Threads | mukewang | 沐曦开发者论坛

一、软硬件信息
1.服务器厂家：浪潮信息:
2.沐曦GPU型号：单张沐曦曦思N260
3.操作系统内核版本：4.19.90-89.11.v2401.ky10.x86_64
4.是否开启CPU虚拟化：是
5.mx-smi回显：
mx-smi version: 2.3.1

=================== MetaX System Management Interface Log ===================
Timestamp : Mon May 11 10:18:11 2026

Attached GPUs : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.3.1 Kernel Mode Driver Version: 3.7.11 |
| MACA Version: 3.7.0.38 BIOS Version: 1.31.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX N260 | 0 Off | 0000:c1:00.0 | 0% Disabled |
| 60W / 225W | 59C P9 | 52895/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| 0 3760427 VLLM::EngineCor 52228 |
+---------------------------------------------------------------------------------+
6.docker info回显：
Client:
Version: 29.3.1
Context: default
Debug Mode: false
Plugins:
compose: Docker Compose (Docker Inc.)
Version: v2.24.6
Path: /usr/local/lib/docker/cli-plugins/docker-compose

Server:
Containers: 25
Running: 24
Paused: 0
Stopped: 1
Images: 56
Server Version: 29.3.1
Storage Driver: overlayfs
driver-type: io.containerd.snapshotter.v1
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
CDI spec directories:
/etc/cdi
/var/run/cdi
Swarm: inactive
Runtimes: io.containerd.runc.v2 metax runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 301b2dac98f15c27117da5c8af12118a041a31d9
runc version: v1.3.4-0-gd6d73eb
init version: de40ad0
Security Options:
seccomp
Profile: builtin
Kernel Version: 4.19.90-89.11.v2401.ky10.x86_64
Operating System: Kylin Linux Advanced Server V10 (Halberd)
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 61.55GiB
Name: localhost.localdomain
ID: f92e3bfc-06d2-4441-886f-8b48bf0e6b27
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
::1/128
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
Firewall Backend: iptables

WARNING: Support for cgroup v1 is deprecated and planned to be removed by no later than May 2029 (github.com/moby/moby/issues/51111)
7.镜像版本：
vllm-metax:0.18.0-maca.ai3.5.3.405-torch2.8-py310-kylinv11-amd64
8.启动容器命令：
metax-docker run -itd --device=/dev/dri --device=/dev/mxcd --group-add video --network=host --name vllm --restart unless-stopped --security-opt seccomp=unconfined --security-opt apparmor=unconfined --ulimit memlock=-1 -v /home/models:/models cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.18.0-maca.ai3.5.3.405-torch2.8-py312-kylinv11-amd64
9.容器内执行命令：
/opt/conda/bin/vllm serve /models/Qwen3.5-27B-W8A8 --host 0.0.0.0 --port 9901 --served-model-name qwen3 -tp 1 --trust-remote-code --dtype half --gpu-memory-utilization 0.8 --max-model-len 8192 --max-num-batched-tokens 8192 --reasoning-parser qwen3

二、问题现象
使用vllm运行Qwen3.5-27B-W8A8模型时，推理速度最高只有12.4 tokens/s，更换为Qwen3-32B-W8A8也是如此，当使用Qwen3-14B-AWQ时有所改善，能够达到18~22 tokens/s，并且模型运行过程中发生内存寻址错误出现异常中断，导致 vLLM 引擎崩溃影响正常使用，所以：1.我该如何提升模型运行速度？当前速度影响正常业务；2.如何解决模型运行报错中断的问题？

报错日志如下所示：
(APIServer pid=8943) INFO 05-11 09:56:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
[09:56:52.419][MXC][E]xnack(0x8): kernel causes atu address translation error
[09:56:52.420][MCR][E]mx_trapProcess.cpp :76 : traping pipeID=0,queueID=3!
[09:56:52.420][MCR][E]mx_trapProcess.cpp :77 : traping virtualDevice=0x7f3a48000c70!
[09:56:52.421][MCR][E]mx_trapProcess.cpp :707 : Xnack(0x8) exception happened in the shader, the mcruntime api will be disabled
[09:56:52.421][MCR][E]mx_trapProcess.cpp :798 : Node ID is 4
[09:56:52.421][MCR][E]mx_trapProcess.cpp :811 : debug info: 0x04,0x00,0x00,0x00,0x04,0x00,0x00,0x00,0xe0,0x23,0x00,0x00,0x00,0x20,0x00,0x00,
[09:56:52.421][MCR][E]mx_device.cpp :9606: trap:precise positioning.
[09:56:52.421][MCR][E]mx_device.cpp :9611: traping: kernelName: ZN4vllm30reshape_and_cache_flash_kernelIttLNS_18Fp8KVCacheDataTypeE0EEEvPKT_S4_PT0_S6_PKlllllliiiPKfSA ,commandIndex: 1179567 , trapType: Xnack Error/ATU Fault(0x8)
[09:56:53.408][MCR][E]mx_signal.cpp :592 : Failed signal [0x7f3d423766c0] wait
[09:56:53.408][MCR][E]mx_signal.cpp :592 : Failed signal [0x7f3d423766c0] wait
[09:56:53.408][MCR][E]mx_signal.cpp :160 : shouldn't destroy a profiling signal 0x5594b2ae4b70 that is still busy!
[09:56:53.408][MCR][E]mx_signal.cpp :160 : shouldn't destroy a profiling signal 0x7f3a48004aa0 that is still busy!
(EngineCore pid=9223) ERROR 05-11 09:56:53 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.18.0) with config: model='/models/Qwen3.5-27B-W8A8', speculative_config=None, tokenizer='/models/Qwen3.5-27B-W8A8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/f62306b0e2', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::mx_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 400, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/f62306b0e2/rank_0_0/backbone', 'fast_moe_cold_start': True, 'static_all_moe_layers': []},
(EngineCore pid=9223) ERROR 05-11 09:56:53 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-a2089b62a4a5346f-a2d0cdd1', 'chatcmpl-b6dcfcbbedded5d1-898d1a0f'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[None, None],num_computed_tokens=[2914, 1184],num_output_tokens=[2904, 458]), num_scheduled_tokens={chatcmpl-a2089b62a4a5346f-a2d0cdd1: 1, chatcmpl-b6dcfcbbedded5d1-898d1a0f: 1}, total_num_scheduled_tokens=2, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=9223) ERROR 05-11 09:56:53 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=2, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.02970297029702973, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] EngineCore encountered a fatal error.
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] Traceback (most recent call last):
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1092, in run_engine_core
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] engine_core.run_busy_loop()
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1133, in run_busy_loop
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] self._process_engine_step()
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1172, in _process_engine_step
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] outputs, model_executed = self.step_fn()
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 398, in step
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] model_output = self.model_executor.sample_tokens(grammar_output)
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 118, in sample_tokens
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] return self.collective_rpc(
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] return func(args, kwargs)
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] return func(args, kwargs)
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 759, in sample_tokens
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] return self.model_runner.sample_tokens(grammar_output)
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] return func(*args, kwargs)
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4029, in sample_tokens
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ) = self._bookkeeping_sync(
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3161, in _bookkeeping_sync
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] valid_sampled_token_ids = self._to_list(sampled_token_ids)
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 6643, in _to_list
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] self.transfer_event.synchronize()
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101]
(EngineCore pid=9223) Process EngineCore:
(EngineCore pid=9223) Traceback (most recent call last):
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=9223) self.run()
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=9223) self._target(self._args, self._kwargs)
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=9223) raise e
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1092, in run_engine_core
(EngineCore pid=9223) engine_core.run_busy_loop()
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1133, in run_busy_loop
(EngineCore pid=9223) self._process_engine_step()
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1172, in _process_engine_step
(EngineCore pid=9223) outputs, model_executed = self.step_fn()
(APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] AsyncLLM output_handler failed.
(EngineCore pid=9223) ^^^^^^^^^^^^^^
(APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] Traceback (most recent call last):
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 398, in step
(APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 663, in output_handler
(EngineCore pid=9223) model_output = self.model_executor.sample_tokens(grammar_output)
(APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] outputs = await engine_core.get_output_async()
(EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 118, in sample_tokens
(APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1022, in get_output_async
(EngineCore pid=9223) return self.collective_rpc(
(APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] raise self._format_exception(outputs) from None
(EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=9223) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=9223) return func(args, kwargs)
(EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=9223) return func(*args, kwargs)
(EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 759, in sample_tokens
(EngineCore pid=9223) return self.model_runner.sample_tokens(grammar_output)
(EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=9223) return func(args, *kwargs)
(EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4029, in sample_tokens
(EngineCore pid=9223) ) = self._bookkeeping_sync(
(EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3161, in _bookkeeping_sync
(EngineCore pid=9223) valid_sampled_token_ids = self._to_list(sampled_token_ids)
(EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 6643, in _to_list
(EngineCore pid=9223) self.transfer_event.synchronize()
(EngineCore pid=9223) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(EngineCore pid=9223) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=9223) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore pid=9223) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore pid=9223)
(APIServer pid=8943) INFO: 10.75.45.11:43660 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=8943) INFO: 172.20.0.51:33402 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=8943) INFO: Shutting down
(APIServer pid=8943) INFO: Waiting for application shutdown.
(APIServer pid=8943) INFO: Application shutdown complete.
(APIServer pid=8943) INFO: Finished server process [8943]