Posts | langhongbin | 沐曦开发者论坛

See post chevron_right

langhongbin
Members

metaX C500如何部署bge-m3和bge-reranker-v2-m3模型？已解决 2026年5月26日 19:49

重新插拔后依然有上述报错
See post chevron_right

langhongbin
Members

8卡MetaX C500 部署qwen3.6推理速度太慢已解决 2026年5月25日 14:14

curl http://10.217.247.136:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_api_key_here" \
-d '{
"model": "qwen3.6",
"messages": [{"role": "user", "content": "生成一篇多于3000字的以勇气为主题的作文"}]
}'
See post chevron_right

langhongbin
Members

8卡MetaX C500 部署qwen3.6推理速度太慢已解决 2026年5月25日 14:11

重新插拔后速度仍然没有提升，dmesg -T | grep -i err无新增日志，vllm日志内容如下
(APIServer pid=60) INFO 05-25 13:51:17 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 16.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
1%, Prefix cache hit rate: 0.0%
(APIServer pid=60) INFO 05-25 13:51:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 74.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
1%, Prefix cache hit rate: 0.0%
(APIServer pid=60) INFO 05-25 13:51:37 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 78.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
2%, Prefix cache hit rate: 0.0%
(APIServer pid=60) INFO 05-25 13:51:47 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 78.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
2%, Prefix cache hit rate: 0.0%
(APIServer pid=60) INFO 05-25 13:51:57 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 78.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
2%, Prefix cache hit rate: 0.0%
(APIServer pid=60) INFO 05-25 13:52:07 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
2%, Prefix cache hit rate: 0.0%
(APIServer pid=60) INFO: 10.217.247.136:57828 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=60) INFO 05-25 13:52:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 65.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
0%, Prefix cache hit rate: 0.0%
(APIServer pid=60) INFO 05-25 13:52:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0
%, Prefix cache hit rate: 0.0%
See post chevron_right

langhongbin
Members

metaX C500如何部署bge-m3和bge-reranker-v2-m3模型？已解决 2026年5月23日 16:38

端口错开后依旧有如上报错
启动命令
nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ \
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen3.6 \
--dtype bfloat16 \
--trust-remote-code \
--tensor-parallel-size 4 \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--max-num-batched-tokens 131072 \
--max-num-seqs 128 \
--enable-chunked-prefill \
--enable-prefix-caching \

qwen.log 2>&1 &

nohup vllm serve /root/vllm/bge-m3/ \
--host 0.0.0.0 \
--port 8001 \
--served-model-name bge-m3 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.1 \
--trust-remote-code \
--dtype auto \

bge-m3.log 2>&1 &

报错日志：
(EngineCore pid=26596) INFO 05-23 16:26:48 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='/root/vllm/bge-m3/', speculative_config=None, tokenizer='/root/vllm/bge-m3/', skip_toke
nizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1,
pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_expert
s=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='',
reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_m
etrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False
), seed=0, served_model_name=bge-m3, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=PoolerConfig(task=None, pooling_type=None, seq_pooling_type='CLS', tok_pooling_type='ALL', us
e_activation=True, dimensions=None, enable_chunked_processing=False, max_embed_len=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'mode': <CompilationMode.VLLM_COM
PILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_atten
tion_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_m
ixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::mx_sparse_attn_indexer', 'vllm:
:mx_sparse_attn_indexer_bf16', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'e
ncoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_a
sserts': False, 'scalar_asserts': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48,
56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512
], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, '
enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': Fa
lse, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=26596) INFO 05-23 16:26:48 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.217.247.136:40835 backend=nccl
(EngineCore pid=26596) INFO 05-23 16:26:48 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
[16:26:59.536][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:27:09.776][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:27:20.016][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:27:30.256][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:27:40.496][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:27:50.736][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:28:00.977][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
See post chevron_right

langhongbin
Members

metaX C500如何部署bge-m3和bge-reranker-v2-m3模型？已解决 2026年5月22日 14:47

1.镜像版本：
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

2.启动容器命令：
docker run -itd \
--name qwen3.6 \
--network host \
--shm-size 512G \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 100gb \
--ulimit memlock=-1 \
-v /home/modelscope:/root/vllm \
-e TZ=Asia/Shanghai \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

nohup vllm serve /root/vllm/bge-m3/ \
--host 0.0.0.0 \
--port 8001 \
--served-model-name bge-m3 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.1 \
--trust-remote-code \
--dtype auto \

bge-m3.log 2>&1 &

nohup vllm serve /root/vllm/bge-reranker-v2-m3/ \
--host 0.0.0.0 \
--port 8001 \
--served-model-name bge-reranker-v2-m3 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.1 \
--trust-remote-code \
--dtype auto \

reranker.log 2>&1 &

二问题：
在容器中启动多个服务时报错，报错信息如下
itionalGeneration.
WARNING 05-22 14:34:58 [registry.py:915] Model architecture GlmMoeDsaForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:GlmMoeDsaForCausalLM.
(Worker pid=2004) INFO 05-22 14:34:59 [parallel_state.py:1400] world_size=8 rank=5 local_rank=5 distributed_init_method=tcp://127.0.0.1:45863 backend=nccl
(Worker pid=2000) [rank1]:W0522 14:34:59.640000 2000 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2000) [rank1]:W0522 14:34:59.640000 2000 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2002) [rank3]:W0522 14:34:59.640000 2002 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2002) [rank3]:W0522 14:34:59.640000 2002 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2001) [rank2]:W0522 14:34:59.640000 2001 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2001) [rank2]:W0522 14:34:59.640000 2001 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=1999) [rank0]:W0522 14:34:59.640000 1999 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=1999) [rank0]:W0522 14:34:59.640000 1999 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2005) [rank6]:W0522 14:34:59.641000 2005 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2006) [rank7]:W0522 14:34:59.641000 2006 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2005) [rank6]:W0522 14:34:59.641000 2005 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2006) [rank7]:W0522 14:34:59.641000 2006 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2004) [rank5]:W0522 14:34:59.641000 2004 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2004) [rank5]:W0522 14:34:59.641000 2004 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2003) [rank4]:W0522 14:34:59.642000 2003 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2003) [rank4]:W0522 14:34:59.642000 2003 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=1999) INFO 05-22 14:34:59 [mccl.py:27] Found mccl from library libmccl.so
(Worker pid=1999) INFO 05-22 14:34:59 [pynccl.py:111] vLLM is using nccl==2.16.5
[14:35:11.312][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:35:21.552][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:35:31.792][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:35:42.032][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:35:52.273][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:36:02.512][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:36:12.752][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
See post chevron_right

langhongbin
Members

metaX C500如何部署bge-m3和bge-reranker-v2-m3模型？已解决 2026年5月22日 12:22

一、软硬件信息：
1.服务器厂家：浪潮

2.沐曦GPU型号：MetaX C500 8卡

3.操作系统内核版本：6.6.0-32.7.v2505.ky11.x86_64

4.是否开启CPU虚拟化：开启

5.mx-smi回显：
mx-smi version: 2.2.12

=================== MetaX System Management Interface Log ===================
Timestamp : Wed May 20 18:14:56 2026

Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
| MACA Version: unknown BIOS Version: 1.31.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:04:00.0 | 0% Disabled |
| 82W / 350W | 61C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:05:00.0 | 0% Disabled |
| 75W / 350W | 58C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:63:00.0 | 0% Disabled |
| 80W / 350W | 56C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:64:00.0 | 0% Disabled |
| 80W / 350W | 59C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:83:00.0 | 0% Disabled |
| 82W / 350W | 56C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:84:00.0 | 0% Disabled |
| 72W / 350W | 53C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:e4:00.0 | 0% Disabled |
| 81W / 350W | 58C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:e5:00.0 | 0% Disabled |
| 74W / 350W | 54C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| 0 1025936 VLLM::Worker_TP 39386 |
| 1 1025937 VLLM::Worker_TP 40026 |
| 2 1025938 VLLM::Worker_TP 39386 |
| 3 1025939 VLLM::Worker_TP 40026 |
| 4 1025940 VLLM::Worker_TP 40026 |
| 5 1025941 VLLM::Worker_TP 39386 |
| 6 1025942 VLLM::Worker_TP 40026 |
| 7 1025943 VLLM::Worker_TP 39386 |
+---------------------------------------------------------------------------------+

6.docker info回显：
[root@localhost ~]# docker info
Client:
Version: 24.0.9
Context: default
Debug Mode: false

Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 24.0.9
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9a04df1519ac2967eece6c6a5d13d3b846b574b2.m
runc version:
init version:
Security Options:
seccomp
Profile: builtin
Kernel Version: 6.6.0-32.7.v2505.ky11.x86_64
Operating System: Kylin Linux Advanced Server V11 (Swan25)
OSType: linux
Architecture: x86_64
CPUs: 256
Total Memory: 1.472TiB
Name: localhost.localdomain
ID: ded90092-4000-426b-a3ca-08950e376242
Docker Root Dir: /home/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
docker.1ms.run/
dockerpull.com/
registry.docker-cn.com/
Live Restore Enabled: false

二问题
metaX C500如何部署bge-m3和bge-reranker-v2-m3模型？
See post chevron_right

langhongbin
Members

8卡MetaX C500 部署qwen3.6推理速度太慢已解决 2026年5月21日 13:38

四 5月 21 11:15:48 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:15:48 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:15:48 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:15:48 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:15:58 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:15:58 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:15:58 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:15:58 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:16:09 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:16:09 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:16:09 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:16:09 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:16:19 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:16:19 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:16:19 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:16:19 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:16:29 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:16:29 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:16:29 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:16:29 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:16:39 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:16:39 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:16:39 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:16:39 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:16:50 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:16:50 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:16:50 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:16:50 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:17:29 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:17:29 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 11:17:29 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 11:17:29 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:11:31 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:11:31 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:11:31 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:11:31 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:11:41 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:11:41 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:11:41 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:11:41 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:11:51 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:11:51 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:11:51 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:11:51 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:12:02 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:12:02 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:12:02 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:12:02 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:12:12 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:12:12 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:12:12 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:12:12 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:12:22 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:12:22 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:12:22 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:12:22 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:12:32 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:12:32 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:12:32 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:12:32 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:18:14 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:18:14 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
[四 5月 21 12:18:14 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
[四 5月 21 12:18:14 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
See post chevron_right

langhongbin
Members

8卡MetaX C500 部署qwen3.6推理速度太慢已解决 2026年5月21日 12:30

该报错持续一段时间后服务正常启动了，但是推理速度依然很慢，日志如下

(APIServer pid=53) INFO 05-21 12:28:35 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 15.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
1%, Prefix cache hit rate: 0.0%
(APIServer pid=53) INFO 05-21 12:28:45 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 78.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
2%, Prefix cache hit rate: 0.0%
(APIServer pid=53) INFO 05-21 12:28:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 78.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
2%, Prefix cache hit rate: 0.0%
(APIServer pid=53) INFO 05-21 12:29:05 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 79.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
2%, Prefix cache hit rate: 0.0%
(APIServer pid=53) INFO 05-21 12:29:15 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 79.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
3%, Prefix cache hit rate: 0.0%
(APIServer pid=53) INFO: 10.217.247.136:40238 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=53) INFO 05-21 12:29:25 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 65.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
0%, Prefix cache hit rate: 0.0%
(APIServer pid=53) INFO 05-21 12:29:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0
%, Prefix cache hit rate: 0.0%
See post chevron_right

langhongbin
Members

8卡MetaX C500 部署qwen3.6推理速度太慢已解决 2026年5月21日 12:16

四卡仍然报相同错误，启动命令
nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 4 \
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen3.6 \
--dtype bfloat16 \
--trust-remote-code \
--tensor-parallel-size 4 \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--max-num-batched-tokens 131072 \
--max-num-seqs 64 \

qwen.log 2>& 1 &

日志
tail -500f qwen.log
nohup: ignoring input
INFO 05-21 12:08:03 [init.py:44] Available plugins for group vllm.platform_plugins:
INFO 05-21 12:08:03 [init.py:46] - metax -> vllm_metax:register
INFO 05-21 12:08:03 [init.py:49] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 05-21 12:08:03 [init.py:239] Platform plugin metax is activated
(EngineCore pid=758) INFO 05-21 12:08:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=758) INFO 05-21 12:09:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=758) INFO 05-21 12:10:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=758) INFO 05-21 12:11:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=758) INFO 05-21 12:12:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=758) INFO 05-21 12:13:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=758) INFO 05-21 12:14:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=758) INFO 05-21 12:15:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=758) INFO 05-21 12:16:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
See post chevron_right

langhongbin
Members

8卡MetaX C500 部署qwen3.6推理速度太慢已解决 2026年5月21日 11:05

双卡推理服务启动卡死
服务启动命令：
nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 2\
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen3.6 \
--dtype bfloat16 \
--trust-remote-code \
--tensor-parallel-size 2 \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--max-num-batched-tokens 131072 \
--max-num-seqs 64 \

qwen.log 2>& 1 &

日志信息：
(Worker_TP0 pid=1133)
(Worker_TP0 pid=1133) INFO 05-21 10:58:12 [default_loader.py:384] Loading weights took 19.40 seconds
(Worker_TP0 pid=1133) INFO 05-21 10:58:13 [gpu_model_runner.py:4820] Model loading took 32.86 GiB memory and 20.283825 seconds
(Worker_TP0 pid=1133) INFO 05-21 10:58:15 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 131072 tokens, and profiled with 8 image items of the maximum feature size.
(Worker_TP0 pid=1133) INFO 05-21 10:58:30 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/583c9adccf/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=1133) INFO 05-21 10:58:30 [backends.py:1111] Dynamo bytecode transform time: 11.64 s
(EngineCore pid=785) INFO 05-21 10:59:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=785) INFO 05-21 11:00:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=785) INFO 05-21 11:01:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=785) INFO 05-21 11:02:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=785) INFO 05-21 11:03:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).

(EngineCore pid=785) INFO 05-21 11:04:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
onsuming work (e.g. compilation, weight/kv cache quantization).
See post chevron_right

langhongbin
Members

8卡MetaX C500 部署qwen3.6推理速度太慢已解决 2026年5月21日 10:50

是否需要添加环境变量进行优化，需要的话具体添加哪些？
See post chevron_right

langhongbin
Members

8卡MetaX C500 部署qwen3.6推理速度太慢已解决 2026年5月21日 10:45

单卡部署显存不足
See post chevron_right

langhongbin
Members

8卡MetaX C500 部署qwen3.6推理速度太慢已解决 2026年5月21日 09:08

一、软硬件信息：
1.服务器厂家：浪潮

2.沐曦GPU型号：MetaX C500 8卡

3.操作系统内核版本：6.6.0-32.7.v2505.ky11.x86_64

4.是否开启CPU虚拟化：开启

5.mx-smi回显：
mx-smi version: 2.2.12

=================== MetaX System Management Interface Log ===================
Timestamp : Wed May 20 18:14:56 2026

Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
| MACA Version: unknown BIOS Version: 1.31.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:04:00.0 | 0% Disabled |
| 82W / 350W | 61C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:05:00.0 | 0% Disabled |
| 75W / 350W | 58C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:63:00.0 | 0% Disabled |
| 80W / 350W | 56C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:64:00.0 | 0% Disabled |
| 80W / 350W | 59C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:83:00.0 | 0% Disabled |
| 82W / 350W | 56C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:84:00.0 | 0% Disabled |
| 72W / 350W | 53C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:e4:00.0 | 0% Disabled |
| 81W / 350W | 58C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:e5:00.0 | 0% Disabled |
| 74W / 350W | 54C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| 0 1025936 VLLM::Worker_TP 39386 |
| 1 1025937 VLLM::Worker_TP 40026 |
| 2 1025938 VLLM::Worker_TP 39386 |
| 3 1025939 VLLM::Worker_TP 40026 |
| 4 1025940 VLLM::Worker_TP 40026 |
| 5 1025941 VLLM::Worker_TP 39386 |
| 6 1025942 VLLM::Worker_TP 40026 |
| 7 1025943 VLLM::Worker_TP 39386 |
+---------------------------------------------------------------------------------+

6.docker info回显：
[root@localhost ~]# docker info
Client:
Version: 24.0.9
Context: default
Debug Mode: false

Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 24.0.9
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9a04df1519ac2967eece6c6a5d13d3b846b574b2.m
runc version:
init version:
Security Options:
seccomp
Profile: builtin
Kernel Version: 6.6.0-32.7.v2505.ky11.x86_64
Operating System: Kylin Linux Advanced Server V11 (Swan25)
OSType: linux
Architecture: x86_64
CPUs: 256
Total Memory: 1.472TiB
Name: localhost.localdomain
ID: ded90092-4000-426b-a3ca-08950e376242
Docker Root Dir: /home/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
docker.1ms.run/
dockerpull.com/
registry.docker-cn.com/
Live Restore Enabled: false

7.镜像版本：
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

8.启动容器命令：
docker run -itd \
--name qwen3.6 \
--network host \
--shm-size 512G \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 100gb \
--ulimit memlock=-1 \
-v /home/modelscope:/root/vllm \
-e TZ=Asia/Shanghai \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

9.容器内执行命令：
nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 8\
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen3.6 \
--dtype bfloat16 \
--trust-remote-code \
--tensor-parallel-size 8 \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--max-num-batched-tokens 327680 \
--kv-cache-dtype fp8_e4m3 >qwen.log 2>& 1 &

二、问题现象
推理速度慢，首轮 prompt 预填：2.2 tokens/s（输入解析慢）生成阶段稳定：70~73 tokens/s
日志信息如下：
(APIServer pid=254754) INFO 05-20 20:11:26 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage:
0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:11:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 0.7%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:11:46 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 0.9%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:11:56 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 1.2%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:12:06 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 1.3%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:12:16 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 1.6%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO: 10.217.247.136:54410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=254754) INFO 05-20 20:12:26 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage
: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:12:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage:
0.0%, Prefix cache hit rate: 0.0%
See post chevron_right

langhongbin
Members

Qwen3.6-35B-A3B模型部署报错已解决 2026年5月20日 20:16

一、软硬件信息：
1.服务器厂家：浪潮

2.沐曦GPU型号：MetaX C500 8卡

3.操作系统内核版本：6.6.0-32.7.v2505.ky11.x86_64

4.是否开启CPU虚拟化：开启

5.mx-smi回显：
mx-smi version: 2.2.12

=================== MetaX System Management Interface Log ===================
Timestamp : Wed May 20 18:14:56 2026

Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
| MACA Version: unknown BIOS Version: 1.31.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:04:00.0 | 0% Disabled |
| 82W / 350W | 61C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:05:00.0 | 0% Disabled |
| 75W / 350W | 58C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:63:00.0 | 0% Disabled |
| 80W / 350W | 56C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:64:00.0 | 0% Disabled |
| 80W / 350W | 59C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:83:00.0 | 0% Disabled |
| 82W / 350W | 56C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:84:00.0 | 0% Disabled |
| 72W / 350W | 53C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:e4:00.0 | 0% Disabled |
| 81W / 350W | 58C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:e5:00.0 | 0% Disabled |
| 74W / 350W | 54C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| 0 1025936 VLLM::Worker_TP 39386 |
| 1 1025937 VLLM::Worker_TP 40026 |
| 2 1025938 VLLM::Worker_TP 39386 |
| 3 1025939 VLLM::Worker_TP 40026 |
| 4 1025940 VLLM::Worker_TP 40026 |
| 5 1025941 VLLM::Worker_TP 39386 |
| 6 1025942 VLLM::Worker_TP 40026 |
| 7 1025943 VLLM::Worker_TP 39386 |
+---------------------------------------------------------------------------------+

6.docker info回显：
[root@localhost ~]# docker info
Client:
Version: 24.0.9
Context: default
Debug Mode: false

Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 24.0.9
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9a04df1519ac2967eece6c6a5d13d3b846b574b2.m
runc version:
init version:
Security Options:
seccomp
Profile: builtin
Kernel Version: 6.6.0-32.7.v2505.ky11.x86_64
Operating System: Kylin Linux Advanced Server V11 (Swan25)
OSType: linux
Architecture: x86_64
CPUs: 256
Total Memory: 1.472TiB
Name: localhost.localdomain
ID: ded90092-4000-426b-a3ca-08950e376242
Docker Root Dir: /home/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
docker.1ms.run/
dockerpull.com/
registry.docker-cn.com/
Live Restore Enabled: false

7.镜像版本：
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

8.启动容器命令：
docker run -itd \
--name qwen3.6 \
--network host \
--shm-size 512G \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 100gb \
--ulimit memlock=-1 \
-v /home/modelscope:/root/vllm \
-e TZ=Asia/Shanghai \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

9.容器内执行命令：
nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 8\
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen3.6 \
--dtype bfloat16 \
--trust-remote-code \
--tensor-parallel-size 8 \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--max-num-batched-tokens 327680 \
--kv-cache-dtype fp8_e4m3 >qwen.log 2>& 1 &

二、问题现象
推理速度慢，首轮 prompt 预填：2.2 tokens/s（输入解析慢）生成阶段稳定：70~73 tokens/s
日志信息如下：
(APIServer pid=254754) INFO 05-20 20:11:26 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage:
0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:11:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 0.7%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:11:46 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 0.9%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:11:56 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 1.2%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:12:06 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 1.3%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:12:16 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 1.6%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO: 10.217.247.136:54410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=254754) INFO 05-20 20:12:26 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage
: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:12:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage:
0.0%, Prefix cache hit rate: 0.0%
See post chevron_right

langhongbin
Members

Qwen3.6-35B-A3B模型部署报错已解决 2026年5月20日 16:41

Metax C500 8卡部署Qwen3.6-35B-A3B模型，容器启动命令如下：
docker run -itd \
--name qwen3.6 \
--network host \
--shm-size 512G \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 100gb \
--ulimit memlock=-1 \
-v /home/modelscope:/root/vllm \
-e TZ=Asia/Shanghai \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

vllm启动命令如下：
vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 8\
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen3.6 \
--dtype bfloat16 \
--trust-remote-code \
--tensor-parallel-size 8 \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--max-num-batched-tokens 524288 \
--kv-cache-dtype fp8_e4m3

报错信息如下：
(EngineCore pid=157812) ERROR 05-20 16:37:27 [core.py:1108] RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 22.74
GiB is free. Of the allocated memory 35.24 GiB is allocated by PyTorch, and 442.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_
CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace abov
e for the root cause
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return self._call_impl(args, kwargs)
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return forward_call(args, kwargs)
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "<eval_with_key>.82", line 258, in forward
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] submod_2 = self.submod_2(getitem_3, s59, getitem_4, l_self_modules_layers_modules_0_modules_linear_attn_modules_norm_parameter
s_weight_, getitem_5, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_, getitem_6, s18, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_
weight_, l_inputs_embeds_, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_
modules_layers_modules_1_modules_linear_attn_modules_in_proj_ba_parameters_weight_); getitem_3 = getitem_4 = l_self_modules_layers_modules_0_modules_linear_attn_modules_norm_parameters_weight_ = getitem
5 = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight = getitem_6 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = l_inputs_e
mbeds_ = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_m
odules_1_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in call
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return self.runnable(*args, kwargs)
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 367, in call
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return range_entry.runnable(args)
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 62, in call
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return self._compiled_fn(args)
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return fn(args, kwargs)
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1241, in forward
(Worker_TP5 pid=158165) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP4 pid=158164) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP0 pid=158160) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP2 pid=158162) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP6 pid=158166) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP1 pid=158161) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP7 pid=158167) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP3 pid=158163) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
(EngineCore pid=157812) ERROR 05-20 16:37:38 [multiproc_executor.py:273] Worker proc VllmWorker-4 died unexpectedly, shutting down executor.
(EngineCore pid=157812) Process EngineCore:
(EngineCore pid=157812) Traceback (most recent call last):
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=157812) self.run()
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=157812) self._target(self._args, self._kwargs)
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=157812) raise e
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=157812) engine_core = EngineCoreProc(*args, engine_index=dp_rank, kwargs)
(EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=157812) return func(args, kwargs)
(EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in init
(EngineCore pid=157812) super().init(
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 124, in init
(EngineCore pid=157812) kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=157812) return func(args, kwargs)
(EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
(EngineCore pid=157812) available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=157812) return self.collective_rpc("determine_available_memory")
(EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 397, in collective_rpc
(EngineCore pid=157812) return aggregate(get_response())
(EngineCore pid=157812) ^^^^^^^^^^^^^^
(EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 380, in get_response
(EngineCore pid=157812) raise RuntimeError(
(EngineCore pid=157812) RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 22.74 GiB is free. Of the allocated memor
y 35.24 GiB is allocated by PyTorch, and 442.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avo
id fragmentation. See documentation for Memory Management (pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause
(APIServer pid=157458) Traceback (most recent call last):
(APIServer pid=157458) File "/opt/conda/bin/vllm", line 8, in <module>
(APIServer pid=157458) sys.exit(main())
(APIServer pid=157458) ^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=157458) args.dispatch_function(args)
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=157458) uvloop.run(run_server(args))
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
(APIServer pid=157458) return asyncio.run(
(APIServer pid=157458) ^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=157458) return runner.run(main)
(APIServer pid=157458) ^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=157458) return self._loop.run_until_complete(task)
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init.py", line 48, in wrapper
(APIServer pid=157458) return await main
(APIServer pid=157458) ^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=157458) await run_server_worker(listen_address, sock, args, uvicorn_kwargs)
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=157458) async with build_async_engine_client(
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=157458) return await anext(self.gen)
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=157458) async with build_async_engine_client_from_engine_args(
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=157458) return await anext(self.gen)
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=157458) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=157458) return cls(
(APIServer pid=157458) ^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in init
(APIServer pid=157458) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=157458) return func(args, kwargs)
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=157458) return AsyncMPClient(client_args)
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=157458) return func(args, *kwargs)
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 887, in init
(APIServer pid=157458) super().init(
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 535, in init
(APIServer pid=157458) with launch_core_engines(
(APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=157458) File "/opt/conda/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=157458) next(self.gen)
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=157458) wait_for_engine_startup(
(APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=157458) raise RuntimeError(
(APIServer pid=157458) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/opt/conda/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '