一、软硬件信息
1.服务器厂家: New H3C R5500 G7
2.沐曦GPU型号:MetaX C550
3.操作系统内核版本:6.8.0-31-generic
4.是否开启CPU虚拟化:是
5.mx-smi回显:
mx-smi version: 2.2.8
=================== MetaX System Management Interface Log ===================
Timestamp : Fri Jan 30 15:42:52 2026
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.8 Kernel Mode Driver Version: 3.0.11 |
| MACA Version: 3.1.0.14 BIOS Version: 1.27.5.0 |
|------------------------------------+---------------------+----------------------+
| GPU NAME Persistence-M | Bus-id | GPU-Util sGPU-M |
| Temp Pwr:Usage/Cap Perf | Memory-Usage | GPU-State |
|====================================+=====================+======================|
| 0 MetaX C550 Off | 0000:23:00.0 | 0% Native |
| 37C 93W / 450W P0 | 859/65536 MiB | Available |
+------------------------------------+---------------------+----------------------+
| 1 MetaX C550 Off | 0000:26:00.0 | 0% Native |
| 32C 93W / 450W P0 | 859/65536 MiB | Available |
+------------------------------------+---------------------+----------------------+
| 2 MetaX C550 Off | 0000:63:00.0 | 0% Native |
| 31C 91W / 450W P0 | 859/65536 MiB | Available |
+------------------------------------+---------------------+----------------------+
| 3 MetaX C550 Off | 0000:66:00.0 | 0% Native |
| 38C 95W / 450W P0 | 859/65536 MiB | Available |
+------------------------------------+---------------------+----------------------+
| 4 MetaX C550 Off | 0000:a3:00.0 | 0% Native |
| 39C 94W / 450W P0 | 859/65536 MiB | Available |
+------------------------------------+---------------------+----------------------+
| 5 MetaX C550 Off | 0000:a4:00.0 | 0% Native |
| 32C 92W / 450W P0 | 859/65536 MiB | Available |
+------------------------------------+---------------------+----------------------+
| 6 MetaX C550 Off | 0000:e3:00.0 | 0% Native |
| 31C 91W / 450W P0 | 859/65536 MiB | Available |
+------------------------------------+---------------------+----------------------+
| 7 MetaX C550 Off | 0000:e4:00.0 | 0% Native |
| 38C 92W / 450W P0 | 859/65536 MiB | Available |
+------------------------------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
6.docker info回显:
Client: Docker Engine - Community
Version: 26.1.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.29.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.40.2
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 3
Running: 2
Paused: 0
Stopped: 1
Images: 23
Server Version: 26.1.1
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 metax runc
Default Runtime: runc
Init Binary: docker-init
containerd version: b98a3aace656320842a23f4a392a33f46af97866
runc version: v1.3.0-0-g4ca628d1
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.8.0-31-generic
Operating System: Ubuntu 24.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 256
Total Memory: 1.472TiB
Name: new12
ID: 9ed3c1a4-13fa-474f-9e65-83393f42b09c
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
10.205.70.4:5000
127.0.0.0/8
Live Restore Enabled: false
7.镜像版本:cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.12.0-maca.ai3.3.0.204-torch2.8-py310-ubuntu22.04-amd64
8.启动容器命令:
docker run -itd --name qwen3-bench-vllm \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--network=host \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 512gb \
--ulimit memlock=-1 \
-v /home/gpu_benchmark_xl:/gpu_benchmark_xl \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.12.0-maca.ai3.3.0.204-torch2.8-py310-ubuntu22.04-amd64 \
/bin/bash
9.容器内执行命令:
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
export MACA_GRAPH_LAUNCH_MODE=1
export OMP_NUM_THREADS=1
export VLLM_LOGGING_LEVEL=DEBUG
export NCCL_DEBUG=INFO
nohup vllm serve /gpu_benchmark_xl/weights/Qwen3-235B-A22B \
--async-scheduling \
--tensor-parallel-size 8 \
--data-parallel-size 1 \
--gpu-memory-utilization 0.93 \
--max-model-len 12800 \
--swap-space 16 \
--trust-remote-code \
--additional-config '{"enable_cpu_binding":true}' \
> ./logs/Qwen3-235B-A22B-serve_$TIMESTAMP.log 2>&1 &
二、问题现象
vllm serve 启动时会卡住,并显示Waiting for 1 local, 0 remote core engine proc(s) to start.,最后超时报错终止。
可以通过设置export MCCL_P2P_DISABLE=1禁用显卡P2P传输解决这个问题,成功启动vllm服务,但是这样应该会损失性能,有没有其他方法解决这个问题。
...
[1;36m(VllmWorker rank=7 pid=100341)[0;0m INFO 01-30 14:34:23 [fused_moe.py:770] Using configuration from /opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/H=4096,E=128,N=192,device_name=Device_4000.json for MoE layer.
[1;36m(VllmWorker rank=6 pid=100340)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 26.45 s in total
[1;36m(VllmWorker rank=5 pid=100339)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 27.75 s in total
[1;36m(VllmWorker rank=1 pid=100335)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 25.66 s in total
[1;36m(VllmWorker rank=4 pid=100338)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 26.01 s in total
[1;36m(VllmWorker rank=2 pid=100336)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 25.79 s in total
[1;36m(VllmWorker rank=3 pid=100337)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 26.84 s in total
[1;36m(VllmWorker rank=7 pid=100341)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 27.89 s in total
[1;36m(VllmWorker rank=0 pid=100334)[0;0m INFO 01-30 14:34:24 [fused_moe.py:770] Using configuration from /opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/H=4096,E=128,N=192,device_name=Device_4000.json for MoE layer.
[1;36m(VllmWorker rank=0 pid=100334)[0;0m INFO 01-30 14:34:24 [monitor.py:34] torch.compile takes 32.08 s in total
DEBUG 01-30 14:34:30 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-30 14:34:40 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-30 14:34:50 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-30 14:35:00 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-30 14:35:07 [shm_broadcast.py:456] No available shared memory broadcast block found in 60 second.
DEBUG 01-30 14:35:10 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-30 14:35:20 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
...
DEBUG 01-30 14:44:30 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
[rank4]:[E130 14:44:31.416940324 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=4861952, NumelOut=38895616, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
[rank4]:[E130 14:44:31.418359276 ProcessGroupNCCL.cpp:2174] [PG ID 2 PG GUID 3 Rank 4] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1
[rank4]:[E130 14:44:31.418403860 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank4]:[E130 14:44:31.418424639 ProcessGroupNCCL.cpp:681] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E130 14:44:31.418442312 ProcessGroupNCCL.cpp:695] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E130 14:44:31.420348341 ProcessGroupNCCL.cpp:1901] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=4861952, NumelOut=38895616, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
Exception raised from checkTimeout at /workspace/framework/mcPytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x70f80ae52b0c in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2da (0x70f7b72d978a in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x9b8 (0x70f7b72db1b8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x13f (0x70f7b72dbf7f in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b65 (0x70f80b906b65 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x70f80c2d4ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x1268c0 (0x70f80c3668c0 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=4861952, NumelOut=38895616, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.