已解决：vllm serve 启动时遇到`Waiting for 1 local, 0 remote core engine proc(s) to start.`问题

一、软硬件信息
1.服务器厂家: New H3C R5500 G7
2.沐曦GPU型号：MetaX C550
3.操作系统内核版本：6.8.0-31-generic
4.是否开启CPU虚拟化：是
5.mx-smi回显：

mx-smi  version: 2.2.8

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Fri Jan 30 15:42:52 2026

Attached GPUs                                     : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.8                        Kernel Mode Driver Version: 3.0.11          |
| MACA Version: 3.1.0.14              BIOS Version: 1.27.5.0                      |
|------------------------------------+---------------------+----------------------+
| GPU     NAME         Persistence-M | Bus-id              | GPU-Util      sGPU-M |
| Temp    Pwr:Usage/Cap         Perf | Memory-Usage        | GPU-State            |
|====================================+=====================+======================|
| 0       MetaX C550             Off | 0000:23:00.0        | 0%            Native |
| 37C     93W / 450W              P0 | 859/65536 MiB       | Available            |
+------------------------------------+---------------------+----------------------+
| 1       MetaX C550             Off | 0000:26:00.0        | 0%            Native |
| 32C     93W / 450W              P0 | 859/65536 MiB       | Available            |
+------------------------------------+---------------------+----------------------+
| 2       MetaX C550             Off | 0000:63:00.0        | 0%            Native |
| 31C     91W / 450W              P0 | 859/65536 MiB       | Available            |
+------------------------------------+---------------------+----------------------+
| 3       MetaX C550             Off | 0000:66:00.0        | 0%            Native |
| 38C     95W / 450W              P0 | 859/65536 MiB       | Available            |
+------------------------------------+---------------------+----------------------+
| 4       MetaX C550             Off | 0000:a3:00.0        | 0%            Native |
| 39C     94W / 450W              P0 | 859/65536 MiB       | Available            |
+------------------------------------+---------------------+----------------------+
| 5       MetaX C550             Off | 0000:a4:00.0        | 0%            Native |
| 32C     92W / 450W              P0 | 859/65536 MiB       | Available            |
+------------------------------------+---------------------+----------------------+
| 6       MetaX C550             Off | 0000:e3:00.0        | 0%            Native |
| 31C     91W / 450W              P0 | 859/65536 MiB       | Available            |
+------------------------------------+---------------------+----------------------+
| 7       MetaX C550             Off | 0000:e4:00.0        | 0%            Native |
| 38C     92W / 450W              P0 | 859/65536 MiB       | Available            |
+------------------------------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  no process found                                                               |
+---------------------------------------------------------------------------------+

6.docker info回显：

Client: Docker Engine - Community
 Version:    26.1.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.29.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.40.2
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 3
  Running: 2
  Paused: 0
  Stopped: 1
 Images: 23
 Server Version: 26.1.1
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 metax runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: b98a3aace656320842a23f4a392a33f46af97866
 runc version: v1.3.0-0-g4ca628d1
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-31-generic
 Operating System: Ubuntu 24.04 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 256
 Total Memory: 1.472TiB
 Name: new12
 ID: 9ed3c1a4-13fa-474f-9e65-83393f42b09c
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  10.205.70.4:5000
  127.0.0.0/8
 Live Restore Enabled: false

7.镜像版本：cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.12.0-maca.ai3.3.0.204-torch2.8-py310-ubuntu22.04-amd64
8.启动容器命令：

docker run -itd --name qwen3-bench-vllm \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--network=host \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 512gb \
--ulimit memlock=-1 \
-v /home/gpu_benchmark_xl:/gpu_benchmark_xl \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.12.0-maca.ai3.3.0.204-torch2.8-py310-ubuntu22.04-amd64 \
/bin/bash

9.容器内执行命令：

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
export MACA_GRAPH_LAUNCH_MODE=1

export OMP_NUM_THREADS=1

export VLLM_LOGGING_LEVEL=DEBUG
export NCCL_DEBUG=INFO

nohup vllm serve /gpu_benchmark_xl/weights/Qwen3-235B-A22B \
        --async-scheduling \
        --tensor-parallel-size 8 \
        --data-parallel-size 1 \
        --gpu-memory-utilization 0.93 \
        --max-model-len 12800 \
        --swap-space 16 \
        --trust-remote-code \
        --additional-config '{"enable_cpu_binding":true}' \
        > ./logs/Qwen3-235B-A22B-serve_$TIMESTAMP.log 2>&1 &

二、问题现象

vllm serve 启动时会卡住，并显示Waiting for 1 local, 0 remote core engine proc(s) to start.，最后超时报错终止。

可以通过设置export MCCL_P2P_DISABLE=1禁用显卡P2P传输解决这个问题，成功启动vllm服务，但是这样应该会损失性能，有没有其他方法解决这个问题。


...

[1;36m(VllmWorker rank=7 pid=100341)[0;0m INFO 01-30 14:34:23 [fused_moe.py:770] Using configuration from /opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/H=4096,E=128,N=192,device_name=Device_4000.json for MoE layer.
[1;36m(VllmWorker rank=6 pid=100340)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 26.45 s in total
[1;36m(VllmWorker rank=5 pid=100339)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 27.75 s in total
[1;36m(VllmWorker rank=1 pid=100335)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 25.66 s in total
[1;36m(VllmWorker rank=4 pid=100338)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 26.01 s in total
[1;36m(VllmWorker rank=2 pid=100336)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 25.79 s in total
[1;36m(VllmWorker rank=3 pid=100337)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 26.84 s in total
[1;36m(VllmWorker rank=7 pid=100341)[0;0m INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 27.89 s in total
[1;36m(VllmWorker rank=0 pid=100334)[0;0m INFO 01-30 14:34:24 [fused_moe.py:770] Using configuration from /opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/H=4096,E=128,N=192,device_name=Device_4000.json for MoE layer.
[1;36m(VllmWorker rank=0 pid=100334)[0;0m INFO 01-30 14:34:24 [monitor.py:34] torch.compile takes 32.08 s in total
DEBUG 01-30 14:34:30 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-30 14:34:40 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-30 14:34:50 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-30 14:35:00 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-30 14:35:07 [shm_broadcast.py:456] No available shared memory broadcast block found in 60 second.
DEBUG 01-30 14:35:10 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-30 14:35:20 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.

...

DEBUG 01-30 14:44:30 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
[rank4]:[E130 14:44:31.416940324 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=4861952, NumelOut=38895616, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
[rank4]:[E130 14:44:31.418359276 ProcessGroupNCCL.cpp:2174] [PG ID 2 PG GUID 3 Rank 4]  failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1
[rank4]:[E130 14:44:31.418403860 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank4]:[E130 14:44:31.418424639 ProcessGroupNCCL.cpp:681] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E130 14:44:31.418442312 ProcessGroupNCCL.cpp:695] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E130 14:44:31.420348341 ProcessGroupNCCL.cpp:1901] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=4861952, NumelOut=38895616, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
Exception raised from checkTimeout at /workspace/framework/mcPytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x70f80ae52b0c in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2da (0x70f7b72d978a in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x9b8 (0x70f7b72db1b8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x13f (0x70f7b72dbf7f in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b65 (0x70f80b906b65 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x70f80c2d4ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x1268c0 (0x70f80c3668c0 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=4861952, NumelOut=38895616, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.