MetaX-Tech Developer Forum 论坛首页
  • 沐曦开发者
search
Sign in

Loong

  • Members
  • Joined 2026年1月30日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

Loong has started 1 thread.

  • See post chevron_right
    Loong
    Members
    已解决:vllm serve 启动时遇到`Waiting for 1 local, 0 remote core engine proc(s) to start.`问题 已解决 2026年1月30日 16:13

    一、软硬件信息
    1.服务器厂家: New H3C R5500 G7
    2.沐曦GPU型号:MetaX C550
    3.操作系统内核版本:6.8.0-31-generic
    4.是否开启CPU虚拟化:是
    5.mx-smi回显:

    mx-smi  version: 2.2.8
    
    =================== MetaX System Management Interface Log ===================
    Timestamp                                         : Fri Jan 30 15:42:52 2026
    
    Attached GPUs                                     : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.8                        Kernel Mode Driver Version: 3.0.11          |
    | MACA Version: 3.1.0.14              BIOS Version: 1.27.5.0                      |
    |------------------------------------+---------------------+----------------------+
    | GPU     NAME         Persistence-M | Bus-id              | GPU-Util      sGPU-M |
    | Temp    Pwr:Usage/Cap         Perf | Memory-Usage        | GPU-State            |
    |====================================+=====================+======================|
    | 0       MetaX C550             Off | 0000:23:00.0        | 0%            Native |
    | 37C     93W / 450W              P0 | 859/65536 MiB       | Available            |
    +------------------------------------+---------------------+----------------------+
    | 1       MetaX C550             Off | 0000:26:00.0        | 0%            Native |
    | 32C     93W / 450W              P0 | 859/65536 MiB       | Available            |
    +------------------------------------+---------------------+----------------------+
    | 2       MetaX C550             Off | 0000:63:00.0        | 0%            Native |
    | 31C     91W / 450W              P0 | 859/65536 MiB       | Available            |
    +------------------------------------+---------------------+----------------------+
    | 3       MetaX C550             Off | 0000:66:00.0        | 0%            Native |
    | 38C     95W / 450W              P0 | 859/65536 MiB       | Available            |
    +------------------------------------+---------------------+----------------------+
    | 4       MetaX C550             Off | 0000:a3:00.0        | 0%            Native |
    | 39C     94W / 450W              P0 | 859/65536 MiB       | Available            |
    +------------------------------------+---------------------+----------------------+
    | 5       MetaX C550             Off | 0000:a4:00.0        | 0%            Native |
    | 32C     92W / 450W              P0 | 859/65536 MiB       | Available            |
    +------------------------------------+---------------------+----------------------+
    | 6       MetaX C550             Off | 0000:e3:00.0        | 0%            Native |
    | 31C     91W / 450W              P0 | 859/65536 MiB       | Available            |
    +------------------------------------+---------------------+----------------------+
    | 7       MetaX C550             Off | 0000:e4:00.0        | 0%            Native |
    | 38C     92W / 450W              P0 | 859/65536 MiB       | Available            |
    +------------------------------------+---------------------+----------------------+
    
    +---------------------------------------------------------------------------------+
    | Process:                                                                        |
    |  GPU                    PID         Process Name                 GPU Memory     |
    |                                                                  Usage(MiB)     |
    |=================================================================================|
    |  no process found                                                               |
    +---------------------------------------------------------------------------------+
    

    6.docker info回显:

    Client: Docker Engine - Community
     Version:    26.1.1
     Context:    default
     Debug Mode: false
     Plugins:
      buildx: Docker Buildx (Docker Inc.)
        Version:  v0.29.1
        Path:     /usr/libexec/docker/cli-plugins/docker-buildx
      compose: Docker Compose (Docker Inc.)
        Version:  v2.40.2
        Path:     /usr/libexec/docker/cli-plugins/docker-compose
    
    Server:
     Containers: 3
      Running: 2
      Paused: 0
      Stopped: 1
     Images: 23
     Server Version: 26.1.1
     Storage Driver: overlay2
      Backing Filesystem: xfs
      Supports d_type: true
      Using metacopy: false
      Native Overlay Diff: true
      userxattr: false
     Logging Driver: json-file
     Cgroup Driver: systemd
     Cgroup Version: 2
     Plugins:
      Volume: local
      Network: bridge host ipvlan macvlan null overlay
      Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
     Swarm: inactive
     Runtimes: io.containerd.runc.v2 metax runc
     Default Runtime: runc
     Init Binary: docker-init
     containerd version: b98a3aace656320842a23f4a392a33f46af97866
     runc version: v1.3.0-0-g4ca628d1
     init version: de40ad0
     Security Options:
      apparmor
      seccomp
       Profile: builtin
      cgroupns
     Kernel Version: 6.8.0-31-generic
     Operating System: Ubuntu 24.04 LTS
     OSType: linux
     Architecture: x86_64
     CPUs: 256
     Total Memory: 1.472TiB
     Name: new12
     ID: 9ed3c1a4-13fa-474f-9e65-83393f42b09c
     Docker Root Dir: /var/lib/docker
     Debug Mode: false
     Experimental: false
     Insecure Registries:
      10.205.70.4:5000
      127.0.0.0/8
     Live Restore Enabled: false
    

    7.镜像版本:cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.12.0-maca.ai3.3.0.204-torch2.8-py310-ubuntu22.04-amd64
    8.启动容器命令:

    docker run -itd --name qwen3-bench-vllm \
    --device=/dev/dri \
    --device=/dev/mxcd \
    --group-add video \
    --network=host \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --shm-size 512gb \
    --ulimit memlock=-1 \
    -v /home/gpu_benchmark_xl:/gpu_benchmark_xl \
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.12.0-maca.ai3.3.0.204-torch2.8-py310-ubuntu22.04-amd64 \
    /bin/bash
    

    9.容器内执行命令:

    TIMESTAMP=$(date +%Y%m%d_%H%M%S)
    export MACA_GRAPH_LAUNCH_MODE=1
    
    export OMP_NUM_THREADS=1
    
    export VLLM_LOGGING_LEVEL=DEBUG
    export NCCL_DEBUG=INFO
    
    nohup vllm serve /gpu_benchmark_xl/weights/Qwen3-235B-A22B \
            --async-scheduling \
            --tensor-parallel-size 8 \
            --data-parallel-size 1 \
            --gpu-memory-utilization 0.93 \
            --max-model-len 12800 \
            --swap-space 16 \
            --trust-remote-code \
            --additional-config '{"enable_cpu_binding":true}' \
            > ./logs/Qwen3-235B-A22B-serve_$TIMESTAMP.log 2>&1 &
    

    二、问题现象

    vllm serve 启动时会卡住,并显示Waiting for 1 local, 0 remote core engine proc(s) to start.,最后超时报错终止。

    可以通过设置export MCCL_P2P_DISABLE=1禁用显卡P2P传输解决这个问题,成功启动vllm服务,但是这样应该会损失性能,有没有其他方法解决这个问题。

    
    ...
    
    (VllmWorker rank=7 pid=100341) INFO 01-30 14:34:23 [fused_moe.py:770] Using configuration from /opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/H=4096,E=128,N=192,device_name=Device_4000.json for MoE layer.
    (VllmWorker rank=6 pid=100340) INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 26.45 s in total
    (VllmWorker rank=5 pid=100339) INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 27.75 s in total
    (VllmWorker rank=1 pid=100335) INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 25.66 s in total
    (VllmWorker rank=4 pid=100338) INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 26.01 s in total
    (VllmWorker rank=2 pid=100336) INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 25.79 s in total
    (VllmWorker rank=3 pid=100337) INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 26.84 s in total
    (VllmWorker rank=7 pid=100341) INFO 01-30 14:34:23 [monitor.py:34] torch.compile takes 27.89 s in total
    (VllmWorker rank=0 pid=100334) INFO 01-30 14:34:24 [fused_moe.py:770] Using configuration from /opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/H=4096,E=128,N=192,device_name=Device_4000.json for MoE layer.
    (VllmWorker rank=0 pid=100334) INFO 01-30 14:34:24 [monitor.py:34] torch.compile takes 32.08 s in total
    DEBUG 01-30 14:34:30 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
    DEBUG 01-30 14:34:40 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
    DEBUG 01-30 14:34:50 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
    DEBUG 01-30 14:35:00 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
    DEBUG 01-30 14:35:07 [shm_broadcast.py:456] No available shared memory broadcast block found in 60 second.
    DEBUG 01-30 14:35:10 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
    DEBUG 01-30 14:35:20 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
    
    ...
    
    DEBUG 01-30 14:44:30 [utils.py:741] Waiting for 1 local, 0 remote core engine proc(s) to start.
    [rank4]:[E130 14:44:31.416940324 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=4861952, NumelOut=38895616, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
    [rank4]:[E130 14:44:31.418359276 ProcessGroupNCCL.cpp:2174] [PG ID 2 PG GUID 3 Rank 4]  failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1
    [rank4]:[E130 14:44:31.418403860 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
    [rank4]:[E130 14:44:31.418424639 ProcessGroupNCCL.cpp:681] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
    [rank4]:[E130 14:44:31.418442312 ProcessGroupNCCL.cpp:695] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
    [rank4]:[E130 14:44:31.420348341 ProcessGroupNCCL.cpp:1901] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=4861952, NumelOut=38895616, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
    Exception raised from checkTimeout at /workspace/framework/mcPytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x70f80ae52b0c in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
    frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2da (0x70f7b72d978a in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
    frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x9b8 (0x70f7b72db1b8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
    frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x13f (0x70f7b72dbf7f in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
    frame #4: <unknown function> + 0xd3b65 (0x70f80b906b65 in /opt/conda/bin/../lib/libstdc++.so.6)
    frame #5: <unknown function> + 0x94ac3 (0x70f80c2d4ac3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #6: <unknown function> + 0x1268c0 (0x70f80c3668c0 in /lib/x86_64-linux-gnu/libc.so.6)
    
    terminate called after throwing an instance of 'c10::DistBackendError'
      what():  [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=4861952, NumelOut=38895616, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
    
  • 沐曦开发者论坛
powered by misago