• Members 10 posts
    2026年5月26日 17:11

    一、软硬件信息
    1.服务器厂家: 浪潮
    2.沐曦GPU型号:MXC500
    3.操作系统内核版本:Ubuntu 22.02 Linux 5.15.0-72-generic
    4.pip list(简略):
    torch 2.8.0+metax3.5.3.9
    torchaudio 2.4.1+metax3.5.3.9
    torchcodec 0.6.0+metax3.5.3.9
    torchvision 0.15.1+metax3.5.3.9
    vllm 0.19.0
    vllm_metax 0.19.0+g933e92.d20260429.maca3.5.3.20.torch2.8
    5.Docker 镜像版本:vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-ubuntu22.04-amd64
    mx-smi回显:

    root@node02:~# mx-smi
    mx-smi  version: 2.2.3
    
    =================== MetaX System Management Interface Log ===================
    Timestamp                                         : Tue May 26 17:02:48 2026
    
    Attached GPUs                                     : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.3                        Kernel Mode Driver Version: 2.14.6          |
    | MACA Version: 2.32.0.6              BIOS Version: 1.24.3.0                      |
    |------------------------------------+---------------------+----------------------+
    | GPU         NAME                   | Bus-id              | GPU-Util             |
    | Temp        Pwr:Usage/Cap          | Memory-Usage        |                      |
    |====================================+=====================+======================|
    | 0           MetaX C500             | 0000:0f:00.0        | 0%                   |
    | 43C         76W / 350W             | 59952/65536 MiB     |                      |
    +------------------------------------+---------------------+----------------------+
    | 1           MetaX C500             | 0000:10:00.0        | 0%                   |
    | 44C         78W / 350W             | 59952/65536 MiB     |                      |
    +------------------------------------+---------------------+----------------------+
    | 2           MetaX C500             | 0000:11:00.0        | 0%                   |
    | 42C         61W / 350W             | 870/65536 MiB       |                      |
    +------------------------------------+---------------------+----------------------+
    | 3           MetaX C500             | 0000:13:00.0        | 0%                   |
    | 40C         60W / 350W             | 870/65536 MiB       |                      |
    +------------------------------------+---------------------+----------------------+
    | 4           MetaX C500             | 0000:88:00.0        | 0%                   |
    | 37C         58W / 350W             | 863/65536 MiB       |                      |
    +------------------------------------+---------------------+----------------------+
    | 5           MetaX C500             | 0000:89:00.0        | 0%                   |
    | 39C         60W / 350W             | 863/65536 MiB       |                      |
    +------------------------------------+---------------------+----------------------+
    | 6           MetaX C500             | 0000:8a:00.0        | 0%                   |
    | 40C         62W / 350W             | 863/65536 MiB       |                      |
    +------------------------------------+---------------------+----------------------+
    | 7           MetaX C500             | 0000:8b:00.0        | 0%                   |
    | 40C         58W / 350W             | 863/65536 MiB       |                      |
    +------------------------------------+---------------------+----------------------+
    
    +---------------------------------------------------------------------------------+
    | Process:                                                                        |
    |  GPU                    PID         Process Name                 GPU Memory     |
    |                                                                  Usage(MiB)     |
    |=================================================================================|
    |  0                    35623         VLLM::Worker                 2              |
    |  0                   502514         VLLM::Worker_TP              59086          |
    |  1                    35624         VLLM::Worker                 2              |
    |  1                   502517         VLLM::Worker_TP              59086          |
    |  2                    35625         VLLM::Worker                 6              |
    |  3                    35626         VLLM::Worker                 6              |
    +---------------------------------------------------------------------------------+
    
    End of Log
    

    在尝试 4 卡部署Qwen3.6-35B-A3B-W8A8时遇到跨卡通信问题:

    (Worker pid=51607) INFO 05-26 16:58:38 [mccl.py:27] Found mccl from library libmccl.so
    (Worker pid=51607) INFO 05-26 16:58:38 [pynccl.py:111] vLLM is using nccl==2.16.5
    [16:58:49.492][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
    [16:58:49.492][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
    [16:58:59.728][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
    [16:58:59.728][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
    [16:59:09.972][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
    [16:59:09.972][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
    [16:59:20.212][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
    [16:59:20.212][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
    [16:59:30.452][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
    [16:59:30.452][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
    [16:59:40.688][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
    [16:59:40.692][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
    [16:59:50.932][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
    [16:59:50.932][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
    

    裸金属内核日志如下:

    root@node02:~# dmesg -T | grep -i err
    [Wed Feb 11 15:12:48 2026] ACPI: Using IOAPIC for interrupt routing
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKA configured for IRQ 15
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKA disabled
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKB configured for IRQ 15
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKB disabled
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKC configured for IRQ 15
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKC disabled
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKD configured for IRQ 15
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKD disabled
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKE configured for IRQ 15
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKE disabled
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKF configured for IRQ 15
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKF disabled
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKG configured for IRQ 15
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKG disabled
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKH configured for IRQ 15
    [Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKH disabled
    [Wed Feb 11 15:12:49 2026] pcieport 0000:36:01.0: pciehp: Slot #16 AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+ HotPlug+ Surprise- Interlock- NoCompl- IbPresDis- LLActRep+ (with Cmd Compl erratum)
    [Wed Feb 11 15:12:49 2026] ERST: Error Record Serialization Table (ERST) support is initialized.
    [Wed Feb 11 15:12:49 2026] RAS: Correctable Errors collector initialized.
    [Wed Feb 11 15:12:50 2026] i801_smbus 0000:00:1f.4: SMBus using PCI interrupt
    [Wed Feb 11 15:12:50 2026] igb 0000:37:00.0: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
    [Wed Feb 11 15:12:50 2026] igb 0000:37:00.1: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
    [Wed Feb 11 15:12:50 2026] igb 0000:37:00.2: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
    [Wed Feb 11 15:12:50 2026] igb 0000:37:00.3: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
    [Wed Feb 11 15:12:56 2026] EDAC MC0: Giving out device to module i10nm_edac controller Intel_10nm Socket#0 IMC#0: DEV 0000:7e:0c.0 (INTERRUPT)
    [Wed Feb 11 15:12:56 2026] EDAC MC1: Giving out device to module i10nm_edac controller Intel_10nm Socket#0 IMC#1: DEV 0000:7e:0d.0 (INTERRUPT)
    [Wed Feb 11 15:12:56 2026] EDAC MC2: Giving out device to module i10nm_edac controller Intel_10nm Socket#0 IMC#2: DEV 0000:7e:0e.0 (INTERRUPT)
    [Wed Feb 11 15:12:56 2026] EDAC MC3: Giving out device to module i10nm_edac controller Intel_10nm Socket#0 IMC#3: DEV 0000:7e:0f.0 (INTERRUPT)
    [Wed Feb 11 15:12:56 2026] EDAC MC4: Giving out device to module i10nm_edac controller Intel_10nm Socket#1 IMC#0: DEV 0000:fe:0c.0 (INTERRUPT)
    [Wed Feb 11 15:12:56 2026] EDAC MC5: Giving out device to module i10nm_edac controller Intel_10nm Socket#1 IMC#1: DEV 0000:fe:0d.0 (INTERRUPT)
    [Wed Feb 11 15:12:56 2026] EDAC MC6: Giving out device to module i10nm_edac controller Intel_10nm Socket#1 IMC#2: DEV 0000:fe:0e.0 (INTERRUPT)
    [Wed Feb 11 15:12:56 2026] EDAC MC7: Giving out device to module i10nm_edac controller Intel_10nm Socket#1 IMC#3: DEV 0000:fe:0f.0 (INTERRUPT)
    [Wed Feb 11 22:19:11 2026] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
    [Thu Feb 12 00:57:43 2026] perf: interrupt took too long (3143 > 3127), lowering kernel.perf_event_max_sample_rate to 63500
    [Thu Feb 12 04:31:00 2026] perf: interrupt took too long (3939 > 3928), lowering kernel.perf_event_max_sample_rate to 50750
    [Thu Feb 12 10:39:28 2026] perf: interrupt took too long (9058 > 4923), lowering kernel.perf_event_max_sample_rate to 22000
    [Fri Feb 20 09:12:51 2026] hrtimer: interrupt took 14704 ns
    [Tue Mar  3 21:17:10 2026] METAX.B1000.D0.MC.ERROR vram try to use (67272992 kB) beyond total memory (67108864 kB), failed
    [Tue Mar  3 21:17:10 2026] METAX.B1000.D0.MC.ERROR failed to create bo on domain VRAM, -12
    [Tue Mar  3 21:17:10 2026] METAX.B1000.D0.MC.ERROR vram try to use (67242276 kB) beyond total memory (67108864 kB), failed
    [Tue Mar  3 21:17:10 2026] METAX.B1000.D0.MC.ERROR failed to create bo on domain VRAM, -12
    [Tue Mar  3 21:17:10 2026] METAX.BF00.D0.MC.ERROR vram try to use (67764516 kB) beyond total memory (67108864 kB), failed
    [Tue Mar  3 21:17:10 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
    [Tue Mar  3 21:17:10 2026] METAX.BF00.D0.MC.ERROR vram try to use (67733796 kB) beyond total memory (67108864 kB), failed
    [Tue Mar  3 21:17:10 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
    [Tue Mar  3 21:20:33 2026] METAX.BF00.D0.MC.ERROR vram try to use (67132796 kB) beyond total memory (67108864 kB), failed
    [Tue Mar  3 21:20:33 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
    [Tue Mar  3 21:20:33 2026] METAX.BF00.D0.MC.ERROR vram try to use (67132796 kB) beyond total memory (67108864 kB), failed
    [Tue Mar  3 21:20:33 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
    [Tue Mar  3 22:51:21 2026] worker.io[1950624]: segfault at 560cf17d34b5 ip 00007fa175cbf13d sp 00007fa119ffa3e0 error 4 in _raylet.so[7fa174f06000+1a25000]
    [Tue Mar  3 23:00:25 2026] traps: python3.10[2069691] general protection fault ip:7f6622857a69 sp:7ffe00ba4070 error:0 in _raylet.so[7f6621306000+1a25000]
    [Tue Apr 14 10:59:43 2026] perf: interrupt took too long (15005 > 11322), lowering kernel.perf_event_max_sample_rate to 13250
    [Wed Apr 15 13:14:37 2026] METAX.MC.ERROR -4 can not get user pages 0x55b3f6c68000 num_pages 0x1801 vm flags 0x8100073 gup flags 0x1
    [Wed Apr 15 13:14:37 2026] METAX.MC.ERROR failed to get user pages, -4
    [Wed Apr 15 13:14:37 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Wed Apr 15 13:14:37 2026] METAX.MC.ERROR -4 can not get user pages 0x55c628ea1000 num_pages 0x1801 vm flags 0x8100073 gup flags 0x1
    [Wed Apr 15 13:14:37 2026] METAX.MC.ERROR failed to get user pages, -4
    [Wed Apr 15 13:14:37 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Fri May  8 02:29:05 2026] METAX.BF00.D0.MC.ERROR vram try to use (68437304 kB) beyond total memory (67108864 kB), failed
    [Fri May  8 02:29:05 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
    [Fri May  8 04:24:36 2026] METAX.BF00.D0.MC.ERROR vram try to use (67449308 kB) beyond total memory (67108864 kB), failed
    [Fri May  8 04:24:36 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
    [Fri May  8 04:24:36 2026] METAX.BF00.D0.MC.ERROR vram try to use (67449304 kB) beyond total memory (67108864 kB), failed
    [Fri May  8 04:24:36 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
    [Fri May  8 10:03:36 2026] METAX.B8A00.D0.MC.ERROR vram try to use (68371796 kB) beyond total memory (67108864 kB), failed
    [Fri May  8 10:03:36 2026] METAX.B8A00.D0.MC.ERROR failed to create bo on domain VRAM, -12
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7fa388fbc000 num_pages 0x1001 vm flags 0x8100073 gup flags 0x1
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Fri May  8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7fdcfad66000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f9d63b88000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Fri May  8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f6318d73000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Fri May  8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Fri May  8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f46197f5000 num_pages 0x1001 vm flags 0x8100073 gup flags 0x1
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Fri May  8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f4fe57d4000 num_pages 0x1001 vm flags 0x8100073 gup flags 0x1
    [Fri May  8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Fri May  8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Fri May  8 16:24:16 2026] METAX.MC.ERROR -4 can not get user pages 0x7fef5e345000 num_pages 0x2000 vm flags 0x8100073 gup flags 0x1
    [Fri May  8 16:24:16 2026] METAX.MC.ERROR failed to get user pages, -4
    [Fri May  8 16:24:16 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Fri May  8 16:25:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7fcd737bb000 num_pages 0x800 vm flags 0x8100073 gup flags 0x1
    [Fri May  8 16:25:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Fri May  8 16:25:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Fri May  8 16:26:39 2026] METAX.MC.ERROR -4 can not get user pages 0x7f50c255b000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
    [Fri May  8 16:26:39 2026] METAX.MC.ERROR failed to get user pages, -4
    [Fri May  8 16:26:39 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Fri May  8 17:13:53 2026] METAX.MC.ERROR -4 can not get user pages 0x7fb90cb3e000 num_pages 0x800 vm flags 0x8100073 gup flags 0x1
    [Fri May  8 17:13:53 2026] METAX.MC.ERROR failed to get user pages, -4
    [Fri May  8 17:13:53 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Sat May  9 11:34:42 2026] pt_nccl_heartbt[4127564]: segfault at 4c0 ip 00007f71c1e8bc1d sp 00007f6e237fd630 error 4 in libc.so.6[7f71c1e6f000+195000]
    [Sat May  9 18:12:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f980f9d0000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
    [Sat May  9 18:12:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f847df5b000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
    [Sat May  9 18:12:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Sat May  9 18:12:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f7bf0d3d000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
    [Sat May  9 18:12:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Sat May  9 18:12:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f399e75b000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
    [Sat May  9 18:12:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Sat May  9 18:12:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Sat May  9 18:12:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Sat May  9 18:12:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7fd121dac000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
    [Sat May  9 18:12:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Sat May  9 18:12:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Sat May  9 18:12:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Sat May  9 18:12:36 2026] METAX.MC.ERROR failed to get user pages, -4
    [Sat May  9 18:12:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Sun May 10 00:30:12 2026] METAX.MC.ERROR -4 can not get user pages 0x7f3267188000 num_pages 0x1001 vm flags 0x8100073 gup flags 0x1
    [Sun May 10 00:30:12 2026] METAX.MC.ERROR failed to get user pages, -4
    [Sun May 10 00:30:12 2026] METAX.MC.ERROR -4 can not get user pages 0x7f299b9bc000 num_pages 0x1001 vm flags 0x8100073 gup flags 0x1
    [Sun May 10 00:30:12 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Sun May 10 00:30:12 2026] METAX.MC.ERROR failed to get user pages, -4
    [Sun May 10 00:30:12 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
    [Mon May 18 22:55:38 2026] python3.7[3982761]: segfault at 0 ip 0000000000000000 sp 00007ffdb6f2ccc8 error 14 in python3.7[55708cfdd000+58000]
    root@node02:~#
    
  • arrow_forward

    Thread has been moved from 产品&运维.

  • Members 497 posts
    2026年5月26日 17:13

    尊敬的开发者您好,请关闭服务器,拔掉电源,重新插拔GPU尝试