一、软硬件信息
1.服务器厂家: 浪潮
2.沐曦GPU型号:MXC500
3.操作系统内核版本:Ubuntu 22.02 Linux 5.15.0-72-generic
4.pip list(简略):
torch 2.8.0+metax3.5.3.9
torchaudio 2.4.1+metax3.5.3.9
torchcodec 0.6.0+metax3.5.3.9
torchvision 0.15.1+metax3.5.3.9
vllm 0.19.0
vllm_metax 0.19.0+g933e92.d20260429.maca3.5.3.20.torch2.8
5.Docker 镜像版本:vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-ubuntu22.04-amd64
mx-smi回显:
root@node02:~# mx-smi
mx-smi version: 2.2.3
=================== MetaX System Management Interface Log ===================
Timestamp : Tue May 26 17:02:48 2026
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.3 Kernel Mode Driver Version: 2.14.6 |
| MACA Version: 2.32.0.6 BIOS Version: 1.24.3.0 |
|------------------------------------+---------------------+----------------------+
| GPU NAME | Bus-id | GPU-Util |
| Temp Pwr:Usage/Cap | Memory-Usage | |
|====================================+=====================+======================|
| 0 MetaX C500 | 0000:0f:00.0 | 0% |
| 43C 76W / 350W | 59952/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 1 MetaX C500 | 0000:10:00.0 | 0% |
| 44C 78W / 350W | 59952/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 2 MetaX C500 | 0000:11:00.0 | 0% |
| 42C 61W / 350W | 870/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 3 MetaX C500 | 0000:13:00.0 | 0% |
| 40C 60W / 350W | 870/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 4 MetaX C500 | 0000:88:00.0 | 0% |
| 37C 58W / 350W | 863/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 5 MetaX C500 | 0000:89:00.0 | 0% |
| 39C 60W / 350W | 863/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 6 MetaX C500 | 0000:8a:00.0 | 0% |
| 40C 62W / 350W | 863/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 7 MetaX C500 | 0000:8b:00.0 | 0% |
| 40C 58W / 350W | 863/65536 MiB | |
+------------------------------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| 0 35623 VLLM::Worker 2 |
| 0 502514 VLLM::Worker_TP 59086 |
| 1 35624 VLLM::Worker 2 |
| 1 502517 VLLM::Worker_TP 59086 |
| 2 35625 VLLM::Worker 6 |
| 3 35626 VLLM::Worker 6 |
+---------------------------------------------------------------------------------+
End of Log
在尝试 4 卡部署Qwen3.6-35B-A3B-W8A8时遇到跨卡通信问题:
(Worker pid=51607) INFO 05-26 16:58:38 [mccl.py:27] Found mccl from library libmccl.so
(Worker pid=51607) INFO 05-26 16:58:38 [pynccl.py:111] vLLM is using nccl==2.16.5
[16:58:49.492][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
[16:58:49.492][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
[16:58:59.728][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
[16:58:59.728][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
[16:59:09.972][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
[16:59:09.972][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
[16:59:20.212][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
[16:59:20.212][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
[16:59:30.452][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
[16:59:30.452][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
[16:59:40.688][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
[16:59:40.692][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
[16:59:50.932][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:51332 type:21. Retrying.
[16:59:50.932][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:61644 type:21. Retrying.
裸金属内核日志如下:
root@node02:~# dmesg -T | grep -i err
[Wed Feb 11 15:12:48 2026] ACPI: Using IOAPIC for interrupt routing
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKA configured for IRQ 15
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKA disabled
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKB configured for IRQ 15
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKB disabled
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKC configured for IRQ 15
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKC disabled
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKD configured for IRQ 15
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKD disabled
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKE configured for IRQ 15
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKE disabled
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKF configured for IRQ 15
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKF disabled
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKG configured for IRQ 15
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKG disabled
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKH configured for IRQ 15
[Wed Feb 11 15:12:49 2026] ACPI: PCI: Interrupt link LNKH disabled
[Wed Feb 11 15:12:49 2026] pcieport 0000:36:01.0: pciehp: Slot #16 AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+ HotPlug+ Surprise- Interlock- NoCompl- IbPresDis- LLActRep+ (with Cmd Compl erratum)
[Wed Feb 11 15:12:49 2026] ERST: Error Record Serialization Table (ERST) support is initialized.
[Wed Feb 11 15:12:49 2026] RAS: Correctable Errors collector initialized.
[Wed Feb 11 15:12:50 2026] i801_smbus 0000:00:1f.4: SMBus using PCI interrupt
[Wed Feb 11 15:12:50 2026] igb 0000:37:00.0: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
[Wed Feb 11 15:12:50 2026] igb 0000:37:00.1: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
[Wed Feb 11 15:12:50 2026] igb 0000:37:00.2: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
[Wed Feb 11 15:12:50 2026] igb 0000:37:00.3: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
[Wed Feb 11 15:12:56 2026] EDAC MC0: Giving out device to module i10nm_edac controller Intel_10nm Socket#0 IMC#0: DEV 0000:7e:0c.0 (INTERRUPT)
[Wed Feb 11 15:12:56 2026] EDAC MC1: Giving out device to module i10nm_edac controller Intel_10nm Socket#0 IMC#1: DEV 0000:7e:0d.0 (INTERRUPT)
[Wed Feb 11 15:12:56 2026] EDAC MC2: Giving out device to module i10nm_edac controller Intel_10nm Socket#0 IMC#2: DEV 0000:7e:0e.0 (INTERRUPT)
[Wed Feb 11 15:12:56 2026] EDAC MC3: Giving out device to module i10nm_edac controller Intel_10nm Socket#0 IMC#3: DEV 0000:7e:0f.0 (INTERRUPT)
[Wed Feb 11 15:12:56 2026] EDAC MC4: Giving out device to module i10nm_edac controller Intel_10nm Socket#1 IMC#0: DEV 0000:fe:0c.0 (INTERRUPT)
[Wed Feb 11 15:12:56 2026] EDAC MC5: Giving out device to module i10nm_edac controller Intel_10nm Socket#1 IMC#1: DEV 0000:fe:0d.0 (INTERRUPT)
[Wed Feb 11 15:12:56 2026] EDAC MC6: Giving out device to module i10nm_edac controller Intel_10nm Socket#1 IMC#2: DEV 0000:fe:0e.0 (INTERRUPT)
[Wed Feb 11 15:12:56 2026] EDAC MC7: Giving out device to module i10nm_edac controller Intel_10nm Socket#1 IMC#3: DEV 0000:fe:0f.0 (INTERRUPT)
[Wed Feb 11 22:19:11 2026] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
[Thu Feb 12 00:57:43 2026] perf: interrupt took too long (3143 > 3127), lowering kernel.perf_event_max_sample_rate to 63500
[Thu Feb 12 04:31:00 2026] perf: interrupt took too long (3939 > 3928), lowering kernel.perf_event_max_sample_rate to 50750
[Thu Feb 12 10:39:28 2026] perf: interrupt took too long (9058 > 4923), lowering kernel.perf_event_max_sample_rate to 22000
[Fri Feb 20 09:12:51 2026] hrtimer: interrupt took 14704 ns
[Tue Mar 3 21:17:10 2026] METAX.B1000.D0.MC.ERROR vram try to use (67272992 kB) beyond total memory (67108864 kB), failed
[Tue Mar 3 21:17:10 2026] METAX.B1000.D0.MC.ERROR failed to create bo on domain VRAM, -12
[Tue Mar 3 21:17:10 2026] METAX.B1000.D0.MC.ERROR vram try to use (67242276 kB) beyond total memory (67108864 kB), failed
[Tue Mar 3 21:17:10 2026] METAX.B1000.D0.MC.ERROR failed to create bo on domain VRAM, -12
[Tue Mar 3 21:17:10 2026] METAX.BF00.D0.MC.ERROR vram try to use (67764516 kB) beyond total memory (67108864 kB), failed
[Tue Mar 3 21:17:10 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
[Tue Mar 3 21:17:10 2026] METAX.BF00.D0.MC.ERROR vram try to use (67733796 kB) beyond total memory (67108864 kB), failed
[Tue Mar 3 21:17:10 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
[Tue Mar 3 21:20:33 2026] METAX.BF00.D0.MC.ERROR vram try to use (67132796 kB) beyond total memory (67108864 kB), failed
[Tue Mar 3 21:20:33 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
[Tue Mar 3 21:20:33 2026] METAX.BF00.D0.MC.ERROR vram try to use (67132796 kB) beyond total memory (67108864 kB), failed
[Tue Mar 3 21:20:33 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
[Tue Mar 3 22:51:21 2026] worker.io[1950624]: segfault at 560cf17d34b5 ip 00007fa175cbf13d sp 00007fa119ffa3e0 error 4 in _raylet.so[7fa174f06000+1a25000]
[Tue Mar 3 23:00:25 2026] traps: python3.10[2069691] general protection fault ip:7f6622857a69 sp:7ffe00ba4070 error:0 in _raylet.so[7f6621306000+1a25000]
[Tue Apr 14 10:59:43 2026] perf: interrupt took too long (15005 > 11322), lowering kernel.perf_event_max_sample_rate to 13250
[Wed Apr 15 13:14:37 2026] METAX.MC.ERROR -4 can not get user pages 0x55b3f6c68000 num_pages 0x1801 vm flags 0x8100073 gup flags 0x1
[Wed Apr 15 13:14:37 2026] METAX.MC.ERROR failed to get user pages, -4
[Wed Apr 15 13:14:37 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Wed Apr 15 13:14:37 2026] METAX.MC.ERROR -4 can not get user pages 0x55c628ea1000 num_pages 0x1801 vm flags 0x8100073 gup flags 0x1
[Wed Apr 15 13:14:37 2026] METAX.MC.ERROR failed to get user pages, -4
[Wed Apr 15 13:14:37 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Fri May 8 02:29:05 2026] METAX.BF00.D0.MC.ERROR vram try to use (68437304 kB) beyond total memory (67108864 kB), failed
[Fri May 8 02:29:05 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
[Fri May 8 04:24:36 2026] METAX.BF00.D0.MC.ERROR vram try to use (67449308 kB) beyond total memory (67108864 kB), failed
[Fri May 8 04:24:36 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
[Fri May 8 04:24:36 2026] METAX.BF00.D0.MC.ERROR vram try to use (67449304 kB) beyond total memory (67108864 kB), failed
[Fri May 8 04:24:36 2026] METAX.BF00.D0.MC.ERROR failed to create bo on domain VRAM, -12
[Fri May 8 10:03:36 2026] METAX.B8A00.D0.MC.ERROR vram try to use (68371796 kB) beyond total memory (67108864 kB), failed
[Fri May 8 10:03:36 2026] METAX.B8A00.D0.MC.ERROR failed to create bo on domain VRAM, -12
[Fri May 8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7fa388fbc000 num_pages 0x1001 vm flags 0x8100073 gup flags 0x1
[Fri May 8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Fri May 8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Fri May 8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7fdcfad66000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
[Fri May 8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f9d63b88000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
[Fri May 8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Fri May 8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Fri May 8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f6318d73000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
[Fri May 8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Fri May 8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Fri May 8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Fri May 8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Fri May 8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f46197f5000 num_pages 0x1001 vm flags 0x8100073 gup flags 0x1
[Fri May 8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Fri May 8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Fri May 8 16:22:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f4fe57d4000 num_pages 0x1001 vm flags 0x8100073 gup flags 0x1
[Fri May 8 16:22:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Fri May 8 16:22:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Fri May 8 16:24:16 2026] METAX.MC.ERROR -4 can not get user pages 0x7fef5e345000 num_pages 0x2000 vm flags 0x8100073 gup flags 0x1
[Fri May 8 16:24:16 2026] METAX.MC.ERROR failed to get user pages, -4
[Fri May 8 16:24:16 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Fri May 8 16:25:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7fcd737bb000 num_pages 0x800 vm flags 0x8100073 gup flags 0x1
[Fri May 8 16:25:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Fri May 8 16:25:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Fri May 8 16:26:39 2026] METAX.MC.ERROR -4 can not get user pages 0x7f50c255b000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
[Fri May 8 16:26:39 2026] METAX.MC.ERROR failed to get user pages, -4
[Fri May 8 16:26:39 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Fri May 8 17:13:53 2026] METAX.MC.ERROR -4 can not get user pages 0x7fb90cb3e000 num_pages 0x800 vm flags 0x8100073 gup flags 0x1
[Fri May 8 17:13:53 2026] METAX.MC.ERROR failed to get user pages, -4
[Fri May 8 17:13:53 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Sat May 9 11:34:42 2026] pt_nccl_heartbt[4127564]: segfault at 4c0 ip 00007f71c1e8bc1d sp 00007f6e237fd630 error 4 in libc.so.6[7f71c1e6f000+195000]
[Sat May 9 18:12:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f980f9d0000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
[Sat May 9 18:12:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f847df5b000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
[Sat May 9 18:12:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Sat May 9 18:12:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f7bf0d3d000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
[Sat May 9 18:12:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Sat May 9 18:12:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7f399e75b000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
[Sat May 9 18:12:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Sat May 9 18:12:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Sat May 9 18:12:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Sat May 9 18:12:36 2026] METAX.MC.ERROR -4 can not get user pages 0x7fd121dac000 num_pages 0x2001 vm flags 0x8100073 gup flags 0x1
[Sat May 9 18:12:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Sat May 9 18:12:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Sat May 9 18:12:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Sat May 9 18:12:36 2026] METAX.MC.ERROR failed to get user pages, -4
[Sat May 9 18:12:36 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Sun May 10 00:30:12 2026] METAX.MC.ERROR -4 can not get user pages 0x7f3267188000 num_pages 0x1001 vm flags 0x8100073 gup flags 0x1
[Sun May 10 00:30:12 2026] METAX.MC.ERROR failed to get user pages, -4
[Sun May 10 00:30:12 2026] METAX.MC.ERROR -4 can not get user pages 0x7f299b9bc000 num_pages 0x1001 vm flags 0x8100073 gup flags 0x1
[Sun May 10 00:30:12 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Sun May 10 00:30:12 2026] METAX.MC.ERROR failed to get user pages, -4
[Sun May 10 00:30:12 2026] METAX.BF00.D0.MC.ERROR init_user_pages failed, -4
[Mon May 18 22:55:38 2026] python3.7[3982761]: segfault at 0 ip 0000000000000000 sp 00007ffdb6f2ccc8 error 14 in python3.7[55708cfdd000+58000]
root@node02:~#