MCCL 四机测试日志
测试时间: 2026-04-16 20:10
环境: 4节点 x 8卡 MetaX C500, MACA 3.5.3, 镜像 vllm-metax:0.14.0-maca.ai3.5.3.102-torch2.8-py310-ubuntu22.04-amd64_glm_w4a8_full
测试命令:
export MACA_PATH=/opt/maca
export LD_LIBRARY_PATH=${MACA_PATH}/lib:${MACA_PATH}/ompi/lib:${MACA_PATH}/ucx/lib
${MACA_PATH}/ompi/bin/mpirun -np 32 --allow-run-as-root \
-mca btl_tcp_if_include 10.66.3.0/24 -mca oob_tcp_if_include 10.66.3.0/24 \
-mca pml ^ucx -mca osc ^ucx -mca btl ^openib \
-host 10.66.3.10:8,10.66.3.11:8,10.66.3.12:8,10.66.3.13:8 \
-x MACA_PATH -x LD_LIBRARY_PATH -x MCCL_CROSS_NIC=1 -x FORCE_ACTIVE_WAIT=2 \
<test_binary> -b 1K -e 1G -d float -f 2 -g 1 -n 10
====================================================================
测试1: all_reduce_perf (32 GPUs across 4 nodes)
====================================================================
nThread 1 nGpus 1 minBytes 1024 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 10 agg iters: 1 validation: 1 graph: 0
Using devices
Rank 0 Pid 207588 on suanfeng-mxc500-0001 device 0 [0x08] MetaX C500
Rank 1 Pid 207589 on suanfeng-mxc500-0001 device 1 [0x09] MetaX C500
Rank 2 Pid 207590 on suanfeng-mxc500-0001 device 2 [0x0e] MetaX C500
Rank 3 Pid 207591 on suanfeng-mxc500-0001 device 3 [0x11] MetaX C500
Rank 4 Pid 207592 on suanfeng-mxc500-0001 device 4 [0x32] MetaX C500
Rank 5 Pid 207593 on suanfeng-mxc500-0001 device 5 [0x38] MetaX C500
Rank 6 Pid 207594 on suanfeng-mxc500-0001 device 6 [0x3b] MetaX C500
Rank 7 Pid 207595 on suanfeng-mxc500-0001 device 7 [0x3c] MetaX C500
Rank 8 Pid 121115 on suanfeng-mxc500-0002 device 0 [0x08] MetaX C500
Rank 9 Pid 121116 on suanfeng-mxc500-0002 device 1 [0x09] MetaX C500
Rank 10 Pid 121117 on suanfeng-mxc500-0002 device 2 [0x0e] MetaX C500
Rank 11 Pid 121118 on suanfeng-mxc500-0002 device 3 [0x11] MetaX C500
Rank 12 Pid 121119 on suanfeng-mxc500-0002 device 4 [0x32] MetaX C500
Rank 13 Pid 121120 on suanfeng-mxc500-0002 device 5 [0x38] MetaX C500
Rank 14 Pid 121121 on suanfeng-mxc500-0002 device 6 [0x3b] MetaX C500
Rank 15 Pid 121122 on suanfeng-mxc500-0002 device 7 [0x3c] MetaX C500
Rank 16 Pid 30847 on suanfeng-mxc500-0003 device 0 [0x08] MetaX C500
Rank 17 Pid 30848 on suanfeng-mxc500-0003 device 1 [0x09] MetaX C500
Rank 18 Pid 30849 on suanfeng-mxc500-0003 device 2 [0x0e] MetaX C500
Rank 19 Pid 30850 on suanfeng-mxc500-0003 device 3 [0x11] MetaX C500
Rank 20 Pid 30851 on suanfeng-mxc500-0003 device 4 [0x32] MetaX C500
Rank 21 Pid 30852 on suanfeng-mxc500-0003 device 5 [0x38] MetaX C500
Rank 22 Pid 30853 on suanfeng-mxc500-0003 device 6 [0x3b] MetaX C500
Rank 23 Pid 30854 on suanfeng-mxc500-0003 device 7 [0x3c] MetaX C500
Rank 24 Pid 30001 on suanfeng-mxc500-0004 device 0 [0x08] MetaX C500
Rank 25 Pid 30002 on suanfeng-mxc500-0004 device 1 [0x09] MetaX C500
Rank 26 Pid 30003 on suanfeng-mxc500-0004 device 2 [0x0e] MetaX C500
Rank 27 Pid 30004 on suanfeng-mxc500-0004 device 3 [0x11] MetaX C500
Rank 28 Pid 30005 on suanfeng-mxc500-0004 device 4 [0x32] MetaX C500
Rank 29 Pid 30006 on suanfeng-mxc500-0004 device 5 [0x38] MetaX C500
Rank 30 Pid 30007 on suanfeng-mxc500-0004 device 6 [0x3b] MetaX C500
Rank 31 Pid 30008 on suanfeng-mxc500-0004 device 7 [0x3c] MetaX C500
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 256 float sum -1 65.95 0.02 0.03 0 70.78 0.01 0.03 0
2048 512 float sum -1 68.10 0.03 0.06 0 73.63 0.03 0.05 0
4096 1024 float sum -1 71.80 0.06 0.11 0 70.05 0.06 0.11 0
8192 2048 float sum -1 75.38 0.11 0.21 0 78.53 0.10 0.20 0
16384 4096 float sum -1 81.17 0.20 0.39 0 78.59 0.21 0.40 0
32768 8192 float sum -1 90.55 0.36 0.70 0 88.60 0.37 0.72 0
65536 16384 float sum -1 160.38 0.41 0.79 0 170.30 0.38 0.75 0
131072 32768 float sum -1 184.35 0.71 1.38 0 179.75 0.73 1.41 0
262144 65536 float sum -1 196.76 1.33 2.58 0 200.38 1.31 2.53 0
524288 131072 float sum -1 222.27 2.36 4.57 0 219.08 2.39 4.64 0
1048576 262144 float sum -1 286.65 3.66 7.09 0 268.51 3.91 7.57 0
2097152 524288 float sum -1 306.80 6.84 13.24 0 307.21 6.83 13.23 0
4194304 1048576 float sum -1 418.79 10.02 19.40 0 393.13 10.67 20.67 0
8388608 2097152 float sum -1 599.86 13.98 27.09 0 598.73 14.01 27.15 0
16777216 4194304 float sum -1 1018.58 16.47 31.91 0 1010.60 16.60 32.16 0
33554432 8388608 float sum -1 1597.54 21.00 40.69 0 1595.35 21.03 40.75 0
67108864 16777216 float sum -1 2951.84 22.73 44.05 0 2979.91 22.52 43.63 0
134217728 33554432 float sum -1 6046.98 22.20 43.00 0 6052.52 22.18 42.97 0
268435456 67108864 float sum -1 10429.17 25.74 49.87 0 10402.60 25.80 50.00 0
536870912 134217728 float sum -1 16236.85 33.06 64.06 0 15635.12 34.34 66.53 0
1073741824 268435456 float sum -1 29299.23 36.65 71.00 0 29201.44 36.77 71.24 0
Out of bounds values : 0 OK
Avg bus bandwidth : 20.214
====================================================================
测试2: sendrecv_perf (32 GPUs across 4 nodes)
====================================================================
nThread 1 nGpus 1 minBytes 1024 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 10 agg iters: 1 validation: 1 graph: 0
Using devices
Rank 0 Pid 207778 on suanfeng-mxc500-0001 device 0 [0x08] MetaX C500
Rank 1 Pid 207779 on suanfeng-mxc500-0001 device 1 [0x09] MetaX C500
Rank 2 Pid 207780 on suanfeng-mxc500-0001 device 2 [0x0e] MetaX C500
Rank 3 Pid 207781 on suanfeng-mxc500-0001 device 3 [0x11] MetaX C500
Rank 4 Pid 207782 on suanfeng-mxc500-0001 device 4 [0x32] MetaX C500
Rank 5 Pid 207783 on suanfeng-mxc500-0001 device 5 [0x38] MetaX C500
Rank 6 Pid 207784 on suanfeng-mxc500-0001 device 6 [0x3b] MetaX C500
Rank 7 Pid 207785 on suanfeng-mxc500-0001 device 7 [0x3c] MetaX C500
Rank 8 Pid 121305 on suanfeng-mxc500-0002 device 0 [0x08] MetaX C500
Rank 9 Pid 121306 on suanfeng-mxc500-0002 device 1 [0x09] MetaX C500
Rank 10 Pid 121307 on suanfeng-mxc500-0002 device 2 [0x0e] MetaX C500
Rank 11 Pid 121308 on suanfeng-mxc500-0002 device 3 [0x11] MetaX C500
Rank 12 Pid 121309 on suanfeng-mxc500-0002 device 4 [0x32] MetaX C500
Rank 16 Pid 31037 on suanfeng-mxc500-0003 device 0 [0x08] MetaX C500
Rank 17 Pid 31038 on suanfeng-mxc500-0003 device 1 [0x09] MetaX C500
Rank 18 Pid 31039 on suanfeng-mxc500-0003 device 2 [0x0e] MetaX C500
Rank 19 Pid 31040 on suanfeng-mxc500-0003 device 3 [0x11] MetaX C500
Rank 20 Pid 31041 on suanfeng-mxc500-0003 device 4 [0x32] MetaX C500
Rank 21 Pid 31042 on suanfeng-mxc500-0003 device 5 [0x38] MetaX C500
Rank 22 Pid 31043 on suanfeng-mxc500-0003 device 6 [0x3b] MetaX C500
Rank 23 Pid 31044 on suanfeng-mxc500-0003 device 7 [0x3c] MetaX C500
Rank 24 Pid 30191 on suanfeng-mxc500-0004 device 0 [0x08] MetaX C500
Rank 25 Pid 30192 on suanfeng-mxc500-0004 device 1 [0x09] MetaX C500
Rank 26 Pid 30193 on suanfeng-mxc500-0004 device 2 [0x0e] MetaX C500
Rank 27 Pid 30194 on suanfeng-mxc500-0004 device 3 [0x11] MetaX C500
Rank 28 Pid 30195 on suanfeng-mxc500-0004 device 4 [0x32] MetaX C500
Rank 29 Pid 30196 on suanfeng-mxc500-0004 device 5 [0x38] MetaX C500
Rank 30 Pid 30197 on suanfeng-mxc500-0004 device 6 [0x3b] MetaX C500
Rank 31 Pid 30198 on suanfeng-mxc500-0004 device 7 [0x3c] MetaX C500
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 256 float sum -1 22.69 0.05 0.05 0 23.21 0.04 0.04 N/A
2048 512 float sum -1 22.53 0.09 0.09 0 21.32 0.10 0.10 N/A
4096 1024 float sum -1 22.56 0.18 0.18 0 23.10 0.18 0.18 N/A
8192 2048 float sum -1 27.75 0.30 0.30 0 23.47 0.35 0.35 N/A
16384 4096 float sum -1 27.30 0.60 0.60 0 31.13 0.53 0.53 N/A
32768 8192 float sum -1 27.32 1.20 1.20 0 29.38 1.12 1.12 N/A
65536 16384 float sum -1 33.79 1.94 1.94 0 35.12 1.87 1.87 N/A
131072 32768 float sum -1 41.38 3.17 3.17 0 39.67 3.30 3.30 N/A
262144 65536 float sum -1 41.94 6.25 6.25 0 42.77 6.13 6.13 N/A
524288 131072 float sum -1 48.69 10.77 10.77 0 49.06 10.69 10.69 N/A
1048576 262144 float sum -1 75.20 13.94 13.94 0 72.41 14.48 14.48 N/A
2097152 524288 float sum -1 121.01 17.33 17.33 0 121.96 17.20 17.20 N/A
4194304 1048576 float sum -1 215.95 19.42 19.42 0 213.89 19.61 19.61 N/A
8388608 2097152 float sum -1 407.83 20.57 20.57 0 406.84 20.62 20.62 N/A
16777216 4194304 float sum -1 790.86 21.21 21.21 0 789.07 21.26 21.26 N/A
33554432 8388608 float sum -1 1558.21 21.53 21.53 0 1556.81 21.55 21.55 N/A
67108864 16777216 float sum -1 3092.90 21.70 21.70 0 3088.81 21.73 21.73 N/A
134217728 33554432 float sum -1 6170.58 21.75 21.75 0 6166.62 21.77 21.77 N/A
268435456 67108864 float sum -1 12331.14 21.77 21.77 0 12331.46 21.77 21.77 N/A
536870912 134217728 float sum -1 24670.56 21.76 21.76 0 24650.90 21.78 21.78 N/A
1073741824 268435456 float sum -1 49320.18 21.77 21.77 0 49299.55 21.78 21.78 N/A
Out of bounds values : 0 OK
Avg bus bandwidth : 11.789
====================================================================
结论: MCCL 四机通信测试全部通过, 0 错误。
all_reduce 峰值 bus bandwidth: 71.24 GB/s
sendrecv 峰值 bus bandwidth: 21.78 GB/s
但 vLLM PP>2 推理仍然卡死 (PP=3 和 PP=4 均复现)。
PP=2 推理正常。问题不在 MCCL 底层通信, 而在 vLLM PP pipeline 调度与 MCCL 的交互。