服务器信息:h3c
gpu:mx c550
cpu:海光
rdma网卡:8 * mlx
问题描述:
无论如何增加HCA高性能网卡,多机allreduce等最后带宽只打满了一张网卡,也就是全部都走了一个网卡,MCCL_CROSS_NIC也分别修改过0 1 没有效果
MCCL_IB_HCA=mlx5_0,mlx5_1,.....
rdma网络正常,,网卡间通信正常,所有机器的网卡1属于子网a,网卡2属于子网b,一一对应,调整了很多参数还是无法做到让allreduce测试走多个网卡
服务器信息:h3c
gpu:mx c550
cpu:海光
rdma网卡:8 * mlx
问题描述:
无论如何增加HCA高性能网卡,多机allreduce等最后带宽只打满了一张网卡,也就是全部都走了一个网卡,MCCL_CROSS_NIC也分别修改过0 1 没有效果
MCCL_IB_HCA=mlx5_0,mlx5_1,.....
rdma网络正常,,网卡间通信正常,所有机器的网卡1属于子网a,网卡2属于子网b,一一对应,调整了很多参数还是无法做到让allreduce测试走多个网卡
尊敬的开发者您好,请提供每张网卡ib_write测试结果
所有卡的测试结果类似 ,
尊敬的开发者您好,请提供allreduce测试脚本以及测试结果
结果:
MCCL INFO Connected all trees
1024 256 float sum -1 47.37 0.02 0.04 0 44.43 0.02 0.04 0
2048 512 float sum -1 43.77 0.05 0.09 0 43.64 0.05 0.09 0
4096 1024 float sum -1 46.11 0.09 0.17 0 47.15 0.09 0.16 0
8192 2048 float sum -1 47.47 0.17 0.32 0 45.45 0.18 0.34 0
16384 4096 float sum -1 48.26 0.34 0.64 0 54.89 0.30 0.56 0
32768 8192 float sum -1 51.41 0.64 1.20 0 56.86 0.58 1.08 0
65536 16384 float sum -1 131.29 0.50 0.94 0 129.71 0.51 0.95 0
131072 32768 float sum -1 132.38 0.99 1.86 0 133.09 0.98 1.85 0
262144 65536 float sum -1 136.61 1.92 3.60 0 138.39 1.89 3.55 0
524288 131072 float sum -1 151.87 3.45 6.47 0 160.54 3.27 6.12 0
1048576 262144 float sum -1 190.91 5.49 10.30 0 175.86 5.96 11.18 0
2097152 524288 float sum -1 387.85 5.41 10.14 0 216.03 9.71 18.20 0
4194304 1048576 float sum -1 291.62 14.38 26.97 0 286.74 14.63 27.43 0
8388608 2097152 float sum -1 426.88 19.65 36.85 0 430.51 19.49 36.54 0
16777216 4194304 float sum -1 670.59 25.02 46.91 0 753.13 22.28 41.77 0
33554432 8388608 float sum -1 1503.81 22.31 41.84 0 1541.25 21.77 40.82 0
67108864 16777216 float sum -1 2721.40 24.66 46.24 0 2719.98 24.67 46.26 0
134217728 33554432 float sum -1 5417.88 24.77 46.45 0 5400.53 24.85 46.60 0
268435456 67108864 float sum -1 10798.70 24.86 46.61 0 10757.76 24.95 46.79 0
536870912 134217728 float sum -1 21586.39 24.87 46.63 0 21569.71 24.89 46.67 0
1073741824 268435456 float sum -1 43066.86 24.93 46.75 0 43017.88 24.96 46.80 0
MCCL INFO comm 0x7f279eec4010 rank 14 nranks 16 cudaDev 6 busId e3000 - Destroy COMPLETE
458098:458098 [6] MCCL INFO comm 0x7f12d7f12010 rank 6 nranks 16 cudaDev 6 busId e3000 - Destroy COMPLETE
:458092 [0] MCCL INFO comm 0x7f7a5329c010 rank 0 nranks 16 cudaDev 0 busId 23000 - Destroy COMPLETE
[4] MCCL INFO comm 0x7fb504712010 rank 12 nranks 16 cudaDev 4 busId a3000 - Destroy COMPLETE
脚本:
set -euo pipefail
MACA_PATH=/opt/maca
HOST_IP=${HOST_IP:-"机器信息"}
IP_MASK=${IP_MASK:-"172.16.1.0/24"}
GPU_NUM=${GPU_NUM:-64}
IB_PORT="mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7"
GID_INDEX=${GID_INDEX:-3}
TEST_DIR=/opt/maca/samples/mccl_tests/perf/mccl_perf
BENCH_NAMES=${BENCH_NAMES:-"all_reduce_perf"}
PERF_ENV="-x FORCE_ACTIVE_WAIT=2"
LIB_PATH_ENV="-x LD_LIBRARY_PATH=${MACA_PATH}/lib:${MACA_PATH}/ompi/lib"
ENV_VAR="\
-x OMPI_ALLOW_RUN_AS_ROOT=1 \
-x OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
-x MCCL_IB_HCA=${IB_PORT} \
-x MCCL_IB_GID_INDEX=${GID_INDEX} \
-x MCCL_SOCKET_IFNAME=eth10 \ #尝试增加多个mlx卡 通信也无效
-x MCCL_CROSS_NIC=0 \ # 0,1尝试也无效
${PERF_ENV} \
${LIB_PATH_ENV}"
MPI_PROCESS_NUM=${GPU_NUM}
MPI_RUN_OPT="\
-mca btl_tcp_if_include ${IP_MASK} \
-mca oob_tcp_if_include ${IP_MASK} \
-mca pml ^ucx \
-mca osc ^ucx \
-mca btl ^openib"
for BENCH in ${BENCH_NAMES}; do
echo "The test is ${BENCH}, the maca version is $(realpath ${MACA_PATH})"
echo "HOST_IP=${HOST_IP}"
echo "IP_MASK=${IP_MASK}"
echo "GPU_NUM=${GPU_NUM}"
echo "IB_PORT=${IB_PORT}"
echo "GID_INDEX=${GID_INDEX}"
${MACA_PATH}/ompi/bin/mpirun \
--allow-run-as-root \
-np ${MPI_PROCESS_NUM} \
${MPI_RUN_OPT} \
-host ${HOST_IP} \
${ENV_VAR} \
${TEST_DIR}/${BENCH} \
-b 1K -e 1G -d float -f 2 -g 1 -n 10
done
尊敬的开发者您好,您是如何判断多机allreduce等最后带宽只打满了一张网卡,也就是全部都走了一个网卡
多个卡带宽太低了,即使只用一个网卡 结果也是一样的
尊敬的开发者您好,请通过商务渠道获取集群部署指南,按照相关说明配置网络