在nccl test中,默认使用local read作为ncclIbIflush以确保GDR完成。
在不支持local read的情况下需要开启NCCL_GDRCOPY_ENABLE和NCCL_GDRCOPY_FLUSH_ENABLE,这需要依赖gdrcopy这个feature。请问C500是否支持?
在nccl test中,默认使用local read作为ncclIbIflush以确保GDR完成。
在不支持local read的情况下需要开启NCCL_GDRCOPY_ENABLE和NCCL_GDRCOPY_FLUSH_ENABLE,这需要依赖gdrcopy这个feature。请问C500是否支持?
通过图片看到,执行export MACA_VISIBLE_DEVICES=2,3,6,7 后再执行mccl.sh 通过mx-smi查看,不符合预期。
一、软硬件信息
1.服务器厂家:H3C UniServer R5300 G6
2.沐曦GPU型号:C500
3.操作系统内核版本:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="www.ubuntu.com/"
SUPPORT_URL="help.ubuntu.com/"
BUG_REPORT_URL="bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
distribution_version=v0.3.205
firmware_version=v0.3.132
driver_version=v0.3.165
4.是否开启CPU虚拟化:
5.mx-smi回显:
mx-smi version: 2.3.1
=================== MetaX System Management Interface Log ===================
Timestamp : Wed May 13 10:06:54 2026
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.23 |
| MACA Version: 3.7.0.38 BIOS Version: 1.33.4.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:08:00.0 | 0% Disabled |
| 36W / 350W | 36C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:09:00.0 | 0% Disabled |
| 39W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:0e:00.0 | 0% Disabled |
| 44W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:11:00.0 | 0% Disabled |
| 42W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:32:00.0 | 0% Disabled |
| 38W / 350W | 37C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:38:00.0 | 0% Disabled |
| 38W / 350W | 37C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:3b:00.0 | 0% Disabled |
| 41W / 350W | 39C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:3c:00.0 | 0% Disabled |
| 41W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
End of Log
二、问题现象
运行bash cluster.sh "10.5.1.44:1,10.5.1.45:1" 2 reduce_perf有如下文档回显。脚本也贴在附件中
请确认,确定是CUDA_VISIBLE_DEVICES,而不是MACA_VISIBLE_DEVICES吗?两个参数是否有差别。
多次执行bash mccl.sh 2 未出现异常。多次bash mccl.sh 4 时概率出现
[muxi-44:20247] Process received signal
[muxi-44:20247] Signal: Segmentation fault (11)
[muxi-44:20247] Signal code: Address not mapped (1)
[muxi-44:20247] Failing at address: 0x2d20b170
[muxi-44:20247] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f3a2cbed520]
[muxi-44:20247] [ 1] /opt/maca/lib/libmccompiler.so(+0x94330)[0x7f3a2d494330]
[muxi-44:20247] [ 2] /opt/maca/lib/libmccompiler.so(+0x9caca)[0x7f3a2d49caca]
[muxi-44:20247] [ 3] /opt/maca/lib/libmccompiler.so(+0x82b7a)[0x7f3a2d482b7a]
[muxi-44:20247] [ 4] /opt/maca/lib/libmccompiler.so(+0x873b3)[0x7f3a2d4873b3]
[muxi-44:20247] [ 5] /opt/maca/lib/libmccompiler.so(+0x7a05a)[0x7f3a2d47a05a]
[muxi-44:20247] [ 6] /opt/maca/lib/libmccompiler.so(+0x6a9a7)[0x7f3a2d46a9a7]
[muxi-44:20247] [ 7] /opt/maca/lib/libmccl.so(+0x2a2f92)[0x7f3a2eca2f92]
[muxi-44:20247] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xb6)[0x7f3a2cbf0a56]
[muxi-44:20247] [ 9] /opt/maca/lib/libmccl.so(+0x4d483)[0x7f3a2ea4d483]
[muxi-44:20247] End of error message
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node muxi-44 exited on signal 11 (Segmentation fault).
不同通信原语在环境执行时有多种异常回显,麻烦帮忙排查一下,报错原因。
一、软硬件信息
1.服务器厂家:H3C UniServer R5300 G6
2.沐曦GPU型号:C500
3.操作系统内核版本:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="www.ubuntu.com/"
SUPPORT_URL="help.ubuntu.com/"
BUG_REPORT_URL="bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
distribution_version=v0.3.205
firmware_version=v0.3.132
driver_version=v0.3.165
4.是否开启CPU虚拟化:
5.mx-smi回显:
mx-smi version: 2.3.1
=================== MetaX System Management Interface Log ===================
Timestamp : Wed May 13 10:06:54 2026
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.23 |
| MACA Version: 3.7.0.38 BIOS Version: 1.33.4.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:08:00.0 | 0% Disabled |
| 36W / 350W | 36C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:09:00.0 | 0% Disabled |
| 39W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:0e:00.0 | 0% Disabled |
| 44W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:11:00.0 | 0% Disabled |
| 42W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:32:00.0 | 0% Disabled |
| 38W / 350W | 37C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:38:00.0 | 0% Disabled |
| 38W / 350W | 37C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:3b:00.0 | 0% Disabled |
| 41W / 350W | 39C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:3c:00.0 | 0% Disabled |
| 41W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
End of Log
二、问题现象
[muxi-45:06515] *** Process received signal ***
[muxi-45:06515] Signal: Segmentation fault (11)
[muxi-45:06515] Signal code: Address not mapped (1)
[muxi-45:06515] Failing at address: 0x185e0008
[muxi-45:06515] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff9f71ed520]
[muxi-45:06515] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0xa275d)[0x7ff9f724d75d]
[muxi-45:06515] [ 2] /lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7ff9f7250453]
[muxi-45:06515] [ 3] /opt/maca/lib/libmccompiler.so(+0x9422b)[0x7ff9f7a9422b]
[muxi-45:06515] [ 4] /opt/maca/lib/libmccompiler.so(+0x9caca)[0x7ff9f7a9caca]
[muxi-45:06515] [ 5] /opt/maca/lib/libmccompiler.so(+0x82b7a)[0x7ff9f7a82b7a]
[muxi-45:06515] [ 6] /opt/maca/lib/libmccompiler.so(+0x873b3)[0x7ff9f7a873b3]
[muxi-45:06515] [ 7] /opt/maca/lib/libmccompiler.so(+0x7a05a)[0x7ff9f7a7a05a]
[muxi-45:06515] [ 8] /opt/maca/lib/libmccompiler.so(+0x6a9a7)[0x7ff9f7a6a9a7]
[muxi-45:06515] [ 9] /opt/maca/lib/libmccl.so(+0x22f4f2)[0x7ff9f922f4f2]
[muxi-45:06515] [10] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xb6)[0x7ff9f71f0a56]
[muxi-45:06515] [11] /opt/maca/lib/libmccl.so(+0x4d483)[0x7ff9f904d483]
[muxi-45:06515] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 3 with PID 0 on node muxi-45 exited on signal 11 (Segmentation fault).
请问这个是什么问题,为什么会出现这样的错误。