在nccl test中,默认使用local read作为ncclIbIflush以确保GDR完成。
在不支持local read的情况下需要开启NCCL_GDRCOPY_ENABLE和NCCL_GDRCOPY_FLUSH_ENABLE,这需要依赖gdrcopy这个feature。请问C500是否支持?
在nccl test中,默认使用local read作为ncclIbIflush以确保GDR完成。
在不支持local read的情况下需要开启NCCL_GDRCOPY_ENABLE和NCCL_GDRCOPY_FLUSH_ENABLE,这需要依赖gdrcopy这个feature。请问C500是否支持?
一、软硬件信息
1.服务器厂家:H3C UniServer R5300 G6
2.沐曦GPU型号:C500
3.操作系统内核版本:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="www.ubuntu.com/"
SUPPORT_URL="help.ubuntu.com/"
BUG_REPORT_URL="bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
distribution_version=v0.3.205
firmware_version=v0.3.132
driver_version=v0.3.165
4.是否开启CPU虚拟化:
5.mx-smi回显:
mx-smi version: 2.3.1
=================== MetaX System Management Interface Log ===================
Timestamp : Wed May 13 10:06:54 2026
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.23 |
| MACA Version: 3.7.0.38 BIOS Version: 1.33.4.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:08:00.0 | 0% Disabled |
| 36W / 350W | 36C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:09:00.0 | 0% Disabled |
| 39W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:0e:00.0 | 0% Disabled |
| 44W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:11:00.0 | 0% Disabled |
| 42W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:32:00.0 | 0% Disabled |
| 38W / 350W | 37C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:38:00.0 | 0% Disabled |
| 38W / 350W | 37C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:3b:00.0 | 0% Disabled |
| 41W / 350W | 39C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:3c:00.0 | 0% Disabled |
| 41W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
End of Log
二、问题现象
运行bash cluster.sh "10.5.1.44:1,10.5.1.45:1" 2 reduce_perf有如下文档回显。脚本也贴在附件中
一、软硬件信息
1.服务器厂家:H3C UniServer R5300 G6
2.沐曦GPU型号:C500
3.操作系统内核版本:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="www.ubuntu.com/"
SUPPORT_URL="help.ubuntu.com/"
BUG_REPORT_URL="bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
distribution_version=v0.3.205
firmware_version=v0.3.132
driver_version=v0.3.165
4.是否开启CPU虚拟化:
5.mx-smi回显:
mx-smi version: 2.3.1
=================== MetaX System Management Interface Log ===================
Timestamp : Wed May 13 10:06:54 2026
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.23 |
| MACA Version: 3.7.0.38 BIOS Version: 1.33.4.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:08:00.0 | 0% Disabled |
| 36W / 350W | 36C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:09:00.0 | 0% Disabled |
| 39W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:0e:00.0 | 0% Disabled |
| 44W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:11:00.0 | 0% Disabled |
| 42W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:32:00.0 | 0% Disabled |
| 38W / 350W | 37C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:38:00.0 | 0% Disabled |
| 38W / 350W | 37C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:3b:00.0 | 0% Disabled |
| 41W / 350W | 39C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:3c:00.0 | 0% Disabled |
| 41W / 350W | 38C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
End of Log
二、问题现象
[muxi-45:06515] *** Process received signal ***
[muxi-45:06515] Signal: Segmentation fault (11)
[muxi-45:06515] Signal code: Address not mapped (1)
[muxi-45:06515] Failing at address: 0x185e0008
[muxi-45:06515] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff9f71ed520]
[muxi-45:06515] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0xa275d)[0x7ff9f724d75d]
[muxi-45:06515] [ 2] /lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7ff9f7250453]
[muxi-45:06515] [ 3] /opt/maca/lib/libmccompiler.so(+0x9422b)[0x7ff9f7a9422b]
[muxi-45:06515] [ 4] /opt/maca/lib/libmccompiler.so(+0x9caca)[0x7ff9f7a9caca]
[muxi-45:06515] [ 5] /opt/maca/lib/libmccompiler.so(+0x82b7a)[0x7ff9f7a82b7a]
[muxi-45:06515] [ 6] /opt/maca/lib/libmccompiler.so(+0x873b3)[0x7ff9f7a873b3]
[muxi-45:06515] [ 7] /opt/maca/lib/libmccompiler.so(+0x7a05a)[0x7ff9f7a7a05a]
[muxi-45:06515] [ 8] /opt/maca/lib/libmccompiler.so(+0x6a9a7)[0x7ff9f7a6a9a7]
[muxi-45:06515] [ 9] /opt/maca/lib/libmccl.so(+0x22f4f2)[0x7ff9f922f4f2]
[muxi-45:06515] [10] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xb6)[0x7ff9f71f0a56]
[muxi-45:06515] [11] /opt/maca/lib/libmccl.so(+0x4d483)[0x7ff9f904d483]
[muxi-45:06515] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 3 with PID 0 on node muxi-45 exited on signal 11 (Segmentation fault).
请问这个是什么问题,为什么会出现这样的错误。