• Members 8 posts
    2026年5月13日 18:05

    一、软硬件信息
    1.服务器厂家:H3C UniServer R5300 G6
    2.沐曦GPU型号:C500
    3.操作系统内核版本:
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=22.04
    DISTRIB_CODENAME=jammy
    DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
    PRETTY_NAME="Ubuntu 22.04.5 LTS"
    NAME="Ubuntu"
    VERSION_ID="22.04"
    VERSION="22.04.5 LTS (Jammy Jellyfish)"
    VERSION_CODENAME=jammy
    ID=ubuntu
    ID_LIKE=debian
    HOME_URL="www.ubuntu.com/"
    SUPPORT_URL="help.ubuntu.com/"
    BUG_REPORT_URL="bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    UBUNTU_CODENAME=jammy
    distribution_version=v0.3.205
    firmware_version=v0.3.132
    driver_version=v0.3.165
    4.是否开启CPU虚拟化:
    5.mx-smi回显:
    mx-smi version: 2.3.1

    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed May 13 10:06:54 2026

    Attached GPUs : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.23 |
    | MACA Version: 3.7.0.38 BIOS Version: 1.33.4.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:08:00.0 | 0% Disabled |
    | 36W / 350W | 36C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:09:00.0 | 0% Disabled |
    | 39W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:0e:00.0 | 0% Disabled |
    | 44W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:11:00.0 | 0% Disabled |
    | 42W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 4 MetaX C500 | 4 Off | 0000:32:00.0 | 0% Disabled |
    | 38W / 350W | 37C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 5 MetaX C500 | 5 Off | 0000:38:00.0 | 0% Disabled |
    | 38W / 350W | 37C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 6 MetaX C500 | 6 Off | 0000:3b:00.0 | 0% Disabled |
    | 41W / 350W | 39C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 7 MetaX C500 | 7 Off | 0000:3c:00.0 | 0% Disabled |
    | 41W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |
    +---------------------------------------------------------------------------------+

    End of Log
    二、问题现象

    Out of bounds values : 0 OK

    Avg bus bandwidth : 28.3128

    [muxi-45:06515] *** Process received signal ***
    [muxi-45:06515] Signal: Segmentation fault (11)
    [muxi-45:06515] Signal code: Address not mapped (1)
    [muxi-45:06515] Failing at address: 0x185e0008
    [muxi-45:06515] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff9f71ed520]
    [muxi-45:06515] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0xa275d)[0x7ff9f724d75d]
    [muxi-45:06515] [ 2] /lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7ff9f7250453]
    [muxi-45:06515] [ 3] /opt/maca/lib/libmccompiler.so(+0x9422b)[0x7ff9f7a9422b]
    [muxi-45:06515] [ 4] /opt/maca/lib/libmccompiler.so(+0x9caca)[0x7ff9f7a9caca]
    [muxi-45:06515] [ 5] /opt/maca/lib/libmccompiler.so(+0x82b7a)[0x7ff9f7a82b7a]
    [muxi-45:06515] [ 6] /opt/maca/lib/libmccompiler.so(+0x873b3)[0x7ff9f7a873b3]
    [muxi-45:06515] [ 7] /opt/maca/lib/libmccompiler.so(+0x7a05a)[0x7ff9f7a7a05a]
    [muxi-45:06515] [ 8] /opt/maca/lib/libmccompiler.so(+0x6a9a7)[0x7ff9f7a6a9a7]
    [muxi-45:06515] [ 9] /opt/maca/lib/libmccl.so(+0x22f4f2)[0x7ff9f922f4f2]
    [muxi-45:06515] [10] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xb6)[0x7ff9f71f0a56]
    [muxi-45:06515] [11] /opt/maca/lib/libmccl.so(+0x4d483)[0x7ff9f904d483]
    [muxi-45:06515] *** End of error message ***


    Primary job terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.



    mpirun noticed that process rank 3 with PID 0 on node muxi-45 exited on signal 11 (Segmentation fault).
    请问这个是什么问题,为什么会出现这样的错误。

  • edit

    Thread title has been changed from 1111111.

  • arrow_forward

    Thread has been moved from 公共.

  • Members 458 posts
    2026年5月14日 13:18

    尊敬的开发者您好,请给出完整的mccl测试日志,以附件形式

  • Members 8 posts
    2026年5月14日 14:58

    不同通信原语在环境执行时有多种异常回显,麻烦帮忙排查一下,报错原因。

    insert_drive_file
    mccl测试执行异常日志.docx

    DOCX, 23.5 KB, uploaded by uncle4 on 2026年5月14日.

  • Members 458 posts
    2026年5月14日 15:00

    尊敬的开发者您好,执行

    bash mccl.sh 8
    
  • Members 8 posts
    2026年5月14日 15:04

    执行bash mccl.sh 8 后多种回显

    insert_drive_file
    bash mccl.sh 8.docx

    DOCX, 16.3 KB, uploaded by uncle4 on 2026年5月14日.

  • Members 458 posts
    2026年5月15日 11:09

    尊敬的开发者您好,请执行bash mccl.sh 2或4

  • Members 8 posts
    2026年5月15日 11:14

    多次执行bash mccl.sh 2 未出现异常。多次bash mccl.sh 4 时概率出现
    [muxi-44:20247] Process received signal
    [muxi-44:20247] Signal: Segmentation fault (11)
    [muxi-44:20247] Signal code: Address not mapped (1)
    [muxi-44:20247] Failing at address: 0x2d20b170
    [muxi-44:20247] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f3a2cbed520]
    [muxi-44:20247] [ 1] /opt/maca/lib/libmccompiler.so(+0x94330)[0x7f3a2d494330]
    [muxi-44:20247] [ 2] /opt/maca/lib/libmccompiler.so(+0x9caca)[0x7f3a2d49caca]
    [muxi-44:20247] [ 3] /opt/maca/lib/libmccompiler.so(+0x82b7a)[0x7f3a2d482b7a]
    [muxi-44:20247] [ 4] /opt/maca/lib/libmccompiler.so(+0x873b3)[0x7f3a2d4873b3]
    [muxi-44:20247] [ 5] /opt/maca/lib/libmccompiler.so(+0x7a05a)[0x7f3a2d47a05a]
    [muxi-44:20247] [ 6] /opt/maca/lib/libmccompiler.so(+0x6a9a7)[0x7f3a2d46a9a7]
    [muxi-44:20247] [ 7] /opt/maca/lib/libmccl.so(+0x2a2f92)[0x7f3a2eca2f92]
    [muxi-44:20247] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xb6)[0x7f3a2cbf0a56]
    [muxi-44:20247] [ 9] /opt/maca/lib/libmccl.so(+0x4d483)[0x7f3a2ea4d483]
    [muxi-44:20247] End of error message


    Primary job terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.



    mpirun noticed that process rank 0 with PID 0 on node muxi-44 exited on signal 11 (Segmentation fault).

  • Members 458 posts
    2026年5月15日 11:22

    尊敬的开发者您好,按照实际GPU ID填写,export CUDA_VISIBLE_DEVICES=0,1,2,3,执行bash mccl.sh 2或者4,进行交叉验证,确认是哪几个GPU通信问题。确认后,请进行服务器下电重新插拔GPU尝试。

  • Members 8 posts
    2026年5月15日 11:42

    请确认,确定是CUDA_VISIBLE_DEVICES,而不是MACA_VISIBLE_DEVICES吗?两个参数是否有差别。

  • Members 458 posts
    2026年5月15日 11:45

    尊敬的开发者您好,是CUDA_VISIBLE_DEVICES

  • Members 8 posts
    2026年5月15日 11:59

    通过图片看到,执行export MACA_VISIBLE_DEVICES=2,3,6,7 后再执行mccl.sh 通过mx-smi查看,不符合预期。

    image.png

    PNG, 374.4 KB, uploaded by uncle4 on 2026年5月15日.

  • Members 458 posts
    2026年5月15日 12:05

    尊敬的开发者您好,请使用CUDA_VISIBLE_DEVICES