MetaX-Tech Developer Forum 论坛首页
  • 沐曦开发者
search
Sign in

uncle4

  • Members
  • Joined 2026年5月13日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

uncle4 has started 3 threads.

  • See post chevron_right
    uncle4
    Members
    沐曦C500是否支持GDRCOPY 解决中 2026年5月18日 15:08

    在nccl test中,默认使用local read作为ncclIbIflush以确保GDR完成。
    在不支持local read的情况下需要开启NCCL_GDRCOPY_ENABLE和NCCL_GDRCOPY_FLUSH_ENABLE,这需要依赖gdrcopy这个feature。请问C500是否支持?

  • See post chevron_right
    uncle4
    Members
    跨节点mccl出现#wrong 和Out of bounds values : 28 FAILED 解决中 2026年5月15日 11:58

    一、软硬件信息
    1.服务器厂家:H3C UniServer R5300 G6
    2.沐曦GPU型号:C500
    3.操作系统内核版本:
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=22.04
    DISTRIB_CODENAME=jammy
    DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
    PRETTY_NAME="Ubuntu 22.04.5 LTS"
    NAME="Ubuntu"
    VERSION_ID="22.04"
    VERSION="22.04.5 LTS (Jammy Jellyfish)"
    VERSION_CODENAME=jammy
    ID=ubuntu
    ID_LIKE=debian
    HOME_URL="www.ubuntu.com/"
    SUPPORT_URL="help.ubuntu.com/"
    BUG_REPORT_URL="bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    UBUNTU_CODENAME=jammy
    distribution_version=v0.3.205
    firmware_version=v0.3.132
    driver_version=v0.3.165
    4.是否开启CPU虚拟化:
    5.mx-smi回显:
    mx-smi version: 2.3.1

    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed May 13 10:06:54 2026

    Attached GPUs : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.23 |
    | MACA Version: 3.7.0.38 BIOS Version: 1.33.4.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:08:00.0 | 0% Disabled |
    | 36W / 350W | 36C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:09:00.0 | 0% Disabled |
    | 39W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:0e:00.0 | 0% Disabled |
    | 44W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:11:00.0 | 0% Disabled |
    | 42W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 4 MetaX C500 | 4 Off | 0000:32:00.0 | 0% Disabled |
    | 38W / 350W | 37C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 5 MetaX C500 | 5 Off | 0000:38:00.0 | 0% Disabled |
    | 38W / 350W | 37C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 6 MetaX C500 | 6 Off | 0000:3b:00.0 | 0% Disabled |
    | 41W / 350W | 39C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 7 MetaX C500 | 7 Off | 0000:3c:00.0 | 0% Disabled |
    | 41W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |
    +---------------------------------------------------------------------------------+

    End of Log
    二、问题现象
    运行bash cluster.sh "10.5.1.44:1,10.5.1.45:1" 2 reduce_perf有如下文档回显。脚本也贴在附件中

  • See post chevron_right
    uncle4
    Members
    /opt/maca/samples/mccl_tests/perf$ sudo bash mccl.sh 8 reduce_perf 解决中 2026年5月13日 18:05

    一、软硬件信息
    1.服务器厂家:H3C UniServer R5300 G6
    2.沐曦GPU型号:C500
    3.操作系统内核版本:
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=22.04
    DISTRIB_CODENAME=jammy
    DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
    PRETTY_NAME="Ubuntu 22.04.5 LTS"
    NAME="Ubuntu"
    VERSION_ID="22.04"
    VERSION="22.04.5 LTS (Jammy Jellyfish)"
    VERSION_CODENAME=jammy
    ID=ubuntu
    ID_LIKE=debian
    HOME_URL="www.ubuntu.com/"
    SUPPORT_URL="help.ubuntu.com/"
    BUG_REPORT_URL="bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    UBUNTU_CODENAME=jammy
    distribution_version=v0.3.205
    firmware_version=v0.3.132
    driver_version=v0.3.165
    4.是否开启CPU虚拟化:
    5.mx-smi回显:
    mx-smi version: 2.3.1

    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed May 13 10:06:54 2026

    Attached GPUs : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.23 |
    | MACA Version: 3.7.0.38 BIOS Version: 1.33.4.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:08:00.0 | 0% Disabled |
    | 36W / 350W | 36C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:09:00.0 | 0% Disabled |
    | 39W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:0e:00.0 | 0% Disabled |
    | 44W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:11:00.0 | 0% Disabled |
    | 42W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 4 MetaX C500 | 4 Off | 0000:32:00.0 | 0% Disabled |
    | 38W / 350W | 37C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 5 MetaX C500 | 5 Off | 0000:38:00.0 | 0% Disabled |
    | 38W / 350W | 37C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 6 MetaX C500 | 6 Off | 0000:3b:00.0 | 0% Disabled |
    | 41W / 350W | 39C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 7 MetaX C500 | 7 Off | 0000:3c:00.0 | 0% Disabled |
    | 41W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |
    +---------------------------------------------------------------------------------+

    End of Log
    二、问题现象

    Out of bounds values : 0 OK

    Avg bus bandwidth : 28.3128

    [muxi-45:06515] *** Process received signal ***
    [muxi-45:06515] Signal: Segmentation fault (11)
    [muxi-45:06515] Signal code: Address not mapped (1)
    [muxi-45:06515] Failing at address: 0x185e0008
    [muxi-45:06515] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff9f71ed520]
    [muxi-45:06515] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0xa275d)[0x7ff9f724d75d]
    [muxi-45:06515] [ 2] /lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7ff9f7250453]
    [muxi-45:06515] [ 3] /opt/maca/lib/libmccompiler.so(+0x9422b)[0x7ff9f7a9422b]
    [muxi-45:06515] [ 4] /opt/maca/lib/libmccompiler.so(+0x9caca)[0x7ff9f7a9caca]
    [muxi-45:06515] [ 5] /opt/maca/lib/libmccompiler.so(+0x82b7a)[0x7ff9f7a82b7a]
    [muxi-45:06515] [ 6] /opt/maca/lib/libmccompiler.so(+0x873b3)[0x7ff9f7a873b3]
    [muxi-45:06515] [ 7] /opt/maca/lib/libmccompiler.so(+0x7a05a)[0x7ff9f7a7a05a]
    [muxi-45:06515] [ 8] /opt/maca/lib/libmccompiler.so(+0x6a9a7)[0x7ff9f7a6a9a7]
    [muxi-45:06515] [ 9] /opt/maca/lib/libmccl.so(+0x22f4f2)[0x7ff9f922f4f2]
    [muxi-45:06515] [10] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xb6)[0x7ff9f71f0a56]
    [muxi-45:06515] [11] /opt/maca/lib/libmccl.so(+0x4d483)[0x7ff9f904d483]
    [muxi-45:06515] *** End of error message ***


    Primary job terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.



    mpirun noticed that process rank 3 with PID 0 on node muxi-45 exited on signal 11 (Segmentation fault).
    请问这个是什么问题,为什么会出现这样的错误。

  • 沐曦开发者论坛
powered by misago