• Members 8 posts
    2026年5月15日 11:58

    一、软硬件信息
    1.服务器厂家:H3C UniServer R5300 G6
    2.沐曦GPU型号:C500
    3.操作系统内核版本:
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=22.04
    DISTRIB_CODENAME=jammy
    DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
    PRETTY_NAME="Ubuntu 22.04.5 LTS"
    NAME="Ubuntu"
    VERSION_ID="22.04"
    VERSION="22.04.5 LTS (Jammy Jellyfish)"
    VERSION_CODENAME=jammy
    ID=ubuntu
    ID_LIKE=debian
    HOME_URL="www.ubuntu.com/"
    SUPPORT_URL="help.ubuntu.com/"
    BUG_REPORT_URL="bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    UBUNTU_CODENAME=jammy
    distribution_version=v0.3.205
    firmware_version=v0.3.132
    driver_version=v0.3.165
    4.是否开启CPU虚拟化:
    5.mx-smi回显:
    mx-smi version: 2.3.1

    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed May 13 10:06:54 2026

    Attached GPUs : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.23 |
    | MACA Version: 3.7.0.38 BIOS Version: 1.33.4.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:08:00.0 | 0% Disabled |
    | 36W / 350W | 36C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:09:00.0 | 0% Disabled |
    | 39W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:0e:00.0 | 0% Disabled |
    | 44W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:11:00.0 | 0% Disabled |
    | 42W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 4 MetaX C500 | 4 Off | 0000:32:00.0 | 0% Disabled |
    | 38W / 350W | 37C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 5 MetaX C500 | 5 Off | 0000:38:00.0 | 0% Disabled |
    | 38W / 350W | 37C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 6 MetaX C500 | 6 Off | 0000:3b:00.0 | 0% Disabled |
    | 41W / 350W | 39C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 7 MetaX C500 | 7 Off | 0000:3c:00.0 | 0% Disabled |
    | 41W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |
    +---------------------------------------------------------------------------------+

    End of Log
    二、问题现象
    运行bash cluster.sh "10.5.1.44:1,10.5.1.45:1" 2 reduce_perf有如下文档回显。脚本也贴在附件中

    insert_drive_file
    mccl测试执行异常日志.docx

    DOCX, 23.5 KB, uploaded by uncle4 on 2026年5月15日.

  • arrow_forward

    Thread has been moved from 公共.

  • Members 458 posts
    2026年5月15日 12:06

    尊敬的开发者您好,请先确保两台服务器单机八卡mccl测试没有问题,再测试双机