metax@metax-host-104:/opt/maca/samples/mccl_tests/perf$ bash mccl.sh 8
The test is all_reduce_perf, the maca version is /opt/maca-3.7.1
main_process = 7324
metax-host-104: Test CUDA failure common.cu:1349 'initialization error'
.. metax-host-104 pid 7325: Test failure common.cu:1271
metax-host-104: Test CUDA failure common.cu:1349 'initialization error'
.. metax-host-104 pid 7326: Test failure common.cu:1271
metax-host-104: Test CUDA failure common.cu:1349 'initialization error'
.. metax-host-104 pid 7327: Test failure common.cu:1271
metax-host-104: Test CUDA failure common.cu:1349 'initialization error'
.. metax-host-104 pid 7328: Test failure common.cu:1271
metax-host-104: Test CUDA failure common.cu:1349 'initialization error'
.. metax-host-104 pid 7329: Test failure common.cu:1271
metax-host-104: Test CUDA failure common.cu:1349 'initialization error'
.. metax-host-104 pid 7330: Test failure common.cu:1271
metax-host-104: Test CUDA failure common.cu:1349 'initialization error'
.. metax-host-104 pid 7331: Test failure common.cu:1271
===============================
nThread 1 nGpus 1 minBytes 1024 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 10 agg iters: 1 validation: 1 graph: 0
Using devices
metax-host-104: Test CUDA failure common.cu:1349 'initialization error'
.. metax-host-104 pid 7324: Test failure common.cu:1271
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[25858,1],4]
Exit code: 2
metax@metax-host-104:/opt/maca/samples/mccl_tests/perf$ bash mccl.sh 2
The test is all_reduce_perf, the maca version is /opt/maca-3.7.1
main_process = 7378
metax-host-104: Test CUDA failure common.cu:1349 'initialization error'
.. metax-host-104 pid 7379: Test failure common.cu:1271
===============================
nThread 1 nGpus 1 minBytes 1024 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 10 agg iters: 1 validation: 1 graph: 0
Using devices
metax-host-104: Test CUDA failure common.cu:1349 'initialization error'
.. metax-host-104 pid 7378: Test failure common.cu:1271
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[25940,1],0]
Exit code: 2
metax@metax-host-104:/opt/maca/samples/mccl_tests/perf$
单机mccl测试失败是怎么回事
mx-smi回显信息正常如下
metax@metax-host-104:/opt/maca/samples/mccl_tests/perf$ mx-smi
mx-smi version: 2.3.1
=================== MetaX System Management Interface Log ===================
Timestamp : Wed Jun 17 10:13:47 2026
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.1 |
| MACA Version: 3.7.1.5 BIOS Version: 1.29.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C550 | 0 Off | 0000:2b:00.0 | 0% Disabled |
| 54W / 450W | 32C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C550 | 1 Off | 0000:3a:00.0 | 0% Disabled |
| 56W / 450W | 33C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C550 | 2 Off | 0000:4d:00.0 | 0% Disabled |
| 52W / 450W | 33C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C550 | 3 Off | 0000:5c:00.0 | 0% Disabled |
| 56W / 450W | 33C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C550 | 4 Off | 0000:aa:00.0 | 0% Disabled |
| 53W / 450W | 32C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C550 | 5 Off | 0000:ba:00.0 | 0% Disabled |
| 52W / 450W | 33C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C550 | 6 Off | 0000:ca:00.0 | 0% Disabled |
| 54W / 450W | 34C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C550 | 7 Off | 0000:da:00.0 | 0% Disabled |
| 53W / 450W | 33C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
End of Log
metax@metax-host-104:/opt/maca/samples/mccl_tests/perf$