3. MX-DCM功能

本章介绍MX-DCM的主要功能,以及 mx-dcmi 命令的使用示例。

备注

本文档中,输出结果均以曦云C500为示例。

3.1. 设备查询

用户可以使用设备查询功能查看指定主机上的可用设备。

3.1.1. 使用帮助

mx-dcmi discovery -h

执行命令后,显示如下:

USAGE:
   mx-dcmi discovery [-hlv] [--gpuid <gpuid>] [--host <IP:PORT>]

Where:
   -l, --list <FILE>
      List all GPUs discovered on the host.

   --host <IP:PORT>
      The target host to connect.

   Either of:
      --gpu-id <gpu-id>
         Query target gpu detail info

      --group-id <[0, 63]>
         Query gpus' detail info of a gpu group

   --, --ignore_rest
      Ignores the rest of the labeled arguments following this flag.

   -v, --version
      Displays version information and exits.

   -h, --help
      Displays usage information and exits.

   MetaX Data Center Management Interface

3.1.2. 查看设备列表

  • mx-dcmi命令:

    mx-dcmi discovery --host [IP]:[PORT] -l
    

    操作成功,显示如下:

    ================== MetaX Data Center Management Interface Log ==================
    Timestamp                                              : Wed Nov 5 15:02:00 2025
    
    +----------+--------+----------------------------------------------------------+
    | BOARD_ID | GPU ID | Device Information                                       |
    +----------+--------+----------------------------------------------------------+
    | 0        | 0      | Name: MXC500                                             |
    |          |        | PCI Bus ID: 0000:4f:00.0                                 |
    |          |        | Device UUID: GPU-dcf2c6b3-bf23-7535-8406-fa0d793df737    |
    +----------+--------+----------------------------------------------------------+
    | 1        | 1      | Name: MXC500                                             |
    |          |        | PCI Bus ID: 0000:50:00.0                                 |
    |          |        | Device UUID: GPU-1323796f-71df-ea8c-ed3d-8389dbd9eef4    |
    +----------+--------+----------------------------------------------------------+
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/gpus
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-type: application/json
    Content-Length: 206
    
    {"gpus":[{"boardid":0,"deviceid":0,"bdfid":"0000:4f:00.0","uuid":"GPU-dcf2c6b3-bf23-7535-8406fa0d793df737","type":"MXC500"},{"boardid":1,"deviceid":1,"bdfid":"0000:50:00.0","uuid":"GPU-1323796f-71df-ea8c-ed3d-8389dbd9eef4","type":"MXC500"}]}
    

3.1.3. 查看指定设备的属性信息

  • mx-dcmi命令:

    mx-dcmi discovery --host [IP]:[PORT] --gpu-id 0
    

    操作成功,显示如下:

    ============== MetaX Data Center Management Interface Log ==============
    Timestamp                                      : Wed Nov 5 15:00:00 2025
    
    +---------------------------+------------------------------------------+
    | BOARD ID: 0               | Power Limit (W): 350                     |
    +---------------------------+------------------------------------------+
    | GPU ID: 0                 | Device Information                       |
    +---------------------------+------------------------------------------+
    | Device Name               | MXC500                                   |
    | PCI Bus ID                | 0000:4f:00.0                             |
    | Device UUID               | GPU-dcf2c6b3-bf23-7535-8406-fa0d793df737 |
    | Board SN                  | PBR23090062067                           |
    | KMD Version               | 3.3.9                                    |
    | VBIOS                     | 1.29.1.0                                 |
    +---------------------------+------------------------------------------+
    | Shut down Temperature(C)  | 108.00                                   |
    | Slow down Temperature(C)  | 95.00                                    |
    +---------------------------+------------------------------------------+
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/gpus/gpu/0
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-type: application/json
    Content-Length: 256
    
    {"gpus":[{"boardid":0,"deviceid":0,"bdfid":"0000:4f:00.0","uuid":"GPU-dcf2c6b3-bf23-7535-8406fa0d793df737","name":"MXC500","boardSN":"PBR23090062067","kmd":"3.3.9","vbios":"1.29.1.0","shutdownTemperature":"108.0","slowdownTemperature":"95.0","powerLimit":"350"}]}
    

3.2. 设备分组

在MX-DCM里,几乎所有的设备操作都是以设备组为单位。用户可以创建、删除以及更新设备组,并将设备组用于后续的操作。设备组是为了方便用户对多个设备执行相同的操作。不同的组所包含的设备集合可以有重复,因为不同组可以用于不同的管理需求。

Group 0是默认创建的设备组,包含所有的设备,不可更改。

3.2.1. 使用帮助

mx-dcmi group -h

执行命令后,显示如下:

USAGE:
   mx-dcmi group  [-hlv] [-a <>|-c <>|-d <[0, 63]>|-i|-r <>] [-g <[0, 63]>]
                  [--host <IP:PORT>]
Where:
   -l,  --list
      List all groups.

   --host <IP:PORT>
      The target host to connect.

   -g <[0, 63]>,  --group <[0, 63]>
      Group id

   Either of:
      -a <>,  --add <>
         Add devices to group

      -c <>,  --create <>
         Create a group on the remote host.

      -d <[0, 63]>,  --delete <[0, 63]>
         Delete a group on the remote host.

      -r <>,  --remove <>
         Remove gpus from group


      -i,  --inquiry
         Inquiry group information

   --,  --ignore_rest
      Ignores the rest of the labeled arguments following this flag.

   -v,  --version
      Displays version information and exits.

   -h,  --help
      Displays usage information and exits.

   MetaX Data Center Management Interface

3.2.2. 创建设备组

  • mx-dcmi命令:

    mx-dcmi group --host [IP]:[PORT] -c group1
    

    操作成功,显示如下:

    create group group1 with a group ID of 1 successfully
    
  • RESTful API:

    curl -i -X POST http://[IP]:[PORT]/api/v1/groups/group1
    

    操作成功,显示如下:

    HTTP/1.1 200 OK
    Content-type: application/json
    Content-Length: 13
    
    {"groupid": 1}
    

3.2.3. 添加设备到设备组

以下命令把gpu#0gpu#1添加到设备组#1

  • mx-dcmi命令:

    mx-dcmi group --host [IP]:[PORT] -a 0,1 -g 1
    

    操作成功,显示如下:

    add gpus operation successfully
    
  • RESTful API:

    curl -i -X PUT http://[IP]:[PORT]/api/v1/groups/1/gpu?gpuid=0,1
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-Length: 0
    

3.2.4. 从设备组移除设备

以下命令从设备组#1移除gpu#0

  • mx-dcmi命令:

    mx-dcmi group --host [IP]:[PORT] -r 0 -g 1
    

    操作成功,显示如下:

    remove gpus operation successfully
    
  • RESTful API:

    curl -i -X DELETE http://[IP]:[PORT]/api/v1/groups/1/gpu?gpuid=0
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-Length: 0
    

3.2.5. 查看设备组列表

  • mx-dcmi命令:

    mx-dcmi group --host [IP]:[PORT] -l
    

    操作成功,显示如下:

    ============= MetaX Data Center Management Interface Log =============
    Timestamp                                  : Fri Nov  21 16:52:49 2025
    
    +--------------------------------------------------------------------+
    | GROUPS                                                             |
    | 1 groups found                                                     |
    +--------------------------------------------------------------------+
    +------------+-------------------------------------------------------+
    | Group ID   | 0                                                     |
    | Group Name | DCM_ALL_SUPPORTED_GPUS                                |
    | GPU ID(s)  | 0, 1, 2, 3, 4, 5, 6, 7                                |
    +------------+-------------------------------------------------------+
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/groups
    

    操作成功,显示如下:

    HTTP/1.1 200 OK
    Content-type: application/json
    Content-Length: 86
    
    {"gpugroups":[{"groupid":0,"name":"DCM_ALL_SUPPORTED_GPUS","gpus":[0,1,2,3,4,5,6,7]}]}
    

3.2.6. 查看设备组详细信息

  • mx-dcmi命令:

    mx-dcmi group --host [IP]:[PORT] -i -g 1
    

    操作成功,显示如下:

    ============= MetaX Data Center Management Interface Log ===========
    Timestamp                                 : Fri Nov  7 17:11:38 2025
    
    +------------+-----------------------------------------------------+
    | GROUPS                                                           |
    +------------+-----------------------------------------------------+
    +------------+-----------------------------------------------------+
    | Group ID   | 1                                                   |
    | Group Name | group1                                              |
    | GPU ID(s)  | 0, 1                                                |
    +------------+-----------------------------------------------------+
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/groups/0
    

    操作成功,显示如下:

    HTTP/1.1 200 OK
    Content-type: application/json
    Content-Length: 70
    
    {"groupid":0,"name":"DCM_ALL_SUPPORTED_GPUS","gpus":[0,1,2,3,4,5,6,7]}
    

3.2.7. 删除设备组

  • mx-dcmi命令:

    mx-dcmi group --host [IP]:[PORT] -d 1
    

    操作成功,显示如下:

    delete group 1 successfully
    
  • RESTful API:

    curl -i -X DELETE http://[IP]:[PORT]/api/v1/groups/1
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-Length: 0
    

3.3. 指标分组

用户可以管理自定义的指标组,并将指标组用于设备或设备组的状态监控。

FieldGroup 0是默认创建的指标组,不可更改。

3.3.1. 使用帮助

mx-dcmi fieldgroup -h

执行命令后,显示如下:

USAGE:
mx-dcmi fieldgroup [-hv] {-c <groupName>]|-d|-i|-l|--list-fields} [-f
                     <fieldsId>][-g <[0,63]>] [--host <IP:PORT>]
Where:
   --host <IP:PORT>
      The target host to connect.

   -g <[0,63]>, --fieldgroup <[0,63]>
      The field group to query on the remote host.

   -f <fieldsId>, --fields <fieldsId>
      Comma-separated list of field ids to add to a field group when
      Creating a new one
      or Query target fields info when listing available fields

   One of:
      -c <groupName>, --create <groupName>
      Create a field group on the remote host.

      -i, --info
      Get a field group info on the remote host.

      -d, --delete
      Delete a field group on the remote host.

      -l, --list
      List all field group.

      --list-fields
      List all available field.

   MetaX Data Center Management Interface

3.3.2. 创建指标组

以下命令创建指标组create_fieldgroup,并添加指标 123425

  • mx-dcmi命令:

    mx-dcmi fieldgroup --host [IP]:[PORT] -c create_fieldgroup -f 1-4,25
    

    操作成功,显示如下:

    create field group "create_fieldgroup" with a group ID of 1 successfully
    
  • RESTful API:

    curl -i -X POST http://[IP]:[PORT]/api/v1/fieldgroups/create_fieldgroup?fields=1-4,25
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-type: application/json
    Content-Length: 13
    {"groupid":1}
    

3.3.3. 删除指标组

  • mx-dcmi命令:

    mx-dcmi fieldgroup --host [IP]:[PORT] -d -g 1
    

    操作成功,显示如下:

    delete field group 1 successfully
    
  • RESTful API:

    curl -i -X DELETE http://[IP]:[PORT]/api/v1/fieldgroups/1
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-Length: 0
    

3.3.4. 查看指标组信息

  • mx-dcmi命令:

    mx-dcmi fieldgroup --host [IP]:[PORT] -i -g 1
    

    操作成功,显示如下:

    ============ MetaX Data Center Management Interface Log ==========
    Timestamp                                : Wed Nov 5 15:00:00 2025
    
    +----------------------------------------------------------------+
    | FIELD GROUP                                                    |
    +--------------------+-------------------------------------------+
    | Field Group ID     | 1                                         |
    | Field Group Name   | create_fieldgroup                         |
    | Field ID(s)        | 1, 2, 3, 4, 25                            |
    +--------------------+-------------------------------------------+
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/fieldgroups/1
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-type: application/json
    Content-Length: 62
    
    {"groupid":1,"name":"create_fieldgroup","fields":{1,2,3,4,25}}
    

3.3.5. 查看指标组列表

  • mx-dcmi命令:

    mx-dcmi fieldgroup --host [IP]:[PORT] -l
    

    操作成功,显示如下:

    =========== MetaX Data Center Management Interface Log ===========
    Timestamp                                : Wed Nov 5 15:20:00 2025
    
    +----------------------------------------------------------------+
    | FIELD GROUP                                                    |
    | 2 field groups found                                           |
    +--------------------+-------------------------------------------+
    | Field Group ID     | 0                                         |
    | Field Group Name   | DEFAULT_FIELD_GROUP                       |
    | Field ID(s)        | 11, 13, 23, 24, 25, 31, 32, 33            |
    +--------------------+-------------------------------------------+
    | Field Group ID     | 1                                         |
    | Field Group Name   | create_fieldgroup                         |
    | Field ID(s)        | 1, 2, 3, 4, 25                            |
    +--------------------+-------------------------------------------+
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/fieldgroups
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-type: application/json
    Content-Length: 150
    
    {"fieldgroups":[{"groupid":0,"name":"DEFAULT_FIELD_GROUP","fields":[11,13,23,24,25,31,32,33]},{"groupid":1,"name":"create_fieldgroup","fields":[1,2,3,4,25]}]}
    

3.3.6. 查看所有可用指标

  • mx-dcmi命令:

    mx-dcmi fieldgroup --host [IP]:[PORT] --list-fields
    

    操作成功,显示如下:

    ======= MetaX Data Center Management Interface Log ========
    Timestamp                         : Wed Nov 5 15:00:00 2025
    
    +----+-----------------------+----+-----------------------+
    | Id | Field Name            | Id | Field Name            |
    +----------------------------+----+-----------------------+
    | 0  | UNKNOWN               | 18 | CCX_CLOCK             |
    | 1  | DRIVER_VERSION        | 19 | GPU_TEMP              |
    | 2  | VBIOS_VERSION         | 20 | SOC_TEMP              |
    | 3  | FW_VERSION            | 21 | TEMP_LIMIT            |
    | 4  | SML_VERSION           | 22 | PMBUS_POWER           |
    | 5  | PROCESS_NAME          | 23 | BOARD_POWER           |
    | 6  | DEV_COUNT             | 24 | GPU_UTIL              |
    | 7  | DEV_NAME down         | 25 | MEM_UTIL              |
    | 8  | DEV_SML_INDEX         | 26 | VPUE_UTIL             |
    | 9  | DEV_SERIAL            | 27 | VPUD_UTIL             |
    | 10 | DEV_BDFID             | 28 | PCIE_GEN              |
    | 11 | GPU_CLOCK             | 29 | PCIE_WIDTH            |
    | 12 | CSC_CLOCK             | 30 | PCIE_MAX_GEN          |
    | 13 | MC_CLOCK              | 31 | PCIE_TX               |
    | 14 | SOC_CLOCK             | 32 | PCIE_RX               |
    | 15 | DNOC_CLOCK            | 33 | ECC_STATE             |
    | 16 | VPUE_CLOCK            | 34 | MAX_FIELDS            |
    | 17 | VPUD_CLOCK            |    |                       |
    +----------------------------+----+-----------------------+
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/fields
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-type: application/json
    Content-Length: 1008
    
    {"fields":[{"id":0,"name":"UNKNOWN"},{"id":1,"name":"DRIVER_VERSION"},{"id":2,"name":"VBIOS_VERSION"},{"id":3,"name":"FW_VERSION"},{"id":4,"name":"SML_VERSION"},{"id":5,"name":"PROCESS_NAME"},{"id":6,"name":"DEV_COUNT"},{"id":7,"name":"DEV_NAME"},{"id":8,"name":"DEV_SML_INDEX"},{"id":9,"name":"DEV_SERIAL"},{"id":10,"name":"DEV_BDFID"},{"id":11,"name":"GPU_CLOCK"},{"id":12,"name":"CSC_CLOCK"},{"id":13,"name":"MC_CLOCK"},{"id":14,"name":"SOC_CLOCK"},{"id":15,"name":"DNOC_CLOCK"},{"id":16,"name":"VPUE_CLOCK"},{"id":17,"name":"VPUD_CLOCK"},{"id":18,"name":"CCX_CLOCK"},{"id":19,"name":"GPU_TEMP"},{"id":20,"name":"SOC_TEMP"},{"id":21,"name":"TEMP_LIMIT"},{"id":22,"name":"PMBUS_POWER"},{"id":23,"name":"BOARD_POWER"},{"id":24,"name":"GPU_UTIL"},{"id":25,"name":"MEM_UTIL"},{"id":26,"name":"VPUE_UTIL"},{"id":27,"name":"VPUD_UTIL"},{"id":28,"name":"PCIE_GEN"},{"id":29,"name":"PCIE_WIDTH"},{"id":30,"name":"PCIE_MAX_GEN"},{"id":31,"name":"PCIE_TX"},{"id":32,"name":"PCIE_RX"},{"id":33,"name":"ECC_STATE"},{"id":34,"name":"MAX_FIELDS"}]}
    

3.4. 状态监控

用户可以实时监控设备数据。对于曦云C588,会分Die显示信息。

3.4.1. 使用帮助

mx-dcmi dmon -h

执行命令后,显示如下:

USAGE:
   mx-dcmi dmon  [-hv] [-g <[0, 63]>|-i <gpu-id>] [-e <fieldIds>|-f <[0,
               63]>] [-c <count>] [-d <delay>] [--host <IP:PORT>]
Where:
   --host <IP:PORT>
   The target host to connect.

   -c <count>,  --count <count>
   Integer representing how many times to loop before exiting, 0
   represents forever.

   -d <delay>,  --delay <delay>
   Integer(ms) representing how often to query results from DCMD and
   print them.
   Default: 1000ms, minimum: 1000ms.

   Either of:
      -i <gpu-id>,  --gpu-id <gpu-id>
      Gpu index, run mx-dcmi discovery --host ip:port -l to check list of
      gpu ids.

      -g <[0, 63]>,  --group-id <[0, 63]>
      The gpu group to query on the specified host.

   Either of:
      -e <fieldIds>,  --field-id <fieldIds>
      Field identifier to view/inject.

      -f <[0, 63]>,  --field-group-id <[0, 63]>
      The field group to query on the specified host.

   --,  --ignore_rest
   Ignores the rest of the labeled arguments following this flag.

   -v,  --version
   Displays version information and exits.

   -h,  --help
   Displays usage information and exits.

   MetaX Data Center Management Interface

3.4.2. 开启设备监控

  • mx-dcmi命令:

    mx-dcmi dmon --host [IP]:[PORT] -i 0 -e 11 -d 2000 -c 5
    
    • -e (指标)可替换为 -f (指标组)。

    • -d (可选参数)指定监控间隔,不指定时默认1s。

    • -c (可选参数)指定监控次数,不指定时持续监控。

    操作成功,显示如下:

    gpu   die   GPU_CLOCK
    id    id    MHz
    0     0     428
    0     0     428
    0     0     428
    0     0     428
    0     0     428
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/stats?gpu=0\&fields=11\&delay=2000\&count=5
    
    • fields (指标)可替换为 fieldgroup (指标组)。

    • delay (可选参数)指定监控间隔,不指定时默认1s。

    • count (可选参数)指定监控次数,不指定时持续监控。

    操作成功,显示如下:

    HTTP/1.1 200 OK
    Content-Type: text/event-stream
    
    {"stats":[{"gpuid":0,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]}]}
    
    {"stats":[{"gpuid":0,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]}]}
    
    {"stats":[{"gpuid":0,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]}]}
    
    {"stats":[{"gpuid":0,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]}]}
    
    {"stats":[{"gpuid":0,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]}]}
    

3.4.3. 开启设备组监控

  • mx-dcmi命令:

    mx-dcmi dmon --host [IP]:[PORT] -g 0 -e 11 -d 2000 -c 5
    
    • -e (指标)可替换为 -f (指标组)。

    • -d (可选参数)指定监控间隔,不指定时默认1s。

    • -c (可选参数)指定监控次数,不指定时持续监控。

    操作成功,显示如下:

    gpu   die   GPU_CLOCK
    id    id    MHz
    0     0     428
    1     0     428
    0     0     428
    1     0     428
    0     0     428
    1     0     428
    0     0     428
    1     0     428
    0     0     428
    1     0     428
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/stats?gpugroup=0\&fields=11\&delay=2000\&count=5
    
    • fields (指标)可替换为 fieldgroup (指标组)。

    • delay (可选参数)指定监控间隔,不指定时默认1s。

    • count (可选参数)指定监控次数,不指定时持续监控。

    操作成功,显示如下:

    HTTP/1.1 200 OK
    Content-Type: text/event-stream
    
    {"stats":[{"gpuid":0,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]},{"gpuid":1,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]}]}
    
    {"stats":[{"gpuid":0,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]},{"gpuid":1,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]}]}
    
    {"stats":[{"gpuid":0,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]},{"gpuid":1,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]}]}
    
    {"stats":[{"gpuid":0,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]},{"gpuid":1,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]}]}
    
    {"stats":[{"gpuid":0,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]},{"gpuid":1,"dieid":0,"gpustats":[{"fieldid":11,"fieldname":"GPU_CLOCK","fieldunit":"MHz","value":"428"}]}]}
    

3.4.4. 停止监控

执行ctrl+c停止或者通过 -c/--count 指定监控次数。

3.5. 任务进程数据收集

mx-dcmd支持在后台收集数据、分析数据。这样,用户可以方便地收集任务级别的数据。

后台收集数据流程如图 3.1 所示。

../_images/CollectDataInBackground.png

图 3.1 后台收集数据流程

3.5.1. 使用帮助

mx-dcmi stats -h

执行命令后,显示如下:

USAGE:
   mx-dcmi stats [-hv] {-j <jobName>]|-l|-r <jobName>|-s <jobName>|-x <jobName>} [-g <[0,63]>] [-m <duration>] [-u <interval>] [--host <IP:PORT>]

Where:
   --host <IP:PORT>
      The target host to connect.

   -g <[0,63]>, --fieldgroup <[0,63]>
      The gpu group to query on the job.

   -u <interval>, --interval <interval>
      How often to update the job stats in ms, default: 2000.

   -m <duration>, --duration <duration>
      How long to run the job in s, default: run till stop.

One of:
      -s <jobName>, --jstart <jobName>
      Start recording job statistics.

      -x <jobName>, --jstop <jobName>
      Stop recording job statistics.

      -r <jobName>, --jremove <jobName>
      Remove job statistics.

      -j <jobName>, --job <jobName>
      Display job statistics.

      -l, --list
      List all jobs.

   --, --ignore_rest
      Ignores the rest of the labeled arguments following this flag.

   -v, --version
      Displays version information and exits.

   -h, --help
      Displays usage information and exits.

   MetaX Data Center Management Interface

3.5.2. 启动后台收集任务

以下命令创建并启动test_job,收集group0的信息,更新间隔2000 ms,更新6次后停止。

其中,更新次数:11s/2000ms = 6。

  • mx-dcmi命令:

    mx-dcmi stats --host [IP]:[PORT] -s test_job -g 0 -u 2000 -m 11
    

    操作成功,显示如下:

    Successfully started recording stats for test_job
    
  • RESTful API:

    curl -i -X POST http://[IP]:[PORT]/api/v1/job/test_job?group=0\&interval=2000\&duration=11
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-Length: 0
    

3.5.3. 显示所有后台收集任务

  • mx-dcmi命令:

    mx-dcmi stats --host [IP]:[PORT] -l
    

    操作成功,显示如下:

    =========== MetaX Data Center Management Interface Log ============
    Timestamp                                 : Wed Nov 5 15:30:00 2025
    
    +----------------------------------------------------------------+
    | Job List                                                       |
    +-------------+----------------+---------------------------------+
    | Job Name    | State          | Gpus                            |
    +-------------+----------------+---------------------------------+
    | test_job2   | running        | 0, 1, 2, 3, 4, 5, 6, 7          |
    | test_job    | stopped        | 0, 1, 2, 3, 4, 5, 6, 7          |
    +-------------+----------------+---------------------------------+
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/jobs
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-Length: 143
    
    {"jobs":[{"jobname":"test_job2","state":"running","gpus":[0,1,2,3,4,5,6,7]},{"jobname":"test_job","state":"stopped","gpus":[0,1,2,3,4,5,6,7]}]}
    

3.5.4. 停止后台收集任务

  • mx-dcmi命令:

    mx-dcmi stats --host [IP]:[PORT] -x test_job
    

    操作成功,显示如下:

    Successfully stopped recording stats for test_job
    
  • RESTful API:

    curl -i -X PUT http://[IP]:[PORT]/api/v1/job/test_job?status=stop
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-Length: 0
    

3.5.5. 显示指定后台任务数据

  • mx-dcmi命令:

    mx-dcmi stats --host [IP]:[PORT] -j test_job
    

    操作成功,显示如下:

    ============ MetaX Data Center Management Interface Log =============
    Timestamp                                   : Wed Nov 5 15:00:00 2025
    
    +-------------------------------------------------------------------+
    | Summary                                                           |
    +-------------------------------+-----------------------------------+
    | ----- Execution Stats ------- +-----------------------------------|
    | Start Time                    | Wed Nov 6 15:00:00                |
    | Update Time                   | Wed Nov 6 15:00:12                |
    | Total Execution Time (sec)    | 12                                |
    | No. of Processes Time         | 0                                 |
    | ----- Power Usage (Watts)  ---+-----------------------------------|
    | GPU0                          | Avg: 58.7, Max: 59.3 Min: 58.1    |
    | GPU1                          |-Avg: 60.7, Max: 61.1 Min: 60.5    |
    | GPU2                          | Avg: 56.8, Max: 57.1 Min: 56.5    |
    | GPU3                          | Avg: 55.2, Max: 55.5 Min: 55.0    |
    | GPU4                          | Avg: 62.5, Max: 62.8 Min: 62.3    |
    | GPU5                          | Avg: 53.4, Max: 53.6 Min: 53.1    |
    | GPU6                          | Avg: 55.6, Max: 55.9 Min: 55.1    |
    | GPU7                          | Avg: 57.0, Max: 57.1 Min: 57.0    |
    | --- Max Memory Utilization(%) +-----------------------------------|
    | GPU0                          | 1.3                               |
    | GPU1                          | 1.3                               |
    | GPU2                          | 0.6                               |
    | GPU3                          | 0.8                               |
    | GPU4                          | 1.3                               |
    | GPU5                          | 4.4                               |
    | GPU6                          | 5.6                               |
    | GPU7                          | 20.1                              |
    | -----  Event Stats -----------+-----------------------------------|
    | Single Bit ECC Errors         | Not Specified                     |
    | Double Bit ECC Errors         | Not Specified                     |
    | -----  Overall Health --------+-----------------------------------|
    | Overall Health                | Healthy                           |
    +-------------------------------+-----------------------------------+
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/job/test_job
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-type: application/json
    Content-Length: 830
    
    {"starttime":"Wed Nov 6 15:00:00","updatetime":"Wed Nov 6 15:00:12","executetime":12,"processnumber":0,"devicenumber":8,"gpustatistics":[{"gpuid":0,"maxmemoryused":936048,"avgpower":58864,"maxpower":59262,"minpower":58055},{"gpuid":1,"maxmemoryused":936048,"avgpower":58864,"maxpower":59262,"minpower":58055},{"gpuid":2,"maxmemoryused":936048,"avgpower":58864,"maxpower":59262,"minpower":58055},{"gpuid":3,"maxmemoryused":936048,"avgpower":58864,"maxpower":59262,"minpower":58055},{"gpuid":4,"maxmemoryused":936048,"avgpower":58864,"maxpower":59262,"minpower":58055},{"gpuid":5,"maxmemoryused":936048,"avgpower":58864,"maxpower":59262,"minpower":58055},{"gpuid":6,"maxmemoryused":936048,"avgpower":58864,"maxpower":59262,"minpower":58055},{"gpuid":7,"maxmemoryused":936048,"avgpower":57023,"maxpower":57104,"minpower":56960}]}
    

3.5.6. 删除后台收集任务

  • mx-dcmi命令:

    mx-dcmi stats --host [IP]:[PORT] -r test_job
    

    操作成功,显示如下:

    Successfully removed job test_job
    
  • RESTful API:

    curl -i -X DELETE http://[IP]:[PORT]/api/v1/job/test_job
    

    操作成功,显示如下:

    HTTP/1.1 200 0K
    Content-Length: 0
    

3.6. 设备健康状态监控

用户可以查看设备健康状态。

3.6.1. 使用帮助

mx-dcmi health -h

执行命令后,显示如下:

USAGE:
   mx-dcmi health  [-hv] [-c|-d|-e <type>|--clear] [-g <[0, 63]>] [--host
                  <IP:PORT>]
Where:

   --host <IP:PORT>
   The target host to connect.

   -g <[0, 63]>,  --group <[0, 63]>
   Group id

   Either of:
      -c,  --check
        Check to see if any errors or warnings have occurred in the enabled
        monitors.

      -d,  --disable
        Disable all monitors.

      -e <type>,  --enable <type>
        Enable specified monitor(s).
        a - all monitors
        l - log monitor
        e - event monitor
        Note: 'le' is valid for enabling log&event monitors

      --clear
        Clear errors and warnings have detected in the enabled monitors.
        Note: driver exception won't be cleared.

   --,  --ignore_rest
     Ignores the rest of the labeled arguments following this flag.

   -v,  --version
     Displays version information and exits.

   -h,  --help
     Displays usage information and exits.

   MetaX Data Center Management Interface

3.6.2. 检查设备健康状态

以下命令检查设备组#1在启用的监控项中是否发生告警。

  • mx-dcmi命令:

    mx-dcmi health --host [IP]:[PORT] -c -g 1
    

    操作成功,显示如下:

    =========== MetaX Data Center Management Interface Log ===========
    Timestamp                               : Mon Nov 10 11:59:29 2025
    
    +----------------+-----------------------------------------------+
    | Overall Health | Healthy                                       |
    +----------------+-----------------------------------------------+
    | GPU ID: 0      | Healthy                                       |
    +----------------+-----------------------------------------------+
    | GPU ID: 1      | Healthy                                       |
    +----------------+-----------------------------------------------+
    
  • RESTful API:

    curl -i -X GET http://[IP]:[PORT]/api/v1/health/gpugroup/1
    

    操作成功,显示如下:

    HTTP/1.1 200 OK
    Content-type: application/json
    Content-Length: 254
    
    {"grouphealthinfo":"Healthy","driverexception":[],"gpuhealthinfo":[{"gpuid":0,"dieid":0,"healthinfo":{"errorlog":[],"fatallog":[],"hwexception":[],"memoryexception":[]}},{"gpuid":1,"dieid":0,"healthinfo":{"errorlog":[],"fatallog":[],"hwexception":[],"memoryexception":[]}}]}
    

3.6.3. 启用监控项

以下命令启用Log和Event监控。

  • mx-dcmi命令:

    mx-dcmi health --host [IP]:[PORT] -e le
    

    操作成功,显示如下:

    Successfully enable health monitor, type 3.
    
  • RESTful API:

    curl -i -X POST http://[IP]:[PORT]/api/v1/health/3
    

    操作成功,显示如下:

    HTTP/1.1 200 OK
    Content-Length: 0
    

3.6.4. 禁用所有监控项

以下命令禁用所有监控项。

  • mx-dcmi命令:

    mx-dcmi health --host [IP]:[PORT] -d
    

    操作成功,显示如下:

    Successfully disable health monitors.
    
  • RESTful API:

    curl -i -X DELETE http://[IP]:[PORT]/api/v1/health
    

    操作成功,显示如下:

    HTTP/1.1 200 OK
    Content-Length: 0
    

3.6.5. 清除告警

以下命令清除设备组#1启用监控项中已经检测到的告警。

  • mx-dcmi命令:

    mx-dcmi health --host [IP]:[PORT] --clear -g 1
    

    操作成功,显示如下:

    Successfully clear group 1 health monitors.
    
  • RESTful API:

    curl -i -X DELETE http://[IP]:[PORT]/api/v1/health/gpugroup/1
    

    操作成功,显示如下:

    HTTP/1.1 200 OK
    Content-Length: 0
    

3.7. 设备自检

用户可以使用自检功能检查设备状况。自检前建议关闭虚拟化。

3.7.1. 使用帮助

mx-dcmi diag -h

执行命令后,显示如下:

USAGE:
   mx-dcmi diag  [-hjv] [-c <PATH>] [-g <[0, 63]>] [-r <[1|2|3|4|software|pcie|mxlk|ops|mem>] [--host <IP:PORT>]

Where:
   --host <IP:PORT>
     The target host to connect.

   -g <[0, 63]>,  --group <[0, 63]>
     The group ID to query.

   -c <PATH>,  --configfile <PATH>
     Path to the configuration file in json format.

   -j,  --json
     Print detailed result in json format.

   --generate-template
      Generate diag config template file.

   -r <1|2|3|4|software|pcie|mxlk|ops|mem>,  --run <1|2|3|4|software|pcie|mxlk|ops|mem>
     Run a diagnostic.
     (Note: higher numbered tests include all beneath.)
     1 - Quick (System Validation ~ seconds)
     2 - Medium (Extended System Validation ~ 2 minutes)
     3 - Long (System HW Diagnostics ~ 15 minutes)
     4 - Extended (Longer-running System HW Diagnostics)
     Specific module diagnostic to run by name
     Module name: software, pcie, mxlk, ops, mem

   -p <MXC500|MXC550|MXC588>, --product <MXC500|MXC550|MXC588>
     Specify metax product name to generate diag config file.
     Supported product: MXC500, MXC550, MXC588

   -s, --short
     Start the diagnostic and disconnect
     get the status by -l, get the result by -f

   Either of:
     -l, --list
       List the diagnostic status.

     -f, --fetch
       Get the diagnostic info
       specify gpu group id by -g, default 0

   --,  --ignore_rest
     Ignores the rest of the labeled arguments following this flag.

   -v,  --version
   Displays version information and exits.

   -h,  --help
   Displays usage information and exits.

MetaX Data Center Management Interface

3.7.2. 自检项目

3.7.2.1. 自检项目说明

表 3.1 自检项目说明

项目类别

所属级别

所属类型

说明

MxSml check

Level 1

Deployment

检查libmxsml.so是否可用。

Maca check

Level 1

Deployment

检查MXMACA软件栈是否可用。

Permission Check

Level 1

Deployment

检查用户权限。

Compiler Check

Level 1

Deployment

检查编译器是否可用。

Nouveau Check

Level 1

Deployment

检查Nouveau驱动是否禁用。

Environment Variable Check

Level 1

Deployment

检查环境变量设置。

Power Mode Check

Level 1

Deployment

检查是否开启高功耗模式,可通过 mx-smi --show-power-mode 查看。 High 代表开启高功耗模式, Normal 为未开启。

PCIe

Level 2

Integration

检查PCIe speed(GT/s),lanes和最大带宽(MB/s)。

MetaXLink

Level 2

Integration

检查MetaXLink speed(GT/s),lanes和最大带宽(MB/s)。

OPS

Level 2

Hardware

精度为INT8、BF16、FP16、FP32_vector、FP32_matrix、TF32的算力指标测试。自检配置文件对应 ops 项,自检结果对应 perf 项。

Memory

Level 2

Hardware

HBM申请大内存测试,及HBM带宽测试。

Gpu Burn

Level 3

Hardware

压力测试,检查功耗,使用率。自检配置文件及自检结果对应 stress 项。

Pulse Test

Level 4

Hardware

功耗波动测试。

Memory Test

Level 4

Hardware

共有11种类型的Memory测试。 test_indexes 中可指定测试类别,默认11种都测。

3.7.2.2. 自检项目诊断时间

表 3.2 自检项目执行时长说明

自检级别

诊断时间(参考)

Level 1

< 2 秒

Level 2

< 2 分钟

Level 3

< 30 分钟

Level 4

< 2 小时

3.7.3. 执行自检

3.7.3.1. 默认基准自检

用户可使用默认基准直接执行Level1~Level4的自检,当前仅支持 MXC500MXC550MXC588。模块自检目前仍需指定配置文件进行自检。

mx-dcmi命令:

mx-dcmi diag --host [IP]:[PORT] -r {1|2|3|4}

3.7.3.2. 用户自定义基准自检

执行以下命令可生成默认指标配置文件diag-config-MXC500.json

mx-dcmi命令:

mx-dcmi diag --generate-template --host [IP]:[PORT] --product {MXC500|MXC550|MXC588}

其中 --product 为产品名称,目前支持 MXC500MXC550MXC588,若不填则默认生成 MXC500 配置文件。

当执行Level1自检时,无须指定配置文件。用户可根据实际期望进行调整,若希望跳过某个项目的检查,可直接将对应项目整段删除。

执行以下命令指定用户自定义的配置文件diag-config-MXC500-modified.json进行自检:

mx-dcmi命令:

mx-dcmi diag --host [IP]:[PORT] -r {2|3|4} -c diag-config-MXC500-modified.json

3.7.4. 自检示例

3.7.4.1. Level 1自检

mx-dcmi命令:

mx-dcmi diag --host [IP]:[PORT] -r 1

操作成功,显示如下:

============= MetaX Data Center Management Interface Log =============
Timestamp                                   : Mon Nov 10 14:36:07 2025
+-----------------------------+--------------------------------------+
| Diagnostic                  | Result                               |
+-----------------------------+--------------------------------------+
| ------ Deployment ----------+--------------------------------------|
| MxSml Library               | Pass                                 |
| Maca Library                | Pass                                 |
| Permission Check            | Pass                                 |
| Compiler Check              | Pass                                 |
| Nouveau Check               | Pass                                 |
| Environment Variable Check  | Pass                                 |
| Power Mode Check            | Normal                               |
+-----------------------------+--------------------------------------+

3.7.4.2. RESTful API方式自检

获取MetaXLink自检结果,其中 -d 的入参为期望的MetaXLink指标结果。MetaXLink测试属于Level 2检测,对于其他Level 2检测项不予检测,显示为 skip

mx-dcmi命令:

curl -i -X POST -d '{"metaxlink": {"bw_uni_p2p": 92000,"bw_bi_p2p": 181000,"speed": 32,"width": 16}}' http://[IP]:[PORT]/api/v1/diag/mxlk

操作成功,显示如下:

HTTP/1.1 200 OK
Content-type: application/json
Content-Length: 1126
{
   "software": {
         "mxsml": "pass",
         "maca": "pass",
         "permission": "pass",
         "compiler": "pass",
         "nouveau": "pass",
         "envvar": "pass",
         "powermode": "normal"
   },
   "pcie": {
         "result": "skip"
   },
   "metaxlink": {
         "result": "pass",
   "p2p_uni": [
      {
         "src": 0,
         "dst": 1,
         "bandwidth": 101677
      },
      {
         "src": 1,
         "dst": 0,
         "bandwidth": 98684
      }
   ],
   "p2p_bi": [
      {
         "src": 0,
         "dst": 1,
         "bandwidth": 191145
      }
   ],
   "mxlkspeedwidth": [
      {
         "gpuid": 0,
         "speed": 32.0,
         "width": 16
      },
      {
         "gpuid ": 1,
         "speed": 32.0,
         "width": 16
      }
   ]
   },
   "perf": {
         "result": "skip"
   },
   "memory": {
         "memAllocResult": "skip",
         "memBwResult": "skip"
   }
}

3.7.4.3. 指定模块自检

用户使用自定义配置文件指定模块进行自检,以HBM为例。

mx-dcmi命令:

mx-dcmi diag --host [IP]:[PORT] -r mem -c mxdcm-diag.json

操作成功,显示如下:

============= MetaX Data Center Management Interface Log =============
Timestamp                                   : Mon Nov 10 14:36:07 2025
+-----------------------------+--------------------------------------+
| Diagnostic                  | Result                               |
+-----------------------------+--------------------------------------+
| ------ Deployment ----------+--------------------------------------|
| MxSml Library               | Pass                                 |
| Maca Library                | Pass                                 |
| Permission Check            | Pass                                 |
| Compiler Check              | Pass                                 |
| Nouveau Check               | Pass                                 |
| Environment Variable Check  | Pass                                 |
| Power Mode Check            | Normal                               |
+-----------------------------+--------------------------------------+
| ------ Hardware ------------+--------------------------------------|
| Memory MAX                  | Pass                                 |
| Memory Bandwidth            | Pass                                 |
+-----------------------------+--------------------------------------+

3.7.4.4. 查询自检执行状态和结果

备注

该功能仅支持 Level1~Level4自检,暂不支持模块自检。

用户使用 --short, -L, --fetch 组合进行自检状态及结果查询。

  1. 执行Level3自检。

    mx-dcmi命令:

    mx-dcmi diag --host [IP]:[PORT] -r 3 --short
    

    操作成功,显示如下:

    Diagnostic request has started successfully
    Use mx-dcmi diag -l/-f to get diagnostic status/result.
    
  2. 查询Level3自检执行状态。

    mx-dcmi命令:

    mx-dcmi diag --host [IP]:[PORT] -l
    

    操作成功,显示如下:

    =========== MetaX Data Center Management Interface Log =========
    Timestamp                              : Tue Nov 4 14:36:07 2025
    +-------------------------+------------------------------------+
    | Group Id                | Result                             |
    +-------------------------+------------------------------------+
    | 0                       | Pass                               |
    +-------------------------+------------------------------------+
    
  3. 当状态从 Running 变为 PassFail 时,可查询Level3自检结果。

    mx-dcmi命令:

    mx-dcmi diag --host [IP]:[PORT] -f
    

    操作成功,显示如下:

    ============= MetaX Data Center Management Interface Log =============
    Timestamp                                    : Tue Nov 4 14:36:07 2025
    +-----------------------------+--------------------------------------+
    | Diagnostic                  | Result                               |
    +-----------------------------+--------------------------------------+
    | ------ Deployment ----------+--------------------------------------|
    | MxSml Library               | Pass                                 |
    | Maca Library                | Pass                                 |
    | Permission Check            | Pass                                 |
    | Compiler Check              | Pass                                 |
    | Nouveau Check               | Pass                                 |
    | Environment Variable Check  | Pass                                 |
    | Power Mode Check            | Normal                               |
    +-----------------------------+--------------------------------------+
    | ------ Integration ---------+--------------------------------------|
    | PCIe                        | Pass                                 |
    | MetaXLink                   | Pass                                 |
    +-----------------------------+--------------------------------------+
    | ------ Hardware ------------+--------------------------------------|
    | Ops                         | Pass                                 |
    | Memory MAX                  | Pass                                 |
    | Memory Bandwidth            | Pass                                 |
    | Gpu Burn                    | Pass                                 |
    | Pulse Test                  | Skip                                 |
    | Memory Test                 | Skip                                 |
    +-----------------------------+--------------------------------------+
    

3.7.4.5. 指定设备进行自检

仅对指定设备进行自检,需新建设备组(参见 3.2.2 创建设备组),将设备放入组中(参见 3.2.3 添加设备到设备组),再进行自检。

执行以下命令指定设备组(Group ID:1)进行Level2自检。

mx-dcmi命令:

mx-dcmi diag --host [IP]:[PORT] -r 2 -g 1

操作成功,显示如下:

============= MetaX Data Center Management Interface Log =============
Timestamp                                   : Tue Nov 4 14:36:07 2025
+-----------------------------+--------------------------------------+
| Diagnostic                  | Result                               |
+-----------------------------+--------------------------------------+
| ------ Deployment ----------+--------------------------------------|
| MxSml Library               | Pass                                 |
| Maca Library                | Pass                                 |
| Permission Check            | Pass                                 |
| Compiler Check              | Pass                                 |
| Nouveau Check               | Pass                                 |
| Environment Variable Check  | Pass                                 |
| Power Mode Check            | Normal                               |
+-----------------------------+--------------------------------------+
| ------ Integration ---------+--------------------------------------|
| PCIe                        | Pass                                 |
| MetaXLink                   | Pass                                 |
+-----------------------------+--------------------------------------+
| ------ Hardware ------------+--------------------------------------|
| Ops                         | Pass                                 |
| Memory MAX                  | Pass                                 |
| Memory Bandwidth            | Pass                                 |
+-----------------------------+--------------------------------------+

3.7.4.6. 显示json格式的自检结果

以下命令指定Level 1自检,以json格式显示结果。

mx-dcmi命令:

mx-dcmi diag --host [IP]:[PORT] -r 1 -j

操作成功,显示如下:

============= MetaX Data Center Management Interface Log =============
Timestamp                                  : Mon Nov 10 14:36:26 2025

{
   "software": {
      "mxsml": "pass",
      "maca": "pass",
      "permission": "pass",
      "compiler": "pass",
      "nouveau": "pass",
      "envvar": "pass",
      "powermode": "normal"
   }
}

3.8. 拓扑查询

用户可以使用拓扑查询功能查看GPU设备的拓扑连接。

3.8.1. 使用帮助

mx-dcmi topo -h

执行命令后,显示如下:

USAGE:

   mx-dcmi topo  [-hv] {-g |--gpuid } [--host ]


Where:

   --host
     The target host to connect.

   One of:
      --gpuid
        The gpu ID to query.

      -g ,  --group
        The group ID to query.

   --,  --ignore_rest
     Ignores the rest of the labeled arguments following this flag.

   -v,  --version
     Displays version information and exits.

   -h,  --help
     Displays usage information and exits.

   MetaX Data Center Management Interface

3.8.2. 查看设备拓扑信息

mx-dcmi命令:

mx-dcmi topo --host [IP]:[PORT] {-g |--gpuid }

操作成功,显示如下:

============= MetaX Data Center Management Interface Log ============
Timestamp                                   : Wed Nov 19 22:04:47 2025

+-------------------+-----------------------------------------------+
| GPU ID: 0         | Topology Information                          |
+-------------------+-----------------------------------------------+
| CPU Core Affinity | 0-39,80-119                                   |
+-------------------+-----------------------------------------------+
| To GPU 1          | Connected via a metaxlink                     |
+-------------------+-----------------------------------------------+
| To GPU 2          | Connected via a metaxlink                     |
+-------------------+-----------------------------------------------+
| To GPU 3          | Connected via a metaxlink                     |
+-------------------+-----------------------------------------------+
| To GPU 4          | Connected via a CPU-level link (the same numa)|
+-------------------+-----------------------------------------------+
| To GPU 5          | Connected via a CPU-level link (the same numa)|
+-------------------+-----------------------------------------------+
| To GPU 6          | Connected via a CPU-level link (the same numa)|
+-------------------+-----------------------------------------------+
| To GPU 7          | Connected via a CPU-level link (the same numa)|
+-------------------+-----------------------------------------------+

End of Log

3.9. 网卡诊断

用户可以使用自检功能检查单机网卡状况,也可对网卡进行单机和跨机的性能测试与压力测试。

3.9.1. 使用帮助

mx-dcmi nic -h

执行命令后,显示如下:

USAGE:

   mx-dcmi nic  [-hjlv] [--generate-template] [--show-qos-info][-c <PATH>]
                [-r <1|2|3>] [--client-gpu-list <clientgpulist>]
                [--client-nic-list <clientniclist>] [--gpu-id <gpuid>]
                [--host <IP:PORT>] [--nic-id <nicid>] [--operation <[0,
                2]>] [--server-gpu-id <servergpuid>] [--server-nic-id
                <servernicid>] [--server-num <servernum>] [--socket-address
                <socketaddress>] [--test-mode <[0, 3]>] [--type <[0, 1]>]


Where:

   -l,  --list
   List all Nic info on the host.

   --host <IP:PORT>
   The target host to connect.

   --test-mode <[0, 3]>
   Query diagnostic test-mode:
   0 - do diagnostic on the target Nic
   1 - do burn diagnostic on the target Nic
   2 - start the target Nic as burn diagnostic server
   3 - start the target Nic as burn diagnostic client

   --nic-id <nicid>
   Target Nic Id

   --gpu-id <gpuid>
   Target Gpu Id

   --server-nic-id <servernicid>
   Server Nic Id

   --server-num <servernum>
   Server Count

   --server-gpu-id <servergpuid>
   Server Gpu Id

   --client-nic-list <clientniclist>
   Client Nic List

   --client-gpu-list <clientgpulist>
   Client Gpu List

   --socket-address <socketaddress>
   Socket Address

   --operation <[0, 2]>
   Query diagnostic operation:
   0 - Send (ib_send_*)
   1 - Read (ib_read_*)
   2 - Write (ib_write_*)

   --type <[0, 1]>
   Query diagnostic type:
   0 - bw (ib_*_bw)
   1 - lat (ib_*_lat)

   -c <PATH>,  --configfile <PATH>
   Path to the configuration file in json format.

   -j,  --json
   Print detailed result in json format.

   --generate-template
   Generate Nic diag config template file.

   --show-qos-info
     Show Qos Info. Need use with --list

   -r <1|2|3>,  --run <1|2|3>
   Run a Nic diagnostic.
   (Note: higher numbered tests include all beneath.)
   1 - Quick (Nic basic diagnostic ~ seconds)
   2 - Medium (Nic diagnostic include ib write perftest ~ 15 minutes)
   3 - Long (Nic diagnostic include ib write burntest ~ 30 minutes)

   --,  --ignore_rest
   Ignores the rest of the labeled arguments following this flag.

   -v,  --version
   Displays version information and exits.

   -h,  --help
   Displays usage information and exits.

   MetaX Data Center Management Interface

3.9.2. 查询网卡信息

mx-dcmi命令:

mx-dcmi nic --host [IP]:[PORT] -l [–-nic-id <nicid>]

如需显示指定网卡信息,使用 –-nic-id <nicid> 指定。

操作成功,显示如下:

../_images/NicList.png

图 3.2 查询网卡信息

3.9.3. 自检项目

3.9.3.1. 自检项目说明

表 3.3 网卡自检项目说明

项目类别

所属级别

说明

Driver Version Check

Level 1

检测网卡驱动版本

Firmware Version Check

Level 1

检测网卡固件版本

Kernel Syslog Check

Level 1

检查网卡在内核中的日志

Device Name Check

Level 1

检测网卡的名称

Port State Check

Level 1

监测网卡端口状态(UP/DOWN)IP地址

Error counters Check

Level 1

检测网卡的错误计数器

Qos Configuratioin

Level 1

包括ECN/PFC,TC,PRIO配置检查

Interface Name Check

Level 1

检测网卡接口名称

Perf test byself (write) for all nic

Level 2

对本服务器所有网卡进行性能测试(RDMA写入)

Perf burn test (write) for all nic

Level 3

对本服务器所有网卡进行压力测试(RDMA写入)

高级别的检测包含所有低级别的检测内容。若低级别的测试结果中有失败,则不会执行高级别的测试,显示为 Skip。 例如,执行Level 3的测试时,Level 2的测试结果为失败,则默认不执行Level 3的测试内容,Level 3的测试结果为 Skip

3.9.3.2. 基准配置文件

执行Level 1自检前,可用以下命令生成环境对应的配置文件nic-diag-config.json。生成的配置文件中网卡的信息仅包含与设备相连接的网卡,其他网卡默认不在其中。 参见 3.9.2 查询网卡信息 获取对应网卡的 ip_address 信息。 若相关检查项目内容为空,则会跳过该项检查;若检查结果等同于配置文件中的内容,则该项检查通过,否则失败。 失败或检查项目为空,会在对应项目中列出发生失败或跳过的网卡名称。

mx-dcmi命令:

mx-dcmi nic --generate-template --host [IP]:[PORT]

以下 nic-diag-config.json 仅为示例,用户可根据实际需要进行调整。

{
      "Driver Version": "xxx",
      "Firmware Version": "xxx",
      "Nic Status": {
            "port state": "UP 4X NDR (InfiniBand)",
            "device names": [
                        "mlx5_0",
                        "mlx5_1"
            ],
            "interface names": [
                        "ens2np0",
                        "enp3np0"
            ],
            "ip address": ["172.16.2.1", "172.16.3.1"]
      },
      "Qos Configuration": {
          "priority trust state":"dscp",
          "cmaRoceTos":"",
          "trafficClass":"160",
          "pfc priority enabled":"0,0,0,0,0,1,0,0",
          "pfc priority buffer":"0,0,0,0,0,1,0,0",
          "ecn Np priority enabled":"1,1,1,1,1,1,1,1",
          "ecn Np cnp_802_prio":"6",
          "ecn Np cnp_dscp":"48",
          "ecn Rp priority enabled":"1,1,1,1,1,1,1,1"
      }
}

3.9.4. 自检示例

3.9.4.1. Level 1自检

mx-dcmi命令:

mx-dcmi nic --host [IP]:[PORT] -r 1 -c <config_file>

操作成功,显示如下:

============= MetaX Data Center Management Interface Log ==================
Timestamp                                    : Thu Nov 20 09:22:29 2025

+------------------------------------+------------------------------------+
| Diagnostic                         | Result                             |
+------------------------------------+------------------------------------+
| ------ NIC Self Check -------------+------------------------------------|
| Driver Version Check               | Pass                               |
| Firmware Version Check             | Pass                               |
| Kernel Syslog Check                | Pass                               |
| Device Names Check                 | Pass                               |
| Port State Check                   | Pass                               |
| Error Counters Check               | Pass                               |
| Interface Names Check              | Pass                               |
| Ip Address Check                   | Pass                               |
|----QoS Configuratioin Check--------+------------------------------------+
| Priority trust state Check         | Pass                               |
| Cma Roce Tos Check                 | Pass                               |
| Traffic Class Check                | Pass                               |
| PFC Configuration enabled  Check   | Pass                               |
| PFC Configuration buffer Check     | Pass                               |
| ECN Np Configuration enabled Check | Pass                               |
| Cnp 802p Prio Check                | Pass                               |
| ECN Rp Configuration enabled Check | Pass                               |
+------------------------------------+------------------------------------+

End of Log

3.9.4.2. 显示json格式的自检结果

以下命令指定Level 2自检,以json格式显示结果。

mx-dcmi命令:

mx-dcmi nic --host [IP]:[PORT] -c nic-diag-config.json -r 2 -j

操作成功,显示如下:

{"Driver version":{"Result":"pass"},"Firmware version":{"Result":"pass"},"Kernel syslog":{"Result":"pass"},"Device names":{"Result":"pass"},"Port state":{"Result":"pass"},"Interface names":{"Result":"pass"},"Ip address":{"Result":"pass"},"Error counters":{"Result":"pass"},"Perftest":{"Result":"skip"},"Priority trust state":{"Result":"pass"},"Cnp 802p Prio":{"Result":"pass"},"Cnp Dscp":{"Result":"pass"},"Cma Rock Tos":{"Result":"skip"},"Traffic Class":{"Result":"pass"},"PFC Configuration enabled":{"Result":"pass"},"PFC Configuration buffer":{"Result":"pass"},"ECN Np Configuration enabled":{"Result":"pass"},"ECN Rp Configuration enabled":{"Result":"pass"}}

3.9.5. 单机RDMA性能测试

3.9.5.1. 基本命令格式

用户可根据自己的需求自由组合测试,且可同GPUDirect结合使用,以评估网络和GPU的性能。

备注

当测试结果不理想时,会给出具体原因分析及建议。

mx-dcmi命令:

mx-dcmi nic --host [IP]:[PORT] --test-mode 0 --nic-id <nicid> [--operation <0|1|2>] [--type <0|1>] [--gpu-id <gpuid>]
表 3.4 单机RDMA性能测试参数说明

参数

说明

--host

被测服务器IP地址及mxdcmd服务端口

--test-mode

指定测试模式, 0 为RDMA性能测试, 1 为RDMA压力测试, 2 为启动服务端监听, 3 为连接到服务端进行测试

--nic-id

网卡ID

--operation <0|1|2>

指定RDMA测试类型。 0 为发送(send)操作, 1 为读取(Read)操作, 2 为写入(write)操作。默认为写入(write)操作

--type <0,1>

指定测试类型。 0 为带宽测试, 1 为延时测试。默认为带宽测试

--gpu-id

指定目标设备ID进行GPU Direct测试

3.9.5.2. 示例1:读取数据带宽测试

mx-dcmi命令:

mx-dcmi nic --host [IP]:[PORT] --test-mode 0 --nic-id 1 --operation 1 --type 0

操作成功,显示如下:

../_images/SingleNicPerTest.png

图 3.3 读取数据带宽测试

3.9.5.3. 示例2:写入数据带宽测试(RESTful API)

mx-dcmi命令:

curl -i -X POST http://[IP]:[PORT]/api/v1/nic/ibs/ibtest?testmode=r\&server=3\&operation=write\&type=bw

操作成功,显示如下:

HTTP/1.1 200 OK
Content-type: application/json
Content-type: application/json
Content-Length: 3372

{"results":[{"deviceId":3,"name":"mlx5_2","operation":"ib_write_bw","type":0,"result":[
{"Bytes":2,"Iterations":5000,"BW Peak":0.06857100129127502,"BW Average":0.0682080015540123,"Msg Rate":4.262992858886719,"analysisResult":0},
{"Bytes":4,"Iterations":5000,"BW Peak":0.12999999523162842,"BW Average":0.12999999523162842,"Msg Rate":3.9534409046173096,"analysisResult":0},
{"Bytes":8,"Iterations":5000,"BW Peak":0.23999999463558197,"BW Average":0.23999999463558197,"Msg Rate":3.7158620357513428,"analysisResult":0},
{"Bytes":16,"Iterations":5000,"BW Peak":0.46000000834465027,"BW Average":0.46000000834465027,"Msg Rate":3.588594913482666,"analysisResult":0},
{"Bytes":32,"Iterations":5000,"BW Peak":0.9100000262260437,"BW Average":0.9100000262260437,"Msg Rate":3.5382909774780273,"analysisResult":0},
{"Bytes":64,"Iterations":5000,"BW Peak":1.6100000143051147,"BW Average":1.6100000143051147,"Msg Rate":3.1451549530029297,"analysisResult":0},
{"Bytes":128,"Iterations":5000,"BW Peak":3.4000000953674316,"BW Average":3.4000000953674316,"Msg Rate":3.3210930824279785,"analysisResult":0},
{"Bytes":256,"Iterations":5000,"BW Peak":6.440000057220459,"BW Average":6.440000057220459,"Msg Rate":3.1433870792388916,"analysisResult":0},
{"Bytes":512,"Iterations":5000,"BW Peak":12.899999618530273,"BW Average":12.850000381469727,"Msg Rate":3.1373159885406494,"analysisResult":0},
{"Bytes":1024,"Iterations":5000,"BW Peak":26.030000686645508,"BW Average":26.020000457763672,"Msg Rate":3.1757090091705322,"analysisResult":0},
{"Bytes":2048,"Iterations":5000,"BW Peak":63.47999954223633,"BW Average":63.380001068115234,"Msg Rate":3.868633985519409,"analysisResult":0},
{"Bytes":4096,"Iterations":5000,"BW Peak":118.8499984741211,"BW Average":118.29000091552734,"Msg Rate":3.6099469661712646,"analysisResult":0},
{"Bytes":8192,"Iterations":5000,"BW Peak":192.2100067138672,"BW Average":191.22999572753906,"Msg Rate":2.917980909347534,"analysisResult":0},
{"Bytes":16384,"Iterations":5000,"BW Peak":207.5800018310547,"BW Average":207.5,"Msg Rate":1.5830650329589844,"analysisResult":0},
{"Bytes":32768,"Iterations":5000,"BW Peak":210.83999633789063,"BW Average":210.1999969482422,"Msg Rate":0.8018640279769897,"analysisResult":0},
{"Bytes":65536,"Iterations":5000,"BW Peak":211.94000244140625,"BW Average":211.94000244140625,"Msg Rate":0.40423500537872314,"analysisResult":0},
{"Bytes":131072,"Iterations":5000,"BW Peak":212.13999938964844,"BW Average":212.1199951171875,"Msg Rate":0.20229299366474152,"analysisResult":0},
{"Bytes":262144,"Iterations":5000,"BW Peak":212.22999572753906,"BW Average":212.22999572753906,"Msg Rate":0.10119900107383728,"analysisResult":0},
{"Bytes":524288,"Iterations":5000,"BW Peak":212.17999267578125,"BW Average":212.13999938964844,"Msg Rate":0.050579000264406204,"analysisResult":0},
{"Bytes":1048576,"Iterations":5000,"BW Peak":212.22000122070313,"BW Average":212.1199951171875,"Msg Rate":0.025287000462412834,"analysisResult":0},
{"Bytes":2097152,"Iterations":5000,"BW Peak":212.17999267578125,"BW Average":212.0500030517578,"Msg Rate":0.012639000080525875,"analysisResult":0},
{"Bytes":4194304,"Iterations":5000,"BW Peak":212.17999267578125,"BW Average":212.0500030517578,"Msg Rate":0.006320000160485506,"analysisResult":0},
{"Bytes":8388608,"Iterations":5000,"BW Peak":212.1199951171875,"BW Average":212.0800018310547,"Msg Rate":0.003160000080242753,"analysisResult":0}]}]}

3.9.5.4. 示例3:读取数据延时测试(GPUDirect)

mx-dcmi命令:

mx-dcmi nic --host [IP]:[PORT] --test-mode 0 --nic-id 1 --operation 0 --type 1 --gpu-id 0

操作成功,显示如下:

../_images/SingleNicGpuPerfTest.png

图 3.4 读取数据延时测试

3.9.6. 单机RDMA压力测试

3.9.6.1. 基本命令格式

用户可根据自己的需求自由组合测试单机和跨机环境下的RDMA 压力测试,包括GPUDirect支持的场景。

备注

当测试结果不理想时,会给出具体原因分析及建议。

mx-dcmi命令:

mx-dcmi nic --host [IP]:[PORT] –-test-mode 1 --server-nic-id <servernicid> --client-nic-list <clientniclist> [--operation <0|1|2>] [--type <0|1>] [--server-gpu-id <servergpuid>] [--client-gpu-list <clientgpulist>]
表 3.5 单机RDMA压力测试参数说明

参数

说明

--host

被测服务器IP地址及mxdcmd服务端口

--test-mode

指定测试模式,0 为RDMA测试,1 为RDMA压力测试,2 为启动服务端监听,3 为连接到服务端进行测试

--server-nic-id

指定测试的服务端网卡ID,注意需要指定与GPU相连的网卡ID,若为管理口及存储口,则不能测试

--client-nic-list

指定测试的客户端网卡ID,注意需要指定与GPU相连的网卡ID,若为管理口及存储口,则不能测试。当有多个网卡需要测试时,需用 “,” 分隔,如:1,2,3

--server-gpu-id

指定测试的服务端设备ID进行GPUDirect测试

--client-gpu-list

指定测试的客户端设备ID进行GPUDirect测试,个数需与 --client-nic-list 保持一致,当有多个GPU设备需要测试时,需用 “,” 分隔,如:1,2,3

--operation <0|1|2>

指定RDMA测试类型。0 为发送(send)操作,1 为读取(Read)操作,2 为写入(write)操作。默认为写入(write)操作

--type <0,1>

指定测试类型。0 为带宽测试,1 为延时测试。默认为带宽测试

3.9.6.2. 示例1:读取数据带宽压力测试

mx-dcmi nic --host [IP]:[PORT] --test-mode 1 --server-nic-id 0 --client-nic-list 1,3 --operation 1

操作成功,网卡1测试结果显示如下:

../_images/SingleNicBurnTest1.png

图 3.5 读取数据带宽压力测试网卡1结果

网卡3测试结果显示如下:

../_images/SingleNicBurnTest2.png

图 3.6 读取数据带宽压力测试网卡3结果

3.9.6.3. 示例2:读取数据带宽压力测试(GPUDirect)

mx-dcmi nic --host [IP]:[PORT] --test-mode 1 --server-nic-id 0 --client-nic-list 1,3 --server-gpu-id 0 --client-gpu-list 1,2 --operation 1

操作成功,网卡1测试结果显示如下:

../_images/SingleNicGpuBurnTest1.png

图 3.7 GPUDirect读取数据带宽压力测试网卡1结果

网卡3测试结果显示如下:

../_images/SingleNicGpuBurnTest2.png

图 3.8 GPUDirect读取数据带宽压力测试网卡3结果

3.9.7. 跨机RDMA压力测试

3.9.7.1. 基本命令格式

用户可根据自己的需求自由组合测试单机和跨机环境下的RDMA 压力测试,包括GPUDirect支持的场景。

备注

当测试结果不理想时,会给出具体原因分析及建议。

服务端:

mx-dcmi nic --host [IP1]:[PORT1] –-test-mode 2 --server-nic-id <servernicid> --server-num <servernum> [--operation <0|1|2>] [--type <0|1>] [--server-gpu-id <servergpuid>]
表 3.6 跨机网卡RDMA压力测试服务端参数说明

参数

说明

--host

被测服务器IP地址及mxdcmd服务端口

--test-mode

指定测试模式, 0 为RDMA测试, 1 为RDMA压力测试, 2 为启动服务端监听, 3 为连接到服务端进行测试

--server-nic-id

指定测试的服务端网卡ID,注意需要指定与GPU相连的网卡ID,若为管理口及存储口,则不能测试

--server-num

指定测试的服务端端口个数,个数需与对应的list中ID个数相同,当大于1时,仅返回第一个端口。该端口在测试结束后会释放,如要继续测试,则需重启动Server及Client

--server-gpu-id

指定测试的服务端设备ID进行GPUDirect测试

--operation <0|1|2>

指定RDMA测试类型。 0 为发送(send)操作, 1 为读取(Read)操作, 2 为写入(write)操作。默认为写入(write)操作

--type <0,1>

指定测试类型。 0 为带宽测试, 1 为延时测试。默认为带宽测试

客户端:

mx-dcmi nic --host [IP2]:[PORT2] –-test-mode 3 --client-nic-list <clientniclist> --socket-address [IP1]:[PORT3] [--operation <0|1|2>] [--type <0|1>] [--client-gpu-list <clientgpulist>]
表 3.7 跨机网卡RDMA压力测试客户端参数说明

参数

说明

--host

被测服务器IP地址及mxdcmd服务端口

--test-mode

指定测试模式, 0 为RDMA测试, 1 为RDMA压力测试, 2 为启动服务端监听, 3 为连接到服务端进行测试

--client-nic-list

指定测试的客户端网卡ID,注意需要指定与GPU相连的网卡ID,若为管理口及存储口,则不能测试。当有多个网卡需要测试时,需用 “,” 分隔,如:1,2,3

--socket-address

指定测试要连接的服务端IP和端口号,指定时为Server端返回的IP及端口,若IP为内部IP地址,则需改为Server所在的HOST IP

--client-gpu-list

指定测试的客户端设备ID进行GPUDirect测,个数需与 --client-nic-list 保持一致,当有多个GPU设备需要测试时,需用 “,” 分隔,如:1,2,3

--operation <0|1|2>

指定RDMA测试类型。0为发送(send)操作,1 为读取(Read)操作,2 为写入(write)操作。默认为写入(write)操作

--type <0,1>

指定测试类型。 0 为带宽测试, 1 为延时测试。默认为带宽测试

3.9.7.2. 示例1:跨机写入数据带宽压力测试

服务端:

mx-dcmi nic --host [IP1]:[PORT1] --test-mode 2 --server-nic-id 6 --server-num 2

命令返回socket address 用于客户端使用。

操作成功,显示如下:

============= MetaX Data Center Management Interface Log =============
Timestamp                                   : Wed Apr 23 12:50:54 2025

Socket Address: 172.17.26.36:23515

客户端:

mx-dcmi nic --host [IP2]:[PORT2] --test-mode 3 --client-nic-list 3,4  --socket-address [IP1]:23515

操作成功,网卡3测试结果显示如下:

../_images/TwoServerNicBurnTest1.png

图 3.9 跨机写入数据带宽压力测试网卡3结果

网卡4测试结果显示如下:

../_images/TwoServerNicBurnTest2.png

图 3.10 跨机写入数据带宽压力测试网卡4结果

3.9.7.3. 示例2:跨机写入数据带宽压力测试(GPUDirect)

服务端:

mx-dcmi nic --host [IP1]:[PORT1] --test-mode 2 --server-nic-id 4 --server-gpu-id 3 --server-num 2

命令返回socket address 用于客户端使用。

操作成功,显示如下:

============= MetaX Data Center Management Interface Log =============
Timestamp :                                   Wed Apr 23 12:39:27 2025

Socket Address: 172.17.22.36:21515

客户端:

mx-dcmi nic --host [IP2]:[PORT2] --test-mode 3 --client-gpu-list 1,3 --client-nic-list 0,3 --socket-address [IP1]:21515

操作成功,网卡1测试结果显示如下:

../_images/TwoServerNicGPUBurnTest1.png

图 3.11 GPUDirect跨机写入数据带宽压力测试网卡1结果

网卡3测试结果显示如下:

../_images/TwoServerNicGPUBurnTest2.png

图 3.12 GPUDirect跨机写入数据带宽压力测试网卡3结果

3.10. MCCL诊断

用户可使用 mccl 子命令对网卡执行MCCL测试。主要功能为生成测试配置模板、执行MCCL压力测试、查询测试结果以及测试任务管理。

前提条件:

  • 集群环境中,被测服务器已部署多个GPU设备及Infiniiband网卡,且配置一致。

  • 被测网卡状态需为 UP 且均有IPv4地址,参见 3.9.2 查询网卡信息

  • mx-dcmd 可部署于参与MCCL测试的主机,或集群中与测试主机配置一致的其他主机;无需在集群的每台被测服务器上部署mx-dcmd。

  • mx-dcmi 正常运行在管理节点、不参加测试的服务器、参加测试的服务器均可。

  • 生成的配置文件参数已正确配置,参见 3.10.2 配置文件

  • 服务器之间已设置免密登录。

3.10.1. 使用帮助

mx-dcmi mccl -h

执行命令后,显示如下:

USAGE:

   mx-dcmi mccl  [-hv] {-g |-j |-l|-r |-s |
               -x } [-c ] [--host ]


Where:

   --host
   The target host to connect.

   -c ,  --configfile
   Path to the configuration file in json format. Must be used in
   conjunction with -s/--start.

   One of:
      -s ,  --start
      Start Mccl stress job with the configuration file.

      -x ,  --stop
      Stop Mccl stress job.

      -r ,  --remove
      Remove Mccl stress job.

      -j ,  --job
      Display Mccl stress result.

      -g ,  --generate-template
      Generate Mccl diag config template file with wanted ip list.

      -l,  --list
      List all Mccl stress job info on the host.

   --,  --ignore_rest
   Ignores the rest of the labeled arguments following this flag.

   -v,  --version
   Displays version information and exits.

   -h,  --help
   Displays usage information and exits.

   MetaX Data Center Management Interface

3.10.2. 配置文件

执行MCCL前需要用户提供MCCL测试所需参数配置。

执行以下命令可生成默认指标配置文件 mccl-diag-config.json

mx-dcmi命令:

mx-dcmi mccl --host [IP]:[PORT] –g "<PATH>"
  • [IP]:[PORT] 为收集MCCL日志的服务器,已启动mx-dcmd且正常运行。

  • <PATH> 指定被测主机列表全路径,如 "/home/user/ip.txt",当为 "" 时,将根据 [IP]:[PORT] 信息自动生成一份配置文件。

以下 mccl-diag-config.json 仅为示例,用户可根据实际需要进行调整。

{
   "perf_type": "all_gather_perf",
   "mgmt_interface_name": "bond0",
   "mgmt_ip_mask": "10.200.146.192/26",
   "mgmt_ip_address": "10.200.146.235,10.200.146.226",
   "node": "3",
   "nic_dev": "mlx5_4, mlx5_1, mlx5_6, mlx5_3, mlx5_2, mlx5_0, mlx5_7, mlx5_5",
   "duration": "43200"
   "standard_agl_bandwidth": 37.19
}
表 3.8 MCCL测试配置文件说明

项目类别

说明

perf_type

测试类型,目前支持 all_gather_perfall_reduce_perf,默认为 all_gather_perf

mgmt_interface_name

服务器管理口名称

mgmt_ip_mask

服务器管理口掩码,当跨地址段时,需手动修改该参数

mgmt_ip_address

被测服务器地址列表,可直接手动修改配置文件,生成配置文件传入的IP列表会全部显示在这里

node

每组参与测试的服务器个数,若需跑单机,可填 1

nic_dev

与GPU设备相连接的网卡列表,可根据实际情况手动修改

duration

单位:秒,测试持续时间,默认为43200秒(24小时)

3.10.3. MCCL测试基本命令

本章节以测试任务 mccl_test 为例。

3.10.3.1. 启动MCCL测试

mx-dcmi命令:

mx-dcmi mccl --host [IP]:[PORT] --start mccl_test --configfile <config_file>
  • --host 为收集MCCL日志的服务器,已启动mx-dcmd且正常运行。

  • --start 传入任务名称( jobname )。

  • --configfile 传入可用于测试的配置文件。

备注

最多可创建64个任务。

操作成功,显示如下:

Successfully started Mccl diag for mccl_test

3.10.3.2. 查询测试任务状态

待MCCL测试开始3分钟后可查询测试任务状态。

mx-dcmi命令:

mx-dcmi mccl --host [IP]:[PORT] -l

操作成功,显示如下:

================ MetaX Data Center Management Interface Log ==================
Timestamp                                          : Mon Nov  20 16:52:49 2025
Mccl stress job list
+-----------+------------+----------------------+-----------+----------------+
| Job Name  | Node List  | Env Variable         |  State    | Exception Node |
+-----------+------------+----------------------+-----------+----------------+
| mccl_test | x.x.x.x    | FORCE_ACTIVE_WAIT=2  |  Running  |                |
+-----------+------------+----------------------+-----------+----------------+

3.10.3.3. 查看测试结果

mx-dcmi命令:

mx-dcmi mccl --host [IP]:[PORT] -j mccl_test

操作成功,显示如下:

============== MetaX Data Center Management Interface Log ==============
Timestamp                                   : Tue Apr 22 16:43:13 2025

{
   "results": [{
   "node_list": "x.x.x.x:8",
   "perf_type": " all_gather_perf",
   "nic_dev": " mlx5_7, mlx5_3, mlx5_2, mlx5_0, mlx5_4, mlx5_6, mlx5_1, mlx5_5",
   "iter": "5",
   "pass_time":"5",
   "start_time": "2025-11-19 16:43:52",
   "end_time": "2025-11-19 16:44:13",
   "standard_alg_bandwidth":"37.189999",
   "mission_rate":"100.00%",
   "min_alg_bandwidth": "180.96",
   "max_alg_bandwidth": "181.67",
   "test_result":"pass"
   }]
}

MCCL测试结果会存放在 mxdcmd-log 文件夹下,命名为 mccl_stress_log_x_<pid>.log,若以容器方式启动mxdcmd,需进入容器内查看该文件夹。

  • pid 为测试任务的进程ID。

  • x 为MCCL的测试组数,若有多组测试,将生成同一个PID下的多组测试日志。

若查询到测试任务状态很快为 stopped 时,表明已经出错,可通过结果查看具体出错原因,如下所示:

=============== MetaX Data Center Management Interface Log ===============
Timestamp                                   : Wed Apr 23 13:14:18 2025

   {
      "results": [{
      "node_list": "10.200.146.235:8,10.200.146.226:8",
      "perf_type": " all_gather_perf",
      "nic_dev": " mlx5_4, mlx5_6, mlx5_3, mlx5_2, mlx5_0, mlx5_7, mlx5_5",
      "iter": "1",
      "error": "  : Test NCCL failure all_gather.cu:47 'internal error / Proxy Call to rank 15 failed (Connect)'"
      }]
   }

3.10.3.4. 终止测试任务

mx-dcmi命令:

mx-dcmi mccl --host [IP]:[PORT] -x mccl_test

操作成功,显示如下:

Successfully stopped Mccl diag for mccl_test

再次查询mccl_test状态,显示 stopped

=============== MetaX Data Center Management Interface Log ================
Timestamp                                       : Mon Nov  20 17:52:49 2025
Mccl stress job list
+-----------+------------+---------------------+--------+-----------------+
| Job Name  | Node List  | Env Variable        | State  |  Exception Nod  |
+-----------+------------+---------------------+--------+-----------------+
| mccl_test | x.x.x.x    | FORCE_ACTIVE_WAIT=2 | ZOMBIE |  None           |
+-----------+------------+---------------------+--------+-----------------+

End of Log

3.10.3.5. 删除测试任务

mx-dcmi命令:

mx-dcmi mccl --host [IP]:[PORT] -r mccl_test

操作成功,显示如下:

Successfully remove Mccl diag for mccl_test

再次查询任务状态:

============= MetaX Data Center Management Interface Log =============
Timestamp                                   : Wed Apr 23 13:20:51 2025
Mccl stress job list
+-------------------------------------+
| There is no job                     |
+-------------------------------------+

End of Log