1. 概述

mx-exporter是用于在集群环境中收集曦云® GPU设备指标数据的工具。集群监控系统,如Prometheus,可以通过HTTP从运行于每个节点的mx-exporter拉取设备指标数据。

2. 工具部署

mx-exporter有多种安装方式,本文介绍两种安装方式:wheel包安装和镜像安装。 基于Kubernetes集群部署的安装方式,请参见《曦云® 系列通用计算GPU mx-exporter Kubernetes集群监控部署手册》。

2.1. Wheel包

2.1.1. 安装

操作步骤

  1. 确保系统已安装Python 3。

  2. 确保系统已安装Python库prometheus_client、grpcio、protobuf(版本需大于3.12.0)。

  3. 终端输入 mx-smi -L,查看板卡信息是否正常。

  4. 安装MXMACA SDK之后,在 /opt/maca/wheel 目录下找到Python 3 的wheel包:mx_exporter_*.whl

  5. 安装wheel包。

    sudo pip3 install mx_exporter_*.whl
    

2.1.2. 使用方法

表 2.1 mx-exporter 参数说明(wheel包安装)

参数

描述

-p PORT, --port PORT

主机需要开放的HTTP端口号。若不指定,默认为8000

-i INTERVAL,

--interval INTERVAL

mx-exporter持续收集指标的间隔时间,单位ms。若不指定,默认为10000ms

-c CONFIG_FILE,

--config-file CONFIG_FILE

若不指定,默认配置文件为 /opt/maca/etc/default-counters.csv

若自定义配置文件,需基于默认配置文件,参考 3.2 修改配置文件

-h, --help

显示帮助信息

2.1.2.1. 启动mx-exporter

备注

因默认监控kernel日志,需使用sudo启动mx-exporter。

示例1

执行以下命令启动mx-exporter,指定指标配置文件为 /home/user/counters.csv

sudo mx-exporter -c /home/user/counters.csv
示例2

执行以下命令启动mx-exporter,监听端口为8002,收集指标间隔为5s:

sudo mx-exporter -p 8002 -i 5000
示例3

若希望监控sGPU相关指标,用户需要执行以下操作。

  1. 修改 /opt/maca/etc/default-counters.csv,去掉sGPU相关指标前的注释符号 # 来启用指标。

  2. 将主机设备进行切分,具体操作参见《曦云® 系列通用计算GPU mx-smi使用手册》中“sGPU切分选项”章节。

  3. 启动mx-exporter

    sudo mx-exporter
    

2.1.2.2. 查看mx-exporter抓取数据

在浏览器输入 http://<host_ip>:<host_port>/metrics 或执行 curl http://<host_ip>:<host_port>/metrics,其中 <host_port> 是主机需要开放的HTTP端口号,可用 -p 指定。

抓取到的GPU指标信息格式如下:

#HELP <指标名称> <指标描述>
#TYPE <指标名称> gauge
<指标名称>{Hostname="xx",bios_version="xx",deviceId="xx",driver_version="xx",exported_container="xx",exported_namespace="xx", exported_pod="xx",modelName="xx",uuid="xx"} XX
示例
#HELP mx_chip_hotspot_temp Chip hotspot temperature
#TYPE mx_chip_hotspot_temp gauge
mx_chip_hotspot_temp{Hostname="xx",bios_version="1.16.0.0",deviceId="0",driver_version="2.6.0",exported_container="",exported_namespace="", exported_pod="",modelName="MXC500",uuid="xx"} 35.75

2.1.3. 卸载

操作步骤

  1. 执行以下命令,卸载mx-exporter wheel包。

    sudo pip3 uninstall mx-exporter
    

2.2. mx-exporter镜像

2.2.1. 加载镜像

操作步骤

  1. 解压mx-exporter镜像包:

    tar -zxvf mx-exporter.xxx.tgz
    
  2. 根据主机的架构加载对应的镜像。对于x86架构主机,使用amd64后缀的镜像;对于Arm架构主机,使用arm64后缀的镜像。

    cd mx-exporter; docker load -i mx-exporter-xx-amd64.xz
    

2.2.2. 使用方法

表 2.2 mx-exporter 参数说明(镜像部署)

参数

描述

-p PORT, --port PORT

容器端口号,默认为8000

-i INTERVAL,

--interval INTERVAL

mx-exporter持续收集指标的间隔时间,单位ms,默认为10000ms

该时间间隔需与prometheus pull请求周期一致

-c CONFIG_FILE,

--config-file CONFIG_FILE

用户自定义的指标配置文件,默认在容器中/opt/mxexporter/mx_exporter/default-counters.csv;也可修改mx-exporter.xxx.tgz压缩包解压后的mx-exporter/config/default-counters.csv,并在启动容器时传入使用

若自定义配置文件,需基于默认配置文件更改,参考 3.2 修改配置文件

-mp MOUNT_POINT,

--mount-point MOUNT_POINT

容器中的挂载路径,若用脚本 start_mxexporter.sh 启动,默认容器内挂载路径为 /host

-h, --help

显示帮助信息

表 2.3 Docker启动参数说明

参数

描述

-d

后台运行容器,并返回容器ID

--device=/dev/dri

将主机上的曦云系列GPU设备挂载到容器

--device=/dev/mxgvm

当开启虚拟化后,默认收集VF指标数据;若想一同收集PF指标数据,需将主机上的/dev/mxgvm挂载到容器中

-v /var/log:/host/var/log

将主机上的/var/log目录挂载到容器中/host/var/log

--name=mx-exporter

指定容器名称为mx-exporter

-p 0.0.0.0:<host_port>:<container_port>

host_port 为主机开放的HTTP端口

container_port 为容器端口

-v /var/lib/kubelet/pod-resources:/var/lib/kubelet/pod-resources

用于在容器中获取Kubernetes资源信息

2.2.2.1. 运行mx-exporter

示例1

使用启动脚本 mx-exporter/start_mxexporter.sh 启动mx-exporter。 --help 可查看具体使用说明。指定HTTP监听端口号为 9000:

sudo bash start_mxexporter.sh -p=9000
示例2

执行以下命令运行mx-exporter,修改 mx-exporter.xxx.tgz 压缩包解压后的配置文件 mx-exporter/config/default-counters.csv,重命名为 new_counters.csv,例如 /home/<username>/mx-exporter/config/new_counters.csv (需填写绝对路径),将修改后的配置文件挂载到容器内,并指定使用该配置文件:

sudo bash start_mxexporter.sh -c=/home/<username>/mx-exporter/config/new_counters.csv
示例3

执行以下命令运行mx-exporter,可显示Kubernetes正在使用的资源信息(指标标签:exported_containerexported_namespaceexported_pod):

sudo bash start_mxexporter.sh -pm=1
示例4

若希望监控sGPU相关指标,用户需要

  1. 修改 mx-exporter.xxx.tgz 压缩包解压后的配置文件 mx-exporter/config/default-counters.csv,去掉sGPU相关指标前的注释符号 # 来启用指标。

  2. 启动容器时加特权模式,且需要将修改后的配置文件挂载到容器内。

    sudo docker run -d --device=/dev/dri --name=mx-exporter -p 0.0.0.0:8000:8000 --privileged -v /var/log:/host/var/log -v /home/<username>/mx-exporter/config/new_counters.csv:/opt/mxexporter/mx_exporter/new_counters.csv <image ID> -c /opt/mxexporter/mx_exporter/new_counters.csv -mp /host
    
  3. 将主机设备进行切分,具体操作参见《曦云® 系列通用计算GPU mx-smi使用手册》中“sGPU切分选项”章节。

2.2.2.2. 查看mx-exporter抓取数据

在浏览器输入 http://<host_ip>:<host_port>/metrics 或执行 curl http://<host_ip>:<host_port>/metrics,其中 <host_port> 是主机需要开放的HTTP端口号。抓取到的GPU指标信息格式如下:

#HELP <指标名称> <指标描述>
#TYPE <指标名称> gauge
<指标名称>{Hostname="xx",bios_version="xx",deviceId="xx",driver_version="xx",exported_container="xx",exported_namespace="xx", exported_pod="xx",modelName="xx",uuid="xx"} XX
示例
#HELP mx_chip_hotspot_temp Chip hotspot temperature
#TYPE mx_chip_hotspot_temp gauge
mx_chip_hotspot_temp{Hostname="xx",bios_version="1.16.0.0",deviceId="0",driver_version="2.6.0",exported_container="",exported_namespace="", exported_pod="",modelName="MXC500",uuid="xx"} 35.75

2.2.3. 删除容器

操作步骤

  1. 执行以下命令,删除mx-exporter容器。

    docker stop <mx-exporter_container-id>
    docker rm <mx-exporter_container-id>
    

3. 配置文件使用说明

3.1. 默认配置文件

MXMACA SDK安装完成后,默认配置文件为/opt/maca/etc/default-counters.csv

表 3.1 默认配置文件表头说明

名称

可修改

含义

metric id

指标ID,用于识别指标的唯一标识符

metric type

指标类别,如Gauge

metric name

指标名称,可根据实际需要修改

metric description

指标描述,可根据实际需要修改

label

指标标签,用于筛选指标所需展示的信息

只能更改名称,不能更改顺序及增删

表 3.2 默认配置文件中mx_chip_hotspot_temp指标示例

# Format: metric id

metric type

metric name

metric description

label

# temperature

chip_hotspot_temp

Gauge

mx_chip_hotspot_temp

Chip hotspot temperature

deviceId, uuid, exported_pod, exported_namespace, exported_container, Hostname, driver_version, bios_version, modelName.

3.2. 修改配置文件

3.2.1. 自定义指标集

操作步骤

  1. 打开mx-exporter.xxx.tgz压缩包解压后的mx-exporter/config/default-counters.csv

  2. 在不需要监控的指标行前加“#”注释,保存文件。

  3. 运行 mx-exporter -c 指定文件。

  4. 看到抓取的指标中,不包含已经注释掉的指标信息。

3.2.2. 修改指标名称、指标描述及标签

操作步骤

  1. 打开mx-exporter.xxx.tgz压缩包解压后的mx-exporter/config/default-counters.csv

  2. 修改 metric namemetric descriptionlabel 列信息,保存文件。

  3. 运行 mx-exporter -c 指定文件。

  4. 看到抓取的指标中,指标名称、指标描述及指标标签已更改为期望的内容。

    备注

    • 修改标签时只能更改名称,不能更改顺序及增删。

    • 修改指标名称及标签的规范,请参考Prometheus相关文档

示例

mx_chip_hotspot_temp 指标为例,修改如下:

指标名称mx_chip_hotspot_temp修改为metax_chip_hotspot_temp
指标描述Chip hotspot temperature修改为Metax Chip hotspot temperature
指标标签deviceId修改为mx_deviceId
指标标签modelName修改为mx_modelName

运行mx-exporter后,得到的数据显示如下:

#HELP metax_chip_hotspot_temp Metax Chip hotspot temperature
#TYPE metax_chip_hotspot_temp gauge
metax_chip_hotspot_temp\ {Hostname="xx",bios_version="1.16.0.0", mx_modelName\ ="MXC500", mx_deviceId\ ="0",driver_version="2.6.0",exported_container="",exported_namespace="",exported_pod="",uuid="xx"} 35.75

4. 指标及标签说明

4.1. 指标说明

表 4.1 指标名称及说明

指标名称

所属类别

指标说明

mx_chip_hotspot_temp

Temperature

芯片内部最高温度,单位摄氏度

mx_board_core_temp

Temperature

板卡温度,单位摄氏度

mx_optical_module_temp

Temperature

光模块温度,仅部分设备有光模块,如C500X

mx_vpue_usage

Usage

编码器利用率

mx_vpud_usage

Usage

解码器利用率

mx_memory_usage

Usage

HBM及系统内存利用率,指标标签 type 用于区分HBM和系统内存,vram代表HBM,xtt代表系统内存

mx_memory_total

Usage

HBM及系统内存总量,单位KB,指标标签 type 用于区分HBM和系统内存,vram代表HBM,xtt代表系统内存

mx_memory_used

Usage

HBM及系统内存使用量,单位KB,指标标签 type 用于区分HBM和系统内存,vram代表HBM,xtt代表系统内存

mx_gpu_usage

Usage

GPU利用率

mx_board_power

Power

功耗,单位毫瓦

mx_vpue_clock

Clock

编码器时钟频率,单位MHz

mx_vpud_clock

Clock

解码时钟频率,单位MHz

mx_mem_clock

Clock

HBM时钟频率,单位MHz

mx_gpu_clock

Clock

GPU时钟频率,单位MHz

mx_pcie_bw

PCIE

PCIe Tx和Rx吞吐量,单位MB/s

mx_pcie_speed

PCIE

设备每条lane上的PCIe传输速率,单位GT/s

mx_pcie_width

PCIE

设备带宽,表示PCIe lanes

mx_pcie_bridge_speed

PCIE

PCIe bridge传输速率,单位GT/s

mx_pcie_bridge_width

PCIE

PCIe bridge lanes

mx_mxlk_bw

MetaXLink

MetaXLink Tx和Rx吞吐量,单位MB/s

mx_mxlk_speed

MetaXLink

MetaXLink实时速率,单位GT/s

mx_mxlk_width

MetaXLink

MetaXLink lanes

mx_mxlk_aer_count

MetaXLink

MetaXLink AER计数,分UE和CE

mx_mxlk_traffic_total_bytes

MetaXLink

GPU驱动加载后MetaXLink传输总数据量,单位:Byte

mx_xcore_dpm_level

DPM

当前DPM等级

mx_gpu_state

Status

设备状态信息,0为不可用,1为可用

mx_clk_thr

Clocks Throttle Reasons

当前降频原因,0为无降频;降频原因二进制显示,从右往左依次占位为Idle,Application Limit,Over Power,Chip过温,VR过温,HBM过温,Thermal过温,PCC,Power Brake,DIDT,Low Usage,Other

例如,若数值为12,表示当前降频原因为Over Power及Chip过温

mx_sgpu_compute_quota

SGPU Usage

当前分给该子设备的算力占父设备的百分比,单位%(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标)

mx_sgpu_usage

SGPU Usage

当前该子设备的使用率,单位%(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标)

mx_sgpu_total_memory

SGPU Usage

当前分给该子设备的VRAM总量,单位KB(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标)

mx_sgpu_used_memory

SGPU Usage

当前该子设备已使用的VRAM大小,单位KB(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标)

mx_sgpu_free_memory

SGPU Usage

当前该子设备还可以使用的VRAM余量,单位KB(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标)

mx_server_info

UUID

服务器UUID,local代表本机信息,remote代表与之互联的服务器信息

mx_driver_log_errors

Log

驱动每个周期每个模块日志级别为ERROR 及 ALERT 的信息条数

mx_driver_eid_errors

Log

最新驱动EID错误信息,指标数值换算成十六进制后为EID错误码。错误详细信息,影响及常见处理方法参见《曦云® 系列通用计算GPU EID手册》

mx_sdk_eid_errors

Log

最新SDK EID错误信息及对应的processId,指标数值换算成十六进制后为EID错误码。错误详细信息,影响及常见处理方法参见《曦云® 系列通用计算GPU EID手册》

mx_ecc_error_count

Count

ECC 错误的计数,包括sram和dram的CE(Correctable Errors)和UE(Correctable Errors)个数,以及坏页数

4.2. 标签说明

表 4.2 标签说明

名称

含义

Hostname

主机名称

若mx-exporter在容器中运行,则为容器名称,若需修改,可在启动容器时传入 --hostname <hostname>

bios_version

BIOS版本信息

deviceId

设备号,即板卡号(通过 mx-smi -L 获取板卡ID信息)

driver_version

设备驱动版本信息

exported_container

正在使用设备的k8s pod的container名称,仅当启动mx-exporter容器传入 -v /var/lib/kubelet/pod-resources:/var/lib/kubelet/pod-resources 时可显示

exported_namespace

正在使用设备的pod所在k8s命名空间,仅当启动mx-exporter容器传入 -v /var/lib/kubelet/pod-resources:/var/lib/kubelet/pod-resources 时可显示

exported_pod

正在使用设备的k8s pod名称,仅当启动mx-exporter容器传入 -v /var/lib/kubelet/pod-resources:/var/lib/kubelet/pod-resources 时可显示

modelName

芯片型号

uuid

设备UUID

major/minor

子设备唯一编号(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标)

sgpuID

在当前父设备下创建的子设备ID,范围为[0-15](仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标)

4.3. 指标输出示例

使用以下命令启动mx-exporter:

docker run -d --device=/dev/dri -v /var/log:/host/var/log -v /var/lib/kubelet/pod-resources:/var/lib/kubelet/pod-resources --name=mx-exporter -p 0.0.0.0:8000:8000 <image ID> -mp=/host

执行 curl http://<host_ip>:8000/metrics 可得到以下内容:

# HELP mx_device_type Device type
# TYPE mx_device_type gauge
mx_device_type{deviceId="0",deviceType="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.0
# HELP mx_bios_ver Bios version
# TYPE mx_bios_ver gauge
mx_bios_ver{bios="1.7.4.0",deviceId="0"} 1.0
# HELP mx_driver_ver Driver version
# TYPE mx_driver_ver gauge
mx_driver_ver{deviceId="0",driver="2.3.0"} 1.0
# HELP mx_chip_hotspot_temp Chip hotspot temperature
# TYPE mx_chip_hotspot_temp gauge
mx_chip_hotspot_temp{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 36.25
# HELP mx_board_core_temp Board DrMOS Core temperature
# TYPE mx_board_core_temp gauge
mx_board_core_temp{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 35.0
# HELP mx_vpue_usage Vpue usage in percent
# TYPE mx_vpue_usage gauge
mx_vpue_usage{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_vpud_usage Vpud usage in percent
# TYPE mx_vpud_usage gauge
mx_vpud_usage{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_memory_usage HBM and system memory usage in percent
# TYPE mx_memory_usage gauge
mx_memory_usage{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="vram",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.2849688529968262
mx_memory_usage{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="xtt",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.001035634935770525
# HELP mx_memory_total Total HBM and system memory in KB
# TYPE mx_memory_total gauge
mx_memory_total{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="vram",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 6.7108864e+07
mx_memory_total{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="xtt",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.585114545e+09
# HELP mx_memory_used HBM and system memory used in KB
# TYPE mx_memory_used gauge
mx_memory_used{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="vram",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 862328.0
mx_memory_used{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="xtt",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16416.0
# HELP mx_gpu_usage GPU uage in percent
# TYPE mx_gpu_usage gauge
mx_gpu_usage{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_board_power Board power in milliwatt
# TYPE mx_board_power gauge
mx_board_power{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 55983.0
# HELP mx_vpue_clock Encoder clock frequency in MHz
# TYPE mx_vpue_clock gauge
mx_vpue_clock{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 535.0
# HELP mx_vpud_clock Decoder clock frequency in MHz
# TYPE mx_vpud_clock gauge
mx_vpud_clock{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 525.0
# HELP mx_mem_clock Memory clock frequency in MHz
# TYPE mx_mem_clock gauge
mx_mem_clock{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1800.0
# HELP mx_gpu_clock GPU clock frequency in MHz
# TYPE mx_gpu_clock gauge
mx_gpu_clock{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 428.0
# HELP mx_pcie_bw Pcie Tx and Rx throughput in MB/s
# TYPE mx_pcie_bw gauge
mx_pcie_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_pcie_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_pcie_speed Pcie current speed in GT/s
# TYPE mx_pcie_speed gauge
mx_pcie_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 32.0
# HELP mx_pcie_width Pcie current lanes
# TYPE mx_pcie_width gauge
mx_pcie_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16.0
# HELP mx_pcie_bridge_speed Pcie bridge current speed in GT/s
# TYPE mx_pcie_bridge_speed gauge
mx_pcie_bridge_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 32.0
# HELP mx_pcie_bridge_width Pcie bridge current lanes
# TYPE mx_pcie_bridge_width gauge
mx_pcie_bridge_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16.0
# HELP mx_mxlk_bw MetaXLink Tx and Rx throughput in MB/s
# TYPE mx_mxlk_bw gauge
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_mxlk_speed MetaXLink current link speed in GT/s
# TYPE mx_mxlk_speed gauge
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 32.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 32.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 32.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_mxlk_width MetaXLink current link width
# TYPE mx_mxlk_width gauge
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_mxlk_traffic_total_bytes  MetaXLink traffic total in Bytes
# TYPE mx_mxlk_traffic_total_bytes gauge
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.1474920704e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.147492064e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.1474920684e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.1474930688e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.1474932352e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.1474930688e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_mxlk_aer_count MetaXLink aer count
# TYPE mx_mxlk_aer_count gauge
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_xcore_dpm_level Dpm xcore performance level
# TYPE mx_xcore_dpm_level gauge
mx_xcore_dpm_level{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_gpu_state GPU state: 0(not available) 1(available)
# TYPE mx_gpu_state gauge
mx_gpu_state{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.0
# HELP mx_clk_thr Current gpu clock throttling reason
# TYPE mx_clk_thr  gauge
mx_clk_thr{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_ecc_error_count Total ECC error count
# TYPE mx_ecc_error_count gauge
mx_ecc_error_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="sram_ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_ecc_error_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="sram_ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_ecc_error_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="dram_ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_ecc_error_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="dram_ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_ecc_error_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="retired_page",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_server_info Local server and its connected remote servers uuid info
# TYPE mx_server_info gauge
mx_server_info{Hostname="mxsrv003",kind="local",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.0
mx_server_info{Hostname="mxsrv003",kind="remote",uuid="GPU-1d54ca5a-b3d9-65df-0ffc-8ebe0d6347e7"} 1.0
# HELP mx_driver_log_errors Driver kernel log errors
# TYPE mx_driver_log_errors gauge
mx_driver_log_errors{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",error_level="ERROR",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",module="ATU",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 4.0
# HELP mx_driver_eid_errors Value of the latest driver EID error encountered
# TYPE mx_driver_eid_errors gauge
mx_driver_eid_errors{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",eid_info="shader exception, pasid 32769 error_type mem_viol(0x4)",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 8450.0
mx_driver_eid_errors{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",eid_info="atu 0x0 pde_base_addr 0x1000340000",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 8449.0
# HELP mx_sdk_eid_errors Value of the latest SDK EID error encountered
# TYPE mx_sdk_eid_errors gauge
mx_sdk_errors{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",eid_info="Xnack Error/ATU Fault(0x8), check app kernel, _Z10trapKernelPfS_i",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",processId="415256",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 12548.0

5. 兼容性

5.1. MXMACA-C500-SDK-2.25.2/MXMACA-C500-K8s-0.8.2

mx-exporter工具中指标名称新增 mx_ 前缀,如需解决兼容性问题,请采取以下方法:

wheel包

  • 修改 /opt/maca/etc/default-counters.csv 配置文件,去掉 mx_ 前缀,并指定该配置文件部署mx-exporter,操作步骤参见 2.1.2 使用方法

镜像

  • 修改 mx-exporter/config/default-counters.csv 配置文件,去掉 mx_ 前缀,并指定该配置文件部署mx-exporter,操作步骤参见 2.2.2 使用方法

  • 或者,使用Prometheus relabeling功能,在Prometheus配置文件 mx-exporter/deployment/prometheus/config-map.yaml 中的 job_name: "metax-mx-exporter" 下,增加如下 metric_relabel_configs 配置来批量更改指标标签。

    metric_relabel_configs:
    - source_labels: [__name__]
      regex: mx_(.*)
      target_label: __name__
    

6. 附录

6.1. 术语/缩略语

术语/缩略语

全称

说明

PCIe

Peripheral Component Interconnect Express

一种高速串行计算机扩展总线标准

DPM

Dynamic Power Management

动态功率管理功能

VPUE

Video Processing Unit Encoder

视频处理单元编码

VPUD

Video Processing Unit Decoder

视频处理单元解码

MetaXLink

沐曦GPU D2D接口总线

ECC

Error Checking and Correcting

错误检查和纠正

EID

Error ID

GPU错误码