1. 概述
mx-exporter是用于在集群环境中收集曦云® GPU设备指标数据的工具。集群监控系统,如Prometheus,可以通过HTTP从运行于每个节点的mx-exporter拉取设备指标数据。
2. 工具部署
mx-exporter有多种安装方式,本文介绍两种安装方式:wheel包安装和镜像安装。 基于Kubernetes集群部署的安装方式,请参见《曦云® 系列通用计算GPU mx-exporter Kubernetes集群监控部署手册》。
备注
本文档中,输出结果均以曦云C500为示例。
2.1. Wheel包
2.1.1. 安装
操作步骤
确保系统已安装Python 3。
确保系统已安装Python库prometheus_client、grpcio、protobuf(版本需大于3.12.0)。
终端输入
mx-smi -L,查看板卡信息是否正常。安装MXMACA SDK之后,在 /opt/maca/wheel 目录下找到Python 3 的wheel包:mx_exporter_*.whl。
安装wheel包。
sudo pip3 install mx_exporter_*.whl
2.1.2. 使用方法
参数 |
描述 |
|---|---|
|
主机需要开放的HTTP端口号。若不指定,默认为8000 |
|
mx-exporter持续收集指标的间隔时间,单位ms。若不指定,默认为10000ms |
|
若不指定,默认配置文件为 /opt/maca/etc/default-counters.csv |
|
显示帮助信息 |
2.1.2.1. 启动mx-exporter
备注
因默认监控kernel日志,需使用sudo启动mx-exporter。
示例1
执行以下命令启动mx-exporter,指定指标配置文件为 /home/user/counters.csv:
sudo mx-exporter -c /home/user/counters.csv
示例2
执行以下命令启动mx-exporter,监听端口为8002,收集指标间隔为5s:
sudo mx-exporter -p 8002 -i 5000
示例3
若希望监控sGPU相关指标,用户需要执行以下操作。
修改 /opt/maca/etc/default-counters.csv,去掉sGPU相关指标前的注释符号
#来启用指标。将主机设备进行切分,具体操作参见《曦云® 系列通用计算GPU mx-smi使用手册》中“sGPU切分选项”章节。
启动mx-exporter
sudo mx-exporter
2.1.2.2. 查看mx-exporter抓取数据
在浏览器输入 http://<host_ip>:<host_port>/metrics 或执行 curl http://<host_ip>:<host_port>/metrics,其中 <host_port> 是主机需要开放的HTTP端口号,可用 -p 指定。
抓取到的GPU指标信息格式如下:
#HELP <指标名称> <指标描述>
#TYPE <指标名称> gauge
<指标名称>{Hostname="xx",bios_version="xx",deviceId="xx",driver_version="xx",exported_container="xx",exported_namespace="xx", exported_pod="xx",modelName="xx",uuid="xx"} XX
示例
#HELP mx_chip_hotspot_temp Chip hotspot temperature
#TYPE mx_chip_hotspot_temp gauge
mx_chip_hotspot_temp{Hostname="xx",bios_version="1.16.0.0",deviceId="0",driver_version="2.6.0",exported_container="",exported_namespace="", exported_pod="",modelName="MXC500",uuid="xx"} 35.75
2.1.3. 卸载
操作步骤
执行以下命令,卸载mx-exporter wheel包。
sudo pip3 uninstall mx-exporter
2.2. mx-exporter镜像
2.2.1. 加载镜像
操作步骤
解压mx-exporter镜像包:
tar -zxvf mx-exporter.xxx.tgz
根据主机的架构加载对应的镜像。对于x86架构主机,使用amd64后缀的镜像;对于Arm架构主机,使用arm64后缀的镜像。
cd mx-exporter; docker load -i mx-exporter-xx-amd64.xz
2.2.2. 使用方法
参数 |
描述 |
|---|---|
|
容器端口号,默认为8000 |
|
mx-exporter持续收集指标的间隔时间,单位ms,默认为10000ms 该时间间隔需与prometheus pull请求周期一致 |
|
用户自定义的指标配置文件,默认在容器中/opt/mxexporter/mx_exporter/default-counters.csv;也可修改mx-exporter.xxx.tgz压缩包解压后的mx-exporter/config/default-counters.csv,并在启动容器时传入使用 |
|
容器中的挂载路径,若用脚本 start_mxexporter.sh 启动,默认容器内挂载路径为 /host |
|
显示帮助信息 |
参数 |
描述 |
|---|---|
|
后台运行容器,并返回容器ID |
|
将主机上的曦云系列GPU设备挂载到容器 |
|
当开启虚拟化后,默认收集VF指标数据;若想一同收集PF指标数据,需将主机上的/dev/mxgvm挂载到容器中 |
|
将主机上的/var/log目录挂载到容器中/host/var/log |
|
指定容器名称为mx-exporter |
|
|
|
用于在容器中获取Kubernetes资源信息 |
2.2.2.1. 运行mx-exporter
示例1
使用启动脚本 mx-exporter/start_mxexporter.sh 启动mx-exporter。 --help 可查看具体使用说明。指定HTTP监听端口号为 9000:
sudo bash start_mxexporter.sh -p=9000
示例2
执行以下命令运行mx-exporter,修改 mx-exporter.xxx.tgz 压缩包解压后的配置文件 mx-exporter/config/default-counters.csv,重命名为 new_counters.csv,例如 /home/<username>/mx-exporter/config/new_counters.csv (需填写绝对路径),将修改后的配置文件挂载到容器内,并指定使用该配置文件:
sudo bash start_mxexporter.sh -c=/home/<username>/mx-exporter/config/new_counters.csv
示例3
执行以下命令运行mx-exporter,可显示Kubernetes正在使用的资源信息(指标标签:exported_container、exported_namespace、exported_pod):
sudo bash start_mxexporter.sh -pm=1
示例4
若希望监控sGPU相关指标,用户需要
修改 mx-exporter.xxx.tgz 压缩包解压后的配置文件 mx-exporter/config/default-counters.csv,去掉sGPU相关指标前的注释符号
#来启用指标。启动容器时加特权模式,且需要将修改后的配置文件挂载到容器内。
sudo docker run -d --device=/dev/dri --name=mx-exporter -p 0.0.0.0:8000:8000 --privileged -v /var/log:/host/var/log -v /home/<username>/mx-exporter/config/new_counters.csv:/opt/mxexporter/mx_exporter/new_counters.csv <image ID> -c /opt/mxexporter/mx_exporter/new_counters.csv -mp /host
将主机设备进行切分,具体操作参见《曦云® 系列通用计算GPU mx-smi使用手册》中“sGPU切分选项”章节。
2.2.2.2. 查看mx-exporter抓取数据
在浏览器输入 http://<host_ip>:<host_port>/metrics 或执行 curl http://<host_ip>:<host_port>/metrics,其中 <host_port> 是主机需要开放的HTTP端口号。抓取到的GPU指标信息格式如下:
#HELP <指标名称> <指标描述>
#TYPE <指标名称> gauge
<指标名称>{Hostname="xx",bios_version="xx",deviceId="xx",driver_version="xx",exported_container="xx",exported_namespace="xx", exported_pod="xx",modelName="xx",uuid="xx"} XX
示例
#HELP mx_chip_hotspot_temp Chip hotspot temperature
#TYPE mx_chip_hotspot_temp gauge
mx_chip_hotspot_temp{Hostname="xx",bios_version="1.16.0.0",deviceId="0",driver_version="2.6.0",exported_container="",exported_namespace="", exported_pod="",modelName="MXC500",uuid="xx"} 35.75
2.2.3. 删除容器
操作步骤
执行以下命令,删除mx-exporter容器。
docker stop <mx-exporter_container-id> docker rm <mx-exporter_container-id>
3. 配置文件使用说明
3.1. 默认配置文件
MXMACA SDK安装完成后,默认配置文件为/opt/maca/etc/default-counters.csv。
名称 |
可修改 |
含义 |
|---|---|---|
metric id |
否 |
指标ID,用于识别指标的唯一标识符 |
metric type |
否 |
指标类别,如Gauge |
metric name |
是 |
指标名称,可根据实际需要修改 |
metric description |
是 |
指标描述,可根据实际需要修改 |
label |
是 |
指标标签,用于筛选指标所需展示的信息 只能更改名称,不能更改顺序及增删 |
# Format: metric id |
metric type |
metric name |
metric description |
label |
|---|---|---|---|---|
# temperature |
||||
chip_hotspot_temp |
Gauge |
mx_chip_hotspot_temp |
Chip hotspot temperature |
deviceId, uuid, exported_pod, exported_namespace, exported_container, Hostname, driver_version, bios_version, modelName. |
3.2. 修改配置文件
3.2.1. 自定义指标集
操作步骤
打开mx-exporter.xxx.tgz压缩包解压后的mx-exporter/config/default-counters.csv。
在不需要监控的指标行前加“#”注释,保存文件。
运行
mx-exporter -c指定文件。看到抓取的指标中,不包含已经注释掉的指标信息。
3.2.2. 修改指标名称、指标描述及标签
操作步骤
打开mx-exporter.xxx.tgz压缩包解压后的mx-exporter/config/default-counters.csv。
修改
metric name、metric description及label列信息,保存文件。运行
mx-exporter -c指定文件。看到抓取的指标中,指标名称、指标描述及指标标签已更改为期望的内容。
备注
修改标签时只能更改名称,不能更改顺序及增删。
修改指标名称及标签的规范,请参考Prometheus相关文档。
示例
以 mx_chip_hotspot_temp 指标为例,修改如下:
指标名称mx_chip_hotspot_temp修改为metax_chip_hotspot_temp
指标描述Chip hotspot temperature修改为Metax Chip hotspot temperature
指标标签deviceId修改为mx_deviceId
指标标签modelName修改为mx_modelName
运行mx-exporter后,得到的数据显示如下:
#HELP metax_chip_hotspot_temp Metax Chip hotspot temperature
#TYPE metax_chip_hotspot_temp gauge
metax_chip_hotspot_temp\ {Hostname="xx",bios_version="1.16.0.0", mx_modelName\ ="MXC500", mx_deviceId\ ="0",driver_version="2.6.0",exported_container="",exported_namespace="",exported_pod="",uuid="xx"} 35.75
4. 指标及标签说明
4.1. 指标说明
指标名称 |
所属类别 |
指标说明 |
|---|---|---|
mx_chip_hotspot_temp |
Temperature |
芯片内部最高温度,单位摄氏度 |
mx_board_core_temp |
Temperature |
板卡温度,单位摄氏度 |
mx_optical_module_temp |
Temperature |
光模块温度,仅部分设备有光模块,如C500X |
mx_vpue_usage |
Usage |
编码器利用率 |
mx_vpud_usage |
Usage |
解码器利用率 |
mx_memory_usage |
Usage |
HBM及系统内存利用率,指标标签 |
mx_memory_total |
Usage |
HBM及系统内存总量,单位KB,指标标签 |
mx_memory_used |
Usage |
HBM及系统内存使用量,单位KB,指标标签 |
mx_gpu_usage |
Usage |
GPU利用率 |
mx_board_power |
Power |
功耗,单位毫瓦 |
mx_vpue_clock |
Clock |
编码器时钟频率,单位MHz |
mx_vpud_clock |
Clock |
解码时钟频率,单位MHz |
mx_mem_clock |
Clock |
HBM时钟频率,单位MHz |
mx_gpu_clock |
Clock |
GPU时钟频率,单位MHz |
mx_pcie_bw |
PCIE |
PCIe Tx和Rx吞吐量,单位MB/s |
mx_pcie_speed |
PCIE |
设备每条lane上的PCIe传输速率,单位GT/s |
mx_pcie_width |
PCIE |
设备带宽,表示PCIe lanes |
mx_pcie_bridge_speed |
PCIE |
PCIe bridge传输速率,单位GT/s |
mx_pcie_bridge_width |
PCIE |
PCIe bridge lanes |
mx_mxlk_bw |
MetaXLink |
MetaXLink Tx和Rx吞吐量,单位MB/s |
mx_mxlk_speed |
MetaXLink |
MetaXLink实时速率,单位GT/s |
mx_mxlk_width |
MetaXLink |
MetaXLink lanes |
mx_server_conn_status |
MetaXLink |
服务器MetaXLink整体状态,1为健康,0为有MetaXLink存在异常 |
mx_mxlk_aer_count |
MetaXLink |
MetaXLink AER计数,分UE和CE |
mx_mxlk_traffic_total_bytes |
MetaXLink |
GPU驱动加载后MetaXLink传输总数据量,单位:Byte |
mx_xcore_dpm_level |
DPM |
当前DPM等级 |
mx_gpu_state |
Status |
设备状态信息,0为不可用,1为可用。当为0时,标签 |
mx_clk_thr |
Clocks Throttle Reasons |
当前降频原因,0为无降频;降频原因二进制显示,从右往左依次占位为Idle,Application Limit,Over Power,Chip过温,VR过温,HBM过温,Thermal过温,PCC,Power Brake,DIDT,Low Usage,Other 例如,若数值为12,表示当前降频原因为Over Power及Chip过温 |
mx_sgpu_compute_quota |
SGPU Usage |
当前分给该子设备的算力占父设备的百分比,单位%(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标) |
mx_sgpu_usage |
SGPU Usage |
当前该子设备的使用率,单位%(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标) |
mx_sgpu_total_memory |
SGPU Usage |
当前分给该子设备的VRAM总量,单位KB(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标) |
mx_sgpu_used_memory |
SGPU Usage |
当前该子设备已使用的VRAM大小,单位KB(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标) |
mx_sgpu_free_memory |
SGPU Usage |
当前该子设备还可以使用的VRAM余量,单位KB(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标) |
mx_server_info |
UUID |
服务器UUID,local代表本机信息,remote代表与之互联的服务器信息 |
mx_topo_info |
Topo |
设备拓扑信息 |
mx_driver_log_errors |
Log |
驱动每个周期每个模块日志级别为ERROR 及 ALERT 的信息条数 |
mx_driver_eid_errors |
Log |
最新驱动EID错误信息,指标数值换算成十六进制后为EID错误码。错误详细信息,影响及常见处理方法参见《曦云® 系列通用计算GPU EID手册》 |
mx_sdk_eid_errors |
Log |
最新SDK EID错误信息及当前MXMACA SDK版本信息,指标数值换算成十六进制后为EID错误码。错误详细信息,影响及常见处理方法参见《曦云® 系列通用计算GPU EID手册》 |
mx_ecc_error_count |
Error |
ECC 错误的计数,包括sram和dram的CE(Correctable Errors)和UE(Uncorrectable Errors)个数,以及坏页数 |
mx_pci_event |
Error |
显示错误事件名称、PCIe错误类型及计数。类型分为 |
mx_ras_count |
Error |
显示IP RAS错误统计,若存在则会按照Corrected Error和Uncorrected Error分类统计数目 |
mx_ras_status |
Status |
显示IP RAS状态寄存器的值,获取到的数值为十进制,需转换成十六进制 |
4.2. 标签说明
名称 |
含义 |
|---|---|
Hostname |
主机名称 若mx-exporter在容器中运行,则为容器名称,若需修改,可在启动容器时传入 |
bios_version |
BIOS版本信息 |
deviceId |
设备号,即板卡号(通过 |
driver_version |
设备驱动版本信息 |
exported_container |
正在使用设备的k8s pod的container名称,仅当启动mx-exporter容器传入 |
exported_namespace |
正在使用设备的pod所在k8s命名空间,仅当启动mx-exporter容器传入 |
exported_pod |
正在使用设备的k8s pod名称,仅当启动mx-exporter容器传入 |
modelName |
芯片型号 |
uuid |
设备UUID |
major/minor |
子设备唯一编号(仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标) |
sgpuID |
在当前父设备下创建的子设备ID,范围为[0-15](仅当启用sGPU且 default-counters.csv 中配置启用该指标时,默认禁用此指标) |
4.3. 指标输出示例
使用以下命令启动mx-exporter:
docker run -d --device=/dev/dri -v /var/log:/host/var/log -v /var/lib/kubelet/pod-resources:/var/lib/kubelet/pod-resources --name=mx-exporter -p 0.0.0.0:8000:8000 <image ID> -mp=/host
执行 curl http://<host_ip>:8000/metrics 可得到以下内容:
# HELP mx_device_type Device type
# TYPE mx_device_type gauge
mx_device_type{deviceId="0",deviceType="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.0
# HELP mx_bios_ver Bios version
# TYPE mx_bios_ver gauge
mx_bios_ver{bios="1.7.4.0",deviceId="0"} 1.0
# HELP mx_driver_ver Driver version
# TYPE mx_driver_ver gauge
mx_driver_ver{deviceId="0",driver="2.3.0"} 1.0
# HELP mx_chip_hotspot_temp Chip hotspot temperature
# TYPE mx_chip_hotspot_temp gauge
mx_chip_hotspot_temp{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 36.25
# HELP mx_board_core_temp Board DrMOS Core temperature
# TYPE mx_board_core_temp gauge
mx_board_core_temp{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 35.0
# HELP mx_vpue_usage Vpue usage in percent
# TYPE mx_vpue_usage gauge
mx_vpue_usage{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_vpud_usage Vpud usage in percent
# TYPE mx_vpud_usage gauge
mx_vpud_usage{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_memory_usage HBM and system memory usage in percent
# TYPE mx_memory_usage gauge
mx_memory_usage{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="vram",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.2849688529968262
mx_memory_usage{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="xtt",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.001035634935770525
# HELP mx_memory_total Total HBM and system memory in KB
# TYPE mx_memory_total gauge
mx_memory_total{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="vram",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 6.7108864e+07
mx_memory_total{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="xtt",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.585114545e+09
# HELP mx_memory_used HBM and system memory used in KB
# TYPE mx_memory_used gauge
mx_memory_used{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="vram",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 862328.0
mx_memory_used{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="xtt",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16416.0
# HELP mx_gpu_usage GPU uage in percent
# TYPE mx_gpu_usage gauge
mx_gpu_usage{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_board_power Board power in milliwatt
# TYPE mx_board_power gauge
mx_board_power{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 55983.0
# HELP mx_vpue_clock Encoder clock frequency in MHz
# TYPE mx_vpue_clock gauge
mx_vpue_clock{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 535.0
# HELP mx_vpud_clock Decoder clock frequency in MHz
# TYPE mx_vpud_clock gauge
mx_vpud_clock{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 525.0
# HELP mx_mem_clock Memory clock frequency in MHz
# TYPE mx_mem_clock gauge
mx_mem_clock{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1800.0
# HELP mx_gpu_clock GPU clock frequency in MHz
# TYPE mx_gpu_clock gauge
mx_gpu_clock{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 428.0
# HELP mx_pcie_bw Pcie Tx and Rx throughput in MB/s
# TYPE mx_pcie_bw gauge
mx_pcie_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_pcie_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_pcie_speed Pcie current speed in GT/s
# TYPE mx_pcie_speed gauge
mx_pcie_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 32.0
# HELP mx_pcie_width Pcie current lanes
# TYPE mx_pcie_width gauge
mx_pcie_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16.0
# HELP mx_pcie_bridge_speed Pcie bridge current speed in GT/s
# TYPE mx_pcie_bridge_speed gauge
mx_pcie_bridge_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 32.0
# HELP mx_pcie_bridge_width Pcie bridge current lanes
# TYPE mx_pcie_bridge_width gauge
mx_pcie_bridge_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16.0
# HELP mx_mxlk_bw MetaXLink Tx and Rx throughput in MB/s
# TYPE mx_mxlk_bw gauge
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_bw{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_mxlk_speed MetaXLink current link speed in GT/s
# TYPE mx_mxlk_speed gauge
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 32.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 32.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 32.0
mx_mxlk_speed{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_mxlk_width MetaXLink current link width
# TYPE mx_mxlk_width gauge
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 16.0
mx_mxlk_width{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_mxlk_traffic_total_bytes MetaXLink traffic total in Bytes
# TYPE mx_mxlk_traffic_total_bytes gauge
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.1474920704e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.147492064e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.1474920684e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="rx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.1474930688e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.1474932352e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 2.1474930688e+010
mx_mxlk_traffic_total_bytes{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="tx",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_mxlk_aer_count MetaXLink aer count
# TYPE mx_mxlk_aer_count gauge
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="1",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="2",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="3",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="4",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="5",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="6",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_mxlk_aer_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",mxlkId="7",type="ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_xcore_dpm_level Dpm xcore performance level
# TYPE mx_xcore_dpm_level gauge
mx_xcore_dpm_level{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_gpu_state GPU state: 0(not available) 1(available)
# TYPE mx_gpu_state gauge
mx_gpu_state{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.0
# HELP mx_clk_thr Current gpu clock throttling reason
# TYPE mx_clk_thr gauge
mx_clk_thr{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_ecc_error_count Total ECC error count
# TYPE mx_ecc_error_count gauge
mx_ecc_error_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="sram_ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_ecc_error_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="sram_ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_ecc_error_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="dram_ce",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_ecc_error_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="dram_ue",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_ecc_error_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="retired_page",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
# HELP mx_server_info Local server and its connected remote servers uuid info
# TYPE mx_server_info gauge
mx_server_info{Hostname="mxsrv003",kind="local",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.0
mx_server_info{Hostname="mxsrv003",kind="remote",uuid="GPU-1d54ca5a-b3d9-65df-0ffc-8ebe0d6347e7"} 1.0
# HELP mx_driver_log_errors Driver kernel log errors
# TYPE mx_driver_log_errors gauge
mx_driver_log_errors{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",error_level="ERROR",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",module="ATU",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 4.0
# HELP mx_driver_eid_errors Value of the latest driver EID error encountered
# TYPE mx_driver_eid_errors gauge
mx_driver_eid_errors{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",eid_info="shader exception, pasid 32769 error_type mem_viol(0x4)",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 8450.0
mx_driver_eid_errors{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",eid_info="atu 0x0 pde_base_addr 0x1000340000",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 8449.0
# HELP mx_sdk_eid_errors Value of the latest SDK EID error encountered
# TYPE mx_sdk_eid_errors gauge
mx_sdk_eid_errors{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",eid_info="Xnack Error/ATU Fault(0x8), check app kernel, _Z10trapKernelPfS_i",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",processId="415256",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 12548.0
# HELP mx_pci_event Driver pci event includes aer_ue/aer_ce/synfld/dbe/mmio
# TYPE mx_pci_event gauge
mx_pci_event{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",event_name="MC3 MCA FATAL",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",type="synfld",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.0
# HELP mx_ras_count Display value in the ras error counter registers
# TYPE mx_ras_count gauge
mx_ras_count{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="MCCTL0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 255.0
# HELP mx_ras_status Display data in status registers
# TYPE mx_ras_status gauge
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="PCIE reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 53248.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="MCCTL0 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.646208e+07
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="MCCTL1 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.4626816e+07
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="MCCTL2 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 8.33536e+06
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="MCCTL3 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 1.6723968e+07
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="SMP0 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 0.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="SMP1 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 4096.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="FUSE reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 8192.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="INT reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 36864.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="INT reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="CCX0 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 49152.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="CCX0 reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="CCX1 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 49152.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="CCX1 reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="CCX2 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 49152.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="CCX2 reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DHUB1 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 57344.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DHUB2 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 57344.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DHUB3 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 57344.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DHUB4 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 57344.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DHUB5 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 57344.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DHUB6 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 57344.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DHUB7 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 57344.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="VPUE0 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="VPUD0 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="VPUD1 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="VPUD2 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="VPUD3 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="VPUD4 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="VPUD5 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="VPUD6 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="VPUE0 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="HAG reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DMA0 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 36864.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DMA0 reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DMA1 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 36864.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DMA1 reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DMA2 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 36864.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DMA2 reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DMA3 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 36864.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DMA3 reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DMA4 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 36864.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="DMA4 reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="ATH reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="ATUL20 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="ATUL20 reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="ATUL21 reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="ATUL21 reg1",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 40960.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="XSC reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 28672.0
mx_ras_status{Hostname="mxsrv003",bios_version="1.7.4.0",deviceId="0",driver_version="2.3.0",exported_container="",exported_namespace="",exported_pod="",modelName="MXC500",register_name="CE reg0",uuid="GPU-1d34ca5a-b3d9-65df-0ffc-8ebe0d6347d9"} 28672.0
5. 兼容性
5.1. MXMACA-C500-SDK-2.25.2/MXMACA-C500-K8s-0.8.2
mx-exporter工具中指标名称新增 mx_ 前缀,如需解决兼容性问题,请采取以下方法:
wheel包
镜像
修改 mx-exporter/config/default-counters.csv 配置文件,去掉 mx_ 前缀,并指定该配置文件部署mx-exporter,操作步骤参见 2.2.2 使用方法。
或者,使用Prometheus relabeling功能,在Prometheus配置文件 mx-exporter/deployment/prometheus/config-map.yaml 中的
job_name: "metax-mx-exporter"下,增加如下metric_relabel_configs配置来批量更改指标标签。metric_relabel_configs: - source_labels: [__name__] regex: mx_(.*) target_label: __name__
6. 附录
6.1. 术语/缩略语
术语/缩略语 |
全称 |
说明 |
|---|---|---|
PCIe |
Peripheral Component Interconnect Express |
一种高速串行计算机扩展总线标准 |
DPM |
Dynamic Power Management |
动态功率管理功能 |
VPUE |
Video Processing Unit Encoder |
视频处理单元编码 |
VPUD |
Video Processing Unit Decoder |
视频处理单元解码 |
MetaXLink |
沐曦GPU D2D接口总线 |
|
ECC |
Error Checking and Correcting |
错误检查和纠正 |
EID |
Error ID |
GPU错误码 |