4. 监控模式

4.1. 监控项目

表 4.1 mx-diagease监控项目

监测类别

异常

异常日志信息

功耗监测

出现超功耗

Device (xxx) power exceeds limit.

获取设备功耗失败

Device (xxx) get power failed.

过温监测

出现过温

Device (xxx) vr temperature exceeds limit.

Device (xxx) chip temperature exceeds limit.

Device (xxx) board temperature exceeds limit.

CTF

Device (xxx) chip temperature fault.

Device (xxx) board temperature fault.

获取设备温度失败

Device (xxx) get temperature failed.

PCC监测

出现PCC

Device (xxx) warning Pcc

Device (xxx) critical Pcc

warning: 单位时间pcc counter占比不超过3%。

critical: 单位时间pcc counter占比超过3%。

操作counter失败

Device (xxx) set counter failed: Pcc

Device (xxx) get counter failed: Pcc

Counter计数异常清除

Device (xxx) cleared counter unexpectedly: Pcc

Power brake监测

出现Power brake

Device (xxx) warning Pwrbrk

Device (xxx) critical Pwrbrk

warning: 单位时间pwrbrk counter占比不超过3%。

critical: 单位时间pwrbrk counter占比超过3%。

操作counter失败

Device (xxx) set counter failed: Pwrbrk

Device (xxx) get counter failed: Pwrbrk

Counter计数异常清除

Device (xxx) cleared counter unexpectedly: Pwrbrk

DI/DT监测

出现DI/DT

Device (xxx) warning Didt

Device (xxx) critical Didt

warning: 单位时间counter占比不超过3%。

critical: 单位时间didt counter占比超过3%。

操作counter失败

Device (xxx) set counter failed: Didt

Device (xxx) get counter failed: Didt

Counter计数异常清除

Device (xxx) cleared counter unexpectedly: Didt

Power state(deepsleep)监测

Power state异常

Device (xxx) critical power state error.

获取设备时钟失败

Device (xxx) get clocks failed.

4.2. 监控命令

mx-diagease -m -t <time>

执行命令需要sudo权限。

-t 指定监控时长,支持传入格式为 [seconds][hh:mm:ss],若未传入该参数默认持续进行,需 Ctrl+C 退出mx-diagease,显示汇总信息。

执行以上命令,持续监控板卡功耗模块,count数据等,如有异常,将实时打印异常指标信息。可在mx-diagease运行目录下 mxdiag-log 文件夹中查看日志。

输出结果

  • 退出后若监测结果为健康,显示如下所示:

    MetaX Diagnostic tool Version: X.X.XX
    Product : C500
    Kmd version : X.X.X
    Bios version : X.XX.X.X
    Maca version : X.XX.X.X
    ^C
    ------------------ Result -----------------
    Device xxx
    Device xxx is healthy
    
  • 如有异常会实时打印,退出后显示汇总信息,如下所示:

    MetaX Diagnostic tool Version: X.X.XX
    Product : C500
    Kmd version : X.X.X
    Bios version : X.XX.X.X
    Maca version : X.XX.X.X
    
    ------------------ Result -----------------
    Device 0
    Device 0 is healthy
    Device 1
    WARNING, power exceeds limit
    CAUTION, warning Didt
    CAUTION, get temperature info failed
    Device 2
    CRITICAL, critical power state error