附件一为相关日志
附件二为部署的k8s的yaml文件
附件一为相关日志
附件二为部署的k8s的yaml文件
尊敬的开发者您好,请提供以下信息
一、软硬件信息
1.服务器厂家:
2.沐曦GPU型号:
3.操作系统内核版本:
4.是否开启CPU虚拟化:
5.mx-smi回显:
6.docker info回显:
7.镜像版本:
8.启动容器命令:
9.容器内执行命令:
二、问题现象
请描述详细的问题现象日志。若日志过长,请上传附件(txt格式)。
一、软硬件信息
1.服务器厂家:
New H3C Technologies Co., Ltd.
2.沐曦GPU型号:MetaX C500
3.操作系统内核版本:5.19.0-46-generic
4.是否开启CPU虚拟化:
5.mx-smi回显:
root@chaoxun:/home/sts# mx-smi
mx-smi version: 2.2.9
=================== MetaX System Management Interface Log ===================
Timestamp : Thu Feb 26 11:24:27 2026
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9 Kernel Mode Driver Version: 3.4.4 |
| MACA Version: 3.0.0.8 BIOS Version: 1.26.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:08:00.0 | 0% Disabled |
| 72W / 350W | 50C P9 | 59360/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:09:00.0 | 0% Disabled |
| 76W / 350W | 52C P9 | 59360/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:0e:00.0 | 0% Disabled |
| 72W / 350W | 49C P9 | 59360/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:11:00.0 | 0% Disabled |
| 75W / 350W | 50C P9 | 59360/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:32:00.0 | 0% Disabled |
| 58W / 350W | 45C P0 | 860/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:38:00.0 | 0% Disabled |
| 59W / 350W | 46C P0 | 860/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:3b:00.0 | 0% Disabled |
| 61W / 350W | 48C P0 | 860/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:3c:00.0 | 0% Disabled |
| 63W / 350W | 50C P0 | 860/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| 0 54852 VLLM::Worker_TP 58496 |
| 1 54853 VLLM::Worker_TP 58496 |
| 2 54854 VLLM::Worker_TP 58496 |
| 3 54855 VLLM::Worker_TP 58496 |
+---------------------------------------------------------------------------------+
6.docker info回显:
Client:
Version: 24.0.9
Context: default
Debug Mode: false
Server:
Containers: 123
Running: 63
Paused: 0
Stopped: 60
Images: 137
Server Version: 24.0.9
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: metax runc
Default Runtime: metax
Init Binary: docker-init
containerd version: 7c3aca7a610df76212171d200ca3811ff6096eb8
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.19.0-46-generic
Operating System: Ubuntu 22.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 160
Total Memory: 1.968TiB
Name: chaoxun
ID: d55e8de3-5871-4a10-814a-87f1c884bf87
Docker Root Dir: /mnt/disk0/sagesuite/data/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
docker.mirrors.sjtug.sjtu.edu.cn/
docker.m.daocloud.io/
docker.1panel.live/
Live Restore Enabled: false
Product License: Community Engine
7.镜像版本:cr.metax-tech.com/public-ai-release/maca/modelzoo.llm.vllm:1.0.0-maca.ai3.3.0.11-torch2.6-py312-ubuntu22.04-amd64
8.启动容器命令:
command:
- /opt/conda/bin/vllm
- serve
# 3. 模型路径/名称:改为你要运行的Qwen3-VL-30B-A3B模型
- /models # 假设模型文件存放在宿主机的这个子目录下
- --port
- "8000"
# 4. 张量并行:根据你的显卡数量和模型需求调整。MoE模型建议开启专家并行。
- --tensor-parallel-size
- "4" # 假设仍然使用4张卡
# 5. 【关键】MoE模型必须设置:如果tensor-parallel-size无法被专家数整除,必须启用此选项
- --enable-expert-parallel
# 6. 数据类型:官方推荐使用bfloat16以获得最佳性能
- --dtype
- "bfloat16" # 从float16改为bfloat16
# 7. 模型长度:可根据需求调整,例如设为32768或更高
- --max-model-len
- "16384" # 提升上下文长度以发挥模型潜力 [citation:10]
# 8. 显存利用率:可以保持0.8-0.9,具体看显存大小
- --gpu-memory-utilization
- "0.9"
- --use-v1-engine
- "false"
# 9. 服务名称:可自定义
- --served-model-name
- qwen3-vl-30b-a3b
# 10. 多模态相关参数:根据实际硬件情况决定是否保留 --enforce-eager
# 如果使用FlashAttention,可以尝试去掉 --enforce-eager
- --enforce-eager # 如果不支持flash-attn或遇到问题,保留此项
- --no-enable-chunked-prefill
- --no-enable-prefix-caching
- --block-size
- "16"
- --disable-log-requests
- --disable-log-stats
- --disable-custom-all-reduce
- --disable-custom-all-reduce
- --max-num-seqs
- "1" # 强制串行处理
9.容器内执行命令:
二、问题现象
沐曦C500部署qwen3-vl-30b-a3b,第一次可以识别,第二次识别图片时报错。
尊敬的开发者您好,GPU裸机驱动和固件较低,请裸机升级驱动和固件后尝试
请问固件在哪下载安装,我只找到了驱动跟sdk
root@chaoxun:/lib/firmware/metax/mxc500# mx-smi -u /lib/firmware/metax/$chip_type/mxc500/mxvbios-1.31.1.0-1078-C550.bin -t 600
mx-smi version: 2.2.12
Hint: -u only support upgrading vbios for all devices.
GPU#0 vbios-upgrade Ioctl failed: Chip info mismatch
root@chaoxun:/lib/firmware/metax/mxc500# mx-smi
mx-smi version: 2.2.12
=================== MetaX System Management Interface Log ===================
Timestamp : Thu Feb 26 15:05:22 2026
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.12 Kernel Mode Driver Version: 3.4.4 |
| MACA Version: 3.5.3.17 BIOS Version: 1.26.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:08:00.0 | 0% Disabled |
| 60W / 350W | 47C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:09:00.0 | 0% Disabled |
| 63W / 350W | 48C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:0e:00.0 | 0% Disabled |
| 58W / 350W | 45C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:11:00.0 | 0% Disabled |
| 61W / 350W | 46C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:32:00.0 | 0% Disabled |
| 58W / 350W | 45C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:38:00.0 | 0% Disabled |
| 59W / 350W | 46C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:3b:00.0 | 0% Disabled |
| 61W / 350W | 47C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:3c:00.0 | 0% Disabled |
| 63W / 350W | 49C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
你好,我更新了驱动跟sdk后,再更新固件报错了,是哪里操作错了吗
尊敬的开发者您好,请选择mxvbios-C500固件
mx-smi -u /lib/firmware/metax/$chip_type/mxc500/mxvbios-1.31.1.0-1078-C550.bin -t 600
参照教程,这个可以吗?
请您查看我的回复,已经选择mx-smi -u /lib/firmware/metax/$chip_type/mxc500/mxvbios-1.31.1.0-1078-C550.bin -t 600不是吗?
尊敬的开发者您好,请选择mx-smi -u /lib/firmware/metax/$chip_type/mxc500/mxvbios-1.31.1.0-1078-C500.bin -t 600
好的,我看错了