一、软硬件信息:
1.服务器厂家:浪潮
2.沐曦GPU型号:MetaX C500 8卡
3.操作系统内核版本:6.6.0-32.7.v2505.ky11.x86_64
4.是否开启CPU虚拟化:开启
5.mx-smi回显:
mx-smi version: 2.2.12
=================== MetaX System Management Interface Log ===================
Timestamp : Wed May 20 18:14:56 2026
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
| MACA Version: unknown BIOS Version: 1.31.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:04:00.0 | 0% Disabled |
| 82W / 350W | 61C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:05:00.0 | 0% Disabled |
| 75W / 350W | 58C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:63:00.0 | 0% Disabled |
| 80W / 350W | 56C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:64:00.0 | 0% Disabled |
| 80W / 350W | 59C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:83:00.0 | 0% Disabled |
| 82W / 350W | 56C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:84:00.0 | 0% Disabled |
| 72W / 350W | 53C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:e4:00.0 | 0% Disabled |
| 81W / 350W | 58C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:e5:00.0 | 0% Disabled |
| 74W / 350W | 54C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| 0 1025936 VLLM::Worker_TP 39386 |
| 1 1025937 VLLM::Worker_TP 40026 |
| 2 1025938 VLLM::Worker_TP 39386 |
| 3 1025939 VLLM::Worker_TP 40026 |
| 4 1025940 VLLM::Worker_TP 40026 |
| 5 1025941 VLLM::Worker_TP 39386 |
| 6 1025942 VLLM::Worker_TP 40026 |
| 7 1025943 VLLM::Worker_TP 39386 |
+---------------------------------------------------------------------------------+
6.docker info回显:
[root@localhost ~]# docker info
Client:
Version: 24.0.9
Context: default
Debug Mode: false
Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 24.0.9
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9a04df1519ac2967eece6c6a5d13d3b846b574b2.m
runc version:
init version:
Security Options:
seccomp
Profile: builtin
Kernel Version: 6.6.0-32.7.v2505.ky11.x86_64
Operating System: Kylin Linux Advanced Server V11 (Swan25)
OSType: linux
Architecture: x86_64
CPUs: 256
Total Memory: 1.472TiB
Name: localhost.localdomain
ID: ded90092-4000-426b-a3ca-08950e376242
Docker Root Dir: /home/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
docker.1ms.run/
dockerpull.com/
registry.docker-cn.com/
Live Restore Enabled: false
7.镜像版本:
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64
8.启动容器命令:
docker run -itd \
--name qwen3.6 \
--network host \
--shm-size 512G \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 100gb \
--ulimit memlock=-1 \
-v /home/modelscope:/root/vllm \
-e TZ=Asia/Shanghai \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64
9.容器内执行命令:
nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 8\
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen3.6 \
--dtype bfloat16 \
--trust-remote-code \
--tensor-parallel-size 8 \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--max-num-batched-tokens 327680 \
--kv-cache-dtype fp8_e4m3 >qwen.log 2>& 1 &
二、问题现象
推理速度慢,首轮 prompt 预填:2.2 tokens/s(输入解析慢)生成阶段稳定:70~73 tokens/s
日志信息如下:
(APIServer pid=254754) INFO 05-20 20:11:26 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage:
0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:11:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 0.7%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:11:46 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 0.9%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:11:56 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 1.2%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:12:06 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 1.3%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:12:16 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
: 1.6%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO: 10.217.247.136:54410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=254754) INFO 05-20 20:12:26 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage
: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=254754) INFO 05-20 20:12:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage:
0.0%, Prefix cache hit rate: 0.0%