模型运行过程中异常终止，且再次重启一直报错无法启动

Members 13 posts

2026年5月22日 13:51 2026年5月22日 13:51

一、软硬件信息
1.服务器厂家：浪潮信息:
2.沐曦GPU型号：单张沐曦曦思N260
3.操作系统内核版本：4.19.90-89.11.v2401.ky10.x86_64
4.是否开启CPU虚拟化：是
5.mx-smi回显：
mx-smi version: 2.3.1

=================== MetaX System Management Interface Log ===================
Timestamp : Mon May 11 10:18:11 2026

Attached GPUs : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.3.1 Kernel Mode Driver Version: 3.7.11 |
| MACA Version: 3.7.0.38 BIOS Version: 1.31.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX N260 | 0 Off | 0000:c1:00.0 | 0% Disabled |
| 60W / 225W | 59C P9 | 52895/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| 0 3760427 VLLM::EngineCor 52228 |
+---------------------------------------------------------------------------------+
6.docker info回显：
Client:
Version: 29.3.1
Context: default
Debug Mode: false
Plugins:
compose: Docker Compose (Docker Inc.)
Version: v2.24.6
Path: /usr/local/lib/docker/cli-plugins/docker-compose

Server:
Containers: 25
Running: 24
Paused: 0
Stopped: 1
Images: 56
Server Version: 29.3.1
Storage Driver: overlayfs
driver-type: io.containerd.snapshotter.v1
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
CDI spec directories:
/etc/cdi
/var/run/cdi
Swarm: inactive
Runtimes: io.containerd.runc.v2 metax runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 301b2dac98f15c27117da5c8af12118a041a31d9
runc version: v1.3.4-0-gd6d73eb
init version: de40ad0
Security Options:
seccomp
Profile: builtin
Kernel Version: 4.19.90-89.11.v2401.ky10.x86_64
Operating System: Kylin Linux Advanced Server V10 (Halberd)
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 61.55GiB
Name: localhost.localdomain
ID: f92e3bfc-06d2-4441-886f-8b48bf0e6b27
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
::1/128
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
Firewall Backend: iptables

WARNING: Support for cgroup v1 is deprecated and planned to be removed by no later than May 2029 (github.com/moby/moby/issues/51111)
7.镜像版本：
vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64
8.启动容器命令：
metax-docker run -itd --gpus="[<sgpu:${GPU_UUID}>]" --group-add video --network=host --name llm-model --entrypoint bash --restart unless-stopped --shm-size=32g --security-opt seccomp=unconfined --security-opt apparmor=unconfined --ulimit memlock=-1 -v /home/models:/models cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64 -c "/models/run_model.sh"
9.容器内执行命令：
VLLM_USE_V1=1 /opt/conda/bin/vllm serve /models/Qwen3-32B-AWQ --max-num-seqs 8 --async-scheduling --host 0.0.0.0 --port 9901 --served-model-name qwen3 -tp 1 --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 8192 --max-num-batched-tokens 8192 --reasoning-parser qwen3 --no-enable-prefix-caching

二、问题现象
使用vllm运行模型时，模型出错异常停止，且无法重新启动，重启时提示memory access offset is negative, out of bounds, or misaligned in kernel，一直无法启动，请问该如何解决？

报错日志如附件所示

insert_drive_file

dmesg.txt

Text, 194.1 KB, uploaded by mukewang on 2026年5月22日.