不好意思,一直叫错了。我通过多次运行发现,第一次运行的速度极慢,但后续每次运行的速度会更快。如,仅运行1次,那么时延为2500ms;共运行2次,则第1次时延为1300ms,第2次时延为20ms;共运行3次,第1次时延为840ms,第2、3次时延分别为13ms、12ms;共运行4次,第1次时延630ms,第2、3、4次时延均在10ms左右。随着总计运行次数增加,第1次时延最快为250ms,后续时延在3.8ms左右。请问这是什么原因造成的?
不好意思,一直叫错了。我通过多次运行发现,第一次运行的速度极慢,但后续每次运行的速度会更快。如,仅运行1次,那么时延为2500ms;共运行2次,则第1次时延为1300ms,第2次时延为20ms;共运行3次,第1次时延为840ms,第2、3次时延分别为13ms、12ms;共运行4次,第1次时延630ms,第2、3、4次时延均在10ms左右。随着总计运行次数增加,第1次时延最快为250ms,后续时延在3.8ms左右。请问这是什么原因造成的?
一、软硬件信息
1.服务器厂家:鲲鹏泰山系列920新型号服务器
2.沐曦GPU型号:曦云C500
3.操作系统内核版本:Linux localhost.localdomain 6.6.0-72.0.0.76.oe2403sp1.aarch64
4.是否开启CPU虚拟化:开启CPU虚拟化
5.mx-smi回显:
mx-smi version: 2.2.12
=================== MetaX System Management Interface Log ===================
Timestamp : Wed Feb 11 14:47:59 2026
Attached GPUs : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
| MACA Version: 3.5.3.17 BIOS Version: 1.31.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:ab:00.0 | 0% Disabled |
| 34W / 350W | 51C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
6.docker info回显:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 1
Server Version: 18.09.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Hugetlb Pagesize: 2MB, 64KB, 32MB, 1GB, 64KB, 32MB, 2MB, 1GB (default is 2MB)
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 85f5646ca2e0404de288487d7b0414c4c44e9715
runc version: N/A
init version: N/A (expected: )
Security Options:
seccomp
Profile: default
Kernel Version: 6.6.0-72.0.0.76.oe2403sp1.aarch64
Operating System: openEuler 24.03 (LTS-SP1)
OSType: linux
Architecture: aarch64
CPUs: 160
Total Memory: 1006GiB
Name: localhost.localdomain
ID: WVRV:S6D5:4ASY:N5L3:DAJN:UJLQ:GP6C:M543:GKCC:DMLH:JSGM:HIHO
Docker Root Dir: /home/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true
7.镜像版本:
EPOSITORY TAG IMAGE ID CREATED SIZE
maca-torch2.4-py310 mc3.3.0.4-kylinv10-arm64 9f4c837ed9ad 2 months ago 22.2GB
8.启动容器命令:
docker run -it --device=/dev/mxcd --device=/dev/dri --privileged=true --ipc="shareable" --name torch24 --shm-size=256g -v /home/:/home/ -w /home/ maca-torch2.4-py310:mc3.3.0.4-kylinv10-arm64 /bin/bash
9.容器内执行命令:
import torch
from transformers import AutoProcessor, AutoModel
from PIL import Image
use_NPU = True
model_path = "/home/models/siglip-so400m-patch14-384/"
if use_NPU:
model = AutoModel.from_pretrained(model_path, local_files_only=True).cuda()
else:
model = AutoModel.from_pretrained(model_path, local_files_only=True)
processor = AutoProcessor.from_pretrained(model_path, local_files_only=True)
image = Image.open("/home/datasets/siglip/photos/xxxxx.jpg")
resized_image = image.resize((224, 224), resample=Image.Resampling.LANCZOS)
texts = ["xxxxx", "xxxxx", "xxxxx", "xxxxx", "xxxxx"]
inputs = processor(text=texts, images=resized_image, padding=True, return_tensors="pt")
if use_NPU:
for key, value in inputs.items():
inputs[key] = inputs[key].cuda()
start = time.time()
with torch.no_grad():
outputs = model(**inputs)
print("耗时:", (time.time() - start) * 1000, "ms")
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
for text, prob in zip(texts, probs[0]):
print(f"{text}: {prob:.4f}")
二、问题现象
在NPU上推理siglip模型耗时过长(2616ms),慢于在英伟达4090(310ms),甚至慢于在裸机上使用CPU进行推理(1037ms)。在容器内使用NPU推理时会显示告警/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:5168: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /workspace/framework/mcPytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:617.)
return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale)
本人使用鲲鹏920新型号CPU+一张曦云C500 NPU,运行siglip模型。对于同一张图片,在裸机上使用纯CPU上推理时延约1037ms,但是使用NPU推理时延约2837ms,在容器(maca-torch2.4-py310-mc3.3.0.4-kylinv10-arm64)内使用NPU推理时延约2616ms;但使用英伟达4090推理时延约310ms。使用NPU推理的性能弱于使用CPU推理,这性能明显不正常,请问该如何排查和解决?其中驱动版本是3.5.3.11,sdk版本是3.5.3.17,cu-bridge用的是master分支代码。
已经按照文档安装了驱动固件及sdk,然仍有相同的报错
我下载了maca-pytorch2.8-py312-3.5.3.9-aarch64.tar这个版本的安装包后,在裸机上创建conda环境并安装了pytorch等安装包后导入pytorch时有如下报错File "<stdin>", line 1, in <module>
File "/home/lv/miniconda3/envs/python312/lib/python3.12/site-packages/torch/init.py", line 421, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: libmxomp.so: cannot open shared object file: No such file or directory
请问该如何解决?我已经安装了2.14.27版本的驱动和2.32.0.9版本的MACA sdk。
您好,我在使用容器中的vllm0.8.2部署大模型进行推理时,无法采集profiler数据——设置VLLM_TORCH_PROFILER_DIR环境变量后会卡死。请问该如何解决?