沐曦卡运行siglip模型性能问题

Members 8 posts

2026年2月9日 10:14 2026年2月9日 10:14

本人使用鲲鹏920新型号CPU+一张曦云C500 NPU，运行siglip模型。对于同一张图片，在裸机上使用纯CPU上推理时延约1037ms，但是使用NPU推理时延约2837ms，在容器（maca-torch2.4-py310-mc3.3.0.4-kylinv10-arm64）内使用NPU推理时延约2616ms；但使用英伟达4090推理时延约310ms。使用NPU推理的性能弱于使用CPU推理，这性能明显不正常，请问该如何排查和解决？其中驱动版本是3.5.3.11，sdk版本是3.5.3.17，cu-bridge用的是master分支代码。

link

shuai_chen

Members 221 posts

2026年2月11日 14:33 2026年2月11日 14:33

link

尊敬的开发者您好，请按照以下模版提供详细信息
一、软硬件信息
1.服务器厂家:
2.沐曦GPU型号：
3.操作系统内核版本：
4.是否开启CPU虚拟化：
5.mx-smi回显：
6.docker info回显：
7.镜像版本：
8.启动容器命令：
9.容器内执行命令：
二、问题现象
请描述详细的问题现象日志。若日志过长，请上传附件(txt格式）。

link

ruanding

Members 8 posts

2026年2月11日 15:09 2026年2月11日 15:09

link

一、软硬件信息
1.服务器厂家：鲲鹏泰山系列920新型号服务器
2.沐曦GPU型号：曦云C500
3.操作系统内核版本：Linux localhost.localdomain 6.6.0-72.0.0.76.oe2403sp1.aarch64
4.是否开启CPU虚拟化：开启CPU虚拟化
5.mx-smi回显：
mx-smi version: 2.2.12

=================== MetaX System Management Interface Log ===================
Timestamp : Wed Feb 11 14:47:59 2026

Attached GPUs : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
| MACA Version: 3.5.3.17 BIOS Version: 1.31.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:ab:00.0 | 0% Disabled |
| 34W / 350W | 51C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
6.docker info回显：
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 1
Server Version: 18.09.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Hugetlb Pagesize: 2MB, 64KB, 32MB, 1GB, 64KB, 32MB, 2MB, 1GB (default is 2MB)
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 85f5646ca2e0404de288487d7b0414c4c44e9715
runc version: N/A
init version: N/A (expected: )
Security Options:
seccomp
Profile: default
Kernel Version: 6.6.0-72.0.0.76.oe2403sp1.aarch64
Operating System: openEuler 24.03 (LTS-SP1)
OSType: linux
Architecture: aarch64
CPUs: 160
Total Memory: 1006GiB
Name: localhost.localdomain
ID: WVRV:S6D5:4ASY:N5L3:DAJN:UJLQ:GP6C:M543:GKCC:DMLH:JSGM:HIHO
Docker Root Dir: /home/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true
7.镜像版本：
EPOSITORY TAG IMAGE ID CREATED SIZE
maca-torch2.4-py310 mc3.3.0.4-kylinv10-arm64 9f4c837ed9ad 2 months ago 22.2GB
8.启动容器命令：
docker run -it --device=/dev/mxcd --device=/dev/dri --privileged=true --ipc="shareable" --name torch24 --shm-size=256g -v /home/:/home/ -w /home/ maca-torch2.4-py310:mc3.3.0.4-kylinv10-arm64 /bin/bash
9.容器内执行命令：
import torch
from transformers import AutoProcessor, AutoModel
from PIL import Image

use_NPU = True
model_path = "/home/models/siglip-so400m-patch14-384/"
if use_NPU:
model = AutoModel.from_pretrained(model_path, local_files_only=True).cuda()
else:
model = AutoModel.from_pretrained(model_path, local_files_only=True)
processor = AutoProcessor.from_pretrained(model_path, local_files_only=True)

image = Image.open("/home/datasets/siglip/photos/xxxxx.jpg")
resized_image = image.resize((224, 224), resample=Image.Resampling.LANCZOS)
texts = ["xxxxx", "xxxxx", "xxxxx", "xxxxx", "xxxxx"]

inputs = processor(text=texts, images=resized_image, padding=True, return_tensors="pt")

if use_NPU:
for key, value in inputs.items():
inputs[key] = inputs[key].cuda()

start = time.time()
with torch.no_grad():
outputs = model(**inputs)
print("耗时：", (time.time() - start) * 1000, "ms")

logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

for text, prob in zip(texts, probs[0]):
print(f"{text}: {prob:.4f}")

二、问题现象
在NPU上推理siglip模型耗时过长（2616ms），慢于在英伟达4090（310ms），甚至慢于在裸机上使用CPU进行推理（1037ms）。在容器内使用NPU推理时会显示告警/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:5168: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /workspace/framework/mcPytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:617.)
return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale)

link

shuai_chen

Members 221 posts

2026年2月11日 15:25 2026年2月11日 15:25

link

尊敬的开发者您好，沐曦是GPU不是NPU。请推理百次求平均值对比。

link

ruanding

Members 8 posts

2026年2月11日 16:16 2026年2月11日 16:16

link

不好意思，一直叫错了。我通过多次运行发现，第一次运行的速度极慢，但后续每次运行的速度会更快。如，仅运行1次，那么时延为2500ms；共运行2次，则第1次时延为1300ms，第2次时延为20ms；共运行3次，第1次时延为840ms，第2、3次时延分别为13ms、12ms；共运行4次，第1次时延630ms，第2、3、4次时延均在10ms左右。随着总计运行次数增加，第1次时延最快为250ms，后续时延在3.8ms左右。请问这是什么原因造成的？