Posts | ruanding | 沐曦开发者论坛

一、软硬件信息
1.服务器厂家：鲲鹏泰山系列920新型号服务器
2.沐曦GPU型号：曦云C500
3.操作系统内核版本：Linux localhost.localdomain 6.6.0-72.0.0.76.oe2403sp1.aarch64
4.是否开启CPU虚拟化：开启CPU虚拟化
5.mx-smi回显：
mx-smi version: 2.2.12

=================== MetaX System Management Interface Log ===================
Timestamp : Wed Feb 11 14:47:59 2026

Attached GPUs : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
| MACA Version: 3.5.3.17 BIOS Version: 1.31.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:ab:00.0 | 0% Disabled |
| 34W / 350W | 51C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
6.docker info回显：
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 1
Server Version: 18.09.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Hugetlb Pagesize: 2MB, 64KB, 32MB, 1GB, 64KB, 32MB, 2MB, 1GB (default is 2MB)
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 85f5646ca2e0404de288487d7b0414c4c44e9715
runc version: N/A
init version: N/A (expected: )
Security Options:
seccomp
Profile: default
Kernel Version: 6.6.0-72.0.0.76.oe2403sp1.aarch64
Operating System: openEuler 24.03 (LTS-SP1)
OSType: linux
Architecture: aarch64
CPUs: 160
Total Memory: 1006GiB
Name: localhost.localdomain
ID: WVRV:S6D5:4ASY:N5L3:DAJN:UJLQ:GP6C:M543:GKCC:DMLH:JSGM:HIHO
Docker Root Dir: /home/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true
7.镜像版本：
EPOSITORY TAG IMAGE ID CREATED SIZE
maca-torch2.4-py310 mc3.3.0.4-kylinv10-arm64 9f4c837ed9ad 2 months ago 22.2GB
8.启动容器命令：
docker run -it --device=/dev/mxcd --device=/dev/dri --privileged=true --ipc="shareable" --name torch24 --shm-size=256g -v /home/:/home/ -w /home/ maca-torch2.4-py310:mc3.3.0.4-kylinv10-arm64 /bin/bash
9.容器内执行命令：
import torch
from transformers import AutoProcessor, AutoModel
from PIL import Image

use_NPU = True
model_path = "/home/models/siglip-so400m-patch14-384/"
if use_NPU:
model = AutoModel.from_pretrained(model_path, local_files_only=True).cuda()
else:
model = AutoModel.from_pretrained(model_path, local_files_only=True)
processor = AutoProcessor.from_pretrained(model_path, local_files_only=True)

image = Image.open("/home/datasets/siglip/photos/xxxxx.jpg")
resized_image = image.resize((224, 224), resample=Image.Resampling.LANCZOS)
texts = ["xxxxx", "xxxxx", "xxxxx", "xxxxx", "xxxxx"]

inputs = processor(text=texts, images=resized_image, padding=True, return_tensors="pt")

if use_NPU:
for key, value in inputs.items():
inputs[key] = inputs[key].cuda()

start = time.time()
with torch.no_grad():
outputs = model(**inputs)
print("耗时：", (time.time() - start) * 1000, "ms")

logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

for text, prob in zip(texts, probs[0]):
print(f"{text}: {prob:.4f}")

二、问题现象
在NPU上推理siglip模型耗时过长（2616ms），慢于在英伟达4090（310ms），甚至慢于在裸机上使用CPU进行推理（1037ms）。在容器内使用NPU推理时会显示告警/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:5168: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /workspace/framework/mcPytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:617.)
return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale)