• Members 8 posts
    2026年2月9日 10:14

    本人使用鲲鹏920新型号CPU+一张曦云C500 NPU,运行siglip模型。对于同一张图片,在裸机上使用纯CPU上推理时延约1037ms,但是使用NPU推理时延约2837ms,在容器(maca-torch2.4-py310-mc3.3.0.4-kylinv10-arm64)内使用NPU推理时延约2616ms;但使用英伟达4090推理时延约310ms。使用NPU推理的性能弱于使用CPU推理,这性能明显不正常,请问该如何排查和解决?其中驱动版本是3.5.3.11,sdk版本是3.5.3.17,cu-bridge用的是master分支代码。

  • Members 221 posts
    2026年2月11日 14:33

    尊敬的开发者您好,请按照以下模版提供详细信息
    一、软硬件信息
    1.服务器厂家:
    2.沐曦GPU型号:
    3.操作系统内核版本:
    4.是否开启CPU虚拟化:
    5.mx-smi回显:
    6.docker info回显:
    7.镜像版本:
    8.启动容器命令:
    9.容器内执行命令:
    二、问题现象
    请描述详细的问题现象日志。若日志过长,请上传附件(txt格式)。

  • arrow_forward

    Thread has been moved from 公共.

  • Members 8 posts
    2026年2月11日 15:09

    一、软硬件信息
    1.服务器厂家:鲲鹏泰山系列920新型号服务器
    2.沐曦GPU型号:曦云C500
    3.操作系统内核版本:Linux localhost.localdomain 6.6.0-72.0.0.76.oe2403sp1.aarch64
    4.是否开启CPU虚拟化:开启CPU虚拟化
    5.mx-smi回显:
    mx-smi version: 2.2.12

    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed Feb 11 14:47:59 2026

    Attached GPUs : 1
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
    | MACA Version: 3.5.3.17 BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:ab:00.0 | 0% Disabled |
    | 34W / 350W | 51C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |
    +---------------------------------------------------------------------------------+
    6.docker info回显:
    Containers: 2
    Running: 2
    Paused: 0
    Stopped: 0
    Images: 1
    Server Version: 18.09.0
    Storage Driver: overlay2
    Backing Filesystem: extfs
    Supports d_type: true
    Native Overlay Diff: true
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Hugetlb Pagesize: 2MB, 64KB, 32MB, 1GB, 64KB, 32MB, 2MB, 1GB (default is 2MB)
    Plugins:
    Volume: local
    Network: bridge host macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
    Swarm: inactive
    Runtimes: runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 85f5646ca2e0404de288487d7b0414c4c44e9715
    runc version: N/A
    init version: N/A (expected: )
    Security Options:
    seccomp
    Profile: default
    Kernel Version: 6.6.0-72.0.0.76.oe2403sp1.aarch64
    Operating System: openEuler 24.03 (LTS-SP1)
    OSType: linux
    Architecture: aarch64
    CPUs: 160
    Total Memory: 1006GiB
    Name: localhost.localdomain
    ID: WVRV:S6D5:4ASY:N5L3:DAJN:UJLQ:GP6C:M543:GKCC:DMLH:JSGM:HIHO
    Docker Root Dir: /home/docker
    Debug Mode (client): false
    Debug Mode (server): false
    Registry: index.docker.io/v1/
    Labels:
    Experimental: false
    Insecure Registries:
    127.0.0.0/8
    Live Restore Enabled: true
    7.镜像版本:
    EPOSITORY TAG IMAGE ID CREATED SIZE
    maca-torch2.4-py310 mc3.3.0.4-kylinv10-arm64 9f4c837ed9ad 2 months ago 22.2GB
    8.启动容器命令:
    docker run -it --device=/dev/mxcd --device=/dev/dri --privileged=true --ipc="shareable" --name torch24 --shm-size=256g -v /home/:/home/ -w /home/ maca-torch2.4-py310:mc3.3.0.4-kylinv10-arm64 /bin/bash
    9.容器内执行命令:
    import torch
    from transformers import AutoProcessor, AutoModel
    from PIL import Image

    use_NPU = True
    model_path = "/home/models/siglip-so400m-patch14-384/"
    if use_NPU:
    model = AutoModel.from_pretrained(model_path, local_files_only=True).cuda()
    else:
    model = AutoModel.from_pretrained(model_path, local_files_only=True)
    processor = AutoProcessor.from_pretrained(model_path, local_files_only=True)

    image = Image.open("/home/datasets/siglip/photos/xxxxx.jpg")
    resized_image = image.resize((224, 224), resample=Image.Resampling.LANCZOS)
    texts = ["xxxxx", "xxxxx", "xxxxx", "xxxxx", "xxxxx"]

    inputs = processor(text=texts, images=resized_image, padding=True, return_tensors="pt")

    if use_NPU:
    for key, value in inputs.items():
    inputs[key] = inputs[key].cuda()

    start = time.time()
    with torch.no_grad():
    outputs = model(**inputs)
    print("耗时:", (time.time() - start) * 1000, "ms")

    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)

    for text, prob in zip(texts, probs[0]):
    print(f"{text}: {prob:.4f}")

    二、问题现象
    在NPU上推理siglip模型耗时过长(2616ms),慢于在英伟达4090(310ms),甚至慢于在裸机上使用CPU进行推理(1037ms)。在容器内使用NPU推理时会显示告警/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:5168: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /workspace/framework/mcPytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:617.)
    return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale)

  • Members 221 posts
    2026年2月11日 15:25

    尊敬的开发者您好,沐曦是GPU不是NPU。请推理百次求平均值对比。

  • Members 8 posts
    2026年2月11日 16:16

    不好意思,一直叫错了。我通过多次运行发现,第一次运行的速度极慢,但后续每次运行的速度会更快。如,仅运行1次,那么时延为2500ms;共运行2次,则第1次时延为1300ms,第2次时延为20ms;共运行3次,第1次时延为840ms,第2、3次时延分别为13ms、12ms;共运行4次,第1次时延630ms,第2、3、4次时延均在10ms左右。随着总计运行次数增加,第1次时延最快为250ms,后续时延在3.8ms左右。请问这是什么原因造成的?