• Members 2 posts
    2025年10月29日 12:03

    一、软硬件信息
    1.服务器厂家:浪潮
    2.沐曦GPU型号:METAX_C500_64G *4
    3.操作系统内核版本:4.19.90-89.11.v2401.ky10.x86_64
    4.是否开启CPU虚拟化:否
    5.mx-smi回显:
    mx-smi
    mx-smi version: 2.2.9

    =================== MetaX System Management Interface Log ===================
    Timestamp : Thu Oct 30 10:02:21 2025

    Attached GPUs : 4
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.9 Kernel Mode Driver Version: 3.3.12 |
    | MACA Version: unknown BIOS Version: 1.29.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:43:00.0 | 0% Disabled |
    | 54W / 350W | 31C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:44:00.0 | 0% Disabled |
    | 55W / 350W | 31C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:45:00.0 | 0% Disabled |
    | 60W / 350W | 33C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:47:00.0 | 0% Disabled |
    | 57W / 350W | 33C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |
    +---------------------------------------------------------------------------------+

    End of Log

    6.docker info回显:
    docker info
    Client:
    Version: 28.3.3
    Context: default
    Debug Mode: false

    Server:
    Containers: 2
    Running: 2
    Paused: 0
    Stopped: 0
    Images: 6
    Server Version: 28.3.3
    Storage Driver: overlay2
    Backing Filesystem: extfs
    Supports d_type: true
    Using metacopy: false
    Native Overlay Diff: true
    userxattr: false
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Cgroup Version: 1
    Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
    CDI spec directories:
    /etc/cdi
    /var/run/cdi
    Swarm: inactive
    Runtimes: io.containerd.runc.v2 runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
    runc version: v1.2.6-0-ge89a299
    init version: de40ad0
    Security Options:
    seccomp
    Profile: builtin
    Kernel Version: 4.19.90-89.11.v2401.ky10.x86_64
    Operating System: Kylin Linux Advanced Server V10 (Halberd)
    OSType: linux
    Architecture: x86_64
    CPUs: 128
    Total Memory: 994.7GiB
    Name: localhost.localdomain
    ID: ca3e0563-e7fc-4f52-ad88-9655d1100756
    Docker Root Dir: /data/docker
    7.镜像版本:
    cr.metax-tech.com/public-library/maca-pytorch:3.2.1.4-torch2.6-py310-ubuntu24.04-amd64
    cr.metax-tech.com/public-ai-release/maca/vllm:maca.ai3.1.0.7-torch2.6-py310-ubuntu22.04-amd64
    cr.metax-tech.com/public-ai-release/maca/modelzoo.llm.vllm:maca.ai2.33.1.12-torch2.6-py310-ubuntu22.04-amd64
    cr.metax-tech.com/public-ai-release/maca/vllm:maca.ai2.33.1.12-torch2.6-py310-ubuntu22.04-amd64
    8.启动容器命令:
    docker run -it --device=/dev/dri --device=/dev/mxcd --group-add video --name images --device=/dev/mem --network=host --security-opt seccomp=unconfined --security-opt apparmor=unconfined --shm-size '100gb' --ulimit memlock=-1 -v /usr/local/:/usr/local/ -v /data/models/:/data/models/ ce3f69501a52 /bin/bash
    9.容器内执行命令:
    vllm serve /data/models/Qwen/Qwen3-VL-30B-A3B-Instruct --served-model-name Qwen3-VL-30B --tensor-parallel-size 4 --swap-space 16 --trust-remote-code --dtype bfloat16 --gpu-memory-utilization 0.9 --max-model-len 30720 --port 18091
    二、问题现象
    服务器是4张64G C500沐曦显卡,部署MiniCPM-V-4_5 、Qwen3-VL-30B-A3B-Instruct都失败了,vllm 0.10.0不支持这2个模型,请问沐曦官方的vllm0.11什么时候可以升级,Qwen3-Image也没有成功

  • Members 139 posts
    2025年10月30日 18:23

    尊敬的开发者您好,请给出详细的问题日志。

  • arrow_forward

    Thread has been moved from 公共.

  • Members 139 posts
    2025年10月30日 18:25

    尊敬的开发者您好,vllm0.11镜像更新请关注开发者镜像下载中心更新。

  • Members 2 posts
    2025年10月30日 18:26

    RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 63.59 GiB of which 0 bytes is free. Of the allocated memory 57.90 GiB is allocated by PyTorch, and 45.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause

  • Members 139 posts
    2025年10月30日 18:27

    尊敬的开发者您好,您这个报错是显存不足。