MetaX-Tech Developer Forum 论坛首页
  • 沐曦开发者
search
Sign in

rootcj

  • Members
  • Joined 2025年7月22日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

rootcj has posted 2 messages.

  • See post chevron_right
    rootcj
    Members
    vllm版本问题 解决中 2025年10月30日 18:26

    RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 63.59 GiB of which 0 bytes is free. Of the allocated memory 57.90 GiB is allocated by PyTorch, and 45.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause

  • See post chevron_right
    rootcj
    Members
    vllm版本问题 解决中 2025年10月29日 12:03

    一、软硬件信息
    1.服务器厂家:浪潮
    2.沐曦GPU型号:METAX_C500_64G *4
    3.操作系统内核版本:4.19.90-89.11.v2401.ky10.x86_64
    4.是否开启CPU虚拟化:否
    5.mx-smi回显:
    mx-smi
    mx-smi version: 2.2.9

    =================== MetaX System Management Interface Log ===================
    Timestamp : Thu Oct 30 10:02:21 2025

    Attached GPUs : 4
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.9 Kernel Mode Driver Version: 3.3.12 |
    | MACA Version: unknown BIOS Version: 1.29.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:43:00.0 | 0% Disabled |
    | 54W / 350W | 31C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:44:00.0 | 0% Disabled |
    | 55W / 350W | 31C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:45:00.0 | 0% Disabled |
    | 60W / 350W | 33C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:47:00.0 | 0% Disabled |
    | 57W / 350W | 33C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |
    +---------------------------------------------------------------------------------+

    End of Log

    6.docker info回显:
    docker info
    Client:
    Version: 28.3.3
    Context: default
    Debug Mode: false

    Server:
    Containers: 2
    Running: 2
    Paused: 0
    Stopped: 0
    Images: 6
    Server Version: 28.3.3
    Storage Driver: overlay2
    Backing Filesystem: extfs
    Supports d_type: true
    Using metacopy: false
    Native Overlay Diff: true
    userxattr: false
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Cgroup Version: 1
    Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
    CDI spec directories:
    /etc/cdi
    /var/run/cdi
    Swarm: inactive
    Runtimes: io.containerd.runc.v2 runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
    runc version: v1.2.6-0-ge89a299
    init version: de40ad0
    Security Options:
    seccomp
    Profile: builtin
    Kernel Version: 4.19.90-89.11.v2401.ky10.x86_64
    Operating System: Kylin Linux Advanced Server V10 (Halberd)
    OSType: linux
    Architecture: x86_64
    CPUs: 128
    Total Memory: 994.7GiB
    Name: localhost.localdomain
    ID: ca3e0563-e7fc-4f52-ad88-9655d1100756
    Docker Root Dir: /data/docker
    7.镜像版本:
    cr.metax-tech.com/public-library/maca-pytorch:3.2.1.4-torch2.6-py310-ubuntu24.04-amd64
    cr.metax-tech.com/public-ai-release/maca/vllm:maca.ai3.1.0.7-torch2.6-py310-ubuntu22.04-amd64
    cr.metax-tech.com/public-ai-release/maca/modelzoo.llm.vllm:maca.ai2.33.1.12-torch2.6-py310-ubuntu22.04-amd64
    cr.metax-tech.com/public-ai-release/maca/vllm:maca.ai2.33.1.12-torch2.6-py310-ubuntu22.04-amd64
    8.启动容器命令:
    docker run -it --device=/dev/dri --device=/dev/mxcd --group-add video --name images --device=/dev/mem --network=host --security-opt seccomp=unconfined --security-opt apparmor=unconfined --shm-size '100gb' --ulimit memlock=-1 -v /usr/local/:/usr/local/ -v /data/models/:/data/models/ ce3f69501a52 /bin/bash
    9.容器内执行命令:
    vllm serve /data/models/Qwen/Qwen3-VL-30B-A3B-Instruct --served-model-name Qwen3-VL-30B --tensor-parallel-size 4 --swap-space 16 --trust-remote-code --dtype bfloat16 --gpu-memory-utilization 0.9 --max-model-len 30720 --port 18091
    二、问题现象
    服务器是4张64G C500沐曦显卡,部署MiniCPM-V-4_5 、Qwen3-VL-30B-A3B-Instruct都失败了,vllm 0.10.0不支持这2个模型,请问沐曦官方的vllm0.11什么时候可以升级,Qwen3-Image也没有成功

  • 沐曦开发者论坛
powered by misago