MetaX-Tech Developer Forum 论坛首页
  • 沐曦开发者
search
Sign in

taowei

  • Members
  • Joined 2025年11月21日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

taowei has posted 4 messages.

  • See post chevron_right
    taowei
    Members
    C500部署Qwen3-VL-8B-Instruct出现OutOfMemoryError和Cannot use FA 解决中 2025年11月21日 17:40

    ai.gitee.com/serverless-api/packages/1492
    您好,我在这个里面看到了Qwen3-VL-8B-Instruct,请问是C500现在不支持吗?

  • See post chevron_right
    taowei
    Members
    C500部署Qwen3-VL-8B-Instruct出现OutOfMemoryError和Cannot use FA 解决中 2025年11月21日 17:14

    问题现象:详情请见附件

    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:05:53 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is unavaible due to: libcudart.so.12: cannot open shared object file: No such file or directory
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:55 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) WARNING 11-21 17:05:55 [utils.py:181] TransformersForMultimodalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [gpu_model_runner.py:2338] Starting to load model /data/Qwen3-VL-8B-Instruct...
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) `torch_dtype` is deprecated! Use `dtype` instead!
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [gpu_model_runner.py:2370] Loading model from scratch...
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [transformers.py:439] Using Transformers backend.
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:58 [platform.py:298] Using Flash Attention backend on V1 engine.
    Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
    Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:05<00:16,  5.52s/it]
    Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:11<00:11,  5.79s/it]
    Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:13<00:03,  3.97s/it]
    Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00,  3.57s/it]
    Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00,  4.06s/it]
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) 
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:06:14 [default_loader.py:268] Loading weights took 16.46 seconds
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:06:15 [gpu_model_runner.py:2392] Model loading took 16.3341 GiB and 16.796520 seconds
    (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:06:15 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] EngineCore failed to start.
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] Traceback (most recent call last):
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 505, in __init__
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     super().__init__(vllm_config, executor_class, log_stats,
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 91, in __init__
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     self._initialize_kv_caches(vllm_config)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 183, in _initialize_kv_caches
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     self.model_executor.determine_available_memory())
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 84, in determine_available_memory
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return self.collective_rpc("determine_available_memory")
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 309, in collective_rpc
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return self._run_workers(method, *args, **(kwargs or {}))
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 505, in _run_workers
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     ray_worker_outputs = ray.get(ray_worker_outputs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return fn(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return func(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2858, in get
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 958, in get_objects
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     raise value.as_instanceof_cause()
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] ray.exceptions.RayTaskError(OutOfMemoryError): ray::RayWorkerWrapper.execute_method() (pid=791, ip=172.17.0.4, actor_id=2b9d7f7d597adf4159ecbb8101000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f5d455714e0>)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 628, in execute_method
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     raise e
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 619, in execute_method
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return run_method(self, method, args, kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/utils/__init__.py", line 3060, in run_method
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return func(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return func(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     self.model_runner.profile_run()
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3017, in profile_run
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     self.model.get_multimodal_embeddings(
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/transformers.py", line 844, in get_multimodal_embeddings
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     vision_embeddings = self.model.get_image_features(
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1061, in get_image_features
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     image_embeds, deepstack_image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 739, in forward
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     hidden_states = blk(
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_layers.py", line 94, in __call__
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return super().__call__(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 267, in forward
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     hidden_states = hidden_states + self.attn(
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 230, in forward
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     attn_outputs = [
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 231, in <listcomp>
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     attention_interface(
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/sdpa_attention.py", line 96, in sdpa_attention_forward
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     attn_output = torch.nn.functional.scaled_dot_product_attention(
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 5912, in scaled_dot_product_attention
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale, enable_gqa = enable_gqa)
    (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 41.68 GiB is free. Of the allocated memory 19.19 GiB is allocated by PyTorch, and 442.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    
  • See post chevron_right
    taowei
    Members
    C500部署Qwen3-VL-8B-Instruct出现OutOfMemoryError和Cannot use FA 解决中 2025年11月21日 17:13

    docker info回显:

    Client: Docker Engine - Community
     Version:    26.0.2
     Context:    default
     Debug Mode: false
     Plugins:
      buildx: Docker Buildx (Docker Inc.)
        Version:  v0.14.0
        Path:     /usr/libexec/docker/cli-plugins/docker-buildx
      compose: Docker Compose (Docker Inc.)
        Version:  v2.26.1
        Path:     /usr/libexec/docker/cli-plugins/docker-compose
    
    Server:
     Containers: 103
      Running: 72
      Paused: 0
      Stopped: 31
     Images: 56
     Server Version: 26.0.2
     Storage Driver: overlay2
      Backing Filesystem: extfs
      Supports d_type: true
      Using metacopy: false
      Native Overlay Diff: true
      userxattr: false
     Logging Driver: json-file
     Cgroup Driver: systemd
     Cgroup Version: 2
     Plugins:
      Volume: local
      Network: bridge host ipvlan macvlan null overlay
      Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
     Swarm: inactive
     Runtimes: io.containerd.runc.v2 runc
     Default Runtime: runc
     Init Binary: docker-init
     containerd version: e377cd56a71523140ca6ae87e30244719194a521
     runc version: v1.1.12-0-g51d5e94
     init version: de40ad0
     Security Options:
      apparmor
      seccomp
       Profile: builtin
      cgroupns
     Kernel Version: 5.19.0-46-generic
     Operating System: Ubuntu 22.04.4 LTS
     OSType: linux
     Architecture: x86_64
     CPUs: 128
     Total Memory: 1008GiB
     Name: ZNDX-CA100
     ID: 43f2ba6f-191c-4779-ad11-f360c2d5fc11
     Docker Root Dir: /var/lib/docker
     Debug Mode: false
     Experimental: false
     Insecure Registries:
      127.0.0.0/8
     Registry Mirrors:
      https://docker.1ms.run/
      https://docker.m.daocloud.io/
      https://dockerpull.com/
      https://dockerproxy.com/
     Live Restore Enabled: false
    

    镜像版本:cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.10.2-maca.ai3.2.1.7-torch2.6-py310-ubuntu22.04-amd64

    容器指令:

    docker run -it --device=/dev/dri --device=/dev/mxcd \
      --name Qwen3-VL-8B-Instruct \
      -v /8T/perfxcloud/model/Qwen/Qwen3-VL-8B-Instruct:/data/Qwen3-VL-8B-Instruct \
      -e CUDA_VISIBLE_DEVICES=5 \
      -e TRITON_ENABLE_MACA_OPT_MOVE_DOT_OPERANDS_OUT_LOOP=1 \
      -e TRITON_ENABLE_MACA_CHAIN_DOT_OPT=1 \
      -e TRITON_DISABLE_MACA_OPT_MMA_PREFETCH=1 \
      -e TRITON_ENABLE_MACA_COMPILER_INT8_OPT=True \
      -e MACA_SMALL_PAGESIZE_ENABLE=1 \
      -e RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 \
      -p 2032:30889 \
      --security-opt seccomp=unconfined \
      --security-opt apparmor=unconfined \
      --shm-size 100gb \
      --ulimit memlock=-1 \
      --group-add video \
      6e519687a9e4 \
      /opt/conda/bin/python -m vllm.entrypoints.openai.api_server \
      --model /data/Qwen3-VL-8B-Instruct \
      --api-key c01b24fc-4bf1-4871-a1c3-8663e151555b \
      --served-model-name Qwen3-VL-8B-Instruct \
      --max-model-len 16384 \
      --gpu-memory-utilization 0.95 \
      --port 30889 \
      --swap-space 8 \
      --tensor-parallel-size 1 \
      --disable-log-stats \
      --disable-log-requests \
      --trust-remote-code \
      --distributed-executor-backend ray \
      --dtype bfloat16 \
      --max-num-seqs 5
    
  • See post chevron_right
    taowei
    Members
    C500部署Qwen3-VL-8B-Instruct出现OutOfMemoryError和Cannot use FA 解决中 2025年11月21日 17:11

    服务器厂家:H3C UniServer R5300 G6

    沐曦GPU型号:MetaX C500

    操作系统内核版本:5.19.0-46-generic

    是否开启CPU虚拟化:开启

    mx-smi回显:

    mx-smi  version: 2.2.3
    
    =================== MetaX System Management Interface Log ===================
    Timestamp                                         : Fri Nov 21 16:53:45 2025
    
    Attached GPUs                                     : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.3                        Kernel Mode Driver Version: 2.14.6          |
    | MACA Version: 3.0.0.8               BIOS Version: 1.24.3.0                      |
    |------------------------------------+---------------------+----------------------+
    | GPU         NAME                   | Bus-id              | GPU-Util             |
    | Temp        Pwr:Usage/Cap          | Memory-Usage        |                      |
    |====================================+=====================+======================|
    | 0           MetaX C500             | 0000:08:00.0        | 0%                   |
    | 33C         52W / 350W             | 64603/65536 MiB     |                      |
    +------------------------------------+---------------------+----------------------+
    | 1           MetaX C500             | 0000:09:00.0        | 1%                   |
    | 34C         54W / 350W             | 58204/65536 MiB     |                      |
    +------------------------------------+---------------------+----------------------+
    | 2           MetaX C500             | 0000:0e:00.0        | 0%                   |
    | 35C         53W / 350W             | 63899/65536 MiB     |                      |
    +------------------------------------+---------------------+----------------------+
    | 3           MetaX C500             | 0000:11:00.0        | 0%                   |
    | 34C         53W / 350W             | 63643/65536 MiB     |                      |
    +------------------------------------+---------------------+----------------------+
    | 4           MetaX C500             | 0000:32:00.0        | 1%                   |
    | 32C         52W / 350W             | 58204/65536 MiB     |                      |
    +------------------------------------+---------------------+----------------------+
    | 5           MetaX C500             | 0000:38:00.0        | 0%                   |
    | 31C         40W / 350W             | 858/65536 MiB       |                      |
    +------------------------------------+---------------------+----------------------+
    | 6           MetaX C500             | 0000:3b:00.0        | 0%                   |
    | 34C         51W / 350W             | 59997/65536 MiB     |                      |
    +------------------------------------+---------------------+----------------------+
    | 7           MetaX C500             | 0000:3c:00.0        | 0%                   |
    | 33C         52W / 350W             | 59997/65536 MiB     |                      |
    +------------------------------------+---------------------+----------------------+
    
    +---------------------------------------------------------------------------------+
    | Process:                                                                        |
    |  GPU                    PID         Process Name                 GPU Memory     |
    |                                                                  Usage(MiB)     |
    |=================================================================================|
    |  0                  2155589         VLLM::EngineCor              63744          |
    |  1                  2231555         python                       57344          |
    |  2                  1290848         VLLM::EngineCor              63040          |
    |  3                  1586485         VLLM::EngineCor              62784          |
    |  4                  2232951         python                       57344          |
    |  6                  2235998         VLLM::Worker_TP              59136          |
    |  7                  2235999         VLLM::Worker_TP              59136          |
    +---------------------------------------------------------------------------------+
    
  • 沐曦开发者论坛
powered by misago