Posts | taowei | 沐曦开发者论坛

问题现象：详情请见附件

(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:05:53 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is unavaible due to: libcudart.so.12: cannot open shared object file: No such file or directory
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:55 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) WARNING 11-21 17:05:55 [utils.py:181] TransformersForMultimodalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [gpu_model_runner.py:2338] Starting to load model /data/Qwen3-VL-8B-Instruct...
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) `torch_dtype` is deprecated! Use `dtype` instead!
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [transformers.py:439] Using Transformers backend.
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:58 [platform.py:298] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:05<00:16,  5.52s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:11<00:11,  5.79s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:13<00:03,  3.97s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00,  3.57s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00,  4.06s/it]
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) 
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:06:14 [default_loader.py:268] Loading weights took 16.46 seconds
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:06:15 [gpu_model_runner.py:2392] Model loading took 16.3341 GiB and 16.796520 seconds
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:06:15 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] EngineCore failed to start.
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] Traceback (most recent call last):
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 505, in __init__
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 91, in __init__
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 183, in _initialize_kv_caches
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 84, in determine_available_memory
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 309, in collective_rpc
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return self._run_workers(method, *args, **(kwargs or {}))
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 505, in _run_workers
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     ray_worker_outputs = ray.get(ray_worker_outputs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return func(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2858, in get
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 958, in get_objects
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     raise value.as_instanceof_cause()
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] ray.exceptions.RayTaskError(OutOfMemoryError): ray::RayWorkerWrapper.execute_method() (pid=791, ip=172.17.0.4, actor_id=2b9d7f7d597adf4159ecbb8101000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f5d455714e0>)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 628, in execute_method
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     raise e
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 619, in execute_method
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return run_method(self, method, args, kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/utils/__init__.py", line 3060, in run_method
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return func(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return func(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     self.model_runner.profile_run()
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3017, in profile_run
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     self.model.get_multimodal_embeddings(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/transformers.py", line 844, in get_multimodal_embeddings
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     vision_embeddings = self.model.get_image_features(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1061, in get_image_features
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     image_embeds, deepstack_image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 739, in forward
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     hidden_states = blk(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_layers.py", line 94, in __call__
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return super().__call__(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 267, in forward
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     hidden_states = hidden_states + self.attn(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 230, in forward
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     attn_outputs = [
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 231, in <listcomp>
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     attention_interface(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/sdpa_attention.py", line 96, in sdpa_attention_forward
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     attn_output = torch.nn.functional.scaled_dot_product_attention(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 5912, in scaled_dot_product_attention
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718]     return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale, enable_gqa = enable_gqa)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 41.68 GiB is free. Of the allocated memory 19.19 GiB is allocated by PyTorch, and 442.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

docker info回显：

Client: Docker Engine - Community
 Version:    26.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.26.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 103
  Running: 72
  Paused: 0
  Stopped: 31
 Images: 56
 Server Version: 26.0.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e377cd56a71523140ca6ae87e30244719194a521
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.19.0-46-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 128
 Total Memory: 1008GiB
 Name: ZNDX-CA100
 ID: 43f2ba6f-191c-4779-ad11-f360c2d5fc11
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://docker.1ms.run/
  https://docker.m.daocloud.io/
  https://dockerpull.com/
  https://dockerproxy.com/
 Live Restore Enabled: false

镜像版本：cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.10.2-maca.ai3.2.1.7-torch2.6-py310-ubuntu22.04-amd64

容器指令：

docker run -it --device=/dev/dri --device=/dev/mxcd \
  --name Qwen3-VL-8B-Instruct \
  -v /8T/perfxcloud/model/Qwen/Qwen3-VL-8B-Instruct:/data/Qwen3-VL-8B-Instruct \
  -e CUDA_VISIBLE_DEVICES=5 \
  -e TRITON_ENABLE_MACA_OPT_MOVE_DOT_OPERANDS_OUT_LOOP=1 \
  -e TRITON_ENABLE_MACA_CHAIN_DOT_OPT=1 \
  -e TRITON_DISABLE_MACA_OPT_MMA_PREFETCH=1 \
  -e TRITON_ENABLE_MACA_COMPILER_INT8_OPT=True \
  -e MACA_SMALL_PAGESIZE_ENABLE=1 \
  -e RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 \
  -p 2032:30889 \
  --security-opt seccomp=unconfined \
  --security-opt apparmor=unconfined \
  --shm-size 100gb \
  --ulimit memlock=-1 \
  --group-add video \
  6e519687a9e4 \
  /opt/conda/bin/python -m vllm.entrypoints.openai.api_server \
  --model /data/Qwen3-VL-8B-Instruct \
  --api-key c01b24fc-4bf1-4871-a1c3-8663e151555b \
  --served-model-name Qwen3-VL-8B-Instruct \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.95 \
  --port 30889 \
  --swap-space 8 \
  --tensor-parallel-size 1 \
  --disable-log-stats \
  --disable-log-requests \
  --trust-remote-code \
  --distributed-executor-backend ray \
  --dtype bfloat16 \
  --max-num-seqs 5