/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning(). warnings.warn(_BETA_TRANSFORMS_WARNING) /opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning(). warnings.warn(_BETA_TRANSFORMS_WARNING) INFO 11-21 17:05:06 [__init__.py:36] Available plugins for group vllm.platform_plugins: INFO 11-21 17:05:06 [__init__.py:38] - metax -> vllm_metax:register INFO 11-21 17:05:06 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. INFO 11-21 17:05:06 [__init__.py:207] Platform plugin metax is activated WARNING 11-21 17:05:13 [registry.py:483] Model architecture BaichuanForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.baichuan:BaichuanForCausalLM. WARNING 11-21 17:05:13 [registry.py:483] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.qwen2_vl:Qwen2VLForConditionalGeneration. WARNING 11-21 17:05:13 [registry.py:483] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_metax.models.internvl:InternVLChatModel. WARNING 11-21 17:05:13 [registry.py:483] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP. INFO 11-21 17:05:13 [platform.py:423] [hook] platform:pre_register_and_update... WARNING 11-21 17:05:13 [__init__.py:1758] argument '--disable-log-requests' is deprecated and replaced with '--enable-log-requests'. This will be removed in v0.12.0. (APIServer pid=16) INFO 11-21 17:05:13 [api_server.py:1896] vLLM API server version 0.10.2 (APIServer pid=16) INFO 11-21 17:05:13 [platform.py:423] [hook] platform:pre_register_and_update... (APIServer pid=16) INFO 11-21 17:05:13 [utils.py:328] non-default args: {'port': 30889, 'api_key': ['c01b24fc-4bf1-4871-a1c3-8663e151555b'], 'model': '/data/Qwen3-VL-8B-Instruct', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 16384, 'served_model_name': ['Qwen3-VL-8B-Instruct'], 'distributed_executor_backend': 'ray', 'gpu_memory_utilization': 0.95, 'swap_space': 8.0, 'max_num_seqs': 5, 'disable_log_stats': True} (APIServer pid=16) INFO 11-21 17:05:13 [platform.py:423] [hook] platform:pre_register_and_update... (APIServer pid=16) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. (APIServer pid=16) INFO 11-21 17:05:28 [__init__.py:742] Resolved architecture: TransformersForMultimodalLM (APIServer pid=16) `torch_dtype` is deprecated! Use `dtype` instead! (APIServer pid=16) INFO 11-21 17:05:28 [__init__.py:1815] Using max model len 16384 (APIServer pid=16) INFO 11-21 17:05:29 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=16) WARNING 11-21 17:05:29 [utils.py:181] TransformersForMultimodalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal. /opt/conda/lib/python3.10/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning(). warnings.warn(_BETA_TRANSFORMS_WARNING) /opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning(). warnings.warn(_BETA_TRANSFORMS_WARNING) INFO 11-21 17:05:33 [__init__.py:36] Available plugins for group vllm.platform_plugins: INFO 11-21 17:05:33 [__init__.py:38] - metax -> vllm_metax:register INFO 11-21 17:05:33 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. INFO 11-21 17:05:33 [__init__.py:207] Platform plugin metax is activated (EngineCore_DP0 pid=296) INFO 11-21 17:05:35 [core.py:654] Waiting for init message from front-end. (EngineCore_DP0 pid=296) WARNING 11-21 17:05:40 [registry.py:483] Model architecture BaichuanForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.baichuan:BaichuanForCausalLM. (EngineCore_DP0 pid=296) WARNING 11-21 17:05:40 [registry.py:483] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.qwen2_vl:Qwen2VLForConditionalGeneration. (EngineCore_DP0 pid=296) WARNING 11-21 17:05:40 [registry.py:483] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_metax.models.internvl:InternVLChatModel. (EngineCore_DP0 pid=296) WARNING 11-21 17:05:40 [registry.py:483] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP. (EngineCore_DP0 pid=296) INFO 11-21 17:05:40 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='/data/Qwen3-VL-8B-Instruct', speculative_config=None, tokenizer='/data/Qwen3-VL-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-VL-8B-Instruct, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":8,"local_cache_dir":null} (EngineCore_DP0 pid=296) 2025-11-21 17:05:42,252 INFO worker.py:1927 -- Started a local Ray instance. (EngineCore_DP0 pid=296) INFO 11-21 17:05:44 [ray_utils.py:345] No current placement group found. Creating a new placement group. (EngineCore_DP0 pid=296) INFO 11-21 17:05:44 [ray_distributed_executor.py:171] use_ray_spmd_worker: True (EngineCore_DP0 pid=296) (pid=791) /opt/conda/lib/python3.10/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning(). (EngineCore_DP0 pid=296) (pid=791) warnings.warn(_BETA_TRANSFORMS_WARNING) (EngineCore_DP0 pid=296) (pid=791) /opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning(). (EngineCore_DP0 pid=296) (pid=791) warnings.warn(_BETA_TRANSFORMS_WARNING) (EngineCore_DP0 pid=296) (pid=791) INFO 11-21 17:05:48 [__init__.py:36] Available plugins for group vllm.platform_plugins: (EngineCore_DP0 pid=296) (pid=791) INFO 11-21 17:05:48 [__init__.py:38] - metax -> vllm_metax:register (EngineCore_DP0 pid=296) (pid=791) INFO 11-21 17:05:48 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. (EngineCore_DP0 pid=296) (pid=791) INFO 11-21 17:05:48 [__init__.py:207] Platform plugin metax is activated (EngineCore_DP0 pid=296) INFO 11-21 17:05:53 [ray_env.py:63] RAY_NON_CARRY_OVER_ENV_VARS from config: set() (EngineCore_DP0 pid=296) INFO 11-21 17:05:53 [ray_env.py:65] Copying the following environment variables to workers: ['VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_RAY_SPMD_WORKER', 'VLLM_USE_V1', 'LD_LIBRARY_PATH', 'VLLM_USE_RAY_COMPILED_DAG'] (EngineCore_DP0 pid=296) INFO 11-21 17:05:53 [ray_env.py:68] If certain env vars should NOT be copied, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) WARNING 11-21 17:05:53 [registry.py:483] Model architecture BaichuanForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.baichuan:BaichuanForCausalLM. (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) WARNING 11-21 17:05:53 [registry.py:483] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.qwen2_vl:Qwen2VLForConditionalGeneration. (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) WARNING 11-21 17:05:53 [registry.py:483] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_metax.models.internvl:InternVLChatModel. (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) WARNING 11-21 17:05:53 [registry.py:483] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP. (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:05:53 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is unavaible due to: libcudart.so.12: cannot open shared object file: No such file or directory (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:55 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) WARNING 11-21 17:05:55 [utils.py:181] TransformersForMultimodalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal. (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [gpu_model_runner.py:2338] Starting to load model /data/Qwen3-VL-8B-Instruct... (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) `torch_dtype` is deprecated! Use `dtype` instead! (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [gpu_model_runner.py:2370] Loading model from scratch... (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [transformers.py:439] Using Transformers backend. (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:58 [platform.py:298] Using Flash Attention backend on V1 engine. Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 628, in execute_method (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] raise e (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 619, in execute_method (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return run_method(self, method, args, kwargs) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/utils/__init__.py", line 3060, in run_method (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return func(*args, **kwargs) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return func(*args, **kwargs) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] self.model_runner.profile_run() (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3017, in profile_run (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] self.model.get_multimodal_embeddings( (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/transformers.py", line 844, in get_multimodal_embeddings (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] vision_embeddings = self.model.get_image_features( (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1061, in get_image_features (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] image_embeds, deepstack_image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return forward_call(*args, **kwargs) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 739, in forward (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] hidden_states = blk( (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_layers.py", line 94, in __call__ (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return super().__call__(*args, **kwargs) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return forward_call(*args, **kwargs) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 267, in forward (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] hidden_states = hidden_states + self.attn( (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return forward_call(*args, **kwargs) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 230, in forward (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] attn_outputs = [ (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 231, in (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] attention_interface( (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/sdpa_attention.py", line 96, in sdpa_attention_forward (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] attn_output = torch.nn.functional.scaled_dot_product_attention( (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 5912, in scaled_dot_product_attention (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale, enable_gqa = enable_gqa) (EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 41.68 GiB is free. Of the allocated memory 19.19 GiB is allocated by PyTorch, and 442.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) (EngineCore_DP0 pid=296) Process EngineCore_DP0: (EngineCore_DP0 pid=296) Traceback (most recent call last): (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=296) self.run() (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=296) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core (EngineCore_DP0 pid=296) raise e (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core (EngineCore_DP0 pid=296) engine_core = EngineCoreProc(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 505, in __init__ (EngineCore_DP0 pid=296) super().__init__(vllm_config, executor_class, log_stats, (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 91, in __init__ (EngineCore_DP0 pid=296) self._initialize_kv_caches(vllm_config) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 183, in _initialize_kv_caches (EngineCore_DP0 pid=296) self.model_executor.determine_available_memory()) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 84, in determine_available_memory (EngineCore_DP0 pid=296) return self.collective_rpc("determine_available_memory") (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 309, in collective_rpc (EngineCore_DP0 pid=296) return self._run_workers(method, *args, **(kwargs or {})) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 505, in _run_workers (EngineCore_DP0 pid=296) ray_worker_outputs = ray.get(ray_worker_outputs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper (EngineCore_DP0 pid=296) return fn(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper (EngineCore_DP0 pid=296) return func(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2858, in get (EngineCore_DP0 pid=296) values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 958, in get_objects (EngineCore_DP0 pid=296) raise value.as_instanceof_cause() (EngineCore_DP0 pid=296) ray.exceptions.RayTaskError(OutOfMemoryError): ray::RayWorkerWrapper.execute_method() (pid=791, ip=172.17.0.4, actor_id=2b9d7f7d597adf4159ecbb8101000000, repr=) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 628, in execute_method (EngineCore_DP0 pid=296) raise e (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 619, in execute_method (EngineCore_DP0 pid=296) return run_method(self, method, args, kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/utils/__init__.py", line 3060, in run_method (EngineCore_DP0 pid=296) return func(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context (EngineCore_DP0 pid=296) return func(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory (EngineCore_DP0 pid=296) self.model_runner.profile_run() (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3017, in profile_run (EngineCore_DP0 pid=296) self.model.get_multimodal_embeddings( (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/transformers.py", line 844, in get_multimodal_embeddings (EngineCore_DP0 pid=296) vision_embeddings = self.model.get_image_features( (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1061, in get_image_features (EngineCore_DP0 pid=296) image_embeds, deepstack_image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (EngineCore_DP0 pid=296) return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (EngineCore_DP0 pid=296) return forward_call(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 739, in forward (EngineCore_DP0 pid=296) hidden_states = blk( (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_layers.py", line 94, in __call__ (EngineCore_DP0 pid=296) return super().__call__(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (EngineCore_DP0 pid=296) return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (EngineCore_DP0 pid=296) return forward_call(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 267, in forward (EngineCore_DP0 pid=296) hidden_states = hidden_states + self.attn( (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (EngineCore_DP0 pid=296) return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (EngineCore_DP0 pid=296) return forward_call(*args, **kwargs) (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 230, in forward (EngineCore_DP0 pid=296) attn_outputs = [ (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 231, in (EngineCore_DP0 pid=296) attention_interface( (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/sdpa_attention.py", line 96, in sdpa_attention_forward (EngineCore_DP0 pid=296) attn_output = torch.nn.functional.scaled_dot_product_attention( (EngineCore_DP0 pid=296) File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 5912, in scaled_dot_product_attention (EngineCore_DP0 pid=296) return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale, enable_gqa = enable_gqa) (EngineCore_DP0 pid=296) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 41.68 GiB is free. Of the allocated memory 19.19 GiB is allocated by PyTorch, and 442.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) (EngineCore_DP0 pid=296) INFO 11-21 17:06:26 [ray_distributed_executor.py:122] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray. (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) /opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:5912: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /workspace/framework/mcPytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:649.) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale, enable_gqa = enable_gqa) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] Error executing method 'determine_available_memory'. This might cause deadlock in distributed execution. (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] Traceback (most recent call last): (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 619, in execute_method (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return run_method(self, method, args, kwargs) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/vllm/utils/__init__.py", line 3060, in run_method (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return func(*args, **kwargs) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return func(*args, **kwargs) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] self.model_runner.profile_run() (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3017, in profile_run (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] self.model.get_multimodal_embeddings( (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/transformers.py", line 844, in get_multimodal_embeddings (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] vision_embeddings = self.model.get_image_features( (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1061, in get_image_features (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] image_embeds, deepstack_image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return forward_call(*args, **kwargs) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 739, in forward (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] hidden_states = blk( (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_layers.py", line 94, in __call__ (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return super().__call__(*args, **kwargs) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return forward_call(*args, **kwargs) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 267, in forward (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] hidden_states = hidden_states + self.attn( (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return self._call_impl(*args, **kwargs) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return forward_call(*args, **kwargs) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 230, in forward (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] attn_outputs = [ (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 231, in (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] attention_interface( (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/sdpa_attention.py", line 96, in sdpa_attention_forward (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] attn_output = torch.nn.functional.scaled_dot_product_attention( (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 5912, in scaled_dot_product_attention (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale, enable_gqa = enable_gqa) (EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:06:26 [worker_base.py:627] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 41.68 GiB is free. Of the allocated memory 19.19 GiB is allocated by PyTorch, and 442.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) (APIServer pid=16) Traceback (most recent call last): (APIServer pid=16) File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main (APIServer pid=16) return _run_code(code, main_globals, None, (APIServer pid=16) File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code (APIServer pid=16) exec(code, run_globals) (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 2011, in (APIServer pid=16) uvloop.run(run_server(args)) (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run (APIServer pid=16) return loop.run_until_complete(wrapper()) (APIServer pid=16) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=16) return await main (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server (APIServer pid=16) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker (APIServer pid=16) async with build_async_engine_client( (APIServer pid=16) File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__ (APIServer pid=16) return await anext(self.gen) (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client (APIServer pid=16) async with build_async_engine_client_from_engine_args( (APIServer pid=16) File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__ (APIServer pid=16) return await anext(self.gen) (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args (APIServer pid=16) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/utils/__init__.py", line 1589, in inner (APIServer pid=16) return fn(*args, **kwargs) (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 212, in from_vllm_config (APIServer pid=16) return cls( (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 136, in __init__ (APIServer pid=16) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client (APIServer pid=16) return AsyncMPClient(*client_args) (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__ (APIServer pid=16) super().__init__( (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 448, in __init__ (APIServer pid=16) with launch_core_engines(vllm_config, executor_class, (APIServer pid=16) File "/opt/conda/lib/python3.10/contextlib.py", line 142, in __exit__ (APIServer pid=16) next(self.gen) (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines (APIServer pid=16) wait_for_engine_startup( (APIServer pid=16) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup (APIServer pid=16) raise RuntimeError("Engine core initialization failed. " (APIServer pid=16) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}