/opt/conda/lib/python3.12/site-packages/tvm_ffi/_optional_torch_c_dlpack.py:181: UserWarning: Failed to JIT torch c dlpack extension, EnvTensorAllocator will not be enabled. We recommend installing via `pip install torch-c-dlpack-ext` warnings.warn( WARNING 05-22 13:38:03 [__init__.py:79] The quantization method 'awq' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:03 [__init__.py:79] The quantization method 'awq_marlin' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:03 [__init__.py:79] The quantization method 'compressed-tensors' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:03 [__init__.py:79] The quantization method 'gptq' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:03 [__init__.py:79] The quantization method 'gptq_marlin' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:03 [__init__.py:79] The quantization method 'moe_wna16' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:03 [registry.py:915] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP. WARNING 05-22 13:38:03 [registry.py:915] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV2ForCausalLM. WARNING 05-22 13:38:03 [registry.py:915] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM. WARNING 05-22 13:38:03 [registry.py:915] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM. WARNING 05-22 13:38:03 [registry.py:915] Model architecture KimiK25ForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.kimi_k25:KimiK25ForConditionalGeneration. WARNING 05-22 13:38:03 [registry.py:915] Model architecture GlmMoeDsaForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:GlmMoeDsaForCausalLM. (APIServer pid=14) INFO 05-22 13:38:03 [utils.py:299] (APIServer pid=14) INFO 05-22 13:38:03 [utils.py:299] █ █ █▄ ▄█ (APIServer pid=14) INFO 05-22 13:38:03 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0 (APIServer pid=14) INFO 05-22 13:38:03 [utils.py:299] █▄█▀ █ █ █ █ model /models/Qwen3-32B-AWQ (APIServer pid=14) INFO 05-22 13:38:03 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=14) INFO 05-22 13:38:03 [utils.py:299] (APIServer pid=14) INFO 05-22 13:38:03 [utils.py:233] non-default args: {'model_tag': '/models/Qwen3-32B-AWQ', 'host': '0.0.0.0', 'port': 9901, 'model': '/models/Qwen3-32B-AWQ', 'trust_remote_code': True, 'max_model_len': 8192, 'served_model_name': ['qwen3'], 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.95, 'enable_prefix_caching': False, 'max_num_batched_tokens': 8192, 'max_num_seqs': 8, 'async_scheduling': True} (APIServer pid=14) WARNING 05-22 13:38:03 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_USE_V1 (APIServer pid=14) INFO 05-22 13:38:03 [model.py:549] Resolved architecture: Qwen3ForCausalLM (APIServer pid=14) INFO 05-22 13:38:03 [model.py:1678] Using max model len 8192 (APIServer pid=14) INFO 05-22 13:38:03 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192. (APIServer pid=14) INFO 05-22 13:38:03 [vllm.py:790] Asynchronous scheduling is enabled. INFO 05-22 13:38:08 [__init__.py:44] Available plugins for group vllm.platform_plugins: INFO 05-22 13:38:08 [__init__.py:46] - metax -> vllm_metax:register INFO 05-22 13:38:08 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. INFO 05-22 13:38:08 [__init__.py:239] Platform plugin metax is activated INFO Print the version information of mcoplib during compilation. Version info:Mcoplib_Version = '0.4.4' Build_Maca_Version = '3.5.3.20' GIT_BRANCH = 'HEAD' GIT_COMMIT = '817afcc' Vllm Op Version = 0.19.0 SGlang Op Version = 0.5.10 INFO Staring Check the current MACA version of the operating environment. INFO: Release major.minor matching, successful:3.5. /opt/conda/lib/python3.12/site-packages/tvm_ffi/_optional_torch_c_dlpack.py:181: UserWarning: Failed to JIT torch c dlpack extension, EnvTensorAllocator will not be enabled. We recommend installing via `pip install torch-c-dlpack-ext` warnings.warn( WARNING 05-22 13:38:52 [__init__.py:79] The quantization method 'awq' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:52 [__init__.py:79] The quantization method 'awq_marlin' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:59 [__init__.py:79] The quantization method 'compressed-tensors' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:59 [__init__.py:79] The quantization method 'gptq' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:59 [__init__.py:79] The quantization method 'gptq_marlin' already exists and will be overwritten by the quantization config . WARNING 05-22 13:38:59 [__init__.py:79] The quantization method 'moe_wna16' already exists and will be overwritten by the quantization config . (EngineCore pid=365) WARNING 05-22 13:38:59 [registry.py:915] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP. (EngineCore pid=365) WARNING 05-22 13:38:59 [registry.py:915] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV2ForCausalLM. (EngineCore pid=365) WARNING 05-22 13:38:59 [registry.py:915] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM. (EngineCore pid=365) WARNING 05-22 13:38:59 [registry.py:915] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM. (EngineCore pid=365) WARNING 05-22 13:38:59 [registry.py:915] Model architecture KimiK25ForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.kimi_k25:KimiK25ForConditionalGeneration. (EngineCore pid=365) WARNING 05-22 13:38:59 [registry.py:915] Model architecture GlmMoeDsaForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:GlmMoeDsaForCausalLM. (EngineCore pid=365) INFO 05-22 13:38:59 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='/models/Qwen3-32B-AWQ', speculative_config=None, tokenizer='/models/Qwen3-32B-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=awq, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': , 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::mx_sparse_attn_indexer', 'vllm::mx_sparse_attn_indexer_bf16', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False}, 'inductor_passes': {}, 'cudagraph_mode': , 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': , 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore pid=365) INFO 05-22 13:38:59 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.75.45.11:45101 backend=nccl (EngineCore pid=365) INFO 05-22 13:38:59 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore pid=365) INFO 05-22 13:39:00 [gpu_model_runner.py:4735] Starting to load model /models/Qwen3-32B-AWQ... (EngineCore pid=365) [rank0]:W0522 13:39:00.566000 365 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. (EngineCore pid=365) [rank0]:W0522 13:39:00.566000 365 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures. (EngineCore pid=365) INFO 05-22 13:39:00 [platform.py:395] Some attention backends are not valid for maca with AttentionSelectorConfig(head_size=128, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=False, has_sink=False, use_sparse=False, use_mm_prefix=False, use_per_head_quant_scales=False, attn_type=AttentionType.DECODER). Reasons: {FLEX_ATTENTION: [ImportError]}. (EngineCore pid=365) INFO 05-22 13:39:00 [platform.py:407] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'TREE_ATTN'] (EngineCore pid=365) INFO 05-22 13:39:00 [platform.py:438] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'TREE_ATTN') (EngineCore pid=365) INFO 05-22 13:39:00 [fa_utils.py:21] Using Maca version of flash attention, which only supports version 2. Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00.130", line 1363, in forward (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] submod_94 = self.submod_94(getitem_233, s72, l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_qweight_, l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_scales_, l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_qzeros_, l_self_modules_layers_modules_46_modules_post_attention_layernorm_parameters_weight_, getitem_234, l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_qweight_, l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_scales_, l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_qzeros_, l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_qweight_, l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_scales_, l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_qzeros_, l_self_modules_layers_modules_47_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_qweight_, l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_scales_, l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_qzeros_, l_self_modules_layers_modules_47_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_47_modules_self_attn_modules_k_norm_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); getitem_233 = l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_qweight_ = l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_scales_ = l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_qzeros_ = l_self_modules_layers_modules_46_modules_post_attention_layernorm_parameters_weight_ = getitem_234 = l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_qweight_ = l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_scales_ = l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_qzeros_ = l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_qweight_ = l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_scales_ = l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_qzeros_ = l_self_modules_layers_modules_47_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_qweight_ = l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_scales_ = l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_qzeros_ = l_self_modules_layers_modules_47_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_47_modules_self_attn_modules_k_norm_parameters_weight_ = None (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] return self.runnable(*args, **kwargs) (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 367, in __call__ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] return range_entry.runnable(*args) (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 62, in __call__ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] return self._compiled_fn(*args) (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] return fn(*args, **kwargs) (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1241, in forward (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] return compiled_fn(full_args) (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 384, in runtime_wrapper (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] all_outs = call_func_at_runtime_with_args( (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] out = normalize_as_list(f(args)) (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 556, in wrapper (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] return compiled_fn(runtime_args) (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 584, in __call__ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] return self.current_callable(inputs) (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/utils.py", line 2720, in run (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] out = model(new_inputs) (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/tmp/torchinductor_root/cn/ccnt3h5ihaeznnfptjhn7gaqgsuqih4iz74w2douejbhyoefnf4n.py", line 1049, in call (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] with torch.cuda._DeviceGuard(0): (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] File "/opt/conda/lib/python3.12/site-packages/torch/cuda/__init__.py", line 522, in __exit__ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] self.idx = torch.cuda._maybe_exchange_device(self.prev_idx) (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] torch.AcceleratorError: CUDA error: an illegal memory access was encountered (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore pid=365) ERROR 05-22 13:39:49 [core.py:1108] (EngineCore pid=365) Process EngineCore: (EngineCore pid=365) Traceback (most recent call last): (EngineCore pid=365) File "/tmp/torchinductor_root/cn/ccnt3h5ihaeznnfptjhn7gaqgsuqih4iz74w2douejbhyoefnf4n.py", line 1102, in call (EngineCore pid=365) buf13 = empty_strided_cuda((s72, 64, 1), (64, 1, 64*s72), torch.float32) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (EngineCore pid=365) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=365) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=365) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore pid=365) (EngineCore pid=365) (EngineCore pid=365) During handling of the above exception, another exception occurred: (EngineCore pid=365) (EngineCore pid=365) Traceback (most recent call last): (EngineCore pid=365) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=365) self.run() (EngineCore pid=365) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore pid=365) self._target(*self._args, **self._kwargs) (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core (EngineCore pid=365) raise e (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core (EngineCore pid=365) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=365) return func(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__ (EngineCore pid=365) super().__init__( (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 124, in __init__ (EngineCore pid=365) kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=365) return func(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches (EngineCore pid=365) available_gpu_memory = self.model_executor.determine_available_memory() (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory (EngineCore pid=365) return self.collective_rpc("determine_available_memory") (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore pid=365) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=365) return func(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore pid=365) return func(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory (EngineCore pid=365) self.model_runner.profile_run() (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in profile_run (EngineCore pid=365) hidden_states, last_hidden_states = self._dummy_run( (EngineCore pid=365) ^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore pid=365) return func(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5474, in _dummy_run (EngineCore pid=365) outputs = self.model( (EngineCore pid=365) ^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__ (EngineCore pid=365) return self.runnable(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl (EngineCore pid=365) return self._call_impl(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl (EngineCore pid=365) return forward_call(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 322, in forward (EngineCore pid=365) hidden_states = self.model( (EngineCore pid=365) ^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 610, in __call__ (EngineCore pid=365) output = TorchCompileWithNoGuardsWrapper.__call__( (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/wrapper.py", line 190, in __call__ (EngineCore pid=365) return self._call_with_optional_nvtx_range( (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range (EngineCore pid=365) return callable_fn(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper (EngineCore pid=365) return fn(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 422, in forward (EngineCore pid=365) def forward( (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn (EngineCore pid=365) return fn(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/caching.py", line 211, in __call__ (EngineCore pid=365) return self.optimized_call(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/fx/graph_module.py", line 848, in call_wrapped (EngineCore pid=365) return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/fx/graph_module.py", line 424, in __call__ (EngineCore pid=365) raise e (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/fx/graph_module.py", line 411, in __call__ (EngineCore pid=365) return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc] (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl (EngineCore pid=365) return self._call_impl(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl (EngineCore pid=365) return forward_call(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File ".130", line 1363, in forward (EngineCore pid=365) submod_94 = self.submod_94(getitem_233, s72, l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_qweight_, l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_scales_, l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_qzeros_, l_self_modules_layers_modules_46_modules_post_attention_layernorm_parameters_weight_, getitem_234, l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_qweight_, l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_scales_, l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_qzeros_, l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_qweight_, l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_scales_, l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_qzeros_, l_self_modules_layers_modules_47_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_qweight_, l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_scales_, l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_qzeros_, l_self_modules_layers_modules_47_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_47_modules_self_attn_modules_k_norm_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); getitem_233 = l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_qweight_ = l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_scales_ = l_self_modules_layers_modules_46_modules_self_attn_modules_o_proj_parameters_qzeros_ = l_self_modules_layers_modules_46_modules_post_attention_layernorm_parameters_weight_ = getitem_234 = l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_qweight_ = l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_scales_ = l_self_modules_layers_modules_46_modules_mlp_modules_gate_up_proj_parameters_qzeros_ = l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_qweight_ = l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_scales_ = l_self_modules_layers_modules_46_modules_mlp_modules_down_proj_parameters_qzeros_ = l_self_modules_layers_modules_47_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_qweight_ = l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_scales_ = l_self_modules_layers_modules_47_modules_self_attn_modules_qkv_proj_parameters_qzeros_ = l_self_modules_layers_modules_47_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_47_modules_self_attn_modules_k_norm_parameters_weight_ = None (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__ (EngineCore pid=365) return self.runnable(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 367, in __call__ (EngineCore pid=365) return range_entry.runnable(*args) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 62, in __call__ (EngineCore pid=365) return self._compiled_fn(*args) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn (EngineCore pid=365) return fn(*args, **kwargs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1241, in forward (EngineCore pid=365) return compiled_fn(full_args) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 384, in runtime_wrapper (EngineCore pid=365) all_outs = call_func_at_runtime_with_args( (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args (EngineCore pid=365) out = normalize_as_list(f(args)) (EngineCore pid=365) ^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 556, in wrapper (EngineCore pid=365) return compiled_fn(runtime_args) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 584, in __call__ (EngineCore pid=365) return self.current_callable(inputs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/utils.py", line 2720, in run (EngineCore pid=365) out = model(new_inputs) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/tmp/torchinductor_root/cn/ccnt3h5ihaeznnfptjhn7gaqgsuqih4iz74w2douejbhyoefnf4n.py", line 1049, in call (EngineCore pid=365) with torch.cuda._DeviceGuard(0): (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) File "/opt/conda/lib/python3.12/site-packages/torch/cuda/__init__.py", line 522, in __exit__ (EngineCore pid=365) self.idx = torch.cuda._maybe_exchange_device(self.prev_idx) (EngineCore pid=365) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=365) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (EngineCore pid=365) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=365) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=365) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore pid=365) [13:39:49.443][MCR][E]mc_runtime_api.cpp :215 : 365 : [7f27ce6de740] mcGetDevice: Returned mcErrorIllegalAddress [rank0]:[W522 13:39:49.564680587 CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent) [13:39:49.443][MCR][E]mc_runtime_api.cpp :215 : 365 : [7f27ce6de740] mcGetDevice: Returned mcErrorIllegalAddress terminate called after throwing an instance of 'c10::AcceleratorError' what(): CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /workspace/framework/mcPytorch/c10/cuda/CUDAException.cpp:42 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x88 (0x7f27cbb928f8 in /opt/conda/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: + 0x177a7 (0x7f27ccb667a7 in /opt/conda/lib/python3.12/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::SetDevice(signed char, bool) + 0x72 (0x7f27ccbae462 in /opt/conda/lib/python3.12/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x2fe5f (0x7f27ccb7ee5f in /opt/conda/lib/python3.12/site-packages/torch/lib/libc10_cuda.so) frame #4: + 0x766ffe (0x7f27c89eaffe in /opt/conda/lib/python3.12/site-packages/torch/lib/libtorch_python.so) frame #5: + 0x1fdc3b (0x55699b284c3b in VLLM::EngineCore) frame #6: + 0x249950 (0x55699b2d0950 in VLLM::EngineCore) frame #7: + 0x11aa3f (0x55699b1a1a3f in VLLM::EngineCore) frame #8: PyEval_EvalCode + 0xa1 (0x55699b3371a1 in VLLM::EngineCore) frame #9: + 0x2ea8ca (0x55699b3718ca in VLLM::EngineCore) frame #10: + 0x2e5585 (0x55699b36c585 in VLLM::EngineCore) frame #11: PyRun_StringFlags + 0x62 (0x55699b368aa2 in VLLM::EngineCore) frame #12: PyRun_SimpleStringFlags + 0x3c (0x55699b36889c in VLLM::EngineCore) frame #13: Py_RunMain + 0x4e4 (0x55699b365ff4 in VLLM::EngineCore) frame #14: Py_BytesMain + 0x37 (0x55699b321247 in VLLM::EngineCore) frame #15: + 0x23a10 (0x7f27ce706a10 in /lib64/libc.so.6) frame #16: __libc_start_main + 0x89 (0x7f27ce706ac9 in /lib64/libc.so.6) frame #17: + 0x29a0ed (0x55699b3210ed in VLLM::EngineCore) (APIServer pid=14) Traceback (most recent call last): (APIServer pid=14) File "/opt/conda/bin/vllm", line 8, in (APIServer pid=14) sys.exit(main()) (APIServer pid=14) ^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main (APIServer pid=14) args.dispatch_function(args) (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=14) uvloop.run(run_server(args)) (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run (APIServer pid=14) return __asyncio.run( (APIServer pid=14) ^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=14) return runner.run(main) (APIServer pid=14) ^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=14) return self._loop.run_until_complete(task) (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=14) return await main (APIServer pid=14) ^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server (APIServer pid=14) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker (APIServer pid=14) async with build_async_engine_client( (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=14) return await anext(self.gen) (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=14) async with build_async_engine_client_from_engine_args( (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=14) return await anext(self.gen) (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args (APIServer pid=14) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=14) return cls( (APIServer pid=14) ^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__ (APIServer pid=14) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=14) return func(*args, **kwargs) (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client (APIServer pid=14) return AsyncMPClient(*client_args) (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=14) return func(*args, **kwargs) (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 887, in __init__ (APIServer pid=14) super().__init__( (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__ (APIServer pid=14) with launch_core_engines( (APIServer pid=14) ^^^^^^^^^^^^^^^^^^^^ (APIServer pid=14) File "/opt/conda/lib/python3.12/contextlib.py", line 144, in __exit__ (APIServer pid=14) next(self.gen) (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines (APIServer pid=14) wait_for_engine_startup( (APIServer pid=14) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup (APIServer pid=14) raise RuntimeError( (APIServer pid=14) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}