vllm部署qwen3.5-122b的问题

Members 7 posts

2026年4月23日 15:48 2026年4月23日 15:48

一. 软硬件信息
1. IEIT SYSTEMS
NF5468-M7-A0-R0-00
2. 8卡 MetaX C500 64GB
3. 5.14.0-284.25.1.el9_2.x86_64
4. 启用cpu虚拟化
5. /workspace# mx-smi
mx-smi version: 2.2.8

=================== MetaX System Management Interface Log ===================
Timestamp : Thu Apr 23 15:36:44 2026

+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
7. 镜像： vllm-metax:0.17.0-maca.ai3.5.3.307-torch2.8-py312-ubuntu22.04-amd64
二. 具体情况：
容器内执行：CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve /data/msz/models/Qwen3.5-122B-A10B \
--served-model-name qwen-122b \
--tensor-parallel-size 8 \
--max-model-len 2048 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.80 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000 \
--api-key "" 2>&1 | tee /workspace/vllm_log/qwen-122b.log

此时vllm正常启动，但是推理时报错，推理命令：

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer " \
-d '{
"model": "qwen-122b",
"prompt": "你好，请介绍一下你自己。",
"max_tokens": 100
}'
日志：
(APIServer pid=470) INFO: Started server process [470]
(APIServer pid=470) INFO: Waiting for application startup.
(APIServer pid=470) INFO: Application startup complete.
(Worker pid=756) (Worker_TP2 pid=756) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker pid=756) (Worker_TP2 pid=756) return fn(contiguous_args, contiguous_kwargs)
(Worker pid=754) (Worker_TP0 pid=754) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker pid=754) (Worker_TP0 pid=754) return fn(contiguous_args, contiguous_kwargs)
(Worker pid=757) (Worker_TP3 pid=757) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker pid=757) (Worker_TP3 pid=757) return fn(*contiguous_args, contiguous_kwargs)
(Worker pid=758) (Worker_TP4 pid=758) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker pid=758) (Worker_TP4 pid=758) return fn(contiguous_args, contiguous_kwargs)
(Worker pid=760) (Worker_TP6 pid=760) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker pid=760) (Worker_TP6 pid=760) return fn(contiguous_args, contiguous_kwargs)
(Worker pid=755) (Worker_TP1 pid=755) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker pid=755) (Worker_TP1 pid=755) return fn(*contiguous_args, contiguous_kwargs)
(Worker pid=759) (Worker_TP5 pid=759) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker pid=759) (Worker_TP5 pid=759) return fn(contiguous_args, contiguous_kwargs)
(Worker pid=761) (Worker_TP7 pid=761) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker pid=761) (Worker_TP7 pid=761) return fn(contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=615) INFO 04-23 15:20:00 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=615) INFO 04-23 15:21:00 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=615) INFO 04-23 15:22:00 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=615) INFO 04-23 15:23:00 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.0) with config: model='/data/msz/models/Qwen3.5-122B-A10B', speculative_config=None, tokenizer='/data/msz/models/Qwen3.5-122B-A10B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen-122b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::mx_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 2, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []},
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-85a641483c598a77-0-bc24cafe,prompt_token_ids_len=6,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={cmpl-85a641483c598a77-0-bc24cafe: 6}, total_num_scheduled_tokens=6, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.002373887240356032, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 366, in get_response
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] status, result = mq.dequeue(
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/contextlib.py", line 137, in enter
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] return next(self.gen)
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 542, in acquire_read
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] raise TimeoutError
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] TimeoutError
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102]
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102]
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] engine_core.run_busy_loop()
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] self._process_engine_step()
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] model_output = future.result()
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 83, in result
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] return super().result()
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] return self.__get_result()
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] raise self._exception
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 87, in wait_for_response
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] response = self.aggregate(get_response())
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 370, in get_response
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] TimeoutError: RPC call to execute_model timed out.
(APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] Traceback (most recent call last):
(APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
(APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] outputs = await engine_core.get_output_async()
(APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
(APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] raise self._format_exception(outputs) from None
(APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=470) INFO: 127.0.0.1:44862 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=470) INFO: Shutting down
(APIServer pid=470) INFO: Waiting for application shutdown.
(APIServer pid=470) INFO: Application shutdown complete.
(APIServer pid=470) INFO: Finished server process [470]
/opt/conda/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

进行pip install transformers==5.2.0后，同样的命令尝试启动vllm，启动时报错：

root@app-1da4ba5aa5334988aec66a3b382902e1-788bf68fc8-rvp56:/data/msz/models/Qwen3.5-122B-A10B# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve /data/msz/models/Qwen3.5-122B-A10B --served-model-name qwen-122b --tensor-parallel-size 8 --max-model-len 2048 --max-num-seqs 1 --gpu-memory-utilization 0.80 --trust-remote-code --host 0.0.0.0 --port 8000 --api-key "123" 2>&1 | tee /workspace/vllm_log/qwen-122b.log
tee: /workspace/vllm_log/qwen-122b.log: No such file or directory
INFO 04-23 15:27:36 [init.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-23 15:27:36 [init.py:46] - metax -> vllm_metax:register
INFO 04-23 15:27:36 [init.py:49] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 04-23 15:27:36 [init.py:212] Platform plugin metax is activated
INFO 04-23 15:27:36 [envs.py:104] Plugin sets VLLM_USE_FLASHINFER_SAMPLER to False. Reason: flashinfer sampler are not supported on maca
INFO 04-23 15:27:36 [envs.py:104] Plugin sets VLLM_ENGINE_READY_TIMEOUT_S to 3600. Reason: set timeout to 3600s for model loading
INFO 04-23 15:27:36 [envs.py:104] Plugin sets VLLM_DISABLE_SHARED_EXPERTS_STREAM to True. Reason: no used on maca
INFO Print the version information of mcoplib during compilation.

Version info:Mcoplib_Version = '0.4.2'
Build_Maca_Version = '3.5.3.20'
GIT_BRANCH = 'HEAD'
GIT_COMMIT = 'e482051'
Vllm Op Version = 0.17.0
SGlang Op Version = 0.5.8 && 0.5.9

INFO Staring Check the current MACA version of the operating environment.

INFO: Release major.minor matching, successful:3.5.

WARNING 04-23 15:27:44 [init.py:80] The quantization method 'awq' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.awq.MacaAWQConfig'>.
WARNING 04-23 15:27:44 [init.py:80] The quantization method 'awq_marlin' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.awq_marlin.MacaAWQMarlinConfig'>.
WARNING 04-23 15:27:44 [init.py:80] The quantization method 'compressed-tensors' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.compressed_tensors.MacaCompressedTensorsConfig'>.
WARNING 04-23 15:27:44 [init.py:80] The quantization method 'gptq' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.gptq.MacaGPTQConfig'>.
WARNING 04-23 15:27:44 [init.py:80] The quantization method 'gptq_marlin' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.gptq_marlin.MacaGPTQMarlinConfig'>.
WARNING 04-23 15:27:44 [init.py:80] The quantization method 'moe_wna16' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.moe_wna16.MacaMoeWNA16Config'>.
WARNING 04-23 15:27:44 [registry.py:886] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
WARNING 04-23 15:27:44 [registry.py:886] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV2ForCausalLM.
WARNING 04-23 15:27:44 [registry.py:886] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 04-23 15:27:44 [registry.py:886] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 04-23 15:27:44 [registry.py:886] Model architecture KimiK25ForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.kimi_k25:KimiK25ForConditionalGeneration.
WARNING 04-23 15:27:44 [registry.py:886] Model architecture GlmMoeDsaForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:GlmMoeDsaForCausalLM.
WARNING 04-23 15:27:44 [registry.py:886] Model architecture Step3p5MTP is already registered, and will be overwritten by the new model class vllm_metax.models.step3p5_mtp:Step3p5MTP.
(APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302]
(APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302] █ █ █▄ ▄█
(APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.0
(APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302] █▄█▀ █ █ █ █ model /data/msz/models/Qwen3.5-122B-A10B
(APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302]
(APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:238] non-default args: {'model_tag': '/data/msz/models/Qwen3.5-122B-A10B', 'host': '0.0.0.0', 'api_key': ['123'], 'model': '/data/msz/models/Qwen3.5-122B-A10B', 'trust_remote_code': True, 'max_model_len': 2048, 'served_model_name': ['qwen-122b'], 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.8, 'max_num_seqs': 1}
(APIServer pid=90642) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=90642) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=90642) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=90642) Traceback (most recent call last):
(APIServer pid=90642) File "/opt/conda/bin/vllm", line 8, in <module>
(APIServer pid=90642) sys.exit(main())
(APIServer pid=90642) ^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=90642) args.dispatch_function(args)
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=90642) uvloop.run(run_server(args))
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
(APIServer pid=90642) return asyncio.run(
(APIServer pid=90642) ^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=90642) return runner.run(main)
(APIServer pid=90642) ^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=90642) return self._loop.run_until_complete(task)
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init.py", line 48, in wrapper
(APIServer pid=90642) return await main
(APIServer pid=90642) ^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=90642) await run_server_worker(listen_address, sock, args, uvicorn_kwargs)
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=90642) async with build_async_engine_client(
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=90642) return await anext(self.gen)
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=90642) async with build_async_engine_client_from_engine_args(
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=90642) return await anext(self.gen)
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 122, in build_async_engine_client_from_engine_args
(APIServer pid=90642) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1477, in create_engine_config
(APIServer pid=90642) model_config = self.create_model_config()
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1329, in create_model_config
(APIServer pid=90642) return ModelConfig(
(APIServer pid=90642) ^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 121, in init
(APIServer pid=90642) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/config/model.py", line 474, in post_init
(APIServer pid=90642) hf_config = get_config(
(APIServer pid=90642) ^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 628, in get_config
(APIServer pid=90642) config_dict, config = config_parser.parse(
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 163, in parse
(APIServer pid=90642) config = config_class.from_pretrained(
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/transformers/configuration_utils.py", line 552, in from_pretrained
(APIServer pid=90642) return cls.from_dict(config_dict, kwargs)
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/transformers/configuration_utils.py", line 714, in from_dict
(APIServer pid=90642) config = cls(config_dict)
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/transformers_utils/configs/qwen3_5_moe.py", line 192, in init
(APIServer pid=90642) self.text_config = self.sub_configs "text_config"
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/transformers_utils/configs/qwen3_5_moe.py", line 121, in init
(APIServer pid=90642) super().init(kwargs)
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/transformers/configuration_utils.py", line 219, in init
(APIServer pid=90642) kwargs = self.convert_rope_params_to_dict(
(APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/transformers/modeling_rope_utils.py", line 651, in convert_rope_params_to_dict
(APIServer pid=90642) ignore_keys_at_rope_validation = ignore_keys_at_rope_validation | {"partial_rotary_factor"}
(APIServer pid=90642) ~~~~~~~^~~~~~~
(APIServer pid=90642) TypeError: unsupported operand type(s) for |: 'list' and 'set'