• Members 7 posts
    2026年4月23日 15:48

    一. 软硬件信息
    1. IEIT SYSTEMS
    NF5468-M7-A0-R0-00
    2. 8卡 MetaX C500 64GB
    3. 5.14.0-284.25.1.el9_2.x86_64
    4. 启用cpu虚拟化
    5. /workspace# mx-smi
    mx-smi version: 2.2.8

    =================== MetaX System Management Interface Log ===================
    Timestamp : Thu Apr 23 15:36:44 2026

    Attached GPUs : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.8 Kernel Mode Driver Version: 3.0.11 |
    | MACA Version: 3.5.3.20 BIOS Version: 1.27.5.0 |
    |------------------------------------+---------------------+----------------------+
    | GPU NAME Persistence-M | Bus-id | GPU-Util sGPU-M |
    | Temp Pwr:Usage/Cap Perf | Memory-Usage | GPU-State |
    |====================================+=====================+======================|
    | 0 MetaX C500 Off | 0000:0e:00.0 | 0% Native |
    | 34C 55W / 350W P0 | 858/65536 MiB | Available |
    +------------------------------------+---------------------+----------------------+
    | 1 MetaX C500 Off | 0000:0f:00.0 | 0% Native |
    | 37C 57W / 350W P0 | 858/65536 MiB | Available |
    +------------------------------------+---------------------+----------------------+
    | 2 MetaX C500 Off | 0000:10:00.0 | 0% Native |
    | 36C 57W / 350W P0 | 858/65536 MiB | Available |
    +------------------------------------+---------------------+----------------------+
    | 3 MetaX C500 Off | 0000:12:00.0 | 0% Native |
    | 35C 58W / 350W P0 | 858/65536 MiB | Available |
    +------------------------------------+---------------------+----------------------+
    | 4 MetaX C500 Off | 0000:35:00.0 | 0% Native |
    | 35C 55W / 350W P0 | 858/65536 MiB | Available |
    +------------------------------------+---------------------+----------------------+
    | 5 MetaX C500 Off | 0000:36:00.0 | 0% Native |
    | 38C 57W / 350W P0 | 858/65536 MiB | Available |
    +------------------------------------+---------------------+----------------------+
    | 6 MetaX C500 Off | 0000:37:00.0 | 0% Native |
    | 37C 56W / 350W P0 | 858/65536 MiB | Available |
    +------------------------------------+---------------------+----------------------+
    | 7 MetaX C500 Off | 0000:38:00.0 | 0% Native |
    | 38C 58W / 350W P0 | 858/65536 MiB | Available |
    +------------------------------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |
    +---------------------------------------------------------------------------------+
    7. 镜像: vllm-metax:0.17.0-maca.ai3.5.3.307-torch2.8-py312-ubuntu22.04-amd64
    二. 具体情况:
    容器内执行:CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    vllm serve /data/msz/models/Qwen3.5-122B-A10B \
    --served-model-name qwen-122b \
    --tensor-parallel-size 8 \
    --max-model-len 2048 \
    --max-num-seqs 1 \
    --gpu-memory-utilization 0.80 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key "" 2>&1 | tee /workspace/vllm_log/qwen-122b.log

    此时vllm正常启动,但是推理时报错,推理命令:

    curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer " \
    -d '{
    "model": "qwen-122b",
    "prompt": "你好,请介绍一下你自己。",
    "max_tokens": 100
    }'
    日志:
    (APIServer pid=470) INFO: Started server process [470]
    (APIServer pid=470) INFO: Waiting for application startup.
    (APIServer pid=470) INFO: Application startup complete.
    (Worker pid=756) (Worker_TP2 pid=756) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
    (Worker pid=756) (Worker_TP2 pid=756) return fn(contiguous_args, contiguous_kwargs)
    (Worker pid=754) (Worker_TP0 pid=754) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
    (Worker pid=754) (Worker_TP0 pid=754) return fn(
    contiguous_args, contiguous_kwargs)
    (Worker pid=757) (Worker_TP3 pid=757) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
    (Worker pid=757) (Worker_TP3 pid=757) return fn(*contiguous_args,
    contiguous_kwargs)
    (Worker pid=758) (Worker_TP4 pid=758) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
    (Worker pid=758) (Worker_TP4 pid=758) return fn(contiguous_args, contiguous_kwargs)
    (Worker pid=760) (Worker_TP6 pid=760) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
    (Worker pid=760) (Worker_TP6 pid=760) return fn(
    contiguous_args, contiguous_kwargs)
    (Worker pid=755) (Worker_TP1 pid=755) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
    (Worker pid=755) (Worker_TP1 pid=755) return fn(*contiguous_args,
    contiguous_kwargs)
    (Worker pid=759) (Worker_TP5 pid=759) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
    (Worker pid=759) (Worker_TP5 pid=759) return fn(contiguous_args, contiguous_kwargs)
    (Worker pid=761) (Worker_TP7 pid=761) /opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
    (Worker pid=761) (Worker_TP7 pid=761) return fn(
    contiguous_args, **contiguous_kwargs)
    (EngineCore_DP0 pid=615) INFO 04-23 15:20:00 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore_DP0 pid=615) INFO 04-23 15:21:00 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore_DP0 pid=615) INFO 04-23 15:22:00 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore_DP0 pid=615) INFO 04-23 15:23:00 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.0) with config: model='/data/msz/models/Qwen3.5-122B-A10B', speculative_config=None, tokenizer='/data/msz/models/Qwen3.5-122B-A10B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen-122b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::mx_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 2, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []},
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-85a641483c598a77-0-bc24cafe,prompt_token_ids_len=6,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={cmpl-85a641483c598a77-0-bc24cafe: 6}, total_num_scheduled_tokens=6, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.002373887240356032, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] EngineCore encountered a fatal error.
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] Traceback (most recent call last):
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 366, in get_response
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] status, result = mq.dequeue(
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] with self.acquire_read(timeout, cancel, indefinite) as buf:
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/contextlib.py", line 137, in enter
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] return next(self.gen)
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 542, in acquire_read
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] raise TimeoutError
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] TimeoutError
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102]
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] The above exception was the direct cause of the following exception:
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102]
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] Traceback (most recent call last):
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] engine_core.run_busy_loop()
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] self._process_engine_step()
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] outputs, model_executed = self.step_fn()
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] model_output = future.result()
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 83, in result
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] return super().result()
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] return self.__get_result()
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] raise self._exception
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 87, in wait_for_response
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] response = self.aggregate(get_response())
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] ^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 370, in get_response
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] raise TimeoutError(f"RPC call to {method} timed out.") from e
    (EngineCore_DP0 pid=615) ERROR 04-23 15:24:00 [core.py:1102] TimeoutError: RPC call to execute_model timed out.
    (APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] AsyncLLM output_handler failed.
    (APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] Traceback (most recent call last):
    (APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
    (APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] outputs = await engine_core.get_output_async()
    (APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
    (APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] raise self._format_exception(outputs) from None
    (APIServer pid=470) ERROR 04-23 15:24:00 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
    (APIServer pid=470) INFO: 127.0.0.1:44862 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
    (APIServer pid=470) INFO: Shutting down
    (APIServer pid=470) INFO: Waiting for application shutdown.
    (APIServer pid=470) INFO: Application shutdown complete.
    (APIServer pid=470) INFO: Finished server process [470]
    /opt/conda/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '

    进行pip install transformers==5.2.0后,同样的命令尝试启动vllm,启动时报错:

    root@app-1da4ba5aa5334988aec66a3b382902e1-788bf68fc8-rvp56:/data/msz/models/Qwen3.5-122B-A10B# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve /data/msz/models/Qwen3.5-122B-A10B --served-model-name qwen-122b --tensor-parallel-size 8 --max-model-len 2048 --max-num-seqs 1 --gpu-memory-utilization 0.80 --trust-remote-code --host 0.0.0.0 --port 8000 --api-key "123" 2>&1 | tee /workspace/vllm_log/qwen-122b.log
    tee: /workspace/vllm_log/qwen-122b.log: No such file or directory
    INFO 04-23 15:27:36 [init.py:44] Available plugins for group vllm.platform_plugins:
    INFO 04-23 15:27:36 [init.py:46] - metax -> vllm_metax:register
    INFO 04-23 15:27:36 [init.py:49] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
    INFO 04-23 15:27:36 [init.py:212] Platform plugin metax is activated
    INFO 04-23 15:27:36 [envs.py:104] Plugin sets VLLM_USE_FLASHINFER_SAMPLER to False. Reason: flashinfer sampler are not supported on maca
    INFO 04-23 15:27:36 [envs.py:104] Plugin sets VLLM_ENGINE_READY_TIMEOUT_S to 3600. Reason: set timeout to 3600s for model loading
    INFO 04-23 15:27:36 [envs.py:104] Plugin sets VLLM_DISABLE_SHARED_EXPERTS_STREAM to True. Reason: no used on maca
    INFO Print the version information of mcoplib during compilation.

    Version info:Mcoplib_Version = '0.4.2'
    Build_Maca_Version = '3.5.3.20'
    GIT_BRANCH = 'HEAD'
    GIT_COMMIT = 'e482051'
    Vllm Op Version = 0.17.0
    SGlang Op Version = 0.5.8 && 0.5.9

    INFO Staring Check the current MACA version of the operating environment.

    INFO: Release major.minor matching, successful:3.5.

    WARNING 04-23 15:27:44 [init.py:80] The quantization method 'awq' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.awq.MacaAWQConfig'>.
    WARNING 04-23 15:27:44 [init.py:80] The quantization method 'awq_marlin' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.awq_marlin.MacaAWQMarlinConfig'>.
    WARNING 04-23 15:27:44 [init.py:80] The quantization method 'compressed-tensors' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.compressed_tensors.MacaCompressedTensorsConfig'>.
    WARNING 04-23 15:27:44 [init.py:80] The quantization method 'gptq' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.gptq.MacaGPTQConfig'>.
    WARNING 04-23 15:27:44 [init.py:80] The quantization method 'gptq_marlin' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.gptq_marlin.MacaGPTQMarlinConfig'>.
    WARNING 04-23 15:27:44 [init.py:80] The quantization method 'moe_wna16' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.moe_wna16.MacaMoeWNA16Config'>.
    WARNING 04-23 15:27:44 [registry.py:886] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
    WARNING 04-23 15:27:44 [registry.py:886] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV2ForCausalLM.
    WARNING 04-23 15:27:44 [registry.py:886] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM.
    WARNING 04-23 15:27:44 [registry.py:886] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM.
    WARNING 04-23 15:27:44 [registry.py:886] Model architecture KimiK25ForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.kimi_k25:KimiK25ForConditionalGeneration.
    WARNING 04-23 15:27:44 [registry.py:886] Model architecture GlmMoeDsaForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:GlmMoeDsaForCausalLM.
    WARNING 04-23 15:27:44 [registry.py:886] Model architecture Step3p5MTP is already registered, and will be overwritten by the new model class vllm_metax.models.step3p5_mtp:Step3p5MTP.
    (APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302]
    (APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302] █ █ █▄ ▄█
    (APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.0
    (APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302] █▄█▀ █ █ █ █ model /data/msz/models/Qwen3.5-122B-A10B
    (APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
    (APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:302]
    (APIServer pid=90642) INFO 04-23 15:27:44 [utils.py:238] non-default args: {'model_tag': '/data/msz/models/Qwen3.5-122B-A10B', 'host': '0.0.0.0', 'api_key': ['123'], 'model': '/data/msz/models/Qwen3.5-122B-A10B', 'trust_remote_code': True, 'max_model_len': 2048, 'served_model_name': ['qwen-122b'], 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.8, 'max_num_seqs': 1}
    (APIServer pid=90642) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    (APIServer pid=90642) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    (APIServer pid=90642) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    (APIServer pid=90642) Traceback (most recent call last):
    (APIServer pid=90642) File "/opt/conda/bin/vllm", line 8, in <module>
    (APIServer pid=90642) sys.exit(main())
    (APIServer pid=90642) ^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
    (APIServer pid=90642) args.dispatch_function(args)
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
    (APIServer pid=90642) uvloop.run(run_server(args))
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
    (APIServer pid=90642) return asyncio.run(
    (APIServer pid=90642) ^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run
    (APIServer pid=90642) return runner.run(main)
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run
    (APIServer pid=90642) return self._loop.run_until_complete(task)
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init
    .py", line 48, in wrapper
    (APIServer pid=90642) return await main
    (APIServer pid=90642) ^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
    (APIServer pid=90642) await run_server_worker(listen_address, sock, args, uvicorn_kwargs)
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
    (APIServer pid=90642) async with build_async_engine_client(
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in aenter
    (APIServer pid=90642) return await anext(self.gen)
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
    (APIServer pid=90642) async with build_async_engine_client_from_engine_args(
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in aenter
    (APIServer pid=90642) return await anext(self.gen)
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 122, in build_async_engine_client_from_engine_args
    (APIServer pid=90642) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1477, in create_engine_config
    (APIServer pid=90642) model_config = self.create_model_config()
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1329, in create_model_config
    (APIServer pid=90642) return ModelConfig(
    (APIServer pid=90642) ^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 121, in init
    (APIServer pid=90642) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/config/model.py", line 474, in post_init
    (APIServer pid=90642) hf_config = get_config(
    (APIServer pid=90642) ^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 628, in get_config
    (APIServer pid=90642) config_dict, config = config_parser.parse(
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 163, in parse
    (APIServer pid=90642) config = config_class.from_pretrained(
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/transformers/configuration_utils.py", line 552, in from_pretrained
    (APIServer pid=90642) return cls.from_dict(config_dict,
    kwargs)
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/transformers/configuration_utils.py", line 714, in from_dict
    (APIServer pid=90642) config = cls(config_dict)
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/transformers_utils/configs/qwen3_5_moe.py", line 192, in init
    (APIServer pid=90642) self.text_config = self.sub_configs"text_config"
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/vllm/transformers_utils/configs/qwen3_5_moe.py", line 121, in init
    (APIServer pid=90642) super().init(
    kwargs)
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/transformers/configuration_utils.py", line 219, in init
    (APIServer pid=90642) kwargs = self.convert_rope_params_to_dict(
    (APIServer pid=90642) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=90642) File "/opt/conda/lib/python3.12/site-packages/transformers/modeling_rope_utils.py", line 651, in convert_rope_params_to_dict
    (APIServer pid=90642) ignore_keys_at_rope_validation = ignore_keys_at_rope_validation | {"partial_rotary_factor"}
    (APIServer pid=90642) ~~~~~~~^~~~~~~
    (APIServer pid=90642) TypeError: unsupported operand type(s) for |: 'list' and 'set'

  • arrow_forward

    Thread has been moved from 产品&运维.

  • Members 460 posts
    2026年4月23日 15:50

    尊敬的开发者您好,请给出容器启动命令

  • Members 7 posts
    2026年4月23日 15:51

    容器启动命令: /bin/bash -lc "sleep infinity"

  • Members 460 posts
    2026年4月23日 15:55

    尊敬的开发者您好,请给出容器启动命令,如docker run

  • Members 7 posts
    2026年4月23日 16:00

    平台上启用的云容器实例,没有具体的docker run命令,启动时的基础配置:
    容器名称
    vllm
    镜像种类
    私有镜像
    镜像名称
    namespace/vllm-metax
    镜像Tag
    0.17.0-maca.ai3.5.3.307-torch2.8-py312-ubuntu22.04-amd64
    卡数
    8卡
    vCPU
    100核
    内存
    1900GiB

  • Members 460 posts
    2026年4月23日 16:07

    尊敬的开发者您好,请尝试推理qwen3.5其他小参数量模型进行交叉验证

  • Members 7 posts
    2026年4月23日 18:57

    你好,尝试了多种小参数模型的部署与推理,
    qwen3.5系列模型,容器:vllm-metax:0.17.0-maca.ai3.5.3.307-torch2.8-py312-ubuntu22.04-amd64
    4B,9B,27B的dense模型能在单卡上正常起服务,27B尝试了两卡部署,也能正常起服务
    35B的moe模型,Qwen/Qwen3.5-35B-A3B,能启动vllm,但是不能推理,一推理就崩溃
    piptransformers试过了,有冲突不能运行
    报错信息:

    (APIServer pid=18576) INFO:     Started server process [18576]
    (APIServer pid=18576) INFO:     Waiting for application startup.
    (APIServer pid=18576) INFO:     Application startup complete.
    (EngineCore_DP0 pid=18721) INFO 04-23 17:50:40 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore_DP0 pid=18721) INFO 04-23 17:51:40 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore_DP0 pid=18721) INFO 04-23 17:52:40 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore_DP0 pid=18721) INFO 04-23 17:53:40 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.0) with config: model='/data/msz/models/Qwen3.5-35B', speculative_config=None, tokenizer='/data/msz/models/Qwen3.5-35B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3.5-35b-a3b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::mx_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, 
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-ad8b05cb2ffe5905-8a2aec7c,prompt_token_ids_len=17,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-ad8b05cb2ffe5905-8a2aec7c: 17}, total_num_scheduled_tokens=17, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.003824091778202643, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102] EngineCore encountered a fatal error.
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102] Traceback (most recent call last):
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 366, in get_response
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     status, result = mq.dequeue(
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]                      ^^^^^^^^^^^
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     with self.acquire_read(timeout, cancel, indefinite) as buf:
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/contextlib.py", line 137, in __enter__
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     return next(self.gen)
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]            ^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 542, in acquire_read
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     raise TimeoutError
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102] TimeoutError
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102] 
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102] The above exception was the direct cause of the following exception:
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102] 
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102] Traceback (most recent call last):
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     engine_core.run_busy_loop()
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     self._process_engine_step()
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     outputs, model_executed = self.step_fn()
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]                               ^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     model_output = future.result()
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]                    ^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 83, in result
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     return super().result()
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]            ^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     return self.__get_result()
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     raise self._exception
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 87, in wait_for_response
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     response = self.aggregate(get_response())
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]                               ^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]   File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 370, in get_response
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102]     raise TimeoutError(f"RPC call to {method} timed out.") from e
    (EngineCore_DP0 pid=18721) ERROR 04-23 17:54:40 [core.py:1102] TimeoutError: RPC call to execute_model timed out.
    (APIServer pid=18576) ERROR 04-23 17:54:40 [async_llm.py:708] AsyncLLM output_handler failed.
    (APIServer pid=18576) ERROR 04-23 17:54:40 [async_llm.py:708] Traceback (most recent call last):
    (APIServer pid=18576) ERROR 04-23 17:54:40 [async_llm.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
    (APIServer pid=18576) ERROR 04-23 17:54:40 [async_llm.py:708]     outputs = await engine_core.get_output_async()
    (APIServer pid=18576) ERROR 04-23 17:54:40 [async_llm.py:708]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=18576) ERROR 04-23 17:54:40 [async_llm.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
    (APIServer pid=18576) ERROR 04-23 17:54:40 [async_llm.py:708]     raise self._format_exception(outputs) from None
    (APIServer pid=18576) ERROR 04-23 17:54:40 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
    (APIServer pid=18576) INFO:     127.0.0.1:58924 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
    (APIServer pid=18576) INFO:     Shutting down
    (APIServer pid=18576) INFO:     Waiting for application shutdown.
    (APIServer pid=18576) INFO:     Application shutdown complete.
    (APIServer pid=18576) INFO:     Finished server process [18576]
    /opt/conda/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
      warnings.warn('resource_tracker: There appear to be %d '
    

    是否是moe模型的问题

  • Members 460 posts
    2026年4月23日 18:59

    尊敬的开发者您好,Dense模型推理请求测试是正常的吗

  • Members 7 posts
    2026年4月23日 18:59

    是的,能正常推理,但是moe的不行

  • Members 460 posts
    2026年4月24日 16:18

    尊敬的开发者您好,请联系相应平台的技术支持解决,查看裸金属相关日志定位问题

  • Members 7 posts
    2026年4月24日 18:11

    问题已解决,更换了启动命令: 

    vllm serve /models/Qwen3.5-122B-A10B -pp 1 -tp 8 \
        --trust-remote-code --dtype bfloat16 --distributed-executor-backend mp --swap-space 16 \
            --gpu-memory-utilization 0.85 --max-model-len 131072 --max-num-batched-tokens 131072 --no-async-scheduling --mm-encoder-tp-mode data --mm-processor-cache-type shm --limit-mm-per-prompt '{"image": 5, "video": 1}' --skip-mm-profiling --enable-prefix-caching \
        --served-model-name Qwen3.5-122B-A10B \
        --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
        --port 8000
    
  • arrow_forward

    Thread has been moved from 解决中.