试了未量化版本,依然报错
root@dzdwd-server:/workspace# CUDA_VISIBLE_DEVICES=0,1 \
nohup vllm serve /models/Qwen3.5-35B-A3B -tp 2 \
--port 8889 \
--trust-remote-code \
--dtype auto \
--max-model-len 104800 \
--max-num-batched-tokens 104800 \
--swap-space 32 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--served-model-name DeepSeek-R1-32B \
--enable-auto-tool-choice \
--api-key Dzdwd@85416 \
--tool-call-parser hermes > vllm_serve.log 2>&1 &
[1] 31
root@dzdwd-server:/workspace# tail -f vllm_serve.log
nohup: ignoring input
INFO 03-19 12:24:38 [__init__.py:43] Available plugins for group vllm.platform_plugins:
INFO 03-19 12:24:38 [__init__.py:45] - metax -> vllm_metax:register
INFO 03-19 12:24:38 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-19 12:24:38 [__init__.py:217] Platform plugin metax is activated
INFO 03-19 12:24:38 [envs.py:83] Plugin sets VLLM_USE_FLASHINFER_SAMPLER to False. Reason: flashinfer sampler are not supported on maca
[2026-03-19 12:24:51] INFO config.py:27: Updated vLLM file registry _CLASS_TO_MODULE
[2026-03-19 12:24:51] INFO config.py:36: Updated vLLM internal model_type registry
INFO Print the version information of mcoplib during compilation.
Version info:Mcoplib_Version = '0.4.0'
Build_Maca_Version = '3.5.3.18'
GIT_BRANCH = 'HEAD'
GIT_COMMIT = '3cd1a1a'
Vllm Op Version = 0.14.0
SGlang Op Version = 0.5.7 && 0.5.8
INFO Staring Check the current MACA version of the operating environment.
INFO: Release major.minor matching, successful:3.5.
Successfully added Qwen3ASRForConditionalGeneration to _MULTIMODAL_MODELS
WARNING 03-19 12:25:21 [registry.py:801] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
WARNING 03-19 12:25:21 [registry.py:801] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV2ForCausalLM.
WARNING 03-19 12:25:21 [registry.py:801] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 03-19 12:25:21 [registry.py:801] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'awq' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.awq.MacaAWQConfig'>.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'awq_marlin' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.awq_marlin.MacaAWQMarlinConfig'>.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'gptq' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.gptq.MacaGPTQConfig'>.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'gptq_marlin' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.gptq_marlin.MacaGPTQMarlinConfig'>.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'moe_wna16' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.moe_wna16.MacaMoeWNA16Config'>.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'compressed-tensors' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.compressed_tensors.MacaCompressedTensorsConfig'>.
(APIServer pid=31) INFO 03-19 12:25:21 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=31) INFO 03-19 12:25:21 [utils.py:263] non-default args: {'model_tag': '/models/Qwen3.5-35B-A3B', 'port': 8889, 'api_key': ['Dzdwd@85416'], 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/models/Qwen3.5-35B-A3B', 'trust_remote_code': True, 'max_model_len': 104800, 'served_model_name': ['DeepSeek-R1-32B'], 'tensor_parallel_size': 2, 'swap_space': 32.0, 'enable_prefix_caching': True, 'max_num_batched_tokens': 104800}
(APIServer pid=31) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=31) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=31) Traceback (most recent call last):
(APIServer pid=31) File "/opt/conda/bin/vllm", line 8, in <module>
(APIServer pid=31) sys.exit(main())
(APIServer pid=31) ^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=31) args.dispatch_function(args)
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=31) uvloop.run(run_server(args))
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=31) return __asyncio.run(
(APIServer pid=31) ^^^^^^^^^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=31) return runner.run(main)
(APIServer pid=31) ^^^^^^^^^^^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=31) return self._loop.run_until_complete(task)
(APIServer pid=31) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=31) return await main
(APIServer pid=31) ^^^^^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1319, in run_server
(APIServer pid=31) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1338, in run_server_worker
(APIServer pid=31) async with build_async_engine_client(
(APIServer pid=31) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=31) return await anext(self.gen)
(APIServer pid=31) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client
(APIServer pid=31) async with build_async_engine_client_from_engine_args(
(APIServer pid=31) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=31) return await anext(self.gen)
(APIServer pid=31) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 199, in build_async_engine_client_from_engine_args
(APIServer pid=31) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=31) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1369, in create_engine_config
(APIServer pid=31) model_config = self.create_model_config()
(APIServer pid=31) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1223, in create_model_config
(APIServer pid=31) return ModelConfig(
(APIServer pid=31) ^^^^^^^^^^^^
(APIServer pid=31) File "/opt/conda/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=31) s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=31) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=31) Value error, The checkpoint you are trying to load has model type `qwen3_5_moe` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
(APIServer pid=31)
(APIServer pid=31) You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git` [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=31) For further information visit https://errors.pydantic.dev/2.12/v/value_error