Posts | xiaoo | 沐曦开发者论坛

Members

Qwen3.5-35B-A3B-FP8运行报错，是还不支持吗？已解决 2026年3月19日 12:28

试了未量化版本,依然报错

root@dzdwd-server:/workspace# CUDA_VISIBLE_DEVICES=0,1 \
nohup vllm serve /models/Qwen3.5-35B-A3B -tp 2 \
--port 8889 \
  --trust-remote-code \
  --dtype auto \
  --max-model-len 104800 \
  --max-num-batched-tokens 104800 \
  --swap-space 32 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --served-model-name DeepSeek-R1-32B \
  --enable-auto-tool-choice \
--api-key Dzdwd@85416 \
  --tool-call-parser hermes > vllm_serve.log 2>&1 &
[1] 31
root@dzdwd-server:/workspace# tail -f vllm_serve.log
nohup: ignoring input
INFO 03-19 12:24:38 [__init__.py:43] Available plugins for group vllm.platform_plugins:
INFO 03-19 12:24:38 [__init__.py:45] - metax -> vllm_metax:register
INFO 03-19 12:24:38 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-19 12:24:38 [__init__.py:217] Platform plugin metax is activated
INFO 03-19 12:24:38 [envs.py:83] Plugin sets VLLM_USE_FLASHINFER_SAMPLER to False. Reason: flashinfer sampler are not supported on maca
[2026-03-19 12:24:51] INFO config.py:27: Updated vLLM file registry _CLASS_TO_MODULE
[2026-03-19 12:24:51] INFO config.py:36: Updated vLLM internal model_type registry
INFO Print the version information of mcoplib during compilation.

Version info:Mcoplib_Version = '0.4.0'
Build_Maca_Version = '3.5.3.18'
GIT_BRANCH = 'HEAD'
GIT_COMMIT = '3cd1a1a'
Vllm Op Version = 0.14.0
SGlang Op Version  = 0.5.7 && 0.5.8 

INFO Staring Check the current MACA version of the operating environment.

INFO: Release major.minor matching,  successful:3.5. 

Successfully added Qwen3ASRForConditionalGeneration to _MULTIMODAL_MODELS
WARNING 03-19 12:25:21 [registry.py:801] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
WARNING 03-19 12:25:21 [registry.py:801] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV2ForCausalLM.
WARNING 03-19 12:25:21 [registry.py:801] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 03-19 12:25:21 [registry.py:801] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'awq' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.awq.MacaAWQConfig'>.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'awq_marlin' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.awq_marlin.MacaAWQMarlinConfig'>.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'gptq' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.gptq.MacaGPTQConfig'>.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'gptq_marlin' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.gptq_marlin.MacaGPTQMarlinConfig'>.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'moe_wna16' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.moe_wna16.MacaMoeWNA16Config'>.
WARNING 03-19 12:25:21 [__init__.py:94] The quantization method 'compressed-tensors' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.compressed_tensors.MacaCompressedTensorsConfig'>.
(APIServer pid=31) INFO 03-19 12:25:21 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=31) INFO 03-19 12:25:21 [utils.py:263] non-default args: {'model_tag': '/models/Qwen3.5-35B-A3B', 'port': 8889, 'api_key': ['Dzdwd@85416'], 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/models/Qwen3.5-35B-A3B', 'trust_remote_code': True, 'max_model_len': 104800, 'served_model_name': ['DeepSeek-R1-32B'], 'tensor_parallel_size': 2, 'swap_space': 32.0, 'enable_prefix_caching': True, 'max_num_batched_tokens': 104800}
(APIServer pid=31) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=31) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=31) Traceback (most recent call last):
(APIServer pid=31)   File "/opt/conda/bin/vllm", line 8, in <module>
(APIServer pid=31)     sys.exit(main())
(APIServer pid=31)              ^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=31)     args.dispatch_function(args)
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=31)     uvloop.run(run_server(args))
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=31)     return __asyncio.run(
(APIServer pid=31)            ^^^^^^^^^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=31)     return runner.run(main)
(APIServer pid=31)            ^^^^^^^^^^^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=31)     return self._loop.run_until_complete(task)
(APIServer pid=31)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=31)     return await main
(APIServer pid=31)            ^^^^^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1319, in run_server
(APIServer pid=31)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1338, in run_server_worker
(APIServer pid=31)     async with build_async_engine_client(
(APIServer pid=31)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=31)     return await anext(self.gen)
(APIServer pid=31)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client
(APIServer pid=31)     async with build_async_engine_client_from_engine_args(
(APIServer pid=31)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=31)     return await anext(self.gen)
(APIServer pid=31)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 199, in build_async_engine_client_from_engine_args
(APIServer pid=31)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=31)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1369, in create_engine_config
(APIServer pid=31)     model_config = self.create_model_config()
(APIServer pid=31)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1223, in create_model_config
(APIServer pid=31)     return ModelConfig(
(APIServer pid=31)            ^^^^^^^^^^^^
(APIServer pid=31)   File "/opt/conda/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=31)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=31) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=31)   Value error, The checkpoint you are trying to load has model type `qwen3_5_moe` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
(APIServer pid=31) 
(APIServer pid=31) You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git` [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=31)     For further information visit https://errors.pydantic.dev/2.12/v/value_error

See post chevron_right

xiaoo
Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年3月19日 12:15

是的哈，emmm

See post chevron_right

xiaoo

Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年3月18日 18:55

没错

参考如下命令启动模型

docker run -itd \
--restart always \
--device=/dev/mxcd \
--device=/dev/sgpu000 \
--device=/dev/sgpu001 \
--device=/dev/dri/renderD128 \
--device=/dev/dri/renderD129 \
--group-add video \
--network=host \
--name llm \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 110gb \
--ulimit memlock=-1 \
-v /models:/models \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.13.0-maca.ai3.3.0.303-torch2.8-py310-ubuntu22.04-amd64 /bin/bash

See post chevron_right

xiaoo
Members

Qwen3.5-35B-A3B-FP8运行报错，是还不支持吗？已解决 2026年3月18日 12:51

只能支持 w8a8是吧

See post chevron_right

xiaoo

Members

Qwen3.5-35B-A3B-FP8运行报错，是还不支持吗？已解决 2026年3月18日 11:16

docker：cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.14.0-maca.ai3.5.3.102-torch2.8-py312-ubuntu22.04-amd64
显卡：两块 N260

mx-smi version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp : Wed Mar 18 09:39:39 2026

Attached GPUs : 2
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9 Kernel Mode Driver Version: 3.4.4 |
| MACA Version: 3.3.0.15 BIOS Version: 1.29.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX N260 | 0 Off | 0000:41:00.0 | 0% Enabled |
| 52W / 225W | 44C P9 | 6645/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX N260 | 1 Off | 0000:c1:00.0 | 0% Enabled |
| 47W / 225W | 41C P9 | 6619/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Sliced GPU |
|------------------------------------+---------------------+----------------------|
| Minor GPU sGPU-Id Compute | Vram Quota | sGPU-Util |
|====================================+=====================+======================|
| 000 0 0 5% | 0/55296 MiB | 0% |
+------------------------------------+---------------------+----------------------+
| 001 1 0 5% | 0/55296 MiB | 0% |
+------------------------------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+

运行命令：

root@dzdwd-server:/workspace# CUDA_VISIBLE_DEVICES=0,1 \
nohup vllm serve /models/Qwen3.5-35B-A3B-FP8 -tp 2 \
--port 8889 \
  --trust-remote-code \
  --dtype auto \
  --max-model-len 204800 \
  --max-num-batched-tokens 204800 \
  --max-num-seqs 7 \
  --swap-space 32 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --served-model-name Qwen3.5-35B-A3B-FP8 \
  --enable-auto-tool-choice \
--api-key Dzdwd@85416 \
  --tool-call-parser hermes > vllm_serve.log 2>&1 &
[1] 56618

报错：

nohup: ignoring input
INFO 03-18 09:45:45 [__init__.py:43] Available plugins for group vllm.platform_plugins:
INFO 03-18 09:45:45 [__init__.py:45] - metax -> vllm_metax:register
INFO 03-18 09:45:45 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-18 09:45:45 [__init__.py:217] Platform plugin metax is activated
INFO 03-18 09:45:45 [envs.py:89] Note!: set VLLM_USE_FLASHINFER_SAMPLER to False. Reason: flashinfer sampler are not supported on maca
INFO 03-18 09:45:45 [envs.py:89] Note!: set VLLM_USE_TRTLLM_ATTENTION to False. Reason: trtllm interfaces are not supported
INFO 03-18 09:45:45 [envs.py:89] Note!: set VLLM_DISABLE_FLASHINFER_PREFILL to True. Reason: disable flashinfer prefill(use flash_attn prefill) on maca
INFO 03-18 09:45:45 [envs.py:89] Note!: set VLLM_USE_CUDNN_PREFILL to False. Reason: cudnn prefill interfaces are not supported
INFO 03-18 09:45:45 [envs.py:89] Note!: set VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL to False. Reason: trtllm interfaces are not supported
INFO 03-18 09:45:45 [envs.py:89] Note!: set VLLM_DISABLE_SHARED_EXPERTS_STREAM to True. Reason: shared expert stream may cause hang
/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
  warnings.warn(_BETA_TRANSFORMS_WARNING)
/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
  warnings.warn(_BETA_TRANSFORMS_WARNING)
INFO Print the version information of mcoplib during compilation.

Version info:Mcoplib_Version = '0.3.1'
Build_Maca_Version = '3.3.0.15'
GIT_BRANCH = 'HEAD'
GIT_COMMIT = '836541d'
Vllm Op Version = 0.13.0
SGlang Op Version  = 0.5.7 

INFO Staring Check the current MACA version of the operating environment.

INFO: Release major.minor matching,  successful:3.3. 

INFO 03-18 09:45:55 [fa_utils.py:15] Using Maca version of flash attention, which only supports version 2.
WARNING 03-18 09:46:05 [registry.py:774] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
WARNING 03-18 09:46:05 [registry.py:774] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV2ForCausalLM.
WARNING 03-18 09:46:05 [registry.py:774] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 03-18 09:46:05 [registry.py:774] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 03-18 09:46:05 [__init__.py:78] The quantization method 'awq' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.awq.MacaAWQConfig'>.
WARNING 03-18 09:46:05 [__init__.py:78] The quantization method 'awq_marlin' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.awq_marlin.MacaAWQMarlinConfig'>.
WARNING 03-18 09:46:05 [__init__.py:78] The quantization method 'gptq' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.gptq.MacaGPTQConfig'>.
WARNING 03-18 09:46:05 [__init__.py:78] The quantization method 'gptq_marlin' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.gptq_marlin.MacaGPTQMarlinConfig'>.
WARNING 03-18 09:46:05 [__init__.py:78] The quantization method 'moe_wna16' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.moe_wna16.MacaMoeWNA16Config'>.
WARNING 03-18 09:46:05 [__init__.py:78] The quantization method 'compressed-tensors' already exists and will be overwritten by the quantization config <class 'vllm_metax.quant_config.compressed_tensors.MacaCompressedTensorsConfig'>.
WARNING 03-18 09:46:06 [attention.py:82] Using VLLM_USE_CUDNN_PREFILL environment variable is deprecated and will be removed in v0.14.0 or v1.0.0, whichever is soonest. Please use --attention-config.use_cudnn_prefill command line argument or AttentionConfig(use_cudnn_prefill=...) config field instead.
WARNING 03-18 09:46:06 [attention.py:82] Using VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL environment variable is deprecated and will be removed in v0.14.0 or v1.0.0, whichever is soonest. Please use --attention-config.use_trtllm_ragged_deepseek_prefill command line argument or AttentionConfig(use_trtllm_ragged_deepseek_prefill=...) config field instead.
WARNING 03-18 09:46:06 [attention.py:82] Using VLLM_USE_TRTLLM_ATTENTION environment variable is deprecated and will be removed in v0.14.0 or v1.0.0, whichever is soonest. Please use --attention-config.use_trtllm_attention command line argument or AttentionConfig(use_trtllm_attention=...) config field instead.
WARNING 03-18 09:46:06 [attention.py:82] Using VLLM_DISABLE_FLASHINFER_PREFILL environment variable is deprecated and will be removed in v0.14.0 or v1.0.0, whichever is soonest. Please use --attention-config.disable_flashinfer_prefill command line argument or AttentionConfig(disable_flashinfer_prefill=...) config field instead.
(APIServer pid=56618) INFO 03-18 09:46:06 [api_server.py:1351] vLLM API server version 0.13.0
(APIServer pid=56618) INFO 03-18 09:46:06 [utils.py:253] non-default args: {'model_tag': '/models/Qwen3.5-35B-A3B-FP8', 'port': 8889, 'api_key': ['Dzdwd@85416'], 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/models/Qwen3.5-35B-A3B-FP8', 'trust_remote_code': True, 'max_model_len': 204800, 'served_model_name': ['Qwen3.5-35B-A3B-FP8'], 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.92, 'swap_space': 32.0, 'enable_prefix_caching': True, 'max_num_batched_tokens': 204800, 'max_num_seqs': 7}
(APIServer pid=56618) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=56618) Traceback (most recent call last):
(APIServer pid=56618)   File "/opt/conda/bin/vllm", line 8, in <module>
(APIServer pid=56618)     sys.exit(main())
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=56618)     args.dispatch_function(args)
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=56618)     uvloop.run(run_server(args))
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run
(APIServer pid=56618)     return loop.run_until_complete(wrapper())
(APIServer pid=56618)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=56618)     return await main
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1398, in run_server
(APIServer pid=56618)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1417, in run_server_worker
(APIServer pid=56618)     async with build_async_engine_client(
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=56618)     return await anext(self.gen)
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 172, in build_async_engine_client
(APIServer pid=56618)     async with build_async_engine_client_from_engine_args(
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=56618)     return await anext(self.gen)
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 198, in build_async_engine_client_from_engine_args
(APIServer pid=56618)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1332, in create_engine_config
(APIServer pid=56618)     model_config = self.create_model_config()
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1189, in create_model_config
(APIServer pid=56618)     return ModelConfig(
(APIServer pid=56618)   File "/opt/conda/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=56618)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=56618) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=56618)   Value error, The checkpoint you are trying to load has model type `qwen3_5_moe` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
(APIServer pid=56618) 
(APIServer pid=56618) You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git` [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=56618)     For further information visit https://errors.pydantic.dev/2.12/v/value_error

See post chevron_right

xiaoo
Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年3月18日 10:33

已解决，需要使用分割 gpu 技术。sgpu
See post chevron_right

xiaoo
Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年3月18日 10:28

已解决，需要使用分割 gpu 技术。sgpu
See post chevron_right

xiaoo
Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年2月3日 12:00

你什么卡，部署的什么模型哦。我只能同时运行一个

See post chevron_right

xiaoo

Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年2月3日 11:59

你好，我已尝试 qwen2.5 的 7B 模型，还是卡住

(EngineCore_DP0 pid=34195) INFO 02-03 11:57:19 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:35845 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=34195) INFO 02-03 11:57:19 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0

dzdwd@dzdwd-server:~$ mx-smi
mx-smi  version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Tue Feb  3 11:59:24 2026

Attached GPUs                                     : 2
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
| MACA Version: 3.3.0.15             BIOS Version: 1.29.1.0                       |
|------------------+-----------------+---------------------+----------------------|
| Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
| Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
|==================+=================+=====================+======================|
| 0     MetaX N260 | 0           Off | 0000:41:00.0        | 0%          Disabled |
| 51W / 225W       | 43C          P9 | 22079/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 1     MetaX N260 | 1           Off | 0000:c1:00.0        | 0%          Disabled |
| 47W / 225W       | 40C          P9 | 22063/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  0                  1384499         VLLM::Worker_TP              21394          |
|  0                  1400062         VLLM::EngineCor              16             |
|  1                  1384500         VLLM::Worker_TP              21394          |
+---------------------------------------------------------------------------------+

See post chevron_right

xiaoo
Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年2月3日 09:52

额，两块 N260 运行 7B 模型好像有点很不合理吧。

See post chevron_right

xiaoo

Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年2月3日 09:19

你好，我今天尝试了千问30B。依然卡住

vllm serve /models/Qwen3-VL-30B-A3B-Instruct \
  --port 8889 \
  -tp 2 \
  --max-model-len 2000 \
  --gpu-memory-utilization 0.6 \
  --api-key Dzdwd@85416 \
  --max-num-seqs 30 \
  --served-model-name Qwen3-VL-30B-A3B-Instruct

dzdwd@dzdwd-server:~$ mx-smi
mx-smi  version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Tue Feb  3 09:17:16 2026

Attached GPUs                                     : 2
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
| MACA Version: 3.3.0.15             BIOS Version: 1.29.1.0                       |
|------------------+-----------------+---------------------+----------------------|
| Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
| Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
|==================+=================+=====================+======================|
| 0     MetaX N260 | 0           Off | 0000:41:00.0        | 0%          Disabled |
| 52W / 225W       | 43C          P9 | 38897/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 1     MetaX N260 | 1           Off | 0000:c1:00.0        | 0%          Disabled |
| 47W / 225W       | 40C          P9 | 38881/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  0                   619946         VLLM::Worker_TP              38212          |
|  0                   665500         VLLM::EngineCor              16             |
|  1                   619947         VLLM::Worker_TP              38212          |
+---------------------------------------------------------------------------------+

卡在这里：

(EngineCore_DP0 pid=28524) INFO 02-03 09:05:54 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:42619 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=28524) INFO 02-03 09:05:54 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0

See post chevron_right

xiaoo
Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年2月2日 18:19

依然报错。

See post chevron_right

xiaoo

Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年2月2日 17:25

nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
  --port 8889 \
  -tp 2 \
  --enforce-eager \
  --max-model-len 512 \
  --gpu-memory-utilization 0.6 \
  --api-key Dzdwd@85416 \
  --max-num-seqs 2 \
  --served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &

token512,--gpu-memory-utilization 0.6 这个参数会跑不起来

token512,--gpu-memory-utilization 0.7可以跑起来，但是再跑向量依然报错

See post chevron_right

xiaoo

Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年2月2日 16:51

目前已调至 2000 了，还要继续调吗。

nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
  --port 8889 \
  -tp 2 \
  --enforce-eager \
  --max-model-len 2000 \
  --gpu-memory-utilization 0.7 \
  --api-key Dzdwd@85416 \
  --max-num-seqs 10 \
  --served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &

See post chevron_right

xiaoo

Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年2月2日 16:48

已尝试，不行。先启动向量或者 LLM 都能成功，启动第二就会报错。

INFO 02-02 16:46:35 [parallel_state.py:1208] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:43627 backend=nccl
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 02-02 16:46:35 [mccl.py:28] Found nccl from library libmccl.so
INFO 02-02 16:46:35 [mccl.py:28] Found nccl from library libmccl.so
INFO 02-02 16:46:35 [pynccl.py:111] vLLM is using nccl==2.16.5
[16:46:46.087][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:879 type:21. Retrying.
[16:46:46.087][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:41742 type:21. Retrying.
[16:46:56.327][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:879 type:21. Retrying.
[16:46:56.327][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:41742 type:21. Retrying.
[16:47:06.567][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:879 type:21. Retrying.

See post chevron_right

xiaoo
Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年2月2日 16:41

已尝试降低启动参数，最低初始化显存 0.7 才能运行大模型，但是运行向量依然报一样的错。是否是运行Qwen3-Next-80B-A3B-Instruct.w8a8 已达极限，我看显存每个显卡还剩余挺多的，10 多 G
See post chevron_right

xiaoo
Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年2月2日 16:40

已尝试降低启动参数，最低初始化显存 0.7 才能运行大模型，但是运行向量依然报一样的错。是否是运行Qwen3-Next-80B-A3B-Instruct.w8a8 已达极限，我看显存每个显卡还剩余挺多的，10 多 G

See post chevron_right

xiaoo

Members

同时运行推理、嵌入、排序模型卡在显存分配问题已解决 2026年1月30日 13:00

配置：
联想 SR658H，内存 512GB，显卡：N260 * 2

问题： 只运行一个不会出问题，但运行第二个就无法分配到显存，卡住了

模型版本：
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.2-maca.ai3.3.0.103-torch2.8-py312-ubuntu22.04-amd64

dokcer 命令：

docker run -itd \
  --restart always \
  --privileged \
  --device=/dev/dri \
  --device=/dev/mxcd \
  --group-add video \
  --network=host \
  --name Qwen3-Next-80B-A3B-Instruct.w8a8 \
  --security-opt seccomp=unconfined \
  --security-opt apparmor=unconfined \
  --shm-size 100gb \
  --ulimit memlock=-1 \
  -v /models:/models \
  cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.2-maca.ai3.3.0.103-torch2.8-py312-ubuntu22.04-amd64 \
  /bin/bash

模型启动命令:

VLLM_USE_V1=0 nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
  --port 8889 \
  -tp 2 \
  --enforce-eager \
  --max-model-len 15000 \
  --gpu-memory-utilization 0.7 \
  --api-key Dzdwd@85416 \
  --max-num-seqs 35 \
  --served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &

向量启动命令：

nohup vllm serve /models/qwen3-Embedding-0.6B \
  --port 8890 \
  --enforce-eager \
  --served-model-name qwen3-Embedding-0.6B \
  --max-model-len 1024 \
  --gpu-memory-utilization 0.1 \
  --trust-remote-code \
  --task embed \
  --api-key Dzdwd@85416 > vllm-emb.log 2>&1 &

问题：只运行一个不会出问题，但运行第二个就无法分配到显存，卡住了

(EngineCore_DP0 pid=20179) INFO 01-30 12:54:48 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='/models/qwen3-Embedding-0.6B', speculative_config=None, tokenizer='/models/qwen3-Embedding-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen3-Embedding-0.6B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, softmax=None, activation=None, use_activation=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
(EngineCore_DP0 pid=20179) INFO 01-30 12:54:48 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:40141 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=20179) INFO 01-30 12:54:49 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0

mx-smi：

mx-smi  version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Fri Jan 30 12:59:17 2026

Attached GPUs                                     : 2
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
| MACA Version: 3.3.0.15             BIOS Version: 1.29.1.0                       |
|------------------+-----------------+---------------------+----------------------|
| Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
| Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
|==================+=================+=====================+======================|
| 0     MetaX N260 | 0           Off | 0000:41:00.0        | 0%          Disabled |
| 52W / 225W       | 43C          P9 | 47883/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 1     MetaX N260 | 1           Off | 0000:c1:00.0        | 0%          Disabled |
| 47W / 225W       | 40C          P9 | 47867/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  0                  2322349         VLLM::Worker_TP              47198          |
|  0                  2343541         VLLM::EngineCor              16             |
|  1                  2322350         VLLM::Worker_TP              47198          |
+---------------------------------------------------------------------------------+

请问该如何操作，还是参数有问题？