Qwen3.5-397B-A17B-W8A8 速度优化问题

Members 10 posts

2026年4月17日 13:53 2026年4月17日 13:53

硬件环境

mx-smi  version: 2.2.12

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Fri Apr 17 13:50:37 2026

Attached GPUs                                     : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.12                      Kernel Mode Driver Version: 3.3.12           |
| MACA Version: 3.5.3.20             BIOS Version: 1.22.3.0                       |
|------------------+-----------------+---------------------+----------------------|
| Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
| Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
|==================+=================+=====================+======================|
| 0     MetaX C550 | 0           N/A | 0000:2a:00.0        | 0%          Disabled |
| NA / NA          | 36C         N/A | 60773/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 1     MetaX C550 | 1           N/A | 0000:3a:00.0        | 0%          Disabled |
| NA / NA          | 41C         N/A | 60773/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 2     MetaX C550 | 2           N/A | 0000:4c:00.0        | 0%          Disabled |
| NA / NA          | 43C         N/A | 60773/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 3     MetaX C550 | 3           N/A | 0000:5c:00.0        | 0%          Disabled |
| NA / NA          | 38C         N/A | 60771/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 4     MetaX C550 | 4           N/A | 0000:aa:00.0        | 0%          Disabled |
| NA / NA          | 39C         N/A | 60773/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 5     MetaX C550 | 5           N/A | 0000:ba:00.0        | 0%          Disabled |
| NA / NA          | 43C         N/A | 60771/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 6     MetaX C550 | 6           N/A | 0000:ca:00.0        | 0%          Disabled |
| NA / NA          | 43C         N/A | 60771/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 7     MetaX C550 | 7           N/A | 0000:da:00.0        | 0%          Disabled |
| NA / NA          | 37C         N/A | 60771/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  0                      315         VLLM::Worker_TP              59902          |
|  1                      316         VLLM::Worker_TP              59902          |
|  2                      317         VLLM::Worker_TP              59902          |
|  3                      318         VLLM::Worker_TP              59900          |
|  4                      319         VLLM::Worker_TP              59902          |
|  5                      320         VLLM::Worker_TP              59900          |
|  6                      321         VLLM::Worker_TP              59900          |
|  7                      322         VLLM::Worker_TP              59900          |
+---------------------------------------------------------------------------------+

使用的docker镜像
vllm-metax:0.17.0-maca.ai3.5.3.307-torch2.8-py312-ubuntu22.04-amd64
使用的权重
Qwen3.5-397B-A17B-W8A8
由于兼容性问题关闭了 CUDA Graph 捕获 VLLM_USE_V1=0
升级了transformers到5.2.0
启动命令：

vllm serve /data/metax-tech/Qwen3.5-397B-A17B-W8A8 \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.88 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --served-model-name Qwen3.5-W8A8 \
  --trust-remote-code \
  --enforce-eager

现在整体速度很低约 7.9 tokens/s 。有没有那些参数可以进行加速和优化？

link

shuai_chen

Members 595 posts

2026年4月17日 13:56 2026年4月17日 13:56

link

尊敬的开发者您好，麻烦详细描述兼容性问题关闭了 CUDA Graph 捕获 VLLM_USE_V1=0的原因

link

OverS9982

Members 10 posts

2026年4月17日 13:59 2026年4月17日 13:59

link

不关闭的话，有报错，报错如下：

(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     super().__init__(
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 279, in _initialize_kv_caches
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 389, in collective_rpc
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     return aggregate(get_response())
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]                      ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 372, in get_response
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     raise RuntimeError(
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] RuntimeError: Worker failed with error 'CUDA error: operation not permitted when stream is capturing
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] ', please check the stack trace above for the root cause

link

shuai_chen

Members 595 posts

2026年4月17日 14:01 2026年4月17日 14:01

link

尊敬的开发者您好，请参考developer.metax-tech.com/forum/t/fa-tie-qian-bi-kan-jing-xiang-shi-yong-wen-ti-ti-wen-mo-ban/267/ 详细描述操作步骤和日志