• Members 10 posts
    2026年4月17日 13:53

    硬件环境

    mx-smi  version: 2.2.12
    
    =================== MetaX System Management Interface Log ===================
    Timestamp                                         : Fri Apr 17 13:50:37 2026
    
    Attached GPUs                                     : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.12                      Kernel Mode Driver Version: 3.3.12           |
    | MACA Version: 3.5.3.20             BIOS Version: 1.22.3.0                       |
    |------------------+-----------------+---------------------+----------------------|
    | Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
    | Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
    |==================+=================+=====================+======================|
    | 0     MetaX C550 | 0           N/A | 0000:2a:00.0        | 0%          Disabled |
    | NA / NA          | 36C         N/A | 60773/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 1     MetaX C550 | 1           N/A | 0000:3a:00.0        | 0%          Disabled |
    | NA / NA          | 41C         N/A | 60773/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 2     MetaX C550 | 2           N/A | 0000:4c:00.0        | 0%          Disabled |
    | NA / NA          | 43C         N/A | 60773/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 3     MetaX C550 | 3           N/A | 0000:5c:00.0        | 0%          Disabled |
    | NA / NA          | 38C         N/A | 60771/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 4     MetaX C550 | 4           N/A | 0000:aa:00.0        | 0%          Disabled |
    | NA / NA          | 39C         N/A | 60773/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 5     MetaX C550 | 5           N/A | 0000:ba:00.0        | 0%          Disabled |
    | NA / NA          | 43C         N/A | 60771/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 6     MetaX C550 | 6           N/A | 0000:ca:00.0        | 0%          Disabled |
    | NA / NA          | 43C         N/A | 60771/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 7     MetaX C550 | 7           N/A | 0000:da:00.0        | 0%          Disabled |
    | NA / NA          | 37C         N/A | 60771/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    
    +---------------------------------------------------------------------------------+
    | Process:                                                                        |
    |  GPU                    PID         Process Name                 GPU Memory     |
    |                                                                  Usage(MiB)     |
    |=================================================================================|
    |  0                      315         VLLM::Worker_TP              59902          |
    |  1                      316         VLLM::Worker_TP              59902          |
    |  2                      317         VLLM::Worker_TP              59902          |
    |  3                      318         VLLM::Worker_TP              59900          |
    |  4                      319         VLLM::Worker_TP              59902          |
    |  5                      320         VLLM::Worker_TP              59900          |
    |  6                      321         VLLM::Worker_TP              59900          |
    |  7                      322         VLLM::Worker_TP              59900          |
    +---------------------------------------------------------------------------------+
    

    使用的docker镜像
    vllm-metax:0.17.0-maca.ai3.5.3.307-torch2.8-py312-ubuntu22.04-amd64
    使用的权重
    Qwen3.5-397B-A17B-W8A8
    由于兼容性问题关闭了 CUDA Graph 捕获 VLLM_USE_V1=0
    升级了transformers到5.2.0
    启动命令:

    vllm serve /data/metax-tech/Qwen3.5-397B-A17B-W8A8 \
      --host 0.0.0.0 \
      --port 8000 \
      --tensor-parallel-size 8 \
      --gpu-memory-utilization 0.88 \
      --max-model-len 262144 \
      --reasoning-parser qwen3 \
      --enable-auto-tool-choice \
      --tool-call-parser qwen3_coder \
      --served-model-name Qwen3.5-W8A8 \
      --trust-remote-code \
      --enforce-eager
    

    现在整体速度很低约 7.9 tokens/s 。有没有那些参数可以进行加速和优化?

  • arrow_forward

    Thread has been moved from 公共.

  • Members 384 posts
    2026年4月17日 13:56

    尊敬的开发者您好,麻烦详细描述兼容性问题关闭了 CUDA Graph 捕获 VLLM_USE_V1=0的原因

  • Members 10 posts
    2026年4月17日 13:59

    不关闭的话,有报错,报错如下:

    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] EngineCore failed to start.
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] Traceback (most recent call last):
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     return func(*args, **kwargs)
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 834, in __init__
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     super().__init__(
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in __init__
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     return func(*args, **kwargs)
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 279, in _initialize_kv_caches
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     self.model_executor.initialize_from_config(kv_cache_configs)
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 389, in collective_rpc
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     return aggregate(get_response())
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]                      ^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]   File "/opt/conda/lib/python3.12/site-packages/vllm_metax/v1/executor/multiproc_executor.py", line 372, in get_response
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100]     raise RuntimeError(
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] RuntimeError: Worker failed with error 'CUDA error: operation not permitted when stream is capturing
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    (EngineCore_DP0 pid=3266) ERROR 04-17 11:46:53 [core.py:1100] ', please check the stack trace above for the root cause