• Members 12 posts
    2026年5月20日 11:30

    使用曦云C500 4卡推理Qwen3-32B性能极差,甚至差于使用两卡推理,但测试pcie带宽和p2p带宽没有问题,启动服务和测试的命令如下

    vllm serve /home/models/Qwen3-32B --host 0.0.0.0 --port 8206 --block_size=16 --max_model_len=9120 --tensor-parallel-size 2 --gpu_memory_utilization=0.9 --no-enable-prefix-caching --async-scheduling
    
    vllm bench serve --backend vllm --model /home/models/Qwen3-32B --host 0.0.0.0 --port 8206 --dataset-name random --random-input-len 4096 --random-output-len 1024 --ignore-eos --request-rate 40 --num-prompts 40
    

    4卡硬件信息
    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed May 20 10:03:09 2026

    Attached GPUs : 4
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
    | MACA Version: 3.5.3.18 BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:81:00.0 | 0% Disabled |
    | 43W / 350W | 40C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:a1:00.0 | 0% Disabled |
    | 40W / 350W | 37C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:c1:00.0 | 0% Disabled |
    | 43W / 350W | 39C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:e1:00.0 | 0% Disabled |
    | 43W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |

    使用4卡推理性能如下
    ============ Serving Benchmark Result ============
    Successful requests: 40
    Request rate configured (RPS): 40.00
    Benchmark duration (s): 337.87
    Total input tokens: 163734
    Total generated tokens: 40960
    Request throughput (req/s): 0.12
    Output token throughput (tok/s): 121.23
    Peak output token throughput (tok/s): 320.00
    Peak concurrent requests: 40.00
    Total Token throughput (tok/s): 605.83
    ---------------Time to First Token----------------
    Mean TTFT (ms): 102917.51
    Median TTFT (ms): 102972.21
    P99 TTFT (ms): 196391.12
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms): 226.36
    Median TPOT (ms): 226.71
    P99 TPOT (ms): 315.54
    ---------------Inter-token Latency----------------
    Mean ITL (ms): 226.36
    Median ITL (ms): 139.24
    P99 ITL (ms): 2468.52
    ==================================================
    使用2卡推理性能如下
    ============ Serving Benchmark Result ============
    Successful requests: 40
    Request rate configured (RPS): 40.00
    Benchmark duration (s): 124.06
    Total input tokens: 163734
    Total generated tokens: 40960
    Request throughput (req/s): 0.32
    Output token throughput (tok/s): 330.16
    Peak output token throughput (tok/s): 800.00
    Peak concurrent requests: 40.00
    Total Token throughput (tok/s): 1649.94
    ---------------Time to First Token----------------
    Mean TTFT (ms): 21047.51
    Median TTFT (ms): 21005.14
    P99 TTFT (ms): 40315.77
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms): 94.32
    Median TPOT (ms): 94.41
    P99 TPOT (ms): 109.96
    ---------------Inter-token Latency----------------
    Mean ITL (ms): 94.32
    Median ITL (ms): 53.47
    P99 ITL (ms): 548.38
    ==================================================

    pcie带宽如下
    EFFECTIVE RAW TRANSMISSION DATA
    DEVICE1 DEVICE2 TOPOLOGY SIZE(B) BANDWIDTH BANDWIDTH DELAY (us) VALIDATION
    ──────────────────────────────────────────────────────────────────────────────────────────────────
    GPU#0 GPU#1 pcie 7516192768 101.34 GB/s 126.51 GB/s 69075.73 PASS
    GPU#0 GPU#2 pcie 7516192768 101.33 GB/s 126.50 GB/s 69081.57 PASS
    GPU#0 GPU#3 pcie 7516192768 101.34 GB/s 126.51 GB/s 69075.71 PASS
    GPU#1 GPU#2 pcie 7516192768 101.33 GB/s 126.50 GB/s 69081.29 PASS
    GPU#1 GPU#3 pcie 7516192768 101.34 GB/s 126.52 GB/s 69075.36 PASS
    GPU#2 GPU#3 pcie 7516192768 101.33 GB/s 126.51 GB/s 69078.24 PASS

    p2p带宽如下
    EFFECTIVE RAW TRANSMISSION DATA
    DEV 1 DEV 2 SIZE(B) BANDWIDTH BANDWIDTH DELAY (us) VALIDATION
    ──────────────────────────────────────────────────────────────────────────────────────────────────
    CPU <> BOARD#0 7516192768 41.75 GB/s 52.12 GB/s 180041.45 PASS
    CPU <> BOARD#1 7516192768 41.92 GB/s 52.34 GB/s 179283.75 PASS
    CPU <> BOARD#2 7516192768 34.95 GB/s 43.64 GB/s 215041.24 PASS
    CPU <> BOARD#3 7516192768 34.96 GB/s 43.65 GB/s 214988.19 PASS

  • arrow_forward

    Thread has been moved from 产品&运维.

  • Members 458 posts
    2026年5月20日 11:39

    尊敬的开发者您好,Qwen3-32B不推荐使用4卡推理,4卡推理会造成资源浪费,建议每两卡开启一个推理服务,提升GPU利用率

  • Members 12 posts
    2026年5月20日 11:48

    我理解你的意思,但是我想知道为什么4卡推理性能会下降。而且如果我更换其他服务器同样对比4卡和两卡推理的性能却不存在这样的问题

  • Members 458 posts
    2026年5月20日 11:52

    尊敬的开发者您好,Qwen3-32B属于Dense模型,四卡推理会造成通信延迟、tp4单卡利用率低从而造成四卡推理性能不如双卡性能。