MetaX-Tech Developer Forum 论坛首页
  • 沐曦开发者
search
Sign in

ruanding

  • Members
  • Joined 2025年12月18日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

ruanding has started 5 threads.

  • See post chevron_right
    ruanding
    Members
    曦云C500 4卡推理Qwen3-32B性能有问题 解决中 2026年5月20日 11:30

    使用曦云C500 4卡推理Qwen3-32B性能极差,甚至差于使用两卡推理,但测试pcie带宽和p2p带宽没有问题,启动服务和测试的命令如下

    vllm serve /home/models/Qwen3-32B --host 0.0.0.0 --port 8206 --block_size=16 --max_model_len=9120 --tensor-parallel-size 2 --gpu_memory_utilization=0.9 --no-enable-prefix-caching --async-scheduling
    
    vllm bench serve --backend vllm --model /home/models/Qwen3-32B --host 0.0.0.0 --port 8206 --dataset-name random --random-input-len 4096 --random-output-len 1024 --ignore-eos --request-rate 40 --num-prompts 40
    

    4卡硬件信息
    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed May 20 10:03:09 2026

    Attached GPUs : 4
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
    | MACA Version: 3.5.3.18 BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:81:00.0 | 0% Disabled |
    | 43W / 350W | 40C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:a1:00.0 | 0% Disabled |
    | 40W / 350W | 37C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:c1:00.0 | 0% Disabled |
    | 43W / 350W | 39C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:e1:00.0 | 0% Disabled |
    | 43W / 350W | 38C P0 | 858/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |

    使用4卡推理性能如下
    ============ Serving Benchmark Result ============
    Successful requests: 40
    Request rate configured (RPS): 40.00
    Benchmark duration (s): 337.87
    Total input tokens: 163734
    Total generated tokens: 40960
    Request throughput (req/s): 0.12
    Output token throughput (tok/s): 121.23
    Peak output token throughput (tok/s): 320.00
    Peak concurrent requests: 40.00
    Total Token throughput (tok/s): 605.83
    ---------------Time to First Token----------------
    Mean TTFT (ms): 102917.51
    Median TTFT (ms): 102972.21
    P99 TTFT (ms): 196391.12
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms): 226.36
    Median TPOT (ms): 226.71
    P99 TPOT (ms): 315.54
    ---------------Inter-token Latency----------------
    Mean ITL (ms): 226.36
    Median ITL (ms): 139.24
    P99 ITL (ms): 2468.52
    ==================================================
    使用2卡推理性能如下
    ============ Serving Benchmark Result ============
    Successful requests: 40
    Request rate configured (RPS): 40.00
    Benchmark duration (s): 124.06
    Total input tokens: 163734
    Total generated tokens: 40960
    Request throughput (req/s): 0.32
    Output token throughput (tok/s): 330.16
    Peak output token throughput (tok/s): 800.00
    Peak concurrent requests: 40.00
    Total Token throughput (tok/s): 1649.94
    ---------------Time to First Token----------------
    Mean TTFT (ms): 21047.51
    Median TTFT (ms): 21005.14
    P99 TTFT (ms): 40315.77
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms): 94.32
    Median TPOT (ms): 94.41
    P99 TPOT (ms): 109.96
    ---------------Inter-token Latency----------------
    Mean ITL (ms): 94.32
    Median ITL (ms): 53.47
    P99 ITL (ms): 548.38
    ==================================================

    pcie带宽如下
    EFFECTIVE RAW TRANSMISSION DATA
    DEVICE1 DEVICE2 TOPOLOGY SIZE(B) BANDWIDTH BANDWIDTH DELAY (us) VALIDATION
    ──────────────────────────────────────────────────────────────────────────────────────────────────
    GPU#0 GPU#1 pcie 7516192768 101.34 GB/s 126.51 GB/s 69075.73 PASS
    GPU#0 GPU#2 pcie 7516192768 101.33 GB/s 126.50 GB/s 69081.57 PASS
    GPU#0 GPU#3 pcie 7516192768 101.34 GB/s 126.51 GB/s 69075.71 PASS
    GPU#1 GPU#2 pcie 7516192768 101.33 GB/s 126.50 GB/s 69081.29 PASS
    GPU#1 GPU#3 pcie 7516192768 101.34 GB/s 126.52 GB/s 69075.36 PASS
    GPU#2 GPU#3 pcie 7516192768 101.33 GB/s 126.51 GB/s 69078.24 PASS

    p2p带宽如下
    EFFECTIVE RAW TRANSMISSION DATA
    DEV 1 DEV 2 SIZE(B) BANDWIDTH BANDWIDTH DELAY (us) VALIDATION
    ──────────────────────────────────────────────────────────────────────────────────────────────────
    CPU <> BOARD#0 7516192768 41.75 GB/s 52.12 GB/s 180041.45 PASS
    CPU <> BOARD#1 7516192768 41.92 GB/s 52.34 GB/s 179283.75 PASS
    CPU <> BOARD#2 7516192768 34.95 GB/s 43.64 GB/s 215041.24 PASS
    CPU <> BOARD#3 7516192768 34.96 GB/s 43.65 GB/s 214988.19 PASS

  • See post chevron_right
    ruanding
    Members
    GLM-5和Qwen3.5系类模型适配 已解决 2026年3月23日 14:30

    您好,近日看到有发文显示GLM-5和Qwen3.5系列大模型可以沐曦曦云C500 GPU上运行,请问应该用什么推理软件?因为我看当前提供的arm版vLLM还是0.11版本,不支持运行这两个系列的模型

    沐曦股份曦云C系列GPU 完成通义千问Qwen 3.5 Day 0 适配 has written:

    www.metax-tech.com/ndetail/12563.html

    沐曦股份曦云C系列GPU 完成通义千问Qwen 3.5 Day 0 适配 has written:

    mp.weixin.qq.com/s?__biz=Mzg5NzY1MDM3Mg==&mid=2247492553&idx=1&sn=0b9d5ee19095dc670e99d44959fe8c5e&scene=21&poc_token=HFu0wGmjU8FwnYY90l7ZxEprwuD7m6MHf7bH0FCH

  • See post chevron_right
    ruanding
    Members
    沐曦卡运行siglip模型性能问题 已解决 2026年2月9日 10:14

    本人使用鲲鹏920新型号CPU+一张曦云C500 NPU,运行siglip模型。对于同一张图片,在裸机上使用纯CPU上推理时延约1037ms,但是使用NPU推理时延约2837ms,在容器(maca-torch2.4-py310-mc3.3.0.4-kylinv10-arm64)内使用NPU推理时延约2616ms;但使用英伟达4090推理时延约310ms。使用NPU推理的性能弱于使用CPU推理,这性能明显不正常,请问该如何排查和解决?其中驱动版本是3.5.3.11,sdk版本是3.5.3.17,cu-bridge用的是master分支代码。

  • See post chevron_right
    ruanding
    Members
    沐曦版pytorch安装问题 已解决 2026年2月3日 21:27

    我下载了maca-pytorch2.8-py312-3.5.3.9-aarch64.tar这个版本的安装包后,在裸机上创建conda环境并安装了pytorch等安装包后导入pytorch时有如下报错File "<stdin>", line 1, in <module>
    File "/home/lv/miniconda3/envs/python312/lib/python3.12/site-packages/torch/init.py", line 421, in <module>
    from torch._C import * # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
    ImportError: libmxomp.so: cannot open shared object file: No such file or directory
    请问该如何解决?我已经安装了2.14.27版本的驱动和2.32.0.9版本的MACA sdk。

  • See post chevron_right
    ruanding
    Members
    无法采集profiler数据或timeline 已解决 2025年12月18日 14:24

    您好,我在使用容器中的vllm0.8.2部署大模型进行推理时,无法采集profiler数据——设置VLLM_TORCH_PROFILER_DIR环境变量后会卡死。请问该如何解决?

  • 沐曦开发者论坛
powered by misago