Posts | nuuuuuuke | 沐曦开发者论坛

Members

pd分离transfer backend 已解决 2026年4月30日 14:16

python -m sglang.launch_server --help 2>&1 | grep -A8 -B2 "disaggregation-transfer-backend"

                    [--debug-tensor-dump-inject DEBUG_TENSOR_DUMP_INJECT]
                    [--disaggregation-mode {null,prefill,decode}]
                    [--disaggregation-transfer-backend {mooncake,nixl,ascend,fake,mori}]
                    [--disaggregation-bootstrap-port DISAGGREGATION_BOOTSTRAP_PORT]
                    [--disaggregation-decode-tp DISAGGREGATION_DECODE_TP]
                    [--disaggregation-decode-dp DISAGGREGATION_DECODE_DP]
                    [--disaggregation-prefill-pp DISAGGREGATION_PREFILL_PP]
                    [--disaggregation-ib-device DISAGGREGATION_IB_DEVICE]
                    [--disaggregation-decode-enable-offload-kvcache]
                    [--num-reserved-decode-tokens NUM_RESERVED_DECODE_TOKENS]
                    [--disaggregation-decode-polling-interval DISAGGREGATION_DECODE_POLLING_INTERVAL]

sglang 0.5.9的镜像， pd分离的部署， --disaggregation-transfer-backend用啥啊，直接 pip install mooncake的？还是有沐曦配套的。

See post chevron_right

nuuuuuuke
Members

Minimax m2.7适配已解决 2026年4月17日 10:23

我没有转vllm的W8A8模型，另外他们有一堆神秘的环境变量，类似这种
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
export MACA_DIRECT_DISPATCH=1
export MACA_GRAPH_LAUNCH_MODE=5
export MACA_SMALL_PAGESIZE_ENABLE=1
export MACA_TORCH_COMPILE_CONF=triton.multi_kernel:1
See post chevron_right

nuuuuuuke
Members

Minimax m2.7适配已解决 2026年4月16日 17:05

不知道啊。
See post chevron_right

nuuuuuuke
Members

Minimax m2.7适配已解决 2026年4月16日 16:43

System Info:
Machine ID: 9d52c7d699ca42f0ae1f8b918d2a3eb1
System UUID: b1a64fb0-1ed5-01e1-d311-debf52dba16c
Boot ID: bb311989-725f-4a20-baa7-960a7a0087c9
Kernel Version: 6.8.0-49-generic
OS Image: Ubuntu 24.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.23
Kubelet Version: v1.31.3-8+52431524cc27b6-sc
Kube-Proxy Version: v1.31.3-8+52431524cc27b6-sc

单机16卡的C500
See post chevron_right

nuuuuuuke
Members

Minimax m2.7适配已解决 2026年4月16日 16:34

有sglang的推荐启动参数不, 包括各种神秘的环境变量开关。单机16卡C500或者更多
See post chevron_right

nuuuuuuke
Members

Minimax m2.7适配已解决 2026年4月16日 16:22

好吧。。。
See post chevron_right

nuuuuuuke
Members

Minimax m2.7适配已解决 2026年4月16日 16:18

modelscope.cn/models/metax-tech/MiniMax-M2.7-W8A8
See post chevron_right

nuuuuuuke
Members

Minimax m2.7适配已解决 2026年4月16日 16:10

modelscope上下载的MiniMax-M2.7-W8A8模型；
镜像用的： :0.14.0-maca.ai3.5.3.102-torch2.8-py310-ubuntu22.04-amd64
c500 单机16卡。
启动命令：
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
export MACA_DIRECT_DISPATCH=1
export MACA_GRAPH_LAUNCH_MODE=5
export MACA_SMALL_PAGESIZE_ENABLE=1
export MACA_TORCH_COMPILE_CONF=triton.multi_kernel:1
MODELPATH=/data/opensource-models/MiniMax-M2.7-W8A8-official/
MODEL_NAME=MiniMax-M2.7-W8A8

port=${1:-12001}

currenttime=date "+%Y%m%d%H%M%S"

vllm serve ${MODELPATH} \
--host 0.0.0.0 \
--port ${port} \
--served-model-name ${MODEL_NAME} \
--tensor-parallel-size 16 \
--pipeline-parallel-size 1 \
--dtype half \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens 8192 \
--max-model-len 8192 \
--swap-space 64 \
--mm-encoder-tp-mode data \
--trust-remote-code \
--max-num-seqs=64 \
--no-enable-prefix-caching --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think \
2>&1 | tee ./${currenttime}.log

报错了。。。。
See post chevron_right

nuuuuuuke
Members

Qwen3-ASR-1.7B在C500上的测试已解决 2026年4月10日 08:36

。。。。。。
See post chevron_right

nuuuuuuke
Members

Qwen3-ASR-1.7B在C500上的测试已解决 2026年4月9日 14:58

Traceback (most recent call last):
File "/opt/conda/bin/vllm", line 5, in <module>
from vllm.entrypoints.cli.main import main
File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/init.py", line 4, in <module>
from vllm.entrypoints.cli.benchmark.mm_processor import (
File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/mm_processor.py", line 5, in <module>
from vllm.benchmarks.mm_processor import add_cli_args, main
File "/opt/conda/lib/python3.12/site-packages/vllm/benchmarks/mm_processor.py", line 26, in <module>
from vllm.benchmarks.datasets import (
File "/opt/conda/lib/python3.12/site-packages/vllm/benchmarks/datasets.py", line 39, in <module>
from vllm.lora.utils import get_adapter_absolute_path
File "/opt/conda/lib/python3.12/site-packages/vllm/lora/utils.py", line 17, in <module>
from vllm.lora.layers import (
File "/opt/conda/lib/python3.12/site-packages/vllm/lora/layers/init.py", line 4, in <module>
from vllm.lora.layers.column_parallel_linear import (
File "/opt/conda/lib/python3.12/site-packages/vllm/lora/layers/column_parallel_linear.py", line 12, in <module>
from vllm.model_executor.layers.linear import (
File "/opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 28, in <module>
from vllm.model_executor.layers.utils import (
File "/opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/utils.py", line 9, in <module>
from vllm import _custom_ops as ops
File "/opt/conda/lib/python3.12/site-packages/vllm/_custom_ops.py", line 95, in <module>
@register_fake("_C::scaled_fp4_quant.out")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/torch/library.py", line 1069, in register
use_lib._register_fake(
File "/opt/conda/lib/python3.12/site-packages/torch/library.py", line 219, in _register_fake
handle = entry.fake_impl.register(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/torch/_library/fake_impl.py", line 50, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator _C::scaled_fp4_quant.out does not exist
See post chevron_right

nuuuuuuke
Members

Qwen3-ASR-1.7B在C500上的测试已解决 2026年4月8日 19:28

环境信息： k8s里的提交了一个job，镜像用的vllm-metax:0.15.0。。。
containers:
- name: master
image: registry.sa-ryd.sensetime.com.sa/ccr-alg/vllm-metax:0.15.0-maca.ai3.5.3.203-torch2.8-py312-ubuntu22.04-amd64
command: ["bash", "-lc", "sleep infinity"]
resources:
requests:
cpu: '32'
memory: '64Gi'
metax-tech.com/gpu: '1'
limits:
cpu: '32'
memory: '64Gi'
metax-tech.com/gpu: '1'
mx-smi信息：
mx-smi version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp : Wed Apr 8 19:20:41 2026

Attached GPUs : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9 Kernel Mode Driver Version: 3.4.4 |
| MACA Version: 3.5.3.20 BIOS Version: 1.30.0.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:0c:00.0 | 0% Disabled |
| 57W / 350W | 36C P0 | 858/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+

启动命令：
根据vllm-metax.readthedocs.io/en/latest/getting_started/installation/maca.html#build-vllm的文档，
安装了 git clone --branch v0.18.0-dev github.com/MetaX-MACA/vLLM-metax 和 vllm 0.18.0. 安装成功。
pip list | grep vllm
vllm 0.18.1.dev0+gbcf2be961.d20260408.empty
vllm_metax 0.18.0+gea0600.d20260408.maca3.5.3.20.torch2.8

然后运行vllm，
vllm serve /data/ASR/qwen3-asr-hf/hub/models--Qwen--Qwen3-ASR-1.7B/snapshots/7278e1e70fe206f11671096ffdd38061171dd6e5 --served-model-name "Qwen3-ASR-1.7B" --host 0.0.0.0 --port 12212

报错信息：

INFO 04-08 19:15:51 [init.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-08 19:15:51 [init.py:46] - metax -> vllm_metax:register
INFO 04-08 19:15:51 [init.py:49] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 04-08 19:15:51 [init.py:239] Platform plugin metax is activated
INFO 04-08 19:15:51 [envs.py:104] Plugin sets VLLM_USE_FLASHINFER_SAMPLER to False. Reason: flashinfer sampler are not supported on maca
INFO 04-08 19:15:51 [envs.py:104] Plugin sets VLLM_ENGINE_READY_TIMEOUT_S to 3600. Reason: set timeout to 3600s for model loading
INFO Print the version information of mcoplib during compilation.

Version info:Mcoplib_Version = '0.4.0'
Build_Maca_Version = '3.5.3.18'
GIT_BRANCH = 'HEAD'
GIT_COMMIT = 'fe3a7e2'
Vllm Op Version = 0.15.0
SGlang Op Version = 0.5.7 && 0.5.8

INFO Staring Check the current MACA version of the operating environment.

INFO: Release major.minor matching, successful:3.5.

Traceback (most recent call last):
File "/opt/conda/bin/vllm", line 5, in <module>
from vllm.entrypoints.cli.main import main
File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/init.py", line 4, in <module>
from vllm.entrypoints.cli.benchmark.mm_processor import (
File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/mm_processor.py", line 5, in <module>
from vllm.benchmarks.mm_processor import add_cli_args, main
File "/opt/conda/lib/python3.12/site-packages/vllm/benchmarks/mm_processor.py", line 26, in <module>
from vllm.benchmarks.datasets import (
File "/opt/conda/lib/python3.12/site-packages/vllm/benchmarks/datasets.py", line 39, in <module>
from vllm.lora.utils import get_adapter_absolute_path
File "/opt/conda/lib/python3.12/site-packages/vllm/lora/utils.py", line 17, in <module>
from vllm.lora.layers import (
File "/opt/conda/lib/python3.12/site-packages/vllm/lora/layers/init.py", line 4, in <module>
from vllm.lora.layers.column_parallel_linear import (
File "/opt/conda/lib/python3.12/site-packages/vllm/lora/layers/column_parallel_linear.py", line 12, in <module>
from vllm.model_executor.layers.linear import (
File "/opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 28, in <module>
from vllm.model_executor.layers.utils import (
File "/opt/conda/lib/python3.12/site-packages/vllm/model_executor/layers/utils.py", line 9, in <module>
from vllm import _custom_ops as ops
File "/opt/conda/lib/python3.12/site-packages/vllm/_custom_ops.py", line 95, in <module>
@register_fake("_C::scaled_fp4_quant.out")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/torch/library.py", line 1069, in register
use_lib._register_fake(
File "/opt/conda/lib/python3.12/site-packages/torch/library.py", line 219, in _register_fake
handle = entry.fake_impl.register(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/torch/_library/fake_impl.py", line 50, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator _C::scaled_fp4_quant.out does not exist

问题：这个算子报错的链路出在哪里啊？模型都没开始加载，像是初始化效验算子到位的时候，直接崩了。
自己想提前跑点新模型，碰到类似的报错是不是到此为止无能为力了？！
k8s里的宿主机不升级驱动，光使用新的vllm-metax:0.15.0-maca.ai3.5.3.203的镜像在上面跑，会有问题不？！
See post chevron_right

nuuuuuuke
Members

maca, metax-vllm, mcoplib, pytorch 版本问题已解决 2026年4月8日 16:29

我是沐曦的C500显卡，有一堆，用的k8s + volcano管理的。
想跑个新点的开源模型是真的难啊，要等你们官方的vllm-metax镜像，基本要等两个月。。。
谁能把这个事情说清楚啊， maca， mcoplib， pytorch， metax-vllm这些版本的依赖关系啊。。。
性能也跑不满，跟上新的模型也跟不上。

比如，你们发布了 vllm-metax:0.15.0-maca.ai3.5.3.203-torch2.8-py312-ubuntu22.04-amd64，集群上node全是 MACA Version: 3.3.x版本的，能跑不？

mx-smi version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp : Wed Apr 8 16:22:30 2026

Attached GPUs : 16
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9 Kernel Mode Driver Version: 3.4.4 |
| MACA Version: 3.3.0.15 BIOS Version: 1.30.0.0