MetaX-Tech Developer Forum 论坛首页
  • 沐曦开发者
search
Sign in

lishuai

  • Members
  • Joined 2025年12月19日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

lishuai has posted 17 messages.

  • See post chevron_right
    lishuai
    Members
    mccl测试问题 已解决 2026年3月25日 17:22

    现在测试mccl的问题是关闭 GDR 走 SysMem 能跑通,开启 GDR 走 P2P 瞬间触发 PCIe 总线错误

  • See post chevron_right
    lishuai
    Members
    mccl测试问题 已解决 2026年3月25日 17:10

    ib_write测试做完后,进行cluster测试,回导致机器重启

    脚本内容如下:

    !/bin/bash

    MACA_PATH="${MACA_PATH:-/opt/maca}"

    HOST_IP=192.168.1.204:8,192.168.1.205:8
    GPU_NUM=16

    TEST_DIR=$MACA_PATH/samples/mccl_tests/perf/mccl_perf

    BENCH_NAMES="all_reduce_perf all_gather_perf reduce_scatter_perf sendrecv_perf alltoall_perf"

    BENCH_NAMES="all_reduce_perf"

    if [[ -z "$1" || -z "$2" || -z "$3" ]]; then
    echo "Use the default ip addr. Run with parameters for custom ip addr, for example: bash cluster.sh ip_1:proc_count,ip_2:proc_count gpu_num test_name"
    else
    HOST_IP=$1
    GPU_NUM=$2

    if [ "$3" = "all" ]; then
    BENCH_NAMES="all_reduce_perf all_gather_perf reduce_scatter_perf sendrecv_perf alltoall_perf"
    else
    if [ -e "$TEST_DIR/$3" ]; then
    BENCH_NAMES=$3
    else
    echo "$TEST_DIR/$3 dose not exist!"
    exit 1
    fi
    fi
    fi

    IP_MASK="$(echo "$HOST_IP" | cut -d. -f1-3).0/24"

    IP_MASK="192.168.100.0/24"
    IB_PORT=mlx5_0,mlx5_1

    PERF_ENV="-x FORCE_ACTIVE_WAIT=2"
    LIB_PATH_ENV="-x MACA_PATH=${MACA_PATH} -x LD_LIBRARY_PATH=${MACA_PATH}/lib:/${MACA_PATH}/ompi/lib:/${MACA_PATH}/ucx/lib"
    ENV_VAR="-x MCCL_IB_HCA=${IB_PORT} -x MCCL_CROSS_NIC=1 -x MCCL_IB_TRAFFIC_CLASS=160 -x MCCL_IB_GID_INDEX=3 -x MCCL_IB_RETRY_CNT=15 -x MCCL_NET_GDR_LEVEL=2 ${PERF_ENV} ${LIB_PATH_ENV}"
    MPI_PROCESS_NUM=${GPU_NUM}
    MPI_RUN_OPT="--allow-run-as-root -mca btl_tcp_if_include ${IP_MASK} -mca oob_tcp_if_include ${IP_MASK} -mca pml ^ucx -mca osc ^ucx -mca btl ^openib"

    for BENCH in ${BENCH_NAMES}; do
    echo -n "The test is ${BENCH}, the maca version is " && realpath ${MACA_PATH}
    ${MACA_PATH}/ompi/bin/mpirun -np ${MPI_PROCESS_NUM} ${MPI_RUN_OPT} -host ${HOST_IP} ${ENV_VAR} ${TEST_DIR}/${BENCH} -b 1K -e 1G -d float -f 2 -g 1 -n 10
    done

  • See post chevron_right
    lishuai
    Members
    mccl测试问题 已解决 2026年3月25日 16:59

    这数据怎么样

  • See post chevron_right
    lishuai
    Members
    mccl测试问题 已解决 2026年3月24日 13:55

    ib_write单流测试结果如图

  • See post chevron_right
    lishuai
    Members
    mccl测试问题 已解决 2026年3月24日 13:42

    比如检查哪些呢

  • See post chevron_right
    lishuai
    Members
    mccl测试问题 已解决 2026年3月24日 13:39

    mccl中的cluster.sh结果如图

  • See post chevron_right
    lishuai
    Members
    mccl测试问题 已解决 2026年3月24日 13:12

    现在双机部署模型一直卡在这个状态,看了下mccl的日志,像是变量声明了mccl走HCA,但是mccl还是走业务网,两台机器的ibdev2netdev
    python、mccl、ip信息如附件

  • See post chevron_right
    lishuai
    Members
    mccl测试问题 已解决 2026年3月23日 16:55

    好的,我看下

  • See post chevron_right
    lishuai
    Members
    mccl测试问题 已解决 2026年3月23日 16:46

    没有,这个如何测试

  • See post chevron_right
    lishuai
    Members
    mccl测试问题 已解决 2026年3月23日 16:32

    缺陷报告:MCCL 跨机 RoCE 内存注册失败 (ibv_reg_mr Invalid argument)
    【环境信息】

    硬件: 双节点共 16 张沐曦 MetaX C500 (8卡/节点),机头是两台H3c uniserver R5330 G7(C500*16)
    网络: 40Gbps RoCE 网卡 (设备名 rocep6s0, rocep95s0)---这是ibstat显示的信息,网卡用的是400G的迈洛斯 cx7
    软件: MACA 3.5.3 / MCCL 2.16.5

    【问题现象】
    在使用 mpirun 执行跨机 all_reduce_perf 测试时,若开启 IB/RoCE 硬件加速,MCCL 在初始化阶段崩溃,报错:
    MCCL WARN Call to ibv_reg_mr failed with error Invalid argument
    【已完成的排查与隔离】
    网络层正常: RoCE 高速网段(192.168.100.x)跨机 ping 测试延迟 < 0.1ms,双端 firewalld 与 SELinux 已关闭。
    系统限制正常: 双端通过 MPI 验证 ulimit -l 均为 unlimited,排除 memlock 限制导致的问题。
    iova2 降级测试: 增加 -x MCCL_IB_PCI_RELAXED_ORDERING=0 后,报错从 ibv_reg_mr_iova2 failed 退化为 ibv_reg_mr failed,依然返回 Invalid argument (retcode 2)。
    TCP 降级对照组(核心证据): 增加 -x MCCL_IB_DISABLE=1 强制走普通 TCP/Socket 通信后,测试完美通过(#wrong 0)。


    mccl测试脚本内容:
    [root@localhost /opt/maca/samples/mccl_tests/perf]# cat cluster.sh

    !/bin/bash

    MACA_PATH="${MACA_PATH:-/opt/maca}"

    HOST_IP=192.168.1.204:8,192.168.1.205:8
    GPU_NUM=16

    TEST_DIR=$MACA_PATH/samples/mccl_tests/perf/mccl_perf

    BENCH_NAMES="all_reduce_perf all_gather_perf reduce_scatter_perf sendrecv_perf alltoall_perf"

    BENCH_NAMES="all_reduce_perf"

    if [[ -z "$1" || -z "$2" || -z "$3" ]]; then
    echo "Use the default ip addr. Run with parameters for custom ip addr, for example: bash cluster.sh ip_1:proc_count,ip_2:proc_count gpu_num test_name"
    else
    HOST_IP=$1
    GPU_NUM=$2

    if [ "$3" = "all" ]; then
    BENCH_NAMES="all_reduce_perf all_gather_perf reduce_scatter_perf sendrecv_perf alltoall_perf"
    else
    if [ -e "$TEST_DIR/$3" ]; then
    BENCH_NAMES=$3
    else
    echo "$TEST_DIR/$3 dose not exist!"
    exit 1
    fi
    fi
    fi

    IP_MASK="$(echo "$HOST_IP" | cut -d. -f1-3).0/24"

    IP_MASK="192.168.100.0/24"
    IB_PORT=rocep6s0,rocep95s0

    PERF_ENV="-x FORCE_ACTIVE_WAIT=2"
    LIB_PATH_ENV="-x MACA_PATH=${MACA_PATH} -x LD_LIBRARY_PATH=${MACA_PATH}/lib:/${MACA_PATH}/ompi/lib:/${MACA_PATH}/ucx/lib"

    ENV_VAR="-x MCCL_IB_HCA=${IB_PORT} -x MCCL_CROSS_NIC=1 ${PERF_ENV} ${LIB_PATH_ENV}"

    ENV_VAR="-x MCCL_IB_HCA=rocep6s0,rocep95s0 -x MCCL_SOCKET_IFNAME=p50p1,p51p1 -x MCCL_CROSS_NIC=1 ${PERF_ENV} ${LIB_PATH_ENV} -x MCCL_IB_DISABLE=0"
    MPI_PROCESS_NUM=${GPU_NUM}
    MPI_RUN_OPT="--allow-run-as-root -mca btl_tcp_if_include ${IP_MASK} -mca oob_tcp_if_include ${IP_MASK} -mca pml ^ucx -mca osc ^ucx -mca btl ^openib"

    for BENCH in ${BENCH_NAMES}; do
    echo -n "The test is ${BENCH}, the maca version is " && realpath ${MACA_PATH}
    ${MACA_PATH}/ompi/bin/mpirun -np ${MPI_PROCESS_NUM} ${MPI_RUN_OPT} -host ${HOST_IP} ${ENV_VAR} ${TEST_DIR}/${BENCH} -b 1K -e 1G -d float -f 2 -g 1 -n 10
    done

    报错信息如附件内容

  • See post chevron_right
    lishuai
    Members
    双机16卡C500部署metax-tech/DeepSeek-R1-0528-W8A8问题 已解决 2026年2月5日 13:40

    请问在进行 ray配置的时候,如果机器有ib卡,是不是必须要要映射计算网口?

  • See post chevron_right
    lishuai
    Members
    如何判断某些模型当前沐曦已经适配的vllm版本是支持的 已解决 2026年2月4日 09:57

    比如以下模型那些支持,那些不支持:
    Qwen3-235B
    Qwen3-VL
    Qwen3-embeding
    Qwen3-rerank
    Z-Image
    GLM4.6/7
    Deepseek-V3.2
    Deepseek-V3
    Deepseek-R1

  • See post chevron_right
    lishuai
    Members
    C500针对OCR识别的性能方面有没有什么优化方向 已解决 2026年2月3日 19:50

    PP-OCRv5_server_det
    PP-LCNet_x1_0_doc_ori
    PP-LCNet_x1_0_textline_ori
    PP-OCRv5_server_rec
    UVDoc

    由于有批量存文本识别业务要求,这几个模型组成的流水线明显更适合该业务,vl-1.5的这个有想过,但是业务方认为这个多模态的不太适合该业务

  • See post chevron_right
    lishuai
    Members
    C500针对OCR识别的性能方面有没有什么优化方向 已解决 2026年2月3日 19:35

    这几个模型是共同协作的,实现一条ocr流水线,用vllm怎么跑这个

  • See post chevron_right
    lishuai
    Members
    C500针对OCR识别的性能方面有没有什么优化方向 已解决 2026年2月3日 16:13

    安装信息:python -m pip install paddle-metax-gpu==3.3.0 -i www.paddlepaddle.org.cn/packages/stable/maca/
    模型运行脚本如上传文件:

    使用模型:
    PP-OCRv5_server_det
    PP-LCNet_x1_0_doc_ori
    PP-LCNet_x1_0_textline_ori
    PP-OCRv5_server_rec
    UVDoc

  • See post chevron_right
    lishuai
    Members
    英伟达的libdevice对应的是沐曦的哪个文件? 已解决 2026年2月2日 15:23

    问题如标题

  • See post chevron_right
    lishuai
    Members
    N260是否支持GLM-4.1V-9B-Thinking 已解决 2025年12月19日 11:47

    镜像:cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.0-maca.ai3.3.0.11-torch2.6-py310-ubuntu22.04-amd64

    容器构建命令:docker run -it --restart always --device=/dev/dri --device=/dev/mxcd --group-add 44 --name GLM-4.1V-9B-Thinking --device=/dev/mem --network=host --security-opt seccomp=unconfined --security-opt apparmor=unconfined --shm-size '100gb' --ulimit memlock=-1 -v /mnt/data/models/GLM-4.1V-9B-Thinking:/mnt/data/models/GLM-4.1V-9B-Thinking cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.0-maca.ai3.3.0.11-torch2.6-py310-ubuntu22.04-amd64

    服务启动命令:vllm serve /mnt/data/models/GLM-4.1V-9B-Thinking/ --trust-remote-code --dtype auto --max-model-len 4096 --gpu-memory-utilization 0.9 --served-model-name GLM-4.1V-9B-Thinking

    报错:
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] EngineCore failed to start.
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] Traceback (most recent call last):
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1057, in call_hf_processor
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] output = hf_processor(data,
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/transformers/models/glm4v/processing_glm4v.py", line 150, in call
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] videos_inputs = self.video_processor(videos=videos,
    output_kwargs["videos_kwargs"])
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/transformers/video_processing_utils.py", line 206, in call
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] return self.preprocess(videos, kwargs)
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/transformers/video_processing_utils.py", line 387, in preprocess
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] preprocessed_videos = self._preprocess(videos=videos,
    kwargs)
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/transformers/models/glm4v/video_processing_glm4v.py", line 177, in preprocess
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] resized_height, resized_width = smart_resize(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/transformers/models/glm4v/image_processing_glm4v.py", line 59, in smart_resize
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] raise ValueError(f"t:{num_frames} must be larger than temporal_factor:{temporal_factor}")
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] ValueError: t:1 must be larger than temporal_factor:2
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708]
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] The above exception was the direct cause of the following exception:
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708]
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] Traceback (most recent call last):
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] engine_core = EngineCoreProc(args, kwargs)
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 498, in init
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] super().init(vllm_config, executor_class, log_stats,
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 83, in init
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] self.model_executor = executor_class(vllm_config)
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] self._init_executor()
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 54, in _init_executor
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] self.collective_rpc("init_device")
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] return [run_method(self.driver_worker, method, args, kwargs)]
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/utils/init.py", line 3122, in run_method
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] return func(
    args, **kwargs)
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 259, in init_device
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] self.worker.init_device() # type: ignore
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 201, in init_device
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] self.model_runner: GPUModelRunner = GPUModelRunner(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 421, in init
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] self.mm_budget = MultiModalBudget(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/utils.py", line 47, in init
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] max_tokens_by_modality = mm_registry \
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/registry.py", line 167, in get_max_tokens_per_item_by_nonzero_modality
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] max_tokens_per_item = self.get_max_tokens_per_item_by_modality(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/registry.py", line 143, in get_max_tokens_per_item_by_modality
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] return profiler.get_mm_max_contiguous_tokens(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/profiling.py", line 282, in get_mm_max_contiguous_tokens
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] return self._get_mm_max_tokens(seq_len,
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/profiling.py", line 262, in _get_mm_max_tokens
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts)
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/profiling.py", line 173, in _get_dummy_mm_inputs
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] return self.processor.apply(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 2036, in apply
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] ) = self._cached_apply_hf_processor(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1826, in _cached_apply_hf_processor
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] ) = self._apply_hf_processor_main(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1572, in _apply_hf_processor_main
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] mm_processed_data = self._apply_hf_processor_mm_only(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1529, in _apply_hf_processor_mm_only
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708]
    , mm_processed_data, _ = self._apply_hf_processor_text_mm(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1456, in _apply_hf_processor_text_mm
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] processed_data = self._call_hf_processor(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/glm4_1v.py", line 1207, in _call_hf_processor
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] video_outputs = super()._call_hf_processor(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1417, in _call_hf_processor
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] return self.info.ctx.call_hf_processor(
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] File "/opt/conda/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1080, in call_hf_processor
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] raise ValueError(msg) from exc
    (EngineCore_DP0 pid=137) ERROR 12-19 09:43:42 [core.py:708] ValueError: Failed to apply Glm4vProcessor on data={'text': '<|begin_of_video|><|video|><|end_of_video|>', 'videos': [[array([[[[255, 255, 255],

  • 沐曦开发者论坛
powered by misago