MetaX-Tech Developer Forum
  • 沐曦开发者
search
Sign in

JunHowie

  • Members
  • Joined 2025年7月7日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

JunHowie has started 1 thread.

  • See post chevron_right
    JunHowie
    Members
    TypeError: Input tensor data type is not supported for NCCL process group: BFloat16 产品&运维 2025年7月7日 09:56

    背景:在d.run平台使用MXC500-32G,微调时出现TypeError: Input tensor data type is not supported for NCCL process group: BFloat16
    系统环境:

    OS 版本  : Ubuntu 22.04.3 LTS
      内核      : 5.15.0-58-generic
      IP 地址     : 10.233.81.148
      主机名    : ins-m7p8w-698894dd4d-p4f5r
    
      CPU 型号   : Intel(R) Xeon(R) Platinum 8460Y+
      CPU 线程 : 8 C
      内存      : 133 MB / 98304 MB (0.14% 已使用)
      GPU         : NO GPU detected
      CUDA        : NO CUDA detected
    
    (ms-swift) root@ins-m7p8w-698894dd4d-p4f5r:~/data# mx-smi
    mx-smi  version: 2.1.9
    
    =================== MetaX System Management Interface Log ===================
    Timestamp                                         : Mon Jul  7 09:47:37 2025
    
    Attached GPUs                                     : 1
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.1.9                        Kernel Mode Driver Version: 2.9.8           |
    | MACA Version: 2.31.0.6              BIOS Version: 1.20.3.0                      |
    |------------------------------------+---------------------+----------------------+
    | GPU         NAME                   | Bus-id              | GPU-Util             |
    | Temp        Power                  | Memory-Usage        |                      |
    |====================================+=====================+======================|
    | 0           MXC500 VF              | 0000:38:00.1        | 0%                   |
    | N/A         N/A                    | 618/32512 MiB       |                      |
    +------------------------------------+---------------------+----------------------+
    
    +---------------------------------------------------------------------------------+
    | Process:                                                                        |
    |  GPU                    PID         Process Name                 GPU Memory     |
    |                                                                  Usage(MiB)     |
    |=================================================================================|
    |  no process found                                                               |
    +---------------------------------------------------------------------------------+
    
    End of Log
    

    环境配置:

    conda create --prefix=/root/data/envs/ms-swift python=3.10 -y
    
    conda activate /root/data/envs/ms-swift
    
    pip install 'ms-swift'
    

    **然后安装mx编译后的包

    requirements.txt内容如下
    **

    apex==0.1+metax2.32.0.3
    torch==2.6.0+metax2.32.0.3
    torchaudio==2.4.1+metax2.32.0.3
    torchvision==0.15.1+metax2.32.0.3
    triton==3.0.0+metax2.32.0.3 
    flash_attn
    
    ##安装沐曦专用的包
    pip install -r requirements.txt -i https://repos.metax-tech.com/r/maca-pypi/simple --trusted-host repos.metax-tech.com --no-build-isolation
    

    微调脚本如下

    问题描述

    主要错误是TypeError: Input tensor data type is not supported for NCCL process group: BFloat16

    在Jupyter Lab新的终端直接运行

    swift sft \
        --model "/root/data/internlm2_5-1_8b-chat" \
        --tr智算n_type lora \
        --dataset "/root/data/datasets/output.jsonl" \
        --model_type internlm2 \
        --torch_dtype bfloat16 \
        --num_tr智算n_epochs 4 \
        --per_device_tr智算n_batch_size 4 \
        --learning_rate 5e-5 \
        --warmup_ratio 0.1 \
        --split_dataset_ratio 0 \
        --lora_rank 8 \
        --lora_alpha 32 \
        --target_modules all-linear \
        --gradient_accumulation_steps 2 \
        --save_steps 2000 \
        --save_total_limit 5 \
        --gradient_checkpointing_kwargs '{"use_reentrant": false}' \
        --logging_steps 5 \
        --max_length 2048 \
        --output_dir ./swift_output/InternLM2.5-1.8B-Lora \
        --dataloader_num_workers 256 \
        --model_author JimmyMa99 \
        --model_name InternLM2.5-1.8B-Lora \
    

    则不会报错。

    安装mx专用torch以及其他安装包后,测试是否支持bfloat16,验证结果如下

    (ms-swift) root@ins-m7p8w-698894dd4d-p4f5r:~/data# python
    Python 3.10.18 | packaged by conda-forge | (m智算n, Jun  4 2025, 14:45:41) [GCC 13.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import torch
    >>> print(torch.cuda.is_bf16_supported())
    True
    >>>
    
  • 沐曦开发者论坛
powered by misago