Threads | JunHowie | 沐曦开发者论坛

背景：在d.run平台使用MXC500-32G，微调时出现TypeError: Input tensor data type is not supported for NCCL process group: BFloat16
系统环境：

OS 版本  : Ubuntu 22.04.3 LTS
  内核      : 5.15.0-58-generic
  IP 地址     : 10.233.81.148
  主机名    : ins-m7p8w-698894dd4d-p4f5r

  CPU 型号   : Intel(R) Xeon(R) Platinum 8460Y+
  CPU 线程 : 8 C
  内存      : 133 MB / 98304 MB (0.14% 已使用)
  GPU         : NO GPU detected
  CUDA        : NO CUDA detected

(ms-swift) root@ins-m7p8w-698894dd4d-p4f5r:~/data# mx-smi
mx-smi  version: 2.1.9

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Mon Jul  7 09:47:37 2025

Attached GPUs                                     : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.1.9                        Kernel Mode Driver Version: 2.9.8           |
| MACA Version: 2.31.0.6              BIOS Version: 1.20.3.0                      |
|------------------------------------+---------------------+----------------------+
| GPU         NAME                   | Bus-id              | GPU-Util             |
| Temp        Power                  | Memory-Usage        |                      |
|====================================+=====================+======================|
| 0           MXC500 VF              | 0000:38:00.1        | 0%                   |
| N/A         N/A                    | 618/32512 MiB       |                      |
+------------------------------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  no process found                                                               |
+---------------------------------------------------------------------------------+

End of Log

环境配置：

conda create --prefix=/root/data/envs/ms-swift python=3.10 -y

conda activate /root/data/envs/ms-swift

pip install 'ms-swift'

**然后安装mx编译后的包

requirements.txt内容如下
**

apex==0.1+metax2.32.0.3
torch==2.6.0+metax2.32.0.3
torchaudio==2.4.1+metax2.32.0.3
torchvision==0.15.1+metax2.32.0.3
triton==3.0.0+metax2.32.0.3 
flash_attn

##安装沐曦专用的包
pip install -r requirements.txt -i https://repos.metax-tech.com/r/maca-pypi/simple --trusted-host repos.metax-tech.com --no-build-isolation

微调脚本如下

问题描述

主要错误是TypeError: Input tensor data type is not supported for NCCL process group: BFloat16

在Jupyter Lab新的终端直接运行

swift sft \
    --model "/root/data/internlm2_5-1_8b-chat" \
    --tr智算n_type lora \
    --dataset "/root/data/datasets/output.jsonl" \
    --model_type internlm2 \
    --torch_dtype bfloat16 \
    --num_tr智算n_epochs 4 \
    --per_device_tr智算n_batch_size 4 \
    --learning_rate 5e-5 \
    --warmup_ratio 0.1 \
    --split_dataset_ratio 0 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 2 \
    --save_steps 2000 \
    --save_total_limit 5 \
    --gradient_checkpointing_kwargs '{"use_reentrant": false}' \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir ./swift_output/InternLM2.5-1.8B-Lora \
    --dataloader_num_workers 256 \
    --model_author JimmyMa99 \
    --model_name InternLM2.5-1.8B-Lora \

则不会报错。

安装mx专用torch以及其他安装包后，测试是否支持bfloat16，验证结果如下

(ms-swift) root@ins-m7p8w-698894dd4d-p4f5r:~/data# python
Python 3.10.18 | packaged by conda-forge | (m智算n, Jun  4 2025, 14:45:41) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_bf16_supported())
True
>>>