背景:在d.run平台使用MXC500-32G,微调时出现TypeError: Input tensor data type is not supported for NCCL process group: BFloat16
系统环境:
OS 版本 : Ubuntu 22.04.3 LTS
内核 : 5.15.0-58-generic
IP 地址 : 10.233.81.148
主机名 : ins-m7p8w-698894dd4d-p4f5r
CPU 型号 : Intel(R) Xeon(R) Platinum 8460Y+
CPU 线程 : 8 C
内存 : 133 MB / 98304 MB (0.14% 已使用)
GPU : NO GPU detected
CUDA : NO CUDA detected
(ms-swift) root@ins-m7p8w-698894dd4d-p4f5r:~/data# mx-smi
mx-smi version: 2.1.9
=================== MetaX System Management Interface Log ===================
Timestamp : Mon Jul 7 09:47:37 2025
Attached GPUs : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.1.9 Kernel Mode Driver Version: 2.9.8 |
| MACA Version: 2.31.0.6 BIOS Version: 1.20.3.0 |
|------------------------------------+---------------------+----------------------+
| GPU NAME | Bus-id | GPU-Util |
| Temp Power | Memory-Usage | |
|====================================+=====================+======================|
| 0 MXC500 VF | 0000:38:00.1 | 0% |
| N/A N/A | 618/32512 MiB | |
+------------------------------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
End of Log
环境配置:
conda create --prefix=/root/data/envs/ms-swift python=3.10 -y
conda activate /root/data/envs/ms-swift
pip install 'ms-swift'
**然后安装mx编译后的包
apex==0.1+metax2.32.0.3
torch==2.6.0+metax2.32.0.3
torchaudio==2.4.1+metax2.32.0.3
torchvision==0.15.1+metax2.32.0.3
triton==3.0.0+metax2.32.0.3
flash_attn
##安装沐曦专用的包
pip install -r requirements.txt -i https://repos.metax-tech.com/r/maca-pypi/simple --trusted-host repos.metax-tech.com --no-build-isolation
微调脚本如下
问题描述
主要错误是TypeError: Input tensor data type is not supported for NCCL process group: BFloat16
在Jupyter Lab新的终端直接运行
swift sft \
--model "/root/data/internlm2_5-1_8b-chat" \
--tr智算n_type lora \
--dataset "/root/data/datasets/output.jsonl" \
--model_type internlm2 \
--torch_dtype bfloat16 \
--num_tr智算n_epochs 4 \
--per_device_tr智算n_batch_size 4 \
--learning_rate 5e-5 \
--warmup_ratio 0.1 \
--split_dataset_ratio 0 \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--gradient_accumulation_steps 2 \
--save_steps 2000 \
--save_total_limit 5 \
--gradient_checkpointing_kwargs '{"use_reentrant": false}' \
--logging_steps 5 \
--max_length 2048 \
--output_dir ./swift_output/InternLM2.5-1.8B-Lora \
--dataloader_num_workers 256 \
--model_author JimmyMa99 \
--model_name InternLM2.5-1.8B-Lora \
则不会报错。
安装mx专用torch以及其他安装包后,测试是否支持bfloat16,验证结果如下
(ms-swift) root@ins-m7p8w-698894dd4d-p4f5r:~/data# python
Python 3.10.18 | packaged by conda-forge | (m智算n, Jun 4 2025, 14:45:41) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_bf16_supported())
True
>>>