最近在跑chatglm2的sft的时候出现了下面的错误,我的运行方式是bf16, deepspeed zero3,因为担心fp16会有很多的nan.
File "/home/suser/.conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_implreturn func(*args, **kwargs)File "/home/suser/.conda/envs/llm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2532, in all_gather_into_tensorresult = forward_call(*args, **kwargs)File "/home/suser/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 805, in forwardinputs_embeds = self.embedding(input_ids)File "/home/suser/.conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_implwork = group._allgather_base(output_tensor, input_tensor)
RuntimeError: output tensor must have the same type as input tensor
解决方法
在stage3 config里面加入bf16就行了。
{ "bf16": { "enabled": true },"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","gradient_accumulation_steps": "auto","optimizer": {"type": "AdamW","params": {"lr": "auto","betas": "auto","eps": "auto","weight_decay": "auto"}},"scheduler": {"type": "WarmupLR","params": {"warmup_min_lr": "auto","warmup_max_lr": "auto","warmup_num_steps": "auto"}},"fp16": {"enabled": "auto"},"zero_optimization": {"stage": 3,"overlap_comm": true,"contiguous_gradients": true,"sub_group_size": 1e9,"reduce_bucket_size": "auto","stage3_prefetch_bucket_size": "auto","stage3_param_persistence_threshold": "auto","stage3_max_live_parameters": 1e9,"stage3_max_reuse_distance": 1e9,"stage3_gather_16bit_weights_on_model_save": true}
}
参考文献
[BUG]RuntimeError: output tensor must have the same type as input tensor