大模型实战营第二期——4. XTuner 大模型单卡低成本微调实战

github地址：InternLM/tutorial-书生·浦语大模型实战营
文档地址：XTuner 大模型单卡低成本微调实战
视频地址：XTuner 大模型单卡低成本微调实战
Intern Studio: https://studio.intern-ai.org.cn/console/instance

这个人的研究方向是眼科的AI，也是大模型+CV的多模态的诊疗体系的研究，还挺对口

1. Finetune简介

在这里插入图片描述

一般通过海量数据训练来的就是一个大的预训练模型/基座模型，
如果不进行额外训练/微调，则询问什么是肺癌？，则模型不会意识到这是个需要回答的问题，只会去找训练集中拟合分布的对应结果，类似词嵌入会找最相近的词语。
因此需要进行指令微调，让大模型理解指令的意图，才会给我们想要的答案

关于XTuner支持的微调数据格式，直接看这个文档：
https://github.com/InternLM/xtuner/blob/main/docs/zh_cn/user_guides/dataset_format.md

1.2 指令跟随微调

在这里插入图片描述

在这里插入图片描述
这里对输入的语料进行区分，只是为了进行指令微调，只是指训练过程，而在实际使用/推理阶段，不需要用户把输入的内容改成这个模版的样子（角色分配不是用户来进行的）。

system模板是启动对话时候，自己配置的
用户输入的内容，会被自动放到user字段中
模型输出的内容，其实包含<|Bot|>这样的对话模板，去掉就可以得到给展示给用户的输出了

在这里插入图片描述
这里就与增量预训练微调有所不同了。

1.1 增量预训练微调

在这里插入图片描述
这其实就是常见的自然语言处理里的东西，

data就是输入
label就是希望的输出
由于在多对多的NLP任务中（输入是序列，输出也是序列），在推理时，会给一个起始符来启动作为t=0时刻的输入来启动循环神经网络，因此，输出其实只有output+ <s> 这两个内容。

在这里插入图片描述
增量微调，

其使用的数据不存在问题（用户角色）和回答（Bot角色）以及背景上下文（系统角色）
只是直接看到的语句，不需要对语料划分角色，直接处理原始数据即可。
类似于把之前的系统和用户角色的模板部分置位空，把数据放到Bot角色部分的回答即可。

1.3 XTuner中的微调原理——LoRA,QloRA

在这里插入图片描述
这里讲的不是很清晰（甚至有点问题？？？），建议看：图解大模型微调系列之：大模型低秩适配器LoRA（原理篇），这个很赞！🥳🥳🥳

在这里插入图片描述

2. XTuner

注意，这个XTuner虽然也和Openmmlab一起宣传，但是文档并不统一，
XTuner目前还没有网页文档，只有md格式的文档：
https://github.com/InternLM/xtuner/blob/main/docs/zh_cn/user_guides/dataset_prepare.md
略微磕碜。。。

在这里插入图片描述

训练完成之后，就得到了LoRA模型（也就是上图中的Adapter适配器模型）
加载模型时，除了加载基座模型，还需要加载Adapter适配器模型(LoRA模型)

在这里插入图片描述
类似于chatGPT里的plugin。

在这里插入图片描述
这个和XTuner关系不大，是XTuner支持了ZeRO，本质上是DeepSpeed这个库里的，microsoft/DeepSpeed

在这里插入图片描述

3. 动手实践

直接跟着文档走就可以，很傻瓜式了。

3.1 config

xtuner list-cfg
[2024-02-20 21:39:31,194] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-20 21:40:02,552] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
==========================CONFIGS===========================
baichuan2_13b_base_qlora_alpaca_e3
baichuan2_13b_base_qlora_alpaca_enzh_e3
baichuan2_13b_base_qlora_alpaca_enzh_oasst1_e3
baichuan2_13b_base_qlora_alpaca_zh_e3
baichuan2_13b_base_qlora_arxiv_gentitle_e3
baichuan2_13b_chat_qlora_open_platypus_e3
baichuan2_7b_base_qlora_open_platypus_e3
baichuan2_7b_base_qlora_sql_e3
baichuan2_7b_chat_qlora_alpaca_e3
baichuan2_7b_chat_qlora_alpaca_enzh_e3

配置文件名的解释：

13b指模型参数是1.3亿
chat vs base: 没有chat或base的，以及有base的都是基座模型，chat表示指令微调过的模型
qlora vs lora表示微调时候使用的方法
alpaca,sql这些都是微调基于的数据集
e3就是epoch=3

另外，这些配置其实基本对应于文档：

https://github.com/InternLM/xtuner/blob/main/docs/zh_cn/user_guides/finetune.md
https://github.com/InternLM/xtuner/blob/main/docs/zh_cn/user_guides/chat.md

3.2 操作

复制配置文件

$ xtuner copy-cfg internlm_chat_7b_qlora_oasst1_e3 .
[2024-02-20 22:05:51,716] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-20 22:06:26,565] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Copy to ./internlm_chat_7b_qlora_oasst1_e3_copy.py

不建议使用xtuner这个命令来复制或者列出config目录，非常慢，还会提示一些信息，应该是执行之前会做一些别的事情。

直接在xtuner这个repo的文件夹里找xtuner/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_oasst1_e3.py，手动复制一个。

预训练模型权重文件
在InterStuido平台的话，可以直接搞个软链接

ln -s /share/temp/model_repos/internlm-chat-7b ~/ft-oasst1/
# 或者拷贝
cp -r /share/temp/model_repos/internlm-chat-7b ~/ft-oasst1/

数据集
在InterStuido平台的话，可以直接复制

cd ~/ft-oasst1
cp -r /root/share/temp/datasets/openassistant-guanaco .# 实际上也不大，但是Huggingface的网络可能不大行
(internlm-demo) root@intern-studio-052101:~/ft-oasst1/openassistant-guanaco# ls -lh
total 21M
-rw-r--r-- 1 root root 1.1M Feb 20 22:13 openassistant_best_replies_eval.jsonl
-rw-r--r-- 1 root root  20M Feb 20 22:13 openassistant_best_replies_train.jsonl

文件结构确认

~/ft-oasst1# tree -L 1
.
├── internlm-chat-7b -> /share/temp/model_repos/internlm-chat-7b # 预训练模型
├── internlm_chat_7b_qlora_oasst1_e3_copy.py # 配置文件
└── openassistant-guanaco # 数据集

训练

xtuner train ./internlm_chat_7b_qlora_oasst1_e3_copy.py --deepspeed deepspeed_zero2
// 从打印的信息看，xtuner也是基于MMengine进行的
map这个过程就是把数据转换成符合指令训练的对话模板数据格式

不设置DeepSpeed，使用A100 (1/4)这个配置的话，设置1个epoch，输出训练信息中有: eta: 3:54:38也就是1个epoch要训练3小时。
设置DeepSpeed，使用A100 (1/4)这个配置的话，设置1个epoch，输出训练信息中有:eta: 1:50:32。时间提升40%以上，牛啊
然后等着就行了

训练完的文件夹大概长这样：

-- work_dirs`-- internlm_chat_7b_qlora_oasst1_e3_copy|-- 20231101_152923|   |-- 20231101_152923.log|   `-- vis_data|       |-- 20231101_152923.json|       |-- config.py|       `-- scalars.json|-- epoch_1.pth|-- epoch_2.pth|-- epoch_3.pth|-- internlm_chat_7b_qlora_oasst1_e3_copy.py|-- last_checkpoint

一个epoch存一个文件，因为这个东西训练很慢。。

将得到的 PTH 模型转换为 HuggingFace 模型，即：生成 Adapter 文件夹

xtuner convert pth_to_hf ${CONFIG_NAME_OR_PATH} ${PTH_file_dir} ${SAVE_PATH}
// 具体使用的命令
cd ~/ft-oasst1/
mkdir hf
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert pth_to_hf ./internlm_chat_7b_qlora_oasst1_e3_copy.py \
./work_dirs/internlm_chat_7b_qlora_oasst1_e3_copy/epoch_1.pth \
./hf# 输出类似
Reconstructed fp32 state dict with 448 params 159907840 elements
Load PTH model from ./work_dirs/internlm_chat_7b_qlora_oasst1_e3_copy/epoch_1.pth
Convert weights to float16
Saving HuggingFace model to ./hf
/root/.conda/envs/internlm-demo/lib/python3.10/site-packages/peft/utils/save_and_load.py:148: UserWarning: Could not find a config file in ./internlm-chat-7b - will assume that the vocabulary was not modified.warnings.warn(
All done!
# 就说明成功了

得到的文件结构

> tree -L 1 ./hf/
./hf/
├── README.md
├── adapter_config.json
├── adapter_model.safetensors
└── xtuner_config.py0 directories, 4 files> ls -lh ./hf
total 306M
-rw-r--r-- 1 root root 5.0K Feb 25 16:08 README.md
-rw-r--r-- 1 root root  670 Feb 25 16:08 adapter_config.json
-rw-r--r-- 1 root root 306M Feb 25 16:08 adapter_model.safetensors  
# 确实是个模型文件，306M
-rw-r--r-- 1 root root 6.1K Feb 25 16:08 xtuner_config.py

此时，hf 文件夹即为我们平时所理解的所谓 “LoRA 模型文件”

可以简单理解：LoRA 模型文件 = Adapter

将 HuggingFace adapter 合并到大语言模型，base基座模型+Adapter适配器

xtuner convert merge ./internlm-chat-7b ./hf ./merged --max-shard-size 2GB
# xtuner convert merge \
#     ${NAME_OR_PATH_TO_LLM} 基座模型路径\
#     ${NAME_OR_PATH_TO_ADAPTER} 适配器路径\
#     ${SAVE_PATH} \
#     --max-shard-size 2GB 分块保存，每块最大2G

合并之后得到的模型的文件夹结构，和之前所使用的基座模型的文件夹结构基本一致(在大模型实战营第二期——2. 浦语大模型趣味Demo-2.2.3 模型下载 里打印过)，就是发布到Huggingface的一种格式。

tree -L 1 ./merged/
./merged/
├── added_tokens.json
├── config.json
├── configuration_internlm.py
├── generation_config.json
├── modeling_internlm.py
├── pytorch_model-00001-of-00008.bin
├── pytorch_model-00002-of-00008.bin
├── pytorch_model-00003-of-00008.bin
├── pytorch_model-00004-of-00008.bin
├── pytorch_model-00005-of-00008.bin
├── pytorch_model-00006-of-00008.bin
├── pytorch_model-00007-of-00008.bin
├── pytorch_model-00008-of-00008.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenization_internlm.py
├── tokenizer.model
└── tokenizer_config.json0 directories, 18 files

对话测试

# 加载 Adapter 模型对话（Float 16）
export MKL_SERVICE_FORCE_INTEL=1 # 要先运行了这个，下面才能正确执行
xtuner chat ./merged --prompt-template internlm_chat
# chat 启动对话脚本
# ./merged 使用的模型是这个文件夹内的
# --prompt-template internlm_chat 用的对话模板是internlm_chat,是基于谁训练的，对话模板就要用谁的
运行之后，就可以开始对话了，16bit全量模型的时候，回复很慢，4bit会快一些# 4 bit 量化加载
# xtuner chat ./merged --bits 4 --prompt-template internlm_chat$ xtuner chat --help  # 可以用--help参数查看脚本的参数说明，这个跑起来太慢了，直接贴这里吧
usage: chat.py [-h] [--adapter ADAPTER][--prompt-template {default,zephyr,internlm_chat,moss_sft,llama2_chat,code_llama_chat,chatglm2,chatglm3,qwen_chat,baichuan_chat,baichuan2_chat,wizardlm}][--system SYSTEM | --system-template {moss_sft,alpaca,arxiv_gentile,colorist,coder,lawyer,medical,sql}] [--bits {4,8,None}][--bot-name BOT_NAME] [--with-plugins {calculate,solve,search} [{calculate,solve,search} ...]] [--no-streamer] [--lagent][--command-stop-word COMMAND_STOP_WORD] [--answer-stop-word ANSWER_STOP_WORD] [--offload-folder OFFLOAD_FOLDER][--max-new-tokens MAX_NEW_TOKENS] [--temperature TEMPERATURE] [--top-k TOP_K] [--top-p TOP_P] [--seed SEED]model_name_or_pathChat with a HF modelpositional arguments:model_name_or_path    Hugging Face model name or pathoptions:-h, --help            show this help message and exit--adapter ADAPTER     adapter name or path--prompt-template {default,zephyr,internlm_chat,moss_sft,llama2_chat,code_llama_chat,chatglm2,chatglm3,qwen_chat,baichuan_chat,baichuan2_chat,wizardlm}Specify a prompt template--system SYSTEM       Specify the system text--system-template {moss_sft,alpaca,arxiv_gentile,colorist,coder,lawyer,medical,sql}Specify a system template--bits {4,8,None}     LLM bits--bot-name BOT_NAME   Name for Bot--with-plugins {calculate,solve,search} [{calculate,solve,search} ...]Specify plugins to use--no-streamer         Whether to with streamer--lagent              Whether to use lagent--command-stop-word COMMAND_STOP_WORDStop key--answer-stop-word ANSWER_STOP_WORDStop key--offload-folder OFFLOAD_FOLDERThe folder in which to offload the model weights (or where the model weights are already offloaded).--max-new-tokens MAX_NEW_TOKENSMaximum number of new tokens allowed in generated text--temperature TEMPERATUREThe value used to modulate the next token probabilities.--top-k TOP_K         The number of highest probability vocabulary tokens to keep for top-k-filtering.--top-p TOP_P         If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p orhigher are kept for generation.--seed SEED           Random seed for reproducible text generation