大模型PEFT（一）之推理实践学习记录

1. 简介

多种模型: LLaMA、Mistral、Mixtral-MoE、Qwen、Yi、Gemmha、Baichuan、ChatGLM、Phi等等。
集成方法:(增量)预训练、指令监督微调、奖励模型训练、PPO训练和DPO训练。
多种精度:32比特全参数微调、16比特冻结微调、16比特LORA微调和基于AQLM/AWQ/GPTQ/LLM.int8 的2/4/8比特 QLORA 微调。
先进算法:GaLore、DORA、LongLoRA、LLaMAPro、LoftQ和Agen微调。
实用技巧:FlashAttention-2、Unsloth、RoPE scaling、NEFTune和rsLoRA。
实验监控:LlamaBoard、TensorBoard、Wandb、MLfiow等等。
极速推理:基于VLLM的OpenAl风格API、浏览器界面和命令行接口。

2. 模型对比

在这里插入图片描述

注意：
默认模块应作为 --lora_target 参数的默认值，可使用 --lora_target all 参数指定全部模块以取得更好的效果。

对于所有“基座”（Base）模型，–template 参数可以是 default, alpaca, vicuna 等任意值。但“对话”（Instruct/Chat）模型请务必使用对应的模板。

请务必在训练和推理时使用完全一致的模板。

项目所支持模型的完整列表请参阅 constants.py。

您也可以在 template.py 中添加自己的对话模板。

3. 训练方法

在这里插入图片描述

4. 软硬件依赖

在这里插入图片描述

5. 硬件依赖

估算值

6. 如何使用

6.0 构建python 环境

# 创建新环境
conda create -n py310 python=3.10
#激活环境
conda activate py310

6.1 安装 LLaMA Factory

# 本次LLaMA-Factory版本 c1fdf81df6ade5da7be4eb66b715f0efd171d5aa
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .[torch,metrics]

可选的额外依赖项：torch、metrics、deepspeed、bitsandbytes、vllm、galore、badam、gptq、awq、aqlm、qwen、modelscope、quality

遇到包冲突时，可使用 pip install --no-deps -e . 解决。

6.1.2 Windows 用户指南

如果要在 Windows 平台上开启量化 LoRA（QLoRA），需要安装预编译的 bitsandbytes 库, 支持 CUDA 11.1 到 12.2, 请根据您的 CUDA 版本情况选择适合的发布版本。

pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl

如果要在 Windows 平台上开启 FlashAttention-2，需要安装预编译的 flash-attn 库，支持 CUDA 12.1 到 12.2，请根据需求到 flash-attention 下载对应版本安装。

6.1.3 确认自己的cuda版本

nvidia-smi

在这里插入图片描述

版本是12.2, 非常好

在这里插入图片描述

所以我安装的是：

pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.39.1-py3-none-win_amd64.whl

在这里插入图片描述

6.2 安装依赖

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple --ignore-installed

在这里插入图片描述

6.3 大模型下载

可以在【2. 模型对比】中选择想微调的大模型,这里为了方便学习,选择Qwen1.5-0.5B大模型作为本次学习的大模型

!git lfs install
!git clone https://huggingface.co/Qwen/Qwen1.5-0.5B

在这里插入图片描述
哈哈哈哈，没下载下了，手动去网页一个个点击下载的

7. 大模型推理

当前最新版本只有 api、webui、train三种模式，cli_demo 是之前的版本。（本次LLaMA-Factory版本c1fdf81df6ade5da7be4eb66b715f0efd171d5aa）
在这里插入图片描述
但是可以试用llamafactory-cli 推理

7.1 使用openai 风格 api推理

CUDA_VISIBLE_DEVICES=0 API_PORT=8000 python src/api.py
--model_name_or_path .\models\Qwen1.5-0.5B\
#--adapter_name_or_path path_to_checkpoint\
#--finetuning_type lora\
--template qwen

在这里插入图片描述

关于API文档请见http://localhost:8000/docs。

7.2 使用命令行推理

大模型指令推理

CUDA_VISIBLE_DEVICES=7 API_PORT=8030 llamafactory-cli api \--model_name_or_path .\models\Qwen1.5-0.5B \--template qwen

在这里插入图片描述

7.2.1 文档访问：http://localhost:8000/docs

在这里插入图片描述

7.2.1.1 get 请求：/v1/models

在这里插入图片描述
postman 试一下：

网页端自带请求测试：

7.2.1.2 post 请求： /v1/chat/completions

在这里插入图片描述

# Request body  Example Value
{"model": "string","messages": [{"role": "user","content": "string","tool_calls": [{"id": "string","type": "function","function": {"name": "string","arguments": "string"}}]}],"tools": [{"type": "function","function": {"name": "string","description": "string","parameters": {}}}],"do_sample": true,"temperature": 0,"top_p": 0,"n": 1,"max_tokens": 0,"stop": "string","stream": false
}

在这里插入图片描述

#Responses 	Successful Response Example Value
{"id": "string","object": "chat.completion","created": 0,"model": "string","choices": [{"index": 0,"message": {"role": "user","content": "string","tool_calls": [{"id": "string","type": "function","function": {"name": "string","arguments": "string"}}]},"finish_reason": "stop"}],"usage": {"prompt_tokens": 0,"completion_tokens": 0,"total_tokens": 0}
}

# Responses 	Validation Error  Example Value
{"detail": [{"loc": ["string",0],"msg": "string","type": "string"}]
}

postman 试一下：

在这里插入图片描述

网页端自带请求测试：
在这里插入图片描述

{"model": "string","messages": [{"role": "user","content": "tell me a story"}]
}

response:

在这里插入图片描述

{"id": "chatcmpl-5a4587623b494f46b190ab363ac4260a","object": "chat.completion","created": 1717041519,"model": "string","choices": [{"index": 0,"message": {"role": "assistant","content": "Sure, here's a story:\n\nOnce upon a time, there was a young girl named Lily. She loved to read and always wanted to learn more about the world around her. One day, while she was wandering through the woods, she stumbled upon a mysterious old book. The book was filled with stories and secrets that she had never heard before.\n\nAs she began to read, she found herself lost in the world of the book. She knew that the book was important, but she didn't know how to get it back. She searched for hours and hours, but she couldn't find it anywhere.\n\nJust when she thought she had lost hope, she saw a group of brave young men who were searching for treasure. They were looking for a lost treasure that had been hidden for centuries. Lily was excited to join them, but she was also scared.\n\nAs they searched the woods, they stumbled upon a secret passage that led to the treasure. But as they walked deeper into the forest, they realized that the passage was guarded by a group of trolls. The trolls were fierce and dangerous, and they knew that they couldn't leave without getting hurt.\n\nThe trolls were angry and territorial, and they wanted to take control of the treasure. They charged at Lily and her friends, but they were too late. They were outnumbered and outmatched, and they were all caught in the troll's trap.\n\nIn the end, Lily and her friends had to fight off the trolls and save the treasure. They had to work together to find a way out of the troll's trap, and they did it. They had a lot of fun and learned a lot about themselves and the world around them.","tool_calls": null},"finish_reason": "stop"}],"usage": {"prompt_tokens": 23,"completion_tokens": 339,"total_tokens": 362}
}

服务终端log:
在这里插入图片描述

7.3 使用浏览器推理

CUDA_VISIBLE_DEVICES=0 python src/webui.py\
--model_name_or_path ./models/Qwen1.5-0.5B\
#--adapter_name_or_path path_to_checkpoint\
#--finetuning_type lora\
--template qwen

在这里插入图片描述

参数简介：

–model_name_or_path：参数的名称（huggingface或者modelscope上的标准定义，如“meta-llama/Meta-Llama-3-8B-Instruct”），或者是本地下载的绝对路径，如/media/codingma/LLM/llama3/Meta-Llama-3-8B-Instruct 或者 /models\Qwen1.5-0.5B
template: 模型问答时所使用的prompt模板，不同模型不同，请参考 https://github.com/hiyouga/LLaMA-Factory?tab=readme-ov-file#supported-models 获取不同模型的模板定义，否则会回答结果会很奇怪或导致重复生成等现象的出现。chat 版本的模型基本都需要指定，比如Meta-Llama-3-8B-Instruct的template 就是 llama3 。Qwen 模型的 template 就是qwen
finetuning_type：微调的方法，比如 lora
adapter_name_or_path：微调后的权重位置，比如 LoRA的模型位置