LLM实战:LLM微调加速神器-Unsloth + LLama3

1. 背景

五一结束后,本qiang~又投入了LLM的技术海洋中,本期将给大家带来LLM微调神器:Unsloth。

正如Unsloth官方的对外宣贯:Easily finetune & train LLMs; Get faster with unsloth。微调训练LLM,可以显著提升速度,其次显存占用也会显著减少。

但有一点需要说明:unsloth目前开源部分只支持单机版微调,更高效微调只能交费使用unsloth pro

2. Unsloth简介

2.1 主要特性

(1) 所有的内核均以OpenAITriton语言实现,并且手动实现反向传播引擎。Triton语言是面向LLM训练加速。

(2) 准确率0损失,没有近似方法,方法完全一致。

(3) 硬件层面无需变动。支持18年之后的Nvidia GPU(V100, T4, Titan V, RTX20,30,40x, A100, H100, L40等,GTX1070,1080也支撑,但比较慢)Cuda最低兼容版本是7.0

(4) 通过WSL适用于LinuxWindows

(5) 基于bisandbytes包,支持4bit16bit QLoRA/LoRA微调

(6) 开源代码有5倍的训练效率提升, Unsloth Pro可以提升至30

2.2 目前支撑的模型

由于底层算子需要使用triton重写,因此部分开源模型的适配工作周期可能较长。当前unsloth支持的模型包含Qwen 1.5(7B, 14B, 32B, 72B), Llama3-8B, Mistral-7B, Gemma-7B, ORPO, DPO Zephyr, Phi-3(3.8B), TinyLlama

2.3 模型加速效果

Qwen1.5-7B的集成是由Firefly作者封装并验证,性能提升30%+,显卡减少40%+,详见地址。

2.4 安装教程

conda create --name unsloth_env python=3.10conda activate unsloth_envconda install pytorch-cuda=<12.1/11.8> pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformerspip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"pip install --no-deps trl peft accelerate bitsandbytes

3. 实战

本着眼过千遍不如手过一遍的宗旨,本qiang~针对Unsloth做了一个对比实现。对比的实验环境分别为:P40, A40, A800,对比的模型使用的是出锅热乎的Llama3(8B)

3.1 比对维度

维度

说明

显卡

是否支持bf16

最大文本长度

max_seq_length

批次大小

per_device_train_batch_size

梯度累加步长

gradient_accumulation_steps

LoRA的rank

dropout

lora_droput

3.2 源码

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer, AutoModelForCausalLM, set_seed, AutoTokenizer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
import gcset_seed(42)alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.### Instruction:{}### Input:{}### Response:{}"""def train_unsloth(dtype,max_seq_length,per_device_train_batch_size, gradient_accumulation_steps, rank,  lora_alpha=16, lora_dropout=0, max_steps=50, save_steps=50,seed=42,warmup_steps=5,learning_rate=2e-4,logging_steps=5):"""使用unsloth进行微调训练"""print(f'dtype:{dtype}, max_seq_length:{max_seq_length}, per_device_train_batch_size:{per_device_train_batch_size}, gradient_accumulation_steps:{gradient_accumulation_steps}, rank:{rank}, lora_dropout:{lora_dropout}')load_in_4bit = Truemodel, tokenizer = FastLanguageModel.from_pretrained(model_name='pretrain_models/llama/llama3-8B-Instruct',max_seq_length=max_seq_length,dtype=dtype,load_in_4bit=load_in_4bit)model = FastLanguageModel.get_peft_model(model,r = rank,target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],lora_alpha=lora_alpha,lora_dropout=lora_dropout,bias='none',use_gradient_checkpointing=True,random_state=seed,use_rslora=False)EOS_TOKEN = tokenizer.eos_tokendef formatting_prompts_func(examples):instructions = examples["instruction"]inputs       = examples["input"]outputs      = examples["output"]texts = []for instruction, input, output in zip(instructions, inputs, outputs):# Must add EOS_TOKEN, otherwise your generation will go on forever!text = alpaca_prompt.format(instruction, input, output) + EOS_TOKENtexts.append(text)return { "text" : texts}passdataset = load_dataset("yahma/alpaca-cleaned", split = "train")dataset = dataset.map(formatting_prompts_func, batched = True)trainer = SFTTrainer(model=model,tokenizer=tokenizer,train_dataset=dataset,dataset_text_field='text',max_seq_length=max_seq_length,packing=False,args = TrainingArguments(per_device_train_batch_size=per_device_train_batch_size,gradient_accumulation_steps=gradient_accumulation_steps,warmup_steps=warmup_steps,learning_rate=learning_rate,fp16 = not torch.cuda.is_bf16_supported(),bf16 = torch.cuda.is_bf16_supported(),logging_steps=logging_steps,optim='adamw_8bit',weight_decay=0.01,lr_scheduler_type='linear',seed=seed,output_dir='output/llame3-8b-instruct-unsloth',save_steps=save_steps,max_steps=max_steps))gpu_stats = torch.cuda.get_device_properties(0)start_gpu_memory = round(torch.cuda.max_memory_reserved()/1024/1024/1024, 3)max_memory = round(gpu_stats.total_memory/1024/1024/1024, 3)print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")print(f"{start_gpu_memory} GB of memory reserved.")trainer_stats = trainer.train()used_memory = round(torch.cuda.max_memory_reserved()/1024/1024/1024, 3)used_memory_for_lora = round(used_memory - start_gpu_memory)used_percentage = round(used_memory/max_memory*100, 3)lora_percentage = round(used_memory_for_lora/max_memory*100, 3)print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")print(f"Peak reserved memory = {used_memory} GB.")print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")print(f"Peak reserved memory % of max memory = {used_percentage} %.")print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")model.save_pretrained("output/llame3-8b-instruct-unsloth-lora") # Local savingtokenizer.save_pretrained("output/llame3-8b-instruct-unsloth-lora")# model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)  # Merge to 16bit# model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",) # Merge to 4bit# model.save_pretrained_merged("model", tokenizer, save_method = "lora",) # Just LoRA adapters# model.save_pretrained_gguf("model", tokenizer,)   # Save to 8bit Q8_0# model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")   # Save to 16bit GGUF# model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")    # Save to q4_k_m GGUFdel modeldel tokenizertorch.cuda.empty_cache()for _ in range(3):gc.collect()def train_trans(dtype, max_seq_length, per_device_train_batch_size, gradient_accumulation_steps, rank, lora_alpha=16, lora_dropout=0, max_steps=50, save_steps=50,seed=42,warmup_steps=5,learning_rate=2e-4,logging_steps=5):"""使用transformers进行微调训练"""print(f'dtype:{dtype}, max_seq_length:{max_seq_length}, per_device_train_batch_size:{per_device_train_batch_size}, gradient_accumulation_steps:{gradient_accumulation_steps}, rank:{rank}, lora_dropout:{lora_dropout}')model_path = 'pretrain_models/llama/llama3-8B-Instruct'tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side='right', model_max_length=8192)tokenizer.add_special_tokens({"pad_token" : '<|reserved_special_token_250|>'})tokenizer.pad_token = '<|reserved_special_token_250|>'quantization_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_compute_dtype=dtype,bnb_4bit_use_double_quant=True,bnb_4bit_quant_type="nf4",llm_int8_threshold=6.0,llm_int8_has_fp16_weight=False,)model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=dtype,quantization_config=quantization_config)model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)model.enable_input_require_grads()config = LoraConfig(r=rank,lora_alpha=lora_alpha,target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],lora_dropout=lora_dropout,bias="none",task_type="CAUSAL_LM",use_rslora=False)model = get_peft_model(model, peft_config=config)model.gradient_checkpointing_enable()EOS_TOKEN = tokenizer.eos_tokendef formatting_prompts_func(examples):instructions = examples["instruction"]inputs       = examples["input"]outputs      = examples["output"]texts = []for instruction, input, output in zip(instructions, inputs, outputs):# Must add EOS_TOKEN, otherwise your generation will go on forever!text = alpaca_prompt.format(instruction, input, output) + EOS_TOKENtexts.append(text)return { "text" : texts}passdataset = load_dataset("yahma/alpaca-cleaned", split = "train")dataset = dataset.map(formatting_prompts_func, batched = True,)trainer = SFTTrainer(model=model,tokenizer=tokenizer,train_dataset=dataset,dataset_text_field='text',max_seq_length=max_seq_length,packing=False,args = TrainingArguments(per_device_train_batch_size=per_device_train_batch_size,gradient_accumulation_steps=gradient_accumulation_steps,warmup_steps=warmup_steps,learning_rate=learning_rate,fp16 = not torch.cuda.is_bf16_supported(),bf16 = torch.cuda.is_bf16_supported(),logging_steps=logging_steps,optim='adamw_8bit',weight_decay=0.01,lr_scheduler_type='linear',seed=seed,output_dir='output/llame3-8b-instruct-unsloth',save_steps=save_steps,max_steps=max_steps))gpu_stats = torch.cuda.get_device_properties(0)start_gpu_memory = round(torch.cuda.max_memory_reserved()/1024/1024/1024, 3)max_memory = round(gpu_stats.total_memory/1024/1024/1024, 3)print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")print(f"{start_gpu_memory} GB of memory reserved.")trainer_stats = trainer.train()used_memory = round(torch.cuda.max_memory_reserved()/1024/1024/1024, 3)used_memory_for_lora = round(used_memory - start_gpu_memory)used_percentage = round(used_memory/max_memory*100, 3)lora_percentage = round(used_memory_for_lora/max_memory*100, 3)print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")print(f"Peak reserved memory = {used_memory} GB.")print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")print(f"Peak reserved memory % of max memory = {used_percentage} %.")print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")model.save_pretrained("output/llame3-8b-instruct-unsloth-lora") # Local savingtokenizer.save_pretrained("output/llame3-8b-instruct-unsloth-lora")del modeldel tokenizertorch.cuda.empty_cache()for _ in range(3):gc.collect()def infer():model, tokenizer = FastLanguageModel.from_pretrained(model_name='output/llame3-8b-instruct-unsloth-lora',max_seq_length=2048,dtype=torch.float16,load_in_4bit=True)# 2x的速率进行推理FastLanguageModel.for_inference(model)inputs = tokenizer([alpaca_prompt.format('Continue the fibonnaci sequence.', '1, 1, 2, 3, 5, 8', '')], return_tensors = "pt").to('cuda')outputs = model.generate(**inputs, max_new_tokens=1024, use_cache=True)print(tokenizer.batch_decode(outputs))text_streamer = TextStreamer(tokenizer)outputs = model.generate(**inputs, max_new_tokens=1024, streamer=text_streamer)print(tokenizer.batch_decode(outputs))if __name__ == '__main__':train_unsloth(dtype=torch.bfloat16, max_seq_length=1024, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=8, lora_dropout=0)train_unsloth(dtype=torch.bfloat16, max_seq_length=1024, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=64, lora_dropout=0)train_unsloth(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=64, lora_dropout=0)train_unsloth(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, rank=64, lora_dropout=0)train_unsloth(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, rank=64, lora_dropout=0.05)train_unsloth(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=16, gradient_accumulation_steps=4, rank=64, lora_dropout=0.05)train_trans(dtype=torch.bfloat16, max_seq_length=1024, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=8, lora_dropout=0)train_trans(dtype=torch.bfloat16, max_seq_length=1024, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=64, lora_dropout=0)train_trans(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=64, lora_dropout=0)train_trans(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, rank=64, lora_dropout=0)train_trans(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, rank=64, lora_dropout=0.05)

4 实验结果

4.1 P40

4.2 A40

4.3 A800

4.4 结论

针对于llama3-8B进行unsloth训练,与基于transformers框架训练进行比对,结论如下:

(1) 集成unsloth后,显卡占用确实更少,训练效率确实更快,不管是哪种维度。

(2) P40增加batch_size后,显卡的内存占用提升,但训练的时间也更长,说明P40针对大批次的数据处理,性能会降低; A40, A800增加batch_size后,显卡内存占用虽然提升,但训练的时间更短。

(3) A800batch_size1时,训练效率不如A40,当batch_size增加到16时,A800的训练效率比A40快接近一倍。因此,A800更适合处理大批次的场景,对于小batch_size,杀鸡不能用牛刀。

5. 总结

一句话足矣~

本文主要是使用unsloth框架针对llama3的高效微调实验,提供了详细的对比代码以及对比分析结果。

之后会写一篇关于Qwen1.5的对比实验,敬请期待~

6. 参考

1. unsloth: https://github.com/unslothai/unsloth

2. Qwen1.5+Unsloth: Support Qwen2 by yangjianxin1 · Pull Request #428 · unslothai/unsloth · GitHub


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/diannao/11645.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

kafka 图形化

介绍 idea 中的一个插件 kafkalytic,kafka 图形化 简单又强大 安装 使用界面 总体信息 数据查看

JDK的串行收集器介绍与优化指南-02

对象的生命周期 对象的生命周期 在Java中,对象的生命周期通常包括以下几个阶段,这些阶段与JVM的内存管理和垃圾收集机制密切相关。 创建阶段 (1)为对象分配存储空间:当使用new关键字或其他方式(如反射、克隆、反序列化等)创建一个对象时,JVM首先会在堆内存中为其分配…

004.可观察对象与观察者

Rx非常适合事件驱动的应用程序。这是有意义的&#xff0c;因为事件(作为)(如前所述)是创建时变值的命令式方法。从历史上看,事件驱动编程主要出现在客户端技术中&#xff0c;因为作为事件实现的用户交互。例如&#xff0c;你可能工作过使用OnMouseMove或OnKeyPressed事件。正因…

“智慧食堂”|基于Springboot+vue的“智慧食堂”系统(源码+数据库+文档)

“智慧食堂”系统 目录 基于Springbootvue的“智慧食堂”系统 一、前言 二、系统设计 三、系统功能设计 1功能页面实现 2系统功能模块 3管理员功能模块 四、数据库设计 五、核心代码 六、论文参考 七、最新计算机毕设选题推荐 八、源码获取&#xff1a; 博主介绍…

缩短项目周期:SOLIDWORKS Electrical简化了电气设计过程

在现代工业设计领域&#xff0c;电气系统设计的复杂性日益增加&#xff0c;然而&#xff0c;达索系统SOLIDWORKS Electrical软件的出现为这一挑战提供了高效的解决方案。该软件支持工程师通过选配的方式快速设计原理图&#xff0c;这极大地简化了电气设计过程&#xff0c;并有效…

形位公差Overview of GDT

零件公差产生于十九世纪后期&#xff0c;其初衷是为了保证零件的互换性。起初只有尺寸公差。由于 当时的设计部门和制造部门通常都在一起或就在隔壁&#xff0c;因此交流起来非常方便。在当时&#xff0c;给 定的公差一般都很大&#xff0c;因此当时的设备刀具的能力对于保证产…

Spring:@Async注解使用注意事项及九大失效场景

前言 原文作者&#xff1a;微信公众号&#xff1a;苏三说技术 场景举例 代码案例 点击此处可观看&#xff1a;Async注解使用注意事项及九大失效场景

Python修改exe之类的游戏文件中的数值

文章目录 场景查找修改 补充字节to_bytes 场景 某些游戏数值&#xff08;攻击力、射程、速度…&#xff09;被写在exe之类的文件里 要先查找游戏数值&#xff0c;然后修改 查找 首先&#xff0c;要查找数值&#xff0c;大数重复较少&#xff0c;建议从大数找起 F 游戏原件…

OPT系列极速版远距离光数据传输器|光通讯传感器安装与调试方法

OPT系列极速版远距离光数据传输器|光通讯传感器使用红外激光通信&#xff0c;满足全双工 100M 带宽&#xff0c;通讯距离可达 300 米。能够快速&#xff0c;稳地传送数据&#xff0c;支持主流的工业控制总线&#xff08;Profinet&#xff0c;Ethercat 等&#xff09;&#xff1…

【JVM】从可达性分析,到JVM垃圾回收算法,再到垃圾收集器

《深入理解Java虚拟机》[1]中&#xff0c;有下面这么一段话&#xff1a; 在JVM的各个区域中&#xff0c;如虚拟机栈中&#xff0c;栈帧随着方法的进入和退出而有条不紊的执行者出栈和入栈操作。每一个栈帧中分配多少内存基本上是在类结构确定下来时就已知的&#xff08;尽管在…

【RAG 论文】BGM:为 LLM 和 Retriever 的偏好 gap 搭建一个 Bridge

论文&#xff1a;Bridging the Preference Gap between Retrievers and LLMs ⭐⭐⭐ Google Research, arXiv:2401.06954 论文速读 LLM 与 Retriever 之间存在一个 preference gap&#xff1a;大多数 retriever 被设计为 human-friendly&#xff0c;但是 LLM 的偏好与人类的却…

长难句打卡 5.13

And in Europe, some are up in arms over a proposal to drop a specific funding category for social-science research and to integrate it within cross-cutting topics of sustainable development. 在欧洲&#xff0c;有些人正竭力反对一项“终止专用于社会科学研究的…

网络安全防护:抵御DDoS和CC攻击

在当今数字化时代&#xff0c;网络安全已成为任何组织或个人不可忽视的重要议题。DDoS&#xff08;分布式拒绝服务&#xff09;攻击和CC&#xff08;命令与控制&#xff09;攻击作为两种最为常见的网络攻击方式&#xff0c;给网络运营者和用户带来了巨大的威胁和影响。本文将介…

函数memcpy的实现及详解

前言 今天我们来了解一下memcpy函数和它的作用吧&#xff0c;咋们之前已经熟悉了strcpy的使用&#xff0c;它的作用是字符串的拷贝&#xff0c;那么当我们要拷贝其他类型的数据时&#xff0c;应该使用什么函数呢&#xff0c;我们今天给大家介绍的就是memcpy函数&#xff0c;他可…

C++语言的字符数组

存放字符数据的数组是字符数组&#xff0c;字符数组中的一个元素存放一个字符。字符数组具有数组的共同属性。 1. 声明一个字符数组 char c[5]; 2. 字符数组赋值方式 &#xff08;1&#xff09;为数组元素逐一赋值 c[0]H c[1]E c[2]L c[3]L c[4]O &#xff08;2&…

三极管 导通条件

一、三极管理解 三极管是电子行业常用的元器件之一&#xff0c;他是一种电流型控制的器件&#xff0c;他有三种工作状态&#xff1a;截止区&#xff0c;放大区、饱和区。当三极管当做开关使用时&#xff0c;他工作在饱和区。下面简短讲解三极管作为开关使用的方法&#xff0c;只…

2.三极管

2.习题 3.知识补充

Web 安全 PHP 代码审查之常规漏洞

前言 工欲善其事&#xff0c;必先利其器。我们做代码审计之前选好工具也是十分必要的。下面我给大家介绍两款代码审计中比较好用的工具。 一、审计工具介绍 PHP 代码审计系统— RIPS 功能介绍 RIPS 是一款基于 PHP 开发的针对 PHP 代码安全审计的软件。 另外&#xff0c;…

Spring Cloud 概述及项目创建

本篇主要介绍什么是Spring Cloud&#xff0c;以及Spring Cloud工程的创建 目录 一、什么是微服务&#xff1f; 集群 分布式 微服务 二、Spring Cloud 什么是Spring Cloud Spring Cloud 版本 Spring Cloud实现方案 Spring Cloud 工程创建 创建父工程 创建子工程 一、…

MySQL 大量数据插入优化

效率最好的方式是&#xff1a;批量插入 开启事务。 1、数据批量插入相比数据逐条插入的运行效率得到极大提升&#xff1b; ## 批量插入 INSERT INTO table (field1, field12,...) VALUES (valuea1, valuea2,...), (valueb1, valueb2,...),...;当数据逐条插入时&#xff0c;每…