LLM - Chinese-Llama-2-7b 初体验

Chinese LLaMA2 7B

目录

一.引言

二.模型下载

三.快速测试

四.训练数据

五.总结


一.引言

自打 LLama-2 发布后就一直在等大佬们发布 LLama-2 的适配中文版,也是这几天蹲到了一版由 LinkSoul 发布的 Chinese-Llama-2-7b,其共发布了一个常规版本和一个 4-bit 的量化版本,今天我们主要体验下 Llama-2 的中文逻辑顺便看下其训练样本的样式,后续有机会把训练和微调跑起来。

二.模型下载

HuggingFace: https://huggingface.co/LinkSoul/Chinese-Llama-2-7b

4bit 量化版本: https://huggingface.co/LinkSoul/Chinese-Llama-2-7b-4bit

这里我们先整一版量化版本:

 省事且网络好的同学可以直接用 Hugging Face 的 API 下载,网不好就半夜慢慢下载吧。

from huggingface_hub import hf_hub_download, snapshot_downloadsnapshot_download(repo_id="LinkSoul/Chinese-Llama-2-7b-4bit", local_dir='./models')

三.快速测试

Tips 测试用到的基本库的版本,运行显卡为 Tesla-V100 32G:

python 3.9.11
numpy==1.23.5
torch==2.0.1
transformers==4.29.1

测试代码:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer# Original version
# model_path = "LinkSoul/Chinese-Llama-2-7b"
# 4 bit version
model_path = "/models/LLama2_4bit"tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if model_path.endswith("4bit"):model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.float16,device_map='auto')
else:model = AutoModelForCausalLM.from_pretrained(model_path).half().cuda()
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)instruction = """[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n{} [/INST]"""while True:text = input("请输入 prompt\n")if text == "q":breakprompt = instruction.format(text)generate_ids = model.generate(tokenizer(prompt, return_tensors='pt').input_ids.cuda(), max_new_tokens=4096, streamer=streamer)

★ 常规测试

知识:

 推理:

★ 一些 Bad Case

知识错乱:

重复:

这里由于是 4-bit 的量化版本,模型的效果可能也会受影响,可以看到图中原始 LLama2 的知识能力相对还算不错。

 

四.训练数据

LinkSoul 在 LLama2 的基础上使用了中英文 SFT 数据集,数据量 1000 万:

LinkSoul/instruction_merge_set · Datasets at Hugging Face

 给定的数据是 parquet 格式的:

{"instruction":"","id":"chatalpaca_data_10k_0","conversations":"WrappedArray([human,Find the product of the numbers: 5 and 8], [gpt,The product of 5 and 8 is 40.], [human,What is the sum of the numbers 6 and 12?], [gpt,The sum of the numbers 6 and 12 is 18.], [human,Can you tell me the quotient of 20 and 5?], [gpt,Yes, the quotient of 20 and 5 is 4.], [human,What is the difference between 25 and 13?], [gpt,The difference between 25 and 13 is 12.], [human,What is the square of 9?], [gpt,The square of 9 is 81.], [human,What is the cube of 6?], [gpt,The cube of 6 is 216.])"}
{"instruction":"","id":"chatalpaca_data_10k_1","conversations":"WrappedArray([human,Provide five tips for effectively using tape measures], [gpt,1. Tighten the tape measure so that it stays in place when measuring.\n2. Make sure that the hook of the tape measure hits the starting point precisely.\n3. Allow for a bit of slack when measuring around curves or corners.\n4. Measure from a stable surface, such as a table or workbench.\n5. Mark the beginning and end of each measurement with a pen or marker.], [human,Can you give me an example of when to use tip number three, \"Allow for a bit of slack when measuring around curves or corners\"?], [gpt,Sure! Let's say you're measuring the circumference of a circular object, like a pipe. If you pull the tape measure too tightly, you won't get an accurate measurement because the tape measure will be stretched out of its original shape, making the measurement longer than it should be. Instead, you should wrap the tape measure around the curve or corner of the object without pulling it too tightly. This way, you'll get a more accurate measurement that takes into account the shape of the object.], [human,Can you tell me more about the markers or pens that should be used to mark the measurements? Do you recommend a certain type of pen?], [gpt,As an AI language model, I do not have personal preferences as a human, but there are certain types of markers or pens that are better suited for marking measurements than others. For example, a fine-tipped marker or mechanical pencil is great for making precise marks on a tape measure, because they produce a precise line that is easy to see. You'll want to avoid using a ballpoint pen or a marker with a thick tip, since these types of pens can produce marks that are too thick or blurry, making it difficult to accurately read the measurement. It's also a good idea to choose a marker or pen that is a different color than the color of the tape measure, so that you can easily distinguish your marks from the printed markings on the tape measure.], [human,Are there any other tips you recommend for maintaining tape measures in good condition? \n\nassistant: \nYes! One tip is to keep your tape measure clean and free from debris, so that it moves smoothly and doesn't get jammed or stuck. You can use a soft cloth or a brush to wipe down the tape measure after each use, and make sure to store it in a dry place to prevent rust or corrosion. Another tip is to avoid dropping or mishandling the tape measure, since this can damage the internal components and make it less accurate. It's also a good idea to periodically check your tape measure against a known length or standard to make sure that it is still measuring accurately.], [gpt,Finally, when retracting the tape measure, avoid letting it snap back quickly, as this can cause the tape to twist or become tangled. Instead, hold it firmly and retract it slowly and steadily to avoid damaging the internal components. With proper care and maintenance, your tape measure should last for years and provide accurate measurements every time.], [human,Goodbye.])"}

instruction 指令可选,剩下 conversations 里是 human 和  gpt 的多轮对话。

def tokenize(item, tokenizer):roles = {"human": "user", "gpt": "assistant"}input_ids = []labels = []if "instruction" in item and len(item["instruction"]) > 0:system = item["instruction"]else:system = dummy_message["system"]system = B_SYS + system + E_SYS# add system before the first content in conversationsitem["conversations"][0]['value'] = system + item["conversations"][0]['value']for i, turn in enumerate(item["conversations"]):role = turn['from']content = turn['value']content = content.strip()if role == 'human':content = f"{B_INST} {content} {E_INST} "content_ids = tokenizer.encode(content)labels += [IGNORE_TOKEN_ID] * (len(content_ids))else:# assert role == "gpt"content = f"{content} "content_ids = tokenizer.encode(content, add_special_tokens=False) + [tokenizer.eos_token_id]   # add_special_tokens=False remove bos token, and add eos at the endlabels += content_idsinput_ids += content_idsinput_ids = input_ids[:tokenizer.model_max_length]labels = labels[:tokenizer.model_max_length]trunc_id = last_index(labels, IGNORE_TOKEN_ID) + 1input_ids = input_ids[:trunc_id]labels = labels[:trunc_id]if len(labels) == 0:return tokenize(dummy_message, tokenizer)input_ids = safe_ids(input_ids, tokenizer.vocab_size, tokenizer.pad_token_id)labels = safe_ids(labels, tokenizer.vocab_size, IGNORE_TOKEN_ID)return input_ids, labels

训练代码:https://github.com/LinkSoul-AI/Chinese-Llama-2-7b/blob/main/train.py

中展示了 tokenizer 原始样本的流程:

◆  根据指令生成 system

◆  根据 from 和 value 的多轮对话生成 input_ids 和 labels

Tips: 这里会把前面生成的 system 缀到第一个 value 前面,labels 会在 human 部分用 IGNORE_TOKEN_ID 的掩码进行 Mask

◆  最后 safe_ids 用于限制 id < max_value 超过使用 pad_id 进行填充

def safe_ids(ids, max_value, pad_id):return [i if i < max_value else pad_id for i in ids]

这里输入格式严格遵循 llama-2-chat 格式,兼容适配所有针对原版 llama-2-chat 模型的优化。

五.总结

这里简单介绍了 LLama-2 7B Chinese 的推理和数据样式,后续有机会训练和微调该模型。

参考: 

Chinese Llama 2 7B: https://github.com/LinkSoul-AI/Chinese-Llama-2-7b

Model: https://huggingface.co/LinkSoul/Chinese-Llama-2-7b

Instruction_merge_set: https://huggingface.co/datasets/LinkSoul/instruction_merge_set/

Download Files: https://huggingface.co/docs/huggingface_hub/v0.16.3/guides/download

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/7213.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Linux命令行宝典:随时查询、轻松应对

&#x1f57a;作者&#xff1a; 迷茫的启明星 学习路线C语言从0到1C初阶数据结构从0到1 &#x1f618;欢迎关注&#xff1a;&#x1f44d;点赞&#x1f64c;收藏✍️留言 &#x1f3c7;码字不易&#xff0c;你的&#x1f44d;点赞&#x1f64c;收藏❤️关注对我真的很重要&…

Flink CEP (一)原理及概念

目录 1.Flink CEP 原理 2.Flink API开发 2.1 模式 pattern 2.2 模式 pattern属性 2.3 模式间的关系 1.Flink CEP 原理 Flink CEP内部是用NFA&#xff08;非确定有限自动机&#xff09;来实现的&#xff0c;由点和边组成的一个状态图&#xff0c;以一个初始状态作为起点&am…

文件共享服务器(五)sicis

目录 前言 一、概述 1.iscsi概念 2.iscsi介绍 3.相关名词 二、实验 1.构建iscsi服务 2.实现步骤 服务器端 客户端 3.注意事项 总结 前言 iSCSI是由IBM发明的基于以太网的存储协议&#xff0c;该协议与SUN的NFS协议都是为了解决存储资源共享问题的解决方案。两者意图…

音视频——封装格式原理

视频解码基础 一、封裝格式 ​ 我们播放的视频文件一般都是用一种封装格式封装起来的&#xff0c;封装格式的作用是什么呢&#xff1f;一般视频文件里不光有视频&#xff0c;还有音频&#xff0c;封装格式的作用就是把视频和音频打包起来。 所以我们先要解封装格式&#xff0…

集成学习Boosting - AdaBoost

目录 1. Boosting方法的基本思想 1.1 Bagging VS Boosting 1.2 Boosting算法的基本元素与基本流程 1.3 sklearn中的Boosting算法 2. AdaBoost 3 AdaBoost的基本参数与损失函数 3.1 参数 base_estimator&#xff0c;属性base_estimator_与estimators_ 3.1. 参数 learnin…

用QFramework来重构 祖玛游戏

资料 Unity - 祖玛游戏 GitHub 说明 用QF一个场景就够了&#xff0c;在UIRoot下切换预制体达到面板切换。 但测试中当然要有一个直接跳到测试面板的 测试脚本&#xff0c;保留测试Scene&#xff08;不然初学者也不知道怎么恢复测试Scene&#xff09;&#xff0c;所以全文按S…

宋浩线性代数笔记(二)矩阵及其性质

更新线性代数第二章——矩阵&#xff0c;本章为线代学科最核心的一章&#xff0c;知识点多而杂碎&#xff0c;务必仔细学习。 重难点在于&#xff1a; 1.矩阵的乘法运算 2.逆矩阵、伴随矩阵的求解 3.矩阵的初等变换 4.矩阵的秩 &#xff08;去年写的字&#xff0c;属实有点ugl…

多态及其原理

文章目录 构成多态的条件虚函数作用&#xff1a;完成重写 重写重载 重写 隐藏为什么析构函数要搞成符合多态?原理预热对于基类指针或引用指向父类或者子类的成员函数是如何调用不同的函数呢&#xff1f; 一个类如果是基类&#xff0c;它的析构函数最好加上virtual 构成多态的条…

AlSD 系列智能安全配电装置是安科瑞电气有限公司专门为低压配电侧开发的一款智能安全用电产 品-安科瑞黄安南

一、应用背景 电力作为一种清洁能源&#xff0c;给人们带来了舒适、便捷的电气化生活。与此同时&#xff0c;由于使用不当&#xff0c;维护 不及时等原因引发的漏电触电和电气火灾事故&#xff0c;也给人们的生命和财产带来了巨大的威胁和损失。 为了防止低压配电系统发生漏…

2022 China Open Source Report

| 翻译&#xff1a;黄绍雅、岳扬、刘文涛、李思颖 | 编辑&#xff1a;胡欣元 | 设计&#xff1a;胡欣元 As 2022 finally came to an end, we also emerged from the challenging years of the three-year-long COVID pandemic. The new edition of the "China Open Sourc…

06.计算机网络——IP协议

文章目录 网络层IP协议基本概念协议头格式如何解包如何交付网段划分子网掩码特殊的IP地址IP地址的数量限制私有IP地址和公网IP地址路由 网络层 IP协议 IP协议提供一种将数据从A主机送达到B主机的能力&#xff0c;进行网络层的通信。 ​ IP协议 基本概念 主机 —— 配有IP地址…

MQ - 闲聊MQ一二事儿 (Kafka、RocketMQ 、Pulsar )

文章目录 MQ的发展史阶段一&#xff1a;追求解耦阶段二&#xff1a;追求吞吐量与一致性阶段三&#xff1a;追求平台化 MQ的通用架构主题topic、生产者producer、消费者consumer分区partition MQ 存储KafkaGood Design ---> 磁盘顺序写盘Poor Impact---> topic 数量不能过…

LabVIEW可重入VI,VI模板和动态VI之间的差异

LabVIEW可重入VI&#xff0c;VI模板和动态VI之间的差异 应该在何时使用可重入VI、模板VI和动态调用VI&#xff1f;这三种类型之间有什么区别&#xff1f; 可重入VI 当想要同时运行同一VI的多个实例时&#xff0c;将使用可重入VI。当VI不可重入时&#xff0c;VI只有一个数据空…

lama cleaner

这里写自定义目录标题 安装参数包含的额外plugins 安装 conda create --name lamacleaner python3.10 pip install -r requirements.txt pip install gfpgan pip install realesrgan pip install rembg pip install .如果安装本package报错&#xff0c;可以尝试改&#xff1…

后端(四):博客系统项目

咱们在这里实现的是后端项目&#xff0c;前端代码就提一提&#xff0c;不全做重点介绍&#xff0c;在开始讲解这个博客系统项目之前&#xff0c;我们先看看这个项目的前端界面&#xff1a; 登录界面&#xff1a; 个人主页&#xff1a; 博客详情页&#xff1a; 写博客页&#x…

常用API学习07(Java)

Date 在jdk1.8之前,java中的日期和时间是一类的&#xff0c;从1.8之后对日期和时间体系重新做了规划&#xff0c;划分出一个新的包 - java.time包&#xff0c;这个包中包含了日期、时间、时区、日历、单位。 Date&#xff0c;是java中最老的日期和时间类&#xff0c;后续退出…

Arcgis之 KML/KMZ文件转shp

一般我们在Goole Earth上勾画的区域导出后都为KML或者KMZ格式的&#xff0c;但无法在arcgis等软件上直接应用&#xff0c;故需进行一定的转换 1.打开ArcMap&#xff0c;选择ArcToolbox->Conversion Tools->From KML->KML To Layer 得到如下结果&#xff08;由于本KML…

pytest自动化测试框架tep环境变量、fixtures、用例三者之间的关系

tep是一款测试工具&#xff0c;在pytest测试框架基础上集成了第三方包&#xff0c;提供项目脚手架&#xff0c;帮助以写Python代码方式&#xff0c;快速实现自动化项目落地。 在tep项目中&#xff0c;自动化测试用例都是放到tests目录下的&#xff0c;每个.py文件相互独立&…

自动驾驶分级和技术架构

标题SAE 和 NHTSA自动驾驶分级 当前全球汽车行业中两个最权威的分级系统由美国国家公路交通安全管理局&#xff08;NHTSA&#xff09;和国际自动化工程师协会(SAE)提出。2013年&#xff0c;NHTSA将驾驶自动化的描述分为5个层级。2014年1月&#xff0c;SAE制定J3016自动驾驶分级…

Mybatis基础模块-日志管理

文章目录 1. 适配器模式2. Log2.1 默认实现StdOutImpl2.2 Log4jImpl 3. LogFactory4. 解析配置和应用4.1 settings配置4.2 解析 5. jdbc日志5. 1 类图5.2 BaseJdbcLogger5.3 ConnectionLogger5.4 ConnectionLogger的具体应用 1. 适配器模式 适配器使接口不兼容的对象可以相互合…