Qwen 论文阅读记录


本文仅作自己初步熟悉大模型,梳理之用,慢慢会更改/增加/删除,部分细节尚未解释,希望不断学习之后,能够完善补充。若有同道之人,欢迎指正探讨。

关于后面的code-qwen and math-qwen,我个人认为依托于前三部分,这两部分大致阅读,尚未细究,暂不记录于此。


1. Abstract(Introduction补充)

QWEN is a comprehensive language model series that encompasses distinct models with varying parameter counts.
QWEN 是一个全面的语言模型系列,包含参数数量不同的多个独立模型。

It includes QWEN, the base pretrained language models, and QWEN-CHAT, the chat models finetuned with human alignment techniques.

The base language models consistently demonstrate superior performance across a multitude of downstream tasks,

and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive.


以上两句话可以根据下图综合理解:

The model series include the base pretrained language models,
chat models finetuned with human alignment techniques, i.e., supervised finetuning (SFT), reinforcement learning with human feedback (RLHF), etc.,
as well as specialized models in coding and math.

在这里插入图片描述

Figure 1: Model Lineage of the Qwen Series.
We have pretrained the language models, namely QWEN, on massive datasets containing trillions of tokens.

We then use SFT and RLHF to align QWEN to human preference and thus we have QWEN-CHAT and specifically its improved version QWEN-CHAT-RLHF.

Additionally, we also develop specialized models for coding and mathematics, such as CODE-QWEN, CODE-QWEN-CHAT, and MATH-QWEN-CHAT based on QWEN with similar techniques.

Note that we previously released the multimodal LLM, QWEN-VL and QWEN-VLCHAT (Bai et al., 2023), which are also based on our QWEN base models.


The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter.
这些聊天模型在创建代理应用方面,具备先进的工具使用和规划能力,即使在执行如使用代码解释器等复杂任务时,与更大的模型相比也展现出了令人印象深刻的性能。


这句话,跟Introduction这句话综合理解:

LLMs are not just limited to language tasks.

They can also function as a generalist agent, collaborating with external systems, tools, and models to achieve the objectives set by humans.
它们还可以充当通用代理,与外部系统、工具和模型协作,以实现人类设定的目标。

For example, LLMs can understand multimodal instructions, execute code, use tools, and more.


2. Pretraining

2.1 Data

To ensure the quality of our pretraining data, we have developed a comprehensive data preprocessing procedure.
为了确保预训练数据的质量,我们开发了一个全面的数据预处理程序。

2.2 Tokenization

we utilize byte pair encoding (BPE) as our tokenization method.

2.3 Model

2.3.1 Architecture

QWEN is designed using a modified version of the Transformer architecture.

Specifically, we have adopted the recent open-source approach of training large language models, LLaMA (Touvron et al., 2023a).

Our modifications to the architecture include:

  • Embedding and output projection. ----嵌入和输出投影。
    Based on preliminary experimental findings, we have opted for the untied embedding approach instead of tying the weights of input embedding and output projection.
    基于初步的实验结果,我们选择了未绑定嵌入方法,而不是将输入嵌入和输出投影的权重绑定在一起。
    This decision was made in order to achieve better performance with the price of memory costs.
    这个决定是为了以牺牲内存成本为代价来获得更好的性能。

“Output projection”指的是在模型的输出层,将隐藏层的表示映射回原始的词汇或符号空间,以产生最终的输出。

“untied embedding approach”指的是输入嵌入和输出投影的权重是独立的,不共享的。这与“tied embedding”或“weight tying”相对,后者将输入嵌入和输出投影的权重设置为相同,以减少参数数量和防止过拟合。


  • Positional embedding.
    We have chosen RoPE (Rotary Positional Embedding) as our preferred option for incorporating positional information into our model.
    我们选择了RoPE,将位置信息融入模型。
    In particular, we have opted to use FP32 precision for the inverse frequency matrix, rather than BF16 or FP16, in order to prioritize model performance and achieve higher accuracy.
    特别是,我们选择了使用FP32精度来处理逆频率矩阵,而不是BF16或FP16,这是为了优先考虑模型性能并实现更高的准确性。

  • Bias.
    For most layers, we remove biases following Chowdhery et al. (2022), but we add biases in the QKV layer of attention to enhance the extrapolation ability of the model (Su, 2023b).
    在大多数层中移除了Bias,但在QKV层保留以提升模型的外推能力。​


根据Chowdhery等人的研究,移除大多数层的偏差项,有助于提高模型的稳定性和性能。这种做法可以减少参数数量,从而降低模型的复杂度和过拟合的风险。此外,移除偏差项有助于模型在训练过程中更好地泛化,从而在不同的数据集上表现得更为一致。

在注意力机制的QKV层中添加偏差项:Su的研究指出,在注意力机制的QKV(Query,Key,Value)层中添加偏差项可以增强模型的外推能力(extrapolation ability)。

QKV层是注意力机制的核心部分,通过添加偏差项,可以更好地捕捉输入数据的特征,从而提升模型在处理未知或未见数据时的表现。这种增强外推能力的做法对于处理复杂任务和应对多样化的数据输入非常重要。

总结来说,移除大多数层的偏差项是为了提高模型的泛化能力和稳定性,而在QKV层中添加偏差项则是为了增强模型的外推能力。


  • Pre-Norm & RMSNorm.
    In modern Transformer models, pre-normalization is the most widely used approach, which has been shown to improve training stability compared to post-normalization.
    Additionally, we have replaced the traditional layer normalization technique described in (Ba et al., 2016) with RMSNorm(Jiang et al., 2023). This change has resulted in equivalent performance while also improving efficiency.

  • Activation function.
    We have selected SwiGLU (Shazeer, 2020) as our activation function, a combination of Swish (Ramachandran et al., 2017) and Gated Linear Unit (Dauphin et al., 2017).
    Our initial experiments have shown that activation functions based on GLU generally outperform other baseline options, such as GeLU (Hendrycks & Gimpel, 2016).
    As is common practice in previous research, we have reduced the dimension of the feed-forward network (FFN) from 4 times the hidden size to 8/3 of the hidden size.

2.3.2 Context length extension

Transformer models have a significant limitation in terms of the context length for their attention mechanism.

As the context length increases, the quadratic-complexity computation leads to a drastic increase in both computation and memory costs.
计算复杂度与输入数据量的平方成正比

为此,作者提到了四个关键词:

  • NTK-aware interpolation (bloc97, 2023)
  • dynamic NTK-aware interpolation(Peng et al., 2023a)

QWEN additionally incorporates two attention mechanisms:

  • LogN-Scaling
  • Window Attention

LogN-Scaling rescales the dot product of the query and value by a factor that depends on the ratio of the context length to the training length, ensuring that the entropy of the attention value remains stable as the context length grows.
LogN-Scaling 通过一个依赖于上下文长度与训练长度之比的因子来重新缩放查询和值的点积,从而确保随着上下文长度的增加,注意力值的熵保持稳定。

Window attention restricts the attention to a limited context window, preventing the model from attending to tokens that are too far away.
Window Attention则将注意力限制在一个有限的上下文窗口内,防止模型关注到距离过远的标记(tokens)。


Swin transformer,待启动新的记录,详细解析其细节。


We also observed that the long-context modeling ability of our model varies across layers, with lower layers being more sensitive in context length extension compared to the higher layers.
我们还观察到,我们模型的 long-context 建模能力,在不同层之间存在差异,与较高层相比,较低层在上下文长度扩展方面更为敏感。

To leverage this observation, we assign different window sizes to each layer, using shorter windows for lower layers and longer windows for higher layers.

2.4 Training

To train QWEN, we follow the standard approach of autoregressive language modeling(自回归语言建模) Radford et al. (2018).

This involves training the model to predict the next token based on the context provided by the previous tokens.

We train models with context lengths of 2048.

To improve computational efficiency and reduce memory usage, we employ Flash Attention in the attention modules (Dao et al., 2022).


FlashAttention的核心原理是,通过将输入分块并在每个块上执行注意力操作,从而减少对高带宽内存(HBM)的读写操作。它利用底层硬件的内存层次知识,例如GPU的内存层次结构,来提高计算速度和减少内存访问开销。
具体来说,FlashAttention使用平铺(tiling)和重计算(recomputation)等经典技术。它首先将输入块从HBM加载到SRAM(快速缓存),在SRAM上执行注意力操作,并将结果更新回HBM。通过这种方式,FlashAttention减少了内存读写量,从而实现了加速。


3. Alignment

Pretrained large language models have been found to be out of sync with human behavior, making them unsuitable for serving as AI assistants in most cases.
已发现,预训练的大型语言模型与人类行为不一致,因此在大多数情况下不适合担任AI助手。

Recent research has shown that the use of alignment techniques, such as supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF), can significantly improve the ability of language models to engage in natural conversation.

3.1 Supervised Fine-tuning

To gain an understanding of human behavior, the initial step is to carry out supervised fine-tuning.

This process fine-tunes a pre-trained model on chat-style data, which includes both human queries and AI responses.
该过程对聊天风格数据的预训练模型进行微调,其中包括人工查询AI响应

Supervised finetuning is similar to text-to-text transfer, but it is capable of creating a helpful AI assistant due to the intricate and varied nature of the datasets used for finetuning.
监督式微调类似于文本到文本的迁移,但由于用于微调的数据集,具有复杂性和多样性,它能够创建出一个有用的AI助手。

In the following sections, we will delve into the details of data construction and training methods.

3.1.1 DATA

To enhance the capabilities of our supervised finetuning datasets, we have annotated conversations in multiple styles.

3.1.2 TRAINING

3.2 Rein Forcement Learning from Human Feedback

While SFT has proven to be effective, we acknowledge that its generalization and creativity capabilities may be limited, and it is prone to overfitting.

To address this issue, we have implemented Reinforcement Learning from Human Feedback (RLHF) to further align SFT models with human preferences.

This process involves training a reward model and using Proximal Policy Optimization (PPO) to conduct policy training.
这个过程包括训练一个奖励模型,并使用近端策略优化(Proximal Policy Optimization,简称PPO)来进行策略训练。

3.2.1 Reward Model

3.2.2 ReinForcement Learning

Our Proximal Policy Optimization (PPO) process involves four models:

  • policy model,
  • value model,
  • reference model,
  • reward model.

3.3 Automatic and Human evaluation of Aligned Models

Besides the widely used few-shot setting, we test our aligned models in the zero-shot setting to demonstrate how well the models follow instructions.

The prompt in a zero-shot setting consists of an instruction and a question without any previous examples in the context.
在零样本设置(zero-shot setting)中,提示(prompt)仅包含一个指令和一个问题,而不包含任何先前的示例作为上下文。


这种设置要求模型,仅根据给定的指令和问题本身,来生成回答或执行相应的任务。

零样本设置对于评估模型的泛化能力和理解能力尤为重要,因为它模拟了模型在面对全新、未见过的任务或问题时的表现。

在这种设置下,指令通常清晰地说明了模型需要执行的任务类型,比如生成文本、回答问题、翻译等。

问题则是模型需要处理的具体内容,它依赖于指令中指定的任务类型。

由于没有提供任何示例,模型必须依靠其预训练期间学到的知识和技能来理解和回答问题。

零样本设置的挑战在于模型需要在没有直接指导或示例的情况下,准确地理解并完成任务。这要求模型具备强大的语言理解和生成能力,以及良好的泛化能力。

因此,零样本设置是评估自然语言处理(NLP)模型性能的一个重要方面。


3.4 Tool use, Code Interpreter, and Agent

The QWEN models, which are designed to be versatile, have the remarkable ability to assist with (semi-) automating daily tasks by leveraging their skills in tool-use and planning.
QWEN模型设计得十分通用,它们具有,通过利用,其工具使用,和,规划方面的,技能,来协助(半)自动化日常任务的,显著能力。

We explore QWEN’s proficiency in the following areas:

  • Utilizing unseen tools through ReAct prompting (Yao et al., 2022) (see Table 6).
    通过ReAct提示利用未见过的工具;
  • Using a Python code interpreter to enhance math reasoning, data analysis, and more.
    使用Python代码解释器来增强数学推理、数据分析等能力;
  • Functioning as an agent that accesses Hugging Face’s extensive collection of multimodal models while engaging with humans (see Table 9).
    作为一个能够与人类交互,并访问Hugging Face庞大多模态模型库的代理;

To enhance QWEN’s capabilities as an agent or copilot, we employ the self-instruct (Wang et al., 2023c) strategy for supervised fine-tuning (SFT).
为了增强QWEN作为代理或协同工具的能力,我们采用自我指示策略进行有监督微调(SFT)。

Specifically, we utilize the in-context learning capability of QWEN for self-instruction.
具体来说,我们利用QWEN的上下文学习能力来进行自我指示。

By providing a few examples, we can prompt QWEN to generate more relevant queries and generate outputs that follow a specific format, such as ReAct.
通过提供几个示例,我们可以引导QWEN生成更多相关的查询,并产生符合特定格式(如ReAct)的输出。

We then apply rules and involve human annotators to filter out any noisy samples.
随后,我们应用规则并引入人工标注者来过滤掉任何噪声样本。

Afterwards, the samples are incorporated into QWEN’s training data, resulting in an updated version of QWEN that is more dependable for self-instruction.
之后,这些样本被纳入QWEN的训练数据中,从而得到一个在自我指示方面更加可靠的更新版QWEN。

We iterate through this process multiple times until we gather an ample number of samples that possess both exceptional quality and a widerange of diversity.
我们多次重复这个过程,直到收集到足够数量且既优质又多样化的样本。


在这里插入图片描述
Figure 2: A high-level overview of Self-Instruct.

The process starts with a small seed set of tasks as the task pool.

Random tasks are sampled from the task pool, and used to prompt an off-the-shelf LM to generate both new instructions and corresponding instances, followed by filtering low-quality or similar generations, and then added back to the initial repository of tasks.
从任务池中随机抽取任务,并用这些任务,提示一个现成的语言模型(LM),来生成新的指令和相应的实例。
随后,过滤掉质量低或相似的生成内容,再将它们添加回最初的任务库中。

The resulting data can be used for the instruction tuning of the language model itself later to follow instructions better.
所得数据可用于后续对语言模型本身的指令调优,以便其能更好地遵循指令。

Tasks shown in the figure are generated by GPT3.

其中,上图中的 Input-First 和 Output-First,具体可以参考下面两张图:

  • Input-First

在这里插入图片描述
Table 7: Prompt used for the input-first approach of instance generation. The model is prompted to generate the instance first, and then generate the corresponding output.
For instructions that don’t require additional input, the output is allowed to be generated directly.

  • Output-First

在这里插入图片描述

Table 8: Prompt used for the output-first approach of instance generation.
The model is prompted to generate the class label first, and then generate the corresponding input.
This prompt is used for generating the instances for classification tasks.

We apply the output-first approach to the classification tasks identified in the former step, and the input-first approach to the remaining non-classification tasks.


Related Work

  1. QWEN TECHNICAL REPORT.

  2. SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/63845.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

JCR一区牛顿-拉夫逊优化算法+分解对比!VMD-NRBO-Transformer-BiLSTM多变量时序光伏功率预测

JCR一区牛顿-拉夫逊优化算法分解对比!VMD-NRBO-Transformer-BiLSTM多变量时序光伏功率预测 目录 JCR一区牛顿-拉夫逊优化算法分解对比!VMD-NRBO-Transformer-BiLSTM多变量时序光伏功率预测预测效果基本介绍程序设计参考资料 预测效果 基本介绍 1.中科院…

如何在小米平板5上运行 deepin 23 ?

deepin 23 加入了 ARM64 支持,这里尝试将 deepin 系统刷入平板中,平常使用中,带个笔记本电脑有时候也会嫌比较麻烦,把 Linux 系统刷入平板中既满足了使用需要,又满足了轻便的需求。为什么不使用 Termux ?虽…

QT6 Socket通讯封装(TCP/UDP)

为大家分享一下最近封装的以太网socket通讯接口 效果演示 如图,界面还没优化,后续更新 废话不多说直接上教程 添加库 如果为qmake项目中,在.pro文件添加 QT network QT core gui QT networkgreaterThan(QT_MAJOR_VERS…

all/any函数可以对“条件”打包(Python)

操作符直观易读适用简单逻辑,函数紧凑好写便于多条件处理。 (笔记模板由python脚本于2024年12月12日 22:19:10创建,本篇笔记适合有一定编程基础的coder翻阅) 【学习的细节是欢悦的历程】 Python 官网:https://www.python.org/ Free&#xff…

js:v-for循环中我希望再次循环七张图片,需要在v-for中嵌套一个v-for还是?

问: div classxxxx v-for(item,index) in data :keyindex div classimgDiv div classimgDivBox /div /div .imgDivBox { .background-img(/assets/images/top_01.png) } 这是现在设置的图片,但是现在我希望遍历一个数组然后遍历top01-top07&…

黑皮书-计算机科学导论02

目录 第二部分:计算机硬件 第5章计算机组成 5.1中央处理单元 Ⅰ.算数逻辑单元 Ⅱ.控制单元 Ⅲ.寄存器 5.2主存储器 Ⅰ.随机存取存储器(RAM) Ⅱ.只读存储器(ROM) 高速缓冲存储器(Cache) 5.3输入/输出子系统 Ⅰ.非存储设备 Ⅱ.存储设备(辅助存…

小程序开发中的插件生态与应用-上

更多精彩内容都在公zhong号:小白的大数据之旅 在小程序的开发过程中,插件作为扩展功能、提升效率的重要工具,扮演着不可或缺的角色。它们不仅能够帮助开发者快速集成复杂的功能模块,还能优化开发流程,缩短项目周期。 …

优选算法——分治(快排)

1. 颜色分类 题目链接:75. 颜色分类 - 力扣(LeetCode) 题目展示: 题目分析:本题其实就要将数组最终分成3块儿,这也是后面快排的优化思路,具体大家来看下图。 这里我们上来先定义了3个指针&…

【大模型系列篇】GPU资源容器化访问使用指南

在当今的高性能计算和机器学习领域,GPU(图形处理单元)因其卓越的并行计算能力而扮演着至关重要的角色。随着容器化技术如 Docker 的普及,越来越多的数据科学家和开发者选择将他们的应用和工作负载封装到 Docker 容器中&#xff0c…

【毕业设计选题】数据科学与大数据专业毕业设计选题与建议

目录 前言 毕设选题 开题指导建议 更多精选选题 选题帮助 最后 前言 大家好,这里是海浪学长毕设专题! 大四是整个大学期间最忙碌的时光,一边要忙着准备考研、考公、考教资或者实习为毕业后面临的升学就业做准备,一边要为毕业设计耗费大量精力。学长给大家整…

大数据笔记之flink-cdc实时同步数据

大数据笔记之flink-cdc实时同步数据(mysql -->doris) 一、基本概念 Flink CDC 是一个基于流的数据集成工具,旨在为用户提供一套功能更加全面的编程接口(API)。 该工具使得用户能够以 YAML配置文件的形式,优雅地定义其 ETL&…

蓝桥杯新年题解 | 第15届蓝桥杯迎新篇

蓝桥杯新年题解 | 第15届蓝桥杯迎新篇 2024年的蓝桥杯即将拉开序幕!对于许多编程爱好者来说,这不仅是一次展示自我能力的舞台,更是一次学习和成长的机会。作为一名大一新生的小蓝,对蓝桥杯充满了期待,但面对初次参赛的…

【有啥问啥】大语言模型Prompt中的“System指令”:深入剖析与误区澄清

大语言模型Prompt中的“System指令”:深入剖析与误区澄清 引言 在与大语言模型(LLM)交互时,“prompt”(提示符)这一概念已不再陌生。Prompt是引导模型生成特定类型文本的关键输入,决定了模型的…

linux/centOS7用户和权限管理笔记

linux系列中可以: 配置多个用户配置多个用户组用户可以加入多个用户中 linux中关于权限的管理级别有2个级别,分别是: 针对用户的权限控制针对用户组的权限控制 一,root用户 root用户拥有最大的系统操作权限,而普通…

sheng的学习笔记-AI-注意力模型(Attention Model)

Ai目录:sheng的学习笔记-AI目录-CSDN博客 先看下这两个文章: 序列模型:sheng的学习笔记-AI-序列模型(Sequence Models),RNN,GRU,LSTM_音乐识别是一对多吗-CSDN博客 机器翻译 sheng的学习笔记-AI-自然语…

el-table组件树形数据修改展开箭头

<style lang"scss" scoped> ::v-deep .el-table__expand-icon .el-icon-arrow-right:before {content: ">"; // 箭头样式font-size: 16px; }::v-deep .el-table__expand-icon{ // 没有展开的状态background-color: rgba(241, 242, 245, 1);color:…

已解决:elasticsearch创建索引失败

报错信息 具体报错&#xff1a; org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [typeillegal_argument_exception, reasonunknown setting [index.mappings.properties.category.analyzer] please check that any required plugins are installed…

JAVA学习笔记——第十一章 枚举和注解

一、引出枚举类 1.先看一个需求demo package com.hspedu.enum_;public class Enumration01 {public static void main(String[] args) {Season Spring new Season("春天", "温暖");Season Summer new Season("夏天", "炎热");Seas…

GeeCache-单体并发缓存

实现LRU中value接口的缓存类 使用互斥锁封装LRU缓存类&#xff0c;实现并发访问 实现Group组&#xff0c;用名称对缓存分类 Getter为缓存击穿时调用的回调函数 若缓存击穿则调用回调函数&#xff0c;并把读取到的值加载到缓存中

吸烟抽烟行为识别数据集-超高识别率,支持YOLO,COCO,VOC格式的标注,10162张各种姿势场景下的吸烟图片

吸烟抽烟行为识别数据集-超高识别率&#xff0c;支持YOLO&#xff0c;COCO,VOC格式的标注&#xff0c;10162张各种姿势场景下的吸烟图片 数据集分割 训练组91&#xff05; 9279图片 有效集5&#xff05; 507图片 测试集4% 376图片 预处理 自动定…