论文精读--GPT2

被BERT敲打了,但是仍然坚持解码器架构

Abstract

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples.The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

翻译:

自然语言处理任务,如问答、机器翻译、阅读理解和摘要,通常在任务特定的数据集上使用监督学习来处理。我们证明了当语言模型在名为WebText的数以百万计的网页新数据集上进行训练时,它们开始在没有明确监督的情况下学习这些任务。当在文档和问题的条件下,语言模型生成的答案在CoQA数据集上的F1分数达到了55,这与或超过了4个基线系统中的3个,而没有使用127,000多个训练示例。语言模型的容量对零样本任务迁移的成功至关重要,而且增加模型容量可以在各种任务上以对数线性的方式提高性能。我们最大的模型GPT-2是一个拥有1.5B参数的Transformer,它在零样本设置下在8个测试的语言建模数据集中的7个上取得了最先进的结果,但仍然没有完全适应WebText。模型的样本反映了这些改进,并包含了连贯的文本段落。这些发现为构建从自然发生的演示中学习执行任务的语言处理系统提供了一条有希望的路径。

总结:

更大更强:文本数据集变成百万级,模型拥有15亿参数

但是和BERT相比可能优势不大,因为BERT只有3.4亿参数,这时作者找了另一个视角zero-shot作为主要卖点

Introduction

The dominant approach to creating ML systems is to collect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al, 2017), reading comprehension systems (Jia & Liang, 2017), and image classifiers (Alcorn et al, 2018) on the diversity and variety of possible inputs highlights some of the shortcomings of this approach.

Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al, 2018) and decaNLP (McCann et al, 2018) to begin studying this.

翻译:

创建机器学习系统的主要方法是为所需任务收集一个训练示例数据集,展示正确的行为,训练系统来模仿这些行为,然后在独立且同分布(IID)的保留示例上测试其性能。这种方法在培养特定领域的专家方面取得了很好的进展。但是,在可能的输入的多样性和种类上,字幕模型(Lake等人,2017年)、阅读理解系统(Jia & Liang,2017年)和图像分类器(Alcorn等人,2018年)的经常性不稳定行为突显了这种方法的一些不足之处。

我们怀疑,在单一领域数据集上流行的单一任务训练是当前系统中观察到的缺乏泛化的主要原因。向具有当前体系结构的健壮系统发展可能需要在广泛的领域和任务上进行培训和测量性能。最近,已经提出了几个基准,如GLUE (Wang等人,2018)和decaNLP (McCann等人,2018)来开始研究这一点。

总结:

现有模型泛化能力不足,需要特定数据集解决特定问题

Multitask learning (Caruana, 1997) is a promising framework for improving general performance. However, multitask training in NLP is still nascent. Recent work reports modest performance improvements (Yogatama et al, 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al, 2018) (Bowman et al, 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be required to brute force our way there with current techniques.This motivates exploring additional setups for performing multitask learning.

翻译:

多任务学习(Caruana, 1997)是一种有望提高通用性能的有前景的框架。然而,在自然语言处理(NLP)中,多任务训练仍然处于初级阶段。最近的工作报告了性能的适度提升(Yogatama等人,2019年),而迄今为止最雄心勃勃的两次尝试分别训练了总共10个和17个(数据集,目标)对(McCann等人,2018年;(Bowman等人,2018年)。从元学习的角度来看,每个(数据集,目标)对都是从数据集和目标分布中抽取的单一训练示例。当前的机器学习系统需要数百到数千个示例来诱导出能够良好泛化的函数。这表明,多任务训练可能需要同样多的有效训练对来实现其承诺。继续按照当前的规模创造数据集和设计目标,以达到可能需要用当前技术强行解决问题的程度,将非常困难。这促使我们探索执行多任务学习的其他设置。

总结:

多任务学习:同时看多个数据集,而且可能会通过多个损失函数,来达到一个模型能用于多个任务的目的,但在nlp领域应用并不多

The current best performing systems on language tasks utilize a combination of pre-training and supervised finetuning. This approach has a long history with a trend towards more flexible forms of transfer. First, word vectors were learned and used as inputs to task-specific architectures (Mikolov et al, 2013) (Collobert et al, 2011), then the contextual representations of recurrent networks were transferred (Dai & Le, 2015) (Peters et al, 2018), and recent work suggests that task-specific architectures are no longer necessary and transferring many self-attention blocks is sufficient (Radford et al, 2018) (Devlin et al, 2018).

翻译:

目前在语言任务上表现最好的系统利用了预训练和监督微调的结合。这一办法有着悠久的历史,有一种趋向于更灵活的迁移形式的趋势。首先,学习词向量并将其用作特定任务架构的输入(Mikolov等人,2013)(Collobert等人,2011),然后迁移循环网络的上下文表示(Dai & Le, 2015) (Peters等人,2018),最近的工作表明,不再需要特定任务架构,迁移许多自注意力块就足够了(Radford等人,2018)(Devlin等人,2018)。

总结:

还是GPT和BERT这类预训练+有监督微调的模型吃香

In this paper, we connect these two lines of work and continue the trend of more general methods of transfer. We demonstrate language models can perform down-stream tasks in a zero-shot setting – without any parameter or architecture modification. We demonstrate this approach shows potential by highlighting the ability of language models to perform a wide range of tasks in a zero-shot setting. We achieve promising, competitive, and state of the art results depending on the task.

翻译:

在本文中,我们将这两项工作联系起来,并继续研究更通用的迁移方法。我们证明了语言模型可以在零样本设置中执行下游任务——无需任何参数或架构修改。我们通过强调语言模型在零样本设置中执行广泛任务的能力来展示这种方法的前景。根据任务的不同,我们实现了有希望、有竞争力以及最先进的结果。

总结:

训练一个GPT2的zero-shot模型,到哪都能用

Approach

For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer). McCann et al (2018) demonstrated it was possible to train a single model, the MQAN,to infer and perform many different tasks on examples with this type of format.

Language modeling is also able to, in principle, learn the tasks of McCann et al (2018) without the need for explicit supervision of which symbols are the outputs to be predicted. Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective. In this slightly toy setting, the concerns with density estimation as a principled training objective discussed in (Sutskever et al, 2015) are side stepped. The problem instead becomes whether we are able to, in practice, optimize the unsupervised objective to convergence. Preliminary experiments confirmed that sufficiently large language models are able to perform multitask learning in this toy-ish setup but learning is much slower than in explicitly supervised approaches.

While it is a large step from the well-posed setup described above to the messiness of “language in the wild”, Weston (2016) argues, in the context of dialog, for the need to develop systems capable of learning from natural language directly and demonstrated a proof of concept – learning a QA task without a reward signal by using forward prediction of a teacher’s outputs. While dialog is an attractive approach, we worry it is overly restrictive. The internet contains a vast amount of information that is passively available without the need for interactive communication. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.

翻译:

例如,一个翻译训练示例可以写成序列(translate to french, english text, french text)。同样,一个阅读理解训练示例可以写成(answer the question, document, question, answer)。McCann等人(2018年)证明了可以训练一个单一模型,即MQAN,以推断并执行这种格式的示例上的许多不同任务。
语言建模原则上也能够学习McCann等人(2018年)的任务,而无需明确监督哪些符号是要预测的输出。由于监督目标与无监督目标相同,但只评估序列的一个子集,无监督目标的全局最小值也是监督目标的全局最小值。在这个稍微有些玩具性质的设置中,(Sutskever等人,2015年)讨论的将密度估计作为原则性训练目标的问题被绕过了。问题变成了我们实际上是否能够优化无监督目标以达到收敛。初步实验证实,足够大的语言模型能够在这种有些玩具性质的设置中进行多任务学习,但学习速度比明确监督的方法要慢得多。

尽管从上述明确设置的步骤到“野生的语言”的混乱是一个巨大的飞跃,但Weston(2016年)在对话的背景下提出了开发能够直接从自然语言学习的系统的必要性,并展示了一个概念验证——通过使用对教师输出的正向预测来学习一个QA任务,而不需要奖励信号。虽然对话是一种吸引人的方法,但我们担心它过于限制。互联网包含了大量的信息,这些信息被动地可用,无需互动交流。我们的推测是,具有足够容量的语言模型将开始学习推断并执行自然语言序列中展示的任务,以更好地预测它们,而不管它们的获取方法如何。如果一个语言模型能够做到这一点,那么它实际上将执行无监督的多任务学习。我们通过分析语言模型在零样本设置中对各种任务的性能来测试这是否属实。

总结:

要做zero-shot,就表明无法利用有标号的下游数据训练GPT1中的start、delim、extract符号,因此使用prompt来提示模型。这种方法的可行性?一方面作者认为只要你的模型足够大,那么模型就能理解prompt是什么意思;另一方面某些prompt在文本数据中也会出现

Training Dataset

Instead, we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

The resulting dataset, WebText, contains the text subset of these 45 million links. To extract the text from HTML responses we use a combination of the Dragnet (Peters & Lecocq, 2013) and Newspaper1 content extractors. All results presented in this paper use a preliminary version of WebText which does not include links created after Dec 2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of 40 GB of text. We removed all Wikipedia documents from WebText since it is a common data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks.

翻译:

相反,我们创建了一个新的网络抓取,强调文档质量。为此,我们只抓取了那些经过人工筛选/过滤的网页。手动过滤整个网络抓取将非常昂贵,因此作为起点,我们从Reddit这个社交媒体平台抓取了所有至少获得3点 Karma 的外链。这可以被认为是一种启发式指标,用于判断其他用户是否发现这个链接有趣、有教育意义,或者只是好笑。
结果数据集 WebText 包含了这 4500 万个链接中的文本子集。为了从 HTML 响应中提取文本,我们使用了 Dragnet (Peters & Lecocq, 2013) 和 Newspaper1 内容提取器的组合。本文呈现的所有结果都使用了一个初步版本的 WebText,该版本不包括 2017 年 12 月之后创建的链接,并且在去重和一些基于启发式的清理后,包含略微超过 800 万个文档,总共有 40 GB 的文本。我们从 WebText 中移除了所有维基百科文档,因为它可能是其他数据集的常见数据来源,并且由于与测试评估任务的训练数据重叠,可能会复杂化分析。

Experiments

We trained and benchmarked four LMs with approximately log-uniformly spaced sizes. The architectures are summarized in Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al, 2018). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText. All models still underfit WebText and held-out perplexity has as of yet improved given more training time.

翻译:

我们训练并基准测试了四个大小大致对数均匀分布的语言模型。这些架构在表2中进行了总结。最小的模型等同于原始的GPT,第二小的模型等同于BERT(Devlin等人,2018年)中最大的模型。我们最大的模型,我们称之为GPT-2,其参数数量比GPT多一个数量级。每个模型的学习率都是手动调整的,以在WebText的5%保留样本上获得最佳的困惑度。所有模型仍然没有完全适应WebText,而且到目前为止,随着训练时间的增加,保留的困惑度已经有所改善。

zero-shot模型之间的对比: 

可以看到虽然和其他有监督任务仍有差距,但是随着参数变多,效果也在持续上升

Generalization vs Memorization

Overall, our analysis suggests that data overlap between WebText training data and specific evaluation datasets provides a small but consistent benefit to reported results. However, for most datasets we do not notice significantly larger overlaps than those already existing between standard training and test sets, as Table 6 highlights.

Understanding and quantifying how highly similar text impacts performance is an important research question. Better de-duplication techniques such as scalable fuzzy matching could also help better answer these questions. For now, we recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets.

Another potential way of determining whether the performance of WebText LMs is attributable to memorization is inspecting their performance on their own held-out set. As shown in Figure 4, performance on both the training and test sets of WebText are similar and improve together as model size is increased. This suggests even GPT-2 is still underfitting on WebText in many ways.

翻译:

总体而言,我们的分析表明,WebText训练数据与特定评估数据集之间的数据重叠为报告的结果提供了小小的但一致的益处。然而,对于大多数数据集,我们并没有注意到比标准训练集和测试集之间已经存在的重叠显著更大,如表6所强调的。
理解和量化高度相似的文本如何影响性能是一个重要的研究问题。更好的去重技术,如可扩展的模糊匹配,也可能有助于更好地回答这些问题。目前,我们建议使用基于n-gram重叠的去重作为一个重要的验证步骤和在创建新的NLP数据集的训练和测试分割时的健全性检查。
确定WebText LMs的性能是否可归因于记忆的另一种可能方法是检查它们在自己保留集上的性能。如图4所示,WebText的训练集和测试集上的性能相似,并且随着模型大小的增加而一起改善。这表明即使GPT-2在许多方面仍然没有完全适应WebText。

总结:

训练集与测试集有重叠部分,因此利用n-gram去重

一起下降,明显数据集没充分训练完

Discussion

zero-shot能力有,但是不多,不知道继续加大参数的极限在哪

Conclusion

When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets. GPT-2 zero-shots to state of the art performance on 7 out of 8 tested language modeling datasets. The diversity of tasks the model is able to perform in a zero-shot setting suggests that high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision.

翻译:

当一个大型语言模型在足够大且多样化的数据集上训练时,它能够在许多领域和数据集上表现出色。GPT-2在零样本情况下,在8个测试的语言建模数据集中有7个达到了最先进的性能。该模型能够在零样本设置中执行的任务的多样性表明,训练用来最大化足够多样化的文本语料库的可能性 high-capacity 模型开始学习如何在不需要明确监督的情况下执行惊人的数量的任务。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/698140.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

使用代理IP技术实现爬虫同步获取和保存

概述 在网络爬虫中,使用代理IP技术可以有效地提高爬取数据的效率和稳定性。本文将介绍如何在爬虫中同步获取和保存数据,并结合代理IP技术,以提高爬取效率。 正文 代理IP技术是一种常用的网络爬虫技术,通过代理服务器转发请求&a…

idea和jdk之间对应的版本(idea支持的jdk版本)

idea如果和jdk版本不对应,就会出现无法运行的情况,如下: 翻译:无法确定17的“tools.jar”库的路径(C:\Program Files\Java\jdk-17) 原因:idea版本是2020.2,而jdk版本是17&#xff0…

动态规划-

关键词: 重叠子问题;每一个状态一定是由上一个状态推导出来(类似数列a^n f(a^n-1,a^n-2)) 步骤: 确定dp数组(dp table)以及下标的含义确定递推公式dp数组如何初始化确定遍历顺序举例推导dp数组 题目&#…

01_02_mysql06_(视图-存储过程-函数(变量、流程控制与游标)-触发器)

视图 使用 视图一方面可以帮我们使用表的一部分而不是所有的表,另一方面也可以针对不同的用户制定不同的查询视图。比如,针对一个公司的销售人员,我们只想给他看部分数据,而某些特殊的数据,比如采购的价格&#xff0…

【2024软件测试面试必会技能】Unittest(6):unittest_构建测试套件

构建测试套件 在实际项目中,随着项目进度的开展,测试类会越来越多,可是直到现在我 们还只会一个一个的单独运行测试类,这在实际项目实践中肯定是不可行的,在 unittest中可以通过测试套件来解决该问题。 测试套件&…

目标检测-Transformer-ViT和DETR

文章目录 前言一、ViT应用和结论结构及创新点 二、DETR应用和结论结构及创新点 总结 前言 随着Transformer爆火以来,NLP领域迎来了大模型时代,成为AI目前最先进和火爆的领域,介于Transformer的先进性,基于Transformer架构的CV模型…

C语言深入剖析——函数栈帧的创建与销毁

目录 0.前言 1.什么是函数栈帧 1.1栈帧的组成 1.2栈帧的作用 1.3栈帧的管理 2.理解函数栈帧的作用 3.解析函数栈帧的创建与销毁 3.1栈的介绍 3.2寄存器简介 3.3汇编指令简介 3.4具体过程解析 3.4.1预备知识 3.4.2函数的调用堆栈 3.4.3转到反汇编 3.4.4函数栈帧的…

【Python_Zebra斑马打印机编程学习笔记(一)】实现标贴预览的两种方式

实现标贴预览的两种方式 实现标贴预览的两种方式前言一、调用 Labelary Online ZPL Viewer API 方法实现标贴预览功能1、Labelary Online ZPL Viewer API 案例介绍2、生成 PNG 格式3、Parameters 二、通过 zpl 的 label.preview() 方法实现标贴预览功能1、实现步骤2、代码示例 …

Python实战:读取MATLAB文件数据(.mat文件)

Python实战:读取MATLAB文件数据(.mat文件) 🌈 个人主页:高斯小哥 🔥 高质量专栏:Matplotlib之旅:零基础精通数据可视化、Python基础【高质量合集】、PyTorch零基础入门教程 👈 希望得到您的订阅…

R语言【base】——abs(),sqrt():杂项数学函数

Package base version 4.2.0 Description abs(x) 计算 x 的绝对值,sqrt(x) 计算 x 的正平方根。 Usage abs(x) sqrt(x) Arguments 参数【x】:一个数值或复数向量或数组。 Details 这些都是内部泛型原语函数:可以为它们单独定义方法,也可以…

MATLAB R2018b安装教程

目录 一、软件下载 二、软件介绍 三、安装须知 四、安装步骤 【最后】 🎈个人主页:库库的里昂 🎐CSDN新晋作者 🎉欢迎 👍点赞✍评论⭐收藏 ✨收录专栏:MATLAB基础及应用🤝希望作者的文章能…

Bert基础(一)--自注意力机制

1、简介 当下最先进的深度学习架构之一,Transformer被广泛应用于自然语言处理领域。它不单替代了以前流行的循环神经网络(recurrent neural network, RNN)和长短期记忆(long short-term memory, LSTM)网络,并且以它为基础衍生出了诸如BERT、GPT-3、T5等…

知识积累(二):损失函数正则化与权重衰减

文章目录 1. 欧氏距离与L2范数1.1 常用的相似性度量 2. 什么是正则化?参考资料 本文只介绍 L2 正则化。 1. 欧氏距离与L2范数 欧氏距离也就是L2范数 1.1 常用的相似性度量 1)点积 2)余弦相似度 3)L1和L2 2. 什么是正则化&…

http相关概念以及apache的功能(最详细讲解!!!!)

概念 互联网:是网络的网络,是所有类型网络的母集 因特网:世界上最大的互联网网络 万维网:www (不是网络,而是数据库)是网页与网页之间的跳转关系 URL:万维网使用统一资源定位符,…

c#程序,oracle使用Devart驱动解决第第三方库是us7ascii,数据乱码的问题

最近做项目,要跟对方系统的库进行读写,结果发现对方采用的是oracle的us7ascii编码,我们系统默认采用的是ZHS16GBK,导致我们客户端读取和写入对方库的数据都是乱码,搜索网上,发现需要采用独立的oracle驱动去…

JVM——感谢黑马程序员官方文档

JVM——感谢黑马程序员官方文档 一、JVM介绍1.什么是JVM?2.有什么好处3.学习路线 二、内存结构1.程序计数器(Program Counter Registe)1.定义2.作用3.特点4.演示 2.虚拟机栈(Java Virtual Machine Stacks)1.定义2.演示3.问题解析4.栈内存溢出5.线程运行诊断&#xf…

操作系统--多线程的互斥、同步

一、概念 在进程/线程并发执行的过程中,进程/线程之间存在协作的关系,例如有互斥、同步的关系。 1.互斥 由于多线程执行操作共享变量的这段代码可能会导致竞争状态,因此我们将此段代码称为临界区(critical section)…

数据中心机房建设的真正挑战

在数字化时代,数据中心机房不仅是信息处理和存储的心脏,也是企业运营的核心枢纽。然而,在机房建设过程中,存在一系列概念上的误解和痛点。这些误区不仅影响了机房建设的质量和效率,也给企业的长期发展带来了潜在的风险…

了解RT-Thread

1.简介 1)RT-Thread,全程是Real Time-Thread; 2)嵌入式实时多线程操作系统; 3)基本属性之一是支持多任务; 4)某一时刻只能运行一个任务,每次对一个任务的执行时间很短…

[论文精读]Do Transformers Really Perform Bad for Graph Representation?

论文网址:[2106.05234] Do Transformers Really Perform Bad for Graph Representation? (arxiv.org) 论文代码:https://github.com/Microsoft/Graphormer 英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼…