论文精读--GPT2

被BERT敲打了，但是仍然坚持解码器架构

Abstract

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples.The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

翻译：

自然语言处理任务，如问答、机器翻译、阅读理解和摘要，通常在任务特定的数据集上使用监督学习来处理。我们证明了当语言模型在名为WebText的数以百万计的网页新数据集上进行训练时，它们开始在没有明确监督的情况下学习这些任务。当在文档和问题的条件下，语言模型生成的答案在CoQA数据集上的F1分数达到了55，这与或超过了4个基线系统中的3个，而没有使用127,000多个训练示例。语言模型的容量对零样本任务迁移的成功至关重要，而且增加模型容量可以在各种任务上以对数线性的方式提高性能。我们最大的模型GPT-2是一个拥有1.5B参数的Transformer，它在零样本设置下在8个测试的语言建模数据集中的7个上取得了最先进的结果，但仍然没有完全适应WebText。模型的样本反映了这些改进，并包含了连贯的文本段落。这些发现为构建从自然发生的演示中学习执行任务的语言处理系统提供了一条有希望的路径。

总结：

更大更强：文本数据集变成百万级，模型拥有15亿参数

但是和BERT相比可能优势不大，因为BERT只有3.4亿参数，这时作者找了另一个视角zero-shot作为主要卖点

Introduction

The dominant approach to creating ML systems is to collect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al, 2017), reading comprehension systems (Jia & Liang, 2017), and image classifiers (Alcorn et al, 2018) on the diversity and variety of possible inputs highlights some of the shortcomings of this approach.

Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al, 2018) and decaNLP (McCann et al, 2018) to begin studying this.

翻译：

创建机器学习系统的主要方法是为所需任务收集一个训练示例数据集，展示正确的行为，训练系统来模仿这些行为，然后在独立且同分布（IID）的保留示例上测试其性能。这种方法在培养特定领域的专家方面取得了很好的进展。但是，在可能的输入的多样性和种类上，字幕模型（Lake等人，2017年）、阅读理解系统（Jia & Liang，2017年）和图像分类器（Alcorn等人，2018年）的经常性不稳定行为突显了这种方法的一些不足之处。

我们怀疑，在单一领域数据集上流行的单一任务训练是当前系统中观察到的缺乏泛化的主要原因。向具有当前体系结构的健壮系统发展可能需要在广泛的领域和任务上进行培训和测量性能。最近，已经提出了几个基准，如GLUE (Wang等人，2018)和decaNLP (McCann等人，2018)来开始研究这一点。

总结：

现有模型泛化能力不足，需要特定数据集解决特定问题

Multitask learning (Caruana, 1997) is a promising framework for improving general performance. However, multitask training in NLP is still nascent. Recent work reports modest performance improvements (Yogatama et al, 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al, 2018) (Bowman et al, 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be required to brute force our way there with current techniques.This motivates exploring additional setups for performing multitask learning.

翻译：

多任务学习（Caruana, 1997）是一种有望提高通用性能的有前景的框架。然而，在自然语言处理（NLP）中，多任务训练仍然处于初级阶段。最近的工作报告了性能的适度提升（Yogatama等人，2019年），而迄今为止最雄心勃勃的两次尝试分别训练了总共10个和17个（数据集，目标）对（McCann等人，2018年；（Bowman等人，2018年）。从元学习的角度来看，每个（数据集，目标）对都是从数据集和目标分布中抽取的单一训练示例。当前的机器学习系统需要数百到数千个示例来诱导出能够良好泛化的函数。这表明，多任务训练可能需要同样多的有效训练对来实现其承诺。继续按照当前的规模创造数据集和设计目标，以达到可能需要用当前技术强行解决问题的程度，将非常困难。这促使我们探索执行多任务学习的其他设置。

总结：

多任务学习：同时看多个数据集，而且可能会通过多个损失函数，来达到一个模型能用于多个任务的目的，但在nlp领域应用并不多

The current best performing systems on language tasks utilize a combination of pre-training and supervised finetuning. This approach has a long history with a trend towards more flexible forms of transfer. First, word vectors were learned and used as inputs to task-specific architectures (Mikolov et al, 2013) (Collobert et al, 2011), then the contextual representations of recurrent networks were transferred (Dai & Le, 2015) (Peters et al, 2018), and recent work suggests that task-specific architectures are no longer necessary and transferring many self-attention blocks is sufficient (Radford et al, 2018) (Devlin et al, 2018).

翻译：

目前在语言任务上表现最好的系统利用了预训练和监督微调的结合。这一办法有着悠久的历史，有一种趋向于更灵活的迁移形式的趋势。首先，学习词向量并将其用作特定任务架构的输入(Mikolov等人，2013)(Collobert等人，2011)，然后迁移循环网络的上下文表示(Dai & Le, 2015) (Peters等人，2018)，最近的工作表明，不再需要特定任务架构，迁移许多自注意力块就足够了(Radford等人，2018)(Devlin等人，2018)。

总结：

还是GPT和BERT这类预训练+有监督微调的模型吃香

In this paper, we connect these two lines of work and continue the trend of more general methods of transfer. We demonstrate language models can perform down-stream tasks in a zero-shot setting – without any parameter or architecture modification. We demonstrate this approach shows potential by highlighting the ability of language models to perform a wide range of tasks in a zero-shot setting. We achieve promising, competitive, and state of the art results depending on the task.

翻译：

在本文中，我们将这两项工作联系起来，并继续研究更通用的迁移方法。我们证明了语言模型可以在零样本设置中执行下游任务——无需任何参数或架构修改。我们通过强调语言模型在零样本设置中执行广泛任务的能力来展示这种方法的前景。根据任务的不同，我们实现了有希望、有竞争力以及最先进的结果。

总结：

训练一个GPT2的zero-shot模型，到哪都能用

Approach

For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer). McCann et al (2018) demonstrated it was possible to train a single model, the MQAN,to infer and perform many different tasks on examples with this type of format.

Language modeling is also able to, in principle, learn the tasks of McCann et al (2018) without the need for explicit supervision of which symbols are the outputs to be predicted. Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective. In this slightly toy setting, the concerns with density estimation as a principled training objective discussed in (Sutskever et al, 2015) are side stepped. The problem instead becomes whether we are able to, in practice, optimize the unsupervised objective to convergence. Preliminary experiments confirmed that sufficiently large language models are able to perform multitask learning in this toy-ish setup but learning is much slower than in explicitly supervised approaches.

While it is a large step from the well-posed setup described above to the messiness of “language in the wild”, Weston (2016) argues, in the context of dialog, for the need to develop systems capable of learning from natural language directly and demonstrated a proof of concept – learning a QA task without a reward signal by using forward prediction of a teacher’s outputs. While dialog is an attractive approach, we worry it is overly restrictive. The internet contains a vast amount of information that is passively available without the need for interactive communication. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.

翻译：

例如，一个翻译训练示例可以写成序列(translate to french, english text, french text)。同样，一个阅读理解训练示例可以写成(answer the question, document, question, answer)。McCann等人（2018年）证明了可以训练一个单一模型，即MQAN，以推断并执行这种格式的示例上的许多不同任务。
语言建模原则上也能够学习McCann等人（2018年）的任务，而无需明确监督哪些符号是要预测的输出。由于监督目标与无监督目标相同，但只评估序列的一个子集，无监督目标的全局最小值也是监督目标的全局最小值。在这个稍微有些玩具性质的设置中，(Sutskever等人，2015年)讨论的将密度估计作为原则性训练目标的问题被绕过了。问题变成了我们实际上是否能够优化无监督目标以达到收敛。初步实验证实，足够大的语言模型能够在这种有些玩具性质的设置中进行多任务学习，但学习速度比明确监督的方法要慢得多。

尽管从上述明确设置的步骤到“野生的语言”的混乱是一个巨大的飞跃，但Weston（2016年）在对话的背景下提出了开发能够直接从自然语言学习的系统的必要性，并展示了一个概念验证——通过使用对教师输出的正向预测来学习一个QA任务，而不需要奖励信号。虽然对话是一种吸引人的方法，但我们担心它过于限制。互联网包含了大量的信息，这些信息被动地可用，无需互动交流。我们的推测是，具有足够容量的语言模型将开始学习推断并执行自然语言序列中展示的任务，以更好地预测它们，而不管它们的获取方法如何。如果一个语言模型能够做到这一点，那么它实际上将执行无监督的多任务学习。我们通过分析语言模型在零样本设置中对各种任务的性能来测试这是否属实。

总结：

要做zero-shot，就表明无法利用有标号的下游数据训练GPT1中的start、delim、extract符号，因此使用prompt来提示模型。这种方法的可行性？一方面作者认为只要你的模型足够大，那么模型就能理解prompt是什么意思；另一方面某些prompt在文本数据中也会出现

Training Dataset

Instead, we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

The resulting dataset, WebText, contains the text subset of these 45 million links. To extract the text from HTML responses we use a combination of the Dragnet (Peters & Lecocq, 2013) and Newspaper1 content extractors. All results presented in this paper use a preliminary version of WebText which does not include links created after Dec 2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of 40 GB of text. We removed all Wikipedia documents from WebText since it is a common data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks.

翻译：

相反，我们创建了一个新的网络抓取，强调文档质量。为此，我们只抓取了那些经过人工筛选/过滤的网页。手动过滤整个网络抓取将非常昂贵，因此作为起点，我们从Reddit这个社交媒体平台抓取了所有至少获得3点 Karma 的外链。这可以被认为是一种启发式指标，用于判断其他用户是否发现这个链接有趣、有教育意义，或者只是好笑。
结果数据集 WebText 包含了这 4500 万个链接中的文本子集。为了从 HTML 响应中提取文本，我们使用了 Dragnet (Peters & Lecocq, 2013) 和 Newspaper1 内容提取器的组合。本文呈现的所有结果都使用了一个初步版本的 WebText，该版本不包括 2017 年 12 月之后创建的链接，并且在去重和一些基于启发式的清理后，包含略微超过 800 万个文档，总共有 40 GB 的文本。我们从 WebText 中移除了所有维基百科文档，因为它可能是其他数据集的常见数据来源，并且由于与测试评估任务的训练数据重叠，可能会复杂化分析。

Experiments

We trained and benchmarked four LMs with approximately log-uniformly spaced sizes. The architectures are summarized in Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al, 2018). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText. All models still underfit WebText and held-out perplexity has as of yet improved given more training time.

翻译：

我们训练并基准测试了四个大小大致对数均匀分布的语言模型。这些架构在表2中进行了总结。最小的模型等同于原始的GPT，第二小的模型等同于BERT（Devlin等人，2018年）中最大的模型。我们最大的模型，我们称之为GPT-2，其参数数量比GPT多一个数量级。每个模型的学习率都是手动调整的，以在WebText的5%保留样本上获得最佳的困惑度。所有模型仍然没有完全适应WebText，而且到目前为止，随着训练时间的增加，保留的困惑度已经有所改善。

zero-shot模型之间的对比：

可以看到虽然和其他有监督任务仍有差距，但是随着参数变多，效果也在持续上升

Generalization vs Memorization

Overall, our analysis suggests that data overlap between WebText training data and specific evaluation datasets provides a small but consistent benefit to reported results. However, for most datasets we do not notice significantly larger overlaps than those already existing between standard training and test sets, as Table 6 highlights.

Understanding and quantifying how highly similar text impacts performance is an important research question. Better de-duplication techniques such as scalable fuzzy matching could also help better answer these questions. For now, we recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets.

Another potential way of determining whether the performance of WebText LMs is attributable to memorization is inspecting their performance on their own held-out set. As shown in Figure 4, performance on both the training and test sets of WebText are similar and improve together as model size is increased. This suggests even GPT-2 is still underfitting on WebText in many ways.

翻译：

总体而言，我们的分析表明，WebText训练数据与特定评估数据集之间的数据重叠为报告的结果提供了小小的但一致的益处。然而，对于大多数数据集，我们并没有注意到比标准训练集和测试集之间已经存在的重叠显著更大，如表6所强调的。
理解和量化高度相似的文本如何影响性能是一个重要的研究问题。更好的去重技术，如可扩展的模糊匹配，也可能有助于更好地回答这些问题。目前，我们建议使用基于n-gram重叠的去重作为一个重要的验证步骤和在创建新的NLP数据集的训练和测试分割时的健全性检查。
确定WebText LMs的性能是否可归因于记忆的另一种可能方法是检查它们在自己保留集上的性能。如图4所示，WebText的训练集和测试集上的性能相似，并且随着模型大小的增加而一起改善。这表明即使GPT-2在许多方面仍然没有完全适应WebText。

总结：

训练集与测试集有重叠部分，因此利用n-gram去重

一起下降，明显数据集没充分训练完

Discussion

zero-shot能力有，但是不多，不知道继续加大参数的极限在哪

Conclusion

When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets. GPT-2 zero-shots to state of the art performance on 7 out of 8 tested language modeling datasets. The diversity of tasks the model is able to perform in a zero-shot setting suggests that high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision.

翻译：

当一个大型语言模型在足够大且多样化的数据集上训练时，它能够在许多领域和数据集上表现出色。GPT-2在零样本情况下，在8个测试的语言建模数据集中有7个达到了最先进的性能。该模型能够在零样本设置中执行的任务的多样性表明，训练用来最大化足够多样化的文本语料库的可能性 high-capacity 模型开始学习如何在不需要明确监督的情况下执行惊人的数量的任务。