r语言调用数据集中的数据集

Garbage in, garbage out. You don’t have to be an ML expert to have heard this phrase. Models uncover patterns in the data, so when the data is broken, they develop broken behavior. This is why researchers allocate significant resources towards curating datasets. However, despite best efforts, it is nearly impossible to collect perfectly clean data, especially at the scale demanded by deep learning.

垃圾进垃圾出。 您不必是ML专家就能听到这句话。模型揭示了数据中的模式，因此，当数据损坏时，它们会表现出损坏的行为。这就是为什么研究人员分配大量资源来管理数据集的原因。但是，尽管尽了最大的努力，但几乎不可能收集完美的干净数据，尤其是在深度学习所需的规模下。

This article discusses popular natural language datasets that turned out to disobey fundamental principles of machine learning and data science, despite being produced by experts in the field. Some of these flaws were exposed and quantified years after the publication and intense usage of the datasets. This is to show that data collection and validation are arduous processes. Here are some of their main impediments:

本文讨论了流行的自然语言数据集，尽管这些数据集是由该领域的专家制作的，但它们却违反了机器学习和数据科学的基本原理。这些缺陷中的一些在发布和大量使用数据集后数年就暴露并量化了。这表明数据收集和验证是艰巨的过程 。以下是一些主要障碍：

Machine learning is data hungry. The sheer volume of data needed for ML (deep learning in particular) calls for automation, i.e., mining the Internet. Datasets end up inheriting undesirable properties from the Internet (e.g., duplication, statistical biases, falsehoods) that are non-trivial to detect and remove.
机器学习需要大量数据。 机器学习(尤其是深度学习)所需的庞大数据量要求自动化，即挖掘Internet。数据集最终从Internet继承了不良的属性(例如，重复，统计偏差，虚假信息)，这些属性对于检测和删除而言并非不重要。
Desiderata cannot be captured exhaustively. Even in the presence of an oracle that could produce infinite data according to some predefined rules, it would be practically infeasible to enumerate all requirements. Consider the training data for a conversational bot. We can express general desiderata like diverse topics, respectful communication, or balanced exchange between interlocutors. But we don’t have enough imagination to specify all the relevant parameters.
Desiderata无法详尽捕获。 即使存在可以根据某些预定义规则生成无限数据的预言机，枚举所有要求实际上也是不可行的。考虑对话式机器人的培训数据。我们可以表达各种主题，尊重的沟通或对话者之间的平衡交流之类的一般愿望。但是我们没有足够的想象力来指定所有相关参数。
Humans take the path of least resistance. Some data collection efforts are still manageable at human scale. But we ourselves are not flawless and, despite our best efforts, are subconsciously inclined to take shortcuts. If you were tasked to write a statement that contradicts the premise “The dog is sleeping”, what would your answer be? Continue reading to find out whether you’d be part of the problem.
人类走阻力最小的道路 。一些数据收集工作仍然可以在人类规模上进行管理。但是我们自己也不是完美无瑕的，尽管我们尽了最大的努力，但我们下意识地倾向于走捷径。如果您被要求撰写与“狗在睡觉”这一前提相矛盾的陈述，您的答案将是什么？继续阅读以找出您是否会遇到问题。

重叠的培训和评估集 (Overlapping training and evaluation sets)

ML practitioners split their data three-ways: there’s a training set for actual learning, a validation set for hyperparameter tuning, and an evaluation set for measuring the final quality of the model. It is common knowledge that these sets should be mostly disjunct. When evaluating on training data, you are measuring the model’s capacity to memorize rather than its ability to recognize patterns and apply them in new contexts.

ML练习者将数据分为三类：用于实际学习的训练集 ，用于超参数调整的验证集和用于测量模型最终质量的评估集 。众所周知，这些集合大部分应该是分离的。在评估训练数据时，您正在衡量模型的记忆能力，而不是模型识别模式并将其应用于新环境的能力。

This guideline sounds straightforward to apply, yet Lewis et al. [1] show in a 2020 paper that the most popular open-domain question answering datasets (open-QA) have a significant overlap between their training and evaluation sets. Their analysis includes WebQuestions, TriviaQA and Open Natural Questions — datasets created by reputable institutions and heavily used as QA benchmarks.

该指南听起来很容易应用，但Lewis等人。 [1]在2020年的一篇论文中表明，最流行的开放域问题回答数据集(open-QA)在它们的训练和评估集之间有明显的重叠。 他们的分析包括WebQuestions ， TriviaQA和开放自然问题 -由知名机构创建并被广泛用作质量检查基准的数据集。

We find that 60–70% of test-time answers are also present somewhere in the training sets. We also find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding training sets.
我们发现，在测试集中某处还存在60-70％的测试时间答案。我们还发现30％的测试集问题在其相应的训练集中具有近乎重复的释义。

Of course, a 0% overlap between training and testing would not be ideal either. We do want some degree of memorization — models should be able to answer questions seen during training and know when to surface previously-seen answers. The real problem is benchmarking a model on a dataset with high training/evaluation overlap and making rushed conclusions about its generalization ability.

当然，培训和测试之间的0％重叠也不理想。我们确实需要一定程度的记忆-模型应该能够回答训练中看到的问题，并知道何时出现以前看到的答案。 真正的问题是在训练/评估重叠度高的数据集上对模型进行基准测试，并对模型的泛化能力做出匆忙的结论。

Lewis et al. [1] re-evaluate state-of-the-art QA models after partitioning the evaluation sets into three subsets: (a) question overlap — for which identical or paraphrased question-answer pairs occur in the training set, (b) answer overlap only—for which the same answers occur in the training set, but paired with a different question, and (c) no overlap. QA models score vastly differently across these three subsets. For instance, when tested on Open Natural Questions, the state-of-the-art Fusion-in-Decoder model scores ~70% on question overlap, ~50% on answer overlap only, ~35% on no overlap.

Lewis等。 [1]将评估集划分为三个子集后，重新评估最新的质量保证模型：(a) 问题重叠-在训练集中出现相同或释义的问题-答案对，(b) 答案重叠仅-为此相同的答案出现在训练集中，但与另一个问题配对，并且(c) 没有重叠 。质量检查模型在这三个子集中的得分差异很大。例如，在“ 开放自然问题”上进行测试时，最新的解码器融合模型在问题重叠方面的得分约为70％，仅在答案重叠方面的得分为50％，在没有重叠问题的得分为35％。

It is clear that performance on these datasets cannot be properly understood by overall QA accuracy and suggest that in future, a greater emphasis should be placed on more behaviour-driven evaluation, rather than pursuing single-number overall accuracy figures.
很明显，这些数据集的性能无法通过总体质量保证准确性来正确理解，并建议在未来，应更加注重行为驱动的评估，而不是追求单一数字的总体准确性。

虚假相关 (Spurious correlations)

Just like humans, models take shortcuts and discover the simplest patterns that explain the data. For instance, consider a dog-vs-cat image classifier and a naïve training set in which all dog images are grayscale and all cat images are in full color. The model will most likely latch onto the spurious correlation between presence/absence of color and labels. When tested on a dog in full color, it will probably label it as a cat.

就像人类一样，模型采用捷径并发现解释数据的最简单模式。例如，考虑一个狗对猫图像分类器和一个幼稚的训练集，其中所有的狗图像都是灰度的，而所有的猫图像都是全彩色的。该模型很可能会锁定颜色是否存在和标签之间的虚假相关性。在对全色狗进行测试时，它可能会将其标记为猫。

Gururangan et al. [2] showed that similar spurious correlations occur in two of the most popular natural language inference (NLI) datasets, SNLI (Stanford NLI) and MNLI (Multi-genre NLI). Given two statements, a premise and a hypothesis, the natural language inference task is to decide the relationship between them: entailment, contradiction or neutrality. Here is an example from the MNLI dataset:

Gururangan等。 [2]表明， 在两个最流行的自然语言推论( NLI )数据集 SNLI (斯坦福NLI )和MNLI (多体NLI )中也发生了类似的虚假相关 。给定两个陈述(一个前提和一个假设)，自然语言推理任务就是确定它们之间的关系： 蕴涵，矛盾或中立。 这是来自MNLI数据集的示例：

Image for post — MNLI datasetMNLI数据集中的示例

Solving NLI requires understanding the subtle connection between the premise and the hypothesis. However, Gururangan et al. [2] revealed that, when models are shown the hypothesis alone, they can achieve accuracy as high as 67% on SNLI and 53% on MNLI. This is significantly higher than the most-frequent-class baseline (~35%), surfacing undeniable flaws in the datasets.

解决NLI需要了解前提和假设之间的微妙联系。但是，Gururangan等。 [2]揭示了， 当仅显示假设时，它们可以在SNLI上达到67％的精度，在MNLI上达到53％的精度 。这大大高于最常见的基线(约35％)，从而弥补了数据集中不可否认的缺陷。

How did this happen? SNLI and MNLI were both crowd-sourced; humans were given a premise and asked to produce three hypotheses, one for each label. Which brings us back to the premise “The dog is sleeping”. How would you contradict it? “The dog is not sleeping” is a perfectly reasonable candidate. However, if negation is consistently applied as a heuristic, models learn to detect contradiction by simply checking for the occurrence of “not” in the hypothesis, achieving high accuracy without even reading the premise.

这怎么发生的？ SNLI和MNLI都是众包的。给人类一个前提，并要求他们产生三个假设，每个假设一个。这使我们回到了“狗在睡觉”的前提。你会怎么矛盾呢？ “狗没有睡觉”是一个非常合理的选择。但是，如果将否定作为启发式方法始终如一地应用，则模型会通过简单地检查假设中是否出现“ not”来学习检测矛盾，从而无需阅读前提即可获得高精度。

Gururangan et al. [2] reveal several other such annotation artefacts:

Gururangan等。 [2]揭示了其他一些这样的注释伪像：

Entailment hypotheses were produced by generalizing words found in the premise (dog → animal, 3 → some, woman → person), making entailment recognizable from the hypothesis alone.
蕴涵假设是通过对前提中发现的单词 ( 狗→动物，3→某些，女人→人 )进行概括而产生的，因此仅凭假设就可以识别蕴涵。
Neutral hypotheses were produced by injecting modifiers (tall, first, most) as an easy way to introduce information not entailed by the premise but also not contradictory to it.
中性假设是通过注入修饰符 ( 高，第一，多数 )产生的作为引入前提所不包含但又不矛盾的信息的简便方法。

Despite these discoveries, MNLI remains under the GLUE leaderboard, one of the most popular benchmarks for natural language processing. Due to its considerable size compared to the other GLUE corpora (~400k data instances), MNLI is prominently featured in abstracts and used in ablation studies. While its shortcomings are starting to be recognized more widely, it is unlikely to lose its popularity until we find a better alternative.

尽管有这些发现，但是MNLI仍然排在GLUE排行榜的首位， GLUE是自然语言处理最受欢迎的基准之一。由于与其他GLUE语料库(约40万个数据实例)相比，MNLI的大小相当大，因此MNLI在摘要中具有突出的特征，并用于消融研究中。尽管它的缺点已开始被广泛认可，但在我们找到更好的替代方法之前，它不太可能失去其流行性。

偏见和代表性不足 (Bias and under-representation)

In the past few years, bias in machine learning has been exposed across multiple dimensions including gender and race. In response to biased word embeddings and model behavior, the research community has been directing increasingly more efforts towards bias mitigation, as illustrated by Sun et al. [3] in their comprehensive literature review.

在过去的几年中，机器学习的偏见已经暴露在包括性别和种族在内的多个层面。为了应对有偏见的词嵌入和模型行为，研究团体一直在引导越来越多的努力来减轻偏见，如Sun等人所述。 [3]在他们的综合文献综述中。

Yann LeCun, co-recipient of the 2018 Turing Award, pointed out that biased data leads to biased model behavior:

2018年图灵奖的共同获奖者Yann LeCun指出，有偏见的数据导致有偏见的模型行为：

演示地址

His Tweet drew a lot of engagement from the research community, with mixed reactions. On the one hand, people acknowledged almost unanimously that bias does exist in many datasets. On the other hand, some disagreed with the perceived implication that bias stems solely from data, additionally blaming modeling and evaluation choices, and the unconscious bias of those who design and build the models. Yann LeCun later clarified that he does not consider data bias to be the only cause for societal bias in models:

他的推文吸引了研究界的广泛参与，React参差不齐。一方面，人们几乎一致承认，在许多数据集中确实存在偏见。另一方面，有些人不同意感知的含义，即偏差仅源于数据，另外归咎于建模和评估选择以及设计和构建模型的人员的无意识偏差。 Yann LeCun后来澄清说，他不认为数据偏差是模型中社会偏差的唯一原因：

演示地址

Even though the dataset being discussed was an image corpus used for computer vision, natural language processing suffers no less from biased datasets. A prominent task that has exposed gender bias is coreference resolution, where a referring expression (like a pronoun) must be linked to an entity mentioned in the text. Here is an example from Webster et al. [4]:

即使正在讨论的数据集是用于计算机视觉的图像语料库，自然语言处理也受到偏向数据集的影响。暴露性别偏见的一项突出任务是共指称解析 ，即指称表达(如代词)必须与文本中提到的实体链接。这是Webster等人的示例。 [4]：

In May, Fujisawa joined Mari Motohashi’s rink as the team’s skip, moving back from Karuizawa to Kitami where she had spent her junior days.
今年五月，藤泽加入本桥麻里的溜冰场作为球队的跳跃，从移动到轻井泽北见回来，她度过了她的初中天。

The authors point out that less than 15% of biographies on Wikipedia are about women, and that they tend to discuss marriage and divorce more prominently than pages about men. Given that many NLP datasets are extracted from Wikipedia, this impacts many downstream tasks. For coreference resolution in particular, the lack of female pronouns or their association with certain stereotypes is problematic. For instance, how would you interpret the sentence “Mary saw her doctor as she entered the room”?

作者指出，维基百科上不到15％的传记是关于女性的，与男性页面相比，他们更倾向于讨论婚姻和离婚。鉴于从Wikipedia中提取了许多NLP数据集，因此这会影响许多下游任务。尤其对于共指解决，缺少女性代词或它们与某些定型观念的联系是有问题的。举例来说，你会如何解释句子“玛丽看到她的医生，因为她进入房间”？

Eliminating bias from the training data is an unsolved problem. First, because we cannot exhaustively enumerate the axes in which bias manifests; in addition to gender and race, there are many other subtle dimensions that can invite bias (age, proper names, profession etc.). Second, even if we selected a single axis like gender, removing bias would either mean dropping a large portion of the data or applying error-prone heuristics to turn male pronouns into under-represented gender pronouns. Instead, the research community is currently focusing on producing unbiased evaluation datasets, since their smaller scale is more conducive of manual intervention. This at least gives us the ability to measure the performance of our models more truthfully, across a representative sample of the population.

从训练数据中消除偏差是一个尚未解决的问题。首先，因为我们不能穷举列举偏见的轴；除了性别和种族外，还有许多其他细微的方面可能引起偏见(年龄，专有名称，职业等)。其次，即使我们选择像性别这样的单一轴，消除偏见也可能意味着丢弃大量数据或应用容易出错的启发式方法将男性代词转变为代表性不足的性别代词。取而代之的是，研究社区目前专注于生成无偏的评估数据集，因为它们较小的规模更有利于人工干预。这至少使我们能够在整个代表性样本中更真实地衡量模型的性能。

Building natural language datasets is a never-ending process: we continuously collect data, validate it, acknowledge its shortcomings and work around them. Then we rinse and repeat whenever a new source becomes available. And in the meantime we make progress. All the datasets mentioned above, despite their flaws, have undeniably helped push natural language understanding forward.

建立自然语言数据集是一个永无止境的过程：我们不断收集数据，对其进行验证，确认其缺点并加以解决。然后，我们会冲洗并在有新来源可用时重复进行。同时，我们取得了进步。上面提到的所有数据集，尽管存在缺陷，但无疑有助于推动自然语言理解的发展。