r语言调用数据集中的数据集_自然语言数据集中未解决的问题

r语言调用数据集中的数据集

Garbage in, garbage out. You don’t have to be an ML expert to have heard this phrase. Models uncover patterns in the data, so when the data is broken, they develop broken behavior. This is why researchers allocate significant resources towards curating datasets. However, despite best efforts, it is nearly impossible to collect perfectly clean data, especially at the scale demanded by deep learning.

垃圾进垃圾出。 您不必是ML专家就能听到这句话。 模型揭示了数据中的模式,因此,当数据损坏时,它们会表现出损坏的行为。 这就是为什么研究人员分配大量资源来管理数据集的原因。 但是,尽管尽了最大的努力,但几乎不可能收集完美的干净数据,尤其是在深度学习所需的规模下。

This article discusses popular natural language datasets that turned out to disobey fundamental principles of machine learning and data science, despite being produced by experts in the field. Some of these flaws were exposed and quantified years after the publication and intense usage of the datasets. This is to show that data collection and validation are arduous processes. Here are some of their main impediments:

本文讨论了流行的自然语言数据集,尽管这些数据集是由该领域的专家制作的,但它们却违反了机器学习和数据科学的基本原理。 这些缺陷中的一些在发布和大量使用数据集后数年就暴露并量化了。 这表明数据收集和验证是艰巨的过程 。 以下是一些主要障碍:

  1. Machine learning is data hungry. The sheer volume of data needed for ML (deep learning in particular) calls for automation, i.e., mining the Internet. Datasets end up inheriting undesirable properties from the Internet (e.g., duplication, statistical biases, falsehoods) that are non-trivial to detect and remove.

    机器学习需要大量数据。 机器学习(尤其是深度学习)所需的庞大数据量要求自动化,即挖掘Internet。 数据集最终从Internet继承了不良的属性(例如,重复,统计偏差,虚假信息),这些属性对于检测和删除而言并非不重要。

  2. Desiderata cannot be captured exhaustively. Even in the presence of an oracle that could produce infinite data according to some predefined rules, it would be practically infeasible to enumerate all requirements. Consider the training data for a conversational bot. We can express general desiderata like diverse topics, respectful communication, or balanced exchange between interlocutors. But we don’t have enough imagination to specify all the relevant parameters.

    Desiderata无法详尽捕获。 即使存在可以根据某些预定义规则生成无限数据的预言机,枚举所有要求实际上也是不可行的。 考虑对话式机器人的培训数据。 我们可以表达各种主题,尊重的沟通或对话者之间的平衡交流之类的一般愿望。 但是我们没有足够的想象力来指定所有相关参数。

  3. Humans take the path of least resistance. Some data collection efforts are still manageable at human scale. But we ourselves are not flawless and, despite our best efforts, are subconsciously inclined to take shortcuts. If you were tasked to write a statement that contradicts the premise “The dog is sleeping”, what would your answer be? Continue reading to find out whether you’d be part of the problem.

    人类走阻力最小的道路 。 一些数据收集工作仍然可以在人类规模上进行管理。 但是我们自己也不是完美无瑕的,尽管我们尽了最大的努力,但我们下意识地倾向于走捷径。 如果您被要求撰写与“狗在睡觉”这一前提相矛盾的陈述,您的答案将是什么? 继续阅读以找出您是否会遇到问题。

重叠的培训和评估集 (Overlapping training and evaluation sets)

ML practitioners split their data three-ways: there’s a training set for actual learning, a validation set for hyperparameter tuning, and an evaluation set for measuring the final quality of the model. It is common knowledge that these sets should be mostly disjunct. When evaluating on training data, you are measuring the model’s capacity to memorize rather than its ability to recognize patterns and apply them in new contexts.

ML练习者将数据分为三类:用于实际学习的训练集 ,用于超参数调整的验证集和用于测量模型最终质量的评估集 。 众所周知,这些集合大部分应该分离的。 在评估训练数据时,您正在衡量模型的记忆能力,而不是模型识别模式并将其应用于新环境的能力。

This guideline sounds straightforward to apply, yet Lewis et al. [1] show in a 2020 paper that the most popular open-domain question answering datasets (open-QA) have a significant overlap between their training and evaluation sets. Their analysis includes WebQuestions, TriviaQA and Open Natural Questions — datasets created by reputable institutions and heavily used as QA benchmarks.

该指南听起来很容易应用,但Lewis等人。 [1]在2020年的一篇论文中表明,最流行的开放域问题回答数据集(open-QA)在它们的训练和评估集之间有明显的重叠。 他们的分析包括WebQuestions , TriviaQA和开放自然问题 -由知名机构创建并被广泛用作质量检查基准的数据集。

We find that 60–70% of test-time answers are also present somewhere in the training sets. We also find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding training sets.

我们发现,在测试集中某处还存在60-70%的测试时间答案。 我们还发现30%的测试集问题在其相应的训练集中具有近乎重复的释义。

Of course, a 0% overlap between training and testing would not be ideal either. We do want some degree of memorization — models should be able to answer questions seen during training and know when to surface previously-seen answers. The real problem is benchmarking a model on a dataset with high training/evaluation overlap and making rushed conclusions about its generalization ability.

当然,培训和测试之间的0%重叠也不理想。 我们确实需要一定程度的记忆-模型应该能够回答训练中看到的问题,并知道何时出现以前看到的答案。 真正的问题是在训练/评估重叠度高的数据集上对模型进行基准测试,并对模型的泛化能力做出匆忙的结论。

Lewis et al. [1] re-evaluate state-of-the-art QA models after partitioning the evaluation sets into three subsets: (a) question overlap — for which identical or paraphrased question-answer pairs occur in the training set, (b) answer overlap only—for which the same answers occur in the training set, but paired with a different question, and (c) no overlap. QA models score vastly differently across these three subsets. For instance, when tested on Open Natural Questions, the state-of-the-art Fusion-in-Decoder model scores ~70% on question overlap, ~50% on answer overlap only, ~35% on no overlap.

Lewis等。 [1]将评估集划分为三个子集后,重新评估最新的质量保证模型:(a) 问题重叠-在训练集中出现相同或释义的问题-答案对,(b) 答案重叠仅-为此 相同的答案出现在训练集中,但与另一个问题配对,并且(c) 没有重叠 。 质量检查模型在这三个子集中的得分差异很大。 例如,在“ 开放自然问题”上进行测试时,最新的解码器融合模型在问题重叠方面的得分约为70%,仅在答案重叠方面的得分为50%,在没有重叠问题的得分为35%。

It is clear that performance on these datasets cannot be properly understood by overall QA accuracy and suggest that in future, a greater emphasis should be placed on more behaviour-driven evaluation, rather than pursuing single-number overall accuracy figures.

很明显,这些数据集的性能无法通过总体质量保证准确性来正确理解,并建议在未来,应更加注重行为驱动的评估,而不是追求单一数字的总体准确性。

虚假相关 (Spurious correlations)

Just like humans, models take shortcuts and discover the simplest patterns that explain the data. For instance, consider a dog-vs-cat image classifier and a naïve training set in which all dog images are grayscale and all cat images are in full color. The model will most likely latch onto the spurious correlation between presence/absence of color and labels. When tested on a dog in full color, it will probably label it as a cat.

就像人类一样,模型采用捷径并发现解释数据的最简单模式。 例如,考虑一个狗对猫图像分类器和一个幼稚的训练集,其中所有的狗图像都是灰度的,而所有的猫图像都是全彩色的。 该模型很可能会锁定颜色是否存在和标签之间的虚假相关性。 在对全色狗进行测试时,它可能会将其标记为猫。

Gururangan et al. [2] showed that similar spurious correlations occur in two of the most popular natural language inference (NLI) datasets, SNLI (Stanford NLI) and MNLI (Multi-genre NLI). Given two statements, a premise and a hypothesis, the natural language inference task is to decide the relationship between them: entailment, contradiction or neutrality. Here is an example from the MNLI dataset:

Gururangan等。 [2]表明, 在两个最流行的自然语言推论( NLI )数据集 SNLI (斯坦福NLI )和MNLI (多体NLI )中也发生了类似的虚假相关 。 给定两个陈述(一个前提和一个假设),自然语言推理任务就是确定它们之间的关系: 蕴涵,矛盾中立。 这是来自MNLI数据集的示例:

Image for post
MNLI datasetMNLI数据集中的示例

Solving NLI requires understanding the subtle connection between the premise and the hypothesis. However, Gururangan et al. [2] revealed that, when models are shown the hypothesis alone, they can achieve accuracy as high as 67% on SNLI and 53% on MNLI. This is significantly higher than the most-frequent-class baseline (~35%), surfacing undeniable flaws in the datasets.

解决NLI需要了解前提和假设之间的微妙联系。 但是,Gururangan等。 [2]揭示了, 当仅显示假设时,它们可以在SNLI上达到67%的精度,在MNLI上达到53%的精度 。 这大大高于最常见的基线(约35%),从而弥补了数据集中不可否认的缺陷。

How did this happen? SNLI and MNLI were both crowd-sourced; humans were given a premise and asked to produce three hypotheses, one for each label. Which brings us back to the premise “The dog is sleeping”. How would you contradict it? “The dog is not sleeping” is a perfectly reasonable candidate. However, if negation is consistently applied as a heuristic, models learn to detect contradiction by simply checking for the occurrence of “not” in the hypothesis, achieving high accuracy without even reading the premise.

这怎么发生的? SNLI和MNLI都是众包的。 给人类一个前提,并要求他们产生三个假设,每个假设一个。 这使我们回到了“狗在睡觉”的前提。 你会怎么矛盾呢? “狗没有睡觉”是一个非常合理的选择。 但是,如果将否定作为启发式方法始终如一地应用,则模型会通过简单地检查假设中是否出现“ not”来学习检测矛盾,从而无需阅读前提即可获得高精度。

Gururangan et al. [2] reveal several other such annotation artefacts:

Gururangan等。 [2]揭示了其他一些这样的注释伪像:

  • Entailment hypotheses were produced by generalizing words found in the premise (dog → animal, 3 → some, woman → person), making entailment recognizable from the hypothesis alone.

    蕴涵假设是通过对前提中发现的单词 ( 狗→动物,3→某些,女人→人 )进行概括而产生的,因此仅凭假设就可以识别蕴涵。

  • Neutral hypotheses were produced by injecting modifiers (tall, first, most) as an easy way to introduce information not entailed by the premise but also not contradictory to it.

    中性假设是通过注入修饰符 ( 高,第一,多数 )产生的 作为引入前提所不包含但又不矛盾的信息的简便方法。

Despite these discoveries, MNLI remains under the GLUE leaderboard, one of the most popular benchmarks for natural language processing. Due to its considerable size compared to the other GLUE corpora (~400k data instances), MNLI is prominently featured in abstracts and used in ablation studies. While its shortcomings are starting to be recognized more widely, it is unlikely to lose its popularity until we find a better alternative.

尽管有这些发现,但是MNLI仍然排在GLUE排行榜的首位 , GLUE是自然语言处理最受欢迎的基准之一。 由于与其他GLUE语料库(约40万个数据实例)相比,MNLI的大小相当大,因此MNLI在摘要中具有突出的特征,并用于消融研究中。 尽管它的缺点已开始被广泛认可,但在我们找到更好的替代方法之前,它不太可能失去其流行性。

偏见和代表性不足 (Bias and under-representation)

In the past few years, bias in machine learning has been exposed across multiple dimensions including gender and race. In response to biased word embeddings and model behavior, the research community has been directing increasingly more efforts towards bias mitigation, as illustrated by Sun et al. [3] in their comprehensive literature review.

在过去的几年中,机器学习的偏见已经暴露在包括性别和种族在内的多个层面。 为了应对有偏见的词嵌入和模型行为,研究团体一直在引导越来越多的努力来减轻偏见,如Sun等人所述。 [3]在他们的综合文献综述中 。

Yann LeCun, co-recipient of the 2018 Turing Award, pointed out that biased data leads to biased model behavior:

2018年图灵奖的共同获奖者Yann LeCun指出,有偏见的数据导致有偏见的模型行为:

演示地址

His Tweet drew a lot of engagement from the research community, with mixed reactions. On the one hand, people acknowledged almost unanimously that bias does exist in many datasets. On the other hand, some disagreed with the perceived implication that bias stems solely from data, additionally blaming modeling and evaluation choices, and the unconscious bias of those who design and build the models. Yann LeCun later clarified that he does not consider data bias to be the only cause for societal bias in models:

他的推文吸引了研究界的广泛参与,React参差不齐。 一方面,人们几乎一致承认,在许多数据集中确实存在偏见。 另一方面,有些人不同意感知的含义,即偏差源于数据,另外归咎于建模和评估选择以及设计和构建模型的人员的无意识偏差。 Yann LeCun后来澄清说,他不认为数据偏差是模型中社会偏差的唯一原因:

演示地址

Even though the dataset being discussed was an image corpus used for computer vision, natural language processing suffers no less from biased datasets. A prominent task that has exposed gender bias is coreference resolution, where a referring expression (like a pronoun) must be linked to an entity mentioned in the text. Here is an example from Webster et al. [4]:

即使正在讨论的数据集是用于计算机视觉的图像语料库,自然语言处理也受到偏向数据集的影响。 暴露性别偏见的一项突出任务是共指称解析 ,即指称表达(如代词)必须与文本中提到的实体链接。 这是Webster等人的示例。 [4]:

In May, Fujisawa joined Mari Motohashi’s rink as the team’s skip, moving back from Karuizawa to Kitami where she had spent her junior days.

今年五月, 藤泽加入本桥麻里的溜冰场作为球队的跳跃,从移动到轻井泽北见回来, 度过了她的初中天。

The authors point out that less than 15% of biographies on Wikipedia are about women, and that they tend to discuss marriage and divorce more prominently than pages about men. Given that many NLP datasets are extracted from Wikipedia, this impacts many downstream tasks. For coreference resolution in particular, the lack of female pronouns or their association with certain stereotypes is problematic. For instance, how would you interpret the sentence “Mary saw her doctor as she entered the room”?

作者指出,维基百科上不到15%的传记是关于女性的,与男性页面相比,他们更倾向于讨论婚姻和离婚。 鉴于从Wikipedia中提取了许多NLP数据集,因此这会影响许多下游任务。 尤其对于共指解决,缺少女性代词或它们与某些定型观念的联系是有问题的。 举例来说,你会如何解释句子“玛丽看到她的医生因为进入房间”?

Eliminating bias from the training data is an unsolved problem. First, because we cannot exhaustively enumerate the axes in which bias manifests; in addition to gender and race, there are many other subtle dimensions that can invite bias (age, proper names, profession etc.). Second, even if we selected a single axis like gender, removing bias would either mean dropping a large portion of the data or applying error-prone heuristics to turn male pronouns into under-represented gender pronouns. Instead, the research community is currently focusing on producing unbiased evaluation datasets, since their smaller scale is more conducive of manual intervention. This at least gives us the ability to measure the performance of our models more truthfully, across a representative sample of the population.

训练数据中消除偏差是一个尚未解决的问题。 首先,因为我们不能穷举列举偏见的轴; 除了性别和种族外,还有许多其他细微的方面可能引起偏见(年龄,专有名称,职业等)。 其次,即使我们选择像性别这样的单一轴,消除偏见也可能意味着丢弃大量数据或应用容易出错的启发式方法将男性代词转变为代表性不足的性别代词。 取而代之的是,研究社区目前专注于生成无偏的评估数据集,因为它们较小的规模更有利于人工干预。 这至少使我们能够在整个代表性样本中更真实地衡量模型的性能。

Building natural language datasets is a never-ending process: we continuously collect data, validate it, acknowledge its shortcomings and work around them. Then we rinse and repeat whenever a new source becomes available. And in the meantime we make progress. All the datasets mentioned above, despite their flaws, have undeniably helped push natural language understanding forward.

建立自然语言数据集是一个永无止境的过程:我们不断收集数据,对其进行验证,确认其缺点并加以解决。 然后,我们会冲洗并在有新来源可用时重复进行。 同时,我们取得了进步。 上面提到的所有数据集,尽管存在缺陷,但无疑有助于推动自然语言理解的发展。

翻译自: https://towardsdatascience.com/unsolved-problems-in-natural-language-datasets-2b09ab37e94c

r语言调用数据集中的数据集

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391674.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

数据特征分析-相关性分析

相关性分析是指对两个或多个具备相关性的变量元素进行分析,从而衡量两个变量的相关密切程度。 相关性的元素之间需要存在一定的联系或者概率才可以进行相关性分析。 相关系数在[-1,1]之间。 一、图示初判 通过pandas做散点矩阵图进行初步判断 df1 pd.DataFrame(np.…

获取所有权_住房所有权经济学深入研究

获取所有权Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seekin…

getBoundingClientRect说明

getBoundingClientRect用于获取某个元素相对于视窗的位置集合。 1.语法:这个方法没有参数。 rectObject object.getBoundingClientRect() 2.返回值类型:TextRectangle对象,每个矩形具有四个整数性质( 上, 右 &#xf…

robot:接口入参为图片时如何发送请求

https://www.cnblogs.com/changyou615/p/8776507.html 接口是上传图片,通过F12抓包获得如下信息 由于使用的是RequestsLibrary,所以先看一下官网怎么传递二进制文件参数,https://2.python-requests.org//en/master/user/advanced/#post-multi…

已知两点坐标拾取怎么操作_已知的操作员学习-第3部分

已知两点坐标拾取怎么操作有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING) These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as mu…

缺失值和异常值处理

一、缺失值 1.空值判断 isnull()空值为True,非空值为False notnull() 空值为False,非空值为True s pd.Series([1,2,3,np.nan,hello,np.nan]) df pd.DataFrame({a:[1,2,np.nan,3],b:[2,np.nan,3,hello]}) print(s.isnull()) print(s[s.isnull() False]…

特征工程之特征选择_特征工程与特征选择

特征工程之特征选择📈Python金融系列 (📈Python for finance series) Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.警告 : 这里没有神奇的配方或圣杯,尽管新世界可…

版本号控制-GitHub

前面几篇文章。我们介绍了Git的基本使用方法及Gitserver的搭建。本篇文章来学习一下怎样使用GitHub。GitHub是开源的代码库以及版本号控制库,是眼下使用网络上使用最为广泛的服务,GitHub能够托管各种Git库。首先我们须要注冊一个GitHub账号,打…

数据标准化和离散化

在某些比较和评价的指标处理中经常需要去除数据的单位限制,将其转化为无量纲的纯数值,便于不同单位或量级的指标能够进行比较和加权。因此需要通过一定的方法进行数据标准化,将数据按比例缩放,使之落入一个小的特定区间。 一、标准…

熊猫tv新功能介绍_熊猫简单介绍

熊猫tv新功能介绍Out of all technologies that is introduced in Data Analysis, Pandas is one of the most popular and widely used library.在Data Analysis引入的所有技术中,P andas是最受欢迎和使用最广泛的库之一。 So what are we going to cover :那么我…

数据转换软件_数据转换

数据转换软件📈Python金融系列 (📈Python for finance series) Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.警告 :这里没有神奇的配方或圣杯,尽管新世界可能为您…

10张图带你深入理解Docker容器和镜像

【编者的话】本文用图文并茂的方式介绍了容器、镜像的区别和Docker每个命令后面的技术细节,能够很好的帮助读者深入理解Docker。这篇文章希望能够帮助读者深入理解Docker的命令,还有容器(container)和镜像(image&#…

matlab界area_Matlab的数据科学界

matlab界area意见 (Opinion) My personal interest in Data Science spans back to 2011. I was learning more about Economies and wanted to experiment with some of the ‘classic’ theories and whilst many of them held ground, at a micro level, many were also pur…

hdf5文件和csv的区别_使用HDF5文件并创建CSV文件

hdf5文件和csv的区别In my last article, I discussed the steps to download NASA data from GES DISC. The data files downloaded are in the HDF5 format. HDF5 is a file format, a technology, that enables the management of very large data collections. Thus, it is…

机器学习常用模型:决策树_fairmodels:让我们与有偏见的机器学习模型作斗争

机器学习常用模型:决策树TL; DR (TL;DR) The R Package fairmodels facilitates bias detection through model visualizations. It implements a few mitigation strategies that could reduce bias. It enables easy to use checks for fairness metrics and comparison betw…

高德地图如何将比例尺放大到10米?

2019独角兽企业重金招聘Python工程师标准>>> var map new AMap.Map(container, {resizeEnable: true,expandZoomRange:true,zoom:20,zooms:[3,20],center: [116.397428, 39.90923] }); alert(map.getZoom());http://lbs.amap.com/faq/web/javascript-api/expand-zo…

Android 手把手带你玩转自己定义相机

本文已授权微信公众号《鸿洋》原创首发,转载请务必注明出处。概述 相机差点儿是每一个APP都要用到的功能,万一老板让你定制相机方不方?反正我是有点方。关于相机的两天奋斗总结免费送给你。Intent intent new Intent(); intent.setAction(M…

100米队伍,从队伍后到前_我们的队伍

100米队伍,从队伍后到前The last twelve months have brought us a presidential impeachment trial, the coronavirus pandemic, sweeping racial justice protests triggered by the death of George Floyd, and a critical presidential election. News coverage of these e…

idea使用 git 撤销commit

2019独角兽企业重金招聘Python工程师标准>>> 填写commit的id 就可以取消这一次的commit 转载于:https://my.oschina.net/u/3559695/blog/1596669

mongodb数据可视化_使用MongoDB实时可视化开放数据

mongodb数据可视化Using Python to connect to Taiwan Government PM2.5 open data API, and schedule to update data in real time to MongoDB — Part 2使用Python连接到台湾政府PM2.5开放数据API,并计划将数据实时更新到MongoDB —第2部分 目标 (Goal) This ti…