kaggle数据集_Kaggle上有170万份ArXiv文章的数据集

kaggle数据集

“arXiv is a free distribution service and an open-access archive for 1.7 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics”, as stated by its editors. ArXiv is a gold mine of knowledge. The more you dig into, the more valuable information you learn. It also makes it easier to follow the trends in science.

如前所述,“ arXiv是一项免费分发服务,是一个开放的档案库,可容纳170万条物理学,数学,计算机科学,定量生物学,定量金融,统计,电气工程和系统科学以及经济学领域的学术文章”。它的编辑。 ArXiv是知识的金矿。 您越深入研究,就会学到更多有价值的信息。 它还使跟踪科学趋势变得更加容易。

If you are into the field of data science, you have probably read articles on arXiv. If you haven’t done it yet, you should. Since data science is still an evolving field, new papers leading to new enhancements are published everyday. This makes the platforms like arXiv even more valuable.

如果您是数据科学领域的专家,您可能已经阅读了有关arXiv的文章。 如果您还没有这样做,那应该。 由于数据科学仍然是一个不断发展的领域,因此每天都会发表新的文章,以进行新的改进。 这使arXiv等平台更具价值。

arXiv has made its entire corpus available as a dataset on Kaggle. The dataset contains relevant features such as article titles, authors, categories, content (both abstract and full text) and citations of 1.7 million scholarly articles avaiable on arXiv.

arXiv已将其整个语料库作为数据集在Kaggle上提供。 数据集包含相关特征,例如文章标题,作者,类别,内容(摘要和全文)以及arXiv上170万篇学术文章的引用。

This dataset is amazing resource to do machine learning and deep learning applications. Some of the applications that can be done are:

该数据集是进行机器学习和深度学习应用程序的绝佳资源。 可以完成的一些应用程序是:

  • Natural language processing (NLP) and understanding (NLU) use cases

    自然语言处理(NLP)和理解(NLU)用例
  • Text generation with deep learning using the content of articles

    使用文章内容通过深度学习生成文本
  • Predictive analytics such as category prediction of articles

    预测分析,例如文章类别预测
  • Trend analysis of topics in different scientific fields

    不同科学领域主题的趋势分析
  • Paper recommender engine

    纸张推荐器引擎
Image for post
Photo by Skye Studios on Unsplash
Skye Studios在Unsplash上拍摄的照片

Deep learning models are data hungry. With the advancements in computing and processing, models can absorb more data than ever. Such a big dataset of scientific text is a highly valuable raw material for NLP, NLU and text generation. We may even have a model that writes scholarly articles on some topics. OpenAI’s new text generator, GPT-3, makes us think beyond the limits. Thus, I don’t think it is too far to have a deep learning model to write about science.

深度学习模型需要大量数据。 随着计算和处理技术的进步,模型可以吸收比以往更多的数据。 如此庞大的科学文本数据集对于NLP,NLU和文本生成是非常有价值的原材料。 我们甚至可能有一个模型可以撰写有关某些主题的学术文章。 OpenAI的新文本生成器GPT-3使我们的思考超出了极限。 因此,我认为拥有一个关于科学的深度学习模型并不过分。

Eleonora Presani, arXiv executive director said that “by offering the dataset on Kaggle we go beyond what humans can learn by reading all these articles and we make the data and information behind arXiv available to the public in a machine-readable format”. I definitely agree with her on the learning opportunities. Having all of these articles as a dataset allows to go beyond learning by reading. A ton of valuable insights can be discovered from this gold mine of articles by data analysis and machine learning. For instance, some not-so-obvious connections between different technologies can light up.

arXiv执行董事Eleonora Presani表示:“通过在Kaggle上提供数据集,我们超越了人类通过阅读所有这些文章所能学到的知识,并以机器可读的格式向公众提供了arXiv背后的数据和信息”。 我绝对同意她的学习机会。 将所有这些文章作为数据集可以超越阅读学习的范围。 通过数据分析和机器学习,可以从这个金矿中找到大量有价值的见解。 例如,不同技术之间的一些不太明显的连接可能会点亮。

Converting the entire arXiv articles to a well-structured and organized dataset has the potential to accelerate scientific discoveries. Science grows and advances by building on itself. There is no need to reinvent the wheel when we can focus on improving the wheel. By analyzing this arXiv dataset, we can obtain a concise summary of what science has been up to and shed light on what we need to focus going forward.

将整个arXiv文章转换为结构合理且组织良好的数据集有可能加速科学发现。 科学在自身的基础上发展壮大。 当我们可以专注于改进车轮时,无需重新发明车轮。 通过分析此arXiv数据集,我们可以获得有关最新科学知识的简明摘要,并阐明了今后我们需要关注的重点。

There is just so much to do with this dataset. I highly encourage you to at least take a look at it. You don’t have to create a machine learning product but it will also be a helpful resource for practicing data analysis and processing skills.

这个数据集有很多事情要做。 我强烈建议您至少看看它。 您不必创建机器学习产品,但它也将是练习数据分析和处理技能的有用资源。

Thank you for reading. Please let me know if you have any feedback.

感谢您的阅读。 如果您有任何反馈意见,请告诉我。

  • https://www.kaggle.com/Cornell-University/arxiv?select=arxiv-metadata-oai-snapshot.json

    https://www.kaggle.com/Cornell-University/arxiv?select=arxiv-metadata-oai-snapshot.json

  • https://blogs.cornell.edu/arxiv/2020/08/05/leveraging-machine-learning-to-fuel-new-discoveries-with-the-arxiv-dataset/

    https://blogs.cornell.edu/arxiv/2020/08/05/leveraging-machine-learning-to-fuel-new-discoveries-with-the-arxiv-dataset/

翻译自: https://towardsdatascience.com/a-dataset-of-1-7-million-arxiv-articles-available-on-kaggle-8a11075cac32

kaggle数据集

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388762.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

深度学习数据集中数据差异大_使用差异隐私来利用大数据并保留隐私

深度学习数据集中数据差异大The modern world runs on “big data,” the massive data sets used by governments, firms, and academic researchers to conduct analyses, unearth patterns, and drive decision-making. When it comes to data analysis, bigger can be bett…

C#图片处理基本应用(裁剪,缩放,清晰度,水印)

前言 需求源自项目中的一些应用,比如相册功能,通常用户上传相片后我们都会针对该相片再生成一张缩略图,用于其它页面上的列表显示。随便看一下,大部分网站基本都是将原图等比缩放来生成缩略图。但完美主义者会发现一些问题&#…

Java客户端访问HBase集群解决方案(优化)

测试环境&#xff1a;IdeaWindows10 准备工作&#xff1a; <1>、打开本地 C:\Windows\System32\drivers\etc&#xff08;系统默认&#xff09;下名为hosts的系统文件&#xff0c;如果提示当前用户没有权限打开文件&#xff1b;第一种方法是将hosts文件拖到桌面进行配置后…

WPF布局系统

WPF之路——WPF布局系统 前言 前段时间忙了一阵子Google Earth&#xff0c;这周又忙了一阵子架构师论文开题报告&#xff0c;现在终于有时间继续<WPF之路>了。先回忆一下上篇的内容&#xff0c;在《从HelloWorld到WPF World》中&#xff0c;我们对WPF有了个大概的了解&am…

PostGIS容器运行

2019独角兽企业重金招聘Python工程师标准>>> 获取镜像&#xff1a; docker pull mdillon/postgis 该 mdillon/postgis 镜像提供了容器中运行Postgres&#xff08;内置安装PostGIS 2.5&#xff09; 。该镜像基于官方 postgres image&#xff0c;提供了多种变体&#…

小型数据库_如果您从事“小型科学”工作,那么您是否正在利用数据存储库?

小型数据库If you’re a scientist, especially one performing a lot of your research alone, you probably have more than one spreadsheet of important data that you just haven’t gotten around to writing up yet. Maybe you never will. Sitting idle on a hard dri…

BitmapEffect位图效果是简单的像素处理操作。它可以呈现下面几种特殊效果。

BitmapEffect位图效果是简单的像素处理操作。它可以呈现下面几种特殊效果。 BevelBitmapEffect 凹凸效果 BlurBitmapEffect 模糊效果 DropShadowBitmapEffect投影效果 EmbossBitmapEffect 浮雕效果 Outer…

AutoScaling 与函数计算结合,赋予更丰富的弹性能力

目前&#xff0c;弹性伸缩服务已经接入了负载均衡&#xff08;SLB&#xff09;、云数据库RDS 等云产品&#xff0c;但是暂未接入 云数据库Redis&#xff0c;有时候我们可能会需要弹性伸缩服务在扩缩容的时候自动将扩缩容涉及到的 ECS 实例私网 IP 添加到 Redis 白名单或者从 Re…

参考文献_参考

参考文献Recently, I am attracted by the news that Tanzania has attained lower middle income status under the World Bank’s classification, five years ahead of projection. Being curious on how they make the judgement, I take a look of the World Bank’s offi…

数据统计 测试方法_统计测试:了解如何为数据选择最佳测试!

数据统计 测试方法This post is not meant for seasoned statisticians. This is geared towards data scientists and machine learning (ML) learners & practitioners, who like me, do not come from a statistical background.Ť他的职位是不是意味着经验丰富的统计人…

spring的几个通知(前置、后置、环绕、异常、最终)

1、没有异常的 2、有异常的 1、被代理类接口Person.java 1 package com.xiaostudy;2 3 /**4 * desc 被代理类接口5 * 6 * author xiaostudy7 *8 */9 public interface Person { 10 11 public void add(); 12 public void update(); 13 public void delete();…

每个Power BI开发人员的Power Query提示

If someone asks you to define the Power Query, what should you say? If you’ve ever worked with Power BI, there is no chance that you haven’t used Power Query, even if you weren’t aware of it. Therefore, one could easily say that Power Query is the “he…

c# PDF 转换成图片

1.新建项目 2.新增一个新文件夹“lib”&#xff08;主要是为了存放引用的dll&#xff09; 3.将“gsdll32.dll 、PDFLibNet.dll 、PDFView.dll”3个dll添加到文件夹中 4.项目添加“PDFLibNet.dll 、PDFView.dll”2个类库的引用&#xff0c;并将gsdll32.dll 拷贝到项目生产根…

oracle 死锁

为什么80%的码农都做不了架构师&#xff1f;>>> ORA-01013: user requested cancel of current operation 转载于:https://my.oschina.net/8808/blog/2994537

a/b测试_如何进行A / B测试?

a/b测试The idea of A/B testing is to present different content to different variants (user groups), gather their reactions and user behaviour and use the results to build product or marketing strategies in the future.A / B测试的想法是将不同的内容呈现给不同…

hibernate h2变mysql_struts2-hibernate-mysql开发案例 -解道Jdon

Hibernate专题struts2-hibernate-mysql开发案例与源码源码下载本案例展示使用Struts2&#xff0c;Hibernate和MySQL数据库开发一个个人音乐管理器Web应用程序。&#xff0c;可将您的音乐收藏添加到数据库中。功能有&#xff1a;显示一个添加记录的表单和所有的音乐收藏的列表。…

提取图像感兴趣区域_从图像中提取感兴趣区域

提取图像感兴趣区域Welcome to the second post in this series where we talk about extracting regions of interest (ROI) from images using OpenCV and Python.欢迎来到本系列的第二篇文章&#xff0c;我们讨论使用OpenCV和Python从图像中提取感兴趣区域(ROI)。 As a rec…

解决java compiler level does not match the version of the installed java project facet

ava compiler level does not match the version of the installed java project facet错误的解决 因工作的关系&#xff0c;Eclipse开发的Java项目拷来拷去&#xff0c;有时候会报一个很奇怪的错误。明明源码一模一样&#xff0c;为什么项目复制到另一台机器上&#xff0c;就会…

php模板如何使用,ThinkPHP如何使用模板

到目前为止&#xff0c;我们只是使用了控制器和模型&#xff0c;还没有接触视图&#xff0c;下面来给上面的应用添加视图模板。首先我们修改下 Action 的 index 操作方法&#xff0c;添加模板赋值和渲染模板操作。PHP代码classIndexActionextendsAction{publicfunctionindex(){…

什么是嵌入式系统

在我们的日常生活中&#xff0c;我们经常使用许多使用嵌入式系统技术设计的电气和电子电路和套件。计算机&#xff0c;手机&#xff0c;平板&#xff0c;笔记本电脑&#xff0c;数字电子系统以及其他电子和电子设备都是使用嵌入式系统设计的。 什么是嵌入式系统&#xff1f;将硬…