机器学习 美股_我如何使用机器学习来探索英美文学之间的差异

机器学习 美股

by Sofia Godovykh

索非亚·戈多维克(Sofia Godovykh)

我如何使用机器学习来探索英美文学之间的差异 (How I used machine learning to explore the differences between British and American literature)

As I delved further into English literature to further my own language gains, my interest was piqued: how do American and British English differ?

当我进一步研究英语文学以提高自己的语言水平时,我的兴趣激起了:美国英语和英国英语有何不同?

With this question framed in my mind, the next steps were to apply natural language processing and machine learning techniques to find concrete examples. I was curious to know whether it would be possible to train a classifier, which would distinguish literary texts.

考虑到这个问题,下一步是应用自然语言处理和机器学习技术来找到具体的例子。 我很想知道是否有可能训练分类器来区分文学文本。

It is quite easy to distinguish texts written in different languages since the cardinality of the intersection of words (features, in terms of machine learning) was relatively small. Text classification by category (such as science, atheism, computer graphics, etc.) is a well-known “hello world” when it comes to tasks related with working with text classification. I faced a more difficult task when I tried to compare two dialects of the same language, as texts have no common theme.

由于单词交集的基数(相对于机器学习而言的特征)相对较小,因此区分以不同语言编写的文本非常容易。 当涉及与文本分类相关的任务时,按类别(例如科学,无神论,计算机图形学等)进行文本分类是众所周知的“ hello world”。 当我尝试比较同一种语言的两种方言时,我面临着更加艰巨的任务,因为文本没有共同的主题。

The most time consuming stage of machine learning deals with the retrieval of data. For the training sample, I used texts from Project Gutenberg, which can be freely downloaded. As for the list of American and British authors, I used names of authors I found in the Wikipedia.

机器学习最耗时的阶段是数据的检索。 对于培训样本,我使用了来自Gutenberg项目的文本,可以免费下载。 至于美国和英国作者的名单,我使用了我在维基百科上找到的作者的名字。

One of the challenges I encountered was finding the name of the author of a text that matched the Wikipedia page. A good search by name was implemented on the site, but since the site doesn’t allow the parsing of data, I instead proposed to use files that contained metadata. This meant that I needed to solve a non-trivial task of matching names (Sir Arthur Ignatius Conan Doyle and Doyle, C. is the same person, but Doyle, M.E. is a different person) — and I had to do so with a very high level of accuracy.

我遇到的挑战之一是找到与Wikipedia页面匹配的文本的作者姓名。 在站点上实现了按名称的良好搜索,但是由于站点不允许解析数据,因此我建议使用包含元数据的文件。 这意味着我需要解决一个简单的姓名匹配任务(Sir Arthur Ignatius Conan Doyle和C. Doyle是同一个人,而ME。Doyle是不同的人),而我必须非常高精度。

Instead, I chose to sacrifice the sample size for the sake of attaining high accuracy, as well as saving some time. I chose, as a unique identifier, an author’s Wikipedia link, which was included in some of the metadata files. With these files, I was able to acquire about 1,600 British and 2,500 American texts and use them to begin training my classifier.

取而代之的是,我选择牺牲样本大小以达到高精度,同时节省一些时间。 我选择了作者的Wikipedia链接作为唯一标识符,该链接包含在某些元数据文件中。 有了这些文件,我就可以获取约1,600份英国文本和2500份美国文本,并使用它们开始训练我的分类器。

For this project I used sklearn package. The first step after the data collection and analysis stage is pre-processing, in which I utilized a CountVectorizer. A CountVecrorizer takes a text data as input and returns a vector of features as output. Next, I needed to calculate the tf-idf (term frequency — inverted document frequency). A brief explanation why I needed to use it and how:

对于这个项目,我使用了sklearn包。 数据收集和分析阶段之后的第一步是预处理,其中我使用了CountVectorizer。 CountVecrorizer将文本数据作为输入,并返回特征向量作为输出。 接下来,我需要计算tf-idf (术语频率-倒排文档频率)。 简要说明为什么需要使用它以及如何使用:

For example, take the word “the” and count the number of occurrences of the word in a given text, A. Let’s suppose that we have 100 occurrences, and the total number of words in a document is 1000.

例如,取单词“ the”并计算给定文本A中该单词的出现次数。假设我们有100个出现次数,而文档中的单词总数为1000。

Thus,

从而,

tf(“the”) = 100/1000 = 0.1

tf(“the”) = 100/1000 = 0.1

Next, take the word “sepal”, which has occurred 50 times:

接下来,使用“ sepal”一词,该词已经出现了50次:

tf(“sepal”) = 50/1000 = 0.05

tf(“sepal”) = 50/1000 = 0.05

To calculate the inverted document frequency for these words, we need to take the logarithm of the ratio of the number of texts from which there is at least one occurrence of the word, to the total number of texts. If there are all 10,000 texts, and in each, there is the word “the”:

要计算这些单词的倒排文档频率,我们需要取至少出现一次单词的文本数与文本总数之比的对数。 如果总共有10,000个文本,并且每个文本中都有单词“ the”:

idf(“the”) = log(10000/10000) = 0 and

idf(“the”) = log(10000/10000) = 0

tf-idf(“the”) = idf(“the”) * tf(“the”) = 0 * 0.1 = 0

tf-idf(“the”) = idf(“the”) * tf(“the”) = 0 * 0.1 = 0

The word “sepal” is way more rare, and was found only in the 5 texts. Therefore:

“ sepal”一词更为罕见,仅在5个文本中才发现。 因此:

idf(“sepal”) = log(10000/5) and tf-idf(“sepal”) = 7.6 * 0.05 = 0.38

idf(“sepal”) = log(10000/5) and tf-idf(“sepal”) = 7.6 * 0.05 = 0.38

Thus, the most frequently occurring words carry less weight, and specific rarer ones carry more weight. If there are many occurrences of the word “sepal”, we can assume that this is a botanical text. We can not feed a classifier with words, we will use tf-idf measure instead.

因此,最常出现的单词的权重较小,而特定的罕见单词的权重较大。 如果出现“ sepal”一词的次数很多,我们可以假定这是植物性文本。 我们无法用单词来填充分类器,我们将改用tf-idf度量。

After I had presented the data as a set of features, I needed to train the classifier. I was working with text data, which is presented as sparse data, so the best option is to use a linear classifier, which works well with large amounts of features.

在将数据呈现为一组功能之后,我需要训练分类器。 我正在处理以稀疏数据形式表示的文本数据,因此最好的选择是使用线性分类器,该分类器可以很好地与大量功能配合使用。

First, I ran the CountVectorizer, TF-IDFTransformer and SGDClassifier using the default parameters. By analyzing the plot of the accuracy of the sample size — where accuracy fluctuated from 0.6 to 0.85 — I discovered that the classifier was very much dependent on the particular sample used, and therefore not very effective.

首先,我使用默认参数运行CountVectorizer,TF-IDFTransformer和SGDClassifier。 通过分析样本大小的精度图(精度从0.6到0.85波动),我发现分类器在很大程度上取决于所使用的特定样本,因此效果不佳。

After receiving a list of the classifier weights, I noticed a part of the problem: the classifier had been fed with words like “of” and “he”, which we should have treated as a noise. I could easily solve this problem by removing these words from the features by setting the stop_words parameter to the CountVectorizer: stop_words = ‘english’ (or your own custom list of stop words).

在收到分类器权重列表之后,我注意到了问题的一部分:分类器被喂了“ of”和“ he”之类的词,我们应该将其视为杂音。 我可以通过将stop_words参数设置为stop_words从功能中删除这些单词来轻松解决此问题: stop_words = 'english' (或您自己的自定义停用词列表)。

With the default stop words removed, I got an accuracy of 0.85. After that, I launched the automatic selection of parameters using GridSearchCV and achieved a final accuracy of 0.89. I may be able to improve this result with a larger training sample, but for now I stuck with this classifier.

删除默认停用词后,我的准确度为0.85。 之后,我使用GridSearchCV启动了参数的自动选择,最终精度达到了0.89。 我可能可以通过使用更大的训练样本来改善此结果,但是现在我坚持使用该分类器。

Now on to what interests me most: which words point to the origin of the text? Here’s a list of words, sorted in descending order of weight in the classifier:

现在,我最感兴趣的是:哪些词指向文本的起源? 这是单词列表,在分类器中按权重降序排列:

American: dollars, new, york, girl, gray, american, carvel, color, city, ain, long, just, parlor, boston, honor, washington, home, labor, got, finally, maybe, hodder, forever, dorothy, dr

美国:美元,新,约克,女孩,灰色,美国,carvel,颜色,城市,艾因,长,只是,客厅,波士顿,荣誉,华盛顿,家庭,劳工,终于有了,也许是霍德,永远,多萝西,博士

British: round, sir, lady, london, quite, mr, shall, lord, grey, dear, honour, having, philip, poor, pounds, scrooge, soames, things, sea, man, end, come, colour, illustration, english, learnt

英国人:圆形,先生,女士,伦敦,相当,先生,须,领主,灰色,亲爱的,荣誉,有,菲利普,可怜,磅,史克鲁奇,苏打,东西,海,人,端,来,颜色,插图,英语,学习

While having fun with the classifier, I was able to single-out the most “American” British authors and the most “British” American authors (a tricky way to see how bad my classifier could work).

在与分类器一起玩耍的同时,我能够挑选出最“美国”的英国作者和最“英国”的美国作者(这是一种棘手的方法,可以看出我的分类器的工作效果如何)。

The most “British” Americans:

最“英国”的美国人:

  • Frances Hodgson Burnett (born in England, moved to the USA at age of 17, so I treat her as an American writer)

    弗朗西斯·霍奇森·伯内特(Frances Hodgson Burnett)(出生于英国,在17岁时移居美国,所以我将她视为美国作家)
  • Henry James (born in the USA, moved to England at age of 33)

    亨利·詹姆斯(Henry James)(出生于美国,现年33岁,移居英国)
  • Owen Wister (yes, the father of Western fiction)

    Owen Wister(是,西方小说之父)
  • Mary Roberts Rinehart (was called the American Agatha Christie for a reason)

    玛丽·罗伯茨·雷内哈特(Mary Roberts Rinehart)(由于某种原因被称为美国阿加莎·克里斯蒂)
  • William McFee (another writer moved to America at a young age)

    威廉·麦克菲(另一位作家年轻时移居美国)

The most “American” British:

最“美国”的英国人:

  • Rudyard Kipling (he lived in America several years, also, he wrote “American Notes”)

    鲁德亚德·吉卜林(他在美国生活了几年,也写了《美国笔记》)
  • Anthony Trollope (the author of “North America”)

    安东尼·特罗洛普(Anthony Trollope)(《北美》的作者)
  • Frederick Marryat (A veteran of Anglo-American War of 1812, thanks to his “Narrative of the Travels and Adventures of Monsieur Violet in California, Sonara, and Western Texas” which made him fall into the american category)

    弗雷德里克·马里亚特(Frederick Marryat)(1812年英美战争的退伍军人,这要归功于他的“加利福尼亚,索纳拉和西得克萨斯州的紫罗兰先生游记和历险记”,使他进入了美国类别)
  • Arnold Bennett (the author of “Your United States: Impressions of a first visit”) one more gentleman wrote travel notes

    阿诺德·贝内特(Arnold Bennett)(《您的美国:第一次访问的印象》的作者)又一位先生写了旅行记录
  • E. Phillips Oppenheim

    菲利普斯·奥本海姆

And also the most “British” British and “American” American authors (because the classifier still works well):

也是最“英国”的英国和“美国”美国作者(因为分类器仍然有效):

Americans:

美国人:

  • Francis Hopkinson Smith

    弗朗西斯·霍普金森·史密斯
  • Hamlin Garland

    哈姆林·加兰
  • George Ade

    乔治·阿德
  • Charles Dudley Warner

    查尔斯·达德利·华纳
  • Mark Twain

    马克·吐温

British:

英国人:

  • George Meredith

    乔治·梅雷迪思
  • Samuel Richardson

    塞缪尔·理查森(Samuel Richardson)
  • John Galsworthy

    约翰·加尔斯沃西
  • Gilbert Keith Chesterton

    吉尔伯特·基思·切斯特顿
  • Anthony Trollope (oh, hi)

    安东尼·特罗洛普(哦,嗨)

I was inspired to do this work by @TragicAllyHere tweet:

@TragicAllyHere启发了我从事这项工作:

Well, wourds really matter, as I realised.

嗯,就像我意识到的那样,丝瓜真的很重要。

翻译自: https://www.freecodecamp.org/news/how-to-differentiate-between-british-and-american-literature-being-a-machine-learning-engineer-ac842662da1c/

机器学习 美股

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/395999.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

远程执行漏洞修复方案_请马上修复!SaltStack远程命令执行漏洞

【漏洞预警】SaltStack远程命令执行漏洞(CVE-2020-11651、CVE-2020-11652)2020年5月3日,阿里云应急响应中心监测到近日国外某安全团队披露了SaltStack存在认证绕过致命令执行漏洞以及目录遍历漏洞。漏洞描述SaltStack是基于Python开发的一套C/S架构配置管理工具。国…

kafka部分重要参数配置-broker端参数

broker端参数主要在config/server.properties目录下设置: 启动命令:nohup ./kafka-server-start.sh -daemon ../config/server.properties & broker.id参数:Kafka使用唯一的一个整数来标识每个broker,全局唯一,默认…

JS正则表达式大全(整理详细且实用)

JS正则表达式大全(整理详细且实用) 作者: 字体:[增加 减小] 类型:转载 时间:2013-11-14 我要评论 JS正则表达式大全(整理详细且实用)。需要的朋友可以过来参考下,希望对大家有所帮助正则表达式中的特殊字符 字符 含意…

html设置模块宽度为200像素,css 宽度(CSS width)

DIV CSS宽度width样式属性CSS 宽度是指通过CSS 样式设置对应div宽度,以下我们了解传统html宽度、宽度自适应百分比、固定宽度等宽度知识。传统Html 宽度属性单词:width 如width"300";CSS 宽度属性单词:width 如width:300px;一、Wid…

我从Stack Overflow对64,000名开发人员的大规模调查中学到的东西

Today Stack Overflow released the results of their 2017 survey of more than 64,000 developers.今天,Stack Overflow发布了他们对64,000多名开发人员的2017年调查结果。 Just like in 2016, I’ve combed through these results and summarized them for you.…

《Node应用程序构建——使用MongoDB和Backbone》一第 1 章 介绍与总览1.1 打造一个社交网络...

本节书摘来自异步社区《Node应用程序构建——使用MongoDB和Backbone》一书中的第1章,第1.1节,作者【美】Mike Wilson,更多章节内容可以访问云栖社区“异步社区”公众号查看 第 1 章 介绍与总览 Node应用程序构建——使用MongoDB和Backbone互…

jquery 样式获取设置值_jQuery获取样式中的背景颜色属性值/颜色值

天使用jQuery获取样式中的background-color的值时发现在获取到的颜色值在IE中与Chrome、Firefox显示的格式不一样,IE中是以HEX格式显示#ffff00,而Chrome、Firefox中则是以GRB格式显示rgb(255,0,0),由于需要将颜色值存储到数据库中&#xff0c…

计算机专业做产品,非计算机专业如何做产品经理?

《硅谷产品实战》学习笔记 32课这节课中讲了计算机专业背景对产品经理的帮助:第一印象;判断项目复杂度;了解技术可否实现,有何限制?对于没有计算机专业背景的产品如何弥补专业不足?关于如何判断项目复杂度在…

_UICreateCGImageFromIOSurface 使用API

上传的时候,苹果发送邮件 Non-public API usage: The app references non-public symbols in DUO-LINK 4: _UICreateCGImageFromIOSurfaceIf method names in your source code match the private Apple APIs listed above, altering your method names will help …

匹配一个字符串的开头和结尾_我如何构建一个应用程序来展示精彩小说的开头和结尾

匹配一个字符串的开头和结尾I know sentences. In my decade as a print journalist, I’ve written hundreds of articles for dozens of publications. I’ve dished out more sentences than Judge Judy. But I didn’t study writing or journalism, at least not formally…

python 社区网络转化_python-将numpy打开网格转换为坐标

方法1使用np.meshgrid,然后堆叠-r,c np.meshgrid(*m)out np.column_stack((r.ravel(F), c.ravel(F) ))方法2或者,使用np.array()然后进行转置,重塑-np.array(np.meshgrid(*m)).T.reshape(-1,len(m))对于np.ix_中使用的通用数组数目的通用情况,这里是需要进行的修改-p np.r_[…

《思科数据中心I/O整合》一2.11 活动-活动连接(Active-Active)

本节书摘来自异步社区《思科数据中心I/O整合》一书中的第2章,第2.11节,作者【美】Silvano Gai , Claudio DeSanti,更多章节内容可以访问云栖社区“异步社区”公众号查看 2.11 活动-活动连接(Active-Active) 思科数据中…

spring mvc 返回html 乱码,解决springmvc使用ResponseBody注解返回json中文乱码问题

spring版本:4.2.5.RELEASE查看“org.springframework.http.converter.StringHttpMessageConverter”源码,中有一段说明:By default, this converter supports all media types ({code */*}),and writes with a {code Content-Type} of {code …

JS Ajax异步请求发送列表数据后面多了[]

还在苦逼的写代码,这里就不详细了,直接抛出问题: 如图所示: 前端ajax请求向后端发送数据的时候,给key添加了[]出现很多找不到原因, 后面在说 解决方法: 暂时先这样记录一下,下次方便…

分析堆栈溢出原因_我分析了有关堆栈溢出的所有书籍。 这是最受欢迎的。

分析堆栈溢出原因by Vlad Wetzel通过弗拉德韦泽尔 我分析了有关堆栈溢出的所有书籍。 这是最受欢迎的。 (I analyzed every book ever mentioned on Stack Overflow. Here are the most popular ones.) Finding your next programming book is hard, and it’s risky.寻找下一…

ftp如何预览图片 解决方案

下载使用 server-U ,开启 HTTP 服务,输入 http://ip:端口 后,登录ftp账号密码,可选使用 基于java的应用 web client 或 FTP Voyager JV,来预览图片。 本来想走 windows 文件共享服务预览图片,可是 貌似 被防…

《面向对象的思考过程(原书第4版)》一 导读

本节书摘来自华章出版社《面向对象的思考过程(原书第4版)》一书中的第3章,第3.2节,[美] 马特魏斯费尔德(Matt Weisfeld) 著黄博文 译更多章节内容可以访问云栖社区“华章计算机”…

html文件下的flag,推荐一个SAM文件中flag含义解释工具

SAM是Sequence Alignment/Map 的缩写。像bwa等软件序列比对结果都会输出这样的文件。samtools网站上有专门的文档介绍SAM文件。具体地址:http://samtools.sourceforge.net/SAM1.pdf很多人困惑SAM文件中的第二列FLAG值是什么意思。根据文档介绍我们可以计算&#xff…

科大讯飞往届生招聘_我从飞往西雅图的最后一波设计采访中学到的东西

科大讯飞往届生招聘by Tiffany Eaton蒂芙尼伊顿(Tiffany Eaton) 我从飞往西雅图的最后一波设计采访中学到的东西 (What I learned from flying to Seattle for Microsoft’s final wave of design interviews) Before I tell you about my onsite interview with Microsoft, I…

{0,1,2.....Fmax} 每个数出现的次数

给定一个非负整数数组&#xff0c;统计里面每一个数的出现次数。我们只统计到数组里最大的数。 假设 Fmax &#xff08;Fmax < 10000&#xff09;是数组里最大的数&#xff0c;那么我们只统计 {0,1,2.....Fmax} 里每个数出现的次数。 输入第一行n是数组的大小。1 < n <…