可视化 nlp_使用nlp可视化尤利西斯

可视化 nlp

My data science experience has, thus far, been focused on natural language processing (NLP), and the following post is neither the first nor last which will include the novel Ulysses, by James Joyce, as its primary target for NLP and literary elucidation. In this post I will explain why it’s such a perfect target, since Ulysses will likely be the focus of my next project. This will probably be a multi-part blog post.

到目前为止,我的数据科学经验一直集中在自然语言处理(NLP)上,以下文章既不是第一也不是最后一篇,其中包括James Joyce的小说《尤利西斯》(Ulysses),作为其NLP和文学阐释的主要目标。 在这篇文章中,我将解释为什么它是一个如此理想的目标,因为尤利西斯很可能成为我下一个项目的重点。 这可能是一个多部分的博客文章。

关于这本书 (About the Book)

First off, why this book?

首先,为什么要这本书?

Ulysses, by James Joyce, has elicited just about every kind of response from readers since its publication in 1922, ranging from claims that it’s the pinnacle of modernist literature to claims that it’s a filthy, decadent depiction of obscenity and pornography (nonetheless glittering with Shakespearean intertextuality on nearly every page) which should be, and was, banned until the famous Supreme Court case, United States v. One Book Called Ulysses in 1933 decidedly readmitted it into the United States. Moreover, this decision highlighted serious, longstanding philosophical questions about the role of art and the right to literary expression.

自1922年出版以来,詹姆斯·乔伊斯(James Joyce)撰写的《尤利西斯》(Ulysses)引起了读者的几乎所有回应,从声称这是现代主义文学的巅峰之作,到声称这是对淫秽和色情的肮脏,decade废的描写(尽管如此,莎士比亚还是闪闪发光的)。几乎在每页上都应保留互文性),直到著名的最高法院案件《美国诉一本名叫尤利西斯的书》 (1933年)被坚决重新纳入美国为止。 此外,这一决定突出了关于艺术的作用和文学表达权的严重的,长期的哲学问题。

I fall firmly into the former category, and I believe it’s an affirmative work of genius. The reasons for this include many of the reasons I use Ulysses for NLP: the intent of Joyce was to recreate the many different modes of human experience through language. Through the scintillating and narrowing confines of different languages, dialects, subdialects, profanities, connotations, grammars, and idioms, all of which cross different religious and cultural traditions from the Catholic Church to Irish nationalists, to the mundane domestic affairs of a house in Ireland in 1904, to the parallels of ordinary life found on one day in Dublin to the Odyssey, as well as Shakespeare’s Hamlet, through fixations on classical philosophy and human suffering expressed via a hallucinatory, drunken escapade in a brothel manifesting as both the climax of the novel and the resurrection of the dead, Joyce believed that dimensionality would emerge through the parallax of flowing between these different modes of language, or life.

我坚决属于前一类,并且我相信这是天才的肯定作品。 原因包括我将Ulysses用于NLP的许多原因:Joyce的目的是通过语言重现人类体验的许多不同模式。 通过各种语言,方言,次方言,亵渎,内涵,语法和成语的闪烁而狭窄的界限,所有这些跨越了不同的宗教和文化传统,从天主教到爱尔兰民族主义者,再到爱尔兰一所房子的平凡的内政在1904年,普通生活的相似之处在都柏林的第一天发现奥德赛 以及莎士比亚的哈姆雷特 ,通过对古典哲学,并通过表达人类痛苦的注视幻觉,在妓院醉酒越轨行为表现为两个高潮乔伊斯(Joece)认为小说和死者的复活是通过这些不同的语言或生活模式之间流动的视差而出现的维度

If the ostensible larger project of data science is to provide conceptual clarity via statistical analysis, as well as actionable insight through computing via machine learning, feature engineering, and deep understanding of data structures with the mindset of a scientist, then I can think of no greater interdisciplinary project, at least in the realm of NLP, than Ulysses for the sake of validating Joyce’s larger project. There are deeper parallels between data science and literary criticism than I initially realized when I entered the field, particularly in the importance of understanding the data through exploratory data analysis. And the more experience I’ve gained, the more I’ve realized this is a STEM way of saying ‘cultivate emotional alignment and clarity through conceptually rigorous inspection until dimensionality emerges in the data’.

如果说表面上较大的数据科学项目是通过统计分析来提供概念清晰性,以及通过机器学习,特征工程以及以科学家的思维方式对数据结构的深刻理解来通过计算提供可行的见解,那么我可以认为没有为了验证乔伊斯的更大项目,至少在NLP领域,这个更大的跨学科项目要比尤利西斯(Ulysses)好。 数据科学与文学批评之间的相似之处比我进入该领域时最初所意识到的要深得多,尤其是在通过探索性数据分析理解数据的重要性方面。 而且我获得的经验越多,我就越意识到这是一种STEM方式,即“通过概念上严格的检查来培养情绪的一致性和清晰度,直到数据中出现维数为止”。

Hence, the following project will be an attempt to simply visualize Ulysses in a way inspired by a similar project.

因此,以下项目将尝试以类似项目的启发方式简单地可视化Ulysses

动机 (The Motivation)

I was initially inspired by this project. The project is presented in the form of an academic article on Thomas Pynchon’s V, another difficult English novel characterized by a fragmented plot and an unclear timeline, by Christos Iraklis Tsatsoulis, completed in 2013, and I remain continuously shocked that I haven’t found more projects like it. I presume this is due to a cultural gap between Data Science and Literary Criticism, for reasons most likely due to the ancient war between STEM and liberal arts.

最初,我受到这个项目的启发。 该项目以关于托马斯·平昌V的学术文章的形式呈现,该小说是克里斯托斯·伊拉克利斯·特萨苏里斯(Christos Iraklis Tsatsoulis)于2013年完成的另一本艰难的英语小说,其特点是剧情零散,时间表不明确,我一直为我没有发现而感到震惊更多类似的项目。 我认为这是由于数据科学与文学批评之间的文化鸿沟造成的,原因很可能是由于STEM与文科之间的古老战争 。

To start, Tsatsoulis presents an overview of the novel from a literary perspective, going over chapter summaries and the two primary ‘storylines’ in the book, the V. storyline and the Profane storyline. For the reader’s sake, I’ll be transparent by saying that I haven’t read V. by Thomas Pynchon, nor do you need to have read Ulysses to understand either project. For this post, I would only like to outline and emphasize the ingenuity behind the overarching project, which is to make a true interdisciplinary effort to use the brilliant tools of contemporary NLP to augment literary analysis through both visualization and deeper understanding of the semantic content.

首先,Tsatsoulis从文学的角度对小说进行了概述,介绍了章节摘要和本书中的两个主要“故事情节”,即V.故事情节和Profane故事情节。 为了读者的缘故,我会公开地说我没有读过Thomas Pynchon的V. ,也不需要读过Ulysses就能理解这两个项目。 对于本篇文章,我只想概述和强调总体项目背后的独创性, 即通过跨学科的努力,利用可视化和对语义内容的深入理解,利用当代NLP的出色工具来加强文学分析

So often have I seen the combative attitude between ‘machine learning’ and ‘art’, always descending into the same pit of claims that ‘a computer can never make real art’ versus claims that ‘real art is simply a set of fundamental patterns which can be learned and replicated’, whether in the context of AI-produced music, literature, or any number of films about AI-related romance and love. Without getting into the tangential complexities of that debate, I only mean to point out how little cooperation there is between these general poles, which seem to correspond, again, to STEM and liberal arts.

我经常看到“机器学习”与“艺术”之间的战斗态度,总是陷入“计算机永远不能创造真实艺术”的说法与“真实艺术只是一组基本模式,可以在AI产生的音乐,文学或任何与AI相关的爱情和爱情的电影中学习和复制。 在不讨论该辩论的切线复杂性的情况下,我只想指出这些普遍的极点之间几乎没有合作 ,而这些极点似乎又与STEM和文科相对应。

What Tsatsoulis accomplished shows just how useful the tools of NLP can be for healing this strange adversarial relationship.

Tsatsoulis取得的成就表明,NLP的工具对于治愈这种奇怪的对抗性关系有多么有用。

After presenting a literary overview of V., he then provides some exploratory data analysis, like any good data scientist, via a wordcloud and some of V.’s characterizing vocabulary. He explains his primary methodology for capturing the structure of semantic content throughout the novel, which involves TF-IDF and hierarchical clustering, as well as the interesting and original utilization of ‘distance thresholds’ between chapters, based on Euclidian, Manhattan, and Canberra distances, as well as an independent section on Normalized Compression Distance, a methodology based on Kolmogorov complexity. He uses these distance thresholds to create the bafflingly interesting visualizations for the novel:

在介绍了V.的文学概观之后,他随后通过词云和一些V.的特征性词汇,提供了一些探索性的数据分析,就像任何一位好的数据科学家一样。 他解释了他捕获整个小说中语义内容结构的主要方法,该方法涉及TF-IDF和分层聚类,以及基于欧几里得,曼哈顿和堪培拉距离的章节之间“距离阈值”的有趣和原始用法,以及关于标准化压缩距离的独立部分该方法基于Kolmogorov复杂度。 他使用这些距离阈值为小说创建了令人困惑的有趣可视化效果:

Image for post
the article文章中检索

This is an incredible application of NLP to literary analysis. Tsatsoulis even mentions:

这是自然语言处理在文学分析中不可思议的应用。 Tsatsoulis甚至提到:

Somewhat to our surprise, despite this universal agreement regarding the existence of two different storylines in the novel, it seems that there has never been an attempt to exclusively map each chapter to one and only one storyline.

令我们感到惊讶的是,尽管就小说中存在两个不同的故事情节达成了普遍共识,但似乎从未尝试过将每一章专门映射到一个故事情节

Such a situation screams for the application of the tools of data science, and Tsatsoulis fantastically succeeded in applying them.

这种情况使数据科学工具的应用大为震惊,Tsatsoulis成功地应用了它们。

Why, then, given that this project was completed in 2013, has this methodology not caught on in the field of literary analysis? James Joyce himself is infamous for having said of Ulysses, to his French translator:

那么,既然这个项目于2013年完成,为什么在文学分析领域没有采用这种方法呢? 詹姆斯·乔伊斯(James Joyce)本人对他的法语翻译说过《尤利西斯》而臭名昭著:

I’ve put in so many enigmas and puzzles that it will keep the professors busy for centuries arguing over what I meant, and that’s the only way of insuring one’s immortality.

我已经提出了许多谜题和困惑,这将使教授们忙于几个世纪来一直在争论我的意思,而这是确保人们永生的唯一方法

And indeed, professors remain busy arguing over Ulysses. I will not even discuss — for now — the ultimate enigma that is Finnegans Wake for the potential application of NLP, though it may indeed be the telos project of NLP and literature.

确实,教授们仍然忙于争论尤利西斯。 就目前而言,我什至不会讨论Finnegans Wake的终极谜团 对于NLP的潜在应用,虽然它可能确实是终极目的项目NLP和文学。

Hence, my motivation for applying a similar methodology to Ulysses is inspired by the utter success of Tsatsoulis’s project with Thomas Pynchon’s V., not just because of the literary merit provided by such an analysis, but because it demonstrates that productive cooperation between data science, or the field of precision and rigorous statistical dominance, and literary criticism, the refuge of obscurantism and impenetrable vocabulary, is possible.

因此,我之所以将类似的方法学应用于尤利西斯(Ulysses )的动机,是受Tsatsoulis与Thomas Pynchon的V.的项目的巨大成功的启发,这不仅是因为这种分析提供了文学上的价值,而且还因为它证明了数据科学之间的富有成效的合作,或在精确和严格的统计控制领域,以及文学批评领域,躲避晦涩难懂的词汇是可能的。

Look out for Part Two of this post, where I’ll actually attempt similar visualizations with the plot of Ulysses, which will hopefully align with standard interpretations of the novel.

请注意本文的第二部分,在这里我实际上将尝试用《 尤利西斯》的情节进行类似的可视化这有望与小说的标准解释保持一致。

翻译自: https://medium.com/swlh/using-nlp-to-visualize-ulysses-8a953c27aca

可视化 nlp

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389799.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

本地搜索文件太慢怎么办?用Everything搜索秒出结果(附安装包)

每次用电脑本地的搜索都慢的一批,后来发现了一个搜索利器 基本上搜索任何文件都不用等待。 并且页面非常简洁,也没有任何广告,用起来非常舒服。 软件官网如下: voidtools 官网提供三个版本,用起来差别不大。 网盘链…

小程序入口传参:关于带参数的小程序扫码进入的方法

1.使用场景 1.医院场景:比如每个医生一个id,通过带参数二维码,扫码二维码就直接进入小程序医生页面 2.餐厅场景:比如每个菜一个二维码,通过扫码这个菜的二维码,进入小程序后,可以直接点这道菜&a…

python的power bi转换基础

I’ve been having a great time playing around with Power BI, one of the most incredible things in the tool is the array of possibilities you have to transform your data.我在玩Power BI方面玩得很开心,该工具中最令人难以置信的事情之一就是您必须转换数…

您是六个主要数据角色中的哪一个

When you were growing up, did you ever play the name game? The modern data organization has something similar, and it’s called the “Bad Data Blame Game.” Unlike the name game, however, the Bad Data Blame Game is played when data downtime strikes and no…

自定义按钮动态变化_新闻价值的变化定义

自定义按钮动态变化I read Bari Weiss’ resignation letter from the New York Times with some perplexity. In particular, I found her claim that she “was hired with the goal of bringing in voices that would not otherwise appear in your pages” a bit strange: …

Linux记录-TCP状态以及(TIME_WAIT/CLOSE_WAIT)分析(转载)

1.TCP握手定理 2.TCP状态 l CLOSED:初始状态,表示TCP连接是“关闭着的”或“未打开的”。 l LISTEN :表示服务器端的某个SOCKET处于监听状态,可以接受客户端的连接。 l SYN_RCVD :表示服务器接收到了来自客户端请求…

算法 从 数中选出_算法可以选出胜出的nba幻想选秀吗

算法 从 数中选出Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without …

django-rest-framework第一次使用使用常见问题

2019独角兽企业重金招聘Python工程师标准>>> 记录在第一次使用django-rest-framework框架使用时遇到的问题,为了便于理解在这里创建了Person和Grade这两个model from django.db import models class Person(models.Model):SHIRT_SIZES ((S, Small),(M, …

插入脚注把脚注标注删掉_地狱司机不应该只是英国电影历史数据中的脚注,这说明了为什么...

插入脚注把脚注标注删掉Cowritten by Andie Yam由安迪(Andie Yam)撰写 Hell Drivers”, 1957地狱司机 》电影海报 Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. Mor…

贝叶斯统计 传统统计_统计贝叶斯如何补充常客

贝叶斯统计 传统统计For many years, academics have been using so-called frequentist statistics to evaluate whether experimental manipulations have significant effects.多年以来,学者们一直在使用所谓的常客统计学来评估实验操作是否具有significant效果。…

saltstack二

配置管理 haproxy的安装部署 haproxy各版本安装包下载路径https://www.haproxy.org/download/1.6/src/,跳转地址为http,改为https即可 创建相关目录 # 创建配置目录 [rootlinux-node1 ~]# mkdir /srv/salt/prod/pkg/ [rootlinux-node1 ~]# mkdir /srv/sa…

319. 灯泡开关

319. 灯泡开关 初始时有 n 个灯泡处于关闭状态。第一轮,你将会打开所有灯泡。接下来的第二轮,你将会每两个灯泡关闭一个。 第三轮,你每三个灯泡就切换一个灯泡的开关(即,打开变关闭,关闭变打开&#xff0…

因为你的电脑安装了即点即用_即你所爱

因为你的电脑安装了即点即用Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing data and …

2074. 反转偶数长度组的节点

2074. 反转偶数长度组的节点 给你一个链表的头节点 head 。 链表中的节点 按顺序 划分成若干 非空 组,这些非空组的长度构成一个自然数序列(1, 2, 3, 4, …)。一个组的 长度 就是组中分配到的节点数目。换句话说: 节点 1 分配给…

团队管理新思考_需要一个新的空间来思考讨论和行动

团队管理新思考andrew wong安德鲁黄 Follow跟随 Sep 4 九月4 There is a need for a new space to think, discuss, and act. This need are being felt by the majority of AI / ML / Data Product Managers out there. They are exhausted by the ever increasing data volum…

2075. 解码斜向换位密码

2075. 解码斜向换位密码 字符串 originalText 使用 斜向换位密码 ,经由 行数固定 为 rows 的矩阵辅助,加密得到一个字符串 encodedText 。 originalText 先按从左上到右下的方式放置到矩阵中。 先填充蓝色单元格,接着是红色单元格&#xff…

微服务实战(六):落地微服务架构到直销系统(事件存储)

在CQRS架构中,一个比较重要的内容就是当命令处理器从命令队列中接收到相关的命令数据后,通过调用领域对象逻辑,然后将当前事件的对象数据持久化到事件存储中。主要的用途是能够快速持久化对象此次的状态,另外也可以通过未来最终一…

时间序列数据的多元回归_清理和理解多元时间序列数据

时间序列数据的多元回归No matter what kind of data science project one is assigned to, making sense of the dataset and cleaning it always critical for success. The first step is to understand the data using exploratory data analysis (EDA)as it helps us crea…

vue-cli搭建项目的目录结构及说明

vue-cli基于webpack搭建项目的目录结构 build文件夹 ├── build // 项目构建的(webpack)相关代码 │ ├── build.js // 生产环境构建代码(在npm run build的时候会用到这个文件夹)│ ├── check-versions.js // 检查node&am…

391. 完美矩形

391. 完美矩形 给你一个数组 rectangles ,其中 rectangles[i] [xi, yi, ai, bi] 表示一个坐标轴平行的矩形。这个矩形的左下顶点是 (xi, yi) ,右上顶点是 (ai, bi) 。 如果所有矩形一起精确覆盖了某个矩形区域,则返回 true ;否…