可视化 nlp_使用nlp可视化尤利西斯

可视化 nlp

My data science experience has, thus far, been focused on natural language processing (NLP), and the following post is neither the first nor last which will include the novel Ulysses, by James Joyce, as its primary target for NLP and literary elucidation. In this post I will explain why it’s such a perfect target, since Ulysses will likely be the focus of my next project. This will probably be a multi-part blog post.

到目前为止,我的数据科学经验一直集中在自然语言处理(NLP)上,以下文章既不是第一也不是最后一篇,其中包括James Joyce的小说《尤利西斯》(Ulysses),作为其NLP和文学阐释的主要目标。 在这篇文章中,我将解释为什么它是一个如此理想的目标,因为尤利西斯很可能成为我下一个项目的重点。 这可能是一个多部分的博客文章。

关于这本书 (About the Book)

First off, why this book?

首先,为什么要这本书?

Ulysses, by James Joyce, has elicited just about every kind of response from readers since its publication in 1922, ranging from claims that it’s the pinnacle of modernist literature to claims that it’s a filthy, decadent depiction of obscenity and pornography (nonetheless glittering with Shakespearean intertextuality on nearly every page) which should be, and was, banned until the famous Supreme Court case, United States v. One Book Called Ulysses in 1933 decidedly readmitted it into the United States. Moreover, this decision highlighted serious, longstanding philosophical questions about the role of art and the right to literary expression.

自1922年出版以来,詹姆斯·乔伊斯(James Joyce)撰写的《尤利西斯》(Ulysses)引起了读者的几乎所有回应,从声称这是现代主义文学的巅峰之作,到声称这是对淫秽和色情的肮脏,decade废的描写(尽管如此,莎士比亚还是闪闪发光的)。几乎在每页上都应保留互文性),直到著名的最高法院案件《美国诉一本名叫尤利西斯的书》 (1933年)被坚决重新纳入美国为止。 此外,这一决定突出了关于艺术的作用和文学表达权的严重的,长期的哲学问题。

I fall firmly into the former category, and I believe it’s an affirmative work of genius. The reasons for this include many of the reasons I use Ulysses for NLP: the intent of Joyce was to recreate the many different modes of human experience through language. Through the scintillating and narrowing confines of different languages, dialects, subdialects, profanities, connotations, grammars, and idioms, all of which cross different religious and cultural traditions from the Catholic Church to Irish nationalists, to the mundane domestic affairs of a house in Ireland in 1904, to the parallels of ordinary life found on one day in Dublin to the Odyssey, as well as Shakespeare’s Hamlet, through fixations on classical philosophy and human suffering expressed via a hallucinatory, drunken escapade in a brothel manifesting as both the climax of the novel and the resurrection of the dead, Joyce believed that dimensionality would emerge through the parallax of flowing between these different modes of language, or life.

我坚决属于前一类,并且我相信这是天才的肯定作品。 原因包括我将Ulysses用于NLP的许多原因:Joyce的目的是通过语言重现人类体验的许多不同模式。 通过各种语言,方言,次方言,亵渎,内涵,语法和成语的闪烁而狭窄的界限,所有这些跨越了不同的宗教和文化传统,从天主教到爱尔兰民族主义者,再到爱尔兰一所房子的平凡的内政在1904年,普通生活的相似之处在都柏林的第一天发现奥德赛 以及莎士比亚的哈姆雷特 ,通过对古典哲学,并通过表达人类痛苦的注视幻觉,在妓院醉酒越轨行为表现为两个高潮乔伊斯(Joece)认为小说和死者的复活是通过这些不同的语言或生活模式之间流动的视差而出现的维度

If the ostensible larger project of data science is to provide conceptual clarity via statistical analysis, as well as actionable insight through computing via machine learning, feature engineering, and deep understanding of data structures with the mindset of a scientist, then I can think of no greater interdisciplinary project, at least in the realm of NLP, than Ulysses for the sake of validating Joyce’s larger project. There are deeper parallels between data science and literary criticism than I initially realized when I entered the field, particularly in the importance of understanding the data through exploratory data analysis. And the more experience I’ve gained, the more I’ve realized this is a STEM way of saying ‘cultivate emotional alignment and clarity through conceptually rigorous inspection until dimensionality emerges in the data’.

如果说表面上较大的数据科学项目是通过统计分析来提供概念清晰性,以及通过机器学习,特征工程以及以科学家的思维方式对数据结构的深刻理解来通过计算提供可行的见解,那么我可以认为没有为了验证乔伊斯的更大项目,至少在NLP领域,这个更大的跨学科项目要比尤利西斯(Ulysses)好。 数据科学与文学批评之间的相似之处比我进入该领域时最初所意识到的要深得多,尤其是在通过探索性数据分析理解数据的重要性方面。 而且我获得的经验越多,我就越意识到这是一种STEM方式,即“通过概念上严格的检查来培养情绪的一致性和清晰度,直到数据中出现维数为止”。

Hence, the following project will be an attempt to simply visualize Ulysses in a way inspired by a similar project.

因此,以下项目将尝试以类似项目的启发方式简单地可视化Ulysses

动机 (The Motivation)

I was initially inspired by this project. The project is presented in the form of an academic article on Thomas Pynchon’s V, another difficult English novel characterized by a fragmented plot and an unclear timeline, by Christos Iraklis Tsatsoulis, completed in 2013, and I remain continuously shocked that I haven’t found more projects like it. I presume this is due to a cultural gap between Data Science and Literary Criticism, for reasons most likely due to the ancient war between STEM and liberal arts.

最初,我受到这个项目的启发。 该项目以关于托马斯·平昌V的学术文章的形式呈现,该小说是克里斯托斯·伊拉克利斯·特萨苏里斯(Christos Iraklis Tsatsoulis)于2013年完成的另一本艰难的英语小说,其特点是剧情零散,时间表不明确,我一直为我没有发现而感到震惊更多类似的项目。 我认为这是由于数据科学与文学批评之间的文化鸿沟造成的,原因很可能是由于STEM与文科之间的古老战争 。

To start, Tsatsoulis presents an overview of the novel from a literary perspective, going over chapter summaries and the two primary ‘storylines’ in the book, the V. storyline and the Profane storyline. For the reader’s sake, I’ll be transparent by saying that I haven’t read V. by Thomas Pynchon, nor do you need to have read Ulysses to understand either project. For this post, I would only like to outline and emphasize the ingenuity behind the overarching project, which is to make a true interdisciplinary effort to use the brilliant tools of contemporary NLP to augment literary analysis through both visualization and deeper understanding of the semantic content.

首先,Tsatsoulis从文学的角度对小说进行了概述,介绍了章节摘要和本书中的两个主要“故事情节”,即V.故事情节和Profane故事情节。 为了读者的缘故,我会公开地说我没有读过Thomas Pynchon的V. ,也不需要读过Ulysses就能理解这两个项目。 对于本篇文章,我只想概述和强调总体项目背后的独创性, 即通过跨学科的努力,利用可视化和对语义内容的深入理解,利用当代NLP的出色工具来加强文学分析

So often have I seen the combative attitude between ‘machine learning’ and ‘art’, always descending into the same pit of claims that ‘a computer can never make real art’ versus claims that ‘real art is simply a set of fundamental patterns which can be learned and replicated’, whether in the context of AI-produced music, literature, or any number of films about AI-related romance and love. Without getting into the tangential complexities of that debate, I only mean to point out how little cooperation there is between these general poles, which seem to correspond, again, to STEM and liberal arts.

我经常看到“机器学习”与“艺术”之间的战斗态度,总是陷入“计算机永远不能创造真实艺术”的说法与“真实艺术只是一组基本模式,可以在AI产生的音乐,文学或任何与AI相关的爱情和爱情的电影中学习和复制。 在不讨论该辩论的切线复杂性的情况下,我只想指出这些普遍的极点之间几乎没有合作 ,而这些极点似乎又与STEM和文科相对应。

What Tsatsoulis accomplished shows just how useful the tools of NLP can be for healing this strange adversarial relationship.

Tsatsoulis取得的成就表明,NLP的工具对于治愈这种奇怪的对抗性关系有多么有用。

After presenting a literary overview of V., he then provides some exploratory data analysis, like any good data scientist, via a wordcloud and some of V.’s characterizing vocabulary. He explains his primary methodology for capturing the structure of semantic content throughout the novel, which involves TF-IDF and hierarchical clustering, as well as the interesting and original utilization of ‘distance thresholds’ between chapters, based on Euclidian, Manhattan, and Canberra distances, as well as an independent section on Normalized Compression Distance, a methodology based on Kolmogorov complexity. He uses these distance thresholds to create the bafflingly interesting visualizations for the novel:

在介绍了V.的文学概观之后,他随后通过词云和一些V.的特征性词汇,提供了一些探索性的数据分析,就像任何一位好的数据科学家一样。 他解释了他捕获整个小说中语义内容结构的主要方法,该方法涉及TF-IDF和分层聚类,以及基于欧几里得,曼哈顿和堪培拉距离的章节之间“距离阈值”的有趣和原始用法,以及关于标准化压缩距离的独立部分该方法基于Kolmogorov复杂度。 他使用这些距离阈值为小说创建了令人困惑的有趣可视化效果:

Image for post
the article文章中检索

This is an incredible application of NLP to literary analysis. Tsatsoulis even mentions:

这是自然语言处理在文学分析中不可思议的应用。 Tsatsoulis甚至提到:

Somewhat to our surprise, despite this universal agreement regarding the existence of two different storylines in the novel, it seems that there has never been an attempt to exclusively map each chapter to one and only one storyline.

令我们感到惊讶的是,尽管就小说中存在两个不同的故事情节达成了普遍共识,但似乎从未尝试过将每一章专门映射到一个故事情节

Such a situation screams for the application of the tools of data science, and Tsatsoulis fantastically succeeded in applying them.

这种情况使数据科学工具的应用大为震惊,Tsatsoulis成功地应用了它们。

Why, then, given that this project was completed in 2013, has this methodology not caught on in the field of literary analysis? James Joyce himself is infamous for having said of Ulysses, to his French translator:

那么,既然这个项目于2013年完成,为什么在文学分析领域没有采用这种方法呢? 詹姆斯·乔伊斯(James Joyce)本人对他的法语翻译说过《尤利西斯》而臭名昭著:

I’ve put in so many enigmas and puzzles that it will keep the professors busy for centuries arguing over what I meant, and that’s the only way of insuring one’s immortality.

我已经提出了许多谜题和困惑,这将使教授们忙于几个世纪来一直在争论我的意思,而这是确保人们永生的唯一方法

And indeed, professors remain busy arguing over Ulysses. I will not even discuss — for now — the ultimate enigma that is Finnegans Wake for the potential application of NLP, though it may indeed be the telos project of NLP and literature.

确实,教授们仍然忙于争论尤利西斯。 就目前而言,我什至不会讨论Finnegans Wake的终极谜团 对于NLP的潜在应用,虽然它可能确实是终极目的项目NLP和文学。

Hence, my motivation for applying a similar methodology to Ulysses is inspired by the utter success of Tsatsoulis’s project with Thomas Pynchon’s V., not just because of the literary merit provided by such an analysis, but because it demonstrates that productive cooperation between data science, or the field of precision and rigorous statistical dominance, and literary criticism, the refuge of obscurantism and impenetrable vocabulary, is possible.

因此,我之所以将类似的方法学应用于尤利西斯(Ulysses )的动机,是受Tsatsoulis与Thomas Pynchon的V.的项目的巨大成功的启发,这不仅是因为这种分析提供了文学上的价值,而且还因为它证明了数据科学之间的富有成效的合作,或在精确和严格的统计控制领域,以及文学批评领域,躲避晦涩难懂的词汇是可能的。

Look out for Part Two of this post, where I’ll actually attempt similar visualizations with the plot of Ulysses, which will hopefully align with standard interpretations of the novel.

请注意本文的第二部分,在这里我实际上将尝试用《 尤利西斯》的情节进行类似的可视化这有望与小说的标准解释保持一致。

翻译自: https://medium.com/swlh/using-nlp-to-visualize-ulysses-8a953c27aca

可视化 nlp

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389799.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

区分'方法'和'函数'

区分方法: 1在类中的叫方法,在类外面的叫函数 2在名字前加 对象名. 的叫方法, 在名字前加 类名. 或 只写名字的 叫函数 通过代码进行区分: 1 from types import MethodType,FunctionType 2 def check(arg): 3 if isinstance(arg,MethodType)#判断第一个参数是否是第二个参数…

520. 检测大写字母

520. 检测大写字母 我们定义,在以下情况时,单词的大写用法是正确的: 全部字母都是大写,比如 “USA” 。单词中所有字母都不是大写,比如 “leetcode” 。如果单词不只含有一个字母,只有首字母大写&#xf…

Java 打包 FatJar 方法小结

在函数计算(Aliyun FC)中发布一个 Java 函数,往往需要将函数打包成一个 all-in-one 的 zip 包或者 jar 包。Java 中这种打包 all-in-one 的技术常称之为 Fatjar 技术。本文小结一下 Java 里打包 FatJar 的若干种方法。 什么是 FatJar FatJar 又称作 uber-Jar&#x…

常见问题及解决方案(前端篇)

一、jquery validate 默认校验规则序号 规则 描述1 requiredtrue 必须输入的字段。2 remote "check.php" 使用 ajax 方法调用 check.php 验证输入值。3 emailtrue 必须输入正确格式的电子邮件。4 urltrue 必须输入正确格式的网址。5 datetrue 必须输入正确格式的日期…

本地搜索文件太慢怎么办?用Everything搜索秒出结果(附安装包)

每次用电脑本地的搜索都慢的一批,后来发现了一个搜索利器 基本上搜索任何文件都不用等待。 并且页面非常简洁,也没有任何广告,用起来非常舒服。 软件官网如下: voidtools 官网提供三个版本,用起来差别不大。 网盘链…

2024. 考试的最大困扰度

2024. 考试的最大困扰度 一位老师正在出一场由 n 道判断题构成的考试,每道题的答案为 true (用 ‘T’ 表示)或者 false (用 ‘F’ 表示)。老师想增加学生对自己做出答案的不确定性,方法是 最大化 有 连续相…

小程序入口传参:关于带参数的小程序扫码进入的方法

1.使用场景 1.医院场景:比如每个医生一个id,通过带参数二维码,扫码二维码就直接进入小程序医生页面 2.餐厅场景:比如每个菜一个二维码,通过扫码这个菜的二维码,进入小程序后,可以直接点这道菜&a…

python的power bi转换基础

I’ve been having a great time playing around with Power BI, one of the most incredible things in the tool is the array of possibilities you have to transform your data.我在玩Power BI方面玩得很开心,该工具中最令人难以置信的事情之一就是您必须转换数…

感想3-对于业务逻辑复用、模板复用的一些思考(未完)

内容概览: 业务逻辑复用的目的基于现有场景,如何抽象出初步可复用逻辑复用业务逻辑会不会产生过度设计的问题业务逻辑复用的目的 我对于业务逻辑复用的理解是忽略实际业务内容,从交互流程、交互逻辑的角度去归纳、总结,提出通用的…

Git的一些总结

.git 目录结构 |── HEAD|── branches // 分支|── config // 配置|── description // 项目的描述|── hooks // 钩子| |── pre-commit.sample| |── pre-push.sample| └── ...|── info| └── exclude // 类似.gitignore 用于排除文件|── objects // 存储了…

2025. 分割数组的最多方案数

2025. 分割数组的最多方案数 给你一个下标从 0 开始且长度为 n 的整数数组 nums 。分割 数组 nums 的方案数定义为符合以下两个条件的 pivot 数目&#xff1a; 1 < pivot < nnums[0] nums[1] … nums[pivot - 1] nums[pivot] nums[pivot 1] … nums[n -1] 同时…

您是六个主要数据角色中的哪一个

When you were growing up, did you ever play the name game? The modern data organization has something similar, and it’s called the “Bad Data Blame Game.” Unlike the name game, however, the Bad Data Blame Game is played when data downtime strikes and no…

命令查看linux主机配置

查看cpu&#xff1a; # 总核数 物理CPU个数 X 每颗物理CPU的核数 # 总逻辑CPU数 物理CPU个数 X 每颗物理CPU的核数 X 超线程数# 查看物理CPU个数 cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l# 查看每个物理CPU中core的个数(即核数) cat /proc/cpui…

C#中全局处理异常方式

using System; using System.Configuration; using System.Text; using System.Windows.Forms; using ZB.QueueSys.Common;namespace ZB.QueueSys {static class Program{/// <summary>/// 应用程序的主入口点。/// </summary>[STAThread]static void Main(){Appli…

5911. 模拟行走机器人 II

5911. 模拟行走机器人 II 给你一个在 XY 平面上的 width x height 的网格图&#xff0c;左下角 的格子为 (0, 0) &#xff0c;右上角 的格子为 (width - 1, height - 1) 。网格图中相邻格子为四个基本方向之一&#xff08;“North”&#xff0c;“East”&#xff0c;“South”…

自定义按钮动态变化_新闻价值的变化定义

自定义按钮动态变化I read Bari Weiss’ resignation letter from the New York Times with some perplexity. In particular, I found her claim that she “was hired with the goal of bringing in voices that would not otherwise appear in your pages” a bit strange: …

Linux记录-TCP状态以及(TIME_WAIT/CLOSE_WAIT)分析(转载)

1.TCP握手定理 2.TCP状态 l CLOSED&#xff1a;初始状态&#xff0c;表示TCP连接是“关闭着的”或“未打开的”。 l LISTEN &#xff1a;表示服务器端的某个SOCKET处于监听状态&#xff0c;可以接受客户端的连接。 l SYN_RCVD &#xff1a;表示服务器接收到了来自客户端请求…

677. 键值映射

677. 键值映射 实现一个 MapSum 类&#xff0c;支持两个方法&#xff0c;insert 和 sum&#xff1a; MapSum() 初始化 MapSum 对象 void insert(String key, int val) 插入 key-val 键值对&#xff0c;字符串表示键 key &#xff0c;整数表示值 val 。如果键 key 已经存在&am…

算法 从 数中选出_算法可以选出胜出的nba幻想选秀吗

算法 从 数中选出Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without …

jQuery表单校验

小小Demo&#xff1a; <script>$(function () {//给username绑定失去焦点事件$("#username").blur(function () {//得到username文本框的值var nameValue $(this).val();//每次清除数据$("table font:first").remove();//校验username是否合法if (n…