用python分析小说_用Python对哈利波特系列小说进行情感分析

原标题:用Python对哈利波特系列小说进行情感分析

准备数据

现有的数据是一部小说放在一个txt里,我们想按照章节(列表中第一个就是章节1的内容,列表中第二个是章节2的内容)进行分析,这就需要用到正则表达式整理数据。

比如我们先看看 01-Harry Potter and the Sorcerer's Stone.txt" 里的章节情况,我们打开txt

经过检索发现,所有章节存在规律性表达

[Chapter][空格][整数][换行符n][可能含有空格的英文标题][换行符n]

我们先熟悉下正则,使用这个设计一个模板pattern提取章节信息

import re

import nltk

raw_text = open("data/01-Harry Potter and the Sorcerer's Stone.txt").read

pattern = 'Chapter d+n[a-zA-Z ]+n'

re.findall(pattern, raw_text)

['Chapter 1nThe Boy Who Livedn',

'Chapter 2nThe Vanishing Glassn',

'Chapter 3nThe Letters From No Onen',

'Chapter 4nThe Keeper Of The Keysn',

'Chapter 5nDiagon Alleyn',

'Chapter 7nThe Sorting Hatn',

'Chapter 8nThe Potions Mastern',

'Chapter 9nThe Midnight Dueln',

'Chapter 10nHalloweenn',

'Chapter 11nQuidditchn',

'Chapter 12nThe Mirror Of Erisedn',

'Chapter 13nNicholas Flameln',

'Chapter 14nNorbert the Norwegian Ridgebackn',

'Chapter 15nThe Forbidden Forestn',

'Chapter 16nThrough the Trapdoorn',

'Chapter 17nThe Man With Two Facesn']

熟悉上面的正则表达式操作,我们想更精准一些。我准备了一个test文本,与实际小说中章节目录表达相似,只不过文本更短,更利于理解。按照我们的预期,我们数据中只有5个章节,那么列表的长度应该是5。这样操作后的列表中第一个内容就是章节1的内容,列表中第二个内容是章节2的内容。

import re

test = """Chapter 1nThe Boy Who LivednMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.nMr. Dursley was the director of a firm called Grunnings,

Chapter 2nThe Vanishing GlassnFor a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.

Chapter 3nThe Letters From No OnenThe traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.

Chapter 4nThe Keeper Of The KeysnHe didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin.

Chapter 5nDiagon AlleynIt was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. """

#获取章节内容列表(列表中第一个内容就是章节1的内容,列表中第二个内容是章节2的内容)

#为防止列表中有空内容,这里加了一个条件判断,保证列表长度与章节数预期一致

chapter_contents = [c for c in re.split('Chapter d+n[a-zA-Z ]+n', test) if c]

chapter_contents

['Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.nMr. Dursley was the director of a firm called Grunnings,n ',

'For a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.n ',

'The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.n ',

'He didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin. n ',

'It was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. ']

能得到哈利波特的章节内容列表

也就意味着我们可以做真正的文本分析了

数据分析章节数对比

import os

import re

import matplotlib.pyplot as plt

colors = ['#78C850', '#A8A878','#F08030','#C03028','#6890F0', '#A890F0','#A040A0']

harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",

"Harry Potter and the Chamber of Secrets.txt",

"Harry Potter and the Prisoner of Azkaban.txt",

"Harry Potter and the Goblet of Fire.txt",

"Harry Potter and the Order of the Phoenix.txt",

"Harry Potter and the Half-Blood Prince.txt",

"Harry Potter and the Deathly Hallows.txt"]

#横坐标为小说名

harry_potter_names = [n.replace('Harry Potter and the ', '')[:-4]

for n in harry_potters]

#纵坐标为章节数

chapter_nums = []

for harry_potter in harry_potters:

file = "data/"+harry_potter

raw_text = open(file).read

pattern = 'Chapter d+n[a-zA-Z ]+n'

chapter_contents = [c for c in re.split(pattern, raw_text) if c]

chapter_nums.append(len(chapter_contents))

#设置画布尺寸

plt.figure(figsize=(20, 10))

#图的名字,字体大小,粗体

plt.title('Chapter Number of Harry Potter', fontsize=25, weight='bold')

#绘制带色条形图

plt.bar(harry_potter_names, chapter_nums, color=colors)

#横坐标刻度上的字体大小及倾斜角度

plt.xticks(rotation=25, fontsize=16, weight='bold')

plt.yticks(fontsize=16, weight='bold')

#坐标轴名字

plt.xlabel('Harry Potter Series', fontsize=20, weight='bold')

plt.ylabel('Chapter Number', rotation=25, fontsize=20, weight='bold')

plt.show

从上面可以看出哈利波特系列小说的后四部章节数据较多(这分析没啥大用处,主要是练习)

用词丰富程度

如果说一句100个词的句子,同时词语不带重样的,那么用词的丰富程度为100。

而如果说同样长度的句子,只用到20个词语,那么用词的丰富程度为100/20=5。

import os

import re

import matplotlib.pyplot as plt

from nltk import word_tokenize

from nltk.stem.snowball importSnowballStemmer

plt.style.use('fivethirtyeight')

colors = ['#78C850', '#A8A878','#F08030','#C03028','#6890F0', '#A890F0','#A040A0']

harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",

"Harry Potter and the Chamber of Secrets.txt",

"Harry Potter and the Prisoner of Azkaban.txt",

"Harry Potter and the Goblet of Fire.txt",

"Harry Potter and the Order of the Phoenix.txt",

"Harry Potter and the Half-Blood Prince.txt",

"Harry Potter and the Deathly Hallows.txt"]

#横坐标为小说名

harry_potter_names = [n.replace('Harry Potter and the ', '')[:-4]

for n in harry_potters]

#用词丰富程度

richness_of_words = []

stemmer = SnowballStemmer("english")

for harry_potter in harry_potters:

file = "data/"+harry_potter

raw_text = open(file).read

words = word_tokenize(raw_text)

words = [stemmer.stem(w.lower) for w in words]

wordset = set(words)

richness = len(words)/len(wordset)

richness_of_words.append(richness)

#设置画布尺寸

plt.figure(figsize=(20, 10))

#图的名字,字体大小,粗体

plt.title('The Richness of Word in Harry Potter', fontsize=25, weight='bold')

#绘制带色条形图

plt.bar(harry_potter_names, richness_of_words, color=colors)

#横坐标刻度上的字体大小及倾斜角度

plt.xticks(rotation=25, fontsize=16, weight='bold')

plt.yticks(fontsize=16, weight='bold')

#坐标轴名字

plt.xlabel('Harry Potter Series', fontsize=20, weight='bold')

plt.ylabel('Richness of Words', rotation=25, fontsize=20, weight='bold')

plt.show

情感分析

哈利波特系列小说情绪发展趋势,这里使用VADER,有现成的库vaderSentiment,这里使用其中的polarity_scores函数,可以得到

neg:负面得分

neu:中性得分

pos:积极得分

compound: 综合情感得分

from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer

test = 'i am so sorry'

analyzer.polarity_scores(test)

{'neg': 0.443, 'neu': 0.557, 'pos': 0.0, 'compound': -0.1513}

import os

import re

import matplotlib.pyplot as plt

from nltk.tokenize import sent_tokenize

from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer

harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",

"Harry Potter and the Chamber of Secrets.txt",

"Harry Potter and the Prisoner of Azkaban.txt",

"Harry Potter and the Goblet of Fire.txt",

"Harry Potter and the Order of the Phoenix.txt",

"Harry Potter and the Half-Blood Prince.txt",

"Harry Potter and the Deathly Hallows.txt"]

#横坐标为章节序列

chapter_indexes = []

#纵坐标为章节情绪得分

compounds = []

analyzer = SentimentIntensityAnalyzer

chapter_index = 1

for harry_potter in harry_potters:

file = "data/"+harry_potter

raw_text = open(file).read

pattern = 'Chapter d+n[a-zA-Z ]+n'

chapters = [c for c in re.split(pattern, raw_text) if c]

#计算每个章节的情感得分

for chapter in chapters:

compound = 0

sentences = sent_tokenize(chapter)

for sentence in sentences:

score = analyzer.polarity_scores(sentence)

compound += score['compound']

compounds.append(compound/len(sentences))

chapter_indexes.append(chapter_index)

chapter_index+=1

#设置画布尺寸

plt.figure(figsize=(20, 10))

#图的名字,字体大小,粗体

plt.title('Average Sentiment of the Harry Potter', fontsize=25, weight='bold')

#绘制折线图

plt.plot(chapter_indexes, compounds, color='#A040A0')

#横坐标刻度上的字体大小及倾斜角度

plt.xticks(rotation=25, fontsize=16, weight='bold')

plt.yticks(fontsize=16, weight='bold')

#坐标轴名字

plt.xlabel('Chapter', fontsize=20, weight='bold')

plt.ylabel('Average Sentiment', rotation=25, fontsize=20, weight='bold')

plt.show

曲线不够平滑,为了熨平曲线波动,自定义了一个函数

import numpy as np

import os

import re

import matplotlib.pyplot as plt

from nltk.tokenize import sent_tokenize

from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer

#曲线平滑函数

def movingaverage(value_series, window_size):

window = np.ones(int(window_size))/float(window_size)

return np.convolve(value_series, window, 'same')

harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",

"Harry Potter and the Chamber of Secrets.txt",

"Harry Potter and the Prisoner of Azkaban.txt",

"Harry Potter and the Goblet of Fire.txt",

"Harry Potter and the Order of the Phoenix.txt",

"Harry Potter and the Half-Blood Prince.txt",

"Harry Potter and the Deathly Hallows.txt"]

#横坐标为章节序列

chapter_indexes = []

#纵坐标为章节情绪得分

compounds = []

analyzer = SentimentIntensityAnalyzer

chapter_index = 1

for harry_potter in harry_potters:

file = "data/"+harry_potter

raw_text = open(file).read

pattern = 'Chapter d+n[a-zA-Z ]+n'

chapters = [c for c in re.split(pattern, raw_text) if c]

#计算每个章节的情感得分

for chapter in chapters:

compound = 0

sentences = sent_tokenize(chapter)

for sentence in sentences:

score = analyzer.polarity_scores(sentence)

compound += score['compound']

compounds.append(compound/len(sentences))

chapter_indexes.append(chapter_index)

chapter_index+=1

#设置画布尺寸

plt.figure(figsize=(20, 10))

#图的名字,字体大小,粗体

plt.title('Average Sentiment of the Harry Potter',

fontsize=25,

weight='bold')

#绘制折线图

plt.plot(chapter_indexes, compounds,

color='red')

plt.plot(movingaverage(compounds, 10),

color='black',

linestyle=':')

#横坐标刻度上的字体大小及倾斜角度

plt.xticks(rotation=25,

fontsize=16,

weight='bold')

plt.yticks(fontsize=16,

weight='bold')

#坐标轴名字

plt.xlabel('Chapter',

fontsize=20,

weight='bold')

plt.ylabel('Average Sentiment',

rotation=25,

fontsize=20,

weight='bold')

plt.show

全新打卡学习模式

每天30分钟

30天学会Python编程

世界正在奖励坚持学习的人!返回搜狐,查看更多

责任编辑:

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/489025.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

模仿u-boot的makefile结构

u-boot(2014.04)是通过顶层makefile调用各子目录中的makefile来实现整个工程的编译的,实际上子目录的makefile是include进来的。这里仿照这种结构写个模板测试一下。 目录结构: mytest: add: mul&#xff1…

我国机器视觉企业体量偏小,上游零部件占利润大头

来源:仪商网根据中国机器视觉产业联盟(CMVU)调查统计,目前进入中国市场的国际机器视觉企业和中国本土的机器视觉企业(不包括代理商)都已经超过200家,产品代理商超过300家,专业的机器视觉系统集成商超过70家,覆盖全产业…

php 后端 轻量 框架,GitHub - 22cloud/mixphp: 轻量 PHP 框架,基于 Swoole 的常驻内存型 PHP 高性能框架 (开发文档完善)...

高性能 • 轻量级 • 命令行MixPHP 是什么MixPHP 秉承 "普及 PHP 常驻内存型解决方案,促进 PHP 往更后端发展" 的理念而创造,采用 Swoole 扩展作为底层引擎,围绕常驻内存的方式而设计,提供了 Web / Console 开发所需的众…

技术面试问项目难题如何解决的_技术创新 | 降本增效,青海农信社项目小伙刻苦钻研解决联合支架设计难题!...

革新READ随着建筑技术的进步,建筑内部功能和机电系统越来越多样,带来机电管线的数量成倍增加。为节约建筑空间,提高建筑净高,保证系统功能,提升履约品质,越来越多的项目选择采用联合支架的形式将各类管线集…

MYSQL四

-- ########## 01、ER关系 ##########-- ER关系(逻辑描述) -- A:E---Entity简写,实体,具有相同属性(特征)的对象归为同一实体 -- Attribute属性,描述实体具有的特征&#xff…

AI研究过于集中狭隘,我们是不是该反思了?

来源:AI科技大本营译者 | 陆离编辑 | 夕颜【导读】2019年是AI领域更加冷静的一年,少了些喧嚣和泡沫,大浪淘沙留下的是经过检验的真正的AI研究者、实践者。但是你也许没有发现,本来被寄予厚望要解决一切“疑难杂症”的AI&#xff0…

svchost占用内存过高_是什么导致你的Java服务器内存和CPU占用过高呢

一、内存占用过高1、造成服务器内存占用过高只有两种情况:内存溢出或内存泄漏(1)内存溢出:程序分配的内存超出物理内存的大小,导致无法继续分配物理内存,出现OOM报错。(2)内存泄漏:不再调用的对象一直占用着内存不释放…

php取不到post数据库,安卓post 数据到php 在写入数据库老是不成功, 数据post不到php...

代码如下安卓端public void onCreate(Bundle savedInstanceState) {super.onCreate(savedInstanceState);new Thread(){Overridepublic void onCreate(Bundle savedInstanceState) {super.onCreate(savedInstanceState);new Thread(){Overridepublic void run(){ArrayList para…

Even Three is Odd

题意&#xff1a; 问题是对于所有的长度为n&#xff0c;且$1<ai<n$的整数序列求 $\prod_{i1}^{n-2}{max \{w_i,w_{i1},w_{i2}}\}$ 之和。 解法&#xff1a; 首先设dp状态为 $f(i,j,k)$ &#xff0c;长度为$i3$的&#xff0c;最大值为k&#xff0c;且最大值出现的位置集合…

中国图书评论协会2019年度“中国好书”

来源&#xff1a;腾讯网2019.12.16第1127次推送为读者发现好书&#xff0c;为好书寻找读者。“中国好书”月榜由我国权威的图书评测机构中国图书评论学会发布&#xff0c;志达书店经整理编辑&#xff0c;为您呈现“中国好书”2019年度榜单&#xff08;1-10月&#xff09;。寒假…

pythonfor循环列表排序_Python使用for循环对列表内元素进行排序方法

这篇文章介绍Python使用for循环对列表内元素进行排序方法list [13, 22, 6, 99, 11]for m in range(len(list)-1):for n in range(m1, len(list)):if list[m]> list[n]:temp list[n]list[n] list[m]list[m] tempprint list结果&#xff1a;[6, 11, 13, 22, 99]分析&#…

a标签居中 img vue,让html img图片垂直居中的三种方法

三种让img元素图片在盒子内垂直居中的方式教程&#xff0c;依据代码与文章教程熟习掌握并加以应用。一、使用flex完成垂直居中操纵css flex实现垂直居中。flex或许不是完成垂直居中最好的选择&#xff0c;由于IE8,9其实不赞成它。那时&#xff0c;为了用flex实现垂直居中&#…

python——面向对象相关

其他相关 一、isinstance(obj, cls) 检查是否obj是否是类 cls 的对象 123456class Foo(object):passobj Foo()isinstance(obj, Foo)二、issubclass(sub, super) 检查sub类是否是 super 类的派生类 1234567class Foo(object):passclass Bar(Foo):passissubclass(Bar, Foo)三、异…

德国工业4.0眼里“工业互联网”与“智能制造”

来源&#xff1a;智造智库工业4.0在德国被认为是第四次工业革命&#xff0c;主要是指&#xff0c;在“智能工厂”利用“智能备”将“智能物料”生产成为“智能产品”&#xff0c;整个过程贯穿以“网络协同”&#xff0c;从而提升生产效率&#xff0c;缩短生产周期&#xff0c;降…

python常用模块教程_盘点Python常用的模块和包

模块1.定义计算机在开发过程中&#xff0c;代码越写越多&#xff0c;也就越难以维护&#xff0c;所以为了编写可维护的代码&#xff0c;我们会把函数进行分组&#xff0c;放在不同的文件里。在python里&#xff0c;一个.py文件就是一个模块。2.优点&#xff1a;提高代码的可维护…

php 正则匹配静态资源,Struts2 配置静态资源文件不经过Strut处理(正则匹配)

Struts2框架有两个核心配置文件&#xff1a;struts.xml和Struts2默认属性文件default.properties(在struts2-core-2.3.20.jar中)default.properties可以通过自己在classpath下写一个struts.properties文件进行定制改写为什么是struts.properties&#xff0c;这可以看org.apache…

全球数字孪生市场大预测:2025 年的 358 亿美元,年复合增长率(CAGR)高达 37.8%...

来源&#xff1a;云头条 数字孪生市场估计将从2019年的38亿美元猛增到2025年的358亿美元&#xff0c;年复合增长率&#xff08;CAGR&#xff09;高达37.8%。推动数字孪生需求增长的几个因素包括&#xff1a;越来越广泛地采用物联网和云计算之类的技术用于实施数字孪生&#xff…

matlab 小波中心频率,小波频域特性Matlab实现.pdf

小波频域特性Matlab实现小波频域特性– Matlab实现东北大学信号与信息处理研究所栾峰 副教授/luanfeng/luanfeng编程示例例下面给出了一个信号的连续小波变换的例子。这个信号的第一个时间段包含了一个低频成分&#xff0c;最后一个时间段包含了高频成分&#xff0c;中间的时间…

yii3正式版什么时候发布_事业单位联考结束,成绩什么时候发布?合格分数线怎么算?...

今天上午&#xff0c;2020下半年全国事业单位联考笔试结束了。笔试刚刚结束&#xff0c;很多考生问图图有没有此次联考的答案&#xff0c;那肯定得有啊。为了帮助各位考生更好估分&#xff0c;华图教育推出估分系统&#xff0c;各位考生只需点击文末“了解更多”即可参与估分。…

Open Live Writer测试

************************我是可爱的分界线***************************转载于:https://www.cnblogs.com/elijahxb/p/6473105.html