文本数据可视化_如何使用TextHero快速预处理和可视化文本数据

文本数据可视化

自然语言处理 (Natural Language Processing)

When we are working on any NLP project or competition, we spend most of our time on preprocessing the text such as removing digits, punctuations, stopwords, whitespaces, etc and sometimes visualization too. After experimenting TextHero on a couple of NLP datasets I found this library to be extremely useful for preprocessing and visualization. This will save us some time writing custom functions. Aren’t you excited!!? So let’s dive in.

在进行任何NLP项目或竞赛时,我们将大部分时间用于预处理文本,例如删除数字,标点符号,停用词,空白等,有时还会进行可视化处理。 在几个NLP数据集上试验TextHero之后,我发现此库对于预处理和可视化非常有用。 这将节省我们一些编写自定义函数的时间。 你不兴奋!!? 因此,让我们开始吧。

We will apply techniques that we are going to learn in this article to Kaggle’s Spooky Author Identification dataset. You can find the dataset here. The complete code is given at the end of the article.

我们将把本文中要学习的技术应用于Kaggle的Spooky Author Identification数据集。 您可以在此处找到数据集。 完整的代码在文章末尾给出。

Note: TextHero is still in beta. The library may undergo major changes. So some of the code snippets or functionalities below might get changed.

注意:TextHero仍处于测试版。 图书馆可能会发生重大变化。 因此,下面的某些代码段或功能可能会更改。

安装 (Installation)

pip install texthero

前处理 (Preprocessing)

As the name itself says clean method is used to clean the text. By default, the clean method applies 7 default pipelines to the text.

顾名思义, clean方法用于清理文本。 默认情况下, clean方法将7个default pipelines应用于文本。

from texthero import preprocessing
df[‘clean_text’] = preprocessing.clean(df[‘text’])
  1. fillna(s)

    fillna(s)

  2. lowercase(s)

    lowercase(s)

  3. remove_digits()

    remove_digits()

  4. remove_punctuation()

    remove_punctuation()

  5. remove_diacritics()

    remove_diacritics()

  6. remove_stopwords()

    remove_stopwords()

  7. remove_whitespace()

    remove_whitespace()

We can confirm the default pipelines used with below code:

我们可以确认以下代码使用的默认管道:

Image for post

Apart from the above 7 default pipelines, TextHero provides many more pipelines that we can use. See the complete list here with descriptions. These are very useful as we deal with all these during text preprocessing.

除了上述7个默认管道之外, TextHero还提供了更多可以使用的管道。 请参阅此处的完整列表及其说明。 这些非常有用,因为我们在文本预处理期间会处理所有这些问题。

Based on our requirements, we can also have our custom pipelines as shown below. Here in this example, we are using two pipelines. However, we can use as many pipelines as we want.

根据我们的要求,我们还可以具有如下所示的自定义管道。 在此示例中,我们使用两个管道。 但是,我们可以使用任意数量的管道。

from texthero import preprocessing custom_pipeline = [preprocessing.fillna, preprocessing.lowercase] df[‘clean_text’] = preprocessing.clean(df[‘text’], custom_pipeline)

自然语言处理 (NLP)

As of now, this NLP functionality provides only named_entity and noun_phrases methods. See the sample code below. Since TextHero is still in beta, I believe, more functionalities will be added later.

到目前为止,此NLP功能仅提供named_entitynoun_phrases方法。 请参见下面的示例代码。 由于TextHero仍处于测试阶段,我相信以后会添加更多功能。

named entity

命名实体

s = pd.Series(“Narendra Damodardas Modi is an Indian politician serving as the 14th and current Prime Minister of India since 2014”)print(nlp.named_entities(s)[0])Output:
[('Narendra Damodardas Modi', 'PERSON', 0, 24),
('Indian', 'NORP', 31, 37),
('14th', 'ORDINAL', 64, 68),
('India', 'GPE', 99, 104),
('2014', 'DATE', 111, 115)]

noun phrases

名词短语

s = pd.Series(“Narendra Damodardas Modi is an Indian politician serving as the 14th and current Prime Minister of India since 2014”)print(nlp.noun_chunks(s)[0])Output:
[(‘Narendra Damodardas Modi’, ‘NP’, 0, 24),
(‘an Indian politician’, ‘NP’, 28, 48),
(‘the 14th and current Prime Minister’, ‘NP’, 60, 95),
(‘India’, ‘NP’, 99, 104)]

表示 (Representation)

This functionality is used to map text data into vectors (Term Frequency, TF-IDF), for clustering (kmeans, dbscan, meanshift) and also for dimensionality reduction (PCA, t-SNE, NMF).

此功能用于将文本数据映射到vectors (术语频率,TF-IDF), clustering (kmeans,dbscan,meanshift)以及降dimensionality reduction (PCA,t-SNE,NMF)。

Let’s look at an example with TF-TDF and PCA on the Spooky author identification train dataset.

让我们看一下Spooky作者标识训练数据集中的TF-TDFPCA的示例。

train['pca'] = (
train['text']
.pipe(preprocessing.clean)
.pipe(representation.tfidf, max_features=1000)
.pipe(representation.pca)
)visualization.scatterplot(train, 'pca', color='author', title="Spooky Author identification")
Image for post

可视化 (Visualization)

This functionality is used to plotting Scatter-plot, word cloud, and also used to get top n words from the text. Refer to the examples below.

此功能用于绘制Scatter-plot ,词云,还用于从文本中获取top n words 。 请参考以下示例。

Scatter-plot example

散点图示例

train['tfidf'] = (
train['text']
.pipe(preprocessing.clean)
.pipe(representation.tfidf, max_features=1000)
)train['kmeans_labels'] = (
train['tfidf']
.pipe(representation.kmeans, n_clusters=3)
.astype(str)
)train['pca'] = train['tfidf'].pipe(representation.pca)visualization.scatterplot(train, 'pca', color='kmeans_labels', title="K-means Spooky author")
Image for post

Wordcloud示例 (Wordcloud example)

from texthero import visualization
visualization.wordcloud(train[‘clean_text’])
Image for post

热门单词示例 (Top words example)

Image for post

完整的代码 (Complete Code)

结论 (Conclusion)

We have gone thru most of the functionalities provided by TextHero. Except for the NLP functionality, I found that rest of the features are really useful which we can try to use it for the next NLP project.

我们已经通过了TextHero提供的大多数功能。 除了NLP功能以外,我发现其余功能确实有用,我们可以尝试将其用于下一个NLP项目。

Thank you so much for taking out time to read this article. You can reach me at https://www.linkedin.com/in/chetanambi/

非常感谢您抽出宝贵的时间阅读本文。 您可以通过https://www.linkedin.com/in/chetanambi/与我联系

翻译自: https://medium.com/towards-artificial-intelligence/how-to-quickly-preprocess-and-visualize-text-data-with-texthero-c86957452824

文本数据可视化

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390706.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

linux shell 编程

shell的作用 shell是用户和系统内核之间的接口程序shell是命令解释器 shell程序 Shell程序的特点及用途: shell程序可以认为是将shell命令按照控制结构组织到一个文本文件中,批量的交给shell去执行 不同的shell解释器使用不同的shell命令语法 shell…

真实感人故事_您的数据可以告诉您真实故事吗?

真实感人故事Many are passionate about Data Analytics. Many love matplotlib and Seaborn. Many enjoy designing and working on Classifiers. We are quick to grab a data set and launch Jupyter Notebook, import pandas and NumPy and get to work. But wait a minute…

转:防止跨站攻击,安全过滤

转:http://blog.csdn.net/zpf0918/article/details/43952511 Spring MVC防御CSRF、XSS和SQL注入攻击 本文说一下SpringMVC如何防御CSRF(Cross-site request forgery跨站请求伪造)和XSS(Cross site script跨站脚本攻击)。 说说CSRF 对CSRF来说,其实Spring…

Linux c编程

c语言标准 ANSI CPOSIX(提高UNIX程序可移植性)SVID(POSIX的扩展超集)XPG(X/Open可移植性指南)GNU C(唯一能编译Linux内核的编译器) gcc 简介 名称: GNU project C an…

k均值算法 二分k均值算法_使用K均值对加勒比珊瑚礁进行分类

k均值算法 二分k均值算法Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.您见过加勒比礁吗? 好吧,如果没有,请做好准备。 Today, we will be answering a question that, at face value, appears quite sim…

新建VUX项目

使用Vue-cli安装Vux2 特别注意配置vux-loader。来自为知笔记(Wiz)

衡量试卷难度信度_我们可以通过数字来衡量语言难度吗?

衡量试卷难度信度Without a doubt, the world is “growing smaller” in terms of our access to people and content from other countries and cultures. Even the COVID-19 pandemic, which has curtailed international travel, has led to increasing virtual interactio…

Linux 题目总结

守护进程的工作就是打开一个端口,并且等待(Listen)进入连接。 如果客户端发起一个连接请求,守护进程就创建(Fork)一个子进程响应这个连接,而主进程继续监听其他的服务请求。 xinetd能够同时监听…

《精通Spring4.X企业应用开发实战》读后感第二章

一、配置Maven\tomcat https://www.cnblogs.com/Miracle-Maker/articles/6476687.html https://www.cnblogs.com/Knowledge-has-no-limit/p/7240585.html 二、创建数据库表 DROP DATABASE IF EXISTS sampledb; CREATE DATABASE sampledb DEFAULT CHARACTER SET utf8; USE sampl…

视图可视化 后台_如何在单视图中可视化复杂的多层主题

视图可视化 后台Sometimes a dataset can tell many stories. Trying to show them all in a single visualization is great, but can be too much of a good thing. How do you avoid information overload without oversimplification?有时数据集可以讲述许多故事。 试图在…

一步一步构建自己的管理系统①

2019独角兽企业重金招聘Python工程师标准>>> 系统肯定要先选一个基础框架。 还算比较熟悉Spring. 就选Spring boot postgres mybatis. 前端用Angular. 开始搭开发环境,开在window上整的。 到时候再放到服务器上。 自己也去整了个小服务器,…

python边玩边学_边听边学数据科学

python边玩边学Podcasts are a fun way to learn new stuff about the topics you like. Podcast hosts have to find a way to explain complex ideas in simple terms because no one would understand them otherwise 🙂 In this article I present a few episod…

react css多个变量_如何使用CSS变量和React上下文创建主题引擎

react css多个变量CSS variables are really cool. You can use them for a lot of things, like applying themes in your application with ease. CSS变量真的很棒。 您可以将它们用于很多事情,例如轻松地在应用程序中应用主题。 In this tutorial Ill show you …

vue 自定义 移动端筛选条件

1.创建组件 components/FilterBar/FilterBar.vue <template><div class"filterbar" :style"{top: top px}"><div class"container"><div class"row"><divclass"col":class"{selected: ind…

PPPOE拨号上网流程及密码窃取具体实现

楼主学生党一枚&#xff0c;最近研究netkeeper有些许心得。 关于netkeeper是调用windows的rasdial来进行上网的东西&#xff0c;网上已经有一大堆&#xff0c;我就不赘述了。 本文主要讲解rasdial的部分核心过程&#xff0c;以及我们可以利用它来干些什么。 netkeeper中rasdial…

新购阿里云服务器ECS创建之后无法ssh连接的问题处理

作者&#xff1a;13 GitHub&#xff1a;https://github.com/ZHENFENG13 版权声明&#xff1a;本文为原创文章&#xff0c;未经允许不得转载。 问题描述 由于原服务器将要到期&#xff0c;因此趁着阿里云搞促销活动重新购买了一台ECS服务器&#xff0c;但是在初始化并启动后却无…

边缘计算 ai_在边缘探索AI!

边缘计算 ai介绍 (Introduction) What is Edge (or Fog) Computing?什么是边缘(或雾)计算&#xff1f; Gartner defines edge computing as: “a part of a distributed computing topology in which information processing is located close to the edge — where things a…

初识spring-boot

使用Spring或者SpringMVC的话依然有许多东西需要我们进行配置&#xff0c;这样不仅徒增工作量而且在跨平台部署时容易出问题。 使用Spring Boot可以让我们快速创建一个基于Spring的项目&#xff0c;而让这个Spring项目跑起来我们只需要很少的配置就可以了。Spring Boot主要有如…

leetcode 879. 盈利计划(dp)

这是我参与更文挑战的第9天 &#xff0c;活动详情查看更文挑战 题目 集团里有 n 名员工&#xff0c;他们可以完成各种各样的工作创造利润。 第 i 种工作会产生 profit[i] 的利润&#xff0c;它要求 group[i] 名成员共同参与。如果成员参与了其中一项工作&#xff0c;就不能…

区块链101:区块链的应用和用例是什么?

区块链技术是一场记录系统的革命。 比特币是历史上第一个永久的、分散的、全球性的、无信任的记录分类帐。自其发明以来&#xff0c;世界各地各行各业的企业家都开始明白这一发展的意义。 区块链技术的本质让人联想到疯狂&#xff0c;因为这个想法现在可以应用到任何值得信赖的…