vray阴天室内_阴天有话:第1部分

vray阴天室内

When working with text data and NLP projects, word-frequency is often a useful feature to identify and look into. However, creating good visuals is often difficult because you don’t have a lot of options outside of bar charts. Lets face it; bar charts get old and boring quick! This is where word clouds come into play. In this blog learn how to spice up your visualizations using word clouds on your next project.

在处理文本数据和NLP项目时,单词频率通常是识别和调查的有用功能。 但是,创建良好的视觉效果通常很困难,因为在条形图之外您没有太多选择。 面对现实吧; 条形图变老又无聊! 这就是词云发挥作用的地方。 在此博客中,学习如何在下一个项目中使用词云为您的可视化增添趣味。

Up until my most recent project I actually didn’t know a word cloud library existed in python, but I assure you it does, and it has some amazing features!

在我最近的项目之前,我实际上还不知道python中存在词云库,但是我向您保证,它确实存在,并且它具有一些惊人的功能!

The full WordCloud library and documentation can be found here for those interested.

完整的WordCloud库和文档可以在 此处 找到 感兴趣的人。

TLDR (TLDR)

Part 1 of this blog will walk you through obtaining the appropriate libraries and the basic parameters and functions of the wordcloud library as well as how to create a generic word cloud. Part 2 will build upon this and walk you through creating custom masks for word clouds and other unique visual options.

本博客的第1部分将引导您获得合适的库以及wordcloud库的基本参数和功能,以及如何创建通用词云。 第2部分将以此为基础,并引导您为词云和其他独特的视觉选项创建自定义蒙版。

WordCloud入门 (Getting Started With WordCloud)

Before we can start making visuals, we’ll need to make sure we have the libraries we need to create our word clouds. You’ll need the following libraries:

在开始制作视觉效果之前,我们需要确保拥有创建词云所需的库。 您将需要以下库:

  • numpy

    麻木
  • matplotlib

    matplotlib
  • PIL

    皮尔
  • wordcloud

    词云
  • nltk (This is only necessary for the purpose of this blog and as a source of sample text to create word clouds from)

    nltk (这仅对于本博客而言是必需的,并且作为从其创建词云的示例文本的来源)

All of these libraries can be pip installed if you’re unable to import them. For my specific project, I used Google Colab which required a slightly more unique solution to import wordcloud. For Google Colab users, you can use the following command to install wordcloud:

如果您无法导入所有这些库,则可以通过pip安装。 对于我的特定项目,我使用了Google Colab,它需要一个稍微独特的解决方案来导入wordcloud。 对于Google Colab用户,您可以使用以下命令来安装wordcloud:

!pip install git+https://github.com/amueller/word_cloud.git #egg=wordcloud

!pip安装git + https://github.com/amueller/word_cloud.git#egg = wordcloud

That last part is important for Colab because it identifies and effectively names the library so that it can be properly imported.

最后一部分对Colab很重要,因为它可以识别并有效地命名库,以便可以正确导入它。

Once we have all of our needed libraries installed, we can use the following set of import statements:

一旦我们安装了所有需要的库,就可以使用以下一组导入语句:

Image for post

We’re now ready to create some word clouds!

现在我们准备创建一些词云!

通用词云 (Generic Word Clouds)

To start with, lets explore generic word clouds. For those that want to follow along, we’ll use some corpora from the nltk library.

首先,让我们探索通用词云。 对于那些想要继续学习的人,我们将使用nltk库中的一些语料库。

First off, we’ll need to acquire our text. I’ll note here that there are two forms of text that WordCloud can use to generate a visual. The first, and the main one we’ll use, is in the form of a string. The second, is from a dictionary of words and their frequency as key-value pairs.

首先,我们需要获取文本。 我将在此处指出,WordCloud可使用两种形式的文本来生成视觉效果。 我们将使用的第一个也是主要的字符串形式。 第二个是来自单词字典及其作为键值对的频率。

If you’re following along, or want to attempt this using other sample text from nltk, you can use the following code to acquire our text samples:

如果您正在遵循,或者想使用来自nltk的其他示例文本来尝试此操作,则可以使用以下代码获取我们的文本示例:

Image for post
This shows a list of the different authors and texts we have to choose from within nltk’s gutenberg files
This shows a list of the different authors and texts we have to choose from within nltk’s gutenberg files
这显示了我们必须从nltk的gutenberg文件中选择的不同作者和文本的列表

Feel free to attempt creating word clouds from any of the above options. The one that we’ll continue with in these examples, however, will be Moby Dick.

随意尝试从以上任何选项创建词云。 但是,在这些示例中我们将继续讨论的是Moby Dick。

To gather our sample text as a single string you can use the following command:

要将示例文本作为单个字符串收集,可以使用以下命令:

Image for post

Now that we have our text, let’s take a look at how to turn this into a word cloud. What we’re doing in the code block below is instantiating a WordCloud object, we then use that object to generate a cloud based upon the text that we pass in. Once we have the cloud generated, we then want to be able to show it without the unnecessary x and y axis.

现在我们有了文本,让我们看一下如何将其变成词云。 在下面的代码块中,我们正在实例化一个WordCloud对象,然后使用该对象根据传入的文本生成一个云。一旦生成了云,我们便希望能够显示它没有不必要的x和y轴。

Image for post

Look at that! We made a word cloud!

看那个! 我们做了一个词云!

Now personally, I’m not a fan of the black background and it seems a little small, so let’s change that with some simple parameters.

现在我个人不喜欢黑色背景,而且看起来有点小,所以让我们用一些简单的参数来更改它。

Image for post

Now we’re talking! Although, there seems to be some strange things showing up in our generic word cloud doesn’t there?

现在我们在说话! 虽然,在通用词云中似乎有一些奇怪的事情出现了吗?

参数和语言处理 (Parameters and Language Processing)

Looking at the cloud above we notice some things. Some words seem to be paired.

看着上面的云,我们注意到一些事情。 有些话似乎成对出现。

  • the whale

    鲸鱼
  • the ship

  • the sea

  • the captain

    队长
  • White Whale

    白鲸

So on and so forth. Our word cloud is still showing word frequencies however one of the parameters WordCloud has is ‘collocations’ which it defaults to True. What this does is also looks at pairs of words and their frequencies. In some instances this can definitely be useful, but in this one I think we’ll get better results not using it.

等等等等。 我们的词云仍在显示词频,但是WordCloud的参数之一是“配置”,默认为True。 这还着眼于单词对及其频率。 在某些情况下,这绝对是有用的,但在我看来,不使用它会得到更好的结果。

Image for post

Notice the difference?

注意区别吗?

A keen eye may recognize that the word ‘the’ no longer appears in our word cloud. This is because ‘the’ is recognized as a stop-word and excluded from the cloud even though it appears quite frequently in the text.

敏锐的眼睛可能会意识到“ the”一词不再出现在我们的词云中。 这是因为“ the”被识别为停用词,即使在文本中出现频率很高,也被排除在云端之外。

You may be wondering where stop-words came into play, and that is one of the really cool features of the wordcloud library. The library comes with it’s own list of stop-words that it uses by default. The library actually uses quite a few NLP practices by default that makes creating the clouds that much easier and also adjustable for the more experienced NLP practitioner. Some of these additional NLP parameters that are used are:

您可能想知道停用词在哪里起作用,而这是wordcloud库的真正酷功能之一。 该库附带了它自己的默认停用词列表。 默认情况下,该库实际上使用了许多NLP实践,这使得创建云变得更加容易,并且对于经验丰富的NLP从业者而言也是可调整的。 使用的一些其他NLP参数是:

  • regexp — an optional parameter that if left blank will use r”\w[\w’]+” by default. Custom regex string can be passed in here.

    regexp —一个可选参数,如果保留为空白,默认情况下将使用r” \ w [\ w'] +” 。 自定义正则表达式字符串可以在此处传递。

  • normalize_plurals — default = True; For words that appear both with and without a trailing ‘s’, that ‘s’ is removed from the plural and it’s counted as another of it’s singular version

    normalize_plurals —默认= True; 对于同时带有和不带有尾部“ s”的单词,该“ s”将从复数形式中删除,并被视为另一个单数形式

In our original import statement we imported STOPWORDS from the wordcloud library. You can print this to see the entire list of words that are being excluded by default, but it currently uses 192 of the most common stop-words. You can also add to this list if you have additional words you want excluded. You can also supply your own stop-words if prefer. Note that the stopwords must be passed in as a set and not a list.

在原始的导入语句中,我们从wordcloud库中导入了STOPWORDS。 您可以打印此内容以查看默认情况下排除的单词的整个列表,但当前它使用192个最常用的停用词。 如果您想排除其他单词,也可以添加到此列表中。 如果愿意,您也可以提供自己的停用词。 请注意,停用词必须作为集合而不是列表传递。

Image for post

What a difference!

有什么不同!

One last thing we’ll talk about before moving on to making fun and unique word clouds is “relative scaling”.

在继续取笑和独特的词云之前,我们要谈论的最后一件事是“相对缩放”。

Relative scaling is what’s used to determine the size of the word based upon its frequency. By default, relative scaling is set to 0.5, which is essentially the equivalent of saying that a word that occurs twice as often as another word will be 50% larger.

相对缩放是根据单词的频率来确定单词大小的方法。 默认情况下,相对缩放比例设置为0.5,这基本上等于说一个单词出现的频率是另一个单词的两倍将增加50%。

Relative scaling can be set to any number between 0 and 1. With 0 being essentially kind of pointless as all words will be the same size, and 1 being that words that occur twice as often will be twice as large. In some cases this can be useful to better identify the differences in frequency. However, this doesn’t always look very good and can affect the fit of a word cloud to a mask which we will talk about later.

相对缩放比例可以设置为0到1之间的任何数字。0本质上是毫无意义的,因为所有单词的大小都相同,而1表示出现频率两倍的单词将是两倍大。 在某些情况下,这有助于更好地识别频率差异。 但是,这并不总是看起来很好,并且可能会影响词云与蒙版的匹配度,我们将在后面讨论。

Image for post

In this case, using a relative scaling of 1 actually doesn’t look too bad! We’ll soon see how this translates to using it with an image mask.

在这种情况下,使用1的相对比例实际上看起来还不错! 我们将很快看到如何将其转换为与图像蒙版一起使用。

保存您的词云 (Saving Your Word Cloud)

Once you have your word cloud the way you want it, you’ll probably want to save it. To do so, you can run the following code which will save the current state of your WordCloud object.

一旦有了您想要的词云,就可能要保存它。 为此,您可以运行以下代码来保存WordCloud对象的当前状态。

Image for post

Keep in mind this will save the image to your local folder and if you have a specific location in mind, you will need to add in the appropriate path.

请记住,这会将图像保存到本地文件夹,如果您有特定的位置,则需要添加适当的路径。

值得一玩的其他参数 (Other Parameters Worth Playing With)

We looked at the key parameters for making word clouds, but there are many more that are worth looking into and toying with. These parameters are fairly self-explanatory and can be used to further tweak your clouds:

我们研究了制作词云的关键参数,但是还有很多值得研究和研究的参数。 这些参数是不言自明的,可用于进一步调整云:

  • prefer_horizontal — (float)If set to 1, all words will appear horizontal while lower values will increase the frequency of vertical words. default = 0.9

    preferred_horizo​​ntal —(浮动)如果设置为1,则所有单词将显示为水平,而较低的值将增加垂直单词的频率。 默认值= 0.9

  • min_font_size — (int) Smallest font size to be used. default = 4

    min_font_size —(int)要使用的最小字体大小。 默认= 4

  • max_words — (int) default = 200

    max_words —(整数)默认= 200

  • min_word_length — (int) Minimum number of letters required in a word to be in the cloud. default = 0

    min_word_length —(int)单词在云中所需的最小字母数。 默认值= 0

  • include_numbers — (bool) default = False

    include_numbers —(布尔值)默认= False

  • repeat — (bool) Determines if words/phrases will be repeated until max_words or min_font_size is reached. (Can be used to create word clouds from a single word) default = False

    repeat —(布尔)确定是否重复单词/短语,直到达到max_words或min_font_size。 (可用于从单个单词创建单词云)default = False

独特和自定义词云 (Unique and Custom Word Clouds)

Due to this blog turning out much longer than I had initially planned, I’ll discuss using image masks to create custom word clouds, how to create your own image masks from any image, and how to apply an image’s color to your cloud in a soon to follow, Part 2 of this blog.

由于此博客的发布时间比我最初计划的要长得多,因此我将讨论使用图像蒙版创建自定义文字云,如何从任何图像创建自己的图像蒙版以及如何将图像的颜色应用于云中。不久之后,该博客的第2部分 。

翻译自: https://medium.com/swlh/cloudy-with-a-chance-of-words-part-1-d34a29739dba

vray阴天室内

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391018.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

高光谱图像分类_高光谱图像分析-分类

高光谱图像分类初学者指南 (Beginner’s Guide) This article provides detailed implementation of different classification algorithms on Hyperspectral Images(HSI).本文提供了在高光谱图像(HSI)上不同分类算法的详细实现。 目录 (Table of Contents) Introduction to H…

机器人的动力学和动力学联系_通过机器学习了解幸福动力学(第2部分)

机器人的动力学和动力学联系Happiness is something we all aspire to, yet its key factors are still unclear.幸福是我们所有人都渴望的东西,但其关键因素仍不清楚。 Some would argue that wealth is the most important condition as it determines one’s li…

ubuntu 16.04 安装mysql

2019独角兽企业重金招聘Python工程师标准>>> 1) 安装 sudo apt-get install mysql-server apt-get isntall mysql-client apt-get install libmysqlclient-dev 2) 验证 sudo netstat -tap | grep mysql 如果有 就代表已经安装成功。 3)开启远程访问 1、 …

大样品随机双盲测试_训练和测试样品生成

大样品随机双盲测试This post aims to explore a step-by-step approach to create a K-Nearest Neighbors Algorithm without the help of any third-party library. In practice, this Algorithm should be useful enough for us to classify our data whenever we have alre…

JavaScript 基础,登录验证

<script></script>的三种用法&#xff1a;放在<body>中放在<head>中放在外部JS文件中三种输出数据的方式&#xff1a;使用 document.write() 方法将内容写到 HTML 文档中。使用 window.alert() 弹出警告框。使用 innerHTML 写入到 HTML 元素。使用 &qu…

从数据角度探索在新加坡的非法毒品

All things are poisons, for there is nothing without poisonous qualities. It is only the dose which makes a thing poison.” ― Paracelsus万物都是毒药&#xff0c;因为没有毒药就没有什么。 只是使事物中毒的剂量。” ― 寄生虫 执行摘要(又名TL&#xff1b; DR) (Ex…

Android 自定义View实现QQ运动积分抽奖转盘

因为偶尔关注QQ运动&#xff0c; 看到QQ运动的积分抽奖界面比较有意思&#xff0c;所以就尝试用自定义View实现了下&#xff0c;原本想通过开发者选项查看下界面的一些信息&#xff0c;后来发现积分抽奖界面是在WebView中展示的&#xff0c;应该是在H5页面中用js代码实现的&…

瑞立视:厚积薄发且具有“工匠精神”的中国品牌

一家成立两年的公司&#xff1a;是如何在VR行业趋于稳定的情况下首次融资就获得如此大额的金额呢&#xff1f; 2017年VR行业内宣布融资的公司寥寥无几&#xff0c;无论是投资人还是消费者对这个 “宠儿”都开始纷纷投以怀疑的目光。但就在2017年7月27日&#xff0c;深圳市一家…

CSV模块的使用

CSV模块的使用 1、csv简介 CSV (Comma Separated Values)&#xff0c;即逗号分隔值&#xff08;也称字符分隔值&#xff0c;因为分隔符可以不是逗号&#xff09;&#xff0c;是一种常用的文本 格式&#xff0c;用以存储表格数据&#xff0c;包括数字或者字符。很多程序在处理数…

python 重启内核_Python从零开始的内核回归

python 重启内核Every beginner in Machine Learning starts by studying what regression means and how the linear regression algorithm works. In fact, the ease of understanding, explainability and the vast effective real-world use cases of linear regression is…

回归分析中自变量共线性_具有大特征空间的回归分析中的变量选择

回归分析中自变量共线性介绍 (Introduction) Performing multiple regression analysis from a large set of independent variables can be a challenging task. Identifying the best subset of regressors for a model involves optimizing against things like bias, multi…

python 面试问题_值得阅读的30个Python面试问题

python 面试问题Interview questions are quite tricky to predict. In most cases, even peoples with great programming ability fail to answer some simple questions. Solving the problem with your code is not enough. Often, the interviewer will expect you to hav…

机器学习模型 非线性模型_机器学习:通过预测菲亚特500的价格来观察线性模型的工作原理...

机器学习模型 非线性模型Introduction介绍 In this article, I’d like to speak about linear models by introducing you to a real project that I made. The project that you can find in my Github consists of predicting the prices of fiat 500.在本文中&#xff0c;…

10款中小企业必备的开源免费安全工具

10款中小企业必备的开源免费安全工具 secist2017-05-188共527453人围观 &#xff0c;发现 7 个不明物体企业安全工具很多企业特别是一些中小型企业在日常生产中&#xff0c;时常会因为时间、预算、人员配比等问题&#xff0c;而大大减少或降低在安全方面的投入。这时候&#xf…

图片主成分分析后的可视化_主成分分析-可视化

图片主成分分析后的可视化If you have ever taken an online course on Machine Learning, you must have come across Principal Component Analysis for dimensionality reduction, or in simple terms, for compression of data. Guess what, I had taken such courses too …

TP引用样式表和js文件及验证码

TP引用样式表和js文件及验证码 引入样式表和js文件 <script src"__PUBLIC__/bootstrap/js/jquery-1.11.2.min.js"></script> <script src"__PUBLIC__/bootstrap/js/bootstrap.min.js"></script> <link href"__PUBLIC__/bo…

pytorch深度学习_深度学习和PyTorch的推荐系统实施

pytorch深度学习The recommendation is a simple algorithm that works on the principle of data filtering. The algorithm finds a pattern between two users and recommends or provides additional relevant information to a user in choosing a product or services.该…

Java 集合-集合介绍

2017-10-30 00:01:09 一、Java集合的类关系图 二、集合类的概述 集合类出现的原因&#xff1a;面向对象语言对事物的体现都是以对象的形式&#xff0c;所以为了方便对多个对象的操作&#xff0c;Java就提供了集合类。数组和集合类同是容器&#xff0c;有什么不同&#xff1a;数…

Exchange 2016部署实施案例篇-04.Ex基础配置篇(下)

上二篇我们对全新部署完成的Exchange Server做了基础的一些配置&#xff0c;今天继续基础配置这个话题。 DAG配置 先决条件 首先在配置DGA之前我们需要确保DAG成员服务器上磁盘的盘符都是一样的&#xff0c;大小建议最好也相同。 其次我们需要确保有一块网卡用于数据复制使用&…

数据库课程设计结论_结论:

数据库课程设计结论In this article, we will learn about different types[Z Test and t Test] of commonly used Hypothesis Testing.在本文中&#xff0c;我们将学习常用假设检验的不同类型[ Z检验和t检验 ]。 假设是什么&#xff1f; (What is Hypothesis?) This is a St…