vray阴天室内

When working with text data and NLP projects, word-frequency is often a useful feature to identify and look into. However, creating good visuals is often difficult because you don’t have a lot of options outside of bar charts. Lets face it; bar charts get old and boring quick! This is where word clouds come into play. In this blog learn how to spice up your visualizations using word clouds on your next project.

在处理文本数据和NLP项目时，单词频率通常是识别和调查的有用功能。但是，创建良好的视觉效果通常很困难，因为在条形图之外您没有太多选择。面对现实吧; 条形图变老又无聊！这就是词云发挥作用的地方。在此博客中，学习如何在下一个项目中使用词云为您的可视化增添趣味。

Up until my most recent project I actually didn’t know a word cloud library existed in python, but I assure you it does, and it has some amazing features!

在我最近的项目之前，我实际上还不知道python中存在词云库，但是我向您保证，它确实存在，并且它具有一些惊人的功能！

The full WordCloud library and documentation can be found here for those interested.

完整的WordCloud库和文档可以在 此处找到 感兴趣的人。

TLDR (TLDR)

Part 1 of this blog will walk you through obtaining the appropriate libraries and the basic parameters and functions of the wordcloud library as well as how to create a generic word cloud. Part 2 will build upon this and walk you through creating custom masks for word clouds and other unique visual options.

本博客的第1部分将引导您获得合适的库以及wordcloud库的基本参数和功能，以及如何创建通用词云。第2部分将以此为基础，并引导您为词云和其他独特的视觉选项创建自定义蒙版。

WordCloud入门 (Getting Started With WordCloud)

Before we can start making visuals, we’ll need to make sure we have the libraries we need to create our word clouds. You’ll need the following libraries:

在开始制作视觉效果之前，我们需要确保拥有创建词云所需的库。您将需要以下库：

numpy
麻木
matplotlib
matplotlib
PIL
皮尔
wordcloud
词云
nltk (This is only necessary for the purpose of this blog and as a source of sample text to create word clouds from)
nltk (这仅对于本博客而言是必需的，并且作为从其创建词云的示例文本的来源)

All of these libraries can be pip installed if you’re unable to import them. For my specific project, I used Google Colab which required a slightly more unique solution to import wordcloud. For Google Colab users, you can use the following command to install wordcloud:

如果您无法导入所有这些库，则可以通过pip安装。对于我的特定项目，我使用了Google Colab，它需要一个稍微独特的解决方案来导入wordcloud。对于Google Colab用户，您可以使用以下命令来安装wordcloud：

!pip install git+https://github.com/amueller/word_cloud.git #egg=wordcloud

！pip安装git + https：//github.com/amueller/word_cloud.git＃egg = wordcloud

That last part is important for Colab because it identifies and effectively names the library so that it can be properly imported.

最后一部分对Colab很重要，因为它可以识别并有效地命名库，以便可以正确导入它。

Once we have all of our needed libraries installed, we can use the following set of import statements:

一旦我们安装了所有需要的库，就可以使用以下一组导入语句：

We’re now ready to create some word clouds!

现在我们准备创建一些词云！

通用词云 (Generic Word Clouds)

To start with, lets explore generic word clouds. For those that want to follow along, we’ll use some corpora from the nltk library.

首先，让我们探索通用词云。对于那些想要继续学习的人，我们将使用nltk库中的一些语料库。

First off, we’ll need to acquire our text. I’ll note here that there are two forms of text that WordCloud can use to generate a visual. The first, and the main one we’ll use, is in the form of a string. The second, is from a dictionary of words and their frequency as key-value pairs.

首先，我们需要获取文本。我将在此处指出，WordCloud可使用两种形式的文本来生成视觉效果。我们将使用的第一个也是主要的字符串形式。第二个是来自单词字典及其作为键值对的频率。

If you’re following along, or want to attempt this using other sample text from nltk, you can use the following code to acquire our text samples:

如果您正在遵循，或者想使用来自nltk的其他示例文本来尝试此操作，则可以使用以下代码获取我们的文本示例：

This shows a list of the different authors and texts we have to choose from within nltk’s gutenberg files

Feel free to attempt creating word clouds from any of the above options. The one that we’ll continue with in these examples, however, will be Moby Dick.

随意尝试从以上任何选项创建词云。但是，在这些示例中我们将继续讨论的是Moby Dick。

To gather our sample text as a single string you can use the following command:

要将示例文本作为单个字符串收集，可以使用以下命令：

Now that we have our text, let’s take a look at how to turn this into a word cloud. What we’re doing in the code block below is instantiating a WordCloud object, we then use that object to generate a cloud based upon the text that we pass in. Once we have the cloud generated, we then want to be able to show it without the unnecessary x and y axis.

现在我们有了文本，让我们看一下如何将其变成词云。在下面的代码块中，我们正在实例化一个WordCloud对象，然后使用该对象根据传入的文本生成一个云。一旦生成了云，我们便希望能够显示它没有不必要的x和y轴。

Look at that! We made a word cloud!

看那个！我们做了一个词云！

Now personally, I’m not a fan of the black background and it seems a little small, so let’s change that with some simple parameters.

现在我个人不喜欢黑色背景，而且看起来有点小，所以让我们用一些简单的参数来更改它。

Now we’re talking! Although, there seems to be some strange things showing up in our generic word cloud doesn’t there?

现在我们在说话！虽然，在通用词云中似乎有一些奇怪的事情出现了吗？

参数和语言处理 (Parameters and Language Processing)

Looking at the cloud above we notice some things. Some words seem to be paired.

看着上面的云，我们注意到一些事情。有些话似乎成对出现。

the whale
鲸鱼
the ship
船
the sea
海
the captain
队长
White Whale
白鲸

So on and so forth. Our word cloud is still showing word frequencies however one of the parameters WordCloud has is ‘collocations’ which it defaults to True. What this does is also looks at pairs of words and their frequencies. In some instances this can definitely be useful, but in this one I think we’ll get better results not using it.

等等等等。我们的词云仍在显示词频，但是WordCloud的参数之一是“配置”，默认为True。这还着眼于单词对及其频率。在某些情况下，这绝对是有用的，但在我看来，不使用它会得到更好的结果。

Notice the difference?

注意区别吗？

A keen eye may recognize that the word ‘the’ no longer appears in our word cloud. This is because ‘the’ is recognized as a stop-word and excluded from the cloud even though it appears quite frequently in the text.

敏锐的眼睛可能会意识到“ the”一词不再出现在我们的词云中。这是因为“ the”被识别为停用词，即使在文本中出现频率很高，也被排除在云端之外。

You may be wondering where stop-words came into play, and that is one of the really cool features of the wordcloud library. The library comes with it’s own list of stop-words that it uses by default. The library actually uses quite a few NLP practices by default that makes creating the clouds that much easier and also adjustable for the more experienced NLP practitioner. Some of these additional NLP parameters that are used are:

您可能想知道停用词在哪里起作用，而这是wordcloud库的真正酷功能之一。该库附带了它自己的默认停用词列表。默认情况下，该库实际上使用了许多NLP实践，这使得创建云变得更加容易，并且对于经验丰富的NLP从业者而言也是可调整的。使用的一些其他NLP参数是：

regexp — an optional parameter that if left blank will use r”\w[\w’]+” by default. Custom regex string can be passed in here.
regexp —一个可选参数，如果保留为空白，默认情况下将使用r” \ w [\ w'] +” 。自定义正则表达式字符串可以在此处传递。
normalize_plurals — default = True; For words that appear both with and without a trailing ‘s’, that ‘s’ is removed from the plural and it’s counted as another of it’s singular version
normalize_plurals —默认= True；对于同时带有和不带有尾部“ s”的单词，该“ s”将从复数形式中删除，并被视为另一个单数形式

In our original import statement we imported STOPWORDS from the wordcloud library. You can print this to see the entire list of words that are being excluded by default, but it currently uses 192 of the most common stop-words. You can also add to this list if you have additional words you want excluded. You can also supply your own stop-words if prefer. Note that the stopwords must be passed in as a set and not a list.

在原始的导入语句中，我们从wordcloud库中导入了STOPWORDS。您可以打印此内容以查看默认情况下排除的单词的整个列表，但当前它使用192个最常用的停用词。如果您想排除其他单词，也可以添加到此列表中。如果愿意，您也可以提供自己的停用词。 请注意，停用词必须作为集合而不是列表传递。

What a difference!

有什么不同！

One last thing we’ll talk about before moving on to making fun and unique word clouds is “relative scaling”.

在继续取笑和独特的词云之前，我们要谈论的最后一件事是“相对缩放”。

Relative scaling is what’s used to determine the size of the word based upon its frequency. By default, relative scaling is set to 0.5, which is essentially the equivalent of saying that a word that occurs twice as often as another word will be 50% larger.

相对缩放是根据单词的频率来确定单词大小的方法。默认情况下，相对缩放比例设置为0.5，这基本上等于说一个单词出现的频率是另一个单词的两倍将增加50％。

Relative scaling can be set to any number between 0 and 1. With 0 being essentially kind of pointless as all words will be the same size, and 1 being that words that occur twice as often will be twice as large. In some cases this can be useful to better identify the differences in frequency. However, this doesn’t always look very good and can affect the fit of a word cloud to a mask which we will talk about later.

相对缩放比例可以设置为0到1之间的任何数字。0本质上是毫无意义的，因为所有单词的大小都相同，而1表示出现频率两倍的单词将是两倍大。在某些情况下，这有助于更好地识别频率差异。但是，这并不总是看起来很好，并且可能会影响词云与蒙版的匹配度，我们将在后面讨论。

In this case, using a relative scaling of 1 actually doesn’t look too bad! We’ll soon see how this translates to using it with an image mask.

在这种情况下，使用1的相对比例实际上看起来还不错！我们将很快看到如何将其转换为与图像蒙版一起使用。

保存您的词云 (Saving Your Word Cloud)

Once you have your word cloud the way you want it, you’ll probably want to save it. To do so, you can run the following code which will save the current state of your WordCloud object.

一旦有了您想要的词云，就可能要保存它。为此，您可以运行以下代码来保存WordCloud对象的当前状态。

Keep in mind this will save the image to your local folder and if you have a specific location in mind, you will need to add in the appropriate path.

请记住，这会将图像保存到本地文件夹，如果您有特定的位置，则需要添加适当的路径。

值得一玩的其他参数 (Other Parameters Worth Playing With)

We looked at the key parameters for making word clouds, but there are many more that are worth looking into and toying with. These parameters are fairly self-explanatory and can be used to further tweak your clouds:

我们研究了制作词云的关键参数，但是还有很多值得研究和研究的参数。这些参数是不言自明的，可用于进一步调整云：

prefer_horizontal — (float)If set to 1, all words will appear horizontal while lower values will increase the frequency of vertical words. default = 0.9
preferred_horizontal —(浮动)如果设置为1，则所有单词将显示为水平，而较低的值将增加垂直单词的频率。默认值= 0.9
min_font_size — (int) Smallest font size to be used. default = 4
min_font_size —(int)要使用的最小字体大小。默认= 4
max_words — (int) default = 200
max_words —(整数)默认= 200
min_word_length — (int) Minimum number of letters required in a word to be in the cloud. default = 0
min_word_length —(int)单词在云中所需的最小字母数。默认值= 0
include_numbers — (bool) default = False
include_numbers —(布尔值)默认= False
repeat — (bool) Determines if words/phrases will be repeated until max_words or min_font_size is reached. (Can be used to create word clouds from a single word) default = False
repeat —(布尔)确定是否重复单词/短语，直到达到max_words或min_font_size。 (可用于从单个单词创建单词云)default = False

独特和自定义词云 (Unique and Custom Word Clouds)

Due to this blog turning out much longer than I had initially planned, I’ll discuss using image masks to create custom word clouds, how to create your own image masks from any image, and how to apply an image’s color to your cloud in a soon to follow, Part 2 of this blog.

由于此博客的发布时间比我最初计划的要长得多，因此我将讨论使用图像蒙版创建自定义文字云，如何从任何图像创建自己的图像蒙版以及如何将图像的颜色应用于云中。不久之后，该博客的第2部分。