twitter数据分析
If you’ve written data science articles or are trying to get started, finding the most popular topics is a big help in getting your articles read. Below are the steps to easily determine what these topics are using R and the results of the analysis. This article can also serve as an intro to using an API, and doing some basic text processing in R. Feel free to alter this code to do other Twitter analyses and skip to the end if you’re only interested in the results.
如果您写过数据科学文章或正在尝试入门,那么找到最受欢迎的主题对阅读文章有很大帮助。 以下是轻松确定使用R这些主题和分析结果的步骤。 本文还可以作为使用API的介绍,以及在R中进行一些基本的文本处理。如果您仅对结果感兴趣,请随意更改此代码以进行其他Twitter分析,并跳到最后。
Twitter API (The Twitter API)
If you don’t have a twitter account, you need to make one. After that head over to Twitter Developer. After signing in with your new account, you can select the Apps Menu, then Create an App. From there fill out the information about the App’s details, for the most part, this can be left blank except for the App’s name and details.
如果您没有Twitter帐户,则需要注册一个。 之后,前往Twitter开发人员 。 使用新帐户登录后,您可以选择“应用程序菜单”,然后选择“创建应用程序”。 从那里填写有关应用程序详细信息的信息,在大多数情况下,除了应用程序名称和详细信息外,可以将其留空。
安装R软件包 (Installing R Packages)
We will need 3 R packages to do this project — rtweet, tidyverse, and tidytext. Install these with
我们需要3个R软件包来完成此项目-rtweet,tidyverse和tidytext。 用这些安装
install.packages("package-name")
Make sure you load each of these packages in your script with the library() function.
确保使用library()函数在脚本中加载每个软件包。
收集推文 (Collecting Tweets)
To get tweets we first need to generate a token. We can do that with the create_token() function from the rtweet package. The arguments can be filled in by looking over the information on the page of your new app on the Twitter Developer site. All of these variables are important for security, so whenever you use an API, make sure to take precaution by keeping these tokens and keys private.
要获取推文,我们首先需要生成一个令牌。 我们可以使用rtweet包中的create_token()函数来实现。 可以通过在Twitter Developer网站上查看新应用程序页面上的信息来填充参数。 所有这些变量对于安全性都很重要,因此,每当您使用API时,请确保将这些令牌和密钥设为私有,以防患于未然。
token <- create_token(app = "<app-name>",
consumer_key = "<consumer-key>",
consumer_secret = "<consumer-secret>",
access_token = "<access-token>",
access_secret = "<access-secret>")
Now to grab tweets! Use the get_timeline() function to get access to the 3200 most recent tweets of any public user, to get any more than that in one query you will have to pay.
现在抓推文! 使用get_timeline()函数可访问任何公共用户的3200条最新推文,以获取除您将要支付的查询之外的任何更多信息。
tweets <- get_timeline("TDataScience", n=3200, token=token)
There are 50 columns of data with 3200 rows stored in the variable “tweets” now. We primarily care about the columns text, favorite_count, and retweet_count, but you can explore what else you have to look at.
现在,变量“ tweets”中存储了50列数据,其中3200行。 我们主要关心列文本,favorite_count和retweet_count,但是您可以探索其他内容。
删除不需要的短语 (Removing Unwanted Phrases)
We’d also like to remove some things that are common to this domain, mainly links and twitter handles. We can use the gsub() function to replace occurrences of links and handles with something else, in our case we remove them by replacing them with empty strings. Below we pass the text column from one gsub() function to the next with the pipe operator, removing the urls and handles, ultimately saving them in a new column called clean_text.
我们还想删除该域的一些常见内容,主要是链接和Twitter句柄。 我们可以使用gsub()函数将链接和句柄的出现替换为其他内容,在本例中,我们通过将它们替换为空字符串来删除它们。 下面我们使用管道运算符将text列从一个gsub()函数传递到下一个gsub()函数,删除url和handle,最终将它们保存在名为clean_text的新列中。
tweets$text %>%
gsub("?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", .) %>%
gsub("@([A-Za-z0-9_]+)", "", .) -> tweets$clean_text
删除停用词 (Removing Stop Words)
Now we want to remove stop words from our clean_text column. Stop words are words that are not important to understanding the meaning of a passage. These are typically the most common words of a language, but there may be additional words you’d like to remove for a specific domain. In our task we might consider the word “data” to be trivial since almost every article will mention data.
现在我们要从clean_text列中删除停用词。 停用词是对于理解段落的含义不重要的词。 这些通常是一种语言中最常见的单词,但是对于特定的域,您可能希望删除其他单词。 在我们的任务中,我们可能会认为“数据”一词微不足道,因为几乎每篇文章都会提到数据。
Tokenization is the process of splitting of text into sections. This could be by sentence, paragraph, word, or something else. In our case we will be using splitting by word, which will take our table of data and make into a table with one row for each word in each tweet. Below I select the columns of interest and then tokenize the words.
标记化是将文本分成多个部分的过程。 这可以是句子,段落,单词或其他形式。 在我们的例子中,我们将使用按单词拆分,这将获取我们的数据表,并在每个推特中的每个单词排成一行的表中。 在下面,我选择感兴趣的列,然后标记单词。
tweets %>%
select(clean_text, favorite_count, retweet_count) %>%
unnest_tokens(word, clean_text)
This leaves us with the table below, where the words “include”, “training”, and “operations”, were originally all part of the same row in the clean_text column.
这就留给我们下表,其中“ include”,“ training”和“ operations”一词原本是clean_text列中同一行的一部分。
Now we proceed by doing an anti-join with the stop_words data set from the tidytext package. This removes all the words that are part of both the tweets data set and the stop_words data set (essentially this removes the stop words from your data). The code for this is below.
现在,我们对tidytext包中的stop_words数据集进行反联接。 这将删除属于tweets数据集和stop_words数据集的所有单词(实质上,这将从数据中删除停用词)。 下面的代码。
tweets %>%
select(clean_text, favorite_count, retweet_count) %>%
unnest_tokens(word, clean_text) %>%
anti_join(stop_words)
获得平均收藏夹和转发 (Getting Average Favorites and Retweets)
Using the groub_by() and summarise() functions from the tidyverse, we can get the median number of favorites and retweets for each word in our data set. I choose to use median here to eliminate the effects that some outliers may have on our data.
使用tidyverse中的groub_by()和summarise()函数,我们可以获得数据集中每个单词的收藏夹和转推的中位数。 我选择在此处使用中位数来消除某些异常值可能会对我们的数据产生的影响。
tweets %>%
select(clean_text, favorite_count, retweet_count) %>%
unnest_tokens(word, clean_text) %>%
anti_join(stop_words) %>%
group_by(word) %>%
summarise("Median Favorites" = median(favorite_count),
"Median Retweets" = median(retweet_count),
"Count" = n())
筛选结果 (Filtering Results)
We want to see the words that have the most favorites/retweets to inform us about what we should write about. But let’s consider the case where a word is only used once. For example, if there is one really popular article out their about data science applied to the geography of Michigan, we wouldn’t want this to bias us to writing articles about Michigan. To fix this, I filter out all the words that weren’t used in at least 4 tweets with the filter() function, and then sort the results with the arrange() function. Leaving us with our final bit of code.
我们希望看到最喜欢/转发最多的单词,以告知我们应该写些什么。 但是,让我们考虑一个单词仅使用一次的情况。 例如,如果有一篇非常流行的文章将其应用于数据科学的文章应用于密歇根州,那么我们不希望这使我们偏向于撰写有关密歇根州的文章。 为了解决这个问题,我使用filter()函数过滤掉了至少4条推文中未使用过的所有单词,然后使用ranging()函数对结果进行排序。 剩下最后的代码。
tweets %>%
select(clean_text, favorite_count, retweet_count) %>%
unnest_tokens(word, clean_text) %>%
anti_join(stop_words) %>%
group_by(word) %>%
summarise("Median Favorites" = median(favorite_count),
"Median Retweets" = median(retweet_count),
"Count" = n()) %>%
arrange(desc(`Median Favorites`)) %>%
filter(`Count` > 4)
最终结果 (Final Results)
So if you are skipping to the end to see what the results are, or don’t want to run this code yourself, here are the results. These are the common words in Towards Data Science, ranked by how many favorites the tweet that contains that word usually gets.
因此,如果您跳到最后看看结果是什么,或者不想自己运行此代码,则这里是结果。 这些是“迈向数据科学”中的常用词,按包含该词的推文通常获得多少偏爱来排名。
- website — 84.5 favorites 网站— 84.5收藏夹
- finance — 72 favorites 金融— 72个收藏夹
- action — 66 favorites 动作— 66个收藏夹
- matplotlib — 59.5 favorites matplotlib — 59.5个收藏夹
- plotting — 57 favorites 绘图— 57个收藏夹
- beautiful — 55.5 favorites 美丽— 55.5收藏夹
- portfolio — 51 favorites 投资组合— 51个收藏夹
- exploratory — 47 favorites 探索— 47个收藏夹
- github — 46.5 favorites github — 46.5最喜欢的
- comprehensive — 46 favorites 综合— 46个收藏夹
Note: I did have to remove the word “James” here since apparently there is an author(s) out there who is very popular with that name. Congratulations, James(es)!
注意:我确实必须在这里删除“ James”一词,因为显然那里有一位作者非常受欢迎。 恭喜,詹姆斯!
My advice now would be to write an article titled “A Comprehensive Guide to Creating Beautiful Financial Plots with Matplotlib” or something like that. You can repeat this process now for any other twitter user with a public account.
现在,我的建议是写一篇标题为“使用Matplotlib创建漂亮的财务图的综合指南”或类似的文章。 您现在可以对具有公共帐户的任何其他Twitter用户重复此过程。
翻译自: https://towardsdatascience.com/the-most-popular-towards-data-science-article-topics-on-twitter-2ecc512dd041
twitter数据分析
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388296.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!