在Python中使用Seaborn和WordCloud可视化YouTube视频

I am an avid Youtube user and love watching videos on it in my free time. I decided to do some exploratory data analysis on the youtube videos streamed in the US. I found the dataset on the Kaggle on this link

我是YouTube的狂热用户,喜欢在业余时间观看视频。 我决定对在美国播放的youtube视频进行一些探索性数据分析。 我在此链接的Kaggle上找到了数据集

I downloaded the csv file ‘USvidoes.csv’ and the json file ‘US_category_id.json’ among all the geography-wise datasets available. I have used Jupyter notebook for the purpose of this analysis.

我在所有可用的地理区域数据集中下载了csv文件“ USvidoes.csv”和json文件“ US_category_id.json”。 我已使用Jupyter笔记本进行此分析。

让我们开始吧! (Let's get started!)

Loading the necessary libraries

加载必要的库

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltimport osfrom subprocess import check_outputfrom wordcloud import WordCloud, STOPWORDSimport stringimport re import nltkfrom nltk.corpus import stopwordsfrom nltk import pos_tagfrom nltk.stem.wordnet import WordNetLemmatizer from nltk.tokenize import word_tokenizefrom nltk.tokenize import TweetTokenizer

I created a dataframe by the name ‘df_you’ which will be used throughout the course of the analysis.

我创建了一个名为“ df_you”的数据框,该数据框将在整个分析过程中使用。

df_you = pd.read_csv(r"...\Projects\US Youtube - Python\USvideos.csv")

The foremost step is to understand the length, breadth and the bandwidth of the data.

最重要的步骤是了解数据的长度,宽度和带宽。

print(df_you.shape)
print(df_you.nunique())
Image for post

There seems to be around 40949 observations and 16 variables in the dataset. The next step would be to clean the data if necessary. I checked if there are any null values which need to be removed or manipulated.

数据集中似乎有大约40949个观测值和16个变量。 下一步将是在必要时清除数据。 我检查了是否有任何需要删除或处理的空值。

df_you.info()
Image for post

We see that there are total 16 columns with no null values in any of them. Good for us :) Let us now get a sense of data by viewing the top few rows.

我们看到总共有16列,其中任何一列都没有空值。 对我们有用:)现在让我们通过查看前几行来获得数据感。

df_you.head(n=5)

Now comes the exciting part of visualizations! To visualize the data in the variables such as ‘likes’, ‘dislikes’, ‘views’ and ‘comment count’, I first normalize the data using log distribution. Normalization of the data is essential to ensure that these variables are scaled appropriately without letting one dominant variable skew the final result.

现在是可视化令人兴奋的部分! 为了可视化“喜欢”,“喜欢”,“观看”和“评论计数”等变量中的数据,我首先使用对数分布对数据进行规范化。 数据的规范化对于确保适当缩放这些变量而不会让一个主要变量偏向最终结果至关重要。

df_you['likes_log'] = np.log(df_you['likes']+1)
df_you['views_log'] = np.log(df_you['views'] +1)
df_you['dislikes_log'] = np.log(df_you['dislikes'] +1)
df_you['comment_count_log'] = np.log(df_you['comment_count']+1)

Let us now plot these!

现在让我们绘制这些!

plt.figure(figsize = (12,6))
plt.subplot(221)
g1 = sns.distplot(df_you['likes_log'], color = 'green')
g1.set_title("LIKES LOG DISTRIBUTION", fontsize = 16)
plt.subplot(222)
g2 = sns.distplot(df_you['views_log'])
g2.set_title("VIEWS LOG DISTRIBUTION", fontsize = 16)
plt.subplot(223)
g3 = sns.distplot(df_you['dislikes_log'], color = 'r')
g3.set_title("DISLIKES LOG DISTRIBUTION", fontsize=16)
plt.subplot(224)
g4 = sns.distplot(df_you['comment_count_log'])
g4.set_title("COMMENT COUNT LOG DISTRIBUTION", fontsize=16)
plt.subplots_adjust(wspace = 0.2, hspace = 0.4, top = 0.9)
plt.show()
Image for post

Let us now find out the unique category ids present in our dataset to assign appropriate category names in our dataframe.

现在让我们找出数据集中存在的唯一类别ID,以便在数据框中分配适当的类别名称。

np.unique(df_you["category_id"])
Image for post

We see there are 16 unique categories. Let us assign the names to these categories using the information in the json file ‘US_category_id.json’ which we previously downloaded.

我们看到有16个独特的类别。 让我们使用先前下载的json文件“ US_category_id.json”中的信息将名称分配给这些类别。

df_you['category_name'] = np.nan
df_you.loc[(df_you["category_id"]== 1),"category_name"] = 'Film and Animation'
df_you.loc[(df_you["category_id"] == 2), "category_name"] = 'Cars and Vehicles'
df_you.loc[(df_you["category_id"] == 10), "category_name"] = 'Music'
df_you.loc[(df_you["category_id"] == 15), "category_name"] = 'Pet and Animals'
df_you.loc[(df_you["category_id"] == 17), "category_name"] = 'Sports'
df_you.loc[(df_you["category_id"] == 19), "category_name"] = 'Travel and Events'
df_you.loc[(df_you["category_id"] == 20), "category_name"] = 'Gaming'
df_you.loc[(df_you["category_id"] == 22), "category_name"] = 'People and Blogs'
df_you.loc[(df_you["category_id"] == 23), "category_name"] = 'Comedy'
df_you.loc[(df_you["category_id"] == 24), "category_name"] = 'Entertainment'
df_you.loc[(df_you["category_id"] == 25), "category_name"] = 'News and Politics'
df_you.loc[(df_you["category_id"] == 26), "category_name"] = 'How to and Style'
df_you.loc[(df_you["category_id"] == 27), "category_name"] = 'Education'
df_you.loc[(df_you["category_id"] == 28), "category_name"] = 'Science and Technology'
df_you.loc[(df_you["category_id"] == 29), "category_name"] = 'Non-profits and Activism'
df_you.loc[(df_you["category_id"] == 43), "category_name"] = 'Shows'

Let us now plot these to identify the popular video categories!

现在,让我们对它们进行标绘,以识别受欢迎的视频类别!

plt.figure(figsize = (14,10))
g = sns.countplot('category_name', data = df_you, palette="Set1", order = df_you['category_name'].value_counts().index)
g.set_xticklabels(g.get_xticklabels(),rotation=45, ha="right")
g.set_title("Count of the Video Categories", fontsize=15)
g.set_xlabel("", fontsize=12)
g.set_ylabel("Count", fontsize=12)
plt.subplots_adjust(wspace = 0.9, hspace = 0.9, top = 0.9)
plt.show()
Image for post

We see that the top — 5 viewed categories are ‘Entertainment’, ‘Music’, ‘How to and Style’, ‘Comedy’ and ‘People and Blogs’. So if you are thinking of starting your own youtube channel, you better think about these categories first!

我们看到排名前5位的类别是“娱乐”,“音乐”,“操作方法和样式”,“喜剧”和“人与博客”。 因此,如果您想建立自己的YouTube频道,最好先考虑这些类别!

Let us now see how views, likes, dislikes and comments fare across categories using boxplots.

现在,让我们看看使用箱线图,视图,喜欢,不喜欢和评论在不同类别中的表现如何。

plt.figure(figsize = (14,10))
g = sns.boxplot(x = 'category_name', y = 'views_log', data = df_you, palette="winter_r")
g.set_xticklabels(g.get_xticklabels(),rotation=45, ha="right")
g.set_title("Views across different categories", fontsize=15)
g.set_xlabel("", fontsize=12)
g.set_ylabel("Views(log)", fontsize=12)
plt.subplots_adjust(wspace = 0.9, hspace = 0.9, top = 0.9)
plt.show()
Image for post
plt.figure(figsize = (14,10))
g = sns.boxplot(x = 'category_name', y = 'likes_log', data = df_you, palette="spring_r")
g.set_xticklabels(g.get_xticklabels(),rotation=45, ha="right")
g.set_title("Likes across different categories", fontsize=15)
g.set_xlabel("", fontsize=12)
g.set_ylabel("Likes(log)", fontsize=12)
plt.subplots_adjust(wspace = 0.9, hspace = 0.9, top = 0.9)
plt.show()
Image for post
plt.figure(figsize = (14,10))
g = sns.boxplot(x = 'category_name', y = 'dislikes_log', data = df_you, palette="summer_r")
g.set_xticklabels(g.get_xticklabels(),rotation=45, ha="right")
g.set_title("Dislikes across different categories", fontsize=15)
g.set_xlabel("", fontsize=12)
g.set_ylabel("Dislikes(log)", fontsize=12)
plt.subplots_adjust(wspace = 0.9, hspace = 0.9, top = 0.9)
plt.show()
Image for post
plt.figure(figsize = (14,10))
g = sns.boxplot(x = 'category_name', y = 'comment_count_log', data = df_you, palette="plasma")
g.set_xticklabels(g.get_xticklabels(),rotation=45, ha="right")
g.set_title("Comments count across different categories", fontsize=15)
g.set_xlabel("", fontsize=12)
g.set_ylabel("Comment_count(log)", fontsize=12)
plt.subplots_adjust(wspace = 0.9, hspace = 0.9, top = 0.9)
plt.show()
Image for post

Next I calculated engagement measures such as like rate, dislike rate and comment rate.

接下来,我计算了参与度,例如喜欢率,不喜欢率和评论率。

df_you['like_rate'] = df_you['likes']/df_you['views']
df_you['dislike_rate'] = df_you['dislikes']/df_you['views']
df_you['comment_rate'] = df_you['comment_count']/df_you['views']

Building correlation matrix using a heatmap for engagement measures.

使用热图来建立参与度度量的相关矩阵。

plt.figure(figsize = (10,8))
sns.heatmap(df_you[['like_rate', 'dislike_rate', 'comment_rate']].corr(), annot=True)
plt.show()
Image for post

From the above heatmap, it can be seen that if a viewer likes a particular video, there is a 43% chance that he/she will comment on it as opposed to 28% chance of commenting if the viewer dislikes the video. This is a good insight which means if viewers like any videos, they are more likely to comment on them to show their appreciation/feedback.

从上面的热图可以看出,如果观众喜欢某个视频,则有43%的机会对其发表评论,而如果观众不喜欢该视频,则有28%的评论机会。 这是一个很好的见解,这意味着如果观众喜欢任何视频,他们就更有可能对它们发表评论以表示赞赏/反馈。

Next, I try to analyse the word count , unique word count, punctuation count and average length of the words in the ‘Title’ and ‘Tags’ columns

接下来,我尝试在“标题”和“标签”列中分析字数唯一字数,标点符号和字的平均长度

#Word count 
df_you['count_word']=df_you['title'].apply(lambda x: len(str(x).split()))
df_you['count_word_tags']=df_you['tags'].apply(lambda x: len(str(x).split()))#Unique word count
df_you['count_unique_word'] = df_you['title'].apply(lambda x: len(set(str(x).split())))
df_you['count_unique_word_tags'] = df_you['tags'].apply(lambda x: len(set(str(x).split())))#Punctutation count
df_you['count_punctuation'] = df_you['title'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
df_you['count_punctuation_tags'] = df_you['tags'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))#Average length of the words
df_you['mean_word_len'] = df_you['title'].apply(lambda x : np.mean([len(x) for x in str(x).split()]))
df_you['mean_word_len_tags'] = df_you['tags'].apply(lambda x: np.mean([len(x) for x in str(x).split()]))

Plotting these…

绘制这些…

plt.figure(figsize = (12,18))
plt.subplot(421)
g1 = sns.distplot(df_you['count_word'],
hist = False, label = 'Text')
g1 = sns.distplot(df_you['count_word_tags'],
hist = False, label = 'Tags')
g1.set_title('Word count distribution', fontsize = 14)
g1.set(xlabel='Word Count')
plt.subplot(422)
g2 = sns.distplot(df_you['count_unique_word'],
hist = False, label = 'Text')
g2 = sns.distplot(df_you['count_unique_word_tags'],
hist = False, label = 'Tags')
g2.set_title('Unique word count distribution', fontsize = 14)
g2.set(xlabel='Unique Word Count')
plt.subplot(423)
g3 = sns.distplot(df_you['count_punctuation'],
hist = False, label = 'Text')
g3 = sns.distplot(df_you['count_punctuation_tags'],
hist = False, label = 'Tags')
g3.set_title('Punctuation count distribution', fontsize =14)
g3.set(xlabel='Punctuation Count')
plt.subplot(424)
g4 = sns.distplot(df_you['mean_word_len'],
hist = False, label = 'Text')
g4 = sns.distplot(df_you['mean_word_len_tags'],
hist = False, label = 'Tags')
g4.set_title('Average word length distribution', fontsize = 14)
g4.set(xlabel = 'Average Word Length')
plt.subplots_adjust(wspace = 0.2, hspace = 0.4, top = 0.9)
plt.legend()
plt.show()
Image for post

Let us now visualize the word cloud for Title of the videos, Description of the videos and videos Tags. This way we can discover which words are popular in the title, description and tags. Creating a word cloud is a popular way to find out trending words on the blogsphere.

现在让我们可视化视频标题,视频描述和视频标签的词云。 通过这种方式,我们可以发现标题,描述和标签中流行的单词。 创建词云是在Blogsphere上查找流行词的一种流行方法。

  1. Word Cloud for Title of the videos

    视频标题的词云

plt.figure(figsize = (20,20))
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
background_color = 'black',
stopwords=stopwords,
max_words = 1000,
max_font_size = 120,
random_state = 42
).generate(str(df_you['title']))#Plotting the word cloud
plt.imshow(wordcloud)
plt.title("WORD CLOUD for Titles", fontsize = 20)
plt.axis('off')
plt.show()
Image for post

From the above word cloud, it is apparent that most popularly used title words are ‘Official’, ‘Video’, ‘Talk’, ‘SNL’, ‘VS’, and ‘Week’ among others.

从上面的词云中可以看出,最常用的标题词是“ Official”,“ Video”,“ Talk”,“ SNL”,“ VS”和“ Week”等。

2. Word cloud for Title Description

2.标题说明的词云

plt.figure(figsize = (20,20))
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
background_color = 'black',
stopwords = stopwords,
max_words = 1000,
max_font_size = 120,
random_state = 42
).generate(str(df_you['description']))
plt.imshow(wordcloud)
plt.title('WORD CLOUD for Title Description', fontsize = 20)
plt.axis('off')
plt.show()
Image for post

I found that the most popular words for description of videos are ‘https’, ‘video’, ‘new’, ‘watch’ among others.

我发现最受欢迎的视频描述词是“ https”,“ video”,“ new”,“ watch”等。

3. Word Cloud for Tags

3.标签的词云

plt.figure(figsize = (20,20))
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
background_color = 'black',
stopwords = stopwords,
max_words = 1000,
max_font_size = 120,
random_state = 42
).generate(str(df_you['tags']))
plt.imshow(wordcloud)
plt.title('WORD CLOUD for Tags', fontsize = 20)
plt.axis('off')
plt.show()
Image for post

Popular tags seem to be ‘SNL’, ‘TED’, ‘new’, ‘Season’, ‘week’, ‘Cream’, ‘youtube’ From the word cloud analysis, it looks like there are a lot of Saturday Night Live fans on the youtube out there!

热门标签似乎是'SNL','TED','new','Season','week','Cream','youtube'。从词云分析来看,似乎有很多Saturday Night Live粉丝在YouTube上!

I used the word cloud library for the very first time and it yielded pretty and useful visuals! If you are interested in knowing more about this library and how to use it then you must absolutely check this out

我第一次使用词云库,它产生了漂亮而有用的视觉效果! 如果您有兴趣了解有关此库以及如何使用它的更多信息,则必须完全检查一下

This analysis is hosted on my Github page here.

此分析托管在我的Github页面上。

Thanks for reading!

谢谢阅读!

翻译自: https://towardsdatascience.com/visualizing-youtube-videos-using-seaborn-and-wordcloud-in-python-b24247f70228

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390930.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

老生常谈:抽象工厂模式

在创建型模式中有一个模式是不得不学的,那就是抽象工厂模式(Abstract Factory),这是创建型模式中最为复杂,功能最强大的模式.它常与工厂方法组合来实现。平时我们在写一个组件的时候一般只针对一种语言,或者说是针对一个区域的人来实现。 例如:现有有一个新闻组件,在中国我们有…

数据结构入门最佳书籍_最佳数据科学书籍

数据结构入门最佳书籍Introduction介绍 I get asked a lot what resources I recommend for people who want to start their Data Science journey. This section enlists books I recommend you should read at least once in your life as a Data Scientist.我被很多人问到…

函数式编程概念

什么是函数式编程 简单地说,函数式编程通过使用函数,将值转换成抽象单元,接着用于构建软件系统。 面向对象VS函数式编程 面向对象编程 面向对象编程认为一切事物皆对象,将现实世界的事物抽象成对象,现实世界中的关系抽…

多重插补 均值插补_Feature Engineering Part-1均值/中位数插补。

多重插补 均值插补Understanding the Mean /Median Imputation and Implementation using feature-engine….!了解使用特征引擎的均值/中位数插补和实现…。! 均值或中位数插补: (Mean or Median Imputation:) The mean or median value should be calc…

linux 查看用户上次修改密码的日期

查看root用户密码上次修改的时间 方法一:查看日志文件: # cat /var/log/secure |grep password changed 方法二: # chage -l root-----Last password change : Feb 27, 2018 Password expires : never…

客户行为模型 r语言建模_客户行为建模:汇总统计的问题

客户行为模型 r语言建模As a Data Scientist, I spend quite a bit of time thinking about Customer Lifetime Value (CLV) and how to model it. A strong CLV model is really a strong customer behavior model — the better you can predict next actions, the better yo…

【知识科普】解读闪电/雷电网络,零基础秒懂!

知识科普,解读闪电/雷电网络,零基础秒懂! 闪电网络的技术是革命性的,将实现即时0手续费的小金额支付。第一步是解决扩容问题,第二部就是解决共通性问题,利用原子交换协议和不同链条的状态通道结合&#xff…

Alpha 冲刺 (5/10)

【Alpha go】Day 5! Part 0 简要目录 Part 1 项目燃尽图Part 2 项目进展Part 3 站立式会议照片Part 4 Scrum 摘要Part 5 今日贡献Part 1 项目燃尽图 Part 2 项目进展 已分配任务进度博客检索功能:根据标签检索流程图 -> 实现 -> 测试近期比…

多维空间可视化_使用GeoPandas进行空间可视化

多维空间可视化Recently, I was working on a project where I was trying to build a model that could predict housing prices in King County, Washington — the area that surrounds Seattle. After looking at the features, I wanted a way to determine the houses’ …

机器学习 来源框架_机器学习的秘密来源:策展

机器学习 来源框架成功的机器学习/人工智能方法 (Methods for successful Machine learning / Artificial Intelligence) It’s widely stated that data is the new oil, and like oil, data needs the right refinement to evolve to be utilised perfectly. The power of ma…

WebLogic调用WebService提示Failed to localize、Failed to create WsdlDefinitionFeature

在本地Tomcat环境下调用WebService正常&#xff0c;但是部署到WebLogic环境中&#xff0c;则提示警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_2 ......警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_1 ..…

呼吁开放外网_服装数据集:呼吁采取行动

呼吁开放外网Getting a dataset with images is not easy if you want to use it for a course or a book. Yes, there are many datasets with images, but few of them are suitable for commercial or educational use.如果您想将其用于课程或书籍&#xff0c;则获取带有图像…

React JS 组件间沟通的一些方法

刚入门React可能会因为React的单向数据流的特性而遇到组件间沟通的麻烦&#xff0c;这篇文章主要就说一说如何解决组件间沟通的问题。 1.组件间的关系 1.1 父子组件 ReactJS中数据的流动是单向的&#xff0c;父组件的数据可以通过设置子组件的props传递数据给子组件。如果想让子…

数据可视化分析票房数据报告_票房收入分析和可视化

数据可视化分析票房数据报告Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.欢迎回到我的100天数据科学挑战之旅。 在第4天和第5天&#xff0c;我将研究Kaggle上提供的TM…

先知模型 facebook_Facebook先知

先知模型 facebook什么是先知&#xff1f; (What is Prophet?) “Prophet” is an open-sourced library available on R or Python which helps users analyze and forecast time-series values released in 2017. With developers’ great efforts to make the time-series …

搭建Maven私服那点事

摘要&#xff1a;本文主要介绍在CentOS7.1下使用nexus3.6.0搭建maven私服&#xff0c;以及maven私服的使用&#xff08;将自己的Maven项目指定到私服地址、将第三方项目jar上传到私服供其他项目组使用&#xff09; 一、简介 Maven是一个采用纯Java编写的开源项目管理工具, Mave…

gan训练失败_我尝试过(但失败了)使用GAN来创作艺术品,但这仍然值得。

gan训练失败This work borrows heavily from the Pytorch DCGAN Tutorial and the NVIDA paper on progressive GANs.这项工作大量借鉴了Pytorch DCGAN教程 和 有关渐进式GAN 的 NVIDA论文 。 One area of computer vision I’ve been wanting to explore are GANs. So when m…

19.7 主动模式和被动模式 19.8 添加监控主机 19.9 添加自定义模板 19.10 处理图形中的乱码 19.11 自动发现...

2019独角兽企业重金招聘Python工程师标准>>> 19.7 主动模式和被动模式 • 主动或者被动是相对客户端来讲的 • 被动模式&#xff0c;服务端会主动连接客户端获取监控项目数据&#xff0c;客户端被动地接受连接&#xff0c;并把监控信息传递给服务端 服务端请求以后&…

华盛顿特区与其他地区的差别_使用华盛顿特区地铁数据确定可获利的广告位置...

华盛顿特区与其他地区的差别深度分析 (In-Depth Analysis) Living in Washington DC for the past 1 year, I have come to realize how WMATA metro is the lifeline of this vibrant city. The metro network is enormous and well-connected throughout the DMV area. When …

Windows平台下kafka环境的搭建

近期在搞kafka&#xff0c;在Windows环境搭建的过程中遇到一些问题&#xff0c;把具体的流程几下来防止后面忘了。 准备工作&#xff1a; 1.安装jdk环境 http://www.oracle.com/technetwork/java/javase/downloads/index.html 2.下载kafka的程序安装包&#xff1a; http://kafk…