数据可视化分析票房数据报告_票房收入分析和可视化

数据可视化分析票房数据报告

Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.

欢迎回到我的100天数据科学挑战之旅。 在第4天和第5天,我将研究Kaggle上提供的TMDB票房预测数据集。

I’ll start by importing some useful libraries that we need in this task.

我将从导入此任务中需要的一些有用的库开始。

import pandas as pd# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('dark_background')

数据加载与探索 (Data Loading and Exploration)

Once you downloaded data from the Kaggle, you will have 3 files. As this is a prediction competition, you have train, test, and sample_submission file. For this project, my motive is only to perform data analysis and visuals. I am going to ignore test.csv and sample_submission.csv files.

从Kaggle下载数据后,您将拥有3个文件。 由于这是一场预测比赛,因此您具有训练,测试和sample_submission文件。 对于这个项目,我的动机只是执行数据分析和视觉效果。 我将忽略test.csv和sample_submission.csv文件。

Let’s load train.csv in data frame using pandas.

让我们使用熊猫在数据框中加载train.csv。

%time train = pd.read_csv('./data/tmdb-box-office-prediction/train.csv')# output
CPU times: user 258 ms, sys: 132 ms, total: 389 ms
Wall time: 403 ms

关于数据集: (About the dataset:)

id: Integer unique id of each moviebelongs_to_collection: Contains the TMDB Id, Name, Movie Poster, and Backdrop URL of a movie in JSON format.budget: Budget of a movie in dollars. Some row contains 0 values, which mean unknown.genres: Contains all the Genres Name & TMDB Id in JSON Format.homepage: Contains the official URL of a movie.imdb_id: IMDB id of a movie (string).original_language: Two-digit code of the original language, in which the movie was made.original_title: The original title of a movie in original_language.overview: Brief description of the movie.popularity: Popularity of the movie.poster_path: Poster path of a movie. You can see full poster image by adding URL after this link → https://image.tmdb.org/t/p/original/production_companies: All production company name and TMDB id in JSON format of a movie.production_countries: Two-digit code and the full name of the production company in JSON format.release_date: The release date of a movie in mm/dd/yy format.runtime: Total runtime of a movie in minutes (Integer).spoken_languages: Two-digit code and the full name of the spoken language.status: Is the movie released or rumored?tagline: Tagline of a movietitle: English title of a movieKeywords: TMDB Id and name of all the keywords in JSON format.cast: All cast TMDB id, name, character name, gender (1 = Female, 2 = Male) in JSON formatcrew: Name, TMDB id, profile path of various kind of crew members job like Director, Writer, Art, Sound, etc.revenue: Total revenue earned by a movie in dollars.

Let’s have a look at the sample data.

让我们看一下样本数据。

train.head()

As we can see that some features have dictionaries, hence I am dropping all such columns for now.

如我们所见,某些功能具有字典,因此我暂时删除所有此类列。

train = train.drop(['belongs_to_collection', 'genres', 'crew',
'cast', 'Keywords', 'spoken_languages', 'production_companies', 'production_countries', 'tagline','overview','homepage'], axis=1)

Now it time to have a look at statistics of the data.

现在该看一下数据统计了。

print("Shape of data is ")
train.shape# OutputShape of data is
(3000, 12)

Dataframe information.

数据框信息。

train.info()# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3000 non-null int64
1 budget 3000 non-null int64
2 imdb_id 3000 non-null object
3 original_language 3000 non-null object
4 original_title 3000 non-null object
5 popularity 3000 non-null float64
6 poster_path 2999 non-null object
7 release_date 3000 non-null object
8 runtime 2998 non-null float64
9 status 3000 non-null object
10 title 3000 non-null object
11 revenue 3000 non-null int64
dtypes: float64(2), int64(3), object(7)
memory usage: 281.4+ KB

Describe dataframe.

描述数据框。

train.describe()

Let’s create new columns for release weekday, date, month, and year.

让我们为发布工作日,日期,月份和年份创建新列。

train['release_date'] = pd.to_datetime(train['release_date'], infer_datetime_format=True)train['release_day'] = train['release_date'].apply(lambda t: t.day)train['release_weekday'] = train['release_date'].apply(lambda t: t.weekday())train['release_month'] = train['release_date'].apply(lambda t: t.month)
train['release_year'] = train['release_date'].apply(lambda t: t.year if t.year < 2018 else t.year -100)

数据分析与可视化 (Data Analysis and Visualization)

Image for post
Photo by Isaac Smith on Unsplash
艾萨克·史密斯 ( Isaac Smith)在Unsplash上拍摄的照片

问题1:哪部电影的收入最高? (Question 1: Which movie made the highest revenue?)

train[train['revenue'] == train['revenue'].max()]
train[['id','title','budget','revenue']].sort_values(['revenue'], ascending=False).head(10).style.background_gradient(subset='revenue', cmap='BuGn')# Please note that output has a gradient style, but in a medium, it is not possible to show.

The Avengers movie has made the highest revenue.

复仇者联盟电影的收入最高。

问题2:哪部电影的预算最高? (Question 2 : Which movie has the highest budget?)

train[train['budget'] == train['budget'].max()]
train[['id','title','budget', 'revenue']].sort_values(['budget'], ascending=False).head(10).style.background_gradient(subset=['budget', 'revenue'], cmap='PuBu')

Pirates of the Caribbean: On Stranger Tides is most expensive movie.

加勒比海盗:惊涛怪浪是最昂贵的电影。

问题3:哪部电影是最长的电影? (Question 3: Which movie is longest movie?)

train[train['runtime'] == train['runtime'].max()]
plt.hist(train['runtime'].fillna(0) / 60, bins=40);
plt.title('Distribution of length of film in hours', fontsize=16, color='white');
plt.xlabel('Duration of Movie in Hours')
plt.ylabel('Number of Movies')
Image for post
train[['id','title','runtime', 'budget', 'revenue']].sort_values(['runtime'],ascending=False).head(10).style.background_gradient(subset=['runtime','budget','revenue'], cmap='YlGn')

Carlos is the longest movie, with 338 minutes (5 hours and 38 minutes) of runtime.

卡洛斯(Carlos)是最长的电影,有338分钟(5小时38分钟)的运行时间。

问题4:大多数电影在哪一年发行的? (Question 4: In which year most movies were released?)

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_year'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Year",fontsize=20)
plt.xlabel('Release Year')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12,rotation=90)
plt.show()
Image for post
train['release_year'].value_counts().head()# Output2013    141
2015 128
2010 126
2016 125
2012 125
Name: release_year, dtype: int64

In 2013 total 141 movies were released.

2013年,总共发行了141部电影。

问题5:最受欢迎和最低人气的电影。 (Question 5 : Movies with Highest and Lowest popularity.)

Most popular Movie:

最受欢迎的电影:

train[train['popularity']==train['popularity'].max()][['original_title','popularity','release_date','revenue']]

Least Popular Movie:

最不受欢迎的电影:

train[train['popularity']==train['popularity'].min()][['original_title','popularity','release_date','revenue']]

Lets create popularity distribution plot.

让我们创建人气分布图。

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.distplot(train['popularity'], kde=False)
plt.title("Movie Popularity Count",fontsize=20)
plt.xlabel('Popularity')
plt.ylabel('Count')
plt.xticks(fontsize=12,rotation=90)
plt.show()
Image for post

Wonder Woman movie have highest popularity of 294.33 whereas Big Time movie have lowest popularity which is 0.

《神奇女侠》电影的最高人气为294.33,而《大时代》电影的最低人气为0。

问题6:从1921年到2017年,大多数电影在哪个月发行? (Question 6 : In which month most movies are released from 1921 to 2017?)

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_month'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Month",fontsize=20)
plt.xlabel('Release Month')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12)
plt.show()
Image for post
train['release_month'].value_counts()# Output
9 362
10 307
12 263
8 256
4 245
3 238
6 237
2 226
5 224
11 221
1 212
7 209
Name: release_month, dtype: int64

In september month most movies are relesed which is around 362.

在9月中,大多数电影都已发行,大约362。

问题7:大多数电影在哪个月上映? (Question 7 : On which date of month most movies are released?)

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_day'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Day of Month",fontsize=20)
plt.xlabel('Release Day')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12)
plt.show()
Image for post
train['release_day'].value_counts().head()#Output
1 152
15 126
12 122
7 110
6 107
Name: release_day, dtype: int64

首次发布影片的最高数量为152。 (On first date highest number of movies are released, 152.)

问题8:大多数电影在一周的哪一天发行? (Question 8 : On which day of week most movies are released?)

plt.figure(figsize=(20,12))
sns.countplot(train['release_weekday'].sort_values(), palette='Dark2')
loc = np.array(range(len(train['release_weekday'].unique())))
day_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
plt.xlabel('Release Day of Week')
plt.ylabel('Number of Movies Release')
plt.xticks(loc, day_labels, fontsize=12)
plt.show()
Image for post
train['release_weekday'].value_counts()# Output
4 1334
3 609
2 449
1 196
5 158
0 135
6 119
Name: release_weekday, dtype: int64

星期五上映的电影数量最多。 (Highest number of movies released on friday.)

最后的话 (Final Words)

I hope this article was helpful to you. I tried to answer a few questions using data science. There are many more questions to ask. Now, I will move towards another dataset tomorrow. All the codes of data analysis and visuals can be found at this GitHub repository or Kaggle kernel.

希望本文对您有所帮助。 我尝试使用数据科学回答一些问题。 还有更多问题要问。 现在,我明天将移至另一个数据集。 可以在此GitHub存储库或Kaggle内核中找到所有数据分析和可视化代码。

Thanks for reading.

谢谢阅读。

I appreciate any feedback.

我感谢任何反馈。

数据科学进展100天 (100 Days of Data Science Progress)

If you like my work and want to support me, I’d greatly appreciate it if you follow me on my social media channels:

如果您喜欢我的工作并希望支持我,那么如果您在我的社交媒体频道上关注我,我将不胜感激:

  • The best way to support me is by following me on Medium.

    支持我的最佳方法是在Medium上关注我。

  • Subscribe to my new YouTube channel.

    订阅我的新YouTube频道

  • Sign up on my email list.

    在我的电子邮件列表中注册。

翻译自: https://towardsdatascience.com/box-office-revenue-analysis-and-visualization-ce5b81a636d7

数据可视化分析票房数据报告

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390897.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

先知模型 facebook_Facebook先知

先知模型 facebook什么是先知&#xff1f; (What is Prophet?) “Prophet” is an open-sourced library available on R or Python which helps users analyze and forecast time-series values released in 2017. With developers’ great efforts to make the time-series …

搭建Maven私服那点事

摘要&#xff1a;本文主要介绍在CentOS7.1下使用nexus3.6.0搭建maven私服&#xff0c;以及maven私服的使用&#xff08;将自己的Maven项目指定到私服地址、将第三方项目jar上传到私服供其他项目组使用&#xff09; 一、简介 Maven是一个采用纯Java编写的开源项目管理工具, Mave…

gan训练失败_我尝试过(但失败了)使用GAN来创作艺术品,但这仍然值得。

gan训练失败This work borrows heavily from the Pytorch DCGAN Tutorial and the NVIDA paper on progressive GANs.这项工作大量借鉴了Pytorch DCGAN教程 和 有关渐进式GAN 的 NVIDA论文 。 One area of computer vision I’ve been wanting to explore are GANs. So when m…

19.7 主动模式和被动模式 19.8 添加监控主机 19.9 添加自定义模板 19.10 处理图形中的乱码 19.11 自动发现...

2019独角兽企业重金招聘Python工程师标准>>> 19.7 主动模式和被动模式 • 主动或者被动是相对客户端来讲的 • 被动模式&#xff0c;服务端会主动连接客户端获取监控项目数据&#xff0c;客户端被动地接受连接&#xff0c;并把监控信息传递给服务端 服务端请求以后&…

华盛顿特区与其他地区的差别_使用华盛顿特区地铁数据确定可获利的广告位置...

华盛顿特区与其他地区的差别深度分析 (In-Depth Analysis) Living in Washington DC for the past 1 year, I have come to realize how WMATA metro is the lifeline of this vibrant city. The metro network is enormous and well-connected throughout the DMV area. When …

Windows平台下kafka环境的搭建

近期在搞kafka&#xff0c;在Windows环境搭建的过程中遇到一些问题&#xff0c;把具体的流程几下来防止后面忘了。 准备工作&#xff1a; 1.安装jdk环境 http://www.oracle.com/technetwork/java/javase/downloads/index.html 2.下载kafka的程序安装包&#xff1a; http://kafk…

铺装s路画法_数据管道的铺装之路

铺装s路画法Data is a key bet for Intuit as we invest heavily in new customer experiences: a platform to connect experts anywhere in the world with customers and small business owners, a platform that connects to thousands of institutions and aggregates fin…

IBM推全球首个5纳米芯片:计划2020年量产

IBM日前宣布&#xff0c;该公司已取得技术突破&#xff0c;利用5纳米技术制造出密度更大的芯片。这种芯片可以将300亿个5纳米开关电路集成在指甲盖大小的芯片上。 IBM推全球首个5纳米芯片 IBM表示&#xff0c;此次使用了一种新型晶体管&#xff0c;即堆叠硅纳米板&#xff0c;将…

async 和 await的前世今生 (转载)

async 和 await 出现在C# 5.0之后&#xff0c;给并行编程带来了不少的方便&#xff0c;特别是当在MVC中的Action也变成async之后&#xff0c;有点开始什么都是async的味道了。但是这也给我们编程埋下了一些隐患&#xff0c;有时候可能会产生一些我们自己都不知道怎么产生的Bug&…

项目案例:qq数据库管理_2小时元项目:项目管理您的数据科学学习

项目案例:qq数据库管理Many of us are struggling to prioritize our learning as a working professional or aspiring data scientist. We’re told that we need to learn so many things that at times it can be overwhelming. Recently, I’ve felt like there could be …

react 示例_2020年的React Cheatsheet(+真实示例)

react 示例Ive put together for you an entire visual cheatsheet of all of the concepts and skills you need to master React in 2020.我为您汇总了2020年掌握React所需的所有概念和技能的完整视觉摘要。 But dont let the label cheatsheet fool you. This is more than…

查询数据库中有多少个数据表_您的数据中有多少汁?

查询数据库中有多少个数据表97%. That’s the percentage of data that sits unused by organizations according to Gartner, making up so-called “dark data”.97 &#xff05;。 根据Gartner的说法&#xff0c;这就是组织未使用的数据百分比&#xff0c;即所谓的“ 暗数据…

数据科学与大数据技术的案例_作为数据科学家解决问题的案例研究

数据科学与大数据技术的案例There are two myths about how data scientists solve problems: one is that the problem naturally exists, hence the challenge for a data scientist is to use an algorithm and put it into production. Another myth considers data scient…

Spring-Boot + AOP实现多数据源动态切换

2019独角兽企业重金招聘Python工程师标准>>> 最近在做保证金余额查询优化&#xff0c;在项目启动时候需要把余额全量加载到本地缓存&#xff0c;因为需要全量查询所有骑手的保证金余额&#xff0c;为了不影响主数据库的性能&#xff0c;考虑把这个查询走从库。所以涉…

leetcode 1738. 找出第 K 大的异或坐标值

本文正在参加「Java主题月 - Java 刷题打卡」&#xff0c;详情查看 活动链接 题目 给你一个二维矩阵 matrix 和一个整数 k &#xff0c;矩阵大小为 m x n 由非负整数组成。 矩阵中坐标 (a, b) 的 值 可由对所有满足 0 < i < a < m 且 0 < j < b < n 的元素…

商业数据科学

数据科学 &#xff0c; 意见 (Data Science, Opinion) “There is a saying, ‘A jack of all trades and a master of none.’ When it comes to being a data scientist you need to be a bit like this, but perhaps a better saying would be, ‘A jack of all trades and …

leetcode 692. 前K个高频单词

题目 给一非空的单词列表&#xff0c;返回前 k 个出现次数最多的单词。 返回的答案应该按单词出现频率由高到低排序。如果不同的单词有相同出现频率&#xff0c;按字母顺序排序。 示例 1&#xff1a; 输入: ["i", "love", "leetcode", "…

数据显示,中国近一半的独角兽企业由“BATJ”四巨头投资

中国的互联网行业越来越有被巨头垄断的趋势。百度、阿里巴巴、腾讯、京东&#xff0c;这四大巨头支撑起了中国近一半的独角兽企业。CB Insights日前发表了题为“Nearly Half Of China’s Unicorns Backed By Baidu, Alibaba, Tencent, Or JD.com”的数据分析文章&#xff0c;列…

Java的Servlet、Filter、Interceptor、Listener

写在前面&#xff1a; 使用Spring-Boot时&#xff0c;嵌入式Servlet容器可以通过扫描注解&#xff08;ServletComponentScan&#xff09;的方式注册Servlet、Filter和Servlet规范的所有监听器&#xff08;如HttpSessionListener监听器&#xff09;。 Spring boot 的主 Servlet…

leetcode 1035. 不相交的线(dp)

在两条独立的水平线上按给定的顺序写下 nums1 和 nums2 中的整数。 现在&#xff0c;可以绘制一些连接两个数字 nums1[i] 和 nums2[j] 的直线&#xff0c;这些直线需要同时满足满足&#xff1a; nums1[i] nums2[j] 且绘制的直线不与任何其他连线&#xff08;非水平线&#x…