虎牙直播电影一天收入_电影收入

虎牙直播电影一天收入

“美国电影协会(MPAA)的首席执行官J. Valenti提到:“没有人能告诉您电影在市场上的表现。 直到电影在黑暗的剧院里放映并且银幕和观众之间都散发出火花。 (“The CEO of Motion Picture Association of America (MPAA) J. Valenti mentioned that ‘No one can tell you how a movie is going to do in the marketplace. Not until the film opens in darkened theater and sparks fly up between the screen and the audience’”)

Cigdem Tuncer
Cigdem Tuncer西格德·图姆斯
Aug 9 8月9

The modern film industry, a business of nearly 10 billion dollars per year, is a cutthroat business competition.

现代电影业每年的营业额接近100亿美元,是一场残酷的商业竞争。

Each year in the United States, hundreds of films are released to domestic audiences in the hope that they will become the next “blockbuster.” Predicting how well a movie will perform at the box office is hard because there are so many factors involved in success.

在美国,每年都会向国内观众放映数百部电影,希望它们将成为下一部“大片”。 很难预测电影在票房上的表现如何,因为成功涉及很多因素。

The goal of this project is to develop a computational model for predicting the revenues based on public data for movies extracted from Boxofficemojo.com online movie database.

该项目的目标是开发一种计算模型,该模型可以基于从Boxofficemojo.com在线电影数据库中提取的电影的公共数据来预测收入。

The first phase is web scraping. Different types of features are extracted from Boxofficemojo.com which will be described later. Second phase is data cleaning. After scrapping data from our source, we cleaned our data mainly depend on unavailability of some features. After cleaning all data, next phase is exploratory data analysis. In third phase we create graphics to understand data. Fourth phase is feature engineering, where you create features for machine learning model from raw text data. Fifth phase is model analysis, where I applied one of the machine learning algorithms on our data set.

第一阶段是刮纸。 从Boxofficemojo.com中提取了不同类型的功能,这将在后面描述。 第二阶段是数据清理。 从我们的来源中删除数据后,我们清理数据主要取决于某些功能的不可用性。 清除所有数据后,下一阶段是探索性数据分析。 在第三阶段,我们创建图形来理解数据。 第四阶段是功能工程,其中您可以从原始文本数据创建用于机器学习模型的功能。 第五阶段是模型分析,其中我在数据集上应用了一种机器学习算法。

Image for post

网页抓取 (Web Scraping)

Web scraping is a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.

Web抓取是从Web提取和处理大量数据的程序或算法。 无论您是数据科学家,工程师,还是任何分析大量数据集的人员,从网络中抓取数据的能力都是一项有用的技能。

It’s a good idea to do some research on your own and make sure that you’re not violating any Terms of Service before you start a large-scale project. To learn more about the legal aspects of web scraping, check out Legal Perspectives on Scraping Data from the Modern Web.

最好自己进行一些研究,并确保在开始大规模项目之前,不要违反任何服务条款。 要了解有关网络抓取的法律方面的更多信息,请查阅《现代网络中关于数据搜集的法律观点》 。

For this project;

对于这个项目;

· BeautifulSoup Library is used for data extraction from the web.

· BeautifulSoup库用于从Web提取数据。

· Pandas Library is used for data manipulation and cleaning.

· 熊猫库用于数据处理和清理。

· Matplotlib and Seaborn are used for data visualization.

· MatplotlibSeaborn用于数据可视化。

My data set contains 8319 movies released in between 2010 to 2019. Recent movies are not selected because Covid-19 not much movie released in 2020. I collect Title, Distributor, Release, MPAA, Time, Genre, Domestic, International, Worldwide, Opening, Budget, and Actors information.

我的数据集包含2010年至2019年之间发行的8319部电影。由于Covid-19 2020年发行的电影不多,因此未选择近期电影。我收集标题,发行商,发行,MPAA,时间,类型,国内,国际,全球,开幕,预算和演员信息。

Image for post
Image for post
Image for post

数据清理 (Data Cleaning)

At the beginning my data set had 8319 movies. Then I recognize that there were many movies which don’t have all data available. So unavailability of features was the main reason behind eliminating movies from my data set.

最初,我的数据集包含8319部电影。 然后我意识到有很多电影没有所有可用数据。 因此,功能不可用是从我的数据集中删除电影的主要原因。

Most of the movie doesn’t have budget data available. So, null rows have been deleted.

这部电影大部分没有可用的预算数据。 因此,空行已被删除。

Image for post

Dtype is converted from “Object” to “float” for numeric columns.

对于数字列,Dtype从“对象”转换为“浮点”。

Image for post

“Release” data is checked for leap year detail and found data is modified. Dtype is converted from “Object” to “datetime” for Release column. Data from “Distributor” column is cleaned from not related info.

检查“发布”数据中的leap年细节,并修改找到的数据。 Dtype从Release列的“ Object”转换为“ datetime”。 来自“分销商”列的数据已从不相关的信息中清除。

Image for post

Duplicate rows have been deleted from data set.

重复的行已从数据集中删除。

Image for post

After removing those movies I finally got my data set with 1293 movies which have all information available.

删除这些电影后,我终于获得了包含所有可用信息的1293电影的数据集。

Image for post

探索性数据分析(EDA) (Exploratory Data Analysis (EDA))

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

在统计中,探索性数据分析(EDA)是一种分析数据集以总结其主要特征的方法,通常使用视觉方法。 可以使用统计模型,也可以不使用统计模型,但是EDA主要用于查看数据可以在形式建模或假设检验任务之外告诉我们的内容。

Let’s look at the data relation between “Domestic Total Gross” and “Budget” for each year.

让我们看一下每年“国内总收入”和“预算”之间的数据关系。

Image for post

While there are an almost overwhelming number of methods to use in EDA, one of the most effective starting tools is the pairs plot (also called a scatterplot matrix). A pairs plot allows us to see both distribution of single variables and relationships between two variables. Pair plots are a great method to identify trends for follow-up analysis and, fortunately, are easily implemented in Python.

尽管在EDA中使用了几乎绝大多数方法,但最有效的入门工具之一是结对图(也称为散点图矩阵)。 配对图使我们可以看到单个变量的分布以及两个变量之间的关系。 配对图是识别趋势以进行后续分析的一种好方法,幸运的是,可以在Python中轻松实现。

Image for post

A Heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features in a data set. We can now use either Matplotlib or Seaborn to create the heatmap. To get the correlation of the features inside a data set we can call <dataset>.corr(), which is a Pandas dataframe method. This will give us the correlation matrix.

热图是数据的图形表示,其中矩阵中包含的各个值表示为颜色。 热图非常适合探索数据集中要素的相关性。 现在,我们可以使用Matplotlib或Seaborn来创建热图。 为了获得数据集中内部<dataset>.corr()的相关性,我们可以调用<dataset>.corr() ,这是Pandas数据<dataset>.corr()方法。 这将给我们相关矩阵。

Image for post

特征工程 (Feature Engineering)

Feature engineering means building additional features out of existing data which is often spread across multiple related tables. Feature engineering requires extracting the relevant information from the data and getting it into a single table which can then be used to train a machine learning model.

特征工程意味着从现有数据中构建附加特征,这些数据通常分布在多个相关表中。 特征工程需要从数据中提取相关信息,并将其放入一个表中,然后该表可用于训练机器学习模型。

Machine learning fits mathematical notations to the data in order to derive some insights. The models take features as input. A feature is generally a numeric representation of an aspect of real-world phenomena or data. Just the way there are dead ends in a maze, the path of data is filled with noise and missing pieces. Our job as a Data Scientist is to find a clear path to the end goal of insights.

机器学习使数学符号适合数据,以得出一些见解。 这些模型将要素作为输入。 特征通常是真实现象或数据方面的数字表示。 就像迷宫中的死胡同一样,数据的路径充满了噪声和丢失的碎片。 作为数据科学家,我们的工作是找到通往最终见解的明确路径。

Let’s look at the description of dataset and see distribution of target column.

让我们看一下数据集的描述并查看目标列的分布。

Image for post
Image for post

We want the target variable to be predicted in the model to have a normal distribution. When we examine the distribution of our target variable, we see that there is no right skewed distribution. We can correct this situation by applying a logarithmic transformation to the target variable.

我们希望在模型中预测目标变量具有正态分布。 当我们检查目标变量的分布时,我们发现没有右偏分布。 我们可以通过对目标变量应用对数转换来纠正这种情况。

Image for post

Ordinary least-squares (OLS) models assume that the analysis is fitting a model of a relationship between one or more explanatory variables and a continuous or at least interval outcome variable that minimizes the sum of square errors, where an error is the difference between the actual and the predicted value of the outcome variable.

普通最小二乘(OLS)模型假设分析适合一个或多个解释变量与连续或至少区间结果变量之间的关系模型,该变量使平方误差之和最小,其中误差是结果变量的实际值和预测值。

Image for post

When I do OLS model with two numerical features from data set, I got low cond. no, but also got low R-2 score. To increase R-2 score I will do feature engineering to add new features from categorical variables from out data set.

当我使用数据集中的两个数值特征进行OLS模型建模时,cond降低。 不,但R-2得分也很低。 为了增加R-2分数,我将进行特征工程设计以从数据集中的分类变量中添加新特征。

· The “year” column and four season columns were created from the “Release” column.

·从“发布”列中创建了“年”列和四个季节列。

· Four Dummy columns were created from “MPAA” column.

·从“ MPAA”列中创建了四个虚拟列。

· Running time (min) column were created from “time” column.

·运行时间(分钟)列是从“时间”列中创建的。

· New columns created for all distributors with more than 49 rows.

·为具有49行以上的所有分发者创建的新列。

· Logs of “Budget” and “Opening” columns were created.

·创建了“预算”和“开放”列的日志。

模型分析 (Model Analysis)

Now is the time to split our data into sets of training, testing and validation. Let’s rerun our model and finally compare the Ridge, Lasso and Polynomial regression results.

现在是时候将我们的数据分为训练,测试和验证的集合了。 让我们重新运行模型,最后比较Ridge,Lasso和多项式回归结果。

Image for post

Data set was split as a train (60%), validation (20%), and test (20%). The tuning parameters (alpha) of the Lasso and Ridge models were chosen from a wide value range than put the 10-fold cross-validation.

数据集分为训练(60%),验证(20%)和测试(20%)。 拉索和里奇模型的调整参数(alpha)是从10倍交叉验证的宽泛范围内选择的。

When we included the variables we applied feature engineering into the model, OLS model R-2 score is increased to 0.759, but at the same time the cond. no. increased. Lasso Regression and Ridge Regression brought us the same results. The result of Linear Regression was also very close to them. We have the best result in a Degree 2 Polynomial Regression and the second is Ridge Polynomial Regression.

当我们将变量应用到模型中时,将OLS模型R-2得分提高到0.759,但同时条件也有所提高。 没有。 增加。 拉索回归和岭回归为我们带来了相同的结果。 线性回归的结果也非常接近它们。 我们在2次多项式回归中得到最好的结果,第二个是Ridge多项式回归。

Image for post
Image for post

Now it’s time to do Cross Validation (CV) and look at Mean Absolute Error (MAE) score. When we cross validate each model (kfold = 10), we see little drop in scores.

现在是时候进行交叉验证(CV)和查看平均绝对误差(MAE)分数了。 当我们交叉验证每个模型(kfold = 10)时,我们看到分数几乎没有下降。

Image for post

结论 (Conclusion)

Finally, when we look at the mean absolute errors on the established models, we can say that Ridge Polynomial Regression will bring us the most accurate results.

最后,当我们查看已建立模型的平均绝对误差时,可以说岭多项式回归将为我们带来最准确的结果。

Five fundamental assumptions of the linear regression analysis were checked as these can be seen on Jupyter Notebook.

检查了线性回归分析的五个基本假设,因为可以在Jupyter Notebook中看到这些假设。

GitHub repository for web scraping and data processing is here.

用于Web抓取和数据处理的GitHub存储库在这里 。

Thank you for your time and reading my article. Please feel free to contact me if you have any questions or would like to share your comments.

感谢您的时间和阅读我的文章。 如果您有任何疑问或想分享您的意见,请随时与我联系。

翻译自: https://medium.com/analytics-vidhya/predicting-a-movies-revenue-3709fb460604

虎牙直播电影一天收入

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391513.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Python操作Mysql实例代码教程在线版(查询手册)_python

实例1、取得MYSQL的版本在windows环境下安装mysql模块用于python开发MySQL-python Windows下EXE安装文件下载 复制代码 代码如下:# -*- coding: UTF-8 -*- #安装MYSQL DB for pythonimport MySQLdb as mdb con None try: #连接mysql的方法&#xff1a;connect(ip,user,pass…

批判性思维_为什么批判性思维技能对数据科学家至关重要

批判性思维As Alexander Pope said, to err is human. By that metric, who is more human than us data scientists? We devise wrong hypotheses constantly and then spend time working on them just to find out how wrong we were.正如亚历山大波普(Alexander Pope)所说…

Manjaro 17 搭建 redis 4.0.1 集群服务

安装Redis在Linux环境中 这里我们用的是manjaro一个小众一些的发行版 我选用的是manjaro 17 KDE 如果你已经安装好了manjaro 那么你需要准备一个redis.tar.gz包 这里我选用的是截至目前最新的redis 4.0.1版本 我们可以在官网进行下载 https://redis.io/download选择Stable &…

快速排序简便记_建立和测试股票交易策略的快速简便方法

快速排序简便记Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without se…

robot:List变量的使用注意点

创建list类型变量&#xff0c;两种方式&#xff0c;建议使用Create List关键字 使用该列表变量时需要变为${}方式&#xff0c;切记切记&#xff01; 转载于:https://www.cnblogs.com/gcgc/p/11429482.html

python基础教程(十一)

迭代器 本节进行迭代器的讨论。只讨论一个特殊方法---- __iter__ &#xff0c;这个方法是迭代器规则的基础。 迭代器规则 迭代的意思是重复做一些事很多次---就像在循环中做的那样。__iter__ 方法返回一个迭代器&#xff0c;所谓迭代器就是具有next方法的对象&#xff0c;在调…

美剧迷失_迷失(机器)翻译

美剧迷失Machine translation doesn’t generate as much excitement as other emerging areas in NLP these days, in part because consumer-facing services like Google Translate have been around since April 2006.如今&#xff0c;机器翻译并没有像其他NLP新兴领域那样…

机器学习中决策树的随机森林_决策树和随机森林在机器学习中的使用

机器学习中决策树的随机森林机器学习 (Machine Learning) Machine learning is an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. The 3 main categor…

【Python算法】遍历(Traversal)、深度优先(DFS)、广度优先(BFS)

图结构&#xff1a; 非常强大的结构化思维&#xff08;或数学&#xff09;模型。如果您能用图的处理方式来规范化某个问题&#xff0c;即使这个问题本身看上去并不像个图问题&#xff0c;也能使您离解决问题更进一步。 在众多图算法中&#xff0c;我们常会用到一种非常实用的思…

我如何预测10场英超联赛的确切结果

Is there a way to predict the outcome of any soccer game with 100% accuracy? The honest and simplest answer is…. no. Regardless of what your fantasy football friends say, there is absolutely no way to be 100% certain, but there is a proven, mathematical …

深度学习数据自动编码器_如何学习数据科学编码

深度学习数据自动编码器意见 (Opinion) When I first wanted to learn programming, I coded along to a 4 hour long YouTube tutorial.刚开始学习编程时&#xff0c;我编写了长达4个小时的YouTube教程。 “Great,” I thought after finishing the course. “I know how to …

Angular 5.0 学习2:Angular 5.0 开发环境的搭建和新建第一个ng5项目

1.安装Node.js 在开始工作之前&#xff0c;我们必须设置好开发环境。如果你的机器上还没有Node.js和npm&#xff0c;请先安装它们。去Node.js的官网&#xff0c;https://nodejs.org/en/&#xff0c;点击下载按钮&#xff0c;下载最新版本&#xff0c;直接下一步下一步安装即可&…

robot:根据条件主动判定用例失败或者通过

场景&#xff1a; 当用例中的断言部分需要满足特定条件时才会执行&#xff0c;如果不满足条件时&#xff0c;可以主动判定该用例为passed状态&#xff0c;忽略下面的断言语句。 如上图场景&#xff0c;当每月1号时&#xff0c;表中才会生成上月数据&#xff0c;生成后数据不会再…

图深度学习-第1部分

有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING) These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of cou…

Git上传项目到github

2019独角兽企业重金招聘Python工程师标准>>> Git入门 个人理解git就是一个上传工具&#xff0c;同时兼具和svn一样的版本控制功能&#xff08;此解释纯属本人个人观点&#xff09; Github是什么 github就是一个分布式版本管理系统&#xff08;反正我就是这么认为的…

robot:当用例失败时执行关键字(发送短信)

使用场景&#xff1a; 当用例失败时需要通知对应人员&#xff0c;则需要在Teardown中&#xff0c;使用关键字Run Keyword If Test Failed Send Message关键字为自定义关键字&#xff0c;${content}为短信内容&#xff0c;${msg_receiver}为短信接收者列表。 当然执行成功时需要…

项目经济规模的估算方法_估算英国退欧的经济影响

项目经济规模的估算方法On June 23 2016, the United Kingdom narrowly voted in a country-wide referendum to leave the European Union (EU). Economists at the time warned of economic losses; the Bank of England produced estimates that that GDP could be as much …

奇迹网站可视化排行榜]_外观可视化奇迹

奇迹网站可视化排行榜]When reading a visualization is what we see really what we get?阅读可视化内容时&#xff0c;我们真正看到的是什么&#xff1f; This post summarizes and accompanies our paper “Surfacing Visualization Mirages” that was presented at CHI …

机器学习 量子_量子机器学习:神经网络学习

机器学习 量子My last articles tackled Bayes nets on quantum computers (read it here!), and k-means clustering, our first steps into the weird and wonderful world of quantum machine learning.我的最后一篇文章讨论了量子计算机上的贝叶斯网络( 在这里阅读&#xf…

BZOJ 1176: [Balkan2007]Mokia

一道CDQ分治的模板题&#xff0c;然而我De了一上午Bug...... 按时间分成左右两半&#xff0c;按x坐标排序然后把y坐标丢到树状数组里&#xff0c;扫一遍遇到左边的就add,遇到右边的query 几个弱智出了bug的点&#xff0c; 一是先分了左右两半再排序&#xff0c;保证的是这次的左…