虎牙直播电影一天收入
“美国电影协会(MPAA)的首席执行官J. Valenti提到:“没有人能告诉您电影在市场上的表现。 直到电影在黑暗的剧院里放映并且银幕和观众之间都散发出火花。 (“The CEO of Motion Picture Association of America (MPAA) J. Valenti mentioned that ‘No one can tell you how a movie is going to do in the marketplace. Not until the film opens in darkened theater and sparks fly up between the screen and the audience’”)
The modern film industry, a business of nearly 10 billion dollars per year, is a cutthroat business competition.
现代电影业每年的营业额接近100亿美元,是一场残酷的商业竞争。
Each year in the United States, hundreds of films are released to domestic audiences in the hope that they will become the next “blockbuster.” Predicting how well a movie will perform at the box office is hard because there are so many factors involved in success.
在美国,每年都会向国内观众放映数百部电影,希望它们将成为下一部“大片”。 很难预测电影在票房上的表现如何,因为成功涉及很多因素。
The goal of this project is to develop a computational model for predicting the revenues based on public data for movies extracted from Boxofficemojo.com online movie database.
该项目的目标是开发一种计算模型,该模型可以基于从Boxofficemojo.com在线电影数据库中提取的电影的公共数据来预测收入。
The first phase is web scraping. Different types of features are extracted from Boxofficemojo.com which will be described later. Second phase is data cleaning. After scrapping data from our source, we cleaned our data mainly depend on unavailability of some features. After cleaning all data, next phase is exploratory data analysis. In third phase we create graphics to understand data. Fourth phase is feature engineering, where you create features for machine learning model from raw text data. Fifth phase is model analysis, where I applied one of the machine learning algorithms on our data set.
第一阶段是刮纸。 从Boxofficemojo.com中提取了不同类型的功能,这将在后面描述。 第二阶段是数据清理。 从我们的来源中删除数据后,我们清理数据主要取决于某些功能的不可用性。 清除所有数据后,下一阶段是探索性数据分析。 在第三阶段,我们创建图形来理解数据。 第四阶段是功能工程,其中您可以从原始文本数据创建用于机器学习模型的功能。 第五阶段是模型分析,其中我在数据集上应用了一种机器学习算法。
网页抓取 (Web Scraping)
Web scraping is a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.
Web抓取是从Web提取和处理大量数据的程序或算法。 无论您是数据科学家,工程师,还是任何分析大量数据集的人员,从网络中抓取数据的能力都是一项有用的技能。
It’s a good idea to do some research on your own and make sure that you’re not violating any Terms of Service before you start a large-scale project. To learn more about the legal aspects of web scraping, check out Legal Perspectives on Scraping Data from the Modern Web.
最好自己进行一些研究,并确保在开始大规模项目之前,不要违反任何服务条款。 要了解有关网络抓取的法律方面的更多信息,请查阅《现代网络中关于数据搜集的法律观点》 。
For this project;
对于这个项目;
· BeautifulSoup Library is used for data extraction from the web.
· BeautifulSoup库用于从Web提取数据。
· Pandas Library is used for data manipulation and cleaning.
· 熊猫库用于数据处理和清理。
· Matplotlib and Seaborn are used for data visualization.
· Matplotlib和Seaborn用于数据可视化。
My data set contains 8319 movies released in between 2010 to 2019. Recent movies are not selected because Covid-19 not much movie released in 2020. I collect Title, Distributor, Release, MPAA, Time, Genre, Domestic, International, Worldwide, Opening, Budget, and Actors information.
我的数据集包含2010年至2019年之间发行的8319部电影。由于Covid-19 2020年发行的电影不多,因此未选择近期电影。我收集标题,发行商,发行,MPAA,时间,类型,国内,国际,全球,开幕,预算和演员信息。
数据清理 (Data Cleaning)
At the beginning my data set had 8319 movies. Then I recognize that there were many movies which don’t have all data available. So unavailability of features was the main reason behind eliminating movies from my data set.
最初,我的数据集包含8319部电影。 然后我意识到有很多电影没有所有可用数据。 因此,功能不可用是从我的数据集中删除电影的主要原因。
Most of the movie doesn’t have budget data available. So, null rows have been deleted.
这部电影大部分没有可用的预算数据。 因此,空行已被删除。
Dtype is converted from “Object” to “float” for numeric columns.
对于数字列,Dtype从“对象”转换为“浮点”。
“Release” data is checked for leap year detail and found data is modified. Dtype is converted from “Object” to “datetime” for Release column. Data from “Distributor” column is cleaned from not related info.
检查“发布”数据中的leap年细节,并修改找到的数据。 Dtype从Release列的“ Object”转换为“ datetime”。 来自“分销商”列的数据已从不相关的信息中清除。
Duplicate rows have been deleted from data set.
重复的行已从数据集中删除。
After removing those movies I finally got my data set with 1293 movies which have all information available.
删除这些电影后,我终于获得了包含所有可用信息的1293电影的数据集。
探索性数据分析(EDA) (Exploratory Data Analysis (EDA))
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
在统计中,探索性数据分析(EDA)是一种分析数据集以总结其主要特征的方法,通常使用视觉方法。 可以使用统计模型,也可以不使用统计模型,但是EDA主要用于查看数据可以在形式建模或假设检验任务之外告诉我们的内容。
Let’s look at the data relation between “Domestic Total Gross” and “Budget” for each year.
让我们看一下每年“国内总收入”和“预算”之间的数据关系。
While there are an almost overwhelming number of methods to use in EDA, one of the most effective starting tools is the pairs plot (also called a scatterplot matrix). A pairs plot allows us to see both distribution of single variables and relationships between two variables. Pair plots are a great method to identify trends for follow-up analysis and, fortunately, are easily implemented in Python.
尽管在EDA中使用了几乎绝大多数方法,但最有效的入门工具之一是结对图(也称为散点图矩阵)。 配对图使我们可以看到单个变量的分布以及两个变量之间的关系。 配对图是识别趋势以进行后续分析的一种好方法,幸运的是,可以在Python中轻松实现。
A Heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features in a data set. We can now use either Matplotlib or Seaborn to create the heatmap. To get the correlation of the features inside a data set we can call <dataset>.corr()
, which is a Pandas dataframe method. This will give us the correlation matrix.
热图是数据的图形表示,其中矩阵中包含的各个值表示为颜色。 热图非常适合探索数据集中要素的相关性。 现在,我们可以使用Matplotlib或Seaborn来创建热图。 为了获得数据集中内部<dataset>.corr()
的相关性,我们可以调用<dataset>.corr()
,这是Pandas数据<dataset>.corr()
方法。 这将给我们相关矩阵。
特征工程 (Feature Engineering)
Feature engineering means building additional features out of existing data which is often spread across multiple related tables. Feature engineering requires extracting the relevant information from the data and getting it into a single table which can then be used to train a machine learning model.
特征工程意味着从现有数据中构建附加特征,这些数据通常分布在多个相关表中。 特征工程需要从数据中提取相关信息,并将其放入一个表中,然后该表可用于训练机器学习模型。
Machine learning fits mathematical notations to the data in order to derive some insights. The models take features as input. A feature is generally a numeric representation of an aspect of real-world phenomena or data. Just the way there are dead ends in a maze, the path of data is filled with noise and missing pieces. Our job as a Data Scientist is to find a clear path to the end goal of insights.
机器学习使数学符号适合数据,以得出一些见解。 这些模型将要素作为输入。 特征通常是真实现象或数据方面的数字表示。 就像迷宫中的死胡同一样,数据的路径充满了噪声和丢失的碎片。 作为数据科学家,我们的工作是找到通往最终见解的明确路径。
Let’s look at the description of dataset and see distribution of target column.
让我们看一下数据集的描述并查看目标列的分布。
We want the target variable to be predicted in the model to have a normal distribution. When we examine the distribution of our target variable, we see that there is no right skewed distribution. We can correct this situation by applying a logarithmic transformation to the target variable.
我们希望在模型中预测目标变量具有正态分布。 当我们检查目标变量的分布时,我们发现没有右偏分布。 我们可以通过对目标变量应用对数转换来纠正这种情况。
Ordinary least-squares (OLS) models assume that the analysis is fitting a model of a relationship between one or more explanatory variables and a continuous or at least interval outcome variable that minimizes the sum of square errors, where an error is the difference between the actual and the predicted value of the outcome variable.
普通最小二乘(OLS)模型假设分析适合一个或多个解释变量与连续或至少区间结果变量之间的关系模型,该变量使平方误差之和最小,其中误差是结果变量的实际值和预测值。
When I do OLS model with two numerical features from data set, I got low cond. no, but also got low R-2 score. To increase R-2 score I will do feature engineering to add new features from categorical variables from out data set.
当我使用数据集中的两个数值特征进行OLS模型建模时,cond降低。 不,但R-2得分也很低。 为了增加R-2分数,我将进行特征工程设计以从数据集中的分类变量中添加新特征。
· The “year” column and four season columns were created from the “Release” column.
·从“发布”列中创建了“年”列和四个季节列。
· Four Dummy columns were created from “MPAA” column.
·从“ MPAA”列中创建了四个虚拟列。
· Running time (min) column were created from “time” column.
·运行时间(分钟)列是从“时间”列中创建的。
· New columns created for all distributors with more than 49 rows.
·为具有49行以上的所有分发者创建的新列。
· Logs of “Budget” and “Opening” columns were created.
·创建了“预算”和“开放”列的日志。
模型分析 (Model Analysis)
Now is the time to split our data into sets of training, testing and validation. Let’s rerun our model and finally compare the Ridge, Lasso and Polynomial regression results.
现在是时候将我们的数据分为训练,测试和验证的集合了。 让我们重新运行模型,最后比较Ridge,Lasso和多项式回归结果。
Data set was split as a train (60%), validation (20%), and test (20%). The tuning parameters (alpha) of the Lasso and Ridge models were chosen from a wide value range than put the 10-fold cross-validation.
数据集分为训练(60%),验证(20%)和测试(20%)。 拉索和里奇模型的调整参数(alpha)是从10倍交叉验证的宽泛范围内选择的。
When we included the variables we applied feature engineering into the model, OLS model R-2 score is increased to 0.759, but at the same time the cond. no. increased. Lasso Regression and Ridge Regression brought us the same results. The result of Linear Regression was also very close to them. We have the best result in a Degree 2 Polynomial Regression and the second is Ridge Polynomial Regression.
当我们将变量应用到模型中时,将OLS模型R-2得分提高到0.759,但同时条件也有所提高。 没有。 增加。 拉索回归和岭回归为我们带来了相同的结果。 线性回归的结果也非常接近它们。 我们在2次多项式回归中得到最好的结果,第二个是Ridge多项式回归。
Now it’s time to do Cross Validation (CV) and look at Mean Absolute Error (MAE) score. When we cross validate each model (kfold = 10), we see little drop in scores.
现在是时候进行交叉验证(CV)和查看平均绝对误差(MAE)分数了。 当我们交叉验证每个模型(kfold = 10)时,我们看到分数几乎没有下降。
结论 (Conclusion)
Finally, when we look at the mean absolute errors on the established models, we can say that Ridge Polynomial Regression will bring us the most accurate results.
最后,当我们查看已建立模型的平均绝对误差时,可以说岭多项式回归将为我们带来最准确的结果。
Five fundamental assumptions of the linear regression analysis were checked as these can be seen on Jupyter Notebook.
检查了线性回归分析的五个基本假设,因为可以在Jupyter Notebook中看到这些假设。
GitHub repository for web scraping and data processing is here.
用于Web抓取和数据处理的GitHub存储库在这里 。
Thank you for your time and reading my article. Please feel free to contact me if you have any questions or would like to share your comments.
感谢您的时间和阅读我的文章。 如果您有任何疑问或想分享您的意见,请随时与我联系。
翻译自: https://medium.com/analytics-vidhya/predicting-a-movies-revenue-3709fb460604
虎牙直播电影一天收入
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391513.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!