国外广告牌_广告牌下一首流行歌曲的分析和预测，第1部分

国外广告牌

Using Spotify and Billboard’s data to understand what makes a song a hit.

使用Spotify和Billboard的数据来了解歌曲的流行。

Thousands of songs are released every year around the world. Some are very successful in the music industry; others less so. It is a fact that being successful in this industry remains a difficult task. Investing in the production of a song requires a variety of activities and can consume a lot of resources. There are very few music labels that fund studies to find out to what extent the song they are about to release could be a musical hit.

每年的歌曲牛逼 housands被释放世界各地。有些人在音乐界非常成功。其他人则更少。事实上，要在该行业取得成功仍然是一项艰巨的任务。投资于歌曲的制作需要进行各种活动，并且会消耗大量资源。很少有音乐标签可以资助研究，以查明他们即将发行的歌曲在多大程度上可能会引起音乐上的轰动。

My curiosity about music prompted me to devote some time to studying the subject. For many people, Drake’s secret to making his hits in recent years is his style as an artist; for others, it is mainly to his notoriety that he owes it. Generally speaking, opinions don’t just go one way when it comes to explaining why a song is a hit but another one isn’t; or what an artist should prioritize while producing a song if he wants it to be a hit.

我对音乐的好奇心促使我投入一些时间来研究音乐主题。对于许多人来说，德雷克(Drake)近年来取得成功的秘诀就是他作为艺术家的风格。对于其他人来说，主要归功于他的声名狼藉。一般而言，在解释歌曲为什么是流行歌曲时，观点不只是一种方式，而在解释歌曲时则不是。或艺术家在制作歌曲时，如果要使其成为热门歌曲时应优先考虑的事项。

This bipartite article is a snapshot of the project I have been working on over the past few weeks. I used data science techniques to understand what characterizes a popular song, and more precisely how it would be possible to predict the popularity of a song based solely on its audio characteristics and the profile of the song artist. I built a machine learning model that can classify a song as a hit or not.

这篇两部分的文章是我过去几周一直在从事的项目的快照。我使用数据科学技术来理解流行歌曲的特征，更确切地说是如何仅根据其音频特性和歌手的个人资料来预测歌曲的流行程度。我建立了一个机器学习模型，可以将歌曲归类为热门歌曲。

While social factors like the context in which the song was broadcast, the demographics of its listeners, and the effectiveness of its marketing campaign may just as well play an important role in its virality, I hypothesized that the characteristics inherent of a song, such as the profile of the artist who performs it, its duration, its audio characteristics can be correlated and also revealing of its virality.

尽管社会因素(如歌曲的播放背景，听众的受众特征以及营销活动的有效性)也可能在其病毒式传播中起着重要作用，但我假设歌曲的固有特性(例如表演者的个人资料，其持续时间，其音频特性可以相互关联，也可以揭示其病毒性。

My data

我的资料

I couldn’t have a data set from a single source that contained all the variables. To overcome this problem, I have resorted to data enrichment techniques with the following three data sources: Billboard, Spotify and Genius.

我无法从包含所有变量的单一来源获得数据集。为了克服这个问题，我采用了以下三种数据源的数据丰富技术：Billboard，Spotify和Genius。

First, using Beautiful Soup, I collected a list of Billboard Year-End Hot 100 songs from 2010 to 2019, at the rate of 100 songs per year. Then, the Spotipy package was used for the recovery of data related to the songs audio characteristics such as danceability, instrumentalness, liveness, etc., on one hand; and, on the other hand, those related to the artist’s profile such as number of followers, popularity, etc. for both previously recovered hit and other non-hit songs from the same period.

首先，我使用Beautiful Soup收集了2010年至2019年的Billboard年终热门100首歌曲列表，并且以每年100首歌曲的速度进行收集。然后，一方面使用Spotipy包恢复与歌曲的音频特性有关的数据，例如舞蹈性，器乐性，活泼性等；另一方面，与艺术家个人资料相关的那些信息，例如先前恢复的热门歌曲和同一时期的其他非热门歌曲的关注者数量，受欢迎程度等。

And finally, Genius will be used mainly for retrieving the lyrics for all the songs that have been collected.

最后， Genius将主要用于检索已收集的所有歌曲的歌词。

A song in our data set is considered a hit if it made it to the Billboard Year-End Hot 100 chart at least once during any of the years in the reporting period. In other words, our model was tasked with predicting whether a song would make it to Billboard’s 100 most popular song list or not.

如果在报告期内的任何一年中，我们的数据集中的某首歌曲至少进入Billboard Year-End Hot 100排行榜一次，则该歌曲被视为热门歌曲。换句话说，我们的模型的任务是预测歌曲是否会进入Billboard的100首最受欢迎歌曲列表。

Tools used :

使用的工具：

The spotipy package to access data from the Spotify music platform
Spotipy包，用于从Spotify音乐平台访问数据
seaborn and matplotlib for data visualization
seaborn和matplotlib用于数据可视化
pandas and numpy for data analysis
熊猫和numpy进行数据分析
LightGBM and the scikit-learn library for building and evaluating the model
LightGBM和用于构建和评估模型的scikit-learn库

Features

特征

Spotify is one of the biggest streaming platforms in the world. Like Twitter or Facebook, it provides an API (Application Programming Interface) so that developers can interact with its huge music database. Via endpoints of this API, I was able to collect data for more than 22,000 songs; each song is characterized by more than twenty variables.

Spotify是世界上最大的流媒体平台之一。像Twitter或Facebook一样，它提供了API(应用程序编程接口)，以便开发人员可以与其庞大的音乐数据库进行交互。通过该API的端点，我能够收集超过22,000首歌曲的数据；每首歌都有二十多个变量。

The variables returned by the API are as rich in information as they are varied. However, I have selected only those that are deemed relevant for the job. Then, they were transformed using feature engineering techniques in order to prepare the data set as well as possible for training the model. You will find the description of each of the variables used here. In general, the complete codes of the project are accessible through this link.

API返回的变量随其变化而具有丰富的信息。但是，我只选择了与工作相关的那些。然后，使用特征工程技术对它们进行转换，以准备数据集并尽可能地训练模型。您将在此处找到每个变量的描述。通常，可通过此链接访问项目的完整代码。

Some of these variables are used only for analysis, others are involved in all stages of the pipeline. Now let’s take a look at what the first five observations of our data set look like.

这些变量中的一些仅用于分析，而其他变量则涉及管道的所有阶段。现在，让我们看一下数据集的前五个观察结果。

png — First five observations of our data set, which contains 19 182 observations from 24 variables.

Data visualization

数据可视化

Right before building the model, I wanted to explore the data a bit through visualizations; although the primary goal of the work was not to conduct a full exploratory analysis.

在建立模型之前，我想通过可视化来探索数据。尽管这项工作的主要目标不是进行全面的探索性分析。

In light of the above graph, January is the month in which more hits were released; and July being the least requested one. We also notice that the majority of the hits for the period came out on the 4th day of the week; which is Thursday. Finally, the first day of the month is much more used to publish a song.

根据上面的图表，一月份是发布更多匹配的月份；而7月是要求最少的时间。我们还注意到，该期间的大多数匹配均在一周的第4天发布；这是星期四。最后，每月的第一天更多地用于发布歌曲。

Based on that observation, we could deduce that publishing a song on January 1st is probably a step towards optimizing the chances of its virality. We must still be careful: correlation is not causation. We would have to push our analysis further in order to be more precise in this conclusion.

根据这一观察，我们可以推断出在1月1日发布歌曲可能是朝着优化病毒传播机会迈出的一步。我们仍然必须小心：关联不是因果关系。为了使该结论更加精确，我们将不得不进一步分析。

I continued the visualization, wanting to get an idea of which artists recorded the most hits over the period. Unsurprisingly, the graph below shows that artists like Drake, Rihanna, and Taylor Swift recorded more hits than anyone else during the period. That being said, it’s plausible to believe that a song is more likely to go viral if it contains the vocals of one of these artists.

我继续进行可视化，想了解一下在这段时间内哪些艺术家记录了最多的热门歌曲。毫不奇怪，下图显示了在这段时期内，德雷克(Drake)，蕾哈娜(Rihanna)和泰勒·斯威夫特(Taylor Swift)等艺术家录得的热门歌曲比其他任何人都多。话虽这么说，相信一首歌如果包含其中一位艺术家的声音，就更有可能传播开来。

To understand what makes a song a hit; and another one not, it is imperative to study the audio-related variables of both categories. This is how I wanted to get an idea of the distribution of these variables with respect to the two categories.

了解什么使一首歌大受欢迎；另一个不是，必须研究这两种类别的音频相关变量。这就是我想要了解这些变量相对于两个类别的分布的想法。

The predominant variables within the two categories are the same: energy, danceability and valence. The only difference is that the average values of these variables are higher for the hit songs category. This attests that the hits music are faster and more sonorous. They are more suited to dance and are more inclined to inspire joy, gaiety, euphoria.

这两个类别中的主要变量相同：能量，舞蹈性和化合价。唯一的区别是热门歌曲类别的这些变量的平均值更高。这证明了流行音乐的速度更快，声音更大。他们更适合跳舞，更容易激发欢乐，快乐和欣快感。

The machine learning approach

机器学习方法

So far we’ve been able to uncover some interesting insights about the data. In order to shorten this article, let’s go directly to the part about the machine learning algorithm that was used.

到目前为止，我们已经能够发现有关数据的一些有趣的见解。为了缩短本文的篇幅，让我们直接转到有关所使用的机器学习算法的部分。

I wanted to build a model to predict which class, hit or non-hit, that a song is most likely to belong to based on a set of explanatory variables, as explained at the beginning of this article.

我想建立一个模型，根据一组解释性变量来预测一首歌曲最有可能属于哪个类别(热门还是非热门)，如本文开头所述。

In its raw state, the data collected was not ready for training a machine learning model. So I treated them; first, using the SMOTE technique, because the hit song class was under-represented compared to the non-hit category, then by using other feature engineering techniques in order to standardize the data.

在原始状态下，收集的数据尚未准备好用于训练机器学习模型。所以我对待他们；首先，使用SMOTE技术，因为与非热门类别相比，热门歌曲类别的代表性不足，然后使用其他特征工程技术来标准化数据。

Also, I wanted a model that would perform as well as possible. To do this, I trained several algorithms and compared the results based on the selected evaluation criteria. It turned out that the LightGBM classification algorithm is the one that, at the time of training, better detected the patterns between the explanatory variables and the variable to be predicted (hit).

另外，我想要一个性能最好的模型。为此，我训练了几种算法，并根据选定的评估标准比较了结果。事实证明，LightGBM分类算法是一种在训练时可以更好地检测解释变量和要预测的变量(命中)之间的模式的算法。

%matplotlib inline
import lightgbm as lgb
from sklearn.metrics import auc, accuracy_score, roc_auc_score, roc_curve, confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
sm = SMOTE(random_state=42)X=df[['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness','loudness','speechiness', 'valence','tempo','duration_ms','featuring','pop_artist','tot_followers','avail_mark','pop_track']]
y=df['hit']# Spliting the data set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=20)# Oversample the training data
X_res, y_res = sm.fit_resample(X_train, y_train)# Parameters optimized via GridSearch
clf = lgb.LGBMClassifier( boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,importance_type='split', learning_rate=0.2, max_depth=-1,min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,n_estimators=90, n_jobs=-1, num_leaves=31, objective=None,random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,subsample=1.0, subsample_for_bin=200000, subsample_freq=0)clf.fit(X_res, y_res)

Confusion matrix

混淆矩阵

For a first level performance analysis of the model, we will use the confusion matrix. The visualization of the confusion matrix will allow us to understand the errors made by our classifier compared to the test subset.

对于模型的第一级性能分析，我们将使用混淆矩阵。混淆矩阵的可视化将使我们能够了解分类器与测试子集相比所犯的错误。

This matrix measures the quality of a classification system. In a binary classification, the principal diagonal represents the observations correctly classified by the model; and the secondary diagonal, those classified incorrectly. Therefore, the most frequent mistake made by the model is to have classified a song as non-hit when in reality it was a hit (129 cases), the type II error more precisely.

该矩阵衡量分类系统的质量。在二元分类中，主对角线代表模型正确分类的观测值；以及次级对角线，则归类不正确。因此，该模型最常犯的错误是将一首歌曲实际上没有击中时将其归类为未击中(129例)，更准确地说是II型错误。

Type I error is that the model classifies song as a hit when it is non-hit (false hit); and type II error, the reverse; that is to say the case where it classifies a music as non-hit yet it is hit (false non-hit). If we are trying to understand the psychology of a music producer, Type I error is less acceptable than Type II error. We wouldn’t want to incur all the expenses related to the production and promotion of a song that a model has predicted that would be a hit so that in the end it won’t be. The value of the Type I error should be minimal.

类型I错误是模型在歌曲未击中(错误击中)时将歌曲归类为击中；和II型错误，反之；也就是说，将音乐归类为非流行但仍被流行(假非流行)的情况。如果我们试图了解音乐制作人的心理，则类型I错误比类型II错误更不可接受。我们不想承担模型制作所预想会很成功的歌曲制作和推广相关的所有费用，因此最终不会出现。类型I错误的值应该最小。

The classification report

分类报告

The classification report presents statistics calculated from the data in the confusion matrix. Each metric describes a different aspect of the classification. We will use this report for a second level performance analysis of the model.

分类报告提供了根据混淆矩阵中的数据计算出的统计信息。每个度量标准都描述了分类的不同方面。我们将使用此报告对该模型进行第二级性能分析。

The accuracy, which globally measures the percentage of correct classification performed by the model, is 96%. Since the test subset is unbalanced, this percentage is stretched by the over-represented class, in this case, the non-hit class. So this metric is not the best we could use.

准确性(整体衡量模型执行的正确分类的百分比)为96％。由于测试子集是不平衡的，因此该百分比会被过度代表的类(在这种情况下为非命中类)所扩展。因此，该指标不是我们可以使用的最佳指标。

The recall measures the percentage of occurrences classified correctly by the model for each class. A classification is correct when the predicted class matches the actual class. On one hand, from the 5,484 non-hit songs that we used to test the model, 98% were correctly classified. On the other hand, the algorithm correctly classified only 52% of the 271 hit songs we submitted to it. You will have understood: it is more difficult for the algorithm to classify a hit song as a hit (true hit) than to classify a non-hit song as a non-hit (true non-hit).

召回度量模型为每个类别正确分类的出现百分比。当预测的类别与实际类别匹配时，分类是正确的。一方面，从我们用来测试模型的5,484首非热门歌曲中，有98％被正确分类。另一方面，该算法仅正确分类了我们提交给它的271首热门歌曲中的52％。您将了解：与将非流行歌曲分类为非流行歌曲(真正非流行)相比，算法将流行歌曲分类为流行歌曲(真正流行)要困难得多。

The precision levels for the non-hit and hit classes are 0.98 and 0.55, respectively. This translates that 98% of all the songs the model classified as non-hits are indeed non-hits; and, only 55% of the songs it predicted hits really are.

非命中和命中类别的精度水平分别为0.98和0.55。这意味着该模型分类为非流行歌曲的所有歌曲中有98％确实是非流行歌曲。而且，它预测的热门歌曲中只有55％确实是。

Our model reacts better when it comes to non-hits songs. This is most likely due to the fact that from the start this class had a lot more data. The pattern detection between the non-hit modality of the dependent variable and the other explanatory variables is perhaps favored because of this.
对于非热门歌曲，我们的模型React更好。这很可能是由于该类从一开始就拥有很多数据这一事实。因此，可能更喜欢在因变量的非命中模态和其他解释变量之间进行模式检测。

Once satisfied with the performance of the model, I resolved to understand the most influential explanatory variables in determining the class of a given song. That’s why I used the plot importance below.

对模型的性能感到满意后，我决定了解最有影响力的解释变量，以确定给定歌曲的类别。这就是为什么我在下面使用情节重要性。

We can see that the popularity of a song on Spotify is the most important variable in the process of predicting which class it is most likely to belong to. Then, the artist’s popularity on Spotify, the number of followers he has and the number of markets in which the song is available on constitutes a second wave of determining variables in the process. Finally come mainly the variables related to the audio of the songs with a relatively similar level of influence.

我们可以看到，在预测歌曲最有可能属于哪个类别的过程中，Spotify上歌曲的流行程度是最重要的变量。然后，艺术家在Spotify上的受欢迎程度，他所拥有的追随者数量以及歌曲可在其上获得的市场数量构成了第二个确定变量的过程。最后主要是影响力相对相似的与歌曲音频相关的变量。

This analysis draws attention to something major. Essentially, a song is a hit if it is popular on Spotify, is performed by an artist who is also popular on Spotify and has a significant number of followers, and finally, if it is available in the greatest number of countries across the world. This conclusion seems logical, and … Eurêka🙂, it is also verified empirically by our model.

该分析将注意力吸引到主要方面。从本质上讲，如果歌曲在Spotify上很流行，那么它就是一首热门歌曲，这是由一位在Spotify上也很受欢迎并且有大量追随者的艺术家表演的，最后，如果它在世界上最多的国家/地区都有销售。这个结论似乎是合乎逻辑的，而且…Eurêka🙂，我们的模型也通过经验进行了验证。

To better appreciate the relevance of this conclusion, it should be borne in mind that Billboard’s year-end top 100 music list is based primarily on a commercial aspect. Indeed, this ranking is a faithful reflection of physical and digital sales, radio listening and music streaming in the United States; all income-generating activities, directly or indirectly.

为了更好地理解此结论的相关性，应该牢记Billboard的年终前100名音乐榜单主要基于商业方面。实际上，该排名真实反映了美国的实体和数字销售，广播收听和音乐流；所有直接或间接产生收入的活动。

The more popular the song is on Spotify, the more it is listened to online.; more streams translates to more revenue generated because after each song listened to by a subscriber, the streaming platforms pay a fee to the artist or the music label. The popularity of the artist on Spotify and the number of followers he has there are channels that amplify the number of streams and sales, which will then help increase the income generated by his music.

这首歌在Spotify上越流行，在线上收听的次数就越多。更多的流转化为更多的收入，因为在订阅者听完每首歌曲之后，流平台向艺术家或音乐唱片公司收取费用。艺术家在Spotify上的受欢迎程度以及他拥有的追随者数量，这些渠道可以扩大流媒体的数量和销量，从而有助于增加他的音乐产生的收入。

The more revenue music generates, the more likely it is to be on Billboard’s top 100 music list at the end of the year; therefore the more likely it is also to be ranked hit by our model because the variables that mainly determine the level of music income are the most influential in the ranking process according to our importance graph (logical, isn’t it😉 ?).

音乐产生的收入越多，到年底它就更有可能进入Billboard的前100名音乐榜单；因此，根据我们的重要性图，主要决定音乐收入水平的变量在排名过程中最具影响力，因此我们的模型也更有可能对其进行排名排名(逻辑上不是吗？)。

Usually, songs have another important characteristic that we haven’t yet taken into account: the lyrics. Can we further increase the performance of the model by using lyrics? This is what we will explore in the second part of the article.

通常，歌曲具有我们尚未考虑的另一个重要特征：歌词。我们可以通过使用歌词进一步提高模型的性能吗？这就是我们将在本文的第二部分中探讨的内容。

As always I welcome constructive criticism and feedback. I can be reached on Twitter @jbobym.

一如既往，我欢迎建设性的批评和反馈。可以通过Twitter @ jbobym与我联系。