客户流失
Big Data Analytics within a real-life example of digital music service
数字音乐服务真实示例中的大数据分析
Customer churn is a key predictor of the long term success or failure of a business. It is the rate at which customers are leaving your business and taking their subscription dollars elsewhere. For every single business, why the users churn and how to change, keep, attract the users is the forever questions they ask themselves.
客户流失是企业长期成功或失败的关键预测指标。 这是客户离开您的业务并将其订阅费转移到其他地方的费率。 对于每一项业务,为什么用户流失以及如何改变,保持,吸引用户是他们永远问自己的问题。
Digital Music Service, as an example for us here to look into. Let’s think of the most familiar platform, like Spotify, Pandora. Every time when you, as the user interact with the service, every small step, such as playing music, logging out the page, like the song, etc, generate the data. Here comes the Big Data! All these data contain the key insight for predicting the churn of the users and keeping the business thrive. Because of the size of the data, it is a challenging and common problem that we regularly encounter in any customer-facing business.
数字音乐服务 ,以我们为例进行研究。 让我们考虑一下最熟悉的平台,例如Spotify,Pandora。 每当您作为用户与服务交互时,每一个小步骤(例如播放音乐,注销页面,喜欢歌曲等)都会生成数据。 大数据来了! 所有这些数据都包含了预测用户流失并保持业务蓬勃发展的关键见解。 由于数据量大,这是我们在任何面向客户的业务中经常遇到的具有挑战性的普遍问题。
Here we are going to analyze the real-life large datasets for a music streaming service with Spark. We attempt to build machine learning models to predict the churning possibilities of the users and understand the features that contribute to the churning behaviors.
在这里,我们将分析Spark的音乐流服务的真实大型数据集。 我们试图建立机器学习模型 来预测用户的搅动可能性,并了解有助于搅动行为的功能。
Let’s start with a mini-subset (~128MB) of the large data (12 GB) first for understanding and exploring the datasets. We will load in our dataset (JSON format) through the following commands:
让我们首先从大数据(12 GB)的微型子集 (〜128MB)开始,以了解和探索数据集。 我们将通过以下命令加载数据集(JSON格式):
# Create a Spark session
spark = (SparkSession.builder
.master(“local”)
.appName(“sparkify”)
.getOrCreate())# Read the dataset
events_df = spark.read.json(‘mini_sparkify_event_data.json’)
We can also take a look at the shortcut of all the features and their datatype
我们还可以看一下所有功能及其数据类型的快捷方式
root
|-- artist: string (nullable = true)
|-- auth: string (nullable = true)
|-- firstName: string (nullable = true)
|-- gender: string (nullable = true)
|-- itemInSession: long (nullable = true)
|-- lastName: string (nullable = true)
|-- length: double (nullable = true)
|-- level: string (nullable = true)
|-- location: string (nullable = true)
|-- method: string (nullable = true)
|-- page: string (nullable = true)
|-- registration: long (nullable = true)
|-- sessionId: long (nullable = true)
|-- song: string (nullable = true)
|-- status: long (nullable = true)
|-- ts: long (nullable = true)
|-- userAgent: string (nullable = true)
|-- userId: string (nullable = true)
The feature page seems to be the most important one as it records all the user interactions. The page column recorded values, such as Logout, Save Settings, Roll Advert, Settings, Submit Upgrade, Cancellation Confirmation, Add Friends, etc. Also, the Cancellation Confirmation events of page define the churn that we are interested in. (0 as un-churn, and 1 as churn)
功能页面似乎是最重要的页面 ,因为它记录了所有用户交互。 页面列记录的值,例如注销,保存设置,滚动广告,设置,提交升级,取消确认,添加朋友等。此外,页面的取消确认事件定义了我们感兴趣的用户流失。(0为un -搅动,而1为搅动)
Exploratory Data Analysis (EDA)
探索性数据分析(EDA)
We want to perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned.
我们希望执行一些探索性数据分析,以观察留下的用户与搅动的用户的行为。
From the bar plot on the left, the average length of songs played for churn and un-churn users is generated. For un-churned users, they have longer mean length for listening to the songs compare to the other group.
从左侧的条形图中, 可以得出为流失和不流失用户播放的歌曲的平均长度 。 对于未订阅的用户,与其他组相比,他们的平均听歌时间更长。
The second bar chart shows the relationship of the churn rate and User-Agent of the users. From the data, we can conclude that X11 and iPhone users tend to churn more and this can give us some insights for further investigation of the systems.
第二个条形图显示了用户的客户流失率与User-Agent的关系 。 根据数据,我们可以得出结论,X11和iPhone用户的流失率更高,这可以为我们进一步研究系统提供一些见识。
By checking the correlation matrix of the page and our domain knowledge, we pick several features (Thumbs Up, Thumbs Down, Add Friend, Add to Playlist, Error, Help) to observe the difference between churn and un-churn customers. The box plot below shows some detailed information.
通过检查页面的相关矩阵和我们的领域知识 ,我们选择了几种功能(“竖起大拇指”,“大拇指朝下”,“添加朋友”,“添加到播放列表”,“错误”,“帮助”),以观察流失客户与未流失客户之间的差异。 下面的方框图显示了一些详细信息。
What can you gain from the plots? From my perspective, churn users:
您可以从地块中获得什么? 在我看来,用户流失 :
- less likely to click thumbs up 不太可能点击竖起大拇指
- less likely to add friends 不太可能添加朋友
- less likely to add songs to the playlist 将歌曲添加到播放列表的可能性较小
However, it doesn’t necessarily mean that they have more errors encountered and need more help from the service.
但是,这不一定意味着他们遇到了更多的错误,并且需要该服务提供更多帮助。
Once we familiarized ourselves with the data, let’s build out the features find promising to train the model on.
一旦我们熟悉了数据,就可以建立一些很有希望在模型上进行训练的功能。
Feature Engineering
特征工程
Here are some features that I found interesting:
以下是一些我发现很有趣的功能:
Features of Page but remove un-related ones
Page的功能,但删除不相关的功能
df_features = df.groupby([‘userId’])
.pivot(‘page’)
.count()
.fillna(0)df_features.withColumnRenamed(‘Cancellation Confirmation’,’Churn’)df_features = df_features.drop(‘About’, ‘Cancel’, ‘Login’,’Logout’, ‘Roll Advert’, ‘Submit Registration’, ‘Register’, ‘Save Settings’)
2. Total song-length of the user listened
2.听过的用户的总歌曲长度
total_length = df.filter(df.page == ‘NextSong’)
.groupby(df.userId)
.agg(sum(df.length)
.alias(‘total_songlength’))df_features = df_features.join(total_length, on=[‘userId’], how=’inner’)
3. Gender: Dummy variables created
3.性别:虚拟变量已创建
gender_df = df.select(‘userId’,’gender’).dropDuplicates()categories = gender_df.select(‘gender’)
.distinct()
.rdd.flatMap(lambda x: x)
.collect()exprs = [F.when(F.col(‘gender’) == category, 1)
.otherwise(0)
.alias(category) for category in categories]gender_df = gender_df.select(‘userId’, *exprs)df_features = df_features.join(gender_df, on=[‘userId’], how=’inner’)
4. Number of days user active
4.用户活跃天数
days = df.groupby(‘userId’).agg(max(df.ts),(min(df.ts)))days = days.withColumn(‘days_active’, (col(‘max(ts)’) -col(‘min(ts)’)) / (60*60*24) )df_features = df_features.join(days, on=[‘userId’], how=’inner’).drop(‘max(ts)’,’min(ts)’)
5. Number of days register the account
5.注册帐户的天数
days_reg = df.groupby(‘userId’)
.agg(max(df.registration),(max(df.ts)))days_reg = days_reg.withColumn(‘days_register’, (col(‘max(ts)’) -col(‘max(registration)’)) / (60*60*24) )df_features = df_features.join(days_reg, on=[‘userId’], how=’inner’).drop(‘max(ts)’,’max(registration)’)
6. The final level of the user (paid/free)
6.用户的最终级别(付费/免费)
final_level = df.groupby(‘userId’, ‘level’)
.agg(max(df.ts)
.alias(‘finalTime’))
.sort(“userId”)categories = final_level.select(‘level’)
.distinct()
.rdd.flatMap(lambda x: x)
.collect()exprs = [F.when(F.col(‘level’) == category, 1)
.otherwise(0)
.alias(category) for category in categories]final_level = final_level.select(‘userId’, *exprs)
Modeling
造型
After we engineered the features, we will build three models: logistic regression, random forest, gradient boosting trees. Let’s start by generating the table, splitting, and scale the data.
设计完这些功能之后,我们将构建三个模型:逻辑回归,随机森林,梯度增强树。 让我们从生成表,拆分和缩放数据开始 。
# Rename Cancellation Confirmation as label in df_features_label
df_features_label = df_features.withColumnRenamed(‘Cancellation Confirmation’, ‘label’)# Generate features table
df_features = df_features.drop(‘Cancellation Confirmation’, ‘userId’)# Splitting the data
train, test = df_features_label.randomSplit([0.8, 0.2])# Instantiating vectorassembler for creating pipeline
vector_assembler = VectorAssembler(inputCols = df_features.columns, outputCol = ‘Features’)# Scale each column for creating pipeline
scale_df = StandardScaler(inputCol = ‘Features’, outputCol=’ScaledFeatures’)
Here we give an example of building the Logistic Regression Model. All the other models are similar methods to build.
这里我们举一个建立Logistic回归模型的例子。 所有其他模型都是相似的构建方法。
lr = LogisticRegression(featuresCol=”ScaledFeatures”, labelCol=”label”, maxIter=10, regParam=0.01)# Creating pipeline
pipeline_lr = Pipeline(stages=[vector_assembler, scale_df, lr])# fitting the model
model_lr = pipeline_lr.fit(train)
In order to evaluate the accuracy of the model, we write a function to report results on the validation set. Since the churned users are a fairly small subset, and F1 score as the metric to optimize.
为了评估模型的准确性,我们编写了一个函数来报告验证集上的结果。 由于搅动的用户是一个相当小的子集,因此F1得分是要优化的指标。
def peformance(model, data, evaluation_metric):
# Generate predictions
evaluator = MulticlassClassificationEvaluator(metricName = evaluation_metric)
predictions = model.transform(data)
# Get scores
score = evaluator.evaluate(predictions)
confusion_matrix = (predictions.groupby(‘label’)
.pivot(‘prediction’)
.count()
.toPandas())
return score, confusion_matrix
We check the performance of the model as follows:
我们按以下方式检查模型的性能:
# Performance
score_lr, confusion_matrix_lr = peformance(model_lr, test, ‘f1’)print(‘The f1 score for Logistic Regression model:{}’.format(score_lr))print(confusion_matrix_lr)
Here is the resulting output for the Logistic Regression Model:
这是逻辑回归模型的结果输出:
From the analysis, the Gradient Boosting Tree Model did the best job with an F-1 score of up to 0.88. We need to notice that, since we only have a small group of people churn in the business usually, we care more about we can identify the churned users correctly, instead of pursuing high overall performance. In this case, we didn’t perform the grid searching and tune the parameters.
根据分析,梯度提升树模型的F-1分数最高为0.88,表现最佳。 我们需要注意的是,由于通常业务中只有一小部分人在流失 ,因此我们更关心可以正确识别被搅动的用户,而不是追求较高的整体绩效。 在这种情况下,我们没有执行网格搜索和调整参数。
Feature Importance
功能重要性
Using our best GBT model and feature importance function, we visualize the relative importance rank of each feature we obtained in the feature engineering process. As the figure below, we find that the days of the active, register of the users, and the number of times users add the song to the playlist are the most important features to the GBT model we built.
使用我们最好的GBT模型和特征重要性函数 ,我们可以可视化在特征工程过程中获得的每个特征的相对重要性等级。 如下图所示,我们发现活动的日子,用户的注册日期以及用户将歌曲添加到播放列表的次数是我们构建的GBT模型最重要的功能。
What actions we can take to decrease the churn rate then?
那么我们可以采取什么措施来降低流失率呢?
By finishing analyzing the data is never the end, always how to apply to the business is the most important part and the part makes our model crucial. With the feature importance we gained, we can come up with some business strategies to counter customer churns in real-life business. Here are some brief ideas related to our analysis:
通过完成对数据的分析永无止境,始终如何应用到业务是最重要的部分,而这一部分使我们的模型至关重要。 利用我们获得的功能重要性,我们可以提出一些业务策略来应对现实业务中的客户流失。 以下是与我们的分析有关的一些简要建议:
The number of active days is one of the important factors for churning, then rewarding and discounting can be considered to attract the activity of the users. This can also apply to the adding friends’ system, for example, if the user recommend and add 5 friends in the community, they can unblock unique badge
活动天数是搅动的重要因素之一,因此可以考虑奖励和折扣来吸引用户的活动。 这也可以应用于添加朋友的系统 ,例如,如果用户在社区中推荐并添加5个朋友,则他们可以取消阻止唯一徽章
Wow! We are finally here! Do you still remember what did we do to use Big Data methods in order to find out the churn behaviors of the customers?
哇! 我们终于来了! 您是否还记得我们做了什么工作才能使用大数据方法来发现客户的流失行为?
Let’s do a recap:
让我们来回顾一下:
- data loading 资料载入
- Exploratory data analysis 探索性数据分析
- Feature engineering 特征工程
- Model building and evaluation 模型建立与评估
- Identifying important features 识别重要特征
- Business Strategy (Actions) 商业策略(行动)
If you are interested in more details of these procedures, you could check out my entire code for this Sparkify analysis at my GitHub repository.
如果您对这些过程的更多细节感兴趣,可以在我的GitHub存储库中查看我的整个代码,以进行Sparkify分析。
Hope you enjoy reading this long blog and learn the strategies booming your business in Data Science Way!
希望您喜欢阅读这个漫长的博客,并学习以Data Science Way推动业务蓬勃发展的策略!
翻译自: https://medium.com/@jessie.sssy/understanding-customer-churning-abd6525d61c5
客户流失
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387946.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!