客户流失_了解客户流失

客户流失

Big Data Analytics within a real-life example of digital music service

数字音乐服务真实示例中的大数据分析

Customer churn is a key predictor of the long term success or failure of a business. It is the rate at which customers are leaving your business and taking their subscription dollars elsewhere. For every single business, why the users churn and how to change, keep, attract the users is the forever questions they ask themselves.

客户流失是企业长期成功或失败的关键预测指标。 这是客户离开您的业务并将其订阅费转移到其他地方的费率。 对于每一项业务,为什么用户流失以及如何改变,保持,吸引用户是他们永远问自己的问题。

Image for post
Big Cloud大云

Digital Music Service, as an example for us here to look into. Let’s think of the most familiar platform, like Spotify, Pandora. Every time when you, as the user interact with the service, every small step, such as playing music, logging out the page, like the song, etc, generate the data. Here comes the Big Data! All these data contain the key insight for predicting the churn of the users and keeping the business thrive. Because of the size of the data, it is a challenging and common problem that we regularly encounter in any customer-facing business.

数字音乐服务 ,以我们为例进行研究。 让我们考虑一下最熟悉的平台,例如Spotify,Pandora。 每当您作为用户与服务交互时,每一个小步骤(例如播放音乐,注销页面,喜欢歌曲等)都会生成数据。 大数据来了! 所有这些数据都包含了预测用户流失并保持业务蓬勃发展的关键见解。 由于数据量大,这是我们在任何面向客户的业务中经常遇到的具有挑战性的普遍问题。

Here we are going to analyze the real-life large datasets for a music streaming service with Spark. We attempt to build machine learning models to predict the churning possibilities of the users and understand the features that contribute to the churning behaviors.

在这里,我们将分析Spark的音乐流服务的真实大型数据集。 我们试图建立机器学习模型 来预测用户的搅动可能性,并了解有助于搅动行为的功能。

Let’s start with a mini-subset (~128MB) of the large data (12 GB) first for understanding and exploring the datasets. We will load in our dataset (JSON format) through the following commands:

让我们首先从大数据(12 GB)的微型子集 (〜128MB)开始,以了解和探索数据集。 我们将通过以下命令加载数据集(JSON格式):

# Create a Spark session
spark = (SparkSession.builder
.master(“local”)
.appName(“sparkify”)
.getOrCreate())# Read the dataset
events_df = spark.read.json(‘mini_sparkify_event_data.json’)

We can also take a look at the shortcut of all the features and their datatype

我们还可以看一下所有功能及其数据类型的快捷方式

root
|-- artist: string (nullable = true)
|-- auth: string (nullable = true)
|-- firstName: string (nullable = true)
|-- gender: string (nullable = true)
|-- itemInSession: long (nullable = true)
|-- lastName: string (nullable = true)
|-- length: double (nullable = true)
|-- level: string (nullable = true)
|-- location: string (nullable = true)
|-- method: string (nullable = true)
|-- page: string (nullable = true)
|-- registration: long (nullable = true)
|-- sessionId: long (nullable = true)
|-- song: string (nullable = true)
|-- status: long (nullable = true)
|-- ts: long (nullable = true)
|-- userAgent: string (nullable = true)
|-- userId: string (nullable = true)

The feature page seems to be the most important one as it records all the user interactions. The page column recorded values, such as Logout, Save Settings, Roll Advert, Settings, Submit Upgrade, Cancellation Confirmation, Add Friends, etc. Also, the Cancellation Confirmation events of page define the churn that we are interested in. (0 as un-churn, and 1 as churn)

功能页面似乎是最重要的页面 ,因为它记录了所有用户交互。 页面列记录的值,例如注销,保存设置,滚动广告,设置,提交升级,取消确认,添加朋友等。此外,页面的取消确认事件定义了我们感兴趣的用户流失。(0为un -搅动,而1为搅动)

Exploratory Data Analysis (EDA)

探索性数据分析(EDA)

We want to perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned.

我们希望执行一些探索性数据分析,以观察留下的用户与搅动的用户的行为。

Image for post
Image for post

From the bar plot on the left, the average length of songs played for churn and un-churn users is generated. For un-churned users, they have longer mean length for listening to the songs compare to the other group.

从左侧的条形图中, 可以得出为流失和不流失用户播放的歌曲的平均长度 。 对于未订阅的用户,与其他组相比,他们的平均听歌时间更长。

The second bar chart shows the relationship of the churn rate and User-Agent of the users. From the data, we can conclude that X11 and iPhone users tend to churn more and this can give us some insights for further investigation of the systems.

第二个条形图显示了用户的客户流失率与User-Agent关系 。 根据数据,我们可以得出结论,X11和iPhone用户的流失率更高,这可以为我们进一步研究系统提供一些见识。

By checking the correlation matrix of the page and our domain knowledge, we pick several features (Thumbs Up, Thumbs Down, Add Friend, Add to Playlist, Error, Help) to observe the difference between churn and un-churn customers. The box plot below shows some detailed information.

通过检查页面相关矩阵和我们的领域知识 ,我们选择了几种功能(“竖起大拇指”,“大拇指朝下”,“添加朋友”,“添加到播放列表”,“错误”,“帮助”),以观察流失客户与未流失客户之间的差异。 下面的方框图显示了一些详细信息。

Image for post

What can you gain from the plots? From my perspective, churn users:

您可以从地块中获得什么? 在我看来,用户流失

  • less likely to click thumbs up

    不太可能点击竖起大拇指
  • less likely to add friends

    不太可能添加朋友
  • less likely to add songs to the playlist

    将歌曲添加到播放列表的可能性较小

However, it doesn’t necessarily mean that they have more errors encountered and need more help from the service.

但是,这不一定意味着他们遇到了更多的错误,并且需要该服务提供更多帮助。

Once we familiarized ourselves with the data, let’s build out the features find promising to train the model on.

一旦我们熟悉了数据,就可以建立一些很有希望在模型上进行训练的功能。

Feature Engineering

特征工程

Here are some features that I found interesting:

以下是一些我发现很有趣的功能:

  1. Features of Page but remove un-related ones

    Page的功能,但删除不相关的功能

df_features = df.groupby([‘userId’])
.pivot(‘page’)
.count()
.fillna(0)df_features.withColumnRenamed(‘Cancellation Confirmation’,’Churn’)df_features = df_features.drop(‘About’, ‘Cancel’, ‘Login’,’Logout’, ‘Roll Advert’, ‘Submit Registration’, ‘Register’, ‘Save Settings’)

2. Total song-length of the user listened

2.听过的用户的总歌曲长度

total_length = df.filter(df.page == ‘NextSong’)
.groupby(df.userId)
.agg(sum(df.length)
.alias(‘total_songlength’))df_features = df_features.join(total_length, on=[‘userId’], how=’inner’)

3. Gender: Dummy variables created

3.性别:虚拟变量已创建

gender_df = df.select(‘userId’,’gender’).dropDuplicates()categories = gender_df.select(‘gender’)
.distinct()
.rdd.flatMap(lambda x: x)
.collect()exprs = [F.when(F.col(‘gender’) == category, 1)
.otherwise(0)
.alias(category) for category in categories]gender_df = gender_df.select(‘userId’, *exprs)df_features = df_features.join(gender_df, on=[‘userId’], how=’inner’)

4. Number of days user active

4.用户活跃天数

days = df.groupby(‘userId’).agg(max(df.ts),(min(df.ts)))days = days.withColumn(‘days_active’, (col(‘max(ts)’) -col(‘min(ts)’)) / (60*60*24) )df_features = df_features.join(days, on=[‘userId’], how=’inner’).drop(‘max(ts)’,’min(ts)’)

5. Number of days register the account

5.注册帐户的天数

days_reg = df.groupby(‘userId’)
.agg(max(df.registration),(max(df.ts)))days_reg = days_reg.withColumn(‘days_register’, (col(‘max(ts)’) -col(‘max(registration)’)) / (60*60*24) )df_features = df_features.join(days_reg, on=[‘userId’], how=’inner’).drop(‘max(ts)’,’max(registration)’)

6. The final level of the user (paid/free)

6.用户的最终级别(付费/免费)

final_level = df.groupby(‘userId’, ‘level’)
.agg(max(df.ts)
.alias(‘finalTime’))
.sort(“userId”)categories = final_level.select(‘level’)
.distinct()
.rdd.flatMap(lambda x: x)
.collect()exprs = [F.when(F.col(‘level’) == category, 1)
.otherwise(0)
.alias(category) for category in categories]final_level = final_level.select(‘userId’, *exprs)

Modeling

造型

After we engineered the features, we will build three models: logistic regression, random forest, gradient boosting trees. Let’s start by generating the table, splitting, and scale the data.

设计完这些功能之后,我们将构建三个模型:逻辑回归,随机森林,梯度增强树。 让我们从生成表,拆分和缩放数据开始

# Rename Cancellation Confirmation as label in df_features_label
df_features_label = df_features.withColumnRenamed(‘Cancellation Confirmation’, ‘label’)# Generate features table
df_features = df_features.drop(‘Cancellation Confirmation’, ‘userId’)# Splitting the data
train, test = df_features_label.randomSplit([0.8, 0.2])# Instantiating vectorassembler for creating pipeline
vector_assembler = VectorAssembler(inputCols = df_features.columns, outputCol = ‘Features’)# Scale each column for creating pipeline
scale_df = StandardScaler(inputCol = ‘Features’, outputCol=’ScaledFeatures’)

Here we give an example of building the Logistic Regression Model. All the other models are similar methods to build.

这里我们举一个建立Logistic回归模型的例子。 所有其他模型都是相似的构建方法。

lr = LogisticRegression(featuresCol=”ScaledFeatures”, labelCol=”label”, maxIter=10, regParam=0.01)# Creating pipeline
pipeline_lr = Pipeline(stages=[vector_assembler, scale_df, lr])# fitting the model
model_lr = pipeline_lr.fit(train)

In order to evaluate the accuracy of the model, we write a function to report results on the validation set. Since the churned users are a fairly small subset, and F1 score as the metric to optimize.

为了评估模型的准确性,我们编写了一个函数来报告验证集上的结果。 由于搅动的用户是一个相当小的子集,因此F1得分是要优化的指标。

def peformance(model, data, evaluation_metric):
# Generate predictions
evaluator = MulticlassClassificationEvaluator(metricName = evaluation_metric)

predictions = model.transform(data)

# Get scores
score = evaluator.evaluate(predictions)

confusion_matrix = (predictions.groupby(‘label’)
.pivot(‘prediction’)
.count()
.toPandas())

return score, confusion_matrix

We check the performance of the model as follows:

我们按以下方式检查模型的性能:

# Performance 
score_lr, confusion_matrix_lr = peformance(model_lr, test, ‘f1’)print(‘The f1 score for Logistic Regression model:{}’.format(score_lr))print(confusion_matrix_lr)

Here is the resulting output for the Logistic Regression Model:

这是逻辑回归模型的结果输出:

Image for post

From the analysis, the Gradient Boosting Tree Model did the best job with an F-1 score of up to 0.88. We need to notice that, since we only have a small group of people churn in the business usually, we care more about we can identify the churned users correctly, instead of pursuing high overall performance. In this case, we didn’t perform the grid searching and tune the parameters.

根据分析,梯度提升树模型的F-1分数最高为0.88,表现最佳。 我们需要注意的是,由于通常业务中只有一小部分人流失 ,因此我们更关心可以正确识别被搅动的用户,而不是追求较高的整体绩效。 在这种情况下,我们没有执行网格搜索和调整参数。

Feature Importance

功能重要性

Using our best GBT model and feature importance function, we visualize the relative importance rank of each feature we obtained in the feature engineering process. As the figure below, we find that the days of the active, register of the users, and the number of times users add the song to the playlist are the most important features to the GBT model we built.

使用我们最好的GBT模型和特征重要性函数 ,我们可以可视化在特征工程过程中获得的每个特征的相对重要性等级。 如下图所示,我们发现活动的日子,用户的注册日期以及用户将歌曲添加到播放列表的次数是我们构建的GBT模型最重要的功能。

Image for post

What actions we can take to decrease the churn rate then?

那么我们可以采取什么措施来降低流失率呢?

By finishing analyzing the data is never the end, always how to apply to the business is the most important part and the part makes our model crucial. With the feature importance we gained, we can come up with some business strategies to counter customer churns in real-life business. Here are some brief ideas related to our analysis:

通过完成对数据的分析永无止境,始终如何应用到业务是最重要的部分,而这一部分使我们的模型至关重要。 利用我们获得的功能重要性,我们可以提出一些业务策略来应对现实业务中的客户流失。 以下是与我们的分析有关的一些简要建议:

The number of active days is one of the important factors for churning, then rewarding and discounting can be considered to attract the activity of the users. This can also apply to the adding friends’ system, for example, if the user recommend and add 5 friends in the community, they can unblock unique badge

活动天数是搅动的重要因素之一,因此可以考虑奖励和折扣来吸引用户的活动。 这也可以应用于添加朋友的系统 ,例如,如果用户在社区中推荐并添加5个朋友,则他们可以取消阻止唯一徽章

Wow! We are finally here! Do you still remember what did we do to use Big Data methods in order to find out the churn behaviors of the customers?

哇! 我们终于来了! 您是否还记得我们做了什么工作才能使用大数据方法来发现客户的流失行为?

Let’s do a recap:

让我们来回顾一下:

  1. data loading

    资料载入
  2. Exploratory data analysis

    探索性数据分析
  3. Feature engineering

    特征工程
  4. Model building and evaluation

    模型建立与评估
  5. Identifying important features

    识别重要特征
  6. Business Strategy (Actions)

    商业策略(行动)

If you are interested in more details of these procedures, you could check out my entire code for this Sparkify analysis at my GitHub repository.

如果您对这些过程的更多细节感兴趣,可以在我的GitHub存储库中查看我的整个代码,以进行Sparkify分析。

Hope you enjoy reading this long blog and learn the strategies booming your business in Data Science Way!

希望您喜欢阅读这个漫长的博客,并学习以Data Science Way推动业务蓬勃发展的策略!

翻译自: https://medium.com/@jessie.sssy/understanding-customer-churning-abd6525d61c5

客户流失

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387946.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Nginx:Nginx limit_req limit_conn限速

简介 Nginx是一个异步框架的Web服务器,也可以用作反向代理,负载均衡器和HTTP缓存,最常用的便是Web服务器。nginx对于预防一些攻击也是很有效的,例如CC攻击,爬虫,本文将介绍限制这些攻击的方法,可…

Linux实战教学笔记12:linux三剑客之sed命令精讲

第十二节 linux三剑客之sed命令精讲 标签(空格分隔): Linux实战教学笔记-陈思齐 ---更多资料点我查看 1,前言 我们都知道,在Linux中一切皆文件,比如配置文件,日志文件,启动文件等等。…

activiti 为什么需要采用乐观锁?

乐观锁 为什么需要采用乐观锁? 由于activiti一个周期的transaction时间可能比较长,且同一流程实例中存在任务并发执行等场景。设计者将update、insert、delete事务性的操作推迟至command结束时完成,这样尽量降低锁冲突的概率,由…

支付宝架构

支付宝系统架构图如下: 支付宝架构文档有两个搞支付平台设计的人必须仔细揣摩的要点。 一个是账务处理。在记账方面,涉及到内外两个子系统,外部子系统是单边账,满足线上性能需求;内部子系统走复式记账,满足…

Android Studio 导入新工程项目

1 导入之前先修改工程下相关文件 1.1 只需修改如下三个地方1.2 修改build.gradle文件 1.3 修改gradle/wrapper/gradle-wrapper.properties 1.4 修改app/build.gradle 2 导入修改后的工程 2.1 选择File|New|Import Project 2.2 选择修改后的工程 如果工程没有变成AS符号&#xf…

马蜂窝张矗:绩效考核是为了激发工作潜力,而不是逃避问题

3 月 23 日,由高端技术领导者社交平台 TGO 鲲鹏会主办的 GTLC 全球技术领导峰会分站首站在北京举行。会上马蜂窝技术副总裁 \u0026amp; TGO 鲲鹏会会员张矗发表了主题为“我在马蜂窝的技术管理实践”的演讲。本文根据其演讲整理而成。大家好,我是来自马蜂…

fiddler抓包1-抓小程序https包

抓小程序包和抓app包是一样的操作方法;安卓用fiddler,ios用charles; 一、环境准备 1.电脑已装最新版fiddler 2.手机和电脑在同一局域网 二、fiddler设置 1.fiddler>Tools>Options>HTTPS 勾选Capture HTTPS CONNECTs 及下边的子项&am…

冲刺第七天

今天任务进行情况:今天我们将我们的游戏导到界面形成可用的应用程序,并且进行调试与运行,让同学试玩,发现了困难并加以改正。 遇到的困难及解决方法: 运行时发现游戏界面中UI的button和image的位置会随分辨率的不同而发…

Node.js Streams:你需要知道的一切

Node.js Streams:你需要知道的一切 图像来源 Node.js流以难以使用而闻名,甚至更难理解。好吧,我有个好消息 - 不再是这样了。 多年来,开发人员在那里创建了许多软件包,其唯一目的是简化流程。但在本文中,我…

shell之引号嵌套引号大全

万恶的引号 这个能看懂你就出师了! 转载于:https://www.cnblogs.com/theodoric008/p/10000480.html

oracle表分区详解

oracle表分区详解 从以下几个方面来整理关于分区表的概念及操作: 表空间及分区表的概念表分区的具体作用表分区的优缺点表分区的几种类型及操作方法对表分区的维护性操作 1.表空间及分区表的概念 表空间: 是一个或多个数据文件的集合,所有的数据对象都存…

如果您不将Docker用于数据科学项目,那么您将生活在1985年

重点 (Top highlight)One of the hardest problems that new programmers face is understanding the concept of an ‘environment’. An environment is what you could say, the system that you code within. In principal it sounds easy, but later on in your career yo…

jmeter对oracle压力测试

下载Oracle的jdbc数据库驱动包,注意Oracle数据库的版本,这里使用的是:Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production; 一般数据库的驱动包文件在安装路径下:D:\oracle\product\10.2.…

docker部署flask_使用Docker,GCP Cloud Run和Flask部署Scikit-Learn NLP模型

docker部署flaskA brief guide to building an app to serve a natural language processing model, containerizing it and deploying it.构建用于服务自然语言处理模型,将其容器化和部署的应用程序的简要指南。 By: Edward Krueger and Douglas Franklin.作者&am…

SQL的执行计划

SQL的执行计划实际代表了目标SQL在Oracle数据库内部的具体执行步骤,作为调优,只有知道了优化器选择的执行计划是否为当前情形下最优的执行计划,才能够知道下一步往什么方向。 执行计划的定义:执行目标SQL的所有步骤的组合。 我们首…

[转帖]USB-C和Thunderbolt 3连接线你搞懂了吗?---没搞明白.

USB-C和Thunderbolt 3连接线你搞懂了吗? 2018年11月25日 07:30 6318 次阅读 稿源:威锋网 3 条评论按照计算行业的风潮,USB Type-C 将会是下一代主流的接口。不过,在过去两年时间里,关于 USB-C、Thunderbolt 3、USB 3.1…

大数据技术 学习之旅_为什么聚焦是您数据科学之旅的关键

大数据技术 学习之旅David Robinson, a data scientist, has said the following quotes:数据科学家David Robinson曾说过以下话: “When you’ve written the same code 3 times, write a function.”“当您编写了3次相同的代码时,请编写一个函数。” …

无监督学习 k-means_无监督学习-第4部分

无监督学习 k-means有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING) These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as …

vCenter 升级错误 VCSServiceManager 1603

近日,看到了VMware发布的vCenter 6.7 Update 1b的更新消息。其中有一条比较震撼。有误删所有VM的概率,这种BUG谁也承受不起。Removing a virtual machine folder from the inventory by using the vSphere Client might delete all virtual machinesIn t…

day28 socketserver

1. socketserver 多线程用的 例 import socket import timeclientsocket.socket() client.connect(("127.0.0.1",9000))while 1:cmdinput("请输入指令")client.send(cmd.encode("utf-8"))from_server_msgclient.recv(1024).decode("utf…