背景故事 (Backstory)

I stumbled upon an interesting task while doing a data exercise for a company. It was about cohort analysis based on user activity data, I got really interested so thought of writing this post.

在为公司进行数据练习时，我偶然发现了一项有趣的任务。这是关于基于用户活动数据的队列分析，我真的很感兴趣，因此想到了写这篇文章。

This article provides an insight into what cohort analysis is and how to analyze data for plotting cohorts. There are various ways to do this, I have discussed a specific approach that uses pandas and python to track user retention. And further provided some analysis into figuring out the best traffic sources (organic/ inorganic) for an organization.

本文提供了关于什么是队列分析以及如何分析数据以绘制队列的见解。有多种方法可以做到这一点，我已经讨论了一种使用pandas和python跟踪用户保留率的特定方法。并进一步提供了一些分析，以找出组织的最佳流量来源(有机/无机)。

队列分析 (Cohort Analysis)

Let’s start by introducing the concept of cohorts. Dictionary definition of a cohort is a group of people with some common characteristics. Examples of cohort include birth cohorts (a group of people born during the same period, like 90’s kids) and academic cohorts (a group of people who started working towards the same curriculum to finish a degree together).

让我们开始介绍同类群组的概念。同类词典的定义是一群具有某些共同特征的人。同类人群的例子包括出生人群 (在同一时期出生的一群人，例如90年代的孩子)和学术人群 (一群开始朝着相同的课程努力以完成学位的人们)。

Cohort analysis is specifically useful in analyzing user growth patterns for products. In terms of a product, a cohort can be a group of people with the same sign-up date, the same usage starts month/date, or the same traffic source.

同类群组分析在分析产品的用户增长模式时特别有用。就产品而言，同类群组可以是一群具有相同注册日期，相同使用开始月份/日期或相同流量来源的人。

Cohort analysis is an analytics method by which these groups can be tracked over time for finding key insights. This analysis can further be used to do customer segmentation and track metrics like retention, churn, and lifetime value. There are two types of cohorts — acquisitional and behavioral.

同类群组分析是一种分析方法，可以随着时间的推移跟踪这些组以查找关键见解。该分析还可以用于进行客户细分，并跟踪诸如保留率，客户流失率和生命周期价值之类的指标。队列有两种类型：获取型和行为型。

Acquisitional cohorts — groups of users on the basis of there signup date or first use date etc.
获取群组-根据注册日期或首次使用日期等确定的用户组。
Behavioral cohorts — groups of users on the basis of there activities in a given period of time. Examples could be when they install the app, uninstall the app, delete the app, etc.
行为群组-基于给定时间段内的活动的用户组。例如当他们安装应用程序，卸载应用程序，删除应用程序等时。

In this article, I will be demonstrating the acquisition cohort creation and analysis using a dataset. Let’s dive into it:

在本文中，我将演示使用数据集进行的采集队列创建和分析。让我们深入了解一下：

建立 (Setup)

I am using pandas, NumPy, seaborn, and matplotlib for this analysis. So let’s start by importing the required libraries.

我正在使用熊猫，NumPy，seaborn和matplotlib进行此分析。因此，让我们从导入所需的库开始。

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as plt
%matplotlib inline

数据 (The Data)

This dataset consists of usage data from customers for the month of February. Some of the users in the dataset started using the app in February (if ‘isFirst’ is true) and some are pre-February users.

此数据集包含2月份来自客户的使用情况数据。数据集中的某些用户从2月开始使用该应用程序(如果'isFirst'为true)，另一些则是2月之前的用户。

df=pd.read_json(“data.json”)
df.head()

The data has 5 columns:

数据有5列：

date: date of the use (for the month of February)timestamp: usage timestampuid: unique id assigned to usersisFirst: true if this is the user’s first use everSource: traffic source from which the user came

We can compute the shape and info of the dataframe as follows

我们可以如下计算数据框的形状和信息

df.shape(4823567, 5)df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4823567 entries, 0 to 4823566
Data columns (total 5 columns):
date         datetime64[ns]
isFirst      bool
timestamp    datetime64[ns]
uid          object
utmSource    object
dtypes: bool(1), datetime64[ns](2), object(2)
memory usage: 151.8+ MB

Below is a table of contents for the data analysis. I will first show the data cleaning that I did for this dataset, followed by the questions that this exercise will answer. The most important part of any analysis based project is what questions are going to be answered by the end of it. I will be answering 3 questions (listed below) followed by some more analysis, summary, and conclusions.

以下是用于数据分析的目录。我将首先显示我对此数据集所做的数据清理，然后显示此练习将回答的问题。在任何基于分析的项目中，最重要的部分是在项目结束时将要回答什么问题。我将回答3个问题(在下面列出)，然后提供更多分析，总结和结论。

目录 (Table of Contents)

Data Cleaning
数据清理
Question 1: Show the daily active users over the month.
问题1：显示当月的每日活跃用户。
Question 2: Calculate the daily retention curve for users who used the app for the first time on specific dates. Also, show the number of users from each cohort.
问题2：计算在特定日期首次使用该应用程序的用户的每日保留曲线。另外，显示每个群组的用户数。
Question 3: Determine if there are any differences in usage based on where the users came from. From which traffic source does the app get its best users? Its worst users?
问题3：根据用户来自何方来确定用法上是否存在差异。该应用程序从哪个流量来源获得最佳用户？它最糟糕的用户？
Conclusions
结论

数据清理 (Data Cleaning)

Here are some of the tasks I performed for cleaning my data.

这是我执行的一些清理数据任务。

空值： (Null values:)

Found out the null values in the dataframe: Source had 1,674,386 null values
在数据框中找到空值：源具有1,674,386个空值
Created a new column ‘trafficSource’ where null values are marked as ‘undefined’
创建了一个新列“ trafficSource”，其中空值被标记为“未定义”

合并流量来源： (Merge Traffic Sources:)

Merged traffic Sources using regular expression: facebook.* to facebook, gplus.* to google, twitter.* to twitter
使用正则表达式合并流量来源：facebook。*到facebook，gplus。*到google，twitter。*到twitter
Merged the traffic sources with < than 500 unique users to ‘others’. This was done because 11 sources had only 1 unique user, another 11 sources had less than 10 unique users, and another 11 sources had less than 500 unique users.
将少于500个唯一身份用户的流量来源合并为“其他”用户。这样做是因为11个来源只有1个唯一用户，另外11个来源只有10个以下唯一用户，另外11个来源只有500个以下唯一用户。
Finally reduced the number of traffic sources from 52 to 11.
最终将流量来源从52个减少到11个。

df.describe()

#Let’s check for null values
df.isnull().sum()date               0
isFirst            0
timestamp          0
uid                0
utmSource    1674386
dtype: int64

Looks like Source has a lot of null values. Almost ~34% of the values are null. Created a new column ‘trafficSource’ where null values are marked as ‘undefined’

看起来Source有很多空值。几乎〜34％的值为空。创建了一个新列“ trafficSource”，其中空值被标记为“未定义”

Next, I took care of similar traffic sources like facebook, facebookapp, src_facebook, etc by merging them.

接下来，我通过合并来处理类似的流量来源，例如facebook，facebookapp，src_facebook等。

Found out the unique users from each traffic source. This was done to figure out — if some of the traffic sources have very few unique users as compared to others then they all can be merged. This reduces the number of data sources that we have to analyze without any significant loss in the accuracy of the analysis. And so I merged the traffic sources with less than 500 (0.2% of the total) unique users to ‘others’. Finally reducing the number of traffic sources from 52 to 11.

从每个流量来源中找出唯一的用户。这样做是为了弄清楚-如果某些流量源与其他流量相比具有很少的唯一用户，则可以将它们全部合并。这减少了我们必须分析的数据源的数量，而不会导致分析准确性的任何重大损失。因此，我将流量来源与少于500个唯一用户( 占总数的0.2％ )合并为“其他”用户。最后，将流量来源的数量从52个减少到11个 。

Now let’s answer the questions.

现在让我们回答问题。

问题1： (Question 1:)

显示一个月的每日活跃用户(DAU)。 (Show the daily active users (DAU) over the month.)

A user is considered active for the day if they used the app at least once on a given day. Tasks performed to answer this question:

如果用户在给定的一天中至少使用过一次该应用程序，则该用户被视为当天处于活动状态。为回答该问题而执行的任务：

Segregated the users who started using the app in February from all the users.
将2月份开始使用该应用程序的用户与所有用户隔离开来。
Calculated the DAU for
计算的DAU

- users who started in the month of February
-2月开始的用户

- total number of active users
-活动用户总数
plotted this on a graph.
将此绘制在图形上。

Figure 1 shows the daily active users (DAU) for the month of February. I have plotted 2 graphs; one is the DAU plot for all the users and the other is the DAU plot for those users who started using the app in February. As we can see from the graph, the daily active count for Feb users is increasing in number but the DAU plot for all users has significant periodic dips (which could be attributed to less usage during the weekends) with slower net growth as compared to just the users for February.

图1显示了2月份的每日活动用户(DAU)。我已经绘制了2张图；一个是所有用户的DAU图，另一个是2月份开始使用该应用程序的用户的DAU图。从图表中可以看出，2月份用户的每日活动数量正在增加，但是所有用户的DAU图都有明显的周期性下降(这可能是由于周末期间的使用减少)，与2月份的用户。

问题2 (Question 2)

计算在特定日期首次使用该应用程序的用户的每日保留曲线。另外，显示每个同类群组的用户数。 (Calculate the daily retention curve for users who used the app for the first time on specific dates. Also, show the number of users from each cohort.)

The dates which were considered for creating cohorts are Feb 4th, Feb 10th, and Feb 14th. Tasks done to answer this question are:

创建队列的日期考虑为2月4日，2月10日和2月14日。为回答该问题而完成的任务是：

Created cohorts for all the users who started on the above dates.
为在上述日期开始的所有用户创建了同类群组。
Calculated daily retention for the above dates as starting dates; for each day of February. The daily retention curve is defined as the % of users from the cohort, who used the product that day.
计算上述日期的每日保留时间作为开始日期； 2月的每一天。每日保留曲线定义为当天使用该产品的同类用户的百分比。

The function dailyRetention takes a dataframe and a date (of cohort creation) as input and creates a cohort for all the users who started using the app on date ‘date’. It outputs the total number of unique users in that cohort and the retention in percentage from the starting date for each day of February.

函数dailyRetention将数据框和日期(创建群组的日期)作为输入，并为所有在日期“ date”开始使用该应用程序的用户创建一个群组。它输出该同类群组中的唯一身份用户总数，以及从2月的每一天开始日期起的保留百分比。

Figure 2 shows the total number of unique users from each cohort.

图2显示了每个队列的唯一身份用户总数。

The below code makes the data ready for creating a heatmap by adding a cohort index and then pivoting the data with index as Cohort start dates, columns as days of February, and values as a percentage of unique users who used the app on that day. And then the code further plots a heatmap.

下面的代码通过添加同类群组索引，然后将数据与索引一起作为同类群组开始日期，列作为2月的天以及列值作为当天使用该应用的唯一身份用户的百分比来使数据准备好创建热图。然后代码进一步绘制热图。

Figure 3 shows a heatmap for the daily retention for users who used the app for the first time on Feb 4th, Feb 10th, and Feb 14th. From the heatmap, we can see 100% retention on the first day of usage. And retention decreases to as low as ~31% for some of the days.

图3显示了2月4日，2月10日和2月14日首次使用该应用程序的用户每日保留的热图。从热图中，我们可以看到使用第一天的保留率是100％。在某些日子里，保留率降低至〜31％。

Figure 4 shows the Daily retention curve for the month of February. Initially the retention is 100% but it keeps on decreasing and becomes stable after a week.

图4显示了2月份的每日保留曲线。最初的保留率为100％，但持续下降，并在一周后变得稳定。

This retention curve immediately reflects an important insight — about 40–50% of the users stop using the app after the 1st day. After that initial large drop, a second brisk drop occurs after the 10th day — to under 35–45%, before the curve starts to level off after the 13th day, leaving about 37–43% of original users still active in the app at the last day of February.

此保留曲线立即反映出重要的见解-大约40％至50％的用户在第一天后停止使用该应用程序。在最初的大幅下降之后，在第10天之后又发生了第二次快速下降-低于35–45％，在第13天之后曲线开始趋于平稳之前，大约有37–43％的原始用户仍在该应用中处于活动状态2月的最后一天。

The above retention curve indicates that a lot of users are not experiencing the value of the app, resulting in drop-offs. Hence, one way to fix that is to improve the onboarding experience which can help the users in experiencing the core value as quickly as possible, thereby boosting retention.

上面的保留曲线表明，很多用户没有体验到该应用程序的价值，从而导致该应用程序的流失。因此，一种解决方法是改善入职体验，这可以帮助用户尽快体验核心价值，从而提高保留率。

问题3 (Question 3)

根据用户来自何方来确定用法上是否存在差异。该应用程序从哪个流量来源获得最佳用户？它最糟糕的用户？ (Determine if there are any differences in usage based on where the users came from. From which traffic source does the app get its best users? Its worst users?)

The tasks performed to answer this question are:

为回答该问题而执行的任务是：

Data Cleaning: To clean user ids with duplicate sources.
数据清除：清除具有重复来源的用户ID。
Feature Engineering: Feature engineered new features to find out the best and worst sources.
功能工程：功能工程的新功能可以找出最佳和最差的来源。

Data Cleaning:Identifying the best or the worse sources required some data cleaning. There are some users who had more than one traffic source. I did some data cleaning to remove/merge these multiple sources.

数据清除：确定最佳或较差的来源需要一些数据清除。有些用户拥有多个流量来源。我进行了一些数据清理，以删除/合并这些多个源。

1.64% of user ids i.e. 4058 unique uids had more than 1 source.
1.64％的用户ID(即4058个唯一uid)具有多个来源。
Since the duplicate traffic source uid count is not significant and there was no reliable way to attribute a single source to these uids, I simply removed these uids from my analysis.
由于重复的流量源uid计数并不重要，并且没有可靠的方法将单个源归因于这些uid，因此我只是从分析中删除了这些uid。

1.64% of user ids have more than 1 source.

1.64％的用户ID具有多个来源。

The below code creates a group of all users with multiple sources and drops those users from our dataframe and creates a new dataframe ‘dfa’.

以下代码创建了一组具有多个来源的所有用户，并将这些用户从我们的数据框中删除，并创建了一个新的数据框'dfa'。

Feature Engineering: In this section, I have feature engineered 2 different metrics to find out the differences in usage based on the traffic source. Here are the metrics:

功能工程：在本节中，我对2个不同的功能进行了工程设计，以根据流量来源找出用法上的差异。以下是指标：

Total number of unique active users per source per day The first metric is a purely quantitative metric calculated on the basis of how many users we are getting from each source and their activity per day throughout the month. The below code calculates this metric and plots a graph to visualize the results.
每天每个来源的唯一活动用户总数第一个指标是一个纯粹的定量指标，该指标是根据我们每个月从每个来源获得的用户数量及其每天的活动量计算得出的。以下代码计算该指标并绘制图形以可视化结果。

Figure 6 plots this information using a line plot. From the plot, one can see that biznesowe+rewolucje and undefined sources are getting the most users but there is a dip in the usage on weekends. And sources like program, answers, shmoop, twitter, grub+street, and handbook have constant usage throughout the month but the unique users contributed are low.

图6使用线图绘制了这些信息。从该图可以看出，biznesowe + rewolucje和未定义来源正在吸引最多的用户，但周末使用量有所下降。诸如程序，答案，smoop，twitter，grub + street和手册之类的资源在整个月中都有固定使用量，但唯一用户贡献却很少。

2) Number of days active/source The second metric I calculated is the number of days active/source. For this, I have grouped the data per traffic source and uid and counted the number of unique dates. So this gives us the number of days for each traffic source when each uid was active. I have plotted this information on a KDE graph and on analyzing the graph it’s evident that the distribution for all sources is bimodal with peaks near 0 and 29 days. The best traffic sources can be defined as ones with a peak at 29 and the worst ones with a peak at 0.

2)活动/源天数我计算的第二个指标是活动/源天数。为此，我对每个流量来源和uid的数据进行了分组，并计算了唯一日期的数量。因此，这为我们提供了每个uid处于活动状态时每个流量来源的天数。我已经在KDE图上绘制了这些信息，并且在分析图时很明显，所有源的分布都是双峰的，峰值接近0和29天。最佳流量源可以定义为峰值为29的流量来源，最差流量为0的峰值流量。

Figure 7 shows us a KDE graph for the number of days active per source for the month of Feb. From the graph, it can be seen that the best sources with a mode at 29 (most of the users from these sources used the app for 29 days) are shmoop and twitter closely followed by program, salesmanago, and grub+street with peaks at 29, 28 and 27 respectively. The worst source is the undefined with a mode of 0 despite getting the most users, followed by answers and biznesowe+rewolucje. If I were to define the traffic sources from best to worst based on this graph above, this would be my sequence: shmoop, twitter, program, salesmanago, grub+street, other, handbook, mosalingua+fr, biznesowe+rewolucje, answers, and undefined.

图7为我们显示了2月每个来源的活动天数的KDE图。从该图可以看出，最佳模式为29的最佳来源(这些来源中的大多数用户使用该应用29天)是shmoop和twitter，紧随其后的是节目，销售经理和grub + street，其峰值分别在29、28和27。尽管获得最多的用户，但最差的来源是undefined，其模式为0，其次是答案和biznesowe + rewolucje。如果我要根据上面的这张图来定义从最佳到最坏的流量来源，那么这就是我的顺序：shmoop，twitter，程序，salesmanago，grub + street，其他，手册，mosalingua + fr，biznesowe + rewolucje，答案，和未定义。

分析 (Analysis)

User behavior depends on the kind of metric that is important for a business. For some businesses daily activity(pings) can be an important metric and for some businesses, more activity (pings) on certain days of the month has more weight than the daily activity. One would define the worst users and best users based on what is important for the product/organization.

用户行为取决于对业务很重要的度量标准类型。对于某些企业来说，每日活动(ping)可能是一个重要的指标，对于某些企业来说，每月某些天的更多活动(ping)的权重要大于日常活动。可以根据对产品/组织重要的信息来定义最差的用户和最佳的用户。

摘要 (Summary)

If the total number of unique active users is an important metric for the product than the first graph can be used to see which sources are best/worst — more number of users indicate better traffic source.

如果唯一活动用户总数是该产品的重要指标，则可以使用第一个图表来查看哪个来源是最佳/最差来源-更多的用户表示更好的流量来源。

But if we want to see their activity over the month and analyze how many days the users from a particular source were active for the month then the second metric becomes important. In this case, we found out that even if the source (eg Shmoop, twitter) is giving a lesser number of unique active users per day, the users which are coming are using the app for a longer period of time.

但是，如果我们想查看他们一个月的活动并分析来自特定来源的用户在该月中活跃了多少天，那么第二个指标就变得很重要。在这种情况下，我们发现即使源(例如Shmoop，twitter)每天提供的唯一身份活跃用户数量较少，即将到来的用户使用该应用程序的时间也更长。

结论 (Conclusions)

In this article, I showed how to carry out Cohort Analysis using Python’s pandas, matplotlib, and seaborn. During the analysis, I have made some simplifying assumptions, but that was mostly due to the nature of the dataset. While working on real data, we would have more understanding of the business and can draw better and more meaningful conclusions from the analysis.

在本文中，我展示了如何使用Python的pandas，matplotlib和seaborn进行同类群组分析。在分析过程中，我做了一些简化的假设，但这主要是由于数据集的性质。在处理真实数据时，我们将对业务有更多的了解，并可以从分析中得出更好，更有意义的结论。

You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me in the comments.

您可以在我的GitHub上找到用于本文的代码。一如既往，欢迎任何建设性的反馈。您可以在评论中与我联系。

翻译自: https://medium.com/swlh/cohort-analysis-using-python-and-pandas-d2a60f4d0a4d

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/389263.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

3.Contructor(构造器)模式—精读《JavaScript 设计模式》Addy Osmani著

同系列友情链接: 1.设计模式之初体验—精读《JavaScript 设计模式》Addy Osmani著 2.设计模式的分类—精读《JavaScript 设计模式》Addy Osmani著 Construct（构造器）模式在经典的面向对象编程语言中，Construtor是一种在内存已分配给该对象的…

使用python和pandas进行同类群组分析