数据科学 , 意见 (Data Science, Opinion)
“There is a saying, ‘A jack of all trades and a master of none.’ When it comes to being a data scientist you need to be a bit like this, but perhaps a better saying would be, ‘A jack of all trades and a master of some.’” — Brendan Tierney
“有一句俗语,“万事通,万事通”。 当要成为数据科学家时,您需要有点像这样,但也许更好的说法是, “万事通,万事通”。” —布伦丹·蒂尔尼(Brendan Tierney)
I believe the word some in the above quote includes communication and domain knowledge. You might have read many articles focusing on the technical facets of data science. In this article, we will discuss about some not-so-technical facets that data scientists encounter in their day-day lives by picturing a scenario.
我相信以上引用中的“ 某些 ”一词包括交流和领域知识。 您可能已经阅读了许多有关数据科学技术方面的文章。 在本文中,我们将通过描述场景来讨论数据科学家在日常生活中遇到的一些非技术方面。
情境 (Scenario)
I am working as a data practitioner for the online department of Eastside, a large retail company. My manager passes by my desk on his way to a meeting and asks me to figure out “our best customers” and leaves in a whisker.
我是一家大型零售公司Eastside的在线部门的数据从业人员 。 我的经理在去开会的路上路过我的办公桌,要我找出“我们最好的客户”,然后胡说八道。
What does best mean here? Does it mean customers who have spent the most? or does it mean customers who buy more? Notice that spending most and buying many items are two completely different things.
最好的意思是什么? 这是否意味着花费最多的客户? 还是意味着购买更多商品的客户? 请注意,花费最多和购买许多物品是完全不同的两件事。
The situation which happened above is a common occurrence in the field of data. The usage of fuzzy(vague) language. More often, we will hear people expressing their ideas using natural language which looks good initially but on close inspection are ill-defined.
上面发生的情况是数据领域中的一种普遍现象。 模糊 (模糊)语言的用法。 通常,我们会听到人们使用自然语言表达想法的想法,这种语言最初看起来不错,但仔细检查后仍然不清楚。
In the above situation, you noticed how bad communication can have an adverse impact. A Linkedin study states that communication is the most sought-after soft skill. Even though my manager was not precise in his request, I could have sought clarifications. If we find out the end goal of the request ie. Why does he want to know the best customers? we can decide upon our approach.
在上述情况下,您注意到不良的通讯会对您产生不利影响。 Linkedin的一项研究指出, 沟通是最受追捧的软技能。 即使我的经理要求不准确,我也可以寻求澄清。 如果我们找到请求的最终目标,即。 他为什么想认识最好的顾客? 我们可以决定我们的方法。
Upon reaching to my manager, he explains that there are $1000 left in the marketing budget and he wants to use that money to convert some physical store customers to the online stores by emailing them some free coupons. One caveat here is that we should not steal the customers of the physical store as it may create a problem for the physical store’s head. He also mentions that this task must be accomplished within two hours!!
到达我的经理后,他解释说营销预算中还剩$ 1000 ,他想用这笔钱通过向他们发送一些免费的优惠券 ,将一些实体店的客户转换为在线商店。 这里的一个警告是,我们不应窃取实体店的顾客,因为这可能给实体店的负责人带来麻烦。 他还提到必须在两个小时内完成此任务!
This is where domain knowledge comes into the picture. What does the stealing of customers mean? It means we should not send coupons to active customers of the physical store as it may prevent them from going to the physical store. Instead, we can send coupons to some of the best-churned customers.
这就是领域知识出现的地方。 偷顾客是什么意思? 这意味着我们不应该将优惠券发送给实体店的活跃客户,因为这可能会阻止他们去实体店。 相反,我们可以将优惠券发送给一些最佳客户。
Customer churn — means a customer, ceases to be a customer. (I bought a subscription on Netflix for 3 months and unsubscribed for it later. I am a churned customer.)
客户流失 -意味着客户不再是客户。 (我在Netflix上购买了3个月的订阅,后来又取消了订阅。我是一位流失的客户。)
I explained to my manager that we will consider a customer to be churned if he has not purchased anything from the physical store in the past 3 months (Most of the customers buy only groceries from the physical store. It is safe to assume that someone who has not purchased anything for the past 3 months has churned). My manager agrees and gives me the dataset of all the physical store customers.
我向经理解释说,如果客户在过去3个月内未从实体店购买任何东西,我们将认为该客户会受到搅动(大多数客户仅从实体店购买杂货。可以肯定地假设有人过去3个月内未购买任何商品)。 我的经理同意并给我所有实体店客户的数据集。
You may think the idea for churned customers is not perfect. There arises a situation in data science when we don’t know the truth due to time constraints or the inability to measure it. We use approximations close to truth. These are called proxies. When a request is urgent, it is common to use proxies.
您可能会认为吸引客户的想法并不完美。 当由于时间限制或无法测量真相而导致我们不了解真相时,就会出现数据科学中的一种情况。 我们使用接近真实的近似值。 这些被称为代理 。 当紧急请求时,通常使用代理。
方法 (Approach)
Let us explore the data of the physical store given.
让我们探索给定的物理商店的数据。
import pandas as pd
import datetimedata = pd.read_csv("/content/es_phy_store.txt")
The Output —
输出 -
There are 1,25,000 rows in total and 3 columns. Our goal is to return the id’s of the best customers. We can also see that all the columns are not-null indicating that the data is clean.
总共有1,25,000行和3列。 我们的目标是返回最佳客户的ID。 我们还可以看到所有列都不为空,表明数据是干净的。
To find out the churned customers we will group the data by customer_id
and find out the latest transaction_date
of each customer.
要找出我们会按组的数据流失的客户customer_id
并找出最新的transaction_date
每一个客户的。
group_by_customer = data.groupby("customer_id")
last_transaction = group_by_customer["transaction_date"].max()
last_transaction.head(5)
Output —
输出—
Since customers are considered to be churned if their last transaction was three months ago we will create a cutoff date of May 1st, 2020, and label the customers accordingly.
由于如果客户的最后一笔交易是在三个月前,则认为客户受到了干扰,因此我们将截止日期定为2020年5月1日,并相应地给客户贴上标签。
We will create a separate data frame called best_churn which consists of the customer_id
, transaction_date
and a boolean column churned
denoting whether the customer was churned or not.
我们将创建一个名为best_churn的单独数据框,该数据框由customer_id
, transaction_date
和一个布尔值列组成,该布尔值列表示是否churned
了客户。
Output —
输出—
客户排名 (Ranking the Customers)
We found out the churned customers. The main aim is to find the best-churned customers. Firstly, we need to rank the customers based on some criteria, and next, we need to find a threshold value to identify the best customers.
我们找到了流失的客户。 主要目的是寻找最受客户欢迎的客户。 首先,我们需要根据一些标准对客户进行排名 ,其次,我们需要找到一个阈值来确定最佳客户。
Due to the time constraints, we cannot use a complex ML/DL model. We can use a simple weighted-sum model to classify customers. This model assigns a number(score) to each customer denoting how good they are. In our case, we need to consider two criteria — Amount Spent and the Number of Purchases made. Both must be given the same weight ie. a customer who spends a lot is equivalent to a customer who makes more purchases. So we can define the customer score as — Score = (1/2 × Number of purchases)+(1/2 × Amount spent)
由于时间限制,我们不能使用复杂的ML / DL模型。 我们可以使用简单的加权和模型对客户进行分类。 该模型为每个客户分配一个数字(分数) ,表示他们的水平。 在我们的案例中,我们需要考虑两个标准- 花费金额和购买次数 。 两者必须赋予相同的权重,即。 花很多钱的客户等于花更多钱购买的客户。 因此,我们可以将客户分数定义为:分数=(1/2×购买数量)+(1/2×消费金额)
For example, if a customer made 2 purchases worth $500 his score would be (1/2 × 2) + (1/2 × 500) = 251.
例如,如果客户进行了2次购物,价值500美元,那么他的得分将是(1/2×2)+(1/2×500)= 251。
Let us find the number of transactions per customer and create a separate column. This can be accomplished by grouping the data based on customer_id and using the size()
method. We can also find the total amount spent by using the sum()
method on the transaction_amount
column. We will also drop the transaction_date
column which is no longer required.
让我们找到每个客户的交易数量并创建一个单独的列。 这可以通过基于customer_id分组数据并使用size()
方法来完成。 我们还可以通过使用transaction_amount
列上的sum()
方法来找到花费的总金额 。 我们还将删除不再需要的transaction_date
列。
best_churn["no_of_transactions"] = group_by_customer.size()
best_churn["amount_spent"] = group_by_customer.sum()
Output —
输出—
Everything seems to be good, but if we take a closer look at the formula we notice a defect. We saw that when a customer spent $500 for 2 purchases his score was 251. If a customer has spent $400 across 20 different purchases his score would be 210 which seems to be unfair because it seems that the second customer is more regular than the first one and shows more potential to spend in the long run. This is happening mainly due to two reasons. 1) Money spent always exceeds the number of transactions.2) We are using the same weights for both the criteria.
一切似乎都很好,但是如果我们仔细看一下公式,就会发现一个缺陷。 我们看到,当一位顾客花500美元进行2次购买时,他的得分是251。如果一位顾客在20次不同的购买中花费了400美元,他的得分将是210分,这似乎是不公平的,因为第二位顾客似乎比第一个顾客更固定并显示出更多的长期消费潜力。 发生这种情况主要是由于两个原因。 1)花费的钱总是超过交易数量。2)我们对这两个标准使用相同的权重。
Let us find out the min and max number of transactions and amounts from the best_churn data frame.
让我们从best_churn数据框中找出最小和最大交易数以及金额。
best_churn[["no_of_transactions", "amount_spent"]].describe().loc[["min", "max"]]
We can see that the number of transactions is way too less when compared to the amount spent. To overcome this problem we will use min-max scaling which is used to compare different scales in a meaningful way. The formula for min-max scaling is —
我们可以看到,与花费的数量相比,交易数量实在太少了。 为了克服这个问题,我们将使用最小-最大缩放比例 ,该缩放比例用于以有意义的方式比较不同的缩放比例。 最小-最大缩放比例的公式是-
Let us apply the above formula on our no_of_transactions
and amount_spent
columns, find out the score using the scaled values and sort the data frame based on the score.
让我们将上述公式应用于no_of_transactions
和amount_spent
列,使用缩放后的值找出分数,并根据分数对数据框进行排序。
Output —
输出—
我们如何找出阈值得分值? (How do we find out the threshold score value?)
Should we chose the first 20 customers? or the first 50 customers? or the top 10%? What should be the criteria? Again, domain knowledge plays a crucial role here. We know that the budget is $1000. Each coupon value is not specified and we must decide the value. The coupon value cannot be too high because it reduces the number of customers.
我们应该选择前20位客户吗? 还是前50位客户? 还是前10%? 准则是什么? 同样,领域知识在这里起着至关重要的作用。 我们知道预算是$ 1000。 没有指定每个优惠券的价值,我们必须决定其价值。 优惠券价值不能太高,因为它减少了客户数量。
We all know that a 30% discount on one transaction is a pretty decent deal.So, let us find out the mean value of all the 1,25,000 transactions in the initial data frame we have and find 30% of that mean value.
我们都知道一笔交易有30%的折扣是相当不错的一笔交易,因此,让我们找出初始数据框中所有1,25,000笔交易的均值,然后找到该均值的30%。
coupon = data["tran_amount"].mean() * 0.3Output - 19.4976
Let us round this to 20. Hence each coupon value is $20. We know that our budget is $1000. Dividing 1000/20 yields 50. Hence we can select the top 50 customers from the best_churn
data whose churned value is 1 and mail the coupons.
让我们将其四舍五入为20。因此,每个优惠券价值为20美元。 我们知道我们的预算是1000美元。 除以1000/20得出50。因此,我们可以从best_churn
数据中选择前50个客户 ,其搅动值为1,然后将优惠券邮寄best_churn
。
top_50_churned = best_churn.loc[best_churn["churned"] == 1].head(50)
Output —
输出—
结论 (Conclusion)
In this article, we understood the importance of communication and how the usage of fuzzy-language can be a hindrance. We also took a real-life scenario and solved a problem that had many constraints and also required the usage of communication skills, domain knowledge, and quick decision-making. I hope you learned something new today.
在本文中,我们了解了交流的重要性以及模糊语言的使用如何成为障碍。 我们还采用了现实生活中的场景,解决了一个有很多约束并且还需要使用沟通技巧,领域知识和快速决策能力的问题。 希望您今天学到了新东西。
If you would like to get in touch, connect with me on LinkedIn.
如果您想取得联系,请通过 LinkedIn 与我联系。
翻译自: https://medium.com/towards-artificial-intelligence/data-science-in-business-8266fae71a87
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390858.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!