商业数据科学

数据科学 , 意见 (Data Science, Opinion)

“There is a saying, ‘A jack of all trades and a master of none.’ When it comes to being a data scientist you need to be a bit like this, but perhaps a better saying would be, ‘A jack of all trades and a master of some.’” — Brendan Tierney

“有一句俗语,“万事通,万事通”。 当要成为数据科学家时,您需要有点像这样,但也许更好的说法是, “万事通,万事通”。” —布伦丹·蒂尔尼(Brendan Tierney)

I believe the word some in the above quote includes communication and domain knowledge. You might have read many articles focusing on the technical facets of data science. In this article, we will discuss about some not-so-technical facets that data scientists encounter in their day-day lives by picturing a scenario.

我相信以上引用中的“ 某些 ”一词包括交流和领域知识。 您可能已经阅读了许多有关数据科学技术方面的文章。 在本文中,我们将通过描述场景来讨论数据科学家在日常生活中遇到的一些非技术方面。

情境 (Scenario)

I am working as a data practitioner for the online department of Eastside, a large retail company. My manager passes by my desk on his way to a meeting and asks me to figure out “our best customers” and leaves in a whisker.

我是一家大型零售公司Eastside的在线部门的数据从业人员 。 我的经理在去开会的路上路过我的办公桌,要我找出“我们最好的客户”,然后胡说八道。

What does best mean here? Does it mean customers who have spent the most? or does it mean customers who buy more? Notice that spending most and buying many items are two completely different things.

最好的意思是什么? 这是否意味着花费最多的客户? 还是意味着购买更多商品的客户? 请注意,花费最多和购买许多物品是完全不同的两件事。

The situation which happened above is a common occurrence in the field of data. The usage of fuzzy(vague) language. More often, we will hear people expressing their ideas using natural language which looks good initially but on close inspection are ill-defined.

上面发生的情况是数据领域中的一种普遍现象。 模糊 (模糊)语言的用法。 通常,我们会听到人们使用自然语言表达想法的想法,这种语言最初看起来不错,但仔细检查后仍然不清楚。

In the above situation, you noticed how bad communication can have an adverse impact. A Linkedin study states that communication is the most sought-after soft skill. Even though my manager was not precise in his request, I could have sought clarifications. If we find out the end goal of the request ie. Why does he want to know the best customers? we can decide upon our approach.

在上述情况下,您注意到不良的通讯会对您产生不利影响。 Linkedin的一项研究指出, 沟通最受追捧的软技能。 即使我的经理要求不准确,我也可以寻求澄清。 如果我们找到请求的最终目标,即。 他为什么想认识最好的顾客? 我们可以决定我们的方法。

Upon reaching to my manager, he explains that there are $1000 left in the marketing budget and he wants to use that money to convert some physical store customers to the online stores by emailing them some free coupons. One caveat here is that we should not steal the customers of the physical store as it may create a problem for the physical store’s head. He also mentions that this task must be accomplished within two hours!!

到达我的经理后,他解释说营销预算中还剩$ 1000 ,他想用这笔钱通过向他们发送一些免费的优惠券 ,将一些实体店的客户转换为在线商店。 这里的一个警告是,我们不应窃取实体店的顾客,因为这可能给实体店的负责人带来麻烦。 他还提到必须在两个小时内完成此任务!

This is where domain knowledge comes into the picture. What does the stealing of customers mean? It means we should not send coupons to active customers of the physical store as it may prevent them from going to the physical store. Instead, we can send coupons to some of the best-churned customers.

这就是领域知识出现的地方。 偷顾客是什么意思? 这意味着我们不应该将优惠券发送给实体店的活跃客户,因为这可能会阻止他们去实体店。 相反,我们可以将优惠券发送给一些最佳客户。

Customer churn — means a customer, ceases to be a customer. (I bought a subscription on Netflix for 3 months and unsubscribed for it later. I am a churned customer.)

客户流失 -意味着客户不再是客户。 (我在Netflix上购买了3个月的订阅,后来又取消了订阅。我是一位流失的客户。)

I explained to my manager that we will consider a customer to be churned if he has not purchased anything from the physical store in the past 3 months (Most of the customers buy only groceries from the physical store. It is safe to assume that someone who has not purchased anything for the past 3 months has churned). My manager agrees and gives me the dataset of all the physical store customers.

我向经理解释说,如果客户在过去3个月内未从实体店购买任何东西,我们将认为该客户会受到搅动(大多数客户仅从实体店购买杂货。可以肯定地假设有人过去3个月内未购买任何商品)。 我的经理同意并给我所有实体店客户的数据集。

You may think the idea for churned customers is not perfect. There arises a situation in data science when we don’t know the truth due to time constraints or the inability to measure it. We use approximations close to truth. These are called proxies. When a request is urgent, it is common to use proxies.

您可能会认为吸引客户的想法并不完美。 当由于时间限制或无法测量真相而导致我们不了解真相时,就会出现数据科学中的一种情况。 我们使用接近真实的近似值。 这些被称为代理 。 当紧急请求时,通常使用代理。

方法 (Approach)

Let us explore the data of the physical store given.

让我们探索给定的物理商店的数据。

import pandas as pd
import datetimedata = pd.read_csv("/content/es_phy_store.txt")

The Output —

输出 -

Image for post
Image for post

There are 1,25,000 rows in total and 3 columns. Our goal is to return the id’s of the best customers. We can also see that all the columns are not-null indicating that the data is clean.

总共有1,25,000行和3列。 我们的目标是返回最佳客户的ID。 我们还可以看到所有列都不为空,表明数据是干净的。

To find out the churned customers we will group the data by customer_id and find out the latest transaction_date of each customer.

要找出我们会按组的数据流失的客户customer_id并找出最新的transaction_date每一个客户的。

group_by_customer = data.groupby("customer_id")
last_transaction = group_by_customer["transaction_date"].max()
last_transaction.head(5)

Output —

输出—

Image for post

Since customers are considered to be churned if their last transaction was three months ago we will create a cutoff date of May 1st, 2020, and label the customers accordingly.

由于如果客户的最后一笔交易是在三个月前,则认为客户受到了干扰,因此我们将截止日期定为2020年5月1日,并相应地给客户贴上标签。

We will create a separate data frame called best_churn which consists of the customer_id, transaction_date and a boolean column churned denoting whether the customer was churned or not.

我们将创建一个名为best_churn的单独数据框,该数据框由customer_idtransaction_date和一个布尔值列组成,该布尔值列表示是否churned了客户。

Output —

输出—

Image for post

客户排名 (Ranking the Customers)

We found out the churned customers. The main aim is to find the best-churned customers. Firstly, we need to rank the customers based on some criteria, and next, we need to find a threshold value to identify the best customers.

我们找到了流失的客户。 主要目的是寻找最受客户欢迎的客户。 首先,我们需要根据一些标准对客户进行排名 ,其次,我们需要找到一个阈值来确定最佳客户。

Due to the time constraints, we cannot use a complex ML/DL model. We can use a simple weighted-sum model to classify customers. This model assigns a number(score) to each customer denoting how good they are. In our case, we need to consider two criteria — Amount Spent and the Number of Purchases made. Both must be given the same weight ie. a customer who spends a lot is equivalent to a customer who makes more purchases. So we can define the customer score as — Score = (1/2 × Number of purchases)+(1/2 × Amount spent)

由于时间限制,我们不能使用复杂的ML / DL模型。 我们可以使用简单的加权和模型对客户进行分类。 该模型为每个客户分配一个数字(分数) ,表示他们的水平。 在我们的案例中,我们需要考虑两个标准- 花费金额购买次数 。 两者必须赋予相同的权重,即。 花很多钱的客户等于花更多钱购买的客户。 因此,我们可以将客户分数定义为:分数=(1/2×购买数量)+(1/2×消费金额)

For example, if a customer made 2 purchases worth $500 his score would be (1/2 × 2) + (1/2 × 500) = 251.

例如,如果客户进行了2次购物,价值500美元,那么他的得分将是(1/2×2)+(1/2×500)= 251。

Let us find the number of transactions per customer and create a separate column. This can be accomplished by grouping the data based on customer_id and using the size() method. We can also find the total amount spent by using the sum() method on the transaction_amount column. We will also drop the transaction_date column which is no longer required.

让我们找到每个客户的交易数量并创建一个单独的列。 这可以通过基于customer_id分组数据并使用size()方法来完成。 我们还可以通过使用transaction_amount列上的sum()方法来找到花费的总金额 。 我们还将删除不再需要的transaction_date列。

best_churn["no_of_transactions"] = group_by_customer.size()
best_churn["amount_spent"] = group_by_customer.sum()

Output —

输出—

Image for post

Everything seems to be good, but if we take a closer look at the formula we notice a defect. We saw that when a customer spent $500 for 2 purchases his score was 251. If a customer has spent $400 across 20 different purchases his score would be 210 which seems to be unfair because it seems that the second customer is more regular than the first one and shows more potential to spend in the long run. This is happening mainly due to two reasons. 1) Money spent always exceeds the number of transactions.2) We are using the same weights for both the criteria.

一切似乎都很好,但是如果我们仔细看一下公式,就会发现一个缺陷。 我们看到,当一位顾客花500美元进行2次购买时,他的得分是251。如果一位顾客在20次不同的购买中花费了400美元,他的得分将是210分,这似乎是不公平的,因为第二位顾客似乎比第一个顾客更固定并显示出更多的长期消费潜力。 发生这种情况主要是由于两个原因。 1)花费的钱总是超过交易数量。2)我们对这两个标准使用相同的权重。

Let us find out the min and max number of transactions and amounts from the best_churn data frame.

让我们从best_churn数据框中找出最小和最大交易数以及金额。

best_churn[["no_of_transactions", "amount_spent"]].describe().loc[["min", "max"]]
Image for post

We can see that the number of transactions is way too less when compared to the amount spent. To overcome this problem we will use min-max scaling which is used to compare different scales in a meaningful way. The formula for min-max scaling is —

我们可以看到,与花费的数量相比,交易数量实在太少了。 为了克服这个问题,我们将使用最小-最大缩放比例 ,该缩放比例用于以有意义的方式比较不同的缩放比例。 最小-最大缩放比例的公式是-

Image for post

Let us apply the above formula on our no_of_transactions and amount_spent columns, find out the score using the scaled values and sort the data frame based on the score.

让我们将上述公式应用于no_of_transactionsamount_spent列,使用缩放后的值找出分数,并根据分数对数据框进行排序。

Output —

输出—

Image for post

我们如何找出阈值得分值? (How do we find out the threshold score value?)

Should we chose the first 20 customers? or the first 50 customers? or the top 10%? What should be the criteria? Again, domain knowledge plays a crucial role here. We know that the budget is $1000. Each coupon value is not specified and we must decide the value. The coupon value cannot be too high because it reduces the number of customers.

我们应该选择前20位客户吗? 还是前50位客户? 还是前10%? 准则是什么? 同样,领域知识在这里起着至关重要的作用。 我们知道预算是$ 1000。 没有指定每个优惠券的价值,我们必须决定其价值。 优惠券价值不能太高,因为它减少了客户数量。

We all know that a 30% discount on one transaction is a pretty decent deal.So, let us find out the mean value of all the 1,25,000 transactions in the initial data frame we have and find 30% of that mean value.

我们都知道一笔交易有30%的折扣是相当不错的一笔交易,因此,让我们找出初始数据框中所有1,25,000笔交易的均值,然后找到该均值的30%。

coupon = data["tran_amount"].mean() * 0.3Output - 19.4976

Let us round this to 20. Hence each coupon value is $20. We know that our budget is $1000. Dividing 1000/20 yields 50. Hence we can select the top 50 customers from the best_churn data whose churned value is 1 and mail the coupons.

让我们将其四舍五入为20。因此,每个优惠券价值为20美元。 我们知道我们的预算是1000美元。 除以1000/20得出50。因此,我们可以从best_churn数据中选择前50个客户 ,其搅动值为1,然后将优惠券邮寄best_churn

top_50_churned = best_churn.loc[best_churn["churned"] == 1].head(50)

Output —

输出—

Image for post

结论 (Conclusion)

In this article, we understood the importance of communication and how the usage of fuzzy-language can be a hindrance. We also took a real-life scenario and solved a problem that had many constraints and also required the usage of communication skills, domain knowledge, and quick decision-making. I hope you learned something new today.

在本文中,我们了解了交流的重要性以及模糊语言的使用如何成为障碍。 我们还采用了现实生活中的场景,解决了一个有很多约束并且还需要使用沟通技巧,领域知识和快速决策能力的问题。 希望您今天学到了新东西。

If you would like to get in touch, connect with me on LinkedIn.

如果您想取得联系,请通过 LinkedIn 与我联系

翻译自: https://medium.com/towards-artificial-intelligence/data-science-in-business-8266fae71a87

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390858.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

leetcode 692. 前K个高频单词

题目 给一非空的单词列表,返回前 k 个出现次数最多的单词。 返回的答案应该按单词出现频率由高到低排序。如果不同的单词有相同出现频率,按字母顺序排序。 示例 1: 输入: ["i", "love", "leetcode", "…

数据显示,中国近一半的独角兽企业由“BATJ”四巨头投资

中国的互联网行业越来越有被巨头垄断的趋势。百度、阿里巴巴、腾讯、京东,这四大巨头支撑起了中国近一半的独角兽企业。CB Insights日前发表了题为“Nearly Half Of China’s Unicorns Backed By Baidu, Alibaba, Tencent, Or JD.com”的数据分析文章,列…

Java的Servlet、Filter、Interceptor、Listener

写在前面: 使用Spring-Boot时,嵌入式Servlet容器可以通过扫描注解(ServletComponentScan)的方式注册Servlet、Filter和Servlet规范的所有监听器(如HttpSessionListener监听器)。 Spring boot 的主 Servlet…

leetcode 1035. 不相交的线(dp)

在两条独立的水平线上按给定的顺序写下 nums1 和 nums2 中的整数。 现在,可以绘制一些连接两个数字 nums1[i] 和 nums2[j] 的直线,这些直线需要同时满足满足: nums1[i] nums2[j] 且绘制的直线不与任何其他连线(非水平线&#x…

SPI和RAM IP核

学习目的: (1) 熟悉SPI接口和它的读写时序; (2) 复习Verilog仿真语句中的$readmemb命令和$display命令; (3) 掌握SPI接口写时序操作的硬件语言描述流程(本例仅…

个人技术博客Alpha----Android Studio UI学习

项目联系 这次的项目我在前端组,负责UI,下面简略讲下学到的内容和使用AS过程中遇到的一些问题及其解决方法。 常见UI控件的使用 1.TextView 在TextView中,首先用android:id给当前控件定义一个唯一标识符。在活动中通过这个标识符对控件进行事…

数据科学家数据分析师_站出来! 分析人员,数据科学家和其他所有人的领导和沟通技巧...

数据科学家数据分析师这一切如何发生? (How did this All Happen?) As I reflect on my life over the past few years, even though I worked my butt off to get into Data Science as a Product Analyst, I sometimes still find myself begging the question, …

react-hooks_在5分钟内学习React Hooks-初学者教程

react-hooksSometimes 5 minutes is all youve got. So in this article, were just going to touch on two of the most used hooks in React: useState and useEffect. 有时只有5分钟。 因此,在本文中,我们仅涉及React中两个最常用的钩子: …

分析工作试用期收获_免费使用零编码技能探索数据分析

分析工作试用期收获Have you been hearing the new industry buzzword — Data Analytics(it was AI-ML earlier) a lot lately? Does it sound complicated and yet simple enough? Understand the logic behind models but dont know how to code? Apprehensive of spendi…

select的一些问题。

这个要怎么统计类别数呢? 哇哇哇 解决了。 之前怎么没想到呢?感谢一楼。转载于:https://www.cnblogs.com/AbsolutelyPerfect/p/7818701.html

重学TCP协议(12)SO_REUSEADDR、SO_REUSEPORT、SO_LINGER

1. SO_REUSEADDR 假如服务端出现故障,主动断开连接以后,需要等 2 个 MSL 以后才最终释放这个连接,而服务重启以后要绑定同一个端口,默认情况下,操作系统的实现都会阻止新的监听套接字绑定到这个端口上。启用 SO_REUSE…

残疾科学家_数据科学与残疾:通过创新加强护理

残疾科学家Could the time it takes for you to water your houseplants say something about your health? Or might the amount you’re moving around your neighborhood reflect your mental health status?您给植物浇水所需的时间能否说明您的健康状况? 还是…

Linux 网络相关命令

1. telnet 1.1 检查端口是否打开 执行 telnet www.baidu.com 80,粘贴下面的文本(注意总共有四行,最后两行为两个空行) telnet [domainname or ip] [port]例如: telnet www.baidu.com 80 如果这个网络连接可达&…

spss23出现数据消失_改善23亿人口健康数据的可视化

spss23出现数据消失District Health Information Software, or DHIS2, is one of the most important sources of health data in low- and middle-income countries (LMICs). Used by 72 different LMIC governments, DHIS2 is a web-based open-source platform that is used…

01-hibernate注解:类级别注解,@Entity,@Table,@Embeddable

Entity Entity:映射实体类 Entity(name"tableName") name:可选,对应数据库中一个表,若表名与实体类名相同,则可以省略。 注意:使用Entity时候必须指定实体类的主键属性。 第一步:建立实体类: 分别…

COVID-19研究助理

These days scientists, researchers, doctors, and medical professionals face challenges to develop answers to their high priority scientific questions.如今,科学家,研究人员,医生和医学专家面临着挑战,无法为其高度优先…

Go语言实战 : API服务器 (8) 中间件

为什么需要中间件 我们可能需要对每个请求/返回做一些特定的操作,比如 记录请求的 log 信息在返回中插入一个 Header部分接口进行鉴权 这些都需要一个统一的入口。这个功能可以通过引入 middleware 中间件来解决。Go 的 net/http 设计的一大特点是特别容易构建中间…

缺失值和异常值的识别与处理_识别异常值-第一部分

缺失值和异常值的识别与处理📈Python金融系列 (📈Python for finance series) Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.警告 : 这里没有神奇的配方或圣杯,尽管…

leetcode 664. 奇怪的打印机(dp)

题目 有台奇怪的打印机有以下两个特殊要求: 打印机每次只能打印由 同一个字符 组成的序列。 每次可以在任意起始和结束位置打印新字符,并且会覆盖掉原来已有的字符。 给你一个字符串 s ,你的任务是计算这个打印机打印它需要的最少打印次数。…

PHP7.2 redis

为什么80%的码农都做不了架构师?>>> PHP7.2 的redis安装方法: 顺便说一下PHP7.2的安装: wget http://cn2.php.net/distributions/php-7.2.4.tar.gz tar -zxvf php-7.2.4.tar.gz cd php-7.2.4./configure --prefix/usr/local/php…