数据科学与大数据技术的案例_作为数据科学家解决问题的案例研究

数据科学与大数据技术的案例

There are two myths about how data scientists solve problems: one is that the problem naturally exists, hence the challenge for a data scientist is to use an algorithm and put it into production. Another myth considers data scientists always try leveraging the most advanced algorithms, the fancier model equals a better solution. While these are not fully groundless, they represent two common misunderstandings on how data scientists work: one emphasizes too much on the “execution” side, and the other overstate the “algorithm” part.

关于数据科学家如何解决问题有两个神话:一个是问题自然存在,因此数据科学家面临的挑战是使用算法并将其投入生产。 另一个神话认为,数据科学家总是尝试利用最先进的算法,更高级的模型等于更好的解决方案。 尽管这些并不是完全没有根据的,但它们代表了关于数据科学家如何工作的两个常见误解:一个在“执行”方面过分强调,而另一个则夸大了“算法”部分。

Obviously, these myths are not how we actually solve problems. From my perspective, problem-solving for a data scientist is:

显然,这些神话并不是我们实际解决问题的方式。 从我的角度来看,为数据科学家解决问题的方法是:

  • more about “how to abstract the problem out of the business context”, not just “be handed with a specific task”

    更多关于“如何从业务环境中抽象出问题”,而不仅仅是“处理特定任务”
  • more about “solve the problem with an algorithm”, not just “use the best algorithm to solve a problem”

    更多关于“使用算法解决问题”,而不仅仅是“使用最佳算法来解决问题”
  • more about “iteratively deliver business value”, not just “implement the code and call it a day”.

    更多关于“迭代地交付业务价值”,而不仅仅是“实施代码并称其为一天”。

With this said, I observe there are usually four stages involved in the problem-solving process, and I would like to share what are the four stages, and how it works in action with a case study, and then how can we get there with the right mindsets.

如此说来,我观察到解决问题的过程通常涉及四个阶段,我想分享这四个阶段是什么,以及它如何与案例研究一起发挥作用,然后我们如何才能达到目标?正确的心态。

故事始于…… (The story starts with, once upon a time …)

My first job was in a company that operates an automotive pricing and information website and it went through the initial public offering (IPO) in May 2014. It was a great experience and I vividly remember everyone around was cheering on that day for the birth of a public company. As a public company, our revenue started to receive a lot of attention, especially with the first quarterly earnings report coming out in August. In early July, the director in the revenue department came to the Data Scientists' seating area, and it did not look like he got good news to share.

我的第一份工作是在一家经营汽车价格和信息网站的公司中,该公司于2014年5月进行了首次公开募股(IPO)。这是一次很棒的经历,我生动地记得那天周围的每个人都为该公司的诞生欢呼雀跃。上市公司。 作为一家上市公司,我们的收入开始受到广泛关注,尤其是在八月份发布了第一份季度收益报告之后。 7月初,税务部门的主管来到了数据科学家的办公区,看来他没有什么好消息可分享。

“We are in trouble, a percentage of the sales revenue cannot be credited appropriately; we need your help.”

“我们有麻烦,不能适当地记入一定比例的销售收入; 我们需要您的帮助。”

Here are some relevant contexts: the company’s revenue is generated based on the fact that it introduces more sales to car dealers. To get the deserved commission, we need to match the sale of a vehicle to the correct customer. If our data providers can tell us which customer bought which vehicle, then the matching is done and no extra effort is needed; however, the problem is that one data provider decided to not provide the 1-to-1 sale record: it has to be done in a batch (visualization on what is a “batch” shown as below), then it is much harder and uncertain to know which customer bought which car.

以下是一些相关的上下文:公司的收入是基于这样的事实而产生的:它为汽车经销商带来了更多的销售。 为了获得应得的佣金,我们需要将车辆的销售与正确的客户匹配。 如果我们的数据提供商可以告诉我们哪个客户购买了哪辆汽车,那么匹配就完成了,不需要额外的工作; 但是,问题在于,一个数据提供者决定不提供一对一的销售记录:必须分批处理(可视化显示如下所示的“批处理”),这会变得更加困难,并且不确定要知道哪个客户买了哪辆车。

Image for post

The revenue team was surprised by this change and after spending the past month trying to solve the problem, only 2% of sales from that data provider could be recovered manually. This would be bad news for the first earning call, so they came to seek help from Data Scientists. This is clearly an urgent problem that needs to be solved, so we jumped right on it.

收入团队对此更改感到惊讶,在花费了过去一个月的时间来解决问题之后,只能手动恢复该数据提供商2%的销售额。 这对于第一次打来的电话来说是个坏消息,因此他们来寻求数据科学家的帮助。 显然,这是一个亟待解决的紧迫问题,因此我们跳过了。

阶段1.了解问题,然后使用数学术语重新定义 (Stage 1. understand the problem, and then redefine it using mathematical terms)

This is the first stage of problem-solving in Data Science. Regarding “understand the problem” part, one needs to clearly identify the pain points so that once the pain point is resolved, the problem should be gone; regarding “redefine” the problem part, this is usually why a problem needs Data Scientist help.

这是数据科学中解决问题的第一步。 关于“理解问题”部分,需要清楚地识别痛点,以便一旦痛点得到解决,问题就应该消除。 关于“定义”问题部分,通常这就是为什么问题需要数据科学家的帮助。

For the specific one asked by our revenue team, the problem is: we cannot assign each sold vehicle to a customer, then we lose the revenue.

对于我们的收入团队要求的特定问题,问题是:我们无法将每辆售出的车辆分配给客户,然后我们损失了收入。

The pain point is: finding who purchased a vehicle in the given batch is manual and inaccurate, considering there are thousands of batches that need matching sales, it is very time-consuming and not sustainable.

痛点是:考虑到成千上万的批次需要匹配的销售,找到谁在给定的批次中购买了汽车是手动且不准确的,这非常耗时且不可持续。

The “redefined” problem in a mathematical term is: given a batch with customer C1, C2, .., Cn, along with the sold vehicle information, V1, V2, …, Vm, we need an automated solution to accurately identify the right matching pair (Ci, Vj) reflecting the actual purchasing event.

用数学术语来说,“重新定义”的问题是:给定一个具有客户C1,C2,..,Cn的批次以及出售的车辆信息V1,V2,…,Vm,我们需要一个自动化的解决方案来准确地确定正确的反映实际购买事件的匹配对(Ci,Vj)。

第2阶段。分解问题,确定逻辑算法解决方案,然后进行构建 (Stage 2. decompose the problem, identify a logical algorithm solution, and then build it out)

With the redefined problem, we can see this is a “matching” exercise under constraint, with given customers and vehicles in a batch. So I decomposed the problem further into two steps:

有了重新定义的问题,我们可以看到这是在给定的客户和车辆成批的约束下的“匹配”练习。 因此,我将问题进一步分解为两个步骤:

  • Step 1. calculate the purchase likelihood for a customer given the vehicle P(C|V)

    步骤1.计算给定车辆P(C | V)的客户的购买可能性
  • Step 2. based on the likelihood, attribute a car to the most likely customer in the batch

    步骤2.根据可能性,将汽车分配给批次中最有可能的客户

Now we can further identify the solution for each.

现在,我们可以进一步确定每种解决方案。

步骤1.概率计算 (Step 1. probability calculation)

For simplicity, let’s assume there are three customers (c1, c2, c3) in this batch, and one vehicle (v1) information is provided as a sale.

为简单起见,我们假设此批次中有三个客户(c1,c2,c3),并且提供了一辆汽车(v1)信息作为销售。

  • P(C=c1) represents the likelihood of c1 to buy any car. Assuming no prior knowledge about each customer, their likelihood of buying any car should be the same: P(C=c1) = P(C=c2) = P(C=c3), which equals a constant (e.g. 1/3 in this situation)

    P(C = c1)表示c1购买任何汽车的可能性。 假设没有每个客户的先验知识,那么他们购买任何汽车的可能性应该是相同的:P(C = c1)= P(C = c2)= P(C = c3),它等于一个常数(例如1/3 in这个情况)
  • P(V=v1) is the likelihood for v1 to be sold, given it is shown in this batch, this should be 1 (100% likelihood to be sold)

    P(V = v1)是v1被出售的可能性,鉴于此批次中显示,该值应为1(100%的可能性出售)

Since there is only one customer making the purchase, this probability can be extended into:

由于只有一位客户进行购买,因此可以将这种可能性扩展为:

P(V=v1) = P(C=c1, V=v1) + P(C=c2, V=v1) + P(C=c3, V=v1) = 1.0

P(V = v1)= P(C = c1,V = v1)+ P(C = c2,V = v1)+ P(C = c3,V = v1)= 1.0

For each of the item, given the following formula

对于每个项目,给定以下公式

P(C=c1, V=v1) = P(C=c1|V=v1) * P(V=v1) = P(V=v1|C=c1) * P(C=c1)

P(C = c1,V = v1)= P(C = c1 | V = v1)* P(V = v1)= P(V = v1 | C = c1)* P(C = c1)

We can see P(C=c1|V=v1) is proportional to P(V=v1|C=c1). So now, we can get the formula for the probability calculation:

我们可以看到P(C = c1 | V = v1)与P(V = v1 | C = c1)成正比。 现在,我们可以得出概率计算的公式:

P(C=c1|V=v1) = P(V=v1|C=c1) / (P(V=v1|C=c1) + P(V=v1|C=c2) + P(V=v1|C=c3))

P(C = c1 | V = v1)= P(V = v1 | C = c1)/(P(V = v1 | C = c1)+ P(V = v1 | C = c2)+ P(V = v1 | C = c3))

and the key is to get the probability for each P(V|C). Such a formula can be verbally explained as: the likelihood for a vehicle to be purchased by a specific customer is proportional to the likelihood for the customer to buy this specific vehicle.

关键是获得每个P(V | C)的概率。 这样的公式可以用语言来解释为:特定顾客购买车辆的可能性与顾客购买该特定车辆的可能性成比例。

The above formula may look too “mathematical”, so let me put it into an intuitive context: assuming three people were in a room, one is a musician, one is an athlete, and one is a data scientist. You were told there is a violin in this room belong to one of them. Now guess, whom do you think is the owner of the violin? This is pretty straightforward, right? given the likelihood of musician to own a violin is high, and the likelihood of athlete and data scientists to own a violin is lower, it is much more likely for the violin to belong to the musician. The “mathematical” thinking process is illustrated below.

上面的公式看起来太“数学”了,因此让我将其放在一个直观的上下文中:假设三个人在一个房间里,一个是音乐家,一个是运动员,一个是数据科学家。 有人告诉您,这个房间里有一把小提琴属于其中之一。 现在猜,您认为小提琴的所有者是谁? 这很简单,对吧? 鉴于音乐家拥有小提琴的可能性较高,而运动员和数据科学家拥有小提琴的可能性较低,因此小提琴属于音乐家的可能性更大。 下面说明了“数学”思维过程。

Image for post

Now, let’s put the probabilities into a business context. As an online automotive pricing platform, each customer needs to generate at least one vehicle quote, hence, we assume the customer can be reasonably represented as the vehicles he/she quoted. Then such P(V|C) probability can be learned from existing data the company already accumulated in the history, including who generated a vehicle quote at when, and what vehicle they eventually bought. I would not further elaborate on the details, but the key point is that we can learn P(V|C), and then calculate the needed probability P(C|V) in each batch.

现在,让我们将概率放入业务环境中。 作为一个在线汽车定价平台,每个客户都需要至少生成一个车辆报价,因此,我们假设该客户可以合理地代表其报价的车辆。 然后,可以从公司在历史记录中已经积累的现有数据获悉这种P(V | C)概率,包括谁在何时生成车辆报价以及他们最终购买了哪种车辆。 我不会进一步详细说明,但是关键是我们可以学习P(V | C),然后计算每批中所需的概率P(C | V)。

步骤2.车辆归属 (Step 2. vehicle attribution)

Once we get the expected probability for each vehicle to be sold to customers, the second step is the attribution process. Assuming there is only one sold vehicle in the batch, such process is trivial; however, if there are multiple sold vehicles in the batch, either following approaches would work:

一旦我们获得了每辆车出售给客户的预期概率,第二步就是归因过程。 假设批次中只有一辆售出的车辆,那么这个过程很简单; 但是,如果批次中有多个售出的车辆,则可以使用以下两种方法之一:

  • (direct attribution) use only the calculated probability P(C|V), always attribute vehicle to customers with the highest likelihood. Under this approach, it is possible to attribute two vehicles to the same customer.

    (直接归因)仅使用计算出的概率P(C | V),始终将车辆归因于可能性最高的客户。 在这种方法下,可以将两辆车分配给同一客户。
  • (round-robin way) assume each customer buys at most one vehicle: once one vehicle is attributed to a customer, both are removed before the next round vehicle attribution.

    (轮循方式)假设每个客户最多购买一辆车辆:一旦将一辆车辆归于客户,则在下一轮归属之前将两者都移除。

Now we have designed a two-stepped algorithm to solve the key challenge, and it’s time to test the performance! Given there are historic quotes and sales data, it is straightforward to simulate the process of “creating random batches”, “attaching sales to the batch”, and try to “recover sales from the given batch information”. Such simulation provides a way to evaluate the model’s performance and we estimated more than 50% of sales can be recovered with high precision (>95%). We deployed the model for the real dataset, and the results matched our expectations well.

现在,我们设计了一个两步算法来解决关键挑战,现在该测试性能了! 鉴于有历史报价和销售数据,可以轻松地模拟“创建随机批次”,“将销售附加到批次”并尝试“从给定的批次信息中恢复销售”的过程。 这种模拟提供了一种评估模型性能的方法,我们估计可以以高精度(> 95%)收回超过50%的销售额。 我们为实际数据集部署了该模型,结果与我们的预期非常吻合。

The revenue team was very happy with the above solution: comparing to the ~2% recovery rate, 50% is more than 25 X! From a business impact perspective, this revenue directly added to the bottom line for our first quarterly earnings report, and the contributed value from the Data Science team is significant.

收入团队对上述解决方案感到非常满意:与〜2%的回收率相比,50%的回收率是25倍以上! 从业务影响的角度来看,该收入直接添加到了我们的第一季度收入报告的底线中,数据科学团队的贡献是巨大的。

阶段3.深思熟虑,寻求机会进行进一步的改进 (Stage 3. Think deeper, and seek opportunities to make further improvement)

We run the above solution for an extra month and see the performance is pretty consistent, and now it is time to think about what’s next? We recovered 50% of sales, but how about the rest 50%? Is it possible to further improve the algorithm to get there?

我们将上述解决方案运行了一个多月,看到性能相当稳定,现在是时候考虑下一步了吗? 我们收回了50%的销售额,但其余50%呢? 是否有可能进一步改进算法以达到目标?

Usually, we, as data scientists, have a tendency to focus too much on the algorithm details; in this case, there were some discussions around how to better model the P(V|C): should we use a deep learning model to make this probability much better, etc. However, per my understanding, these pure algorithmic improvements usually result in just incremental performance, and it’s less likely we close the rest 50% gap.

通常,作为数据科学家,我们倾向于过多地关注算法细节。 在这种情况下,围绕如何更好地对P(V | C)建模进行了一些讨论:我们是否应使用深度学习模型来使这种概率更好,等等。但是,据我了解,这些纯算法上的改进通常导致只是提高性能,而我们缩小50%的剩余差距的可能性较小。

Then I started a deeper conversation with the revenue team and trying to figure out what was missing in our understanding about the problem, turns out we can control how the customers are grouped into a batch! Although there are some restrictions (e.g. customers have to generate quotes from the same dealership), this gives us the freedom to further optimize, and I see this is the direction to close the gap of the rest 50% sales.

然后,我与收入团队进行了更深入的对话,试图找出我们对问题的了解中缺少的内容,结果我们可以控制将客户分组的方式! 尽管存在一些限制(例如,客户必须从同一个经销商处生成报价),但是这给了我们进一步优化的自由,我认为这是缩小其余50%销售差距的方向。

Why am I confident in this direction? Think about this situation: if you have 4 people to be batched, and each batch has 2 people. The best batching strategy is to put the most different people in the same batch so that once an item is returned, the attribution will be more accurate. The following visualization shows the concept. On the left side, if you put two musicians in the same batch, two athletes in the same batch, it’s very hard to know who owns the violin or basketball. While on the right side, if you have each batch with one musician and one athlete, it is much easier to tell Musician A owns the violin, and Athlete D owns the basketball, with high confidence.

我为什么对这个方向充满信心? 考虑这种情况:如果要分批处理4个人,每批分2个人。 最佳的批处理策略是将最多的人放在同一批中,这样一来,一旦退回货品,归因将更加准确。 以下可视化显示了该概念。 在左侧,如果将两个音乐家放在同一批中,将两个运动员放在同一批中,则很难知道谁拥有小提琴或篮球。 在右侧,如果您每批都有一位音乐家和一位运动员,那么说出音乐家A拥有小提琴而运动员D拥有篮球则要容易得多。

Image for post

To materialize the above concept, there are two steps required:

要实现上述概念,需要执行两个步骤:

  • (similarity definition) how to define customer to customer similarity? and then a batch’s entropy as the objective function to optimize for?

    (相似度定义)如何定义顾客与顾客之间的相似度? 然后将一批熵作为目标函数进行优化?
  • (batch optimization) based on the above similarities, how to design an optimization strategy to achieve optimal batches?

    (批次优化)基于以上相似性,如何设计优化策略以实现最佳批次?

步骤1.相似性定义 (Step 1. similarity definition)

In the first stage solution, we already find a way to calculate P(V|C), here, I would make a direct generalization: the similarity between two customers is proportional to the average likelihood for both customers to purchase each other’s quoted vehicles. If each customer quoted only one vehicle (c1 quoted v1, and c2 quoted v2), then a simplified version looks as follows:

在第一阶段的解决方案中,我们已经找到一种计算P(V | C)的方法 ,在这里,我将直接进行概括:两个客户之间的相似性与两个客户购买彼此报价的车辆的平均可能性成正比。 如果每个客户仅报价一辆车(c1报价为v1,c2报价为v2),则简化版本如下所示:

Similarity(C1, C2) = 0.5 * (P(V=v1|C=c2) + P(V=v2|C=c1))

相似度(C1,C2)= 0.5 *(P(V = v1 | C = c2)+ P(V = v2 | C = c1))

Once we have the pairwise similarity between two customers, we can define the entropy for a batch as the sum of mutual pairwise similarities between customers in the batch. Now, we have an objective function to optimize for: we want batches with maximum entropy

一旦我们有了两个客户之间的成对相似性,就可以将一个批次的熵定义为该批次中客户之间相互成对相似性的总和。 现在,我们有一个优化的目标函数:我们想要具有最大熵的批次

步骤2.批次最佳化 (Step 2. batch optimization)

After reading some similar studies, I decided to use the 2-opt algorithm, which is a simple local search algorithm for solving the traveling salesman problem.

阅读一些类似的研究后,我决定使用2-opt算法,这是一种用于解决旅行商问题的简单本地搜索算法。

The basic concept of 2-opt algorithm is as follows: in every step, two edges are randomly picked and attempt to “swap”, if the objective function is better after the swap is done, then the swap will be executed; or else, re-pick two edges. The algorithm continues until the objective function is converged or the maximum iteration number is met. The following figure illustrates when two edges (red) are picked and swapped into new edges (blue), achieving a shorter distance.

2-opt算法的基本概念如下:在每个步骤中,随机选择两个边缘并尝试“交换”,如果交换完成后目标函数更好,则将执行交换; 否则,重新拾取两个边缘。 该算法继续进行,直到目标函数收敛或满足最大迭代次数为止。 下图说明了拾取两个边缘(红色)并将其交换为新边缘(蓝色)时获得的距离更短的情况。

Image for post

To apply the 2-opt algorithm in my case, I made analogies to the traveling salesman problem (TSP):

为了在我的情况下应用2-opt算法,我对旅行商问题(TSP)进行了类比:

  • In TSP, two edges are randomly selected; in my cases, two batches are randomly selected, and then each batch randomly pick one customer inside to exchange

    在TSP中,随机选择两个边; 在我的情况下,随机选择两个批次,然后每个批次随机选择一个内部客户进行交换
  • In TSP, the total distance is used as the objective function, the shorter the better; in my case, the entropy of all batches is the objective function, the higher the better.

    在TSP中,总距离用作目标函数,越短越好;反之亦然。 就我而言,所有批次的熵都是目标函数,越高越好。

Great, we have all the elements to optimize the batches! After implementing the algorithm, we further backtest over the existing data and found that: more than 85% of sales could be recovered. In the following month, when we apply this over the real dataset, the recovery rate is found at a similar level. This approach works, as expected!

太好了,我们拥有优化批次的所有要素! 实施该算法后,我们对现有数据进行了进一步的回测,发现:可以收回超过85%的销售额。 在下个月,当我们将其应用于实际数据集时,发现恢复率处于相似的水平。 这种方法符合预期!

阶段4.设计解决方案以使其可扩展和可维护 (Stage 4. Engineering the solution to make it extendable and maintainable)

What I described above is mainly the algorithm design part; and in parallel, there is the Engineering development part, and it is not easy to simply write the code and expect it to be extendable and maintainable.

我上面描述的主要是算法设计部分; 同时,还有工程开发部分,要简单地编写代码并期望它具有可扩展性和可维护性并不容易。

During the project evolution, we gradually noticed there is a pattern of dependencies across the modules needed. The vehicle is represented by many features, and the customer is represented by a set of vehicles, and the batch is represented by a set of customers. With this high-level representation, we can build the dependency lineage as Vehicle -> Customer -> Batch.

在项目发展过程中,我们逐渐注意到,所需模块之间存在某种依赖关系模式。 车辆由许多功能代表,客户由一组车辆代表,批次由一组客户代表。 通过这种高级表示,我们可以将依赖关系谱系构建为Vehicle-> Customer-> Batch。

Meanwhile, as a data product, we need to make sure the system can evolve to update the needed parameters and always evaluate the performance along the way. Hence the architecture was designed in the following way

同时,作为数据产品,我们需要确保系统可以发展以更新所需的参数,并始终评估性能。 因此,架构是通过以下方式设计的

Image for post

With this architecture, what the Data Scientist need to do on a regular basis are:

使用这种架构,数据科学家需要定期进行以下操作:

  • re-train model for the P(V|C) to ensure it incorporates the most recent customer purchasing behavior

    对P(V | C)进行重新训练模型,以确保它包含最新的客户购买行为
  • simulation over the whole process, including both batch optimization and sales attribution, to ensure the system performance is above a threshold

    在整个过程中进行仿真,包括批次优化和销售归因,以确保系统性能超过阈值
  • monthly batch optimization to prepare data for our revenue team and sales attribution to match a customer to the sales

    每月批量优化,以为我们的收入团队和销售归因准备数据,以使客户与销售匹配

Now we have built a sustainable data product that is maintainable. Given the data science team established a good reputation, in the next year, we heavily involved in the re-design of the sales matching system, which further expanded the data science footprint over the company. Because of this architecture’s operational excellence, it frees us more resources to seek the next challenge.

现在,我们已经构建了可维护的可持续数据产品。 鉴于数据科学团队建立了良好的声誉,明年,我们将大量参与销售匹配系统的重新设计,从而进一步扩大了数据科学在公司的业务范围。 由于该体系结构的卓越操作性,它使我们有更多的资源来寻求下一个挑战。

正确心态的一般问题解决流程 (The general problem-solving flow with the right mindset)

The data science area is quite broad and designing algorithmic data products is only part of many potential projects. Other commonly-seen data science projects are experimentation design, causal inference, deep-dive analysis to drive strategic changes, etc. Although they may not strictly follow or even need all the stages I listed above, the four-stage flow still help to lay out a way to think about problem-solving in general:

数据科学领域非常广泛,设计算法数据产品只是许多潜在项目的一部分。 其他常见的数据科学项目包括实验设计,因果推断,深入分析以推动战略变革等。尽管它们可能并不严格遵循甚至需要上面列出的所有阶段,但四阶段流程仍然有助于奠定基础提出一种思考解决问题的方法:

  • Stage 1 (problem identification) is to help you focus on the key question and not loose track while diving deep into data

    第1阶段(问题识别)旨在帮助您专注于关键问题,而不会在深入研究数据时迷失方向
  • Stage 2 (first logical solution) is to get you a quick win and keep the momentum to build trust with business partners

    第2阶段(第一个合乎逻辑的解决方案)是使您快速获胜并保持与业务合作伙伴建立信任的动力
  • Stage 3 (iterative improvement) is to help you move the solution further ahead and be the owner of the area

    第3阶段(迭代改进)旨在帮助您将解决方案向前推进并成为该区域的所有者
  • Stage 4 (operational excellence) is to help you remove tech debt, to set you free from mundane maintenance works going forward

    第4阶段(卓越运营)旨在帮助您消除技术债务,使您免于日后的日常维护工作

The four-stage flow is not necessarily a strict rule one should follow, but it is more like a natural outcome if a data scientist has the right mindsets while facing any incoming challenge. In my opinion, these mindsets are:

四个阶段的流程不一定是应该遵循的严格规则,但是如果数据科学家在面对任何即将来临的挑战时具有正确的心态,则它更像是自然的结果。 我认为这些心态是:

  • Business-driven, not algorithm-driven. Look at the big picture and see how data science fits in the business, understand why data science is needed and how it delivers value. Don’t be too attached to any specific algorithm: “if all you have is a hammer, everything looks like a nail".

    业务驱动,而不是算法驱动 。 纵观全局,了解数据科学如何适应业务,了解为什么需要数据科学以及它如何带来价值。 不要太拘泥于任何特定的算法:“如果您只有锤子,那么一切看起来就像钉子”。

  • Owning the problem, not just taking orders. Being the owner of a problem means one will be proactive in thinking about how to solve it now, solve it better, and solve it with less effort. One would not stop at a sub-optimal solution and consider it done.

    造成问题的,不仅是接单 。 成为问题的所有者,意味着人们将积极思考如何立即解决,更好地解决问题以及以更少的精力解决问题。 人们不会停在一个次优的解决方案上并认为它已经完成。

  • Open-minded, and always be learning. As an interdisciplinary field, data science overlaps with statistics, computer science, operational research, psychology, economics, marketing, sales, and more! It’s almost impossible to know all the areas ahead of time, so be open-minded and keep learning along the way. There could always be a better solution than the one you already knew.

    胸襟开阔,永远学习 。 作为一个跨学科领域,数据科学与统计,计算机科学,运筹学,心理学,经济学,市场营销,销售等等重叠! 提前知道所有领域几乎是不可能的,因此要胸襟开阔,并不断学习。 总会有比您已经知道的更好的解决方案。

Hope you may find the above sharing helpful: happy problem solving, the data science way.

希望以上分享对您有所帮助:快乐的问题解决,数据科学的方式。

— — — — — — — — — — — — — —

— — — — — — — — — — — — — — — — —

If you enjoyed this article, help spread the word by liking, sharing, and commenting. Pan is currently a Data Science Manager at LinkedIn. You can read previous posts and follow him on LinkedIn.

如果您喜欢这篇文章,请通过喜欢,共享和评论来传播这个词。 Pan目前是LinkedIn的数据科学经理。 您可以阅读以前的帖子并在 LinkedIn 上关注他

Here are two previous articles sharing Pan’s Data Science experience:

这是分享Pan的Data Science经验的前两篇文章:

My First Data Science Project

我的第一个数据科学项目

How to innovate in Data Science

如何在数据科学中创新

翻译自: https://towardsdatascience.com/problem-solving-as-data-scientist-a-case-study-49296d8cd7b7

数据科学与大数据技术的案例

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390864.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Spring-Boot + AOP实现多数据源动态切换

2019独角兽企业重金招聘Python工程师标准>>> 最近在做保证金余额查询优化,在项目启动时候需要把余额全量加载到本地缓存,因为需要全量查询所有骑手的保证金余额,为了不影响主数据库的性能,考虑把这个查询走从库。所以涉…

leetcode 1738. 找出第 K 大的异或坐标值

本文正在参加「Java主题月 - Java 刷题打卡」&#xff0c;详情查看 活动链接 题目 给你一个二维矩阵 matrix 和一个整数 k &#xff0c;矩阵大小为 m x n 由非负整数组成。 矩阵中坐标 (a, b) 的 值 可由对所有满足 0 < i < a < m 且 0 < j < b < n 的元素…

商业数据科学

数据科学 &#xff0c; 意见 (Data Science, Opinion) “There is a saying, ‘A jack of all trades and a master of none.’ When it comes to being a data scientist you need to be a bit like this, but perhaps a better saying would be, ‘A jack of all trades and …

leetcode 692. 前K个高频单词

题目 给一非空的单词列表&#xff0c;返回前 k 个出现次数最多的单词。 返回的答案应该按单词出现频率由高到低排序。如果不同的单词有相同出现频率&#xff0c;按字母顺序排序。 示例 1&#xff1a; 输入: ["i", "love", "leetcode", "…

数据显示,中国近一半的独角兽企业由“BATJ”四巨头投资

中国的互联网行业越来越有被巨头垄断的趋势。百度、阿里巴巴、腾讯、京东&#xff0c;这四大巨头支撑起了中国近一半的独角兽企业。CB Insights日前发表了题为“Nearly Half Of China’s Unicorns Backed By Baidu, Alibaba, Tencent, Or JD.com”的数据分析文章&#xff0c;列…

Java的Servlet、Filter、Interceptor、Listener

写在前面&#xff1a; 使用Spring-Boot时&#xff0c;嵌入式Servlet容器可以通过扫描注解&#xff08;ServletComponentScan&#xff09;的方式注册Servlet、Filter和Servlet规范的所有监听器&#xff08;如HttpSessionListener监听器&#xff09;。 Spring boot 的主 Servlet…

leetcode 1035. 不相交的线(dp)

在两条独立的水平线上按给定的顺序写下 nums1 和 nums2 中的整数。 现在&#xff0c;可以绘制一些连接两个数字 nums1[i] 和 nums2[j] 的直线&#xff0c;这些直线需要同时满足满足&#xff1a; nums1[i] nums2[j] 且绘制的直线不与任何其他连线&#xff08;非水平线&#x…

SPI和RAM IP核

学习目的&#xff1a; &#xff08;1&#xff09; 熟悉SPI接口和它的读写时序&#xff1b; &#xff08;2&#xff09; 复习Verilog仿真语句中的$readmemb命令和$display命令&#xff1b; &#xff08;3&#xff09; 掌握SPI接口写时序操作的硬件语言描述流程&#xff08;本例仅…

个人技术博客Alpha----Android Studio UI学习

项目联系 这次的项目我在前端组&#xff0c;负责UI&#xff0c;下面简略讲下学到的内容和使用AS过程中遇到的一些问题及其解决方法。 常见UI控件的使用 1.TextView 在TextView中&#xff0c;首先用android:id给当前控件定义一个唯一标识符。在活动中通过这个标识符对控件进行事…

数据科学家数据分析师_站出来! 分析人员,数据科学家和其他所有人的领导和沟通技巧...

数据科学家数据分析师这一切如何发生&#xff1f; (How did this All Happen?) As I reflect on my life over the past few years, even though I worked my butt off to get into Data Science as a Product Analyst, I sometimes still find myself begging the question, …

react-hooks_在5分钟内学习React Hooks-初学者教程

react-hooksSometimes 5 minutes is all youve got. So in this article, were just going to touch on two of the most used hooks in React: useState and useEffect. 有时只有5分钟。 因此&#xff0c;在本文中&#xff0c;我们仅涉及React中两个最常用的钩子&#xff1a; …

分析工作试用期收获_免费使用零编码技能探索数据分析

分析工作试用期收获Have you been hearing the new industry buzzword — Data Analytics(it was AI-ML earlier) a lot lately? Does it sound complicated and yet simple enough? Understand the logic behind models but dont know how to code? Apprehensive of spendi…

select的一些问题。

这个要怎么统计类别数呢&#xff1f; 哇哇哇 解决了。 之前怎么没想到呢&#xff1f;感谢一楼。转载于:https://www.cnblogs.com/AbsolutelyPerfect/p/7818701.html

重学TCP协议(12)SO_REUSEADDR、SO_REUSEPORT、SO_LINGER

1. SO_REUSEADDR 假如服务端出现故障&#xff0c;主动断开连接以后&#xff0c;需要等 2 个 MSL 以后才最终释放这个连接&#xff0c;而服务重启以后要绑定同一个端口&#xff0c;默认情况下&#xff0c;操作系统的实现都会阻止新的监听套接字绑定到这个端口上。启用 SO_REUSE…

残疾科学家_数据科学与残疾:通过创新加强护理

残疾科学家Could the time it takes for you to water your houseplants say something about your health? Or might the amount you’re moving around your neighborhood reflect your mental health status?您给植物浇水所需的时间能否说明您的健康状况&#xff1f; 还是…

Linux 网络相关命令

1. telnet 1.1 检查端口是否打开 执行 telnet www.baidu.com 80&#xff0c;粘贴下面的文本&#xff08;注意总共有四行&#xff0c;最后两行为两个空行&#xff09; telnet [domainname or ip] [port]例如&#xff1a; telnet www.baidu.com 80 如果这个网络连接可达&…

spss23出现数据消失_改善23亿人口健康数据的可视化

spss23出现数据消失District Health Information Software, or DHIS2, is one of the most important sources of health data in low- and middle-income countries (LMICs). Used by 72 different LMIC governments, DHIS2 is a web-based open-source platform that is used…

01-hibernate注解:类级别注解,@Entity,@Table,@Embeddable

Entity Entity:映射实体类 Entity(name"tableName") name:可选&#xff0c;对应数据库中一个表&#xff0c;若表名与实体类名相同&#xff0c;则可以省略。 注意&#xff1a;使用Entity时候必须指定实体类的主键属性。 第一步&#xff1a;建立实体类&#xff1a; 分别…

COVID-19研究助理

These days scientists, researchers, doctors, and medical professionals face challenges to develop answers to their high priority scientific questions.如今&#xff0c;科学家&#xff0c;研究人员&#xff0c;医生和医学专家面临着挑战&#xff0c;无法为其高度优先…

Go语言实战 : API服务器 (8) 中间件

为什么需要中间件 我们可能需要对每个请求/返回做一些特定的操作&#xff0c;比如 记录请求的 log 信息在返回中插入一个 Header部分接口进行鉴权 这些都需要一个统一的入口。这个功能可以通过引入 middleware 中间件来解决。Go 的 net/http 设计的一大特点是特别容易构建中间…