批判性思维
As Alexander Pope said, to err is human. By that metric, who is more human than us data scientists? We devise wrong hypotheses constantly and then spend time working on them just to find out how wrong we were.
正如亚历山大·波普(Alexander Pope)所说,犯错是人类。 按照这个指标,谁比我们的数据科学家更人性化? 我们不断设计错误的假设,然后花时间研究它们,以找出我们的错误所在。
When looking at mistakes from an experiment, a data scientist needs to be critical, always on the lookout for something that others may have missed. But sometimes, in our day-to-day routine, we can easily get lost in little details. When this happens, we often fail to look at the overall picture, ultimately failing to deliver what the business wants.
在查看实验中的错误时,数据科学家必须至关重要,始终在寻找其他人可能错过的东西。 但是有时候,在我们的日常工作中,我们很容易在细节上迷失方向。 发生这种情况时,我们常常无法看清整体情况,最终无法交付业务所需的东西。
Our business partners have hired us to generate value. We won’t be able to generate that value unless we develop business-oriented critical thinking, including having a more holistic perspective of the business at hand. So here is some practical advice for your day-to-day work as a data scientist.
我们的商业伙伴已聘请我们创造价值。 除非我们发展面向业务的批判性思维,包括对手头的业务有更全面的了解,否则我们将无法产生该价值。 因此,这是您作为数据科学家的日常工作的一些实用建议。
1.当心清洁数据综合症 (1. Beware of clean data syndrome)
Tell me how many times this has happened to you: You get a data set and start working on it straight away. You create neat visualizations and start building models. Maybe you even present automatically generated descriptive analytics to your business counterparts!
告诉我这件事发生了多少次:您得到一个数据集,并立即开始处理它。 您可以创建简洁的可视化效果并开始构建模型。 甚至您甚至可以向业务对手展示自动生成的描述性分析!
But do you ever ask, “Does this data actually make sense?”
但是您是否曾经问过:“这些数据真的有意义吗?”
Incorrectly assuming that the data is clean could lead you toward very wrong hypotheses. Not only that, but you’re also missing an important analytical opportunity with this assumption.
错误地假设数据是干净的可能会导致您得出非常错误的假设。 不仅如此,这种假设还会使您失去重要的分析机会。
You can actually discern a lot of important patterns by looking at discrepancies in the data. For example, if you notice that a particular column has more than 50 percent of values missing, you might think about dropping the column. But what if the missing column is because the data collection instrument has some error? By calling attention to this, you could have helped the business to improve its processes.
通过查看数据中的差异,您实际上可以识别出许多重要的模式。 例如,如果您发现某个特定的列缺少超过50%的值,则可以考虑删除该列。 但是,如果缺少列是因为数据收集工具有一些错误怎么办? 通过引起对此的注意,您本可以帮助企业改进其流程。
Or what if you’re given a distribution of customers that shows a ratio of 90 percent men versus 10 percent women, but the business is a cosmetics company that predominantly markets its products to women? You could assume you have clean data and show the results as is, or you can use common sense and ask the business partner if the labels are switched.
或者,如果给您分配了90%的男性与10%的女性比率的客户分布,但该企业是一家化妆品公司,主要向女性销售产品,该怎么办? 您可以假设您有干净的数据并按原样显示结果,或者可以使用常识并询问业务伙伴是否更换了标签。
Such errors are widespread. Catching them not only helps the future data collection processes but also prevents the company from making wrong decisions by preventing various other teams from using bad data.
这种错误很普遍。 捕获它们不仅有助于将来的数据收集过程,而且还可以防止其他团队使用不良数据来防止公司做出错误的决定。
2.警惕异常。 (2. Be on the lookout for something out of the ordinary.)
You probably know fab.com. If you don’t, it’s a website that sells selected health and fitness items. But the site’s origins weren’t in e-commerce. Fab.com started as Fabulis.com, a social networking site for gay men. One of the site’s most popular features was called the “Gay Deal of the Day.”
Ÿ欧大概知道fab.com。 如果您不这样做,那是一个出售选定健康和健身物品的网站。 但是该网站的起源不是电子商务。 Fab.com 最初是Fabulis.com(男同性恋者的社交网站)。 该网站最受欢迎的功能之一被称为“每日同性恋交易”。
One day, the deal was for hamburgers. Half of the deal’s buyers were women, despite the fact that they weren’t the site’s target users. This fact caused the data team to realize that they had an untapped market for selling goods to women. So Fabulis.com changed its business model to serve this newfound market.
有一天,这笔交易是给汉堡包的。 尽管这不是该网站的目标用户,但交易的买家中有一半是女性。 这一事实使数据团队意识到,他们有一个尚未开发的向女性出售商品的市场。 因此Fabulis.com更改了其业务模式以服务于这个新发现的市场。
Be on the lookout for something out of the ordinary. Be ready to ask questions. If you see something in the data, you may have hit gold. Data can help a business to optimize revenue, but sometimes it has the power to change the direction of the company as well.
寻求与众不同的东西。 准备问问题。 如果您看到数据中的某些内容,则可能是黄金。 数据可以帮助企业优化收入,但有时它也可以改变公司的发展方向。
Another famous example of this is Flickr, which started out as a multiplayer game. Only when the founders noticed that people were using it as a photo upload service did the company pivot to the photo-sharing app we know it as today.
另一个著名的例子是Flickr,它最初是一种多人游戏 。 只有当创始人注意到人们将其用作照片上传服务时,公司才转向我们今天所知的照片共享应用程序。
Try to see patterns that others would miss. Do you see a discrepancy in some buying patterns or maybe something you can’t seem to explain? That might be an opportunity in disguise when you look through a wider lens.
尝试查看其他人会错过的模式。 您是否发现某些购买模式存在差异,或者您似乎无法解释? 当您从更大的角度看时,这可能是变相的机会。
3.关注正确的指标 (3. Focus on the right metrics)
What do we want to optimize for?
我们要优化什么?
Most businesses fail to answer this simple question.
大多数企业无法回答这个简单的问题。
Every business problem is a little different and should, therefore, be optimized differently. For example, a website owner might ask you to optimize for daily active users. Daily active users is a metric defined as the number of people who open a product on a given day.
每个业务问题都稍有不同,因此应该以不同的方式进行优化。 例如,网站所有者可能会要求您针对每日活跃用户进行优化。 每日活跃用户是一个指标,定义为在特定日期打开产品的人数。
But is that the right metric? Maybe not. In reality, it’s just a vanity metric, meaning one that makes you look good but doesn’t serve any purpose when it comes to actionability. This metric will always increase if you are spending marketing dollars across various channels to bring more and more customers to your site.
但这是正确的指标吗? 也许不会。 实际上,这只是一种虚荣感指标,这意味着它可以使您看起来不错,但对于可操作性没有任何作用。 如果您在各种渠道上花费营销费用来吸引越来越多的客户访问您的网站,则该指标将始终保持增长。
Instead, I would recommend optimizing the percentage of users that are active to get a better idea of how my product is performing. A big marketing campaign might bring a lot of users to my site, but if only a few of them convert to active, the marketing campaign was a failure and my site stickiness factor is very low. You can measure the stickiness by the second metric and not the first one. If the percentage of active users is increasing, that must mean that they like my website.
相反,我建议优化活跃用户的百分比,以更好地了解我的产品的性能。 大型的营销活动可能会吸引很多用户访问我的网站,但是如果只有少数用户转换为活动用户,则营销活动将失败,并且我的网站黏性系数非常低。 您可以通过第二个指标而不是第一个指标来衡量粘性。 如果活跃用户的百分比在增加,那必须表示他们喜欢我的网站。
Another example of looking at the wrong metric happens when we create classification models. We often try to increase accuracy for such models. But do we really want accuracy as a metric of our model performance?
创建分类模型时,会出现另一个错误指标的例子。 我们经常尝试提高此类模型的准确性。 但是,我们是否真的希望准确性作为衡量模型性能的指标?
Imagine that we’re predicting the number of asteroids that will hit the Earth. If we want to optimize for accuracy, we can just say zero all the time, and we will be 99.99 percent accurate. That 0.01 percent error could be hugely impactful, though. What if that 0.01 percent is a planet-killing-sized asteroid? A model can be reasonably accurate but not at all valuable. A better metric would be the F score, which would be zero in this case, because the recall of such a model is zero as it never predicts an asteroid hitting the Earth.
想象一下,我们正在预测将撞击地球的小行星的数量。 如果我们要优化准确性,我们可以一直说零,那么我们将达到99.99%的准确性。 不过,该0.01%的错误可能会产生巨大影响。 如果那0.01%是杀死行星的小行星怎么办? 模型可以相当准确,但根本没有价值。 更好的度量标准是F分数,在这种情况下将为零,因为这种模型的召回率是零,因为它从未预测过小行星撞击地球。
When it comes to data science, designing a project and the metrics we want to use for evaluation is much more important than modeling itself. The metrics themselves need to specify the business goal and aiming for a wrong goal effectively destroys the whole purpose of modeling. For example, F1 or PRAUC is a better metric in terms of asteroid prediction as they take into consideration both the precision and recall of the model. If we optimize for accuracy, our whole modeling effort could just be in vain.
在数据科学方面,设计项目和我们要用于评估的指标比建模本身更为重要。 度量标准本身需要指定业务目标,而针对错误的目标则有效地破坏了建模的整个目的。 例如,就小行星预测而言,F1或PRAUC是更好的指标,因为它们同时考虑了模型的精度和召回率。 如果我们针对准确性进行优化,那么整个建模工作将徒劳无功。
4.记住:统计有时会误导 (4. Remember: Statistics mislead sometimes)
Be skeptical of any statistics that get quoted to you. Statistics have been used to lie in advertisements, in workplaces, and in a lot of other areas in the past. People will do anything to get sales or promotions.
怀疑引用给您的任何统计信息。 过去,统计信息已被用于广告,工作场所以及许多其他领域。 人们会做任何事情来获得销售或促销。
For example, do you remember Colgate’s claim that 80 percent of dentists recommended their brand? This statistic seems pretty good at first. If so many dentists use Colgate, I should too, right?
例如, 您还记得高露洁声称80%的牙医推荐其品牌的说法吗? 起初,这个统计数字看起来不错。 如果有那么多牙医使用高露洁,我也应该吧?
It turns out that during the survey, the dentists could choose multiple brands rather than just one. So other brands could be just as popular as Colgate.
事实证明,在调查期间,牙医可以选择多个品牌,而不仅仅是一个。 因此,其他品牌可能与高露洁一样受欢迎。
Marketing departments are just myth creation machines. We often see such examples in our daily lives. Take, for example, this 1992 ad from Chevrolet. Just looking at just the graph and not at the axis labels, it looks like Nissan/Datsun must be dreadful truck manufacturers.
营销部门只是神话创造的机器。 我们在日常生活中经常看到这样的例子。 以1992年的雪佛兰(Chevrolet)广告为例。 只看图表而不看轴标签,看起来日产/ Datsun一定是可怕的卡车制造商。
In fact, the graph indicates that more than 95 percent of the Nissan and Datsun trucks sold in the previous 10 years were still running. And the small difference might just be due to sample sizes and the types of trucks sold by each of the companies. As a general rule, never trust a chart that doesn’t label the Y-axis.
实际上,该图表明在过去10年中售出的日产和Datsun卡车中有95%仍在运行。 差异很小可能只是由于样本量和每个公司出售的卡车的类型。 作为一般规则,否E版本的信任,不标注Y轴的图表。
As a part of the ongoing pandemic, we’re seeing even more such examples with a lot of studies promoting cures for COVID-19. This past June in India, a man claimed to have made medicine for coronavirus that cured 100 percent of patients in seven days. This news predictably caused a big stir, but only after he was asked about the sample size did we understand what was actually happening here.
作为持续进行的大流行的一部分,我们通过许多促进COVID-19治愈的研究看到了更多这样的例子。 今年六月在印度,一名男子声称自己制作了冠状病毒药物,在7天内治愈了100%的患者。 可以预见的是,这一消息引起了极大的轰动,但只有在询问了他有关样本量的信息后,我们才了解这里实际发生的情况。
With a sample size of 100, the claim was utterly ridiculous on its face.
样本数量为100,该声明的内容完全荒谬。
Worse, the way the sample was selected was hugely flawed. His organization selected asymptomatic and mildly symptomatic users with a mean age between 35 and 45 with no pre-existing conditions, I was dumbfounded — this was not even a random sample. So not only was the study useless, it was actually unethical.
更糟糕的是,样本的选择方式存在巨大缺陷。 他的组织选择了无症状和轻度症状的使用者,他们的平均年龄在35至45岁之间,并且没有既往疾病,我对此感到震惊-这甚至不是随机样本。 因此,这项研究不仅无用,而且实际上是不道德的。
When you see charts and statistics, remember to evaluate them carefully. Make sure the statistics were sampled correctly and are being used in an ethical, honest way.
当您看到图表和统计数据时,请记住要仔细评估它们。 确保统计信息已正确采样并以道德,诚实的方式使用。
5.不要屈服于谬论 (5. Don’t give in to fallacies)
During the summer of 1913 in a casino in Monaco, gamblers watched in amazement as the roulette wheel landed on black an astonishing 26 times in a row. And since the probability of red versus black is precisely half, they were confident that red was “due.” It was a field day for the casino and a perfect example of gambler’s fallacy, a.k.a. the Monte Carlo fallacy.
d uring 1913年夏天在摩纳哥赌场,惊奇地轮盘赌的赌徒看着一排降落在黑色惊人的26倍。 而且由于红色与黑色的概率恰好是一半,所以他们确信红色是“应有的”。 这是赌场的野外活动日,也是赌徒谬论 (又称蒙特卡洛谬论)的完美例证。
This happens in everyday life outside of casinos too. People tend to avoid long strings of the same answer. Sometimes they do so while sacrificing accuracy of judgment for the sake of getting a pattern of decisions that look fairer or more probable. For example, an admissions office may reject the next application they see if they have approved three applications in a row, even if the application should have been accepted on merit.
这也发生在赌场以外的日常生活中。 人们倾向于避免使用长串相同的答案 。 有时他们这样做是在牺牲判断准确性的同时,为了获得看起来更公平或更可能的决策模式。 例如, 招生办公室可以连续拒绝三个申请,即使他们本应被接受,也可以拒绝下一个申请。
The world works on probabilities. We are seven billion people, each doing an event every second of our lives. Because of that sheer volume, rare events are bound to happen. But we shouldn’t put our money on them.
世界靠概率工作。 我们有70亿人,每个人每秒钟都在做一件事情。 由于数量庞大,必将发生罕见的事件。 但是我们不应该把钱花在他们身上。
Think also of the spurious correlations we end up seeing regularly. This particular graph shows that organic food sales cause autism. Or is it the opposite? Just because two variables move together in tandem doesn’t necessarily mean that one causes the other. Correlation does not imply causation and as data scientists, it is our job to be on a lookout for such fallacies, biases, and spurious correlations. We can’t allow oversimplified conclusions to cloud our work.
还请考虑一下我们最终经常看到的虚假关联。 此特殊图表显示,有机食品的销售会导致自闭症。 还是相反? 仅仅因为两个变量串联在一起并不一定意味着一个导致另一个。 关联并不意味着因果关系 ,作为数据科学家,寻找此类谬论,偏差和虚假关联是我们的工作。 我们不能允许过于简单的结论使我们的工作蒙上阴影。
Data scientists have a big role to play in any organization. A good data scientist must be both technical as well as business-driven to perform the job’s requirements well. Thus, we need to make a conscious effort to understand the business’ needs while also polishing our technical skills.
数据科学家在任何组织中都可以发挥重要作用。 优秀的数据科学家必须具备技术和业务驱动才能很好地满足工作要求。 因此,我们需要有意识地努力去了解业务需求,同时还要完善我们的技术技能。
继续学习 (Continue learning)
If you want to learn more about how to apply Data Science in a business context, I would recommend AI for Everyone course by Andrew Ng which focusses on spotting opportunities to apply AI to problems in your own organization, working with an AI team and build an AI strategy in your company.
如果您想了解有关如何在业务环境中应用数据科学的更多信息,我将推荐Andrew Ng的“ 每个人的AI”课程 ,重点是发现机会将AI应用于您自己组织中的问题,与AI团队合作并建立一个您公司的AI策略。
Thanks for the read. I am going to be writing more beginner-friendly posts in the future too. Follow me up at Medium or Subscribe to my blog to be informed about them. As always, I welcome feedback and constructive criticism and can be reached on Twitter @mlwhiz.
感谢您的阅读。 我将来也会写更多对初学者友好的文章。 在 Medium上 关注我, 或订阅我的 博客 以了解有关它们的信息。 与往常一样,我欢迎您提供反馈和建设性的批评,可以在Twitter @mlwhiz 上与我们 联系 。
翻译自: https://medium.com/swlh/why-critical-thinking-skills-are-essential-for-data-scientists-e9a16634ac8
批判性思维
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391507.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!