工作10年厌倦写代码
I have been in tons of meetings where data and results of any sort of analysis have been presented. And most meetings have one thing in common, data quality is being challenged and most of the meeting time is used for discussing potential data quality issues. The number one follow up of this meeting is to verify the open question, and we start all over again. Sounds familiar?
我参加过无数次会议,提出了各种分析的数据和结果。 大多数会议有一个共同点, 数据质量正在受到挑战,而大多数会议时间都用于讨论潜在的数据质量问题 。 这次会议的首要跟进工作是核实悬而未决的问题, 我们从头再来 。 听起来很熟悉?
It can be different. There are meetings where these discussions don’t take place, or perhaps were started, but immediately taken care of. I have seen and been involved in a few. And there was ONE difference between these types of meetings that I have seen over and over again. The person presenting the data was not on top of their data, was not anticipating and not thinking a step further.
可以不同。 在有些会议中,这些讨论没有进行,也可能没有开始,但立即得到了处理。 我已经看到并参与了一些。 我一遍又一遍地看到,这些类型的会议之间只有一个区别 。 提供数据的人不在他们的数据之上, 没有期待 ,也没有进一步思考 。
The person presenting the data was not on top of their data, was not anticipating and not thinking a step further.
呈现数据的人是 不是自己的数据之上, 并 没有期待 ,而 不是进一步思考的一个步骤 。
In fact, many of these data quality discussions are not actually data quality issues but an understanding of the meaning of the data. For example its hierarchy or structure, and the interpretation of the metrics. It is very easy when you don’t understand something to blame the data quality, but usually, the issue lies somewhere else.
实际上,许多数据质量讨论实际上并不是数据质量问题,而是对数据含义的理解。 例如,其层次结构或结构以及指标的解释。 这是很容易当你不明白的东西惹的祸 数据质量 ,但通常情况下,问题在于其他地方 。
It is very easy when you don’t understand something to blame the data quality, but usually, the issue lies somewhere else.
这是很容易当你不明白的东西惹的祸 数据质量 ,但通常情况下,问题在于其他地方 。
Let's assume you are working on some exploratory data analysis that you are doing to get started with AI. The key to success is to really understand the data you are working with. If the quality is not up to standard, make it up to standard or find a way to work with the data nonetheless. Be proactive and then it will find it’s a long way.
让我们假设您正在做一些探索性数据分析,以开始使用AI 。 成功的关键是真正了解正在使用的数据 。 如果质量不符合标准,则使其达到标准或找到一种处理数据的方法。 积极主动,然后发现它还有很长的路要走。
1.从小做起 (1. Start small)
The key here is as with so many things to start small. If you are looking at a handful of features you can actually dig into what these features mean. If you are starting off with hundreds, it will be more difficult. Let’s look at the number of products per customer, which is clearly small.
关键在于从头开始有很多事情。 如果您正在查看一些功能 ,则实际上可以深入了解这些功能的含义。 如果您刚开始有数百个,那将更加困难。 让我们看看每个客户的产品数量,这显然很小。
If you are looking at a handful of features you can actually dig into what these features mean
如果您正在查看一些功能,那么您实际上可以深入了解这些功能的含义
2.确保您了解自己的数据 (2. Make sure you understand your data)
Because you started small, you are able to dig deep. Do your correlation plots, look at the frequencies, and read the documentation on these features.
因为从小开始 ,所以您可以深入研究 。 做相关图,查看频率,并阅读这些功能的文档。
Because you started small, you are able to dig deep and truly understand the data
因为您从小开始 ,所以您能够深入并真正理解数据
In our example, we basically have two features to look at, two features that actually both have a large potential for discussion. I have once taken about three months to define what is meant with customer, an especially difficult question when working in a B2B environment. Depending on the company you work in, there may be different levels of products used, each of them who can be of interest in a different type of role. A product manager can have a different hierarchy of interest than the head of sales of a region.
在我们的示例中,我们基本上要看两个功能,实际上两个功能都有很大的讨论潜力 。 我曾经花了大约三个月的时间来定义客户的含义,这是在B2B环境中工作时特别棘手的问题。 根据您所工作的公司的不同,可能会使用不同级别的产品,每种产品可能会对不同类型的职位感兴趣。 产品经理的兴趣层次与区域销售主管的兴趣层次可能不同。
3.验证数据质量 (3. Verify the data quality)
There may be standard ways already that the data quality is checked, and you should understand and be able to explain these. I recommend going a step beyond the usual checks. Check for inconsistencies from a business perspective, are most of the jobs of your customer “Accountant”? Think again, it may be the top selection of the drop-down list. Another typical quality issue is the inconsistency between systems. Be sure you know these inconsistencies, what drives them, and their implications.
可能已经有检查数据质量的标准方法,您应该理解并能够解释这些方法。 我建议超越常规检查范围。 从业务角度检查不一致之处 ,客户的大部分工作是“会计”吗? 再想一想,它可能是下拉列表的首选。 另一个典型的质量问题是系统之间的不一致 。 确保您知道这些不一致之处,驱动它们的原因及其含义。
It may be the top selection of the drop-down list
它可能是下拉列表的首选
4.预测问题 (4. Anticipate the issues)
Quite a few questions and issues you can anticipate. What are the questions you typically get? What KPIs have been reported to your audience? What discussions have taken place in the past? Which words are used in the daily discussions? That should for example give you a good sense of the product split you are looking at (spoiler alert, it may well be none of the splits in your data). Make sure you understand the different levels of why they are used and how.
您可以预期的一些问题。 您通常会遇到什么问题? 向您的听众报告了哪些KPI ? 过去进行了哪些讨论 ? 日常讨论中使用哪些词 ? 例如,这应该可以使您很好地了解要查看的产品拆分(扰流板警报,很可能不是您数据中的任何拆分)。 确保您了解为什么使用它们以及如何使用它们的不同层次。
Anticipating the issues will allow you to divert from the data quality discussion
预计问题将使您从数据质量讨论中 转移出来
In my example, there were many different product hierarchies (from different systems) that were used by different audiences. I have built-in both hierarchies in my dashboard and was able to explain the overlap and differences between the two.
在我的示例中,不同的受众使用了许多不同的产品层次结构(来自不同的系统)。 我在仪表板上内置了两个层次结构,并且能够解释两者之间的重叠和差异。
If you find out which systems your audience is using and what data they typically see. Have an upfront discussion with someone you trust to go through the data and results to take out all possible flaws.
如果您找出观众使用的系统以及他们通常看到的数据。 与您信任的人进行前期讨论,以审阅数据和结果以发现所有可能的缺陷 。
5.了解问题并解决它们 (5. Know the issues and work around them)
Once you know the issues that are there. It’s time to work around them. One way is to tackle the issue at source. It may not be your job but potentially critical for a follow-up project where these features are going to be used.
一旦知道存在的问题。 现在该解决它们了。 一种方法是从源头上解决问题。 这可能不是您的工作,但对于将要使用这些功能的后续项目而言可能至关重要。
If you are still in the exploratory phase, then you could think of making the issues and assumptions clear. Key will be that you are able to explain them and their implications to gain the trust of your audience.
如果您仍处于探索阶段,则可以考虑将问题和假设弄清楚。 关键在于您能够解释它们及其含义,从而赢得听众的信任。
Key will be that you are able to explain the issues and their implications to gain the trust of your audience.
关键在于您能够解释这些问题及其含义,从而赢得听众的信任 。
You are thinking that this is a lot of work? Well think again, once this is sorted you can actually do your job and start creating actionable insights, and take action.
您以为这是很多工作吗? 再想一想 ,一旦解决了这个问题,您就可以真正完成自己的工作并开始创建可行的见解 ,并采取行动 。
About me: I am an Analytics Consultant and Director of Studies for “AI Management” at a local business school. I am on a mission to help organizations generating business value with AI and creating an environment in which Data Scientists can thrive. Sign up to my newsletter for new articles, insights, and offerings on AI Management here.
关于我:我是当地商学院的分析顾问和“ AI管理”研究总监。 我的使命是帮助组织通过AI创造业务价值,并创造一个数据科学家可以蓬勃发展的环境。 在 此处 注册我的时事通讯,以获得有关AI Management的新文章,新见解和新产品 。
翻译自: https://towardsdatascience.com/tired-of-data-quality-discussions-654106ce2e00
工作10年厌倦写代码
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388461.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!