趣味数据故事_坏数据的好故事

趣味数据故事

Meet Julia. She’s a data engineer. Julia is responsible for ensuring that your data warehouses and lakes don’t turn into data swamps, and that, generally speaking, your data pipelines are in good working order.

中号 EETJulia。 她是一名数据工程师。 Julia负责确保您的数据仓库和湖泊不会变成数据沼泽,并且通常来说,您的数据管道运行状况良好。

Image for post

Julia is happy when nothing breaks, but like any good engineer, she knows that this is near-to impossible. So, she just wants to be the first to know when issues do arise so that she can solve them.

当一切都没有中断时,茱莉亚很高兴,但是像任何优秀的工程师一样,她知道这几乎是不可能的。 因此,她只想成为第一个知道问题何时发生的人,以便她可以解决问题。

Image for post

Meet Ted. He’s a data analyst. Ted is known by his company as the “SQL King” because he’s the go-to query wrangler for their Marketing, Customer Support, and Operations teams. He’s an expert in Tableau, and knows all the Excel hacks. Ted is also happy when nothing breaks, and like Julia, knows that this is impossible. However, Ted doesn’t want bad data to ruin his analytics, making his life and the lives of his stakeholders miserable (more on that later).

认识特德。 他是一名数据分析师。 Ted被他的公司称为“ SQL King”,因为他是其市场营销,客户支持和运营团队的首选查询管理员。 他是Tableau的专家,并且了解所有Excel技巧。 当一切都没有中断时,Ted也很高兴,并且像Julia一样,知道这是不可能的。 但是,Ted不想让不良数据破坏他的分析,从而使他的生活和利益相关者的生活变得痛苦不堪(稍后再详述)。

Image for post

Meet Alex. Alex is a data consumer. She might be a data scientist, a product manager, a VP of Marketing, or even your CEO. Alex uses data to make smarter decisions, whether that’s what the title of her new product should be or which pair of lucky socks she should wear to tomorrow’s board meeting.

认识亚历克斯。 Alex是数据消费者。 她可能是数据科学家,产品经理,营销副总裁,甚至是您的CEO。 亚历克斯使用数据做出更明智的决策,无论这是她的新产品的名称,还是她应该在明天的董事会会议上穿的那双幸运袜子。

Alex, or anyone else at the company for that matter, can’t do their job if they can’t trust their data. We call this phenomena data downtime. Data downtime refers to periods of time where your data is inaccurate, missing, or otherwise erroneous, and spares no one, sort of like death and taxes. Unlike death and taxes, however, data downtime can be easily avoided if acted on immediately.

亚历克斯(Alex)或公司中与此有关的任何其他人,如果他们不信任自己的数据,就无法完成他们的工作。 我们称这种现象为数据停机时间。 数据停机时间是指您的数据不准确,丢失或以其他方式错误并且不遗余力的时间段,类似于死亡和税收。 但是,与死亡和税收不同,如果立即采取行动,可以轻松避免数据停机。

Image for post

When raw data is consumed by your data pipeline, it’s abstract and meaningless on its own. It doesn’t really matter if there’s data downtime because no one is using it quite yet — other than Julia, to pass it on. The problem is, she doesn’t always know if data is broken.

当原始数据被数据管道消耗时,它本身就是抽象的且毫无意义。 是否存在数据停机时间并不重要,因为除了Julia之外,没有人正在使用它来传递数据。 问题是,她并不总是知道数据是否损坏。

Image for post

As data moves through the pipeline, it becomes more concrete. Once it reaches the company’s business intelligence tools, Ted can start using it, transforming what was formerly vague and abstract into Excel spreadsheets, Tableau dashboards, and other beautiful vessels of knowledge.

随着数据在管道中移动,它变得更加具体。 一旦它到达公司的商业智能工具,Ted就可以开始使用它,将以前模糊和抽象的内容转换为Excel电子表格,Tableau仪表板和其他精美的知识工具。

Ted can then transform this data (now nearing full maturity) into actionable insights for the rest of his company. Now, Alex can create marketing collateral and PDFs and customer decks with this data, which is polished and concrete and bound to save the world. Or is it?

然后,Ted可以将这些数据(现在已经接近完全成熟)转换为他的公司其余部分的可行见解。 现在,Alex可以使用这些数据创建营销资料,PDF和客户资料,这些数据经过精心处理和具体化,必将拯救世界。 还是?

Image for post

As data errors move down the pipeline, the severity of data downtime increases. There are more and more Teds and Alexs using the data, many of whom have no idea if what they’re looking at is right, wrong, or somewhere in between until it’s too late.

随着数据错误沿流水线向下移动,数据停机的严重性增加。 越来越多的Teds和Alexs使用这些数据,其中许多人不知道自己所看的内容是对,错还是介于两者之间,直到为时已晚。

When is too late, you might ask?

什么时候来不及,您可能会问?

Too late is when Julia is paged at 3 a.m. Monday morning by Ted who was called by Alex, his skip-level manager and the VP of Sales, only a few minutes before about a wonky report he was supposed to present the next morning to their CEO. Too late is when you’ve wasted time, lost revenue, and eroded Alex — and everyone else’s — precious trust.

太迟了,当周一早上3点,Julia(Julia)被特德(Ted)传呼时,特德(Ted)由他的跳级经理兼销售副总裁亚历克斯(Alex)召集,而几分钟前,他就应该在第二天早上向他们呈报一个奇怪的报告CEO。 浪费时间,失去收入,侵蚀亚历克斯(Alex)和其他所有人的宝贵信任已经为时已晚。

Image for post

The more concrete and further removed the data gets from Julia’s raw tables, the more severe the impact. We refer to this as the cone of data anxiety.

从Julia的原始表中获取的数据越具体,越深入,影响就越严重。 我们将此称为数据焦虑症

Disaster struck and Julia had no idea why, let alone that it had happened. If only she had caught the data downtime immediately — right when it hit — instead of through Alex and her other data consumers (down the cone of anxiety), disaster could have been avoided.

灾难来了,Julia不知道为什么,更不用说发生了。 如果只有她立即(在命中时)捕获了数据停机,而不是通过Alex和她的其他数据消费者(在焦虑中),可以避免灾难。

Worst of all, she was in the middle of a once-in-a-lifetime dream. Cotton candy clouds, chocolate fountain waterfalls, and no null values. The complete opposite of the reality she was facing at 3 a.m. on Monday morning.

最糟糕的是,她处于千载难逢的梦想之中。 棉花糖云,巧克力喷泉瀑布,并且没有空值。 星期一早上3点,她所面对的现实完全相反。

Sounds familiar? Yeah, I’m with you.

听起来很熟悉? 是的,我和你在一起。

If data downtime is something you’ve experienced, we’d love to hear from you! Reach out to Barr with your own good tales of bad data.

如果您遇到数据宕机的情况,我们将很高兴收到您的来信! 伸出 巴尔坏数据的自己的好故事。

This article was co-written by Barr Moses & Martín Alonso Lago.

本文由 Barr Moses MartínAlonso Lago 共同撰写

翻译自: https://towardsdatascience.com/good-tales-of-bad-data-91eccc29cbc5

趣味数据故事

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388129.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Linux 4.1内核热补丁成功实践

最开始公司运维同学反馈,个别宿主机上存在进程CPU峰值使用率异常的现象。而数万台机器中只出现了几例,也就是说万分之几的概率。监控产生的些小误差,不会造成宕机等严重后果,很容易就此被忽略了。但我们考虑到这个异常转瞬即逝、并…

python分句_Python循环中的分句,继续和其他子句

python分句Python中的循环 (Loops in Python) for loop for循环 while loop while循环 Let’s learn how to use control statements like break, continue, and else clauses in the for loop and the while loop.让我们学习如何在for循环和while循环中使用诸如break &#xf…

eclipse plugin 菜单

简介: 菜单是各种软件及开发平台会提供的必备功能,Eclipse 也不例外,提供了丰富的菜单,包括主菜单(Main Menu),视图 / 编辑器菜单(ViewPart/Editor Menu)和上下文菜单&am…

python数据建模数据集_Python中的数据集

python数据建模数据集There are useful Python packages that allow loading publicly available datasets with just a few lines of code. In this post, we will look at 5 packages that give instant access to a range of datasets. For each package, we will look at h…

打开editor的接口讨论

【打开editor的接口讨论】 先来看一下workbench吧,workbench从静态划分应该大致如下: 从结构图我们大致就可以猜测出来,workbench page作为一个IWorkbenchPart(无论是eidtor part还是view part&#…

网络攻防技术实验五

2018-10-23 实验五 学 号201521450005 中国人民公安大学 Chinese people’ public security university 网络对抗技术 实验报告 实验五 综合渗透 学生姓名 陈军 年级 2015 区队 五 指导教师 高见 信息技术与网络安全学院 2018年10月23日 实验任务总纲 2018—2019 …

usgs地震记录如何下载_用大叶草绘制USGS地震数据

usgs地震记录如何下载One of the many services provided by the US Geological Survey (USGS) is the monitoring and tracking of seismological events worldwide. I recently stumbled upon their earthquake datasets provided at the website below.美国地质调查局(USGS)…

Springboot 项目中 xml文件读取yml 配置文件

2019独角兽企业重金招聘Python工程师标准>>> 在xml文件中读取yml文件即可&#xff0c;代码如下&#xff1a; 现在spring-boot提倡零配置&#xff0c;但是的如果要集成老的spring的项目&#xff0c;涉及到的bean的配置。 <bean id"yamlProperties" clas…

无法获取 vmci 驱动程序版本: 句柄无效

https://jingyan.baidu.com/article/a3a3f811ea5d2a8da2eb8aa1.html 将 vmci0.present "TURE" 改为 “FALSE”; 转载于:https://www.cnblogs.com/limanjihe/p/9868462.html

数据可视化 信息可视化_更好的数据可视化的8个技巧

数据可视化 信息可视化Ggplot is R’s premier data visualization package. Its popularity can likely be attributed to its ease of use — with just a few lines of code you are able to produce great visualizations. This is especially great for beginners who are…

分布式定时任务框架Elastic-Job的使用

为什么80%的码农都做不了架构师&#xff1f;>>> 一、前言 Elastic-Job是一个优秀的分布式作业调度框架。 Elastic-Job是一个分布式调度解决方案&#xff0c;由两个相互独立的子项目Elastic-Job-Lite和Elastic-Job-Cloud组成。 Elastic-Job-Lite定位为轻量级无中心化…

Memcached和Redis

Memcached和Redis作为两种Inmemory的key-value数据库&#xff0c;在设计和思想方面有着很多共通的地方&#xff0c;功能和应用方面在很多场合下(作为分布式缓存服务器使用等) 也很相似&#xff0c;在这里把两者放在一起做一下对比的介绍 基本架构和思想 首先简单介绍一下两者的…

第4章 springboot热部署 4-1 SpringBoot 使用devtools进行热部署

/imooc-springboot-starter/src/main/resources/application.properties #关闭缓存, 即时刷新 #spring.freemarker.cachefalse spring.thymeleaf.cachetrue#热部署生效 spring.devtools.restart.enabledtrue #设置重启的目录,添加那个目录的文件需要restart spring.devtools.r…

ibm python db_使用IBM HR Analytics数据集中的示例的Python独立性卡方检验

ibm python dbSuppose you are exploring a dataset and you want to examine if two categorical variables are dependent on each other.假设您正在探索一个数据集&#xff0c;并且想要检查两个分类变量是否相互依赖。 The motivation could be a better understanding of …

sql 左联接 全联接_通过了解自我联接将您SQL技能提升到一个新的水平

sql 左联接 全联接The last couple of blogs that I have written have been great for beginners ( Data Concepts Without Learning To Code or Developing A Data Scientist’s Mindset). But, I would really like to push myself to create content for other members of …

hadoop windows

1、安装JDK1.6或更高版本 官网下载JDK&#xff0c;安装时注意&#xff0c;最好不要安装到带有空格的路径名下&#xff0c;例如:Programe Files&#xff0c;否则在配置Hadoop的配置文件时会找不到JDK&#xff08;按相关说法&#xff0c;配置文件中的路径加引号即可解决&#xff…

科学价值 社交关系 大数据_服务的价值:数据科学和用户体验研究美好生活

科学价值 社交关系 大数据A crucial part of building a product is understanding exactly how it provides your customers with value. Understanding this is understanding how you fit into the lives of your customers, and should be central to how you build on wha…

在Ubuntu下创建hadoop组和hadoop用户

一、在Ubuntu下创建hadoop组和hadoop用户 增加hadoop用户组&#xff0c;同时在该组里增加hadoop用户&#xff0c;后续在涉及到hadoop操作时&#xff0c;我们使用该用户。 1、创建hadoop用户组 2、创建hadoop用户 sudo adduser -ingroup hadoop hadoop 回车后会提示输入新的UNIX…

vs azure web_在Azure中迁移和自动化Chrome Web爬网程序的指南。

vs azure webWebscraping as a required skill for many data-science related jobs is becoming increasingly desirable as more companies slowly migrate their processes to the cloud.随着越来越多的公司将其流程缓慢迁移到云中&#xff0c;将Web爬网作为许多与数据科学相…

hadoop eclipse windows

首先说一下本人的环境: Windows7 64位系统 Spring Tool Suite Version: 3.4.0.RELEASE Hadoop2.6.0 一&#xff0e;简介 Hadoop2.x之后没有Eclipse插件工具&#xff0c;我们就不能在Eclipse上调试代码&#xff0c;我们要把写好的java代码的MapReduce打包成jar然后在Linux上运…