华为开源构建工具_为什么我构建了用于大数据测试和质量控制的开源工具

华为开源构建工具

I’ve developed an open-source data testing and a quality tool called data-flare. It aims to help data engineers and data scientists assure the data quality of large datasets using Spark. In this post I’ll share why I wrote this tool, why the existing tools weren’t enough, and how this tool may be helpful to you.

我已经开发了一个开源数据测试和一个称为data-flare的质量工具。 它旨在帮助数据工程师和数据科学家使用Spark确保大型数据集的数据质量。 在这篇文章中,我将分享我编写此工具的原因,为什么现有工具不够用以及该工具如何为您提供帮助。

谁晚上花时间编写数据质量工具? (Who spends their evenings writing a data quality tool?)

In every data-driven organisation, we must always recognise that without confidence in the quality of our data, that data is useless. Despite that there are relatively few tools available to help us ensure our data quality stays high.

在每个由数据驱动的组织中,我们必须始终认识到,对数据质量没有信心,数据就毫无用处。 尽管有相对较少的工具可用来帮助我们确保数据质量保持较高水平。

What I was looking for was a tool that:

我一直在寻找的工具是:

  • Helped me write high performance checks on the key properties of my data, like the size of my datasets, the percentage of rows that comply with a condition, or the distinct values in my columns

    帮助我对数据的关键属性进行高性能检查,例如数据集的大小,符合条件的行的百分比或列中的不同值
  • Helped me track those key properties over time, so that I can see how my datasets are evolving, and spot problem areas easily

    帮助我随着时间的推移跟踪这些关键属性,以便我可以查看数据集的发展情况,并轻松发现问题区域
  • Enabled me to write more complex checks to check other facets of my data that weren’t simple to incorporate in a property, and enabled me to compare between different datasets

    使我能够编写更复杂的检查来检查我的数据的其他方面,这些方面并非很容易合并到属性中,并使我能够在不同的数据集之间进行比较
  • Would scale to huge volumes of data

    可以扩展到海量数据

The tools that I found were more limited, constraining me to simpler checks defined in yaml or json, or only letting me check simpler properties on a single dataset. I wrote data-flare to fill in these gaps, and provide a one-stop-shop for our data quality needs.

我发现的工具受到限制,使我只能使用yaml或json中定义的更简单的检查,或者只能让我检查单个数据集上的更简单的属性。 我写了数据耀斑来填补这些空白,并为我们的数据质量需求提供一站式服务。

给我看代码 (Show me the code)

data-flare is a Scala library built on top of Spark. It means you will need to write some Scala, but I’ve tried to keep the interface simple so that even a non-Scala developer could quickly pick it up.

data-flare是一个基于Spark构建的Scala库。 这意味着您将需要编写一些Scala,但是我试图使界面保持简单,以便即使是非Scala开发人员也可以快速使用它。

Let’s look at a simple example. Imagine we have a dataset containing orders, with the following attributes:

让我们看一个简单的例子。 想象一下,我们有一个包含订单的数据集,具有以下属性:

  • CustomerId

    顾客ID
  • OrderId

    OrderId
  • ItemId

    ItemId
  • OrderType

    订单类型
  • OrderValue

    订单价值

We can represent this in a Dataset[Order] in Spark, with our order being:

我们可以在Spark的Dataset [Order]中表示它,其顺序为:

case class Order(customerId: String, orderId: String, itemId: String, orderType: String, orderValue: Int)

检查单个数据集 (Checks on a single dataset)

We want to check that our orders are all in order, including checking:

我们要检查我们的订单是否全部正常,包括检查:

  • orderType is “Sale” at least 90% of the time

    orderType至少有90%的时间为“销售”
  • orderTypes of “Refund” have order values of less than 0

    “退款”的orderType类型的订单值小于0
  • There are 20 different items that we sell, and we expect orders for each of those

    我们出售20种不同的商品,我们希望每个商品都有订单
  • We have at least 100 orders

    我们至少有100个订单

We can do this as follows (here orders represents our Dataset[Order]):

我们可以按照以下步骤进行操作(这里的order代表我们的Dataset [Order]):

val ordersChecks = ChecksSuite("orders",singleDsChecks = Map(DescribedDs(orders, "orders") -> Seq(SingleMetricCheck.complianceCheck(AbsoluteThreshold(0.9, 1),ComplianceFn(col("orderType") === "Sale")),SingleMetricCheck.complianceCheck(AbsoluteThreshold(1, 1),ComplianceFn(col("orderValue") < 0),MetricFilter(col("orderType") === "Refund")),SingleMetricCheck.distinctValuesCheck(AbsoluteThreshold(Some(20), None),List("itemId")),SingleMetricCheck.sizeCheck(AbsoluteThreshold(Some(100), None)))))

As you can see from this code, everything starts with a ChecksSuite. You can then pass in all of your checks that operate on single datasets using the singleDsChecks. We’ve been able to do all of these checks using SingleMetricChecks — these are efficient and perform all checks in a single pass over the dataset.

从该代码可以看到,所有内容都以ChecksSuite开头。 然后,您可以使用singleDsChecks传递对单个数据集进行的所有检查。 我们已经能够使用SingleMetricChecks进行所有这些检查-这些效率很高,并且可以一次通过数据集来执行所有检查。

What if we wanted to do something that we couldn’t easily express with a metric check? Let’s say we wanted to check that no customer had more than 5 orders with an orderType of “Flash Sale”. We could express that with an Arbitrary Check like so:

如果我们想做一些无法通过度量标准检查轻易表达的事情怎么办? 假设我们要检查的是,没有任何客户的orderType为“ Flash Sale”的订单超过5个。 我们可以这样用任意支票来表示:

ArbSingleDsCheck("less than 5 flash sales per customer") { ds =>val tooManyFlashSaleCustomerCount = ds.filter(col("orderType") === "Flash Sale").groupBy("customerId").agg(count("orderId").as("flashSaleCount")).filter(col("flashSaleCount") > 5).countif (tooManyFlashSaleCustomerCount > 0)RawCheckResult(CheckStatus.Error, s"$tooManyFlashSaleCustomerCount customers had too many flash sales")elseRawCheckResult(CheckStatus.Success, "No customers had more than 5 flash sales :)")
}

The ability to define arbitrary checks in this way gives you the power to define any check you want. They won’t be as efficient as the metric based checks, but the flexibility you get can make it a worthwhile trade-off.

以这种方式定义任意检查的能力使您能够定义所需的任何检查。 它们不会像基于指标的检查那样高效,但是您获得的灵活性可以使其成为一个有价值的折衷。

检查一对数据集 (Checks on a pair of datasets)

Let’s imagine we have a machine learning algorithm that predicts which item each customer will order next. We are returned another Dataset[Order] with predicted orders in it.

假设我们有一个机器学习算法,可以预测每个客户接下来要订购的商品。 我们返回了另一个带有预测订单的Dataset [Order]。

We may want to compare metrics on our predicted orders with metrics on our original orders. Let’s say that we expect to have an entry in our predicted orders for every customer that has had a previous order. We could check this using Flare as follows:

我们可能希望将预测订单的指标与原始订单的指标进行比较。 假设我们希望在每个先前拥有订单的客户的预测订单中都有一个条目。 我们可以使用Flare对此进行检查,如下所示:

val predictedOrdersChecks = ChecksSuite("orders",dualDsChecks = Map(DescribedDsPair(DescribedDs(orders, "orders"), DescribedDs(predictedOrders, "predictedOrders")) ->Seq(DualMetricCheck(CountDistinctValuesMetric(List("customerId")), CountDistinctValuesMetric(List("customerId")),"predicted orders present for every customer", MetricComparator.metricsAreEqual)))
)

We can pass in dualDsChecks to a ChecksSuite. Here we describe the datasets we want to compare, the metrics we want to calculate for each of those datasets, and a MetricComparator which describes how those metrics should be compared. In this case we want the number of distinct customerIds in each dataset to be equal.

我们可以将dualDsChecks传递给ChecksSuite。 在这里,我们描述了我们要比较的数据集,我们要为每个数据集计算的指标,以及描述如何比较这些指标的MetricComparator。 在这种情况下,我们希望每个数据集中不同的customerId数量相等。

运行支票时会怎样? (What happens when you run your checks?)

When you run your checks all metrics are calculated in a single pass over each dataset, and check results are calculated and returned. You can then decide yourself how to handle those results. For example if one of your checks gives an error you could fail the spark job, or send a failure notification.

运行检查时,所有指标都将通过一次遍历每个数据集进行计算,并计算并返回检查结果。 然后,您可以决定自己如何处理这些结果。 例如,如果您的一项检查出现错误,则可能导致Spark作业失败或发送失败通知。

你还能做什么? (What else can you do?)

  • Store your metrics and check results by passing in a metricsPersister and qcResultsRepository to your ChecksSuite (ElasticSearch supported out the box, and it’s extendable to support any data store)

    通过将metricsPersister和qcResultsRepository传递到您的ChecksSuite中来存储指标并检查结果(ElasticSearch支持开箱即用,并且可扩展以支持任何数据存储)
  • Graph metrics over time in Kibana so you can spot trends

    在Kibana中随时间绘制指标图表,以便您发现趋势
  • Write arbitrary checks for pairs of datasets

    为数据集对编写任意检查

For more information check out the documentation and the code!

有关更多信息,请查阅文档和代码 !

翻译自: https://medium.com/swlh/why-i-built-an-opensource-tool-for-big-data-testing-and-quality-control-182a14701e8d

华为开源构建工具

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390632.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

字号与磅值对应关系_终极版式指南:磅值,大写与小写,Em和En破折号等

字号与磅值对应关系Typography is a field that deals with the written word and how letters and characters are presented.印刷术是处理文字以及字母和字符的显示方式的领域。 The same letters can be styled in different ways to convey different emotions. And there…

leetcode 65. 有效数字(正则表达式)

题目 有效数字&#xff08;按顺序&#xff09;可以分成以下几个部分&#xff1a; 一个 小数 或者 整数 &#xff08;可选&#xff09;一个 ‘e’ 或 ‘E’ &#xff0c;后面跟着一个 整数 小数&#xff08;按顺序&#xff09;可以分成以下几个部分&#xff1a; &#xff08;…

Swift中的闭包例子

常见的实现&#xff0c; 要熟悉了解&#xff0c; 至于闭包逃逸&#xff0c; 自动闭包这些内容&#xff0c; 可以以后用到时再学吧。 let names ["Chris", "Alex", "Eva", "Barry", "Daniella"]func backward(_ s1: String,…

如何判断自己的编程水平

有的朋友说&#xff1a;当一段时间后的你&#xff0c;再重新看回以前写的代码&#xff0c;会觉得很渣&#xff0c;就证明你有学到新东西了。转载于:https://www.cnblogs.com/viplued/p/7943405.html

数据科学项目_完整的数据科学组合项目

数据科学项目In this article, I would like to showcase what might be my simplest data science project ever.在本文中&#xff0c;我想展示一下有史以来最简单的数据科学项目 。 I have spent hours training a much more complex models in the past, and struggled to …

回溯算法和贪心算法_回溯算法介绍

回溯算法和贪心算法回溯算法 (Backtracking Algorithms) Backtracking is a general algorithm for finding all (or some) solutions to some computational problems, notably constraint satisfaction problems. It incrementally builds candidates to the solutions, and …

alpha冲刺day8

项目进展 李明皇 昨天进展 编写完个人中心页面今天安排 编写首页逻辑层问题困难 开始编写数据传递逻辑&#xff0c;要用到列表渲染和条件渲染心得体会 小程序框架设计的内容有点忘了&#xff0c;而且比较抽象&#xff0c;需要理解文档举例和具体案例林翔 昨天进展 黑名单用户的…

增加 processon 免费文件数

github 地址&#xff1a;github.com/96chh/Upgra… 关于 ProcessOn 非常好用的思维导图网站&#xff0c;不仅支持思维导图&#xff0c;还支持流程图、原型图、UML 等。比我之前用的百度脑图强多了。 直接登录网站就可以编辑&#xff0c;非常适合我在图书馆公用电脑学习使用。 但…

uni-app清理缓存数据_数据清理-从哪里开始?

uni-app清理缓存数据It turns out that Data Scientists and Data Analysts will spend most of their time on data preprocessing and EDA rather than training a machine learning model. As one of the most important job, Data Cleansing is very important indeed.事实…

高级人工智能之群体智能:蚁群算法

群体智能 鸟群&#xff1a; 鱼群&#xff1a; 1.基本介绍 蚁群算法&#xff08;Ant Colony Optimization, ACO&#xff09;是一种模拟自然界蚂蚁觅食行为的优化算法。它通常用于解决路径优化问题&#xff0c;如旅行商问题&#xff08;TSP&#xff09;。 蚁群算法的基本步骤…

JavaScript标准对象:地图

The Map object is a relatively new standard built-in object that holds [key, value] pairs in the order that theyre inserted. Map对象是一个相对较新的标准内置对象&#xff0c;按插入顺序保存[key, value]对。 The keys and values in the Map object can be any val…

leetcode 483. 最小好进制

题目 对于给定的整数 n, 如果n的k&#xff08;k>2&#xff09;进制数的所有数位全为1&#xff0c;则称 k&#xff08;k>2&#xff09;是 n 的一个好进制。 以字符串的形式给出 n, 以字符串的形式返回 n 的最小好进制。 示例 1&#xff1a; 输入&#xff1a;“13” 输…

图像灰度变换及图像数组操作

Python图像灰度变换及图像数组操作 作者&#xff1a;MingChaoSun 字体&#xff1a;[增加 减小] 类型&#xff1a;转载 时间&#xff1a;2016-01-27 我要评论 这篇文章主要介绍了Python图像灰度变换及图像数组操作的相关资料,需要的朋友可以参考下使用python以及numpy通过直接操…

npx npm区别_npm vs npx —有什么区别?

npx npm区别If you’ve ever used Node.js, then you must have used npm for sure.如果您曾经使用过Node.js &#xff0c;那么一定要使用npm 。 npm (node package manager) is the dependency/package manager you get out of the box when you install Node.js. It provide…

找出性能消耗是第一步,如何解决问题才是关键

作者最近刚接手一个新项目&#xff0c;在首页列表滑动时就感到有点不顺畅&#xff0c;特别是在滑动到有 ViewPager 部分的时候&#xff0c;如果是熟悉的项目&#xff0c;可能会第一时间会去检查代码&#xff0c;但前面说到这个是刚接手的项目&#xff0c;同时首页的代码逻辑比较…

bigquery_如何在BigQuery中进行文本相似性搜索和文档聚类

bigqueryBigQuery offers the ability to load a TensorFlow SavedModel and carry out predictions. This capability is a great way to add text-based similarity and clustering on top of your data warehouse.BigQuery可以加载TensorFlow SavedModel并执行预测。 此功能…

bzoj 1996: [Hnoi2010]chorus 合唱队

Description 为了在即将到来的晚会上有吏好的演出效果&#xff0c;作为AAA合唱队负责人的小A需要将合唱队的人根据他们的身高排出一个队形。假定合唱队一共N个人&#xff0c;第i个人的身髙为Hi米(1000<Hi<2000),并已知任何两个人的身高都不同。假定最终排出的队形是A 个人…

移动应用程序开发_什么是移动应用程序开发?

移动应用程序开发One of the most popular forms of coding in the last decade has been the creation of apps, or applications, that run on mobile devices.在过去的十年中&#xff0c;最流行的编码形式之一是创建在移动设备上运行的应用程序。 Today there are two main…

leetcode 1600. 皇位继承顺序(dfs)

题目 一个王国里住着国王、他的孩子们、他的孙子们等等。每一个时间点&#xff0c;这个家庭里有人出生也有人死亡。 这个王国有一个明确规定的皇位继承顺序&#xff0c;第一继承人总是国王自己。我们定义递归函数 Successor(x, curOrder) &#xff0c;给定一个人 x 和当前的继…

vlookup match_INDEX-MATCH — VLOOKUP功能的升级

vlookup match电子表格/索引匹配 (SPREADSHEETS / INDEX-MATCH) In a previous article, we discussed about how and when to use VLOOKUP functions and what are the issues that we might face while using them. This article, on the other hand, will take you to a jou…