华为开源构建工具

I’ve developed an open-source data testing and a quality tool called data-flare. It aims to help data engineers and data scientists assure the data quality of large datasets using Spark. In this post I’ll share why I wrote this tool, why the existing tools weren’t enough, and how this tool may be helpful to you.

我已经开发了一个开源数据测试和一个称为data-flare的质量工具。它旨在帮助数据工程师和数据科学家使用Spark确保大型数据集的数据质量。在这篇文章中，我将分享我编写此工具的原因，为什么现有工具不够用以及该工具如何为您提供帮助。

谁晚上花时间编写数据质量工具？ (Who spends their evenings writing a data quality tool?)

In every data-driven organisation, we must always recognise that without confidence in the quality of our data, that data is useless. Despite that there are relatively few tools available to help us ensure our data quality stays high.

在每个由数据驱动的组织中，我们必须始终认识到，对数据质量没有信心，数据就毫无用处。尽管有相对较少的工具可用来帮助我们确保数据质量保持较高水平。

What I was looking for was a tool that:

我一直在寻找的工具是：

Helped me write high performance checks on the key properties of my data, like the size of my datasets, the percentage of rows that comply with a condition, or the distinct values in my columns
帮助我对数据的关键属性进行高性能检查，例如数据集的大小，符合条件的行的百分比或列中的不同值
Helped me track those key properties over time, so that I can see how my datasets are evolving, and spot problem areas easily
帮助我随着时间的推移跟踪这些关键属性，以便我可以查看数据集的发展情况，并轻松发现问题区域
Enabled me to write more complex checks to check other facets of my data that weren’t simple to incorporate in a property, and enabled me to compare between different datasets
使我能够编写更复杂的检查来检查我的数据的其他方面，这些方面并非很容易合并到属性中，并使我能够在不同的数据集之间进行比较
Would scale to huge volumes of data
可以扩展到海量数据

The tools that I found were more limited, constraining me to simpler checks defined in yaml or json, or only letting me check simpler properties on a single dataset. I wrote data-flare to fill in these gaps, and provide a one-stop-shop for our data quality needs.

我发现的工具受到限制，使我只能使用yaml或json中定义的更简单的检查，或者只能让我检查单个数据集上的更简单的属性。我写了数据耀斑来填补这些空白，并为我们的数据质量需求提供一站式服务。

给我看代码 (Show me the code)

data-flare is a Scala library built on top of Spark. It means you will need to write some Scala, but I’ve tried to keep the interface simple so that even a non-Scala developer could quickly pick it up.

data-flare是一个基于Spark构建的Scala库。这意味着您将需要编写一些Scala，但是我试图使界面保持简单，以便即使是非Scala开发人员也可以快速使用它。

Let’s look at a simple example. Imagine we have a dataset containing orders, with the following attributes:

让我们看一个简单的例子。想象一下，我们有一个包含订单的数据集，具有以下属性：

CustomerId
顾客ID
OrderId
OrderId
ItemId
ItemId
OrderType
订单类型
OrderValue
订单价值

We can represent this in a Dataset[Order] in Spark, with our order being:

我们可以在Spark的Dataset [Order]中表示它，其顺序为：

case class Order(customerId: String, orderId: String, itemId: String, orderType: String, orderValue: Int)

检查单个数据集 (Checks on a single dataset)

We want to check that our orders are all in order, including checking:

我们要检查我们的订单是否全部正常，包括检查：

orderType is “Sale” at least 90% of the time
orderType至少有90％的时间为“销售”
orderTypes of “Refund” have order values of less than 0
“退款”的orderType类型的订单值小于0
There are 20 different items that we sell, and we expect orders for each of those
我们出售20种不同的商品，我们希望每个商品都有订单
We have at least 100 orders
我们至少有100个订单

We can do this as follows (here orders represents our Dataset[Order]):

我们可以按照以下步骤进行操作(这里的order代表我们的Dataset [Order])：

val ordersChecks = ChecksSuite("orders",singleDsChecks = Map(DescribedDs(orders, "orders") -> Seq(SingleMetricCheck.complianceCheck(AbsoluteThreshold(0.9, 1),ComplianceFn(col("orderType") === "Sale")),SingleMetricCheck.complianceCheck(AbsoluteThreshold(1, 1),ComplianceFn(col("orderValue") < 0),MetricFilter(col("orderType") === "Refund")),SingleMetricCheck.distinctValuesCheck(AbsoluteThreshold(Some(20), None),List("itemId")),SingleMetricCheck.sizeCheck(AbsoluteThreshold(Some(100), None)))))

As you can see from this code, everything starts with a ChecksSuite. You can then pass in all of your checks that operate on single datasets using the singleDsChecks. We’ve been able to do all of these checks using SingleMetricChecks — these are efficient and perform all checks in a single pass over the dataset.

从该代码可以看到，所有内容都以ChecksSuite开头。然后，您可以使用singleDsChecks传递对单个数据集进行的所有检查。我们已经能够使用SingleMetricChecks进行所有这些检查-这些效率很高，并且可以一次通过数据集来执行所有检查。

What if we wanted to do something that we couldn’t easily express with a metric check? Let’s say we wanted to check that no customer had more than 5 orders with an orderType of “Flash Sale”. We could express that with an Arbitrary Check like so:

如果我们想做一些无法通过度量标准检查轻易表达的事情怎么办？假设我们要检查的是，没有任何客户的orderType为“ Flash Sale”的订单超过5个。我们可以这样用任意支票来表示：

ArbSingleDsCheck("less than 5 flash sales per customer") { ds =>val tooManyFlashSaleCustomerCount = ds.filter(col("orderType") === "Flash Sale").groupBy("customerId").agg(count("orderId").as("flashSaleCount")).filter(col("flashSaleCount") > 5).countif (tooManyFlashSaleCustomerCount > 0)RawCheckResult(CheckStatus.Error, s"$tooManyFlashSaleCustomerCount customers had too many flash sales")elseRawCheckResult(CheckStatus.Success, "No customers had more than 5 flash sales :)")
}

The ability to define arbitrary checks in this way gives you the power to define any check you want. They won’t be as efficient as the metric based checks, but the flexibility you get can make it a worthwhile trade-off.

以这种方式定义任意检查的能力使您能够定义所需的任何检查。它们不会像基于指标的检查那样高效，但是您获得的灵活性可以使其成为一个有价值的折衷。

检查一对数据集 (Checks on a pair of datasets)

Let’s imagine we have a machine learning algorithm that predicts which item each customer will order next. We are returned another Dataset[Order] with predicted orders in it.

假设我们有一个机器学习算法，可以预测每个客户接下来要订购的商品。我们返回了另一个带有预测订单的Dataset [Order]。

We may want to compare metrics on our predicted orders with metrics on our original orders. Let’s say that we expect to have an entry in our predicted orders for every customer that has had a previous order. We could check this using Flare as follows:

我们可能希望将预测订单的指标与原始订单的指标进行比较。假设我们希望在每个先前拥有订单的客户的预测订单中都有一个条目。我们可以使用Flare对此进行检查，如下所示：

val predictedOrdersChecks = ChecksSuite("orders",dualDsChecks = Map(DescribedDsPair(DescribedDs(orders, "orders"), DescribedDs(predictedOrders, "predictedOrders")) ->Seq(DualMetricCheck(CountDistinctValuesMetric(List("customerId")), CountDistinctValuesMetric(List("customerId")),"predicted orders present for every customer", MetricComparator.metricsAreEqual)))
)

We can pass in dualDsChecks to a ChecksSuite. Here we describe the datasets we want to compare, the metrics we want to calculate for each of those datasets, and a MetricComparator which describes how those metrics should be compared. In this case we want the number of distinct customerIds in each dataset to be equal.

我们可以将dualDsChecks传递给ChecksSuite。在这里，我们描述了我们要比较的数据集，我们要为每个数据集计算的指标，以及描述如何比较这些指标的MetricComparator。在这种情况下，我们希望每个数据集中不同的customerId数量相等。

运行支票时会怎样？ (What happens when you run your checks?)

When you run your checks all metrics are calculated in a single pass over each dataset, and check results are calculated and returned. You can then decide yourself how to handle those results. For example if one of your checks gives an error you could fail the spark job, or send a failure notification.

运行检查时，所有指标都将通过一次遍历每个数据集进行计算，并计算并返回检查结果。然后，您可以决定自己如何处理这些结果。例如，如果您的一项检查出现错误，则可能导致Spark作业失败或发送失败通知。

你还能做什么？ (What else can you do?)

Store your metrics and check results by passing in a metricsPersister and qcResultsRepository to your ChecksSuite (ElasticSearch supported out the box, and it’s extendable to support any data store)
通过将metricsPersister和qcResultsRepository传递到您的ChecksSuite中来存储指标并检查结果(ElasticSearch支持开箱即用，并且可扩展以支持任何数据存储)
Graph metrics over time in Kibana so you can spot trends
在Kibana中随时间绘制指标图表，以便您发现趋势
Write arbitrary checks for pairs of datasets
为数据集对编写任意检查

For more information check out the documentation and the code!

有关更多信息，请查阅文档和代码！