数据质量提升

Author Vlad Rișcuția is joined for this article by co-authors Wayne Yim and Ayyappan Balasubramanian.

作者 Vlad Rișcuția 和合著者 Wayne Yim 和 Ayyappan Balasubramanian 共同撰写了这篇文章 。

为什么要数据质量？ (Why data quality?)

Data quality is a critical aspect of ensuring high quality business decisions. An estimate of the yearly cost of poor data quality is $3.1 trillion per year for the United States alone, equating to approximately 16.5 percent of GDP.¹ For a business such as Microsoft, where data-driven decisions are ingrained within the fabric of the company, ensuring high data quality is paramount. Not only is data used to drive, steer, and grow the Microsoft business from a tactical and strategic perspective, but there are also regulatory obligations to produce accurate data for quarterly financial reporting.

数据质量是确保高质量业务决策的关键方面。据估计，仅在美国，不良数据质量的年成本就高达每年3.1万亿美元，约占GDP的16.5％。¹对于像Microsoft这样的企业，数据驱动型决策根深蒂固，确保高数据质量至关重要。从战术和战略角度来看，不仅使用数据来驱动，指导和发展Microsoft业务，而且还存在监管义务，要求为季度财务报告生成准确的数据。

DataCop的历史 (History of DataCop)

In the Experiences and Devices (E+D) division at Microsoft, a central data team called IDEAs (Insights Data Engineering and Analytics) generates key business metrics that are used to grow and steer the business. As one of its first undertakings, the team created the Office 365 Commercial Monthly Active User (MAU) measure to track the usage and growth of Office 365. This was a complicated endeavor due to the sheer scale of data, the number of Office products and services involved, and the heterogenous nature of the data pipelines across different products and services. In addition, many other business metrics, tracking the growth and usage of all Office products and services, also needed to be created.

在Microsoft的“体验和设备”(E + D)部门中，一个名为IDEA(Insights数据工程和分析)的中央数据团队生成了用于发展和指导业务的关键业务指标。作为其首批任务之一，该团队创建了Office 365商业月度活动用户(MAU)措施来跟踪Office 365的使用和增长。由于数据规模巨大，Office产品和服务的数量庞大，这是一项复杂的工作。涉及的服务以及跨不同产品和服务的数据管道的异构性质。此外，还需要创建许多其他业务指标，以跟踪所有Office产品和服务的增长和使用情况。

In the process of creating these critical business metrics, it was clear that generating them at scale and in a reliable way with high data quality was of the utmost importance, as key tactical and strategic business decisions would be based on them. In addition, because of the team’s charge to generate key metrics for release with quarterly earnings, producing high quality data was also a regulatory requirement.

在创建这些关键业务指标的过程中，很明显，以关键的战术和战略业务决策将基于它们，以高质量的数据大规模可靠地生成它们至关重要。另外，由于团队负责生成关键指标以按季度收入发布，因此生成高质量数据也是监管要求。

The IDEAs team formed as a data quality team consisting of program management, engineering, and data science representatives, and set out to investigate internal and external data quality solutions. The team examined internal data quality systems and researched public whitepapers from other companies that worked with huge amounts of data. Members of the team also spent a considerable amount of time with LinkedIn, learning about their data quality system called “Data Sentinel”² to potentially leverage what they had built, as they had already spent a considerable amount of time developing Data Sentinel and are also part of Microsoft.

IDEA团队组成了一个由程序管理，工程和数据科学代表组成的数据质量团队，并着手研究内部和外部数据质量解决方案。该团队检查了内部数据质量系统，并研究了处理大量数据的其他公司的公开白皮书。团队成员还花了很多时间在LinkedIn上，了解他们称为“ Data Sentinel”²的数据质量系统，以潜在地利用他们所构建的内容，因为他们已经花费了大量时间来开发Data Sentinel，并且微软的一部分。

The vision for a data quality platform in IDEAs was that it would be extensible, scalable, able to work with the multiple data fabrics involved, and be leveraged by the wider data science community at Microsoft. For example, data scientists and data analysts should be able to write data quality checks in languages familiar to them such as Python, R, and Scala, among others, and have these data quality checks operate reliably at scale.

IDEA中的数据质量平台的愿景是，它是可扩展的，可伸缩的，能够与所涉及的多个数据结构配合使用，并被Microsoft的更广泛的数据科学社区所利用。例如，数据科学家和数据分析人员应该能够用他们熟悉的语言(例如Python，R和Scala等)编写数据质量检查，并使这些数据质量检查可靠地大规模运行。

Another key requirement was to have the data quality platform function as a DaaS, or “Data as a Service,” resulting in the need to apply the same “service rigor” in engineering, operations, and processes that were used to create and operate Office 365, the largest SaaS in the world. This meant having very high engineering standards around change management, monitoring, security controls, and auditability, and tightly integrating with Microsoft incident management systems to ensure that systems operate with high availability, efficiency, and security.

另一个关键要求是使数据质量平台具有DaaS或“数据即服务”的功能，因此需要在用于创建和操作Office的工程，操作和流程中应用相同的“服务严格性” 365，世界上最大的SaaS。这意味着在变更管理，监视，安全控制和可审核性方面具有很高的工程标准，并与Microsoft事件管理系统紧密集成，以确保系统以高可用性，效率和安全性运行。

In the end, the team decided to build its own extensible data quality system from scratch in order for it to function with the scale and reliability of a DaaS and for it to interface with other internal Microsoft data systems. The initial functional specification was written in late 2018, and by early 2019 DataCop was born. Today, DataCop is part of the DataHub platform that also consists of Data Build and Data Catalog. Data Build generates the datasets required by the business in a compliant and scalable way and Data Catalog is a search store for all assets and surfaces with metadata such as data quality scores from DataCop, as well as access and privacy information. Future articles will describe how Data Catalog and Data Build are used to generate the metrics and insights that drive, steer, and grow the E+D business and serve as critical components of the data quality journey.

最后，团队决定从头开始构建自己的可扩展数据质量系统，以使其能够与DaaS的规模和可靠性一起运行，并与其他内部Microsoft数据系统进行交互。最初的功能规范写于2018年底，到2019年初DataCop诞生了。今天，DataCop已成为DataHub平台的一部分，该平台还包括数据构建和数据目录。 Data Build以合规且可扩展的方式生成企业所需的数据集，Data Catalog是具有元数据(例如来自DataCop的数据质量得分以及访问和隐私信息)的所有资产和表面的搜索存储。未来的文章将描述如何使用“数据目录”和“数据构建”来生成度量标准和见解，以推动，指导和发展E + D业务，并充当数据质量之旅的关键组成部分。

建筑 (Architecture)

DataCop is designed with a mindset that no one team can solve this challenge on its own. The data ecosystem at Microsoft consists of multiple data fabrics, with data arriving in minutes to a month later. The system must be flexible and simple enough for other developers across Microsoft to add plugins and workers for adding to the data fabric or quality checks they want to build on. As a result, DataCop was built as a distributed message broker based on Azure Service Bus with quality check results stored on Cosmos DB.

DataCop的设计思想是，任何团队都无法独自解决这一挑战。 Microsoft的数据生态系统由多个数据结构组成，数据在数分钟至一个月后到达。该系统必须足够灵活和简单，以使Microsoft的其他开发人员可以添加插件和工作程序，以添加到他们想要建立的数据结构或质量检查中。结果，DataCop被构建为基于Azure Service Bus的分布式消息代理，质量检查结果存储在Cosmos DB中。

Messages in the message broker must be self-contained and allow workers to work on them exclusively. This would allow messages from Orchestrator to run scheduled checks or from an Azure Data Factory (ADF) pipeline itself. Every time a data check or new fabric needs to be added, the developer can simply implement an override and develop their own worker process without affecting the rest of the system. The Azure team leveraged this to build on it quickly, as described below.

消息代理中的消息必须是独立的，并允许工作人员专门处理它们。这将允许来自Orchestrator的消息运行计划的检查，或者来自Azure数据工厂 (ADF)管道本身的消息。每次需要添加数据检查或新结构时，开发人员都可以简单地实现覆盖并开发自己的工作进程，而不会影响系统的其余部分。如下所述，Azure团队利用它来快速构建它。

Image for post — *High level architectural diagram of DataCopDataCop的高级架构图*

Workers are run today as Azure Web Jobs. Workers typically leverage another compute in Azure such as Azure Databricks or Azure SQL to execute quality checks against the actual data. Workers are lightweight and used to determine whether the checks are successful. This makes Azure Web Jobs a perfect fit for running them. For consistency, Orchestrator is hosted as a web job as well. Orchestrator is a time-triggered web job that generates the sets of quality checks that need to be executed and puts them in a respective worker-specific service bus queue.

今天，工作人员作为Azure Web Jobs运行。工作人员通常利用Azure中的另一种计算(例如Azure Databricks或Azure SQL)对实际数据执行质量检查。工人很轻巧，可用来确定检查是否成功。这使得Azure Web Jobs非常适合运行它们。为了保持一致性，Orchestrator也作为Web作业托管。 Orchestrator是一个时间触发的Web作业，它生成需要执行的质量检查集，并将它们放入相应的特定于工作人员的服务总线队列中。

The next important part of any data quality system is alerting. All Microsoft services use IcM, the company-wide incident management system. Data alerts are not like service alerts: Data arrives at a higher latency compared to typical services and can be recovered in some situations. If there is a need to restate bad data, an issue can be potentially open longer until the data is restated. So, alert suppression is set to handle a very different number of cases — data not available due to upstream issues for x days should result in one alert, and data not available downstream due to a common upstream issue should be suppressed.

任何数据质量系统的下一个重要部分是警报。所有Microsoft服务都使用IcM(公司范围的事件管理系统)。数据警报与服务警报不同：与典型服务相比，数据延迟更高，并且在某些情况下可以恢复。如果需要重述错误的数据，则可能需要更长的时间才能解决该问题，直到重新陈述数据为止。因此，将警报抑制设置为处理非常不同的情况-由于x天上游问题导致的数据不可用将导致一个警报，而由于常见上游问题而导致下游数据不可用的数据将被抑制。

This is a good place to touch upon another important topic in the data quality landscape: Anomaly detection. Data volume and metrics change often and are prone to seasonality. Having an anomaly detection system that can handle seasonality helps with a move away from monitoring data volumes and daily trends to a more sophisticated system. DataCop leverages Azure anomaly detector APIs to measure completeness stats such as file size and a few key metrics along multiple dimensions. This is a work in progress with further updates to come.

这是接触数据质量领域中另一个重要主题的好地方：异常检测。数据量和指标经常更改，并且容易出现季节性变化。拥有可以处理季节性的异常检测系统有助于从监视数据量和每日趋势转变为更复杂的系统。 DataCop利用Azure异常检测器API来测量完整性统计信息，例如文件大小和沿多个维度的一些关键指标。这是一项正在进行的工作，将进行进一步的更新。

It was apparent that developers need a way to quickly author data quality checks and also deploy them. As a result, we integrated with Azure DevOps workflow to automatically deploy these data quality monitors. Today, the IDEAs team runs close to 2000 tests on about 750 key datasets that include externally reported financial metrics.

显然，开发人员需要一种快速编写数据质量检查并进行部署的方法。因此，我们与Azure DevOps工作流集成在一起，以自动部署这些数据质量监视器。如今，IDEA团队对约750个关键数据集(包括外部报告的财务指标)进行了近2000次测试。

M365与Azure之间的合作伙伴关系 (Partnership between M365 and Azure)

The Customer Growth and Analytics team (CGA) is a centralized data science team in the Cloud+AI division at Microsoft. The team’s mission is to learn from customers and empower them to make the most of Azure services.³

客户增长和分析团队(CGA)是Microsoft的Cloud + AI部门中的集中数据科学团队。该团队的任务是向客户学习，并使其能够充分利用Azure服务。³

Last year, as CGA’s scope was growing, an effort began to standardize technologies. Having a smaller number of technologies upon which CGA’s data platform is built makes it easier to move engineering resources as needed, share knowledge, and in general increase the reliability of the overall system. The use of Azure PaaS offerings reduced the need for writing custom code. The team standardized on Azure Data Factory for data movement and Azure Monitor for monitoring, among others. Unfortunately, at this writing, Azure doesn’t offer a PaaS data quality testing framework.

去年，随着CGA范围的不断扩大，人们开始努力使技术标准化。使用CGA数据平台所基于的技术数量较少，可以更轻松地根据需要移动工程资源，共享知识并总体上提高整个系统的可靠性。使用Azure PaaS产品减少了编写自定义代码的需要。该团队在Azure数据工厂(用于数据移动)和Azure监视器(用于监视)上进行了标准化。不幸的是，在撰写本文时，Azure没有提供PaaS数据质量测试框架。

CGA realized the need for a reliable and scalable data quality solution, especially as the data platform evolved to support more and more production workloads where data issues can have large impacts, and so evaluated multiple options.

CGA意识到了对可靠且可扩展的数据质量解决方案的需求，特别是随着数据平台的发展以支持越来越多的生产工作负载，其中数据问题可能会产生重大影响，因此评估了多种选择。

CGA tried out several data quality testing solutions with the code base, but quickly realized they were built for smaller projects, made some rigid assumptions, and would require significant investment to scale out to cover the entire platform.

CGA使用代码库尝试了几种数据质量测试解决方案，但很快意识到它们是为较小的项目构建的，做出了一些严格的假设，并且需要大量投资才能扩展到整个平台。

Discussions with other data science organizations within the company to see how they were handling this led to LinkedIn and an introduction to Data Sentinel. Its main limitation is that it runs exclusively on Spark. CGA must support multiple data fabrics: In some cases, different compute scenarios require the specific best solution for the job, such as Azure Data Explorer for analytics or Azure Data Lake Storage and Azure Machine Learning for ML workloads. In other cases, data ingested from other teams comes from a variety of storage locations: Azure SQL, blob storage, and Azure Data Lake Storage gen1, among others.

与公司内其他数据科学组织的讨论，以了解他们如何处理此问题，从而导致了LinkedIn和Data Sentinel的介绍。它的主要限制是它只能在Spark上运行。 CGA必须支持多种数据结构：在某些情况下，不同的计算方案需要特定的最佳解决方案来完成工作，例如用于分析的Azure Data Explorer或用于ML工作负载的Azure Data Lake Storage和Azure Machine Learning 。在其他情况下，从其他团队提取的数据来自各种存储位置：Azure SQL，blob存储和Azure Data Lake Storage gen1等。

Further outreach led to discussions with the M365 data science team and led to an introduction to DataCop, the solution described in this article. Its capabilities were compelling: Test scheduling, integration with the standard Microsoft alerting platform, and a declarative way of describing tests. Its main limitation was that DataCop didn’t support Azure Data Explorer.

进一步的扩展导致与M365数据科学团队的讨论，并导致对DataCop(本文中描述的解决方案)进行了介绍。它的功能引人注目：测试计划，与标准Microsoft警报平台的集成以及描述测试的声明方式。它的主要限制是DataCop不支持Azure Data Explorer。

Because Azure Data Explorer (ADX) is core to CGA’s platform, this could have been a showstopper, but in true One Microsoft spirit, the DataCop team was more than happy to work with CGA to light up the missing capability. The teams agreed to treat this as an “internal open source” project, with CGA contributing code to the DataCop solution from which both teams could benefit. Due to its flexible design, adding ADX capabilities was significantly easier than the alternative (investing in a home-grown solution).

因为Azure数据资源管理器(ADX)是CGA平台的核心，所以这本来可以成为热门。但是，本着一种Microsoft的精神，DataCop团队非常乐意与CGA合作以减轻缺失的功能。团队同意将其视为“内部开源”项目，CGA向DataCop解决方案贡献代码，这两个团队都可以从中受益。由于其灵活的设计，添加ADX功能比选择其他方法(投资自家解决方案)要容易得多。

CGA deployed an instance of DataCop in its environment and over the following months had a big data quality push, including training the team on how to author tests and increasing test coverage to 100 percent of the datasets in CGA’s platform. At the time of writing, CGA has around 400 tests covering close to 300 key datasets. Over the past 30 days, CGA ran more than 4000 tests, identifying and quickly acting to mitigate multiple data issues that would have caused significant anomalies in CGA’s system. Onboarding DataCop saved significant engineering effort, which was refocused on test authoring.

CGA在其环境中部署了一个DataCop实例，并且在接下来的几个月中，数据质量得到了很大的推动，包括培训团队如何编写测试以及将测试覆盖率提高到CGA平台中100％的数据集。在撰写本文时，CGA拥有约400个测试，涵盖了近300个关键数据集。在过去的30天里，CGA运行了4000多个测试，识别并Swift采取措施来缓解可能导致CGA系统出现重大异常的多个数据问题。入职的DataCop节省了大量的工程设计工作，这些工作重新集中在测试创作上。

总结思想/总结 (Closing thoughts/summary)

This article described DataCop, the data quality solution developed by the M365 data team in partnership with the Azure data team.

本文介绍了DataCop，它是M365数据团队与Azure数据团队合作开发的数据质量解决方案。

Data quality is a critical aspect of a business, both for informing decisions and for regulatory obligations.
数据质量对于通知决策和监管义务都是业务的关键方面。
The diverse data fabrics in use and their huge scale led to development of DataCop, a data quality solution for supporting the Microsoft business.
使用中的各种数据结构及其巨大规模促成了DataCop的发展，DataCop是一种支持Microsoft业务的数据质量解决方案。
DataCop is a cloud-native Azure solution, consisting of a set of web jobs that communicate via service bus.
DataCop是云原生的Azure解决方案，由一组通过服务总线进行通信的Web作业组成。
The plug-in architecture allowed the CGA team to quickly develop an Azure Data Explorer test runner and expand the scope of DataCop from the M365 team to also cover the Azure business.
插件体系结构使CGA团队可以快速开发Azure Data Explorer测试运行程序，并从M365团队扩展DataCop的范围，以涵盖Azure业务。
Today, DataCop runs hundreds of tests every day to ensure the quality of data throughout multiple systems on both teams.
今天，DataCop每天运行数百个测试，以确保两个团队中多个系统的数据质量。

Vlad Rișcuția is on LinkedIn.

Vlad Rișcuția在 LinkedIn上 。

[1] The Four V’s of Big Data, IBM, 2016.

[1] 大数据的四个V ，IBM，2016年。

[2] Data Sentinel: Automating Data Validation, LinkedIn, March 2010.

[2] 数据前哨：自动化数据验证，LinkedIn，2010年3月。

[3] Using Azure to Understand Azure, by Ron Sielinski, January 2020.

[3] Ron Sielinski于2020年1月使用 “ 使用Azure来理解Azure” 。