DataCleaner(4.5)第一章

Part1. Introduction to DataCleaner  介绍DataCleaner

  • |--What is data quality(DQ)  数据质量?
  • |--What is data profiling?   数据分析?
  • |--What is datastore?      数据存储?
  •   Composite datastore    综合性数据存储
  • |--What is data monitoring?  数据监控?
  • |--What is master data management(MDM)?  主数据管理?

What is data quality (DQ)?

Data Quality (DQ) is a concept and a business term covering the quality of the data used for a particular purpose. Often times the DQ term is applied to the quality of data used

数据质量即使一种概念又是一种用于说明特定目的包含质量数据的商业术语。很多时间DQ术语被应用到商业决策上,

in business decisions but it may also refer to the quality of data used in research, campaigns, processes and more.

但是也值得是质量数据被应用到研究、质量活动,流程等等。

Working with Data Quality typically varies a lot from project to project, just as the issues in the quality of data vary a lot. Examples of data quality issues include:

处理数据质量通常会随着项目和项目的不同而变化,就像数据质量的问题会有很大的不同。数据质量的问题主要有:

      1. Completeness of data  数据的完整性  
      2. Correctness of data   数据的正确性 
      3. Duplication of data    重复的数据
      4. Uniformedness/standardization of data  数据的标准性   

A less technical definition of high-quality data is, that data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran).

对高质量数据的一个不太技术性的定义是,数据具有高质量,“如果它们适合于其在运营、决策和规划方面的预期用途”(J. M. Juran)。

Data quality analysis (DQA) is the (human) process of examining the quality of data for a particular process or organization. The DQA includes both technical and non-technical

数据质量分析(DQA)是对特定过程或组织的数据质量进行检查的过程。数据质量分析包括的技术元素和非技术元素。

elements. For example, to do a good DQA you will probably need to talk to users, business people, partner organizations and maybe customers.

 

例如,要做一个好的DQA,您可能需要与用户、业务人员、伙伴组织和可能的客户交谈。

This is needed to asses what the goal of the DQA should be.

这是用来评估DQA目标的必要的。

From a technical viewpoint the main task in a DQA is the data profiling activity, which will help you discover and measure the current state of affairs in the data.

从技术角度来看,DQA中的主要任务是数据分析活动,它将帮助您发现和度量数据中的当前状态。

What is data profiling?

Data profiling is the activity of investigating a datastore to create a 'profile' of it. With a profile of your datastore you will be a lot better equipped to actually use and improve it.

数据分析是对数据存储进行调查以创建它的“概要”的活动。有了您的数据存储的概要,您将会有更好的去实际使用和改进它。

The way you do profiling often depends on whether you already have some ideas about the quality of the data or if you're not experienced with the datastore at hand. Either

您进行分析的方式通常取决于您是否已经对数据的质量有了一些想法,或者您是否对datastore没有经验。

way we recommend an explorative approach, because even though you think there are only a certain amount of issues you need to look for, it is our experience (and reasoning behind a lot of the features of DataCleaner) that it is just as important to check those items in the data that you think are correct!

无论哪种方式,我们都建议采用一种探索性的方法,因为即使您认为您需要查找的问题只有一定数量,但这是我们的经验(并且在数据收集者的许多特性后面进行推理),在您认为正确的数据中检查这些项同样重要!

Typically it's cheap to include a bit more data into your analysis and the results just might surprise you and save you time!

通常,在你的分析中包含更多的数据是没有价值的,结果可能会让你大吃一惊,节省你的时间!

DataCleaner comprises (amongst other aspects) a desktop application for doing data profiling on just about any kind of datastore.

DataCleaner包括(在其他方面)一个桌面应用程序,用于对任何类型的数据存储进行数据分析。

 

What is a datastore?

A datastore is the place where data is stored. Usually enterprise data lives in relational databases, but there are numerous exceptions to that rule.

数据存储是存储数据的地方。通常企业数据都存在于关系数据库中,但是有许多例外情况。

To comprehend different sources of data, such as databases, spreadsheets, XML files and even standard business applications, we employ the umbrella term datastore .

由不同来源的数据组成,例如数据库、电子表格、XML文件,甚至标准的业务应用程序,我们使用的是术语数据存储。

DataCleaner is capable of retrieving data from a very wide range of datastores. And furthermore, DataCleaner can update the data of most of these datastores as well.

DataCleaner能够从非常广泛的数据存储中检索数据。此外,DataCleaner还可以更新大多数这些数据存储的数据。

A datastore can be created in the UI or via the configuration file . You can create a datastore from any type of source such as: CSV, Excel, Oracle Database, MySQL, etc.

数据存储可以在UI中创建,也可以通过配置文件创建。您可以从任何类型的源(如:CSV、Excel、Oracle数据库、MySQL等)创建数据存储。

                                点击注册一个新的数据存储

Composite datastore

composite datastore contains multiple datastores . The main advantage of a composite datastore is that it allows you to analyze and process data from multiple sources in the same job.

复合数据存储包含多个数据存储。复合数据存储的主要优势在于,它允许您在同一作业中分析和处理来自多个源的数据。

 

What is data monitoring?

We've argued that data profiling is ideally an explorative activity. Data monitoring typically isn't! The measurements that you do when profiling often times needs to be

continuously checked so that your improvements are enforced through time. This is what data monitoring is typically about.

我们认为,数据分析是一种理想的探索活动。数据监控通常不是!您在进行概要分析时所做的度量通常需要不断地检查,以便您的改进可以通过时间来执行。这就是数据监控的典型特征。

Data monitoring solutions come in different shapes and sizes. You can set up your own bulk of scheduled jobs that run every night. You can build alerts around it that send you emails if a particular measure goes beyond its allowed thresholds, or in some cases you can attempt ruling out the issue entirely by applying First-Time-Right (FTR) principles that validate data at entry-time. eg. at data registration forms and more.

数据监控解决方案有不同的形状和大小。你可以安排自己的大部分计划的工作每天晚上运行。如果某个特定的度量超出了允许的阈值,或者在某些情况下,您可以通过应用第一次正确的(FTR)原则来排除这个问题,那么您就可以在它周围构建警报,或者在某些情况下,您可以尝试排除这个问题。如。在数据登记表格等.

As of version 3, DataCleaner now also includes a monitoring web application, dubbed "DataCleaner monitor". The monitor is a server application that supports orchestrating and scheduling of jobs, as well as exposing metrics through web services and through interactive timelines and reports. It also supports the configuration and job-building process through wizards and management pages for all the components of the solution. As such, we like to say that the DataCleaner monitor provides a good foundation for the infrastructure needed in a Master Data Management hub.

在版本3中,DataCleaner现在还包括一个监视web应用程序,称为“DataCleaner monitor”。monitor是一个服务器应用程序,它支持编排和调度作业,以及通过web服务和交互式时间线和报告公开指标。它还通过向导和管理页面支持解决方案的所有组件的配置和工作构建过程。因此,我们喜欢说DataCleaner monitor为一个主数据管理中心所需的基础设施提供了良好的基础。

What is master data management (MDM)?

Master data management (MDM) is a very broad term and is seen materialized in a variety of ways. For the scope of this document it serves more as a context of data quality than an activity that we actually target with DataCleaner per-se.

主数据管理(MDM)是一个非常广泛的术语,它以各种方式出现。对于本文档的范围来说,它更像是数据质量的上下文,而不是我们实际使用DataCleaner的活动。

The overall goals of MDM is to manage the important data of an organization. By "master data" we refer to "a single version of the truth", ie. not the data of a particular system, but for example all the customer data or product data of a company. Usually this data is dispersed over multiple datastores, so an important part of MDM is the process of unifying the data into a single model.

MDM的总体目标是管理组织的重要数据。“主数据”指的是“单一版本的真相”。不是某个特定系统的数据,而是一个公司的所有客户数据或产品数据。通常,这些数据分散在多个数据存储中,因此MDM的一个重要部分就是将数据统一为一个模型的过程。

Obviously another of the very important issues to handle in MDM is the quality of data. If you simply gather eg. "all customer data" from all systems in an organization, you will most likely see a lot of data quality issues. There will be a lot of duplicate entries, there will be variances in the way that customer data is filled, there will be different identifiers and even different levels of granularity for defining "what is a customer?". In the context of MDM, DataCleaner can serve as the engine to cleanse, transform and unify data from multiple datastores into the single view of the master data.

显然,在MDM中处理的另一个非常重要的问题是数据的质量。如果你只是聚集。“所有客户数据”来自组织中的所有系统,您很可能会看到大量的数据质量问题。将会有很多重复的条目,在客户数据填充的方式上会有差异,会有不同的标识符,甚至是不同的粒度级别来定义“什么是客户”。在MDM环境中,DataCleaner可以作为引擎来清理、转换和统一来自多个数据存储的数据,并将其统一到主数据的单一视图中。

 

转载于:https://www.cnblogs.com/xiaotao726/p/8519993.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/466521.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Oracle数据库管理员职责(二)

DBA日常管理 目的:这篇文档有很详细的资料记录着对一个甚至更多的ORACLE数据库每天的,每月的,每年的运行的状态的结果及检查的结果,在文档的附录中你将会看到所有检查,修改的SQL和PL/SQL代码。目录1.日常维护程序A.检查…

c100道

题目来源: 1、中兴、华为、慧通、英华达、微软亚洲技术中心等中 外企业面试题目; 2、C 语言面试宝典(林锐《高质量编程第三版》)。 说明: 1、部分C 语言面试题中可能会参杂部分和C 相关的知 识,为了保持题目的灵活性故保留&#x…

约瑟夫斯问题-java版数组解法和链表解法

10个人围成一圈,从1到10编号,从1开始数,数到3或3的倍数的位置,则该位置的人出局,求最后剩下哪一个号? 数组解法: 数组存放数组:a[10]存在1到10编号人 数组遍历到尾部又从头遍历&…

少写点if-else吧,它的效率有多低你知道吗?

# 干了这碗鸡汤我要再和生活死磕几年。要么我就毁灭,要么我就注定铸就辉煌。如果有一天,你发现我在平庸面前低了头,请向我开炮。--杰克凯鲁亚克if-else涉及到分支预测的概念,关于分支预测上篇文章《虚函数真的就那么慢吗&#xff…

js实现连接的两种放法

第一种用document.write输出 <html> <body> <script type"text/javascript"> var rMath.random() if (r>0.5) { document.write("<a hrefhttp://www.w3school.com.cn>学习 Web 开发&#xff01;</a>") } else { documen…

异或求校验和

uint8_t chk_xrl(const void *data, uint16_t length) {const uint8_t *buf data;uint8_t retval 0;while(length){retval ^ *buf;--length;}return retval; }

c语言笔试

1、局部变量能否和全局变量重名&#xff1f;   答&#xff1a;能&#xff0c;局部会屏蔽全局。要用全局变量&#xff0c;需要使用"::" ;局部变量可以与全局变量同名&#xff0c;在函数内引用这个变量时&#xff0c;会用到同名的局部变量&#xff0c;而不会用到全局…

聚宝盆,只要你上网就可以挣钱

点此注册[url]http://www.56cash.com/ref.php?id5429[/url]转载于:https://blog.51cto.com/435178/100909

为什么不能在中断上半部休眠?

这是一个老生常谈的问题。我们先简单说下什么是中断「因为最近在群里看到有人竟然不懂什么是中断」。中断是计算机里面非常核心的东西&#xff0c;我们可以跑OS&#xff0c;可以多任务运行都因为中断的存在。假设你是一个CPU&#xff0c;你正在睡觉。你突然觉得肚子疼&#xff…

打CALL APP 项目进展 总体计划

时间进展完成度参与人员备注2018.3完成app的前端设计 全体 2018.4app的后端 2018.5app的后端 转载于:https://www.cnblogs.com/aliceluorong/p/8520442.html

单片机中通用的类型别名

单片机中通用的类型别名 #ifndef _TYPE_H_ #define _TYPE_H_#ifdef __GNUC__ #define __packed __attribute__((aligned(1))) #endif/* exact-width signed integer types */ typedef signed char int8_t; typedef signed short int int16_t; typedef sign…

j.u.c系列(08)---之并发工具类:CountDownLatch

写在前面 CountDownLatch所描述的是”在完成一组正在其他线程中执行的操作之前&#xff0c;它允许一个或多个线程一直等待“&#xff1a;用给定的计数 初始化 CountDownLatch。由于调用了 countDown() 方法&#xff0c;所以在当前计数到达零之前&#xff0c;await 方法会一直受…

巧用1个GPIO控制2个LED显示4种状态

很多电子产品有状态指示灯&#xff0c;比如电视机&#xff1a;待机状态亮红灯开机状态亮绿灯实现起来很简单&#xff0c;微控制器MCU的两个GPIO分别控制就行&#xff1a;不过资源总是紧张的&#xff0c;有时候会碰到GPIO不够用的情况。如果只用1个GPIO&#xff0c;可不可以实现…

GetTickcount函数

GetTickCount是一种函数。GetTickCount返回&#xff08;retrieve&#xff09;从操作系统启动所经过&#xff08;elapsed&#xff09;的毫秒数&#xff0c;它的返回值是DWORD。 GetTickcount函数&#xff1a;它返回从操作系统启动到当前所经过的毫秒数&#xff0c;常常用来判断某…

网络大小端转换函数

网络大小端转换函数 //***************************************************************************** // // htonl/ntohl - big endian/little endian byte swapping macros for // 32-bit (long) values // //**********************************************************…

5-全排列总结:

https://www.nowcoder.com/acm/contest/76/H 给一道题&#xff0c;可以去测试代码。 这里总结一下全排列的几种方法&#xff1a; 方法一&#xff1a;利用交换排列&#xff1a;缺点&#xff1a;不能按字典序排列&#xff0c;但可以借助set处理。 #include <bits/stdc.h> …

大大大大数怎么求余?C语言

问题&#xff1a;一个特别大的数除以23求余数用C语言应该怎么算啊&#xff1f;比如23232323232323232323232323232323232323232323232323232323233除以23&#xff0c;怎么算余数&#xff1f;数据类型在计算机的存储是有大小限制的&#xff0c;所以才出现了大数求余这种问题&…

substr

substr &#xff08;C语言函数&#xff09; 编辑 substr是C语言函数&#xff0c;主要功能是复制子字符串&#xff0c;要求从指定位置开始&#xff0c;并具有指定的长度。如果没有指定长度_Count或_Count_Off超出了源字符串的长度&#xff0c;则子字符串将延续到源字符串的结尾…

***站长自述挂马经历 提醒挂马者回头是岸

我做站都已经接近三年了&#xff0c;期间像很多人一样买过很多玉米&#xff0c;但是因为养不起&#xff0c;至今只保留了一个域名&#xff08;159e.cn&#xff09; &#xff0c;当时学校正流行移动159的号码&#xff0c;然后e在网络上代表很多意思&#xff0c;就注册了这个域名…

微信公众号--相关资料

相关资料 l 官方文档&#xff1a;https://mp.weixin.qq.com/wiki?tresource/res_main&idmp1445241432 l 测试号&#xff1a;https://mp.weixin.qq.com/debug/cgi-bin/sandboxinfo?actionshowinfo&tsandbox/index l 接口调试地址&#xff1a;https://mp.weixin.qq.…