数据分析 数据科学_数据科学中的数据分析

数据分析 数据科学

资料剖析 (Data Profiling)

Data Profiling is a method of examining data from an existing supply and summarizing info this data. Your profile data to work out the accuracy, completeness, and validity of your data. Information identification is in dire straits several reasons, however, it's most typically a part of serving to work out information quality as an element of a bigger project. Commonly, Data Profiling is combined with an ETL (Extract, Transform, and Load) method to maneuver data from one system to a different. Once done properly, ETL and Data Profiling is combined to cleanse, enrich, and move quality information to a target location.

数据分析是一种检查来自现有供应商的数据并汇总此数据信息的方法。 您的个人资料数据可以计算出数据的准确性,完整性和有效性。 信息识别陷入困境的原因有很多,但是,它通常是确定信息质量的一部分,这是大型项目的一个组成部分。 通常,数据分析与ETL(提取,转换和加载)方法结合使用,可以将数据从一个系统转移到另一个系统。 一旦正确完成,ETL和数据分析将结合起来,以清理,丰富质量信息并将其移动到目标位置。

For example, you may need to perform data profiling once migrating from a gift system to a brand new system. Data Profiling will facilitate establish data quality problems that require to be handled within the code after you move data into your new system Or you may need to perform data profiling as you progress data to a data warehouse for business analytics. Typically once data is captive to a data warehouse, ETL tools are accustomed to moving the Data. Data profiling is useful in characteristic what data quality problems should be fastened within the supply, and what data quality problems are fastened throughout the ETL method.

例如,从礼品系统迁移到全新系统后,您可能需要执行数据分析。 数据剖析有助于建立数据质量问题,这些问题需要在将数据移至新系统中之后在代码中进行处理,或者在将数据前进到数据仓库进行业务分析时可能需要执行数据剖析。 通常,一旦数据被捕获到数据仓库中,ETL工具就会习惯于移动数据。 数据概要分析有助于确定应在供应中解决哪些数据质量问题以及在整个ETL方法中解决哪些数据质量问题。

为什么要分析资料? (Why profile data?)

Data profiling permits you to answer the subsequent questions on your data:

数据分析使您可以回答有关数据的后续问题:

  • Is the data complete? Are there a blank or no values?

    数据是否完整? 是否有空白或没有值?

  • Is this data unique? How many distinct values are there? Is that the data duplicated?

    此数据是否唯一? 有多少个不同的值? 数据是否重复?

  • Are there abnormal patterns in your data? What's the distribution of patterns in your data?

    您的数据中是否存在异常模式? 数据中模式的分布是什么?

  • Are these the patterns I expect?

    这些是我期望的模式吗?

  • What varies values exist and are they expected? What are the utmost, minimum, and average values for given data? Are these the ranges I expect?

    存在哪些不同的值,它们是预期的吗? 给定数据的最大,最小和平均值是多少? 这些是我期望的范围吗?

Answering these queries helps you make sure that you're maintaining quality data, that — firms are progressively realizing — is that the cornerstone of a thriving business.

回答这些查询有助于确保您正在维护质量数据(企业正在逐步实现),这是业务蓬勃发展的基石。

一个配置文件如何数据? (How does one profile data?)

Data profiling is performed in several ways that, however, there are roughly 3 base ways accustomed to analyze the info.

数据分析以几种方式执行,但是,大约有3种基本方式习惯于分析信息。

Column profiling counts the number of times each price seems among every column during a table. This methodology helps to uncover the patterns among your data.

列分析计算表中每个列中每个价格出现的次数。 这种方法有助于发现数据中的模式。

Cross-column profiling appearance across columns to perform key and dependency analysis. Key analysis scans collections of values during a table to find a possible primary key. Dependency analysis determines the dependent relationships among a data set. Together, these analyses verify the relationships and dependencies among a table.

跨列的跨列分析外观,以执行键和依赖关系分析。 键分析在表期间扫描值的集合,以查找可能的主键。 依赖性分析确定数据集之间的依赖性关系。 这些分析共同验证了表之间的关系和依赖性。

Cross-table profiling appearance across tables to spot potential foreign keys. It additionally attempts to work out the similarities and variations in syntax and data varieties between tables to determine that data may well be redundant and which could be mapped along.

跨表的跨表分析外观可发现潜在的外键。 此外,它尝试找出表之间语法和数据种类的相似性和变化形式,以确定数据可能完全是冗余的并且可以沿数据映射。

Rule validation is usually thought of as the ultimate step in data profiling. This can be a proactive step of adding rules that check for the correctness and integrity of the info that's entered into the system.

通常将规则验证视为数据概要分析的最终步骤。 这可以是添加规则的主动步骤,该规则将检查输入到系统中的信息的正确性和完整性。

These different ways could also be performed manually by an analyst, or they'll be performed by a service that will alter these queries.

这些不同的方式也可以由分析师手动执行,或者由将更改这些查询的服务来执行。

数据分析挑战 (Data profiling challenges)

Data profiling is commonly troublesome because of the sheer volume of data you'll get to profile. This can be very true if you're gazing at a gift system. A gift system might need years of older data with thousands of errors. Consultants advocate that you simply phase your data as a section of your data profiling method so you'll be able to see the forest for the trees.

数据分析通常很麻烦,因为您将要分析的数据量很大。 如果您盯着礼物系统,这可能是非常正确的。 礼物系统可能需要多年的旧数据,并且有数千个错误。 顾问们提倡您只需将数据作为数据分析方法的一部分进行分阶段操作,就可以看到树木的森林。

If you manually perform your data profiling, you should have the skill to run various queries and sift through the results to achieve meaningful insights regarding your data, which might eat up precious resources. Additionally, you may doubtless solely be ready to check a set of your overall data as a result of it's too long to travel through the complete data set.

如果您手动执行数据分析,则您应该具有运行各种查询并筛选结果的技巧,以获取有关数据的有意义的见解,这可能会消耗宝贵的资源。 此外,由于时间太长,无法遍历完整的数据集,因此毫无疑问,您可能只准备检查一组整体数据。

翻译自: https://www.includehelp.com/data-science/data-profiling.aspx

数据分析 数据科学

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/543871.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

bpl开发模式_BPL的完整形式是什么? 什么是电力线宽带

bpl开发模式BPL:电力线宽带 (BPL: Broadband Over Power Lines) BPL is an abbreviation of "Broadband Over Power Lines". BPL是“电力线宽带”的缩写 。 BPL is also occasionally called as Internet over power line (IPL) or power line telecommu…

ups一直响是什么原因_UPS的完整形式是什么?

ups一直响是什么原因UPS:不间断电源 (UPS: Uninterruptible Power Supply) UPS is an abbreviation of Uninterruptible Power Supply. It operates with the support of a battery which is used to supply power in the lack of most important source or when th…

语音asr是什么意思_ASR的完整形式是什么?

语音asr是什么意思ASR:自动语音识别 (ASR: Automated Speech Recognition) ASR stands for Automated Speech Recognition. With the help of this technology, spoken words can be easily converted to written text. What actually it does? It gives access to…

数据库缓冲池_块缓冲| 数据库管理系统

数据库缓冲池When several blocks need to be transferred from disk to main memory and all the block addresses are known, several buffers can be reserved in main memory to speed up the transfer. 当需要将几个块从磁盘传输到主存储器并且所有块地址已知时&#xff0…

python公共变量_Python中的公共变量

python公共变量By default all numbers, methods, variables of the class are public in the Python programming language; we can access them outside of the class using the object name. 默认情况下,该类的所有数字,方法和变量在Python编程语言中…

递归如何书写?

目录 第一步:首先你分析问题,要有递归的思路,知道要递归什么来解决问题。 第二步:先按照思路(第一层)写出函数的定义与函数体 第三步:根据函数的定义与函数体进一步确定需要的参数 第四步&a…

kotlin 判断数字_Kotlin程序可以逆转数字

kotlin 判断数字Given an integer number, we have to find reverse number and print it. 给定一个整数,我们必须找到反向数字并打印出来。 Example: 例: Input:Number: 12345Output:Reverse Number: 54321To find a reverse number – we use this f…

Python | 创建员工类别

Python-员工类代码 (Python - employee class code) # employee class code in Python# class definitionclass Employee:__id0__name""__gender""__city""__salary0# function to set data def setData(self,id,name,gender,city,salary):self.…

scala 字段覆盖_Scala中的字段覆盖

scala 字段覆盖Scala字段覆盖 (Scala field overriding) Overriding is the concept in which the child class is allowed to redefine the members of the parent class. Both methods and variables/ fields can be overridden in object-oriented programming. In Scala as…

python 散点图 分类_Python | 分类图

python 散点图 分类Visualizing different variables is also a part of basic plotting. Such variables can have different classes, for example, numerical or a category. Matplotlib has an important feature of Categorical Plotting. We can plot multiple categoric…

python 对角线矩阵_Python | 矩阵的对角线

python 对角线矩阵Some problems in linear algebra are mainly concerned with diagonal elements of the matrix. For this purpose, we have a predefined function numpy.diag(a) in NumPy library package which automatically stores diagonal elements in an array (a V…

二叉树祖先节点_二叉树的祖先

二叉树祖先节点Problem statement: 问题陈述: Given a Binary Tree and a target key, write a function that prints all the ancestors of the key in the given binary tree. 给定二叉树和目标键,编写一个函数,以打印给定二叉树中键的所有…

txt文本变为粗体_如何在PHP中使文本变为粗体?

txt文本变为粗体Sometimes we might want to display text with style. That its font, color, make it bold, italic, underlined and many more. Adding whatever style is all based on the message that we want to pass across or getting someones attention. 有时我们可…

CALayer精讲

CALayer精讲 CALayer包含在QuartzCore框架中,这是一个跨平台的框架,既可以用在iOS中又可以用在Mac OS X中。后面要学Core Animation就应该先学好Layer(层)。 我们看一下UIView与Layer之间的关系图(图片来源于网络&…

VSRE的完整形式是什么?

VSRE:预期回复非常短 (VSRE: Very Short Reply Expected) VSRE is an abbreviation of "Very Short Reply Expected". VSRE是“ Very Short Reply Expected”的缩写。 It is an expression, which is commonly used in the Gmail platform. It is writte…

rofl用什么播放_ROFL的完整形式是什么?

rofl用什么播放ROFL:笑在地板上滚动 (ROFL: Rolling On Floor Laughing) ROFL is an abbreviation of Rolling on Floor Laughing. ROFL is a very trendy internet slang between youngsters and used in text messaging, instant messaging, chatting, and social…

为什么只有根桥发送bpdu_BPDU的完整形式是什么?

为什么只有根桥发送bpduBPDU:网桥协议数据单元 (BPDU: Bridge Protocol Data Unit) BPDU is an abbreviation of the "Bridge Protocol Data Unit". BPDU是“网桥协议数据单元”的缩写 。 It is a data message in the form of a frame that used to exc…

什么叫穷举法?

穷举法的基本思想是根据题目的部分条件确定答案的大致范围,并在此范围内对所有可能的情况逐一验证,直到全部情况验证完毕。若某个情况验证符合题目的全部条件,则为本问题的一个解;若全部情况验证后都不符合题目的全部条件&#xf…

gif 格式 完整 检查_GIF的完整格式是什么?

gif 格式 完整 检查GIF:图形交换格式 (GIF: Graphics Interchange Format) GIF is an abbreviation of Graphics Interchange Format. It is extensively used for animations and still images on the World Wide Web. The image is set out is bitmap image and i…

Java基础_05

2019独角兽企业重金招聘Python工程师标准>>> 1:boolean运算符号 || 与 | && 与 &的区别。 Equals与innstanceof 1:java中的方法。方法的定义,参数、返回值、调用方式。 2:方法调用与参数传递、Static方…