数据分析 数据清理_数据清理| 数据科学

数据分析 数据清理

数据清理 (Data Cleaning)

Data cleaning is the way toward altering information to guarantee that it is right, precise, and significant. The definition may be straightforward, yet information cleaning is utilized in numerous situations. Likewise, information cleaning alludes to a large number of exercises. These exercises mean to improve the nature of your information. Generally, these assignments are cultivated by joining numerous different activities. The present blog entries will talk about the most significant information cleaning undertakings.

数据清理是更改信息以确保其正确,准确和重要的方法。 该定义可能很简单,但是在许多情况下都使用了信息清洗。 同样,信息清洗也涉及大量练习。 这些练习旨在改善您信息的性质。 通常,通过分配许多不同的活动来培养这些任务。 当前的博客文章将讨论最重要的信息清洁工作。

轮廓匹配和数据标准化 (Outline Matching and Data Standardization)

Frequently, composition coordinating is the main errand you have to perform. Its point is to adjust the traits originating from new datasets with the ones in your current database.

通常,构图协调是您必须执行的主要任务。 它的目的是用当前数据库中的数据调整源自新数据集的特征。

Existing Customer Schema (Name, Country, Address, Phone)

现有客户架构(名称,国家/地区,地址,电话)

Approaching Customer Schema (Country, City, Street, Apt, Phone)

接近客户模式(国家,城市,街道,公寓,电话)

To coordinate these patterns and push ahead with your information coordinating activity, you have to devise a procedure that changes over each tuple in the Incoming Customer Schema to Existing Customer Schema.

为了协调这些模式并推进您的信息协调活动,您必须设计一个过程,以将“传入客户模式”中的每个元组转换为“现有客户模式”。

Another situation we will examine here alludes to a similar two constructions however accept that the information records about your clients don't contain postal districts. If you have to see what number of clients are there for a particular code, it is critical to have the right zip esteems.

我们将在这里检查的另一种情况暗示类似的两种构造,但是我们接受关于您的客户的信息记录不包含邮政区。 如果必须查看特定代码的客户端数量,那么拥有正确的zip信誉至关重要。

Nonetheless, similar standards apply when you have to keep up your item index database. You should ensure that all elements of an item are both communicated in similar units and that these qualities are not missing. If not, search questions will return mistaken outcomes. The errand that ensures all qualities are utilizing a similar show is called information institutionalization. This is the errand you ought to perform before other information cleaning exercises, for example, information coordinating and information deduplication. These are in no way, shape or form unimportant exercises and, frequently, it isn't practical for you to perform them physically.

但是,当您必须保持商品索引数据库时,也适用类似的标准。 您应确保一个项目的所有元素都以相似的单位进行交流,并且不遗漏这些品质。 如果不是,搜索问题将返回错误的结果。 确保所有素质都利用类似表演的方式被称为信息制度化 。 这是在执行其他信息清除练习(例如,信息协调和重复数据删除)之前应该执行的任务。 这些绝不是无关紧要的形式或形式,并且通常来说,您不能实际进行锻炼。

资料比对 (Data Matching)

The point of record coordinating is to coordinate every single record from a dataset with the records from another dataset. For the most part, you have to play out this action when you import new information. Thusly, you will ensure the new datasets don't present copy substances.

记录协调的重点是将数据集中的每个记录与另一个数据集中的记录进行协调。 在大多数情况下,导入新信息时必须执行此操作。 因此,您将确保新的数据集不显示复制物质。

Consider a situation when you have to import another arrangement of client records into your business database. You should check if a similar client is spoken to in both approaching cluster or existing databases. You should keep just one record. Lamentably, because of composing mistakes or illustrative blunders, a similar record in the two pieces of information could appear to be changed. Subsequently, it probably won't coordinate the significant characteristics, for example, telephone, address, and name.

考虑一种情况,您必须将另一组客户记录导入到您的业务数据库中。 您应该检查在接近群集或现有数据库中是否使用了类似的客户端。 您应该只保留一个记录。 可悲的是,由于出现了错误或说明性的错误,两条信息中的相似记录似乎已被更改。 随后,它可能无法协调重要特征,例如电话,地址和名称。

The trouble is regularly expanded on account of sections where the item depiction is a link of more than one characteristic. In this way, the objective of record coordinating is to discover sets of records in every one of the two informational collections which relate to a similar substance.

由于项目描述是多个特性链接的一部分,因此该问题会定期扩大。 通过这种方式,记录协调的目的是在与相似物质相关的两个信息收集的每一个中发现记录集。

The most significant difficulties you have to address right now:

您现在必须解决的最重要的困难是:

Recognize the criteria that guarantee two records are undoubtedly relating to a similar true element with the huge datasets accessible today, you need to locate the most proficient calculation technique. This strategy ought to have the option to decide the previously mentioned combines over huge arrangements of information.

认识到保证两条记录无疑与当今拥有巨大数据集的相似真实元素相关的标准,您需要找到最精通的计算技术。 该策略应具有选择权,可以决定上述巨大信息组合的组合。

Luckily, few apps can assist you with conquering these obstacles. By utilizing its keen fluffy coordinating motor, our item is designed to locate the most obvious matches and the least bogus matches. Moreover, you can consolidate these outcomes with the adjustable information base library.

幸运的是,很少有应用程序可以帮助您克服这些障碍。 通过使用其敏锐的蓬松协调马达,我们的产品旨在定位最明显的匹配项和最少的虚假匹配项。 此外,您可以使用可调整的信息库来合并这些结果。

资料复制 (Data Duplication)

Information deduplication intends to aggregate records in a dataset. Thusly, it ensures that each gathering is speaking to a similar true substance. For best outcomes, you ought to play out this procedure both when you populate the database just because and when you include new records. When contrasted with information coordinating, deduplication is generally including the extra gathering of coordinating records. This methodology permits the gatherings to on the whole parcel the information datasets.

信息重复数据删除旨在聚合数据集中的记录。 因此,它确保每次聚会都在讲类似的真实内容。 为了获得最佳结果,在填充数据库(包括添加新记录)和添加新记录时都应执行此过程。 与信息协调相比,重复数据删除通常包括额外收集的协调记录。 这种方法可以使收集者整体上收集信息数据集。

Consider a model where your database stores various records, for example,

考虑一个数据库存储各种记录的模型,例如,

  • Nikon D750 Camera

    尼康D750相机

  • Nikon D750 SLR

    尼康D750单反

  • Nikon D750 Digital SLR

    尼康D750数码单反

This set has different records that speak to a similar element. Along these lines, you should be capable not exclusively to coordinate two of them however coordinate every one of the three records to a similar certifiable substance.

该集合具有不同的记录,它们代表相似的元素。 遵循这些原则,您不应该只能够协调其中的两个,而应将三个记录中的每一个都协调到类似的可验证物质。

资料剖析 (Data Profiling)

Since information cleaning is an intelligent procedure, it is fundamental for you to have the option to assess the nature of your information. You ought to have the option to do this both when the information cleaning process. Thusly, you will have the option to check its adequacy. We call his procedure information profiling. Its most significant objectives are to guarantee that your qualities coordinate with your desires.

由于信息清除是一种智能过程,因此您可以选择评估信息的性质,这一点至关重要。 在信息清理过程中,您都应该选择同时执行此操作。 因此,您可以选择检查其适当性。 我们称其为程序信息分析 。 其最重要的目标是确保您的品质与您的期望相协调。

Consider that you may expect a client name and address to exceptionally recognize every client in your database. Along these lines, the number of exceptional tuples must be as nearest as conceivable to the complete number of passages in your database.

考虑到您可能希望客户名和地址能异常识别数据库中的每个客户。 遵循这些原则,异常元组的数量必须与数据库中整个段落的数量尽可能接近。

Notwithstanding, even you may acquire subsets of components through a few SQL inquiries, this methodology is wasteful and tedious. Data Profiling/Statistics is anything but difficult to utilize and incredible information profiling programming made to assist you with finding designs in your informational collections. Besides, the module can check the nature of your information by examining esteem tallies, types, organizations, and culmination. The module gives a total arrangement of measurable information intended to help clean your information.

尽管如此,即使您可以通过一些SQL查询来获取组件的子集,这种方法也是浪费和繁琐的。 数据剖析/统计几乎没有什么可利用的,而令人难以置信的信息剖析编程可帮助您在信息集合中查找设计。 此外,该模块还可以通过检查自尊记录,类型,组织和高潮来检查您信息的性质。 该模块提供了可衡量信息的整体安排,旨在帮助您清洁信息。

翻译自: https://www.includehelp.com/data-science/data-cleaning.aspx

数据分析 数据清理

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/543836.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

jQuery之call()方法的使用

最近在做项目时候,写了几行关于DOM操作的代码,在方法中使用了this,在后期重构的时候,想将这段分离出来做成一个方法。 最开始想的很简单,就直接分离出来使用方法名称调用即可。 但是实际操作的时候没有效果&#xff0c…

GMTA的完整形式是什么?

GMTA:伟大的思想一致 (GMTA: Great Minds Think Alike) GMTA is an abbreviation of "Great Minds Think Alike". GMTA是“ Great Minds Think Alike”的缩写 。 It is an expression, which is commonly used in messaging or chatting on social media…

github的使用

GitHub操作总结 : 总结看不明白就看下面的详细讲解. GitHub操作流程 : 第一次提交 : 方案一 : 本地创建项目根目录, 然后与远程GitHub关联, 之后的操作一样; -- 初始化git仓库 :git init ; -- 提交改变到缓存 :git commit -m description ; -- 本地git仓库关联GitHub仓库 : g…

sql更改完整模式报错_SQL的完整形式是什么?

sql更改完整模式报错SQL:结构化查询语言 (SQL: Structured Query Language) SQL is an abbreviation of Structured Query Language. It is a programming language developed and designed for handling structured data in Relational Database Management System…

基于微服务架构,改造企业核心系统之实践

2019独角兽企业重金招聘Python工程师标准>>> 1. 背景与挑战 随着公司国际化战略的推行以及本土业务的高速发展,后台支撑系统已经不堪重负。在吞吐量、稳定性以及可扩展性上都无法满足日益增长的业务需求。对于每10万元额度的合同,从销售团队…

bkg bnc_BNC的完整形式是什么?

bkg bncBNC:刺刀Neill–Concelman (BNC: Bayonet Neill–Concelman) BNC is an abbreviation of "Bayonet Neill–Concelman". BNC是“刺刀Neill–Concelman”的缩写 。 It is also known as "British Naval Connector" or "Bayonet Nut …

使用visio 提示此UML形状所在的绘图页不是UML模型图的一部分 请问这个问题怎么解决?...

解决方法新建->选择软件与数据库模板->选择UML模型图->注意:如果不选择UML模型图的话,可能会出现无法编辑形状文本,提示“此UML形状所在的绘图页不是UML模型图的一部分,该形状设计用于利用UML模型图模板创建的绘图”关注…

iOS之 开发常用到的宏定义

不久前做过一个小项目种用到了就记录下来方便自己以后使用,一个非常实用的宏定义来打印函数名称等 #ifdef DEBUG #define DebugLog(fmt, ...) NSLog(("\n[文件名:%s]\n""[函数名:%s]\n""[行号:%d] \n" fmt), __FILE__, __FUNCTION__,…

agp模式_AGP的完整形式是什么?

agp模式AGP:加速图形端口 (AGP: Accelerated Graphics Port ) AGP is an abbreviation of the "Accelerated Graphics Port". AGP是“加速图形端口”的缩写 。 It was created and developed as a high-speed point-to-point channel for putting togeth…

XCopy命令实现增量备份

xcopy XCOPY是COPY的扩展,可以把指定的目录连文件和目录结构一并拷贝,但不能拷贝系统文件;使用时源盘符、源目标路径名、源文件名至少指定一个;选用/S时对源目录下及其子目录下的所有文件进行COPY。除非指定/E参数,否则…

dbms_DBMS | 并发控制

dbmsManagement of concurrent transaction execution is known as “Concurrency Control”. Transaction management in DBMS handles all transaction, to ensure serializability and isolation of transaction. DBMS implement concurrency control technique so that the…

ruby 发送post请求_使用Ruby发送电子邮件

ruby 发送post请求Ruby发送电子邮件 (Ruby sending email) Sending emails and routing email among mail servers are handled by Simple Mail Transfer Protocol commonly known as SMTP. Net::SMTP class is a predefined class in Ruby’s library which is purposefully d…

Centos Git1.7.1升级到Git2.2.1

Centos Git1.7.1升级到Git2.2.1安装需求:># yum install curl-devel expat-devel gettext-devel openssl-devel zlib-devel asciidoc ># yum install asciidoc xmlto -y ># yum install gcc perl-ExtUtils-MakeMaker error: /utf8.c:463: undefined r…

tgc 什么意思 tgt_TGT的完整形式是什么?

tgc 什么意思 tgtTGT:训练有素的研究生老师 (TGT: Trained Graduate Teacher) TGT is an abbreviation of Trained Graduate Teacher. It is a title, not a teaching program that is given to a graduate person who has done completion of training in teaching…

svn的使用(Mac)

2019独角兽企业重金招聘Python工程师标准>>> 从服务器下载代码 在终端中输入svn checkout svn://localhost/mycode --username用户名 --password密码 /Users/apple/Documents/code指令意思:将服务器中mycode仓库的内容下载到/Users/apple/Documents/myCo…

scala语言示例_标有示例的Scala关键字

scala语言示例Scala | 任一关键字 (Scala | Either Keyword) Either is a container similar to the option which has two values, they are referred to as children. The left and right children are named as the right child and left child. 这是一个类似于选项的容器&a…

css 中文文字字体_使用CSS的网络字体

css 中文文字字体CSS | 网络字体 (CSS | Web fonts) Web fonts allow people to use fonts that are not pre-installed in their computers. When you want to include a particular font simply include the font file on your browser and it will be downloaded. Web字体允…

Ajax实践之用户是否存在

关于Ajax在之前的学习中,已经对它的基础知识有了初步的了解。仅仅是欠实践。那么接下来就让实践来检验一下真理吧! 基础见:http://blog.csdn.net/liu_yujie2011com/article/details/29812777 那么先回忆一下,Ajax是用来解决什么问…

vb 导出整数 科学计数法_可整数组的计数

vb 导出整数 科学计数法Problem statement: 问题陈述: Given two positive integer n and m, find how many arrays of size n that can be formed such that: 给定两个正整数n和m ,找出可以形成多少个大小为n的数组: Each element of the …

C4.5决策树算法概念学习

数据挖掘一般是指从大量的数据中自动搜索隐藏于其中的有着特殊关系性的信息的过程。 •分类和聚类•分类(Classification)就是按照某种标准给对象贴标签,再根据标签来区分归类,类别数不变。•聚类(clustering)是指根据“物以类聚”的原理,将本…