数据分析 数据清理
数据清理 (Data Cleaning)
Data cleaning is the way toward altering information to guarantee that it is right, precise, and significant. The definition may be straightforward, yet information cleaning is utilized in numerous situations. Likewise, information cleaning alludes to a large number of exercises. These exercises mean to improve the nature of your information. Generally, these assignments are cultivated by joining numerous different activities. The present blog entries will talk about the most significant information cleaning undertakings.
数据清理是更改信息以确保其正确,准确和重要的方法。 该定义可能很简单,但是在许多情况下都使用了信息清洗。 同样,信息清洗也涉及大量练习。 这些练习旨在改善您信息的性质。 通常,通过分配许多不同的活动来培养这些任务。 当前的博客文章将讨论最重要的信息清洁工作。
轮廓匹配和数据标准化 (Outline Matching and Data Standardization)
Frequently, composition coordinating is the main errand you have to perform. Its point is to adjust the traits originating from new datasets with the ones in your current database.
通常,构图协调是您必须执行的主要任务。 它的目的是用当前数据库中的数据调整源自新数据集的特征。
Existing Customer Schema (Name, Country, Address, Phone)
现有客户架构(名称,国家/地区,地址,电话)
Approaching Customer Schema (Country, City, Street, Apt, Phone)
接近客户模式(国家,城市,街道,公寓,电话)
To coordinate these patterns and push ahead with your information coordinating activity, you have to devise a procedure that changes over each tuple in the Incoming Customer Schema to Existing Customer Schema.
为了协调这些模式并推进您的信息协调活动,您必须设计一个过程,以将“传入客户模式”中的每个元组转换为“现有客户模式”。
Another situation we will examine here alludes to a similar two constructions however accept that the information records about your clients don't contain postal districts. If you have to see what number of clients are there for a particular code, it is critical to have the right zip esteems.
我们将在这里检查的另一种情况暗示类似的两种构造,但是我们接受关于您的客户的信息记录不包含邮政区。 如果必须查看特定代码的客户端数量,那么拥有正确的zip信誉至关重要。
Nonetheless, similar standards apply when you have to keep up your item index database. You should ensure that all elements of an item are both communicated in similar units and that these qualities are not missing. If not, search questions will return mistaken outcomes. The errand that ensures all qualities are utilizing a similar show is called information institutionalization. This is the errand you ought to perform before other information cleaning exercises, for example, information coordinating and information deduplication. These are in no way, shape or form unimportant exercises and, frequently, it isn't practical for you to perform them physically.
但是,当您必须保持商品索引数据库时,也适用类似的标准。 您应确保一个项目的所有元素都以相似的单位进行交流,并且不遗漏这些品质。 如果不是,搜索问题将返回错误的结果。 确保所有素质都利用类似表演的方式被称为信息制度化 。 这是在执行其他信息清除练习(例如,信息协调和重复数据删除)之前应该执行的任务。 这些绝不是无关紧要的形式或形式,并且通常来说,您不能实际进行锻炼。
资料比对 (Data Matching)
The point of record coordinating is to coordinate every single record from a dataset with the records from another dataset. For the most part, you have to play out this action when you import new information. Thusly, you will ensure the new datasets don't present copy substances.
记录协调的重点是将数据集中的每个记录与另一个数据集中的记录进行协调。 在大多数情况下,导入新信息时必须执行此操作。 因此,您将确保新的数据集不显示复制物质。
Consider a situation when you have to import another arrangement of client records into your business database. You should check if a similar client is spoken to in both approaching cluster or existing databases. You should keep just one record. Lamentably, because of composing mistakes or illustrative blunders, a similar record in the two pieces of information could appear to be changed. Subsequently, it probably won't coordinate the significant characteristics, for example, telephone, address, and name.
考虑一种情况,您必须将另一组客户记录导入到您的业务数据库中。 您应该检查在接近群集或现有数据库中是否使用了类似的客户端。 您应该只保留一个记录。 可悲的是,由于出现了错误或说明性的错误,两条信息中的相似记录似乎已被更改。 随后,它可能无法协调重要特征,例如电话,地址和名称。
The trouble is regularly expanded on account of sections where the item depiction is a link of more than one characteristic. In this way, the objective of record coordinating is to discover sets of records in every one of the two informational collections which relate to a similar substance.
由于项目描述是多个特性链接的一部分,因此该问题会定期扩大。 通过这种方式,记录协调的目的是在与相似物质相关的两个信息收集的每一个中发现记录集。
The most significant difficulties you have to address right now:
您现在必须解决的最重要的困难是:
Recognize the criteria that guarantee two records are undoubtedly relating to a similar true element with the huge datasets accessible today, you need to locate the most proficient calculation technique. This strategy ought to have the option to decide the previously mentioned combines over huge arrangements of information.
认识到保证两条记录无疑与当今拥有巨大数据集的相似真实元素相关的标准,您需要找到最精通的计算技术。 该策略应具有选择权,可以决定上述巨大信息组合的组合。
Luckily, few apps can assist you with conquering these obstacles. By utilizing its keen fluffy coordinating motor, our item is designed to locate the most obvious matches and the least bogus matches. Moreover, you can consolidate these outcomes with the adjustable information base library.
幸运的是,很少有应用程序可以帮助您克服这些障碍。 通过使用其敏锐的蓬松协调马达,我们的产品旨在定位最明显的匹配项和最少的虚假匹配项。 此外,您可以使用可调整的信息库来合并这些结果。
资料复制 (Data Duplication)
Information deduplication intends to aggregate records in a dataset. Thusly, it ensures that each gathering is speaking to a similar true substance. For best outcomes, you ought to play out this procedure both when you populate the database just because and when you include new records. When contrasted with information coordinating, deduplication is generally including the extra gathering of coordinating records. This methodology permits the gatherings to on the whole parcel the information datasets.
信息重复数据删除旨在聚合数据集中的记录。 因此,它确保每次聚会都在讲类似的真实内容。 为了获得最佳结果,在填充数据库(包括添加新记录)和添加新记录时都应执行此过程。 与信息协调相比,重复数据删除通常包括额外收集的协调记录。 这种方法可以使收集者整体上收集信息数据集。
Consider a model where your database stores various records, for example,
考虑一个数据库存储各种记录的模型,例如,
Nikon D750 Camera
尼康D750相机
Nikon D750 SLR
尼康D750单反
Nikon D750 Digital SLR
尼康D750数码单反
This set has different records that speak to a similar element. Along these lines, you should be capable not exclusively to coordinate two of them however coordinate every one of the three records to a similar certifiable substance.
该集合具有不同的记录,它们代表相似的元素。 遵循这些原则,您不应该只能够协调其中的两个,而应将三个记录中的每一个都协调到类似的可验证物质。
资料剖析 (Data Profiling)
Since information cleaning is an intelligent procedure, it is fundamental for you to have the option to assess the nature of your information. You ought to have the option to do this both when the information cleaning process. Thusly, you will have the option to check its adequacy. We call his procedure information profiling. Its most significant objectives are to guarantee that your qualities coordinate with your desires.
由于信息清除是一种智能过程,因此您可以选择评估信息的性质,这一点至关重要。 在信息清理过程中,您都应该选择同时执行此操作。 因此,您可以选择检查其适当性。 我们称其为程序信息分析 。 其最重要的目标是确保您的品质与您的期望相协调。
Consider that you may expect a client name and address to exceptionally recognize every client in your database. Along these lines, the number of exceptional tuples must be as nearest as conceivable to the complete number of passages in your database.
考虑到您可能希望客户名和地址能异常识别数据库中的每个客户。 遵循这些原则,异常元组的数量必须与数据库中整个段落的数量尽可能接近。
Notwithstanding, even you may acquire subsets of components through a few SQL inquiries, this methodology is wasteful and tedious. Data Profiling/Statistics is anything but difficult to utilize and incredible information profiling programming made to assist you with finding designs in your informational collections. Besides, the module can check the nature of your information by examining esteem tallies, types, organizations, and culmination. The module gives a total arrangement of measurable information intended to help clean your information.
尽管如此,即使您可以通过一些SQL查询来获取组件的子集,这种方法也是浪费和繁琐的。 数据剖析/统计几乎没有什么可利用的,而令人难以置信的信息剖析编程可帮助您在信息集合中查找设计。 此外,该模块还可以通过检查自尊记录,类型,组织和高潮来检查您信息的性质。 该模块提供了可衡量信息的整体安排,旨在帮助您清洁信息。
翻译自: https://www.includehelp.com/data-science/data-cleaning.aspx
数据分析 数据清理