提高机器学习质量的想法_如何提高机器学习的数据质量？

提高机器学习质量的想法

The ultimate goal of every data scientist or Machine Learning evangelist is to create a better model with higher predictive accuracy. However, in the pursuit of fine-tuning hyperparameters or improving modeling algorithms, data might actually be the culprit. There is a famous Chinese saying “工欲善其事，必先利其器” which literally translates to — To do a good job, an artisan needs the best tools. So if the data are generally of poor quality, regardless of how good a Machine Learning model is, the results will always be subpar at best.

每个数据科学家或机器学习传播者的最终目标是创建一个具有更高预测准确性的更好模型。但是，在追求微调超参数或改进建模算法时，数据实际上可能是罪魁祸首。中国有句名言“工欲善其事，必先利其器”，字面意思是：要做好工作，工匠需要最好的工具。因此，如果数据质量通常很差，那么无论机器学习模型的质量如何，结果总是最好的。

Why is data preparation so important?

为什么数据准备如此重要？

Image for post — Photo by Austin Distel on Unsplash

It is no secret that data preparation in the process of data analytics is ‘an essential but unsexy’ task and more than half of data scientists regard cleaning and organizing data as the least enjoyable part of their work.
众所周知，数据分析过程中的数据准备是“一项必不可少的但并不性感的任务”，超过一半的数据科学家将清理和整理数据视为工作中最不愉快的部分。

Multiple surveys with data scientists and experts have indeed confirmed the common 80/20 trope — whereby 80% of the time is mired in the mundane janitorial work of prepping data, from collecting, cleaning to finding insights of the data (data wrangling or munching); leaving only 20% for the actual analytic work by modeling and building algorithm.

与数据科学家和专家进行的多次调查确实证实了常见的80/20斜率-80％的时间都沉浸在准备数据的平凡的清洁工作中，从收集，清理到发现数据见解(数据整理或压缩) ; 通过建模和构建算法只剩下20％的实际分析工作。

Thus, the Achilles heel of a data analytic process is in fact the unjustifiable amount of time spent on just data preparation. For data scientists, this can be a big hurdle in productivity for building a meaningful model. For businesses, this can be a huge blow to the resources as the investment into data analytics only sees the remaining one-fifth of the allocation dedicated to the original intent.

因此，数据分析过程的致命弱点实际上是仅仅花费在数据准备上的无用时间。对于数据科学家而言，这对于构建有意义的模型可能是生产力的一大障碍。对于企业而言，这可能是对资源的巨大打击，因为对数据分析的投资仅看到剩余的五分之一专用于原始意图。

Heard of GIGO (garbage in, garbage out)? This is exactly what happens here. Data scientists arrive at a task with a given set of data, with the expectation to build the best model to fulfill the goal of the task. But halfway thru the assignment, he realizes that no matter how good the model is he can never achieve better results. After going back-and-forth he finds out that there are lapses in data quality and started scrubbing thru the data to make them “clean and usable”. By the time the data are finally fit again, the dateline is slowly creeping in and resources started draining up, and he is left with a limited amount of time to build and refine the actual model he was hired for.

听说过GIGO(垃圾进，垃圾出)吗？这正是这里发生的情况。数据科学家使用给定的数据集完成一项任务，并期望构建最佳模型来实现任务目标。但是在完成任务的途中，他意识到无论模型多么出色，他都永远无法取得更好的结果。经过反复研究，他发现数据质量存在问题，并开始对数据进行清理以使其“干净且可用”。等到数据终于重新适合时，日期线就慢慢爬进去，资源开始消耗drain尽，他只剩下有限的时间来建立和完善他所雇用的实际模型。

This is akin to a product recall. When defects are discovered in products already on the market, it is often too late to remedy and products have to be recalled to ensure the public safety of consumers. In most cases, the defects are results of negligence in quality control of the components or ingredients used in the supply chain. For example, laptops being recalled due to battery issues or chocolates being recalled due to contamination in the dairy produce. Be it a physical or digital product, the staggering similarity we see here is that it is always the raw material taking the blame.

这类似于产品召回。如果在市场上已有的产品中发现缺陷，通常为时已晚，无法补救，必须召回产品以确保消费者的公共安全。在大多数情况下，缺陷是供应链中使用的组件或成分的质量控制疏忽的结果。例如，由于电池问题而召回笔记本电脑，或者由于乳制品中的污染而召回巧克力。无论是物理产品还是数字产品，我们在这里看到的惊人相似之处都在于，总是责怪原材料。

But if data quality is a problem, why not just improve it?

但是，如果数据质量有问题，为什么不仅仅改善它呢？

To answer this question, we first have to understand what is data quality.

要回答这个问题，我们首先必须了解什么是数据质量。

Tindependent quality as the measure of the agreement between data views presented and the same data in real-world based on inherent characteristics and features; secondly, the quality of dependent application — a measure of conformance of the data to user needs for intended purposes.
T 独立质量是衡量基于固有特征和特征的数据视图与现实世界中相同数据之间一致性的度量；其次， 从属应用程序的质量-衡量数据是否符合预期目的用户需求的量度。

Let’s say you are a university recruiter trying to recruit fresh grads for entry-level jobs. You have a pretty accurate contact list but as you go thru the list you realize that most of the contacts are people over 50 years old, deeming it unsuitable for you to approach them. By applying the definition, this scenario fulfills only the first half of the complete definition — the list has the accuracy and consists of good data. But it does not meet the second criteria — the data, no matter how accurate are not suitable for the application.

假设您是一位大学招聘人员，正在尝试为入门级工作招募应届毕业生。您有一个非常准确的联系人列表，但是当您浏览列表时，您会意识到大多数联系人都是50岁以上的人，认为不适合与他们联系。通过应用定义，此方案仅满足完整定义的前半部分-列表具有准确性，并包含良好的数据。但是它不符合第二个标准-数据，无论多么精确，都不适合该应用程序。

In this example, accuracy is the dimension we are looking at to assess the inherent quality of the data. There are a lot more different dimensions out there. To give you an idea of which dimensions are commonly studied and researched in peer-reviewed literature, here is a histogram showing the top 6 dimensions after studying 15 different data quality assessment methodologies involving 32 dimensions.

在此示例中，准确性是我们要评估的数据固有质量的维度。那里还有更多不同的尺寸。为了让您了解在同行评审的文献中通常研究和研究哪些维度，下面的直方图显示了研究15种不同的数据质量评估方法(涉及32个维度)后的前6个维度。

A systemic approach to Data Quality Assessment

数据质量评估的系统方法

If you fail to plan, you plan to fail. A good systemic approach cannot be successful without a good planning. To have a good plan, you need to have a thorough understanding of the business, especially on problems associating with data quality. In the previous example, one should be aware that the contact list, albeit correct has a data quality problem of not being applicable to achieve the goal of the assigned task.

如果您没有计划，您计划失败。没有良好的计划，好的系统方法就不会成功。要制定好的计划，您需要对业务有透彻的了解，尤其是在与数据质量相关的问题上。在前面的示例中，应该知道联系人列表(尽管正确)存在数据质量问题，不适用于实现所分配任务的目标。

After the problems become clear, data quality dimensions to be investigated should be defined. This can be done using an empirical approach like surveys among stakeholders to find out which dimension matters the most in reference to the data quality problems.

在问题明确之后，应该定义要研究的数据质量维度。可以使用经验方法(例如，在利益相关者之间进行调查)来完成，以找出哪个维度相对于数据质量问题最为重要。

A set of assessment steps should follow suit. Design a way for the implementation so that these steps can map the assessment based on selected dimensions to the actual data. For instance, the following five requirements can be used as an example:

一套评估步骤也应随之而来。设计一种实现方式，以便这些步骤可以将基于选定维度的评估映射到实际数据。例如，可以使用以下五个要求作为示例：

[1] Timeframe — Decide on an interval for when the investigative data are collected.

[1]时间范围-决定收集调查数据的时间间隔。

[2] Definition — Define a standard on how to differentiate the good from the bad data.

[2]定义-定义有关如何区分好数据和坏数据的标准。

[3] Aggregation — How to quantify the data for the assessment.

[3]汇总-如何量化评估数据。

[4] Interpretability — A mathematical expression to assess the data.

[4]可解释性-评估数据的数学表达式。

[5] Threshold —Select a cut-off point to evaluate the results.

[5]阈值—选择一个截止点以评估结果。

Once the assessment methodologies are in place, it is time to get hands-on and carry out the actual assessment. After the assessment, a reporting mechanism can be set up to evaluate the results. If the data quality is satisfactory, then the data are fit for further analytic purposes. Else, the data have to be revised and potentially to be collected again. An example can be seen in the following illustration.

评估方法到位后，就可以动手进行实际评估了。 评估之后 ，可以建立报告机制来评估结果。如果数据质量令人满意，则将数据用于进一步的分析目的。否则，必须修改数据并可能再次收集。下图显示了一个示例。

Conclusion

结论

There is no one-size-fits-all solution for all data quality problems, as the definition outlined above, half of the data quality aspect is highly subjective. However, in the process of data quality assessment, we can always use a systemic approach to evaluate and assess data quality. While this approach is largely objective and relatively versatile, some domain knowledge is still required. For example in the selection of data quality dimension. Data Accuracy and Completeness might be critical aspects of the data for use case A but for use case B these dimensions might be less important.

对于所有数据质量问题，没有一种千篇一律的解决方案，正如上面概述的定义，数据质量方面的一半是高度主观的。但是，在数据质量评估过程中，我们始终可以使用系统的方法来评估和评估数据质量。尽管此方法主要是客观的并且相对通用，但是仍需要一些领域知识。例如在选择数据质量维度时。对于用例A，数据准确性和完整性可能是数据的关键方面，但对于用例B，这些维度可能不太重要。

翻译自: https://towardsdatascience.com/how-to-improve-data-preparation-for-machine-learning-dde107b60091

提高机器学习质量的想法

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/388794.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

mysql 集群实践_MySQL Cluster集群探索与实践

MySQL集群是一种在无共享架构(SNA，Share Nothing Architecture)系统里应用内存数据库集群的技术。这种无共享的架构可以使得系统使用低廉的硬件获取高的可扩展性。MySQL集群是一种分布式设计，目标是要达到没有任何单点故障点。因此，任何组成部…

Python基础：搭建开发环境（1）

1.Python语言简介 2.Python环境 Python环境产品存在多个。 2.1 CPython CPython是Python官方提供的。一般情况下提到的Python就是指CPython，CPython是基于C语言编写的。 CPython实现的解释器将源代码编译为字节码（ByteCode），再由虚…

python数据结构之队列（一）

队列概念队列（queue）是只允许在一端进行插入操作，而在另一端进行删除操作的线性表。队列是一种先进先出的（First In First Out）的线性表，简称FIFO。允许插入的一端为队尾，允许删除的一端为队头。…

Android实现图片放大缩小

Android实现图片放大缩小 package com.min.Test_Gallery; import Android.app.Activity; import android.graphics.Bitmap; import android.graphics.BitmapFactory; import android.graphics.Color; import android.graphics.Matrix; import android.os.Bun…

matlab散点图折线图_什么是散点图以及何时使用

matlab散点图折线图When you were learning algebra back in high school, you might not have realized that one day you would need to create a scatter plot to demonstrate real-world results.当您在高中学习代数时，您可能没有意识到有一天需要创建一个散点图…

java判断题_【Java判断题】请大神们进来看下、这些判断题你都知道多少~

该楼层疑似违规已被系统折叠隐藏此楼查看此楼、判断改错题(每题2分，共20分)(正确的打√，错误的打并说明原因)1、 Java系统包提供了很多预定义类,我们可以直接引用它们而不必从头开始编写程序。 (　)2、程序可以用字符‘*’替代一个TextField中的每个字…

PoPo数据可视化第8期

PoPo数据可视化聚焦于Web数据可视化与可视化交互领域，发现可视化领域有意思的内容。不想错过可视化领域的精彩内容, 就快快关注我们吧 :) 微信订阅号：popodv_com谷歌决定关闭云可视化服务 Fusion Tables谷歌宣布即将关闭其 Fusion Tables 云服务&#x…

AC自动机题单

AC自动机题目真的超级感谢xzy 真的帮到我很多题单 [X] [luogu3808]【模板】AC自动机（简单版） https://www.luogu.org/problemnew/show/P3808[X] [luogu3796]【模板】AC自动机（加强版）https://www.luogu.org/problemnew/show/P37…

java list用法_Java List 用法详解及实例分析

Java List 用法详解及实例分析Java中可变数组的原理就是不断的创建新的数组，将原数组加到新的数组中,下文对Java List用法做了详解。List:元素是有序的(怎么存的就怎么取出来，顺序不会乱)，元素可以重复(角标1上有个3，角标2上也可以…

python字符串和List：索引值以 0 为开始值，-1 为从末尾的开始位置；值和位置的区别哦...

String（字符串）Python中的字符串用单引号或双引号 " 括起来，同时使用反斜杠 \ 转义特殊字符。字符串的截取的语法格式如下： 变量[头下标:尾下标]索引值以 0 为开始值，-1 为从末尾的开始位置。[一个是值&#x…

逻辑回归 python_深入研究Python的逻辑回归

逻辑回归 pythonClassification techniques are an essential part of machine learning and data science applications. Approximately 70% of problems in machine learning are classification problems. There are lots of classification problems that are available, b…