提高机器学习质量的想法_如何提高机器学习的数据质量?

提高机器学习质量的想法

The ultimate goal of every data scientist or Machine Learning evangelist is to create a better model with higher predictive accuracy. However, in the pursuit of fine-tuning hyperparameters or improving modeling algorithms, data might actually be the culprit. There is a famous Chinese saying “工欲善其事,必先其器” which literally translates to — To do a good job, an artisan needs the best tools. So if the data are generally of poor quality, regardless of how good a Machine Learning model is, the results will always be subpar at best.

每个数据科学家或机器学习传播者的最终目标是创建一个具有更高预测准确性的更好模型。 但是,在追求微调超参数或改进建模算法时,数据实际上可能是罪魁祸首。 中国有句名言“工欲善其事,必先其器”,字面意思是:要做好工作,工匠需要最好的工具。 因此,如果数据质量通常很差,那么无论机器学习模型的质量如何,结果总是最好的。

Why is data preparation so important?

为什么数据准备如此重要?

Image for post
Photo by Austin Distel on Unsplash
Austin Distel在Unsplash上拍摄的照片

It is no secret that data preparation in the process of data analytics is ‘an essential but unsexy’ task and more than half of data scientists regard cleaning and organizing data as the least enjoyable part of their work.

众所周知 ,数据分析过程中的数据准备是“一项必不可少的但并不性感的任务”, 超过一半的数据科学家将清理和整理数据视为工作中最不愉快的部分。

Multiple surveys with data scientists and experts have indeed confirmed the common 80/20 trope — whereby 80% of the time is mired in the mundane janitorial work of prepping data, from collecting, cleaning to finding insights of the data (data wrangling or munching); leaving only 20% for the actual analytic work by modeling and building algorithm.

与数据科学家和专家进行的多次调查确实证实了常见的80/20斜率-80%的时间都沉浸在准备数据的平凡的清洁工作中,从收集,清理到发现数据见解(数据整理或压缩) ; 通过建模和构建算法只剩下20%的实际分析工作。

Thus, the Achilles heel of a data analytic process is in fact the unjustifiable amount of time spent on just data preparation. For data scientists, this can be a big hurdle in productivity for building a meaningful model. For businesses, this can be a huge blow to the resources as the investment into data analytics only sees the remaining one-fifth of the allocation dedicated to the original intent.

因此,数据分析过程的致命弱点实际上是仅仅花费在数据准备上的无用时间。 对于数据科学家而言,这对于构建有意义的模型可能是生产力的一大障碍。 对于企业而言,这可能是对资源的巨大打击,因为对数据分析的投资仅看到剩余的五分之一专用于原始意图。

Image for post

Heard of GIGO (garbage in, garbage out)? This is exactly what happens here. Data scientists arrive at a task with a given set of data, with the expectation to build the best model to fulfill the goal of the task. But halfway thru the assignment, he realizes that no matter how good the model is he can never achieve better results. After going back-and-forth he finds out that there are lapses in data quality and started scrubbing thru the data to make them “clean and usable”. By the time the data are finally fit again, the dateline is slowly creeping in and resources started draining up, and he is left with a limited amount of time to build and refine the actual model he was hired for.

听说过GIGO(垃圾进,垃圾出)吗? 这正是这里发生的情况。 数据科学家使用给定的数据集完成一项任务,并期望构建最佳模型来实现任务目标。 但是在完成任务的途中,他意识到无论模型多么出色,他都永远无法取得更好的结果。 经过反复研究,他发现数据质量存在问题,并开始对数据进行清理以使其“干净且可用”。 等到数据终于重新适合时,日期线就慢慢爬进去,资源开始消耗drain尽,他只剩下有限的时间来建立和完善他所雇用的实际模型。

This is akin to a product recall. When defects are discovered in products already on the market, it is often too late to remedy and products have to be recalled to ensure the public safety of consumers. In most cases, the defects are results of negligence in quality control of the components or ingredients used in the supply chain. For example, laptops being recalled due to battery issues or chocolates being recalled due to contamination in the dairy produce. Be it a physical or digital product, the staggering similarity we see here is that it is always the raw material taking the blame.

这类似于产品召回。 如果在市场上已有的产品中发现缺陷,通常为时已晚,无法补救,必须召回产品以确保消费者的公共安全。 在大多数情况下,缺陷是供应链中使用的组件或成分的质量控制疏忽的结果。 例如,由于电池问题而召回笔记本电脑 ,或者由于乳制品中的污染而召回巧克力 。 无论是物理产品还是数字产品,我们在这里看到的惊人相似之处都在于,总是责怪原材料。

But if data quality is a problem, why not just improve it?

但是,如果数据质量有问题,为什么不仅仅改善它呢?

To answer this question, we first have to understand what is data quality.

要回答这个问题,我们首先必须了解什么是数据质量。

Tindependent quality as the measure of the agreement between data views presented and the same data in real-world based on inherent characteristics and features; secondly, the quality of dependent application — a measure of conformance of the data to user needs for intended purposes.

T 独立质量是衡量基于固有特征和特征的数据视图与现实世界中相同数据之间一致性的度量; 其次, 从属应用程序的质量-衡量数据是否符合预期目的用户需求的量度。

Let’s say you are a university recruiter trying to recruit fresh grads for entry-level jobs. You have a pretty accurate contact list but as you go thru the list you realize that most of the contacts are people over 50 years old, deeming it unsuitable for you to approach them. By applying the definition, this scenario fulfills only the first half of the complete definition — the list has the accuracy and consists of good data. But it does not meet the second criteria — the data, no matter how accurate are not suitable for the application.

假设您是一位大学招聘人员,正在尝试为入门级工作招募应届毕业生。 您有一个非常准确的联系人列表,但是当您浏览列表时,您会意识到大多数联系人都是50岁以上的人,认为不适合与他们联系。 通过应用定义,此方案仅满足完整定义的前半部分-列表具有准确性,并包含良好的数据。 但是它不符合第二个标准-数据,无论多么精确,都不适合该应用程序。

In this example, accuracy is the dimension we are looking at to assess the inherent quality of the data. There are a lot more different dimensions out there. To give you an idea of which dimensions are commonly studied and researched in peer-reviewed literature, here is a histogram showing the top 6 dimensions after studying 15 different data quality assessment methodologies involving 32 dimensions.

在此示例中,准确性是我们要评估的数据固有质量的维度。 那里还有更多不同的尺寸。 为了让您了解在同行评审的文献中通常研究和研究哪些维度,下面的直方图显示了研究15种不同的数据质量评估方法(涉及32个维度)后的前6个维度。

Image for post

A systemic approach to Data Quality Assessment

数据质量评估的系统方法

Image for post

If you fail to plan, you plan to fail. A good systemic approach cannot be successful without a good planning. To have a good plan, you need to have a thorough understanding of the business, especially on problems associating with data quality. In the previous example, one should be aware that the contact list, albeit correct has a data quality problem of not being applicable to achieve the goal of the assigned task.

如果您没有计划,您计划失败。 没有良好的计划,好的系统方法就不会成功。 要制定好的计划,您需要对业务有透彻的了解 ,尤其是在与数据质量相关的问题上。 在前面的示例中,应该知道联系人列表(尽管正确)存在数据质量问题,不适用于实现所分配任务的目标。

After the problems become clear, data quality dimensions to be investigated should be defined. This can be done using an empirical approach like surveys among stakeholders to find out which dimension matters the most in reference to the data quality problems.

在问题明确之后,应该定义要研究的数据质量维度。 可以使用经验方法(例如,在利益相关者之间进行调查)来完成,以找出哪个维度相对于数据质量问题最为重要。

A set of assessment steps should follow suit. Design a way for the implementation so that these steps can map the assessment based on selected dimensions to the actual data. For instance, the following five requirements can be used as an example:

一套评估步骤也应随之而来。 设计一种实现方式,以便这些步骤可以将基于选定维度的评估映射到实际数据。 例如,可以使用以下五个要求作为示例:

[1] Timeframe — Decide on an interval for when the investigative data are collected.

[1]时间范围-决定收集调查数据的时间间隔。

[2] Definition — Define a standard on how to differentiate the good from the bad data.

[2]定义-定义有关如何区分好数据和坏数据的标准。

[3] Aggregation — How to quantify the data for the assessment.

[3]汇总-如何量化评估数据。

[4] Interpretability — A mathematical expression to assess the data.

[4]可解释性-评估数据的数学表达式。

[5] Threshold —Select a cut-off point to evaluate the results.

[5]阈值—选择一个截止点以评估结果。

Once the assessment methodologies are in place, it is time to get hands-on and carry out the actual assessment. After the assessment, a reporting mechanism can be set up to evaluate the results. If the data quality is satisfactory, then the data are fit for further analytic purposes. Else, the data have to be revised and potentially to be collected again. An example can be seen in the following illustration.

评估方法到位后,就可以动手进行实际评估了。 评估之后 ,可以建立报告机制来评估结果。 如果数据质量令人满意,则将数据用于进一步的分析目的。 否则,必须修改数据并可能再次收集。 下图显示了一个示例。

Image for post

Conclusion

结论

There is no one-size-fits-all solution for all data quality problems, as the definition outlined above, half of the data quality aspect is highly subjective. However, in the process of data quality assessment, we can always use a systemic approach to evaluate and assess data quality. While this approach is largely objective and relatively versatile, some domain knowledge is still required. For example in the selection of data quality dimension. Data Accuracy and Completeness might be critical aspects of the data for use case A but for use case B these dimensions might be less important.

对于所有数据质量问题,没有一种千篇一律的解决方案,正如上面概述的定义,数据质量方面的一半是高度主观的。 但是,在数据质量评估过程中,我们始终可以使用系统的方法来评估和评估数据质量。 尽管此方法主要是客观的并且相对通用,但是仍需要一些领域知识。 例如在选择数据质量维度时。 对于用例A,数据准确性和完整性可能是数据的关键方面,但对于用例B,这些维度可能不太重要。

翻译自: https://towardsdatascience.com/how-to-improve-data-preparation-for-machine-learning-dde107b60091

提高机器学习质量的想法

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388794.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

mysql 集群实践_MySQL Cluster集群探索与实践

MySQL集群是一种在无共享架构(SNA,Share Nothing Architecture)系统里应用内存数据库集群的技术。这种无共享的架构可以使得系统使用低廉的硬件获取高的可扩展性。MySQL集群是一种分布式设计,目标是要达到没有任何单点故障点。因此,任何组成部…

matlab散点图折线图_什么是散点图以及何时使用

matlab散点图折线图When you were learning algebra back in high school, you might not have realized that one day you would need to create a scatter plot to demonstrate real-world results.当您在高中学习代数时,您可能没有意识到有一天需要创建一个散点图…

python字符串和List:索引值以 0 为开始值,-1 为从末尾的开始位置;值和位置的区别哦...

String(字符串)Python中的字符串用单引号 或双引号 " 括起来,同时使用反斜杠 \ 转义特殊字符。 字符串的截取的语法格式如下: 变量[头下标:尾下标]索引值以 0 为开始值,-1 为从末尾的开始位置。[一个是值&#x…

逻辑回归 python_深入研究Python的逻辑回归

逻辑回归 pythonClassification techniques are an essential part of machine learning and data science applications. Approximately 70% of problems in machine learning are classification problems. There are lots of classification problems that are available, b…

spring定时任务(@Scheduled注解)

(一)在xml里加入task的命名空间 xmlns:task"http://www.springframework.org/schema/task" http://www.springframework.org/schema/task http://www.springframework.org/schema/task/spring-task-4.1.xsd(二)启用注…

JavaScript是如何工作的:与WebAssembly比较及其使用场景

*摘要:** WebAssembly未来可期。 原文:JavaScript是如何工作的:与WebAssembly比较及其使用场景作者:前端小智Fundebug经授权转载,版权归原作者所有。 这是专门探索 JavaScript及其所构建的组件的系列文章的第6篇。 如果…

Matplotlib中的“ plt”和“ ax”到底是什么?

Indeed, as the most popular and fundamental data visualisation library, Matplotlib is kind of confusing in some perspectives. It is usually to see that someone asking about的确,作为最受欢迎的基础数据可视化库,Matplotlib在某些方面令人困…

2018年阿里云NoSQL数据库大事盘点

2019独角兽企业重金招聘Python工程师标准>>> NoSQL一词最早出现在1998年。2009年Last.fm的Johan Oskarsson发起了一次关于分布式开源数据库的讨论,来自Rackspace的Eric Evans再次提出了NoSQL概念,这时的NoSQL主要是指非关系型、分布式、不提供…

cayenne:用于随机模拟的Python包

TL;DR; We just released v1.0 of cayenne, our Python package for stochastic simulations! Read on to find out if you should model your system as a stochastic process, and why you should try out cayenne.TL; DR; 我们刚刚发布了 cayenne v1.0 ,这是我们…

java 如何将word 转换为ftl_使用 freemarker导出word文档

近日需要将人员的基本信息导出,存储为word文档,查阅了很多资料,最后选择了使用freemarker,网上一共有四种方式,效果都一样,选择它呢是因为使用简单,再次记录一下,一个简单的demo,仅供…

DotNetBar office2007效果

1.DataGridView 格式化显示cell里的数据日期等。 进入编辑列,选择要设置的列,DefaultCellStyle里->行为->formart设置 2.tabstrip和mdi窗口的结合使用给MDI窗口加上TabPage。拖动个tabstrip到MDI窗口上tabstrip里选择到主窗口名就加上TABPAGE了。d…

spotify 数据分析_没有数据? 没问题! 如何从Wikipedia和Spotify收集重金属数据

spotify 数据分析For many data science students, collecting data is seen as a solved problem. It’s just there in Kaggle or UCI. However, that’s not how data is available daily for working Data Scientists. Also, many of the datasets used for learning have …

IS环境下配置PHP5+MySql+PHPMyAdmin

IIS环境下配置PHP5MySqlPHPMyAdmin Posted on 2009-08-07 15:18 谢启祥 阅读(1385)评论(18) 编辑 收藏 虽然主要是做.net开发的,但是,时不时的还要搞一下php,但是,php在windows下的配置,总是走很多弯路,正好…

kaggle数据集_Kaggle上有170万份ArXiv文章的数据集

kaggle数据集“arXiv is a free distribution service and an open-access archive for 1.7 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and sys…

深度学习数据集中数据差异大_使用差异隐私来利用大数据并保留隐私

深度学习数据集中数据差异大The modern world runs on “big data,” the massive data sets used by governments, firms, and academic researchers to conduct analyses, unearth patterns, and drive decision-making. When it comes to data analysis, bigger can be bett…

C#图片处理基本应用(裁剪,缩放,清晰度,水印)

前言 需求源自项目中的一些应用,比如相册功能,通常用户上传相片后我们都会针对该相片再生成一张缩略图,用于其它页面上的列表显示。随便看一下,大部分网站基本都是将原图等比缩放来生成缩略图。但完美主义者会发现一些问题&#…

Java客户端访问HBase集群解决方案(优化)

测试环境&#xff1a;IdeaWindows10 准备工作&#xff1a; <1>、打开本地 C:\Windows\System32\drivers\etc&#xff08;系统默认&#xff09;下名为hosts的系统文件&#xff0c;如果提示当前用户没有权限打开文件&#xff1b;第一种方法是将hosts文件拖到桌面进行配置后…

WPF布局系统

WPF之路——WPF布局系统 前言 前段时间忙了一阵子Google Earth&#xff0c;这周又忙了一阵子架构师论文开题报告&#xff0c;现在终于有时间继续<WPF之路>了。先回忆一下上篇的内容&#xff0c;在《从HelloWorld到WPF World》中&#xff0c;我们对WPF有了个大概的了解&am…

PostGIS容器运行

2019独角兽企业重金招聘Python工程师标准>>> 获取镜像&#xff1a; docker pull mdillon/postgis 该 mdillon/postgis 镜像提供了容器中运行Postgres&#xff08;内置安装PostGIS 2.5&#xff09; 。该镜像基于官方 postgres image&#xff0c;提供了多种变体&#…

小型数据库_如果您从事“小型科学”工作,那么您是否正在利用数据存储库?

小型数据库If you’re a scientist, especially one performing a lot of your research alone, you probably have more than one spreadsheet of important data that you just haven’t gotten around to writing up yet. Maybe you never will. Sitting idle on a hard dri…