missforest_missforest最佳丢失数据插补算法

missforest

Missing data often plagues real-world datasets, and hence there is tremendous value in imputing, or filling in, the missing values. Unfortunately, standard ‘lazy’ imputation methods like simply using the column median or average don’t work well.

丢失的数据通常困扰着现实世界的数据集,因此,估算或填写丢失的值具有巨大的价值。 不幸的是,标准的“惰性”插补方法(例如仅使用列中位数或平均值)效果不佳。

On the other hand, KNN is a machine-learning based imputation algorithm that has seen success but requires tuning of the parameter k and additionally, is vulnerable to many of KNN’s weaknesses, like being sensitive to being outliers and noise. Additionally, depending on circumstances, it can be computationally expensive, requiring the entire dataset to be stored and computing distances between every pair of points.

另一方面,KNN是一种基于机器学习的插补算法,它已经取得了成功,但需要调整参数k,而且容易受到KNN的许多弱点的影响,例如对异常值和噪声敏感。 另外,根据情况,计算可能会很昂贵,需要存储整个数据集并计算每对点之间的距离。

MissForest is another machine learning-based data imputation algorithm that operates on the Random Forest algorithm. Stekhoven and Buhlmann, creators of the algorithm, conducted a study in 2011 in which imputation methods were compared on datasets with randomly introduced missing values. MissForest outperformed all other algorithms in all metrics, including KNN-Impute, in some cases by over 50%.

MissForest是基于随机森林算法的另一种基于机器学习的数据插补算法。 该算法的创建者Stekhoven和Buhlmann于2011年进行了一项研究,该研究在具有随机引入的缺失值的数据集上比较了插补方法。 在所有指标上,MissForest的性能均优于其他所有算法,包括KNN-Impute,在某些情况下超过50%。

First, the missing values are filled in using median/mode imputation. Then, we mark the missing values as ‘Predict’ and the others as training rows, which are fed into a Random Forest model trained to predict, in this case, Age based on Score. The generated prediction for that row is then filled in to produce a transformed dataset.

首先,使用中位数/众数插补来填充缺失值。 然后,我们将缺失的值标记为'Predict',将其他值标记为训练行,将其输入经过训练的Random Forest模型中,该模型用于预测基于Score Age 。 然后填写针对该行生成的预测,以生成转换后的数据集。

Image for post
Assume that the dataset is truncated. Image created by author.
假设数据集被截断。 图片由作者创建。

This process of looping through missing data points repeats several times, each iteration improving on better and better data. It’s like standing on a pile of rocks while continually adding more to raise yourself: the model uses its current position to elevate itself further.

这种遍历缺失数据点的循环过程会重复几次,每次迭代都会改善越来越好的数据。 这就像站在一堆岩石上,而不断增加更多东西以提高自己:模型使用其当前位置进一步提升自己。

The model may decide in the following iterations to adjust predictions or to keep them the same.

模型可以在接下来的迭代中决定调整预测或使其保持不变。

Image for post
Image created by author
图片由作者创建

Iterations continue until some stopping criteria is met or after a certain number of iterations has elapsed. As a general rule, datasets become well imputed after four to five iterations, but it depends on the size and amount of missing data.

迭代一直持续到满足某些停止条件或经过一定数量的迭代之后。 通常,经过四到五次迭代后,数据集的插补效果会很好,但这取决于丢失数据的大小和数量。

There are many benefits of using MissForest. For one, it can be applied to mixed data types, numerical and categorical. Using KNN-Impute on categorical data requires it to be first converted into some numerical measure. This scale (usually 0/1 with dummy variables) is almost always incompatible with the scales of other dimensions, so the data must be standardized.

使用MissForest有很多好处。 一方面,它可以应用于数值和分类的混合数据类型。 对分类数据使用KNN-Impute要求首先将其转换为某种数字量度。 此比例(通常为0/1,带有虚拟变量 )几乎总是与其他尺寸的比例不兼容,因此必须对数据进行标准化。

In a similar vein, no pre-processing is required. Since KNN uses naïve Euclidean distances, all sorts of actions like categorical encoding, standardization, normalization, scaling, data splitting, etc. need to be taken to ensure its success. On the other hand, Random Forest can handle these aspects of data because it doesn’t make assumptions of feature relationships like K-Nearest Neighbors does.

同样,不需要预处理。 由于KNN使用朴素的欧几里得距离,因此需要采取各种措施,例如分类编码,标准化,归一化,缩放,数据拆分等,以确保其成功。 另一方面,Random Forest可以处理数据的这些方面,因为它没有像K-Nearest Neighbors那样假设特征关系。

MissForest is also robust to noisy data and multicollinearity, since random-forests have built-in feature selection (evaluating entropy and information gain). KNN-Impute yields poor predictions when datasets have weak predictors or heavy correlation between features.

MissForest还对嘈杂的数据和多重共线性具有鲁棒性,因为随机森林具有内置的特征选择(评估熵和信息增益 )。 当数据集的预测变量较弱或特征之间的相关性很强时,KNN-Impute的预测结果很差。

The results of KNN are also heavily determined by a value of k, which must be discovered on what is essentially a try-it-all approach. On the other hand, Random Forest is non-parametric, so there is no tuning required. It can also work with high-dimensional data, and is not prone to the Curse of Dimensionality to the heavy extent KNN-Impute is.

KNN的结果在很大程度上还取决于k的值,该值必须在本质上是一种“万能尝试”方法中进行发现。 另一方面,“随机森林”是非参数的,因此不需要调整。 它也可以处理高维数据,并且在很大程度上不会出现KNN-Impute的维数诅咒。

On the other hand, it does have some downsides. For one, even though it takes up less space, if the dataset is sufficiently small it may be more expensive to run MissForest. Additionally, it’s an algorithm, not a model object; this means it must be run every time data is imputed, which may not work in some production environments.

另一方面,它确实有一些缺点。 一方面,即使占用的空间较小,但如果数据集足够小,则运行MissForest可能会更昂贵。 另外,它是一种算法,而不是模型对象。 这意味着每次插补数据时都必须运行它,这在某些生产环境中可能无法运行。

Using MissForest is simple. In Python, it can be done through the missingpy library, which has a sklearn-like interface and has many of the same parameters as the RandomForestClassifier/RandomForestRegressor. The complete documentation can be found on GitHub here.

使用MissForest很简单。 在Python中,这可以通过missingpy库完成,该库具有sklearn的界面,并且具有与RandomForestClassifier / RandomForestRegressor相同的许多参数。 完整的文档可以在GitHub上找到 。

The model is only as good as the data, so taking proper care of the dataset is a must. Consider using MissForest next time you need to impute missing data!

该模型仅与数据一样好,因此必须适当注意数据集。 下次需要填写缺少的数据时,请考虑使用MissForest!

Thanks for reading!

谢谢阅读!

翻译自: https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3

missforest

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389282.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

华硕猛禽1080ti_F-22猛禽动力回路的视频分析

华硕猛禽1080tiThe F-22 Raptor has vectored thrust. This means that the engines don’t just push towards the front of the aircraft. Instead, the thrust can be directed upward or downward (from the rear of the jet). With this vectored thrust, the Raptor can …

聊天常用js代码

<script languagejavascript>//转意义字符与替换图象以及字体HtmlEncode(text)function HtmlEncode(text){return text.replace(//"/g, &quot;).replace(/</g, <).replace(/>/g, >).replace(/#br#/g,<br>).replace(/IMGSTART/g,<IMG style…

温故而知新:柯里化 与 bind() 的认知

什么是柯里化?科里化是把一个多参数函数转化为一个嵌套的一元函数的过程。&#xff08;简单的说就是将函数的参数&#xff0c;变为多次入参&#xff09; const curry (fn, ...args) > fn.length < args.length ? fn(...args) : curry.bind(null, fn, ...args); // 想要…

OPENVAS运行

https://www.jianshu.com/p/382546aaaab5转载于:https://www.cnblogs.com/diyunpeng/p/9258163.html

Memory-Associated Differential Learning论文及代码解读

Memory-Associated Differential Learning论文及代码解读 论文来源&#xff1a; 论文PDF&#xff1a; Memory-Associated Differential Learning论文 论文代码&#xff1a; Memory-Associated Differential Learning代码 论文解读&#xff1a; 1.Abstract Conventional…

大数据技术 学习之旅_如何开始您的数据科学之旅?

大数据技术 学习之旅Machine Learning seems to be fascinating to a lot of beginners but they often get lost into the pool of information available across different resources. This is true that we have a lot of different algorithms and steps to learn but star…

纯API函数实现串口读写。

以最后决定用纯API函数实现串口读写。 先从网上搜索相关代码&#xff08;关键字&#xff1a;C# API 串口&#xff09;&#xff0c;发现网上相关的资料大约来源于一个版本&#xff0c;那就是所谓的msdn提供的样例代码&#xff08;msdn的具体出处&#xff0c;我没有考证&#xff…

数据可视化工具_数据可视化

数据可视化工具Visualizations are a great way to show the story that data wants to tell. However, not all visualizations are built the same. My rule of thumb is stick to simple, easy to understand, and well labeled graphs. Line graphs, bar charts, and histo…

Android Studio调试时遇见Install Repository and sync project的问题

我们可以看到&#xff0c;报的错是“Failed to resolve: com.android.support:appcompat-v7:16.”&#xff0c;也就是我们在build.gradle中最后一段中的compile项内容。 AS自动生成的“com.android.support:appcompat-v7:16.”实际上是根据我们的最低版本16来选择16.x.x及以上编…

Apache Ignite 学习笔记(二): Ignite Java Thin Client

前一篇文章&#xff0c;我们介绍了如何安装部署Ignite集群&#xff0c;并且尝试了用REST和SQL客户端连接集群进行了缓存和数据库的操作。现在我们就来写点代码&#xff0c;用Ignite的Java thin client来连接集群。 在开始介绍具体代码之前&#xff0c;让我们先简单的了解一下Ig…

VGAE(Variational graph auto-encoders)论文及代码解读

一&#xff0c;论文来源 论文pdf Variational graph auto-encoders 论文代码 github代码 二&#xff0c;论文解读 理论部分参考&#xff1a; Variational Graph Auto-Encoders&#xff08;VGAE&#xff09;理论参考和源码解析 VGAE&#xff08;Variational graph auto-en…

IIS7设置

IIS 7.0和IIS 6.0相比改变很大谁都知道&#xff0c;而且在IIS 7.0中用VS2005来调试Web项目也不是什么新鲜的话题&#xff0c;但是我还是第一次运用这个东东&#xff0c;所以在此记下我的一些过程&#xff0c;希望能给更多的后来者带了一点参考。其实我写这篇文章时也参考了其他…

tableau大屏bi_Excel,Tableau,Power BI ...您应该使用什么?

tableau大屏biAfter publishing my previous article on data visualization with Power BI, I received quite a few questions about the abilities of Power BI as opposed to those of Tableau or Excel. Data, when used correctly, can turn into digital gold. So what …

python 可视化工具_最佳的python可视化工具

python 可视化工具Disclaimer: I work for Datapane免责声明&#xff1a;我为Datapane工作 动机 (Motivation) There are amazing articles on data visualization on Medium every day. Although this comes at the cost of information overload, it shouldn’t prevent you …

网络编程 socket介绍

Socket介绍 Socket是应用层与TCP/IP协议族通信的中间软件抽象层&#xff0c;它是一组接口。在设计模式中&#xff0c;Socket其实就是一个门面模式&#xff0c;它把复杂的TCP/IP协议族隐藏在Socket接口后面&#xff0c;对用户来说&#xff0c;一组简单的接口就是全部。 Socket通…

猿课python 第三天

字典 字典是python中唯一的映射类型,字典对象是可变的&#xff0c;但是字典的键是不可变对象&#xff0c;字典中可以使用不同的键值字典功能> dict.clear()          -->清空字典 dict.keys()          -->获取所有key dict.values()      …

在C#中使用代理的方式触发事件

事件&#xff08;event&#xff09;是一个非常重要的概念&#xff0c;我们的程序时刻都在触发和接收着各种事件&#xff1a;鼠标点击事件&#xff0c;键盘事件&#xff0c;以及处理操作系统的各种事件。所谓事件就是由某个对象发出的消息。比如用户按下了某个按钮&#xff0c;某…

BP神经网络反向传播手动推导

BP神经网络过程&#xff1a; 基本思想 BP算法是一个迭代算法&#xff0c;它的基本思想如下&#xff1a; 将训练集数据输入到神经网络的输入层&#xff0c;经过隐藏层&#xff0c;最后达到输出层并输出结果&#xff0c;这就是前向传播过程。由于神经网络的输出结果与实际结果…

使用python和pandas进行同类群组分析

背景故事 (Backstory) I stumbled upon an interesting task while doing a data exercise for a company. It was about cohort analysis based on user activity data, I got really interested so thought of writing this post.在为公司进行数据练习时&#xff0c;我偶然发…

3.Contructor(构造器)模式—精读《JavaScript 设计模式》Addy Osmani著

同系列友情链接: 1.设计模式之初体验—精读《JavaScript 设计模式》Addy Osmani著 2.设计模式的分类—精读《JavaScript 设计模式》Addy Osmani著 Construct&#xff08;构造器&#xff09;模式 在经典的面向对象编程语言中&#xff0c;Construtor是一种在内存已分配给该对象的…