python 插补数据
Most machine learning algorithms expect complete and clean noise-free datasets, unfortunately, real-world datasets are messy and have multiples missing cells, in such cases handling missing data becomes quite complex.
Therefore in today’s article, we are going to discuss some of the most effective and indeed easy-to-use data imputation techniques which can be used to deal with missing data.
So without any further delay, let’s get started.

什么是数据归因? (What is Data Imputation?)
Data Imputation is a method in which the missing values in any variable or data frame(in Machine learning) is filled with some numeric values for performing the task. Using this method the sample size remains the same, only the blanks which were missing are now filled with some values. This method is easy to use but the variance of the dataset is reduced.
数据插补是一种方法,其中(在机器学习中)任何变量或数据框中的缺失值都填充有一些数字值,以执行任务。 使用此方法,样本大小保持不变,现在仅将缺少的空白 填充一些值 。 这种方法易于使用,但数据集的方差减小了。
为什么要进行数据插补? (Why Data Imputation?)
There can be various reasons for imputing data, many real-world datasets(not talking about CIFAR or MNIST) containing missing values which can be in any form such as blanks, NaN, 0s, any integers or any categorical symbol. Instead of just dropping the Rows or Columns containing the missing values which come at the price of losing data which may be valuable, a better strategy is to impute the missing values.
插补数据可能有多种原因,许多现实世界的数据集(不涉及CIFAR或MNIST)包含缺失值,这些缺失值可以采用任何形式,例如空格,NaN,0,任何整数或任何分类符号 。 更好的策略是估算缺失值 ,而不是仅仅删除包含缺失值的行或列,而这些缺失值会以丢失可能有价值的数据为代价 。
Coming back to the topic -
Sklearn.impute package provides 2 types of imputations algorithms to fill in missing values:
1. SimpleImputer (1. SimpleImputer)
SimpleImputer is used for imputations on univariate datasets, univariate datasets are datasets that have only a single variable. SimpleImputer allows us to impute values in any feature column using only missing values in that feature space.
SimpleImputer用于单变量数据集的插补, 单变量数据集是仅具有单个变量的数据集。 SimpleImputer允许我们仅使用该要素空间中的缺失值来插补任何要素列中的值 。
There are different strategies provided to impute data such as imputing with a constant value or using the statistical methods such as mean, median or mode to impute data for each column of missing values.
For categorical data representations, it has support for ‘most-frequent’ strategy which is like the mode of numerical values.

2.迭代计算机 (2. IterativeImputer)
IterativeImputer is used for imputations on multivariate datasets, multivariate datasets are datasets that have more than two variables or feature columns per observation. IterativeImputer allows us to make use of the entire dataset of available features columns to impute the missing values.
IterativeImputer用于对多元数据集进行插补, 多元数据集是每个观察值具有两个以上变量或特征列的数据集。 IterativeImputer允许我们利用可用要素列的整个数据集来估算缺失值。
In IterativeImpute each feature with a missing value is used as a function of other features with known output and models the function for imputations. The same process is then iterated in a loop for some iterations and at each step, a feature column is selected as output y and other feature columns are treated as inputs X, then a regressor is fit on (X, y) for known y and is used to predict the missing values of y.
在IterativeImpute与缺失值的每个特征被用作与已知的输出和模式插补函数其它特征的函数 。 然后,在循环中重复相同的过程进行一些迭代,并在每个步骤中,选择一个特征列作为 输出y ,将其他特征列视为 输入X ,然后将 回归器拟合到(X,y)上 以获取已知 y 和用于 预测y的缺失值 。
The same process is repeated for each feature column in a loop and the average of all the multiple regression values are taken to impute the missing values for the data points.

失踪图书馆 (Missingpy library)
Missingpy is a library in python used for imputations of missing values. Currently, it supports K-Nearest Neighbours based imputation technique and MissForest i.e Random Forest-based imputation technique.
Missingpy是python中的一个库,用于估算缺失值。 当前,它支持基于K最近邻的插补技术和MissForest即基于随机森林的插补技术。
To install missingpy library, you can type the following in command line:
pip install missingpy
pip install missingpy
3. KNNImputer (3. KNNImputer)
KNNImputer is a multivariate data imputation technique used for filling in the missing values using the K-Nearest Neighbours approach. Each missing value is filled by the mean value form the n nearest neighbours found in the training set, either weighted or unweighted.
KNNImputer是一种多变量数据插补技术,用于使用K最近邻方法填充缺失值。 每个缺失值都由在训练集中找到的n个最近邻居 (加权或未加权)的平均值填充。
If a sample has more than one feature missing then the neighbour for that sample can be different and if the number of neighbours is lesser than n_neighbour specified then there is no defined distance in the training set, the average of that training set is used during imputation.
如果样本缺少一个以上的特征,则该样本的邻居可能会有所不同 ;如果邻居的数量小于指定的n_neighbour,则训练集中没有定义的距离,则在插补过程中将使用该训练集的平均值。
Nearest neighbours are selected on the basis of distance metrics, by default it is set to euclidean distance and n_neighbour are specified to consider for each step.
根据距离量度 选择最近的邻居 ,默认情况下将其设置为欧式距离,并为每个步骤指定要考虑的n_neighbour 。

4.小姐森林 (4. MissForest)
It is another technique used to fill in the missing values using Random Forest in an iterated fashion. The candidate column is selected from the set of all the columns having the least number of missing values.
这是另一种使用“ 随机森林 ”以迭代方式填充缺失值的技术。 从缺少值最少的所有列的集合中选择候选列 。
In the first step, all the other columns i.e non-candidate columns having missing values are filled with the mean for the numerical columns and mode for the categorical columns and after that imputer fits a random forest model with the candidate columns as the outcome variable(target variable) and remaining columns as independent variables and then filling the missing values in candidate column using the predictions from the fitted Random Forest model.
第一步,将所有其他列(即缺少值的非候选列)填充为数值列的平均值和分类列的众数,然后,imputer将候选列作为结果变量拟合随机森林模型 (目标变量)和其余列作为自变量 ,然后使用拟合随机森林模型的预测填充候选列中的缺失值。
Then the imputer moves on and the next candidate column is selected with the second least number of missing values and the process repeats itself for each column with the missing values.

进一步阅读 (Further Readings)
IterativeImputer was originally a part of the fancy impute but later on was merged into sklearn. Apart from IterativeImputer, fancy impute has many different algorithms that can be helpful in imputing missing values. Few of them are not much common in the industry but have proved their existence in some particular projects, that is why I am not including them in today's article. You can study them here.
IterativeImputer最初是幻想式估算的一部分,但后来合并为sklearn。 除了IterativeImputer之外,花式插补还具有许多不同的算法,可用于插补缺失值。 它们很少在行业中并不常见,但是已经证明它们在某些特定项目中的存在,这就是为什么我不在今天的文章中包括它们。 你可以在这里学习。
It is yet another python package for analysis and imputation of missing values in datasets. It supports various utility functions to examine patterns in missing values and provides some imputations methods for continuous, categorical or time-series data. It also supports multiple and single imputation frameworks for imputations. You can study them here.
它是另一个python软件包,用于分析和估算数据集中的缺失值。 它支持各种实用程序功能以检查缺失值中的模式,并为连续,分类或时间序列数据提供一些插补方法。 它还支持用于插补的多个和单个插补框架。 你可以在这里学习。
