多重插补 均值插补
Understanding the Mean /Median Imputation and Implementation using feature-engine….!
了解使用特征引擎的均值/中位数插补和实现…。!
均值或中位数插补: (Mean or Median Imputation:)
The mean or median value should be calculated only in the train set and used to replace NA in both train and test sets. To avoid over-fitting
平均值或中位数应仅在训练集中进行计算,并用于代替训练和测试集中的NA。 避免过度拟合
均值/中位数插补:定义: (Mean / Median imputation: definition:)
Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean or median.
均值/中位数推算包括用均值或中位数替换变量中所有缺失值(NA)的出现。
我可以使用均值/中位数插补估算哪些变量? (Which variables can I impute with Mean / Median Imputation?)
· The mean and median can only be calculated on numerical variables, therefore, these methods are suitable for continuous and discrete numerical variables only.
·平均值和中位数只能通过数值变量来计算,因此,这些方法仅适用于连续和离散数值变量。
假设: (Assumptions:)
1. Data is missing completely at random (MCAR)
1.数据完全随机丢失(MCAR)
2. The missing observations, most likely look like the majority of the observations in the variable (aka, the mean/median)
2.缺失的观测值,很可能看起来像变量中的大多数观测值(aka,均值/中位数)
3. If data is missing completely at random, then it is fair to assume that the missing values are most likely very close to the value of the mean or the median of the distribution, as these represent the most frequent/average observation.
3.如果数据完全随机丢失,则可以假设丢失值很可能非常接近均值或分布中值,因为它们代表了最频繁/平均的观察值。
优点: (Advantages:)
- Easy to implement. 易于实现。
- Fast way of obtaining complete datasets. 快速获取完整数据集的方法。
- Can be integrated into production (during model deployment). 可以集成到生产中(在模型部署期间)。
局限性: (Limitations:)
- Distortion of the original variable distribution. 原始变量分布失真。
- Distortion of the original variance. 原始方差的失真。
- Distortion of the covariance with the remaining variables of the dataset 数据集其余变量的协方差失真
When replacing NA with the mean or median, the variance of the variable will be distorted if the number of NA is big respect to the total number of observations, leading to underestimation of the variance.
当用均值或中位数替换NA时,如果NA的数量相对于观察总数而言很大,则变量的方差将失真,从而导致方差的低估。
Besides, estimates of covariance and correlations with other variables in the dataset may also be affected. Mean / median imputation may alter intrinsic correlations since the mean / median value that now replaces the missing data will not necessarily preserve the relationship with the remaining variables.
此外,数据集中其他变量的协方差和相关性估计也会受到影响。 均值/中位数估算值可能会更改内在相关性,因为现在替换缺失数据的均值/中位数值不一定会保留与其余变量的关系。
Finally, concentrating all missing values at the mean / median value may lead to observations that are common occurrences in the distribution, to be picked up as outliers.
最后,将所有缺失值集中在平均值/中值可能会导致分布中常见的观测值,被当作异常值。
何时使用均值/中位数推算? (When to use mean/median imputation?)
· Data is missing completely at random.
·数据完全随机丢失。
· No more than 5% of the variable contains missing data.
·包含丢失数据的变量不超过5%。
· Although in theory, the above conditions should be met to minimize the impact of this imputation technique, in practice, mean/median imputation is very commonly used, even in those cases when data is not MCAR and there are a lot of missing values. The reason behind this is the simplicity of the technique.
·尽管从理论上讲,应满足上述条件以最大程度地减少这种插补技术的影响,但实际上,即使在数据不是MCAR且存在许多缺失值的情况下,均值插补/中位数插补也是非常常用的。 其背后的原因是该技术的简单性。
Typically, mean/median imputation is done together with adding a binary “missing indicator” variable to capture those observations where the data was missing.
通常,均值/中位数估算与添加二进制“缺失指标”变量一起进行,以捕获数据丢失的那些观测值。
If the data were missing completely at random, this would be captured by the mean /median imputation, and if it wasn’t this would be captured by the additional “missing indicator” variable. Both methods are extremely straight forward to implement, and therefore are a top choice in data science competitions.
如果数据完全随机丢失,则将通过均值/中位数插值来捕获,如果不是,则将通过附加的“缺失指标”变量来捕获。 两种方法都非常容易实现,因此是数据科学竞赛中的首选。
请注意以下几点: (Note the following:)
1. If a variable is normally distributed, the mean, median, and mode, are approximately the same. Therefore, replacing missing values by the mean and the median are equivalent. Replacing missing data by the mode is not common practice for numerical variables.
1.如果变量为正态分布,则均值,中位数和众数大致相同。 因此,用均值和中位数代替缺失值是等效的。 对于数字变量,用这种模式替换丢失的数据并不常见。
2. If the variable is skewed, the mean is biased by the values at the far end of the distribution. Therefore, the median is a better representation of the majority of the values in the variable.
2.如果变量偏斜,则均值会受到分布远端的值的偏倚。 因此,中位数可以更好地表示变量中的大多数值。
实作 (Implementation)
Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add:PThanks.
如果您在帖子中发现任何错误或有任何要添加的内容,请在评论中进行讨论:谢谢。
翻译自: https://medium.com/analytics-vidhya/feature-engineering-part-1-mean-median-imputation-761043b95379
多重插补 均值插补
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390922.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!