多重插补 均值插补_Feature Engineering Part-1均值/中位数插补。

多重插补 均值插补

Understanding the Mean /Median Imputation and Implementation using feature-engine….!

了解使用特征引擎的均值/中位数插补和实现…。!

均值或中位数插补: (Mean or Median Imputation:)

The mean or median value should be calculated only in the train set and used to replace NA in both train and test sets. To avoid over-fitting

平均值或中位数应仅在训练集中进行计算,并用于代替训练和测试集中的NA。 避免过度拟合

均值/中位数插补:定义: (Mean / Median imputation: definition:)

Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean or median.

均值/中位数推算包括用均值或中位数替换变量中所有缺失值(NA)的出现。

我可以使用均值/中位数插补估算哪些变量? (Which variables can I impute with Mean / Median Imputation?)

· The mean and median can only be calculated on numerical variables, therefore, these methods are suitable for continuous and discrete numerical variables only.

·平均值和中位数只能通过数值变量来计算,因此,这些方法仅适用于连续和离散数值变量。

Image for post
Mean/Median Imputation
均值/中位数插补

假设: (Assumptions:)

1. Data is missing completely at random (MCAR)

1.数据完全随机丢失(MCAR)

2. The missing observations, most likely look like the majority of the observations in the variable (aka, the mean/median)

2.缺失的观测值,很可能看起来像变量中的大多数观测值(aka,均值/中位数)

3. If data is missing completely at random, then it is fair to assume that the missing values are most likely very close to the value of the mean or the median of the distribution, as these represent the most frequent/average observation.

3.如果数据完全随机丢失,则可以假设丢失值很可能非常接近均值或分布中值,因为它们代表了最频繁/平均的观察值。

优点: (Advantages:)

  • Easy to implement.

    易于实现。
  • Fast way of obtaining complete datasets.

    快速获取完整数据集的方法。
  • Can be integrated into production (during model deployment).

    可以集成到生产中(在模型部署期间)。

局限性: (Limitations:)

  • Distortion of the original variable distribution.

    原始变量分布失真。
  • Distortion of the original variance.

    原始方差的失真。
Image for post
Distortion of Variance
方差失真
  • Distortion of the covariance with the remaining variables of the dataset

    数据集其余变量的协方差失真
Image for post
Distortion of CoVariance
协方差失真

When replacing NA with the mean or median, the variance of the variable will be distorted if the number of NA is big respect to the total number of observations, leading to underestimation of the variance.

当用均值或中位数替换NA时,如果NA的数量相对于观察总数而言很大,则变量的方差将失真,从而导致方差的低估。

Besides, estimates of covariance and correlations with other variables in the dataset may also be affected. Mean / median imputation may alter intrinsic correlations since the mean / median value that now replaces the missing data will not necessarily preserve the relationship with the remaining variables.

此外,数据集中其他变量的协方差和相关性估计也会受到影响。 均值/中位数估算值可能会更改内在相关性,因为现在替换缺失数据的均值/中位数值不一定会保留与其余变量的关系。

Finally, concentrating all missing values at the mean / median value may lead to observations that are common occurrences in the distribution, to be picked up as outliers.

最后,将所有缺失值集中在平均值/中值可能会导致分布中常见的观测值,被当作异常值。

何时使用均值/中位数推算? (When to use mean/median imputation?)

· Data is missing completely at random.

·数据完全随机丢失。

· No more than 5% of the variable contains missing data.

·包含丢失数据的变量不超过5%。

· Although in theory, the above conditions should be met to minimize the impact of this imputation technique, in practice, mean/median imputation is very commonly used, even in those cases when data is not MCAR and there are a lot of missing values. The reason behind this is the simplicity of the technique.

·尽管从理论上讲,应满足上述条件以最大程度地减少这种插补技术的影响,但实际上,即使在数据不是MCAR且存在许多缺失值的情况下,均值插补/中位数插补也是非常常用的。 其背后的原因是该技术的简单性。

Typically, mean/median imputation is done together with adding a binary “missing indicator” variable to capture those observations where the data was missing.

通常,均值/中位数估算与添加二进制“缺失指标”变量一起进行,以捕获数据丢失的那些观测值。

If the data were missing completely at random, this would be captured by the mean /median imputation, and if it wasn’t this would be captured by the additional “missing indicator” variable. Both methods are extremely straight forward to implement, and therefore are a top choice in data science competitions.

如果数据完全随机丢失,则将通过均值/中位数插值来捕获,如果不是,则将通过附加的“缺失指标”变量来捕获。 两种方法都非常容易实现,因此是数据科学竞赛中的首选。

请注意以下几点: (Note the following:)

1. If a variable is normally distributed, the mean, median, and mode, are approximately the same. Therefore, replacing missing values by the mean and the median are equivalent. Replacing missing data by the mode is not common practice for numerical variables.

1.如果变量为正态分布,则均值,中位数和众数大致相同。 因此,用均值和中位数代替缺失值是等效的。 对于数字变量,用这种模式替换丢失的数据并不常见。

Image for post

2. If the variable is skewed, the mean is biased by the values at the far end of the distribution. Therefore, the median is a better representation of the majority of the values in the variable.

2.如果变量偏斜,则均值会受到分布远端的值的偏倚。 因此,中位数可以更好地表示变量中的大多数值。

Image for post
Skewed Distribution
分布偏斜

实作 (Implementation)

Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add:PThanks.

如果您在帖子中发现任何错误或有任何要添加的内容,请在评论中进行讨论:谢谢。

Image for post
Give a Clap
拍手

翻译自: https://medium.com/analytics-vidhya/feature-engineering-part-1-mean-median-imputation-761043b95379

多重插补 均值插补

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390922.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

域 嵌入图像显示不出来_如何(以及为什么)将域概念嵌入代码中

域 嵌入图像显示不出来Code should clearly reflect the problem it’s solving, and thus openly expose that problem’s domain. Embedding domain concepts in code requires thought and skill, and doesnt drop out automatically from TDD. However, it is a necessary …

linux 查看用户上次修改密码的日期

查看root用户密码上次修改的时间 方法一:查看日志文件: # cat /var/log/secure |grep password changed 方法二: # chage -l root-----Last password change : Feb 27, 2018 Password expires : never…

spring里面 @Controller和@RestController注解的区别

问题:spring里面 Controller和RestController注解的区别 spring里面 Controller和RestController注解的区别 Web MVC和REST applications都可以用Controller吗? 如果是的话,怎么样区别这个一个 Web MVC还是REST application呢 回答一 下面…

2流程控制

分支、循环 str1$1 str2$2 echo $# if [ $str1 $str2 ] thenecho "ab" elif [ "$str1" -lt "$str2" ] thenecho "a < b" elif [ "$str1" -gt "$str2" ] thenecho "a > b" elseecho "没有符…

客户行为模型 r语言建模_客户行为建模:汇总统计的问题

客户行为模型 r语言建模As a Data Scientist, I spend quite a bit of time thinking about Customer Lifetime Value (CLV) and how to model it. A strong CLV model is really a strong customer behavior model — the better you can predict next actions, the better yo…

linux bash命令_Ultimate Linux命令行指南-Full Bash教程

linux bash命令Welcome to our ultimate guide to the Linux Command Line. This tutorial will show you some of the key Linux command line technologies and introduce you to the Bash scripting language.欢迎使用我们的Linux命令行最终指南。 本教程将向您展示一些关键…

【知识科普】解读闪电/雷电网络,零基础秒懂!

知识科普&#xff0c;解读闪电/雷电网络&#xff0c;零基础秒懂&#xff01; 闪电网络的技术是革命性的&#xff0c;将实现即时0手续费的小金额支付。第一步是解决扩容问题&#xff0c;第二部就是解决共通性问题&#xff0c;利用原子交换协议和不同链条的状态通道结合&#xff…

spring框架里面applicationContext.xml 和spring-servlet.xml 的区别

问题&#xff1a;spring框架里面applicationContext.xml 和spring-servlet.xml 的区别 在Spring框架中applicationContext.xml和Spring -servlet.xml有任何关系吗? DispatcherServlet可以使用到在applicationContext.xml中声明的属性文件吗? 另外&#xff0c;为什么我需要*…

Alpha 冲刺 (5/10)

【Alpha go】Day 5&#xff01; Part 0 简要目录 Part 1 项目燃尽图Part 2 项目进展Part 3 站立式会议照片Part 4 Scrum 摘要Part 5 今日贡献Part 1 项目燃尽图 Part 2 项目进展 已分配任务进度博客检索功能&#xff1a;根据标签检索流程图 -> 实现 -> 测试近期比…

多维空间可视化_使用GeoPandas进行空间可视化

多维空间可视化Recently, I was working on a project where I was trying to build a model that could predict housing prices in King County, Washington — the area that surrounds Seattle. After looking at the features, I wanted a way to determine the houses’ …

蛮力写算法_蛮力算法解释

蛮力写算法Brute Force Algorithms are exactly what they sound like – straightforward methods of solving a problem that rely on sheer computing power and trying every possibility rather than advanced techniques to improve efficiency.蛮力算法听起来确实像是–…

NoClassDefFoundError和ClassNotFoundException之间有什么区别?是由什么导致的?

问题&#xff1a; NoClassDefFoundError和ClassNotFoundException之间有什么区别?是由什么导致的&#xff1f; NoClassDefFoundError和ClassNotFoundException之前的区别是什么? 是什么导致它们被抛出?这些问题我们要怎么样解决? 当我在为了引入新的jar包而修改现有代码…

关于Tensorflow安装opencv和pygame

1.安装opencv https://www.lfd.uci.edu/~gohlke/pythonlibs/#opencv C:\ProgramData\Anaconda3\Lib\site-packages>pip install opencv_python-3.3.1-cp36-cp36m-win_amd64.whlProcessing c:\programdata\anaconda3\lib\site-packages\opencv_python-3.3.1-cp36-cp36m-win_a…

内置的常用协议实现模版

SuperSocket内置的常用协议实现模版 中文&#xff08;中国&#xff09;Toggle Dropdownv1.6Toggle Dropdown关键字: TerminatorReceiveFilter, CountSpliterReceiveFilter, FixedSizeReceiveFilter, BeginEndMarkReceiveFilter, FixedHeaderReceiveFilter 阅读了前面一篇文档之…

机器学习 来源框架_机器学习的秘密来源:策展

机器学习 来源框架成功的机器学习/人工智能方法 (Methods for successful Machine learning / Artificial Intelligence) It’s widely stated that data is the new oil, and like oil, data needs the right refinement to evolve to be utilised perfectly. The power of ma…

linux gcc 示例_最好的Linux示例

linux gcc 示例Linux is a powerful operating system that powers most servers and most mobile devices. In this guide, we will show you examples of how to use some of its most powerful features. This involves using the Bash command line.Linux是功能强大的操作系…

帆软报表和jeecg的进一步整合--ajax给后台传递map类型的参数

下面是页面代码&#xff1a; <% page language"java" contentType"text/html; charsetUTF-8" pageEncoding"UTF-8"%> <%include file"/context/mytags.jsp"%> <% String deptIds (String)request.getAttribute("…

@Nullable 注解的用法

问题&#xff1a;Nullable 注解的用法 我看到java中的一些方法声明为: void foo(Nullable Object obj){…}在这里Nullable是什么意思?这是不是意味着输入可以为空? 没有这个注解&#xff0c;输入仍然可以是null&#xff0c;所以我猜这不是它的用法? 回答一 它清楚地说明…

WebLogic调用WebService提示Failed to localize、Failed to create WsdlDefinitionFeature

在本地Tomcat环境下调用WebService正常&#xff0c;但是部署到WebLogic环境中&#xff0c;则提示警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_2 ......警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_1 ..…

呼吁开放外网_服装数据集:呼吁采取行动

呼吁开放外网Getting a dataset with images is not easy if you want to use it for a course or a book. Yes, there are many datasets with images, but few of them are suitable for commercial or educational use.如果您想将其用于课程或书籍&#xff0c;则获取带有图像…