多重插补 均值插补_Feature Engineering Part-1均值/中位数插补。

多重插补 均值插补

Understanding the Mean /Median Imputation and Implementation using feature-engine….!

了解使用特征引擎的均值/中位数插补和实现…。!

均值或中位数插补: (Mean or Median Imputation:)

The mean or median value should be calculated only in the train set and used to replace NA in both train and test sets. To avoid over-fitting

平均值或中位数应仅在训练集中进行计算,并用于代替训练和测试集中的NA。 避免过度拟合

均值/中位数插补:定义: (Mean / Median imputation: definition:)

Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean or median.

均值/中位数推算包括用均值或中位数替换变量中所有缺失值(NA)的出现。

我可以使用均值/中位数插补估算哪些变量? (Which variables can I impute with Mean / Median Imputation?)

· The mean and median can only be calculated on numerical variables, therefore, these methods are suitable for continuous and discrete numerical variables only.

·平均值和中位数只能通过数值变量来计算,因此,这些方法仅适用于连续和离散数值变量。

Image for post
Mean/Median Imputation
均值/中位数插补

假设: (Assumptions:)

1. Data is missing completely at random (MCAR)

1.数据完全随机丢失(MCAR)

2. The missing observations, most likely look like the majority of the observations in the variable (aka, the mean/median)

2.缺失的观测值,很可能看起来像变量中的大多数观测值(aka,均值/中位数)

3. If data is missing completely at random, then it is fair to assume that the missing values are most likely very close to the value of the mean or the median of the distribution, as these represent the most frequent/average observation.

3.如果数据完全随机丢失,则可以假设丢失值很可能非常接近均值或分布中值,因为它们代表了最频繁/平均的观察值。

优点: (Advantages:)

  • Easy to implement.

    易于实现。
  • Fast way of obtaining complete datasets.

    快速获取完整数据集的方法。
  • Can be integrated into production (during model deployment).

    可以集成到生产中(在模型部署期间)。

局限性: (Limitations:)

  • Distortion of the original variable distribution.

    原始变量分布失真。
  • Distortion of the original variance.

    原始方差的失真。
Image for post
Distortion of Variance
方差失真
  • Distortion of the covariance with the remaining variables of the dataset

    数据集其余变量的协方差失真
Image for post
Distortion of CoVariance
协方差失真

When replacing NA with the mean or median, the variance of the variable will be distorted if the number of NA is big respect to the total number of observations, leading to underestimation of the variance.

当用均值或中位数替换NA时,如果NA的数量相对于观察总数而言很大,则变量的方差将失真,从而导致方差的低估。

Besides, estimates of covariance and correlations with other variables in the dataset may also be affected. Mean / median imputation may alter intrinsic correlations since the mean / median value that now replaces the missing data will not necessarily preserve the relationship with the remaining variables.

此外,数据集中其他变量的协方差和相关性估计也会受到影响。 均值/中位数估算值可能会更改内在相关性,因为现在替换缺失数据的均值/中位数值不一定会保留与其余变量的关系。

Finally, concentrating all missing values at the mean / median value may lead to observations that are common occurrences in the distribution, to be picked up as outliers.

最后,将所有缺失值集中在平均值/中值可能会导致分布中常见的观测值,被当作异常值。

何时使用均值/中位数推算? (When to use mean/median imputation?)

· Data is missing completely at random.

·数据完全随机丢失。

· No more than 5% of the variable contains missing data.

·包含丢失数据的变量不超过5%。

· Although in theory, the above conditions should be met to minimize the impact of this imputation technique, in practice, mean/median imputation is very commonly used, even in those cases when data is not MCAR and there are a lot of missing values. The reason behind this is the simplicity of the technique.

·尽管从理论上讲,应满足上述条件以最大程度地减少这种插补技术的影响,但实际上,即使在数据不是MCAR且存在许多缺失值的情况下,均值插补/中位数插补也是非常常用的。 其背后的原因是该技术的简单性。

Typically, mean/median imputation is done together with adding a binary “missing indicator” variable to capture those observations where the data was missing.

通常,均值/中位数估算与添加二进制“缺失指标”变量一起进行,以捕获数据丢失的那些观测值。

If the data were missing completely at random, this would be captured by the mean /median imputation, and if it wasn’t this would be captured by the additional “missing indicator” variable. Both methods are extremely straight forward to implement, and therefore are a top choice in data science competitions.

如果数据完全随机丢失,则将通过均值/中位数插值来捕获,如果不是,则将通过附加的“缺失指标”变量来捕获。 两种方法都非常容易实现,因此是数据科学竞赛中的首选。

请注意以下几点: (Note the following:)

1. If a variable is normally distributed, the mean, median, and mode, are approximately the same. Therefore, replacing missing values by the mean and the median are equivalent. Replacing missing data by the mode is not common practice for numerical variables.

1.如果变量为正态分布,则均值,中位数和众数大致相同。 因此,用均值和中位数代替缺失值是等效的。 对于数字变量,用这种模式替换丢失的数据并不常见。

Image for post

2. If the variable is skewed, the mean is biased by the values at the far end of the distribution. Therefore, the median is a better representation of the majority of the values in the variable.

2.如果变量偏斜,则均值会受到分布远端的值的偏倚。 因此,中位数可以更好地表示变量中的大多数值。

Image for post
Skewed Distribution
分布偏斜

实作 (Implementation)

Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add:PThanks.

如果您在帖子中发现任何错误或有任何要添加的内容,请在评论中进行讨论:谢谢。

Image for post
Give a Clap
拍手

翻译自: https://medium.com/analytics-vidhya/feature-engineering-part-1-mean-median-imputation-761043b95379

多重插补 均值插补

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390922.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

linux 查看用户上次修改密码的日期

查看root用户密码上次修改的时间 方法一:查看日志文件: # cat /var/log/secure |grep password changed 方法二: # chage -l root-----Last password change : Feb 27, 2018 Password expires : never…

客户行为模型 r语言建模_客户行为建模:汇总统计的问题

客户行为模型 r语言建模As a Data Scientist, I spend quite a bit of time thinking about Customer Lifetime Value (CLV) and how to model it. A strong CLV model is really a strong customer behavior model — the better you can predict next actions, the better yo…

【知识科普】解读闪电/雷电网络,零基础秒懂!

知识科普,解读闪电/雷电网络,零基础秒懂! 闪电网络的技术是革命性的,将实现即时0手续费的小金额支付。第一步是解决扩容问题,第二部就是解决共通性问题,利用原子交换协议和不同链条的状态通道结合&#xff…

Alpha 冲刺 (5/10)

【Alpha go】Day 5! Part 0 简要目录 Part 1 项目燃尽图Part 2 项目进展Part 3 站立式会议照片Part 4 Scrum 摘要Part 5 今日贡献Part 1 项目燃尽图 Part 2 项目进展 已分配任务进度博客检索功能:根据标签检索流程图 -> 实现 -> 测试近期比…

多维空间可视化_使用GeoPandas进行空间可视化

多维空间可视化Recently, I was working on a project where I was trying to build a model that could predict housing prices in King County, Washington — the area that surrounds Seattle. After looking at the features, I wanted a way to determine the houses’ …

机器学习 来源框架_机器学习的秘密来源:策展

机器学习 来源框架成功的机器学习/人工智能方法 (Methods for successful Machine learning / Artificial Intelligence) It’s widely stated that data is the new oil, and like oil, data needs the right refinement to evolve to be utilised perfectly. The power of ma…

WebLogic调用WebService提示Failed to localize、Failed to create WsdlDefinitionFeature

在本地Tomcat环境下调用WebService正常&#xff0c;但是部署到WebLogic环境中&#xff0c;则提示警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_2 ......警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_1 ..…

呼吁开放外网_服装数据集:呼吁采取行动

呼吁开放外网Getting a dataset with images is not easy if you want to use it for a course or a book. Yes, there are many datasets with images, but few of them are suitable for commercial or educational use.如果您想将其用于课程或书籍&#xff0c;则获取带有图像…

React JS 组件间沟通的一些方法

刚入门React可能会因为React的单向数据流的特性而遇到组件间沟通的麻烦&#xff0c;这篇文章主要就说一说如何解决组件间沟通的问题。 1.组件间的关系 1.1 父子组件 ReactJS中数据的流动是单向的&#xff0c;父组件的数据可以通过设置子组件的props传递数据给子组件。如果想让子…

数据可视化分析票房数据报告_票房收入分析和可视化

数据可视化分析票房数据报告Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.欢迎回到我的100天数据科学挑战之旅。 在第4天和第5天&#xff0c;我将研究Kaggle上提供的TM…

先知模型 facebook_Facebook先知

先知模型 facebook什么是先知&#xff1f; (What is Prophet?) “Prophet” is an open-sourced library available on R or Python which helps users analyze and forecast time-series values released in 2017. With developers’ great efforts to make the time-series …

搭建Maven私服那点事

摘要&#xff1a;本文主要介绍在CentOS7.1下使用nexus3.6.0搭建maven私服&#xff0c;以及maven私服的使用&#xff08;将自己的Maven项目指定到私服地址、将第三方项目jar上传到私服供其他项目组使用&#xff09; 一、简介 Maven是一个采用纯Java编写的开源项目管理工具, Mave…

gan训练失败_我尝试过(但失败了)使用GAN来创作艺术品,但这仍然值得。

gan训练失败This work borrows heavily from the Pytorch DCGAN Tutorial and the NVIDA paper on progressive GANs.这项工作大量借鉴了Pytorch DCGAN教程 和 有关渐进式GAN 的 NVIDA论文 。 One area of computer vision I’ve been wanting to explore are GANs. So when m…

19.7 主动模式和被动模式 19.8 添加监控主机 19.9 添加自定义模板 19.10 处理图形中的乱码 19.11 自动发现...

2019独角兽企业重金招聘Python工程师标准>>> 19.7 主动模式和被动模式 • 主动或者被动是相对客户端来讲的 • 被动模式&#xff0c;服务端会主动连接客户端获取监控项目数据&#xff0c;客户端被动地接受连接&#xff0c;并把监控信息传递给服务端 服务端请求以后&…

华盛顿特区与其他地区的差别_使用华盛顿特区地铁数据确定可获利的广告位置...

华盛顿特区与其他地区的差别深度分析 (In-Depth Analysis) Living in Washington DC for the past 1 year, I have come to realize how WMATA metro is the lifeline of this vibrant city. The metro network is enormous and well-connected throughout the DMV area. When …

Windows平台下kafka环境的搭建

近期在搞kafka&#xff0c;在Windows环境搭建的过程中遇到一些问题&#xff0c;把具体的流程几下来防止后面忘了。 准备工作&#xff1a; 1.安装jdk环境 http://www.oracle.com/technetwork/java/javase/downloads/index.html 2.下载kafka的程序安装包&#xff1a; http://kafk…

铺装s路画法_数据管道的铺装之路

铺装s路画法Data is a key bet for Intuit as we invest heavily in new customer experiences: a platform to connect experts anywhere in the world with customers and small business owners, a platform that connects to thousands of institutions and aggregates fin…

IBM推全球首个5纳米芯片:计划2020年量产

IBM日前宣布&#xff0c;该公司已取得技术突破&#xff0c;利用5纳米技术制造出密度更大的芯片。这种芯片可以将300亿个5纳米开关电路集成在指甲盖大小的芯片上。 IBM推全球首个5纳米芯片 IBM表示&#xff0c;此次使用了一种新型晶体管&#xff0c;即堆叠硅纳米板&#xff0c;将…

async 和 await的前世今生 (转载)

async 和 await 出现在C# 5.0之后&#xff0c;给并行编程带来了不少的方便&#xff0c;特别是当在MVC中的Action也变成async之后&#xff0c;有点开始什么都是async的味道了。但是这也给我们编程埋下了一些隐患&#xff0c;有时候可能会产生一些我们自己都不知道怎么产生的Bug&…

项目案例:qq数据库管理_2小时元项目:项目管理您的数据科学学习

项目案例:qq数据库管理Many of us are struggling to prioritize our learning as a working professional or aspiring data scientist. We’re told that we need to learn so many things that at times it can be overwhelming. Recently, I’ve felt like there could be …