python 插补数据_python 2020中缺少数据插补技术的快速指南

python 插补数据

Most machine learning algorithms expect complete and clean noise-free datasets, unfortunately, real-world datasets are messy and have multiples missing cells, in such cases handling missing data becomes quite complex.

大多数机器学习算法期望完整且干净的无噪声数据集,但不幸的是,现实世界的数据集比较杂乱,缺少多个单元格,在这种情况下,处理丢失的数据变得相当复杂。

Therefore in today’s article, we are going to discuss some of the most effective and indeed easy-to-use data imputation techniques which can be used to deal with missing data.

因此,在今天的文章中,我们将讨论一些最有效且确实易于使用的数据插补技术,这些技术可用于处理丢失的数据。

So without any further delay, let’s get started.

因此,没有任何进一步的延迟,让我们开始吧。

Image for post
Copyright: © Ampcool22 | Dreamstime.com
版权:©Ampcool22 | Dreamstime.com

什么是数据归因? (What is Data Imputation?)

Data Imputation is a method in which the missing values in any variable or data frame(in Machine learning) is filled with some numeric values for performing the task. Using this method the sample size remains the same, only the blanks which were missing are now filled with some values. This method is easy to use but the variance of the dataset is reduced.

数据插补是一种方法,其中(在机器学习中)任何变量或数据框中的缺失值都填充有一些数字值,以执行任务。 使用此方法,样本大小保持不变,现在仅将缺少空白 填充一些值这种方法易于使用,但数据集的方差减小了。

为什么要进行数据插补? (Why Data Imputation?)

There can be various reasons for imputing data, many real-world datasets(not talking about CIFAR or MNIST) containing missing values which can be in any form such as blanks, NaN, 0s, any integers or any categorical symbol. Instead of just dropping the Rows or Columns containing the missing values which come at the price of losing data which may be valuable, a better strategy is to impute the missing values.

插补数据可能有多种原因,许多现实世界的数据集(不涉及CIFAR或MNIST)包含缺失值,这些缺失值可以采用任何形式,例如空格,NaN,0,任何整数或任何分类符号更好的策略是估算缺失值 ,而不是仅仅删除包含缺失值的行或列,而这些缺失值会以丢失可能有价值的数据代价

Having a good theoretical knowledge is amazing but implementing them in code in a real-time machine learning project is a completely different thing. You might get different and unexpected results based on different problems and datasets. So as a Bonus,I am also adding the links to the various courses which has helped me a lot in my journey to learn Data science and ML, experiment and compare different data imputations strategies which led me to write this article on comparisons between different data imputations methods.

拥有良好的理论知识是惊人的,但是在实时机器学习项目中以代码实现它们是完全不同的。 根据不同的问题和数据集,您可能会得到不同且出乎意料的结果。 因此,作为奖励,我还添加了到各种课程的链接,这些链接对我学习数据科学和ML,实验和比较不同的数据归因策略有很大帮助,这使我撰写了有关不同数据之间的比较的本文。归因方法。

I am personally a fan of DataCamp, I started from it and I am still learning through DataCamp and keep doing new courses. They seriously have some exciting courses. Do check them out.

我个人是 DataCamp 的粉丝 ,我从此开始,但仍在学习 DataCamp 并继续 学习 新课程。 他们认真地开设了一些令人兴奋的课程。 请检查一下。

1.处理缺少输入的数据 (1. handling-missing-data-with-imputations-in-r)

2.在Python中处理丢失的数据 (2. dealing-with-missing-data-in-python)

3.处理R中的缺失数据 (3. dealing-with-missing-data-in-r)

4.在Python中构建数据工程管道 (4. building-data-engineering-pipelines-in-python)

5.数据工程概论 (5. introduction-to-data-engineering)

6.用Python进行数据处理 (6. data-manipulation-with-python)

7.熊猫数据处理 (7. data-manipulation-with-pandas)

8.使用R进行数据处理 (8. data-manipulation-with-r)

P.S: I am still using DataCamp and keep doing courses in my free time. I actually insist the readers to try out any of the above courses as per their interest, to get started and build a good foundation in Machine learning and Data Science. The best thing about these courses by DataCamp is that they explain it in a very elegant and different manner with a balanced focus on practical and well as conceptual knowledge and at the end, there is always a Case study. This is what I love the most about them. These courses are truly worth your time and money. These courses would surely help you also understand and implement Deep learning, machine learning in a better way and also implement it in Python or R. I am damn sure you will love it and I am claiming this from my personal opinion and experience.

PS:我仍在使用 DataCamp, 并在 业余时间 继续 上课 我实际上是坚持要求读者根据自己的兴趣尝试上述任何课程,以开始并为机器学习和数据科学打下良好的基础。 DataCamp 开设的这些课程的最好之 在于,他们以非常优雅且与众不同的方式 对课程进行了 解释,同时重点关注实践和概念知识,最后始终进行案例研究。 这就是我最喜欢他们的地方。 这些课程确实值得您花费时间和金钱。 这些课程肯定会帮助您更好地理解和实施深度学习,机器学习,并且还可以在Python或R中实现它。我该死的,您一定会喜欢它的,我是从我个人的观点和经验中主张这一点的。

Also, I have noticed that DataCamp is giving unlimited access to all the courses for free for one week starting from 1st of September through 8th September 2020, 12 PM EST. So this would literally be the best time to grab some yearly subscriptions(which I have) which basically has unlimited access to all the courses and other things on DataCamp and make fruitful use of your time sitting at home during this Pandemic. So go for it folks and Happy learning

此外,我注意到, DataCamp 自2020年9月1日至美国东部时间12 PM,为期一周免费无限制地访问所有课程。 因此,从字面上看,这将是获取一些年度订阅(我拥有)的最佳时间,该订阅基本上可以无限制地访问 DataCamp 上的所有课程和其他内容, 并可以在这次大流行期间充分利用您在家里的时间。 因此,请亲朋好友学习愉快

Coming back to the topic -

回到主题-

Sklearn.impute package provides 2 types of imputations algorithms to fill in missing values:

Sklearn.impute包提供了两种插补算法来填充缺失值:

1. SimpleImputer (1. SimpleImputer)

SimpleImputer is used for imputations on univariate datasets, univariate datasets are datasets that have only a single variable. SimpleImputer allows us to impute values in any feature column using only missing values in that feature space.

SimpleImputer用于单变量数据集的插补, 变量数据集是仅具有单个变量的数据集。 SimpleImputer允许我们仅使用该要素空间中的缺失值插补任何要素列中的值

There are different strategies provided to impute data such as imputing with a constant value or using the statistical methods such as mean, median or mode to impute data for each column of missing values.

提供了不同的策略来估算数据,例如以恒定值进行估算,或使用统计方法(例如均值,中位数或众数)为缺失值的每一列估算数据。

For categorical data representations, it has support for ‘most-frequent’ strategy which is like the mode of numerical values.

对于分类数据表示,它支持“最频繁”策略,就像数值模式一样。

Image for post
dataframe with 5 columns具有5列的数据框
Image for post
number of missing values in each column每列中缺失值的数量
Image for post
Image for post
Image for post
Imputing Data using Mean, Median and Mode Strategy from SimpleImputer使用SimpleImputer的均值,中位数和众数策略插补数据

2.迭代计算机 (2. IterativeImputer)

IterativeImputer is used for imputations on multivariate datasets, multivariate datasets are datasets that have more than two variables or feature columns per observation. IterativeImputer allows us to make use of the entire dataset of available features columns to impute the missing values.

IterativeImputer用于对多元数据集进行插补, 多元数据集是每个观察值具有两个以上变量或特征列的数据集。 IterativeImputer允许我们利用可用要素列的整个数据集来估算缺失值。

In IterativeImpute each feature with a missing value is used as a function of other features with known output and models the function for imputations. The same process is then iterated in a loop for some iterations and at each step, a feature column is selected as output y and other feature columns are treated as inputs X, then a regressor is fit on (X, y) for known y and is used to predict the missing values of y.

IterativeImpute与缺失值的每个特征被用作与已知的输出和模式插补函数其它特征的函数 。 然后,在循环中重复相同的过程进行一些迭代,并在每个步骤中,选择一个特征列作为 输出y ,将其他特征列视为 输入X ,然后将 回归器拟合到(X,y)上 以获取已知 y 和用于 预测y的缺失值

The same process is repeated for each feature column in a loop and the average of all the multiple regression values are taken to impute the missing values for the data points.

对循环中的每个要素列重复相同的过程,并采用所有多个回归值的平均值来估算数据点的缺失值。

Image for post
Imputing Data using IterativeImputer使用IterativeImputer插补数据

失踪图书馆 (Missingpy library)

Missingpy is a library in python used for imputations of missing values. Currently, it supports K-Nearest Neighbours based imputation technique and MissForest i.e Random Forest-based imputation technique.

Missingpy是python中的一个库,用于估算缺失值。 当前,它支持基于K最近邻的插补技术和MissForest即基于随机森林的插补技术。

To install missingpy library, you can type the following in command line:

要安装missingpy库,可以在命令行中键入以下内容:

pip install missingpy

pip install missingpy

3. KNNImputer (3. KNNImputer)

KNNImputer is a multivariate data imputation technique used for filling in the missing values using the K-Nearest Neighbours approach. Each missing value is filled by the mean value form the n nearest neighbours found in the training set, either weighted or unweighted.

KNNImputer是一种多变量数据插补技术,用于使用K最近邻方法填充缺失值。 每个缺失值都由在训练集中找到的n个最近邻居 (加权或未加权)的平均值填充。

If a sample has more than one feature missing then the neighbour for that sample can be different and if the number of neighbours is lesser than n_neighbour specified then there is no defined distance in the training set, the average of that training set is used during imputation.

如果样本缺少一个以上的特征,则该样本的邻居可能会有所不同 ;如果邻居的数量小于指定的n_neighbour,则训练集中没有定义的距离,则在插补过程中将使用该训练集的平均值。

Nearest neighbours are selected on the basis of distance metrics, by default it is set to euclidean distance and n_neighbour are specified to consider for each step.

根据距离量度 选择最近的邻居 默认情况下将其设置为欧式距离,并为每个步骤指定要考虑的n_neighbour

Image for post
Imputing Data using KNN from missingpy使用失踪的KNN插值数据

4.小姐森林 (4. MissForest)

It is another technique used to fill in the missing values using Random Forest in an iterated fashion. The candidate column is selected from the set of all the columns having the least number of missing values.

这是另一种使用“ 随机森林 ”以迭代方式填充缺失值的技术。 从缺少值最少的所有列的集合中选择候选列

In the first step, all the other columns i.e non-candidate columns having missing values are filled with the mean for the numerical columns and mode for the categorical columns and after that imputer fits a random forest model with the candidate columns as the outcome variable(target variable) and remaining columns as independent variables and then filling the missing values in candidate column using the predictions from the fitted Random Forest model.

第一步,将所有其他列(即缺少值的非候选列)填充为数值列的平均值和分类列的众数,然后,imputer将候选列作为结果变量拟合随机森林模型 (目标变量)和其余列作为自变量 ,然后使用拟合随机森林模型的预测填充候选列中的缺失值。

Then the imputer moves on and the next candidate column is selected with the second least number of missing values and the process repeats itself for each column with the missing values.

然后,推动者继续前进,并选择缺失值次之下一个候选列,并且该过程针对具有缺失值的每一列重复其自身。

Image for post
Imputing data using MissForest from Missingpy使用Missingpy的MissForest插补数据

进一步阅读 (Further Readings)

FancyImpute:

FancyImpute:

IterativeImputer was originally a part of the fancy impute but later on was merged into sklearn. Apart from IterativeImputer, fancy impute has many different algorithms that can be helpful in imputing missing values. Few of them are not much common in the industry but have proved their existence in some particular projects, that is why I am not including them in today's article. You can study them here.

IterativeImputer最初是幻想式估算的一部分,但后来合并为sklearn。 除了IterativeImputer之外,花式插补还具有许多不同的算法,可用于插补缺失值。 它们很少在行业中并不常见,但是已经证明它们在某些特定项目中的存在,这就是为什么我不在今天的文章中包括它们。 你可以在这里学习。

AutoImpute:

自动提示:

It is yet another python package for analysis and imputation of missing values in datasets. It supports various utility functions to examine patterns in missing values and provides some imputations methods for continuous, categorical or time-series data. It also supports multiple and single imputation frameworks for imputations. You can study them here.

它是另一个python软件包,用于分析和估算数据集中的缺失值。 它支持各种实用程序功能以检查缺失值中的模式,并为连续,分类或时间序列数据提供一些插补方法。 它还支持用于插补的多个和单个插补框架。 你可以在这里学习。

If you enjoyed reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via LinkedIn and Github. Please do not hesitate to send a contact request!

如果您喜欢阅读本文,那么我相信我们有相同的兴趣并且会/将会从事相似的行业。 因此,让我们通过LinkedIn和Github进行连接。 请不要犹豫,发送联系请求!

翻译自: https://medium.com/analytics-vidhya/a-quick-guide-on-missing-data-imputation-techniques-in-python-2020-5410f3df1c1e

python 插补数据

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389696.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

NIO 学习笔记

0. 介绍 参考 关于Java IO与NIO知识都在这里 ,在其基础上进行修改与补充。 1. NIO介绍 1.1 NIO 是什么 Java NIO 是 java 1.4, 之后新出的一套IO接口. NIO中的N可以理解为Non-blocking,不单纯是New。 1.2 NIO的特性/NIO与IO区别 IO是面向流的&#x…

[原创]java获取word里面的文本

需求场景 开发的web办公系统如果需要处理大量的Word文档(比如有成千上万个文档),用户一定提出查找包含某些关键字的文档的需求,这就要求能够读取 word 中的文字内容,而忽略其中的文字样式、表格、图片等信息。 方案分析…

ab 模拟_Ab测试第二部分的直观模拟

ab 模拟In this post, I would like to invite you to continue our intuitive exploration of A/B testing, as seen in the previous post:在本文中,我想邀请您继续我们对A / B测试的直观探索,如前一篇文章所示: Resuming what we saw, we…

1886. 判断矩阵经轮转后是否一致

1886. 判断矩阵经轮转后是否一致 给你两个大小为 n x n 的二进制矩阵 mat 和 target 。现 以 90 度顺时针轮转 矩阵 mat 中的元素 若干次 ,如果能够使 mat 与 target 一致,返回 true ;否则,返回 false 。 示例 1: 输…

samba登陆密码不正确

win7访问Linux Samba的共享目录提示“登录失败:用户名或密码错误”解决方法 解决办法:修改本地安全策略 通过Samba服务可以实现UNIX/Linux主机与Windows主机之间的资源互访,由于实验需要,轻车熟路的在linux下配置了samba服务&…

各类软件马斯洛需求层次分析_需求的分析层次

各类软件马斯洛需求层次分析When I joined Square, I was embedded on a product that had been in-market for a year but didn’t have dedicated analytics support.当我加入Square时,我被嵌入了已经上市一年但没有专门的分析支持的产品。 As you might expect,…

MySQL的变量分类总结

在MySQL中,my.cnf是参数文件(Option Files),类似于ORACLE数据库中的spfile、pfile参数文件,照理说,参数文件my.cnf中的都是系统参数(这种称呼比较符合思维习惯),但是官方…

亚洲国家互联网渗透率_发展中亚洲国家如何回应covid 19

亚洲国家互联网渗透率The COVID-19 pandemic has severely hit various economies across the world, with global impact estimated between USD 6.1 trillion and USD 9.1 trillion, equivalent to a loss of 7.1% to 10.5% of global gross domestic product (GDP).[1] More…

snake4444勒索病毒成功处理教程方法工具达康解密金蝶/用友数据库sql后缀snake4444...

*snake4444勒索病毒成功处理教程方法 案例:笔者负责一个政务系统的第三方公司的运维,上班后发现服务器的所有文件都打不开了,而且每个文件后面都有一个snake4444的后缀,通过网络我了解到这是一种勒索病毒。因为各个文件不能正常打…

有史以来最漂亮的游戏机

The recent reveal of the PlayStation 5’s design has divided the gaming world. There are those who appreciate its bold, daring industrial design and those who would have preferred something a little less outlandish; perhaps a little more traditional.吨 他最…

墨刀原型制作 位置选择_原型制作不再是可选的

墨刀原型制作 位置选择The ‘role’ of a designer has been a topic of discussion several many years now. In the past decade, the role of a Designer got split into several different roles like — Graphic Designer, User Experience Designer, Interaction Designe…

eclipse maven 构建简单springmvc项目

环境&#xff1a;eclipse Version: Oxygen.3a Release (4.7.3a) 创建maven Project项目&#xff0c;目录结构 修改工程的相关编译属性 修改pop.xml&#xff0c;引入springmvc相关包 <project xmlns"http://maven.apache.org/POM/4.0.0"xmlns:xsi"http://www.…

使用协同过滤推荐电影

ALSO, ARE RECOMMENDER SYSTEMS INFLUENCING OUR TASTE??此外&#xff0c;推荐系统是否影响我们的口味&#xff1f; An excerpt on creating a movie recommender system similar to the OTT platforms.有关创建类似于OTT平台的电影推荐系统的摘录。 INTRODUCTION介绍 For…

数据暑假实习面试_面试数据科学实习如何准备

数据暑假实习面试Unfortunately, on this occasion, your application was not successful, and we have appointed an applicant who…不幸的是&#xff0c;这一次&#xff0c;您的申请没有成功&#xff0c;我们已经任命了一位符合以下条件的申请人&#xff1a; Sounds famili…

谷歌 colab_如何在Google Colab上使用熊猫分析

谷歌 colabRecently, pandas have come up with an amazing open-source library called pandas-profiling. Generally, EDA starts by df.describe(), df.info() and etc which to be done separately. Pandas_profiling extends the general data frame report using a singl…

Java之生成Pdf并对Pdf内容操作

虽说网上有很多可以在线导出Pdf或者word或者转成png等格式的工具&#xff0c;但是我觉得还是得了解知道是怎么实现的。一来&#xff0c;在线免费转换工具&#xff0c;是有容量限制的&#xff0c;达到一定的容量时&#xff0c;是不能成功导出的;二来&#xff0c;业务需求&#x…

边际概率条件概率_数据科学家解释的边际联合和条件概率

边际概率条件概率Probability plays a very important role in Data Science, as Data Scientist regularly attempt to draw statistical inferences that could be used to predict data or analyse data better.P robability起着数据科学非常重要的作用&#xff0c;为数据科…

袋装决策树_袋装树是每个数据科学家需要的机器学习算法

袋装决策树袋装树木介绍 (Introduction to Bagged Trees) Without diving into the specifics just yet, it’s important that you have some foundation understanding of decision trees.尚未深入研究细节&#xff0c;对决策树有一定基础了解就很重要。 From the evaluatio…

[JS 分析] 天_眼_查 字体文件

0. 参考 js分析 猫_眼_电_影 字体文件 font-face 1. 分析 1.1 定位目标元素 1.2 查看网页源代码 1.3 requests 请求提取得到大量错误信息 对比猫_眼_电_影抓取到unicode编码&#xff0c;天_眼_查混合使用正常字体和自定义字体&#xff0c;难点在于如何从 红 转化为 美。 一开始…

经天测绘测量工具包_公共土地测量系统

经天测绘测量工具包部分-乡镇第一师 (Sections — First Divisions of Townships) The PLSS Townships are typically divided into 36 Sections (nominally one mile on a side), but in the national standard this feature is called the first division because Townships …