python 数据框缺失值_Python:处理数据框中的缺失值

python 数据框缺失值

介绍 (Introduction)

In the last article we went through on how to find the missing values. This link has the details on the how to find missing values in the data frame. https://medium.com/@kallepalliravi/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

在上一篇文章中,我们探讨了如何找到缺失的值。 该链接包含有关如何在数据框中查找缺失值的详细信息。 https://medium.com/@kallepalliravi/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

Now that you have identified all the missing values, what to do with these missing values? In this article we will go over on how to handle missing data in a data frame.

现在,您已经确定了所有缺失值,如何处理这些缺失值? 在本文中,我们将探讨如何处理数据帧中的丢失数据。

There are multiple ways of handling missing data and this varies case by case. There is no universal best way in dealing with the missing data. Use your best judgement and explore different options to determine which method is best for your data set.

有多种处理丢失数据的方法,具体情况视情况而定。 没有通用的最佳方法来处理丢失的数据。 根据您的最佳判断,探索不同的选项,以确定哪种方法最适合您的数据集。

  1. Deleting all rows/columns with missing data: This can be used when you have rows/columns where majority of the data is missing. When you are deleting rows/columns you might be losing some valuable information and lead to biased models. So analyze your data before deleting and check if there is any particular reason for missing data.

    删除所有缺少数据的行/列 :当您缺少大部分数据的行/列时,可以使用此方法。 当您删除行/列时,您可能会丢失一些有价值的信息,并导致模型有偏差。 因此,请在删除数据之前分析您的数据,并检查是否有任何特殊原因导致数据丢失。

  2. Imputing data: This is by far the most common way used to handle missing data. In this method you impute a value where data is missing. Imputing data can introduce bias into the datasets. Imputation can be done multiple ways.

    估算数据 :这是迄今为止处理缺失数据的最常用方法。 在此方法中,您将在缺少数据的地方估算一个值。 估算数据可能会使数据集产生偏差。 插补可以通过多种方式完成。

a. You can impute mean, median or mode values of a column into the missing values in a column.

一个。 您可以将一列的均值,中位数或众数值插入一列的缺失值中。

b. You use predictive algorithms to impute missing values.

b。 您可以使用预测算法来估算缺失值。

c. For categorical variables you can label missing data as a category.

C。 对于分类变量,可以将缺少的数据标记为类别。

For this exercise we will use the Seattle Airbnb data set which can be found in the below link. https://www.kaggle.com/airbnb/seattle?select=listings.csv

在本练习中,我们将使用Seattle Airbnb数据集,该数据集可在下面的链接中找到。 https://www.kaggle.com/airbnb/seattle?select=listings.csv

Load the data and find the missing values.

加载数据并找到缺少的值。

The details of this steps can be found in the previous post under the below link. https://medium.com/@kallepalliravi/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

有关此步骤的详细信息,请参见上一篇文章的以下链接。 https://medium.com/@kallepalliravi/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

Image for post
Load the data file and check the structure of data
加载数据文件并检查数据结构
Image for post
% of missing data on each numerical column
每个数字列上丢失数据的百分比
Image for post
% of missing data in categorical columns
分类列中丢失数据的百分比

1.删​​除缺少数据的行/列: (1. Deleting rows/columns with missing data:)

Deleting Specific rows/columns

删除特定的行/列

From the above you can see that 100% of the values in license column and 97% of the square_feet column are missing data in numerical columns.

从上面可以看到,许可证列中的100%的值和square_feet列中的97%的值在数字列中丢失。

60% of the values in monthly_price, 51% of values in security_deposit and 47% of values in weekly_price are missing data

缺少数据的month_price中的值的60%,security_deposit中的51%的值和weekly_price中的47%的值

Lets try deleting these 5 columns.

让我们尝试删除这5列。

Pandas drop function can be used to delete rows and columns. Full details of this function can be found in the below https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.drop.html

熊猫拖放功能可用于删除行和列。 可以在下面的https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.drop.html中找到此功能的完整详细信息。

All columns which should be deleted should be included in columns parameter. axis =1 represents column, axis=0 represent rows. In the case we are telling to delete all columns specified in the columns parameter.

应该删除的所有列都应包含在columns参数中。 轴= 1代表列,轴= 0代表行。 在这种情况下,我们告诉您删除columns参数中指定的所有列。

Image for post

As you can see below now you do not have columns which have been deleted.

如下所示,您现在没有已删除的列。

Image for post
Image for post

Deleting rows/columns with NA

用NA删除行/列

If you want to delete rows/columns with NA we can use dropna function in pandas. Details of this function can be found in the below link. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

如果您想使用NA删除行/列,我们可以在熊猫中使用dropna函数。 可以在下面的链接中找到此功能的详细信息。 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

dropna function has multiple parameters, the 3 main ones are

dropna函数有多个参数,其中三个主要参数是

  1. how : this has 2 options “any” or “all”. If you set to “any” even if one value has NA in row or column it will delete those columns. If you set to “all” only if all the values in rows/columns have NA deletion will happen.

    方式:这有2个选项“任何”或“全部”。 如果您设置为“ any”,即使一个值在行或列中具有NA,它将删除这些列。 如果仅将行/列中的所有值都具有NA删除,则设置为“所有”。
  2. axis : this can be set to 0 or 1. If 0 then drops rows with NA values, if 1 then drops columns with NA values.

    axis:可以将其设置为0或1。如果为0,则删除具有NA值的行,如果为1,则删除具有NA值的列。
  3. subset: if you want the operation to be performed only on certain columns then mention the column name int he subset. If subset is not define then the operation is performed on all the columns.

    子集:如果您希望仅对某些列执行操作,请在子集中提及列名。 如果未定义子集,则对所有列执行该操作。
Image for post

2.估算数据 (2. Imputing Data)

With imputing you are trying to assign a value through inference from the values to which it contributes. In this case you are assigning a value in the place of a missing value by using different methods on the feature which has missing value. Methods can as simple as assigning mean, median, mode of the column to the missing values or you can use machine learning techniques to predict the missing values. Imputation methods can be different for numerical and categorical variables.

使用插补时,您试图通过推断贡献值来分配一个值。 在这种情况下,您可以通过对具有缺失值的要素使用不同的方法来为缺失值分配一个值。 方法可以简单到为缺失值分配列的均值,中位数,众数模式,也可以使用机器学习技术来预测缺失值。 数值和分类变量的插补方法可能不同。

Imputation for Numerical values:

数值的估算:

With numerical columns the most common approach to impute data is by imputing mean, median or mode of the column in place of the missing values.

对于数字列,最常用的估算数据方法是通过估算列的均值,中位数或众数来代替缺失值。

To do that we will write a function to fill na with mean/median/mode and then apply that function to all the columns.

为此,我们将编写一个用均值/中位数/众数填充na的函数,然后将该函数应用于所有列。

In the below i am showing a example to fill the missing data with the mean of the column.

在下面的示例中,我展示了使用列的平均值填充缺失数据的示例。

fill_mean function iterates through each column in the data frame and fill’s na with the column mean.

fill_mean函数遍历数据帧中的每一列,并用列均值填充na。

You can then use apply() function to apply fill_mean function on one column or multiple columns in a data frame.

然后,您可以使用apply()函数将fill_mean函数应用于数据框中的一列或多列。

This example shows using mean, you can use median() and mode() function in place of mean() if you want to impute median or mode of the column .

此示例显示了使用均值,如果要对列的中值或众数进行插值,则可以使用mean()和mode()函数代替mean()。

Image for post

Imputation for Categorical values:

分类值的插补:

For categorical variables clearly you cannot use mean or median for imputation. But we can use mode which is use the most frequently used value or the one other way is to missing data as category by itself.

显然,对于分类变量,您不能使用均值或中位数进行插补。 但是我们可以使用使用最常用值的模式,或者另一种方法是单独丢失数据作为类别。

Since i have already went through on how to impute most frequently value, in this step i will show how make a missing data as a category. This is very straight forward, you just replace NA with “missing data” category. Missing data will be one of the levels in each categorical variable.

由于我已经介绍了如何估算最频繁的值,因此在这一步中,我将说明如何将缺失的数据作为类别。 这很简单,您只需将NA替换为“缺少数据”类别。 丢失的数据将是每个分类变量中的级别之一。

Image for post

Imputation using a model to predict missing values:

使用模型进行插补以预测缺失值:

One more option is to use model to predict missing values. To perform this task you can IterativeImputer from sklearn library. You can find details on this in the below link

另一种选择是使用模型来预测缺失值。 要执行此任务,您可以从sklearn库中获取IterativeImputer。 您可以在以下链接中找到详细信息

https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

https://scikit-learn.org/stable/modules/generation/sklearn.impute.IterativeImputer.html

Iterative imputer considers features with missing values and develops a model as function of other features. It then estimates the missing value and imputes those values.

迭代冲刺者会考虑具有缺失值的要素,并根据其他要素开发模型。 然后,它估计缺失值并估算这些值。

It does it in a iterative manner, meaning it will take a 1st feature with missing values which it considers as response variable and considers all the other features as input variables. Using these input variables it will estimate the values for the missing values in the response variable. In the next step it will consider the 2nd feature with missing values as response variable and use all the other features as input variables and estimate missing values. This process will continue until all the features with missing values are addressed.

它以迭代方式进行,这意味着它将采用第一个具有缺失值的特征,将其视为响应变量,并将所有其他特征视为输入变量。 使用这些输入变量,它将估计响应变量中缺少的值的值。 在下一步中,它将把具有缺失值的第二个特征视为响应变量,并将所有其他特征用作输入变量并估计缺失值。 此过程将继续进行,直到解决所有缺少值的功能。

In the below example i am using Random forest in the imputer to estimate the missing values and fitting the imputer to a data frame.

在下面的示例中,我将在嵌入式计算机中使用随机森林来估计缺失值,并将嵌入式计算机拟合到数据帧。

Image for post

结论: (Conclusion:)

In this article we went through on how to handle the missing values in a data frame.

在本文中,我们探讨了如何处理数据框中的缺失值。

  1. Delete the rows/columns with missing values

    删除缺少值的行/列
  2. Imputing the missing values with statistic like mean, mean or mode.

    用均值,均值或众数等统计数据来估算缺失值。
  3. For categorical variables making missing data as a category.

    对于类别变量,将缺少的数据作为类别。
  4. Using Iterative Imputer develop a model to predict missing values in each of the features.

    使用Iterative Imputer开发一个模型来预测每个功能部件中的缺失值。

翻译自: https://medium.com/analytics-vidhya/python-handling-missing-values-in-a-data-frame-4156dac4399

python 数据框缺失值

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391078.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Spring Cloud 5分钟搭建教程(附上一个分布式日志系统项目作为参考) - 推荐

http://blog.csdn.net/lc0817/article/details/53266212/ https://github.com/leoChaoGlut/log-sys 上面是我基于Spring Cloud ,Spring Boot 和 Docker 搭建的一个分布式日志系统. 目前已在我司使用. 想要学习Spring Cloud, Spring Boot以及Spring 全家桶的童鞋,可以参考学习,如…

51nod1832(二叉树/高精度模板+dfs)

题目链接: http://www.51nod.com/onlineJudge/questionCode.html#!problemId1832 题意: 中文题诶~ 思路: 若二叉树中有 k 个节点只有一个子树, 则答案为 1 << k. 详情参见:http://blog.csdn.net/gyhguoge01234/article/details/77836484 代码: 1 #include <iostream&g…

重学TCP协议(11)TFO(Tcp Fast Open)

1. TFO 为了改善web应用相应时延&#xff0c;google发布了通过修改TCP协议利用三次握手时进行数据交换的TFO(TCP fast open&#xff0c;RFC 7413)。 TFO允许在TCP握手期间发送和接收初始SYN分组中的数据。如果客户端和服务器都支持TFO功能&#xff0c;则可以减少建立到同一服…

外星人图像和外星人太空船_卫星图像:来自太空的见解

外星人图像和外星人太空船By Christophe Restif & Avi Hoffman, Senior Software Engineers, Crisis Response危机应对高级软件工程师Christophe Restif和Avi Hoffman Editor’s note: In 2019, we piloted a new feature in Search SOS Alerts for major California wild…

棒棒糖 宏_棒棒糖图表

棒棒糖 宏AKA: lollipop plot又名&#xff1a;棒棒糖情节 WHY: a lollipop chart (LC) is a handy variation of a bar chart where the bar is replaced with a line and a dot at the end. Just like bar graphs, lollipop plots are used to make comparisons between diff…

ubuntu上如何安装tomcat

1. 在官网下载linux里面的tomcat 2. 放到DownLoads下面--把tomcat的压缩包放到DownLoads3. sudo mkdir /usr/local/tomcat/ -在usr/local/路径下新建一个tomcat的文件夹4 sudo tar zxvf tomcat。。。。tar.gz -C /usr/local/tomcat/---把解压后的tomcat放到usr/local/下的tomca…

ZooKeeper3.4.5-最基本API开发

2019独角兽企业重金招聘Python工程师标准>>> package cn.itcast.bigdata.zk;import java.io.IOException; import java.util.List;import org.apache.zookeeper.CreateMode; import org.apache.zookeeper.KeeperException; import org.apache.zookeeper.WatchedEven…

nlp自然语言处理_不要被NLP Research淹没

nlp自然语言处理自然语言处理 (Natural Language Processing) 到底是怎么回事&#xff1f; (What is going on?) NLP is the new Computer VisionNLP是新的计算机视觉 With enormous amount go textual datasets available; giants like Google, Microsoft, Facebook etc have…

opencv 随笔

装环境好累&#xff0c;python3.6&#xff0c;opencv3.4 好不容易装好了&#xff0c;结果 addweight的时候总是报错 The operation is neither array op array (where arrays have the same size and the same number of channels), nor array op scalar, nor scalar op array …

中小型研发团队架构实践三要点(转自原携程架构师张辉清)

如果你正好处在中小型研发团队…… 中小型研发团队很多&#xff0c;而社区在中小型研发团队架构实践方面的探讨却很少。中小型研发团队特别是 50 至 200 人的研发团队&#xff0c;在早期的业务探索阶段&#xff0c;更多关注业务逻辑&#xff0c;快速迭代以验证商业模式&#xf…

时间序列预测 预测时间段_应用时间序列预测:美国住宅

时间序列预测 预测时间段1.简介 (1. Introduction) During these COVID19 months housing sector is rebounding rapidly after a downtime since the early months of the year. New residential house construction was down to about 1 million in April. As of July 1.5 mi…

zabbix之web监控

Web monitoring(web监控)是用来监控Web程序的&#xff0c;可以监控到Web程序的下载速度&#xff0c;返回码以及响应时间&#xff0c;还支持把一组连续的Web动作作为一个整体进行监控。 1.Web监控的原理 Web监控即对HTTP服务的监控&#xff0c;模拟用户去访问网站&#xff0c;对…

经验主义 保守主义_为什么我们需要行动主义-始终如此。

经验主义 保守主义It’s been almost three months since George Floyd was murdered and the mass protests. Three months since the nationwide protests, looting and riots across America.距离乔治弗洛伊德(George Floyd)被谋杀和大规模抗议活动已经快三个月了。 全国抗议…

redis介绍以及安装

一、redis介绍 redis是一个key-value存储系统。和Memcached类似&#xff0c;它支持存储的values类型相对更多&#xff0c;包括字符串、列表、哈希散列表、集合&#xff0c;有序集合。 这些数据类型都支持push/pop、add/remove及取交集并集和差集及更丰富的操作&#xff0c;而且…

python机器学习预测_使用Python和机器学习预测未来的股市趋势

python机器学习预测Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works withou…

Python数据结构之四——set(集合)

Python版本&#xff1a;3.6.2 操作系统&#xff1a;Windows 作者&#xff1a;SmallWZQ 经过几天的回顾和学习&#xff0c;我终于把Python 3.x中的基础知识介绍好啦。下面将要继续什么呢&#xff1f;让我想想先~~~嗯&#xff0c;还是先整理一下近期有关Python基础知识的随笔吧…

knn 机器学习_机器学习:通过预测意大利葡萄酒的品种来观察KNN的工作方式

knn 机器学习Introduction介绍 For this article, I’d like to introduce you to KNN with a practical example.对于本文&#xff0c;我想通过一个实际的例子向您介绍KNN。 I will consider one of my project that you can find in my GitHub profile. For this project, …

python 实现分步累加_Python网页爬取分步指南

python 实现分步累加As data scientists, we are always on the look for new data and information to analyze and manipulate. One of the main approaches to find data right now is scraping the web for a particular inquiry.作为数据科学家&#xff0c;我们一直在寻找…

关于双黑洞和引力波,LIGO科学家回答了这7个你可能会关心的问题

引力波的成功探测&#xff0c;就像双黑洞的碰撞一样&#xff0c;一石激起千层浪。 关于双黑洞和引力波&#xff0c;LIGO科学家回答了这7个你可能会关心的问题 最近&#xff0c;引力波的成功探测&#xff0c;就像双黑洞的碰撞一样&#xff0c;一石激起千层浪。 大家兴奋之余&am…

用于MLOps的MLflow简介第1部分:Anaconda环境

在这三部分的博客中跟随了演示之后&#xff0c;您将能够&#xff1a; (After following along with the demos in this three part blog you will be able to:) Understand how you and your Data Science teams can improve your MLOps practices using MLflow 了解您和您的数…