How to find Missing values in a data frame using Python/Pandas
如何使用Python / Pandas查找数据框中的缺失值
介绍: (Introduction:)
When you start working on any data science project the data you are provided is never clean. One of the most common issue with any data set are missing values. Most of the machine learning algorithms are not able to handle missing values. The missing values needs to be addressed before proceeding to applying any machine learning algorithm.
当您开始从事任何数据科学项目时,所提供的数据永远不会干净。 任何数据集最常见的问题之一就是缺少值。 大多数机器学习算法无法处理缺失值。 在继续应用任何机器学习算法之前,需要解决缺少的值。
Missing values can be handled in different ways depending on if the missing values are continuous or categorical. In this section I will address how to find missing values. In the next article i will address on how to address the missing values.
根据缺失值是连续的还是分类的,可以用不同的方式来处理缺失值。 在本节中,我将介绍如何查找缺失值。 在下一篇文章中,我将介绍如何解决缺失值。
查找缺失值: (Finding Missing Values:)
For this exercise i will be using “listings.csv” data file from Seattle Airbnb data. The data can be found under this link : https://www.kaggle.com/airbnb/seattle?select=listings.csv
在本练习中,我将使用Seattle Airbnb数据中的“ listings.csv”数据文件。 可以在以下链接下找到数据: https : //www.kaggle.com/airbnb/seattle?select=listings.csv
Step 1: Load the data frame and study the structure of the data frame.
步骤1:加载数据框并研究数据框的结构。
First step is to load the file and look at the structure of the file. When you have a big dateset with high number of columns it is hard to look at each columns and study the types of columns.
第一步是加载文件并查看文件的结构。 如果日期集较大且列数很高,则很难查看每个列并研究列的类型。
To find out how many of the columns are categorical and numerical we can use pandas “dtypes” to get the different data types and you can use pandas “value_counts()” function to get count of each data type. Value_counts groups all the unique instances and gives the count of each of those instances.
要了解有多少列是分类列和数字列,我们可以使用pandas“ dtypes”来获取不同的数据类型,还可以使用pandas“ value_counts()”函数来获取每种数据类型的计数。 Value_counts对所有唯一实例进行分组,并给出每个实例的计数。
As you can see below we have 62 columns which are objects (categorical data), 17 columns which are of float data type and 13 columns which are of int data type.
如下所示,我们有62列是对象(分类数据),有17列是浮点数据类型,有13列是int数据类型。
Step 2: Separate categorical and numerical columns in the data frame
步骤2:将数据框中的类别和数字列分开
The reason to separate the categorical and numerical columns in the data frame is the method of handling missing values are different between these two data type which i will walk through in the next section.
在数据框中分隔类别和数字列的原因是,这两种数据类型之间处理缺失值的方法不同,我将在下一节中介绍这些方法。
The easiest way to achieve this step is through filtering out the columns from the original data frame by data type. By using “dtypes” function and equality operator you can get which columns are objects (categorical variable) and which are not.
实现此步骤的最简单方法是按数据类型从原始数据帧中过滤出列。 通过使用“ dtypes”函数和相等运算符,您可以了解哪些列是对象(分类变量),哪些不是。
To get the column names of the columns which satisfy the above conditions we can use “df.columns”. The below code gives column names which are objects and column names which are not objects.
要获得满足上述条件的列的列名,我们可以使用“ df.columns”。 下面的代码给出了作为对象的列名和不是对象的列名。
As you can see below we separated the original data frame into 2 and assigned them new variables. One for for categorical variables and one for non-categorical variables.
如下所示,我们将原始数据帧分为2个并为其分配了新变量。 一种用于分类变量,另一种用于非分类变量。
Step 3: Find the missing values
步骤3:找出遗漏的值
Finding the missing values is the same for both categorical and continuous variables. We will use “num_vars” which holds all the columns which are not object data type.
对于分类变量和连续变量,找到缺失值都是相同的。 我们将使用“ num_vars”来保存所有非对象数据类型的列。
df[num_vars] will give you all the columns in “num_vars” which consists of all the columns in the data frame which are not object data type.
df [num_vars]将为您提供“ num_vars”中的所有列,该列由数据框中的所有非对象数据类型的列组成。
We can use pandas “isnull()” function to find out all the fields which have missing values. This will return True if a field has missing values and false if the field does not have missing values.
我们可以使用熊猫的“ isnull()”函数来找出所有缺少值的字段。 如果字段缺少值,则返回True,否则返回false。
To get how many missing values are in each column we use sum() along with isnull() which is shown below. This will sum up all the True’s in each column from the step above.
为了获得每列中有多少个缺失值,我们使用sum()以及isull() ,如下所示。 这将汇总上述步骤中每一列中的所有True。
Its always good practice to sort the columns in descending order so you can see what are the columns with highest missing values. To do this we can use sort_values() function. By default this function will sort in ascending order. Since we want the columns with highest missing values first we want to set it to descending. You can do this by passing “ascending=False” paramter in sort_values().
始终最好的做法是按降序对列进行排序,以便您可以看到缺失值最高的列。 为此,我们可以使用sort_values()函数。 默认情况下,此功能将按升序排序。 因为我们首先要使缺失值最高的列,所以我们希望将其设置为降序。 您可以通过在sort_values()中传递“ ascending = False”参数来实现。
The above give you the count of missing values in each column. To get % of missing values in each column you can divide by length of the data frame. You can “len(df)” which gives you the number of rows in the data frame.
上面给出了每一列中缺失值的计数。 要获得每一列中丢失值的百分比,您可以除以数据帧的长度。 您可以“ len(df)”,它为您提供数据框中的行数。
As you can see below license column is missing 100% of the data and square_feet column is missing 97% of data.
如您所见,License列缺少100%的数据,square_feet列缺少97%的数据。
结论 (Conclusion)
The above article goes over on how to find missing values in the data frame using Python pandas library. Below are the steps
上面的文章介绍了如何使用Python pandas库在数据框中查找缺失值。 以下是步骤
Use isnull() function to identify the missing values in the data frame
使用isnull()函数来识别数据框中的缺失值
Use sum() functions to get sum of all missing values per column.
使用sum()函数可获取每列所有缺失值的总和。
use sort_values(ascending=False) function to get columns with the missing values in descending order.
使用sort_values(ascending = False)函数以降序获取缺少值的列。
Divide by len(df) to get % of missing values in each column.
用len(df)除以得到每一列中丢失值的%。
In this section we identified missing values, in the next we go over on how to handle these missing values.
在本节中,我们确定了缺失值,接下来,我们将继续介绍如何处理这些缺失值。
翻译自: https://medium.com/analytics-vidhya/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391612.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!