熊猫数据集_熊猫迈向数据科学的第三部分

熊猫数据集

Data is almost never perfect. Data Scientist spend more time in preprocessing dataset than in creating a model. Often we come across scenario where we find some missing data in data set. Such data points are represented with NaN or Not a Number in Pandas. So it is very important that we discover columns with NaN/null values in early stages while analyzing data.

数据几乎从来都不是完美的。与创建模型相比，数据科学家在预处理数据集上花费的时间更多。通常，我们会遇到在数据集中发现一些缺失数据的情况。此类数据点用NaN表示或用None Not Number表示。因此，在分析数据的早期发现具有NaN / null值的列非常重要。

We have covered many methods in Pandas library and if you haven’t read previous articles, I recommend you to go through those articles to get in a flow. But if you are following from the beginning then lets get started.

我们已经在Pandas库中介绍了许多方法，如果您还没有阅读过以前的文章，我建议您仔细阅读这些文章以进行学习。但是，如果您从头开始关注，那就开始吧。

In this article, we are going to learn

在本文中，我们将学习

What is NaN ?
什么是NaN？
How to find NaN in dataset ?
如何在数据集中找到NaN？
How to deal with NaN as beginner ?
如何应对NaN作为初学者？
Finally, some methods to make dataframe more readable.
最后，一些使数据框更具可读性的方法。

如何在数据集中找到NaN？ (How to find NaN in dataset ?)

To check NaN data in a column or in entire dataframe, we use isnull() or isna(). Both of these works as same , so we will use isnull() in this article. If you want to understand why there are two methods for same task, you can learn it here. Lets begin by checking null values in entire dataset.

要检查列或整个数据框中的NaN数据，我们使用isnull()或isna()。 两者的工作原理相同，因此我们将在本文中使用isnull() 。如果您想了解为什么有两种方法可以完成同一任务，则可以在此处学习。首先检查整个数据集中的空值。

>> print(titanic_data.info())output :RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Here you can see some valuable information about dataset. But information that we are interested is in Non-Null Count column. It shows number of non-null data points in each column. First line of output shows that there are total 891 entries that is 891 data points. We can also directly check number of non-null entries in each column using count() method as well.

在这里，您可以看到有关数据集的一些有价值的信息。但是我们感兴趣的信息在“ 非空计数”列中。它显示每列中非空数据点的数量。输出的第一行显示总共有891个条目，即891个数据点。我们也可以使用count()方法直接检查每列中非空条目的数量。

>> print(titanic_data.count())output :PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

From here we can conclude that Age, Cabin and Embarked are the columns with null values. There another way to get this result using isnull() method as we discussed earlier.

从这里我们可以得出结论，“ 年龄”，“机舱”和“ 登机”是具有空值的列。如前所述，还有另一种方法可以使用isnull()方法获得此结果。

>> print(titanic_data.isnull().any())output :PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool>> print(titanic_data.isnull().sum())output :PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

As we can see this result is much better if we are solely interested in null values.

如我们所见，如果我们只对null值感兴趣，则此结果会更好。

如何应对NaN作为初学者？ (How to deal with NaN as beginner ?)

It is important to know number of null values in a column as it can help us understand how to deal with null values. If there are small numbers of null values like in Embarked, then we can remove those entries from dataset. However if most of the values are null like in Cabin, then it is better to skip that column while creating model.

知道一列中空值的数量很重要，因为它可以帮助我们了解如何处理空值。如果像Embarked中那样有少量的空值，那么我们可以从数据集中删除这些条目。但是，如果像Cabin中的大多数值都为空，那么在创建模型时最好跳过该列。

There is another case where null values are not large enough to skip the column and small enough to remove entries as in the case of Age here. For such cases we have many ways to deal with null values, but as a beginner we will learn just one trick here and that is to fill it with a value. We will use fillna() method to do that.

在另一种情况下，空值的大小不足以跳过该列，而其大小不足以删除条目，如此处的Age一样 。对于这种情况，我们有很多方法可以处理空值，但作为一个初学者，我们将在这里仅学习一个技巧，那就是用值填充它。我们将使用fillna()方法来做到这一点。

>> titanic_data.Age.fillna("Unknown", inplace = True)
>> print(titanic_data.Age.isnull().any())output :false
# It is Age column have no null values

We used inplace argument so that changes are implemented in dataframe which is calling the method. If we do not pass this argument or keep it False then changes will not appear in our dataset. We can also check if a specific column have null values in same manner as we did for whole dataset.

我们使用了inplace参数，以便在调用该方法的数据框中实现更改。如果我们不传递此参数或将其保留为False，则更改将不会出现在我们的数据集中。我们还可以以与整个数据集相同的方式检查特定列是否具有空值。

We can also replace values in a column which are not NaN using replace() method.

我们还可以使用replace()方法替换非NaN列中的值。

>> titanic_data.Sex.replace("male","M",inplace = True)
>> titanic_data.Sex.replace("female","F",inplace = True)
>> print(titanic_data.Sex)output :0      M
1      F
2      F
3      F
4      M
      ..
886    M
887    F
888    F
889    M
890    M
Name: Sex, Length: 891, dtype: object

一些使数据集更具可读性的方法 (Some methods to make Dataset more readable)

rename() : There might be situation, when we realize that column name is not suitable as per our requirement. We can use rename() method to change column name.
named() ：在某些情况下，我们意识到列名不符合我们的要求。我们可以使用rename()方法来更改列名。

>> titanic_data.rename(columns={"Sex":"Gender"},inplace=True)
>> print(titanic_data.Gender)output :0      M
1      F
2      F
3      F
4      M
      ..
886    M
887    F
888    F
889    M
890    M
Name: Gender, Length: 891, dtype: object

2. rename_axis() : It is a simple method and as name suggest is used to provide names for axis.

2. named_axis() ：这是一种简单的方法，顾名思义，该名称用于提供轴的名称。

>> titanic_data.rename_axis("Sr.No",axis='rows',inplace=True)
>> titanic_data.rename_axis("Catergory",axis='columns',inplace=True)
>> print(titanic_data.head(2))output :Catergory  PassengerId  Survived  Pclass  .....
Sr.No                                      
0                    1         0       3   
1                    2         1       1
[2 rows x 12 columns]

With this we come to end of this article and series on Pandas. I believe that methods which we came across in this series are very helpful for analyzing data before we can start training them. However, this is just a small fraction of methods in Pandas library and just a beginning of data exploration and preprocessing. But as a beginner, I think these are enough to get started with Data Science journey. I hope you found this series valuable. Thank you for reading. Keep practicing. Happy Coding ! 😄

这样，我们就结束了本文和有关熊猫的系列文章的结尾。我相信本系列中遇到的方法在开始训练数据之前对分析数据非常有帮助。但是，这只是Pandas库中方法的一小部分，也是数据探索和预处理的开始。但是，作为一个初学者，我认为这些足以开始Data Science之旅。希望您觉得本系列有价值。感谢您的阅读。保持练习。编码愉快！ 😄