17-Pandas缺失值处理

Python Pandas缺失值处理

在一些数据分析业务中，数据缺失是我们经常遇见的问题，缺失值会导致数据质量的下降，从而影响模型预测的准确性，这对于机器学习和数据挖掘影响尤为严重。因此妥善的处理缺失值能够使模型预测更为准确和有效。

为什么会存在缺失值？

前面章节的示例中，我们遇到过很多 NaN 值，关于缺失值您可能会有很多疑问，数据为什么会丢失数据呢，又是从什么时候丢失的呢？通过下面场景，您会得到答案。

其实在很多时候，人们往往不愿意过多透露自己的信息。假如您正在对用户的产品体验做调查，在这个过程中您会发现，一些用户很乐意分享自己使用产品的体验，但他是不愿意透露自己的姓名和联系方式；还有一些用户愿意分享他们使用产品的全部经过，包括自己的姓名和联系方式。因此，总有一些数据会因为某些不可抗力的因素丢失，这种情况在现实生活中会经常遇到。

什么是稀疏数据？

稀疏数据，指的是在数据库或者数据集中存在大量缺失数据或者空值，我们把这样的数据集称为稀疏数据集。稀疏数据不是无效数据，只不过是信息不全而已，只要通过适当的方法就可以“变废为宝”。

稀疏数据的来源与产生原因有很多种，大致归为以下几种：

由于调查不当产生的稀疏数据；
由于天然限制产生的稀疏数据；
文本挖掘中产生的稀疏数据。

缺失值处理

那么 Pandas 是如何处理缺失值的呢，下面让我们一起看一下。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)

输出结果：

          0         1         2
a  1.659217 -0.522622  0.004241
b       NaN       NaN       NaN
c  0.788436 -1.135235 -1.753622
d       NaN       NaN       NaN
e  0.144724 -0.307758 -0.435239
f -0.807119  1.932682 -0.684306
g       NaN       NaN       NaN
h -0.026587 -0.732601  0.204647

上述示例，通过使用 reindex（重构索引），我们创建了一个存在缺少值的 DataFrame 对象。

检查缺失值

为了使检测缺失值变得更容易，Pandas 提供了 isnull() 和 notnull() 两个函数，它们同时适用于 Series 和 DataFrame 对象。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print("原始数据：\n",df)
print("判断数据是否为空：\n",df[[1,2]].isnull())

输出结果：

原始数据：0         1         2
a  0.232274  0.049742 -1.196369
b       NaN       NaN       NaN
c -0.598790  0.731822  0.898254
d       NaN       NaN       NaN
e -1.728575 -0.669052 -0.454844
f  0.739184  1.352023  0.061043
g       NaN       NaN       NaN
h  1.305116  0.578419  0.021779
判断数据是否为空：1      2
a  False  False
b   True   True
c  False  False
d   True   True
e  False  False
f  False  False
g   True   True
h  False  False

notnull() 函数，使用示例：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print("原始数据：\n",df)
print("判断数据是否不为空：\n",df[[1,2]].notnull())

输出结果：

原始数据：0         1         2
a -0.213029  0.466790  0.632789
b       NaN       NaN       NaN
c  0.091104  0.674106  0.396130
d       NaN       NaN       NaN
e -0.304249 -0.098976 -0.627985
f -1.390221 -0.592548 -0.539891
g       NaN       NaN       NaN
h  0.624527  0.278154  1.006981
判断数据是否不为空：1      2
a   True   True
b  False  False
c   True   True
d  False  False
e   True   True
f   True   True
g  False  False
h   True   True

缺失数据计算

计算缺失数据时，需要注意两点：首先数据求和时，将 NA 值视为 0 ，其次，如果要计算的数据为 NA，那么结果就是 NA。示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(15).reshape(5,3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
print ("列one的和为：",df['one'].sum())

输出结果：

    one   two  three
a   0.0   1.0    2.0
b   NaN   NaN    NaN
c   3.0   4.0    5.0
d   NaN   NaN    NaN
e   6.0   7.0    8.0
f   9.0  10.0   11.0
g   NaN   NaN    NaN
h  12.0  13.0   14.0
列one的和为： 30.0

清理并填充缺失值

Pandas 提供了多种方法来清除缺失值。fillna() 函数可以实现用非空数据“填充”NaN 值。

1) 用标量值替换NaN值

下列程序将 NaN 值替换为了 0，如下所示：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(15).reshape(5,3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print("原始数据：\n",df)
print("使用0填充NaN:\n",df.fillna(0))

输出结果：

原始数据：one   two  three
a   0.0   1.0    2.0
b   NaN   NaN    NaN
c   3.0   4.0    5.0
d   NaN   NaN    NaN
e   6.0   7.0    8.0
f   9.0  10.0   11.0
g   NaN   NaN    NaN
h  12.0  13.0   14.0
使用0填充NaN:one   two  three
a   0.0   1.0    2.0
b   0.0   0.0    0.0
c   3.0   4.0    5.0
d   0.0   0.0    0.0
e   6.0   7.0    8.0
f   9.0  10.0   11.0
g   0.0   0.0    0.0
h  12.0  13.0   14.

当然根据您自己的需求，您也可以用其他值进行填充。

2) 向前和向后填充NA

示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(15).reshape(5,3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print("原始数据：\n",df)
print("向后填充:\n",df.fillna(method='ffill'))
print("向前填充:\n",df.fillna(method='bfill'))

输出结果：

原始数据：one   two  three
a   0.0   1.0    2.0
b   NaN   NaN    NaN
c   3.0   4.0    5.0
d   NaN   NaN    NaN
e   6.0   7.0    8.0
f   9.0  10.0   11.0
g   NaN   NaN    NaN
h  12.0  13.0   14.0
向后填充:one   two  three
a   0.0   1.0    2.0
b   0.0   1.0    2.0
c   3.0   4.0    5.0
d   3.0   4.0    5.0
e   6.0   7.0    8.0
f   9.0  10.0   11.0
g   9.0  10.0   11.0
h  12.0  13.0   14.0
向前填充:one   two  three
a   0.0   1.0    2.0
b   3.0   4.0    5.0
c   3.0   4.0    5.0
d   6.0   7.0    8.0
e   6.0   7.0    8.0
f   9.0  10.0   11.0
g  12.0  13.0   14.0
h  12.0  13.0   14.0

3) 使用replace替换通用值

在某些情况下，您需要使用 replace() 将 DataFrame 中的通用值替换成特定值，这和使用 fillna() 函数替换 NaN 值是类似的。示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(15).reshape(5,3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print("原始数据：\n",df)
print("替换后的数据：\n",df.replace({np.nan:0}))

输出结果：

原始数据：one   two  three
a   0.0   1.0    2.0
b   NaN   NaN    NaN
c   3.0   4.0    5.0
d   NaN   NaN    NaN
e   6.0   7.0    8.0
f   9.0  10.0   11.0
g   NaN   NaN    NaN
h  12.0  13.0   14.0
替换后的数据：one   two  three
a   0.0   1.0    2.0
b   0.0   0.0    0.0
c   3.0   4.0    5.0
d   0.0   0.0    0.0
e   6.0   7.0    8.0
f   9.0  10.0   11.0
g   0.0   0.0    0.0
h  12.0  13.0   14.0

删除缺失值

如果想删除缺失值，那么使用 dropna() 函数与参数 axis 可以实现。在默认情况下，按照 axis=0 来按行处理，这意味着如果某一行中存在 NaN 值将会删除整行数据。示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(15).reshape(5,3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print("原始数据：\n",df)
print("删除后的数据：\n",df.dropna())

输出结果：

原始数据：one   two  three
a   0.0   1.0    2.0
b   NaN   NaN    NaN
c   3.0   4.0    5.0
d   NaN   NaN    NaN
e   6.0   7.0    8.0
f   9.0  10.0   11.0
g   NaN   NaN    NaN
h  12.0  13.0   14.0
删除后的数据：one   two  three
a   0.0   1.0    2.0
c   3.0   4.0    5.0
e   6.0   7.0    8.0
f   9.0  10.0   11.0
h  12.0  13.0   14.0

axis = 1 表示按列处理，处理结果是一个空的 DataFrame 对象。