Python 数据分析三剑客之 Pandas（八）：数据重塑、重复数据处理与数据替换

CSDN 课程推荐：《迈向数据科学家：带你玩转Python数据分析》，讲师齐伟，苏州研途教育科技有限公司CTO，苏州大学应用统计专业硕士生指导委员会委员；已出版《跟老齐学Python：轻松入门》《跟老齐学Python：Django实战》、《跟老齐学Python：数据分析》和《Python大学实用教程》畅销图书。

Pandas 系列文章：

Python 数据分析三剑客之 Pandas（一）：认识 Pandas 及其 Series、DataFrame 对象
Python 数据分析三剑客之 Pandas（二）：Index 索引对象以及各种索引操作
Python 数据分析三剑客之 Pandas（三）：算术运算与缺失值的处理
Python 数据分析三剑客之 Pandas（四）：函数应用、映射、排序和层级索引
Python 数据分析三剑客之 Pandas（五）：统计计算与统计描述
Python 数据分析三剑客之 Pandas（六）：GroupBy 数据分裂、应用与合并
Python 数据分析三剑客之 Pandas（七）：合并数据集
Python 数据分析三剑客之 Pandas（八）：数据重塑、重复数据处理与数据替换
Python 数据分析三剑客之 Pandas（九）：时间序列
Python 数据分析三剑客之 Pandas（十）：数据读写

另有 NumPy、Matplotlib 系列文章已更新完毕，欢迎关注：

NumPy 系列文章：https://itrhx.blog.csdn.net/category_9780393.html
Matplotlib 系列文章：https://itrhx.blog.csdn.net/category_9780418.html

推荐学习资料与网站（博主参与部分文档翻译）：

NumPy 官方中文网：https://www.numpy.org.cn/
Pandas 官方中文网：https://www.pypandas.cn/
Matplotlib 官方中文网：https://www.matplotlib.org.cn/
NumPy、Matplotlib、Pandas 速查表：https://github.com/TRHX/Python-quick-reference-table

文章目录

- 【01x00】数据重塑
- - 【01x01】stack
  - 【01x02】unstack
- 【02x00】重复数据处理
- - 【02x01】duplicated
  - 【02x02】drop_duplicates
- 【03x00】数据替换
- - 【03x01】replace
  - 【03x02】where
  - 【03x03】mask

这里是一段防爬虫文本，请读者忽略。
本文原创首发于 CSDN，作者 TRHX。
博客首页：https://itrhx.blog.csdn.net/
本文链接：https://itrhx.blog.csdn.net/article/details/106900748
未经授权，禁止转载！恶意转载，后果自负！尊重原创，远离剽窃！

【01x00】数据重塑

有许多用于重新排列表格型数据的基础运算。这些函数也称作重塑（reshape）或轴向旋转（pivot）运算。重塑层次化索引主要有以下两个方法：

stack：将数据的列转换成行；
unstack：将数据的行转换成列。

【01x01】stack

stack 方法用于将数据的列转换成为行；

基本语法：DataFrame.stack(self, level=-1, dropna=True)

官方文档：https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html

参数	描述
level	从列转换到行，指定不同层级的列索引或列标签、由列索引或列标签组成的数组，默认-1
dropna	bool 类型，是否删除重塑后数据中所有值为 NaN 的行，默认 True

单层列（Single level columns）：

>>> import pandas as pd
>>> obj = pd.DataFrame([[0, 1], [2, 3]], index=['cat', 'dog'], columns=['weight', 'height'])
>>> objweight  height
cat       0       1
dog       2       3
>>> 
>>> obj.stack()
cat  weight    0height    1
dog  weight    2height    3
dtype: int64

多层列（Multi level columns）：

>>> import pandas as pd
>>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('weight', 'pounds')])
>>> obj = pd.DataFrame([[1, 2], [2, 4]], index=['cat', 'dog'], columns=multicol)
>>> objweight       kg pounds
cat      1      2
dog      2      4
>>> 
>>> obj.stack()weight
cat kg           1pounds       2
dog kg           2pounds       4

缺失值填充：

>>> import pandas as pd
>>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('height', 'm')])
>>> obj = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]], index=['cat', 'dog'], columns=multicol)
>>> objweight heightkg      m
cat    1.0    2.0
dog    3.0    4.0
>>> 
>>> obj.stack()height  weight
cat kg     NaN     1.0m      2.0     NaN
dog kg     NaN     3.0m      4.0     NaN

通过 level 参数指定不同层级的轴进行重塑：

>>> import pandas as pd
>>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('height', 'm')])
>>> obj = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]], index=['cat', 'dog'], columns=multicol)
>>> objweight heightkg      m
cat    1.0    2.0
dog    3.0    4.0
>>> 
>>> obj.stack(level=0)kg    m
cat height  NaN  2.0weight  1.0  NaN
dog height  NaN  4.0weight  3.0  NaN
>>> 
>>> obj.stack(level=1)height  weight
cat kg     NaN     1.0m      2.0     NaN
dog kg     NaN     3.0m      4.0     NaN
>>>
>>> obj.stack(level=[0, 1])
cat  height  m     2.0weight  kg    1.0
dog  height  m     4.0weight  kg    3.0
dtype: float64

对于重塑后的数据，若有一行的值均为 NaN，则默认会被删除，可以设置 dropna=False 来保留缺失值：

>>> import pandas as pd
>>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('height', 'm')])
>>> obj = pd.DataFrame([[None, 1.0], [2.0, 3.0]], index=['cat', 'dog'], columns=multicol)
>>> objweight heightkg      m
cat    NaN    1.0
dog    2.0    3.0
>>> 
>>> obj.stack(dropna=False)height  weight
cat kg     NaN     NaNm      1.0     NaN
dog kg     NaN     2.0m      3.0     NaN
>>> 
>>> obj.stack(dropna=True)height  weight
cat m      1.0     NaN
dog kg     NaN     2.0m      3.0     NaN

【01x02】unstack

unstack：将数据的行转换成列。

基本语法：

Series.unstack(self, level=-1, fill_value=None)
DataFrame.unstack(self, level=-1, fill_value=None)

官方文档：

https://pandas.pydata.org/docs/reference/api/pandas.Series.unstack.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html

参数	描述
level	从行转换到列，指定不同层级的行索引，默认-1
fill_value	用于替换 NaN 的值

在 Series 对象中的应用：

>>> import pandas as pd
>>> obj = pd.Series([1, 2, 3, 4], index=pd.MultiIndex.from_product([['one', 'two'], ['a', 'b']]))
>>> obj
one  a    1b    2
two  a    3b    4
dtype: int64
>>> 
>>> obj.unstack()a  b
one  1  2
two  3  4
>>> 
>>> obj.unstack(level=0)one  two
a    1    3
b    2    4

和 stack 方法类似，如果值不存在将会引入缺失值（NaN）：

>>> import pandas as pd
>>> obj1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
>>> obj2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
>>> obj3 = pd.concat([obj1, obj2], keys=['one', 'two'])
>>> obj3
one  a    0b    1c    2d    3
two  c    4d    5e    6
dtype: int64
>>> 
>>> obj3.unstack()a    b    c    d    e
one  0.0  1.0  2.0  3.0  NaN
two  NaN  NaN  4.0  5.0  6.0

在 DataFrame 对象中的应用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame(np.arange(6).reshape((2, 3)),index=pd.Index(['Ohio','Colorado'], name='state'),columns=pd.Index(['one', 'two', 'three'],name='number'))
>>> obj
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5
>>> 
>>> obj2 = obj.stack()
>>> obj2
state     number
Ohio      one       0two       1three     2
Colorado  one       3two       4three     5
dtype: int32
>>> 
>>> obj3 = pd.DataFrame({'left': obj2, 'right': obj2 + 5},columns=pd.Index(['left', 'right'], name='side'))
>>> obj3
side             left  right
state    number             
Ohio     one        0      5two        1      6three      2      7
Colorado one        3      8two        4      9three      5     10
>>> 
>>> obj3.unstack('state')
side   left          right         
state  Ohio Colorado  Ohio Colorado
number                             
one       0        3     5        8
two       1        4     6        9
three     2        5     7       10
>>> 
>>> obj3.unstack('state').stack('side')
state         Colorado  Ohio
number side                 
one    left          3     0right         8     5
two    left          4     1right         9     6
three  left          5     2right        10     7

这里是一段防爬虫文本，请读者忽略。
本文原创首发于 CSDN，作者 TRHX。
博客首页：https://itrhx.blog.csdn.net/
本文链接：https://itrhx.blog.csdn.net/article/details/106900748
未经授权，禁止转载！恶意转载，后果自负！尊重原创，远离剽窃！

【02x00】重复数据处理

duplicated：判断是否为重复值；
drop_duplicates：删除重复值。

【02x01】duplicated

duplicated 方法可以判断值是否为重复数据。

基本语法：

Series.duplicated(self, keep='first')
DataFrame.duplicated(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first') → ’Series’

官方文档：

https://pandas.pydata.org/docs/reference/api/pandas.Series.duplicated.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

参数	描述
keep	标记重复项的方法，默认 `'first'` `'first'`：将非重复项和第一个重复项标记为 False，其他重复项标记为 True `'last'`：将非重复项和最后一个重复项标记为 False，其他重复项标记为 True `False`：将所有重复项标记为 True，非重复项标记为 False
subset	列标签或标签序列，在 DataFrame 对象中才有此参数，用于指定某列，仅标记该列的重复项，默认情况下将考虑所有列

默认情况下，对于每组重复的值，第一个出现的重复值标记为 False，其他重复项标记为 True，非重复项标记为 False，相当于 keep='first'：

>>> import pandas as pd
>>> obj = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama'])
>>> obj
0      lama
1       cow
2      lama
3    beetle
4      lama
dtype: object
>>> 
>>> obj.duplicated()
0    False
1    False
2     True
3    False
4     True
dtype: bool
>>>
>>> obj.duplicated(keep='first')
0    False
1    False
2     True
3    False
4     True
dtype: bool

设置 keep='last'，将每组非重复项和最后一次出现的重复项标记为 False，其他重复项标记为 True，设置 keep=False，则所有重复项均为 True，其他值为 False：

>>> import pandas as pd
>>> obj = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama'])
>>> obj
0      lama
1       cow
2      lama
3    beetle
4      lama
dtype: object
>>> 
>>> obj.duplicated(keep='last')
0     True
1    False
2     True
3    False
4    False
dtype: bool
>>> 
>>> obj.duplicated(keep=False)
0     True
1    False
2     True
3    False
4     True
dtype: bool

在 DataFrame 对象中，subset 参数用于指定某列，仅标记该列的重复项，默认情况下将考虑所有列：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,'data2' : np.random.randint(0, 4, 8)})
>>> objdata1  data2
0     a      0
1     a      0
2     a      0
3     a      3
4     b      3
5     b      3
6     b      0
7     b      2
>>> 
>>> obj.duplicated()
0    False
1     True
2     True
3    False
4    False
5     True
6    False
7    False
dtype: bool
>>> 
>>> obj.duplicated(subset='data1')
0    False
1     True
2     True
3     True
4    False
5     True
6     True
7     True
dtype: bool
>>> 
>>> obj.duplicated(subset='data2', keep='last')
0     True
1     True
2     True
3     True
4     True
5    False
6    False
7    False
dtype: bool

【02x02】drop_duplicates

drop_duplicates 方法会返回一个删除了重复值的序列。

基本语法：

Series.drop_duplicates(self, keep='first', inplace=False)

DataFrame.drop_duplicates(self,subset: Union[Hashable, Sequence[Hashable], NoneType] = None,keep: Union[str, bool] = 'first',inplace: bool = False,ignore_index: bool = False) → Union[ForwardRef(‘DataFrame’), NoneType]

官方文档：

https://pandas.pydata.org/docs/reference/api/pandas.Series.drop_duplicates.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

参数	描述
keep	删除重复项的方法，默认 `'first'` `'first'`：保留非重复项和第一个重复项，其他重复项标记均删除 `'last'`：保留非重复项和最后一个重复项，其他重复项删除 `False`：将所有重复项删除，非重复项保留
inplace	是否返回删除重复项后的值，默认 False，若设置为 True，则不返回值，直接改变原数据
subset	列标签或标签序列，在 DataFrame 对象中才有此参数，用于指定某列，仅标记该列的重复项，默认情况下将考虑所有列
ignore_index	bool 类型，在 DataFrame 对象中才有此参数，是否忽略原对象的轴标记，默认 False，如果为 True，则新对象的索引将是 0, 1, 2, …, n-1

keep 参数的使用：

>>> import pandas as pd
>>> obj = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'], name='animal')
>>> obj
0      lama
1       cow
2      lama
3    beetle
4      lama
5     hippo
Name: animal, dtype: object
>>> 
>>> obj.drop_duplicates()
0      lama
1       cow
3    beetle
5     hippo
Name: animal, dtype: object
>>> 
>>> obj.drop_duplicates(keep='last')
1       cow
3    beetle
4      lama
5     hippo
Name: animal, dtype: object
>>> 
>>> obj.drop_duplicates(keep=False)
1       cow
3    beetle
5     hippo
Name: animal, dtype: object

如果设置 inplace=True，则不会返回任何值，但原对象的值已被改变：

>>> import pandas as pd
>>> obj1 = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'], name='animal')
>>> obj1
0      lama
1       cow
2      lama
3    beetle
4      lama
5     hippo
Name: animal, dtype: object
>>> 
>>> obj2 = obj1.drop_duplicates()
>>> obj2          # 有返回值
0      lama
1       cow
3    beetle
5     hippo
Name: animal, dtype: object
>>> 
>>> obj3 = obj1.drop_duplicates(inplace=True)
>>> obj3         # 无返回值
>>>
>>> obj1         # 原对象的值已改变
0      lama
1       cow
3    beetle
5     hippo
Name: animal, dtype: object

在 DataFrame 对象中的使用：

>>> import numpy as np
>>> import pandas as pd
>>> obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,'data2' : np.random.randint(0, 4, 8)})
>>> objdata1  data2
0     a      2
1     a      1
2     a      1
3     a      2
4     b      1
5     b      2
6     b      0
7     b      0
>>> 
>>> obj.drop_duplicates()data1  data2
0     a      2
1     a      1
4     b      1
5     b      2
6     b      0
>>> 
>>> obj.drop_duplicates(subset='data2')data1  data2
0     a      2
1     a      1
6     b      0
>>> 
>>> obj.drop_duplicates(subset='data2', ignore_index=True)data1  data2
0     a      2
1     a      1
2     b      0

【03x00】数据替换

【03x01】replace

replace 方法可以根据值的内容进行替换。

基本语法：

Series.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
DataFrame.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')

官方文档：

https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

常用参数：

参数	描述
to_replace	找到要替换值的方法，可以是：字符串、正则表达式、列表、字典、整数、浮点数、Series 对象或者 None 使用不同参数的区别参见官方文档
value	用于替换匹配项的值，对于 DataFrame，可以使用字典的值来指定每列要使用的值，还允许使用此类对象的正则表达式，字符串和列表或字典
inplace	bool 类型，是否直接改变原数据且不返回值，默认 False
regex	bool 类型或者与 to_replace 相同的类型，当 to_replace 参数为正则表达式时，regex 应为 True，或者直接使用该参数代替 to_replace

to_replace 和 value 参数只传入一个值，单个值替换单个值：

>>> import pandas as pd
>>> obj = pd.Series([0, 1, 2, 3, 4])
>>> obj
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> 
>>> obj.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64

to_replace 传入多个值，value 传入一个值，多个值替换一个值：

>>> import pandas as pd
>>> obj = pd.Series([0, 1, 2, 3, 4])
>>> obj
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> 
>>> obj.replace([0, 1, 2, 3], 4)
0    4
1    4
2    4
3    4
4    4
dtype: int64

to_replace 和 value 参数都传入多个值，多个值替换多个值：

>>> import pandas as pd
>>> obj = pd.Series([0, 1, 2, 3, 4])
>>> obj
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> 
>>> obj.replace([0, 1, 2, 3], [4, 3, 2, 1])
0    4
1    3
2    2
3    1
4    4
dtype: int64

to_replace 传入字典：

>>> import pandas as pd
>>> obj = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
>>> objA  B  C
0  0  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e
>>> 
>>> obj.replace(0, 5)A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e
>>> 
>>> obj.replace({0: 10, 1: 100})A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> 
>>> obj.replace({'A': 0, 'B': 5}, 100)A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> obj.replace({'A': {0: 100, 4: 400}})A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

to_replace 传入正则表达式：

>>> import pandas as pd
>>> obj = pd.DataFrame({'A': ['bat', 'foo', 'bait'],'B': ['abc', 'bar', 'xyz']})
>>> objA    B
0   bat  abc
1   foo  bar
2  bait  xyz
>>> 
>>> obj.replace(to_replace=r'^ba.$', value='new', regex=True)A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> 
>>> obj.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> 
>>> obj.replace(regex=r'^ba.$', value='new')A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> 
>>> obj.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> 
>>> obj.replace(regex=[r'^ba.$', 'foo'], value='new')A    B
0   new  abc
1   new  new
2  bait  xyz

【03x02】where

where 方法用于替换条件为 False 的值。

基本语法：

Series.where(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)
DataFrame.where(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)

官方文档：

https://pandas.pydata.org/docs/reference/api/pandas.Series.where.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html

常用参数：

参数	描述
cond	替换条件，如果 cond 为 True，则保留原始值。如果为 False，则替换为来自 other 的相应值
other	替换值，如果 cond 为 False，则替换为来自该参数的相应值
inplace	bool 类型，是否直接改变原数据且不返回值，默认 False

在 Series 中的应用：

>>> import pandas as pd
>>> obj = pd.Series(range(5))
>>> obj
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> 
>>> obj.where(obj > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> 
>>> obj.where(obj > 1, 10)
0    10
1    10
2     2
3     3
4     4
dtype: int64

在 DataFrame 中的应用：

>>> import pandas as pd
>>> obj = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
>>> objA  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> 
>>> m = obj % 3 == 0
>>> obj.where(m, -obj)A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> 
>>> obj.where(m, -obj) == np.where(m, obj, -obj)A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

【03x03】mask

mask 方法与 where 方法相反，mask 用于替换条件为 False 的值。

基本语法：

Series.mask(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)
DataFrame.mask(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)

官方文档：

https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html

常用参数：

参数	描述
cond	替换条件，如果 cond 为 False，则保留原始值。如果为 True，则替换为来自 other 的相应值
other	替换值，如果 cond 为 False，则替换为来自该参数的相应值
inplace	bool 类型，是否直接改变原数据且不返回值，默认 False

在 Series 中的应用：

>>> import pandas as pd
>>> obj = pd.Series(range(5))
>>> obj
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> 
>>> obj.mask(obj > 0)
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> 
>>> obj.mask(obj > 1, 10)
0     0
1     1
2    10
3    10
4    10
dtype: int64

在 DataFrame 中的应用：

>>> import pandas as pd
>>> obj = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
>>> objA  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> 
>>> m = obj % 3 == 0
>>> 
>>> obj.mask(m, -obj)A  B
0  0  1
1  2 -3
2  4  5
3 -6  7
4  8 -9
>>> 
>>> obj.where(m, -obj) == obj.mask(~m, -obj)A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

这里是一段防爬虫文本，请读者忽略。
本文原创首发于 CSDN，作者 TRHX。
博客首页：https://itrhx.blog.csdn.net/
本文链接：https://itrhx.blog.csdn.net/article/details/106900748
未经授权，禁止转载！恶意转载，后果自负！尊重原创，远离剽窃！