pandas之Seris和DataFrame

pandas是一个强大的python工具包，提供了大量处理数据的函数和方法，用于处理数据和分析数据。

使用pandas之前需要先安装pandas包，并通过import pandas as pd导入。

一、系列Series

Seris为带标签的一维数组，标签即为索引。

1.Series的创建

Seris创建的方法：s = pd.Seris(obj , index=' ***' , name='***')

Seris创建时如果不通过参数指定name，名称默认为None，并不是=前面的变量名称s。

①通过字典创建

通过字典创建Seris，字典的key即为索引。如果字典的key有重复，创建Seris时会取最后出现的一个值。

dic = {'name':'Alice','age':23,'age':20,'age':25,'hobby':'dance'}
s = pd.Series(dic,name='dic_Seris')
print(s)
# name     Alice
# age         25
# hobby    dance
# Name: dic_Seris, dtype: object

通过字典创建Seris

②通过一维数组、列表或元组创建

通过这种方法，如果不指定索引index，默认为从0开始的整数；如果指定index，index的数量必须与Seris的元素个数保持一致，否则会报错。

arr = np.arange(1,6)
s1 = pd.Series(arr)
s2 = pd.Series(arr,index=list('abcde'),name='iter_Seris')
print(s1.name,s2.name)
print(s1)
print('-------------')
print(s2)
# None iter_Seris
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5
# dtype: int32
# -------------
# a    1
# b    2
# c    3
# d    4
# e    5
# Name: iter_Seris, dtype: int32

通过一维数组、列表或元组创建Seris

③通过标量创建

通过标量创建时，参数obj为一个固定的值，表示Seris中元素的值，此时必须指定index，index的个数表示元素个数。

s = pd.Series('hi',index=list('abc'),name='s_Seris')
print(s)
# a    hi
# b    hi
# c    hi
# Name: s_Seris, dtype: object

通过标量创建Seris

2.Series的索引

①下标索引

下标索引从0开始，-1表示最后一个元素，通过[m:n]切片包括m不包括n。Seris中的每一个元素类型为<class 'numpy.***'>

还可以通过[[ m,n,x]]获取下标为m、n、x的值，列表和元组没有该用法。

s = pd.Series([1,2,3,4,5],index=list('abcde'))
print(s[1],type(s[1]))
print(s[-2])
print(s[1:3])
print(s[[0,4]])
# 2 <class 'numpy.int64'>
# 4
# b    2
# c    3
# dtype: int64
# a    1
# e    5
# dtype: int64

Seris下标索引

②标签索引

与下标索引不同的是，标签通过[m:n]切片时包含m也包含n。也可以通过[[ m,n,x]]获取标签为m、n和x的值

s = pd.Series([1,2,3,4,5],index=list('abcde'))
print(s['b'])
print(s['c':'d'])
print(s[['a','e']])
# 2
# c    3
# d    4
# dtype: int64
# a    1
# e    5
# dtype: int64

Seris标签索引

注意，如果Seris的标签也为整数时，会出现混乱，因此不建议自定义数字为标签索引。

s = pd.Series([1,2,3,4,5],index=[1,2,3,4,5])
print(s)
print('------------')
print(s[3])
print('------------')
print(s[2:4])
# 1    1
# 2    2
# 3    3
# 4    4
# 5    5
# dtype: int64
# ------------
# 3
# ------------
# 3    3
# 4    4
# dtype: int64

View Code

③布尔索引

s = pd.Series([1,2,3,4,5],index=list('abcde'))
m = s > 3
print(m)
print(s[m])
# a    False
# b    False
# c    False
# d     True
# e     True
# dtype: bool
# d    4
# e    5
# dtype: int64

Seris布尔值索引

3.Seris查看和常用方法

①head()和tail()

参数默认为5，表示查看前5个和后5个，可指定参数。

s = pd.Series([1,2,3,4,5,6,7,8,9,10])
print(s.head(2))
print(s.tail((3)))
# 0    1
# 1    2
# dtype: int64
# 7     8
# 8     9
# 9    10
# dtype: int64

head()和tail()

②tolist()（也可写作to_list()）

将Seris转化为列表

s = pd.Series(np.random.randint(1,10,10))
print(s.tolist())
# [3, 8, 8, 9, 8, 2, 2, 7, 7, 7]

③reindex(index , fill_value=NaN)

reindex会生成一个新的Seris，对于参数index，如果在原Seris的index中存在则保留，不存在则将值填充为fill_value指定的值，fill_value默认为NaN

arr = np.arange(1,6)
s1 = pd.Series(arr,index = list('abcde'))
s2 =s1.reindex(['a','d','f','h'],fill_value=0)
print(s1)
print(s2)
# a    1
# b    2
# c    3
# d    4
# e    5
# dtype: int32
# a    1
# d    4
# f    0
# h    0
# dtype: int32

reindex()

④+和-

Seris与单个值的加法和减法，是对Seris的每个元素进行操作。

两个Seris的加法和减法，对两者index相同的数值做加法和减法，不相同的部分index都保留，值默认为NaN。

s1 = pd.Series(np.arange(1,4),index = list('abc'))
s2 = pd.Series(np.arange(5,8),index = list('bcd'))
print(s1+s2)
print('--------')
print(s2-s1)
print('--------')
print(s1+10)
# a    NaN
# b    7.0
# c    9.0
# d    NaN
# dtype: float64
# --------
# a    NaN
# b    3.0
# c    3.0
# d    NaN
# dtype: float64
# --------
# a    11
# b    12
# c    13
# dtype: int32

Seris的加法和减法

⑤元素的添加

直接通过标签方式添加元素（通过下标方式添加报超出索引错误），修改原Seris。

s = pd.Series(np.arange(1,4),index = list('abc'))
# s[3] = 10
s['p'] = 15
print(s)
# a     1
# b     2
# c     3
# p    15
# dtype: int64

Seris添加元素

s1.appeng(s2)，生成一个新的Seris，不修改s1和s2

s1 = pd.Series(np.arange(1,3),index = list('ab'))
s2 = pd.Series(np.arange(3,5),index = list('mn'))
a = s1.append(s2)
print(s1)
print(s2)
print(a)
# a    1
# b    2
# dtype: int32
# m    3
# n    4
# dtype: int32
# a    1
# b    2
# m    3
# n    4
# dtype: int32

append()

⑥元素的删除drop()

用法：drop(index,inplace = False)，表示删除原Seris中索引为参数index的值，默认删除的内容会生成一个新的Seris且不改变原Seris，如果指定Inplace = True则会直接修改原Seris。

s1 = pd.Series(np.arange(1,4),index = list('abc'))
s2 = s1.drop(['a','c'])
print(s1)
print(s2)
s3 = pd.Series(np.arange(5,8),index = list('lmn'))
s4 = s3.drop('m',inplace=True)
print(s3)
print(s4)
# a    1
# b    2
# c    3
# dtype: int32
# b    2
# dtype: int32
# l    5
# n    7

drop()删除元素

返回顶部

二、数据帧DataFrame

DataFrame是一个表格型的数据结构，是一组带有标签的二维数组，DataFrame是pandas中最常用的一种数据结构。创建一个DataFrame为df，则

df.index表示行索引，df.columns表示列索引，df.values表示实际的值。

dic = {'name':['alice','Bob','Jane'],'age':[23,26,25]}
df = pd.DataFrame(dic)
print(df)
print(type(df))
print(df.index)
print(df.columns)
print(df.values)
#     name  age
# 0  alice   23
# 1    Bob   26
# 2   Jane   25
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex(start=0, stop=3, step=1)
# Index(['name', 'age'], dtype='object')
# [['alice' 23]
#  ['Bob' 26]
#  ['Jane' 25]]

DataFrame数据示例

1.DataFrame的创建

①通过字典、或者由字典组成的列表创建

通过这种方法，字典的key就是列索引，行索引默认为从0开始的整数。

dic1 = [{'name':'Alice','age':23},{'name':'Bob','age':26},{'name':'Jane','age':25}]
dic2 = {'name':['alice','Bob','Jane'],'age':[23,26,25]}
df1 = pd.DataFrame(dic1)
df2 = pd.DataFrame(dic2)
print(df1)
print('---------------')
# print(pd.DataFrame(df1,columns=['name','age']))
print(df2)
#    age   name
# 0   23  Alice
# 1   26    Bob
# 2   25   Jane
# ---------------
#     name  age
# 0  alice   23
# 1    Bob   26
# 2   Jane   25

通过列表或字典创建DataFrame

创建时可通过index指定行索引，但是索引的个数必须要与DataFrame的行数保持一致，否则会报错。

也可以通过columns指定列索引，列索引的个数可以不与DataFrame的列数保持一致，索引相同的部分保留，原字典或列表中多余的部分去除，columns中多余的部分保留并填充值为NaN

dic = {'name':['alice','Bob','Jane'],'age':[23,26,25]}
df1 = pd.DataFrame(dic,columns=['name','hobby'])
df2 = pd.DataFrame(dic,index=['a','b','c'])
print(df1)
print(df2)
#    name hobby
# 0  alice   NaN
# 1    Bob   NaN
# 2   Jane   NaN
#     name  age
# a  alice   23
# b    Bob   26
# c   Jane   25

指定行索引和列索引

②通过Seris创建

通过Seris创建时，Seris的长度可以不一致，DataFrame会取最长的Seris，并将不足的部分填充为NaN

dic1 = {'one':pd.Series(np.arange(2)),'two':pd.Series(np.arange(3))}
dic2 = {'one':pd.Series(np.arange(2),index=['a','b']),'two':pd.Series(np.arange(3),index = ['a','b','c'])}
print(pd.DataFrame(dic1))
print('------------')
print(pd.DataFrame(dic2))
#    one  two
# 0  0.0    0
# 1  1.0    1
# 2  NaN    2
# ------------
#    one  two
# a  0.0    0
# b  1.0    1
# c  NaN    2

通过Seris创建DataFrame

③通过二维数组创建

方法：DataFrame(arr,index=‘***’ ,columns=‘***’)，如果不指定index和columns，默认都是从0开始的整数，如果指定则index和columns的长度必须与二维数据的行数和列数相同，否则会报错。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index=['a','b','c'],columns=['col1','col2','col3','col4'])
print(df)
#    col1  col2  col3  col4
# a     0     1     2     3
# b     4     5     6     7
# c     8     9    10    11

通过二维数组创建DataFrame

④通过嵌套字典创建

通过这种方法创建，字典的外层key为列索引，内层key为行索引。

dic = {'Chinese':{'Alice':92,'Bob':95,'Jane':93},'Math':{'Alice':96,'Bob':98,'Jane':95}}
print(pd.DataFrame(dic))
#        Chinese  Math
# Alice       92    96
# Bob         95    98
# Jane        93    95

通过嵌套字典创建DataFrame

2.DataFrame的索引

可通过.values直接获取不带index和column的内容部分，结果为一个二维数组。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df.values)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

.values获取内容部分

①列索引

单列索引直接使用df['列索引']即可，数据类型为Seris，名称为列索引，index为原DataFrame的index；

多列索引通过df[['列索引1','列索引2',...]]，结果为DataFrame，columns为指定的索引，index为原DataFrame的index。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df)
print('-------------')
print(df['one'],type(df['one']))
print('-------------')
print(df[['one','three']])
#    one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
# -------------
# a    0
# b    4
# c    8
# Name: one, dtype: int32 <class 'pandas.core.series.Series'>
# -------------
#    one  three
# a    0      2
# b    4      6
# c    8     10

DataFrame列索引

②行索引

单行索引通过df.loc['行索引']实现，数据类型为Seris，名称为行索引，index为原DataFrame的columns；

多行索引通过df.loc[['行索引1','行索引2',...]]，结果为DataFrame，columns为原DataFrame的columns，index为的指定的行索引。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df.loc['a'],type(df.loc['a']))
print(df.loc[['a','c']])
# one      0
# two      1
# three    2
# four     3
# Name: a, dtype: int32 <class 'pandas.core.series.Series'>
#    one  two  three  four
# a    0    1      2     3
# c    8    9     10    11

DataFrame行索引

行索引也可以使用iloc[]，loc[]使用标签作为行索引，iloc[ ]使用下标（即第几行）作为索引

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df.iloc[1],type(df.iloc[1]))
print(df.iloc[[0,2]])
# one      4
# two      5
# three    6
# four     7
# Name: b, dtype: int32 <class 'pandas.core.series.Series'>
#    one  two  three  four
# a    0    1      2     3
# c    8    9     10    11

DataFrame的iloc[]行索引

③单元格和块索引

单元格的索引有三种方式：df['列索引'].loc['行索引']、df.loc['行索引']['列索引']、df.loc['行索引','列索引']

块索引：df[['列索引1','列索引2'...]].loc[['行索引1','行索引2'...]]、df.loc[['行索引1','行索引2'...]][['列索引1','列索引2'...]]、df.loc[['行索引1','行索引2'...]],[['列索引1','列索引2'...]]

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df)
print('--------------------------')
print(df['two'].loc['b'] , df.loc['b']['two'] , df.loc['b','two'])
print('--------------------------')
print(df.loc[['a','c'],['one','four']])
#    one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
# --------------------------
# 5 5 5
# --------------------------
#    one  four
# a    0     3
# c    8    11

DataFrame单元格和块索引

④布尔索引

如果对DataFrame进行单列布尔索引，结果会显示列中值为True所在的行。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
m1= df['one']>5
print(df)
print('------------------------')
print(m1) #索引c对应的值为True
print('------------------------')
print(df[m1])  #显示索引c所在的行，包括所有列
#   one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
# ------------------------
# a    False
# b    False
# c     True
# Name: one, dtype: bool
# ------------------------
#    one  two  three  four
# c    8    9     10    11

DataFrame单列布尔索引

如果对多列或整个DataFrame进行布尔索引，结果是一个与DataFrame结构相同的DataFrame，其中索引列中符合条件的以实际值显示，不符合条件的以NaN显示。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
m1 = df[['one','three']] > 5
print(m1)
print(df[m1])   #列one、three中符合条件的显示实际值，其他都显示为NaN
#      one  three
# a  False  False
# b  False   True
# c   True   True
#    one  two  three  four
# a  NaN  NaN    NaN   NaN
# b  NaN  NaN    6.0   NaN
# c  8.0  NaN   10.0   NaN

DataFrame多列布尔索引

df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
m = df >5
print(m)
print(df[m])
#     one    two  three   four
# a  False  False  False  False
# b  False  False   True   True
# c   True   True   True   True
#    one  two  three  four
# a  NaN  NaN    NaN   NaN
# b  NaN  NaN    6.0   7.0
# c  8.0  9.0   10.0  11.0

整个DataFrame布尔索引

（对行做布尔索引会报错pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match）

3.DataFrame的常用方法

①.T转置

DataFrame转置会将原columns变为index，原index变为columns，并且修改原DataFrame会修改转置后的DataFrame，修改转置后的DataFrame也会修改原DataFrame。

arr = np.arange(12).reshape(3,4)
df1 = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
df2 = df1.T
df1.loc['a','one'] = 100
print(df1)
print(df2)
df2.loc['two','b'] = 500
print(df1)
print(df2)
#    one  two  three  four
# a  100    1      2     3
# b    4    5      6     7
# c    8    9     10    11
#          a  b   c
# one    100  4   8
# two      1  5   9
# three    2  6  10
# four     3  7  11
#    one  two  three  four
# a  100    1      2     3
# b    4  500      6     7
# c    8    9     10    11
#          a    b   c
# one    100    4   8
# two      1  500   9
# three    2    6  10
# four     3    7  11

DataFrame转置

②添加与修改

增加列：df['新列索引'] = [***]，元素的个数必须与DataFrame的行数相同，否则会报错。

增加行：df.loc['新行索引'] = [***]，元素的个数必须与DataFrame的列数相同，否则会报错。

修改DataFrame直接通过上一节单元格或块索引的方式获得单元格或块，再修改即可。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df)
df['five'] = [11,22,33]  #元素个数必须与行数相同，否则会报错
print(df)
df.loc['d'] = [100,200,300,400,500]  #元素个数必须与列数相同，否则会报错
print(df)
#   one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
#    one  two  three  four  five
# a    0    1      2     3    11
# b    4    5      6     7    22
# c    8    9     10    11    33
#    one  two  three  four  five
# a    0    1      2     3    11
# b    4    5      6     7    22
# c    8    9     10    11    33
# d  100  200    300   400   500

DataFrame增加行或列

③删除

del df['列索引'] 直接删除原DataFrame的列

df.drop('索引',axis = 0,inplace = False)，drop可以删除行也可以删除列，默认axis为0即默认删除行，为1则表示删除列，如果给定的索引在行中或者列中不存在会报错；

drop默认生成新的DataFrame不改变原DataFrame，即inplace=False，如果inplace设置为True则不生成新的DataFrame，而是直接修改原DataFrame。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df)
del df['four']
print(df)  #del删除原DataFrame的列
f = df.drop('c')
print(f)
print(df)
f = df.drop('three',axis=1,inplace=True)
print(f)
print(df)
#    one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
#    one  two  three
# a    0    1      2
# b    4    5      6
# c    8    9     10
#    one  two  three
# a    0    1      2
# b    4    5      6
#    one  two  three
# a    0    1      2
# b    4    5      6
# c    8    9     10
# None
#    one  two
# a    0    1
# b    4    5
# c    8    9

DataFrame删除行或列

④相加

DataFrame与单个值相加或相减，对每个元素进行加或减即可。

DataFrame之间相加或相减，不要求index和columns相同，对行和列对应的部分加或减，多余的行和列都保留并且值全部为NaN。

arr1 = np.arange(12).reshape(3,4)
arr2 = np.arange(12).reshape(4,3)
df1 = pd.DataFrame(arr1,index = ['a','b','c'],columns = ['one','two','three','four'])
df2 = pd.DataFrame(arr2,index = ['a','b','c','d'],columns = ['one','two','three'])
print( df1 + 1 )
print( df1 + df2 )
#    one  two  three  four
# a    1    2      3     4
# b    5    6      7     8
# c    9   10     11    12
#    four   one  three   two
# a   NaN   0.0    4.0   2.0
# b   NaN   7.0   11.0   9.0
# c   NaN  14.0   18.0  16.0
# d   NaN   NaN    NaN   NaN

DataFrame相加或相减

⑤排序

按值排序：sort_values('列索引',ascending=True)，即对某一列的值按行排序，默认升序排序，对多个列排序则用['列索引1','列索引2',...]

按index排序：sort_index(ascending=True)，按照index的名称进行排序，默认升序。

arr = np.random.randint(1,10,[4,3])
df = pd.DataFrame(arr,index = ['a','b','c','d'],columns = ['one','two','three'])
print(df)
print(df.sort_values(['one','three'],ascending=True))
print(df.sort_index(ascending=False))
#    one  two  three
# a    7    7      1
# b    5    7      1
# c    1    9      4
# d    7    9      9
#    one  two  three
# c    1    9      4
# b    5    7      1
# a    7    7      1
# d    7    9      9
#    one  two  three
# d    7    9      9
# c    1    9      4
# b    5    7      1
# a    7    7      1