文章目录
- 一、对象创建
- 1、Series对象
- (1)用列表创建
- (2)用一维numpy数组创建
- (3)用字典创建
- (4)data为标量的情况
- 2、DataFrame对象
- (1)通过Series对象创建
- (2)通过Series对象字典创建
- (3)通过字典列表对象创建
- (4)通过numpy二维数组创建
- 二、DataFrame性质
- 1、属性
- (1)values返回numpy数组表示的数据
- (2)index返回行索引
- (3)columns返回列索引
- (4)shape返回形状
- (5)size返回大小
- (6)dtypes返回每列数据类型
- 2、索引
- (1)获取列
- (2)获取行
- (3)获取标量
- (4)Series对象的索引
- 3、切片
- (1)行切片
- (2)列切片
- (3)多种多样的取值
- 4、布尔索引
- 5、赋值
- 三、数值运算及统计分析
- 1、数据查看
- (1)查看前面的行
- (2)查看后面的行
- (3)查看总体信息
- 2、numpy通用函数适用于pandas
- (1)向量化运算
- (2)矩阵化运算
- (3)广播运算
- 3、新的用法
- (1)索引对齐
- (2)统计相关
- 四、缺失值处理
- 1、发现缺失值
- 2、删除缺失值
- (1)删除整行
- (2)删除整列
- 3、填充缺失值
- 五、合并数据
- 1、垂直合并
- 2、水平合并
- 3、索引重叠
- 4、对齐合并merge()
- 5、例:合并城市信息
- 六、分组和数据透视表
- 1、分组
- (1)延迟计算
- (2)按列取值
- (3)按组迭代
- (4)调用方法
- (5)支持更复杂的操作
- (6)过滤
- (7)转换
- (8)apply()方法
- (9)将列表、数组设为分组间
- (10)用字典将索引映射到分组
- (11)任意Python函数
- (12)多个有效值组成的列表
- (13)例:行星观测数据处理
- 2、数据透视表
- 七、多级索引:多用于多维数据
- 八、高性能的pandas
- 1、eval()和query()用法
- 2、eval()和query()使用时机
一、对象创建
1、Series对象
Series是带标签数据的一维数组
(1)用列表创建
pd.Series(data, index=index, dtype=dtype)
data:数据,可以是列表,字典或numpy数组
index:索引,为可选参数
dtype:数据类型,为可选参数
①示例
import pandas as pd
# index缺省,默认为整数序列
data = pd.Series([2,4,3,6])
print(data)
'''
0 2
1 4
2 3
3 6
dtype: int64
'''
②增加index
import pandas as pddata = pd.Series([2,4,3,6],index=["a", "b", "c", "d"])
print(data)
'''
a 2
b 4
c 3
d 6
dtype: int64
'''
③增加数据类型
import pandas as pddata = pd.Series([2,4,3,6],index=["a", "b", "c", "d"],dtype=float)
print(data)
'''
a 2.0
b 4.0
c 3.0
d 6.0
dtype: float64
'''
print(data["c"]) # 3.0
'''
④数据类型可被强制改变
import pandas as pddata = pd.Series([2,4,"3",6],index=["a", "b", "c", "d"],dtype=float)
print(data)
'''
a 2.0
b 4.0
c 3.0
d 6.0
dtype: float64
'''
print(data["c"]) # 3.0
(2)用一维numpy数组创建
import pandas as pd
import numpy as npx = np.arange(5)
data = pd.Series(x)
print(data)
'''
0 0
1 1
2 2
3 3
4 4
dtype: int32
'''
(3)用字典创建
默认以键为index,值为data
import pandas as pddic = {"x":1,"y":10}
data = pd.Series(dic)
print(data)
'''
x 1
y 10
dtype: int64
'''
字典创建,如果指定index,则会到字典的键中筛选,找不到的,值设为NaN
import pandas as pddic = {"x":1,"y":10}
data = pd.Series(dic, index=["x","z"])
print(data)
'''
x 1.0
z NaN
dtype: float64
'''
(4)data为标量的情况
x 5
z 5
dtype: int64
2、DataFrame对象
DataFrame是带标签数据的多维数组
pd.DataFrame(data, index=index, columns=columns)
data:数据,可以是列表,字典或numpy数组
index:索引,为可选参数
columns:列标签,为可选参数
(1)通过Series对象创建
import pandas as pddic = {"beijing":110, "shanghai":370}
popu = pd.Series(dic)
dpopu = pd.DataFrame(popu)
print(dpopu)
'''0
beijing 110
shanghai 370
'''
import pandas as pddic = {"beijing":110, "shanghai":370}
popu = pd.Series(dic)
dpopu = pd.DataFrame(popu,columns=["icode"])
print(dpopu)
'''icode
beijing 110
shanghai 370
'''
(2)通过Series对象字典创建
import pandas as pddic1 = {"beijing":2300, "shanghai":2100}
dic2 = {"beijing":110, "shanghai":370}
popu = pd.Series(dic1)
icode = pd.Series(dic2)
data = pd.DataFrame({"population":popu, "icode":icode, "country":"China"})
# country数量不够,会自动补齐
print(data)
'''population icode country
beijing 2300 110 China
shanghai 2100 370 China
'''
(3)通过字典列表对象创建
列表索引作为index,字典键作为columns
import pandas as pddata = [{"a":i, "b":2*i} for i in range(3)]
print(pd.DataFrame(data))
'''a b
0 0 0
1 1 2
2 2 4
'''
不存在的键,会默认值为NaN
import pandas as pddata = [{"a":1,"b":2},{"b":3,"c":10}]
print(pd.DataFrame(data))
'''a b c
0 1.0 2 NaN
1 NaN 3 10.0
'''
(4)通过numpy二维数组创建
import pandas as pd
import numpy as npdata = pd.DataFrame(np.random.randint(10, size=(3,2)), \columns=["foo","bar"],index=["a","b","c"])
print(data)
'''foo bar
a 3 2
b 1 8
c 3 5
'''
二、DataFrame性质
1、属性
(1)values返回numpy数组表示的数据
import pandas as pd
import numpy as npdata = pd.DataFrame(np.random.randint(10, size=(3,2)), \columns=["foo","bar"],index=["a","b","c"])# print(data)
print(data.values)
'''
[[0 5][7 0][4 8]]
'''
(2)index返回行索引
print(data.index)
'''
Index(['a', 'b', 'c'], dtype='object')
'''
(3)columns返回列索引
print(data.columns)
'''
Index(['foo', 'bar'], dtype='object')
'''
(4)shape返回形状
print(data.shape) # (3, 2)
(5)size返回大小
print(data.size) # 6
(6)dtypes返回每列数据类型
print(data.dtypes)
'''
foo int32
bar int32
dtype: object
'''
2、索引
(1)获取列
字典式:
import pandas as pd
import numpy as npdata = pd.DataFrame(np.arange(6).reshape(3,2), \columns=["foo","bar"],index=["a","b","c"])print(data)
'''foo bar
a 0 1
b 2 3
c 4 5
'''
print(data["foo"])
'''
a 0
b 2
c 4
Name: foo, dtype: int32
'''
print(data[["foo", "bar"]])
'''foo bar
a 0 1
b 2 3
c 4 5
'''
对象属性式
print(data.bar)
'''
a 1
b 3
c 5
Name: bar, dtype: int32
'''
(2)获取行
绝对索引 loc
print(data.loc["b"])
'''
foo 2
bar 3
Name: b, dtype: int32
'''
相对索引 iloc
print(data.iloc[1])
'''
foo 2
bar 3
Name: b, dtype: int32
'''
print(data.iloc[[0,2]])
'''foo bar
a 0 1
c 4 5
'''
(3)获取标量
print(data.loc["b","bar"]) # 3
print(data.iloc[0,1]) # 1
print(data.values[0][1]) # 1
(4)Series对象的索引
print(type(data.foo)) # <class 'pandas.core.series.Series'>print(data.foo["c"]) # 4
3、切片
import pandas as pd
import numpy as npdatas = pd.date_range(start="2019-01-01", periods=6)
print(datas)
'''
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04','2019-01-05', '2019-01-06'],dtype='datetime64[ns]', freq='D')
'''
df = pd.DataFrame(np.random.randn(6,4), index=datas,columns=["A","B","C","D"])
print(df)
'''A B C D
2019-01-01 -0.593472 -0.526596 -0.663579 -0.475506
2019-01-02 0.029637 -1.542327 1.446231 -0.219709
2019-01-03 0.312669 -0.540142 0.106548 -0.569854
2019-01-04 -0.031100 1.409991 -0.625770 1.349713
2019-01-05 -0.752705 -0.302528 0.043599 0.592143
2019-01-06 0.956202 -0.393068 0.466223 -1.890532
'''
(1)行切片
print(df["2019-01-01":"2019-01-03"])
print(df.loc["2019-01-01":"2019-01-03"])
print(df.iloc[0:3])
'''A B C D
2019-01-01 -0.563258 -0.981668 -0.038098 0.313748
2019-01-02 1.453888 -1.075848 1.452511 -0.562839
2019-01-03 0.797852 0.774357 1.796320 1.337514
'''
(2)列切片
print(df.loc[:, "A":"C"])
print(df.iloc[:,0:3])
'''A B C
2019-01-01 0.121463 -2.668285 0.175662
2019-01-02 -0.042151 1.250018 0.964810
2019-01-03 0.641962 0.892863 -0.091651
2019-01-04 -0.381722 0.014011 -0.962964
2019-01-05 1.158018 -0.030124 0.599618
2019-01-06 0.569749 -0.435110 -0.319675
'''
(3)多种多样的取值
行、列同时切片
print(df.loc["2019-01-02":"2019-01-04", "B":"C"])
print(df.iloc[1:4,1:3])
'''B C
2019-01-02 1.885370 0.439749
2019-01-03 -1.054281 0.271491
2019-01-04 -0.781519 -0.872194
'''
行切片,列分散取值
print(df.loc["2019-01-04":"2019-01-06", ["A","C"]])
print(df.iloc[3:, [0,2]])
'''A C
2019-01-04 0.057934 0.415995
2019-01-05 0.656228 0.836275
2019-01-06 -0.956402 0.720133
'''
行分散取值,列切片
print(df.loc[["2019-01-04", "2019-01-06"], "C":"D"])
print(df.iloc[[3,5],2:4])
'''C D
2019-01-04 -0.796464 -1.371296
2019-01-06 2.131938 -1.106263
'''
行、列分散取值
print(df.loc[["2019-01-04","2019-01-06"], ["B", "D"]])
print(df.iloc[[3,5],[1,3]])
'''B D
2019-01-04 -0.320283 1.346262
2019-01-06 -0.216891 -0.844410
'''
4、布尔索引
print(df[df>0])
'''A B C D
2019-01-01 1.170066 NaN NaN NaN
2019-01-02 0.786002 2.158762 NaN NaN
2019-01-03 NaN NaN 0.322335 0.602991
2019-01-04 NaN NaN NaN NaN
2019-01-05 0.416069 NaN 0.838723 0.687255
2019-01-06 0.277207 0.086217 NaN NaN
'''
print(df.A>0) # 判断A列的元素是否大于0
'''
2019-01-01 True
2019-01-02 True
2019-01-03 False
2019-01-04 False
2019-01-05 False
2019-01-06 True
'''
print(df[df.A>0])
'''A B C D
2019-01-01 0.590420 -1.282202 0.318478 0.415096
2019-01-02 2.072327 -0.121314 1.713179 1.663085
2019-01-06 0.106245 -0.522096 0.417755 -0.524761
'''
isin()方法
df2 = df.copy()
df2["E"] = ["one","one","two","three","four","three"]
print(df2)
'''A B C D E
2019-01-01 -0.432689 1.960850 0.079677 0.609651 one
2019-01-02 0.026600 0.081690 0.555260 -0.193917 one
2019-01-03 1.346473 -0.249037 -0.398267 1.376942 two
2019-01-04 1.631712 -1.757012 -0.386546 -0.215699 three
2019-01-05 0.802655 -0.033013 0.771480 -1.589764 four
2019-01-06 0.615043 -0.240700 0.678544 -0.838852 three
'''
ind = df2["E"].isin(["two","four"])
print(ind)
'''
2019-01-01 False
2019-01-02 False
2019-01-03 True
2019-01-04 False
2019-01-05 True
2019-01-06 False
'''
print(df2[ind])
'''A B C D E
2019-01-03 0.704706 0.123659 1.147022 0.104124 two
2019-01-05 0.065825 0.207168 1.425794 -0.267355 four
'''
5、赋值
DateFrame增加新列
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range("20190101", periods=6))
print(s1)
'''
2019-01-01 1
2019-01-02 2
2019-01-03 3
2019-01-04 4
2019-01-05 5
2019-01-06 6
Freq: D, dtype: int64
'''df["E"] = s1
print(df)
'''A B C D E
2019-01-01 -2.192860 1.744378 -0.671842 0.704741 1
2019-01-02 0.125302 -1.141235 1.145471 1.860608 2
2019-01-03 1.462714 -0.632829 -0.046127 0.379126 3
2019-01-04 1.745818 -0.688786 0.574567 -0.900502 4
2019-01-05 0.680510 -0.194625 -1.047654 1.482277 5
2019-01-06 1.627649 -0.205627 -1.003146 0.453174 6
'''
修改赋值
df.loc["2019-01-01", "A"] = 0
df.iloc[0,1] = 0
df["D"] = np.array([5]*len(df)) # 可简化成df["D"] = 5 len(df)返回df的行数
print(df)
'''A B C D
2019-01-01 0.000000 0.000000 1.095675 5
2019-01-02 -2.028600 2.048896 -1.527212 5
2019-01-03 2.149004 -0.904068 0.471809 5
2019-01-04 -0.034528 2.151367 -0.219636 5
2019-01-05 -0.544008 -1.098587 -1.873869 5
2019-01-06 -1.547652 -2.084554 -0.701767 5
'''
修改index和columns
df.index = [i for i in range(len(df))]
df.columns = [i*10 for i in range(df.shape[1])]
print(df)
'''0 10 20 30
0 -0.942362 0.191228 0.891761 -0.520997
1 -1.330733 -0.462275 -0.711679 1.503393
2 -0.187491 1.461077 0.557227 -0.798765
3 -0.012331 -1.728701 0.018166 0.659837
4 0.518749 0.776088 2.482731 -0.020565
5 0.475219 -1.025717 1.293841 1.236391
'''
三、数值运算及统计分析
1、数据查看
import pandas as pd
import numpy as npdates = pd.date_range(start="2019-01-01", periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates,columns=["A","B","C","D"])
print(df)
'''A B C D
2019-01-01 -1.061156 0.591245 -0.885117 1.123434
2019-01-02 -1.142466 -0.807766 -1.519887 0.051029
2019-01-03 -0.739533 1.907320 -1.359995 0.335202
2019-01-04 -0.290423 -1.784109 -1.033240 0.706024
2019-01-05 1.179959 0.660133 0.596361 0.384645
2019-01-06 1.093600 -0.395159 -0.799479 -0.308565
'''
(1)查看前面的行
print(df.head(2)) # 默认显示前5行
'''A B C D
2019-01-01 -1.062086 -1.966453 0.638081 0.922812
2019-01-02 0.683613 1.363954 0.004098 1.308496
'''
(2)查看后面的行
print(df.tail(2)) # 默认显示后5行
'''A B C D
2019-01-05 -0.370315 0.187505 -0.272255 0.296648
2019-01-06 1.393871 -0.341858 0.361288 0.834284
'''
(3)查看总体信息
df.iloc[0, 3] = np.nan # 将第1行,第4列的值设置为NaN
print(df)
'''A B C D
2019-01-01 0.529576 -0.582373 1.174552 NaN
2019-01-02 1.381525 2.005128 -0.084598 -0.680730
2019-01-03 0.634071 -0.421678 -0.695929 1.936779
2019-01-04 -0.146882 1.434341 0.553859 -0.452890
2019-01-05 -0.257330 -0.119174 -0.859402 0.163590
2019-01-06 -1.684116 0.372460 1.312178 -1.548088
'''
df.info()
'''
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 A 6 non-null float641 B 6 non-null float642 C 6 non-null float643 D 5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes
'''
2、numpy通用函数适用于pandas
(1)向量化运算
x = pd.DataFrame(np.arange(4).reshape(1,4))
print(x)
'''0 1 2 3
0 0 1 2 3
'''
print(x+5)
'''0 1 2 3
0 5 6 7 8
'''
print(np.exp(x))
'''0 1 2 3
0 1.0 2.718282 7.389056 20.085537
'''
x = pd.DataFrame(np.arange(4).reshape(1,4))
print(x)
'''0 1 2 3
0 0 1 2 3
'''
y = pd.DataFrame(np.arange(4,8).reshape(1,4))
print(y)
'''0 1 2 3
0 4 5 6 7
'''
print(x*y)
'''0 1 2 3
0 0 5 12 21
'''
(2)矩阵化运算
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(5,5)))
print(x)
'''0 1 2 3 4
0 6 3 7 4 6
1 9 2 6 7 4
2 3 7 7 2 5
3 4 1 7 5 1
4 4 0 9 5 8
'''
print(x.dtypes)
'''
0 int32
1 int32
2 int32
3 int32
4 int32
dtype: object
'''
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(5,5)))
print(x)
'''0 1 2 3 4
0 6 3 7 4 6
1 9 2 6 7 4
2 3 7 7 2 5
3 4 1 7 5 1
4 4 0 9 5 8
'''
z = x.T # 转置
print(z)
'''0 1 2 3 4
0 6 9 3 4 4
1 3 2 7 1 0
2 7 6 7 7 9
3 4 7 2 5 5
4 6 4 5 1 8
'''
print(x.dot(z))
'''0 1 2 3 4
0 146 154 126 102 155
1 154 186 117 119 157
2 126 117 136 83 125
3 102 119 83 92 112
4 155 157 125 112 186
'''
print(np.dot(x,z))
'''
[[146 154 126 102 155][154 186 117 119 157][126 117 136 83 125][102 119 83 92 112][155 157 125 112 186]]
'''
执行相同运算,一般来说纯粹的计算在numpy里执行的更快。numpy更侧重于计算,pandas更侧重于数据处理。
(3)广播运算
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=list("ABC"))
print(x)
'''A B C
0 6 3 7
1 4 6 9
2 2 6 7
'''
按行广播
print(x.iloc[0])
'''
A 6
B 3
C 7
Name: 0, dtype: int32
'''print(x/x.iloc[0])
'''A B C
0 1.000000 1.0 1.000000
1 0.666667 2.0 1.285714
2 0.333333 2.0 1.000000
'''
按列广播
print(x.A)
'''
0 6
1 4
2 2
Name: A, dtype: int32
'''print(x.div(x.A,axis=0)) # 每列都除以A列
'''A B C
0 1.0 0.5 1.166667
1 1.0 1.5 2.250000
2 1.0 3.0 3.500000
'''
print(x.iloc[0])
'''
A 6
B 3
C 7
Name: 0, dtype: int32
'''print(x.div(x.iloc[0], axis=1)) # 默认axis=1,即按行计算
'''A B C
0 1.000000 1.0 1.000000
1 0.666667 2.0 1.285714
2 0.333333 2.0 1.000000
'''
3、新的用法
(1)索引对齐
np.random.seed(42)
x = pd.DataFrame(np.random.randint(0,20,size=(2,2)), columns=list("AB"))
print(x)
'''A B
0 6 19
1 14 10
'''
y = pd.DataFrame(np.random.randint(0,10,size=(3,3)), columns=list("ABC"))
print(y)
'''A B C
0 7 4 6
1 9 2 6
2 7 4 3
'''
pandas会自动对齐两个对象的索引,没有的值用np.nan表示
print(x+y)
'''A B C
0 13.0 23.0 NaN
1 23.0 12.0 NaN
2 NaN NaN NaN
'''
缺省值也可用 fill_value 来填充
print(x.add(y, fill_value=0))
'''A B C
0 13.0 23.0 6.0
1 23.0 12.0 6.0
2 7.0 4.0 3.0
'''
(2)统计相关
数据种类统计
import pandas as pd
import numpy as np
from collections import Counternp.random.seed(42)
y = np.random.randint(3, size=10)
print(y) # [2 0 2 2 0 0 2 1 2 2]print(np.unique(y)) # [0 1 2]
print(Counter(y)) # Counter({2: 6, 0: 3, 1: 1})
y1 = pd.DataFrame(y,columns=["A"])
print(y1)
'''A
0 2
1 0
2 2
3 2
4 0
5 0
6 2
7 1
8 2
9 2
'''
print(np.unique(y1)) # [0 1 2]
print(y1["A"].value_counts())
'''
2 6
0 3
1 1
Name: A, dtype: int64
'''
产生新的结果,并进行排序
import pandas as pd
import numpy as nppopulation_dict = {"BeiJing":2154,"ShangHai":2424,"ShenZhen":1303,"HangZhou":981}
population = pd.Series(population_dict)
GDP_dict = {"BeiJing":30320,"ShangHai":32680,"ShenZhen":24222,"HangZhou":13468}
GDP = pd.Series(GDP_dict)
city_info = pd.DataFrame({"population":population,"GDP":GDP})
city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
print(city_info)
'''population GDP per_GDP
BeiJing 2154 30320 14.076137
ShangHai 2424 32680 13.481848
ShenZhen 1303 24222 18.589409
HangZhou 981 13468 13.728848
'''
①递增排序
print(city_info.sort_values(by="per_GDP"))
'''population GDP per_GDP
ShangHai 2424 32680 13.481848
HangZhou 981 13468 13.728848
BeiJing 2154 30320 14.076137
ShenZhen 1303 24222 18.589409
'''
②递减排序
print(city_info.sort_values(by="per_GDP", ascending=False))
'''population GDP per_GDP
ShenZhen 1303 24222 18.589409
BeiJing 2154 30320 14.076137
HangZhou 981 13468 13.728848
ShangHai 2424 32680 13.481848
'''
③按轴排序
data = pd.DataFrame(np.random.randint(20, size=(3,4)),index=[2,1,0],columns=list("CBAD"))
print(data)
'''C B A D
2 2 5 19 16
1 14 11 9 4
0 6 18 5 17
'''
print(data.sort_index()) # 行排序
'''C B A D
0 6 18 5 17
1 14 11 9 4
2 2 5 19 16
'''
print(data.sort_index(axis=1)) # 列排序
'''A B C D
2 3 15 1 14
1 10 7 18 6
0 15 13 11 14
'''
print(data.sort_index(axis=1, ascending=False))
'''D C B A
2 3 10 9 6
1 5 11 15 5
0 5 7 16 2
'''
统计方法
np.random.seed(10)
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
print(df)
'''A B C D
0 7.326346 4.861116 -4.181601 1.966465
1 4.485344 -0.880342 3.062046 2.434194
2 2.017166 1.301599 3.732105 6.812149
3 -1.860263 6.113096 2.914521 3.780550
4 -2.546409 2.540548 7.938148 -2.319220
5 -5.910913 -4.973489 3.064281 11.539869
'''
# 统计非空个数
print(df.count())
'''
A 6
B 6
C 6
D 6
'''
# 求和
print(df.sum())
'''
A 3.511271
B 8.962527
C 16.529499
D 24.214008
dtype: float64
'''
print(df.sum(axis=1))
'''
0 9.972325
1 9.101242
2 13.863019
3 10.947905
4 5.613067
5 3.719748
dtype: float64
'''
# 最大值 最小值
print(df.min()) # 按列
'''
A -5.910913
B -4.973489
C -4.181601
D -2.319220
dtype: float64
'''
print(df.max(axis=1)) # 按行
'''
0 7.326346
1 4.485344
2 6.812149
3 6.113096
4 7.938148
5 11.539869
dtype: float64
'''
print(df.idxmax()) # 最大值的坐标
'''
A 0
B 3
C 4
D 5
dtype: int64
'''
# 均值
print(df.mean())
'''
A 0.585212
B 1.493755
C 2.754917
D 4.035668
dtype: float64
'''
# 方差
print(df.var())
'''
A 24.138289
B 16.254343
C 15.230314
D 22.263578
dtype: float64
'''
# 标准差
print(df.std())
'''
A 4.913073
B 4.031668
C 3.902604
D 4.718430
dtype: float64
'''
# 中位数
print(df.median())
'''
A 0.078452
B 1.921073
C 3.063163
D 3.107372
dtype: float64
'''
# 众数
data = pd.DataFrame(np.random.randint(5,size=(10,2)),columns=list("AB"))
print(data)
'''A B
0 2 0
1 3 4
2 2 0
3 1 2
4 0 0
5 3 1
6 3 4
7 1 4
8 2 0
9 0 4
'''
print(data.mode())
'''A B
0 2 0
1 3 4
'''
print(df.quantile(0.75)) # 75%分数位
'''
A 3.868299
B 4.280974
C 3.565149
D 6.054250
Name: 0.75, dtype: float64
'''
print(df.describe())
'''A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.585212 1.493755 2.754917 4.035668
std 4.913073 4.031668 3.902604 4.718430
min -5.910913 -4.973489 -4.181601 -2.319220
25% -2.374872 -0.334857 2.951402 2.083397
50% 0.078452 1.921073 3.063163 3.107372
75% 3.868299 4.280974 3.565149 6.054250
max 7.326346 6.113096 7.938148 11.539869
'''
data2 = pd.DataFrame([["a","a","c","d"],["c","a","c","d"],["a","a","d","c"]],
columns=list("ABCD"))
print(data2)
'''A B C D
0 a a c d
1 c a c d
2 a a d c
'''
print(data2.describe())
'''A B C D
count 3 3 3 3
unique 2 1 2 2
top a a c d
freq 2 3 2 2
'''
'''
count 表示每列的数据数量,
unique 表示每列的唯一值数量,
top 表示每列中出现频率最高的值,
freq 表示最常见值的出现频次。
'''
# 相关性系数
print(df.corr())
'''A B C D
A 1.000000 0.409966 -0.655007 -0.383420
B 0.409966 1.000000 -0.255655 -0.631457
C -0.655007 -0.255655 1.000000 -0.152966
D -0.383420 -0.631457 -0.152966 1.000000
'''
print(df.corrwith(df["A"]))
'''
A 1.000000
B 0.409966
C -0.655007
D -0.383420
dtype: float64
'''
自定义输出
apply(method)的用法:使用method方法默认对每一列进行相应的操作
np.random.seed(10)
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
print(df)
'''A B C D
0 7.326346 4.861116 -4.181601 1.966465
1 4.485344 -0.880342 3.062046 2.434194
2 2.017166 1.301599 3.732105 6.812149
3 -1.860263 6.113096 2.914521 3.780550
4 -2.546409 2.540548 7.938148 -2.319220
5 -5.910913 -4.973489 3.064281 11.539869
'''
print(df.apply(np.cumsum)) # 按列方向,累加
'''A B C D
0 7.326346 4.861116 -4.181601 1.966465
1 11.811690 3.980774 -1.119555 4.400659
2 13.828856 5.282373 2.612550 11.212808
3 11.968593 11.395469 5.527070 14.993359
4 9.422184 13.936017 13.465218 12.674139
5 3.511271 8.962527 16.529499 24.214008
'''
print(df.apply(np.cumsum, axis=1)) # 按行方向,累加
'''A B C D
0 7.326346 12.187462 8.005861 9.972325
1 4.485344 3.605002 6.667048 9.101242
2 2.017166 3.318765 7.050870 13.863019
3 -1.860263 4.252834 7.167354 10.947905
4 -2.546409 -0.005861 7.932287 5.613067
5 -5.910913 -10.884402 -7.820122 3.719748
'''
print(df.apply(sum))
'''
A 3.511271
B 8.962527
C 16.529499
D 24.214008
dtype: float64
'''
print(df.apply(lambda x: x.max()-x.min()))
'''
A 13.237259
B 11.086585
C 12.119749
D 13.859089
dtype: float64
'''
def my_describe(x):return pd.Series([x.count(), x.mean(), x.max(),x.idxmin(), x.std()],index=["Count", "mean", "max", "idxmin", "std"])
print(df.apply(my_describe))
'''A B C D
Count 6.000000 6.000000 6.000000 6.000000
mean 0.585212 1.493755 2.754917 4.035668
max 7.326346 6.113096 7.938148 11.539869
idxmin 5.000000 5.000000 0.000000 4.000000
std 4.913073 4.031668 3.902604 4.718430
'''
四、缺失值处理
1、发现缺失值
import pandas as pd
import numpy as npdata = pd.DataFrame(np.array([[1, np.nan, 2],[np.nan, 3, 4],[5, 6, None]]),columns=["A", "B", "C"])
print(data)
'''A B C
0 1 NaN 2
1 NaN 3 4
2 5 6 None
'''
注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源
print(data.dtypes)
'''
A object
B object
C object
dtype: object
'''
print(data.isnull())
'''A B C
0 False True False
1 True False False
2 False False True
'''
print(data.notnull())
'''A B C
0 True False True
1 False True True
2 True True False
'''
2、删除缺失值
import pandas as pd
import numpy as npdata = pd.DataFrame(np.array([[1, np.nan, 2, 3],[np.nan, 3, 4, 6],[7, 8, np.nan, 9],[10, 11, 12, 13]]),columns=["A", "B", "C", "D"])
print(data)
'''A B C D
0 1.0 NaN 2.0 3.0
1 NaN 3.0 4.0 6.0
2 7.0 8.0 NaN 9.0
3 10.0 11.0 12.0 13.0
'''
注意:np.nan是一种特殊的浮点数
print(data.dtypes)
'''
A float64
B float64
C float64
D float64
dtype: object
'''
(1)删除整行
print(data.dropna())
'''A B C D
3 10.0 11.0 12.0 13.0
'''
(2)删除整列
print(data.dropna(axis=1))
'''D
0 3.0
1 6.0
2 9.0
3 13.0
'''
data["D"] = np.nan
print(data)
'''A B C D
0 1.0 NaN 2.0 NaN
1 NaN 3.0 4.0 NaN
2 7.0 8.0 NaN NaN
3 10.0 11.0 12.0 NaN
'''
print(data.dropna(axis=1, how="all"))
'''A B C
0 1.0 NaN 2.0
1 NaN 3.0 4.0
2 7.0 8.0 NaN
3 10.0 11.0 12.0
'''
data.loc[3] = np.nan
print(data)
'''A B C D
0 1.0 NaN 2.0 NaN
1 NaN 3.0 4.0 NaN
2 7.0 8.0 NaN NaN
3 NaN NaN NaN NaN
'''
print(data.dropna(how="all"))
'''A B C D
0 1.0 NaN 2.0 NaN
1 NaN 3.0 4.0 NaN
2 7.0 8.0 NaN NaN
'''
3、填充缺失值
import pandas as pd
import numpy as npdata = pd.DataFrame(np.array([[1, np.nan, 2, 3],[np.nan, 3, 4, 6],[7, 8, np.nan, 9],[10, 11, 12, 13]]),columns=["A", "B", "C", "D"])
print(data)
'''A B C D
0 1.0 NaN 2.0 3.0
1 NaN 3.0 4.0 6.0
2 7.0 8.0 NaN 9.0
3 10.0 11.0 12.0 13.0
'''print(data.fillna(value=5))
'''A B C D
0 1.0 5.0 2.0 3.0
1 5.0 3.0 4.0 6.0
2 7.0 8.0 5.0 9.0
3 10.0 11.0 12.0 13.0
'''
用均值进行替换
print(data.fillna(value=data.mean())) # 填充每列的均值
'''A B C D
0 1.0 7.333333 2.0 3.0
1 6.0 3.000000 4.0 6.0
2 7.0 8.000000 6.0 9.0
3 10.0 11.000000 12.0 13.0
'''
print(data.fillna(value=data.stack().mean())) #用这个DataFrame中所有非空数据的均值填充
'''A B C D
0 1.000000 6.846154 2.000000 3.0
1 6.846154 3.000000 4.000000 6.0
2 7.000000 8.000000 6.846154 9.0
3 10.000000 11.000000 12.000000 13.0
'''
五、合并数据
构造一个生产DataFrame的函数
import pandas as pddef make_df(cols, ind):data = {c: [str(c)+str(i) for i in ind] for c in cols}return pd.DataFrame(data, ind)print(make_df("ABC", range(3)))
'''A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
'''
1、垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3 ,4])
print(df_1)
'''A B
1 A1 B1
2 A2 B2
'''
print(df_2)
'''A B
3 A3 B3
4 A4 B4
'''
print(pd.concat([df_1, df_2]))
'''A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
'''
2、水平合并
df_3 = make_df("AB", [0,1])
df_4 = make_df("CD", [0,1])
print(df_3)
'''A B
0 A0 B0
1 A1 B1
'''
print(df_4)
'''C D
0 C0 D0
1 C1 D1
'''
print(pd.concat([df_3, df_4], axis=1))
'''A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
'''
3、索引重叠
df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
'''A B
1 A1 B1
2 A2 B2
'''
print(df_6)
'''A B
1 A1 B1
2 A2 B2
'''
print(pd.concat([df_5, df_6]))
'''A B
1 A1 B1
2 A2 B2
1 A1 B1
2 A2 B2
'''
print(pd.concat([df_5, df_6], ignore_index=True))
'''A B
0 A1 B1
1 A2 B2
2 A1 B1
3 A2 B2
'''
4、对齐合并merge()
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
'''A B
1 A1 B1
2 A2 B2
'''
print(df_10)
'''B C
1 B1 C1
2 B2 C2
'''
print(pd.merge(df_9, df_10))
'''A B C
0 A1 B1 C1
1 A2 B2 C2
'''
5、例:合并城市信息
import pandas as pdpopulation_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),"pop": (2154, 981,1303)}
population = pd.DataFrame(population_dict)
print(population)
'''city pop
0 BeiJing 2154
1 HangZhou 981
2 ShenZhen 1303
'''
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),"GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
print(GDP)
'''city GDP
0 BeiJing 30320
1 ShangHai 32680
2 HangZhou 13468
'''
city_info = pd.merge(population, GDP)
print(city_info)
'''city pop GDP
0 BeiJing 2154 30320
1 HangZhou 981 13468
'''
city_info = pd.merge(population, GDP, how="outer") # 设置为并集,默认为交集
print(city_info)
'''city pop GDP
0 BeiJing 2154.0 30320.0
1 HangZhou 981.0 13468.0
2 ShenZhen 1303.0 NaN
3 ShangHai NaN 32680.0
'''
六、分组和数据透视表
import pandas as pd
import numpy as npnp.random.seed(10)
df = pd.DataFrame({"key":["A", "B", "C", "A", "B", "C"],"data1":range(6),"data2":np.random.randint(0, 10, size=6)})
print(df)
'''key data1 data2
0 A 0 9
1 B 1 4
2 C 2 0
3 A 3 1
4 B 4 9
5 C 5 0
'''
1、分组
(1)延迟计算
print(df.groupby("key"))
# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001E5DB95F610>print(df.groupby("key").sum())
'''data1 data2
key
A 3 10
B 5 13
C 7 0
'''
(2)按列取值
print(df.groupby("key")["data2"].sum())
'''
key
A 10
B 13
C 0
Name: data2, dtype: int32
'''
(3)按组迭代
for data, group in df.groupby("key"):print("{0:5} shape={1}".format(data, group.shape))
'''
A shape=(2, 3)
B shape=(2, 3)
C shape=(2, 3)
'''
(4)调用方法
print(df.groupby("key")["data1"].describe())
'''count mean std min 25% 50% 75% max
key
A 2.0 1.5 2.12132 0.0 0.75 1.5 2.25 3.0
B 2.0 2.5 2.12132 1.0 1.75 2.5 3.25 4.0
C 2.0 3.5 2.12132 2.0 2.75 3.5 4.25 5.0
'''
(5)支持更复杂的操作
print(df.groupby("key").aggregate(["min", "median", "max"]))
'''data1 data2 min median max min median max
key
A 0 1.5 3 1 5.0 9
B 1 2.5 4 4 6.5 9
C 2 3.5 5 0 0.0 0
'''
(6)过滤
def filter_func(x):return x["data2"].std() > 3print(df.groupby("key")["data2"].std())
'''
key
A 5.656854
B 3.535534
C 0.000000
Name: data2, dtype: float64
'''
print(df.groupby("key").filter(filter_func))
'''key data1 data2
0 A 0 9
1 B 1 4
3 A 3 1
4 B 4 9
'''
(7)转换
print(df.groupby("key").transform(lambda x: x-x.mean()))
'''data1 data2
0 -1.5 4.0
1 -1.5 -2.5
2 -1.5 0.0
3 1.5 -4.0
4 1.5 2.5
5 1.5 0.0
'''
(8)apply()方法
def norm_by_data2(x):x["data1"] /= x["data2"].sum()return xprint(df.groupby("key").apply(norm_by_data2)
)
'''key data1 data2
0 A 0.000000 9
1 B 0.076923 4
2 C inf 0
3 A 0.300000 1
4 B 0.307692 9
5 C inf 0
'''
(9)将列表、数组设为分组间
L = [0, 1, 0, 1, 2, 0]
print(df.groupby(L).sum())
'''data1 data2
0 7 9
1 4 5
2 4 9
'''
(10)用字典将索引映射到分组
df2 = df.set_index("key")
print(df2)
'''data1 data2
key
A 0 9
B 1 4
C 2 0
A 3 1
B 4 9
C 5 0
'''
mapping = {"A": "first", "B": "constant", "C": "constant"}print(df2.groupby(mapping).sum())
'''data1 data2
key
constant 12 13
first 3 10
'''
(11)任意Python函数
print(df2.groupby(str.lower).mean()
)
'''data1 data2
key
a 1.5 5.0
b 2.5 6.5
c 3.5 0.0
'''
(12)多个有效值组成的列表
mapping = {"A": "first", "B": "constant", "C": "constant"}print(df2.groupby([str.lower, mapping]).mean()
)
'''data1 data2
key key
a first 1.5 5.0
b constant 2.5 6.5
c constant 3.5 0.0
'''
(13)例:行星观测数据处理
import seaborn as snsplanets = sns.load_dataset("planets")# print(planets)
# print(planets.shape)
# print(planets.head())
# print(planets.describe())decade = 10*(planets["year"]//10)
decade = decade.astype(str) + "s"
decade.name = "decade"
print(decade.head())# print(planets.groupby(["method", decade]).sum())
print(planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0))
2、数据透视表
import seaborn as snstitanic = sns.load_dataset("titanic")
# print(titanic.head())
# print(titanic.describe())
# print(titanic.groupby("sex")[["survived"]].mean())
'''survived
sex
female 0.742038
male 0.188908
'''
# print(titanic.groupby("sex")["survived"].mean())
'''
sex
female 0.742038
male 0.188908
Name: survived, dtype: float64
'''
# print(
# titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
# )
'''
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
'''
# 数据透视表
# print(
# titanic.pivot_table("survived", index="sex", columns="class",
# aggfunc="mean", margins=True)
# )
'''
class First Second Third All
sex
female 0.968085 0.921053 0.500000 0.742038
male 0.368852 0.157407 0.135447 0.188908
All 0.629630 0.472826 0.242363 0.383838
'''
print(titanic.pivot_table(index="sex", columns="class",aggfunc={"survived":sum, "fare":"mean"})
)
'''fare survived
class First Second Third First Second Third
sex
female 106.125798 21.970121 16.118810 91 70 72
male 67.226127 19.741782 12.661633 45 17 47
'''
七、多级索引:多用于多维数据
import pandas as pd
import numpy as npbase_data = np.array([[1771, 11115],[2154, 30320],[2141, 14070],[2424, 32680],[1077, 7806],[1303, 24222],[798, 4789],[981, 13468]
])data = pd.DataFrame(base_data, index=[["BeiJing", "BeiJing", "ShangHai", "ShangHai","ShenZhen", "ShenZhen", "HangZhou", "HangZhou"],[2008, 2018] * 4], columns=["population", "GDP"])
data.index.names = ["city", "year"]
print(data)
'''population GDP
city year
BeiJing 2008 1771 111152018 2154 30320
ShangHai 2008 2141 140702018 2424 32680
ShenZhen 2008 1077 78062018 1303 24222
HangZhou 2008 798 47892018 981 13468
'''
print(data["GDP"])
'''
city year
BeiJing 2008 111152018 30320
ShangHai 2008 140702018 32680
ShenZhen 2008 78062018 24222
HangZhou 2008 47892018 13468
Name: GDP, dtype: int32
'''
print(data.loc["ShangHai", "GDP"])
'''
year
2008 14070
2018 32680
Name: GDP, dtype: int32
'''
print(data.loc["ShangHai", 2018]["GDP"]) # 32680
八、高性能的pandas
1、eval()和query()用法
减少了符合代数式计算过程中间的内存分配
import pandas as pd
import numpy as npdf1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
print(np.allclose((df1+df2)/(df3+df4),pd.eval("(df1+df2)/(df3+df4)"))) # True
query()用法和eval()相同
2、eval()和query()使用时机
小数组时,普通方法更快。它们适用于大数组。
# 计算 DataFrame df1 中所有元素所占用的内存空间大小,单位为字节(bytes)
print(df1.values.nbytes) # 8000000