文章最前: 我是Octopus,这个名字来源于我的中文名--章鱼;我热爱编程、热爱算法、热爱开源。所有源码在我的个人github ;这博客是记录我学习的点点滴滴,如果您对 Python、Java、AI、算法有兴趣,可以关注我的动态,一起学习,共同进步。
一.分组计算
#示例数据
df = pd.read_csv("pokemon_data.csv",encoding="gbk")
df.head(10)
姓名 | 类型1 | 类型2 | 总计 | 生命值 | 攻击力 | 防御力 | 速度 | 时代 | |
---|---|---|---|---|---|---|---|---|---|
0 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 45 | 1 |
1 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 60 | 1 |
2 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 80 | 1 |
3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 80 | 1 |
4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 65 | 1 |
5 | Charmeleon | Fire | NaN | 405 | 58 | 64 | 58 | 80 | 1 |
6 | Charizard | Fire | Flying | 534 | 78 | 84 | 78 | 100 | 1 |
7 | CharizardMega Charizard X | Fire | Dragon | 634 | 78 | 130 | 111 | 100 | 1 |
8 | CharizardMega Charizard Y | Fire | Flying | 634 | 78 | 104 | 78 | 100 | 1 |
9 | Squirtle | Water | NaN | 314 | 44 | 48 | 65 | 43 | 1 |
1.如何分组计算
假设现在要根据“类型1”这列来做分组计算
#查看下类型1中的类别数量分布
df["类型1"].value_counts()
Water 112
Normal 98
Grass 70
Bug 69
Psychic 57
Fire 52
Electric 44
Rock 44
Ground 32
Dragon 32
Ghost 32
Dark 31
Poison 28
Steel 27
Fighting 27
Ice 24
Fairy 17
Flying 4
Name: 类型1, dtype: int64
#查看类型1中有多少个类型
len(df["类型1"].value_counts())
Q1:想知道类型1的这18个种类各自的平均攻击力是多少(单列分组计算)
#根据类型1这列来分组,并将结果存储在grouped1中
grouped1 = df.groupby("类型1")
#输出grouped1,这里就是显示它是一个分组对象,并且存储的内存地址是0x0000000008EE9E80,没什么卵用
grouped1
#求类型1的18个种类各自的平均攻击力
grouped1[["攻击力"]].mean()
攻击力 | |
---|---|
类型1 | |
Bug | 70.971014 |
Dark | 88.387097 |
Dragon | 112.125000 |
Electric | 69.090909 |
Fairy | 61.529412 |
Fighting | 96.777778 |
Fire | 84.769231 |
Flying | 78.750000 |
Ghost | 73.781250 |
Grass | 73.214286 |
Ground | 95.750000 |
Ice | 72.750000 |
Normal | 73.469388 |
Poison | 74.678571 |
Psychic | 71.456140 |
Rock | 92.863636 |
Steel | 92.703704 |
Water | 74.151786 |
小结一下:
grouped1 = df.groupby("类型1")这一步就是分组计算流程里的第一步:split
grouped1[["攻击力"]].mean() 这一步就是分组计算流程的第二和第三步:apply—combine
Q2:想知道类型1和类型2的组合类型里,每个组合各自的攻击力均值(多列分组计算)
grouped2 = df.groupby(["类型1","类型2"])
grouped2[["攻击力"]].mean()
想知道类型1和类型2的组合类型里,每个组合各自的攻击力均值、中位数、总和(对组应用多个函数)
grouped2[["攻击力"]].agg([np.mean,np.median,np.sum])
Q4:想知道类型1和类型2的组合类型里,每个组合各自的攻击力的均值和中位数,生命值的总和(对不同列应用不同的函数)
grouped2.agg({"攻击力":[np.mean,np.median],"生命值":np.sum})
Q5:对组内数据进行标准化处理(转换)
zscore = lambda x : (x-x.mean())/x.std()
grouped1.transform(zscore)
Q6:对组进行条件过滤
需求:针对grouped2的这个分组,希望得到平均攻击力为100以上的组,其余的组过滤掉
attack_filter = lambda x : x["攻击力"].mean() > 100
grouped2.filter(attack_filter)
Q7:将类型1和2作为索引列,按照索引来实现分组计算(根据索引来分组计算)
#将类型1、类型2设置为索引列
df_pokemon = df.set_index(["类型1","类型2"])
#根据索引分组
grouped3 = df_pokemon.groupby(level=["类型1","类型2"])
grouped3
#分组计算各列均值
grouped3.mean()
2.组的一些特征
查看每个索引组的个数
grouped2.size()
得到每个索引组的在源数据中的索引位置
grouped2.groups
得到包含索引组的所有数据
#得到索引组为Fire和Flying的所有数据
grouped2.get_group(('Fire', 'Flying'))
for name,group in grouped2:print(name)print(group.shape)
二.数据透视表
1.数据透视表pivot_table
#示例数据
df_p = df.iloc[:10,0:6]
df_p
姓名 | 类型1 | 类型2 | 总计 | 生命值 | 攻击力 | |
---|---|---|---|---|---|---|
0 | Bulbasaur | Grass | Poison | 318 | 45 | 49 |
1 | Ivysaur | Grass | Poison | 405 | 60 | 62 |
2 | Venusaur | Grass | Poison | 525 | 80 | 82 |
3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 |
4 | Charmander | Fire | NaN | 309 | 39 | 52 |
5 | Charmeleon | Fire | NaN | 405 | 58 | 64 |
6 | Charizard | Fire | Flying | 534 | 78 | 84 |
7 | CharizardMega Charizard X | Fire | Dragon | 634 | 78 | 130 |
8 | CharizardMega Charizard Y | Fire | Flying | 634 | 78 | 104 |
9 | Squirtle | Water | NaN | 314 | 44 | 48 |
#做一些修改
df_p.loc[0:2,"姓名"] = "A"
df_p.loc[3:5,"姓名"] = "B"
df_p.loc[6:9,"姓名"] = "C"
df_p["类型2"] = df_p["类型2"].fillna("Flying")
df_p.rename(columns={"姓名":"组"},inplace=True)
#将组放在行上,类型1放在列上,计算字段为攻击力,如果没有指定,默认计算其均值
df_p.pivot_table(index="组",columns="类型1",values="攻击力")
类型1 | Fire | Grass | Water |
---|---|---|---|
组 | |||
A | NaN | 64.333333 | NaN |
B | 58.0 | 100.000000 | NaN |
C | 106.0 | NaN | 48.0 |
#将组放在行上,类型1放在列上,计算攻击力的均值和计数
df_p.pivot_table(index="组",columns="类型1",values="攻击力",aggfunc=[np.mean,len])
mean | len | ||||||
---|---|---|---|---|---|---|---|
类型2 | Dragon | Flying | Poison | Dragon | Flying | Poison | |
组 | 类型1 | ||||||
A | Grass | NaN | NaN | 64.333333 | NaN | NaN | 3.0 |
B | Fire | NaN | 58.0 | NaN | NaN | 2.0 | NaN |
Grass | NaN | NaN | 100.000000 | NaN | NaN | 1.0 | |
C | Fire | 130.0 | 94.0 | NaN | 1.0 | 2.0 | NaN |
Water | NaN | 48.0 | NaN | NaN | 1.0 | NaN |
#将组和类型1放在行上,类型2放在列上,计算生命值和攻击力的均值和计数
df_p.pivot_table(index=["组","类型1"],columns="类型2",values=["生命值","攻击力"],aggfunc=[np.mean,len])
mean | len | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
攻击力 | 生命值 | 攻击力 | 生命值 | ||||||||||
类型2 | Dragon | Flying | Poison | Dragon | Flying | Poison | Dragon | Flying | Poison | Dragon | Flying | Poison | |
组 | 类型1 | ||||||||||||
A | Grass | NaN | NaN | 64.333333 | NaN | NaN | 61.666667 | NaN | NaN | 3.0 | NaN | NaN | 3.0 |
B | Fire | NaN | 58.0 | NaN | NaN | 48.5 | NaN | NaN | 2.0 | NaN | NaN | 2.0 | NaN |
Grass | NaN | NaN | 100.000000 | NaN | NaN | 80.000000 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | |
C | Fire | 130.0 | 94.0 | NaN | 78.0 | 78.0 | NaN | 1.0 | 2.0 | NaN | 1.0 | 2.0 | NaN |
Water | NaN | 48.0 | NaN | NaN | 44.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN |
#将组和类型1放在行上,类型2放在列上,计算生命值和攻击力的均值和计数,并且将缺失值填充为0
df_p1 = df_p.pivot_table(index=["组","类型1"],columns="类型2",values=["生命值","攻击力"],aggfunc=[np.mean,len],fill_value=0)
df_p1
mean | len | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
攻击力 | 生命值 | 攻击力 | 生命值 | ||||||||||
类型2 | Dragon | Flying | Poison | Dragon | Flying | Poison | Dragon | Flying | Poison | Dragon | Flying | Poison | |
组 | 类型1 | ||||||||||||
A | Grass | 0 | 0 | 64.333333 | 0 | 0.0 | 61.666667 | 0 | 0 | 3 | 0 | 0 | 3 |
B | Fire | 0 | 58 | 0.000000 | 0 | 48.5 | 0.000000 | 0 | 2 | 0 | 0 | 2 | 0 |
Grass | 0 | 0 | 100.000000 | 0 | 0.0 | 80.000000 | 0 | 0 | 1 | 0 | 0 | 1 | |
C | Fire | 130 | 94 | 0.000000 | 78 | 78.0 | 0.000000 | 1 | 2 | 0 | 1 | 2 | 0 |
Water | 0 | 48 | 0.000000 | 0 | 44.0 | 0.000000 | 0 | 1 | 0 | 0 | 1 | 0 |
#将组和类型1放在行上,类型2放在列上,计算生命值和攻击力的均值和计数,将缺失值填充为0,并且增加总计行列
df_p.pivot_table(index=["组","类型1"],columns="类型2",values=["生命值","攻击力"],aggfunc=[np.mean,len],fill_value=0,margins=True)
2.重塑层次化索引
stack():将数据最内层的列旋转到行上
unstack():将数据最内层的行旋转到列上
#示例数据
df_p1
mean | len | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
攻击力 | 生命值 | 攻击力 | 生命值 | ||||||||||
类型2 | Dragon | Flying | Poison | Dragon | Flying | Poison | Dragon | Flying | Poison | Dragon | Flying | Poison | |
组 | 类型1 | ||||||||||||
A | Grass | 0 | 0 | 64.333333 | 0 | 0.0 | 61.666667 | 0 | 0 | 3 | 0 | 0 | 3 |
B | Fire | 0 | 58 | 0.000000 | 0 | 48.5 | 0.000000 | 0 | 2 | 0 | 0 | 2 | 0 |
Grass | 0 | 0 | 100.000000 | 0 | 0.0 | 80.000000 | 0 | 0 | 1 | 0 | 0 | 1 | |
C | Fire | 130 | 94 | 0.000000 | 78 | 78.0 | 0.000000 | 1 | 2 | 0 | 1 | 2 | 0 |
Water | 0 | 48 | 0.000000 | 0 | 44.0 | 0.000000 | 0 | 1 | 0 | 0 | 1 | 0 |
#将数据最内层的列旋转到行上,也即是将类型2转移到行上
df_p1.stack()
mean | len | |||||
---|---|---|---|---|---|---|
攻击力 | 生命值 | 攻击力 | 生命值 | |||
组 | 类型1 | 类型2 | ||||
A | Grass | Dragon | 0.000000 | 0.000000 | 0 | 0 |
Flying | 0.000000 | 0.000000 | 0 | 0 | ||
Poison | 64.333333 | 61.666667 | 3 | 3 | ||
B | Fire | Dragon | 0.000000 | 0.000000 | 0 | 0 |
Flying | 58.000000 | 48.500000 | 2 | 2 | ||
Poison | 0.000000 | 0.000000 | 0 | 0 | ||
Grass | Dragon | 0.000000 | 0.000000 | 0 | 0 | |
Flying | 0.000000 | 0.000000 | 0 | 0 | ||
Poison | 100.000000 | 80.000000 | 1 | 1 | ||
C | Fire | Dragon | 130.000000 | 78.000000 | 1 | 1 |
Flying | 94.000000 | 78.000000 | 2 | 2 | ||
Poison | 0.000000 | 0.000000 | 0 | 0 | ||
Water | Dragon | 0.000000 | 0.000000 | 0 | 0 | |
Flying | 48.000000 | 44.000000 | 1 | 1 | ||
Poison | 0.000000 | 0.000000 | 0 | 0 |
#将数据最内层的行旋转到列上,也即是将类型1转移到列上
df_p1.unstack()
mean | ... | len | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
攻击力 | 生命值 | ... | 攻击力 | 生命值 | |||||||||||||||||
类型2 | Dragon | Flying | Poison | Dragon | ... | Poison | Dragon | Flying | Poison | ||||||||||||
类型1 | Fire | Grass | Water | Fire | Grass | Water | Fire | Grass | Water | Fire | ... | Water | Fire | Grass | Water | Fire | Grass | Water | Fire | Grass | Water |
组 | |||||||||||||||||||||
A | NaN | 0.0 | NaN | NaN | 0.0 | NaN | NaN | 64.333333 | NaN | NaN | ... | NaN | NaN | 0.0 | NaN | NaN | 0.0 | NaN | NaN | 3.0 | NaN |
B | 0.0 | 0.0 | NaN | 58.0 | 0.0 | NaN | 0.0 | 100.000000 | NaN | 0.0 | ... | NaN | 0.0 | 0.0 | NaN | 2.0 | 0.0 | NaN | 0.0 | 1.0 | NaN |
C | 130.0 | NaN | 0.0 | 94.0 | NaN | 48.0 | 0.0 | NaN | 0.0 | 78.0 | ... | 0.0 | 1.0 | NaN | 0.0 | 2.0 | NaN | 1.0 | 0.0 | NaN | 0.0 |
三.交叉表
用于计算分组频率用的特殊透视表
#示例数据
df_p
组 | 类型1 | 类型2 | 总计 | 生命值 | 攻击力 | |
---|---|---|---|---|---|---|
0 | A | Grass | Poison | 318 | 45 | 49 |
1 | A | Grass | Poison | 405 | 60 | 62 |
2 | A | Grass | Poison | 525 | 80 | 82 |
3 | B | Grass | Poison | 625 | 80 | 100 |
4 | B | Fire | Flying | 309 | 39 | 52 |
5 | B | Fire | Flying | 405 | 58 | 64 |
6 | C | Fire | Flying | 534 | 78 | 84 |
7 | C | Fire | Dragon | 634 | 78 | 130 |
8 | C | Fire | Flying | 634 | 78 | 104 |
9 | C | Water | Flying | 314 | 44 | 48 |
#计算组和类型1的交叉频率
pd.crosstab(index=df_p["组"],columns=df_p["类型1"])
类型1 | Fire | Grass | Water |
---|---|---|---|
组 | |||
A | 0 | 3 | 0 |
B | 2 | 1 | 0 |
C | 3 | 0 | 1 |