九、Pandas高级处理

4.6高级处理-缺失值处理

点击标题即可获取文章源代码和笔记
数据集:https://download.csdn.net/download/weixin_44827418/12548095

在这里插入图片描述

Pandas高级处理缺失值处理数据离散化合并交叉表与透视表分组与聚合综合案例4.6 高级处理-缺失值处理1)如何进行缺失值处理两种思路:1)删除含有缺失值的样本2)替换/插补4.6.1 如何处理nan1)判断数据中是否存在NaNpd.isnull(df)pd.notnull(df)2)删除含有缺失值的样本df.dropna(inplace=False)替换/插补df.fillna(value, inplace=False)4.6.2 不是缺失值nan,有默认标记的1)替换 ?-> np.nandf.replace(to_replace="?", value=np.nan)2)处理np.nan缺失值的步骤2)缺失值处理实例
4.7 高级处理-数据离散化性别 年龄
A    1   23
B    2   30
C    1   18物种 毛发
A    1
B    2
C    3男 女 年龄
A   1  0  23
B   0  1  30
C   1  0  18狗  猪  老鼠 毛发
A   1   0   0   2
B   0   1   0   1
C   0   0   1   1
one-hot编码&哑变量
4.7.1 什么是数据的离散化原始的身高数据:165174160180159163192184
4.7.2 为什么要离散化
4.7.3 如何实现数据的离散化1)分组自动分组sr=pd.qcut(data, bins)自定义分组sr=pd.cut(data, [])2)将分组好的结果转换成one-hot编码pd.get_dummies(sr, prefix=)
4.8 高级处理-合并numpynp.concatnate((a, b), axis=)水平拼接np.hstack()竖直拼接np.vstack()1)按方向拼接pd.concat([data1, data2], axis=1)2)按索引拼接pd.merge实现合并pd.merge(left, right, how="inner", on=[索引])
4.9 高级处理-交叉表与透视表找到、探索两个变量之间的关系4.9.1 交叉表与透视表什么作用4.9.2 使用crosstab(交叉表)实现pd.crosstab(value1, value2)4.9.3 pivot_table
4.10 高级处理-分组与聚合4.10.1 什么是分组与聚合4.10.2 分组与聚合APIdataframesr

4.6.1如何处理nan

import pandas as pd movie = pd.read_csv("./datas/IMDB-Movie-Data.csv")
movie
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
01Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.0
12PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.0
23SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.0
34SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.0
45Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0
.......................................
995996Secret in Their EyesCrime,Drama,MysteryA tight-knit team of rising investigators, alo...Billy RayChiwetel Ejiofor, Nicole Kidman, Julia Roberts...20151116.227585NaN45.0
996997Hostel: Part IIHorrorThree American college students studying abroa...Eli RothLauren German, Heather Matarazzo, Bijou Philli...2007945.57315217.5446.0
997998Step Up 2: The StreetsDrama,Music,RomanceRomantic sparks occur between two dance studen...Jon M. ChuRobert Hoffman, Briana Evigan, Cassie Ventura,...2008986.27069958.0150.0
998999Search PartyAdventure,ComedyA pair of friends embark on a mission to reuni...Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh...2014935.64881NaN22.0
9991000Nine LivesComedy,Family,FantasyA stuffy businessman finds himself trapped ins...Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch...2016875.31243519.6411.0

1000 rows × 12 columns

# 1. 判断是否存在NaN类型的缺失值,为True的就是缺失值
movie.isnull()
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
.......................................
995FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
996FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
997FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
998FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
999FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

1000 rows × 12 columns

import numpy as np# any() 只要有一个True就会返回True
# 返回结果为True,说明数据中存在缺失值
np.any(movie.isnull())
True
# 为False的就是缺失值
pd.notnull(movie)
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
0TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
2TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
3TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
4TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
.......................................
995TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue
996TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
997TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
998TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue
999TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue

1000 rows × 12 columns

# all()只要有一个False就返回False
# 返回结果为False,说明数据中存在缺失值
np.all(pd.notnull(movie))
False
pd.isnull(movie).any()
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool
pd.notnull(movie).all()
Rank                   True
Title                  True
Genre                  True
Description            True
Director               True
Actors                 True
Year                   True
Runtime (Minutes)      True
Rating                 True
Votes                  True
Revenue (Millions)    False
Metascore             False
dtype: bool
# 缺失值处理
# 方法1: 删除含有缺失值的样本
movie_full = movie.dropna()
movie_full.isnull().any()
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool
# 方法2: 替换
movie.head()
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
01Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.0
12PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.0
23SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.0
34SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.0
45Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0
movie["Revenue (Millions)"].mean()
82.95637614678897
# 含有缺失值的字段
# Revenue (Millions)    False
# Metascore             False
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace=True)
movie["Revenue (Millions)"].isnull().any()
False
# inplace=True ,直接在原数据上进行填充
movie["Metascore"].fillna(movie["Metascore"].mean(),inplace=True)
movie["Metascore"].isnull().any()
False
movie.isnull().any() # 缺失值已经处理完毕
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool

不是缺失值nan,有默认标记的处理方法

data = pd.read_csv("./datas/GBvideos.csv",encoding="GBK")
data
video_idtitlechannel_titlecategory_idtagsviewslikesdislikescomment_totalthumbnail_linkdate
0jt2OHQh0HoQLive Apple Event - Apple September Event 2017 ...Apple Event28apple events|apple event|iphone 8|iphone x|iph...74263937824013548705https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv...13.09
1AqokkXoa7uEHolly and Phillip Meet Samantha the Sex Robot ...This Morning24this morning|interview|holly willoughby|philli...494203265113090https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg13.09
2YPVcg45W0z4My DNA Test Results? I'm WHAT??emmablackery24emmablackery|emma blackery|emma|blackery|briti...142819131191511141https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg13.09
3T_PuZBdT2iMgetting into a conversation in a language you ...ProZD1skit|korean|language|conversation|esl|japanese...15800286572915293598https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg13.09
4NsjsmgmbCfcBaby Name Challenge?Sprinkleofglitter26sprinkleofglitter|sprinkle of glitter|baby gli...40592501957490https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg13.09
....................................
1595w8fAellnPnsJuicy Chicken Breast - You Suck at Cooking (ep...You Suck At Cooking26how to|cooking|recipe|kitchen|chicken|chicken ...788466319459452274https://i.ytimg.com/vi/w8fAellnPns/default.jpg20.09
1596RsG37JcEQNwWeezer - Beach Boysweezer10weezer|pacific daydream|pacificdaydream|beach ...1079272435412641https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg20.09
1597htSiIA2g7G8Berry Frozen Yogurt Bark RecipeSORTEDfood26frozen yogurt bark|frozen yoghurt bark|frozen ...109222484035212https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg20.09
1598ZQK1F0wz6z4What Do You Want to Eat??Wong Fu Productions24panda|what should we eat|buzzfeed|comedy|boyfr...626223229625321559https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg20.09
1599DuPXdnSWoLkThe Child in Time: Trailer - BBC OneBBC24BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi...992281699?135https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg20.09

1600 rows × 11 columns

# 1. 将 ! 替换为np.nan
new_data = data.replace(to_replace="?",value=np.nan)
new_data
video_idtitlechannel_titlecategory_idtagsviewslikesdislikescomment_totalthumbnail_linkdate
0jt2OHQh0HoQLive Apple Event - Apple September Event 2017 ...Apple Event28apple events|apple event|iphone 8|iphone x|iph...74263937824013548705https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv...13.09
1AqokkXoa7uEHolly and Phillip Meet Samantha the Sex Robot ...This Morning24this morning|interview|holly willoughby|philli...494203265113090https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg13.09
2YPVcg45W0z4My DNA Test Results? I'm WHAT??emmablackery24emmablackery|emma blackery|emma|blackery|briti...142819131191511141https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg13.09
3T_PuZBdT2iMgetting into a conversation in a language you ...ProZD1skit|korean|language|conversation|esl|japanese...15800286572915293598https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg13.09
4NsjsmgmbCfcBaby Name Challenge?Sprinkleofglitter26sprinkleofglitter|sprinkle of glitter|baby gli...40592501957490https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg13.09
....................................
1595w8fAellnPnsJuicy Chicken Breast - You Suck at Cooking (ep...You Suck At Cooking26how to|cooking|recipe|kitchen|chicken|chicken ...788466319459452274https://i.ytimg.com/vi/w8fAellnPns/default.jpg20.09
1596RsG37JcEQNwWeezer - Beach Boysweezer10weezer|pacific daydream|pacificdaydream|beach ...1079272435412641https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg20.09
1597htSiIA2g7G8Berry Frozen Yogurt Bark RecipeSORTEDfood26frozen yogurt bark|frozen yoghurt bark|frozen ...109222484035212https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg20.09
1598ZQK1F0wz6z4What Do You Want to Eat??Wong Fu Productions24panda|what should we eat|buzzfeed|comedy|boyfr...626223229625321559https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg20.09
1599DuPXdnSWoLkThe Child in Time: Trailer - BBC OneBBC24BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi...992281699NaN135https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg20.09

1600 rows × 11 columns

new_data.isnull().any() # 说明dislikes列中的?已经替换成了NaN
video_id          False
title             False
channel_title     False
category_id       False
tags              False
views             False
likes             False
dislikes           True
comment_total     False
thumbnail_link    False
date              False
dtype: bool
new_data.dropna(inplace=True)
new_data.isnull().any()
video_id          False
title             False
channel_title     False
category_id       False
tags              False
views             False
likes             False
dislikes          False
comment_total     False
thumbnail_link    False
date              False
dtype: bool

4.7 高级处理-数据离散化

import pandas as pd # 准备数据
data = pd.Series([165,174,160,180,159,163,192,184],index=["No1:165","No2:174","No3:160","No4:180","No5:159","No6:163","No7:192","No8:184"])
data
No1:165    165
No2:174    174
No3:160    160
No4:180    180
No5:159    159
No6:163    163
No7:192    192
No8:184    184
dtype: int64

自动分组

# 1. 分组# 自动分组
#qcut(data,组数)
sr = pd.qcut(data,3)
sr
No1:165      (163.667, 178.0]
No2:174      (163.667, 178.0]
No3:160    (158.999, 163.667]
No4:180        (178.0, 192.0]
No5:159    (158.999, 163.667]
No6:163    (158.999, 163.667]
No7:192        (178.0, 192.0]
No8:184        (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]
# 查看分组情况
sr.value_counts()
(178.0, 192.0]        3
(158.999, 163.667]    3
(163.667, 178.0]      2
dtype: int64
type(sr)
pandas.core.series.Series
# 2. 将分组好的结果转换成独热编码
# prefix,设置列名的前缀
pd.get_dummies(sr,prefix="height")
height_(158.999, 163.667]height_(163.667, 178.0]height_(178.0, 192.0]
No1:165010
No2:174010
No3:160100
No4:180001
No5:159100
No6:163100
No7:192001
No8:184001

自定义分组

# 自定义分组
# pd.cut(data,包含全部分界值的列表)
sr = pd.cut(data,[150,165,180,195])
sr
No1:165    (150, 165]
No2:174    (165, 180]
No3:160    (150, 165]
No4:180    (165, 180]
No5:159    (150, 165]
No6:163    (150, 165]
No7:192    (180, 195]
No8:184    (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr.value_counts()
(150, 165]    4
(180, 195]    2
(165, 180]    2
dtype: int64
pd.get_dummies(sr,prefix="身高")
身高_(150, 165]身高_(165, 180]身高_(180, 195]
No1:165100
No2:174010
No3:160100
No4:180010
No5:159100
No6:163100
No7:192001
No8:184001

4.8 高级处理-合并

4.8.1 pd.concat实现合并(按方向拼接)

data1 = np.arange(0,20,1).reshape(4,5)
data1 = pd.DataFrame(data1)
data1
01234
001234
156789
21011121314
31516171819
data2 = np.arange(100,120,1).reshape(4,5)
data2 = pd.DataFrame(data2)
data2
01234
0100101102103104
1105106107108109
2110111112113114
3115116117118119
# 将data1 和 data2 进行水平拼接
data_concat = pd.concat([data1,data2],axis=1)
data_concat
0123401234
001234100101102103104
156789105106107108109
21011121314110111112113114
31516171819115116117118119
data2.T
0123
0100105110115
1101106111116
2102107112117
3103108113118
4104109114119
# 将data1 和 data2 进行竖直拼接
data_concat1 = pd.concat([data1,data2.T],axis=0)
data_concat1
01234
001234.0
156789.0
21011121314.0
31516171819.0
0100105110115NaN
1101106111116NaN
2102107112117NaN
3103108113118NaN
4104109114119NaN

4.8.2 pd.merge实现合并(按索引拼接)

left=pd.DataFrame({'key1':['K0','K0','K1','K2'],
'key2':['K0','K1','K0','K1'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
left
key1key2AB
0K0K0A0B0
1K0K1A1B1
2K1K0A2B2
3K2K1A3B3
right=pd.DataFrame({'key1':['K0','K1','K1','K2'], 'key2':['K0','K0','K0','K0'], 'C':['Co','C1','C2','C3'],'D':['DO','D1','D2','D3']})
right
key1key2CD
0K0K0CoDO
1K1K0C1D1
2K1K0C2D2
3K2K0C3D3
# 默认内连接inner
# inner 保留共有的key
result = pd.merge(left,right,on=['key1','key2'],how="inner")
result
key1key2ABCD
0K0K0A0B0CoDO
1K1K0A2B2C1D1
2K1K0A2B2C2D2
# left ,左连接
# 左表中所有的key都保留,以左表为主进行合并
result_left = pd.merge(left,right,on=['key1','key2'],how="left")
result_left
key1key2ABCD
0K0K0A0B0CoDO
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN
# right ,右连接
# 右表中所有的key都保留,以右表为主进行合并
result_right = pd.merge(left,right,on=['key1','key2'],how="right")
result_right
key1key2ABCD
0K0K0A0B0CoDO
1K1K0A2B2C1D1
2K1K0A2B2C2D2
3K2K0NaNNaNC3D3
# outer ,外连接
# 左右两表中所有的key都保留,进行合并
result_outer = pd.merge(left,right,on=['key1','key2'],how="outer")
result_outer
key1key2ABCD
0K0K0A0B0CoDO
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN
5K2K0NaNNaNC3D3

4.9 高级处理-交叉表与透视表

  • 用来探索两个变量之间的关系

4.9.2 使用crosstab(交叉表)实现

data = pd.read_excel("./datas/szfj_baoan.xls")
data
districtroomnumhallAREAC_floorfloor_numschoolsubwayper_price
0baoan3289.3middle31007.0773
1baoan42127.0high31006.9291
2baoan1128.0low39003.9286
3baoan1128.0middle30003.3568
4baoan2278.0middle8115.0769
..............................
1246baoan4289.3low8004.2553
1247baoan2167.0middle30003.8060
1248baoan2267.4middle29105.3412
1249baoan2273.1low15105.9508
1250baoan3286.2middle32014.5244

1251 rows × 9 columns

time = "2020-06-23"
# pandas日期类型
date = pd.to_datetime(time)
date
Timestamp('2020-06-23 00:00:00')
type(date)
pandas._libs.tslibs.timestamps.Timestamp
date.year
2020
date.month
6
data["week"] = date.weekday
data.drop("week",axis=1,inplace=True)
data
districtroomnumhallAREAC_floorfloor_numschoolsubwayper_price
0baoan3289.3middle31007.0773
1baoan42127.0high31006.9291
2baoan1128.0low39003.9286
3baoan1128.0middle30003.3568
4baoan2278.0middle8115.0769
..............................
1246baoan4289.3low8004.2553
1247baoan2167.0middle30003.8060
1248baoan2267.4middle29105.3412
1249baoan2273.1low15105.9508
1250baoan3286.2middle32014.5244

1251 rows × 9 columns

data["feature"] = np.where(data["per_price"] > 5.0000,1,0)
data
districtroomnumhallAREAC_floorfloor_numschoolsubwayper_pricefeature
0baoan3289.3middle31007.07731
1baoan42127.0high31006.92911
2baoan1128.0low39003.92860
3baoan1128.0middle30003.35680
4baoan2278.0middle8115.07691
.................................
1246baoan4289.3low8004.25530
1247baoan2167.0middle30003.80600
1248baoan2267.4middle29105.34121
1249baoan2273.1low15105.95081
1250baoan3286.2middle32014.52440

1251 rows × 10 columns

# 交叉表# 查看楼层 和 每平方米单价是否>50000的关系
# 返回值为每个楼层中,为0的个数和为1的个数
data0 = pd.crosstab(data["floor_num"],data["feature"])
data0
feature01
floor_num
168
301
4010
637
71625
81932
9211
1049
11811
1213
13420
1405
15833
16919
172021
181735
19115
2024
2116
2201
2348
241026
25437
26957
27538
28635
292668
303078
314151
3221126
333420
3415
3512
3604
3711
3801
39510
4013
4301
4406
4507
4701
5001
5103
5202
5301
data0.sum(axis=1) # 按行求和
floor_num
1      14
3       1
4      10
6      10
7      41
8      51
9      13
10     13
11     19
12      4
13     24
14      5
15     41
16     28
17     41
18     52
19     16
20      6
21      7
22      1
23     12
24     36
25     41
26     66
27     43
28     41
29     94
30    108
31    155
32    147
33     54
34      6
35      3
36      4
37      2
38      1
39     15
40      4
43      1
44      6
45      7
47      1
50      1
51      3
52      2
53      1
dtype: int64
data0.div(data0.sum(axis=1),axis=0) # 按行做除法
feature01
floor_num
10.4285710.571429
30.0000001.000000
40.0000001.000000
60.3000000.700000
70.3902440.609756
80.3725490.627451
90.1538460.846154
100.3076920.692308
110.4210530.578947
120.2500000.750000
130.1666670.833333
140.0000001.000000
150.1951220.804878
160.3214290.678571
170.4878050.512195
180.3269230.673077
190.6875000.312500
200.3333330.666667
210.1428570.857143
220.0000001.000000
230.3333330.666667
240.2777780.722222
250.0975610.902439
260.1363640.863636
270.1162790.883721
280.1463410.853659
290.2765960.723404
300.2777780.722222
310.0258060.974194
320.1428570.857143
330.6296300.370370
340.1666670.833333
350.3333330.666667
360.0000001.000000
370.5000000.500000
380.0000001.000000
390.3333330.666667
400.2500000.750000
430.0000001.000000
440.0000001.000000
450.0000001.000000
470.0000001.000000
500.0000001.000000
510.0000001.000000
520.0000001.000000
530.0000001.000000
data_percent = data0.div(data0.sum(axis=1),axis=0)
data_percent
feature01
floor_num
10.4285710.571429
30.0000001.000000
40.0000001.000000
60.3000000.700000
70.3902440.609756
80.3725490.627451
90.1538460.846154
100.3076920.692308
110.4210530.578947
120.2500000.750000
130.1666670.833333
140.0000001.000000
150.1951220.804878
160.3214290.678571
170.4878050.512195
180.3269230.673077
190.6875000.312500
200.3333330.666667
210.1428570.857143
220.0000001.000000
230.3333330.666667
240.2777780.722222
250.0975610.902439
260.1363640.863636
270.1162790.883721
280.1463410.853659
290.2765960.723404
300.2777780.722222
310.0258060.974194
320.1428570.857143
330.6296300.370370
340.1666670.833333
350.3333330.666667
360.0000001.000000
370.5000000.500000
380.0000001.000000
390.3333330.666667
400.2500000.750000
430.0000001.000000
440.0000001.000000
450.0000001.000000
470.0000001.000000
500.0000001.000000
510.0000001.000000
520.0000001.000000
530.0000001.000000
# stacked=True 是否重叠显示
data_percent.plot(kind="bar",stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x24719dd7488>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Hp6tTqN9-1592912393310)(output_70_1.png)]

data_percent = data0.div(data0.sum(axis=1),axis=0)
data_percent
<tr><th>50</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>51</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>52</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>53</th><td>0.000000</td><td>1.000000</td>
</tr>
feature01
floor_num
10.4285710.571429
30.0000001.000000
40.0000001.000000
60.3000000.700000
70.3902440.609756
80.3725490.627451
90.1538460.846154
100.3076920.692308
110.4210530.578947
120.2500000.750000
130.1666670.833333
140.0000001.000000
150.1951220.804878
160.3214290.678571
170.4878050.512195
180.3269230.673077
190.6875000.312500
200.3333330.666667
210.1428570.857143
220.0000001.000000
230.3333330.666667
240.2777780.722222
250.0975610.902439
260.1363640.863636
270.1162790.883721
280.1463410.853659
290.2765960.723404
300.2777780.722222

4.9.3使用pivot_table(透视表)实现

# 通过透视表,整个过程会变得更加简单些
# 结果直接就是值为1的百分比
data.pivot_table(["feature"],index=["floor_num"])

...

feature
floor_num
10.571429
31.000000
41.000000
60.700000
501.000000
511.000000
521.000000
531.000000

4.10 高级处理-分组与聚合

4.10.2 分组与聚合API

col = pd.DataFrame({'color':['white','red','green','red','green'],'object':["pen","pencil","pencil","ashtray","pen"],'price1':[4.56,4.20,1.30,0.56,2.75],'price2':[4.75,4.12,1.68,0.75,3.15]})
col
colorobjectprice1price2
0whitepen4.564.75
1redpencil4.204.12
2greenpencil1.301.68
3redashtray0.560.75
4greenpen2.753.15
#  进行分组,对颜色进行分组,对价格price1进行聚合
# 用DataFrame的方法进行分组
col.groupby(by="color")["price1"].max()
color
green    2.75
red      4.20
white    4.56
Name: price1, dtype: float64
# 用Series的方法进行分组
col['price1'].groupby(col["color"])
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002471D178D08>
col['price1'].groupby(col["color"]).max()
color
green    2.75
red      4.20
white    4.56
Name: price1, dtype: float64

4.11 综合案例

# 1. 准备数据
movie = pd.read_csv("./datas/IMDB-Movie-Data.csv")
movie
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
01Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.0
12PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.0
23SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.0
34SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.0
45Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0
.......................................
995996Secret in Their EyesCrime,Drama,MysteryA tight-knit team of rising investigators, alo...Billy RayChiwetel Ejiofor, Nicole Kidman, Julia Roberts...20151116.227585NaN45.0
996997Hostel: Part IIHorrorThree American college students studying abroa...Eli RothLauren German, Heather Matarazzo, Bijou Philli...2007945.57315217.5446.0
997998Step Up 2: The StreetsDrama,Music,RomanceRomantic sparks occur between two dance studen...Jon M. ChuRobert Hoffman, Briana Evigan, Cassie Ventura,...2008986.27069958.0150.0
998999Search PartyAdventure,ComedyA pair of friends embark on a mission to reuni...Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh...2014935.64881NaN22.0
9991000Nine LivesComedy,Family,FantasyA stuffy businessman finds himself trapped ins...Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch...2016875.31243519.6411.0

1000 rows × 12 columns

#问题1:我们想知道这些电影数据中评分的平均分,导演的人数等信息,
# 我们应该怎么获取?
movie["Rating"].mean()
6.723200000000003
movie["Director"]
0                James Gunn
1              Ridley Scott
2        M. Night Shyamalan
3      Christophe Lourdelet
4                David Ayer...         
995               Billy Ray
996                Eli Roth
997              Jon M. Chu
998          Scot Armstrong
999        Barry Sonnenfeld
Name: Director, Length: 1000, dtype: object
# np.unique()去重,因为导演可能是多个电影的导演
np.unique(movie["Director"])
array(['Aamir Khan', 'Abdellatif Kechiche', 'Adam Leon', 'Adam McKay','Adam Shankman', 'Adam Wingard', 'Afonso Poyart', 'Aisling Walsh','Akan Satayev', 'Akiva Schaffer', 'Alan Taylor', 'Albert Hughes','Alejandro Amenábar', 'Alejandro González Iñárritu',...'Tomas Alfredson', 'Tony Gilroy', 'Tony Scott', 'Travis Knight','Tyler Shields', 'Wally Pfister', 'Walt Dohrn', 'Walter Hill','Warren Beatty', 'Werner Herzog', 'Wes Anderson', 'Wes Ball','Wes Craven', 'Whit Stillman', 'Will Gluck', 'Will Slocombe','William Brent Bell', 'William Oldroyd', 'Woody Allen','Xavier Dolan', 'Yimou Zhang', 'Yorgos Lanthimos', 'Zack Snyder','Zackary Adler'], dtype=object)
# 导演的人数
np.unique(movie["Director"]).size
644
# 问题2 : 对于这一组电影数据,如果我们先rating,runtime的分布情况,应该如何呈现数据?
movie["Rating"].plot(kind="hist",figsize=(20,8),fontsize=40)
<matplotlib.axes._subplots.AxesSubplot at 0x2471ce18708>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fDymqgEf-1592912393314)(output_86_1.png)]

import matplotlib.pyplot as plt# 1. 创建画布
plt.figure(figsize=(20,8),dpi=100)# 2. 绘制直方图
plt.hist(movie["Rating"],20)# 修改刻度
plt.xticks(np.linspace(movie["Rating"].min(),movie["Rating"].max(),21))# 添加网格
plt.grid(linestyle="--",alpha=0.5)# 3. 显示图像
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gC3tfVMD-1592912393315)(output_87_0.png)]

movie["Rating"]
0      8.1
1      7.0
2      7.3
3      7.2
4      6.2... 
995    6.2
996    5.5
997    6.2
998    5.6
999    5.3
Name: Rating, Length: 1000, dtype: float64
# 问题3:对于这一组电影数据,如果我们希望统计电影分类(genre)的情况,应该如何处理数据?# 先统计电影类别有哪些
movie_genre = [i.split(",") for i in movie["Genre"]]
movie_genre
[['Action', 'Adventure', 'Sci-Fi'],['Adventure', 'Mystery', 'Sci-Fi'],['Horror', 'Thriller'],['Animation', 'Comedy', 'Family'],['Action', 'Adventure', 'Fantasy'],...['Horror'],['Drama', 'Music', 'Romance'],['Adventure', 'Comedy'],['Comedy', 'Family', 'Fantasy']]
[j for i in movie_genre for j in i]
['Action','Adventure','Sci-Fi','Adventure','Mystery','Sci-Fi',
...'Animation','Action','Adventure','Action','Adventure','Drama',...]
movie_class = np.unique([j for i in movie_genre for j in i])
movie_class
array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime','Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music','Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller','War', 'Western'], dtype='<U9')
len(movie_class) # 20 个电影类别
20
# 统计每个类别有几个电影# 先创建一个空的DataFrame表
count = pd.DataFrame(np.zeros(shape=[1000,20],dtype="int32"),columns=movie_class)
count.head()
ActionAdventureAnimationBiographyComedyCrimeDramaFamilyFantasyHistoryHorrorMusicMusicalMysteryRomanceSci-FiSportThrillerWarWestern
000000000000000000000
100000000000000000000
200000000000000000000
300000000000000000000
400000000000000000000
count.loc[0,movie_genre[0]]
Action       0
Adventure    0
Sci-Fi       0
Name: 0, dtype: int32
movie_genre[0]
['Action', 'Adventure', 'Sci-Fi']
# 计数填表
for i in range(1000):count.loc[i,movie_genre[i]] = 1
count
ActionAdventureAnimationBiographyComedyCrimeDramaFamilyFantasyHistoryHorrorMusicMusicalMysteryRomanceSci-FiSportThrillerWarWestern
011000000000000010000
101000000000001010000
200000000001000000100
300101001000000000000
411000000100000000000
...............................................................
99500000110000001000000
99600000000001000000000
99700000010000100100000
99801001000000000000000
99900001001100000000000

1000 rows × 20 columns

# 按列求和
count.sum(axis=0)
Action       303
Adventure    259
Animation     49
Biography     81
Comedy       279
Crime        150
Drama        513
Family        51
Fantasy      101
History       29
Horror       119
Music         16
Musical        5
Mystery      106
Romance      141
Sci-Fi       120
Sport         18
Thriller     195
War           13
Western        7
dtype: int64
count.sum(axis=0).sort_values(ascending=False)
Drama        513
Action       303
Comedy       279
Adventure    259
Thriller     195
Crime        150
Romance      141
Sci-Fi       120
Horror       119
Mystery      106
Fantasy      101
Biography     81
Family        51
Animation     49
History       29
Sport         18
Music         16
War           13
Western        7
Musical        5
dtype: int64
count.sum(axis=0).sort_values(ascending=False).plot(kind="bar",fontsize=20,figsize=(20,9),colormap="cool")
<matplotlib.axes._subplots.AxesSubplot at 0x2472450c1c8>

在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/471048.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

用户投票算法

作者: 阮一峰 发布时间: 2012-03-29 13:33 阅读: 7323 次 推荐: 6 原文链接 [收藏] 目录 基于用户投票的排名算法&#xff08;一&#xff09;&#xff1a;Delicious和Hacker News  基于用户投票的排名算法&#xff08;二&#xff09;&#xff1a;Reddit  基于用户…

LeetCode 2255. 统计是给定字符串前缀的字符串数目

文章目录1. 题目2. 解题1. 题目 给你一个字符串数组 words 和一个字符串 s &#xff0c;其中 words[i] 和 s 只包含 小写英文字母 。 请你返回 words 中是字符串 s 前缀 的 字符串数目 。 一个字符串的 前缀 是出现在字符串开头的子字符串。 子字符串 是一个字符串中的连续一…

常用排序讲解

分类&#xff1a; 1&#xff09;插入排序&#xff08;直接插入排序、希尔排序&#xff09;2&#xff09;交换排序&#xff08;冒泡排序、快速排序&#xff09;3&#xff09;选择排序&#xff08;直接选择排序、堆排序&#xff09;4&#xff09;归并排序5&#xff09;分配排序&a…

七、matplotlib的使用

matplotlib的使用 点击标题即可获取文章源代码和笔记 二、Matplotlib2.1 Matplotlib之HelloWorld2.1.1 什么是Matplotlib - 画二维图表的python库mat - matrix 矩阵二维数据 - 二维图表plot - 画图lib - library 库matlab 矩阵实验室mat - matrixlab 实验室2.1.2 为什么要学习M…

LeetCode 2256. 最小平均差(前缀和)

文章目录1. 题目2. 解题1. 题目 给你一个下标从 0 开始长度为 n 的整数数组 nums 。 下标 i 处的 平均差 指的是 nums 中 前 i 1 个元素平均值和 后 n - i - 1 个元素平均值的 绝对差 。两个平均值都需要 向下取整 到最近的整数。 请你返回产生 最小平均差 的下标。 如果有…

十、简单线性回归的python实现(详解)

4. 简单线性回归的python实现 点击标题即可获取源代码和笔记 4.1 导入相关包 import numpy as np import pandas as pd import random import matplotlib as mpl import matplotlib.pyplot as pltplt.rcParams[font.sans-serif] [simhei] # 显示中文 plt.rcParams[axes.unic…

LeetCode 2257. 统计网格图中没有被保卫的格子数

文章目录1. 题目2. 解题1. 题目 给你两个整数 m 和 n 表示一个下标从 0 开始的 m x n 网格图。 同时给你两个二维整数数组 guards 和 walls &#xff0c;其中 guards[i] [rowi, coli] 且 walls[j] [rowj, colj] &#xff0c;分别表示第 i 个警卫和第 j 座墙所在的位置。 一…

LeetCode 2259. 移除指定数字得到的最大结果

文章目录1. 题目2. 解题1. 题目 给你一个表示某个正整数的字符串 number 和一个字符 digit 。 从 number 中 恰好 移除 一个 等于 digit 的字符后&#xff0c;找出并返回按 十进制 表示 最大 的结果字符串。 生成的测试用例满足 digit 在 number 中出现至少一次。 示例 1&am…

十一、加权线性回归案例:预测鲍鱼的年龄

加权线性回归案例&#xff1a;预测鲍鱼的年龄 点击文章标题即可获取源代码和笔记 数据集&#xff1a;https://download.csdn.net/download/weixin_44827418/12553408 1.导入数据集 数据集描述&#xff1a; import pandas as pd import numpy as npabalone pd.read_table(&q…

LeetCode 2260. 必须拿起的最小连续卡牌数(哈希)

文章目录1. 题目2. 解题1. 题目 给你一个整数数组 cards &#xff0c;其中 cards[i] 表示第 i 张卡牌的 值 。如果两张卡牌的值相同&#xff0c;则认为这一对卡牌 匹配 。 返回你必须拿起的最小连续卡牌数&#xff0c;以使在拿起的卡牌中有一对匹配的卡牌。 如果无法得到一对…

十二、案例:加利福尼亚房屋价值数据集(多元线性回归) Lasso 岭回归 分箱处理非线性问题 多项式回归

案例&#xff1a;加利福尼亚房屋价值数据集&#xff08;线性回归&#xff09;& Lasso & 岭回归 & 分箱处理非线性问题 点击标题即可获取文章源代码和笔记 1. 导入需要的模块和库 from sklearn.linear_model import LinearRegression as LR from sklearn.model_sel…

LeetCode 2261. 含最多 K 个可整除元素的子数组

文章目录1. 题目2. 解题1. 题目 给你一个整数数组 nums 和两个整数 k 和 p &#xff0c;找出并返回满足要求的不同的子数组数&#xff0c;要求子数组中最多 k 个可被 p 整除的元素。 如果满足下述条件之一&#xff0c;则认为数组 nums1 和 nums2 是 不同 数组&#xff1a; 两…

二十、MySQL之用户权限管理(用户管理、权限管理、忘记root密码的解决方案)

用户权限管理&#xff1a;在不同的项目中给不同的角色&#xff08;开发者&#xff09;不同的操作权限&#xff0c;为了保证数据库数据的安全。 通常&#xff0c;一个用户的密码不会长期不变&#xff0c;所以需要经常性的变更数据库用户密码来确保用户本身安全&#xff08;mysql…

PyQt5 基本窗口控件(状态栏/窗口/图标/提示消息/QLabel/文本类控件)

文章目录1. 状态栏2. 窗口居中显示3. 关闭窗口4. QWidget5. 添加图标6. 气泡提示信息7. QLabel添加快捷键8. QLineEditechoMode验证器inputMask综合练习9. QTextEditlearn from 《PyQt5 快速开发与实战》 1. 状态栏 self.statusbar.showMessage("hello, Michael", …

CSMA/CD协议(先听再说,边听边说)

一、概念 载波监听多点接入/碰撞检测 CSMA/CD &#xff08;carrier sense multiple access with colision detection&#xff09; CS&#xff1a;载波侦听/监听&#xff0c;每一个站再发送数据之前以及发送数据时都要检测一下总线上是否有其他计算机再发送数据。 MA&#xff…

PyQt5 基本窗口控件(按钮类/对话框类)

文章目录1. 按钮类1.1 QPushButton1.2 QRadioButton1.3 QCheckBox1.4 QComboBox 下拉列表1.5 QSpinBox 计数器1.6 QSlider 滑动条2. 对话框类2.1 QDialog2.2 QMessageBox2.3 QInputDialog2.4 QFontDialog2.5 QFileDialoglearn from 《PyQt5 快速开发与实战》 https://doc.qt.io…

python网络爬虫系列(二)——ProxyHandler处理器实现代理IP

ProxyHandler处理器&#xff08;代理&#xff09;&#xff1a; 很多网站会检测某一段时间某个IP的访问次数&#xff08;通过流量统计&#xff0c;系统日志等&#xff09;&#xff0c;如果访问次数多的不像正常人&#xff0c;它会禁止这个lP的访问。 所以我们可以设置一些代理服…

LeetCode 2264. 字符串中最大的 3 位相同数字

文章目录1. 题目2. 解题1. 题目 给你一个字符串 num &#xff0c;表示一个大整数。如果一个整数满足下述所有条件&#xff0c;则认为该整数是一个 优质整数 &#xff1a; 该整数是 num 的一个长度为 3 的 子字符串 。该整数由唯一一个数字重复 3 次组成。 以字符串形式返回 …

四则运算个人项目进展

一、项目要求 基本要求&#xff1a;将10-20道四则运算题目写入文档&#xff0c;程序读取并输出题目&#xff0c;同时计算出正确结果。使用者对每道题目计算答案&#xff0c;答对进行提示&#xff0c;答错输出正确结果。分别记录回答正确、错误的数目并输出。四则运算题目基本要…

python网络爬虫系列(一)——urllib库(urlopen、urlretrieve、urlencode、parse-qs、urlparse和urlsplit、request.Request类)

urllib库 urllib库是Python中一个最基本的网络请求库。可以模拟浏览器的行为&#xff0c;向指定的服务器发送一个请求&#xff0c;并可以保存服务器返回的数据。 一、urlopen函数&#xff1a; 在Python3的urllib库中&#xff0c;所有和网络请求相关的方法&#xff0c;都被集…