点击标题即可获取文章源代码和笔记 数据集:https://download.csdn.net/download/weixin_44827418/12548095
Pandas高级处理缺失值处理数据离散化合并交叉表与透视表分组与聚合综合案例4.6 高级处理- 缺失值处理1 )如何进行缺失值处理两种思路:1 )删除含有缺失值的样本2 )替换/ 插补4.6 .1 如何处理nan1 )判断数据中是否存在NaNpd. isnull( df) pd. notnull( df) 2 )删除含有缺失值的样本df. dropna( inplace= False ) 替换/ 插补df. fillna( value, inplace= False ) 4.6 .2 不是缺失值nan,有默认标记的1 )替换 ?- > np. nandf. replace( to_replace= "?" , value= np. nan) 2 )处理np. nan缺失值的步骤2 )缺失值处理实例
4.7 高级处理- 数据离散化性别 年龄
A 1 23
B 2 30
C 1 18 物种 毛发
A 1
B 2
C 3 男 女 年龄
A 1 0 23
B 0 1 30
C 1 0 18 狗 猪 老鼠 毛发
A 1 0 0 2
B 0 1 0 1
C 0 0 1 1
one- hot编码& 哑变量
4.7 .1 什么是数据的离散化原始的身高数据:165 ,174 ,160 ,180 ,159 ,163 ,192 ,184
4.7 .2 为什么要离散化
4.7 .3 如何实现数据的离散化1 )分组自动分组sr= pd. qcut( data, bins) 自定义分组sr= pd. cut( data, [ ] ) 2 )将分组好的结果转换成one- hot编码pd. get_dummies( sr, prefix= )
4.8 高级处理- 合并numpynp. concatnate( ( a, b) , axis= ) 水平拼接np. hstack( ) 竖直拼接np. vstack( ) 1 )按方向拼接pd. concat( [ data1, data2] , axis= 1 ) 2 )按索引拼接pd. merge实现合并pd. merge( left, right, how= "inner" , on= [ 索引] )
4.9 高级处理- 交叉表与透视表找到、探索两个变量之间的关系4.9 .1 交叉表与透视表什么作用4.9 .2 使用crosstab( 交叉表) 实现pd. crosstab( value1, value2) 4.9 .3 pivot_table
4.10 高级处理- 分组与聚合4.10 .1 什么是分组与聚合4.10 .2 分组与聚合APIdataframesr
4.6.1如何处理nan
import pandas as pd movie = pd. read_csv( "./datas/IMDB-Movie-Data.csv" )
movie
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore 0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0 4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0 ... ... ... ... ... ... ... ... ... ... ... ... ... 995 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0 996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0 997 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0 998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0 999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0
1000 rows × 12 columns
movie. isnull( )
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore 0 False False False False False False False False False False False False 1 False False False False False False False False False False False False 2 False False False False False False False False False False False False 3 False False False False False False False False False False False False 4 False False False False False False False False False False False False ... ... ... ... ... ... ... ... ... ... ... ... ... 995 False False False False False False False False False False True False 996 False False False False False False False False False False False False 997 False False False False False False False False False False False False 998 False False False False False False False False False False True False 999 False False False False False False False False False False False False
1000 rows × 12 columns
import numpy as np
np. any ( movie. isnull( ) )
True
pd. notnull( movie)
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore 0 True True True True True True True True True True True True 1 True True True True True True True True True True True True 2 True True True True True True True True True True True True 3 True True True True True True True True True True True True 4 True True True True True True True True True True True True ... ... ... ... ... ... ... ... ... ... ... ... ... 995 True True True True True True True True True True False True 996 True True True True True True True True True True True True 997 True True True True True True True True True True True True 998 True True True True True True True True True True False True 999 True True True True True True True True True True True True
1000 rows × 12 columns
np. all ( pd. notnull( movie) )
False
pd. isnull( movie) . any ( )
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) True
Metascore True
dtype: bool
pd. notnull( movie) . all ( )
Rank True
Title True
Genre True
Description True
Director True
Actors True
Year True
Runtime (Minutes) True
Rating True
Votes True
Revenue (Millions) False
Metascore False
dtype: bool
movie_full = movie. dropna( )
movie_full. isnull( ) . any ( )
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool
movie. head( )
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore 0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0 4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
movie[ "Revenue (Millions)" ] . mean( )
82.95637614678897
movie[ "Revenue (Millions)" ] . fillna( movie[ "Revenue (Millions)" ] . mean( ) , inplace= True )
movie[ "Revenue (Millions)" ] . isnull( ) . any ( )
False
movie[ "Metascore" ] . fillna( movie[ "Metascore" ] . mean( ) , inplace= True )
movie[ "Metascore" ] . isnull( ) . any ( )
False
movie. isnull( ) . any ( )
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool
不是缺失值nan,有默认标记的处理方法
data = pd. read_csv( "./datas/GBvideos.csv" , encoding= "GBK" )
data
video_id title channel_title category_id tags views likes dislikes comment_total thumbnail_link date 0 jt2OHQh0HoQ Live Apple Event - Apple September Event 2017 ... Apple Event 28 apple events|apple event|iphone 8|iphone x|iph... 7426393 78240 13548 705 https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv... 13.09 1 AqokkXoa7uE Holly and Phillip Meet Samantha the Sex Robot ... This Morning 24 this morning|interview|holly willoughby|philli... 494203 2651 1309 0 https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg 13.09 2 YPVcg45W0z4 My DNA Test Results? I'm WHAT?? emmablackery 24 emmablackery|emma blackery|emma|blackery|briti... 142819 13119 151 1141 https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg 13.09 3 T_PuZBdT2iM getting into a conversation in a language you ... ProZD 1 skit|korean|language|conversation|esl|japanese... 1580028 65729 1529 3598 https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg 13.09 4 NsjsmgmbCfc Baby Name Challenge? Sprinkleofglitter 26 sprinkleofglitter|sprinkle of glitter|baby gli... 40592 5019 57 490 https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg 13.09 ... ... ... ... ... ... ... ... ... ... ... ... 1595 w8fAellnPns Juicy Chicken Breast - You Suck at Cooking (ep... You Suck At Cooking 26 how to|cooking|recipe|kitchen|chicken|chicken ... 788466 31945 945 2274 https://i.ytimg.com/vi/w8fAellnPns/default.jpg 20.09 1596 RsG37JcEQNw Weezer - Beach Boys weezer 10 weezer|pacific daydream|pacificdaydream|beach ... 107927 2435 412 641 https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg 20.09 1597 htSiIA2g7G8 Berry Frozen Yogurt Bark Recipe SORTEDfood 26 frozen yogurt bark|frozen yoghurt bark|frozen ... 109222 4840 35 212 https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg 20.09 1598 ZQK1F0wz6z4 What Do You Want to Eat?? Wong Fu Productions 24 panda|what should we eat|buzzfeed|comedy|boyfr... 626223 22962 532 1559 https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg 20.09 1599 DuPXdnSWoLk The Child in Time: Trailer - BBC One BBC 24 BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi... 99228 1699 ? 135 https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg 20.09
1600 rows × 11 columns
new_data = data. replace( to_replace= "?" , value= np. nan)
new_data
video_id title channel_title category_id tags views likes dislikes comment_total thumbnail_link date 0 jt2OHQh0HoQ Live Apple Event - Apple September Event 2017 ... Apple Event 28 apple events|apple event|iphone 8|iphone x|iph... 7426393 78240 13548 705 https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv... 13.09 1 AqokkXoa7uE Holly and Phillip Meet Samantha the Sex Robot ... This Morning 24 this morning|interview|holly willoughby|philli... 494203 2651 1309 0 https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg 13.09 2 YPVcg45W0z4 My DNA Test Results? I'm WHAT?? emmablackery 24 emmablackery|emma blackery|emma|blackery|briti... 142819 13119 151 1141 https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg 13.09 3 T_PuZBdT2iM getting into a conversation in a language you ... ProZD 1 skit|korean|language|conversation|esl|japanese... 1580028 65729 1529 3598 https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg 13.09 4 NsjsmgmbCfc Baby Name Challenge? Sprinkleofglitter 26 sprinkleofglitter|sprinkle of glitter|baby gli... 40592 5019 57 490 https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg 13.09 ... ... ... ... ... ... ... ... ... ... ... ... 1595 w8fAellnPns Juicy Chicken Breast - You Suck at Cooking (ep... You Suck At Cooking 26 how to|cooking|recipe|kitchen|chicken|chicken ... 788466 31945 945 2274 https://i.ytimg.com/vi/w8fAellnPns/default.jpg 20.09 1596 RsG37JcEQNw Weezer - Beach Boys weezer 10 weezer|pacific daydream|pacificdaydream|beach ... 107927 2435 412 641 https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg 20.09 1597 htSiIA2g7G8 Berry Frozen Yogurt Bark Recipe SORTEDfood 26 frozen yogurt bark|frozen yoghurt bark|frozen ... 109222 4840 35 212 https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg 20.09 1598 ZQK1F0wz6z4 What Do You Want to Eat?? Wong Fu Productions 24 panda|what should we eat|buzzfeed|comedy|boyfr... 626223 22962 532 1559 https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg 20.09 1599 DuPXdnSWoLk The Child in Time: Trailer - BBC One BBC 24 BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi... 99228 1699 NaN 135 https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg 20.09
1600 rows × 11 columns
new_data. isnull( ) . any ( )
video_id False
title False
channel_title False
category_id False
tags False
views False
likes False
dislikes True
comment_total False
thumbnail_link False
date False
dtype: bool
new_data. dropna( inplace= True )
new_data. isnull( ) . any ( )
video_id False
title False
channel_title False
category_id False
tags False
views False
likes False
dislikes False
comment_total False
thumbnail_link False
date False
dtype: bool
4.7 高级处理-数据离散化
import pandas as pd
data = pd. Series( [ 165 , 174 , 160 , 180 , 159 , 163 , 192 , 184 ] , index= [ "No1:165" , "No2:174" , "No3:160" , "No4:180" , "No5:159" , "No6:163" , "No7:192" , "No8:184" ] )
data
No1:165 165
No2:174 174
No3:160 160
No4:180 180
No5:159 159
No6:163 163
No7:192 192
No8:184 184
dtype: int64
自动分组
sr = pd. qcut( data, 3 )
sr
No1:165 (163.667, 178.0]
No2:174 (163.667, 178.0]
No3:160 (158.999, 163.667]
No4:180 (178.0, 192.0]
No5:159 (158.999, 163.667]
No6:163 (158.999, 163.667]
No7:192 (178.0, 192.0]
No8:184 (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]
sr. value_counts( )
(178.0, 192.0] 3
(158.999, 163.667] 3
(163.667, 178.0] 2
dtype: int64
type ( sr)
pandas.core.series.Series
pd. get_dummies( sr, prefix= "height" )
height_(158.999, 163.667] height_(163.667, 178.0] height_(178.0, 192.0] No1:165 0 1 0 No2:174 0 1 0 No3:160 1 0 0 No4:180 0 0 1 No5:159 1 0 0 No6:163 1 0 0 No7:192 0 0 1 No8:184 0 0 1
自定义分组
sr = pd. cut( data, [ 150 , 165 , 180 , 195 ] )
sr
No1:165 (150, 165]
No2:174 (165, 180]
No3:160 (150, 165]
No4:180 (165, 180]
No5:159 (150, 165]
No6:163 (150, 165]
No7:192 (180, 195]
No8:184 (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr. value_counts( )
(150, 165] 4
(180, 195] 2
(165, 180] 2
dtype: int64
pd. get_dummies( sr, prefix= "身高" )
身高_(150, 165] 身高_(165, 180] 身高_(180, 195] No1:165 1 0 0 No2:174 0 1 0 No3:160 1 0 0 No4:180 0 1 0 No5:159 1 0 0 No6:163 1 0 0 No7:192 0 0 1 No8:184 0 0 1
4.8 高级处理-合并
4.8.1 pd.concat实现合并(按方向拼接)
data1 = np. arange( 0 , 20 , 1 ) . reshape( 4 , 5 )
data1 = pd. DataFrame( data1)
data1
0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19
data2 = np. arange( 100 , 120 , 1 ) . reshape( 4 , 5 )
data2 = pd. DataFrame( data2)
data2
0 1 2 3 4 0 100 101 102 103 104 1 105 106 107 108 109 2 110 111 112 113 114 3 115 116 117 118 119
data_concat = pd. concat( [ data1, data2] , axis= 1 )
data_concat
0 1 2 3 4 0 1 2 3 4 0 0 1 2 3 4 100 101 102 103 104 1 5 6 7 8 9 105 106 107 108 109 2 10 11 12 13 14 110 111 112 113 114 3 15 16 17 18 19 115 116 117 118 119
data2. T
0 1 2 3 0 100 105 110 115 1 101 106 111 116 2 102 107 112 117 3 103 108 113 118 4 104 109 114 119
data_concat1 = pd. concat( [ data1, data2. T] , axis= 0 )
data_concat1
0 1 2 3 4 0 0 1 2 3 4.0 1 5 6 7 8 9.0 2 10 11 12 13 14.0 3 15 16 17 18 19.0 0 100 105 110 115 NaN 1 101 106 111 116 NaN 2 102 107 112 117 NaN 3 103 108 113 118 NaN 4 104 109 114 119 NaN
4.8.2 pd.merge实现合并(按索引拼接)
left= pd. DataFrame( { 'key1' : [ 'K0' , 'K0' , 'K1' , 'K2' ] ,
'key2' : [ 'K0' , 'K1' , 'K0' , 'K1' ] ,
'A' : [ 'A0' , 'A1' , 'A2' , 'A3' ] ,
'B' : [ 'B0' , 'B1' , 'B2' , 'B3' ] } )
left
key1 key2 A B 0 K0 K0 A0 B0 1 K0 K1 A1 B1 2 K1 K0 A2 B2 3 K2 K1 A3 B3
right= pd. DataFrame( { 'key1' : [ 'K0' , 'K1' , 'K1' , 'K2' ] , 'key2' : [ 'K0' , 'K0' , 'K0' , 'K0' ] , 'C' : [ 'Co' , 'C1' , 'C2' , 'C3' ] , 'D' : [ 'DO' , 'D1' , 'D2' , 'D3' ] } )
right
key1 key2 C D 0 K0 K0 Co DO 1 K1 K0 C1 D1 2 K1 K0 C2 D2 3 K2 K0 C3 D3
result = pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "inner" )
result
key1 key2 A B C D 0 K0 K0 A0 B0 Co DO 1 K1 K0 A2 B2 C1 D1 2 K1 K0 A2 B2 C2 D2
result_left = pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "left" )
result_left
key1 key2 A B C D 0 K0 K0 A0 B0 Co DO 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN
result_right = pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "right" )
result_right
key1 key2 A B C D 0 K0 K0 A0 B0 Co DO 1 K1 K0 A2 B2 C1 D1 2 K1 K0 A2 B2 C2 D2 3 K2 K0 NaN NaN C3 D3
result_outer = pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "outer" )
result_outer
key1 key2 A B C D 0 K0 K0 A0 B0 Co DO 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN 5 K2 K0 NaN NaN C3 D3
4.9 高级处理-交叉表与透视表
4.9.2 使用crosstab(交叉表)实现
data = pd. read_excel( "./datas/szfj_baoan.xls" )
data
district roomnum hall AREA C_floor floor_num school subway per_price 0 baoan 3 2 89.3 middle 31 0 0 7.0773 1 baoan 4 2 127.0 high 31 0 0 6.9291 2 baoan 1 1 28.0 low 39 0 0 3.9286 3 baoan 1 1 28.0 middle 30 0 0 3.3568 4 baoan 2 2 78.0 middle 8 1 1 5.0769 ... ... ... ... ... ... ... ... ... ... 1246 baoan 4 2 89.3 low 8 0 0 4.2553 1247 baoan 2 1 67.0 middle 30 0 0 3.8060 1248 baoan 2 2 67.4 middle 29 1 0 5.3412 1249 baoan 2 2 73.1 low 15 1 0 5.9508 1250 baoan 3 2 86.2 middle 32 0 1 4.5244
1251 rows × 9 columns
time = "2020-06-23"
date = pd. to_datetime( time)
date
Timestamp('2020-06-23 00:00:00')
type ( date)
pandas._libs.tslibs.timestamps.Timestamp
date. year
2020
date. month
6
data[ "week" ] = date. weekday
data. drop( "week" , axis= 1 , inplace= True )
data
district roomnum hall AREA C_floor floor_num school subway per_price 0 baoan 3 2 89.3 middle 31 0 0 7.0773 1 baoan 4 2 127.0 high 31 0 0 6.9291 2 baoan 1 1 28.0 low 39 0 0 3.9286 3 baoan 1 1 28.0 middle 30 0 0 3.3568 4 baoan 2 2 78.0 middle 8 1 1 5.0769 ... ... ... ... ... ... ... ... ... ... 1246 baoan 4 2 89.3 low 8 0 0 4.2553 1247 baoan 2 1 67.0 middle 30 0 0 3.8060 1248 baoan 2 2 67.4 middle 29 1 0 5.3412 1249 baoan 2 2 73.1 low 15 1 0 5.9508 1250 baoan 3 2 86.2 middle 32 0 1 4.5244
1251 rows × 9 columns
data[ "feature" ] = np. where( data[ "per_price" ] > 5.0000 , 1 , 0 )
data
district roomnum hall AREA C_floor floor_num school subway per_price feature 0 baoan 3 2 89.3 middle 31 0 0 7.0773 1 1 baoan 4 2 127.0 high 31 0 0 6.9291 1 2 baoan 1 1 28.0 low 39 0 0 3.9286 0 3 baoan 1 1 28.0 middle 30 0 0 3.3568 0 4 baoan 2 2 78.0 middle 8 1 1 5.0769 1 ... ... ... ... ... ... ... ... ... ... ... 1246 baoan 4 2 89.3 low 8 0 0 4.2553 0 1247 baoan 2 1 67.0 middle 30 0 0 3.8060 0 1248 baoan 2 2 67.4 middle 29 1 0 5.3412 1 1249 baoan 2 2 73.1 low 15 1 0 5.9508 1 1250 baoan 3 2 86.2 middle 32 0 1 4.5244 0
1251 rows × 10 columns
data0 = pd. crosstab( data[ "floor_num" ] , data[ "feature" ] )
data0
feature 0 1 floor_num 1 6 8 3 0 1 4 0 10 6 3 7 7 16 25 8 19 32 9 2 11 10 4 9 11 8 11 12 1 3 13 4 20 14 0 5 15 8 33 16 9 19 17 20 21 18 17 35 19 11 5 20 2 4 21 1 6 22 0 1 23 4 8 24 10 26 25 4 37 26 9 57 27 5 38 28 6 35 29 26 68 30 30 78 31 4 151 32 21 126 33 34 20 34 1 5 35 1 2 36 0 4 37 1 1 38 0 1 39 5 10 40 1 3 43 0 1 44 0 6 45 0 7 47 0 1 50 0 1 51 0 3 52 0 2 53 0 1
data0. sum ( axis= 1 )
floor_num
1 14
3 1
4 10
6 10
7 41
8 51
9 13
10 13
11 19
12 4
13 24
14 5
15 41
16 28
17 41
18 52
19 16
20 6
21 7
22 1
23 12
24 36
25 41
26 66
27 43
28 41
29 94
30 108
31 155
32 147
33 54
34 6
35 3
36 4
37 2
38 1
39 15
40 4
43 1
44 6
45 7
47 1
50 1
51 3
52 2
53 1
dtype: int64
data0. div( data0. sum ( axis= 1 ) , axis= 0 )
feature 0 1 floor_num 1 0.428571 0.571429 3 0.000000 1.000000 4 0.000000 1.000000 6 0.300000 0.700000 7 0.390244 0.609756 8 0.372549 0.627451 9 0.153846 0.846154 10 0.307692 0.692308 11 0.421053 0.578947 12 0.250000 0.750000 13 0.166667 0.833333 14 0.000000 1.000000 15 0.195122 0.804878 16 0.321429 0.678571 17 0.487805 0.512195 18 0.326923 0.673077 19 0.687500 0.312500 20 0.333333 0.666667 21 0.142857 0.857143 22 0.000000 1.000000 23 0.333333 0.666667 24 0.277778 0.722222 25 0.097561 0.902439 26 0.136364 0.863636 27 0.116279 0.883721 28 0.146341 0.853659 29 0.276596 0.723404 30 0.277778 0.722222 31 0.025806 0.974194 32 0.142857 0.857143 33 0.629630 0.370370 34 0.166667 0.833333 35 0.333333 0.666667 36 0.000000 1.000000 37 0.500000 0.500000 38 0.000000 1.000000 39 0.333333 0.666667 40 0.250000 0.750000 43 0.000000 1.000000 44 0.000000 1.000000 45 0.000000 1.000000 47 0.000000 1.000000 50 0.000000 1.000000 51 0.000000 1.000000 52 0.000000 1.000000 53 0.000000 1.000000
data_percent = data0. div( data0. sum ( axis= 1 ) , axis= 0 )
data_percent
feature 0 1 floor_num 1 0.428571 0.571429 3 0.000000 1.000000 4 0.000000 1.000000 6 0.300000 0.700000 7 0.390244 0.609756 8 0.372549 0.627451 9 0.153846 0.846154 10 0.307692 0.692308 11 0.421053 0.578947 12 0.250000 0.750000 13 0.166667 0.833333 14 0.000000 1.000000 15 0.195122 0.804878 16 0.321429 0.678571 17 0.487805 0.512195 18 0.326923 0.673077 19 0.687500 0.312500 20 0.333333 0.666667 21 0.142857 0.857143 22 0.000000 1.000000 23 0.333333 0.666667 24 0.277778 0.722222 25 0.097561 0.902439 26 0.136364 0.863636 27 0.116279 0.883721 28 0.146341 0.853659 29 0.276596 0.723404 30 0.277778 0.722222 31 0.025806 0.974194 32 0.142857 0.857143 33 0.629630 0.370370 34 0.166667 0.833333 35 0.333333 0.666667 36 0.000000 1.000000 37 0.500000 0.500000 38 0.000000 1.000000 39 0.333333 0.666667 40 0.250000 0.750000 43 0.000000 1.000000 44 0.000000 1.000000 45 0.000000 1.000000 47 0.000000 1.000000 50 0.000000 1.000000 51 0.000000 1.000000 52 0.000000 1.000000 53 0.000000 1.000000
data_percent. plot( kind= "bar" , stacked= True )
<matplotlib.axes._subplots.AxesSubplot at 0x24719dd7488>
data_percent = data0. div( data0. sum ( axis= 1 ) , axis= 0 )
data_percent
<tr><th>50</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>51</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>52</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>53</th><td>0.000000</td><td>1.000000</td>
</tr>
feature 0 1 floor_num 1 0.428571 0.571429 3 0.000000 1.000000 4 0.000000 1.000000 6 0.300000 0.700000 7 0.390244 0.609756 8 0.372549 0.627451 9 0.153846 0.846154 10 0.307692 0.692308 11 0.421053 0.578947 12 0.250000 0.750000 13 0.166667 0.833333 14 0.000000 1.000000 15 0.195122 0.804878 16 0.321429 0.678571 17 0.487805 0.512195 18 0.326923 0.673077 19 0.687500 0.312500 20 0.333333 0.666667 21 0.142857 0.857143 22 0.000000 1.000000 23 0.333333 0.666667 24 0.277778 0.722222 25 0.097561 0.902439 26 0.136364 0.863636 27 0.116279 0.883721 28 0.146341 0.853659 29 0.276596 0.723404 30 0.277778 0.722222
4.9.3使用pivot_table(透视表)实现
data. pivot_table( [ "feature" ] , index= [ "floor_num" ] )
...
feature floor_num 1 0.571429 3 1.000000 4 1.000000 6 0.700000 50 1.000000 51 1.000000 52 1.000000 53 1.000000
4.10 高级处理-分组与聚合
4.10.2 分组与聚合API
col = pd. DataFrame( { 'color' : [ 'white' , 'red' , 'green' , 'red' , 'green' ] , 'object' : [ "pen" , "pencil" , "pencil" , "ashtray" , "pen" ] , 'price1' : [ 4.56 , 4.20 , 1.30 , 0.56 , 2.75 ] , 'price2' : [ 4.75 , 4.12 , 1.68 , 0.75 , 3.15 ] } )
col
color object price1 price2 0 white pen 4.56 4.75 1 red pencil 4.20 4.12 2 green pencil 1.30 1.68 3 red ashtray 0.56 0.75 4 green pen 2.75 3.15
col. groupby( by= "color" ) [ "price1" ] . max ( )
color
green 2.75
red 4.20
white 4.56
Name: price1, dtype: float64
col[ 'price1' ] . groupby( col[ "color" ] )
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002471D178D08>
col[ 'price1' ] . groupby( col[ "color" ] ) . max ( )
color
green 2.75
red 4.20
white 4.56
Name: price1, dtype: float64
4.11 综合案例
movie = pd. read_csv( "./datas/IMDB-Movie-Data.csv" )
movie
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore 0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0 4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0 ... ... ... ... ... ... ... ... ... ... ... ... ... 995 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0 996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0 997 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0 998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0 999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0
1000 rows × 12 columns
movie[ "Rating" ] . mean( )
6.723200000000003
movie[ "Director" ]
0 James Gunn
1 Ridley Scott
2 M. Night Shyamalan
3 Christophe Lourdelet
4 David Ayer...
995 Billy Ray
996 Eli Roth
997 Jon M. Chu
998 Scot Armstrong
999 Barry Sonnenfeld
Name: Director, Length: 1000, dtype: object
np. unique( movie[ "Director" ] )
array(['Aamir Khan', 'Abdellatif Kechiche', 'Adam Leon', 'Adam McKay','Adam Shankman', 'Adam Wingard', 'Afonso Poyart', 'Aisling Walsh','Akan Satayev', 'Akiva Schaffer', 'Alan Taylor', 'Albert Hughes','Alejandro Amenábar', 'Alejandro González Iñárritu',...'Tomas Alfredson', 'Tony Gilroy', 'Tony Scott', 'Travis Knight','Tyler Shields', 'Wally Pfister', 'Walt Dohrn', 'Walter Hill','Warren Beatty', 'Werner Herzog', 'Wes Anderson', 'Wes Ball','Wes Craven', 'Whit Stillman', 'Will Gluck', 'Will Slocombe','William Brent Bell', 'William Oldroyd', 'Woody Allen','Xavier Dolan', 'Yimou Zhang', 'Yorgos Lanthimos', 'Zack Snyder','Zackary Adler'], dtype=object)
np. unique( movie[ "Director" ] ) . size
644
movie[ "Rating" ] . plot( kind= "hist" , figsize= ( 20 , 8 ) , fontsize= 40 )
<matplotlib.axes._subplots.AxesSubplot at 0x2471ce18708>
import matplotlib. pyplot as plt
plt. figure( figsize= ( 20 , 8 ) , dpi= 100 )
plt. hist( movie[ "Rating" ] , 20 )
plt. xticks( np. linspace( movie[ "Rating" ] . min ( ) , movie[ "Rating" ] . max ( ) , 21 ) )
plt. grid( linestyle= "--" , alpha= 0.5 )
plt. show( )
movie[ "Rating" ]
0 8.1
1 7.0
2 7.3
3 7.2
4 6.2...
995 6.2
996 5.5
997 6.2
998 5.6
999 5.3
Name: Rating, Length: 1000, dtype: float64
movie_genre = [ i. split( "," ) for i in movie[ "Genre" ] ]
movie_genre
[['Action', 'Adventure', 'Sci-Fi'],['Adventure', 'Mystery', 'Sci-Fi'],['Horror', 'Thriller'],['Animation', 'Comedy', 'Family'],['Action', 'Adventure', 'Fantasy'],...['Horror'],['Drama', 'Music', 'Romance'],['Adventure', 'Comedy'],['Comedy', 'Family', 'Fantasy']]
[ j for i in movie_genre for j in i]
['Action','Adventure','Sci-Fi','Adventure','Mystery','Sci-Fi',
...'Animation','Action','Adventure','Action','Adventure','Drama',...]
movie_class = np. unique( [ j for i in movie_genre for j in i] )
movie_class
array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime','Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music','Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller','War', 'Western'], dtype='<U9')
len ( movie_class)
20
count = pd. DataFrame( np. zeros( shape= [ 1000 , 20 ] , dtype= "int32" ) , columns= movie_class)
count. head( )
Action Adventure Animation Biography Comedy Crime Drama Family Fantasy History Horror Music Musical Mystery Romance Sci-Fi Sport Thriller War Western 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
count. loc[ 0 , movie_genre[ 0 ] ]
Action 0
Adventure 0
Sci-Fi 0
Name: 0, dtype: int32
movie_genre[ 0 ]
['Action', 'Adventure', 'Sci-Fi']
for i in range ( 1000 ) : count. loc[ i, movie_genre[ i] ] = 1
count
Action Adventure Animation Biography Comedy Crime Drama Family Fantasy History Horror Music Musical Mystery Romance Sci-Fi Sport Thriller War Western 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 3 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 995 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 996 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 997 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 998 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 999 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
1000 rows × 20 columns
count. sum ( axis= 0 )
Action 303
Adventure 259
Animation 49
Biography 81
Comedy 279
Crime 150
Drama 513
Family 51
Fantasy 101
History 29
Horror 119
Music 16
Musical 5
Mystery 106
Romance 141
Sci-Fi 120
Sport 18
Thriller 195
War 13
Western 7
dtype: int64
count. sum ( axis= 0 ) . sort_values( ascending= False )
Drama 513
Action 303
Comedy 279
Adventure 259
Thriller 195
Crime 150
Romance 141
Sci-Fi 120
Horror 119
Mystery 106
Fantasy 101
Biography 81
Family 51
Animation 49
History 29
Sport 18
Music 16
War 13
Western 7
Musical 5
dtype: int64
count. sum ( axis= 0 ) . sort_values( ascending= False ) . plot( kind= "bar" , fontsize= 20 , figsize= ( 20 , 9 ) , colormap= "cool" )
<matplotlib.axes._subplots.AxesSubplot at 0x2472450c1c8>