文章目录
- 1. 数据集合加载
- 2. 使用常用的属性/方法查看数据情况
- type()
- shape
- columns
- dtypes
- info()
- 3. 查看部分数据
- 获取一列数据
- 获取多列数据
- 按行加载数据
- 同时取出行列数据
- 切片语法
- 4. 简单数据分析
- 5. 数据可视化
- 总结
1. 数据集合加载
pd.read_csv()方法不仅可以加载CSV文件,还可以加载TSV文件,不过需要通过sep参数指明数据分割符号。
df = pd.read_csv('data/gapminder.tsv',sep='\t')
print(df)
'''
代码输出:country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
... ... ... ... ... ... ...
1699 Zimbabwe Africa 1987 62.351 9216418 706.157306
1700 Zimbabwe Africa 1992 60.377 10704340 693.420786
1701 Zimbabwe Africa 1997 46.809 11404948 792.449960
1702 Zimbabwe Africa 2002 39.989 11926563 672.038623
1703 Zimbabwe Africa 2007 43.487 12311143 469.709298[1704 rows x 6 columns]
'''
2. 使用常用的属性/方法查看数据情况
type()
print(type(df))
'''
代码输出:
pandas.core.frame.DataFrame
'''
shape
print(df.shape)
'''
代码输出:
(1704,6)
'''
columns
print(df.columns)
'''
代码输出:
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')
'''
dtypes
print(df.dtypes)
'''
代码输出:
country object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object
'''
info()
print(df.info())
'''
代码输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 country 1704 non-null object 1 continent 1704 non-null object 2 year 1704 non-null int64 3 lifeExp 1704 non-null float644 pop 1704 non-null int64 5 gdpPercap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
'''
3. 查看部分数据
获取数据最关键的两个属性:
df.loc[行名,列名]
df.iloc[行号,列号]
获取一列数据
DataFrame可以看作一个Series的列表,传入Series的名称,就可以获取对应的一列数据。
col = col = df['country']
print(col)
'''
代码输出:
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan...
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: country, Length: 1704, dtype: object
'''
获取多列数据
获取多列数据,需要传入一个包含列名的列表。
subset = df[['country','continent','year']]
print(subset)
'''
代码输出:country continent year
0 Afghanistan Asia 1952
1 Afghanistan Asia 1957
2 Afghanistan Asia 1962
3 Afghanistan Asia 1967
4 Afghanistan Asia 1972
... ... ... ...
1699 Zimbabwe Africa 1987
1700 Zimbabwe Africa 1992
1701 Zimbabwe Africa 1997
1702 Zimbabwe Africa 2002
1703 Zimbabwe Africa 2007[1704 rows x 3 columns]
'''
按行加载数据
DataFrame默认的行索引是数字索引,也就是从0开始的数字序列,这种情况下行索引等于行编号。
以下代码获取df中的第1行数据(编号为0,行索引也是0)。
print(df.loc[0])
'''
代码输出:
country Afghanistan
continent Asia
year 1952
lifeExp 28.801
pop 8425333
gdpPercap 779.445314
Name: 0, dtype: object
'''
使用iloc也可以达到同样的效果:
print(df.iloc[0])
'''
代码输出:
country Afghanistan
continent Asia
year 1952
lifeExp 28.801
pop 8425333
gdpPercap 779.445314
Name: 0, dtype: object
'''
如果在加载数据的时候 ,手动指明行索引,那么行索引和数字编号就会有差异。
以下代码获取数据集中的第一行,行编号为0,行名(行索引)为Avatar(阿凡达)。
movie = pd.read_csv('data/movie.csv',index_col='movie_title')
movie.loc['Avatar'] # 这里使用行索引
'''
代码输出:
color Color
director_name James Cameron
num_critic_for_reviews 723.0
duration 178.0
director_facebook_likes 0.0
actor_3_facebook_likes 855.0
actor_2_name Joel David Moore
... ...
country USA
content_rating PG-13
budget 237000000.0
title_year 2009.0
actor_2_facebook_likes 936.0
imdb_score 7.9
aspect_ratio 1.78
movie_facebook_likes 33000
Name: Avatar, dtype: object
'''
使用iloc可以达到同样的效果:
print(movie.iloc[0])
'''
代码输出:
color Color
director_name James Cameron
num_critic_for_reviews 723.0
duration 178.0
director_facebook_likes 0.0
actor_3_facebook_likes 855.0
actor_2_name Joel David Moore
... ...
country USA
content_rating PG-13
budget 237000000.0
title_year 2009.0
actor_2_facebook_likes 936.0
imdb_score 7.9
aspect_ratio 1.78
movie_facebook_likes 33000
Name: Avatar, dtype: object
'''
由于iloc使用行号访问数据,所以可以使用python的列表方式进行访问,例如负数索引:
print(df.iloc[-1])
'''
代码输出:
country Zimbabwe
continent Africa
year 2007
lifeExp 43.487
pop 12311143
gdpPercap 469.709298
Name: 1703, dtype: object
'''
同时取出行列数据
df.loc[:,['country','year']]
'''
代码输出:country year
0 Afghanistan 1952
1 Afghanistan 1957
2 Afghanistan 1962
3 Afghanistan 1967
4 Afghanistan 1972
... ... ...
1699 Zimbabwe 1987
1700 Zimbabwe 1992
1701 Zimbabwe 1997
1702 Zimbabwe 2002
1703 Zimbabwe 2007[1704 rows x 2 columns]
'''
df.iloc[:,[0,2]]
'''
代码输出:country year
0 Afghanistan 1952
1 Afghanistan 1957
2 Afghanistan 1962
3 Afghanistan 1967
4 Afghanistan 1972
... ... ...
1699 Zimbabwe 1987
1700 Zimbabwe 1992
1701 Zimbabwe 1997
1702 Zimbabwe 2002
1703 Zimbabwe 2007[1704 rows x 2 columns]
'''
切片语法
获取3、4、5列所有的行数据:
df.iloc[:,3:6]
'''
代码输出:lifeExp pop gdpPercap
0 28.801 8425333 779.445314
1 30.332 9240934 820.853030
2 31.997 10267083 853.100710
3 34.020 11537966 836.197138
4 36.088 13079460 739.981106
... ... ... ...
1699 62.351 9216418 706.157306
1700 60.377 10704340 693.420786
1701 46.809 11404948 792.449960
1702 39.989 11926563 672.038623
1703 43.487 12311143 469.709298[1704 rows x 3 columns]
'''
也可以设置切片的步长:
df.iloc[:,0:6:2]
'''
代码输出:country year pop
0 Afghanistan 1952 8425333
1 Afghanistan 1957 9240934
2 Afghanistan 1962 10267083
3 Afghanistan 1967 11537966
4 Afghanistan 1972 13079460
... ... ... ...
1699 Zimbabwe 1987 9216418
1700 Zimbabwe 1992 10704340
1701 Zimbabwe 1997 11404948
1702 Zimbabwe 2002 11926563
1703 Zimbabwe 2007 12311143[1704 rows x 3 columns]
'''
4. 简单数据分析
- 获取每一年的平均预期寿命
df.groupby('year')['lifeExp'].mean()
'''
代码输出:
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64
'''
- 获取每年的平均预期寿命,平均人口,平均GDP
df.groupby('year')[['lifeExp','pop','gdpPercap']].mean()
'''
代码输出:lifeExp pop gdpPercap
year
1952 49.057620 1.695040e+07 3725.276046
1957 51.507401 1.876341e+07 4299.408345
1962 53.609249 2.042101e+07 4725.812342
1967 55.678290 2.265830e+07 5483.653047
1972 57.647386 2.518998e+07 6770.082815
1977 59.570157 2.767638e+07 7313.166421
1982 61.533197 3.020730e+07 7518.901673
1987 63.212613 3.303857e+07 7900.920218
1992 64.160338 3.599092e+07 8158.608521
1997 65.014676 3.883947e+07 9090.175363
2002 65.694923 4.145759e+07 9917.848365
2007 67.007423 4.402122e+07 11680.071820
'''
- 每年、每大洲的数据平均值
df.groupby(by=['year','continent'])[['lifeExp','pop','gdpPercap']].mean()
'''
代码输出:lifeExp pop gdpPercap
year continent
1952 Africa 39.135500 4.570010e+06 1252.572466Americas 53.279840 1.380610e+07 4079.062552Asia 46.314394 4.228356e+07 5195.484004Europe 64.408500 1.393736e+07 5661.057435Oceania 69.255000 5.343003e+06 10298.085650
1957 Africa 41.266346 5.093033e+06 1385.236062Americas 55.960280 1.547816e+07 4616.043733Asia 49.318544 4.735699e+07 5787.732940Europe 66.703067 1.459635e+07 6963.012816Oceania 70.295000 5.970988e+06 11598.522455
... ...
2002 Africa 53.325231 1.603315e+07 2599.385159Americas 72.422040 3.399091e+07 9287.677107Asia 69.233879 1.091455e+08 10174.090397Europe 76.700600 1.927413e+07 21711.732422Oceania 79.740000 1.172741e+07 26938.778040
2007 Africa 54.806038 1.787576e+07 3089.032605Americas 73.608120 3.595485e+07 11003.031625Asia 70.728485 1.155138e+08 12473.026870Europe 77.648600 1.953662e+07 25054.481636Oceania 80.719500 1.227497e+07 29810.188275
'''
- 每大洲有多少个国家
df.groupby('continent')['country'].nunique()
'''
代码输出:
continent
Africa 52
Americas 25
Asia 33
Europe 30
Oceania 2
Name: country, dtype: int64
'''
5. 数据可视化
df.groupby('year')['lifeExp'].mean().plot()
# kind = 'bar' 可以画柱状图
总结
数据分析处理流程:
加载数据 → 数据基本处理 → 数据分析 → 数据可视化 → 得出结论