DataFrame入门

文章目录

- 1. 数据集合加载
- 2. 使用常用的属性/方法查看数据情况
- - type()
  - shape
  - columns
  - dtypes
  - info()
- 3. 查看部分数据
- - 获取一列数据
  - 获取多列数据
  - 按行加载数据
  - 同时取出行列数据
  - 切片语法
- 4. 简单数据分析
- 5. 数据可视化
- 总结

1. 数据集合加载

pd.read_csv()方法不仅可以加载CSV文件，还可以加载TSV文件，不过需要通过sep参数指明数据分割符号。

df = pd.read_csv('data/gapminder.tsv',sep='\t')
print(df)
'''
代码输出：country continent  year  lifeExp       pop   gdpPercap
0     Afghanistan      Asia  1952   28.801   8425333  779.445314
1     Afghanistan      Asia  1957   30.332   9240934  820.853030
2     Afghanistan      Asia  1962   31.997  10267083  853.100710
3     Afghanistan      Asia  1967   34.020  11537966  836.197138
4     Afghanistan      Asia  1972   36.088  13079460  739.981106
...           ...       ...   ...      ...       ...         ...
1699     Zimbabwe    Africa  1987   62.351   9216418  706.157306
1700     Zimbabwe    Africa  1992   60.377  10704340  693.420786
1701     Zimbabwe    Africa  1997   46.809  11404948  792.449960
1702     Zimbabwe    Africa  2002   39.989  11926563  672.038623
1703     Zimbabwe    Africa  2007   43.487  12311143  469.709298[1704 rows x 6 columns]
'''

2. 使用常用的属性/方法查看数据情况

type()

print(type(df))
'''
代码输出：
pandas.core.frame.DataFrame
'''

shape

print(df.shape)
'''
代码输出：
(1704,6)
'''

columns

print(df.columns)
'''
代码输出：
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')
'''

dtypes

print(df.dtypes)
'''
代码输出：
country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object
'''

info()

print(df.info())
'''
代码输出：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):#   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  0   country    1704 non-null   object 1   continent  1704 non-null   object 2   year       1704 non-null   int64  3   lifeExp    1704 non-null   float644   pop        1704 non-null   int64  5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
'''

3. 查看部分数据

获取数据最关键的两个属性：

df.loc[行名,列名]
df.iloc[行号,列号]

获取一列数据

DataFrame可以看作一个Series的列表，传入Series的名称，就可以获取对应的一列数据。

col = col = df['country']
print(col)
'''
代码输出：
0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object
'''

获取多列数据

获取多列数据，需要传入一个包含列名的列表。

subset = df[['country','continent','year']]
print(subset)
'''
代码输出：country continent  year
0     Afghanistan      Asia  1952
1     Afghanistan      Asia  1957
2     Afghanistan      Asia  1962
3     Afghanistan      Asia  1967
4     Afghanistan      Asia  1972
...           ...       ...   ...
1699     Zimbabwe    Africa  1987
1700     Zimbabwe    Africa  1992
1701     Zimbabwe    Africa  1997
1702     Zimbabwe    Africa  2002
1703     Zimbabwe    Africa  2007[1704 rows x 3 columns]
'''

按行加载数据

DataFrame默认的行索引是数字索引，也就是从0开始的数字序列，这种情况下行索引等于行编号。

以下代码获取df中的第1行数据（编号为0，行索引也是0）。

print(df.loc[0])
'''
代码输出：
country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap     779.445314
Name: 0, dtype: object
'''

使用iloc也可以达到同样的效果：

print(df.iloc[0])
'''
代码输出：
country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap     779.445314
Name: 0, dtype: object
'''

如果在加载数据的时候，手动指明行索引，那么行索引和数字编号就会有差异。

以下代码获取数据集中的第一行，行编号为0，行名（行索引）为Avatar（阿凡达）。

movie = pd.read_csv('data/movie.csv',index_col='movie_title')
movie.loc['Avatar'] # 这里使用行索引
'''
代码输出：
color                                                                    Color
director_name                                                    James Cameron
num_critic_for_reviews                                                   723.0
duration                                                                 178.0
director_facebook_likes                                                    0.0
actor_3_facebook_likes                                                   855.0
actor_2_name                                                  Joel David Moore
... ...
country                                                                    USA
content_rating                                                           PG-13
budget                                                             237000000.0
title_year                                                              2009.0
actor_2_facebook_likes                                                   936.0
imdb_score                                                                 7.9
aspect_ratio                                                              1.78
movie_facebook_likes                                                     33000
Name: Avatar, dtype: object
'''

使用iloc可以达到同样的效果：

print(movie.iloc[0])
'''
代码输出：
color                                                                    Color
director_name                                                    James Cameron
num_critic_for_reviews                                                   723.0
duration                                                                 178.0
director_facebook_likes                                                    0.0
actor_3_facebook_likes                                                   855.0
actor_2_name                                                  Joel David Moore
... ...
country                                                                    USA
content_rating                                                           PG-13
budget                                                             237000000.0
title_year                                                              2009.0
actor_2_facebook_likes                                                   936.0
imdb_score                                                                 7.9
aspect_ratio                                                              1.78
movie_facebook_likes                                                     33000
Name: Avatar, dtype: object
'''

由于iloc使用行号访问数据，所以可以使用python的列表方式进行访问，例如负数索引：

print(df.iloc[-1])
'''
代码输出：
country        Zimbabwe
continent        Africa
year               2007
lifeExp          43.487
pop            12311143
gdpPercap    469.709298
Name: 1703, dtype: object
'''

同时取出行列数据

df.loc[:,['country','year']]
'''
代码输出：country  year
0     Afghanistan  1952
1     Afghanistan  1957
2     Afghanistan  1962
3     Afghanistan  1967
4     Afghanistan  1972
...           ...   ...
1699     Zimbabwe  1987
1700     Zimbabwe  1992
1701     Zimbabwe  1997
1702     Zimbabwe  2002
1703     Zimbabwe  2007[1704 rows x 2 columns]
'''

df.iloc[:,[0,2]]
'''
代码输出：country  year
0     Afghanistan  1952
1     Afghanistan  1957
2     Afghanistan  1962
3     Afghanistan  1967
4     Afghanistan  1972
...           ...   ...
1699     Zimbabwe  1987
1700     Zimbabwe  1992
1701     Zimbabwe  1997
1702     Zimbabwe  2002
1703     Zimbabwe  2007[1704 rows x 2 columns]
'''

切片语法

获取3、4、5列所有的行数据：

df.iloc[:,3:6]
'''
代码输出：lifeExp       pop   gdpPercap
0      28.801   8425333  779.445314
1      30.332   9240934  820.853030
2      31.997  10267083  853.100710
3      34.020  11537966  836.197138
4      36.088  13079460  739.981106
...       ...       ...         ...
1699   62.351   9216418  706.157306
1700   60.377  10704340  693.420786
1701   46.809  11404948  792.449960
1702   39.989  11926563  672.038623
1703   43.487  12311143  469.709298[1704 rows x 3 columns]
'''

也可以设置切片的步长：

df.iloc[:,0:6:2]
'''
代码输出：country  year       pop
0     Afghanistan  1952   8425333
1     Afghanistan  1957   9240934
2     Afghanistan  1962  10267083
3     Afghanistan  1967  11537966
4     Afghanistan  1972  13079460
...           ...   ...       ...
1699     Zimbabwe  1987   9216418
1700     Zimbabwe  1992  10704340
1701     Zimbabwe  1997  11404948
1702     Zimbabwe  2002  11926563
1703     Zimbabwe  2007  12311143[1704 rows x 3 columns]
'''

4. 简单数据分析

获取每一年的平均预期寿命

df.groupby('year')['lifeExp'].mean()
'''
代码输出：
year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64
'''

获取每年的平均预期寿命，平均人口，平均GDP

df.groupby('year')[['lifeExp','pop','gdpPercap']].mean()
'''
代码输出：lifeExp           pop     gdpPercap
year                                       
1952  49.057620  1.695040e+07   3725.276046
1957  51.507401  1.876341e+07   4299.408345
1962  53.609249  2.042101e+07   4725.812342
1967  55.678290  2.265830e+07   5483.653047
1972  57.647386  2.518998e+07   6770.082815
1977  59.570157  2.767638e+07   7313.166421
1982  61.533197  3.020730e+07   7518.901673
1987  63.212613  3.303857e+07   7900.920218
1992  64.160338  3.599092e+07   8158.608521
1997  65.014676  3.883947e+07   9090.175363
2002  65.694923  4.145759e+07   9917.848365
2007  67.007423  4.402122e+07  11680.071820
'''

每年、每大洲的数据平均值

df.groupby(by=['year','continent'])[['lifeExp','pop','gdpPercap']].mean()
'''
代码输出：lifeExp           pop     gdpPercap
year continent                                       
1952 Africa     39.135500  4.570010e+06   1252.572466Americas   53.279840  1.380610e+07   4079.062552Asia       46.314394  4.228356e+07   5195.484004Europe     64.408500  1.393736e+07   5661.057435Oceania    69.255000  5.343003e+06  10298.085650
1957 Africa     41.266346  5.093033e+06   1385.236062Americas   55.960280  1.547816e+07   4616.043733Asia       49.318544  4.735699e+07   5787.732940Europe     66.703067  1.459635e+07   6963.012816Oceania    70.295000  5.970988e+06  11598.522455
... ...
2002 Africa     53.325231  1.603315e+07   2599.385159Americas   72.422040  3.399091e+07   9287.677107Asia       69.233879  1.091455e+08  10174.090397Europe     76.700600  1.927413e+07  21711.732422Oceania    79.740000  1.172741e+07  26938.778040
2007 Africa     54.806038  1.787576e+07   3089.032605Americas   73.608120  3.595485e+07  11003.031625Asia       70.728485  1.155138e+08  12473.026870Europe     77.648600  1.953662e+07  25054.481636Oceania    80.719500  1.227497e+07  29810.188275
'''

每大洲有多少个国家

df.groupby('continent')['country'].nunique()
'''
代码输出：
continent
Africa      52
Americas    25
Asia        33
Europe      30
Oceania      2
Name: country, dtype: int64
'''

5. 数据可视化

df.groupby('year')['lifeExp'].mean().plot()
# kind = 'bar' 可以画柱状图

在这里插入图片描述

总结

数据分析处理流程：
加载数据 → 数据基本处理 → 数据分析 → 数据可视化 → 得出结论

DataFrame入门

文章目录

1. 数据集合加载

2. 使用常用的属性/方法查看数据情况

type()

shape

columns

dtypes

info()

3. 查看部分数据

获取一列数据

获取多列数据

按行加载数据

同时取出行列数据

切片语法

4. 简单数据分析

5. 数据可视化

总结

相关文章

初识Java 12-3 流

LLM下半场之Agent基础能力概述：Profile、Memory、Plan、Action、Eval学习笔记

Linux【网络】数据链路层

栈和队列的实现

山西电力市场日前价格预测【2023-10-05】

微信公众号模板消息First，Remark字段不显示，备注字段不见了

矩阵的c++实现（2）

成都建筑模板批发市场在哪？

mysql面试题16：说说分库与分表的设计？常用的分库分表中间件有哪些？分库分表可能遇到的问题有哪些？

＜C++＞ String

[软件工具]opencv-svm快速训练助手教程解决opencv C++ SVM模型训练与分类实现任务支持C# python调用

AWS Lambda Golang HelloWorld 快速入门

密码技术 (6) - 证书

Python3数据科学包系列(一):数据分析实战

React框架核心原理

sheng的学习笔记-【中文】【吴恩达课后测验】Course 1 - 神经网络和深度学习 - 第三周测验

【一、灵犀考试系统项目设计、框架搭建】

国庆假期作业day2

后端面经学习自测（二）

[React源码解析] React的设计理念和源码架构 (一）