Pandas入门1（DataFrame+Series读写/Index+Select+Assign）

文章目录

- 1. Creating, Reading and Writing
- - 1.1 DataFrame 数据框架
  - 1.2 Series 序列
  - 1.3 Reading 读取数据
- 2. Indexing, Selecting, Assigning
- - 2.1 类python方式的访问
  - 2.2 Pandas特有的访问方式
  - - 2.2.1 iloc 基于index访问
    - 2.2.2 loc 基于label标签访问
  - 2.3 set_index() 设置索引列
  - 2.4 Conditional selection 按条件选择
  - - 2.4.1 布尔符号 `&，|，==`
    - 2.4.2 Pandas内置符号 `isin，isnull、notnull`
  - 2.5 Assigning data 赋值
  - - 2.5.1 赋值常量
    - 2.5.2 赋值迭代的序列

learn from https://www.kaggle.com/learn/pandas

下一篇：Pandas入门2（DataFunctions+Maps+groupby+sort_values）

1. Creating, Reading and Writing

1.1 DataFrame 数据框架

创建DataFrame，它是一张表，内部是字典，key ：[value_1,...,value_n]

#%%
# -*- coding:utf-8 -*-
# @Python Version: 3.7
# @Time: 2020/5/16 21:10
# @Author: Michael Ming
# @Website: https://michael.blog.csdn.net/
# @File: pandasExercise.ipynb
# @Reference: https://www.kaggle.com/learn/pandas
import pandas as pd#%%
pd.DataFrame({'Yes':[50,22],"No":[131,2]})

在这里插入图片描述

fruits = pd.DataFrame([[30, 21],[40, 22]], columns=['Apples', 'Bananas'])

在这里插入图片描述

字典内的value也可以是：字符串

pd.DataFrame({"Michael":['handsome','good'],"Ming":['love basketball','coding']})

在这里插入图片描述

给数据加索引index，index=['index1','index2',...]

pd.DataFrame({"Michael":['handsome','good'],"Ming":['love basketball','coding']},index=['people1 say','people2 say'])

在这里插入图片描述

1.2 Series 序列

Series 是一系列的数据，可以看成是 list

pd.Series([5,2,0,1,3,1,4])0    5
1    2
2    0
3    1
4    3
5    1
6    4
dtype: int64

也可以把数据赋值给Series，只是Series没有列名称，只有总的名称
DataFrame本质上是多个Series粘在一起

pd.Series([30,40,50],index=['2018销量','2019销量','2020销量'],name='博客访问量')2018销量    30
2019销量    40
2020销量    50
Name: 博客访问量, dtype: int64

1.3 Reading 读取数据

读取csv（"Comma-Separated Values"）文件，pd.read_csv('file')，存入一个DataFrame

wine_rev = pd.read_csv("winemag-data-130k-v2.csv")

wine_rev.shape # 大小
(129971, 14)

wine_rev.head() # 查看头部5行

在这里插入图片描述

可以自定义索引列，index_col=, 可以是列的序号，或者是列的 name

wine_rev = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
wine_rev.head()

（下图比上面少了一列，因为定义了index列为0列）
在这里插入图片描述

保存，to_csv('xxx.csv')

wine_rev.to_csv('XXX.csv')

2. Indexing, Selecting, Assigning

2.1 类python方式的访问

item.col_name # 缺点，不能访问带有空格的名称的列，[]操作可以
item['col_name']

wine_rev.country
wine_rev['country']0            Italy
1         Portugal
2               US
3               US
4               US...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

wine_rev['country'][0]   # 'Italy',先取列，再取行
wine_rev.country[1]  # 'Portugal'

2.2 Pandas特有的访问方式

2.2.1 iloc 基于index访问

要选择DataFrame中的第一行数据，我们可以使用以下代码：
wine_rev.iloc[0]

country                                                              Italy
description              Aromas include tropical fruit, broom, brimston...
designation                                                   Vulkà Bianco
points                                                                  87
price                                                                  NaN
province                                                 Sicily & Sardinia
region_1                                                              Etna
region_2                                                               NaN
taster_name                                                  Kerin O’Keefe
taster_twitter_handle                                         @kerinokeefe
title                                    Nicosia 2013 Vulkà Bianco  (Etna)
variety                                                        White Blend
winery                                                             Nicosia
Name: 0, dtype: object

loc和iloc都是行第一，列第二，跟上面python操作是相反的

wine_rev.iloc[:,0]，获取第一列，: 表示所有的

0            Italy
1         Portugal
2               US
3               US
4               US...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

wine_rev.iloc[:3,0]，:3 表示 [0:3)行 0,1,2

0       Italy
1    Portugal
2          US
Name: country, dtype: object

也可以用离散的list，来取行，wine_rev.iloc[[1,2],0]

1    Portugal
2          US
Name: country, dtype: object

取最后几行，wine_rev.iloc[-5:]，倒数第5行到结束

在这里插入图片描述

2.2.2 loc 基于label标签访问

wine_rev.loc[0, 'country']，行也可以使用 [0,1]表示离散行，列不能使用index

'Italy'

wine_rev.loc[ : 3, 'country']，跟iloc不一样，这里包含了3号行，loc包含末尾的

0       Italy
1    Portugal
2          US
3          US
Name: country, dtype: object

wine_rev.loc[ 1 : 3, ['country','points']]，多列用 list 括起来

在这里插入图片描述

loc 的优势，例如有用字符串 index 的行，df.loc['Apples':'Potatoes']可以选取

2.3 set_index() 设置索引列

set_index() 可以重新设置索引，wine_rev.set_index("title")

在这里插入图片描述

2.4 Conditional selection 按条件选择

2.4.1 布尔符号 `&，|，==`

wine_rev.country == 'US'，按国家查找，生成了Series of True/False，可用于 loc

0         False
1         False
2          True
3          True
4          True...  
129966    False
129967     True
129968    False
129969    False
129970    False
Name: country, Length: 129971, dtype: bool