选取数据子集
简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
设定最大列数和最大行数
pd.set_option('max_columns',5 , 'max_rows', 5)
1 选取Series数据
college = pd.read_csv('data/college.csv', index_col='INSTNM')
city = college['CITY']
1.1 iloc整数选取
标量
city.iloc[3]
'Huntsville'
多行 - 整数列表
city.iloc[[10,20,30]]
INSTNM
Birmingham Southern College Birmingham
George C Wallace State Community College-Hanceville Hanceville
Judson College Marion
Name: CITY, dtype: object
多行 - 整数切片
city.iloc[4:50:10]
INSTNM
Alabama State University Montgomery
Enterprise State Community College Enterprise
Heritage Christian University Florence
Marion Military Institute Marion
Reid State Technical College Evergreen
Name: CITY, dtype: object
1.2 loc标签选取
标量
city.loc['Heritage Christian University']
'Florence'
多行 - 标签列表
随机选择4个标签
labels = list(np.random.choice(city.index, 4))
labels
['University of St Thomas','Paul Mitchell the School-Woodbridge','San Francisco Conservatory of Music','Trinity Bible College']
city.loc[labels]
INSTNM
University of St Thomas Saint Paul
Paul Mitchell the School-Woodbridge Woodbridge
San Francisco Conservatory of Music San Francisco
Trinity Bible College Ellendale
Name: CITY, dtype: object
多行 - 标签切片
city.loc['Alabama State University':'Reid State Technical College':10]
INSTNM
Alabama State University Montgomery
Enterprise State Community College Enterprise
Heritage Christian University Florence
Marion Military Institute Marion
Reid State Technical College Evergreen
Name: CITY, dtype: object
也可以切片逆序选取
city.loc['Reid State Technical College':'Alabama State University':-10]
INSTNM
Reid State Technical College Evergreen
Marion Military Institute Marion
Heritage Christian University Florence
Enterprise State Community College Enterprise
Alabama State University Montgomery
Name: CITY, dtype: object
2 选取DataFrame数据
2.1 选取DataFrame的行
注意iloc是整数索引,loc是标签索引即可
2.2 同时选取DataFrame的行和列
2.2.1 选取前3行和前4列
college.iloc[:3, :4]
CITY | STABBR | HBCU | MENONLY | |
---|---|---|---|---|
INSTNM | ||||
Alabama A & M University | Normal | AL | 1.0 | 0.0 |
University of Alabama at Birmingham | Birmingham | AL | 0.0 | 0.0 |
Amridge University | Montgomery | AL | 0.0 | 0.0 |
college.loc[:'Amridge University', :'MENONLY']
CITY | STABBR | HBCU | MENONLY | |
---|---|---|---|---|
INSTNM | ||||
Alabama A & M University | Normal | AL | 1.0 | 0.0 |
University of Alabama at Birmingham | Birmingham | AL | 0.0 | 0.0 |
Amridge University | Montgomery | AL | 0.0 | 0.0 |
2.2.2 选取两列的所有的行
college.iloc[:, [4,6]].head()
WOMENONLY | SATVRMID | |
---|---|---|
INSTNM | ||
Alabama A & M University | 0.0 | 424.0 |
University of Alabama at Birmingham | 0.0 | 570.0 |
Amridge University | 0.0 | NaN |
University of Alabama in Huntsville | 0.0 | 595.0 |
Alabama State University | 0.0 | 425.0 |
college.loc[:, ['WOMENONLY', 'SATVRMID']]
WOMENONLY | SATVRMID | |
---|---|---|
INSTNM | ||
Alabama A & M University | 0.0 | 424.0 |
University of Alabama at Birmingham | 0.0 | 570.0 |
... | ... | ... |
Bay Area Medical Academy - San Jose Satellite Location | NaN | NaN |
Excel Learning Center-San Antonio South | NaN | NaN |
7535 rows × 2 columns
2.2.3 选取不连续的行和列
college.iloc[[100, 200], [7, 15]]
SATMTMID | UGDS_NHPI | |
---|---|---|
INSTNM | ||
GateWay Community College | NaN | 0.0029 |
American Baptist Seminary of the West | NaN | NaN |
rows = ['GateWay Community College', 'American Baptist Seminary of the West']
columns = ['SATMTMID', 'UGDS_NHPI']
college.loc[rows, columns]
SATMTMID | UGDS_NHPI | |
---|---|---|
INSTNM | ||
GateWay Community College | NaN | 0.0029 |
American Baptist Seminary of the West | NaN | NaN |
2.2.4 选取一个标量值
college.iloc[5, -4]
0.401
college.loc['The University of Alabama', 'PCTFLOAN']
0.401
2.2.5 行切片选取一列
college.iloc[90:80:-2, 5]
INSTNM
Empire Beauty School-Flagstaff 0
Charles of Italy Beauty College 0
Central Arizona College 0
University of Arizona 0
Arizona State University-Tempe 0
Name: RELAFFIL, dtype: int64
start = 'Empire Beauty School-Flagstaff'
stop = 'Arizona State University-Tempe'
college.loc[start:stop:-2, 'RELAFFIL']
INSTNM
Empire Beauty School-Flagstaff 0
Charles of Italy Beauty College 0
Central Arizona College 0
University of Arizona 0
Arizona State University-Tempe 0
Name: RELAFFIL, dtype: int64
3 用整数和标签的互换
用索引方法get_loc,找到指定列的整数位置
col_start = college.columns.get_loc('UGDS_WHITE')
col_end = college.columns.get_loc('UGDS_UNKN') + 1
col_start, col_end
(10, 19)
college.iloc[:5, col_start:col_end]
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.0333 | 0.9353 | ... | 0.0059 | 0.0138 |
University of Alabama at Birmingham | 0.5922 | 0.2600 | ... | 0.0179 | 0.0100 |
Amridge University | 0.2990 | 0.4192 | ... | 0.0000 | 0.2715 |
University of Alabama in Huntsville | 0.6988 | 0.1255 | ... | 0.0332 | 0.0350 |
Alabama State University | 0.0158 | 0.9208 | ... | 0.0243 | 0.0137 |
5 rows × 9 columns
获得整数行对应的标签名
row_end = college.index[4]
col_start,col_end = college.columns[10],college.columns[19]
col_start,col_end
('UGDS_WHITE', 'PPTUG_EF')
college.loc[:row_end, col_start:col_end]
UGDS_WHITE | UGDS_BLACK | ... | UGDS_UNKN | PPTUG_EF | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.0333 | 0.9353 | ... | 0.0138 | 0.0656 |
University of Alabama at Birmingham | 0.5922 | 0.2600 | ... | 0.0100 | 0.2607 |
Amridge University | 0.2990 | 0.4192 | ... | 0.2715 | 0.4536 |
University of Alabama in Huntsville | 0.6988 | 0.1255 | ... | 0.0350 | 0.2146 |
Alabama State University | 0.0158 | 0.9208 | ... | 0.0137 | 0.0892 |
5 rows × 10 columns
4 切片操作
4.1 惰性行切片
惰性,我的理解就就是省去loc和iloc吧,但是惰性切片不能用于列,只能用于DataFrame的行和Series,也不能同时选取行和列。
Series选取数据
city = college['CITY']
city[10:20:2]
INSTNM
Birmingham Southern College Birmingham
Concordia College Alabama Selma
Enterprise State Community College Enterprise
Faulkner University Montgomery
New Beginning College of Cosmetology Albertville
Name: CITY, dtype: object
dataframe行选取
college[10:20:2]
CITY | STABBR | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
INSTNM | |||||
Birmingham Southern College | Birmingham | AL | ... | 44200 | 27000 |
Concordia College Alabama | Selma | AL | ... | 19900 | PrivacySuppressed |
Enterprise State Community College | Enterprise | AL | ... | 24600 | 8273 |
Faulkner University | Montgomery | AL | ... | 37200 | 22000 |
New Beginning College of Cosmetology | Albertville | AL | ... | NaN | 5500 |
5 rows × 26 columns
Series和DataFrame都可以用标签进行切片。
start = 'Mesa Community College'
stop = 'Spokane Community College'
college[start:stop:1500]
CITY | STABBR | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
INSTNM | |||||
Mesa Community College | Mesa | AZ | ... | 35200 | 8000 |
Hair Academy Inc-New Carrollton | New Carrollton | MD | ... | 15200 | 9666 |
National College of Natural Medicine | Portland | OR | ... | NaN | PrivacySuppressed |
3 rows × 26 columns
4.2 按照字母切片
需先对索引进行排序
college = college.sort_index()
college.head()
CITY | STABBR | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
INSTNM | |||||
A & W Healthcare Educators | New Orleans | LA | ... | NaN | 19022.5 |
A T Still University of Health Sciences | Kirksville | MO | ... | 219800 | PrivacySuppressed |
ABC Beauty Academy | Garland | TX | ... | NaN | PrivacySuppressed |
ABC Beauty College Inc | Arkadelphia | AR | ... | PrivacySuppressed | 16500 |
AI Miami International University of Art and Design | Miami | FL | ... | 29900 | 31000 |
5 rows × 26 columns
选取字母顺序在‘Sp’和‘Su’之间的学校
college.loc['Sp':'Su']
CITY | STABBR | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
INSTNM | |||||
Spa Tech Institute-Ipswich | Ipswich | MA | ... | 21500 | 6333 |
Spa Tech Institute-Plymouth | Plymouth | MA | ... | 21500 | 6333 |
... | ... | ... | ... | ... | ... |
Styles and Profiles Beauty College | Selmer | TN | ... | PrivacySuppressed | PrivacySuppressed |
Styletrends Barber and Hairstyling Academy | Rock Hill | SC | ... | PrivacySuppressed | 9495.5 |
201 rows × 26 columns