文章目录 pandas介绍 为什么使用pandas DataFrame DataFrame属性 DataFrame的索引 MultiIndex Serias 索引操作 赋值操作 排序 DataFrame的运算 算术运算 逻辑运算 逻辑运算符号 < > | & 逻辑运算函数 query() isin() 统计运算 自定义运算 Pandas画图 scv文件读取与存储 hdf5文件读取与存储 json文件读取与存储 总结
pandas介绍
pandas= panel + data + analysis 面板数据分析 panel面板数据-计量经济学 三维数据 以numpy为基础,借力numpy模块在计算方面性能高的优势 基于matplotlib,能够简便的画图 独特的数据结构
为什么使用pandas
便捷的数据处理能力 读取文件方便 封装了matplotlib、numpy的画图和计算能力
DataFrame
import pandas as pd
import numpy as np
stock_change = np. random. normal( 0 , 1 , ( 10 , 5 ) )
stock_change
array([[ 0.52652359, -0.42210135, 0.45506419, -0.1319933 , -0.85892243],[-2.80978824, 0.68502373, -0.72809275, -1.56716962, 0.24278934],[ 0.1423945 , -0.14913827, -0.30118759, 0.80841083, 0.56448585],[-1.11053808, -0.91833131, -0.82696531, 0.33592674, -1.81590623],[-0.7972349 , -0.38960542, -0.64822525, -1.67732846, -1.1320404 ],[-0.83075257, -0.96589613, 1.21458607, -0.54116531, 0.5416992 ],[ 0.2346827 , 0.38728822, 0.5534352 , 0.49615629, 0.03958449],[ 1.32743523, 0.8559906 , -0.35473279, -0.40734067, 0.23585156],[ 2.217162 , 0.43897264, 1.39278121, -0.17076621, 1.25111371],[-1.84123059, -1.00666366, 2.07583716, 1.03959872, 1.20092384]])
stock_change1 = pd. DataFrame( stock_change)
stock_change1
0 1 2 3 4 0 -0.230423 -0.108677 2.116127 -0.405135 -0.600457 1 1.422377 -1.136674 -0.462335 0.795195 -0.013265 2 0.708261 -0.197826 -0.177992 -1.078743 0.357987 3 -0.325432 0.264337 0.856580 -1.035939 -0.228252 4 0.016734 1.007554 0.454911 0.252380 -0.691905 5 -0.471790 0.557541 -0.703171 0.344268 -0.083205 6 -0.013339 -0.300371 1.424916 0.028338 1.101670 7 0.061438 -0.802730 -0.746614 -0.919655 -1.336464 8 0.369274 0.515427 0.661126 -0.550260 -1.560633 9 -1.087217 -1.164305 -0.408748 1.198835 -0.389584
stock_code = [ '股票{}' . format ( i+ 1 ) for i in range ( 10 ) ]
stock_code
pd. DataFrame( stock_change, index= stock_code)
0 1 2 3 4 股票1 -1.796149 0.063469 0.922334 -0.338207 2.157024 股票2 -0.064218 0.969453 0.223896 -0.795105 -2.020499 股票3 -0.039286 0.046665 -0.408812 -0.284145 1.852426 股票4 -1.811617 0.588799 -1.020581 -0.421300 -1.068160 股票5 -0.867187 0.070269 0.362412 0.595810 0.005319 股票6 -2.384285 0.185213 -0.094201 0.559706 1.156052 股票7 1.231396 0.226930 -0.284544 1.056286 -0.765503 股票8 1.451832 -0.518495 0.115510 0.578233 0.174324 股票9 1.184461 -0.327693 -1.405433 1.480470 0.049133 股票10 0.891309 0.780864 -0.858295 -1.154474 0.127319
pd.date_range(start=None,end=None,periods=None,freq=‘B’)
start : 开始时间 end : 结束时间 periods : 时间天数 freq : 递进单位,默认1天,'B’默认略过周末
date = pd. date_range( start= '20231021' , end= None , periods= 5 , freq= 'B' )
date
DatetimeIndex(['2023-10-23', '2023-10-24', '2023-10-25', '2023-10-26','2023-10-27'],dtype='datetime64[ns]', freq='B')
stock_c = pd. DataFrame( stock_change, index= stock_code, columns= date)
stock_c
2023-10-23 2023-10-24 2023-10-25 2023-10-26 2023-10-27 股票1 -1.796149 0.063469 0.922334 -0.338207 2.157024 股票2 -0.064218 0.969453 0.223896 -0.795105 -2.020499 股票3 -0.039286 0.046665 -0.408812 -0.284145 1.852426 股票4 -1.811617 0.588799 -1.020581 -0.421300 -1.068160 股票5 -0.867187 0.070269 0.362412 0.595810 0.005319 股票6 -2.384285 0.185213 -0.094201 0.559706 1.156052 股票7 1.231396 0.226930 -0.284544 1.056286 -0.765503 股票8 1.451832 -0.518495 0.115510 0.578233 0.174324 股票9 1.184461 -0.327693 -1.405433 1.480470 0.049133 股票10 0.891309 0.780864 -0.858295 -1.154474 0.127319
DataFrame属性
对象.shape 获取形状 对象.index 获取行索引 对象.columns 获取列索引 对象.values 获取值 对象.T 获取行列转换 对象.head() 查看前几行,默认是5 对象.tail() 查看最后几行 默认是5
stock_c. shape
(10, 5)
stock_c. index
Index(['股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9', '股票10'], dtype='object')
stcok_c. columns
DatetimeIndex(['2023-10-23', '2023-10-24', '2023-10-25', '2023-10-26','2023-10-27'],dtype='datetime64[ns]', freq='B')
stock_c. values
array([[-1.7961491 , 0.06346948, 0.92233413, -0.33820729, 2.15702396],[-0.06421753, 0.96945298, 0.22389647, -0.79510515, -2.02049945],[-0.03928641, 0.04666511, -0.40881248, -0.28414454, 1.85242648],[-1.81161734, 0.5887991 , -1.02058093, -0.42130023, -1.06816 ],[-0.86718681, 0.07026887, 0.36241195, 0.59581008, 0.00531913],[-2.38428482, 0.18521273, -0.09420118, 0.55970591, 1.15605167],[ 1.23139579, 0.22693018, -0.28454449, 1.05628637, -0.76550258],[ 1.45183169, -0.51849484, 0.11550995, 0.57823283, 0.17432416],[ 1.18446114, -0.3276933 , -1.40543347, 1.48046993, 0.04913251],[ 0.89130874, 0.78086438, -0.85829505, -1.15447368, 0.12731851]])
stock_c. T
股票1 股票2 股票3 股票4 股票5 股票6 股票7 股票8 股票9 股票10 2023-10-23 -1.796149 -0.064218 -0.039286 -1.811617 -0.867187 -2.384285 1.231396 1.451832 1.184461 0.891309 2023-10-24 0.063469 0.969453 0.046665 0.588799 0.070269 0.185213 0.226930 -0.518495 -0.327693 0.780864 2023-10-25 0.922334 0.223896 -0.408812 -1.020581 0.362412 -0.094201 -0.284544 0.115510 -1.405433 -0.858295 2023-10-26 -0.338207 -0.795105 -0.284145 -0.421300 0.595810 0.559706 1.056286 0.578233 1.480470 -1.154474 2023-10-27 2.157024 -2.020499 1.852426 -1.068160 0.005319 1.156052 -0.765503 0.174324 0.049133 0.127319
stock_c. head( )
2023-10-23 2023-10-24 2023-10-25 2023-10-26 2023-10-27 股票1 -1.796149 0.063469 0.922334 -0.338207 2.157024 股票2 -0.064218 0.969453 0.223896 -0.795105 -2.020499 股票3 -0.039286 0.046665 -0.408812 -0.284145 1.852426 股票4 -1.811617 0.588799 -1.020581 -0.421300 -1.068160 股票5 -0.867187 0.070269 0.362412 0.595810 0.005319
stock_c. tail( )
2023-10-23 2023-10-24 2023-10-25 2023-10-26 2023-10-27 股票6 -2.384285 0.185213 -0.094201 0.559706 1.156052 股票7 1.231396 0.226930 -0.284544 1.056286 -0.765503 股票8 1.451832 -0.518495 0.115510 0.578233 0.174324 股票9 1.184461 -0.327693 -1.405433 1.480470 0.049133 股票10 0.891309 0.780864 -0.858295 -1.154474 0.127319
DataFrame的索引
修改行列的索引值
stock_c. index = [ f'股票_ { i+ 1 } ' for i in range ( 10 ) ]
stock_c
2023-10-23 2023-10-24 2023-10-25 2023-10-26 2023-10-27 股票_1 -1.796149 0.063469 0.922334 -0.338207 2.157024 股票_2 -0.064218 0.969453 0.223896 -0.795105 -2.020499 股票_3 -0.039286 0.046665 -0.408812 -0.284145 1.852426 股票_4 -1.811617 0.588799 -1.020581 -0.421300 -1.068160 股票_5 -0.867187 0.070269 0.362412 0.595810 0.005319 股票_6 -2.384285 0.185213 -0.094201 0.559706 1.156052 股票_7 1.231396 0.226930 -0.284544 1.056286 -0.765503 股票_8 1.451832 -0.518495 0.115510 0.578233 0.174324 股票_9 1.184461 -0.327693 -1.405433 1.480470 0.049133 股票_10 0.891309 0.780864 -0.858295 -1.154474 0.127319
重设索引值
stock_c. reset_index( )
index 2023-10-23 00:00:00 2023-10-24 00:00:00 2023-10-25 00:00:00 2023-10-26 00:00:00 2023-10-27 00:00:00 0 股票_1 -1.796149 0.063469 0.922334 -0.338207 2.157024 1 股票_2 -0.064218 0.969453 0.223896 -0.795105 -2.020499 2 股票_3 -0.039286 0.046665 -0.408812 -0.284145 1.852426 3 股票_4 -1.811617 0.588799 -1.020581 -0.421300 -1.068160 4 股票_5 -0.867187 0.070269 0.362412 0.595810 0.005319 5 股票_6 -2.384285 0.185213 -0.094201 0.559706 1.156052 6 股票_7 1.231396 0.226930 -0.284544 1.056286 -0.765503 7 股票_8 1.451832 -0.518495 0.115510 0.578233 0.174324 8 股票_9 1.184461 -0.327693 -1.405433 1.480470 0.049133 9 股票_10 0.891309 0.780864 -0.858295 -1.154474 0.127319
stock_c. reset_index( drop= True )
2023-10-23 2023-10-24 2023-10-25 2023-10-26 2023-10-27 0 -1.796149 0.063469 0.922334 -0.338207 2.157024 1 -0.064218 0.969453 0.223896 -0.795105 -2.020499 2 -0.039286 0.046665 -0.408812 -0.284145 1.852426 3 -1.811617 0.588799 -1.020581 -0.421300 -1.068160 4 -0.867187 0.070269 0.362412 0.595810 0.005319 5 -2.384285 0.185213 -0.094201 0.559706 1.156052 6 1.231396 0.226930 -0.284544 1.056286 -0.765503 7 1.451832 -0.518495 0.115510 0.578233 0.174324 8 1.184461 -0.327693 -1.405433 1.480470 0.049133 9 0.891309 0.780864 -0.858295 -1.154474 0.127319
以某列设置新索引
df = pd. DataFrame( { 'year' : [ 2021 , 2021 , 2023 , 2024 ] , 'month' : [ 1 , 2 , 3 , 4 ] , 'sale' : [ 22 , 100 , 222 , 113 ] } )
df
year month sale 0 2021 1 22 1 2021 2 100 2 2023 3 222 3 2024 4 113
df. index
RangeIndex(start=0, stop=4, step=1)
df. set_index( keys= [ 'year' ] )
month sale year 2021 1 22 2021 2 100 2023 3 222 2024 4 113
new_df = df. set_index( keys= [ 'year' , 'month' ] , drop= False )
new_df
year month sale year month 2021 1 2021 1 22 2 2021 2 100 2023 3 2023 3 222 2024 4 2024 4 113
new_df. index
MultiIndex([(2021, 1),(2021, 2),(2023, 3),(2024, 4)],names=['year', 'month'])
MultiIndex
new_df. index. names
FrozenList(['year', 'month'])
tuples = [ ( 'bar' , 'one' ) , ( 'bar' , 'two' ) , ( 'baz' , 'one' ) , ( 'baz' , 'two' ) , ( 'foo' , 'one' ) , ( 'foo' , 'two' ) , ( 'qux' , 'one' ) , ( 'qux' , 'two' ) ]
index = pd. MultiIndex. from_tuples( tuples, names= [ 'first' , 'second' ] )
index
MultiIndex([('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')],names=['first', 'second'])
pd. Series( np. random. randn( 8 ) , index= index)
first second
bar one -0.816907two 0.660782
baz one -1.032361two -0.595878
foo one -0.658145two -0.891936
qux one 0.385722two -0.192622
dtype: float64
arrays = [
np. array( [ "bar" , "bar" , "baz" , "baz" , "foo" , "foo" , "qux" , "qux" ] ) ,
np. array( [ "one" , "two" , "one" , "two" , "one" , "two" , "one" , "two" ] ) ,
]
df = pd. DataFrame( np. random. randn( 8 , 4 ) , index= arrays)
df
0 1 2 3 bar one -0.162790 2.799107 1.070652 0.034360 two -0.283814 -0.551970 -1.270871 -0.813390 baz one 0.422166 1.380131 0.593804 0.776062 two 1.888835 -0.176970 -0.568067 -1.343601 foo one -0.532914 1.206831 -0.367705 0.912403 two -1.576118 -0.082882 -0.122176 1.521598 qux one -0.074543 -0.359237 0.309770 0.895598 two 0.905186 0.670022 -1.549954 -0.539559
pd. DataFrame( np. random. randn( 8 , 4 ) , index= index)
0 1 2 3 first second bar one -1.208274 -0.810972 -1.820593 -0.833156 two -1.501657 0.683875 0.923321 -0.710930 baz one -0.008496 -3.645099 2.125764 1.406796 two -0.440605 0.645926 -1.640536 1.002207 foo one 0.264713 0.182264 -1.410930 0.837404 two 0.683733 -0.300426 1.281374 0.440129 qux one -0.179653 -0.331090 -0.817277 0.583263 two -0.305134 -0.934428 -0.479319 -0.179533
MultiIndex.from_arrays():传入一个数组列表 MultiIndex.from_tuples():传入一个元组数组、 MultiIndex.from_product():传入一个交叉的迭代集合 MultiIndex.from_frame():传入一个 DataFrame
Serias
对象[flag1][flag2][flag3] 先列后行 对象.loc[] # 先行后列,可以使用切片操作 对象.iloc[] # 先行后列,通过索引去进行索引
new_df[ 'year' ] [ 2021 ] [ 1 ]
2021
df. loc[ 0 : 4 , 'sale' ]
0 22
1 100
2 222
3 113
Name: sale, dtype: int64
df. iloc[ 0 : 3 , : 5 ]
year month sale 0 2021 1 22 1 2021 2 100 2 2023 3 222
new_df. iloc[ 0 : 3 , : 5 ]
year month sale year month 2021 1 2021 1 22 2 2021 2 100 2023 3 2023 3 222
sr = pd. Series( np. arange( 2 , 10 , 2 ) , index= [ '数值{}' . format ( i+ 1 ) for i in range ( 4 ) ] )
sr
数值1 2
数值2 4
数值3 6
数值4 8
dtype: int32
sr. values
array([2, 4, 6, 8])
sr. index
Index(['数值1', '数值2', '数值3', '数值4'], dtype='object')
索引操作
import numpy as np
import pandas as pd
mydata = np. random. normal( 0 , 1 , ( 5 , 5 ) )
mydata_index = [ 'index{}' . format ( i+ 1 ) for i in range ( 5 ) ]
mydata_col = [ 'col{}' . format ( i+ 1 ) for i in range ( 5 ) ]
data = pd. DataFrame( mydata, index= mydata_index, columns= mydata_col)
data
col1 col2 col3 col4 col5 index1 0.178961 0.849560 -0.077123 -0.550173 -0.821073 index2 -0.479774 -0.986681 -0.934725 0.010318 -0.736170 index3 -0.384807 -0.636485 0.056328 -1.383175 -0.451370 index4 -0.770427 -1.009373 -0.283575 -0.923803 -1.502639 index5 0.068687 -0.361269 1.827731 0.034858 1.239907
直接索引
data[ 'col1' ] [ 'index1' ]
-0.31201088599026405
按名字索引
data. loc[ 'index1' ] [ 'col1' ]
-0.31201088599026405
data. loc[ 'index1' , 'col1' ]
-0.31201088599026405
data. loc[ [ 'index1' , 'index2' ] , 'col1' ]
index1 0.178961
index2 -0.479774
Name: col1, dtype: float64
按数值索引
data. iloc[ 1 , 0 ]
-0.2269501796329433
data. iloc[ : 4 , : 1 ]
col1 index1 0.178961 index2 -0.479774 index3 -0.384807 index4 -0.770427
赋值操作
data[ 'col1' ] = 0.01
data
col1 col2 col3 col4 col5 index1 0.01 0.849560 -0.077123 -0.550173 -0.821073 index2 0.01 -0.986681 -0.934725 0.010318 -0.736170 index3 0.01 -0.636485 0.056328 -1.383175 -0.451370 index4 0.01 -1.009373 -0.283575 -0.923803 -1.502639 index5 0.01 -0.361269 1.827731 0.034858 1.239907
data. col1 = 0.02
data
col1 col2 col3 col4 col5 index1 0.02 0.849560 -0.077123 -0.550173 -0.821073 index2 0.02 -0.986681 -0.934725 0.010318 -0.736170 index3 0.02 -0.636485 0.056328 -1.383175 -0.451370 index4 0.02 -1.009373 -0.283575 -0.923803 -1.502639 index5 0.02 -0.361269 1.827731 0.034858 1.239907
data. col1. index1 = 0.1
data
col1 col2 col3 col4 col5 index1 0.10 0.849560 -0.077123 -0.550173 -0.821073 index2 0.02 -0.986681 -0.934725 0.010318 -0.736170 index3 0.02 -0.636485 0.056328 -1.383175 -0.451370 index4 0.02 -1.009373 -0.283575 -0.923803 -1.502639 index5 0.02 -0.361269 1.827731 0.034858 1.239907
data[ 'col1' ] [ 'index2' ] = 0.3
data
col1 col2 col3 col4 col5 index1 0.10 0.849560 -0.077123 -0.550173 -0.821073 index2 0.30 -0.986681 -0.934725 0.010318 -0.736170 index3 0.02 -0.636485 0.056328 -1.383175 -0.451370 index4 0.02 -1.009373 -0.283575 -0.923803 -1.502639 index5 0.02 -0.361269 1.827731 0.034858 1.239907
排序
对内容排序
对象.sort_values(by=, key=, ascending=) 单个键或者多个键进行排序,默认升序 True升序 False降序
data. sort_values( by= [ 'col1' ] , ascending= False )
col1 col2 col3 col4 col5 index2 0.30 -0.986681 -0.934725 0.010318 -0.736170 index1 0.10 0.849560 -0.077123 -0.550173 -0.821073 index3 0.02 -0.636485 0.056328 -1.383175 -0.451370 index4 0.02 -1.009373 -0.283575 -0.923803 -1.502639 index5 0.02 -0.361269 1.827731 0.034858 1.239907
data. sort_values( by= [ 'col1' , 'col2' ] , ascending= False )
col1 col2 col3 col4 col5 index2 0.30 -0.986681 -0.934725 0.010318 -0.736170 index1 0.10 0.849560 -0.077123 -0.550173 -0.821073 index5 0.02 -0.361269 1.827731 0.034858 1.239907 index3 0.02 -0.636485 0.056328 -1.383175 -0.451370 index4 0.02 -1.009373 -0.283575 -0.923803 -1.502639
sr = data[ 'col1' ]
sr
index1 0.10
index2 0.30
index3 0.02
index4 0.02
index5 0.02
Name: col1, dtype: float64
sr. sort_values( )
index3 0.02
index4 0.02
index5 0.02
index1 0.10
index2 0.30
Name: col1, dtype: float64
按索引排序
data. sort_index( )
col1 col2 col3 col4 col5 index1 0.10 0.849560 -0.077123 -0.550173 -0.821073 index2 0.30 -0.986681 -0.934725 0.010318 -0.736170 index3 0.02 -0.636485 0.056328 -1.383175 -0.451370 index4 0.02 -1.009373 -0.283575 -0.923803 -1.502639 index5 0.02 -0.361269 1.827731 0.034858 1.239907
sr. sort_index( )
index1 0.10
index2 0.30
index3 0.02
index4 0.02
index5 0.02
Name: col1, dtype: float64
DataFrame的运算
算术运算
data. col1 + 2
index1 2.10
index2 2.30
index3 2.02
index4 2.02
index5 2.02
Name: col1, dtype: float64
data. col1. add( 3 )
index1 3.10
index2 3.30
index3 3.02
index4 3.02
index5 3.02
Name: col1, dtype: float64
data. sub( 10 ) . head( 2 )
col1 col2 col3 col4 col5 index1 -9.9 -9.150440 -10.077123 -10.550173 -10.821073 index2 -9.7 -10.986681 -10.934725 -9.989682 -10.736170
data. col1. sub( data. col2) . head( 3 )
index1 -0.749560
index2 1.286681
index3 0.656485
dtype: float64
逻辑运算
逻辑运算符号 < > | &
data. col1 > 0.1
index1 False
index2 True
index3 False
index4 False
index5 False
Name: col1, dtype: bool
data[ data. col1 > 0.1 ]
col1 col2 col3 col4 col5 index2 0.3 -0.986681 -0.934725 0.010318 -0.73617
data[ ( data. col1 < 0.1 ) & ( data. col2 < 0.1 ) ]
col1 col2 col3 col4 col5 index3 0.02 -0.636485 0.056328 -1.383175 -0.451370 index4 0.02 -1.009373 -0.283575 -0.923803 -1.502639 index5 0.02 -0.361269 1.827731 0.034858 1.239907
逻辑运算函数 query() isin()
data. query( 'col1 < 0.1 & col2 < 0.1' )
col1 col2 col3 col4 col5 index3 0.02 -0.636485 0.056328 -1.383175 -0.451370 index4 0.02 -1.009373 -0.283575 -0.923803 -1.502639 index5 0.02 -0.361269 1.827731 0.034858 1.239907
data[ data. col1. isin( [ 0.02 , 0.01 ] ) ]
col1 col2 col3 col4 col5 index3 0.02 -0.636485 0.056328 -1.383175 -0.451370 index4 0.02 -1.009373 -0.283575 -0.923803 -1.502639 index5 0.02 -0.361269 1.827731 0.034858 1.239907
统计运算
统计函数:count、mean、std、min、max、var、prod、mode、abs、idmax、idmin 上面的idmax、idmin、表示获取最小值最大值的位置 和numpy的argmax、argmin函数是类似的 对象.describe() 一次性的获取平均值、标准差、最大值、最小值等值 累计统计函数 cumsum、cummax、cummin、cumprod 分别是计算n个数的和、最大值、最小值、积
data. max ( )
col1 0.300000
col2 0.849560
col3 1.827731
col4 0.034858
col5 1.239907
dtype: float64
data. describe( )
col1 col2 col3 col4 col5 count 5.000000 5.000000 5.000000 5.000000 5.000000 mean 0.092000 -0.428850 0.117727 -0.562395 -0.454269 std 0.121326 0.763248 1.028901 0.610155 1.022660 min 0.020000 -1.009373 -0.934725 -1.383175 -1.502639 25% 0.020000 -0.986681 -0.283575 -0.923803 -0.821073 50% 0.020000 -0.636485 -0.077123 -0.550173 -0.736170 75% 0.100000 -0.361269 0.056328 0.010318 -0.451370 max 0.300000 0.849560 1.827731 0.034858 1.239907
data. col1. cumsum( )
index1 0.10
index2 0.40
index3 0.42
index4 0.44
index5 0.46
Name: col1, dtype: float64
data. col1. cumsum( ) . plot( )
自定义运算
func:自定义函数 axis=0:默认是列,axis=1表示进行行计算
data[ [ 'col1' , 'col2' ] ] . apply ( lambda x: x. max ( ) - x. min ( ) , axis= 0 )
col1 0.280000
col2 1.858932
dtype: float64
Pandas画图
Pandas.DataFrame.plot(x=None, y=None, kind=‘line’) line折线图 bar柱状图 barh水平柱状图 hist直方图 pie饼图 scatter散点图 Pandas.Serias.plot
import pandas as pd
import numpy as np
mydata = np. random. normal( 0 , 1 , ( 5 , 5 ) )
mydata_index = [ 'index{}' . format ( i+ 1 ) for i in range ( 5 ) ]
mydata_col = [ 'col{}' . format ( i+ 1 ) for i in range ( 5 ) ]
data = pd. DataFrame( mydata, index= mydata_index, columns= mydata_col)
data
col1 col2 col3 col4 col5 index1 0.706740 1.059931 -0.290975 0.480027 0.869103 index2 -0.461089 2.278285 0.118369 -0.141536 -1.054914 index3 0.871724 -1.184708 -0.729994 -0.291118 0.606099 index4 -0.300855 -0.784571 -1.815973 0.791439 0.861675 index5 1.380419 1.675737 0.400070 0.130281 0.501257
data. plot( )
data. plot( x= 'col1' , y= 'col2' , kind= 'barh' )
scv文件读取与存储
pandas.read_csv(filepath_or_buffer, sep=‘,’, header=‘infer’, names=None, usecols=[])
DataFrame.to_csv(path_or_buf=None, sep=‘,’, na_rep=‘’, index=False, header=True, mode=‘w’, encoding=None) path_or_buf:写入CSV文件的路径或文件对象 sep:列分隔符,默认为逗 na_rep:缺失值的表示,默认为空字 index:是否写入行索引,默认为 False header:是否写入列名,默认为True mode:写入模式 默认是w重写,还有a追加模式
read_data = pd. read_csv( 'E:/Project/PyCharm_Projects/pandas_test/read.csv' , encoding= 'GBK' )
read_data
data = { 'Name' : [ 'Alice' , 'Bob' , 'Carol' ] , 'Age' : [ 25 , 30 , 35 ] }
df = pd. DataFrame( data)
df. to_csv( 'E:/Project/PyCharm_Projects/pandas_test/output.csv' , index= False , mode= 'a' , header= False )
df_read = pd. read_csv( 'E:/Project/PyCharm_Projects/pandas_test/output.csv' )
df_read
Name Age 0 Alice 25 1 Bob 30 2 Carol 35 3 Alice 25 4 Bob 30 5 Carol 35
hdf5文件读取与存储
pandas.read_hdf(path_or_buf, key=None, **kwargs) path_or_buf 文件路径 key:读取的键 mode:打开模式 return Theseselected objects DataFrame.to_hdf(path_or_buf, key, **kwargs) hdf5是使用键值对来存储数据的,他也是可以存储三维数据的 跨平台、支持压缩、节省空间
json文件读取与存储
pandas.read_json(path_or_buf=None, orient=None, type=‘frame’, lines=‘False’)
将json格式数据转换为默认的Pandas DataFrame格式的数据、 orient:一般选择records lines:是否把每行作为一个json DataFrame.to_json(path_or_buf=None, orient=None, lines=‘False’)
总结
Pandas基础数据处理 Pandas介绍: 面板数据 数据处理工具 便捷的数据处理能力 继承了Numpy和matplotlib,读取文件方便 Series:一维数据 DataFrame多维数据 Series属性:index values DataFrame属性:shape、index、columns、values、T DataFrame常用方法:head() tail() Multiindex,多维数据存储方式 Pandas基本操作 索引操作:直接索引(先列后行)、按名字索引loc、按数字索引iloc 赋值操作 排序操作:sort_values() sort_index() Pandas运算: 算术运算: 逻辑运算:逻辑运算符 & 布尔索引 query() isin() 统计运算:describe()、min、max、std、idmax、idmin、cumsum、cummax 自定义运算:apply() Pandas画图: PandasIO操作: csv:pd.read_csv(path, names, usecols) pd.to_csv(path, header, mode, index) hdf5:pd.read_hdf5(path, key) pd.to_hdf5(path, key) json:pd.read_json(path, records, lines) pd.to_json(path, records, lines) =‘w’, encoding=None) path_or_buf:写入CSV文件的路径或文件对象 sep:列分隔符,默认为逗 na_rep:缺失值的表示,默认为空字 index:是否写入行索引,默认为 False header:是否写入列名,默认为True mode:写入模式 默认是w重写,还有a追加模式