目录
- 引言
- 1. 基本数据结构
- 1.1. Series 的初始化和简单操作
- 1.2. DataFrame 的初始化和简单操作
- 1.2.1. 初始化与持久化
- 1.2.2. 读取查看
- 1.2.3. 行操作
- 1.2.4. 列操作
- 1.2.5. 选中筛查
- 2. 数据预处理
- 2.0. 生成样例表
- 2.1. 缺失值处理
- 2.2. 类型转换和排序
- 2.3. 统计分析
- 3. 数据透视
- 3.0. 生成样例表
- 3.1. 生成透视表
- 4. 数据重塑
- 4.1. 层次化索引
- 4.1.1. 双层索引的Series
- 4.1.2. 双层索引的Dataframe
- 4.2. 离散化处理
- 4.2.1. 分组运算
- 4.2.2. 分级标签
- 4.3. 数据集合并
引言
Pandas (Python Data Analysis Library)
是基于Numpy
的一种用于数据分析的工具包,其中纳入了大量库和一些标准数据模型,提供了高效操作大型数据集所需的工具。
以下对Pandas
库函数的介绍中,已传入的参数为默认值,并且无返回值的函数不会以赋值形式演示。
1. 基本数据结构
1.1. Series 的初始化和简单操作
Pandas
中的Series
,与Numpy
中的array
和Python
中的基本数据结构list
类似,是一种能保存不同数据类型的一维数组。
s = pd.Series(data, index, dtype, name, copy)
data
:类数组array-like
或者字典dict
。类数组的索引(自然数序列)作为Series
的索引s.index
,元素作为Series
的值s.values
;字典的键作为s.index
,值作为s.values
index
:类数组array-like
。重新指定s.index
,若data
为array-like
,则依序匹配;若data
为dict
,则s.values
为空值np.nan
dtype
:数据类型或其字符串str
。重新指定Series
的类型s.dtype
。name
:不可变对象hashable
,包括字符串,数字,元组等。指定Series
的名称s.name
,默认为None
。
import pandas as pd
import numpy as np# 默认行标签建表,并查看索引和值
s1 = pd.Series([-1, 0.7, False, np.nan])
'''
0 -1
1 0.7
2 False
3 NaN
dtype: object
'''
print(s1.values) # [-1 0.7 False nan]
print(s1.index) # RangeIndex(start=0, stop=4, step=1)# 设定行标签、表格名称和索引名称
s2 = pd.Series([-1, 0.7, False, np.nan], index=list('abcd'), name='demo')
s2.index.name = 'index'
'''
index
a -1
b 0.7
c False
d NaN
Name: demo, dtype: object
'''
print(s2.values) # [-1 0.7 False nan]
print(s2.index) # Index(['a', 'b', 'c', 'd'], dtype='object')# 查询值
print(s1[3]) # nan
print(s2['d'])# 切片:标签闭区间
print(s1[::2])
'''
0 -1
2 False
dtype: object
'''print(s2['a':'c'])
'''
a -1
b 0.7
c False
Name: demo, dtype: object
'''
1.2. DataFrame 的初始化和简单操作
Pandas
中的DataFrame
,与R
中的data.frame
类似,是一种二维表格型数据结构,相当于Series
的容器。
1.2.1. 初始化与持久化
import pandas as pd
import numpy as np# 1.字典建表
df = pd.DataFrame({'A': pd.Timestamp('20250110'), 'B': pd.Series(0, index=list(range(4))), 'C': np.array([1]*4), 'D': 2,'E': pd.Categorical(['test', 'train', 'test', 'train'])})
'''A B C D E
0 2025-01-10 0 1 2 test
1 2025-01-10 0 1 2 train
2 2025-01-10 0 1 2 test
3 2025-01-10 0 1 2 train
'''# 生成时序
date = pd.date_range('20250110', periods=5)
'''
DatetimeIndex(['2025-01-10', '2025-01-11', '2025-01-12', '2025-01-13','2025-01-14'],dtype='datetime64[ns]', freq='D')'''# 2.时序建表
df = pd.DataFrame(np.random.randn(5, 3), index=date, columns=list('xyz'))
'''x y z
2025-01-10 -0.274766 -0.593336 0.724735
2025-01-11 1.552149 -0.300292 0.061253
2025-01-12 0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14 0.683950 -0.483677 -2.019955
'''# 保存表格:不额外添加索引
df.to_csv(r'E:\Pycharm_Python\course\demo.csv')
1.2.2. 读取查看
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)# 查看头尾(默认5行)
print(df.head(2))
'''x y z
2025-01-10 -0.274766 -0.593336 0.724735
2025-01-11 1.552149 -0.300292 0.061253
'''
print(df.tail(1))
'''x y z
2025-01-14 0.68395 -0.483677 -2.019955
'''# 查看格式
print(df.dtypes)
'''
x float64
y float64
z float64
dtype: object
'''# 查看行标签
print(df.index)
'''Index(['2025-01-10', '2025-01-11', '2025-01-12', '2025-01-13', '2025-01-14'], dtype='object')'''# 查看列标签
print(df.columns)
'''Index(['x', 'y', 'z'], dtype='object')'''# 查看数据
print(df.values)
'''
[[-0.27476573 -0.59333579 0.72473541][ 1.55214904 -0.30029235 0.06125304][ 0.41190756 -0.47019098 -0.89324306][-1.32816905 -0.99938983 -0.08141872][ 0.6839496 -0.48367661 -2.01995517]]
'''
1.2.3. 行操作
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)# 单行选中
print(df.iloc[1])
'''
x 1.552149
y -0.300292
z 0.061253
Name: 2025-01-11, dtype: float64
'''# 多行选中
# print(df.loc['20250113':'20250114'])
print(df.iloc[3:5])
'''x y z
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14 0.683950 -0.483677 -2.019955
'''# 单行添加
df = df._append(pd.Series(dict(zip('xyz', np.random.randn(3))), name='2025-01-15'))
'''x y z
2025-01-10 -0.274766 -0.593336 0.724735
2025-01-11 1.552149 -0.300292 0.061253
2025-01-12 0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14 0.683950 -0.483677 -2.019955
2025-01-15 1.205855 0.841471 0.843053
'''# 单行删除
df = df.drop(['2025-01-15'])
'''x y z
2025-01-10 -0.274766 -0.593336 0.724735
2025-01-11 1.552149 -0.300292 0.061253
2025-01-12 0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14 0.683950 -0.483677 -2.019955
'''
1.2.4. 列操作
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)# 单列选中
print(df['x'])
'''
2025-01-10 -0.274766
2025-01-11 1.552149
2025-01-12 0.411908
2025-01-13 -1.328169
2025-01-14 0.683950
Name: x, dtype: float64
'''# 多列选中
print(df[['x', 'z']])
'''x z
2025-01-10 -0.274766 0.724735
2025-01-11 1.552149 0.061253
2025-01-12 0.411908 -0.893243
2025-01-13 -1.328169 -0.081419
2025-01-14 0.683950 -2.019955
'''# 单列添加
df['p'] = np.random.rand(5)
'''x y z p
2025-01-10 -0.274766 -0.593336 0.724735 0.070785
2025-01-11 1.552149 -0.300292 0.061253 0.034027
2025-01-12 0.411908 -0.470191 -0.893243 0.446612
2025-01-13 -1.328169 -0.999390 -0.081419 0.545531
2025-01-14 0.683950 -0.483677 -2.019955 0.261958
'''# 单列删除
df = df.drop('p', axis=1)
print(df)
'''x y z
2025-01-10 -0.274766 -0.593336 0.724735
2025-01-11 1.552149 -0.300292 0.061253
2025-01-12 0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14 0.683950 -0.483677 -2.019955
'''
1.2.5. 选中筛查
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)# 单值选中
print(df.loc['2025-01-12', 'y'])
'''-0.47019098075848'''# 区域选中
# print(df.loc[['2025-01-12', '2025-01-13'], ['x', 'z']])
print(df[['x', 'z']][2:4])
'''x z
2025-01-12 0.411908 -0.893243
2025-01-13 -1.328169 -0.081419
'''# 条件判断
print(df['x'] > 0)
'''
2025-01-10 False
2025-01-11 True
2025-01-12 True
2025-01-13 False
2025-01-14 True
Name: x, dtype: bool
'''# 条件筛选
print(df[(df['x'] > 0)&(df['z'] > 0)])
'''x y z
2025-01-11 1.552149 -0.300292 0.061253
'''# 区间条件筛选
print(df[df['x'] > 0][1:4])
'''x y z
2025-01-12 0.411908 -0.470191 -0.893243
2025-01-14 0.683950 -0.483677 -2.019955
'''
2. 数据预处理
2.0. 生成样例表
import pandas as pd
import numpy as np# 生成时序
date = pd.date_range('20250110', periods=5)# 建表
df = pd.DataFrame(np.random.randn(5, 3), index=date, columns=list('xyz'))# 生成随机布尔表
mask = np.random.randint(0, 2, df.shape, dtype='bool')# 随机生成空值
df[pd.DataFrame(mask, index=df.index, columns=df.columns)] = np.nan# 保存表格
df.to_csv(r'E:\Pycharm_Python\course\demo_nan.csv')
2.1. 缺失值处理
isnull()
:返回一个与原表尺寸相同的布尔类型的表格,原表里的缺失值在其中对应位置上的值为True
,其余为False
fillna(value, inplace=False)
:返回将原表的缺失值填充为value
后的表格,inplace=True
时将原表格替换为输出(下同)replace(to_replace, value, inplace=False)
:返回原表的待替换值to_replace
全部替换为value
后的表格,前两个参数是列表时表示批量替换dropna(axis=0, inplace=False)
:返回将原表中含有缺失值的指定维度的记录删除后的表格
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo_nan.csv', index_col=0)
'''x y z
2025-01-10 0.606495 NaN 0.456811
2025-01-11 NaN 0.743876 NaN
2025-01-12 0.024458 0.733735 NaN
2025-01-13 0.306332 NaN -0.586894
2025-01-14 NaN NaN NaN
'''# 判断缺失值
print(df.isnull())
'''x y z
2025-01-10 False True False
2025-01-11 True False True
2025-01-12 False False True
2025-01-13 False True False
2025-01-14 True True True
'''# 填充缺失值
print(df['x'].fillna(df['x'].mean()))
'''
2025-01-10 0.606495
2025-01-11 0.312428
2025-01-12 0.024458
2025-01-13 0.306332
2025-01-14 0.312428
Name: x, dtype: float64
'''print(df.fillna(0, inplace=True))
'''x y z
2025-01-10 0.606495 0.000000 0.456811
2025-01-11 0.000000 0.743876 0.000000
2025-01-12 0.024458 0.733735 0.000000
2025-01-13 0.306332 0.000000 -0.586894
2025-01-14 0.000000 0.000000 0.000000
'''# 替换缺失值
print(df.replace(np.nan, 0))# 删除缺失值
print(df.dropna())# 保存表格
df.to_csv(r'E:\Pycharm_Python\course\demo.csv')
2.2. 类型转换和排序
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)# 类型转换
print(df.astype(bool))
'''x y z
2025-01-10 True False True
2025-01-11 False True False
2025-01-12 True True False
2025-01-13 True False True
2025-01-14 False False False
'''# 逆序排列
print(df.sort_values(by='x', ascending=False))
'''x y z
2025-01-10 0.606495 0.000000 0.456811
2025-01-13 0.306332 0.000000 -0.586894
2025-01-12 0.024458 0.733735 0.000000
2025-01-11 0.000000 0.743876 0.000000
2025-01-14 0.000000 0.000000 0.000000
'''# 有优先级的正序排列
print(df.sort_values(by=['z', 'y'], ascending=True))
'''x y z
2025-01-13 0.306332 0.000000 -0.586894
2025-01-14 0.000000 0.000000 0.000000
2025-01-12 0.024458 0.733735 0.000000
2025-01-11 0.000000 0.743876 0.000000
2025-01-10 0.606495 0.000000 0.456811
'''
2.3. 统计分析
# 描述性统计
print(df.describe())
'''x y z
count 5.000000 5.000000 5.000000
mean 0.187457 0.295522 -0.026017
std 0.267662 0.404676 0.370721
min 0.000000 0.000000 -0.586894
25% 0.000000 0.000000 0.000000
50% 0.024458 0.000000 0.000000
75% 0.306332 0.733735 0.000000
max 0.606495 0.743876 0.456811
'''# 最大值
print(df['x'].max()) # 0.606494812593188# 最小值
print(df['x'].min()) # 0.0# 均值
print(df['x'].mean()) # 0.18745690178195# 中值
print(df['x'].median()) # 0.0244579357029649# 方差
print(df['x'].var()) # 0.07164321143837143# 标准差
print(df['x'].std()) # 0.26766249538994336# 计数
print(df['z'].count()) # 5# 种类
print(df['z'].unique()) # [ 0.45681124 0. -0.58689448]# 分类计数
print(df['z'].value_counts())
'''
z0.000000 30.456811 1
-0.586894 1
'''# 求和
print(df.sum())
'''
x 0.937285
y 1.477611
z -0.130083
dtype: float64
'''# 相关系数
print(df.corr())
'''x y z
x 1.000000 -0.597883 0.306501
y -0.597883 1.000000 0.064061
z 0.306501 0.064061 1.000000
'''# 协方差
print(df.cov())
'''x y z
x 0.071643 -0.064761 0.030414
y -0.064761 0.163763 0.009611
z 0.030414 0.009611 0.137434
'''
3. 数据透视
3.0. 生成样例表
import pandas as pd
import numpy as np# 生成数据
hour = np.random.randint(0, 24, (1000, 1))
area = np.random.randint(0, 10, 1000)
displacement = np.random.randn(1000, 3)# 拼接表格
a = np.concatenate((hour, displacement), axis=1)
df = pd.DataFrame(a, index=area, columns=['hour', 'x', 'y', 'z'])
df.index.name = 'area'# 类型转换
df['hour'] = df['hour'].astype('int64')# 保存表格
df.to_csv(r'E:\Pycharm_Python\course\demo.csv')
'''hour x y z
area
9 18 1.453873 -0.452853 0.126672
5 20 -0.541874 -0.798552 0.209252
9 12 0.848762 -0.734806 0.124415
1 13 0.794053 1.838139 -0.268814
8 2 -0.115496 2.054565 0.860301
... ... ... ... ...
9 21 -0.212381 0.355993 -1.124492
1 20 -0.010173 0.408953 -0.275197
2 15 0.334253 0.231890 3.557654
0 3 -0.383228 -0.562431 2.418784
8 12 -1.004758 -0.539583 1.589166[1000 rows x 4 columns]
'''
3.1. 生成透视表
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)# 设置透视范围
pd.set_option('display.max_columns', 4)
pd.set_option('display.max_rows', 10)# 生成透视表:将缺失值填充为0,并显示各列总和
pt = pd.pivot_table(df, index=['area', 'hour'], values=['x', 'y'], aggfunc=['sum', 'mean'], fill_value=0, margins=True)
'''sum mean x y x y
area hour
0 0 4.372290 0.019988 1.093072 0.0049971 -5.463018 3.510755 -1.092604 0.7021512 0.429444 -2.022444 0.429444 -2.0224443 1.954055 -1.683926 0.488514 -0.4209814 -2.226930 -3.827011 -1.113465 -1.913506
... ... ... ... ...
9 20 -1.422801 0.262971 -0.237134 0.04382921 -2.720063 1.411410 -0.544013 0.28228222 -0.342656 -0.502878 -0.171328 -0.25143923 -1.024892 -0.385198 -0.170815 -0.064200
All -24.847056 19.244009 -0.024847 0.019244[239 rows x 4 columns]
'''# 字典指定各标签的聚合函数
pt = pd.pivot_table(df, index=['area', 'hour'], aggfunc={'x': 'sum', 'y': 'mean'})
'''x y
area hour
0 0 4.372290 0.0049971 -5.463018 0.7021512 0.429444 -2.0224443 1.954055 -0.4209814 -2.226930 -1.913506
... ... ...
9 19 1.744798 -0.40582120 -1.422801 0.04382921 -2.720063 0.28228222 -0.342656 -0.25143923 -1.024892 -0.064200[238 rows x 2 columns]
'''
4. 数据重塑
4.1. 层次化索引
4.1.1. 双层索引的Series
import pandas as pd
import numpy as np# 双层索引
index = pd.MultiIndex.from_arrays([list('aaabbccdd'), list(map(int, '123121212'))], names=('area', 'numbers'))
'''MultiIndex([('a', '1'),('a', '2'),('a', '3'),('b', '1'),('b', '2'),('c', '1'),('c', '2'),('d', '1'),('d', '2')],names=['area', 'numbers'])
'''# 初始化
s = pd.Series(np.random.randn(9), index=index)
print(s)
'''
area numbers
a 1 0.4173282 0.1680573 1.252186
b 1 -1.8354902 0.951358
c 1 -1.9037622 -0.075067
d 1 0.7821232 0.355078
dtype: float64
'''# 单个选中
print(s['a', 1]) # 0.417328381875337# 单层选中
print(s['a'])
'''
numbers
1 0.417328
2 0.168057
3 1.252186
dtype: float64
'''# 单层切片
print(s['a':'b'])
'''
area numbers
a 1 0.4173282 0.1680573 1.252186
b 1 -1.8354902 0.951358
dtype: float64
'''print(s[:, 2])
'''
area
a 0.168057
b 0.951358
c -0.075067
d 0.355078
dtype: float64
'''
4.1.2. 双层索引的Dataframe
# 设置显示上限
pd.set_option('display.max_columns', 4)
pd.set_option('display.max_rows', 8)# 双层标签
index = pd.MultiIndex.from_arrays([list('aabbccdd'), list(map(int, '12121212'))], names=('area', 'numbers'))
columns = pd.MultiIndex.from_tuples([('t1', 'x'), ('t1', 'y'), ('t2', 'x'), ('t2', 'y')])
df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
'''t1 t2 x y x y
area numbers
a 1 -0.125867 -1.722040 -0.266579 0.9100842 0.060483 0.750894 -0.479338 0.608312
b 1 0.345995 1.470237 1.763323 -0.3364752 -1.977062 0.071204 0.000797 -0.323753
c 1 0.963804 0.186688 0.443276 0.6156502 1.729371 -0.775489 1.663172 -0.657688
d 1 0.376276 0.693671 0.982811 -0.3938402 -0.632945 -2.046240 0.865305 1.150940
'''# 单值选中
print(df.loc[('a', 1), ('t1', 'x')]) # -0.12586716795606423# 单层行选中
print(df.loc['a'])
'''t1 t2 x y x y
numbers
1 -0.125867 -1.722040 -0.266579 0.910084
2 0.060483 0.750894 -0.479338 0.608312
'''# 单层列选中
print(df['t1'])
'''x y
area numbers
a 1 -0.125867 -1.7220402 0.060483 0.750894
b 1 0.345995 1.4702372 -1.977062 0.071204
c 1 0.963804 0.1866882 1.729371 -0.775489
d 1 0.376276 0.6936712 -0.632945 -2.046240
'''# 索引交换
df = df.swaplevel('area', 'numbers')
'''t1 t2 x y x y
numbers area
1 a -0.125867 -1.722040 -0.266579 0.910084
2 a 0.060483 0.750894 -0.479338 0.608312
1 b 0.345995 1.470237 1.763323 -0.336475
2 b -1.977062 0.071204 0.000797 -0.323753
1 c 0.963804 0.186688 0.443276 0.615650
2 c 1.729371 -0.775489 1.663172 -0.657688
1 d 0.376276 0.693671 0.982811 -0.393840
2 d -0.632945 -2.046240 0.865305 1.150940
'''
4.2. 离散化处理
4.2.1. 分组运算
# 设置显示上限
pd.set_option('display.max_rows', 6)# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
'''hour x y z
area
9 18 1.453873 -0.452853 0.126672
5 20 -0.541874 -0.798552 0.209252
9 12 0.848762 -0.734806 0.124415
... ... ... ... ...
2 15 0.334253 0.231890 3.557654
0 3 -0.383228 -0.562431 2.418784
8 12 -1.004758 -0.539583 1.589166[1000 rows x 4 columns]
'''# 对表格按索引和标签分组运算
print(df.groupby([df.index, df['hour']]).mean())
'''x y z
area hour
0 0 1.093072 0.004997 -0.1831011 -1.092604 0.702151 -0.5567972 0.429444 -2.022444 -0.346545
... ... ... ...
9 21 -0.544013 0.282282 -0.19125022 -0.171328 -0.251439 -0.02212123 -0.170815 -0.064200 -0.244528[238 rows x 3 columns]
'''# 对某列按索引分组运算
print(df['x'].groupby(df.index).sum())
'''
area
0 -10.897082
1 -4.915652
2 -13.841750...
7 -4.954806
8 -6.258694
9 2.828964
Name: x, Length: 10, dtype: float64
'''
4.2.2. 分级标签
# 设置显示上限
pd.set_option('display.max_rows', 6)# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
'''hour x y z
area
9 18 1.453873 -0.452853 0.126672
5 20 -0.541874 -0.798552 0.209252
9 12 0.848762 -0.734806 0.124415
... ... ... ... ...
2 15 0.334253 0.231890 3.557654
0 3 -0.383228 -0.562431 2.418784
8 12 -1.004758 -0.539583 1.589166[1000 rows x 4 columns]
'''# 按数值区间分级
bins = [0, 0.1, 0.4, 0.8, 1.6, 3.2, 4.0]
labels = ['E', 'D', 'C', 'B', 'A', 'S']
df['rank_x'] = pd.cut(df['x'].abs(), bins, labels=labels)
'''hour x y z rank_x
area
9 18 1.453873 -0.452853 0.126672 B
5 20 -0.541874 -0.798552 0.209252 C
9 12 0.848762 -0.734806 0.124415 B
... ... ... ... ... ...
2 15 0.334253 0.231890 3.557654 D
0 3 -0.383228 -0.562431 2.418784 D
8 12 -1.004758 -0.539583 1.589166 B[1000 rows x 5 columns]
'''# 按分位区间分级
bins = np.percentile(df['x'], [0, 25, 50, 70, 85, 95, 100])
labels = ['E', 'D', 'C', 'B', 'A', 'S']
df['rank_x'] = pd.cut(df['x'].abs(), bins, labels=labels)
print(df)
'''hour x y z rank_x
area
9 18 1.453873 -0.452853 0.126672 A
5 20 -0.541874 -0.798552 0.209252 B
9 12 0.848762 -0.734806 0.124415 B
... ... ... ... ... ...
2 15 0.334253 0.231890 3.557654 C
0 3 -0.383228 -0.562431 2.418784 C
8 12 -1.004758 -0.539583 1.589166 A[1000 rows x 5 columns]
'''
4.3. 数据集合并
# 设置显示上限
pd.set_option('display.max_rows', 6)# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
'''hour x y z
area
9 18 1.453873 -0.452853 0.126672
5 20 -0.541874 -0.798552 0.209252
9 12 0.848762 -0.734806 0.124415
... ... ... ... ...
2 15 0.334253 0.231890 3.557654
0 3 -0.383228 -0.562431 2.418784
8 12 -1.004758 -0.539583 1.589166[1000 rows x 4 columns]
'''# 单表添加和多表拼接
print(df.iloc[:3]._append(df.iloc[3:6]))
print(pd.concat([df.iloc[:2], df.iloc[2:4], df.iloc[4:6]], axis=0))
'''hour x y z
area
9 18 1.453873 -0.452853 0.126672
5 20 -0.541874 -0.798552 0.209252
9 12 0.848762 -0.734806 0.124415
1 13 0.794053 1.838139 -0.268814
8 2 -0.115496 2.054565 0.860301
9 9 1.235167 1.030952 -0.517618
'''# 合并
'''样表'''
df1 = df.iloc[2:5][['x', 'y']]
'''hour x y
area
9 12 0.848762 -0.734806
1 13 0.794053 1.838139
8 2 -0.115496 2.054565
'''df2 = df.iloc[2:5][['x', 'z']].sample(frac=1)
'''hour x z
area
8 2 -0.115496 0.860301
1 13 0.794053 -0.268814
9 12 0.848762 0.124415
'''print(pd.merge(df1, df2, on='area'))
'''x_x y x_y z
area
9 0.848762 -0.734806 0.848762 0.124415
1 0.794053 1.838139 0.794053 -0.268814
8 -0.115496 2.054565 -0.115496 0.860301
'''