数据:
序号
x1
x2
x3
x4
1
40
2
5
20
2
10
1.5
5
30
3
120
3
13
50
4
250
4.5
18
0
5
120
3.5
9
50
6
10
1.5
12
50
7
40
1
19
40
8
270
4
13
60
9
280
3.5
11
60
10
170
3
9
60
11
180
3.5
14
40
12
130
2
30
50
13
220
1.5
17
20
14
160
1.5
35
60
15
220
2.5
14
30
16
140
2
20
20
17
220
2
14
10
18
40
1
10
0
19
20
1
12
60
20
120
2
20
0
数据标准化:
x1
x2
x3
x4
0
-1.102513
-0.308130
-1.347755
-0.708447
1
-1.440017
-0.782175
-1.347755
-0.251384
2
-0.202502
0.639961
-0.269551
0.662740
3
1.260015
2.062098
0.404327
-1.622571
4
-0.202502
1.114007
-0.808653
0.662740
5
-1.440017
-0.782175
-0.404327
0.662740
6
-1.102513
-1.256220
0.539102
0.205678
7
1.485017
1.588052
-0.269551
1.119803
8
1.597518
1.114007
-0.539102
1.119803
9
0.360004
0.639961
-0.808653
1.119803
10
0.472505
1.114007
-0.134776
0.205678
11
-0.090001
-0.308130
2.021633
0.662740
12
0.922511
-0.782175
0.269551
-0.708447
13
0.247503
-0.782175
2.695510
1.119803
14
0.922511
0.165916
-0.134776
-0.251384
15
0.022500
-0.308130
0.673878
-0.708447
16
0.922511
-0.308130
-0.134776
-1.165509
17
-1.102513
-1.256220
-0.673878
-1.622571
18
-1.327515
-1.256220
-0.404327
1.119803
19
-0.202502
-0.308130
0.673878
-1.622571
数据标准化:也可以用sklearn包
from sklearn import preprocessing
#Z-Score标准化
#建立StandardScaler对象
zscore = preprocessing.StandardScaler()
# 标准化处理
data_zs = zscore.fit_transform(data)
注意:sklearn这种处理求标准差时分母为n,而我们下面的std计算时分母为n-1,Spss里的计算分母也为n-1。
sklearn降维:
pca=dp.PCA(n_components=2) #加载pca算法,设置降维后主成分数目为2
reduced_x=pca.fit_transform(x) #对原始数据进行降维,保存在reduced_x中
数据标准化代码:
import pandas as pd
import numpy as np
csv_data = pd.read_csv('C:/Users/admin/Desktop/2019.10.05/算法/主成分分析/data.csv') # 读取训练数据
csv_data=csv_data.drop('序号', axis=1) #去掉序号那一列
describe=csv_data.describe() # 对每一列数据进行统计,包括计数,均值,std,各个分位数等。
mean=describe.loc['mean']
std=describe.loc['std']
m=csv_data.index.size #行数
n=csv_data.columns.size #列数
column=csv_data.columns.values #['x1' 'x2' 'x3' 'x4']
#实现对数据框里的每个元素进行相关操作
for i in range(0,m):
for j in range(0,n):
csv_data.iloc[i,j]=(csv_data.iloc[i,j]-mean[j])/std[j] #第i行,第j列
print("标准化后的数据:\n",csv_data)
主成分分析:
import pandas as pd
import math
import numpy as np
from scipy import linalg
csv_data = pd.read_csv('C:/Users/admin/Desktop/2019.10.05/算法/主成分分析/data.csv') # 读取训练数据
csv_data=csv_data.drop('序号', axis=1) #去掉序号那一列
corr = csv_data.corr() #求变量之间的相关系数,判断是否可以进行主成分分析
print("原始数据:\n",csv_data)
print("\n相关系数矩阵:\n",corr)
describe=csv_data.describe() # 对每一列数据进行统计,包括计数,均值,std,各个分位数等。
mean=describe.loc['mean']
std=describe.loc['std']
a=list(csv_data['x1'])
x11=[]
for i in range(0,20):
x11.append((a[i]-mean['x1'])/std['x1'])
b=list(csv_data['x2'])
x22=[]
for i in range(0,20):
x22.append((b[i]-mean['x2'])/std['x2'])
c=list(csv_data['x3'])
x33=[]
for i in range(0,20):
x33.append((c[i]-mean['x3'])/std['x3'])
d=list(csv_data['x4'])
x44=[]
for i in range(0,20):
x44.append((d[i]-mean['x4'])/std['x4'])
arr=np.array([x11,x22,x33,x44]) #中心化后的数据
print("\n标准化后的数据:\n",arr.T)
M=corr.values #将相关系数转为矩阵
eig,vec=np.linalg.eig(M) #计算矩阵的特征值、特征向量。eig是list类型,vec是类型
per=[] #贡献率的计算
for i in range(0,4):
per.append(eig[i]/sum(eig))
print("\n相关系数矩阵的特征值:\n",eig)
# vec1=vec[[:]][:,[1,3,2,0]]
per=sorted(per,reverse=True) #贡献率排序(从大到小)
print("\n贡献率排序:\n",per)
print("\n累计贡献率:\n",np.array(per).cumsum()) #贡献率的累计计算
#定义单位正交化的函数
def gram_schmidt(A):
"""Gram-schmidt正交化"""
global Q #必须申明为全局变量,否则无法调用Q
Q=np.zeros_like(A)
cnt = 0
for a in A.T:
u = np.copy(a)
for i in range(0, cnt):
u -= np.dot(np.dot(Q[:, i].T, a), Q[:, i]) # 减去待求向量在已求向量上的投影
e = u / np.linalg.norm(u) # 归一化
Q[:, cnt] = e
cnt += 1
R = np.dot(Q.T, A)
print("\n正交单位化后的特征向量:")
print(Q.T)
gram_schmidt(vec)
print("\n按特征值大小排列的正交单位化后的特征向量:")
print(Q.T[[1,3,2,0][:]])
y=np.dot(arr.T,Q.T[[1,3,2,0][:]].T)
Y=pd.DataFrame(y)
Y.rename(columns={0:'Y1',1:'Y2', 2:'Y3',3:'Y4'}, inplace = True)
print("\n主成分的值(得分):\n",Y)
print("\n主成分相关系数矩阵:")
corr1=Y.corr()
print(corr1)
result = csv_data.join(Y,how='inner')
print("\n原始数据和主成分得分:")
print(result)
corr2=result.corr()
print("\n原始数据和主成分得分之间的相关系数:")
print(corr2.iloc[0:4, 4:8])
输出结果:
相关系数矩阵:
x1
x2
x3
x4
x1
1.000000
0.694984
0.219456
0.024898
x2
0.694984
1.000000
-0.147955
0.135133
x3
0.219456
-0.147955
1.000000
0.071327
x4
0.024898
0.135133
0.071327
1.000000
相关系数矩阵的特征值:
[0.20686561 1.71825161 0.98134701 1.09353577]
贡献率排序:
[0.42956290217587323, 0.2733839423331357, 0.2453367536310203, 0.05171640185997065]
累计贡献率:
[0.4295629 0.70294684 0.9482836 1. ]
正交单位化后的特征向量:
[[ 0.66588327 -0.66355498 -0.31889547 0.12083021]
[-0.69996363 -0.6897981 -0.08793923 -0.16277651]
[-0.24004879 0.05846333 -0.27031356 0.93053167]
[ 0.09501037 -0.28364662 0.9041587 0.30498307]]
按特征值大小排列的正交单位化后的特征向量:
[[-0.69996363 -0.6897981 -0.08793923 -0.16277651]
[ 0.09501037 -0.28364662 0.9041587 0.30498307]
[-0.24004879 0.05846333 -0.27031356 0.93053167]
[ 0.66588327 -0.66355498 -0.31889547 0.12083021]]
Y1
Y2
Y3
Y4
0
1.218105
-1.451999
-0.048273
-0.185493
1
1.706942
-1.210208
0.430341
-0.040449
2
-0.383874
-0.242355
0.775589
-0.393455
3
-2.075835
-0.594474
-1.801057
-0.854286
4
-0.663462
-0.864250
0.949030
-0.536093
5
1.475180
-0.078406
1.025942
-0.230850
6
1.557370
0.801735
0.236877
-0.047638
7
-2.293467
-0.211550
0.851241
0.156353
8
-2.021514
-0.310116
0.869385
0.631779
9
-0.804599
-0.536949
1.211598
0.208253
10
-1.120804
-0.330221
0.179526
-0.356740
11
-0.010115
2.108850
0.073817
-0.420080
12
-0.014567
0.337162
-0.999271
0.961740
13
-0.053019
3.024066
0.208238
-0.040456
14
-0.707401
-0.157940
-0.409237
0.516795
15
0.252856
0.482766
-0.864806
-0.081055
16
-0.231607
-0.302271
-1.287573
0.720896
17
1.961634
-0.852576
-1.136482
0.118267
18
1.649029
0.206141
1.396532
0.213845
19
0.559148
0.182595
-1.661416
-0.341334
主成分相关系数矩阵:
Y1
Y2
Y3
Y4
Y1
1.000000e+00
-2.120752e-16
-4.499891e-17
7.693762e-16
Y2
-2.120752e-16
1.000000e+00
1.974226e-16
-6.972072e-16
Y3
-4.499891e-17
1.974226e-16
1.000000e+00
2.075015e-16
Y4
7.693762e-16
-6.972072e-16
2.075015e-16
1.000000e+00
原始数据和主成分得分:
x1
x2
x3
x4
Y1
Y2
Y3
Y4
0
40
2.0
5
20
1.218105
-1.451999
-0.048273
-0.185493
1
10
1.5
5
30
1.706942
-1.210208
0.430341
-0.040449
2
120
3.0
13
50
-0.383874
-0.242355
0.775589
-0.393455
3
250
4.5
18
0
-2.075835
-0.594474
-1.801057
-0.854286
4
120
3.5
9
50
-0.663462
-0.864250
0.949030
-0.536093
5
10
1.5
12
50
1.475180
-0.078406
1.025942
-0.230850
6
40
1.0
19
40
1.557370
0.801735
0.236877
-0.047638
7
270
4.0
13
60
-2.293467
-0.211550
0.851241
0.156353
8
280
3.5
11
60
-2.021514
-0.310116
0.869385
0.631779
9
170
3.0
9
60
-0.804599
-0.536949
1.211598
0.208253
10
180
3.5
14
40
-1.120804
-0.330221
0.179526
-0.356740
11
130
2.0
30
50
-0.010115
2.108850
0.073817
-0.420080
12
220
1.5
17
20
-0.014567
0.337162
-0.999271
0.961740
13
160
1.5
35
60
-0.053019
3.024066
0.208238
-0.040456
14
220
2.5
14
30
-0.707401
-0.157940
-0.409237
0.516795
15
140
2.0
20
20
0.252856
0.482766
-0.864806
-0.081055
16
220
2.0
14
10
-0.231607
-0.302271
-1.287573
0.720896
17
40
1.0
10
0
1.961634
-0.852576
-1.136482
0.118267
18
20
1.0
12
60
1.649029
0.206141
1.396532
0.213845
19
120
2.0
20
0
0.559148
0.182595
-1.661416
-0.341334
原始数据和主成分得分之间的相关系数:
Y1
Y2
Y3
Y4
x1
-0.917527
0.099354
-0.237799
0.302860
x2
-0.904202
-0.296616
0.057916
-0.301801
x3
-0.115273
0.945499
-0.267781
-0.145042
x4
-0.213371
0.318928
0.921812
0.054957