前面我们已经学习了识别数据缺失值已经对缺失值进行处理的方法,但是KNN的准确率都不是很高,今天我们继续进行数据探索进一步增强机器学习流水线;
通过数据直方图可以看到数据中的列的均值、最大值、最小值等差别很大;
from sklearn.impute import SimpleImputerimputer = SimpleImputer(strategy='mean')
pima_imputed = imputer.fit_transform(pima)
pima_imputed = pd.DataFrame(pima_imputed, columns=pima_column_names)
pima_imputed.hist(figsize=(15, 15))
plt.show()
通过describe方法,从数值上可以看到这个差别;diastolic_blood_pressure列的舒张压在24~122,年龄是21~81;
print(pima_imputed.describe())
# times_pregnant plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin bmi pedigree_function age onset_diabetes
# count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
# mean 3.845052 121.686763 72.405184 29.153420 155.548223 32.457464 0.471876 33.240885 0.348958
# std 3.369578 30.435949 12.096346 8.790942 85.021108 6.875151 0.331329 11.760232 0.476951
# min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000 0.000000
# 25% 1.000000 99.750000 64.000000 25.000000 121.500000 27.500000 0.243750 24.000000 0.000000
# 50% 3.000000 117.000000 72.202592 29.153420 155.548223 32.400000 0.372500 29.000000 0.000000
# 75% 6.000000 140.250000 80.000000 32.000000 155.548223 36.600000 0.626250 41.000000 1.000000
# max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
在直方图中,让所有的列共享数据轴,可以看到所有的数据尺寸都是不一样的,有一些列已经无法显示图形了;
pima_imputed.hist(figsize=(15, 15), sharey=True, sharex=True)
plt.show()
对于上面的问题,我们可以选用某种归一化操作,在机器学习流水线上处理该问题;归一化操作旨在将行和列对齐并转化为一致的规则;标准化通过确保所有行和列在机器学习中得到平等对待,让数据的处理保持一致。
下边我们将尝试3种数据归一化方法:
❏ z分数标准化;
❏ min-max标准化;
❏ 行归一化。
Z分数标准化
z分数标准化是最常见的标准化技术,利用了统计学里简单的z分数(标准分数)思想。z分数标准化的输出会被重新缩放,使均值为0、标准差为1。通过缩放特征、统一化均值和方差(标准差的平方),可以让KNN这种模型达到最优化,而不会倾向于较大比例的特征。公式很简单,对于每列,用这个公式替换单元格:
z= (x- μ) / σ
我们利用刚才的公式计算plasma_glucose_concentration的z分数;
pgc_std = pima_imputed['plasma_glucose_concentration'].std()
pgc_mean = pima_imputed['plasma_glucose_concentration'].mean()
pgc_z = (pima_imputed['plasma_glucose_concentration']-pgc_mean)/pgc_std
print(pgc_z.head())
# 0 0.864545
# 1 -1.205376
# 2 2.014501
# 3 -1.073952
# 4 0.503130
# Name: plasma_glucose_concentration, dtype: float64
通过直方图可以看到plasma_glucose_concentration处理之前的横轴分布在40到200之间;
pgc_field_name = 'plasma_glucose_concentration'
ax = pima_imputed[pgc_field_name].hist()
ax.set_title('Distribution of pgc')
plt.show()
sklearn提供了StandardScaler来计算z分数;
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
pgc_z_standardized = scaler.fit_transform(pima_imputed[[pgc_field_name]])
print(pgc_z_standardized.mean())
print(pgc_z_standardized.std())
# -3.561965537339044e-16
# 1.0
从处理后的直方图可以看到横轴的分布范围在-2.6到2.6之间,数据分布变密了;
ax = pd.Series(pgc_z_standardized.reshape(-1,)).hist()
ax.set_title('Distribution of pgc after Z Score Scaling')
plt.show()
现在我们对数据集的所有字段都进行z分数计算,然后通过直方图可以看到,横轴的数值分布在-2.5到7.5之间;
scaler = StandardScaler()
pima_imputed_z = pd.DataFrame(scaler.fit_transform(pima_imputed), columns=pima_column_names)
pima_imputed_z.hist(figsize=(15, 15), sharex=True)
plt.show()
我们将StandardScaler插入之前的机器学习流水线中;
onset_field_name = 'onset_diabetes'
knn_params = {'imputer__strategy':['mean', 'median'], 'classify__n_neighbors':[1, 2, 3, 4, 5, 6, 7]}
knn = KNeighborsClassifier()
pipe_z = Pipeline([('imputer', SimpleImputer()), ('standardize', StandardScaler()), ('classify', knn)])
X = pima.drop(onset_field_name, axis=1)
y = pima[onset_field_name]
grid = GridSearchCV(pipe_z, knn_params)
grid.fit(X, y)
print(grid.best_score_, grid.best_params_)# 0.7539173245055598 {'classify__n_neighbors': 7, 'imputer__strategy': 'mean'}
min-max标准化
min-max标准化和z分数标准化类似,因为它也用一个公式替换列中的每个值。此处的公式是:
m=(X-Xmin)/(Xmax-Xmin)
我们使用sklearn内置的MinMaxScaler进行处理,可以看到处理之后最小值都变成了0,最大值都变成了1,这种缩放的副作用是标准差都非常小。这有可能不利于某些模型,因为异常值的权重降低了。
from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
pima_min_maxed = pd.DataFrame(scaler.fit_transform(pima_imputed), columns=pima_column_names)
print(pima_min_maxed.describe())
# times_pregnant plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin bmi pedigree_function age onset_diabetes
# count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
# mean 0.226180 0.501205 0.493930 0.240798 0.170130 0.291564 0.168179 0.204015 0.348958
# std 0.198210 0.196361 0.123432 0.095554 0.102189 0.140596 0.141473 0.196004 0.476951
# min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
# 25% 0.058824 0.359677 0.408163 0.195652 0.129207 0.190184 0.070773 0.050000 0.000000
# 50% 0.176471 0.470968 0.491863 0.240798 0.170130 0.290389 0.125747 0.133333 0.000000
# 75% 0.352941 0.620968 0.571429 0.271739 0.170130 0.376278 0.234095 0.333333 1.000000
# max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
我们将MinMaxScaler插入之前的机器学习流水线中;
knn_params = {'imputer__strategy': ['mean', 'median'], 'classify__n_neighbors': [1, 2, 3, 4, 5, 6, 7]}
knn = KNeighborsClassifier()
pipe_z = Pipeline([('imputer', SimpleImputer()), ('standardize', MinMaxScaler()), ('classify', knn)])
X = pima.drop(onset_field_name, axis=1)
y = pima[onset_field_name]
grid = GridSearchCV(pipe_z, knn_params)
grid.fit(X, y)
print(grid.best_score_, grid.best_params_)
# 0.7630336983278159 {'classify__n_neighbors': 7, 'imputer__strategy': 'median'}
行归一化
行归一化不是计算每列的统计值(均值、最小值、最大值等),而是会保证每行有单位范数(unit norm),意味着每行的向量长度相同。想象一下,如果每行数据都在一个n维空间内,那么每行都有一个向量范数(长度)。
pima数据集有8个字段,可以看做是有8个维度的向量空间,我们可以使用L2范数计算向量的长度;
我们可以利用公式直接计算矩阵的平均范数
pima_l2 = np.sqrt((pima_imputed**2).sum(axis=1))
pima_l2_mean = pima_l2.mean()
print(pima_l2_mean)
# 223.36222025823747
我们使用sklearn内置的Normalizer进行归一化处理,处理之后所有行的范数都是1;
from sklearn.preprocessing import Normalizernormalizer = Normalizer()
pima_row_normalized = pd.DataFrame(normalizer.fit_transform(pima_imputed), columns= pima_column_names)
pima_row_normalized_l2 = np.sqrt((pima_row_normalized**2).sum(axis=1))
pima_row_normalized_l2_mean = pima_row_normalized_l2.mean()
print(pima_row_normalized_l2_mean)
# 1.0
我们将Normalizer插入之前的机器学习流水线中;
knn_params = {'imputer__strategy': ['mean', 'median'], 'classify__n_neighbors': [1, 2, 3, 4, 5, 6, 7]}
knn = KNeighborsClassifier()
pipe_z = Pipeline([('imputer', SimpleImputer()), ('normalize', Normalizer()), ('classify', knn)])
X = pima.drop(onset_field_name, axis=1)
y = pima[onset_field_name]
grid = GridSearchCV(pipe_z, knn_params)
grid.fit(X, y)
print(grid.best_score_, grid.best_params_)
# 0.6980052627111452 {'classify__n_neighbors': 7, 'imputer__strategy': 'median'}