机器学习基础06

1.梯度下降

1.1梯度下降概念

1.2梯度下降公式

1.3学习率

1.4实现梯度下降

1.5API

1.5.1随机梯度下降SGD

1.5.2小批量梯度下降MBGD

1.6梯度下降优化

2.欠拟合过拟合

2.1欠拟合

2.2过拟合

2.3正则化

2.3.1L1正则项（曼哈顿距离）

2.3.2L2正则项（欧氏距离）

1.梯度下降

1.1梯度下降概念

正规方程求解的缺点

①利用正规方程求解的W是最优解的原因是MSE这个损失函数是凸函数。但机器学习的损失函数并非都是凸函数，设置导数为0会得到很多个极值，不能确定唯一解。

②当数据量和特征较多时,矩阵计算量太大.

在机器学习中，梯度表示损失函数对于模型参数的偏导数。具体来说，对于每个可训练参数，梯度告诉我们在当前参数值下，沿着每个参数方向变化时，损失函数的变化率。通过计算损失函数对参数的梯度，梯度下降算法能够根据梯度的信息来调整参数，朝着减少损失的方向更新模型。

1.2梯度下降公式

有损失函数：

梯度下降公式：

得：

1.3学习率

设置大的学习率α;每次调整的幅度就大,设置小的学习率α;每次调整的幅度就小

(1)常见的设定数值：0.1、0.01、0.001、0.0001

(2)随着迭代次数增多学习率逐渐变小,深度学习的优化算法可以调整学习率

1.4实现梯度下降

import matplotlib.pyplot as plt
import numpy as np# 随机初始化
w= np.random.randint(-10,10,1)
# 学习率
h =0.01
# 收敛条件
diff=0.0001
# 最大更新的次数
time =1000lt_w =[]
lt_w_new=[]
for i in range(time):# 保存原w，用于计算差值w_new= wlt_w.append(w_new)lt_w_new.append(10*w**2-15.9*w+6.5)# 更新ww= w - h*(20*w_new-15.9)#20*w-15.9是切线difference=w_new-wprint(f'第{i+1}次迭代：','\t','w:',w,'\t','w_new-w:', difference)if abs(difference) <=diff:break# 图像示意，散点图为梯度下降
plt.scatter(lt_w,lt_w_new,c='red')w = np.linspace(-10,10,100)
loss = 10*w**2-15.9*w+6.5
# 曲线图为损失函数
plt.plot(w,loss)plt.show()

import numpy as np
import matplotlib.pyplot as plt# 随机初始化
w1 = 10
w2 = 10# 学习率
h = 0.001
# 收敛条件
diff = 0.0001
# 最大更新的次数
time = 1000def loss(w1, w2):return 4*w1**2 + 9*w2**2 + 2*w1*w2 + 3.5*w1 - 4*w2 + 6def dloss_w1(w1, w2):return 8*w1 + 2*w2 + 3.5def dloss_w2(w1, w2):return 2*w1 + 18*w2 - 4# 记录每次迭代的w1和w2
w1_history = [w1]
w2_history = [w2]for i in range(time):# 保存原w，用于计算差值w1_new = w1w2_new = w2# 更新ww1 = w1 - h * dloss_w1(w1_new, w2_new)w2 = w2 - h * dloss_w2(w1_new, w2_new)difference1 = w1_new - w1difference2 = w2_new - w2print(f'第{i+1}次迭代：\tw1: {w1:.6f}, w2: {w2:.6f}, w1_new-w1: {difference1:.6f}, w2_new-w2: {difference2:.6f}')# 记录每次迭代的w1和w2w1_history.append(w1)w2_history.append(w2)if abs(difference1) <= diff and abs(difference2) <= diff:breakprint("最终结果：w1 =", w1, "w2 =", w2)# 绘制三维图
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')# 创建网格数据
w1_vals = np.linspace(-15, 15, 100)
w2_vals = np.linspace(-15, 15, 100)
w1_grid, w2_grid = np.meshgrid(w1_vals, w2_vals)
loss_grid = loss(w1_grid, w2_grid)# 绘制损失函数的表面
ax.plot_surface(w1_grid, w2_grid, loss_grid, cmap='viridis', alpha=0.7)# 绘制梯度下降路径
ax.plot(w1_history, w2_history, [loss(w1, w2) for w1, w2 in zip(w1_history, w2_history)], color='r', marker='.')# 设置标签
ax.set_xlabel('w1')
ax.set_ylabel('w2')
ax.set_zlabel('Loss')# 显示图形
plt.show()

1.5API

批量梯度下降BGD(Batch Gradient Descent)

小批量梯度下降MBGD(Mini-BatchGradient Descent)

随机梯度下降SGD(Stochastic Gradient Descent)。

Batch Gradient Descent (BGD):每一次迭代都会使用全部的训练样本计算梯度来更新权重。这意味着每一步梯度更新都是基于整个数据集的平均梯度。这种方法的优点是每次更新的方向是最准确的，但缺点是计算量大且速度慢，尤其是在大数据集上。
Mini-Batch Gradient Descent (MBGD): 这种方法介于批量梯度下降和随机梯度下降之间。它不是用全部样本也不是只用一个样本，而是每次迭代从数据集中随机抽取一小部分样本（例如，从500个样本中选取32个），然后基于这一小批样本的平均梯度来更新权重。这种方法在准确性和计算效率之间取得了一个平衡。
Stochastic Gradient Descent (SGD): 在随机梯度下降中，每次迭代仅使用随机单个样本（或有时称为“例子”）来计算梯度并更新权重。这种方法能够更快地收敛，但由于每次更新都基于单个样本，所以会导致权重更新路径不稳定。

1.5.1随机梯度下降SGD

sklearn.linear_model.SGDRegressor()

参数:
   loss: 损失函数，默认为 ’squared_error’
   fit_intercept: 是否计算偏置， default=True
   eta0: float, default=0.01学习率初始值
    learning_rate: str, default=’invscaling’
‘constant’: eta = eta0 学习率为eta0设置的值，保持不变
‘optimal’: eta = 1.0 / (alpha * (t + t0))
‘invscaling’: eta = eta0 / pow(t, power_t)
‘adaptive’: eta = eta0, 学习率由eta0开始，逐步变小
max_iter: int, default=1000 经过训练数据的最大次数(又名epoch)
shuffle=True 每批次是否洗牌
    penalty: {‘l2’, ‘l1’, ‘elasticnet’, None}, default=’l2’，要使用的正则化项
属性:
      coef_ 回归后的权重系数
      intercept_ 偏置

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_errordataset = fetch_california_housing(data_home='./src')x_train,x_test,y_train,y_test=train_test_split(dataset.data,dataset.target,train_size =0.7,shuffle =True,random_state=200)transfer = StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)# 线性回归预估器
estimator = SGDRegressor(loss='squared_error',penalty='l1',max_iter=1000,eta0=0.01,learning_rate ='constant')
estimator.fit(x_train,y_train)# 模型数据
print('coef:',estimator.coef_)
print('intercept:',estimator.intercept_)y_predict = estimator.predict(x_test)
print("预测的数据集：\n", y_predict)
print('决定系数 (R^2)：',estimator.score(x_test,y_test))
error = mean_squared_error(y_test,y_predict)
print('均方误差：',error)

1.5.2小批量梯度下降MBGD

sklearn.linear_model.SGDRegressor()

调用partial_fit函数训练直接更新权重,不需要调fit从头开始训练。

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_errordataset = fetch_california_housing(data_home='./src')x_train,x_test,y_train,y_test=train_test_split(dataset.data,dataset.target,train_size =0.7,shuffle =True,random_state=200)transfer = StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)# 线性回归预估器
estimator = SGDRegressor(loss='squared_error',penalty='l1',max_iter=1000,eta0=0.01,learning_rate ='constant')# 小批量梯度下降
batch_size =50 # 批量大小
n_batches = len(x_train)//batch_size 
for epoch in range(estimator.max_iter):# 随机打乱样本顺序indices = np.random.permutation(len(x_train))for i in range(n_batches):start_index = i*batch_sizeend_index = (i+1) * batch_sizebatch_indices = indices[start_index:end_index]x_batch = x_train[batch_indices]y_batch = y_train[batch_indices]# 更换模型权重estimator.partial_fit(x_batch,y_batch) # 模型数据
print('coef:',estimator.coef_)
print('intercept:',estimator.intercept_)y_predict = estimator.predict(x_test)
print("预测的数据集：\n", y_predict)
print('决定系数 (R^2)：',estimator.score(x_test,y_test))
error = mean_squared_error(y_test,y_predict)
print('均方误差：',error)