翻译
- 从头开始建立神经网络-简介
- 导包和配置
- 生成一个数据集
- 实现用来展示决策边界的辅助函数
- Logistic Regression
- 训练一个神经网络
- - 我们的神经网络如何进行预测
  - 学习神经网络的参数
  - 实现神经网络
- 训练一个隐层有3个神经元的神经网络
- 验证隐层神经元个数对神经网络的影响
- 练习
练习题解答
- 1. Minibatch gradient
- 2.Annealing learning rate
- 3.其他激活函数
- - Sigmoid Activation
  - ReLU Activation
- 4.Three Classes
- 5.Extend the network to 4 layers

翻译

这篇文章是在完成吴恩达的深度学习课程作业的时候，在参考资料中看到的，感觉写的不错；这里翻译一下内容来加深自己的理解，同时在后面也完成了作者留下的一些作业，来提高自己对神经网络的认识。翻译部分的原内容请转到原文传送门。

从头开始建立神经网络-简介

在这篇文章中我们将从头开始实现一个非常简单的3层神经网络。这里并不会对所需的所有数学知识进行推导，但我会试着从直觉上来解释我们在做什么。我也会给你指出去哪里查看你想要的细节。

这里我假设你已经具备了基础的微积分基础和机器学的概念，举例来说就是，你已经知道什么是回归问题，而什么是分类问题。更理想的是你也已经知道了一些有关像梯度下降这样的优化方法。不过就算上面提到的东西你一无所知，这篇文章也能给你带来乐趣。

为什么我想要从头开始建立一个神经网络呢？就算是你将来打算使用像PyBrain这样的框架来实现你的神经网络，那么有过至少一次从头开始实现一个神经网络的经验能够让你明白神经网络是如何工作的；而明白神经网络的工作原理对于设计一个有效的模型是至关重要的。

还有一个需要注意的问题是这里的代码并不是很高效，因为我想让这些代码更加易于理解。在后面的文章中我会使用Theano来实现一个高效的神经网络。（在我看的时候已经实现好了，–>传送门!!!所谓的高效就是用GPU）。

导包和配置

# Package imports
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.datasets
import sklearn.linear_model
import matplotlib# Display plots inline and change default figure size
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)

生成一个数据集

要训练模型，首先就要生成一个数据集。很庆幸，scikit-learn 有很多有用的数据集生成器，我们这里直接使用make_moons这个函数来生成我们的数据集。

# Package imports
# Generate a dataset and plot it
np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)

生成的数据集结果如下
alt
我们生成的数据集有两个类别，分别用红色和蓝色的点来表示。你可以用这样的场景来描述这个数据集：蓝色的代表是男病人、红色代表女病人，而x和y的值代表了两项医疗指标。

我们的目标就是训练一个能够通过给定的x和y的值来正确识别红色和蓝色这两个类别（男人或者女人）。注意我们生成的数据集不是线性可分的，也就是说我们无法画一条直线来区分这两个类别。这就意味着像逻辑回归这样的线性分类器无法识别这个数据集中的模式，除非你手工制造一些适合该模型的非线性的特征（例如：多项式）。

事实上，这是神经网络的一个主要优势。你无需担心特征工程。隐层的神经元将会帮你提取有用的特征。

实现用来展示决策边界的辅助函数

# Helper function to plot a decision boundary.
# If you don't fully understand this function don't worry,
# it just generates the contour plot below.
def plot_decision_boundary(pred_func):# Set min and max values and give it some paddingx_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5h = 0.01# Generate a grid of points with distance h between themxx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))# Predict the function value for the whole gidZ = pred_func(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)# Plot the contour and training examplesplt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)

Logistic Regression

为了证明我的观点，我将训练一个逻辑回归分类器。这个分类器接收x，y的输入，然后输出对应的分类（0 或者 1）。为了简便，这里直接使用scikit-learn提供的模型来实现。

# Train the logistic regression classifier
clf = sklearn.linear_model.LogisticRegressionCV(cv=5)
clf.fit(X, y)
# Plot the decision boundary
plot_decision_boundary(lambda x: clf.predict(x))
plt.title("Logistic Regression")

运行结果如下
alt
图像显示了我们的逻辑回归算法所学习到的决策边界，它已经尽力地用一条直线来分割我们的数据集了，但是它永远无法捕捉数据集中的“月牙”形状。

训练一个神经网络

我们现在开始建立一个3层的神经网络：一个输入层、一个隐层、一个输出层。输入层的神经元的个数由数据的特征纬度来决定，所以这里是2（x和y）。输出层的神经元个数由类别数来决定，也是2（有两个类别）（由于我们只有两个类别，因此我们完全可以用一个神经元来作为输出层，分别输出1和0来代表两个分类。但是用两个神经元可以为后面做多分类任务的扩展带来便利）。我们输入x和y，神经网络会输出两个概率值，一个代表类别0的概率（“女人”），另一个代表类别为1的概率（“男人”）。神经网络的结构如下图所示：
alt
隐层的神经元的个数是由我们来指定的。我们放入隐层的神经元的个数越多，我们就能够模拟越复杂的函数。但是大量的隐层神经元增加了代价。首先，在预测和训练的过程中就需要跟高的计算能力。同时大量的参数就意味着更容易发生过拟合。

如何来决定隐层的大小？没有什么一般性的知道方针，它由你所处理的问题不同而决定，同时这也是一种艺术！（调参的艺术！！！）后面我们会对隐层数量进行改变，来看看它是如何影响我们的输出结果的。

同时我们也需要为我们的隐层挑选一个合适的激活函数。激活函数的作用就是把该层的输入的线性组合某种变换。一个非线性的激活函数能够让我们做出非线性的预测。通常的激活函数选择有tanh和sigmoid以及ReLUs。这里我们选择使用tanh，因为它在很多场景下都表现的很好。这些函数的一个很好的属性是它们的倒数可以使用原函数的值来计算。那tanh(x)来举例，tanh(x)的导数是1-tanh²(x)。这个特性很有用，因为这使得我们能够只计算一次tanh(x)的值，然后利用这个值来计算导数，减少了很多计算量。

因为我们想要我们的神经网络输出概率，所以输出层的激活函数需要使用softmax，它提供了将得分转换为概率的途径。如果你对logistic 函数很熟悉，那么你可以将softmax函数看作它在多分类问题上的扩展。

我们的神经网络如何进行预测

我们的神经网络通过前想传播来进行预测，也就是一堆的矩阵乘法和对我们所定义的激活函数的应用。假设x是一个2维向量，那么我们通过如下的方式来计算 $y^\hat{y}$ ：
z₁=xW₁+b₁
a₁=tanh(z₁)
z₂=a₁W₂+b₂
a₂= $y^\hat{y}$ =softmax(z₂)
z_i是第i层的输入，a_i是第i层经过激活函数的作用后的输出。W₁，b₁，W₂，b₂是神经网络的参数，需要我们从数据集中来学习。可以把它们看作是神经网络不同层之间的数据传输矩阵。通过矩阵乘法的定义，我们可以决定这些矩阵的纬度。假如隐层有500个神经元，那么 $W1∈R2×500W_{1}\in \mathbb{R}^{2\times 500}$ ， $b1∈R500b_{1}\in \mathbb{R}^{500}$ ， $W2∈R500×2W_{2}\in \mathbb{R}^{500\times 2}$ ， $b2∈R2b_{2}\in \mathbb{R}^{2}$ 。现在你应该能够发现为什么隐层的神经元的个数越多，我们的参数就越多了。

学习神经网络的参数

对参数进行学习，也就是寻找能够最小化训练集上的误差的(W₁,b₁,W₂,b₂)。所以问题就变成了如何来定义这个误差？我们会定义一个损失函数来描述这个误差。输出使用softmax，那么对应的损失函数通常定义为cross-entropy-loss(也叫做负log likelihood)。假如我们拥有N个训练数据和C个类别，那么我们所预测的 $y^\hat{y}$ 与真正的标签y之间的损失函数定义为：
$L(y,y^)=−1N∑n∈N∑i∈Cyn,ilogy^n,iL(y,\hat{y})=-\frac{1}{N}\sum_{n\in N}\sum_{i\in C}y_{n,i}log\hat{y}_{n,i}$

这个公式开起来很复杂，但实际上它做的事情就是如果我们在某个样本上预测错误，就累计该错误。然后在整个样本上求和。y（实际值）和 $y^\hat{y}$ （预测值）相差越大，损失函数的值就会越大。最小化损失函数，其实就是在数据集上最大话似然函数。

我们使用梯度下降法来最小化损失函数，这里使用最常见的一种梯度下降法，也就是通常的使用固定学习率的批量梯度下降法。它的变种SGD和minibatch-gradient-descent在实际应用中会表现更好。所以如果你是来真的，那么在他们之中选择一个是正确的选择。同时更理想的是你要使用随着时间衰减的学习率。

作为输入，梯度下降法需要损失函数对我们的参数的的梯度（导数的向量）: $∂L∂W1\frac{\partial L}{\partial W_{1}}$ , $∂L∂b1\frac{\partial L}{\partial b_{1}}$ , $∂L∂W2\frac{\partial L}{\partial W_{2}}$ , $∂L∂b2\frac{\partial L}{\partial b_{2}}$ 。我们使用著名的误差逆传播算法来求这些梯度。这里我不会细说误差逆传播算法是如何工作的，你可以参考这些在网上流传很广的解释传送门、传送门。

使用误差逆传播算法，我们得到如下的计算公式（信我，我算的是对的）：
$δ3=y^−y\delta_{3}=\hat{y}-y$
$δ2=(1−tanh2z1)∘δ3W2T\delta_{2}=(1-tanh^2z_{1})\circ\delta_{3}W_{2}^T$
$∂L∂W2=a1Tδ3\frac{\partial L}{\partial W_{2}}=a_{1}^T\delta_{3}$
$∂L∂b2=δ3\frac{\partial L}{\partial b_{2}}=\delta_{3}$
$∂L∂W1=xTδ2\frac{\partial L}{\partial W_{1}}=x^T\delta_{2}$
$∂L∂b1=δ2\frac{\partial L}{\partial b_{1}}=\delta_{2}$

实现神经网络

现在我们就要来实现一个神经网络了。我们先来定义一些有用的参数。

num_examples = len(X) # training set size
nn_input_dim = 2 # input layer dimensionality
nn_output_dim = 2 # output layer dimensionality# Gradient descent parameters (I picked these by hand)
epsilon = 0.01 # learning rate for gradient descent
reg_lambda = 0.01 # regularization strength

首先来实现一损失函数的计算，用它来评价我们的模型的表现。

# Helper function to evaluate the total loss on the dataset
def calculate_loss(model):W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']# Forward propagation to calculate our predictionsz1 = X.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Calculating the losscorect_logprobs = -np.log(probs[range(num_examples), y])data_loss = np.sum(corect_logprobs)# Add regulatization term to loss (optional)data_loss += reg_lambda/2 * (np.sum(np.square(W1))+ np.sum(np.square(W2)))return 1./num_examples * data_loss

同样我们还要定义一个用来做预测的辅助函数。它运行我们定义的前向传播算法，输出概率最高的类别作为预测结果。

# Helper function to predict an output (0 or 1)
def predict(model, x):W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']# Forward propagationz1 = x.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)return np.argmax(probs, axis=1)

最后就是定义一个训练模型的函数。实现了固定学习率的批量梯度下降算法。

#This function learns parameters for the
# neural network and returns the model.
#- nn_hdim: Number of nodes in the hidden layer
#- num_passes: Number of passes 
# through the training data for gradient descent
#- print_loss: If True, print the loss every 1000 iterations
def build_model(nn_hdim, num_passes=20000, print_loss=False):# Initialize the parameters to random values. We need to learn these.np.random.seed(0)W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)b1 = np.zeros((1, nn_hdim))W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)b2 = np.zeros((1, nn_output_dim))# This is what we return at the endmodel = {}# Gradient descent. For each batch...for i in range(0, num_passes):# Forward propagationz1 = X.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Backpropagationdelta3 = probsdelta3[range(num_examples), y] -= 1dW2 = (a1.T).dot(delta3)db2 = np.sum(delta3, axis=0, keepdims=True)delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))dW1 = np.dot(X.T, delta2)db1 = np.sum(delta2, axis=0)# Add regularization terms (b1 and b2 don't# have regularization terms)dW2 += reg_lambda * W2dW1 += reg_lambda * W1# Gradient descent parameter updateW1 += -epsilon * dW1b1 += -epsilon * db1W2 += -epsilon * dW2b2 += -epsilon * db2# Assign new parameters to the modelmodel = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}# Optionally print the loss.# This is expensive because it uses the whole dataset,# so we don't want to do it too often.if print_loss and i % 1000 == 0:print("Loss after iteration %i: %f" %(i, calculate_loss(model)))return model

训练一个隐层有3个神经元的神经网络

# Build a model with a 3-dimensional hidden layer
model = build_model(3, print_loss=True)# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3")

结果
在这里插入图片描述
哈!结果看起来相当不错。我们的神经网络可以找到区分这两个类别的决策边界。

验证隐层神经元个数对神经网络的影响

在上面的例子中我们指定了隐层的神经元的数量为3。现在让我们来看看不同的神经元个数对于输出结果的影响。

plt.figure(figsize=(16, 32))
hidden_layer_dimensions = [1, 2, 3, 4, 5, 20, 50]
for i, nn_hdim in enumerate(hidden_layer_dimensions):plt.subplot(5, 2, i+1)plt.title('Hidden Layer size %d' % nn_hdim)model = build_model(nn_hdim)plot_decision_boundary(lambda x: predict(model, x))
plt.show()

在这里插入图片描述
我们可以看到，低维的隐层能够很好地捕捉数据的变化趋势。高维的隐层有过拟合的趋向；高纬度的隐层“记忆”了训练集中的所有数据，从而降低了在整个形状上的泛华能力。加入我们使用独立的测试集来衡量我们的模型（这是你应该做的！！）。小隐层的神经网络将会因为更强的泛华能力而表现的更好。我们可以通过更强的正则化来抵消过拟合，但为隐藏层选择正确的大小是一种更“经济”的解决方案。

练习

用小批量梯度下降算法来替代批量梯度下降算法(参考)来训练模型，小批量梯度下降算法通常表现的更好。
这里我们使用了固定的学习率 $ϵ\epsilon$ ，为学习率实现一种退火的策略（也就是衰减策略）参考
我们使用了tanh来作为隐层的激活函数。尝试一下其他的激活函数。
扩展神经网络为3分类神经网络（自己建立一个数据集）
扩展神经网络为一个4层的神经网络。尝试一些不同的隐层的大小。

练习题解答

1. Minibatch gradient

只需要修改一下模型的训练函数

import random
def build_model_batch(nn_hdim, num_passes=50000, print_loss=False, batch_size=50):# 这里的batch_size就是小批量的大小# 建立一个训练集的索引列表indexes = [index for index in range(num_examples)]# Initialize the parameters to random values. We need to learn these.np.random.seed(0)W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)b1 = np.zeros((1, nn_hdim))W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)b2 = np.zeros((1, nn_output_dim))# This is what we return at the endmodel = {}# Gradient descent. For each batch...for i in range(0, num_passes):# 随机从训练集中拿出 batch_size个数据进行训练train_indexes = random.sample(indexes, batch_size)X_TRAIN = X[train_indexes, :]y_train = y[train_indexes]# Forward propagationz1 = X_TRAIN.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Backpropagationdelta3 = probsdelta3[range(batch_size), y_train] -= 1dW2 = (a1.T).dot(delta3)db2 = np.sum(delta3, axis=0, keepdims=True)delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))dW1 = np.dot(X_TRAIN.T, delta2)db1 = np.sum(delta2, axis=0)# Add regularization terms (b1 and b2 don't # have regularization terms)dW2 += reg_lambda * W2dW1 += reg_lambda * W1# Gradient descent parameter updateW1 += -epsilon * dW1b1 += -epsilon * db1W2 += -epsilon * dW2b2 += -epsilon * db2# Assign new parameters to the modelmodel = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}# Optionally print the loss.# This is expensive because it uses # the whole dataset, so we don't want to do it too often.if print_loss and i % 1000 == 0:print("Loss after iteration %i: %f" %(i, calculate_loss(model)))return model

用该方式训练模型

# Build a model with a 3-dimensional hidden layer with MiniBatch
model = build_model_batch(3, print_loss=False)# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3 with MiniBatch")

结果
在这里插入图片描述

2.Annealing learning rate

退火有利于跳出局部最小！这里实现一个最简单的退火方式。

# 初始的学习率
max_epsilon = 0.01
# 终止的学习率
min_epsilon = 0.001
def build_model_annealing(nn_hdim, num_passes=80000, 
print_loss=False, explore=100):# explore就是退火周期，每explore次迭代，退火一次# Initialize the parameters to random values. We need to learn these.np.random.seed(0)W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)b1 = np.zeros((1, nn_hdim))W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)b2 = np.zeros((1, nn_output_dim))# This is what we return at the endmodel = {}# 初始化学习率为最大tem_epsilon = max_epsilon# Gradient descent. For each batch...for i in range(0, num_passes):# 进行退火if tem_epsilon > min_epsilon and i % explore == 0:tem_epsilon -= (max_epsilon - min_epsilon) / exploretem_epsilon = max(tem_epsilon, min_epsilon)# Forward propagationz1 = X.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Backpropagationdelta3 = probsdelta3[range(num_examples), y] -= 1dW2 = (a1.T).dot(delta3)db2 = np.sum(delta3, axis=0, keepdims=True)delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))dW1 = np.dot(X.T, delta2)db1 = np.sum(delta2, axis=0)dW2 += reg_lambda * W2dW1 += reg_lambda * W1# Gradient descent parameter updateW1 += -tem_epsilon * dW1b1 += -tem_epsilon * db1W2 += -tem_epsilon * dW2b2 += -tem_epsilon * db2# Assign new parameters to the modelmodel = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}if print_loss and i % 1000 == 0:print("Loss after iteration %i: %f，the now learning rate is %s" %(i, calculate_loss(model), tem_epsilon))return model

# Build a model with a 3-dimensional hidden layer with Annealing
model = build_model_annealing(3, print_loss=True)# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3 with Annealing")

运行结果

3.其他激活函数

更改了激活函数以后，计算损失、预测和模型训练方法都需要进行修改。其实最合理的是增加一个激活函数的参数，我这里偷懒就直接复制了·

Sigmoid Activation

# 定义sigmoid函数
def sigmoid(z):s = 1.0 / (1 + np.exp(-z))return s

# 绘制图像
def draw_sigmoid():fig = plt.figure(figsize=(6,4))ax = fig.add_subplot(111)x=np.linspace(-6,6,1000)  #这个表示在-5到5之间生成1000个x值y=sigmoid(x)  #对上述生成的1000个数循环用sigmoid公式求对应的yplt.xlim((-6,6))plt.ylim((0.00,1.00))plt.yticks([0,0.5,1.0],[0,0.5,1.0]) #设置y轴显示的刻度plt.plot(x,y,color='darkblue')  #用上述生成的1000个xy值对生成1000个点ax=plt.gca()ax.spines['right'].set_color('none')  #删除右边框设为无ax.spines['top'].set_color('none')    #删除上边框设为无ax.xaxis.set_ticks_position('bottom')ax.spines['bottom'].set_position(('data', 0))  #调整x轴位置ax.yaxis.set_ticks_position('left')ax.spines['left'].set_position(('data', 0))   #调整y轴位置plt.xlabel("sigmoid")plt.show()

draw_sigmoid()

运行结果
在这里插入图片描述
激活函数为sigmoid的模型实现

# Helper function to evaluate the total loss on the dataset
# sigmoid edition
def calculate_loss_sigmoid(model):W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']# Forward propagation to calculate our predictionsz1 = X.dot(W1) + b1# 修改了激活函数a1 = sigmoid(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Calculating the losscorect_logprobs = -np.log(probs[range(num_examples), y])data_loss = np.sum(corect_logprobs)# Add regulatization term to loss (optional)data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))return 1./num_examples * data_loss

# Helper function to predict an output (0 or 1)
# sigmoid edtion
def predict_sigmoid(model, x):W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']# Forward propagationz1 = x.dot(W1) + b1# 修改了激活函数a1 = sigmoid(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)return np.argmax(probs, axis=1)

# sigmoid edition of building model
def build_model_sigmoid(nn_hdim, num_passes=50000, print_loss=False):# Initialize the parameters to random values. We need to learn these.np.random.seed(0)W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)b1 = np.zeros((1, nn_hdim))W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)b2 = np.zeros((1, nn_output_dim))# This is what we return at the endmodel = {}# Gradient descent. For each batch...for i in range(0, num_passes):# Forward propagationz1 = X.dot(W1) + b1# 修改了激活函数a1 = sigmoid(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Backpropagationdelta3 = probsdelta3[range(num_examples), y] -= 1dW2 = (a1.T).dot(delta3)db2 = np.sum(delta3, axis=0, keepdims=True)# 这里的导数变成了sigmoid的导数，即a1*(1-a1)delta2 = delta3.dot(W2.T)*a1*(1-a1)dW1 = np.dot(X.T, delta2)db1 = np.sum(delta2, axis=0)dW2 += reg_lambda * W2dW1 += reg_lambda * W1# Gradient descent parameter updateW1 += -epsilon * dW1b1 += -epsilon * db1W2 += -epsilon * dW2b2 += -epsilon * db2# Assign new parameters to the modelmodel = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}# 改为用sigmoid计算损失if print_loss and i % 1000 == 0:print("Loss after iteration %i: %f" %(i, calculate_loss_sigmoid(model)))return model

# Build a model with a 3-dimensional hidden layer with sigmoid
model = build_model_sigmoid(3, print_loss=True)# Plot the decision boundary
plot_decision_boundary(lambda x: predict_sigmoid(model, x))
plt.title("Decision Boundary for hidden layer size 3 with sigmoid")

运行结果
在这里插入图片描述

ReLU Activation

ReLU图像

def draw_ReLU():fig = plt.figure(figsize=(6, 4))ax = fig.add_subplot(111)x = np.arange(-10, 10)y = np.where(x>0, x, 0)plt.xlim(-11, 11)plt.ylim(-11, 11)ax.spines['top'].set_color('none')ax.spines['right'].set_color('none')ax.xaxis.set_ticks_position('bottom')ax.spines['bottom'].set_position(('data', 0))ax.set_xticks([-10, -5, 0, 5, 10])ax.yaxis.set_ticks_position('left')ax.spines['left'].set_position(('data', 0))ax.set_yticks([-10, -5, 5, 10])plt.plot(x, y, label="ReLU", color="blue")plt.legend()plt.show()

draw_ReLU()

运行结果
在这里插入图片描述
激活函数为ReLU的模型实现

# Helper function to evaluate the total loss on the dataset
# ReLU edition
def calculate_loss_ReLU(model):W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']# Forward propagation to calculate our predictionsz1 = X.dot(W1) + b1# 激活函数改为了ReLUa1 = np.where(z1>0, z1, 0)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Calculating the losscorect_logprobs = -np.log(probs[range(num_examples), y])data_loss = np.sum(corect_logprobs)# Add regulatization term to loss (optional)data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))return 1./num_examples * data_loss

# Helper function to predict an output (0 or 1)
# ReLU edtion
def predict_ReLU(model, x):W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']# Forward propagationz1 = x.dot(W1) + b1# 激活函数改为了ReLUa1 = np.where(z1>0, z1, 0)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)return np.argmax(probs, axis=1)

# ReLU edition of building model
def build_model_ReLU(nn_hdim, num_passes=20000, print_loss=False):# Initialize the parameters to random values. We need to learn these.np.random.seed(0)W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)b1 = np.zeros((1, nn_hdim))W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)b2 = np.zeros((1, nn_output_dim))# This is what we return at the endmodel = {}# Gradient descent. For each batch...for i in range(0, num_passes):# Forward propagationz1 = X.dot(W1) + b1# 激活函数改为ReLUa1 = np.where(z1>0, z1,0)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Backpropagationdelta3 = probsdelta3[range(num_examples), y] -= 1dW2 = (a1.T).dot(delta3)db2 = np.sum(delta3, axis=0, keepdims=True)delta2 = delta3.dot(W2.T)# ReLu求导delta2 = np.where(z1 >0, delta2, 0)dW1 = np.dot(X.T, delta2)db1 = np.sum(delta2, axis=0)dW2 += reg_lambda * W2dW1 += reg_lambda * W1# Gradient descent parameter updateW1 += -epsilon * dW1b1 += -epsilon * db1W2 += -epsilon * dW2b2 += -epsilon * db2# Assign new parameters to the modelmodel = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}# 改为ReLU计算损失if print_loss and i % 1000 == 0:print("Loss after iteration %i: %f" %(i, calculate_loss_ReLU(model)))return model

# Build a model with a 3-dimensional hidden layer with sigmoid
model = build_model_ReLU(3, print_loss=True)# Plot the decision boundary
plot_decision_boundary(lambda x: predict_ReLU(model, x))
plt.title("Decision Boundary for hidden layer size 3 with ReLU")

运行结果
在这里插入图片描述

4.Three Classes

先搞一个数据集
这里也用scikit-learn来生成

# Generate a dataset and plot it
X1, y1 = sklearn.datasets.make_classification(n_samples=300,
n_features=2,n_redundant=0,
n_informative=2,n_clusters_per_class=1,n_classes=3,random_state=29)
plt.scatter(X1[:,0], X1[:,1], s=40, c=y1, cmap=plt.cm.Spectral)

运行结果
在这里插入图片描述
初始化三分类问题的参数

num_examples1 = len(X1) # training set size
nn_input_dim1 = 2 # input layer dimensionality
nn_output_dim1 = 3 # output layer dimensionality

实现三分类神经网络
由于输出采用的是softmax，所以基本不用修改。只是修改一下数据集和输出层的神经元个数。

# Helper function to evaluate the total loss on the dataset
# 这里只是简单的把数据集更换了
def calculate_loss1(model):W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']# Forward propagation to calculate our predictionsz1 = X1.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Calculating the losscorect_logprobs = -np.log(probs[range(num_examples1), y1])data_loss = np.sum(corect_logprobs)# Add regulatization term to loss (optional)data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))return 1./num_examples1 * data_loss

# Helper function to predict an output (0 or 1)
# 这个就没改···
def predict1(model, x):W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']# Forward propagationz1 = x.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)return np.argmax(probs, axis=1)

def build_model1(nn_hdim, num_passes=20000, print_loss=False):# Initialize the parameters to random values. We need to learn these.np.random.seed(0)W1 = np.random.randn(nn_input_dim1, nn_hdim) / np.sqrt(nn_input_dim1)b1 = np.zeros((1, nn_hdim))W2 = np.random.randn(nn_hdim, nn_output_dim1) / np.sqrt(nn_hdim)b2 = np.zeros((1, nn_output_dim1))# This is what we return at the endmodel = {}# Gradient descent. For each batch...for i in range(0, num_passes):# Forward propagationz1 = X1.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2exp_scores = np.exp(z2)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Backpropagationdelta3 = probsdelta3[range(num_examples1), y1] -= 1dW2 = (a1.T).dot(delta3)db2 = np.sum(delta3, axis=0, keepdims=True)delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))dW1 = np.dot(X1.T, delta2)db1 = np.sum(delta2, axis=0)dW2 += reg_lambda * W2dW1 += reg_lambda * W1# Gradient descent parameter updateW1 += -epsilon * dW1b1 += -epsilon * db1W2 += -epsilon * dW2b2 += -epsilon * db2# Assign new parameters to the modelmodel = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}if print_loss and i % 1000 == 0:print("Loss after iteration %i: %f" %(i, calculate_loss1(model)))return model

由于数据集发生了变化，作者的绘制边界的函数中的数据集都是写死在函数里的，为了偷懒我就复制一个新的，改了改

def plot_decision_boundary1(pred_func):# Set min and max values and give it some paddingx_min, x_max = X1[:, 0].min() - .5, X1[:, 0].max() + .5y_min, y_max = X1[:, 1].min() - .5, X1[:, 1].max() + .5h = 0.01# Generate a grid of points with distance h between themxx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))# Predict the function value for the whole gidZ = pred_func(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)# Plot the contour and training examplesplt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)plt.scatter(X1[:, 0], X1[:, 1], c=y1, cmap=plt.cm.Spectral)

# Build a model with a 5-dimensional hidden layer
model = build_model1(5, print_loss=True)# Plot the decision boundary
plot_decision_boundary1(lambda x: predict1(model, x))
plt.title("Decision Boundary for hidden layer size 3")

运行结果
在这里插入图片描述

5.Extend the network to 4 layers

三个函数都增加了一个隐层而已~

def calculate_loss2(model):W1, b1, W2, b2, W3, b3 = model['W1'], model['b1'], model['W2'], model['b2'], model['W3'], model['b3']# Forward propagation to calculate our predictionsz1 = X.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2a2 = np.tanh(z2)z3 = a2.dot(W3) + b3exp_scores = np.exp(z3)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Calculating the losscorect_logprobs = -np.log(probs[range(num_examples), y])data_loss = np.sum(corect_logprobs)# Add regulatization term to loss (optional)data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))return 1./num_examples * data_loss

def predict2(model, x):W1, b1, W2, b2, W3, b3 = model['W1'], model['b1'], model['W2'], model['b2'], model['W3'], model['b3']# Forward propagationz1 = x.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2a2 = np.tanh(z2)z3 = a2.dot(W3) + b3exp_scores = np.exp(z3)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)return np.argmax(probs, axis=1)

def build_model2(nn_hdim, nn_hdim1, num_passes=20000, print_loss=False):# Initialize the parameters to random values. We need to learn these.np.random.seed(0)W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim1)b1 = np.zeros((1, nn_hdim))W2 = np.random.randn(nn_hdim, nn_hdim1) / np.sqrt(nn_hdim)b2 = np.zeros((1, nn_hdim1))W3 = np.random.randn(nn_hdim1, nn_output_dim) / np.sqrt(nn_hdim1)b3 = np.zeros((1, nn_output_dim))# This is what we return at the endmodel = {}# Gradient descent. For each batch...for i in range(0, num_passes):# Forward propagationz1 = X.dot(W1) + b1a1 = np.tanh(z1)z2 = a1.dot(W2) + b2a2 = np.tanh(z2)z3 = a2.dot(W3) + b3exp_scores = np.exp(z3)probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)# Backpropagationdelta4 = probsdelta4[range(num_examples), y] -= 1dW3 = (a2.T).dot(delta4)db3 = np.sum(delta4, axis=0, keepdims=True)delta3 = delta4.dot(W3.T) * (1 - np.power(a2, 2))dW2 = (a1.T).dot(delta3)db2 = np.sum(delta3, axis=0, keepdims=True)delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))dW1 = np.dot(X.T, delta2)db1 = np.sum(delta2, axis=0, keepdims=True)dW3 += reg_lambda * W3dW2 += reg_lambda * W2dW1 += reg_lambda * W1# Gradient descent parameter updateW1 += -epsilon * dW1b1 += -epsilon * db1W2 += -epsilon * dW2b2 += -epsilon * db2W3 += -epsilon * dW3b3 += -epsilon * db3# Assign new parameters to the modelmodel = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2, 'W3': W3, 'b3': b3}if print_loss and i % 1000 == 0:print("Loss after iteration %i: %f" %(i, calculate_loss2(model)))return model

# Build a model with a 3_4-dimensional hidden layer
model = build_model2(3,4, print_loss=True)# Plot the decision boundary
plot_decision_boundary(lambda x: predict2(model, x))
plt.title("Decision Boundary for hidden layer size 3-4")