GoogleLeNet V2 V3 —— Batch Normalization

文章目录

Batch Normalization
- internal covariate shift
- 激活层的作用
- BN执行的位置
- 数据白化
- 网络中的BN层
- 训练过程
BN的实验效果
- MNIST
- 与GoogleLeNet V1比较

GoogleLeNet出来之后，Google在这个基础上又演进了几个版本，一般来说是说有4个版本，之前的那个是V1，然后有一个V2，V3和V4。
其实我个人感觉V2和V3应该是在一起的，都是综合了两篇论文中的一些改进点来的：

Accelerating deep network training by reducing internal covariate shift
Rethinking the Inception Architecture for Computer Vision

其中，第一篇是提出了一个重要的概念：Batch Normalization，是针对内部协变量偏移问题的，简单的说就是加速训练过程。把BN作为激活层之前的另外一个网络层，可以加速网络训练的收敛速度。
第二篇就提出了一些新的卷积方法等，然后总和第一篇论文一起就提出了一个inception v2的网络结构，没有明确提到v3，但是其中的一些变形作为了v3版本。
我们就来看一下这两篇论文说了点啥，这个v2和v3又改进了点啥。

Batch Normalization

internal covariate shift

讲BN之前，肯定要说说BN到底是解决一个什么问题，在论文中提到的就是internal covariate shift问题，翻译过来是内部协变量偏移。不明觉厉，这个看不太懂是什么东西。
原文中的描述为：
Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change.
This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs。
大致意思是在训练的反向传播过程中，每个输入数据的分布情况回发生变化，在计算损失的之后，这一层的输出也会发生变化，从而导致下一层的输入数据分布发生变化。这种情况就叫做内部协变量偏移。
简单点说就是网络的隐藏层数据分布变化很大，容易出现梯度消失和梯度爆炸，导致训练过程很难收敛。一个梯度一下大到天上，一下就等于0，确实很难收敛。
那么BN的基本逻辑就是针对每个训练的batch数据，在每个激活层(Sigmond或者ReLU之类)前增加一个BN层，也就是做一次数据标准化，把上一层的输出线性变化到一个固定的分布内(fixed distribution)

激活层的作用

这里增加一点，就是之前一直没太弄明白激活层的作用。看完这篇论文之后大概了解了。整个网络，中间基本都是卷积和全连接层，不管是卷积还是全连接层，都是针对前一层数据的一种线性变换。也就是前一层数据的一种多项式变化，如果中间没有激活层的话，那么实际上无论增加多少层，都可以简化成一层，因为线性变化是可以叠加的。
举个例子，如果第一层的处理是：
$F (x) = 2 x + 3$

第二层的处理是:
$H (x) = 4 x - 4$
这里的x就是上一层的 $F (x)$ ，所以就是
$H (x) = 4 (2 x + 3) - 4 = 8 x - 8$
那么就可以简化成一层。复杂的线性变化也是一样的。但是如果增加了激活层的话，就不一样了，激活层是非线形函数，不满足
$f (x + y) = f (x) + f (y)$ ，所以就不存在上述的变换。
这样就可以增强模型的表达能力(reprensetation power)，就是对数据分布的拟合能力。
所以基本上，在网络结构里，每个卷积层后面都会跟一个非线性层(池化或者激活)。

BN执行的位置

论文中的描述是：To Batch-Normalize a network, we specify a subset of activations and insert the BN transform for each of them。
增加在所有的激活层之前。

数据白化

论文中提到：By fixing the distribution of the layer inputs x as the training progresses, we expect to improve the training speed. It has been long known that the network training converges faster if its inputs are whitened, linearly transformed to have zero
means and unit variances, and decorrelated。
这里提到就是利用了LeCun 1998年的论文中提到的，白化(whiten)的输入数据可以加速训练。而这个白化数据就是指数据分布符合均值为0，方差为1。
白化过程为：一个d维的矢量样本( $x=(x^{(1)},x^{(2)}....x^{(d)})$ )的白化过程：
$\hat{x}^{(k)}=\frac{x^{(k)}-E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$
把每一维计算完成之后就形成了服从0-1分布的 $\hat{x}$ 向量。

网络中的BN层

在白化之后，实际上还需要做一个线性变换：
$y^{(k)}=\gamma^{(k)}\hat{x}^{(k)}+\beta^{(k)}$
至于为什么要增加这么一个动作，论文中是说：
Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.
我理解是直接标准化会降低网络的表达能力，可能是直接强行拉到一个0-1的分布，会造成一些损失吧。所以可以做一些拉伸和偏移(正态分布的那个图做一些拉伸和偏移)，然后在学习的过程中去动态的调整这两个参数 $\gamma$ 和 $\beta$ 。也就是学习到底是拉伸多少，偏移多少能更好的拟合数据。

上面的数据白话相当于是把一个样本作了标准化，然后需要把一个训练batch的数据一起做标准化。
论文中是说：since we use mini-batches in stochastic gradient training, of the mean and variance each mini-batch produces estimates of each activation。
也就是说为每个批次也要做一个normalization。

计算方法为：