训练深度学习_深度学习训练tricks整理1

深度学习训练tricks整理1

环境：pytorch1.4.0 + Ubuntu16.04

参考：

数据增强策略（一）mp.weixin.qq.com

https://zhuanlan.zhihu.com/p/104992391zhuanlan.zhihu.com

深度神经网络模型训练中的 tricks（原理与代码汇总）mp.weixin.qq.com

一、data_augmentation

基本的数据增强调用torchvision.transforms库中的就可以了，我整理一下其他的。

参考：

Pytorch 中的数据增强方式最全解释cloud.tencent.com

1.1 单图操作（图像遮挡）

1.Cutout

对CNN 第一层的输入使用剪切方块Mask

论文参考：

Improved Regularization of Convolutional Neural Networks with Cutoutarxiv.org

代码链接：

https://github.com/uoguelph-mlrg/Cutoutgithub.com

Cutout示意图

2.Random Erasing

用随机值或训练集的平均像素值替换图像的区域

论文参考：

https://arxiv.org/abs/1708.04896arxiv.org

代码参考：

https://github.com/zhunzhong07/Random-Erasing/blob/master/transforms.pygithub.com

Random Erasing示意图

3.Hide-and-Seek

图像分割成一个由 SxS 图像补丁组成的网格，根据概率设置随机隐藏一些补丁，从而让模型学习整个对象的样子，而不是单独一块，比如不单独依赖动物的脸做识别。

论文参考：

Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-supervised Object and Action Localizationarxiv.org

代码参考：

https://github.com/kkanshul/Hide-and-Seek/blob/master/hide_patch.pygithub.com

Hide-and-Seek示意图

4.GridMask

将图像的区域隐藏在网格中，作用也是为了让模型学习对象的整个组成部分

论文参考：

https://arxiv.org/pdf/2001.04086.pdfarxiv.org

代码参考：

https://github.com/Jia-Research-Lab/GridMask/blob/master/imagenet_grid/utils/grid.pygithub.com

GridMask示意图

1.2 多图组合

1.Mixup

通过线性叠加两张图片生成新的图片，对应label也进行线性叠加用以训练

论文参考：

https://arxiv.org/abs/1710.09412arxiv.org

理解与代码参考：

目标检测中图像增强，mixup 如何操作？www.zhihu.com

Mixup 示意图

2.Cutmix

将另一个图像中的剪切部分粘贴到当前图像来进行图像增强，图像的剪切迫使模型学会根据大量的特征进行预测。

论文参考：

https://arxiv.org/abs/1905.04899arxiv.org

代码参考：

https://github.com/clovaai/CutMix-PyTorch/blob/master/train.pygithub.com

代码理解：

模型训练技巧--CutMix_Guo_Python的博客-CSDN博客_cutmix lossblog.csdn.net

Cutmix示意图

3.Mosaic data augmentation（用于检测）

Cutmix中组合了两张图像，而在 Mosaic中使用四张训练图像按一定比例组合成一张图像，使模型学会在更小的范围内识别对象。其次还有助于显著减少对batch-size的需求。

代码参考：

https://zhuanlan.zhihu.com/p/163356279zhuanlan.zhihu.com

Mosaic data augmentation示意图

二、Label Smoothing

label smoothing

参考论文：

https://arxiv.org/pdf/1812.01187.pdfarxiv.org

参考理解：

SoftMax原理介绍及其 LabelSmooth优化blog.csdn.net

标签平滑Label Smoothingblog.csdn.net

https://zhuanlan.zhihu.com/p/148487894zhuanlan.zhihu.com

在多分类训练任务中，输入图片经过神经网络的计算，会得到当前输入图片对应于各个类别的置信度分数，这些分数会被softmax进行归一化处理，最终得到当前输入图片属于每个类别的概率,最终在训练网络时，最小化预测概率和标签真实概率的交叉熵，从而得到最优的预测概率分布.

网络会驱使自身往正确标签和错误标签差值大的方向学习，在训练数据不足以表征所以的样本特征的情况下，这就会导致网络过拟合。label smoothing的提出就是为了解决上述问题。最早是在Inception v2中被提出，是一种正则化的策略。其通过"软化"传统的one-hot类型标签，使得在计算损失值时能够有效抑制过拟合现象。

代码：

class LabelSmoothCEloss(nn.Module):def __init__(self):super().__init__()def forward(self,  pred,  label,  smoothing=0.1):pred = F.softmax(pred,  dim=1)one_hot_label = F.one_hot(label, pred.size(1)).float()smoothed_one_hot_label = (1.0 - smoothing)  *  one_hot_label + smoothing / pred.size(1)loss = (-torch.log(pred))  *  smoothed_one_hot_labelloss = loss.sum(axis=1,  keepdim=False)loss = loss.mean()return loss
----------------------------------------------------------------------------------------------
调用时criterion = nn.CrossEntropyLoss()
改为criterion = LabelSmoothCELoss()

三、学习率调整

warm up最早来自于这篇文章：https://arxiv.org/pdf/1706.02677.pdf 。根据这篇文章，我们一般只在前5个epoch使用warm up。consine learning rate来自于这篇文章：https://arxiv.org/pdf/1812.01187.pdf 。通常情况下，把warm up和consine learning rate一起使用会达到更好的效果。代码实现：

class WarmUpLR(_LRScheduler):"""warmup_training learning rate schedulerArgs:optimizer: optimzier(e.g. SGD)total_iters: totoal_iters of warmup phase"""def __init__(self, optimizer, total_iters, last_epoch=-1):self.total_iters = total_iterssuper().__init__(optimizer, last_epoch)def get_lr(self):"""we will use the first m batches, and set the learningrate to base_lr * m / total_iters"""return [base_lr * self.last_epoch / (self.total_iters + 1e-8) for base_lr in self.base_lrs]# MultiStepLR without warm up
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=args.milestones, gamma=0.1)# warm_up_with_multistep_lr
warm_up_with_multistep_lr = lambda epoch: epoch / args.warm_up_epochs if epoch <= args.warm_up_epochs else 0.1**len([m for m in args.milestones if m <= epoch])
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=warm_up_with_multistep_lr)# warm_up_with_cosine_lr
warm_up_with_cosine_lr = lambda epoch: epoch / args.warm_up_epochs if epoch <= args.warm_up_epochs else 0.5 * ( math.cos((epoch - args.warm_up_epochs) /(args.epochs - args.warm_up_epochs) * math.pi) + 1)
scheduler = torch.optim.lr_scheduler.LambdaLR( optimizer, lr_lambda=warm_up_with_cosine_lr)

四、蒸馏（distillation）

4.1 传统蒸馏

论文参考：

https://arxiv.org/pdf/1503.02531.pdfarxiv.org

理解参考：

深度学习方法（十五）：知识蒸馏（Distilling the Knowledge in a Neural Network），在线蒸馏blog.csdn.net

知识蒸馏（Distilling Knowledge ）的核心思想blog.csdn.net

传统蒸馏示意图

训练的过程采用以下的步骤：
先用硬标签训练大型复杂网络（Teacher Net）；
采用值大的T，经训练好的 TN 进行前向传播获得软标签；
分别采用值大的 T 和 T=1 两种情况，让小型网络（Student Net）获得两种不同的输出，加权计算两种交叉熵损失，训练SN；
采用训练好的 SN 预测类别。

2. 新的蒸馏方式：通道蒸馏

论文参考：

Channel Distillation: Channel-Wise Attention for Knowledge Distillationarxiv.org

代码参考：

https://github.com/zhouzaida/channel-distillationgithub.com