google
ICML-2021
文章目录
- 1 Background and Motivation
- 2 Related Work
- 3 Advantages / Contributions
- 4 Method
- 4. 1 Understanding Training Efficiency
- 4.2 Training-Aware NAS and Scaling
- 4.3 Progressive learning
- 5 Experiments
- 5.1 Datasets and Metrics
- 5.2 ImageNet ILSVRC2012 and ImageNet21k
- 5.3 Transfer Learning Datasets
- 5.4 Ablation Studies
- Comparison to EfficientNet
- Progressive Learning for Different Networks
- Importance of Adaptive Regularization
- 6 Conclusion(own)
1 Background and Motivation
efficientNet v1(【EfficientNet】《EfficientNet:Rethinking Model Scaling for Convolutional Neural Networks》) 的基础上升级, faster training speed and better parameter efficiency
作者观察到
- training with very large image sizes is slow(Progressive training,input image size 随着训练深入而逐渐变大)
- depthwise convolutions are slow in early layers(提出Fused-MBConv,NAS 出来 early layer 都用的是没有 dw conv 的 Fused-MBConv)
- equally scaling up every stage is sub-optimal
2 Related Work
- Training and parameter efficiency
- Progressive training
- Neural architecture search (NAS)
3 Advantages / Contributions
提出 EfficentNet v2,用 training-aware neural architecture search and scaling 搜索出来,采用 progressive learning(image size 和 regularization) 进一步提升速度精度,公开数据集上更快更好
4 Method
EfficientNet
4. 1 Understanding Training Efficiency
(1)Training with very large image sizes is slow
(2)Depthwise convolutions are slow in early layers but effective in later stages
DW conv 虽然参数量和计算量更小,但是 cannot fully utilize modern accelerators
Fuse-MBConv 高效,精度高,但是参数量和 FLOPS 会变高
MB 和 Fuse-MB 两者在网络中如何搭配更高效呢? 作者: leverage neural architecture search to automatically search for the best combination.
(3)Equally scaling up every stage is sub-optimal
网络深度或者宽度缩放时所有 stage 一样,不是最优的
EfficientNet-S -> EfficientNet-M -> EfficientNet-L
4.2 Training-Aware NAS and Scaling
在 v1 的基础上做的 NAS,搜出来的结构如下
EfficientNetV2-S,scale up 到 M 和 L 时,gradually add more layers to later stages (e.g., stage 5 and 6)
Training Speed Comparison:
effNet(reprod) 用了 about 30% smaller image size
4.3 Progressive learning
in the early training epochs, we train the network with small image size and weak regularization (e.g., dropout and data augmentation), then we gradually increase image size and add stronger regularization.
创新点在于不仅 progressive image size,还有 regularization
输入尺寸越大,数据增强的程度相应的增高,效果会更好
progress learning 的策略如下
不仅 size 随着训练过程的深入在增大,regularization 强度也在递增,算法流程如下
-
S i S_i Si image size,最初 S 0 S_0 S0,最终 S e S_e Se
-
ϕ i k \phi_i^k ϕik 正则化强度,regularization magnitude,参考(【Randaugment】《Randaugment:Practical automated data augmentation with a reduced search space》 和 【AutoAgument for OD】《Learning Data Augmentation Strategies for Object Detection》),作者用的正则化技术有 Dropout、RandAugment、Mixup,最低强度 ϕ 0 k \phi_0^k ϕ0k 最高强度 ϕ e k \phi_e^k ϕek
-
M M M,训练过程被划分成了 M M M 个 stage,training process into four stages with about 87 epochs per stage,注意区别于主干的 stage
-
N N N,traning steps,可以理解为 epoch 或者每次 batch-size 的 iteration
progress learning 采用了最简单的线性增长形式,更细节的参数配置范围如下表
5 Experiments
5.1 Datasets and Metrics
- ImageNet ILSVRC2012:about 1.28M training images and 50,000 validation images with 1000 classes
- ImageNet21k:13M training images with 21,841 classes
- CIFAR-10
- CIFAR-100
- Flowers
- Cars
5.2 ImageNet ILSVRC2012 and ImageNet21k
结果展示
又快又好
不过这个图看起来,速度优势并没有很明显,精度倒是优势很明显
用了 ImageNet21k 后作者的实验心得
- Scaling up data size is more effective than simply scaling up model size in high-accuracy regime,这里的 data size 指的是数量级的规模,不是输入到网络中的图片 size,
- Pretraining on ImageNet21k could be quite efficient,作者的方法可以加速训练和推理过程,这个时候上更大的数据集时间可能和之前不用 progressive learning 差不多,但效果进一步提升
5.3 Transfer Learning Datasets
CIFAR-10 还好,CIFAR-100 和 cars 领先的比较明显
5.4 Ablation Studies
Comparison to EfficientNet
training speed (reduced from 139h to 54h) and accuracy (improved from 84.7% to 85.0%) are better than the original paper
EfficientNet-v2-S 的基础上,scaling down 一些小模型,看看性能
主打的是一个快
Progressive Learning for Different Networks
提速明显
Importance of Adaptive Regularization
这个是作者的创新点之一
6 Conclusion(own)
- Unlike transformers, whose weights (e.g., position embedding) may depend on input length,
- progressive learning, does it work in object detection?
- depth-wise 丢到后面的stage用
- data size vs model size