【ICCV21】Swin Transformer: Hierarchical Vision Transformer using Shifted Windows


论文链接: https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf

0. Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.

Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows.

The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.

This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mloU on ADE20K val).
Swin Transformer的这些特性使其与广泛的视觉任务兼容,包括图像分类(ImageNet-1K上的87.3 top-1精度)和密集预测任务,如对象检测(COCO testdev上的58.7 box AP和51.1 mask AP)和语义分割(ADE20K val上的53.5 mIoU)。

Its performance surpasses the previous state-of-theart by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mloU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
其性能在COCO上大幅超过了+2.7 box AP和+2.6 mask AP,在ADE20K上超过了+ 3.2 mloU,显示了基于transformer的模型作为视觉骨干的潜力。

The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.

1. Introduction


3. Method

3.1 Overall Architecture

It first splits an input RGB image into non-overlapping(非重叠) patches by a patch splitting(分裂) module, like ViT

Each patch is treated as a “token” and its feature is set as a concatenation(连接)of the raw pixel RGB values

project it to an arbitrary(任意) dimension

Several Transformer blocks with modified(修改)self-attention computation (Swin Transformer blocks) are applied on these patch tokens.

is reduced by patch merging(合并) layers as the network
gets deeper

The first patch merging layer concatenates(连接) the
features of

Swin Transformer blocks are applied afterwards(后来) for feature transformation

two successive(连续) Swin Transformer Blocks

with regular and shifted windowing configurations(配置), respectively.

  • Swin Transformer block

Swin Transformer is built by replacing the standard multi-head self attention (MSA) module in a Transformer block by a module based on
shifted windows

3.2 Shifted Window based Self-Attention

  • Self-attention in non-overlapped windows

  • Shifted window partitioning in successive blocks

  • Efficient batch computation for shifted configuration

  • Relative position bias

3.3 Architecture Variants



4. Experiments

We conduct experiments on ImageNet-1K image classification [19], COCO object detection [43], and ADE20K semantic segmentation [83].
我们对ImageNet-1K图像分类[19]、COCO目标检测[43]、ADE20K 语义分割 [83]进行了实验。

In the following, we first compare the proposed Swin Transformer architecture with the previous state-of-the-arts on the three tasks.
在下文中,我们将首先比较所建议的Swin Transformer体系结构与之前关于这三个任务的最新技术

Then, we ablate the important design elements of Swin Transformer.

4.1 Image Classification on ImageNet-1K


4.2 Object Detection on COCO


4.3 Semantic Segmentation on ADE20K


FLOPS 注意全部大写 是floating point of per second的缩写,意指每秒浮点运算次数。用来衡量硬件的性能。
FLOPs 是floating point of operations的缩写,是浮点运算次数,可以用来衡量算法/模型复杂度。

4.4 Ablation Study


5. Conclusion

swin transformer 可以产生 层次特征表示 和 相对于输入图像的大小 具有线性计算复杂度,在COCO和ADE20K方面实现了SOTA。


6. Acknowledgement

We thank many colleagues at Microsoft for their help, in particular, Li Dong and Furu Wei for useful discussions; Bin Xiao, Lu Yuan and Lei Zhang for help on datasets.






My thought

swin transformer 更强调在视觉任务语言任务上的通用性,本文更强调其在不同视觉任务上的backbone能力。








