【ICCV21】Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

文章目录

  • 0. Abstract
  • 1. Introduction
  • 2. Related Work
  • 3. Method
    • 3.1 Overall Architecture
    • 3.2 Shifted Window based Self-Attention
    • 3.3 Architecture Variants
  • 4. Experiments
    • 4.1 Image Classification on ImageNet-1K
    • 4.2 Object Detection on COCO
    • 4.3 Semantic Segmentation on ADE20K
    • 4.4 Ablation Study
  • 5. Conclusion
  • 6. Acknowledgement
  • References
  • My thought
  • 彩蛋

在这里插入图片描述

论文链接: https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf

Article Reading Sharing

0. Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
本文提出了一种新的视觉变压器,称为Swin变压器,它可以作为计算机视觉的通用骨干

Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
将Transformer从语言应用到视觉的挑战来自于这两个领域之间的差异,例如视觉实体规模的巨大差异以及与文本中的单词相比,图像中像素的高分辨率。

To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows.
为了解决这些差异,我们提出了一个分层的Transformer,它的表示是用移位窗口计算的。

The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
移位窗口方案将自关注计算限制在不重叠的局部窗口,同时允许跨窗口连接,从而提高了效率。

This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.
这种层次结构具有在各种尺度上建模的灵活性,并且相对于图像大小具有线性计算复杂度。

These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mloU on ADE20K val).
Swin Transformer的这些特性使其与广泛的视觉任务兼容,包括图像分类(ImageNet-1K上的87.3 top-1精度)和密集预测任务,如对象检测(COCO testdev上的58.7 box AP和51.1 mask AP)和语义分割(ADE20K val上的53.5 mIoU)。

Its performance surpasses the previous state-of-theart by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mloU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
其性能在COCO上大幅超过了+2.7 box AP和+2.6 mask AP,在ADE20K上超过了+ 3.2 mloU,显示了基于transformer的模型作为视觉骨干的潜力。

The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
分层设计和移位窗口方法也被证明对所有MLP体系结构都是有益的。


capably serves as a general-purpose(通用) backbone for computer vision.

such as large variations (变化) in the scale of visual entities (实体) and the high resolution of pixels
in images compared to words in text

a hierarchical (分层) Transformer whose representation is computed with Shifted windows.

scheme(计划)brings greater efficiency by limiting self-attention computation to non-overlapping local
windows while also allowing for cross-window connection.

These qualities of Swin Transformer make it compatible with a broad range of vision tasks

dense (密集) prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val)

surpasses (超过) the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on
COCO, and +3.2 mIoU on ADE20K, demonstrating (展示) the potential of Transformer-based models as vision backbones.


代码仓库:https://github.com/microsoft/Swin-Transformer

1. Introduction

本文旨在扩展transformer为计算机视觉的通用backbone,与CNN形成竞争,以提高其在图像分类和视觉语言模型任务上的表现。

Swin Transformer适合作为各种视觉任务的通用主干,与以前基于Transformer的架构形成鲜明对比。


has long been dominated(主导)by convolutional neural networks (CNNs).

在这里插入图片描述
Figure 1. (a) The proposed Swin Transformer builds hierarchical feature maps by merging (合并) image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus (因此) serve as a general-purpose backbone for both image classification and dense recognition tasks. (b) In contrast, previous vision Transformers [19] produce feature maps of a single low resolution and have quadratic (二次) computation complexity to input image size due to computation of self-attention globally.

the prevalent (普遍的)architecture today is instead the Transformer

Designed for sequence modeling and transduction tasks, the Transformer is notable (值得注意)for its use
of attention to model long-range dependencies(依赖) in the data.

demonstrated promising (有前途) results on certain tasks

between the two modalities(模式)

can vary substantially(大幅) in scale

this would be intractable(难以对付) for Transformer on high-resolution images

would be intractable(棘手的) for Transformer on high-resolution images, as the computational complexity of its self-attention is quadratic(二次)

conveniently leverage (利用)advanced techniques for dense prediction such as feature pyramid networks (FPN) [38] or U-Net [47]. The linear computational complexity is achieved by computing self-attention locally within non-overlapping(重叠) windows that partition an
image (outlined in red).

between consecutive (连续) self-attention layers, as illustrated in Figure 2.

strategy is also efficient in regards to real-world latency(延迟): all query patches within a window share the same key set, which facilitates(促进) memory access in hardware.

在这里插入图片描述
a unified(统一) architecture across computer vision and natural language processing could benefit both fields, since it would facilitate (促进) joint modeling of visual and textual signals and the modeling knowledge from
both domains can be more deeply shared.

2. Related Work

  • CNN and variants

  • Self-attention based backbone architectures

  • Self-attention/Transformers to complement(补充)CNNs

  • Transformer based vision backbones

本篇工作与Vision Transformer(ViT)非常相关

Our approach is both efficient and effective, achieving state-of-the-art accuracy on both COCO object detection and ADE20K semantic segmentation.

3. Method

3.1 Overall Architecture

It first splits an input RGB image into non-overlapping(非重叠) patches by a patch splitting(分裂) module, like ViT

Each patch is treated as a “token” and its feature is set as a concatenation(连接)of the raw pixel RGB values

project it to an arbitrary(任意) dimension

Several Transformer blocks with modified(修改)self-attention computation (Swin Transformer blocks) are applied on these patch tokens.

is reduced by patch merging(合并) layers as the network
gets deeper

The first patch merging layer concatenates(连接) the
features of

Swin Transformer blocks are applied afterwards(后来) for feature transformation

在这里插入图片描述
two successive(连续) Swin Transformer Blocks

with regular and shifted windowing configurations(配置), respectively.

  • Swin Transformer block

Swin Transformer is built by replacing the standard multi-head self attention (MSA) module in a Transformer block by a module based on
shifted windows

3.2 Shifted Window based Self-Attention

  • Self-attention in non-overlapped windows

  • Shifted window partitioning in successive blocks

  • Efficient batch computation for shifted configuration

  • Relative position bias

3.3 Architecture Variants

除了构建的基础模型swin-B之外,还有swin-T、swin-S、和swin-L

在这里插入图片描述

4. Experiments

We conduct experiments on ImageNet-1K image classification [19], COCO object detection [43], and ADE20K semantic segmentation [83].
我们对ImageNet-1K图像分类[19]、COCO目标检测[43]、ADE20K 语义分割 [83]进行了实验。

In the following, we first compare the proposed Swin Transformer architecture with the previous state-of-the-arts on the three tasks.
在下文中,我们将首先比较所建议的Swin Transformer体系结构与之前关于这三个任务的最新技术

Then, we ablate the important design elements of Swin Transformer.
然后,对Swin变压器的重要设计要素进行了分析。

4.1 Image Classification on ImageNet-1K

在这里插入图片描述

4.2 Object Detection on COCO

在这里插入图片描述

4.3 Semantic Segmentation on ADE20K

在这里插入图片描述

FLOPS 注意全部大写 是floating point of per second的缩写,意指每秒浮点运算次数。用来衡量硬件的性能。
FLOPs 是floating point of operations的缩写,是浮点运算次数,可以用来衡量算法/模型复杂度。

4.4 Ablation Study

在这里插入图片描述
在这里插入图片描述

5. Conclusion

swin transformer 可以产生 层次特征表示 和 相对于输入图像的大小 具有线性计算复杂度,在COCO和ADE20K方面实现了SOTA。

本文提出的基于位移窗口的自注意力在视觉问题上是有效和高效的。

6. Acknowledgement

We thank many colleagues at Microsoft for their help, in particular, Li Dong and Furu Wei for useful discussions; Bin Xiao, Lu Yuan and Lei Zhang for help on datasets.

此部分不包含已有作者。

References

https://github.com/microsoft/Swin-Transformer

https://gitcode.com/microsoft/Swin-Transformer/overview?utm_source=csdn_github_accelerator&isLogin=1

https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf

My thought

swin transformer 更强调在视觉任务语言任务上的通用性,本文更强调其在不同视觉任务上的backbone能力。

彩蛋

轻松一刻
在这里插入图片描述

欢迎在评论区讨论本文

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/738816.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基于JavaWeb开发的springboot网咖管理系统[附源码]

基于JavaWeb开发的springboot网咖管理系统[附源码] 🍅 作者主页 央顺技术团队 🍅 欢迎点赞 👍 收藏 ⭐留言 📝 🍅 文末获取源码联系方式 📝 🍅 查看下方微信号获取联系方式 承接各种定制系统 &a…

【办公类-40-02】20240311 python模仿PPT相册功能批量插入照片,更改背景颜色 (家长会系列二)

作品展示——用Python插入PPT相册 背景需求: 马上就要家长会,我负责做会议前的照片滚动PPT,通常都是使用PPT的相册功能批量导入照片, 生成给一个新的PPT文件 更改背景颜色 设置4秒间隔,应用到全部 保存,改…

springboot单体项目链路日志跟踪及接口耗时

最近接触一个新的传统项目,在联调过程中,查看日志特别不方便,既无trackId,即无接口耗时,所以写了该博客。话不多说,直接上代码 1、实体类user package com.yk.domain;import lombok.Data;Data public cla…

【IVA】人工智能领域常用的术语(1)

在人工智能和机器学习领域,"检测"、"识别"和"分类"是常用的术语,它们在问题解决中有着不同的含义: 检测(Detection):检测是指在图像或视频中定位和识别特定目标的过程。目标…

Hadoop伪分布式配置--没有DataNode或NameNode

一、原因分析 重复格式化NameNode 二、解决方法 1、输入格式化NameNode命令,找到data和name存放位置 ./bin/hdfs namenode -format 2、删除data或name(没有哪个删哪个) sudo rm -rf data 3、重新格式化NameNode 4、重新启动即可。

sheng的学习笔记- AI-类别不平衡问题

目录:sheng的学习笔记-AI目录-CSDN博客 什么是类别不平衡问题 类别不平衡(class-imbalance),也叫数据倾斜,数据不平衡,就是指分类任务中不同类别的训练样例数目差别很大的情况。 例如有998个反例&#xf…

vue3全局引入element-plus后怎么使用Message进行消息提示

全局引入 main.ts import element-plus/dist/index.css 在需要使用提示的组件中引入 import { ElMessage } from element-plus 使用举例

Verilog刷题笔记37

题目:3位二进制加法器 Now that you know how to build a full adder, make 3 instances of it to create a 3-bit binary ripple-carry adder. The adder adds two 3-bit numbers and a carry-in to produce a 3-bit sum and carry out. To encourage you to actua…

html5cssjs代码 001 第一个网页

html5&css&js代码 001 第一个网页 一、代码二、解释 这是第一个网页&#xff0c;也是一个模板。 一、代码 <!-- 声明文档类型 --> <!DOCTYPE html> <html lang "zh-cn" ><!-- 页面头部开始 --><head ><!-- 设置页面标题 …

@Conditional注解详解

目录 一、Conditional注解作用 二、Conditional源码解析 2.1 Conditional源码 2.2 Condition源码 三、Conditional案例 3.1 Conditional作用在类上案例 3.1.1 配置文件 3.1.2 Condition实现类 3.1.3 Bean内容类 3.1.4 Config类 3.1.5 Controller类 3.1.6 测试结果 3…

Visual grounding-视觉定位任务介绍

&#x1f380;个人主页&#xff1a; https://zhangxiaoshu.blog.csdn.net &#x1f4e2;欢迎大家&#xff1a;关注&#x1f50d;点赞&#x1f44d;评论&#x1f4dd;收藏⭐️&#xff0c;如有错误敬请指正! &#x1f495;未来很长&#xff0c;值得我们全力奔赴更美好的生活&…

Spring Cloud Alibaba微服务从入门到进阶(一)

Springboot三板斧 1、加依赖 2、写注解 3、写配置 Spring Boot Actuator Spring Boot Actuator 是 Spring Boot 提供的一系列用于监控和管理应用程序的工具和服务。 SpringBoot导航端点 其中localhost:8080/actuator/health是健康检查端点&#xff0c;加上以下配置&#xf…

基于element-plus的Dialog选择控件

翻看之前工程师写的vue2的代码&#xff0c;很多都是复制、粘贴&#xff0c;也真是搞不懂&#xff0c;明明可以写一个控件&#xff0c;不就可以重复使用。很多前端总喜欢element搞一下&#xff0c;ant-design也搞一下&#xff0c;有啥意义&#xff0c;控件也不是自己写的&#x…

Python递归函数你用对了吗?

1.递归函数 递归函数&#xff1a;函数自己调用自己 2.需求 使用函数的方式&#xff0c;计算数字n的阶乘 # 5&#xff01; """ 5! 1 * 2 * 3 * 4 * 5 4! 1 * 2 * 3 * 4 3! 1 * 2 * 3 2! 1 * 2 1! 1综上可以总结出&#xff1a;n! n * (n - 1) "&qu…

什么是防静电晶圆隔离膜?一分钟让你了解抗静电晶圆隔离纸

防静电晶圆隔离膜&#xff0c;也被称为防静电蓄积纸、硅片纸、半导体晶圆盒内缓冲垫片等多种名称&#xff0c;是半导体制造和运输过程中的一种重要辅助材料。 该隔离膜具备多种特性&#xff0c;如防静电、无尘、不掉屑、强韧耐用等&#xff0c;这些特性使其在半导体制造和运输中…

网络安全之从原理看懂XSS

01、XSS的原理和分类 跨站脚本攻击XSS(Cross Site Scripting)&#xff0c;为了不和层叠样式表(Cascading Style Sheets&#xff0c;CSS)的缩写混淆 故将跨站脚本攻击缩写为XSS&#xff0c;恶意攻击者往Web页面里插入恶意Script代码&#xff0c;当用户浏览该页面时&#xff0c…

Word转PDF保持图片原有清晰度

目录 1、需要的软件 2、配置Acrobat PDFMaker 3、配置Acrobat Distiller 4、更改Acrobat PDFMaker中的首选项 5、将word转换成pdf 1、需要的软件 利用Adobe Acrobat DC工具。 打开word&#xff0c;选择Acrobat的插件&#xff0c;选择首选项。 如果没有出现Acrobat插件也…

AI辅助研发:2024年科技与工业领域的新革命

随着人工智能&#xff08;AI&#xff09;技术的不断进步&#xff0c;2024年AI辅助研发成为了科技界和工业界广泛关注的焦点。这一年&#xff0c;从医药研发到汽车设计&#xff0c;从软件开发到材料科学&#xff0c;AI的身影无处不在&#xff0c;正逐步改变着研发领域的面貌。这…

Python 基础语法:基本数据类型(元组)

1 元组&#xff08;Tuples&#xff09;概述 1.1 元组的定义与特点 元组&#xff08;Tuples&#xff09;是Python中的一个内置数据类型&#xff0c;用于存储一系列有序的元素。元组中的元素可以是任何类型&#xff0c;包括数字、字符串、列表等&#xff0c;且元素之间用逗号…

java并发编程知识点汇总

文章目录 1. Java8新特性1.1 Lambda表达式1.2 函数式接口1.3 Stream流式计算&#xff0c;应用了上述函数式接口能力1.4 接口增强 2. 常用原子类3. 多线程与高并发-juc3.1 谈一谈对volatile的理解3.2 谈一谈对JMM的理解3.3 谈一谈对CAS及底层原理的理解3.4 谈一谈对ABA问题及原子…