Bottleneck Transformers for Visual Recognition

Type: Article

Publication Date: 2021-06-01

Citations: 946

DOI: https://doi.org/10.1109/cvpr46437.2021.01625

Abstract

We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt [67] evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to 1.64x faster in "compute" <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> time than the popular EfficientNet models on TPU-v3 hardware. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup>

Locations

  • arXiv (Cornell University) - View - PDF
  • 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - View

Similar Works

Action Title Year Authors
+ Bottleneck Transformers for Visual Recognition 2021 Aravind Srinivas
Tsung-Yi Lin
Niki Parmar
Jonathon Shlens
Pieter Abbeel
Ashish Vaswani
+ PDF Chat Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition 2024 Qibin Hou
Cheng-Ze Lu
Ming‐Ming Cheng
Jiashi Feng
+ Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention 2021 Sitong Wu
Tianyi Wu
Haoru Tan
Guodong Guo
+ Focal Self-attention for Local-Global Interactions in Vision Transformers 2021 Jianwei Yang
Chunyuan Li
Pengchuan Zhang
Xiyang Dai
Bin Xiao
Lu Yuan
Jianfeng Gao
+ CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows 2021 Xiaoyi Dong
Jianmin Bao
Dongdong Chen
Weiming Zhang
Nenghai Yu
Lu Yuan
Dong Chen
Baining Guo
+ Improved Multiscale Vision Transformers for Classification and Detection 2021 Yanghao Li
Chao-Yuan Wu
Haoqi Fan
Karttikeya Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
+ PDF Chat CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows 2022 Xiaoyi Dong
Jianmin Bao
Dongdong Chen
Weiming Zhang
Nenghai Yu
Lu Yuan
Dong Chen
Baining Guo
+ Three things everyone should know about Vision Transformers 2022 Hugo Touvron
Matthieu Cord
Alaaeldin El-Nouby
Jakob Verbeek
Hervé Jeǔou
+ You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection 2021 Y.K. Fang
Bencheng Liao
Xinggang Wang
Jiemin Fang
Jiyang Qi
Rui Wu
Jianwei Niu
Wenyu Liu
+ Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition 2022 Qibin Hou
Cheng-Ze Lu
Ming‐Ming Cheng
Jiashi Feng
+ Convolutional Xformers for Vision 2022 Pranav Jeevan
Amit sethi
+ Understanding The Robustness in Vision Transformers 2022 Daquan Zhou
Zhiding Yu
Enze Xie
Chaowei Xiao
Anima Anandkumar
Jiashi Feng
José M. Álvarez
+ PDF Chat ReViT: Enhancing Vision Transformers with Attention Residual Connections for Visual Recognition 2024 Anxhelo Diko
Danilo Avola
Marco Cascio
Luigi Cinque
+ HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling 2022 Xiaosong Zhang
Yunjie Tian
Wei Huang
Qixiang Ye
Qi Dai
Lingxi Xie
Qi Tian
+ Stand-Alone Self-Attention in Vision Models 2019 Prajit Ramachandran
Niki Parmar
Ashish Vaswani
Irwan Bello
Anselm Levskaya
Jonathon Shlens
+ Stand-Alone Self-Attention in Vision Models 2019 Prajit Ramachandran
Niki Parmar
Ashish Vaswani
Irwan Bello
Anselm Levskaya
Jonathon Shlens
+ Stand-Alone Self-Attention in Vision Models 2019 Prajit Ramachandran
Niki Parmar
Ashish Vaswani
Irwan Bello
Anselm Levskaya
Jonathon Shlens
+ PDF Chat MViTv2: Improved Multiscale Vision Transformers for Classification and Detection 2022 Yanghao Li
Chao-Yuan Wu
Haoqi Fan
Karttikeya Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
+ Refiner: Refining Self-attention for Vision Transformers 2021 Daquan Zhou
Yujun Shi
Bingyi Kang
Weihao Yu
Zihang Jiang
Yuan Li
Xiaojie Jin
Qibin Hou
Jiashi Feng
+ MViTv2: Improved Multiscale Vision Transformers for Classification and Detection 2021 Yanghao Li
Chao-Yuan Wu
Haoqi Fan
Karttikeya Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer

Works That Cite This (227)

Action Title Year Authors
+ PDF Chat 3D Spine Shape Estimation from Single 2D DXA 2024 Emmanuelle Bourigault
Amir Jamaludin
Andrew Zisserman
+ PDF Chat Recent progress in transformer-based medical image analysis 2023 Zhaoshan Liu
Qiujie Lv
Ziduo Yang
Yifan Li
Chau Hung Lee
Lei Shen
+ PDF Chat GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation 2021 Yunxiang Li
Shuai Wang
Jun Wang
Guodong Zeng
Wenjun Liu
Qianni Zhang
Qun Jin
Yaqi Wang
+ PDF Chat Exploring Advances in Transformers and CNN for Skin Lesion Diagnosis on Small Datasets 2022 Leandro M. de Lima
Renato A. Krohling
+ CMT: Convolutional Neural Networks Meet Vision Transformers 2021 Jianyuan Guo
Kai Han
Han Wu
Chang Xu
Yehui Tang
Chunjing Xu
Yunhe Wang
+ PDF Chat Learning ADC maps from accelerated radial <scp>k‐space</scp> diffusion‐weighted <scp>MRI</scp> in mice using a deep <scp>CNN‐transformer</scp> model 2023 Yuemeng Li
Miguel Romanello Joaquim
Stephen Pickup
Hee Kwon Song
Rong Zhou
Yong Fan
+ PDF Chat Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks 2023 Jierun Chen
Shiu-hong Kao
Hao He
Weipeng Zhuo
Wen Song
Chul‐Ho Lee
S.-H. Gary Chan
+ PDF Chat Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition 2022 Qibin Hou
Zihang Jiang
Li Yuan
Ming‐Ming Cheng
Shuicheng Yan
Jiashi Feng
+ PDF Chat EdgeViTs: Competing Light-Weight CNNs on Mobile Devices with Vision Transformers 2022 Junting Pan
Adrian Bulat
Fuwen Tan
Xiatian Zhu
Ɓukasz Dudziak
Hongsheng Li
Georgios Tzimiropoulos
Brais MartĂ­nez
+ PDF Chat MedViT: A robust vision transformer for generalized medical image classification 2023 Omid Nejati Manzari
Hamid Ahmadabadi
Hossein Kashiani
Shahriar B. Shokouhi
Ahmad Ayatollahi