PVT v2: Improved baselines with Pyramid Vision Transformer

Type: Article

Publication Date: 2022-03-16

Citations: 1092

DOI: https://doi.org/10.1007/s41095-022-0274-8

Abstract

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

Locations

  • Computational Visual Media - View - PDF
  • arXiv (Cornell University) - View - PDF
  • DataCite API - View

Similar Works

Action Title Year Authors
+ Twins: Revisiting the Design of Spatial Attention in Vision Transformers 2021 Xiangxiang Chu
Zhi Tian
Yuqing Wang
Bo Zhang
Haibing Ren
Xiaolin Wei
Huaxia Xia
Chunhua Shen
+ Twins: Revisiting the Design of Spatial Attention in Vision Transformers 2021 Xiangxiang Chu
Zhi Tian
Yuqing Wang
Bo Zhang
Haibing Ren
Xiaolin Wei
Huaxia Xia
Chunhua Shen
+ CvT: Introducing Convolutions to Vision Transformers 2021 Haiping Wu
Bin Xiao
Noel Codella
Mengchen Liu
Xiyang Dai
Yuan Lü
Lei Zhang
+ PDF Chat CvT: Introducing Convolutions to Vision Transformers 2021 Haiping Wu
Bin Xiao
Noel Codella
Mengchen Liu
Xiyang Dai
Lu Yuan
Lei Zhang
+ PDF Chat Swin Transformer: Hierarchical Vision Transformer using Shifted Windows 2021 Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
Baining Guo
+ PDF Chat When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism 2022 Guangting Wang
Yucheng Zhao
Chuanxin Tang
Chong Luo
Wenjun Zeng
+ When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism 2022 Guangting Wang
Yucheng Zhao
Chuanxin Tang
Chong Luo
Wenjun Zeng
+ Three things everyone should know about Vision Transformers 2022 Hugo Touvron
Matthieu Cord
Alaaeldin El-Nouby
Jakob Verbeek
Hervé Jeǵou
+ P2T: Pyramid Pooling Transformer for Scene Understanding 2021 Yu-Huan Wu
Yun Liu
Xin Zhan
Ming‐Ming Cheng
+ PDF Chat Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition 2024 Qibin Hou
Cheng-Ze Lu
Ming‐Ming Cheng
Jiashi Feng
+ Swin Transformer: Hierarchical Vision Transformer using Shifted Windows 2021 Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
Baining Guo
+ A Survey on Vision Transformer 2020 Kai Han
Yunhe Wang
Hanting Chen
Xinghao Chen
Jianyuan Guo
Zhenhua Liu
Yehui Tang
An Xiao
Chunjing Xu
Yixing Xu
+ PDF Chat A Survey on Vision Transformer 2022 Kai Han
Yunhe Wang
Hanting Chen
Xinghao Chen
Jianyuan Guo
Zhenhua Liu
Yehui Tang
An Xiao
Chunjing Xu
Yixing Xu
+ PDF Chat DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion 2024 Zhenzhen Chu
Jiayu Chen
Cen Chen
Chengyu Wang
Ziheng Wu
Jun Huang
Weining Qian
+ DAT++: Spatially Dynamic Vision Transformer with Deformable Attention 2023 Zhuofan Xia
Xuran Pan
Shiji Song
Li Erran Li
Gao Huang
+ A survey of the Vision Transformers and its CNN-Transformer based Variants 2023 Asifullah Khan
Zunaira Rauf
Anabia Sohail
Abdul Rehman
Hifsa Asif
Aqsa Asif
Umair Farooq
+ PDF Chat BViT: Broad Attention-Based Vision Transformer 2023 Nannan Li
Yaran Chen
Weifan Li
Zixiang Ding
Dongbin Zhao
Shuai Nie
+ Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention 2021 Sitong Wu
Tianyi Wu
Haoru Tan
Guodong Guo
+ PDF Chat ReViT: Enhancing Vision Transformers with Attention Residual Connections for Visual Recognition 2024 Anxhelo Diko
Danilo Avola
Marco Cascio
Luigi Cinque
+ BViT: Broad Attention based Vision Transformer 2022 Nannan Li
Yaran Chen
Weifan Li
Zixiang Ding
Dongbin Zhao

Works That Cite This (211)

Action Title Year Authors
+ PDF Chat Semantic segmentation using Vision Transformers: A survey 2023 Hans Thisanke
Chamli Deshan
Kavindu Chamith
Sachith Seneviratne
Rajith Vidanaarachchi
Damayanthi Herath
+ PDF Chat Swin MAE: Masked autoencoders for small datasets 2023 Zian Xu
Yin Dai
Fayu Liu
Weibing Chen
Yue Liu
Lifu Shi
Sheng Liu
Yuhang Zhou
+ PDF Chat Faster OreFSDet: A lightweight and effective few-shot object detector for ore images 2023 Yang Zhang
Le Cheng
Yuting Peng
Chengming Xu
Yanwei Fu
Bo Wu
Guodong Sun
+ Communication-Efficient Framework for Distributed Image Semantic Wireless Transmission 2023 Bingyan Xie
Yongpeng Wu
Yuxuan Shi
Derrick Wing Kwan Ng
Wenjun Zhang
+ PDF Chat SparseDC: Depth completion from sparse and non-uniform inputs 2024 C.-F. Long
Wenxiao Zhang
Zhe Chen
Haiping Wang
Yuan Liu
Peiling Tong
Zhen Cao
Zhen Dong
Bisheng Yang
+ PDF Chat Explicit Change-Relation Learning for Change Detection in VHR Remote Sensing Images 2024 Dalong Zheng
Zebin Wu
Jia Liu
Yang Xu
Chih‐Cheng Hung
Zhihui Wei
+ PDF Chat Task-balanced distillation for object detection 2023 Ruining Tang
Zhenyu Liu
Yangguang Li
Yiguo Song
Hui Liu
Qide Wang
Jing Shao
Guifang Duan
Jianrong Tan
+ PDF Chat Gallery Filter Network for Person Search 2023 Lucas Jaffe
Avideh Zakhor
+ PDF Chat Multi-level feature fusion network combining attention mechanisms for polyp segmentation 2023 Junzhuo Liu
Qiaosong Chen
Ye Zhang
Zhixiang Wang
Xin Deng
Jin Wang
+ A Survey on Efficient Vision Transformers: Algorithms, Techniques, and Performance Benchmarking 2024 Lorenzo Papa
Paolo Russo
Irene Amerini
Luping Zhou

Works Cited by This (51)

Action Title Year Authors
+ PDF Chat Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification 2015 Kaiming He
Xiangyu Zhang
Shaoqing Ren
Jian Sun
+ PDF Chat Going deeper with convolutions 2015 Christian Szegedy
Wei Liu
Yangqing Jia
Pierre Sermanet
Scott Reed
Dragomir Anguelov
Dumitru Erhan
Vincent Vanhoucke
Andrew Rabinovich
+ PDF Chat ImageNet Large Scale Visual Recognition Challenge 2015 Olga Russakovsky
Jia Deng
Hao Su
Jonathan Krause
Sanjeev Satheesh
Sean Ma
Zhiheng Huang
Andrej Karpathy
Aditya Khosla
Michael S. Bernstein
+ PDF Chat Rethinking the Inception Architecture for Computer Vision 2016 Christian Szegedy
Vincent Vanhoucke
Sergey Ioffe
Jon Shlens
Zbigniew Wojna
+ PDF Chat Deep Residual Learning for Image Recognition 2016 Kaiming He
Xiangyu Zhang
Shaoqing Ren
Jian Sun
+ PDF Chat DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs 2017 Liang-Chieh Chen
George Papandreou
Iasonas Kokkinos
Kevin Murphy
Alan Yuille
+ Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units 2016 Dan Hendrycks
Kevin Gimpel
+ SGDR: Stochastic Gradient Descent with Warm Restarts 2016 Ilya Loshchilov
Frank Hutter
+ PDF Chat Aggregated Residual Transformations for Deep Neural Networks 2017 Saining Xie
Ross Girshick
Piotr Dollár
Zhuowen Tu
Kaiming He
+ MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications 2017 Andrew Howard
Menglong Zhu
Bo Chen
Dmitry Kalenichenko
Weijun Wang
Tobias Weyand
Marco Andreetto
Hartwig Adam