DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

Type: Book-Chapter

Publication Date: 2024-01-01

Citations: 1

DOI: https://doi.org/10.1137/1.9781611978032.79

Abstract

Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convo-lutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.

Locations

  • arXiv (Cornell University) - View - PDF
  • Society for Industrial and Applied Mathematics eBooks - View

Similar Works

Action Title Year Authors
+ DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion 2023 Zhenzhen Chu
Jiayu Chen
Cen Chen
Chengyu Wang
Ziheng Wu
Jun Huang
Weining Qian
+ PDF Chat Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers 2024 Sanghyeok Lee
Joonmyung Choi
Hyunwoo J. Kim
+ Focal Self-attention for Local-Global Interactions in Vision Transformers 2021 Jianwei Yang
Chunyuan Li
Pengchuan Zhang
Xiyang Dai
Bin Xiao
Lu Yuan
Jianfeng Gao
+ PDF Chat MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers 2024 Jongseong Bae
Susang Kim
Minsu Cho
Ha Young Kim
+ PDF Chat VOLO: Vision Outlooker for Visual Recognition 2022 Li Yuan
Qibin Hou
Zihang Jiang
Jiashi Feng
Shuicheng Yan
+ ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond 2022 Qiming Zhang
Yufei Xu
Jing Zhang
Dacheng Tao
+ DaViT: Dual Attention Vision Transformers 2022 Mingyu Ding
Bin Xiao
Noel Codella
Ping Luo
Jingdong Wang
Lu Yuan
+ CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification 2021 Chun-Fu Chen
Quanfu Fan
Rameswar Panda
+ Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations 2022 Youwei Liang
Chongjian Ge
Zhan Tong
Yibing Song
Jue Wang
Pengtao Xie
+ LightViT: Towards Light-Weight Convolution-Free Vision Transformers 2022 Tao Huang
Lang Huang
Shan You
Fei Wang
Qian Chen
Chang Xu
+ TiC: Exploring Vision Transformer in Convolution 2023 Song Zhang
Qingzhong Wang
Jiang Bian
Haoyi Xiong
+ Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets 2022 Zhiying Lu
Hongtao Xie
Chuanbin Liu
Yongdong Zhang
+ Dual Vision Transformer 2022 Ting Yao
Yehao Li
Yingwei Pan
Yu Wang
Xiao–Ping Zhang
Tao Mei
+ Lightweight Vision Transformer with Cross Feature Attention 2022 Youpeng Zhao
Huadong Tang
Yingying Jiang
A Yong
Qiang Wu
+ So-ViT: Mind Visual Tokens for Vision Transformer. 2021 Jiangtao Xie
Ruiren Zeng
Qilong Wang
Ziqi Zhou
Peihua Li
+ DAT++: Spatially Dynamic Vision Transformer with Deformable Attention 2023 Zhuofan Xia
Xuran Pan
Shiji Song
Li Erran Li
Gao Huang
+ Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet 2021 Li Yuan
Yunpeng Chen
Tao Wang
Weihao Yu
Yujun Shi
Francis E. H. Tay
Jiashi Feng
Shuicheng Yan
+ Multi-Tailed Vision Transformer for Efficient Inference 2022 Yunke Wang
Boxue Du
Chang Xu
+ CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows 2021 Xiaoyi Dong
Jianmin Bao
Dongdong Chen
Weiming Zhang
Nenghai Yu
Lu Yuan
Dong Chen
Baining Guo
+ PDF Chat CvT: Introducing Convolutions to Vision Transformers 2021 Haiping Wu
Bin Xiao
Noel Codella
Mengchen Liu
Xiyang Dai
Lu Yuan
Lei Zhang

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors