A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Type: Preprint

Publication Date: 2024-08-14

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2408.07680

Abstract

Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and scale- and shape-invariant positional embeddings, we perform experiments and ablations that contrast our approach with patch-based tokenization and randomized partitions as baselines. We show that our method significantly improves the faithfulness of attributions, gives pixel-level granularity on zero-shot unsupervised dense prediction tasks, while maintaining predictive performance in classification tasks. Our approach provides a modular tokenization framework commensurable with standard architectures, extending the space of ViTs to a larger class of semantically-rich models.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ Visual Transformers: Token-based Image Representation and Processing for Computer Vision 2020 Bichen Wu
Chenfeng Xu
Xiaoliang Dai
Alvin Wan
Peizhao Zhang
Masayoshi Tomizuka
Kurt Keutzer
Péter Vajda
+ PDF Chat Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens 2024 Jaihyun Lew
Soohyuk Jang
Jaehoon Lee
Seungryong Yoo
Eunji Kim
Saehyung Lee
Jisoo Mok
S.R. Kim
Sungroh Yoon
+ Vision Transformer with Super Token Sampling 2022 Huaibo Huang
Xiaoqiang Zhou
Jie Cao
Ran He
Tieniu Tan
+ MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers 2023 Jakob D. Havtorn
Amélie Royer
Tijmen Blankevoort
Babak Ehteshami Bejnordi
+ PDF Chat MSViT: Dynamic Mixed-scale Tokenization for Vision Transformers 2023 Jakob D. Havtorn
Amélie Royer
Tijmen Blankevoort
Babak Ehteshami Bejnordi
+ PDF Chat Vision Transformers with Natural Language Semantics 2024 Young Kyung Kim
J. Matías Di Martino
Guillermo Sapiro
+ Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning 2022 Weicong Liang
Yuhui Yuan
Henghui Ding
Xiao Luo
Weihong Lin
Ding Jia
Zheng Zhang
Chao Zhang
Han Hu
+ PDF Chat Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning 2023 Yuhui Yuan
Weicong Liang
Henghui Ding
Zhanhao Liang
Chao Zhang
Han Hu
+ Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers 2023 Chenyang Lu
Daan de Geus
Gijs Dubbelman
+ Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers 2023 Chenyang Lu
Daan de Geus
Gijs Dubbelman
+ PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers 2022 Ryan Grainger
Thomas Paniagua
Xi Song
Tianfu Wu
+ PDF Chat Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation 2024 Jianyu Zhang
Li Zhang
Shijian Li
+ Token Pooling in Vision Transformers 2021 Dmitrii Marin
Jen-Hao Rick Chang
Anurag Ranjan
Anish Prabhu
Mohammad Rastegari
Oncel Tuzel
+ PDF Chat TCFormer: Visual Recognition via Token Clustering Transformer 2024 Wang Zeng
Sheng Jin
Lumin Xu
Wentao Liu
Chen Qian
Wanli Ouyang
Ping Luo
Xiaogang Wang
+ Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations 2022 Youwei Liang
Chongjian Ge
Zhan Tong
Yibing Song
Jue Wang
Pengtao Xie
+ Make A Long Image Short: Adaptive Token Length for Vision Transformers 2021 Yichen Zhu
Yuqin Zhu
Jie Du
Yi Wang
Zhicai Ou
Feifei Feng
Jian Tang
+ PDF Chat Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer 2024 Shuai Peng
Di Fu
Baole Wei
Yong Cao
Liangcai Gao
Zhi Hao Tang
+ Token Pooling in Vision Transformers. 2021 Dmitrii Marin
Jen-Hao Rick Chang
Anurag Ranjan
Anish Prabhu
Mohammad Rastegari
Oncel Tuzel
+ DaViT: Dual Attention Vision Transformers 2022 Mingyu Ding
Bin Xiao
Noel Codella
Ping Luo
Jingdong Wang
Lu Yuan
+ PDF Chat Subobject-level Image Tokenization 2024 Delong Chen
Samuel Cahyawijaya
Jianfeng Liu
Baoyuan Wang
Pascale Fung

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors