Adaptive Length Image Tokenization via Recurrent Allocation

Type: Preprint

Publication Date: 2024-11-04

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2411.02393

Abstract

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat CAT: Content-Adaptive Image Tokenization 2025 Jun-Hong Shen
Kushal Tirumala
Michihiro Yasunaga
Ishan Misra
Luke Zettlemoyer
Lili Yu
Chunting Zhou
+ CAT: Content-Adaptive Image Tokenization 2025 Jun-Hong Shen
Kushal Tirumala
Michihiro Yasunaga
Ishan Misra
Luke Zettlemoyer
Lili Yu
Chunting Zhou
+ Make A Long Image Short: Adaptive Token Length for Vision Transformers 2023 Qiqi Zhou
Yichen Zhu
+ Make A Long Image Short: Adaptive Token Length for Vision Transformers 2021 Yichen Zhu
Yuqin Zhu
Jie Du
Yi Wang
Zhicai Ou
Feifei Feng
Jian Tang
+ Visual Transformers: Token-based Image Representation and Processing for Computer Vision 2020 Bichen Wu
Chenfeng Xu
Xiaoliang Dai
Alvin Wan
Peizhao Zhang
Masayoshi Tomizuka
Kurt Keutzer
Péter Vajda
+ PDF Chat FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression 2024 Yilin Zhu
Chi Xie
Shuang Liang
Bo Zheng
Sheng Guo
+ PDF Chat SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference 2024 Yuan Zhang
Chun-Kai Fan
Junpeng Ma
Wenzhao Zheng
Tao Huang
Kuan Cheng
Denis Gudovskiy
Tomoyuki Okuno
Yohei Nakata
Kurt Keutzer
+ Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition 2021 Yulin Wang
Rui Huang
Shiji Song
Zeyi Huang
Gao Huang
+ PDF Chat Taming Scalable Visual Tokenizer for Autoregressive Image Generation 2024 Fengyuan Shi
Zhuoyan Luo
Yixiao Ge
Yujiu Yang
Ying Shan
Limin Wang
+ PDF Chat Image Understanding Makes for A Good Tokenizer for Image Generation 2024 Luting Wang
Yang Zhao
Zijian Zhang
Jiashi Feng
Si Liu
Bingyi Kang
+ FIT: Far-reaching Interleaved Transformers 2023 ting chen
Lala Li
+ Token Pooling in Vision Transformers 2021 Dmitrii Marin
Jen-Hao Rick Chang
Anurag Ranjan
Anish Prabhu
Mohammad Rastegari
Oncel Tuzel
+ PDF Chat A Spitting Image: Modular Superpixel Tokenization in Vision Transformers 2024 Marius Aasan
Odd Kolbjørnsen
Anne Solberg
Adıń Ramıŕez Rivera
+ PDF Chat LookupViT: Compressing visual information to a limited number of tokens 2024 Rajat Koner
Gagan Jain
Prateek Jain
Volker Tresp
Sujoy Paul
+ Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training 2023 Lorenzo Baraldi
Roberto Amoroso
Marcella Cornia
Lorenzo Baraldi
Andrea Pilzer
Rita Cucchiara
+ Planting a SEED of Vision in Large Language Model 2023 Yuying Ge
Yixiao Ge
Ziyun Zeng
Xintao Wang
Ying Shan
+ PDF Chat PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models 2024 Chenyu Yang
Xuan Dong
X L Zhu
Weijie Su
Jiahao Wang
Hao Tian
Zhe Chen
Wenhai Wang
Lewei Lu
Yifeng Dai
+ Token Pooling in Vision Transformers. 2021 Dmitrii Marin
Jen-Hao Rick Chang
Anurag Ranjan
Anish Prabhu
Mohammad Rastegari
Oncel Tuzel
+ PDF Chat ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition 2024 Seungdong Yoa
Seungjun Lee
H.D. Cho
Bumsoo Kim
Woohyung Lim
+ PDF Chat PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction 2024 Long Xing
Qidong Huang
Xiaoyi Dong
Jiajie Lu
Pan Zhang
Yuhang Zang
Yuhang Cao
Conghui He
Jiaqi Wang
Feng Wu

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors