Ask a Question

Prefer a chat interface with context about you and your work?

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and …