ImagePiece: Content-aware Re-tokenization for Efficient Image
Recognition
ImagePiece: Content-aware Re-tokenization for Efficient Image
Recognition
Vision Transformers (ViTs) have achieved remarkable success in various computer vision tasks. However, ViTs have a huge computational cost due to their inherent reliance on multi-head self-attention (MHSA), prompting efforts to accelerate ViTs for practical applications. To this end, recent works aim to reduce the number of tokens, mainly focusing …