Ask a Question

Prefer a chat interface with context about you and your work?

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it …