Ask a Question

Prefer a chat interface with context about you and your work?

ConViT: improving vision transformers with soft convolutional inductive biases*

ConViT: improving vision transformers with soft convolutional inductive biases*

Abstract Convolutional architectures have proven to be extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision transformers rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly …