ConViT: improving vision transformers with soft convolutional inductive biases*
ConViT: improving vision transformers with soft convolutional inductive biases*
Abstract Convolutional architectures have proven to be extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision transformers rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly …