UniFormer: Unifying Convolution and Self-Attention for Visual Recognition
UniFormer: Unifying Convolution and Self-Attention for Visual Recognition
It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local …