Ask a Question

Prefer a chat interface with context about you and your work?

UniFormer: Unifying Convolution and Self-Attention for Visual Recognition

UniFormer: Unifying Convolution and Self-Attention for Visual Recognition

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local …