Ask AI a math question

Standard approaches for video action recognition usually operate on full input videos, which is inefficient due to the widespread spatio-temporal redundancy in videos. The recent progress in masked video modelling, specifically VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts using limited visual content. Inspired …

Ask a Question