MAR: <u>M</u>asked Autoencoders for Efficient <u>A</u>ction <u>R</u>ecognition
MAR: <u>M</u>asked Autoencoders for Efficient <u>A</u>ction <u>R</u>ecognition
Standard approaches for video action recognition usually operate on full input videos, which is inefficient due to the widespread spatio-temporal redundancy in videos. The recent progress in masked video modelling, specifically VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts using limited visual content. Inspired …