Ask a Question

Prefer a chat interface with context about you and your work?

MAR: <u>M</u>asked Autoencoders for Efficient <u>A</u>ction <u>R</u>ecognition

MAR: <u>M</u>asked Autoencoders for Efficient <u>A</u>ction <u>R</u>ecognition

Standard approaches for video action recognition usually operate on full input videos, which is inefficient due to the widespread spatio-temporal redundancy in videos. The recent progress in masked video modelling, specifically VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts using limited visual content. Inspired …