Ask a Question

Prefer a chat interface with context about you and your work?

BEVT: BERT Pretraining of Video Transformers

BEVT: BERT Pretraining of Video Transformers

This paper studies the BERT pretraining of video transformers. It is a straightforward but worth-studying extension given the recent success from BERT pretraining of image transformers. We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning. In particular, BEVT first performs masked image modeling …