Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
We study joint video and language (VL) pretraining to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel …