Ask a Question

Prefer a chat interface with context about you and your work?

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

We study joint video and language (VL) pretraining to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel …