Ask a Question

Prefer a chat interface with context about you and your work?

Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment …