Ask a Question

Prefer a chat interface with context about you and your work?

Object-aware Video-language Pre-training for Retrieval

Object-aware Video-language Pre-training for Retrieval

Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained semantic align. In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations. The key idea …