Ask a Question

Prefer a chat interface with context about you and your work?

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events …