Ask a Question

Prefer a chat interface with context about you and your work?

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack …