Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack …