Ask a Question

Prefer a chat interface with context about you and your work?

Enhancing Vision-Language Pre-training with Rich Supervisions

Enhancing Vision-Language Pre-training with Rich Supervisions

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured …