Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation
Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation
Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like objectives (masked language modeling and image-text matching) during pretraining. Although they perform well in many understanding downstream tasks, e.g., visual question answering, image-text retrieval and visual entailment, they do not possess the ability to generate. To tackle this …