Ask a Question

Prefer a chat interface with context about you and your work?

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like objectives (masked language modeling and image-text matching) during pretraining. Although they perform well in many understanding downstream tasks, e.g., visual question answering, image-text retrieval and visual entailment, they do not possess the ability to generate. To tackle this …