Ask a Question

Prefer a chat interface with context about you and your work?

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal tasks, where VLP models without reliance on object detectors are becoming the mainstream due to their superior computation efficiency and competitive performance. However, the removal of object detectors also deprives the capability of VLP models in explicit …