Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate
Modality Imbalance in VLMs?
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate
Modality Imbalance in VLMs?
While Vision Language Models (VLMs) are impressive in tasks such as visual question answering (VQA) and image captioning, their ability to apply multi-step reasoning to images has lagged, giving rise to perceptions of modality imbalance or brittleness. Towards systematic study of such issues, we introduce a synthetic framework for assessing …