VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic
Reasoning Tasks
VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic
Reasoning Tasks
Deriving inference from heterogeneous inputs (such as images, text, and audio) is an important skill for humans to perform day-to-day tasks. A similar ability is desirable for the development of advanced Artificial Intelligence (AI) systems. While state-of-the-art models are rapidly closing the gap with human-level performance on diverse computer vision …