Muffin or Chihuahua? Challenging Large Vision-Language Models with
Multipanel VQA
Muffin or Chihuahua? Challenging Large Vision-Language Models with
Multipanel VQA
Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of …