Ask a Question

Prefer a chat interface with context about you and your work?

Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA

Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA

Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of …