ReFocus: Visual Editing as a Chain of Thought for Structured Image
Understanding
ReFocus: Visual Editing as a Chain of Thought for Structured Image
Understanding
Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a …