Ask a Question

Prefer a chat interface with context about you and your work?

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking …