From the Least to the Most: Building a Plug-and-Play Visual Reasoner via
Data Synthesis
From the Least to the Most: Building a Plug-and-Play Visual Reasoner via
Data Synthesis
We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking …