TACO: Learning Multi-modal Action Models with Synthetic
Chains-of-Thought-and-Action
TACO: Learning Multi-modal Action Models with Synthetic
Chains-of-Thought-and-Action
While open-source multi-modal language models perform well on simple question answering tasks, they often fail on complex questions that require multiple capabilities, such as fine-grained recognition, visual grounding, and reasoning, and that demand multi-step solutions. We present TACO, a family of multi-modal large action models designed to improve performance on …