Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Type: Preprint

Publication Date: 2024-05-16

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2405.10292

Abstract

Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model 2025 Yuhang Zang
Xiaoyi Dong
Pan Zhang
Yuhang Cao
Ziyu Liu
Steven X. Ding
Shwu Chong Wu
Yizhou Ma
Haodong Duan
Wenwei Zhang
+ PDF Chat HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning 2024 Fucai Ke
Zhixi Cai
Simindokht Jahangard
Weiqing Wang
Pari Delir Haghighi
Hamid Rezatofighi
+ Octopus: Embodied Vision-Language Programmer from Environmental Feedback 2023 Jingkang Yang
Yuhao Dong
Shuai Liu
Bo Li
Ziyue Wang
Chencheng Jiang
Haoran Tan
Jiamu Kang
Yuanhan Zhang
Kaiyang Zhou
+ PDF Chat BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games 2024 Davide Paglieri
Bartłomiej Cupiał
Samuel Coward
Ulyana Piterbarg
Maciej Wołczyk
Akbir Khan
Eduardo Pignatelli
Łukasz Kuciński
Lerrel Pinto
Rob Fergus
+ PDF Chat Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models 2024 Yuhao Dong
Zuyan Liu
Hailong Sun
Jingkang Yang
Weisheng Hu
Yongming Rao
Ziwei Liu
+ PDF Chat INSIGHT: End-to-End Neuro-Symbolic Visual Reinforcement Learning with Language Explanations 2024 Lirui Luo
Guoxi Zhang
Hongming Xu
Yaodong Yang
Cong Fang
Qing Li
+ PDF Chat On the Modeling Capabilities of Large Language Models for Sequential Decision Making 2024 Martin Klissarov
Devon Hjelm
Alexander Toshev
Bogdan Mazoure
+ NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models 2023 Gengze Zhou
Yicong Hong
Qi Wu
+ PDF Chat NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models 2024 Gengze Zhou
Yicong Hong
Qi Wu
+ Policy Learning with a Language Bottleneck 2024 Megha Srivastava
Cédric Colas
Dorsa Sadigh
Jacob Andreas
+ PDF Chat ViSTa Dataset: Do vision-language models understand sequential tasks? 2024 Evžen Wybitul
Elsa L. Gunter
Mikhail Seleznyov
David Lindner
+ Language Models can Solve Computer Tasks 2023 Geunwoo Kim
Pierre Baldi
Stephen McAleer
+ lilGym: Natural Language Visual Reasoning with Reinforcement Learning 2022 Anne Wu
Kianté Brantley
Noriyuki Kojima
Yoav Artzi
+ lilGym: Natural Language Visual Reasoning with Reinforcement Learning 2023 Anne Wu
Kianté Brantley
Noriyuki Kojima
Yoav Artzi
+ PDF Chat NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning 2024 Bingqian Lin
Yunshuang Nie
Ziming Wei
Jiaqi Chen
Shikui Ma
Jianhua Han
Hang Xu
Xiaojun Chang
Xiaodan Liang
+ PDF Chat Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning 2024 Di Zhang
Junxian Li
Jingdi Lei
Xunzhi Wang
Yujie Liu
Zonglin Yang
Jiatong Li
Weida Wang
Shunkun Yang
Jianbo Wu
+ PDF Chat Evaluating Vision-Language Models as Evaluators in Path Planning 2024 Mohamed Aghzal
Xiang Yue
Erion Plaku
Ziyu Yao
+ PDF Chat VISUALHINTS: A Visual-Lingual Environment for Multimodal Reinforcement Learning 2021 Thomas Carta
Subhajit Chaudhury
Kartik Talamadupula
Michiaki Tatsubori
+ VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement Learning 2020 Thomas Carta
Subhajit Chaudhury
Kartik Talamadupula
Michiaki Tatsubori
+ Building Open-Ended Embodied Agent via Language-Policy Bidirectional Adaptation 2024 Shaopeng Zhai
Jie Wang
Tianyi Zhang
Fuxian Huang
Qi Zhang
Zhou Ming
Jing Hou
Yu Liu

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors