REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Type: Preprint

Publication Date: 2024-08-05

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2408.02231

Abstract

Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also design RevQA, a question-answering benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware generative models.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation 2024 Weijia Shi
Xiaochuang Han
Chunting Zhou
Weixin Liang
Xi Victoria Lin
Luke Zettlemoyer
Lili Yu
+ PDF Chat Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion 2024 Jiuhai Chen
Jianwei Yang
Haiping Wu
Dianqi Li
Jianfeng Gao
Tianyi Zhou
Bin Xiao
+ PDF Chat JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images 2024 Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Al-Omari
Anushka Sivakumar
Rui Sun
Wenhao Li
Md. Atabuzzaman
Hammad Ayyubi
Haoxuan You
+ PDF Chat Probing Visual Language Priors in VLMs 2024 Tiange Luo
Ang Cao
Gunhee Lee
Justin Johnson
Honglak Lee
+ PDF Chat SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models 2024 Zheng Liu
Hao Liang
Wentao Xiong
Qinhan Yu
Conghui He
Bin Cui
Wentao Zhang
+ VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation 2023 Jinguo Zhu
Xiaohan Ding
Yixiao Ge
Yuying Ge
Sijie Zhao
Hengshuang Zhao
Xiaohua Wang
Ying Shan
+ DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback 2023 Jiao Sun
Deqing Fu
Yushi Hu
Su Wang
Royi Rassin
Da-Cheng Juan
Dana Alon
Charles Herrmann
Sjoerd van Steenkiste
Ranjay Krishna
+ PDF Chat AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations? 2024 Shouwei Ruan
Hong Wei Liu
Yao Huang
Xiaoqi Wang
Chieh-Yi Kang
Hang Su
Yinpeng Dong
Xingxing Wei
+ PDF Chat Explaining Multi-modal Large Language Models by Analyzing their Vision Perception 2024 Loris Giulivi
Giacomo Boracchi
+ PDF Chat Liquid: Language Models are Scalable Multi-modal Generators 2024 Junfeng Wu
Yi Jiang
Chuofan Ma
Yuliang Liu
Hengshuang Zhao
Zehuan Yuan
Song Bai
Xiang Bai
+ PDF Chat BLINK: Multimodal Large Language Models Can See but Not Perceive 2024 Xingyu Fu
Yushi Hu
Bangzheng Li
Yujun Feng
Haoyu Wang
Xudong Lin
Dan Roth
Noah A. Smith
Wei-Chiu Ma
Ranjay Krishna
+ PDF Chat Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models 2024 Nitzan Bitton-Guetta
Aviv Slobodkin
Aviya Maimon
Eliya Habba
Royi Rassin
Yonatan Bitton
Idan Szpektor
Amir Globerson
Yuval Elovici
+ PDF Chat Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution 2024 Peng Wang
Shuai Bai
Sinan Tan
Shijie Wang
Zhihao Fan
Jinze Bai
Keqin Chen
Xuejing Liu
Jialin Wang
Ge Wen-bin
+ PDF Chat Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge 2024 Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Philip H. S. Torr
Lu Yuan
+ PDF Chat Visual Hallucinations of Multi-modal Large Language Models 2024 Wen Huang
Hongbin Liu
Minxin Guo
Neil Zhenqiang Gong
+ Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs 2024 Shengbang Tong
Zhuang Liu
Yuexiang Zhai
Yi Ma
Yann LeCun
Saining Xie
+ PDF Chat BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions 2024 Wenbo Hu
Yifan Xu
Yi Li
Weiyue Li
Zeyuan Chen
Zhuowen Tu
+ BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions 2023 Wenbo Hu
Yifan Xu
Yi Li
Weiyue Li
Zeyuan Chen
Zhuowen Tu
+ PDF Chat Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models 2024 Yanwei Li
Yuechen Zhang
Chengyao Wang
Zhisheng Zhong
Yixin Chen
Ruihang Chu
Shaoteng Liu
Jiaya Jia
+ PDF Chat Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads 2024 Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors