VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

Type: Preprint

Publication Date: 2025-01-21

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2501.12327

Abstract

We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework. VARGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation. VARGPT innovatively extends the LLaVA architecture, achieving efficient scale-wise autoregressive visual generation within MLLMs while seamlessly accommodating mixed-modal input and output within a single model framework. Our VARGPT undergoes a three-stage unified training process on specially curated datasets, comprising a pre-training phase and two mixed visual instruction-tuning phases. The unified training strategy are designed to achieve alignment between visual and textual features, enhance instruction following for both understanding and generation, and improve visual generation quality, respectively. Despite its LLAVA-based architecture for multimodel understanding, VARGPT significantly outperforms LLaVA-1.5 across various vision-centric benchmarks, such as visual question-answering and reasoning tasks. Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks. Project page is at: \url{https://vargpt-1.github.io/}

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat MetaMorph: Multimodal Understanding and Generation via Instruction Tuning 2024 Shengbang Tong
Daiming Fan
Jianfei Zhu
Yunyang Xiong
Xinlei Chen
Koustuv Sinha
Michael Rabbat
Yann LeCun
Saining Xie
Zhuang Liu
+ PDF Chat PUMA: Empowering Unified MLLM with Multi-granular Visual Generation 2024 Rongyao Fang
Chengqi Duan
Kun Wang
Hao Li
Hao Tian
Xingyu Zeng
Rui Zhao
Jifeng Dai
Hongsheng Li
Xihui Liu
+ PDF Chat The (R)Evolution of Multimodal Large Language Models: A Survey 2024 Davide Caffagni
Federico Cocchi
Luca Barsellotti
Nicholas Moratelli
Sara Sarto
Lorenzo Baraldi
Lorenzo Baraldi
Marcella Cornia
Rita Cucchiara
+ VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation 2023 Jinguo Zhu
Xiaohan Ding
Yixiao Ge
Yuying Ge
Sijie Zhao
Hengshuang Zhao
Xiaohua Wang
Ying Shan
+ PDF Chat Liquid: Language Models are Scalable Multi-modal Generators 2024 Junfeng Wu
Yi Jiang
Chuofan Ma
Yuliang Liu
Hengshuang Zhao
Zehuan Yuan
Song Bai
Xiang Bai
+ MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens 2023 Kaizhi Zheng
Xuehai He
Xin Eric Wang
+ PDF Chat MIO: A Foundation Model on Multimodal Tokens 2024 Zekun Wang
Keying Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
Y M Zhang
Jiashuo Wang
Shi Ning
Siyu Li
Yi‐Zhi Li
+ PDF Chat Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads 2024 Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng
+ Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation 2021 Tianyi Liu
Zuxuan Wu
Wenhan Xiong
Jingjing Chen
Yu‐Gang Jiang
+ PDF Chat Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation 2021 Tianyi Liu
Zuxuan Wu
Wenhan Xiong
Jingjing Chen
Yu‐Gang Jiang
+ GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation 2023 Zhanyu Wang
Longyue Wang
Zhen Zhao
Minghao Wu
Chenyang Lyu
Huayang Li
Deng Cai
Luping Zhou
Shuming Shi
Zhaopeng Tu
+ PDF Chat JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation 2024 Yiyang Ma
Xingchao Liu
Xiaokang Chen
Wen Liu
Chengyue Wu
Zhiyu Wu
Zongyou Pan
Zhenda Xie
Haowei Zhang
X.G. Yu
+ PDF Chat ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance 2024 Chunwei Wang
Guansong Lu
Junwei Yang
Runhui Huang
Jianhua Han
Lu Hou
Wei Zhang
Songcen Xu
+ PDF Chat Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering 2024 Jinhe Bi
Yujun Wang
Haokun Chen
Xun Xiao
Artur Hecker
Volker Tresp
Yunpu Ma
+ mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality 2023 Qinghao Ye
Haiyang Xu
Guohai Xu
Jiabo Ye
Ming Yan
Yiyang Zhou
Junyang Wang
Anwen Hu
Pengcheng Shi
Yaya Shi
+ PDF Chat Generative Visual Instruction Tuning 2024 Jefferson Hernandez
Ruben Villegas
Vicente Ordóñez
+ PDF Chat VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks 2024 Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
Wenhai Wang
Zhe Chen
Xizhou Zhu
Lewei Lu
Tong Lu
+ PDF Chat VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation 2024 Yunfeng Wu
Zhengtong Zhang
Junyu Chen
Haotian Tang
Dacheng Li
Yunhao Fang
Ligeng Zhu
Enze Xie
Hongxu Yin
Yi Li
+ InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation 2023 Rongyao Fang
Shilin Yan
Zhaoyang Huang
Jingqiu Zhou
Hao Tian
Jifeng Dai
Hongsheng Li
+ PDF Chat LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation 2024 Weijia Shi
Xiaochuang Han
Chunting Zhou
Weixin Liang
Xi Victoria Lin
Luke Zettlemoyer
Lili Yu

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors