Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

Type: Preprint

Publication Date: 2024-11-28

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2412.00127

Abstract

We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Unlike prior arts on unified multimodal modeling, Orthus simultaneously copes with discrete text tokens and continuous image features under the AR modeling principle. The continuous treatment of visual signals minimizes the information loss for both image understanding and generation while the fully AR formulation renders the characterization of the correlation between modalities straightforward. The key mechanism enabling Orthus to leverage these advantages lies in its modality-specific heads -- one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features conditioning on the output of the backbone. We devise an efficient strategy for building Orthus -- by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within mere 72 A100 GPU hours). Orthus-base can further embrace post-training to better model interleaved images and texts. Empirically, Orthus surpasses competing baselines including Show-o and Chameleon across standard benchmarks, achieving a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters. Orthus also shows exceptional mixed-modality generation capabilities, reflecting the potential for handling intricate practical generation tasks.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining 2024 Dongyang Liu
Shitian Zhao
Le Zhuo
Weifeng Lin
Yu Qiao
Hongsheng Li
Peng Gao
+ Jointly Training Large Autoregressive Multimodal Models 2023 Emanuele Aiello
Lili Yu
Yixin Nie
Armen Aghajanyan
Barlas Oğuz
+ PDF Chat ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation 2024 Ethan Chern
Jiadi Su
Yan Ma
Pengfei Liu
+ PDF Chat JetFormer: An Autoregressive Generative Model of Raw Images and Text 2024 Michael Tschannen
André Susano Pinto
Alexander Kolesnikov
+ Generative Pretraining in Multimodality 2023 Quan Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
+ PDF Chat Show-o: One Single Transformer to Unify Multimodal Understanding and Generation 2024 Jinheng Xie
Weijia Mao
Zechen Bai
David Junhao Zhang
Weihao Wang
Kevin Qinghong Lin
Yuchao Gu
Zhijie Chen
Zhenheng Yang
Mike Zheng Shou
+ PDF Chat LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation 2024 Weijia Shi
Xiaochuang Han
Chunting Zhou
Weixin Liang
Xi Victoria Lin
Luke Zettlemoyer
Lili Yu
+ Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning 2023 Lili Yu
Bowen Shi
Ramakanth Pasunuru
Benjamin Müller
Olga Golovneva
Tianlu Wang
Arun Babu
Binh Tang
Brian Karrer
Shelly Sheynin
+ VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation 2023 Jinguo Zhu
Xiaohan Ding
Yixiao Ge
Yuying Ge
Sijie Zhao
Hengshuang Zhao
Xiaohua Wang
Ying Shan
+ MAGVLT: Masked Generative Vision-and-Language Transformer 2023 Sungwoong Kim
Daejin Jo
Dong‐Hoon Lee
Jongmin Kim
+ PDF Chat MAGVLT: Masked Generative Vision-and-Language Transformer 2023 Sungwoong Kim
Daejin Jo
Donghoon Lee
Jongmin Kim
+ PDF Chat MIO: A Foundation Model on Multimodal Tokens 2024 Zekun Wang
Keying Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
Y M Zhang
Jiashuo Wang
Shi Ning
Siyu Li
Yi‐Zhi Li
+ Scaling Autoregressive Models for Content-Rich Text-to-Image Generation 2022 Jiahui Yu
Yuanzhong Xu
Jing Yu Koh
Thang M. Luong
Gunjan Baid
Zirui Wang
Vijay K. Vasudevan
Alexander Ku
Yinfei Yang
Burcu Karagol Ayan
+ Generating Images with Multimodal Language Models 2023 Jing Yu Koh
Daniel Fried
Ruslan Salakhutdinov
+ PDF Chat VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model 2025 Xianwei Zhuang
Yuxin Xie
Yufan Deng
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
+ PDF Chat Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens 2024 Lijie Fan
Tianhong Li
Siyang Qin
Yuanzhen Li
Chen Sun
Michael Rubinstein
Deqing Sun
Kaiming He
Yonglong Tian
+ PDF Chat Liquid: Language Models are Scalable Multi-modal Generators 2024 Junfeng Wu
Yi Jiang
Chuofan Ma
Yuliang Liu
Hengshuang Zhao
Zehuan Yuan
Song Bai
Xiang Bai
+ PDF Chat MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis 2024 Wanggui He
Siming Fu
Mushui Liu
Xierui Wang
Wenyi Xiao
Fangxun Shu
Yi Wang
Lei Zhang
Zhelun Yu
Haoyuan Li
+ PDF Chat Chameleon: Mixed-Modal Early-Fusion Foundation Models 2024 Chameleon Team
+ MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens 2023 Kaizhi Zheng
Xuehai He
Xin Eric Wang

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors