Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Type: Preprint

Publication Date: 2024-10-17

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2410.13863

Abstract

Scaling up autoregressive models in vision has not proven as beneficial as in large language models. In this work, we investigate this scaling problem in the context of text-to-image generation, focusing on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed raster order using BERT- or GPT-like transformer architectures. Our empirical results show that, while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends. Models based on continuous tokens achieve significantly better visual quality than those using discrete tokens. Furthermore, the generation order and attention mechanisms significantly affect the GenEval score: random-order models achieve notably better GenEval scores compared to raster-order models. Inspired by these findings, we train Fluid, a random-order autoregressive model on continuous tokens. Fluid 10.5B model achieves a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K, and 0.69 overall score on the GenEval benchmark. We hope our findings and results will encourage future efforts to further bridge the scaling gap between vision and language models.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation 2024 Peize Sun
Yi Jiang
Shoufa Chen
Shilong Zhang
Bingyue Peng
Ping Luo
Zehuan Yuan
+ PDF Chat Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching 2024 Enshu Liu
Xuefei Ning
Yu Wang
Zinan Lin
+ Emage: Non-Autoregressive Text-to-Image Generation 2023 Zhangyin Feng
Runyi Hu
Liangxin Liu
Fan Zhang
Duyu Tang
Yong Dai
Xiaocheng Feng
Jiwei Li
Bing Qin
Shuming Shi
+ StraIT: Non-autoregressive Generation with Stratified Image Transformer 2023 Shengju Qian
Hui‐Wen Chang
Yuanzhen Li
Zizhao Zhang
Jiaya Jia
Han Zhang
+ MAGVLT: Masked Generative Vision-and-Language Transformer 2023 Sungwoong Kim
Daejin Jo
Dong‐Hoon Lee
Jongmin Kim
+ PDF Chat MAGVLT: Masked Generative Vision-and-Language Transformer 2023 Sungwoong Kim
Daejin Jo
Donghoon Lee
Jongmin Kim
+ PDF Chat A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation 2024 Liang Chen
Sinan Tan
Zefan Cai
Wenxuan Xie
Haozhe Zhao
Yichi Zhang
Junyang Lin
Jinze Bai
Tianyu Liu
Baobao Chang
+ Scaling Autoregressive Models for Content-Rich Text-to-Image Generation 2022 Jiahui Yu
Yuanzhong Xu
Jing Yu Koh
Thang M. Luong
Gunjan Baid
Zirui Wang
Vijay K. Vasudevan
Alexander Ku
Yinfei Yang
Burcu Karagol Ayan
+ PDF Chat Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads 2024 Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng
+ Muse: Text-To-Image Generation via Masked Generative Transformers 2023 Hui‐Wen Chang
Han Zhang
Jarred Barber
AJ Maschinot
José Lezama
Lu Jiang
Ming–Hsuan Yang
Kevin Murphy
William T. Freeman
Michael Rubinstein
+ PDF Chat Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining 2024 Dongyang Liu
Shitian Zhao
Le Zhuo
Weifeng Lin
Yu Qiao
Hongsheng Li
Peng Gao
+ PDF Chat CLIPAG: Towards Generator-Free Text-to-Image Generation 2024 Roy Ganz
Michael Elad
+ CLIPAG: Towards Generator-Free Text-to-Image Generation 2023 Roy Ganz
Michael Elad
+ PDF Chat VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling 2024 Qian Zhang
Xiangzi Dai
Ninghua Yang
Xiang An
Ziyong Feng
Xingyu Ren
+ PDF Chat Next Patch Prediction for Autoregressive Visual Generation 2024 Ying Pang
Jin Peng
Shuo Yang
Bin Lin
Bin Zhu
Zhenyu Tang
Liuhan Chen
Francis Eng-Hock Tay
Ser-Nam Lim
Harry Yang
+ PDF Chat Causal Diffusion Transformers for Generative Modeling 2024 Chaorui Deng
Deyao Zhu
Kunchang Li
Guang Shi
Haoqi Fan
+ Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning 2023 Lili Yu
Bowen Shi
Ramakanth Pasunuru
Benjamin MĂŒller
Olga Golovneva
Tianlu Wang
Arun Babu
Binh Tang
Brian Karrer
Shelly Sheynin
+ PDF Chat Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation 2024 Zhuoyan Luo
Fengyuan Shi
Yixiao Ge
Yuanjian Yang
Linda Wangï»ż
Ying Shan
+ Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation 2023 Lijun Yu
José Lezama
Nitesh B. Gundavarapu
Luca Versari
Kihyuk Sohn
David Minnen
Yong Cheng
Agrim Gupta
Xiuye Gu
Alexander G. Hauptmann
+ PDF Chat EditAR: Unified Conditional Generation with Autoregressive Models 2025 Jiteng Mu
Nuno Vasconcelos
Xiaolong Wang

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors