Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models

Type: Article

Publication Date: 2023-10-01

Citations: 1

DOI: https://doi.org/10.1109/iccv51070.2023.02125

View Chat PDF

Abstract

Token-based masked generative models are gaining popularity for their fast inference time with parallel decoding. While recent token-based approaches achieve competitive performance to diffusion-based models, their generation performance is still suboptimal as they sample multiple tokens simultaneously without considering the dependence among them. We empirically investigate this problem and propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information. TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts. To further improve the image quality, we introduce a cohesive sampling strategy, Frequency Adaptive Sampling (FAS), to each group of tokens divided according to the self-attention maps. We validate the efficacy of TCTS combined with FAS with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality. Our text-conditioned sampling framework further reduces the original inference time by more than 50 % without modifying the original generative model.

Locations

  • arXiv (Cornell University) - View - PDF
  • 2021 IEEE/CVF International Conference on Computer Vision (ICCV) - View

Similar Works

Action Title Year Authors
+ Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models 2023 Dong Keun Lee
Sangwon Jang
Jaehyeong Jo
Jaehong Yoon
Yunji Kim
Jin-Hwa Kim
Jung-Woo Ha
Sung Ju Hwang
+ MAGVLT: Masked Generative Vision-and-Language Transformer 2023 Sungwoong Kim
Daejin Jo
Dong‐Hoon Lee
Jongmin Kim
+ PDF Chat MAGVLT: Masked Generative Vision-and-Language Transformer 2023 Sungwoong Kim
Daejin Jo
Donghoon Lee
Jongmin Kim
+ MarkovGen: Structured Prediction for Efficient Text-to-Image Generation 2023 Sadeep Jayasumana
Daniel GlÀsner
Srikumar Ramalingam
Andreas Veit
Ayan Chakrabarti
Sanjiv Kumar
+ PDF Chat Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models 2024 Arman Zarei
Keivan Rezaei
Samyadeep Basu
Mehrdad Saberi
Mazda Moayeri
Priyatham Kattakinda
Soheil Feizi
+ PDF Chat Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens 2025 Dongwon Kim
Ju He
Qihang Yu
Chenglin Yang
Xiaohui Shen
Suha Kwak
Liang-Chieh Chen
+ UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs 2023 Yanwu Xu
Yang Zhao
Zhisheng Xiao
Tingbo Hou
+ Muse: Text-To-Image Generation via Masked Generative Transformers 2023 Hui‐Wen Chang
Han Zhang
Jarred Barber
AJ Maschinot
José Lezama
Lu Jiang
Ming–Hsuan Yang
Kevin Murphy
William T. Freeman
Michael Rubinstein
+ PDF Chat Discriminative Probing and Tuning for Text-to-Image Generation 2024 Leigang Qu
Wenjie Wang
Yongqi Li
Hanwang Zhang
Liqiang Nie
Tat‐Seng Chua
+ Attribute-Centric Compositional Text-to-Image Generation 2023 Yuren Cong
Martin Renqiang Min
Li Erran Li
Bodo Rosenhahn
Michael Ying Yang
+ Localized Text-to-Image Generation for Free via Cross Attention Control 2023 Yutong He
Ruslan Salakhutdinov
J. Zico Kolter
+ PDF Chat JetFormer: An Autoregressive Generative Model of Raw Images and Text 2024 Michael Tschannen
André Susano Pinto
Alexander Kolesnikov
+ PDF Chat Unifying Multimodal Transformer for Bi-directional Image and Text Generation 2021 Yupan Huang
Hongwei Xue
Bei Liu
Yutong Lu
+ PDF Chat Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation 2024 Shengyuan Liu
Bo Wang
Ye Ma
Te Yang
Xipeng Cao
Quan Chen
Han Li
Di Dong
Peng Jiang
+ ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation 2021 Han Zhang
Weichong Yin
Yewei Fang
Lanxin Li
Boqiang Duan
Zhihua Wu
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
+ PDF Chat Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion 2024 Xuantong Liu
Tianyang Hu
Wenjia Wang
Kenji Kawaguchi
Yuan Yao
+ UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance 2022 Wei Li
Xue Xu
Xinyan Xiao
Jiachen Liu
Hu Yang
Guohao Li
Zhanpeng Wang
Zhifan Feng
Qiaoqiao She
Yajuan Lyu
+ PDF Chat Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer 2024 Shitong Shao
Zikai Zhou
Ye Tian
Lijie Bai
Zhiqiang Xu
Zeke Xie
+ Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding 2023 Jiacheng Li
Longhui Wei
ZongYuan Zhan
Xin He
Siliang Tang
Qi Tian
Yueting Zhuang
+ CogView: Mastering Text-to-Image Generation via Transformers 2021 Ming Ding
Zhuoyi Yang
Wenyi Hong
Wendi Zheng
Chang Zhou
Da Yin
Junyang Lin
Xu Zou
Zhou Shao
Hongxia Yang

Cited by (0)

Action Title Year Authors

Citing (43)

Action Title Year Authors
+ PDF Chat Fine-Grained Visual-Textual Representation Learning 2019 Xiangteng He
Yuxin Peng
+ Microsoft COCO: Common Objects in Context 2014 Tsung-Yi Lin
Michael Maire
Serge Belongie
Lubomir Bourdev
Ross Girshick
James Hays
Pietro Perona
Deva Ramanan
C. Lawrence Zitnick
Piotr DollĂĄr
+ PDF Chat StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks 2018 Han Zhang
Tao Xu
Hongsheng Li
Shaoting Zhang
Xiaogang Wang
Xiaolei Huang
Dimitris Metaxas
+ PDF Chat AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks 2018 Tao Xu
Pengchuan Zhang
Qiuyuan Huang
Han Zhang
Zhe Gan
Xiaolei Huang
Xiaodong He
+ PDF Chat StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks 2017 Han Zhang
Tao Xu
Hongsheng Li
Shaoting Zhang
Xiaogang Wang
Xiaolei Huang
Dimitris Metaxas
+ PDF Chat DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis 2019 Minfeng Zhu
Pingbo Pan
Wei Chen
Yi Yang
+ Generating Diverse High-Fidelity Images with VQ-VAE-2 2019 Ali Razavi
AĂ€ron van den Oord
Oriol Vinyals
+ PDF Chat Semantic Object Accuracy for Generative Text-to-Image Synthesis 2020 Tobias Hinz
Stefan Heinrich
Stefan Wermter
+ Denoising Diffusion Probabilistic Models 2020 Jonathan Ho
Ajay N. Jain
Pieter Abbeel
+ DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis 2020 Ming Tao
Hao Tang
Songsong Wu
Nicu Sebe
Fei Wu
Xiao‐Yuan Jing
+ PDF Chat FOIL it! Find One mismatch between Image and Language caption 2017 Ravi Shekhar
Sandro Pezzelle
Yauhen Klimovich
Aurélie Herbelot
Moin Nabi
Enver Sangineto
Raffaella Bernardi
+ Score-Based Generative Modeling through Stochastic Differential Equations 2020 Yang Song
Jascha Sohl‐Dickstein
Diederik P. Kingma
Abhishek Kumar
Stefano Ermon
Ben Poole
+ PDF Chat CLIPScore: A Reference-free Evaluation Metric for Image Captioning 2021 Jack Hessel
Ari Holtzman
Maxwell Forbes
Ronan Le Bras
Yejin Choi
+ Diffusion Models Beat GANs on Image Synthesis 2021 Prafulla Dhariwal
Alex Nichol
+ Cascaded Diffusion Models for High Fidelity Image Generation 2021 Jonathan Ho
Chitwan Saharia
William Chan
David J. Fleet
Mohammad Norouzi
Tim Salimans
+ PDF Chat Taming Transformers for High-Resolution Image Synthesis 2021 Patrick Esser
Robin Rombach
Björn Ommer
+ Vector-quantized Image Modeling with Improved VQGAN 2021 Jiahui Yu
Xin Li
Jing Yu Koh
Han Zhang
Ruoming Pang
James Qin
Alexander Ku
Yuanzhong Xu
Jason Baldridge
Yonghui Wu
+ Hierarchical Text-Conditional Image Generation with CLIP Latents 2022 Aditya Ramesh
Prafulla Dhariwal
Alex Nichol
Casey Chu
Mark Chen
+ GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models 2021 Alex Nichol
Prafulla Dhariwal
Aditya Ramesh
Pranav Shyam
Pamela Mishkin
Bob McGrew
Ilya Sutskever
Mark Chen
+ Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding 2022 Chitwan Saharia
William Chan
Saurabh Saxena
Lala Li
Jay Whang
Emily Denton
Seyed Kamyar Seyed Ghasemipour
Burcu Karagol Ayan
S. Sara Mahdavi
Rapha Gontijo Lopes
+ Improved Vector Quantized Diffusion Models 2022 Zhicong Tang
Shuyang Gu
Jianmin Bao
Dong Chen
Fang Wen
+ Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer 2022 Doyup Lee
Chiheon Kim
Saehoon Kim
Minsu Cho
Wook-Shin Han
+ Mutual Information Divergence: A Unified Metric for Multimodal Generative Models 2022 Jin-Hwa Kim
Yunji Kim
Jiyoung Lee
Kang Min Yoo
Sang‐Woo Lee
+ Scaling Autoregressive Models for Content-Rich Text-to-Image Generation 2022 Jiahui Yu
Yuanzhong Xu
Jing Yu Koh
Thang M. Luong
Gunjan Baid
Zirui Wang
Vijay K. Vasudevan
Alexander Ku
Yinfei Yang
Burcu Karagol Ayan
+ PDF Chat StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets 2022 Axel Sauer
Katja Schwarz
Andreas Geiger
+ Instance-Conditioned GAN 2021 Arantxa Casanova
MarlĂšne Careil
Jakob Verbeek
Michal Drozdzal
Adriana Romero-Soriano
+ Classifier-Free Diffusion Guidance 2022 Jonathan Ho
Tim Salimans
+ GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium 2017 Martin Heusel
Hubert Ramsauer
Thomas Unterthiner
Bernhard Nessler
Sepp Hochreiter
+ Lafite2: Few-shot Text-to-Image Generation 2022 Yufan Zhou
Chunyuan Li
Changyou Chen
Jianfeng Gao
Jinhui Xu
+ eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers 2022 Yogesh Balaji
Seungjun Nah
Xun Huang
Arash Vahdat
Jiaming Song
Karsten Kreis
Miika Aittala
Timo Aila
Samuli Laine
Bryan Catanzaro
+ PDF Chat Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes 2022 Sam Bond-Taylor
Peter Hessey
Hiroshi Sasaki
Toby P. Breckon
Chris G. Willcocks
+ PDF Chat Vector Quantized Diffusion Model for Text-to-Image Synthesis 2022 Shuyang Gu
Dong Chen
Jianmin Bao
Fang Wen
Bo Zhang
Dongdong Chen
Lu Yuan
Baining Guo
+ PDF Chat Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors 2022 Oran Gafni
Adam Polyak
Oron Ashual
Shelly Sheynin
Devi Parikh
Yaniv Taigman
+ PDF Chat High-Resolution Image Synthesis with Latent Diffusion Models 2022 Robin Rombach
Andreas Blattmann
Dominik Lorenz
Patrick Esser
Björn Ommer
+ PDF Chat MaskGIT: Masked Generative Image Transformer 2022 Huiwen Chang
Han Zhang
Lu Jiang
Ce Liu
William T. Freeman
+ PDF Chat Improved Masked Image Generation with Token-Critic 2022 José Lezama
Huiwen Chang
Lu Jiang
Irfan Essa
+ Muse: Text-To-Image Generation via Masked Generative Transformers 2023 Hui‐Wen Chang
Han Zhang
Jarred Barber
AJ Maschinot
José Lezama
Lu Jiang
Ming–Hsuan Yang
Kevin Murphy
William T. Freeman
Michael Rubinstein
+ DiffEdit: Diffusion-based semantic image editing with mask guidance 2022 Guillaume Couairon
Jakob Verbeek
Holger Schwenk
Matthieu Cord
+ PDF Chat Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis 2023 Wan-Cyuan Fan
Yen‐Chun Chen
Dongdong Chen
Cheng Yu
Lu Yuan
Yu-Chiang Frank Wang
+ Attention Is All You Need 2017 Ashish Vaswani
Noam Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan N. Gomez
Ɓukasz Kaiser
Illia Polosukhin