Ask a Question

Prefer a chat interface with context about you and your work?

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image …