Ask a Question

Prefer a chat interface with context about you and your work?

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and …