MetaMorph: Multimodal Understanding and Generation via Instruction
Tuning
MetaMorph: Multimodal Understanding and Generation via Instruction
Tuning
In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and …