LlamaFusion: Adapting Pretrained Language Models for Multimodal
Generation
LlamaFusion: Adapting Pretrained Language Models for Multimodal
Generation
We present LlamaFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LlamaFusion leverages existing Llama-3's weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with …