EMOVA: Empowering Language Models to See, Hear and Speak with Vivid
Emotions
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid
Emotions
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external …