EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Type: Preprint

Publication Date: 2024-09-26

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2409.18042

Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis 2025 Run Luo
Ting-En Lin
Haonan Zhang
Yuchuan Wu
Xiong Liu
Min Yang
Yongbin Li
Longze Chen
Jiaming Li
Lei Zhang
+ PDF Chat Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities 2024 Zhiyang Xie
Chyuan-Chuan Wu
+ PDF Chat Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition 2024 Zhisheng Zhong
Chengyao Wang
Yuqi Liu
Senqiao Yang
Longxiang Tang
Yuechen Zhang
Jingyao Li
Tan Qu
Yanwei Li
Yukang Chen
+ PDF Chat FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs 2024 Tongyi SpeechTeam
+ PDF Chat Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech 2024 Yiu-Wai Chu
Yujeong Shim
Unsang Park
+ PDF Chat VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction 2025 Chaoyou Fu
Haojia Lin
Xiong Wang
Yifan Zhang
Yunhang Shen
Xiaoyu Liu
Yangze Li
Zuwei Long
Heting Gao
Ke Li
+ PDF Chat BLSP-Emo: Towards Empathetic Large Speech-Language Models 2024 Chen Wang
Minpeng Liao
Zhongqiang Huang
Junhong Wu
Chengqing Zong
Jiajun Zhang
+ LLaSM: Large Language and Speech Model 2023 Yu Shu
Siwei Dong
Guangyao Chen
Wenhao Huang
Ruihua Zhang
Daochen Shi
Qiqi Xiang
Yemin Shi
+ PDF Chat EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector 2024 Deok-Hyeon Cho
Hyung‐Seok Oh
Seung-Bin Kim
Seong-Whan Lee
+ PDF Chat MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis 2024 Xiang Li
Zhi-Qi Cheng
Jun-Yan He
Xiaojiang Peng
Alexander G. Hauptmann
+ PDF Chat HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding 2025 Jiaxing Zhao
Qize Yang
Yixing Peng
Detao Bai
Shiqing Yao
Boyuan Sun
Xiang Chen
Shenghao Fu
W.-T. Chen
Xihan Wei
+ PDF Chat Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition 2024 Orchid Chetia Phukan
Mohd Mujtaba Akhtar
Girish
Swarup Ranjan Behera
Sishir Kalita
Arun Balaji Buduru
Rajesh Sharma
S. R. Mahadeva Prasanna
+ PDF Chat Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech 2024 Haibin Wu
Xiaofei Wang
Şefik Emre Eskimez
Manthan Thakker
Daniel M. Tompkins
Chung-Hsien Tsai
Canrun Li
Zhen Xiao
Sheng Zhao
Jinyu Li
+ PDF Chat VITA: Towards Open-Source Interactive Omni Multimodal LLM 2024 Chaoyou Fu
Haojia Lin
Zuwei Long
Yunhang Shen
Zhao Meng
Yifan Zhang
Xiong Wang
Di Yin
L.L. Ma
Xiawu Zheng
+ CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition 2023 Kari Ali Noriy
Xiaosong Yang
Marcin Budka
Jianjun Zhang
+ PDF Chat Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness 2024 Xincan Feng
Akifumi Yoshimoto
+ PDF Chat SpiRit-LM: Interleaved Spoken and Written Language Model 2024 Tu Anh Nguyen
Benjamin MĂźller
Bokai Yu
Marta R. Costa‐jussà
Maha Elbayad
Sravya Popuri
Paul-Ambroise Duquenne
Robin Algayres
Ruslan Mavlyutov
Itai Gat
+ PDF Chat X-VILA: Cross-Modality Alignment for Large Language Model 2024 Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Wei Ping
Andrew Tao
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
+ AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis 2023 Hrishikesh Viswanath
Aneesh Bhattacharya
Pascal Jutras-DubĂŠ
Prerit Gupta
Mridu Prashanth
Yashvardhan Khaitan
Aniket Bera
+ PDF Chat Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks 2024 Chien‐Yu Huang
Wei‐Chih Chen
Shu-Wen Yang
Andy T. Liu
Chen-An Li
Yuxiang Lin
Wei‐Cheng Tseng
Anuj Diwan
Yi-Jen Shih
Jiatong Shi

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors