EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Chaoyu Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo‐Hsin Yang

Type: Preprint

Publication Date: 2024-09-26

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2409.18042

Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+ PDF Chat	OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis	2025	Run Luo Ting-En Lin Haonan Zhang Yuchuan Wu Xiong Liu Min Yang Yongbin Li Longze Chen Jiaming Li Lei Zhang
+ PDF Chat	Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities	2024	Zhiyang Xie Chyuan-Chuan Wu
+ PDF Chat	Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition	2024	Zhisheng Zhong Chengyao Wang Yuqi Liu Senqiao Yang Longxiang Tang Yuechen Zhang Jingyao Li Tan Qu Yanwei Li Yukang Chen
+ PDF Chat	FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs	2024	Tongyi SpeechTeam
+ PDF Chat	Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech	2024	Yiu-Wai Chu Yujeong Shim Unsang Park
+ PDF Chat	VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction	2025	Chaoyou Fu Haojia Lin Xiong Wang Yifan Zhang Yunhang Shen Xiaoyu Liu Yangze Li Zuwei Long Heting Gao Ke Li
+ PDF Chat	BLSP-Emo: Towards Empathetic Large Speech-Language Models	2024	Chen Wang Minpeng Liao Zhongqiang Huang Junhong Wu Chengqing Zong Jiajun Zhang
+	LLaSM: Large Language and Speech Model	2023	Yu Shu Siwei Dong Guangyao Chen Wenhao Huang Ruihua Zhang Daochen Shi Qiqi Xiang Yemin Shi
+ PDF Chat	EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector	2024	Deok-Hyeon Cho Hyung‐Seok Oh Seung-Bin Kim Seong-Whan Lee
+ PDF Chat	MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis	2024	Xiang Li Zhi-Qi Cheng Jun-Yan He Xiaojiang Peng Alexander G. Hauptmann
+ PDF Chat	HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding	2025	Jiaxing Zhao Qize Yang Yixing Peng Detao Bai Shiqing Yao Boyuan Sun Xiang Chen Shenghao Fu W.-T. Chen Xihan Wei
+ PDF Chat	Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition	2024	Orchid Chetia Phukan Mohd Mujtaba Akhtar Girish Swarup Ranjan Behera Sishir Kalita Arun Balaji Buduru Rajesh Sharma S. R. Mahadeva Prasanna
+ PDF Chat	Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech	2024	Haibin Wu Xiaofei Wang Şefik Emre Eskimez Manthan Thakker Daniel M. Tompkins Chung-Hsien Tsai Canrun Li Zhen Xiao Sheng Zhao Jinyu Li
+ PDF Chat	VITA: Towards Open-Source Interactive Omni Multimodal LLM	2024	Chaoyou Fu Haojia Lin Zuwei Long Yunhang Shen Zhao Meng Yifan Zhang Xiong Wang Di Yin L.L. Ma Xiawu Zheng
+	CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition	2023	Kari Ali Noriy Xiaosong Yang Marcin Budka Jianjun Zhang
+ PDF Chat	Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness	2024	Xincan Feng Akifumi Yoshimoto
+ PDF Chat	SpiRit-LM: Interleaved Spoken and Written Language Model	2024	Tu Anh Nguyen Benjamin Müller Bokai Yu Marta R. Costa‐jussà Maha Elbayad Sravya Popuri Paul-Ambroise Duquenne Robin Algayres Ruslan Mavlyutov Itai Gat
+ PDF Chat	X-VILA: Cross-Modality Alignment for Large Language Model	2024	Hanrong Ye De-An Huang Yao Lu Zhiding Yu Wei Ping Andrew Tao Jan Kautz Song Han Dan Xu Pavlo Molchanov
+	AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis	2023	Hrishikesh Viswanath Aneesh Bhattacharya Pascal Jutras-Dubé Prerit Gupta Mridu Prashanth Yashvardhan Khaitan Aniket Bera
+ PDF Chat	Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks	2024	Chien‐Yu Huang Wei‐Chih Chen Shu-Wen Yang Andy T. Liu Chen-An Li Yuxiang Lin Wei‐Cheng Tseng Anuj Diwan Yi-Jen Shih Jiatong Shi

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors