ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Xiangheng He, Junjie Chen, Zixing Zhang, Björn Schüller

Type: Preprint

Publication Date: 2024-12-16

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2412.11795

Abstract

Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which integrates a set of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+	Into-TTS : Intonation Template Based Prosody Control System	2022	Jihwan Lee Joun Yeop Lee Heejin Choi Seongkyu Mun Sangjun Park Chanwoo Kim
+	PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling	2023	Ji-Sang Hwang Sang-Hoon Lee Seong‐Whan Lee
+ PDF Chat	Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control	2022	Giridhar Pamisetty K. Sri Rama Murty
+	Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data	2021	Li Zhu Yuqing Zhang Mengxi Nie Ming Yan Mengnan He Ruixiong Zhang Caixia Gong
+ PDF Chat	Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data	2021	Li Zhu Yuqing Zhang Mengxi Nie Ming Yan Mengnan He Ruixiong Zhang Caixia Gong
+	Unsupervised word-level prosody tagging for controllable speech synthesis	2022	Yiwei Guo Chenpeng Du Kai Yu
+ PDF Chat	Unsupervised Word-Level Prosody Tagging for Controllable Speech Synthesis	2022	Yiwei Guo Chenpeng Du Kai Yu
+	Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows.	2021	Iván Vallés-Ṕerez Julian Roth Grzegorz Beringer Roberto Barra-Chicote Jasha Droppo
+	Controllable neural text-to-speech synthesis using intuitive prosodic features	2020	Tuomo Raitio Ramya Rasipuram Dan Castellani
+ PDF Chat	Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows	2021	Iván Vallés-Ṕerez Julian Roth Grzegorz Beringer Roberto Barra-Chicote Jasha Droppo
+	Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows	2021	Iván Vallés-Ṕerez Julian Roth Grzegorz Beringer Roberto Barra-Chicote Jasha Droppo
+ PDF Chat	Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features	2020	Tuomo Raitio Ramya Rasipuram Dan Castellani
+ PDF Chat	CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech	2020	Sri Karlapati Alexis Moinet Arnaud Joly Viacheslav Klimkov Daniel Sáez-Trigueros Thomas Drugman
+	StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis	2022	Yinghao Aaron Li Cong Han Nima Mesgarani
+ PDF Chat	Prosospeech: Enhancing Prosody with Quantized Vector Pre-Training in Text-To-Speech	2022	Yi Ren Ming Lei Zhiying Huang Shiliang Zhang Qian Chen Zhijie Yan Zhou Zhao
+	ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech	2022	Yi Ren Ming Lei Zhiying Huang Shiliang Zhang Qian Chen Zhijie Yan Zhou Zhao
+ PDF Chat	DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions	2025	Weidong Chen Yang Shan Guangzhi Li Xixin Wu
+	Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP	2023	Jinzuomu Zhong Yang Li Hui Huang Jie Liu Zhiba Su Jing Guo Benlai Tang Fengjie Zhu
+	Prosody-TTS: An end-to-end speech synthesis system with prosody control	2021	Giridhar Pamisetty K. Sri Rama Murty
+	Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis	2021	Julian Zaïdi Hugo Seuté Benjamin van Niekerk Marc-André Carbonneau

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors