ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Type: Preprint

Publication Date: 2024-12-16

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2412.11795

Abstract

Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which integrates a set of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ Into-TTS : Intonation Template Based Prosody Control System 2022 Jihwan Lee
Joun Yeop Lee
Heejin Choi
Seongkyu Mun
Sangjun Park
Chanwoo Kim
+ PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling 2023 Ji-Sang Hwang
Sang-Hoon Lee
Seong‐Whan Lee
+ PDF Chat Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control 2022 Giridhar Pamisetty
K. Sri Rama Murty
+ Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data 2021 Li Zhu
Yuqing Zhang
Mengxi Nie
Ming Yan
Mengnan He
Ruixiong Zhang
Caixia Gong
+ PDF Chat Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data 2021 Li Zhu
Yuqing Zhang
Mengxi Nie
Ming Yan
Mengnan He
Ruixiong Zhang
Caixia Gong
+ Unsupervised word-level prosody tagging for controllable speech synthesis 2022 Yiwei Guo
Chenpeng Du
Kai Yu
+ PDF Chat Unsupervised Word-Level Prosody Tagging for Controllable Speech Synthesis 2022 Yiwei Guo
Chenpeng Du
Kai Yu
+ Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows. 2021 IvĂĄn VallĂ©s-áč”erez
Julian Roth
Grzegorz Beringer
Roberto Barra-Chicote
Jasha Droppo
+ Controllable neural text-to-speech synthesis using intuitive prosodic features 2020 Tuomo Raitio
Ramya Rasipuram
Dan Castellani
+ PDF Chat Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows 2021 IvĂĄn VallĂ©s-áč”erez
Julian Roth
Grzegorz Beringer
Roberto Barra-Chicote
Jasha Droppo
+ Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows 2021 IvĂĄn VallĂ©s-áč”erez
Julian Roth
Grzegorz Beringer
Roberto Barra-Chicote
Jasha Droppo
+ PDF Chat Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features 2020 Tuomo Raitio
Ramya Rasipuram
Dan Castellani
+ PDF Chat CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech 2020 Sri Karlapati
Alexis Moinet
Arnaud Joly
Viacheslav Klimkov
Daniel SĂĄez-Trigueros
Thomas Drugman
+ StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis 2022 Yinghao Aaron Li
Cong Han
Nima Mesgarani
+ PDF Chat Prosospeech: Enhancing Prosody with Quantized Vector Pre-Training in Text-To-Speech 2022 Yi Ren
Ming Lei
Zhiying Huang
Shiliang Zhang
Qian Chen
Zhijie Yan
Zhou Zhao
+ ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech 2022 Yi Ren
Ming Lei
Zhiying Huang
Shiliang Zhang
Qian Chen
Zhijie Yan
Zhou Zhao
+ PDF Chat DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions 2025 Weidong Chen
Yang Shan
Guangzhi Li
Xixin Wu
+ Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP 2023 Jinzuomu Zhong
Yang Li
Hui Huang
Jie Liu
Zhiba Su
Jing Guo
Benlai Tang
Fengjie Zhu
+ Prosody-TTS: An end-to-end speech synthesis system with prosody control 2021 Giridhar Pamisetty
K. Sri Rama Murty
+ Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis 2021 Julian ZaĂŻdi
Hugo Seuté
Benjamin van Niekerk
Marc-André Carbonneau

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors