Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows
Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows
Text-to-speech systems recently achieved almost indistinguishable quality from human speech.However, the prosody of those systems is generally flatter than natural speech, producing samples with low expressiveness.Disentanglement of speaker id and prosody is crucial in text-to-speech systems to improve on naturalness and produce more variable syntheses.This paper proposes a new neural …