Ask a Question

Prefer a chat interface with context about you and your work?

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Text-to-speech(TTS) has undergone remarkable improvements in performance, particularly with the advent of Denoising Diffusion Probabilistic Models (DDPMs). However, the perceived quality of audio depends not solely on its content, pitch, rhythm, and energy, but also on the physical environment.In this work, we propose ViT-TTS, the first visual TTS model with …