SVTS: Scalable Video-to-Speech Synthesis
SVTS: Scalable Video-to-Speech Synthesis
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio.This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online.Despite these strong …