Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder
In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in end-to-end frameworks with low-quality videos. Unmatching convergence rates and specialized input representations between audio-visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve …