Improving RNN Transducer with Target Speaker Extraction and Neural Uncertainty Estimation
Improving RNN Transducer with Target Speaker Extraction and Neural Uncertainty Estimation
Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers. This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer (RNN-T). To stabilize the joint-training, we propose a multi-stage training strategy that pre-trains and fine-tunes each …