Type: Article
Publication Date: 2021-12-13
Citations: 13
DOI: https://doi.org/10.1109/asru51503.2021.9688101
In this paper, several works are proposed to address practi-cal challenges for deploying RNN Transducer (RNN-T) based speech recognition systems. These challenges are adapting a well-trained RNN-T model to a new domain without col-lecting the audio data, obtaining time stamps and confidence scores at word level. We solve the first challenge with a splicing data method which concatenates the speech segments ex-tracted from the source domain data. To get time stamps, a phone prediction branch is added to the RNN-T model by sharing the encoder for the purpose of forced alignment. Fi-nally, we obtain word level confidence scores by utilizing sev-eral types of features calculated during decoding and from a confusion network. Evaluated with Microsoft production data, the splicing data adaptation method improves the base-line and adaptation with the text to speech method by 58.03% and 15.25% relative word error rate reduction, respectively. The proposed time stamping method can get less than 50 mil-lisecond word timing difference from the ground truth align-ment on average while maintaining the recognition accuracy. We also obtain high confidence annotation performance with limited computation cost.