Type: Article
Publication Date: 2021-08-27
Citations: 3
DOI: https://doi.org/10.21437/interspeech.2021-1131
Neural end-to-end (E2E) models have become a promising technique to realize practical automatic speech recognition (ASR) systems.When realizing such a system, one important issue is the segmentation of audio to deal with streaming input or long recording.After audio segmentation, the ASR model with a small real-time factor (RTF) is preferable because the latency of the system can be faster.Recently, E2E ASR based on non-autoregressive models becomes a promising approach since it can decode an N -length token sequence with less than N iterations.We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR.As a non-autoregressive ASR, the insertion-based model is used.In addition, instead of concatenating separated models for segmentation and ASR, we introduce a new architecture that realizes audio segmentation and non-autoregressive ASR by a single neural network.Experimental results on Japanese and English dataset show that the method achieved a reasonable trade-off between accuracy and RTF compared with baseline autoregressive Transformer and connectionist temporal classification.