Serialized Output Training for End-to-End Overlapped Speech Recognition
Serialized Output Training for End-to-End Overlapped Speech Recognition
This paper proposes serialized output training (SOT), a novel framework for multi-speaker overlapped speech recognition based on an attention-based encoder-decoder approach. Instead of having multiple output layers as with the permutation invariant training (PIT), SOT uses a model with only one output layer that generates the transcriptions of multiple speakers …