Direct Acoustics-to-Word Models for English Conversational Speech Recognition
Direct Acoustics-to-Word Models for English Conversational Speech Recognition
Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences.Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences.However, they are not truly end-to-end in the …