Generative Spoken Language Model based on continuous word-sized audio tokens
Generative Spoken Language Model based on continuous word-sized audio tokens
In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model …