Ask a Question

Prefer a chat interface with context about you and your work?

Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes

Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes

We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These units are difficult to scale to languages with large vocabularies, particularly in the case of multilingual …