Two-Pass End-to-End Speech Recognition

Type: Article

Publication Date: 2019-09-13

Citations: 117

DOI: https://doi.org/10.21437/interspeech.2019-1341

Abstract

The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency.Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time.Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models [1].However, this model still lags behind a large state-of-the-art conventional model in quality [2].On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models [3].This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints.Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T.

Locations

  • arXiv (Cornell University) - View - PDF
  • Interspeech 2022 - View

Similar Works

Action Title Year Authors
+ PDF Chat A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency 2020 Tara N. Sainath
Yanzhang He
Bo Li
Arun Narayanan
Ruoming Pang
Antoine Bruguier
Shuo-Yiin Chang
Wei Li
Raziel Álvarez
Zhifeng Chen
+ A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency 2020 Tara N. Sainath
Yanzhang He
Bo Li
Arun Narayanan
Ruoming Pang
Antoine Bruguier
Shuo-Yiin Chang
Wei Li
Raziel Álvarez
Zhifeng Chen
+ Towards Fast and Accurate Streaming End-to-End ASR 2020 Bo Li
Shuo-Yiin Chang
Tara N. Sainath
Ruoming Pang
Yanzhang He
Trevor Strohman
Yonghui Wu
+ PDF Chat Towards Fast and Accurate Streaming End-To-End ASR 2020 Bo Li
Shuo-Yiin Chang
Tara N. Sainath
Ruoming Pang
Yanzhang He
Trevor Strohman
Yonghui Wu
+ PDF Chat Streaming End-to-end Speech Recognition for Mobile Devices 2019 Yanzhang He
Tara N. Sainath
Rohit Prabhavalkar
Ian McGraw
Raziel Álvarez
Ding Zhao
David Rybach
Anjuli Kannan
Yonghui Wu
Ruoming Pang
+ Streaming End-to-end Speech Recognition For Mobile Devices 2018 Yanzhang He
Tara N. Sainath
Rohit Prabhavalkar
Ian McGraw
Raziel Álvarez
Ding Zhao
David Rybach
Anjuli Kannan
Yonghui Wu
Ruoming Pang
+ PDF Chat A Better and Faster end-to-end Model for Streaming ASR 2021 Bo Li
Anmol Gulati
Jiahui Yu
Tara N. Sainath
Chung‐Cheng Chiu
Arun Narayanan
Shuo-Yiin Chang
Ruoming Pang
Yanzhang He
James Qin
+ A Better and Faster End-to-End Model for Streaming ASR 2020 Bo Li
Anmol Gulati
Jiahui Yu
Tara N. Sainath
Chung‐Cheng Chiu
Arun Narayanan
Shuo-Yiin Chang
Ruoming Pang
Yanzhang He
James Qin
+ PDF Chat Two-Pass Low Latency End-to-End Spoken Language Understanding 2022 Siddhant Arora
Siddharth Dalmia
Xuankai Chang
Brian Yan
Alan W. Black
Shinji Watanabe
+ Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition 2020 Wei Li
James Qin
Chung‐Cheng Chiu
Ruoming Pang
Yanzhang He
+ PDF Chat Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition 2020 Wei Li
James Qin
Chung‐Cheng Chiu
Ruoming Pang
Yanzhang He
+ Endpoint Detection for Streaming End-to-End Multi-talker ASR 2022 Liang Lu
Jinyu Li
Yifan Gong
+ Two-Pass Low Latency End-to-End Spoken Language Understanding 2022 Siddhant Arora
Siddharth Dalmia
Xuankai Chang
Brian Yan
Alan W. Black
Shinji Watanabe
+ PDF Chat Endpoint Detection for Streaming End-to-End Multi-Talker ASR 2022 Liang Lu
Jinyu Li
Yifan Gong
+ PDF Chat Dissecting User-Perceived Latency of On-Device E2E Speech Recognition 2021 Yuan Shangguan
Rohit Prabhavalkar
Hang Su
Jay Mahadeokar
Yangyang Shi
Jiatong Zhou
Chunyang Wu
Manh Duc Le
Ozlem Kalinli
Christian Fuegen
+ Dissecting User-Perceived Latency of On-Device E2E Speech Recognition 2021 Yuan Shangguan
Rohit Prabhavalkar
Hang Su
Jay Mahadeokar
Yangyang Shi
Jiatong Zhou
Chunyang Wu
Duc Le
Ozlem Kalinli
Christian Fuegen
+ PDF Chat Dissecting User-Perceived Latency of On-Device E2E Speech Recognition 2021 Yuan Shangguan
Rohit Prabhavalkar
Hang Su
Jay Mahadeokar
Yangyang Shi
Jiatong Zhou
Chunyang Wu
Duc Le
Ozlem Kalinli
Christian Fuegen
+ PDF Chat Streaming End-to-End Multi-Talker Speech Recognition 2021 Liang Lu
Naoyuki Kanda
Jinyu Li
Yifan Gong
+ On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition 2020 Jinyu Li
Yu Wu
Yashesh Gaur
Chengyi Wang
Rui Zhao
Shujie Liu
+ PDF Chat On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition 2020 Jinyu Li
Yu Wu
Yashesh Gaur
Chengyi Wang
Rui Zhao
Shujie Liu

Works That Cite This (82)

Action Title Year Authors
+ PDF Chat Toward Streaming ASR with Non-Autoregressive Insertion-Based Model 2021 Yuya Fujita
Tianzi Wang
Shinji Watanabe
Motoi Omachi
+ PDF Chat Audio-Attention Discriminative Language Model for ASR Rescoring 2020 Ankur Gandhe
Ariya Rastrow
+ PDF Chat Monotonic Segmental Attention for Automatic Speech Recognition 2023 Albert Zeyer
Robin Schmitt
Wei Zhou
Ralf SchlĂźter
Hermann Ney
+ E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model 2023 Wei Huang
Shuo-Yiin Chang
Tara N. Sainath
Yanzhang He
David Rybach
Robert David
Rohit Prabhavalkar
Cyril Allauzen
Cal Peyser
Trevor Strohman
+ PDF Chat Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus 2020 Cal Peyser
Sepand Mavandadi
Tara N. Sainath
James Apfel
Ruoming Pang
Shankar Kumar
+ PDF Chat Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability 2020 Jinyu Li
Rui Zhao
Zhong Meng
Yanqing Liu
Wenning Wei
Sarangarajan Parthasarathy
Vadim Mazalov
Zhenghao Wang
Lei He
Sheng Zhao
+ Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability 2020 Jinyu Li
Rui Zhao
Zhong Meng
Yanqing Liu
Wenning Wei
Sarangarajan Parthasarathy
Vadim Mazalov
Zhenghao Wang
Lei He
Sheng Zhao
+ PDF Chat High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model 2020 Jinyu Li
Rui Zhao
Eric Sun
Jeremy H. M. Wong
Amit Das
Zhong Meng
Yifan Gong
+ High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model 2020 Jinyu Li
Rui Zhao
Eric Sun
Jeremy H. M. Wong
Amit Das
Zhong Meng
Yifan Gong
+ SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory 2021 Zhijie Lin
Zhou Zhao
Haoyuan Li
Jinglin Liu
Meng Zhang
Xingshan Zeng
Xiaofei He

Works Cited by This (15)

Action Title Year Authors
+ Sequence Transduction with Recurrent Neural Networks 2012 Alex Graves
+ Listen, Attend and Spell 2015 William Chan
Navdeep Jaitly
Quoc V. Le
Oriol Vinyals
+ TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems 2016 MartĹ́n Abadi
Ashish Agarwal
Paul Barham
Eugene Brevdo
Zhifeng Chen
Craig Citro
Gregory S. Corrado
Andy Davis
Jay B. Dean
Matthieu Devin
+ PDF Chat Personalized speech recognition on mobile devices 2016 Ian McGraw
Rohit Prabhavalkar
Raziel Álvarez
Montse Gonzalez Arenas
Kanishka Rao
David Rybach
Ouais Alsharif
Haşim Sak
Alexander Gruenstein
Françoise Beaufays
+ PDF Chat Joint CTC-attention based end-to-end speech recognition using multi-task learning 2017 Suyoun Kim
Takaaki Hori
Shinji Watanabe
+ PDF Chat Towards Better Decoding and Language Model Integration in Sequence to Sequence Models 2017 Jan Chorowski
Navdeep Jaitly
+ PDF Chat Deep Context: End-to-end Contextual Speech Recognition 2018 Golan Pundak
Tara N. Sainath
Rohit Prabhavalkar
Anjuli Kannan
Ding Zhao
+ PDF Chat Streaming End-to-end Speech Recognition for Mobile Devices 2019 Yanzhang He
Tara N. Sainath
Rohit Prabhavalkar
Ian McGraw
Raziel Álvarez
Ding Zhao
David Rybach
Anjuli Kannan
Yonghui Wu
Ruoming Pang
+ PDF Chat State-of-the-Art Speech Recognition with Sequence-to-Sequence Models 2018 Chung‐Cheng Chiu
Tara N. Sainath
Yonghui Wu
Rohit Prabhavalkar
Patrick Nguyen
Zhifeng Chen
Anjuli Kannan
Ron J. Weiss
Kanishka Rao
Ekaterina Gonina
+ PDF Chat An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model 2018 Anjuli Kannan
Yonghui Wu
Patrick Nguyen
Tara N. Sainath
ZhiJeng Chen
Rohit Prabhavalkar