On Addressing Practical Challenges for RNN-Transducer

Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong

Type: Article

Publication Date: 2021-12-13

Citations: 13

DOI: https://doi.org/10.1109/asru51503.2021.9688101

Abstract

In this paper, several works are proposed to address practi-cal challenges for deploying RNN Transducer (RNN-T) based speech recognition systems. These challenges are adapting a well-trained RNN-T model to a new domain without col-lecting the audio data, obtaining time stamps and confidence scores at word level. We solve the first challenge with a splicing data method which concatenates the speech segments ex-tracted from the source domain data. To get time stamps, a phone prediction branch is added to the RNN-T model by sharing the encoder for the purpose of forced alignment. Fi-nally, we obtain word level confidence scores by utilizing sev-eral types of features calculated during decoding and from a confusion network. Evaluated with Microsoft production data, the splicing data adaptation method improves the base-line and adaptation with the text to speech method by 58.03% and 15.25% relative word error rate reduction, respectively. The proposed time stamping method can get less than 50 mil-lisecond word timing difference from the ground truth align-ment on average while maintaining the recognition accuracy. We also obtain high confidence annotation performance with limited computation cost.

Locations

arXiv (Cornell University) - View - PDF
2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) - View

Similar Works

Action	Title	Year	Authors
+	On Addressing Practical Challenges for RNN-Transducer	2021	Rui Zhao Jian Xue Jinyu Li Wenning Wei Lei He Yifan Gong
+ PDF Chat	On Addressing Practical Challenges for RNN-Transducer	2021	Rui Zhao Jian Xue Jinyu Li Wei Wenning Lei He Yifan Gong
+ PDF Chat	Exploring Pre-Training with Alignments for RNN Transducer Based End-to-End Speech Recognition	2020	Hu Hu Rui Zhao Jinyu Li Liang Lu Yifan Gong
+	Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition	2020	Hu Hu Rui Zhao Jinyu Li Liang Lu Yifan Gong
+ PDF Chat	Improving RNN Transducer Modeling for End-to-End Speech Recognition	2019	Jinyu Li Rui Zhao Hu Hu Yifan Gong
+	Improving RNN Transducer Modeling for End-to-End Speech Recognition	2019	Jinyu Li Rui Zhao Hu Hu Yifan Gong
+	Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition	2020	Jinxi Guo Gautam Tiwari Jasha Droppo Maarten Van Segbroeck Che-Wei Huang Andreas Stolcke Roland Maas
+ PDF Chat	Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition	2020	Jinxi Guo Gautam Tiwari Jasha Droppo Maarten Van Segbroeck Che-Wei Huang Andreas Stolcke Roland Maas
+ PDF Chat	Efficient Training of Neural Transducer for Speech Recognition	2022	Wei Zhou Wilfried Michel Ralf Schlüter Hermann Ney
+	Efficient Training of Neural Transducer for Speech Recognition	2022	Wei Zhou Wilfried Michel Ralf Schlüter Hermann Ney
+	Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network	2021	Janne Pylkkönen Antti Ukkonen Juho Kilpikoski Samu Tamminen Hannes Heikinheimo
+ PDF Chat	Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network	2021	Janne Pylkkönen Antti Ukkonen Juho Kilpikoski Samu Tamminen Hannes Heikinheimo
+	Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network	2021	Janne Pylkkönen Antti Ukkonen Juho Kilpikoski Samu Tamminen Hannes Heikinheimo
+	Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers	2021	Juntae Kim Jeehye Lee
+ PDF Chat	Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers	2022	Juntae Kim Jeehye Lee
+	On Language Model Integration for RNN Transducer based Speech Recognition	2021	Wei Zhou Zuoyun Zheng Ralf Schlüter Hermann Ney
+ PDF Chat	On Language Model Integration for RNN Transducer Based Speech Recognition	2022	Wei Zhou Zuoyun Zheng Ralf Schlüter Hermann Ney
+	Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition	2019	Chao Weng Chengzhu Yu Jia Cui Chunlei Zhang Dong Yu
+	Word-level confidence estimation for RNN transducers	2021	Mingqiu Wang Hagen Soltau Laurent El Shafey Izhak Shafran
+ PDF Chat	Word-Level Confidence Estimation for RNN Transducers	2021	Mingqiu Wang Hagen Soltau Laurent El Shafey Izhak Shafran

Works That Cite This (8)

Action	Title	Year	Authors
+	Speaker Change Detection For Transformer Transducer ASR	2023	Jian Wu Zhuo Chen Min Hu Xiong Xiao Jinyu Li
+	Code-Switching Text Generation and Injection in Mandarin-English ASR	2023	Haibin Yu Yuxuan Hu Yao Qian Jin Ma Linquan Liu Shujie Liu Yu Shi Yanmin Qian Edward Lin Michael Zeng
+ PDF Chat	Pronunciation-Aware Unique Character Encoding for RNN Transducer-Based Mandarin Speech Recognition	2023	Peng Shen Xugang Lu Hisashi Kawai
+	Fast and Accurate Factorized Neural Transducer for Text Adaption of End-to-End Speech Recognition Models	2023	Rui Zhao Jian Xue Partha Parthasarathy Veljko Miljanic Jinyu Li
+	Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora	2024	Amir Hussein Dorsa Zeinali Ondřej Klejch Matthew Wiesner Brian Yan Shammur Absar Chowdhury Ahmed Ali Shinji Watanabe Sanjeev Khudanpur
+	From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition	2023	Chao-Han Huck Yang Bo Li Yu Zhang Nanxin Chen Rohit Prabhavalkar Tara N. Sainath Trevor Strohman
+ PDF Chat	Hybrid Attention-Based Encoder-Decoder Model for Efficient Language Model Adaptation	2024	Shaoshi Ling Guoli Ye Rui Zhao Yifan Gong
+ PDF Chat	CTC-GMM: CTC Guided Modality Matching For Fast and Accurate Streaming Speech Translation	2024	Rui Zhao Jinyu Li Ruchao Fan Matt Post

Works Cited by This (25)

Action	Title	Year	Authors
+	Sequence Transduction with Recurrent Neural Networks	2012	Alex Graves
+ PDF Chat	Finding consensus in speech recognition: word error minimization and other applications of confusion networks	2000	Lidia Mangu Eric Brill Andreas Stolcke
+	Exploring Neural Transducers for End-to-End Speech Recognition	2017	Eric Battenberg Jitong Chen Rewon Child Adam Coates Yashesh Gaur Yi Li Hairong Liu Sanjeev Satheesh David Seetapun Anuroop Sriram
+	Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice	2018	Yan Deng Lei He Frank K. Soong
+	Improving Performance of End-to-End ASR on Numeric Sequences	2019	Cal Peyser Hao Zhang Tara N. Sainath Zelin Wu
+ PDF Chat	Streaming End-to-end Speech Recognition for Mobile Devices	2019	Yanzhang He Tara N. Sainath Rohit Prabhavalkar Ian McGraw Raziel Álvarez Ding Zhao David Rybach Anjuli Kannan Yonghui Wu Ruoming Pang
+ PDF Chat	State-of-the-Art Speech Recognition with Sequence-to-Sequence Models	2018	Chung‐Cheng Chiu Tara N. Sainath Yonghui Wu Rohit Prabhavalkar Patrick Nguyen Zhifeng Chen Anjuli Kannan Ron J. Weiss Kanishka Rao Ekaterina Gonina
+ PDF Chat	EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding	2015	Yajie Miao Mohammad Gowayyed Florian Metze
+ PDF Chat	Advancing Acoustic-to-Word CTC Model	2018	Jinyu Li Guoli Ye Amit Das Rui Zhao Yifan Gong
+ PDF Chat	Improving RNN Transducer Modeling for End-to-End Speech Recognition	2019	Jinyu Li Rui Zhao Hu Hu Yifan Gong