Torchaudio: Building Blocks for Audio and Speech Processing

Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Artyom Astafurov, Caroline Chen, Christian Puhrsch, David Pollack, Dmitriy Genzel, Donny Greenberg, Edward Z. Yang

Type: Article

Publication Date: 2022-04-27

Citations: 61

DOI: https://doi.org/10.1109/icassp43922.2022.9747236

View Publication

Download PDF

Abstract

This document describes version 0.10 of TorchAudio: building blocks for machine learning applications in the audio and speech processing domain. The objective of TorchAudio is to accelerate the development and deployment of machine learning applications for researchers and engineers by providing off-the-shelf building blocks. The building blocks are designed to be GPU-compatible, automatically differentiable, and production-ready. TorchAudio can be easily installed from Python Package Index repository and the source code is publicly available under a BSD-2-Clause License (as of September 2021) at https://github.com/pytorch/audio. In this document, we provide an overview of the design principles, functionalities, and benchmarks of TorchAudio. We also benchmark our implementation of several audio and speech operations and models. We verify through the benchmarks that our implementations of various operations and models are valid and perform similarly to other publicly available implementations.

Locations

arXiv (Cornell University) - View - PDF
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - View

Similar Works

Action	Title	Year	Authors
+	TorchAudio: Building Blocks for Audio and Speech Processing	2021	Yao-Yuan Yang Moto Hira Zhaoheng Ni Anjali Chourdia Artyom Astafurov Caroline Chen Ching-Feng Yeh Christian Puhrsch David Pollack Dmitriy Genzel
+ PDF Chat	TorchAudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch	2023	Jeff Hwang Moto Hira Caroline Chen Xiaohui Zhang Zhaoheng Ni Guangzhi Sun Pingchuan Ma Ruizhe Huang Vineel Pratap Yuekai Zhang
+	TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch	2023	Jeff Hwang Moto Hira Caroline Chen Xiaohui Zhang Zhaoheng Ni Guangzhi Sun Pingchuan Ma Ruizhe Huang Vineel Pratap Yuekai Zhang
+ PDF Chat	Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models	2024	Haibin Wu Xuanjun Chen Yi‐Cheng Lin Kai‐Wei Chang Jiawei Du Ke-Han Lu Alexander H. Liu Ho-Lam Chung Yuan-Kuei Wu Dongchao Yang
+	Lhotse: a speech data representation library for the modern deep learning ecosystem	2021	Piotr Żelasko Daniel Povey Jan Trmal Sanjeev Khudanpur
+ PDF Chat	Shennong: A Python toolbox for audio speech features extraction	2023	Mathieu Bernard Maxime Poli Julien Karadayi Emmanuel Dupoux
+	Audiodec: An Open-Source Streaming High-Fidelity Neural Audio Codec	2023	Yi-Chiao Wu Israel D. Gebru Dejan Marković Alexander Richard
+ PDF Chat	Overview of the Amphion Toolkit (v0.2)	2025	Jiaqi Li Xueyao Zhang Yuancheng Wang Haorui He Chaoren Wang Lijun Wang Huan Liao J Ão Zhihui Xie Yiqiao Huang
+	DeepSpectrumLite: A Power-Efficient Transfer Learning Framework for Embedded Speech and Audio Processing from Decentralised Data	2021	Shahin Amiriparian Tobias Hübner Maurice Gerczuk Sandra Ottl Björn Schüller
+ PDF Chat	The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge	2024	Yiwei Guo Chenrun Wang Yifan Yang Hankun Wang Ziyang Ma Chenpeng Du Shuai Wang H. B. Li Shuai Fan Hui Zhang
+ PDF Chat	The ICME 2025 Audio Encoder Capability Challenge	2025	Junbo Zhang Heinrich Dinkel Qiong Song Helen Wang Y. Niu Cheng Si Xiaofeng Xin Ke Li Wenwu Wang Yujun Wang
+ PDF Chat	ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech	2024	Jiatong Shi Jinchuan Tian Yihan Wu Jee-weon Jung Jia Qi Yip Yoshiki Masuyama William Chen Yuning Wu Yuxun Tang Massa Baali
+	Low-complexity deep learning frameworks for acoustic scene classification	2022	Lam Pham Dat Ngo Anahid Jalali Alexander Schindler
+	Transformer-based Sequence Labeling for Audio Classification based on MFCCs	2023	C. S. Sonali B S Chinmayi Ahana Balasubramanian
+ PDF Chat	Audio-Language Datasets of Scenes and Events: A Survey	2024	Gijs Wijngaard Elia Formisano Michele Esposito Michel Dumontier
+	The PyTorch-Kaldi Speech Recognition Toolkit	2018	Mirco Ravanelli Titouan Parcollet Yoshua Bengio
+	The PyTorch-Kaldi Speech Recognition Toolkit	2018	Mirco Ravanelli Titouan Parcollet Yoshua Bengio
+ PDF Chat	The Pytorch-kaldi Speech Recognition Toolkit	2019	Mirco Ravanelli Titouan Parcollet Yoshua Bengio
+	EasyASR: A Distributed Machine Learning Platform for End-to-end Automatic Speech Recognition	2020	Chengyu Wang Mengli Cheng Hu Xu Jun Huang
+ PDF Chat	EasyASR: A Distributed Machine Learning Platform for End-to-end Automatic Speech Recognition	2021	Chengyu Wang Mengli Cheng Hu Xu Jun Huang

Works That Cite This (23)

Action	Title	Year	Authors
+	Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge	2024	Simon Leglaive Matthieu Fraticelli Hend Elghazaly Léonie Borne Mostafa Sadeghi Scott Wisdom Manuel Pariente John R. Hershey Daniel Pressnitzer Jon Barker
+	GANStrument: Adversarial Instrument Sound Synthesis with Pitch-Invariant Instance Conditioning	2023	Gaku Narita Junichi Shimizu Taketo Akama
+ PDF Chat	A Large-Scale Evaluation of Speech Foundation Models	2024	Shu-Wen Yang Heng-Jui Chang Zili Huang Andy T. Liu Cheng-I Lai Haibin Wu Jiatong Shi Xuankai Chang Hsiang-Sheng Tsai Wen-Chin Huang
+	TorchGeo: Deep Learning With Geospatial Data	2024	Adam J. Stewart Caleb Robinson Isaac Corley Anthony Ortiz Juan Lavista Ferres Arindam Banerjee
+ PDF Chat	TorchAudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch	2023	Jeff Hwang Moto Hira Caroline Chen Xiaohui Zhang Zhaoheng Ni Guangzhi Sun Pingchuan Ma Ruizhe Huang Vineel Pratap Yuekai Zhang
+ PDF Chat	The Impact of Silence on Speech Anti-Spoofing	2023	Yuxiang Zhang Zhuo Li Jingze Lu Hua Hua Wenchao Wang Pengyuan Zhang
+	ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit	2023	Brian Yan Jiatong Shi Yun Tang Hirofumi Inaguma Yifan Peng Siddharth Dalmia Peter Polák Patrick Fernandes Dan Berrebbi Tomoki Hayashi
+ PDF Chat	LPCSE: Neural Speech Enhancement through Linear Predictive Coding	2022	Yang Liu Na Tang Xiaoli Chu Yang Yang Jun Wang
+	Soft Label Coding for end-to-end Sound Source Localization with ad-hoc Microphone Arrays	2023	Linfeng Feng Yijun Gong Xiao-Lei Zhang
+	End-to-End Spoken Language Understanding Using Joint CTC Loss and Self-Supervised, Pretrained Acoustic Encoders	2023	Jixuan Wang Martin Radfar Kai Wei Clement Chung

Works Cited by This (19)

Action	Title	Year	Authors
+	Deep Speech: Scaling up end-to-end speech recognition	2014	Awni Hannun Carl Case Jared Casper Bryan Catanzaro Greg Diamos Erich Elsen Ryan Prenger Sanjeev Satheesh Shubho Sengupta Adam Coates
+	Wav2Letter: an End-to-End ConvNet-based Speech Recognition System	2016	Ronan Collobert Christian Puhrsch Gabriel Synnaeve
+ PDF Chat	fairseq: A Fast, Extensible Toolkit for Sequence Modeling	2019	Myle Ott Sergey Edunov Alexei Baevski Angela Fan Sam Gross Nathan Ng David Grangier Michael Auli
+	Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation	2019	Yi Luo Nima Mesgarani
+ PDF Chat	ESPnet: End-to-End Speech Processing Toolkit	2018	Shinji Watanabe Takaaki Hori Shigeki Karita Tomoki Hayashi Jiro Nishitoba Yuya Unno Nelson Enrique Yalta Soplin Jahn Heymann Matthew Wiesner Nanxin Chen
+ PDF Chat	Waveglow: A Flow-based Generative Network for Speech Synthesis	2019	Ryan Prenger Rafael Valle Bryan Catanzaro
+ PDF Chat	The Pytorch-kaldi Speech Recognition Toolkit	2019	Mirco Ravanelli Titouan Parcollet Yoshua Bengio
+ PDF Chat	Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions	2018	Jonathan Shen Ruoming Pang Ron J. Weiss Mike Schuster Navdeep Jaitly Zongheng Yang Zhifeng Chen Yu Zhang Yuxuan Wang Rj Skerrv-Ryan
+	PyTorch: An Imperative Style, High-Performance Deep Learning Library	2019	Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan Trevor Killeen Zeming Lin Natalia Gimelshein Luca Antiga
+	NeMo: a toolkit for building AI applications using Neural Modules	2019	Oleksii Kuchaiev Jason Li Huyen Nguyen Oleksii Hrinchuk R. Bret Leary Boris Ginsburg Samuel Kriman Stanislav Beliaev Vitaly Lavrukhin Jack Cook