Attention Is All You Need In Speech Separation

Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, Jianyuan Zhong

Type: Article

Publication Date: 2021-05-13

Citations: 364

DOI: https://doi.org/10.1109/icassp39728.2021.9413901

Abstract

Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism.In this paper, we propose the SepFormer, a novel RNN-free Transformer-based neural network for speech separation. The Sep-Former learns short and long-term dependencies with a multi-scale approach that employs transformers. The proposed model achieves state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets. It reaches an SI-SNRi of 22.3 dB on WSJ0-2mix and an SI-SNRi of 19.5 dB on WSJ0-3mix. The SepFormer inherits the parallelization advantages of Transformers and achieves a competitive performance even when downsampling the encoded representation by a factor of 8. It is thus significantly faster and it is less memory-demanding than the latest speech separation systems with comparable performance.

Locations

arXiv (Cornell University) - View - PDF
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - View

Similar Works

Action	Title	Year	Authors
+	None	1999	Ming Liao
+	None	2001	I. N. Kostin
+	None	1999	Yong-Gao Chen Imre Z. Ruzsa
+	None	2003	Paul Sablonnière
+	None	2001	Emmanuel Fragnière Jacek Gondzio Robert Sarkissian
+	None	1998	G. Sardanashvily
+	None	1998	Hans Keiding
+	None	2003	Haihua Feng Vincenzo Galdi David A. Castañón
+	None	2003	V. Z. Kanchukoev B. S. Karamurzov В. А. Созаев Vladimir Chernov
+	None	2001	Petr Habala Nicole Tomczak-Jaegermann
+	None	2001	S. E. Kozlov
+ PDF Chat	None	2008	田村直義
+	None	2001	Joaquin Soriano
+	None	2001	Shigetaka Fukuda
+	None	2003	Solomon Friedberg
+	None	2003	Igor Belegradek
+	None	1997	Salih Çelïk
+	None	2001	M. de Montigny Hubert de Guise
+	None	2001	A. Yu. Kolesov Н. Х. Розов
+	None	2002	D. G. Djumbayeva Erlan Nursultanov

Works That Cite This (121)

Action	Title	Year	Authors
+	MT3: Multi-Task Multitrack Music Transcription	2021	Josh Gardner Ian Simon Ethan Manilow Curtis Hawthorne Jesse Engel
+ PDF Chat	Exploring Self-Attention Mechanisms for Speech Separation	2023	Cem Subakan Mirco Ravanelli Samuele Cornell François Grondin Mirko Bronzi
+ PDF Chat	DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation and Extraction	2022	Jiangyu Han Yanhua Long Lukáš Burget Jaň Černocký
+	Dasformer: Deep Alternating Spectrogram Transformer For Multi/Single-Channel Speech Separation	2023	Shuo Wang Xiang‐Yu Kong Xiulian Peng Hesam Movassagh Vinod Prakash Yan Lu
+	Ripple Sparse Self-Attention for Monaural Speech Enhancement	2023	Qiquan Zhang Hongxu Zhu Qi Song Xinyuan Qian Zhaoheng Ni Haizhou Li
+	Robustdistiller: Compressing Universal Speech Representations for Enhanced Environment Robustness	2023	Heitor R. Guimarães Arthur Pimentel Anderson R. Avila Mehdi Rezagholizadeh Boxing Chen Tiago H. Falk
+	On The Design and Training Strategies for Rnn-Based Online Neural Speech Separation Systems	2023	Kai Li Yi Luo
+ PDF Chat	Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation	2023	Yoshiki Masuyama Xuankai Chang Wangyou Zhang Samuele Cornell Zhong-Qiu Wang Nobutaka Ono Yanmin Qian Shinji Watanabe
+	Latent Iterative Refinement for Modular Source Separation	2023	Dimitrios Bralios Efthymios Tzinis Gordon Wichern Paris Smaragdis Jonathan Le Roux
+	MossFormer: Pushing the Performance Limit of Monaural Speech Separation Using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions	2023	Shengkui Zhao Bin Ma

Works Cited by This (27)

Action	Title	Year	Authors
+ PDF Chat	Deep clustering: Discriminative embeddings for segmentation and separation	2016	John R. Hershey Zhuo Chen Jonathan Le Roux Shinji Watanabe
+ PDF Chat	Permutation invariant training of deep models for speaker-independent multi-talker speech separation	2017	Dong Yu Morten Kolbæk Zheng‐Hua Tan Jesper Jensen
+ PDF Chat	Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks	2017	Morten Kolbæk Dong Yu Zheng‐Hua Tan Jesper Jensen
+ PDF Chat	Light Gated Recurrent Units for Speech Recognition	2018	Mirco Ravanelli Philémon Brakel Maurizio Omologo Yoshua Bengio
+ PDF Chat	Neural Speech Synthesis with Transformer Network	2019	Naihan Li Shujie Liu Yanqing Liu Sheng Zhao Ming Liu
+	Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation	2019	Yi Luo Nima Mesgarani
+ PDF Chat	TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation	2018	Yi Luo Nima Mesgarani
+ PDF Chat	Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective	2019	Zhong-Qiu Wang Ke Tan DeLiang Wang
+	Attention is All you Need	2017	Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Łukasz Kaiser Illia Polosukhin
+ PDF Chat	End-To-End Source Separation With Adaptive Front-Ends	2018	Shrikant Venkataramani Jonah Casebeer Paris Smaragdis