Rethinking The Separation Layers In Speech Separation Networks

Type: Article

Publication Date: 2021-05-13

Citations: 6

DOI: https://doi.org/10.1109/icassp39728.2021.9414176

Abstract

Modules in all existing speech separation networks can be categorized into single-input-multi-output (SIMO) modules and single-input-single-output (SISO) modules. SIMO modules generate more outputs than input, and SISO modules keep the numbers of input and output the same. While the majority of separation models only contain SIMO architectures, it has also been shown that certain two-stage separation systems integrated with a post-enhancement SISO module can improve the separation quality. Why performance improvements can be achieved by incorporating the SISO modules? Are SIMO modules always necessary? In this paper, we empirically examine those questions by designing models with varying configurations in the SIMO and SISO modules. We show that comparing with the standard SIMO-only design, a mixed SIMO-SISO design with a same model size is able to improve the separation performance especially under low-overlap conditions. We further validate the necessity of SIMO modules and show that SISO-only models are still able to perform separation without sacrificing the performance. The observations allow us to rethink the model design paradigm and present different views on how the separation is performed.

Locations

  • arXiv (Cornell University) - View - PDF
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - View

Works Cited by This (24)

Action Title Year Authors
+ PDF Chat Deep clustering: Discriminative embeddings for segmentation and separation 2016 John R. Hershey
Zhuo Chen
Jonathan Le Roux
Shinji Watanabe
+ PDF Chat Permutation invariant training of deep models for speaker-independent multi-talker speech separation 2017 Dong Yu
Morten Kolbæk
Zheng‐Hua Tan
Jesper Jensen
+ PDF Chat Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks 2017 Morten Kolbæk
Dong Yu
Zheng‐Hua Tan
Jesper Jensen
+ PDF Chat Low-latency Speaker-independent Continuous Speech Separation 2019 Takuya Yoshioka
Zhuo Chen
Changliang Liu
Xiong Xiao
Hakan Erdoğan
Dimitrios Dimitriadis
+ End-to-End Multi-Channel Speech Separation 2019 Rongzhi Gu
Jian Wu
Shi-Xiong Zhang
Lianwu Chen
Yong Xu
Meng Yu
Dan Su
Yuexian Zou
Dong Yu
+ Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation 2019 Yi Luo
Nima Mesgarani
+ PDF Chat TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation 2018 Yi Luo
Nima Mesgarani
+ Adam: A Method for Stochastic Optimization 2014 Diederik P. Kingma
Jimmy Ba
+ PDF Chat Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 2019 Naoyuki Kanda
Christoph Boeddeker
Jens Heitkaemper
Yusuke Fujita
Shota Horiguchi
Kenji Nagamatsu
Reinhold Haeb‐Umbach
+ PDF Chat WHAM!: Extending Speech Separation to Noisy Environments 2019 Gordon Wichern
Joe Antognini
M. D. Flynn
Licheng Richard Zhu
Emmett McQuinn
Dwight Crow
Ethan Manilow
Jonathan Le Roux