Personal VAD: Speaker-Conditioned Voice Activity Detection

Shaojin Ding, Quan Wang, Shuo-Yiin Chang, Li Wan, Ignacio López Moreno

Type: Preprint

Publication Date: 2020-05-15

Citations: 68

DOI: https://doi.org/10.21437/odyssey.2020-62

Abstract

In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level.This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable.We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score.For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech.Under our optimal setup, we are able to train a model with only 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task.

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+	Personal VAD: Speaker-Conditioned Voice Activity Detection	2019	Shaojin Ding Quan Wang Shuo-Yiin Chang Li Wan Ignacio López Moreno
+ PDF Chat	Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition	2022	Shaojin Ding Rajeev V. Rikhye Qiao Liang Yanzhang He Quan Wang Arun Narayanan Tom O’Malley Ian McGraw
+	Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition	2022	Shaojin Ding Rajeev V. Rikhye Qiao Liang Yanzhang He Quan Wang Arun Narayanan Tom O’Malley Ian McGraw
+	SVVAD: Personal Voice Activity Detection for Speaker Verification	2023	Zuheng Kang Jianzong Wang Junqing Peng Jing Xiao
+ PDF Chat	Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization	2024	Jenthe Thienpondt Kris Demuynck
+ PDF Chat	Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization	2024	Jenthe Thienpondt Kris Demuynck
+ PDF Chat	SVVAD: Personal Voice Activity Detection for Speaker Verification	2023	Zuheng Kang Jianzong Wang Junqing Peng Jing Xiao
+	Enrollment-less training for personalized voice activity detection	2021	Naoki Makishima Mana Ihori Tomohiro Tanaka Akihiko Takashima Shota Orihashi Ryo Masumura
+ PDF Chat	Enrollment-less training for personalized voice activity detection	2021	Naoki Makishima Mana Ihori Tomohiro Tanaka Akihiko Takashima Shota Orihashi Ryo Masumura
+ PDF Chat	Enrollment-Less Training for Personalized Voice Activity Detection	2021	Naoki Makishima Mana Ihori Tomohiro Tanaka Akihiko Takashima Shota Orihashi Ryo Masumura
+	SG-VAD: Stochastic Gates Based Speech Activity Detection	2022	Jonathan Svirsky Ofir Lindenbaum
+ PDF Chat	Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness	2024	Sai Srujana Buddi Satyam Kumar Utkarsh Sarawgi Vineet Garg Shivesh Ranjan Ognjen Rudovic Ahmed Hussen Abdelaziz Saurabh Adya
+ PDF Chat	Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness	2024	Satyam Kumar Sai Srujana Buddi Utkarsh Sarawgi Vineet Garg Shivesh Ranjan Ognjen Rudovic Ahmed Hussen Abdelaziz Saurabh Adya
+	End-to-End Speaker-Dependent Voice Activity Detection	2020	Yefei Chen Shuai Wang Yanmin Qian Kai Yu
+ PDF Chat	Multi-User Voicefilter-Lite via Attentive Speaker Embedding	2021	Rajeev V. Rikhye Quan Wang Qiao Liang Yanzhang He Ian McGraw
+	SG-VAD: Stochastic Gates Based Speech Activity Detection	2023	Jonathan Svirsky Ofir Lindenbaum
+ PDF Chat	Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction	2023	Mohan Shi Yuchun Shu Lingyun Zuo Qian Chen Shiliang Zhang Jie Zhang Li-Rong Dai
+	Multi-user VoiceFilter-Lite via Attentive Speaker Embedding	2021	Rajeev V. Rikhye Quan Wang Qiao Liang Yanzhang He Ian McGraw
+	Multi-user VoiceFilter-Lite via Attentive Speaker Embedding	2021	Rajeev V. Rikhye Quan Wang Qiao Liang Yanzhang He Ian McGraw
+	Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures	2023	Lingyun Zuo Keyu An Shiliang Zhang Zhijie Yan

Works That Cite This (43)

Action	Title	Year	Authors
+ PDF Chat	End-to-End Active Speaker Detection	2022	Juan León Alcázar Moritz Cordes Chen Zhao Bernard Ghanem
+	Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement	2021	Ritwik Giri Shrikant Venkataramani Jean-Marc Valin Umut Isik Arvindh Krishnaswamy
+ PDF Chat	Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario	2020	Ivan Medennikov Maxim Korenevsky Tatiana Prisyach Yuri Khokhlov Mariya Korenevskaya Ivan Sorokin Tatiana Timofeeva Anton Mitrofanov Andrei Andrusenko Ivan Podluzhny
+	The "Sound of Silence" in EEG -- Cognitive voice activity detection	2020	Rini A Sharon Hema A. Murthy
+	Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker	2021	Maokui He Desh Raj Zili Huang Jun Du Zhuo Chen Shinji Watanabe
+ PDF Chat	VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition	2020	Quan Wang Ignacio López Moreno Mert Sağlam Kevin R. Wilson Alan Chiao Renjie Liu Yanzhang He Wěi Li Jason Pelecanos Marily Nika
+	Configurable Privacy-Preserving Automatic Speech Recognition	2021	Ranya Aloufi Hamed Haddadi David Boyle
+	VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition	2020	Quan Wang Ignacio López Moreno Mert Sağlam Kevin Wilson Alan Chiao Renjie Liu Yanzhang He Wei Li Jason Pelecanos Marily Nika
+	In-Ear-Voice: Towards Milli-Watt Audio Enhancement With Bone-Conduction Microphones for In-Ear Sensing Platforms	2023	Philipp Schilk Niccolò Polvani Andrea Ronco Miloš Cerňak Michele Magno
+ PDF Chat	Configurable Privacy-Preserving Automatic Speech Recognition	2021	Ranya Aloufi Hamed Haddadi David Boyle

Works Cited by This (15)

Action	Title	Year	Authors
+	Distilling the Knowledge in a Neural Network	2015	Geoffrey E. Hinton Oriol Vinyals Jay B. Dean
+ PDF Chat	Voice Activity Detection: Merging Source and Filter-based Information	2015	Thomas Drugman Yannis Stylianou Yusuke Kida Masami Akamine
+ PDF Chat	On the Efficient Representation and Execution of Deep Acoustic Models	2016	Raziel Álvarez Rohit Prabhavalkar Anton Bakhtin
+	Deep Speaker: an End-to-End Neural Speaker Embedding System	2017	Chao Li Xiaokong Ma Bing Jiang Xiangang Li Xuewei Zhang Xiao Liu Ying Cao Ajay Kannan Zhenyao Zhu
+ PDF Chat	VoxCeleb: A Large-Scale Speaker Identification Dataset	2017	Arsha Nagrani Joon Son Chung Andrew Zisserman
+ PDF Chat	Wavenet Based Low Rate Speech Coding	2018	W. Bastiaan Kleijn Felicia S. C. Lim Alejandro Luebs Jan Skoglund Florian Stimberg Quan Wang Thomas C. Walters
+	Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis	2018	Jia Ye Yu Zhang Ron J. Weiss Quan Wang Jonathan Shen Fei Ren Zhifeng Chen Patrick Nguyen Ruoming Pang Ignacio López Moreno
+	Sample Efficient Adaptive Text-to-Speech	2018	Yutian Chen Yannis Assael Brendan Shillingford David Budden Scott Reed Heiga Zen Quan Wang Luis C. Cobo Andrew Trask Ben Laurie
+ PDF Chat	Fully Supervised Speaker Diarization	2019	Aonan Zhang Quan Wang Zhenyao Zhu John Paisley Chong Wang
+ PDF Chat	Tuplemax Loss for Language Identification	2019	Li Wan Prashant Sridhar Yang Yu Quan Wang Ignacio López Moreno