Personal VAD: Speaker-Conditioned Voice Activity Detection

Type: Preprint

Publication Date: 2020-05-15

Citations: 68

DOI: https://doi.org/10.21437/odyssey.2020-62

Download PDF

Abstract

In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level.This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable.We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score.For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech.Under our optimal setup, we are able to train a model with only 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ Personal VAD: Speaker-Conditioned Voice Activity Detection 2019 Shaojin Ding
Quan Wang
Shuo-Yiin Chang
Li Wan
Ignacio L贸pez Moreno
+ PDF Chat Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition 2022 Shaojin Ding
Rajeev V. Rikhye
Qiao Liang
Yanzhang He
Quan Wang
Arun Narayanan
Tom O鈥橫alley
Ian McGraw
+ Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition 2022 Shaojin Ding
Rajeev V. Rikhye
Qiao Liang
Yanzhang He
Quan Wang
Arun Narayanan
Tom O鈥橫alley
Ian McGraw
+ SVVAD: Personal Voice Activity Detection for Speaker Verification 2023 Zuheng Kang
Jianzong Wang
Junqing Peng
Jing Xiao
+ PDF Chat Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization 2024 Jenthe Thienpondt
Kris Demuynck
+ PDF Chat Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization 2024 Jenthe Thienpondt
Kris Demuynck
+ PDF Chat SVVAD: Personal Voice Activity Detection for Speaker Verification 2023 Zuheng Kang
Jianzong Wang
Junqing Peng
Jing Xiao
+ Enrollment-less training for personalized voice activity detection 2021 Naoki Makishima
Mana Ihori
Tomohiro Tanaka
Akihiko Takashima
Shota Orihashi
Ryo Masumura
+ PDF Chat Enrollment-less training for personalized voice activity detection 2021 Naoki Makishima
Mana Ihori
Tomohiro Tanaka
Akihiko Takashima
Shota Orihashi
Ryo Masumura
+ PDF Chat Enrollment-Less Training for Personalized Voice Activity Detection 2021 Naoki Makishima
Mana Ihori
Tomohiro Tanaka
Akihiko Takashima
Shota Orihashi
Ryo Masumura
+ SG-VAD: Stochastic Gates Based Speech Activity Detection 2022 Jonathan Svirsky
Ofir Lindenbaum
+ PDF Chat Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness 2024 Sai Srujana Buddi
Satyam Kumar
Utkarsh Sarawgi
Vineet Garg
Shivesh Ranjan
Ognjen Rudovic
Ahmed Hussen Abdelaziz
Saurabh Adya
+ PDF Chat Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness 2024 Satyam Kumar
Sai Srujana Buddi
Utkarsh Sarawgi
Vineet Garg
Shivesh Ranjan
Ognjen
Rudovic
Ahmed Hussen Abdelaziz
Saurabh Adya
+ End-to-End Speaker-Dependent Voice Activity Detection 2020 Yefei Chen
Shuai Wang
Yanmin Qian
Kai Yu
+ PDF Chat Multi-User Voicefilter-Lite via Attentive Speaker Embedding 2021 Rajeev V. Rikhye
Quan Wang
Qiao Liang
Yanzhang He
Ian McGraw
+ SG-VAD: Stochastic Gates Based Speech Activity Detection 2023 Jonathan Svirsky
Ofir Lindenbaum
+ PDF Chat Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction 2023 Mohan Shi
Yuchun Shu
Lingyun Zuo
Qian Chen
Shiliang Zhang
Jie Zhang
Li-Rong Dai
+ Multi-user VoiceFilter-Lite via Attentive Speaker Embedding 2021 Rajeev V. Rikhye
Quan Wang
Qiao Liang
Yanzhang He
Ian McGraw
+ Multi-user VoiceFilter-Lite via Attentive Speaker Embedding 2021 Rajeev V. Rikhye
Quan Wang
Qiao Liang
Yanzhang He
Ian McGraw
+ Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures 2023 Lingyun Zuo
Keyu An
Shiliang Zhang
Zhijie Yan

Works That Cite This (43)

Action Title Year Authors
+ PDF Chat End-to-End Active Speaker Detection 2022 Juan Le贸n Alc谩zar
Moritz Cordes
Chen Zhao
Bernard Ghanem
+ Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement 2021 Ritwik Giri
Shrikant Venkataramani
Jean-Marc Valin
Umut Isik
Arvindh Krishnaswamy
+ PDF Chat Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario 2020 Ivan Medennikov
Maxim Korenevsky
Tatiana Prisyach
Yuri Khokhlov
Mariya Korenevskaya
Ivan Sorokin
Tatiana Timofeeva
Anton Mitrofanov
Andrei Andrusenko
Ivan Podluzhny
+ The "Sound of Silence" in EEG -- Cognitive voice activity detection 2020 Rini A Sharon
Hema A. Murthy
+ Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker 2021 Maokui He
Desh Raj
Zili Huang
Jun Du
Zhuo Chen
Shinji Watanabe
+ PDF Chat VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition 2020 Quan Wang
Ignacio L贸pez Moreno
Mert Sa臒lam
Kevin R. Wilson
Alan Chiao
Renjie Liu
Yanzhang He
W臎i Li
Jason Pelecanos
Marily Nika
+ Configurable Privacy-Preserving Automatic Speech Recognition 2021 Ranya Aloufi
Hamed Haddadi
David Boyle
+ VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition 2020 Quan Wang
Ignacio L贸pez Moreno
Mert Sa臒lam
Kevin Wilson
Alan Chiao
Renjie Liu
Yanzhang He
Wei Li
Jason Pelecanos
Marily Nika
+ In-Ear-Voice: Towards Milli-Watt Audio Enhancement With Bone-Conduction Microphones for In-Ear Sensing Platforms 2023 Philipp Schilk
Niccol貌 Polvani
Andrea Ronco
Milo拧 Cer艌ak
Michele Magno
+ PDF Chat Configurable Privacy-Preserving Automatic Speech Recognition 2021 Ranya Aloufi
Hamed Haddadi
David Boyle

Works Cited by This (15)

Action Title Year Authors
+ Distilling the Knowledge in a Neural Network 2015 Geoffrey E. Hinton
Oriol Vinyals
Jay B. Dean
+ PDF Chat Voice Activity Detection: Merging Source and Filter-based Information 2015 Thomas Drugman
Yannis Stylianou
Yusuke Kida
Masami Akamine
+ PDF Chat On the Efficient Representation and Execution of Deep Acoustic Models 2016 Raziel 脕lvarez
Rohit Prabhavalkar
Anton Bakhtin
+ Deep Speaker: an End-to-End Neural Speaker Embedding System 2017 Chao Li
Xiaokong Ma
Bing Jiang
Xiangang Li
Xuewei Zhang
Xiao Liu
Ying Cao
Ajay Kannan
Zhenyao Zhu
+ PDF Chat VoxCeleb: A Large-Scale Speaker Identification Dataset 2017 Arsha Nagrani
Joon Son Chung
Andrew Zisserman
+ PDF Chat Wavenet Based Low Rate Speech Coding 2018 W. Bastiaan Kleijn
Felicia S. C. Lim
Alejandro Luebs
Jan Skoglund
Florian Stimberg
Quan Wang
Thomas C. Walters
+ Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis 2018 Jia Ye
Yu Zhang
Ron J. Weiss
Quan Wang
Jonathan Shen
Fei Ren
Zhifeng Chen
Patrick Nguyen
Ruoming Pang
Ignacio L贸pez Moreno
+ Sample Efficient Adaptive Text-to-Speech 2018 Yutian Chen
Yannis Assael
Brendan Shillingford
David Budden
Scott Reed
Heiga Zen
Quan Wang
Luis C. Cobo
Andrew Trask
Ben Laurie
+ PDF Chat Fully Supervised Speaker Diarization 2019 Aonan Zhang
Quan Wang
Zhenyao Zhu
John Paisley
Chong Wang
+ PDF Chat Tuplemax Loss for Language Identification 2019 Li Wan
Prashant Sridhar
Yang Yu
Quan Wang
Ignacio L贸pez Moreno