A Lightweight Framework for Online Voice Activity Detection in the Wild <BR>(3 minutes introduction)

A Lightweight Framework for Online Voice Activity Detection in the Wild
(3 minutes introduction)

Xuenan Xu (SJTU, China), Heinrich Dinkel (Xiaomi, China), Mengyue Wu (SJTU, China), Kai Yu (SJTU, China)

Voice activity detection (VAD) is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional VAD systems require strong frame-level supervision for training, inhibiting their performance in real-world test scenarios. Previously, the general-purpose VAD (GPVAD) framework has been proposed to enhance noise robustness significantly. However, GPVAD models are comparatively large and only work for offline evaluation. This work proposes the use of a knowledge distillation framework, where a (large, offline) teacher model provides frame-level supervision to a (light, online) student model. Our experiments verify that our proposed lightweight student models outperform GPVAD on all test sets, including clean, synthetic and real-world scenarios. Our smallest student model only uses 2.2% of the parameters and 15.9% duration cost of our teacher model for inference when evaluated on a Raspberry Pi.

Search in Audio

Related Recordings

End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention
(3 minutes introduction)

Bo Wei , Meirong Yang , Tao Zhang , Xiao Tang , Xing Huang , Kyuhong Kim , Jaeyun Lee , Kiho Cho , Sung-Un Park

Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation
(3 minutes introduction)

Saurabhchand Bhati , Jesús Villalba , Piotr Żelasko , Laureano Moro-Velázquez , Najim Dehak

InterSpeech 2021

A Lightweight Framework for Online Voice Activity Detection in the Wild (3 minutes introduction)

Search in Audio

Related Recordings

End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention (3 minutes introduction)

Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation (3 minutes introduction)

A Lightweight Framework for Online Voice Activity Detection in the Wild
(3 minutes introduction)

End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention
(3 minutes introduction)

Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation
(3 minutes introduction)