Speech Bandwidth Expansion For Speaker Recognition On Telephony Audio
Ganesh Sivaraman, Amruta Vidwans, Elie Khoury |
---|
Practical applications often require speaker recognition systems to work well for audio files of different sampling rates. However, the performance of speaker recognition systems degrades substantially under a mismatched audio sampling rate between the training and testing conditions. For example, wideband speaker recognition models trained on audio files with a 16kHz sampling rate perform poorly on telephony audio with an 8kHz sampling rate due to the missing higher frequency information. In this paper, we propose a Deep Neural Network (DNN) based system to estimate the speech spectrum in the frequencies above 4kHz for narrowband 8kHz telephony audio. We train the proposed system on speech datasets processed using various simulated telephony codecs. Additionally, we perform speaker recognition and verification experiments by using the bandwidth expansion system as a preprocessor for speaker verification using wideband models. The dataset used for speaker verification experiments are downsampled Voxceleb1, downsampled SITW data, and the NIST SRE 2010 protocols. We see a significant improvement in the results compared to a simple upsampling with interpolation and low-pass filtering. These promising experiments show that the proposed bandwidth expansion system can be used successfully as a data augmentation for the training of speaker embeddings. Day 4