ECAPA-TDNN Embeddings for Speaker Diarization
(3 minutes introduction)
Nauman Dawalatabad (IIT Madras, India), Mirco Ravanelli (Mila, Canada), François Grondin (Université de Sherbrooke, Canada), Jenthe Thienpondt (Ghent University, Belgium), Brecht Desplanques (Ghent University, Belgium), Hwidong Na (Samsung, Korea) |
---|
Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, for instance, has shown impressive performance in the speaker verification domain, thanks to a carefully designed neural model. In this work, we extend, for the first time, the use of the ECAPA-TDNN model to speaker diarization. Moreover, we improved its robustness with a powerful augmentation scheme that concatenates several contaminated versions of the same signal within the same training batch. The ECAPA-TDNN model turned out to provide robust speaker embeddings under both close-talking and distant-talking conditions. Our results on the popular AMI meeting corpus show that our system significantly outperforms recently proposed approaches.