InterSpeech 2021

TEACHER-STUDENT MIXIT FOR UNSUPERVISED AND SEMI-SUPERVISED SPEECH SEPARATION
(3 minutes introduction)

Jisi Zhang (University of Sheffield, UK), Cătălin Zorilă (Toshiba, UK), Rama Doddipatla (Toshiba, UK), Jon Barker (University of Sheffield, UK)
In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student model can be fine-tuned with supervised data, i.e., paired artificial mixtures and clean speech sources, and further improved via model distillation. Experiments with single and multi channel mixtures show that the teacher-student training resolves the over-separation problem observed in the original MixIT method. Further, the semi-supervised performance is comparable to a fully-supervised separation system trained using ten times the amount of supervised data.