Adjunct-Emeritus Distillation for Semi-Supervised Language Model Adaptation
(3 minutes introduction)
Scott Novotney (Amazon, USA), Yile Gu (Amazon, USA), Ivan Bulyko (Amazon, USA) |
---|
To improve customer privacy, commercial speech applications are reducing human transcription of customer data. This has a negative impact on language model training due to a smaller amount of in-domain transcripts. Prior work demonstrated that training on automated transcripts alone provides modest gains due to reinforcement of recognition errors. We consider a new condition, where a model trained on historical human transcripts, but not the transcripts themselves, are available to us. To overcome temporal drift in vocabulary and topics, we propose a novel extension of knowledge distillation, adjunct-emeritus distillation where two imperfect teachers jointly train a student model. We conduct experiments on an English voice assistant domain and simulate a one year gap in human transcription. Unlike fine-tuning, our approach is architecture agnostic and achieves a 14% relative reduction in perplexity over the baseline approach of freezing model development and improves over the baseline of knowledge distillation.