Large-Scale Self- and Semi-Supervised Learning for Speech Translation
(3 minutes introduction)![https://www.isca-speech.org/archive/interspeech_2021/wang21r_interspeech.html](/images/interspeech/full-paper-isca.png)
Changhan Wang (Facebook, USA), Anne Wu (Facebook, USA), Juan Pino (Facebook, USA), Alexei Baevski (Facebook, USA), Michael Auli (Facebook, USA), Alexis Conneau (Facebook, USA) |
---|
In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.8 BLEU on average on all four considered CoVoST 2 language pairs via a simple recipe of combining wav2vec 2.0 pretraining, a single iteration of self-training and decoding with a language model. Different from existing work, our approach does not leverage any other supervision than ST data. Code and models are publicly released.