Bootstrap an End-to-end ASR System by Multilingual Training, Transfer Learning, Text-to-text Mapping and Synthetic Audio <BR>(3 minutes introduction)

Bootstrap an End-to-end ASR System by Multilingual Training, Transfer Learning, Text-to-text Mapping and Synthetic Audio
(3 minutes introduction)

Manuel Giollo (Amazon, Italy), Deniz Gunceler (Amazon, Germany), Yulan Liu (Amazon, UK), Daniel Willett (Amazon, Germany)

Bootstrapping speech recognition on limited data resources has been an area of active research for long. The recent transition to all-neural models and end-to-end (E2E) training brought along particular challenges as these models are known to be data hungry, but also came with opportunities around language-agnostic representations derived from multilingual data as well as shared word-piece output representations across languages that share script and roots. We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer (RNN-T) based automatic speech recognition (ASR) system in the low resource regime, while exploiting the abundant resources available in other languages as well as the synthetic audio from a text-to-speech (TTS) engine. Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements, allowing us to bootstrap a model for a new language with a fraction of the data that would otherwise be needed. The best system achieved a 46% relative word error rate (WER) reduction compared to the monolingual baseline, among which 25% relative WER improvement is attributed to the post-ASR text-to-text mappings and the TTS synthetic data.

Search in Audio

Related Recordings

Unsupervised Cross-lingual Representation Learning for Speech Recognition
(3 minutes introduction)

Alexis Conneau , Alexei Baevski , Ronan Collobert , Abdelrahman Mohamed , Michael Auli

Multilingual and code-switching ASR challenges for low resource Indian languages
(3 minutes introduction)

Anuj Diwan , Rakesh Vaideeswaran , Sanket Shah , Ankita Singh , Srinivasa Raghavan , Shreya Khare , Vinit Unni , Saurabh Vyas , Akash Rajpuria , Chiranjeevi Yarra , Ashish Mittal , Prasanta Kumar Ghosh , Preethi Jyothi , Kalika Bali , Vivek Seshadri , Sunayana Sitaram , Samarth Bharadwaj , Jai Nanavati , Raoul Nanavati , Karthik Sankaranarayanan

InterSpeech 2021

Bootstrap an End-to-end ASR System by Multilingual Training, Transfer Learning, Text-to-text Mapping and Synthetic Audio (3 minutes introduction)

Search in Audio

Related Recordings

Unsupervised Cross-lingual Representation Learning for Speech Recognition (3 minutes introduction)

Multilingual and code-switching ASR challenges for low resource Indian languages (3 minutes introduction)

Bootstrap an End-to-end ASR System by Multilingual Training, Transfer Learning, Text-to-text Mapping and Synthetic Audio
(3 minutes introduction)

Unsupervised Cross-lingual Representation Learning for Speech Recognition
(3 minutes introduction)

Multilingual and code-switching ASR challenges for low resource Indian languages
(3 minutes introduction)