Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems <BR>(3 minutes introduction)

Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems
(3 minutes introduction)

Vikas Joshi (Microsoft, India), Amit Das (Microsoft, USA), Eric Sun (Microsoft, USA), Rupesh R. Mehta (Microsoft, India), Jinyu Li (Microsoft, USA), Yifan Gong (Microsoft, USA)

Improving multilingual end-to-end (E2E) automatic speech recognition (ASR) systems have manifold advantages. They simplify the training strategy, are easier to scale and exhibit better performance over monolingual models. However, it is still challenging to use a single multilingual model to recognize multiple languages without knowing the input language, as most multilingual models assume the availability of the input language. In this paper, we introduce multi-softmax model to improve the multilingual recurrent neural network transducer (RNN-T) models, by having language specific softmax, joint and embedding layers, while sharing rest of the parameters. We extend the multi-softmax model to work without knowing the input language, by integrating a language identification (LID) model, that estimates the LID on-the-fly and also does the recognition at the same time. The multi-softmax model outperforms monolingual models with an average word error rate relative (WERR) reduction of 4.65% on Indian languages. Finetuning further improves the WERR reduction to 12.2%. The multi-softmax model with on-the-fly LID estimation, shows WERR reduction of 13.86% compared to the multilingual baseline.

Search in Audio

Related Recordings

Super-Human Performance in Online Low-latency Recognition of Conversational Speech
(3 minutes introduction)

Thai-Son Nguyen , Sebastian Stüker , Alex Waibel

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
(3 minutes introduction)

Tara N. Sainath , Yanzhang He , Arun Narayanan , Rami Botros , Ruoming Pang , David Rybach , Cyril Allauzen , Ehsan Variani , James Qin , Quoc-Nam Le-The , Shuo-Yiin Chang , Bo Li , Anmol Gulati , Jiahui Yu , Chung-Cheng Chiu , Diamantino Caseiro , Wei Li , Qiao Liang , Pat Rondon

InterSpeech 2021

Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems (3 minutes introduction)

Search in Audio

Related Recordings

Super-Human Performance in Online Low-latency Recognition of Conversational Speech (3 minutes introduction)

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling (3 minutes introduction)

Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems
(3 minutes introduction)

Super-Human Performance in Online Low-latency Recognition of Conversational Speech
(3 minutes introduction)

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
(3 minutes introduction)