The TAL system for the INTERSPEECH2021 Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech
(longer introduction)
Gaopeng Xu (TAL, China), Song Yang (TAL, China), Lu Ma (TAL, China), Chengfei Li (TAL, China), Zhongqin Wu (TAL, China) |
---|
This paper describes TAL’s system for the INTERSPEECH 2021 shared task on Automatic Speech Recognition (ASR) for non-native children’s speech. In this work, we attempt to apply the self-supervised approach to non-native German children’s ASR. First, we conduct some baseline experiments to indicate that self-supervised learning can capture more acoustic information on non-native children’s speech. Then, we apply the 11-fold data augmentation and combine it with data clean-up to supplement to the limited training data. Moreover, an in-domain semi-supervised VAD model is utilized to segment untranscribed audio. These strategies can significantly improve the system performance. Furthermore, we use two types of language models to further improve performance, i.e., a 4-gram LM with CTC beam-search and a Transformer LM for 2-pass rescoring. Our ASR system reduces the Word Error Rate (WER) by about 48% relatively in comparison with the baseline, achieving 1st in the evaluation period with the WER of 23.5%.