Joint Training End-to-End Speech Recognition Systems with Speaker Attributes
Sheng Li, Xugang Lu, Raj Dabre, Peng Shen, Hisashi Kawai |
---|
The end-to-end (E2E) model allows for simplifying the conventional automatic speech recognition (ASR) systems. It integrates the acoustic model, lexicon, and language model into one neural network. In this paper, we focus on improving the performance of the state-of-the-art transformer-based E2E ASR system (ASR-Transformer). We propose to joint train the compressed ASR-Transformer with speaker recognition (SR) tasks. As a common practice, speaker-ids are used for joint training the ASR and SR tasks. However, this leads to no significant improvement. To address this problem, we propose to augment the labels with bags-of-attributes of speakers instead of simple speaker-ids. Experiments show the proposed method can effectively improve the performance of compressed ASR-Transformer on CSJ corpus. Moreover, the proposed bags-of-attributes method has the potential to be used for building a highly customized ASR system.