Weakly Supervised Construction of ASR Systems from Massive Video Data
(longer introduction)
Mengli Cheng (Alibaba, China), Chengyu Wang (Alibaba, China), Jun Huang (Alibaba, China), Xiaobo Wang (Alibaba, China) |
---|
Despite the rapid development of deep learning models, for real-world applications, building large-scale Automatic Speech Recognition (ASR) systems from scratch is still significantly challenging, mostly due to the time-consuming and financially-expensive process of annotating a large amount of audio data with transcripts. Although several self-supervised pre-training models have been proposed to learn speech representations, applying such models directly might be sub-optimal if more labeled, training data could be obtained without a large cost. In this paper, we present VideoASR, a weakly supervised framework for constructing ASR systems from massive video data. As user-generated videos often contain human-speech audio roughly aligned with subtitles, we consider videos as an important knowledge source, and propose an effective approach to extract high-quality audio aligned with transcripts from videos based on text detection and Optical Character Recognition. The underlying ASR models can be fine-tuned to fit any domain-specific target training datasets after weakly supervised pre-training on automatically generated datasets. Extensive experiments show that VideoASR can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition. In addition, the VideoASR framework has been deployed on the cloud to support various industrial-scale applications.