Phonetically Motivated Self-Supervised Speech Representation Learning
(3 minutes introduction)
Xianghu Yue (NUS, Singapore), Haizhou Li (NUS, Singapore) |
---|
Self-supervised representation learning has seen remarkable success in encoding high-level semantic information from unlabelled speech data. The studies have been focused on exploring new pretext tasks to improve the learned speech representation and various masking schemes with reference to speech frames. We consider effective latent speech representation should be phonetically informed. In this work, we propose a novel phonetically motivated masking scheme. Specifically, we select the masked speech frames according to the phonetic segmentation in an utterance. The phonetically motivated self-supervised representation learns the speech representation that benefits downstream speech processing tasks. We evaluate the proposed learning algorithm on phoneme classification, speech recognition, and speaker recognition, and show that it consistently outperforms competitive baselines.