Fine-tuning pre-trained voice conversion model for adding new target speakers with limited data <BR>(3 minutes introduction)

Fine-tuning pre-trained voice conversion model for adding new target speakers with limited data
(3 minutes introduction)

Takeshi Koshizuka (Tokyo University of Science, Japan), Hidefumi Ohmura (Tokyo University of Science, Japan), Kouichi Katsurada (Tokyo University of Science, Japan)

Voice conversion (VC) is a technique that converts speaker-dependent non-linguistic information into that of another speaker, while retaining the linguistic information of the input speech. A typical VC system comprises two modules: an encoder module that removes speaker individuality from the input speech and a decoder module that incorporates another speaker’s individuality in synthesized speech. This paper proposes a training method for a vocoder-free any-to-many encoder-decoder VC model with limited data. Various pre-training techniques have been proposed to solve problems training to limited training data; some of these techniques employ the text-to-speech (TTS) task for pre-training. We pre-train the decoder module in the voice conversion task for growing our pre-training technique into continuously adding target speakers to the VC system. The experimental results show that good conversion performance can be achieved by conducting VC-based pre-training. We also confirmed that the rehearsal and pseudo-rehearsal methods can effectively fine-tune the model without degrading the conversion performance of the pre-trained target speakers.

Loading player

Search in Audio

Related Recordings

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion
(3 minutes introduction)

Wen-Chin Huang , Kazuhiro Kobayashi , Yu-Huai Peng , Ching-Feng Liu , Yu Tsao , Hsin-Min Wang , Tomoki Toda

Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
(3 minutes introduction)

Neeraj Kumar , Srishti Goel , Ankur Narang , Brejesh Lall

InterSpeech 2021

Fine-tuning pre-trained voice conversion model for adding new target speakers with limited data (3 minutes introduction)

Search in Audio

Related Recordings

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion (3 minutes introduction)

Normalization Driven Zero-shot Multi-Speaker Speech Synthesis (3 minutes introduction)

Fine-tuning pre-trained voice conversion model for adding new target speakers with limited data
(3 minutes introduction)

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion
(3 minutes introduction)

Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
(3 minutes introduction)