Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training <BR>(3 minutes introduction)

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
(3 minutes introduction)

Kun Zhou (NUS, Singapore), Berrak Sisman (SUTD, Singapore), Haizhou Li (NUS, Singapore)

Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. We note that the proposed EVC framework leverages text-to-speech (TTS) as they share a common goal that is to generate high-quality expressive voice. In stage 1, we perform style initialization with a multi-speaker TTS corpus, to disentangle speaking style and linguistic content. In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.

Loading player

Adversarially Learning Disentangled Speech Representations for Robust Multi-factor Voice Conversion
(3 minutes introduction)

Jie Wang , Jingbei Li , Xintao Zhao , Zhiyong Wu , Shiyin Kang , Helen Meng

InterSpeech 2021

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
(3 minutes introduction)

Search in Audio

Related Recordings

Adversarial Voice Conversion against Neural Spoofing Detectors
(3 minutes introduction)

Adversarially Learning Disentangled Speech Representations for Robust Multi-factor Voice Conversion
(3 minutes introduction)

InterSpeech 2021

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training (3 minutes introduction)

Search in Audio

Related Recordings

Adversarial Voice Conversion against Neural Spoofing Detectors (3 minutes introduction)

Adversarially Learning Disentangled Speech Representations for Robust Multi-factor Voice Conversion (3 minutes introduction)

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
(3 minutes introduction)

Adversarial Voice Conversion against Neural Spoofing Detectors
(3 minutes introduction)

Adversarially Learning Disentangled Speech Representations for Robust Multi-factor Voice Conversion
(3 minutes introduction)