Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers <BR>(Oral presentation)

Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers
(Oral presentation)

Marvin Borsdorf (Universität Bremen, Germany), Chenglin Xu (NUS, Singapore), Haizhou Li (NUS, Singapore), Tanja Schultz (Universität Bremen, Germany)

Speaker extraction has been studied mostly for the scenarios where a target speaker is present in a two or more talkers mixture. Such scenarios do not adequately reflect everyday conversations. For example, a target speaker can be the only active talker, be quiet for a while, or leave the conversation, that means the target speaker is absent from the mixture. Traditional speaker extraction models fail in these scenarios. We propose a novel speaker extraction approach to handle speech mixtures with one or two talkers in which the target speaker can either be present or absent. First, we formulate four speaker extraction conditions to cover the typical scenarios of everyday conversations with one and two talkers. Second, we introduce a joint training scheme with one unified loss function that works for all four conditions. We show that only a small amount of data is required to adapt the model to work well in the four conditions.

Search in Audio

Related Recordings

Auxiliary loss function for target speech extraction and recognition with weak supervision based on speaker characteristics
(Oral presentation)

Katerina Zmolikova , Marc Delcroix , Desh Raj , Shinji Watanabe , Jan Černocký

Using X-vectors for Speech Activity Detection in Broadcast Streams
(Oral presentation)

Lukas Mateju , Frantisek Kynych , Petr Cerva , Jindrich Zdansky , Jiri Malek

InterSpeech 2021

Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers (Oral presentation)

Search in Audio

Related Recordings

Auxiliary loss function for target speech extraction and recognition with weak supervision based on speaker characteristics (Oral presentation)

Using X-vectors for Speech Activity Detection in Broadcast Streams (Oral presentation)

Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers
(Oral presentation)

Auxiliary loss function for target speech extraction and recognition with weak supervision based on speaker characteristics
(Oral presentation)

Using X-vectors for Speech Activity Detection in Broadcast Streams
(Oral presentation)