hello everyone

my name is very extra iron from singapore inverse of technology and design using or

today i will be talking about are okay oracle generative inverse real networks for singing

voice conversion

we i mean a pattern training they are we have combat that this research together

with michael we actually from the machine restore single

so that the basic definition of singing voice conversion it is that have to convert

one single his voice to sound like that of another without changing the lyrical content

you can also see an illustration of it

you have a sore single

who is thinking mummy out here i going

and we will and the singing voice conversion year

and you're gonna change identical is

or some sort it sounds like this lady is thinking the same song

and i don't like to highlight that sinking armies lexical and emotional information through or

and all

well data being transferred from the source to the target speaker

so in this paper we will propose a novel solutions to singing voice conversion based

on generative address real networks and without parallel training data

and let's

briefly talk about singing-voice partition

singing voice conversion is another very user

because in itself is not an easy task

and to mimic someone thinking is more difficult

well professional fingers are trained to control and very they walk a timber

they're by the by the physical limit of your remote production system

and singing voice conversion provides an extension two months or collect be able to control

the voice

beyond the physical limit and expressive

in extend this very small way

so singing voice conversion has lots of applications and some of them are listed here

such as singing synthesis the bingo soundtrack

and grapheme one thinking

and there is also challenge here that i would like to highlight

thinking is a final or and any distortion of the remote the singing voice cannot

be tolerated

so of nist singing voice conversion you melting like there is a voice conversion what

is the difference between singing voice conversion and the traditional voice conversion well they share

similar moderation

in the conventional speech was a motion which we also all the identity or version

unseen was on version differs from speech voice conversion in many ways that are listed

here

starting in the traditional speech voice conversion speech processing it includes speech dynamics durational words

they'll is right speaker individuality

therefore we need to transform from the source to the target speaker

in singing voice conversion the matter of thinking is grammar that it's removed by the

sheet music itself so it is considered as far as an independent

therefore in singing voice conversion only the characteristics of voice identity

so just where

are considered as the price and the trains to the contrary to

so in this paper we will only focus on the spectrum or emotion

aspect of thinking voice conversion

so before starting to talk about are proposing impose farmers a model i would like

to their belief that were generated by terrestrial networks and my mutual i mean this

paper

so the traditional generates about restaurant that for once the generative and discriminative training of

your may already know the

and generate bidirectional networks have recently we wish to be effective

instead i mean it feels

listed below in a generation image translation speech enhancement language identification

it's just speech sentences anyone in speech voice conversion

and in this paper we propose to generate vectors are not or

that's where

that's where that works for i was thinking voice conversion well with and without where

the training data

so

i don't like at least a contributions here to start with me propose a singing

voice conversion frame or

it is based on channel factors from the four

and v h one martin singing-voice middle an extra no such as speech recognition which

is not very easy to train

i think cycle can be achieved by the other data free thinking voice that on

the baseline

and last but not least mean reduce the reliance on large amount of data

well what are the and non-parallel training scenario

we would like to know that this paper reports the for a successful at time

to yield a gender at birth and that's where

okay though they are thinking one version

phone based i'm thinking were voice farmer human have the training data

and the statistical methods such as gaussian mixture models are presented and the success of

singing-voice origin

we have multiple listed

some of these works here there are a great idea

a do not use the learning most of the time and the for ideas in

the learning has a positive impact in training feels with no exception to singing-voice origin

and hearing this paper we propose to use and to learn the essential differences between

the source thinking and the original target singing train discriminative process you know testing

and in this paper we further understand your processing as a part of the gas

solutions to singing voice or emotion

in a comparative study

so let's try the training phase of the thing and three main steps provided here

the first

is to perform

well analysis

to obtain a spectral and prosodic features as provided here we develop more

and the second step is to use dynamic time warping algorithm for temporal alignment of

source and target singing spectral features it is also provided here with the blue collar

we will either here is a and

the algorithms that you cannot training

and the last is to train to generate electricity and that's for by using the

aligned thinking source and target feature

i don't like to highlight you know one more time that we haven't data from

source and target english and they are thinking the same

this is class to their parallel training data for thinking voice conversion

and it also would like to highlight that the previous studies top loss in french

and the singing voice conversion it is not always necessary to transform based or values

from the source target singular a meeting possible singles of a single key

and the combination of a realistic usually has a small

in k until the singing voice

so therefore in this paper

beyond from spectral feature vectors h in acceptable singing-voice where the

and may need to run time version we again have three mains yes

provided here the first step based on things or thinking features using to roll analysis

and the second studies to generate the climate is sinking spectral features by using the

which is already to train during the training phase

and last but not really is we're gonna generated by just a four by using

girls and

i like to highlight in this paper

standard by the previous studies we don't know from f their original

but in french and there's singing voice conversion experiment

for two gender singing voice conversion experiments we performed in so version

and in all experiments

but we have in this paper in order to distill the scores in getting but

are

so this is at all

are the data case

but it's not singing voice conversion without her the training data

before we discuss singing voice conversion time and high like something

learned from the guy the training data file

as also cited here and she's right well that the voice conversion each force or

version i mean also provides a solution to model

the single translation

and for best knowledge

so again has not instantly or singing voice conversion

and in this paper saying trying to find an optimal set okay

the good singing data of speakers

for singing voice conversion uses

so i just as follows

adversely lowest and i that the maximal

and we decide again engages and have demonstrated that i'd estimate involves here

this allows us to preserve the lyrical content of the source euchre

sorry sourcing

so i'm to make slice we will be discussing very briefly why we need you

loss function

let's start entwined being an adversarial all

is voice conversion are paying optimize the distribution of the remote thinking feature

as much as closer to the distribution of targets there

and also the distribution of convergence data comes to that of target single

let's learned a little speaker

and we can achieve high speaker similarity singing voice conversion

so that we need to find a system small

the reason is the and with a global mean tells us better than on version

of the target single state the distribution

and does not help to results are think this contextual information

and it's a distance loss we can maintain the contextual information it single source and

target hair

well i estimate models it was decided that systems a lot of rice clustering wanna

structure however it will not surface to guarantee that the mapping always with their little

one of those for center

so explicit presented little and

gonna incorporate and i'd estimate involved here

let's look at the experiment

and

true this paper we are from objective and subjective evaluation with a us singing database

and describe the second system audio recordings

from point a english or by about professional seniors

and for all other than that in the training data settings

in one that experiments tree and five or started singing care

and we extract twenty four mel cepstral coefficients logarithmic fundamental and after this

and we normalize the source and target and cepstral zero mean and unit variance by

using the statistics of the training

so on

let's get the objective evaluation here

the mel-cepstral distortion between the targets english nature thinking and converts it is warm and

its you may no longer mel-cepstral distortion value in the case more spectral distortion

and hearing this table one meal a quadtree framework

and if you personally interested how we trained these networks

please note that

all these models and experimental conditions are provided in the paper

so you can just go and check

for each time we provide another one paragraph and explain how we trained them

and three army training male to male and female to male conversion

and for in the anything we have a nice and the training data

tri-phone from each speaker and types for each speaker

and if you just a good thing in the nine you are going on the

always outperforms in

so i shows that if we have not training vector

is a much better solution than the nn for singing voice conversion

and this cycle again no i guess problem is more challenging because we are doing

a very low hanging one are shown

which means the lyrical content is different during the training

i don't the data is not depend on

so i again achieves comparable results or something one battery o

and the gmm baseline

and i'm and not in the in the baseline use of hello they all these

results show that which is much better without we don't think so

i mean then if we do not readily only castilian cycle

and achieve comparable or even better results to that of in a

so

in the next slide here we report the subjective evaluation we have our experiments indicate

are about to the interest of time i already a

some of them

here in the presentation

so what mean opinion score

and we have fifteen subjects participated in the listening test on each subject listens to

based on merit

singing voices

and the anything ghana trained in parallel data verified against train kernel training data

and if you look at the end and you are but also that

and i don't know

and even though they use the same amount of training data

results show last the

outperforms the n and it should be used for singing voice or emotional word in

there and if you look cycle again you train the same amount of training data

but it does not parallel which means it's the more challenging

and you for a more challenging task

i again

you know

a choose a very similar performance to that of yen and

then the and then use of parallel training data

so we really the performance of cycle again you know is the remote will

assuming that uses non-parallel training then

another experiment that we wanna compere he recycling andreas's again

for speaker similarity

i think this experiment reported here in a separate friends task of speaker similarity you

five minutes on their scores type again training

where is the audio stream and that's one for training

and

this experiment shows that the actual again we thinking they are not clear that singing

thing to this bar achieves comparable results to

it and the sinking they are

if it just doesn't have the battery sample for forty eight point one percent of

the time

which we believe is a remarkable because if you know

having the training database a much more challenging task aiming at a training dataset or

so we believe that cycle again issues

really the performance in terms of singing voice conversion line you have

no further training data

so some in this paper we propose a novel solution based on generative accuracy and

that's where it's just singing voice conversion

we and we don't parallel training data

and the whole and framework which is very well documented anymore spectral training data

i know exactly yes to reno to the error between source and target fingers

and you and i mean and not a training data

we show that it works really well

furthermore we also show that the proposed framework for better

in less training data and the n and which we really remarkable

that one leaves with or without parental training data available generative and restroom that's where

if you high i anymore

and you're for listening