hello everyone
my name is very extra iron from singapore inverse of technology and design using or
today i will be talking about are okay oracle generative inverse real networks for singing
voice conversion
we i mean a pattern training they are we have combat that this research together
with michael we actually from the machine restore single
so that the basic definition of singing voice conversion it is that have to convert
one single his voice to sound like that of another without changing the lyrical content
you can also see an illustration of it
you have a sore single
who is thinking mummy out here i going
and we will and the singing voice conversion year
and you're gonna change identical is
or some sort it sounds like this lady is thinking the same song
and i don't like to highlight that sinking armies lexical and emotional information through or
and all
well data being transferred from the source to the target speaker
so in this paper we will propose a novel solutions to singing voice conversion based
on generative address real networks and without parallel training data
and let's
briefly talk about singing-voice partition
singing voice conversion is another very user
because in itself is not an easy task
and to mimic someone thinking is more difficult
well professional fingers are trained to control and very they walk a timber
they're by the by the physical limit of your remote production system
and singing voice conversion provides an extension two months or collect be able to control
the voice
beyond the physical limit and expressive
in extend this very small way
so singing voice conversion has lots of applications and some of them are listed here
such as singing synthesis the bingo soundtrack
and grapheme one thinking
and there is also challenge here that i would like to highlight
thinking is a final or and any distortion of the remote the singing voice cannot
be tolerated
so of nist singing voice conversion you melting like there is a voice conversion what
is the difference between singing voice conversion and the traditional voice conversion well they share
similar moderation
in the conventional speech was a motion which we also all the identity or version
unseen was on version differs from speech voice conversion in many ways that are listed
here
starting in the traditional speech voice conversion speech processing it includes speech dynamics durational words
they'll is right speaker individuality
therefore we need to transform from the source to the target speaker
in singing voice conversion the matter of thinking is grammar that it's removed by the
sheet music itself so it is considered as far as an independent
therefore in singing voice conversion only the characteristics of voice identity
so just where
are considered as the price and the trains to the contrary to
so in this paper we will only focus on the spectrum or emotion
aspect of thinking voice conversion
so before starting to talk about are proposing impose farmers a model i would like
to their belief that were generated by terrestrial networks and my mutual i mean this
paper
so the traditional generates about restaurant that for once the generative and discriminative training of
your may already know the
and generate bidirectional networks have recently we wish to be effective
instead i mean it feels
listed below in a generation image translation speech enhancement language identification
it's just speech sentences anyone in speech voice conversion
and in this paper we propose to generate vectors are not or
that's where
that's where that works for i was thinking voice conversion well with and without where
the training data
so
i don't like at least a contributions here to start with me propose a singing
voice conversion frame or
it is based on channel factors from the four
and v h one martin singing-voice middle an extra no such as speech recognition which
is not very easy to train
i think cycle can be achieved by the other data free thinking voice that on
the baseline
and last but not least mean reduce the reliance on large amount of data
well what are the and non-parallel training scenario
we would like to know that this paper reports the for a successful at time
to yield a gender at birth and that's where
okay though they are thinking one version
phone based i'm thinking were voice farmer human have the training data
and the statistical methods such as gaussian mixture models are presented and the success of
singing-voice origin
we have multiple listed
some of these works here there are a great idea
a do not use the learning most of the time and the for ideas in
the learning has a positive impact in training feels with no exception to singing-voice origin
and hearing this paper we propose to use and to learn the essential differences between
the source thinking and the original target singing train discriminative process you know testing
and in this paper we further understand your processing as a part of the gas
solutions to singing voice or emotion
in a comparative study
so let's try the training phase of the thing and three main steps provided here
the first
is to perform
well analysis
to obtain a spectral and prosodic features as provided here we develop more
and the second step is to use dynamic time warping algorithm for temporal alignment of
source and target singing spectral features it is also provided here with the blue collar
we will either here is a and
the algorithms that you cannot training
and the last is to train to generate electricity and that's for by using the
aligned thinking source and target feature
i don't like to highlight you know one more time that we haven't data from
source and target english and they are thinking the same
this is class to their parallel training data for thinking voice conversion
and it also would like to highlight that the previous studies top loss in french
and the singing voice conversion it is not always necessary to transform based or values
from the source target singular a meeting possible singles of a single key
and the combination of a realistic usually has a small
in k until the singing voice
so therefore in this paper
beyond from spectral feature vectors h in acceptable singing-voice where the
and may need to run time version we again have three mains yes
provided here the first step based on things or thinking features using to roll analysis
and the second studies to generate the climate is sinking spectral features by using the
which is already to train during the training phase
and last but not really is we're gonna generated by just a four by using
girls and
i like to highlight in this paper
standard by the previous studies we don't know from f their original
but in french and there's singing voice conversion experiment
for two gender singing voice conversion experiments we performed in so version
and in all experiments
but we have in this paper in order to distill the scores in getting but
are
so this is at all
are the data case
but it's not singing voice conversion without her the training data
before we discuss singing voice conversion time and high like something
learned from the guy the training data file
as also cited here and she's right well that the voice conversion each force or
version i mean also provides a solution to model
the single translation
and for best knowledge
so again has not instantly or singing voice conversion
and in this paper saying trying to find an optimal set okay
the good singing data of speakers
for singing voice conversion uses
so i just as follows
adversely lowest and i that the maximal
and we decide again engages and have demonstrated that i'd estimate involves here
this allows us to preserve the lyrical content of the source euchre
sorry sourcing
so i'm to make slice we will be discussing very briefly why we need you
loss function
let's start entwined being an adversarial all
is voice conversion are paying optimize the distribution of the remote thinking feature
as much as closer to the distribution of targets there
and also the distribution of convergence data comes to that of target single
let's learned a little speaker
and we can achieve high speaker similarity singing voice conversion
so that we need to find a system small
the reason is the and with a global mean tells us better than on version
of the target single state the distribution
and does not help to results are think this contextual information
and it's a distance loss we can maintain the contextual information it single source and
target hair
well i estimate models it was decided that systems a lot of rice clustering wanna
structure however it will not surface to guarantee that the mapping always with their little
one of those for center
so explicit presented little and
gonna incorporate and i'd estimate involved here
let's look at the experiment
and
true this paper we are from objective and subjective evaluation with a us singing database
and describe the second system audio recordings
from point a english or by about professional seniors
and for all other than that in the training data settings
in one that experiments tree and five or started singing care
and we extract twenty four mel cepstral coefficients logarithmic fundamental and after this
and we normalize the source and target and cepstral zero mean and unit variance by
using the statistics of the training
so on
let's get the objective evaluation here
the mel-cepstral distortion between the targets english nature thinking and converts it is warm and
its you may no longer mel-cepstral distortion value in the case more spectral distortion
and hearing this table one meal a quadtree framework
and if you personally interested how we trained these networks
please note that
all these models and experimental conditions are provided in the paper
so you can just go and check
for each time we provide another one paragraph and explain how we trained them
and three army training male to male and female to male conversion
and for in the anything we have a nice and the training data
tri-phone from each speaker and types for each speaker
and if you just a good thing in the nine you are going on the
always outperforms in
so i shows that if we have not training vector
is a much better solution than the nn for singing voice conversion
and this cycle again no i guess problem is more challenging because we are doing
a very low hanging one are shown
which means the lyrical content is different during the training
i don't the data is not depend on
so i again achieves comparable results or something one battery o
and the gmm baseline
and i'm and not in the in the baseline use of hello they all these
results show that which is much better without we don't think so
i mean then if we do not readily only castilian cycle
and achieve comparable or even better results to that of in a
so
in the next slide here we report the subjective evaluation we have our experiments indicate
are about to the interest of time i already a
some of them
here in the presentation
so what mean opinion score
and we have fifteen subjects participated in the listening test on each subject listens to
based on merit
singing voices
and the anything ghana trained in parallel data verified against train kernel training data
and if you look at the end and you are but also that
and i don't know
and even though they use the same amount of training data
results show last the
outperforms the n and it should be used for singing voice or emotional word in
there and if you look cycle again you train the same amount of training data
but it does not parallel which means it's the more challenging
and you for a more challenging task
i again
you know
a choose a very similar performance to that of yen and
then the and then use of parallel training data
so we really the performance of cycle again you know is the remote will
assuming that uses non-parallel training then
another experiment that we wanna compere he recycling andreas's again
for speaker similarity
i think this experiment reported here in a separate friends task of speaker similarity you
five minutes on their scores type again training
where is the audio stream and that's one for training
and
this experiment shows that the actual again we thinking they are not clear that singing
thing to this bar achieves comparable results to
it and the sinking they are
if it just doesn't have the battery sample for forty eight point one percent of
the time
which we believe is a remarkable because if you know
having the training database a much more challenging task aiming at a training dataset or
so we believe that cycle again issues
really the performance in terms of singing voice conversion line you have
no further training data
so some in this paper we propose a novel solution based on generative accuracy and
that's where it's just singing voice conversion
we and we don't parallel training data
and the whole and framework which is very well documented anymore spectral training data
i know exactly yes to reno to the error between source and target fingers
and you and i mean and not a training data
we show that it works really well
furthermore we also show that the proposed framework for better
in less training data and the n and which we really remarkable
that one leaves with or without parental training data available generative and restroom that's where
if you high i anymore
and you're for listening