well contort okay speaker or this a twenty four char workshop
i'm doing so you from columbia university
and i'll be presenting our recent research effort on many-to-many voice conversion using cycle free
a weight multi decoders
this is a joint work with some given name
can you only
and in channel you
we are all from a io their core university
i've first to find the problem that we want solve
which is voice conversion among multiple speakers
and describe key idea of the proposed a method called a psycho we at with
multi decoders to solve the problem
doctor in show you is going to explain the details up to stack of we
with multi decoders
and show you some experimental richard so the proposed method
followed by some concluding remarks
voice conversion is the task of converting the speaker-related voice characteristics in an utterance while
maintaining the linguistic information
for example of female speakers may sound like a main speaker using a voice conversion
technique
voice conversion can be applied to data limitation
for example for the training of automatic speech recognition systems
various voice generation from a text to speech is system
speaking assistance for foreign languages to accent conversion
speech enhancement by improving the comprehensibility up to convert it voice
and personal information protection through a speaker d identification
if you have a parallel training data
which contain pairs sub same transcription utterances spoken by different speakers
we can just train a simple neural network by providing the source speakers utterances has
the input adopt a network
and the target speakers utterances has the target up to network have the proper time
alignment of the parallel utterances
however building a parallel there are corporas is a highly expensive task sometimes even impossible
which estimate the strongly is for voice conversion method that does not require the parallel
training data
therefore recent voice conversion approaches attempt to use non-parallel training data
one of such approaches as using a variational twelve encoder
or fee ensure
which was originally developed has the generative model for image generation
if we at is composed of an encoder and the decoder
t can coder produces a set of parameters for posterior distribution a latent variable z
given the input data x
where is the decoder generates a set of parameters for posterior distribution no output data
x given the latent variables c
after being trained using the variational all but andre has its objective function
the vad can be used to generate samples of x by feeding random latent variables
to the decoder
of we can be applied to voice conversion by providing speaker identity to the decoder
together with the latent variable
the vad is trained to reconstruct the input speech from the latent variables c and
the source speaker identity x
here for case latter x represent an utterance
and uppercase letter x represent speaker identity
to convert a speech from a source speaker to a target speaker
the source speaker identity x is replaced with a target speaker identity y
however due to t have sense of unexploded training process for the conversion path between
the source speaker and the target speaker
the vad based voice conversion methods generally produce for quality voice
that is the conventional method to train the model with the self reconstruction the objective
function only
not considering the convergent path from a source speaker to a target speaker
in order to solve this problem we propose to cycle consistent variational auto-encoder or cycles
we a short with multi decoders
it uses the cycle consistency lost and multiple decoders for explicit convergent path training as
follows
and the speech x is fed into don't network it passes to the decoder and
compressed into to latent variable z
the reconstruction error is computed using the reconstructed speech x prime by the speaker x
decoder
up to this point the loss function is similar to the vanilla vad except that
it does not require your the speaker identity because the decoder is used exclusively for
speaker x
the same input speech x go through at the encoder and the speaker windy quarter
as well
to generate the convert a speech x prime from speaker next to speaker y which
has the same linguistic contents has the original input speech x but in speaker y's
voice
then the converted speech x prime goes to the encoder and the speaker x t
quarter to generate the converted back speech x double prime which should recover the first
input speech x
the cyclic conversion encourages the explicit training of the voice conversion path from speaker y
two speaker x without parallel training there
the cycle consistency loss optimized t decoder vad for two speakers given the input speech
x is defined as follows
again the whole point case letters x and y represent speaker identities
no cable and the input speech x the loss function up to cycle of we
at for two speakers is the weight is some how the above two losses as
follows
where lambda is the weight up to cycle consistency loss
similarly input speech y is used to train the convergent path from speaker next to
speaker white explicitly as well as the so reconstruction path for speaker y
it can be easily extended for more than two speakers by summing over all pairs
of the training speakers
the loss function up to cycle of we for more than two speakers can be
computed has follows
where the second summation is usually over a mini batch
the sound quality can be improved
since each decoder there is it's on the speaker's voice characteristics
by the additional convergent path training
why the combination of we at most handle multiple speakers with only a single decoder
by self reconstruction training
at this point i'd like to handle the microphone to doctor you
is going to explain the details of the proposed to cycle of the with multi
decoders
thank you
i mean value from a that corey university
i did explaining the details of the proposed that could be a bit more two
decoders
experimental results
and conclusions
that generates about the rasta network or can ensure can be applied to cycle be
tween as the quality of the resulting speech is
the reconstructed speech x prime is retrieved from the second vad
i feeding the speech x from speaker x
in the speaker identity of x
but this can be to is trained to distinguish the reconstructed speech from the original
speech
door cyclic completely the speech x top we're prime is also pretty from the second
v
it first composed of speech x still speaker y's voice
i feeding the speaker identity or why the latent variable z
the converted speech is not be encumbered back to the speaker x voice
i feeding the speaker identity of x with the latent variables
but this commander is also trained to distinguish the results next overpriced keys from the
original speech
the cycle we can be for to extend it to use much pretty coders
in similar fashion to the second vad we much pretty colours
each speaker use is dedicated decoder and discriminator networks
since they are much for against
previous iteration again is more t five
the modified in the icsi mark has right
in this work we used what sustains cans w again since short instead of one
you like
redesigned architecture of a motorist based on the paper by coming our current two thousand
nineteen
all encoder the colour and discriminate was used pretty cumbersome architecture speak at the linear
units or jeer using your
the source identity vector is broadcast it it's jerry april that is the source cited
vector c is appended to the output of the previous jury overlap
since we assume gaussian distribution with diagonal covariance matrices for the encoder and the decoder
the outputs of the encoder and the decoder at pairs of mean and variance
that decoder architecture is similar to that of encoder
the target speaker identity vectors
and that used for the multi decoder side could be and marty decoder circuitry w
and this is the architecture but this commander that war
as in that score that while the target speaker identity vectors are not used for
the multi decoder second be multicolour second three w k
now i we show some experimental results of the proposed missile and concluding remarks
here is that it takes up the experimental setup
we was always component challenge two thousand eighteen dataset which consist of six theory and
six male speakers
we used a subset of two female speakers and two male speakers
each speaker has one on the sixteen utterances
we use seventy two utterances for training
nine utterances for validation and start by utterances for testing
we used three sets of features
thirty systematic gesture questions or m c ensure and fundamental frequency and a periodicities
we use the following hyper parameters
and it's more there was books a from all five on the train the beam
order
we analyze time and space complexity of the algorithm
the time complexity is measured by the average training time by fourteen seconds using the
chip or thirty s two thousand eight a gpu machine
the space complexity is measured by number of model parameters
by comparing we and so i could be a single decoder
we can see that ending cycle consistency increase the training time to four times but
the number of parameters seaside into car
same can be so by comparing fourier the reader can incite v w in the
single decoder
using multiple decoders considerably increase space complexities
especially when the w again is already since they nist separate this came in terms
for each speaker assess where
the global variance or achieving is sure or m c
can be used to measure the degree or some money that these the highly we
values corner with the shopping use of the spectra
we miss error cheery for each of the insisting this is all of it is
all sources for your space and the comparative space by the commission of the and
the from four section three
don't ever is to use of the conventional vad and the proposed section v four
or in the system only various all sources with similar
the tv various of the second v for higher insisting that is useful better than
those of the miss and v
for the case of the listener and the compare two speech utterances contain the same
linguistic information
the difference between the mfcc up to two speech utterance it should be small
we miss the mel-cepstral distortion m c d for various algorithm
by comparing v and v w can we can see that the v w and
channel real problems be anyway
by comparing p w can in section v the billion single decoder
we can see the effectiveness so and things like a consistency
by comparing psycho every single decoder and marty decoder
and second we that we begin single decoder and much decoder
we can see that the much which could afford to improve the performance
one interesting to note is that the cycle but much pretty cover up of one
second v w can be much pretty cold or
we suspect that the multi decoder second consistency lost its of setting up to learn
the cumberland pace explicitly that the additional w again four component past may not in
excess sorry
we conducted to subjective evaluations that show is test and similar i think task
for naturalness test we measured the mean opinion score and where
ten s not evaluate the naturalness of the forty eight utterances in this case of
one
two five exile
one average the proposed multi decoder cycle vad hessians slightly higher naturalness the scores that
the commission of e
it can be also seen that the proposed i could be a method has shown
relatively stable performance this beating cumberland pair
for similarity test we conducted the following experiment
using forty eight utterances and ten participants as in the trellis test
all target speakers utterances will play first
then the to convert you know transceivers by the to messrs what played in random
order
listeners were asked to select the most email addresses to the target speakers speeds or
fair if they could not additive four
results so that the proposed multicolour second be based we see a upon the completion
of e p c significantly
now we show some examples of voice comparison
this is the song not the source speaker
because man groping in the arctic darkness
the found the elemental
in the target speaker
because my well being in arctic darkness and found no matter
these are the silence of the component is speeches
we present and grabbing in the arctic darkness and found the elemental
five nine running our preparedness and finding a no
because and then grabbing in the arctic darkness the funny elemental
because nine broken in the arctic darkness the founding elemental
because an island hopping in the arctic darkness the founding elemental
as in and problem of the art darkness the finding a no
these are another example so combating based p two p m s p
this is the sound of the source speaker
the proper course to pursue is to offer your name and address
in the target speaker
the proper course to pursue is to offer your name address
these are the silence of the component is speeches
the proper course to pursue is to offer your name inventor's
the proportion issue is to offer your main interest
the proper course to pursue is to offer your name and managers
the proper course to pursue is to offer your name and address
the proper course to pursue is to offer your name and address
the proper corresponds to is often in a way to address
you know there some concluding remarks
the variational to encode herve's voice conversion can run many-to-many voice comers on we do
parenting data
however it has low quality you to have sense of explicit training process for the
common to pass
in this war we improve the quality of vad based voice conversion by using second
consistency and much but decoders
values of cycle consistency in a widow network to explicitly learn the compression pistons and
then use a much we decoders in a bit on the top tool on individual
target speakers voice characteristics
for future works
we have currently running the experiments using a lot of corpus consisting of more than
hundred speakers
to find out how the proposed messrs careful allows a number of speakers
the proposed methods can be further extended by utilizing much recorders
for example using technique at encoder for you it's all speakers
also replacing the for coder with more powerful endurable colour such as the we even
a what we are and then an increase the power point six where
thank you for watching our presentation