0:00:13well contort okay speaker or this a twenty four char workshop
0:00:20i'm doing so you from columbia university
0:00:23and i'll be presenting our recent research effort on many-to-many voice conversion using cycle free
0:00:31a weight multi decoders
0:00:36this is a joint work with some given name
0:00:42can you only
0:00:46and in channel you
0:00:49we are all from a io their core university
0:00:54i've first to find the problem that we want solve
0:00:57which is voice conversion among multiple speakers
0:01:02and describe key idea of the proposed a method called a psycho we at with
0:01:07multi decoders to solve the problem
0:01:11doctor in show you is going to explain the details up to stack of we
0:01:16with multi decoders
0:01:19and show you some experimental richard so the proposed method
0:01:24followed by some concluding remarks
0:01:30voice conversion is the task of converting the speaker-related voice characteristics in an utterance while
0:01:37maintaining the linguistic information
0:01:40for example of female speakers may sound like a main speaker using a voice conversion
0:01:45technique
0:01:48voice conversion can be applied to data limitation
0:01:52for example for the training of automatic speech recognition systems
0:01:57various voice generation from a text to speech is system
0:02:02speaking assistance for foreign languages to accent conversion
0:02:08speech enhancement by improving the comprehensibility up to convert it voice
0:02:13and personal information protection through a speaker d identification
0:02:21if you have a parallel training data
0:02:23which contain pairs sub same transcription utterances spoken by different speakers
0:02:30we can just train a simple neural network by providing the source speakers utterances has
0:02:37the input adopt a network
0:02:39and the target speakers utterances has the target up to network have the proper time
0:02:45alignment of the parallel utterances
0:02:48however building a parallel there are corporas is a highly expensive task sometimes even impossible
0:02:56which estimate the strongly is for voice conversion method that does not require the parallel
0:03:02training data
0:03:06therefore recent voice conversion approaches attempt to use non-parallel training data
0:03:14one of such approaches as using a variational twelve encoder
0:03:19or fee ensure
0:03:22which was originally developed has the generative model for image generation
0:03:28if we at is composed of an encoder and the decoder
0:03:33t can coder produces a set of parameters for posterior distribution a latent variable z
0:03:40given the input data x
0:03:44where is the decoder generates a set of parameters for posterior distribution no output data
0:03:50x given the latent variables c
0:03:54after being trained using the variational all but andre has its objective function
0:04:00the vad can be used to generate samples of x by feeding random latent variables
0:04:06to the decoder
0:04:08of we can be applied to voice conversion by providing speaker identity to the decoder
0:04:15together with the latent variable
0:04:18the vad is trained to reconstruct the input speech from the latent variables c and
0:04:24the source speaker identity x
0:04:27here for case latter x represent an utterance
0:04:32and uppercase letter x represent speaker identity
0:04:38to convert a speech from a source speaker to a target speaker
0:04:42the source speaker identity x is replaced with a target speaker identity y
0:04:49however due to t have sense of unexploded training process for the conversion path between
0:04:55the source speaker and the target speaker
0:04:59the vad based voice conversion methods generally produce for quality voice
0:05:05that is the conventional method to train the model with the self reconstruction the objective
0:05:11function only
0:05:12not considering the convergent path from a source speaker to a target speaker
0:05:19in order to solve this problem we propose to cycle consistent variational auto-encoder or cycles
0:05:27we a short with multi decoders
0:05:31it uses the cycle consistency lost and multiple decoders for explicit convergent path training as
0:05:38follows
0:05:40and the speech x is fed into don't network it passes to the decoder and
0:05:46compressed into to latent variable z
0:05:50the reconstruction error is computed using the reconstructed speech x prime by the speaker x
0:05:58decoder
0:05:59up to this point the loss function is similar to the vanilla vad except that
0:06:05it does not require your the speaker identity because the decoder is used exclusively for
0:06:12speaker x
0:06:14the same input speech x go through at the encoder and the speaker windy quarter
0:06:20as well
0:06:21to generate the convert a speech x prime from speaker next to speaker y which
0:06:28has the same linguistic contents has the original input speech x but in speaker y's
0:06:34voice
0:06:36then the converted speech x prime goes to the encoder and the speaker x t
0:06:42quarter to generate the converted back speech x double prime which should recover the first
0:06:49input speech x
0:06:53the cyclic conversion encourages the explicit training of the voice conversion path from speaker y
0:07:00two speaker x without parallel training there
0:07:06the cycle consistency loss optimized t decoder vad for two speakers given the input speech
0:07:13x is defined as follows
0:07:17again the whole point case letters x and y represent speaker identities
0:07:24no cable and the input speech x the loss function up to cycle of we
0:07:29at for two speakers is the weight is some how the above two losses as
0:07:34follows
0:07:36where lambda is the weight up to cycle consistency loss
0:07:41similarly input speech y is used to train the convergent path from speaker next to
0:07:48speaker white explicitly as well as the so reconstruction path for speaker y
0:07:55it can be easily extended for more than two speakers by summing over all pairs
0:08:01of the training speakers
0:08:03the loss function up to cycle of we for more than two speakers can be
0:08:08computed has follows
0:08:10where the second summation is usually over a mini batch
0:08:15the sound quality can be improved
0:08:18since each decoder there is it's on the speaker's voice characteristics
0:08:23by the additional convergent path training
0:08:27why the combination of we at most handle multiple speakers with only a single decoder
0:08:32by self reconstruction training
0:08:37at this point i'd like to handle the microphone to doctor you
0:08:41is going to explain the details of the proposed to cycle of the with multi
0:08:46decoders
0:08:47thank you
0:08:48i mean value from a that corey university
0:08:53i did explaining the details of the proposed that could be a bit more two
0:08:57decoders
0:08:58experimental results
0:09:00and conclusions
0:09:02that generates about the rasta network or can ensure can be applied to cycle be
0:09:06tween as the quality of the resulting speech is
0:09:10the reconstructed speech x prime is retrieved from the second vad
0:09:14i feeding the speech x from speaker x
0:09:17in the speaker identity of x
0:09:20but this can be to is trained to distinguish the reconstructed speech from the original
0:09:24speech
0:09:27door cyclic completely the speech x top we're prime is also pretty from the second
0:09:32v
0:09:34it first composed of speech x still speaker y's voice
0:09:38i feeding the speaker identity or why the latent variable z
0:09:43the converted speech is not be encumbered back to the speaker x voice
0:09:48i feeding the speaker identity of x with the latent variables
0:09:53but this commander is also trained to distinguish the results next overpriced keys from the
0:09:58original speech
0:10:01the cycle we can be for to extend it to use much pretty coders
0:10:07in similar fashion to the second vad we much pretty colours
0:10:12each speaker use is dedicated decoder and discriminator networks
0:10:17since they are much for against
0:10:19previous iteration again is more t five
0:10:23the modified in the icsi mark has right
0:10:28in this work we used what sustains cans w again since short instead of one
0:10:33you like
0:10:35redesigned architecture of a motorist based on the paper by coming our current two thousand
0:10:40nineteen
0:10:42all encoder the colour and discriminate was used pretty cumbersome architecture speak at the linear
0:10:48units or jeer using your
0:10:51the source identity vector is broadcast it it's jerry april that is the source cited
0:10:58vector c is appended to the output of the previous jury overlap
0:11:02since we assume gaussian distribution with diagonal covariance matrices for the encoder and the decoder
0:11:09the outputs of the encoder and the decoder at pairs of mean and variance
0:11:15that decoder architecture is similar to that of encoder
0:11:19the target speaker identity vectors
0:11:22and that used for the multi decoder side could be and marty decoder circuitry w
0:11:29and this is the architecture but this commander that war
0:11:33as in that score that while the target speaker identity vectors are not used for
0:11:38the multi decoder second be multicolour second three w k
0:11:43now i we show some experimental results of the proposed missile and concluding remarks
0:11:50here is that it takes up the experimental setup
0:11:54we was always component challenge two thousand eighteen dataset which consist of six theory and
0:12:00six male speakers
0:12:02we used a subset of two female speakers and two male speakers
0:12:07each speaker has one on the sixteen utterances
0:12:10we use seventy two utterances for training
0:12:13nine utterances for validation and start by utterances for testing
0:12:19we used three sets of features
0:12:21thirty systematic gesture questions or m c ensure and fundamental frequency and a periodicities
0:12:29we use the following hyper parameters
0:12:32and it's more there was books a from all five on the train the beam
0:12:37order
0:12:39we analyze time and space complexity of the algorithm
0:12:43the time complexity is measured by the average training time by fourteen seconds using the
0:12:48chip or thirty s two thousand eight a gpu machine
0:12:51the space complexity is measured by number of model parameters
0:12:56by comparing we and so i could be a single decoder
0:13:00we can see that ending cycle consistency increase the training time to four times but
0:13:05the number of parameters seaside into car
0:13:09same can be so by comparing fourier the reader can incite v w in the
0:13:14single decoder
0:13:16using multiple decoders considerably increase space complexities
0:13:22especially when the w again is already since they nist separate this came in terms
0:13:27for each speaker assess where
0:13:30the global variance or achieving is sure or m c
0:13:34can be used to measure the degree or some money that these the highly we
0:13:38values corner with the shopping use of the spectra
0:13:43we miss error cheery for each of the insisting this is all of it is
0:13:48all sources for your space and the comparative space by the commission of the and
0:13:53the from four section three
0:13:55don't ever is to use of the conventional vad and the proposed section v four
0:14:00or in the system only various all sources with similar
0:14:03the tv various of the second v for higher insisting that is useful better than
0:14:08those of the miss and v
0:14:12for the case of the listener and the compare two speech utterances contain the same
0:14:16linguistic information
0:14:18the difference between the mfcc up to two speech utterance it should be small
0:14:23we miss the mel-cepstral distortion m c d for various algorithm
0:14:28by comparing v and v w can we can see that the v w and
0:14:34channel real problems be anyway
0:14:38by comparing p w can in section v the billion single decoder
0:14:42we can see the effectiveness so and things like a consistency
0:14:48by comparing psycho every single decoder and marty decoder
0:14:52and second we that we begin single decoder and much decoder
0:14:56we can see that the much which could afford to improve the performance
0:15:01one interesting to note is that the cycle but much pretty cover up of one
0:15:06second v w can be much pretty cold or
0:15:09we suspect that the multi decoder second consistency lost its of setting up to learn
0:15:14the cumberland pace explicitly that the additional w again four component past may not in
0:15:19excess sorry
0:15:21we conducted to subjective evaluations that show is test and similar i think task
0:15:28for naturalness test we measured the mean opinion score and where
0:15:33ten s not evaluate the naturalness of the forty eight utterances in this case of
0:15:38one
0:15:39two five exile
0:15:41one average the proposed multi decoder cycle vad hessians slightly higher naturalness the scores that
0:15:47the commission of e
0:15:50it can be also seen that the proposed i could be a method has shown
0:15:54relatively stable performance this beating cumberland pair
0:15:58for similarity test we conducted the following experiment
0:16:02using forty eight utterances and ten participants as in the trellis test
0:16:07all target speakers utterances will play first
0:16:11then the to convert you know transceivers by the to messrs what played in random
0:16:15order
0:16:16listeners were asked to select the most email addresses to the target speakers speeds or
0:16:22fair if they could not additive four
0:16:25results so that the proposed multicolour second be based we see a upon the completion
0:16:31of e p c significantly
0:16:34now we show some examples of voice comparison
0:16:38this is the song not the source speaker
0:16:41because man groping in the arctic darkness
0:16:44the found the elemental
0:16:46in the target speaker
0:16:48because my well being in arctic darkness and found no matter
0:16:53these are the silence of the component is speeches
0:16:57we present and grabbing in the arctic darkness and found the elemental
0:17:02five nine running our preparedness and finding a no
0:17:07because and then grabbing in the arctic darkness the funny elemental
0:17:11because nine broken in the arctic darkness the founding elemental
0:17:16because an island hopping in the arctic darkness the founding elemental
0:17:21as in and problem of the art darkness the finding a no
0:17:26these are another example so combating based p two p m s p
0:17:32this is the sound of the source speaker
0:17:34the proper course to pursue is to offer your name and address
0:17:39in the target speaker
0:17:41the proper course to pursue is to offer your name address
0:17:45these are the silence of the component is speeches
0:17:49the proper course to pursue is to offer your name inventor's
0:17:54the proportion issue is to offer your main interest
0:17:58the proper course to pursue is to offer your name and managers
0:18:03the proper course to pursue is to offer your name and address
0:18:08the proper course to pursue is to offer your name and address
0:18:13the proper corresponds to is often in a way to address
0:18:17you know there some concluding remarks
0:18:20the variational to encode herve's voice conversion can run many-to-many voice comers on we do
0:18:25parenting data
0:18:27however it has low quality you to have sense of explicit training process for the
0:18:32common to pass
0:18:34in this war we improve the quality of vad based voice conversion by using second
0:18:39consistency and much but decoders
0:18:42values of cycle consistency in a widow network to explicitly learn the compression pistons and
0:18:48then use a much we decoders in a bit on the top tool on individual
0:18:52target speakers voice characteristics
0:18:56for future works
0:18:57we have currently running the experiments using a lot of corpus consisting of more than
0:19:02hundred speakers
0:19:04to find out how the proposed messrs careful allows a number of speakers
0:19:09the proposed methods can be further extended by utilizing much recorders
0:19:13for example using technique at encoder for you it's all speakers
0:19:18also replacing the for coder with more powerful endurable colour such as the we even
0:19:23a what we are and then an increase the power point six where
0:19:28thank you for watching our presentation