0:00:13 | well contort okay speaker or this a twenty four char workshop |
---|
0:00:20 | i'm doing so you from columbia university |
---|
0:00:23 | and i'll be presenting our recent research effort on many-to-many voice conversion using cycle free |
---|
0:00:31 | a weight multi decoders |
---|
0:00:36 | this is a joint work with some given name |
---|
0:00:42 | can you only |
---|
0:00:46 | and in channel you |
---|
0:00:49 | we are all from a io their core university |
---|
0:00:54 | i've first to find the problem that we want solve |
---|
0:00:57 | which is voice conversion among multiple speakers |
---|
0:01:02 | and describe key idea of the proposed a method called a psycho we at with |
---|
0:01:07 | multi decoders to solve the problem |
---|
0:01:11 | doctor in show you is going to explain the details up to stack of we |
---|
0:01:16 | with multi decoders |
---|
0:01:19 | and show you some experimental richard so the proposed method |
---|
0:01:24 | followed by some concluding remarks |
---|
0:01:30 | voice conversion is the task of converting the speaker-related voice characteristics in an utterance while |
---|
0:01:37 | maintaining the linguistic information |
---|
0:01:40 | for example of female speakers may sound like a main speaker using a voice conversion |
---|
0:01:45 | technique |
---|
0:01:48 | voice conversion can be applied to data limitation |
---|
0:01:52 | for example for the training of automatic speech recognition systems |
---|
0:01:57 | various voice generation from a text to speech is system |
---|
0:02:02 | speaking assistance for foreign languages to accent conversion |
---|
0:02:08 | speech enhancement by improving the comprehensibility up to convert it voice |
---|
0:02:13 | and personal information protection through a speaker d identification |
---|
0:02:21 | if you have a parallel training data |
---|
0:02:23 | which contain pairs sub same transcription utterances spoken by different speakers |
---|
0:02:30 | we can just train a simple neural network by providing the source speakers utterances has |
---|
0:02:37 | the input adopt a network |
---|
0:02:39 | and the target speakers utterances has the target up to network have the proper time |
---|
0:02:45 | alignment of the parallel utterances |
---|
0:02:48 | however building a parallel there are corporas is a highly expensive task sometimes even impossible |
---|
0:02:56 | which estimate the strongly is for voice conversion method that does not require the parallel |
---|
0:03:02 | training data |
---|
0:03:06 | therefore recent voice conversion approaches attempt to use non-parallel training data |
---|
0:03:14 | one of such approaches as using a variational twelve encoder |
---|
0:03:19 | or fee ensure |
---|
0:03:22 | which was originally developed has the generative model for image generation |
---|
0:03:28 | if we at is composed of an encoder and the decoder |
---|
0:03:33 | t can coder produces a set of parameters for posterior distribution a latent variable z |
---|
0:03:40 | given the input data x |
---|
0:03:44 | where is the decoder generates a set of parameters for posterior distribution no output data |
---|
0:03:50 | x given the latent variables c |
---|
0:03:54 | after being trained using the variational all but andre has its objective function |
---|
0:04:00 | the vad can be used to generate samples of x by feeding random latent variables |
---|
0:04:06 | to the decoder |
---|
0:04:08 | of we can be applied to voice conversion by providing speaker identity to the decoder |
---|
0:04:15 | together with the latent variable |
---|
0:04:18 | the vad is trained to reconstruct the input speech from the latent variables c and |
---|
0:04:24 | the source speaker identity x |
---|
0:04:27 | here for case latter x represent an utterance |
---|
0:04:32 | and uppercase letter x represent speaker identity |
---|
0:04:38 | to convert a speech from a source speaker to a target speaker |
---|
0:04:42 | the source speaker identity x is replaced with a target speaker identity y |
---|
0:04:49 | however due to t have sense of unexploded training process for the conversion path between |
---|
0:04:55 | the source speaker and the target speaker |
---|
0:04:59 | the vad based voice conversion methods generally produce for quality voice |
---|
0:05:05 | that is the conventional method to train the model with the self reconstruction the objective |
---|
0:05:11 | function only |
---|
0:05:12 | not considering the convergent path from a source speaker to a target speaker |
---|
0:05:19 | in order to solve this problem we propose to cycle consistent variational auto-encoder or cycles |
---|
0:05:27 | we a short with multi decoders |
---|
0:05:31 | it uses the cycle consistency lost and multiple decoders for explicit convergent path training as |
---|
0:05:38 | follows |
---|
0:05:40 | and the speech x is fed into don't network it passes to the decoder and |
---|
0:05:46 | compressed into to latent variable z |
---|
0:05:50 | the reconstruction error is computed using the reconstructed speech x prime by the speaker x |
---|
0:05:58 | decoder |
---|
0:05:59 | up to this point the loss function is similar to the vanilla vad except that |
---|
0:06:05 | it does not require your the speaker identity because the decoder is used exclusively for |
---|
0:06:12 | speaker x |
---|
0:06:14 | the same input speech x go through at the encoder and the speaker windy quarter |
---|
0:06:20 | as well |
---|
0:06:21 | to generate the convert a speech x prime from speaker next to speaker y which |
---|
0:06:28 | has the same linguistic contents has the original input speech x but in speaker y's |
---|
0:06:34 | voice |
---|
0:06:36 | then the converted speech x prime goes to the encoder and the speaker x t |
---|
0:06:42 | quarter to generate the converted back speech x double prime which should recover the first |
---|
0:06:49 | input speech x |
---|
0:06:53 | the cyclic conversion encourages the explicit training of the voice conversion path from speaker y |
---|
0:07:00 | two speaker x without parallel training there |
---|
0:07:06 | the cycle consistency loss optimized t decoder vad for two speakers given the input speech |
---|
0:07:13 | x is defined as follows |
---|
0:07:17 | again the whole point case letters x and y represent speaker identities |
---|
0:07:24 | no cable and the input speech x the loss function up to cycle of we |
---|
0:07:29 | at for two speakers is the weight is some how the above two losses as |
---|
0:07:34 | follows |
---|
0:07:36 | where lambda is the weight up to cycle consistency loss |
---|
0:07:41 | similarly input speech y is used to train the convergent path from speaker next to |
---|
0:07:48 | speaker white explicitly as well as the so reconstruction path for speaker y |
---|
0:07:55 | it can be easily extended for more than two speakers by summing over all pairs |
---|
0:08:01 | of the training speakers |
---|
0:08:03 | the loss function up to cycle of we for more than two speakers can be |
---|
0:08:08 | computed has follows |
---|
0:08:10 | where the second summation is usually over a mini batch |
---|
0:08:15 | the sound quality can be improved |
---|
0:08:18 | since each decoder there is it's on the speaker's voice characteristics |
---|
0:08:23 | by the additional convergent path training |
---|
0:08:27 | why the combination of we at most handle multiple speakers with only a single decoder |
---|
0:08:32 | by self reconstruction training |
---|
0:08:37 | at this point i'd like to handle the microphone to doctor you |
---|
0:08:41 | is going to explain the details of the proposed to cycle of the with multi |
---|
0:08:46 | decoders |
---|
0:08:47 | thank you |
---|
0:08:48 | i mean value from a that corey university |
---|
0:08:53 | i did explaining the details of the proposed that could be a bit more two |
---|
0:08:57 | decoders |
---|
0:08:58 | experimental results |
---|
0:09:00 | and conclusions |
---|
0:09:02 | that generates about the rasta network or can ensure can be applied to cycle be |
---|
0:09:06 | tween as the quality of the resulting speech is |
---|
0:09:10 | the reconstructed speech x prime is retrieved from the second vad |
---|
0:09:14 | i feeding the speech x from speaker x |
---|
0:09:17 | in the speaker identity of x |
---|
0:09:20 | but this can be to is trained to distinguish the reconstructed speech from the original |
---|
0:09:24 | speech |
---|
0:09:27 | door cyclic completely the speech x top we're prime is also pretty from the second |
---|
0:09:32 | v |
---|
0:09:34 | it first composed of speech x still speaker y's voice |
---|
0:09:38 | i feeding the speaker identity or why the latent variable z |
---|
0:09:43 | the converted speech is not be encumbered back to the speaker x voice |
---|
0:09:48 | i feeding the speaker identity of x with the latent variables |
---|
0:09:53 | but this commander is also trained to distinguish the results next overpriced keys from the |
---|
0:09:58 | original speech |
---|
0:10:01 | the cycle we can be for to extend it to use much pretty coders |
---|
0:10:07 | in similar fashion to the second vad we much pretty colours |
---|
0:10:12 | each speaker use is dedicated decoder and discriminator networks |
---|
0:10:17 | since they are much for against |
---|
0:10:19 | previous iteration again is more t five |
---|
0:10:23 | the modified in the icsi mark has right |
---|
0:10:28 | in this work we used what sustains cans w again since short instead of one |
---|
0:10:33 | you like |
---|
0:10:35 | redesigned architecture of a motorist based on the paper by coming our current two thousand |
---|
0:10:40 | nineteen |
---|
0:10:42 | all encoder the colour and discriminate was used pretty cumbersome architecture speak at the linear |
---|
0:10:48 | units or jeer using your |
---|
0:10:51 | the source identity vector is broadcast it it's jerry april that is the source cited |
---|
0:10:58 | vector c is appended to the output of the previous jury overlap |
---|
0:11:02 | since we assume gaussian distribution with diagonal covariance matrices for the encoder and the decoder |
---|
0:11:09 | the outputs of the encoder and the decoder at pairs of mean and variance |
---|
0:11:15 | that decoder architecture is similar to that of encoder |
---|
0:11:19 | the target speaker identity vectors |
---|
0:11:22 | and that used for the multi decoder side could be and marty decoder circuitry w |
---|
0:11:29 | and this is the architecture but this commander that war |
---|
0:11:33 | as in that score that while the target speaker identity vectors are not used for |
---|
0:11:38 | the multi decoder second be multicolour second three w k |
---|
0:11:43 | now i we show some experimental results of the proposed missile and concluding remarks |
---|
0:11:50 | here is that it takes up the experimental setup |
---|
0:11:54 | we was always component challenge two thousand eighteen dataset which consist of six theory and |
---|
0:12:00 | six male speakers |
---|
0:12:02 | we used a subset of two female speakers and two male speakers |
---|
0:12:07 | each speaker has one on the sixteen utterances |
---|
0:12:10 | we use seventy two utterances for training |
---|
0:12:13 | nine utterances for validation and start by utterances for testing |
---|
0:12:19 | we used three sets of features |
---|
0:12:21 | thirty systematic gesture questions or m c ensure and fundamental frequency and a periodicities |
---|
0:12:29 | we use the following hyper parameters |
---|
0:12:32 | and it's more there was books a from all five on the train the beam |
---|
0:12:37 | order |
---|
0:12:39 | we analyze time and space complexity of the algorithm |
---|
0:12:43 | the time complexity is measured by the average training time by fourteen seconds using the |
---|
0:12:48 | chip or thirty s two thousand eight a gpu machine |
---|
0:12:51 | the space complexity is measured by number of model parameters |
---|
0:12:56 | by comparing we and so i could be a single decoder |
---|
0:13:00 | we can see that ending cycle consistency increase the training time to four times but |
---|
0:13:05 | the number of parameters seaside into car |
---|
0:13:09 | same can be so by comparing fourier the reader can incite v w in the |
---|
0:13:14 | single decoder |
---|
0:13:16 | using multiple decoders considerably increase space complexities |
---|
0:13:22 | especially when the w again is already since they nist separate this came in terms |
---|
0:13:27 | for each speaker assess where |
---|
0:13:30 | the global variance or achieving is sure or m c |
---|
0:13:34 | can be used to measure the degree or some money that these the highly we |
---|
0:13:38 | values corner with the shopping use of the spectra |
---|
0:13:43 | we miss error cheery for each of the insisting this is all of it is |
---|
0:13:48 | all sources for your space and the comparative space by the commission of the and |
---|
0:13:53 | the from four section three |
---|
0:13:55 | don't ever is to use of the conventional vad and the proposed section v four |
---|
0:14:00 | or in the system only various all sources with similar |
---|
0:14:03 | the tv various of the second v for higher insisting that is useful better than |
---|
0:14:08 | those of the miss and v |
---|
0:14:12 | for the case of the listener and the compare two speech utterances contain the same |
---|
0:14:16 | linguistic information |
---|
0:14:18 | the difference between the mfcc up to two speech utterance it should be small |
---|
0:14:23 | we miss the mel-cepstral distortion m c d for various algorithm |
---|
0:14:28 | by comparing v and v w can we can see that the v w and |
---|
0:14:34 | channel real problems be anyway |
---|
0:14:38 | by comparing p w can in section v the billion single decoder |
---|
0:14:42 | we can see the effectiveness so and things like a consistency |
---|
0:14:48 | by comparing psycho every single decoder and marty decoder |
---|
0:14:52 | and second we that we begin single decoder and much decoder |
---|
0:14:56 | we can see that the much which could afford to improve the performance |
---|
0:15:01 | one interesting to note is that the cycle but much pretty cover up of one |
---|
0:15:06 | second v w can be much pretty cold or |
---|
0:15:09 | we suspect that the multi decoder second consistency lost its of setting up to learn |
---|
0:15:14 | the cumberland pace explicitly that the additional w again four component past may not in |
---|
0:15:19 | excess sorry |
---|
0:15:21 | we conducted to subjective evaluations that show is test and similar i think task |
---|
0:15:28 | for naturalness test we measured the mean opinion score and where |
---|
0:15:33 | ten s not evaluate the naturalness of the forty eight utterances in this case of |
---|
0:15:38 | one |
---|
0:15:39 | two five exile |
---|
0:15:41 | one average the proposed multi decoder cycle vad hessians slightly higher naturalness the scores that |
---|
0:15:47 | the commission of e |
---|
0:15:50 | it can be also seen that the proposed i could be a method has shown |
---|
0:15:54 | relatively stable performance this beating cumberland pair |
---|
0:15:58 | for similarity test we conducted the following experiment |
---|
0:16:02 | using forty eight utterances and ten participants as in the trellis test |
---|
0:16:07 | all target speakers utterances will play first |
---|
0:16:11 | then the to convert you know transceivers by the to messrs what played in random |
---|
0:16:15 | order |
---|
0:16:16 | listeners were asked to select the most email addresses to the target speakers speeds or |
---|
0:16:22 | fair if they could not additive four |
---|
0:16:25 | results so that the proposed multicolour second be based we see a upon the completion |
---|
0:16:31 | of e p c significantly |
---|
0:16:34 | now we show some examples of voice comparison |
---|
0:16:38 | this is the song not the source speaker |
---|
0:16:41 | because man groping in the arctic darkness |
---|
0:16:44 | the found the elemental |
---|
0:16:46 | in the target speaker |
---|
0:16:48 | because my well being in arctic darkness and found no matter |
---|
0:16:53 | these are the silence of the component is speeches |
---|
0:16:57 | we present and grabbing in the arctic darkness and found the elemental |
---|
0:17:02 | five nine running our preparedness and finding a no |
---|
0:17:07 | because and then grabbing in the arctic darkness the funny elemental |
---|
0:17:11 | because nine broken in the arctic darkness the founding elemental |
---|
0:17:16 | because an island hopping in the arctic darkness the founding elemental |
---|
0:17:21 | as in and problem of the art darkness the finding a no |
---|
0:17:26 | these are another example so combating based p two p m s p |
---|
0:17:32 | this is the sound of the source speaker |
---|
0:17:34 | the proper course to pursue is to offer your name and address |
---|
0:17:39 | in the target speaker |
---|
0:17:41 | the proper course to pursue is to offer your name address |
---|
0:17:45 | these are the silence of the component is speeches |
---|
0:17:49 | the proper course to pursue is to offer your name inventor's |
---|
0:17:54 | the proportion issue is to offer your main interest |
---|
0:17:58 | the proper course to pursue is to offer your name and managers |
---|
0:18:03 | the proper course to pursue is to offer your name and address |
---|
0:18:08 | the proper course to pursue is to offer your name and address |
---|
0:18:13 | the proper corresponds to is often in a way to address |
---|
0:18:17 | you know there some concluding remarks |
---|
0:18:20 | the variational to encode herve's voice conversion can run many-to-many voice comers on we do |
---|
0:18:25 | parenting data |
---|
0:18:27 | however it has low quality you to have sense of explicit training process for the |
---|
0:18:32 | common to pass |
---|
0:18:34 | in this war we improve the quality of vad based voice conversion by using second |
---|
0:18:39 | consistency and much but decoders |
---|
0:18:42 | values of cycle consistency in a widow network to explicitly learn the compression pistons and |
---|
0:18:48 | then use a much we decoders in a bit on the top tool on individual |
---|
0:18:52 | target speakers voice characteristics |
---|
0:18:56 | for future works |
---|
0:18:57 | we have currently running the experiments using a lot of corpus consisting of more than |
---|
0:19:02 | hundred speakers |
---|
0:19:04 | to find out how the proposed messrs careful allows a number of speakers |
---|
0:19:09 | the proposed methods can be further extended by utilizing much recorders |
---|
0:19:13 | for example using technique at encoder for you it's all speakers |
---|
0:19:18 | also replacing the for coder with more powerful endurable colour such as the we even |
---|
0:19:23 | a what we are and then an increase the power point six where |
---|
0:19:28 | thank you for watching our presentation |
---|