0:00:13 | well i'm to know what you said today talking about comparison of speech representation for |
---|
0:00:19 | automatic only estimation in multi-speaker text-to-speech synthesis |
---|
0:00:23 | the pleasure that are you today and i also one thank michael appears to whether |
---|
0:00:27 | you are in simon king |
---|
0:00:31 | the problem that we want to solve this research is how to develop a neural |
---|
0:00:35 | network that will outweigh mean opinion score you and some synthetic speech input |
---|
0:00:43 | so the motivations for this work are to speed of the tts development cycle to |
---|
0:00:48 | save time and money for the listening test and to predict tts quality automatic |
---|
0:00:56 | there ran several attempts to develop an automatic already estimation system p five six three |
---|
0:01:02 | was developed for coding and there is no correlation to text to speech ask was |
---|
0:01:06 | also developer top twenty speech it requires a high quality reference |
---|
0:01:11 | and it is based on comparing degraded speech to natural speech tts errors are not |
---|
0:01:16 | always captured in this approach on a mouse is interesting but it was limited to |
---|
0:01:21 | only the google tts systems and the ground truth is based on multiple masters conducted |
---|
0:01:26 | over a period of time |
---|
0:01:28 | well you know was introduced in two thousand eighteen this was for speech enhancement and |
---|
0:01:34 | it is limited to the timit dataset |
---|
0:01:36 | in two thousand nineteen mass that was introduced |
---|
0:01:39 | this is for estimating the quality of voice conversion systems we found that the pre-trained |
---|
0:01:45 | models try to do not generalize well to text to speech |
---|
0:01:51 | two main contributions of this paper are to retrain the original master that framework using |
---|
0:01:56 | tts data and x for frame based weighting in the loss function we also train |
---|
0:02:02 | a new low capacity c and then architecture on the tts dataset and compare a |
---|
0:02:07 | different types of speech representations |
---|
0:02:09 | finally we characterize mars prediction performance based on speaker global ranking |
---|
0:02:15 | our approach is basically in two phases and is one we did all speech representations |
---|
0:02:20 | training models network re-training the original most network using a tts data and we determine |
---|
0:02:30 | what is the best representation and model for tts it is too |
---|
0:02:34 | we training new other multi-speaker tts system conduct a small loss listening test and applied |
---|
0:02:41 | trained model based one to analyze the generalization the question we want to answer these |
---|
0:02:47 | two is just our best model developed in based wyman generalize to a new tts |
---|
0:02:52 | system |
---|
0:02:55 | one of our main contributions as a we explore different types of speech representations |
---|
0:03:01 | in a low capacity seen in architecture we which we develop to handle these new |
---|
0:03:05 | representations |
---|
0:03:06 | we have five different types of extractors |
---|
0:03:11 | these are just like regular inspectors however during training of the extractor which of about |
---|
0:03:16 | adjusted and |
---|
0:03:17 | the target is not necessarily speaker id we have one version of the egg structure |
---|
0:03:21 | where the target is in fact speaker or speaker at |
---|
0:03:25 | then we have another type just a fixed factor where the target is a categorical |
---|
0:03:30 | variable representing room size from the t a s p screw dataset |
---|
0:03:36 | we have another type of x factor that is modeling power oracle values for the |
---|
0:03:40 | t sixty reaper time |
---|
0:03:42 | another one to the models talk origin basically distance |
---|
0:03:46 | attacker to talk or distance and finally another one then models of the device quality |
---|
0:03:53 | we also have at extraction features |
---|
0:03:57 | which is a output of a image that |
---|
0:04:03 | pretrained model that is operating on the entire spectrogram as an image |
---|
0:04:08 | so this is a very high dimensional representation |
---|
0:04:13 | finally we have the acoustic model and variance |
---|
0:04:17 | and those are from an asr system again those are five hundred twelve dimensions someone's |
---|
0:04:21 | where x vectors |
---|
0:04:23 | finally there is the original models that which uses frame based features |
---|
0:04:28 | and we're retraining that are be leds |
---|
0:04:36 | when we treat these different types of environment |
---|
0:04:39 | and extractors |
---|
0:04:42 | we have as targets |
---|
0:04:45 | against speakers room size the t sixty reverb time it distances and the replay device |
---|
0:04:51 | quality |
---|
0:04:52 | so of the purpose is to use the extractors to model different types of environments |
---|
0:04:58 | in attacks there and label in that the two thousand nineteen is feast the physical |
---|
0:05:03 | access dataset so we're getting labels for free from the physical access data we want |
---|
0:05:08 | to use those to model a speech degradation |
---|
0:05:12 | there will do not apply as a way of modeling the degradation in text to |
---|
0:05:17 | speech |
---|
0:05:20 | someone we train our what sdc and then and the a personal must now we |
---|
0:05:25 | are working on the watchable access |
---|
0:05:29 | dataset from a speech to challenge this is the evaluation portion of the l a |
---|
0:05:33 | dataset |
---|
0:05:34 | it consists of for a unique dct k speakers |
---|
0:05:38 | importantly there are thirteen different tts and b c systems |
---|
0:05:43 | in this dataset so we get a wide range of quality of tts and voice |
---|
0:05:48 | conversion how well it and most of the systems are in fact text to speech |
---|
0:05:53 | also warmly as they were evaluated with human judgements and the ground truth judgement on |
---|
0:05:59 | a one to ten point scale for mean opinion score on the same march reading |
---|
0:06:04 | task so where are the like the optimal as different masters over time this alleviates |
---|
0:06:10 | our problem |
---|
0:06:12 | we used a speaker disjoint training about interest split |
---|
0:06:15 | and in table one of the people that a reference here we can see these |
---|
0:06:20 | systems for a labeled at zero seven three nineteen there's thirteen of them and they |
---|
0:06:26 | have different characteristics so for example the use different types of vocoders so some will |
---|
0:06:31 | use group when a visual using a high quality |
---|
0:06:34 | no vocoders the just wait another week or and then |
---|
0:06:43 | so again we explore two types of mass production that neural networks in the first |
---|
0:06:48 | case we have all of the machinery that comes with the original more snacks which |
---|
0:06:52 | has different types of architectures for example there's one version that has a bidirectional both |
---|
0:06:57 | another version that have the c and then |
---|
0:07:00 | another version that has a scene and p l s t m combination and so |
---|
0:07:04 | with their code right on how we get all of the architecture and we explore |
---|
0:07:10 | all the different hyper parameters of the explored |
---|
0:07:14 | in addition to that of original models that we introduce our low capacity c n |
---|
0:07:19 | and which we use to operate on are different representations such as the extractors it |
---|
0:07:25 | deep structural features of the acoustic model bindings |
---|
0:07:32 | so now we're going to talk about some of the findings that we got from |
---|
0:07:36 | our experiments |
---|
0:07:38 | first we used for different correlation matrix |
---|
0:07:43 | so |
---|
0:07:44 | each of them have different ranges and different tradeoffs and might be useful for different |
---|
0:07:50 | types of problems we wanted to keep in mind as previous work would use these |
---|
0:07:54 | we also introduce the final and the candle tower a correlation so from start we |
---|
0:07:59 | have the linear correlation coefficient the l c |
---|
0:08:02 | also runs pearson r is a value that ranges between negative one hundred one depending |
---|
0:08:07 | on correlation on one being highly correlated |
---|
0:08:10 | and worst experiment right correlation coefficient when the benefits is it is non parametric and |
---|
0:08:16 | the values again range which we need one to austria one we also use mean |
---|
0:08:21 | square error analysis not ideals as it fails to capture distributional information such as outliers |
---|
0:08:26 | and we can we have the cepstral right correlation coefficient which is useful on this |
---|
0:08:31 | task because it captures or ratings and is a little bit more robust error sensitivity |
---|
0:08:36 | then experiment right correlation coefficient |
---|
0:08:42 | so here is the one stable of our first set of results |
---|
0:08:46 | now this is the correlation between the ground truth model scores from the l a |
---|
0:08:52 | dataset and are predicted model scores from are different systems are aggregated into different ways |
---|
0:09:00 | one is a d system level |
---|
0:09:03 | and the others of the speaker level in this work we are particularly interested in |
---|
0:09:07 | how different speakers contribute to the overall quality of a tts system so we focus |
---|
0:09:12 | our discussion on the speaker level results |
---|
0:09:15 | from left to right |
---|
0:09:16 | we have different systems and the different representations to starting with these this first column |
---|
0:09:22 | pretrained voice conversion c and then this is the pre-trained model that comes with the |
---|
0:09:26 | original mass net and you know it is trained on voice conversion data in here |
---|
0:09:31 | we have applied it to the only dataset what we see is that there is |
---|
0:09:35 | almost no correlation between the pre-training model and the teachers data |
---|
0:09:41 | when we retrained demos nancy more structure what we retrained and only really dataset and |
---|
0:09:48 | then evaluated again on a held-out portion of the lid it is that we get |
---|
0:09:53 | much higher correlation we can see that the method to trim a state-of-the-art structure is |
---|
0:09:59 | fine except for a needed to be retrained on the data |
---|
0:10:03 | and we have are over a different representations of we compare our setup extractors as |
---|
0:10:09 | close to structural features and acoustic model and bindings |
---|
0:10:12 | these for you these were |
---|
0:10:15 | run on our local sdc nn which we trained from scratch so there's no pre-trained |
---|
0:10:20 | models in this experiment |
---|
0:10:23 | what we find is what we consider all the different correlation metrics here at speaker |
---|
0:10:28 | level aggregation created by expected far to be the best representation |
---|
0:10:33 | recall that expected five modeling device quality so it does make some intuitive sense |
---|
0:10:38 | and it's worth mentioning that the retrained i was for access us you know and |
---|
0:10:43 | master our structure also performs quite well |
---|
0:10:50 | so here we want to characterize |
---|
0:10:52 | some the best and worst tts systems |
---|
0:10:57 | so for example we identified using the ground truth the system a zero eight is |
---|
0:11:02 | supports quality system it has a mean and more score of one point seven five |
---|
0:11:06 | and it is i hmm based tts system so it makes sense of this might |
---|
0:11:12 | be the worst performing system |
---|
0:11:14 | and then we didn't five best performing system a high quality having a higher mean |
---|
0:11:20 | models |
---|
0:11:22 | i five point five eight and that is in fact the we've aren't in tts |
---|
0:11:25 | system |
---|
0:11:26 | so now let's listen to some examples of what this the speech sounds like what |
---|
0:11:32 | we see here in the plot is that the one true i'll ground truth masks |
---|
0:11:35 | label |
---|
0:11:37 | has quite a spread between one and six five the that is being predicted by |
---|
0:11:43 | or systems |
---|
0:11:46 | or in a very narrow band still we have this range from about two point |
---|
0:11:50 | five |
---|
0:11:52 | two or three point five so as a very narrow |
---|
0:11:56 | dialogue is the key |
---|
0:11:59 | okay that is the we've are in and here's the hmm |
---|
0:12:02 | today will tell |
---|
0:12:04 | it's got a little bit more dollars |
---|
0:12:08 | so next we also once you characterise the best and worst speakers |
---|
0:12:13 | adheres we things get a little bit tricky so we have the best system which |
---|
0:12:20 | is eight and in the worst system which is a eight but we just saw |
---|
0:12:25 | the hmm and the we that |
---|
0:12:30 | in you also have the best speaker in the worst speaker which we identified solar |
---|
0:12:35 | best speaker the l a dataset based on the ground truth is the speaker labels |
---|
0:12:40 | zero four eight and the worst being zero four zero |
---|
0:12:44 | now we look at what the art room on score is what we look at |
---|
0:12:49 | in terms of best system or speaker were system test speaker the true mask for |
---|
0:12:53 | has a quite a big gap however are predicted a mean opinion score from the |
---|
0:13:01 | model is |
---|
0:13:03 | much narrower in the difference |
---|
0:13:06 | and also the ordinal ranking is reverse and that's listen to some examples of |
---|
0:13:14 | the cultural be is changed dramatically in the past five or six years |
---|
0:13:18 | so that was the best system in the worst speaker |
---|
0:13:21 | today will tell |
---|
0:13:23 | then the worst system in the best speaker and it just that they are |
---|
0:13:28 | someone close was listened to it again |
---|
0:13:31 | the coach arabia's change dramatically in the past five or six years |
---|
0:13:37 | today will tell |
---|
0:13:38 | okay and so the fact that we're hearing some closeness |
---|
0:13:43 | may correspond to a the neural range of scores predicted by our system |
---|
0:13:53 | next importantly we wanna talk at a peacock analysis that we did so how well |
---|
0:13:57 | this us now training that we trained generalize to a completely held-out tts system with |
---|
0:14:04 | so that data |
---|
0:14:07 | so for this we have the need for tts dataset a that is audio book |
---|
0:14:12 | data and we have a large set it has five hundred eighty six hours without |
---|
0:14:17 | you thousand speakers now that did undergo some cleaning from google |
---|
0:14:22 | and we have a small subset the we trained our teachers system on which is |
---|
0:14:27 | sixty hours of male and female a just a forty five speakers so a balanced |
---|
0:14:33 | across the two genders and we have the personally thirty seven thousand utterances that we |
---|
0:14:38 | trained our tts system one |
---|
0:14:43 | it is a system that weeks for is dct ts otherwise the result feel and |
---|
0:14:48 | it's just ec tts |
---|
0:14:50 | with one highest speaker codes incorporated into the system |
---|
0:14:54 | this tts system consists of a text and mel that work but also has a |
---|
0:14:59 | spectrogram super resolution network and the audio control group them and so we will hear |
---|
0:15:04 | the graph and one in the next slide |
---|
0:15:07 | what we apply the models that's to the synthesized speech |
---|
0:15:13 | for your |
---|
0:15:15 | and abrogated the speaker level |
---|
0:15:18 | what we do see again is that the best representation |
---|
0:15:23 | as far as correlation matrix go is expected five |
---|
0:15:27 | which is the device quality as some actually before |
---|
0:15:30 | however the correlation overall is quite or so we cannot say that the a so |
---|
0:15:38 | that is |
---|
0:15:39 | working very well on this dataset even though we have identified of a better representation |
---|
0:15:44 | to use compared to the others |
---|
0:15:51 | so even though demos and doesn't generalize well to this new system |
---|
0:15:56 | when we use our best performing representation the expected five we can capture is some |
---|
0:16:03 | relative speaker rankings |
---|
0:16:05 | like this often cleans closes |
---|
0:16:09 | this was he |
---|
0:16:11 | so that would be the worst speaker synthesized them are system using midrange speaker |
---|
0:16:16 | alright for white broke away |
---|
0:16:20 | in your best |
---|
0:16:21 | i hear the fact you she after dinner if i can't for this way come |
---|
0:16:24 | upstairs of me |
---|
0:16:28 | and the one we a look at them side by side so the lever t |
---|
0:16:32 | s |
---|
0:16:34 | and the weight less zorro value system and the way that so these side by |
---|
0:16:40 | side we have the of the u d c tts with the weight net from |
---|
0:16:44 | the l a data |
---|
0:16:47 | what we see is that the speakers in each system |
---|
0:16:53 | contribute differently to the overall performance of the system so there are some speakers you |
---|
0:16:59 | will just two outstanding a in both systems |
---|
0:17:03 | in some speakers there are generally much worse now i take the worst performing speaker |
---|
0:17:09 | probably of really a trained on lever t s and the worst performing speaker in |
---|
0:17:14 | the evening tts important side by side |
---|
0:17:17 | let's listen to that |
---|
0:17:20 | versa for was looking at and what that the thing that |
---|
0:17:23 | you know sdc tts case we |
---|
0:17:26 | that is great news for the viewers influence of the scores for levels |
---|
0:17:31 | okay so probably are actually quite or and we find is that the by selectively |
---|
0:17:39 | choosing the speaker to evaluate which is system on one could artificially low or loose |
---|
0:17:46 | the overall system score so selecting only the speakers to |
---|
0:17:50 | or performing a very well what would loosely |
---|
0:17:55 | overall systems for some more efficiently |
---|
0:18:01 | so in conclusion what we determined is that the overall approach for doing mass production |
---|
0:18:05 | is sound by the correlation between true and predicted scores could be improved |
---|
0:18:12 | of the mass production model |
---|
0:18:14 | training the leds is that |
---|
0:18:16 | does not generalize well |
---|
0:18:18 | to a held-out tts system and data |
---|
0:18:21 | and we did find the summer presentations or a better suited for this task and |
---|
0:18:26 | other representations are just not well suited to this task |
---|
0:18:30 | we have made to tools available and get home |
---|
0:18:33 | so the first is demos estimation low capacity c and then using the expected five |
---|
0:18:38 | device quality |
---|
0:18:41 | extractor and we try to treat are pretrained model |
---|
0:18:45 | the second to is the |
---|
0:18:48 | original last that structure with the pre-trained model that is reached frames |
---|
0:18:53 | on the leds that so the original master right pretrained model for voice conversion and |
---|
0:18:58 | we're providing a pretrained model that you |
---|
0:19:03 | some of the future directions are to look at a predicting speaker similarity |
---|
0:19:09 | we also thank you would be interesting to use is us to think directors wars |
---|
0:19:13 | to project |
---|
0:19:14 | the models score |
---|
0:19:17 | we think that it would be important to train we formulate this task |
---|
0:19:22 | as a marshal or at preference test |
---|
0:19:26 | and finally we would like to incorporate automatic mass estimation into the tts training process |
---|
0:19:35 | thank you very much listening to the talk and we hope you enjoy paper |
---|