0:00:13well i'm to know what you said today talking about comparison of speech representation for
0:00:19automatic only estimation in multi-speaker text-to-speech synthesis
0:00:23the pleasure that are you today and i also one thank michael appears to whether
0:00:27you are in simon king
0:00:31the problem that we want to solve this research is how to develop a neural
0:00:35network that will outweigh mean opinion score you and some synthetic speech input
0:00:43so the motivations for this work are to speed of the tts development cycle to
0:00:48save time and money for the listening test and to predict tts quality automatic
0:00:56there ran several attempts to develop an automatic already estimation system p five six three
0:01:02was developed for coding and there is no correlation to text to speech ask was
0:01:06also developer top twenty speech it requires a high quality reference
0:01:11and it is based on comparing degraded speech to natural speech tts errors are not
0:01:16always captured in this approach on a mouse is interesting but it was limited to
0:01:21only the google tts systems and the ground truth is based on multiple masters conducted
0:01:26over a period of time
0:01:28well you know was introduced in two thousand eighteen this was for speech enhancement and
0:01:34it is limited to the timit dataset
0:01:36in two thousand nineteen mass that was introduced
0:01:39this is for estimating the quality of voice conversion systems we found that the pre-trained
0:01:45models try to do not generalize well to text to speech
0:01:51two main contributions of this paper are to retrain the original master that framework using
0:01:56tts data and x for frame based weighting in the loss function we also train
0:02:02a new low capacity c and then architecture on the tts dataset and compare a
0:02:07different types of speech representations
0:02:09finally we characterize mars prediction performance based on speaker global ranking
0:02:15our approach is basically in two phases and is one we did all speech representations
0:02:20training models network re-training the original most network using a tts data and we determine
0:02:30what is the best representation and model for tts it is too
0:02:34we training new other multi-speaker tts system conduct a small loss listening test and applied
0:02:41trained model based one to analyze the generalization the question we want to answer these
0:02:47two is just our best model developed in based wyman generalize to a new tts
0:02:52system
0:02:55one of our main contributions as a we explore different types of speech representations
0:03:01in a low capacity seen in architecture we which we develop to handle these new
0:03:05representations
0:03:06we have five different types of extractors
0:03:11these are just like regular inspectors however during training of the extractor which of about
0:03:16adjusted and
0:03:17the target is not necessarily speaker id we have one version of the egg structure
0:03:21where the target is in fact speaker or speaker at
0:03:25then we have another type just a fixed factor where the target is a categorical
0:03:30variable representing room size from the t a s p screw dataset
0:03:36we have another type of x factor that is modeling power oracle values for the
0:03:40t sixty reaper time
0:03:42another one to the models talk origin basically distance
0:03:46attacker to talk or distance and finally another one then models of the device quality
0:03:53we also have at extraction features
0:03:57which is a output of a image that
0:04:03pretrained model that is operating on the entire spectrogram as an image
0:04:08so this is a very high dimensional representation
0:04:13finally we have the acoustic model and variance
0:04:17and those are from an asr system again those are five hundred twelve dimensions someone's
0:04:21where x vectors
0:04:23finally there is the original models that which uses frame based features
0:04:28and we're retraining that are be leds
0:04:36when we treat these different types of environment
0:04:39and extractors
0:04:42we have as targets
0:04:45against speakers room size the t sixty reverb time it distances and the replay device
0:04:51quality
0:04:52so of the purpose is to use the extractors to model different types of environments
0:04:58in attacks there and label in that the two thousand nineteen is feast the physical
0:05:03access dataset so we're getting labels for free from the physical access data we want
0:05:08to use those to model a speech degradation
0:05:12there will do not apply as a way of modeling the degradation in text to
0:05:17speech
0:05:20someone we train our what sdc and then and the a personal must now we
0:05:25are working on the watchable access
0:05:29dataset from a speech to challenge this is the evaluation portion of the l a
0:05:33dataset
0:05:34it consists of for a unique dct k speakers
0:05:38importantly there are thirteen different tts and b c systems
0:05:43in this dataset so we get a wide range of quality of tts and voice
0:05:48conversion how well it and most of the systems are in fact text to speech
0:05:53also warmly as they were evaluated with human judgements and the ground truth judgement on
0:05:59a one to ten point scale for mean opinion score on the same march reading
0:06:04task so where are the like the optimal as different masters over time this alleviates
0:06:10our problem
0:06:12we used a speaker disjoint training about interest split
0:06:15and in table one of the people that a reference here we can see these
0:06:20systems for a labeled at zero seven three nineteen there's thirteen of them and they
0:06:26have different characteristics so for example the use different types of vocoders so some will
0:06:31use group when a visual using a high quality
0:06:34no vocoders the just wait another week or and then
0:06:43so again we explore two types of mass production that neural networks in the first
0:06:48case we have all of the machinery that comes with the original more snacks which
0:06:52has different types of architectures for example there's one version that has a bidirectional both
0:06:57another version that have the c and then
0:07:00another version that has a scene and p l s t m combination and so
0:07:04with their code right on how we get all of the architecture and we explore
0:07:10all the different hyper parameters of the explored
0:07:14in addition to that of original models that we introduce our low capacity c n
0:07:19and which we use to operate on are different representations such as the extractors it
0:07:25deep structural features of the acoustic model bindings
0:07:32so now we're going to talk about some of the findings that we got from
0:07:36our experiments
0:07:38first we used for different correlation matrix
0:07:43so
0:07:44each of them have different ranges and different tradeoffs and might be useful for different
0:07:50types of problems we wanted to keep in mind as previous work would use these
0:07:54we also introduce the final and the candle tower a correlation so from start we
0:07:59have the linear correlation coefficient the l c
0:08:02also runs pearson r is a value that ranges between negative one hundred one depending
0:08:07on correlation on one being highly correlated
0:08:10and worst experiment right correlation coefficient when the benefits is it is non parametric and
0:08:16the values again range which we need one to austria one we also use mean
0:08:21square error analysis not ideals as it fails to capture distributional information such as outliers
0:08:26and we can we have the cepstral right correlation coefficient which is useful on this
0:08:31task because it captures or ratings and is a little bit more robust error sensitivity
0:08:36then experiment right correlation coefficient
0:08:42so here is the one stable of our first set of results
0:08:46now this is the correlation between the ground truth model scores from the l a
0:08:52dataset and are predicted model scores from are different systems are aggregated into different ways
0:09:00one is a d system level
0:09:03and the others of the speaker level in this work we are particularly interested in
0:09:07how different speakers contribute to the overall quality of a tts system so we focus
0:09:12our discussion on the speaker level results
0:09:15from left to right
0:09:16we have different systems and the different representations to starting with these this first column
0:09:22pretrained voice conversion c and then this is the pre-trained model that comes with the
0:09:26original mass net and you know it is trained on voice conversion data in here
0:09:31we have applied it to the only dataset what we see is that there is
0:09:35almost no correlation between the pre-training model and the teachers data
0:09:41when we retrained demos nancy more structure what we retrained and only really dataset and
0:09:48then evaluated again on a held-out portion of the lid it is that we get
0:09:53much higher correlation we can see that the method to trim a state-of-the-art structure is
0:09:59fine except for a needed to be retrained on the data
0:10:03and we have are over a different representations of we compare our setup extractors as
0:10:09close to structural features and acoustic model and bindings
0:10:12these for you these were
0:10:15run on our local sdc nn which we trained from scratch so there's no pre-trained
0:10:20models in this experiment
0:10:23what we find is what we consider all the different correlation metrics here at speaker
0:10:28level aggregation created by expected far to be the best representation
0:10:33recall that expected five modeling device quality so it does make some intuitive sense
0:10:38and it's worth mentioning that the retrained i was for access us you know and
0:10:43master our structure also performs quite well
0:10:50so here we want to characterize
0:10:52some the best and worst tts systems
0:10:57so for example we identified using the ground truth the system a zero eight is
0:11:02supports quality system it has a mean and more score of one point seven five
0:11:06and it is i hmm based tts system so it makes sense of this might
0:11:12be the worst performing system
0:11:14and then we didn't five best performing system a high quality having a higher mean
0:11:20models
0:11:22i five point five eight and that is in fact the we've aren't in tts
0:11:25system
0:11:26so now let's listen to some examples of what this the speech sounds like what
0:11:32we see here in the plot is that the one true i'll ground truth masks
0:11:35label
0:11:37has quite a spread between one and six five the that is being predicted by
0:11:43or systems
0:11:46or in a very narrow band still we have this range from about two point
0:11:50five
0:11:52two or three point five so as a very narrow
0:11:56dialogue is the key
0:11:59okay that is the we've are in and here's the hmm
0:12:02today will tell
0:12:04it's got a little bit more dollars
0:12:08so next we also once you characterise the best and worst speakers
0:12:13adheres we things get a little bit tricky so we have the best system which
0:12:20is eight and in the worst system which is a eight but we just saw
0:12:25the hmm and the we that
0:12:30in you also have the best speaker in the worst speaker which we identified solar
0:12:35best speaker the l a dataset based on the ground truth is the speaker labels
0:12:40zero four eight and the worst being zero four zero
0:12:44now we look at what the art room on score is what we look at
0:12:49in terms of best system or speaker were system test speaker the true mask for
0:12:53has a quite a big gap however are predicted a mean opinion score from the
0:13:01model is
0:13:03much narrower in the difference
0:13:06and also the ordinal ranking is reverse and that's listen to some examples of
0:13:14the cultural be is changed dramatically in the past five or six years
0:13:18so that was the best system in the worst speaker
0:13:21today will tell
0:13:23then the worst system in the best speaker and it just that they are
0:13:28someone close was listened to it again
0:13:31the coach arabia's change dramatically in the past five or six years
0:13:37today will tell
0:13:38okay and so the fact that we're hearing some closeness
0:13:43may correspond to a the neural range of scores predicted by our system
0:13:53next importantly we wanna talk at a peacock analysis that we did so how well
0:13:57this us now training that we trained generalize to a completely held-out tts system with
0:14:04so that data
0:14:07so for this we have the need for tts dataset a that is audio book
0:14:12data and we have a large set it has five hundred eighty six hours without
0:14:17you thousand speakers now that did undergo some cleaning from google
0:14:22and we have a small subset the we trained our teachers system on which is
0:14:27sixty hours of male and female a just a forty five speakers so a balanced
0:14:33across the two genders and we have the personally thirty seven thousand utterances that we
0:14:38trained our tts system one
0:14:43it is a system that weeks for is dct ts otherwise the result feel and
0:14:48it's just ec tts
0:14:50with one highest speaker codes incorporated into the system
0:14:54this tts system consists of a text and mel that work but also has a
0:14:59spectrogram super resolution network and the audio control group them and so we will hear
0:15:04the graph and one in the next slide
0:15:07what we apply the models that's to the synthesized speech
0:15:13for your
0:15:15and abrogated the speaker level
0:15:18what we do see again is that the best representation
0:15:23as far as correlation matrix go is expected five
0:15:27which is the device quality as some actually before
0:15:30however the correlation overall is quite or so we cannot say that the a so
0:15:38that is
0:15:39working very well on this dataset even though we have identified of a better representation
0:15:44to use compared to the others
0:15:51so even though demos and doesn't generalize well to this new system
0:15:56when we use our best performing representation the expected five we can capture is some
0:16:03relative speaker rankings
0:16:05like this often cleans closes
0:16:09this was he
0:16:11so that would be the worst speaker synthesized them are system using midrange speaker
0:16:16alright for white broke away
0:16:20in your best
0:16:21i hear the fact you she after dinner if i can't for this way come
0:16:24upstairs of me
0:16:28and the one we a look at them side by side so the lever t
0:16:32s
0:16:34and the weight less zorro value system and the way that so these side by
0:16:40side we have the of the u d c tts with the weight net from
0:16:44the l a data
0:16:47what we see is that the speakers in each system
0:16:53contribute differently to the overall performance of the system so there are some speakers you
0:16:59will just two outstanding a in both systems
0:17:03in some speakers there are generally much worse now i take the worst performing speaker
0:17:09probably of really a trained on lever t s and the worst performing speaker in
0:17:14the evening tts important side by side
0:17:17let's listen to that
0:17:20versa for was looking at and what that the thing that
0:17:23you know sdc tts case we
0:17:26that is great news for the viewers influence of the scores for levels
0:17:31okay so probably are actually quite or and we find is that the by selectively
0:17:39choosing the speaker to evaluate which is system on one could artificially low or loose
0:17:46the overall system score so selecting only the speakers to
0:17:50or performing a very well what would loosely
0:17:55overall systems for some more efficiently
0:18:01so in conclusion what we determined is that the overall approach for doing mass production
0:18:05is sound by the correlation between true and predicted scores could be improved
0:18:12of the mass production model
0:18:14training the leds is that
0:18:16does not generalize well
0:18:18to a held-out tts system and data
0:18:21and we did find the summer presentations or a better suited for this task and
0:18:26other representations are just not well suited to this task
0:18:30we have made to tools available and get home
0:18:33so the first is demos estimation low capacity c and then using the expected five
0:18:38device quality
0:18:41extractor and we try to treat are pretrained model
0:18:45the second to is the
0:18:48original last that structure with the pre-trained model that is reached frames
0:18:53on the leds that so the original master right pretrained model for voice conversion and
0:18:58we're providing a pretrained model that you
0:19:03some of the future directions are to look at a predicting speaker similarity
0:19:09we also thank you would be interesting to use is us to think directors wars
0:19:13to project
0:19:14the models score
0:19:17we think that it would be important to train we formulate this task
0:19:22as a marshal or at preference test
0:19:26and finally we would like to incorporate automatic mass estimation into the tts training process
0:19:35thank you very much listening to the talk and we hope you enjoy paper