well i'm to know what you said today talking about comparison of speech representation for
automatic only estimation in multi-speaker text-to-speech synthesis
the pleasure that are you today and i also one thank michael appears to whether
you are in simon king
the problem that we want to solve this research is how to develop a neural
network that will outweigh mean opinion score you and some synthetic speech input
so the motivations for this work are to speed of the tts development cycle to
save time and money for the listening test and to predict tts quality automatic
there ran several attempts to develop an automatic already estimation system p five six three
was developed for coding and there is no correlation to text to speech ask was
also developer top twenty speech it requires a high quality reference
and it is based on comparing degraded speech to natural speech tts errors are not
always captured in this approach on a mouse is interesting but it was limited to
only the google tts systems and the ground truth is based on multiple masters conducted
over a period of time
well you know was introduced in two thousand eighteen this was for speech enhancement and
it is limited to the timit dataset
in two thousand nineteen mass that was introduced
this is for estimating the quality of voice conversion systems we found that the pre-trained
models try to do not generalize well to text to speech
two main contributions of this paper are to retrain the original master that framework using
tts data and x for frame based weighting in the loss function we also train
a new low capacity c and then architecture on the tts dataset and compare a
different types of speech representations
finally we characterize mars prediction performance based on speaker global ranking
our approach is basically in two phases and is one we did all speech representations
training models network re-training the original most network using a tts data and we determine
what is the best representation and model for tts it is too
we training new other multi-speaker tts system conduct a small loss listening test and applied
trained model based one to analyze the generalization the question we want to answer these
two is just our best model developed in based wyman generalize to a new tts
system
one of our main contributions as a we explore different types of speech representations
in a low capacity seen in architecture we which we develop to handle these new
representations
we have five different types of extractors
these are just like regular inspectors however during training of the extractor which of about
adjusted and
the target is not necessarily speaker id we have one version of the egg structure
where the target is in fact speaker or speaker at
then we have another type just a fixed factor where the target is a categorical
variable representing room size from the t a s p screw dataset
we have another type of x factor that is modeling power oracle values for the
t sixty reaper time
another one to the models talk origin basically distance
attacker to talk or distance and finally another one then models of the device quality
we also have at extraction features
which is a output of a image that
pretrained model that is operating on the entire spectrogram as an image
so this is a very high dimensional representation
finally we have the acoustic model and variance
and those are from an asr system again those are five hundred twelve dimensions someone's
where x vectors
finally there is the original models that which uses frame based features
and we're retraining that are be leds
when we treat these different types of environment
and extractors
we have as targets
against speakers room size the t sixty reverb time it distances and the replay device
quality
so of the purpose is to use the extractors to model different types of environments
in attacks there and label in that the two thousand nineteen is feast the physical
access dataset so we're getting labels for free from the physical access data we want
to use those to model a speech degradation
there will do not apply as a way of modeling the degradation in text to
speech
someone we train our what sdc and then and the a personal must now we
are working on the watchable access
dataset from a speech to challenge this is the evaluation portion of the l a
dataset
it consists of for a unique dct k speakers
importantly there are thirteen different tts and b c systems
in this dataset so we get a wide range of quality of tts and voice
conversion how well it and most of the systems are in fact text to speech
also warmly as they were evaluated with human judgements and the ground truth judgement on
a one to ten point scale for mean opinion score on the same march reading
task so where are the like the optimal as different masters over time this alleviates
our problem
we used a speaker disjoint training about interest split
and in table one of the people that a reference here we can see these
systems for a labeled at zero seven three nineteen there's thirteen of them and they
have different characteristics so for example the use different types of vocoders so some will
use group when a visual using a high quality
no vocoders the just wait another week or and then
so again we explore two types of mass production that neural networks in the first
case we have all of the machinery that comes with the original more snacks which
has different types of architectures for example there's one version that has a bidirectional both
another version that have the c and then
another version that has a scene and p l s t m combination and so
with their code right on how we get all of the architecture and we explore
all the different hyper parameters of the explored
in addition to that of original models that we introduce our low capacity c n
and which we use to operate on are different representations such as the extractors it
deep structural features of the acoustic model bindings
so now we're going to talk about some of the findings that we got from
our experiments
first we used for different correlation matrix
so
each of them have different ranges and different tradeoffs and might be useful for different
types of problems we wanted to keep in mind as previous work would use these
we also introduce the final and the candle tower a correlation so from start we
have the linear correlation coefficient the l c
also runs pearson r is a value that ranges between negative one hundred one depending
on correlation on one being highly correlated
and worst experiment right correlation coefficient when the benefits is it is non parametric and
the values again range which we need one to austria one we also use mean
square error analysis not ideals as it fails to capture distributional information such as outliers
and we can we have the cepstral right correlation coefficient which is useful on this
task because it captures or ratings and is a little bit more robust error sensitivity
then experiment right correlation coefficient
so here is the one stable of our first set of results
now this is the correlation between the ground truth model scores from the l a
dataset and are predicted model scores from are different systems are aggregated into different ways
one is a d system level
and the others of the speaker level in this work we are particularly interested in
how different speakers contribute to the overall quality of a tts system so we focus
our discussion on the speaker level results
from left to right
we have different systems and the different representations to starting with these this first column
pretrained voice conversion c and then this is the pre-trained model that comes with the
original mass net and you know it is trained on voice conversion data in here
we have applied it to the only dataset what we see is that there is
almost no correlation between the pre-training model and the teachers data
when we retrained demos nancy more structure what we retrained and only really dataset and
then evaluated again on a held-out portion of the lid it is that we get
much higher correlation we can see that the method to trim a state-of-the-art structure is
fine except for a needed to be retrained on the data
and we have are over a different representations of we compare our setup extractors as
close to structural features and acoustic model and bindings
these for you these were
run on our local sdc nn which we trained from scratch so there's no pre-trained
models in this experiment
what we find is what we consider all the different correlation metrics here at speaker
level aggregation created by expected far to be the best representation
recall that expected five modeling device quality so it does make some intuitive sense
and it's worth mentioning that the retrained i was for access us you know and
master our structure also performs quite well
so here we want to characterize
some the best and worst tts systems
so for example we identified using the ground truth the system a zero eight is
supports quality system it has a mean and more score of one point seven five
and it is i hmm based tts system so it makes sense of this might
be the worst performing system
and then we didn't five best performing system a high quality having a higher mean
models
i five point five eight and that is in fact the we've aren't in tts
system
so now let's listen to some examples of what this the speech sounds like what
we see here in the plot is that the one true i'll ground truth masks
label
has quite a spread between one and six five the that is being predicted by
or systems
or in a very narrow band still we have this range from about two point
five
two or three point five so as a very narrow
dialogue is the key
okay that is the we've are in and here's the hmm
today will tell
it's got a little bit more dollars
so next we also once you characterise the best and worst speakers
adheres we things get a little bit tricky so we have the best system which
is eight and in the worst system which is a eight but we just saw
the hmm and the we that
in you also have the best speaker in the worst speaker which we identified solar
best speaker the l a dataset based on the ground truth is the speaker labels
zero four eight and the worst being zero four zero
now we look at what the art room on score is what we look at
in terms of best system or speaker were system test speaker the true mask for
has a quite a big gap however are predicted a mean opinion score from the
model is
much narrower in the difference
and also the ordinal ranking is reverse and that's listen to some examples of
the cultural be is changed dramatically in the past five or six years
so that was the best system in the worst speaker
today will tell
then the worst system in the best speaker and it just that they are
someone close was listened to it again
the coach arabia's change dramatically in the past five or six years
today will tell
okay and so the fact that we're hearing some closeness
may correspond to a the neural range of scores predicted by our system
next importantly we wanna talk at a peacock analysis that we did so how well
this us now training that we trained generalize to a completely held-out tts system with
so that data
so for this we have the need for tts dataset a that is audio book
data and we have a large set it has five hundred eighty six hours without
you thousand speakers now that did undergo some cleaning from google
and we have a small subset the we trained our teachers system on which is
sixty hours of male and female a just a forty five speakers so a balanced
across the two genders and we have the personally thirty seven thousand utterances that we
trained our tts system one
it is a system that weeks for is dct ts otherwise the result feel and
it's just ec tts
with one highest speaker codes incorporated into the system
this tts system consists of a text and mel that work but also has a
spectrogram super resolution network and the audio control group them and so we will hear
the graph and one in the next slide
what we apply the models that's to the synthesized speech
for your
and abrogated the speaker level
what we do see again is that the best representation
as far as correlation matrix go is expected five
which is the device quality as some actually before
however the correlation overall is quite or so we cannot say that the a so
that is
working very well on this dataset even though we have identified of a better representation
to use compared to the others
so even though demos and doesn't generalize well to this new system
when we use our best performing representation the expected five we can capture is some
relative speaker rankings
like this often cleans closes
this was he
so that would be the worst speaker synthesized them are system using midrange speaker
alright for white broke away
in your best
i hear the fact you she after dinner if i can't for this way come
upstairs of me
and the one we a look at them side by side so the lever t
s
and the weight less zorro value system and the way that so these side by
side we have the of the u d c tts with the weight net from
the l a data
what we see is that the speakers in each system
contribute differently to the overall performance of the system so there are some speakers you
will just two outstanding a in both systems
in some speakers there are generally much worse now i take the worst performing speaker
probably of really a trained on lever t s and the worst performing speaker in
the evening tts important side by side
let's listen to that
versa for was looking at and what that the thing that
you know sdc tts case we
that is great news for the viewers influence of the scores for levels
okay so probably are actually quite or and we find is that the by selectively
choosing the speaker to evaluate which is system on one could artificially low or loose
the overall system score so selecting only the speakers to
or performing a very well what would loosely
overall systems for some more efficiently
so in conclusion what we determined is that the overall approach for doing mass production
is sound by the correlation between true and predicted scores could be improved
of the mass production model
training the leds is that
does not generalize well
to a held-out tts system and data
and we did find the summer presentations or a better suited for this task and
other representations are just not well suited to this task
we have made to tools available and get home
so the first is demos estimation low capacity c and then using the expected five
device quality
extractor and we try to treat are pretrained model
the second to is the
original last that structure with the pre-trained model that is reached frames
on the leds that so the original master right pretrained model for voice conversion and
we're providing a pretrained model that you
some of the future directions are to look at a predicting speaker similarity
we also thank you would be interesting to use is us to think directors wars
to project
the models score
we think that it would be important to train we formulate this task
as a marshal or at preference test
and finally we would like to incorporate automatic mass estimation into the tts training process
thank you very much listening to the talk and we hope you enjoy paper