okay so i'm pleased to introduce the next guest speaker who's a kitchen recruited from the goal is to just
technology
he's extremely well known but for those who don't know he's the pioneer of statistical speech of the system particular
hmm speech of the system captures the together
right a single actual not so like i
i
i
okay
and operator
that
most
speech recognition researchers
re going speech synthesis
as a messy problem
that the reason why
yeah
i yeah
talk about a statistical formation of speech synthesis
in this presentation
okay
to realise speech synthesis systems
many approaches have been proposed
before nineteen
rule based formant synthesis had both been studied
in this case funding you need a bit by hand crafted rules
after nineties
corpus based concatenative speech synthesis
approach is dominant
state of the art
speech synthesis systems
based on unit selection can generate natural sounding speech
in recent years
statistical parametric speech synthesis approach yet popularity
it has
several advantage
such as
flexibility in voice characteristics
a small footprint
automatic voice between
and so on
and i'm not there
the most important
advantage of the statistical approach
it that's
we can use mathematical you will define the models and average
in this talk
i would like to discuss how we can formulate
and i understand the whole speech synthesis process
including speech feature extraction acoustic modeling and the text processing and so on a unified statistical framework
okay
the basic problem
all speech synthesis
yeah i can be stated as shown here
we have a speech database
that is a text
yeah
a set of text
and corresponding speech waveform
given a text
to be syntax
but the speech waveform corresponding context
the problem can be represented by this equation
and it can be solved
by estimating the
predictive distribution
a given barrier
and then drawing samples
from the predicted distribution
basically it's quite simple
however
estimating that
predictive distribution
is very hot
so
we have to introduce a acoustic model problem
then the in the acoustic model for example hmm
and this part correspond to the training part
and this but response
to the generation part
and the first i like to discuss the generation for
as we know modeling speech waveform
directly by
acoustic models is very difficult
so we have to introduce
parametric representation speech waveform
oh
is a parametric representation of speech waveform
for example cepstrum well mel cepstrum but as it is used for every zero
accordingly and this generation apart
is decomposed into these two terms
we also know
that takes should be converted to that is
because the same text
can i have much to pronunciation
part of speech analytics lexical stress
or other information
so that generation part
is decomposed
into these three times
text processing
and
acoustic model
a parameter generation from acoustic model
and speech waveform reconstruction
and that it is difficult to perform integral and summation
yeah over all the variables
so we approximate the by joint maximization are shown here
however
joint maximization is still hot
so i
is approximated by a step by step maximization problem
discourse want to the training part
and
this maximise
maximization with that of this
is that this correspond to
text and at least
and this corresponds to
speech parameter generation from a acoustic model
i
talked about the generation part
but the training part
also requires a partner parametric representation of a speech waveform and there
accordingly the
training part
can be approximated by a step by step maximization problem in a similar manner at that iteration part
they're doing
all speech database
and the feature extraction of speech database
and acoustic model train
as a result
the original problem
is it
decompose into these sub-problems
bows or four
training part and those are
for dinner at some point
feature extraction
of speech database
between
and acoustic model training
and the text and there is
of
the text
to be synthesized
and the speech parameter generation from acoustic model
and finally yeah we reconstruct speech waveform
by sampling of this
distribution
okay
i just talked about the
mathematical formulation
in the following
i like to explain
each component a step by step
and then
show examples to demonstrate the flexibility of thus that's statistical approach
and finally give some discussion and computers
"'kay"
this the overview of an hmm based speech synthesis system
the training part is similar to those used in hmm based speech recognition system
the essential difference
it that the state output vector in clues
not only spectrum parameters
for example mel-cepstrum
but also excited some parameters if zero parameters
on the other hand
the synthesis part
does the inverse operation of speech recognition
that is
phoneme hmms
or concatenated according to the labels
i drive the from the text
to be synthesized
yeah
a sequence or speech parameters
a spectrum parameters and F zero parameters
is determined in such a way that it's at most probable probability for the hmm is max
and finally
switch maple
is in fact by using speech synthesis filter
and that each part correspond to the
supper problem
that we
feature extraction
and the model training
and that text analysis for the text to be synthesized
and speech parameter generation from acoustic model trained a cost model
and speech waveform reconstruction
first
i like to talk about speech feature extraction
and space speech waveform a reconstruction which correspond to these which
it's based on the source-filter model which in that no human speech production
in this presentation
i assume the
system function
H of Z is represented by mel-cepstral coefficient
that is
frequency warped cepstral coefficients
defined by this equation
the frequency warping function defined by this
first order allpass system function
give us a good approximation to auditory frequency scales
and with an appropriate choice of that of
by assuming X
icsi's a
a short segment of a speech waveform
assuming X is a gaussian process
we time see
mel-cepstrum
in such a way that
it's likelihood
with respect to X
is maximized
it's just that any other estimation of mel-cepstral coefficient
because the of X
is convex with respect to see
the solution can easily obtained by an iterative everywhere
okay
to reset resynthesized speech
H of the is controlled according to the estimated mel-cepstrum
and excited by post-training
and of white noise
for voiced and unvoiced segments are respectively
i know this is the
pulse train
under this is white noise
and the excitation signal is generated based on voiced unvoiced information and if a zero
extracted from the original speech
this is all the non-speech unfair advantage now parents scales et cetera and dct excitation signal
it could have the same if zero
at this point
and but exciting a speech synthesis filter controlled by mel-cepstral coefficient vectors
by this excitation signal we can reconstruct the speech waveform
i don't you have somebody else et cetera
so now the problem
is
how we can
generate both speech parameters
from the tech
i have to be synthesized was the corresponding acoustic
model
okay
next i'd like to talk about this maximization problem
which correspond to acoustic modeling
this is the other two markov model hmm a result left to right topology
which is used in speech recognition system
we also use the same structure for speech synthesis
please note that the state output probability is defined as
gaussian single gaussian us that because
it's enough for speech synthesis we are using a speaker-dependent model
that for speech synthesis
as i explained
we need to model not only spectral parameters
but also F zero parameters to resynthesize speech wave
putting the state output vector consists of
spectrum part
and F zero part
spectrum brought consists of mel-cepstrum coefficient vector
and its delta and delta-delta
and the F zero product consists of F zero and its delta and delta-delta
the problem
in modeling F zero by a gmm
if that
we cannot apply to conventional discrete or continuous stated distribution
because
F zero value
not to define in the unvoiced region
that is
the observation sequence of F zero is composed of
one dimensional continuous values
and discrete a simple which represent about
several heuristic methods have been investigated four hundred in the unvoiced region
for example
interpolating the caps
or substituting random values for almost agrees
to model this kind of observation sequence in a statistical quirk the manner
we have defined a new kind of hmm
yeah
we refer to it as multi-space probability distribution hmms
or msd hmm
it includes the discrete hmm and the continuous mixture hmm
as special cases
and for the more it can model the sequence or
all observation vectors with variable dimensionality including discrete simple
we show the structure of msd hmm
specialised for F zero modeling
each state
has weights
which represent
and probabilities
all voiced
and unvoiced
and
continuous distribution for voice
observation
that is not bad
i'm em algorithm can easily be derived for training this type of H M
okay
but combining the spectrum part and F zero part of the state output distribution
has
mod stream structure
like this
okay
no
i like to talk about
model structure
in speech recognition
preceding and succeeding phone identities are regarded as context
on the other hand
in speech synthesis
current phone identity can also be a context
because i
no it's not necessary to know
what the speech recognition result
furthermore
there are
many other context of factors
that affect
spectrum
every zero
and the duration as shown here
for example a number
phones in this stuff below
or
for example current syllable in current word or part of speech or other looks more information and so on
since there are
too many combinations
it's a difficult to have all possible model
to avoid the problem in the same manner as hmm based speech recognition
we use context-dependent hmms
and apply a decision tree based context clustering technique to K
in this figure a
htk sty triphone letters are shown
however
in the case of speech synthesis the data is very long because it
includes
all these information
so we also a list menu other questions
about
this information
okay
each number spectrum and F zero have its own influential contextual factors so that there should be some for spectrum
and F zero should be clustered independently
it results in
stream dependent a context clustering structure
i strongly
in the standard hmm days
the states through some prior probability an exponent site and decrease with increase over last iteration
however
it's too simple to control a temporal structure of speech parameter C sequence
therefore
we assume that the state
durations
oh gosh
and not that the hmm with an explicit and racial model is called
and hidden semi-markov model
or it just a man
and now we need a special type of em algorithm for parameter is used to measure this model
okay as a result state iterations of aged men each hmms
oh model the
by a three dimensional
gaussian
and
context-dependent three dimensional gaussians
a class that by
at this juncture
so we now we have
seven decision trees are in this example
three four spectrum from those mel-cepstrum
and three four F zero
and a wonderful situation
okay
next i'd like to talk about the second maximization problem
which correspond to speech parameter generation
from acoustic model
like concatenating context-dependent hmms
according to the led us a drive from the text to be synthesized
a sentence hmm can yeah
something
for a given sentence hmm
we determine the speech parameter vector sequence
oh
which maximizes
the outputs probably
P
this equation that can be approximated by this which one
output approximated by maximization
on the bottom or it can be decomposed into D two maximization problem
first
we determine the state sequence Q hot
independently of all then
yeah
determine
speech parameter vector sequence
yeah O
all hyped
for the
prefixed a state sequence
do you have
the first problem
can be sold
very easy
because us that iteration are modelled by gosh
the solution is simply given by means of gaussians
a postage or some other
unfortunately
that direct solution for that
second problem is you appropriate for synthesizing speech
and this is an example parameter generation from an hmm
composed by concatenation of a phoneme hmms
each vertical dotted line
a state of that the line represents a state out
we assume that the covariance matrix is guy or whatever
so each state has its means and variance
for example this
horizontal dotted line
represents a mean of this state and the shaded area
that represent
variance
of this thing
by maximizing the output probability
the parameter sequence becomes the mean vector sequence
resulting in a step wise function like this
because
this is the most likely sequence for the sequence of a state of gaussians
and the this jumps
a coarse this continues its in synthetic speech
about
about the problem
we assume that each state output vector O
consists of mel-cepstral coefficient
back to
and it's dynamic feature vectors
delta and delta-delta
which correspond to the first
and second derivatives
of a speech parameter vector C
and can be calculated as a linear combination or neighboring a speech parameter vectors
most of speech recognition systems also use this type of speech parameters
and
relationship
between
see and that the C and the see that can be arranged in a matrix form
as shown here
i see in the
mel-cepstral coefficient vector
and delta and delta-delta and the dct stick out vector
and
C includes all
mel-cepstral coefficients vectors for utterance
and W is for calculating does that the
and that this constraint
on wall
maximizing be with respect to all
is equivalent to that with respect to see
that's by setting the derivative equals zero we obtain a set of linear equations
which can be shown in
much useful
that dimensionality
of the equation is very high
for example tens of thousand because C was all a mel-cepstral coefficients vector for utterance
fortunately
by using the special structure of this metric
it's very sparse matrix
it can be solved by
fast algorithm
okay
this is an example of
parameter generation
that from us in this hmm using dynamic no feature brown
this shows
the trajectory
of the
second the coefficient
of that generated the mel cepstrum
sequence
and
they
sure its delta
and delta-delta which correspond to the first
and second derivatives of the
trajectory
these three
trajectories a constrained by each other
and to determine the simon tennessee
by maximizing
total output probabilities
as a result
that trajectory
is constrained to be realistic as defined by the statistics
of static and dynamic feature
you may have noticed that
the of all
is improper as the distribution of C
because it's not normalize
respect to see
interestingly by normalizing
be with
respect to see we can drive a new type of trajectory model would to be called
trajectory hmm
oh i'm sorry but i won't go into details in this presentation
okay if you guess of the spectrum calculated from the mel cepstrum vectors generated
without dynamic feature parameters
and we dine feature parameter respectively
it can be seen that by taking into account
a dynamic feature of parameters
smoothly varying sequence of spectral can be up to
and they show the generated F zero about that
without the dynamic feature
generated F zero sequence becomes a step wise function
on the other hand by taking into account that i and number of features
we can generate F zero trajectories
which approximate the natural F zero that
okay
not i would like to play some speech samples of synthesized speech samples and too strong effect of dynamic features
in speech parameter generation
this was since size
from the model trained with both
yeah static and dynamic features
and that this was syntax
without a spectrum then it feature
and this was
in size without
F zero dynamic feature
and that this was in fact without the both spectrum and F zero dynamic three
as the mean they this one
like you know you're not model i mean and sorry again
jeez a known only on like you know you're the model i mean and
it's some of
and the without
spectrum then feature you may perceive a frequent discontinued is in this
these are not lonely and that you know you wanna model i mean i
can't find it
and now without F zero dynamic features
in this case you made by C different type of discontinued is
jeez on only on like you know you on i mean and without both we may perceive serious discontinued
these are known only and then you know you want to model i mean again we both
jeez unknown only and like you know you're the model i mean and
yep
from this examples we can see that the importance of dynamic feature hmm based
okay
in the next part
i lurked show some
examples
to demonstrate the flexibility of the statistical approach
first i'd like to show an example of emotional speech synthesis
i'm sorry about
this is very old then so that support speech quite a
is that
and
this sample is inside from a model trained with
neutral speech
and this was inside from the model trained with unreadable i'm very
pitch
this that the case again i'm sorry that it's in japanese
this is english translation
just a neutral
people who need it i and unable maintain okay it has flat prosody
and from and we model
J i
okay
one of the sentence
neutral
meeting anyone i if you can have enough time
yeah i
it sounds like he's angry
yeah
and we see that that's training the system with a small amount of emotional speech data we can see that
the most no speech very easy it's not in this area
to handle craft a heuristic rules for emotional speech and
next and that show an example of speaker adaptation in speech synthesis
we apply the speaker adaptation technique now using speech recognition yeah mllr to the synthesis system
and they say the speaker independent model is a model
and it was adapted to a target the speaker eight
and that this
the adapted model
okay this samples
is
since that from the
speaker independent model
for channel sometime recognition
okay i'm sorry
it's in japanese
and this was synthesized
oh this is that's inside speech
but the for speaker i
so this is inside speech but yeah it has speech bias voice
a voice characteristics
can also i snuck
and this was synthesized from the adapted model
with
for turn
oh yeah no sunlight recognition and fifty utterances
of course you know something that is not let me play them again
speaker independent model
cocaine or something recognition for utterances
yeah no sunlight recognition
not something that is not
and also i recognition
if
these three sound
very similar it means that the system can maybe the target speakers voice using a very small amount of the
adaptation data
and then we have another sample
maybe in the
famous persons voice
institute of technology energy was founded in nineteen O five isn't going to hire technical to pioneering academic institution dedicated
to industrial education
can you find for hey
yes
you're right
please note that
this was done by engine geometries at C S T R of the university of edinburgh
and they
yeah it was us inside by the system adapted to justify
okay
next example is speaker interpolation in speech synthesis
when we have several speaker dependent hmm sets
by interpolating among the hmm parameters
means and variances
we can generate a new hmm set
which correspond to a new voice
in this case we have to speaker-dependent models
one
is trained by a female speaker
and one
is trained by a male speaker
okay let me play
speech samples
synthesized from
female model
sorry okay
i
and this was in part from a male speakers for a male speakers model
well when i don't and we can interpolate between and these two models with arbitrarily depletion much
a dct sent out
due to
models
we cannot find he or she is
male or female
and
and
we can change in the in the pool interpolation ratio right nearly in a trance
from female to male
well
we do not know
sounds like
male finally
and i this is the same except we have for speaker-dependent the model
models
the first speaker
and when the second speaker per speaker i don't want to know in a manner i don't have for speaker
and this at the center of these four speakers
and then we can also change the interpolation ratio red
and
oh in a manner another
i don't know
oh in
it is interesting
but
could be used to this
i
yeah
if we train each model
with S P speaking style we can interpolate among speaking styles to it could be useful for spoken dialogue systems
in this case
we have two models
once trained with
a neutral
draw a voice
and one trained with
high tension voice
by the same speaker
okay
first neutral voice
oh
and heightened so model
i
i
if you feel it's too much
you we can adjust the
degree of the expression by interpolating between two models
for example this one
and that we can
also fixed extrapolated and used to model
under the replay yeah all of them
in this order
oh
i
i
oh
oh
oh
please note that
it's not just that changing average F zero the prosody it can be changed
okay
next example is eigenvoice
the eigenvoice technique was to have developed for very fast speaker adaptation in speech recognition
in speech synthesis
it can be used for creating new voices
image of something more
okay
this represents a weight
for eigenvoices
by adjusting them we can find a favourite voice
each eigenvoice first eigenvoice and second eigenvoice
it's eigenvoice
may correspond to a specific voice character
maybe play some speech samples
but
for the
first eigenvoice we've negative rate
yeah
i
okay
and now we
posted wait for the first eigenvoice
okay
oh no what contributes to say okay
i'm sorry that this is the maximum with the ball
and the second eigenvoice we've negative rate
and we
positive rate
okay they're not what makes you don't sound that made on for eigenvoice
and we've
was divided up with the weight
i
yeah
at
and by second weight after writing we don't and various voices
and find out for your voice
some them then
but i
i
i hope
this is better
okay
the
anyway this shows the flexibility of their statistical approach to speech synthesis
okay
similarly to other corpus based approaches
and the hmm baptists
system has a
very compact language dependent but
done easily be applied to other languages
i like to play some the them
japanese change that you know you and i one and i mean i'm sorry
in which
you would not keep the truth from chinese
well or from grand cherokee
korea
then they can match the categories and the finnish
only taken a little mental but it once again i must be sent to contain an snr essential or several
minima
and this is also in which but trained by
baby
vol
yeah i
okay
and now
next examples
so that
even
singing voice can be used as a training data
as a result
the system can seen any piece of music
we she's or her voice and simply used i
and
this is a one oh training data
okay
sees a semi professional scene
and now
the server
and
and anyway and
this sample
if
syntax
by using trained acoustic model so
she have not
some this song
yeah
oh i
maybe it sounds that are
but we have not seen this story
us in this
okay
this is the final part
yeah i like to show the
basic problem of speech synthesis okay this one
solving this problem directly
based on
and this equation
is it yeah
but we have to decompose it into trapped up to no such problems because the
direct solution is not feasible we currently available computational resources
however we can relax the approximation
for example
by marginalised in what their parameters
a hmm for acoustic model parameters
we can drive a variational bayesian acoustic modeling technique for speech synthesis
well
by marginalise and that is
we can drive
joint front-end and back-end model train
a friend front it means that text process
and back end acoustic model
or by including a speech
wait for speech waveform generation part in a statistical model we can also drive a bit from that of
stats come on
anyway please note that
this kind of improved techniques
can be drivers
based on
this equation which represents
the basic problem
since
okay
then some read this presentation
i have talked about the stats got from their some of the speech synthesis problem
all speech synthesis process is described in a statistical framework
and it give us a unified view and the reviews
what is correct and what is wrong
another point i should
implies that is
the importance of the database
future work
still we have many problems
which we should solve
based on
the
equation which represent speech synthesis problem
okay this is that
final slide
is P synthesis
and messy problem
no i don't think so
i would be happy
if many speech recognition researchers joining speech synthesis research
it's must be very have had
to a T research area
that's all thank you very much
i
yeah thanks for such a could talk we have some time for questions
michael
oh thank you very much for a wonderful talk on speech synthesis at some point in the future i guess
we don't even have to have our presenters make presentations anymore we could just synthesise them i would like to
do that
i'm not the fifteen you speaking
so one of the quest one of things you alluded to at the end of your talk i was wondering
if you could elaborate a little bit more
one of the problems you can still hear on some of the examples you place a certain but seeing this
dependent upon the speaker in the quality of the final waveform generation i'm just wondering if you could say a
few words about some of the country
the techniques that are being looked at in order to improve the quality of the waveform generation the model you
at the at the beginning of the talk is still relatively simple excitation of the spectral sort of model and
i know people looked at the fancier stuff
just wondering if you have some comments as to what you think is interesting promising the directions to improve the
quality the waveform generation yep
i didn't
mention but in the newest the system well we are using a straight vocoding a technique and it can improve
the speech quality very much
however
i'm afraid that
it's not based on
that's that is god
frame
so
i would like to include that kind of most you
but coding part should be included this it which
that must be
this one
i but
currently
we still
use
many approximation
for example
the formulation
is
correct
for cost sampras
it is right for unvoiced sect
however it's
not appropriate for
theoretic sick person
so we need more sophisticated a speech before each initial model
and i believe that
that can for that kind
problem
the vocal
it is
the onset do
hi yes i have a couple of questions related to the smoothing of the cepstral coefficients we talked about
i'm so the use of the deltas and double deltas gives you the smoothed
i cepstral coefficient
how is how important is that relative to say representing or generating static coefficients and then perhaps applying a moving
a moving average filter some somewhere
smoothing like that
okay with question
the initial
a slight
i have with one
with the moment
oh i'm sorry if it was not be greatly in the select but anyway
that is a very effective
and
that the data is not so effect
and now of course we can apply some heuristics somebody by filtering or something like
it's
still effect
i have a worse
that using that on the today parameter
i have a most scores
i'm sorry i can't find i one other following question on that when you set up those linear equations that
you
you weights
all the different dimensions equally all the deltas and double deltas and the static coefficients are those all weighted
equally when you
solve the least squares equations
yeah there's no weights
yeah O we have a
that's no choice weight or some operation
we just have a
we just have a
definition of the probably
but it gently
and just
using
so i have a question
so obviously we've
in texas speech local optimum
speech recognition but it's
frequent comment and it came up and very minute talks is you know this whole question of how good is
there's an hmm model of speech on the
received wisdom is either that it's kind of terrible or
it's terrible but it so tractable are useful the user button speech synthesis you think that the success of this
technique
both
in fact demonstrated hmms are good model speech
because i think it's
the quality is for the
higher than that of the thing anybody would have believed
possible
and what follows
this workshop on the
what is the same channel build models
yeah that's
yeah good but we've got question
and
anyway
to
we have been organising we just try to
it's a
evaluation campaign
i think this systems
and we have find that
you did in the intelligibility of a hmm based systems
or almost perfect
almost
three but i two notches on a natural speech
i was still a naturalness
is
insufficient compared with natural speech
so maybe you
prosody
so
oh
i believe that we have to improve prosodic part
well
D
status got more
maybe
human speech and make one by various non-verbal information
yeah speech about it cannot be done in the current speech
system
that kind of a speech of should be included
so
that you're talking was very nice
i want to go a little bit for the long pause line of questioning because i was thinking about your
final call to the asr community to take join you know the stuff
one of the things you're seeing a lot with H M and stuff and the speech kill this is that
everybody's moving parts discriminant models of various kinds and whatnot and the nice thing about
you know
the hmms for the synthesis problem is it really used to regenerate of problem right so i in some ways
it model matches a little better which just sort of what paul was touching on
so do you
see
in moving forward and synthesis that
discriminant techniques are gonna be
playing a part in that kind of thing where do you think that you know generated this you know asians
are generated
models and this definitely gonna be the right way to
model this kind of thing
yeah a good question
this discriminant given training
does not allow for speech utterances
it is not necessary to discriminate
and
another point
that
we can set a specific
objective function based on human perception
don't
quickly like
a discriminative training in speech recognition
but anyway yeah
in speech synthesis
the basic problem
it's
generation
so we can consider to concentrate on
generative model
it's not necessary to tickle
with the disk and discriminative training
that a nice point of speech synthesis research
but it in
i
i
oh
yeah
yeah
but
i want to
the that kind of
optimization in a statistical framework
by changing the a paedophile parameter or a more their structure we can do that now that's got four
and i've got a related question you so you generate the maximum likelihood sequence
if you have a really good
and generative model we'd really like to sample stochastically
there's not that
we are using something
because
i'm sure this
okay
those are given variables
and this is the speech waveform
and a dct predicted distribution and we have something a speech waveform
the exciting speech synthesis filter by a gaussian
what noise
it's
just or something
and that speech parameters mel cepstrum or this result if a zero S
is marginalised in the equation
so as a pokemon approximation we add generate it we imagine criteria
maximum likelihood criterion
it's just approximation
but
this criterion is
something
does it make sense
well i guess i'm wondering whether it's good approximation of
yeah that the reason why we have to reduce or remove
we wanna relax the approximation in the future active
future work
okay things temporal things to close so that's like speaker
i