Speech Transcript - Speech Synthesis as A Statistical Machine Learning Problem

okay so i'm pleased to introduce the next guest speaker who's a kitchen recruited from the goal is to just

technology

he's extremely well known but for those who don't know he's the pioneer of statistical speech of the system particular

hmm speech of the system captures the together

right a single actual not so like i

okay

and operator

that

most

speech recognition researchers

re going speech synthesis

as a messy problem

that the reason why

yeah

i yeah

talk about a statistical formation of speech synthesis

in this presentation

okay

to realise speech synthesis systems

many approaches have been proposed

before nineteen

rule based formant synthesis had both been studied

in this case funding you need a bit by hand crafted rules

after nineties

corpus based concatenative speech synthesis

approach is dominant

state of the art

speech synthesis systems

based on unit selection can generate natural sounding speech

in recent years

statistical parametric speech synthesis approach yet popularity

it has

several advantage

such as

flexibility in voice characteristics

a small footprint

automatic voice between

and so on

and i'm not there

the most important

advantage of the statistical approach

it that's

we can use mathematical you will define the models and average

in this talk

i would like to discuss how we can formulate

and i understand the whole speech synthesis process

including speech feature extraction acoustic modeling and the text processing and so on a unified statistical framework

okay

the basic problem

all speech synthesis

yeah i can be stated as shown here

we have a speech database

that is a text

yeah

a set of text

and corresponding speech waveform

given a text

to be syntax

but the speech waveform corresponding context

the problem can be represented by this equation

and it can be solved

by estimating the

predictive distribution

a given barrier

and then drawing samples

from the predicted distribution

basically it's quite simple

however

estimating that

predictive distribution

is very hot

we have to introduce a acoustic model problem

then the in the acoustic model for example hmm

and this part correspond to the training part

and this but response

to the generation part

and the first i like to discuss the generation for

as we know modeling speech waveform

directly by

acoustic models is very difficult

so we have to introduce

parametric representation speech waveform

is a parametric representation of speech waveform

for example cepstrum well mel cepstrum but as it is used for every zero

accordingly and this generation apart

is decomposed into these two terms

we also know

that takes should be converted to that is

because the same text

can i have much to pronunciation

part of speech analytics lexical stress

or other information

so that generation part

is decomposed

into these three times

text processing

and

acoustic model

a parameter generation from acoustic model

and speech waveform reconstruction

and that it is difficult to perform integral and summation

yeah over all the variables

so we approximate the by joint maximization are shown here

however

joint maximization is still hot

so i

is approximated by a step by step maximization problem

discourse want to the training part

and

this maximise

maximization with that of this

is that this correspond to

text and at least

and this corresponds to

speech parameter generation from a acoustic model

talked about the generation part

but the training part

also requires a partner parametric representation of a speech waveform and there

accordingly the

training part

can be approximated by a step by step maximization problem in a similar manner at that iteration part

they're doing

all speech database

and the feature extraction of speech database

and acoustic model train

as a result

the original problem

is it

decompose into these sub-problems

bows or four

training part and those are

for dinner at some point

feature extraction

of speech database

between

and acoustic model training

and the text and there is

the text

to be synthesized

and the speech parameter generation from acoustic model

and finally yeah we reconstruct speech waveform

by sampling of this

distribution

okay

i just talked about the

mathematical formulation

in the following

i like to explain

each component a step by step

and then

show examples to demonstrate the flexibility of thus that's statistical approach

and finally give some discussion and computers

"'kay"

this the overview of an hmm based speech synthesis system

the training part is similar to those used in hmm based speech recognition system

the essential difference

it that the state output vector in clues

not only spectrum parameters

for example mel-cepstrum

but also excited some parameters if zero parameters

on the other hand

the synthesis part

does the inverse operation of speech recognition

that is

phoneme hmms

or concatenated according to the labels

i drive the from the text

to be synthesized

yeah

a sequence or speech parameters

a spectrum parameters and F zero parameters

is determined in such a way that it's at most probable probability for the hmm is max

and finally

switch maple

is in fact by using speech synthesis filter

and that each part correspond to the

supper problem

that we

feature extraction

and the model training

and that text analysis for the text to be synthesized

and speech parameter generation from acoustic model trained a cost model

and speech waveform reconstruction

first

i like to talk about speech feature extraction

and space speech waveform a reconstruction which correspond to these which

it's based on the source-filter model which in that no human speech production

in this presentation

i assume the

system function

H of Z is represented by mel-cepstral coefficient

that is

frequency warped cepstral coefficients

defined by this equation

the frequency warping function defined by this

first order allpass system function

give us a good approximation to auditory frequency scales

and with an appropriate choice of that of

by assuming X

icsi's a

a short segment of a speech waveform

assuming X is a gaussian process

we time see

mel-cepstrum

in such a way that

it's likelihood

with respect to X

is maximized

it's just that any other estimation of mel-cepstral coefficient

because the of X

is convex with respect to see

the solution can easily obtained by an iterative everywhere

okay

to reset resynthesized speech

H of the is controlled according to the estimated mel-cepstrum

and excited by post-training

and of white noise

for voiced and unvoiced segments are respectively

i know this is the

pulse train

under this is white noise

and the excitation signal is generated based on voiced unvoiced information and if a zero

extracted from the original speech

this is all the non-speech unfair advantage now parents scales et cetera and dct excitation signal

it could have the same if zero

at this point

and but exciting a speech synthesis filter controlled by mel-cepstral coefficient vectors

by this excitation signal we can reconstruct the speech waveform

i don't you have somebody else et cetera

so now the problem

how we can

generate both speech parameters

from the tech

i have to be synthesized was the corresponding acoustic

model

okay

next i'd like to talk about this maximization problem

which correspond to acoustic modeling

this is the other two markov model hmm a result left to right topology

which is used in speech recognition system

we also use the same structure for speech synthesis

please note that the state output probability is defined as

gaussian single gaussian us that because

it's enough for speech synthesis we are using a speaker-dependent model

that for speech synthesis

as i explained

we need to model not only spectral parameters

but also F zero parameters to resynthesize speech wave

putting the state output vector consists of

spectrum part

and F zero part

spectrum brought consists of mel-cepstrum coefficient vector

and its delta and delta-delta

and the F zero product consists of F zero and its delta and delta-delta

the problem

in modeling F zero by a gmm

if that

we cannot apply to conventional discrete or continuous stated distribution

because

F zero value

not to define in the unvoiced region

that is

the observation sequence of F zero is composed of

one dimensional continuous values

and discrete a simple which represent about

several heuristic methods have been investigated four hundred in the unvoiced region

for example

interpolating the caps

or substituting random values for almost agrees

to model this kind of observation sequence in a statistical quirk the manner

we have defined a new kind of hmm

yeah

we refer to it as multi-space probability distribution hmms

or msd hmm

it includes the discrete hmm and the continuous mixture hmm

as special cases

and for the more it can model the sequence or

all observation vectors with variable dimensionality including discrete simple

we show the structure of msd hmm

specialised for F zero modeling

each state

has weights

which represent

and probabilities

all voiced

and unvoiced

and

continuous distribution for voice

observation

that is not bad

i'm em algorithm can easily be derived for training this type of H M

okay

but combining the spectrum part and F zero part of the state output distribution

has

mod stream structure

like this

okay

i like to talk about

model structure

in speech recognition

preceding and succeeding phone identities are regarded as context

on the other hand

in speech synthesis

current phone identity can also be a context

because i

no it's not necessary to know

what the speech recognition result

furthermore

there are

many other context of factors

that affect

spectrum

every zero

and the duration as shown here

for example a number

phones in this stuff below

for example current syllable in current word or part of speech or other looks more information and so on

since there are

too many combinations

it's a difficult to have all possible model

to avoid the problem in the same manner as hmm based speech recognition

we use context-dependent hmms

and apply a decision tree based context clustering technique to K

in this figure a

htk sty triphone letters are shown

however

in the case of speech synthesis the data is very long because it

includes

all these information

so we also a list menu other questions

about

this information

okay

each number spectrum and F zero have its own influential contextual factors so that there should be some for spectrum

and F zero should be clustered independently

it results in

stream dependent a context clustering structure

i strongly

in the standard hmm days

the states through some prior probability an exponent site and decrease with increase over last iteration

however

it's too simple to control a temporal structure of speech parameter C sequence

therefore

we assume that the state

durations

oh gosh

and not that the hmm with an explicit and racial model is called

and hidden semi-markov model

or it just a man

and now we need a special type of em algorithm for parameter is used to measure this model

okay as a result state iterations of aged men each hmms

oh model the

by a three dimensional

gaussian

and

context-dependent three dimensional gaussians

a class that by

at this juncture

so we now we have

seven decision trees are in this example

three four spectrum from those mel-cepstrum

and three four F zero

and a wonderful situation

okay

next i'd like to talk about the second maximization problem

which correspond to speech parameter generation

from acoustic model

like concatenating context-dependent hmms

according to the led us a drive from the text to be synthesized

a sentence hmm can yeah

something

for a given sentence hmm

we determine the speech parameter vector sequence

which maximizes

the outputs probably

this equation that can be approximated by this which one

output approximated by maximization

on the bottom or it can be decomposed into D two maximization problem

first

we determine the state sequence Q hot

independently of all then

yeah

determine

speech parameter vector sequence

yeah O

all hyped

for the

prefixed a state sequence

do you have

the first problem

can be sold

very easy

because us that iteration are modelled by gosh

the solution is simply given by means of gaussians

a postage or some other

unfortunately

that direct solution for that

second problem is you appropriate for synthesizing speech

and this is an example parameter generation from an hmm

composed by concatenation of a phoneme hmms

each vertical dotted line

a state of that the line represents a state out

we assume that the covariance matrix is guy or whatever

so each state has its means and variance

for example this

horizontal dotted line

represents a mean of this state and the shaded area

that represent

variance

of this thing

by maximizing the output probability

the parameter sequence becomes the mean vector sequence

resulting in a step wise function like this

because

this is the most likely sequence for the sequence of a state of gaussians

and the this jumps

a coarse this continues its in synthetic speech

about

about the problem

we assume that each state output vector O

consists of mel-cepstral coefficient

back to

and it's dynamic feature vectors

delta and delta-delta

which correspond to the first

and second derivatives

of a speech parameter vector C

and can be calculated as a linear combination or neighboring a speech parameter vectors

most of speech recognition systems also use this type of speech parameters

and

relationship

between

see and that the C and the see that can be arranged in a matrix form

as shown here

i see in the

mel-cepstral coefficient vector

and delta and delta-delta and the dct stick out vector

and

C includes all

mel-cepstral coefficients vectors for utterance

and W is for calculating does that the

and that this constraint

on wall

maximizing be with respect to all

is equivalent to that with respect to see

that's by setting the derivative equals zero we obtain a set of linear equations

which can be shown in

much useful

that dimensionality

of the equation is very high

for example tens of thousand because C was all a mel-cepstral coefficients vector for utterance

fortunately

by using the special structure of this metric

it's very sparse matrix

it can be solved by

fast algorithm

okay

this is an example of

parameter generation

that from us in this hmm using dynamic no feature brown

this shows

the trajectory

of the

second the coefficient

of that generated the mel cepstrum

sequence

and

they

sure its delta

and delta-delta which correspond to the first

and second derivatives of the

trajectory

these three

trajectories a constrained by each other

and to determine the simon tennessee

by maximizing

total output probabilities

as a result

that trajectory

is constrained to be realistic as defined by the statistics

of static and dynamic feature

you may have noticed that

the of all

is improper as the distribution of C

because it's not normalize

respect to see

interestingly by normalizing

be with

respect to see we can drive a new type of trajectory model would to be called

trajectory hmm

oh i'm sorry but i won't go into details in this presentation

okay if you guess of the spectrum calculated from the mel cepstrum vectors generated

without dynamic feature parameters

and we dine feature parameter respectively

it can be seen that by taking into account

a dynamic feature of parameters

smoothly varying sequence of spectral can be up to

and they show the generated F zero about that

without the dynamic feature

generated F zero sequence becomes a step wise function

on the other hand by taking into account that i and number of features

we can generate F zero trajectories

which approximate the natural F zero that

okay

not i would like to play some speech samples of synthesized speech samples and too strong effect of dynamic features

in speech parameter generation

this was since size

from the model trained with both

yeah static and dynamic features

and that this was syntax

without a spectrum then it feature

and this was

in size without

F zero dynamic feature

and that this was in fact without the both spectrum and F zero dynamic three

as the mean they this one

like you know you're not model i mean and sorry again

jeez a known only on like you know you're the model i mean and

it's some of

and the without

spectrum then feature you may perceive a frequent discontinued is in this

these are not lonely and that you know you wanna model i mean i

can't find it

and now without F zero dynamic features

in this case you made by C different type of discontinued is

jeez on only on like you know you on i mean and without both we may perceive serious discontinued

these are known only and then you know you want to model i mean again we both

jeez unknown only and like you know you're the model i mean and

yep

from this examples we can see that the importance of dynamic feature hmm based

okay

in the next part

i lurked show some

examples

to demonstrate the flexibility of the statistical approach

first i'd like to show an example of emotional speech synthesis

i'm sorry about

this is very old then so that support speech quite a

is that

and

this sample is inside from a model trained with

neutral speech

and this was inside from the model trained with unreadable i'm very

pitch

this that the case again i'm sorry that it's in japanese

this is english translation

just a neutral

people who need it i and unable maintain okay it has flat prosody

and from and we model

J i

okay

one of the sentence

neutral

meeting anyone i if you can have enough time

yeah i

it sounds like he's angry

yeah

and we see that that's training the system with a small amount of emotional speech data we can see that

the most no speech very easy it's not in this area

to handle craft a heuristic rules for emotional speech and

next and that show an example of speaker adaptation in speech synthesis

we apply the speaker adaptation technique now using speech recognition yeah mllr to the synthesis system

and they say the speaker independent model is a model

and it was adapted to a target the speaker eight

and that this

the adapted model

okay this samples

since that from the

speaker independent model

for channel sometime recognition

okay i'm sorry

it's in japanese

and this was synthesized

oh this is that's inside speech

but the for speaker i

so this is inside speech but yeah it has speech bias voice

a voice characteristics

can also i snuck

and this was synthesized from the adapted model

with

for turn

oh yeah no sunlight recognition and fifty utterances

of course you know something that is not let me play them again

speaker independent model

cocaine or something recognition for utterances

yeah no sunlight recognition

not something that is not

and also i recognition

these three sound

very similar it means that the system can maybe the target speakers voice using a very small amount of the

adaptation data

and then we have another sample

maybe in the

famous persons voice

institute of technology energy was founded in nineteen O five isn't going to hire technical to pioneering academic institution dedicated

to industrial education

can you find for hey

yes

you're right

please note that

this was done by engine geometries at C S T R of the university of edinburgh

and they

yeah it was us inside by the system adapted to justify

okay

next example is speaker interpolation in speech synthesis

when we have several speaker dependent hmm sets

by interpolating among the hmm parameters

means and variances

we can generate a new hmm set

which correspond to a new voice

in this case we have to speaker-dependent models

one

is trained by a female speaker

and one

is trained by a male speaker

okay let me play

speech samples

synthesized from

female model

sorry okay

and this was in part from a male speakers for a male speakers model

well when i don't and we can interpolate between and these two models with arbitrarily depletion much

a dct sent out

due to

models

we cannot find he or she is

male or female

and

we can change in the in the pool interpolation ratio right nearly in a trance

from female to male

well

we do not know

sounds like

male finally

and i this is the same except we have for speaker-dependent the model

models

the first speaker

and when the second speaker per speaker i don't want to know in a manner i don't have for speaker

and this at the center of these four speakers

and then we can also change the interpolation ratio red

and

oh in a manner another

i don't know

oh in

it is interesting

but

could be used to this

yeah

if we train each model

with S P speaking style we can interpolate among speaking styles to it could be useful for spoken dialogue systems

in this case

we have two models

once trained with

a neutral

draw a voice

and one trained with

high tension voice

by the same speaker

okay

first neutral voice

and heightened so model

if you feel it's too much

you we can adjust the

degree of the expression by interpolating between two models

for example this one

and that we can

also fixed extrapolated and used to model

under the replay yeah all of them

in this order

please note that

it's not just that changing average F zero the prosody it can be changed

okay

next example is eigenvoice

the eigenvoice technique was to have developed for very fast speaker adaptation in speech recognition

in speech synthesis

it can be used for creating new voices

image of something more

okay

this represents a weight

for eigenvoices

by adjusting them we can find a favourite voice

each eigenvoice first eigenvoice and second eigenvoice

it's eigenvoice

may correspond to a specific voice character

maybe play some speech samples

but

for the

first eigenvoice we've negative rate

yeah

okay

and now we

posted wait for the first eigenvoice

okay

oh no what contributes to say okay

i'm sorry that this is the maximum with the ball

and the second eigenvoice we've negative rate

and we

positive rate

okay they're not what makes you don't sound that made on for eigenvoice

and we've

was divided up with the weight

yeah

and by second weight after writing we don't and various voices

and find out for your voice

some them then

but i

i hope

this is better

okay

the

anyway this shows the flexibility of their statistical approach to speech synthesis

okay

similarly to other corpus based approaches

and the hmm baptists

system has a

very compact language dependent but

done easily be applied to other languages

i like to play some the them

japanese change that you know you and i one and i mean i'm sorry

in which

you would not keep the truth from chinese

well or from grand cherokee

korea

then they can match the categories and the finnish

only taken a little mental but it once again i must be sent to contain an snr essential or several

minima

and this is also in which but trained by

baby

vol

yeah i

okay

and now

next examples

so that

even

singing voice can be used as a training data

as a result

the system can seen any piece of music

we she's or her voice and simply used i

and

this is a one oh training data

okay

sees a semi professional scene

and now

the server

and

and anyway and

this sample

syntax

by using trained acoustic model so

she have not

some this song

yeah

oh i

maybe it sounds that are

but we have not seen this story

us in this

okay

this is the final part

yeah i like to show the

basic problem of speech synthesis okay this one

solving this problem directly

based on

and this equation

is it yeah

but we have to decompose it into trapped up to no such problems because the

direct solution is not feasible we currently available computational resources

however we can relax the approximation

for example

by marginalised in what their parameters

a hmm for acoustic model parameters

we can drive a variational bayesian acoustic modeling technique for speech synthesis

well

by marginalise and that is

we can drive

joint front-end and back-end model train

a friend front it means that text process

and back end acoustic model

or by including a speech

wait for speech waveform generation part in a statistical model we can also drive a bit from that of

stats come on

anyway please note that

this kind of improved techniques

can be drivers

based on

this equation which represents

the basic problem

since

okay

then some read this presentation

i have talked about the stats got from their some of the speech synthesis problem

all speech synthesis process is described in a statistical framework

and it give us a unified view and the reviews

what is correct and what is wrong

another point i should

implies that is

the importance of the database

future work

still we have many problems

which we should solve

based on

the

equation which represent speech synthesis problem

okay this is that

final slide

is P synthesis

and messy problem

no i don't think so

i would be happy

if many speech recognition researchers joining speech synthesis research

it's must be very have had

to a T research area

that's all thank you very much

yeah thanks for such a could talk we have some time for questions

michael

oh thank you very much for a wonderful talk on speech synthesis at some point in the future i guess

we don't even have to have our presenters make presentations anymore we could just synthesise them i would like to

do that

i'm not the fifteen you speaking

so one of the quest one of things you alluded to at the end of your talk i was wondering

if you could elaborate a little bit more

one of the problems you can still hear on some of the examples you place a certain but seeing this

dependent upon the speaker in the quality of the final waveform generation i'm just wondering if you could say a

few words about some of the country

the techniques that are being looked at in order to improve the quality of the waveform generation the model you

at the at the beginning of the talk is still relatively simple excitation of the spectral sort of model and

i know people looked at the fancier stuff

just wondering if you have some comments as to what you think is interesting promising the directions to improve the

quality the waveform generation yep

i didn't

mention but in the newest the system well we are using a straight vocoding a technique and it can improve

the speech quality very much

however

i'm afraid that

it's not based on

that's that is god

frame

i would like to include that kind of most you

but coding part should be included this it which

that must be

this one

i but

currently

we still

use

many approximation

for example

the formulation

correct

for cost sampras

it is right for unvoiced sect

however it's

not appropriate for

theoretic sick person

so we need more sophisticated a speech before each initial model

and i believe that

that can for that kind

problem

the vocal

it is

the onset do

hi yes i have a couple of questions related to the smoothing of the cepstral coefficients we talked about

i'm so the use of the deltas and double deltas gives you the smoothed

i cepstral coefficient

how is how important is that relative to say representing or generating static coefficients and then perhaps applying a moving

a moving average filter some somewhere

smoothing like that

okay with question

the initial

a slight

i have with one

with the moment

oh i'm sorry if it was not be greatly in the select but anyway

that is a very effective

and

that the data is not so effect

and now of course we can apply some heuristics somebody by filtering or something like

it's

still effect

i have a worse

that using that on the today parameter

i have a most scores

i'm sorry i can't find i one other following question on that when you set up those linear equations that

you

you weights

all the different dimensions equally all the deltas and double deltas and the static coefficients are those all weighted

equally when you

solve the least squares equations

yeah there's no weights

yeah O we have a

that's no choice weight or some operation

we just have a

definition of the probably

but it gently

and just

using

so i have a question

so obviously we've

in texas speech local optimum

speech recognition but it's

frequent comment and it came up and very minute talks is you know this whole question of how good is

there's an hmm model of speech on the

received wisdom is either that it's kind of terrible or

it's terrible but it so tractable are useful the user button speech synthesis you think that the success of this

technique

both

in fact demonstrated hmms are good model speech

because i think it's

the quality is for the

higher than that of the thing anybody would have believed

possible

and what follows

this workshop on the

what is the same channel build models

yeah that's

yeah good but we've got question

and

anyway

we have been organising we just try to

it's a

evaluation campaign

i think this systems

and we have find that

you did in the intelligibility of a hmm based systems

or almost perfect

almost

three but i two notches on a natural speech

i was still a naturalness

insufficient compared with natural speech

so maybe you

prosody

i believe that we have to improve prosodic part

well

status got more

maybe

human speech and make one by various non-verbal information

yeah speech about it cannot be done in the current speech

system

that kind of a speech of should be included

that you're talking was very nice

i want to go a little bit for the long pause line of questioning because i was thinking about your

final call to the asr community to take join you know the stuff

one of the things you're seeing a lot with H M and stuff and the speech kill this is that

everybody's moving parts discriminant models of various kinds and whatnot and the nice thing about

you know

the hmms for the synthesis problem is it really used to regenerate of problem right so i in some ways

it model matches a little better which just sort of what paul was touching on

so do you

see

in moving forward and synthesis that

discriminant techniques are gonna be

playing a part in that kind of thing where do you think that you know generated this you know asians

are generated

models and this definitely gonna be the right way to

model this kind of thing

yeah a good question

this discriminant given training

does not allow for speech utterances

it is not necessary to discriminate

and

another point

that

we can set a specific

objective function based on human perception

don't

quickly like

a discriminative training in speech recognition

but anyway yeah

in speech synthesis

the basic problem

it's

generation

so we can consider to concentrate on

generative model

it's not necessary to tickle

with the disk and discriminative training

that a nice point of speech synthesis research

but it in

yeah

but

i want to

the that kind of

optimization in a statistical framework

by changing the a paedophile parameter or a more their structure we can do that now that's got four

and i've got a related question you so you generate the maximum likelihood sequence

if you have a really good

and generative model we'd really like to sample stochastically

there's not that

we are using something

because

i'm sure this

okay

those are given variables

and this is the speech waveform

and a dct predicted distribution and we have something a speech waveform

the exciting speech synthesis filter by a gaussian

what noise

it's

just or something

and that speech parameters mel cepstrum or this result if a zero S

is marginalised in the equation

so as a pokemon approximation we add generate it we imagine criteria

maximum likelihood criterion

it's just approximation

but

this criterion is

something

does it make sense

well i guess i'm wondering whether it's good approximation of

yeah that the reason why we have to reduce or remove

we wanna relax the approximation in the future active

future work

okay things temporal things to close so that's like speaker

Speech Synthesis as A Statistical Machine Learning Problem

Invited Speakers

Keiichi Tokuda (Nagoya Institute of Technology)