Speech Transcript - Neural statistical parametric speech synthesis

welcome to the speaker all the c to them and twenty

this is the tutorial session on text-to-speech synthesis

i'm showing one from national institute of informatics japan

i'm going to reduce realist tutorial on test to speech synthesis

first brave self introduction

and a post-hoc and i a much that i got my p h t two

years ago from a simple at during my p h d i was working on

text-to-speech since s

of things the post or chi have been working on speech and also music audio

generation

meanwhile also getting boarding the a swiss move to as i'm writing and voice primus

challenge of this here

for this tutorial like to first apologise a thoughts the abstract i think is thing

in the old strike i mentioned i will explaining the resetting neural network based acoustic

models waveform generators

the classic hidden markov model based approaches and also words conversion

but this abstractly seems to be to features i think account to cover all topics

thing one hour tutorial so in this tutorial i will focus on the recent neural

network based acoustic models including the talked run and its environs

for other topics such as a waveform generators and hmm after they've them out from

this tutorial

if we are interested you can find a useful notes and the reference papers thing

this slide

for this tutorial i'm going to focus on the recent approaches like to talk at

run and the related

seconds two seconds tedious models

i'm going to talk about how they work and the water or the differences

so this tutorial is based on my own reading list i summarised what have learned

and what i have implemented with my colleagues

so the content may not be comprehensive

however i the i would try my best to include more contents i was summarised

on hanging the notes in each slide

i also provide an appendix that reading list ways what have rating the past so

i hope you enjoy this tutorial and of course you all feedback is welcome

so for this tutorial i'd like to first give a brief introduction about the current

situation or the state-of-art it was a tts research after that i will give our

view of a tts briefly introducing the classical methods

and the why we're here today

and after that i was spend the most of it i'm have this tutorial on

these sequence two seconds tts and the state-of-art tts nowadays

explain different types of a sequence two seconds tts those based on solve attention hot

attention

and hybrid approaches and finally i will make a summary and draw conclusions

it speaking with introduction

tedious is a technology that a covert single texting to the output avoidable

one famous example of the tts application is the speaker use device professor stephen hawking

nowadays we have more types of applications based on the tts

one example is the intelligent the robot

we also have that each taught systems are cell phones and computers

research on t is has a really long history if we read the books reference

paper some tedious we can find so many different types of tts methods for my

since as you need selection and weighting it

and the reason why researchers are still working on d is that and

this researchers want to make systems i speech as natural as possible as natural as

human speech for some types of applications all we also want the so that speech

sounds like us

so towards this go researches put so many a first thing to the case research

i ever you know was not on your the recent years that and researchers find

really good models to achieve this goal

here i'd like to use the a space move data to show the rapid progress

of tts

first it's picture it is serious moved to center fifteen

and the i-vector space where we show different types of tts system and their this

distance from the natural speech religion you speech

so you can see there are many system here most of them not based on

hmm orgy a gmm based voice conversion

for this edition basic tts is really for me from the natural speech is only

unit selection that is close to the natural speech

so how about is swiss moved to the nineteen after four years of research

here the results based on expect or a computer with the picturing to send fitting

we can see there are so meetings is some that are really close to the

natural speech

not only thing the selection i'd like to give that one here

the first example is a sham d and system as you can see from this

scoring which is still far free from a natural speech

the unit selection is still close to match for speech meanwhile we can see

other types of tts messrs

including the sequence two seconds t d s and the women it so they are

really close to match for speech

of course this figure based on acoustic features either the extractor or the i-vectors

but the question is how to this is that speech sounds really like in human

perception

to show that sounds are that question i'd like to use the results from our

recent study where we come back to healing evaluation on the a swiss moved to

them hiding data

here we ask feeling evaluators to evaluate and the how this is a speech sounds

like the target speakers and how this is that speech

the what is the quality of sounds i speech compared with t natural speech

so we show the results in terms of it by using the det curves

as you can see from the left hand side we can see that d in

h m d and is really for a for me from the natural speech in

terms of the speaker similarity so this whole distribution is for rate from the natural

target speech

unit selection is calls or bastille not closing off its own is seconds to segment

system as you can see from this picture as really close to the target speaker

natural speech so you this case the eer is rock it is close to fifty

percent

so which means this is some kind of the release on the synthesized speech sounds

like the a target speakers and human beings cannot tell tells them from each other

this is similar trend if we look at the results in terms of speech quality

d and the unit selection are not good enough it's only a sequence two seconds

model that is really close to the natural speech

so from these results we can have a general idea or on the how the

race and the models based on second steps simplest model improves the quality and a

speaker similarity and even the human beings can not tell them from the natural speech

okay after introducing the results i'd like to play some samples from a swiss providing

database

and that i think you can have general perception a housing like is model sound

like computer with a natural speech

we did not complete with any of a local farmer

we did not compete with any of the local optima we did not completely then

have the local phone

we did not completed any of the local formal writing

eventually india function that winter

a french at that level until

so this other samples from two speakers i think you may agree that's an unit-selection

sounds like to natural speech in terms of the speaker identity but you can sometimes

perceive the channel one we concatenating you i different units together

and the tmm sounds close but the but this sounds like artificial speech itsy seventy

six models that's true some three like the target speakers

if you are interested in you can find more samples are website or download the

a space with lighting database to have a charge

after listening to the tts samples from a swiss move into the nineteen

i'm going to talk about more details on the tts what kind of problems women

face one would be at a tts system what kind of solutions we can use

and how we come up with idea d sequence to sequence tts models

so what are the problems we may face one would be you the tts system

to give a example here is once and this from the guidelines for tool be

labeling my random it's more light

the first thing we need to note is one recover the text things waveform is

that

the text is basically discrete

it comes from a finite set of symbols

well as the waveform is continuous in time domain and also doing them out of

domain

because of the basic difference between the text and the speech the first thing we

noticed is the ambiguity in pronunciation for example the inmates segments the more maria that

all mate they are pronouncing different ways the second thing is about alignment

for example where mi same eight we miss a mate

we may shorts or increase the duration of the sound when we produce a pronounce

so this kind of alignment we need to learn from the data which is not

easy another issue is the a to recover information with which is not encoded in

the text for example

the speaker identity and prosody this has really different issues when we but tts systems

here is one example of using classic a tts to converse detects into the output

waveform

so the first step of the system is to clean the input a text to

do some kind of text normalization to remove all kinds of

the strangest thing balls from input text

so after that the system converts the text into the phoneme or phone strings

so the phone of phonemes are symbols that tells the computer how to read the

ward

of course this is not enough or we may need to add additional prosodic tags

to each word or some sort of the word

for example women and the size t mariano instead of the mate

so giving thing and linguistic information about how to read text

the system will our converts them into the acoustic units or acoustic features

finally the system will use a waveform generator to covers the acoustic information into the

output waveform

in the literature we normally refers to the first steps the cued a system as

a front end and the rest of the backend

in this tutorial like another cover the topics on the front end

and the readers can find it to textbooks on front end

for this tutorial we focus on the back and the issue especially how we learned

alignment to between the text and waveform in the back and the models

the first example i'd like to explain is unit selection based back end

so as the name suggests this mister is quite simple straightforward four inch input to

unit which is directly select one speech segments from a large database

after that which is directly concatenate these speech units into the outputs wasteful

so there is no explicit modeling all of the alignment between the speech and the

waveform

a because this alignment has been preserved in the speech units so we didn't really

care about alignment in this kind of mister

however the story becomes different and when we use the hmm based back end to

synset speech

in a like the unit selection which directly gender was waveform

for a h t s hmm based approach we don't directly predict the waveform instead

we first predict the sequence of acoustic features

from the input text so this acoustic feature vectors maybe from each vector my corresponding

say twenty five milisecond of waveform

and the we can use of vocoders to reconstruct a waveform from the acoustic feature

vectors

so each acoustic feature vector into my containing the for example the cepstrum coefficients if

the role

and all their kind of acoustic features specific

two d speech will coders

but is a general idea here

in h t s we don't directly predicts waveform instead we need to first predict

the acoustic feature vectors from the input text

the question is how can we do that remembers that's the input information has being

extracted from the text

including the phoneme identity and all their prosodic tax

so you h t s we normally encode or converts the linguistic features into a

vector for each input a unit

so in each vector it to make contains information like the phoneme identity

the whether the course of a boy stress a lot

so we assign this kind of vector for each unit

the question of cores is how can we convert the sequence of encoding linguistic websters

into the output to acoustic feature vectors

so remembered the number of vectors we have is equal to the number of units

in the text

and this number is much shorter than a number of acoustic feature vectors we will

we need to predict

so this is alignment t should

this is how the h t s system handles this issue

since this system is based on a gmm so the first thing we need to

do is to a converts the linguistic vectors seem to the hmm state

so this is done by simply searching through these

this increase after that we can get the hmm state for this specific which are

after researching and the finding all the hmm state for each linguist vectors

the next thing is to predict the duration for each in from state for example

when we repeat the first item state two times the second one three times

given is duration information that we can create six agenda seconds like this

so remember that the sequence of this hmm state will be equal to the number

of vectors when you to predict in the output

the loss the regression task become much easier because we can use main types of

all the reason

to generate vectors for from each hmm state

specifically h t s system used to you so called

maximum likelihood parameter generation or present to produce

the acoustic feature vectors from the hmm states

but this is how the h t s system produce the output from the input

to linguistic feature vectors

two briefly summarize the h t a system we can use the speech or so

we generates a linguistic features from the input text

we do the searching in the decision trees

after that we predict the duration for each hmm state so this is where the

alignment is produced

of to generate a output acoustic features after that everything is straightforward just convert each

websters into the output vectors

and do the waveform generation using the vocoder

from h t s two d n is straightforward we just need to replace the

hmm states ways the neural networks

feet word one or record one

however for this kind of framework we still need the duration model we need to

predict

the alignment

from the linguistic feature vectors

without that we cannot prepare the input to the neural networks

indeed as d paper by alex grave says or ends are usually restricted to the

problems where the encoder output sequences

all will aligned

as lies where using the com unfit for word or record neural networks

we still need additional tools including the hmm

to lower and it generates alignment

for the tts task

when we wonder with that we can use a single model to jointly learn alignment

and to do the regression

and this is where the sequence two seconds model counts as a stage

in fact they're more ambitious

they want to use a single neural networks to jointly learn alignment to the regression

and you want conducting the linguistic all eyes on the input a test

and that there are lots of recent work showing that this approach is reasonable and

is really step our own your network so that we can achieve a better quality

for tts

okay let's look at the a sequence two-sigma cts models

remember that the task of seconds two seconds model is to converse the text into

the acoustic feature sequences

and we need to solve three specific task

how to derive linguistic features

how to learn to generate alignment then how to generate output sequences

again we cannot to use a common your and it works such as the feature

for tall recon one

for this kind of sequence two seconds model where you normally use attention mechanism

for explanation i we use x has encode while i being the output

note is that the input has same time steps while the output has and time

steps

so they have different time lengths

the first if a framework we can use is the so-called encoding and decoding framework

here we use our and layer as their encoder with processing code and extract the

c which are from the last hidden is data from the encoder after that we

use is c which are as a condition

to generate child was sequences step by step

so if we write only questions it to look like these so you can see

how the output is factorized

along all time steps and is see the condition is used in each condition the

each time step

this framework is straightforward and a simple so the matter hall on the input a

sequence that is we can always compressed input a sequence information into a single vector

however there is also you should because we need to use this c worked or

across all the time steps on which generates output

can we extract different context from the input what we generated different out time steps

the answer is yes and we can use adaptation mechanism to achieve this goal

suppose we want to generate a second time step why to here

what my extract the heathens data from a decoder ring the previous time step and

faded back to the encoder

after that to extract some kind of weight

vectors through the softmax layer

then we do a weighted sum over the input information

i produce the vector c to here

we can use this c to which are as encode

to the decoder and the produce the y two which are

so it is how

the context information can be calculated for the second time step

so no desired we can save the output from the softmax layer so it is

kind of wait information used for the second time step

we can repeat the process for the next time step so in this case we

feedback the history from the decoder in the second time step and then we calculates

the vector c three for the output wise three

in general we can do this for the time step and that we can write

equations like these

so after we save all the output from the softmax

along all the time step so we can do it is

the weight

mm calculated by the use of the marks will gradually change

as we move as we generate out hotel on the time so the way to

his helpful three

what we also move along the input sequences as you can see from this picture

so this alone as alignment matrix thing you can find this picture in mating papers

tedious or speech recognition

two briefly summarize the attention base segments two seconds models we can use this equations

for each time step and way calculate the softmax weight

vector r for here

and then we use or for vector to summarize information from the input so we

do a weighted sum over the h vectors

that gives a bus these a context vector c and for each time step

with a c and context we can generate output y n

and to repeat the process for all time steps

this is generally how the attention basis segments two seconds model works

as you can see from the previous explanation

the attention make it is done is the essential for a sequence to sequence tts

model

and you to this reason there has being so many different types of attention proposed

when i read the papers i noticed that there are so many different types of

attention we can use

self attention for word attention heart attention on the soft attention

so one is the relationship between different types of attention and the what is her

purse to use a specific attention

so in the next few slice are we explain then in a more systematic way

use my proposal i organise the tension based on what kind of features are used

to compute alignment

and how do they compute alignment are from

and what kind of constraints e need to put on the alignment

so as you can see to what kind if you choose to compute the a

alignment we can organise attention based on with their content based whether they are location

where all with their they are pure location base attention

the way to compute the alignment we can organise attention based according to three groups

relative diode and discover told attention

and the for the final axes we can see attention is the a monotonic all

for the tension the local attention and the global attention so this is my proposal

for organising the so called soft attention

that engine is not only group like a fine tuning in the literature

i four rated paper is we can find another group so called hot attention

the difference from these of attention is that in called attention the alignment is treated

as a lot and of arrivals

we need to use all kinds of tools such as dynamic programming and marginalisation to

calculate the probability and the to marginalise the latin of around

i would talk more about the difference between the two groups of attention in the

later slides but for this stage i will focus on the soft attention

that's first look at this told

scaled order and additive attention

to this i was three types of attention is which use different the ways to

compute alignment matrix

suppose we are going to compute the output y and for the and it's time

step

what we have is the decoder state of the previous time step yes and minus

one

we also have the features extracted from the input text which is you know data

as the actually

so this read hypotension differ in the way to compute the input to the to

the softmax layer the output of softmax will be the alignment a matrix

the first one the thought attention

directly multiple a basis to vectors the s and minus one from the decoder the

h and from the encoder

so it is why the cold it is scored at a little attention

scattered all detention is quite similar but in this case we and the scalar these

here to change the value of the a lot of the activation to this of

the max lightyear

the other blasted last the type of attention is the cavity of attention

so in this case we apply linear transformation is and the two vectors after that

we add this to vector together so this is why this is a reason why

quite the additive attention

so note is that

for all the three types of tensions in this example

where using the s vector from decoder the edge vector from the encoder

in all the words

we can consider the h as a content of the input

we multiply the content the from the input a text

ways the hidden state from decoder in order to computes the alignment matrix

this bring us to the second question based on which we can classify different types

of attention

a question is what kind of features we can use to compute the output alignment

in the previous slide high exploiting the told scale that alt and additive attention by

using examples where use the decoder state and the content vector edge to computers alignment

so for this type so mister is we call them content a base attention because

they're using the content of vector

however this is not only way to compute the alignment

the second away is the so-called location aware attention

as you can see from these two equations computer was content the base attention the

location aware attention uses the attention vector from the previous time step

so this attention is aware of the previous alignment so that's why record the location

where attention

the third type of attention in this group is the so called the location base

attention

so computer with the location aware attention we can do it is from this equation

that the content the vector h is removed from the input so you know other

words in the location based attention we don't care about the content with purely compute

the attention of the lyman the matrix

based on the decoder state and the previous the lyman to from the previous step

finally there is c small byron to have the a location based attention so in

this case

we only use the decoder state to compute alignment without using the alignment from the

previous time step

from the equations of the four types of pensions i think you mean though it

is when we compute attention or the lyman the matrix for each out what time

step we need to consider all encoded time steps

so this leads to the sorted dimension along which we can classify the tension

along this dimension i'd like to explain two types of attention the first one is

the so called the global attention

as a name suggests what we compute alignment for each output timestamp

we consider it is possible to get information from all the input time steps

so this matrix the vector lyman the vector are here has no they are all

elements

in contrast when we use a local attention we consider some of the lyman to

can be zero for example in this case we only consider

to extract information from the input a steps in the middle

now i have supplying these three dimensions are on which we can classify this often

attention in fact all the paper all the examples i have explained can find their

location in this treaty space

but let me give one more concrete example that is yourself attention

the same attention is a scalar this scale adult attention it's based on content

and it's a global attention

so let's see how is defined

if we look at the equations of a set of attention we can do it

is to y is cold the skeletal mode

global and content the based extension

muddy in this case a special thing about seven attention is that we extracting both

the

feature vectors edge here and action here from the input a sequence is you know

all the words we are computing alignment on the in consequences itself

of course of because we can compute the everything a power

we can all we can also define a matrix form for the self attention so

in this case we formulate the input feature sequences as a matrix

and then we do with the us to get at a little attention in the

matrix form

and the matrix a is cool with the query key and the value matrices

in this case the refers to the same matrix the h u

so you know the words and this somewhat tension does a transformation on the input

the sequences then output sequences has the same lands as encode

in some sense we can consider the cell attention as a special type of convolutional

renewal layers to transform the input into the output with the same lines

of course we can also use a seven attention for alignment flirting sewing this case

it just a special type of soft attention based on the scared about and a

content based attention

as you can see from the equations in this case we replace the query matrix

with the state from a decoder but the process is quite similar and we can

do everything power by using the matrix multiplication

so by now i have explained all the street dimension is to classify this of

attention and also example based on the self attention in fact there are more ways

to combine different types of attentions and you don't find the i rinse in the

paper published by a good this year

giving the explanation on the sofa attention in this now quickly explain how to work

sing in tts system

so for the tts system when we use the tension basis segment stick set at

two seconds models way almost to use the same framework as those used for the

speech translation

or machine translation or speech recognition

so we use case input is to phonemes or characters

now to po tasty acoustic feature vector sequences

and we still have the encoder the tension and the decoder which is autoregressive

but of course we can do something more for example adding more layers and the

decoder increase the number of recording layers at the print data which receive the feedback

from the previous time step in the outer aggressive decoder this is a free to

choose

but the basic idea is still the attention based approach to learn alignment between the

input and output

this gives sauce the basics to understand the first fingers a tts system based seconds

test two seconds model so this is it talks on system

as you can see from the picture of the original paper a they are architectures

of a network can be generally divided into this three groups

that decoder attention and then the encoder

the we just they just differ how they define the encoder for example by using

different types of hidden layers

to extract information from the input phoneme or cracked or a sequences

but the basic idea is still the same user attention to learn alignment between the

input and output

in fact talked ron is not only model or that he uses the a segments

two seconds based approaches

as for as i read and the first a model might be the i on

probably shall work by alex craves if you listen to his pork into some fifteen

you couldn't note is he plays some samples

using the tension based frameworks attention basic was to set the smallest as already has

to send fifteen

so after that inter-speech there is one paper mattering tts which first use the attention

and a published paper

so after that it's a talk from system meaning to send

seventeen

i mean while there are different types of system for example the chart to waves

as a talks on two

the dct t s and the deep voice three transformer tts

so all we all this types of system are based on the tension mechanism

but here i liked also mention one spatial system the so called a voice loop

which is also a sequence to sequence tts but actually use different types of alignment

learning the so called memory buffer

if you are interesting this model you can find the illustration in appendix

to help the to help you to understand the difference between different types of segments

of segments tts system i summarize the details and differences in this table across the

different tts systems

there are many details here for example in terms of t waveform generator their acoustic

features and the architecture of the decoder encoder

but let's focus on attention here

as you can see for the talk from basis a sum they any use the

additive attention and of course with a local location or a nist

there are also other systems for example the shortwave which directly user location base attention

and also there is a pure self attention basis of some that is the transformer

tts

and you can find the details later from the slide

no i'd like to play some samples published ways to you papers so they are

from the of official website they also the daytime the public domain

and full

system trying to using their own internal data i come output of samples here but

you can find a samples on their websites

table that but now i never find

but that was totally a of the blue

thus it is about the math of investigation into allegations a fixed and gains an

illegal by thing

prosecutors of openness of investigation into allegations of fixing gains an illegal betting

and how to accept it as a numerical without any physical explanation

and had accepted it as a numerical without any physical explanation

do not at all

or not

after applying the samples i hope you can have a general impression of how the

sequence to segments tts systems sound alike

of course the quality might not be as good as a as what we have

her in the a swiss moved to the lighting

there are many different reasons for that

and if you want to find other good examples i suggest the samples on the

document and the transformer aware the used their own internal data to train the systems

after listening to the samples i think of the raiders my wonder whether is of

attention is good enough for tts purpose

i think is also is no the samples i played all decoder samples there actually

many cases where the sequence to segments based tts systems the do not work for

this case we need cut to consider specific attention mechanism that is designed for the

tts

so this lead us to the

another group of a system which use a monotonic and the for the tensions

before explaining this type of models i think we need to first explain why the

global attention or the global alignment does not work sometimes

remembers that for the global alignment or the gullible attention we need to compute alignment

of between every peer of the encoder and the output of time steps

this might be necessary for other tasks such as machine translation but in that might

not be necessary for tts

and this kind of alignment is heart a lower sometimes it does not work

so i'd like to play one sample

so this is one sample from the paper from microsoft the research where the used

to global attention to generate somewhere very long synthesis you can hear sound so that

x transcription is here so is it would be the input

crashes backslash we passed backslash yes there is backslash in that graph backslash one backslash

fifteen that makes post processing a little painful

even if as the reports does that have a clapping we have a rasta based

impact of anything about the visible be version of the one people maybe people would

be people with

i hope this interesting example can tell you how the use of attention might not

work

well we use alone text as input

and this issue we need to solve

so what we can do to alleviate the problems and one thing we can consider

is that for text to speech there is some kind of a monotonic in a

relationship between the input and output because human beings read the text from left to

right

so we can use this kind of prior knowledge to constrain alignment so that we

can make it easier for the system to learn the mapping from the input to

the output

so the idea looks like this

so this is the motivation behind the a monotonic and the foreword attention

and this and try to year of the ford a monotonic attention is to recompute

the alignment matrix

so suppose we have computed alignment a matrix like these so after that ways some

kind of prior knowledge we recompute alignment matrix to encourage the monotonic alignment

to give you an example but how do works this consider this simple task to

convert the encoding x one two three to the outputs one two three white one

two three

suppose we have used a soft attention and we have computed the alignment for the

first time a time stamp

so this is where we can introduce a prior knowledge to constrained alignment learning

so suppose we start we can only start from the first input time step so

we can but this a alignment vector and zero are for their hat here to

indicate initial condition so in this case is zero or is one zero

for the more we constrain that's alignment of how only stay at the same input

a step or you can only trust sees from the previous time step to the

unix time step are like the a left-to-right hmm

based on this condition is we can re compute the alignment vector

i like these are for one tailed

we can we can be is widely used

to give you one more example here if suppose the are for one is equal

to zero point five zero point four and there are point one so after the

re calculation we can get in you worked are

you can do it is how the probability to align the y one way sticks

three is reduced from zero point one two zero point there

so this is how we can use and the for what how we do the

forward three calculation of the line matrix and the reduced the in possible alignment during

the model training stage

of course thing the paper the also propose all other types of mechanism to recompute

the alignment a matrix but as soon try dear is a sign

so giving the recalculate alignment vector we can

use it to compute the first time step output

that we can repeat the process and dollars alignment on the computer outputs y one

to wise three

interestingly if we check the alignment matrix thing the paper we can do it is

how the foreword attention is different from the come one salford attention based approaches

especially as you can see from the first row of the i'm in the matrix

at is the alignment after what only one the books overly

for the baseline without any constrained the alignment is just simply random or uniform distribution

for the forward that engine ways to re calculation over the matrix you can see

how the lyman the matrix looks like a monotonic a shape

we can also consider this

type so monotonic i shape has a prior a constraint on what we colour from

the input and output data

based on these example i think you gotta understand why the foreword attention make it

make it is easier for the tts system to learn alignment between the input and

output

in addition to the foreword attention there are also other types of monotonic attention for

example using different department reforms or combined it with a logo attention

however i'd like to mention is that

and the for ward also called a monotonic intention come not guarantee the attention to

be monotonic exactly monotonic

there are many reasons to explain that but i think of the fundamental reason is

west you was considering the soft attention where we compute alignment and the way summarize

the context where occurring now data in a deterministic way

so this issue with like to solve or use a whole attention which are we

explain in the later slides

okay let's just play some samples to see how the for detection works

so this is same text which i played before so if we don't if we

use the solver the tension the tts system does a three work on the sample

unless listen to how the for attention basis system works

crashes backslash we'd ask backslashes the idea is there is yes backslash feel that radius

backslash one backslash fifteen not that makes post processing a little painful since the files

and reports crashes in a hierarchical structure mention that have

so from this example you can notice how the for attention mate made a system

successful to read the later part of this nonsense as

this is a good example which shows how for detention works

but again as i mentioned in the previous slide the for attention is not the

grantee need to produce the a monotonic alignment

here is one example from the you microsoft the paper

the preliminary willing by gently cmu left district court for the no then just active

californians to buy not set friends to battle chat variance divide not yet friends ten

right not set friends derive not set friends derive not chip firms derived not set

firms they're willing if the fact that for clark and

it is the chip and let me and f t c hi jointly ask the

judge last month to the labelling on the issue will pop up to thirty days

one i pursued sentiment tasks

this the funny example i hope you can know it is how the for attention

system

a repeat the phrase to rival chip firms

multiple times

and you can also see this alignment from the picture here so you this case

alignment is not

monotonic

so again

soft attention

this'll for detention does not the grantees a monotonic alignment we colour from the data

anyway from the previous samples ice i think you can hear how the for work

tension hand help

they tts system to learn the alignment for the lawns and this is

actually the remaining tts system using the for attention for example the full papers here

i will not play the samples here if are interested you can find the samples

are website or in this light

to use a soft attention can not guarantee the monotonic alignment to during generation

we have to find another solution

so one potential answer could be the hold attention

here is my understanding on how the tension

suppose we have the use of attention alignment matrix

so this matrix tells us the probability that each output time step is aligned away

single time step

so from this alignment probability matrix limit or sample

a monotonic alignments like these

so it is idea if we want to use monotonic alignment for t v is

generation

however we have to take into consideration that there are multiple candidates for the alignment

for example the alignment to on the more times three

and we have different probabilities to troll these samples

a can really during training we have to take into consideration the uncertainty with different

alignment

so you wanted to evaluate the model likelihood during training

we have to create the alignment as a latin are able this probabilistic model

so this idea is very similar to the hidden markov model and as you can

imagine during training you know we have to use all kinds of dynamic programming

feed forward or search algorithms

to evaluate model likelihood

to give you a more intuitive example of how the hot attention works we can

compute it was this off attention

as you can see from this picture for those of attention

for each output time step

we just directly calculate the weighted sum

to extract information from the encode

and is how we do to generate alignment during the generation things of attention

and we repeat this

operation for all the time steps

in contrast in the whole attention we have to troll samples

we have to select only one

a possible alignment for each time step

of course we can use more complicated sampling techniques such as the beam search all

with turbo decoding to selects the good alignment

for the tts generation

but is how we do the generation in the whole attention

computer was of attention

we don't a weighted sum

instead we will you we draw samples

similarly in the training stage we have to use a dynamic programming to summarise all

possible alignments in order to your body it's the model likelihood

for the whole contention based models

in contrast sold attention does not require to do so we just

do the same as well you what we use

for the generation stage

we do the weighted sum for each time step

so the difference between this off attention the hot a whole attention requires us to

use a different space

to categorise different techniques for hold attention

that leads to this space

which i think will be easy to understand different kinds of a whole attention techniques

however due to the limited time i cannot explain the details are hot attention if

you are interested please find this lies

where i explain the whole attention in more details

in terms of the tts system with a whole attention as far as we know

there is only one group actually using the whole attention

with a tts

and it's the our group

you can find a reference papers in the website below

and you could also find many details on how we use different types of search

and thus upping techniques

to produce the output alignment from the whole attention based models

given the details on soft attention and a brief introduction on the whole attention women

come to the sort of group

the hybrid approaches

for the seconds to segments tts models

from the first or the this tutorial hope you can understand how this of attention

is easy to implement but

it might not work on which generates about utterances

of the whole intention my help the to solve this issue because data quantities a

monotonic alignment during generation

however

according to our experiments the whole the tension might not be as at great as

a soft attention

for example sometimes so i may overestimate the duration for silence

for both soft and the house attention we compute alignment probability for each pair of

the encoder and output time steps

for tts because the output sequence can be quite lawn

so these meetings we have to calculate a large matrix

for the alignment to make for the alignment probability is not easy

of course we can do something more efficient suppose we can summarise the alignment information

from the matrix

so that we can know roughly how meaning out good time steps when you to

generate for each input okay

so by using this information we can we compute one probabilistic model for each input

okay

i just to estimate how meeting time steps they need to produce during the generation

stage

so this idea is not new you'd actually has been used in the hmm and

d and bases system

actually this is also that here behind the hybrid approaches

for the hybrid approach is the first user attention based a model

to extract alignment a matrix

after that this summarizing information for example

the duration or how meantime how many output time steps we need to repeat

for each input it okay

after summarizing these information we can trying the duration model directly for each input a

token

during the generation stage we can't either directly clogging the trend duration model as you

can see from this picture we just need to predict how many output time steps

we need to repeat

for each input a target

giving this duration information we can do the up sampling

just read by duplicating each input vectors

so the input into the decoder will be will lined with the output sequences we

want to generate we can use norm on your a network such as the feed

forward

recall rent or autoregressive neural networks

to converts the input to the output acoustic features decreases

here are some tts system using the hybrid approaches

the fast the speech user sold attention to extract the duration

well the align tts and the other system and use different kinds of techniques to

extract the duration

i'd like to play some samples extracted from the published papers

i would play just one sample for each system from for speech and for to

for speech to

which are you are like chapel and that this year casework to real straight at

least are mostly

it's have you are collected chapel and on the staircase work to rose training set

are we

although i only play the short samples here but i think you can find alarms

and this is on their website

what i want to see here from the example is that by using had hybrid

approaches are we can generates the us acidic speech with the quite a robust duration

i think that is one strong point about hybrid approaches

okay let's come to the summary

in this tutorial i first explain the pipeline tts system including the hmm and d

and basis systems

in the pipeline tts we need to use the front end

to extracting linguist information from the input attacks after that we need to duration model

to predict a duration four inch include a unit

followed following that we need the acoustic model and the waveform generators

to cover the linguist if you choosing to final

waveform

in two cents sixteen go deep mind propose to believe that

all the way it is not explaining this to oreo i'd like to mention that

the original wave in it still needs you front end and the duration more

so it achieves the astonishing performance because it to use a single network

to directly converts a linguistic features into d waveform sampling points directly

this all the issues or the artifact and what we used a conventional waveform generators

like the vocoders

different from these two types of a tts system these signals two seconds model use

a single modeled converts the input text

into the acoustic features

the use a single model to do alignment learning to do the duration modeling and

the acoustic modeling

in fact main sequence two seconds models also use women it's like a waveform generators

to further improve the quality of this is that speech

if we summarize the differences from the park lie system to the sequence to second

system i think there are four suspects

the first one is we replace the conventional front end in the pipeline system

with the trainable implicit front end in the sequence two seconds model

second instead of using external all duration model

we may jointly do the duration modeling ways the sequence two seconds mapping

sir point is the of acoustic models or low is not explained in this tutorial

actually most of the seconds to segments model use just so called autoregressive decoding

so would produce one audible time step

conditioned on the previous time step

the last point is the in your away from models as i mentioned in the

previous slide

making all the segments to segments models use neural waveform models like the wavelet

the first the three types of differences implemented through the attention based segments two seconds

models

so in this tutorial we focus on attention mechanism

we first explain this of the tension

we also groups as of attention approaches space on this three

dimensions

what kind of features were used to calculate alignment a matrix how do we calculate

alignment

and the what kind of constraint we compute i'll the alignment

we also mention the shortcoming of the a soft attention it does not guarantee the

monotonic structures

that evaluates the hot attention based approach

however the whole attention might not be accurate enough to produce natural speech

at a gives us to these are just a possible solution the hybrid approach where

we don't

used attention during the generation

all the four specs are quite essential to the performances of the sequence to segments

tts models

of course maybe we may wonder

what is the most important a factor

that contributed to the performance of a sequence two seconds model

to answer that oliver and is called x design experiments

and they try to analyze the impact of each other's factor

and the quality of a general speech from the sequence two seconds models

ice recommended it to raise their paper to understand why the sickness two seconds model

outperforms the tedious apply systems

before we in this tutorial let me briefly mention all the research topics based on

these seconds to segments tts models

the first one big is the neural waveform models that has being using mating signal

emitting seconds two seconds models

due to the limited time i cannot explain the neural for a waveform almost but

you can find a reference paper using the rating list

another topic is the speakers style any motion modelling is segments two seconds models

prosody is also hot topic being seconds two seconds modeling

in terms of multi speaker modeling a single most of the segments two seconds models

are quite a straightforward

the either jointly trying the speaker vector is a bayes d sequence to suppose model

or they use separates speaker model

to extract these speaker vectors from the reference speech

so this is so called the rule short learning for multi-speaker tts

in terms of prosody either paper is focusing on the segmental prosody for example of

the lexicon tom or the pitch accent

so this all the most of this paper is focusing on the pitch accent the

language at a language such as mattering or japanese

in terms of the super or a sacramental variation is they're also papers

combining the process of the embedding ways to talk from basis systems

also system using the variational encoders

to extract the processing bindings from the reference speech

finally i'd like to mention another direction of all the tts research

and i is the a tts for entertainment

so for the in this paper the also is use the traditional japanese comedy data

to trying to tts system

so the goal of this kind of t c system is not only the speech

communication but also

mm to intuit and the audience

this is and of this tutorial

you can find this slide on my teeth how page it's recommended it to check

the i exponent slides the reading list and the appendix thank you for recently

Neural statistical parametric speech synthesis

Tutorials

Dr Xin Wang, National Institute of Informatics, Japan