welcome to the speaker all the c to them and twenty
this is the tutorial session on text-to-speech synthesis
i'm showing one from national institute of informatics japan
i'm going to reduce realist tutorial on test to speech synthesis
first brave self introduction
and a post-hoc and i a much that i got my p h t two
years ago from a simple at during my p h d i was working on
text-to-speech since s
of things the post or chi have been working on speech and also music audio
generation
meanwhile also getting boarding the a swiss move to as i'm writing and voice primus
challenge of this here
for this tutorial like to first apologise a thoughts the abstract i think is thing
in the old strike i mentioned i will explaining the resetting neural network based acoustic
models waveform generators
the classic hidden markov model based approaches and also words conversion
but this abstractly seems to be to features i think account to cover all topics
thing one hour tutorial so in this tutorial i will focus on the recent neural
network based acoustic models including the talked run and its environs
for other topics such as a waveform generators and hmm after they've them out from
this tutorial
if we are interested you can find a useful notes and the reference papers thing
this slide
for this tutorial i'm going to focus on the recent approaches like to talk at
run and the related
seconds two seconds tedious models
i'm going to talk about how they work and the water or the differences
so this tutorial is based on my own reading list i summarised what have learned
and what i have implemented with my colleagues
so the content may not be comprehensive
however i the i would try my best to include more contents i was summarised
on hanging the notes in each slide
i also provide an appendix that reading list ways what have rating the past so
i hope you enjoy this tutorial and of course you all feedback is welcome
so for this tutorial i'd like to first give a brief introduction about the current
situation or the state-of-art it was a tts research after that i will give our
view of a tts briefly introducing the classical methods
and the why we're here today
and after that i was spend the most of it i'm have this tutorial on
these sequence two seconds tts and the state-of-art tts nowadays
explain different types of a sequence two seconds tts those based on solve attention hot
attention
and hybrid approaches and finally i will make a summary and draw conclusions
it speaking with introduction
tedious is a technology that a covert single texting to the output avoidable
one famous example of the tts application is the speaker use device professor stephen hawking
nowadays we have more types of applications based on the tts
one example is the intelligent the robot
we also have that each taught systems are cell phones and computers
research on t is has a really long history if we read the books reference
paper some tedious we can find so many different types of tts methods for my
since as you need selection and weighting it
and the reason why researchers are still working on d is that and
this researchers want to make systems i speech as natural as possible as natural as
human speech for some types of applications all we also want the so that speech
sounds like us
so towards this go researches put so many a first thing to the case research
i ever you know was not on your the recent years that and researchers find
really good models to achieve this goal
here i'd like to use the a space move data to show the rapid progress
of tts
first it's picture it is serious moved to center fifteen
and the i-vector space where we show different types of tts system and their this
distance from the natural speech religion you speech
so you can see there are many system here most of them not based on
hmm orgy a gmm based voice conversion
for this edition basic tts is really for me from the natural speech is only
unit selection that is close to the natural speech
so how about is swiss moved to the nineteen after four years of research
here the results based on expect or a computer with the picturing to send fitting
we can see there are so meetings is some that are really close to the
natural speech
not only thing the selection i'd like to give that one here
the first example is a sham d and system as you can see from this
scoring which is still far free from a natural speech
the unit selection is still close to match for speech meanwhile we can see
other types of tts messrs
including the sequence two seconds t d s and the women it so they are
really close to match for speech
of course this figure based on acoustic features either the extractor or the i-vectors
but the question is how to this is that speech sounds really like in human
perception
to show that sounds are that question i'd like to use the results from our
recent study where we come back to healing evaluation on the a swiss moved to
them hiding data
here we ask feeling evaluators to evaluate and the how this is a speech sounds
like the target speakers and how this is that speech
the what is the quality of sounds i speech compared with t natural speech
so we show the results in terms of it by using the det curves
as you can see from the left hand side we can see that d in
h m d and is really for a for me from the natural speech in
terms of the speaker similarity so this whole distribution is for rate from the natural
target speech
unit selection is calls or bastille not closing off its own is seconds to segment
system as you can see from this picture as really close to the target speaker
natural speech so you this case the eer is rock it is close to fifty
percent
so which means this is some kind of the release on the synthesized speech sounds
like the a target speakers and human beings cannot tell tells them from each other
this is similar trend if we look at the results in terms of speech quality
d and the unit selection are not good enough it's only a sequence two seconds
model that is really close to the natural speech
so from these results we can have a general idea or on the how the
race and the models based on second steps simplest model improves the quality and a
speaker similarity and even the human beings can not tell them from the natural speech
okay after introducing the results i'd like to play some samples from a swiss providing
database
and that i think you can have general perception a housing like is model sound
like computer with a natural speech
we did not complete with any of a local farmer
we did not compete with any of the local optima we did not completely then
have the local phone
we did not completed any of the local formal writing
eventually india function that winter
a french at that level until
so this other samples from two speakers i think you may agree that's an unit-selection
sounds like to natural speech in terms of the speaker identity but you can sometimes
perceive the channel one we concatenating you i different units together
and the tmm sounds close but the but this sounds like artificial speech itsy seventy
six models that's true some three like the target speakers
if you are interested in you can find more samples are website or download the
a space with lighting database to have a charge
after listening to the tts samples from a swiss move into the nineteen
i'm going to talk about more details on the tts what kind of problems women
face one would be at a tts system what kind of solutions we can use
and how we come up with idea d sequence to sequence tts models
so what are the problems we may face one would be you the tts system
to give a example here is once and this from the guidelines for tool be
labeling my random it's more light
the first thing we need to note is one recover the text things waveform is
that
the text is basically discrete
it comes from a finite set of symbols
well as the waveform is continuous in time domain and also doing them out of
domain
so
because of the basic difference between the text and the speech the first thing we
noticed is the ambiguity in pronunciation for example the inmates segments the more maria that
all mate they are pronouncing different ways the second thing is about alignment
for example where mi same eight we miss a mate
we may shorts or increase the duration of the sound when we produce a pronounce
it
so this kind of alignment we need to learn from the data which is not
easy another issue is the a to recover information with which is not encoded in
the text for example
the speaker identity and prosody this has really different issues when we but tts systems
here is one example of using classic a tts to converse detects into the output
waveform
so the first step of the system is to clean the input a text to
do some kind of text normalization to remove all kinds of
the strangest thing balls from input text
so after that the system converts the text into the phoneme or phone strings
so the phone of phonemes are symbols that tells the computer how to read the
ward
of course this is not enough or we may need to add additional prosodic tags
to each word or some sort of the word
for example women and the size t mariano instead of the mate
so giving thing and linguistic information about how to read text
the system will our converts them into the acoustic units or acoustic features
finally the system will use a waveform generator to covers the acoustic information into the
output waveform
in the literature we normally refers to the first steps the cued a system as
a front end and the rest of the backend
in this tutorial like another cover the topics on the front end
and the readers can find it to textbooks on front end
for this tutorial we focus on the back and the issue especially how we learned
alignment to between the text and waveform in the back and the models
the first example i'd like to explain is unit selection based back end
so as the name suggests this mister is quite simple straightforward four inch input to
unit which is directly select one speech segments from a large database
after that which is directly concatenate these speech units into the outputs wasteful
so there is no explicit modeling all of the alignment between the speech and the
waveform
a because this alignment has been preserved in the speech units so we didn't really
care about alignment in this kind of mister
however the story becomes different and when we use the hmm based back end to
synset speech
so
in a like the unit selection which directly gender was waveform
for a h t s hmm based approach we don't directly predict the waveform instead
we first predict the sequence of acoustic features
from the input text so this acoustic feature vectors maybe from each vector my corresponding
to
say twenty five milisecond of waveform
and the we can use of vocoders to reconstruct a waveform from the acoustic feature
vectors
so each acoustic feature vector into my containing the for example the cepstrum coefficients if
the role
and all their kind of acoustic features specific
two d speech will coders
but is a general idea here
in h t s we don't directly predicts waveform instead we need to first predict
the acoustic feature vectors from the input text
the question is how can we do that remembers that's the input information has being
extracted from the text
including the phoneme identity and all their prosodic tax
so you h t s we normally encode or converts the linguistic features into a
vector for each input a unit
so in each vector it to make contains information like the phoneme identity
the whether the course of a boy stress a lot
so we assign this kind of vector for each unit
the question of cores is how can we convert the sequence of encoding linguistic websters
into the output to acoustic feature vectors
so remembered the number of vectors we have is equal to the number of units
in the text
and this number is much shorter than a number of acoustic feature vectors we will
we need to predict
so this is alignment t should
this is how the h t s system handles this issue
since this system is based on a gmm so the first thing we need to
do is to a converts the linguistic vectors seem to the hmm state
so this is done by simply searching through these
this increase after that we can get the hmm state for this specific which are
after researching and the finding all the hmm state for each linguist vectors
the next thing is to predict the duration for each in from state for example
when we repeat the first item state two times the second one three times
given is duration information that we can create six agenda seconds like this
so remember that the sequence of this hmm state will be equal to the number
of vectors when you to predict in the output
the loss the regression task become much easier because we can use main types of
all the reason
to generate vectors for from each hmm state
specifically h t s system used to you so called
maximum likelihood parameter generation or present to produce
the acoustic feature vectors from the hmm states
but this is how the h t s system produce the output from the input
to linguistic feature vectors
two briefly summarize the h t a system we can use the speech or so
we generates a linguistic features from the input text
we do the searching in the decision trees
after that we predict the duration for each hmm state so this is where the
alignment is produced
of to generate a output acoustic features after that everything is straightforward just convert each
websters into the output vectors
and do the waveform generation using the vocoder
from h t s two d n is straightforward we just need to replace the
hmm states ways the neural networks
feet word one or record one
however for this kind of framework we still need the duration model we need to
predict
the alignment
from the linguistic feature vectors
without that we cannot prepare the input to the neural networks
indeed as d paper by alex grave says or ends are usually restricted to the
problems where the encoder output sequences
all will aligned
as lies where using the com unfit for word or record neural networks
we still need additional tools including the hmm
to lower and it generates alignment
for the tts task
when we wonder with that we can use a single model to jointly learn alignment
and to do the regression
and this is where the sequence two seconds model counts as a stage
in fact they're more ambitious
they want to use a single neural networks to jointly learn alignment to the regression
and you want conducting the linguistic all eyes on the input a test
and that there are lots of recent work showing that this approach is reasonable and
is really step our own your network so that we can achieve a better quality
for tts
okay let's look at the a sequence two-sigma cts models
remember that the task of seconds two seconds model is to converse the text into
the acoustic feature sequences
and we need to solve three specific task
how to derive linguistic features
how to learn to generate alignment then how to generate output sequences
again we cannot to use a common your and it works such as the feature
for tall recon one
for this kind of sequence two seconds model where you normally use attention mechanism
for explanation i we use x has encode while i being the output
note is that the input has same time steps while the output has and time
steps
so they have different time lengths
the first if a framework we can use is the so-called encoding and decoding framework
here we use our and layer as their encoder with processing code and extract the
c which are from the last hidden is data from the encoder after that we
use is c which are as a condition
to generate child was sequences step by step
so if we write only questions it to look like these so you can see
how the output is factorized
along all time steps and is see the condition is used in each condition the
each time step
this framework is straightforward and a simple so the matter hall on the input a
sequence that is we can always compressed input a sequence information into a single vector
however there is also you should because we need to use this c worked or
across all the time steps on which generates output
can we extract different context from the input what we generated different out time steps
the answer is yes and we can use adaptation mechanism to achieve this goal
suppose we want to generate a second time step why to here
what my extract the heathens data from a decoder ring the previous time step and
faded back to the encoder
after that to extract some kind of weight
vectors through the softmax layer
then we do a weighted sum over the input information
i produce the vector c to here
we can use this c to which are as encode
to the decoder and the produce the y two which are
so it is how
the context information can be calculated for the second time step
so no desired we can save the output from the softmax layer so it is
kind of wait information used for the second time step
we can repeat the process for the next time step so in this case we
feedback the history from the decoder in the second time step and then we calculates
the vector c three for the output wise three
in general we can do this for the time step and that we can write
equations like these
so after we save all the output from the softmax
along all the time step so we can do it is
the weight
mm calculated by the use of the marks will gradually change
as we move as we generate out hotel on the time so the way to
his helpful three
what we also move along the input sequences as you can see from this picture
so this alone as alignment matrix thing you can find this picture in mating papers
tedious or speech recognition
two briefly summarize the attention base segments two seconds models we can use this equations
for each time step and way calculate the softmax weight
vector r for here
and then we use or for vector to summarize information from the input so we
do a weighted sum over the h vectors
that gives a bus these a context vector c and for each time step
with a c and context we can generate output y n
and to repeat the process for all time steps
this is generally how the attention basis segments two seconds model works
as you can see from the previous explanation
the attention make it is done is the essential for a sequence to sequence tts
model
and you to this reason there has being so many different types of attention proposed
when i read the papers i noticed that there are so many different types of
attention we can use
self attention for word attention heart attention on the soft attention
so one is the relationship between different types of attention and the what is her
purse to use a specific attention
so in the next few slice are we explain then in a more systematic way
use my proposal i organise the tension based on what kind of features are used
to compute alignment
and how do they compute alignment are from
and what kind of constraints e need to put on the alignment
so as you can see to what kind if you choose to compute the a
alignment we can organise attention based on with their content based whether they are location
where all with their they are pure location base attention
the way to compute the alignment we can organise attention based according to three groups
relative diode and discover told attention
and the for the final axes we can see attention is the a monotonic all
for the tension the local attention and the global attention so this is my proposal
for organising the so called soft attention
that engine is not only group like a fine tuning in the literature
i four rated paper is we can find another group so called hot attention
the difference from these of attention is that in called attention the alignment is treated
as a lot and of arrivals
we need to use all kinds of tools such as dynamic programming and marginalisation to
calculate the probability and the to marginalise the latin of around
i would talk more about the difference between the two groups of attention in the
later slides but for this stage i will focus on the soft attention
that's first look at this told
scaled order and additive attention
to this i was three types of attention is which use different the ways to
compute alignment matrix
suppose we are going to compute the output y and for the and it's time
step
what we have is the decoder state of the previous time step yes and minus
one
we also have the features extracted from the input text which is you know data
as the actually
so this read hypotension differ in the way to compute the input to the to
the softmax layer the output of softmax will be the alignment a matrix
the first one the thought attention
directly multiple a basis to vectors the s and minus one from the decoder the
h and from the encoder
so it is why the cold it is scored at a little attention
scattered all detention is quite similar but in this case we and the scalar these
here to change the value of the a lot of the activation to this of
the max lightyear
the other blasted last the type of attention is the cavity of attention
so in this case we apply linear transformation is and the two vectors after that
we add this to vector together so this is why this is a reason why
quite the additive attention
so note is that
for all the three types of tensions in this example
where using the s vector from decoder the edge vector from the encoder
in all the words
we can consider the h as a content of the input
so
we multiply the content the from the input a text
ways the hidden state from decoder in order to computes the alignment matrix
this bring us to the second question based on which we can classify different types
of attention
a question is what kind of features we can use to compute the output alignment
in the previous slide high exploiting the told scale that alt and additive attention by
using examples where use the decoder state and the content vector edge to computers alignment
so for this type so mister is we call them content a base attention because
they're using the content of vector
however this is not only way to compute the alignment
the second away is the so-called location aware attention
as you can see from these two equations computer was content the base attention the
location aware attention uses the attention vector from the previous time step
so this attention is aware of the previous alignment so that's why record the location
where attention
the third type of attention in this group is the so called the location base
attention
so computer with the location aware attention we can do it is from this equation
that the content the vector h is removed from the input so you know other
words in the location based attention we don't care about the content with purely compute
the attention of the lyman the matrix
based on the decoder state and the previous the lyman to from the previous step
finally there is c small byron to have the a location based attention so in
this case
we only use the decoder state to compute alignment without using the alignment from the
previous time step
from the equations of the four types of pensions i think you mean though it
is when we compute attention or the lyman the matrix for each out what time
step we need to consider all encoded time steps
so this leads to the sorted dimension along which we can classify the tension
along this dimension i'd like to explain two types of attention the first one is
the so called the global attention
as a name suggests what we compute alignment for each output timestamp
we consider it is possible to get information from all the input time steps
so this matrix the vector lyman the vector are here has no they are all
elements
in contrast when we use a local attention we consider some of the lyman to
can be zero for example in this case we only consider
to extract information from the input a steps in the middle
now i have supplying these three dimensions are on which we can classify this often
attention in fact all the paper all the examples i have explained can find their
location in this treaty space
but let me give one more concrete example that is yourself attention
the same attention is a scalar this scale adult attention it's based on content
and it's a global attention
so let's see how is defined
if we look at the equations of a set of attention we can do it
is to y is cold the skeletal mode
global and content the based extension
muddy in this case a special thing about seven attention is that we extracting both
the
feature vectors edge here and action here from the input a sequence is you know
all the words we are computing alignment on the in consequences itself
of course of because we can compute the everything a power
we can all we can also define a matrix form for the self attention so
in this case we formulate the input feature sequences as a matrix
and then we do with the us to get at a little attention in the
matrix form
and the matrix a is cool with the query key and the value matrices
in this case the refers to the same matrix the h u
so you know the words and this somewhat tension does a transformation on the input
the sequences then output sequences has the same lands as encode
in some sense we can consider the cell attention as a special type of convolutional
renewal layers to transform the input into the output with the same lines
of course we can also use a seven attention for alignment flirting sewing this case
it just a special type of soft attention based on the scared about and a
content based attention
as you can see from the equations in this case we replace the query matrix
with the state from a decoder but the process is quite similar and we can
do everything power by using the matrix multiplication
so by now i have explained all the street dimension is to classify this of
attention and also example based on the self attention in fact there are more ways
to combine different types of attentions and you don't find the i rinse in the
paper published by a good this year
giving the explanation on the sofa attention in this now quickly explain how to work
sing in tts system
so for the tts system when we use the tension basis segment stick set at
two seconds models way almost to use the same framework as those used for the
speech translation
or machine translation or speech recognition
so we use case input is to phonemes or characters
now to po tasty acoustic feature vector sequences
and we still have the encoder the tension and the decoder which is autoregressive
but of course we can do something more for example adding more layers and the
decoder increase the number of recording layers at the print data which receive the feedback
from the previous time step in the outer aggressive decoder this is a free to
choose
but the basic idea is still the attention based approach to learn alignment between the
input and output
this gives sauce the basics to understand the first fingers a tts system based seconds
test two seconds model so this is it talks on system
as you can see from the picture of the original paper a they are architectures
of a network can be generally divided into this three groups
that decoder attention and then the encoder
the we just they just differ how they define the encoder for example by using
different types of hidden layers
to extract information from the input phoneme or cracked or a sequences
but the basic idea is still the same user attention to learn alignment between the
input and output
in fact talked ron is not only model or that he uses the a segments
two seconds based approaches
as for as i read and the first a model might be the i on
probably shall work by alex craves if you listen to his pork into some fifteen
you couldn't note is he plays some samples
using the tension based frameworks attention basic was to set the smallest as already has
to send fifteen
so after that inter-speech there is one paper mattering tts which first use the attention
and a published paper
so after that it's a talk from system meaning to send
seventeen
i mean while there are different types of system for example the chart to waves
as a talks on two
the dct t s and the deep voice three transformer tts
so all we all this types of system are based on the tension mechanism
but here i liked also mention one spatial system the so called a voice loop
which is also a sequence to sequence tts but actually use different types of alignment
learning the so called memory buffer
if you are interesting this model you can find the illustration in appendix
to help the to help you to understand the difference between different types of segments
of segments tts system i summarize the details and differences in this table across the
different tts systems
there are many details here for example in terms of t waveform generator their acoustic
features and the architecture of the decoder encoder
but let's focus on attention here
as you can see for the talk from basis a sum they any use the
additive attention and of course with a local location or a nist
there are also other systems for example the shortwave which directly user location base attention
and also there is a pure self attention basis of some that is the transformer
tts
and you can find the details later from the slide
no i'd like to play some samples published ways to you papers so they are
from the of official website they also the daytime the public domain
and full
system trying to using their own internal data i come output of samples here but
you can find a samples on their websites
table that but now i never find
but that was totally a of the blue
thus it is about the math of investigation into allegations a fixed and gains an
illegal by thing
prosecutors of openness of investigation into allegations of fixing gains an illegal betting
and how to accept it as a numerical without any physical explanation
and had accepted it as a numerical without any physical explanation
do not at all
or not
after applying the samples i hope you can have a general impression of how the
sequence to segments tts systems sound alike
of course the quality might not be as good as a as what we have
her in the a swiss moved to the lighting
there are many different reasons for that
and if you want to find other good examples i suggest the samples on the
document and the transformer aware the used their own internal data to train the systems
after listening to the samples i think of the raiders my wonder whether is of
attention is good enough for tts purpose
i think is also is no the samples i played all decoder samples there actually
many cases where the sequence to segments based tts systems the do not work for
this case we need cut to consider specific attention mechanism that is designed for the
tts
so this lead us to the
another group of a system which use a monotonic and the for the tensions
before explaining this type of models i think we need to first explain why the
global attention or the global alignment does not work sometimes
remembers that for the global alignment or the gullible attention we need to compute alignment
of between every peer of the encoder and the output of time steps
this might be necessary for other tasks such as machine translation but in that might
not be necessary for tts
and this kind of alignment is heart a lower sometimes it does not work
so i'd like to play one sample
so this is one sample from the paper from microsoft the research where the used
to global attention to generate somewhere very long synthesis you can hear sound so that
x transcription is here so is it would be the input
crashes backslash we passed backslash yes there is backslash in that graph backslash one backslash
fifteen that makes post processing a little painful
even if as the reports does that have a clapping we have a rasta based
impact of anything about the visible be version of the one people maybe people would
be people with
i hope this interesting example can tell you how the use of attention might not
work
well we use alone text as input
and this issue we need to solve
so what we can do to alleviate the problems and one thing we can consider
is that for text to speech there is some kind of a monotonic in a
relationship between the input and output because human beings read the text from left to
right
so we can use this kind of prior knowledge to constrain alignment so that we
can make it easier for the system to learn the mapping from the input to
the output
so the idea looks like this
so this is the motivation behind the a monotonic and the foreword attention
and this and try to year of the ford a monotonic attention is to recompute
the alignment matrix
so suppose we have computed alignment a matrix like these so after that ways some
kind of prior knowledge we recompute alignment matrix to encourage the monotonic alignment
to give you an example but how do works this consider this simple task to
convert the encoding x one two three to the outputs one two three white one
two three
suppose we have used a soft attention and we have computed the alignment for the
first time a time stamp
so this is where we can introduce a prior knowledge to constrained alignment learning
so suppose we start we can only start from the first input time step so
we can but this a alignment vector and zero are for their hat here to
indicate initial condition so in this case is zero or is one zero
for the more we constrain that's alignment of how only stay at the same input
a step or you can only trust sees from the previous time step to the
unix time step are like the a left-to-right hmm
based on this condition is we can re compute the alignment vector
i like these are for one tailed
we can we can be is widely used
to give you one more example here if suppose the are for one is equal
to zero point five zero point four and there are point one so after the
re calculation we can get in you worked are
you can do it is how the probability to align the y one way sticks
three is reduced from zero point one two zero point there
so this is how we can use and the for what how we do the
forward three calculation of the line matrix and the reduced the in possible alignment during
the model training stage
of course thing the paper the also propose all other types of mechanism to recompute
the alignment a matrix but as soon try dear is a sign
so giving the recalculate alignment vector we can
use it to compute the first time step output
that we can repeat the process and dollars alignment on the computer outputs y one
to wise three
interestingly if we check the alignment matrix thing the paper we can do it is
how the foreword attention is different from the come one salford attention based approaches
especially as you can see from the first row of the i'm in the matrix
at is the alignment after what only one the books overly
for the baseline without any constrained the alignment is just simply random or uniform distribution
for the forward that engine ways to re calculation over the matrix you can see
how the lyman the matrix looks like a monotonic a shape
we can also consider this
type so monotonic i shape has a prior a constraint on what we colour from
the input and output data
based on these example i think you gotta understand why the foreword attention make it
make it is easier for the tts system to learn alignment between the input and
output
in addition to the foreword attention there are also other types of monotonic attention for
example using different department reforms or combined it with a logo attention
however i'd like to mention is that
and the for ward also called a monotonic intention come not guarantee the attention to
be monotonic exactly monotonic
there are many reasons to explain that but i think of the fundamental reason is
west you was considering the soft attention where we compute alignment and the way summarize
the context where occurring now data in a deterministic way
so this issue with like to solve or use a whole attention which are we
explain in the later slides
okay let's just play some samples to see how the for detection works
so this is same text which i played before so if we don't if we
use the solver the tension the tts system does a three work on the sample
unless listen to how the for attention basis system works
crashes backslash we'd ask backslashes the idea is there is yes backslash feel that radius
backslash one backslash fifteen not that makes post processing a little painful since the files
and reports crashes in a hierarchical structure mention that have
so from this example you can notice how the for attention mate made a system
successful to read the later part of this nonsense as
this is a good example which shows how for detention works
but again as i mentioned in the previous slide the for attention is not the
grantee need to produce the a monotonic alignment
here is one example from the you microsoft the paper
the preliminary willing by gently cmu left district court for the no then just active
californians to buy not set friends to battle chat variance divide not yet friends ten
right not set friends derive not set friends derive not chip firms derived not set
firms they're willing if the fact that for clark and
it is the chip and let me and f t c hi jointly ask the
judge last month to the labelling on the issue will pop up to thirty days
one i pursued sentiment tasks
this the funny example i hope you can know it is how the for attention
system
a repeat the phrase to rival chip firms
multiple times
and you can also see this alignment from the picture here so you this case
alignment is not
monotonic
so again
soft attention
this'll for detention does not the grantees a monotonic alignment we colour from the data
anyway from the previous samples ice i think you can hear how the for work
tension hand help
they tts system to learn the alignment for the lawns and this is
actually the remaining tts system using the for attention for example the full papers here
i will not play the samples here if are interested you can find the samples
are website or in this light
to use a soft attention can not guarantee the monotonic alignment to during generation
we have to find another solution
so one potential answer could be the hold attention
here is my understanding on how the tension
suppose we have the use of attention alignment matrix
so this matrix tells us the probability that each output time step is aligned away
single time step
so from this alignment probability matrix limit or sample
a monotonic alignments like these
so it is idea if we want to use monotonic alignment for t v is
generation
however we have to take into consideration that there are multiple candidates for the alignment
for example the alignment to on the more times three
and we have different probabilities to troll these samples
a can really during training we have to take into consideration the uncertainty with different
alignment
so you wanted to evaluate the model likelihood during training
we have to create the alignment as a latin are able this probabilistic model
so this idea is very similar to the hidden markov model and as you can
imagine during training you know we have to use all kinds of dynamic programming
feed forward or search algorithms
to evaluate model likelihood
to give you a more intuitive example of how the hot attention works we can
compute it was this off attention
as you can see from this picture for those of attention
for each output time step
we just directly calculate the weighted sum
to extract information from the encode
and is how we do to generate alignment during the generation things of attention
and we repeat this
operation for all the time steps
in contrast in the whole attention we have to troll samples
we have to select only one
a possible alignment for each time step
of course we can use more complicated sampling techniques such as the beam search all
with turbo decoding to selects the good alignment
for the tts generation
but is how we do the generation in the whole attention
computer was of attention
we don't a weighted sum
instead we will you we draw samples
similarly in the training stage we have to use a dynamic programming to summarise all
possible alignments in order to your body it's the model likelihood
for the whole contention based models
in contrast sold attention does not require to do so we just
do the same as well you what we use
for the generation stage
we do the weighted sum for each time step
so the difference between this off attention the hot a whole attention requires us to
use a different space
to categorise different techniques for hold attention
that leads to this space
which i think will be easy to understand different kinds of a whole attention techniques
however due to the limited time i cannot explain the details are hot attention if
you are interested please find this lies
where i explain the whole attention in more details
in terms of the tts system with a whole attention as far as we know
there is only one group actually using the whole attention
with a tts
and it's the our group
you can find a reference papers in the website below
and you could also find many details on how we use different types of search
and thus upping techniques
to produce the output alignment from the whole attention based models
given the details on soft attention and a brief introduction on the whole attention women
come to the sort of group
the hybrid approaches
for the seconds to segments tts models
from the first or the this tutorial hope you can understand how this of attention
is easy to implement but
it might not work on which generates about utterances
of the whole intention my help the to solve this issue because data quantities a
monotonic alignment during generation
however
according to our experiments the whole the tension might not be as at great as
a soft attention
for example sometimes so i may overestimate the duration for silence
for both soft and the house attention we compute alignment probability for each pair of
the encoder and output time steps
for tts because the output sequence can be quite lawn
so these meetings we have to calculate a large matrix
for the alignment to make for the alignment probability is not easy
of course we can do something more efficient suppose we can summarise the alignment information
from the matrix
so that we can know roughly how meaning out good time steps when you to
generate for each input okay
so by using this information we can we compute one probabilistic model for each input
okay
i just to estimate how meeting time steps they need to produce during the generation
stage
so this idea is not new you'd actually has been used in the hmm and
d and bases system
actually this is also that here behind the hybrid approaches
for the hybrid approach is the first user attention based a model
to extract alignment a matrix
after that this summarizing information for example
the duration or how meantime how many output time steps we need to repeat
for each input it okay
after summarizing these information we can trying the duration model directly for each input a
token
during the generation stage we can't either directly clogging the trend duration model as you
can see from this picture we just need to predict how many output time steps
we need to repeat
for each input a target
giving this duration information we can do the up sampling
just read by duplicating each input vectors
so the input into the decoder will be will lined with the output sequences we
want to generate we can use norm on your a network such as the feed
forward
recall rent or autoregressive neural networks
to converts the input to the output acoustic features decreases
here are some tts system using the hybrid approaches
the fast the speech user sold attention to extract the duration
well the align tts and the other system and use different kinds of techniques to
extract the duration
i'd like to play some samples extracted from the published papers
i would play just one sample for each system from for speech and for to
for speech to
which are you are like chapel and that this year casework to real straight at
least are mostly
it's have you are collected chapel and on the staircase work to rose training set
are we
although i only play the short samples here but i think you can find alarms
and this is on their website
what i want to see here from the example is that by using had hybrid
approaches are we can generates the us acidic speech with the quite a robust duration
i think that is one strong point about hybrid approaches
okay let's come to the summary
in this tutorial i first explain the pipeline tts system including the hmm and d
and basis systems
in the pipeline tts we need to use the front end
to extracting linguist information from the input attacks after that we need to duration model
to predict a duration four inch include a unit
followed following that we need the acoustic model and the waveform generators
to cover the linguist if you choosing to final
waveform
in two cents sixteen go deep mind propose to believe that
all the way it is not explaining this to oreo i'd like to mention that
the original wave in it still needs you front end and the duration more
so it achieves the astonishing performance because it to use a single network
to directly converts a linguistic features into d waveform sampling points directly
this all the issues or the artifact and what we used a conventional waveform generators
like the vocoders
different from these two types of a tts system these signals two seconds model use
a single modeled converts the input text
into the acoustic features
the use a single model to do alignment learning to do the duration modeling and
the acoustic modeling
in fact main sequence two seconds models also use women it's like a waveform generators
to further improve the quality of this is that speech
if we summarize the differences from the park lie system to the sequence to second
system i think there are four suspects
the first one is we replace the conventional front end in the pipeline system
with the trainable implicit front end in the sequence two seconds model
second instead of using external all duration model
we may jointly do the duration modeling ways the sequence two seconds mapping
sir point is the of acoustic models or low is not explained in this tutorial
actually most of the seconds to segments model use just so called autoregressive decoding
so would produce one audible time step
conditioned on the previous time step
the last point is the in your away from models as i mentioned in the
previous slide
making all the segments to segments models use neural waveform models like the wavelet
the first the three types of differences implemented through the attention based segments two seconds
models
so in this tutorial we focus on attention mechanism
we first explain this of the tension
we also groups as of attention approaches space on this three
dimensions
what kind of features were used to calculate alignment a matrix how do we calculate
alignment
and the what kind of constraint we compute i'll the alignment
we also mention the shortcoming of the a soft attention it does not guarantee the
monotonic structures
that evaluates the hot attention based approach
however the whole attention might not be accurate enough to produce natural speech
at a gives us to these are just a possible solution the hybrid approach where
we don't
used attention during the generation
all the four specs are quite essential to the performances of the sequence to segments
tts models
of course maybe we may wonder
what is the most important a factor
that contributed to the performance of a sequence two seconds model
to answer that oliver and is called x design experiments
and they try to analyze the impact of each other's factor
and the quality of a general speech from the sequence two seconds models
ice recommended it to raise their paper to understand why the sickness two seconds model
outperforms the tedious apply systems
before we in this tutorial let me briefly mention all the research topics based on
these seconds to segments tts models
the first one big is the neural waveform models that has being using mating signal
emitting seconds two seconds models
due to the limited time i cannot explain the neural for a waveform almost but
you can find a reference paper using the rating list
another topic is the speakers style any motion modelling is segments two seconds models
prosody is also hot topic being seconds two seconds modeling
in terms of multi speaker modeling a single most of the segments two seconds models
are quite a straightforward
the either jointly trying the speaker vector is a bayes d sequence to suppose model
or they use separates speaker model
to extract these speaker vectors from the reference speech
so this is so called the rule short learning for multi-speaker tts
in terms of prosody either paper is focusing on the segmental prosody for example of
the lexicon tom or the pitch accent
so this all the most of this paper is focusing on the pitch accent the
language at a language such as mattering or japanese
in terms of the super or a sacramental variation is they're also papers
combining the process of the embedding ways to talk from basis systems
also system using the variational encoders
to extract the processing bindings from the reference speech
finally i'd like to mention another direction of all the tts research
and i is the a tts for entertainment
so for the in this paper the also is use the traditional japanese comedy data
to trying to tts system
so the goal of this kind of t c system is not only the speech
communication but also
mm to intuit and the audience
this is and of this tutorial
you can find this slide on my teeth how page it's recommended it to check
the i exponent slides the reading list and the appendix thank you for recently