0:00:00 | welcome to the speaker all the c to them and twenty |
---|
0:00:02 | this is the tutorial session on text-to-speech synthesis |
---|
0:00:06 | i'm showing one from national institute of informatics japan |
---|
0:00:10 | i'm going to reduce realist tutorial on test to speech synthesis |
---|
0:00:15 | first brave self introduction |
---|
0:00:17 | and a post-hoc and i a much that i got my p h t two |
---|
0:00:22 | years ago from a simple at during my p h d i was working on |
---|
0:00:25 | text-to-speech since s |
---|
0:00:27 | of things the post or chi have been working on speech and also music audio |
---|
0:00:31 | generation |
---|
0:00:33 | meanwhile also getting boarding the a swiss move to as i'm writing and voice primus |
---|
0:00:38 | challenge of this here |
---|
0:00:42 | for this tutorial like to first apologise a thoughts the abstract i think is thing |
---|
0:00:48 | in the old strike i mentioned i will explaining the resetting neural network based acoustic |
---|
0:00:53 | models waveform generators |
---|
0:00:56 | the classic hidden markov model based approaches and also words conversion |
---|
0:01:02 | but this abstractly seems to be to features i think account to cover all topics |
---|
0:01:07 | thing one hour tutorial so in this tutorial i will focus on the recent neural |
---|
0:01:13 | network based acoustic models including the talked run and its environs |
---|
0:01:19 | for other topics such as a waveform generators and hmm after they've them out from |
---|
0:01:26 | this tutorial |
---|
0:01:28 | if we are interested you can find a useful notes and the reference papers thing |
---|
0:01:33 | this slide |
---|
0:01:35 | for this tutorial i'm going to focus on the recent approaches like to talk at |
---|
0:01:40 | run and the related |
---|
0:01:42 | seconds two seconds tedious models |
---|
0:01:45 | i'm going to talk about how they work and the water or the differences |
---|
0:01:50 | so this tutorial is based on my own reading list i summarised what have learned |
---|
0:01:55 | and what i have implemented with my colleagues |
---|
0:01:59 | so the content may not be comprehensive |
---|
0:02:03 | however i the i would try my best to include more contents i was summarised |
---|
0:02:08 | on hanging the notes in each slide |
---|
0:02:11 | i also provide an appendix that reading list ways what have rating the past so |
---|
0:02:17 | i hope you enjoy this tutorial and of course you all feedback is welcome |
---|
0:02:23 | so for this tutorial i'd like to first give a brief introduction about the current |
---|
0:02:29 | situation or the state-of-art it was a tts research after that i will give our |
---|
0:02:36 | view of a tts briefly introducing the classical methods |
---|
0:02:41 | and the why we're here today |
---|
0:02:42 | and after that i was spend the most of it i'm have this tutorial on |
---|
0:02:47 | these sequence two seconds tts and the state-of-art tts nowadays |
---|
0:02:52 | explain different types of a sequence two seconds tts those based on solve attention hot |
---|
0:02:58 | attention |
---|
0:02:59 | and hybrid approaches and finally i will make a summary and draw conclusions |
---|
0:03:07 | it speaking with introduction |
---|
0:03:10 | tedious is a technology that a covert single texting to the output avoidable |
---|
0:03:15 | one famous example of the tts application is the speaker use device professor stephen hawking |
---|
0:03:21 | nowadays we have more types of applications based on the tts |
---|
0:03:26 | one example is the intelligent the robot |
---|
0:03:29 | we also have that each taught systems are cell phones and computers |
---|
0:03:35 | research on t is has a really long history if we read the books reference |
---|
0:03:41 | paper some tedious we can find so many different types of tts methods for my |
---|
0:03:46 | since as you need selection and weighting it |
---|
0:03:49 | and the reason why researchers are still working on d is that and |
---|
0:03:55 | this researchers want to make systems i speech as natural as possible as natural as |
---|
0:04:00 | human speech for some types of applications all we also want the so that speech |
---|
0:04:07 | sounds like us |
---|
0:04:09 | so towards this go researches put so many a first thing to the case research |
---|
0:04:15 | i ever you know was not on your the recent years that and researchers find |
---|
0:04:21 | really good models to achieve this goal |
---|
0:04:25 | here i'd like to use the a space move data to show the rapid progress |
---|
0:04:30 | of tts |
---|
0:04:32 | first it's picture it is serious moved to center fifteen |
---|
0:04:36 | and the i-vector space where we show different types of tts system and their this |
---|
0:04:41 | distance from the natural speech religion you speech |
---|
0:04:44 | so you can see there are many system here most of them not based on |
---|
0:04:49 | hmm orgy a gmm based voice conversion |
---|
0:04:53 | for this edition basic tts is really for me from the natural speech is only |
---|
0:04:58 | unit selection that is close to the natural speech |
---|
0:05:03 | so how about is swiss moved to the nineteen after four years of research |
---|
0:05:08 | here the results based on expect or a computer with the picturing to send fitting |
---|
0:05:15 | we can see there are so meetings is some that are really close to the |
---|
0:05:19 | natural speech |
---|
0:05:21 | not only thing the selection i'd like to give that one here |
---|
0:05:25 | the first example is a sham d and system as you can see from this |
---|
0:05:30 | scoring which is still far free from a natural speech |
---|
0:05:35 | the unit selection is still close to match for speech meanwhile we can see |
---|
0:05:40 | other types of tts messrs |
---|
0:05:43 | including the sequence two seconds t d s and the women it so they are |
---|
0:05:46 | really close to match for speech |
---|
0:05:48 | of course this figure based on acoustic features either the extractor or the i-vectors |
---|
0:05:55 | but the question is how to this is that speech sounds really like in human |
---|
0:06:00 | perception |
---|
0:06:03 | to show that sounds are that question i'd like to use the results from our |
---|
0:06:08 | recent study where we come back to healing evaluation on the a swiss moved to |
---|
0:06:13 | them hiding data |
---|
0:06:14 | here we ask feeling evaluators to evaluate and the how this is a speech sounds |
---|
0:06:20 | like the target speakers and how this is that speech |
---|
0:06:25 | the what is the quality of sounds i speech compared with t natural speech |
---|
0:06:30 | so we show the results in terms of it by using the det curves |
---|
0:06:36 | as you can see from the left hand side we can see that d in |
---|
0:06:40 | h m d and is really for a for me from the natural speech in |
---|
0:06:45 | terms of the speaker similarity so this whole distribution is for rate from the natural |
---|
0:06:51 | target speech |
---|
0:06:53 | unit selection is calls or bastille not closing off its own is seconds to segment |
---|
0:06:58 | system as you can see from this picture as really close to the target speaker |
---|
0:07:03 | natural speech so you this case the eer is rock it is close to fifty |
---|
0:07:09 | percent |
---|
0:07:11 | so which means this is some kind of the release on the synthesized speech sounds |
---|
0:07:15 | like the a target speakers and human beings cannot tell tells them from each other |
---|
0:07:21 | this is similar trend if we look at the results in terms of speech quality |
---|
0:07:28 | d and the unit selection are not good enough it's only a sequence two seconds |
---|
0:07:34 | model that is really close to the natural speech |
---|
0:07:37 | so from these results we can have a general idea or on the how the |
---|
0:07:40 | race and the models based on second steps simplest model improves the quality and a |
---|
0:07:45 | speaker similarity and even the human beings can not tell them from the natural speech |
---|
0:07:54 | okay after introducing the results i'd like to play some samples from a swiss providing |
---|
0:07:59 | database |
---|
0:08:01 | and that i think you can have general perception a housing like is model sound |
---|
0:08:06 | like computer with a natural speech |
---|
0:08:10 | we did not complete with any of a local farmer |
---|
0:08:13 | we did not compete with any of the local optima we did not completely then |
---|
0:08:17 | have the local phone |
---|
0:08:20 | we did not completed any of the local formal writing |
---|
0:08:25 | eventually india function that winter |
---|
0:08:29 | a french at that level until |
---|
0:08:32 | so this other samples from two speakers i think you may agree that's an unit-selection |
---|
0:08:39 | sounds like to natural speech in terms of the speaker identity but you can sometimes |
---|
0:08:45 | perceive the channel one we concatenating you i different units together |
---|
0:08:53 | and the tmm sounds close but the but this sounds like artificial speech itsy seventy |
---|
0:09:01 | six models that's true some three like the target speakers |
---|
0:09:05 | if you are interested in you can find more samples are website or download the |
---|
0:09:10 | a space with lighting database to have a charge |
---|
0:09:15 | after listening to the tts samples from a swiss move into the nineteen |
---|
0:09:20 | i'm going to talk about more details on the tts what kind of problems women |
---|
0:09:25 | face one would be at a tts system what kind of solutions we can use |
---|
0:09:29 | and how we come up with idea d sequence to sequence tts models |
---|
0:09:36 | so what are the problems we may face one would be you the tts system |
---|
0:09:40 | to give a example here is once and this from the guidelines for tool be |
---|
0:09:45 | labeling my random it's more light |
---|
0:09:49 | the first thing we need to note is one recover the text things waveform is |
---|
0:09:53 | that |
---|
0:09:53 | the text is basically discrete |
---|
0:09:56 | it comes from a finite set of symbols |
---|
0:09:59 | well as the waveform is continuous in time domain and also doing them out of |
---|
0:10:03 | domain |
---|
0:10:04 | so |
---|
0:10:05 | because of the basic difference between the text and the speech the first thing we |
---|
0:10:11 | noticed is the ambiguity in pronunciation for example the inmates segments the more maria that |
---|
0:10:18 | all mate they are pronouncing different ways the second thing is about alignment |
---|
0:10:23 | for example where mi same eight we miss a mate |
---|
0:10:27 | we may shorts or increase the duration of the sound when we produce a pronounce |
---|
0:10:32 | it |
---|
0:10:33 | so this kind of alignment we need to learn from the data which is not |
---|
0:10:37 | easy another issue is the a to recover information with which is not encoded in |
---|
0:10:43 | the text for example |
---|
0:10:45 | the speaker identity and prosody this has really different issues when we but tts systems |
---|
0:10:55 | here is one example of using classic a tts to converse detects into the output |
---|
0:11:00 | waveform |
---|
0:11:02 | so the first step of the system is to clean the input a text to |
---|
0:11:06 | do some kind of text normalization to remove all kinds of |
---|
0:11:11 | the strangest thing balls from input text |
---|
0:11:14 | so after that the system converts the text into the phoneme or phone strings |
---|
0:11:19 | so the phone of phonemes are symbols that tells the computer how to read the |
---|
0:11:24 | ward |
---|
0:11:25 | of course this is not enough or we may need to add additional prosodic tags |
---|
0:11:30 | to each word or some sort of the word |
---|
0:11:33 | for example women and the size t mariano instead of the mate |
---|
0:11:37 | so giving thing and linguistic information about how to read text |
---|
0:11:43 | the system will our converts them into the acoustic units or acoustic features |
---|
0:11:48 | finally the system will use a waveform generator to covers the acoustic information into the |
---|
0:11:54 | output waveform |
---|
0:11:58 | in the literature we normally refers to the first steps the cued a system as |
---|
0:12:03 | a front end and the rest of the backend |
---|
0:12:07 | in this tutorial like another cover the topics on the front end |
---|
0:12:11 | and the readers can find it to textbooks on front end |
---|
0:12:15 | for this tutorial we focus on the back and the issue especially how we learned |
---|
0:12:22 | alignment to between the text and waveform in the back and the models |
---|
0:12:28 | the first example i'd like to explain is unit selection based back end |
---|
0:12:33 | so as the name suggests this mister is quite simple straightforward four inch input to |
---|
0:12:39 | unit which is directly select one speech segments from a large database |
---|
0:12:44 | after that which is directly concatenate these speech units into the outputs wasteful |
---|
0:12:50 | so there is no explicit modeling all of the alignment between the speech and the |
---|
0:12:56 | waveform |
---|
0:12:56 | a because this alignment has been preserved in the speech units so we didn't really |
---|
0:13:02 | care about alignment in this kind of mister |
---|
0:13:07 | however the story becomes different and when we use the hmm based back end to |
---|
0:13:12 | synset speech |
---|
0:13:14 | so |
---|
0:13:15 | in a like the unit selection which directly gender was waveform |
---|
0:13:20 | for a h t s hmm based approach we don't directly predict the waveform instead |
---|
0:13:25 | we first predict the sequence of acoustic features |
---|
0:13:29 | from the input text so this acoustic feature vectors maybe from each vector my corresponding |
---|
0:13:36 | to |
---|
0:13:36 | say twenty five milisecond of waveform |
---|
0:13:40 | and the we can use of vocoders to reconstruct a waveform from the acoustic feature |
---|
0:13:44 | vectors |
---|
0:13:45 | so each acoustic feature vector into my containing the for example the cepstrum coefficients if |
---|
0:13:52 | the role |
---|
0:13:53 | and all their kind of acoustic features specific |
---|
0:13:56 | two d speech will coders |
---|
0:13:58 | but is a general idea here |
---|
0:14:00 | in h t s we don't directly predicts waveform instead we need to first predict |
---|
0:14:05 | the acoustic feature vectors from the input text |
---|
0:14:11 | the question is how can we do that remembers that's the input information has being |
---|
0:14:18 | extracted from the text |
---|
0:14:20 | including the phoneme identity and all their prosodic tax |
---|
0:14:24 | so you h t s we normally encode or converts the linguistic features into a |
---|
0:14:30 | vector for each input a unit |
---|
0:14:33 | so in each vector it to make contains information like the phoneme identity |
---|
0:14:38 | the whether the course of a boy stress a lot |
---|
0:14:41 | so we assign this kind of vector for each unit |
---|
0:14:46 | the question of cores is how can we convert the sequence of encoding linguistic websters |
---|
0:14:51 | into the output to acoustic feature vectors |
---|
0:14:54 | so remembered the number of vectors we have is equal to the number of units |
---|
0:14:59 | in the text |
---|
0:15:01 | and this number is much shorter than a number of acoustic feature vectors we will |
---|
0:15:05 | we need to predict |
---|
0:15:07 | so this is alignment t should |
---|
0:15:11 | this is how the h t s system handles this issue |
---|
0:15:14 | since this system is based on a gmm so the first thing we need to |
---|
0:15:19 | do is to a converts the linguistic vectors seem to the hmm state |
---|
0:15:24 | so this is done by simply searching through these |
---|
0:15:27 | this increase after that we can get the hmm state for this specific which are |
---|
0:15:34 | after researching and the finding all the hmm state for each linguist vectors |
---|
0:15:41 | the next thing is to predict the duration for each in from state for example |
---|
0:15:47 | when we repeat the first item state two times the second one three times |
---|
0:15:53 | given is duration information that we can create six agenda seconds like this |
---|
0:15:59 | so remember that the sequence of this hmm state will be equal to the number |
---|
0:16:06 | of vectors when you to predict in the output |
---|
0:16:10 | the loss the regression task become much easier because we can use main types of |
---|
0:16:15 | all the reason |
---|
0:16:16 | to generate vectors for from each hmm state |
---|
0:16:22 | specifically h t s system used to you so called |
---|
0:16:26 | maximum likelihood parameter generation or present to produce |
---|
0:16:30 | the acoustic feature vectors from the hmm states |
---|
0:16:34 | but this is how the h t s system produce the output from the input |
---|
0:16:39 | to linguistic feature vectors |
---|
0:16:43 | two briefly summarize the h t a system we can use the speech or so |
---|
0:16:48 | we generates a linguistic features from the input text |
---|
0:16:53 | we do the searching in the decision trees |
---|
0:16:56 | after that we predict the duration for each hmm state so this is where the |
---|
0:17:01 | alignment is produced |
---|
0:17:03 | of to generate a output acoustic features after that everything is straightforward just convert each |
---|
0:17:09 | websters into the output vectors |
---|
0:17:12 | and do the waveform generation using the vocoder |
---|
0:17:18 | from h t s two d n is straightforward we just need to replace the |
---|
0:17:22 | hmm states ways the neural networks |
---|
0:17:25 | feet word one or record one |
---|
0:17:27 | however for this kind of framework we still need the duration model we need to |
---|
0:17:32 | predict |
---|
0:17:33 | the alignment |
---|
0:17:34 | from the linguistic feature vectors |
---|
0:17:37 | without that we cannot prepare the input to the neural networks |
---|
0:17:42 | indeed as d paper by alex grave says or ends are usually restricted to the |
---|
0:17:48 | problems where the encoder output sequences |
---|
0:17:51 | all will aligned |
---|
0:17:53 | as lies where using the com unfit for word or record neural networks |
---|
0:17:58 | we still need additional tools including the hmm |
---|
0:18:02 | to lower and it generates alignment |
---|
0:18:04 | for the tts task |
---|
0:18:07 | when we wonder with that we can use a single model to jointly learn alignment |
---|
0:18:11 | and to do the regression |
---|
0:18:13 | and this is where the sequence two seconds model counts as a stage |
---|
0:18:17 | in fact they're more ambitious |
---|
0:18:20 | they want to use a single neural networks to jointly learn alignment to the regression |
---|
0:18:25 | and you want conducting the linguistic all eyes on the input a test |
---|
0:18:29 | and that there are lots of recent work showing that this approach is reasonable and |
---|
0:18:34 | is really step our own your network so that we can achieve a better quality |
---|
0:18:39 | for tts |
---|
0:18:42 | okay let's look at the a sequence two-sigma cts models |
---|
0:18:47 | remember that the task of seconds two seconds model is to converse the text into |
---|
0:18:54 | the acoustic feature sequences |
---|
0:18:56 | and we need to solve three specific task |
---|
0:18:59 | how to derive linguistic features |
---|
0:19:01 | how to learn to generate alignment then how to generate output sequences |
---|
0:19:06 | again we cannot to use a common your and it works such as the feature |
---|
0:19:10 | for tall recon one |
---|
0:19:12 | for this kind of sequence two seconds model where you normally use attention mechanism |
---|
0:19:18 | for explanation i we use x has encode while i being the output |
---|
0:19:24 | note is that the input has same time steps while the output has and time |
---|
0:19:28 | steps |
---|
0:19:29 | so they have different time lengths |
---|
0:19:33 | the first if a framework we can use is the so-called encoding and decoding framework |
---|
0:19:37 | here we use our and layer as their encoder with processing code and extract the |
---|
0:19:44 | c which are from the last hidden is data from the encoder after that we |
---|
0:19:50 | use is c which are as a condition |
---|
0:19:52 | to generate child was sequences step by step |
---|
0:19:55 | so if we write only questions it to look like these so you can see |
---|
0:20:00 | how the output is factorized |
---|
0:20:03 | along all time steps and is see the condition is used in each condition the |
---|
0:20:08 | each time step |
---|
0:20:10 | this framework is straightforward and a simple so the matter hall on the input a |
---|
0:20:15 | sequence that is we can always compressed input a sequence information into a single vector |
---|
0:20:21 | however there is also you should because we need to use this c worked or |
---|
0:20:25 | across all the time steps on which generates output |
---|
0:20:31 | can we extract different context from the input what we generated different out time steps |
---|
0:20:37 | the answer is yes and we can use adaptation mechanism to achieve this goal |
---|
0:20:43 | suppose we want to generate a second time step why to here |
---|
0:20:46 | what my extract the heathens data from a decoder ring the previous time step and |
---|
0:20:52 | faded back to the encoder |
---|
0:20:54 | after that to extract some kind of weight |
---|
0:20:58 | vectors through the softmax layer |
---|
0:21:01 | then we do a weighted sum over the input information |
---|
0:21:04 | i produce the vector c to here |
---|
0:21:07 | we can use this c to which are as encode |
---|
0:21:09 | to the decoder and the produce the y two which are |
---|
0:21:13 | so it is how |
---|
0:21:14 | the context information can be calculated for the second time step |
---|
0:21:19 | so no desired we can save the output from the softmax layer so it is |
---|
0:21:24 | kind of wait information used for the second time step |
---|
0:21:28 | we can repeat the process for the next time step so in this case we |
---|
0:21:32 | feedback the history from the decoder in the second time step and then we calculates |
---|
0:21:39 | the vector c three for the output wise three |
---|
0:21:44 | in general we can do this for the time step and that we can write |
---|
0:21:48 | equations like these |
---|
0:21:50 | so after we save all the output from the softmax |
---|
0:21:56 | along all the time step so we can do it is |
---|
0:21:59 | the weight |
---|
0:22:00 | mm calculated by the use of the marks will gradually change |
---|
0:22:04 | as we move as we generate out hotel on the time so the way to |
---|
0:22:08 | his helpful three |
---|
0:22:09 | what we also move along the input sequences as you can see from this picture |
---|
0:22:16 | so this alone as alignment matrix thing you can find this picture in mating papers |
---|
0:22:21 | tedious or speech recognition |
---|
0:22:26 | two briefly summarize the attention base segments two seconds models we can use this equations |
---|
0:22:32 | for each time step and way calculate the softmax weight |
---|
0:22:37 | vector r for here |
---|
0:22:39 | and then we use or for vector to summarize information from the input so we |
---|
0:22:44 | do a weighted sum over the h vectors |
---|
0:22:48 | that gives a bus these a context vector c and for each time step |
---|
0:22:53 | with a c and context we can generate output y n |
---|
0:22:56 | and to repeat the process for all time steps |
---|
0:22:59 | this is generally how the attention basis segments two seconds model works |
---|
0:23:05 | as you can see from the previous explanation |
---|
0:23:08 | the attention make it is done is the essential for a sequence to sequence tts |
---|
0:23:12 | model |
---|
0:23:13 | and you to this reason there has being so many different types of attention proposed |
---|
0:23:18 | when i read the papers i noticed that there are so many different types of |
---|
0:23:23 | attention we can use |
---|
0:23:25 | self attention for word attention heart attention on the soft attention |
---|
0:23:29 | so one is the relationship between different types of attention and the what is her |
---|
0:23:34 | purse to use a specific attention |
---|
0:23:37 | so in the next few slice are we explain then in a more systematic way |
---|
0:23:43 | use my proposal i organise the tension based on what kind of features are used |
---|
0:23:48 | to compute alignment |
---|
0:23:50 | and how do they compute alignment are from |
---|
0:23:53 | and what kind of constraints e need to put on the alignment |
---|
0:23:56 | so as you can see to what kind if you choose to compute the a |
---|
0:24:01 | alignment we can organise attention based on with their content based whether they are location |
---|
0:24:07 | where all with their they are pure location base attention |
---|
0:24:11 | the way to compute the alignment we can organise attention based according to three groups |
---|
0:24:18 | relative diode and discover told attention |
---|
0:24:21 | and the for the final axes we can see attention is the a monotonic all |
---|
0:24:26 | for the tension the local attention and the global attention so this is my proposal |
---|
0:24:32 | for organising the so called soft attention |
---|
0:24:38 | that engine is not only group like a fine tuning in the literature |
---|
0:24:43 | i four rated paper is we can find another group so called hot attention |
---|
0:24:47 | the difference from these of attention is that in called attention the alignment is treated |
---|
0:24:54 | as a lot and of arrivals |
---|
0:24:56 | we need to use all kinds of tools such as dynamic programming and marginalisation to |
---|
0:25:02 | calculate the probability and the to marginalise the latin of around |
---|
0:25:07 | i would talk more about the difference between the two groups of attention in the |
---|
0:25:11 | later slides but for this stage i will focus on the soft attention |
---|
0:25:17 | that's first look at this told |
---|
0:25:19 | scaled order and additive attention |
---|
0:25:21 | to this i was three types of attention is which use different the ways to |
---|
0:25:26 | compute alignment matrix |
---|
0:25:28 | suppose we are going to compute the output y and for the and it's time |
---|
0:25:31 | step |
---|
0:25:33 | what we have is the decoder state of the previous time step yes and minus |
---|
0:25:38 | one |
---|
0:25:38 | we also have the features extracted from the input text which is you know data |
---|
0:25:43 | as the actually |
---|
0:25:45 | so this read hypotension differ in the way to compute the input to the to |
---|
0:25:50 | the softmax layer the output of softmax will be the alignment a matrix |
---|
0:25:58 | the first one the thought attention |
---|
0:26:01 | directly multiple a basis to vectors the s and minus one from the decoder the |
---|
0:26:07 | h and from the encoder |
---|
0:26:09 | so it is why the cold it is scored at a little attention |
---|
0:26:14 | scattered all detention is quite similar but in this case we and the scalar these |
---|
0:26:18 | here to change the value of the a lot of the activation to this of |
---|
0:26:24 | the max lightyear |
---|
0:26:25 | the other blasted last the type of attention is the cavity of attention |
---|
0:26:30 | so in this case we apply linear transformation is and the two vectors after that |
---|
0:26:36 | we add this to vector together so this is why this is a reason why |
---|
0:26:40 | quite the additive attention |
---|
0:26:44 | so note is that |
---|
0:26:46 | for all the three types of tensions in this example |
---|
0:26:51 | where using the s vector from decoder the edge vector from the encoder |
---|
0:26:56 | in all the words |
---|
0:26:57 | we can consider the h as a content of the input |
---|
0:27:01 | so |
---|
0:27:03 | we multiply the content the from the input a text |
---|
0:27:07 | ways the hidden state from decoder in order to computes the alignment matrix |
---|
0:27:13 | this bring us to the second question based on which we can classify different types |
---|
0:27:18 | of attention |
---|
0:27:19 | a question is what kind of features we can use to compute the output alignment |
---|
0:27:25 | in the previous slide high exploiting the told scale that alt and additive attention by |
---|
0:27:31 | using examples where use the decoder state and the content vector edge to computers alignment |
---|
0:27:40 | so for this type so mister is we call them content a base attention because |
---|
0:27:45 | they're using the content of vector |
---|
0:27:48 | however this is not only way to compute the alignment |
---|
0:27:53 | the second away is the so-called location aware attention |
---|
0:27:57 | as you can see from these two equations computer was content the base attention the |
---|
0:28:02 | location aware attention uses the attention vector from the previous time step |
---|
0:28:07 | so this attention is aware of the previous alignment so that's why record the location |
---|
0:28:14 | where attention |
---|
0:28:17 | the third type of attention in this group is the so called the location base |
---|
0:28:21 | attention |
---|
0:28:22 | so computer with the location aware attention we can do it is from this equation |
---|
0:28:27 | that the content the vector h is removed from the input so you know other |
---|
0:28:33 | words in the location based attention we don't care about the content with purely compute |
---|
0:28:39 | the attention of the lyman the matrix |
---|
0:28:42 | based on the decoder state and the previous the lyman to from the previous step |
---|
0:28:47 | finally there is c small byron to have the a location based attention so in |
---|
0:28:52 | this case |
---|
0:28:54 | we only use the decoder state to compute alignment without using the alignment from the |
---|
0:28:59 | previous time step |
---|
0:29:03 | from the equations of the four types of pensions i think you mean though it |
---|
0:29:08 | is when we compute attention or the lyman the matrix for each out what time |
---|
0:29:12 | step we need to consider all encoded time steps |
---|
0:29:16 | so this leads to the sorted dimension along which we can classify the tension |
---|
0:29:22 | along this dimension i'd like to explain two types of attention the first one is |
---|
0:29:27 | the so called the global attention |
---|
0:29:28 | as a name suggests what we compute alignment for each output timestamp |
---|
0:29:34 | we consider it is possible to get information from all the input time steps |
---|
0:29:39 | so this matrix the vector lyman the vector are here has no they are all |
---|
0:29:45 | elements |
---|
0:29:47 | in contrast when we use a local attention we consider some of the lyman to |
---|
0:29:52 | can be zero for example in this case we only consider |
---|
0:29:56 | to extract information from the input a steps in the middle |
---|
0:30:03 | now i have supplying these three dimensions are on which we can classify this often |
---|
0:30:08 | attention in fact all the paper all the examples i have explained can find their |
---|
0:30:14 | location in this treaty space |
---|
0:30:17 | but let me give one more concrete example that is yourself attention |
---|
0:30:22 | the same attention is a scalar this scale adult attention it's based on content |
---|
0:30:28 | and it's a global attention |
---|
0:30:29 | so let's see how is defined |
---|
0:30:33 | if we look at the equations of a set of attention we can do it |
---|
0:30:36 | is to y is cold the skeletal mode |
---|
0:30:39 | global and content the based extension |
---|
0:30:43 | muddy in this case a special thing about seven attention is that we extracting both |
---|
0:30:48 | the |
---|
0:30:50 | feature vectors edge here and action here from the input a sequence is you know |
---|
0:30:55 | all the words we are computing alignment on the in consequences itself |
---|
0:31:00 | of course of because we can compute the everything a power |
---|
0:31:04 | we can all we can also define a matrix form for the self attention so |
---|
0:31:09 | in this case we formulate the input feature sequences as a matrix |
---|
0:31:14 | and then we do with the us to get at a little attention in the |
---|
0:31:17 | matrix form |
---|
0:31:18 | and the matrix a is cool with the query key and the value matrices |
---|
0:31:24 | in this case the refers to the same matrix the h u |
---|
0:31:29 | so you know the words and this somewhat tension does a transformation on the input |
---|
0:31:35 | the sequences then output sequences has the same lands as encode |
---|
0:31:40 | in some sense we can consider the cell attention as a special type of convolutional |
---|
0:31:45 | renewal layers to transform the input into the output with the same lines |
---|
0:31:52 | of course we can also use a seven attention for alignment flirting sewing this case |
---|
0:31:56 | it just a special type of soft attention based on the scared about and a |
---|
0:32:03 | content based attention |
---|
0:32:05 | as you can see from the equations in this case we replace the query matrix |
---|
0:32:11 | with the state from a decoder but the process is quite similar and we can |
---|
0:32:16 | do everything power by using the matrix multiplication |
---|
0:32:21 | so by now i have explained all the street dimension is to classify this of |
---|
0:32:26 | attention and also example based on the self attention in fact there are more ways |
---|
0:32:32 | to combine different types of attentions and you don't find the i rinse in the |
---|
0:32:38 | paper published by a good this year |
---|
0:32:43 | giving the explanation on the sofa attention in this now quickly explain how to work |
---|
0:32:49 | sing in tts system |
---|
0:32:51 | so for the tts system when we use the tension basis segment stick set at |
---|
0:32:55 | two seconds models way almost to use the same framework as those used for the |
---|
0:33:00 | speech translation |
---|
0:33:02 | or machine translation or speech recognition |
---|
0:33:05 | so we use case input is to phonemes or characters |
---|
0:33:09 | now to po tasty acoustic feature vector sequences |
---|
0:33:12 | and we still have the encoder the tension and the decoder which is autoregressive |
---|
0:33:20 | but of course we can do something more for example adding more layers and the |
---|
0:33:25 | decoder increase the number of recording layers at the print data which receive the feedback |
---|
0:33:34 | from the previous time step in the outer aggressive decoder this is a free to |
---|
0:33:39 | choose |
---|
0:33:40 | but the basic idea is still the attention based approach to learn alignment between the |
---|
0:33:45 | input and output |
---|
0:33:48 | this gives sauce the basics to understand the first fingers a tts system based seconds |
---|
0:33:57 | test two seconds model so this is it talks on system |
---|
0:34:00 | as you can see from the picture of the original paper a they are architectures |
---|
0:34:06 | of a network can be generally divided into this three groups |
---|
0:34:12 | that decoder attention and then the encoder |
---|
0:34:16 | the we just they just differ how they define the encoder for example by using |
---|
0:34:21 | different types of hidden layers |
---|
0:34:25 | to extract information from the input phoneme or cracked or a sequences |
---|
0:34:31 | but the basic idea is still the same user attention to learn alignment between the |
---|
0:34:35 | input and output |
---|
0:34:38 | in fact talked ron is not only model or that he uses the a segments |
---|
0:34:43 | two seconds based approaches |
---|
0:34:46 | as for as i read and the first a model might be the i on |
---|
0:34:52 | probably shall work by alex craves if you listen to his pork into some fifteen |
---|
0:34:58 | you couldn't note is he plays some samples |
---|
0:35:00 | using the tension based frameworks attention basic was to set the smallest as already has |
---|
0:35:07 | to send fifteen |
---|
0:35:08 | so after that inter-speech there is one paper mattering tts which first use the attention |
---|
0:35:15 | and a published paper |
---|
0:35:17 | so after that it's a talk from system meaning to send |
---|
0:35:21 | seventeen |
---|
0:35:22 | i mean while there are different types of system for example the chart to waves |
---|
0:35:26 | as a talks on two |
---|
0:35:29 | the dct t s and the deep voice three transformer tts |
---|
0:35:34 | so all we all this types of system are based on the tension mechanism |
---|
0:35:40 | but here i liked also mention one spatial system the so called a voice loop |
---|
0:35:46 | which is also a sequence to sequence tts but actually use different types of alignment |
---|
0:35:51 | learning the so called memory buffer |
---|
0:35:55 | if you are interesting this model you can find the illustration in appendix |
---|
0:36:01 | to help the to help you to understand the difference between different types of segments |
---|
0:36:07 | of segments tts system i summarize the details and differences in this table across the |
---|
0:36:13 | different tts systems |
---|
0:36:16 | there are many details here for example in terms of t waveform generator their acoustic |
---|
0:36:21 | features and the architecture of the decoder encoder |
---|
0:36:25 | but let's focus on attention here |
---|
0:36:28 | as you can see for the talk from basis a sum they any use the |
---|
0:36:32 | additive attention and of course with a local location or a nist |
---|
0:36:38 | there are also other systems for example the shortwave which directly user location base attention |
---|
0:36:44 | and also there is a pure self attention basis of some that is the transformer |
---|
0:36:49 | tts |
---|
0:36:51 | and you can find the details later from the slide |
---|
0:36:56 | no i'd like to play some samples published ways to you papers so they are |
---|
0:37:01 | from the of official website they also the daytime the public domain |
---|
0:37:07 | and full |
---|
0:37:08 | system trying to using their own internal data i come output of samples here but |
---|
0:37:14 | you can find a samples on their websites |
---|
0:37:17 | table that but now i never find |
---|
0:37:21 | but that was totally a of the blue |
---|
0:37:23 | thus it is about the math of investigation into allegations a fixed and gains an |
---|
0:37:28 | illegal by thing |
---|
0:37:29 | prosecutors of openness of investigation into allegations of fixing gains an illegal betting |
---|
0:37:36 | and how to accept it as a numerical without any physical explanation |
---|
0:37:40 | and had accepted it as a numerical without any physical explanation |
---|
0:37:44 | do not at all |
---|
0:37:48 | or not |
---|
0:37:53 | after applying the samples i hope you can have a general impression of how the |
---|
0:37:58 | sequence to segments tts systems sound alike |
---|
0:38:02 | of course the quality might not be as good as a as what we have |
---|
0:38:06 | her in the a swiss moved to the lighting |
---|
0:38:10 | there are many different reasons for that |
---|
0:38:12 | and if you want to find other good examples i suggest the samples on the |
---|
0:38:17 | document and the transformer aware the used their own internal data to train the systems |
---|
0:38:25 | after listening to the samples i think of the raiders my wonder whether is of |
---|
0:38:29 | attention is good enough for tts purpose |
---|
0:38:33 | i think is also is no the samples i played all decoder samples there actually |
---|
0:38:39 | many cases where the sequence to segments based tts systems the do not work for |
---|
0:38:46 | this case we need cut to consider specific attention mechanism that is designed for the |
---|
0:38:52 | tts |
---|
0:38:53 | so this lead us to the |
---|
0:38:56 | another group of a system which use a monotonic and the for the tensions |
---|
0:39:03 | before explaining this type of models i think we need to first explain why the |
---|
0:39:09 | global attention or the global alignment does not work sometimes |
---|
0:39:15 | remembers that for the global alignment or the gullible attention we need to compute alignment |
---|
0:39:19 | of between every peer of the encoder and the output of time steps |
---|
0:39:25 | this might be necessary for other tasks such as machine translation but in that might |
---|
0:39:30 | not be necessary for tts |
---|
0:39:33 | and this kind of alignment is heart a lower sometimes it does not work |
---|
0:39:38 | so i'd like to play one sample |
---|
0:39:41 | so this is one sample from the paper from microsoft the research where the used |
---|
0:39:48 | to global attention to generate somewhere very long synthesis you can hear sound so that |
---|
0:39:53 | x transcription is here so is it would be the input |
---|
0:39:59 | crashes backslash we passed backslash yes there is backslash in that graph backslash one backslash |
---|
0:40:09 | fifteen that makes post processing a little painful |
---|
0:40:13 | even if as the reports does that have a clapping we have a rasta based |
---|
0:40:17 | impact of anything about the visible be version of the one people maybe people would |
---|
0:40:22 | be people with |
---|
0:40:25 | i hope this interesting example can tell you how the use of attention might not |
---|
0:40:29 | work |
---|
0:40:29 | well we use alone text as input |
---|
0:40:33 | and this issue we need to solve |
---|
0:40:35 | so what we can do to alleviate the problems and one thing we can consider |
---|
0:40:40 | is that for text to speech there is some kind of a monotonic in a |
---|
0:40:45 | relationship between the input and output because human beings read the text from left to |
---|
0:40:51 | right |
---|
0:40:52 | so we can use this kind of prior knowledge to constrain alignment so that we |
---|
0:40:58 | can make it easier for the system to learn the mapping from the input to |
---|
0:41:03 | the output |
---|
0:41:04 | so the idea looks like this |
---|
0:41:08 | so this is the motivation behind the a monotonic and the foreword attention |
---|
0:41:13 | and this and try to year of the ford a monotonic attention is to recompute |
---|
0:41:19 | the alignment matrix |
---|
0:41:21 | so suppose we have computed alignment a matrix like these so after that ways some |
---|
0:41:26 | kind of prior knowledge we recompute alignment matrix to encourage the monotonic alignment |
---|
0:41:34 | to give you an example but how do works this consider this simple task to |
---|
0:41:39 | convert the encoding x one two three to the outputs one two three white one |
---|
0:41:44 | two three |
---|
0:41:46 | suppose we have used a soft attention and we have computed the alignment for the |
---|
0:41:50 | first time a time stamp |
---|
0:41:53 | so this is where we can introduce a prior knowledge to constrained alignment learning |
---|
0:41:58 | so suppose we start we can only start from the first input time step so |
---|
0:42:04 | we can but this a alignment vector and zero are for their hat here to |
---|
0:42:09 | indicate initial condition so in this case is zero or is one zero |
---|
0:42:15 | for the more we constrain that's alignment of how only stay at the same input |
---|
0:42:20 | a step or you can only trust sees from the previous time step to the |
---|
0:42:25 | unix time step are like the a left-to-right hmm |
---|
0:42:29 | based on this condition is we can re compute the alignment vector |
---|
0:42:34 | i like these are for one tailed |
---|
0:42:37 | we can we can be is widely used |
---|
0:42:39 | to give you one more example here if suppose the are for one is equal |
---|
0:42:44 | to zero point five zero point four and there are point one so after the |
---|
0:42:49 | re calculation we can get in you worked are |
---|
0:42:52 | you can do it is how the probability to align the y one way sticks |
---|
0:42:57 | three is reduced from zero point one two zero point there |
---|
0:43:02 | so this is how we can use and the for what how we do the |
---|
0:43:07 | forward three calculation of the line matrix and the reduced the in possible alignment during |
---|
0:43:14 | the model training stage |
---|
0:43:17 | of course thing the paper the also propose all other types of mechanism to recompute |
---|
0:43:23 | the alignment a matrix but as soon try dear is a sign |
---|
0:43:28 | so giving the recalculate alignment vector we can |
---|
0:43:33 | use it to compute the first time step output |
---|
0:43:36 | that we can repeat the process and dollars alignment on the computer outputs y one |
---|
0:43:43 | to wise three |
---|
0:43:46 | interestingly if we check the alignment matrix thing the paper we can do it is |
---|
0:43:51 | how the foreword attention is different from the come one salford attention based approaches |
---|
0:43:59 | especially as you can see from the first row of the i'm in the matrix |
---|
0:44:03 | at is the alignment after what only one the books overly |
---|
0:44:09 | for the baseline without any constrained the alignment is just simply random or uniform distribution |
---|
0:44:16 | for the forward that engine ways to re calculation over the matrix you can see |
---|
0:44:20 | how the lyman the matrix looks like a monotonic a shape |
---|
0:44:26 | we can also consider this |
---|
0:44:29 | type so monotonic i shape has a prior a constraint on what we colour from |
---|
0:44:34 | the input and output data |
---|
0:44:37 | based on these example i think you gotta understand why the foreword attention make it |
---|
0:44:43 | make it is easier for the tts system to learn alignment between the input and |
---|
0:44:48 | output |
---|
0:44:51 | in addition to the foreword attention there are also other types of monotonic attention for |
---|
0:44:58 | example using different department reforms or combined it with a logo attention |
---|
0:45:04 | however i'd like to mention is that |
---|
0:45:07 | and the for ward also called a monotonic intention come not guarantee the attention to |
---|
0:45:13 | be monotonic exactly monotonic |
---|
0:45:18 | there are many reasons to explain that but i think of the fundamental reason is |
---|
0:45:22 | west you was considering the soft attention where we compute alignment and the way summarize |
---|
0:45:30 | the context where occurring now data in a deterministic way |
---|
0:45:34 | so this issue with like to solve or use a whole attention which are we |
---|
0:45:38 | explain in the later slides |
---|
0:45:40 | okay let's just play some samples to see how the for detection works |
---|
0:45:45 | so this is same text which i played before so if we don't if we |
---|
0:45:50 | use the solver the tension the tts system does a three work on the sample |
---|
0:45:55 | unless listen to how the for attention basis system works |
---|
0:46:00 | crashes backslash we'd ask backslashes the idea is there is yes backslash feel that radius |
---|
0:46:06 | backslash one backslash fifteen not that makes post processing a little painful since the files |
---|
0:46:13 | and reports crashes in a hierarchical structure mention that have |
---|
0:46:19 | so from this example you can notice how the for attention mate made a system |
---|
0:46:25 | successful to read the later part of this nonsense as |
---|
0:46:30 | this is a good example which shows how for detention works |
---|
0:46:34 | but again as i mentioned in the previous slide the for attention is not the |
---|
0:46:39 | grantee need to produce the a monotonic alignment |
---|
0:46:42 | here is one example from the you microsoft the paper |
---|
0:46:47 | the preliminary willing by gently cmu left district court for the no then just active |
---|
0:46:53 | californians to buy not set friends to battle chat variance divide not yet friends ten |
---|
0:46:59 | right not set friends derive not set friends derive not chip firms derived not set |
---|
0:47:04 | firms they're willing if the fact that for clark and |
---|
0:47:07 | it is the chip and let me and f t c hi jointly ask the |
---|
0:47:11 | judge last month to the labelling on the issue will pop up to thirty days |
---|
0:47:15 | one i pursued sentiment tasks |
---|
0:47:19 | this the funny example i hope you can know it is how the for attention |
---|
0:47:23 | system |
---|
0:47:23 | a repeat the phrase to rival chip firms |
---|
0:47:27 | multiple times |
---|
0:47:28 | and you can also see this alignment from the picture here so you this case |
---|
0:47:32 | alignment is not |
---|
0:47:34 | monotonic |
---|
0:47:36 | so again |
---|
0:47:37 | soft attention |
---|
0:47:39 | this'll for detention does not the grantees a monotonic alignment we colour from the data |
---|
0:47:45 | anyway from the previous samples ice i think you can hear how the for work |
---|
0:47:50 | tension hand help |
---|
0:47:51 | they tts system to learn the alignment for the lawns and this is |
---|
0:47:56 | actually the remaining tts system using the for attention for example the full papers here |
---|
0:48:01 | i will not play the samples here if are interested you can find the samples |
---|
0:48:05 | are website or in this light |
---|
0:48:09 | to use a soft attention can not guarantee the monotonic alignment to during generation |
---|
0:48:15 | we have to find another solution |
---|
0:48:17 | so one potential answer could be the hold attention |
---|
0:48:21 | here is my understanding on how the tension |
---|
0:48:24 | suppose we have the use of attention alignment matrix |
---|
0:48:27 | so this matrix tells us the probability that each output time step is aligned away |
---|
0:48:33 | single time step |
---|
0:48:35 | so from this alignment probability matrix limit or sample |
---|
0:48:40 | a monotonic alignments like these |
---|
0:48:42 | so it is idea if we want to use monotonic alignment for t v is |
---|
0:48:47 | generation |
---|
0:48:48 | however we have to take into consideration that there are multiple candidates for the alignment |
---|
0:48:54 | for example the alignment to on the more times three |
---|
0:48:58 | and we have different probabilities to troll these samples |
---|
0:49:04 | a can really during training we have to take into consideration the uncertainty with different |
---|
0:49:09 | alignment |
---|
0:49:11 | so you wanted to evaluate the model likelihood during training |
---|
0:49:16 | we have to create the alignment as a latin are able this probabilistic model |
---|
0:49:22 | so this idea is very similar to the hidden markov model and as you can |
---|
0:49:27 | imagine during training you know we have to use all kinds of dynamic programming |
---|
0:49:32 | feed forward or search algorithms |
---|
0:49:34 | to evaluate model likelihood |
---|
0:49:38 | to give you a more intuitive example of how the hot attention works we can |
---|
0:49:43 | compute it was this off attention |
---|
0:49:46 | as you can see from this picture for those of attention |
---|
0:49:49 | for each output time step |
---|
0:49:51 | we just directly calculate the weighted sum |
---|
0:49:54 | to extract information from the encode |
---|
0:49:57 | and is how we do to generate alignment during the generation things of attention |
---|
0:50:02 | and we repeat this |
---|
0:50:06 | operation for all the time steps |
---|
0:50:08 | in contrast in the whole attention we have to troll samples |
---|
0:50:13 | we have to select only one |
---|
0:50:15 | a possible alignment for each time step |
---|
0:50:18 | of course we can use more complicated sampling techniques such as the beam search all |
---|
0:50:24 | with turbo decoding to selects the good alignment |
---|
0:50:27 | for the tts generation |
---|
0:50:29 | but is how we do the generation in the whole attention |
---|
0:50:33 | computer was of attention |
---|
0:50:34 | we don't a weighted sum |
---|
0:50:36 | instead we will you we draw samples |
---|
0:50:41 | similarly in the training stage we have to use a dynamic programming to summarise all |
---|
0:50:46 | possible alignments in order to your body it's the model likelihood |
---|
0:50:51 | for the whole contention based models |
---|
0:50:53 | in contrast sold attention does not require to do so we just |
---|
0:50:57 | do the same as well you what we use |
---|
0:51:00 | for the generation stage |
---|
0:51:03 | we do the weighted sum for each time step |
---|
0:51:06 | so the difference between this off attention the hot a whole attention requires us to |
---|
0:51:12 | use a different space |
---|
0:51:14 | to categorise different techniques for hold attention |
---|
0:51:18 | that leads to this space |
---|
0:51:20 | which i think will be easy to understand different kinds of a whole attention techniques |
---|
0:51:25 | however due to the limited time i cannot explain the details are hot attention if |
---|
0:51:31 | you are interested please find this lies |
---|
0:51:34 | where i explain the whole attention in more details |
---|
0:51:38 | in terms of the tts system with a whole attention as far as we know |
---|
0:51:42 | there is only one group actually using the whole attention |
---|
0:51:46 | with a tts |
---|
0:51:47 | and it's the our group |
---|
0:51:49 | you can find a reference papers in the website below |
---|
0:51:53 | and you could also find many details on how we use different types of search |
---|
0:51:59 | and thus upping techniques |
---|
0:52:01 | to produce the output alignment from the whole attention based models |
---|
0:52:07 | given the details on soft attention and a brief introduction on the whole attention women |
---|
0:52:13 | come to the sort of group |
---|
0:52:15 | the hybrid approaches |
---|
0:52:16 | for the seconds to segments tts models |
---|
0:52:20 | from the first or the this tutorial hope you can understand how this of attention |
---|
0:52:25 | is easy to implement but |
---|
0:52:28 | it might not work on which generates about utterances |
---|
0:52:32 | of the whole intention my help the to solve this issue because data quantities a |
---|
0:52:36 | monotonic alignment during generation |
---|
0:52:40 | however |
---|
0:52:41 | according to our experiments the whole the tension might not be as at great as |
---|
0:52:47 | a soft attention |
---|
0:52:48 | for example sometimes so i may overestimate the duration for silence |
---|
0:52:54 | for both soft and the house attention we compute alignment probability for each pair of |
---|
0:52:59 | the encoder and output time steps |
---|
0:53:02 | for tts because the output sequence can be quite lawn |
---|
0:53:05 | so these meetings we have to calculate a large matrix |
---|
0:53:08 | for the alignment to make for the alignment probability is not easy |
---|
0:53:13 | of course we can do something more efficient suppose we can summarise the alignment information |
---|
0:53:18 | from the matrix |
---|
0:53:20 | so that we can know roughly how meaning out good time steps when you to |
---|
0:53:24 | generate for each input okay |
---|
0:53:27 | so by using this information we can we compute one probabilistic model for each input |
---|
0:53:32 | okay |
---|
0:53:33 | i just to estimate how meeting time steps they need to produce during the generation |
---|
0:53:39 | stage |
---|
0:53:40 | so this idea is not new you'd actually has been used in the hmm and |
---|
0:53:45 | d and bases system |
---|
0:53:47 | actually this is also that here behind the hybrid approaches |
---|
0:53:51 | for the hybrid approach is the first user attention based a model |
---|
0:53:55 | to extract alignment a matrix |
---|
0:53:58 | after that this summarizing information for example |
---|
0:54:01 | the duration or how meantime how many output time steps we need to repeat |
---|
0:54:06 | for each input it okay |
---|
0:54:08 | after summarizing these information we can trying the duration model directly for each input a |
---|
0:54:15 | token |
---|
0:54:17 | during the generation stage we can't either directly clogging the trend duration model as you |
---|
0:54:22 | can see from this picture we just need to predict how many output time steps |
---|
0:54:26 | we need to repeat |
---|
0:54:28 | for each input a target |
---|
0:54:31 | giving this duration information we can do the up sampling |
---|
0:54:35 | just read by duplicating each input vectors |
---|
0:54:38 | so the input into the decoder will be will lined with the output sequences we |
---|
0:54:43 | want to generate we can use norm on your a network such as the feed |
---|
0:54:47 | forward |
---|
0:54:48 | recall rent or autoregressive neural networks |
---|
0:54:51 | to converts the input to the output acoustic features decreases |
---|
0:54:58 | here are some tts system using the hybrid approaches |
---|
0:55:02 | the fast the speech user sold attention to extract the duration |
---|
0:55:06 | well the align tts and the other system and use different kinds of techniques to |
---|
0:55:11 | extract the duration |
---|
0:55:15 | i'd like to play some samples extracted from the published papers |
---|
0:55:21 | i would play just one sample for each system from for speech and for to |
---|
0:55:25 | for speech to |
---|
0:55:28 | which are you are like chapel and that this year casework to real straight at |
---|
0:55:33 | least are mostly |
---|
0:55:36 | it's have you are collected chapel and on the staircase work to rose training set |
---|
0:55:41 | are we |
---|
0:55:44 | although i only play the short samples here but i think you can find alarms |
---|
0:55:48 | and this is on their website |
---|
0:55:51 | what i want to see here from the example is that by using had hybrid |
---|
0:55:55 | approaches are we can generates the us acidic speech with the quite a robust duration |
---|
0:56:00 | i think that is one strong point about hybrid approaches |
---|
0:56:06 | okay let's come to the summary |
---|
0:56:09 | in this tutorial i first explain the pipeline tts system including the hmm and d |
---|
0:56:15 | and basis systems |
---|
0:56:17 | in the pipeline tts we need to use the front end |
---|
0:56:21 | to extracting linguist information from the input attacks after that we need to duration model |
---|
0:56:27 | to predict a duration four inch include a unit |
---|
0:56:32 | followed following that we need the acoustic model and the waveform generators |
---|
0:56:36 | to cover the linguist if you choosing to final |
---|
0:56:39 | waveform |
---|
0:56:42 | in two cents sixteen go deep mind propose to believe that |
---|
0:56:47 | all the way it is not explaining this to oreo i'd like to mention that |
---|
0:56:51 | the original wave in it still needs you front end and the duration more |
---|
0:56:56 | so it achieves the astonishing performance because it to use a single network |
---|
0:57:02 | to directly converts a linguistic features into d waveform sampling points directly |
---|
0:57:07 | this all the issues or the artifact and what we used a conventional waveform generators |
---|
0:57:13 | like the vocoders |
---|
0:57:15 | different from these two types of a tts system these signals two seconds model use |
---|
0:57:20 | a single modeled converts the input text |
---|
0:57:22 | into the acoustic features |
---|
0:57:25 | the use a single model to do alignment learning to do the duration modeling and |
---|
0:57:30 | the acoustic modeling |
---|
0:57:31 | in fact main sequence two seconds models also use women it's like a waveform generators |
---|
0:57:37 | to further improve the quality of this is that speech |
---|
0:57:42 | if we summarize the differences from the park lie system to the sequence to second |
---|
0:57:46 | system i think there are four suspects |
---|
0:57:50 | the first one is we replace the conventional front end in the pipeline system |
---|
0:57:56 | with the trainable implicit front end in the sequence two seconds model |
---|
0:58:00 | second instead of using external all duration model |
---|
0:58:04 | we may jointly do the duration modeling ways the sequence two seconds mapping |
---|
0:58:11 | sir point is the of acoustic models or low is not explained in this tutorial |
---|
0:58:17 | actually most of the seconds to segments model use just so called autoregressive decoding |
---|
0:58:22 | so would produce one audible time step |
---|
0:58:24 | conditioned on the previous time step |
---|
0:58:27 | the last point is the in your away from models as i mentioned in the |
---|
0:58:31 | previous slide |
---|
0:58:32 | making all the segments to segments models use neural waveform models like the wavelet |
---|
0:58:39 | the first the three types of differences implemented through the attention based segments two seconds |
---|
0:58:45 | models |
---|
0:58:46 | so in this tutorial we focus on attention mechanism |
---|
0:58:51 | we first explain this of the tension |
---|
0:58:53 | we also groups as of attention approaches space on this three |
---|
0:58:58 | dimensions |
---|
0:58:59 | what kind of features were used to calculate alignment a matrix how do we calculate |
---|
0:59:03 | alignment |
---|
0:59:04 | and the what kind of constraint we compute i'll the alignment |
---|
0:59:08 | we also mention the shortcoming of the a soft attention it does not guarantee the |
---|
0:59:13 | monotonic structures |
---|
0:59:15 | that evaluates the hot attention based approach |
---|
0:59:18 | however the whole attention might not be accurate enough to produce natural speech |
---|
0:59:23 | at a gives us to these are just a possible solution the hybrid approach where |
---|
0:59:29 | we don't |
---|
0:59:30 | used attention during the generation |
---|
0:59:35 | all the four specs are quite essential to the performances of the sequence to segments |
---|
0:59:39 | tts models |
---|
0:59:40 | of course maybe we may wonder |
---|
0:59:42 | what is the most important a factor |
---|
0:59:46 | that contributed to the performance of a sequence two seconds model |
---|
0:59:49 | to answer that oliver and is called x design experiments |
---|
0:59:54 | and they try to analyze the impact of each other's factor |
---|
0:59:58 | and the quality of a general speech from the sequence two seconds models |
---|
1:00:03 | ice recommended it to raise their paper to understand why the sickness two seconds model |
---|
1:00:08 | outperforms the tedious apply systems |
---|
1:00:13 | before we in this tutorial let me briefly mention all the research topics based on |
---|
1:00:18 | these seconds to segments tts models |
---|
1:00:22 | the first one big is the neural waveform models that has being using mating signal |
---|
1:00:27 | emitting seconds two seconds models |
---|
1:00:29 | due to the limited time i cannot explain the neural for a waveform almost but |
---|
1:00:34 | you can find a reference paper using the rating list |
---|
1:00:37 | another topic is the speakers style any motion modelling is segments two seconds models |
---|
1:00:45 | prosody is also hot topic being seconds two seconds modeling |
---|
1:00:49 | in terms of multi speaker modeling a single most of the segments two seconds models |
---|
1:00:54 | are quite a straightforward |
---|
1:00:55 | the either jointly trying the speaker vector is a bayes d sequence to suppose model |
---|
1:01:01 | or they use separates speaker model |
---|
1:01:04 | to extract these speaker vectors from the reference speech |
---|
1:01:08 | so this is so called the rule short learning for multi-speaker tts |
---|
1:01:14 | in terms of prosody either paper is focusing on the segmental prosody for example of |
---|
1:01:20 | the lexicon tom or the pitch accent |
---|
1:01:23 | so this all the most of this paper is focusing on the pitch accent the |
---|
1:01:29 | language at a language such as mattering or japanese |
---|
1:01:33 | in terms of the super or a sacramental variation is they're also papers |
---|
1:01:37 | combining the process of the embedding ways to talk from basis systems |
---|
1:01:41 | also system using the variational encoders |
---|
1:01:45 | to extract the processing bindings from the reference speech |
---|
1:01:49 | finally i'd like to mention another direction of all the tts research |
---|
1:01:54 | and i is the a tts for entertainment |
---|
1:01:57 | so for the in this paper the also is use the traditional japanese comedy data |
---|
1:02:04 | to trying to tts system |
---|
1:02:05 | so the goal of this kind of t c system is not only the speech |
---|
1:02:10 | communication but also |
---|
1:02:12 | mm to intuit and the audience |
---|
1:02:16 | this is and of this tutorial |
---|
1:02:18 | you can find this slide on my teeth how page it's recommended it to check |
---|
1:02:22 | the i exponent slides the reading list and the appendix thank you for recently |
---|