0:00:13 | a matched call |
---|
0:00:14 | as me |
---|
0:00:15 | uh so okay so i start |
---|
0:00:17 | uh this work is set extension of uh |
---|
0:00:20 | of our previous paper from last in speech |
---|
0:00:23 | a which also about the recurrence a work a language model and |
---|
0:00:27 | you know we really show some |
---|
0:00:29 | some more details of to do how to train this model are efficiently |
---|
0:00:33 | comparison |
---|
0:00:34 | jane the standard to network |
---|
0:00:36 | language language models so this is some |
---|
0:00:38 | a introduction |
---|
0:00:40 | uh basically a a network language models work |
---|
0:00:43 | uh like let's say better than the standard |
---|
0:00:46 | uh because of models because they can |
---|
0:00:48 | uh automatically |
---|
0:00:50 | uh share some parameters between similar board |
---|
0:00:53 | so they are from some kind of soft clustering in |
---|
0:00:56 | low dimensional space |
---|
0:00:58 | so in some sense they are similar to "'cause" based models |
---|
0:01:03 | uh |
---|
0:01:03 | the good think about this neural network language models is does |
---|
0:01:06 | is that they are quite |
---|
0:01:08 | a a simple to implement and |
---|
0:01:10 | do me do not need to |
---|
0:01:12 | a |
---|
0:01:13 | uh deal it's for example smooth thing |
---|
0:01:15 | and even the training colour grid means usually just the standard propagation algorithm which is |
---|
0:01:20 | very well known and described |
---|
0:01:23 | and the |
---|
0:01:24 | actually what we have shown recently |
---|
0:01:26 | uh was that uh the recurrent architecture is |
---|
0:01:30 | uh a but it if it with the feed forward at architecture |
---|
0:01:34 | which is |
---|
0:01:35 | right |
---|
0:01:36 | like nice because uh actually the recount architecture is start to carry much more powerful because it allows the model |
---|
0:01:42 | to remember some kind of information in the hidden layer |
---|
0:01:45 | and so we do don't the build and and got model of that some a limited |
---|
0:01:49 | a history but actually the model learns the easter it from the data |
---|
0:01:54 | uh we will see that later in the pictures |
---|
0:01:57 | so in this uh presentation |
---|
0:01:59 | i dollar |
---|
0:02:00 | uh describe a a a is back propagation trip times like likely which is uh a very old algorithm for |
---|
0:02:05 | training recurrent do networks |
---|
0:02:07 | uh and |
---|
0:02:08 | i dense son as uh a speed-up technique that is actually |
---|
0:02:12 | uh very similar or |
---|
0:02:13 | to to the previous presentation |
---|
0:02:15 | just our technique is i would say much simpler |
---|
0:02:18 | and uh then uh a results uh are to combining a uh |
---|
0:02:22 | many and am iced uh a you know network models and |
---|
0:02:26 | uh how this uh a text perplexity |
---|
0:02:29 | and also some comparison with other techniques and |
---|
0:02:32 | so i'm |
---|
0:02:33 | sun |
---|
0:02:34 | uh a results that are not in the original paper because we obtain the them light you out that a |
---|
0:02:40 | paper was written and this is |
---|
0:02:41 | about some large |
---|
0:02:42 | uh i C R |
---|
0:02:44 | thus |
---|
0:02:46 | but a lot of more data then |
---|
0:02:48 | i i we show |
---|
0:02:49 | here in the simple |
---|
0:02:51 | uh simple examples |
---|
0:02:52 | so the the model looks like this |
---|
0:02:55 | basically we have a |
---|
0:02:57 | yeah some input layer |
---|
0:02:58 | and output layer that have the same time a general it as the as the vocabulary |
---|
0:03:03 | a that is that W and uh uh why and between the |
---|
0:03:06 | uh two layers there is one |
---|
0:03:08 | in a that as sexually much lower dimensionality |
---|
0:03:11 | uh let's say |
---|
0:03:12 | uh a one hundred dollar two on the king or owns and |
---|
0:03:15 | uh actually we would to not uh considered the a recurrent uh |
---|
0:03:19 | uh parameters the recurrent rate |
---|
0:03:22 | uh |
---|
0:03:23 | uh that's uh that it is uh |
---|
0:03:25 | uh that are between the hidden don't layer |
---|
0:03:27 | uh |
---|
0:03:28 | like between a itself |
---|
0:03:30 | uh then a then the network would be a uh just uh a standard bigram a network |
---|
0:03:35 | a language model but actually these so parameters give uh the model |
---|
0:03:39 | uh the power to remember some history and use it efficiently |
---|
0:03:44 | uh so |
---|
0:03:46 | uh |
---|
0:03:47 | uh so actually in the previous paper we are using just the the normal back propagation for training such network |
---|
0:03:53 | but |
---|
0:03:54 | uh here i will show that a bit to back propagation should time we can |
---|
0:03:57 | uh i get to actually better results which should be even more of the use on a corrected a |
---|
0:04:02 | based language models are |
---|
0:04:03 | the the usual architecture doesn't really work well |
---|
0:04:07 | uh not only for the recurrence met were the but also for the for for work |
---|
0:04:12 | and uh |
---|
0:04:13 | uh actually |
---|
0:04:14 | how would does uh back propagation through time work uh it works by unfolding the recount part of the network |
---|
0:04:20 | uh in time |
---|
0:04:21 | so that be obtain some and deep uh cute for of our to network |
---|
0:04:25 | i which is some kind of approximation of the recurrent part of the network |
---|
0:04:29 | and we train this by |
---|
0:04:31 | uh using the standard back propagation |
---|
0:04:33 | just the to be you have uh many more hidden layer |
---|
0:04:38 | so |
---|
0:04:39 | looks basically like a is uh you be imagine the the original bigram network and now we know that there |
---|
0:04:44 | are |
---|
0:04:45 | recurrent connections in the hidden air |
---|
0:04:47 | like that it is connected to a it a to itself just that these connections are |
---|
0:04:51 | do you light in time |
---|
0:04:53 | so that a when B |
---|
0:04:54 | uh trained the network |
---|
0:04:56 | we we compute D a vector in the in the output layer and propagated back using the back back propagation |
---|
0:05:02 | and |
---|
0:05:03 | uh |
---|
0:05:04 | we full the network in times so basically |
---|
0:05:07 | uh we can |
---|
0:05:08 | uh |
---|
0:05:09 | we can go one step back in time and we can see that the the |
---|
0:05:12 | the uh uh activation values |
---|
0:05:15 | in the you don't their of our depending on the |
---|
0:05:18 | state of the |
---|
0:05:20 | of the input layer and on the state of the previous |
---|
0:05:23 | uh uh a on the state of the hidden are in the previous times that |
---|
0:05:27 | and |
---|
0:05:27 | so on |
---|
0:05:29 | so basically we can unfold this network work for a for two steps in time |
---|
0:05:34 | and the obtain a future for of to the approximation of the a recurrent neural networks so |
---|
0:05:39 | this is the idea of the algorithm works |
---|
0:05:41 | and |
---|
0:05:43 | is a little bit tricky to implement it correctly but otherwise it is quite straightforward like would say |
---|
0:05:49 | uh do other extension that the describe in our paper is uh a factorization of the output layer |
---|
0:05:55 | which is uh basically something very similar to some |
---|
0:05:58 | class based uh a language models like but a |
---|
0:06:01 | a like but you sure a good did in his paper |
---|
0:06:04 | ten years ago |
---|
0:06:05 | just in our case of V do not really |
---|
0:06:08 | extend this approach by |
---|
0:06:10 | uh by using some to use and so on a |
---|
0:06:12 | uh we keep a the the approach simple simpler |
---|
0:06:15 | and actually meet be make it even simpler by not even a computing can classes but use just |
---|
0:06:21 | a factorisation |
---|
0:06:22 | uh that is based just on the frequency of the word so basically we do frequency binning of the |
---|
0:06:28 | of the |
---|
0:06:29 | vocabulary |
---|
0:06:30 | to obtain these so let's see classes |
---|
0:06:32 | and |
---|
0:06:33 | otherwise uh the approach is very similar to what one would two |
---|
0:06:37 | what was in the sim a the previous presentation |
---|
0:06:40 | so we |
---|
0:06:41 | uh so we basically computes first the |
---|
0:06:44 | the the probability distribution over to |
---|
0:06:47 | glass lay that's can be very small let's say |
---|
0:06:49 | i i just uh a one hundred out what unit |
---|
0:06:52 | and then we compute just the the probability distribution |
---|
0:06:55 | for the that to be a long to to this class layer other guys the model stays of the same |
---|
0:07:01 | so we do not need to compute probability distribution over |
---|
0:07:04 | the role of put layered that can be a say ten thousand boards but |
---|
0:07:08 | we will be computing cut the of the is just for |
---|
0:07:11 | much less |
---|
0:07:15 | so this this can provide |
---|
0:07:16 | speed-up up in some cases even more than hundred times see the if the ought to look of it is |
---|
0:07:20 | very large so |
---|
0:07:21 | this |
---|
0:07:22 | this technique is very nice |
---|
0:07:23 | we do not need to introduce a any |
---|
0:07:25 | and a short lists or any to use and |
---|
0:07:28 | actually it is quite surprising surprising that |
---|
0:07:31 | something think as simple as this works but will see in the is result that the does |
---|
0:07:35 | so |
---|
0:07:36 | uh uh are uh basic set up to that to be a uh described more close in the paper is |
---|
0:07:42 | uh |
---|
0:07:42 | penn treebank uh |
---|
0:07:44 | uh a part of the wall street journal corpus |
---|
0:07:47 | and to we use the the same settings things as |
---|
0:07:50 | also the other researchers |
---|
0:07:52 | so that we can directly compared the result |
---|
0:07:55 | which G extended now our |
---|
0:07:57 | on a going work but |
---|
0:07:59 | you will |
---|
0:07:59 | you it's simply here |
---|
0:08:01 | so |
---|
0:08:01 | uh this is the importance of of the back propagation true time training at D so |
---|
0:08:06 | or do to the results on this corpus |
---|
0:08:08 | and you can see that uh |
---|
0:08:10 | the blue curve or i should stop maybe with the baseline which is the green you know a line |
---|
0:08:15 | that's |
---|
0:08:15 | modified of name |
---|
0:08:17 | uh a five gram |
---|
0:08:18 | and the blue curve is uh when be trained for models of it |
---|
0:08:23 | we do a different amount of uh of uh steps for the back propagation through time algorithm |
---|
0:08:28 | and we can see that |
---|
0:08:30 | uh the average joe of uh of these |
---|
0:08:32 | oh of these four models is actually put it in the graph we can see that |
---|
0:08:36 | the more uh steps we go |
---|
0:08:38 | in time back |
---|
0:08:39 | uh the better of the final model is |
---|
0:08:41 | as the evolution of the model is uh still the same it this not affected by the training for |
---|
0:08:47 | uh when we actually combine this models like that we use a linear interpolation to go this model models we |
---|
0:08:52 | can see that |
---|
0:08:53 | uh the results are better but the affect of |
---|
0:08:56 | of using better training algorithms stays so |
---|
0:08:59 | uh this still obtain |
---|
0:09:00 | quite significant improvement here it this about ten person |
---|
0:09:04 | perplexity ended to to be even more if we would to use more training data this is just a boat |
---|
0:09:09 | that |
---|
0:09:10 | one a in word |
---|
0:09:11 | oh of training data |
---|
0:09:14 | uh here |
---|
0:09:15 | the |
---|
0:09:16 | a we show that |
---|
0:09:17 | if we combine actually more than a let's say for models |
---|
0:09:20 | we can still serve some improvement even after the vol |
---|
0:09:24 | combination of models |
---|
0:09:25 | interpolated with the |
---|
0:09:26 | at the back of model |
---|
0:09:28 | uh for the for |
---|
0:09:29 | to combine the neural nets |
---|
0:09:31 | uh be used just uh a no interpolation bit |
---|
0:09:34 | cool rates for each model |
---|
0:09:35 | but the weight of the lack of model is you want on the validation data |
---|
0:09:39 | this is why the car fits |
---|
0:09:40 | slightly noisy |
---|
0:09:42 | the at one |
---|
0:09:43 | uh but |
---|
0:09:44 | you can basically see that |
---|
0:09:45 | we can obtain some very small improvements up tree going |
---|
0:09:48 | for more than four model |
---|
0:09:53 | and that these uh this uh networks uh |
---|
0:09:55 | are direct the different just in the |
---|
0:09:57 | a random initialization of the |
---|
0:09:59 | of the weights |
---|
0:10:03 | uh here |
---|
0:10:04 | is the comparison that i was already |
---|
0:10:06 | introducing |
---|
0:10:07 | uh to other techniques uh so |
---|
0:10:09 | the the baseline of can be the five gram |
---|
0:10:12 | perplexity like the one hundred forty one |
---|
0:10:15 | at that that first |
---|
0:10:16 | row |
---|
0:10:17 | and then a |
---|
0:10:18 | uh uh a random forest the that is solar interpolated to a this a baseline |
---|
0:10:23 | uh achieves |
---|
0:10:24 | a perplexed the reduction somewhat less than ten percent |
---|
0:10:27 | and structured language models |
---|
0:10:29 | work uh actually better than a random forest so on this up |
---|
0:10:33 | and |
---|
0:10:34 | we can see that the all the neural networks as a language models work even better than that |
---|
0:10:39 | the standard before about you know not |
---|
0:10:41 | uh are are about |
---|
0:10:43 | time points and perplexed perplexity better than the structured language models |
---|
0:10:46 | then |
---|
0:10:47 | uh the produce at best technique on this set up was from a a up the money and both syntactic |
---|
0:10:52 | you know at work a language model that that's |
---|
0:10:54 | actually |
---|
0:10:55 | even more features that are |
---|
0:10:57 | like uh |
---|
0:10:58 | um |
---|
0:10:59 | linguistically motivated |
---|
0:11:01 | and uh we can see that if V train |
---|
0:11:04 | just the there are |
---|
0:11:05 | uh just using the standard back propagation |
---|
0:11:08 | uh that of the ev trained a recurrence a network |
---|
0:11:11 | you can obtain better results on this a top than a bit the |
---|
0:11:15 | uh usual of fit for working on that work |
---|
0:11:17 | and we train it uh by back propagation through time B |
---|
0:11:20 | obtain |
---|
0:11:21 | uh a a large improvement in the end that's all these are a results are are all lovely |
---|
0:11:26 | uh after combination with the with the of model |
---|
0:11:29 | and then when we train a several |
---|
0:11:32 | different models the obtain again a quite significant improvement |
---|
0:11:35 | uh actually we have some |
---|
0:11:37 | ongoing work and we are able to |
---|
0:11:40 | no |
---|
0:11:40 | a perplexity on this that that this lower than at |
---|
0:11:44 | oh to combining a lot of |
---|
0:11:45 | different think you |
---|
0:11:46 | you i technique |
---|
0:11:50 | uh so |
---|
0:11:51 | the |
---|
0:11:52 | the factorisation of the output the of wood layer that i have described before |
---|
0:11:57 | but it's uh it's to right |
---|
0:11:59 | significant speed-up speech is quite a use and we can see here that also the |
---|
0:12:04 | uh the |
---|
0:12:05 | cost |
---|
0:12:05 | of perplexity |
---|
0:12:07 | like yeah |
---|
0:12:08 | like uh |
---|
0:12:09 | because the we we make some assumptions that are are to a completely true and that the approach is very |
---|
0:12:13 | simple |
---|
0:12:14 | that a the |
---|
0:12:15 | the is also do not degrade very much even if we go for let's say hundred is then if we |
---|
0:12:21 | go to to even less cost is the result |
---|
0:12:23 | bill go even |
---|
0:12:25 | uh again better because actually |
---|
0:12:27 | uh the model for |
---|
0:12:29 | for a number of classes to one |
---|
0:12:31 | and that the size of the of a is uh each row to the the real origin model |
---|
0:12:36 | uh so the optimal volume is about |
---|
0:12:38 | square root of the size of the vocabulary |
---|
0:12:41 | but like the optimal value to obtain the maximum speed up of course |
---|
0:12:44 | you can make some |
---|
0:12:46 | compromise and we can go for a little lot more classes to |
---|
0:12:50 | obtain some |
---|
0:12:51 | less efficient threes of the less efficient |
---|
0:12:54 | a a network that has uh but accuracy |
---|
0:13:00 | uh what we did not have in the paper is |
---|
0:13:02 | what happens if we would uh actually |
---|
0:13:04 | and more data |
---|
0:13:06 | because the previews |
---|
0:13:07 | speed experiments of it just one min in |
---|
0:13:09 | oh works in the training data |
---|
0:13:11 | here we show a graph on a |
---|
0:13:13 | uh english you give or to are we used up to start to six min in of fort |
---|
0:13:19 | and you can see |
---|
0:13:20 | that's for this or kinda know that works of the difference |
---|
0:13:23 | and jane's the back of models |
---|
0:13:24 | is actually increasing with more data |
---|
0:13:27 | which is like |
---|
0:13:28 | uh opposite it of what we can see for most of the other loss and week a language modeling techniques |
---|
0:13:33 | that work only for |
---|
0:13:35 | small amounts of data and |
---|
0:13:36 | when we increase this uh the amount of the training data |
---|
0:13:39 | then a |
---|
0:13:40 | uh actually all the improvements |
---|
0:13:42 | and to vanish so this is not the case |
---|
0:13:48 | so |
---|
0:13:48 | uh |
---|
0:13:49 | next speed did |
---|
0:13:50 | a lot of |
---|
0:13:51 | small modifications of two |
---|
0:13:52 | even improve the accuracy and the speed and |
---|
0:13:55 | one of these things is actually dynamic a relation that gonna be |
---|
0:13:58 | used for adaptation of the models is |
---|
0:14:01 | uh extension nor simplification of our previous approach that we have |
---|
0:14:05 | described in |
---|
0:14:06 | are lost interspeech paper |
---|
0:14:09 | and |
---|
0:14:09 | uh it basically works uh well like uh that to be trained a network even during uh the testing phase |
---|
0:14:16 | uh but in this case we just three train the network on the |
---|
0:14:20 | on the one on the one best |
---|
0:14:22 | oh during recognition |
---|
0:14:24 | uh then also |
---|
0:14:26 | we show |
---|
0:14:27 | uh the paper or the at the paper of every show combination in comparison of a recurrent two networks with |
---|
0:14:33 | many other |
---|
0:14:34 | advanced language modeling technique |
---|
0:14:36 | she's uh |
---|
0:14:37 | which leads to know more than fifty percent cent reductions of perplexity |
---|
0:14:42 | you james |
---|
0:14:42 | some stand back of uh |
---|
0:14:44 | no a of language models |
---|
0:14:46 | and to on some even a large data uh than this penn treebank corpus a are able to |
---|
0:14:52 | get even more than fifty percent |
---|
0:14:54 | reduction |
---|
0:14:56 | perplexity |
---|
0:14:57 | uh we have also some |
---|
0:14:58 | uh some S our experiments and the results and on |
---|
0:15:03 | some easy so that uh |
---|
0:15:05 | that is uh uh uh able to use you know its some very basic acoustic models you are able to |
---|
0:15:10 | obtain a almost and the person the reduction of the word error rate |
---|
0:15:13 | uh and on a much harder and larger |
---|
0:15:17 | set of which is |
---|
0:15:18 | it the same as the one that was |
---|
0:15:21 | use the a last year |
---|
0:15:23 | on J two workshops summer workshop |
---|
0:15:25 | we apple thing almost and person the reduction of the |
---|
0:15:29 | board the rate to jane's |
---|
0:15:30 | baker a four gram model |
---|
0:15:33 | which is actually i can even include the results for the |
---|
0:15:36 | model and on the the on this sistine |
---|
0:15:39 | which uh whites |
---|
0:15:40 | a reduction from |
---|
0:15:42 | thirteen starting point one to |
---|
0:15:44 | twelve point five |
---|
0:15:45 | means that |
---|
0:15:46 | this there can sing at work is about |
---|
0:15:49 | the wise but you're |
---|
0:15:50 | in a what they're rate reduction on this up |
---|
0:15:53 | then a model um |
---|
0:15:55 | and also |
---|
0:15:57 | uh |
---|
0:15:58 | yeah all these expense can be very P it as a |
---|
0:16:01 | uh we made it to look at at this available on this |
---|
0:16:04 | this setting |
---|
0:16:05 | and the think should be also in the paper |
---|
0:16:07 | so |
---|
0:16:09 | i would say to |
---|
0:16:10 | yes |
---|
0:16:11 | all these experiments can can be repeated just |
---|
0:16:13 | it takes a lot of time |
---|
0:16:15 | so |
---|
0:16:16 | thanks for attention |
---|
0:16:24 | oh for questions |
---|
0:16:40 | yeah |
---|
0:16:42 | yeah |
---|
0:16:47 | uh |
---|
0:16:48 | just a second |
---|
0:16:55 | uh just table |
---|
0:16:58 | yeah |
---|
0:16:59 | so uh which numbers |
---|
0:17:01 | you mean |
---|
0:17:04 | uh this is a to be the uh can be a this is a combination be the bic model |
---|
0:17:08 | with the baseline model |
---|
0:17:11 | a it out combination |
---|
0:17:13 | i'm not sure if a a it is in the paper but uh basically it would be |
---|
0:17:17 | like the debate of the recurrence let's work in the combination is uh usually a on this set about those |
---|
0:17:22 | your point seven or zero point eight |
---|
0:17:24 | so it would be that it a better than the than the baseline i think it was around one other |
---|
0:17:29 | when something |
---|
0:17:41 | and the questions |
---|