Speech Transcript - EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGE MODEL

0:00:13	a matched call
0:00:14	as me
0:00:15	uh so okay so i start
0:00:17	uh this work is set extension of uh
0:00:20	of our previous paper from last in speech
0:00:23	a which also about the recurrence a work a language model and
0:00:27	you know we really show some
0:00:29	some more details of to do how to train this model are efficiently
0:00:33	comparison
0:00:34	jane the standard to network
0:00:36	language language models so this is some
0:00:38	a introduction
0:00:40	uh basically a a network language models work
0:00:43	uh like let's say better than the standard
0:00:46	uh because of models because they can
0:00:48	uh automatically
0:00:50	uh share some parameters between similar board
0:00:53	so they are from some kind of soft clustering in
0:00:56	low dimensional space
0:00:58	so in some sense they are similar to "'cause" based models
0:01:03	uh
0:01:03	the good think about this neural network language models is does
0:01:06	is that they are quite
0:01:08	a a simple to implement and
0:01:10	do me do not need to
0:01:12	a
0:01:13	uh deal it's for example smooth thing
0:01:15	and even the training colour grid means usually just the standard propagation algorithm which is
0:01:20	very well known and described
0:01:23	and the
0:01:24	actually what we have shown recently
0:01:26	uh was that uh the recurrent architecture is
0:01:30	uh a but it if it with the feed forward at architecture
0:01:34	which is
0:01:35	right
0:01:36	like nice because uh actually the recount architecture is start to carry much more powerful because it allows the model
0:01:42	to remember some kind of information in the hidden layer
0:01:45	and so we do don't the build and and got model of that some a limited
0:01:49	a history but actually the model learns the easter it from the data
0:01:54	uh we will see that later in the pictures
0:01:57	so in this uh presentation
0:01:59	i dollar
0:02:00	uh describe a a a is back propagation trip times like likely which is uh a very old algorithm for
0:02:05	training recurrent do networks
0:02:07	uh and
0:02:08	i dense son as uh a speed-up technique that is actually
0:02:12	uh very similar or
0:02:13	to to the previous presentation
0:02:15	just our technique is i would say much simpler
0:02:18	and uh then uh a results uh are to combining a uh
0:02:22	many and am iced uh a you know network models and
0:02:26	uh how this uh a text perplexity
0:02:29	and also some comparison with other techniques and
0:02:32	so i'm
0:02:33	sun
0:02:34	uh a results that are not in the original paper because we obtain the them light you out that a
0:02:40	paper was written and this is
0:02:41	about some large
0:02:42	uh i C R
0:02:44	thus
0:02:46	but a lot of more data then
0:02:48	i i we show
0:02:49	here in the simple
0:02:51	uh simple examples
0:02:52	so the the model looks like this
0:02:55	basically we have a
0:02:57	yeah some input layer
0:02:58	and output layer that have the same time a general it as the as the vocabulary
0:03:03	a that is that W and uh uh why and between the
0:03:06	uh two layers there is one
0:03:08	in a that as sexually much lower dimensionality
0:03:11	uh let's say
0:03:12	uh a one hundred dollar two on the king or owns and
0:03:15	uh actually we would to not uh considered the a recurrent uh
0:03:19	uh parameters the recurrent rate
0:03:22	uh
0:03:23	uh that's uh that it is uh
0:03:25	uh that are between the hidden don't layer
0:03:27	uh
0:03:28	like between a itself
0:03:30	uh then a then the network would be a uh just uh a standard bigram a network
0:03:35	a language model but actually these so parameters give uh the model
0:03:39	uh the power to remember some history and use it efficiently
0:03:44	uh so
0:03:46	uh
0:03:47	uh so actually in the previous paper we are using just the the normal back propagation for training such network
0:03:53	but
0:03:54	uh here i will show that a bit to back propagation should time we can
0:03:57	uh i get to actually better results which should be even more of the use on a corrected a
0:04:02	based language models are
0:04:03	the the usual architecture doesn't really work well
0:04:07	uh not only for the recurrence met were the but also for the for for work
0:04:12	and uh
0:04:13	uh actually
0:04:14	how would does uh back propagation through time work uh it works by unfolding the recount part of the network
0:04:20	uh in time
0:04:21	so that be obtain some and deep uh cute for of our to network
0:04:25	i which is some kind of approximation of the recurrent part of the network
0:04:29	and we train this by
0:04:31	uh using the standard back propagation
0:04:33	just the to be you have uh many more hidden layer
0:04:38	so
0:04:39	looks basically like a is uh you be imagine the the original bigram network and now we know that there
0:04:44	are
0:04:45	recurrent connections in the hidden air
0:04:47	like that it is connected to a it a to itself just that these connections are
0:04:51	do you light in time
0:04:53	so that a when B
0:04:54	uh trained the network
0:04:56	we we compute D a vector in the in the output layer and propagated back using the back back propagation
0:05:02	and
0:05:03	uh
0:05:04	we full the network in times so basically
0:05:07	uh we can
0:05:08	uh
0:05:09	we can go one step back in time and we can see that the the
0:05:12	the uh uh activation values
0:05:15	in the you don't their of our depending on the
0:05:18	state of the
0:05:20	of the input layer and on the state of the previous
0:05:23	uh uh a on the state of the hidden are in the previous times that
0:05:27	and
0:05:27	so on
0:05:29	so basically we can unfold this network work for a for two steps in time
0:05:34	and the obtain a future for of to the approximation of the a recurrent neural networks so
0:05:39	this is the idea of the algorithm works
0:05:41	and
0:05:43	is a little bit tricky to implement it correctly but otherwise it is quite straightforward like would say
0:05:49	uh do other extension that the describe in our paper is uh a factorization of the output layer
0:05:55	which is uh basically something very similar to some
0:05:58	class based uh a language models like but a
0:06:01	a like but you sure a good did in his paper
0:06:04	ten years ago
0:06:05	just in our case of V do not really
0:06:08	extend this approach by
0:06:10	uh by using some to use and so on a
0:06:12	uh we keep a the the approach simple simpler
0:06:15	and actually meet be make it even simpler by not even a computing can classes but use just
0:06:21	a factorisation
0:06:22	uh that is based just on the frequency of the word so basically we do frequency binning of the
0:06:28	of the
0:06:29	vocabulary
0:06:30	to obtain these so let's see classes
0:06:32	and
0:06:33	otherwise uh the approach is very similar to what one would two
0:06:37	what was in the sim a the previous presentation
0:06:40	so we
0:06:41	uh so we basically computes first the
0:06:44	the the probability distribution over to
0:06:47	glass lay that's can be very small let's say
0:06:49	i i just uh a one hundred out what unit
0:06:52	and then we compute just the the probability distribution
0:06:55	for the that to be a long to to this class layer other guys the model stays of the same
0:07:01	so we do not need to compute probability distribution over
0:07:04	the role of put layered that can be a say ten thousand boards but
0:07:08	we will be computing cut the of the is just for
0:07:11	much less
0:07:15	so this this can provide
0:07:16	speed-up up in some cases even more than hundred times see the if the ought to look of it is
0:07:20	very large so
0:07:21	this
0:07:22	this technique is very nice
0:07:23	we do not need to introduce a any
0:07:25	and a short lists or any to use and
0:07:28	actually it is quite surprising surprising that
0:07:31	something think as simple as this works but will see in the is result that the does
0:07:35	so
0:07:36	uh uh are uh basic set up to that to be a uh described more close in the paper is
0:07:42	uh
0:07:42	penn treebank uh
0:07:44	uh a part of the wall street journal corpus
0:07:47	and to we use the the same settings things as
0:07:50	also the other researchers
0:07:52	so that we can directly compared the result
0:07:55	which G extended now our
0:07:57	on a going work but
0:07:59	you will
0:07:59	you it's simply here
0:08:01	so
0:08:01	uh this is the importance of of the back propagation true time training at D so
0:08:06	or do to the results on this corpus
0:08:08	and you can see that uh
0:08:10	the blue curve or i should stop maybe with the baseline which is the green you know a line
0:08:15	that's
0:08:15	modified of name
0:08:17	uh a five gram
0:08:18	and the blue curve is uh when be trained for models of it
0:08:23	we do a different amount of uh of uh steps for the back propagation through time algorithm
0:08:28	and we can see that
0:08:30	uh the average joe of uh of these
0:08:32	oh of these four models is actually put it in the graph we can see that
0:08:36	the more uh steps we go
0:08:38	in time back
0:08:39	uh the better of the final model is
0:08:41	as the evolution of the model is uh still the same it this not affected by the training for
0:08:47	uh when we actually combine this models like that we use a linear interpolation to go this model models we
0:08:52	can see that
0:08:53	uh the results are better but the affect of
0:08:56	of using better training algorithms stays so
0:08:59	uh this still obtain
0:09:00	quite significant improvement here it this about ten person
0:09:04	perplexity ended to to be even more if we would to use more training data this is just a boat
0:09:09	that
0:09:10	one a in word
0:09:11	oh of training data
0:09:14	uh here
0:09:15	the
0:09:16	a we show that
0:09:17	if we combine actually more than a let's say for models
0:09:20	we can still serve some improvement even after the vol
0:09:24	combination of models
0:09:25	interpolated with the
0:09:26	at the back of model
0:09:28	uh for the for
0:09:29	to combine the neural nets
0:09:31	uh be used just uh a no interpolation bit
0:09:34	cool rates for each model
0:09:35	but the weight of the lack of model is you want on the validation data
0:09:39	this is why the car fits
0:09:40	slightly noisy
0:09:42	the at one
0:09:43	uh but
0:09:44	you can basically see that
0:09:45	we can obtain some very small improvements up tree going
0:09:48	for more than four model
0:09:53	and that these uh this uh networks uh
0:09:55	are direct the different just in the
0:09:57	a random initialization of the
0:09:59	of the weights
0:10:03	uh here
0:10:04	is the comparison that i was already
0:10:06	introducing
0:10:07	uh to other techniques uh so
0:10:09	the the baseline of can be the five gram
0:10:12	perplexity like the one hundred forty one
0:10:15	at that that first
0:10:16	row
0:10:17	and then a
0:10:18	uh uh a random forest the that is solar interpolated to a this a baseline
0:10:23	uh achieves
0:10:24	a perplexed the reduction somewhat less than ten percent
0:10:27	and structured language models
0:10:29	work uh actually better than a random forest so on this up
0:10:33	and
0:10:34	we can see that the all the neural networks as a language models work even better than that
0:10:39	the standard before about you know not
0:10:41	uh are are about
0:10:43	time points and perplexed perplexity better than the structured language models
0:10:46	then
0:10:47	uh the produce at best technique on this set up was from a a up the money and both syntactic
0:10:52	you know at work a language model that that's
0:10:54	actually
0:10:55	even more features that are
0:10:57	like uh
0:10:58	um
0:10:59	linguistically motivated
0:11:01	and uh we can see that if V train
0:11:04	just the there are
0:11:05	uh just using the standard back propagation
0:11:08	uh that of the ev trained a recurrence a network
0:11:11	you can obtain better results on this a top than a bit the
0:11:15	uh usual of fit for working on that work
0:11:17	and we train it uh by back propagation through time B
0:11:20	obtain
0:11:21	uh a a large improvement in the end that's all these are a results are are all lovely
0:11:26	uh after combination with the with the of model
0:11:29	and then when we train a several
0:11:32	different models the obtain again a quite significant improvement
0:11:35	uh actually we have some
0:11:37	ongoing work and we are able to
0:11:40	no
0:11:40	a perplexity on this that that this lower than at
0:11:44	oh to combining a lot of
0:11:45	different think you
0:11:46	you i technique
0:11:50	uh so
0:11:51	the
0:11:52	the factorisation of the output the of wood layer that i have described before
0:11:57	but it's uh it's to right
0:11:59	significant speed-up speech is quite a use and we can see here that also the
0:12:04	uh the
0:12:05	cost
0:12:05	of perplexity
0:12:07	like yeah
0:12:08	like uh
0:12:09	because the we we make some assumptions that are are to a completely true and that the approach is very
0:12:13	simple
0:12:14	that a the
0:12:15	the is also do not degrade very much even if we go for let's say hundred is then if we
0:12:21	go to to even less cost is the result
0:12:23	bill go even
0:12:25	uh again better because actually
0:12:27	uh the model for
0:12:29	for a number of classes to one
0:12:31	and that the size of the of a is uh each row to the the real origin model
0:12:36	uh so the optimal volume is about
0:12:38	square root of the size of the vocabulary
0:12:41	but like the optimal value to obtain the maximum speed up of course
0:12:44	you can make some
0:12:46	compromise and we can go for a little lot more classes to
0:12:50	obtain some
0:12:51	less efficient threes of the less efficient
0:12:54	a a network that has uh but accuracy
0:13:00	uh what we did not have in the paper is
0:13:02	what happens if we would uh actually
0:13:04	and more data
0:13:06	because the previews
0:13:07	speed experiments of it just one min in
0:13:09	oh works in the training data
0:13:11	here we show a graph on a
0:13:13	uh english you give or to are we used up to start to six min in of fort
0:13:19	and you can see
0:13:20	that's for this or kinda know that works of the difference
0:13:23	and jane's the back of models
0:13:24	is actually increasing with more data
0:13:27	which is like
0:13:28	uh opposite it of what we can see for most of the other loss and week a language modeling techniques
0:13:33	that work only for
0:13:35	small amounts of data and
0:13:36	when we increase this uh the amount of the training data
0:13:39	then a
0:13:40	uh actually all the improvements
0:13:42	and to vanish so this is not the case
0:13:48	so
0:13:48	uh
0:13:49	next speed did
0:13:50	a lot of
0:13:51	small modifications of two
0:13:52	even improve the accuracy and the speed and
0:13:55	one of these things is actually dynamic a relation that gonna be
0:13:58	used for adaptation of the models is
0:14:01	uh extension nor simplification of our previous approach that we have
0:14:05	described in
0:14:06	are lost interspeech paper
0:14:09	and
0:14:09	uh it basically works uh well like uh that to be trained a network even during uh the testing phase
0:14:16	uh but in this case we just three train the network on the
0:14:20	on the one on the one best
0:14:22	oh during recognition
0:14:24	uh then also
0:14:26	we show
0:14:27	uh the paper or the at the paper of every show combination in comparison of a recurrent two networks with
0:14:33	many other
0:14:34	advanced language modeling technique
0:14:36	she's uh
0:14:37	which leads to know more than fifty percent cent reductions of perplexity
0:14:42	you james
0:14:42	some stand back of uh
0:14:44	no a of language models
0:14:46	and to on some even a large data uh than this penn treebank corpus a are able to
0:14:52	get even more than fifty percent
0:14:54	reduction
0:14:56	perplexity
0:14:57	uh we have also some
0:14:58	uh some S our experiments and the results and on
0:15:03	some easy so that uh
0:15:05	that is uh uh uh able to use you know its some very basic acoustic models you are able to
0:15:10	obtain a almost and the person the reduction of the word error rate
0:15:13	uh and on a much harder and larger
0:15:17	set of which is
0:15:18	it the same as the one that was
0:15:21	use the a last year
0:15:23	on J two workshops summer workshop
0:15:25	we apple thing almost and person the reduction of the
0:15:29	board the rate to jane's
0:15:30	baker a four gram model
0:15:33	which is actually i can even include the results for the
0:15:36	model and on the the on this sistine
0:15:39	which uh whites
0:15:40	a reduction from
0:15:42	thirteen starting point one to
0:15:44	twelve point five
0:15:45	means that
0:15:46	this there can sing at work is about
0:15:49	the wise but you're
0:15:50	in a what they're rate reduction on this up
0:15:53	then a model um
0:15:55	and also
0:15:57	uh
0:15:58	yeah all these expense can be very P it as a
0:16:01	uh we made it to look at at this available on this
0:16:04	this setting
0:16:05	and the think should be also in the paper
0:16:07	so
0:16:09	i would say to
0:16:10	yes
0:16:11	all these experiments can can be repeated just
0:16:13	it takes a lot of time
0:16:15	so
0:16:16	thanks for attention
0:16:24	oh for questions
0:16:40	yeah
0:16:42	yeah
0:16:47	uh
0:16:48	just a second
0:16:55	uh just table
0:16:58	yeah
0:16:59	so uh which numbers
0:17:01	you mean
0:17:04	uh this is a to be the uh can be a this is a combination be the bic model
0:17:08	with the baseline model
0:17:11	a it out combination
0:17:13	i'm not sure if a a it is in the paper but uh basically it would be
0:17:17	like the debate of the recurrence let's work in the combination is uh usually a on this set about those
0:17:22	your point seven or zero point eight
0:17:24	so it would be that it a better than the than the baseline i think it was around one other
0:17:29	when something
0:17:41	and the questions

EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGE MODEL

Language Modeling

Presented by: Tomáš Mikolov, Author(s): Tomáš Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocky, Brno University of Technology, Czech Republic; Sanjeev Khudanpur, The Johns Hopkins University, United States