Přepis řeči - VARIATIONAL APPROXIMATION OF LONG-SPAN LANGUAGE MODELS FOR LVCSR

0:00:16	"'kay" uh
0:00:17	first a couple of quick disclaimers
0:00:19	uh this is uh work mostly done by a look the us and that to much we love who give
0:00:24	the previous talk
0:00:26	uh a with a lot of help from a the fund and from
0:00:30	no
0:00:30	and i just have the pleasure of sort of pretending that that have something to do that and going to
0:00:35	talk
0:00:36	uh uh not actually couldn't come because of some uh i is a problems not on the check side but
0:00:41	going back to the U S so he sends apologies
0:00:44	and uh he even made the slide so apologies to then then V of the slides a two nice i
0:00:49	don't think he's trying to hide any
0:00:52	alright right
0:00:53	so
0:00:54	uh you had to much tell you how the and neural net for gives is great performance
0:00:58	one of the issues with the model like that is that you know it has essentially
0:01:02	at least theoretically infinite memory and and the other it does depend on the past five seven eight word
0:01:07	so you really can't do lattice was coding with the model like this
0:01:10	so the main idea about this paper is can we do something but the neural net put language model so
0:01:15	that we can rescored scored is but
0:01:17	and if you want the idea and a nutshell
0:01:19	this whole variational approximation is a scary term i don't know how one came up with that
0:01:24	it's actually very simple idea
0:01:25	imagine for a moment that the really was a true language model that generate all the sentences that you are
0:01:30	nice speak
0:01:31	right
0:01:31	how do we actually but but such a mark what we do is we take a lot of actual text
0:01:35	sentences
0:01:37	so it's a little bit like saying we sample from this true underlying distribution they get a bunch of sentence
0:01:42	and then we approximate
0:01:43	that and the line model with the markov chain like be do it second third for of them sort are
0:01:48	the markov chain
0:01:49	and that's it and them approximation of the true underlying model
0:01:52	that you and i believe looks are not heads are better right
0:01:55	i
0:01:56	so that have is the same you to much as uh a neural net language model
0:02:00	and pretend that that's a to model of language
0:02:02	generate lots and lots of the data from that model instead of
0:02:05	having human beings i to text
0:02:08	and simply estimate a n-gram model from that
0:02:10	that's essentially a long and short the paper
0:02:12	right
0:02:13	so i'll tell you how it works
0:02:17	oh
0:02:18	yeah it is a house statistical
0:02:21	i
0:02:23	okay
0:02:24	i will trying to do this this way
0:02:26	no that's okay it's not
0:02:28	okay
0:02:29	so a be get annotated speech you use the generative training procedure to create acoustic models
0:02:35	we get some text because you didn't it to training to do a language model the combine them figure out
0:02:39	some scale factor
0:02:41	we get new speech we it to the decoder which has all these three models
0:02:45	produce transcribed utterance
0:02:47	and
0:02:48	essentially via are implementing this formula P of a given W to the part mu P of W in return
0:02:53	find the them that just standard when the last
0:02:57	now the language model is typically as i said approximated using an n-gram
0:03:01	so we take P of W light it as W being a whole sentence did this P of W I
0:03:05	given W you want to write minus one
0:03:07	the pretty approximated by some
0:03:10	uh
0:03:10	a like in the last couple of words in the history that gives rise to and the models
0:03:14	and typically
0:03:16	uh we use small and then small number of and n-grams in order to make the decoding feasible
0:03:21	and the you and we get work quality so means instead create a so search space like a lattice this
0:03:27	and we score the lattice using a bigger than n-gram
0:03:30	again standard to practise everybody get knows there's nothing
0:03:34	uh
0:03:36	so this is asking can we go beyond n-grams gonna use something like to much as neural net to do
0:03:40	lattice rescoring
0:03:42	like
0:03:42	at or we could perhaps talk about using more complex acoustic model
0:03:46	and all these a feasible but uh
0:03:48	lattices provided your models are tractable in the sense of being local
0:03:53	because when phones even the have a they have a five phone context that's to local n-grams even for any
0:03:57	nipples five hours to local
0:03:59	but if you truly have a long span model that's not possible
0:04:02	you can do it with word lattices
0:04:04	so you tend to do it then best
0:04:07	right
0:04:07	so and best list have the advantage that you can deploy play really long span model but they do have
0:04:12	a bias
0:04:13	and this paper is about how to get past that
0:04:16	so the real question is can we do but like to this is what we do with best
0:04:21	so let's see how we go from
0:04:23	so what's good about a word lattices the provide a light so space
0:04:27	but and they had not biased by the new model that it will are used to rescore the lattices because
0:04:32	they really have a lot of options and them
0:04:34	but they do require the new models to be look
0:04:36	and best less
0:04:38	they don't require a new models to be local the you models can be really long span
0:04:42	but they do offer a limited search space limited by and and more importantly that top and are chosen according
0:04:46	to the old more
0:04:48	so the choice of the top-n and hypotheses is by
0:04:50	so
0:04:51	in some other word
0:04:53	uh
0:04:54	and no what on this
0:04:55	and present a paper at is do you
0:04:57	and and this i cast that as opposed to going on right now which you might be able to catch
0:05:01	after the session
0:05:02	but i i shows how to essentially
0:05:04	expand the search space how to search more than just the and the lattice
0:05:08	right
0:05:09	and that's like saying no let's do a lattice like stuff but and best less except let's some somehow try
0:05:13	to make the end very large without really having a large
0:05:16	so that saying let's do it the decoding these coding that some magic you can
0:05:20	this that point goes in the other direction
0:05:22	it's a is
0:05:23	i somehow approximate
0:05:24	the long span model
0:05:27	and make it look
0:05:28	so yes the neural net model has a long history but can i somehow get a local approximation of that
0:05:32	model
0:05:33	and that's for the steak
0:05:35	alright right so let's see
0:05:37	so this is sort of the outline of the top that's coming up
0:05:40	so the first order of business what long span model of we approximate already told you
0:05:45	this is the the could in that model that the much just presented and
0:05:50	so
0:05:51	you all know about it i one based time
0:05:53	so what's is approximation going to be
0:05:55	but i think of it this way
0:05:57	that's so this is the space of or language models
0:05:59	but than that he's the set of all we could a neural networks
0:06:02	and let's just say this is set of all n-gram models like
0:06:05	and let's say this the model you'd like
0:06:09	well typically you can decode rescored lattices were then gram models that's so there's a checked against the
0:06:14	what kind of the
0:06:17	i think it's not a needs die out that's colour yeah
0:06:20	okay anyway so that's tractable the blue one is not a
0:06:23	so what we should do what we should really be using a tractable model which is as close as possible
0:06:28	to the model we really like to
0:06:31	right
0:06:32	but i i and a model that the actually use for scoring based on the training data may not be
0:06:37	that more
0:06:39	are from the same training data you estimate run neural net model P
0:06:43	and you estimate an n-gram model am
0:06:45	and and may not be close to to start used start of the and which just close this to the
0:06:50	out to your are long span language model whatever it might be
0:06:53	so what we do do in this paper to saint of "'em" what happens if we use to stop
0:06:58	what if we approximate this but the model
0:07:00	P
0:07:01	but the than
0:07:02	the m-gram gram the n-gram model
0:07:05	so that's how
0:07:06	you wanna look at this
0:07:07	so what is action and or skip all
0:07:13	so it what happens when you try to use someone else's like
0:07:17	okay
0:07:17	so that that the approximations gonna work
0:07:19	you're gonna look for a and i'm model Q
0:07:22	among the set of all n-grams
0:07:24	script you
0:07:25	which is close this to this long span model in the sense of the them
0:07:31	for enough everybody that
0:07:33	alright right
0:07:34	so
0:07:36	what do you do but will essentially say
0:07:38	the kl divergence between P and Q is basically just a sum what all X P X to X what's
0:07:43	X X is all possible sentences in the war
0:07:47	"'cause" these that let's a sentence level model
0:07:50	and if you forget about the be lower P to
0:07:53	what you really want as you want the Q
0:07:55	that maximises
0:07:57	the sum of all sentences in the universe of P X log to X where P X is the
0:08:01	neural net probability of the sentence
0:08:03	and Q X is the n-gram probability of the scent
0:08:07	right
0:08:07	of course you can get P X of every sentence in the universe
0:08:12	but what you could do is just like we do with
0:08:15	normal human language models we approximate them by synthesizing they from our model namely getting people to write down text
0:08:22	and estimating an n-gram model from the text
0:08:24	so what we do it
0:08:26	we'll synthesized
0:08:27	sentences using
0:08:33	using this neural net language model and will simply estimate and then gram model from the synthesized date
0:08:40	so the recipes is very simple you get your fancy long span language that is a prerequisite
0:08:44	this fancy long i'm language model needs to be a generative more
0:08:49	meaning you need to be able to simulate sentence using this model
0:08:53	but the i don't and is ideal for that
0:08:56	and you
0:08:56	synthesized sentence
0:08:58	and then once you have a huge enough corpus that you're comfortable estimating whatever n-gram you an estimate you can
0:09:03	go ahead an estimated as if somebody gave you tons of X
0:09:07	so sounds crazy yeah
0:09:09	so let's see what it that
0:09:12	so i'm gonna give you T sets of experiments first a baby one then the medium and then than the
0:09:16	really muscular one
0:09:18	so this is the baby experiment we start off at the penn treebank corpus
0:09:23	uh and i'd have about a million words for training and about twenty thousand words for evaluation
0:09:29	and the vocabularies made of the top ten thousand words and is the standard set
0:09:33	and just to tell you how standard the says
0:09:35	uh
0:09:36	sure and yellow nick had the random forest language models on it cal buff present a structured language model run
0:09:42	drawer bit somewhere a a language model as well
0:09:45	then it's film no mary hard ball that that is a hard but i is not on it
0:09:49	uh these fish block of and the say them and thought of the uh the costs and pins
0:09:53	cash like language model all sorts of people of sound is that an exactly this corpus
0:09:57	so in fact we didn't have to be that experiment fig simply take their set up and copy the number
0:10:02	it's pretty standard
0:10:04	so what we get on that
0:10:05	so what we need of the estimated this uh a neural net actually to matched it
0:10:09	and uh are then we simply that in simulation more you of having a million words of training text
0:10:14	we generated two hundred thirty million words of training
0:10:17	i two to thirty million at that how much we could create in a day or
0:10:20	so we stop okay
0:10:22	and then
0:10:23	we simply estimate and and n-gram model from the simulated text
0:10:26	so the and model that are generated different simulated text that called five eight P at i'd and then six
0:10:32	a variational approximation of the neural net
0:10:34	we can either generate bike bands or five gram models from
0:10:37	and are on those that
0:10:40	so the both a good trigram model with standard can nice smoothing has a perplexity of one for the yeah
0:10:46	if you approximate the neural network by the synthetic method now this is generated a is to made don't two
0:10:51	hundred thirty million words of data since a lot of data
0:10:54	but its perplexity is only one fifty two which is sort of compatible to one for
0:10:58	and a you if you interpolate that do you get one twenty four so maybe there's reason to be high
0:11:02	or do we compare but
0:11:03	so a a shoes a random forest language model had sort of it to work to
0:11:08	like at look at the two previous words in the line
0:11:10	in that and and we can compare that that's so that's one thirty to this one twenty four okay so
0:11:14	far so
0:11:16	uh
0:11:17	you can also compare it but the car L or make other or or or or a on of hard
0:11:21	for language models they look at previous words and syntactic heads and so on so think of them as five
0:11:26	gram models because the look at for preceding words
0:11:29	although these are not the four consecutive words use a based on some
0:11:33	so if you do wanna do that we can compare it by stimulating a five gram or
0:11:38	so i five gram can set nine models based based on the same one million words of training text of
0:11:42	the perplexity of what one for T
0:11:44	and this neural net has a perplexity of also about one forty but when you interpolate to get one twenty
0:11:50	and so i again competitive at all the other model
0:11:53	and then finally
0:11:54	uh
0:11:55	the i Z in clock of model is across cross it's models it looks at previous sentence of the compared
0:12:00	to that we simply implement a very simple cache language model
0:12:04	cash up all the words in the last in the sort of previous couple sentences
0:12:07	and just to that but you kind model
0:12:10	and you get
0:12:11	a perplexity of hundred eleven
0:12:13	mind you the system not as good as one or two which is what the much has ended so that's
0:12:16	tell you that the exact neural net language model still better than the
0:12:20	approximation be creating after of the at approximating
0:12:23	it long as an out and then with the five
0:12:25	the approximation already is pretty good and quite different from
0:12:28	the and us to may just re million words of text
0:12:31	so that two are working but
0:12:33	okay so this is nice perplexity as that
0:12:35	uh yeah the next experiment
0:12:38	uh we were looking at the mit lectures data
0:12:40	so for those of you who don't know the score corpus
0:12:43	there are uh
0:12:44	a few tens of hours of acoustic training data i thing some like twenty and then a couple of hours
0:12:48	of evaluation data these a professors giving lectures and a couple of speakers
0:12:54	so we have transcripts for about uh and a fifty words of text
0:12:58	uh a speech and uh we have a lot of data from broadcast news which is out of domain main
0:13:03	news and these that like to that it might be
0:13:05	and uh so we basically said let's see what we can do with it
0:13:09	we estimated a at and then from the in-domain data
0:13:12	simulated twenty times more data that three million words of text and estimate and then gram from there
0:13:18	and uh compared to
0:13:20	uh the basic like nist allow model this would be a computation saying what can you do the simulated language
0:13:25	model
0:13:26	have "'cause" you don't wanna throw we all the broadcast news models because they do have a lot of good
0:13:29	english of it for the in as well
0:13:31	and a a lot of use
0:13:33	a baseline lines you can choose your
0:13:35	if you just use the ms so my model the estimated from the mit data
0:13:40	plus interpolated of broadcast news you do unit as well as you can
0:13:44	and you get a a of rates like twenty four point seven on one lecture the twenty two point four
0:13:49	and uh the lecture
0:13:50	you got a reasonable number
0:13:52	uh then uh but with the big acoustic model are trained using that the less of these are fairly
0:13:58	they can a better trained acoustic model that men but i they have some time and my
0:14:01	i don't know if they have P made
0:14:04	that is not that they do
0:14:06	so yes so all the goodies that in there
0:14:08	and if you score the top hundred using the for neural network because remote that's a full sentence model
0:14:14	uh you get some reduction in a twenty four point one for the first set than the twenty two point
0:14:18	for the other
0:14:19	if you go much be but in the n-best list you do get a
0:14:22	you you to go a bit more of an improvement
0:14:24	almost to
0:14:26	close to one percent point eight point nine
0:14:28	and a lot of a pretty good but only ripple both also fifteen
0:14:32	or so maybe they could we
0:14:33	but sure
0:14:34	or anyway that's what you get by doing and best rescoring and i said one the problems is that
0:14:38	the and that is presented to you is good according to the original five four gram model
0:14:43	uh and then what you can do as you can replace the original four gram model
0:14:48	but this
0:14:48	phony for that model estimated from simulated
0:14:52	you much more text you three million words one for thousand words
0:14:55	and all any you can see on the
0:14:57	first line of the second block that if you simply used the variational approximation of the four gram
0:15:02	of the can use an that to the four gram
0:15:04	and decode but they
0:15:06	you already have a point for
0:15:08	point two percent reduction in word error rate
0:15:10	what's more interesting is not if you simply score the hundred best which is much less work
0:15:15	you can get almost all the game
0:15:17	so this is starting to look given but
0:15:19	so what you're saying is that not only do you get better than a output
0:15:24	if you use the
0:15:25	uh and got approximation of the new net language model
0:15:29	you produce better because we you then rescored it but the for new in that language model
0:15:33	you get
0:15:34	uh did actions that much lower N
0:15:37	"'cause" but you have to score and that of
0:15:39	and
0:15:40	and you would if you had
0:15:41	the original
0:15:43	so this is the medium size experiment if you met
0:15:45	than that of a large an experiment
0:15:48	this has to do with english uh conversational telephone speech and some uh
0:15:52	meeting speech from the nist
0:15:54	the uh D or seven
0:15:56	evaluation data
0:15:58	and again that are about five million words transcribed
0:16:01	so that's are basic uh in the main training data
0:16:04	and then uh we can either build a case of my model or we can create a neural net model
0:16:09	and then synthesized
0:16:11	another huge order of magnitude like for million words of text from it and then use that X to build
0:16:15	the language model
0:16:16	so again the original language model which is
0:16:19	in blue
0:16:20	and the
0:16:21	fate line simulated data language model which is right
0:16:25	and again the way that two gram five gram that indicates that be produced lattices using a bigram model
0:16:30	but then be rescored coded using a
0:16:34	a five gram model
0:16:35	and there's recording was done and no so this uses the remote
0:16:40	and uh
0:16:41	on that
0:16:42	uh he again again and again a bunch of is that's again you can decide what to baseline as i
0:16:46	like to think of the is of my five gram lattice rescoring as the baseline
0:16:51	a you get the uh on the cts data are you get twenty percent that of rate on the
0:16:55	uh meetings data a good thirty two point four
0:16:58	and if you D score these using a neural net you get down to twenty seven point one if you
0:17:02	hundred best rescoring or twenty six point five if few thousand best
0:17:06	and that sort of the
0:17:07	kind of things you can get if you
0:17:09	D using a standard n-gram and score using the neural
0:17:13	but in signal standard n-gram you find out of place it but this new n-gram which is an approximation of
0:17:18	the neural net
0:17:20	yeah we get
0:17:22	so you get a nice reduction just by using a different and them morton's a twenty eight point or becomes
0:17:26	twenty seven point two
0:17:28	and thirty two point four becomes thirty one point seven which is nice
0:17:32	uh we don't get as much of a gain and least coding so maybe the done to good a job
0:17:35	of meeting the
0:17:37	that this is but uh
0:17:38	a a there's a gain to be had
0:17:40	it least in the first pass decoding
0:17:43	so again uh to conclude
0:17:45	uh
0:17:46	the basically convinced us and hopefully have convinced you
0:17:50	that if you have a very long span model
0:17:52	but you can fit in your additional decoding
0:17:55	the rather than just leave it to the end for scoring you might as though simulate some text for that
0:17:59	didn't build an and them model of the simulated text
0:18:01	because that already is better than just using it all then them
0:18:04	and it might save you some work during these code
0:18:08	uh
0:18:09	we able would improve significantly lower an and that that's of course mainly coming from the on and then
0:18:14	and uh
0:18:15	before i conclude let me show you one more slide which is interesting does the one that the much sure
0:18:19	that the end of a stop
0:18:21	uh this is a a much larger training set because the one of the reviewers by the way set all
0:18:25	this all night what does it scale to large data
0:18:28	what happens when you a lot of is original training data
0:18:30	so does also a that we have four hundred million words to trained initial and them
0:18:35	and then once we train a neural net but that
0:18:37	which takes whatever and ever but that's to much as problem
0:18:40	uh we then simulate and a billion words of text
0:18:44	and then but if and gram model are of that
0:18:46	and as you can see the or ignored decoding that the four gram gives you for being point one at
0:18:50	a rate
0:18:51	and if you simply be decode using the approximated for gram based on the neural net
0:18:56	you already get the been point
0:18:58	a tells you that it just place saying
0:19:01	you're all four gram model with the new month based on approximating a bit language model
0:19:06	is already a good start and then the last number of what the uh showed you it's twelve point one
0:19:10	if you score B
0:19:11	uh lattices as sum and best list start of this
0:19:14	in model
0:19:16	me go back and say that's it
0:19:17	and thank you
0:19:23	i
0:19:25	questions
0:19:32	okay
0:19:39	yes you long and model has to be good
0:19:47	yeah
0:19:50	yes
0:19:58	okay
0:19:59	oh
0:20:07	yeah
0:20:09	okay
0:20:14	okay i can give you a quantity on set in terms of correlation but if you think of at the
0:20:18	first a i a doesn't it is but seem funny because you repeat a question yes i
0:20:22	i Z yeah it doesn't this looks like the standard bias variance trade off
0:20:27	if you have lots of initial text to train the neural net like first of all this will work only
0:20:31	the neural net is much better than the and gram models are trying to replace with the simulation yeah and
0:20:36	it is yes
0:20:37	you are can approximate not
0:20:40	actual text with the four gram but some imagine text with the four gram
0:20:45	so if the imagine text is not a good representation of actual that
0:20:49	it has to be it that are not be good actual text
0:20:52	it has to be a but a presentation of actual text than the for gram approximation of the actual
0:20:57	so that two models competing here there's actual text which is a good representation of you language
0:21:02	that the four gram which is a
0:21:03	okay to presentation of language
0:21:05	so you model has to be better than the for that
0:21:08	and once you have a
0:21:10	you taken get of the by
0:21:11	and the and the simulation reduces bit
0:21:14	okay
0:21:16	i think request
0:21:24	a i just wonder how big your
0:21:26	one billion language model
0:21:28	um
0:21:29	is after
0:21:31	you you made it an n-gram language model oh oh person to really one don't quite a bit by the
0:21:35	way this is the um i'm i current think that is a
0:21:38	no but yeah
0:21:39	no i don't think of the number of of the double my fingers but yes there's is standard yeah stalled
0:21:44	Q like pruning
0:21:46	uh for all sorts of things
0:21:48	so it is uh
0:21:50	i don't the much where you remember how big that language model
0:21:54	the
0:21:56	the
0:22:11	uh
0:22:11	five million and fifty million
0:22:14	yeah five million for decoding fifty million these coding
0:22:21	i think
0:22:23	i just have a question the bound the um on so
0:22:25	um you don't have set how would you run on a away and you were where language model and a
0:22:30	lattice rescoring right
0:22:31	so
0:22:32	sorry i can't hear you because of the double thing i here the i well as well as the reflections
0:22:36	so
0:22:36	so you you do have a a direct results using lattice rescoring with new right length model right
0:22:42	a direct result of lattice is coding using neural net morgan sell
0:22:46	you don't
0:22:47	but i'll tell that table the for one the K and plus you and decoding with work
0:22:52	no that a broadcast news so this is basically mission my model based on the broadcast news for hundred million
0:22:57	words of broadcast news
0:22:59	in the second one is
0:23:01	can a some i'm model from the broadcast news interpolated with the
0:23:04	for gram approximation of the new
0:23:07	and a but and i would just and is
0:23:09	"'cause" my be slightly a basic on what what's it was probably just seem implementation because the nature of the
0:23:14	expansion news
0:23:15	is having no back structures so this every every single context is forty it is that because that's to cool
0:23:20	making the decoding or
0:23:22	then then you want to be of the neural net uh you you or lattices was that C the neural
0:23:26	then because of the recurrent current are is keeping information about the entire pot cost
0:23:31	so when two parts um in the is you cannot model the angle can no okay
0:23:35	yes
0:23:36	okay
0:23:40	but for one more question
0:23:42	a two and a
0:23:45	i
0:23:50	uh uh how much do we have to synthesise i think the simulations
0:23:54	then work for like you know jess
0:23:56	a factor of two or factor of fine
0:23:59	they started working menu you at least an order of magnitude
0:24:03	but the original model yeah
0:24:05	because after i it that is a statistic which is interesting namely that when we simulated two hundred thirty million
0:24:09	words for example
0:24:11	right we looked at n-gram coverage and how many hallucinations out that and the new n-gram than how many of
0:24:16	you
0:24:17	course there's no way to tell except that is one way
0:24:19	we set if and then n-gram that be hypothesis a simulated
0:24:23	or how loose an aided shows up in the google five gram corpus
0:24:26	then will say it was a good simulation
0:24:28	and if it doesn't say was the bad simulation
0:24:31	so if you look at that a one million words of wall street journal
0:24:34	eighty five percent of them i'd and the group bill five gram
0:24:38	so that's
0:24:38	the are so real text has a a bad and goodness of eighty five percent
0:24:42	and that two hundred thirty million words we simulated has the goodness of what's seventy two percent
0:24:47	so we are mostly simulating good and
0:24:49	but you're assimilate and order of magnitude
0:24:52	okay let's thinks the speaker
0:24:55	okay

VARIATIONAL APPROXIMATION OF LONG-SPAN LANGUAGE MODELS FOR LVCSR

Language Modeling

Přednášející: Sanjeev Khudanpur, Autoři: Anoop Deoras, Center for Language and Speech Processing, United States; Tomáš Mikolov, Stefan Kombrink, Martin Karafiát, Brno University of Technology, Czech Republic; Sanjeev Khudanpur, Center for Language and Speech Processing, United States