0:00:16 | "'kay" uh |
---|
0:00:17 | first a couple of quick disclaimers |
---|
0:00:19 | uh this is uh work mostly done by a look the us and that to much we love who give |
---|
0:00:24 | the previous talk |
---|
0:00:26 | uh a with a lot of help from a the fund and from |
---|
0:00:30 | no |
---|
0:00:30 | and i just have the pleasure of sort of pretending that that have something to do that and going to |
---|
0:00:35 | talk |
---|
0:00:36 | uh uh not actually couldn't come because of some uh i is a problems not on the check side but |
---|
0:00:41 | going back to the U S so he sends apologies |
---|
0:00:44 | and uh he even made the slide so apologies to then then V of the slides a two nice i |
---|
0:00:49 | don't think he's trying to hide any |
---|
0:00:52 | alright right |
---|
0:00:53 | so |
---|
0:00:54 | uh you had to much tell you how the and neural net for gives is great performance |
---|
0:00:58 | one of the issues with the model like that is that you know it has essentially |
---|
0:01:02 | at least theoretically infinite memory and and the other it does depend on the past five seven eight word |
---|
0:01:07 | so you really can't do lattice was coding with the model like this |
---|
0:01:10 | so the main idea about this paper is can we do something but the neural net put language model so |
---|
0:01:15 | that we can rescored scored is but |
---|
0:01:17 | and if you want the idea and a nutshell |
---|
0:01:19 | this whole variational approximation is a scary term i don't know how one came up with that |
---|
0:01:24 | it's actually very simple idea |
---|
0:01:25 | imagine for a moment that the really was a true language model that generate all the sentences that you are |
---|
0:01:30 | nice speak |
---|
0:01:31 | right |
---|
0:01:31 | how do we actually but but such a mark what we do is we take a lot of actual text |
---|
0:01:35 | sentences |
---|
0:01:37 | so it's a little bit like saying we sample from this true underlying distribution they get a bunch of sentence |
---|
0:01:42 | and then we approximate |
---|
0:01:43 | that and the line model with the markov chain like be do it second third for of them sort are |
---|
0:01:48 | the markov chain |
---|
0:01:49 | and that's it and them approximation of the true underlying model |
---|
0:01:52 | that you and i believe looks are not heads are better right |
---|
0:01:55 | i |
---|
0:01:56 | so that have is the same you to much as uh a neural net language model |
---|
0:02:00 | and pretend that that's a to model of language |
---|
0:02:02 | generate lots and lots of the data from that model instead of |
---|
0:02:05 | having human beings i to text |
---|
0:02:08 | and simply estimate a n-gram model from that |
---|
0:02:10 | that's essentially a long and short the paper |
---|
0:02:12 | right |
---|
0:02:13 | so i'll tell you how it works |
---|
0:02:17 | oh |
---|
0:02:18 | yeah it is a house statistical |
---|
0:02:21 | i |
---|
0:02:23 | okay |
---|
0:02:24 | i will trying to do this this way |
---|
0:02:26 | no that's okay it's not |
---|
0:02:28 | okay |
---|
0:02:29 | so a be get annotated speech you use the generative training procedure to create acoustic models |
---|
0:02:35 | we get some text because you didn't it to training to do a language model the combine them figure out |
---|
0:02:39 | some scale factor |
---|
0:02:41 | we get new speech we it to the decoder which has all these three models |
---|
0:02:45 | produce transcribed utterance |
---|
0:02:47 | and |
---|
0:02:48 | essentially via are implementing this formula P of a given W to the part mu P of W in return |
---|
0:02:53 | find the them that just standard when the last |
---|
0:02:57 | now the language model is typically as i said approximated using an n-gram |
---|
0:03:01 | so we take P of W light it as W being a whole sentence did this P of W I |
---|
0:03:05 | given W you want to write minus one |
---|
0:03:07 | the pretty approximated by some |
---|
0:03:10 | uh |
---|
0:03:10 | a like in the last couple of words in the history that gives rise to and the models |
---|
0:03:14 | and typically |
---|
0:03:16 | uh we use small and then small number of and n-grams in order to make the decoding feasible |
---|
0:03:21 | and the you and we get work quality so means instead create a so search space like a lattice this |
---|
0:03:27 | and we score the lattice using a bigger than n-gram |
---|
0:03:30 | again standard to practise everybody get knows there's nothing |
---|
0:03:34 | uh |
---|
0:03:36 | so this is asking can we go beyond n-grams gonna use something like to much as neural net to do |
---|
0:03:40 | lattice rescoring |
---|
0:03:42 | like |
---|
0:03:42 | at or we could perhaps talk about using more complex acoustic model |
---|
0:03:46 | and all these a feasible but uh |
---|
0:03:48 | lattices provided your models are tractable in the sense of being local |
---|
0:03:53 | because when phones even the have a they have a five phone context that's to local n-grams even for any |
---|
0:03:57 | nipples five hours to local |
---|
0:03:59 | but if you truly have a long span model that's not possible |
---|
0:04:02 | you can do it with word lattices |
---|
0:04:04 | so you tend to do it then best |
---|
0:04:07 | right |
---|
0:04:07 | so and best list have the advantage that you can deploy play really long span model but they do have |
---|
0:04:12 | a bias |
---|
0:04:13 | and this paper is about how to get past that |
---|
0:04:16 | so the real question is can we do but like to this is what we do with best |
---|
0:04:21 | so let's see how we go from |
---|
0:04:23 | so what's good about a word lattices the provide a light so space |
---|
0:04:27 | but and they had not biased by the new model that it will are used to rescore the lattices because |
---|
0:04:32 | they really have a lot of options and them |
---|
0:04:34 | but they do require the new models to be look |
---|
0:04:36 | and best less |
---|
0:04:38 | they don't require a new models to be local the you models can be really long span |
---|
0:04:42 | but they do offer a limited search space limited by and and more importantly that top and are chosen according |
---|
0:04:46 | to the old more |
---|
0:04:48 | so the choice of the top-n and hypotheses is by |
---|
0:04:50 | so |
---|
0:04:51 | in some other word |
---|
0:04:53 | uh |
---|
0:04:54 | and no what on this |
---|
0:04:55 | and present a paper at is do you |
---|
0:04:57 | and and this i cast that as opposed to going on right now which you might be able to catch |
---|
0:05:01 | after the session |
---|
0:05:02 | but i i shows how to essentially |
---|
0:05:04 | expand the search space how to search more than just the and the lattice |
---|
0:05:08 | right |
---|
0:05:09 | and that's like saying no let's do a lattice like stuff but and best less except let's some somehow try |
---|
0:05:13 | to make the end very large without really having a large |
---|
0:05:16 | so that saying let's do it the decoding these coding that some magic you can |
---|
0:05:20 | this that point goes in the other direction |
---|
0:05:22 | it's a is |
---|
0:05:23 | i somehow approximate |
---|
0:05:24 | the long span model |
---|
0:05:27 | and make it look |
---|
0:05:28 | so yes the neural net model has a long history but can i somehow get a local approximation of that |
---|
0:05:32 | model |
---|
0:05:33 | and that's for the steak |
---|
0:05:35 | alright right so let's see |
---|
0:05:37 | so this is sort of the outline of the top that's coming up |
---|
0:05:40 | so the first order of business what long span model of we approximate already told you |
---|
0:05:45 | this is the the could in that model that the much just presented and |
---|
0:05:50 | so |
---|
0:05:51 | you all know about it i one based time |
---|
0:05:53 | so what's is approximation going to be |
---|
0:05:55 | but i think of it this way |
---|
0:05:57 | that's so this is the space of or language models |
---|
0:05:59 | but than that he's the set of all we could a neural networks |
---|
0:06:02 | and let's just say this is set of all n-gram models like |
---|
0:06:05 | and let's say this the model you'd like |
---|
0:06:09 | well typically you can decode rescored lattices were then gram models that's so there's a checked against the |
---|
0:06:14 | what kind of the |
---|
0:06:17 | i think it's not a needs die out that's colour yeah |
---|
0:06:20 | okay anyway so that's tractable the blue one is not a |
---|
0:06:23 | so what we should do what we should really be using a tractable model which is as close as possible |
---|
0:06:28 | to the model we really like to |
---|
0:06:31 | right |
---|
0:06:32 | but i i and a model that the actually use for scoring based on the training data may not be |
---|
0:06:37 | that more |
---|
0:06:39 | are from the same training data you estimate run neural net model P |
---|
0:06:43 | and you estimate an n-gram model am |
---|
0:06:45 | and and may not be close to to start used start of the and which just close this to the |
---|
0:06:50 | out to your are long span language model whatever it might be |
---|
0:06:53 | so what we do do in this paper to saint of "'em" what happens if we use to stop |
---|
0:06:58 | what if we approximate this but the model |
---|
0:07:00 | P |
---|
0:07:01 | but the than |
---|
0:07:02 | the m-gram gram the n-gram model |
---|
0:07:05 | so that's how |
---|
0:07:06 | you wanna look at this |
---|
0:07:07 | so what is action and or skip all |
---|
0:07:13 | so it what happens when you try to use someone else's like |
---|
0:07:17 | okay |
---|
0:07:17 | so that that the approximations gonna work |
---|
0:07:19 | you're gonna look for a and i'm model Q |
---|
0:07:22 | among the set of all n-grams |
---|
0:07:24 | script you |
---|
0:07:25 | which is close this to this long span model in the sense of the them |
---|
0:07:31 | for enough everybody that |
---|
0:07:33 | alright right |
---|
0:07:34 | so |
---|
0:07:36 | what do you do but will essentially say |
---|
0:07:38 | the kl divergence between P and Q is basically just a sum what all X P X to X what's |
---|
0:07:43 | X X is all possible sentences in the war |
---|
0:07:47 | "'cause" these that let's a sentence level model |
---|
0:07:50 | and if you forget about the be lower P to |
---|
0:07:53 | what you really want as you want the Q |
---|
0:07:55 | that maximises |
---|
0:07:57 | the sum of all sentences in the universe of P X log to X where P X is the |
---|
0:08:01 | neural net probability of the sentence |
---|
0:08:03 | and Q X is the n-gram probability of the scent |
---|
0:08:07 | right |
---|
0:08:07 | of course you can get P X of every sentence in the universe |
---|
0:08:12 | but what you could do is just like we do with |
---|
0:08:15 | normal human language models we approximate them by synthesizing they from our model namely getting people to write down text |
---|
0:08:22 | and estimating an n-gram model from the text |
---|
0:08:24 | so what we do it |
---|
0:08:26 | we'll synthesized |
---|
0:08:27 | sentences using |
---|
0:08:33 | using this neural net language model and will simply estimate and then gram model from the synthesized date |
---|
0:08:40 | so the recipes is very simple you get your fancy long span language that is a prerequisite |
---|
0:08:44 | this fancy long i'm language model needs to be a generative more |
---|
0:08:49 | meaning you need to be able to simulate sentence using this model |
---|
0:08:53 | but the i don't and is ideal for that |
---|
0:08:56 | and you |
---|
0:08:56 | synthesized sentence |
---|
0:08:58 | and then once you have a huge enough corpus that you're comfortable estimating whatever n-gram you an estimate you can |
---|
0:09:03 | go ahead an estimated as if somebody gave you tons of X |
---|
0:09:07 | so sounds crazy yeah |
---|
0:09:09 | so let's see what it that |
---|
0:09:12 | so i'm gonna give you T sets of experiments first a baby one then the medium and then than the |
---|
0:09:16 | really muscular one |
---|
0:09:18 | so this is the baby experiment we start off at the penn treebank corpus |
---|
0:09:23 | uh and i'd have about a million words for training and about twenty thousand words for evaluation |
---|
0:09:29 | and the vocabularies made of the top ten thousand words and is the standard set |
---|
0:09:33 | and just to tell you how standard the says |
---|
0:09:35 | uh |
---|
0:09:36 | sure and yellow nick had the random forest language models on it cal buff present a structured language model run |
---|
0:09:42 | drawer bit somewhere a a language model as well |
---|
0:09:45 | then it's film no mary hard ball that that is a hard but i is not on it |
---|
0:09:49 | uh these fish block of and the say them and thought of the uh the costs and pins |
---|
0:09:53 | cash like language model all sorts of people of sound is that an exactly this corpus |
---|
0:09:57 | so in fact we didn't have to be that experiment fig simply take their set up and copy the number |
---|
0:10:02 | it's pretty standard |
---|
0:10:04 | so what we get on that |
---|
0:10:05 | so what we need of the estimated this uh a neural net actually to matched it |
---|
0:10:09 | and uh are then we simply that in simulation more you of having a million words of training text |
---|
0:10:14 | we generated two hundred thirty million words of training |
---|
0:10:17 | i two to thirty million at that how much we could create in a day or |
---|
0:10:20 | so we stop okay |
---|
0:10:22 | and then |
---|
0:10:23 | we simply estimate and and n-gram model from the simulated text |
---|
0:10:26 | so the and model that are generated different simulated text that called five eight P at i'd and then six |
---|
0:10:32 | a variational approximation of the neural net |
---|
0:10:34 | we can either generate bike bands or five gram models from |
---|
0:10:37 | and are on those that |
---|
0:10:40 | so the both a good trigram model with standard can nice smoothing has a perplexity of one for the yeah |
---|
0:10:46 | if you approximate the neural network by the synthetic method now this is generated a is to made don't two |
---|
0:10:51 | hundred thirty million words of data since a lot of data |
---|
0:10:54 | but its perplexity is only one fifty two which is sort of compatible to one for |
---|
0:10:58 | and a you if you interpolate that do you get one twenty four so maybe there's reason to be high |
---|
0:11:02 | or do we compare but |
---|
0:11:03 | so a a shoes a random forest language model had sort of it to work to |
---|
0:11:08 | like at look at the two previous words in the line |
---|
0:11:10 | in that and and we can compare that that's so that's one thirty to this one twenty four okay so |
---|
0:11:14 | far so |
---|
0:11:16 | uh |
---|
0:11:17 | you can also compare it but the car L or make other or or or or a on of hard |
---|
0:11:21 | for language models they look at previous words and syntactic heads and so on so think of them as five |
---|
0:11:26 | gram models because the look at for preceding words |
---|
0:11:29 | although these are not the four consecutive words use a based on some |
---|
0:11:33 | so if you do wanna do that we can compare it by stimulating a five gram or |
---|
0:11:38 | so i five gram can set nine models based based on the same one million words of training text of |
---|
0:11:42 | the perplexity of what one for T |
---|
0:11:44 | and this neural net has a perplexity of also about one forty but when you interpolate to get one twenty |
---|
0:11:50 | and so i again competitive at all the other model |
---|
0:11:53 | and then finally |
---|
0:11:54 | uh |
---|
0:11:55 | the i Z in clock of model is across cross it's models it looks at previous sentence of the compared |
---|
0:12:00 | to that we simply implement a very simple cache language model |
---|
0:12:04 | cash up all the words in the last in the sort of previous couple sentences |
---|
0:12:07 | and just to that but you kind model |
---|
0:12:10 | and you get |
---|
0:12:11 | a perplexity of hundred eleven |
---|
0:12:13 | mind you the system not as good as one or two which is what the much has ended so that's |
---|
0:12:16 | tell you that the exact neural net language model still better than the |
---|
0:12:20 | approximation be creating after of the at approximating |
---|
0:12:23 | it long as an out and then with the five |
---|
0:12:25 | the approximation already is pretty good and quite different from |
---|
0:12:28 | the and us to may just re million words of text |
---|
0:12:31 | so that two are working but |
---|
0:12:33 | okay so this is nice perplexity as that |
---|
0:12:35 | uh yeah the next experiment |
---|
0:12:38 | uh we were looking at the mit lectures data |
---|
0:12:40 | so for those of you who don't know the score corpus |
---|
0:12:43 | there are uh |
---|
0:12:44 | a few tens of hours of acoustic training data i thing some like twenty and then a couple of hours |
---|
0:12:48 | of evaluation data these a professors giving lectures and a couple of speakers |
---|
0:12:54 | so we have transcripts for about uh and a fifty words of text |
---|
0:12:58 | uh a speech and uh we have a lot of data from broadcast news which is out of domain main |
---|
0:13:03 | news and these that like to that it might be |
---|
0:13:05 | and uh so we basically said let's see what we can do with it |
---|
0:13:09 | we estimated a at and then from the in-domain data |
---|
0:13:12 | simulated twenty times more data that three million words of text and estimate and then gram from there |
---|
0:13:18 | and uh compared to |
---|
0:13:20 | uh the basic like nist allow model this would be a computation saying what can you do the simulated language |
---|
0:13:25 | model |
---|
0:13:26 | have "'cause" you don't wanna throw we all the broadcast news models because they do have a lot of good |
---|
0:13:29 | english of it for the in as well |
---|
0:13:31 | and a a lot of use |
---|
0:13:33 | a baseline lines you can choose your |
---|
0:13:35 | if you just use the ms so my model the estimated from the mit data |
---|
0:13:40 | plus interpolated of broadcast news you do unit as well as you can |
---|
0:13:44 | and you get a a of rates like twenty four point seven on one lecture the twenty two point four |
---|
0:13:49 | and uh the lecture |
---|
0:13:50 | you got a reasonable number |
---|
0:13:52 | uh then uh but with the big acoustic model are trained using that the less of these are fairly |
---|
0:13:58 | they can a better trained acoustic model that men but i they have some time and my |
---|
0:14:01 | i don't know if they have P made |
---|
0:14:04 | that is not that they do |
---|
0:14:06 | so yes so all the goodies that in there |
---|
0:14:08 | and if you score the top hundred using the for neural network because remote that's a full sentence model |
---|
0:14:14 | uh you get some reduction in a twenty four point one for the first set than the twenty two point |
---|
0:14:18 | for the other |
---|
0:14:19 | if you go much be but in the n-best list you do get a |
---|
0:14:22 | you you to go a bit more of an improvement |
---|
0:14:24 | almost to |
---|
0:14:26 | close to one percent point eight point nine |
---|
0:14:28 | and a lot of a pretty good but only ripple both also fifteen |
---|
0:14:32 | or so maybe they could we |
---|
0:14:33 | but sure |
---|
0:14:34 | or anyway that's what you get by doing and best rescoring and i said one the problems is that |
---|
0:14:38 | the and that is presented to you is good according to the original five four gram model |
---|
0:14:43 | uh and then what you can do as you can replace the original four gram model |
---|
0:14:48 | but this |
---|
0:14:48 | phony for that model estimated from simulated |
---|
0:14:52 | you much more text you three million words one for thousand words |
---|
0:14:55 | and all any you can see on the |
---|
0:14:57 | first line of the second block that if you simply used the variational approximation of the four gram |
---|
0:15:02 | of the can use an that to the four gram |
---|
0:15:04 | and decode but they |
---|
0:15:06 | you already have a point for |
---|
0:15:08 | point two percent reduction in word error rate |
---|
0:15:10 | what's more interesting is not if you simply score the hundred best which is much less work |
---|
0:15:15 | you can get almost all the game |
---|
0:15:17 | so this is starting to look given but |
---|
0:15:19 | so what you're saying is that not only do you get better than a output |
---|
0:15:24 | if you use the |
---|
0:15:25 | uh and got approximation of the new net language model |
---|
0:15:29 | you produce better because we you then rescored it but the for new in that language model |
---|
0:15:33 | you get |
---|
0:15:34 | uh did actions that much lower N |
---|
0:15:37 | "'cause" but you have to score and that of |
---|
0:15:39 | and |
---|
0:15:40 | and you would if you had |
---|
0:15:41 | the original |
---|
0:15:43 | so this is the medium size experiment if you met |
---|
0:15:45 | than that of a large an experiment |
---|
0:15:48 | this has to do with english uh conversational telephone speech and some uh |
---|
0:15:52 | meeting speech from the nist |
---|
0:15:54 | the uh D or seven |
---|
0:15:56 | evaluation data |
---|
0:15:58 | and again that are about five million words transcribed |
---|
0:16:01 | so that's are basic uh in the main training data |
---|
0:16:04 | and then uh we can either build a case of my model or we can create a neural net model |
---|
0:16:09 | and then synthesized |
---|
0:16:11 | another huge order of magnitude like for million words of text from it and then use that X to build |
---|
0:16:15 | the language model |
---|
0:16:16 | so again the original language model which is |
---|
0:16:19 | in blue |
---|
0:16:20 | and the |
---|
0:16:21 | fate line simulated data language model which is right |
---|
0:16:25 | and again the way that two gram five gram that indicates that be produced lattices using a bigram model |
---|
0:16:30 | but then be rescored coded using a |
---|
0:16:34 | a five gram model |
---|
0:16:35 | and there's recording was done and no so this uses the remote |
---|
0:16:40 | and uh |
---|
0:16:41 | on that |
---|
0:16:42 | uh he again again and again a bunch of is that's again you can decide what to baseline as i |
---|
0:16:46 | like to think of the is of my five gram lattice rescoring as the baseline |
---|
0:16:51 | a you get the uh on the cts data are you get twenty percent that of rate on the |
---|
0:16:55 | uh meetings data a good thirty two point four |
---|
0:16:58 | and if you D score these using a neural net you get down to twenty seven point one if you |
---|
0:17:02 | hundred best rescoring or twenty six point five if few thousand best |
---|
0:17:06 | and that sort of the |
---|
0:17:07 | kind of things you can get if you |
---|
0:17:09 | D using a standard n-gram and score using the neural |
---|
0:17:13 | but in signal standard n-gram you find out of place it but this new n-gram which is an approximation of |
---|
0:17:18 | the neural net |
---|
0:17:20 | yeah we get |
---|
0:17:22 | so you get a nice reduction just by using a different and them morton's a twenty eight point or becomes |
---|
0:17:26 | twenty seven point two |
---|
0:17:28 | and thirty two point four becomes thirty one point seven which is nice |
---|
0:17:32 | uh we don't get as much of a gain and least coding so maybe the done to good a job |
---|
0:17:35 | of meeting the |
---|
0:17:37 | that this is but uh |
---|
0:17:38 | a a there's a gain to be had |
---|
0:17:40 | it least in the first pass decoding |
---|
0:17:43 | so again uh to conclude |
---|
0:17:45 | uh |
---|
0:17:46 | the basically convinced us and hopefully have convinced you |
---|
0:17:50 | that if you have a very long span model |
---|
0:17:52 | but you can fit in your additional decoding |
---|
0:17:55 | the rather than just leave it to the end for scoring you might as though simulate some text for that |
---|
0:17:59 | didn't build an and them model of the simulated text |
---|
0:18:01 | because that already is better than just using it all then them |
---|
0:18:04 | and it might save you some work during these code |
---|
0:18:08 | uh |
---|
0:18:09 | we able would improve significantly lower an and that that's of course mainly coming from the on and then |
---|
0:18:14 | and uh |
---|
0:18:15 | before i conclude let me show you one more slide which is interesting does the one that the much sure |
---|
0:18:19 | that the end of a stop |
---|
0:18:21 | uh this is a a much larger training set because the one of the reviewers by the way set all |
---|
0:18:25 | this all night what does it scale to large data |
---|
0:18:28 | what happens when you a lot of is original training data |
---|
0:18:30 | so does also a that we have four hundred million words to trained initial and them |
---|
0:18:35 | and then once we train a neural net but that |
---|
0:18:37 | which takes whatever and ever but that's to much as problem |
---|
0:18:40 | uh we then simulate and a billion words of text |
---|
0:18:44 | and then but if and gram model are of that |
---|
0:18:46 | and as you can see the or ignored decoding that the four gram gives you for being point one at |
---|
0:18:50 | a rate |
---|
0:18:51 | and if you simply be decode using the approximated for gram based on the neural net |
---|
0:18:56 | you already get the been point |
---|
0:18:58 | a tells you that it just place saying |
---|
0:19:01 | you're all four gram model with the new month based on approximating a bit language model |
---|
0:19:06 | is already a good start and then the last number of what the uh showed you it's twelve point one |
---|
0:19:10 | if you score B |
---|
0:19:11 | uh lattices as sum and best list start of this |
---|
0:19:14 | in model |
---|
0:19:16 | me go back and say that's it |
---|
0:19:17 | and thank you |
---|
0:19:23 | i |
---|
0:19:25 | questions |
---|
0:19:32 | okay |
---|
0:19:39 | yes you long and model has to be good |
---|
0:19:47 | yeah |
---|
0:19:50 | yes |
---|
0:19:58 | okay |
---|
0:19:59 | oh |
---|
0:20:07 | yeah |
---|
0:20:09 | okay |
---|
0:20:14 | okay i can give you a quantity on set in terms of correlation but if you think of at the |
---|
0:20:18 | first a i a doesn't it is but seem funny because you repeat a question yes i |
---|
0:20:22 | i Z yeah it doesn't this looks like the standard bias variance trade off |
---|
0:20:27 | if you have lots of initial text to train the neural net like first of all this will work only |
---|
0:20:31 | the neural net is much better than the and gram models are trying to replace with the simulation yeah and |
---|
0:20:36 | it is yes |
---|
0:20:37 | you are can approximate not |
---|
0:20:40 | actual text with the four gram but some imagine text with the four gram |
---|
0:20:45 | so if the imagine text is not a good representation of actual that |
---|
0:20:49 | it has to be it that are not be good actual text |
---|
0:20:52 | it has to be a but a presentation of actual text than the for gram approximation of the actual |
---|
0:20:57 | so that two models competing here there's actual text which is a good representation of you language |
---|
0:21:02 | that the four gram which is a |
---|
0:21:03 | okay to presentation of language |
---|
0:21:05 | so you model has to be better than the for that |
---|
0:21:08 | and once you have a |
---|
0:21:10 | you taken get of the by |
---|
0:21:11 | and the and the simulation reduces bit |
---|
0:21:14 | okay |
---|
0:21:16 | i think request |
---|
0:21:24 | a i just wonder how big your |
---|
0:21:26 | one billion language model |
---|
0:21:28 | um |
---|
0:21:29 | is after |
---|
0:21:31 | you you made it an n-gram language model oh oh person to really one don't quite a bit by the |
---|
0:21:35 | way this is the um i'm i current think that is a |
---|
0:21:38 | no but yeah |
---|
0:21:39 | no i don't think of the number of of the double my fingers but yes there's is standard yeah stalled |
---|
0:21:44 | Q like pruning |
---|
0:21:46 | uh for all sorts of things |
---|
0:21:48 | so it is uh |
---|
0:21:50 | i don't the much where you remember how big that language model |
---|
0:21:54 | the |
---|
0:21:56 | the |
---|
0:22:11 | uh |
---|
0:22:11 | five million and fifty million |
---|
0:22:14 | yeah five million for decoding fifty million these coding |
---|
0:22:21 | i think |
---|
0:22:23 | i just have a question the bound the um on so |
---|
0:22:25 | um you don't have set how would you run on a away and you were where language model and a |
---|
0:22:30 | lattice rescoring right |
---|
0:22:31 | so |
---|
0:22:32 | sorry i can't hear you because of the double thing i here the i well as well as the reflections |
---|
0:22:36 | so |
---|
0:22:36 | so you you do have a a direct results using lattice rescoring with new right length model right |
---|
0:22:42 | a direct result of lattice is coding using neural net morgan sell |
---|
0:22:46 | you don't |
---|
0:22:47 | but i'll tell that table the for one the K and plus you and decoding with work |
---|
0:22:52 | no that a broadcast news so this is basically mission my model based on the broadcast news for hundred million |
---|
0:22:57 | words of broadcast news |
---|
0:22:59 | in the second one is |
---|
0:23:01 | can a some i'm model from the broadcast news interpolated with the |
---|
0:23:04 | for gram approximation of the new |
---|
0:23:07 | and a but and i would just and is |
---|
0:23:09 | "'cause" my be slightly a basic on what what's it was probably just seem implementation because the nature of the |
---|
0:23:14 | expansion news |
---|
0:23:15 | is having no back structures so this every every single context is forty it is that because that's to cool |
---|
0:23:20 | making the decoding or |
---|
0:23:22 | then then you want to be of the neural net uh you you or lattices was that C the neural |
---|
0:23:26 | then because of the recurrent current are is keeping information about the entire pot cost |
---|
0:23:31 | so when two parts um in the is you cannot model the angle can no okay |
---|
0:23:35 | yes |
---|
0:23:36 | okay |
---|
0:23:40 | but for one more question |
---|
0:23:42 | a two and a |
---|
0:23:45 | i |
---|
0:23:50 | uh uh how much do we have to synthesise i think the simulations |
---|
0:23:54 | then work for like you know jess |
---|
0:23:56 | a factor of two or factor of fine |
---|
0:23:59 | they started working menu you at least an order of magnitude |
---|
0:24:03 | but the original model yeah |
---|
0:24:05 | because after i it that is a statistic which is interesting namely that when we simulated two hundred thirty million |
---|
0:24:09 | words for example |
---|
0:24:11 | right we looked at n-gram coverage and how many hallucinations out that and the new n-gram than how many of |
---|
0:24:16 | you |
---|
0:24:17 | course there's no way to tell except that is one way |
---|
0:24:19 | we set if and then n-gram that be hypothesis a simulated |
---|
0:24:23 | or how loose an aided shows up in the google five gram corpus |
---|
0:24:26 | then will say it was a good simulation |
---|
0:24:28 | and if it doesn't say was the bad simulation |
---|
0:24:31 | so if you look at that a one million words of wall street journal |
---|
0:24:34 | eighty five percent of them i'd and the group bill five gram |
---|
0:24:38 | so that's |
---|
0:24:38 | the are so real text has a a bad and goodness of eighty five percent |
---|
0:24:42 | and that two hundred thirty million words we simulated has the goodness of what's seventy two percent |
---|
0:24:47 | so we are mostly simulating good and |
---|
0:24:49 | but you're assimilate and order of magnitude |
---|
0:24:52 | okay let's thinks the speaker |
---|
0:24:55 | okay |
---|