0:00:15 | you hear me |
---|
0:00:18 | and then um |
---|
0:00:19 | a much |
---|
0:00:20 | a where that up of and and |
---|
0:00:22 | it |
---|
0:00:22 | but its right |
---|
0:00:24 | so i'll try to |
---|
0:00:27 | trying to explain to you |
---|
0:00:28 | we went to a frame |
---|
0:00:31 | large |
---|
0:00:33 | um |
---|
0:00:34 | this is my |
---|
0:00:36 | i |
---|
0:00:38 | so better |
---|
0:00:40 | uh so this is the outline of the talk L begin with an introduction and a motivation as to why |
---|
0:00:45 | a model M as a religion and i P M |
---|
0:00:47 | and of move on to the two main components which make uh the computation of model M uh extremely computationally |
---|
0:00:55 | expensive and memory hard |
---|
0:00:57 | and go one to some of the ideas we had for distributed training and how much gain we got a |
---|
0:01:02 | memory and |
---|
0:01:03 | uh speed or by using these ideas |
---|
0:01:06 | so just to give a very brief overview uh everyone knows what exponential i'm gram models are and the power |
---|
0:01:13 | and the ability to capture really complex dependencies |
---|
0:01:17 | um these were first introduce we back and ninety six and subsequently again and two thousand two |
---|
0:01:22 | um |
---|
0:01:23 | so the definition really goes as |
---|
0:01:25 | given a set of target symbols Y and inputs and but i |
---|
0:01:29 | you can formulate the exponential language model in the following way |
---|
0:01:34 | where do you see a numerator term which is the summation over all features which are active and the denominator |
---|
0:01:40 | term which we refer to as the a normalisation some computation |
---|
0:01:44 | which goes over a additional summation over all the target output |
---|
0:01:48 | so X is the history and wise that target |
---|
0:01:51 | and now you can imagine and a large scale is how extremely computationally intensive this is gonna be |
---|
0:01:57 | so for the typical uh exponential n-gram model where any equals three you can think of three histories than null |
---|
0:02:03 | tree the unigram and the bigram tree |
---|
0:02:06 | and uh parameter estimation an exponential model basically has uh many objective functions of variations of regular stations on the |
---|
0:02:15 | basic objective function |
---|
0:02:16 | and the one that we use to and we formulated as of this following form |
---|
0:02:21 | uh with an N one and L two regularization component and for some choice of a fun signal which are |
---|
0:02:27 | determined magically a to have that |
---|
0:02:29 | um |
---|
0:02:31 | training these parameters there are the whole class of iterative update algorithms |
---|
0:02:36 | and basically it's a comparison of the expected feature counts to the observed counts |
---|
0:02:41 | and you update the parameters according so they you see the expression for the expected feature con |
---|
0:02:47 | and you see a summation term over your training data D |
---|
0:02:52 | which can be represented as a sequence of his trees and target play |
---|
0:02:56 | um |
---|
0:02:57 | if you look at that i've friends that i have their from a loaf in two thousand two |
---|
0:03:01 | you'll find an exhaustive comparison of various iterative update algorithms and their advantages and disadvantages and so one |
---|
0:03:08 | what we use internally is a generalized uh iterative scaling algorithm we've implemented |
---|
0:03:15 | so you had a bad model i'm from our previous speaker |
---|
0:03:17 | oh introduced in two thousand and nine based chan |
---|
0:03:21 | um it basically has the following form which is think of it as a combination of two exponential n-gram model |
---|
0:03:28 | and what we've seen is it consistently out performs any other language model and gives us one of the best |
---|
0:03:34 | results and we've been seeing five percent relative cross |
---|
0:03:37 | a wide range of domains and task |
---|
0:03:39 | and |
---|
0:03:40 | large corpora as well |
---|
0:03:42 | um |
---|
0:03:43 | let let's talk but not a in this particular a formulation where you have a like a three next |
---|
0:03:49 | a long lines of the exponential n-gram model but in was equal and it three |
---|
0:03:53 | so the key thing to note here is you have a class model and a word model both of which |
---|
0:03:57 | can be estimated independently |
---|
0:03:59 | well the class model has a history dependence on the previous class |
---|
0:04:03 | as well as the word |
---|
0:04:05 | the word model has this little term C J sitting here |
---|
0:04:09 | which is the predicted class of the word |
---|
0:04:11 | of the current where G I at you're trying to print |
---|
0:04:15 | so the to and models can be trained independently of each other |
---|
0:04:18 | and we know that model M we shown has significant |
---|
0:04:23 | performance improvement |
---|
0:04:24 | but then be no it comes at this steep computational cost |
---|
0:04:27 | so typically something like a thirty hours of computation is required wired for a typical training dataset of hundred million |
---|
0:04:35 | words |
---|
0:04:36 | and it consumes memory of like ten gigabyte |
---|
0:04:39 | so this is something that we want to read use if you want to scale to billions of words as |
---|
0:04:43 | is common in many of the tasks to such just voice search or you could think of any kind of |
---|
0:04:48 | large-scale broadcast news type |
---|
0:04:52 | so there are two parts which we worked on in this engine approach to improving the problem station and fast |
---|
0:04:59 | computation |
---|
0:05:00 | the first one when to talk about is the normalisation some and then i'll come to the uh a feature |
---|
0:05:05 | computation |
---|
0:05:06 | um before we go to the parallelization will talk about some efficient computation of these two component |
---|
0:05:12 | uh first consider the case where |
---|
0:05:14 | uh you you remember the previous slides the formulation of the norm some computation |
---|
0:05:20 | consider two we meant histories X one and X two |
---|
0:05:23 | and if you were to compute a different show of those two histories the expression would look like something like |
---|
0:05:28 | this |
---|
0:05:29 | and here the out for um is just just summation major whole but all of the features that are actually |
---|
0:05:34 | a lie |
---|
0:05:35 | now think of any in frequent words |
---|
0:05:38 | the uh only case where this feature going to be really act to is that all of the unigram feature |
---|
0:05:43 | case for most cases you'll find out for one equal self or two |
---|
0:05:47 | if you could efficiently order all your histories trees and and in a manner |
---|
0:05:52 | in which you're reduced to small different she'll computations then it the whole process becomes very efficient |
---|
0:05:59 | and so the key idea here is to sort the card biz |
---|
0:06:03 | in a lexical graphical way |
---|
0:06:05 | such that consecutive histories differ only in small number of word |
---|
0:06:10 | and what this does for you it reduces the number of target words over which you have to some over |
---|
0:06:16 | the final |
---|
0:06:17 | normalisation and some computation |
---|
0:06:21 | so what is this lexical |
---|
0:06:23 | graphics sorting all we're doing is to make sure |
---|
0:06:26 | same words and position one are a coding first and then the same words and position to what coding next |
---|
0:06:32 | and it sort of like a nested history tree clustering |
---|
0:06:35 | so you have your bike i'm gram an empty history and you can visualise the whole thing as a prefix |
---|
0:06:41 | tree |
---|
0:06:42 | where you have the non history three at the root node |
---|
0:06:44 | and then wall of the sorted event histories are children of that node |
---|
0:06:50 | each |
---|
0:06:51 | uh and the time is a new wave band and you can tie them all up at the root node |
---|
0:06:55 | and |
---|
0:06:56 | when you do this you're basically a signing if you want to assign unique I D's to these event histories |
---|
0:07:03 | or the low what or data and we get where I Ds and the higher order ones we get high |
---|
0:07:07 | I |
---|
0:07:08 | so when you do your different shall computation that i talked about a a you're only going to some different |
---|
0:07:14 | channels over a small number of them so you computation cost comes down and lay from we'll talk about the |
---|
0:07:19 | memory as well |
---|
0:07:21 | um this is basically what i just covered of the different she'll can be computed efficiently |
---|
0:07:26 | um so once you have the event history sort you have one more step which is just some but all |
---|
0:07:32 | the events |
---|
0:07:33 | now you can do the same lexical graphics sorting at the end time event level as well |
---|
0:07:39 | and when you do that the only two items you keep track of are the alpha value |
---|
0:07:44 | and the norm some value which is the summation of out but over all the target symbols you're interested in |
---|
0:07:52 | and when you move from one event history to the next |
---|
0:07:56 | you only want to identify the feature histories that are differ |
---|
0:07:59 | that have actually for and this comes down to are starting again because then you know all that the do |
---|
0:08:05 | what are data his trees are have a low or a unique I D's and the really only gonna to |
---|
0:08:09 | different the higher order terms |
---|
0:08:11 | and so this different she'll becomes zero for all the lower order his and exists only for the higher order |
---|
0:08:17 | one |
---|
0:08:17 | so it just an efficient way of storing your histories trees and your target symbols |
---|
0:08:22 | such that your denominator term an exponential model can be computed a fish |
---|
0:08:27 | no once you're do this this sort of spells over to the expectation uh computation of the features as well |
---|
0:08:33 | because you can be like that numerator term um you as just this i'll for divided by Z |
---|
0:08:39 | and you want to computed Z very effectively so far |
---|
0:08:43 | so if you want to consider the case in the numerator term but for the expectation computation |
---|
0:08:49 | where you have a sequence of history's trees D one D do |
---|
0:08:53 | and you wrote the whole expression out this way |
---|
0:08:56 | if you have |
---|
0:08:57 | a one element in the event history which can be shared for the same target Y |
---|
0:09:02 | that this expression reduces to the out for term coming out and just the second term remaining |
---|
0:09:08 | and what this means is this becomes independent of Y |
---|
0:09:12 | and you can share this across multiple targets um |
---|
0:09:17 | so the so think of event histories uh |
---|
0:09:20 | that you did a really are in the norm some computation |
---|
0:09:23 | with such that you only a few of the for terms changes and because of the you got some good |
---|
0:09:29 | i'd bang to just in the computation and it all just build over to the expectation computation |
---|
0:09:34 | where it just reduces to a multiplication if you will of your running total that you kept so thought of |
---|
0:09:39 | the norm some you want to it by the out for to a which also you only computed only for |
---|
0:09:45 | those seep industries that change |
---|
0:09:48 | um |
---|
0:09:49 | so that's what we point out here that that it becomes a very efficient computation of the uh numerator term |
---|
0:09:55 | so once we have a well the numerator and the denominator the parameter updates and an exponential uh language model |
---|
0:10:01 | become very uh efficient |
---|
0:10:03 | and you have well known algorithms that exist to do this kind of iterative scaling |
---|
0:10:08 | um |
---|
0:10:10 | at this point i wanna talk a little bit about the prior art here |
---|
0:10:13 | this sort of uh |
---|
0:10:15 | lexical graphics sorting was introduced by laugh for T ninety five |
---|
0:10:18 | and it's one of the bases that we use for this efficient uh computation what for norm some and expected |
---|
0:10:25 | feature computations |
---|
0:10:26 | and uh |
---|
0:10:28 | john woo and could on or in two thousand two and two thousand |
---|
0:10:31 | used a similar prefix tree traversal type algorithm |
---|
0:10:36 | but the difference between their work can our work are they sort of used eight propagation if you will of |
---|
0:10:42 | norm some terms |
---|
0:10:43 | from low but order to higher order n-grams |
---|
0:10:45 | and the expectations from higher order to lower order n-grams so computational complexity wise these two algorithms that one we |
---|
0:10:52 | talked about and what they proposed would be similar |
---|
0:10:55 | except that our formulation is much simpler |
---|
0:10:58 | and it relies purely on something as simple of sorting the corpus |
---|
0:11:01 | and we only compute the different shells and again we don't need to store all of this in memory be |
---|
0:11:06 | only need do |
---|
0:11:08 | um |
---|
0:11:08 | keep just a different jobs and memory so you don't have to this |
---|
0:11:11 | store or what of the n-gram features |
---|
0:11:14 | and given that this is a simpler formulation |
---|
0:11:16 | you can think of a she adding information across multiple sets of n-gram features like the class prediction model that |
---|
0:11:23 | we have we have to exponential models and model um so |
---|
0:11:26 | um these of the sort of key differences between the |
---|
0:11:29 | right art that's been proposed |
---|
0:11:31 | um now how do you do with practically so far we talked about and efficient computation |
---|
0:11:37 | now we wanna say how can you power |
---|
0:11:40 | um |
---|
0:11:40 | the norm some computation can i so you can do it in two ways |
---|
0:11:44 | one way do you take your target vocabulary and use which across machines or one when you take your actual |
---|
0:11:50 | or base |
---|
0:11:51 | each machine compute over or your target vocabulary but only on a limited portion of the corpus |
---|
0:11:57 | so let's take the first case where it it's the target what capillaries split |
---|
0:12:01 | then the norm some can be computed by summing the normalization terms |
---|
0:12:06 | or but each vocabulary partition |
---|
0:12:08 | so what happens here is the expectation computation also gets split perfectly by splitting the target words a at each |
---|
0:12:15 | machine |
---|
0:12:16 | um what what you would need to do is you would need to manage all these partial normalisation terms |
---|
0:12:22 | at the end of one computation and broadcasted back |
---|
0:12:25 | so each one of the uh sleeve nodes can do the subsequent computation |
---|
0:12:31 | so you would need to merge the norm sums in this case now if you got to the case of |
---|
0:12:34 | where you split like corpus |
---|
0:12:36 | you're splitting part of the corpus across different machine |
---|
0:12:40 | so why you can do the norm some computation on a single machine and no i just required |
---|
0:12:45 | for the expectation parts you would need to make sure that they have the right normalisation jobs so he that |
---|
0:12:50 | way you have to have some sort of communications between slaves and masters |
---|
0:12:54 | but now you have a combination of fire |
---|
0:12:56 | prior had an efficient computation and the parallelization we talked about |
---|
0:13:00 | so together they seem to work uh very well and reducing the memory and the time uh computational time |
---|
0:13:07 | so in this table we show how if you what to do what direct implementation where you to know short |
---|
0:13:12 | cut |
---|
0:13:13 | um the computation time in seconds for it difficult hub four corpus with one point five million where is going |
---|
0:13:19 | to be like a of the order four thousand second |
---|
0:13:21 | i'd unigram caching is something that everybody does so that significantly brings the time down and we can produce that |
---|
0:13:27 | even for the with our proposed approach which comes down to just |
---|
0:13:31 | seconds |
---|
0:13:32 | so this is a a need to wait to do the exponential model combination computation |
---|
0:13:37 | um |
---|
0:13:38 | the results are show you now are on the english hub four system |
---|
0:13:42 | um this is uh |
---|
0:13:44 | uh |
---|
0:13:45 | training part is about three hundred million words with with at K vocabulary |
---|
0:13:49 | and the number of features that are a life for the class portion of the exponential model |
---|
0:13:53 | and for the word a portion of the model are roughly two hundred |
---|
0:13:57 | thirty eight million and two seventy six million |
---|
0:13:59 | um if you take the comparison with the at a big L system which also we have results on that |
---|
0:14:04 | is a one point six billion words |
---|
0:14:06 | and you see |
---|
0:14:19 | and you see it how much more number of histories trees and are a life for the class and the |
---|
0:14:24 | uh weight based models |
---|
0:14:26 | so here we give you an idea of the trade memory required during training |
---|
0:14:31 | you see on all the four models the class and the word for english as well as at of big |
---|
0:14:36 | yeah |
---|
0:14:37 | uh the X axis talks about how much memory per node no discussions you |
---|
0:14:41 | and the Y axis is basically what happens as the number of nodes |
---|
0:14:45 | increases as you power lies for there |
---|
0:14:47 | so use as expected you see a nice brought in the amount of memory can see |
---|
0:14:52 | and what this means is you can train something with billions of words |
---|
0:14:56 | in |
---|
0:14:56 | a day as opposed to where early a weight was taking like weeks to do this |
---|
0:15:01 | um |
---|
0:15:02 | this is just the numbers right now and if you talk about training time are computational time also to significantly |
---|
0:15:08 | reduced |
---|
0:15:09 | you go down from a single case of something like two hundred and fifty hours |
---|
0:15:13 | down to when you power lies it would just twenty machines you're down to like about fifty hours so that's |
---|
0:15:18 | a nice uh improvement we get and training time as well as and memory |
---|
0:15:24 | um so there is also related work on power of station and that of many it's that are being pursued |
---|
0:15:29 | in the literature |
---|
0:15:30 | originally nine to six it was introduced as a a a theoretical does all done how you can paralyse and |
---|
0:15:36 | grams |
---|
0:15:37 | uh distribute it um language model computation has since then been in per you had uh from uh both um |
---|
0:15:43 | of than brian |
---|
0:15:44 | a from two thousand seven |
---|
0:15:46 | and for are not there's have investigated how you can do maximum entropy based models in a distributed environment as |
---|
0:15:52 | well |
---|
0:15:53 | so this work is uh along the lines of these two and beep shown how you can efficiently implement this |
---|
0:15:59 | computation |
---|
0:16:01 | i just some nodes on computational cost |
---|
0:16:03 | uh we've used uh on normalized iterative scaling here and you see details and the two thousand nine paper |
---|
0:16:09 | a but that are the choices you can do such just conjugate gradient L B F G S and are |
---|
0:16:14 | prop and we didn't compare exactly how our parameter estimation goes but this but |
---|
0:16:19 | you can |
---|
0:16:20 | and we you know one of these as additional uh improvements that you could potentially get |
---|
0:16:25 | um now the timing results are also based on a strict convergence criteria |
---|
0:16:30 | we used a very tight convergence criterion i you could probably relaxed this the little depending on the task and |
---|
0:16:35 | get further improvement |
---|
0:16:37 | uh the computational cost just both the function of number of unique features and unique histories trees |
---|
0:16:42 | and in a typical case we process like one point two million unique n-grams grams per second but five compute |
---|
0:16:48 | nodes |
---|
0:16:48 | that this can vary drastically crossed to whether it's a broadcast news task or E conversational task and so one |
---|
0:16:55 | so again the relative gain you get will the pan on the task |
---|
0:16:58 | but it basically if you don't do this the |
---|
0:17:01 | possibility of scaling to a large corpus becomes almost uh in |
---|
0:17:05 | vol |
---|
0:17:06 | so to conclude we have an efficient training algorithm for exponential n-gram models which we believe we can extend to |
---|
0:17:13 | other types of features thrown and and the model M frame right |
---|
0:17:16 | and one addition on top of this if you were to paralyse it either on as a target vocabulary subset |
---|
0:17:23 | or as the data subset |
---|
0:17:25 | then you can lead to efficient use of memory as well as better computational cost |
---|
0:17:30 | and if you want just split a christ data are you can increase that a lot more then you can |
---|
0:17:34 | if you were to just |
---|
0:17:36 | split the target for re |
---|
0:17:38 | and that's all |
---|
0:17:46 | questions for questions from |
---|
0:17:58 | a simple question |
---|
0:17:59 | which you not related to to but to to work some more and more |
---|
0:18:06 | my or to use |
---|
0:18:08 | to to model |
---|
0:18:11 | uh i yes the results i presented here where a without any pruning |
---|
0:18:15 | uh but we had a number is with |
---|
0:18:17 | pruning feature pruning and be find that we take uh |
---|
0:18:22 | we absolutely don't take much of |
---|
0:18:24 | hit unless you go to pruning this features like low fifty se |
---|
0:18:27 | is oh |
---|
0:18:29 | the a one or two old |
---|
0:18:32 | so we well that's another paper but uh and we have a uh tried a variety of pruning methods as |
---|
0:18:39 | some of them which are simply based on a counting thing a threshold on counting thing |
---|
0:18:43 | uh some of them uh |
---|
0:18:45 | basically extending and gram pruning methodology to the exponential n-gram framework of model a and they all seem to work |
---|
0:18:52 | in a similar way |
---|
0:18:53 | so we have another forthcoming paper |
---|
0:19:01 | a for abortion |
---|