Přepis řeči - DISTRIBUTED TRAINING OF LARGE SCALE EXPONENTIAL LANGUAGE MODELS

0:00:15	you hear me
0:00:18	and then um
0:00:19	a much
0:00:20	a where that up of and and
0:00:22	it
0:00:22	but its right
0:00:24	so i'll try to
0:00:27	trying to explain to you
0:00:28	we went to a frame
0:00:31	large
0:00:33	um
0:00:34	this is my
0:00:36	i
0:00:38	so better
0:00:40	uh so this is the outline of the talk L begin with an introduction and a motivation as to why
0:00:45	a model M as a religion and i P M
0:00:47	and of move on to the two main components which make uh the computation of model M uh extremely computationally
0:00:55	expensive and memory hard
0:00:57	and go one to some of the ideas we had for distributed training and how much gain we got a
0:01:02	memory and
0:01:03	uh speed or by using these ideas
0:01:06	so just to give a very brief overview uh everyone knows what exponential i'm gram models are and the power
0:01:13	and the ability to capture really complex dependencies
0:01:17	um these were first introduce we back and ninety six and subsequently again and two thousand two
0:01:22	um
0:01:23	so the definition really goes as
0:01:25	given a set of target symbols Y and inputs and but i
0:01:29	you can formulate the exponential language model in the following way
0:01:34	where do you see a numerator term which is the summation over all features which are active and the denominator
0:01:40	term which we refer to as the a normalisation some computation
0:01:44	which goes over a additional summation over all the target output
0:01:48	so X is the history and wise that target
0:01:51	and now you can imagine and a large scale is how extremely computationally intensive this is gonna be
0:01:57	so for the typical uh exponential n-gram model where any equals three you can think of three histories than null
0:02:03	tree the unigram and the bigram tree
0:02:06	and uh parameter estimation an exponential model basically has uh many objective functions of variations of regular stations on the
0:02:15	basic objective function
0:02:16	and the one that we use to and we formulated as of this following form
0:02:21	uh with an N one and L two regularization component and for some choice of a fun signal which are
0:02:27	determined magically a to have that
0:02:29	um
0:02:31	training these parameters there are the whole class of iterative update algorithms
0:02:36	and basically it's a comparison of the expected feature counts to the observed counts
0:02:41	and you update the parameters according so they you see the expression for the expected feature con
0:02:47	and you see a summation term over your training data D
0:02:52	which can be represented as a sequence of his trees and target play
0:02:56	um
0:02:57	if you look at that i've friends that i have their from a loaf in two thousand two
0:03:01	you'll find an exhaustive comparison of various iterative update algorithms and their advantages and disadvantages and so one
0:03:08	what we use internally is a generalized uh iterative scaling algorithm we've implemented
0:03:15	so you had a bad model i'm from our previous speaker
0:03:17	oh introduced in two thousand and nine based chan
0:03:21	um it basically has the following form which is think of it as a combination of two exponential n-gram model
0:03:28	and what we've seen is it consistently out performs any other language model and gives us one of the best
0:03:34	results and we've been seeing five percent relative cross
0:03:37	a wide range of domains and task
0:03:39	and
0:03:40	large corpora as well
0:03:42	um
0:03:43	let let's talk but not a in this particular a formulation where you have a like a three next
0:03:49	a long lines of the exponential n-gram model but in was equal and it three
0:03:53	so the key thing to note here is you have a class model and a word model both of which
0:03:57	can be estimated independently
0:03:59	well the class model has a history dependence on the previous class
0:04:03	as well as the word
0:04:05	the word model has this little term C J sitting here
0:04:09	which is the predicted class of the word
0:04:11	of the current where G I at you're trying to print
0:04:15	so the to and models can be trained independently of each other
0:04:18	and we know that model M we shown has significant
0:04:23	performance improvement
0:04:24	but then be no it comes at this steep computational cost
0:04:27	so typically something like a thirty hours of computation is required wired for a typical training dataset of hundred million
0:04:35	words
0:04:36	and it consumes memory of like ten gigabyte
0:04:39	so this is something that we want to read use if you want to scale to billions of words as
0:04:43	is common in many of the tasks to such just voice search or you could think of any kind of
0:04:48	large-scale broadcast news type
0:04:52	so there are two parts which we worked on in this engine approach to improving the problem station and fast
0:04:59	computation
0:05:00	the first one when to talk about is the normalisation some and then i'll come to the uh a feature
0:05:05	computation
0:05:06	um before we go to the parallelization will talk about some efficient computation of these two component
0:05:12	uh first consider the case where
0:05:14	uh you you remember the previous slides the formulation of the norm some computation
0:05:20	consider two we meant histories X one and X two
0:05:23	and if you were to compute a different show of those two histories the expression would look like something like
0:05:28	this
0:05:29	and here the out for um is just just summation major whole but all of the features that are actually
0:05:34	a lie
0:05:35	now think of any in frequent words
0:05:38	the uh only case where this feature going to be really act to is that all of the unigram feature
0:05:43	case for most cases you'll find out for one equal self or two
0:05:47	if you could efficiently order all your histories trees and and in a manner
0:05:52	in which you're reduced to small different she'll computations then it the whole process becomes very efficient
0:05:59	and so the key idea here is to sort the card biz
0:06:03	in a lexical graphical way
0:06:05	such that consecutive histories differ only in small number of word
0:06:10	and what this does for you it reduces the number of target words over which you have to some over
0:06:16	the final
0:06:17	normalisation and some computation
0:06:21	so what is this lexical
0:06:23	graphics sorting all we're doing is to make sure
0:06:26	same words and position one are a coding first and then the same words and position to what coding next
0:06:32	and it sort of like a nested history tree clustering
0:06:35	so you have your bike i'm gram an empty history and you can visualise the whole thing as a prefix
0:06:41	tree
0:06:42	where you have the non history three at the root node
0:06:44	and then wall of the sorted event histories are children of that node
0:06:50	each
0:06:51	uh and the time is a new wave band and you can tie them all up at the root node
0:06:55	and
0:06:56	when you do this you're basically a signing if you want to assign unique I D's to these event histories
0:07:03	or the low what or data and we get where I Ds and the higher order ones we get high
0:07:07	I
0:07:08	so when you do your different shall computation that i talked about a a you're only going to some different
0:07:14	channels over a small number of them so you computation cost comes down and lay from we'll talk about the
0:07:19	memory as well
0:07:21	um this is basically what i just covered of the different she'll can be computed efficiently
0:07:26	um so once you have the event history sort you have one more step which is just some but all
0:07:32	the events
0:07:33	now you can do the same lexical graphics sorting at the end time event level as well
0:07:39	and when you do that the only two items you keep track of are the alpha value
0:07:44	and the norm some value which is the summation of out but over all the target symbols you're interested in
0:07:52	and when you move from one event history to the next
0:07:56	you only want to identify the feature histories that are differ
0:07:59	that have actually for and this comes down to are starting again because then you know all that the do
0:08:05	what are data his trees are have a low or a unique I D's and the really only gonna to
0:08:09	different the higher order terms
0:08:11	and so this different she'll becomes zero for all the lower order his and exists only for the higher order
0:08:17	one
0:08:17	so it just an efficient way of storing your histories trees and your target symbols
0:08:22	such that your denominator term an exponential model can be computed a fish
0:08:27	no once you're do this this sort of spells over to the expectation uh computation of the features as well
0:08:33	because you can be like that numerator term um you as just this i'll for divided by Z
0:08:39	and you want to computed Z very effectively so far
0:08:43	so if you want to consider the case in the numerator term but for the expectation computation
0:08:49	where you have a sequence of history's trees D one D do
0:08:53	and you wrote the whole expression out this way
0:08:56	if you have
0:08:57	a one element in the event history which can be shared for the same target Y
0:09:02	that this expression reduces to the out for term coming out and just the second term remaining
0:09:08	and what this means is this becomes independent of Y
0:09:12	and you can share this across multiple targets um
0:09:17	so the so think of event histories uh
0:09:20	that you did a really are in the norm some computation
0:09:23	with such that you only a few of the for terms changes and because of the you got some good
0:09:29	i'd bang to just in the computation and it all just build over to the expectation computation
0:09:34	where it just reduces to a multiplication if you will of your running total that you kept so thought of
0:09:39	the norm some you want to it by the out for to a which also you only computed only for
0:09:45	those seep industries that change
0:09:48	um
0:09:49	so that's what we point out here that that it becomes a very efficient computation of the uh numerator term
0:09:55	so once we have a well the numerator and the denominator the parameter updates and an exponential uh language model
0:10:01	become very uh efficient
0:10:03	and you have well known algorithms that exist to do this kind of iterative scaling
0:10:08	um
0:10:10	at this point i wanna talk a little bit about the prior art here
0:10:13	this sort of uh
0:10:15	lexical graphics sorting was introduced by laugh for T ninety five
0:10:18	and it's one of the bases that we use for this efficient uh computation what for norm some and expected
0:10:25	feature computations
0:10:26	and uh
0:10:28	john woo and could on or in two thousand two and two thousand
0:10:31	used a similar prefix tree traversal type algorithm
0:10:36	but the difference between their work can our work are they sort of used eight propagation if you will of
0:10:42	norm some terms
0:10:43	from low but order to higher order n-grams
0:10:45	and the expectations from higher order to lower order n-grams so computational complexity wise these two algorithms that one we
0:10:52	talked about and what they proposed would be similar
0:10:55	except that our formulation is much simpler
0:10:58	and it relies purely on something as simple of sorting the corpus
0:11:01	and we only compute the different shells and again we don't need to store all of this in memory be
0:11:06	only need do
0:11:08	um
0:11:08	keep just a different jobs and memory so you don't have to this
0:11:11	store or what of the n-gram features
0:11:14	and given that this is a simpler formulation
0:11:16	you can think of a she adding information across multiple sets of n-gram features like the class prediction model that
0:11:23	we have we have to exponential models and model um so
0:11:26	um these of the sort of key differences between the
0:11:29	right art that's been proposed
0:11:31	um now how do you do with practically so far we talked about and efficient computation
0:11:37	now we wanna say how can you power
0:11:40	um
0:11:40	the norm some computation can i so you can do it in two ways
0:11:44	one way do you take your target vocabulary and use which across machines or one when you take your actual
0:11:50	or base
0:11:51	each machine compute over or your target vocabulary but only on a limited portion of the corpus
0:11:57	so let's take the first case where it it's the target what capillaries split
0:12:01	then the norm some can be computed by summing the normalization terms
0:12:06	or but each vocabulary partition
0:12:08	so what happens here is the expectation computation also gets split perfectly by splitting the target words a at each
0:12:15	machine
0:12:16	um what what you would need to do is you would need to manage all these partial normalisation terms
0:12:22	at the end of one computation and broadcasted back
0:12:25	so each one of the uh sleeve nodes can do the subsequent computation
0:12:31	so you would need to merge the norm sums in this case now if you got to the case of
0:12:34	where you split like corpus
0:12:36	you're splitting part of the corpus across different machine
0:12:40	so why you can do the norm some computation on a single machine and no i just required
0:12:45	for the expectation parts you would need to make sure that they have the right normalisation jobs so he that
0:12:50	way you have to have some sort of communications between slaves and masters
0:12:54	but now you have a combination of fire
0:12:56	prior had an efficient computation and the parallelization we talked about
0:13:00	so together they seem to work uh very well and reducing the memory and the time uh computational time
0:13:07	so in this table we show how if you what to do what direct implementation where you to know short
0:13:12	cut
0:13:13	um the computation time in seconds for it difficult hub four corpus with one point five million where is going
0:13:19	to be like a of the order four thousand second
0:13:21	i'd unigram caching is something that everybody does so that significantly brings the time down and we can produce that
0:13:27	even for the with our proposed approach which comes down to just
0:13:31	seconds
0:13:32	so this is a a need to wait to do the exponential model combination computation
0:13:37	um
0:13:38	the results are show you now are on the english hub four system
0:13:42	um this is uh
0:13:44	uh
0:13:45	training part is about three hundred million words with with at K vocabulary
0:13:49	and the number of features that are a life for the class portion of the exponential model
0:13:53	and for the word a portion of the model are roughly two hundred
0:13:57	thirty eight million and two seventy six million
0:13:59	um if you take the comparison with the at a big L system which also we have results on that
0:14:04	is a one point six billion words
0:14:06	and you see
0:14:19	and you see it how much more number of histories trees and are a life for the class and the
0:14:24	uh weight based models
0:14:26	so here we give you an idea of the trade memory required during training
0:14:31	you see on all the four models the class and the word for english as well as at of big
0:14:36	yeah
0:14:37	uh the X axis talks about how much memory per node no discussions you
0:14:41	and the Y axis is basically what happens as the number of nodes
0:14:45	increases as you power lies for there
0:14:47	so use as expected you see a nice brought in the amount of memory can see
0:14:52	and what this means is you can train something with billions of words
0:14:56	in
0:14:56	a day as opposed to where early a weight was taking like weeks to do this
0:15:01	um
0:15:02	this is just the numbers right now and if you talk about training time are computational time also to significantly
0:15:08	reduced
0:15:09	you go down from a single case of something like two hundred and fifty hours
0:15:13	down to when you power lies it would just twenty machines you're down to like about fifty hours so that's
0:15:18	a nice uh improvement we get and training time as well as and memory
0:15:24	um so there is also related work on power of station and that of many it's that are being pursued
0:15:29	in the literature
0:15:30	originally nine to six it was introduced as a a a theoretical does all done how you can paralyse and
0:15:36	grams
0:15:37	uh distribute it um language model computation has since then been in per you had uh from uh both um
0:15:43	of than brian
0:15:44	a from two thousand seven
0:15:46	and for are not there's have investigated how you can do maximum entropy based models in a distributed environment as
0:15:52	well
0:15:53	so this work is uh along the lines of these two and beep shown how you can efficiently implement this
0:15:59	computation
0:16:01	i just some nodes on computational cost
0:16:03	uh we've used uh on normalized iterative scaling here and you see details and the two thousand nine paper
0:16:09	a but that are the choices you can do such just conjugate gradient L B F G S and are
0:16:14	prop and we didn't compare exactly how our parameter estimation goes but this but
0:16:19	you can
0:16:20	and we you know one of these as additional uh improvements that you could potentially get
0:16:25	um now the timing results are also based on a strict convergence criteria
0:16:30	we used a very tight convergence criterion i you could probably relaxed this the little depending on the task and
0:16:35	get further improvement
0:16:37	uh the computational cost just both the function of number of unique features and unique histories trees
0:16:42	and in a typical case we process like one point two million unique n-grams grams per second but five compute
0:16:48	nodes
0:16:48	that this can vary drastically crossed to whether it's a broadcast news task or E conversational task and so one
0:16:55	so again the relative gain you get will the pan on the task
0:16:58	but it basically if you don't do this the
0:17:01	possibility of scaling to a large corpus becomes almost uh in
0:17:05	vol
0:17:06	so to conclude we have an efficient training algorithm for exponential n-gram models which we believe we can extend to
0:17:13	other types of features thrown and and the model M frame right
0:17:16	and one addition on top of this if you were to paralyse it either on as a target vocabulary subset
0:17:23	or as the data subset
0:17:25	then you can lead to efficient use of memory as well as better computational cost
0:17:30	and if you want just split a christ data are you can increase that a lot more then you can
0:17:34	if you were to just
0:17:36	split the target for re
0:17:38	and that's all
0:17:46	questions for questions from
0:17:58	a simple question
0:17:59	which you not related to to but to to work some more and more
0:18:06	my or to use
0:18:08	to to model
0:18:11	uh i yes the results i presented here where a without any pruning
0:18:15	uh but we had a number is with
0:18:17	pruning feature pruning and be find that we take uh
0:18:22	we absolutely don't take much of
0:18:24	hit unless you go to pruning this features like low fifty se
0:18:27	is oh
0:18:29	the a one or two old
0:18:32	so we well that's another paper but uh and we have a uh tried a variety of pruning methods as
0:18:39	some of them which are simply based on a counting thing a threshold on counting thing
0:18:43	uh some of them uh
0:18:45	basically extending and gram pruning methodology to the exponential n-gram framework of model a and they all seem to work
0:18:52	in a similar way
0:18:53	so we have another forthcoming paper
0:19:01	a for abortion

DISTRIBUTED TRAINING OF LARGE SCALE EXPONENTIAL LANGUAGE MODELS

Language Modeling

Přednášející: Bhuvana Ramabhadran, Autoři: Abhinav Sethy, Stanley Chen, Bhuvana Ramabhadran, IBM, United States