Přepis řeči - GEOMETRIC PROGRAMMING FOR AGGREGATION OF BINARY CLASSIFIERS

0:00:13	so the the graph an and and what i would like to talk about today E uh how we combine
0:00:19	the the multiple binary classifiers to solve the
0:00:23	well class problem
0:00:25	and the technique uh that we are going to use these the
0:00:28	the geometric programming we're
0:00:30	we can uh solve the problems the using one of the convex of to my this and solver
0:00:39	so
0:00:40	uh i would like to start we'd the multi-class class mornings
0:00:44	and
0:00:45	and then are to the two different approaches to solve the multi-class learning to or multiclass problems
0:00:51	either direct method
0:00:52	or
0:00:53	we we can reduce the the multi-class problems in to the multiple binary problems
0:00:59	K
0:00:59	is such a case actually we need to really combine the mode uh but multiple binary problems
0:01:06	to determine the final answer uh to the multi-class problems
0:01:10	and then
0:01:12	probably i formulated this aggregation problems
0:01:15	and the a geometric programming so that we can always find a global solutions
0:01:20	"'kay" so you know or to introduce a geometric programming formulations and then i will introduce a soft models
0:01:27	and L one norm regularized maxent lightly
0:01:29	missions missions and
0:01:31	no
0:01:31	oh we we just some of the uh
0:01:34	the numerical experiments and then
0:01:35	and can close
0:01:37	so the multi-class problems for example
0:01:40	uh
0:01:41	we need to really a signed the class labels uh from one to K A
0:01:46	uh uh to the data point
0:01:47	okay
0:01:48	so either direct methods okay so suppose actually we have a three classes
0:01:52	and the direct method try to find the sum the separating hyperplane which
0:01:57	discriminate uh these three classes
0:01:59	uh in general uh the separating hyperplane these not really linear okay
0:02:04	uh on the other hand
0:02:06	uh for the binaural decomposition the method of for example all paired
0:02:10	so we look at the sum the
0:02:12	para one and two and two and three and one and three so we look at the sum the binary
0:02:18	pair problem still we can always find the sum the binary classifiers
0:02:23	and the remaining problem ease of how we really uh aggregate this um the solutions of the binary problems you
0:02:29	know to determine the final answer to the multi-class problems
0:02:33	okay so what are the advantage of the binary decompositions over the direct method
0:02:37	and is
0:02:37	but easier and the simple or tool on the classifiers
0:02:41	and there's a lot of other sophisticated classification uh classifiers actually as mister ready for the binary uh the D
0:02:48	i mean the binary problems for example support vector machines
0:02:52	and we're so this is a better suited to the parallel computation
0:02:56	so for example of these uh the three is a well-known examples for the binary decompositions
0:03:01	and also we can really three the binary decompositions they're the binary encoding problem
0:03:06	okay so well in other words
0:03:08	how we aggregate the uh multiple binary answers
0:03:12	is uh a binary decoding i mean the decoding prop
0:03:16	so for example the one versus all K as so we have a three crises
0:03:21	and then first the binary classifiers actually discriminate the first the class
0:03:25	uh from the the remaining class
0:03:27	and we're still the second binary classifiers discriminate
0:03:32	second class from the remaining class then
0:03:34	so one and all pairs where we need to look at the sum the pair of one and two and
0:03:39	two and three and one and three
0:03:41	and we're still all
0:03:44	they're error correcting output coding also can be uh you use
0:03:48	so all we really uh determine the some of the the court words
0:03:53	oh where some of the code words have as a maximal uh the hamming distance so you in other words
0:03:58	of some of the solutions we'd the tolerable error is actually can be uh correctly classified
0:04:04	so in other words the binary decompositions really uh leads to the sum
0:04:08	uh code matrix tk so in this case
0:04:11	we have a three class
0:04:12	and the three binary classifiers
0:04:14	so uh these are the uh exemplary
0:04:17	who may trick
0:04:18	uh for the one versus all and all pairs and L
0:04:22	error correcting up according
0:04:23	okay
0:04:24	so in other words actually we need to really train the uh three deeper in the binary classifiers
0:04:30	uh following the the oracle a in this school me too
0:04:37	"'kay" so for example or the case of the one versus a and then D is the code matrix produced
0:04:42	by the one versus soul
0:04:44	and is that a case
0:04:46	and we need to oh
0:04:47	train the three different pine binary classifiers
0:04:50	and for example all X up i
0:04:53	is that the data and the target labels is a two
0:04:57	so it's such a case actually
0:04:59	uh
0:05:00	so the target label two
0:05:03	and then
0:05:04	in terms of the binary classifiers to actually the correct labels should be zero
0:05:09	one and zero okay so the first the binary classifier a second binary and third binary classifiers
0:05:15	so we i mean these uh binary classifier
0:05:19	followed the
0:05:20	at the binary label
0:05:22	uh uh in this good matrix
0:05:26	so we trained a binary classifiers
0:05:29	and
0:05:30	then L each binary class apart
0:05:32	classifiers produce the some the probability asked me
0:05:35	a for example we can use the support vector machines with sick a the model so that the the banner
0:05:41	classifier the produce the some the scores
0:05:43	uh which uh between zero and one
0:05:46	"'kay"
0:05:47	so the the problem here E
0:05:50	"'kay" so we trained the three binary classifiers and each binary classifiers produce the sum of the scores between the
0:05:57	your and one K and you're to answer to the multi-class problems and that we have to really combine the
0:06:04	and
0:06:05	determine the by the three binary classifiers
0:06:08	okay so how we really aggregate uh the binary classifiers
0:06:12	and the sum of the charade heuristics these
0:06:14	uh for the case of the all pairs and then we do easily majority voting as
0:06:19	and then for the case of though one versus all and they maybe the maximum always means
0:06:24	and hard decoding case
0:06:26	and the we
0:06:28	uh find the court word we to best match the collection of the predicted result computed by the binary classifiers
0:06:34	okay so in the case of the three class and that we have a three
0:06:39	code words okay and then train the three binary classifiers so that
0:06:44	given a some the test at what point K and the three binary classified the produce a sum score is
0:06:50	okay so the collection of the door the three values can stick you'd the three the mental vectors okay and
0:06:55	then we search actually uh which code word he's best match the these uh the three dimensional a prediction result
0:07:02	in or to really determine the final answer to the multi-class problems okay
0:07:07	and or so all we can it a probabilistic decoding in other words in this case actually we need we
0:07:13	really need to compute the class membership probabilities
0:07:16	okay
0:07:17	so uh C L can get the class members probabilities and then we can really uh to the prediction uh
0:07:22	for the class
0:07:25	so uh
0:07:26	uh one of the popular approach in the problem list decoding is actually the based of the
0:07:31	uh the bread are models
0:07:33	and the let me just briefly to i mean the explain actually what really a the braille lit model oh
0:07:38	used doing in this case
0:07:40	so was again actually L we have a three class
0:07:43	okay
0:07:44	and that these reading tidy model has been used to relate the binary predictions with the class members to probably
0:07:50	so for example
0:07:51	okay so we have a some uh the three as servers could use the by the three binary classifiers
0:07:57	and that we have to relate to those answers to the
0:08:00	uh the class membership probabilities
0:08:02	so in such a case actually we treat the class members probabilities as a prayer
0:08:07	"'kay"
0:08:08	so in the case of the all pairs binary D competitions
0:08:11	and then
0:08:13	okay so the capital P one uh star he's actually uh
0:08:18	the a
0:08:19	the class membership probabilities uh
0:08:22	for the
0:08:23	the data point access star
0:08:24	okay
0:08:27	no known actually a a this should be a uh
0:08:32	see
0:08:35	okay so this is actually the the class membership probabilities
0:08:39	and then D is a uh
0:08:41	all pairs result
0:08:43	okay
0:08:43	so all these are i mean the blue oh high like the things are actually the based on the bradley
0:08:49	terry model
0:08:50	and then we introduce a someone out a a is high want i to and pride three
0:08:55	and then these uh relations are directly from the bread lee terry model
0:08:59	and they just sub J star is actually the probability mate
0:09:03	determine the by uh by the binary classifiers
0:09:06	so we have that this was okay so in or to really compute the class membership probabilities and we treat
0:09:12	them as a or and that we asked make these parameters by minimizing the
0:09:17	okay of that verse as uh between the
0:09:20	uh
0:09:21	the binary for uses
0:09:22	and or so
0:09:24	these pie a with
0:09:26	uh coming from them
0:09:29	so oh is such a case actually a a from but actually which exploit it is uh the techniques
0:09:34	uh
0:09:36	here he's
0:09:38	that's a that's probably three
0:09:41	or you know other words actually uh a number of parameters grows with the number of uh
0:09:46	the
0:09:47	the training example
0:09:48	okay
0:09:49	so if you have a assumed the a huge number of training examples and then we have a huge number
0:09:53	of
0:09:53	uh
0:09:54	a parameters you those should be really uh
0:09:57	up team
0:10:00	so all or some of the uh the existing tech the actually the base on the bradley terry model and
0:10:05	then all one of the recent tech and he's they tried to find that some all optimal aggregation
0:10:10	okay
0:10:12	so uh why optimize aggregation is good you because some is some of the prediction is by and live by
0:10:17	like
0:10:17	i fired
0:10:19	oh the aggregate of
0:10:20	i mean entire performance okay so somehow if they can really uh determine this i mean the
0:10:26	come up with a weights which all see i mean all team hourly aggregate the uh the binary predictions
0:10:31	and then that we can really of we did this prob
0:10:35	so of all these uh a technique a uh has been done for the optimal aggregation actually but but based
0:10:40	on the uh
0:10:42	uh the bread carry model
0:10:43	a a a a uh but the problem here is actually a a a a simple of the really a
0:10:47	lot my the um probably decoders
0:10:50	uh i use that red with you model so
0:10:53	the number of parameters is actually the aggregation weights
0:10:57	and also class membership probabilities which to grow as a we the number of uh
0:11:02	example
0:11:02	so name and a lot my is a problem and or so
0:11:07	uh this is not really a not i mean that not convex of to the some problems of doesn't guarantee
0:11:12	a global
0:11:12	lesions
0:11:13	okay so what i would like to hear he's actually uh we would like to formulate these problems
0:11:19	uh
0:11:20	as a convex up to my this um prop
0:11:22	okay so all
0:11:23	in the aggregate some model actually a we don't look at really the bread lead a model but yeah we
0:11:28	we ah
0:11:29	uh uh use a softmax model uh which was also a recently uh use uh
0:11:35	uh
0:11:36	by us
0:11:38	exactly and i C yeah
0:11:39	the last year
0:11:40	so
0:11:42	yeah all introduce an i mean the softmax models and and is such a case
0:11:46	uh and
0:11:47	so these are actually the and um different binary classifiers
0:11:52	and you know our approach actually the writers that all the aggregation weights okay so each a a class
0:11:59	classifiers is really uh by the different uh a this and the W want through double sub and
0:12:05	and then i goal is actually a optimized these court presents
0:12:09	to produce a some the best uh
0:12:12	uh a combinations of the binary prediction
0:12:20	okay so
0:12:22	the class and then we're the probabilities of follows the of the softmax of functions so in other words i
0:12:28	mean the probability of wise the i equal K given some parameter this is the aggregation weight
0:12:34	and the data point X of i follows the softmax functions
0:12:38	uh but the exponent E
0:12:40	uh the way sum of the discrepancy so okay so these are the the discrepancy between the
0:12:46	code word and then the binary prediction
0:12:49	okay
0:12:50	so for example maybe we can use a cross-entropy entropy other functions
0:12:53	case
0:12:54	so all this easy really uh the probably extensions of the loss based be decoding
0:13:00	i
0:13:00	a and in this way
0:13:02	actually we have only um mean the aggregation weights oh as a parameter
0:13:08	so based on this models and then
0:13:11	uh we write the likelihood of the training data so these C's the likely you and then maybe be of
0:13:17	the details you can find in the papers
0:13:19	uh
0:13:21	and then we add the uh L one norm regularization as
0:13:25	okay so
0:13:26	the negative log-likelihood likelihood the L one norm regularization Z
0:13:30	and then we come up with the some the law some exponential function
0:13:36	and then uh we figured out
0:13:38	E
0:13:39	so uh are our optimized nation is actually the minimize the loss of exponential function as
0:13:44	that's some sex the uh
0:13:46	some of the plastic uh
0:13:48	coefficients
0:13:49	okay
0:13:49	and the loss some one it's of times an is a context
0:13:52	"'kay" so we can really solve this problem as a convex of my Z
0:13:55	a problem
0:13:56	uh what we figured out about a a a a a two years ago he's actually be can form this
0:14:01	i mean but can really think this into the the geometric programming
0:14:04	uh
0:14:05	so i mean this is just a short introduction of the geometric programming
0:14:10	and this is a problem of i i mean the standard form of the german to
0:14:13	programming
0:14:14	and the we minimize uh some of the pose in on the L
0:14:17	and that was not
0:14:19	a it on you but we i mean that
0:14:21	but the different sees
0:14:22	uh the exponents are allowed to be a real valued okay in a plain on me on the exponent is
0:14:27	only should be the integer
0:14:29	so the minimize sum of was not me L under this all inequality constraint and also so you quality constraint
0:14:36	"'kay"
0:14:37	and then uh this uh the geometric programming in a on L from always can be
0:14:42	can already to the sum of the german program in convex
0:14:45	uh on is you just a well in comics
0:14:48	i
0:14:49	so
0:14:50	is is our of the optimisation and problems
0:14:53	and and uh we can really write is uh
0:14:57	optimize as
0:14:59	the
0:14:59	geometric programming in either convex or the port a all forms
0:15:03	uh so there is actually a uh some efficient since all words of yeah
0:15:07	so we simply to the use that solvers
0:15:09	two oh find the actually uh
0:15:12	the minimum of this uh
0:15:14	objective a function
0:15:17	so in in experiments and that we compared the some of the uh existing work
0:15:22	uh which is a loss to based decoding which is just a one of the heart according
0:15:25	and the so the map is uh one of the
0:15:28	optima a uh aggregation method based on the bradley terry model
0:15:36	so of these are the some of the the data actually uh uh uh
0:15:41	uh on you C i uh we pasta re
0:15:44	and the number of samples uh uh
0:15:47	a east stand then these are the number of attributes
0:15:50	and the number of a class
0:15:52	and then we compared to some the classification performance uh
0:15:56	uh for the three different encoding technique all pairs one versus or then there
0:16:00	cracking up of chord
0:16:01	and that these are the result for the loss based decoding
0:16:04	and then W map
0:16:06	and that these are the uh
0:16:09	a result of our men
0:16:11	i
0:16:11	so i mean
0:16:13	like that the wrecker experiment then uh our method
0:16:17	up from uh better than oh these to existing method
0:16:21	and although this is also the optimal aggregation uh but this involves the really uh should number of frames
0:16:27	so uh i mean run time is really really i mean our method is much faster than the uh
0:16:33	the pretty
0:16:36	a now our case actually or i mean the parameters our only the aggregation weights
0:16:41	so in conclusion i is actually be uh present the sum of the convex optimisation
0:16:47	techniques needs uh for a aggregation of the binary classifiers to
0:16:52	oh solve the multi-class problems
0:16:54	uh but we chose as the geometric programming because uh our or objective function can be easily fit into the
0:17:01	standard form of the german to programming
0:17:03	and then we compared to uh the classification performance to some of the existing method to show you
0:17:09	mean the the method we proposed these well seems to work uh a better than
0:17:14	some of the existing F
0:17:15	and then this clues um i
0:17:33	that that you all you know the fact that you knew
0:17:35	i have fewer were parameters
0:17:37	for your method is that
0:17:39	that
0:17:40	i presume that directly relates to um
0:17:43	that you're less likely that over fit
0:17:46	is that i think is the right to right yeah because i mean the previous one
0:17:49	has as you to number of parameters so the easily over feet
0:17:52	and then maybe uh
0:17:54	i mean that might be the one of the reason actually why are we're method performs a better than the
0:17:58	sum of
0:18:05	so did you want to compare your results with
0:18:07	uh uh to class not a location for example you could have used a multinomial logistic regression as the combined
0:18:13	as and
0:18:13	instead of
0:18:14	uh comparing all the against the fusion of mine be classifiers
0:18:19	for solving the to class rob
0:18:22	ah
0:18:23	i i i don't like the okay so yeah maybe we can compare but uh i don't think we really
0:18:28	compared to a to the the multinomial logistic regression
0:18:33	uh because multinomial logistic regression and also convex
0:18:37	so you might be right there you right
0:18:39	okay so we really didn't do it but
0:18:41	we will
0:18:42	okay
0:18:51	a number of features actually that the added descriptions is uh oh case of the number of attributes
0:18:56	okay is from some tend to six and read and
0:19:00	you and you on the data
0:19:07	no no no actually we just the user some though whole
0:19:10	uh features actually so this is just a matter of that
0:19:12	classifier perform was not a feature extraction
0:19:18	a right thank you

GEOMETRIC PROGRAMMING FOR AGGREGATION OF BINARY CLASSIFIERS

Machine Learning Methods and Applications

Přednášející: Seungjin Choi, Autoři: Sunho Park, Seungjin Choi, POSTECH, Republic of Korea