0:00:13 | so the the graph an and and what i would like to talk about today E uh how we combine |
---|
0:00:19 | the the multiple binary classifiers to solve the |
---|
0:00:23 | well class problem |
---|
0:00:25 | and the technique uh that we are going to use these the |
---|
0:00:28 | the geometric programming we're |
---|
0:00:30 | we can uh solve the problems the using one of the convex of to my this and solver |
---|
0:00:39 | so |
---|
0:00:40 | uh i would like to start we'd the multi-class class mornings |
---|
0:00:44 | and |
---|
0:00:45 | and then are to the two different approaches to solve the multi-class learning to or multiclass problems |
---|
0:00:51 | either direct method |
---|
0:00:52 | or |
---|
0:00:53 | we we can reduce the the multi-class problems in to the multiple binary problems |
---|
0:00:59 | K |
---|
0:00:59 | is such a case actually we need to really combine the mode uh but multiple binary problems |
---|
0:01:06 | to determine the final answer uh to the multi-class problems |
---|
0:01:10 | and then |
---|
0:01:12 | probably i formulated this aggregation problems |
---|
0:01:15 | and the a geometric programming so that we can always find a global solutions |
---|
0:01:20 | "'kay" so you know or to introduce a geometric programming formulations and then i will introduce a soft models |
---|
0:01:27 | and L one norm regularized maxent lightly |
---|
0:01:29 | missions missions and |
---|
0:01:31 | no |
---|
0:01:31 | oh we we just some of the uh |
---|
0:01:34 | the numerical experiments and then |
---|
0:01:35 | and can close |
---|
0:01:37 | so the multi-class problems for example |
---|
0:01:40 | uh |
---|
0:01:41 | we need to really a signed the class labels uh from one to K A |
---|
0:01:46 | uh uh to the data point |
---|
0:01:47 | okay |
---|
0:01:48 | so either direct methods okay so suppose actually we have a three classes |
---|
0:01:52 | and the direct method try to find the sum the separating hyperplane which |
---|
0:01:57 | discriminate uh these three classes |
---|
0:01:59 | uh in general uh the separating hyperplane these not really linear okay |
---|
0:02:04 | uh on the other hand |
---|
0:02:06 | uh for the binaural decomposition the method of for example all paired |
---|
0:02:10 | so we look at the sum the |
---|
0:02:12 | para one and two and two and three and one and three so we look at the sum the binary |
---|
0:02:18 | pair problem still we can always find the sum the binary classifiers |
---|
0:02:23 | and the remaining problem ease of how we really uh aggregate this um the solutions of the binary problems you |
---|
0:02:29 | know to determine the final answer to the multi-class problems |
---|
0:02:33 | okay so what are the advantage of the binary decompositions over the direct method |
---|
0:02:37 | and is |
---|
0:02:37 | but easier and the simple or tool on the classifiers |
---|
0:02:41 | and there's a lot of other sophisticated classification uh classifiers actually as mister ready for the binary uh the D |
---|
0:02:48 | i mean the binary problems for example support vector machines |
---|
0:02:52 | and we're so this is a better suited to the parallel computation |
---|
0:02:56 | so for example of these uh the three is a well-known examples for the binary decompositions |
---|
0:03:01 | and also we can really three the binary decompositions they're the binary encoding problem |
---|
0:03:06 | okay so well in other words |
---|
0:03:08 | how we aggregate the uh multiple binary answers |
---|
0:03:12 | is uh a binary decoding i mean the decoding prop |
---|
0:03:16 | so for example the one versus all K as so we have a three crises |
---|
0:03:21 | and then first the binary classifiers actually discriminate the first the class |
---|
0:03:25 | uh from the the remaining class |
---|
0:03:27 | and we're still the second binary classifiers discriminate |
---|
0:03:32 | second class from the remaining class then |
---|
0:03:34 | so one and all pairs where we need to look at the sum the pair of one and two and |
---|
0:03:39 | two and three and one and three |
---|
0:03:41 | and we're still all |
---|
0:03:44 | they're error correcting output coding also can be uh you use |
---|
0:03:48 | so all we really uh determine the some of the the court words |
---|
0:03:53 | oh where some of the code words have as a maximal uh the hamming distance so you in other words |
---|
0:03:58 | of some of the solutions we'd the tolerable error is actually can be uh correctly classified |
---|
0:04:04 | so in other words the binary decompositions really uh leads to the sum |
---|
0:04:08 | uh code matrix tk so in this case |
---|
0:04:11 | we have a three class |
---|
0:04:12 | and the three binary classifiers |
---|
0:04:14 | so uh these are the uh exemplary |
---|
0:04:17 | who may trick |
---|
0:04:18 | uh for the one versus all and all pairs and L |
---|
0:04:22 | error correcting up according |
---|
0:04:23 | okay |
---|
0:04:24 | so in other words actually we need to really train the uh three deeper in the binary classifiers |
---|
0:04:30 | uh following the the oracle a in this school me too |
---|
0:04:37 | "'kay" so for example or the case of the one versus a and then D is the code matrix produced |
---|
0:04:42 | by the one versus soul |
---|
0:04:44 | and is that a case |
---|
0:04:46 | and we need to oh |
---|
0:04:47 | train the three different pine binary classifiers |
---|
0:04:50 | and for example all X up i |
---|
0:04:53 | is that the data and the target labels is a two |
---|
0:04:57 | so it's such a case actually |
---|
0:04:59 | uh |
---|
0:05:00 | so the target label two |
---|
0:05:03 | and then |
---|
0:05:04 | in terms of the binary classifiers to actually the correct labels should be zero |
---|
0:05:09 | one and zero okay so the first the binary classifier a second binary and third binary classifiers |
---|
0:05:15 | so we i mean these uh binary classifier |
---|
0:05:19 | followed the |
---|
0:05:20 | at the binary label |
---|
0:05:22 | uh uh in this good matrix |
---|
0:05:26 | so we trained a binary classifiers |
---|
0:05:29 | and |
---|
0:05:30 | then L each binary class apart |
---|
0:05:32 | classifiers produce the some the probability asked me |
---|
0:05:35 | a for example we can use the support vector machines with sick a the model so that the the banner |
---|
0:05:41 | classifier the produce the some the scores |
---|
0:05:43 | uh which uh between zero and one |
---|
0:05:46 | "'kay" |
---|
0:05:47 | so the the problem here E |
---|
0:05:50 | "'kay" so we trained the three binary classifiers and each binary classifiers produce the sum of the scores between the |
---|
0:05:57 | your and one K and you're to answer to the multi-class problems and that we have to really combine the |
---|
0:06:04 | and |
---|
0:06:05 | determine the by the three binary classifiers |
---|
0:06:08 | okay so how we really aggregate uh the binary classifiers |
---|
0:06:12 | and the sum of the charade heuristics these |
---|
0:06:14 | uh for the case of the all pairs and then we do easily majority voting as |
---|
0:06:19 | and then for the case of though one versus all and they maybe the maximum always means |
---|
0:06:24 | and hard decoding case |
---|
0:06:26 | and the we |
---|
0:06:28 | uh find the court word we to best match the collection of the predicted result computed by the binary classifiers |
---|
0:06:34 | okay so in the case of the three class and that we have a three |
---|
0:06:39 | code words okay and then train the three binary classifiers so that |
---|
0:06:44 | given a some the test at what point K and the three binary classified the produce a sum score is |
---|
0:06:50 | okay so the collection of the door the three values can stick you'd the three the mental vectors okay and |
---|
0:06:55 | then we search actually uh which code word he's best match the these uh the three dimensional a prediction result |
---|
0:07:02 | in or to really determine the final answer to the multi-class problems okay |
---|
0:07:07 | and or so all we can it a probabilistic decoding in other words in this case actually we need we |
---|
0:07:13 | really need to compute the class membership probabilities |
---|
0:07:16 | okay |
---|
0:07:17 | so uh C L can get the class members probabilities and then we can really uh to the prediction uh |
---|
0:07:22 | for the class |
---|
0:07:25 | so uh |
---|
0:07:26 | uh one of the popular approach in the problem list decoding is actually the based of the |
---|
0:07:31 | uh the bread are models |
---|
0:07:33 | and the let me just briefly to i mean the explain actually what really a the braille lit model oh |
---|
0:07:38 | used doing in this case |
---|
0:07:40 | so was again actually L we have a three class |
---|
0:07:43 | okay |
---|
0:07:44 | and that these reading tidy model has been used to relate the binary predictions with the class members to probably |
---|
0:07:50 | so for example |
---|
0:07:51 | okay so we have a some uh the three as servers could use the by the three binary classifiers |
---|
0:07:57 | and that we have to relate to those answers to the |
---|
0:08:00 | uh the class membership probabilities |
---|
0:08:02 | so in such a case actually we treat the class members probabilities as a prayer |
---|
0:08:07 | "'kay" |
---|
0:08:08 | so in the case of the all pairs binary D competitions |
---|
0:08:11 | and then |
---|
0:08:13 | okay so the capital P one uh star he's actually uh |
---|
0:08:18 | the a |
---|
0:08:19 | the class membership probabilities uh |
---|
0:08:22 | for the |
---|
0:08:23 | the data point access star |
---|
0:08:24 | okay |
---|
0:08:27 | no known actually a a this should be a uh |
---|
0:08:32 | see |
---|
0:08:35 | okay so this is actually the the class membership probabilities |
---|
0:08:39 | and then D is a uh |
---|
0:08:41 | all pairs result |
---|
0:08:43 | okay |
---|
0:08:43 | so all these are i mean the blue oh high like the things are actually the based on the bradley |
---|
0:08:49 | terry model |
---|
0:08:50 | and then we introduce a someone out a a is high want i to and pride three |
---|
0:08:55 | and then these uh relations are directly from the bread lee terry model |
---|
0:08:59 | and they just sub J star is actually the probability mate |
---|
0:09:03 | determine the by uh by the binary classifiers |
---|
0:09:06 | so we have that this was okay so in or to really compute the class membership probabilities and we treat |
---|
0:09:12 | them as a or and that we asked make these parameters by minimizing the |
---|
0:09:17 | okay of that verse as uh between the |
---|
0:09:20 | uh |
---|
0:09:21 | the binary for uses |
---|
0:09:22 | and or so |
---|
0:09:24 | these pie a with |
---|
0:09:26 | uh coming from them |
---|
0:09:29 | so oh is such a case actually a a from but actually which exploit it is uh the techniques |
---|
0:09:34 | uh |
---|
0:09:36 | here he's |
---|
0:09:38 | that's a that's probably three |
---|
0:09:41 | or you know other words actually uh a number of parameters grows with the number of uh |
---|
0:09:46 | the |
---|
0:09:47 | the training example |
---|
0:09:48 | okay |
---|
0:09:49 | so if you have a assumed the a huge number of training examples and then we have a huge number |
---|
0:09:53 | of |
---|
0:09:53 | uh |
---|
0:09:54 | a parameters you those should be really uh |
---|
0:09:57 | up team |
---|
0:10:00 | so all or some of the uh the existing tech the actually the base on the bradley terry model and |
---|
0:10:05 | then all one of the recent tech and he's they tried to find that some all optimal aggregation |
---|
0:10:10 | okay |
---|
0:10:12 | so uh why optimize aggregation is good you because some is some of the prediction is by and live by |
---|
0:10:17 | like |
---|
0:10:17 | i fired |
---|
0:10:19 | oh the aggregate of |
---|
0:10:20 | i mean entire performance okay so somehow if they can really uh determine this i mean the |
---|
0:10:26 | come up with a weights which all see i mean all team hourly aggregate the uh the binary predictions |
---|
0:10:31 | and then that we can really of we did this prob |
---|
0:10:35 | so of all these uh a technique a uh has been done for the optimal aggregation actually but but based |
---|
0:10:40 | on the uh |
---|
0:10:42 | uh the bread carry model |
---|
0:10:43 | a a a a uh but the problem here is actually a a a a simple of the really a |
---|
0:10:47 | lot my the um probably decoders |
---|
0:10:50 | uh i use that red with you model so |
---|
0:10:53 | the number of parameters is actually the aggregation weights |
---|
0:10:57 | and also class membership probabilities which to grow as a we the number of uh |
---|
0:11:02 | example |
---|
0:11:02 | so name and a lot my is a problem and or so |
---|
0:11:07 | uh this is not really a not i mean that not convex of to the some problems of doesn't guarantee |
---|
0:11:12 | a global |
---|
0:11:12 | lesions |
---|
0:11:13 | okay so what i would like to hear he's actually uh we would like to formulate these problems |
---|
0:11:19 | uh |
---|
0:11:20 | as a convex up to my this um prop |
---|
0:11:22 | okay so all |
---|
0:11:23 | in the aggregate some model actually a we don't look at really the bread lead a model but yeah we |
---|
0:11:28 | we ah |
---|
0:11:29 | uh uh use a softmax model uh which was also a recently uh use uh |
---|
0:11:35 | uh |
---|
0:11:36 | by us |
---|
0:11:38 | exactly and i C yeah |
---|
0:11:39 | the last year |
---|
0:11:40 | so |
---|
0:11:42 | yeah all introduce an i mean the softmax models and and is such a case |
---|
0:11:46 | uh and |
---|
0:11:47 | so these are actually the and um different binary classifiers |
---|
0:11:52 | and you know our approach actually the writers that all the aggregation weights okay so each a a class |
---|
0:11:59 | classifiers is really uh by the different uh a this and the W want through double sub and |
---|
0:12:05 | and then i goal is actually a optimized these court presents |
---|
0:12:09 | to produce a some the best uh |
---|
0:12:12 | uh a combinations of the binary prediction |
---|
0:12:20 | okay so |
---|
0:12:22 | the class and then we're the probabilities of follows the of the softmax of functions so in other words i |
---|
0:12:28 | mean the probability of wise the i equal K given some parameter this is the aggregation weight |
---|
0:12:34 | and the data point X of i follows the softmax functions |
---|
0:12:38 | uh but the exponent E |
---|
0:12:40 | uh the way sum of the discrepancy so okay so these are the the discrepancy between the |
---|
0:12:46 | code word and then the binary prediction |
---|
0:12:49 | okay |
---|
0:12:50 | so for example maybe we can use a cross-entropy entropy other functions |
---|
0:12:53 | case |
---|
0:12:54 | so all this easy really uh the probably extensions of the loss based be decoding |
---|
0:13:00 | i |
---|
0:13:00 | a and in this way |
---|
0:13:02 | actually we have only um mean the aggregation weights oh as a parameter |
---|
0:13:08 | so based on this models and then |
---|
0:13:11 | uh we write the likelihood of the training data so these C's the likely you and then maybe be of |
---|
0:13:17 | the details you can find in the papers |
---|
0:13:19 | uh |
---|
0:13:21 | and then we add the uh L one norm regularization as |
---|
0:13:25 | okay so |
---|
0:13:26 | the negative log-likelihood likelihood the L one norm regularization Z |
---|
0:13:30 | and then we come up with the some the law some exponential function |
---|
0:13:36 | and then uh we figured out |
---|
0:13:38 | E |
---|
0:13:39 | so uh are our optimized nation is actually the minimize the loss of exponential function as |
---|
0:13:44 | that's some sex the uh |
---|
0:13:46 | some of the plastic uh |
---|
0:13:48 | coefficients |
---|
0:13:49 | okay |
---|
0:13:49 | and the loss some one it's of times an is a context |
---|
0:13:52 | "'kay" so we can really solve this problem as a convex of my Z |
---|
0:13:55 | a problem |
---|
0:13:56 | uh what we figured out about a a a a a two years ago he's actually be can form this |
---|
0:14:01 | i mean but can really think this into the the geometric programming |
---|
0:14:04 | uh |
---|
0:14:05 | so i mean this is just a short introduction of the geometric programming |
---|
0:14:10 | and this is a problem of i i mean the standard form of the german to |
---|
0:14:13 | programming |
---|
0:14:14 | and the we minimize uh some of the pose in on the L |
---|
0:14:17 | and that was not |
---|
0:14:19 | a it on you but we i mean that |
---|
0:14:21 | but the different sees |
---|
0:14:22 | uh the exponents are allowed to be a real valued okay in a plain on me on the exponent is |
---|
0:14:27 | only should be the integer |
---|
0:14:29 | so the minimize sum of was not me L under this all inequality constraint and also so you quality constraint |
---|
0:14:36 | "'kay" |
---|
0:14:37 | and then uh this uh the geometric programming in a on L from always can be |
---|
0:14:42 | can already to the sum of the german program in convex |
---|
0:14:45 | uh on is you just a well in comics |
---|
0:14:48 | i |
---|
0:14:49 | so |
---|
0:14:50 | is is our of the optimisation and problems |
---|
0:14:53 | and and uh we can really write is uh |
---|
0:14:57 | optimize as |
---|
0:14:59 | the |
---|
0:14:59 | geometric programming in either convex or the port a all forms |
---|
0:15:03 | uh so there is actually a uh some efficient since all words of yeah |
---|
0:15:07 | so we simply to the use that solvers |
---|
0:15:09 | two oh find the actually uh |
---|
0:15:12 | the minimum of this uh |
---|
0:15:14 | objective a function |
---|
0:15:17 | so in in experiments and that we compared the some of the uh existing work |
---|
0:15:22 | uh which is a loss to based decoding which is just a one of the heart according |
---|
0:15:25 | and the so the map is uh one of the |
---|
0:15:28 | optima a uh aggregation method based on the bradley terry model |
---|
0:15:36 | so of these are the some of the the data actually uh uh uh |
---|
0:15:41 | uh on you C i uh we pasta re |
---|
0:15:44 | and the number of samples uh uh |
---|
0:15:47 | a east stand then these are the number of attributes |
---|
0:15:50 | and the number of a class |
---|
0:15:52 | and then we compared to some the classification performance uh |
---|
0:15:56 | uh for the three different encoding technique all pairs one versus or then there |
---|
0:16:00 | cracking up of chord |
---|
0:16:01 | and that these are the result for the loss based decoding |
---|
0:16:04 | and then W map |
---|
0:16:06 | and that these are the uh |
---|
0:16:09 | a result of our men |
---|
0:16:11 | i |
---|
0:16:11 | so i mean |
---|
0:16:13 | like that the wrecker experiment then uh our method |
---|
0:16:17 | up from uh better than oh these to existing method |
---|
0:16:21 | and although this is also the optimal aggregation uh but this involves the really uh should number of frames |
---|
0:16:27 | so uh i mean run time is really really i mean our method is much faster than the uh |
---|
0:16:33 | the pretty |
---|
0:16:36 | a now our case actually or i mean the parameters our only the aggregation weights |
---|
0:16:41 | so in conclusion i is actually be uh present the sum of the convex optimisation |
---|
0:16:47 | techniques needs uh for a aggregation of the binary classifiers to |
---|
0:16:52 | oh solve the multi-class problems |
---|
0:16:54 | uh but we chose as the geometric programming because uh our or objective function can be easily fit into the |
---|
0:17:01 | standard form of the german to programming |
---|
0:17:03 | and then we compared to uh the classification performance to some of the existing method to show you |
---|
0:17:09 | mean the the method we proposed these well seems to work uh a better than |
---|
0:17:14 | some of the existing F |
---|
0:17:15 | and then this clues um i |
---|
0:17:33 | that that you all you know the fact that you knew |
---|
0:17:35 | i have fewer were parameters |
---|
0:17:37 | for your method is that |
---|
0:17:39 | that |
---|
0:17:40 | i presume that directly relates to um |
---|
0:17:43 | that you're less likely that over fit |
---|
0:17:46 | is that i think is the right to right yeah because i mean the previous one |
---|
0:17:49 | has as you to number of parameters so the easily over feet |
---|
0:17:52 | and then maybe uh |
---|
0:17:54 | i mean that might be the one of the reason actually why are we're method performs a better than the |
---|
0:17:58 | sum of |
---|
0:18:05 | so did you want to compare your results with |
---|
0:18:07 | uh uh to class not a location for example you could have used a multinomial logistic regression as the combined |
---|
0:18:13 | as and |
---|
0:18:13 | instead of |
---|
0:18:14 | uh comparing all the against the fusion of mine be classifiers |
---|
0:18:19 | for solving the to class rob |
---|
0:18:22 | ah |
---|
0:18:23 | i i i don't like the okay so yeah maybe we can compare but uh i don't think we really |
---|
0:18:28 | compared to a to the the multinomial logistic regression |
---|
0:18:33 | uh because multinomial logistic regression and also convex |
---|
0:18:37 | so you might be right there you right |
---|
0:18:39 | okay so we really didn't do it but |
---|
0:18:41 | we will |
---|
0:18:42 | okay |
---|
0:18:51 | a number of features actually that the added descriptions is uh oh case of the number of attributes |
---|
0:18:56 | okay is from some tend to six and read and |
---|
0:19:00 | you and you on the data |
---|
0:19:07 | no no no actually we just the user some though whole |
---|
0:19:10 | uh features actually so this is just a matter of that |
---|
0:19:12 | classifier perform was not a feature extraction |
---|
0:19:18 | a right thank you |
---|