0:00:15 | The next talk could be Variational Bayes logistic regression as regularized fusion for NIST SRE |
---|
0:00:22 | 2010 |
---|
0:00:40 | OK |
---|
0:00:42 | My name is Hautomaki, and |
---|
0:00:46 | the topic is the fusion topic so |
---|
0:00:51 | I think it is likely only to be fusion topic this time for speech recognition |
---|
0:01:01 | This time, I have tried to do variational Bayes for NIST SRE evaluation corpora. |
---|
0:01:12 | OK, start with fusion. while we do fusion, why we don't have just a single |
---|
0:01:21 | best system. The motivation is that the fusion work better than single best and so |
---|
0:01:30 | we can take multiple classifiers and so on. |
---|
0:01:35 | And on the another hand, when some classifiers are not well behaved on development data, |
---|
0:01:46 | the fusion can help to smooth out. |
---|
0:01:51 | There are conventional wisdom in fusion. that complementary classifier should be selected in the fusion |
---|
0:02:00 | pool. This systems can be the main question in our working here. |
---|
0:02:08 | So, if we are going to do fusion instead of one single best system, how |
---|
0:02:14 | to select the classifier for the fusion pool or ensemble |
---|
0:02:20 | We are working on some kind of thinking that there are some complementariness in the |
---|
0:02:27 | feature set or classifier or something else |
---|
0:02:31 | But it is difficult to quantify |
---|
0:02:35 | What does the mean of complementariness ? |
---|
0:02:39 | So we can first look at the mutual information of classifier output and the class |
---|
0:02:49 | label and looking at the Fano inequality. |
---|
0:02:53 | Maximizing mutual information means minimize classification error. |
---|
0:03:03 | Gavin Brown in the paper Information theoretic Perspective on Multiple Classifier Systems 2009, showed that |
---|
0:03:12 | multi-way mutual information, which we take all classifier from 1 to L (that are the |
---|
0:03:21 | potential pool) , can be composed into three different terms |
---|
0:03:27 | The first term is very familiar to us. It basically is sum of individual classifier |
---|
0:03:36 | accuracy |
---|
0:03:38 | You are usually try to maximize this term |
---|
0:03:42 | Actually, maximizing this term can lead to maximizing mutual information. |
---|
0:03:48 | But we have to subtract the second term here, so it is not very nice. |
---|
0:03:55 | This term here, I am not going to talk too much in detail, is a |
---|
0:04:07 | kind of mutual information |
---|
0:04:11 | Here we just have classification output, we don't have class label here |
---|
0:04:18 | It is basically the correlation term, but we take all sub set of all classifier |
---|
0:04:26 | Basically, minimizing the correlation of the subset and maximizing the first term can lead to |
---|
0:04:40 | maximize mutual information |
---|
0:04:43 | The last term is interesting term, that you take all the sub set and you |
---|
0:04:52 | compute again the mutual information again and you take the condition of class label here. |
---|
0:05:00 | This term is additive term. |
---|
0:05:04 | The conclusion is that: we do not only minimize the correlation between classifier on this |
---|
0:05:14 | kind of higher level, group of increasing ensemble size, but also we have to move |
---|
0:05:24 | onto strong half of conditional correlation |
---|
0:05:29 | I think it doesn't make general sense, but I think it can give some kind |
---|
0:05:38 | of idea that the conventional wisdom on the complementariness might be not so accurate. |
---|
0:05:46 | What the topic of this talk is that we are using sparseness to do kind |
---|
0:05:56 | of this task automatically. We don't considering the composition |
---|
0:06:02 | I don't try to handed optimize this ensemble. |
---|
0:06:17 | We are not based any optimization of diversity measurement for selection. We don't try to |
---|
0:06:25 | do anything like that. |
---|
0:06:28 | And I use sparseness regression to optimize the ensemble size |
---|
0:06:35 | But the regularization regression introduce extra parameter, regularization parameter, and it is not very nice. |
---|
0:06:45 | I would like to get rid of this parameter. I have considered that as higher |
---|
0:06:50 | parameter and optimize at the same time we're optimizing the actual classifier on fusion device |
---|
0:06:57 | The attempt here was to use variational Bayes because it is a nice framework |
---|
0:07:06 | We can integrate half of parameter estimation into the same goal |
---|
0:07:13 | We are attempting to integrate all parameter to get the posterior |
---|
0:07:32 | OK, the motivation is likely to be this. Why we use regularization as sparseness of |
---|
0:07:39 | the fusion. |
---|
0:07:41 | Here we have two classifiers and CWLR is a cost function |
---|
0:07:51 | x-axis describes classifier one, y-axis describes classifier two. The figure on the left is for |
---|
0:07:56 | training set, the figure in the middle is for development set and the figure on |
---|
0:08:02 | the right is for the final evaluation set. |
---|
0:08:06 | If you use L2-norm as regularization, we can see that: |
---|
0:08:12 | The red round point is the optimum that we find, and the red cross point |
---|
0:08:18 | is unconstrained optimum and the black square point is the L1 optimum. We can see |
---|
0:08:25 | that with training data, L2 optimum is closer than unconstrained optimum. |
---|
0:08:31 | For the development set this is also true. But when we apply for test set |
---|
0:08:39 | you can see the classifier w2 has been erased, actually given better success from posterior |
---|
0:08:46 | than the true minimum for that set |
---|
0:08:51 | The minimum has changed, the CRWL function has changed on the SRE set. Of course |
---|
0:09:11 | we are lucky now. |
---|
0:09:17 | Suppose it happen in real data |
---|
0:09:21 | So, if you would have to optimize the unconstrained minimum, the solution could be here. |
---|
0:09:28 | And now we are much closer to it - the true minimum. |
---|
0:09:35 | so it tells us that we can definitely |
---|
0:09:38 | zeros out, and classify, and it will give us |
---|
0:09:41 | the real benefit |
---|
0:09:44 | Basically, we have to find what Lagrange coefficient value to use. This is done by |
---|
0:09:54 | cross validation |
---|
0:10:00 | We have the discriminative probabilistic framework here and we are using it to optimize the |
---|
0:10:09 | fusion weights |
---|
0:10:20 | Optimization of logistic sigmoid function lead to cross-entropy cost |
---|
0:10:35 | and so the... |
---|
0:10:37 | the thing Nickle has proposed is to see the whole LR |
---|
0:10:40 | and ... weighted by the |
---|
0:10:44 | the ratio for the |
---|
0:10:47 | the actual training utterance training set |
---|
0:10:51 | but also we have here |
---|
0:10:53 | additive term |
---|
0:10:55 | logic |
---|
0:10:56 | logic effective pi |
---|
0:11:04 | The idea of regularization is we try to do MAP estimate and the last supplier |
---|
0:11:16 | is double exponential. |
---|
0:11:20 | Basically, we can see the probability of having a regular of parameter for each of |
---|
0:11:30 | classifier j. |
---|
0:11:36 | Here we can assume that the parameter is the same for all dimensions. In case |
---|
0:11:47 | of Ridge regression, we have isotropic Gaussian. |
---|
0:11:53 | In Variational Bayes, we follow the treatment in Bishop book chapter 10 |
---|
0:11:59 | Now, we are not looking to estimate MAP but the whole posterior. It is approximated |
---|
0:12:09 | as q(w) q(a) q(t). Here we factorize all hidden parameters. |
---|
0:12:19 | But we have additional problem here. The likelihood terms has to be approximated by h(w,z). |
---|
0:12:30 | We have one scalar number of z for each training set score vector. It has |
---|
0:12:39 | to be optimized on the same VB loop. |
---|
0:12:45 | Here the natural assumption for distribution of alpha (we can see alpha in previous slide) |
---|
0:13:03 | is Gamma. Here Gamma become non informative. |
---|
0:13:13 | The interesting point here is that the mean of predictive density can be in normal |
---|
0:13:20 | product. It is slight consistence for us to use normal linear scoring |
---|
0:13:31 | As I explained early, we are interested in half sparseness solution. |
---|
0:13:46 | I try to use Automatic Relevance Determination p(W|alpha). |
---|
0:13:51 | We have A is a diagonal matrix that we have one precision for each classifier. |
---|
0:14:03 | And we have product of Gamma instead of having just one Gamma. |
---|
0:14:16 | The general idea is that we put down to zero the classifier that doesn't play |
---|
0:14:25 | any role. |
---|
0:14:37 | Our setup is use NIST SRE 2008. The speech sample is extended and split into |
---|
0:14:54 | two trial lists |
---|
0:14:58 | One set for training confusion device, and the other is cross validate set. |
---|
0:15:05 | . Then we apply the evaluation on NIST SRE 2010 core set |
---|
0:15:13 | We can see the results of Variable logistic regression. |
---|
0:15:19 | I forget to mention that now maybe in this set there are no cross validation |
---|
0:15:27 | needed at all. And so the complexity is the operation is the same to the |
---|
0:15:36 | standard for goal. |
---|
0:15:41 | We can see in itv-itv that my result is the best result. My result in |
---|
0:15:52 | minDCF is get slight improvement. |
---|
0:15:56 | but the result of actual DCF is not calibrated. |
---|
0:16:03 | Unfortunately there is a fact that there are only two classifier, for some reasons I |
---|
0:16:15 | can't explain, didn't get about end of zero or any thing more than |
---|
0:16:25 | I search on the literature, there are many comments about using ARD and some people |
---|
0:16:34 | complain the ARD is some kind of under-fit database. |
---|
0:16:39 | And I guess the solution here is that instead of using ARD I have to |
---|
0:16:46 | use another stronger prior and has more stronger regularization ability. |
---|
0:16:53 | So let looking another condition that some case actually, the standard logistic regression perform better. |
---|
0:17:07 | On the other hand, the tel-tel case I got some improvement with just using the |
---|
0:17:16 | standard in equal error rate (EER). So it is not intended consequence. |
---|
0:17:27 | So the interesting point here is there are quite big problem with the calibration. At |
---|
0:17:39 | least the Variational Bayes logistic regression doesn't calibrate well. |
---|
0:17:48 | So I think we should need extra calibration step. I didn't try that now. Actually |
---|
0:17:56 | I interested to change prior than to work on this baseline. |
---|
0:18:04 | But on the other hand, we produce some score, so we can of course adding |
---|
0:18:10 | subset selection now on this result. |
---|
0:18:14 | And so here, this is a bit ad hoc in the case of variational Bayes. |
---|
0:18:20 | So i can pause a .. the L0 norm on this result. So this won't |
---|
0:18:27 | create me a ... I mean it forces me a sparse solution |
---|
0:18:34 | So you will see here, what happen in standard logistic regression is when we scan |
---|
0:18:46 | the ensemble size, definitely there is a minimum between 8-9 |
---|
0:19:19 | But any actually would be smaller than our predicted subset size. |
---|
0:19:26 | This is logistic regression baseline. We can see there are a gain from 3.55 to |
---|
0:19:36 | 3.40 in EER. On the other hand, the ORACLE will tell us what is correct |
---|
0:19:46 | size, what is the correct subset. And we have a large gain here. |
---|
0:20:03 | But the interesting point here is when we apply Variational Bayes, there are actually much |
---|
0:20:12 | closer behavior to our prediction. We can follow closely to the ORACLE bound. |
---|
0:20:22 | Unfortunately, because of the fact that actual DCF is not well calibrated, so actual DCF |
---|
0:20:27 | doesn't perform so nicely. |
---|
0:20:31 | But we can do post calibration. |
---|
0:20:37 | Here is the comparison table. I only did that for itv-itv set. We get an |
---|
0:20:46 | improvement in equal error rate, from 3.48 for full set to 3.37 for subset. And |
---|
0:20:54 | now we have only 6 classifiers left on our pool. |
---|
0:21:00 | All ... we have exponential time complexity in the size of original classifier pool. Now |
---|
0:21:09 | I have 12 classifiers in total and they doesn't cost so much. But it is |
---|
0:21:18 | still really impractical in real application. |
---|
0:21:22 | Now I will introduce again my regularization regression parameter here. It can bring some benefits. |
---|
0:21:33 | The benefit between standard logistic regression and regression using VB is not great, but ...] |
---|
0:21:48 | It is possible to add more extra parameter to our regularization. One is to use |
---|
0:21:59 | elastic-net. Now we back to the MAP estimate. Elastic-net basically is convex sum of L1 |
---|
0:22:10 | and L2. |
---|
0:22:12 | there is possibility that we can't talk about is that some time LASSO can regularize |
---|
0:22:21 | hardly. So we can restrict LASSO to not regularize one classifier. |
---|
0:22:27 | This method is called restricted LASSO. |
---|
0:22:31 | Here we see that the result for complete set. Basically for itv-itv condition the ensemble |
---|
0:22:44 | size of LASSO is 6. It is computational efficient method. |
---|
0:22:58 | For interview-telephone we can observe the similar performance. The restricted LASSO get the best result. |
---|
0:23:14 | Now interesting thing is in the restricted LASSO here. We want one classifier is not |
---|
0:23:26 | regular. We can see that the restricted LASSO has smaller size than original LASSO for |
---|
0:23:39 | interview-telephone sub condition. So we have selected one classifier that didn't collaborate regularize but still |
---|
0:23:52 | affect to cause other classifier to be zero. |
---|
0:24:01 | I can't explain the behavior but I found some kind of interesting size produced here. |
---|
0:24:13 | Here is telephone-telephone sub condition. the EER here is of Variational Bayes. this is only |
---|
0:24:27 | condition that sparsity didn't help in terms of actual DCF. |
---|
0:24:37 | Other while, in the other conditions the sparsity does helps. |
---|
0:24:53 | this is surprising the fact that the ARD was not able to put more classifier |
---|
0:25:02 | to zero than standard Variational Bayes. |
---|
0:25:07 | On the other hand, there is possibility that the prior has to be more stronger. |
---|
0:25:15 | The elastic-net show most promise but we are not able to estimate parameter efficiently. |
---|
0:25:22 | In the future work we will study method to automatically learn Elastic-net prior hyper parameter. |
---|
0:26:16 | Of course we try to do Variational Bayes. We have higher prior parameter but then |
---|
0:26:28 | we can set those higher prior parameter to be non-informative. |
---|
0:27:01 | The comparison between standard logistic regression and the VB method is totally fair. |
---|
0:27:11 | There are some study in the literature in the different fields that people do this |
---|
0:27:19 | kind of thing. They observe that optimizing using cross validation regularized parameter bring better performance. |
---|
0:27:27 | But using this kind of Bayes approach bring a stable performance. |
---|
0:27:33 | It isn't so good but more predictable. So this is my goal. |
---|