The next talk could be Variational Bayes logistic regression as regularized fusion for NIST SRE
2010
OK
My name is Hautomaki, and
the topic is the fusion topic so
I think it is likely only to be fusion topic this time for speech recognition
This time, I have tried to do variational Bayes for NIST SRE evaluation corpora.
OK, start with fusion. while we do fusion, why we don't have just a single
best system. The motivation is that the fusion work better than single best and so
we can take multiple classifiers and so on.
And on the another hand, when some classifiers are not well behaved on development data,
the fusion can help to smooth out.
There are conventional wisdom in fusion. that complementary classifier should be selected in the fusion
pool. This systems can be the main question in our working here.
So, if we are going to do fusion instead of one single best system, how
to select the classifier for the fusion pool or ensemble
We are working on some kind of thinking that there are some complementariness in the
feature set or classifier or something else
But it is difficult to quantify
What does the mean of complementariness ?
So we can first look at the mutual information of classifier output and the class
label and looking at the Fano inequality.
Maximizing mutual information means minimize classification error.
Gavin Brown in the paper Information theoretic Perspective on Multiple Classifier Systems 2009, showed that
multi-way mutual information, which we take all classifier from 1 to L (that are the
potential pool) , can be composed into three different terms
The first term is very familiar to us. It basically is sum of individual classifier
accuracy
You are usually try to maximize this term
Actually, maximizing this term can lead to maximizing mutual information.
But we have to subtract the second term here, so it is not very nice.
This term here, I am not going to talk too much in detail, is a
kind of mutual information
Here we just have classification output, we don't have class label here
It is basically the correlation term, but we take all sub set of all classifier
Basically, minimizing the correlation of the subset and maximizing the first term can lead to
maximize mutual information
The last term is interesting term, that you take all the sub set and you
compute again the mutual information again and you take the condition of class label here.
This term is additive term.
The conclusion is that: we do not only minimize the correlation between classifier on this
kind of higher level, group of increasing ensemble size, but also we have to move
onto strong half of conditional correlation
I think it doesn't make general sense, but I think it can give some kind
of idea that the conventional wisdom on the complementariness might be not so accurate.
What the topic of this talk is that we are using sparseness to do kind
of this task automatically. We don't considering the composition
I don't try to handed optimize this ensemble.
We are not based any optimization of diversity measurement for selection. We don't try to
do anything like that.
And I use sparseness regression to optimize the ensemble size
But the regularization regression introduce extra parameter, regularization parameter, and it is not very nice.
I would like to get rid of this parameter. I have considered that as higher
parameter and optimize at the same time we're optimizing the actual classifier on fusion device
The attempt here was to use variational Bayes because it is a nice framework
We can integrate half of parameter estimation into the same goal
We are attempting to integrate all parameter to get the posterior
OK, the motivation is likely to be this. Why we use regularization as sparseness of
the fusion.
Here we have two classifiers and CWLR is a cost function
x-axis describes classifier one, y-axis describes classifier two. The figure on the left is for
training set, the figure in the middle is for development set and the figure on
the right is for the final evaluation set.
If you use L2-norm as regularization, we can see that:
The red round point is the optimum that we find, and the red cross point
is unconstrained optimum and the black square point is the L1 optimum. We can see
that with training data, L2 optimum is closer than unconstrained optimum.
For the development set this is also true. But when we apply for test set
you can see the classifier w2 has been erased, actually given better success from posterior
than the true minimum for that set
The minimum has changed, the CRWL function has changed on the SRE set. Of course
we are lucky now.
Suppose it happen in real data
So, if you would have to optimize the unconstrained minimum, the solution could be here.
And now we are much closer to it - the true minimum.
so it tells us that we can definitely
zeros out, and classify, and it will give us
the real benefit
Basically, we have to find what Lagrange coefficient value to use. This is done by
cross validation
We have the discriminative probabilistic framework here and we are using it to optimize the
fusion weights
Optimization of logistic sigmoid function lead to cross-entropy cost
and so the...
the thing Nickle has proposed is to see the whole LR
and ... weighted by the
the ratio for the
the actual training utterance training set
but also we have here
additive term
logic
logic effective pi
The idea of regularization is we try to do MAP estimate and the last supplier
is double exponential.
Basically, we can see the probability of having a regular of parameter for each of
classifier j.
Here we can assume that the parameter is the same for all dimensions. In case
of Ridge regression, we have isotropic Gaussian.
In Variational Bayes, we follow the treatment in Bishop book chapter 10
Now, we are not looking to estimate MAP but the whole posterior. It is approximated
as q(w) q(a) q(t). Here we factorize all hidden parameters.
But we have additional problem here. The likelihood terms has to be approximated by h(w,z).
We have one scalar number of z for each training set score vector. It has
to be optimized on the same VB loop.
Here the natural assumption for distribution of alpha (we can see alpha in previous slide)
is Gamma. Here Gamma become non informative.
The interesting point here is that the mean of predictive density can be in normal
product. It is slight consistence for us to use normal linear scoring
As I explained early, we are interested in half sparseness solution.
I try to use Automatic Relevance Determination p(W|alpha).
We have A is a diagonal matrix that we have one precision for each classifier.
And we have product of Gamma instead of having just one Gamma.
The general idea is that we put down to zero the classifier that doesn't play
any role.
Our setup is use NIST SRE 2008. The speech sample is extended and split into
two trial lists
One set for training confusion device, and the other is cross validate set.
. Then we apply the evaluation on NIST SRE 2010 core set
We can see the results of Variable logistic regression.
I forget to mention that now maybe in this set there are no cross validation
needed at all. And so the complexity is the operation is the same to the
standard for goal.
We can see in itv-itv that my result is the best result. My result in
minDCF is get slight improvement.
but the result of actual DCF is not calibrated.
Unfortunately there is a fact that there are only two classifier, for some reasons I
can't explain, didn't get about end of zero or any thing more than
I search on the literature, there are many comments about using ARD and some people
complain the ARD is some kind of under-fit database.
And I guess the solution here is that instead of using ARD I have to
use another stronger prior and has more stronger regularization ability.
So let looking another condition that some case actually, the standard logistic regression perform better.
On the other hand, the tel-tel case I got some improvement with just using the
standard in equal error rate (EER). So it is not intended consequence.
So the interesting point here is there are quite big problem with the calibration. At
least the Variational Bayes logistic regression doesn't calibrate well.
So I think we should need extra calibration step. I didn't try that now. Actually
I interested to change prior than to work on this baseline.
But on the other hand, we produce some score, so we can of course adding
subset selection now on this result.
And so here, this is a bit ad hoc in the case of variational Bayes.
So i can pause a .. the L0 norm on this result. So this won't
create me a ... I mean it forces me a sparse solution
So you will see here, what happen in standard logistic regression is when we scan
the ensemble size, definitely there is a minimum between 8-9
But any actually would be smaller than our predicted subset size.
This is logistic regression baseline. We can see there are a gain from 3.55 to
3.40 in EER. On the other hand, the ORACLE will tell us what is correct
size, what is the correct subset. And we have a large gain here.
But the interesting point here is when we apply Variational Bayes, there are actually much
closer behavior to our prediction. We can follow closely to the ORACLE bound.
Unfortunately, because of the fact that actual DCF is not well calibrated, so actual DCF
doesn't perform so nicely.
But we can do post calibration.
Here is the comparison table. I only did that for itv-itv set. We get an
improvement in equal error rate, from 3.48 for full set to 3.37 for subset. And
now we have only 6 classifiers left on our pool.
All ... we have exponential time complexity in the size of original classifier pool. Now
I have 12 classifiers in total and they doesn't cost so much. But it is
still really impractical in real application.
Now I will introduce again my regularization regression parameter here. It can bring some benefits.
The benefit between standard logistic regression and regression using VB is not great, but ...]
It is possible to add more extra parameter to our regularization. One is to use
elastic-net. Now we back to the MAP estimate. Elastic-net basically is convex sum of L1
and L2.
there is possibility that we can't talk about is that some time LASSO can regularize
hardly. So we can restrict LASSO to not regularize one classifier.
This method is called restricted LASSO.
Here we see that the result for complete set. Basically for itv-itv condition the ensemble
size of LASSO is 6. It is computational efficient method.
For interview-telephone we can observe the similar performance. The restricted LASSO get the best result.
Now interesting thing is in the restricted LASSO here. We want one classifier is not
regular. We can see that the restricted LASSO has smaller size than original LASSO for
interview-telephone sub condition. So we have selected one classifier that didn't collaborate regularize but still
affect to cause other classifier to be zero.
I can't explain the behavior but I found some kind of interesting size produced here.
Here is telephone-telephone sub condition. the EER here is of Variational Bayes. this is only
condition that sparsity didn't help in terms of actual DCF.
Other while, in the other conditions the sparsity does helps.
this is surprising the fact that the ARD was not able to put more classifier
to zero than standard Variational Bayes.
On the other hand, there is possibility that the prior has to be more stronger.
The elastic-net show most promise but we are not able to estimate parameter efficiently.
In the future work we will study method to automatically learn Elastic-net prior hyper parameter.
Of course we try to do Variational Bayes. We have higher prior parameter but then
we can set those higher prior parameter to be non-informative.
The comparison between standard logistic regression and the VB method is totally fair.
There are some study in the literature in the different fields that people do this
kind of thing. They observe that optimizing using cross validation regularized parameter bring better performance.
But using this kind of Bayes approach bring a stable performance.
It isn't so good but more predictable. So this is my goal.