Speech Transcript - Variational Bayes Logistic Regression as Regularized Fusion for NIST SRE 2010

0:00:15	The next talk could be Variational Bayes logistic regression as regularized fusion for NIST SRE
0:00:22	2010
0:00:40	OK
0:00:42	My name is Hautomaki, and
0:00:46	the topic is the fusion topic so
0:00:51	I think it is likely only to be fusion topic this time for speech recognition
0:01:01	This time, I have tried to do variational Bayes for NIST SRE evaluation corpora.
0:01:12	OK, start with fusion. while we do fusion, why we don't have just a single
0:01:21	best system. The motivation is that the fusion work better than single best and so
0:01:30	we can take multiple classifiers and so on.
0:01:35	And on the another hand, when some classifiers are not well behaved on development data,
0:01:46	the fusion can help to smooth out.
0:01:51	There are conventional wisdom in fusion. that complementary classifier should be selected in the fusion
0:02:00	pool. This systems can be the main question in our working here.
0:02:08	So, if we are going to do fusion instead of one single best system, how
0:02:14	to select the classifier for the fusion pool or ensemble
0:02:20	We are working on some kind of thinking that there are some complementariness in the
0:02:27	feature set or classifier or something else
0:02:31	But it is difficult to quantify
0:02:35	What does the mean of complementariness ?
0:02:39	So we can first look at the mutual information of classifier output and the class
0:02:49	label and looking at the Fano inequality.
0:02:53	Maximizing mutual information means minimize classification error.
0:03:03	Gavin Brown in the paper Information theoretic Perspective on Multiple Classifier Systems 2009, showed that
0:03:12	multi-way mutual information, which we take all classifier from 1 to L (that are the
0:03:21	potential pool) , can be composed into three different terms
0:03:27	The first term is very familiar to us. It basically is sum of individual classifier
0:03:36	accuracy
0:03:38	You are usually try to maximize this term
0:03:42	Actually, maximizing this term can lead to maximizing mutual information.
0:03:48	But we have to subtract the second term here, so it is not very nice.
0:03:55	This term here, I am not going to talk too much in detail, is a
0:04:07	kind of mutual information
0:04:11	Here we just have classification output, we don't have class label here
0:04:18	It is basically the correlation term, but we take all sub set of all classifier
0:04:26	Basically, minimizing the correlation of the subset and maximizing the first term can lead to
0:04:40	maximize mutual information
0:04:43	The last term is interesting term, that you take all the sub set and you
0:04:52	compute again the mutual information again and you take the condition of class label here.
0:05:00	This term is additive term.
0:05:04	The conclusion is that: we do not only minimize the correlation between classifier on this
0:05:14	kind of higher level, group of increasing ensemble size, but also we have to move
0:05:24	onto strong half of conditional correlation
0:05:29	I think it doesn't make general sense, but I think it can give some kind
0:05:38	of idea that the conventional wisdom on the complementariness might be not so accurate.
0:05:46	What the topic of this talk is that we are using sparseness to do kind
0:05:56	of this task automatically. We don't considering the composition
0:06:02	I don't try to handed optimize this ensemble.
0:06:17	We are not based any optimization of diversity measurement for selection. We don't try to
0:06:25	do anything like that.
0:06:28	And I use sparseness regression to optimize the ensemble size
0:06:35	But the regularization regression introduce extra parameter, regularization parameter, and it is not very nice.
0:06:45	I would like to get rid of this parameter. I have considered that as higher
0:06:50	parameter and optimize at the same time we're optimizing the actual classifier on fusion device
0:06:57	The attempt here was to use variational Bayes because it is a nice framework
0:07:06	We can integrate half of parameter estimation into the same goal
0:07:13	We are attempting to integrate all parameter to get the posterior
0:07:32	OK, the motivation is likely to be this. Why we use regularization as sparseness of
0:07:39	the fusion.
0:07:41	Here we have two classifiers and CWLR is a cost function
0:07:51	x-axis describes classifier one, y-axis describes classifier two. The figure on the left is for
0:07:56	training set, the figure in the middle is for development set and the figure on
0:08:02	the right is for the final evaluation set.
0:08:06	If you use L2-norm as regularization, we can see that:
0:08:12	The red round point is the optimum that we find, and the red cross point
0:08:18	is unconstrained optimum and the black square point is the L1 optimum. We can see
0:08:25	that with training data, L2 optimum is closer than unconstrained optimum.
0:08:31	For the development set this is also true. But when we apply for test set
0:08:39	you can see the classifier w2 has been erased, actually given better success from posterior
0:08:46	than the true minimum for that set
0:08:51	The minimum has changed, the CRWL function has changed on the SRE set. Of course
0:09:11	we are lucky now.
0:09:17	Suppose it happen in real data
0:09:21	So, if you would have to optimize the unconstrained minimum, the solution could be here.
0:09:28	And now we are much closer to it - the true minimum.
0:09:35	so it tells us that we can definitely
0:09:38	zeros out, and classify, and it will give us
0:09:41	the real benefit
0:09:44	Basically, we have to find what Lagrange coefficient value to use. This is done by
0:09:54	cross validation
0:10:00	We have the discriminative probabilistic framework here and we are using it to optimize the
0:10:09	fusion weights
0:10:20	Optimization of logistic sigmoid function lead to cross-entropy cost
0:10:35	and so the...
0:10:37	the thing Nickle has proposed is to see the whole LR
0:10:40	and ... weighted by the
0:10:44	the ratio for the
0:10:47	the actual training utterance training set
0:10:51	but also we have here
0:10:53	additive term
0:10:55	logic
0:10:56	logic effective pi
0:11:04	The idea of regularization is we try to do MAP estimate and the last supplier
0:11:16	is double exponential.
0:11:20	Basically, we can see the probability of having a regular of parameter for each of
0:11:30	classifier j.
0:11:36	Here we can assume that the parameter is the same for all dimensions. In case
0:11:47	of Ridge regression, we have isotropic Gaussian.
0:11:53	In Variational Bayes, we follow the treatment in Bishop book chapter 10
0:11:59	Now, we are not looking to estimate MAP but the whole posterior. It is approximated
0:12:09	as q(w) q(a) q(t). Here we factorize all hidden parameters.
0:12:19	But we have additional problem here. The likelihood terms has to be approximated by h(w,z).
0:12:30	We have one scalar number of z for each training set score vector. It has
0:12:39	to be optimized on the same VB loop.
0:12:45	Here the natural assumption for distribution of alpha (we can see alpha in previous slide)
0:13:03	is Gamma. Here Gamma become non informative.
0:13:13	The interesting point here is that the mean of predictive density can be in normal
0:13:20	product. It is slight consistence for us to use normal linear scoring
0:13:31	As I explained early, we are interested in half sparseness solution.
0:13:46	I try to use Automatic Relevance Determination p(W\|alpha).
0:13:51	We have A is a diagonal matrix that we have one precision for each classifier.
0:14:03	And we have product of Gamma instead of having just one Gamma.
0:14:16	The general idea is that we put down to zero the classifier that doesn't play
0:14:25	any role.
0:14:37	Our setup is use NIST SRE 2008. The speech sample is extended and split into
0:14:54	two trial lists
0:14:58	One set for training confusion device, and the other is cross validate set.
0:15:05	. Then we apply the evaluation on NIST SRE 2010 core set
0:15:13	We can see the results of Variable logistic regression.
0:15:19	I forget to mention that now maybe in this set there are no cross validation
0:15:27	needed at all. And so the complexity is the operation is the same to the
0:15:36	standard for goal.
0:15:41	We can see in itv-itv that my result is the best result. My result in
0:15:52	minDCF is get slight improvement.
0:15:56	but the result of actual DCF is not calibrated.
0:16:03	Unfortunately there is a fact that there are only two classifier, for some reasons I
0:16:15	can't explain, didn't get about end of zero or any thing more than
0:16:25	I search on the literature, there are many comments about using ARD and some people
0:16:34	complain the ARD is some kind of under-fit database.
0:16:39	And I guess the solution here is that instead of using ARD I have to
0:16:46	use another stronger prior and has more stronger regularization ability.
0:16:53	So let looking another condition that some case actually, the standard logistic regression perform better.
0:17:07	On the other hand, the tel-tel case I got some improvement with just using the
0:17:16	standard in equal error rate (EER). So it is not intended consequence.
0:17:27	So the interesting point here is there are quite big problem with the calibration. At
0:17:39	least the Variational Bayes logistic regression doesn't calibrate well.
0:17:48	So I think we should need extra calibration step. I didn't try that now. Actually
0:17:56	I interested to change prior than to work on this baseline.
0:18:04	But on the other hand, we produce some score, so we can of course adding
0:18:10	subset selection now on this result.
0:18:14	And so here, this is a bit ad hoc in the case of variational Bayes.
0:18:20	So i can pause a .. the L0 norm on this result. So this won't
0:18:27	create me a ... I mean it forces me a sparse solution
0:18:34	So you will see here, what happen in standard logistic regression is when we scan
0:18:46	the ensemble size, definitely there is a minimum between 8-9
0:19:19	But any actually would be smaller than our predicted subset size.
0:19:26	This is logistic regression baseline. We can see there are a gain from 3.55 to
0:19:36	3.40 in EER. On the other hand, the ORACLE will tell us what is correct
0:19:46	size, what is the correct subset. And we have a large gain here.
0:20:03	But the interesting point here is when we apply Variational Bayes, there are actually much
0:20:12	closer behavior to our prediction. We can follow closely to the ORACLE bound.
0:20:22	Unfortunately, because of the fact that actual DCF is not well calibrated, so actual DCF
0:20:27	doesn't perform so nicely.
0:20:31	But we can do post calibration.
0:20:37	Here is the comparison table. I only did that for itv-itv set. We get an
0:20:46	improvement in equal error rate, from 3.48 for full set to 3.37 for subset. And
0:20:54	now we have only 6 classifiers left on our pool.
0:21:00	All ... we have exponential time complexity in the size of original classifier pool. Now
0:21:09	I have 12 classifiers in total and they doesn't cost so much. But it is
0:21:18	still really impractical in real application.
0:21:22	Now I will introduce again my regularization regression parameter here. It can bring some benefits.
0:21:33	The benefit between standard logistic regression and regression using VB is not great, but ...]
0:21:48	It is possible to add more extra parameter to our regularization. One is to use
0:21:59	elastic-net. Now we back to the MAP estimate. Elastic-net basically is convex sum of L1
0:22:10	and L2.
0:22:12	there is possibility that we can't talk about is that some time LASSO can regularize
0:22:21	hardly. So we can restrict LASSO to not regularize one classifier.
0:22:27	This method is called restricted LASSO.
0:22:31	Here we see that the result for complete set. Basically for itv-itv condition the ensemble
0:22:44	size of LASSO is 6. It is computational efficient method.
0:22:58	For interview-telephone we can observe the similar performance. The restricted LASSO get the best result.
0:23:14	Now interesting thing is in the restricted LASSO here. We want one classifier is not
0:23:26	regular. We can see that the restricted LASSO has smaller size than original LASSO for
0:23:39	interview-telephone sub condition. So we have selected one classifier that didn't collaborate regularize but still
0:23:52	affect to cause other classifier to be zero.
0:24:01	I can't explain the behavior but I found some kind of interesting size produced here.
0:24:13	Here is telephone-telephone sub condition. the EER here is of Variational Bayes. this is only
0:24:27	condition that sparsity didn't help in terms of actual DCF.
0:24:37	Other while, in the other conditions the sparsity does helps.
0:24:53	this is surprising the fact that the ARD was not able to put more classifier
0:25:02	to zero than standard Variational Bayes.
0:25:07	On the other hand, there is possibility that the prior has to be more stronger.
0:25:15	The elastic-net show most promise but we are not able to estimate parameter efficiently.
0:25:22	In the future work we will study method to automatically learn Elastic-net prior hyper parameter.
0:26:16	Of course we try to do Variational Bayes. We have higher prior parameter but then
0:26:28	we can set those higher prior parameter to be non-informative.
0:27:01	The comparison between standard logistic regression and the VB method is totally fair.
0:27:11	There are some study in the literature in the different fields that people do this
0:27:19	kind of thing. They observe that optimizing using cross validation regularized parameter bring better performance.
0:27:27	But using this kind of Bayes approach bring a stable performance.
0:27:33	It isn't so good but more predictable. So this is my goal.

Variational Bayes Logistic Regression as Regularized Fusion for NIST SRE 2010

SESSION 09: Speaker Recognition Evaluation

Ville Hautamaki