Speech Transcript - The Role of Proper Scoring Rules in Training and Evaluating Probabilistic Speaker and Language Recognizers

0:00:18	So, they asked me to do the introduction for the opening plenary talk here. And
0:00:23	luckily, it's very easy to do, since we have Niko, who is... as everyone knows,
0:00:29	has been part of The Odyssey Workshops; has become part of the institution of the
0:00:33	Odyssey Workshops itself. He's been involved in the area of speaker and language recognition for
0:00:38	over twenty years. He started off working at Spescom and DataVoice and now he's the
0:00:44	chief scientist at AGNITIO. He receieved his Ph.D. in two thousand ten from the University
0:00:49	of Stellenbosch, where he also receieved
0:00:51	Undergraduate Master's Degree
0:00:53	He's been involved in the area of speaker and language recognition in various aspects of
0:00:56	it, those from
0:00:57	working on the core technologies of the
0:01:01	classifiers themselves: from generative models to discriminatively trained models. And working on the other side
0:01:07	of calibration and how you evaluate. And today talk Niko is going to that one
0:01:12	area that he's had a lot of contributions in over the years how we can
0:01:16	go about evaluating the systems
0:01:18	we told them: How do we know how well they're working and how can we
0:01:21	work in this in a way that's gonna show each utility for downstream applications? So,
0:01:25	with that, I hand it over to Niko to begin his talk
0:01:39	much, Doug
0:01:41	to be here, thank you
0:01:45	So, when Haizhou invited me, he asked me to say something about calibration and fusion
0:01:54	doing for
0:01:56	years. So, I'll do so by discussing proper scoring rules, the basic principle that underlies
0:02:05	all of this work. So, fusion you can do in many ways. Proper scoring rules
0:02:09	is a good way to do fusion, but it's not essential for fusion. But, in
0:02:14	my view, if you're talking about calibration, you do need proper scoring rules.
0:02:20	So, they've been around since nineteeen fifty, The Brier Score
0:02:26	proposed for the evaluation of
0:02:29	evaluating the goodness of a weather forecasting, probabilistic weather forecasting. Since then, they've been in
0:02:36	statistic literature, even up to the present. In pattern recognition/ machine learning speech processing they're
0:02:43	not that well known, but in fact, if you use maximum likelihood for generative training,
0:02:50	or cross- entropy for discriminative training, you are in practice using the logarithmic scoring rule.
0:02:57	So, you've probably all used it already
0:03:01	In the future, we may be seeing more of
0:03:05	scoring rules in machine learning. We've got these new restricted Boltzmann machines and
0:03:13	other energy based models, which are now becoming very popular. It's very difficult to train,
0:03:19	because you can't work out the likelihood. This... Hyvarinen proposed a proper scoring rule, how
0:03:26	to attack that problem, and if you google you'll find some papers... some recent papers
0:03:30	on that as well.
0:03:32	so, I'll concentrate
0:03:35	own application of proper scoring rules, in our field and
0:03:41	to promote better understanding of the concept of calibration itself and how to form
0:03:48	training algorithms which are calibration sensitive and... and the evaluation measures
0:03:55	So, I'll start by outlining the problem that we are trying to solve. And then,
0:04:02	than if you can see the grey... but then I'll... I'll introduce proper scoring rules
0:04:07	and then, the last section will be how to design proper scoring rules with the
0:04:12	several different ones, how to design them to do what you want them to do
0:04:17	So, not all pattern recognition needs to be probabilistic. You can build a nice recognizers
0:04:24	with an SVM classifier and you don't need to think about probabilities even to do
0:04:29	that. But in this talk, we're interested in probabilistic pattern recognition, where the output is
0:04:37	a probability, or a likelihood, or a likelihood ratio. So, if you can get the
0:04:44	calibration right, that part of output is more useful than just hard decisions.
0:04:51	In machine learning and also in speech recognition, if you do probabilistic recognition, you might
0:04:57	be used to seeing a posterior probability as an output. An example is a phone
0:05:02	recognizer where there will be forty or so posterior probabilities given
0:05:08	input frames. But in speaker and language, there are good reasons why we want to
0:05:13	use class likelihoods rather than posteriors. And if there are two classes, as in speaker
0:05:19	recognition, then likelihood ratio is the most convenient. So, what I'm about to
0:05:26	about... we can do all of those things, it doesn't really matter which of those
0:05:31	forms we use
0:05:33	so, we're interested
0:05:35	in a pattern recognizer that types some form of input, maybe the acoustic feature vectors,
0:05:42	maybe an i-vector, or maybe just even a score. And then the output will be
0:05:47	a small number of discreet classes, for example: target and non-target in speaker recognition, or
0:05:52	a language recognition a number of language classes
0:05:55	So, the output of the recognizer might be inaccurate form, so you have
0:06:02	given one piece of data, you have a likelihood for... for each of the classes
0:06:09	here
0:06:10	if you also have a prior; and for the purposes here you can consider the
0:06:15	prior as given, a prior distribution of the classes; then it's easy, we just plug
0:06:20	that in device rule and you get the posterior. So, you can go from the
0:06:24	posterior to the likelihoods or the other way round. They're equivalent, they have the same
0:06:28	information
0:06:30	Also, if the one side is well calibrated, we can say the other side is
0:06:34	well calibrated as well. So, it doesn't really matter on which side is Bayes rule,
0:06:38	we look at calibration. So,
0:06:42	the recognizer output, for the purposes of this presentation, will be here
0:06:49	look at measuring calibration on the other side of Bayes rule
0:06:54	So, why is calibration necessary?
0:06:58	Because our models are imperfect models of the data.
0:07:02	Even when that model manages to extract information that could in principle discriminate high accuracy
0:07:08	between the classes, the probabilistic representation might not be optimal. For example, it might be
0:07:15	out of the confident. The probabilities might all be very close to zero and one,
0:07:19	whereas the accuracy doesn't warn that kind of high confidence. So, that's the calibration problem
0:07:27	So, calibration analysis will help you to detect that problem and also to fix it.
0:07:35	So, calibration can have two meanings: as a measure of goodness, how good is the
0:07:39	calibration, and also as a... as a transformation.
0:07:42	So, this is
0:07:45	what the typical transformation might look like. We have a pattern recognizer, which outputs likelihoods.
0:07:53	That recognizer might be based on some probabilistic model. The joint probability here, by which
0:08:00	I want to indicate the model can be generative. Probability of data given class or
0:08:06	the other way round, discriminative probability class given data, doesn't matter. You're probably going to
0:08:14	do better if you recalibrate that output and again you... you could do that. This
0:08:21	time we're modeling the scores. The likelihoods that come out of this model, we call
0:08:27	them scores, feature, if you like
0:08:30	another probabilistic model: the scores are simpler, another dimension than the original input, so they're
0:08:35	easier to model. Again, you can do a generative or discriminative modeling of the scores.
0:08:42	What I'm about to show is going to be mostly about discriminative modeling, but you
0:08:45	can do generative as well.
0:08:49	so
0:08:51	How can we call the likelihoods that come out of the second stage calibrated? Because
0:08:57	we're going to measure them, we're going to measure how well they're calibrated and moreover,
0:09:01	we're going to force them to be... to be well calibrated.
0:09:05	so
0:09:09	If you set their likelihood Bayes rule, then you get the posterior and that's where
0:09:13	we're going to measure the calibration with the proper scoring rule.
0:09:19	so
0:09:20	Obviously, you need to do this kind of measurement with the supervised evaluation database. So,
0:09:27	you apply the proper scoring rule to every example in the database and then your
0:09:31	have reached the value of the proper scoring rule. That's your measure of goodness, of
0:09:36	your recognizer in this database and you plug that into the training algorithm, which is
0:09:42	your objective function, and you can adjust the calibration parameters, and that's the way you
0:09:47	force your calibrator to
0:09:52	produce calibrated likelihoods.
0:09:54	So, you can use the same assembly for fusion, if you have multiple system to
0:09:59	a final, to a fusion point; or more generally, you can just train your whole
0:10:05	recognizer.
0:10:08	with the same principle.
0:10:11	So in summary of this part, calibration is easiest applied to the likelihoods
0:10:16	simple, affine transforms work very well in log-likelihood domains, but the measurement is based on
0:10:24	the posteriors and they're going to be with proper scoring rules, so
0:10:28	let's introduce proper scoring rules
0:10:33	Our first talk about the classical definition of proper scoring rule; than, a more engineering
0:10:39	viewpoint
0:10:42	how you can define them by decision theory. It is also very useful to look
0:10:46	at them in information theory, that will tell you how much information the recognizer is
0:10:51	delivering to the user. But that won't be directly relevant to this talk, so I'll
0:10:56	just refer you to this reference.
0:11:00	So, we start with the
0:11:03	classical definition and the sort of canonical example is weather forecast.
0:11:11	so
0:11:13	We have a weather forecaster. He predicts whether it will rain tomorrow or not and
0:11:17	he has a probabilistic prediction, he gives us a probability for rain.
0:11:22	The next day, it rains or it doesn't. How do we decide whether that was
0:11:27	a good probability or not?
0:11:29	So, it's reasonable to choose some kind of a cost function. So, you put the
0:11:34	probability, the prediction in there, as well as the fact whether it rained or not.
0:11:38	So, what should this cost function look like?
0:11:44	It's not so obvious how this cost function should look. If, for example, temperature was
0:11:50	predicted, it's easy, you can compare the predicted against the actual temperature and just compute
0:11:57	some kind of the squared difference. But in this case, it's a probabilistic prediction on
0:12:03	the day that ir rains or not, there's no true probability for rain, so we
0:12:08	can't do that kind of
0:12:09	direct comparison
0:12:11	So, the solution to forming such a cost function is the family of cost functions
0:12:19	for proper scoring rules
0:12:22	and they have
0:12:25	two nice properties. That, first of all, they force prediction to be as accurate as
0:12:30	possible, but subject to honesty. You can't pretend that your
0:12:37	prediction is more accurate than... that actually should be. So,
0:12:42	you need these two things to work together.
0:12:47	so
0:12:49	this is a simple picture of how weather forecast might... might be done. You've got
0:12:55	the data, which comes from satellites and other sensors
0:12:59	and the probabilistic model, and then you compute the probability for rain, given the observations
0:13:05	in the model. Posterior probability. So, the weather forecaster might ask himself: Do I predict
0:13:11	what I calculated or do I output
0:13:16	some warping or reinterpretation of this probability, maybe that would be more useful for my
0:13:22	users? Maybe my boss will be happier with me if I pretend that my predictions
0:13:27	are more accurate than they really are? So,
0:13:31	If the weather forecaster trusts his modeling his data
0:13:37	then, we can't really do better than the weather forecaster, we're not a weather forecasters.
0:13:42	So, what we do want is his best probability, p, the one that he calculated,
0:13:47	not something else. So, how do we force him to do that?
0:13:52	so, we tell the weather forecaster: Tomorrow
0:13:56	when you've predicted some q, which might be different from p, which we really want;
0:14:01	we are going to evaluate you with the proper scoring rule, with this type of
0:14:05	cost function. Then, the weather forecaster, he doesn't know whether it's going to rain or
0:14:10	not. The best information he has is his prediction, p. So, he forms an expected
0:14:16	value for the way he's going to be evaluated tomorrow. What's my expected cost that
0:14:23	I'm going to be evaluated with tomorrow? So,
0:14:26	proper scoring rule satisfies this expectation requirement. So,
0:14:34	this probability, p, forms the expectation
0:14:38	q is what he submits and
0:14:40	proper scoring rule, you're always going to do better if you submit p instead of
0:14:48	so that is the way that the proper scoring rule motivates honesty
0:14:55	The same mechanism also motivates him to make it more accurate. So,
0:15:03	he might sit down and think: If I have a bigger computer, if I launch
0:15:07	more satellites, I could get a better prediction. And even though I don't have the
0:15:12	better prediction, if I had it, I would form my expectation with the better prediction.
0:15:19	And the same mechanism then says: well, we would do better with the better prediction,
0:15:24	it's kind of obvious, but the proper scoring rule makes that obvious statement work mathematically.
0:15:33	Here's another view. It turns out if you form the... if you look at the
0:15:39	expected cost of the proper scoring rule, as a function of the predicted probability, then
0:15:44	you get the minima at the vertices of the probability simplex. So,
0:15:50	this is very much like the entropy function. In fact, if you use the logarithmic
0:15:54	scoring rule, this is just change into P. so minimizing expected cost is the same
0:16:00	as... as
0:16:02	minimizing entropy uncertainty. So,
0:16:06	driving down expected cost tends to favour
0:16:13	sharper predictions, that they... they sometimes call it. But it has to be subject to
0:16:18	calibration as well.
0:16:21	so
0:16:22	why are we going on about what humans might do? We can motivate machines. That
0:16:29	is called discriminative training. And we can expect the same benefits.
0:16:36	some examples. In many different proper scoring rules too... the very well known ones are
0:16:42	the Brier score, which has this predictive form,
0:16:45	I'll show... I'll show a graph just now... and also the logarithmic score
0:16:50	both cases is really easy to show that they do satisfy this expectation requirement.
0:16:58	So, here's an example: if
0:17:01	the top left. If it does rain, we're looking at the green curve. If you
0:17:07	predicted zero probability for rain, that's bad, so the cost is high. If you predicted
0:17:13	one, probability one, that's good, so the cost is low. If it doesn't rain it
0:17:18	works the other way round. So, that's the Brier score. This is the logarithmic, very
0:17:21	similar, except it goes out to
0:17:23	infinity here.
0:17:25	take another view, you can do logadd transformation on the probability and then you see,
0:17:30	they look very different.
0:17:33	the logarithmic one tends... turns out to form nice convex objective functions, which are easier
0:17:40	to numerically optimize
0:17:42	the Brier score is a little bit harder to optimize
0:17:48	so now, let's switch to the
0:17:51	engineering view of proper scoring rule. So, we're building these recognizers because we actually want
0:17:57	to use them for some useful purpose, we want to
0:18:03	do whatever we're doing in a cost effective way. We want to minimize expected cost.
0:18:08	So, if you ask what are the consequences of Bayes decision that I can make
0:18:15	with some probabilistic prediction? Then you really already constructed the proper scoring
0:18:21	you just have to ask that very natural question. So, all proper scoring rules can
0:18:26	be interpreted in that way.
0:18:29	so
0:18:32	I'm assuming everybody knows this. This is the example of the NIST detection cost function.
0:18:40	You make some decision to accept or reject and it's a target or a non-target.
0:18:44	And if you get it wrong there's some cost, if you get it right everything
0:18:48	is good, the cost is zero. So, that's the consequence
0:18:51	now we are using the probabilistic recognizer, which gives us this probability distribution, q, for
0:18:58	target, One minus q for non target. And we want to make, we want to
0:19:03	use that to make a decision, so we are making a minimum expected cost Bayes
0:19:08	decision. We are assuming that input is well calibrated, that we can use it directly
0:19:14	in the minimum expected cost Bayes decision. So, on the two sides of the inequality,
0:19:19	ve'we got the expected cost. You choose the lowest expected cost and then you put
0:19:23	it into the cost function. So the cost function is used twice. You see, I
0:19:28	highlighted the cost parameters that are used twice and the end result is then the
0:19:33	proper scoring rule. So, you're comparing the probability distribution for the hypotheses with the true
0:19:40	hypotheses, and the proper scoring rule kills tells you how well these two match
0:19:46	so this is exactly how NIST this year will form their new evaluation criteria. All
0:19:54	the others up to two thousand and ten, they used just the DCF as is,
0:20:00	with hard input decisions. Our output decisions, this year they'll do the proper scoring rule
0:20:07	and they'll ask for likelihood ratios. Of course, we have to put that trough Bayes
0:20:12	rule to get posterior and then it goes into the proper scoring rule.
0:20:17	So, we can generalise this to more than two classes and you can really use
0:20:22	any cost function, you can use more complicated cost functions. This rule works
0:20:26	this trivial inequality that shows this type of construction of proper scoring rule
0:20:33	satisfies the expectation requirement
0:20:39	so
0:20:40	in summary of this part, this Bayes decision interpretation tells us: If you need the
0:20:44	proper scoring rule, take your favourite cost function
0:20:47	apply this recipe, apply Bayes decisions and you'll have a proper scoring rule and then
0:20:53	that will measure and optimize the cost effectiveness of your recognizer.
0:21:02	so just a last word about . discrimination/ calibration decomposition...
0:21:09	The Bayes decision measures the full cost of using the probabilistic recognizer to make decisions.
0:21:17	So, often it's useful to decompose this cost into two components. The first might be
0:21:24	the underlying inability of the recognizer to perfectly discriminate between the two classes. Even if
0:21:32	you get the calibration optimal, you still can't recognize the classes perfectly. And then, the
0:21:38	second component is the additional cost due to bad calibration.
0:21:44	So, we've all been looking for... in my case for more than a decade, the
0:21:51	NIST's actual DCF versus minimum DCF. That's very much the same kind of decomposition, but
0:22:00	in this case, calibration refers only to setting your decision threshold. So, if we move
0:22:06	to probabilistic output of the recognizer, that's a more general type of calibration, so does
0:22:14	that same recipe... can we do that same kind of decomposition? My answer is yes.
0:22:21	I've tried it over the last few years with speaker and language recognition and in
0:22:26	my opinion it's a useful thing to do. So, the recipe is
0:22:31	at the output end of your recognizer you isolate a few parameters that you call
0:22:37	the calibration parameters or you might add an extra stage and call that a calibration
0:22:42	stage. If it's multiclass, maybe there's some debate about how to choose these parameters
0:22:48	once you've done that, you choose whatever proper scoring rule you're going to use for
0:22:55	your evaluation metric.
0:22:57	and you use that over your supervised evaluation database that's called then the actual cost.
0:23:05	Then, the evaluator goes and, using a two class labels minimizes just those calibration parameters
0:23:12	and that reduces the cost
0:23:17	somewhat. And then, let's call that the minimum cost, and then you can compare the
0:23:21	actual to the minimum cost. If they are very close, you can say: My calibration
0:23:25	was good. Otherwise, let's go back and see what went wrong.
0:23:32	So, in the last part of the talk we're going to play around wtih proper
0:23:36	scoring rules a bit
0:23:39	I propose this CLLR for use in speaker recognition in two thousand and four
0:23:48	but
0:23:49	what I want to show here is that's not the only option
0:23:53	you can adjust the proper scoring rule to target your
0:24:01	application. So,
0:24:04	I'll show how to do that
0:24:06	so, the mechanism
0:24:09	for, let's call it binary proper scoring rules, is the fact that you can combine
0:24:16	proper scoring rules
0:24:18	just a weighted summation of the proper scoring rules and once you do that it's
0:24:24	still the proper scoring rule. So, you might have multiple different proper scoring rules representing
0:24:28	slightly different applications, applications that work in different operating points
0:24:35	if you do this kind of application... combination of those proper scoring rules, you get
0:24:40	a new proper scoring rule that represents a mixture of applications. Ĺo, real application probably
0:24:46	is not going to focus, it'll be used just at the single operating point. If
0:24:50	it's a probabilistic output, you can hope to apply it to a range of different
0:24:54	operating points. So, this type of
0:24:58	combination of proper scpring rules is then a nice way to
0:25:02	evaluate that kind of more generally applicable recognizer.
0:25:08	So, NIST is also going to do that, this year in SRE twelve. They will
0:25:13	use a combination of two discreet operating points in a proper scoring rule. So can
0:25:20	you do discreet combinations or continuous combinations, also, but... So, interesting thing is that
0:25:29	binary, two class proper scoring rules, can be described in this way, I'll show how
0:25:35	that is done. So this DCF turns out to be fundamental building block for
0:25:42	two class proper scoring rules
0:25:45	This is the same picture I had before. I just normalized the cost function, so
0:25:52	there's a cost of miss and false alarm, that's redundant. You don't really need those
0:25:56	two costs, we can reduce it to one parameter.
0:25:59	Because the magnitude of the proper scoring rule doesn't really tell us anything. So if
0:26:04	you normalize it like this, then the expected cost at the decision threshold is always
0:26:09	going to be one, no matter what the... what the parameters, no matter what the
0:26:13	operating point. So, the parameter that we're using
0:26:18	is the base decision threshold
0:26:20	posterior probability for the target
0:26:23	we compare that to this parameter t, which is the threshold
0:26:27	cost is one other t, the cost of miss and the cost of the false
0:26:31	alarm is one or the one minus t. You see, if t is close to
0:26:34	zero, the one cost goes to infinity; if it's close to one, the other cost
0:26:37	goes to infinity. So, you're covering the whole range of cost-ratio just by varying this
0:26:44	parameter t. So, we'll call this the normalized DCF scoring rule and I've got the
0:26:51	c star notation for it and the operating point is t.
0:26:59	So, what does it look like?
0:27:02	It's a very simple step function.
0:27:04	If your posterior probability for the target is too low, you're going to miss the
0:27:09	target; if you're below the threshold t, and you get hit with the miss cost.
0:27:14	If p is high enough, that suppose if it really is the target. P is
0:27:18	high enough, we pass t, the cost is zero. If it's not the target, it's
0:27:23	the other way round in the step function, the red line goes up
0:27:28	so
0:27:29	you now have four different values of t, so if you adjust the parameters, then
0:27:35	cost of miss is high, cost of false alarm is low. If you adjust it,
0:27:38	they ... they change.
0:27:43	in comparison I've got the logarythmic scoring rule
0:27:47	and you'll see, it looks very similar. It tends to follow the way that the
0:27:54	miss and false alarm cost change, so you'll find indeed, if you integrate over all
0:27:59	values of t
0:28:01	then you will get the logarythmic scoring rule
0:28:07	so
0:28:08	All binary proper scoring rules can be expressed as a... as an expectation over operating
0:28:13	points. So, the integrant here is the step functions, the c star guy, as well
0:28:18	as some weighting distribution.
0:28:21	so
0:28:25	the weighting distributions relative distribution. It tends to be non-zero, it tends to integrate to
0:28:30	one
0:28:31	and it determines
0:28:34	the nature of your proper scoring rule. Several properties depend on this weighting distribution and
0:28:39	it also tells you what relative importance do I place on different operating points.
0:28:48	is a rich variety of things you can do if you make the weighting function
0:28:53	an impulse. I shouldn't say function, it's a distribution
0:28:57	mathematics is not really a function
0:29:00	Any case, if it's an impulse, we're looking at a single operating point. If it's
0:29:05	a sum of impulses, we're looking at multiple operating points, discrete operating points. Or if
0:29:12	it's a smooth probability distribution then we're looking at the continuous range of operating points.
0:29:20	So, examples of the discrete ones, that could be the SRE ten operating point
0:29:25	is a step function that... I mean, an impulse at points nine or in SRE
0:29:33	twelve we'll have
0:29:37	you're looking at two operating points, a mixture of two points.
0:29:42	If you do smooth weighting, this quadratic form over here gives the Brier score
0:29:48	and the logarythmic score just uses a very simple constant, weighting. So, weighting matters a
0:29:54	lot. The Brier score, if you use it for discriminative training, it forms a non-
0:30:01	convex optimization objective, which also tends not to generalize that well. If you trained on
0:30:09	this data and then use recognizer on that data, it doesn't generalize that well, whereas
0:30:17	the logarythmic one
0:30:19	has a little bit of natural regularisation booting you can expect to do better on
0:30:24	new data
0:30:28	so
0:30:30	this in your own time, this is just an example how the integral works out.
0:30:35	The step function causes the probability that you submit to the proper scoring rule to
0:30:40	appear in the
0:30:43	boundary of the integral, it's very simple, you get this logarythmic form.
0:30:49	So now, let's do a case study
0:30:52	and lets' design a proper scoring rule to target the low false alarm region for...
0:30:59	of course, for
0:31:00	speaker recognition
0:31:02	detection
0:31:04	a range of thresholds
0:31:07	that's the threshold you place in the posterior probability
0:31:10	that corresponds
0:31:12	an operating point on the DET-curve.
0:31:15	So we can use this weighting function to tailor the proper scoring rule to target
0:31:23	only a part of the DET-curve if we want. So, George Doddington recently proposed
0:31:31	another way to achieve the same thing. He called it cllr and ten
0:31:37	it's mentioned in the new NIST evaluation plan. There's also
0:31:41	upcoming Interspeech paper, so
0:31:45	he used the standard logarythmic scoring rule, which is essentially just the same as cllr
0:31:50	that I proposed. And then, he just inverted some scores above some threshold, so that
0:31:59	you can tall that low false alarm region
0:32:03	he omitted the scores below the threshold
0:32:07	So unfortunately, cllr intent does not quite fit into this framework of a proper scoring
0:32:13	rule, because it's got threshold that's dependent on the miss rate of every system, so
0:32:17	the threshold is slightly different for different systems
0:32:21	make it the proper scoring rule. I'm just saying, let's use a fixed threshold and
0:32:26	then let's just call it the truncated cllr.
0:32:29	and then you can also express
0:32:31	the truncated cllr just with the weighting function
0:32:35	So, the original cllr logarythmic score has a flat weighting distribution. Truncated cllr uses a
0:32:43	unit step
0:32:44	which steps up at wherever you want to
0:32:47	threshold the scores.
0:32:51	so
0:32:53	there are several different things you can do. Let's call them variety of... variety of
0:33:00	cllr, variations of cllr. The original one is just a logarithmic
0:33:08	proper scoring rule, which you need to apply to a probability
0:33:13	go from log-likelihood ratio to a probability we need to have some prior and then
0:33:18	apply Bayes rule, so the prior that defines cllr is just half
0:33:25	you can shift cllr by using some other prior and I'll show you in what
0:33:29	sense it shifted in a graph just after this
0:33:34	that mechanism has been in the Focal toolkit and the most of us have probably
0:33:38	used that to do calibration and fusion
0:33:44	but I never explicitly recommended this as an evaluation criteria
0:33:51	and then this truncated cllr, which is very close to what George proposed,
0:33:57	uses u-step weighting
0:34:00	so there's this transformation between log- likelihood- ratio and posterior, so I'm going to show
0:34:08	a plot, where the threshold is log- likelihood- ratio threshold, so there's this transformation
0:34:15	transformation, and then
0:34:16	prior is also involved, the prior just shifts you along the x-axis. And you have
0:34:22	to remember, this transformation has Jacobian associated with it, so on right you're having the
0:34:28	posterior threshold domain
0:34:35	So, this is what graph looks like
0:34:37	the
0:34:39	the x-axis are the log- likelihood- ratio threshold
0:34:43	the y-axis is the relative weighting that the proper scoring rule assigns to different operating
0:34:49	points. So, in this view, this weighting function is the probability distribution. It looks almost
0:34:54	like Gaussian, it's not quite a Gaussian
0:35:00	and
0:35:01	now what we do is... we just change the prior, then you get the shifted
0:35:05	cmllr which is the best green curve, which is shifted to the right. So, that's
0:35:11	shifted towards the low false alarm region. So, I've labelled the regions here... the middle
0:35:18	one we can call the equal error rate region, close to log- likelihood- ratio zero
0:35:23	low mess rate, low mess rate region. If your threshold is low, you're not gonna
0:35:29	mess so many
0:35:31	targets. That's the low false alarm region
0:35:35	and the blue curve is truncated cllr, so
0:35:42	you basically ignore all scores on this side of the threshold and, of course, you
0:35:47	have to scale it by a factor of ten that integrates to one
0:35:53	So now, let's look at the different option, another final option.
0:35:56	There's this beta family of proper scoring rules, that was proposed by Buja.
0:36:04	It uses the beta distribution as this weighting distribution. It has two adjustable parameters
0:36:12	Why this... why not just the Gaussian? The answer is the integrals work out if
0:36:16	you use the beta.
0:36:18	It's also general enough, so by adjusting these parameters we can get the Brier, logarithmic
0:36:23	and also the c star which we've been using here
0:36:27	and
0:36:29	So it's a comfortable family used for this purpose
0:36:34	For this presentation I've chosen the parameters to be equal to ten and one and
0:36:40	that's gonna then
0:36:43	very similar to the truncated cllr
0:36:46	This is what the proper scoring rule looks like
0:36:49	I liked this logarithmic. If p goes close to one
0:36:55	the polynomial term doesn't do very much anymore, it's more or less constant. So, then
0:37:03	at the very low false alarm region this just becomes the logarithmic scoring rule again.
0:37:10	So that's what this new beta looks like. The red curve over here
0:37:14	it has it' s peak in the same place as the truncated one or the
0:37:19	shifted one. But, for example compared, the shifted one
0:37:27	more effectively ignores
0:37:30	the one side of the det curve. So, if you believe this is the way
0:37:36	to go forward, you really do want to ignore that side of the det curve.
0:37:40	You can tailor your proper scoring rule to do that. So, I've not tried the
0:37:47	blue or the red version here myself numerically
0:37:51	so I cannot recommend to you that you're going to do well is sre twelve
0:37:56	if you use one of these curves. It's up to you to experiment, so I
0:38:00	just like to point out: cllr is not the only proper scoring rule
0:38:06	They're very general, you can tailor them
0:38:10	play with them, see what you can get.
0:38:16	these guys are saying
0:38:19	we have to say something about multiclass
0:38:21	so I've one slide of multiclass
0:38:25	Multiclass turns out to be a lot more difficult to analyze
0:38:31	it's amazing, the complexity, if you go from two to three classes, the trouble you
0:38:35	can get into
0:38:37	But, it's useful to know that some of the same rules still apply. You can
0:38:44	construct
0:38:45	the proper scoring rule
0:38:48	choose some cost function and construct a proper scoring rule via the Bayes decision recipe.
0:38:55	You can also combine them, so the same rules apply.
0:38:58	And then, the logarithmic scoring rule is just very nice, it behaves nicely
0:39:05	it also turns to... how to be an expectation of weight of misclassification errors, very
0:39:12	similar to what I've shown before. The integral is a lot harder to show, it
0:39:17	works like that. And then, the logarithmic scoring rule does form a nice evaluation criterion
0:39:23	and nice discriminative training criteria and
0:39:28	it will be used as such in the Albayzin two thousand and twelve language recognition
0:39:32	evaluation. Nicholas here will be telling us more about that later this week.
0:39:40	so in conclusion
0:39:41	in my view, proper scoring rules are essential if you want to use for recognizing
0:39:47	the probabilistic output
0:39:51	They do work well for the discriminative training
0:39:54	you have to choose the right proper scoring rule for your training, but some of
0:39:58	them do work very well; average have a rich structure, they can be tailored, there's
0:40:02	not only one
0:40:04	and in future maybe we'll see them used more generally in machine learning, even for
0:40:11	generative training.
0:40:15	Some selected references. The first one, my Ph.D. dissertation has a lot more material about
0:40:21	proper scoring rules, many more references
0:40:26	and
0:40:28	a few questions
0:41:08	well
0:41:10	we've had a bit of a discussion
0:41:13	in the context of... of
0:41:17	recognizer that has to recognize the age of the... of the speaker and then if
0:41:23	you see... look at the age as a... as a continuous variable, then the nature
0:41:28	of the proper scoring rule changes.
0:41:29	and
0:41:31	there's a... there's a lot literature on that typeof proper scoring rule
0:41:37	there are extra issues
0:41:42	for example, you... you have to ask
0:41:46	even in a multi class case. In the multiclass case
0:41:51	you have to ask is there some association between the classes, are some of them
0:41:56	closer, so that if you make an error, but error is... is
0:42:02	well, let's take an example. If the language is really English and you... no, let's
0:42:07	say if language is really one of the Chinese languages and your recognizer says it's
0:42:13	one of the other Chinese languages, that error is not as bad as saying it's
0:42:17	English.
0:42:19	So, the logarithmic scoring rule, for example, doesn't do that. Any error is as bad
0:42:26	as
0:42:27	as... as any other error.
0:42:30	if you have a continuous range like age
0:42:32	if
0:42:36	if you... if the question is really thirty and you say it's thirty one, that's
0:42:39	not such a bad error. So, the logarithmic... there's a logarithmic version of the continuous
0:42:45	scoring rule
0:42:48	That one will not tell you that error is excusable.
0:42:52	So, there are ways to design scoring rules to take into account
0:42:58	some... some structure in the way you define your classes.
0:43:28	I like to think we've thought more about the problem
0:43:34	and I
0:43:38	I think one of the reasons for that are the NIST evaluations and specifically
0:43:44	the ... the DCF that we've been using in the NIST evaluation.
0:43:50	In machine learning they like to just do error
0:43:53	so, by Gyen from... from MRI to DCF it's a simple step, we're just weighting
0:43:59	the errors
0:44:38	You are never speaking about the constrains
0:44:41	concerning the datasets
0:44:45	if we are targeting
0:44:49	some part of the
0:44:54	curve, now the false alarm region we will have certainly some constraint
0:45:01	dataset, to have the balanced dataset. It's my first
0:45:05	the second one a whole lot easier so you get to be
0:45:08	dataset
0:45:11	maybe start now to speak about also the quantity of information we have in the
0:45:19	is it... I'm coming back to your example in language recognition. Is it the same
0:45:23	error if you
0:45:26	your choice of Chinese language when it was a different Chinese language compared to deciding
0:45:31	it's English
0:45:35	in speaker recognition is it the same error if you decide it's
0:45:40	it's not a target
0:45:43	you have nothing in the speech file, no information in the speech file, when you
0:45:48	decide that, with a very speech file, with already dead information
0:45:57	Here, let me answer the first ... first question, if I understood it correctly
0:46:04	you asked about the ... the size of your evaluation database. So, of course, that's...
0:46:11	that's very important
0:46:14	in
0:46:16	my presentation that I had, the ... this SRE analysis workshop in december last year
0:46:22	I adressed that... that issue. So, if you look at this view of the proper
0:46:29	scoring rule as an integral of error rates
0:46:33	then
0:46:36	if you move into... if you... if you write an operating point, where the error
0:46:41	rate is going to be low, which does happen to in the false alarm region,
0:46:45	then you need
0:46:47	enough data, so that you actually do
0:46:50	errors. If you don't have errors, how can you measure the error rate? So, one
0:46:56	has to be very careful
0:47:00	not to push your evaluation outside of the range
0:47:03	data can cover
0:47:07	and the second question is
0:47:13	the case that I covered is just the basics
0:47:17	if you want a more complicated cost functions
0:47:21	where you want to assign different costs to different flavours of errors, that does fit
0:47:29	into this framework, so
0:47:30	you can type any cost function
0:47:33	as long as it doesn't do something pathological again
0:47:38	two class, the cost functions simple; in multiclass you have to think really carefully how
0:47:43	to
0:47:44	construct the cost function that doesn't contradict itself
0:47:49	once you've formed a nice cost function
0:47:53	you can apply this recipe
0:47:55	just plug it into Bayes decision, back into the cost function and you'll have a
0:48:00	proper scoring rule. So, this framework does cover that.
0:48:33	are you dealing with people who are real?
0:48:59	okay
0:49:00	and going to tell us some more

The Role of Proper Scoring Rules in Training and Evaluating Probabilistic Speaker and Language Recognizers

Plenary Session

Niko Brümmer