Speech Transcript - The Role of Proper Scoring Rules in Training and Evaluating Probabilistic Speaker and Language Recognizers

So, they asked me to do the introduction for the opening plenary talk here. And

luckily, it's very easy to do, since we have Niko, who is... as everyone knows,

has been part of The Odyssey Workshops; has become part of the institution of the

Odyssey Workshops itself. He's been involved in the area of speaker and language recognition for

over twenty years. He started off working at Spescom and DataVoice and now he's the

chief scientist at AGNITIO. He receieved his Ph.D. in two thousand ten from the University

of Stellenbosch, where he also receieved

Undergraduate Master's Degree

He's been involved in the area of speaker and language recognition in various aspects of

it, those from

working on the core technologies of the

classifiers themselves: from generative models to discriminatively trained models. And working on the other side

of calibration and how you evaluate. And today talk Niko is going to that one

area that he's had a lot of contributions in over the years how we can

go about evaluating the systems

we told them: How do we know how well they're working and how can we

work in this in a way that's gonna show each utility for downstream applications? So,

with that, I hand it over to Niko to begin his talk

much, Doug

to be here, thank you

So, when Haizhou invited me, he asked me to say something about calibration and fusion

doing for

years. So, I'll do so by discussing proper scoring rules, the basic principle that underlies

all of this work. So, fusion you can do in many ways. Proper scoring rules

is a good way to do fusion, but it's not essential for fusion. But, in

my view, if you're talking about calibration, you do need proper scoring rules.

So, they've been around since nineteeen fifty, The Brier Score

proposed for the evaluation of

evaluating the goodness of a weather forecasting, probabilistic weather forecasting. Since then, they've been in

statistic literature, even up to the present. In pattern recognition/ machine learning speech processing they're

not that well known, but in fact, if you use maximum likelihood for generative training,

or cross- entropy for discriminative training, you are in practice using the logarithmic scoring rule.

So, you've probably all used it already

In the future, we may be seeing more of

scoring rules in machine learning. We've got these new restricted Boltzmann machines and

other energy based models, which are now becoming very popular. It's very difficult to train,

because you can't work out the likelihood. This... Hyvarinen proposed a proper scoring rule, how

to attack that problem, and if you google you'll find some papers... some recent papers

on that as well.

so, I'll concentrate

own application of proper scoring rules, in our field and

to promote better understanding of the concept of calibration itself and how to form

training algorithms which are calibration sensitive and... and the evaluation measures

So, I'll start by outlining the problem that we are trying to solve. And then,

than if you can see the grey... but then I'll... I'll introduce proper scoring rules

and then, the last section will be how to design proper scoring rules with the

several different ones, how to design them to do what you want them to do

So, not all pattern recognition needs to be probabilistic. You can build a nice recognizers

with an SVM classifier and you don't need to think about probabilities even to do

that. But in this talk, we're interested in probabilistic pattern recognition, where the output is

a probability, or a likelihood, or a likelihood ratio. So, if you can get the

calibration right, that part of output is more useful than just hard decisions.

In machine learning and also in speech recognition, if you do probabilistic recognition, you might

be used to seeing a posterior probability as an output. An example is a phone

recognizer where there will be forty or so posterior probabilities given

input frames. But in speaker and language, there are good reasons why we want to

use class likelihoods rather than posteriors. And if there are two classes, as in speaker

recognition, then likelihood ratio is the most convenient. So, what I'm about to

about... we can do all of those things, it doesn't really matter which of those

forms we use

so, we're interested

in a pattern recognizer that types some form of input, maybe the acoustic feature vectors,

maybe an i-vector, or maybe just even a score. And then the output will be

a small number of discreet classes, for example: target and non-target in speaker recognition, or

a language recognition a number of language classes

So, the output of the recognizer might be inaccurate form, so you have

given one piece of data, you have a likelihood for... for each of the classes

here

if you also have a prior; and for the purposes here you can consider the

prior as given, a prior distribution of the classes; then it's easy, we just plug

that in device rule and you get the posterior. So, you can go from the

posterior to the likelihoods or the other way round. They're equivalent, they have the same

information

Also, if the one side is well calibrated, we can say the other side is

well calibrated as well. So, it doesn't really matter on which side is Bayes rule,

we look at calibration. So,

the recognizer output, for the purposes of this presentation, will be here

look at measuring calibration on the other side of Bayes rule

So, why is calibration necessary?

Because our models are imperfect models of the data.

Even when that model manages to extract information that could in principle discriminate high accuracy

between the classes, the probabilistic representation might not be optimal. For example, it might be

out of the confident. The probabilities might all be very close to zero and one,

whereas the accuracy doesn't warn that kind of high confidence. So, that's the calibration problem

So, calibration analysis will help you to detect that problem and also to fix it.

So, calibration can have two meanings: as a measure of goodness, how good is the

calibration, and also as a... as a transformation.

So, this is

what the typical transformation might look like. We have a pattern recognizer, which outputs likelihoods.

That recognizer might be based on some probabilistic model. The joint probability here, by which

I want to indicate the model can be generative. Probability of data given class or

the other way round, discriminative probability class given data, doesn't matter. You're probably going to

do better if you recalibrate that output and again you... you could do that. This

time we're modeling the scores. The likelihoods that come out of this model, we call

them scores, feature, if you like

another probabilistic model: the scores are simpler, another dimension than the original input, so they're

easier to model. Again, you can do a generative or discriminative modeling of the scores.

What I'm about to show is going to be mostly about discriminative modeling, but you

can do generative as well.

How can we call the likelihoods that come out of the second stage calibrated? Because

we're going to measure them, we're going to measure how well they're calibrated and moreover,

we're going to force them to be... to be well calibrated.

If you set their likelihood Bayes rule, then you get the posterior and that's where

we're going to measure the calibration with the proper scoring rule.

Obviously, you need to do this kind of measurement with the supervised evaluation database. So,

you apply the proper scoring rule to every example in the database and then your

have reached the value of the proper scoring rule. That's your measure of goodness, of

your recognizer in this database and you plug that into the training algorithm, which is

your objective function, and you can adjust the calibration parameters, and that's the way you

force your calibrator to

produce calibrated likelihoods.

So, you can use the same assembly for fusion, if you have multiple system to

a final, to a fusion point; or more generally, you can just train your whole

recognizer.

with the same principle.

So in summary of this part, calibration is easiest applied to the likelihoods

simple, affine transforms work very well in log-likelihood domains, but the measurement is based on

the posteriors and they're going to be with proper scoring rules, so

let's introduce proper scoring rules

Our first talk about the classical definition of proper scoring rule; than, a more engineering

viewpoint

how you can define them by decision theory. It is also very useful to look

at them in information theory, that will tell you how much information the recognizer is

delivering to the user. But that won't be directly relevant to this talk, so I'll

just refer you to this reference.

So, we start with the

classical definition and the sort of canonical example is weather forecast.

We have a weather forecaster. He predicts whether it will rain tomorrow or not and

he has a probabilistic prediction, he gives us a probability for rain.

The next day, it rains or it doesn't. How do we decide whether that was

a good probability or not?

So, it's reasonable to choose some kind of a cost function. So, you put the

probability, the prediction in there, as well as the fact whether it rained or not.

So, what should this cost function look like?

It's not so obvious how this cost function should look. If, for example, temperature was

predicted, it's easy, you can compare the predicted against the actual temperature and just compute

some kind of the squared difference. But in this case, it's a probabilistic prediction on

the day that ir rains or not, there's no true probability for rain, so we

can't do that kind of

direct comparison

So, the solution to forming such a cost function is the family of cost functions

for proper scoring rules

and they have

two nice properties. That, first of all, they force prediction to be as accurate as

possible, but subject to honesty. You can't pretend that your

prediction is more accurate than... that actually should be. So,

you need these two things to work together.

this is a simple picture of how weather forecast might... might be done. You've got

the data, which comes from satellites and other sensors

and the probabilistic model, and then you compute the probability for rain, given the observations

in the model. Posterior probability. So, the weather forecaster might ask himself: Do I predict

what I calculated or do I output

some warping or reinterpretation of this probability, maybe that would be more useful for my

users? Maybe my boss will be happier with me if I pretend that my predictions

are more accurate than they really are? So,

If the weather forecaster trusts his modeling his data

then, we can't really do better than the weather forecaster, we're not a weather forecasters.

So, what we do want is his best probability, p, the one that he calculated,

not something else. So, how do we force him to do that?

so, we tell the weather forecaster: Tomorrow

when you've predicted some q, which might be different from p, which we really want;

we are going to evaluate you with the proper scoring rule, with this type of

cost function. Then, the weather forecaster, he doesn't know whether it's going to rain or

not. The best information he has is his prediction, p. So, he forms an expected

value for the way he's going to be evaluated tomorrow. What's my expected cost that

I'm going to be evaluated with tomorrow? So,

proper scoring rule satisfies this expectation requirement. So,

this probability, p, forms the expectation

q is what he submits and

proper scoring rule, you're always going to do better if you submit p instead of

so that is the way that the proper scoring rule motivates honesty

The same mechanism also motivates him to make it more accurate. So,

he might sit down and think: If I have a bigger computer, if I launch

more satellites, I could get a better prediction. And even though I don't have the

better prediction, if I had it, I would form my expectation with the better prediction.

And the same mechanism then says: well, we would do better with the better prediction,

it's kind of obvious, but the proper scoring rule makes that obvious statement work mathematically.

Here's another view. It turns out if you form the... if you look at the

expected cost of the proper scoring rule, as a function of the predicted probability, then

you get the minima at the vertices of the probability simplex. So,

this is very much like the entropy function. In fact, if you use the logarithmic

scoring rule, this is just change into P. so minimizing expected cost is the same

as... as

minimizing entropy uncertainty. So,

driving down expected cost tends to favour

sharper predictions, that they... they sometimes call it. But it has to be subject to

calibration as well.

why are we going on about what humans might do? We can motivate machines. That

is called discriminative training. And we can expect the same benefits.

some examples. In many different proper scoring rules too... the very well known ones are

the Brier score, which has this predictive form,

I'll show... I'll show a graph just now... and also the logarithmic score

both cases is really easy to show that they do satisfy this expectation requirement.

So, here's an example: if

the top left. If it does rain, we're looking at the green curve. If you

predicted zero probability for rain, that's bad, so the cost is high. If you predicted

one, probability one, that's good, so the cost is low. If it doesn't rain it

works the other way round. So, that's the Brier score. This is the logarithmic, very

similar, except it goes out to

infinity here.

take another view, you can do logadd transformation on the probability and then you see,

they look very different.

the logarithmic one tends... turns out to form nice convex objective functions, which are easier

to numerically optimize

the Brier score is a little bit harder to optimize

so now, let's switch to the

engineering view of proper scoring rule. So, we're building these recognizers because we actually want

to use them for some useful purpose, we want to

do whatever we're doing in a cost effective way. We want to minimize expected cost.

So, if you ask what are the consequences of Bayes decision that I can make

with some probabilistic prediction? Then you really already constructed the proper scoring

you just have to ask that very natural question. So, all proper scoring rules can

be interpreted in that way.

I'm assuming everybody knows this. This is the example of the NIST detection cost function.

You make some decision to accept or reject and it's a target or a non-target.

And if you get it wrong there's some cost, if you get it right everything

is good, the cost is zero. So, that's the consequence

now we are using the probabilistic recognizer, which gives us this probability distribution, q, for

target, One minus q for non target. And we want to make, we want to

use that to make a decision, so we are making a minimum expected cost Bayes

decision. We are assuming that input is well calibrated, that we can use it directly

in the minimum expected cost Bayes decision. So, on the two sides of the inequality,

ve'we got the expected cost. You choose the lowest expected cost and then you put

it into the cost function. So the cost function is used twice. You see, I

highlighted the cost parameters that are used twice and the end result is then the

proper scoring rule. So, you're comparing the probability distribution for the hypotheses with the true

hypotheses, and the proper scoring rule kills tells you how well these two match

so this is exactly how NIST this year will form their new evaluation criteria. All

the others up to two thousand and ten, they used just the DCF as is,

with hard input decisions. Our output decisions, this year they'll do the proper scoring rule

and they'll ask for likelihood ratios. Of course, we have to put that trough Bayes

rule to get posterior and then it goes into the proper scoring rule.

So, we can generalise this to more than two classes and you can really use

any cost function, you can use more complicated cost functions. This rule works

this trivial inequality that shows this type of construction of proper scoring rule

satisfies the expectation requirement

in summary of this part, this Bayes decision interpretation tells us: If you need the

proper scoring rule, take your favourite cost function

apply this recipe, apply Bayes decisions and you'll have a proper scoring rule and then

that will measure and optimize the cost effectiveness of your recognizer.

so just a last word about . discrimination/ calibration decomposition...

The Bayes decision measures the full cost of using the probabilistic recognizer to make decisions.

So, often it's useful to decompose this cost into two components. The first might be

the underlying inability of the recognizer to perfectly discriminate between the two classes. Even if

you get the calibration optimal, you still can't recognize the classes perfectly. And then, the

second component is the additional cost due to bad calibration.

So, we've all been looking for... in my case for more than a decade, the

NIST's actual DCF versus minimum DCF. That's very much the same kind of decomposition, but

in this case, calibration refers only to setting your decision threshold. So, if we move

to probabilistic output of the recognizer, that's a more general type of calibration, so does

that same recipe... can we do that same kind of decomposition? My answer is yes.

I've tried it over the last few years with speaker and language recognition and in

my opinion it's a useful thing to do. So, the recipe is

at the output end of your recognizer you isolate a few parameters that you call

the calibration parameters or you might add an extra stage and call that a calibration

stage. If it's multiclass, maybe there's some debate about how to choose these parameters

once you've done that, you choose whatever proper scoring rule you're going to use for

your evaluation metric.

and you use that over your supervised evaluation database that's called then the actual cost.

Then, the evaluator goes and, using a two class labels minimizes just those calibration parameters

and that reduces the cost

somewhat. And then, let's call that the minimum cost, and then you can compare the

actual to the minimum cost. If they are very close, you can say: My calibration

was good. Otherwise, let's go back and see what went wrong.

So, in the last part of the talk we're going to play around wtih proper

scoring rules a bit

I propose this CLLR for use in speaker recognition in two thousand and four

but

what I want to show here is that's not the only option

you can adjust the proper scoring rule to target your

application. So,

I'll show how to do that

so, the mechanism

for, let's call it binary proper scoring rules, is the fact that you can combine

proper scoring rules

just a weighted summation of the proper scoring rules and once you do that it's

still the proper scoring rule. So, you might have multiple different proper scoring rules representing

slightly different applications, applications that work in different operating points

if you do this kind of application... combination of those proper scoring rules, you get

a new proper scoring rule that represents a mixture of applications. Ĺo, real application probably

is not going to focus, it'll be used just at the single operating point. If

it's a probabilistic output, you can hope to apply it to a range of different

operating points. So, this type of

combination of proper scpring rules is then a nice way to

evaluate that kind of more generally applicable recognizer.

So, NIST is also going to do that, this year in SRE twelve. They will

use a combination of two discreet operating points in a proper scoring rule. So can

you do discreet combinations or continuous combinations, also, but... So, interesting thing is that

binary, two class proper scoring rules, can be described in this way, I'll show how

that is done. So this DCF turns out to be fundamental building block for

two class proper scoring rules

This is the same picture I had before. I just normalized the cost function, so

there's a cost of miss and false alarm, that's redundant. You don't really need those

two costs, we can reduce it to one parameter.

Because the magnitude of the proper scoring rule doesn't really tell us anything. So if

you normalize it like this, then the expected cost at the decision threshold is always

going to be one, no matter what the... what the parameters, no matter what the

operating point. So, the parameter that we're using

is the base decision threshold

posterior probability for the target

we compare that to this parameter t, which is the threshold

cost is one other t, the cost of miss and the cost of the false

alarm is one or the one minus t. You see, if t is close to

zero, the one cost goes to infinity; if it's close to one, the other cost

goes to infinity. So, you're covering the whole range of cost-ratio just by varying this

parameter t. So, we'll call this the normalized DCF scoring rule and I've got the

c star notation for it and the operating point is t.

So, what does it look like?

It's a very simple step function.

If your posterior probability for the target is too low, you're going to miss the

target; if you're below the threshold t, and you get hit with the miss cost.

If p is high enough, that suppose if it really is the target. P is

high enough, we pass t, the cost is zero. If it's not the target, it's

the other way round in the step function, the red line goes up

you now have four different values of t, so if you adjust the parameters, then

cost of miss is high, cost of false alarm is low. If you adjust it,

they ... they change.

in comparison I've got the logarythmic scoring rule

and you'll see, it looks very similar. It tends to follow the way that the

miss and false alarm cost change, so you'll find indeed, if you integrate over all

values of t

then you will get the logarythmic scoring rule

All binary proper scoring rules can be expressed as a... as an expectation over operating

points. So, the integrant here is the step functions, the c star guy, as well

as some weighting distribution.

the weighting distributions relative distribution. It tends to be non-zero, it tends to integrate to

one

and it determines

the nature of your proper scoring rule. Several properties depend on this weighting distribution and

it also tells you what relative importance do I place on different operating points.

is a rich variety of things you can do if you make the weighting function

an impulse. I shouldn't say function, it's a distribution

mathematics is not really a function

Any case, if it's an impulse, we're looking at a single operating point. If it's

a sum of impulses, we're looking at multiple operating points, discrete operating points. Or if

it's a smooth probability distribution then we're looking at the continuous range of operating points.

So, examples of the discrete ones, that could be the SRE ten operating point

is a step function that... I mean, an impulse at points nine or in SRE

twelve we'll have

you're looking at two operating points, a mixture of two points.

If you do smooth weighting, this quadratic form over here gives the Brier score

and the logarythmic score just uses a very simple constant, weighting. So, weighting matters a

lot. The Brier score, if you use it for discriminative training, it forms a non-

convex optimization objective, which also tends not to generalize that well. If you trained on

this data and then use recognizer on that data, it doesn't generalize that well, whereas

the logarythmic one

has a little bit of natural regularisation booting you can expect to do better on

new data

this in your own time, this is just an example how the integral works out.

The step function causes the probability that you submit to the proper scoring rule to

appear in the

boundary of the integral, it's very simple, you get this logarythmic form.

So now, let's do a case study

and lets' design a proper scoring rule to target the low false alarm region for...

of course, for

speaker recognition

detection

a range of thresholds

that's the threshold you place in the posterior probability

that corresponds

an operating point on the DET-curve.

So we can use this weighting function to tailor the proper scoring rule to target

only a part of the DET-curve if we want. So, George Doddington recently proposed

another way to achieve the same thing. He called it cllr and ten

it's mentioned in the new NIST evaluation plan. There's also

upcoming Interspeech paper, so

he used the standard logarythmic scoring rule, which is essentially just the same as cllr

that I proposed. And then, he just inverted some scores above some threshold, so that

you can tall that low false alarm region

he omitted the scores below the threshold

So unfortunately, cllr intent does not quite fit into this framework of a proper scoring

rule, because it's got threshold that's dependent on the miss rate of every system, so

the threshold is slightly different for different systems

make it the proper scoring rule. I'm just saying, let's use a fixed threshold and

then let's just call it the truncated cllr.

and then you can also express

the truncated cllr just with the weighting function

So, the original cllr logarythmic score has a flat weighting distribution. Truncated cllr uses a

unit step

which steps up at wherever you want to

threshold the scores.

there are several different things you can do. Let's call them variety of... variety of

cllr, variations of cllr. The original one is just a logarithmic

proper scoring rule, which you need to apply to a probability

go from log-likelihood ratio to a probability we need to have some prior and then

apply Bayes rule, so the prior that defines cllr is just half

you can shift cllr by using some other prior and I'll show you in what

sense it shifted in a graph just after this

that mechanism has been in the Focal toolkit and the most of us have probably

used that to do calibration and fusion

but I never explicitly recommended this as an evaluation criteria

and then this truncated cllr, which is very close to what George proposed,

uses u-step weighting

so there's this transformation between log- likelihood- ratio and posterior, so I'm going to show

a plot, where the threshold is log- likelihood- ratio threshold, so there's this transformation

transformation, and then

prior is also involved, the prior just shifts you along the x-axis. And you have

to remember, this transformation has Jacobian associated with it, so on right you're having the

posterior threshold domain

So, this is what graph looks like

the

the x-axis are the log- likelihood- ratio threshold

the y-axis is the relative weighting that the proper scoring rule assigns to different operating

points. So, in this view, this weighting function is the probability distribution. It looks almost

like Gaussian, it's not quite a Gaussian

and

now what we do is... we just change the prior, then you get the shifted

cmllr which is the best green curve, which is shifted to the right. So, that's

shifted towards the low false alarm region. So, I've labelled the regions here... the middle

one we can call the equal error rate region, close to log- likelihood- ratio zero

low mess rate, low mess rate region. If your threshold is low, you're not gonna

mess so many

targets. That's the low false alarm region

and the blue curve is truncated cllr, so

you basically ignore all scores on this side of the threshold and, of course, you

have to scale it by a factor of ten that integrates to one

So now, let's look at the different option, another final option.

There's this beta family of proper scoring rules, that was proposed by Buja.

It uses the beta distribution as this weighting distribution. It has two adjustable parameters

Why this... why not just the Gaussian? The answer is the integrals work out if

you use the beta.

It's also general enough, so by adjusting these parameters we can get the Brier, logarithmic

and also the c star which we've been using here

and

So it's a comfortable family used for this purpose

For this presentation I've chosen the parameters to be equal to ten and one and

that's gonna then

very similar to the truncated cllr

This is what the proper scoring rule looks like

I liked this logarithmic. If p goes close to one

the polynomial term doesn't do very much anymore, it's more or less constant. So, then

at the very low false alarm region this just becomes the logarithmic scoring rule again.

So that's what this new beta looks like. The red curve over here

it has it' s peak in the same place as the truncated one or the

shifted one. But, for example compared, the shifted one

more effectively ignores

the one side of the det curve. So, if you believe this is the way

to go forward, you really do want to ignore that side of the det curve.

You can tailor your proper scoring rule to do that. So, I've not tried the

blue or the red version here myself numerically

so I cannot recommend to you that you're going to do well is sre twelve

if you use one of these curves. It's up to you to experiment, so I

just like to point out: cllr is not the only proper scoring rule

They're very general, you can tailor them

play with them, see what you can get.

these guys are saying

we have to say something about multiclass

so I've one slide of multiclass

Multiclass turns out to be a lot more difficult to analyze

it's amazing, the complexity, if you go from two to three classes, the trouble you

can get into

But, it's useful to know that some of the same rules still apply. You can

construct

the proper scoring rule

choose some cost function and construct a proper scoring rule via the Bayes decision recipe.

You can also combine them, so the same rules apply.

And then, the logarithmic scoring rule is just very nice, it behaves nicely

it also turns to... how to be an expectation of weight of misclassification errors, very

similar to what I've shown before. The integral is a lot harder to show, it

works like that. And then, the logarithmic scoring rule does form a nice evaluation criterion

and nice discriminative training criteria and

it will be used as such in the Albayzin two thousand and twelve language recognition

evaluation. Nicholas here will be telling us more about that later this week.

so in conclusion

in my view, proper scoring rules are essential if you want to use for recognizing

the probabilistic output

They do work well for the discriminative training

you have to choose the right proper scoring rule for your training, but some of

them do work very well; average have a rich structure, they can be tailored, there's

not only one

and in future maybe we'll see them used more generally in machine learning, even for

generative training.

Some selected references. The first one, my Ph.D. dissertation has a lot more material about

proper scoring rules, many more references

and

a few questions

well

we've had a bit of a discussion

in the context of... of

recognizer that has to recognize the age of the... of the speaker and then if

you see... look at the age as a... as a continuous variable, then the nature

of the proper scoring rule changes.

and

there's a... there's a lot literature on that typeof proper scoring rule

there are extra issues

for example, you... you have to ask

even in a multi class case. In the multiclass case

you have to ask is there some association between the classes, are some of them

closer, so that if you make an error, but error is... is

well, let's take an example. If the language is really English and you... no, let's

say if language is really one of the Chinese languages and your recognizer says it's

one of the other Chinese languages, that error is not as bad as saying it's

English.

So, the logarithmic scoring rule, for example, doesn't do that. Any error is as bad

as... as any other error.

if you have a continuous range like age

if you... if the question is really thirty and you say it's thirty one, that's

not such a bad error. So, the logarithmic... there's a logarithmic version of the continuous

scoring rule

That one will not tell you that error is excusable.

So, there are ways to design scoring rules to take into account

some... some structure in the way you define your classes.

I like to think we've thought more about the problem

and I

I think one of the reasons for that are the NIST evaluations and specifically

the ... the DCF that we've been using in the NIST evaluation.

In machine learning they like to just do error

so, by Gyen from... from MRI to DCF it's a simple step, we're just weighting

the errors

You are never speaking about the constrains

concerning the datasets

if we are targeting

some part of the

curve, now the false alarm region we will have certainly some constraint

dataset, to have the balanced dataset. It's my first

the second one a whole lot easier so you get to be

dataset

maybe start now to speak about also the quantity of information we have in the

is it... I'm coming back to your example in language recognition. Is it the same

error if you

your choice of Chinese language when it was a different Chinese language compared to deciding

it's English

in speaker recognition is it the same error if you decide it's

it's not a target

you have nothing in the speech file, no information in the speech file, when you

decide that, with a very speech file, with already dead information

Here, let me answer the first ... first question, if I understood it correctly

you asked about the ... the size of your evaluation database. So, of course, that's...

that's very important

my presentation that I had, the ... this SRE analysis workshop in december last year

I adressed that... that issue. So, if you look at this view of the proper

scoring rule as an integral of error rates

then

if you move into... if you... if you write an operating point, where the error

rate is going to be low, which does happen to in the false alarm region,

then you need

enough data, so that you actually do

errors. If you don't have errors, how can you measure the error rate? So, one

has to be very careful

not to push your evaluation outside of the range

data can cover

and the second question is

the case that I covered is just the basics

if you want a more complicated cost functions

where you want to assign different costs to different flavours of errors, that does fit

into this framework, so

you can type any cost function

as long as it doesn't do something pathological again

two class, the cost functions simple; in multiclass you have to think really carefully how

construct the cost function that doesn't contradict itself

once you've formed a nice cost function

you can apply this recipe

just plug it into Bayes decision, back into the cost function and you'll have a

proper scoring rule. So, this framework does cover that.

are you dealing with people who are real?

okay

and going to tell us some more

The Role of Proper Scoring Rules in Training and Evaluating Probabilistic Speaker and Language Recognizers

Plenary Session

Niko Brümmer