So, they asked me to do the introduction for the opening plenary talk here. And
luckily, it's very easy to do, since we have Niko, who is... as everyone knows,
has been part of The Odyssey Workshops; has become part of the institution of the
Odyssey Workshops itself. He's been involved in the area of speaker and language recognition for
over twenty years. He started off working at Spescom and DataVoice and now he's the
chief scientist at AGNITIO. He receieved his Ph.D. in two thousand ten from the University
of Stellenbosch, where he also receieved
Undergraduate Master's Degree
He's been involved in the area of speaker and language recognition in various aspects of
it, those from
working on the core technologies of the
classifiers themselves: from generative models to discriminatively trained models. And working on the other side
of calibration and how you evaluate. And today talk Niko is going to that one
area that he's had a lot of contributions in over the years how we can
go about evaluating the systems
we told them: How do we know how well they're working and how can we
work in this in a way that's gonna show each utility for downstream applications? So,
with that, I hand it over to Niko to begin his talk
much, Doug
to be here, thank you
So, when Haizhou invited me, he asked me to say something about calibration and fusion
doing for
years. So, I'll do so by discussing proper scoring rules, the basic principle that underlies
all of this work. So, fusion you can do in many ways. Proper scoring rules
is a good way to do fusion, but it's not essential for fusion. But, in
my view, if you're talking about calibration, you do need proper scoring rules.
So, they've been around since nineteeen fifty, The Brier Score
proposed for the evaluation of
evaluating the goodness of a weather forecasting, probabilistic weather forecasting. Since then, they've been in
statistic literature, even up to the present. In pattern recognition/ machine learning speech processing they're
not that well known, but in fact, if you use maximum likelihood for generative training,
or cross- entropy for discriminative training, you are in practice using the logarithmic scoring rule.
So, you've probably all used it already
In the future, we may be seeing more of
scoring rules in machine learning. We've got these new restricted Boltzmann machines and
other energy based models, which are now becoming very popular. It's very difficult to train,
because you can't work out the likelihood. This... Hyvarinen proposed a proper scoring rule, how
to attack that problem, and if you google you'll find some papers... some recent papers
on that as well.
so, I'll concentrate
own application of proper scoring rules, in our field and
to promote better understanding of the concept of calibration itself and how to form
training algorithms which are calibration sensitive and... and the evaluation measures
So, I'll start by outlining the problem that we are trying to solve. And then,
than if you can see the grey... but then I'll... I'll introduce proper scoring rules
and then, the last section will be how to design proper scoring rules with the
several different ones, how to design them to do what you want them to do
So, not all pattern recognition needs to be probabilistic. You can build a nice recognizers
with an SVM classifier and you don't need to think about probabilities even to do
that. But in this talk, we're interested in probabilistic pattern recognition, where the output is
a probability, or a likelihood, or a likelihood ratio. So, if you can get the
calibration right, that part of output is more useful than just hard decisions.
In machine learning and also in speech recognition, if you do probabilistic recognition, you might
be used to seeing a posterior probability as an output. An example is a phone
recognizer where there will be forty or so posterior probabilities given
input frames. But in speaker and language, there are good reasons why we want to
use class likelihoods rather than posteriors. And if there are two classes, as in speaker
recognition, then likelihood ratio is the most convenient. So, what I'm about to
about... we can do all of those things, it doesn't really matter which of those
forms we use
so, we're interested
in a pattern recognizer that types some form of input, maybe the acoustic feature vectors,
maybe an i-vector, or maybe just even a score. And then the output will be
a small number of discreet classes, for example: target and non-target in speaker recognition, or
a language recognition a number of language classes
So, the output of the recognizer might be inaccurate form, so you have
given one piece of data, you have a likelihood for... for each of the classes
here
if you also have a prior; and for the purposes here you can consider the
prior as given, a prior distribution of the classes; then it's easy, we just plug
that in device rule and you get the posterior. So, you can go from the
posterior to the likelihoods or the other way round. They're equivalent, they have the same
information
Also, if the one side is well calibrated, we can say the other side is
well calibrated as well. So, it doesn't really matter on which side is Bayes rule,
we look at calibration. So,
the recognizer output, for the purposes of this presentation, will be here
look at measuring calibration on the other side of Bayes rule
So, why is calibration necessary?
Because our models are imperfect models of the data.
Even when that model manages to extract information that could in principle discriminate high accuracy
between the classes, the probabilistic representation might not be optimal. For example, it might be
out of the confident. The probabilities might all be very close to zero and one,
whereas the accuracy doesn't warn that kind of high confidence. So, that's the calibration problem
So, calibration analysis will help you to detect that problem and also to fix it.
So, calibration can have two meanings: as a measure of goodness, how good is the
calibration, and also as a... as a transformation.
So, this is
what the typical transformation might look like. We have a pattern recognizer, which outputs likelihoods.
That recognizer might be based on some probabilistic model. The joint probability here, by which
I want to indicate the model can be generative. Probability of data given class or
the other way round, discriminative probability class given data, doesn't matter. You're probably going to
do better if you recalibrate that output and again you... you could do that. This
time we're modeling the scores. The likelihoods that come out of this model, we call
them scores, feature, if you like
another probabilistic model: the scores are simpler, another dimension than the original input, so they're
easier to model. Again, you can do a generative or discriminative modeling of the scores.
What I'm about to show is going to be mostly about discriminative modeling, but you
can do generative as well.
so
How can we call the likelihoods that come out of the second stage calibrated? Because
we're going to measure them, we're going to measure how well they're calibrated and moreover,
we're going to force them to be... to be well calibrated.
so
If you set their likelihood Bayes rule, then you get the posterior and that's where
we're going to measure the calibration with the proper scoring rule.
so
Obviously, you need to do this kind of measurement with the supervised evaluation database. So,
you apply the proper scoring rule to every example in the database and then your
have reached the value of the proper scoring rule. That's your measure of goodness, of
your recognizer in this database and you plug that into the training algorithm, which is
your objective function, and you can adjust the calibration parameters, and that's the way you
force your calibrator to
produce calibrated likelihoods.
So, you can use the same assembly for fusion, if you have multiple system to
a final, to a fusion point; or more generally, you can just train your whole
recognizer.
with the same principle.
So in summary of this part, calibration is easiest applied to the likelihoods
simple, affine transforms work very well in log-likelihood domains, but the measurement is based on
the posteriors and they're going to be with proper scoring rules, so
let's introduce proper scoring rules
Our first talk about the classical definition of proper scoring rule; than, a more engineering
viewpoint
how you can define them by decision theory. It is also very useful to look
at them in information theory, that will tell you how much information the recognizer is
delivering to the user. But that won't be directly relevant to this talk, so I'll
just refer you to this reference.
So, we start with the
classical definition and the sort of canonical example is weather forecast.
so
We have a weather forecaster. He predicts whether it will rain tomorrow or not and
he has a probabilistic prediction, he gives us a probability for rain.
The next day, it rains or it doesn't. How do we decide whether that was
a good probability or not?
So, it's reasonable to choose some kind of a cost function. So, you put the
probability, the prediction in there, as well as the fact whether it rained or not.
So, what should this cost function look like?
It's not so obvious how this cost function should look. If, for example, temperature was
predicted, it's easy, you can compare the predicted against the actual temperature and just compute
some kind of the squared difference. But in this case, it's a probabilistic prediction on
the day that ir rains or not, there's no true probability for rain, so we
can't do that kind of
direct comparison
So, the solution to forming such a cost function is the family of cost functions
for proper scoring rules
and they have
two nice properties. That, first of all, they force prediction to be as accurate as
possible, but subject to honesty. You can't pretend that your
prediction is more accurate than... that actually should be. So,
you need these two things to work together.
so
this is a simple picture of how weather forecast might... might be done. You've got
the data, which comes from satellites and other sensors
and the probabilistic model, and then you compute the probability for rain, given the observations
in the model. Posterior probability. So, the weather forecaster might ask himself: Do I predict
what I calculated or do I output
some warping or reinterpretation of this probability, maybe that would be more useful for my
users? Maybe my boss will be happier with me if I pretend that my predictions
are more accurate than they really are? So,
If the weather forecaster trusts his modeling his data
then, we can't really do better than the weather forecaster, we're not a weather forecasters.
So, what we do want is his best probability, p, the one that he calculated,
not something else. So, how do we force him to do that?
so, we tell the weather forecaster: Tomorrow
when you've predicted some q, which might be different from p, which we really want;
we are going to evaluate you with the proper scoring rule, with this type of
cost function. Then, the weather forecaster, he doesn't know whether it's going to rain or
not. The best information he has is his prediction, p. So, he forms an expected
value for the way he's going to be evaluated tomorrow. What's my expected cost that
I'm going to be evaluated with tomorrow? So,
proper scoring rule satisfies this expectation requirement. So,
this probability, p, forms the expectation
q is what he submits and
proper scoring rule, you're always going to do better if you submit p instead of
so that is the way that the proper scoring rule motivates honesty
The same mechanism also motivates him to make it more accurate. So,
he might sit down and think: If I have a bigger computer, if I launch
more satellites, I could get a better prediction. And even though I don't have the
better prediction, if I had it, I would form my expectation with the better prediction.
And the same mechanism then says: well, we would do better with the better prediction,
it's kind of obvious, but the proper scoring rule makes that obvious statement work mathematically.
Here's another view. It turns out if you form the... if you look at the
expected cost of the proper scoring rule, as a function of the predicted probability, then
you get the minima at the vertices of the probability simplex. So,
this is very much like the entropy function. In fact, if you use the logarithmic
scoring rule, this is just change into P. so minimizing expected cost is the same
as... as
minimizing entropy uncertainty. So,
driving down expected cost tends to favour
sharper predictions, that they... they sometimes call it. But it has to be subject to
calibration as well.
so
why are we going on about what humans might do? We can motivate machines. That
is called discriminative training. And we can expect the same benefits.
some examples. In many different proper scoring rules too... the very well known ones are
the Brier score, which has this predictive form,
I'll show... I'll show a graph just now... and also the logarithmic score
both cases is really easy to show that they do satisfy this expectation requirement.
So, here's an example: if
the top left. If it does rain, we're looking at the green curve. If you
predicted zero probability for rain, that's bad, so the cost is high. If you predicted
one, probability one, that's good, so the cost is low. If it doesn't rain it
works the other way round. So, that's the Brier score. This is the logarithmic, very
similar, except it goes out to
infinity here.
take another view, you can do logadd transformation on the probability and then you see,
they look very different.
the logarithmic one tends... turns out to form nice convex objective functions, which are easier
to numerically optimize
the Brier score is a little bit harder to optimize
so now, let's switch to the
engineering view of proper scoring rule. So, we're building these recognizers because we actually want
to use them for some useful purpose, we want to
do whatever we're doing in a cost effective way. We want to minimize expected cost.
So, if you ask what are the consequences of Bayes decision that I can make
with some probabilistic prediction? Then you really already constructed the proper scoring
you just have to ask that very natural question. So, all proper scoring rules can
be interpreted in that way.
so
I'm assuming everybody knows this. This is the example of the NIST detection cost function.
You make some decision to accept or reject and it's a target or a non-target.
And if you get it wrong there's some cost, if you get it right everything
is good, the cost is zero. So, that's the consequence
now we are using the probabilistic recognizer, which gives us this probability distribution, q, for
target, One minus q for non target. And we want to make, we want to
use that to make a decision, so we are making a minimum expected cost Bayes
decision. We are assuming that input is well calibrated, that we can use it directly
in the minimum expected cost Bayes decision. So, on the two sides of the inequality,
ve'we got the expected cost. You choose the lowest expected cost and then you put
it into the cost function. So the cost function is used twice. You see, I
highlighted the cost parameters that are used twice and the end result is then the
proper scoring rule. So, you're comparing the probability distribution for the hypotheses with the true
hypotheses, and the proper scoring rule kills tells you how well these two match
so this is exactly how NIST this year will form their new evaluation criteria. All
the others up to two thousand and ten, they used just the DCF as is,
with hard input decisions. Our output decisions, this year they'll do the proper scoring rule
and they'll ask for likelihood ratios. Of course, we have to put that trough Bayes
rule to get posterior and then it goes into the proper scoring rule.
So, we can generalise this to more than two classes and you can really use
any cost function, you can use more complicated cost functions. This rule works
this trivial inequality that shows this type of construction of proper scoring rule
satisfies the expectation requirement
so
in summary of this part, this Bayes decision interpretation tells us: If you need the
proper scoring rule, take your favourite cost function
apply this recipe, apply Bayes decisions and you'll have a proper scoring rule and then
that will measure and optimize the cost effectiveness of your recognizer.
so just a last word about . discrimination/ calibration decomposition...
The Bayes decision measures the full cost of using the probabilistic recognizer to make decisions.
So, often it's useful to decompose this cost into two components. The first might be
the underlying inability of the recognizer to perfectly discriminate between the two classes. Even if
you get the calibration optimal, you still can't recognize the classes perfectly. And then, the
second component is the additional cost due to bad calibration.
So, we've all been looking for... in my case for more than a decade, the
NIST's actual DCF versus minimum DCF. That's very much the same kind of decomposition, but
in this case, calibration refers only to setting your decision threshold. So, if we move
to probabilistic output of the recognizer, that's a more general type of calibration, so does
that same recipe... can we do that same kind of decomposition? My answer is yes.
I've tried it over the last few years with speaker and language recognition and in
my opinion it's a useful thing to do. So, the recipe is
at the output end of your recognizer you isolate a few parameters that you call
the calibration parameters or you might add an extra stage and call that a calibration
stage. If it's multiclass, maybe there's some debate about how to choose these parameters
once you've done that, you choose whatever proper scoring rule you're going to use for
your evaluation metric.
and you use that over your supervised evaluation database that's called then the actual cost.
Then, the evaluator goes and, using a two class labels minimizes just those calibration parameters
and that reduces the cost
somewhat. And then, let's call that the minimum cost, and then you can compare the
actual to the minimum cost. If they are very close, you can say: My calibration
was good. Otherwise, let's go back and see what went wrong.
So, in the last part of the talk we're going to play around wtih proper
scoring rules a bit
I propose this CLLR for use in speaker recognition in two thousand and four
but
what I want to show here is that's not the only option
you can adjust the proper scoring rule to target your
application. So,
I'll show how to do that
so, the mechanism
for, let's call it binary proper scoring rules, is the fact that you can combine
proper scoring rules
just a weighted summation of the proper scoring rules and once you do that it's
still the proper scoring rule. So, you might have multiple different proper scoring rules representing
slightly different applications, applications that work in different operating points
if you do this kind of application... combination of those proper scoring rules, you get
a new proper scoring rule that represents a mixture of applications. Ĺo, real application probably
is not going to focus, it'll be used just at the single operating point. If
it's a probabilistic output, you can hope to apply it to a range of different
operating points. So, this type of
combination of proper scpring rules is then a nice way to
evaluate that kind of more generally applicable recognizer.
So, NIST is also going to do that, this year in SRE twelve. They will
use a combination of two discreet operating points in a proper scoring rule. So can
you do discreet combinations or continuous combinations, also, but... So, interesting thing is that
binary, two class proper scoring rules, can be described in this way, I'll show how
that is done. So this DCF turns out to be fundamental building block for
two class proper scoring rules
This is the same picture I had before. I just normalized the cost function, so
there's a cost of miss and false alarm, that's redundant. You don't really need those
two costs, we can reduce it to one parameter.
Because the magnitude of the proper scoring rule doesn't really tell us anything. So if
you normalize it like this, then the expected cost at the decision threshold is always
going to be one, no matter what the... what the parameters, no matter what the
operating point. So, the parameter that we're using
is the base decision threshold
posterior probability for the target
we compare that to this parameter t, which is the threshold
cost is one other t, the cost of miss and the cost of the false
alarm is one or the one minus t. You see, if t is close to
zero, the one cost goes to infinity; if it's close to one, the other cost
goes to infinity. So, you're covering the whole range of cost-ratio just by varying this
parameter t. So, we'll call this the normalized DCF scoring rule and I've got the
c star notation for it and the operating point is t.
So, what does it look like?
It's a very simple step function.
If your posterior probability for the target is too low, you're going to miss the
target; if you're below the threshold t, and you get hit with the miss cost.
If p is high enough, that suppose if it really is the target. P is
high enough, we pass t, the cost is zero. If it's not the target, it's
the other way round in the step function, the red line goes up
so
you now have four different values of t, so if you adjust the parameters, then
cost of miss is high, cost of false alarm is low. If you adjust it,
they ... they change.
in comparison I've got the logarythmic scoring rule
and you'll see, it looks very similar. It tends to follow the way that the
miss and false alarm cost change, so you'll find indeed, if you integrate over all
values of t
then you will get the logarythmic scoring rule
so
All binary proper scoring rules can be expressed as a... as an expectation over operating
points. So, the integrant here is the step functions, the c star guy, as well
as some weighting distribution.
so
the weighting distributions relative distribution. It tends to be non-zero, it tends to integrate to
one
and it determines
the nature of your proper scoring rule. Several properties depend on this weighting distribution and
it also tells you what relative importance do I place on different operating points.
is a rich variety of things you can do if you make the weighting function
an impulse. I shouldn't say function, it's a distribution
mathematics is not really a function
Any case, if it's an impulse, we're looking at a single operating point. If it's
a sum of impulses, we're looking at multiple operating points, discrete operating points. Or if
it's a smooth probability distribution then we're looking at the continuous range of operating points.
So, examples of the discrete ones, that could be the SRE ten operating point
is a step function that... I mean, an impulse at points nine or in SRE
twelve we'll have
you're looking at two operating points, a mixture of two points.
If you do smooth weighting, this quadratic form over here gives the Brier score
and the logarythmic score just uses a very simple constant, weighting. So, weighting matters a
lot. The Brier score, if you use it for discriminative training, it forms a non-
convex optimization objective, which also tends not to generalize that well. If you trained on
this data and then use recognizer on that data, it doesn't generalize that well, whereas
the logarythmic one
has a little bit of natural regularisation booting you can expect to do better on
new data
so
this in your own time, this is just an example how the integral works out.
The step function causes the probability that you submit to the proper scoring rule to
appear in the
boundary of the integral, it's very simple, you get this logarythmic form.
So now, let's do a case study
and lets' design a proper scoring rule to target the low false alarm region for...
of course, for
speaker recognition
detection
a range of thresholds
that's the threshold you place in the posterior probability
that corresponds
an operating point on the DET-curve.
So we can use this weighting function to tailor the proper scoring rule to target
only a part of the DET-curve if we want. So, George Doddington recently proposed
another way to achieve the same thing. He called it cllr and ten
it's mentioned in the new NIST evaluation plan. There's also
upcoming Interspeech paper, so
he used the standard logarythmic scoring rule, which is essentially just the same as cllr
that I proposed. And then, he just inverted some scores above some threshold, so that
you can tall that low false alarm region
he omitted the scores below the threshold
So unfortunately, cllr intent does not quite fit into this framework of a proper scoring
rule, because it's got threshold that's dependent on the miss rate of every system, so
the threshold is slightly different for different systems
make it the proper scoring rule. I'm just saying, let's use a fixed threshold and
then let's just call it the truncated cllr.
and then you can also express
the truncated cllr just with the weighting function
So, the original cllr logarythmic score has a flat weighting distribution. Truncated cllr uses a
unit step
which steps up at wherever you want to
threshold the scores.
so
there are several different things you can do. Let's call them variety of... variety of
cllr, variations of cllr. The original one is just a logarithmic
proper scoring rule, which you need to apply to a probability
go from log-likelihood ratio to a probability we need to have some prior and then
apply Bayes rule, so the prior that defines cllr is just half
you can shift cllr by using some other prior and I'll show you in what
sense it shifted in a graph just after this
that mechanism has been in the Focal toolkit and the most of us have probably
used that to do calibration and fusion
but I never explicitly recommended this as an evaluation criteria
and then this truncated cllr, which is very close to what George proposed,
uses u-step weighting
so there's this transformation between log- likelihood- ratio and posterior, so I'm going to show
a plot, where the threshold is log- likelihood- ratio threshold, so there's this transformation
transformation, and then
prior is also involved, the prior just shifts you along the x-axis. And you have
to remember, this transformation has Jacobian associated with it, so on right you're having the
posterior threshold domain
So, this is what graph looks like
the
the x-axis are the log- likelihood- ratio threshold
the y-axis is the relative weighting that the proper scoring rule assigns to different operating
points. So, in this view, this weighting function is the probability distribution. It looks almost
like Gaussian, it's not quite a Gaussian
and
now what we do is... we just change the prior, then you get the shifted
cmllr which is the best green curve, which is shifted to the right. So, that's
shifted towards the low false alarm region. So, I've labelled the regions here... the middle
one we can call the equal error rate region, close to log- likelihood- ratio zero
low mess rate, low mess rate region. If your threshold is low, you're not gonna
mess so many
targets. That's the low false alarm region
and the blue curve is truncated cllr, so
you basically ignore all scores on this side of the threshold and, of course, you
have to scale it by a factor of ten that integrates to one
So now, let's look at the different option, another final option.
There's this beta family of proper scoring rules, that was proposed by Buja.
It uses the beta distribution as this weighting distribution. It has two adjustable parameters
Why this... why not just the Gaussian? The answer is the integrals work out if
you use the beta.
It's also general enough, so by adjusting these parameters we can get the Brier, logarithmic
and also the c star which we've been using here
and
So it's a comfortable family used for this purpose
For this presentation I've chosen the parameters to be equal to ten and one and
that's gonna then
very similar to the truncated cllr
This is what the proper scoring rule looks like
I liked this logarithmic. If p goes close to one
the polynomial term doesn't do very much anymore, it's more or less constant. So, then
at the very low false alarm region this just becomes the logarithmic scoring rule again.
So that's what this new beta looks like. The red curve over here
it has it' s peak in the same place as the truncated one or the
shifted one. But, for example compared, the shifted one
more effectively ignores
the one side of the det curve. So, if you believe this is the way
to go forward, you really do want to ignore that side of the det curve.
You can tailor your proper scoring rule to do that. So, I've not tried the
blue or the red version here myself numerically
so I cannot recommend to you that you're going to do well is sre twelve
if you use one of these curves. It's up to you to experiment, so I
just like to point out: cllr is not the only proper scoring rule
They're very general, you can tailor them
play with them, see what you can get.
these guys are saying
we have to say something about multiclass
so I've one slide of multiclass
Multiclass turns out to be a lot more difficult to analyze
it's amazing, the complexity, if you go from two to three classes, the trouble you
can get into
But, it's useful to know that some of the same rules still apply. You can
construct
the proper scoring rule
choose some cost function and construct a proper scoring rule via the Bayes decision recipe.
You can also combine them, so the same rules apply.
And then, the logarithmic scoring rule is just very nice, it behaves nicely
it also turns to... how to be an expectation of weight of misclassification errors, very
similar to what I've shown before. The integral is a lot harder to show, it
works like that. And then, the logarithmic scoring rule does form a nice evaluation criterion
and nice discriminative training criteria and
it will be used as such in the Albayzin two thousand and twelve language recognition
evaluation. Nicholas here will be telling us more about that later this week.
so in conclusion
in my view, proper scoring rules are essential if you want to use for recognizing
the probabilistic output
They do work well for the discriminative training
you have to choose the right proper scoring rule for your training, but some of
them do work very well; average have a rich structure, they can be tailored, there's
not only one
and in future maybe we'll see them used more generally in machine learning, even for
generative training.
Some selected references. The first one, my Ph.D. dissertation has a lot more material about
proper scoring rules, many more references
and
a few questions
well
we've had a bit of a discussion
in the context of... of
recognizer that has to recognize the age of the... of the speaker and then if
you see... look at the age as a... as a continuous variable, then the nature
of the proper scoring rule changes.
and
there's a... there's a lot literature on that typeof proper scoring rule
there are extra issues
for example, you... you have to ask
even in a multi class case. In the multiclass case
you have to ask is there some association between the classes, are some of them
closer, so that if you make an error, but error is... is
well, let's take an example. If the language is really English and you... no, let's
say if language is really one of the Chinese languages and your recognizer says it's
one of the other Chinese languages, that error is not as bad as saying it's
English.
So, the logarithmic scoring rule, for example, doesn't do that. Any error is as bad
as
as... as any other error.
if you have a continuous range like age
if
if you... if the question is really thirty and you say it's thirty one, that's
not such a bad error. So, the logarithmic... there's a logarithmic version of the continuous
scoring rule
That one will not tell you that error is excusable.
So, there are ways to design scoring rules to take into account
some... some structure in the way you define your classes.
I like to think we've thought more about the problem
and I
I think one of the reasons for that are the NIST evaluations and specifically
the ... the DCF that we've been using in the NIST evaluation.
In machine learning they like to just do error
so, by Gyen from... from MRI to DCF it's a simple step, we're just weighting
the errors
You are never speaking about the constrains
concerning the datasets
if we are targeting
some part of the
curve, now the false alarm region we will have certainly some constraint
dataset, to have the balanced dataset. It's my first
the second one a whole lot easier so you get to be
dataset
maybe start now to speak about also the quantity of information we have in the
is it... I'm coming back to your example in language recognition. Is it the same
error if you
your choice of Chinese language when it was a different Chinese language compared to deciding
it's English
in speaker recognition is it the same error if you decide it's
it's not a target
you have nothing in the speech file, no information in the speech file, when you
decide that, with a very speech file, with already dead information
Here, let me answer the first ... first question, if I understood it correctly
you asked about the ... the size of your evaluation database. So, of course, that's...
that's very important
in
my presentation that I had, the ... this SRE analysis workshop in december last year
I adressed that... that issue. So, if you look at this view of the proper
scoring rule as an integral of error rates
then
if you move into... if you... if you write an operating point, where the error
rate is going to be low, which does happen to in the false alarm region,
then you need
enough data, so that you actually do
errors. If you don't have errors, how can you measure the error rate? So, one
has to be very careful
not to push your evaluation outside of the range
data can cover
and the second question is
the case that I covered is just the basics
if you want a more complicated cost functions
where you want to assign different costs to different flavours of errors, that does fit
into this framework, so
you can type any cost function
as long as it doesn't do something pathological again
two class, the cost functions simple; in multiclass you have to think really carefully how
to
construct the cost function that doesn't contradict itself
once you've formed a nice cost function
you can apply this recipe
just plug it into Bayes decision, back into the cost function and you'll have a
proper scoring rule. So, this framework does cover that.
are you dealing with people who are real?
okay
and going to tell us some more