i know things for attending this talk
i am just enough that i'm a researcher the computer science institute
which is a unit to university one aside as some corny set in argentina
to they'll be talking about the initial of calibration in speaker verification
and hopefully by the end of the talk and i'm gonna can be assumed that
these things you need an important issue if you were not already convinced
so the top will be organised this way first and gonna define calibration
and given intuition
then
talk about why we should care about it
which is related also to how to make sure it
and if we find out that bit calibration is bad in a certain system then
how to fix it
and then finally i'll talk about issues of robustness of calibration for speaker verification
the task the main task
on which i will be that samples on in speaker verification
and assume that the audience
in you know the c
knows
well this task but just in case
it's a binary classification task
where the samples
are given by a
two waveforms or two sets of waveforms
but we need to compare to decide whether
they come from the same speaker or from different speakers
so the task is binary classification so much of what i'm gonna say
applies to any binary classification task and we just
speaker verification
okay so what is calibration
that's a we want to build a system that predicts the probability that it will
rain within the next hour
based only on a picture of the sky
so this is these are wary
if we see this picture then we would expect the system to work we don't
know probability say point one
while it was in this picture then we would expect it well would have much
higher probability of rain
it's a closer to one when
are we will say that the system is kind of really
the values that are able by the system coincide
we what we seen
in the data
so
i well calibrated score
should reflect the uncertainty of the system
for example to be concrete
for all the samples
but get a score
or point eight
come the system
then we would expect eighty percent of them to be labeled correctly
that's one data point eight meetings
in that happens
then we will say that the system is what kind of
and then we could be an example of diagram that is used in many tasks
not match a speaker verification on not at all but it's
i think it's very intuitive four
understanding calibration
it's called the reliability of diagram
i'm basically but when it shows is the posteriors
from a system that was random certain data
the posteriors that the system okay
for the class
then we predict
so for example for
this being
we have all the samples for which the system gave a posterior between point eight
point
and what the
diagram shows is the accuracy
on those some
so in there
system was calibrated then we would expect these two we
diagonal
because
what the system predicted
what coincide with the accuracy than we seen also
in this specific case what we actually see that the system was correct more times
then you thought it would be
which is interesting in to a system that underestimates it's coupled
now i to this diagram
from a paper from twenty seventeen
which actually studies the initial calibration on
and different architectures
so it compares on a task
that is quality far one hundred which is the image classification how to different classes
and it compares
the this is the plot that i already showed a
c n from nineteen ninety eight
we address in it
from twenty sixteen
we and they show that actually the new network
much worse calibrated
then the old network
so for this saying being the racial before
then you network actually has an accuracy much lower than we got it should how
which is point five more
so in this is an over confident
the nn
is things
it will do much better than it actually thus
one the other hand being error
from the new network is no
so if you put this network to make decisions that the sessions will be better
than the old ones
but the score studied outputs
cannot be interpreted as posterior settle
it cannot be interpreted as
the certainty that sit that the system has when it makes a decision
so
this is actually a phenomenon that we see a node in speaker recognition basically you
have a badly calibrated bottle tiny still
when discriminately
the problem is that such a model
might be useless in practice depending on the scenario in which we plan to use
so
as i already said
this course
from an is gonna weighting system cannot be interpreted as the certainty
that the system has units decisions
also
the scores cannot be made
i cannot been used to make optimal position
without
having the data to
how does make a decision so that's what i'm gonna talk about in the next
two sets
so how do we make optimal decision in general for binary classification
when usually define a cost function
and this is a very
common cost function which has very nice properties
it's a combination of two terms
one for each class
where
the
maybe part here is the probability of making an error for that class of these
is
the probability of
to see
class
zero
when the true class
was one
we multiply these probability of error by the prior
for that class one
and then we further multiplied by cost which is what we think
it is gonna cost us if we make these are
this is very specific to the application that we're gonna use the system
and for the other classes the same symmetric
so
this is an expected cost
the way to minimize is expected cost is to choose the following
the session
so
for a certain sample x
the text class should be one
in this factor
it is larger than this factor and zero otherwise
and this factor is composed of the cost
the prior
and the likelihood
for the class one
and this is the same forecasting
so
we see here than one we need to make optimal decisions is these likelihood
be of x
given c
now
one we have
is the likelihood then we learned
without formal
is the likelihood when they're
on the training data
that's why amusing here the we go to indicate that these in the cost
these probabilities the one we expect to see testing
one we actually see that's the
while we don't have that
what we have is one we saw in train
so let's say that we train a generative model then our generative model is gonna
be was directly these likelihood
but it will be the likelihood we learned in training
and that's fine we usually just assume
in order to do anything at all the machine learning
we assume that these will generalize to testing
testing
now we may not have the likelihood if we train the discriminative system
in that case we may have the posterior
discriminative systems
training for example with cross entropy any two i'll would posteriors
in that case when we need to do is compare those posteriors by two likelihoods
and for that we use bayes rule
by basically we want to like the by
this be of x and divided by the by
i don't hear that again this is the prior in training
is not the prior
the p we call that i put hearing the cost which is the one we
expect to see testing
and that's the whole
point why we use likelihoods and not posteriors
to make these optimal position
because it gives us the flexibility
two separate
the prior from training from the prior in testing
okay so
going back to the
to the optimal decisions
we have this expression
we can simplify with this expression by defining the log-likelihood ratio
which i'm sure everybody now see
you're working speaker verification
it's basically the spatial between
the likelihood for class one and the likelihood for cassie rule
and we take monopoly because it's
nicer
are we can do a similar thing with that costs
the factors that multiplied these likelihoods here
so we define these data
and their with those definitions we can
simplify the optimal decisions to look like these basically you decide class one
if the llr is larger than
data
otherwise garcia
and the and an untimely computed from the system posteriors
with this expression digits
based rules
after taking the logarithm
you of a scroll so wait
because it was
in both
factors it what in most likely
and
and this is basically the no goals of the posterior minus the notebooks of the
prior
which can be written is way using the energy fine function
okay so in speaker verification the feature x
it's actually a pair of features or even
a pair of sets of features
a one for enrollment and one for test
then class one is the class for target or same speaker
trial
and class zero is the task for impostor or different speaker trial
and we define the cost function or we use an equally dcf in speaker verification
using these
names
for the costs and priors
and
we call the errors be nice be false alarm
and beanies and means
would be
a missing a target trial soul namely non-target trial as an impostor
and a false alarm would be
namely and impostor asset are
and that the racial
looks like this using these names
and if you know
only care about
it's actually this thing to make optimal decisions you don't care about the whole
combinational
values of costs and priors altogether about these things they
so you could impact simplify
the cost functions the families of can cost functions to consider by just using a
single binary and the fact that beat are that is equivalent to having this
triplet for money is that are really just three because
p is a function
so we will be using that a the rest of the talk because it's much
simpler
and it helps a lot in the analysis
basically we simplify all possible cost functions
all combinations of
costs and priors to a single
affect the guitar
so let's see some examples of applications that use different costs
right
so the default
the simplest cost function would be to have equal priors any vocals
and that would give you the threshold zero
that would be the optimum bayes threshold for these cost function
now if you have an application of any sport for examples
speaker authentication where
your goal
he's two
verifying whether somebody
is what they say they are
to their voice
for example two and their
system
then new would expect that most of your cases i've and of e
target trials
because you know how many posters trying to get into your system
on the other hand the cost of making a mistake
is very high
you feel false alarm
so you don't want any of the
was able you impostors getting into the system
that means you need to
said a very high cost alarm
a cost of false alarm
compared to the cost of
and that corresponds with initial
two point three
so basically what you're doing a small with the threshold to the right so that
the this area here on the solid curve
which is the distribution of scores
for the impostor samples
so everything about that racial two point three will be a false-alarm
by moving the initial to the right we are meaning lies in this area
another application that actually is
or seen in terms of course
priors is the speaker search
in that case you're looking for certain specific speaker weeding
another instead of many other speakers
so in that case the probability of finding your speaker is actually no
that's a one-to-one one percent
but the cost that you care about
the errors and you want to avoid are the basis because
you don't want you're looking for one specific speaker that is important to you for
some reason so you know want to meet
so in that case the problem of initial is
a symmetric to the now minus two point three
and in that case what you're trying to minimize is under the dash
it to the left
of the threshold
which is the probability of miss
okay
so to recover before moving onto
evaluation
if we have been and are then
i showed that we can trivially make optimisations for any possible cost function that you
can imagine
when the phone that i gave
but of course these decisions will only be actually optimal if the system outputs are
well calibrated
otherwise they will not you
so how do we figure out
if we have a
well calibrated system
the
question is if you're gonna make your system make decisions using these thresholds that i
showed before the data
then that's when you should evaluate have your system make those decisions using those data
and
see how well the
and then the for the question is
quote we have made better this ensures if we calibrated scores before making the decisions
that will give us sarong
how well calibrated is the system
two meeting
so
the when we usually evaluate performance on binary classification task
these
by using the cost
no wonder you over initial
so we prefix that the racial
using bayes
a decision theory or not
we just
that is commercial and then compute the beanie some people sometime which of these yes
and the two distributions
and then compute the costs
now we can also
and
define matrix that depend on the whole distribution to two sisters
so for example the equal error rate
is defined
by finding the commercial that makes these two areas the same
so basically to computing you need the whole test this deviation
and a similar thing is the minimum dcf
so what you're doing that case is
we official
across the whole range of scores
compute the cost
for almost possible threshold
and then
choose the threshold okay the mean cost
now that minimum cost is actually bounded
and
and it bummed in by
basically dummy decisions
this system that makes to make decisions
if you put
for example you official all the way to write
then you will only make
and mistakes that are misses
everything will be nice
so you'll have been means of one before xenomorph zero
in that case the cost then you will incur is this factor here
when the other hand if you put the threshold a way to the left
then you will only make false alarms and there will be the cost for that
system
will be these factors here
so basically the bound for the meeting these is
the best of those
two case
they're both times systems but one will be better than the other
are we usually use this mindcf to normalize
the dcf so and nist evaluations for example
the
core studies define is the normalized dcf
also
and then finally another thing we can do is we the threshold
we called the puny some people's allow for every possible value of potential
and then gives a score curves like these
and if we transform the axis appropriately then we get the
standard that curves we use for speaker verification
so the cost that i've been talking about can be decomposed
into discrimination and calibration component
so let's see how
that's a we assume a cost or well priors an equal cost
in that case
the optimal threshold will be civil
the bayes optimal threshold would be zero
so
we compare the cost using that
commercial
and we get these
a
given that the priors and costs are the same then the cost will be given
by the average of these two areas
and shown here
now when you can also compute the mean cost as i mentioned before
basically sweet but initial
actual the threshold that gives
the minimum cost
again is the average between these two areas which you see is much smaller than
the average between these two areas in this case
and the difference between
those two cost
can be seen
as the additional cost that you encouraging because your system was makes me scully weight
so this orange area here which is the difference between
the sound
well the areas here on the sum of the areas here
is the cost due to these calibration and that's one way of measuring
how
nice kind of ready to system
so there's discrimination which is how well the scores
separated classes
and there's calibration which is whether the discourse can be interpreted probabilistically
which implies that you can make optimum bayes decisions
if they are kind of work
and the key here is then discrimination is the part
of the
performance that cannot be changed
if we transform the scores into we then invertible transformation
so here's a simple example that a you have these distribution of scores
and you have a threshold t that you chose for some reason
could be the optimal or not
and you transform this course we
any monotonic transformation
whatever that in these example is just an affine transformation
you transform it
and you can also transform the threshold t
with the same exact
function
that there's for that forty will correspond to exactly the same cost
as the threshold t in the original domain
so basically
by doing a monotonic transformation to your scores you cannot change it's discrimination
the minimum cost
then you will be able to find in both cases will be the same
so
the cost of a talking about measures the performance artist single operating point
it evaluates the quality of the car decisions for certain
they
now
and more comprehensive measure
is the cross entropy which is given by this expression and you probably all now
the cross-entropy empirical cross-entropy in the average
all the logarithm of the posterior that the system gives
to the correct class for its
so you want these posterior to be as high as portable one
if possible
no you
and algorithm of zero and if that happens for every sample then you know
cross entropy zero which is what you want
now there's a right weighted version of these cross entropy
which is
basically the same
by
you'll split your samples into two terms
the ones poll
class zero once forecast one
and you we wait
these averages
by and prior
that is these effective prior that i talked about before
so basically you make yourself independent of the priors and you're seen in the test
data
you can evaluate for any
right you work
these posteriors are computed from the and then hours
and the priors
using bayes rule
at least note that these are the priors that you're applied any here
the ones that you need to used to compute the llr
okay and the famous e llr that we used in
nist evaluations any many papers
is defined as these weighted cross entropy when the priors are point five
and it's normalized by the logarithm to one and explained in the next like
what
so the weighted cross entropy can be decomposed also
like the cost
in discrimination and calibration terms
basically you compute the actual weighted cross entropy
and you subtracted
and they
minimum
weighted cross entropy
now this meeting one is not a trivial to obtain ask for the cost you
can't just choose the threshold because here where evaluating this course itself is not just
the decisions
so we need to actually what the scores to get
the best possible way to cross entropy
we don't change in the discrimination
of the scores
and that means
using an one attorney transformation
and there's an algorithm goal will adjacent by annotators
well
which
that's exactly that so in
without changing the rank of the scores the order of the scores
in dallas the best it can to minimize the weighted cross
and so that's what we used to compute
yes delta
which
measures how these kind of reading your system it's
in terms of we present
and this way to present the peace mounted the same last
the cost
by and a system that in this case is the system that out what's
instead of
the posteriors we don't was directly the prior so with the system that doesn't know
anything about its input
but
still nasty
best buy
i would in the priors
and
that means that the worst
in c n r
is one point zero because we were normalized to didn't
right nobles to which is exactly these things when you evaluated i point five
so this means that the
minimum c llr
will never be
where someone
i mean the actual c llr is worse than one then you know for sure
that you're gonna have a difference here
because this is never
larger than one in this is larger than one and then it means you have
a calibration problem
okay
finally in terms of evaluation i wanted to mention these
curves of the applied probability of error curves ache
and the llr shows a single summary number
but you might want to actually seen
the performance across
a range of operating points and that's what this curves two
they basically show the cost
of
as a function of the beat are the effect of peter
which
also defines that data
so
what we see here these
the
these cost
for prior decisions
and the prior decisions are what i mentioned before
basically just a dummy system that always outputs
the priors instead of posteriors
and the red is our system whatever that he's
kind of varying or not
and then dashed curve is the very best you can do if you work to
work your scores using the palm algorithm
so basically the difference for each data the difference between the dashed and the right
is there is calibration and that
operating point
and the nice property of all these curves is that the c in a lower
east proportional to the area under the covers
so the actual see an alarm is proportional to the area under the red curve
and the means the lr is proportional to the area under the dashed
and furthermore the equal error rate is the maximum
of these
a red curve
and their variance of these curves
which accompanies this papers
change in the way the axis and define
okay
so let's see not saying we
already in our system has a kind of a simple
should we worry about it shall we trying to fix it
there's some scenarios where you
there's
no problem if you have a nice calibrated system there is no need to fix
it for example
e
you know what the cost function is ahead of time
and there's development data available
then all you need to do is run on the system for the development data
and find the and
you can best
commercial
for
done them data for that system and that can cost function
and you're that
and
you also the need to worry about calibration if
it you wanna care about ranking
the samples so you want to do not and
likely targets
and nothing
on the other hand it may be very necessary to the calibration in many other
sin
one of them is for example if you don't know ahead of time what the
system will be used for exactly what is the application
i don't means
you don't know the cost function and if you don't know the cost function
you cannot optimize the partial
i had of time
so if you want to give the user of the system and all
then defines these effective bit are
then the system has to be calibrated for the baseline
bayes optimal threshold to be in
really optimal
to work well
a another case where you need to look at iteration is if you want to
get a probabilistic value
from your system
some men sure all the uncertainty that a system has
when you make six
this issue
and you can use that uncertainty for example
to reject samples when the system is uncertain
so
if you're and in our is too close to the threshold then you work planning
to use to make our decisions
then perhaps
you wanna system not to make a decision total the user i don't know
union under some
and another case is when this you actually don't want to make her decisions when
you want to report the value
then his interpretable
not for example in the forensic voice comparison people
okay so
that's a we do want to fix
a calibration we are in one of those scenarios where it matters
one very common approach to do this is to use linear logistic regression
so this assumes that b and an hour
the kind of weighted score
is an affine transformation all whatever your system
and the parameters of these small are the w and b
and uses the weighted cross entropy ask the loss function
now
for to compute the weighted presence of we need posteriors not and then hours so
we need to compare those in a nursing to posteriors and we use this expression
that actual before
which is
the llr is the nobles of the posterior minus the no guards of the right
and it would
basically where these expression we get there not just the functional which is the inverse
of the legit
and
and finally after doing
trivial computations we can these expression which is that bystander
mean and logistic expression
we need to them further like these posterior into the expression of the weighted cross
entropy to get lost
that we can then optimize thus we wish
and finally once we optimize these on
some the data
we can the w and b
that are optimal for that
not rate
so
this is an affine transformation so we doesn't change the shapes
of the distributions at all
basically
these looks like he did nothing
but what indeed is
more
shrink shift and shrink
the axes so that the resulting
scores
are kind of right
and in terms of t and then are you can see that their raw scores
which are these ones
how do very high c and an hour actually higher than one so the where
words and one
and after you calibrate them
which all your the was really scale and shapes
a new data much better see in the lower
these minimum here is that well maybe
the
the
very best you can do
so we define transformation we are actually doing almost as
good as the very best
which means that the affine assumption was actually in this case a quite
this is a real case this is box and of data process we the
be lda system
and then many other approaches to do calibration i'm not gonna cover them because it
would take another
another whole keynote
and
there are nonlinear approaches
which
i in some
cases do better than linear
is a good at some somebody is not perfect
then their originality and basin approaches that actually do quite well when you have very
little data
to train the calibration model
and then they're approaches and goal the way
but
to know data not labeled data
so there's label but
there's
they have and you don't know than they
and those works surprisingly well
so
if we have a kind of really score there
we know we can train the most looks not log-likelihood ratios which means
then we can use them to make optimal decisions
and we can also convert them to posteriors if we wanted to and if we
had the bright
and it and very nice property of that in our is that
if you work to compute
in a collection racial
all your
already calibrated score then you would get
the same thing
so you can treat
this score the in an hour after feature
and you
we compute these racial you would get the same by
i don't this they don't seem to some nice properties like for example
in a calibrated
score
the two distributions have to cross exactly at zero
because when the nn are is zero
these racial is one
which means then these two
have to be the same
and these two are exactly what we're seeing here the densities
the probability density function of the score for each of the two guys
they have to corsets you
and further if we assume that one of these two distributions is gaussian
then the other distributions forced to be gaussian
with the same
standard deviation and with symmetric meetings
and these as i said it's a real example and it's actually quite
close to that assumption
in this box and up to
okay so to recover this problem before we don't
what i've been saying is that occurs equal error rate mindcf
my sure only discrimination performance
basically this means that the nor the usual threshold selections of the nor the usual
how to get
to the actual decisions
from the score
on the other hand the weighted cross entropy on the actual dcf and the ape
curves and measure total form
and
that includes the initial how to
make the decisions
and we can further use these metrics
to compute the
calibration loss
so to see whether the system is well calibrated or not
and if you find the calibration is actually not good then fixing this calibration issues
is
usually see in ideal conditions so you can train an invertible transformation
used in
usually a small representative that said
which is enough because
in many of the approach is the number of parameters are is very small so
you don't need a lot update
the key here though
is then you need a representative that's it
and that's going on
what i'm gonna discussing the nastiest like
so
basically what we of serving right these repeatedly is that calibration of our speaker verification
systems
it is
extremely fragile
it is now for our current system and it has always be
okay since i've been working
on speaker verification for
almost twenty years not
anything like language noise distortions duration they not affect
the calibration parameters
and that means that one to train one condition
it's very unlikely to generalize to another condition
on the other hand the discrimination performance is usually still reasonable
on unseen conditions
so if you train a system on telephone data and you try to use it
on microphone data
is that gonna may not be the best you can do
but he still will be reasonable
on the other hand if you train your calibration model on telephone data and trying
to use it a microphone in many
perform horribly
and this is one example
so
i'm training the calibration set on the conversion
well on two different sets
speakers in the while and sre sixty
that's
and applying those models
but on box in the two
they just
the
scores are identical the raw scores and all and doing is changing the w and
the be based on the calibration set
what we see here
is that the model that was trained with speakers in the while
is extremely good
it's basically almost
perfect
while the model that was trained on a set of sixteen is
quite by
is better than the raw scores but he still quite well
compared to the best you can do
and this is not surprising because
block selects actually quite close to speakers in the white in terms of conditions
by this i sixteen is not
now
you may think maybe sorry sixteen is but just about set for doing calibration
but that's not the case because if you evaluate and sre sixteen
evaluation data
and
then the opposite happens
so the
calibration model that is good in that case is the one that was trained on
the set of sixteen so you these
scores
newman much lower
still an arm than the ones that were
calibrated we'd speakers in the way
in this case again you're almost which in the mean
so basically this tells us that the conditions on which the calibration model is trained
are at determining off
where they're gonna be
good
you have do you have to match the conditions on your evaluation
now
this goes even deeper
if you
zoom into a data set you can actually finest calibration issues within the dataset itself
so
i'm showing
results on sre sixteen evaluation set
when i training calibration parameters exactly one of the same impulse it so this is
a cheating calibration experiment
here
i'm showing the
see an alarm which is the solid bar and the means in an hour which
are here are the same by construction and here is that
relative difference between those two
so where the full set i have not lost
by construction as it said
on the other hand if i start to subset
peaceful set
a randomly or by gender
or my condition
i start to see
one more calibration loss
so than random subset is
fine
it is well calibrated females and males are reasonably well calibrated
but for this specific conditions
there are defined by the language the gender
where there are they to waveforms in the trial come from the same telephone number
or not
then we start to see calibration loss
our to almost twenty percent in this case
so
the distributions so for the
target i don't
female same telephone number set
we see that the distributions are shifted to the fact
they should be aligned with zero remember that the this to the sri distributions if
they were kind of reading they should cross at zero
but they don't
so that means they shifted to the right and that is reasonable because seems they
are the same telephone number for both
sides of the trial
then it means that
they look very much the same
more than if the channels one different
so
everything every trial looks more target
then they should
or than they do in the overall distribution
on the opposite happens on the different telephone number
scores
the shift to the left
and the final comments here is that these mis calibration with dataset
it's also cost in a discrimination problem
because if you pool these
trials as they are is kind of reading
you will get poor discrimination then if you work to first calibrated
and then pooled together
so
there's an interplay here between calibration and discrimination
because
the nist calibration is happening
for different sub conditions within the set
okay so they're been several approaches in the literature
over the last decades at least
that's right to
solve this problem or
condition dependent is calibration
where the
assumption of having a global
calibration model
that has a single w and a single be
for all trials it's actually not as good as such
so most of these approaches assume that there's an external class
or vector representation
the ldc there are given by the metadata
or estimated
that represents the condition of the samples
the enrollment and the samples
and these vectors
are fed into the calibration stage and they are used to condition the parameters of
these calibration stage
here are some approaches if you are interesting to take a look
over all these approaches something quite successful at
making the
final system better actually more discriminative
because they align the distributions of the different sub conditions before putting them together
and that their family of approaches these
where they put the condition awareness
in the back end itself rather than in the calibration stage
so
there's again a condition extractor of some kind
that affects the parameters of them okay
the thing is that this approach doesn't necessarily fix calibration
it improves discrimination in general
but you may still need to the calibration it is but can is deal for
example it be lda look and this i think these cases
what comes out of here is still use kind of
so you still need a
perhaps normal
calibration model and the or
okay and recently we propose an approach that jointly trains the backend
and a condition beep and then
calibrate or
where here we assume that the condition is extracted automatically as a function of the
and mailings themselves
and the whole thing
is trained jointly to optimize weighted cross entropy
so
this model actually gives
excellent calibration performance across so wide range of conditions
you can actually find the paper
and
in the ldc proceedings if you're interest
and there's a very related paper a also in a dc one middle
by daniel garcia romano
which i suggest you taken it to if you're interested in these topics
okay so
to finish up
i didn't talking about two
wide application scenarios for speaker verification technology
one of them
is where you assume that there's development data available for the evaluation conditions
in that case
as i said you can either calibrate the system on my on that data which
is matched
or just
find the best commercial
by really calibration in that
scenario is not a mediation
in fact most speaker verification papers
historically
operate under this scenario
it's also the scenario of the nist evaluations where we usually get development data which
is maybe not perfectly matching but
pretty well matched to what we will see you in the evaluation
not only see this ldc five i found thirty three speaker recognition papers
of which twenty eight fold
in this category
so
the mostly report just equal error rate and dcf some report actual values
some don't
and i think it's fine to just report mean dcf in those cases because you
basically assuming that the
it calibration initial is
easy to sell
so that
if you work to have
development data
a you could train a kind of visual all and you won't reach very close
to the minimum
this year
the actual performance gonna get very close to the
now the still the can be at that
you may still have used calibration problems within sub conditions anything i don't report
actual dcf on this year on some conditions
and that's
she
behind the overall performance
the other big scenario is
and the
the one where we don't have development data
for the above conditions
in that case we cannot calibrate or just a special
on matched conditions we can only whole
that our system will
work well out of the box
from the
all these proceedings i only five
papers that operate on their this scenario where the
actually test
a system that was trained on some condition all
at this data that is on a different conditions
and they do not assume that they have
development data for that
if recognition
so basically we as a community
are very heavily focused on the first scenario have always been is
from historically
and i but this man be why our current speaker verification technology
cannot be used out of the box
we are just
used to
always
asking for development data
in order to tune at least the calibration stage of our system
we know the calibration stage has to be tuned otherwise the system one work
in maybe where someone
so my question is and maybe we can discuss the question and answer session
wouldn't be worth it for as a community to pay more attention to these
scenario
no development data available
i believe that the new and two and approaches have the
potential to be quite good
i generalising
and this is basically based on the
paper that i mentioned that actually
is not really into and
but
almost
and it works
quite well
surprisingly well in terms of calibration across conditions on unseen conditions
so i think it's doable
maybe if we would and therefore as a community then maybe we reduce or even
in it
if we're very optimistic
the performance difference between the two center so maybe we can end up with systems
then
are not so independent of having development data
and perhaps even having development data one how much i don't know or more
the out of the book system
so what would you and tail to develop for these known that scenario
possible we
we have to assume that we will need heterogeneous data for training of course because
if you train a system on telephone data is
quite unlikely that it will generalize to
maybe other condition
the second thing is one has to have doubts
some sets
at least
during development that are not d menus for
hyperparameter two
because otherwise they would not be completely and see
so these sets out to be really
held out until the very and until you just evaluate the system out of the
box as in this scenario that we are imagining
and of course in into report actual matrix and not just meaning because in this
case you cannot assume that you're gonna be able to do kind of racial well
you need to test whether the model
i think i cd as it stands
it's actually giving you
good calibration with the session
and finally it's probably a good idea to also report matrix
on some conditions in the set
because
they the mis calibration issues within the sub conditions maybe he in
within the true distribution of the whole set they compensate each other sometimes
and reporting
metrics sub conditions yes
both actual and minimum something you can actually tell
if there's a calibration
okay
thank you very much for listening and i'm looking forward to your questions in the
next session