Speech Transcript - The importance of Calibration in Speaker Verification

0:00:00	i know things for attending this talk
0:00:03	i am just enough that i'm a researcher the computer science institute
0:00:07	which is a unit to university one aside as some corny set in argentina
0:00:13	to they'll be talking about the initial of calibration in speaker verification
0:00:17	and hopefully by the end of the talk and i'm gonna can be assumed that
0:00:20	these things you need an important issue if you were not already convinced
0:00:26	so the top will be organised this way first and gonna define calibration
0:00:31	and given intuition
0:00:35	then
0:00:36	talk about why we should care about it
0:00:39	which is related also to how to make sure it
0:00:43	and if we find out that bit calibration is bad in a certain system then
0:00:47	how to fix it
0:00:49	and then finally i'll talk about issues of robustness of calibration for speaker verification
0:00:55	the task the main task
0:00:58	on which i will be that samples on in speaker verification
0:01:02	and assume that the audience
0:01:03	in you know the c
0:01:05	knows
0:01:06	well this task but just in case
0:01:10	it's a binary classification task
0:01:12	where the samples
0:01:13	are given by a
0:01:16	two waveforms or two sets of waveforms
0:01:18	but we need to compare to decide whether
0:01:21	they come from the same speaker or from different speakers
0:01:26	so the task is binary classification so much of what i'm gonna say
0:01:31	applies to any binary classification task and we just
0:01:34	speaker verification
0:01:36	okay so what is calibration
0:01:39	that's a we want to build a system that predicts the probability that it will
0:01:43	rain within the next hour
0:01:45	based only on a picture of the sky
0:01:47	so this is these are wary
0:01:49	if we see this picture then we would expect the system to work we don't
0:01:53	know probability say point one
0:01:55	while it was in this picture then we would expect it well would have much
0:01:58	higher probability of rain
0:02:00	it's a closer to one when
0:02:03	are we will say that the system is kind of really
0:02:06	the values that are able by the system coincide
0:02:10	we what we seen
0:02:11	in the data
0:02:15	so
0:02:16	i well calibrated score
0:02:18	should reflect the uncertainty of the system
0:02:21	for example to be concrete
0:02:24	for all the samples
0:02:25	but get a score
0:02:26	or point eight
0:02:27	come the system
0:02:29	then we would expect eighty percent of them to be labeled correctly
0:02:32	that's one data point eight meetings
0:02:35	in that happens
0:02:37	then we will say that the system is what kind of
0:02:40	and then we could be an example of diagram that is used in many tasks
0:02:44	not match a speaker verification on not at all but it's
0:02:48	i think it's very intuitive four
0:02:50	understanding calibration
0:02:53	it's called the reliability of diagram
0:02:55	i'm basically but when it shows is the posteriors
0:02:59	from a system that was random certain data
0:03:02	the posteriors that the system okay
0:03:04	for the class
0:03:05	then we predict
0:03:07	so for example for
0:03:08	this being
0:03:11	we have all the samples for which the system gave a posterior between point eight
0:03:15	point
0:03:17	and what the
0:03:18	diagram shows is the accuracy
0:03:21	on those some
0:03:23	so in there
0:03:24	system was calibrated then we would expect these two we
0:03:28	diagonal
0:03:29	because
0:03:30	what the system predicted
0:03:31	what coincide with the accuracy than we seen also
0:03:35	in this specific case what we actually see that the system was correct more times
0:03:41	then you thought it would be
0:03:43	which is interesting in to a system that underestimates it's coupled
0:03:49	now i to this diagram
0:03:52	from a paper from twenty seventeen
0:03:56	which actually studies the initial calibration on
0:03:59	and different architectures
0:04:01	so it compares on a task
0:04:03	that is quality far one hundred which is the image classification how to different classes
0:04:08	and it compares
0:04:09	the this is the plot that i already showed a
0:04:12	c n from nineteen ninety eight
0:04:15	we address in it
0:04:17	from twenty sixteen
0:04:19	we and they show that actually the new network
0:04:23	much worse calibrated
0:04:25	then the old network
0:04:26	so for this saying being the racial before
0:04:29	then you network actually has an accuracy much lower than we got it should how
0:04:36	which is point five more
0:04:38	so in this is an over confident
0:04:41	the nn
0:04:42	is things
0:04:43	it will do much better than it actually thus
0:04:46	one the other hand being error
0:04:48	from the new network is no
0:04:50	so if you put this network to make decisions that the sessions will be better
0:04:54	than the old ones
0:04:55	but the score studied outputs
0:04:58	cannot be interpreted as posterior settle
0:05:00	it cannot be interpreted as
0:05:02	the certainty that sit that the system has when it makes a decision
0:05:08	so
0:05:09	this is actually a phenomenon that we see a node in speaker recognition basically you
0:05:15	have a badly calibrated bottle tiny still
0:05:18	when discriminately
0:05:20	the problem is that such a model
0:05:22	might be useless in practice depending on the scenario in which we plan to use
0:05:27	so
0:05:28	as i already said
0:05:29	this course
0:05:30	from an is gonna weighting system cannot be interpreted as the certainty
0:05:35	that the system has units decisions
0:05:39	also
0:05:40	the scores cannot be made
0:05:42	i cannot been used to make optimal position
0:05:44	without
0:05:46	having the data to
0:05:48	how does make a decision so that's what i'm gonna talk about in the next
0:05:51	two sets
0:05:54	so how do we make optimal decision in general for binary classification
0:05:59	when usually define a cost function
0:06:02	and this is a very
0:06:03	common cost function which has very nice properties
0:06:07	it's a combination of two terms
0:06:09	one for each class
0:06:11	where
0:06:12	the
0:06:13	maybe part here is the probability of making an error for that class of these
0:06:18	is
0:06:18	the probability of
0:06:19	to see
0:06:21	class
0:06:22	zero
0:06:22	when the true class
0:06:24	was one
0:06:26	we multiply these probability of error by the prior
0:06:29	for that class one
0:06:31	and then we further multiplied by cost which is what we think
0:06:36	it is gonna cost us if we make these are
0:06:40	this is very specific to the application that we're gonna use the system
0:06:44	and for the other classes the same symmetric
0:06:47	so
0:06:49	this is an expected cost
0:06:51	the way to minimize is expected cost is to choose the following
0:06:57	the session
0:06:58	so
0:06:59	for a certain sample x
0:07:01	the text class should be one
0:07:03	in this factor
0:07:05	it is larger than this factor and zero otherwise
0:07:09	and this factor is composed of the cost
0:07:12	the prior
0:07:14	and the likelihood
0:07:16	for the class one
0:07:19	and this is the same forecasting
0:07:23	so
0:07:23	we see here than one we need to make optimal decisions is these likelihood
0:07:28	be of x
0:07:29	given c
0:07:32	now
0:07:34	one we have
0:07:36	is the likelihood then we learned
0:07:38	without formal
0:07:39	is the likelihood when they're
0:07:42	on the training data
0:07:43	that's why amusing here the we go to indicate that these in the cost
0:07:48	these probabilities the one we expect to see testing
0:07:51	one we actually see that's the
0:07:55	while we don't have that
0:07:57	what we have is one we saw in train
0:08:00	so let's say that we train a generative model then our generative model is gonna
0:08:05	be was directly these likelihood
0:08:07	but it will be the likelihood we learned in training
0:08:10	and that's fine we usually just assume
0:08:13	in order to do anything at all the machine learning
0:08:15	we assume that these will generalize to testing
0:08:18	testing
0:08:20	now we may not have the likelihood if we train the discriminative system
0:08:25	in that case we may have the posterior
0:08:28	discriminative systems
0:08:29	training for example with cross entropy any two i'll would posteriors
0:08:34	in that case when we need to do is compare those posteriors by two likelihoods
0:08:38	and for that we use bayes rule
0:08:41	by basically we want to like the by
0:08:43	this be of x and divided by the by
0:08:46	i don't hear that again this is the prior in training
0:08:50	is not the prior
0:08:52	the p we call that i put hearing the cost which is the one we
0:08:55	expect to see testing
0:08:58	and that's the whole
0:08:59	point why we use likelihoods and not posteriors
0:09:03	to make these optimal position
0:09:06	because it gives us the flexibility
0:09:08	two separate
0:09:09	the prior from training from the prior in testing
0:09:14	okay so
0:09:16	going back to the
0:09:17	to the optimal decisions
0:09:19	we have this expression
0:09:21	we can simplify with this expression by defining the log-likelihood ratio
0:09:26	which i'm sure everybody now see
0:09:28	you're working speaker verification
0:09:30	it's basically the spatial between
0:09:33	the likelihood for class one and the likelihood for cassie rule
0:09:37	and we take monopoly because it's
0:09:39	nicer
0:09:40	are we can do a similar thing with that costs
0:09:43	the factors that multiplied these likelihoods here
0:09:47	so we define these data
0:09:49	and their with those definitions we can
0:09:53	simplify the optimal decisions to look like these basically you decide class one
0:09:57	if the llr is larger than
0:09:59	data
0:10:00	otherwise garcia
0:10:02	and the and an untimely computed from the system posteriors
0:10:06	with this expression digits
0:10:07	based rules
0:10:08	after taking the logarithm
0:10:11	you of a scroll so wait
0:10:13	because it was
0:10:15	in both
0:10:18	factors it what in most likely
0:10:21	and
0:10:21	and this is basically the no goals of the posterior minus the notebooks of the
0:10:26	prior
0:10:26	which can be written is way using the energy fine function
0:10:34	okay so in speaker verification the feature x
0:10:39	it's actually a pair of features or even
0:10:41	a pair of sets of features
0:10:43	a one for enrollment and one for test
0:10:46	then class one is the class for target or same speaker
0:10:51	trial
0:10:52	and class zero is the task for impostor or different speaker trial
0:10:57	and we define the cost function or we use an equally dcf in speaker verification
0:11:03	using these
0:11:05	names
0:11:06	for the costs and priors
0:11:08	and
0:11:09	we call the errors be nice be false alarm
0:11:12	and beanies and means
0:11:14	would be
0:11:15	a missing a target trial soul namely non-target trial as an impostor
0:11:19	and a false alarm would be
0:11:21	namely and impostor asset are
0:11:26	and that the racial
0:11:27	looks like this using these names
0:11:30	and if you know
0:11:31	only care about
0:11:32	it's actually this thing to make optimal decisions you don't care about the whole
0:11:38	combinational
0:11:40	values of costs and priors altogether about these things they
0:11:44	so you could impact simplify
0:11:47	the cost functions the families of can cost functions to consider by just using a
0:11:52	single binary and the fact that beat are that is equivalent to having this
0:11:58	triplet for money is that are really just three because
0:12:02	p is a function
0:12:06	so we will be using that a the rest of the talk because it's much
0:12:10	simpler
0:12:11	and it helps a lot in the analysis
0:12:14	basically we simplify all possible cost functions
0:12:18	all combinations of
0:12:20	costs and priors to a single
0:12:22	affect the guitar
0:12:27	so let's see some examples of applications that use different costs
0:12:32	right
0:12:33	so the default
0:12:36	the simplest cost function would be to have equal priors any vocals
0:12:40	and that would give you the threshold zero
0:12:43	that would be the optimum bayes threshold for these cost function
0:12:49	now if you have an application of any sport for examples
0:12:53	speaker authentication where
0:12:55	your goal
0:12:56	he's two
0:12:59	verifying whether somebody
0:13:01	is what they say they are
0:13:03	to their voice
0:13:05	for example two and their
0:13:08	system
0:13:09	then new would expect that most of your cases i've and of e
0:13:13	target trials
0:13:14	because you know how many posters trying to get into your system
0:13:18	on the other hand the cost of making a mistake
0:13:22	is very high
0:13:23	you feel false alarm
0:13:25	so you don't want any of the
0:13:27	was able you impostors getting into the system
0:13:31	that means you need to
0:13:32	said a very high cost alarm
0:13:34	a cost of false alarm
0:13:36	compared to the cost of
0:13:38	and that corresponds with initial
0:13:40	two point three
0:13:41	so basically what you're doing a small with the threshold to the right so that
0:13:44	the this area here on the solid curve
0:13:49	which is the distribution of scores
0:13:52	for the impostor samples
0:13:55	so everything about that racial two point three will be a false-alarm
0:14:00	by moving the initial to the right we are meaning lies in this area
0:14:05	another application that actually is
0:14:07	or seen in terms of course
0:14:10	priors is the speaker search
0:14:11	in that case you're looking for certain specific speaker weeding
0:14:16	another instead of many other speakers
0:14:19	so in that case the probability of finding your speaker is actually no
0:14:23	that's a one-to-one one percent
0:14:26	but the cost that you care about
0:14:29	the errors and you want to avoid are the basis because
0:14:33	you don't want you're looking for one specific speaker that is important to you for
0:14:37	some reason so you know want to meet
0:14:40	so in that case the problem of initial is
0:14:43	a symmetric to the now minus two point three
0:14:46	and in that case what you're trying to minimize is under the dash
0:14:51	it to the left
0:14:53	of the threshold
0:14:54	which is the probability of miss
0:14:59	okay
0:15:02	so to recover before moving onto
0:15:05	evaluation
0:15:07	if we have been and are then
0:15:09	i showed that we can trivially make optimisations for any possible cost function that you
0:15:14	can imagine
0:15:16	when the phone that i gave
0:15:19	but of course these decisions will only be actually optimal if the system outputs are
0:15:25	well calibrated
0:15:26	otherwise they will not you
0:15:29	so how do we figure out
0:15:31	if we have a
0:15:32	well calibrated system
0:15:34	the
0:15:35	question is if you're gonna make your system make decisions using these thresholds that i
0:15:41	showed before the data
0:15:43	then that's when you should evaluate have your system make those decisions using those data
0:15:49	and
0:15:50	see how well the
0:15:52	and then the for the question is
0:15:54	quote we have made better this ensures if we calibrated scores before making the decisions
0:16:01	that will give us sarong
0:16:03	how well calibrated is the system
0:16:05	two meeting
0:16:08	so
0:16:09	the when we usually evaluate performance on binary classification task
0:16:14	these
0:16:15	by using the cost
0:16:17	no wonder you over initial
0:16:19	so we prefix that the racial
0:16:22	using bayes
0:16:23	a decision theory or not
0:16:26	we just
0:16:26	that is commercial and then compute the beanie some people sometime which of these yes
0:16:31	and the two distributions
0:16:34	and then compute the costs
0:16:37	now we can also
0:16:40	and
0:16:40	define matrix that depend on the whole distribution to two sisters
0:16:45	so for example the equal error rate
0:16:48	is defined
0:16:49	by finding the commercial that makes these two areas the same
0:16:54	so basically to computing you need the whole test this deviation
0:16:59	and a similar thing is the minimum dcf
0:17:01	so what you're doing that case is
0:17:04	we official
0:17:07	across the whole range of scores
0:17:10	compute the cost
0:17:11	for almost possible threshold
0:17:13	and then
0:17:15	choose the threshold okay the mean cost
0:17:20	now that minimum cost is actually bounded
0:17:22	and
0:17:23	and it bummed in by
0:17:25	basically dummy decisions
0:17:27	this system that makes to make decisions
0:17:31	if you put
0:17:32	for example you official all the way to write
0:17:35	then you will only make
0:17:37	and mistakes that are misses
0:17:40	everything will be nice
0:17:42	so you'll have been means of one before xenomorph zero
0:17:46	in that case the cost then you will incur is this factor here
0:17:50	when the other hand if you put the threshold a way to the left
0:17:54	then you will only make false alarms and there will be the cost for that
0:17:58	system
0:17:59	will be these factors here
0:18:02	so basically the bound for the meeting these is
0:18:05	the best of those
0:18:06	two case
0:18:08	they're both times systems but one will be better than the other
0:18:12	are we usually use this mindcf to normalize
0:18:16	the dcf so and nist evaluations for example
0:18:19	the
0:18:20	core studies define is the normalized dcf
0:18:23	also
0:18:26	and then finally another thing we can do is we the threshold
0:18:30	we called the puny some people's allow for every possible value of potential
0:18:35	and then gives a score curves like these
0:18:37	and if we transform the axis appropriately then we get the
0:18:43	standard that curves we use for speaker verification
0:19:02	so the cost that i've been talking about can be decomposed
0:19:05	into discrimination and calibration component
0:19:10	so let's see how
0:19:12	that's a we assume a cost or well priors an equal cost
0:19:18	in that case
0:19:19	the optimal threshold will be civil
0:19:21	the bayes optimal threshold would be zero
0:19:24	so
0:19:25	we compare the cost using that
0:19:27	commercial
0:19:28	and we get these
0:19:29	a
0:19:30	given that the priors and costs are the same then the cost will be given
0:19:34	by the average of these two areas
0:19:36	and shown here
0:19:38	now when you can also compute the mean cost as i mentioned before
0:19:42	basically sweet but initial
0:19:44	actual the threshold that gives
0:19:46	the minimum cost
0:19:47	again is the average between these two areas which you see is much smaller than
0:19:52	the average between these two areas in this case
0:19:55	and the difference between
0:19:57	those two cost
0:19:59	can be seen
0:20:00	as the additional cost that you encouraging because your system was makes me scully weight
0:20:06	so this orange area here which is the difference between
0:20:09	the sound
0:20:11	well the areas here on the sum of the areas here
0:20:14	is the cost due to these calibration and that's one way of measuring
0:20:19	how
0:20:20	nice kind of ready to system
0:20:25	so there's discrimination which is how well the scores
0:20:28	separated classes
0:20:30	and there's calibration which is whether the discourse can be interpreted probabilistically
0:20:34	which implies that you can make optimum bayes decisions
0:20:37	if they are kind of work
0:20:40	and the key here is then discrimination is the part
0:20:43	of the
0:20:45	performance that cannot be changed
0:20:47	if we transform the scores into we then invertible transformation
0:20:52	so here's a simple example that a you have these distribution of scores
0:20:58	and you have a threshold t that you chose for some reason
0:21:01	could be the optimal or not
0:21:03	and you transform this course we
0:21:06	any monotonic transformation
0:21:09	whatever that in these example is just an affine transformation
0:21:13	you transform it
0:21:15	and you can also transform the threshold t
0:21:18	with the same exact
0:21:20	function
0:21:22	that there's for that forty will correspond to exactly the same cost
0:21:27	as the threshold t in the original domain
0:21:31	so basically
0:21:33	by doing a monotonic transformation to your scores you cannot change it's discrimination
0:21:39	the minimum cost
0:21:41	then you will be able to find in both cases will be the same
0:21:52	so
0:21:52	the cost of a talking about measures the performance artist single operating point
0:21:57	it evaluates the quality of the car decisions for certain
0:22:01	they
0:22:03	now
0:22:04	and more comprehensive measure
0:22:06	is the cross entropy which is given by this expression and you probably all now
0:22:11	the cross-entropy empirical cross-entropy in the average
0:22:15	all the logarithm of the posterior that the system gives
0:22:19	to the correct class for its
0:22:22	so you want these posterior to be as high as portable one
0:22:25	if possible
0:22:27	no you
0:22:28	and algorithm of zero and if that happens for every sample then you know
0:22:33	cross entropy zero which is what you want
0:22:37	now there's a right weighted version of these cross entropy
0:22:40	which is
0:22:41	basically the same
0:22:43	by
0:22:43	you'll split your samples into two terms
0:22:46	the ones poll
0:22:49	class zero once forecast one
0:22:51	and you we wait
0:22:53	these averages
0:22:55	by and prior
0:22:56	that is these effective prior that i talked about before
0:23:01	so basically you make yourself independent of the priors and you're seen in the test
0:23:06	data
0:23:07	you can evaluate for any
0:23:09	right you work
0:23:13	these posteriors are computed from the and then hours
0:23:16	and the priors
0:23:18	using bayes rule
0:23:19	at least note that these are the priors that you're applied any here
0:23:23	the ones that you need to used to compute the llr
0:23:27	okay and the famous e llr that we used in
0:23:30	nist evaluations any many papers
0:23:32	is defined as these weighted cross entropy when the priors are point five
0:23:37	and it's normalized by the logarithm to one and explained in the next like
0:23:41	what
0:23:44	so the weighted cross entropy can be decomposed also
0:23:47	like the cost
0:23:49	in discrimination and calibration terms
0:23:52	basically you compute the actual weighted cross entropy
0:23:56	and you subtracted
0:23:58	and they
0:23:59	minimum
0:23:59	weighted cross entropy
0:24:01	now this meeting one is not a trivial to obtain ask for the cost you
0:24:05	can't just choose the threshold because here where evaluating this course itself is not just
0:24:11	the decisions
0:24:12	so we need to actually what the scores to get
0:24:16	the best possible way to cross entropy
0:24:19	we don't change in the discrimination
0:24:21	of the scores
0:24:22	and that means
0:24:23	using an one attorney transformation
0:24:26	and there's an algorithm goal will adjacent by annotators
0:24:29	well
0:24:30	which
0:24:31	that's exactly that so in
0:24:34	without changing the rank of the scores the order of the scores
0:24:38	in dallas the best it can to minimize the weighted cross
0:24:42	and so that's what we used to compute
0:24:45	yes delta
0:24:46	which
0:24:47	measures how these kind of reading your system it's
0:24:50	in terms of we present
0:24:53	and this way to present the peace mounted the same last
0:24:56	the cost
0:24:57	by and a system that in this case is the system that out what's
0:25:02	instead of
0:25:02	the posteriors we don't was directly the prior so with the system that doesn't know
0:25:06	anything about its input
0:25:08	but
0:25:09	still nasty
0:25:10	best buy
0:25:11	i would in the priors
0:25:14	and
0:25:16	that means that the worst
0:25:18	in c n r
0:25:20	is one point zero because we were normalized to didn't
0:25:23	right nobles to which is exactly these things when you evaluated i point five
0:25:29	so this means that the
0:25:31	minimum c llr
0:25:33	will never be
0:25:34	where someone
0:25:36	i mean the actual c llr is worse than one then you know for sure
0:25:39	that you're gonna have a difference here
0:25:41	because this is never
0:25:43	larger than one in this is larger than one and then it means you have
0:25:46	a calibration problem
0:25:50	okay
0:25:51	finally in terms of evaluation i wanted to mention these
0:25:55	curves of the applied probability of error curves ache
0:25:59	and the llr shows a single summary number
0:26:02	but you might want to actually seen
0:26:04	the performance across
0:26:06	a range of operating points and that's what this curves two
0:26:10	they basically show the cost
0:26:12	of
0:26:14	as a function of the beat are the effect of peter
0:26:18	which
0:26:19	also defines that data
0:26:21	so
0:26:22	what we see here these
0:26:23	the
0:26:24	these cost
0:26:25	for prior decisions
0:26:27	and the prior decisions are what i mentioned before
0:26:30	basically just a dummy system that always outputs
0:26:34	the priors instead of posteriors
0:26:38	and the red is our system whatever that he's
0:26:43	kind of varying or not
0:26:46	and then dashed curve is the very best you can do if you work to
0:26:50	work your scores using the palm algorithm
0:26:54	so basically the difference for each data the difference between the dashed and the right
0:27:00	is there is calibration and that
0:27:02	operating point
0:27:05	and the nice property of all these curves is that the c in a lower
0:27:08	east proportional to the area under the covers
0:27:12	so the actual see an alarm is proportional to the area under the red curve
0:27:17	and the means the lr is proportional to the area under the dashed
0:27:23	and furthermore the equal error rate is the maximum
0:27:26	of these
0:27:28	a red curve
0:27:30	and their variance of these curves
0:27:32	which accompanies this papers
0:27:35	change in the way the axis and define
0:27:39	okay
0:27:40	so let's see not saying we
0:27:43	already in our system has a kind of a simple
0:27:45	should we worry about it shall we trying to fix it
0:27:50	there's some scenarios where you
0:27:52	there's
0:27:52	no problem if you have a nice calibrated system there is no need to fix
0:27:56	it for example
0:27:58	e
0:27:59	you know what the cost function is ahead of time
0:28:02	and there's development data available
0:28:04	then all you need to do is run on the system for the development data
0:28:08	and find the and
0:28:10	you can best
0:28:11	commercial
0:28:13	for
0:28:14	done them data for that system and that can cost function
0:28:17	and you're that
0:28:19	and
0:28:20	you also the need to worry about calibration if
0:28:23	it you wanna care about ranking
0:28:25	the samples so you want to do not and
0:28:29	likely targets
0:28:30	and nothing
0:28:33	on the other hand it may be very necessary to the calibration in many other
0:28:37	sin
0:28:39	one of them is for example if you don't know ahead of time what the
0:28:43	system will be used for exactly what is the application
0:28:45	i don't means
0:28:47	you don't know the cost function and if you don't know the cost function
0:28:50	you cannot optimize the partial
0:28:52	i had of time
0:28:53	so if you want to give the user of the system and all
0:28:57	then defines these effective bit are
0:29:01	then the system has to be calibrated for the baseline
0:29:05	bayes optimal threshold to be in
0:29:08	really optimal
0:29:09	to work well
0:29:11	a another case where you need to look at iteration is if you want to
0:29:15	get a probabilistic value
0:29:18	from your system
0:29:19	some men sure all the uncertainty that a system has
0:29:23	when you make six
0:29:24	this issue
0:29:25	and you can use that uncertainty for example
0:29:28	to reject samples when the system is uncertain
0:29:31	so
0:29:33	if you're and in our is too close to the threshold then you work planning
0:29:36	to use to make our decisions
0:29:38	then perhaps
0:29:39	you wanna system not to make a decision total the user i don't know
0:29:44	union under some
0:29:47	and another case is when this you actually don't want to make her decisions when
0:29:51	you want to report the value
0:29:53	then his interpretable
0:29:55	not for example in the forensic voice comparison people
0:30:02	okay so
0:30:03	that's a we do want to fix
0:30:05	a calibration we are in one of those scenarios where it matters
0:30:10	one very common approach to do this is to use linear logistic regression
0:30:15	so this assumes that b and an hour
0:30:18	the kind of weighted score
0:30:20	is an affine transformation all whatever your system
0:30:25	and the parameters of these small are the w and b
0:30:30	and uses the weighted cross entropy ask the loss function
0:30:35	now
0:30:37	for to compute the weighted presence of we need posteriors not and then hours so
0:30:41	we need to compare those in a nursing to posteriors and we use this expression
0:30:44	that actual before
0:30:46	which is
0:30:47	the llr is the nobles of the posterior minus the no guards of the right
0:30:53	and it would
0:30:55	basically where these expression we get there not just the functional which is the inverse
0:31:00	of the legit
0:31:01	and
0:31:04	and finally after doing
0:31:09	trivial computations we can these expression which is that bystander
0:31:13	mean and logistic expression
0:31:15	we need to them further like these posterior into the expression of the weighted cross
0:31:20	entropy to get lost
0:31:21	that we can then optimize thus we wish
0:31:24	and finally once we optimize these on
0:31:28	some the data
0:31:30	we can the w and b
0:31:32	that are optimal for that
0:31:34	not rate
0:31:37	so
0:31:39	this is an affine transformation so we doesn't change the shapes
0:31:42	of the distributions at all
0:31:44	basically
0:31:45	these looks like he did nothing
0:31:47	but what indeed is
0:31:49	more
0:31:50	shrink shift and shrink
0:31:53	the axes so that the resulting
0:31:56	scores
0:31:57	are kind of right
0:32:01	and in terms of t and then are you can see that their raw scores
0:32:05	which are these ones
0:32:07	how do very high c and an hour actually higher than one so the where
0:32:10	words and one
0:32:12	and after you calibrate them
0:32:14	which all your the was really scale and shapes
0:32:17	a new data much better see in the lower
0:32:20	these minimum here is that well maybe
0:32:23	the
0:32:24	the
0:32:25	very best you can do
0:32:27	so we define transformation we are actually doing almost as
0:32:32	good as the very best
0:32:36	which means that the affine assumption was actually in this case a quite
0:32:41	this is a real case this is box and of data process we the
0:32:46	be lda system
0:32:50	and then many other approaches to do calibration i'm not gonna cover them because it
0:32:54	would take another
0:32:55	another whole keynote
0:32:57	and
0:32:58	there are nonlinear approaches
0:33:02	which
0:33:02	i in some
0:33:04	cases do better than linear
0:33:07	is a good at some somebody is not perfect
0:33:12	then their originality and basin approaches that actually do quite well when you have very
0:33:16	little data
0:33:18	to train the calibration model
0:33:19	and then they're approaches and goal the way
0:33:22	but
0:33:23	to know data not labeled data
0:33:26	so there's label but
0:33:27	there's
0:33:28	they have and you don't know than they
0:33:31	and those works surprisingly well
0:33:35	so
0:33:36	if we have a kind of really score there
0:33:39	we know we can train the most looks not log-likelihood ratios which means
0:33:43	then we can use them to make optimal decisions
0:33:46	and we can also convert them to posteriors if we wanted to and if we
0:33:50	had the bright
0:33:53	and it and very nice property of that in our is that
0:33:57	if you work to compute
0:33:58	in a collection racial
0:34:00	all your
0:34:01	already calibrated score then you would get
0:34:04	the same thing
0:34:05	so you can treat
0:34:07	this score the in an hour after feature
0:34:10	and you
0:34:11	we compute these racial you would get the same by
0:34:16	i don't this they don't seem to some nice properties like for example
0:34:21	in a calibrated
0:34:22	score
0:34:24	the two distributions have to cross exactly at zero
0:34:28	because when the nn are is zero
0:34:30	these racial is one
0:34:32	which means then these two
0:34:34	have to be the same
0:34:36	and these two are exactly what we're seeing here the densities
0:34:40	the probability density function of the score for each of the two guys
0:34:44	they have to corsets you
0:34:46	and further if we assume that one of these two distributions is gaussian
0:34:51	then the other distributions forced to be gaussian
0:34:54	with the same
0:34:55	standard deviation and with symmetric meetings
0:34:59	and these as i said it's a real example and it's actually quite
0:35:03	close to that assumption
0:35:05	in this box and up to
0:35:09	okay so to recover this problem before we don't
0:35:13	what i've been saying is that occurs equal error rate mindcf
0:35:18	my sure only discrimination performance
0:35:21	basically this means that the nor the usual threshold selections of the nor the usual
0:35:26	how to get
0:35:28	to the actual decisions
0:35:29	from the score
0:35:31	on the other hand the weighted cross entropy on the actual dcf and the ape
0:35:36	curves and measure total form
0:35:39	and
0:35:40	that includes the initial how to
0:35:43	make the decisions
0:35:45	and we can further use these metrics
0:35:48	to compute the
0:35:49	calibration loss
0:35:51	so to see whether the system is well calibrated or not
0:35:57	and if you find the calibration is actually not good then fixing this calibration issues
0:36:02	is
0:36:03	usually see in ideal conditions so you can train an invertible transformation
0:36:09	used in
0:36:10	usually a small representative that said
0:36:13	which is enough because
0:36:15	in many of the approach is the number of parameters are is very small so
0:36:19	you don't need a lot update
0:36:23	the key here though
0:36:25	is then you need a representative that's it
0:36:27	and that's going on
0:36:29	what i'm gonna discussing the nastiest like
0:36:33	so
0:36:33	basically what we of serving right these repeatedly is that calibration of our speaker verification
0:36:40	systems
0:36:42	it is
0:36:43	extremely fragile
0:36:45	it is now for our current system and it has always be
0:36:49	okay since i've been working
0:36:51	on speaker verification for
0:36:53	almost twenty years not
0:36:57	anything like language noise distortions duration they not affect
0:37:02	the calibration parameters
0:37:05	and that means that one to train one condition
0:37:08	it's very unlikely to generalize to another condition
0:37:11	on the other hand the discrimination performance is usually still reasonable
0:37:16	on unseen conditions
0:37:17	so if you train a system on telephone data and you try to use it
0:37:21	on microphone data
0:37:22	is that gonna may not be the best you can do
0:37:25	but he still will be reasonable
0:37:28	on the other hand if you train your calibration model on telephone data and trying
0:37:32	to use it a microphone in many
0:37:34	perform horribly
0:37:37	and this is one example
0:37:39	so
0:37:40	i'm training the calibration set on the conversion
0:37:45	well on two different sets
0:37:47	speakers in the while and sre sixty
0:37:50	that's
0:37:51	and applying those models
0:37:53	but on box in the two
0:37:55	they just
0:37:56	the
0:37:58	scores are identical the raw scores and all and doing is changing the w and
0:38:02	the be based on the calibration set
0:38:05	what we see here
0:38:07	is that the model that was trained with speakers in the while
0:38:12	is extremely good
0:38:13	it's basically almost
0:38:15	perfect
0:38:17	while the model that was trained on a set of sixteen is
0:38:20	quite by
0:38:21	is better than the raw scores but he still quite well
0:38:24	compared to the best you can do
0:38:27	and this is not surprising because
0:38:29	block selects actually quite close to speakers in the white in terms of conditions
0:38:33	by this i sixteen is not
0:38:36	now
0:38:36	you may think maybe sorry sixteen is but just about set for doing calibration
0:38:41	but that's not the case because if you evaluate and sre sixteen
0:38:45	evaluation data
0:38:47	and
0:38:48	then the opposite happens
0:38:50	so the
0:38:51	calibration model that is good in that case is the one that was trained on
0:38:54	the set of sixteen so you these
0:38:57	scores
0:38:58	newman much lower
0:39:00	still an arm than the ones that were
0:39:02	calibrated we'd speakers in the way
0:39:06	in this case again you're almost which in the mean
0:39:10	so basically this tells us that the conditions on which the calibration model is trained
0:39:14	are at determining off
0:39:17	where they're gonna be
0:39:19	good
0:39:20	you have do you have to match the conditions on your evaluation
0:39:26	now
0:39:27	this goes even deeper
0:39:29	if you
0:39:30	zoom into a data set you can actually finest calibration issues within the dataset itself
0:39:36	so
0:39:38	i'm showing
0:39:39	results on sre sixteen evaluation set
0:39:42	when i training calibration parameters exactly one of the same impulse it so this is
0:39:47	a cheating calibration experiment
0:39:50	here
0:39:51	i'm showing the
0:39:53	see an alarm which is the solid bar and the means in an hour which
0:39:57	are here are the same by construction and here is that
0:40:01	relative difference between those two
0:40:04	so where the full set i have not lost
0:40:06	by construction as it said
0:40:08	on the other hand if i start to subset
0:40:11	peaceful set
0:40:13	a randomly or by gender
0:40:16	or my condition
0:40:17	i start to see
0:40:19	one more calibration loss
0:40:21	so than random subset is
0:40:23	fine
0:40:24	it is well calibrated females and males are reasonably well calibrated
0:40:28	but for this specific conditions
0:40:31	there are defined by the language the gender
0:40:34	where there are they to waveforms in the trial come from the same telephone number
0:40:39	or not
0:40:41	then we start to see calibration loss
0:40:43	our to almost twenty percent in this case
0:40:47	so
0:40:49	the distributions so for the
0:40:52	target i don't
0:40:53	female same telephone number set
0:40:57	we see that the distributions are shifted to the fact
0:41:00	they should be aligned with zero remember that the this to the sri distributions if
0:41:04	they were kind of reading they should cross at zero
0:41:08	but they don't
0:41:09	so that means they shifted to the right and that is reasonable because seems they
0:41:13	are the same telephone number for both
0:41:16	sides of the trial
0:41:18	then it means that
0:41:19	they look very much the same
0:41:23	more than if the channels one different
0:41:26	so
0:41:27	everything every trial looks more target
0:41:30	then they should
0:41:33	or than they do in the overall distribution
0:41:36	on the opposite happens on the different telephone number
0:41:39	scores
0:41:40	the shift to the left
0:41:42	and the final comments here is that these mis calibration with dataset
0:41:49	it's also cost in a discrimination problem
0:41:52	because if you pool these
0:41:54	trials as they are is kind of reading
0:41:56	you will get poor discrimination then if you work to first calibrated
0:42:00	and then pooled together
0:42:03	so
0:42:05	there's an interplay here between calibration and discrimination
0:42:09	because
0:42:11	the nist calibration is happening
0:42:14	for different sub conditions within the set
0:42:21	okay so they're been several approaches in the literature
0:42:25	over the last decades at least
0:42:28	that's right to
0:42:31	solve this problem or
0:42:33	condition dependent is calibration
0:42:38	where the
0:42:39	assumption of having a global
0:42:42	calibration model
0:42:43	that has a single w and a single be
0:42:46	for all trials it's actually not as good as such
0:42:50	so most of these approaches assume that there's an external class
0:42:54	or vector representation
0:42:57	the ldc there are given by the metadata
0:42:59	or estimated
0:43:02	that represents the condition of the samples
0:43:04	the enrollment and the samples
0:43:07	and these vectors
0:43:10	are fed into the calibration stage and they are used to condition the parameters of
0:43:14	these calibration stage
0:43:17	here are some approaches if you are interesting to take a look
0:43:24	over all these approaches something quite successful at
0:43:27	making the
0:43:28	final system better actually more discriminative
0:43:32	because they align the distributions of the different sub conditions before putting them together
0:43:40	and that their family of approaches these
0:43:43	where they put the condition awareness
0:43:46	in the back end itself rather than in the calibration stage
0:43:49	so
0:43:50	there's again a condition extractor of some kind
0:43:54	that affects the parameters of them okay
0:43:58	the thing is that this approach doesn't necessarily fix calibration
0:44:02	it improves discrimination in general
0:44:04	but you may still need to the calibration it is but can is deal for
0:44:08	example it be lda look and this i think these cases
0:44:12	what comes out of here is still use kind of
0:44:15	so you still need a
0:44:18	perhaps normal
0:44:19	calibration model and the or
0:44:23	okay and recently we propose an approach that jointly trains the backend
0:44:30	and a condition beep and then
0:44:32	calibrate or
0:44:33	where here we assume that the condition is extracted automatically as a function of the
0:44:38	and mailings themselves
0:44:40	and the whole thing
0:44:42	is trained jointly to optimize weighted cross entropy
0:44:46	so
0:44:47	this model actually gives
0:44:49	excellent calibration performance across so wide range of conditions
0:44:53	you can actually find the paper
0:44:55	and
0:44:56	in the ldc proceedings if you're interest
0:44:59	and there's a very related paper a also in a dc one middle
0:45:04	by daniel garcia romano
0:45:06	which i suggest you taken it to if you're interested in these topics
0:45:13	okay so
0:45:15	to finish up
0:45:17	i didn't talking about two
0:45:20	wide application scenarios for speaker verification technology
0:45:24	one of them
0:45:25	is where you assume that there's development data available for the evaluation conditions
0:45:32	in that case
0:45:33	as i said you can either calibrate the system on my on that data which
0:45:38	is matched
0:45:39	or just
0:45:40	find the best commercial
0:45:43	by really calibration in that
0:45:45	scenario is not a mediation
0:45:49	in fact most speaker verification papers
0:45:52	historically
0:45:53	operate under this scenario
0:45:55	it's also the scenario of the nist evaluations where we usually get development data which
0:46:00	is maybe not perfectly matching but
0:46:02	pretty well matched to what we will see you in the evaluation
0:46:07	not only see this ldc five i found thirty three speaker recognition papers
0:46:13	of which twenty eight fold
0:46:15	in this category
0:46:18	so
0:46:19	the mostly report just equal error rate and dcf some report actual values
0:46:25	some don't
0:46:28	and i think it's fine to just report mean dcf in those cases because you
0:46:32	basically assuming that the
0:46:35	it calibration initial is
0:46:37	easy to sell
0:46:39	so that
0:46:39	if you work to have
0:46:41	development data
0:46:43	a you could train a kind of visual all and you won't reach very close
0:46:47	to the minimum
0:46:48	this year
0:46:49	the actual performance gonna get very close to the
0:46:54	now the still the can be at that
0:46:57	you may still have used calibration problems within sub conditions anything i don't report
0:47:02	actual dcf on this year on some conditions
0:47:05	and that's
0:47:07	she
0:47:07	behind the overall performance
0:47:11	the other big scenario is
0:47:13	and the
0:47:15	the one where we don't have development data
0:47:18	for the above conditions
0:47:21	in that case we cannot calibrate or just a special
0:47:25	on matched conditions we can only whole
0:47:28	that our system will
0:47:30	work well out of the box
0:47:35	from the
0:47:36	all these proceedings i only five
0:47:39	papers that operate on their this scenario where the
0:47:42	actually test
0:47:43	a system that was trained on some condition all
0:47:47	at this data that is on a different conditions
0:47:50	and they do not assume that they have
0:47:52	development data for that
0:47:54	if recognition
0:47:58	so basically we as a community
0:48:00	are very heavily focused on the first scenario have always been is
0:48:05	from historically
0:48:09	and i but this man be why our current speaker verification technology
0:48:14	cannot be used out of the box
0:48:16	we are just
0:48:18	used to
0:48:19	always
0:48:21	asking for development data
0:48:22	in order to tune at least the calibration stage of our system
0:48:28	we know the calibration stage has to be tuned otherwise the system one work
0:48:33	in maybe where someone
0:48:36	so my question is and maybe we can discuss the question and answer session
0:48:43	wouldn't be worth it for as a community to pay more attention to these
0:48:47	scenario
0:48:48	no development data available
0:48:53	i believe that the new and two and approaches have the
0:48:57	potential to be quite good
0:48:59	i generalising
0:49:00	and this is basically based on the
0:49:03	paper that i mentioned that actually
0:49:05	is not really into and
0:49:07	but
0:49:07	almost
0:49:09	and it works
0:49:11	quite well
0:49:12	surprisingly well in terms of calibration across conditions on unseen conditions
0:49:18	so i think it's doable
0:49:22	maybe if we would and therefore as a community then maybe we reduce or even
0:49:27	in it
0:49:28	if we're very optimistic
0:49:30	the performance difference between the two center so maybe we can end up with systems
0:49:35	then
0:49:36	are not so independent of having development data
0:49:40	and perhaps even having development data one how much i don't know or more
0:49:47	the out of the book system
0:49:51	so what would you and tail to develop for these known that scenario
0:49:56	possible we
0:49:57	we have to assume that we will need heterogeneous data for training of course because
0:50:02	if you train a system on telephone data is
0:50:05	quite unlikely that it will generalize to
0:50:07	maybe other condition
0:50:11	the second thing is one has to have doubts
0:50:15	some sets
0:50:16	at least
0:50:18	during development that are not d menus for
0:50:20	hyperparameter two
0:50:22	because otherwise they would not be completely and see
0:50:26	so these sets out to be really
0:50:30	held out until the very and until you just evaluate the system out of the
0:50:34	box as in this scenario that we are imagining
0:50:38	and of course in into report actual matrix and not just meaning because in this
0:50:42	case you cannot assume that you're gonna be able to do kind of racial well
0:50:46	you need to test whether the model
0:50:49	i think i cd as it stands
0:50:51	it's actually giving you
0:50:53	good calibration with the session
0:50:56	and finally it's probably a good idea to also report matrix
0:51:01	on some conditions in the set
0:51:03	because
0:51:04	they the mis calibration issues within the sub conditions maybe he in
0:51:09	within the true distribution of the whole set they compensate each other sometimes
0:51:15	and reporting
0:51:19	metrics sub conditions yes
0:51:21	both actual and minimum something you can actually tell
0:51:24	if there's a calibration
0:51:27	okay
0:51:28	thank you very much for listening and i'm looking forward to your questions in the
0:51:32	next session

The importance of Calibration in Speaker Verification

Keynotes

Luciana Ferrer, Computer Science Institute, Argentina