Speech Transcript - A comparison of linear and non-linear calibrations for speaker recognition

0:00:15	okay so
0:00:17	in this p and this paper i'm going to compared some linear and some non
0:00:22	linear calibration function
0:00:25	so
0:00:25	there's a list of previous papers
0:00:28	all of them used linear calibration we did various interesting things
0:00:33	but
0:00:34	when i was doing that work every time it became evident
0:00:38	the linear calibration has limits let's explore what we can do with the non linear
0:00:43	calibration
0:00:45	so
0:00:46	in this paper
0:00:47	we're going to use lots of data for training the calibration in that case of
0:00:52	plugging solution
0:00:55	works well
0:00:57	we have an upcoming interspeech paper where we're gonna have used tiny amounts of training
0:01:01	data and their a bayesian solution
0:01:06	is an interesting thing to do
0:01:09	so
0:01:10	just a reminder why calibration so
0:01:13	a speaker recognizer about the speaker recognizer is not useful unless it actually does something
0:01:19	so we wanted to make decisions and it's nice if you can make those decisions
0:01:24	be
0:01:25	minimum expected cost bayes decisions
0:01:28	if you take the raw score sort of the recognizer
0:01:30	i don't make good decisions so we like to calibrate them
0:01:34	so that the we can make would cost effective decisions
0:01:39	so
0:01:40	four years we've done pretty well what
0:01:42	linear calibration
0:01:44	why complicate things so
0:01:48	well or
0:01:49	the good things about linear calibration simplicity
0:01:53	it's easy to do
0:01:56	and
0:01:57	then there's the problem of overfitting
0:02:00	the linear calibration as very few parameters so it doesn't overfit that easily
0:02:05	not a problem that should be underestimated even if you have lots of data if
0:02:09	you work at an extreme operating point
0:02:11	the error rates become very low so your data is actually errors not
0:02:15	not the not speech samples of trials you know another errors you gonna have over
0:02:21	fitting problems
0:02:23	and the thing
0:02:24	the linear calibration is monotonic
0:02:27	the score increases the log-likelihood ratio increases
0:02:30	if you do something nonlinear
0:02:32	you might be in a situation
0:02:34	with
0:02:36	the score increases but the log likelihood ratio decreases
0:02:40	and that i don't think even all the soreness in finland can help us on
0:02:43	certain whether we want that kind of thing or not
0:02:48	so
0:02:50	the limitation of linear methods
0:02:52	is
0:02:53	that the
0:02:55	they don't look at all operating points at the same time
0:02:59	you have to choose
0:03:02	where
0:03:03	do we want to operate what cost ratio what prior for the target do we
0:03:08	want to work at
0:03:09	and then you have to thailand your operating point your training objective function
0:03:17	to make that what
0:03:21	so that
0:03:21	why is this a problem
0:03:23	you cannot always know in advance where you're gonna want your system to work
0:03:27	and especially if you dealing with unsupervised data
0:03:32	so
0:03:33	the non linear methods
0:03:36	like can be accurate
0:03:37	accurate over a wider range of operating point
0:03:41	then
0:03:41	you don't need to do this so much gymnastics with your
0:03:47	training objective function
0:03:51	so
0:03:52	the non linear methods or considerably more complex to train i had to go and
0:03:56	find out a lot of things about the basal functions and how to compute the
0:04:00	derivatives
0:04:03	and there are more vulnerable to overfitting so
0:04:06	more complex functions more things can go wrong
0:04:12	so will compare
0:04:15	various flavours
0:04:17	discriminative and generative linear calibrations
0:04:19	and the same for non linear ones
0:04:22	another conclusion is going to be there is some benefit to the nonlinear ones
0:04:27	but you will cover a wide range of operating points
0:04:34	it's
0:04:35	first describe the linear ones
0:04:38	so
0:04:39	it's linear because we take the score that comes out of the system
0:04:43	we scale at by some fact that i and we shifted by some constant be
0:04:48	and then
0:04:50	if we do if we use
0:04:52	gaussian distributions gaussian score distributions you have two distributions one for targets one for non-targets
0:04:59	and
0:05:01	you have a target mean and the non-target mean and then you would share the
0:05:06	variance between the two distributions
0:05:09	if you don't if you have separate variances you get a quadratic function
0:05:14	so in the linear case we sharing that the that stigma
0:05:19	so that
0:05:20	gives a linear generative calibration
0:05:23	or you could be discrimate of in that case you're probabilistic model is just the
0:05:28	formula at the top of the by and you directly trained those parameters by minimizing
0:05:32	cross entropy
0:05:36	so
0:05:37	i said
0:05:39	we have to kind of the objective function to make it work at a specific
0:05:42	operating point
0:05:44	so
0:05:44	what we basically do
0:05:46	it's we white
0:05:47	the target trials and the non-target trials or if you want the most errors and
0:05:52	the false alarm errors
0:05:53	by a factor of all five and one minus alpha so
0:05:57	but school also the training parameters so when you train at first you have to
0:06:01	select your operating point
0:06:05	so
0:06:07	let's you know that all works out present
0:06:11	experimental results first a nonlinear stuff then we'll that of the nonlinear
0:06:14	so
0:06:16	sample experimental setup and i-vector system
0:06:20	and
0:06:20	we trained
0:06:23	the calibrations on a huge amount of scores of a forty million scores and we
0:06:28	pairs to the
0:06:29	on the sre twelve which was described earlier today
0:06:34	like i've about nine million scores
0:06:36	and are evaluation criterion
0:06:39	is the same one that was used again and the nist evaluation
0:06:43	very well known dcf what if you want the by using a direct
0:06:49	and it's normalized
0:06:51	as shown by
0:06:53	the performance of
0:06:55	additional system that doesn't look at the scores of just makes decisions by the prior
0:06:59	alone
0:07:02	so
0:07:05	this is the
0:07:06	result of the
0:07:09	gaussian calibration
0:07:13	what we're looking at is on the horizontal axis is the dcf for the error
0:07:17	rate lower is better
0:07:19	on the
0:07:20	sorry about the collected the that the horizontal axis is you're operating point or your
0:07:26	target prior on the log on scale
0:07:28	so the other would be a problem prior of a half
0:07:31	negative
0:07:33	small priors positive lots priors
0:07:36	and
0:07:38	the
0:07:38	dashed line
0:07:39	is what you would know is minimum dcf
0:07:43	what is the best you can do
0:07:45	if the evaluator sets that the threshold that at every single operating point
0:07:50	so
0:07:52	we trained
0:07:54	the system using three different
0:07:57	values
0:07:58	for
0:07:59	the training weighting parameter all four
0:08:03	so of a much smaller than one means
0:08:06	within george doddington the region
0:08:08	the low false alarm rate
0:08:11	false alarms or more important so you white the more
0:08:14	if you do that
0:08:16	you do well in the region that you want to do well but on the
0:08:19	other side you will see the red curve suffers
0:08:22	if
0:08:23	we set the parameter that the hoff
0:08:26	the does badly almost everywhere
0:08:29	if you set the parameter to the others other side almost one
0:08:33	you get the reverse
0:08:35	on that side it's bad on the side
0:08:37	it's good that that's the blue curve
0:08:40	so
0:08:41	this was generative let's move to discriminant of
0:08:45	so
0:08:46	the picture is slightly better
0:08:50	this is the usual button if you have lots of data discrimate of
0:08:55	of from those bit of the genital
0:08:57	but still
0:08:58	we don't do as well as we might like to over all operating points
0:09:03	so let's see what the non linear methods will do
0:09:07	so
0:09:10	that the i v algorithm also sometimes called a symphonic regression
0:09:15	is a very interesting algorithm
0:09:18	we assign
0:09:19	we
0:09:21	allowing for calibration function any monotonic rising functions
0:09:26	and
0:09:27	then
0:09:29	there's
0:09:31	and optimisation procedures
0:09:33	which essentially selects for every single score
0:09:36	what the function if is going to map it do so it nonparametric
0:09:42	and the very interesting thing is
0:09:44	we don't have to choose
0:09:47	which objective
0:09:49	we actually want to optimize this function class is rich enough that it actually just
0:09:55	optimizes all of them so
0:09:57	you get that automatically
0:10:00	all your objective functions optimized at all operating points on the training data
0:10:07	if you going to the test data
0:10:10	you see
0:10:11	over a wide range of operating points it does work pretty well
0:10:15	but at the extreme negative and we do have a slight problem so
0:10:22	i attribute that overfitting
0:10:24	so
0:10:25	this thing has forty two million parameters is non parametric the parameters grow with the
0:10:30	data
0:10:32	but they're also forty two million inequality constraints
0:10:35	that makes it behind mostly
0:10:38	except there we run out of errors and it stops behaving
0:10:44	so
0:10:45	now we go to the generative
0:10:48	version of nonlinear
0:10:51	and
0:10:52	as i mentioned before
0:10:53	if you just allow
0:10:55	the target distribution on the non-target distribution
0:10:58	you have separate variances we get a nonlinear a quadratic
0:11:03	calibration function
0:11:05	and then
0:11:06	also applied a student's t-distribution
0:11:09	and even a more general distribution for the normal inverse gaussian
0:11:14	what if twenty nine but the important thing is
0:11:17	we got from the formally or gaussian was just has a mean and variance a
0:11:21	location and the scale
0:11:22	you distributions and can control the tail thickness
0:11:26	and then the final one has just skewness so we will see what the
0:11:31	what the times extra parameters what their effectiveness
0:11:36	so
0:11:37	this picture
0:11:40	much better than the previous ones
0:11:43	all of them a better
0:11:45	if we have to choose between them
0:11:47	the blue one the most complex one
0:11:50	does the best
0:11:52	but the gaussian one
0:11:54	those pretty well so the gaussian one is a lot faster than a lot easier
0:11:58	to use
0:12:00	so
0:12:02	by the you don't want to bother with your bessel functions
0:12:06	and complex optimization algorithms
0:12:09	you can read in the bible as a part of column of how to optimize
0:12:12	the new one
0:12:16	what is interesting is
0:12:18	the t distribution
0:12:20	is it complexity between the others to so why is it
0:12:24	workers
0:12:27	you would expect it to like that so the green one we would expect to
0:12:30	be between the red and the blue
0:12:33	so my explanation is
0:12:37	it's sort of abusing it's
0:12:39	ability to adjust the tail thickness
0:12:42	but it's a metric
0:12:43	so what it seeing at the one perilous trying to apply to the other day
0:12:47	so
0:12:48	i think that sort of a complex mixture of overfitting unless fitting that using here
0:12:56	so
0:12:57	there's just quickly summarise the results
0:12:59	this table games
0:13:01	all the calibration solutions
0:13:03	the red ones of the linear ones
0:13:05	two or three parameters
0:13:06	the underfoot
0:13:08	but the ivy
0:13:10	has forty two million parameters
0:13:12	and
0:13:13	there's some overfitting
0:13:15	and then the blue ones that don't the ones
0:13:17	they do a lot better
0:13:19	and
0:13:20	the most complex one
0:13:22	works the based
0:13:26	i'll just show these plots again
0:13:27	the so you can see how improve
0:13:30	from
0:13:31	the general the one
0:13:33	discrimate the
0:13:35	a nonlinear
0:13:36	and the nonlinear
0:13:38	parametric
0:13:43	in conclusion
0:13:45	the linear
0:13:48	calibration
0:13:50	suffers from underfunding
0:13:51	but
0:13:52	we can manage that
0:13:54	by focusing on a specific operating point
0:13:58	non linear calibrations
0:14:01	don't have
0:14:02	the under splitting problem but you have to
0:14:05	watch out for over fitting
0:14:07	but again that can be managed you can regularizer as you would do with
0:14:11	machine learning techniques or
0:14:13	you can
0:14:14	you can do you can use bayesian methods
0:14:19	so that's my story
0:14:21	fancy questions
0:14:39	and ask double question
0:14:43	do you think that these conclusions hold for other kinds of systems and other kind
0:14:49	of data have any experience
0:14:53	it was this only a p lda i-vector system
0:14:57	yes recounted i need only did it on that system on the on
0:15:02	the one database
0:15:08	i would like to speculate
0:15:13	once you've got tested on other data as well

A comparison of linear and non-linear calibrations for speaker recognition

Calibration, Evaluation & Forensics

Niko Brummer, David van Leeuwen and Albert Swart