0:00:15 | okay so |
---|
0:00:17 | in this p and this paper i'm going to compared some linear and some non |
---|
0:00:22 | linear calibration function |
---|
0:00:25 | so |
---|
0:00:25 | there's a list of previous papers |
---|
0:00:28 | all of them used linear calibration we did various interesting things |
---|
0:00:33 | but |
---|
0:00:34 | when i was doing that work every time it became evident |
---|
0:00:38 | the linear calibration has limits let's explore what we can do with the non linear |
---|
0:00:43 | calibration |
---|
0:00:45 | so |
---|
0:00:46 | in this paper |
---|
0:00:47 | we're going to use lots of data for training the calibration in that case of |
---|
0:00:52 | plugging solution |
---|
0:00:55 | works well |
---|
0:00:57 | we have an upcoming interspeech paper where we're gonna have used tiny amounts of training |
---|
0:01:01 | data and their a bayesian solution |
---|
0:01:06 | is an interesting thing to do |
---|
0:01:09 | so |
---|
0:01:10 | just a reminder why calibration so |
---|
0:01:13 | a speaker recognizer about the speaker recognizer is not useful unless it actually does something |
---|
0:01:19 | so we wanted to make decisions and it's nice if you can make those decisions |
---|
0:01:24 | be |
---|
0:01:25 | minimum expected cost bayes decisions |
---|
0:01:28 | if you take the raw score sort of the recognizer |
---|
0:01:30 | i don't make good decisions so we like to calibrate them |
---|
0:01:34 | so that the we can make would cost effective decisions |
---|
0:01:39 | so |
---|
0:01:40 | four years we've done pretty well what |
---|
0:01:42 | linear calibration |
---|
0:01:44 | why complicate things so |
---|
0:01:48 | well or |
---|
0:01:49 | the good things about linear calibration simplicity |
---|
0:01:53 | it's easy to do |
---|
0:01:56 | and |
---|
0:01:57 | then there's the problem of overfitting |
---|
0:02:00 | the linear calibration as very few parameters so it doesn't overfit that easily |
---|
0:02:05 | not a problem that should be underestimated even if you have lots of data if |
---|
0:02:09 | you work at an extreme operating point |
---|
0:02:11 | the error rates become very low so your data is actually errors not |
---|
0:02:15 | not the not speech samples of trials you know another errors you gonna have over |
---|
0:02:21 | fitting problems |
---|
0:02:23 | and the thing |
---|
0:02:24 | the linear calibration is monotonic |
---|
0:02:27 | the score increases the log-likelihood ratio increases |
---|
0:02:30 | if you do something nonlinear |
---|
0:02:32 | you might be in a situation |
---|
0:02:34 | with |
---|
0:02:36 | the score increases but the log likelihood ratio decreases |
---|
0:02:40 | and that i don't think even all the soreness in finland can help us on |
---|
0:02:43 | certain whether we want that kind of thing or not |
---|
0:02:48 | so |
---|
0:02:50 | the limitation of linear methods |
---|
0:02:52 | is |
---|
0:02:53 | that the |
---|
0:02:55 | they don't look at all operating points at the same time |
---|
0:02:59 | you have to choose |
---|
0:03:02 | where |
---|
0:03:03 | do we want to operate what cost ratio what prior for the target do we |
---|
0:03:08 | want to work at |
---|
0:03:09 | and then you have to thailand your operating point your training objective function |
---|
0:03:17 | to make that what |
---|
0:03:21 | so that |
---|
0:03:21 | why is this a problem |
---|
0:03:23 | you cannot always know in advance where you're gonna want your system to work |
---|
0:03:27 | and especially if you dealing with unsupervised data |
---|
0:03:32 | so |
---|
0:03:33 | the non linear methods |
---|
0:03:36 | like can be accurate |
---|
0:03:37 | accurate over a wider range of operating point |
---|
0:03:41 | then |
---|
0:03:41 | you don't need to do this so much gymnastics with your |
---|
0:03:47 | training objective function |
---|
0:03:51 | so |
---|
0:03:52 | the non linear methods or considerably more complex to train i had to go and |
---|
0:03:56 | find out a lot of things about the basal functions and how to compute the |
---|
0:04:00 | derivatives |
---|
0:04:03 | and there are more vulnerable to overfitting so |
---|
0:04:06 | more complex functions more things can go wrong |
---|
0:04:12 | so will compare |
---|
0:04:15 | various flavours |
---|
0:04:17 | discriminative and generative linear calibrations |
---|
0:04:19 | and the same for non linear ones |
---|
0:04:22 | another conclusion is going to be there is some benefit to the nonlinear ones |
---|
0:04:27 | but you will cover a wide range of operating points |
---|
0:04:34 | it's |
---|
0:04:35 | first describe the linear ones |
---|
0:04:38 | so |
---|
0:04:39 | it's linear because we take the score that comes out of the system |
---|
0:04:43 | we scale at by some fact that i and we shifted by some constant be |
---|
0:04:48 | and then |
---|
0:04:50 | if we do if we use |
---|
0:04:52 | gaussian distributions gaussian score distributions you have two distributions one for targets one for non-targets |
---|
0:04:59 | and |
---|
0:05:01 | you have a target mean and the non-target mean and then you would share the |
---|
0:05:06 | variance between the two distributions |
---|
0:05:09 | if you don't if you have separate variances you get a quadratic function |
---|
0:05:14 | so in the linear case we sharing that the that stigma |
---|
0:05:19 | so that |
---|
0:05:20 | gives a linear generative calibration |
---|
0:05:23 | or you could be discrimate of in that case you're probabilistic model is just the |
---|
0:05:28 | formula at the top of the by and you directly trained those parameters by minimizing |
---|
0:05:32 | cross entropy |
---|
0:05:36 | so |
---|
0:05:37 | i said |
---|
0:05:39 | we have to kind of the objective function to make it work at a specific |
---|
0:05:42 | operating point |
---|
0:05:44 | so |
---|
0:05:44 | what we basically do |
---|
0:05:46 | it's we white |
---|
0:05:47 | the target trials and the non-target trials or if you want the most errors and |
---|
0:05:52 | the false alarm errors |
---|
0:05:53 | by a factor of all five and one minus alpha so |
---|
0:05:57 | but school also the training parameters so when you train at first you have to |
---|
0:06:01 | select your operating point |
---|
0:06:05 | so |
---|
0:06:07 | let's you know that all works out present |
---|
0:06:11 | experimental results first a nonlinear stuff then we'll that of the nonlinear |
---|
0:06:14 | so |
---|
0:06:16 | sample experimental setup and i-vector system |
---|
0:06:20 | and |
---|
0:06:20 | we trained |
---|
0:06:23 | the calibrations on a huge amount of scores of a forty million scores and we |
---|
0:06:28 | pairs to the |
---|
0:06:29 | on the sre twelve which was described earlier today |
---|
0:06:34 | like i've about nine million scores |
---|
0:06:36 | and are evaluation criterion |
---|
0:06:39 | is the same one that was used again and the nist evaluation |
---|
0:06:43 | very well known dcf what if you want the by using a direct |
---|
0:06:49 | and it's normalized |
---|
0:06:51 | as shown by |
---|
0:06:53 | the performance of |
---|
0:06:55 | additional system that doesn't look at the scores of just makes decisions by the prior |
---|
0:06:59 | alone |
---|
0:07:02 | so |
---|
0:07:05 | this is the |
---|
0:07:06 | result of the |
---|
0:07:09 | gaussian calibration |
---|
0:07:13 | what we're looking at is on the horizontal axis is the dcf for the error |
---|
0:07:17 | rate lower is better |
---|
0:07:19 | on the |
---|
0:07:20 | sorry about the collected the that the horizontal axis is you're operating point or your |
---|
0:07:26 | target prior on the log on scale |
---|
0:07:28 | so the other would be a problem prior of a half |
---|
0:07:31 | negative |
---|
0:07:33 | small priors positive lots priors |
---|
0:07:36 | and |
---|
0:07:38 | the |
---|
0:07:38 | dashed line |
---|
0:07:39 | is what you would know is minimum dcf |
---|
0:07:43 | what is the best you can do |
---|
0:07:45 | if the evaluator sets that the threshold that at every single operating point |
---|
0:07:50 | so |
---|
0:07:52 | we trained |
---|
0:07:54 | the system using three different |
---|
0:07:57 | values |
---|
0:07:58 | for |
---|
0:07:59 | the training weighting parameter all four |
---|
0:08:03 | so of a much smaller than one means |
---|
0:08:06 | within george doddington the region |
---|
0:08:08 | the low false alarm rate |
---|
0:08:11 | false alarms or more important so you white the more |
---|
0:08:14 | if you do that |
---|
0:08:16 | you do well in the region that you want to do well but on the |
---|
0:08:19 | other side you will see the red curve suffers |
---|
0:08:22 | if |
---|
0:08:23 | we set the parameter that the hoff |
---|
0:08:26 | the does badly almost everywhere |
---|
0:08:29 | if you set the parameter to the others other side almost one |
---|
0:08:33 | you get the reverse |
---|
0:08:35 | on that side it's bad on the side |
---|
0:08:37 | it's good that that's the blue curve |
---|
0:08:40 | so |
---|
0:08:41 | this was generative let's move to discriminant of |
---|
0:08:45 | so |
---|
0:08:46 | the picture is slightly better |
---|
0:08:50 | this is the usual button if you have lots of data discrimate of |
---|
0:08:55 | of from those bit of the genital |
---|
0:08:57 | but still |
---|
0:08:58 | we don't do as well as we might like to over all operating points |
---|
0:09:03 | so let's see what the non linear methods will do |
---|
0:09:07 | so |
---|
0:09:10 | that the i v algorithm also sometimes called a symphonic regression |
---|
0:09:15 | is a very interesting algorithm |
---|
0:09:18 | we assign |
---|
0:09:19 | we |
---|
0:09:21 | allowing for calibration function any monotonic rising functions |
---|
0:09:26 | and |
---|
0:09:27 | then |
---|
0:09:29 | there's |
---|
0:09:31 | and optimisation procedures |
---|
0:09:33 | which essentially selects for every single score |
---|
0:09:36 | what the function if is going to map it do so it nonparametric |
---|
0:09:42 | and the very interesting thing is |
---|
0:09:44 | we don't have to choose |
---|
0:09:47 | which objective |
---|
0:09:49 | we actually want to optimize this function class is rich enough that it actually just |
---|
0:09:55 | optimizes all of them so |
---|
0:09:57 | you get that automatically |
---|
0:10:00 | all your objective functions optimized at all operating points on the training data |
---|
0:10:07 | if you going to the test data |
---|
0:10:10 | you see |
---|
0:10:11 | over a wide range of operating points it does work pretty well |
---|
0:10:15 | but at the extreme negative and we do have a slight problem so |
---|
0:10:22 | i attribute that overfitting |
---|
0:10:24 | so |
---|
0:10:25 | this thing has forty two million parameters is non parametric the parameters grow with the |
---|
0:10:30 | data |
---|
0:10:32 | but they're also forty two million inequality constraints |
---|
0:10:35 | that makes it behind mostly |
---|
0:10:38 | except there we run out of errors and it stops behaving |
---|
0:10:44 | so |
---|
0:10:45 | now we go to the generative |
---|
0:10:48 | version of nonlinear |
---|
0:10:51 | and |
---|
0:10:52 | as i mentioned before |
---|
0:10:53 | if you just allow |
---|
0:10:55 | the target distribution on the non-target distribution |
---|
0:10:58 | you have separate variances we get a nonlinear a quadratic |
---|
0:11:03 | calibration function |
---|
0:11:05 | and then |
---|
0:11:06 | also applied a student's t-distribution |
---|
0:11:09 | and even a more general distribution for the normal inverse gaussian |
---|
0:11:14 | what if twenty nine but the important thing is |
---|
0:11:17 | we got from the formally or gaussian was just has a mean and variance a |
---|
0:11:21 | location and the scale |
---|
0:11:22 | you distributions and can control the tail thickness |
---|
0:11:26 | and then the final one has just skewness so we will see what the |
---|
0:11:31 | what the times extra parameters what their effectiveness |
---|
0:11:36 | so |
---|
0:11:37 | this picture |
---|
0:11:40 | much better than the previous ones |
---|
0:11:43 | all of them a better |
---|
0:11:45 | if we have to choose between them |
---|
0:11:47 | the blue one the most complex one |
---|
0:11:50 | does the best |
---|
0:11:52 | but the gaussian one |
---|
0:11:54 | those pretty well so the gaussian one is a lot faster than a lot easier |
---|
0:11:58 | to use |
---|
0:12:00 | so |
---|
0:12:02 | by the you don't want to bother with your bessel functions |
---|
0:12:06 | and complex optimization algorithms |
---|
0:12:09 | you can read in the bible as a part of column of how to optimize |
---|
0:12:12 | the new one |
---|
0:12:16 | what is interesting is |
---|
0:12:18 | the t distribution |
---|
0:12:20 | is it complexity between the others to so why is it |
---|
0:12:24 | workers |
---|
0:12:27 | you would expect it to like that so the green one we would expect to |
---|
0:12:30 | be between the red and the blue |
---|
0:12:33 | so my explanation is |
---|
0:12:37 | it's sort of abusing it's |
---|
0:12:39 | ability to adjust the tail thickness |
---|
0:12:42 | but it's a metric |
---|
0:12:43 | so what it seeing at the one perilous trying to apply to the other day |
---|
0:12:47 | so |
---|
0:12:48 | i think that sort of a complex mixture of overfitting unless fitting that using here |
---|
0:12:56 | so |
---|
0:12:57 | there's just quickly summarise the results |
---|
0:12:59 | this table games |
---|
0:13:01 | all the calibration solutions |
---|
0:13:03 | the red ones of the linear ones |
---|
0:13:05 | two or three parameters |
---|
0:13:06 | the underfoot |
---|
0:13:08 | but the ivy |
---|
0:13:10 | has forty two million parameters |
---|
0:13:12 | and |
---|
0:13:13 | there's some overfitting |
---|
0:13:15 | and then the blue ones that don't the ones |
---|
0:13:17 | they do a lot better |
---|
0:13:19 | and |
---|
0:13:20 | the most complex one |
---|
0:13:22 | works the based |
---|
0:13:26 | i'll just show these plots again |
---|
0:13:27 | the so you can see how improve |
---|
0:13:30 | from |
---|
0:13:31 | the general the one |
---|
0:13:33 | discrimate the |
---|
0:13:35 | a nonlinear |
---|
0:13:36 | and the nonlinear |
---|
0:13:38 | parametric |
---|
0:13:43 | in conclusion |
---|
0:13:45 | the linear |
---|
0:13:48 | calibration |
---|
0:13:50 | suffers from underfunding |
---|
0:13:51 | but |
---|
0:13:52 | we can manage that |
---|
0:13:54 | by focusing on a specific operating point |
---|
0:13:58 | non linear calibrations |
---|
0:14:01 | don't have |
---|
0:14:02 | the under splitting problem but you have to |
---|
0:14:05 | watch out for over fitting |
---|
0:14:07 | but again that can be managed you can regularizer as you would do with |
---|
0:14:11 | machine learning techniques or |
---|
0:14:13 | you can |
---|
0:14:14 | you can do you can use bayesian methods |
---|
0:14:19 | so that's my story |
---|
0:14:21 | fancy questions |
---|
0:14:39 | and ask double question |
---|
0:14:43 | do you think that these conclusions hold for other kinds of systems and other kind |
---|
0:14:49 | of data have any experience |
---|
0:14:53 | it was this only a p lda i-vector system |
---|
0:14:57 | yes recounted i need only did it on that system on the on |
---|
0:15:02 | the one database |
---|
0:15:08 | i would like to speculate |
---|
0:15:13 | once you've got tested on other data as well |
---|