0:00:00 | i know things for attending this talk |
---|
0:00:03 | i am just enough that i'm a researcher the computer science institute |
---|
0:00:07 | which is a unit to university one aside as some corny set in argentina |
---|
0:00:13 | to they'll be talking about the initial of calibration in speaker verification |
---|
0:00:17 | and hopefully by the end of the talk and i'm gonna can be assumed that |
---|
0:00:20 | these things you need an important issue if you were not already convinced |
---|
0:00:26 | so the top will be organised this way first and gonna define calibration |
---|
0:00:31 | and given intuition |
---|
0:00:35 | then |
---|
0:00:36 | talk about why we should care about it |
---|
0:00:39 | which is related also to how to make sure it |
---|
0:00:43 | and if we find out that bit calibration is bad in a certain system then |
---|
0:00:47 | how to fix it |
---|
0:00:49 | and then finally i'll talk about issues of robustness of calibration for speaker verification |
---|
0:00:55 | the task the main task |
---|
0:00:58 | on which i will be that samples on in speaker verification |
---|
0:01:02 | and assume that the audience |
---|
0:01:03 | in you know the c |
---|
0:01:05 | knows |
---|
0:01:06 | well this task but just in case |
---|
0:01:10 | it's a binary classification task |
---|
0:01:12 | where the samples |
---|
0:01:13 | are given by a |
---|
0:01:16 | two waveforms or two sets of waveforms |
---|
0:01:18 | but we need to compare to decide whether |
---|
0:01:21 | they come from the same speaker or from different speakers |
---|
0:01:26 | so the task is binary classification so much of what i'm gonna say |
---|
0:01:31 | applies to any binary classification task and we just |
---|
0:01:34 | speaker verification |
---|
0:01:36 | okay so what is calibration |
---|
0:01:39 | that's a we want to build a system that predicts the probability that it will |
---|
0:01:43 | rain within the next hour |
---|
0:01:45 | based only on a picture of the sky |
---|
0:01:47 | so this is these are wary |
---|
0:01:49 | if we see this picture then we would expect the system to work we don't |
---|
0:01:53 | know probability say point one |
---|
0:01:55 | while it was in this picture then we would expect it well would have much |
---|
0:01:58 | higher probability of rain |
---|
0:02:00 | it's a closer to one when |
---|
0:02:03 | are we will say that the system is kind of really |
---|
0:02:06 | the values that are able by the system coincide |
---|
0:02:10 | we what we seen |
---|
0:02:11 | in the data |
---|
0:02:15 | so |
---|
0:02:16 | i well calibrated score |
---|
0:02:18 | should reflect the uncertainty of the system |
---|
0:02:21 | for example to be concrete |
---|
0:02:24 | for all the samples |
---|
0:02:25 | but get a score |
---|
0:02:26 | or point eight |
---|
0:02:27 | come the system |
---|
0:02:29 | then we would expect eighty percent of them to be labeled correctly |
---|
0:02:32 | that's one data point eight meetings |
---|
0:02:35 | in that happens |
---|
0:02:37 | then we will say that the system is what kind of |
---|
0:02:40 | and then we could be an example of diagram that is used in many tasks |
---|
0:02:44 | not match a speaker verification on not at all but it's |
---|
0:02:48 | i think it's very intuitive four |
---|
0:02:50 | understanding calibration |
---|
0:02:53 | it's called the reliability of diagram |
---|
0:02:55 | i'm basically but when it shows is the posteriors |
---|
0:02:59 | from a system that was random certain data |
---|
0:03:02 | the posteriors that the system okay |
---|
0:03:04 | for the class |
---|
0:03:05 | then we predict |
---|
0:03:07 | so for example for |
---|
0:03:08 | this being |
---|
0:03:11 | we have all the samples for which the system gave a posterior between point eight |
---|
0:03:15 | point |
---|
0:03:17 | and what the |
---|
0:03:18 | diagram shows is the accuracy |
---|
0:03:21 | on those some |
---|
0:03:23 | so in there |
---|
0:03:24 | system was calibrated then we would expect these two we |
---|
0:03:28 | diagonal |
---|
0:03:29 | because |
---|
0:03:30 | what the system predicted |
---|
0:03:31 | what coincide with the accuracy than we seen also |
---|
0:03:35 | in this specific case what we actually see that the system was correct more times |
---|
0:03:41 | then you thought it would be |
---|
0:03:43 | which is interesting in to a system that underestimates it's coupled |
---|
0:03:49 | now i to this diagram |
---|
0:03:52 | from a paper from twenty seventeen |
---|
0:03:56 | which actually studies the initial calibration on |
---|
0:03:59 | and different architectures |
---|
0:04:01 | so it compares on a task |
---|
0:04:03 | that is quality far one hundred which is the image classification how to different classes |
---|
0:04:08 | and it compares |
---|
0:04:09 | the this is the plot that i already showed a |
---|
0:04:12 | c n from nineteen ninety eight |
---|
0:04:15 | we address in it |
---|
0:04:17 | from twenty sixteen |
---|
0:04:19 | we and they show that actually the new network |
---|
0:04:23 | much worse calibrated |
---|
0:04:25 | then the old network |
---|
0:04:26 | so for this saying being the racial before |
---|
0:04:29 | then you network actually has an accuracy much lower than we got it should how |
---|
0:04:36 | which is point five more |
---|
0:04:38 | so in this is an over confident |
---|
0:04:41 | the nn |
---|
0:04:42 | is things |
---|
0:04:43 | it will do much better than it actually thus |
---|
0:04:46 | one the other hand being error |
---|
0:04:48 | from the new network is no |
---|
0:04:50 | so if you put this network to make decisions that the sessions will be better |
---|
0:04:54 | than the old ones |
---|
0:04:55 | but the score studied outputs |
---|
0:04:58 | cannot be interpreted as posterior settle |
---|
0:05:00 | it cannot be interpreted as |
---|
0:05:02 | the certainty that sit that the system has when it makes a decision |
---|
0:05:08 | so |
---|
0:05:09 | this is actually a phenomenon that we see a node in speaker recognition basically you |
---|
0:05:15 | have a badly calibrated bottle tiny still |
---|
0:05:18 | when discriminately |
---|
0:05:20 | the problem is that such a model |
---|
0:05:22 | might be useless in practice depending on the scenario in which we plan to use |
---|
0:05:27 | so |
---|
0:05:28 | as i already said |
---|
0:05:29 | this course |
---|
0:05:30 | from an is gonna weighting system cannot be interpreted as the certainty |
---|
0:05:35 | that the system has units decisions |
---|
0:05:39 | also |
---|
0:05:40 | the scores cannot be made |
---|
0:05:42 | i cannot been used to make optimal position |
---|
0:05:44 | without |
---|
0:05:46 | having the data to |
---|
0:05:48 | how does make a decision so that's what i'm gonna talk about in the next |
---|
0:05:51 | two sets |
---|
0:05:54 | so how do we make optimal decision in general for binary classification |
---|
0:05:59 | when usually define a cost function |
---|
0:06:02 | and this is a very |
---|
0:06:03 | common cost function which has very nice properties |
---|
0:06:07 | it's a combination of two terms |
---|
0:06:09 | one for each class |
---|
0:06:11 | where |
---|
0:06:12 | the |
---|
0:06:13 | maybe part here is the probability of making an error for that class of these |
---|
0:06:18 | is |
---|
0:06:18 | the probability of |
---|
0:06:19 | to see |
---|
0:06:21 | class |
---|
0:06:22 | zero |
---|
0:06:22 | when the true class |
---|
0:06:24 | was one |
---|
0:06:26 | we multiply these probability of error by the prior |
---|
0:06:29 | for that class one |
---|
0:06:31 | and then we further multiplied by cost which is what we think |
---|
0:06:36 | it is gonna cost us if we make these are |
---|
0:06:40 | this is very specific to the application that we're gonna use the system |
---|
0:06:44 | and for the other classes the same symmetric |
---|
0:06:47 | so |
---|
0:06:49 | this is an expected cost |
---|
0:06:51 | the way to minimize is expected cost is to choose the following |
---|
0:06:57 | the session |
---|
0:06:58 | so |
---|
0:06:59 | for a certain sample x |
---|
0:07:01 | the text class should be one |
---|
0:07:03 | in this factor |
---|
0:07:05 | it is larger than this factor and zero otherwise |
---|
0:07:09 | and this factor is composed of the cost |
---|
0:07:12 | the prior |
---|
0:07:14 | and the likelihood |
---|
0:07:16 | for the class one |
---|
0:07:19 | and this is the same forecasting |
---|
0:07:23 | so |
---|
0:07:23 | we see here than one we need to make optimal decisions is these likelihood |
---|
0:07:28 | be of x |
---|
0:07:29 | given c |
---|
0:07:32 | now |
---|
0:07:34 | one we have |
---|
0:07:36 | is the likelihood then we learned |
---|
0:07:38 | without formal |
---|
0:07:39 | is the likelihood when they're |
---|
0:07:42 | on the training data |
---|
0:07:43 | that's why amusing here the we go to indicate that these in the cost |
---|
0:07:48 | these probabilities the one we expect to see testing |
---|
0:07:51 | one we actually see that's the |
---|
0:07:55 | while we don't have that |
---|
0:07:57 | what we have is one we saw in train |
---|
0:08:00 | so let's say that we train a generative model then our generative model is gonna |
---|
0:08:05 | be was directly these likelihood |
---|
0:08:07 | but it will be the likelihood we learned in training |
---|
0:08:10 | and that's fine we usually just assume |
---|
0:08:13 | in order to do anything at all the machine learning |
---|
0:08:15 | we assume that these will generalize to testing |
---|
0:08:18 | testing |
---|
0:08:20 | now we may not have the likelihood if we train the discriminative system |
---|
0:08:25 | in that case we may have the posterior |
---|
0:08:28 | discriminative systems |
---|
0:08:29 | training for example with cross entropy any two i'll would posteriors |
---|
0:08:34 | in that case when we need to do is compare those posteriors by two likelihoods |
---|
0:08:38 | and for that we use bayes rule |
---|
0:08:41 | by basically we want to like the by |
---|
0:08:43 | this be of x and divided by the by |
---|
0:08:46 | i don't hear that again this is the prior in training |
---|
0:08:50 | is not the prior |
---|
0:08:52 | the p we call that i put hearing the cost which is the one we |
---|
0:08:55 | expect to see testing |
---|
0:08:58 | and that's the whole |
---|
0:08:59 | point why we use likelihoods and not posteriors |
---|
0:09:03 | to make these optimal position |
---|
0:09:06 | because it gives us the flexibility |
---|
0:09:08 | two separate |
---|
0:09:09 | the prior from training from the prior in testing |
---|
0:09:14 | okay so |
---|
0:09:16 | going back to the |
---|
0:09:17 | to the optimal decisions |
---|
0:09:19 | we have this expression |
---|
0:09:21 | we can simplify with this expression by defining the log-likelihood ratio |
---|
0:09:26 | which i'm sure everybody now see |
---|
0:09:28 | you're working speaker verification |
---|
0:09:30 | it's basically the spatial between |
---|
0:09:33 | the likelihood for class one and the likelihood for cassie rule |
---|
0:09:37 | and we take monopoly because it's |
---|
0:09:39 | nicer |
---|
0:09:40 | are we can do a similar thing with that costs |
---|
0:09:43 | the factors that multiplied these likelihoods here |
---|
0:09:47 | so we define these data |
---|
0:09:49 | and their with those definitions we can |
---|
0:09:53 | simplify the optimal decisions to look like these basically you decide class one |
---|
0:09:57 | if the llr is larger than |
---|
0:09:59 | data |
---|
0:10:00 | otherwise garcia |
---|
0:10:02 | and the and an untimely computed from the system posteriors |
---|
0:10:06 | with this expression digits |
---|
0:10:07 | based rules |
---|
0:10:08 | after taking the logarithm |
---|
0:10:11 | you of a scroll so wait |
---|
0:10:13 | because it was |
---|
0:10:15 | in both |
---|
0:10:18 | factors it what in most likely |
---|
0:10:21 | and |
---|
0:10:21 | and this is basically the no goals of the posterior minus the notebooks of the |
---|
0:10:26 | prior |
---|
0:10:26 | which can be written is way using the energy fine function |
---|
0:10:34 | okay so in speaker verification the feature x |
---|
0:10:39 | it's actually a pair of features or even |
---|
0:10:41 | a pair of sets of features |
---|
0:10:43 | a one for enrollment and one for test |
---|
0:10:46 | then class one is the class for target or same speaker |
---|
0:10:51 | trial |
---|
0:10:52 | and class zero is the task for impostor or different speaker trial |
---|
0:10:57 | and we define the cost function or we use an equally dcf in speaker verification |
---|
0:11:03 | using these |
---|
0:11:05 | names |
---|
0:11:06 | for the costs and priors |
---|
0:11:08 | and |
---|
0:11:09 | we call the errors be nice be false alarm |
---|
0:11:12 | and beanies and means |
---|
0:11:14 | would be |
---|
0:11:15 | a missing a target trial soul namely non-target trial as an impostor |
---|
0:11:19 | and a false alarm would be |
---|
0:11:21 | namely and impostor asset are |
---|
0:11:26 | and that the racial |
---|
0:11:27 | looks like this using these names |
---|
0:11:30 | and if you know |
---|
0:11:31 | only care about |
---|
0:11:32 | it's actually this thing to make optimal decisions you don't care about the whole |
---|
0:11:38 | combinational |
---|
0:11:40 | values of costs and priors altogether about these things they |
---|
0:11:44 | so you could impact simplify |
---|
0:11:47 | the cost functions the families of can cost functions to consider by just using a |
---|
0:11:52 | single binary and the fact that beat are that is equivalent to having this |
---|
0:11:58 | triplet for money is that are really just three because |
---|
0:12:02 | p is a function |
---|
0:12:06 | so we will be using that a the rest of the talk because it's much |
---|
0:12:10 | simpler |
---|
0:12:11 | and it helps a lot in the analysis |
---|
0:12:14 | basically we simplify all possible cost functions |
---|
0:12:18 | all combinations of |
---|
0:12:20 | costs and priors to a single |
---|
0:12:22 | affect the guitar |
---|
0:12:27 | so let's see some examples of applications that use different costs |
---|
0:12:32 | right |
---|
0:12:33 | so the default |
---|
0:12:36 | the simplest cost function would be to have equal priors any vocals |
---|
0:12:40 | and that would give you the threshold zero |
---|
0:12:43 | that would be the optimum bayes threshold for these cost function |
---|
0:12:49 | now if you have an application of any sport for examples |
---|
0:12:53 | speaker authentication where |
---|
0:12:55 | your goal |
---|
0:12:56 | he's two |
---|
0:12:59 | verifying whether somebody |
---|
0:13:01 | is what they say they are |
---|
0:13:03 | to their voice |
---|
0:13:05 | for example two and their |
---|
0:13:08 | system |
---|
0:13:09 | then new would expect that most of your cases i've and of e |
---|
0:13:13 | target trials |
---|
0:13:14 | because you know how many posters trying to get into your system |
---|
0:13:18 | on the other hand the cost of making a mistake |
---|
0:13:22 | is very high |
---|
0:13:23 | you feel false alarm |
---|
0:13:25 | so you don't want any of the |
---|
0:13:27 | was able you impostors getting into the system |
---|
0:13:31 | that means you need to |
---|
0:13:32 | said a very high cost alarm |
---|
0:13:34 | a cost of false alarm |
---|
0:13:36 | compared to the cost of |
---|
0:13:38 | and that corresponds with initial |
---|
0:13:40 | two point three |
---|
0:13:41 | so basically what you're doing a small with the threshold to the right so that |
---|
0:13:44 | the this area here on the solid curve |
---|
0:13:49 | which is the distribution of scores |
---|
0:13:52 | for the impostor samples |
---|
0:13:55 | so everything about that racial two point three will be a false-alarm |
---|
0:14:00 | by moving the initial to the right we are meaning lies in this area |
---|
0:14:05 | another application that actually is |
---|
0:14:07 | or seen in terms of course |
---|
0:14:10 | priors is the speaker search |
---|
0:14:11 | in that case you're looking for certain specific speaker weeding |
---|
0:14:16 | another instead of many other speakers |
---|
0:14:19 | so in that case the probability of finding your speaker is actually no |
---|
0:14:23 | that's a one-to-one one percent |
---|
0:14:26 | but the cost that you care about |
---|
0:14:29 | the errors and you want to avoid are the basis because |
---|
0:14:33 | you don't want you're looking for one specific speaker that is important to you for |
---|
0:14:37 | some reason so you know want to meet |
---|
0:14:40 | so in that case the problem of initial is |
---|
0:14:43 | a symmetric to the now minus two point three |
---|
0:14:46 | and in that case what you're trying to minimize is under the dash |
---|
0:14:51 | it to the left |
---|
0:14:53 | of the threshold |
---|
0:14:54 | which is the probability of miss |
---|
0:14:59 | okay |
---|
0:15:02 | so to recover before moving onto |
---|
0:15:05 | evaluation |
---|
0:15:07 | if we have been and are then |
---|
0:15:09 | i showed that we can trivially make optimisations for any possible cost function that you |
---|
0:15:14 | can imagine |
---|
0:15:16 | when the phone that i gave |
---|
0:15:19 | but of course these decisions will only be actually optimal if the system outputs are |
---|
0:15:25 | well calibrated |
---|
0:15:26 | otherwise they will not you |
---|
0:15:29 | so how do we figure out |
---|
0:15:31 | if we have a |
---|
0:15:32 | well calibrated system |
---|
0:15:34 | the |
---|
0:15:35 | question is if you're gonna make your system make decisions using these thresholds that i |
---|
0:15:41 | showed before the data |
---|
0:15:43 | then that's when you should evaluate have your system make those decisions using those data |
---|
0:15:49 | and |
---|
0:15:50 | see how well the |
---|
0:15:52 | and then the for the question is |
---|
0:15:54 | quote we have made better this ensures if we calibrated scores before making the decisions |
---|
0:16:01 | that will give us sarong |
---|
0:16:03 | how well calibrated is the system |
---|
0:16:05 | two meeting |
---|
0:16:08 | so |
---|
0:16:09 | the when we usually evaluate performance on binary classification task |
---|
0:16:14 | these |
---|
0:16:15 | by using the cost |
---|
0:16:17 | no wonder you over initial |
---|
0:16:19 | so we prefix that the racial |
---|
0:16:22 | using bayes |
---|
0:16:23 | a decision theory or not |
---|
0:16:26 | we just |
---|
0:16:26 | that is commercial and then compute the beanie some people sometime which of these yes |
---|
0:16:31 | and the two distributions |
---|
0:16:34 | and then compute the costs |
---|
0:16:37 | now we can also |
---|
0:16:40 | and |
---|
0:16:40 | define matrix that depend on the whole distribution to two sisters |
---|
0:16:45 | so for example the equal error rate |
---|
0:16:48 | is defined |
---|
0:16:49 | by finding the commercial that makes these two areas the same |
---|
0:16:54 | so basically to computing you need the whole test this deviation |
---|
0:16:59 | and a similar thing is the minimum dcf |
---|
0:17:01 | so what you're doing that case is |
---|
0:17:04 | we official |
---|
0:17:07 | across the whole range of scores |
---|
0:17:10 | compute the cost |
---|
0:17:11 | for almost possible threshold |
---|
0:17:13 | and then |
---|
0:17:15 | choose the threshold okay the mean cost |
---|
0:17:20 | now that minimum cost is actually bounded |
---|
0:17:22 | and |
---|
0:17:23 | and it bummed in by |
---|
0:17:25 | basically dummy decisions |
---|
0:17:27 | this system that makes to make decisions |
---|
0:17:31 | if you put |
---|
0:17:32 | for example you official all the way to write |
---|
0:17:35 | then you will only make |
---|
0:17:37 | and mistakes that are misses |
---|
0:17:40 | everything will be nice |
---|
0:17:42 | so you'll have been means of one before xenomorph zero |
---|
0:17:46 | in that case the cost then you will incur is this factor here |
---|
0:17:50 | when the other hand if you put the threshold a way to the left |
---|
0:17:54 | then you will only make false alarms and there will be the cost for that |
---|
0:17:58 | system |
---|
0:17:59 | will be these factors here |
---|
0:18:02 | so basically the bound for the meeting these is |
---|
0:18:05 | the best of those |
---|
0:18:06 | two case |
---|
0:18:08 | they're both times systems but one will be better than the other |
---|
0:18:12 | are we usually use this mindcf to normalize |
---|
0:18:16 | the dcf so and nist evaluations for example |
---|
0:18:19 | the |
---|
0:18:20 | core studies define is the normalized dcf |
---|
0:18:23 | also |
---|
0:18:26 | and then finally another thing we can do is we the threshold |
---|
0:18:30 | we called the puny some people's allow for every possible value of potential |
---|
0:18:35 | and then gives a score curves like these |
---|
0:18:37 | and if we transform the axis appropriately then we get the |
---|
0:18:43 | standard that curves we use for speaker verification |
---|
0:19:02 | so the cost that i've been talking about can be decomposed |
---|
0:19:05 | into discrimination and calibration component |
---|
0:19:10 | so let's see how |
---|
0:19:12 | that's a we assume a cost or well priors an equal cost |
---|
0:19:18 | in that case |
---|
0:19:19 | the optimal threshold will be civil |
---|
0:19:21 | the bayes optimal threshold would be zero |
---|
0:19:24 | so |
---|
0:19:25 | we compare the cost using that |
---|
0:19:27 | commercial |
---|
0:19:28 | and we get these |
---|
0:19:29 | a |
---|
0:19:30 | given that the priors and costs are the same then the cost will be given |
---|
0:19:34 | by the average of these two areas |
---|
0:19:36 | and shown here |
---|
0:19:38 | now when you can also compute the mean cost as i mentioned before |
---|
0:19:42 | basically sweet but initial |
---|
0:19:44 | actual the threshold that gives |
---|
0:19:46 | the minimum cost |
---|
0:19:47 | again is the average between these two areas which you see is much smaller than |
---|
0:19:52 | the average between these two areas in this case |
---|
0:19:55 | and the difference between |
---|
0:19:57 | those two cost |
---|
0:19:59 | can be seen |
---|
0:20:00 | as the additional cost that you encouraging because your system was makes me scully weight |
---|
0:20:06 | so this orange area here which is the difference between |
---|
0:20:09 | the sound |
---|
0:20:11 | well the areas here on the sum of the areas here |
---|
0:20:14 | is the cost due to these calibration and that's one way of measuring |
---|
0:20:19 | how |
---|
0:20:20 | nice kind of ready to system |
---|
0:20:25 | so there's discrimination which is how well the scores |
---|
0:20:28 | separated classes |
---|
0:20:30 | and there's calibration which is whether the discourse can be interpreted probabilistically |
---|
0:20:34 | which implies that you can make optimum bayes decisions |
---|
0:20:37 | if they are kind of work |
---|
0:20:40 | and the key here is then discrimination is the part |
---|
0:20:43 | of the |
---|
0:20:45 | performance that cannot be changed |
---|
0:20:47 | if we transform the scores into we then invertible transformation |
---|
0:20:52 | so here's a simple example that a you have these distribution of scores |
---|
0:20:58 | and you have a threshold t that you chose for some reason |
---|
0:21:01 | could be the optimal or not |
---|
0:21:03 | and you transform this course we |
---|
0:21:06 | any monotonic transformation |
---|
0:21:09 | whatever that in these example is just an affine transformation |
---|
0:21:13 | you transform it |
---|
0:21:15 | and you can also transform the threshold t |
---|
0:21:18 | with the same exact |
---|
0:21:20 | function |
---|
0:21:22 | that there's for that forty will correspond to exactly the same cost |
---|
0:21:27 | as the threshold t in the original domain |
---|
0:21:31 | so basically |
---|
0:21:33 | by doing a monotonic transformation to your scores you cannot change it's discrimination |
---|
0:21:39 | the minimum cost |
---|
0:21:41 | then you will be able to find in both cases will be the same |
---|
0:21:52 | so |
---|
0:21:52 | the cost of a talking about measures the performance artist single operating point |
---|
0:21:57 | it evaluates the quality of the car decisions for certain |
---|
0:22:01 | they |
---|
0:22:03 | now |
---|
0:22:04 | and more comprehensive measure |
---|
0:22:06 | is the cross entropy which is given by this expression and you probably all now |
---|
0:22:11 | the cross-entropy empirical cross-entropy in the average |
---|
0:22:15 | all the logarithm of the posterior that the system gives |
---|
0:22:19 | to the correct class for its |
---|
0:22:22 | so you want these posterior to be as high as portable one |
---|
0:22:25 | if possible |
---|
0:22:27 | no you |
---|
0:22:28 | and algorithm of zero and if that happens for every sample then you know |
---|
0:22:33 | cross entropy zero which is what you want |
---|
0:22:37 | now there's a right weighted version of these cross entropy |
---|
0:22:40 | which is |
---|
0:22:41 | basically the same |
---|
0:22:43 | by |
---|
0:22:43 | you'll split your samples into two terms |
---|
0:22:46 | the ones poll |
---|
0:22:49 | class zero once forecast one |
---|
0:22:51 | and you we wait |
---|
0:22:53 | these averages |
---|
0:22:55 | by and prior |
---|
0:22:56 | that is these effective prior that i talked about before |
---|
0:23:01 | so basically you make yourself independent of the priors and you're seen in the test |
---|
0:23:06 | data |
---|
0:23:07 | you can evaluate for any |
---|
0:23:09 | right you work |
---|
0:23:13 | these posteriors are computed from the and then hours |
---|
0:23:16 | and the priors |
---|
0:23:18 | using bayes rule |
---|
0:23:19 | at least note that these are the priors that you're applied any here |
---|
0:23:23 | the ones that you need to used to compute the llr |
---|
0:23:27 | okay and the famous e llr that we used in |
---|
0:23:30 | nist evaluations any many papers |
---|
0:23:32 | is defined as these weighted cross entropy when the priors are point five |
---|
0:23:37 | and it's normalized by the logarithm to one and explained in the next like |
---|
0:23:41 | what |
---|
0:23:44 | so the weighted cross entropy can be decomposed also |
---|
0:23:47 | like the cost |
---|
0:23:49 | in discrimination and calibration terms |
---|
0:23:52 | basically you compute the actual weighted cross entropy |
---|
0:23:56 | and you subtracted |
---|
0:23:58 | and they |
---|
0:23:59 | minimum |
---|
0:23:59 | weighted cross entropy |
---|
0:24:01 | now this meeting one is not a trivial to obtain ask for the cost you |
---|
0:24:05 | can't just choose the threshold because here where evaluating this course itself is not just |
---|
0:24:11 | the decisions |
---|
0:24:12 | so we need to actually what the scores to get |
---|
0:24:16 | the best possible way to cross entropy |
---|
0:24:19 | we don't change in the discrimination |
---|
0:24:21 | of the scores |
---|
0:24:22 | and that means |
---|
0:24:23 | using an one attorney transformation |
---|
0:24:26 | and there's an algorithm goal will adjacent by annotators |
---|
0:24:29 | well |
---|
0:24:30 | which |
---|
0:24:31 | that's exactly that so in |
---|
0:24:34 | without changing the rank of the scores the order of the scores |
---|
0:24:38 | in dallas the best it can to minimize the weighted cross |
---|
0:24:42 | and so that's what we used to compute |
---|
0:24:45 | yes delta |
---|
0:24:46 | which |
---|
0:24:47 | measures how these kind of reading your system it's |
---|
0:24:50 | in terms of we present |
---|
0:24:53 | and this way to present the peace mounted the same last |
---|
0:24:56 | the cost |
---|
0:24:57 | by and a system that in this case is the system that out what's |
---|
0:25:02 | instead of |
---|
0:25:02 | the posteriors we don't was directly the prior so with the system that doesn't know |
---|
0:25:06 | anything about its input |
---|
0:25:08 | but |
---|
0:25:09 | still nasty |
---|
0:25:10 | best buy |
---|
0:25:11 | i would in the priors |
---|
0:25:14 | and |
---|
0:25:16 | that means that the worst |
---|
0:25:18 | in c n r |
---|
0:25:20 | is one point zero because we were normalized to didn't |
---|
0:25:23 | right nobles to which is exactly these things when you evaluated i point five |
---|
0:25:29 | so this means that the |
---|
0:25:31 | minimum c llr |
---|
0:25:33 | will never be |
---|
0:25:34 | where someone |
---|
0:25:36 | i mean the actual c llr is worse than one then you know for sure |
---|
0:25:39 | that you're gonna have a difference here |
---|
0:25:41 | because this is never |
---|
0:25:43 | larger than one in this is larger than one and then it means you have |
---|
0:25:46 | a calibration problem |
---|
0:25:50 | okay |
---|
0:25:51 | finally in terms of evaluation i wanted to mention these |
---|
0:25:55 | curves of the applied probability of error curves ache |
---|
0:25:59 | and the llr shows a single summary number |
---|
0:26:02 | but you might want to actually seen |
---|
0:26:04 | the performance across |
---|
0:26:06 | a range of operating points and that's what this curves two |
---|
0:26:10 | they basically show the cost |
---|
0:26:12 | of |
---|
0:26:14 | as a function of the beat are the effect of peter |
---|
0:26:18 | which |
---|
0:26:19 | also defines that data |
---|
0:26:21 | so |
---|
0:26:22 | what we see here these |
---|
0:26:23 | the |
---|
0:26:24 | these cost |
---|
0:26:25 | for prior decisions |
---|
0:26:27 | and the prior decisions are what i mentioned before |
---|
0:26:30 | basically just a dummy system that always outputs |
---|
0:26:34 | the priors instead of posteriors |
---|
0:26:38 | and the red is our system whatever that he's |
---|
0:26:43 | kind of varying or not |
---|
0:26:46 | and then dashed curve is the very best you can do if you work to |
---|
0:26:50 | work your scores using the palm algorithm |
---|
0:26:54 | so basically the difference for each data the difference between the dashed and the right |
---|
0:27:00 | is there is calibration and that |
---|
0:27:02 | operating point |
---|
0:27:05 | and the nice property of all these curves is that the c in a lower |
---|
0:27:08 | east proportional to the area under the covers |
---|
0:27:12 | so the actual see an alarm is proportional to the area under the red curve |
---|
0:27:17 | and the means the lr is proportional to the area under the dashed |
---|
0:27:23 | and furthermore the equal error rate is the maximum |
---|
0:27:26 | of these |
---|
0:27:28 | a red curve |
---|
0:27:30 | and their variance of these curves |
---|
0:27:32 | which accompanies this papers |
---|
0:27:35 | change in the way the axis and define |
---|
0:27:39 | okay |
---|
0:27:40 | so let's see not saying we |
---|
0:27:43 | already in our system has a kind of a simple |
---|
0:27:45 | should we worry about it shall we trying to fix it |
---|
0:27:50 | there's some scenarios where you |
---|
0:27:52 | there's |
---|
0:27:52 | no problem if you have a nice calibrated system there is no need to fix |
---|
0:27:56 | it for example |
---|
0:27:58 | e |
---|
0:27:59 | you know what the cost function is ahead of time |
---|
0:28:02 | and there's development data available |
---|
0:28:04 | then all you need to do is run on the system for the development data |
---|
0:28:08 | and find the and |
---|
0:28:10 | you can best |
---|
0:28:11 | commercial |
---|
0:28:13 | for |
---|
0:28:14 | done them data for that system and that can cost function |
---|
0:28:17 | and you're that |
---|
0:28:19 | and |
---|
0:28:20 | you also the need to worry about calibration if |
---|
0:28:23 | it you wanna care about ranking |
---|
0:28:25 | the samples so you want to do not and |
---|
0:28:29 | likely targets |
---|
0:28:30 | and nothing |
---|
0:28:33 | on the other hand it may be very necessary to the calibration in many other |
---|
0:28:37 | sin |
---|
0:28:39 | one of them is for example if you don't know ahead of time what the |
---|
0:28:43 | system will be used for exactly what is the application |
---|
0:28:45 | i don't means |
---|
0:28:47 | you don't know the cost function and if you don't know the cost function |
---|
0:28:50 | you cannot optimize the partial |
---|
0:28:52 | i had of time |
---|
0:28:53 | so if you want to give the user of the system and all |
---|
0:28:57 | then defines these effective bit are |
---|
0:29:01 | then the system has to be calibrated for the baseline |
---|
0:29:05 | bayes optimal threshold to be in |
---|
0:29:08 | really optimal |
---|
0:29:09 | to work well |
---|
0:29:11 | a another case where you need to look at iteration is if you want to |
---|
0:29:15 | get a probabilistic value |
---|
0:29:18 | from your system |
---|
0:29:19 | some men sure all the uncertainty that a system has |
---|
0:29:23 | when you make six |
---|
0:29:24 | this issue |
---|
0:29:25 | and you can use that uncertainty for example |
---|
0:29:28 | to reject samples when the system is uncertain |
---|
0:29:31 | so |
---|
0:29:33 | if you're and in our is too close to the threshold then you work planning |
---|
0:29:36 | to use to make our decisions |
---|
0:29:38 | then perhaps |
---|
0:29:39 | you wanna system not to make a decision total the user i don't know |
---|
0:29:44 | union under some |
---|
0:29:47 | and another case is when this you actually don't want to make her decisions when |
---|
0:29:51 | you want to report the value |
---|
0:29:53 | then his interpretable |
---|
0:29:55 | not for example in the forensic voice comparison people |
---|
0:30:02 | okay so |
---|
0:30:03 | that's a we do want to fix |
---|
0:30:05 | a calibration we are in one of those scenarios where it matters |
---|
0:30:10 | one very common approach to do this is to use linear logistic regression |
---|
0:30:15 | so this assumes that b and an hour |
---|
0:30:18 | the kind of weighted score |
---|
0:30:20 | is an affine transformation all whatever your system |
---|
0:30:25 | and the parameters of these small are the w and b |
---|
0:30:30 | and uses the weighted cross entropy ask the loss function |
---|
0:30:35 | now |
---|
0:30:37 | for to compute the weighted presence of we need posteriors not and then hours so |
---|
0:30:41 | we need to compare those in a nursing to posteriors and we use this expression |
---|
0:30:44 | that actual before |
---|
0:30:46 | which is |
---|
0:30:47 | the llr is the nobles of the posterior minus the no guards of the right |
---|
0:30:53 | and it would |
---|
0:30:55 | basically where these expression we get there not just the functional which is the inverse |
---|
0:31:00 | of the legit |
---|
0:31:01 | and |
---|
0:31:04 | and finally after doing |
---|
0:31:09 | trivial computations we can these expression which is that bystander |
---|
0:31:13 | mean and logistic expression |
---|
0:31:15 | we need to them further like these posterior into the expression of the weighted cross |
---|
0:31:20 | entropy to get lost |
---|
0:31:21 | that we can then optimize thus we wish |
---|
0:31:24 | and finally once we optimize these on |
---|
0:31:28 | some the data |
---|
0:31:30 | we can the w and b |
---|
0:31:32 | that are optimal for that |
---|
0:31:34 | not rate |
---|
0:31:37 | so |
---|
0:31:39 | this is an affine transformation so we doesn't change the shapes |
---|
0:31:42 | of the distributions at all |
---|
0:31:44 | basically |
---|
0:31:45 | these looks like he did nothing |
---|
0:31:47 | but what indeed is |
---|
0:31:49 | more |
---|
0:31:50 | shrink shift and shrink |
---|
0:31:53 | the axes so that the resulting |
---|
0:31:56 | scores |
---|
0:31:57 | are kind of right |
---|
0:32:01 | and in terms of t and then are you can see that their raw scores |
---|
0:32:05 | which are these ones |
---|
0:32:07 | how do very high c and an hour actually higher than one so the where |
---|
0:32:10 | words and one |
---|
0:32:12 | and after you calibrate them |
---|
0:32:14 | which all your the was really scale and shapes |
---|
0:32:17 | a new data much better see in the lower |
---|
0:32:20 | these minimum here is that well maybe |
---|
0:32:23 | the |
---|
0:32:24 | the |
---|
0:32:25 | very best you can do |
---|
0:32:27 | so we define transformation we are actually doing almost as |
---|
0:32:32 | good as the very best |
---|
0:32:36 | which means that the affine assumption was actually in this case a quite |
---|
0:32:41 | this is a real case this is box and of data process we the |
---|
0:32:46 | be lda system |
---|
0:32:50 | and then many other approaches to do calibration i'm not gonna cover them because it |
---|
0:32:54 | would take another |
---|
0:32:55 | another whole keynote |
---|
0:32:57 | and |
---|
0:32:58 | there are nonlinear approaches |
---|
0:33:02 | which |
---|
0:33:02 | i in some |
---|
0:33:04 | cases do better than linear |
---|
0:33:07 | is a good at some somebody is not perfect |
---|
0:33:12 | then their originality and basin approaches that actually do quite well when you have very |
---|
0:33:16 | little data |
---|
0:33:18 | to train the calibration model |
---|
0:33:19 | and then they're approaches and goal the way |
---|
0:33:22 | but |
---|
0:33:23 | to know data not labeled data |
---|
0:33:26 | so there's label but |
---|
0:33:27 | there's |
---|
0:33:28 | they have and you don't know than they |
---|
0:33:31 | and those works surprisingly well |
---|
0:33:35 | so |
---|
0:33:36 | if we have a kind of really score there |
---|
0:33:39 | we know we can train the most looks not log-likelihood ratios which means |
---|
0:33:43 | then we can use them to make optimal decisions |
---|
0:33:46 | and we can also convert them to posteriors if we wanted to and if we |
---|
0:33:50 | had the bright |
---|
0:33:53 | and it and very nice property of that in our is that |
---|
0:33:57 | if you work to compute |
---|
0:33:58 | in a collection racial |
---|
0:34:00 | all your |
---|
0:34:01 | already calibrated score then you would get |
---|
0:34:04 | the same thing |
---|
0:34:05 | so you can treat |
---|
0:34:07 | this score the in an hour after feature |
---|
0:34:10 | and you |
---|
0:34:11 | we compute these racial you would get the same by |
---|
0:34:16 | i don't this they don't seem to some nice properties like for example |
---|
0:34:21 | in a calibrated |
---|
0:34:22 | score |
---|
0:34:24 | the two distributions have to cross exactly at zero |
---|
0:34:28 | because when the nn are is zero |
---|
0:34:30 | these racial is one |
---|
0:34:32 | which means then these two |
---|
0:34:34 | have to be the same |
---|
0:34:36 | and these two are exactly what we're seeing here the densities |
---|
0:34:40 | the probability density function of the score for each of the two guys |
---|
0:34:44 | they have to corsets you |
---|
0:34:46 | and further if we assume that one of these two distributions is gaussian |
---|
0:34:51 | then the other distributions forced to be gaussian |
---|
0:34:54 | with the same |
---|
0:34:55 | standard deviation and with symmetric meetings |
---|
0:34:59 | and these as i said it's a real example and it's actually quite |
---|
0:35:03 | close to that assumption |
---|
0:35:05 | in this box and up to |
---|
0:35:09 | okay so to recover this problem before we don't |
---|
0:35:13 | what i've been saying is that occurs equal error rate mindcf |
---|
0:35:18 | my sure only discrimination performance |
---|
0:35:21 | basically this means that the nor the usual threshold selections of the nor the usual |
---|
0:35:26 | how to get |
---|
0:35:28 | to the actual decisions |
---|
0:35:29 | from the score |
---|
0:35:31 | on the other hand the weighted cross entropy on the actual dcf and the ape |
---|
0:35:36 | curves and measure total form |
---|
0:35:39 | and |
---|
0:35:40 | that includes the initial how to |
---|
0:35:43 | make the decisions |
---|
0:35:45 | and we can further use these metrics |
---|
0:35:48 | to compute the |
---|
0:35:49 | calibration loss |
---|
0:35:51 | so to see whether the system is well calibrated or not |
---|
0:35:57 | and if you find the calibration is actually not good then fixing this calibration issues |
---|
0:36:02 | is |
---|
0:36:03 | usually see in ideal conditions so you can train an invertible transformation |
---|
0:36:09 | used in |
---|
0:36:10 | usually a small representative that said |
---|
0:36:13 | which is enough because |
---|
0:36:15 | in many of the approach is the number of parameters are is very small so |
---|
0:36:19 | you don't need a lot update |
---|
0:36:23 | the key here though |
---|
0:36:25 | is then you need a representative that's it |
---|
0:36:27 | and that's going on |
---|
0:36:29 | what i'm gonna discussing the nastiest like |
---|
0:36:33 | so |
---|
0:36:33 | basically what we of serving right these repeatedly is that calibration of our speaker verification |
---|
0:36:40 | systems |
---|
0:36:42 | it is |
---|
0:36:43 | extremely fragile |
---|
0:36:45 | it is now for our current system and it has always be |
---|
0:36:49 | okay since i've been working |
---|
0:36:51 | on speaker verification for |
---|
0:36:53 | almost twenty years not |
---|
0:36:57 | anything like language noise distortions duration they not affect |
---|
0:37:02 | the calibration parameters |
---|
0:37:05 | and that means that one to train one condition |
---|
0:37:08 | it's very unlikely to generalize to another condition |
---|
0:37:11 | on the other hand the discrimination performance is usually still reasonable |
---|
0:37:16 | on unseen conditions |
---|
0:37:17 | so if you train a system on telephone data and you try to use it |
---|
0:37:21 | on microphone data |
---|
0:37:22 | is that gonna may not be the best you can do |
---|
0:37:25 | but he still will be reasonable |
---|
0:37:28 | on the other hand if you train your calibration model on telephone data and trying |
---|
0:37:32 | to use it a microphone in many |
---|
0:37:34 | perform horribly |
---|
0:37:37 | and this is one example |
---|
0:37:39 | so |
---|
0:37:40 | i'm training the calibration set on the conversion |
---|
0:37:45 | well on two different sets |
---|
0:37:47 | speakers in the while and sre sixty |
---|
0:37:50 | that's |
---|
0:37:51 | and applying those models |
---|
0:37:53 | but on box in the two |
---|
0:37:55 | they just |
---|
0:37:56 | the |
---|
0:37:58 | scores are identical the raw scores and all and doing is changing the w and |
---|
0:38:02 | the be based on the calibration set |
---|
0:38:05 | what we see here |
---|
0:38:07 | is that the model that was trained with speakers in the while |
---|
0:38:12 | is extremely good |
---|
0:38:13 | it's basically almost |
---|
0:38:15 | perfect |
---|
0:38:17 | while the model that was trained on a set of sixteen is |
---|
0:38:20 | quite by |
---|
0:38:21 | is better than the raw scores but he still quite well |
---|
0:38:24 | compared to the best you can do |
---|
0:38:27 | and this is not surprising because |
---|
0:38:29 | block selects actually quite close to speakers in the white in terms of conditions |
---|
0:38:33 | by this i sixteen is not |
---|
0:38:36 | now |
---|
0:38:36 | you may think maybe sorry sixteen is but just about set for doing calibration |
---|
0:38:41 | but that's not the case because if you evaluate and sre sixteen |
---|
0:38:45 | evaluation data |
---|
0:38:47 | and |
---|
0:38:48 | then the opposite happens |
---|
0:38:50 | so the |
---|
0:38:51 | calibration model that is good in that case is the one that was trained on |
---|
0:38:54 | the set of sixteen so you these |
---|
0:38:57 | scores |
---|
0:38:58 | newman much lower |
---|
0:39:00 | still an arm than the ones that were |
---|
0:39:02 | calibrated we'd speakers in the way |
---|
0:39:06 | in this case again you're almost which in the mean |
---|
0:39:10 | so basically this tells us that the conditions on which the calibration model is trained |
---|
0:39:14 | are at determining off |
---|
0:39:17 | where they're gonna be |
---|
0:39:19 | good |
---|
0:39:20 | you have do you have to match the conditions on your evaluation |
---|
0:39:26 | now |
---|
0:39:27 | this goes even deeper |
---|
0:39:29 | if you |
---|
0:39:30 | zoom into a data set you can actually finest calibration issues within the dataset itself |
---|
0:39:36 | so |
---|
0:39:38 | i'm showing |
---|
0:39:39 | results on sre sixteen evaluation set |
---|
0:39:42 | when i training calibration parameters exactly one of the same impulse it so this is |
---|
0:39:47 | a cheating calibration experiment |
---|
0:39:50 | here |
---|
0:39:51 | i'm showing the |
---|
0:39:53 | see an alarm which is the solid bar and the means in an hour which |
---|
0:39:57 | are here are the same by construction and here is that |
---|
0:40:01 | relative difference between those two |
---|
0:40:04 | so where the full set i have not lost |
---|
0:40:06 | by construction as it said |
---|
0:40:08 | on the other hand if i start to subset |
---|
0:40:11 | peaceful set |
---|
0:40:13 | a randomly or by gender |
---|
0:40:16 | or my condition |
---|
0:40:17 | i start to see |
---|
0:40:19 | one more calibration loss |
---|
0:40:21 | so than random subset is |
---|
0:40:23 | fine |
---|
0:40:24 | it is well calibrated females and males are reasonably well calibrated |
---|
0:40:28 | but for this specific conditions |
---|
0:40:31 | there are defined by the language the gender |
---|
0:40:34 | where there are they to waveforms in the trial come from the same telephone number |
---|
0:40:39 | or not |
---|
0:40:41 | then we start to see calibration loss |
---|
0:40:43 | our to almost twenty percent in this case |
---|
0:40:47 | so |
---|
0:40:49 | the distributions so for the |
---|
0:40:52 | target i don't |
---|
0:40:53 | female same telephone number set |
---|
0:40:57 | we see that the distributions are shifted to the fact |
---|
0:41:00 | they should be aligned with zero remember that the this to the sri distributions if |
---|
0:41:04 | they were kind of reading they should cross at zero |
---|
0:41:08 | but they don't |
---|
0:41:09 | so that means they shifted to the right and that is reasonable because seems they |
---|
0:41:13 | are the same telephone number for both |
---|
0:41:16 | sides of the trial |
---|
0:41:18 | then it means that |
---|
0:41:19 | they look very much the same |
---|
0:41:23 | more than if the channels one different |
---|
0:41:26 | so |
---|
0:41:27 | everything every trial looks more target |
---|
0:41:30 | then they should |
---|
0:41:33 | or than they do in the overall distribution |
---|
0:41:36 | on the opposite happens on the different telephone number |
---|
0:41:39 | scores |
---|
0:41:40 | the shift to the left |
---|
0:41:42 | and the final comments here is that these mis calibration with dataset |
---|
0:41:49 | it's also cost in a discrimination problem |
---|
0:41:52 | because if you pool these |
---|
0:41:54 | trials as they are is kind of reading |
---|
0:41:56 | you will get poor discrimination then if you work to first calibrated |
---|
0:42:00 | and then pooled together |
---|
0:42:03 | so |
---|
0:42:05 | there's an interplay here between calibration and discrimination |
---|
0:42:09 | because |
---|
0:42:11 | the nist calibration is happening |
---|
0:42:14 | for different sub conditions within the set |
---|
0:42:21 | okay so they're been several approaches in the literature |
---|
0:42:25 | over the last decades at least |
---|
0:42:28 | that's right to |
---|
0:42:31 | solve this problem or |
---|
0:42:33 | condition dependent is calibration |
---|
0:42:38 | where the |
---|
0:42:39 | assumption of having a global |
---|
0:42:42 | calibration model |
---|
0:42:43 | that has a single w and a single be |
---|
0:42:46 | for all trials it's actually not as good as such |
---|
0:42:50 | so most of these approaches assume that there's an external class |
---|
0:42:54 | or vector representation |
---|
0:42:57 | the ldc there are given by the metadata |
---|
0:42:59 | or estimated |
---|
0:43:02 | that represents the condition of the samples |
---|
0:43:04 | the enrollment and the samples |
---|
0:43:07 | and these vectors |
---|
0:43:10 | are fed into the calibration stage and they are used to condition the parameters of |
---|
0:43:14 | these calibration stage |
---|
0:43:17 | here are some approaches if you are interesting to take a look |
---|
0:43:24 | over all these approaches something quite successful at |
---|
0:43:27 | making the |
---|
0:43:28 | final system better actually more discriminative |
---|
0:43:32 | because they align the distributions of the different sub conditions before putting them together |
---|
0:43:40 | and that their family of approaches these |
---|
0:43:43 | where they put the condition awareness |
---|
0:43:46 | in the back end itself rather than in the calibration stage |
---|
0:43:49 | so |
---|
0:43:50 | there's again a condition extractor of some kind |
---|
0:43:54 | that affects the parameters of them okay |
---|
0:43:58 | the thing is that this approach doesn't necessarily fix calibration |
---|
0:44:02 | it improves discrimination in general |
---|
0:44:04 | but you may still need to the calibration it is but can is deal for |
---|
0:44:08 | example it be lda look and this i think these cases |
---|
0:44:12 | what comes out of here is still use kind of |
---|
0:44:15 | so you still need a |
---|
0:44:18 | perhaps normal |
---|
0:44:19 | calibration model and the or |
---|
0:44:23 | okay and recently we propose an approach that jointly trains the backend |
---|
0:44:30 | and a condition beep and then |
---|
0:44:32 | calibrate or |
---|
0:44:33 | where here we assume that the condition is extracted automatically as a function of the |
---|
0:44:38 | and mailings themselves |
---|
0:44:40 | and the whole thing |
---|
0:44:42 | is trained jointly to optimize weighted cross entropy |
---|
0:44:46 | so |
---|
0:44:47 | this model actually gives |
---|
0:44:49 | excellent calibration performance across so wide range of conditions |
---|
0:44:53 | you can actually find the paper |
---|
0:44:55 | and |
---|
0:44:56 | in the ldc proceedings if you're interest |
---|
0:44:59 | and there's a very related paper a also in a dc one middle |
---|
0:45:04 | by daniel garcia romano |
---|
0:45:06 | which i suggest you taken it to if you're interested in these topics |
---|
0:45:13 | okay so |
---|
0:45:15 | to finish up |
---|
0:45:17 | i didn't talking about two |
---|
0:45:20 | wide application scenarios for speaker verification technology |
---|
0:45:24 | one of them |
---|
0:45:25 | is where you assume that there's development data available for the evaluation conditions |
---|
0:45:32 | in that case |
---|
0:45:33 | as i said you can either calibrate the system on my on that data which |
---|
0:45:38 | is matched |
---|
0:45:39 | or just |
---|
0:45:40 | find the best commercial |
---|
0:45:43 | by really calibration in that |
---|
0:45:45 | scenario is not a mediation |
---|
0:45:49 | in fact most speaker verification papers |
---|
0:45:52 | historically |
---|
0:45:53 | operate under this scenario |
---|
0:45:55 | it's also the scenario of the nist evaluations where we usually get development data which |
---|
0:46:00 | is maybe not perfectly matching but |
---|
0:46:02 | pretty well matched to what we will see you in the evaluation |
---|
0:46:07 | not only see this ldc five i found thirty three speaker recognition papers |
---|
0:46:13 | of which twenty eight fold |
---|
0:46:15 | in this category |
---|
0:46:18 | so |
---|
0:46:19 | the mostly report just equal error rate and dcf some report actual values |
---|
0:46:25 | some don't |
---|
0:46:28 | and i think it's fine to just report mean dcf in those cases because you |
---|
0:46:32 | basically assuming that the |
---|
0:46:35 | it calibration initial is |
---|
0:46:37 | easy to sell |
---|
0:46:39 | so that |
---|
0:46:39 | if you work to have |
---|
0:46:41 | development data |
---|
0:46:43 | a you could train a kind of visual all and you won't reach very close |
---|
0:46:47 | to the minimum |
---|
0:46:48 | this year |
---|
0:46:49 | the actual performance gonna get very close to the |
---|
0:46:54 | now the still the can be at that |
---|
0:46:57 | you may still have used calibration problems within sub conditions anything i don't report |
---|
0:47:02 | actual dcf on this year on some conditions |
---|
0:47:05 | and that's |
---|
0:47:07 | she |
---|
0:47:07 | behind the overall performance |
---|
0:47:11 | the other big scenario is |
---|
0:47:13 | and the |
---|
0:47:15 | the one where we don't have development data |
---|
0:47:18 | for the above conditions |
---|
0:47:21 | in that case we cannot calibrate or just a special |
---|
0:47:25 | on matched conditions we can only whole |
---|
0:47:28 | that our system will |
---|
0:47:30 | work well out of the box |
---|
0:47:35 | from the |
---|
0:47:36 | all these proceedings i only five |
---|
0:47:39 | papers that operate on their this scenario where the |
---|
0:47:42 | actually test |
---|
0:47:43 | a system that was trained on some condition all |
---|
0:47:47 | at this data that is on a different conditions |
---|
0:47:50 | and they do not assume that they have |
---|
0:47:52 | development data for that |
---|
0:47:54 | if recognition |
---|
0:47:58 | so basically we as a community |
---|
0:48:00 | are very heavily focused on the first scenario have always been is |
---|
0:48:05 | from historically |
---|
0:48:09 | and i but this man be why our current speaker verification technology |
---|
0:48:14 | cannot be used out of the box |
---|
0:48:16 | we are just |
---|
0:48:18 | used to |
---|
0:48:19 | always |
---|
0:48:21 | asking for development data |
---|
0:48:22 | in order to tune at least the calibration stage of our system |
---|
0:48:28 | we know the calibration stage has to be tuned otherwise the system one work |
---|
0:48:33 | in maybe where someone |
---|
0:48:36 | so my question is and maybe we can discuss the question and answer session |
---|
0:48:43 | wouldn't be worth it for as a community to pay more attention to these |
---|
0:48:47 | scenario |
---|
0:48:48 | no development data available |
---|
0:48:53 | i believe that the new and two and approaches have the |
---|
0:48:57 | potential to be quite good |
---|
0:48:59 | i generalising |
---|
0:49:00 | and this is basically based on the |
---|
0:49:03 | paper that i mentioned that actually |
---|
0:49:05 | is not really into and |
---|
0:49:07 | but |
---|
0:49:07 | almost |
---|
0:49:09 | and it works |
---|
0:49:11 | quite well |
---|
0:49:12 | surprisingly well in terms of calibration across conditions on unseen conditions |
---|
0:49:18 | so i think it's doable |
---|
0:49:22 | maybe if we would and therefore as a community then maybe we reduce or even |
---|
0:49:27 | in it |
---|
0:49:28 | if we're very optimistic |
---|
0:49:30 | the performance difference between the two center so maybe we can end up with systems |
---|
0:49:35 | then |
---|
0:49:36 | are not so independent of having development data |
---|
0:49:40 | and perhaps even having development data one how much i don't know or more |
---|
0:49:47 | the out of the book system |
---|
0:49:51 | so what would you and tail to develop for these known that scenario |
---|
0:49:56 | possible we |
---|
0:49:57 | we have to assume that we will need heterogeneous data for training of course because |
---|
0:50:02 | if you train a system on telephone data is |
---|
0:50:05 | quite unlikely that it will generalize to |
---|
0:50:07 | maybe other condition |
---|
0:50:11 | the second thing is one has to have doubts |
---|
0:50:15 | some sets |
---|
0:50:16 | at least |
---|
0:50:18 | during development that are not d menus for |
---|
0:50:20 | hyperparameter two |
---|
0:50:22 | because otherwise they would not be completely and see |
---|
0:50:26 | so these sets out to be really |
---|
0:50:30 | held out until the very and until you just evaluate the system out of the |
---|
0:50:34 | box as in this scenario that we are imagining |
---|
0:50:38 | and of course in into report actual matrix and not just meaning because in this |
---|
0:50:42 | case you cannot assume that you're gonna be able to do kind of racial well |
---|
0:50:46 | you need to test whether the model |
---|
0:50:49 | i think i cd as it stands |
---|
0:50:51 | it's actually giving you |
---|
0:50:53 | good calibration with the session |
---|
0:50:56 | and finally it's probably a good idea to also report matrix |
---|
0:51:01 | on some conditions in the set |
---|
0:51:03 | because |
---|
0:51:04 | they the mis calibration issues within the sub conditions maybe he in |
---|
0:51:09 | within the true distribution of the whole set they compensate each other sometimes |
---|
0:51:15 | and reporting |
---|
0:51:19 | metrics sub conditions yes |
---|
0:51:21 | both actual and minimum something you can actually tell |
---|
0:51:24 | if there's a calibration |
---|
0:51:27 | okay |
---|
0:51:28 | thank you very much for listening and i'm looking forward to your questions in the |
---|
0:51:32 | next session |
---|