0:00:16 | i don't |
---|
0:00:17 | i am in the centre from a i'm research about the computer science institute which |
---|
0:00:21 | is affiliated to can be set and to university of one of size you know |
---|
0:00:25 | judy |
---|
0:00:26 | the work i'm going to talk about today was done in collaboration with me h |
---|
0:00:30 | one car from the startup |
---|
0:00:31 | ut sri international |
---|
0:00:35 | so let me start we describe the one of the most standard speaker verification pipelines |
---|
0:00:39 | these days |
---|
0:00:40 | and the pipeline is composed of |
---|
0:00:42 | three stages |
---|
0:00:44 | we have first the speaker but in extractor which is meant to transform the sequences |
---|
0:00:49 | in the two trials into fixed-length vectors x one x two here |
---|
0:00:54 | then we have a stage that thus lda followed by mean and variance normalization |
---|
0:00:59 | and then we next normalize |
---|
0:01:02 | and those resulting vectors x one x two are then processed with a the lda |
---|
0:01:07 | stage which computes a score for the trial |
---|
0:01:10 | which can then be threshold it to make the final decision |
---|
0:01:14 | so that the lda scores |
---|
0:01:16 | are computed s and rs log-likelihood ratios |
---|
0:01:19 | and their state of gaussian assumptions |
---|
0:01:23 | the form of the llr is these |
---|
0:01:24 | it's the logarithm |
---|
0:01:26 | of the racial between two probabilities which are the probabilities of it |
---|
0:01:30 | two inputs |
---|
0:01:31 | given that the speakers are the same |
---|
0:01:33 | and the probability of the inputs given that the speakers are different |
---|
0:01:37 | and these in an r |
---|
0:01:39 | given the gaussian assumptions in the lda |
---|
0:01:42 | can be computed with a close form which is a polynomial units one x two |
---|
0:01:47 | you can find a for mean in the paper |
---|
0:01:51 | so |
---|
0:01:52 | the problem is that in most cases what comes of a purely eye are scores |
---|
0:01:57 | that are very nice kind of rate is means that |
---|
0:01:59 | no the we computed unless and an hour's data really are not and ours |
---|
0:02:04 | and the cost for these mismatch a is that |
---|
0:02:09 | they assumption that we may can be lda not really much they're real data |
---|
0:02:16 | so |
---|
0:02:18 | is calibrated scores have the problem that they have not probabilistic interpretation this means that |
---|
0:02:25 | in consequence we cannot |
---|
0:02:26 | and |
---|
0:02:28 | used unless absolute values we can use them relative to each other |
---|
0:02:32 | so we could run examples of trials |
---|
0:02:35 | but we cannot interpret the |
---|
0:02:37 | so let's say for example that you get a score minus one for certain system |
---|
0:02:41 | for certain trial |
---|
0:02:44 | you would only be able to tell one these minus one means a there you've |
---|
0:02:48 | seen a distribution |
---|
0:02:50 | or |
---|
0:02:51 | some development data that has gone through the system |
---|
0:02:55 | so once you see |
---|
0:02:57 | this emotion and then you can interpret this minus one |
---|
0:03:00 | properly and you could actually threshold the score and decide the thesis the target samples |
---|
0:03:10 | okay so we would like scores to be equally weighted because |
---|
0:03:14 | then |
---|
0:03:15 | they have these nice property that they are in an hour so that we can |
---|
0:03:18 | interpret their values |
---|
0:03:20 | and we can also use based rules to make a |
---|
0:03:24 | decision on the threshold |
---|
0:03:26 | without having to see a development data |
---|
0:03:30 | so |
---|
0:03:31 | but calibration is done and generally with an affine transformation |
---|
0:03:35 | there is trained using logistic regression so let's say you and all your score some |
---|
0:03:39 | is calibrated |
---|
0:03:42 | then what you do these |
---|
0:03:43 | train these alpha and beta which are the two |
---|
0:03:47 | parameters in the affine transformation |
---|
0:03:50 | so value maximize the cross entropy |
---|
0:03:52 | that's the logistic regression |
---|
0:03:54 | objective function |
---|
0:03:55 | and then you get at the output |
---|
0:03:58 | properly calibrated and |
---|
0:04:02 | okay so basically what these means is that we take these by applying we had |
---|
0:04:07 | are we just at one stage |
---|
0:04:09 | the global calibration |
---|
0:04:12 | now the problem is that if this doesn't really solve the problem |
---|
0:04:17 | and |
---|
0:04:18 | in general so we are only solving the problem with this global calibration for |
---|
0:04:23 | the extract set |
---|
0:04:24 | for which we train the calibration parameters |
---|
0:04:27 | if the calibration |
---|
0:04:30 | the calibration set doesn't match our test set |
---|
0:04:33 | then we will still have a calibration problem |
---|
0:04:36 | and these results illustrate this so |
---|
0:04:39 | the wearable one this sets are for now well explained them later but for now |
---|
0:04:44 | what's important is then i'm showing three different be lda sets |
---|
0:04:50 | that are |
---|
0:04:51 | really a systems |
---|
0:04:53 | that are identical to the calibration stage on what the first is |
---|
0:04:57 | what training data was used to train |
---|
0:05:00 | the calibration parameters |
---|
0:05:02 | though so that the |
---|
0:05:05 | red bars |
---|
0:05:07 | i |
---|
0:05:08 | one it's important here is to compare the height of the bar which is the |
---|
0:05:11 | actual c and the lower for each of the systems |
---|
0:05:14 | and the black line |
---|
0:05:15 | which is the meaning of the llr |
---|
0:05:17 | for that system |
---|
0:05:19 | so if the difference between the two |
---|
0:05:21 | is smaller than it means that the system is well calibrated if it's be it |
---|
0:05:26 | means it is not what kind |
---|
0:05:28 | so what we see here |
---|
0:05:30 | is that the performance the actual c in an hour is very sensitive to reach |
---|
0:05:35 | set was used to train the calibration |
---|
0:05:38 | well |
---|
0:05:39 | so for example |
---|
0:05:40 | box |
---|
0:05:42 | necessary to switch or which is |
---|
0:05:44 | mostly box in this case |
---|
0:05:48 | it's very well the speakers in the wild dataset |
---|
0:05:52 | so it gives very good calibration but horrible for sre |
---|
0:05:56 | and similarly the say the rats data is very good much more lasers but is |
---|
0:06:01 | not so good for exactly sixty |
---|
0:06:04 | so basically this means we cannot get |
---|
0:06:07 | a single global calibration model that we work |
---|
0:06:10 | well across the board |
---|
0:06:14 | alright so the goal of this work is based digital but system that doesn't require |
---|
0:06:18 | these we calibration for every new condition |
---|
0:06:21 | it's quite ambitious goal |
---|
0:06:23 | and |
---|
0:06:24 | we basically want to speaker verification system that can be used out of the box |
---|
0:06:29 | without having to lead to have been dataset |
---|
0:06:34 | okay so |
---|
0:06:36 | one back to the by line a the standard approach |
---|
0:06:40 | in the by pinata showed |
---|
0:06:41 | is to train each of the stages separately maybe you reach the previous stage |
---|
0:06:47 | and when the |
---|
0:06:49 | we they put data |
---|
0:06:50 | that comes out of that stage train the next state |
---|
0:06:54 | with different objectives so the first one this speaker media extractor is trained with |
---|
0:06:59 | speaker classification also what object the |
---|
0:07:02 | lda on the lda is used is trained to maximize the likelihood |
---|
0:07:07 | and then finally the calibration stage is trained to |
---|
0:07:11 | optimize minor cross entropy which is a speaker verification |
---|
0:07:19 | now |
---|
0:07:21 | one simple thing we can do is just integrated three stages in the market we |
---|
0:07:26 | may think this is |
---|
0:07:29 | some solution to the calibration problem and you may actually sol our initial of needs |
---|
0:07:34 | calibration across conditions |
---|
0:07:37 | so |
---|
0:07:38 | what we do is basically keeping the same exact functional form |
---|
0:07:43 | passing the standard pipeline |
---|
0:07:45 | but instead of training them with different objectives |
---|
0:07:49 | separately |
---|
0:07:50 | we just trained them jointly using stochastic gradient descent |
---|
0:07:54 | for this of course when integrating batch is that are trials |
---|
0:07:59 | my budget of trials rather than samples |
---|
0:08:02 | and we simply just |
---|
0:08:04 | what we do is |
---|
0:08:06 | randomly select speakers for each speaker select |
---|
0:08:10 | two samples |
---|
0:08:11 | and then |
---|
0:08:12 | from that list of samples to create all the trials all the possible trials |
---|
0:08:17 | across those samples all tool pursues older since all samples |
---|
0:08:24 | so we know we can compute the |
---|
0:08:27 | the binary cross entropy and we optimize that |
---|
0:08:31 | so this is not the first time that something like this |
---|
0:08:35 | is proposed of course i to solve the mean m and we'll get and others |
---|
0:08:40 | what was something very similar |
---|
0:08:43 | at the time of the actually |
---|
0:08:45 | train the |
---|
0:08:47 | but kind of with the svm what we linear logistic regression |
---|
0:08:50 | is that of stochastic gradient descent but basically that the concept is saying |
---|
0:08:55 | and more recently now there's been a few papers than two and to have a |
---|
0:09:00 | speaker verification and they use some claymore of these |
---|
0:09:04 | idea where the training data but can't which is usually very similar formats this tandem |
---|
0:09:10 | again in a discriminatively |
---|
0:09:13 | the of this paper is actually here you know these |
---|
0:09:16 | and i'm sorry finest only in the upper |
---|
0:09:21 | so this paper is actually report improving discrimination performance |
---|
0:09:25 | but i don't usually report calibration performance which is one we care |
---|
0:09:30 | in this work |
---|
0:09:32 | and what we actually found in our previous paper is that this approach of just |
---|
0:09:37 | trained discriminatively |
---|
0:09:39 | at the lda back-end |
---|
0:09:41 | is not sufficient to get good calibration across conditions |
---|
0:09:45 | and that we know from our previous papers so |
---|
0:09:49 | it means this is not a these architecture and training jointly is not e |
---|
0:09:55 | so what n |
---|
0:09:57 | what is the problem |
---|
0:09:59 | in this basic form |
---|
0:10:01 | and we |
---|
0:10:02 | we show before the calibration stage is a global |
---|
0:10:07 | well anyway |
---|
0:10:09 | same as in the standard white nine |
---|
0:10:11 | and it seems that this is not enough flexibility for the model to adapt to |
---|
0:10:16 | the different conditions in the date |
---|
0:10:18 | even if you train a small with a lot of different conditions you will just |
---|
0:10:22 | of that to the |
---|
0:10:23 | my jewelry the condition |
---|
0:10:27 | so what we propose to do is to i and branch |
---|
0:10:30 | so these model |
---|
0:10:32 | so we keep the speaker verification range the same |
---|
0:10:36 | and then we added a branch that |
---|
0:10:39 | is in charge of computing calibration parameters as a function |
---|
0:10:43 | both input vector sets one and x two |
---|
0:10:46 | and the form for this branch is starts the same as the top one |
---|
0:10:51 | it's an affine transformation |
---|
0:10:53 | that's length normalization of course the parameters of these something transformation on different |
---|
0:10:58 | on the top ones |
---|
0:11:00 | then we do dimensionality reduction |
---|
0:11:02 | i we go to very low dimensional seen in that paper we use of dimensional |
---|
0:11:06 | five |
---|
0:11:07 | to compute the mean vectors which are |
---|
0:11:10 | and we call |
---|
0:11:11 | side-information vectors |
---|
0:11:13 | and then we use these vectors to compute an alpha and beta using and very |
---|
0:11:19 | simple form which is based similar to the be lda form here |
---|
0:11:24 | at so |
---|
0:11:26 | when we and that is we had two branches one is in charge of computing |
---|
0:11:30 | the score and the other one is its actual computing the |
---|
0:11:33 | calibration parameters |
---|
0:11:36 | for each of the sample c and |
---|
0:11:40 | so i'll show the results now so let me |
---|
0:11:42 | talk about the data |
---|
0:11:44 | we have |
---|
0:11:46 | a bunch |
---|
0:11:47 | i had a whole lot of training data |
---|
0:11:49 | we used books and of one and two |
---|
0:11:52 | sre data speaker recognition evaluation data from |
---|
0:11:55 | two thousand five two thousand twelve |
---|
0:11:58 | blast mixer six |
---|
0:12:00 | and switch for all of that it is actually share we |
---|
0:12:04 | and the embedding extractor training data |
---|
0:12:06 | we just use half of what we use one but in extractor training just for |
---|
0:12:10 | expediency the experimentation |
---|
0:12:14 | and then we have two more sets but source |
---|
0:12:16 | which is telephone data in that would just other non-english for different languages |
---|
0:12:21 | and then if it's just trying to which is forensic voice comparison |
---|
0:12:26 | we just the very clean data set |
---|
0:12:28 | it's a studio microphone anything |
---|
0:12:31 | i australian english |
---|
0:12:34 | and then for testing we use sre six sixteen sorry eighteen speakers in the while |
---|
0:12:40 | the then on the ml |
---|
0:12:41 | and lasers which is a bilingual |
---|
0:12:44 | set recorded over several different microphones |
---|
0:12:48 | and a forensic voice comparison |
---|
0:12:50 | the chinese version so the |
---|
0:12:53 | recording conditions of these two are very similar |
---|
0:12:57 | but the language is the |
---|
0:12:59 | and ask that sets |
---|
0:13:01 | and we use the that part |
---|
0:13:04 | all these three sets aside a sixteen sre eighteen of speakers in the way |
---|
0:13:08 | i with that we do all the parameter tuning |
---|
0:13:11 | we choose the iteration best iteration for each of the models |
---|
0:13:15 | stuff like |
---|
0:13:18 | okay so here we use a rear their results |
---|
0:13:21 | and |
---|
0:13:22 | the |
---|
0:13:23 | rand bars have the same ones |
---|
0:13:25 | as in the previous figure they showed |
---|
0:13:29 | and i didn't the blue |
---|
0:13:31 | bar which is the system we propose |
---|
0:13:34 | we each as you can see |
---|
0:13:36 | you know training rules |
---|
0:13:38 | most cases over the best or that the global calibration model |
---|
0:13:44 | so |
---|
0:13:44 | we basically achieved what we want it which is to have a single model that |
---|
0:13:49 | kind of that to the test conditions without that's telling them |
---|
0:13:53 | what the test conditions are |
---|
0:13:56 | the only exception is these lpc cmn case |
---|
0:14:00 | which is not well calibrated idle |
---|
0:14:03 | and in fact there is one global |
---|
0:14:05 | the lda model that is better |
---|
0:14:08 | than the one we propose |
---|
0:14:10 | is still applied |
---|
0:14:11 | but is better than ours |
---|
0:14:13 | and |
---|
0:14:14 | and the problem with that set |
---|
0:14:16 | is basically that it's |
---|
0:14:18 | it's a condition that is not seeing |
---|
0:14:21 | in combination |
---|
0:14:22 | during training so |
---|
0:14:24 | we have clean data in training |
---|
0:14:27 | but is not in chinese are we have training but is not key |
---|
0:14:32 | so the model doesn't seem to be able to |
---|
0:14:34 | learn |
---|
0:14:35 | how to properly calibrated a that they |
---|
0:14:39 | unfortunately so this just means |
---|
0:14:42 | there's to work to be done we haven't really achieve that ambitious goal that i |
---|
0:14:46 | mentioned before which was to have |
---|
0:14:48 | a completely |
---|
0:14:50 | general |
---|
0:14:51 | out of box system |
---|
0:14:54 | okay so before to finish i i'd like to describe a few details so how |
---|
0:14:59 | this model is trained because they are essential to get would performance |
---|
0:15:04 | so one important thing is to |
---|
0:15:06 | do an |
---|
0:15:07 | non random initialization so |
---|
0:15:10 | what we do and |
---|
0:15:11 | many of the papers than two and two and training do similar things |
---|
0:15:15 | is |
---|
0:15:18 | initialize the speaker brunch with the parameters that a standard the lda baseline |
---|
0:15:24 | that's very sing |
---|
0:15:25 | and then for this |
---|
0:15:26 | side information much we |
---|
0:15:29 | this first stage we initialize if we the bottom |
---|
0:15:33 | and components of this anyway lda transform that we trained for |
---|
0:15:38 | the speaker match |
---|
0:15:41 | that means that what comes out of here he's |
---|
0:15:44 | basically the words you could do for speaker i e |
---|
0:15:48 | we should be |
---|
0:15:49 | the best you can do for conditionality |
---|
0:15:51 | so we're trying to get from the input |
---|
0:15:54 | they condition information |
---|
0:15:56 | then these matrix here |
---|
0:15:59 | which doesn't have any recent level before value |
---|
0:16:02 | we just initialized randomly anyway |
---|
0:16:05 | and these two |
---|
0:16:07 | components here we initialize them so that what comes out of here |
---|
0:16:11 | are the global parameters |
---|
0:16:13 | at the first iteration you portray |
---|
0:16:16 | so |
---|
0:16:16 | basically at the initialization what the scores that them out of here are the same |
---|
0:16:22 | that would come out or a the lda |
---|
0:16:25 | standard p only a by i |
---|
0:16:28 | here the results |
---|
0:16:30 | it comparing three different |
---|
0:16:31 | initialization approaches |
---|
0:16:34 | random |
---|
0:16:36 | then |
---|
0:16:37 | a one star partial which means |
---|
0:16:40 | what i described before but without |
---|
0:16:43 | initialising bees |
---|
0:16:44 | stage with the lda what on components just one only |
---|
0:16:49 | and then the louise |
---|
0:16:50 | what is correct |
---|
0:16:52 | so the blue is the best of the three |
---|
0:16:54 | so it means it's worth the trouble two |
---|
0:16:58 | take the time to find a initial parameters this marking |
---|
0:17:04 | so |
---|
0:17:06 | another important thing is to that we train them only two stages |
---|
0:17:10 | so the first stage uses all the training data to train the formal all the |
---|
0:17:15 | parameters |
---|
0:17:16 | and then the second stage |
---|
0:17:18 | we freeze the lda mp lda blocks |
---|
0:17:21 | i'm trying to on the rest of the parameters using |
---|
0:17:24 | domain balance |
---|
0:17:25 | data |
---|
0:17:26 | and this is important because if the data is not about and then |
---|
0:17:29 | most of the trials in you a novel batch would be from one the mean |
---|
0:17:33 | and then we would just be optimising things for that only |
---|
0:17:38 | that something that has more samples |
---|
0:17:42 | finally the convergence of the model is kind of a big issue |
---|
0:17:47 | validation performance jumps of one from batch to batch and a lot |
---|
0:17:52 | so you see that curve of optimization in |
---|
0:17:55 | com one much to the next i in can change significantly |
---|
0:18:00 | so what we do is basically choose the best iteration using the validation sets that |
---|
0:18:04 | i mentioned before |
---|
0:18:06 | and the good thing is that these approach seems to generalize well to other sets |
---|
0:18:11 | even two sets that are not very well matched to the limitations |
---|
0:18:15 | and we tried a bunch of tricks to smooth out the validation performance and they |
---|
0:18:20 | do set sitting smoothing out the validation mccormick like regularization |
---|
0:18:25 | sloane everybody |
---|
0:18:27 | but they actually make the minimum |
---|
0:18:29 | worse so we |
---|
0:18:31 | keep the while when initial curves i'm just choose the mean |
---|
0:18:38 | and well so |
---|
0:18:39 | and say |
---|
0:18:40 | did how repository |
---|
0:18:43 | we the exactly these |
---|
0:18:45 | model |
---|
0:18:46 | implemented for training them for evaluation at you just want to have a pre-computed and |
---|
0:18:51 | endings |
---|
0:18:52 | and have an example we then bindings that we provide |
---|
0:18:56 | three to use a modified let me know we could find box |
---|
0:19:01 | i'll be how to respond questions and comments |
---|
0:19:05 | okay so |
---|
0:19:07 | conclusion we developed a model that achieves excellent performance across a wide variety of conditions |
---|
0:19:13 | and it integrates different stages in a speaker verification looking into one stage |
---|
0:19:18 | and trains the whole thing doing c |
---|
0:19:21 | you also integrates an automatic extractor of side-information then in then uses to condition calibration |
---|
0:19:27 | parameters |
---|
0:19:28 | and these chips our goal of getting and good performance across different conditions |
---|
0:19:36 | of course there are many open issues with like temporal |
---|
0:19:39 | training convergence i don't think we are done with that i would like to see |
---|
0:19:44 | it easier to |
---|
0:19:47 | optimize model |
---|
0:19:49 | and of course we'd like to plug in these small with the mle extractor and |
---|
0:19:53 | training and |
---|
0:19:56 | okay thank you very much |
---|
0:19:59 | if you have any questions please by two need to be solved resort to the |
---|
0:20:02 | ldc platform be more detail |
---|
0:20:05 | thank you |
---|