0:00:15 | this is very short about the topic |
---|
0:00:18 | we are apply we are working with a probabilistic linear discriminant analysis and it has |
---|
0:00:25 | previously been proved by discriminative training |
---|
0:00:30 | previous studies now use a loss functions that essential to focus on a very broad |
---|
0:00:34 | range of applications so in this work we are trying to |
---|
0:00:39 | train the p lda in a way that it becomes |
---|
0:00:42 | more suitable for |
---|
0:00:44 | narrow range of applications |
---|
0:00:47 | and we observe a small improvement in the minimum detection cost by doing so |
---|
0:00:55 | so as a background |
---|
0:00:57 | s when we use the speaker verification system we would like to minimize the expected |
---|
0:01:02 | cost |
---|
0:01:04 | from our decision |
---|
0:01:06 | and that this is a very much reflected in the detection cost of the of |
---|
0:01:10 | then use |
---|
0:01:12 | so we have at the cost for false rejection and false alarm and also a |
---|
0:01:16 | prior which we can say together constitutes the operating point of our system |
---|
0:01:22 | and which of course depend on the application |
---|
0:01:25 | so the targets here is to yield a application specific system that is optimal for |
---|
0:01:31 | one or several |
---|
0:01:34 | operating point rather than one wall |
---|
0:01:37 | it is more specifics of the same |
---|
0:01:41 | and there so it can only idea some already been explored force score calibration |
---|
0:01:48 | in interspeech paper mention that |
---|
0:01:51 | however well so score calibration with score calibration we can reduce the gap between actual |
---|
0:01:57 | detection cost and minimum detection cost |
---|
0:02:00 | but we cannot be used the minimum detection cost |
---|
0:02:04 | part |
---|
0:02:05 | by applying these channel five use some earlier stage of the speaker verification system we |
---|
0:02:10 | could how to reduce also the minimum detection cost |
---|
0:02:14 | so we will apply to |
---|
0:02:17 | discriminative p lda training |
---|
0:02:25 | we use this method that has been previously been developed for training well for discriminating |
---|
0:02:31 | ple training |
---|
0:02:33 | and the only kind of thing we need to do here is that the this |
---|
0:02:37 | well the log-likelihood ratio score of the period more data is |
---|
0:02:44 | you've done by this kind of for right here |
---|
0:02:47 | and we can apply some discussion discriminative training criteria to these |
---|
0:02:52 | parameter |
---|
0:02:53 | here |
---|
0:02:55 | well only you should be in of the i-vectors |
---|
0:03:00 | which i same out that the still we basically take all possible pairs of my |
---|
0:03:05 | make those in the training database and minimize sound loss function l possible with some |
---|
0:03:11 | weight |
---|
0:03:12 | and also be applied so on |
---|
0:03:14 | regularization term |
---|
0:03:23 | but this |
---|
0:03:25 | have been |
---|
0:03:26 | so |
---|
0:03:28 | when we need to consider a |
---|
0:03:30 | one which operating point we should data |
---|
0:03:34 | talk about how we should target a system to be |
---|
0:03:38 | suitable for a certain operating points we need to consider |
---|
0:03:42 | the part we have here and gmm weights b that which is different for well |
---|
0:03:46 | depends on the trial |
---|
0:03:48 | in essence it will be different for target and non-target trials |
---|
0:03:52 | and we also have a loss function and to say very simple the |
---|
0:03:57 | that will depend which operating point they're targeting whereas the choice of loss function chime |
---|
0:04:04 | decide how much emphasis would put on surrounding operating points |
---|
0:04:17 | so |
---|
0:04:18 | well just a bit short about the forest be that |
---|
0:04:24 | well as probably several you know we can we'll in some applications where approach |
---|
0:04:29 | these three parameters probability of target trial two costs we can rewrite it |
---|
0:04:34 | where we can |
---|
0:04:36 | we have an equivalent cost which will have a |
---|
0:04:40 | a loss |
---|
0:04:41 | in the training or evaluation that is proportional to this |
---|
0:04:44 | first application so |
---|
0:04:46 | we can as well consider this |
---|
0:04:50 | well |
---|
0:04:51 | such kind of application is that and to minimize |
---|
0:04:56 | well as and we will as such we make sure that the |
---|
0:05:00 | we have a |
---|
0:05:02 | our system will be able to also |
---|
0:05:05 | for that are breaking points are looking at |
---|
0:05:08 | so essentially we need to be scale |
---|
0:05:16 | every trial so that we get the retard the |
---|
0:05:21 | percentage of target trials in the |
---|
0:05:24 | evaluation |
---|
0:05:25 | evaluation database we consider all data |
---|
0:05:29 | because we consider can compare two |
---|
0:05:33 | the training database |
---|
0:05:38 | so regarding the choice of a loss function |
---|
0:05:42 | previous studies you for discrimate bp lda training use a logistic regression loss or the |
---|
0:05:47 | svm hinge loss |
---|
0:05:49 | and the logistic regression scores which is essentially the same as the cmllr loss to |
---|
0:05:54 | justify the eer application independent the |
---|
0:05:58 | and |
---|
0:05:59 | evaluation metrics so you could be suitable as a loss function if we want to |
---|
0:06:03 | target a very broad range of applications |
---|
0:06:07 | well what we want consider here is to |
---|
0:06:10 | c by targeting a more narrow range of application of up of operating points if |
---|
0:06:16 | we can give better performance for such operating points |
---|
0:06:20 | and |
---|
0:06:22 | well the most |
---|
0:06:24 | i think that would call course one exactly to one detection cost would be that |
---|
0:06:29 | zero one loss |
---|
0:06:31 | and that we will also consider one which is a little bit broad one loss |
---|
0:06:35 | function which is a little bit broader than that |
---|
0:06:38 | zero one loss button bit more marilyn logistic regression loss which is the be a |
---|
0:06:43 | loss |
---|
0:06:44 | and well |
---|
0:06:47 | to |
---|
0:06:49 | explain why about this the case i can report that the speech paper which is |
---|
0:06:53 | very interesting |
---|
0:07:00 | so |
---|
0:07:01 | i'm showing various the picture of how these different things slopes |
---|
0:07:06 | and the blue one would be the logistic regression loss which is there |
---|
0:07:11 | complex but a |
---|
0:07:14 | and say comes |
---|
0:07:16 | because on that this could also be sensitive to outliers because |
---|
0:07:21 | for some |
---|
0:07:22 | maybe a new show a trial so this would be by the way they look |
---|
0:07:26 | for a target trial metric for example |
---|
0:07:29 | for so one also we have |
---|
0:07:33 | some cost here and then of the past the threshold which is here you that |
---|
0:07:37 | no cost |
---|
0:07:38 | what basically this one can be very large for some |
---|
0:07:43 | when you change point in our database so |
---|
0:07:51 | our system may be very much adjusted to one of my |
---|
0:07:56 | degree one |
---|
0:07:57 | targets the real loss and we are the zero one loss here as i said |
---|
0:08:02 | with a couple of approximations that we will later use |
---|
0:08:07 | we use this sigmoid approximation |
---|
0:08:10 | in order to do optimization under which includes the parameter i'll show that makes it |
---|
0:08:16 | more and more similar to the zero one loss when you increase it and we |
---|
0:08:20 | have that for |
---|
0:08:22 | well |
---|
0:08:24 | well one ten hundred |
---|
0:08:37 | so |
---|
0:08:39 | there are a couple of problems though the real zero one loss is not differentiable |
---|
0:08:43 | so that slightly use this one function |
---|
0:08:47 | and |
---|
0:08:50 | we also or a real as in the same one loss or non-convex so we |
---|
0:08:55 | do one approach here where we can of gradually increase the non complexity and |
---|
0:09:02 | for the sigmoid loss it means we start from the logistic regression model |
---|
0:09:06 | we also tried from the ml model but it's better to start from the logistic |
---|
0:09:10 | regression well |
---|
0:09:12 | and then increase of five gradually on there is another papers to doing that's for |
---|
0:09:17 | other applications |
---|
0:09:20 | we do something similar for the radio lost what we start from the logistic regression |
---|
0:09:25 | model and then train the sycamore loss with our for it was the one loss |
---|
0:09:30 | finally |
---|
0:09:37 | so |
---|
0:09:39 | regarding the experiments we didn't we use the main telephone trials and |
---|
0:09:46 | we use |
---|
0:09:47 | of a couple of different databases and we used as development set there is this |
---|
0:09:51 | research six which is which using one |
---|
0:09:54 | the regular session but i |
---|
0:09:56 | and then use this series zero eight and that's it is it intended for testing |
---|
0:10:02 | and this one cannot standard datasets for p lda training |
---|
0:10:07 | and |
---|
0:10:09 | an engineer the number of i-vectors and speakers with or without including is there is |
---|
0:10:14 | a sixteen goes this we use the fast development set but sometimes we included in |
---|
0:10:19 | the training set off to react decided on the parameters to get the little bit |
---|
0:10:23 | better performance |
---|
0:10:26 | and we conducted the for experiments |
---|
0:10:30 | okay i should also say that we target is the operating point mentioning here which |
---|
0:10:35 | has been standard in which the operating point in several nist evaluations |
---|
0:10:40 | and |
---|
0:10:42 | for you need for experiments one is just considering a couple of different normalisation regularization |
---|
0:10:47 | techniques because we limit on sure about what is the best although it's not really |
---|
0:10:52 | related to the topic of this paper |
---|
0:10:56 | the second experiment we just compare the different the loss functions that use of are |
---|
0:11:00 | discussed |
---|
0:11:01 | and then the underlies also the effect of calibration finally we address were tried to |
---|
0:11:06 | investigate little bit |
---|
0:11:08 | well |
---|
0:11:09 | the choice of be a according to the formal idea before is actually suitable or |
---|
0:11:14 | not |
---|
0:11:18 | so well for regular stations there are two options that are popular i guess and |
---|
0:11:24 | one in this kind of |
---|
0:11:27 | topics so we can do regular size and regularization to see her which would be |
---|
0:11:32 | most remote and warranted or station towards ml and icexml i mean normal generating trained |
---|
0:11:40 | and because logistic regression is also in that sense |
---|
0:11:44 | maximum likelihood approach |
---|
0:11:48 | and to compare also weddings within class covariance for just whitening with |
---|
0:11:54 | full covariance total covariance |
---|
0:11:58 | and maybe we found that in terms of mindcf and eer |
---|
0:12:04 | using just as |
---|
0:12:06 | covariance the phone call total covariance and regularization towards a likelihood you lead to better |
---|
0:12:12 | performance we use that |
---|
0:12:14 | the remaining experiments |
---|
0:12:19 | so comparing loss functions |
---|
0:12:31 | well first we should say that there is given to training schemes that their actual |
---|
0:12:36 | detection cost than the standard maximum likelihood training but that is kind of expect that |
---|
0:12:42 | because |
---|
0:12:43 | they at the same time do calibration |
---|
0:12:47 | however not great calibration |
---|
0:12:50 | which we will discuss |
---|
0:12:52 | make the wrong |
---|
0:12:54 | and |
---|
0:12:56 | but the for calibration it's is that the matrix they're model is very competitive |
---|
0:13:04 | but we can see some improvement by |
---|
0:13:08 | these the application specific loss function compared to logistic regression minimum detection cost any all |
---|
0:13:15 | three and |
---|
0:13:17 | for sre silly there is no such that |
---|
0:13:24 | so |
---|
0:13:29 | maximum likelihood standard maximum likelihood model and a bit worse calibration but |
---|
0:13:34 | since we can |
---|
0:13:36 | take start by doing calibration |
---|
0:13:38 | we will |
---|
0:13:40 | we in order to a fair comparison we will |
---|
0:13:43 | also consider that here |
---|
0:13:45 | so what to do it we need to use some training someone some portion of |
---|
0:13:50 | the training data that's really tried i hear you see that fifty cent defined ninety |
---|
0:13:54 | or ninety five percent of that |
---|
0:13:57 | training data for p lda training and the rest for calibration |
---|
0:14:01 | and we use to see an alarm loss here which is essentially the same as |
---|
0:14:05 | logistic regression |
---|
0:14:07 | and used operating point that we are targeting |
---|
0:14:11 | and in these experiments we assume zero six is not include |
---|
0:14:15 | so the result looks like this and the first thing to say is that the |
---|
0:14:19 | applying the calibration model |
---|
0:14:23 | will be better results than discriminative training without calibration |
---|
0:14:29 | the second thing is that the |
---|
0:14:32 | distributed training here also |
---|
0:14:35 | benefited from calibration which must be explained by |
---|
0:14:39 | the fact that they're using a regularization |
---|
0:14:43 | top |
---|
0:14:52 | and are also maybe overall can say that seventy five percent of the different training |
---|
0:14:58 | and the rest for |
---|
0:15:00 | using the rest for calibration was the optimal |
---|
0:15:09 | and the also |
---|
0:15:13 | we notice that the logistic regression |
---|
0:15:16 | performs quite bad for the very small amount of training data using is the fifty |
---|
0:15:21 | percent |
---|
0:15:22 | and the whereas the real also zero one loss |
---|
0:15:26 | we perform better |
---|
0:15:28 | but the this is probably course |
---|
0:15:33 | and those two loss functions |
---|
0:15:35 | chan |
---|
0:15:38 | if i can go back to this |
---|
0:15:42 | figure here |
---|
0:15:43 | for example the zero one loss |
---|
0:15:46 | we do not make so much use of a in the data but that's a |
---|
0:15:50 | score like this |
---|
0:15:51 | but as the logistic regression would and |
---|
0:15:56 | that means that the regression loss will use more of the data |
---|
0:16:00 | so what happened here |
---|
0:16:05 | i think stuff that |
---|
0:16:08 | since we do regularization towards the ml model |
---|
0:16:11 | simply also one most remotes leads |
---|
0:16:15 | changed so much change the model so much in the state of the model when |
---|
0:16:20 | we used a really great |
---|
0:16:26 | so also it is |
---|
0:16:35 | choice of a |
---|
0:16:37 | use |
---|
0:16:40 | optimal that sounds assuming that the one that |
---|
0:16:43 | trials in the database all channels which is still not the case because we have |
---|
0:16:47 | made up the training data by |
---|
0:16:49 | carrying all the i-vectors |
---|
0:16:52 | and also of course it also assumes that the |
---|
0:16:56 | training database and evaluation based have a better kind of similar properties which probably is |
---|
0:17:01 | also not really case |
---|
0:17:03 | so the optimal beat the could be different from |
---|
0:17:09 | this according to the form |
---|
0:17:12 | so i and i that looks at a bit strange but basically we want to |
---|
0:17:15 | check a couple of different to not use for that and which means that the |
---|
0:17:21 | effective prior p n |
---|
0:17:22 | so we just trying different we make some kind of parameters section which make sure |
---|
0:17:27 | that the |
---|
0:17:28 | we use this parameter gamma which one mace zero point five |
---|
0:17:33 | we used a standard that the |
---|
0:17:36 | calculated |
---|
0:17:37 | effective prior according to the for all and when i am is equal to one |
---|
0:17:41 | we will use |
---|
0:17:43 | and effective prior one which means people way to the target trials |
---|
0:17:48 | and when it's zero we will use |
---|
0:17:51 | this one this section we make sure that we use weights of the non-target trials |
---|
0:17:58 | we used real also in this experiment |
---|
0:18:01 | and also do not include |
---|
0:18:04 | it's a zero six |
---|
0:18:09 | so this the figures a little bit interesting i think |
---|
0:18:14 | and it seems first like it's much more important to for the actual detection cost |
---|
0:18:19 | simple minimum detection cost but remember also here we didn't the by calibration of the |
---|
0:18:24 | way |
---|
0:18:25 | and |
---|
0:18:29 | it is clear that the best choice is not |
---|
0:18:32 | that one we can see that was calculated to formalize of the thing |
---|
0:18:37 | that's very interesting and media |
---|
0:18:39 | area that should be more explored |
---|
0:18:47 | and i should probably have said also that |
---|
0:18:50 | it's very actually goes up a little bit that which is very noticeable and i'm |
---|
0:18:53 | not sure why and |
---|
0:18:58 | because that is actually that |
---|
0:19:01 | the |
---|
0:19:03 | right |
---|
0:19:05 | the prior |
---|
0:19:06 | effective fire okay the recording for which we used in other experiments |
---|
0:19:11 | but anyway we can see that pattern really regularization towards the ml model |
---|
0:19:16 | this is relaxation ones |
---|
0:19:23 | and |
---|
0:19:27 | very interesting |
---|
0:19:30 | thing is that it seems that from name detection cost that actually goes down a |
---|
0:19:35 | little bit here |
---|
0:19:36 | which means that we |
---|
0:19:39 | which is the cases where we used for just for training data or target right |
---|
0:19:45 | you one way to target trials or we give a way to non-target trials |
---|
0:19:50 | and it but i think this the results when the so this is that |
---|
0:19:55 | well |
---|
0:19:57 | this should really not work but |
---|
0:20:00 | because we do regularization towards the ml model it i just a very close to |
---|
0:20:05 | the and the model for such kind of a system that was actually be something |
---|
0:20:08 | good |
---|
0:20:09 | so we can also can not need any you wanna be included |
---|
0:20:13 | the results for regular station towards here where we can see that this is not |
---|
0:20:17 | the case |
---|
0:20:20 | so in conclusions |
---|
0:20:23 | we can see that sometimes can improve the performance but quite often there is not |
---|
0:20:28 | so much different |
---|
0:20:30 | and we tried different optimization strategy is |
---|
0:20:33 | and the |
---|
0:20:35 | what that should say about that it's that starting from the ml a from the |
---|
0:20:38 | logistic regression model is important to the starting point is important but this kind of |
---|
0:20:43 | gradually increasing the complexity of course |
---|
0:20:47 | the non-convex of was not so |
---|
0:20:50 | effective actually but they didn't discuss the details about it |
---|
0:20:57 | so the optimization is something to consider and also |
---|
0:21:01 | since it seems to be the has really the weight be that's |
---|
0:21:06 | some kind of importance what is not simply area |
---|
0:21:10 | well what we should do we probably shouldn't consider a better estimate its of it |
---|
0:21:14 | may be something that depends on other factors and just whether it's of target and |
---|
0:21:19 | non-target trials |
---|
0:21:21 | and |
---|
0:21:22 | since |
---|
0:21:23 | also |
---|
0:21:24 | the discriminative training |
---|
0:21:28 | criterias for the two is connected it is trained models |
---|
0:21:32 | needed calibration i think we could be interesting to |
---|
0:21:35 | mate the regularization towards as well |
---|
0:21:42 | parameter vector where we have built in the regularization parameter so we do calibration of |
---|
0:21:47 | the ml model and then kind of |
---|
0:21:50 | put in the parameters from the regularization into that |
---|
0:21:54 | a from the candidate calibration into the regular stations we actually do |
---|
0:21:59 | regularization towards something that's calibrated |
---|
0:22:02 | okay so |
---|
0:22:03 | or something |
---|
0:22:29 | might be opposed to |
---|
0:22:31 | what optimiser could you use a used to be yes algorithm |
---|
0:22:36 | a little attention |
---|
0:22:38 | okay so |
---|
0:22:40 | you mentioned there was some issues with a non complexity of your objective so |
---|
0:22:47 | hidden in the work that are just like |
---|
0:22:50 | note that this morning |
---|
0:22:53 | i also had issues with non-convex of t |
---|
0:22:58 | and |
---|
0:22:59 | we have to its was a problem |
---|
0:23:02 | of course basically |
---|
0:23:04 | the gift use forms a rough approximation to the inverse of the haitian right and |
---|
0:23:10 | if you do we have two years probably that here soon matrix is going to |
---|
0:23:15 | be positive definite |
---|
0:23:18 | what's so |
---|
0:23:19 | we have two s can see |
---|
0:23:21 | the non-convex of t right and that it can do anything about |
---|
0:23:25 | so it's probably i think you that's a good point and we should consider some |
---|
0:23:30 | better optimization algorithm |
---|
0:23:32 | or we can come from that we have reduced the value of the objective function |
---|
0:23:36 | quite the significantly but maybe |
---|
0:23:39 | we are |
---|
0:23:40 | we could have done much better in that aspect well using something more sure i |
---|
0:23:44 | think so |
---|
0:23:45 | in my case there was simple solutions are could calculate the full history and a |
---|
0:23:50 | inferred that without problems because i very few parameters and i could do an eigenvalue |
---|
0:23:56 | analysis and then go down the steepest negative eigen vectors that can be out of |
---|
0:24:02 | the non-convex regions right the for you |
---|
0:24:05 | very high also |
---|
0:24:08 | you could perhaps to some other things but it's more this week why i'm afraid |
---|
0:24:13 | but thank you |
---|
0:24:21 | which does |
---|
0:24:31 | i |
---|
0:25:04 | okay well basically because we are not the doing calibration we doing discriminative training of |
---|
0:25:09 | the ple |
---|
0:25:12 | okay |
---|