0:00:13 | hello my name is on second at every stop and i'm here tell you about |
---|
0:00:15 | our work on an initial investigation on optimising tandem speaker verification and ponder missus systems |
---|
0:00:21 | using reinforcement learning |
---|
0:00:23 | so that we all just on the same page |
---|
0:00:26 | speaker verification system |
---|
0:00:29 | verifies that the |
---|
0:00:30 | claimed identity |
---|
0:00:31 | and the |
---|
0:00:32 | provided speech sample all the same come from the same people so person so other |
---|
0:00:37 | make speaker verification system takes in a claimed identity |
---|
0:00:40 | and some speech sample and if the identity max's |
---|
0:00:43 | the identity of the who spoke |
---|
0:00:46 | then all good the system will pass and then |
---|
0:00:49 | likewise if somebody else |
---|
0:00:52 | claims that they are someone to put their not and provides speech sample |
---|
0:00:56 | it shouldn't let them pass |
---|
0:00:59 | good |
---|
0:00:59 | a very simple many of you work on this field |
---|
0:01:02 | of course when it comes to security and systems like this there are some bad |
---|
0:01:05 | guys want to break the system |
---|
0:01:08 | so for example in this case |
---|
0:01:10 | somebody could record |
---|
0:01:11 | tommy speech here with a mobile phone and then later use that recorded speech |
---|
0:01:16 | to claim that they are taught me by feeding saying that they are tommy and |
---|
0:01:19 | playing the audio and the system will gladly expect accept that and previous work has |
---|
0:01:24 | shown to the s p is proved with a seventeen |
---|
0:01:27 | silence so that they are if you don't protect for this the extra system will |
---|
0:01:32 | gladly accept this kind of a trial even though it shouldn't |
---|
0:01:37 | likewise |
---|
0:01:38 | you could |
---|
0:01:39 | use they gathered dataset or collect data up some |
---|
0:01:41 | somebody speaking |
---|
0:01:42 | and then you speech synthesis or voice conversion |
---|
0:01:45 | to again generate speech that sounds like tommy and feed it that the system and |
---|
0:01:49 | it will again accent it all fine |
---|
0:01:52 | and again this has been shown in previous competition it's a bob problem but |
---|
0:01:56 | you can also fixed for this |
---|
0:01:58 | so |
---|
0:01:59 | this is |
---|
0:02:00 | where |
---|
0:02:01 | they come to missus come in so a condom is a system takes in |
---|
0:02:05 | they |
---|
0:02:06 | a speech sample that's well was provided to the extra system as well and also |
---|
0:02:11 | checks that the sample comes from a |
---|
0:02:15 | human speaker instead of like a mobile phone or it's not synthesise piece or voice |
---|
0:02:21 | conversed speech so |
---|
0:02:23 | it's like upon a five human speech |
---|
0:02:26 | so for example somebody |
---|
0:02:29 | has recorded somebody else's speech |
---|
0:02:31 | feature the system but now it's fed to the condom is a system as well |
---|
0:02:35 | and the count them is the system says it's or reject and then the other |
---|
0:02:39 | car does not get access so it leads inside attacker |
---|
0:02:43 | good so far and these competitions have shown that when you train for these kind |
---|
0:02:48 | of situations you train to detect these would play |
---|
0:02:50 | attacks of this |
---|
0:02:52 | since the speech |
---|
0:02:53 | you can detect them and all works fine |
---|
0:02:57 | so |
---|
0:02:59 | but one mac we had with this |
---|
0:03:02 | setup is that |
---|
0:03:04 | the |
---|
0:03:05 | a yes we system |
---|
0:03:06 | and they |
---|
0:03:07 | condom is a system or trained completely independent from mixer so the fa system has |
---|
0:03:13 | its own dataset its own laws its own training protocols |
---|
0:03:17 | and so on someone |
---|
0:03:19 | likewise the insistent has its own datasets intone was its own trainings protocols and its |
---|
0:03:23 | own network architecture and someone so on |
---|
0:03:27 | these are trained separately but then they are invalid together so they are evaluated as |
---|
0:03:31 | one big bigger system |
---|
0:03:33 | so |
---|
0:03:34 | where you have a completely different |
---|
0:03:36 | l is metric and these two systems have never been trained to minimize actually this |
---|
0:03:42 | evaluation metric that been trained on their own tasks |
---|
0:03:46 | so |
---|
0:03:47 | we had this |
---|
0:03:48 | coffee room idea |
---|
0:03:50 | a where |
---|
0:03:52 | what if when we have this kind of |
---|
0:03:55 | bigger whole system |
---|
0:03:57 | what if we train the |
---|
0:04:00 | a svm to see insist then on the evaluation metric directly so |
---|
0:04:05 | maybe on pop they already training already existing training they had we also optimize them |
---|
0:04:11 | to minimize the or maximize the appellation metric for better results |
---|
0:04:16 | and |
---|
0:04:17 | however |
---|
0:04:18 | sadly |
---|
0:04:19 | it's not so very straightforward |
---|
0:04:22 | so |
---|
0:04:23 | we have this system where we split but speech to a svm cm system |
---|
0:04:27 | they produce |
---|
0:04:28 | i both of them produce like accent and reject label so i either accept or |
---|
0:04:33 | reject and these are then fed to the appellation metric which usually computes |
---|
0:04:38 | the error rates so false reject rate real reaction rate and false acceptance rate |
---|
0:04:43 | and |
---|
0:04:44 | these are then used in various ways depending on the evolution metric |
---|
0:04:48 | the kind of come up with the one number to show how good the system |
---|
0:04:51 | as a whole is |
---|
0:04:54 | however |
---|
0:04:55 | if we assume that these two i since the interest they are differentiable so they |
---|
0:04:59 | are like on your networks which is quite common these days |
---|
0:05:04 | if we wanted to |
---|
0:05:05 | minimize the automatic we would need to compute the gradient of the evaluation metric with |
---|
0:05:10 | respect to the |
---|
0:05:11 | two systems we have or its parameters |
---|
0:05:14 | sadly |
---|
0:05:15 | the weak and that's compute the gradient over these hot addition of like accent reject |
---|
0:05:21 | and |
---|
0:05:22 | but these all required for be error rates |
---|
0:05:25 | for the whole asymmetric so for example the tandem |
---|
0:05:28 | decision cost function which we will be using later |
---|
0:05:32 | requires these two error rates |
---|
0:05:34 | and it's going to weight them in different ways but we can't from that |
---|
0:05:38 | go back to the compute the gradient all the way back to the systems |
---|
0:05:42 | because these heart this isn't is not differentiable operation and tools we can't calculate the |
---|
0:05:48 | gradient |
---|
0:05:49 | so |
---|
0:05:54 | relating |
---|
0:05:55 | i really in a related topic |
---|
0:05:57 | other work has suggested as soft a remote error metrics |
---|
0:06:02 | so for example |
---|
0:06:04 | by softening the f-score f one score or are we undercut lost you can come |
---|
0:06:10 | up with a differentiable version of the |
---|
0:06:13 | score metric |
---|
0:06:13 | and then you can do this computation so they |
---|
0:06:17 | by softening it means they these heart decisions are kind of stuff and so that |
---|
0:06:20 | you have a function you can actually there they got derivative of |
---|
0:06:24 | and then you can compute disparity and however |
---|
0:06:27 | the tandem decision cost function we have here does not have such source of person |
---|
0:06:31 | so |
---|
0:06:32 | instead |
---|
0:06:34 | we looked into t v important nodding |
---|
0:06:37 | so |
---|
0:06:38 | in reinforcement learning there is this we're gonna simplified setup is like this so |
---|
0:06:43 | the computer agent |
---|
0:06:45 | sees a images or slight problem but it's |
---|
0:06:47 | predicted some information from the game or the environment |
---|
0:06:51 | the agent chooses an action a |
---|
0:06:54 | and |
---|
0:06:55 | the action is then executed on the environment |
---|
0:06:58 | and depending if the outcome of the action is good on that the agent receives |
---|
0:07:03 | some rework and |
---|
0:07:05 | in reinforcement learning |
---|
0:07:06 | the goal of it of this the whole setup is to |
---|
0:07:11 | get a smarts reward as possible so modify d h and so that you get |
---|
0:07:15 | as much reward as possible |
---|
0:07:17 | and one way |
---|
0:07:18 | to do this is |
---|
0:07:20 | kind of a week the gradient |
---|
0:07:22 | well so |
---|
0:07:23 | we could take the gradient |
---|
0:07:25 | of the expected reward so the reward i'll i've reached |
---|
0:07:29 | overall difference in the set up to situations |
---|
0:07:32 | and they got gradient of that with respect to t probably see with the respective |
---|
0:07:37 | age and so if we can do this we can then |
---|
0:07:39 | of course update the agent |
---|
0:07:41 | two |
---|
0:07:43 | towards the direction that increases the amount of reward |
---|
0:07:46 | however |
---|
0:07:47 | we hear also have this that problem that you can't really |
---|
0:07:51 | differentiate is a decision part where we |
---|
0:07:53 | choose an |
---|
0:07:54 | one specific action of many |
---|
0:07:56 | and |
---|
0:07:58 | you execute that on the environment so we can't different see that and we can |
---|
0:08:01 | compute the gradient |
---|
0:08:04 | however there is a thing called police a gradient |
---|
0:08:07 | where which kind of estimates this gradient |
---|
0:08:12 | we do it is kind of a equation where |
---|
0:08:15 | instead of calculating the gradient |
---|
0:08:17 | of the reward directly it computes the gradient of the |
---|
0:08:21 | log probabilities of the |
---|
0:08:23 | selected actions |
---|
0:08:25 | and weights them by t report |
---|
0:08:28 | we got |
---|
0:08:29 | and this has been shown in reinforcement learning |
---|
0:08:32 | two |
---|
0:08:33 | be quite effective us ready and also been shown that you can replace the |
---|
0:08:38 | the air the td reward with any function and then |
---|
0:08:43 | by running this you will also find |
---|
0:08:45 | get the correct gradient so you can |
---|
0:08:47 | get the correct same results with enough samples |
---|
0:08:52 | so |
---|
0:08:52 | going back to our tandem optimization |
---|
0:08:55 | where we had the |
---|
0:08:56 | same problem of |
---|
0:08:57 | a heart decisions which we can't |
---|
0:09:00 | differentiate we just apply this |
---|
0:09:03 | police a gradient here that both to get it great in theorem here |
---|
0:09:07 | where b |
---|
0:09:08 | equation is more or less same |
---|
0:09:10 | just with team or different meeting |
---|
0:09:17 | so |
---|
0:09:18 | we then proceeded to test this how well it works |
---|
0:09:22 | with a rather trivial a set up so |
---|
0:09:25 | we have two datasets |
---|
0:09:26 | the fox let one |
---|
0:09:28 | us |
---|
0:09:28 | the stated and more specifically the speaker verification |
---|
0:09:31 | part of it |
---|
0:09:32 | and then we have t is feasible of nineteen |
---|
0:09:35 | for the are synthesized speech and the other condor mercer tests and i labels and |
---|
0:09:41 | for a has to be task we except extract the x vectors using t pretrained |
---|
0:09:46 | tell the models |
---|
0:09:46 | and forty s feasible we extract easy to see features |
---|
0:09:50 | and these are fixed so these are not being trained in does and in this |
---|
0:09:53 | setup |
---|
0:09:55 | we then train the a actually system and the c insistent as thirty as normally |
---|
0:10:00 | don't using these two datasets |
---|
0:10:03 | and then evaluate them together using the d dcf cost function as |
---|
0:10:10 | present in the a speech both nineteen competition |
---|
0:10:13 | this we take the two pretrained systems and before random optimization on the mass previous |
---|
0:10:19 | is shown and then finally we evaluate the results |
---|
0:10:23 | and compared them with the pre-trained results and see if it actually helps |
---|
0:10:28 | so |
---|
0:10:29 | let surprisingly |
---|
0:10:30 | the tandem optimization helps in out |
---|
0:10:34 | very short not shelled |
---|
0:10:35 | so |
---|
0:10:37 | one way we see this |
---|
0:10:39 | is by looking at this a learning curve where on the x-axis you have |
---|
0:10:44 | the number of updates |
---|
0:10:47 | you do so you can compared to this because where you have the loss and |
---|
0:10:50 | the number of you box |
---|
0:10:51 | and on the y-axis |
---|
0:10:53 | you will have |
---|
0:10:54 | the relative change in immolation set |
---|
0:10:57 | compared to the operating system so if it was zero percent |
---|
0:11:01 | it means the |
---|
0:11:02 | the metric did not chains since the pretrained system |
---|
0:11:07 | so |
---|
0:11:08 | the main metric we wanted to minimize |
---|
0:11:11 | is the minimum |
---|
0:11:12 | the decision cost function normalized |
---|
0:11:15 | and this indeed |
---|
0:11:17 | decrease over time as we do the this a tandem optimization so |
---|
0:11:21 | from zero percent change it went to minus twenty five percent change so yes it |
---|
0:11:26 | improved |
---|
0:11:28 | as a |
---|
0:11:30 | then we also studied |
---|
0:11:31 | how the |
---|
0:11:33 | in the a visual systems changed over the training |
---|
0:11:36 | so |
---|
0:11:37 | for example to compliments or equal error rate in the |
---|
0:11:42 | condom is a zone pass process detecting if it's move or not |
---|
0:11:45 | it also improved by around ten percent |
---|
0:11:47 | in this task but interestingly the a s v |
---|
0:11:51 | e r |
---|
0:11:52 | increased over time and we help of places that |
---|
0:11:56 | because this is because |
---|
0:11:58 | when |
---|
0:11:58 | we have a way that |
---|
0:11:59 | the a s p system in condom is a task and the condom sre in |
---|
0:12:04 | pac task so looked at tasks we notice that of that these phantom optimization |
---|
0:12:10 | the |
---|
0:12:12 | to have improved in there |
---|
0:12:13 | it's others task so a as we was better encounter mister |
---|
0:12:18 | tasks |
---|
0:12:18 | and counter measure was also a slightly better in the a speaker verification task so |
---|
0:12:23 | we hypothesize that this kind of outweigh that the speaker verification systems |
---|
0:12:31 | normal task of do that can correct speaker and it started to kind of |
---|
0:12:35 | thick the condom answers these proved samples instead |
---|
0:12:39 | so |
---|
0:12:40 | we also compared this to a simple baseline |
---|
0:12:43 | where instead of using this tandem optimization we just independently trained |
---|
0:12:49 | continue training to two systems using the same samples as in the quality grading methods |
---|
0:12:54 | so |
---|
0:12:56 | basically we just use the same samples and use the a s p samples to |
---|
0:13:01 | continue updating the a speech is then and then be used the condom is a |
---|
0:13:06 | samples to update the down to missus system independently completely different from each other and |
---|
0:13:11 | we see the same at a sweep behaviour here but |
---|
0:13:15 | counters mercer systems the equal error rate just exploded in the beginning and then slowly |
---|
0:13:20 | creeps back down and in the end operates over multiple runs |
---|
0:13:26 | we see that the |
---|
0:13:27 | both the grading method improves the results by twenty six percent and the fine tuning |
---|
0:13:31 | improve the results by seven point eight four percent |
---|
0:13:35 | but to note that the results on the fine tuning |
---|
0:13:39 | have a lot how your variance |
---|
0:13:42 | than in the |
---|
0:13:44 | police a gradient |
---|
0:13:45 | mesh version |
---|
0:13:48 | but |
---|
0:13:49 | these all results are very positive but as a this wasn't very initial investigation and |
---|
0:13:55 | i highly recommend that you |
---|
0:13:57 | check out the paper for more resultant figures |
---|
0:14:00 | and that's all |
---|
0:14:02 | so |
---|
0:14:02 | thank you for listening to be sure to check out the paper and the code |
---|
0:14:05 | behind that leak |
---|