0:00:15 | and the lower one my name's change a high |
---|
0:00:18 | come from session and then we're still singapore |
---|
0:00:21 | i'm present our recent work a lot about |
---|
0:00:24 | black box attacks |
---|
0:00:26 | a automatic speaker verification is in treedec control was conversion |
---|
0:00:31 | this was this work has been done with a context |
---|
0:00:33 | and actual |
---|
0:00:37 | a nice my presentation into four hours |
---|
0:00:39 | the introduction |
---|
0:00:41 | related works and propose a nested |
---|
0:00:43 | experiments and results |
---|
0:00:45 | and finally go to the conclusion |
---|
0:00:48 | that's start with the introduction |
---|
0:00:51 | with the development of automatic all automatic speaker verification |
---|
0:00:56 | the speaker verification system has been used in many applications |
---|
0:01:00 | such as banking |
---|
0:01:02 | matching authentication |
---|
0:01:04 | and i have been c applications |
---|
0:01:07 | i have more than yes the system also please read from spoofing attacks |
---|
0:01:13 | it is found that |
---|
0:01:14 | the s a system use of one able to various kinds are spoofing attacks |
---|
0:01:20 | to handle this problem |
---|
0:01:21 | different the condom errors i developed for spoofing attacks |
---|
0:01:25 | to be has a security a speaker verification system |
---|
0:01:30 | in practice |
---|
0:01:31 | that's two things system |
---|
0:01:33 | no |
---|
0:01:34 | can |
---|
0:01:35 | can be |
---|
0:01:37 | and the can be can be realising with different techniques |
---|
0:01:42 | for example impersonation |
---|
0:01:45 | the back and the synthetic speech |
---|
0:01:48 | two channels something that is dish |
---|
0:01:51 | the different |
---|
0:01:52 | models can be used for example yes |
---|
0:01:55 | what was promotion |
---|
0:01:57 | in this work |
---|
0:01:59 | we focus on |
---|
0:01:59 | the attacks |
---|
0:02:01 | generated by the was from which is just a |
---|
0:02:05 | drawn from an hackers point of view |
---|
0:02:08 | it is possible to generate a kind of was right context |
---|
0:02:11 | with feedback from the okay |
---|
0:02:14 | system |
---|
0:02:16 | as and impostor attacks a to be some knowledge of that type of the system |
---|
0:02:22 | in to improve the prove one wall street |
---|
0:02:25 | as the extended processed is an example from image processing |
---|
0:02:30 | given a image usually |
---|
0:02:32 | the system recognise is added and the |
---|
0:02:35 | i have more but at the as norm online is the image is means classified |
---|
0:02:41 | by the system |
---|
0:02:42 | as i came then |
---|
0:02:43 | this shows the potential street associated with a rifle was what text |
---|
0:02:48 | in this work we would like to know something to that of are so i |
---|
0:02:52 | x |
---|
0:02:53 | with a speaker verification system |
---|
0:02:55 | it will this will have to be used on more robust is this the |
---|
0:02:59 | in the future |
---|
0:03:02 | slu of |
---|
0:03:04 | spoofing problem attackers perspective |
---|
0:03:07 | using no other was to attack scenario |
---|
0:03:12 | attacker can use and means moving system to generate a score of this is the |
---|
0:03:15 | to turn is of the sample |
---|
0:03:18 | to attack the having yes be system |
---|
0:03:21 | i don't were |
---|
0:03:23 | the were so attack scenario |
---|
0:03:26 | like are copied it is proving system |
---|
0:03:29 | with a feedback of the yes we and the generates as we have also prove |
---|
0:03:35 | the samples |
---|
0:03:36 | two or attack they have to be system again |
---|
0:03:41 | of course |
---|
0:03:42 | this kind all |
---|
0:03:43 | us to sample |
---|
0:03:44 | you know |
---|
0:03:46 | provide them was reading |
---|
0:03:48 | two yes this system |
---|
0:03:53 | with |
---|
0:03:54 | with different level |
---|
0:03:56 | knowledge z |
---|
0:03:59 | maybe three types all other also attacks |
---|
0:04:02 | including black box attack |
---|
0:04:05 | three parts okay |
---|
0:04:06 | and might want to attack |
---|
0:04:08 | well that also attack attacker only have a lot |
---|
0:04:11 | information on how the |
---|
0:04:14 | c system |
---|
0:04:16 | full |
---|
0:04:18 | reebok's attack |
---|
0:04:19 | note taker have |
---|
0:04:21 | informational both input and output of the space system |
---|
0:04:27 | for the one of the tack |
---|
0:04:29 | okay so |
---|
0:04:31 | the fully informational yes please system |
---|
0:04:34 | so such right on our shows that there is a straight |
---|
0:04:37 | however in real part is that have occurred may not |
---|
0:04:42 | it would have |
---|
0:04:44 | as many information as the about |
---|
0:04:46 | so |
---|
0:04:48 | the black hole attack isn't more |
---|
0:04:51 | and easy to arise in |
---|
0:04:54 | in the gravity |
---|
0:04:57 | so we |
---|
0:04:58 | case |
---|
0:04:59 | as a focus on these four |
---|
0:05:03 | then we go to the related work and propose a method |
---|
0:05:07 | first we will introduce a voice conversion |
---|
0:05:11 | what's machines that |
---|
0:05:13 | technique that modifies speaker identity all phones all speaker to a target speaker |
---|
0:05:18 | based on change of the linguistic information |
---|
0:05:22 | e |
---|
0:05:23 | conventional framework |
---|
0:05:24 | the commission model is |
---|
0:05:26 | we will be sounds they are the data from source and target speaker |
---|
0:05:31 | so the coming from all the will be |
---|
0:05:36 | specifically for speaker pair |
---|
0:05:40 | however for the movie have tag |
---|
0:05:43 | a more |
---|
0:05:44 | one |
---|
0:05:45 | a more in uses the which are really not correlate it out once conversion |
---|
0:05:50 | for example imaging resource conversion |
---|
0:05:53 | the basic idea are used to train a feature mapping model between the |
---|
0:05:58 | speaker independent feature as speaker dependent feature |
---|
0:06:02 | for example |
---|
0:06:03 | given a harvest age forcibly used for the speaker independently but feature and speaker-dependent acoustic |
---|
0:06:10 | feature |
---|
0:06:12 | then used is to features to trails us |
---|
0:06:16 | conversion model |
---|
0:06:19 | as a and b g feature |
---|
0:06:22 | is the use of speaker |
---|
0:06:24 | independent |
---|
0:06:25 | that means |
---|
0:06:26 | as well as the speaker on the count it as a have the speech content |
---|
0:06:30 | is the same |
---|
0:06:31 | the did you do not change |
---|
0:06:35 | so |
---|
0:06:36 | in such a framework |
---|
0:06:38 | it is an easy to actually a many-to-one conversion |
---|
0:06:41 | and |
---|
0:06:42 | in this form free more the so stage is not required during training |
---|
0:06:47 | so this will be |
---|
0:06:49 | more easy to use for proving attack |
---|
0:06:56 | so that's cholesky then did not have also attack scenario |
---|
0:07:00 | in not ever so attack scenario |
---|
0:07:03 | alright |
---|
0:07:03 | as recent as we stand |
---|
0:07:07 | but we and acoustic feature we will be straight from the target speech to train |
---|
0:07:12 | the |
---|
0:07:12 | commercial model |
---|
0:07:14 | the model will be a day |
---|
0:07:17 | with a lost |
---|
0:07:18 | calculated to predict acoustic features |
---|
0:07:21 | and |
---|
0:07:23 | generally have features from target speech |
---|
0:07:28 | during tracking |
---|
0:07:30 | the |
---|
0:07:30 | but is extracted from the source speech |
---|
0:07:34 | then |
---|
0:07:35 | of if we just such so sleepy g into commercial model together comedy the acoustic |
---|
0:07:39 | feature |
---|
0:07:42 | we use a book order to come word the acoustic feature |
---|
0:07:47 | on tuesday |
---|
0:07:49 | comedy the speech samples |
---|
0:07:51 | to be former |
---|
0:07:53 | formant tag |
---|
0:07:54 | to that c system |
---|
0:08:00 | this is a |
---|
0:08:01 | keeping otherwise commission model |
---|
0:08:04 | it's optimize the for speaker similarity an ecology |
---|
0:08:08 | so it is not designed for us to the system |
---|
0:08:11 | is me nonoptimal |
---|
0:08:13 | well forcing yes the attack |
---|
0:08:17 | for our proposed the feedback control wise conversion |
---|
0:08:23 | the main difference is |
---|
0:08:24 | we provide a feedback from the yes we system |
---|
0:08:27 | during training |
---|
0:08:31 | as negative example |
---|
0:08:33 | during training for each mini batch |
---|
0:08:35 | we tried the |
---|
0:08:37 | target speech with is trying to the g |
---|
0:08:41 | from target speech into generated predict the acoustic feature |
---|
0:08:45 | the first part most is calculated between the prediction acoustic feature and actually acoustic feature |
---|
0:08:51 | as a baseline be known as it is you discourse conversion |
---|
0:08:56 | and a lot of heart |
---|
0:08:59 | we also use a local the could generate the comedy the speech signal |
---|
0:09:04 | well from the printing acoustic features |
---|
0:09:06 | and |
---|
0:09:07 | which is known |
---|
0:09:08 | speech signal to agnes's system |
---|
0:09:11 | together |
---|
0:09:12 | together |
---|
0:09:13 | well |
---|
0:09:15 | sleeping bag as another for the lost |
---|
0:09:18 | for they model updating |
---|
0:09:23 | during the packing |
---|
0:09:25 | and is the same as |
---|
0:09:27 | as these elements of we're |
---|
0:09:30 | okay bridges attractor used to used for the two major problems for speech and we |
---|
0:09:36 | feed this source the region into the commercial model together |
---|
0:09:40 | conversely the acoustic ensure |
---|
0:09:42 | and |
---|
0:09:43 | a local there is used to generate the company speech signal to people yes work |
---|
0:09:52 | no that's is then |
---|
0:09:54 | how the combined lost |
---|
0:09:56 | is use it for the |
---|
0:09:58 | i was commercial model training |
---|
0:10:01 | as we know |
---|
0:10:03 | i four that most current no that's a scenario |
---|
0:10:06 | we do not have knowledge of each in the relationship no we don't have the |
---|
0:10:10 | knowledge of the relationship which in the ones which are good |
---|
0:10:14 | and then yes be lost |
---|
0:10:16 | so |
---|
0:10:17 | no |
---|
0:10:18 | there's no |
---|
0:10:19 | within phone the signals part |
---|
0:10:25 | but |
---|
0:10:25 | that has to be lost you |
---|
0:10:28 | change of the combine lost curve |
---|
0:10:30 | so to the average using pass signals for the voice conversion more training we use |
---|
0:10:36 | an adaptive learning rate schedules |
---|
0:10:38 | based on the loss |
---|
0:10:40 | well that the dishes that the to achieve the colleges |
---|
0:10:43 | for example |
---|
0:10:44 | the learning rate will be i just |
---|
0:10:46 | we will be adjusted |
---|
0:10:48 | or reduced |
---|
0:10:49 | once a total loss is increased on the validation set |
---|
0:10:56 | that's close to the instrument and the result |
---|
0:11:00 | for three weeks then the database use our experiments are is convinced two hours |
---|
0:11:05 | the training part and validation art |
---|
0:11:09 | for training |
---|
0:11:10 | we can go |
---|
0:11:11 | we workshop three models |
---|
0:11:13 | i |
---|
0:11:14 | course the images structure which is trained out of the target strata |
---|
0:11:19 | the i-vector extractor trained on combine |
---|
0:11:23 | or combine colours all |
---|
0:11:25 | switchboard and nist sre corpus from two thousand six two thousand channel |
---|
0:11:31 | the convolutional this tree down yes physical two thousand nineteen development set |
---|
0:11:37 | we |
---|
0:11:38 | choose fixes target speakers including three male and stripping though |
---|
0:11:43 | for each speaker we choose |
---|
0:11:45 | but hundred and channel utterances |
---|
0:11:48 | core model training |
---|
0:11:51 | volume relations that we using as faced with two thousand nineteen evaluation dataset which contain |
---|
0:11:58 | conditions and sixty seven speakers |
---|
0:12:01 | we just trying to utterances per speaker |
---|
0:12:03 | so in total we how thousand |
---|
0:12:06 | and |
---|
0:12:07 | three hundred and forty utterances |
---|
0:12:12 | pretty bad two systems |
---|
0:12:15 | to perform in our experiment |
---|
0:12:17 | other forces it is a peep into his voice conversion system result sleeping bag |
---|
0:12:23 | another is |
---|
0:12:24 | feedback control once conversion system which is our proposed |
---|
0:12:28 | system |
---|
0:12:30 | incorrectly |
---|
0:12:32 | the combined the racial |
---|
0:12:33 | if set to zero point seven |
---|
0:12:38 | for most model |
---|
0:12:40 | we use the same a network structure which consist of two d r s team |
---|
0:12:46 | rst nonlinear |
---|
0:12:48 | with |
---|
0:12:48 | one can find one two |
---|
0:12:51 | continuing these of each year |
---|
0:12:54 | than other work includes all |
---|
0:12:56 | but system a forty two dimensional p b g feature |
---|
0:13:00 | well as the |
---|
0:13:01 | dimensional output is two hundred and forty |
---|
0:13:04 | considering the house |
---|
0:13:05 | it you dimensional mel spectrum |
---|
0:13:08 | exist and then dynamic an actual error-rate features |
---|
0:13:11 | the rippling what colour they really is used to speech signal reconstruction |
---|
0:13:17 | this figure shows the training curve |
---|
0:13:21 | a only |
---|
0:13:23 | training and validation set |
---|
0:13:26 | the line shows the baseline b g based voice conversion |
---|
0:13:31 | the |
---|
0:13:32 | or shall i shows they |
---|
0:13:34 | create a control wise conversion is a convolutional zero point five |
---|
0:13:39 | the lies shows a |
---|
0:13:42 | with that control voice conversion with a |
---|
0:13:45 | combined racial of zero point seven |
---|
0:13:49 | forms a result of from the training kernel we can see |
---|
0:13:57 | so the |
---|
0:13:58 | the feedback control was from version |
---|
0:14:01 | okay |
---|
0:14:02 | generally i think at low or other or lost during training for training both training |
---|
0:14:08 | and validation set is especially for they |
---|
0:14:13 | for the s p loss |
---|
0:14:18 | and according to this curve we can see |
---|
0:14:21 | we combine loss |
---|
0:14:23 | no |
---|
0:14:24 | come biracial otherwise database |
---|
0:14:27 | there is in |
---|
0:14:28 | there won't find so |
---|
0:14:30 | which was zero point seven as our |
---|
0:14:33 | well |
---|
0:14:34 | our setting |
---|
0:14:35 | probably their experiment |
---|
0:14:39 | the objective the initial values you carried to your that the speaker verification system |
---|
0:14:48 | from of for scroll l |
---|
0:14:50 | we can see that |
---|
0:14:51 | yes these systems form |
---|
0:14:53 | a very effectively one the impostor trials are used |
---|
0:14:57 | reason you carried little but those represent |
---|
0:15:02 | and the performance |
---|
0:15:03 | decreases significantly |
---|
0:15:05 | one the p g police force equation |
---|
0:15:08 | attacks are performed |
---|
0:15:12 | we're z you carried will be increased to all word |
---|
0:15:16 | twenty five percent for |
---|
0:15:18 | all the scenarios |
---|
0:15:21 | and |
---|
0:15:23 | it is also assumes that |
---|
0:15:25 | the proposed the feedback control was conversion |
---|
0:15:28 | is able to folder to increase the performance |
---|
0:15:32 | which shows no when the but details yes these systems to that of the text |
---|
0:15:40 | we all well |
---|
0:15:41 | we use two figures show up having example to show the effectiveness of our proposed |
---|
0:15:47 | it |
---|
0:15:49 | that also attack |
---|
0:15:50 | no |
---|
0:15:52 | the |
---|
0:15:53 | no set |
---|
0:15:55 | and the round i shows the impostor score distribution and the blue line shows the |
---|
0:16:02 | score distribution of the channel nine channels |
---|
0:16:06 | and the yellow line shows the score distribution of the ilp be noted digit is |
---|
0:16:11 | a large portion baseline |
---|
0:16:13 | and |
---|
0:16:14 | purple line shows the scroll score distribution all our proposed method |
---|
0:16:19 | we can see our propose a method that can push the |
---|
0:16:24 | the score |
---|
0:16:25 | two horses each i mean |
---|
0:16:28 | which means which shows the effect leaves names or propose a nested |
---|
0:16:34 | and does go to the conclusion |
---|
0:16:37 | in this form |
---|
0:16:38 | we formulate up to have also attack scenario for embedded control the ones from portions |
---|
0:16:42 | system |
---|
0:16:43 | which effectively |
---|
0:16:45 | given degrees a speaker verification system performance |
---|
0:16:49 | we also evaluated the proposed |
---|
0:16:51 | and was not accent to remove frameworks |
---|
0:16:54 | space proved two thousand nineteen corpus |
---|
0:16:57 | which is widely used force the for system benchmarking |
---|
0:17:02 | but also provide that |
---|
0:17:04 | and then at the cost study |
---|
0:17:07 | proposed the frameworks and exposes a weak links |
---|
0:17:10 | also the common speaker verification systems |
---|
0:17:13 | in facing |
---|
0:17:14 | voice conversion attacks |
---|
0:17:17 | that's for all my presentation |
---|
0:17:19 | single for attention |
---|