0:00:15and the lower one my name's change a high
0:00:18come from session and then we're still singapore
0:00:21i'm present our recent work a lot about
0:00:24black box attacks
0:00:26a automatic speaker verification is in treedec control was conversion
0:00:31this was this work has been done with a context
0:00:33and actual
0:00:37a nice my presentation into four hours
0:00:39the introduction
0:00:41related works and propose a nested
0:00:43experiments and results
0:00:45and finally go to the conclusion
0:00:48that's start with the introduction
0:00:51with the development of automatic all automatic speaker verification
0:00:56the speaker verification system has been used in many applications
0:01:00such as banking
0:01:02matching authentication
0:01:04and i have been c applications
0:01:07i have more than yes the system also please read from spoofing attacks
0:01:13it is found that
0:01:14the s a system use of one able to various kinds are spoofing attacks
0:01:20to handle this problem
0:01:21different the condom errors i developed for spoofing attacks
0:01:25to be has a security a speaker verification system
0:01:30in practice
0:01:31that's two things system
0:01:33no
0:01:34can
0:01:35can be
0:01:37and the can be can be realising with different techniques
0:01:42for example impersonation
0:01:45the back and the synthetic speech
0:01:48two channels something that is dish
0:01:51the different
0:01:52models can be used for example yes
0:01:55what was promotion
0:01:57in this work
0:01:59we focus on
0:01:59the attacks
0:02:01generated by the was from which is just a
0:02:05drawn from an hackers point of view
0:02:08it is possible to generate a kind of was right context
0:02:11with feedback from the okay
0:02:14system
0:02:16as and impostor attacks a to be some knowledge of that type of the system
0:02:22in to improve the prove one wall street
0:02:25as the extended processed is an example from image processing
0:02:30given a image usually
0:02:32the system recognise is added and the
0:02:35i have more but at the as norm online is the image is means classified
0:02:41by the system
0:02:42as i came then
0:02:43this shows the potential street associated with a rifle was what text
0:02:48in this work we would like to know something to that of are so i
0:02:52x
0:02:53with a speaker verification system
0:02:55it will this will have to be used on more robust is this the
0:02:59in the future
0:03:02slu of
0:03:04spoofing problem attackers perspective
0:03:07using no other was to attack scenario
0:03:12attacker can use and means moving system to generate a score of this is the
0:03:15to turn is of the sample
0:03:18to attack the having yes be system
0:03:21i don't were
0:03:23the were so attack scenario
0:03:26like are copied it is proving system
0:03:29with a feedback of the yes we and the generates as we have also prove
0:03:35the samples
0:03:36two or attack they have to be system again
0:03:41of course
0:03:42this kind all
0:03:43us to sample
0:03:44you know
0:03:46provide them was reading
0:03:48two yes this system
0:03:53with
0:03:54with different level
0:03:56knowledge z
0:03:59maybe three types all other also attacks
0:04:02including black box attack
0:04:05three parts okay
0:04:06and might want to attack
0:04:08well that also attack attacker only have a lot
0:04:11information on how the
0:04:14c system
0:04:16full
0:04:18reebok's attack
0:04:19note taker have
0:04:21informational both input and output of the space system
0:04:27for the one of the tack
0:04:29okay so
0:04:31the fully informational yes please system
0:04:34so such right on our shows that there is a straight
0:04:37however in real part is that have occurred may not
0:04:42it would have
0:04:44as many information as the about
0:04:46so
0:04:48the black hole attack isn't more
0:04:51and easy to arise in
0:04:54in the gravity
0:04:57so we
0:04:58case
0:04:59as a focus on these four
0:05:03then we go to the related work and propose a method
0:05:07first we will introduce a voice conversion
0:05:11what's machines that
0:05:13technique that modifies speaker identity all phones all speaker to a target speaker
0:05:18based on change of the linguistic information
0:05:22e
0:05:23conventional framework
0:05:24the commission model is
0:05:26we will be sounds they are the data from source and target speaker
0:05:31so the coming from all the will be
0:05:36specifically for speaker pair
0:05:40however for the movie have tag
0:05:43a more
0:05:44one
0:05:45a more in uses the which are really not correlate it out once conversion
0:05:50for example imaging resource conversion
0:05:53the basic idea are used to train a feature mapping model between the
0:05:58speaker independent feature as speaker dependent feature
0:06:02for example
0:06:03given a harvest age forcibly used for the speaker independently but feature and speaker-dependent acoustic
0:06:10feature
0:06:12then used is to features to trails us
0:06:16conversion model
0:06:19as a and b g feature
0:06:22is the use of speaker
0:06:24independent
0:06:25that means
0:06:26as well as the speaker on the count it as a have the speech content
0:06:30is the same
0:06:31the did you do not change
0:06:35so
0:06:36in such a framework
0:06:38it is an easy to actually a many-to-one conversion
0:06:41and
0:06:42in this form free more the so stage is not required during training
0:06:47so this will be
0:06:49more easy to use for proving attack
0:06:56so that's cholesky then did not have also attack scenario
0:07:00in not ever so attack scenario
0:07:03alright
0:07:03as recent as we stand
0:07:07but we and acoustic feature we will be straight from the target speech to train
0:07:12the
0:07:12commercial model
0:07:14the model will be a day
0:07:17with a lost
0:07:18calculated to predict acoustic features
0:07:21and
0:07:23generally have features from target speech
0:07:28during tracking
0:07:30the
0:07:30but is extracted from the source speech
0:07:34then
0:07:35of if we just such so sleepy g into commercial model together comedy the acoustic
0:07:39feature
0:07:42we use a book order to come word the acoustic feature
0:07:47on tuesday
0:07:49comedy the speech samples
0:07:51to be former
0:07:53formant tag
0:07:54to that c system
0:08:00this is a
0:08:01keeping otherwise commission model
0:08:04it's optimize the for speaker similarity an ecology
0:08:08so it is not designed for us to the system
0:08:11is me nonoptimal
0:08:13well forcing yes the attack
0:08:17for our proposed the feedback control wise conversion
0:08:23the main difference is
0:08:24we provide a feedback from the yes we system
0:08:27during training
0:08:31as negative example
0:08:33during training for each mini batch
0:08:35we tried the
0:08:37target speech with is trying to the g
0:08:41from target speech into generated predict the acoustic feature
0:08:45the first part most is calculated between the prediction acoustic feature and actually acoustic feature
0:08:51as a baseline be known as it is you discourse conversion
0:08:56and a lot of heart
0:08:59we also use a local the could generate the comedy the speech signal
0:09:04well from the printing acoustic features
0:09:06and
0:09:07which is known
0:09:08speech signal to agnes's system
0:09:11together
0:09:12together
0:09:13well
0:09:15sleeping bag as another for the lost
0:09:18for they model updating
0:09:23during the packing
0:09:25and is the same as
0:09:27as these elements of we're
0:09:30okay bridges attractor used to used for the two major problems for speech and we
0:09:36feed this source the region into the commercial model together
0:09:40conversely the acoustic ensure
0:09:42and
0:09:43a local there is used to generate the company speech signal to people yes work
0:09:52no that's is then
0:09:54how the combined lost
0:09:56is use it for the
0:09:58i was commercial model training
0:10:01as we know
0:10:03i four that most current no that's a scenario
0:10:06we do not have knowledge of each in the relationship no we don't have the
0:10:10knowledge of the relationship which in the ones which are good
0:10:14and then yes be lost
0:10:16so
0:10:17no
0:10:18there's no
0:10:19within phone the signals part
0:10:25but
0:10:25that has to be lost you
0:10:28change of the combine lost curve
0:10:30so to the average using pass signals for the voice conversion more training we use
0:10:36an adaptive learning rate schedules
0:10:38based on the loss
0:10:40well that the dishes that the to achieve the colleges
0:10:43for example
0:10:44the learning rate will be i just
0:10:46we will be adjusted
0:10:48or reduced
0:10:49once a total loss is increased on the validation set
0:10:56that's close to the instrument and the result
0:11:00for three weeks then the database use our experiments are is convinced two hours
0:11:05the training part and validation art
0:11:09for training
0:11:10we can go
0:11:11we workshop three models
0:11:13i
0:11:14course the images structure which is trained out of the target strata
0:11:19the i-vector extractor trained on combine
0:11:23or combine colours all
0:11:25switchboard and nist sre corpus from two thousand six two thousand channel
0:11:31the convolutional this tree down yes physical two thousand nineteen development set
0:11:37we
0:11:38choose fixes target speakers including three male and stripping though
0:11:43for each speaker we choose
0:11:45but hundred and channel utterances
0:11:48core model training
0:11:51volume relations that we using as faced with two thousand nineteen evaluation dataset which contain
0:11:58conditions and sixty seven speakers
0:12:01we just trying to utterances per speaker
0:12:03so in total we how thousand
0:12:06and
0:12:07three hundred and forty utterances
0:12:12pretty bad two systems
0:12:15to perform in our experiment
0:12:17other forces it is a peep into his voice conversion system result sleeping bag
0:12:23another is
0:12:24feedback control once conversion system which is our proposed
0:12:28system
0:12:30incorrectly
0:12:32the combined the racial
0:12:33if set to zero point seven
0:12:38for most model
0:12:40we use the same a network structure which consist of two d r s team
0:12:46rst nonlinear
0:12:48with
0:12:48one can find one two
0:12:51continuing these of each year
0:12:54than other work includes all
0:12:56but system a forty two dimensional p b g feature
0:13:00well as the
0:13:01dimensional output is two hundred and forty
0:13:04considering the house
0:13:05it you dimensional mel spectrum
0:13:08exist and then dynamic an actual error-rate features
0:13:11the rippling what colour they really is used to speech signal reconstruction
0:13:17this figure shows the training curve
0:13:21a only
0:13:23training and validation set
0:13:26the line shows the baseline b g based voice conversion
0:13:31the
0:13:32or shall i shows they
0:13:34create a control wise conversion is a convolutional zero point five
0:13:39the lies shows a
0:13:42with that control voice conversion with a
0:13:45combined racial of zero point seven
0:13:49forms a result of from the training kernel we can see
0:13:57so the
0:13:58the feedback control was from version
0:14:01okay
0:14:02generally i think at low or other or lost during training for training both training
0:14:08and validation set is especially for they
0:14:13for the s p loss
0:14:18and according to this curve we can see
0:14:21we combine loss
0:14:23no
0:14:24come biracial otherwise database
0:14:27there is in
0:14:28there won't find so
0:14:30which was zero point seven as our
0:14:33well
0:14:34our setting
0:14:35probably their experiment
0:14:39the objective the initial values you carried to your that the speaker verification system
0:14:48from of for scroll l
0:14:50we can see that
0:14:51yes these systems form
0:14:53a very effectively one the impostor trials are used
0:14:57reason you carried little but those represent
0:15:02and the performance
0:15:03decreases significantly
0:15:05one the p g police force equation
0:15:08attacks are performed
0:15:12we're z you carried will be increased to all word
0:15:16twenty five percent for
0:15:18all the scenarios
0:15:21and
0:15:23it is also assumes that
0:15:25the proposed the feedback control was conversion
0:15:28is able to folder to increase the performance
0:15:32which shows no when the but details yes these systems to that of the text
0:15:40we all well
0:15:41we use two figures show up having example to show the effectiveness of our proposed
0:15:47it
0:15:49that also attack
0:15:50no
0:15:52the
0:15:53no set
0:15:55and the round i shows the impostor score distribution and the blue line shows the
0:16:02score distribution of the channel nine channels
0:16:06and the yellow line shows the score distribution of the ilp be noted digit is
0:16:11a large portion baseline
0:16:13and
0:16:14purple line shows the scroll score distribution all our proposed method
0:16:19we can see our propose a method that can push the
0:16:24the score
0:16:25two horses each i mean
0:16:28which means which shows the effect leaves names or propose a nested
0:16:34and does go to the conclusion
0:16:37in this form
0:16:38we formulate up to have also attack scenario for embedded control the ones from portions
0:16:42system
0:16:43which effectively
0:16:45given degrees a speaker verification system performance
0:16:49we also evaluated the proposed
0:16:51and was not accent to remove frameworks
0:16:54space proved two thousand nineteen corpus
0:16:57which is widely used force the for system benchmarking
0:17:02but also provide that
0:17:04and then at the cost study
0:17:07proposed the frameworks and exposes a weak links
0:17:10also the common speaker verification systems
0:17:13in facing
0:17:14voice conversion attacks
0:17:17that's for all my presentation
0:17:19single for attention