and the lower one my name's change a high

come from session and then we're still singapore

i'm present our recent work a lot about

black box attacks

a automatic speaker verification is in treedec control was conversion

this was this work has been done with a context

and actual

a nice my presentation into four hours

the introduction

related works and propose a nested

experiments and results

and finally go to the conclusion

that's start with the introduction

with the development of automatic all automatic speaker verification

the speaker verification system has been used in many applications

such as banking

matching authentication

and i have been c applications

i have more than yes the system also please read from spoofing attacks

it is found that

the s a system use of one able to various kinds are spoofing attacks

to handle this problem

different the condom errors i developed for spoofing attacks

to be has a security a speaker verification system

in practice

that's two things system

no

can

can be

and the can be can be realising with different techniques

for example impersonation

the back and the synthetic speech

two channels something that is dish

the different

models can be used for example yes

what was promotion

in this work

we focus on

the attacks

generated by the was from which is just a

drawn from an hackers point of view

it is possible to generate a kind of was right context

with feedback from the okay

system

as and impostor attacks a to be some knowledge of that type of the system

in to improve the prove one wall street

as the extended processed is an example from image processing

given a image usually

the system recognise is added and the

i have more but at the as norm online is the image is means classified

by the system

as i came then

this shows the potential street associated with a rifle was what text

in this work we would like to know something to that of are so i

x

with a speaker verification system

it will this will have to be used on more robust is this the

in the future

slu of

spoofing problem attackers perspective

using no other was to attack scenario

attacker can use and means moving system to generate a score of this is the

to turn is of the sample

to attack the having yes be system

i don't were

the were so attack scenario

like are copied it is proving system

with a feedback of the yes we and the generates as we have also prove

the samples

two or attack they have to be system again

of course

this kind all

us to sample

you know

provide them was reading

two yes this system

with

with different level

knowledge z

maybe three types all other also attacks

including black box attack

three parts okay

and might want to attack

well that also attack attacker only have a lot

information on how the

c system

full

reebok's attack

note taker have

informational both input and output of the space system

for the one of the tack

okay so

the fully informational yes please system

so such right on our shows that there is a straight

however in real part is that have occurred may not

it would have

as many information as the about

so

the black hole attack isn't more

and easy to arise in

in the gravity

so we

case

as a focus on these four

then we go to the related work and propose a method

first we will introduce a voice conversion

what's machines that

technique that modifies speaker identity all phones all speaker to a target speaker

based on change of the linguistic information

e

conventional framework

the commission model is

we will be sounds they are the data from source and target speaker

so the coming from all the will be

specifically for speaker pair

however for the movie have tag

a more

one

a more in uses the which are really not correlate it out once conversion

for example imaging resource conversion

the basic idea are used to train a feature mapping model between the

speaker independent feature as speaker dependent feature

for example

given a harvest age forcibly used for the speaker independently but feature and speaker-dependent acoustic

feature

then used is to features to trails us

conversion model

as a and b g feature

is the use of speaker

independent

that means

as well as the speaker on the count it as a have the speech content

is the same

the did you do not change

so

in such a framework

it is an easy to actually a many-to-one conversion

and

in this form free more the so stage is not required during training

so this will be

more easy to use for proving attack

so that's cholesky then did not have also attack scenario

in not ever so attack scenario

alright

as recent as we stand

but we and acoustic feature we will be straight from the target speech to train

the

commercial model

the model will be a day

with a lost

calculated to predict acoustic features

and

generally have features from target speech

during tracking

the

but is extracted from the source speech

then

of if we just such so sleepy g into commercial model together comedy the acoustic

feature

we use a book order to come word the acoustic feature

on tuesday

comedy the speech samples

to be former

formant tag

to that c system

this is a

keeping otherwise commission model

it's optimize the for speaker similarity an ecology

so it is not designed for us to the system

is me nonoptimal

well forcing yes the attack

for our proposed the feedback control wise conversion

the main difference is

we provide a feedback from the yes we system

during training

as negative example

during training for each mini batch

we tried the

target speech with is trying to the g

from target speech into generated predict the acoustic feature

the first part most is calculated between the prediction acoustic feature and actually acoustic feature

as a baseline be known as it is you discourse conversion

and a lot of heart

we also use a local the could generate the comedy the speech signal

well from the printing acoustic features

and

which is known

speech signal to agnes's system

together

together

well

sleeping bag as another for the lost

for they model updating

during the packing

and is the same as

as these elements of we're

okay bridges attractor used to used for the two major problems for speech and we

feed this source the region into the commercial model together

conversely the acoustic ensure

and

a local there is used to generate the company speech signal to people yes work

no that's is then

how the combined lost

is use it for the

i was commercial model training

as we know

i four that most current no that's a scenario

we do not have knowledge of each in the relationship no we don't have the

knowledge of the relationship which in the ones which are good

and then yes be lost

so

no

there's no

within phone the signals part

but

that has to be lost you

change of the combine lost curve

so to the average using pass signals for the voice conversion more training we use

an adaptive learning rate schedules

based on the loss

well that the dishes that the to achieve the colleges

for example

the learning rate will be i just

we will be adjusted

or reduced

once a total loss is increased on the validation set

that's close to the instrument and the result

for three weeks then the database use our experiments are is convinced two hours

the training part and validation art

for training

we can go

we workshop three models

i

course the images structure which is trained out of the target strata

the i-vector extractor trained on combine

or combine colours all

switchboard and nist sre corpus from two thousand six two thousand channel

the convolutional this tree down yes physical two thousand nineteen development set

we

choose fixes target speakers including three male and stripping though

for each speaker we choose

but hundred and channel utterances

core model training

volume relations that we using as faced with two thousand nineteen evaluation dataset which contain

conditions and sixty seven speakers

we just trying to utterances per speaker

so in total we how thousand

and

three hundred and forty utterances

pretty bad two systems

to perform in our experiment

other forces it is a peep into his voice conversion system result sleeping bag

another is

feedback control once conversion system which is our proposed

system

incorrectly

the combined the racial

if set to zero point seven

for most model

we use the same a network structure which consist of two d r s team

rst nonlinear

with

one can find one two

continuing these of each year

than other work includes all

but system a forty two dimensional p b g feature

well as the

dimensional output is two hundred and forty

considering the house

it you dimensional mel spectrum

exist and then dynamic an actual error-rate features

the rippling what colour they really is used to speech signal reconstruction

this figure shows the training curve

a only

training and validation set

the line shows the baseline b g based voice conversion

the

or shall i shows they

create a control wise conversion is a convolutional zero point five

the lies shows a

with that control voice conversion with a

combined racial of zero point seven

forms a result of from the training kernel we can see

so the

the feedback control was from version

okay

generally i think at low or other or lost during training for training both training

and validation set is especially for they

for the s p loss

and according to this curve we can see

we combine loss

no

come biracial otherwise database

there is in

there won't find so

which was zero point seven as our

well

our setting

probably their experiment

the objective the initial values you carried to your that the speaker verification system

from of for scroll l

we can see that

yes these systems form

a very effectively one the impostor trials are used

reason you carried little but those represent

and the performance

decreases significantly

one the p g police force equation

attacks are performed

we're z you carried will be increased to all word

twenty five percent for

all the scenarios

and

it is also assumes that

the proposed the feedback control was conversion

is able to folder to increase the performance

which shows no when the but details yes these systems to that of the text

we all well

we use two figures show up having example to show the effectiveness of our proposed

it

that also attack

no

the

no set

and the round i shows the impostor score distribution and the blue line shows the

score distribution of the channel nine channels

and the yellow line shows the score distribution of the ilp be noted digit is

a large portion baseline

and

purple line shows the scroll score distribution all our proposed method

we can see our propose a method that can push the

the score

two horses each i mean

which means which shows the effect leaves names or propose a nested

and does go to the conclusion

in this form

we formulate up to have also attack scenario for embedded control the ones from portions

system

which effectively

given degrees a speaker verification system performance

we also evaluated the proposed

and was not accent to remove frameworks

space proved two thousand nineteen corpus

which is widely used force the for system benchmarking

but also provide that

and then at the cost study

proposed the frameworks and exposes a weak links

also the common speaker verification systems

in facing

voice conversion attacks

that's for all my presentation

single for attention