and the lower one my name's change a high
come from session and then we're still singapore
i'm present our recent work a lot about
black box attacks
a automatic speaker verification is in treedec control was conversion
this was this work has been done with a context
and actual
a nice my presentation into four hours
the introduction
related works and propose a nested
experiments and results
and finally go to the conclusion
that's start with the introduction
with the development of automatic all automatic speaker verification
the speaker verification system has been used in many applications
such as banking
matching authentication
and i have been c applications
i have more than yes the system also please read from spoofing attacks
it is found that
the s a system use of one able to various kinds are spoofing attacks
to handle this problem
different the condom errors i developed for spoofing attacks
to be has a security a speaker verification system
in practice
that's two things system
no
can
can be
and the can be can be realising with different techniques
for example impersonation
the back and the synthetic speech
two channels something that is dish
the different
models can be used for example yes
what was promotion
in this work
we focus on
the attacks
generated by the was from which is just a
drawn from an hackers point of view
it is possible to generate a kind of was right context
with feedback from the okay
system
as and impostor attacks a to be some knowledge of that type of the system
in to improve the prove one wall street
as the extended processed is an example from image processing
given a image usually
the system recognise is added and the
i have more but at the as norm online is the image is means classified
by the system
as i came then
this shows the potential street associated with a rifle was what text
in this work we would like to know something to that of are so i
x
with a speaker verification system
it will this will have to be used on more robust is this the
in the future
slu of
spoofing problem attackers perspective
using no other was to attack scenario
attacker can use and means moving system to generate a score of this is the
to turn is of the sample
to attack the having yes be system
i don't were
the were so attack scenario
like are copied it is proving system
with a feedback of the yes we and the generates as we have also prove
the samples
two or attack they have to be system again
of course
this kind all
us to sample
you know
provide them was reading
two yes this system
with
with different level
knowledge z
maybe three types all other also attacks
including black box attack
three parts okay
and might want to attack
well that also attack attacker only have a lot
information on how the
c system
full
reebok's attack
note taker have
informational both input and output of the space system
for the one of the tack
okay so
the fully informational yes please system
so such right on our shows that there is a straight
however in real part is that have occurred may not
it would have
as many information as the about
so
the black hole attack isn't more
and easy to arise in
in the gravity
so we
case
as a focus on these four
then we go to the related work and propose a method
first we will introduce a voice conversion
what's machines that
technique that modifies speaker identity all phones all speaker to a target speaker
based on change of the linguistic information
e
conventional framework
the commission model is
we will be sounds they are the data from source and target speaker
so the coming from all the will be
specifically for speaker pair
however for the movie have tag
a more
one
a more in uses the which are really not correlate it out once conversion
for example imaging resource conversion
the basic idea are used to train a feature mapping model between the
speaker independent feature as speaker dependent feature
for example
given a harvest age forcibly used for the speaker independently but feature and speaker-dependent acoustic
feature
then used is to features to trails us
conversion model
as a and b g feature
is the use of speaker
independent
that means
as well as the speaker on the count it as a have the speech content
is the same
the did you do not change
so
in such a framework
it is an easy to actually a many-to-one conversion
and
in this form free more the so stage is not required during training
so this will be
more easy to use for proving attack
so that's cholesky then did not have also attack scenario
in not ever so attack scenario
alright
as recent as we stand
but we and acoustic feature we will be straight from the target speech to train
the
commercial model
the model will be a day
with a lost
calculated to predict acoustic features
and
generally have features from target speech
during tracking
the
but is extracted from the source speech
then
of if we just such so sleepy g into commercial model together comedy the acoustic
feature
we use a book order to come word the acoustic feature
on tuesday
comedy the speech samples
to be former
formant tag
to that c system
this is a
keeping otherwise commission model
it's optimize the for speaker similarity an ecology
so it is not designed for us to the system
is me nonoptimal
well forcing yes the attack
for our proposed the feedback control wise conversion
the main difference is
we provide a feedback from the yes we system
during training
as negative example
during training for each mini batch
we tried the
target speech with is trying to the g
from target speech into generated predict the acoustic feature
the first part most is calculated between the prediction acoustic feature and actually acoustic feature
as a baseline be known as it is you discourse conversion
and a lot of heart
we also use a local the could generate the comedy the speech signal
well from the printing acoustic features
and
which is known
speech signal to agnes's system
together
together
well
sleeping bag as another for the lost
for they model updating
during the packing
and is the same as
as these elements of we're
okay bridges attractor used to used for the two major problems for speech and we
feed this source the region into the commercial model together
conversely the acoustic ensure
and
a local there is used to generate the company speech signal to people yes work
no that's is then
how the combined lost
is use it for the
i was commercial model training
as we know
i four that most current no that's a scenario
we do not have knowledge of each in the relationship no we don't have the
knowledge of the relationship which in the ones which are good
and then yes be lost
so
no
there's no
within phone the signals part
but
that has to be lost you
change of the combine lost curve
so to the average using pass signals for the voice conversion more training we use
an adaptive learning rate schedules
based on the loss
well that the dishes that the to achieve the colleges
for example
the learning rate will be i just
we will be adjusted
or reduced
once a total loss is increased on the validation set
that's close to the instrument and the result
for three weeks then the database use our experiments are is convinced two hours
the training part and validation art
for training
we can go
we workshop three models
i
course the images structure which is trained out of the target strata
the i-vector extractor trained on combine
or combine colours all
switchboard and nist sre corpus from two thousand six two thousand channel
the convolutional this tree down yes physical two thousand nineteen development set
we
choose fixes target speakers including three male and stripping though
for each speaker we choose
but hundred and channel utterances
core model training
volume relations that we using as faced with two thousand nineteen evaluation dataset which contain
conditions and sixty seven speakers
we just trying to utterances per speaker
so in total we how thousand
and
three hundred and forty utterances
pretty bad two systems
to perform in our experiment
other forces it is a peep into his voice conversion system result sleeping bag
another is
feedback control once conversion system which is our proposed
system
incorrectly
the combined the racial
if set to zero point seven
for most model
we use the same a network structure which consist of two d r s team
rst nonlinear
with
one can find one two
continuing these of each year
than other work includes all
but system a forty two dimensional p b g feature
well as the
dimensional output is two hundred and forty
considering the house
it you dimensional mel spectrum
exist and then dynamic an actual error-rate features
the rippling what colour they really is used to speech signal reconstruction
this figure shows the training curve
a only
training and validation set
the line shows the baseline b g based voice conversion
the
or shall i shows they
create a control wise conversion is a convolutional zero point five
the lies shows a
with that control voice conversion with a
combined racial of zero point seven
forms a result of from the training kernel we can see
so the
the feedback control was from version
okay
generally i think at low or other or lost during training for training both training
and validation set is especially for they
for the s p loss
and according to this curve we can see
we combine loss
no
come biracial otherwise database
there is in
there won't find so
which was zero point seven as our
well
our setting
probably their experiment
the objective the initial values you carried to your that the speaker verification system
from of for scroll l
we can see that
yes these systems form
a very effectively one the impostor trials are used
reason you carried little but those represent
and the performance
decreases significantly
one the p g police force equation
attacks are performed
we're z you carried will be increased to all word
twenty five percent for
all the scenarios
and
it is also assumes that
the proposed the feedback control was conversion
is able to folder to increase the performance
which shows no when the but details yes these systems to that of the text
we all well
we use two figures show up having example to show the effectiveness of our proposed
it
that also attack
no
the
no set
and the round i shows the impostor score distribution and the blue line shows the
score distribution of the channel nine channels
and the yellow line shows the score distribution of the ilp be noted digit is
a large portion baseline
and
purple line shows the scroll score distribution all our proposed method
we can see our propose a method that can push the
the score
two horses each i mean
which means which shows the effect leaves names or propose a nested
and does go to the conclusion
in this form
we formulate up to have also attack scenario for embedded control the ones from portions
system
which effectively
given degrees a speaker verification system performance
we also evaluated the proposed
and was not accent to remove frameworks
space proved two thousand nineteen corpus
which is widely used force the for system benchmarking
but also provide that
and then at the cost study
proposed the frameworks and exposes a weak links
also the common speaker verification systems
in facing
voice conversion attacks
that's for all my presentation
single for attention