hello my name is on second at every stop and i'm here tell you about
our work on an initial investigation on optimising tandem speaker verification and ponder missus systems
using reinforcement learning
so that we all just on the same page
speaker verification system
verifies that the
claimed identity
and the
provided speech sample all the same come from the same people so person so other
make speaker verification system takes in a claimed identity
and some speech sample and if the identity max's
the identity of the who spoke
then all good the system will pass and then
likewise if somebody else
claims that they are someone to put their not and provides speech sample
it shouldn't let them pass
good
a very simple many of you work on this field
of course when it comes to security and systems like this there are some bad
guys want to break the system
so for example in this case
somebody could record
tommy speech here with a mobile phone and then later use that recorded speech
to claim that they are taught me by feeding saying that they are tommy and
playing the audio and the system will gladly expect accept that and previous work has
shown to the s p is proved with a seventeen
silence so that they are if you don't protect for this the extra system will
gladly accept this kind of a trial even though it shouldn't
likewise
you could
use they gathered dataset or collect data up some
somebody speaking
and then you speech synthesis or voice conversion
to again generate speech that sounds like tommy and feed it that the system and
it will again accent it all fine
and again this has been shown in previous competition it's a bob problem but
you can also fixed for this
so
this is
where
they come to missus come in so a condom is a system takes in
they
a speech sample that's well was provided to the extra system as well and also
checks that the sample comes from a
human speaker instead of like a mobile phone or it's not synthesise piece or voice
conversed speech so
it's like upon a five human speech
so for example somebody
has recorded somebody else's speech
feature the system but now it's fed to the condom is a system as well
and the count them is the system says it's or reject and then the other
car does not get access so it leads inside attacker
good so far and these competitions have shown that when you train for these kind
of situations you train to detect these would play
attacks of this
since the speech
you can detect them and all works fine
so
but one mac we had with this
setup is that
the
a yes we system
and they
condom is a system or trained completely independent from mixer so the fa system has
its own dataset its own laws its own training protocols
and so on someone
likewise the insistent has its own datasets intone was its own trainings protocols and its
own network architecture and someone so on
these are trained separately but then they are invalid together so they are evaluated as
one big bigger system
so
where you have a completely different
l is metric and these two systems have never been trained to minimize actually this
evaluation metric that been trained on their own tasks
so
we had this
coffee room idea
a where
what if when we have this kind of
bigger whole system
what if we train the
a svm to see insist then on the evaluation metric directly so
maybe on pop they already training already existing training they had we also optimize them
to minimize the or maximize the appellation metric for better results
and
however
sadly
it's not so very straightforward
so
we have this system where we split but speech to a svm cm system
they produce
i both of them produce like accent and reject label so i either accept or
reject and these are then fed to the appellation metric which usually computes
the error rates so false reject rate real reaction rate and false acceptance rate
and
these are then used in various ways depending on the evolution metric
the kind of come up with the one number to show how good the system
as a whole is
however
if we assume that these two i since the interest they are differentiable so they
are like on your networks which is quite common these days
if we wanted to
minimize the automatic we would need to compute the gradient of the evaluation metric with
respect to the
two systems we have or its parameters
sadly
the weak and that's compute the gradient over these hot addition of like accent reject
and
but these all required for be error rates
for the whole asymmetric so for example the tandem
decision cost function which we will be using later
requires these two error rates
and it's going to weight them in different ways but we can't from that
go back to the compute the gradient all the way back to the systems
because these heart this isn't is not differentiable operation and tools we can't calculate the
gradient
so
relating
i really in a related topic
other work has suggested as soft a remote error metrics
so for example
by softening the f-score f one score or are we undercut lost you can come
up with a differentiable version of the
score metric
and then you can do this computation so they
by softening it means they these heart decisions are kind of stuff and so that
you have a function you can actually there they got derivative of
and then you can compute disparity and however
the tandem decision cost function we have here does not have such source of person
so
instead
we looked into t v important nodding
so
in reinforcement learning there is this we're gonna simplified setup is like this so
the computer agent
sees a images or slight problem but it's
predicted some information from the game or the environment
the agent chooses an action a
and
the action is then executed on the environment
and depending if the outcome of the action is good on that the agent receives
some rework and
in reinforcement learning
the goal of it of this the whole setup is to
get a smarts reward as possible so modify d h and so that you get
as much reward as possible
and one way
to do this is
kind of a week the gradient
well so
we could take the gradient
of the expected reward so the reward i'll i've reached
overall difference in the set up to situations
and they got gradient of that with respect to t probably see with the respective
age and so if we can do this we can then
of course update the agent
two
towards the direction that increases the amount of reward
however
we hear also have this that problem that you can't really
differentiate is a decision part where we
choose an
one specific action of many
and
you execute that on the environment so we can't different see that and we can
compute the gradient
however there is a thing called police a gradient
where which kind of estimates this gradient
we do it is kind of a equation where
instead of calculating the gradient
of the reward directly it computes the gradient of the
log probabilities of the
selected actions
and weights them by t report
we got
and this has been shown in reinforcement learning
two
be quite effective us ready and also been shown that you can replace the
the air the td reward with any function and then
by running this you will also find
get the correct gradient so you can
get the correct same results with enough samples
so
going back to our tandem optimization
where we had the
same problem of
a heart decisions which we can't
differentiate we just apply this
police a gradient here that both to get it great in theorem here
where b
equation is more or less same
just with team or different meeting
so
we then proceeded to test this how well it works
with a rather trivial a set up so
we have two datasets
the fox let one
us
the stated and more specifically the speaker verification
part of it
and then we have t is feasible of nineteen
for the are synthesized speech and the other condor mercer tests and i labels and
for a has to be task we except extract the x vectors using t pretrained
tell the models
and forty s feasible we extract easy to see features
and these are fixed so these are not being trained in does and in this
setup
we then train the a actually system and the c insistent as thirty as normally
don't using these two datasets
and then evaluate them together using the d dcf cost function as
present in the a speech both nineteen competition
this we take the two pretrained systems and before random optimization on the mass previous
is shown and then finally we evaluate the results
and compared them with the pre-trained results and see if it actually helps
so
let surprisingly
the tandem optimization helps in out
very short not shelled
so
one way we see this
is by looking at this a learning curve where on the x-axis you have
the number of updates
you do so you can compared to this because where you have the loss and
the number of you box
and on the y-axis
you will have
the relative change in immolation set
compared to the operating system so if it was zero percent
it means the
the metric did not chains since the pretrained system
so
the main metric we wanted to minimize
is the minimum
the decision cost function normalized
and this indeed
decrease over time as we do the this a tandem optimization so
from zero percent change it went to minus twenty five percent change so yes it
improved
as a
then we also studied
how the
in the a visual systems changed over the training
so
for example to compliments or equal error rate in the
condom is a zone pass process detecting if it's move or not
it also improved by around ten percent
in this task but interestingly the a s v
e r
increased over time and we help of places that
because this is because
when
we have a way that
the a s p system in condom is a task and the condom sre in
pac task so looked at tasks we notice that of that these phantom optimization
the
to have improved in there
it's others task so a as we was better encounter mister
tasks
and counter measure was also a slightly better in the a speaker verification task so
we hypothesize that this kind of outweigh that the speaker verification systems
normal task of do that can correct speaker and it started to kind of
thick the condom answers these proved samples instead
so
we also compared this to a simple baseline
where instead of using this tandem optimization we just independently trained
continue training to two systems using the same samples as in the quality grading methods
so
basically we just use the same samples and use the a s p samples to
continue updating the a speech is then and then be used the condom is a
samples to update the down to missus system independently completely different from each other and
we see the same at a sweep behaviour here but
counters mercer systems the equal error rate just exploded in the beginning and then slowly
creeps back down and in the end operates over multiple runs
we see that the
both the grading method improves the results by twenty six percent and the fine tuning
improve the results by seven point eight four percent
but to note that the results on the fine tuning
have a lot how your variance
than in the
police a gradient
mesh version
but
these all results are very positive but as a this wasn't very initial investigation and
i highly recommend that you
check out the paper for more resultant figures
and that's all
so
thank you for listening to be sure to check out the paper and the code
behind that leak