hi everyone i'm john allen from computer with such and still montreal
to them going to talk about a multi condition training strategy for contra miseries against
spoofing attacks
two speaker recognizers
this is the joint work we did wrong wanted to and channel four
in this presentation i'm going to provide an overview of our work
one and two ends of introduction
i employing to deanna channel that utilizes documentation to increase
the amount of
training data for improving
performance on the unseen test data
the outline of my talk for a i lists start with the
introduction element known then i'm going to talk able
spoof an detection documentation
baseline use for this task
and to an approach is to list of introduction using to deanna literature
and finally lame going to provide some results for performance evaluation and
and i'm going to conclude my toe
i had is the introduction and background
given a p r of recording
the goal of
speaker verification system is to do to mine
whether the recordings are from the same speaker or
from two different speakers
in order to do so
and speaker of a speaker verification system utilizes a set of recognisable
and very what verifiable voice characteristics
which are normally considered a unique and the specific to a person
districts are normally extracted
in the feature extraction module it of a speaker verification system
in a controlled setting
speaker verification system perform very well
but it performs
it performance degrades in real-world setting
where n in boston can pretend to be a generally speaker by foraging
a genuinely speaker voice recording
or when there is a mismatch between training and test environment
in this work we and mainly concerned with the
for signal generally speaker y's by of embossed of
in an speaker verification system
the claimed identity
can we generally are forwarded by a for mister
well was goal is to get any illegitimate access to the system
so the manipulation of and authentication system by impostors is nobody known as well thing
speaker verification system are born unable to spoofing attacks generated by
in the replay
speech synthesis voice conversion
and in boston impersonations
except in impersonations all other three attacks are normally considered major trade
to a speaker verification system
among the three major attack times
all replay is known as the physical access advance where m is
speech synthesis
and voice conversion networks are known as the logical alexis attacks
makes them to talk about this poof and detection
fortunately all data styles
discussed in the previous slide that means replay
speech synthesis and voice conversion leave some traces in the converted to speech in the
formal be able artifacts
spoofing detection techniques normally use this to do what are to fix
in order to distinguish
spoof the speech from the generated speech
to make speaker verification systems recording is spoofing attack
speaker verification and the spoofing detection system can be a
connected in parallel
in the left side of the finger already present it is performed detection is followed
by is to get a verification system
well the recording of the claim identity is files is
initially passed through the speaker verification system to make verification decision
if the identity
is accepted but the verification system
it is then passed through with spoofing detection system
to find out what with that the plan mightn't is actually
generally in order to put noticeable
in that are set of the free good or whatever speaker verification system is followed
by this perform detection system in this case
there
claim identities if the claim i didn't is found channeling only then it is past
where verification system tool make verification decision
finally
speaker verification is performed detection system can be connected in parallel
i in this case
the fused score or of
speaker verification and the spoofing detection
system is used to make accept or reject decision
that advantage of this approach is the only want racial is required us
to perform verification decision
like the to those in a few clean and total seventeen editions of this spoofing
challenge
to the nineteen addition of is visible of challenge into the ninety additional admissible challenge
the participants wider us to be list nn on the spoofing detection system irrespective of
a speaker verification system
but in two thousand nine doing additional was something challenge organisers provided the verification is
called for the participant
so this participant can e-model it is performed detection is score
in terms of tandem detection cost function when used to alongside with the verification system
next i'm going to talk about that augmentation
more animation only models such as deep-learning architectures
may have billions of parameters and normally require a large amount of data for training
and but in
most of the application cases have been large amount of data is normally not possible
well as for example consider the case of is visible challenges where training data provided
to the participant or not are not sufficient enough
to expect generalize performance using
deep-learning be approaches
so
two used to barely i architecture we need to increase the training data
the process of increasing and the amount and the diversity of a training data is
nobody non is
documentation
documentation normally serve
two proposes
one propose is the domain adaptation on roman generalisation
in this case the main goal is to compensate for mismatch
environmental between training and test data
and this approach is normally widely used in the speech based applications
for example speaker recognition a speech recognition
another corpus for the documentation is the regularization
the main goal is to improve performance on unseen test data
by
increasing the training data
in this work or corpus was to
do regularization
for this work we try to adopt and domain adaptation this strategy that preserve the
artifacts of the spoofing attacks
and that the same time and does not
use any external data such as
noise reverberation et cetera
data from addition a strategy adopted in this war this presented
in this figure
all the slide
i hear additional training data were created by using speed perturbation with the freeways the
partition perturbation vector or of
zero point nine and one point one
low-pass and high-pass filtering on the training data
by doing the documentation in this case we were able to increase the training data
five tens of the original training that no
next am gonna talk about
speech representation only used
for this work
in the course of
to build a few demon to total seventeen additions of
is a peaceful of challenges
and after the evaluation it became almost clear that the most effective countermeasures errors
for spoken detection is then local the speech representation
by local mean frame level feature
which are typically extracted over ten millisecond interval
for is this for to the nineteen challenge does
we
use three way to use your colour speech representation
one of them is
and linear frequency cepstral coefficient feature
and various is to have to compute this feature is presented in left
ten side of figure
another
feature is the
sequence is if we check honest and you cepstral coefficient feature
which was phone very effective for it
to than fifty variation of stopping challenge task
and
we use
this feature also in the stars
as this feature was provided in the one
with the baseline
and to compute the sequence of feature the various steps are presented in the right
hand side of the figure
another local the speech representation we use for this work is the
provide spectrum which is the product of power spectrum in group delay function
this feature incorporates both the amplitude and phase spectral compare the
and various steps for completing this feature is presented
in this figure out of the slide
next and lead to talk about the baseline used for supporting detection in the stars
in order to make competition a performance we used to baseline provided by the are
gonna the one of the baseline is sequences the feature with gmm classifier
and another baseline is the
elasticy feature with the same gmm classifier
besides we also created our own baselines one of our baseline is mfcc with the
g m and then and there is the i-vector p l d v is then
what of our baseline where
but is encoded toolkit
in the speaker of this slide presented the gmm based framework for a simple from
detection
in this framework
i generally dennis of the gmm models are trained
using genuine and it's
spoofing speech training data
then given a test recording generally noticeable decision is made based on the likelihood ratio
computed using the trained gmm models
next i'm going to talk about the end when approaches that we used for is
performed detection in this stuff
in an end-to-end approach non local the speech representation are normally do typically map twist
proving detection a score
in this approach for modeling we add up to the two d n in
more detail was architecture is presented in
table one of the slight
in this architectures several
one variational convolutional perform to encode
to encode
input local the speech representation to local countdown as ours
statistics putting laity the eagles to summarise this sequence of
local counter miseries into a global condor major
finally the global control method is projected into have final output the score
trying to affine transformation and along with the complete model
for training
binary cross entropy laws is you lost in a standard binary classification setting
the as you can see
the we have seen previously the training database quite unbalanced for is busy of channel
data
is the guys almost nine terms of the one of five training data
so
many pages that created in such a way that genuine example some bald
several times party walks was to ensure
the mean images are balance
training and study using still that stochastic gradient descent algorithm with the meeting best size
of sixteen
only have to selection is also employed for this does
next i'm going to present some results on is this above tools of the nineteen
challenge should have limited evaluation data
the matrix used for almost performance evaluation of normalized minimum ten then detection about ten
them detection cost function and the equal error rate
for experiment on logic and physical access task
there's
be useful to those of nineteen channel and uttering we used
for physical access stars small data generated using similar to repair tasks where is
for
for physical access task
this book data generate reason similar to replay attacks where is for a
logic alexi starts to go to generated using various
i p synthesis and voice conversion and but algorithm
in table two presents
the number of the gender and recordings of recording and the number of the speaker
a contained in trained emblem it and evaluation partitions of
logical x is an physical access task
we can see from this table that training device quite unbalanced in this small
physical accessible for detection results in terms of tandem detection cost function and equal error
rate
on the diablo meant as well as evaluation first which reported in
this table three and five
we can arousal from the presented results that
documentation how
to improve performance in both test set
in this
logic alexis puffing detection results
in this slide represented logical accessible from detection results in terms of
tandem detection cost function on an equal error rate
on the development as well as you evaluation sets
for the logic alexis stars
documentation it's phone effective only on the development set
and overall we can see that the and one approach employing td an architecture provided
better performance
then the baseline
on both logical and physical alexis stuff
finally conclusion we can say
data limitation is found helpful specifically for p it does for the score from detection
employee deep-learning architecture
four
in order to the documentation for the spoofing detection we have to make sure the
signal transformation employed
data that augmentation technique must preserve the art effects introduced by
spoofing algorithm
and tuna and they approach employing t d n and lead to documentation outperform the
baseline
and to an approach we double
documentation
and to an approach
to deal in a two d n and with two d an architecture to with
that but without data limitation
outperformed all the baseline bottleneck and logically lex's infringing alexis task
domain and i know that augmentation by still perturbation and
filtering
basically low-pass and high-pass filtering is
found useful for physical access tiles but for
logically alexi stars
speaker part of vision is found harmful
feature normalization use of voice activity detection and already to different abbing deviation of the
filters less than
sixty four
filter are and number in a tree commander for the spoofing detection task
thank you very much for your attention