0:00:16hi everyone i'm john allen from computer with such and still montreal
0:00:20to them going to talk about a multi condition training strategy for contra miseries against
0:00:27spoofing attacks
0:00:29two speaker recognizers
0:00:30this is the joint work we did wrong wanted to and channel four
0:00:35in this presentation i'm going to provide an overview of our work
0:00:40one and two ends of introduction
0:00:44i employing to deanna channel that utilizes documentation to increase
0:00:50the amount of
0:00:51training data for improving
0:00:54performance on the unseen test data
0:01:04the outline of my talk for a i lists start with the
0:01:09introduction element known then i'm going to talk able
0:01:13spoof an detection documentation
0:01:16baseline use for this task
0:01:20and to an approach is to list of introduction using to deanna literature
0:01:26and finally lame going to provide some results for performance evaluation and
0:01:32and i'm going to conclude my toe
0:01:39i had is the introduction and background
0:01:43given a p r of recording
0:01:45the goal of
0:01:47speaker verification system is to do to mine
0:01:50whether the recordings are from the same speaker or
0:01:55from two different speakers
0:01:57in order to do so
0:01:59and speaker of a speaker verification system utilizes a set of recognisable
0:02:04and very what verifiable voice characteristics
0:02:09which are normally considered a unique and the specific to a person
0:02:15districts are normally extracted
0:02:19in the feature extraction module it of a speaker verification system
0:02:23in a controlled setting
0:02:25speaker verification system perform very well
0:02:29but it performs
0:02:30it performance degrades in real-world setting
0:02:34where n in boston can pretend to be a generally speaker by foraging
0:02:39a genuinely speaker voice recording
0:02:43or when there is a mismatch between training and test environment
0:02:48in this work we and mainly concerned with the
0:02:51for signal generally speaker y's by of embossed of
0:02:59in an speaker verification system
0:03:02the claimed identity
0:03:05can we generally are forwarded by a for mister
0:03:09well was goal is to get any illegitimate access to the system
0:03:15so the manipulation of and authentication system by impostors is nobody known as well thing
0:03:22speaker verification system are born unable to spoofing attacks generated by
0:03:29in the replay
0:03:31speech synthesis voice conversion
0:03:34and in boston impersonations
0:03:37except in impersonations all other three attacks are normally considered major trade
0:03:44to a speaker verification system
0:03:47among the three major attack times
0:03:50all replay is known as the physical access advance where m is
0:03:55speech synthesis
0:03:57and voice conversion networks are known as the logical alexis attacks
0:04:05makes them to talk about this poof and detection
0:04:10fortunately all data styles
0:04:14discussed in the previous slide that means replay
0:04:19speech synthesis and voice conversion leave some traces in the converted to speech in the
0:04:25formal be able artifacts
0:04:28spoofing detection techniques normally use this to do what are to fix
0:04:33in order to distinguish
0:04:35spoof the speech from the generated speech
0:04:40to make speaker verification systems recording is spoofing attack
0:04:46speaker verification and the spoofing detection system can be a
0:04:51connected in parallel
0:04:53in the left side of the finger already present it is performed detection is followed
0:04:58by is to get a verification system
0:05:01well the recording of the claim identity is files is
0:05:05initially passed through the speaker verification system to make verification decision
0:05:10if the identity
0:05:12is accepted but the verification system
0:05:15it is then passed through with spoofing detection system
0:05:18to find out what with that the plan mightn't is actually
0:05:22generally in order to put noticeable
0:05:26in that are set of the free good or whatever speaker verification system is followed
0:05:31by this perform detection system in this case
0:05:35there
0:05:37claim identities if the claim i didn't is found channeling only then it is past
0:05:42where verification system tool make verification decision
0:05:48finally
0:05:51speaker verification is performed detection system can be connected in parallel
0:05:57i in this case
0:05:59the fused score or of
0:06:02speaker verification and the spoofing detection
0:06:06system is used to make accept or reject decision
0:06:11that advantage of this approach is the only want racial is required us
0:06:17to perform verification decision
0:06:20like the to those in a few clean and total seventeen editions of this spoofing
0:06:24challenge
0:06:26to the nineteen addition of is visible of challenge into the ninety additional admissible challenge
0:06:33the participants wider us to be list nn on the spoofing detection system irrespective of
0:06:38a speaker verification system
0:06:42but in two thousand nine doing additional was something challenge organisers provided the verification is
0:06:48called for the participant
0:06:51so this participant can e-model it is performed detection is score
0:06:55in terms of tandem detection cost function when used to alongside with the verification system
0:07:08next i'm going to talk about that augmentation
0:07:13more animation only models such as deep-learning architectures
0:07:19may have billions of parameters and normally require a large amount of data for training
0:07:26and but in
0:07:28most of the application cases have been large amount of data is normally not possible
0:07:35well as for example consider the case of is visible challenges where training data provided
0:07:41to the participant or not are not sufficient enough
0:07:46to expect generalize performance using
0:07:50deep-learning be approaches
0:07:53so
0:07:56two used to barely i architecture we need to increase the training data
0:08:00the process of increasing and the amount and the diversity of a training data is
0:08:06nobody non is
0:08:09documentation
0:08:10documentation normally serve
0:08:12two proposes
0:08:13one propose is the domain adaptation on roman generalisation
0:08:18in this case the main goal is to compensate for mismatch
0:08:22environmental between training and test data
0:08:25and this approach is normally widely used in the speech based applications
0:08:30for example speaker recognition a speech recognition
0:08:34another corpus for the documentation is the regularization
0:08:38the main goal is to improve performance on unseen test data
0:08:43by
0:08:44increasing the training data
0:08:46in this work or corpus was to
0:08:49do regularization
0:08:53for this work we try to adopt and domain adaptation this strategy that preserve the
0:08:59artifacts of the spoofing attacks
0:09:02and that the same time and does not
0:09:05use any external data such as
0:09:08noise reverberation et cetera
0:09:11data from addition a strategy adopted in this war this presented
0:09:17in this figure
0:09:19all the slide
0:09:21i hear additional training data were created by using speed perturbation with the freeways the
0:09:27partition perturbation vector or of
0:09:29zero point nine and one point one
0:09:32low-pass and high-pass filtering on the training data
0:09:38by doing the documentation in this case we were able to increase the training data
0:09:44five tens of the original training that no
0:09:51next am gonna talk about
0:09:53speech representation only used
0:09:56for this work
0:09:59in the course of
0:10:00to build a few demon to total seventeen additions of
0:10:04is a peaceful of challenges
0:10:06and after the evaluation it became almost clear that the most effective countermeasures errors
0:10:13for spoken detection is then local the speech representation
0:10:18by local mean frame level feature
0:10:21which are typically extracted over ten millisecond interval
0:10:25for is this for to the nineteen challenge does
0:10:30we
0:10:31use three way to use your colour speech representation
0:10:35one of them is
0:10:37and linear frequency cepstral coefficient feature
0:10:40and various is to have to compute this feature is presented in left
0:10:46ten side of figure
0:10:48another
0:10:50feature is the
0:10:53sequence is if we check honest and you cepstral coefficient feature
0:10:57which was phone very effective for it
0:11:00to than fifty variation of stopping challenge task
0:11:04and
0:11:05we use
0:11:06this feature also in the stars
0:11:10as this feature was provided in the one
0:11:13with the baseline
0:11:15and to compute the sequence of feature the various steps are presented in the right
0:11:21hand side of the figure
0:11:27another local the speech representation we use for this work is the
0:11:33provide spectrum which is the product of power spectrum in group delay function
0:11:39this feature incorporates both the amplitude and phase spectral compare the
0:11:44and various steps for completing this feature is presented
0:11:50in this figure out of the slide
0:11:55next and lead to talk about the baseline used for supporting detection in the stars
0:12:02in order to make competition a performance we used to baseline provided by the are
0:12:07gonna the one of the baseline is sequences the feature with gmm classifier
0:12:12and another baseline is the
0:12:14elasticy feature with the same gmm classifier
0:12:19besides we also created our own baselines one of our baseline is mfcc with the
0:12:25g m and then and there is the i-vector p l d v is then
0:12:28what of our baseline where
0:12:31but is encoded toolkit
0:12:37in the speaker of this slide presented the gmm based framework for a simple from
0:12:42detection
0:12:44in this framework
0:12:45i generally dennis of the gmm models are trained
0:12:49using genuine and it's
0:12:51spoofing speech training data
0:12:54then given a test recording generally noticeable decision is made based on the likelihood ratio
0:13:02computed using the trained gmm models
0:13:09next i'm going to talk about the end when approaches that we used for is
0:13:13performed detection in this stuff
0:13:19in an end-to-end approach non local the speech representation are normally do typically map twist
0:13:25proving detection a score
0:13:27in this approach for modeling we add up to the two d n in
0:13:31more detail was architecture is presented in
0:13:34table one of the slight
0:13:37in this architectures several
0:13:40one variational convolutional perform to encode
0:13:44to encode
0:13:47input local the speech representation to local countdown as ours
0:13:51statistics putting laity the eagles to summarise this sequence of
0:13:56local counter miseries into a global condor major
0:14:00finally the global control method is projected into have final output the score
0:14:07trying to affine transformation and along with the complete model
0:14:16for training
0:14:18binary cross entropy laws is you lost in a standard binary classification setting
0:14:24the as you can see
0:14:27the we have seen previously the training database quite unbalanced for is busy of channel
0:14:32data
0:14:33is the guys almost nine terms of the one of five training data
0:14:39so
0:14:41many pages that created in such a way that genuine example some bald
0:14:46several times party walks was to ensure
0:14:50the mean images are balance
0:14:53training and study using still that stochastic gradient descent algorithm with the meeting best size
0:14:59of sixteen
0:15:01only have to selection is also employed for this does
0:15:09next i'm going to present some results on is this above tools of the nineteen
0:15:13challenge should have limited evaluation data
0:15:17the matrix used for almost performance evaluation of normalized minimum ten then detection about ten
0:15:23them detection cost function and the equal error rate
0:15:28for experiment on logic and physical access task
0:15:32there's
0:15:33be useful to those of nineteen channel and uttering we used
0:15:37for physical access stars small data generated using similar to repair tasks where is
0:15:43for
0:15:44for physical access task
0:15:46this book data generate reason similar to replay attacks where is for a
0:15:52logic alexi starts to go to generated using various
0:15:56i p synthesis and voice conversion and but algorithm
0:16:00in table two presents
0:16:03the number of the gender and recordings of recording and the number of the speaker
0:16:07a contained in trained emblem it and evaluation partitions of
0:16:12logical x is an physical access task
0:16:17we can see from this table that training device quite unbalanced in this small
0:16:25physical accessible for detection results in terms of tandem detection cost function and equal error
0:16:31rate
0:16:32on the diablo meant as well as evaluation first which reported in
0:16:38this table three and five
0:16:41we can arousal from the presented results that
0:16:45documentation how
0:16:46to improve performance in both test set
0:16:52in this
0:16:54logic alexis puffing detection results
0:16:57in this slide represented logical accessible from detection results in terms of
0:17:02tandem detection cost function on an equal error rate
0:17:06on the development as well as you evaluation sets
0:17:11for the logic alexis stars
0:17:14documentation it's phone effective only on the development set
0:17:18and overall we can see that the and one approach employing td an architecture provided
0:17:24better performance
0:17:26then the baseline
0:17:29on both logical and physical alexis stuff
0:17:36finally conclusion we can say
0:17:40data limitation is found helpful specifically for p it does for the score from detection
0:17:47employee deep-learning architecture
0:17:51four
0:17:52in order to the documentation for the spoofing detection we have to make sure the
0:17:57signal transformation employed
0:18:00data that augmentation technique must preserve the art effects introduced by
0:18:06spoofing algorithm
0:18:09and tuna and they approach employing t d n and lead to documentation outperform the
0:18:15baseline
0:18:16and to an approach we double
0:18:19documentation
0:18:21and to an approach
0:18:22to deal in a two d n and with two d an architecture to with
0:18:26that but without data limitation
0:18:28outperformed all the baseline bottleneck and logically lex's infringing alexis task
0:18:35domain and i know that augmentation by still perturbation and
0:18:41filtering
0:18:42basically low-pass and high-pass filtering is
0:18:46found useful for physical access tiles but for
0:18:49logically alexi stars
0:18:51speaker part of vision is found harmful
0:18:56feature normalization use of voice activity detection and already to different abbing deviation of the
0:19:02filters less than
0:19:04sixty four
0:19:06filter are and number in a tree commander for the spoofing detection task
0:19:15thank you very much for your attention