hi everyone i'm john allen from computer with such and still montreal

to them going to talk about a multi condition training strategy for contra miseries against

spoofing attacks

two speaker recognizers

this is the joint work we did wrong wanted to and channel four

in this presentation i'm going to provide an overview of our work

one and two ends of introduction

i employing to deanna channel that utilizes documentation to increase

the amount of

training data for improving

performance on the unseen test data

the outline of my talk for a i lists start with the

introduction element known then i'm going to talk able

spoof an detection documentation

baseline use for this task

and to an approach is to list of introduction using to deanna literature

and finally lame going to provide some results for performance evaluation and

and i'm going to conclude my toe

i had is the introduction and background

given a p r of recording

the goal of

speaker verification system is to do to mine

whether the recordings are from the same speaker or

from two different speakers

in order to do so

and speaker of a speaker verification system utilizes a set of recognisable

and very what verifiable voice characteristics

which are normally considered a unique and the specific to a person

districts are normally extracted

in the feature extraction module it of a speaker verification system

in a controlled setting

speaker verification system perform very well

but it performs

it performance degrades in real-world setting

where n in boston can pretend to be a generally speaker by foraging

a genuinely speaker voice recording

or when there is a mismatch between training and test environment

in this work we and mainly concerned with the

for signal generally speaker y's by of embossed of

in an speaker verification system

the claimed identity

can we generally are forwarded by a for mister

well was goal is to get any illegitimate access to the system

so the manipulation of and authentication system by impostors is nobody known as well thing

speaker verification system are born unable to spoofing attacks generated by

in the replay

speech synthesis voice conversion

and in boston impersonations

except in impersonations all other three attacks are normally considered major trade

to a speaker verification system

among the three major attack times

all replay is known as the physical access advance where m is

speech synthesis

and voice conversion networks are known as the logical alexis attacks

makes them to talk about this poof and detection

fortunately all data styles

discussed in the previous slide that means replay

speech synthesis and voice conversion leave some traces in the converted to speech in the

formal be able artifacts

spoofing detection techniques normally use this to do what are to fix

in order to distinguish

spoof the speech from the generated speech

to make speaker verification systems recording is spoofing attack

speaker verification and the spoofing detection system can be a

connected in parallel

in the left side of the finger already present it is performed detection is followed

by is to get a verification system

well the recording of the claim identity is files is

initially passed through the speaker verification system to make verification decision

if the identity

is accepted but the verification system

it is then passed through with spoofing detection system

to find out what with that the plan mightn't is actually

generally in order to put noticeable

in that are set of the free good or whatever speaker verification system is followed

by this perform detection system in this case

there

claim identities if the claim i didn't is found channeling only then it is past

where verification system tool make verification decision

finally

speaker verification is performed detection system can be connected in parallel

i in this case

the fused score or of

speaker verification and the spoofing detection

system is used to make accept or reject decision

that advantage of this approach is the only want racial is required us

to perform verification decision

like the to those in a few clean and total seventeen editions of this spoofing

challenge

to the nineteen addition of is visible of challenge into the ninety additional admissible challenge

the participants wider us to be list nn on the spoofing detection system irrespective of

a speaker verification system

but in two thousand nine doing additional was something challenge organisers provided the verification is

called for the participant

so this participant can e-model it is performed detection is score

in terms of tandem detection cost function when used to alongside with the verification system

next i'm going to talk about that augmentation

more animation only models such as deep-learning architectures

may have billions of parameters and normally require a large amount of data for training

and but in

most of the application cases have been large amount of data is normally not possible

well as for example consider the case of is visible challenges where training data provided

to the participant or not are not sufficient enough

to expect generalize performance using

deep-learning be approaches

so

two used to barely i architecture we need to increase the training data

the process of increasing and the amount and the diversity of a training data is

nobody non is

documentation

documentation normally serve

two proposes

one propose is the domain adaptation on roman generalisation

in this case the main goal is to compensate for mismatch

environmental between training and test data

and this approach is normally widely used in the speech based applications

for example speaker recognition a speech recognition

another corpus for the documentation is the regularization

the main goal is to improve performance on unseen test data

by

increasing the training data

in this work or corpus was to

do regularization

for this work we try to adopt and domain adaptation this strategy that preserve the

artifacts of the spoofing attacks

and that the same time and does not

use any external data such as

noise reverberation et cetera

data from addition a strategy adopted in this war this presented

in this figure

all the slide

i hear additional training data were created by using speed perturbation with the freeways the

partition perturbation vector or of

zero point nine and one point one

low-pass and high-pass filtering on the training data

by doing the documentation in this case we were able to increase the training data

five tens of the original training that no

next am gonna talk about

speech representation only used

for this work

in the course of

to build a few demon to total seventeen additions of

is a peaceful of challenges

and after the evaluation it became almost clear that the most effective countermeasures errors

for spoken detection is then local the speech representation

by local mean frame level feature

which are typically extracted over ten millisecond interval

for is this for to the nineteen challenge does

we

use three way to use your colour speech representation

one of them is

and linear frequency cepstral coefficient feature

and various is to have to compute this feature is presented in left

ten side of figure

another

feature is the

sequence is if we check honest and you cepstral coefficient feature

which was phone very effective for it

to than fifty variation of stopping challenge task

and

we use

this feature also in the stars

as this feature was provided in the one

with the baseline

and to compute the sequence of feature the various steps are presented in the right

hand side of the figure

another local the speech representation we use for this work is the

provide spectrum which is the product of power spectrum in group delay function

this feature incorporates both the amplitude and phase spectral compare the

and various steps for completing this feature is presented

in this figure out of the slide

next and lead to talk about the baseline used for supporting detection in the stars

in order to make competition a performance we used to baseline provided by the are

gonna the one of the baseline is sequences the feature with gmm classifier

and another baseline is the

elasticy feature with the same gmm classifier

besides we also created our own baselines one of our baseline is mfcc with the

g m and then and there is the i-vector p l d v is then

what of our baseline where

but is encoded toolkit

in the speaker of this slide presented the gmm based framework for a simple from

detection

in this framework

i generally dennis of the gmm models are trained

using genuine and it's

spoofing speech training data

then given a test recording generally noticeable decision is made based on the likelihood ratio

computed using the trained gmm models

next i'm going to talk about the end when approaches that we used for is

performed detection in this stuff

in an end-to-end approach non local the speech representation are normally do typically map twist

proving detection a score

in this approach for modeling we add up to the two d n in

more detail was architecture is presented in

table one of the slight

in this architectures several

one variational convolutional perform to encode

to encode

input local the speech representation to local countdown as ours

statistics putting laity the eagles to summarise this sequence of

local counter miseries into a global condor major

finally the global control method is projected into have final output the score

trying to affine transformation and along with the complete model

for training

binary cross entropy laws is you lost in a standard binary classification setting

the as you can see

the we have seen previously the training database quite unbalanced for is busy of channel

data

is the guys almost nine terms of the one of five training data

so

many pages that created in such a way that genuine example some bald

several times party walks was to ensure

the mean images are balance

training and study using still that stochastic gradient descent algorithm with the meeting best size

of sixteen

only have to selection is also employed for this does

next i'm going to present some results on is this above tools of the nineteen

challenge should have limited evaluation data

the matrix used for almost performance evaluation of normalized minimum ten then detection about ten

them detection cost function and the equal error rate

for experiment on logic and physical access task

there's

be useful to those of nineteen channel and uttering we used

for physical access stars small data generated using similar to repair tasks where is

for

for physical access task

this book data generate reason similar to replay attacks where is for a

logic alexi starts to go to generated using various

i p synthesis and voice conversion and but algorithm

in table two presents

the number of the gender and recordings of recording and the number of the speaker

a contained in trained emblem it and evaluation partitions of

logical x is an physical access task

we can see from this table that training device quite unbalanced in this small

physical accessible for detection results in terms of tandem detection cost function and equal error

rate

on the diablo meant as well as evaluation first which reported in

this table three and five

we can arousal from the presented results that

documentation how

to improve performance in both test set

in this

logic alexis puffing detection results

in this slide represented logical accessible from detection results in terms of

tandem detection cost function on an equal error rate

on the development as well as you evaluation sets

for the logic alexis stars

documentation it's phone effective only on the development set

and overall we can see that the and one approach employing td an architecture provided

better performance

then the baseline

on both logical and physical alexis stuff

finally conclusion we can say

data limitation is found helpful specifically for p it does for the score from detection

employee deep-learning architecture

four

in order to the documentation for the spoofing detection we have to make sure the

signal transformation employed

data that augmentation technique must preserve the art effects introduced by

spoofing algorithm

and tuna and they approach employing t d n and lead to documentation outperform the

baseline

and to an approach we double

documentation

and to an approach

to deal in a two d n and with two d an architecture to with

that but without data limitation

outperformed all the baseline bottleneck and logically lex's infringing alexis task

domain and i know that augmentation by still perturbation and

filtering

basically low-pass and high-pass filtering is

found useful for physical access tiles but for

logically alexi stars

speaker part of vision is found harmful

feature normalization use of voice activity detection and already to different abbing deviation of the

filters less than

sixty four

filter are and number in a tree commander for the spoofing detection task

thank you very much for your attention