hello everyone

my name is what's industry and i'm gonna present a people

i feel

so and modeling or splitting detection

you know make speaker verification

so in the next few flavours

of like to do with five from one

speaker

can you three

and compromises

so and automatic speaker verification system

instead of verifying the identity of james

so in this paper

options

a more distant from various it's a

and this speaker

you know the fine with the whole speech comes from the same speaker that the

case

the system accepts the test speech that can speaker

otherwise the system

as an enforced

so common application of is very good

for user authentication for example in by classification

recent studies have shown

the systems are not

a lot of the listings of

which can be

from various research and

one of the using synthesis has

and synthetic thing

speaker

second with involves and voice conversion have

to generate the speech of speaker and the colour from the track involved

replay attacks which involves link back

are created by means of a speaker

and for on the back in one

if question at the voice of target or not

a given x is then used system

so in this paper

or focuses on

replay attacks

or two main reasons

one

because of its simplicity

this formal that doesn't require any studies in seen any

a signal processing more speech data

second

the formal that is

i difficult to detect from the results because we are actually feeling that real speech

so it doesn't use any

a complete the algorithm the general speech

so this and ones for all target speaker

so

spoofing on to missus are primarily developed

two

based on dsps

so it really very fine control

video

so and i four think problem it can be viewed as a binary classifier which

means that discriminately with the input speech

is a non speech

or is supposed speech

so in our case when we talk about this posting on the major

so what exactly so there's a them in the

so but not in five

what we factors interest you know

independently back speech

so essentially you know case doesn't be

different channel characteristics

in this during playback and we recording speech

and also

background noise

involve doing this you are

parameters are essentially that you that are we expect model exploiting for playback speech detection

before i start

you know holes sub band modeling framework

for

like of all a sum of common commonly used a approaches towards designing us

supposing images

which since also

exploiting information of course on

so what we call is from one of them is a

for the on here is the

is an example of four

optimal or

gaussian mixture model which is a generative model

as can be seen

of given from the speech prince common spectrum is extracted

on its extractor

acoustic features

for example to use is you know is very low baseline feature

and on this speaker she's data

basically

a gmm models one for school and one on my

utterances are train

you not of the model the distribution of this

this process

so this is an example of commonly used gmm model it has shown promising results

in the st score evaluations

category all a total ban on committees that you're gonna study

is

discriminatively trained unit it's

which on very

problems on and in the thing

nineteen this peaceful nations

so as can be seen so this model takes for quantified in speech or its

and basically going on this isn't wrong three

between the one

so this actually motivates our current work

and in fact that means just question that motivates our work is

do we really need

all the speech

all the informational costs all the three best friends

already split into action

so or intuition is that may be doing

maybe they next speech to use might be somewhere in high frequency regions or maybe

you know very low frequency regions

so in this framework of what we what defined with different understand the importance of

different frequency bands for sporting detection

and also trying to come up of s o s and combination that improved if

the columns

so we test this hypothesis this idea of old remote on two benchmarks

this is feasible thing to seventeen and

is useful information

so this texas also proposed methodology

can be expected to us

in the first one with we basically

if or input into

i

spectrogram of different

and train and independency inance on top of the frame discriminative

and likewise what we obtain is we have been and dependency in that are trained

on and sum of spectrograms

so

this allows of actually look at

or indicate and sometimes

and understand which frequency regions are more discriminant e

of course for detection on this data sets

so

so this can also be viewed as a employing and independent workers

just focusing on

small sub band information and trying to and trying to exploit discriminative information for incorrectly

and a one speech

rather than having once you in which

it hasn't is the baseline once you know that is at least having to look

all the information and also

how to

if the bulletin of are on by means an introduction and while retaining the

most discriminative information what we do we begin by the fast basically be a we

kind of

allow independence into focus on one-pass

and then

we into a

improve performance by doing so

so without in the second step what

we and

basically think this prevent models and combine this alone features

and train or another classifier on top of so many all this

different features and then jointly update the weights of entire framework

so in it what we do

we are also making use of the cold information but and

not using only once you

you in

in

any given in c n n's and then we give this and independent features on

cognitive and then than if we train

the classifier is trained on

so this is our proposed method which we test on the two it's really seventeen

and twenty nineteen is possible data

so

let me let me talk about the experimental results well start with the baseline

baseline gmm and the ceiling

so the gmm baseline is trained on stick uses it is

and

the c n is trained on spectrogram and spectrogram

condition from the experimental results

we find that the discriminatively trained union

in performance

two

sufficiency handcrafted gmm baseline

on the data six

and one thing the of the on the training dataset to increase equal error rate

on gmm is given the fact that we apply

preprocessing step

on the audio signals

which involves discarding the cost of p zero value of silence

from evidence only not prior work in interspeech trained and

and that's the reason why our baseline security gmm is quite

different and then the

baseline and the for this effect

so no let me talk about our first experiment is okay

where we

spectral input into uniform sometimes

and one thing to more here is that

in this paper we have adopted a very simple to extract is your are uniformly

segmenting the input and have non-overlapping

so we just in the in between two

six

we

we

frame independence in and one of two in

and then having frame that model was intensely from we combine them to train a

concatenated model or a joint model

so i would be calling disjoint framework task at in the experimental results

which basically is that

the framework of this except that we trained

pictures and classifier is trained one

so let's look at the experimental results

as can be seen from this

right distribution

on the extended data set we find that the commission in high frequency reasons seem

to be more likely in contrast to if you regions

on the contrary

in between the same reasoning can be the same define objects

we see people

in c reasons be more discriminative in contrast the high frequency region

nonetheless on what the datasets our proposed

of frame more difficult compare

the on off between models anymore

since all four improve performance on what the data set

so in this section in the second experimental setup what we do not is you

know for that is there or input

into four

uniform segments each of the two sub bands

no with i

we can look at

more detail on discriminative information rather than just having a

for us so i

so it is our rows or and experimental setup so we have or independency anything

on two khz and a nation

and then there's models are placed on mine

we have been a feature is it into this classifier and the whole framework is

trained

well over the so

you know if all the with this

so

yes take a look at the experimental results so well as can be seen on

different this we find that

the

information in between

two khz to six khz in to be not comedy

in contrast to

information present in the last two killers sub band and we first

so in contrast on the training dataset we find that the first two khz sub

band is more informative

how valuable of the safety

as in previous case

our on a model

some of the best results

in what the data sets

so

the next experiment a set of what we do

is we know

but this

a input for their into

eight subbands so it's of one khz

so essentially what we do if we see it independency and everything on one delivered

formation

and is a this frequency entirely online again to

previous

a framework it is it's data it units all data

do

you can improve the

do so okay performance

so this is the experimental results for one lower sub-bands units

so this distribution across different sometimes allows us to actually

understand the impact of different bands

so only twenty seven dataset what this

is

one khz information seem to be the most informative

as we in and

eer of a two point one which is

e r in contrast or the frequency bands and interesting about defined is you've on

this isn't between

of different this and we just one two khz and

and it is seven killers is a system that informative as can be seen high

eer

and these second informative frequency for fixed expensive the be the first one khz

and of course will be or compact model operating on the

all h

and seem to give with comments but then we also right in just the last

a seven eight khz band and the first one dollars

which seem to give us the best performance in one for here

on the financial data see what we found is we found that the first one

is most informative

in contrast of the fifty s

so as i mentioned earlier so this is due to the fact that the twenty

seventeen and twenty nine dataset completely different so the fink intention is a simulated data

while twenty seven dataset is

is the real data that was

recorded and it back

using a

speaker verification it has it all right

so that this kind of explains the

difference a mismatch in the behavior

the final set of experiment we performed in this study is in terms of prostate

a simple ones

or with we with some of the best

models

not mentioned in dataset and original dataset and test it

comment on is visible twenty nineteen real be tested

we have used it is a very small essay

of

thousand utterances that was

instinctively conditions like the is organized as

and we want to see how this models

performed on realistic s conditions can be seen from the high

error rate distributions for all our models

this solicitous that the cutting us holding datasets training model doesn't actually

much on the realistic or if the conditions

so this thing that we might have to think about a few design or

training and validation sets for or standing s

so to improvement all

in this paper work with the we will be basically

but at all events and in

by discriminatively training independent seen in on

and

so

if variable a figment

and then there's a lady the later on the combined

and

and independent possible is trained on top of that

using the proposed methodology we found people performance on will be twenty seventy three datasets

and an interesting observation but not for or it which is a language that some

of the for this war is that

under twenty seventy dataset

e

seventy eight khz frequency formations in to be more informative

with however doesn't hold true on the training dataset

between ti din dataset the first one khz information seems to be more formally

and we also found that

the this wanna do not generalize a real on the on the realistically if conditions

with so this

that it is still room or for kids from

designing and validate in

this dataset for training effective

are replaced with an addiction models

so that i would like to control my

and you very much