i

yeah

oh

cover

things

set

i

oh

with a ten

my name is like a region in this is

type of this talk in the text dependent speaker verification using the small buttons

okay

so this is a button for this work in two thousand ten

a speaker evaluation

speaker recognition evaluation was held by it was found back

the relation focus mostly on text-dependent speaker verification

i research it participated in this evaluation

so basically a we presented the results of this evaluation last interspeech

and i was also quite satisfactory

however there was some criticism regarding that a set of the evaluation

because interpolation that the thing that was very was quite large about two hundred

and false sessions per speaker

and the and the criticism was that for me a practical applications customers

not a it's a it's not practical to collect such a large dataset

so it was very interesting to see what are the results of a technology when

using the a small that's that and the smokers that was specified as

being with the consisting of a one hundred speakers

and only one session per speaker

so there is no way to multi session

i

there's only one such as well

oh

okay so that i don't of course but is for example

first quickly described a relation that i will describe our speaker verification systems that use

and then we'll talk about how to how we got to within this it with

the statistics

we present results in

okay so there were three textdependent authentication conditions interpolation the first one is the in

the by the global condition

where we use a global and the constraints such as zero to nine for authentication

circuits authentication condition is using the

speaker dependent password

also it indicates the constraints

and this is denoted by the speaker condition

now of course there's the issue is whether they boston also the absolute not so

in the relation and there's assumption that most cases assumption is that the then both

signals

the passwords definitely just all the trials use the same but the sinc password a

target

possible

and the last the condition is called the prompted condition or a proper the random

string is

is useful for authentication this is hardest to accurately that the case but it's a

more the most resilient to condition for against attacks

such as holding X

yes

okay

so basically that was follow the looks like this the last seven hundred fifty speakers

one the where useful development and five hundred fifty four evaluation data was recorded over

four weeks

and four sessions of error for the speaker to landline do so

and each session consists of all these authentication conditions and a lot of more data

that we are going to use the future like

instead of using the constraints just text

it's

okay so

and for the goal is to the conditions we use a

a three predictions for the past four more so

basically someone to model the system i have to say three times for example zero

nine then we'll see

the education i just one time is nine

suppose that

and that the data is a supposed to be used as following for the global

condition

a way to use the same to constraints as evaluated so if the password is

tonight that will use the predictions of do not in the in the model sets

or for speech recognition product condition we're not allowed to use a repetitions of the

same digit strings for the

the reduced development set is a because this is that the one of the speakers

with a single session each

yeah the speakers are recorded in that i have solar

and by what we were to use a any other sources of probably given

resources

such as the nist or switchboard

on top of these two steps

okay so we are here

systems are useful for the information we use it is three text-independent systems the first

one is that you know joint factor analysis based

this ten second one is the i-vector based system not just the i-vectors

and third one is it is not

we use a also a text-dependent system which is a tune in supervector based and

with no compensation and we use this system currently only for the global condition

and five the fact that final score is a fusion of the scores of all

these cases

which are weighted the

the simple rule based

yeah

okay so

just a few details about the that you can base it is not an assistant

so it's quite standard but we have to a specific

verification to presented also need to speech

the first one is a is

score robust scoring

and the second one is estimated with the scoring

and we may be able to build a system format for only for a telephone

you need state

we don't use that was followed data for building the system

a dollar a company that uses the that was found data

and for system but not used as the score normalization score notation is actually done

using the

same thing for the i-th a basis to eight is the same dataset is sources

and the only use that was five data for score

the not system actually makes a useful in the development the data that was available

data we trained a ubm and not from that of a data and you don't

as much as text much as possible so for example for the global condition we

train the ubm and nap is just from the same text that is being used

in verification

speaker population but not allowed to do that so we just use it for example

the constraints

but not just the text

we found that but do we get a lot

that we also use a variant of not which we call two wire not which

is the on top of a removing the that the channel space and we also

some don't two components

of the interspeaker variability subspace

because we consistently found out in that years that is

thus

yeah

we also using a geometric mean compressing kernel

was

but which control

and

okay

and we do serious conversation again using that was

the H supervector based system is very similar to the gmm nap system

the only difference is that instead of extracting gmm supervectors we extract hmm supervectors

and the rest of system is the set so basically a chance of those are

started by instead of training and ubm train a speaker independent hmm from the development

data

and then if a lot to extract these supervectors we just a take the a

take a session we use that data to estimate the session independent hmm using map

adaptation and we just take a gmm means from the different states normalize the sense

that

okay so

now talk about how we were able to cope with the reduced dataset

is a

what we look at least at four different system we can see that the jfa

and i-vector based systems

are not very

should not be possible to very much to this the buttons that because we're not

using it a very tall we only false normalization

so wait for the moment we didn't we yeah work on these systems we just

a use that this system as is and see what happens

for the not based systems the problem is that much more serious because

it will using the development is that the very extensively and first of all we

have less data for that fortunately yeah speak an hmm

was used a we don't have any multisession speakers

so if we want to for example to train now we will be able to

and also as quantisation began mistake

so

or vice versa for these two systems for the gmm based mapping the hmm based

not systems

and a weekend

we have also a we consider in the in some slides in the results

we focus on these systems because they walk much better than jfa i-vector on this

task so

it's very important to do this

okay so for the gmm based not system and the first component is the ubm

we compare two way to estimate its training don't are

reduced dataset or training on nist data

for now we compare scream at the first one is to train a waveform the

nist data

the second one was to estimate not a for all from produce data although i

don't have a multisession speakers

by using a in approach that we call a common speaker subspace

in conversation which we used in two thousand seven

and i will then excitable explain a bit more i

approach

and of course that the third method you just combine the two compensation that the

use of them

so this common speaker subspace compensation that so it is basically

as for my

for it firstly

we estimate this space this subspace from a large step sizes from all speakers

so it is and the in our case where the one hundred speakers and we

just expressed supervectors for these one hundred sessions and we just do pca on these

supervectors

okay and know what its columns because that's just because it in some way to

represent that he just speaker as such

the speaker subspace

i

so i guess maybe contrary to that the logical we will use a subspace

so instead of focusing that recognition in speaker such as we just remove

is the dominant components of the speaker subspace

actually sample speaker told it also contains the

components of that channel subspace

but remote this subspace

and in but we get after removing we call this the speaker unique subspace

because

in the in the space that you get after this is a reasonable because we

expect we don't expect to have any information that is common to many speakers

because we already remove this

this subspace that is complete

speakers

and the intuition that we have also examined that is it may be wise to

do with nation in this a speaker subspace and we got quite interesting

so this is what i mean

right

okay for agent based not a for speaker dependent hmm we cannot use the nist

data because we need to be text dependent so

only choice is to use that we do test set

for now

we have to be a different methods the first one

the training to form the com using that common speaker subspace method folder into this

is the dev set

second it is to use a feature space now

a which range from the nist data and the third one is a combination

okay so just before a is a present the results just to see that the

quality of the system that you see so for nist two thousand a on

one that standard the telephone

condition and males only

we see that they get the point two

quite a reasonable results in zero the scores jfa of four and i-vector are now

also for that the question is still

okay so that was also for different i-vector based system

first for the match and conditions so that train both involved in the basic issues

time same channel at a landline or so far

what we see here is that

we get a degradation in a round twenty five percent for jfa and also

something similar for i-vectors

we don't really understand why

thus

now it is for the mixed channel B we also see similar

degradation for jeff in i-vectors

i

between seven percent and

each

okay this is what expected because we have a only one hundred sessions those conversations

speaker

okay so for that you cannot stand

we see

that's for example a training the ubm from this is not doesn't give us as

good results as to train phone to reduce test set

and also when we do not see

it's actually better to train did not the reduced dataset using the common speaker subspace

method

and of course if we do if you just combine these sub-spaces

we get the best results

and

we see that a

we still get a quite a large degradation for global condition forty one percent relative

this is because the global condition makes most of the use from the training from

the development data

and this paper conditions of the population don't both we make such as one of

the data because they are not text matched

thus we think that this addition

i

it's not as severe

for the mismatched condition we see quite similar

i

trans

oh

this is for the high each of the system

i

again

we see that its ability to bring the not the cluster densities and of course

because of space

conversations

but we do get a some improvements when we just a

two results the user was not used and

and the competition does have some

so

we try to allow us to make that the hmm system which is the best

system that for the global condition which is

the most important of all

see what is also the main source of degradation caused we see that we

we have some significant degradation

the oh so what we can see if only some of these results is that

if we just compare the full development set and we and we compared to system

which we

starting to the development set for which meant really used for compensation

but we don't use it for not really see that we don't get such a

significant degradation

so the bottom line is that we then sent this that the probably the results

division is that the number three

okay but when a few sources

okay so we see that we get a degradation between thirty percent and points

what we can be

still image database of the results

especially for the global condition which is important in this task

so we still

yeah the zero point six for the right channel condition but we said no mismatch

in addition we might be

so to conclude we validate our cyst

as long good indication conditions using the full development sets and to skip button sets

jfa and i-vector degradation is roughly five fifteen percent

for the nap based systems that that's degradation is more dramatic a due to the

strong the use of that was problem data

actually for the global condition

so for

for you yeah speaker dependent hmm training data that's that is fine

to use you get some small degradation due to it

but for not really a it's important to do to do something that's it to

do a combination of a twenty four nist and

and using the cost because subsystem remember that

note to get the documents that's

is five for the fused system we got degradation

percent average

therefore we conclude that the it's the we can build a text-dependent is this can

be

even if we don't have any multi

okay sessions

i

i

i

oh

you

what

also for the global condition the and we are allowed to use saying that

for the and that's that that's the one hundred

sessions

useful idiots equal but for the speaker condition that the right and the proposition but

not allowed to use the same

say

we use under the constraint

i

oh

oh

without yeah

yes and it's not obvious it's not

okay that the lot is just a

a fixed the tickets

i can say is you know

then

speakers is that in practice six varies estimate a global because you're doing you always

what attracts use the same is the case when full test for both involvement in

verification

but i

that the use case is that if we present has its own i think it's

yeah

well the only difference in disability and a difference is the use of the development

data

and the bottom the condition is where you're probably with a think it's

a

okay we actually didn't really

what can this so basically that

the results that i that i

presents a actually i

used in some cases you can say it you don't this so i'm not

i

i

i

i

i

they basically we did look at it and we don't see here we don't feel

that is a problem for this application we only need a single class

oh

okay

we just a result

i don't

i

so the idea for that there is that it for that for example a development

set here for the global condition

a we actually needed to record a speaker saying zero nine now what happens if

the money one was to change as possible to a different one then it would

be to go again record speakers

saying the same thing is

because we actually using this for development

i think it's not a weighted thing but i don't business and marketing a

a person's be that this is not the from their experience is customers really but

is not practical

but when you want to deploy such a system you will not be most times

you would not a be able to report

so many recordings and you the think that it is a practical to take one

speakers and recorded once but i don't think it's practical to

to take a two hundred speakers and

hold it over four weeks four

yeah because this is the speaker so if you have development set it is from

the set using the same text then you get much better results

if you train your models all actually a utterances saying zero nine

you and we have this in the paper last from as it does but you

will get much more like i don't know fifty percent reduction of modeling error rate

seventy

and then if you just you a try to exclude other for model text for

other things

oh

oh

i

oh

yeah

that there are some cases the more them are a not saying

i

i

oh

i

oh

yeah

the other reason

which are not at a sensory technological perspective

i