i
yeah
oh
cover
things
set
i
oh
with a ten
my name is like a region in this is
type of this talk in the text dependent speaker verification using the small buttons
okay
so this is a button for this work in two thousand ten
a speaker evaluation
speaker recognition evaluation was held by it was found back
the relation focus mostly on text-dependent speaker verification
i research it participated in this evaluation
so basically a we presented the results of this evaluation last interspeech
and i was also quite satisfactory
however there was some criticism regarding that a set of the evaluation
because interpolation that the thing that was very was quite large about two hundred
and false sessions per speaker
and the and the criticism was that for me a practical applications customers
not a it's a it's not practical to collect such a large dataset
so it was very interesting to see what are the results of a technology when
using the a small that's that and the smokers that was specified as
being with the consisting of a one hundred speakers
and only one session per speaker
so there is no way to multi session
i
there's only one such as well
oh
okay so that i don't of course but is for example
first quickly described a relation that i will describe our speaker verification systems that use
and then we'll talk about how to how we got to within this it with
the statistics
we present results in
okay so there were three textdependent authentication conditions interpolation the first one is the in
the by the global condition
where we use a global and the constraints such as zero to nine for authentication
circuits authentication condition is using the
speaker dependent password
also it indicates the constraints
and this is denoted by the speaker condition
now of course there's the issue is whether they boston also the absolute not so
in the relation and there's assumption that most cases assumption is that the then both
signals
the passwords definitely just all the trials use the same but the sinc password a
target
possible
and the last the condition is called the prompted condition or a proper the random
string is
is useful for authentication this is hardest to accurately that the case but it's a
more the most resilient to condition for against attacks
such as holding X
yes
okay
so basically that was follow the looks like this the last seven hundred fifty speakers
one the where useful development and five hundred fifty four evaluation data was recorded over
four weeks
and four sessions of error for the speaker to landline do so
and each session consists of all these authentication conditions and a lot of more data
that we are going to use the future like
instead of using the constraints just text
it's
okay so
and for the goal is to the conditions we use a
a three predictions for the past four more so
basically someone to model the system i have to say three times for example zero
nine then we'll see
the education i just one time is nine
suppose that
and that the data is a supposed to be used as following for the global
condition
a way to use the same to constraints as evaluated so if the password is
tonight that will use the predictions of do not in the in the model sets
or for speech recognition product condition we're not allowed to use a repetitions of the
same digit strings for the
the reduced development set is a because this is that the one of the speakers
with a single session each
yeah the speakers are recorded in that i have solar
and by what we were to use a any other sources of probably given
resources
such as the nist or switchboard
on top of these two steps
okay so we are here
systems are useful for the information we use it is three text-independent systems the first
one is that you know joint factor analysis based
this ten second one is the i-vector based system not just the i-vectors
and third one is it is not
we use a also a text-dependent system which is a tune in supervector based and
with no compensation and we use this system currently only for the global condition
and five the fact that final score is a fusion of the scores of all
these cases
which are weighted the
the simple rule based
yeah
okay so
just a few details about the that you can base it is not an assistant
so it's quite standard but we have to a specific
verification to presented also need to speech
the first one is a is
score robust scoring
and the second one is estimated with the scoring
and we may be able to build a system format for only for a telephone
you need state
we don't use that was followed data for building the system
a dollar a company that uses the that was found data
and for system but not used as the score normalization score notation is actually done
using the
same thing for the i-th a basis to eight is the same dataset is sources
and the only use that was five data for score
the not system actually makes a useful in the development the data that was available
data we trained a ubm and not from that of a data and you don't
as much as text much as possible so for example for the global condition we
train the ubm and nap is just from the same text that is being used
in verification
speaker population but not allowed to do that so we just use it for example
the constraints
but not just the text
we found that but do we get a lot
that we also use a variant of not which we call two wire not which
is the on top of a removing the that the channel space and we also
some don't two components
of the interspeaker variability subspace
because we consistently found out in that years that is
thus
yeah
we also using a geometric mean compressing kernel
was
but which control
and
okay
and we do serious conversation again using that was
the H supervector based system is very similar to the gmm nap system
the only difference is that instead of extracting gmm supervectors we extract hmm supervectors
and the rest of system is the set so basically a chance of those are
started by instead of training and ubm train a speaker independent hmm from the development
data
and then if a lot to extract these supervectors we just a take the a
take a session we use that data to estimate the session independent hmm using map
adaptation and we just take a gmm means from the different states normalize the sense
that
okay so
now talk about how we were able to cope with the reduced dataset
is a
what we look at least at four different system we can see that the jfa
and i-vector based systems
are not very
should not be possible to very much to this the buttons that because we're not
using it a very tall we only false normalization
so wait for the moment we didn't we yeah work on these systems we just
a use that this system as is and see what happens
for the not based systems the problem is that much more serious because
it will using the development is that the very extensively and first of all we
have less data for that fortunately yeah speak an hmm
was used a we don't have any multisession speakers
so if we want to for example to train now we will be able to
and also as quantisation began mistake
so
or vice versa for these two systems for the gmm based mapping the hmm based
not systems
and a weekend
we have also a we consider in the in some slides in the results
we focus on these systems because they walk much better than jfa i-vector on this
task so
it's very important to do this
okay so for the gmm based not system and the first component is the ubm
we compare two way to estimate its training don't are
reduced dataset or training on nist data
for now we compare scream at the first one is to train a waveform the
nist data
the second one was to estimate not a for all from produce data although i
don't have a multisession speakers
by using a in approach that we call a common speaker subspace
in conversation which we used in two thousand seven
and i will then excitable explain a bit more i
approach
and of course that the third method you just combine the two compensation that the
use of them
so this common speaker subspace compensation that so it is basically
as for my
for it firstly
we estimate this space this subspace from a large step sizes from all speakers
so it is and the in our case where the one hundred speakers and we
just expressed supervectors for these one hundred sessions and we just do pca on these
supervectors
okay and know what its columns because that's just because it in some way to
represent that he just speaker as such
the speaker subspace
i
so i guess maybe contrary to that the logical we will use a subspace
so instead of focusing that recognition in speaker such as we just remove
is the dominant components of the speaker subspace
actually sample speaker told it also contains the
components of that channel subspace
but remote this subspace
and in but we get after removing we call this the speaker unique subspace
because
in the in the space that you get after this is a reasonable because we
expect we don't expect to have any information that is common to many speakers
because we already remove this
this subspace that is complete
speakers
and the intuition that we have also examined that is it may be wise to
do with nation in this a speaker subspace and we got quite interesting
so this is what i mean
right
okay for agent based not a for speaker dependent hmm we cannot use the nist
data because we need to be text dependent so
only choice is to use that we do test set
for now
we have to be a different methods the first one
the training to form the com using that common speaker subspace method folder into this
is the dev set
second it is to use a feature space now
a which range from the nist data and the third one is a combination
okay so just before a is a present the results just to see that the
quality of the system that you see so for nist two thousand a on
one that standard the telephone
condition and males only
we see that they get the point two
quite a reasonable results in zero the scores jfa of four and i-vector are now
also for that the question is still
okay so that was also for different i-vector based system
first for the match and conditions so that train both involved in the basic issues
time same channel at a landline or so far
what we see here is that
we get a degradation in a round twenty five percent for jfa and also
something similar for i-vectors
we don't really understand why
thus
now it is for the mixed channel B we also see similar
degradation for jeff in i-vectors
i
between seven percent and
each
okay this is what expected because we have a only one hundred sessions those conversations
speaker
okay so for that you cannot stand
we see
that's for example a training the ubm from this is not doesn't give us as
good results as to train phone to reduce test set
and also when we do not see
it's actually better to train did not the reduced dataset using the common speaker subspace
method
and of course if we do if you just combine these sub-spaces
we get the best results
and
we see that a
we still get a quite a large degradation for global condition forty one percent relative
this is because the global condition makes most of the use from the training from
the development data
and this paper conditions of the population don't both we make such as one of
the data because they are not text matched
thus we think that this addition
i
it's not as severe
for the mismatched condition we see quite similar
i
trans
oh
this is for the high each of the system
i
again
we see that its ability to bring the not the cluster densities and of course
because of space
conversations
but we do get a some improvements when we just a
two results the user was not used and
and the competition does have some
so
we try to allow us to make that the hmm system which is the best
system that for the global condition which is
the most important of all
see what is also the main source of degradation caused we see that we
we have some significant degradation
the oh so what we can see if only some of these results is that
if we just compare the full development set and we and we compared to system
which we
starting to the development set for which meant really used for compensation
but we don't use it for not really see that we don't get such a
significant degradation
so the bottom line is that we then sent this that the probably the results
division is that the number three
okay but when a few sources
okay so we see that we get a degradation between thirty percent and points
what we can be
still image database of the results
especially for the global condition which is important in this task
so we still
yeah the zero point six for the right channel condition but we said no mismatch
in addition we might be
so to conclude we validate our cyst
as long good indication conditions using the full development sets and to skip button sets
jfa and i-vector degradation is roughly five fifteen percent
for the nap based systems that that's degradation is more dramatic a due to the
strong the use of that was problem data
actually for the global condition
so for
for you yeah speaker dependent hmm training data that's that is fine
to use you get some small degradation due to it
but for not really a it's important to do to do something that's it to
do a combination of a twenty four nist and
and using the cost because subsystem remember that
note to get the documents that's
is five for the fused system we got degradation
percent average
therefore we conclude that the it's the we can build a text-dependent is this can
be
even if we don't have any multi
okay sessions
i
i
i
oh
you
what
also for the global condition the and we are allowed to use saying that
for the and that's that that's the one hundred
sessions
useful idiots equal but for the speaker condition that the right and the proposition but
not allowed to use the same
say
we use under the constraint
i
oh
oh
without yeah
yes and it's not obvious it's not
okay that the lot is just a
a fixed the tickets
i can say is you know
then
speakers is that in practice six varies estimate a global because you're doing you always
what attracts use the same is the case when full test for both involvement in
verification
but i
that the use case is that if we present has its own i think it's
yeah
well the only difference in disability and a difference is the use of the development
data
and the bottom the condition is where you're probably with a think it's
a
okay we actually didn't really
what can this so basically that
the results that i that i
presents a actually i
used in some cases you can say it you don't this so i'm not
i
i
i
i
i
they basically we did look at it and we don't see here we don't feel
that is a problem for this application we only need a single class
oh
okay
we just a result
i don't
i
so the idea for that there is that it for that for example a development
set here for the global condition
a we actually needed to record a speaker saying zero nine now what happens if
the money one was to change as possible to a different one then it would
be to go again record speakers
saying the same thing is
because we actually using this for development
i think it's not a weighted thing but i don't business and marketing a
a person's be that this is not the from their experience is customers really but
is not practical
but when you want to deploy such a system you will not be most times
you would not a be able to report
so many recordings and you the think that it is a practical to take one
speakers and recorded once but i don't think it's practical to
to take a two hundred speakers and
hold it over four weeks four
yeah because this is the speaker so if you have development set it is from
the set using the same text then you get much better results
if you train your models all actually a utterances saying zero nine
you and we have this in the paper last from as it does but you
will get much more like i don't know fifty percent reduction of modeling error rate
seventy
and then if you just you a try to exclude other for model text for
other things
oh
oh
i
oh
yeah
that there are some cases the more them are a not saying
i
i
oh
i
oh
yeah
the other reason
which are not at a sensory technological perspective
i