0:00:28i
0:00:31yeah
0:00:32oh
0:00:33cover
0:00:34things
0:00:36set
0:00:48i
0:00:49oh
0:01:00with a ten
0:01:00my name is like a region in this is
0:01:03type of this talk in the text dependent speaker verification using the small buttons
0:01:09okay
0:01:09so this is a button for this work in two thousand ten
0:01:15a speaker evaluation
0:01:17speaker recognition evaluation was held by it was found back
0:01:22the relation focus mostly on text-dependent speaker verification
0:01:25i research it participated in this evaluation
0:01:29so basically a we presented the results of this evaluation last interspeech
0:01:34and i was also quite satisfactory
0:01:38however there was some criticism regarding that a set of the evaluation
0:01:45because interpolation that the thing that was very was quite large about two hundred
0:01:51and false sessions per speaker
0:01:54and the and the criticism was that for me a practical applications customers
0:02:00not a it's a it's not practical to collect such a large dataset
0:02:04so it was very interesting to see what are the results of a technology when
0:02:10using the a small that's that and the smokers that was specified as
0:02:14being with the consisting of a one hundred speakers
0:02:18and only one session per speaker
0:02:20so there is no way to multi session
0:02:24i
0:02:25there's only one such as well
0:02:27oh
0:02:28okay so that i don't of course but is for example
0:02:31first quickly described a relation that i will describe our speaker verification systems that use
0:02:37and then we'll talk about how to how we got to within this it with
0:02:42the statistics
0:02:43we present results in
0:02:48okay so there were three textdependent authentication conditions interpolation the first one is the in
0:02:55the by the global condition
0:02:57where we use a global and the constraints such as zero to nine for authentication
0:03:03circuits authentication condition is using the
0:03:06speaker dependent password
0:03:08also it indicates the constraints
0:03:11and this is denoted by the speaker condition
0:03:14now of course there's the issue is whether they boston also the absolute not so
0:03:18in the relation and there's assumption that most cases assumption is that the then both
0:03:24signals
0:03:25the passwords definitely just all the trials use the same but the sinc password a
0:03:30target
0:03:31possible
0:03:33and the last the condition is called the prompted condition or a proper the random
0:03:38string is
0:03:40is useful for authentication this is hardest to accurately that the case but it's a
0:03:46more the most resilient to condition for against attacks
0:03:50such as holding X
0:03:53yes
0:03:56okay
0:03:57so basically that was follow the looks like this the last seven hundred fifty speakers
0:04:04one the where useful development and five hundred fifty four evaluation data was recorded over
0:04:09four weeks
0:04:10and four sessions of error for the speaker to landline do so
0:04:15and each session consists of all these authentication conditions and a lot of more data
0:04:20that we are going to use the future like
0:04:23instead of using the constraints just text
0:04:27it's
0:04:31okay so
0:04:32and for the goal is to the conditions we use a
0:04:36a three predictions for the past four more so
0:04:41basically someone to model the system i have to say three times for example zero
0:04:45nine then we'll see
0:04:47the education i just one time is nine
0:04:51suppose that
0:04:53and that the data is a supposed to be used as following for the global
0:04:57condition
0:04:58a way to use the same to constraints as evaluated so if the password is
0:05:03tonight that will use the predictions of do not in the in the model sets
0:05:09or for speech recognition product condition we're not allowed to use a repetitions of the
0:05:15same digit strings for the
0:05:19the reduced development set is a because this is that the one of the speakers
0:05:24with a single session each
0:05:25yeah the speakers are recorded in that i have solar
0:05:30and by what we were to use a any other sources of probably given
0:05:37resources
0:05:38such as the nist or switchboard
0:05:41on top of these two steps
0:05:47okay so we are here
0:05:48systems are useful for the information we use it is three text-independent systems the first
0:05:55one is that you know joint factor analysis based
0:05:58this ten second one is the i-vector based system not just the i-vectors
0:06:06and third one is it is not
0:06:08we use a also a text-dependent system which is a tune in supervector based and
0:06:15with no compensation and we use this system currently only for the global condition
0:06:20and five the fact that final score is a fusion of the scores of all
0:06:26these cases
0:06:27which are weighted the
0:06:29the simple rule based
0:06:32yeah
0:06:34okay so
0:06:36just a few details about the that you can base it is not an assistant
0:06:41so it's quite standard but we have to a specific
0:06:47verification to presented also need to speech
0:06:50the first one is a is
0:06:52score robust scoring
0:06:54and the second one is estimated with the scoring
0:06:57and we may be able to build a system format for only for a telephone
0:07:02you need state
0:07:03we don't use that was followed data for building the system
0:07:07a dollar a company that uses the that was found data
0:07:12and for system but not used as the score normalization score notation is actually done
0:07:17using the
0:07:21same thing for the i-th a basis to eight is the same dataset is sources
0:07:26and the only use that was five data for score
0:07:33the not system actually makes a useful in the development the data that was available
0:07:39data we trained a ubm and not from that of a data and you don't
0:07:45as much as text much as possible so for example for the global condition we
0:07:50train the ubm and nap is just from the same text that is being used
0:07:54in verification
0:07:56speaker population but not allowed to do that so we just use it for example
0:08:01the constraints
0:08:02but not just the text
0:08:06we found that but do we get a lot
0:08:10that we also use a variant of not which we call two wire not which
0:08:15is the on top of a removing the that the channel space and we also
0:08:21some don't two components
0:08:23of the interspeaker variability subspace
0:08:27because we consistently found out in that years that is
0:08:32thus
0:08:33yeah
0:08:34we also using a geometric mean compressing kernel
0:08:38was
0:08:41but which control
0:08:43and
0:08:45okay
0:08:46and we do serious conversation again using that was
0:08:50the H supervector based system is very similar to the gmm nap system
0:08:56the only difference is that instead of extracting gmm supervectors we extract hmm supervectors
0:09:02and the rest of system is the set so basically a chance of those are
0:09:06started by instead of training and ubm train a speaker independent hmm from the development
0:09:12data
0:09:13and then if a lot to extract these supervectors we just a take the a
0:09:18take a session we use that data to estimate the session independent hmm using map
0:09:25adaptation and we just take a gmm means from the different states normalize the sense
0:09:31that
0:09:33okay so
0:09:35now talk about how we were able to cope with the reduced dataset
0:09:40is a
0:09:41what we look at least at four different system we can see that the jfa
0:09:45and i-vector based systems
0:09:47are not very
0:09:48should not be possible to very much to this the buttons that because we're not
0:09:53using it a very tall we only false normalization
0:09:57so wait for the moment we didn't we yeah work on these systems we just
0:10:01a use that this system as is and see what happens
0:10:05for the not based systems the problem is that much more serious because
0:10:11it will using the development is that the very extensively and first of all we
0:10:15have less data for that fortunately yeah speak an hmm
0:10:20was used a we don't have any multisession speakers
0:10:24so if we want to for example to train now we will be able to
0:10:30and also as quantisation began mistake
0:10:34so
0:10:36or vice versa for these two systems for the gmm based mapping the hmm based
0:10:40not systems
0:10:41and a weekend
0:10:43we have also a we consider in the in some slides in the results
0:10:48we focus on these systems because they walk much better than jfa i-vector on this
0:10:52task so
0:10:53it's very important to do this
0:10:57okay so for the gmm based not system and the first component is the ubm
0:11:02we compare two way to estimate its training don't are
0:11:07reduced dataset or training on nist data
0:11:10for now we compare scream at the first one is to train a waveform the
0:11:16nist data
0:11:17the second one was to estimate not a for all from produce data although i
0:11:22don't have a multisession speakers
0:11:25by using a in approach that we call a common speaker subspace
0:11:30in conversation which we used in two thousand seven
0:11:34and i will then excitable explain a bit more i
0:11:38approach
0:11:39and of course that the third method you just combine the two compensation that the
0:11:44use of them
0:11:46so this common speaker subspace compensation that so it is basically
0:11:50as for my
0:11:51for it firstly
0:11:53we estimate this space this subspace from a large step sizes from all speakers
0:11:59so it is and the in our case where the one hundred speakers and we
0:12:03just expressed supervectors for these one hundred sessions and we just do pca on these
0:12:10supervectors
0:12:12okay and know what its columns because that's just because it in some way to
0:12:18represent that he just speaker as such
0:12:22the speaker subspace
0:12:24i
0:12:25so i guess maybe contrary to that the logical we will use a subspace
0:12:30so instead of focusing that recognition in speaker such as we just remove
0:12:35is the dominant components of the speaker subspace
0:12:38actually sample speaker told it also contains the
0:12:42components of that channel subspace
0:12:45but remote this subspace
0:12:47and in but we get after removing we call this the speaker unique subspace
0:12:53because
0:12:54in the in the space that you get after this is a reasonable because we
0:12:58expect we don't expect to have any information that is common to many speakers
0:13:03because we already remove this
0:13:05this subspace that is complete
0:13:07speakers
0:13:08and the intuition that we have also examined that is it may be wise to
0:13:12do with nation in this a speaker subspace and we got quite interesting
0:13:18so this is what i mean
0:13:21right
0:13:23okay for agent based not a for speaker dependent hmm we cannot use the nist
0:13:29data because we need to be text dependent so
0:13:33only choice is to use that we do test set
0:13:37for now
0:13:38we have to be a different methods the first one
0:13:42the training to form the com using that common speaker subspace method folder into this
0:13:48is the dev set
0:13:50second it is to use a feature space now
0:13:54a which range from the nist data and the third one is a combination
0:14:01okay so just before a is a present the results just to see that the
0:14:06quality of the system that you see so for nist two thousand a on
0:14:11one that standard the telephone
0:14:15condition and males only
0:14:18we see that they get the point two
0:14:21quite a reasonable results in zero the scores jfa of four and i-vector are now
0:14:28also for that the question is still
0:14:34okay so that was also for different i-vector based system
0:14:39first for the match and conditions so that train both involved in the basic issues
0:14:44time same channel at a landline or so far
0:14:47what we see here is that
0:14:50we get a degradation in a round twenty five percent for jfa and also
0:14:57something similar for i-vectors
0:14:59we don't really understand why
0:15:01thus
0:15:03now it is for the mixed channel B we also see similar
0:15:09degradation for jeff in i-vectors
0:15:12i
0:15:13between seven percent and
0:15:16each
0:15:17okay this is what expected because we have a only one hundred sessions those conversations
0:15:23speaker
0:15:24okay so for that you cannot stand
0:15:28we see
0:15:29that's for example a training the ubm from this is not doesn't give us as
0:15:34good results as to train phone to reduce test set
0:15:38and also when we do not see
0:15:42it's actually better to train did not the reduced dataset using the common speaker subspace
0:15:49method
0:15:51and of course if we do if you just combine these sub-spaces
0:15:55we get the best results
0:15:57and
0:15:58we see that a
0:15:59we still get a quite a large degradation for global condition forty one percent relative
0:16:04this is because the global condition makes most of the use from the training from
0:16:08the development data
0:16:10and this paper conditions of the population don't both we make such as one of
0:16:14the data because they are not text matched
0:16:17thus we think that this addition
0:16:19i
0:16:20it's not as severe
0:16:23for the mismatched condition we see quite similar
0:16:27i
0:16:28trans
0:16:30oh
0:16:32this is for the high each of the system
0:16:35i
0:16:36again
0:16:37we see that its ability to bring the not the cluster densities and of course
0:16:43because of space
0:16:44conversations
0:16:46but we do get a some improvements when we just a
0:16:50two results the user was not used and
0:16:55and the competition does have some
0:16:57so
0:17:02we try to allow us to make that the hmm system which is the best
0:17:06system that for the global condition which is
0:17:08the most important of all
0:17:11see what is also the main source of degradation caused we see that we
0:17:15we have some significant degradation
0:17:19the oh so what we can see if only some of these results is that
0:17:23if we just compare the full development set and we and we compared to system
0:17:28which we
0:17:29starting to the development set for which meant really used for compensation
0:17:34but we don't use it for not really see that we don't get such a
0:17:39significant degradation
0:17:41so the bottom line is that we then sent this that the probably the results
0:17:47division is that the number three
0:17:53okay but when a few sources
0:17:55okay so we see that we get a degradation between thirty percent and points
0:18:02what we can be
0:18:04still image database of the results
0:18:07especially for the global condition which is important in this task
0:18:12so we still
0:18:13yeah the zero point six for the right channel condition but we said no mismatch
0:18:20in addition we might be
0:18:25so to conclude we validate our cyst
0:18:28as long good indication conditions using the full development sets and to skip button sets
0:18:34jfa and i-vector degradation is roughly five fifteen percent
0:18:39for the nap based systems that that's degradation is more dramatic a due to the
0:18:43strong the use of that was problem data
0:18:46actually for the global condition
0:18:48so for
0:18:50for you yeah speaker dependent hmm training data that's that is fine
0:18:55to use you get some small degradation due to it
0:18:58but for not really a it's important to do to do something that's it to
0:19:03do a combination of a twenty four nist and
0:19:06and using the cost because subsystem remember that
0:19:09note to get the documents that's
0:19:12is five for the fused system we got degradation
0:19:16percent average
0:19:17therefore we conclude that the it's the we can build a text-dependent is this can
0:19:22be
0:19:24even if we don't have any multi
0:19:26okay sessions
0:19:33i
0:19:42i
0:19:43i
0:19:45oh
0:19:47you
0:19:49what
0:19:50also for the global condition the and we are allowed to use saying that
0:19:56for the and that's that that's the one hundred
0:20:02sessions
0:20:03useful idiots equal but for the speaker condition that the right and the proposition but
0:20:07not allowed to use the same
0:20:09say
0:20:11we use under the constraint
0:20:16i
0:20:19oh
0:20:21oh
0:20:21without yeah
0:20:26yes and it's not obvious it's not
0:20:45okay that the lot is just a
0:20:47a fixed the tickets
0:20:50i can say is you know
0:20:51then
0:20:53speakers is that in practice six varies estimate a global because you're doing you always
0:21:00what attracts use the same is the case when full test for both involvement in
0:21:07verification
0:21:08but i
0:21:10that the use case is that if we present has its own i think it's
0:21:15yeah
0:21:16well the only difference in disability and a difference is the use of the development
0:21:20data
0:21:21and the bottom the condition is where you're probably with a think it's
0:21:39a
0:21:41okay we actually didn't really
0:21:44what can this so basically that
0:21:48the results that i that i
0:21:53presents a actually i
0:21:55used in some cases you can say it you don't this so i'm not
0:22:02i
0:22:04i
0:22:07i
0:22:10i
0:22:11i
0:22:16they basically we did look at it and we don't see here we don't feel
0:22:20that is a problem for this application we only need a single class
0:22:25oh
0:22:26okay
0:22:27we just a result
0:22:29i don't
0:22:34i
0:22:39so the idea for that there is that it for that for example a development
0:22:44set here for the global condition
0:22:46a we actually needed to record a speaker saying zero nine now what happens if
0:22:52the money one was to change as possible to a different one then it would
0:22:56be to go again record speakers
0:22:58saying the same thing is
0:23:01because we actually using this for development
0:23:04i think it's not a weighted thing but i don't business and marketing a
0:23:11a person's be that this is not the from their experience is customers really but
0:23:17is not practical
0:23:18but when you want to deploy such a system you will not be most times
0:23:23you would not a be able to report
0:23:26so many recordings and you the think that it is a practical to take one
0:23:32speakers and recorded once but i don't think it's practical to
0:23:36to take a two hundred speakers and
0:23:39hold it over four weeks four
0:23:54yeah because this is the speaker so if you have development set it is from
0:23:58the set using the same text then you get much better results
0:24:02if you train your models all actually a utterances saying zero nine
0:24:09you and we have this in the paper last from as it does but you
0:24:13will get much more like i don't know fifty percent reduction of modeling error rate
0:24:18seventy
0:24:20and then if you just you a try to exclude other for model text for
0:24:24other things
0:24:32oh
0:24:33oh
0:24:35i
0:24:35oh
0:24:39yeah
0:24:42that there are some cases the more them are a not saying
0:24:49i
0:24:51i
0:24:53oh
0:24:55i
0:24:56oh
0:24:57yeah
0:24:59the other reason
0:25:01which are not at a sensory technological perspective
0:25:05i