0:00:28 | i |
---|
0:00:31 | yeah |
---|
0:00:32 | oh |
---|
0:00:33 | cover |
---|
0:00:34 | things |
---|
0:00:36 | set |
---|
0:00:48 | i |
---|
0:00:49 | oh |
---|
0:01:00 | with a ten |
---|
0:01:00 | my name is like a region in this is |
---|
0:01:03 | type of this talk in the text dependent speaker verification using the small buttons |
---|
0:01:09 | okay |
---|
0:01:09 | so this is a button for this work in two thousand ten |
---|
0:01:15 | a speaker evaluation |
---|
0:01:17 | speaker recognition evaluation was held by it was found back |
---|
0:01:22 | the relation focus mostly on text-dependent speaker verification |
---|
0:01:25 | i research it participated in this evaluation |
---|
0:01:29 | so basically a we presented the results of this evaluation last interspeech |
---|
0:01:34 | and i was also quite satisfactory |
---|
0:01:38 | however there was some criticism regarding that a set of the evaluation |
---|
0:01:45 | because interpolation that the thing that was very was quite large about two hundred |
---|
0:01:51 | and false sessions per speaker |
---|
0:01:54 | and the and the criticism was that for me a practical applications customers |
---|
0:02:00 | not a it's a it's not practical to collect such a large dataset |
---|
0:02:04 | so it was very interesting to see what are the results of a technology when |
---|
0:02:10 | using the a small that's that and the smokers that was specified as |
---|
0:02:14 | being with the consisting of a one hundred speakers |
---|
0:02:18 | and only one session per speaker |
---|
0:02:20 | so there is no way to multi session |
---|
0:02:24 | i |
---|
0:02:25 | there's only one such as well |
---|
0:02:27 | oh |
---|
0:02:28 | okay so that i don't of course but is for example |
---|
0:02:31 | first quickly described a relation that i will describe our speaker verification systems that use |
---|
0:02:37 | and then we'll talk about how to how we got to within this it with |
---|
0:02:42 | the statistics |
---|
0:02:43 | we present results in |
---|
0:02:48 | okay so there were three textdependent authentication conditions interpolation the first one is the in |
---|
0:02:55 | the by the global condition |
---|
0:02:57 | where we use a global and the constraints such as zero to nine for authentication |
---|
0:03:03 | circuits authentication condition is using the |
---|
0:03:06 | speaker dependent password |
---|
0:03:08 | also it indicates the constraints |
---|
0:03:11 | and this is denoted by the speaker condition |
---|
0:03:14 | now of course there's the issue is whether they boston also the absolute not so |
---|
0:03:18 | in the relation and there's assumption that most cases assumption is that the then both |
---|
0:03:24 | signals |
---|
0:03:25 | the passwords definitely just all the trials use the same but the sinc password a |
---|
0:03:30 | target |
---|
0:03:31 | possible |
---|
0:03:33 | and the last the condition is called the prompted condition or a proper the random |
---|
0:03:38 | string is |
---|
0:03:40 | is useful for authentication this is hardest to accurately that the case but it's a |
---|
0:03:46 | more the most resilient to condition for against attacks |
---|
0:03:50 | such as holding X |
---|
0:03:53 | yes |
---|
0:03:56 | okay |
---|
0:03:57 | so basically that was follow the looks like this the last seven hundred fifty speakers |
---|
0:04:04 | one the where useful development and five hundred fifty four evaluation data was recorded over |
---|
0:04:09 | four weeks |
---|
0:04:10 | and four sessions of error for the speaker to landline do so |
---|
0:04:15 | and each session consists of all these authentication conditions and a lot of more data |
---|
0:04:20 | that we are going to use the future like |
---|
0:04:23 | instead of using the constraints just text |
---|
0:04:27 | it's |
---|
0:04:31 | okay so |
---|
0:04:32 | and for the goal is to the conditions we use a |
---|
0:04:36 | a three predictions for the past four more so |
---|
0:04:41 | basically someone to model the system i have to say three times for example zero |
---|
0:04:45 | nine then we'll see |
---|
0:04:47 | the education i just one time is nine |
---|
0:04:51 | suppose that |
---|
0:04:53 | and that the data is a supposed to be used as following for the global |
---|
0:04:57 | condition |
---|
0:04:58 | a way to use the same to constraints as evaluated so if the password is |
---|
0:05:03 | tonight that will use the predictions of do not in the in the model sets |
---|
0:05:09 | or for speech recognition product condition we're not allowed to use a repetitions of the |
---|
0:05:15 | same digit strings for the |
---|
0:05:19 | the reduced development set is a because this is that the one of the speakers |
---|
0:05:24 | with a single session each |
---|
0:05:25 | yeah the speakers are recorded in that i have solar |
---|
0:05:30 | and by what we were to use a any other sources of probably given |
---|
0:05:37 | resources |
---|
0:05:38 | such as the nist or switchboard |
---|
0:05:41 | on top of these two steps |
---|
0:05:47 | okay so we are here |
---|
0:05:48 | systems are useful for the information we use it is three text-independent systems the first |
---|
0:05:55 | one is that you know joint factor analysis based |
---|
0:05:58 | this ten second one is the i-vector based system not just the i-vectors |
---|
0:06:06 | and third one is it is not |
---|
0:06:08 | we use a also a text-dependent system which is a tune in supervector based and |
---|
0:06:15 | with no compensation and we use this system currently only for the global condition |
---|
0:06:20 | and five the fact that final score is a fusion of the scores of all |
---|
0:06:26 | these cases |
---|
0:06:27 | which are weighted the |
---|
0:06:29 | the simple rule based |
---|
0:06:32 | yeah |
---|
0:06:34 | okay so |
---|
0:06:36 | just a few details about the that you can base it is not an assistant |
---|
0:06:41 | so it's quite standard but we have to a specific |
---|
0:06:47 | verification to presented also need to speech |
---|
0:06:50 | the first one is a is |
---|
0:06:52 | score robust scoring |
---|
0:06:54 | and the second one is estimated with the scoring |
---|
0:06:57 | and we may be able to build a system format for only for a telephone |
---|
0:07:02 | you need state |
---|
0:07:03 | we don't use that was followed data for building the system |
---|
0:07:07 | a dollar a company that uses the that was found data |
---|
0:07:12 | and for system but not used as the score normalization score notation is actually done |
---|
0:07:17 | using the |
---|
0:07:21 | same thing for the i-th a basis to eight is the same dataset is sources |
---|
0:07:26 | and the only use that was five data for score |
---|
0:07:33 | the not system actually makes a useful in the development the data that was available |
---|
0:07:39 | data we trained a ubm and not from that of a data and you don't |
---|
0:07:45 | as much as text much as possible so for example for the global condition we |
---|
0:07:50 | train the ubm and nap is just from the same text that is being used |
---|
0:07:54 | in verification |
---|
0:07:56 | speaker population but not allowed to do that so we just use it for example |
---|
0:08:01 | the constraints |
---|
0:08:02 | but not just the text |
---|
0:08:06 | we found that but do we get a lot |
---|
0:08:10 | that we also use a variant of not which we call two wire not which |
---|
0:08:15 | is the on top of a removing the that the channel space and we also |
---|
0:08:21 | some don't two components |
---|
0:08:23 | of the interspeaker variability subspace |
---|
0:08:27 | because we consistently found out in that years that is |
---|
0:08:32 | thus |
---|
0:08:33 | yeah |
---|
0:08:34 | we also using a geometric mean compressing kernel |
---|
0:08:38 | was |
---|
0:08:41 | but which control |
---|
0:08:43 | and |
---|
0:08:45 | okay |
---|
0:08:46 | and we do serious conversation again using that was |
---|
0:08:50 | the H supervector based system is very similar to the gmm nap system |
---|
0:08:56 | the only difference is that instead of extracting gmm supervectors we extract hmm supervectors |
---|
0:09:02 | and the rest of system is the set so basically a chance of those are |
---|
0:09:06 | started by instead of training and ubm train a speaker independent hmm from the development |
---|
0:09:12 | data |
---|
0:09:13 | and then if a lot to extract these supervectors we just a take the a |
---|
0:09:18 | take a session we use that data to estimate the session independent hmm using map |
---|
0:09:25 | adaptation and we just take a gmm means from the different states normalize the sense |
---|
0:09:31 | that |
---|
0:09:33 | okay so |
---|
0:09:35 | now talk about how we were able to cope with the reduced dataset |
---|
0:09:40 | is a |
---|
0:09:41 | what we look at least at four different system we can see that the jfa |
---|
0:09:45 | and i-vector based systems |
---|
0:09:47 | are not very |
---|
0:09:48 | should not be possible to very much to this the buttons that because we're not |
---|
0:09:53 | using it a very tall we only false normalization |
---|
0:09:57 | so wait for the moment we didn't we yeah work on these systems we just |
---|
0:10:01 | a use that this system as is and see what happens |
---|
0:10:05 | for the not based systems the problem is that much more serious because |
---|
0:10:11 | it will using the development is that the very extensively and first of all we |
---|
0:10:15 | have less data for that fortunately yeah speak an hmm |
---|
0:10:20 | was used a we don't have any multisession speakers |
---|
0:10:24 | so if we want to for example to train now we will be able to |
---|
0:10:30 | and also as quantisation began mistake |
---|
0:10:34 | so |
---|
0:10:36 | or vice versa for these two systems for the gmm based mapping the hmm based |
---|
0:10:40 | not systems |
---|
0:10:41 | and a weekend |
---|
0:10:43 | we have also a we consider in the in some slides in the results |
---|
0:10:48 | we focus on these systems because they walk much better than jfa i-vector on this |
---|
0:10:52 | task so |
---|
0:10:53 | it's very important to do this |
---|
0:10:57 | okay so for the gmm based not system and the first component is the ubm |
---|
0:11:02 | we compare two way to estimate its training don't are |
---|
0:11:07 | reduced dataset or training on nist data |
---|
0:11:10 | for now we compare scream at the first one is to train a waveform the |
---|
0:11:16 | nist data |
---|
0:11:17 | the second one was to estimate not a for all from produce data although i |
---|
0:11:22 | don't have a multisession speakers |
---|
0:11:25 | by using a in approach that we call a common speaker subspace |
---|
0:11:30 | in conversation which we used in two thousand seven |
---|
0:11:34 | and i will then excitable explain a bit more i |
---|
0:11:38 | approach |
---|
0:11:39 | and of course that the third method you just combine the two compensation that the |
---|
0:11:44 | use of them |
---|
0:11:46 | so this common speaker subspace compensation that so it is basically |
---|
0:11:50 | as for my |
---|
0:11:51 | for it firstly |
---|
0:11:53 | we estimate this space this subspace from a large step sizes from all speakers |
---|
0:11:59 | so it is and the in our case where the one hundred speakers and we |
---|
0:12:03 | just expressed supervectors for these one hundred sessions and we just do pca on these |
---|
0:12:10 | supervectors |
---|
0:12:12 | okay and know what its columns because that's just because it in some way to |
---|
0:12:18 | represent that he just speaker as such |
---|
0:12:22 | the speaker subspace |
---|
0:12:24 | i |
---|
0:12:25 | so i guess maybe contrary to that the logical we will use a subspace |
---|
0:12:30 | so instead of focusing that recognition in speaker such as we just remove |
---|
0:12:35 | is the dominant components of the speaker subspace |
---|
0:12:38 | actually sample speaker told it also contains the |
---|
0:12:42 | components of that channel subspace |
---|
0:12:45 | but remote this subspace |
---|
0:12:47 | and in but we get after removing we call this the speaker unique subspace |
---|
0:12:53 | because |
---|
0:12:54 | in the in the space that you get after this is a reasonable because we |
---|
0:12:58 | expect we don't expect to have any information that is common to many speakers |
---|
0:13:03 | because we already remove this |
---|
0:13:05 | this subspace that is complete |
---|
0:13:07 | speakers |
---|
0:13:08 | and the intuition that we have also examined that is it may be wise to |
---|
0:13:12 | do with nation in this a speaker subspace and we got quite interesting |
---|
0:13:18 | so this is what i mean |
---|
0:13:21 | right |
---|
0:13:23 | okay for agent based not a for speaker dependent hmm we cannot use the nist |
---|
0:13:29 | data because we need to be text dependent so |
---|
0:13:33 | only choice is to use that we do test set |
---|
0:13:37 | for now |
---|
0:13:38 | we have to be a different methods the first one |
---|
0:13:42 | the training to form the com using that common speaker subspace method folder into this |
---|
0:13:48 | is the dev set |
---|
0:13:50 | second it is to use a feature space now |
---|
0:13:54 | a which range from the nist data and the third one is a combination |
---|
0:14:01 | okay so just before a is a present the results just to see that the |
---|
0:14:06 | quality of the system that you see so for nist two thousand a on |
---|
0:14:11 | one that standard the telephone |
---|
0:14:15 | condition and males only |
---|
0:14:18 | we see that they get the point two |
---|
0:14:21 | quite a reasonable results in zero the scores jfa of four and i-vector are now |
---|
0:14:28 | also for that the question is still |
---|
0:14:34 | okay so that was also for different i-vector based system |
---|
0:14:39 | first for the match and conditions so that train both involved in the basic issues |
---|
0:14:44 | time same channel at a landline or so far |
---|
0:14:47 | what we see here is that |
---|
0:14:50 | we get a degradation in a round twenty five percent for jfa and also |
---|
0:14:57 | something similar for i-vectors |
---|
0:14:59 | we don't really understand why |
---|
0:15:01 | thus |
---|
0:15:03 | now it is for the mixed channel B we also see similar |
---|
0:15:09 | degradation for jeff in i-vectors |
---|
0:15:12 | i |
---|
0:15:13 | between seven percent and |
---|
0:15:16 | each |
---|
0:15:17 | okay this is what expected because we have a only one hundred sessions those conversations |
---|
0:15:23 | speaker |
---|
0:15:24 | okay so for that you cannot stand |
---|
0:15:28 | we see |
---|
0:15:29 | that's for example a training the ubm from this is not doesn't give us as |
---|
0:15:34 | good results as to train phone to reduce test set |
---|
0:15:38 | and also when we do not see |
---|
0:15:42 | it's actually better to train did not the reduced dataset using the common speaker subspace |
---|
0:15:49 | method |
---|
0:15:51 | and of course if we do if you just combine these sub-spaces |
---|
0:15:55 | we get the best results |
---|
0:15:57 | and |
---|
0:15:58 | we see that a |
---|
0:15:59 | we still get a quite a large degradation for global condition forty one percent relative |
---|
0:16:04 | this is because the global condition makes most of the use from the training from |
---|
0:16:08 | the development data |
---|
0:16:10 | and this paper conditions of the population don't both we make such as one of |
---|
0:16:14 | the data because they are not text matched |
---|
0:16:17 | thus we think that this addition |
---|
0:16:19 | i |
---|
0:16:20 | it's not as severe |
---|
0:16:23 | for the mismatched condition we see quite similar |
---|
0:16:27 | i |
---|
0:16:28 | trans |
---|
0:16:30 | oh |
---|
0:16:32 | this is for the high each of the system |
---|
0:16:35 | i |
---|
0:16:36 | again |
---|
0:16:37 | we see that its ability to bring the not the cluster densities and of course |
---|
0:16:43 | because of space |
---|
0:16:44 | conversations |
---|
0:16:46 | but we do get a some improvements when we just a |
---|
0:16:50 | two results the user was not used and |
---|
0:16:55 | and the competition does have some |
---|
0:16:57 | so |
---|
0:17:02 | we try to allow us to make that the hmm system which is the best |
---|
0:17:06 | system that for the global condition which is |
---|
0:17:08 | the most important of all |
---|
0:17:11 | see what is also the main source of degradation caused we see that we |
---|
0:17:15 | we have some significant degradation |
---|
0:17:19 | the oh so what we can see if only some of these results is that |
---|
0:17:23 | if we just compare the full development set and we and we compared to system |
---|
0:17:28 | which we |
---|
0:17:29 | starting to the development set for which meant really used for compensation |
---|
0:17:34 | but we don't use it for not really see that we don't get such a |
---|
0:17:39 | significant degradation |
---|
0:17:41 | so the bottom line is that we then sent this that the probably the results |
---|
0:17:47 | division is that the number three |
---|
0:17:53 | okay but when a few sources |
---|
0:17:55 | okay so we see that we get a degradation between thirty percent and points |
---|
0:18:02 | what we can be |
---|
0:18:04 | still image database of the results |
---|
0:18:07 | especially for the global condition which is important in this task |
---|
0:18:12 | so we still |
---|
0:18:13 | yeah the zero point six for the right channel condition but we said no mismatch |
---|
0:18:20 | in addition we might be |
---|
0:18:25 | so to conclude we validate our cyst |
---|
0:18:28 | as long good indication conditions using the full development sets and to skip button sets |
---|
0:18:34 | jfa and i-vector degradation is roughly five fifteen percent |
---|
0:18:39 | for the nap based systems that that's degradation is more dramatic a due to the |
---|
0:18:43 | strong the use of that was problem data |
---|
0:18:46 | actually for the global condition |
---|
0:18:48 | so for |
---|
0:18:50 | for you yeah speaker dependent hmm training data that's that is fine |
---|
0:18:55 | to use you get some small degradation due to it |
---|
0:18:58 | but for not really a it's important to do to do something that's it to |
---|
0:19:03 | do a combination of a twenty four nist and |
---|
0:19:06 | and using the cost because subsystem remember that |
---|
0:19:09 | note to get the documents that's |
---|
0:19:12 | is five for the fused system we got degradation |
---|
0:19:16 | percent average |
---|
0:19:17 | therefore we conclude that the it's the we can build a text-dependent is this can |
---|
0:19:22 | be |
---|
0:19:24 | even if we don't have any multi |
---|
0:19:26 | okay sessions |
---|
0:19:33 | i |
---|
0:19:42 | i |
---|
0:19:43 | i |
---|
0:19:45 | oh |
---|
0:19:47 | you |
---|
0:19:49 | what |
---|
0:19:50 | also for the global condition the and we are allowed to use saying that |
---|
0:19:56 | for the and that's that that's the one hundred |
---|
0:20:02 | sessions |
---|
0:20:03 | useful idiots equal but for the speaker condition that the right and the proposition but |
---|
0:20:07 | not allowed to use the same |
---|
0:20:09 | say |
---|
0:20:11 | we use under the constraint |
---|
0:20:16 | i |
---|
0:20:19 | oh |
---|
0:20:21 | oh |
---|
0:20:21 | without yeah |
---|
0:20:26 | yes and it's not obvious it's not |
---|
0:20:45 | okay that the lot is just a |
---|
0:20:47 | a fixed the tickets |
---|
0:20:50 | i can say is you know |
---|
0:20:51 | then |
---|
0:20:53 | speakers is that in practice six varies estimate a global because you're doing you always |
---|
0:21:00 | what attracts use the same is the case when full test for both involvement in |
---|
0:21:07 | verification |
---|
0:21:08 | but i |
---|
0:21:10 | that the use case is that if we present has its own i think it's |
---|
0:21:15 | yeah |
---|
0:21:16 | well the only difference in disability and a difference is the use of the development |
---|
0:21:20 | data |
---|
0:21:21 | and the bottom the condition is where you're probably with a think it's |
---|
0:21:39 | a |
---|
0:21:41 | okay we actually didn't really |
---|
0:21:44 | what can this so basically that |
---|
0:21:48 | the results that i that i |
---|
0:21:53 | presents a actually i |
---|
0:21:55 | used in some cases you can say it you don't this so i'm not |
---|
0:22:02 | i |
---|
0:22:04 | i |
---|
0:22:07 | i |
---|
0:22:10 | i |
---|
0:22:11 | i |
---|
0:22:16 | they basically we did look at it and we don't see here we don't feel |
---|
0:22:20 | that is a problem for this application we only need a single class |
---|
0:22:25 | oh |
---|
0:22:26 | okay |
---|
0:22:27 | we just a result |
---|
0:22:29 | i don't |
---|
0:22:34 | i |
---|
0:22:39 | so the idea for that there is that it for that for example a development |
---|
0:22:44 | set here for the global condition |
---|
0:22:46 | a we actually needed to record a speaker saying zero nine now what happens if |
---|
0:22:52 | the money one was to change as possible to a different one then it would |
---|
0:22:56 | be to go again record speakers |
---|
0:22:58 | saying the same thing is |
---|
0:23:01 | because we actually using this for development |
---|
0:23:04 | i think it's not a weighted thing but i don't business and marketing a |
---|
0:23:11 | a person's be that this is not the from their experience is customers really but |
---|
0:23:17 | is not practical |
---|
0:23:18 | but when you want to deploy such a system you will not be most times |
---|
0:23:23 | you would not a be able to report |
---|
0:23:26 | so many recordings and you the think that it is a practical to take one |
---|
0:23:32 | speakers and recorded once but i don't think it's practical to |
---|
0:23:36 | to take a two hundred speakers and |
---|
0:23:39 | hold it over four weeks four |
---|
0:23:54 | yeah because this is the speaker so if you have development set it is from |
---|
0:23:58 | the set using the same text then you get much better results |
---|
0:24:02 | if you train your models all actually a utterances saying zero nine |
---|
0:24:09 | you and we have this in the paper last from as it does but you |
---|
0:24:13 | will get much more like i don't know fifty percent reduction of modeling error rate |
---|
0:24:18 | seventy |
---|
0:24:20 | and then if you just you a try to exclude other for model text for |
---|
0:24:24 | other things |
---|
0:24:32 | oh |
---|
0:24:33 | oh |
---|
0:24:35 | i |
---|
0:24:35 | oh |
---|
0:24:39 | yeah |
---|
0:24:42 | that there are some cases the more them are a not saying |
---|
0:24:49 | i |
---|
0:24:51 | i |
---|
0:24:53 | oh |
---|
0:24:55 | i |
---|
0:24:56 | oh |
---|
0:24:57 | yeah |
---|
0:24:59 | the other reason |
---|
0:25:01 | which are not at a sensory technological perspective |
---|
0:25:05 | i |
---|