0:00:07 | but it's not |
---|
0:00:08 | yeah um |
---|
0:00:10 | i guess so end of a long day so |
---|
0:00:12 | thanks to saying that |
---|
0:00:14 | um |
---|
0:00:15 | this is basically well almost |
---|
0:00:17 | a lot of overlap with whatever |
---|
0:00:19 | the first |
---|
0:00:20 | speaker did for this particular session |
---|
0:00:23 | um which is basically trying to find out |
---|
0:00:25 | if uh |
---|
0:00:29 | um it basically |
---|
0:00:31 | we can have |
---|
0:00:31 | uh different background models |
---|
0:00:33 | four different sets of speakers |
---|
0:00:35 | and the V L B A proposing at least in this paper is that uh |
---|
0:00:39 | we can uh have |
---|
0:00:40 | uh speakers blasted according to the vocal tract length |
---|
0:00:43 | and also uh another way of doing it is trying to use |
---|
0:00:46 | a similarity between the mllr mattresses |
---|
0:00:49 | and we show that uh using uh |
---|
0:00:50 | few sets of |
---|
0:00:51 | these uh speaker clusters |
---|
0:00:53 | we can obviously get some improvement and uh performance as opposed to |
---|
0:00:57 | using a single |
---|
0:00:58 | uh ubm |
---|
0:00:59 | so uh the overview of the top is |
---|
0:01:02 | i'm pretty much uh |
---|
0:01:04 | is that of them indicated one is to |
---|
0:01:06 | you bureau of your |
---|
0:01:07 | the conventional speaker verification we |
---|
0:01:09 | very often use |
---|
0:01:10 | a single background model |
---|
0:01:12 | and uh then uh what do we |
---|
0:01:14 | the the reason why we might want to use |
---|
0:01:16 | a speaker cluster wise |
---|
0:01:17 | back to models |
---|
0:01:18 | and that two ways you could do the clustering at least in this paper that's what we have suggesting |
---|
0:01:23 | one is to use |
---|
0:01:24 | people collect |
---|
0:01:25 | trent |
---|
0:01:25 | tract length parameter itself |
---|
0:01:27 | and the other is to use a speaker dependent mllr matrix |
---|
0:01:30 | support vector |
---|
0:01:32 | and uh then this |
---|
0:01:33 | sure how we can build background models for each of these |
---|
0:01:36 | uh individual speaker clusters |
---|
0:01:38 | and then we compare the performance |
---|
0:01:40 | first with the speaker |
---|
0:01:41 | uh a single gender independent ubm |
---|
0:01:44 | and then we compare with a gender dependent uh ubm |
---|
0:01:48 | and the gender dependent of |
---|
0:01:50 | speaker cluster model |
---|
0:01:53 | is that uh some of those |
---|
0:01:55 | overlapped but what the first speaker |
---|
0:01:57 | actually here |
---|
0:01:58 | uh so |
---|
0:01:59 | at least in pointing out uh it's basically a binary decision problem so given that this feature |
---|
0:02:04 | and some claimed identity |
---|
0:02:06 | we're trying to find out |
---|
0:02:07 | if the identity uh we compare |
---|
0:02:10 | uh the log likelihood ratio with the same model |
---|
0:02:13 | and and alternate model |
---|
0:02:15 | and see if if it's |
---|
0:02:17 | beyond a certain threshold uh there except that or |
---|
0:02:20 | we rejected |
---|
0:02:21 | and solve and the question is |
---|
0:02:23 | what should be uh the alternate hypothesis uh one of course |
---|
0:02:27 | is |
---|
0:02:27 | yeah |
---|
0:02:28 | to say that |
---|
0:02:30 | a good yeah |
---|
0:02:31 | one is to say that they are on that hypothesis is |
---|
0:02:33 | a universal background model well |
---|
0:02:36 | which is a single model |
---|
0:02:37 | that people use for all speakers |
---|
0:02:39 | the database |
---|
0:02:40 | um then there are other approaches that we have |
---|
0:02:43 | uh |
---|
0:02:43 | a set of |
---|
0:02:44 | uh |
---|
0:02:45 | speaker models |
---|
0:02:46 | cohorts that close to a particular speaker |
---|
0:02:49 | ah well so we take a linear combination of some combination of these |
---|
0:02:52 | uh |
---|
0:02:53 | scores |
---|
0:02:54 | or we could build a a background model for me |
---|
0:02:56 | using these cohorts itself |
---|
0:02:58 | for that particular speaker so one |
---|
0:03:00 | has |
---|
0:03:00 | one background model for all speakers |
---|
0:03:03 | the other has |
---|
0:03:04 | uh a background model for each speaker |
---|
0:03:06 | oh |
---|
0:03:07 | the the other way of doing it is |
---|
0:03:09 | have some compromise between the two |
---|
0:03:11 | which is to say that i'll have a background model for a group of speakers |
---|
0:03:15 | and then the question becomes how like group |
---|
0:03:17 | speakers |
---|
0:03:17 | so we're proposing |
---|
0:03:19 | two different ways that we can group the speakers |
---|
0:03:21 | oh oneness |
---|
0:03:22 | basically using |
---|
0:03:23 | the vocal tract length parameter |
---|
0:03:25 | and the other is to use a speaker |
---|
0:03:27 | pacific |
---|
0:03:28 | mllr matter |
---|
0:03:30 | so um so this is the basic idea that you're talking about so instead of using one background model |
---|
0:03:35 | and then uh comparing the likelihood with that background model and corresponding speaker uh |
---|
0:03:40 | clean model |
---|
0:03:42 | well we actually have |
---|
0:03:44 | different sets of models |
---|
0:03:45 | but different speaker clusters |
---|
0:03:47 | and uh so the how we're gonna build these speaker cluster background models is what we are uh what are |
---|
0:03:53 | you talking about the next slide |
---|
0:03:55 | and the speaker clustering itself was basically done uh using i don't vocal tract length parameter |
---|
0:04:00 | all you maximum likelihood mllr supervector |
---|
0:04:04 | so um the idea of the motivation for trying to use vocal tract and parameter for speaker clustering |
---|
0:04:10 | is |
---|
0:04:11 | because |
---|
0:04:11 | if |
---|
0:04:12 | uh you know basically |
---|
0:04:13 | if it's logical differences in our contract is going to give rise to some differences in the east |
---|
0:04:18 | but |
---|
0:04:19 | so uh |
---|
0:04:21 | what shown here is obviously a in in the fall in the dark lines a male speaker and the |
---|
0:04:26 | that a solid line female speaker so there are differences in the spectral for the same file |
---|
0:04:31 | for the simple reason that the uh sociology of the product |
---|
0:04:34 | system of very different from male speaker female speaker |
---|
0:04:37 | in terms of size |
---|
0:04:38 | and therefore |
---|
0:04:39 | we assume that if |
---|
0:04:40 | also |
---|
0:04:41 | if |
---|
0:04:41 | as a group of speakers have a similar |
---|
0:04:44 | a vocal tract length diameter of similar physiology |
---|
0:04:47 | they probably they produce very similar |
---|
0:04:49 | set of spectral characteristics for the sound |
---|
0:04:52 | and therefore we can group these speakers together |
---|
0:04:54 | and assume that they have very similar characteristics in terms of |
---|
0:04:57 | uh |
---|
0:04:58 | features of the produce for a particular sound |
---|
0:05:00 | uh obviously a we need to vtln |
---|
0:05:04 | i mean we do not have |
---|
0:05:05 | uh a difference |
---|
0:05:06 | speaker |
---|
0:05:07 | and therefore one has to uh you know use some sort of model a reference model |
---|
0:05:12 | if you will |
---|
0:05:12 | and that's what we're trying to use uh the background model |
---|
0:05:15 | itself as a reference model against which we are going to score |
---|
0:05:19 | uh different |
---|
0:05:20 | uh features or the different |
---|
0:05:22 | what parameters |
---|
0:05:23 | and choose the one that does |
---|
0:05:25 | uh be best for that so each speaker basically |
---|
0:05:28 | is |
---|
0:05:28 | uh his or her |
---|
0:05:30 | speck uh vocal tract parameters estimator respective background model |
---|
0:05:34 | this is similar to whatever we don't speech recognition too |
---|
0:05:37 | but we use insert |
---|
0:05:38 | ubm |
---|
0:05:39 | the speaker independent model |
---|
0:05:41 | oh the other way that we could |
---|
0:05:43 | uh possibly classified speak |
---|
0:05:45 | uh into groups |
---|
0:05:46 | if you use |
---|
0:05:47 | the mllr matrix itself |
---|
0:05:49 | uh and that there have been lots of evidence that mllr does capture quite a bit of information about a |
---|
0:05:55 | particular speaker |
---|
0:05:56 | so we |
---|
0:05:57 | that |
---|
0:05:57 | the columns of the mllr like to uh tactics to form a supervector |
---|
0:06:01 | and then we do a very simple uh |
---|
0:06:04 | clustering of these mllr support vectors |
---|
0:06:07 | among speakers in the database of the using this technique he means that about them |
---|
0:06:11 | and just using the simple euclidean distance |
---|
0:06:13 | the plaster these different speeches |
---|
0:06:15 | so given the ubm |
---|
0:06:17 | and the speaker training data we get mllr matrix for each speaker |
---|
0:06:20 | we stack columns |
---|
0:06:22 | uh to form a supervector |
---|
0:06:23 | so this identifies a discount places a speaker |
---|
0:06:27 | and then we have |
---|
0:06:28 | group the speakers depending on |
---|
0:06:30 | uh the clusters that have formed by those |
---|
0:06:32 | about the last subject |
---|
0:06:36 | so um |
---|
0:06:37 | and then |
---|
0:06:38 | oh so now that we have |
---|
0:06:40 | uh sort of group |
---|
0:06:41 | these speakers into different classes |
---|
0:06:43 | uh we will |
---|
0:06:45 | a different background model for each of these |
---|
0:06:47 | a group of speakers |
---|
0:06:49 | and what we have done here is to just basically use a simple |
---|
0:06:52 | mllr adaptation of the |
---|
0:06:54 | ubm model |
---|
0:06:55 | to uh get a new set of |
---|
0:06:57 | means |
---|
0:06:58 | for me |
---|
0:06:59 | each of these speaker cluster background models in each of these speaker adapted models |
---|
0:07:03 | i've got from the ubm by just a consummation of the |
---|
0:07:06 | and these are estimated |
---|
0:07:08 | the the transformation matrix |
---|
0:07:10 | the estimated by using |
---|
0:07:11 | all the data from a particular speaker clusters so that's what is written here |
---|
0:07:15 | so given the ubm you form uh for each cluster |
---|
0:07:18 | its own |
---|
0:07:19 | background model |
---|
0:07:20 | so this plastic will be based on either using |
---|
0:07:23 | vtln as a parameter so close to one 'cause one one how are clustered into another round |
---|
0:07:28 | or it could be uh |
---|
0:07:30 | a set of |
---|
0:07:31 | mllr |
---|
0:07:32 | uh |
---|
0:07:33 | cluster |
---|
0:07:34 | speakers |
---|
0:07:35 | so this is the implementation aspects of given the ubm |
---|
0:07:38 | um first i do and identification of each of the speakers in the database |
---|
0:07:43 | and find out what was the corresponding week and then parameters so let's say if i'm looking at the vtln |
---|
0:07:48 | but i'm at one point |
---|
0:07:49 | two zero |
---|
0:07:50 | i find that speakers three four six all of them have this but i'm just like group them together |
---|
0:07:55 | so and then if i'm looking at the vtln but i'm at a point eight two |
---|
0:07:58 | the speaker I D's |
---|
0:07:59 | two eight nine |
---|
0:08:00 | uh possibly belong to this |
---|
0:08:02 | so i group them together |
---|
0:08:03 | and then using the scruples |
---|
0:08:05 | because |
---|
0:08:06 | i transform the gmmubm |
---|
0:08:08 | to form |
---|
0:08:09 | a background model |
---|
0:08:11 | which basically he's a man |
---|
0:08:12 | mllr adaptation of this particular |
---|
0:08:14 | group of speakers |
---|
0:08:16 | and then i do the individual speaker modelling by doing a map adaptation |
---|
0:08:21 | uh |
---|
0:08:22 | so the so the background model |
---|
0:08:23 | uh and then from the background model for each of the individual speakers |
---|
0:08:26 | i use |
---|
0:08:27 | yeah corresponding addicted to do map adaptation |
---|
0:08:31 | so |
---|
0:08:31 | divide it can be used |
---|
0:08:32 | well if i had used uh uh clustering |
---|
0:08:35 | all speakers based on |
---|
0:08:36 | mllr itself |
---|
0:08:41 | so |
---|
0:08:41 | um so if you look at the test phase they are almost similar to whatever the conventional case it is |
---|
0:08:46 | except that too small differences so given the test utterances |
---|
0:08:50 | i find uh |
---|
0:08:51 | the ideal basically likelihood ratio by comparing the speaker model |
---|
0:08:55 | and the background model |
---|
0:08:56 | in the case of the conventional case still be one single ubm sitting |
---|
0:09:00 | and the |
---|
0:09:01 | the the speaker model is got by adapting this |
---|
0:09:04 | to get the particular speaker model |
---|
0:09:06 | and then i do uh you know uh uh |
---|
0:09:09 | threshold based analysis whether to accept or reject |
---|
0:09:12 | um the exact things that |
---|
0:09:13 | but slightly different models are used yeah |
---|
0:09:16 | here the background model is actually |
---|
0:09:18 | but |
---|
0:09:18 | specifically for that particular speakers |
---|
0:09:21 | cluster |
---|
0:09:22 | and then the speaker model is got by adapting this |
---|
0:09:25 | i see ubm so i have the |
---|
0:09:27 | speaker model that slightly different you know |
---|
0:09:29 | and then again i do a log likelihood ratio |
---|
0:09:31 | this |
---|
0:09:31 | so basically what these systems use |
---|
0:09:33 | identical uh computation cost except that |
---|
0:09:36 | the models |
---|
0:09:37 | a slightly different in what we used for the background |
---|
0:09:40 | uh |
---|
0:09:40 | this is uh just |
---|
0:09:41 | a standard database |
---|
0:09:43 | but we use of things that need |
---|
0:09:44 | uh |
---|
0:09:46 | two thousand two |
---|
0:09:47 | um |
---|
0:09:48 | for background modelling |
---|
0:09:50 | and uh |
---|
0:09:51 | evaluation |
---|
0:09:52 | uh is one type train and once i guess |
---|
0:09:54 | in this two thousand four |
---|
0:09:58 | so uh what we notice is that |
---|
0:10:01 | um |
---|
0:10:02 | depending on the number of vtln clusters that before uh |
---|
0:10:06 | uh you know depending on how many at first that the yellow |
---|
0:10:09 | uh we see that as the number of classes increases you do |
---|
0:10:12 | see some decrease in the |
---|
0:10:14 | uh in the E R so this is what you could if you use a single |
---|
0:10:18 | gender independent ubm |
---|
0:10:20 | so this is the uh that you would get |
---|
0:10:22 | if you use vtln |
---|
0:10:24 | and this is the yeah that you would use if you use |
---|
0:10:26 | mllr based |
---|
0:10:27 | speaker clustering |
---|
0:10:29 | we find that uh M L L about |
---|
0:10:30 | slightly better than |
---|
0:10:32 | vtln but both of them |
---|
0:10:33 | you |
---|
0:10:34 | significantly better |
---|
0:10:35 | formance |
---|
0:10:36 | then |
---|
0:10:36 | um |
---|
0:10:37 | then |
---|
0:10:37 | the single ubm based at that |
---|
0:10:39 | the same thing holds true for them in minimum dcf also |
---|
0:10:43 | uh so the couple of |
---|
0:10:45 | things that you notice wonders what of them uh what we can and then mllr can use some improvement in |
---|
0:10:49 | performance |
---|
0:10:50 | as opposed to single |
---|
0:10:51 | uh ubm |
---|
0:10:53 | and mllr performs |
---|
0:10:55 | slightly |
---|
0:10:55 | uh |
---|
0:10:56 | sometimes |
---|
0:10:56 | oh |
---|
0:10:57 | quite a debate you better than vtln |
---|
0:10:59 | and what we find is that |
---|
0:11:00 | forty and find out the parameter uh clusters that give the best performance |
---|
0:11:07 | and this is |
---|
0:11:07 | because point that yeah so which again shows that mllr doing much a little better than |
---|
0:11:13 | this black girl which is got by vtln plastic |
---|
0:11:16 | and the blue one is the regular single uh ubm base |
---|
0:11:21 | execution |
---|
0:11:22 | so the question to be asked is why use mllr uh performing better than vtln |
---|
0:11:28 | and so what i did is i mean there are lots of other information that was available so but if |
---|
0:11:32 | you look at the black and the white at the bottom |
---|
0:11:34 | the black response to basically having female speakers and the white response to having |
---|
0:11:39 | male speakers so here we have chosen |
---|
0:11:42 | fourteen clusters which was the one maybe not the maximum performance for vtln |
---|
0:11:46 | and you see that there are a lot of clusters that |
---|
0:11:48 | the vtln has both male and female speakers so if you look at this |
---|
0:11:52 | uh i like was one a lot |
---|
0:11:54 | and like what is this which means |
---|
0:11:56 | that are both male and female speakers for this particular one |
---|
0:11:59 | similarly four point ninety point nine six |
---|
0:12:01 | you see that there is some overlap between the male and female speakers |
---|
0:12:05 | on the other hand when you look at the mllr supervector |
---|
0:12:08 | and if you look at |
---|
0:12:09 | uh the black and white |
---|
0:12:11 | yeah very distinct uh the the the screen that |
---|
0:12:13 | picks up |
---|
0:12:14 | then |
---|
0:12:14 | female clusters as they are |
---|
0:12:16 | and the two possible vector uh mllr clusters pick up only the male speakers |
---|
0:12:21 | so there seems to be a white |
---|
0:12:23 | uh |
---|
0:12:24 | nice |
---|
0:12:24 | uh yeah purity in terms of clustering |
---|
0:12:27 | uh you don't agenda |
---|
0:12:28 | it is when you use mllr like uh |
---|
0:12:30 | supervectors |
---|
0:12:31 | and |
---|
0:12:32 | we think possibly that's one of the reasons why mllr seems to be obvious consistently |
---|
0:12:36 | perform better than |
---|
0:12:38 | vtln |
---|
0:12:39 | so we just wanted to go one step further and see |
---|
0:12:41 | if if that was indeed the case then if we separate the clusters |
---|
0:12:44 | according to gender |
---|
0:12:46 | then how would the gap between mllr and vtln disappear we get |
---|
0:12:50 | very similar performance |
---|
0:12:51 | using both of them |
---|
0:12:52 | and that's what the next set of experiments |
---|
0:12:54 | basically indicate |
---|
0:12:56 | so here what we have done is |
---|
0:12:58 | uh now we have a gender wise |
---|
0:12:59 | ubm so |
---|
0:13:00 | but i do you beams one for a million and one for females obviously you see some improvement in performance |
---|
0:13:05 | compared to the gender independent ubm |
---|
0:13:07 | but also what was |
---|
0:13:09 | uh you select what what we conjectured seems to be holding too |
---|
0:13:12 | once we classify once we just uh |
---|
0:13:14 | do gender wise |
---|
0:13:15 | uh |
---|
0:13:16 | splitting of the clusters |
---|
0:13:17 | then vtln and mllr gives all give almost compatible performance |
---|
0:13:23 | still mllr slightly better but uh nevertheless |
---|
0:13:26 | uh the performance |
---|
0:13:27 | it is |
---|
0:13:28 | almost compatible |
---|
0:13:29 | the same thing holds true for |
---|
0:13:31 | uh the minimum dcf also |
---|
0:13:33 | so |
---|
0:13:33 | so the point that we want to make is vtln if you use it just but for a clustering |
---|
0:13:38 | sometimes |
---|
0:13:39 | i gives |
---|
0:13:39 | that's a good performance |
---|
0:13:41 | for the simple reason that it performs |
---|
0:13:43 | it's |
---|
0:13:43 | picks up both the male and female speakers for the same alpha |
---|
0:13:46 | but of either gender wise |
---|
0:13:47 | of uh clustering |
---|
0:13:48 | then what |
---|
0:13:49 | mllr and vtln give almost |
---|
0:13:51 | same |
---|
0:13:52 | a comparable performance |
---|
0:13:53 | uh and in any case both of these methods of clustering but that's obvious |
---|
0:13:57 | or perform |
---|
0:13:58 | uh |
---|
0:13:58 | the gender wise |
---|
0:13:59 | single ubm for each |
---|
0:14:01 | gender |
---|
0:14:01 | yes |
---|
0:14:04 | so and that's reflected also in the debt go |
---|
0:14:07 | you can see that both the ubm about what |
---|
0:14:09 | the mllr clustered and the |
---|
0:14:12 | we can then clustered what of age and gender wise clustered now |
---|
0:14:15 | have very similar performance and they always do better than a gender wise |
---|
0:14:19 | you be |
---|
0:14:22 | so uh |
---|
0:14:23 | so the bottom line is that if you are willing to increase |
---|
0:14:26 | uh the number of background models |
---|
0:14:28 | and they're not much yeah we find that to get a reasonably good performance |
---|
0:14:32 | if you just use |
---|
0:14:33 | yeah |
---|
0:14:34 | i think of something like to uh males and two females |
---|
0:14:37 | clusters |
---|
0:14:37 | you get |
---|
0:14:38 | um |
---|
0:14:39 | some gain in performance |
---|
0:14:41 | a boat in the case of gender dependent and gender dependent case |
---|
0:14:44 | uh the computational cost at least at this |
---|
0:14:46 | is the same as a single ubm because we just uh comparing the two models |
---|
0:14:51 | uh mllr supervector uh performs better than vtln in most cases |
---|
0:14:55 | but the gap |
---|
0:14:56 | narrow down |
---|
0:14:57 | if you're willing to use |
---|
0:14:58 | agenda voice |
---|
0:14:59 | speaker clustering |
---|
0:15:01 | so |
---|
0:15:02 | does it |
---|
0:15:09 | we have time for one last question |
---|
0:15:20 | you close to speakers that they use the uh different ubm depending on the training speakers right |
---|
0:15:27 | i was like to |
---|
0:15:28 | you have |
---|
0:15:28 | but your training |
---|
0:15:30 | yeah samples yeah you close to the training speakers so |
---|
0:15:33 | one training speaker only has one |
---|
0:15:36 | is associated with one ubm |
---|
0:15:37 | get it right |
---|
0:15:38 | okay |
---|
0:15:38 | but now |
---|
0:15:39 | on |
---|
0:15:40 | yes |
---|
0:15:41 | when you have a new sample |
---|
0:15:42 | uh_huh um |
---|
0:15:44 | if you're doing |
---|
0:15:45 | are you talking about |
---|
0:15:46 | i was |
---|
0:15:47 | speaker |
---|
0:15:48 | whether a speaker verification or oh i see you're speaker verification task |
---|
0:15:52 | yeah one particular person only |
---|
0:15:55 | and then he's anglo that ubm yeah |
---|
0:15:57 | it's not a speaker and it S not the speaker deterrence would be much more expensive |
---|
0:16:01 | yeah |
---|
0:16:02 | so here |
---|
0:16:03 | it is because of associated uh |
---|
0:16:05 | background model |
---|
0:16:06 | but |
---|
0:16:06 | it's not each speaker having each cluster speakers have the one button |
---|
0:16:10 | right |
---|
0:16:10 | okay but the the the |
---|
0:16:11 | reasons not more expensive is that your only considering one training speaker |
---|
0:16:15 | right |
---|
0:16:23 | 'kay |
---|
0:16:24 | i think there's no more time to say thank you very much for submitting this session and |
---|
0:16:29 | joey |
---|
0:16:30 | oh |
---|