0:00:13 | i and it you set |
---|
0:00:14 | uh so uh i was that you told you what is the the a vector let me go quickly the |
---|
0:00:18 | through |
---|
0:00:19 | uh |
---|
0:00:21 | do you have a two an information rich low dimensional fixed like thing to representing of voice print |
---|
0:00:25 | uh i an arbitrary long uh and |
---|
0:00:28 | so uh we like these little in is because they no a time me and they turn the speaker I |
---|
0:00:33 | D |
---|
0:00:33 | uh a task into a pattern recognition problem and and are you already told are already shown |
---|
0:00:38 | uh how to do with then |
---|
0:00:40 | so uh just to go quickly about uh this estimate alone again so |
---|
0:00:45 | uh |
---|
0:00:46 | what we wanna model is the um that the data |
---|
0:00:49 | uh that that come up |
---|
0:00:50 | uh so we here an example of an utterance |
---|
0:00:53 | a buttons |
---|
0:00:54 | so we usually model them using the the gaussian mixture model |
---|
0:00:58 | we forget about the the variance and we're and remember the |
---|
0:01:01 | a a the means we |
---|
0:01:03 | construct a the super vector of the mean |
---|
0:01:06 | no i uh do is that we look at more data |
---|
0:01:09 | and uh are we extract the means |
---|
0:01:11 | uh a of of that of all the utterances |
---|
0:01:14 | and we're trying to see that be and we see that they have some kind of a new be of |
---|
0:01:18 | and this this this is what we assume in the i-vector so |
---|
0:01:21 | uh we see that of the got some some uh |
---|
0:01:23 | a offset which is uh represent but the U B and mean is the end symbol |
---|
0:01:27 | and this picture |
---|
0:01:28 | and |
---|
0:01:30 | which is represented by the by the uh a to ads and then we have |
---|
0:01:34 | uh the |
---|
0:01:35 | total variability space |
---|
0:01:37 | i was represented by the um by the hours which which as and in which direction we can |
---|
0:01:42 | she the mean to adopt the mean to the to the incoming and |
---|
0:01:46 | to to you describe the the directions of the has ability of as |
---|
0:01:51 | uh and uh vector W |
---|
0:01:54 | as a bit of a a banana such we can impose a uh uh we can impose a a prior |
---|
0:01:58 | on its so will choose the uh |
---|
0:02:00 | i got some uh uh uh that's not stand alone a prior |
---|
0:02:03 | and getting some uh |
---|
0:02:04 | incoming data X |
---|
0:02:06 | uh we compute the posterior here |
---|
0:02:08 | uh i'll be very are and uh |
---|
0:02:12 | she's also gaussian with uh mean W X and a precision matrix X |
---|
0:02:17 | and basically uh we recall not is the is the mean of this year |
---|
0:02:22 | so that that given any any any details so uh uh this is just a a like a cookbook |
---|
0:02:27 | uh uh |
---|
0:02:29 | codebook book um |
---|
0:02:30 | talk |
---|
0:02:31 | so it to compute the either we need to a the statistics uh extracted from the ubm so we have |
---|
0:02:36 | this your order statistics |
---|
0:02:37 | and the the first order statistics |
---|
0:02:40 | uh |
---|
0:02:41 | the we go any further we do a little tricks so we |
---|
0:02:43 | uh a to the data around the the ubm so we find which cluster of the data comes to which |
---|
0:02:48 | which class and of the of the of the you and you and then |
---|
0:02:51 | and we we should that allow |
---|
0:02:54 | uh uh and we also |
---|
0:02:56 | uh white and the data are uh using the a ubm covariance matrix |
---|
0:03:01 | but that this covariance matrix uh uh can be |
---|
0:03:04 | matrix as you may have already um |
---|
0:03:06 | realise is to with battles |
---|
0:03:08 | i i stock |
---|
0:03:09 | oh |
---|
0:03:11 | a a which makes the yeah |
---|
0:03:13 | virtually it makes the uh the as of the of the individual |
---|
0:03:17 | a a gmm components uh equal to identity |
---|
0:03:22 | so a he's um he's a a a a um |
---|
0:03:25 | a a codebook book question for for computing the so that did it D W is basically a dot product |
---|
0:03:31 | between |
---|
0:03:31 | uh some um |
---|
0:03:33 | oh aims of the post a distribution |
---|
0:03:35 | uh uh the factor we matrix T which describes the subspace |
---|
0:03:40 | and the first order statistics |
---|
0:03:42 | be uh precision matrix a um that is a basically a sum of over |
---|
0:03:47 | of all the uh a gaussian |
---|
0:03:49 | associated pieces of the team manager |
---|
0:03:52 | uh what it used by the by the uh but this you order |
---|
0:03:56 | stats |
---|
0:03:56 | oh of that incoming utterance |
---|
0:03:59 | and that's to a little analysis of of of of uh what this what this function that's in a a |
---|
0:04:03 | a a a computer so |
---|
0:04:05 | uh |
---|
0:04:06 | we have a and |
---|
0:04:07 | we have C gmm component |
---|
0:04:10 | yeah have F dimensional |
---|
0:04:11 | uh feature space and we have and subspace |
---|
0:04:15 | uh well and dimensions subspace |
---|
0:04:17 | uh |
---|
0:04:18 | was describes are are are are are a space |
---|
0:04:21 | and uh so the um |
---|
0:04:23 | do um and to the power of is the is the version there's nothing much we can do battery |
---|
0:04:28 | um |
---|
0:04:29 | that is um |
---|
0:04:30 | the the biggest problem actually use the uh is the sum |
---|
0:04:34 | and the precision computation |
---|
0:04:37 | and then we have the the um |
---|
0:04:39 | the dot product |
---|
0:04:40 | oh of of the individual matrix is |
---|
0:04:43 | and the first from you |
---|
0:04:45 | the memory complexity of uh uh um |
---|
0:04:48 | oh |
---|
0:04:49 | just to just to say restore everything when when we computing the stuff but we can put computed be pre-computed |
---|
0:04:54 | and balance |
---|
0:04:55 | and with to start this product in advance because not dependent on data |
---|
0:04:58 | so the the memory compress is really uh a high for this uh uh for this for this model |
---|
0:05:04 | so uh uh if we mention that in in a typical model we have um um you know thousands of |
---|
0:05:08 | gaussians since |
---|
0:05:09 | uh this can be a but a really for |
---|
0:05:12 | a now be uh were we also have to store is the |
---|
0:05:15 | as the T matrix |
---|
0:05:16 | so these two terms of balance the bound the can complexity of of or other |
---|
0:05:21 | so |
---|
0:05:22 | uh the motivation for simplification of this of this form that's actually wanted to put the application to small scale |
---|
0:05:27 | devices |
---|
0:05:28 | as part of might be a project |
---|
0:05:30 | and uh yeah we also it to prepare a this uh i that a framework for discriminative training what we |
---|
0:05:36 | thought that um such equations could be |
---|
0:05:39 | oh quite difficult to to to compute gradient in four |
---|
0:05:43 | but that's first take look at the first simplification simplifications that we but we uh and assume here in the |
---|
0:05:48 | first M san isn't the pictures that the the proportion |
---|
0:05:51 | of the data generated by each gaussian in the in and the you and am |
---|
0:05:55 | as to is constant across the chris or utterances |
---|
0:05:59 | a uh and this proportions a is actually a |
---|
0:06:02 | uh a given by the ubm rates |
---|
0:06:05 | so what happens is that the um the and |
---|
0:06:08 | the the sum in the in the precision computation |
---|
0:06:11 | it is uh that's |
---|
0:06:12 | independent |
---|
0:06:14 | of the data and we can really effectively pre-computed in rounds |
---|
0:06:17 | so we don't have to um |
---|
0:06:20 | each time we we we compute that's some i mean we we we compute the precision |
---|
0:06:24 | we just uh |
---|
0:06:25 | we look at this formula |
---|
0:06:27 | oh we just instead of adding the the the sum the |
---|
0:06:29 | going from the from the |
---|
0:06:30 | i |
---|
0:06:31 | to the button most |
---|
0:06:32 | uh we see that we only have a scaled um |
---|
0:06:36 | scale to uh |
---|
0:06:38 | addition of two matrices |
---|
0:06:42 | so a a little analysis so we totally only got rid of the of the |
---|
0:06:46 | of the C square |
---|
0:06:48 | um |
---|
0:06:49 | to um in the computational complexity |
---|
0:06:51 | and close of memory complexity signal |
---|
0:06:54 | or uh |
---|
0:06:54 | basically |
---|
0:06:56 | for good most of the data that we were storing |
---|
0:06:58 | before |
---|
0:06:59 | i just a time for for the |
---|
0:07:01 | before before the results section the |
---|
0:07:03 | the the number of gaussians were |
---|
0:07:05 | is is thousands is that said and and the typical size of of the subspace |
---|
0:07:10 | four hundred |
---|
0:07:12 | size and hundreds |
---|
0:07:14 | uh a so this so the first simplification out |
---|
0:07:17 | we also had a the thought uh or |
---|
0:07:19 | we would try to sue |
---|
0:07:21 | that uh we can find a |
---|
0:07:23 | uh uh thing and is a orthogonalization transformation G |
---|
0:07:27 | some G that would uh you know that have been rise or the T transposed times T |
---|
0:07:32 | uh |
---|
0:07:33 | component associated parts of the of the of the factor loading matrix T |
---|
0:07:38 | which are bothering us in the in the precision computation |
---|
0:07:41 | for lab |
---|
0:07:44 | as a transformation |
---|
0:07:46 | then uh we can uh you know multiply very the equation from both sides and uh |
---|
0:07:51 | uh a something like this and then uh to get the original precision we would just uh |
---|
0:07:56 | multiply from paul says by the inverse of G |
---|
0:07:58 | um if our sense from was was correct |
---|
0:08:03 | uh so than i thing here is that uh we would be something the diagonal matrices |
---|
0:08:08 | which uh |
---|
0:08:09 | can be implemented effectively an C or my occur |
---|
0:08:12 | and also so the other thing was that uh the the the the the |
---|
0:08:16 | the died nice |
---|
0:08:18 | precision matrix is diagonal is diagonal also |
---|
0:08:21 | if you remember a uh we were inviting in in the in the i-vector |
---|
0:08:25 | uh extraction from a city a vector |
---|
0:08:27 | so the so um |
---|
0:08:29 | version of the |
---|
0:08:31 | is diagonal matrix is trivial here |
---|
0:08:34 | or if that the effectively written |
---|
0:08:36 | uh |
---|
0:08:37 | we can we can pack |
---|
0:08:39 | uh we can pack the um |
---|
0:08:41 | and the you T times it to and you a T transpose T terms |
---|
0:08:45 | uh uh of the gonna has into a single matrix and we can simply |
---|
0:08:49 | we can simply uh |
---|
0:08:50 | to dot product with the with the vector of zero order statistics |
---|
0:08:54 | the the close and at |
---|
0:08:57 | and we can uh |
---|
0:08:58 | this this lower "'cause" gonna the X symbol stands for in the diagonal of that matrix |
---|
0:09:04 | to a a a a column vector and the |
---|
0:09:06 | capital dag |
---|
0:09:08 | and a simple |
---|
0:09:09 | again maps that |
---|
0:09:11 | column vector to a diagonal matrix |
---|
0:09:13 | and the i-vector extraction |
---|
0:09:15 | is then uh |
---|
0:09:16 | and by to by D second question here and i think about is that the do transpose in the middle |
---|
0:09:21 | of the and |
---|
0:09:22 | can be projected directly to the to the T matrix which is a which uh would you can give the |
---|
0:09:26 | some benefit |
---|
0:09:27 | and the S we set the a matrix and can be inverted to effectively |
---|
0:09:31 | so if we look at the analysis again |
---|
0:09:33 | uh |
---|
0:09:34 | and the computational complexity |
---|
0:09:37 | we but rid of but that the terms that's that's were in on the only the diagonal |
---|
0:09:41 | and uh uh for combat some the memory complexity |
---|
0:09:44 | uh |
---|
0:09:45 | we got an extra term um uh the the the the um |
---|
0:09:48 | see M but we got the |
---|
0:09:50 | do you we got rid of that C and square term there |
---|
0:09:56 | the question is how to the how we compute the G matrix of the first uh well the first uh |
---|
0:10:02 | i i was to use pca |
---|
0:10:04 | uh which we will see that works |
---|
0:10:06 | the second i but was to use this this had had just good st clean linear discriminant analysis |
---|
0:10:11 | uh |
---|
0:10:12 | he a was the simple example what |
---|
0:10:15 | a basically want |
---|
0:10:16 | i it to rotate that that uh those two covariance matrices |
---|
0:10:19 | forty five degrees but uh |
---|
0:10:22 | and the the um |
---|
0:10:24 | the um |
---|
0:10:25 | average within class covariance would be a identity matrix here so |
---|
0:10:29 | oh |
---|
0:10:30 | first that was the inspiration |
---|
0:10:32 | it it's with the lvcsr tasks |
---|
0:10:36 | uh i just a quick step uh |
---|
0:10:38 | we say thing about of the T matrix for those wouldn't no |
---|
0:10:41 | uh are that there's uh |
---|
0:10:43 | a pair of a can load is that we have to accumulate can relate while training |
---|
0:10:47 | uh the T matrix we got all utterances of or or or all training utterances and we use some |
---|
0:10:53 | some computation that and we can relate that and we do some some up |
---|
0:10:57 | at the end of this of of this procedure |
---|
0:10:59 | uh but inside this uh |
---|
0:11:02 | that in theoretical explanation and sat inside of these uh uh these uh this computation |
---|
0:11:08 | we see that we can use the the double which is the final why vector and is the precision matrix |
---|
0:11:13 | so if we know that we can simplify this precision matrix |
---|
0:11:16 | we can simplify the lead actors and |
---|
0:11:18 | or this this this training procedure |
---|
0:11:20 | it's uh a simplified so um the memory use the the use each with this this so um |
---|
0:11:26 | hmmm well |
---|
0:11:27 | that would this simple trick |
---|
0:11:28 | uh we get to about a half of the memory we the gen we can |
---|
0:11:32 | we can maybe effectively |
---|
0:11:33 | try to uh increase |
---|
0:11:35 | the other parameters |
---|
0:11:37 | because number about the parameters to |
---|
0:11:39 | to two for comparison |
---|
0:11:42 | so i for experimental setup uh we use mfcc features uh the standard thing |
---|
0:11:48 | uh |
---|
0:11:49 | um um |
---|
0:11:50 | short-time cepstral mean and variance normalisation |
---|
0:11:53 | uh we used double but that doesn't double the that thus |
---|
0:11:56 | for the training set uh uh with different combinations of the switchboard two phase two and three speech for solar |
---|
0:12:01 | are the nist two thousand four two thousand six |
---|
0:12:05 | uh we use sure in which one and two for training the team of J |
---|
0:12:09 | uh the test set we evaluated on the nist sre ten extend core condition five which is the telephone telephone |
---|
0:12:15 | female and me |
---|
0:12:17 | uh |
---|
0:12:18 | one to mention the slides that we use exactly the same scoring the thing that's as as as i mention |
---|
0:12:23 | in is previous talk |
---|
0:12:24 | so is the cosine distance |
---|
0:12:26 | with uh within class normalisation |
---|
0:12:28 | uh the performance set uh because we a measure of the the the um the speed and the |
---|
0:12:34 | a a and the memory code and the memory demands |
---|
0:12:36 | so do use the matlab environment uh which what which was set to a single core |
---|
0:12:41 | a a single third operation and around on some internal then |
---|
0:12:45 | a process to |
---|
0:12:46 | and we measuring the speed or four fifty randomly picked utterances from the mixer corpus but we had the out |
---|
0:12:52 | the statistics |
---|
0:12:53 | we computed from uh |
---|
0:12:55 | sort the so the of statistics collection is not included in the analysis |
---|
0:12:59 | and as ubm ubm M was diagonal covariance uh |
---|
0:13:02 | two thousand four eight component ubm and as was trained on um |
---|
0:13:06 | and do it about the fisher |
---|
0:13:09 | so a the summary of numbers uh uh are we used two thousand forty eight gaussians |
---|
0:13:13 | the feature dimension was sixty |
---|
0:13:15 | and we use of for and it uh a dimensional subspace |
---|
0:13:19 | uh uh for a and was is been chosen as a trade between performance and and technical conditions so |
---|
0:13:24 | but a can i mean uh the the configuration of of of of the this of the machines that computed |
---|
0:13:29 | the i-vector |
---|
0:13:31 | that uh as i said we were able to know um |
---|
0:13:34 | in one in one of our |
---|
0:13:36 | and the for simplification we were able to |
---|
0:13:38 | uh a decrease the memory demands so we to be fair which i had to uh use also and and |
---|
0:13:43 | in equal to and under eight hundred |
---|
0:13:46 | uh uh just to see |
---|
0:13:47 | just to see what happens |
---|
0:13:50 | and is a little uh a constellation plot |
---|
0:13:52 | oh for the results are the uh the X |
---|
0:13:55 | here is the is the baseline |
---|
0:13:57 | uh |
---|
0:13:59 | oh of course of the the the the little |
---|
0:14:01 | the little or a block down and the the most |
---|
0:14:04 | for "'cause" the eight hundred traditional |
---|
0:14:06 | uh i-vector extraction |
---|
0:14:09 | see that the systems |
---|
0:14:11 | perform from slightly poorer than the than the baseline but uh |
---|
0:14:15 | uh |
---|
0:14:15 | this is just an informative picture |
---|
0:14:17 | uh we see that the best |
---|
0:14:20 | i'm traditional |
---|
0:14:21 | or |
---|
0:14:22 | yeah that's non traditional i-vector extractor |
---|
0:14:24 | goes from a uh a sick the three point six to about three point eight |
---|
0:14:28 | uh equal error rate |
---|
0:14:30 | uh the same can the something |
---|
0:14:32 | and logically with the norm D C have |
---|
0:14:35 | uh |
---|
0:14:37 | are so the system are slightly worse but um |
---|
0:14:39 | this work was |
---|
0:14:41 | and i |
---|
0:14:42 | that's on the analysis of the speech so for look at that |
---|
0:14:44 | and the speech of uh a of the of B computation so |
---|
0:14:48 | uh |
---|
0:14:49 | with the |
---|
0:14:50 | with the baseline |
---|
0:14:51 | to extract those fifty fifty i vectors it to class |
---|
0:14:54 | uh uh thirteen seconds |
---|
0:14:56 | not thirteen second |
---|
0:14:58 | and um |
---|
0:15:01 | so you the uh you C D the the the relative |
---|
0:15:03 | the relative uh |
---|
0:15:05 | well uh numbers here so |
---|
0:15:06 | "'cause" they're talking the you and eight hundred baseline |
---|
0:15:09 | as there is a huge a decrease in performance because the the complex the complexity there is |
---|
0:15:14 | score dropped go |
---|
0:15:17 | uh i that's to have a nice that was that |
---|
0:15:19 | if we if we are able to train the system somehow or without without a hardware |
---|
0:15:23 | we can afford to use a hundred |
---|
0:15:25 | uh dimension dimension right to |
---|
0:15:27 | and still get to but you know ten percent |
---|
0:15:30 | uh a ten percent of the original |
---|
0:15:32 | time |
---|
0:15:33 | uh there was necessary to compute those fifty i vector |
---|
0:15:37 | uh |
---|
0:15:38 | now let let's take a look at the uh comparison of of memory usage |
---|
0:15:42 | so uh for the you for the baseline |
---|
0:15:45 | so the the first column i mean the second column uh what's is constant uh that's something that we can |
---|
0:15:50 | change something that we have to store in memory |
---|
0:15:53 | a a good in specific uh uh and numbers |
---|
0:15:55 | uh show to medical decrease in in in memory needs |
---|
0:15:59 | for the for the uh for the simplified |
---|
0:16:02 | uh algorithms |
---|
0:16:03 | so as to |
---|
0:16:04 | uh uh if we if we want to use the uh uh |
---|
0:16:07 | and have good |
---|
0:16:08 | a dimensional uh i vectors |
---|
0:16:10 | we to still that a fraction of of the memory that that |
---|
0:16:15 | the traditional |
---|
0:16:16 | a total of about the eight hundred baseline system which is which which again close |
---|
0:16:21 | uh a practically |
---|
0:16:24 | so this is this is just a prove that we can use those simplification on and the in the vector |
---|
0:16:29 | training procedure |
---|
0:16:30 | uh a we save space |
---|
0:16:33 | uh a but also little this the the simplification make this process a lot faster |
---|
0:16:37 | and uh |
---|
0:16:38 | this this these numbers just show that uh |
---|
0:16:41 | the difference between between uh |
---|
0:16:44 | the fact that we train |
---|
0:16:45 | uh |
---|
0:16:46 | using the the traditional i-vector extraction and the simplified i-vector extraction |
---|
0:16:51 | does of that we can |
---|
0:16:52 | i can really |
---|
0:16:53 | i the E D D D simplified five |
---|
0:16:57 | so the conclusion is that uh we managed to simplify the state-of-the-art technique |
---|
0:17:02 | in terms of speed and memory |
---|
0:17:03 | with uh are sacrificing some of the |
---|
0:17:06 | uh |
---|
0:17:07 | you know that the performance |
---|
0:17:08 | the the recognition performance |
---|
0:17:11 | oh we have also simplify the form a uh so that the uh easily |
---|
0:17:15 | differentiable for a future work which which uh which is going to be the discriminative training of of the i-vector |
---|
0:17:21 | extractor |
---|
0:17:22 | the matrix T or the |
---|
0:17:24 | G V and and uh and others |
---|
0:17:27 | uh and uh a finally you |
---|
0:17:29 | we managed to fit the guy vector based to the system |
---|
0:17:32 | to to to a cellphone application which was uh uh which was uh |
---|
0:17:36 | one of the tasks |
---|
0:17:37 | and we use puns are to be or project which was on |
---|
0:17:42 | i model a |
---|
0:17:43 | speaker recognition |
---|
0:17:46 | um |
---|
0:17:47 | thank you |
---|
0:17:53 | some something Q with |
---|
0:17:54 | time for one and two questions |
---|
0:18:02 | no questions um |
---|
0:18:05 | i of the questions so you you may two assumptions |
---|
0:18:08 | hmmm to simplify your |
---|
0:18:10 | you go to them uh did you very fine in some way that the which was emission is that were |
---|
0:18:15 | to or |
---|
0:18:16 | source source can we did you find in some way with the data and that the just |
---|
0:18:21 | by looking at the score |
---|
0:18:23 | no |
---|
0:18:23 | hmmm was uh a one or the other |
---|
0:18:26 | assumption was wrong or or a yeah well they were |
---|
0:18:30 | i looking at the at the at the recognition performance that all |
---|
0:18:33 | but |
---|
0:18:34 | slightly one i mean uh |
---|
0:18:36 | yeah that |
---|
0:18:37 | that was a mismatch of course |
---|
0:18:39 | uh if the the um the the |
---|
0:18:41 | the proportion of the data generated by the by the gauss since this is different it's not always a equal |
---|
0:18:46 | to the to the ubm way |
---|
0:18:48 | and the um |
---|
0:18:50 | uh the |
---|
0:18:51 | because guess we we're using two thousand forty eight gaussians since and of finding a one single orthogonalization matrix |
---|
0:18:58 | um |
---|
0:18:58 | oh |
---|
0:18:59 | is is also probably |
---|
0:19:01 | an appropriate here so so |
---|
0:19:03 | but i |
---|
0:19:04 | we tried and and and um |
---|
0:19:06 | and it was |
---|
0:19:07 | some of the |
---|
0:19:08 | yeah |
---|
0:19:09 | okay |
---|
0:19:10 | the questions you |
---|
0:19:20 | uh no |
---|
0:19:21 | no would i did not combine the techniques |
---|
0:19:23 | and of a combined the techniques so |
---|
0:19:31 | it it yeah i'm sorry i didn't i i'm i'm sorry |
---|
0:19:33 | i it was better than pca |
---|
0:19:35 | for you for gonna |
---|
0:19:40 | the baseline |
---|
0:19:41 | uh no |
---|
0:19:44 | yeah i thank you yeah yeah that's |
---|
0:19:46 | that's a good point |
---|
0:19:48 | yeah |
---|
0:19:52 | okay there is no of the question |
---|
0:19:55 | so let's |
---|
0:19:56 | thanks |
---|
0:19:57 | speaker again |
---|