0:00:14 | a single channel and i hope available you into two you kindness the output of |
---|
0:00:21 | a lunch |
---|
0:00:22 | okay |
---|
0:00:23 | the talking i'm going to talk about is how to make uncertainty propagation run fast |
---|
0:00:29 | and also consume less memory |
---|
0:00:32 | my name and why max from the home component alec university |
---|
0:00:37 | so here used a liar representation i will first to keep an overview of i-vector |
---|
0:00:42 | p d a and x spring how the uncertainty properties and can |
---|
0:00:48 | can |
---|
0:00:49 | model the uncertainty of the i-vector |
---|
0:00:52 | and how to how to make the uncertainty fabrication run faster one possible use less |
---|
0:00:58 | memory |
---|
0:00:59 | and we evaluate the proposed to our problem on nist two thousand trial |
---|
0:01:05 | and |
---|
0:01:08 | finally we keep looser okay so here is the outline of the i-vector p lda |
---|
0:01:13 | map onto the probably all you already know this though i mean cool for go |
---|
0:01:17 | through these i very quickly a here |
---|
0:01:22 | we use a the posterior be not the latent factor |
---|
0:01:28 | you to |
---|
0:01:29 | tool |
---|
0:01:31 | to as a low dimensional representation of the speaker so given the mfcc wetter or |
---|
0:01:39 | phrase utterance we compute the posterior mean of the |
---|
0:01:44 | a latent factor and recall this time at |
---|
0:01:47 | okay and t is the total variability matrix that define the channel and speaker subspace |
---|
0:01:55 | or you represent a subspace where the i met okay very |
---|
0:02:00 | so here's the procedure for the i-vector extraction given a sequence of mfcc what are |
---|
0:02:05 | we extract the |
---|
0:02:07 | i-vector using the post your and beam of the latent factor |
---|
0:02:11 | and because if we would like to use the gaussian the lda therefore will lead |
---|
0:02:16 | to |
---|
0:02:17 | suppress the lawn the non gaussian behavior of the i-vector through some preprocessing |
---|
0:02:24 | okay for example whitening and also length normalization |
---|
0:02:29 | and after this preprocessing step because the process that i-vector pairs |
---|
0:02:35 | top you |
---|
0:02:36 | and then this not you |
---|
0:02:38 | i-vector or preprocessed the i-vector can be modeled by the t i would be |
---|
0:02:44 | so the idea is to decrease of the phone of you have a modeling |
---|
0:02:47 | we represent a speaker subspace |
---|
0:02:50 | and that h i is the use the speaker factor |
---|
0:02:55 | and that you can see for the change sets exception of the i've speaker we |
---|
0:03:01 | only have one |
---|
0:03:04 | latent factor h i okay |
---|
0:03:07 | and epsilon i j work that sentiment she to that cannot be represented by the |
---|
0:03:12 | speaker subspace |
---|
0:03:14 | so now that to in the scoring |
---|
0:03:17 | at time we have that |
---|
0:03:20 | the test i-vector |
---|
0:03:21 | we have a test i-vector w t |
---|
0:03:24 | and also we have this target speaker i-vector w s |
---|
0:03:28 | and they're |
---|
0:03:30 | and we compute these likelihoods are assuming that the top us to cut but come |
---|
0:03:36 | from the same speaker |
---|
0:03:38 | and of the also have the alternative hypothesis where the top us and w t |
---|
0:03:43 | come from different speaker |
---|
0:03:45 | there are four after some mathematical at manipulation become if this very nice equation so |
---|
0:03:51 | in this equation we only have matrix and wet |
---|
0:03:55 | multiplication and |
---|
0:03:57 | the nice thing use the matrix c |
---|
0:04:00 | hence i'll can all be computed as you can see these set of creation here |
---|
0:04:05 | at the bottom |
---|
0:04:06 | and all these sigma a c segment total |
---|
0:04:11 | and thus and so on can be pre-computed from the |
---|
0:04:16 | the lda model parameter that that's explain why the |
---|
0:04:20 | scoring p lda very fast |
---|
0:04:26 | but one problem of these conventional i-vector p lda there |
---|
0:04:31 | i'm not two years |
---|
0:04:34 | that's not have the ability to work that stands the reliability of the i-vector |
---|
0:04:39 | so whether the utterance is very long already sought we still use |
---|
0:04:45 | low dimension i-vector to work that stands |
---|
0:04:48 | the speaker characteristics |
---|
0:04:50 | of the whole utterance |
---|
0:04:52 | so this we propose a problem for sort utterance speaker verification |
---|
0:04:57 | but not problem for very long utterance that's how you we have three mean is |
---|
0:05:02 | or sixteen use all speech |
---|
0:05:04 | but if we're utterances only about ten second three second and you |
---|
0:05:09 | then the variability or uncertainty of the i-vector will be so high that's |
---|
0:05:14 | and the plp a scroll wheel favourite same speaker hypothesis |
---|
0:05:19 | even if the test utterance is given by a in a imposed |
---|
0:05:24 | about the recent years you've the spectrum is very short we will not have enough |
---|
0:05:30 | acoustic vector for the nbp estimation or in we do not have enough acoustic webster |
---|
0:05:36 | to compute the posterior mean of the leading factor you know factor analysis model |
---|
0:05:44 | so in the ideal i'm certainly publication |
---|
0:05:48 | we not only extract i-vectors but also clancy the |
---|
0:05:52 | the posterior covariance matrix q |
---|
0:05:55 | so this i this time to illustrate the idea |
---|
0:06:00 | this gaussian represent the posterior density of the latent factor |
---|
0:06:05 | and to do so i-vector is that's me so it is a point estimate |
---|
0:06:11 | and this |
---|
0:06:12 | equation initial the procedure of computing it |
---|
0:06:16 | okay so t c use access to cease partition of the total variability matrix but |
---|
0:06:22 | as you can see you |
---|
0:06:24 | if the variance of this gaussian use very large |
---|
0:06:28 | then the point estimate we'll not be where a correct |
---|
0:06:32 | and this happened went utterances where is not the recent years |
---|
0:06:36 | if the utterance is very short and see which is that zero although sufficient statistic |
---|
0:06:42 | will be very small |
---|
0:06:43 | so use this party where is more than the whole covariance matrix l university'll be |
---|
0:06:48 | very big |
---|
0:06:49 | and thus the means that these variations and be large as a result of point |
---|
0:06:54 | estimate we might be very reliable |
---|
0:06:57 | so |
---|
0:06:59 | and |
---|
0:07:00 | that's why two thousand and thirteen |
---|
0:07:02 | that any proposed ideas that the lda and certainly propagation |
---|
0:07:06 | so you that is so to extracting the i-vector we also express the posterior covariance |
---|
0:07:11 | matrix |
---|
0:07:14 | the latent factor |
---|
0:07:15 | and that represents the uncertainty don't i-vector |
---|
0:07:19 | and with some preprocessing as i have mentioned because we want to use a |
---|
0:07:24 | a thousand you lda for the |
---|
0:07:27 | as the as the final stage of the modelling for the scoring |
---|
0:07:32 | therefore we also need to preprocessed |
---|
0:07:35 | matrix |
---|
0:07:37 | you time school |
---|
0:07:38 | which is if you processed version of the of the posterior covariance matrix and i |
---|
0:07:45 | thought that we could have a the lda modelling now how can |
---|
0:07:49 | how to a certain these corpora t |
---|
0:07:52 | other publication come from here so you know generative model |
---|
0:07:56 | it a generative model we have wh i press i to allow you can see |
---|
0:08:02 | this you |
---|
0:08:03 | use and like at the conventional |
---|
0:08:06 | the lda model we've the eigen channels |
---|
0:08:09 | so this is my eigen channel but instead you keep and so on |
---|
0:08:13 | the section |
---|
0:08:15 | so it depends on the i |
---|
0:08:17 | speaker |
---|
0:08:18 | and the change section of the i speaker |
---|
0:08:20 | as a result the z is also depends on the i and j |
---|
0:08:26 | now the trouble of this year's |
---|
0:08:28 | for every test utterance |
---|
0:08:30 | we also lead to compute the u i t so unlike the i can channel |
---|
0:08:34 | situation |
---|
0:08:35 | we only need to pre-computed and make use of which during scoring now in a |
---|
0:08:40 | uncertainty propagation this you i to have to be computed during |
---|
0:08:45 | a scoring time |
---|
0:08:47 | because the ssm dependent |
---|
0:08:49 | and do compute this u r j which was performed at a dusky decomposed system |
---|
0:08:53 | of the posterior covariance matrix |
---|
0:08:56 | and that's why we have these the intra speaker covariance matrix like this |
---|
0:09:05 | so loud finally and during the score |
---|
0:09:08 | with the p l d a u p then we have these at equation |
---|
0:09:13 | which is very similar to the equation the actual you |
---|
0:09:18 | all the conventional p lda right as you can see this s u |
---|
0:09:23 | matrix and vector multiplication |
---|
0:09:26 | but the difference yes |
---|
0:09:28 | this time the at b c and d all depends on the test utterance |
---|
0:09:33 | an issue can see from this a set of recreation a s t e s |
---|
0:09:38 | t c s t and the st the all depends on the test utterance |
---|
0:09:42 | that that's me as they have to be pre-computed |
---|
0:09:46 | and only very small number of matrix can be pretty compute this have to be |
---|
0:09:51 | computed this set have to be computed during scoring time |
---|
0:09:55 | and this that have to be compute a can be computed up before scoring time |
---|
0:09:59 | so we will thus save much computation might become use the covariance matrix |
---|
0:10:07 | so dissatisfy summarised some summarize the computation that needed to take phone a conventional lp |
---|
0:10:13 | lda we almost have nothing to |
---|
0:10:15 | to compute all you need to compute these |
---|
0:10:19 | i went and matrix multiplication |
---|
0:10:22 | but for the plp a review g |
---|
0:10:24 | we have to compute all these set of matrix on the right |
---|
0:10:28 | so as you can see that so we'll a increase the computation complexity a lot |
---|
0:10:34 | and also if we increase the memory about this place |
---|
0:10:39 | because of for every time the speaker we need to store |
---|
0:10:43 | this may take a p c d for every target speaker |
---|
0:10:47 | so we propose a to a way of in a speeding up the computation and |
---|
0:10:53 | also |
---|
0:10:54 | a two we use the memory consumption |
---|
0:10:57 | the whole idea come from |
---|
0:10:59 | come from d c equation |
---|
0:11:01 | okay come on this equation case here the posterior covariance matrix and only depends on |
---|
0:11:06 | n c |
---|
0:11:08 | which |
---|
0:11:09 | and two and a testing time and c will be to zero or the sufficient |
---|
0:11:12 | statistics of the test utterance |
---|
0:11:15 | well |
---|
0:11:16 | you okay so you the two i-vectors are also meeting that's integration |
---|
0:11:21 | we assume that all we |
---|
0:11:23 | i think okay |
---|
0:11:25 | the composed here are covariance matrix a similar because as you can see we plot |
---|
0:11:30 | and the mfcc audible acoustic that only |
---|
0:11:34 | the zero all the sufficient statistic |
---|
0:11:36 | so having this hypothesis |
---|
0:11:39 | we and |
---|
0:11:42 | proposals |
---|
0:11:43 | to roll direct a according to their be activity |
---|
0:11:47 | now can be |
---|
0:11:50 | we find w happy that we you we used a scalar to define the we're |
---|
0:11:55 | not be by facing for each scroll the i-vector reliability is modeled by performance vehicle |
---|
0:12:00 | right matrix |
---|
0:12:01 | and we obtain the posterior covariance matrix from the development data |
---|
0:12:06 | okay so here |
---|
0:12:08 | that i take a k stand for the |
---|
0:12:12 | case |
---|
0:12:13 | and i'll this u k |
---|
0:12:16 | is independent of the section |
---|
0:12:18 | nice to look at |
---|
0:12:19 | well at the bottom of the slide |
---|
0:12:22 | we have you i j i taste depends on the |
---|
0:12:26 | section |
---|
0:12:27 | but now if you look at this here |
---|
0:12:30 | we successfully |
---|
0:12:33 | make the u i j which is the session dependent |
---|
0:12:37 | is now becomes session independent |
---|
0:12:40 | now you've having this u k become session independent we could |
---|
0:12:45 | do a lot of precomputation on there |
---|
0:12:48 | so one way of doing this |
---|
0:12:50 | used to |
---|
0:12:51 | used to grow |
---|
0:12:54 | used to grow |
---|
0:12:58 | the |
---|
0:13:00 | the i-vector |
---|
0:13:01 | using these three approaches one is the base on the utterance recent which is intuitive |
---|
0:13:07 | to group the i-vector based on the |
---|
0:13:09 | at a race of because we as we believe that reason we use related to |
---|
0:13:14 | the uncertainty of related to the reliability of the i-vector |
---|
0:13:19 | we have also tried using the mean of the diagonal elements of the posteriori matrix |
---|
0:13:23 | of this is a nice thing to do because |
---|
0:13:27 | the being of the diagonal and on there is a scalar so working will become |
---|
0:13:31 | very easy |
---|
0:13:32 | okay and the last one we have tried is the largest eigenvalue of the reference |
---|
0:13:37 | matrix |
---|
0:13:38 | so this i basically tell us how to perform the grouping you for example you |
---|
0:13:43 | uses the time access |
---|
0:13:45 | then this one corresponding to extremely soft uncertains |
---|
0:13:49 | go to medium length but am sort |
---|
0:13:52 | and we're case where is |
---|
0:13:54 | long utterance and u h group be fine one representative |
---|
0:13:58 | okay from the k two |
---|
0:14:00 | we're consensus the whole group |
---|
0:14:02 | so this |
---|
0:14:03 | or percent at u one u one times will work at santa |
---|
0:14:06 | the posterior covariance matrix a very strong extremely short utterance |
---|
0:14:12 | u k or u k tricycle corresponding to the posterior covariance matrix |
---|
0:14:17 | what and certainty all that very long utterance |
---|
0:14:22 | so now that all you really two |
---|
0:14:25 | during the scoring time really to find |
---|
0:14:28 | the real identity |
---|
0:14:30 | so by using the three approach to quantify cook reliability noise gave a |
---|
0:14:36 | we will be able to find what i the nn and so that we case |
---|
0:14:42 | the |
---|
0:14:42 | all the session dependent |
---|
0:14:45 | matrix in two |
---|
0:14:47 | am and |
---|
0:14:49 | and c n and |
---|
0:14:51 | not as to compare with the conventional original plp a few p |
---|
0:14:56 | this eight easy all the session dependent because |
---|
0:15:00 | t is the test utterance |
---|
0:15:03 | so t stand for a test utterance s spent for the attack at six speaker |
---|
0:15:08 | utterance |
---|
0:15:09 | and now it's a it's two am an |
---|
0:15:12 | and c n and the n and all these have been pre-computed already |
---|
0:15:17 | using a development data |
---|
0:15:20 | so as to can see that will be ice |
---|
0:15:22 | for this computation saving my |
---|
0:15:25 | using the pre-computed but rather than computers the covariance matrix on the prior |
---|
0:15:32 | so again that this lie there are some more ice |
---|
0:15:35 | the computation saving that we could have |
---|
0:15:38 | so this is the p lda we've a fast what we've using a reference fast |
---|
0:15:43 | scoring okay so we to only to determine the group i t m and n |
---|
0:15:49 | but for the conventional plp a beep |
---|
0:15:51 | and so that the publication we have to compute all this matrix during the scoring |
---|
0:15:58 | so be performed experiments on |
---|
0:16:02 | sre two thousand trial common condition two |
---|
0:16:06 | using the classical sixty dimensional mfcc wetter one or two for gaussian |
---|
0:16:11 | find the total factor in the total variability matrix |
---|
0:16:16 | and we tried this three different way off on a single the i-vector |
---|
0:16:21 | you know how to create a procedure of the |
---|
0:16:25 | posterior covariance matrix |
---|
0:16:28 | okay so this diagram of summarize the results nice okay cp lda just ultra fast |
---|
0:16:34 | a piece of a represents the |
---|
0:16:38 | scoring time on the back to the times the total time for the whole evaluation |
---|
0:16:42 | on its common condition two |
---|
0:16:47 | but unfortunately the performance is not very good |
---|
0:16:50 | well the reason is that what we white is not very good use of because |
---|
0:16:54 | it we use |
---|
0:16:55 | where is the utterance of arbitrary duration so we need to do that this segmentation |
---|
0:17:01 | or cutting utterance into sought medium sort and a long very soul |
---|
0:17:10 | so that it is not |
---|
0:17:12 | we are we do not use the original data for training and testing but instead |
---|
0:17:16 | we use some of that at ones used were sought some of the utterance used |
---|
0:17:19 | medium sought some of that when it is very long so we create a situation |
---|
0:17:23 | be a victory |
---|
0:17:25 | to raise and you both the training and test utterance |
---|
0:17:29 | now the plp every u p performed extremely well |
---|
0:17:33 | unfortunately the scoring time is also where i |
---|
0:17:37 | and we've our fast scoring approach we successfully we used a scoring kind from here |
---|
0:17:43 | to here |
---|
0:17:45 | if only a very small increase in the eer |
---|
0:17:48 | we have using a more groups okay so that developed a you know our from |
---|
0:17:54 | you with the number of these larger we can make the eer almost the same |
---|
0:17:59 | as the one achieved by the |
---|
0:18:01 | p lda beep uncertainty project each so what happened use |
---|
0:18:05 | we successfully we use the computation time but we followed increasing |
---|
0:18:10 | the eer |
---|
0:18:11 | as the same situation ocarina been dcf the detailed |
---|
0:18:17 | you know paper |
---|
0:18:18 | and also we show system three here because the performance you some two and system |
---|
0:18:23 | three are very similar so i only saw this only show the system to |
---|
0:18:29 | and system one space on utterance duration we want to solve this because you syllables |
---|
0:18:33 | in three d way of doing |
---|
0:18:36 | a memory consumption |
---|
0:18:39 | a domain reconnaissance a i have i have in a similar trend |
---|
0:18:44 | the lda used very small amount of memory |
---|
0:18:48 | and |
---|
0:18:49 | the plp a but use much of a large amount of memory |
---|
0:18:54 | because we need to store all of the posterior covariance matrix of the utterance |
---|
0:19:01 | and we have talk about well |
---|
0:19:03 | what they're gigabyte here |
---|
0:19:06 | and |
---|
0:19:08 | this is not one videos that memory consumption almost by how |
---|
0:19:14 | case and system to a this set the same |
---|
0:19:19 | the memory consumption |
---|
0:19:21 | and if we increase the number groups |
---|
0:19:24 | or obviously not require something will increase |
---|
0:19:27 | but you value that |
---|
0:19:29 | number really a lot of forty five |
---|
0:19:33 | it still use less memory and |
---|
0:19:36 | the original plp we've and set and the propagation |
---|
0:19:41 | so it's the det curve and |
---|
0:19:45 | not as you can see the |
---|
0:19:47 | paying |
---|
0:19:50 | or the paper |
---|
0:19:51 | this leo and we use them for to conventional lp lda report performance |
---|
0:19:56 | about all the others system one two three and also the one with u p |
---|
0:20:01 | one |
---|
0:20:02 | much better |
---|
0:20:04 | because it with the uncertainty propagation you can do the utterance of a feature integration |
---|
0:20:11 | and what we have used that we find that system one use i c pool |
---|
0:20:16 | then the system two and three |
---|
0:20:19 | but system one has the largest |
---|
0:20:22 | we that's and in terms of computation time |
---|
0:20:27 | so in conclusion |
---|
0:20:29 | we propose a very fast scoring map for the lda with certain people bifurcation |
---|
0:20:35 | as the whole idea used to become people's the |
---|
0:20:39 | posterior covariance matrix |
---|
0:20:42 | or the loading matrix representing the reliability of the i-vector |
---|
0:20:46 | two pre-computed |
---|
0:20:47 | all of them |
---|
0:20:49 | that's much as possible and you know how to do this precomputation really two |
---|
0:20:55 | to the grouping first two in the development time |
---|
0:20:58 | and we find three ways all performing the grouping |
---|
0:21:02 | and all this grouping |
---|
0:21:04 | are based on some a scalar just like a the k-means outgrow from you need |
---|
0:21:10 | to use the distance of the way to say |
---|
0:21:12 | it's a |
---|
0:21:14 | criteria for a finding al |
---|
0:21:18 | the cruel a what do you we mean by process now we use the |
---|
0:21:23 | the be all the diagonal covariance matrix |
---|
0:21:26 | okay or well sort then be all the diagonal elements of the posterior covariance matrix |
---|
0:21:32 | what the maximum |
---|
0:21:35 | eigenvalue all the posterior covariance matrix order to radiation as a way of doing this |
---|
0:21:44 | huh set as the criteria for the grouping |
---|
0:21:47 | and all these use a computationally light and sre so it's |
---|
0:21:53 | as a result |
---|
0:21:54 | the proposed f okay perform yes us a similar to the standard u p but |
---|
0:22:00 | we only two point three percent of the scoring time |
---|
0:22:03 | thank you |
---|
0:22:12 | we have time for questions yes we |
---|
0:22:17 | we do not frankly them randomly but use that set for every one second interval |
---|
0:22:23 | we have a week rate |
---|
0:22:27 | so for three second for second five seconds so we randomly extracted from there |
---|
0:22:33 | that's speech data after |
---|
0:22:36 | also when we extract every randomly extract the problem |
---|
0:22:40 | so we durations range between three seconds and how much |
---|
0:22:45 | well as well as long test utterance o can excel some utterances small groups of |
---|
0:22:51 | five a traditional to therefore for different utterance we will have a different operating |
---|
0:22:58 | my experience i wonder if you could just comment on this my experience with this |
---|
0:23:02 | with this method |
---|
0:23:04 | it is the i found other works well in a situation other than the specific |
---|
0:23:13 | problem where was intended |
---|
0:23:15 | okay if there is a gross mismatch between a enrollment and test such as telephone |
---|
0:23:22 | enrolment the microphone channels |
---|
0:23:24 | or a huge mismatch and the in the duration |
---|
0:23:28 | then i found that this work well but i was a bit disappointed with the |
---|
0:23:33 | with the performance only specific problem that you're addressing here which is the problem just |
---|
0:23:39 | a duration variability |
---|
0:23:43 | you fact we could be involved in our experiments we also have recently because |
---|
0:23:50 | well we literacy generated duration mismatch in order to create a situation having a at |
---|
0:23:56 | times a picture it duration therefore the test utterance and the target speaker utterance will |
---|
0:24:03 | have different k |
---|
0:24:05 | of course that we've each are very small times that |
---|
0:24:10 | you one of the u one or two open you a |
---|
0:24:13 | harry |
---|
0:24:14 | the utterance will have various |
---|
0:24:18 | but really then |
---|
0:24:21 | excluding |
---|
0:24:22 | so because everything random so there will be a lot of utterance with various packet |
---|
0:24:28 | utterance which operates and also a trend which are real all that would be a |
---|
0:24:33 | duration mismatch |
---|
0:24:35 | a tree in the test |
---|
0:24:40 | i be very interested to see what so what happens in the in the upcoming |
---|
0:24:44 | nist evaluation where this problem is good is going to be in the in the |
---|
0:24:49 | forefront of our have excellent thank |
---|
0:24:53 | it is you know the truncation the duration will be truncated to between ten seconds |
---|
0:24:59 | and sixty seconds |
---|
0:25:02 | so i think we're all looking up to five percent equal error rate you know |
---|
0:25:08 | a before we moved to chinese and no |
---|
0:25:12 | target more go |
---|
0:25:14 | verification trials |
---|
0:25:17 | okay then that's like the speaker |
---|