0:00:06 | um |
---|
0:00:07 | so |
---|
0:00:08 | as a as i mentioned that um |
---|
0:00:10 | they're going to use mllr inception statistics for |
---|
0:00:12 | speaker identification problem |
---|
0:00:14 | uh but we're not building any speech recognition as such in this particular |
---|
0:00:18 | people |
---|
0:00:19 | and the idea that we're looking specifically at |
---|
0:00:22 | uh |
---|
0:00:23 | at the past where the large number of speakers |
---|
0:00:26 | anyone to identify |
---|
0:00:27 | one of them |
---|
0:00:28 | then we want to do it in a computationally efficient way |
---|
0:00:31 | so that's what was actually done by my students aging gets a car and checked it out and then we'll |
---|
0:00:35 | make |
---|
0:00:39 | so just to give you a brief overview of the top |
---|
0:00:41 | i'm just gonna go uh briefly about uh |
---|
0:00:44 | speaker identification problem that is identifying one out of uh |
---|
0:00:48 | a set of L speakers |
---|
0:00:50 | and i'll talk about the commonly used techniques such as using map adaptation followed by topsy mixture based likelihood estimation |
---|
0:00:57 | um and then |
---|
0:00:58 | stage maybe |
---|
0:00:59 | uh we show that this is actually if you have a large number of speakers |
---|
0:01:03 | uh then evaluating the likelihood across all the |
---|
0:01:05 | because and then choosing the best one |
---|
0:01:07 | it is |
---|
0:01:08 | uh obviously very computationally expensive and this |
---|
0:01:10 | number of speakers |
---|
0:01:12 | in the population can very large |
---|
0:01:14 | and so we proposing to use |
---|
0:01:15 | mllr mattresses |
---|
0:01:17 | uh for the adaptation of |
---|
0:01:18 | speaker models |
---|
0:01:19 | um |
---|
0:01:20 | the reason is that then we just need to have the mllr mattresses |
---|
0:01:24 | and uh we show that you know if you have a manila mattresses then estimating the |
---|
0:01:28 | uh the likelihood of the difference because is a very very fast |
---|
0:01:31 | step adjusting once a matrix multiplication with the mllr |
---|
0:01:34 | uh row vectors |
---|
0:01:36 | and so we give you some comparison of the performance of the conventional uh gmmubm based that that |
---|
0:01:42 | we we show that although the mllr system is |
---|
0:01:44 | oh |
---|
0:01:45 | uh it will give you some degradation in performance |
---|
0:01:48 | and therefore |
---|
0:01:49 | oh finally we propose |
---|
0:01:50 | some sort of a cascade system where |
---|
0:01:52 | the mllr system will reduce the search space from this huge population |
---|
0:01:56 | and then we find a gmmubm system can |
---|
0:01:59 | uh you know look at small set of speakers |
---|
0:02:02 | and identify |
---|
0:02:03 | the |
---|
0:02:03 | uh the best speaker from that |
---|
0:02:05 | uh |
---|
0:02:05 | set |
---|
0:02:06 | so this is the basic |
---|
0:02:07 | flow of that all |
---|
0:02:10 | so |
---|
0:02:11 | um |
---|
0:02:11 | as i said uh the idea is that i'm doing speaker identification so |
---|
0:02:16 | there are and |
---|
0:02:17 | because |
---|
0:02:18 | i had to close that so |
---|
0:02:19 | assuming that that is because in the population |
---|
0:02:22 | so given a test feature |
---|
0:02:23 | we're going to actually find like you would |
---|
0:02:25 | respect to all the things |
---|
0:02:26 | and models |
---|
0:02:27 | and choose the one that maximises the model |
---|
0:02:30 | okay |
---|
0:02:30 | and obviously the and the number of speakers |
---|
0:02:33 | population is large |
---|
0:02:35 | and i have to evaluate for each and every speaker population |
---|
0:02:38 | and therefore uh you know the computation complexity keeps going |
---|
0:02:41 | as an uh the number of speakers in the population |
---|
0:02:44 | becomes law |
---|
0:02:47 | so um so what would be uh |
---|
0:02:50 | conventional methods along the most popular method that is used for speaker identification |
---|
0:02:54 | oh |
---|
0:02:55 | pretty much the same thing is useful speaker verification assume |
---|
0:02:58 | is that uh we will be using uh you don't given uh uh you know so background model |
---|
0:03:03 | uh for each of the speakers the uh basically do a map adaptation to get the speaker models |
---|
0:03:09 | from the universal background model |
---|
0:03:11 | so these are people who uh a speaker adapted models |
---|
0:03:14 | and uh then uh as |
---|
0:03:16 | doug reynolds pointed out that it that of the possibly that you can do the scoring still |
---|
0:03:20 | and that is |
---|
0:03:21 | uh that given the data as such models |
---|
0:03:24 | we first align uh the test database respect to the ubm |
---|
0:03:28 | and finally topsy mixtures |
---|
0:03:30 | uh for that particular test data |
---|
0:03:32 | and so when you want to evaluate the likelihood you don't have to compute all the uh you know |
---|
0:03:36 | like that so |
---|
0:03:37 | each of those two thousand forty bits just assuming that the two thousand forty in the background model |
---|
0:03:41 | for each of the speaker model |
---|
0:03:43 | instead you have the first evaluation with respect all of the two thousand forty |
---|
0:03:47 | from the ubm |
---|
0:03:48 | but then for each of the speaker models you just need |
---|
0:03:51 | uh you know sealed those mixtures to be evaluated |
---|
0:03:54 | so |
---|
0:03:55 | oh but nevertheless |
---|
0:03:56 | as L becomes large uh there is a large uh uh |
---|
0:04:00 | increase in the computation |
---|
0:04:02 | so |
---|
0:04:02 | it's still |
---|
0:04:03 | uh as we will show is |
---|
0:04:04 | is expensive especially L becomes large |
---|
0:04:09 | so uh what we're proposing is |
---|
0:04:11 | is |
---|
0:04:12 | is uh a little more |
---|
0:04:14 | uh |
---|
0:04:14 | it's it's again adaptation but yeah saying that it's middle of doing map adaptation |
---|
0:04:19 | why don't people |
---|
0:04:20 | you can models using just fmllr adaptation |
---|
0:04:23 | so the idea is that for each speaker |
---|
0:04:26 | uh given that we already have the ubm model |
---|
0:04:28 | um we are going to use in strip map adaptation |
---|
0:04:31 | uh |
---|
0:04:32 | but |
---|
0:04:32 | uh we actually have a speaker model |
---|
0:04:34 | which is now gone through mllr speaker adaptation |
---|
0:04:38 | so this is where i think uh |
---|
0:04:40 | the confusion came so we're using a male and that's about all that we have ordering it from |
---|
0:04:44 | speech recognition |
---|
0:04:45 | literature |
---|
0:04:46 | so |
---|
0:04:46 | so |
---|
0:04:47 | each of the speaker is now basically uh the means of the speaker model is nothing but a maddox transformation |
---|
0:04:53 | of the means of the universal background model |
---|
0:04:55 | so |
---|
0:04:56 | uh the idea that i just need this uh |
---|
0:04:59 | uh matrix the mllr maddox to characterise a speaker |
---|
0:05:02 | so in essence we are not performing individual speaker models |
---|
0:05:06 | except that each speaker's now codified by |
---|
0:05:08 | his or her |
---|
0:05:10 | uh the spectrum |
---|
0:05:11 | mllr |
---|
0:05:12 | so |
---|
0:05:12 | this is a stage that the actually |
---|
0:05:15 | but the speaker specific |
---|
0:05:17 | mllr matrix |
---|
0:05:18 | and then |
---|
0:05:19 | this is |
---|
0:05:20 | she said identification vol |
---|
0:05:22 | oh becomes the one we have such |
---|
0:05:26 | i know that i lattices |
---|
0:05:27 | it's just sell such models |
---|
0:05:29 | and of course uh these |
---|
0:05:31 | lattices are what is the L C D this is that |
---|
0:05:35 | and so here the likelihood calculation essentially boils down to finding out |
---|
0:05:39 | the test utterance |
---|
0:05:40 | given that that's what that's what is like you |
---|
0:05:42 | respect to |
---|
0:05:43 | the background model which means see each of these |
---|
0:05:45 | and mattresses |
---|
0:05:47 | the which are already stored uh since you've done uh mllr adaptation for each of these individual speakers |
---|
0:05:53 | so at this point it still looks like |
---|
0:05:55 | we need to compute all the |
---|
0:05:57 | they'll likelihoods and therefore |
---|
0:05:59 | it is still uh looks like i mean we haven't solved anything |
---|
0:06:02 | as yeah |
---|
0:06:03 | but the advantage is that if i want to compute these individual likelihoods |
---|
0:06:07 | now it's very very simple |
---|
0:06:08 | all that i need to do is just do some markets multiplications |
---|
0:06:12 | to get |
---|
0:06:13 | the likelihoods for each of the |
---|
0:06:14 | individual speaker |
---|
0:06:19 | so |
---|
0:06:19 | um |
---|
0:06:20 | so the idea is again more out of this |
---|
0:06:22 | is again water from speech recognition literature because mllr basically |
---|
0:06:26 | uh |
---|
0:06:27 | even if it's using the equations from |
---|
0:06:29 | uh the mllr map text estimation |
---|
0:06:32 | so we have the use of a facility function |
---|
0:06:34 | uh in converge speech recognition what you would do if you're doing mllr estimation |
---|
0:06:39 | he's actually trying to estimate the schematics |
---|
0:06:41 | W S |
---|
0:06:43 | given |
---|
0:06:43 | the i went uh the test on the adaptation data |
---|
0:06:47 | okay so |
---|
0:06:48 | the idea that given the adaptation utterance |
---|
0:06:50 | X |
---|
0:06:51 | what is the back deck so what are the elements of the matrix |
---|
0:06:54 | that will maximise the likelihood |
---|
0:06:56 | in this case the optimal function that you're looking at |
---|
0:06:58 | so |
---|
0:06:59 | we |
---|
0:07:00 | and now we was in the same problem in a speaker identification framework |
---|
0:07:04 | so the idea is now i already know |
---|
0:07:07 | the L |
---|
0:07:08 | speaker mattresses for each individual speaker i already know |
---|
0:07:11 | the mllr matrix |
---|
0:07:13 | and that's what the problem is now one of finding out which of those |
---|
0:07:17 | and mattresses |
---|
0:07:19 | maximises the likelihood so in this case |
---|
0:07:21 | i am not estimating |
---|
0:07:23 | be mllr mattresses |
---|
0:07:24 | i have already computed the mllr mattresses |
---|
0:07:27 | and stored for each of the individual speakers |
---|
0:07:30 | and the only thing that i'm trying to maximise here |
---|
0:07:32 | is trying to maximise |
---|
0:07:33 | oh |
---|
0:07:35 | one of those L |
---|
0:07:36 | mllr mattresses |
---|
0:07:37 | that maximises the likelihood |
---|
0:07:39 | and this is very very efficiently done as a destroyer that |
---|
0:07:42 | again waterfront speech recognition |
---|
0:07:44 | so what we would do is that we already have |
---|
0:07:46 | these |
---|
0:07:47 | and mattresses each of which |
---|
0:07:49 | i'm now represented by the row vectors W one W B these are all vectors actually |
---|
0:07:53 | and in mllr you will see that these row vectors that what estimated |
---|
0:07:57 | when we you when you do actually speaker adaptation |
---|
0:08:00 | here these are already precomputed and stored |
---|
0:08:03 | and so we only computing the likelihood |
---|
0:08:05 | here |
---|
0:08:07 | so |
---|
0:08:08 | oh what is it efficient so i said i denied to compute all the likelihood |
---|
0:08:12 | but i can do that very very very efficiently |
---|
0:08:15 | a white what is it it's only varies because |
---|
0:08:17 | i just need to do one alignment of the data with respect to ubm |
---|
0:08:21 | and that's exactly same thing that is normally done in math class topsy |
---|
0:08:25 | uh likelihood estimation |
---|
0:08:27 | i had to have less |
---|
0:08:28 | an alignment to find out which are the mixture for that |
---|
0:08:31 | uh you know dominant |
---|
0:08:32 | so that's not exactly same as what to do with |
---|
0:08:35 | uh you know map |
---|
0:08:36 | just to see |
---|
0:08:37 | uh |
---|
0:08:37 | it's just that that we do which is again borrowed from speech recognition it is |
---|
0:08:41 | basically compute for the given test utterance |
---|
0:08:44 | D corresponding |
---|
0:08:45 | sufficient statistics |
---|
0:08:46 | he i |
---|
0:08:47 | in G R |
---|
0:08:48 | okay |
---|
0:08:48 | G I so these not sufficient statistic that that |
---|
0:08:51 | computer depending on the alignment and the data that's given guess of the alignment and then there's the data comes |
---|
0:08:57 | okay |
---|
0:08:58 | and so |
---|
0:08:58 | for each of the and speakers now |
---|
0:09:01 | i just one matrix multiplication using these key and G A G uh uh uh statistics |
---|
0:09:07 | so the ski energies computed one you want |
---|
0:09:10 | you suspect you of the number of |
---|
0:09:11 | speakers |
---|
0:09:12 | but |
---|
0:09:13 | the likelihood calculation now |
---|
0:09:15 | uses this individual a row vectors from the corresponding |
---|
0:09:19 | speaker mllr matrix so |
---|
0:09:21 | uh this is of dimension be so each row vector basically |
---|
0:09:25 | uh is modelled for that particular speaker so if it's speaker |
---|
0:09:28 | i'd model is |
---|
0:09:29 | i have that i |
---|
0:09:30 | hi |
---|
0:09:31 | i at low |
---|
0:09:32 | and this is just a matrix multiplication so |
---|
0:09:34 | in a sense |
---|
0:09:35 | this is the most crucial step that this happening and that is that the likelihood can be easily computed for |
---|
0:09:40 | each of those |
---|
0:09:41 | and |
---|
0:09:42 | speakers |
---|
0:09:43 | but using the corresponding mllr hypothesis |
---|
0:09:46 | and doing william at its multiplication |
---|
0:09:48 | on here i |
---|
0:09:49 | gee i |
---|
0:09:50 | hmmm |
---|
0:09:50 | and that's where we get the maximum key in |
---|
0:09:53 | in performance |
---|
0:09:55 | so i'll in computation time so |
---|
0:09:57 | just to go through the old useful given the feature vector i'm assuming that already |
---|
0:10:01 | i have taken |
---|
0:10:02 | and uh the individual speaker's training data and computed the mllr mattresses for all the and speaker |
---|
0:10:08 | and so given a test feature |
---|
0:10:10 | i first do an alignment |
---|
0:10:12 | but the background model |
---|
0:10:14 | and also compute the key I N G I statistics is only done once |
---|
0:10:18 | using the X |
---|
0:10:20 | the test feature and the ubm model |
---|
0:10:22 | and then with respect to each of those mattresses |
---|
0:10:25 | i just need to compute by multiplying this matrix |
---|
0:10:28 | but the statistics |
---|
0:10:30 | to get |
---|
0:10:30 | because one and likelihood |
---|
0:10:32 | so this is |
---|
0:10:33 | a very computationally efficient because it only in what's matrix multiplication |
---|
0:10:39 | oh please stop me if you |
---|
0:10:40 | you have any questions |
---|
0:10:42 | so |
---|
0:10:42 | um |
---|
0:10:43 | so the proof of the pudding is basically uh to go to some of the |
---|
0:10:46 | uh time uh and a complexity analysis |
---|
0:10:50 | uh |
---|
0:10:51 | so what they're doing is now we're comparing the conventional uh map |
---|
0:10:55 | plus topsy approach to check on gmm ubm |
---|
0:10:58 | uh |
---|
0:10:59 | and then the fast mllr system that's one that maybe have |
---|
0:11:02 | and uh |
---|
0:11:04 | mllr mattresses that capture the speaker characteristics |
---|
0:11:07 | and what is shown on the left that is basically um uh |
---|
0:11:12 | this is |
---|
0:11:12 | uh again more fun than that |
---|
0:11:14 | two thousand four data |
---|
0:11:16 | so we have two different uh uh |
---|
0:11:18 | test basically one to ten ten ten seconds speech and the other one |
---|
0:11:22 | side speech |
---|
0:11:23 | so if you use me |
---|
0:11:25 | and then at the end and such that six speakers in this identity |
---|
0:11:28 | that's |
---|
0:11:30 | so uh what we're trying to do is identify |
---|
0:11:32 | uh given the test |
---|
0:11:33 | uh |
---|
0:11:34 | data |
---|
0:11:35 | to identify from one of these |
---|
0:11:36 | three and then six models |
---|
0:11:38 | and so uh there's what the ten second anti one side uh case |
---|
0:11:42 | so the blue |
---|
0:11:43 | is basically what the conventional approach |
---|
0:11:46 | here we have taken C to be top fifteen |
---|
0:11:48 | um and you see that obviously |
---|
0:11:50 | there is a degradation in performance |
---|
0:11:53 | uh uh in in in the case of mllr uh |
---|
0:11:56 | so but uh |
---|
0:11:57 | you know i mean and i analyses them in a little would be good |
---|
0:12:00 | and for the one second C |
---|
0:12:02 | speech |
---|
0:12:03 | uh the gmmubm obviously does better and therefore uh also has a corresponding improvement for the |
---|
0:12:09 | mllr kiss but again |
---|
0:12:10 | there is a gap |
---|
0:12:11 | between performance in uh |
---|
0:12:13 | the conventional case |
---|
0:12:14 | and uh proposed approach |
---|
0:12:16 | um but the advantage comes |
---|
0:12:18 | in the right half of the figure |
---|
0:12:20 | which shows |
---|
0:12:21 | uh here we are just using |
---|
0:12:22 | uh a fixed computer can configuration |
---|
0:12:25 | and trying to find out the average time that they can but i want to estimate uh or to identify |
---|
0:12:31 | the optimal speaker |
---|
0:12:32 | so uh this is |
---|
0:12:34 | uh summation of that review |
---|
0:12:36 | uh so yeah we can see that there's a few which again in terms of complexity of the computation time |
---|
0:12:42 | while this takes about ten point three seconds and then averaged |
---|
0:12:44 | this takes about a second on an average twenty ten second data |
---|
0:12:48 | for this to be announced |
---|
0:12:49 | speakers |
---|
0:12:50 | and when the test it obviously becomes larger it |
---|
0:12:53 | obviously wouldn't take much more time to compute |
---|
0:12:55 | and that takes about forty four seconds versus |
---|
0:12:57 | uh more seconds |
---|
0:12:58 | uh |
---|
0:12:59 | for the mllr |
---|
0:13:00 | so uh so the bottom line is |
---|
0:13:03 | you got a huge gain in a it's about like one is to seven one is to ten |
---|
0:13:07 | a winning as fast mllr so this is useful if you have a two thousand speakers in your |
---|
0:13:11 | uh |
---|
0:13:12 | ask anyone to identify which one of them |
---|
0:13:14 | uh |
---|
0:13:14 | is the one that |
---|
0:13:16 | well the utterance |
---|
0:13:17 | but then there's a downside that is used |
---|
0:13:20 | some |
---|
0:13:20 | in terms of performance |
---|
0:13:22 | so you can see that the sum of |
---|
0:13:23 | performance |
---|
0:13:24 | and uh obviously when they're when these sentences are larger |
---|
0:13:28 | uh the |
---|
0:13:29 | be gmmubm takes a lot more time and that's what you gain more |
---|
0:13:32 | when you have a longer utterances to be |
---|
0:13:37 | so um |
---|
0:13:38 | so the said oh yeah this is a little more analysis |
---|
0:13:41 | uh a little more details of what's happening between the proposed |
---|
0:13:44 | a fast mllr |
---|
0:13:46 | yeah |
---|
0:13:46 | the gmmubm |
---|
0:13:47 | and it said uh since |
---|
0:13:49 | the uh like you would even for that all you have to be computed |
---|
0:13:53 | as the number of speakers so that's |
---|
0:13:54 | a lap dance figure |
---|
0:13:55 | shows me computation time and the number of speakers |
---|
0:13:58 | in the database increases |
---|
0:14:00 | so as the number of speakers increases |
---|
0:14:02 | so the blue line is the conventional approach |
---|
0:14:05 | or if it's ten second obviously it's gonna take less time than if someone finds speech |
---|
0:14:09 | so but you can see that there is a sort of a linear relationship |
---|
0:14:13 | with |
---|
0:14:13 | the number of speakers |
---|
0:14:14 | the database so as the number of speakers |
---|
0:14:17 | in the database increases |
---|
0:14:18 | the computation time results were linearly increase |
---|
0:14:21 | on the other hand if you look at the mllr system which is |
---|
0:14:24 | all those those brown sort of uh uh dark line |
---|
0:14:27 | ah it's almost flat |
---|
0:14:28 | as the number of speakers increases |
---|
0:14:31 | and that's because the meeting uh yeah complexity comes basically in uh in trying to do the alignment and things |
---|
0:14:37 | like that |
---|
0:14:38 | the actual likelihood estimation does not depend much uh that's not real significantly with the number of speakers but |
---|
0:14:44 | yeah just matrix multiplications with the mllr map |
---|
0:14:47 | so |
---|
0:14:48 | uh so |
---|
0:14:49 | uh you can see that uh you know it is a population of two thousand there's gonna be huge uh |
---|
0:14:53 | again in terms of |
---|
0:14:55 | uh |
---|
0:14:55 | computation time |
---|
0:14:57 | um |
---|
0:14:57 | the other interesting thing is obviously that if i'm trying to identify a dbn best performance that is of these |
---|
0:15:04 | two systems that is |
---|
0:15:05 | if i look at the top forty speakers i how often |
---|
0:15:08 | do the kind of speaker okay |
---|
0:15:10 | stop with you at all in |
---|
0:15:11 | we see a zipper that as the number of speakers in the top increases obviously uh they both start converging |
---|
0:15:18 | and so the blue a gmm ubm and the red army |
---|
0:15:22 | the order of the brown on mllr |
---|
0:15:24 | so the performance |
---|
0:15:25 | sort of |
---|
0:15:26 | the top and performance that is identifying at least in the top hundred |
---|
0:15:29 | he's |
---|
0:15:30 | uh similar uh uh as well what this |
---|
0:15:32 | some schools of of the uh the start and went into a teacher |
---|
0:15:35 | so we thought that we could sort of |
---|
0:15:37 | uh |
---|
0:15:38 | explain |
---|
0:15:38 | the advantage of simple |
---|
0:15:40 | the gmmubm which obviously |
---|
0:15:42 | is |
---|
0:15:42 | superior to a million tons of performance |
---|
0:15:45 | and still get some computation again |
---|
0:15:47 | by using the mllr to identify |
---|
0:15:49 | from the population of thousand of something the top one hundred one |
---|
0:15:52 | two speakers |
---|
0:15:53 | and then use only those |
---|
0:15:55 | uh in the uh use |
---|
0:15:56 | that we do set of speakers |
---|
0:15:58 | in the final gmmubm system so that's what one of the cascades |
---|
0:16:01 | yeah |
---|
0:16:02 | uh |
---|
0:16:03 | uh |
---|
0:16:03 | so |
---|
0:16:04 | the idea is that obviously fast mllr system to first |
---|
0:16:07 | i think that that sentence and made use the |
---|
0:16:10 | search space for the speaker so we identified the top hundred at all |
---|
0:16:13 | you print your properly |
---|
0:16:15 | speakers depending on |
---|
0:16:16 | as usual |
---|
0:16:17 | has an impact on performance |
---|
0:16:19 | and then we'll let the conventional gmmubm operate only on these three disorders because to identify |
---|
0:16:25 | the best |
---|
0:16:26 | okay |
---|
0:16:26 | and this is basically the same thing in implementation |
---|
0:16:29 | which basically shows that uh we don't lose much in terms of additional cost and computation |
---|
0:16:35 | so the conventional approach would have taken the uh the test feature |
---|
0:16:39 | and you would have done an alignment with the ubm |
---|
0:16:42 | a lot of the topsy mixtures and use |
---|
0:16:44 | uh the gmmubm based |
---|
0:16:46 | system |
---|
0:16:47 | to actually identify the speaker |
---|
0:16:49 | i'll be at exactly doing the same thing |
---|
0:16:51 | there's an alignment step that goes on here |
---|
0:16:54 | but we do an additional computational |
---|
0:16:56 | sufficient statistics |
---|
0:16:57 | this is only done once |
---|
0:16:59 | and then we have the mllr system which is |
---|
0:17:02 | down in the training phase so in the training phase we already |
---|
0:17:05 | a bill |
---|
0:17:06 | the mllr mattresses for each of those individual speakers |
---|
0:17:09 | so using the statistics the features and the mllr hypothesis |
---|
0:17:13 | we identify |
---|
0:17:14 | the and most probable speakers |
---|
0:17:16 | and once we identify the end was problems because we feed it to the human subjects |
---|
0:17:20 | to get the final |
---|
0:17:21 | identified |
---|
0:17:22 | because |
---|
0:17:23 | so in both cases the aligned this is |
---|
0:17:25 | so |
---|
0:17:27 | this is |
---|
0:17:32 | so |
---|
0:17:32 | um |
---|
0:17:33 | so that's a that's a compromise between complexity and performance |
---|
0:17:37 | um so if i look at the end that's performance that is |
---|
0:17:40 | if i did use a set |
---|
0:17:41 | uh of speakers only at all yeah |
---|
0:17:44 | oh then the performance |
---|
0:17:45 | for the this is the ten second case |
---|
0:17:47 | uh this |
---|
0:17:48 | a degradation performance |
---|
0:17:49 | but development |
---|
0:17:50 | because |
---|
0:17:51 | this degradation decreases and savannah good top thirty |
---|
0:17:55 | uh |
---|
0:17:55 | you know uh |
---|
0:17:56 | performances |
---|
0:17:57 | the still a degradation obviously by uh there is some hit in performance |
---|
0:18:01 | but that it does not very significant |
---|
0:18:03 | but on the other hand uh even for top thirty |
---|
0:18:06 | i do get significant gain in terms of uh computational complexity |
---|
0:18:10 | so as the number of speakers |
---|
0:18:12 | increases |
---|
0:18:13 | the back and that of the gmmubm system has to work on more number of speakers |
---|
0:18:18 | and that obviously the computation time is going to work |
---|
0:18:21 | and therefore the speed up is good at it you |
---|
0:18:23 | but still it's |
---|
0:18:24 | a significant i mean you do get some uh you know five times more uh gain in terms of computation |
---|
0:18:29 | uh |
---|
0:18:30 | this |
---|
0:18:31 | sort of |
---|
0:18:31 | uh |
---|
0:18:32 | same thing is repeated for the one side |
---|
0:18:34 | uh the problem with the one side of the pause between reading a book |
---|
0:18:37 | uh is a huge amount of data five seconds |
---|
0:18:40 | oh five minutes of speech |
---|
0:18:41 | so again if you look at the top |
---|
0:18:43 | uh you know and best |
---|
0:18:44 | if it's then put a top ten that obviously the huge hit in performance two point five percent |
---|
0:18:49 | slow |
---|
0:18:51 | yeah |
---|
0:18:51 | absolute |
---|
0:18:52 | lost |
---|
0:18:53 | but if i go to the top |
---|
0:18:54 | okay |
---|
0:18:54 | um then i get only about uh how point seven percent |
---|
0:18:58 | uh |
---|
0:18:58 | degradation |
---|
0:18:59 | but yeah |
---|
0:19:00 | the top |
---|
0:19:01 | uh oh |
---|
0:19:02 | but the P I in the top well and base that that |
---|
0:19:05 | i mean you're not allowed to segment it is the i can |
---|
0:19:08 | even though i did use the number of |
---|
0:19:10 | a speaker's to fourteen the backend gmms still have to operate at all this forty speakers |
---|
0:19:14 | and therefore compared to ten seconds features see that the gains are not that significant but still uh |
---|
0:19:19 | get about |
---|
0:19:19 | almost |
---|
0:19:20 | three times |
---|
0:19:21 | uh competition |
---|
0:19:22 | so this is the basic idea of our proposed method so |
---|
0:19:26 | we have compromised so you can actually |
---|
0:19:29 | but the operating point at a need |
---|
0:19:30 | any of these |
---|
0:19:31 | uh and best and you'll get uh in the past one again in performance but uh |
---|
0:19:35 | it in terms of computation |
---|
0:19:37 | so uh this is |
---|
0:19:39 | what we have uh |
---|
0:19:41 | uh so basically we're using the idea of uh |
---|
0:19:45 | uh you know exploiting uh |
---|
0:19:46 | mllr matter |
---|
0:19:47 | just to do fast likelihood calculation for the speaker models |
---|
0:19:50 | but uh using mllr adaptation that decrease the performance |
---|
0:19:54 | slightly or i mean significantly depending on whether you stop standing or |
---|
0:19:58 | oh what |
---|
0:19:59 | and therefore this |
---|
0:20:00 | you need that with this and |
---|
0:20:01 | that and that that we choose |
---|
0:20:03 | to reduce the search space so i think you say |
---|
0:20:06 | you get better accuracy but uh uh it gets |
---|
0:20:09 | in the uh in terms of computation time |
---|
0:20:11 | so for the T and speaker |
---|
0:20:13 | speaker that it up and then this |
---|
0:20:14 | database |
---|
0:20:15 | uh if you choose the top ten |
---|
0:20:17 | then you get |
---|
0:20:18 | these as the performance degradation speed up |
---|
0:20:20 | for the one side of the top twenty get about three point one |
---|
0:20:23 | one |
---|
0:20:24 | no |
---|
0:20:25 | so uh |
---|
0:20:26 | this is basically it |
---|
0:20:41 | and |
---|
0:20:41 | timefrequency |
---|
0:20:42 | thank you very much |
---|
0:20:51 | oh |
---|
0:20:52 | to achieve the same result |
---|
0:20:54 | uh |
---|
0:20:55 | recent ones |
---|
0:20:56 | okay |
---|
0:20:58 | two |
---|
0:20:59 | in |
---|
0:20:59 | yeah |
---|
0:21:00 | so |
---|
0:21:02 | okay |
---|
0:21:03 | you are much more |
---|
0:21:06 | uh |
---|
0:21:07 | to uh to to |
---|
0:21:10 | if you want to |
---|
0:21:11 | same performance |
---|
0:21:12 | right |
---|
0:21:13 | i do have some uh |
---|
0:21:15 | result |
---|
0:21:16 | who |
---|
0:21:17 | oh |
---|
0:21:17 | we use |
---|
0:21:18 | i want to achieve |
---|
0:21:20 | same performance |
---|
0:21:21 | not |
---|
0:21:21 | the |
---|
0:21:22 | same and this |
---|
0:21:24 | oh |
---|
0:21:24 | so i |
---|
0:21:25 | oh |
---|
0:21:26 | for some minimal i'm i'm |
---|
0:21:28 | we understand happy or |
---|
0:21:30 | mllr adaptation obviously one this past summer heat compared to |
---|
0:21:33 | uh map is that is generally true i i think that's what we notice |
---|
0:21:37 | so it may be more a hundred or two hundred |
---|
0:21:40 | you will get closer and closer to be conventional gmm you can but you will never get exactly the same |
---|
0:21:45 | so you're always going to get |
---|
0:21:46 | something can performance |
---|
0:21:48 | and a single closer to be complete set obviously all again |
---|
0:21:51 | in computation time is |
---|
0:21:53 | barnaby |
---|
0:21:53 | i mean you want to lose |
---|
0:21:55 | anybody get comparable performance |
---|
0:21:56 | so what do we think is that you will have an example in performance |
---|
0:22:01 | how much it in performances in your hand |
---|
0:22:03 | and |
---|
0:22:04 | depending on how much you're willing to go down and pick and performance |
---|
0:22:07 | we can get that much more gain |
---|
0:22:09 | in |
---|
0:22:10 | uh |
---|
0:22:10 | yeah |
---|
0:22:11 | so your question is can i have a cheap gmmubm performance and still get a speedup |
---|
0:22:16 | uh i'm not sure about that i think you will have something to always |
---|
0:22:24 | like |
---|
0:22:30 | well |
---|
0:22:31 | i listen to a in speech recognition i notice that using it so you system |
---|
0:22:38 | i need more adaptation data done and map adaptation |
---|
0:22:41 | note |
---|
0:22:41 | this morning |
---|
0:22:42 | since my |
---|
0:22:43 | the opposite is true right i mean yeah the better more data you have not always better than |
---|
0:22:48 | mllr right |
---|
0:22:49 | this is what i |
---|
0:22:50 | this alignment so that we do mllr because i do |
---|
0:22:55 | estimating and and not |
---|
0:22:57 | yeah but the most simple i mean the constrained mllr see |
---|
0:22:59 | so i sit in my |
---|
0:23:01 | no matter what is normally most conventional cases |
---|
0:23:04 | uh if you have enough data obviously we should go back to that |
---|
0:23:12 | oh okay |
---|
0:23:13 | uh oh |
---|
0:23:14 | if i understand it well in in the case of |
---|
0:23:17 | i'm a large |
---|
0:23:18 | like |
---|
0:23:19 | sufficient statistics |
---|
0:23:20 | but in the case of gmmubm you only things frame by frame like |
---|
0:23:24 | yeah |
---|
0:23:25 | you have this uh |
---|
0:23:26 | and evolution right |
---|
0:23:28 | got it |
---|
0:23:28 | but you could use the century right so you could actually is coming to such an statistics for even for |
---|
0:23:34 | the originals |
---|
0:23:35 | also um |
---|
0:23:36 | the |
---|
0:23:37 | well adapted model |
---|
0:23:39 | oh |
---|
0:23:40 | okay this but |
---|
0:23:45 | yeah |
---|
0:23:46 | so what is your question like uh |
---|
0:23:48 | so i'm just saying that you are you are you trying to the speed up comes from uh collecting deception |
---|
0:23:53 | statistics on the novel weighting mllr system quickly |
---|
0:23:57 | i don't know but you could use the same trick with the with them up adapted model you can actually |
---|
0:24:01 | look like this so you can you can apply the absentee function the multifunction |
---|
0:24:06 | instead of civility |
---|
0:24:07 | uh things frame by frame between all that work |
---|
0:24:10 | as well as a similarity |
---|
0:24:12 | G gmm frame by frame and using it to |
---|
0:24:14 | and |
---|
0:24:15 | school |
---|
0:24:16 | so you're saying i could do similar things format i mean that the clique assumption statistics |
---|
0:24:20 | exactly yes |
---|
0:24:21 | and which which would probably um |
---|
0:24:24 | well this is what we do and and this leads to much force |
---|
0:24:27 | i i certainly would be probably even faster than |
---|
0:24:30 | then dissimilar scoring result losing any powerful |
---|
0:24:33 | um okay uh |
---|
0:24:36 | i didn't uh |
---|
0:24:37 | so |
---|
0:24:39 | i i okay i i have to so you think i could either do this |
---|
0:24:43 | format right of one way to do it for mllr is that the question |
---|
0:24:46 | i mean yeah i'm i'm i i just think you are basically compare |
---|
0:24:50 | two different thing i mean you wanted to come with |
---|
0:24:52 | person |
---|
0:24:53 | you should only tingles |
---|
0:24:54 | medals with the sufficient statistics and |
---|
0:24:57 | i guess that would be |
---|
0:24:58 | uh |
---|
0:24:59 | about the same false alarm |
---|
0:25:01 | um |
---|
0:25:02 | my |
---|
0:25:03 | still more |
---|
0:25:04 | but |
---|
0:25:05 | i is that too i'm i'm not very familiar so maybe i should have because |
---|
0:25:08 | why do we always then variable that all been uh uh C mixtures |
---|
0:25:12 | five |
---|
0:25:12 | to the top and we don't do that |
---|
0:25:15 | okay |
---|
0:25:15 | so |
---|
0:25:16 | uh maybe okay |
---|
0:25:17 | so i should have a |
---|
0:25:25 | so going back to your |
---|
0:25:27 | original premise that you had here was about you were primarily focused on speech |
---|
0:25:31 | right you're saying that you're dealing with large |
---|
0:25:34 | by population set |
---|
0:25:35 | but but i also get their situation |
---|
0:25:37 | hearing about durations |
---|
0:25:39 | it wasn't just large population said it was the duration |
---|
0:25:42 | test utterance |
---|
0:25:42 | so |
---|
0:25:43 | score large populations that |
---|
0:25:45 | right at ten seconds versus poolside |
---|
0:25:48 | and you were compared that was one of the comparisons you had |
---|
0:25:51 | so i see |
---|
0:25:52 | mllr approach you have |
---|
0:25:54 | tree |
---|
0:25:55 | is it is |
---|
0:25:56 | is done kind of independent of the uh |
---|
0:25:58 | except for the ubm stats is independent of the duration of the test |
---|
0:26:01 | that's gonna right |
---|
0:26:02 | so |
---|
0:26:03 | but i mean what other approach people taken this |
---|
0:26:05 | propulsion speech recognition on it |
---|
0:26:07 | is |
---|
0:26:08 | why don't you look at the notion of the uh |
---|
0:26:10 | yeah |
---|
0:26:12 | that's a well known thing you do beam pruning is frames can do a lot of high calcium drop |
---|
0:26:17 | very quickly so |
---|
0:26:18 | that i see |
---|
0:26:20 | well necessarily have |
---|
0:26:21 | go through |
---|
0:26:22 | keep a hundred twenty at any time |
---|
0:26:24 | um it it back and if you're speech real concern |
---|
0:26:27 | alternately you can bail out |
---|
0:26:30 | oh yeah |
---|
0:26:30 | so i i actually we have all the very mention of the papers that i mean this |
---|
0:26:35 | and there are other methods that you can use |
---|
0:26:37 | what he what speed up |
---|
0:26:38 | um you know for for example pruning or you know the downsampling and things of that |
---|
0:26:42 | um |
---|
0:26:43 | so |
---|
0:26:44 | uh |
---|
0:26:46 | yes |
---|
0:26:46 | i mean maybe we're not saying that |
---|
0:26:48 | uh this is the only way of |
---|
0:26:49 | uh you know doing fast computation |
---|
0:26:51 | uh |
---|
0:26:52 | that's one of the base that we could possibly do |
---|
0:26:54 | that's always existed |
---|
0:26:55 | uh |
---|
0:26:55 | right the questions used more as a research paper is |
---|
0:26:59 | you chose this method in your baseline was full frames without |
---|
0:27:03 | classical other ways of speeding up |
---|
0:27:05 | why was this |
---|
0:27:06 | why was it eight wide user interface |
---|
0:27:09 | oh so |
---|
0:27:10 | C O |
---|
0:27:11 | so even in the case of |
---|
0:27:13 | pruning i'm sure you wanna get some hidden performed i don't think i can absolutely get the same performance as |
---|
0:27:17 | you do the gmmubm right through |
---|
0:27:20 | uh because it was this possibility that while opening about to lose |
---|
0:27:23 | some speakers out |
---|
0:27:25 | if i think i mean so this would be the ultimate |
---|
0:27:28 | uh |
---|
0:27:28 | oops |
---|
0:27:29 | i thought i mean with the uh you know well |
---|
0:27:32 | which one would try to achieve a thing right okay |
---|
0:27:34 | do you want it is that errors are introduced |
---|
0:27:37 | using that is more than can rest and play |
---|
0:27:39 | okay |
---|
0:27:48 | so |
---|
0:27:49 | um |
---|
0:27:50 | performance |
---|
0:27:51 | system as a function |
---|
0:27:53 | number |
---|
0:27:53 | right yeah |
---|
0:27:55 | you know |
---|
0:27:56 | sure |
---|
0:27:57 | no no |
---|
0:27:58 | i have to remind you selected suitable |
---|
0:28:01 | oh my can actually use those |
---|
0:28:03 | so um |
---|
0:28:06 | oh i would like |
---|
0:28:07 | two |
---|
0:28:08 | in utah also but this is obviously more or something |
---|
0:28:10 | kind of application to describe yeah |
---|
0:28:13 | creation |
---|
0:28:15 | in this case it's second was i mean we we just thought that the need we have |
---|
0:28:19 | uh you computing the likelihood |
---|
0:28:21 | an efficient manner |
---|
0:28:22 | and i'm sure that a lot of applications |
---|
0:28:24 | and maybe an audio indexing of some of the |
---|
0:28:26 | you might be having large populations |
---|
0:28:29 | because then you might want to identify |
---|
0:28:31 | somebody yeah |
---|
0:28:32 | big database |
---|
0:28:33 | so |
---|
0:28:34 | we have specifically looked at any particular application |
---|
0:28:37 | we just |
---|
0:28:37 | then that |
---|
0:28:38 | here they are |
---|
0:28:39 | a lot of applications |
---|
0:28:40 | possibly |
---|
0:28:40 | at least |
---|
0:28:41 | you know |
---|
0:28:42 | where the utility |
---|
0:28:43 | databases |
---|
0:28:44 | and one is interested |
---|
0:28:45 | and something like this might work |
---|
0:28:47 | so |
---|
0:28:48 | a realtor when they all possibly rather than the way that |
---|
0:28:50 | we have an application that what we want to find out |
---|
0:28:53 | we are |
---|
0:28:54 | we just um |
---|
0:28:56 | looks cool |
---|
0:28:57 | oh |
---|
0:28:58 | you |
---|
0:28:58 | the menu or |
---|
0:29:00 | what the application space |
---|
0:29:02 | well |
---|
0:29:05 | this is i would like to know more about |
---|
0:29:07 | these are |
---|
0:29:07 | sure |
---|
0:29:08 | but |
---|
0:29:09 | but |
---|
0:29:09 | right |
---|
0:29:24 | oh did you try to use |
---|
0:29:25 | more than one mllr transformations for speaker |
---|
0:29:28 | oh yeah we could do with this on that yeah i think that's something that we have |
---|
0:29:32 | thinking of doing that but we have |
---|
0:29:34 | but it should not |
---|
0:29:35 | uh |
---|
0:29:36 | you know |
---|
0:29:36 | it should |
---|
0:29:37 | hopefully improve but we are not |
---|
0:29:45 | hmmm |
---|
0:29:46 | so how does this |
---|
0:29:47 | um |
---|
0:29:48 | it should be interesting to compare |
---|
0:29:50 | the types going to do we will |
---|
0:29:52 | um |
---|
0:29:53 | another type of scoring where |
---|
0:29:55 | once you have sufficient statistics |
---|
0:29:58 | so the test utterance you actually get in a more |
---|
0:30:00 | transform for the test utterance |
---|
0:30:03 | as well and then compare |
---|
0:30:05 | the mllr transforms for |
---|
0:30:07 | the model and the test utterance |
---|
0:30:08 | whether |
---|
0:30:09 | by doing into the product |
---|
0:30:12 | an svm |
---|
0:30:13 | oh yeah so you could have i mean we have just using like it's what you |
---|
0:30:17 | you're saying that given that that's weapons i could use |
---|
0:30:20 | the test vectors mllr a lot |
---|
0:30:23 | and compare it with |
---|
0:30:24 | with the speaker's mllr and that was it |
---|
0:30:27 | it |
---|
0:30:27 | it will probably be more efficient because once you get to be a man the lord matrix |
---|
0:30:31 | you're the dimension is lower than |
---|
0:30:34 | you're only your submission statistics i mean you're sufficient statistics that winter |
---|
0:30:39 | um |
---|
0:30:40 | you have to consider how to note that these are just |
---|
0:30:42 | the |
---|
0:30:43 | the mentioned previously before the feature vectors of it |
---|
0:30:46 | that's it |
---|
0:30:48 | so this is very very small |
---|
0:31:06 | can't think speaker like |
---|
0:31:07 | yeah |
---|
0:31:08 | i |
---|