0:00:13 | yeah as you five a mentor machine you i minor |
---|
0:00:16 | and i'm from that's of university and this is a joint what my adviser |
---|
0:00:20 | john looked large |
---|
0:00:21 | and all that title of all walk is log spectral enhancement using speaker dependent as for speaker verification |
---|
0:00:29 | and the aim of this what the key idea behind it is how we can use |
---|
0:00:33 | set and in |
---|
0:00:34 | parameter estimation techniques to improve the robustness of about |
---|
0:00:39 | a verification systems to noise and miss much |
---|
0:00:42 | and here |
---|
0:00:43 | why we want to do |
---|
0:00:45 | uh |
---|
0:00:46 | based and up |
---|
0:00:46 | technique is that |
---|
0:00:48 | a bayesian approach is |
---|
0:00:50 | a i'll what's to use |
---|
0:00:51 | but to have a principled way of accounting for parameter down setting to |
---|
0:00:55 | noise estimation task |
---|
0:00:57 | and you know like every most button recognition system |
---|
0:01:01 | the range is up |
---|
0:01:02 | a a a a a a a key component |
---|
0:01:04 | in use yes to extract thing |
---|
0:01:07 | keep the parameters of interest from your role signal in this case when we have a speech |
---|
0:01:12 | which is corrupted by noise and we want to extract features of interest |
---|
0:01:16 | that we using a or |
---|
0:01:17 | but and classification algorithm |
---|
0:01:20 | the the the the noise makes all but i |
---|
0:01:23 | parameterized estimates it's it wrong yes in some case |
---|
0:01:26 | and depending on the severity of the noise you know |
---|
0:01:28 | uh |
---|
0:01:29 | this may |
---|
0:01:30 | that that is also go ones are or well how much of an fact you are on |
---|
0:01:35 | and the parameter estimates not if we use |
---|
0:01:37 | if we can |
---|
0:01:39 | a like than to put in a bayesian up uh estimate |
---|
0:01:43 | uh uh we can probably you know and hans all speaker verification system and here will can see that |
---|
0:01:49 | to |
---|
0:01:50 | you know the the two main courses of |
---|
0:01:53 | and what's degradation noise |
---|
0:01:55 | which have what you discussed and ms much because you know in speaker verification system we we need a more |
---|
0:02:02 | model to to a model a or a speaker distribution |
---|
0:02:07 | and |
---|
0:02:07 | depending on the acoustic environment in which you |
---|
0:02:10 | you you in the training data that this may not be the same environment in which you using |
---|
0:02:15 | which are using the system |
---|
0:02:16 | and this results in me much |
---|
0:02:19 | and and hence performance degradation |
---|
0:02:22 | for the cup joe what you're trying to do here |
---|
0:02:24 | so the aim of all lot that |
---|
0:02:26 | the title to just me using speaker dependent in the log spectral domain |
---|
0:02:31 | and the K yeah he is that we want to |
---|
0:02:34 | link |
---|
0:02:35 | two system which we feel are closely much that speech enhancement system |
---|
0:02:39 | and the recognition cyst |
---|
0:02:41 | the intuition behind it is that |
---|
0:02:44 | feel doing speech enhancement than you you you and enhancing features |
---|
0:02:48 | and you know |
---|
0:02:49 | because dependent priors |
---|
0:02:52 | uh |
---|
0:02:53 | this |
---|
0:02:54 | the intuition is that if you have a better idea of who is speaking and you good brad in |
---|
0:02:58 | and that domain |
---|
0:03:00 | then you can do a better job of enhancing and we the N signal |
---|
0:03:04 | you can do a better job of |
---|
0:03:06 | oh or recognition |
---|
0:03:07 | so these an to play between this two systems |
---|
0:03:10 | and they they are what we do the eight week up to this in doubly plea as a message passing |
---|
0:03:16 | along nodes you not not have colour model |
---|
0:03:19 | so |
---|
0:03:20 | so that be little be |
---|
0:03:22 | message passing and this will fall out in our formulation |
---|
0:03:27 | just a brief outline of what the rest of the talk would be like |
---|
0:03:31 | i just |
---|
0:03:31 | briefly go over a little bit of verification for any members of the audience who may need it |
---|
0:03:37 | and then |
---|
0:03:38 | going to |
---|
0:03:40 | uh but but any in inference and then |
---|
0:03:42 | for to how all into a variational bayesian inference which is a |
---|
0:03:47 | a what we walk in |
---|
0:03:48 | and then a discuss our model |
---|
0:03:51 | and then |
---|
0:03:52 | going to the experimental results |
---|
0:03:57 | so |
---|
0:03:59 | he in verification that task is you know you a given an utterance and a claimed identity and the task |
---|
0:04:04 | is |
---|
0:04:05 | it's a hypothesis test is the |
---|
0:04:08 | given the speech segment X is the speech |
---|
0:04:10 | from from speaker S a not |
---|
0:04:13 | so |
---|
0:04:15 | this is a hypothesis test as of say the and uh |
---|
0:04:17 | what we do is we have to model model |
---|
0:04:20 | uh out target |
---|
0:04:22 | get |
---|
0:04:22 | a target speakers |
---|
0:04:24 | using splits speaker-specific specific gmms |
---|
0:04:27 | and then we can |
---|
0:04:29 | user a |
---|
0:04:30 | i universal background model to test it out and it i what this it and this is the |
---|
0:04:34 | you know usually the this line system |
---|
0:04:37 | you ubm |
---|
0:04:38 | gmm system which is you know |
---|
0:04:40 | but |
---|
0:04:41 | with that starting point to most verification systems there at once is but |
---|
0:04:46 | this is the the the most basic |
---|
0:04:48 | oh |
---|
0:04:49 | and this is where we'll try try a more calm enhancement |
---|
0:04:52 | in the log spectral domain to see if we can |
---|
0:04:54 | have improved |
---|
0:04:56 | so |
---|
0:04:58 | no no uh |
---|
0:04:59 | so the classification deciding when we type of the C C |
---|
0:05:02 | you compute a scroll |
---|
0:05:04 | just just a lot like who who |
---|
0:05:06 | log likelihood ratio and then |
---|
0:05:08 | you know |
---|
0:05:09 | this do not threshold |
---|
0:05:11 | decide which uh |
---|
0:05:12 | which type of this is it correct and |
---|
0:05:15 | yeah |
---|
0:05:15 | you know we can plot to i'll two |
---|
0:05:18 | but i'm for all |
---|
0:05:19 | same well as of formants matrix we can plot that a an also to compute equal error rate |
---|
0:05:25 | you know to determine the trade-off between |
---|
0:05:27 | missed detection |
---|
0:05:29 | and false alarms |
---|
0:05:31 | so that's just a a speaker verification part just a little bit of a bayesian inference |
---|
0:05:39 | so |
---|
0:05:40 | i |
---|
0:05:41 | we can say that that two main approaches to parameter estimation you can go the maximum likelihood route |
---|
0:05:47 | or the bayesian inference rule |
---|
0:05:49 | uh |
---|
0:05:50 | here we see |
---|
0:05:52 | if you have a data X |
---|
0:05:53 | represented presented in this |
---|
0:05:54 | figure by X |
---|
0:05:55 | be the generative model |
---|
0:05:57 | i one by a parameter a |
---|
0:05:59 | data |
---|
0:06:00 | now in them market to some like you |
---|
0:06:02 | a a day |
---|
0:06:04 | we assume that this parameter is an unknown constant |
---|
0:06:07 | and then the |
---|
0:06:08 | they're quantity of interest is that likelihood and then we can estimate |
---|
0:06:12 | so it are based on the map them a maximum likelihood criterion |
---|
0:06:16 | and the but they didn't paradigm i'm one the other hand |
---|
0:06:19 | we assume that they to uh is uh |
---|
0:06:21 | is a |
---|
0:06:22 | but one by is a random variable good up that one by a prior |
---|
0:06:26 | and this is where the robe |
---|
0:06:28 | the the robustness to but it down setting to comes in |
---|
0:06:32 | the fact that we have a prior out what the what this to over the parameter of interest |
---|
0:06:37 | and then the clean quantity in these cases that |
---|
0:06:39 | posterior which is proportional that |
---|
0:06:42 | is given that the problem is proportional to the product of a like the then prior |
---|
0:06:46 | and then |
---|
0:06:49 | uh the issue is |
---|
0:06:52 | how we obtain i estimates we obtain based an estimate does that minimize expect expect |
---|
0:06:56 | costs and |
---|
0:06:57 | for instance if we we have the |
---|
0:07:00 | ah |
---|
0:07:01 | if the cost is the squared |
---|
0:07:03 | is the squared norm |
---|
0:07:05 | of the big |
---|
0:07:06 | a difference there between now |
---|
0:07:10 | this expression he a fit so |
---|
0:07:12 | the difference between the estimate and the true value |
---|
0:07:16 | the it well known that |
---|
0:07:17 | this this an estimate a the minimum mean square error estimate a just the posterior mean |
---|
0:07:23 | note that this is easy to write |
---|
0:07:25 | but |
---|
0:07:26 | the what happens is that in most practical cases and even in the one we can see that here |
---|
0:07:31 | it's |
---|
0:07:32 | you know |
---|
0:07:33 | import a almost impossible to perform from this tech |
---|
0:07:36 | so now what do we do |
---|
0:07:37 | uh |
---|
0:07:43 | we can |
---|
0:07:45 | we can use |
---|
0:07:46 | the problem lies in the instructor stability |
---|
0:07:49 | if the problem lies in the ability of the posterior |
---|
0:07:52 | then we can apply what uh approximate bayesian techniques and for instance are we can use V B or variational |
---|
0:07:59 | of base |
---|
0:08:00 | uh where we approximate |
---|
0:08:03 | well |
---|
0:08:03 | i what true posterior |
---|
0:08:05 | by one that's |
---|
0:08:06 | constrained to be them |
---|
0:08:09 | now |
---|
0:08:10 | we need a metric the mapping between two from |
---|
0:08:14 | and intractable for maybe and a tractable farm |
---|
0:08:19 | of distributions |
---|
0:08:20 | and we need a metric so that we know |
---|
0:08:23 | yeah |
---|
0:08:24 | and the uh |
---|
0:08:25 | you know what's the close |
---|
0:08:26 | approximation to the true posterior indestructible family |
---|
0:08:30 | and and we measure yeah |
---|
0:08:33 | we we obtain the approximation that minimize is the a out that dense |
---|
0:08:37 | to to in our all our approximation and the true to to |
---|
0:08:42 | oh in cases where i'll but i'm it that's set that uh |
---|
0:08:46 | consists of a and |
---|
0:08:47 | number of parameters in this case |
---|
0:08:50 | and parameters as we can and shot ability by |
---|
0:08:54 | assuming that |
---|
0:08:55 | the product of the the posterior factor like that shown in this expression one |
---|
0:09:01 | so |
---|
0:09:02 | no the the question what is that |
---|
0:09:04 | we boils down to |
---|
0:09:07 | estimating what a |
---|
0:09:08 | no computing the forms of uh |
---|
0:09:12 | this approximate posterior each of the five does |
---|
0:09:15 | and then |
---|
0:09:17 | a for if |
---|
0:09:18 | up updating the sufficient statistics |
---|
0:09:21 | i can be shown that these and |
---|
0:09:23 | uh uh an expression for the |
---|
0:09:28 | for the approximate from of the distributions in this uh |
---|
0:09:32 | we computed by taking an expectation with respect to the logarithm a of that |
---|
0:09:37 | the joint distribution between observations |
---|
0:09:40 | and the parameters of interest |
---|
0:09:44 | oh |
---|
0:09:45 | so |
---|
0:09:47 | no that |
---|
0:09:48 | let's |
---|
0:09:48 | get but to our speaker verification context and by in particular let's discuss the |
---|
0:09:53 | the model the probabilistic model |
---|
0:09:56 | so here what did we are in the log spectral domain |
---|
0:10:00 | and uh |
---|
0:10:01 | but we assume is that our our or signal Y of T of the observed signal |
---|
0:10:07 | is corrupted by additive noise |
---|
0:10:10 | and if we take the dft we can compute the log spectrum much shown |
---|
0:10:14 | but the can the look at the |
---|
0:10:16 | this |
---|
0:10:17 | that's to |
---|
0:10:18 | a a F T |
---|
0:10:20 | and then we can it can be shown |
---|
0:10:23 | but uh these a nice |
---|
0:10:25 | a proximity relationship between |
---|
0:10:27 | then the the log spectrum of the up signal |
---|
0:10:31 | that the clean log spectrum and that log spectrum of the noise |
---|
0:10:35 | of this |
---|
0:10:36 | just a lot |
---|
0:10:38 | i what our likelihood |
---|
0:10:39 | you look you in the bayesian paradigm and we we have the likelihood and the prior |
---|
0:10:46 | so |
---|
0:10:47 | this |
---|
0:10:47 | this |
---|
0:10:48 | is our likelihood |
---|
0:10:49 | now we need to |
---|
0:10:50 | two |
---|
0:10:51 | to write out what is out joint distribution how does it five |
---|
0:10:55 | because this will help was when we come to compute a |
---|
0:10:58 | the that box the approximate distribution because that you the called the |
---|
0:11:02 | the expression |
---|
0:11:04 | for each of the optimum for does |
---|
0:11:06 | depends on an expectation |
---|
0:11:08 | like to |
---|
0:11:10 | like an expectation of the look that a beam of the joint distribution |
---|
0:11:14 | so this is a how the joint distribution in this content |
---|
0:11:18 | uh |
---|
0:11:20 | a factor arises |
---|
0:11:21 | you have all all of that out |
---|
0:11:24 | uh i log spectrum |
---|
0:11:25 | the clean log spectrum |
---|
0:11:27 | this is that what it what which tell explain later that we introduce one might lead to up the ability |
---|
0:11:32 | to like an indicator variable than the noise |
---|
0:11:34 | so here you have the likelihood tao |
---|
0:11:37 | and that prior what |
---|
0:11:39 | or what this |
---|
0:11:39 | B |
---|
0:11:40 | clean |
---|
0:11:40 | speech log spectrum we assume that it is speaker dependent |
---|
0:11:46 | and uh |
---|
0:11:49 | so what happens is |
---|
0:11:52 | yes the speaker dependent ubm so in a speaker I D context this would |
---|
0:11:57 | in mean that we we'll and models for |
---|
0:12:00 | each speaker |
---|
0:12:01 | not id context but in know a verification context what we do is we approximate that |
---|
0:12:06 | that would be that you snap not |
---|
0:12:08 | mean in this but if kitchen context we assume that we can |
---|
0:12:12 | model the light bright you'll speakers as as |
---|
0:12:15 | just the target speaker and the ubm so this is what happens is that the library dynamic |
---|
0:12:20 | for each at that your testing you your when you like |
---|
0:12:24 | and we have a what |
---|
0:12:25 | i it is that this indicator the variable |
---|
0:12:28 | uh that was you |
---|
0:12:32 | who peeking |
---|
0:12:33 | oh in other what where they'd the target the ubm and which mixture the component is active |
---|
0:12:39 | so this |
---|
0:12:41 | just shows you the forms of the five does that we compute |
---|
0:12:44 | and we can see that there |
---|
0:12:46 | yeah |
---|
0:12:47 | the well-known known fans |
---|
0:12:49 | and the V be able but and what the don't to each realising this a |
---|
0:12:53 | this |
---|
0:12:53 | but the sufficient statistics in a in a in a case the mean and |
---|
0:12:59 | and the covariance |
---|
0:13:01 | and then and this out of a function of the observations and the prior |
---|
0:13:06 | and then cycling through until some convergence is that thing |
---|
0:13:10 | and |
---|
0:13:10 | yeah |
---|
0:13:11 | what's |
---|
0:13:11 | good is that once you obtain |
---|
0:13:14 | uh |
---|
0:13:15 | the clean posterior a an estimate of the clean posterior we can derive mfccs easy lee for from them |
---|
0:13:22 | for verification |
---|
0:13:24 | so just some experimental results what we do it is |
---|
0:13:28 | we we use three datasets initially we use to |
---|
0:13:32 | then we to use the M T mobile device because verification corpus |
---|
0:13:36 | then we have we also tried it out on a |
---|
0:13:38 | S the sre two thousand and four corpora |
---|
0:13:42 | so initial |
---|
0:13:43 | results here a for |
---|
0:13:45 | oh to make |
---|
0:13:46 | and |
---|
0:13:47 | uh |
---|
0:13:48 | we did did we trained a ubm with that subset using training data from a subset of the and it's |
---|
0:13:53 | because that six hundred and that is because in |
---|
0:13:57 | and then we corrupted the speech |
---|
0:13:59 | using |
---|
0:14:00 | additive white gaussian noise |
---|
0:14:02 | i present results for that |
---|
0:14:04 | for realistic noise later |
---|
0:14:06 | and then we used to test utterances by speaker |
---|
0:14:11 | so |
---|
0:14:12 | what happens is that we can generate from for the six hundred and that is because you can generate |
---|
0:14:16 | uh |
---|
0:14:17 | or hundred and sixty |
---|
0:14:19 | true trials and then we select a random subset of ten speakers |
---|
0:14:23 | but in posters |
---|
0:14:25 | and then we compute its |
---|
0:14:26 | scores for each trial |
---|
0:14:28 | and we also compare we tried to implement the |
---|
0:14:31 | this one by |
---|
0:14:34 | a a and is corpora |
---|
0:14:35 | which is a feature domain intersession compensation technique |
---|
0:14:40 | which entails it |
---|
0:14:41 | a a a a a a a pro uh |
---|
0:14:44 | and in a a a a projection matrix to project the features into a |
---|
0:14:48 | session session independent subspace |
---|
0:14:51 | we have a a the recognition the i i |
---|
0:14:54 | verification would be more robust the details i that it but will go through them |
---|
0:14:58 | oh |
---|
0:14:59 | and just some |
---|
0:15:00 | brief uh |
---|
0:15:02 | table of some result |
---|
0:15:04 | or the timit case when we add in additive white gaussian noise we sweep through some snr |
---|
0:15:10 | and then we just |
---|
0:15:12 | it from the raw data we compute mfccs |
---|
0:15:15 | and the top line shows you if we just up to in the mfccs without |
---|
0:15:20 | note fast applying anything to |
---|
0:15:23 | i out you know just roll |
---|
0:15:26 | and then what if we obtain uh mfccs after that we've and hans |
---|
0:15:30 | a log spectra |
---|
0:15:32 | we in the second line using that B technique |
---|
0:15:35 | i implementation of F D I C was able to draw |
---|
0:15:39 | uh |
---|
0:15:40 | i was able to walk in this |
---|
0:15:42 | i in the low |
---|
0:15:44 | as some case is in the high and that case it shouldn't draw broken down in our implementation |
---|
0:15:49 | uh |
---|
0:15:50 | we can investigate |
---|
0:15:53 | this is does a plot for |
---|
0:15:54 | and the that to db case for timit |
---|
0:15:57 | and we see that the equal error rate uh dropped that by half and that's the case |
---|
0:16:02 | a |
---|
0:16:03 | a course this snr a we investigated |
---|
0:16:07 | oh we also looked at |
---|
0:16:08 | uh uh i had that types of noise |
---|
0:16:11 | a a a a a what we had for three noise |
---|
0:16:14 | and this noise was obtained from the noise X ninety two dataset |
---|
0:16:18 | and the the the the results are similar |
---|
0:16:21 | only the figure that different you know a different snrs because of the type of noise |
---|
0:16:27 | oh but this to see that |
---|
0:16:29 | uh |
---|
0:16:30 | that i almost |
---|
0:16:31 | have been in this domain yeah it's not as good but this is because there |
---|
0:16:36 | oh no this is |
---|
0:16:37 | a very |
---|
0:16:38 | oh almost clean condition |
---|
0:16:40 | now then |
---|
0:16:41 | when we applied this to the mit T um |
---|
0:16:45 | dataset |
---|
0:16:47 | a |
---|
0:16:50 | uh |
---|
0:16:51 | we we want to show |
---|
0:16:54 | the the difference |
---|
0:16:56 | you know what happens when we have missed much |
---|
0:16:58 | we |
---|
0:16:59 | data obtained in an all is and uh |
---|
0:17:02 | and tested it with |
---|
0:17:04 | yeah |
---|
0:17:05 | has has data from was noise noisy street intersection |
---|
0:17:09 | when we we observe the means much you know a it jumps up to twenty percent when the test data |
---|
0:17:15 | he's from one intersection when that all models were change of this data |
---|
0:17:20 | and when we apply the the B technique |
---|
0:17:22 | to use uses to twenty four percent |
---|
0:17:28 | uh |
---|
0:17:29 | for sorry experiments we we use |
---|
0:17:31 | this with it will for corpora |
---|
0:17:34 | we so the details we use ubm with the fifty |
---|
0:17:37 | five to of mixture coefficients |
---|
0:17:39 | and nineteen dimensional mfccs with a stands |
---|
0:17:43 | a from mean normalization |
---|
0:17:45 | but up and the is that we only obtain more disk gang |
---|
0:17:48 | and we applied the whole that's |
---|
0:17:50 | and this may be due to you know |
---|
0:17:53 | oh |
---|
0:17:53 | baseline line system with then |
---|
0:17:55 | he are of that thirteen point eight |
---|
0:17:57 | i the only able to get to that in point for |
---|
0:18:00 | this may be due to the fact that uh |
---|
0:18:03 | we think that uh |
---|
0:18:06 | the the the formulation to the models trained on clean speech |
---|
0:18:10 | and uh and uh it is for all on the L that um what is gained when compared to what |
---|
0:18:15 | you get to meet and |
---|
0:18:17 | the M of that data set |
---|
0:18:20 | and |
---|
0:18:21 | and that's it |
---|
0:18:21 | then you |
---|
0:18:26 | i i one time for one quick concern |
---|
0:18:38 | i have as a question uh did you try to use uh |
---|
0:18:42 | and as as a type of voiced speech you hands had reasons such as a wiener filtering |
---|
0:18:47 | to obtain the enhanced the speech and then |
---|
0:18:50 | using hands to speech to to to do speaker verification |
---|
0:18:55 | a no we did not but we tried a a a a at the in a a what we tried |
---|
0:18:59 | using F frame are |
---|
0:19:02 | i and they have a but we where you you to getting that you not speaker id context but not |
---|
0:19:06 | in this context |
---|
0:19:07 | that is something we we should do |
---|
0:19:10 | okay yes thank you |
---|
0:19:12 | let's has a |
---|
0:19:13 | oh go |
---|
0:19:17 | okay |
---|