0:00:13 | i'm gonna be telling about |
---|
0:00:14 | robust speech recognition using dynamic |
---|
0:00:17 | noise adaptation |
---|
0:00:19 | i guess we'll go right into it |
---|
0:00:21 | um so |
---|
0:00:22 | and a line a talk all start by giving a brief overview |
---|
0:00:26 | of |
---|
0:00:28 | the area |
---|
0:00:29 | and D N A |
---|
0:00:30 | uh review the D N a model |
---|
0:00:32 | briefly go over how we do in print |
---|
0:00:35 | and then jump right two experiments and results |
---|
0:00:40 | okay so model based role robust asr it's a well |
---|
0:00:43 | establish paradigm |
---|
0:00:45 | basically use explicit models of noise channel distortion |
---|
0:00:49 | and their interaction with speech |
---|
0:00:51 | two |
---|
0:00:51 | model noisy speech |
---|
0:00:53 | uh there's been many interesting modeling an inference techniques that are thing developed of a lab |
---|
0:00:58 | i'd say twenty five years |
---|
0:01:00 | interaction models uh there's many references you have to refer to the paper |
---|
0:01:04 | um so |
---|
0:01:05 | one thing that is conjecture to some degree but |
---|
0:01:08 | uh |
---|
0:01:09 | i i i uh |
---|
0:01:11 | but i i believe it didn't and that |
---|
0:01:13 | basically sickly that the relevance to a commercial great is is not been this dependent definitively established |
---|
0:01:19 | a so why is that well that's because |
---|
0:01:21 | we have |
---|
0:01:22 | promising word error rate reductions on less sophisticated asr systems |
---|
0:01:26 | uh you to the small training or test sets |
---|
0:01:29 | either the our or simple pipe fines and or artificially mixed data |
---|
0:01:33 | so we don't know if these techniques actually improve asr as of this moment |
---|
0:01:38 | uh but all i'll give you some evidence it's of at least one model that does |
---|
0:01:43 | improve |
---|
0:01:44 | a state-of-the-art asr system |
---|
0:01:47 | okay so dynamic noise adaptation just uh a review |
---|
0:01:51 | a a model based approach uh essentially use a gmm of one sort or another first model speech |
---|
0:01:57 | in uh and i |
---|
0:01:58 | the model of noise in this case a gaussian process model |
---|
0:02:03 | uh the features |
---|
0:02:04 | of this |
---|
0:02:04 | some features of the map that are that it's |
---|
0:02:06 | i mismatch model there's noise |
---|
0:02:08 | no uh training required for the noise model |
---|
0:02:11 | there's no system retraining required |
---|
0:02:14 | and uh the technique models and certainly in the noise estimate now show you briefly why that's important to |
---|
0:02:20 | uh |
---|
0:02:21 | good |
---|
0:02:22 | speech |
---|
0:02:23 | uh |
---|
0:02:24 | noise thing |
---|
0:02:26 | terms of previous results actually actually D N A is been kicking around for a while now |
---|
0:02:31 | we does off every couple years and try it on more realistic data |
---|
0:02:35 | a so it |
---|
0:02:36 | a a previous publications have shown that it |
---|
0:02:38 | significantly a performs techniques like the |
---|
0:02:42 | F E |
---|
0:02:42 | and F for more on aurora two and the D an a plus or or two test these are artificially |
---|
0:02:48 | a data sets though we've |
---|
0:02:50 | we never tried it on real data |
---|
0:02:52 | so uh |
---|
0:02:53 | this work all talk about today all or review some new results on real data |
---|
0:02:57 | using the uh best |
---|
0:02:59 | uh embedded speech recognizer that we have a |
---|
0:03:02 | a i'll discuss how the an A is been which it for low latency deployments |
---|
0:03:06 | and all |
---|
0:03:07 | i'll show you some results that with some gains for example twenty two percent |
---|
0:03:12 | word error rate reduction before the below six B R |
---|
0:03:16 | on the system that includes spectral subtraction uh from more are and fmpe P |
---|
0:03:21 | uh |
---|
0:03:22 | first briefly here's a that D N a generative model i'll get into the components |
---|
0:03:27 | oh right away but basically the idea is that this is a general model which is |
---|
0:03:32 | it's modular there's the channel model speech model and noise model |
---|
0:03:35 | that are combined with an interaction model |
---|
0:03:37 | and |
---|
0:03:38 | uh basically you can play with how the pieces are structured |
---|
0:03:42 | and put them together you uh using the general framework which is very useful if you want to extend |
---|
0:03:47 | uh |
---|
0:03:48 | and through the model change the inference out were that make it |
---|
0:03:52 | a stronger it's true |
---|
0:03:54 | okay so right in the interaction model in the time domain |
---|
0:03:58 | uh this |
---|
0:03:58 | the standard uh distortion plus noise |
---|
0:04:02 | model in the frequency domain uh i think |
---|
0:04:04 | probably we most and this room are familiar with this |
---|
0:04:07 | uh we approximate the |
---|
0:04:09 | term with the phase |
---|
0:04:11 | the the third term in the sum |
---|
0:04:14 | uh in this |
---|
0:04:15 | in this uh paper and most other papers is scales in there are some better interaction |
---|
0:04:20 | models out there uh by lee gang another |
---|
0:04:23 | um there a little bit more |
---|
0:04:25 | computationally intensive but they are more accurate |
---|
0:04:28 | um so are probability model for the data we observe why given |
---|
0:04:33 | the |
---|
0:04:34 | speech |
---|
0:04:35 | uh X the channel H and the noise and and the log |
---|
0:04:39 | mel power spectral domain |
---|
0:04:41 | is |
---|
0:04:42 | uh |
---|
0:04:43 | a normal distribution |
---|
0:04:44 | with the interaction function have as shown |
---|
0:04:49 | okay so jumping right into the speech model one thing we changed is we uh use the bank quantized gmm |
---|
0:04:54 | instead of a regular gmm |
---|
0:04:56 | essentially for each |
---|
0:04:57 | uh state |
---|
0:04:59 | uh uh you have a map to a reduced number of gaussians at each frequency so |
---|
0:05:03 | here a maps acoustic state yes |
---|
0:05:05 | to get a seen a and |
---|
0:05:07 | a given frequency band yeah the idea is the number the number of |
---|
0:05:10 | yeah scenes in each band is a lot smaller than the number of states |
---|
0:05:14 | so this model can be very efficiently computed and ford |
---|
0:05:18 | C an example than the sec |
---|
0:05:21 | a the noise model for D an A it's a simple uh a simple uh |
---|
0:05:25 | gaussian process model is you can have |
---|
0:05:28 | essentially uh uh we model the noise level as uh as a stationary process |
---|
0:05:33 | so the prediction for the next frame |
---|
0:05:36 | given the previous frames noise level is |
---|
0:05:39 | the |
---|
0:05:40 | that |
---|
0:05:41 | that that same value |
---|
0:05:42 | a plus a the there's some propagation noise gamma M and model |
---|
0:05:47 | a some to |
---|
0:05:48 | uh capture how the noise |
---|
0:05:50 | of the over time |
---|
0:05:52 | uh this is combined with a simple transient noise model which just gaussian |
---|
0:05:56 | and the idea here is that the |
---|
0:05:58 | the noise level is |
---|
0:06:00 | uh changing slowly to the |
---|
0:06:02 | uh |
---|
0:06:03 | but is with respect to the frame rate |
---|
0:06:06 | and that the transient noise actually ends that being a dominating this so introducing this into additional layer |
---|
0:06:12 | makes it possible to track something that doesn't look smooth at all when you look at in the log power |
---|
0:06:17 | spectrum |
---|
0:06:18 | uh but with this with this filled uh this extra layer in the model which |
---|
0:06:24 | a in essence filters out the transient noise it is a very good model |
---|
0:06:29 | a the channel model like in the past mean didn't have a channel model was artificial data we did need |
---|
0:06:33 | one |
---|
0:06:34 | so uh for the channel model we used |
---|
0:06:37 | yeah the for this work |
---|
0:06:38 | just the stochastically adapted parameter vector this is actually |
---|
0:06:42 | uh |
---|
0:06:43 | the same the same model that uh we |
---|
0:06:46 | dang used in the past for |
---|
0:06:48 | modeling noise |
---|
0:06:49 | and not uh work quite well |
---|
0:06:53 | okay so now we're back to the gender model |
---|
0:06:55 | a show you a few slides ago go so |
---|
0:06:58 | uh now we can explain the as |
---|
0:07:00 | briefly so the channel model |
---|
0:07:02 | i have a little square for the variable there because it's not actually random variable it's |
---|
0:07:06 | a parameter that |
---|
0:07:08 | and and the arrows are showing that it adapts over time |
---|
0:07:11 | um |
---|
0:07:12 | uh the speech model uh the part inside the big grey box |
---|
0:07:16 | is essentially uh |
---|
0:07:18 | it is the |
---|
0:07:19 | uh |
---|
0:07:20 | variables K F T which are they gaussian seen indices for a given time and frequency |
---|
0:07:25 | and the speech |
---|
0:07:26 | the clean speech that's generated X |
---|
0:07:29 | subscript that T |
---|
0:07:30 | and of course there's and noise model that i just described it has two layers to facilitate |
---|
0:07:36 | a bus noise tracking and the interaction model i described |
---|
0:07:39 | oh one interesting thing is that everything and the box actually that's a plate |
---|
0:07:44 | a graphical model notation with that little have and the corner there |
---|
0:07:47 | i means that this structure is do created for |
---|
0:07:50 | but |
---|
0:07:50 | duplicated for all frequency |
---|
0:07:52 | and actually the only thing that is not independent |
---|
0:07:55 | in the model |
---|
0:07:57 | keep uh |
---|
0:07:58 | but the only thing that binds |
---|
0:07:59 | uh the free |
---|
0:08:00 | the estimates so we're frequency is actually the speech |
---|
0:08:04 | a components T here C T |
---|
0:08:07 | given C T V entire model |
---|
0:08:09 | uh factors |
---|
0:08:11 | uh well more precisely for a given |
---|
0:08:14 | segment of data over time and frequency you have to be given this |
---|
0:08:17 | all the states of the speech model |
---|
0:08:19 | for the model the factor |
---|
0:08:21 | uh |
---|
0:08:22 | over time and frequency |
---|
0:08:25 | okay so how do we do we it's in this small very quickly this is been |
---|
0:08:29 | oh shown before so the exact noise posterior because it's a gaussian process |
---|
0:08:33 | and you pick up |
---|
0:08:35 | uh |
---|
0:08:36 | uh |
---|
0:08:37 | uh |
---|
0:08:38 | there every time you evaluate a gmm that increases the number of components and the noise posterior by a factor |
---|
0:08:43 | of the number of on it |
---|
0:08:45 | so uh i T grows exponentially with time the number of components and the noise posterior |
---|
0:08:50 | so we actually approximate this just is gaussian and this work |
---|
0:08:54 | and now we done some investigation of |
---|
0:08:57 | some more complex methods and so far we haven't seen any advantage of that |
---|
0:09:01 | but that was on |
---|
0:09:02 | synthetic data it still needs to be tried for real data |
---|
0:09:06 | um um |
---|
0:09:07 | as it just showing how you compute the mean and the variance of that gaussian i'll skip right over that |
---|
0:09:14 | um |
---|
0:09:15 | okay so likelihood approximation a |
---|
0:09:18 | i showed you before the interaction function it's nonlinear when |
---|
0:09:21 | so basically we iteratively lynn your eyes it using a got when this is a very in of it or |
---|
0:09:26 | itself a variant of it or |
---|
0:09:28 | iterative |
---|
0:09:30 | T |
---|
0:09:31 | um um |
---|
0:09:32 | i i here is the a a uh i guess |
---|
0:09:35 | one thing is that the |
---|
0:09:37 | one interesting thing is that i for uh basically is the ratio of speech and noise in uh |
---|
0:09:42 | the weights on that |
---|
0:09:43 | the two factors in the model and |
---|
0:09:45 | it's actually ends up being just the power ratio of the distorted speech over |
---|
0:09:49 | total power |
---|
0:09:51 | under the model |
---|
0:09:53 | uh that's kind of difficult understands here's a picture |
---|
0:09:56 | uh |
---|
0:09:57 | so for a given time and frequency |
---|
0:09:59 | a are you have a joint prior for speech in noise in the first image |
---|
0:10:04 | a i a diagonal covariance gaussian the likelihood function and a given frequency in this case |
---|
0:10:10 | a a the observation has |
---|
0:10:12 | uh intensity of |
---|
0:10:13 | ten db |
---|
0:10:15 | a that's what it looks like in the exact posterior of the model looks like the thirty image |
---|
0:10:21 | so what do we do we uh uh your eyes this approximation it so |
---|
0:10:25 | in this case you can see that |
---|
0:10:27 | we linear arrays in yeah |
---|
0:10:29 | compute the posterior and it's actually nothing you nothing like the |
---|
0:10:33 | a a true posterior |
---|
0:10:35 | but if we iterate the process |
---|
0:10:38 | actually we get a result that is very faithful to the true posterior that shown in the previous slide |
---|
0:10:44 | so is that are two things here iteration is important we know that |
---|
0:10:47 | uh |
---|
0:10:48 | modeling the certainly in the noise |
---|
0:10:50 | when you when the noise is |
---|
0:10:52 | uh adapting |
---|
0:10:53 | is clearly very important if i didn't have uncertainty in the noise estimate |
---|
0:10:58 | uh a then this would have been my answer |
---|
0:11:00 | which is wrong |
---|
0:11:04 | okay um so to reconstruct the speech this is a front-end technique uh at least |
---|
0:11:09 | uh that's how we've been using it for this work |
---|
0:11:12 | we uh uh |
---|
0:11:13 | run D N A and then reconstruct and feed it of the back for recognition that's just the expected value |
---|
0:11:19 | wander |
---|
0:11:20 | a gaussian |
---|
0:11:21 | a mixture of gaussians rather so |
---|
0:11:24 | highly structured mixture of gaussians |
---|
0:11:26 | at a given time step |
---|
0:11:27 | given the prior for not |
---|
0:11:31 | okay so let's just jump straight in some results so the data |
---|
0:11:34 | we tested on was us english in-car speech |
---|
0:11:38 | recorded in various noise conditions uh |
---|
0:11:41 | mostly uh |
---|
0:11:42 | a a you know cars speeding by acceleration of the motor and one not would be the done and noise |
---|
0:11:49 | uh the training data is uh |
---|
0:11:51 | uh in a hundred thousand utterances that's about seven hundred and eighty six hours |
---|
0:11:56 | test data is a a whole out uh |
---|
0:11:59 | one a hundred twenty eight held-out speakers |
---|
0:12:01 | and uh approximately forty hours of data |
---|
0:12:05 | uh there's forty seven tasks that |
---|
0:12:07 | span a few domains navigation command control |
---|
0:12:11 | radio digits dialling |
---|
0:12:13 | seven regional a us accents |
---|
0:12:17 | oh the models that we use uh in the back and we use |
---|
0:12:20 | a pretty standard |
---|
0:12:21 | uh a word internal plusminus |
---|
0:12:24 | to phonetic context the with eight eight sixty five context-dependent state |
---|
0:12:29 | where dimensional the you features we built three models based on notes an ml model a clean and all model |
---|
0:12:35 | trained only on |
---|
0:12:36 | data it's twenty db know above |
---|
0:12:38 | and and fmpe model |
---|
0:12:40 | a once trained we compressed using a |
---|
0:12:43 | hierarchical bang quantization |
---|
0:12:45 | uh |
---|
0:12:47 | for efficiency |
---|
0:12:49 | a the D N a speech model it runs of the front and it has its own speech model a |
---|
0:12:54 | a little gmm |
---|
0:12:55 | so yeah uh it's set in the log mel domain |
---|
0:12:58 | and actually just has two hundred fifty six speech and sixteen silence component |
---|
0:13:02 | uh we compress this model as well to the Q gmm as i describe so is actually only a scenes |
---|
0:13:08 | per dimension you you need to evaluate |
---|
0:13:10 | to evaluate the you know |
---|
0:13:13 | so it's got the uh |
---|
0:13:15 | yeah i mean it's in terms the number of gaussians |
---|
0:13:18 | similar to what a speech detector go |
---|
0:13:23 | um so the adaptation algorithms that we uh |
---|
0:13:26 | tried in conjunction with a testing include spectral subtraction T in a a alone |
---|
0:13:31 | i mean normalisation was always in the pipeline and we tried switching and |
---|
0:13:35 | have a more and P the C L the |
---|
0:13:37 | techniques interact |
---|
0:13:39 | um |
---|
0:13:40 | so spectral subtraction is uh basically used standard |
---|
0:13:43 | a spectral subtraction module uh model based speech detector noise estimate from speech |
---|
0:13:48 | three |
---|
0:13:49 | frames with a |
---|
0:13:51 | a a floor value based on the current noise estimate to avoid musical noise |
---|
0:13:55 | um have a are this version is an online |
---|
0:13:58 | version it's |
---|
0:13:59 | stochastically adapted every friday five frame |
---|
0:14:04 | a the F P model will oh a uh |
---|
0:14:06 | implements the it the transformation using five under and twelve gaussians seventeen frame in a context a nine frame out |
---|
0:14:12 | or condo |
---|
0:14:14 | okay so uh |
---|
0:14:16 | this is a pretty busy go graph will spend a few minutes taking a look at it |
---|
0:14:20 | first thing i want you to look at it is |
---|
0:14:23 | uh basically compare the red lines to the blue lines |
---|
0:14:26 | uh this compares |
---|
0:14:27 | basically turning the an when D an a a is off |
---|
0:14:31 | to when the in is on |
---|
0:14:32 | so first but i |
---|
0:14:34 | but a red curve |
---|
0:14:37 | is a |
---|
0:14:38 | showing the word error rate as a function of snr actually we have a histogram of how much data is |
---|
0:14:43 | E in each snr bin |
---|
0:14:44 | in the background there and you can see the number of words |
---|
0:14:48 | on the you right hand side |
---|
0:14:49 | yeah on the X |
---|
0:14:50 | on the E Y axes rather |
---|
0:14:52 | so uh |
---|
0:14:54 | anyhow so you can see that basically is a general trend that we're getting |
---|
0:14:59 | went from red to green uh |
---|
0:15:01 | that forget me up forget the green curves for no going from red to blue curve we we're getting significant |
---|
0:15:07 | our reductions in word error rate when we turn on D an a |
---|
0:15:11 | so a little bit more concretely the red curve with |
---|
0:15:14 | uh that's dash would diamond |
---|
0:15:16 | it's just with spectral subtraction and one we |
---|
0:15:19 | turn on D N a and spectral subtraction we have the blue curve that's dashed |
---|
0:15:24 | we're getting they screens |
---|
0:15:25 | especially at low |
---|
0:15:26 | well |
---|
0:15:27 | particularly at low snr |
---|
0:15:30 | um |
---|
0:15:31 | an interesting thing was that when we use the clean |
---|
0:15:34 | i clean D N a model and a clean back and model |
---|
0:15:37 | a a looking at the green curve with the upside down triangle |
---|
0:15:41 | a |
---|
0:15:42 | we can see that curve tracks |
---|
0:15:44 | uh |
---|
0:15:46 | a spectral subtraction |
---|
0:15:48 | uh |
---|
0:15:49 | spectral subtraction curve |
---|
0:15:51 | what it has a multi condition back |
---|
0:15:53 | so basically |
---|
0:15:55 | you you doing as well as spectral subtraction |
---|
0:15:58 | uh but |
---|
0:15:59 | if you a minute all that |
---|
0:16:01 | a our data |
---|
0:16:03 | is |
---|
0:16:04 | for the impressive |
---|
0:16:06 | okay so that's so that for more off so you can see what happens when you turn uh from a |
---|
0:16:09 | on so if you're doing noise next experiments without of more uh |
---|
0:16:14 | maybe you should rethink |
---|
0:16:17 | uh a it has a huge impact on |
---|
0:16:19 | on performance and uh fortunately in this case the an a compliment stuff helps to generate |
---|
0:16:24 | that are all alignments |
---|
0:16:26 | and again you can see comparing the |
---|
0:16:28 | uh |
---|
0:16:29 | that's see in this case |
---|
0:16:30 | one spectral so |
---|
0:16:32 | traction is |
---|
0:16:33 | oh i and and |
---|
0:16:34 | okay do when D N in spectral subtraction is on |
---|
0:16:37 | actually in this case there's |
---|
0:16:39 | does not a uh |
---|
0:16:40 | was not a big difference |
---|
0:16:41 | a low snr |
---|
0:16:43 | um but of course |
---|
0:16:45 | these these are only a a and well uh |
---|
0:16:47 | a maximum-likelihood system results so we're still not using the best system |
---|
0:16:51 | so then when we turn on F M P |
---|
0:16:54 | uh |
---|
0:16:54 | we see again that in general the word error rate is just dropping |
---|
0:16:59 | substantially everywhere except at low snr actually it's |
---|
0:17:03 | the it's a little bit in the lowest that's not bins |
---|
0:17:06 | um |
---|
0:17:08 | that makes quite a bit of sense because that |
---|
0:17:10 | that be and you don't get a lot of training data for |
---|
0:17:13 | and it's putting a lot of stress on the F and P model it's |
---|
0:17:16 | it has to memorise combinations of speech and noise |
---|
0:17:20 | uh |
---|
0:17:20 | to do well in that area |
---|
0:17:22 | and it's tricky when noise is actually modifying the feature a lot |
---|
0:17:26 | so here |
---|
0:17:28 | are you can see that |
---|
0:17:29 | comparing i'd |
---|
0:17:31 | spectral subtraction on that's the red diamonds to |
---|
0:17:34 | the D N A results with and without spectral subtraction |
---|
0:17:38 | we're getting a a big gains at very low snr and |
---|
0:17:41 | uh |
---|
0:17:42 | a uh |
---|
0:17:43 | basically in about fifteen db it's |
---|
0:17:46 | it's definitely better to |
---|
0:17:48 | turn on D N A |
---|
0:17:50 | um |
---|
0:17:52 | so one thing uh i'm not gonna go through this huge table this is |
---|
0:17:56 | a kind of daunting but |
---|
0:17:57 | uh you can look in the paper for details but |
---|
0:18:01 | a a well one thing we notice is that we're getting consisting gains at low snr but |
---|
0:18:05 | because of the amount of data |
---|
0:18:07 | that higher snr and the fact that the seems to be just a little bit of degradation |
---|
0:18:12 | from this D an A um describing it did |
---|
0:18:14 | today |
---|
0:18:15 | uh that ends up not not helping the F and peace system |
---|
0:18:19 | um but |
---|
0:18:21 | sense then we actually |
---|
0:18:23 | figured out |
---|
0:18:24 | uh how to make them more parse ammonia as |
---|
0:18:26 | i think it into the details here uh |
---|
0:18:29 | but uh a at this time actually |
---|
0:18:32 | the |
---|
0:18:32 | the latest version we have in improves |
---|
0:18:35 | a a on this database i it gives |
---|
0:18:38 | ten percent relative gains over all and below |
---|
0:18:42 | uh |
---|
0:18:43 | low ten db it's |
---|
0:18:45 | uh |
---|
0:18:46 | heading towards twenty percent relative so it |
---|
0:18:49 | it's improving our system quite a bit |
---|
0:18:53 | um |
---|
0:18:54 | okay so |
---|
0:18:55 | at this point i'll just wrap up i mean and some D N A |
---|
0:18:58 | it's working and it's |
---|
0:19:00 | this this data is |
---|
0:19:01 | a a pretty dated now unfortunately is working a lot better well have that wait till the next conference |
---|
0:19:06 | tell you about the rest |
---|
0:19:08 | uh because of |
---|
0:19:09 | patent issues |
---|
0:19:11 | um in turn into the future work i mean uh i hope uh |
---|
0:19:15 | as many you you the crowd are already where um when you use a graphical model |
---|
0:19:19 | work you can |
---|
0:19:21 | uh its modular so you can build you can make |
---|
0:19:24 | parts of the model stronger |
---|
0:19:26 | a in your inference algorithms obviously |
---|
0:19:28 | yeah |
---|
0:19:29 | we inference algorithm we have for the noise is |
---|
0:19:31 | very approximate we |
---|
0:19:33 | uh have a huge gmm we approximate |
---|
0:19:36 | made by gaussian at each time step |
---|
0:19:39 | um another thing would be tighter back and integration of course there's lot of people working on that in |
---|
0:19:44 | mark gales T |
---|
0:19:45 | uh |
---|
0:19:46 | and the microsoft and search |
---|
0:19:48 | um |
---|
0:19:49 | and uh |
---|
0:19:51 | i guess that's |
---|
0:19:52 | that's pretty much it and |
---|
0:19:53 | so it's good news for |
---|
0:19:55 | model based approach |
---|
0:20:03 | yeah i have time |
---|
0:20:05 | a couple a couple of questions |
---|
0:20:11 | a |
---|
0:20:19 | i |
---|
0:20:21 | oh sorry |
---|
0:20:23 | i |
---|
0:20:25 | uh |
---|
0:20:29 | so seems like you're |
---|
0:20:31 | uh uh your yeah that's a great slide but that |
---|
0:20:34 | seems like it works really great on the cases where you're word error rate is is very high |
---|
0:20:39 | but |
---|
0:20:40 | oh and the like three by db snrs |
---|
0:20:43 | you start to get degradation for a D N A |
---|
0:20:47 | i |
---|
0:20:47 | here here the little red triangle looks like cure |
---|
0:20:50 | where error rate is |
---|
0:20:53 | a doubled |
---|
0:20:54 | we need to a and N is that my reading that right or is that have |
---|
0:20:57 | yeah that's i mean |
---|
0:20:59 | that that that is pretty bad that yeah at the time that's |
---|
0:21:02 | that was the big problem with that |
---|
0:21:04 | um |
---|
0:21:05 | but we saw that problem of so with the iteration with that that you your salt you solve that problem |
---|
0:21:09 | a i just in general we we all that but unfortunately i okay i can tell you yet ideally |
---|
0:21:16 | i is just a clarification question i didn't quite do expose can see numbers |
---|
0:21:20 | a whether some numbers correspond to have been clean training |
---|
0:21:25 | a hundred hours and so |
---|
0:21:28 | noise so |
---|
0:21:30 | oh yeah i mean in general except for the |
---|
0:21:33 | uh |
---|
0:21:34 | the clean the green uh |
---|
0:21:36 | curves here |
---|
0:21:37 | the those have a a clean |
---|
0:21:39 | a |
---|
0:21:40 | back and and D N A model everything else is using multi condition more uh data |
---|
0:21:45 | okay |
---|
0:21:51 | yeah |
---|
0:21:52 | so |
---|
0:21:53 | question to a speech recognition |
---|
0:21:59 | uh |
---|
0:22:00 | rather than is to say a speech enhancement |
---|
0:22:04 | construction noise |
---|
0:22:07 | uh well actually maybe it in make that clear we actually do reconsider a that's why i have the question |
---|
0:22:13 | oh okay |
---|
0:22:16 | a the spectral so to construct |
---|
0:22:21 | well |
---|
0:22:22 | or |
---|
0:22:23 | so to to the uh |
---|
0:22:27 | and |
---|
0:22:29 | oh yeah sure or uh a this been a few publications on it in in the past and they they |
---|
0:22:34 | have spectrograms i didn't put them in that this time |
---|
0:22:37 | uh i mean it is in the mel domain so i i think of the trouble of regenerating generating |
---|
0:22:41 | the signal |
---|
0:22:43 | the listen to it |
---|
0:22:47 | a question |
---|
0:22:48 | i |
---|
0:22:51 | as one quick question here is so that when you compare the performance you already compelled the |
---|
0:22:56 | so the one the D A is all the i is or |
---|
0:23:00 | and also you use the S S |
---|
0:23:02 | this that the for the spectral subtraction right |
---|
0:23:04 | yes |
---|
0:23:05 | and all the what is differs to this uh as the as use the standard the as approach |
---|
0:23:11 | a standard bp S is |
---|
0:23:13 | oh uh |
---|
0:23:15 | i mean a |
---|
0:23:17 | i T S is would be a step up by say from spectral subtraction it it's uh |
---|
0:23:23 | uh |
---|
0:23:24 | but is this one line here pretty much sums it up you |
---|
0:23:28 | use a speech detector |
---|
0:23:30 | and then based on your segmentation you estimate the noise based on the frames their speech free |
---|
0:23:35 | and then you |
---|
0:23:36 | you adapted according you know uh according to some forgetting factor you uh i as i think probably the make |
---|
0:23:42 | more sense to compel this with so |
---|
0:23:44 | yes or one you use a single thing |
---|
0:23:47 | full from now so yeah that's a point so |
---|
0:23:50 | a |
---|
0:23:52 | actually this |
---|
0:23:53 | the spectral subtraction routine is pretty well tuned so |
---|
0:23:56 | when we ran and turned off the at that adaptation for D N a |
---|
0:24:00 | actually the my slight the price the the results |
---|
0:24:03 | are the same is with spectral subtraction |
---|
0:24:06 | so we got quite a bit again by turning on the adaptation of the noise |
---|
0:24:11 | uh |
---|
0:24:12 | so so because if we just lies the noise model T S that's what everybody |
---|
0:24:16 | does initialize on the first ten frames and then do vts T S |
---|
0:24:20 | that actually doesn't improve our system |
---|
0:24:23 | you have to adapt it and |
---|
0:24:25 | the reason that was shocking to me is because this database |
---|
0:24:28 | this database the |
---|
0:24:30 | the utterances are |
---|
0:24:32 | uh |
---|
0:24:33 | to five seconds long |
---|
0:24:35 | so |
---|
0:24:35 | means that you know cars are passing and one not and that's affecting the noise estimate significantly enough that you |
---|
0:24:40 | need to adapt it during the utterance |
---|
0:24:43 | that's important |
---|
0:24:46 | a a was to process |
---|