0:00:14 | "'kay" |
---|
0:00:15 | thank you |
---|
0:00:16 | and so i come to this very last a presentation |
---|
0:00:20 | a a a a model based |
---|
0:00:21 | speech announcements |
---|
0:00:22 | at first |
---|
0:00:24 | i would like if the outline of the talk |
---|
0:00:27 | i was start with the short |
---|
0:00:28 | introduction |
---|
0:00:29 | uh after what's our give a brief overview of out uh the |
---|
0:00:33 | model based noise duck |
---|
0:00:35 | use here |
---|
0:00:36 | for speech enhancement |
---|
0:00:38 | and i well so presents our as an already band |
---|
0:00:41 | i am the estimator your |
---|
0:00:42 | where as an art bands you means that we have different uh is the estimators and the input as an |
---|
0:00:48 | hour decides which one |
---|
0:00:50 | uh we choose |
---|
0:00:51 | then i will show some test results so we're give a short demonstration and we'll final of the the summer |
---|
0:00:59 | "'kay" first let me and |
---|
0:01:00 | the notation we use |
---|
0:01:02 | uh in this presentation |
---|
0:01:04 | we considering for example of such a scenario where we record |
---|
0:01:08 | the noisy microphone signal |
---|
0:01:10 | why |
---|
0:01:11 | which consists |
---|
0:01:12 | of a clean |
---|
0:01:14 | speech signal S that is additive |
---|
0:01:16 | a |
---|
0:01:17 | uh by noise signal and |
---|
0:01:20 | we can see domain we use this representation here what we use capital letters |
---|
0:01:24 | frame index X that and |
---|
0:01:26 | frequency index |
---|
0:01:27 | um |
---|
0:01:28 | all estimates in the following hard you know to |
---|
0:01:31 | by a head |
---|
0:01:32 | for example here |
---|
0:01:33 | at the output of or noise suppression |
---|
0:01:36 | uh you know we enhanced |
---|
0:01:37 | uh speech signal |
---|
0:01:39 | okay |
---|
0:01:41 | a literature uh so called statistical noise reduction |
---|
0:01:44 | uh |
---|
0:01:45 | approaches are often useful purpose of |
---|
0:01:48 | as speech enhancement |
---|
0:01:49 | among them for example the wiener filter all the weighting rooms roots by if for man |
---|
0:01:55 | a model a |
---|
0:01:56 | he's techniques usually as you a certain distribution for the speech and the noise signal for example a course in |
---|
0:02:02 | or up pdf |
---|
0:02:04 | and a light mathematical criteria like mmse embassy maximum likelihood or |
---|
0:02:09 | a map in order to |
---|
0:02:11 | i estimate the speech signal |
---|
0:02:13 | so the classification is based here on memory less a priori no |
---|
0:02:17 | in contrast |
---|
0:02:18 | advantage of model based approaches is that they can additionally uh consider correlation |
---|
0:02:24 | cross time and or frequency |
---|
0:02:26 | for example by using a specific uh model of our speech that |
---|
0:02:32 | so here |
---|
0:02:32 | we can exploit a priori information of fire or |
---|
0:02:36 | and |
---|
0:02:38 | uh one example of such a model based approach is the modified |
---|
0:02:41 | a a to but uh i will present the following |
---|
0:02:45 | the system consists of |
---|
0:02:46 | two steps the first |
---|
0:02:47 | set calls propagation that we tried to exploit |
---|
0:02:51 | uh temporal correlation of |
---|
0:02:52 | speech dft coefficient |
---|
0:02:54 | this is illustrated here |
---|
0:02:56 | we working at the frequency domain and you see of the previous |
---|
0:03:02 | and K enhanced |
---|
0:03:03 | speech coefficients |
---|
0:03:05 | uh for one specific uh frequency bin |
---|
0:03:08 | in order to predict |
---|
0:03:09 | of the current speech |
---|
0:03:10 | a coefficient |
---|
0:03:12 | for this we use a a conventional then your prediction techniques based on a a model of a and K |
---|
0:03:18 | and they ar coefficients a which uh i have to be known or estimated |
---|
0:03:22 | five |
---|
0:03:24 | the second step called uh update step we then |
---|
0:03:27 | uh only have to estimate the prediction error that you've made in the forest |
---|
0:03:31 | step |
---|
0:03:32 | this prediction error is denoted |
---|
0:03:34 | the following and |
---|
0:03:35 | um S |
---|
0:03:37 | and in order to |
---|
0:03:38 | um estimate this prediction error we consider of this the french and C D which is |
---|
0:03:43 | a a noise input efficient |
---|
0:03:45 | mine the first |
---|
0:03:46 | a prediction |
---|
0:03:47 | and as was see later |
---|
0:03:49 | uh we perform here |
---|
0:03:50 | in order to estimate yes we perform spectral weighting of the the french C |
---|
0:03:56 | E |
---|
0:03:59 | uh by a weighting gain G |
---|
0:04:00 | in order to estimate |
---|
0:04:03 | once you've estimated yes we can that update |
---|
0:04:06 | our first |
---|
0:04:07 | uh prediction |
---|
0:04:09 | and finally get a you the enhanced |
---|
0:04:11 | a speech but |
---|
0:04:13 | such a a |
---|
0:04:14 | come come the system the low or a common that the system is |
---|
0:04:17 | uh a light |
---|
0:04:18 | a a separately for each uh frequency |
---|
0:04:21 | uh have and we and then finally |
---|
0:04:23 | uh transform you |
---|
0:04:25 | uh the whole frame back into the time |
---|
0:04:28 | oh system can be is extended to uh noise signals in order to exploit also possible uh correlation of |
---|
0:04:35 | um noise signals |
---|
0:04:37 | that for we apply the propagation step |
---|
0:04:40 | also to the |
---|
0:04:42 | a voice signal |
---|
0:04:43 | where are we use of the previous M K |
---|
0:04:45 | uh |
---|
0:04:46 | and hands |
---|
0:04:47 | uh no estimates noise estimates from the |
---|
0:04:50 | uh past in order to |
---|
0:04:52 | a pretty |
---|
0:04:53 | a the current noise |
---|
0:04:54 | fish |
---|
0:04:55 | the it's that in |
---|
0:04:57 | we have to estimate two prediction errors |
---|
0:04:59 | that of the speech signal E S and that of |
---|
0:05:02 | noise signal |
---|
0:05:03 | and |
---|
0:05:04 | so let's out |
---|
0:05:05 | close to |
---|
0:05:06 | look to this problem |
---|
0:05:09 | so the objective here and the update step uh i just mentioned |
---|
0:05:12 | to estimate the S and S and N |
---|
0:05:16 | uh based on the difference signal be |
---|
0:05:18 | and this case D is given as the noisy input coefficient |
---|
0:05:22 | S plus and minus |
---|
0:05:24 | a first |
---|
0:05:24 | speech prediction |
---|
0:05:26 | mine |
---|
0:05:26 | the first uh noise each |
---|
0:05:28 | and this expression can also be |
---|
0:05:31 | uh stated it as some of the two prediction errors S |
---|
0:05:34 | last and |
---|
0:05:36 | so we have |
---|
0:05:39 | a a classical noise reduction problem you know the update step |
---|
0:05:42 | we have a target signal E S but want to estimate |
---|
0:05:44 | which is |
---|
0:05:45 | just or by an additive noise signal E and and we've but access only to the |
---|
0:05:49 | a noise a difference of signal D |
---|
0:05:52 | and this allows us to use here uh and a conventional statistical estimator which is that |
---|
0:05:57 | uh to the statistics of S |
---|
0:05:59 | and |
---|
0:06:00 | and we can perform you know the spectral weighting |
---|
0:06:03 | of the |
---|
0:06:04 | a french and signal |
---|
0:06:05 | by weighting gain G in order to |
---|
0:06:07 | estimate S |
---|
0:06:08 | or |
---|
0:06:09 | five one minus G in order to estimate a |
---|
0:06:13 | nor or to each and all these |
---|
0:06:14 | uh weighting gains G |
---|
0:06:16 | the original |
---|
0:06:17 | common to |
---|
0:06:19 | approach assumes |
---|
0:06:21 | a got P F |
---|
0:06:22 | for a S and and and minimize the mean square error between S and its estimate |
---|
0:06:27 | and comes that was to the well-known known |
---|
0:06:29 | we know just solution for the a weighting gain G as can be seen here |
---|
0:06:34 | however |
---|
0:06:35 | we met at the statistics of the speech prediction error signal E S |
---|
0:06:39 | and |
---|
0:06:40 | distribution of yes |
---|
0:06:41 | is not caution |
---|
0:06:42 | but |
---|
0:06:44 | a course |
---|
0:06:45 | as we showed that the i guess in two thousand |
---|
0:06:48 | a and i eight |
---|
0:06:49 | and this fact can i'll be exploited in the update step if we did not |
---|
0:06:53 | use the we have filter but |
---|
0:06:55 | statistical estimate which can be adapted to the as statistics |
---|
0:06:59 | for example this and was the estimator |
---|
0:07:01 | by a loans and is |
---|
0:07:03 | uh code leaks |
---|
0:07:04 | which assumes a generalized gamma distribution for |
---|
0:07:07 | uh |
---|
0:07:09 | the uh target see |
---|
0:07:11 | so far we measure our |
---|
0:07:13 | a pdf of the S |
---|
0:07:14 | for uh |
---|
0:07:15 | a a as an are ranged and averaged the results um at the end |
---|
0:07:19 | so at the end we had |
---|
0:07:20 | uh |
---|
0:07:21 | one single a histogram |
---|
0:07:23 | this contribution now we performed an as an norton band |
---|
0:07:26 | measurement |
---|
0:07:27 | of the |
---|
0:07:28 | statistics |
---|
0:07:29 | therefore for we just or our uh speech signals by white portion |
---|
0:07:34 | uh |
---|
0:07:35 | noise noise at different input snr values and measured |
---|
0:07:38 | uh the histograms |
---|
0:07:40 | the result can be seen here the normalized |
---|
0:07:42 | a P F |
---|
0:07:43 | for the mac you'd of yes depending on the input as an hour which bears you from minus |
---|
0:07:49 | twenty two thirty five |
---|
0:07:50 | um do |
---|
0:07:52 | and you can clearly see |
---|
0:07:53 | uh the |
---|
0:07:54 | uh input as an or |
---|
0:07:56 | has influence on the uh histograms |
---|
0:07:59 | you high of the input as a or the higher the of the probability that a a only small of |
---|
0:08:04 | prediction errors |
---|
0:08:05 | a Q |
---|
0:08:06 | and |
---|
0:08:07 | this fact now can also be |
---|
0:08:09 | supported in our system |
---|
0:08:11 | if |
---|
0:08:11 | we use an as an already and mmse estimator |
---|
0:08:14 | in the update step |
---|
0:08:16 | or this |
---|
0:08:17 | we use the em as the estimator by uh i mentioned before which is not adapted |
---|
0:08:22 | to each of the uh histogram we've just |
---|
0:08:26 | see |
---|
0:08:26 | so |
---|
0:08:27 | for each quantized as an or a value with the step size of a five db T V we um |
---|
0:08:33 | use here a different uh mmse estimator |
---|
0:08:36 | so a gate |
---|
0:08:37 | is now also uh depends |
---|
0:08:40 | uh on the input as an or |
---|
0:08:42 | and it's |
---|
0:08:43 | in order to estimate its input there's and or in our system we simply use |
---|
0:08:47 | a and has speech and noise |
---|
0:08:49 | ephesians form |
---|
0:08:50 | previous frames |
---|
0:08:52 | with |
---|
0:08:53 | such a system we increase of course |
---|
0:08:55 | the computational complexity and the a memory requirements compared to a |
---|
0:09:00 | a conventional statistical estimator |
---|
0:09:02 | compared to we know filter for example we increase the complexity by a factor of uh six |
---|
0:09:08 | round about |
---|
0:09:09 | and it's additionally we of course have to um store previous frames for the prediction part |
---|
0:09:15 | and um a look up table for each |
---|
0:09:20 | and as test |
---|
0:09:22 | K for a calm |
---|
0:09:23 | to the results some more |
---|
0:09:25 | a a system that things |
---|
0:09:26 | use your relative low a model orders of model all of |
---|
0:09:29 | three for speech signal |
---|
0:09:31 | and a lot of two for of the noisy signal |
---|
0:09:35 | they are coefficients are it's so in each frame using the elevens and uh i read which is applied to |
---|
0:09:40 | estimate from |
---|
0:09:42 | a previous frames |
---|
0:09:43 | and is names statistics and the up step to pull four |
---|
0:09:46 | uh the noise power |
---|
0:09:49 | a would can we cheap with |
---|
0:09:50 | such a system at first |
---|
0:09:52 | uh objective measurements averaged over five |
---|
0:09:55 | different uh noise signals |
---|
0:09:57 | is see that segment of speech as an hour |
---|
0:09:59 | lot of over the noise attenuation with the input as an R |
---|
0:10:03 | um bearing your from mine |
---|
0:10:06 | a ten to thirty five |
---|
0:10:07 | E |
---|
0:10:09 | objective here is to achieve a high noise attenuation and a high |
---|
0:10:13 | a segment speech as an are so the more these curves of place in the upper right corner |
---|
0:10:17 | the better performance |
---|
0:10:19 | in |
---|
0:10:21 | a blue and red to see the results of two purely statistical estimators |
---|
0:10:25 | the wiener filter and a low plus mse the estimator |
---|
0:10:28 | which assumes a low class distribution for the a speech signal |
---|
0:10:32 | and the green and |
---|
0:10:34 | um like to see the |
---|
0:10:36 | i two proposed |
---|
0:10:37 | uh common filter approach |
---|
0:10:39 | in green uh a where we use the |
---|
0:10:42 | a as an art in depend was the estimator and the update step |
---|
0:10:45 | and in |
---|
0:10:46 | um |
---|
0:10:47 | like |
---|
0:10:48 | B and you approach would be |
---|
0:10:50 | as an already penned M Ms |
---|
0:10:53 | and overall you can see here that |
---|
0:10:55 | um we |
---|
0:10:56 | three to come of it approach we uh of the form |
---|
0:11:00 | a T to a statistical estimators |
---|
0:11:03 | look here for example an input as an hour of |
---|
0:11:05 | um five db and keep here |
---|
0:11:08 | a the segment speech as an hour |
---|
0:11:10 | uh constance speech if you're a |
---|
0:11:12 | a much higher |
---|
0:11:13 | a a noise attenuation |
---|
0:11:15 | with a to model based approaches |
---|
0:11:17 | and in gain here |
---|
0:11:19 | a to three |
---|
0:11:20 | i D V noise attenuation |
---|
0:11:22 | if we compare the wiener filter and then you as an or then |
---|
0:11:25 | a common to |
---|
0:11:27 | also like to give |
---|
0:11:28 | you shorts demonstration |
---|
0:11:30 | uh these four |
---|
0:11:32 | uh |
---|
0:11:33 | investigated techniques |
---|
0:11:34 | at first |
---|
0:11:35 | oh play the noise the signal |
---|
0:11:37 | then |
---|
0:11:37 | uh the enhanced signals but the wiener filter and the the plus |
---|
0:11:41 | and was the estimator |
---|
0:11:43 | then the to common food approaches and it last once again |
---|
0:11:46 | of of more |
---|
0:11:53 | i |
---|
0:11:56 | i |
---|
0:11:57 | i |
---|
0:12:02 | i |
---|
0:12:05 | i |
---|
0:12:06 | i |
---|
0:12:11 | i |
---|
0:12:14 | i |
---|
0:12:15 | i |
---|
0:12:20 | i |
---|
0:12:23 | i |
---|
0:12:24 | i |
---|
0:12:28 | and |
---|
0:12:31 | i |
---|
0:12:32 | i |
---|
0:12:37 | i |
---|
0:12:40 | i |
---|
0:12:41 | i |
---|
0:12:41 | i |
---|
0:12:43 | so that you could hear that we if you with the you proposed |
---|
0:12:47 | as an art event come to that of the as noise attenuation |
---|
0:12:50 | while achieving almost |
---|
0:12:51 | same same a speech was see that |
---|
0:12:53 | the other |
---|
0:12:55 | additional objective measurements showing uh |
---|
0:12:58 | similar similar behavior can be found in the paper |
---|
0:13:02 | and the meantime we also conducted |
---|
0:13:04 | and informal listening tests |
---|
0:13:06 | which uh cannot be found the the paper |
---|
0:13:10 | the left side be compared to estimate which were not |
---|
0:13:13 | uh that to just to some a its statistics and compare to of be as an or independent common filter |
---|
0:13:19 | with the |
---|
0:13:20 | a wiener filter |
---|
0:13:22 | and on the right side we compared to estimate as which are ecstasy the |
---|
0:13:26 | um |
---|
0:13:27 | adapted to match it |
---|
0:13:28 | uh statistics |
---|
0:13:30 | you we compared |
---|
0:13:31 | the as an already pen common do with the simple course my |
---|
0:13:36 | so that we had nineteen uh a test persons |
---|
0:13:39 | and would just the over quality of there is bad |
---|
0:13:42 | respective |
---|
0:13:43 | uh techniques |
---|
0:13:45 | and in both figures |
---|
0:13:46 | and |
---|
0:13:48 | yeah you can see a clear preference for the uh you proposed |
---|
0:13:51 | um |
---|
0:13:52 | common that |
---|
0:13:55 | K just summarise |
---|
0:13:56 | we presented here a modified common filter approach which |
---|
0:13:59 | able to |
---|
0:14:00 | small its temporal correlation of |
---|
0:14:02 | a a speech and noise |
---|
0:14:03 | uh signal |
---|
0:14:05 | sure that in the update step the input as an R has influence |
---|
0:14:08 | uh on the statistics of the speech prediction error signal |
---|
0:14:12 | this fact can be exploited |
---|
0:14:14 | by uh using an as an already depend and and the estimator which just adapted to the method |
---|
0:14:19 | uh |
---|
0:14:20 | is the grams |
---|
0:14:21 | um in the update step |
---|
0:14:23 | and we showed in objective and subjective |
---|
0:14:25 | uh of iterations that |
---|
0:14:27 | uh we can approve |
---|
0:14:28 | the results of the statistical estimate |
---|
0:14:32 | Q |
---|
0:14:34 | i |
---|
0:14:35 | i |
---|
0:14:37 | first question |
---|
0:14:39 | oh |
---|
0:14:42 | it |
---|
0:14:46 | it it's hard to hear but i thought i detected an increased amount of musical noise in your |
---|
0:14:52 | oh |
---|
0:14:53 | you're a little last examples that you play it can you come place yeah that's true |
---|
0:14:59 | i mean is the trade between noise attenuation speech distortion and he's a good tones |
---|
0:15:03 | and um |
---|
0:15:04 | well the first two aspects we we are better |
---|
0:15:07 | but we unfortunately increase slight increase |
---|
0:15:10 | uh a to of music |
---|
0:15:11 | noise there um well you could use some some of post processing |
---|
0:15:15 | techniques three you |
---|
0:15:16 | uh the remaining |
---|
0:15:23 | john |
---|
0:15:26 | it in in a plot that you uh i think you at for different types of use did you were |
---|
0:15:31 | looking at |
---|
0:15:32 | where you were you were correctly |
---|
0:15:36 | where are or where be you can choose |
---|
0:15:39 | um |
---|
0:15:40 | five different types of noise |
---|
0:15:41 | i the average across all five |
---|
0:15:44 | yeah |
---|
0:15:44 | a word you play we're supposed to white gaussian noise example that was the effect noise the fact we note |
---|
0:15:50 | okay |
---|
0:15:50 | oh i it just a quick question |
---|
0:15:53 | and it's a term you set that equal to three for speech two for noise |
---|
0:15:58 | did you tried during that with different types of speech |
---|
0:16:01 | oh like growls or or a conscience or not the average of over a large database of speech yeah |
---|
0:16:08 | this before |
---|
0:16:09 | for of |
---|
0:16:10 | and |
---|
0:16:11 | the values you set for to this |
---|
0:16:13 | that |
---|
0:16:14 | does make a difference for the different types of noise |
---|
0:16:18 | yeah D bands or to the noise |
---|
0:16:20 | so |
---|
0:16:21 | of course can |
---|
0:16:22 | the |
---|
0:16:22 | the type of noise |
---|
0:16:24 | less for white course noise |
---|
0:16:26 | and the more of course for about |
---|
0:16:33 | okay i |
---|
0:16:34 | don't see for contents |
---|
0:16:36 | so i i would like to thank |
---|
0:16:38 | all |
---|
0:16:39 | the speaker some of the session |
---|