0:00:14"'kay"
0:00:15thank you
0:00:16and so i come to this very last a presentation
0:00:20a a a a model based
0:00:21speech announcements
0:00:22at first
0:00:24i would like if the outline of the talk
0:00:27i was start with the short
0:00:28introduction
0:00:29uh after what's our give a brief overview of out uh the
0:00:33model based noise duck
0:00:35use here
0:00:36for speech enhancement
0:00:38and i well so presents our as an already band
0:00:41i am the estimator your
0:00:42where as an art bands you means that we have different uh is the estimators and the input as an
0:00:48hour decides which one
0:00:50uh we choose
0:00:51then i will show some test results so we're give a short demonstration and we'll final of the the summer
0:00:59"'kay" first let me and
0:01:00the notation we use
0:01:02uh in this presentation
0:01:04we considering for example of such a scenario where we record
0:01:08the noisy microphone signal
0:01:10why
0:01:11which consists
0:01:12of a clean
0:01:14speech signal S that is additive
0:01:16a
0:01:17uh by noise signal and
0:01:20we can see domain we use this representation here what we use capital letters
0:01:24frame index X that and
0:01:26frequency index
0:01:27um
0:01:28all estimates in the following hard you know to
0:01:31by a head
0:01:32for example here
0:01:33at the output of or noise suppression
0:01:36uh you know we enhanced
0:01:37uh speech signal
0:01:39okay
0:01:41a literature uh so called statistical noise reduction
0:01:44uh
0:01:45approaches are often useful purpose of
0:01:48as speech enhancement
0:01:49among them for example the wiener filter all the weighting rooms roots by if for man
0:01:55a model a
0:01:56he's techniques usually as you a certain distribution for the speech and the noise signal for example a course in
0:02:02or up pdf
0:02:04and a light mathematical criteria like mmse embassy maximum likelihood or
0:02:09a map in order to
0:02:11i estimate the speech signal
0:02:13so the classification is based here on memory less a priori no
0:02:17in contrast
0:02:18advantage of model based approaches is that they can additionally uh consider correlation
0:02:24cross time and or frequency
0:02:26for example by using a specific uh model of our speech that
0:02:32so here
0:02:32we can exploit a priori information of fire or
0:02:36and
0:02:38uh one example of such a model based approach is the modified
0:02:41a a to but uh i will present the following
0:02:45the system consists of
0:02:46two steps the first
0:02:47set calls propagation that we tried to exploit
0:02:51uh temporal correlation of
0:02:52speech dft coefficient
0:02:54this is illustrated here
0:02:56we working at the frequency domain and you see of the previous
0:03:02and K enhanced
0:03:03speech coefficients
0:03:05uh for one specific uh frequency bin
0:03:08in order to predict
0:03:09of the current speech
0:03:10a coefficient
0:03:12for this we use a a conventional then your prediction techniques based on a a model of a and K
0:03:18and they ar coefficients a which uh i have to be known or estimated
0:03:22five
0:03:24the second step called uh update step we then
0:03:27uh only have to estimate the prediction error that you've made in the forest
0:03:31step
0:03:32this prediction error is denoted
0:03:34the following and
0:03:35um S
0:03:37and in order to
0:03:38um estimate this prediction error we consider of this the french and C D which is
0:03:43a a noise input efficient
0:03:45mine the first
0:03:46a prediction
0:03:47and as was see later
0:03:49uh we perform here
0:03:50in order to estimate yes we perform spectral weighting of the the french C
0:03:56E
0:03:59uh by a weighting gain G
0:04:00in order to estimate
0:04:03once you've estimated yes we can that update
0:04:06our first
0:04:07uh prediction
0:04:09and finally get a you the enhanced
0:04:11a speech but
0:04:13such a a
0:04:14come come the system the low or a common that the system is
0:04:17uh a light
0:04:18a a separately for each uh frequency
0:04:21uh have and we and then finally
0:04:23uh transform you
0:04:25uh the whole frame back into the time
0:04:28oh system can be is extended to uh noise signals in order to exploit also possible uh correlation of
0:04:35um noise signals
0:04:37that for we apply the propagation step
0:04:40also to the
0:04:42a voice signal
0:04:43where are we use of the previous M K
0:04:45uh
0:04:46and hands
0:04:47uh no estimates noise estimates from the
0:04:50uh past in order to
0:04:52a pretty
0:04:53a the current noise
0:04:54fish
0:04:55the it's that in
0:04:57we have to estimate two prediction errors
0:04:59that of the speech signal E S and that of
0:05:02noise signal
0:05:03and
0:05:04so let's out
0:05:05close to
0:05:06look to this problem
0:05:09so the objective here and the update step uh i just mentioned
0:05:12to estimate the S and S and N
0:05:16uh based on the difference signal be
0:05:18and this case D is given as the noisy input coefficient
0:05:22S plus and minus
0:05:24a first
0:05:24speech prediction
0:05:26mine
0:05:26the first uh noise each
0:05:28and this expression can also be
0:05:31uh stated it as some of the two prediction errors S
0:05:34last and
0:05:36so we have
0:05:39a a classical noise reduction problem you know the update step
0:05:42we have a target signal E S but want to estimate
0:05:44which is
0:05:45just or by an additive noise signal E and and we've but access only to the
0:05:49a noise a difference of signal D
0:05:52and this allows us to use here uh and a conventional statistical estimator which is that
0:05:57uh to the statistics of S
0:05:59and
0:06:00and we can perform you know the spectral weighting
0:06:03of the
0:06:04a french and signal
0:06:05by weighting gain G in order to
0:06:07estimate S
0:06:08or
0:06:09five one minus G in order to estimate a
0:06:13nor or to each and all these
0:06:14uh weighting gains G
0:06:16the original
0:06:17common to
0:06:19approach assumes
0:06:21a got P F
0:06:22for a S and and and minimize the mean square error between S and its estimate
0:06:27and comes that was to the well-known known
0:06:29we know just solution for the a weighting gain G as can be seen here
0:06:34however
0:06:35we met at the statistics of the speech prediction error signal E S
0:06:39and
0:06:40distribution of yes
0:06:41is not caution
0:06:42but
0:06:44a course
0:06:45as we showed that the i guess in two thousand
0:06:48a and i eight
0:06:49and this fact can i'll be exploited in the update step if we did not
0:06:53use the we have filter but
0:06:55statistical estimate which can be adapted to the as statistics
0:06:59for example this and was the estimator
0:07:01by a loans and is
0:07:03uh code leaks
0:07:04which assumes a generalized gamma distribution for
0:07:07uh
0:07:09the uh target see
0:07:11so far we measure our
0:07:13a pdf of the S
0:07:14for uh
0:07:15a a as an are ranged and averaged the results um at the end
0:07:19so at the end we had
0:07:20uh
0:07:21one single a histogram
0:07:23this contribution now we performed an as an norton band
0:07:26measurement
0:07:27of the
0:07:28statistics
0:07:29therefore for we just or our uh speech signals by white portion
0:07:34uh
0:07:35noise noise at different input snr values and measured
0:07:38uh the histograms
0:07:40the result can be seen here the normalized
0:07:42a P F
0:07:43for the mac you'd of yes depending on the input as an hour which bears you from minus
0:07:49twenty two thirty five
0:07:50um do
0:07:52and you can clearly see
0:07:53uh the
0:07:54uh input as an or
0:07:56has influence on the uh histograms
0:07:59you high of the input as a or the higher the of the probability that a a only small of
0:08:04prediction errors
0:08:05a Q
0:08:06and
0:08:07this fact now can also be
0:08:09supported in our system
0:08:11if
0:08:11we use an as an already and mmse estimator
0:08:14in the update step
0:08:16or this
0:08:17we use the em as the estimator by uh i mentioned before which is not adapted
0:08:22to each of the uh histogram we've just
0:08:26see
0:08:26so
0:08:27for each quantized as an or a value with the step size of a five db T V we um
0:08:33use here a different uh mmse estimator
0:08:36so a gate
0:08:37is now also uh depends
0:08:40uh on the input as an or
0:08:42and it's
0:08:43in order to estimate its input there's and or in our system we simply use
0:08:47a and has speech and noise
0:08:49ephesians form
0:08:50previous frames
0:08:52with
0:08:53such a system we increase of course
0:08:55the computational complexity and the a memory requirements compared to a
0:09:00a conventional statistical estimator
0:09:02compared to we know filter for example we increase the complexity by a factor of uh six
0:09:08round about
0:09:09and it's additionally we of course have to um store previous frames for the prediction part
0:09:15and um a look up table for each
0:09:20and as test
0:09:22K for a calm
0:09:23to the results some more
0:09:25a a system that things
0:09:26use your relative low a model orders of model all of
0:09:29three for speech signal
0:09:31and a lot of two for of the noisy signal
0:09:35they are coefficients are it's so in each frame using the elevens and uh i read which is applied to
0:09:40estimate from
0:09:42a previous frames
0:09:43and is names statistics and the up step to pull four
0:09:46uh the noise power
0:09:49a would can we cheap with
0:09:50such a system at first
0:09:52uh objective measurements averaged over five
0:09:55different uh noise signals
0:09:57is see that segment of speech as an hour
0:09:59lot of over the noise attenuation with the input as an R
0:10:03um bearing your from mine
0:10:06a ten to thirty five
0:10:07E
0:10:09objective here is to achieve a high noise attenuation and a high
0:10:13a segment speech as an are so the more these curves of place in the upper right corner
0:10:17the better performance
0:10:19in
0:10:21a blue and red to see the results of two purely statistical estimators
0:10:25the wiener filter and a low plus mse the estimator
0:10:28which assumes a low class distribution for the a speech signal
0:10:32and the green and
0:10:34um like to see the
0:10:36i two proposed
0:10:37uh common filter approach
0:10:39in green uh a where we use the
0:10:42a as an art in depend was the estimator and the update step
0:10:45and in
0:10:46um
0:10:47like
0:10:48B and you approach would be
0:10:50as an already penned M Ms
0:10:53and overall you can see here that
0:10:55um we
0:10:56three to come of it approach we uh of the form
0:11:00a T to a statistical estimators
0:11:03look here for example an input as an hour of
0:11:05um five db and keep here
0:11:08a the segment speech as an hour
0:11:10uh constance speech if you're a
0:11:12a much higher
0:11:13a a noise attenuation
0:11:15with a to model based approaches
0:11:17and in gain here
0:11:19a to three
0:11:20i D V noise attenuation
0:11:22if we compare the wiener filter and then you as an or then
0:11:25a common to
0:11:27also like to give
0:11:28you shorts demonstration
0:11:30uh these four
0:11:32uh
0:11:33investigated techniques
0:11:34at first
0:11:35oh play the noise the signal
0:11:37then
0:11:37uh the enhanced signals but the wiener filter and the the plus
0:11:41and was the estimator
0:11:43then the to common food approaches and it last once again
0:11:46of of more
0:11:53i
0:11:56i
0:11:57i
0:12:02i
0:12:05i
0:12:06i
0:12:11i
0:12:14i
0:12:15i
0:12:20i
0:12:23i
0:12:24i
0:12:28and
0:12:31i
0:12:32i
0:12:37i
0:12:40i
0:12:41i
0:12:41i
0:12:43so that you could hear that we if you with the you proposed
0:12:47as an art event come to that of the as noise attenuation
0:12:50while achieving almost
0:12:51same same a speech was see that
0:12:53the other
0:12:55additional objective measurements showing uh
0:12:58similar similar behavior can be found in the paper
0:13:02and the meantime we also conducted
0:13:04and informal listening tests
0:13:06which uh cannot be found the the paper
0:13:10the left side be compared to estimate which were not
0:13:13uh that to just to some a its statistics and compare to of be as an or independent common filter
0:13:19with the
0:13:20a wiener filter
0:13:22and on the right side we compared to estimate as which are ecstasy the
0:13:26um
0:13:27adapted to match it
0:13:28uh statistics
0:13:30you we compared
0:13:31the as an already pen common do with the simple course my
0:13:36so that we had nineteen uh a test persons
0:13:39and would just the over quality of there is bad
0:13:42respective
0:13:43uh techniques
0:13:45and in both figures
0:13:46and
0:13:48yeah you can see a clear preference for the uh you proposed
0:13:51um
0:13:52common that
0:13:55K just summarise
0:13:56we presented here a modified common filter approach which
0:13:59able to
0:14:00small its temporal correlation of
0:14:02a a speech and noise
0:14:03uh signal
0:14:05sure that in the update step the input as an R has influence
0:14:08uh on the statistics of the speech prediction error signal
0:14:12this fact can be exploited
0:14:14by uh using an as an already depend and and the estimator which just adapted to the method
0:14:19uh
0:14:20is the grams
0:14:21um in the update step
0:14:23and we showed in objective and subjective
0:14:25uh of iterations that
0:14:27uh we can approve
0:14:28the results of the statistical estimate
0:14:32Q
0:14:34i
0:14:35i
0:14:37first question
0:14:39oh
0:14:42it
0:14:46it it's hard to hear but i thought i detected an increased amount of musical noise in your
0:14:52oh
0:14:53you're a little last examples that you play it can you come place yeah that's true
0:14:59i mean is the trade between noise attenuation speech distortion and he's a good tones
0:15:03and um
0:15:04well the first two aspects we we are better
0:15:07but we unfortunately increase slight increase
0:15:10uh a to of music
0:15:11noise there um well you could use some some of post processing
0:15:15techniques three you
0:15:16uh the remaining
0:15:23john
0:15:26it in in a plot that you uh i think you at for different types of use did you were
0:15:31looking at
0:15:32where you were you were correctly
0:15:36where are or where be you can choose
0:15:39um
0:15:40five different types of noise
0:15:41i the average across all five
0:15:44yeah
0:15:44a word you play we're supposed to white gaussian noise example that was the effect noise the fact we note
0:15:50okay
0:15:50oh i it just a quick question
0:15:53and it's a term you set that equal to three for speech two for noise
0:15:58did you tried during that with different types of speech
0:16:01oh like growls or or a conscience or not the average of over a large database of speech yeah
0:16:08this before
0:16:09for of
0:16:10and
0:16:11the values you set for to this
0:16:13that
0:16:14does make a difference for the different types of noise
0:16:18yeah D bands or to the noise
0:16:20so
0:16:21of course can
0:16:22the
0:16:22the type of noise
0:16:24less for white course noise
0:16:26and the more of course for about
0:16:33okay i
0:16:34don't see for contents
0:16:36so i i would like to thank
0:16:38all
0:16:39the speaker some of the session