0:00:09 | thank you very much ones i and thank you for coming to my talk |
---|
0:00:14 | but they asked ones are problems with the top ten icky set where anything about |
---|
0:00:18 | between one half an hour and one half hour so i wonder what we intend |
---|
0:00:22 | be how long it will take let's see |
---|
0:00:26 | and you're welcome to ask questions in between and i guess the first question that |
---|
0:00:31 | you have i will answer right away |
---|
0:00:34 | where is part of on |
---|
0:00:36 | so part of a one |
---|
0:00:37 | is |
---|
0:00:38 | here in the state of not kind best farther north rhine with failure |
---|
0:00:43 | and then in the east of that's one and maybe of the closest town that |
---|
0:00:47 | you might know is document was you had often want you know |
---|
0:00:51 | the football team bits of one about one hundred kilometres west imponderable so this point |
---|
0:00:56 | of one |
---|
0:00:57 | okay so i'm going to talk about uh |
---|
0:01:00 | beamforming |
---|
0:01:02 | and this is uh our group in part of a one and before i start |
---|
0:01:06 | i would like to say your but this is of course joint work of the |
---|
0:01:10 | whole group and in particular i would like to mention |
---|
0:01:13 | not high man |
---|
0:01:15 | uh low cost would on |
---|
0:01:17 | and alex a you know i've |
---|
0:01:22 | so here's what i'm going to |
---|
0:01:23 | talk about today |
---|
0:01:26 | what you see here is |
---|
0:01:27 | so to save the scenario we have an enclosure with a speaker and maybe some |
---|
0:01:32 | distortion darting acoustic events |
---|
0:01:35 | and then there is a microphone array and the beamformer processing the signals |
---|
0:01:40 | and uh after the beamformed signal we might have an acoustic speech recognition unit |
---|
0:01:46 | and the beamformer their the adaptation or their computation of the beamforming coefficients are controlled |
---|
0:01:53 | by a power a parameter estimation device |
---|
0:01:56 | and i'm going to first talk a bit about spatial filtering objective functions |
---|
0:02:02 | so this part here |
---|
0:02:03 | and this i think is rather |
---|
0:02:05 | basic |
---|
0:02:06 | because you sets are more them maybe more computer science people have an extra engineering |
---|
0:02:11 | people so i spend a bit more on this part here |
---|
0:02:14 | and then |
---|
0:02:15 | we discuss how we can ask actually estimate the beamforming coefficients |
---|
0:02:20 | so that pertains to this block er parameter estimation |
---|
0:02:25 | uh |
---|
0:02:25 | eventually i will then look at the combination of beamforming and speech recognition |
---|
0:02:32 | and uh you finally |
---|
0:02:34 | if time allows |
---|
0:02:35 | i could also spend a few words or have a few slides about |
---|
0:02:39 | other applications of beamforming beyond noise reduction |
---|
0:02:45 | so let's start with these |
---|
0:02:47 | uh first block spatial filtering objective functions |
---|
0:02:51 | so this is to say the |
---|
0:02:54 | elementary set up |
---|
0:02:56 | assume we have just the mono frequent signal here slide like this complex exponential |
---|
0:03:03 | and first model what's uh we can then receive at the microphones |
---|
0:03:08 | so here's a |
---|
0:03:09 | closer look we have you have this and elements the and microphones |
---|
0:03:13 | and we have this uh signal here the mono four point three consecutive and string |
---|
0:03:18 | the microphone array from an angle data |
---|
0:03:21 | and a d is the inter element a distance between the two microphones |
---|
0:03:26 | now formulating this |
---|
0:03:28 | uh mathematically |
---|
0:03:29 | the beamformer output signal is the weighted sum |
---|
0:03:33 | of the microphone signals |
---|
0:03:35 | the weights of the w is here |
---|
0:03:37 | and what we have actually of the microphone in this simple setup is |
---|
0:03:42 | we have the complex exponential here of the source signal |
---|
0:03:45 | however delay it's |
---|
0:03:47 | by a certain amount of time which depends on the microphone considered |
---|
0:03:52 | if you look here for example take this first microphone of the reference |
---|
0:03:56 | then of the signal will arrive at the second microphone after some delay tall one |
---|
0:04:02 | oh and so on |
---|
0:04:03 | and uh if you can you see that the delay tell depends on the angle |
---|
0:04:08 | of the arrival he or theatre |
---|
0:04:10 | so you can actually uh express this tells by this |
---|
0:04:15 | expression here |
---|
0:04:18 | so this is the beamforming output signal and using vector notation then we have you |
---|
0:04:23 | have a source signal the complex exponential |
---|
0:04:26 | and then we have here the weight vector containing the beamformer coefficients w zero to |
---|
0:04:31 | tell venue and minus one |
---|
0:04:32 | and the cr comprises |
---|
0:04:34 | all these delay terms all this column exponential term c on this term here is |
---|
0:04:39 | called steering vector any test these elements here first microphone signal is not until eight |
---|
0:04:45 | the second one is a little bit still eight byte r one and so on |
---|
0:04:50 | so this is |
---|
0:04:52 | uh that's description of this scenario here |
---|
0:04:56 | and to have an yeah the feeling how |
---|
0:05:00 | uh you go and oops spatial filtering by this scenario |
---|
0:05:04 | uh i have some beam patterns |
---|
0:05:06 | here |
---|
0:05:07 | so what i have |
---|
0:05:08 | on the left-hand side here are the beamformer coefficients |
---|
0:05:13 | which i assume |
---|
0:05:15 | we have study at the beamformer towards direction core or uh angle feeder zero |
---|
0:05:22 | and |
---|
0:05:23 | this now plotted here is the so-called beam pattern which is |
---|
0:05:27 | here this a product between the beamforming vector and this steering vector from the last |
---|
0:05:33 | slide |
---|
0:05:35 | and you know can see that |
---|
0:05:37 | if |
---|
0:05:38 | what you see here is uh as a function of the angle theta |
---|
0:05:42 | the response of the beam form of this beam pattern |
---|
0:05:45 | and you see that if |
---|
0:05:48 | did that you could say to zero then we have a |
---|
0:05:51 | high sensitivity and for other directions |
---|
0:05:55 | the game so to say a small all can be even zero |
---|
0:06:00 | and how i have for beam pattern c r |
---|
0:06:03 | this first one he of on the left-hand side |
---|
0:06:06 | corresponds to the situation where the |
---|
0:06:08 | the signal a rise |
---|
0:06:10 | at so called broadside with broadside i mean come or go back method signal comes |
---|
0:06:15 | from this direction here |
---|
0:06:17 | you know that score broadside |
---|
0:06:19 | this direction will be called and fire |
---|
0:06:23 | so this is the broadside direction |
---|
0:06:26 | here it is in the and file |
---|
0:06:28 | direct show the desired signal |
---|
0:06:30 | so it looks like that |
---|
0:06:32 | and uh |
---|
0:06:33 | c broadside and fire |
---|
0:06:36 | and the lower two of being part and chat indicate that the sensitivity all the |
---|
0:06:42 | spatial selectivity of the beam a beamformer depends very much |
---|
0:06:46 | on the geometry |
---|
0:06:47 | yeah here the |
---|
0:06:49 | ratio between the |
---|
0:06:51 | distance between two |
---|
0:06:52 | microphones |
---|
0:06:54 | and the wavelength is very small |
---|
0:06:56 | so it's its low a patch of a patch or |
---|
0:07:00 | then |
---|
0:07:00 | the sensitivities almost only directional |
---|
0:07:03 | as you can see |
---|
0:07:05 | yeah it's uh the |
---|
0:07:08 | the ratio between the inter element distance of the microphones |
---|
0:07:12 | and the wavelength is very large |
---|
0:07:14 | and is even larger than the wavelength |
---|
0:07:18 | if it's not of in a wavelength |
---|
0:07:20 | look at spatial aliasing just as you know uh temporal aliasing |
---|
0:07:24 | that's the reason why you have these grating lobes he you know |
---|
0:07:28 | you know that scores by spatial aliasing |
---|
0:07:32 | so here we have a |
---|
0:07:33 | low inter element distance here we have a high inter element |
---|
0:07:37 | of the microphones relative to the way |
---|
0:07:41 | so we can do spatial filtering |
---|
0:07:43 | with this setup |
---|
0:07:48 | no |
---|
0:07:49 | i go a step further to real |
---|
0:07:52 | environments where i mean by that is |
---|
0:07:55 | first |
---|
0:07:56 | we have a speech signal you want to work tutor speech so we have |
---|
0:08:01 | uh wideband |
---|
0:08:02 | signal we don't have a single mono frequent |
---|
0:08:06 | sine wave complex exponential we have a |
---|
0:08:09 | signals say bandwidth of eight or sixteen killer whales or whatever |
---|
0:08:13 | what we then do we go to the short-term fourier transform domain |
---|
0:08:17 | and in the for each frequency we can then look at it as a small |
---|
0:08:21 | narrowband a beamforming problem again |
---|
0:08:26 | when we have interferences |
---|
0:08:29 | um the sturdy distorting sources like noise or whatever which we would like to suppress |
---|
0:08:35 | so we need an appropriate objective function from which we derive the beamforming coefficients so |
---|
0:08:39 | that the signal is enhanced the other one is suppressed |
---|
0:08:44 | then we have reverberation |
---|
0:08:47 | so read which means |
---|
0:08:49 | if we are in an enclosure like this lecture hall |
---|
0:08:52 | we have an |
---|
0:08:54 | uh but |
---|
0:08:55 | signal propagation via direct |
---|
0:08:58 | path or by a multiple we might why reflections and multiple reflections and so on |
---|
0:09:03 | and this or |
---|
0:09:05 | is called the reverberation and is modeled by an acoustic impulse response or acoustic transfer |
---|
0:09:11 | function |
---|
0:09:12 | from the source to the microphones |
---|
0:09:16 | and finally this acoustic transfer function |
---|
0:09:19 | is unknown |
---|
0:09:21 | and even time-variant so if i move or something else here in the |
---|
0:09:25 | lecture hall moves when the impulse response or transfer function which change that can be |
---|
0:09:30 | time variant |
---|
0:09:31 | and it is unknown and needs to be estimated |
---|
0:09:38 | so that's what we are going to consider no |
---|
0:09:42 | and we do this by doing |
---|
0:09:45 | data-dependent statistically optimum beamforming |
---|
0:09:51 | so |
---|
0:09:52 | we first formulate the model in the short term fourier transform domain |
---|
0:09:57 | so we go |
---|
0:09:58 | from the time domain |
---|
0:09:59 | to the fourier transform domain short-term fourier transform domain means we take curves |
---|
0:10:04 | chunks of the signal |
---|
0:10:06 | and compute the dft on that then we move this chunk of bits forward and |
---|
0:10:10 | take a new dft and so on |
---|
0:10:12 | and here the two |
---|
0:10:14 | parameters are the time frame index and the frequency bin index |
---|
0:10:18 | so why is no the back to |
---|
0:10:20 | of microphone signals y zero up to white and minus one |
---|
0:10:25 | and |
---|
0:10:26 | after uh with some assumptions |
---|
0:10:29 | this can be more alert it is |
---|
0:10:32 | the |
---|
0:10:33 | product of the source signal s we are interested in and the acoustic transfer function |
---|
0:10:38 | vector from the source to each of the microphones |
---|
0:10:42 | plus a noise term |
---|
0:10:45 | but i should at least mention that this is already an approximation this model |
---|
0:10:50 | because this assumes that the uh impulse response |
---|
0:10:55 | is smaller |
---|
0:10:56 | then the analysis window of the dft yeah but uh i need i take this |
---|
0:11:02 | model for here for the whole talk well |
---|
0:11:07 | so the beamformer output signal then is |
---|
0:11:11 | the beamforming coefficients times this input signal and in the following i really leave out |
---|
0:11:16 | the uh in this all the parameters of the arguments t and f |
---|
0:11:21 | yeah so that the filter output |
---|
0:11:25 | no |
---|
0:11:26 | how to determine these beamforming q coefficients w |
---|
0:11:30 | in a certain statistically optimum way and the first criterion that we would probably also |
---|
0:11:37 | would come up with its the mse criterion minimum mean squared error criterion |
---|
0:11:42 | so we would like to determine the beam phone in coefficients |
---|
0:11:46 | such |
---|
0:11:47 | that the mean squared error all |
---|
0:11:49 | between the beamformer output and some desired signal is the smallest possible |
---|
0:11:55 | that's what do you |
---|
0:11:56 | which and also from other optimisation tasks |
---|
0:11:59 | and what is the desired signal |
---|
0:12:03 | one could use as desired signal |
---|
0:12:06 | of course |
---|
0:12:07 | the source signal s which you would like to enhance |
---|
0:12:11 | you know if so that would mean |
---|
0:12:13 | that you of this you that i introduced here is one so the desired signal |
---|
0:12:17 | is equal to the source signal s |
---|
0:12:19 | and then the beamformer has the task of both |
---|
0:12:22 | beamforming and he reverberation because what we would like to restore with this |
---|
0:12:27 | uh desired signal the six source signal at the position of the source node of |
---|
0:12:31 | the microphone but at the position of the source so it should also the reverberates |
---|
0:12:35 | so that i read you um |
---|
0:12:38 | the press the effect |
---|
0:12:39 | of the uh sound propagation from the source to the microphones |
---|
0:12:44 | or one could use |
---|
0:12:45 | an alternative criterion |
---|
0:12:48 | where the desired signal is the |
---|
0:12:51 | image of this source signal x of the microphone |
---|
0:12:56 | and then we want to do beamforming only by beamforming means noise suppression from other |
---|
0:13:00 | directions |
---|
0:13:04 | so |
---|
0:13:05 | let's |
---|
0:13:06 | now solve this problem here |
---|
0:13:08 | so |
---|
0:13:09 | or we have here the mean squared error this is the beamformer output soaping phone |
---|
0:13:14 | confusions times microphone signal |
---|
0:13:16 | this is desired signal |
---|
0:13:17 | and if we just plug in |
---|
0:13:19 | our definition of why which we had |
---|
0:13:22 | yeah |
---|
0:13:24 | if we just plug it in that's no big deal |
---|
0:13:27 | then we can rewrite it in this way you know |
---|
0:13:31 | where |
---|
0:13:32 | the mask where s is the power all variance of the source noise a source |
---|
0:13:37 | signal |
---|
0:13:38 | and this series than or is rising command |
---|
0:13:41 | is the covariance matrix of the noise at the microphones we have and elements |
---|
0:13:47 | and then they have a end times and covariance matrix of the noise |
---|
0:13:51 | and it can see that the mean squared error at the uh could is consists |
---|
0:13:55 | of two terms |
---|
0:13:57 | there is the speech distortion term and the noise |
---|
0:14:02 | speech distortion is the deviation from the desired output |
---|
0:14:06 | to the beamformer output |
---|
0:14:09 | and this is the contribution of the noise |
---|
0:14:12 | which is independent of the speech |
---|
0:14:15 | desired speech signal |
---|
0:14:18 | so if we know |
---|
0:14:19 | huh formulate in this way it's a really not difficult to carry out this minimization |
---|
0:14:25 | here |
---|
0:14:26 | and this is the result |
---|
0:14:28 | this is the these are the optimal beamforming coefficients which minimize the mean squared error |
---|
0:14:34 | yeah basic minds wasn't noise covariance matrix |
---|
0:14:37 | and a was the acoustic transfer function vector from the source to the individual microphones |
---|
0:14:45 | this one is called multichannel wiener filter and w |
---|
0:14:52 | there are variations upon that |
---|
0:14:55 | one is |
---|
0:14:56 | that you uh |
---|
0:14:57 | a |
---|
0:14:59 | plug in here a trade-off parameter |
---|
0:15:02 | you |
---|
0:15:03 | and with this |
---|
0:15:04 | by tuning now than you can't rate of speech to toss distortion with noise suppression |
---|
0:15:12 | for example if first of all four |
---|
0:15:14 | we could if i for small and you'll |
---|
0:15:17 | you enforce the beamformer to |
---|
0:15:20 | to have as little speech distortion as possible |
---|
0:15:23 | and if you increasing you this one this time is degraded and then the beamformer |
---|
0:15:28 | is more enforced to suppress the noise |
---|
0:15:31 | some of this mu you can control with it |
---|
0:15:33 | first of all if you just introduce the new here |
---|
0:15:36 | then the beamforming coefficients don't change much there's just the one now replaced by you |
---|
0:15:42 | hear and this is called the speech distortion weighted multichannel wiener filter |
---|
0:15:49 | but you can |
---|
0:15:50 | look at the extreme cases as well some you're going to zero or to infinity |
---|
0:15:56 | if you goes to zero this |
---|
0:15:59 | term gets a very high weight |
---|
0:16:02 | and |
---|
0:16:04 | so first nude going to one that was what we had already some you going |
---|
0:16:07 | to zero the speech distortion |
---|
0:16:10 | term gets a very high weight so we would like to |
---|
0:16:14 | make sure that there is no speech distortion you know if we let |
---|
0:16:17 | new go to zero |
---|
0:16:19 | the resulting beamformer is called the minimum variance distortionless response |
---|
0:16:23 | then d r informal |
---|
0:16:26 | so this |
---|
0:16:27 | only consumption can |
---|
0:16:29 | this the |
---|
0:16:30 | objective function is minimize this the noise that the beamforming output however make sure that |
---|
0:16:35 | the speech is not distortions |
---|
0:16:39 | and the other extreme case is new going to infinity |
---|
0:16:42 | then |
---|
0:16:43 | we don't care about the speech distortion but we would like to have the noise |
---|
0:16:47 | as much suppressed as possible at the beamforming output |
---|
0:16:51 | so we would mark like to maximize the signal-to-noise ratio of the beamformer output |
---|
0:16:56 | if you goes to infinity this is called maximum snr |
---|
0:17:00 | informal |
---|
0:17:01 | and the |
---|
0:17:02 | beamforming coefficients from you we could zero you can read of here right away is |
---|
0:17:06 | just this new disappears |
---|
0:17:07 | and in you goes to infinity |
---|
0:17:09 | it's some scaling factor times the numerator here the noise covariance inverse times the |
---|
0:17:17 | backed of acoustic transfer function |
---|
0:17:20 | so the difference |
---|
0:17:21 | uh criteria |
---|
0:17:24 | can be uh |
---|
0:17:25 | visualise like vets we have this |
---|
0:17:28 | problem at time you |
---|
0:17:30 | and if we let some you going to zero |
---|
0:17:33 | we make sure that the speech is preserved so is not distorted |
---|
0:17:37 | but we might not have a lot of noise suppression then we are at the |
---|
0:17:40 | end it er case minimum variance distortionless response |
---|
0:17:44 | and if we go to the other and with a very high in you |
---|
0:17:48 | we have the largest possible noise suppression |
---|
0:17:51 | but the speech might uh |
---|
0:17:53 | sound distorted at the beamformer output |
---|
0:18:02 | so uh also what is interesting to see and what we can see from the |
---|
0:18:07 | last slide already is |
---|
0:18:08 | that these different criteria like mvdr on x m s n r |
---|
0:18:13 | they differ only in a complex scalar which means in a single channel |
---|
0:18:19 | filter output call a postfilter if you look here |
---|
0:18:22 | you know if you |
---|
0:18:23 | the numerator is always the same if we change new we just change the scalar |
---|
0:18:28 | in the denominator |
---|
0:18:29 | so this is a |
---|
0:18:31 | single car just a complex scalar all it's not and necessary to multichannel processing to |
---|
0:18:36 | go from one beamforming objective function to the next |
---|
0:18:40 | so what we could do is |
---|
0:18:42 | we could design here a maximum snr beamformer |
---|
0:18:46 | and then |
---|
0:18:47 | use an appropriate |
---|
0:18:49 | single channel filter |
---|
0:18:51 | i called posterior to |
---|
0:18:53 | and then we could turn this maximize of our beamforming to an mvdr beamformer |
---|
0:18:58 | so from here to here it's like the overall and mvdr beamformer |
---|
0:19:07 | so what i |
---|
0:19:08 | set so far as the following |
---|
0:19:11 | we |
---|
0:19:12 | should look at acoustic transfer functions and not only at this the steering vector with |
---|
0:19:17 | the delays if we talk about reverberant environments and reverberant environments are always present if |
---|
0:19:22 | we are in the room |
---|
0:19:23 | outdoors |
---|
0:19:24 | don't need to consider reverberation but if we are in the room we had to |
---|
0:19:27 | consider the reverberation and then acoustic transfer much functions |
---|
0:19:32 | we have to be used instead of just purely ways |
---|
0:19:35 | and the beamformer criteria differ only in a single channel linear filter |
---|
0:19:40 | what we would like what i'm going to look at now is |
---|
0:19:44 | that the acoustic transfer function eight |
---|
0:19:47 | the effect of tensor functions and this noise covariance matrix the man |
---|
0:19:51 | they are no |
---|
0:19:53 | and then |
---|
0:19:54 | a possibly time-variant |
---|
0:19:56 | so we need to estimate them |
---|
0:19:58 | and uh the goal is to estimate then from the noisy speech signal at the |
---|
0:20:03 | microphones so that's what we consider now |
---|
0:20:08 | so this parameter estimation here |
---|
0:20:11 | which then delivers the beamformer coefficients for the one of the criteria |
---|
0:20:16 | so one method to |
---|
0:20:18 | determined this acoustic transfer function their other methods |
---|
0:20:22 | you know there's one which exploits the nonstationarity of the speech signal but the method |
---|
0:20:26 | that we have been already working on for quite some time is |
---|
0:20:30 | we estimate this acoustic transfer function by eigenvalue |
---|
0:20:33 | decomposition |
---|
0:20:35 | that's as follows |
---|
0:20:37 | that was our signal model |
---|
0:20:39 | yeah the mic reflect of microphone signals is the acoustic transfer function vector times the |
---|
0:20:44 | desired source signal |
---|
0:20:45 | this one like what x |
---|
0:20:47 | blast of the noise |
---|
0:20:49 | and if i come if we compute the covariance matrix of y |
---|
0:20:54 | so the expectation of white times why i'm each an |
---|
0:20:58 | then if s and are uncorrelated which we can assume |
---|
0:21:02 | we have uh the spatial covariance matrix of this ooh |
---|
0:21:08 | and know what that was |
---|
0:21:09 | of |
---|
0:21:10 | of the abyss parts here |
---|
0:21:13 | speech related part |
---|
0:21:14 | and of the noise |
---|
0:21:16 | and here |
---|
0:21:18 | clear that the uh |
---|
0:21:20 | uh covariance matrix can directness at a have each in times the variance of the |
---|
0:21:25 | speech term |
---|
0:21:26 | plus the covariance matrix of the noise |
---|
0:21:30 | so this is the spatial covariance matrix at the been |
---|
0:21:33 | of the microphone signals |
---|
0:21:35 | and |
---|
0:21:36 | for example if you just looking at the spot here |
---|
0:21:41 | it's |
---|
0:21:42 | easy to see that the principal eigenvector |
---|
0:21:46 | of this part here |
---|
0:21:48 | is just a time some scalar |
---|
0:21:51 | yeah on depending on how you normalize the a |
---|
0:21:54 | because if you plug this in |
---|
0:21:57 | to the |
---|
0:21:58 | eigenvalue equation |
---|
0:22:00 | zig mac x times |
---|
0:22:02 | eigenvector is equal to some longer times eigenvector |
---|
0:22:06 | and if you use this yes the eigenvector you really see that this really source |
---|
0:22:10 | this equation |
---|
0:22:13 | maybe i should write it down |
---|
0:22:16 | it's really not difficult |
---|
0:22:20 | so is it mel x |
---|
0:22:23 | times let's call the eigenvector like that uh found that times eigenvector |
---|
0:22:27 | and now we have a |
---|
0:22:30 | at hand each and times variance of the |
---|
0:22:33 | speech signal |
---|
0:22:35 | times eigenvector and for the eigenvector ius |
---|
0:22:39 | sums |
---|
0:22:41 | scalar seek to |
---|
0:22:42 | times a |
---|
0:22:44 | you can sit down at all times |
---|
0:22:46 | c ten times a day |
---|
0:22:49 | and then you see that |
---|
0:22:51 | this er |
---|
0:22:52 | altogether |
---|
0:22:53 | is a scalar |
---|
0:22:56 | a ham each and a use a scalar so the vector just this one here |
---|
0:23:01 | so this year would be the number one on the times cedar |
---|
0:23:05 | and you we have a |
---|
0:23:06 | so indeed this sort of this eigenvector equation |
---|
0:23:16 | so with the if we do an eigenvector decomposition of zig max we can recover |
---|
0:23:20 | the acoustic transfer function we can estimate the acoustic transfer function |
---|
0:23:23 | that's what i wanted to save the slide |
---|
0:23:26 | or |
---|
0:23:27 | we could also look at about generalized eigenvalue |
---|
0:23:31 | problem well we also consider have the sick man |
---|
0:23:34 | so if you look at this |
---|
0:23:36 | eigenvalue problem |
---|
0:23:38 | the principal eigenvector solving this generalized eigenvalue problem is in principle |
---|
0:23:44 | also complex scalar times this one here |
---|
0:23:46 | where we have the inverse of the noise covariance matrix times acoustic transfer function vector |
---|
0:23:52 | so we can estimate |
---|
0:23:54 | these eight term by eigenvector decomposition |
---|
0:24:00 | lot of this slide |
---|
0:24:02 | so with the principal eigenvector of the generalized eigenvalue |
---|
0:24:07 | problem |
---|
0:24:08 | this one here we can write a ray each i mean the maximum snr beamformer |
---|
0:24:13 | because what the principal eigenvector in principle is this one here |
---|
0:24:18 | and actually if we have the right routine in not it's not necessary that the |
---|
0:24:23 | command needs to be invertible we just need to solve the generalized eigenvalue problem so |
---|
0:24:28 | the prince but it also possible if the command is not invertible |
---|
0:24:31 | however there is an arbitrary scaling factor here because |
---|
0:24:35 | any |
---|
0:24:36 | scaling which result in a eigenvector of that problem |
---|
0:24:41 | all the beam formers like the mvdr beamformer |
---|
0:24:45 | we can realize as well then we do um |
---|
0:24:48 | eigenvector decomposition of this uh |
---|
0:24:51 | covariance matrix of the speech related some of the microphone signals because this |
---|
0:24:56 | given a but as eight |
---|
0:24:57 | ike acoustic transfer function vector |
---|
0:25:00 | and this of the denominator |
---|
0:25:02 | corresponding to the mvdr beamforming filter |
---|
0:25:05 | so we can also realise and mvdr beamformer but then we also need this the |
---|
0:25:11 | inverse of the command where here it's not really necessary to do the inverse explicitly |
---|
0:25:18 | so with eigenvector decomposition |
---|
0:25:21 | we can |
---|
0:25:22 | determine |
---|
0:25:24 | the uh |
---|
0:25:25 | acoustic transfer function with that then the beamforming coefficients |
---|
0:25:31 | so what we'd it's |
---|
0:25:32 | we estimate the acoustic transfer function |
---|
0:25:35 | by |
---|
0:25:37 | whatever to be actually did now we |
---|
0:25:39 | know how to determine the acoustic transfer function but we still need is |
---|
0:25:43 | we need these covariance mattresses |
---|
0:25:45 | of the speech related microphone signal and of the noise |
---|
0:25:48 | so |
---|
0:25:49 | we have solved one problem and got in a new problem |
---|
0:25:52 | because now we need to estimate the max and the man |
---|
0:25:56 | then compared is mattresses of the speech term of the microphone signals and of the |
---|
0:26:00 | noise term microphone signal |
---|
0:26:03 | and now there are a couple all many |
---|
0:26:06 | procedures how to do so |
---|
0:26:08 | and |
---|
0:26:10 | basically what most of them do is they do a two-stage procedure |
---|
0:26:16 | that means |
---|
0:26:17 | they first determine for each time-frequency point is it dominated by the speech to or |
---|
0:26:23 | is it dominated by noise |
---|
0:26:26 | that's called speech presence probability estimation so this is all to say a voice activity |
---|
0:26:31 | detector with a very high resolution a time-frequency point uh resolution |
---|
0:26:36 | and we would like to determine for each time-frequency point is it in just a |
---|
0:26:41 | noise term pure noise term or is it dominated by speech |
---|
0:26:45 | if we have the speech presence probability map |
---|
0:26:48 | or mask |
---|
0:26:49 | then we can estimate these metrics as |
---|
0:26:52 | from that |
---|
0:26:54 | and that's the way i'm going to do with that in the following |
---|
0:26:58 | so this speech presence probability estimation which should determine for each time fixed point is |
---|
0:27:05 | it speech or noise is basically something like that |
---|
0:27:08 | we have a noisy spectrogram |
---|
0:27:11 | and what we would like to have is |
---|
0:27:13 | the mosque or the |
---|
0:27:15 | identification of those time-frequency points which are dominated by speech that's it looks something like |
---|
0:27:21 | that |
---|
0:27:26 | to do that |
---|
0:27:27 | there have been |
---|
0:27:29 | a lot of techniques |
---|
0:27:30 | which are based on so called a priori and a-posteriori snr estimation |
---|
0:27:36 | and local spectro-temporal smoothing |
---|
0:27:38 | i'm not going to |
---|
0:27:39 | talk about the t about that was the preferred methods |
---|
0:27:42 | min uh several years ago few years ago |
---|
0:27:46 | uh then we |
---|
0:27:48 | in part of only about of the methods which before was |
---|
0:27:51 | uh |
---|
0:27:52 | very elegance |
---|
0:27:54 | and me but be did is |
---|
0:27:56 | we interpreted |
---|
0:27:58 | uh this here as a two dimensional hidden markov model |
---|
0:28:02 | with correlations |
---|
0:28:04 | or yeah correlations or transition probabilities |
---|
0:28:07 | along the time axis |
---|
0:28:09 | and along the frequency axis |
---|
0:28:12 | and then we did inference in this two dimensional hidden markov model to determine the |
---|
0:28:16 | posterior on the posterior was the speech presence probability |
---|
0:28:21 | but then eventually it turned out that a new network did a much better job |
---|
0:28:27 | and no i'm finally |
---|
0:28:28 | at the yeah other half of my talk tight a new network supported so what |
---|
0:28:34 | i'm discuss now is how can we do with the speech presence probability estimation of |
---|
0:28:39 | a new network |
---|
0:28:44 | so up stopped the those much fast |
---|
0:28:47 | so here is the set up |
---|
0:28:49 | we have a new network is used for speech presence probability estimation |
---|
0:28:54 | we needed in the following way |
---|
0:28:55 | we have be microphone signals |
---|
0:28:59 | and it's uh |
---|
0:29:00 | we haven't nets work |
---|
0:29:03 | for each channel |
---|
0:29:04 | however we tie the weights between the individual networks here |
---|
0:29:08 | and the input to the new network are the magnitude spectrum |
---|
0:29:13 | and the network is supposed |
---|
0:29:15 | to predict an ideal |
---|
0:29:18 | mask |
---|
0:29:18 | now the slide on that |
---|
0:29:20 | so it should predict for each time-frequency point is it dominated by speech or is |
---|
0:29:24 | it dominated by noise |
---|
0:29:27 | we applied this to each channel separately and then we someone merger poor the channels |
---|
0:29:33 | this can be do done by averaging the output all by taking the needy on |
---|
0:29:38 | median turned out to be a bit more robust in the case that one of |
---|
0:29:41 | the channels was broken |
---|
0:29:43 | and the output of this one here now is |
---|
0:29:47 | a look up well |
---|
0:29:48 | uh |
---|
0:29:50 | party a could be the probability for each time-frequency point of being speech |
---|
0:29:55 | and here |
---|
0:29:56 | or being noise |
---|
0:29:58 | so once we have these masks |
---|
0:30:00 | or present probabilities |
---|
0:30:03 | we can compute |
---|
0:30:05 | the spatial covariance matrix as of the speech |
---|
0:30:08 | uh and of the noise this is illustrated here |
---|
0:30:12 | so we estimate know the spatial covariance matrix of speech by |
---|
0:30:16 | this outer product |
---|
0:30:18 | however we take only those |
---|
0:30:20 | time-frequency points |
---|
0:30:22 | where our new network as set well this is really speech |
---|
0:30:26 | and for the noise estimation we take only those time-frequency points with the network set |
---|
0:30:30 | where the for this time-frequency point it's really noise |
---|
0:30:34 | and with that we estimate these covariance mattresses |
---|
0:30:38 | once we have the covariance mattresses |
---|
0:30:40 | we plug into this optimisation function and b d r or maximum snr to get |
---|
0:30:47 | the beamforming vector w |
---|
0:30:50 | so that's already stuff |
---|
0:30:54 | yeah please |
---|
0:31:13 | yeah you're right room i don't wanna white is yeah why yeah and not i'm |
---|
0:31:18 | which should perhaps separate uh subtract the it might and that's what you sorry |
---|
0:31:22 | we tried that but we didn't find an and |
---|
0:31:25 | and an effect or an improvement by that so we stick this one yeah |
---|
0:31:30 | but you're right |
---|
0:31:39 | no for the mask estimation we don't use phase |
---|
0:31:44 | basically we look at this |
---|
0:31:46 | the mixture magnitude of each point is in below some threshold something like that all |
---|
0:31:51 | or above |
---|
0:31:55 | for the phase of course is necessary for the beamforming coefficients |
---|
0:31:59 | yeah but for the mask estimation of the phase it basically present through the estimation |
---|
0:32:04 | of this covariance matrix as |
---|
0:32:13 | here is the |
---|
0:32:14 | network |
---|
0:32:16 | that more in detail so we have a noisy |
---|
0:32:19 | um |
---|
0:32:20 | speech signal at the input over the network |
---|
0:32:24 | and at the output we would like to predict |
---|
0:32:26 | the speech mask |
---|
0:32:28 | and the noise mask |
---|
0:32:31 | yeah so for each time-frequency point uh if it is current uh dominated by speech |
---|
0:32:35 | it should be high here and low here |
---|
0:32:39 | and so what the neural network does it is a is it is operated like |
---|
0:32:44 | a classifier |
---|
0:32:45 | yeah it's a classifier which has to predict one or zero speech or noise for |
---|
0:32:49 | each time-frequency |
---|
0:32:51 | and the objective function is simply because it's a classifier cross entropy |
---|
0:33:02 | the |
---|
0:33:03 | that's one scenario which worked pretty well |
---|
0:33:06 | here we had four layers the first there was a bi directional l s t |
---|
0:33:10 | m |
---|
0:33:11 | layer followed by |
---|
0:33:13 | three feed-forward |
---|
0:33:15 | layers |
---|
0:33:16 | and uh at the input |
---|
0:33:18 | we had the magnitude spectrum |
---|
0:33:21 | for all frequencies |
---|
0:33:24 | and the output rather than the speech mask |
---|
0:33:27 | and the noise mask |
---|
0:33:29 | and these values here |
---|
0:33:31 | could be between zero and one they don't need to be binary |
---|
0:33:34 | and they also don't need to sum up to one so it could be that |
---|
0:33:37 | one of the time-frequency point was considered neither speech nor noise because it was somewhere |
---|
0:33:42 | in between |
---|
0:33:47 | and so what's what did we do |
---|
0:33:49 | here with this mask estimation is just set as you have seen it's |
---|
0:33:54 | single channel there's and neural network per channel aware with type weights but |
---|
0:33:59 | we treat each channel separately |
---|
0:34:02 | so it's independent of the array configuration and the number of microphones here we could |
---|
0:34:07 | train it with six microphones |
---|
0:34:09 | data and use it with three microphone data in the test |
---|
0:34:12 | and the |
---|
0:34:13 | could be a linear right in training and the secular right in the test |
---|
0:34:17 | that's possible |
---|
0:34:18 | so we can see it's an advantage but it would also say that's a disadvantage |
---|
0:34:22 | because for the mask estimation we don't exploit spatial information |
---|
0:34:27 | you know because we look at uh just a single channel |
---|
0:34:30 | what is different from most of the uh hmmm |
---|
0:34:34 | parametric approaches |
---|
0:34:36 | before the neural network hero |
---|
0:34:38 | is that at the inputs we have the whole |
---|
0:34:42 | it dft vectors so we treat all frequencies jointly |
---|
0:34:47 | where is usually in a beamforming you do were separately treat each frequency |
---|
0:34:52 | and here we treat that uh jointly |
---|
0:34:56 | it's not immediately suitable for online processing because we had to be a last em |
---|
0:35:01 | layer |
---|
0:35:01 | there are so we need to avoid backward path so in the ks consideration like |
---|
0:35:07 | that it's current the an offline |
---|
0:35:13 | so here are some example |
---|
0:35:16 | speech and noise masks |
---|
0:35:18 | that has been estimated with this |
---|
0:35:21 | method from which i'm data base |
---|
0:35:24 | and uh you can see that it recovers pretty well here the uh harmonic structure |
---|
0:35:30 | of speech |
---|
0:35:32 | and this is the noise mask |
---|
0:35:34 | where we have a high values here for example and in between the one |
---|
0:35:38 | and how can play |
---|
0:35:41 | the input signal and the beamformer output signal for this one here |
---|
0:35:47 | able to do that |
---|
0:35:53 | i should |
---|
0:35:59 | they both the input |
---|
0:36:02 | all |
---|
0:36:05 | i wish i |
---|
0:36:11 | of course have to be very good example |
---|
0:36:13 | now do not all alike like that in this one was a good one |
---|
0:36:26 | here is in |
---|
0:36:27 | um l different look or another maybe in my not aspect of it |
---|
0:36:32 | here we compared |
---|
0:36:34 | the maximum snr be informal he recorded uh i do not eigenvalue beamformer |
---|
0:36:39 | so this which maximize the signal-to-noise ratio |
---|
0:36:42 | and the mvdr beamformer which makes sure that there is no speech distortion |
---|
0:36:47 | but you see here is the |
---|
0:36:50 | snr at the beamformer output |
---|
0:36:52 | for individual utterances of the chime challenge |
---|
0:36:56 | and what you have here on the or did not |
---|
0:36:58 | is the log of the condition number of this noise covariance matrix |
---|
0:37:02 | so sick man |
---|
0:37:04 | in the mvdr case we have to compute the inverse of the command to determine |
---|
0:37:08 | the coefficients because in the numerator there was the combining vector times a |
---|
0:37:13 | and what this |
---|
0:37:15 | perhaps |
---|
0:37:16 | shows is |
---|
0:37:17 | that's |
---|
0:37:19 | if the |
---|
0:37:20 | condition number is |
---|
0:37:22 | log conditional mice height which means the |
---|
0:37:25 | noise covariance matrix is ill conditioned |
---|
0:37:28 | then |
---|
0:37:29 | seems to be the case that a generalized eigenvalue beamformer which does the generalized eigenvalue |
---|
0:37:35 | decomposition |
---|
0:37:36 | gives a bit higher snrs at the output then the mvdr |
---|
0:37:41 | maybe you can see but i one don't want to make a strong point out |
---|
0:37:45 | of that |
---|
0:37:46 | in the mvdr we have to explicitly compute the inverse of the egg my and |
---|
0:37:49 | that made |
---|
0:37:50 | be problematic for some of the utterances where there is adjusted of noise or just |
---|
0:37:55 | few observations |
---|
0:37:57 | so |
---|
0:37:59 | in our case the make some snr criterion what the bit better than the mvdr |
---|
0:38:03 | criterion but |
---|
0:38:04 | people from entity in japan they do with similar approach and in their case of |
---|
0:38:09 | the mvdr work to be better than the t v so |
---|
0:38:12 | maybe it's about the same |
---|
0:38:17 | one point i would like to make is |
---|
0:38:21 | uh you made one not if we take the maximum snr beamformer we don't care |
---|
0:38:25 | about speech distortions |
---|
0:38:27 | yeah |
---|
0:38:28 | and uh |
---|
0:38:29 | indeed if we don't let's take any care of that |
---|
0:38:33 | the resulting signal sounds i have as an example in the next slide |
---|
0:38:37 | it's depending if the noise is low cost |
---|
0:38:39 | so if the |
---|
0:38:41 | the low frequency if than noise is predominant the low frequencies |
---|
0:38:45 | then after beamforming the second sounds a bit high-pass filter out because it has a |
---|
0:38:49 | process all the more frequencies with a high noise |
---|
0:38:52 | but that means so that that's the speech signal has been distorted because no it |
---|
0:38:56 | sounds high-pass |
---|
0:38:57 | if we do speech recognition it really is not a big deal because that can |
---|
0:39:02 | be learnt by the |
---|
0:39:04 | acoustic model if the input signal is looks is a bit different so we didn't |
---|
0:39:08 | find a big difference whether we a accounted for this distortion by a postfilter or |
---|
0:39:14 | not the speech recognition results were about the same |
---|
0:39:17 | but if you want to do speech enhancement |
---|
0:39:20 | so the time domain to reconstruct the time domain signal |
---|
0:39:23 | then you should |
---|
0:39:24 | big uh careful to reduce these speech distortions |
---|
0:39:28 | and we develop one methods |
---|
0:39:31 | to control these speech distortions |
---|
0:39:34 | i can explain that maybe in more detail later on |
---|
0:39:37 | uh and not here |
---|
0:39:39 | uh going too much in detail about what we actually |
---|
0:39:42 | what we try to do is we would like to design a postfilter single channel |
---|
0:39:48 | such that the combination of the beamformer with the acoustic transfer function of a function |
---|
0:39:54 | vectors and the beamformer should give |
---|
0:39:56 | a response in the desired direction of one |
---|
0:40:00 | and this actually can be solved for g if we assume that we have a |
---|
0:40:05 | an anechoic transmission so if we have no reverberation we can compute you from that |
---|
0:40:09 | but there's of course and uh approximation reality we have reverberation |
---|
0:40:14 | but in this way we could computed and we had then a this single-channel postfilter |
---|
0:40:18 | which |
---|
0:40:19 | really a |
---|
0:40:21 | removed already used to speech distortions |
---|
0:40:23 | and here i haven't |
---|
0:40:25 | example |
---|
0:40:26 | with no post-processing and with this normalization |
---|
0:40:36 | this is not a good such a good example is before so the input |
---|
0:40:41 | primarily on the basis of this choice |
---|
0:40:45 | now if we take the |
---|
0:40:47 | make some snr be informal |
---|
0:40:49 | without any taking any care of the speech distortions |
---|
0:40:54 | i spanish |
---|
0:40:57 | so i use it sounds of the speech signal sounds differently |
---|
0:41:01 | and with this um |
---|
0:41:02 | blind analytic normalisation |
---|
0:41:05 | we can |
---|
0:41:06 | reviews this high-pass filtering effect |
---|
0:41:09 | people don't for primarily on the basis of issues |
---|
0:41:13 | but of course of it is also at the expense of less a snr gain |
---|
0:41:26 | no i have some results |
---|
0:41:28 | on the chime challenge data |
---|
0:41:31 | so i'm three and time for there were recordings problem for different environments |
---|
0:41:38 | cafes tree bass |
---|
0:41:40 | pedestrian area |
---|
0:41:42 | in the chancery three channels they were just the each six channel |
---|
0:41:46 | scenario in time for also a two channel in one channel |
---|
0:41:50 | scenario and the recordings there were two recordings in these environments |
---|
0:41:54 | there were simulated data really artificially added the noise that recording this environment but they |
---|
0:41:59 | were also to speech |
---|
0:42:00 | to uh recorded in these environments |
---|
0:42:03 | and the recording was like that they had this uh tablet |
---|
0:42:07 | where they had here six microphones |
---|
0:42:10 | here at the frame |
---|
0:42:12 | and the person had the tablet in front of incentive and then we're speaking the |
---|
0:42:17 | sentences he was supposed to speak in this in the bus or in the pedestrian |
---|
0:42:20 | area or wherever you know so that was |
---|
0:42:23 | the scenario |
---|
0:42:24 | and |
---|
0:42:26 | one should say that this is |
---|
0:42:28 | uh in a sense |
---|
0:42:30 | not the deaf most difficult scenario because |
---|
0:42:33 | the slow variation the speaker position |
---|
0:42:35 | yeah you have you hold it like that you know it's not so that the |
---|
0:42:38 | microphone the c and you walk er on the floor or whatever so is low |
---|
0:42:43 | decomposition variation |
---|
0:42:46 | and we have simulated and noise or recordings |
---|
0:42:50 | here are some results concerned with the speech enhancement so |
---|
0:42:55 | measured by this |
---|
0:42:57 | pass score which is supposedly measuring the speech quality |
---|
0:43:01 | but uh i don't know how well it really represent a speech quality |
---|
0:43:04 | what about this figure here shows |
---|
0:43:07 | i've taken it from another publication so there are some of the um |
---|
0:43:11 | results are not going to discuss here |
---|
0:43:14 | uh what we have here is the past score of the speech output |
---|
0:43:17 | after the beamformer |
---|
0:43:19 | and this one here oracle means if we knew |
---|
0:43:24 | by oracle which time-frequency bin is corrupted by noise and which one is representing speech |
---|
0:43:30 | so if we had the oracle speech masking oracle noise mask |
---|
0:43:33 | this is the quality that you could achieve the high of the better |
---|
0:43:37 | this was the result with the uh |
---|
0:43:39 | uh |
---|
0:43:40 | estimation of the b l s t m network which is almost as good |
---|
0:43:44 | as the oracle one |
---|
0:43:46 | and this once you're going to skip |
---|
0:43:47 | these were of a network configurations and other training scenarios |
---|
0:43:52 | and here are two results |
---|
0:43:54 | from the a parametric approach and this one icsi's from also from my group a |
---|
0:43:59 | few years ago |
---|
0:44:01 | uh which was a previous sl this two dimensional hidden markov model and you can |
---|
0:44:06 | see that the new network support of mask estimation gave better speech quality |
---|
0:44:11 | but use parametric methods |
---|
0:44:16 | and now i have some |
---|
0:44:18 | recognition speech recognition results on sign |
---|
0:44:22 | three yeah there were development sets and evaluation sets and |
---|
0:44:26 | they were simulated scenarios we have the noise was artificially added |
---|
0:44:30 | and real recordings in the noisy environment |
---|
0:44:33 | where really the people spoke in the past and so on |
---|
0:44:37 | this was uh scenario the setup delivered by the by the organizers the baseline scenario |
---|
0:44:44 | these are the word error rates here |
---|
0:44:46 | this is our |
---|
0:44:48 | uh speech presence probability estimation or method of a few years back this is a |
---|
0:44:54 | uh method from n t also few years back |
---|
0:44:58 | this year uses the beamformer it beamformer |
---|
0:45:02 | some of you may know it's a delay and sum beamformer we have a |
---|
0:45:12 | and uh here is |
---|
0:45:14 | the uh |
---|
0:45:15 | make some snr beamformer with a new network |
---|
0:45:18 | uh for mask estimation to be used |
---|
0:45:20 | you can see that it |
---|
0:45:22 | uh performed pretty well also on the real of recordings |
---|
0:45:28 | real recordings in noisy environments |
---|
0:45:35 | so yeah |
---|
0:45:50 | yeah yes |
---|
0:45:53 | re exactly yeah that's a point i want to make now because that's an important |
---|
0:45:57 | one |
---|
0:46:02 | so what i've talked about so far |
---|
0:46:05 | we use this new network |
---|
0:46:07 | when you network based speech presence probability estimation or neural network based mask estimation to |
---|
0:46:12 | identify the time-frequency bins from which we estimate the power spectral density matrix of speech |
---|
0:46:16 | or of noise |
---|
0:46:18 | in from these metrics as the beamformer coefficients can be estimated know your point ones |
---|
0:46:22 | are |
---|
0:46:23 | the new network training requires just your |
---|
0:46:26 | data |
---|
0:46:26 | yeah remote |
---|
0:46:27 | we should to have separately the |
---|
0:46:30 | or signal at the microphones |
---|
0:46:32 | so the that signal for a which came from this desired source s |
---|
0:46:37 | and visual test separately the noise |
---|
0:46:40 | so that we can artificially so that we have the target for the neural network |
---|
0:46:44 | training the ideal |
---|
0:46:46 | binary mask just the speech mask or noise |
---|
0:46:51 | so we need this a stereo data for training |
---|
0:46:54 | further on the mask this finishing exactly something heuristic |
---|
0:46:58 | what do you declare as time-frequency points to be |
---|
0:47:01 | dominated by speech we took we set that uh in this speech is the |
---|
0:47:07 | speech only case |
---|
0:47:09 | we take those time-frequency points which x amount to ninety nine percent of the total |
---|
0:47:13 | power of the signal |
---|
0:47:15 | but that's something also debatable whether this vest |
---|
0:47:19 | choice |
---|
0:47:19 | so there's some heuristics in that |
---|
0:47:21 | so the question is now can be overcome this |
---|
0:47:25 | the strict requirement of stereo data |
---|
0:47:28 | and what we try this we try to overcome this limitation by end-to-end training |
---|
0:47:33 | that's the next i would like to talk about |
---|
0:47:39 | so now i'm here at this spot beamforming and speech recognition |
---|
0:47:47 | so and with and so the and training that's determined using multiple |
---|
0:47:51 | connotations what i mean with that is the following |
---|
0:47:53 | which is depicted here |
---|
0:47:55 | yeah we have you have a whole processing |
---|
0:47:58 | chain starting with the microphone signal |
---|
0:48:01 | then we have you have a new network for mask estimation this was the pruning |
---|
0:48:06 | or maybe on computation condensing of the |
---|
0:48:10 | masks to a single |
---|
0:48:12 | mask |
---|
0:48:12 | for speech and noise then the speech and noise covariance estimation the beamformer |
---|
0:48:18 | our post filter to remove the speech distortions |
---|
0:48:21 | so that's up to here |
---|
0:48:23 | in our comes the speech recognition |
---|
0:48:25 | so here is the filterbank |
---|
0:48:27 | filterbank for computing the delta and delta-delta coefficients |
---|
0:48:32 | better databank mel filters |
---|
0:48:34 | then here is the acoustic model neural network |
---|
0:48:38 | uh |
---|
0:48:39 | supported all the new network for the acoustic model |
---|
0:48:42 | and then we have that decoder |
---|
0:48:44 | but i mean with end-to-end training is that we would like to propagate back |
---|
0:48:50 | the gradient from the cross entropy criterion at them |
---|
0:48:54 | of them acoustic model training |
---|
0:48:56 | all the way over here these processing blocks |
---|
0:49:00 | up to the new network here for mask estimation |
---|
0:49:04 | that's what we |
---|
0:49:05 | try to do if we can |
---|
0:49:07 | managed to do that we don't need a target |
---|
0:49:10 | like idea of speech mask for the training here but we can derive |
---|
0:49:15 | the gradient from |
---|
0:49:17 | the cross entropy criterion here from the acoustic model training |
---|
0:49:21 | and then we don't need stereo data |
---|
0:49:24 | anymore |
---|
0:49:25 | so that's what we |
---|
0:49:26 | tried to do |
---|
0:49:31 | and what to be |
---|
0:49:33 | have to take have to take care of is |
---|
0:49:36 | but in between these computations we are in the complex domain |
---|
0:49:39 | that means you have the beamforming coefficients are complex-valued |
---|
0:49:43 | back to us you know they're complex-valued |
---|
0:49:46 | the covariance mattresses are complex-valued so we are here the real-valued domain and then we |
---|
0:49:51 | between the complex-valued domain and here we are back in the real-valued domain again |
---|
0:49:56 | so we have two |
---|
0:49:58 | consider |
---|
0:50:00 | gradients with complex-valued the a remotes |
---|
0:50:03 | so what i the ieee denoted here so the cross entropy criterion at the |
---|
0:50:08 | acoustic model training |
---|
0:50:09 | neural networks a function of the spatial covariance matrices of the |
---|
0:50:13 | from which we compute the beamforming coefficients and they are complex-valued |
---|
0:50:17 | and eventually what you want to train the beamformer curfew the coefficients of a new |
---|
0:50:21 | network for mask estimation they are real value of course again |
---|
0:50:26 | and what we'd it's |
---|
0:50:27 | do here is |
---|
0:50:28 | i'm not going to in |
---|
0:50:29 | to detail that about a there's a recorder on it |
---|
0:50:32 | we use the building a calculus |
---|
0:50:35 | to compute complex |
---|
0:50:37 | derivatives because the cost function is not all formal fink function so |
---|
0:50:42 | but this building a calculus is well known in the adaptive filter theory so people |
---|
0:50:47 | who do adaptive filtering their use this or not because the you often have complex-valued |
---|
0:50:52 | uh coefficients |
---|
0:50:55 | and with this uh one can then compute these gradients |
---|
0:51:01 | and the actually the |
---|
0:51:02 | crucial step was |
---|
0:51:05 | uh we have this makes a mess in our beamformer whose coefficients are determined by |
---|
0:51:09 | eigenvalue decomposition |
---|
0:51:11 | so we have to compute the derivative of the principal eigenvector |
---|
0:51:16 | of this |
---|
0:51:17 | generalized eigenvalue problem with respect to the |
---|
0:51:20 | psd mattresses will come out of the neural network mask estimator |
---|
0:51:24 | and uh for that's uh we have a reports |
---|
0:51:28 | which uh |
---|
0:51:30 | and i also have it with me |
---|
0:51:31 | where you can look up how this is done because there is a quite |
---|
0:51:36 | longer along but they are vacation |
---|
0:51:41 | so now |
---|
0:51:43 | for the chime challenge of this |
---|
0:51:45 | really worked and here are |
---|
0:51:47 | some results |
---|
0:51:49 | let's see whether we can make a |
---|
0:51:51 | and sends out of it |
---|
0:51:53 | so |
---|
0:51:54 | first here so here in this |
---|
0:51:57 | uh again the |
---|
0:51:58 | baseline results so to say we have the |
---|
0:52:02 | as a beamformer used this beamforming in a sum beamformer by example you and we |
---|
0:52:07 | have actually |
---|
0:52:08 | and uh |
---|
0:52:10 | did it separate acoustic model training |
---|
0:52:13 | so that was set the baseline word error rates |
---|
0:52:18 | here we have the system web if |
---|
0:52:20 | trained separately the new |
---|
0:52:22 | no i want our beamformer using a new network with ideal binary mask just targets |
---|
0:52:27 | so as we did before |
---|
0:52:29 | and separately training of the acoustic model neural network for the acoustic model and these |
---|
0:52:34 | are the results |
---|
0:52:36 | then here we try to do it as on my last slide so we would |
---|
0:52:41 | like to jointly train both |
---|
0:52:43 | acoustic uh both networks the one for not mask estimation the one for the acoustic |
---|
0:52:48 | model |
---|
0:52:49 | and we started both from |
---|
0:52:52 | from random initialization |
---|
0:52:54 | and you can see that this leads to a somewhat uh worse |
---|
0:52:59 | what error rate so what error rate increased |
---|
0:53:03 | and the interesting result actually is no but next one |
---|
0:53:07 | here we pre-trained the acoustic model |
---|
0:53:11 | for a menu a network for acoustic model |
---|
0:53:14 | but then the new network for the beamformer mask estimation |
---|
0:53:18 | was trained |
---|
0:53:19 | by back propagating the |
---|
0:53:22 | gradient from the acoustic model to the new network for mask estimation so it was |
---|
0:53:26 | randomly initialised and then train all the way back |
---|
0:53:29 | and b that we are |
---|
0:53:30 | even a little bit better than in the separate training so this year shows |
---|
0:53:35 | that at least four |
---|
0:53:36 | this time challenge is possible |
---|
0:53:39 | to uh this minutes of the need for stereo data and you can achieve the |
---|
0:53:44 | same or little bit better results also just training of the noisy data |
---|
0:53:49 | and the lower one he's here where we also pre-trained |
---|
0:53:53 | the acoustic model for mask estimation with the ideal binary mask target and then just |
---|
0:53:58 | later on july propagate the gradient from uh the acoustic model to this first new |
---|
0:54:03 | network four point you like if that it better |
---|
0:54:07 | oh here for this |
---|
0:54:09 | a much data that's the only data |
---|
0:54:11 | i don't wanna required |
---|
0:54:13 | but i would like to emphasise for this data because we can to try to |
---|
0:54:17 | achieve the same |
---|
0:54:18 | on the ami corpus and so far we have a have not been successful yeah |
---|
0:54:22 | so it's not that easy but i'm was perhaps a very nice corpus this respect |
---|
0:54:32 | so that was |
---|
0:54:35 | basically |
---|
0:54:36 | the story about beamforming for noise reduction |
---|
0:54:40 | no i have a few slides |
---|
0:54:42 | for uh where we also have some multichannel processing but not for noise reduction but |
---|
0:54:47 | for to other tasks |
---|
0:54:52 | so the first one is |
---|
0:54:54 | the speech recognition of reverberated speech |
---|
0:54:58 | and what we did here is |
---|
0:55:02 | use the same |
---|
0:55:03 | set up so we have multichannel data we use a new one network for mask |
---|
0:55:08 | estimation |
---|
0:55:09 | but now a lot |
---|
0:55:10 | uh we would now our distortion is no longer noise |
---|
0:55:14 | but reverberation |
---|
0:55:15 | so we |
---|
0:55:17 | in the case |
---|
0:55:18 | that you know the impulse responses |
---|
0:55:21 | yeah |
---|
0:55:22 | of the training data what you can then could do |
---|
0:55:24 | you can |
---|
0:55:25 | by yourself |
---|
0:55:27 | determine the ideal |
---|
0:55:29 | speech mask and the ideal mask for the |
---|
0:55:32 | for the distortion |
---|
0:55:33 | and the |
---|
0:55:34 | for the targets we take the dry signal the non reverberant |
---|
0:55:39 | data and convolved with the early part of the room impulse response |
---|
0:55:44 | so i first fifty milliseconds |
---|
0:55:47 | and for the interference |
---|
0:55:48 | which was the noise in the earlier case but now it's reverberation for mean if |
---|
0:55:51 | we interference be convolved that right signal with a light part of the room impulse |
---|
0:55:56 | response so after fifty miliseconds the body part of a part |
---|
0:56:00 | and with that we can derive |
---|
0:56:02 | also ideal binary masks for the target and from the interference |
---|
0:56:07 | and then the rest remains the same |
---|
0:56:09 | yeah we can then again compute the masks from that the covariance matrix a simple |
---|
0:56:13 | that the beamforming weights |
---|
0:56:16 | and that we tested on the rear up |
---|
0:56:18 | data set of the reef a challenge which i want to charter data are convolved |
---|
0:56:22 | with measured room impulse responses |
---|
0:56:25 | and there are again test data simulated and also real recordings in a reverberant |
---|
0:56:32 | environment |
---|
0:56:33 | and here are some results |
---|
0:56:35 | for that's |
---|
0:56:37 | we are in the real |
---|
0:56:38 | recordings there was a distinction between |
---|
0:56:42 | uh near which means the distance between the microphones and the speaker brought about i |
---|
0:56:47 | don't know of one meter fifty centimetre and far it was about two meters |
---|
0:56:52 | but you can see the difference in the word error rate is not very large |
---|
0:56:56 | and the |
---|
0:56:58 | gmm hmm baseline we have these results here we have the |
---|
0:57:02 | a baseline results were single channel there was no multichannel baseline |
---|
0:57:07 | and then with the |
---|
0:57:09 | method i so i just explained how to determine the to take the late reverberation |
---|
0:57:14 | part for the |
---|
0:57:16 | uh distortion masks |
---|
0:57:18 | and just use the same scenario a set up as before we obtain these |
---|
0:57:22 | whatever rates on these two uh parts of the dataset |
---|
0:57:26 | and be the better acoustic model it can be further improved |
---|
0:57:31 | so it also work in this case to suppress reverberation |
---|
0:57:37 | so that was one example of another application it is my final and last example |
---|
0:57:43 | we know this is a |
---|
0:57:45 | new networks based mask may should be used for |
---|
0:57:49 | for noise tracking for single channel speech enhancement |
---|
0:57:56 | here's a |
---|
0:57:57 | typical setup of the say traditional single channel speech enhancement |
---|
0:58:04 | yeah we have the |
---|
0:58:05 | noisy speech signal and its at the input we already in the stft domain here |
---|
0:58:10 | and then we manually manipulate the magnitude really the phase of and are usually unchanged |
---|
0:58:17 | and then we compute again function |
---|
0:58:20 | the time-varying gain function word which we multiply the microphone signal to suppress the noise |
---|
0:58:25 | and this time grounding function |
---|
0:58:28 | this compute the performance so quite a priori snr |
---|
0:58:32 | and for this to compute we need the noise power spectral density |
---|
0:58:36 | and this noise power spectral density is now estimated with the new network |
---|
0:58:41 | that's |
---|
0:58:42 | real be made change |
---|
0:58:45 | so noise tracking by a new network is we did it |
---|
0:58:48 | similar or the same methodology as before |
---|
0:58:53 | as in mask based speech enhancement |
---|
0:58:56 | we estimate the noise a spectral mask which indicates for each time-frequency bin whether it's |
---|
0:59:02 | dominated by noise or not and if it is dominated by noise we can update |
---|
0:59:06 | the noise estimate for this a priori snr estimator and if it's |
---|
0:59:10 | dominated by speech we just hold the all the old estimate |
---|
0:59:15 | and so only this part is |
---|
0:59:17 | is changed with respect to otherwise traditional speech enhanced |
---|
0:59:23 | uh it's |
---|
0:59:24 | he also some examples of this isn't noisy spectrogram |
---|
0:59:28 | this is the ideal binary mask for the noise so black and it was |
---|
0:59:33 | and here it was is a is the um |
---|
0:59:36 | noise presents prop |
---|
0:59:37 | itsy estimates by this new network this isn't of them method to compare it |
---|
0:59:43 | with |
---|
0:59:44 | and uh |
---|
0:59:46 | is it looks a little bit similar is what we had uh before bads we |
---|
0:59:50 | can estimate with these mask estimation we can estimate the noise |
---|
0:59:54 | for |
---|
0:59:56 | signal |
---|
0:59:58 | and to your uh some results |
---|
1:00:00 | where we have on the left end |
---|
1:00:02 | sides the performance of the noise estimator |
---|
1:00:05 | and on the right-hand side the performance of the speech enhancement system |
---|
1:00:09 | and so we try to really a lot of the state-of-the-art noise estimation methods as |
---|
1:00:15 | you can see here |
---|
1:00:17 | and what we have plotted here is |
---|
1:00:20 | two error measures for the noise estimate so look arrow |
---|
1:00:24 | variance |
---|
1:00:26 | versus look errol mean |
---|
1:00:28 | and the variance should be small in the mean should be small so the ideally |
---|
1:00:32 | the best method should be here the lower left corner |
---|
1:00:36 | and this is actually the dnn based noise estimator this one yeah and these are |
---|
1:00:40 | all the methods |
---|
1:00:42 | and here we have the speech enhancement performance |
---|
1:00:46 | the output snr |
---|
1:00:48 | versus the speech quality measured by this and whereas uh measure |
---|
1:00:53 | and yeah the upper right corner is the best |
---|
1:00:56 | and again |
---|
1:00:57 | this new network based noise mask estimator worked pretty well |
---|
1:01:05 | so that's |
---|
1:01:07 | um |
---|
1:01:08 | or |
---|
1:01:10 | uh applications use a new network for speech or and or noise mask estimation and |
---|
1:01:16 | i think it's a bright powerful inverse of two |
---|
1:01:19 | two |
---|
1:01:20 | and that these for the chime challenge i should say |
---|
1:01:23 | the requirement of theatre data can the over my end-to-end train |
---|
1:01:27 | but i think this the a lot |
---|
1:01:30 | to be done |
---|
1:01:31 | uh first of all its not online |
---|
1:01:34 | or most of the cases of the representative results were not online |
---|
1:01:38 | and one would like to have an online system with low latency |
---|
1:01:42 | then |
---|
1:01:43 | uh i think matters change if we have a moving |
---|
1:01:46 | speaker |
---|
1:01:47 | yeah it was which is stationary with this tablets help by the person speaking |
---|
1:01:52 | and of course there is uh |
---|
1:01:54 | it's something more it's much more different if there are also overlapping speech which we |
---|
1:01:59 | didn't disc uh consider yeah |
---|
1:02:03 | so that's it so that was or not |
---|
1:02:05 | of uh references |
---|
1:02:08 | thank you |
---|
1:02:42 | i think that's no problem that that's easy to implement that |
---|
1:02:46 | but um |
---|
1:02:49 | when you earlier we would have sets |
---|
1:02:51 | the maxim snr is better than the mvdr but now i'm gonna state you say |
---|
1:02:55 | it's about the same and so it doesn't matter whether we take you equal to |
---|
1:02:58 | one or two zero point one or whatever so i think it we're not improve |
---|
1:03:03 | matters but also not degrade but that's |
---|
1:03:05 | that's my feeling example |
---|
1:03:24 | i think we did not listen because they didn't go back to the time domain |
---|
1:03:27 | you know if us english stating that short term frequency domains but that's a good |
---|
1:03:31 | point should put |
---|
1:03:33 | listen to it |
---|
1:03:39 | yeah those |
---|
1:03:42 | yeah |
---|
1:03:43 | the spectrograms i have seen ice short-term spectral crumbs but that was not for the |
---|
1:03:47 | end-to-end training but i presume that looks similar because uh |
---|
1:03:52 | the results were not that difficult different between the two |
---|
1:04:16 | i think of the moment is based mainly the overlaps uh speech and they're also |
---|
1:04:20 | very short utterances which are too short for all covariance estimation |
---|
1:04:35 | i think at six or i don't |
---|
1:04:53 | oh yeah was one more question yeah |
---|
1:05:12 | you know the than neural network i had this on some of the slides i |
---|
1:05:16 | had some figures so it's not as a larger than a new network for acoustic |
---|
1:05:21 | model by far not but still it significantly or then a parametric noise tracker |
---|
1:05:27 | yeah that for sure |
---|
1:05:30 | more detail i cannot |
---|
1:06:04 | basically i think of the |
---|
1:06:07 | motivation for doing than in them |
---|
1:06:11 | yeah days of the parametric approaches we always needed and the migrant to domain when |
---|
1:06:15 | we were about |
---|
1:06:16 | the doing speech enhancement |
---|
1:06:19 | and i think we tried log domain but i'm not no i don't ten for |
---|
1:06:24 | comparison |
---|
1:06:25 | a comparison or |
---|
1:06:30 | yeah |
---|
1:06:32 | yeah |
---|
1:06:35 | and all know what you mean |
---|
1:06:39 | i cannot tell us a much of a bit i think we tried it and |
---|
1:06:42 | then we stick with that but i don't you |
---|
1:07:17 | i think the |
---|
1:07:18 | the different see a is that we have a multichannel signal |
---|
1:07:22 | and for the beamforming been we exploit the phase |
---|
1:07:25 | so i think for multichannel data are uh doing it's |
---|
1:07:30 | in the no explicit beamforming i think it's a good idea you know |
---|
1:07:36 | yeah yes yeah |
---|
1:07:38 | yeah |
---|
1:07:40 | yeah |
---|
1:07:44 | yeah no for the what the last application is concerned with i also think there |
---|
1:07:48 | are the solutions which are listed goal of this one yeah but it nicely fitted |
---|
1:07:52 | you know too much story yeah |
---|
1:08:04 | yeah |
---|
1:08:10 | or |
---|
1:08:12 | we tried with just |
---|
1:08:14 | feed forward only |
---|
1:08:16 | there is where the results i skipped |
---|
1:08:20 | the cr was |
---|
1:08:22 | just feed forward network we've outs uh recurrent layer |
---|
1:08:26 | and it was a bit worse but uh not too much |
---|
1:08:34 | so i think the online and latency something is not the issue but if the |
---|
1:08:38 | speaker moves a lot |
---|
1:08:40 | i think you have to do also something on the |
---|
1:08:43 | on the test data not rely solely on the and model training |
---|
1:08:48 | train constraint uh system with a mask estimation |
---|
1:08:52 | and have to also do some uh |
---|
1:08:54 | tracking or whatever on the test set |
---|
1:08:59 | i think this is the larger issue |
---|
1:09:09 | yes one |
---|
1:09:35 | note that |
---|
1:09:47 | no it is like |
---|
1:09:49 | the example i had discussed before in detail with the noise |
---|
1:09:53 | suppression |
---|
1:09:55 | and we just this uh slightly just showed how do i take in the targets |
---|
1:10:00 | for the new network mask estimator was it |
---|
1:10:05 | and in the noise suppression it was the speech signal was the target all the |
---|
1:10:10 | speech presence |
---|
1:10:11 | was a target for the |
---|
1:10:14 | um |
---|
1:10:15 | for the speech mask estimator and we have the top frequency bins were was just |
---|
1:10:20 | noise for the noise estimator |
---|
1:10:22 | and then we use these covariance mattresses in the beamformer objective function |
---|
1:10:27 | and here we use the same the informal objective function |
---|
1:10:31 | also these covariance matrices my x and the man |
---|
1:10:35 | the question is what to |
---|
1:10:36 | so you don't z command |
---|
1:10:38 | what do we consider signal x and the next we consider the early part of |
---|
1:10:41 | the signal |
---|
1:10:43 | and signal and we consider the later part of the signal |
---|
1:10:46 | so to estimate sigma and |
---|
1:10:48 | we say uh we have the inputs |
---|
1:10:52 | so we have the signal which is not reverberated |
---|
1:10:54 | and we convolved with the lights part |
---|
1:10:57 | of the impulse response and this gives us the distortion |
---|
1:11:01 | and then we use this beamformer framework to remove the distortion |
---|
1:11:05 | it's a bit is difficult to cali beamforming actually butts uh |
---|
1:11:10 | it is |
---|
1:11:18 | yeah |
---|
1:11:19 | yes |
---|
1:11:23 | sixty four |
---|
1:11:43 | a |
---|