0:00:09thank you very much ones i and thank you for coming to my talk
0:00:14but they asked ones are problems with the top ten icky set where anything about
0:00:18between one half an hour and one half hour so i wonder what we intend
0:00:22be how long it will take let's see
0:00:26and you're welcome to ask questions in between and i guess the first question that
0:00:31you have i will answer right away
0:00:34where is part of on
0:00:36so part of a one
0:00:37is
0:00:38here in the state of not kind best farther north rhine with failure
0:00:43and then in the east of that's one and maybe of the closest town that
0:00:47you might know is document was you had often want you know
0:00:51the football team bits of one about one hundred kilometres west imponderable so this point
0:00:56of one
0:00:57okay so i'm going to talk about uh
0:01:00beamforming
0:01:02and this is uh our group in part of a one and before i start
0:01:06i would like to say your but this is of course joint work of the
0:01:10whole group and in particular i would like to mention
0:01:13not high man
0:01:15uh low cost would on
0:01:17and alex a you know i've
0:01:22so here's what i'm going to
0:01:23talk about today
0:01:26what you see here is
0:01:27so to save the scenario we have an enclosure with a speaker and maybe some
0:01:32distortion darting acoustic events
0:01:35and then there is a microphone array and the beamformer processing the signals
0:01:40and uh after the beamformed signal we might have an acoustic speech recognition unit
0:01:46and the beamformer their the adaptation or their computation of the beamforming coefficients are controlled
0:01:53by a power a parameter estimation device
0:01:56and i'm going to first talk a bit about spatial filtering objective functions
0:02:02so this part here
0:02:03and this i think is rather
0:02:05basic
0:02:06because you sets are more them maybe more computer science people have an extra engineering
0:02:11people so i spend a bit more on this part here
0:02:14and then
0:02:15we discuss how we can ask actually estimate the beamforming coefficients
0:02:20so that pertains to this block er parameter estimation
0:02:25uh
0:02:25eventually i will then look at the combination of beamforming and speech recognition
0:02:32and uh you finally
0:02:34if time allows
0:02:35i could also spend a few words or have a few slides about
0:02:39other applications of beamforming beyond noise reduction
0:02:45so let's start with these
0:02:47uh first block spatial filtering objective functions
0:02:51so this is to say the
0:02:54elementary set up
0:02:56assume we have just the mono frequent signal here slide like this complex exponential
0:03:03and first model what's uh we can then receive at the microphones
0:03:08so here's a
0:03:09closer look we have you have this and elements the and microphones
0:03:13and we have this uh signal here the mono four point three consecutive and string
0:03:18the microphone array from an angle data
0:03:21and a d is the inter element a distance between the two microphones
0:03:26now formulating this
0:03:28uh mathematically
0:03:29the beamformer output signal is the weighted sum
0:03:33of the microphone signals
0:03:35the weights of the w is here
0:03:37and what we have actually of the microphone in this simple setup is
0:03:42we have the complex exponential here of the source signal
0:03:45however delay it's
0:03:47by a certain amount of time which depends on the microphone considered
0:03:52if you look here for example take this first microphone of the reference
0:03:56then of the signal will arrive at the second microphone after some delay tall one
0:04:02oh and so on
0:04:03and uh if you can you see that the delay tell depends on the angle
0:04:08of the arrival he or theatre
0:04:10so you can actually uh express this tells by this
0:04:15expression here
0:04:18so this is the beamforming output signal and using vector notation then we have you
0:04:23have a source signal the complex exponential
0:04:26and then we have here the weight vector containing the beamformer coefficients w zero to
0:04:31tell venue and minus one
0:04:32and the cr comprises
0:04:34all these delay terms all this column exponential term c on this term here is
0:04:39called steering vector any test these elements here first microphone signal is not until eight
0:04:45the second one is a little bit still eight byte r one and so on
0:04:50so this is
0:04:52uh that's description of this scenario here
0:04:56and to have an yeah the feeling how
0:05:00uh you go and oops spatial filtering by this scenario
0:05:04uh i have some beam patterns
0:05:06here
0:05:07so what i have
0:05:08on the left-hand side here are the beamformer coefficients
0:05:13which i assume
0:05:15we have study at the beamformer towards direction core or uh angle feeder zero
0:05:22and
0:05:23this now plotted here is the so-called beam pattern which is
0:05:27here this a product between the beamforming vector and this steering vector from the last
0:05:33slide
0:05:35and you know can see that
0:05:37if
0:05:38what you see here is uh as a function of the angle theta
0:05:42the response of the beam form of this beam pattern
0:05:45and you see that if
0:05:48did that you could say to zero then we have a
0:05:51high sensitivity and for other directions
0:05:55the game so to say a small all can be even zero
0:06:00and how i have for beam pattern c r
0:06:03this first one he of on the left-hand side
0:06:06corresponds to the situation where the
0:06:08the signal a rise
0:06:10at so called broadside with broadside i mean come or go back method signal comes
0:06:15from this direction here
0:06:17you know that score broadside
0:06:19this direction will be called and fire
0:06:23so this is the broadside direction
0:06:26here it is in the and file
0:06:28direct show the desired signal
0:06:30so it looks like that
0:06:32and uh
0:06:33c broadside and fire
0:06:36and the lower two of being part and chat indicate that the sensitivity all the
0:06:42spatial selectivity of the beam a beamformer depends very much
0:06:46on the geometry
0:06:47yeah here the
0:06:49ratio between the
0:06:51distance between two
0:06:52microphones
0:06:54and the wavelength is very small
0:06:56so it's its low a patch of a patch or
0:07:00then
0:07:00the sensitivities almost only directional
0:07:03as you can see
0:07:05yeah it's uh the
0:07:08the ratio between the inter element distance of the microphones
0:07:12and the wavelength is very large
0:07:14and is even larger than the wavelength
0:07:18if it's not of in a wavelength
0:07:20look at spatial aliasing just as you know uh temporal aliasing
0:07:24that's the reason why you have these grating lobes he you know
0:07:28you know that scores by spatial aliasing
0:07:32so here we have a
0:07:33low inter element distance here we have a high inter element
0:07:37of the microphones relative to the way
0:07:41so we can do spatial filtering
0:07:43with this setup
0:07:48no
0:07:49i go a step further to real
0:07:52environments where i mean by that is
0:07:55first
0:07:56we have a speech signal you want to work tutor speech so we have
0:08:01uh wideband
0:08:02signal we don't have a single mono frequent
0:08:06sine wave complex exponential we have a
0:08:09signals say bandwidth of eight or sixteen killer whales or whatever
0:08:13what we then do we go to the short-term fourier transform domain
0:08:17and in the for each frequency we can then look at it as a small
0:08:21narrowband a beamforming problem again
0:08:26when we have interferences
0:08:29um the sturdy distorting sources like noise or whatever which we would like to suppress
0:08:35so we need an appropriate objective function from which we derive the beamforming coefficients so
0:08:39that the signal is enhanced the other one is suppressed
0:08:44then we have reverberation
0:08:47so read which means
0:08:49if we are in an enclosure like this lecture hall
0:08:52we have an
0:08:54uh but
0:08:55signal propagation via direct
0:08:58path or by a multiple we might why reflections and multiple reflections and so on
0:09:03and this or
0:09:05is called the reverberation and is modeled by an acoustic impulse response or acoustic transfer
0:09:11function
0:09:12from the source to the microphones
0:09:16and finally this acoustic transfer function
0:09:19is unknown
0:09:21and even time-variant so if i move or something else here in the
0:09:25lecture hall moves when the impulse response or transfer function which change that can be
0:09:30time variant
0:09:31and it is unknown and needs to be estimated
0:09:38so that's what we are going to consider no
0:09:42and we do this by doing
0:09:45data-dependent statistically optimum beamforming
0:09:51so
0:09:52we first formulate the model in the short term fourier transform domain
0:09:57so we go
0:09:58from the time domain
0:09:59to the fourier transform domain short-term fourier transform domain means we take curves
0:10:04chunks of the signal
0:10:06and compute the dft on that then we move this chunk of bits forward and
0:10:10take a new dft and so on
0:10:12and here the two
0:10:14parameters are the time frame index and the frequency bin index
0:10:18so why is no the back to
0:10:20of microphone signals y zero up to white and minus one
0:10:25and
0:10:26after uh with some assumptions
0:10:29this can be more alert it is
0:10:32the
0:10:33product of the source signal s we are interested in and the acoustic transfer function
0:10:38vector from the source to each of the microphones
0:10:42plus a noise term
0:10:45but i should at least mention that this is already an approximation this model
0:10:50because this assumes that the uh impulse response
0:10:55is smaller
0:10:56then the analysis window of the dft yeah but uh i need i take this
0:11:02model for here for the whole talk well
0:11:07so the beamformer output signal then is
0:11:11the beamforming coefficients times this input signal and in the following i really leave out
0:11:16the uh in this all the parameters of the arguments t and f
0:11:21yeah so that the filter output
0:11:25no
0:11:26how to determine these beamforming q coefficients w
0:11:30in a certain statistically optimum way and the first criterion that we would probably also
0:11:37would come up with its the mse criterion minimum mean squared error criterion
0:11:42so we would like to determine the beam phone in coefficients
0:11:46such
0:11:47that the mean squared error all
0:11:49between the beamformer output and some desired signal is the smallest possible
0:11:55that's what do you
0:11:56which and also from other optimisation tasks
0:11:59and what is the desired signal
0:12:03one could use as desired signal
0:12:06of course
0:12:07the source signal s which you would like to enhance
0:12:11you know if so that would mean
0:12:13that you of this you that i introduced here is one so the desired signal
0:12:17is equal to the source signal s
0:12:19and then the beamformer has the task of both
0:12:22beamforming and he reverberation because what we would like to restore with this
0:12:27uh desired signal the six source signal at the position of the source node of
0:12:31the microphone but at the position of the source so it should also the reverberates
0:12:35so that i read you um
0:12:38the press the effect
0:12:39of the uh sound propagation from the source to the microphones
0:12:44or one could use
0:12:45an alternative criterion
0:12:48where the desired signal is the
0:12:51image of this source signal x of the microphone
0:12:56and then we want to do beamforming only by beamforming means noise suppression from other
0:13:00directions
0:13:04so
0:13:05let's
0:13:06now solve this problem here
0:13:08so
0:13:09or we have here the mean squared error this is the beamformer output soaping phone
0:13:14confusions times microphone signal
0:13:16this is desired signal
0:13:17and if we just plug in
0:13:19our definition of why which we had
0:13:22yeah
0:13:24if we just plug it in that's no big deal
0:13:27then we can rewrite it in this way you know
0:13:31where
0:13:32the mask where s is the power all variance of the source noise a source
0:13:37signal
0:13:38and this series than or is rising command
0:13:41is the covariance matrix of the noise at the microphones we have and elements
0:13:47and then they have a end times and covariance matrix of the noise
0:13:51and it can see that the mean squared error at the uh could is consists
0:13:55of two terms
0:13:57there is the speech distortion term and the noise
0:14:02speech distortion is the deviation from the desired output
0:14:06to the beamformer output
0:14:09and this is the contribution of the noise
0:14:12which is independent of the speech
0:14:15desired speech signal
0:14:18so if we know
0:14:19huh formulate in this way it's a really not difficult to carry out this minimization
0:14:25here
0:14:26and this is the result
0:14:28this is the these are the optimal beamforming coefficients which minimize the mean squared error
0:14:34yeah basic minds wasn't noise covariance matrix
0:14:37and a was the acoustic transfer function vector from the source to the individual microphones
0:14:45this one is called multichannel wiener filter and w
0:14:52there are variations upon that
0:14:55one is
0:14:56that you uh
0:14:57a
0:14:59plug in here a trade-off parameter
0:15:02you
0:15:03and with this
0:15:04by tuning now than you can't rate of speech to toss distortion with noise suppression
0:15:12for example if first of all four
0:15:14we could if i for small and you'll
0:15:17you enforce the beamformer to
0:15:20to have as little speech distortion as possible
0:15:23and if you increasing you this one this time is degraded and then the beamformer
0:15:28is more enforced to suppress the noise
0:15:31some of this mu you can control with it
0:15:33first of all if you just introduce the new here
0:15:36then the beamforming coefficients don't change much there's just the one now replaced by you
0:15:42hear and this is called the speech distortion weighted multichannel wiener filter
0:15:49but you can
0:15:50look at the extreme cases as well some you're going to zero or to infinity
0:15:56if you goes to zero this
0:15:59term gets a very high weight
0:16:02and
0:16:04so first nude going to one that was what we had already some you going
0:16:07to zero the speech distortion
0:16:10term gets a very high weight so we would like to
0:16:14make sure that there is no speech distortion you know if we let
0:16:17new go to zero
0:16:19the resulting beamformer is called the minimum variance distortionless response
0:16:23then d r informal
0:16:26so this
0:16:27only consumption can
0:16:29this the
0:16:30objective function is minimize this the noise that the beamforming output however make sure that
0:16:35the speech is not distortions
0:16:39and the other extreme case is new going to infinity
0:16:42then
0:16:43we don't care about the speech distortion but we would like to have the noise
0:16:47as much suppressed as possible at the beamforming output
0:16:51so we would mark like to maximize the signal-to-noise ratio of the beamformer output
0:16:56if you goes to infinity this is called maximum snr
0:17:00informal
0:17:01and the
0:17:02beamforming coefficients from you we could zero you can read of here right away is
0:17:06just this new disappears
0:17:07and in you goes to infinity
0:17:09it's some scaling factor times the numerator here the noise covariance inverse times the
0:17:17backed of acoustic transfer function
0:17:20so the difference
0:17:21uh criteria
0:17:24can be uh
0:17:25visualise like vets we have this
0:17:28problem at time you
0:17:30and if we let some you going to zero
0:17:33we make sure that the speech is preserved so is not distorted
0:17:37but we might not have a lot of noise suppression then we are at the
0:17:40end it er case minimum variance distortionless response
0:17:44and if we go to the other and with a very high in you
0:17:48we have the largest possible noise suppression
0:17:51but the speech might uh
0:17:53sound distorted at the beamformer output
0:18:02so uh also what is interesting to see and what we can see from the
0:18:07last slide already is
0:18:08that these different criteria like mvdr on x m s n r
0:18:13they differ only in a complex scalar which means in a single channel
0:18:19filter output call a postfilter if you look here
0:18:22you know if you
0:18:23the numerator is always the same if we change new we just change the scalar
0:18:28in the denominator
0:18:29so this is a
0:18:31single car just a complex scalar all it's not and necessary to multichannel processing to
0:18:36go from one beamforming objective function to the next
0:18:40so what we could do is
0:18:42we could design here a maximum snr beamformer
0:18:46and then
0:18:47use an appropriate
0:18:49single channel filter
0:18:51i called posterior to
0:18:53and then we could turn this maximize of our beamforming to an mvdr beamformer
0:18:58so from here to here it's like the overall and mvdr beamformer
0:19:07so what i
0:19:08set so far as the following
0:19:11we
0:19:12should look at acoustic transfer functions and not only at this the steering vector with
0:19:17the delays if we talk about reverberant environments and reverberant environments are always present if
0:19:22we are in the room
0:19:23outdoors
0:19:24don't need to consider reverberation but if we are in the room we had to
0:19:27consider the reverberation and then acoustic transfer much functions
0:19:32we have to be used instead of just purely ways
0:19:35and the beamformer criteria differ only in a single channel linear filter
0:19:40what we would like what i'm going to look at now is
0:19:44that the acoustic transfer function eight
0:19:47the effect of tensor functions and this noise covariance matrix the man
0:19:51they are no
0:19:53and then
0:19:54a possibly time-variant
0:19:56so we need to estimate them
0:19:58and uh the goal is to estimate then from the noisy speech signal at the
0:20:03microphones so that's what we consider now
0:20:08so this parameter estimation here
0:20:11which then delivers the beamformer coefficients for the one of the criteria
0:20:16so one method to
0:20:18determined this acoustic transfer function their other methods
0:20:22you know there's one which exploits the nonstationarity of the speech signal but the method
0:20:26that we have been already working on for quite some time is
0:20:30we estimate this acoustic transfer function by eigenvalue
0:20:33decomposition
0:20:35that's as follows
0:20:37that was our signal model
0:20:39yeah the mic reflect of microphone signals is the acoustic transfer function vector times the
0:20:44desired source signal
0:20:45this one like what x
0:20:47blast of the noise
0:20:49and if i come if we compute the covariance matrix of y
0:20:54so the expectation of white times why i'm each an
0:20:58then if s and are uncorrelated which we can assume
0:21:02we have uh the spatial covariance matrix of this ooh
0:21:08and know what that was
0:21:09of
0:21:10of the abyss parts here
0:21:13speech related part
0:21:14and of the noise
0:21:16and here
0:21:18clear that the uh
0:21:20uh covariance matrix can directness at a have each in times the variance of the
0:21:25speech term
0:21:26plus the covariance matrix of the noise
0:21:30so this is the spatial covariance matrix at the been
0:21:33of the microphone signals
0:21:35and
0:21:36for example if you just looking at the spot here
0:21:41it's
0:21:42easy to see that the principal eigenvector
0:21:46of this part here
0:21:48is just a time some scalar
0:21:51yeah on depending on how you normalize the a
0:21:54because if you plug this in
0:21:57to the
0:21:58eigenvalue equation
0:22:00zig mac x times
0:22:02eigenvector is equal to some longer times eigenvector
0:22:06and if you use this yes the eigenvector you really see that this really source
0:22:10this equation
0:22:13maybe i should write it down
0:22:16it's really not difficult
0:22:20so is it mel x
0:22:23times let's call the eigenvector like that uh found that times eigenvector
0:22:27and now we have a
0:22:30at hand each and times variance of the
0:22:33speech signal
0:22:35times eigenvector and for the eigenvector ius
0:22:39sums
0:22:41scalar seek to
0:22:42times a
0:22:44you can sit down at all times
0:22:46c ten times a day
0:22:49and then you see that
0:22:51this er
0:22:52altogether
0:22:53is a scalar
0:22:56a ham each and a use a scalar so the vector just this one here
0:23:01so this year would be the number one on the times cedar
0:23:05and you we have a
0:23:06so indeed this sort of this eigenvector equation
0:23:16so with the if we do an eigenvector decomposition of zig max we can recover
0:23:20the acoustic transfer function we can estimate the acoustic transfer function
0:23:23that's what i wanted to save the slide
0:23:26or
0:23:27we could also look at about generalized eigenvalue
0:23:31problem well we also consider have the sick man
0:23:34so if you look at this
0:23:36eigenvalue problem
0:23:38the principal eigenvector solving this generalized eigenvalue problem is in principle
0:23:44also complex scalar times this one here
0:23:46where we have the inverse of the noise covariance matrix times acoustic transfer function vector
0:23:52so we can estimate
0:23:54these eight term by eigenvector decomposition
0:24:00lot of this slide
0:24:02so with the principal eigenvector of the generalized eigenvalue
0:24:07problem
0:24:08this one here we can write a ray each i mean the maximum snr beamformer
0:24:13because what the principal eigenvector in principle is this one here
0:24:18and actually if we have the right routine in not it's not necessary that the
0:24:23command needs to be invertible we just need to solve the generalized eigenvalue problem so
0:24:28the prince but it also possible if the command is not invertible
0:24:31however there is an arbitrary scaling factor here because
0:24:35any
0:24:36scaling which result in a eigenvector of that problem
0:24:41all the beam formers like the mvdr beamformer
0:24:45we can realize as well then we do um
0:24:48eigenvector decomposition of this uh
0:24:51covariance matrix of the speech related some of the microphone signals because this
0:24:56given a but as eight
0:24:57ike acoustic transfer function vector
0:25:00and this of the denominator
0:25:02corresponding to the mvdr beamforming filter
0:25:05so we can also realise and mvdr beamformer but then we also need this the
0:25:11inverse of the command where here it's not really necessary to do the inverse explicitly
0:25:18so with eigenvector decomposition
0:25:21we can
0:25:22determine
0:25:24the uh
0:25:25acoustic transfer function with that then the beamforming coefficients
0:25:31so what we'd it's
0:25:32we estimate the acoustic transfer function
0:25:35by
0:25:37whatever to be actually did now we
0:25:39know how to determine the acoustic transfer function but we still need is
0:25:43we need these covariance mattresses
0:25:45of the speech related microphone signal and of the noise
0:25:48so
0:25:49we have solved one problem and got in a new problem
0:25:52because now we need to estimate the max and the man
0:25:56then compared is mattresses of the speech term of the microphone signals and of the
0:26:00noise term microphone signal
0:26:03and now there are a couple all many
0:26:06procedures how to do so
0:26:08and
0:26:10basically what most of them do is they do a two-stage procedure
0:26:16that means
0:26:17they first determine for each time-frequency point is it dominated by the speech to or
0:26:23is it dominated by noise
0:26:26that's called speech presence probability estimation so this is all to say a voice activity
0:26:31detector with a very high resolution a time-frequency point uh resolution
0:26:36and we would like to determine for each time-frequency point is it in just a
0:26:41noise term pure noise term or is it dominated by speech
0:26:45if we have the speech presence probability map
0:26:48or mask
0:26:49then we can estimate these metrics as
0:26:52from that
0:26:54and that's the way i'm going to do with that in the following
0:26:58so this speech presence probability estimation which should determine for each time fixed point is
0:27:05it speech or noise is basically something like that
0:27:08we have a noisy spectrogram
0:27:11and what we would like to have is
0:27:13the mosque or the
0:27:15identification of those time-frequency points which are dominated by speech that's it looks something like
0:27:21that
0:27:26to do that
0:27:27there have been
0:27:29a lot of techniques
0:27:30which are based on so called a priori and a-posteriori snr estimation
0:27:36and local spectro-temporal smoothing
0:27:38i'm not going to
0:27:39talk about the t about that was the preferred methods
0:27:42min uh several years ago few years ago
0:27:46uh then we
0:27:48in part of only about of the methods which before was
0:27:51uh
0:27:52very elegance
0:27:54and me but be did is
0:27:56we interpreted
0:27:58uh this here as a two dimensional hidden markov model
0:28:02with correlations
0:28:04or yeah correlations or transition probabilities
0:28:07along the time axis
0:28:09and along the frequency axis
0:28:12and then we did inference in this two dimensional hidden markov model to determine the
0:28:16posterior on the posterior was the speech presence probability
0:28:21but then eventually it turned out that a new network did a much better job
0:28:27and no i'm finally
0:28:28at the yeah other half of my talk tight a new network supported so what
0:28:34i'm discuss now is how can we do with the speech presence probability estimation of
0:28:39a new network
0:28:44so up stopped the those much fast
0:28:47so here is the set up
0:28:49we have a new network is used for speech presence probability estimation
0:28:54we needed in the following way
0:28:55we have be microphone signals
0:28:59and it's uh
0:29:00we haven't nets work
0:29:03for each channel
0:29:04however we tie the weights between the individual networks here
0:29:08and the input to the new network are the magnitude spectrum
0:29:13and the network is supposed
0:29:15to predict an ideal
0:29:18mask
0:29:18now the slide on that
0:29:20so it should predict for each time-frequency point is it dominated by speech or is
0:29:24it dominated by noise
0:29:27we applied this to each channel separately and then we someone merger poor the channels
0:29:33this can be do done by averaging the output all by taking the needy on
0:29:38median turned out to be a bit more robust in the case that one of
0:29:41the channels was broken
0:29:43and the output of this one here now is
0:29:47a look up well
0:29:48uh
0:29:50party a could be the probability for each time-frequency point of being speech
0:29:55and here
0:29:56or being noise
0:29:58so once we have these masks
0:30:00or present probabilities
0:30:03we can compute
0:30:05the spatial covariance matrix as of the speech
0:30:08uh and of the noise this is illustrated here
0:30:12so we estimate know the spatial covariance matrix of speech by
0:30:16this outer product
0:30:18however we take only those
0:30:20time-frequency points
0:30:22where our new network as set well this is really speech
0:30:26and for the noise estimation we take only those time-frequency points with the network set
0:30:30where the for this time-frequency point it's really noise
0:30:34and with that we estimate these covariance mattresses
0:30:38once we have the covariance mattresses
0:30:40we plug into this optimisation function and b d r or maximum snr to get
0:30:47the beamforming vector w
0:30:50so that's already stuff
0:30:54yeah please
0:31:13yeah you're right room i don't wanna white is yeah why yeah and not i'm
0:31:18which should perhaps separate uh subtract the it might and that's what you sorry
0:31:22we tried that but we didn't find an and
0:31:25and an effect or an improvement by that so we stick this one yeah
0:31:30but you're right
0:31:39no for the mask estimation we don't use phase
0:31:44basically we look at this
0:31:46the mixture magnitude of each point is in below some threshold something like that all
0:31:51or above
0:31:55for the phase of course is necessary for the beamforming coefficients
0:31:59yeah but for the mask estimation of the phase it basically present through the estimation
0:32:04of this covariance matrix as
0:32:13here is the
0:32:14network
0:32:16that more in detail so we have a noisy
0:32:19um
0:32:20speech signal at the input over the network
0:32:24and at the output we would like to predict
0:32:26the speech mask
0:32:28and the noise mask
0:32:31yeah so for each time-frequency point uh if it is current uh dominated by speech
0:32:35it should be high here and low here
0:32:39and so what the neural network does it is a is it is operated like
0:32:44a classifier
0:32:45yeah it's a classifier which has to predict one or zero speech or noise for
0:32:49each time-frequency
0:32:51and the objective function is simply because it's a classifier cross entropy
0:33:02the
0:33:03that's one scenario which worked pretty well
0:33:06here we had four layers the first there was a bi directional l s t
0:33:10m
0:33:11layer followed by
0:33:13three feed-forward
0:33:15layers
0:33:16and uh at the input
0:33:18we had the magnitude spectrum
0:33:21for all frequencies
0:33:24and the output rather than the speech mask
0:33:27and the noise mask
0:33:29and these values here
0:33:31could be between zero and one they don't need to be binary
0:33:34and they also don't need to sum up to one so it could be that
0:33:37one of the time-frequency point was considered neither speech nor noise because it was somewhere
0:33:42in between
0:33:47and so what's what did we do
0:33:49here with this mask estimation is just set as you have seen it's
0:33:54single channel there's and neural network per channel aware with type weights but
0:33:59we treat each channel separately
0:34:02so it's independent of the array configuration and the number of microphones here we could
0:34:07train it with six microphones
0:34:09data and use it with three microphone data in the test
0:34:12and the
0:34:13could be a linear right in training and the secular right in the test
0:34:17that's possible
0:34:18so we can see it's an advantage but it would also say that's a disadvantage
0:34:22because for the mask estimation we don't exploit spatial information
0:34:27you know because we look at uh just a single channel
0:34:30what is different from most of the uh hmmm
0:34:34parametric approaches
0:34:36before the neural network hero
0:34:38is that at the inputs we have the whole
0:34:42it dft vectors so we treat all frequencies jointly
0:34:47where is usually in a beamforming you do were separately treat each frequency
0:34:52and here we treat that uh jointly
0:34:56it's not immediately suitable for online processing because we had to be a last em
0:35:01layer
0:35:01there are so we need to avoid backward path so in the ks consideration like
0:35:07that it's current the an offline
0:35:13so here are some example
0:35:16speech and noise masks
0:35:18that has been estimated with this
0:35:21method from which i'm data base
0:35:24and uh you can see that it recovers pretty well here the uh harmonic structure
0:35:30of speech
0:35:32and this is the noise mask
0:35:34where we have a high values here for example and in between the one
0:35:38and how can play
0:35:41the input signal and the beamformer output signal for this one here
0:35:47able to do that
0:35:53i should
0:35:59they both the input
0:36:02all
0:36:05i wish i
0:36:11of course have to be very good example
0:36:13now do not all alike like that in this one was a good one
0:36:26here is in
0:36:27um l different look or another maybe in my not aspect of it
0:36:32here we compared
0:36:34the maximum snr be informal he recorded uh i do not eigenvalue beamformer
0:36:39so this which maximize the signal-to-noise ratio
0:36:42and the mvdr beamformer which makes sure that there is no speech distortion
0:36:47but you see here is the
0:36:50snr at the beamformer output
0:36:52for individual utterances of the chime challenge
0:36:56and what you have here on the or did not
0:36:58is the log of the condition number of this noise covariance matrix
0:37:02so sick man
0:37:04in the mvdr case we have to compute the inverse of the command to determine
0:37:08the coefficients because in the numerator there was the combining vector times a
0:37:13and what this
0:37:15perhaps
0:37:16shows is
0:37:17that's
0:37:19if the
0:37:20condition number is
0:37:22log conditional mice height which means the
0:37:25noise covariance matrix is ill conditioned
0:37:28then
0:37:29seems to be the case that a generalized eigenvalue beamformer which does the generalized eigenvalue
0:37:35decomposition
0:37:36gives a bit higher snrs at the output then the mvdr
0:37:41maybe you can see but i one don't want to make a strong point out
0:37:45of that
0:37:46in the mvdr we have to explicitly compute the inverse of the egg my and
0:37:49that made
0:37:50be problematic for some of the utterances where there is adjusted of noise or just
0:37:55few observations
0:37:57so
0:37:59in our case the make some snr criterion what the bit better than the mvdr
0:38:03criterion but
0:38:04people from entity in japan they do with similar approach and in their case of
0:38:09the mvdr work to be better than the t v so
0:38:12maybe it's about the same
0:38:17one point i would like to make is
0:38:21uh you made one not if we take the maximum snr beamformer we don't care
0:38:25about speech distortions
0:38:27yeah
0:38:28and uh
0:38:29indeed if we don't let's take any care of that
0:38:33the resulting signal sounds i have as an example in the next slide
0:38:37it's depending if the noise is low cost
0:38:39so if the
0:38:41the low frequency if than noise is predominant the low frequencies
0:38:45then after beamforming the second sounds a bit high-pass filter out because it has a
0:38:49process all the more frequencies with a high noise
0:38:52but that means so that that's the speech signal has been distorted because no it
0:38:56sounds high-pass
0:38:57if we do speech recognition it really is not a big deal because that can
0:39:02be learnt by the
0:39:04acoustic model if the input signal is looks is a bit different so we didn't
0:39:08find a big difference whether we a accounted for this distortion by a postfilter or
0:39:14not the speech recognition results were about the same
0:39:17but if you want to do speech enhancement
0:39:20so the time domain to reconstruct the time domain signal
0:39:23then you should
0:39:24big uh careful to reduce these speech distortions
0:39:28and we develop one methods
0:39:31to control these speech distortions
0:39:34i can explain that maybe in more detail later on
0:39:37uh and not here
0:39:39uh going too much in detail about what we actually
0:39:42what we try to do is we would like to design a postfilter single channel
0:39:48such that the combination of the beamformer with the acoustic transfer function of a function
0:39:54vectors and the beamformer should give
0:39:56a response in the desired direction of one
0:40:00and this actually can be solved for g if we assume that we have a
0:40:05an anechoic transmission so if we have no reverberation we can compute you from that
0:40:09but there's of course and uh approximation reality we have reverberation
0:40:14but in this way we could computed and we had then a this single-channel postfilter
0:40:18which
0:40:19really a
0:40:21removed already used to speech distortions
0:40:23and here i haven't
0:40:25example
0:40:26with no post-processing and with this normalization
0:40:36this is not a good such a good example is before so the input
0:40:41primarily on the basis of this choice
0:40:45now if we take the
0:40:47make some snr be informal
0:40:49without any taking any care of the speech distortions
0:40:54i spanish
0:40:57so i use it sounds of the speech signal sounds differently
0:41:01and with this um
0:41:02blind analytic normalisation
0:41:05we can
0:41:06reviews this high-pass filtering effect
0:41:09people don't for primarily on the basis of issues
0:41:13but of course of it is also at the expense of less a snr gain
0:41:26no i have some results
0:41:28on the chime challenge data
0:41:31so i'm three and time for there were recordings problem for different environments
0:41:38cafes tree bass
0:41:40pedestrian area
0:41:42in the chancery three channels they were just the each six channel
0:41:46scenario in time for also a two channel in one channel
0:41:50scenario and the recordings there were two recordings in these environments
0:41:54there were simulated data really artificially added the noise that recording this environment but they
0:41:59were also to speech
0:42:00to uh recorded in these environments
0:42:03and the recording was like that they had this uh tablet
0:42:07where they had here six microphones
0:42:10here at the frame
0:42:12and the person had the tablet in front of incentive and then we're speaking the
0:42:17sentences he was supposed to speak in this in the bus or in the pedestrian
0:42:20area or wherever you know so that was
0:42:23the scenario
0:42:24and
0:42:26one should say that this is
0:42:28uh in a sense
0:42:30not the deaf most difficult scenario because
0:42:33the slow variation the speaker position
0:42:35yeah you have you hold it like that you know it's not so that the
0:42:38microphone the c and you walk er on the floor or whatever so is low
0:42:43decomposition variation
0:42:46and we have simulated and noise or recordings
0:42:50here are some results concerned with the speech enhancement so
0:42:55measured by this
0:42:57pass score which is supposedly measuring the speech quality
0:43:01but uh i don't know how well it really represent a speech quality
0:43:04what about this figure here shows
0:43:07i've taken it from another publication so there are some of the um
0:43:11results are not going to discuss here
0:43:14uh what we have here is the past score of the speech output
0:43:17after the beamformer
0:43:19and this one here oracle means if we knew
0:43:24by oracle which time-frequency bin is corrupted by noise and which one is representing speech
0:43:30so if we had the oracle speech masking oracle noise mask
0:43:33this is the quality that you could achieve the high of the better
0:43:37this was the result with the uh
0:43:39uh
0:43:40estimation of the b l s t m network which is almost as good
0:43:44as the oracle one
0:43:46and this once you're going to skip
0:43:47these were of a network configurations and other training scenarios
0:43:52and here are two results
0:43:54from the a parametric approach and this one icsi's from also from my group a
0:43:59few years ago
0:44:01uh which was a previous sl this two dimensional hidden markov model and you can
0:44:06see that the new network support of mask estimation gave better speech quality
0:44:11but use parametric methods
0:44:16and now i have some
0:44:18recognition speech recognition results on sign
0:44:22three yeah there were development sets and evaluation sets and
0:44:26they were simulated scenarios we have the noise was artificially added
0:44:30and real recordings in the noisy environment
0:44:33where really the people spoke in the past and so on
0:44:37this was uh scenario the setup delivered by the by the organizers the baseline scenario
0:44:44these are the word error rates here
0:44:46this is our
0:44:48uh speech presence probability estimation or method of a few years back this is a
0:44:54uh method from n t also few years back
0:44:58this year uses the beamformer it beamformer
0:45:02some of you may know it's a delay and sum beamformer we have a
0:45:12and uh here is
0:45:14the uh
0:45:15make some snr beamformer with a new network
0:45:18uh for mask estimation to be used
0:45:20you can see that it
0:45:22uh performed pretty well also on the real of recordings
0:45:28real recordings in noisy environments
0:45:35so yeah
0:45:50yeah yes
0:45:53re exactly yeah that's a point i want to make now because that's an important
0:45:57one
0:46:02so what i've talked about so far
0:46:05we use this new network
0:46:07when you network based speech presence probability estimation or neural network based mask estimation to
0:46:12identify the time-frequency bins from which we estimate the power spectral density matrix of speech
0:46:16or of noise
0:46:18in from these metrics as the beamformer coefficients can be estimated know your point ones
0:46:22are
0:46:23the new network training requires just your
0:46:26data
0:46:26yeah remote
0:46:27we should to have separately the
0:46:30or signal at the microphones
0:46:32so the that signal for a which came from this desired source s
0:46:37and visual test separately the noise
0:46:40so that we can artificially so that we have the target for the neural network
0:46:44training the ideal
0:46:46binary mask just the speech mask or noise
0:46:51so we need this a stereo data for training
0:46:54further on the mask this finishing exactly something heuristic
0:46:58what do you declare as time-frequency points to be
0:47:01dominated by speech we took we set that uh in this speech is the
0:47:07speech only case
0:47:09we take those time-frequency points which x amount to ninety nine percent of the total
0:47:13power of the signal
0:47:15but that's something also debatable whether this vest
0:47:19choice
0:47:19so there's some heuristics in that
0:47:21so the question is now can be overcome this
0:47:25the strict requirement of stereo data
0:47:28and what we try this we try to overcome this limitation by end-to-end training
0:47:33that's the next i would like to talk about
0:47:39so now i'm here at this spot beamforming and speech recognition
0:47:47so and with and so the and training that's determined using multiple
0:47:51connotations what i mean with that is the following
0:47:53which is depicted here
0:47:55yeah we have you have a whole processing
0:47:58chain starting with the microphone signal
0:48:01then we have you have a new network for mask estimation this was the pruning
0:48:06or maybe on computation condensing of the
0:48:10masks to a single
0:48:12mask
0:48:12for speech and noise then the speech and noise covariance estimation the beamformer
0:48:18our post filter to remove the speech distortions
0:48:21so that's up to here
0:48:23in our comes the speech recognition
0:48:25so here is the filterbank
0:48:27filterbank for computing the delta and delta-delta coefficients
0:48:32better databank mel filters
0:48:34then here is the acoustic model neural network
0:48:38uh
0:48:39supported all the new network for the acoustic model
0:48:42and then we have that decoder
0:48:44but i mean with end-to-end training is that we would like to propagate back
0:48:50the gradient from the cross entropy criterion at them
0:48:54of them acoustic model training
0:48:56all the way over here these processing blocks
0:49:00up to the new network here for mask estimation
0:49:04that's what we
0:49:05try to do if we can
0:49:07managed to do that we don't need a target
0:49:10like idea of speech mask for the training here but we can derive
0:49:15the gradient from
0:49:17the cross entropy criterion here from the acoustic model training
0:49:21and then we don't need stereo data
0:49:24anymore
0:49:25so that's what we
0:49:26tried to do
0:49:31and what to be
0:49:33have to take have to take care of is
0:49:36but in between these computations we are in the complex domain
0:49:39that means you have the beamforming coefficients are complex-valued
0:49:43back to us you know they're complex-valued
0:49:46the covariance mattresses are complex-valued so we are here the real-valued domain and then we
0:49:51between the complex-valued domain and here we are back in the real-valued domain again
0:49:56so we have two
0:49:58consider
0:50:00gradients with complex-valued the a remotes
0:50:03so what i the ieee denoted here so the cross entropy criterion at the
0:50:08acoustic model training
0:50:09neural networks a function of the spatial covariance matrices of the
0:50:13from which we compute the beamforming coefficients and they are complex-valued
0:50:17and eventually what you want to train the beamformer curfew the coefficients of a new
0:50:21network for mask estimation they are real value of course again
0:50:26and what we'd it's
0:50:27do here is
0:50:28i'm not going to in
0:50:29to detail that about a there's a recorder on it
0:50:32we use the building a calculus
0:50:35to compute complex
0:50:37derivatives because the cost function is not all formal fink function so
0:50:42but this building a calculus is well known in the adaptive filter theory so people
0:50:47who do adaptive filtering their use this or not because the you often have complex-valued
0:50:52uh coefficients
0:50:55and with this uh one can then compute these gradients
0:51:01and the actually the
0:51:02crucial step was
0:51:05uh we have this makes a mess in our beamformer whose coefficients are determined by
0:51:09eigenvalue decomposition
0:51:11so we have to compute the derivative of the principal eigenvector
0:51:16of this
0:51:17generalized eigenvalue problem with respect to the
0:51:20psd mattresses will come out of the neural network mask estimator
0:51:24and uh for that's uh we have a reports
0:51:28which uh
0:51:30and i also have it with me
0:51:31where you can look up how this is done because there is a quite
0:51:36longer along but they are vacation
0:51:41so now
0:51:43for the chime challenge of this
0:51:45really worked and here are
0:51:47some results
0:51:49let's see whether we can make a
0:51:51and sends out of it
0:51:53so
0:51:54first here so here in this
0:51:57uh again the
0:51:58baseline results so to say we have the
0:52:02as a beamformer used this beamforming in a sum beamformer by example you and we
0:52:07have actually
0:52:08and uh
0:52:10did it separate acoustic model training
0:52:13so that was set the baseline word error rates
0:52:18here we have the system web if
0:52:20trained separately the new
0:52:22no i want our beamformer using a new network with ideal binary mask just targets
0:52:27so as we did before
0:52:29and separately training of the acoustic model neural network for the acoustic model and these
0:52:34are the results
0:52:36then here we try to do it as on my last slide so we would
0:52:41like to jointly train both
0:52:43acoustic uh both networks the one for not mask estimation the one for the acoustic
0:52:48model
0:52:49and we started both from
0:52:52from random initialization
0:52:54and you can see that this leads to a somewhat uh worse
0:52:59what error rate so what error rate increased
0:53:03and the interesting result actually is no but next one
0:53:07here we pre-trained the acoustic model
0:53:11for a menu a network for acoustic model
0:53:14but then the new network for the beamformer mask estimation
0:53:18was trained
0:53:19by back propagating the
0:53:22gradient from the acoustic model to the new network for mask estimation so it was
0:53:26randomly initialised and then train all the way back
0:53:29and b that we are
0:53:30even a little bit better than in the separate training so this year shows
0:53:35that at least four
0:53:36this time challenge is possible
0:53:39to uh this minutes of the need for stereo data and you can achieve the
0:53:44same or little bit better results also just training of the noisy data
0:53:49and the lower one he's here where we also pre-trained
0:53:53the acoustic model for mask estimation with the ideal binary mask target and then just
0:53:58later on july propagate the gradient from uh the acoustic model to this first new
0:54:03network four point you like if that it better
0:54:07oh here for this
0:54:09a much data that's the only data
0:54:11i don't wanna required
0:54:13but i would like to emphasise for this data because we can to try to
0:54:17achieve the same
0:54:18on the ami corpus and so far we have a have not been successful yeah
0:54:22so it's not that easy but i'm was perhaps a very nice corpus this respect
0:54:32so that was
0:54:35basically
0:54:36the story about beamforming for noise reduction
0:54:40no i have a few slides
0:54:42for uh where we also have some multichannel processing but not for noise reduction but
0:54:47for to other tasks
0:54:52so the first one is
0:54:54the speech recognition of reverberated speech
0:54:58and what we did here is
0:55:02use the same
0:55:03set up so we have multichannel data we use a new one network for mask
0:55:08estimation
0:55:09but now a lot
0:55:10uh we would now our distortion is no longer noise
0:55:14but reverberation
0:55:15so we
0:55:17in the case
0:55:18that you know the impulse responses
0:55:21yeah
0:55:22of the training data what you can then could do
0:55:24you can
0:55:25by yourself
0:55:27determine the ideal
0:55:29speech mask and the ideal mask for the
0:55:32for the distortion
0:55:33and the
0:55:34for the targets we take the dry signal the non reverberant
0:55:39data and convolved with the early part of the room impulse response
0:55:44so i first fifty milliseconds
0:55:47and for the interference
0:55:48which was the noise in the earlier case but now it's reverberation for mean if
0:55:51we interference be convolved that right signal with a light part of the room impulse
0:55:56response so after fifty miliseconds the body part of a part
0:56:00and with that we can derive
0:56:02also ideal binary masks for the target and from the interference
0:56:07and then the rest remains the same
0:56:09yeah we can then again compute the masks from that the covariance matrix a simple
0:56:13that the beamforming weights
0:56:16and that we tested on the rear up
0:56:18data set of the reef a challenge which i want to charter data are convolved
0:56:22with measured room impulse responses
0:56:25and there are again test data simulated and also real recordings in a reverberant
0:56:32environment
0:56:33and here are some results
0:56:35for that's
0:56:37we are in the real
0:56:38recordings there was a distinction between
0:56:42uh near which means the distance between the microphones and the speaker brought about i
0:56:47don't know of one meter fifty centimetre and far it was about two meters
0:56:52but you can see the difference in the word error rate is not very large
0:56:56and the
0:56:58gmm hmm baseline we have these results here we have the
0:57:02a baseline results were single channel there was no multichannel baseline
0:57:07and then with the
0:57:09method i so i just explained how to determine the to take the late reverberation
0:57:14part for the
0:57:16uh distortion masks
0:57:18and just use the same scenario a set up as before we obtain these
0:57:22whatever rates on these two uh parts of the dataset
0:57:26and be the better acoustic model it can be further improved
0:57:31so it also work in this case to suppress reverberation
0:57:37so that was one example of another application it is my final and last example
0:57:43we know this is a
0:57:45new networks based mask may should be used for
0:57:49for noise tracking for single channel speech enhancement
0:57:56here's a
0:57:57typical setup of the say traditional single channel speech enhancement
0:58:04yeah we have the
0:58:05noisy speech signal and its at the input we already in the stft domain here
0:58:10and then we manually manipulate the magnitude really the phase of and are usually unchanged
0:58:17and then we compute again function
0:58:20the time-varying gain function word which we multiply the microphone signal to suppress the noise
0:58:25and this time grounding function
0:58:28this compute the performance so quite a priori snr
0:58:32and for this to compute we need the noise power spectral density
0:58:36and this noise power spectral density is now estimated with the new network
0:58:41that's
0:58:42real be made change
0:58:45so noise tracking by a new network is we did it
0:58:48similar or the same methodology as before
0:58:53as in mask based speech enhancement
0:58:56we estimate the noise a spectral mask which indicates for each time-frequency bin whether it's
0:59:02dominated by noise or not and if it is dominated by noise we can update
0:59:06the noise estimate for this a priori snr estimator and if it's
0:59:10dominated by speech we just hold the all the old estimate
0:59:15and so only this part is
0:59:17is changed with respect to otherwise traditional speech enhanced
0:59:23uh it's
0:59:24he also some examples of this isn't noisy spectrogram
0:59:28this is the ideal binary mask for the noise so black and it was
0:59:33and here it was is a is the um
0:59:36noise presents prop
0:59:37itsy estimates by this new network this isn't of them method to compare it
0:59:43with
0:59:44and uh
0:59:46is it looks a little bit similar is what we had uh before bads we
0:59:50can estimate with these mask estimation we can estimate the noise
0:59:54for
0:59:56signal
0:59:58and to your uh some results
1:00:00where we have on the left end
1:00:02sides the performance of the noise estimator
1:00:05and on the right-hand side the performance of the speech enhancement system
1:00:09and so we try to really a lot of the state-of-the-art noise estimation methods as
1:00:15you can see here
1:00:17and what we have plotted here is
1:00:20two error measures for the noise estimate so look arrow
1:00:24variance
1:00:26versus look errol mean
1:00:28and the variance should be small in the mean should be small so the ideally
1:00:32the best method should be here the lower left corner
1:00:36and this is actually the dnn based noise estimator this one yeah and these are
1:00:40all the methods
1:00:42and here we have the speech enhancement performance
1:00:46the output snr
1:00:48versus the speech quality measured by this and whereas uh measure
1:00:53and yeah the upper right corner is the best
1:00:56and again
1:00:57this new network based noise mask estimator worked pretty well
1:01:05so that's
1:01:07um
1:01:08or
1:01:10uh applications use a new network for speech or and or noise mask estimation and
1:01:16i think it's a bright powerful inverse of two
1:01:19two
1:01:20and that these for the chime challenge i should say
1:01:23the requirement of theatre data can the over my end-to-end train
1:01:27but i think this the a lot
1:01:30to be done
1:01:31uh first of all its not online
1:01:34or most of the cases of the representative results were not online
1:01:38and one would like to have an online system with low latency
1:01:42then
1:01:43uh i think matters change if we have a moving
1:01:46speaker
1:01:47yeah it was which is stationary with this tablets help by the person speaking
1:01:52and of course there is uh
1:01:54it's something more it's much more different if there are also overlapping speech which we
1:01:59didn't disc uh consider yeah
1:02:03so that's it so that was or not
1:02:05of uh references
1:02:08thank you
1:02:42i think that's no problem that that's easy to implement that
1:02:46but um
1:02:49when you earlier we would have sets
1:02:51the maxim snr is better than the mvdr but now i'm gonna state you say
1:02:55it's about the same and so it doesn't matter whether we take you equal to
1:02:58one or two zero point one or whatever so i think it we're not improve
1:03:03matters but also not degrade but that's
1:03:05that's my feeling example
1:03:24i think we did not listen because they didn't go back to the time domain
1:03:27you know if us english stating that short term frequency domains but that's a good
1:03:31point should put
1:03:33listen to it
1:03:39yeah those
1:03:42yeah
1:03:43the spectrograms i have seen ice short-term spectral crumbs but that was not for the
1:03:47end-to-end training but i presume that looks similar because uh
1:03:52the results were not that difficult different between the two
1:04:16i think of the moment is based mainly the overlaps uh speech and they're also
1:04:20very short utterances which are too short for all covariance estimation
1:04:35i think at six or i don't
1:04:53oh yeah was one more question yeah
1:05:12you know the than neural network i had this on some of the slides i
1:05:16had some figures so it's not as a larger than a new network for acoustic
1:05:21model by far not but still it significantly or then a parametric noise tracker
1:05:27yeah that for sure
1:05:30more detail i cannot
1:06:04basically i think of the
1:06:07motivation for doing than in them
1:06:11yeah days of the parametric approaches we always needed and the migrant to domain when
1:06:15we were about
1:06:16the doing speech enhancement
1:06:19and i think we tried log domain but i'm not no i don't ten for
1:06:24comparison
1:06:25a comparison or
1:06:30yeah
1:06:32yeah
1:06:35and all know what you mean
1:06:39i cannot tell us a much of a bit i think we tried it and
1:06:42then we stick with that but i don't you
1:07:17i think the
1:07:18the different see a is that we have a multichannel signal
1:07:22and for the beamforming been we exploit the phase
1:07:25so i think for multichannel data are uh doing it's
1:07:30in the no explicit beamforming i think it's a good idea you know
1:07:36yeah yes yeah
1:07:38yeah
1:07:40yeah
1:07:44yeah no for the what the last application is concerned with i also think there
1:07:48are the solutions which are listed goal of this one yeah but it nicely fitted
1:07:52you know too much story yeah
1:08:04yeah
1:08:10or
1:08:12we tried with just
1:08:14feed forward only
1:08:16there is where the results i skipped
1:08:20the cr was
1:08:22just feed forward network we've outs uh recurrent layer
1:08:26and it was a bit worse but uh not too much
1:08:34so i think the online and latency something is not the issue but if the
1:08:38speaker moves a lot
1:08:40i think you have to do also something on the
1:08:43on the test data not rely solely on the and model training
1:08:48train constraint uh system with a mask estimation
1:08:52and have to also do some uh
1:08:54tracking or whatever on the test set
1:08:59i think this is the larger issue
1:09:09yes one
1:09:35note that
1:09:47no it is like
1:09:49the example i had discussed before in detail with the noise
1:09:53suppression
1:09:55and we just this uh slightly just showed how do i take in the targets
1:10:00for the new network mask estimator was it
1:10:05and in the noise suppression it was the speech signal was the target all the
1:10:10speech presence
1:10:11was a target for the
1:10:14um
1:10:15for the speech mask estimator and we have the top frequency bins were was just
1:10:20noise for the noise estimator
1:10:22and then we use these covariance mattresses in the beamformer objective function
1:10:27and here we use the same the informal objective function
1:10:31also these covariance matrices my x and the man
1:10:35the question is what to
1:10:36so you don't z command
1:10:38what do we consider signal x and the next we consider the early part of
1:10:41the signal
1:10:43and signal and we consider the later part of the signal
1:10:46so to estimate sigma and
1:10:48we say uh we have the inputs
1:10:52so we have the signal which is not reverberated
1:10:54and we convolved with the lights part
1:10:57of the impulse response and this gives us the distortion
1:11:01and then we use this beamformer framework to remove the distortion
1:11:05it's a bit is difficult to cali beamforming actually butts uh
1:11:10it is
1:11:18yeah
1:11:19yes
1:11:23sixty four
1:11:43a