thank you very much ones i and thank you for coming to my talk
but they asked ones are problems with the top ten icky set where anything about
between one half an hour and one half hour so i wonder what we intend
be how long it will take let's see
and you're welcome to ask questions in between and i guess the first question that
you have i will answer right away
where is part of on
so part of a one
is
here in the state of not kind best farther north rhine with failure
and then in the east of that's one and maybe of the closest town that
you might know is document was you had often want you know
the football team bits of one about one hundred kilometres west imponderable so this point
of one
okay so i'm going to talk about uh
beamforming
and this is uh our group in part of a one and before i start
i would like to say your but this is of course joint work of the
whole group and in particular i would like to mention
not high man
uh low cost would on
and alex a you know i've
so here's what i'm going to
talk about today
what you see here is
so to save the scenario we have an enclosure with a speaker and maybe some
distortion darting acoustic events
and then there is a microphone array and the beamformer processing the signals
and uh after the beamformed signal we might have an acoustic speech recognition unit
and the beamformer their the adaptation or their computation of the beamforming coefficients are controlled
by a power a parameter estimation device
and i'm going to first talk a bit about spatial filtering objective functions
so this part here
and this i think is rather
basic
because you sets are more them maybe more computer science people have an extra engineering
people so i spend a bit more on this part here
and then
we discuss how we can ask actually estimate the beamforming coefficients
so that pertains to this block er parameter estimation
uh
eventually i will then look at the combination of beamforming and speech recognition
and uh you finally
if time allows
i could also spend a few words or have a few slides about
other applications of beamforming beyond noise reduction
so let's start with these
uh first block spatial filtering objective functions
so this is to say the
elementary set up
assume we have just the mono frequent signal here slide like this complex exponential
and first model what's uh we can then receive at the microphones
so here's a
closer look we have you have this and elements the and microphones
and we have this uh signal here the mono four point three consecutive and string
the microphone array from an angle data
and a d is the inter element a distance between the two microphones
now formulating this
uh mathematically
the beamformer output signal is the weighted sum
of the microphone signals
the weights of the w is here
and what we have actually of the microphone in this simple setup is
we have the complex exponential here of the source signal
however delay it's
by a certain amount of time which depends on the microphone considered
if you look here for example take this first microphone of the reference
then of the signal will arrive at the second microphone after some delay tall one
oh and so on
and uh if you can you see that the delay tell depends on the angle
of the arrival he or theatre
so you can actually uh express this tells by this
expression here
so this is the beamforming output signal and using vector notation then we have you
have a source signal the complex exponential
and then we have here the weight vector containing the beamformer coefficients w zero to
tell venue and minus one
and the cr comprises
all these delay terms all this column exponential term c on this term here is
called steering vector any test these elements here first microphone signal is not until eight
the second one is a little bit still eight byte r one and so on
so this is
uh that's description of this scenario here
and to have an yeah the feeling how
uh you go and oops spatial filtering by this scenario
uh i have some beam patterns
here
so what i have
on the left-hand side here are the beamformer coefficients
which i assume
we have study at the beamformer towards direction core or uh angle feeder zero
and
this now plotted here is the so-called beam pattern which is
here this a product between the beamforming vector and this steering vector from the last
slide
and you know can see that
if
what you see here is uh as a function of the angle theta
the response of the beam form of this beam pattern
and you see that if
did that you could say to zero then we have a
high sensitivity and for other directions
the game so to say a small all can be even zero
and how i have for beam pattern c r
this first one he of on the left-hand side
corresponds to the situation where the
the signal a rise
at so called broadside with broadside i mean come or go back method signal comes
from this direction here
you know that score broadside
this direction will be called and fire
so this is the broadside direction
here it is in the and file
direct show the desired signal
so it looks like that
and uh
c broadside and fire
and the lower two of being part and chat indicate that the sensitivity all the
spatial selectivity of the beam a beamformer depends very much
on the geometry
yeah here the
ratio between the
distance between two
microphones
and the wavelength is very small
so it's its low a patch of a patch or
then
the sensitivities almost only directional
as you can see
yeah it's uh the
the ratio between the inter element distance of the microphones
and the wavelength is very large
and is even larger than the wavelength
if it's not of in a wavelength
look at spatial aliasing just as you know uh temporal aliasing
that's the reason why you have these grating lobes he you know
you know that scores by spatial aliasing
so here we have a
low inter element distance here we have a high inter element
of the microphones relative to the way
so we can do spatial filtering
with this setup
no
i go a step further to real
environments where i mean by that is
first
we have a speech signal you want to work tutor speech so we have
uh wideband
signal we don't have a single mono frequent
sine wave complex exponential we have a
signals say bandwidth of eight or sixteen killer whales or whatever
what we then do we go to the short-term fourier transform domain
and in the for each frequency we can then look at it as a small
narrowband a beamforming problem again
when we have interferences
um the sturdy distorting sources like noise or whatever which we would like to suppress
so we need an appropriate objective function from which we derive the beamforming coefficients so
that the signal is enhanced the other one is suppressed
then we have reverberation
so read which means
if we are in an enclosure like this lecture hall
we have an
uh but
signal propagation via direct
path or by a multiple we might why reflections and multiple reflections and so on
and this or
is called the reverberation and is modeled by an acoustic impulse response or acoustic transfer
function
from the source to the microphones
and finally this acoustic transfer function
is unknown
and even time-variant so if i move or something else here in the
lecture hall moves when the impulse response or transfer function which change that can be
time variant
and it is unknown and needs to be estimated
so that's what we are going to consider no
and we do this by doing
data-dependent statistically optimum beamforming
so
we first formulate the model in the short term fourier transform domain
so we go
from the time domain
to the fourier transform domain short-term fourier transform domain means we take curves
chunks of the signal
and compute the dft on that then we move this chunk of bits forward and
take a new dft and so on
and here the two
parameters are the time frame index and the frequency bin index
so why is no the back to
of microphone signals y zero up to white and minus one
and
after uh with some assumptions
this can be more alert it is
the
product of the source signal s we are interested in and the acoustic transfer function
vector from the source to each of the microphones
plus a noise term
but i should at least mention that this is already an approximation this model
because this assumes that the uh impulse response
is smaller
then the analysis window of the dft yeah but uh i need i take this
model for here for the whole talk well
so the beamformer output signal then is
the beamforming coefficients times this input signal and in the following i really leave out
the uh in this all the parameters of the arguments t and f
yeah so that the filter output
no
how to determine these beamforming q coefficients w
in a certain statistically optimum way and the first criterion that we would probably also
would come up with its the mse criterion minimum mean squared error criterion
so we would like to determine the beam phone in coefficients
such
that the mean squared error all
between the beamformer output and some desired signal is the smallest possible
that's what do you
which and also from other optimisation tasks
and what is the desired signal
one could use as desired signal
of course
the source signal s which you would like to enhance
you know if so that would mean
that you of this you that i introduced here is one so the desired signal
is equal to the source signal s
and then the beamformer has the task of both
beamforming and he reverberation because what we would like to restore with this
uh desired signal the six source signal at the position of the source node of
the microphone but at the position of the source so it should also the reverberates
so that i read you um
the press the effect
of the uh sound propagation from the source to the microphones
or one could use
an alternative criterion
where the desired signal is the
image of this source signal x of the microphone
and then we want to do beamforming only by beamforming means noise suppression from other
directions
so
let's
now solve this problem here
so
or we have here the mean squared error this is the beamformer output soaping phone
confusions times microphone signal
this is desired signal
and if we just plug in
our definition of why which we had
yeah
if we just plug it in that's no big deal
then we can rewrite it in this way you know
where
the mask where s is the power all variance of the source noise a source
signal
and this series than or is rising command
is the covariance matrix of the noise at the microphones we have and elements
and then they have a end times and covariance matrix of the noise
and it can see that the mean squared error at the uh could is consists
of two terms
there is the speech distortion term and the noise
speech distortion is the deviation from the desired output
to the beamformer output
and this is the contribution of the noise
which is independent of the speech
desired speech signal
so if we know
huh formulate in this way it's a really not difficult to carry out this minimization
here
and this is the result
this is the these are the optimal beamforming coefficients which minimize the mean squared error
yeah basic minds wasn't noise covariance matrix
and a was the acoustic transfer function vector from the source to the individual microphones
this one is called multichannel wiener filter and w
there are variations upon that
one is
that you uh
a
plug in here a trade-off parameter
you
and with this
by tuning now than you can't rate of speech to toss distortion with noise suppression
for example if first of all four
we could if i for small and you'll
you enforce the beamformer to
to have as little speech distortion as possible
and if you increasing you this one this time is degraded and then the beamformer
is more enforced to suppress the noise
some of this mu you can control with it
first of all if you just introduce the new here
then the beamforming coefficients don't change much there's just the one now replaced by you
hear and this is called the speech distortion weighted multichannel wiener filter
but you can
look at the extreme cases as well some you're going to zero or to infinity
if you goes to zero this
term gets a very high weight
and
so first nude going to one that was what we had already some you going
to zero the speech distortion
term gets a very high weight so we would like to
make sure that there is no speech distortion you know if we let
new go to zero
the resulting beamformer is called the minimum variance distortionless response
then d r informal
so this
only consumption can
this the
objective function is minimize this the noise that the beamforming output however make sure that
the speech is not distortions
and the other extreme case is new going to infinity
then
we don't care about the speech distortion but we would like to have the noise
as much suppressed as possible at the beamforming output
so we would mark like to maximize the signal-to-noise ratio of the beamformer output
if you goes to infinity this is called maximum snr
informal
and the
beamforming coefficients from you we could zero you can read of here right away is
just this new disappears
and in you goes to infinity
it's some scaling factor times the numerator here the noise covariance inverse times the
backed of acoustic transfer function
so the difference
uh criteria
can be uh
visualise like vets we have this
problem at time you
and if we let some you going to zero
we make sure that the speech is preserved so is not distorted
but we might not have a lot of noise suppression then we are at the
end it er case minimum variance distortionless response
and if we go to the other and with a very high in you
we have the largest possible noise suppression
but the speech might uh
sound distorted at the beamformer output
so uh also what is interesting to see and what we can see from the
last slide already is
that these different criteria like mvdr on x m s n r
they differ only in a complex scalar which means in a single channel
filter output call a postfilter if you look here
you know if you
the numerator is always the same if we change new we just change the scalar
in the denominator
so this is a
single car just a complex scalar all it's not and necessary to multichannel processing to
go from one beamforming objective function to the next
so what we could do is
we could design here a maximum snr beamformer
and then
use an appropriate
single channel filter
i called posterior to
and then we could turn this maximize of our beamforming to an mvdr beamformer
so from here to here it's like the overall and mvdr beamformer
so what i
set so far as the following
we
should look at acoustic transfer functions and not only at this the steering vector with
the delays if we talk about reverberant environments and reverberant environments are always present if
we are in the room
outdoors
don't need to consider reverberation but if we are in the room we had to
consider the reverberation and then acoustic transfer much functions
we have to be used instead of just purely ways
and the beamformer criteria differ only in a single channel linear filter
what we would like what i'm going to look at now is
that the acoustic transfer function eight
the effect of tensor functions and this noise covariance matrix the man
they are no
and then
a possibly time-variant
so we need to estimate them
and uh the goal is to estimate then from the noisy speech signal at the
microphones so that's what we consider now
so this parameter estimation here
which then delivers the beamformer coefficients for the one of the criteria
so one method to
determined this acoustic transfer function their other methods
you know there's one which exploits the nonstationarity of the speech signal but the method
that we have been already working on for quite some time is
we estimate this acoustic transfer function by eigenvalue
decomposition
that's as follows
that was our signal model
yeah the mic reflect of microphone signals is the acoustic transfer function vector times the
desired source signal
this one like what x
blast of the noise
and if i come if we compute the covariance matrix of y
so the expectation of white times why i'm each an
then if s and are uncorrelated which we can assume
we have uh the spatial covariance matrix of this ooh
and know what that was
of
of the abyss parts here
speech related part
and of the noise
and here
clear that the uh
uh covariance matrix can directness at a have each in times the variance of the
speech term
plus the covariance matrix of the noise
so this is the spatial covariance matrix at the been
of the microphone signals
and
for example if you just looking at the spot here
it's
easy to see that the principal eigenvector
of this part here
is just a time some scalar
yeah on depending on how you normalize the a
because if you plug this in
to the
eigenvalue equation
zig mac x times
eigenvector is equal to some longer times eigenvector
and if you use this yes the eigenvector you really see that this really source
this equation
maybe i should write it down
it's really not difficult
so is it mel x
times let's call the eigenvector like that uh found that times eigenvector
and now we have a
at hand each and times variance of the
speech signal
times eigenvector and for the eigenvector ius
sums
scalar seek to
times a
you can sit down at all times
c ten times a day
and then you see that
this er
altogether
is a scalar
a ham each and a use a scalar so the vector just this one here
so this year would be the number one on the times cedar
and you we have a
so indeed this sort of this eigenvector equation
so with the if we do an eigenvector decomposition of zig max we can recover
the acoustic transfer function we can estimate the acoustic transfer function
that's what i wanted to save the slide
or
we could also look at about generalized eigenvalue
problem well we also consider have the sick man
so if you look at this
eigenvalue problem
the principal eigenvector solving this generalized eigenvalue problem is in principle
also complex scalar times this one here
where we have the inverse of the noise covariance matrix times acoustic transfer function vector
so we can estimate
these eight term by eigenvector decomposition
lot of this slide
so with the principal eigenvector of the generalized eigenvalue
problem
this one here we can write a ray each i mean the maximum snr beamformer
because what the principal eigenvector in principle is this one here
and actually if we have the right routine in not it's not necessary that the
command needs to be invertible we just need to solve the generalized eigenvalue problem so
the prince but it also possible if the command is not invertible
however there is an arbitrary scaling factor here because
any
scaling which result in a eigenvector of that problem
all the beam formers like the mvdr beamformer
we can realize as well then we do um
eigenvector decomposition of this uh
covariance matrix of the speech related some of the microphone signals because this
given a but as eight
ike acoustic transfer function vector
and this of the denominator
corresponding to the mvdr beamforming filter
so we can also realise and mvdr beamformer but then we also need this the
inverse of the command where here it's not really necessary to do the inverse explicitly
so with eigenvector decomposition
we can
determine
the uh
acoustic transfer function with that then the beamforming coefficients
so what we'd it's
we estimate the acoustic transfer function
by
whatever to be actually did now we
know how to determine the acoustic transfer function but we still need is
we need these covariance mattresses
of the speech related microphone signal and of the noise
so
we have solved one problem and got in a new problem
because now we need to estimate the max and the man
then compared is mattresses of the speech term of the microphone signals and of the
noise term microphone signal
and now there are a couple all many
procedures how to do so
and
basically what most of them do is they do a two-stage procedure
that means
they first determine for each time-frequency point is it dominated by the speech to or
is it dominated by noise
that's called speech presence probability estimation so this is all to say a voice activity
detector with a very high resolution a time-frequency point uh resolution
and we would like to determine for each time-frequency point is it in just a
noise term pure noise term or is it dominated by speech
if we have the speech presence probability map
or mask
then we can estimate these metrics as
from that
and that's the way i'm going to do with that in the following
so this speech presence probability estimation which should determine for each time fixed point is
it speech or noise is basically something like that
we have a noisy spectrogram
and what we would like to have is
the mosque or the
identification of those time-frequency points which are dominated by speech that's it looks something like
that
to do that
there have been
a lot of techniques
which are based on so called a priori and a-posteriori snr estimation
and local spectro-temporal smoothing
i'm not going to
talk about the t about that was the preferred methods
min uh several years ago few years ago
uh then we
in part of only about of the methods which before was
uh
very elegance
and me but be did is
we interpreted
uh this here as a two dimensional hidden markov model
with correlations
or yeah correlations or transition probabilities
along the time axis
and along the frequency axis
and then we did inference in this two dimensional hidden markov model to determine the
posterior on the posterior was the speech presence probability
but then eventually it turned out that a new network did a much better job
and no i'm finally
at the yeah other half of my talk tight a new network supported so what
i'm discuss now is how can we do with the speech presence probability estimation of
a new network
so up stopped the those much fast
so here is the set up
we have a new network is used for speech presence probability estimation
we needed in the following way
we have be microphone signals
and it's uh
we haven't nets work
for each channel
however we tie the weights between the individual networks here
and the input to the new network are the magnitude spectrum
and the network is supposed
to predict an ideal
mask
now the slide on that
so it should predict for each time-frequency point is it dominated by speech or is
it dominated by noise
we applied this to each channel separately and then we someone merger poor the channels
this can be do done by averaging the output all by taking the needy on
median turned out to be a bit more robust in the case that one of
the channels was broken
and the output of this one here now is
a look up well
uh
party a could be the probability for each time-frequency point of being speech
and here
or being noise
so once we have these masks
or present probabilities
we can compute
the spatial covariance matrix as of the speech
uh and of the noise this is illustrated here
so we estimate know the spatial covariance matrix of speech by
this outer product
however we take only those
time-frequency points
where our new network as set well this is really speech
and for the noise estimation we take only those time-frequency points with the network set
where the for this time-frequency point it's really noise
and with that we estimate these covariance mattresses
once we have the covariance mattresses
we plug into this optimisation function and b d r or maximum snr to get
the beamforming vector w
so that's already stuff
yeah please
yeah you're right room i don't wanna white is yeah why yeah and not i'm
which should perhaps separate uh subtract the it might and that's what you sorry
we tried that but we didn't find an and
and an effect or an improvement by that so we stick this one yeah
but you're right
no for the mask estimation we don't use phase
basically we look at this
the mixture magnitude of each point is in below some threshold something like that all
or above
for the phase of course is necessary for the beamforming coefficients
yeah but for the mask estimation of the phase it basically present through the estimation
of this covariance matrix as
here is the
network
that more in detail so we have a noisy
um
speech signal at the input over the network
and at the output we would like to predict
the speech mask
and the noise mask
yeah so for each time-frequency point uh if it is current uh dominated by speech
it should be high here and low here
and so what the neural network does it is a is it is operated like
a classifier
yeah it's a classifier which has to predict one or zero speech or noise for
each time-frequency
and the objective function is simply because it's a classifier cross entropy
the
that's one scenario which worked pretty well
here we had four layers the first there was a bi directional l s t
m
layer followed by
three feed-forward
layers
and uh at the input
we had the magnitude spectrum
for all frequencies
and the output rather than the speech mask
and the noise mask
and these values here
could be between zero and one they don't need to be binary
and they also don't need to sum up to one so it could be that
one of the time-frequency point was considered neither speech nor noise because it was somewhere
in between
and so what's what did we do
here with this mask estimation is just set as you have seen it's
single channel there's and neural network per channel aware with type weights but
we treat each channel separately
so it's independent of the array configuration and the number of microphones here we could
train it with six microphones
data and use it with three microphone data in the test
and the
could be a linear right in training and the secular right in the test
that's possible
so we can see it's an advantage but it would also say that's a disadvantage
because for the mask estimation we don't exploit spatial information
you know because we look at uh just a single channel
what is different from most of the uh hmmm
parametric approaches
before the neural network hero
is that at the inputs we have the whole
it dft vectors so we treat all frequencies jointly
where is usually in a beamforming you do were separately treat each frequency
and here we treat that uh jointly
it's not immediately suitable for online processing because we had to be a last em
layer
there are so we need to avoid backward path so in the ks consideration like
that it's current the an offline
so here are some example
speech and noise masks
that has been estimated with this
method from which i'm data base
and uh you can see that it recovers pretty well here the uh harmonic structure
of speech
and this is the noise mask
where we have a high values here for example and in between the one
and how can play
the input signal and the beamformer output signal for this one here
able to do that
i should
they both the input
all
i wish i
of course have to be very good example
now do not all alike like that in this one was a good one
here is in
um l different look or another maybe in my not aspect of it
here we compared
the maximum snr be informal he recorded uh i do not eigenvalue beamformer
so this which maximize the signal-to-noise ratio
and the mvdr beamformer which makes sure that there is no speech distortion
but you see here is the
snr at the beamformer output
for individual utterances of the chime challenge
and what you have here on the or did not
is the log of the condition number of this noise covariance matrix
so sick man
in the mvdr case we have to compute the inverse of the command to determine
the coefficients because in the numerator there was the combining vector times a
and what this
perhaps
shows is
that's
if the
condition number is
log conditional mice height which means the
noise covariance matrix is ill conditioned
then
seems to be the case that a generalized eigenvalue beamformer which does the generalized eigenvalue
decomposition
gives a bit higher snrs at the output then the mvdr
maybe you can see but i one don't want to make a strong point out
of that
in the mvdr we have to explicitly compute the inverse of the egg my and
that made
be problematic for some of the utterances where there is adjusted of noise or just
few observations
so
in our case the make some snr criterion what the bit better than the mvdr
criterion but
people from entity in japan they do with similar approach and in their case of
the mvdr work to be better than the t v so
maybe it's about the same
one point i would like to make is
uh you made one not if we take the maximum snr beamformer we don't care
about speech distortions
yeah
and uh
indeed if we don't let's take any care of that
the resulting signal sounds i have as an example in the next slide
it's depending if the noise is low cost
so if the
the low frequency if than noise is predominant the low frequencies
then after beamforming the second sounds a bit high-pass filter out because it has a
process all the more frequencies with a high noise
but that means so that that's the speech signal has been distorted because no it
sounds high-pass
if we do speech recognition it really is not a big deal because that can
be learnt by the
acoustic model if the input signal is looks is a bit different so we didn't
find a big difference whether we a accounted for this distortion by a postfilter or
not the speech recognition results were about the same
but if you want to do speech enhancement
so the time domain to reconstruct the time domain signal
then you should
big uh careful to reduce these speech distortions
and we develop one methods
to control these speech distortions
i can explain that maybe in more detail later on
uh and not here
uh going too much in detail about what we actually
what we try to do is we would like to design a postfilter single channel
such that the combination of the beamformer with the acoustic transfer function of a function
vectors and the beamformer should give
a response in the desired direction of one
and this actually can be solved for g if we assume that we have a
an anechoic transmission so if we have no reverberation we can compute you from that
but there's of course and uh approximation reality we have reverberation
but in this way we could computed and we had then a this single-channel postfilter
which
really a
removed already used to speech distortions
and here i haven't
example
with no post-processing and with this normalization
this is not a good such a good example is before so the input
primarily on the basis of this choice
now if we take the
make some snr be informal
without any taking any care of the speech distortions
i spanish
so i use it sounds of the speech signal sounds differently
and with this um
blind analytic normalisation
we can
reviews this high-pass filtering effect
people don't for primarily on the basis of issues
but of course of it is also at the expense of less a snr gain
no i have some results
on the chime challenge data
so i'm three and time for there were recordings problem for different environments
cafes tree bass
pedestrian area
in the chancery three channels they were just the each six channel
scenario in time for also a two channel in one channel
scenario and the recordings there were two recordings in these environments
there were simulated data really artificially added the noise that recording this environment but they
were also to speech
to uh recorded in these environments
and the recording was like that they had this uh tablet
where they had here six microphones
here at the frame
and the person had the tablet in front of incentive and then we're speaking the
sentences he was supposed to speak in this in the bus or in the pedestrian
area or wherever you know so that was
the scenario
and
one should say that this is
uh in a sense
not the deaf most difficult scenario because
the slow variation the speaker position
yeah you have you hold it like that you know it's not so that the
microphone the c and you walk er on the floor or whatever so is low
decomposition variation
and we have simulated and noise or recordings
here are some results concerned with the speech enhancement so
measured by this
pass score which is supposedly measuring the speech quality
but uh i don't know how well it really represent a speech quality
what about this figure here shows
i've taken it from another publication so there are some of the um
results are not going to discuss here
uh what we have here is the past score of the speech output
after the beamformer
and this one here oracle means if we knew
by oracle which time-frequency bin is corrupted by noise and which one is representing speech
so if we had the oracle speech masking oracle noise mask
this is the quality that you could achieve the high of the better
this was the result with the uh
uh
estimation of the b l s t m network which is almost as good
as the oracle one
and this once you're going to skip
these were of a network configurations and other training scenarios
and here are two results
from the a parametric approach and this one icsi's from also from my group a
few years ago
uh which was a previous sl this two dimensional hidden markov model and you can
see that the new network support of mask estimation gave better speech quality
but use parametric methods
and now i have some
recognition speech recognition results on sign
three yeah there were development sets and evaluation sets and
they were simulated scenarios we have the noise was artificially added
and real recordings in the noisy environment
where really the people spoke in the past and so on
this was uh scenario the setup delivered by the by the organizers the baseline scenario
these are the word error rates here
this is our
uh speech presence probability estimation or method of a few years back this is a
uh method from n t also few years back
this year uses the beamformer it beamformer
some of you may know it's a delay and sum beamformer we have a
and uh here is
the uh
make some snr beamformer with a new network
uh for mask estimation to be used
you can see that it
uh performed pretty well also on the real of recordings
real recordings in noisy environments
so yeah
yeah yes
re exactly yeah that's a point i want to make now because that's an important
one
so what i've talked about so far
we use this new network
when you network based speech presence probability estimation or neural network based mask estimation to
identify the time-frequency bins from which we estimate the power spectral density matrix of speech
or of noise
in from these metrics as the beamformer coefficients can be estimated know your point ones
are
the new network training requires just your
data
yeah remote
we should to have separately the
or signal at the microphones
so the that signal for a which came from this desired source s
and visual test separately the noise
so that we can artificially so that we have the target for the neural network
training the ideal
binary mask just the speech mask or noise
so we need this a stereo data for training
further on the mask this finishing exactly something heuristic
what do you declare as time-frequency points to be
dominated by speech we took we set that uh in this speech is the
speech only case
we take those time-frequency points which x amount to ninety nine percent of the total
power of the signal
but that's something also debatable whether this vest
choice
so there's some heuristics in that
so the question is now can be overcome this
the strict requirement of stereo data
and what we try this we try to overcome this limitation by end-to-end training
that's the next i would like to talk about
so now i'm here at this spot beamforming and speech recognition
so and with and so the and training that's determined using multiple
connotations what i mean with that is the following
which is depicted here
yeah we have you have a whole processing
chain starting with the microphone signal
then we have you have a new network for mask estimation this was the pruning
or maybe on computation condensing of the
masks to a single
mask
for speech and noise then the speech and noise covariance estimation the beamformer
our post filter to remove the speech distortions
so that's up to here
in our comes the speech recognition
so here is the filterbank
filterbank for computing the delta and delta-delta coefficients
better databank mel filters
then here is the acoustic model neural network
uh
supported all the new network for the acoustic model
and then we have that decoder
but i mean with end-to-end training is that we would like to propagate back
the gradient from the cross entropy criterion at them
of them acoustic model training
all the way over here these processing blocks
up to the new network here for mask estimation
that's what we
try to do if we can
managed to do that we don't need a target
like idea of speech mask for the training here but we can derive
the gradient from
the cross entropy criterion here from the acoustic model training
and then we don't need stereo data
anymore
so that's what we
tried to do
and what to be
have to take have to take care of is
but in between these computations we are in the complex domain
that means you have the beamforming coefficients are complex-valued
back to us you know they're complex-valued
the covariance mattresses are complex-valued so we are here the real-valued domain and then we
between the complex-valued domain and here we are back in the real-valued domain again
so we have two
consider
gradients with complex-valued the a remotes
so what i the ieee denoted here so the cross entropy criterion at the
acoustic model training
neural networks a function of the spatial covariance matrices of the
from which we compute the beamforming coefficients and they are complex-valued
and eventually what you want to train the beamformer curfew the coefficients of a new
network for mask estimation they are real value of course again
and what we'd it's
do here is
i'm not going to in
to detail that about a there's a recorder on it
we use the building a calculus
to compute complex
derivatives because the cost function is not all formal fink function so
but this building a calculus is well known in the adaptive filter theory so people
who do adaptive filtering their use this or not because the you often have complex-valued
uh coefficients
and with this uh one can then compute these gradients
and the actually the
crucial step was
uh we have this makes a mess in our beamformer whose coefficients are determined by
eigenvalue decomposition
so we have to compute the derivative of the principal eigenvector
of this
generalized eigenvalue problem with respect to the
psd mattresses will come out of the neural network mask estimator
and uh for that's uh we have a reports
which uh
and i also have it with me
where you can look up how this is done because there is a quite
longer along but they are vacation
so now
for the chime challenge of this
really worked and here are
some results
let's see whether we can make a
and sends out of it
so
first here so here in this
uh again the
baseline results so to say we have the
as a beamformer used this beamforming in a sum beamformer by example you and we
have actually
and uh
did it separate acoustic model training
so that was set the baseline word error rates
here we have the system web if
trained separately the new
no i want our beamformer using a new network with ideal binary mask just targets
so as we did before
and separately training of the acoustic model neural network for the acoustic model and these
are the results
then here we try to do it as on my last slide so we would
like to jointly train both
acoustic uh both networks the one for not mask estimation the one for the acoustic
model
and we started both from
from random initialization
and you can see that this leads to a somewhat uh worse
what error rate so what error rate increased
and the interesting result actually is no but next one
here we pre-trained the acoustic model
for a menu a network for acoustic model
but then the new network for the beamformer mask estimation
was trained
by back propagating the
gradient from the acoustic model to the new network for mask estimation so it was
randomly initialised and then train all the way back
and b that we are
even a little bit better than in the separate training so this year shows
that at least four
this time challenge is possible
to uh this minutes of the need for stereo data and you can achieve the
same or little bit better results also just training of the noisy data
and the lower one he's here where we also pre-trained
the acoustic model for mask estimation with the ideal binary mask target and then just
later on july propagate the gradient from uh the acoustic model to this first new
network four point you like if that it better
oh here for this
a much data that's the only data
i don't wanna required
but i would like to emphasise for this data because we can to try to
achieve the same
on the ami corpus and so far we have a have not been successful yeah
so it's not that easy but i'm was perhaps a very nice corpus this respect
so that was
basically
the story about beamforming for noise reduction
no i have a few slides
for uh where we also have some multichannel processing but not for noise reduction but
for to other tasks
so the first one is
the speech recognition of reverberated speech
and what we did here is
use the same
set up so we have multichannel data we use a new one network for mask
estimation
but now a lot
uh we would now our distortion is no longer noise
but reverberation
so we
in the case
that you know the impulse responses
yeah
of the training data what you can then could do
you can
by yourself
determine the ideal
speech mask and the ideal mask for the
for the distortion
and the
for the targets we take the dry signal the non reverberant
data and convolved with the early part of the room impulse response
so i first fifty milliseconds
and for the interference
which was the noise in the earlier case but now it's reverberation for mean if
we interference be convolved that right signal with a light part of the room impulse
response so after fifty miliseconds the body part of a part
and with that we can derive
also ideal binary masks for the target and from the interference
and then the rest remains the same
yeah we can then again compute the masks from that the covariance matrix a simple
that the beamforming weights
and that we tested on the rear up
data set of the reef a challenge which i want to charter data are convolved
with measured room impulse responses
and there are again test data simulated and also real recordings in a reverberant
environment
and here are some results
for that's
we are in the real
recordings there was a distinction between
uh near which means the distance between the microphones and the speaker brought about i
don't know of one meter fifty centimetre and far it was about two meters
but you can see the difference in the word error rate is not very large
and the
gmm hmm baseline we have these results here we have the
a baseline results were single channel there was no multichannel baseline
and then with the
method i so i just explained how to determine the to take the late reverberation
part for the
uh distortion masks
and just use the same scenario a set up as before we obtain these
whatever rates on these two uh parts of the dataset
and be the better acoustic model it can be further improved
so it also work in this case to suppress reverberation
so that was one example of another application it is my final and last example
we know this is a
new networks based mask may should be used for
for noise tracking for single channel speech enhancement
here's a
typical setup of the say traditional single channel speech enhancement
yeah we have the
noisy speech signal and its at the input we already in the stft domain here
and then we manually manipulate the magnitude really the phase of and are usually unchanged
and then we compute again function
the time-varying gain function word which we multiply the microphone signal to suppress the noise
and this time grounding function
this compute the performance so quite a priori snr
and for this to compute we need the noise power spectral density
and this noise power spectral density is now estimated with the new network
that's
real be made change
so noise tracking by a new network is we did it
similar or the same methodology as before
as in mask based speech enhancement
we estimate the noise a spectral mask which indicates for each time-frequency bin whether it's
dominated by noise or not and if it is dominated by noise we can update
the noise estimate for this a priori snr estimator and if it's
dominated by speech we just hold the all the old estimate
and so only this part is
is changed with respect to otherwise traditional speech enhanced
uh it's
he also some examples of this isn't noisy spectrogram
this is the ideal binary mask for the noise so black and it was
and here it was is a is the um
noise presents prop
itsy estimates by this new network this isn't of them method to compare it
with
and uh
is it looks a little bit similar is what we had uh before bads we
can estimate with these mask estimation we can estimate the noise
for
signal
and to your uh some results
where we have on the left end
sides the performance of the noise estimator
and on the right-hand side the performance of the speech enhancement system
and so we try to really a lot of the state-of-the-art noise estimation methods as
you can see here
and what we have plotted here is
two error measures for the noise estimate so look arrow
variance
versus look errol mean
and the variance should be small in the mean should be small so the ideally
the best method should be here the lower left corner
and this is actually the dnn based noise estimator this one yeah and these are
all the methods
and here we have the speech enhancement performance
the output snr
versus the speech quality measured by this and whereas uh measure
and yeah the upper right corner is the best
and again
this new network based noise mask estimator worked pretty well
so that's
um
or
uh applications use a new network for speech or and or noise mask estimation and
i think it's a bright powerful inverse of two
two
and that these for the chime challenge i should say
the requirement of theatre data can the over my end-to-end train
but i think this the a lot
to be done
uh first of all its not online
or most of the cases of the representative results were not online
and one would like to have an online system with low latency
then
uh i think matters change if we have a moving
speaker
yeah it was which is stationary with this tablets help by the person speaking
and of course there is uh
it's something more it's much more different if there are also overlapping speech which we
didn't disc uh consider yeah
so that's it so that was or not
of uh references
thank you
i think that's no problem that that's easy to implement that
but um
when you earlier we would have sets
the maxim snr is better than the mvdr but now i'm gonna state you say
it's about the same and so it doesn't matter whether we take you equal to
one or two zero point one or whatever so i think it we're not improve
matters but also not degrade but that's
that's my feeling example
i think we did not listen because they didn't go back to the time domain
you know if us english stating that short term frequency domains but that's a good
point should put
listen to it
yeah those
yeah
the spectrograms i have seen ice short-term spectral crumbs but that was not for the
end-to-end training but i presume that looks similar because uh
the results were not that difficult different between the two
i think of the moment is based mainly the overlaps uh speech and they're also
very short utterances which are too short for all covariance estimation
i think at six or i don't
oh yeah was one more question yeah
you know the than neural network i had this on some of the slides i
had some figures so it's not as a larger than a new network for acoustic
model by far not but still it significantly or then a parametric noise tracker
yeah that for sure
more detail i cannot
basically i think of the
motivation for doing than in them
yeah days of the parametric approaches we always needed and the migrant to domain when
we were about
the doing speech enhancement
and i think we tried log domain but i'm not no i don't ten for
comparison
a comparison or
yeah
yeah
and all know what you mean
i cannot tell us a much of a bit i think we tried it and
then we stick with that but i don't you
i think the
the different see a is that we have a multichannel signal
and for the beamforming been we exploit the phase
so i think for multichannel data are uh doing it's
in the no explicit beamforming i think it's a good idea you know
yeah yes yeah
yeah
yeah
yeah no for the what the last application is concerned with i also think there
are the solutions which are listed goal of this one yeah but it nicely fitted
you know too much story yeah
yeah
or
we tried with just
feed forward only
there is where the results i skipped
the cr was
just feed forward network we've outs uh recurrent layer
and it was a bit worse but uh not too much
so i think the online and latency something is not the issue but if the
speaker moves a lot
i think you have to do also something on the
on the test data not rely solely on the and model training
train constraint uh system with a mask estimation
and have to also do some uh
tracking or whatever on the test set
i think this is the larger issue
yes one
note that
no it is like
the example i had discussed before in detail with the noise
suppression
and we just this uh slightly just showed how do i take in the targets
for the new network mask estimator was it
and in the noise suppression it was the speech signal was the target all the
speech presence
was a target for the
um
for the speech mask estimator and we have the top frequency bins were was just
noise for the noise estimator
and then we use these covariance mattresses in the beamformer objective function
and here we use the same the informal objective function
also these covariance matrices my x and the man
the question is what to
so you don't z command
what do we consider signal x and the next we consider the early part of
the signal
and signal and we consider the later part of the signal
so to estimate sigma and
we say uh we have the inputs
so we have the signal which is not reverberated
and we convolved with the lights part
of the impulse response and this gives us the distortion
and then we use this beamformer framework to remove the distortion
it's a bit is difficult to cali beamforming actually butts uh
it is
yeah
yes
sixty four
a