thank you very much ones i and thank you for coming to my talk

but they asked ones are problems with the top ten icky set where anything about

between one half an hour and one half hour so i wonder what we intend

be how long it will take let's see

and you're welcome to ask questions in between and i guess the first question that

you have i will answer right away

where is part of on

so part of a one

is

here in the state of not kind best farther north rhine with failure

and then in the east of that's one and maybe of the closest town that

you might know is document was you had often want you know

the football team bits of one about one hundred kilometres west imponderable so this point

of one

okay so i'm going to talk about uh

beamforming

and this is uh our group in part of a one and before i start

i would like to say your but this is of course joint work of the

whole group and in particular i would like to mention

not high man

uh low cost would on

and alex a you know i've

so here's what i'm going to

talk about today

what you see here is

so to save the scenario we have an enclosure with a speaker and maybe some

distortion darting acoustic events

and then there is a microphone array and the beamformer processing the signals

and uh after the beamformed signal we might have an acoustic speech recognition unit

and the beamformer their the adaptation or their computation of the beamforming coefficients are controlled

by a power a parameter estimation device

and i'm going to first talk a bit about spatial filtering objective functions

so this part here

and this i think is rather

basic

because you sets are more them maybe more computer science people have an extra engineering

people so i spend a bit more on this part here

and then

we discuss how we can ask actually estimate the beamforming coefficients

so that pertains to this block er parameter estimation

uh

eventually i will then look at the combination of beamforming and speech recognition

and uh you finally

if time allows

i could also spend a few words or have a few slides about

other applications of beamforming beyond noise reduction

so let's start with these

uh first block spatial filtering objective functions

so this is to say the

elementary set up

assume we have just the mono frequent signal here slide like this complex exponential

and first model what's uh we can then receive at the microphones

so here's a

closer look we have you have this and elements the and microphones

and we have this uh signal here the mono four point three consecutive and string

the microphone array from an angle data

and a d is the inter element a distance between the two microphones

now formulating this

uh mathematically

the beamformer output signal is the weighted sum

of the microphone signals

the weights of the w is here

and what we have actually of the microphone in this simple setup is

we have the complex exponential here of the source signal

however delay it's

by a certain amount of time which depends on the microphone considered

if you look here for example take this first microphone of the reference

then of the signal will arrive at the second microphone after some delay tall one

oh and so on

and uh if you can you see that the delay tell depends on the angle

of the arrival he or theatre

so you can actually uh express this tells by this

expression here

so this is the beamforming output signal and using vector notation then we have you

have a source signal the complex exponential

and then we have here the weight vector containing the beamformer coefficients w zero to

tell venue and minus one

and the cr comprises

all these delay terms all this column exponential term c on this term here is

called steering vector any test these elements here first microphone signal is not until eight

the second one is a little bit still eight byte r one and so on

so this is

uh that's description of this scenario here

and to have an yeah the feeling how

uh you go and oops spatial filtering by this scenario

uh i have some beam patterns

here

so what i have

on the left-hand side here are the beamformer coefficients

which i assume

we have study at the beamformer towards direction core or uh angle feeder zero

and

this now plotted here is the so-called beam pattern which is

here this a product between the beamforming vector and this steering vector from the last

slide

and you know can see that

if

what you see here is uh as a function of the angle theta

the response of the beam form of this beam pattern

and you see that if

did that you could say to zero then we have a

high sensitivity and for other directions

the game so to say a small all can be even zero

and how i have for beam pattern c r

this first one he of on the left-hand side

corresponds to the situation where the

the signal a rise

at so called broadside with broadside i mean come or go back method signal comes

from this direction here

you know that score broadside

this direction will be called and fire

so this is the broadside direction

here it is in the and file

direct show the desired signal

so it looks like that

and uh

c broadside and fire

and the lower two of being part and chat indicate that the sensitivity all the

spatial selectivity of the beam a beamformer depends very much

on the geometry

yeah here the

ratio between the

distance between two

microphones

and the wavelength is very small

so it's its low a patch of a patch or

then

the sensitivities almost only directional

as you can see

yeah it's uh the

the ratio between the inter element distance of the microphones

and the wavelength is very large

and is even larger than the wavelength

if it's not of in a wavelength

look at spatial aliasing just as you know uh temporal aliasing

that's the reason why you have these grating lobes he you know

you know that scores by spatial aliasing

so here we have a

low inter element distance here we have a high inter element

of the microphones relative to the way

so we can do spatial filtering

with this setup

no

i go a step further to real

environments where i mean by that is

first

we have a speech signal you want to work tutor speech so we have

uh wideband

signal we don't have a single mono frequent

sine wave complex exponential we have a

signals say bandwidth of eight or sixteen killer whales or whatever

what we then do we go to the short-term fourier transform domain

and in the for each frequency we can then look at it as a small

narrowband a beamforming problem again

when we have interferences

um the sturdy distorting sources like noise or whatever which we would like to suppress

so we need an appropriate objective function from which we derive the beamforming coefficients so

that the signal is enhanced the other one is suppressed

then we have reverberation

so read which means

if we are in an enclosure like this lecture hall

we have an

uh but

signal propagation via direct

path or by a multiple we might why reflections and multiple reflections and so on

and this or

is called the reverberation and is modeled by an acoustic impulse response or acoustic transfer

function

from the source to the microphones

and finally this acoustic transfer function

is unknown

and even time-variant so if i move or something else here in the

lecture hall moves when the impulse response or transfer function which change that can be

time variant

and it is unknown and needs to be estimated

so that's what we are going to consider no

and we do this by doing

data-dependent statistically optimum beamforming

so

we first formulate the model in the short term fourier transform domain

so we go

from the time domain

to the fourier transform domain short-term fourier transform domain means we take curves

chunks of the signal

and compute the dft on that then we move this chunk of bits forward and

take a new dft and so on

and here the two

parameters are the time frame index and the frequency bin index

so why is no the back to

of microphone signals y zero up to white and minus one

and

after uh with some assumptions

this can be more alert it is

the

product of the source signal s we are interested in and the acoustic transfer function

vector from the source to each of the microphones

plus a noise term

but i should at least mention that this is already an approximation this model

because this assumes that the uh impulse response

is smaller

then the analysis window of the dft yeah but uh i need i take this

model for here for the whole talk well

so the beamformer output signal then is

the beamforming coefficients times this input signal and in the following i really leave out

the uh in this all the parameters of the arguments t and f

yeah so that the filter output

no

how to determine these beamforming q coefficients w

in a certain statistically optimum way and the first criterion that we would probably also

would come up with its the mse criterion minimum mean squared error criterion

so we would like to determine the beam phone in coefficients

such

that the mean squared error all

between the beamformer output and some desired signal is the smallest possible

that's what do you

which and also from other optimisation tasks

and what is the desired signal

one could use as desired signal

of course

the source signal s which you would like to enhance

you know if so that would mean

that you of this you that i introduced here is one so the desired signal

is equal to the source signal s

and then the beamformer has the task of both

beamforming and he reverberation because what we would like to restore with this

uh desired signal the six source signal at the position of the source node of

the microphone but at the position of the source so it should also the reverberates

so that i read you um

the press the effect

of the uh sound propagation from the source to the microphones

or one could use

an alternative criterion

where the desired signal is the

image of this source signal x of the microphone

and then we want to do beamforming only by beamforming means noise suppression from other

directions

so

let's

now solve this problem here

so

or we have here the mean squared error this is the beamformer output soaping phone

confusions times microphone signal

this is desired signal

and if we just plug in

our definition of why which we had

yeah

if we just plug it in that's no big deal

then we can rewrite it in this way you know

where

the mask where s is the power all variance of the source noise a source

signal

and this series than or is rising command

is the covariance matrix of the noise at the microphones we have and elements

and then they have a end times and covariance matrix of the noise

and it can see that the mean squared error at the uh could is consists

of two terms

there is the speech distortion term and the noise

speech distortion is the deviation from the desired output

to the beamformer output

and this is the contribution of the noise

which is independent of the speech

desired speech signal

so if we know

huh formulate in this way it's a really not difficult to carry out this minimization

here

and this is the result

this is the these are the optimal beamforming coefficients which minimize the mean squared error

yeah basic minds wasn't noise covariance matrix

and a was the acoustic transfer function vector from the source to the individual microphones

this one is called multichannel wiener filter and w

there are variations upon that

one is

that you uh

a

plug in here a trade-off parameter

you

and with this

by tuning now than you can't rate of speech to toss distortion with noise suppression

for example if first of all four

we could if i for small and you'll

you enforce the beamformer to

to have as little speech distortion as possible

and if you increasing you this one this time is degraded and then the beamformer

is more enforced to suppress the noise

some of this mu you can control with it

first of all if you just introduce the new here

then the beamforming coefficients don't change much there's just the one now replaced by you

hear and this is called the speech distortion weighted multichannel wiener filter

but you can

look at the extreme cases as well some you're going to zero or to infinity

if you goes to zero this

term gets a very high weight

and

so first nude going to one that was what we had already some you going

to zero the speech distortion

term gets a very high weight so we would like to

make sure that there is no speech distortion you know if we let

new go to zero

the resulting beamformer is called the minimum variance distortionless response

then d r informal

so this

only consumption can

this the

objective function is minimize this the noise that the beamforming output however make sure that

the speech is not distortions

and the other extreme case is new going to infinity

then

we don't care about the speech distortion but we would like to have the noise

as much suppressed as possible at the beamforming output

so we would mark like to maximize the signal-to-noise ratio of the beamformer output

if you goes to infinity this is called maximum snr

informal

and the

beamforming coefficients from you we could zero you can read of here right away is

just this new disappears

and in you goes to infinity

it's some scaling factor times the numerator here the noise covariance inverse times the

backed of acoustic transfer function

so the difference

uh criteria

can be uh

visualise like vets we have this

problem at time you

and if we let some you going to zero

we make sure that the speech is preserved so is not distorted

but we might not have a lot of noise suppression then we are at the

end it er case minimum variance distortionless response

and if we go to the other and with a very high in you

we have the largest possible noise suppression

but the speech might uh

sound distorted at the beamformer output

so uh also what is interesting to see and what we can see from the

last slide already is

that these different criteria like mvdr on x m s n r

they differ only in a complex scalar which means in a single channel

filter output call a postfilter if you look here

you know if you

the numerator is always the same if we change new we just change the scalar

in the denominator

so this is a

single car just a complex scalar all it's not and necessary to multichannel processing to

go from one beamforming objective function to the next

so what we could do is

we could design here a maximum snr beamformer

and then

use an appropriate

single channel filter

i called posterior to

and then we could turn this maximize of our beamforming to an mvdr beamformer

so from here to here it's like the overall and mvdr beamformer

so what i

set so far as the following

we

should look at acoustic transfer functions and not only at this the steering vector with

the delays if we talk about reverberant environments and reverberant environments are always present if

we are in the room

outdoors

don't need to consider reverberation but if we are in the room we had to

consider the reverberation and then acoustic transfer much functions

we have to be used instead of just purely ways

and the beamformer criteria differ only in a single channel linear filter

what we would like what i'm going to look at now is

that the acoustic transfer function eight

the effect of tensor functions and this noise covariance matrix the man

they are no

and then

a possibly time-variant

so we need to estimate them

and uh the goal is to estimate then from the noisy speech signal at the

microphones so that's what we consider now

so this parameter estimation here

which then delivers the beamformer coefficients for the one of the criteria

so one method to

determined this acoustic transfer function their other methods

you know there's one which exploits the nonstationarity of the speech signal but the method

that we have been already working on for quite some time is

we estimate this acoustic transfer function by eigenvalue

decomposition

that's as follows

that was our signal model

yeah the mic reflect of microphone signals is the acoustic transfer function vector times the

desired source signal

this one like what x

blast of the noise

and if i come if we compute the covariance matrix of y

so the expectation of white times why i'm each an

then if s and are uncorrelated which we can assume

we have uh the spatial covariance matrix of this ooh

and know what that was

of

of the abyss parts here

speech related part

and of the noise

and here

clear that the uh

uh covariance matrix can directness at a have each in times the variance of the

speech term

plus the covariance matrix of the noise

so this is the spatial covariance matrix at the been

of the microphone signals

and

for example if you just looking at the spot here

it's

easy to see that the principal eigenvector

of this part here

is just a time some scalar

yeah on depending on how you normalize the a

because if you plug this in

to the

eigenvalue equation

zig mac x times

eigenvector is equal to some longer times eigenvector

and if you use this yes the eigenvector you really see that this really source

this equation

maybe i should write it down

it's really not difficult

so is it mel x

times let's call the eigenvector like that uh found that times eigenvector

and now we have a

at hand each and times variance of the

speech signal

times eigenvector and for the eigenvector ius

sums

scalar seek to

times a

you can sit down at all times

c ten times a day

and then you see that

this er

altogether

is a scalar

a ham each and a use a scalar so the vector just this one here

so this year would be the number one on the times cedar

and you we have a

so indeed this sort of this eigenvector equation

so with the if we do an eigenvector decomposition of zig max we can recover

the acoustic transfer function we can estimate the acoustic transfer function

that's what i wanted to save the slide

or

we could also look at about generalized eigenvalue

problem well we also consider have the sick man

so if you look at this

eigenvalue problem

the principal eigenvector solving this generalized eigenvalue problem is in principle

also complex scalar times this one here

where we have the inverse of the noise covariance matrix times acoustic transfer function vector

so we can estimate

these eight term by eigenvector decomposition

lot of this slide

so with the principal eigenvector of the generalized eigenvalue

problem

this one here we can write a ray each i mean the maximum snr beamformer

because what the principal eigenvector in principle is this one here

and actually if we have the right routine in not it's not necessary that the

command needs to be invertible we just need to solve the generalized eigenvalue problem so

the prince but it also possible if the command is not invertible

however there is an arbitrary scaling factor here because

any

scaling which result in a eigenvector of that problem

all the beam formers like the mvdr beamformer

we can realize as well then we do um

eigenvector decomposition of this uh

covariance matrix of the speech related some of the microphone signals because this

given a but as eight

ike acoustic transfer function vector

and this of the denominator

corresponding to the mvdr beamforming filter

so we can also realise and mvdr beamformer but then we also need this the

inverse of the command where here it's not really necessary to do the inverse explicitly

so with eigenvector decomposition

we can

determine

the uh

acoustic transfer function with that then the beamforming coefficients

so what we'd it's

we estimate the acoustic transfer function

by

whatever to be actually did now we

know how to determine the acoustic transfer function but we still need is

we need these covariance mattresses

of the speech related microphone signal and of the noise

so

we have solved one problem and got in a new problem

because now we need to estimate the max and the man

then compared is mattresses of the speech term of the microphone signals and of the

noise term microphone signal

and now there are a couple all many

procedures how to do so

and

basically what most of them do is they do a two-stage procedure

that means

they first determine for each time-frequency point is it dominated by the speech to or

is it dominated by noise

that's called speech presence probability estimation so this is all to say a voice activity

detector with a very high resolution a time-frequency point uh resolution

and we would like to determine for each time-frequency point is it in just a

noise term pure noise term or is it dominated by speech

if we have the speech presence probability map

or mask

then we can estimate these metrics as

from that

and that's the way i'm going to do with that in the following

so this speech presence probability estimation which should determine for each time fixed point is

it speech or noise is basically something like that

we have a noisy spectrogram

and what we would like to have is

the mosque or the

identification of those time-frequency points which are dominated by speech that's it looks something like

that

to do that

there have been

a lot of techniques

which are based on so called a priori and a-posteriori snr estimation

and local spectro-temporal smoothing

i'm not going to

talk about the t about that was the preferred methods

min uh several years ago few years ago

uh then we

in part of only about of the methods which before was

uh

very elegance

and me but be did is

we interpreted

uh this here as a two dimensional hidden markov model

with correlations

or yeah correlations or transition probabilities

along the time axis

and along the frequency axis

and then we did inference in this two dimensional hidden markov model to determine the

posterior on the posterior was the speech presence probability

but then eventually it turned out that a new network did a much better job

and no i'm finally

at the yeah other half of my talk tight a new network supported so what

i'm discuss now is how can we do with the speech presence probability estimation of

a new network

so up stopped the those much fast

so here is the set up

we have a new network is used for speech presence probability estimation

we needed in the following way

we have be microphone signals

and it's uh

we haven't nets work

for each channel

however we tie the weights between the individual networks here

and the input to the new network are the magnitude spectrum

and the network is supposed

to predict an ideal

mask

now the slide on that

so it should predict for each time-frequency point is it dominated by speech or is

it dominated by noise

we applied this to each channel separately and then we someone merger poor the channels

this can be do done by averaging the output all by taking the needy on

median turned out to be a bit more robust in the case that one of

the channels was broken

and the output of this one here now is

a look up well

uh

party a could be the probability for each time-frequency point of being speech

and here

or being noise

so once we have these masks

or present probabilities

we can compute

the spatial covariance matrix as of the speech

uh and of the noise this is illustrated here

so we estimate know the spatial covariance matrix of speech by

this outer product

however we take only those

time-frequency points

where our new network as set well this is really speech

and for the noise estimation we take only those time-frequency points with the network set

where the for this time-frequency point it's really noise

and with that we estimate these covariance mattresses

once we have the covariance mattresses

we plug into this optimisation function and b d r or maximum snr to get

the beamforming vector w

so that's already stuff

yeah please

yeah you're right room i don't wanna white is yeah why yeah and not i'm

which should perhaps separate uh subtract the it might and that's what you sorry

we tried that but we didn't find an and

and an effect or an improvement by that so we stick this one yeah

but you're right

no for the mask estimation we don't use phase

basically we look at this

the mixture magnitude of each point is in below some threshold something like that all

or above

for the phase of course is necessary for the beamforming coefficients

yeah but for the mask estimation of the phase it basically present through the estimation

of this covariance matrix as

here is the

network

that more in detail so we have a noisy

um

speech signal at the input over the network

and at the output we would like to predict

the speech mask

and the noise mask

yeah so for each time-frequency point uh if it is current uh dominated by speech

it should be high here and low here

and so what the neural network does it is a is it is operated like

a classifier

yeah it's a classifier which has to predict one or zero speech or noise for

each time-frequency

and the objective function is simply because it's a classifier cross entropy

the

that's one scenario which worked pretty well

here we had four layers the first there was a bi directional l s t

m

layer followed by

three feed-forward

layers

and uh at the input

we had the magnitude spectrum

for all frequencies

and the output rather than the speech mask

and the noise mask

and these values here

could be between zero and one they don't need to be binary

and they also don't need to sum up to one so it could be that

one of the time-frequency point was considered neither speech nor noise because it was somewhere

in between

and so what's what did we do

here with this mask estimation is just set as you have seen it's

single channel there's and neural network per channel aware with type weights but

we treat each channel separately

so it's independent of the array configuration and the number of microphones here we could

train it with six microphones

data and use it with three microphone data in the test

and the

could be a linear right in training and the secular right in the test

that's possible

so we can see it's an advantage but it would also say that's a disadvantage

because for the mask estimation we don't exploit spatial information

you know because we look at uh just a single channel

what is different from most of the uh hmmm

parametric approaches

before the neural network hero

is that at the inputs we have the whole

it dft vectors so we treat all frequencies jointly

where is usually in a beamforming you do were separately treat each frequency

and here we treat that uh jointly

it's not immediately suitable for online processing because we had to be a last em

layer

there are so we need to avoid backward path so in the ks consideration like

that it's current the an offline

so here are some example

speech and noise masks

that has been estimated with this

method from which i'm data base

and uh you can see that it recovers pretty well here the uh harmonic structure

of speech

and this is the noise mask

where we have a high values here for example and in between the one

and how can play

the input signal and the beamformer output signal for this one here

able to do that

i should

they both the input

all

i wish i

of course have to be very good example

now do not all alike like that in this one was a good one

here is in

um l different look or another maybe in my not aspect of it

here we compared

the maximum snr be informal he recorded uh i do not eigenvalue beamformer

so this which maximize the signal-to-noise ratio

and the mvdr beamformer which makes sure that there is no speech distortion

but you see here is the

snr at the beamformer output

for individual utterances of the chime challenge

and what you have here on the or did not

is the log of the condition number of this noise covariance matrix

so sick man

in the mvdr case we have to compute the inverse of the command to determine

the coefficients because in the numerator there was the combining vector times a

and what this

perhaps

shows is

that's

if the

condition number is

log conditional mice height which means the

noise covariance matrix is ill conditioned

then

seems to be the case that a generalized eigenvalue beamformer which does the generalized eigenvalue

decomposition

gives a bit higher snrs at the output then the mvdr

maybe you can see but i one don't want to make a strong point out

of that

in the mvdr we have to explicitly compute the inverse of the egg my and

that made

be problematic for some of the utterances where there is adjusted of noise or just

few observations

so

in our case the make some snr criterion what the bit better than the mvdr

criterion but

people from entity in japan they do with similar approach and in their case of

the mvdr work to be better than the t v so

maybe it's about the same

one point i would like to make is

uh you made one not if we take the maximum snr beamformer we don't care

about speech distortions

yeah

and uh

indeed if we don't let's take any care of that

the resulting signal sounds i have as an example in the next slide

it's depending if the noise is low cost

so if the

the low frequency if than noise is predominant the low frequencies

then after beamforming the second sounds a bit high-pass filter out because it has a

process all the more frequencies with a high noise

but that means so that that's the speech signal has been distorted because no it

sounds high-pass

if we do speech recognition it really is not a big deal because that can

be learnt by the

acoustic model if the input signal is looks is a bit different so we didn't

find a big difference whether we a accounted for this distortion by a postfilter or

not the speech recognition results were about the same

but if you want to do speech enhancement

so the time domain to reconstruct the time domain signal

then you should

big uh careful to reduce these speech distortions

and we develop one methods

to control these speech distortions

i can explain that maybe in more detail later on

uh and not here

uh going too much in detail about what we actually

what we try to do is we would like to design a postfilter single channel

such that the combination of the beamformer with the acoustic transfer function of a function

vectors and the beamformer should give

a response in the desired direction of one

and this actually can be solved for g if we assume that we have a

an anechoic transmission so if we have no reverberation we can compute you from that

but there's of course and uh approximation reality we have reverberation

but in this way we could computed and we had then a this single-channel postfilter

which

really a

removed already used to speech distortions

and here i haven't

example

with no post-processing and with this normalization

this is not a good such a good example is before so the input

primarily on the basis of this choice

now if we take the

make some snr be informal

without any taking any care of the speech distortions

i spanish

so i use it sounds of the speech signal sounds differently

and with this um

blind analytic normalisation

we can

reviews this high-pass filtering effect

people don't for primarily on the basis of issues

but of course of it is also at the expense of less a snr gain

no i have some results

on the chime challenge data

so i'm three and time for there were recordings problem for different environments

cafes tree bass

pedestrian area

in the chancery three channels they were just the each six channel

scenario in time for also a two channel in one channel

scenario and the recordings there were two recordings in these environments

there were simulated data really artificially added the noise that recording this environment but they

were also to speech

to uh recorded in these environments

and the recording was like that they had this uh tablet

where they had here six microphones

here at the frame

and the person had the tablet in front of incentive and then we're speaking the

sentences he was supposed to speak in this in the bus or in the pedestrian

area or wherever you know so that was

the scenario

and

one should say that this is

uh in a sense

not the deaf most difficult scenario because

the slow variation the speaker position

yeah you have you hold it like that you know it's not so that the

microphone the c and you walk er on the floor or whatever so is low

decomposition variation

and we have simulated and noise or recordings

here are some results concerned with the speech enhancement so

measured by this

pass score which is supposedly measuring the speech quality

but uh i don't know how well it really represent a speech quality

what about this figure here shows

i've taken it from another publication so there are some of the um

results are not going to discuss here

uh what we have here is the past score of the speech output

after the beamformer

and this one here oracle means if we knew

by oracle which time-frequency bin is corrupted by noise and which one is representing speech

so if we had the oracle speech masking oracle noise mask

this is the quality that you could achieve the high of the better

this was the result with the uh

uh

estimation of the b l s t m network which is almost as good

as the oracle one

and this once you're going to skip

these were of a network configurations and other training scenarios

and here are two results

from the a parametric approach and this one icsi's from also from my group a

few years ago

uh which was a previous sl this two dimensional hidden markov model and you can

see that the new network support of mask estimation gave better speech quality

but use parametric methods

and now i have some

recognition speech recognition results on sign

three yeah there were development sets and evaluation sets and

they were simulated scenarios we have the noise was artificially added

and real recordings in the noisy environment

where really the people spoke in the past and so on

this was uh scenario the setup delivered by the by the organizers the baseline scenario

these are the word error rates here

this is our

uh speech presence probability estimation or method of a few years back this is a

uh method from n t also few years back

this year uses the beamformer it beamformer

some of you may know it's a delay and sum beamformer we have a

and uh here is

the uh

make some snr beamformer with a new network

uh for mask estimation to be used

you can see that it

uh performed pretty well also on the real of recordings

real recordings in noisy environments

so yeah

yeah yes

re exactly yeah that's a point i want to make now because that's an important

one

so what i've talked about so far

we use this new network

when you network based speech presence probability estimation or neural network based mask estimation to

identify the time-frequency bins from which we estimate the power spectral density matrix of speech

or of noise

in from these metrics as the beamformer coefficients can be estimated know your point ones

are

the new network training requires just your

data

yeah remote

we should to have separately the

or signal at the microphones

so the that signal for a which came from this desired source s

and visual test separately the noise

so that we can artificially so that we have the target for the neural network

training the ideal

binary mask just the speech mask or noise

so we need this a stereo data for training

further on the mask this finishing exactly something heuristic

what do you declare as time-frequency points to be

dominated by speech we took we set that uh in this speech is the

speech only case

we take those time-frequency points which x amount to ninety nine percent of the total

power of the signal

but that's something also debatable whether this vest

choice

so there's some heuristics in that

so the question is now can be overcome this

the strict requirement of stereo data

and what we try this we try to overcome this limitation by end-to-end training

that's the next i would like to talk about

so now i'm here at this spot beamforming and speech recognition

so and with and so the and training that's determined using multiple

connotations what i mean with that is the following

which is depicted here

yeah we have you have a whole processing

chain starting with the microphone signal

then we have you have a new network for mask estimation this was the pruning

or maybe on computation condensing of the

masks to a single

mask

for speech and noise then the speech and noise covariance estimation the beamformer

our post filter to remove the speech distortions

so that's up to here

in our comes the speech recognition

so here is the filterbank

filterbank for computing the delta and delta-delta coefficients

better databank mel filters

then here is the acoustic model neural network

uh

supported all the new network for the acoustic model

and then we have that decoder

but i mean with end-to-end training is that we would like to propagate back

the gradient from the cross entropy criterion at them

of them acoustic model training

all the way over here these processing blocks

up to the new network here for mask estimation

that's what we

try to do if we can

managed to do that we don't need a target

like idea of speech mask for the training here but we can derive

the gradient from

the cross entropy criterion here from the acoustic model training

and then we don't need stereo data

anymore

so that's what we

tried to do

and what to be

have to take have to take care of is

but in between these computations we are in the complex domain

that means you have the beamforming coefficients are complex-valued

back to us you know they're complex-valued

the covariance mattresses are complex-valued so we are here the real-valued domain and then we

between the complex-valued domain and here we are back in the real-valued domain again

so we have two

consider

gradients with complex-valued the a remotes

so what i the ieee denoted here so the cross entropy criterion at the

acoustic model training

neural networks a function of the spatial covariance matrices of the

from which we compute the beamforming coefficients and they are complex-valued

and eventually what you want to train the beamformer curfew the coefficients of a new

network for mask estimation they are real value of course again

and what we'd it's

do here is

i'm not going to in

to detail that about a there's a recorder on it

we use the building a calculus

to compute complex

derivatives because the cost function is not all formal fink function so

but this building a calculus is well known in the adaptive filter theory so people

who do adaptive filtering their use this or not because the you often have complex-valued

uh coefficients

and with this uh one can then compute these gradients

and the actually the

crucial step was

uh we have this makes a mess in our beamformer whose coefficients are determined by

eigenvalue decomposition

so we have to compute the derivative of the principal eigenvector

of this

generalized eigenvalue problem with respect to the

psd mattresses will come out of the neural network mask estimator

and uh for that's uh we have a reports

which uh

and i also have it with me

where you can look up how this is done because there is a quite

longer along but they are vacation

so now

for the chime challenge of this

really worked and here are

some results

let's see whether we can make a

and sends out of it

so

first here so here in this

uh again the

baseline results so to say we have the

as a beamformer used this beamforming in a sum beamformer by example you and we

have actually

and uh

did it separate acoustic model training

so that was set the baseline word error rates

here we have the system web if

trained separately the new

no i want our beamformer using a new network with ideal binary mask just targets

so as we did before

and separately training of the acoustic model neural network for the acoustic model and these

are the results

then here we try to do it as on my last slide so we would

like to jointly train both

acoustic uh both networks the one for not mask estimation the one for the acoustic

model

and we started both from

from random initialization

and you can see that this leads to a somewhat uh worse

what error rate so what error rate increased

and the interesting result actually is no but next one

here we pre-trained the acoustic model

for a menu a network for acoustic model

but then the new network for the beamformer mask estimation

was trained

by back propagating the

gradient from the acoustic model to the new network for mask estimation so it was

randomly initialised and then train all the way back

and b that we are

even a little bit better than in the separate training so this year shows

that at least four

this time challenge is possible

to uh this minutes of the need for stereo data and you can achieve the

same or little bit better results also just training of the noisy data

and the lower one he's here where we also pre-trained

the acoustic model for mask estimation with the ideal binary mask target and then just

later on july propagate the gradient from uh the acoustic model to this first new

network four point you like if that it better

oh here for this

a much data that's the only data

i don't wanna required

but i would like to emphasise for this data because we can to try to

achieve the same

on the ami corpus and so far we have a have not been successful yeah

so it's not that easy but i'm was perhaps a very nice corpus this respect

so that was

basically

the story about beamforming for noise reduction

no i have a few slides

for uh where we also have some multichannel processing but not for noise reduction but

for to other tasks

so the first one is

the speech recognition of reverberated speech

and what we did here is

use the same

set up so we have multichannel data we use a new one network for mask

estimation

but now a lot

uh we would now our distortion is no longer noise

but reverberation

so we

in the case

that you know the impulse responses

yeah

of the training data what you can then could do

you can

by yourself

determine the ideal

speech mask and the ideal mask for the

for the distortion

and the

for the targets we take the dry signal the non reverberant

data and convolved with the early part of the room impulse response

so i first fifty milliseconds

and for the interference

which was the noise in the earlier case but now it's reverberation for mean if

we interference be convolved that right signal with a light part of the room impulse

response so after fifty miliseconds the body part of a part

and with that we can derive

also ideal binary masks for the target and from the interference

and then the rest remains the same

yeah we can then again compute the masks from that the covariance matrix a simple

that the beamforming weights

and that we tested on the rear up

data set of the reef a challenge which i want to charter data are convolved

with measured room impulse responses

and there are again test data simulated and also real recordings in a reverberant

environment

and here are some results

for that's

we are in the real

recordings there was a distinction between

uh near which means the distance between the microphones and the speaker brought about i

don't know of one meter fifty centimetre and far it was about two meters

but you can see the difference in the word error rate is not very large

and the

gmm hmm baseline we have these results here we have the

a baseline results were single channel there was no multichannel baseline

and then with the

method i so i just explained how to determine the to take the late reverberation

part for the

uh distortion masks

and just use the same scenario a set up as before we obtain these

whatever rates on these two uh parts of the dataset

and be the better acoustic model it can be further improved

so it also work in this case to suppress reverberation

so that was one example of another application it is my final and last example

we know this is a

new networks based mask may should be used for

for noise tracking for single channel speech enhancement

here's a

typical setup of the say traditional single channel speech enhancement

yeah we have the

noisy speech signal and its at the input we already in the stft domain here

and then we manually manipulate the magnitude really the phase of and are usually unchanged

and then we compute again function

the time-varying gain function word which we multiply the microphone signal to suppress the noise

and this time grounding function

this compute the performance so quite a priori snr

and for this to compute we need the noise power spectral density

and this noise power spectral density is now estimated with the new network

that's

real be made change

so noise tracking by a new network is we did it

similar or the same methodology as before

as in mask based speech enhancement

we estimate the noise a spectral mask which indicates for each time-frequency bin whether it's

dominated by noise or not and if it is dominated by noise we can update

the noise estimate for this a priori snr estimator and if it's

dominated by speech we just hold the all the old estimate

and so only this part is

is changed with respect to otherwise traditional speech enhanced

uh it's

he also some examples of this isn't noisy spectrogram

this is the ideal binary mask for the noise so black and it was

and here it was is a is the um

noise presents prop

itsy estimates by this new network this isn't of them method to compare it

with

and uh

is it looks a little bit similar is what we had uh before bads we

can estimate with these mask estimation we can estimate the noise

for

signal

and to your uh some results

where we have on the left end

sides the performance of the noise estimator

and on the right-hand side the performance of the speech enhancement system

and so we try to really a lot of the state-of-the-art noise estimation methods as

you can see here

and what we have plotted here is

two error measures for the noise estimate so look arrow

variance

versus look errol mean

and the variance should be small in the mean should be small so the ideally

the best method should be here the lower left corner

and this is actually the dnn based noise estimator this one yeah and these are

all the methods

and here we have the speech enhancement performance

the output snr

versus the speech quality measured by this and whereas uh measure

and yeah the upper right corner is the best

and again

this new network based noise mask estimator worked pretty well

so that's

um

or

uh applications use a new network for speech or and or noise mask estimation and

i think it's a bright powerful inverse of two

two

and that these for the chime challenge i should say

the requirement of theatre data can the over my end-to-end train

but i think this the a lot

to be done

uh first of all its not online

or most of the cases of the representative results were not online

and one would like to have an online system with low latency

then

uh i think matters change if we have a moving

speaker

yeah it was which is stationary with this tablets help by the person speaking

and of course there is uh

it's something more it's much more different if there are also overlapping speech which we

didn't disc uh consider yeah

so that's it so that was or not

of uh references

thank you

i think that's no problem that that's easy to implement that

but um

when you earlier we would have sets

the maxim snr is better than the mvdr but now i'm gonna state you say

it's about the same and so it doesn't matter whether we take you equal to

one or two zero point one or whatever so i think it we're not improve

matters but also not degrade but that's

that's my feeling example

i think we did not listen because they didn't go back to the time domain

you know if us english stating that short term frequency domains but that's a good

point should put

listen to it

yeah those

yeah

the spectrograms i have seen ice short-term spectral crumbs but that was not for the

end-to-end training but i presume that looks similar because uh

the results were not that difficult different between the two

i think of the moment is based mainly the overlaps uh speech and they're also

very short utterances which are too short for all covariance estimation

i think at six or i don't

oh yeah was one more question yeah

you know the than neural network i had this on some of the slides i

had some figures so it's not as a larger than a new network for acoustic

model by far not but still it significantly or then a parametric noise tracker

yeah that for sure

more detail i cannot

basically i think of the

motivation for doing than in them

yeah days of the parametric approaches we always needed and the migrant to domain when

we were about

the doing speech enhancement

and i think we tried log domain but i'm not no i don't ten for

comparison

a comparison or

yeah

yeah

and all know what you mean

i cannot tell us a much of a bit i think we tried it and

then we stick with that but i don't you

i think the

the different see a is that we have a multichannel signal

and for the beamforming been we exploit the phase

so i think for multichannel data are uh doing it's

in the no explicit beamforming i think it's a good idea you know

yeah yes yeah

yeah

yeah

yeah no for the what the last application is concerned with i also think there

are the solutions which are listed goal of this one yeah but it nicely fitted

you know too much story yeah

yeah

or

we tried with just

feed forward only

there is where the results i skipped

the cr was

just feed forward network we've outs uh recurrent layer

and it was a bit worse but uh not too much

so i think the online and latency something is not the issue but if the

speaker moves a lot

i think you have to do also something on the

on the test data not rely solely on the and model training

train constraint uh system with a mask estimation

and have to also do some uh

tracking or whatever on the test set

i think this is the larger issue

yes one

note that

no it is like

the example i had discussed before in detail with the noise

suppression

and we just this uh slightly just showed how do i take in the targets

for the new network mask estimator was it

and in the noise suppression it was the speech signal was the target all the

speech presence

was a target for the

um

for the speech mask estimator and we have the top frequency bins were was just

noise for the noise estimator

and then we use these covariance mattresses in the beamformer objective function

and here we use the same the informal objective function

also these covariance matrices my x and the man

the question is what to

so you don't z command

what do we consider signal x and the next we consider the early part of

the signal

and signal and we consider the later part of the signal

so to estimate sigma and

we say uh we have the inputs

so we have the signal which is not reverberated

and we convolved with the lights part

of the impulse response and this gives us the distortion

and then we use this beamformer framework to remove the distortion

it's a bit is difficult to cali beamforming actually butts uh

it is

yeah

yes

sixty four

a