Speech Transcript - Robust Speech Recognition: more than just a lot of noise

first let me introduce mike seltzer from microsoft

Q has been there since two thousand and three

and his interests are really no noise robust speech recognition microsoft the reprocessing the acoustic model adaptation speech enhancement

in two thousand and seven to receive the base down all their word from the ieee signal processing society

and from two thousand and six and two thousand and eight yeah he was a member of the S L

D C and he was also the editor in chief of the

electro electronic

newsletters all many of us to receive emails from him whenever that the newsletter came out

and the holiday season associate that either of the ieee transactions on speech and audio processing

and the title of his talkies robust speech recognition more than just a lot of noise

these by michael

good afternoon thinks that a great introduction george and things to the

it's very two thousand committee for inviting india the stalk it's really the an honour to be here without you

and i hope to have the instinct a kind of sentences after lunch and hope for the food come along

once it into badly

so let's get started

i've been in the field for about oh ten years or so and i brought them to me really seems

where in what i'll call almost a golden age of speech recognition

there's based yeah and as we all know there's a number of mainstream products that everyone on the obvious is

using that involve speech recognition of course there's a huge proliferation of mobile phones and data plans and voice search

and things like that

speech is also widely deployed now in automobiles

in fact i i'd like to point to the ford sync as one example of a system that we took

speech in cars from a high end

come on a add on to look for enlarge remote bills to a kind of a low and feature on

stand

packages for low and for models and sort of like a moderate sized

that functionality and then my most recently is that what this the flaming of the connect i don't for X

box

which has gesturing voice input

in addition to these three examples of technologies that are that are out there we're also in many ways swimming

in this in this you know drowning in the data that we have as a sort the proverbial a house

a fire hose of data coming at us and fifty yeah with cloud base system

no all the data thing logged on the system and sort of having data in many cases is not a

problem as i know how what do we do with the state is sort of the channels these days

and finally as on a on a personal note

i think that one is most meaningful for me is the fact that all this is happening so that only

have to talk explaining to my mother what it is i do on a daily basis

i'm a

i'm not sure she's we so happy that a user face on the sly

i won't tell if you want to

neverless in spite of all the success there'd be lots of challenges that are still out there there's new applications

and as an example here this is the virtual receptionist by dental who should ms are sort of a project

and situated interaction and multiparty engage me

there's always new devices this is a particularly interesting one from time to shorten stands use a electromyography interface where

it sort of actually measure

engineer skin as the input i have another colleague a ardent thing was working on a speech input using microphone

arrays inside a space helmet for us

you know systems that are deployed for spacewalks

and of course as the thomas friedman wrote the world is becoming flatter and flatter and there's always new languages

a new cultures that are systems come in contact with

so addressing these challenges takes data and data is

time consuming to collect in some ways and it's also expensive to collect

as an alternative i like to propose

that we can extract additional utility from the data we already have

the idea is that by reusing and recycling the existing data we have

potentially we can reduce the need rick actually collect new data

and you can think of it informally as making better use of our resources so what i like to focus

my talk on today is

how we can take ideas from a process to help speech recognition go green

so the symbol for my talk today

we'll be how speech recognition goes bring to reduce recycle and reuse

of information

like you like to go see a lot

okay so first i'm gonna talk about the one aspect of this in the sense of reduce

we know their systems suffer because these are just statistical pattern classifiers when this mismatch between the acoustic models we

have and the data that we see a runtime one the biggest sources of this mismatch is environmental noise

so the best solution of course is to retrain with matched data and this is either expensive or impossible depending

on what your definition of matched data is

if matched it is you know i'm in a car on a highway then that's reason to collect it just

as a little bit time consuming if matched data is i'm gonna be in the speaker model of car on

this particular road with the snow is of course it's impossible to do

so as an alternative we have standard adaptation techniques that are tried and true in part of a standard toolkits

things like map or mllr adaptation they're really great because their generic in a computationally efficient the only downside is

the need sufficient data in order to train is proud

as an alternative relative to discuss here is the way we can exploit the model environment

and by doing this

the estimation the adaptation method you will get some sort of structure imposed on it and as a result we

get lots of efficiencies and the data process

so before we go into the details let's just take a quick look

at the effect of noise on speech through the processing chain of showing of the processing chain for mfccs for

a similar features like plps lpc is it's similar so we know that in the spectral domain of the linear

the waveform domain speech is additive

that's not too bad to handle

what's your the power domain it's against additive analysis additional cross term here which is the kind of correlation between

speech and noise we gently assume speech or noise uncorrelated so we kind of ignore the term and sort of

hideaway

now things get a little bit trickier once you have to go to the mel filterbank analog operation because after

log domain gets are to get this little bit of a nasty regulation which says that the noise

the noisy features i C R can be described as a clean speech features plus some nonlinear function of the

clean speech and the noise

and indicates that goes up to a linear transform and we get of a vector version of the same equation

for the purpose of this talk on a backup before the dct "'cause" easier to visualise things and one or

two dimensions rather than thirty nine dimensions

and source talk about this equation here

so because speech and noise are it's a symmetric a relationship so expos and is and plus X we can

swap the positions of X and N in this equation here if we do that i mean and we sort

of bring common term is to teach that equations you get a slightly different expression

and what what's interesting here is that you have in the lab

is because is a lot domain operators this is basically to a function of two signal-to-noise ratios something that's in

speech enhancement signal processing called the a posteriori snr which is the snr of the observed speech compared to the

noise and the prior snr which is the unknown clean speech and as a version of that and the noise

and so if we look at what this relationship is between along this function

we have a curve like this

and it's curve makes a lot of intuitive sense of it look at different points along the curve and so

we can say for example appear in the in the upper right of the curve

right now we have high snr basically noise is not much of a factor and the noisy speech is given

to the clean speech

in a similar way

at the other end of the curve

we have a low snr

and do not you matter with the clean speech is it's completely dominated by the noise

so why was that in this case and of course you know the million dollar question is

how do we handle things in the middle of the nonlinearities is something that needs to be dealt with

there's an added complication of this which is earlier we sort of swept this cross correlation between speech or noise

under the rug

but it turns out

that yes it is your own expectation but it's actually not zero you know it's a distribution that has nonnegligible

very that we have to plot data on this curve and see how well it matches the curve

you see the direction is that the data lies on that line but there's actually significant spread around that my

what that means is

that even if we're given the exact value of clean speech and exact value of noisy speech in the feature

domain where she can predict exactly what the noisy feature will be there's just you can predict what the distribution

will be

and ending this additional uncertainty makes things we wanna do things like model adaptation even more complicated

they're been a number way to look at transfer in this equation into the model them

if we do that again this nonlinearity and the some thirty great challenges so we again look at extremes of

the curve is quite straightforward at high snrs than if we do an adaptation the noisy distribution it's gonna exactly

the same as the queen distribute

we go over here to the low snrs the lower left of the curve

the noisy the speech distributions to just the noise distribution

and the real trick is how do we handle this area in the middle right we have even if we

assume that the speech and the noise are gaussian we put these things through this nominee relationship or comes out

is definitely not gaussian

but of course this is speech recognition and

i darted if it's not gaussian we're just gonna assume it's gaussian anyway and so there are and so there

are various approximations that are made to do this because

we just you know six

right the most famous example to do this would you do at that noise is to simply taken a linear

approximation to this to linearize around this is the famous vector taylor series algorithm by pager marino

the idea here is you simply have an expansion point determined by your

that that's given by the mean of the gaussian are trying to adapt and the mean every noise

and you simply when your eyes that than on the menu function around that point and once you have a

linear function now doing annotation is very straightforward you on the how to how to transform gaussians subject to a

linear transformation

now the trick here is that the transformation is only determined by the mean of the curve and the size

of the variance of the speech and the noise of the clean speech and the noise model will determine the

accuracy of the linearisation stuff the brains very broad if it's very in a wide bell then the linear they

should not be very accurate "'cause" you be subject to more nonlinear

the refinement of idea of that idea we look into something called linear spline interpolation which is sort of participated

well if one line works well many lines must work better and so the idea is to simply take this

find an approximate using a linear spline which is the idea that you have a

a series of knots which are basic places you could even the dots in this in this figure

and with between the dots you have you doing approximation is quite accurate

and in fact because it's a simple linear rescue cash you have a variance associated word error associated with that

when you're model and that'll account for that alpha spread that spread of the data around

and then when you figure out what to do at runtime you have to use

right you can use all the splines based on the pdf rather than just having to pick a single one

determined by the mean and so essentially depending on how much mass of the probabilities under each of the segments

that tells you how much contribution of that linearisation you're gonna use in your final approximation

so you're using you know incorporating you do linearisation based an entire distribution rather than just the mean and i

think is the spine parameters can be trained from stereo data they are also can be trained from an integrated

way using sort of maximum likelihood in an hmm free

we just use two examples of a linear a linearisation approach this another approach is the sampling based method

and the idea here is based on i-th in of the most famous example this is by the data-driven pmc

work for mark gales

in ninety six but that method requires you know tens of thousands of samples is for every gaussian are trying

to adapt its completely

infeasible it is a good upper bound you can do

but the unscented transform is very elegant way to sort of do clever sampling the ideas you just take certain

sigma points and again because we can assume things are gaussian there's a simple recipe for what these sampling points

are take is small set of points in this case it's typically about in a less than a hundred point

ask them through the non-linear function you know to be true under your model and then you can compute the

moments and basically estimate P Y

again

depending on how spread the variance of this model of this distribution you're trying to adapt is that will determine

how accurate this adaptation is so is gonna for the refinement this method

post recently call the unscented gaussian mixture filter in this case you take a very broad gaussian simply chop it

up into a gaussian mixture where within each gaussian

the variance a small and simple linear approximation works quite well

in the sampling works quite efficiently and then use and to combine all discussions back on the other side

here

so you just for example there are a handful of others out there in the literature

but one thing what i've tried to convey here is in contrast to standard adaptation you'll notice i didn't talk

at all about data

and observations was talk about how to adapt the model all we had was that of the hmm parameter you

X and the noise model and

so it's of the what's nice about these systems is that

excuse me

is that basically slowly need is an estimate of what the noises in the signal and given that we can

actually depth every single gaussian in the system because the structures impose on the adaptation process

and in fact if we can sort of sniffed what the environment is before we even see any speech we

can seduce in the first pass which is very nice and of course you can refine this of the second

pass by doing you know em type are going to update your noise parameters

so of course

under this model the accuracy of the technique is largely due to the accuracy of the approximation using so those

are four examples i showed earlier and essentially people who work in this area basic trying to come up better

approximations to that nonlinear function other alternatives also focus on more explicitly modeling

that uncertainty between X with between the speech and noise that accounts that spread in the data that was nearly

figure

so just a sense of how these things work this is the road to which is a standard noise robustness

task it's a noisy connected digit task

for people care it's a complex back-end some like we could train system

it's for the best next like that sort of baseline you can create with this data i mean you can

see that sort of doing standard things like C M and is not great when you cmllr again this it

in one utterance you may not have enough data to do that to do the adaptation correct you get a

small gain but not but not a huge when

the L C advanced front-end shown there is a fee is sort of the

i guess representative of state-of-the-art in sort of front end signal processing approach to doing this as i was not

where the models are used to treat this as a noisy signal and hands it in the front end and

if you do vts

in the rain algorithm ignoring

that correlation between speech or noise that spread of the data you get about the same performance

and now if you actually account for that variance in the data by tuning a weight in your in your

update which i won't get into the details of us to get a pretty sick significant gain

that's a really nice result the problem with that is that the value that you actually is optimal is that

you theoretically implausible and don't and breaks your entire model so that part is a little bit unsatisfying

in addition the fact is not quite that might that often not pravda generalise as across corpora and then we

see that you get about the same results of the use the spline interpolation method where we have you have

the link the linear regression model it does account for the spread and sort of a more natural way

and again all the numbers of than similar at first pass numbers they could be refined further with second test

well this shows we could be no you have nice

gains by adapting the structure there's been a little bit of a dirty laundry i was trying to cover up

which is that the environmental model is completely dependent on the assumption that the hmms trained on clean speech and

as you all know clean speech is kind of a an artificial construct that something we can collect in the

lab but is not very generic it also means that if we deploy a system out in the world we

collect the data that comes like in that it is easy valuable for updating our system and refine your sister

but if it's noisy and our system can only be taken clean data we can use that data

have a problem

a solution to that problem has been proposed and referred to as

noise adaptive training also composes joint adaptive training

and the idea is basically completely i can sort of a little brother little sister to speaker adaptive training

in the same as figure out the training try to remove speaker variability in your acoustic model by having some

other transform absorb the speaker variability we wanna have the same kind of operation happen to absorb the environmental variability

what this allows you to do is actually train incorporate train data from different sources

into a single model is helpful if you if you if you think about a multi-style model we can take

all kinds of data from all different conditions and mix it all together

the model will model the noisy speech correctly beer and have a lot of variance is just modeling the fact

that are coming from different environments

that's not gonna help you with phonetic classification

and if you are not a dataset scare scenario this could become very import

so again just to make it a little a little bit more explicit he hears the general flow force speaker

adaptive training you have some multi-speaker data in a speaker independent hmm

that then doesn't a process where you italy update your hmm and some speaker transforms

most commonly using cmllr and this process goes back and forth so convergence and what what's left of it

speaker adapted hmm

so a noise adapting the exact same process happens

except the goal is to remove the environmental variability from a multi-style multi environment day

so what happens here is we have again i would i guess you could call it an orderly cause an

environment independent model but that's

what it is and also for apparel structural call that essentially data from lots of by

and then in your iterative process you basically trying to model and account for the noise or channel distortion that's

in all of your in all of your data

with other parameters so that the hmm is free to model the phonetic variability and this case typically what's more

stuff and on is the noise that is environmental parameters are updated on a per utterance basis rather than a

per speaker basis because there's few parameters and so you're able to estimate those

well number comes out is a noise adapted hmm again that the nice thing here again is because you can

do this potentially in the first pass you don't need to keep the first environmental independent or noise independent model

around like you do in speaker adaptive training you can directly operate all the time and noise adapted H

there are some results with noise adaptive training

as analysis with noisy multi-style training data you can see this is the result for cmn just cepstral mean normalisation

now we try to fight the vts algorithm which assumes the models clean in this case not under the assumption

is broken and so we got to get

you have to improve over the baseline but the results are not nearly as good and then we get overturned

to getting nice gains but we actually do this adaptive training and we see similar performance on the aurora three

task an interesting thing there is actually because that's real data collected in a car

or she is no clean data to train this on and so you actually need an approach like this to

run a successful on that technique and

corpus like this

to summarise are for

is the triangle and redo

i si model adaptation as you all know can reduce environmental mismatch

when you impose this environmental structure determine by the model that the adaptation is incredibly data efficient if you think

about a general you need

and ask them to be noise in an estimate of your

of your noise meeting yours variance of potentially last interview channel means that spacey thirty nine was thirty nine you

know it's

hundred and twenty parameters to estimate which is really a very little and you know you could even for example

if you assume that your noise is stationary then you're you can actually eliminate even the delta kappa delta features

of your noise

every running or static features that even fewer parameters

doing the adaptation unfortunately is computationally quite a bore i mean it really it's you adapting every gaussian in your

system is probably overkill to do an utterance-by-utterance basis but you can improve the performance by using regression classes shown

by i think well as work

yeah thing is that we can reduce environmental variability in the final model we have

by doing this noise adaptive training in this is helpful when we're in scenarios where there's not much data to

work

the other considerations that reminders although i'm certain ml systems use can be integrated discover training

and is a huge sort of parallel literature to this where the same exact algorithms are used in the front-end

where your place the hmm with the gmm you do this as a front-end feature enhancement scheme and see basically

the same exact operation with the goal of generating a hand

version of the cepstra

and

those items the exact same sort of mathematics mathematical framework and then the nice thing is there is that you

can then if you're data that you work with is noise you can also do the same adaptive training technique

on the front-end gmm and

still use those technique

so well

and i wanna move on from reduced to recycle

and in this case element talk about is

change gears from the ways to channel

and talk about how we can recycle narrowband data that we have

i think it's not a very controversial statement to say

but now that voice over data is replacing voice over the wire

and when you do this now because you know especially in speech applications you have you speaking to some a

smart phone your voice is not going you know making a telephone call anymore it's going over the data network

to some serve

when you do that does not capture them with a possible so you can base the captured you know subjective

bandwidth constraints because our

latency constraints you can see you can basic captured arbitrary bandwidth and this is that we have you know where

possible wideband data is preferable

games do very you know which you build equipment system with narrowband or wideband data but they are consistent

for example if you look at a car

the gains you get are larger in that's not context because a lot of the noise it's in the cars

at low frequencies sort of the rumble of the of the highway and the tires creates a lot of low

frequency noise so having a high energy in the plosives and affricates is really helpful for discriminative ability

and of course it's also sort of going becoming a the standard for just human communication is wideband codecs from

the M R is the european standard and skype now is going to wideband codec or even an ultra wideband

codec so the fact that people perceive it sort of also implies that numbers machines would probably prefer

well

that said there are existing stockpiles a narrowband data all the systems even building over the years and for many

low resources languages in on the developing world mobile phone still are prevalent and i don't think we're gonna go

away that soon so we want the ability to do something useful with that data

so what i'd like to propose is there a way to use the narrowband data to help augment

some wideband data we have in data scare snares to build a better wideband acoustic model and inspiration for this

came from the signal processing literature maybe ten or fifteen years ago people propose the bandwidth extension speech processing

sort of like again it comes from the fact that we know

the people prefer

wideband speech it turns out it's not it's not any more intelligible unless you looking at isolated phones it's actually

both are equally intelligible but things like listener for T and just personal pride and preference comes across in a

much higher for wide

speech and so the way these algorithms operated

was that the basis set can be learned correlations between low and high frequency spectrum the signal so here's a

just

a poorly first grade drawing version of

of spectra like to say that my four year old to this but i did it myself

so this is sort of you know the pilots like about like this i was going for that with a

couple of formants as of yet if i ask you guys to predict

what is sort of on the other side of the line

you know it maybe predict something like that it seems pretty reasonable probably you know you make a down the

difference low platform and maybe in a different location but is not for example gonna go up it's not you

know you would you would doubt that would

and so we can do is basically user like a gaussian mixture model to predict

the gas independent mappings from low to high band spectra

and then a simple we could do is to say let's just generate wideband features from narrowband features

and if you're familiar with the missing feature literature this says basically i'd like i have some in missing features

you say i have some

components of my features that are too corrupted by noise addition to remove them and then try to fill them

in from the surrounding reliable data this is like doing this you features with the deterministic madness given by the

telephone

you're simply taking some amount of wideband data

some potentially large amount narrowband data you're trying to convert that narrowband data into a pseudo wideband features and go

to train an acoustic model that way

so this actually works okay works pretty well and here's an example

this is a wideband log mel-spectrogram

the left in this is that same speech but through a telephony channel you can see obviously the information below

three hundred hz and above thirty four hundred hz is

it has gone missing so to speak and the idea of this bandwidth extension the feature domain is to say

can we do something to fill it back

and in this particular case "'cause" it's not it's not perfect

but you know a lot of you know where there's read it gently read in the other pictures are reserved

capturing

of the gross features but data and we could use that then to train our system

so this is good but the downside is that if you do it this way in the feature domain you

end up with a point estimate of what you're wideband feature should be and if that estimates for or it's

wrong words you know

things like that you really have no way of informing the model during training to not use that data as

much as maybe other estimates that maybe more reliable and so to get this to work you have to do

some ad hoc things like corpus weightings to say okay we have a little bit of

wideband data but i'm the count those statistics much more heavily than my

statistics of my narrowband data which i would have extended into therefore don't trust quite as much so as not

theoretically optimal

and as a result you know a better used to be to use and you know we can could incorporate

this into any amalgam directly see on only train hmm

would it be the state sequence is the hidden variable so you can figure this is doing the exact same

thing but you're adding additional hidden variables for all the missing frequency components that you don't have in the telephone

channel

so if you do this you get something that looks like this where you have the narrowband goes directly into

the training procedure with the wideband data you have this and expand with em algorithm and we comes out as

a wideband hmm no i'm not gonna try to go into too many details and i really try to keep

equations to a minimum but i just want to point out

a few notable thing is this is the variance update equation and a few things that are interesting i think

about this relation the this update equation is

first of all you look at the why should sorry i should mention the notation have adopted here's from the

missing feature literature so oh is something that you would observe in ms and it's missing as you consider O

to be the

the telephone band frequency components and M to be the missing high-frequency components you're trying to the model when you're

hmm

second thing at the posterior combination computation is only computed over

low band that you have only lives are bands you've actually marginalise out the commode you don't have over all

your models and so therefore erroneous estimates that you make in this process don't corrupt your posterior calculations because you

only computing posteriors based on reliable information that you know is

is that

the other interesting thing is that

rather than having a an estimate that's global across all your data you actually have a state conditional C estimate

where the estimate of the wideband features determined by the observation at time T as well as the state you're

in and so the says

the extended wideband feature i have your it can be a function of both the data i see as well

as whether i mean of our fricative or a plosive

sample

and finally there's this variance piece at the end here which then says in general for this particular gaussian

how much uncertainty overall is there in trying to do this mapping so maybe a minute in a case where

them doing this mapping is really heart because there's very little correlation from the time-frequency snack is we will high

variance there so that model as i could reflect the fact that we've

that we've estimated that the estimates that we're using may be poor

so if we look at the performance here we've taken a wall street journal task we base it took the

training data and partitioned into wideband set and the narrowband set at some proportion

and so the idea is that if you look at the performance of the wideband data that's the lower line

it's about ten percent

and if you take the entire system and sort of telephone dies at all you end up with the upper

curve but in the purple curve that's the sort the narrowband system the goal of this is to say given

some wideband data and next thing in the rest narrowband data how far how much coming close that gap

so we see that in this is comparing the results of the feature version on the model domain version and

so we can see that we have a split of at twenty

the performance is about the same and so in that case you know why go through all the extra computation

the feature

version works quite well interestingly once you go to a more extreme case where only ten percent the training set

is actually wide-band the rest is narrowband do in the future version of it is that you worse than just

training at an entire narrowband system

because there's lots of uncertainty in the extension that you do in the front end which is not reflected in

your model at all but if we do the training in this integrated framework

we end up with you know a performance that again is better than equal than all narrowband

talk about this last prong of this second volume of the of the triangle here and recycle

potentially possibly narrowband data can be recycled for using wideband data this may allow us to use the existing piles

of legacy data we have

and for initial system that we have narrowband data whether we want to build narrowband data maybe easier to collect

and maybe simple just like the small amount of wideband data

you can do this in the front end we can come up with the sort of integrated train training framework

and

like a noise-robust this case there is a front-end version that i talked about and there are advantages to that

i shouldn't sort of

so it doesn't the right it allows you that if you do this in the front and you can use

whatever features you want you can then take the postprocesses news bottle neck features

tack a bunch of frames individual the i and so you have a little bit more flexibility what you wanna

do downstream from this process

and the other interesting thing is that the same technology can be used in the reverse scenario where the input

maybe narrowband and the models actually wide

you may think where this happened but this action happens in systems lot of some as soon as someone puts

on a bluetooth headset

you could have a wideband applied system somebody decides that they wanna you'll be safe in hands-free in a put

on a bluetooth headset all somewhat you comes and your system is

in our band if you want do something about it you're gonna get killed and killed but

sorry i and you're going hands free signal killed but anyway you performance is gonna suffer

and so you know one up after we to maintain two models in your server the other ideas you can

actually do about the station the front end and process that by or by a wideband recognizer noise thing there

is like you don't have to be as good as true wideband performing

you just have to be better than or as good as but you've got the narrowband performance would be and

then it's worth it to do that

finally i'd like to move on to a last component here of a reuse

and talk about the reuse the speaker transforms

one of things that we found

is that

the utterances in the applications that are being deployed commercially now are really sure

and so

you know one seattle obviously people's a starbucks quite a bit

no muppet show times or in the living scenario X box play maybe all that you know the only thing

you get

in addition to that these are really gently rich dialogue interactive systems and so these are sort of one shot

thing for you speak where you get a result in your in your done

so that the combination of these two things

make it really difficult to obtain sufficient data for doing conventional speaker adaptation from a single session of use in

so doing things like mllr cmllr becomes quite difficult in a single utterance

case and so

and obvious solution to this is to say well let's just accumulate the data over time across sections we have

users are you know making multiple queries to the system

it's aggregate it all together and then we'll have an update at sufficiently to build a transfer

the

difficulty comes in because this now because it lies applications on mobile phones it means the people are obviously mobile

two

and they're all across all these different users they're actually in different environments

that creates additional variability in the data that we can lead over time and so in my

by numbers i guess i would say or you know them you know a metaphor here let's imagine a user

called the system and the observation comes in as Y and that some combination of the phonetic content which i'm

showing is as a white box

some speaker-specific information shown as a blue box and

some you know environmental backer information as the right

so user gets the speech and says oh okay well mannered proportion adaptation and store away the transform

so the next time this user calls we know will be loaded up and ready to go

so sure enough sometime later the user cost back

and the phonetic content you know may or may not be the

the speaker

is the same

but now is you know here she is in a different location or different environment and so the observation is

now green instead of purple and as a result we can do adaptation on the mile using the store transform

but mismatch persists this is not something optimal

and so what we would like is

a solution where

the variability when we do something like annotation can be separate or to use the part

so that we can say let's just hold onto the part that's related to speaker and sort of throw away

the part that's in environment or very get store the part that's for environments that we oversee different user call

back from that same environment we can actually do that as well

so in order to do this sort of factorisation or separation of the different compare sources of variability

you actually need an explicit way to do joint compensation so it's very heart to separate these things if you

don't have a model that explicitly models them

individual sources of variability

and so to do this there's

several pieces of work that the proposed it's sort of like a being at a diner and it sort of

gets use one from column a and one from column B you can sort of take all the you know

all your favourite speaker adaptation algorithms in you can take

all the games and apply for environmental adaptation pick one up from each thing and combined them and then you

can have a usable model made using thing is that this is sort of proposed

that ten years ago

but as far as i can tell with without with the exception of joint factor analysis and two thousand five

is not that much work on it since and now it sort of seems to be have sort of come

on the scene again which is good i think it's a it's not obvious the more people

in their work on this you know

all these possible combinations of methods can do this

joint compensation together might talk about one particular instance

of using cmllr transforms mostly because i've already talked about how vts is used and so trying to

several different

ways you can go about doing compensation for noise

so in this case we're gonna talk about the idea that you can use a cascade of cmllr transforms

one that captures environmental variability wanna capture speaker

a nice thing about using transforms like this is that we give up the benefit of all the structure we

had an environmental model using solutions like be yes

but we get the ability to have much more flexible use meaning that we have no restriction on what the

features we can use are what the data that where this it's trained from we don't to do this

adaptive training schemes like the noise adapted train

the idea is quite simply defined transforms that maximise the likelihood of a set of environmental transforms in a spell

of speaker transformations given sample of training or adaptation data

now of course you know it's not heart to see that this cascade of linear transforms is itself a linear

transform

in as a result you can take a linear transfer and factor it into two separate transforms in an arbitrary

number of ways menu which will

not be meaningful and so the way that we're gonna get around this is to borrow heavily from the key

idea i think in joint factor analysis from speaker recognition which is to say let's learn the transformations on partitions

of the training data where were able to sort of isolate the variability that we range

so pictorially

still a bit busy inside politics of it

gives you a headache but

you can think about the idea that your basic gonna group the data by speaker and a given those that

you can update your speaker trend

then you gonna repartition your data by environment keeper speaker transforms fixed and update your environment transforms and then go

back and forth in this manner now of course

doing this doing this operation assumes that you have a sense of what you're speaker clusters are in your environment

clusters are

there are some cases where we it sounds reasonable to assume the labels are given to you so for example

if it's a

a phone overhead you know mobile phone data plants near you can have a caller id or a user id

of the hardware address and so you can have a high confidence that you know the speaker is simile for

certain applications

like the X box in the living room we certainly think it's result we can say okay this thing is

by not driving on the card sixty miles an hour probably isn't in the living room once we can assume

the environment in that case or if we don't have this information you can really do environment clustering algorithms are

speaker clustering

and so

yeah just to show some results here

the idea is you can again take

take the training data that let's say from a bright of environmental the brighter speakers

and estimate some environment transforms on the training data

to do that of course you have to estimate the speaker transforms as well but in this case the speaker

the speakers in training and test are distinct and so the speaker chances are not useful for us in the

reuse scenario

and so we've tried here to say let's take estimate the speaker transform

given data from a single environment this case is the subway

we can we take that

transform and either estimated in this way where the sources of variability are factored

or estimated using sort of conventional cmllr approach and apply to data from the same speaker in six different environments

three which aren't times that you've seen in training three what's are

that are not seen

and you can see in both cases you get a benefit by having additional transform in their absorb

the variability from the noise so that this the speaker transform can as you focus on just as the variability

that comes from the speaker that you care about and so you can see there's again overdoing cmllr alone and

that comes again from the fact that this year margin for me is not presumably

learning the mapping of the environment plus the speakers ideally learning the transform just the speaker alone

scenarios where speaker data is scarce

a reuse is important for adaptation

no this is a case where each utterance is you know ten or fifteen or twenty seconds this techniques and

are not nearly as important but if the case where you only have a second or two data you wanna

be able to aggregate all this data and build a model for that speaker

it seems that did it comes from

places

where there's a high degree of variability from other sources

the problem becomes a little more challenging

and this can be environments it can be devices of your

you know all of you but data that's

being held up like this and then you have a far field data then you have additional data that's four

feet away on your couch

all these things are all different in different microphones all these sources are things that are that are basically blurring

the speaker transmitter trying to learn and you wanna go to isolate those in order we use the speaker turn

so doing this style base it allows a secondary transform to absorb this unwanted variability

and

there are various ways of doing in there are just a you know obviously if you have a sort of

a transforms that are specifically modeling different things explicitly it'll be easier to get the separation if we knew things

two linear transforms then you need to sort of resort to be used just data partitioning schemes

which you know

makes things a little bit more difficult

here i've just tried to hit a little bit on a three way you know three aspects of speech recognition

going green in this reduce reuse recycle framework before i conclude i just wanted to slow touch on i think

you know

as someone who's worked in you know a we strongly i guess and robustness and these ideas i sorta wanna

talk about there's to serve also as i got a member three personalities that i sort of take on and

so i wanna sort of address

and you may find yourself thinking i one of these present noise in turn

and so i wanna sort of address because those and so i think there's people who are the believers

there's people who are

the sceptics

and those people who i was called the willing which are sort of the people who say oh well maybe

i'll give this a try and you know i think

i think about sort the but the resurgence in neural net acoustic modeling as a as a good example of

this that we're maybe some auditory inspired signal processing is another example where

there were true believers in sort of acoustic models using neural nets then they're from so though we can't be

when an hmm

you know to put that aside and then you know that's kind of improve the people that i would give

this a try again they want move from being sceptics to the willing

now they got good results another all the believers again and so i think i wanna sort of talking about

these very briefly so i would say to the sceptics i was sort of say yes you know one thing

that i think is interesting is there's increasing robustness in speech recognition thing going on for a long time is

in lots of sessions

lots of papers slots

but if you look at the tasks that it becomes standard for speech recognition like we need like i talked

about today they're all very small bus orders

today state-of-the-art systems compared to things like switchboard and galen meeting recognition

and in is very large scale systems like switchboard and galen meetings

robustness techniques are not really a part of the puzzle there and so i think is very fair to say

all these methods really necessary in any sort of

i still deployed system i would say to that i would just say yes it depends on i sorta wanna

give a few very anecdotal examples to sort of motivate why i think this is

you think of the bn so in production quality systems that do have all the bells and whistles that we

that i one and knows about that are common is large scale systems

we see and things like voice search you know in fact the gains are small and so you know it's

not really a huge went to employ these techniques and so it's a fair critique just are we don't need

we don't need robustness

as you move the smell like the car turns out that actually gains are pretty big

and not you note taking this is you know this to be much more usable by incorporating some elements of

noise robustness in two

into your system

finally i would actually say with the X box like connect

turns out that actually i would say these systems are actually unusable

if you know if i consider a robust as the entire sort of audio processing front-end plus whatever happen

in the recognizer

if we

throw all that away which establishes his microphone to listen i will do everything in the model space systems are

actually unusable

and so there actually is a large place

technology in certain scenarios

ski

peering out to the willing so if someone says well you know what's the easiest way to try celeste of

it is this thing to try is noise adaptive training and sort the biggest bang for the buck is what

i would say is not well lee dying called noise adaptive training in the feature space

the idea is very simple that you have some training data

you believe you have some way to enhance the training data run-time we need to take a train data just

prior to the same exact process and retrain your acoustic model you know you think that this is this is

basically very akin to doing similar for speaker adaptive training you basically updating your features

before you reach in your model it turns out that if we do this you have to get performance

that generally is far superior to operating are trying to compensate noisy speech to recognise with the clean trained hmm

and if you are gonna to try this i think you know the standard algorithms are findings expect subtraction i

mean

the fanciest ones work are great but in an improvement a small i think getting the basics working is important

but the important thing is you need to serve to an optimize the right objective function i've had you know

talk to people say oh we got you know a spectral subtraction component from

my friend who's in the speech enhancement part of our lab and i just tried it and it was you

know i didn't work at all and the reason is that these things are optimized joey completely differently and so

we need to really you know it and

you do need to understand all the details and nuances of what's happening are gonna but generally is a whole

set of parameters and floors and weights

and things

in those things can all be tuned and you can tune them to where you know to maximise word error

or minimize word error rate and that would be great you can do that in a greedy way let's just

sweep a whole bunch of parameters to we get the best

you can also use something called test which is a computational proxy to stands for the perceptual evaluation of speech

quality space you like a model of what

human listeners would say it turns out that that's which are quite correlated to speech recognition performance and so if

you can maximise that are you have your yeah signal processing bodies have some column that maximizes pack has that's

a good place to start and turns out that the doing things like snrs after the worst thing you can

do it creates all kinds of

distortion free

with that i just want to conclude and say that we proposed that potentially there are there's goodness to be

had by using existing data and no we sort of put this on the matter of going green

i'm just pretending to this case of just provide you know try to write one example of the way that

we can reduce recycle and reuse

the data that we have either from environmental mismatch point of view a bandwidth point of view or speaker

adaptation point of view so a there's many other

ways to do this or just talked about a few and of course there's more work to be done

and so with that i will thank you

i think the speaker for

oh we have plenty of time for questions

so what mike things

great small i was wondering if you can address

some other problems in the your robustness area for example

oh there are many cases with the rules what's your nonlinear distortions that are going to be applied to these

signal of strange this in the communication channel and what you talked about i mean

the transform techniques could obviously work on it or anything but i'm wondering if you have any comments or what

do you do one place

rules nonlinear distortions of the signal with the signal still basically set my intelligible both the it doesn't fit any

of the classical speech plus noise model

well

the one thing i would say it is

that

is a heart problem

thank you don't even agreed on it

so that feature space adaptive training technique

is that you generic across any kind of distortion so if you actually have the ability if you know what

that coding is we can model it somehow you guys past data through that that's why the best way to

model it sort of the

it's not very fancy or what but i think it'll work

the thing is a lot of things are burst

but i find it so that you can actually just detect them

building you know whatever classifier and as just part at that point you know you can for example say i'm

gonna you know compute my decoder score bias just giving up on this frames in is no content here that's

another way you can do it

i think sort of trying to have a model for you know

number in your garden in your

i think by won't work

and then like to believe that there is some you know that we can extend the linear transformation scheme to

nonlinear transformations like some kind of an L P

mllr kind of thing but you know that's remains to be seen and that that's again it does really quite

get it is sort

i think and are we talking of this or this occasional

gobbledygook that comes and i don't think that would really just that so i think those two other techniques are

i think the one thing that's interesting is the correlation between how people speak and the noise background and or

a kind of

what does that adding

right noise rather than so the long artifact has the obvious

lot of speech thing which we pretty but to compensate for

you know we normalize stuff

but there's the bombard spectral that

which means that allowed or the noise is the more vocal effort there is the more

to the spectrum and all that sort of thing how do the techniques you're talking about addressed utterance

is the whole kind of different problem because

the environment

really doesn't

yeah unless you know the signal to noise ratio

straight

right so i think

what's interesting about those is those are

speaker

affects that is are manifested by the environment

and so like you said having environment models not gonna capture that at all it's more like maybe having a

but you may want to have some kind of you know so i don't know i don't have the exact

answer although i would think that having a environment informed

peak or

transform kind of thing would be would be useful so you know potentially

your choice of you know vtln work parameters for example could be affected by what you perceive in the environment

any level speaker F

you

you detect

and the other thing of course is sort of the

the poor man's answer would be you know i'm not sure how much of this can be modelled again it

by exist existing speaker adaptation techniques you know

again i think a lot of the text in the being of a nonlinear

and so it's hardest

we put on the rug with the and mllr transform

but it's so i think i think that comes at it as you know incidents on that i was trying

to talk about

orthogonalisation of this

the speech and the noise and i think you're actually the opposite which is actually a jointly informed

transform which i think is a very enticing area

i don't imagine a way of too much work

you

might be greener features that came in were themselves insensitive

to some is just absolutely

so that would that would that would well if i if i agree with you that i'm through email talks

i can agree with you now

maybe at the coffee break i can agree with you but no i think that that's that that's

that's true right there's the whole and i think a lot of this comes with the biologically inspired kind of

features and i think that's true and i think actually in fact the work that

or elan's human kind of did kind of shows that they've made

grammar correctly that they train to a deep net on aurora and got you know high degree of noise robustness

just running the network

potentially learn some kind of

noise invariant

features

you know i think

right is right and so no i think that's true i don't know no i problems i think right now

where we are

it's the heart to come up with sort of a one size fits all

scheme so there's one other thing

but that's about it

using gmms to the government data to what the specific example you

the basically as long as i understand the gmm you mentioned was trained and supplied basically doesn't consider the entrance

in the gmm case

right but you could also do an hmm for

well but that's is easy you can see the transcriptions like phone level transcriptions

can you improve that signal absolutely yeah that's what is shown so only small with a technique

all possible the pure speech feature based technique

yeah and what well that was a good gosh well yes but i think you don't necessarily need a very

strong model

so you know i guess you class might so you could have you could for example have a phone-loop hmm

in the front end that's using like that is using a model based technique but

but you know getting the state sequence right is actually is a problem in the feature technique as you guys

you have within the context of a if you don't put here takes on the on the search space you

can have within a close to have it skipping around states

you have inconsistent hypotheses for what the missing band is

and you can apply that to some extent if you have a if you do a sort of a cheap

decoding the friend where there's a your phone hmm with the phone language model

and you could do that just to i think what you have you know that

the benefit the models actually

restraining your state space to sort of possible sequences of phones

once you have that i think generating whether use that to enhance feature order in the model domain is

you know what you know the both options

yeah i mean it's only also agree i think

the model domain is

will be optimal

i think if you start saying well my system runs with

a eleven frames of hlda and all this other stuff it becomes a little harder to

to do that you know you can sort of just a minute it's gonna be a blind transform like mllr

but if you wanna put structure in the transfer

the map the low to high frequency that gets a little more difficult

okay

is that the speaker again

Robust Speech Recognition: more than just a lot of noise

Invited Speakers

Michael Seltzer (Microsoft Research)