first let me introduce mike seltzer from microsoft
Q has been there since two thousand and three
and his interests are really no noise robust speech recognition microsoft the reprocessing the acoustic model adaptation speech enhancement
in two thousand and seven to receive the base down all their word from the ieee signal processing society
and from two thousand and six and two thousand and eight yeah he was a member of the S L
D C and he was also the editor in chief of the
electro electronic
newsletters all many of us to receive emails from him whenever that the newsletter came out
and the holiday season associate that either of the ieee transactions on speech and audio processing
and the title of his talkies robust speech recognition more than just a lot of noise
so
these by michael
good afternoon thinks that a great introduction george and things to the
it's very two thousand committee for inviting india the stalk it's really the an honour to be here without you
and i hope to have the instinct a kind of sentences after lunch and hope for the food come along
once it into badly
so let's get started
so
i've been in the field for about oh ten years or so and i brought them to me really seems
like
where in what i'll call almost a golden age of speech recognition
there's based yeah and as we all know there's a number of mainstream products that everyone on the obvious is
using that involve speech recognition of course there's a huge proliferation of mobile phones and data plans and voice search
and things like that
speech is also widely deployed now in automobiles
in fact i i'd like to point to the ford sync as one example of a system that we took
speech in cars from a high end
come on a add on to look for enlarge remote bills to a kind of a low and feature on
stand
packages for low and for models and sort of like a moderate sized
that functionality and then my most recently is that what this the flaming of the connect i don't for X
box
which has gesturing voice input
in addition to these three examples of technologies that are that are out there we're also in many ways swimming
in this in this you know drowning in the data that we have as a sort the proverbial a house
a fire hose of data coming at us and fifty yeah with cloud base system
no all the data thing logged on the system and sort of having data in many cases is not a
problem as i know how what do we do with the state is sort of the channels these days
and finally as on a on a personal note
i think that one is most meaningful for me is the fact that all this is happening so that only
have to talk explaining to my mother what it is i do on a daily basis
i'm a
i'm not sure she's we so happy that a user face on the sly
i won't tell if you want to
neverless in spite of all the success there'd be lots of challenges that are still out there there's new applications
and as an example here this is the virtual receptionist by dental who should ms are sort of a project
and situated interaction and multiparty engage me
there's always new devices this is a particularly interesting one from time to shorten stands use a electromyography interface where
it sort of actually measure
engineer skin as the input i have another colleague a ardent thing was working on a speech input using microphone
arrays inside a space helmet for us
you know systems that are deployed for spacewalks
and of course as the thomas friedman wrote the world is becoming flatter and flatter and there's always new languages
a new cultures that are systems come in contact with
so addressing these challenges takes data and data is
time consuming to collect in some ways and it's also expensive to collect
as an alternative i like to propose
that we can extract additional utility from the data we already have
the idea is that by reusing and recycling the existing data we have
potentially we can reduce the need rick actually collect new data
and you can think of it informally as making better use of our resources so what i like to focus
my talk on today is
how we can take ideas from a process to help speech recognition go green
so the symbol for my talk today
we'll be how speech recognition goes bring to reduce recycle and reuse
of information
like you like to go see a lot
okay so first i'm gonna talk about the one aspect of this in the sense of reduce
so
we know their systems suffer because these are just statistical pattern classifiers when this mismatch between the acoustic models we
have and the data that we see a runtime one the biggest sources of this mismatch is environmental noise
so the best solution of course is to retrain with matched data and this is either expensive or impossible depending
on what your definition of matched data is
if matched it is you know i'm in a car on a highway then that's reason to collect it just
as a little bit time consuming if matched data is i'm gonna be in the speaker model of car on
this particular road with the snow is of course it's impossible to do
so as an alternative we have standard adaptation techniques that are tried and true in part of a standard toolkits
things like map or mllr adaptation they're really great because their generic in a computationally efficient the only downside is
the need sufficient data in order to train is proud
as an alternative relative to discuss here is the way we can exploit the model environment
and by doing this
the estimation the adaptation method you will get some sort of structure imposed on it and as a result we
get lots of efficiencies and the data process
so before we go into the details let's just take a quick look
at the effect of noise on speech through the processing chain of showing of the processing chain for mfccs for
a similar features like plps lpc is it's similar so we know that in the spectral domain of the linear
the waveform domain speech is additive
that's not too bad to handle
what's your the power domain it's against additive analysis additional cross term here which is the kind of correlation between
speech and noise we gently assume speech or noise uncorrelated so we kind of ignore the term and sort of
hideaway
now things get a little bit trickier once you have to go to the mel filterbank analog operation because after
log domain gets are to get this little bit of a nasty regulation which says that the noise
the noisy features i C R can be described as a clean speech features plus some nonlinear function of the
clean speech and the noise
and indicates that goes up to a linear transform and we get of a vector version of the same equation
for the purpose of this talk on a backup before the dct "'cause" easier to visualise things and one or
two dimensions rather than thirty nine dimensions
and source talk about this equation here
so because speech and noise are it's a symmetric a relationship so expos and is and plus X we can
swap the positions of X and N in this equation here if we do that i mean and we sort
of bring common term is to teach that equations you get a slightly different expression
and what what's interesting here is that you have in the lab
is because is a lot domain operators this is basically to a function of two signal-to-noise ratios something that's in
speech enhancement signal processing called the a posteriori snr which is the snr of the observed speech compared to the
noise and the prior snr which is the unknown clean speech and as a version of that and the noise
and so if we look at what this relationship is between along this function
we have a curve like this
and it's curve makes a lot of intuitive sense of it look at different points along the curve and so
we can say for example appear in the in the upper right of the curve
right now we have high snr basically noise is not much of a factor and the noisy speech is given
to the clean speech
in a similar way
at the other end of the curve
we have a low snr
and do not you matter with the clean speech is it's completely dominated by the noise
so why was that in this case and of course you know the million dollar question is
how do we handle things in the middle of the nonlinearities is something that needs to be dealt with
there's an added complication of this which is earlier we sort of swept this cross correlation between speech or noise
under the rug
but it turns out
that yes it is your own expectation but it's actually not zero you know it's a distribution that has nonnegligible
very that we have to plot data on this curve and see how well it matches the curve
you see the direction is that the data lies on that line but there's actually significant spread around that my
what that means is
that even if we're given the exact value of clean speech and exact value of noisy speech in the feature
domain where she can predict exactly what the noisy feature will be there's just you can predict what the distribution
will be
and ending this additional uncertainty makes things we wanna do things like model adaptation even more complicated
so
they're been a number way to look at transfer in this equation into the model them
if we do that again this nonlinearity and the some thirty great challenges so we again look at extremes of
the curve is quite straightforward at high snrs than if we do an adaptation the noisy distribution it's gonna exactly
the same as the queen distribute
we go over here to the low snrs the lower left of the curve
the noisy the speech distributions to just the noise distribution
and the real trick is how do we handle this area in the middle right we have even if we
assume that the speech and the noise are gaussian we put these things through this nominee relationship or comes out
is definitely not gaussian
but of course this is speech recognition and
i darted if it's not gaussian we're just gonna assume it's gaussian anyway and so there are and so there
are various approximations that are made to do this because
we just you know six
so
right the most famous example to do this would you do at that noise is to simply taken a linear
approximation to this to linearize around this is the famous vector taylor series algorithm by pager marino
the idea here is you simply have an expansion point determined by your
that that's given by the mean of the gaussian are trying to adapt and the mean every noise
and you simply when your eyes that than on the menu function around that point and once you have a
linear function now doing annotation is very straightforward you on the how to how to transform gaussians subject to a
linear transformation
now the trick here is that the transformation is only determined by the mean of the curve and the size
of the variance of the speech and the noise of the clean speech and the noise model will determine the
accuracy of the linearisation stuff the brains very broad if it's very in a wide bell then the linear they
should not be very accurate "'cause" you be subject to more nonlinear
so
the refinement of idea of that idea we look into something called linear spline interpolation which is sort of participated
well if one line works well many lines must work better and so the idea is to simply take this
find an approximate using a linear spline which is the idea that you have a
a series of knots which are basic places you could even the dots in this in this figure
and with between the dots you have you doing approximation is quite accurate
and in fact because it's a simple linear rescue cash you have a variance associated word error associated with that
when you're model and that'll account for that alpha spread that spread of the data around
and then when you figure out what to do at runtime you have to use
right you can use all the splines based on the pdf rather than just having to pick a single one
determined by the mean and so essentially depending on how much mass of the probabilities under each of the segments
that tells you how much contribution of that linearisation you're gonna use in your final approximation
so you're using you know incorporating you do linearisation based an entire distribution rather than just the mean and i
think is the spine parameters can be trained from stereo data they are also can be trained from an integrated
way using sort of maximum likelihood in an hmm free
we just use two examples of a linear a linearisation approach this another approach is the sampling based method
and the idea here is based on i-th in of the most famous example this is by the data-driven pmc
work for mark gales
in ninety six but that method requires you know tens of thousands of samples is for every gaussian are trying
to adapt its completely
infeasible it is a good upper bound you can do
but the unscented transform is very elegant way to sort of do clever sampling the ideas you just take certain
sigma points and again because we can assume things are gaussian there's a simple recipe for what these sampling points
are take is small set of points in this case it's typically about in a less than a hundred point
ask them through the non-linear function you know to be true under your model and then you can compute the
moments and basically estimate P Y
again
depending on how spread the variance of this model of this distribution you're trying to adapt is that will determine
how accurate this adaptation is so is gonna for the refinement this method
post recently call the unscented gaussian mixture filter in this case you take a very broad gaussian simply chop it
up into a gaussian mixture where within each gaussian
the variance a small and simple linear approximation works quite well
in the sampling works quite efficiently and then use and to combine all discussions back on the other side
here
so you just for example there are a handful of others out there in the literature
but one thing what i've tried to convey here is in contrast to standard adaptation you'll notice i didn't talk
at all about data
and observations was talk about how to adapt the model all we had was that of the hmm parameter you
X and the noise model and
so it's of the what's nice about these systems is that
excuse me
is that basically slowly need is an estimate of what the noises in the signal and given that we can
actually depth every single gaussian in the system because the structures impose on the adaptation process
and in fact if we can sort of sniffed what the environment is before we even see any speech we
can seduce in the first pass which is very nice and of course you can refine this of the second
pass by doing you know em type are going to update your noise parameters
so of course
under this model the accuracy of the technique is largely due to the accuracy of the approximation using so those
are four examples i showed earlier and essentially people who work in this area basic trying to come up better
approximations to that nonlinear function other alternatives also focus on more explicitly modeling
that uncertainty between X with between the speech and noise that accounts that spread in the data that was nearly
figure
so just a sense of how these things work this is the road to which is a standard noise robustness
task it's a noisy connected digit task
for people care it's a complex back-end some like we could train system
it's for the best next like that sort of baseline you can create with this data i mean you can
see that sort of doing standard things like C M and is not great when you cmllr again this it
in one utterance you may not have enough data to do that to do the adaptation correct you get a
small gain but not but not a huge when
the L C advanced front-end shown there is a fee is sort of the
i guess representative of state-of-the-art in sort of front end signal processing approach to doing this as i was not
where the models are used to treat this as a noisy signal and hands it in the front end and
if you do vts
in the rain algorithm ignoring
that correlation between speech or noise that spread of the data you get about the same performance
and now if you actually account for that variance in the data by tuning a weight in your in your
update which i won't get into the details of us to get a pretty sick significant gain
that's a really nice result the problem with that is that the value that you actually is optimal is that
you theoretically implausible and don't and breaks your entire model so that part is a little bit unsatisfying
in addition the fact is not quite that might that often not pravda generalise as across corpora and then we
see that you get about the same results of the use the spline interpolation method where we have you have
the link the linear regression model it does account for the spread and sort of a more natural way
and again all the numbers of than similar at first pass numbers they could be refined further with second test
so
well this shows we could be no you have nice
gains by adapting the structure there's been a little bit of a dirty laundry i was trying to cover up
which is that the environmental model is completely dependent on the assumption that the hmms trained on clean speech and
as you all know clean speech is kind of a an artificial construct that something we can collect in the
lab but is not very generic it also means that if we deploy a system out in the world we
collect the data that comes like in that it is easy valuable for updating our system and refine your sister
but if it's noisy and our system can only be taken clean data we can use that data
have a problem
so
a solution to that problem has been proposed and referred to as
noise adaptive training also composes joint adaptive training
and the idea is basically completely i can sort of a little brother little sister to speaker adaptive training
in the same as figure out the training try to remove speaker variability in your acoustic model by having some
other transform absorb the speaker variability we wanna have the same kind of operation happen to absorb the environmental variability
what this allows you to do is actually train incorporate train data from different sources
into a single model is helpful if you if you if you think about a multi-style model we can take
all kinds of data from all different conditions and mix it all together
the model will model the noisy speech correctly beer and have a lot of variance is just modeling the fact
that are coming from different environments
that's not gonna help you with phonetic classification
and if you are not a dataset scare scenario this could become very import
so again just to make it a little a little bit more explicit he hears the general flow force speaker
adaptive training you have some multi-speaker data in a speaker independent hmm
that then doesn't a process where you italy update your hmm and some speaker transforms
most commonly using cmllr and this process goes back and forth so convergence and what what's left of it
speaker adapted hmm
so a noise adapting the exact same process happens
except the goal is to remove the environmental variability from a multi-style multi environment day
so what happens here is we have again i would i guess you could call it an orderly cause an
environment independent model but that's
what it is and also for apparel structural call that essentially data from lots of by
and then in your iterative process you basically trying to model and account for the noise or channel distortion that's
in all of your in all of your data
with other parameters so that the hmm is free to model the phonetic variability and this case typically what's more
stuff and on is the noise that is environmental parameters are updated on a per utterance basis rather than a
per speaker basis because there's few parameters and so you're able to estimate those
well number comes out is a noise adapted hmm again that the nice thing here again is because you can
do this potentially in the first pass you don't need to keep the first environmental independent or noise independent model
around like you do in speaker adaptive training you can directly operate all the time and noise adapted H
there are some results with noise adaptive training
as analysis with noisy multi-style training data you can see this is the result for cmn just cepstral mean normalisation
now we try to fight the vts algorithm which assumes the models clean in this case not under the assumption
is broken and so we got to get
you have to improve over the baseline but the results are not nearly as good and then we get overturned
to getting nice gains but we actually do this adaptive training and we see similar performance on the aurora three
task an interesting thing there is actually because that's real data collected in a car
or she is no clean data to train this on and so you actually need an approach like this to
run a successful on that technique and
corpus like this
so
to summarise are for
is the triangle and redo
i si model adaptation as you all know can reduce environmental mismatch
when you impose this environmental structure determine by the model that the adaptation is incredibly data efficient if you think
about a general you need
and ask them to be noise in an estimate of your
of your noise meeting yours variance of potentially last interview channel means that spacey thirty nine was thirty nine you
know it's
hundred and twenty parameters to estimate which is really a very little and you know you could even for example
if you assume that your noise is stationary then you're you can actually eliminate even the delta kappa delta features
of your noise
every running or static features that even fewer parameters
doing the adaptation unfortunately is computationally quite a bore i mean it really it's you adapting every gaussian in your
system is probably overkill to do an utterance-by-utterance basis but you can improve the performance by using regression classes shown
by i think well as work
yeah thing is that we can reduce environmental variability in the final model we have
by doing this noise adaptive training in this is helpful when we're in scenarios where there's not much data to
work
the other considerations that reminders although i'm certain ml systems use can be integrated discover training
and is a huge sort of parallel literature to this where the same exact algorithms are used in the front-end
where your place the hmm with the gmm you do this as a front-end feature enhancement scheme and see basically
the same exact operation with the goal of generating a hand
version of the cepstra
and
those items the exact same sort of mathematics mathematical framework and then the nice thing is there is that you
can then if you're data that you work with is noise you can also do the same adaptive training technique
on the front-end gmm and
still use those technique
so well
and i wanna move on from reduced to recycle
and in this case element talk about is
change gears from the ways to channel
and talk about how we can recycle narrowband data that we have
i think it's not a very controversial statement to say
but now that voice over data is replacing voice over the wire
and when you do this now because you know especially in speech applications you have you speaking to some a
smart phone your voice is not going you know making a telephone call anymore it's going over the data network
to some serve
when you do that does not capture them with a possible so you can base the captured you know subjective
bandwidth constraints because our
latency constraints you can see you can basic captured arbitrary bandwidth and this is that we have you know where
possible wideband data is preferable
games do very you know which you build equipment system with narrowband or wideband data but they are consistent
for example if you look at a car
the gains you get are larger in that's not context because a lot of the noise it's in the cars
at low frequencies sort of the rumble of the of the highway and the tires creates a lot of low
frequency noise so having a high energy in the plosives and affricates is really helpful for discriminative ability
and of course it's also sort of going becoming a the standard for just human communication is wideband codecs from
the M R is the european standard and skype now is going to wideband codec or even an ultra wideband
codec so the fact that people perceive it sort of also implies that numbers machines would probably prefer
well
that said there are existing stockpiles a narrowband data all the systems even building over the years and for many
low resources languages in on the developing world mobile phone still are prevalent and i don't think we're gonna go
away that soon so we want the ability to do something useful with that data
so what i'd like to propose is there a way to use the narrowband data to help augment
some wideband data we have in data scare snares to build a better wideband acoustic model and inspiration for this
came from the signal processing literature maybe ten or fifteen years ago people propose the bandwidth extension speech processing
sort of like again it comes from the fact that we know
the people prefer
wideband speech it turns out it's not it's not any more intelligible unless you looking at isolated phones it's actually
both are equally intelligible but things like listener for T and just personal pride and preference comes across in a
much higher for wide
speech and so the way these algorithms operated
was that the basis set can be learned correlations between low and high frequency spectrum the signal so here's a
just
a poorly first grade drawing version of
of spectra like to say that my four year old to this but i did it myself
so this is sort of you know the pilots like about like this i was going for that with a
couple of formants as of yet if i ask you guys to predict
what is sort of on the other side of the line
you know it maybe predict something like that it seems pretty reasonable probably you know you make a down the
difference low platform and maybe in a different location but is not for example gonna go up it's not you
know you would you would doubt that would
and so we can do is basically user like a gaussian mixture model to predict
the gas independent mappings from low to high band spectra
and then a simple we could do is to say let's just generate wideband features from narrowband features
and if you're familiar with the missing feature literature this says basically i'd like i have some in missing features
you say i have some
components of my features that are too corrupted by noise addition to remove them and then try to fill them
in from the surrounding reliable data this is like doing this you features with the deterministic madness given by the
telephone
you're simply taking some amount of wideband data
some potentially large amount narrowband data you're trying to convert that narrowband data into a pseudo wideband features and go
to train an acoustic model that way
so this actually works okay works pretty well and here's an example
this is a wideband log mel-spectrogram
the left in this is that same speech but through a telephony channel you can see obviously the information below
three hundred hz and above thirty four hundred hz is
it has gone missing so to speak and the idea of this bandwidth extension the feature domain is to say
can we do something to fill it back
and in this particular case "'cause" it's not it's not perfect
but you know a lot of you know where there's read it gently read in the other pictures are reserved
capturing
of the gross features but data and we could use that then to train our system
so this is good but the downside is that if you do it this way in the feature domain you
end up with a point estimate of what you're wideband feature should be and if that estimates for or it's
wrong words you know
things like that you really have no way of informing the model during training to not use that data as
much as maybe other estimates that maybe more reliable and so to get this to work you have to do
some ad hoc things like corpus weightings to say okay we have a little bit of
wideband data but i'm the count those statistics much more heavily than my
statistics of my narrowband data which i would have extended into therefore don't trust quite as much so as not
theoretically optimal
and as a result you know a better used to be to use and you know we can could incorporate
this into any amalgam directly see on only train hmm
would it be the state sequence is the hidden variable so you can figure this is doing the exact same
thing but you're adding additional hidden variables for all the missing frequency components that you don't have in the telephone
channel
so if you do this you get something that looks like this where you have the narrowband goes directly into
the training procedure with the wideband data you have this and expand with em algorithm and we comes out as
a wideband hmm no i'm not gonna try to go into too many details and i really try to keep
equations to a minimum but i just want to point out
a few notable thing is this is the variance update equation and a few things that are interesting i think
about this relation the this update equation is
first of all you look at the why should sorry i should mention the notation have adopted here's from the
missing feature literature so oh is something that you would observe in ms and it's missing as you consider O
to be the
the telephone band frequency components and M to be the missing high-frequency components you're trying to the model when you're
hmm
second thing at the posterior combination computation is only computed over
low band that you have only lives are bands you've actually marginalise out the commode you don't have over all
your models and so therefore erroneous estimates that you make in this process don't corrupt your posterior calculations because you
only computing posteriors based on reliable information that you know is
is that
the other interesting thing is that
rather than having a an estimate that's global across all your data you actually have a state conditional C estimate
where the estimate of the wideband features determined by the observation at time T as well as the state you're
in and so the says
the extended wideband feature i have your it can be a function of both the data i see as well
as whether i mean of our fricative or a plosive
sample
and finally there's this variance piece at the end here which then says in general for this particular gaussian
how much uncertainty overall is there in trying to do this mapping so maybe a minute in a case where
them doing this mapping is really heart because there's very little correlation from the time-frequency snack is we will high
variance there so that model as i could reflect the fact that we've
that we've estimated that the estimates that we're using may be poor
so if we look at the performance here we've taken a wall street journal task we base it took the
training data and partitioned into wideband set and the narrowband set at some proportion
and so the idea is that if you look at the performance of the wideband data that's the lower line
it's about ten percent
and if you take the entire system and sort of telephone dies at all you end up with the upper
curve but in the purple curve that's the sort the narrowband system the goal of this is to say given
some wideband data and next thing in the rest narrowband data how far how much coming close that gap
so we see that in this is comparing the results of the feature version on the model domain version and
so we can see that we have a split of at twenty
the performance is about the same and so in that case you know why go through all the extra computation
the feature
version works quite well interestingly once you go to a more extreme case where only ten percent the training set
is actually wide-band the rest is narrowband do in the future version of it is that you worse than just
training at an entire narrowband system
because there's lots of uncertainty in the extension that you do in the front end which is not reflected in
your model at all but if we do the training in this integrated framework
we end up with you know a performance that again is better than equal than all narrowband
so
talk about this last prong of this second volume of the of the triangle here and recycle
potentially possibly narrowband data can be recycled for using wideband data this may allow us to use the existing piles
of legacy data we have
and for initial system that we have narrowband data whether we want to build narrowband data maybe easier to collect
and maybe simple just like the small amount of wideband data
you can do this in the front end we can come up with the sort of integrated train training framework
and
like a noise-robust this case there is a front-end version that i talked about and there are advantages to that
i shouldn't sort of
so it doesn't the right it allows you that if you do this in the front and you can use
whatever features you want you can then take the postprocesses news bottle neck features
tack a bunch of frames individual the i and so you have a little bit more flexibility what you wanna
do downstream from this process
and the other interesting thing is that the same technology can be used in the reverse scenario where the input
maybe narrowband and the models actually wide
you may think where this happened but this action happens in systems lot of some as soon as someone puts
on a bluetooth headset
you could have a wideband applied system somebody decides that they wanna you'll be safe in hands-free in a put
on a bluetooth headset all somewhat you comes and your system is
in our band if you want do something about it you're gonna get killed and killed but
sorry i and you're going hands free signal killed but anyway you performance is gonna suffer
and so you know one up after we to maintain two models in your server the other ideas you can
actually do about the station the front end and process that by or by a wideband recognizer noise thing there
is like you don't have to be as good as true wideband performing
you just have to be better than or as good as but you've got the narrowband performance would be and
then it's worth it to do that
so
finally i'd like to move on to a last component here of a reuse
and talk about the reuse the speaker transforms
so
one of things that we found
is that
the utterances in the applications that are being deployed commercially now are really sure
and so
you know one seattle obviously people's a starbucks quite a bit
no muppet show times or in the living scenario X box play maybe all that you know the only thing
you get
in addition to that these are really gently rich dialogue interactive systems and so these are sort of one shot
thing for you speak where you get a result in your in your done
so that the combination of these two things
make it really difficult to obtain sufficient data for doing conventional speaker adaptation from a single session of use in
so doing things like mllr cmllr becomes quite difficult in a single utterance
case and so
and obvious solution to this is to say well let's just accumulate the data over time across sections we have
users are you know making multiple queries to the system
it's aggregate it all together and then we'll have an update at sufficiently to build a transfer
the
difficulty comes in because this now because it lies applications on mobile phones it means the people are obviously mobile
two
and they're all across all these different users they're actually in different environments
that creates additional variability in the data that we can lead over time and so in my
by numbers i guess i would say or you know them you know a metaphor here let's imagine a user
called the system and the observation comes in as Y and that some combination of the phonetic content which i'm
showing is as a white box
some speaker-specific information shown as a blue box and
some you know environmental backer information as the right
so user gets the speech and says oh okay well mannered proportion adaptation and store away the transform
so the next time this user calls we know will be loaded up and ready to go
so sure enough sometime later the user cost back
and the phonetic content you know may or may not be the
the speaker
is the same
but now is you know here she is in a different location or different environment and so the observation is
now green instead of purple and as a result we can do adaptation on the mile using the store transform
but mismatch persists this is not something optimal
and so what we would like is
a solution where
the variability when we do something like annotation can be separate or to use the part
so that we can say let's just hold onto the part that's related to speaker and sort of throw away
the part that's in environment or very get store the part that's for environments that we oversee different user call
back from that same environment we can actually do that as well
so in order to do this sort of factorisation or separation of the different compare sources of variability
you actually need an explicit way to do joint compensation so it's very heart to separate these things if you
don't have a model that explicitly models them
as
individual sources of variability
and so to do this there's
several pieces of work that the proposed it's sort of like a being at a diner and it sort of
gets use one from column a and one from column B you can sort of take all the you know
all your favourite speaker adaptation algorithms in you can take
all the games and apply for environmental adaptation pick one up from each thing and combined them and then you
can have a usable model made using thing is that this is sort of proposed
that ten years ago
but as far as i can tell with without with the exception of joint factor analysis and two thousand five
is not that much work on it since and now it sort of seems to be have sort of come
on the scene again which is good i think it's a it's not obvious the more people
in their work on this you know
is
so
all these possible combinations of methods can do this
joint compensation together might talk about one particular instance
of using cmllr transforms mostly because i've already talked about how vts is used and so trying to
several different
ways you can go about doing compensation for noise
so in this case we're gonna talk about the idea that you can use a cascade of cmllr transforms
one that captures environmental variability wanna capture speaker
a nice thing about using transforms like this is that we give up the benefit of all the structure we
had an environmental model using solutions like be yes
but we get the ability to have much more flexible use meaning that we have no restriction on what the
features we can use are what the data that where this it's trained from we don't to do this
adaptive training schemes like the noise adapted train
the idea is quite simply defined transforms that maximise the likelihood of a set of environmental transforms in a spell
of speaker transformations given sample of training or adaptation data
now of course you know it's not heart to see that this cascade of linear transforms is itself a linear
transform
in as a result you can take a linear transfer and factor it into two separate transforms in an arbitrary
number of ways menu which will
not be meaningful and so the way that we're gonna get around this is to borrow heavily from the key
idea i think in joint factor analysis from speaker recognition which is to say let's learn the transformations on partitions
of the training data where were able to sort of isolate the variability that we range
so pictorially
still a bit busy inside politics of it
gives you a headache but
you can think about the idea that your basic gonna group the data by speaker and a given those that
you can update your speaker trend
then you gonna repartition your data by environment keeper speaker transforms fixed and update your environment transforms and then go
back and forth in this manner now of course
doing this doing this operation assumes that you have a sense of what you're speaker clusters are in your environment
clusters are
there are some cases where we it sounds reasonable to assume the labels are given to you so for example
if it's a
a phone overhead you know mobile phone data plants near you can have a caller id or a user id
of the hardware address and so you can have a high confidence that you know the speaker is simile for
certain applications
like the X box in the living room we certainly think it's result we can say okay this thing is
by not driving on the card sixty miles an hour probably isn't in the living room once we can assume
the environment in that case or if we don't have this information you can really do environment clustering algorithms are
speaker clustering
and so
yeah just to show some results here
the idea is you can again take
take the training data that let's say from a bright of environmental the brighter speakers
and estimate some environment transforms on the training data
to do that of course you have to estimate the speaker transforms as well but in this case the speaker
the speakers in training and test are distinct and so the speaker chances are not useful for us in the
reuse scenario
and so we've tried here to say let's take estimate the speaker transform
given data from a single environment this case is the subway
we can we take that
transform and either estimated in this way where the sources of variability are factored
or estimated using sort of conventional cmllr approach and apply to data from the same speaker in six different environments
three which aren't times that you've seen in training three what's are
that are not seen
and you can see in both cases you get a benefit by having additional transform in their absorb
the variability from the noise so that this the speaker transform can as you focus on just as the variability
that comes from the speaker that you care about and so you can see there's again overdoing cmllr alone and
that comes again from the fact that this year margin for me is not presumably
learning the mapping of the environment plus the speakers ideally learning the transform just the speaker alone
so
in
scenarios where speaker data is scarce
a reuse is important for adaptation
no this is a case where each utterance is you know ten or fifteen or twenty seconds this techniques and
are not nearly as important but if the case where you only have a second or two data you wanna
be able to aggregate all this data and build a model for that speaker
it seems that did it comes from
places
where there's a high degree of variability from other sources
the problem becomes a little more challenging
and this can be environments it can be devices of your
you know all of you but data that's
being held up like this and then you have a far field data then you have additional data that's four
feet away on your couch
all these things are all different in different microphones all these sources are things that are that are basically blurring
the speaker transmitter trying to learn and you wanna go to isolate those in order we use the speaker turn
so doing this style base it allows a secondary transform to absorb this unwanted variability
and
there are various ways of doing in there are just a you know obviously if you have a sort of
a transforms that are specifically modeling different things explicitly it'll be easier to get the separation if we knew things
like
two linear transforms then you need to sort of resort to be used just data partitioning schemes
which you know
makes things a little bit more difficult
so
here i've just tried to hit a little bit on a three way you know three aspects of speech recognition
going green in this reduce reuse recycle framework before i conclude i just wanted to slow touch on i think
you know
as someone who's worked in you know a we strongly i guess and robustness and these ideas i sorta wanna
talk about there's to serve also as i got a member three personalities that i sort of take on and
so i wanna sort of address
and you may find yourself thinking i one of these present noise in turn
and so i wanna sort of address because those and so i think there's people who are the believers
there's people who are
the sceptics
and those people who i was called the willing which are sort of the people who say oh well maybe
i'll give this a try and you know i think
i think about sort the but the resurgence in neural net acoustic modeling as a as a good example of
this that we're maybe some auditory inspired signal processing is another example where
there were true believers in sort of acoustic models using neural nets then they're from so though we can't be
when an hmm
you know to put that aside and then you know that's kind of improve the people that i would give
this a try again they want move from being sceptics to the willing
now they got good results another all the believers again and so i think i wanna sort of talking about
these very briefly so i would say to the sceptics i was sort of say yes you know one thing
that i think is interesting is there's increasing robustness in speech recognition thing going on for a long time is
in lots of sessions
lots of papers slots
but if you look at the tasks that it becomes standard for speech recognition like we need like i talked
about today they're all very small bus orders
today state-of-the-art systems compared to things like switchboard and galen meeting recognition
and in is very large scale systems like switchboard and galen meetings
robustness techniques are not really a part of the puzzle there and so i think is very fair to say
all these methods really necessary in any sort of
i still deployed system i would say to that i would just say yes it depends on i sorta wanna
give a few very anecdotal examples to sort of motivate why i think this is
you think of the bn so in production quality systems that do have all the bells and whistles that we
that i one and knows about that are common is large scale systems
we see and things like voice search you know in fact the gains are small and so you know it's
not really a huge went to employ these techniques and so it's a fair critique just are we don't need
we don't need robustness
as you move the smell like the car turns out that actually gains are pretty big
and not you note taking this is you know this to be much more usable by incorporating some elements of
noise robustness in two
into your system
finally i would actually say with the X box like connect
turns out that actually i would say these systems are actually unusable
if you know if i consider a robust as the entire sort of audio processing front-end plus whatever happen
in the recognizer
if we
throw all that away which establishes his microphone to listen i will do everything in the model space systems are
actually unusable
and so there actually is a large place
technology in certain scenarios
ski
peering out to the willing so if someone says well you know what's the easiest way to try celeste of
it is this thing to try is noise adaptive training and sort the biggest bang for the buck is what
i would say is not well lee dying called noise adaptive training in the feature space
the idea is very simple that you have some training data
you believe you have some way to enhance the training data run-time we need to take a train data just
prior to the same exact process and retrain your acoustic model you know you think that this is this is
basically very akin to doing similar for speaker adaptive training you basically updating your features
before you reach in your model it turns out that if we do this you have to get performance
that generally is far superior to operating are trying to compensate noisy speech to recognise with the clean trained hmm
and if you are gonna to try this i think you know the standard algorithms are findings expect subtraction i
mean
the fanciest ones work are great but in an improvement a small i think getting the basics working is important
but the important thing is you need to serve to an optimize the right objective function i've had you know
talk to people say oh we got you know a spectral subtraction component from
my friend who's in the speech enhancement part of our lab and i just tried it and it was you
know i didn't work at all and the reason is that these things are optimized joey completely differently and so
we need to really you know it and
you do need to understand all the details and nuances of what's happening are gonna but generally is a whole
set of parameters and floors and weights
and things
in those things can all be tuned and you can tune them to where you know to maximise word error
or minimize word error rate and that would be great you can do that in a greedy way let's just
sweep a whole bunch of parameters to we get the best
you can also use something called test which is a computational proxy to stands for the perceptual evaluation of speech
quality space you like a model of what
human listeners would say it turns out that that's which are quite correlated to speech recognition performance and so if
you can maximise that are you have your yeah signal processing bodies have some column that maximizes pack has that's
a good place to start and turns out that the doing things like snrs after the worst thing you can
do it creates all kinds of
distortion free
so
with that i just want to conclude and say that we proposed that potentially there are there's goodness to be
had by using existing data and no we sort of put this on the matter of going green
i'm just pretending to this case of just provide you know try to write one example of the way that
we can reduce recycle and reuse
the data that we have either from environmental mismatch point of view a bandwidth point of view or speaker
adaptation point of view so a there's many other
ways to do this or just talked about a few and of course there's more work to be done
and so with that i will thank you
i think the speaker for
oh we have plenty of time for questions
so what mike things
great small i was wondering if you can address
some other problems in the your robustness area for example
oh there are many cases with the rules what's your nonlinear distortions that are going to be applied to these
signal of strange this in the communication channel and what you talked about i mean
the transform techniques could obviously work on it or anything but i'm wondering if you have any comments or what
do you do one place
rules nonlinear distortions of the signal with the signal still basically set my intelligible both the it doesn't fit any
of the classical speech plus noise model
well
the one thing i would say it is
that
is a heart problem
i
thank you don't even agreed on it
so that feature space adaptive training technique
is that you generic across any kind of distortion so if you actually have the ability if you know what
that coding is we can model it somehow you guys past data through that that's why the best way to
model it sort of the
it's not very fancy or what but i think it'll work
the thing is a lot of things are burst
but i find it so that you can actually just detect them
building you know whatever classifier and as just part at that point you know you can for example say i'm
gonna you know compute my decoder score bias just giving up on this frames in is no content here that's
another way you can do it
i think sort of trying to have a model for you know
number in your garden in your
i think by won't work
and then like to believe that there is some you know that we can extend the linear transformation scheme to
nonlinear transformations like some kind of an L P
mllr kind of thing but you know that's remains to be seen and that that's again it does really quite
get it is sort
i think and are we talking of this or this occasional
gobbledygook that comes and i don't think that would really just that so i think those two other techniques are
so
i think the one thing that's interesting is the correlation between how people speak and the noise background and or
a kind of
what does that adding
right noise rather than so the long artifact has the obvious
lot of speech thing which we pretty but to compensate for
you know we normalize stuff
but there's the bombard spectral that
which means that allowed or the noise is the more vocal effort there is the more
to the spectrum and all that sort of thing how do the techniques you're talking about addressed utterance
is the whole kind of different problem because
the environment
really doesn't
yeah unless you know the signal to noise ratio
straight
right so i think
what's interesting about those is those are
speaker
affects that is are manifested by the environment
and so like you said having environment models not gonna capture that at all it's more like maybe having a
but you may want to have some kind of you know so i don't know i don't have the exact
answer although i would think that having a environment informed
peak or
transform kind of thing would be would be useful so you know potentially
your choice of you know vtln work parameters for example could be affected by what you perceive in the environment
any level speaker F
you
you detect
and the other thing of course is sort of the
the poor man's answer would be you know i'm not sure how much of this can be modelled again it
by exist existing speaker adaptation techniques you know
again i think a lot of the text in the being of a nonlinear
and so it's hardest
we put on the rug with the and mllr transform
but it's so i think i think that comes at it as you know incidents on that i was trying
to talk about
orthogonalisation of this
the speech and the noise and i think you're actually the opposite which is actually a jointly informed
transform which i think is a very enticing area
i don't imagine a way of too much work
you
might be greener features that came in were themselves insensitive
to some is just absolutely
so that would that would that would well if i if i agree with you that i'm through email talks
i can agree with you now
maybe at the coffee break i can agree with you but no i think that that's that that's
that's true right there's the whole and i think a lot of this comes with the biologically inspired kind of
features and i think that's true and i think actually in fact the work that
or elan's human kind of did kind of shows that they've made
grammar correctly that they train to a deep net on aurora and got you know high degree of noise robustness
just running the network
potentially learn some kind of
noise invariant
features
you know i think
right is right and so no i think that's true i don't know no i problems i think right now
where we are
it's the heart to come up with sort of a one size fits all
scheme so there's one other thing
but that's about it
using gmms to the government data to what the specific example you
the basically as long as i understand the gmm you mentioned was trained and supplied basically doesn't consider the entrance
i
in the gmm case
right but you could also do an hmm for
well but that's is easy you can see the transcriptions like phone level transcriptions
can you improve that signal absolutely yeah that's what is shown so only small with a technique
all possible the pure speech feature based technique
yeah and what well that was a good gosh well yes but i think you don't necessarily need a very
strong model
so you know i guess you class might so you could have you could for example have a phone-loop hmm
in the front end that's using like that is using a model based technique but
but you know getting the state sequence right is actually is a problem in the feature technique as you guys
you have within the context of a if you don't put here takes on the on the search space you
can have within a close to have it skipping around states
you have inconsistent hypotheses for what the missing band is
and you can apply that to some extent if you have a if you do a sort of a cheap
decoding the friend where there's a your phone hmm with the phone language model
and you could do that just to i think what you have you know that
the benefit the models actually
restraining your state space to sort of possible sequences of phones
once you have that i think generating whether use that to enhance feature order in the model domain is
you know what you know the both options
yeah i mean it's only also agree i think
the model domain is
will be optimal
i think if you start saying well my system runs with
a eleven frames of hlda and all this other stuff it becomes a little harder to
to do that you know you can sort of just a minute it's gonna be a blind transform like mllr
but if you wanna put structure in the transfer
the map the low to high frequency that gets a little more difficult
okay
is that the speaker again