that actually
that's actually a morgan kind of introduction just
i that the say too much
thank you brian
actually i just should
before get that the target should mention someone made it
seven a brief discussion with someone about the posters and
realising that some extent the optimum strategy for poster would be to make it seem
like it's really interesting but completely impossible to understand
so that we're gonna want to come up and explain
anyway
there are back again
someone else suggested that perhaps to talk to be called the
station through all over again
from that same philosophy real be there
but let me start with a little story i
those you
not no you're just to tell you arthur conan doyle a series of stories
about detective was next production name sure columns
and it had a cool part in watson
really didn't know so much about it
so holmes and watson one on a camping trip
the shared a good meal had a bottle of wine and the recharger the chance
for the night
three in the morning forms notch watson said
look up and this guy tell me what you see
what sense that i see millions of stars
homes that's what is the total
once replies astronomically it tells me there are billions of galaxies potentially millions of planets
astrological it tells me the saturn isn't leo theologically tells me that got is great
we are small insignificant
or logically tells me that it's about three
you are logically tells me will have a beautiful day tomorrow
was a tell you ones
some wants to tell you really
so we might be missing if you think
and
there are
some great really exciting results is a lot of people who are interested now in
neural nets for number of
application areas but in particular
in speech recognition or slots whose yes and
but there might be a few things that we're missing and the journey
and perhaps it might be useful to look at
some historical context to help us to know that
so
as bright alluded to earlier in the day
there has been a great deal of history
neural networks for speech and the neural networks in general before this
and i think of this is occurring in three ways
the first wave was in the fifties and sixties with the development of the perceptrons
and at one i think of this as a basic structure or the bs
in the eighties and nineties we had back propagation which it actually then develop before
that but really applied a lot
and multilayer perceptrons or mlps
which were applying more structure to the problem sort of an ms
and now we have things that are piled higher and deeper
so it's
the phd level
now asr speech recognition
we had digits pretty much or other very small vocabulary tasks i in the fifties
and sixties
high eighties and nineties we actually graduate too large vocabulary continuous speech recognition
and in this new wave
there's really quite sure use of the technology and it's probably compounded
no
this talk isn't about the history speech recognition but i think i can't really do
it is true of neural nets for speech recognition without doing a little bit of
that
that also have early start
the best known early paper
i was a nineteen fifty two paper the last
but before that was radio right
now if you haven't seen or heard about radio racks
radio rex was a little doggy dog house
and user racks and racks with but
our course if you did that X would also probably pop out just about anything
that have enough energy
five six seven hundred hertz or so because
but actively doghouse actually resonated with some of those low frequencies
and when it resonated vibrate it would break a connection from electromagnet in the spring
with push the dog
so we could think of it is speech recognition really bad rejection
now the first paper that i know of anyway
that was
just crime real speech recognition was this paper by our second davis
on a digit recognition from bell labs
and it approximated energy in the first couple formants was really just how much energy
there was over time
and the different regions different frequency regions
that already had some kinds of robust estimate particular i was quite insensitive to the
apple two
and it
works very well under limited circumstances that is it was
pristine recording conditions you very quiet very great signal noise ratio
in the laboratory and also for single speaker it was tuned to single speaker
and really tune because it was
big bunch of resistors and capacitors
it also took a fair amount of space
that was the nineteen fifty two digit recogniser
wasn't something that you would fit into
in nineteen fifty two phone
now
i should say that this system
have reported accuracy of ninety seven ninety eight percent
and since
every commercial
system says then has reported an accuracy of ninety seven the ninety percent you might
think there's been no progress
but of course there has been the problems of got much harder
and that's a speech recognition isn't the real point it was talk list of mystery
the fundamentally the early asr was based on some kind of templates are examples and
distances between incoming speech and those examples
in the last thirty to forty years
the systems have pretty much been based on statistical models especially
the last twenty five
the hidden markov model technology however is based on mathematics in the late sixties
and
the biggest source again since then this is slightly unfair statement of justified moment
that's based on having lots of computing
now obviously there's a lot of people including a lot of people here who contributed
many important engineering ideas since the since the late sixties
but
those ideas were in a by having lots of computing lots of storage
statistical models are
trained with exact this is the basic approach we all know about
the examples are represented by some kind of choice of features
and the estimators generate likelihoods for what was set and then
there is a model that integrates over time with these sort of
point wise time likelihoods are generated
now artificial neural nets can be used for this to generate even of the features
that are then processed by some kind of a probability estimator that the just neural
net or they can generate the likelihoods that are actually used in hidden markov
going back to these three ways in the first way
and actually i guess i should say
a lot of the things from the only way scary through to your car one
the idea was the mccullough it's your on model
and
there were training algorithms of learning algorithms that were developed around this model perceptrons headline
another more complex things
example of which is what's called discriminative analysis iterative design or D I D
now going to these little bit
so mccall gets model was basically that you had a bunch of inputs coming in
from other neurons
they were weighted in some way
and when the weighted sum exceeded some threshold in their on fire
now the perceptron algorithm
was based on
changing what these weights for
when the firing was incorrect another's for a classification problem that saying that it is
a particular class and i really S
a by the way i'm gonna have
almost no equations and this presentation
itself
if you rating problem too bad
so the perceptron learning algorithm adjusted these weights using the outputs using whether the run
fired or not
at the wind approach was actually a linear processing approach where it the weights were
just using the weighted so
the initial versions of all the experiments with both of these were done with a
single layer so they were single-layer
perceptrons single-layer outlines
and in the late sixties there is a famous both blackman skin pampered perceptron that
pointed out that such a simple network could not even solve exclusive or problem
but in fact
multiple layers were used as early as the early sixties an example of that is
this da the algorithm
so in timit was not homogeneous neural net like that kind of nets that we
mostly used today
had gaussian the output layer
it had perceptron at the output layer
it was somewhat similar to the waiter radial basis function networks which also had some
kind of
radial basis function gaussian like function that at the first layer
it's not really
a clever weighting scheme
when you loaded up the covariance matrix matrices for the gaussians
you would give particular way to the patterns
that had resulted in that errors
and so a that this and you use an exponential loss function of the output
to do that
this wasn't really used for speech but was used for a wide variety of problems
by task and five mcconnell douglas and you know other
governmental and i commercial organisations a lot people don't know about it i happened know
about it "'cause" i recorded one point
this police or
terribly but anyway
so
going to nns for speech recognition
in the early sixties at stanford
bernard woodrow's students did a system for digit recognition
where they had a series of these advertise these adaptive
linear units
and it
worked quite well within speaker much as the nineteen fifty two system had
except that this was automatically didn't have to tune a bunch of resistors
and it had
terrible error rates across speakers
but it was it was sort of comparable and it was using this kind of
technology
pooling move into the eighties
wave to
aquino this colleagues did some consonant classification with such systems
i had the good fortune be able to play around with such things for voiced
unvoiced classification for commercial
task
but competing is systems started coming up by the like by the mid to late
eighties
people at cmu
or exploitable and geoff hinton that the time
lying
did this kind of
classification for stop consonants using such systems
and there are many others i don't
have enough for one slide in this how many were but
can hold in finland what mean and goal us cameron cooper in germany dealing more
in U K many others
built up these systems and did a typically isolated word recognition
then by the by the end of the eighties
we got to for speech recognition that is continuous speech recognition
speaker-independent et cetera
so
have the good fortune to have really
clever friends
and together with some of them include some of this work
i ever bourlard can visit a dixie and eighty eight
and he and i started one collaboration where we developed in approach
for using feed-forward
neural networks for speech recognition
and there is a range of other people who did related things a particular in
germany
and it's seen you
also there was working recurrent networks so that the feed-forward
you can just get there from where
not too many of them a
and the recurrent nets
actually fed back
and this was really high near i mean there was that there were number of
people who work with
of recurrent networks
but for applying it to large vocabulary continuous speech recognition real centre for that is
cambridge
tony robinson and well while still live trying for side
and both approaches though what they had in common
was that they'd through the proper training a generative probability as a phone classes
and then they derive state emission likelihoods for hidden markov models
typically we found it work better in most cases to divide by the prior probabilities
of each phone classes
and get some scaled likelihoods
and we also catch the marker to this name recall that the hybrid hmm mlp
or hybrid
hmm and system
so
with mlps you would use the back error back propagation
using the chain rule the spread the blame or credit back through layers
it was simple to use simple to train a powerful transformations
they were also used for classification and prediction
but in the hybrid system the idea was using probability estimation
and initially we did this for unlimited number of classes typically model
the slight has the only for only a equation and the stall
we didn't
understand that are having some representation of context could be beneficial
but it was kind of heart to deal with twenty some years ago
and notion of having thousands and thousands of outputs just didn't seem particularly like a
good one
decree with the limited amount of training data that we have
and computation to work with
so we came up with a factor version
in this equation a Q stands for the states which in this case a typically
were monophones
but C stands for context and X is the feature
and you can break it up without any assumptions
and no independence assumption
into
two different factors factorisation is
probability of this of a state given the context and the and the future input
times probability of context given input
or the other one the right is
probability for context given state and the input times the probability of the monophone probability
and the latter one
means that you could take the monophone that you already training just multiply and this
other one
a thing we as with other things a bit right back and initially so if
the first six months to your didn't work at all
and then are colleagues at sri were very helpful came up so it's really good
smoothing methods which given the
when the number with limited amount of data
that we're working with was really necessary to make context work
and then it
and a few years later
dropped cmu
french
stardust to an extreme where you actually had a tree
of search mlps and so you could
implement this factorisation over and over can get in finer and finer sit down and
leaves you actually had tens of thousands or even a hundred thousand generalized triphones of
some sort
and it works very well it was actually quite comparable to other systems at the
time
but was really complicate
and most people this pointed really focused in on gaussian mixture systems so it never
really took off
now
if you look at where all this was n-gram two thousand
the gaussian mixture approaches have mature
people really have learned how to use them
they've been many refinements the developed
sometimes think about gaussians you have means you have covariances people typically using variance only
covariance matrices
and so there's lots of simple things that you can do with
many of these were developed
not just mllr sat and an image by later E
i mean all sorts of alphabet soups
this didn't come easily it's not that like between slu possible they didn't come easily
to the mlp world and since the mlp world for is larger can we speech
recognition at this point was really confined to a few places almost everybody was working
with gaussian mixtures which kind of hardly keep up
but we still want to
and we like them because
one important reason for us was that they work really well with different from S
so if you came up with some really weird thing you know listen to christoph
talking about the neurons and we said let's try that thing
during the to the mlp in llp didn't mind
we had experiences with a colleague of ours for instance john last row who was
doing these funny little chips that we'd implement in some threshold mos
us various functions of people had found in go clear nuclei and so on and
he'd those into htk and it would just rollover and i and so
we he that it into our systems and actually didn't mind at all so because
of the nature of the nonlinearities
it really was very
agnostic to the kind of inputs
so question is how to take advantage of both
well what happened at this time we were working with a with hynek hermansky was
a dog i and with dan ellis with the dixie
and there was this competition was happening for standard
for distributed speech recognition i idea being
that you would compute the features
and the phone and then somewhere else you would actually do the rest of the
recognition and so the idea was to replace mfccs something better
so the models were required to be hmm-gmm
you couldn't change
we still like the next
so the solution these guys came up with
was to use the outputs as features not as probabilities
they were the only ones whatever use the outputs of features the outputs of mlps
as features
but there's a particular way doing it and implemented in large vocabulary or small vocabulary
systems
lot really work this was with the digits
that they came up
and this was called tandem approach
now as a so sort of the social cultural advantage for our research
nice thing was instead of having to convince everybody that the hybrid systems the way
to go we could just say here some cool features one should try them out
and we couldn't did in fact collaborate with other people systems that way
and i give some credit the bottom over to some other work being done this
ryan speaker recognition
so there are also interference it can once you get the idea that you happen
some interesting
use of neural nets to generate features you also could focus on temporal approach which
can dickens guys dude with traps where you would have
neural nets just looking at parts of the spectrum or a lot of time
and so they would be kind of forced into learning something about what you couldn't
in the temporal properties
that would help you with a phonetic identification
icsi's version of this was called hats most hidden activation traps
and
in all these there were there were there was the germ of what people do
now with the layer-by-layer stuff
because you train something out and then you'd feed that into another now run the
case of hats
you train something up then you throw away the last layer and feed it into
something else feature
then there were a bunch of things worked with gabor filters in X M roster
where you had modulation based inputs
you can happen using a tandem approach for the end up with getting features
from that
and then much more recent version
is
bottleneck features
which are kind of tandem it's not
exactly
same thing that's not coming from posteriors but it is using an output from the
net as
that's features
so
third way
i liked course to go where
so there's no
there's nothing wrong with the original hybrid theory i mean that it
work fine
gmm approach is sort of have victory because
you get a lot of people
moving in the same direction lot of things can happen
but also
just computation
storage and so forth
there was a lot more straightforward i think to make progress
with modifications to the gmm based approaches
so the fundamental issues with going further with a hybrid approach is how to apply
many parameters usefully
and how do get
these emission probabilities from any phonetic categories
and aspects of solution were already there is already mentioned in a number of these
approaches we reject already generating mlps layer by layer
many phonetic categories there were some work in context dependence but that's needed to be
pushed further
learning approaches second order methods right conversations so forth
these were there are many papers on the sort of things on is variance of
conjugate gradient sort things in the eighties
integrating courses much older than the eighties
but someone had to do all this and so
when i'm sure he's reflections from earlier time i don't want to draw cast aspersions
on and people were doing great things now
someone actually had to put these things together and push forward
and i
and that kind of discussion you have to start with geoff hinton
jeff is kind of excitable guy
it was very excited by back-propagation eighties
it's been excited about the things
and he is very good at spreading but it's a
so
he developed particular initialisation techniques
and some of these
are unsupervised techniques particular which you likes because it's seen high logically possible
and
this permitted the use of many parameters and all layers
because when you have many layers
back propagation isn't to affective
down at your ears gets more that credit blankets watered down
a so is expected to spread to microsoft research
and they extremely what was going on before too many phonetic categories large vocabulary speech
recognition
and lots of other people or a very talented people are google ibm elsewhere follows
so
initialisation having a good starting point
for the weights before you start discriminant training some sort
a was often used for limited data case it was often the case
back in the early nineties when we were going to some situation where we had
relatively little data
we train with something else first and then
it start with those weights maybe we wouldn't even train all the way you just
do any block or two
and then we would go to the other language or other task
and we often found that be very helpful
so hinton developers general unsupervised approach
applied to multiple layers in general call that deep learning
lot of this early stuff was called sometimes talk all the deep belief nets
a general every dnns
supplied other applications and speech
and again i gave reasonable weights for the layers far targets because
even if
the weights don't use it all back propagation training at least the early ones are
doing something useful
later speech where a lot of while the things that you see posters are papers
in the last couple years actually skip this step
and do something else for instance do layer by layer
training
discriminatively
and many approaches use some kind of regularisation
to avoid overfitting
so the recent work which you're much more about in a clear today
from
is
shows significant improvements over comparable gmms
and although there's a mixture of approaches
sometimes tandem why core bottleneck like sometimes a hybrid mode i think they're usually hybrid
mode
and
i have to say it's great to called deep neural nets but they're still multilayer
perceptrons
if they just multilayer perceptrons with you know certain number of layers
and say well okay but it's really different with seven hidden layers then used to
know maybe
but we do have to ask how do you deep to the need to be
so
many experiments show continued improvements more layers
and the at some point there's diminishing returns but the underlying assumption there is that
there's no limit on parameters
so we start asking the question what if there was a one
now why would you want to limit
well because in any practical situations are actually in some kind of women at least
there's a cost right there's
you could think of the number of parameters as being a proxy for the cost
for the resources in general for the time it takes to train a time text
run amount of storage
and well
there's people who go here but
i have say you know even if you've got you know million machines
you probably one hundred users so it still matters on the parameters use
so in interspeech represented something which i'm just gonna present for mentor to here
what we called deep on a budget
and we say suppose
we have a fixed but very large wanna make sure that nobody thinks we didn't
use enough
parameters
and then you
compare between a narrow and deep versus wine and shallow
we often see comparisons where people tried E
you know the you earlier version that we often used of one big hidden where
versus a bunch of good
but we want to do all along the way step by step have two hidden
layers three hidden layers work admirers
we can't the architecture the same
and it was only a one task was a pretty small task as aurora two
and so that
allowed us to look at varying signal-to-noise ratios
we said if you did this on a budget what works best
well you know and maybe more to there's different kinds of additive noise train station
babble and so forth
and this was a mismatch case it's clean training and noisy test that we didn't
do the multi-style
and
it turns out that the answer is all over the map
and in particular
for the cases that were kind of usable
signal-to-noise ratios and by usable i mean
if we gave you a few percent error and digits
as opposed to twenty or thirty or forty which you just couldn't used for anything
actually to was better
and then to yield little bit with the question of will maybe you just pick
the we were number of parameters we tried with double number of parameters have the
number of parameters we saw similar for now
so when i gave
this longer version of this and interspeech some of the comments more along the lines
of why do you think to is better
so forth
i just wanna be clear i'm not saying that to is better than anything
what i'm saying is that
if you were thinking of something actually going into practical use you should do some
experiments where you keep the number of parameters the same you might
then expand and so forth but usually is to some experiments we keep the number
of parameters the same and then you get an idea about what's best and it's
probably gonna be test then
so
we focus on neural networks but we do have to be sure we ask direct
questions
i just said no
i
so
we have test right questions
one question is what we see into the nets no there's all these questions about
what's wearing data and how many layers we have so forth
some people not having any names
a white characterize is true believers think that features aren't important
actually
to verify slightly a discussion just a
that interspeech i think it wasn't and
the
i made this comment and he said no i think features of importance just usual
so anyway features are important
and this goes back to the old general computer
axiom garbage in garbage are
people have done some very interesting experiments with feeding waveforms in and i should say
back and today we did some experiments hynek like this in experiments with a feeling
waveforms in comparing the plp needed waveforms way worse they have made some progress there
actually are doing better
but if you actually look in detail at what these experiments do
in one case for instance
it they check the absolute value the floor to detect the logarithm the averaged over
a bunch of
all sorts of things which actually obscure the phase and that's kind of the point
is that you can have waveforms of extraordinarily different shape that really sound pretty much
the same
there's more recent results that uses maxout pooling of convolutional neural nets
that also had you know a nice result
again using this maximum and maximum style pooling also tended to screw the face
but in both those cases and the other case i've heard of anyway
this completely falls apart when you have made mismatch when you when the testing is
different than the training
so what is holy
of having a frontend after all the available data is in the way for some
assumptions there that you know you might things well and so forth but
that's ignore that for the moment
in fact front ends to consistently improve speech recognition
and i have this is great but like i learned from here which is
that the goal of front ends is to just or information
that's is a little extreme
scenic as these sometimes but i think it's true that some information is misleading at
some information is not affected
and we want to focus on the discriminative information
because the waveform that you receive is not just spoken language
it's also is and reverberation and channel effects and characteristics of the speaker if you're
not gonna speaker recognition
maybe you don't care so much about that
and so
the front end can help to
focus on the things that you care about for your particular task
and a good front end in principle of use or to carry to what extreme
can make recognition extremely simple
least N
so what about the connection to mlps well as i alluded to earlier mlps have
few distributional assumptions
mlps can
also easily integrate information over time
multiple feature streams
could provide useful way to incorporate more parameters
so yes that's do give you a nice way especially with good realisation initializations and
so forth
can give you a way to incorporate more features more parameters sorry usefully
but also multiple strings can do this too
by multiple streams i mean
having different sets of mlps the look at the signal in different ways
and you can really expand out the number of parameters and in a way that
is often quite useful
and so my as well thrown another acronym
if you use this with the T that
you can call this of don
deep white
so you can combine these different streams easily because the outputs of posteriors we know
how to combine probabilities
this isn't really example a very chanted at our place
fifteen thirteen years ago something
all that on the topic mlp
and
the idea is you have a bunch of different
sets of layers they're looking at different critical bands this is this is like
the hats and traps and so forth
the difference is it was just trained all at once
and in fact this work okay
a recent example and there's because they are in the such examples around i just
pick this one because of standby one of my students actually
in china
in which
he had some with
coming from
high modulation frequencies and modulation frequencies
and T this as pca is not the society for prevention of cruelty to animals
buses
a sparse pca and it's used to pick out
pretty uses it to pick out particular filters are gabor filter is in this case
that are particularly useful
for the discrimination
and this these then go into deep neural nets six-layer do deep neural nets
and the output of one deep neural net goes into another so i get the
and it's really deep but you also have some within their
this was used to some effect it's very noisy data for the rats program so
it's a
data that's and transmitted through a radio channels and is really extremely awful what you
get at the other side
so called or are dnns or troughs
nearly all
still based on essentially on this mccullough that small
there are some nice work is also a poster here about more complex units
and for certainly for large vocabulary
kinds of
for real word error rate measurements
they're not particularly better
just little disappointing
but maybe this work is just started
the complexity and power is not supplied by having more complex units are for used
it is applied by
the data and also is a say with multiple streams by the web
you also can represent to some extent signal correlations by pooling again by acoustic context
and so far at least the most effective learning methods are not biologically plausible
so given all that how can we benefit from biological models
why we want to benefit from biological models because we wanna have stable perception and
noise and reverberation which a human hearing can do
and our system certainly can
the cocktail party effect one voice out of many there are some
tory demonstrations of such things but in general they don't really work
rapid adjustment to changing conditions i remember telling someone one point that
if the if our sponsors
wanted us to have the best recognition
anyone could have in this room
we collected thousand hours in this room
then if the sponsors came back next year it's it now we wanted to be
in that conference room dollar fall we'd have to collect another thousand hours
okay i'm like slightly there is an set of things adaptation
but it's release
very minor compared so people can do we just walking to this room and walking
to their room and we just keep very pretty much
and real speaker independence we often colour system speaker independent the speech recognizers
but when you have a voice this particular a different to its it does badly
so we learn from the brain a
these are pictures from
same source that one of one of the source code first talk it is
E clock
so this is a direct cortical measurement as stuff explain
this is these are you get data
from people who are in the hospital for
certain neurosurgery because they had
extreme cases of epilepsy which have not been
sufficiently well treated by drugs
and so surgery is an option but you have to figure out
where the where the focus of what's the about seizures are
and you also wanna know we're not to cut
in terms of language
so
at each angle was mentioned earlier and new remotes grounding
had a lovely paper in nature couple years ago where they're making all kinds of
noise measurements during source separation and in this experiment they would play two speakers speaking
once
and
by the design of the experiment we get the subject to focus first on one
speaker and then on the other and sort of the changes and signal process
so this is giving clues about source separation and noise robustness and what's really exciting
about this from his that this isn't kind of intermediate so between E G which
is something i used to work with a long time ago we're on the scale
have really it spatial wrote
resolution you a pretty good temporal resolution
and the
single or
modest number of electrodes that directly like then there is on the surface intermediate region
and looks like we've got a lot of new kinds of information and the technology
on this is rapidly changing
people working on sensors are making these things with the sensor with the sensor with
the electorate closer and closer together
so the whole is that measurements like these and like the things that chris that's
what really are completely new processing steps
for instance
computational auditory scene analysis is based on psychoacoustics and their know that there's a range
of things that you can do try to pick up one speaker from some other
background but if we actually have a better handle on what's really going on inside
the system we might be able to better design those things rather than just relying
on psychoacoustics
and this concludes structures things that the signal level computational level
and
it's a
it's work that's been done
that will be talked about on thursday night by steve bregman for instance
and understanding the statistical systems can learn about what the limitations are
so what that hasn't common the other it's not from brain but it's actually analysis
of what's going on
it can give you a handle on how to percy
we need feature stability
under different kinds of conditions noise room reverberation so i'll
and models that can handle dependent variables
so in conclusion
there is
more than fifty years of effort
including
some with speech recognition
the current methods include tandem and hybrid approaches
multiple layers and initialisation do sometimes i'll
not
as but speech rec automatic speech recognition the fundamental algorithms
of
neural net used for speech recognition are actually reasonably my quite as well
the engineering efforts to make use of the computational capabilities have helped course
i would argue the features still matter
and the why important not just deep
and where is that missing okay
asr still performs badly for conditions on seen during training
so we have to keep looking
and that's it thank you very much
okay we conduct questions
okay
i can't resist to comment on one of things
like it you know the question of architecture really because
it'll when windows
idea of using hidden units for one task we do use it again and that
the use that eighty nine we called what you like neural networks at the time
was extremely successful work
but it was discarded at the time but people say okay the series say is
that was one hidden layer you can represent any convex classification function so we don't
need to six and then architectural multilayer way
so this car it's a lotta work actually multi layer deep neural networks as you
want even though it time already shot and this
now what it does all still today with work that scoring right now is that
people really don't look very much that how to do automatic architectural learning so in
other words
you know how we want to display by creating another layer of making why narrow
more creating different delays but we all this you know by repeating the same experiments
over again the think and what humans learn they do this development stages we don't
all your sit in the corner run back propagation for twenty years
but we and then wake up and no speech but we learn to babble about
willard words et cetera with this is all come from the must be some scheduled
by which we build architectures in that the about the development away and just too
little of my after divorce the more we look at the low resource as the
multiple languages et cetera i think having some mechanism of building these architectures one learning
approach i think is some fundamental research that still missing in my view but i'd
like to hear your comment that
i guess is another question but
the only count i mean sure
the only thing i have data that i mean i agree hours
is that one thing i didn't mention that is nineteen sixty one approach the idea
is that it actually also build up
automatically
and so it was in that case it was also a feature selection systems well
and so there would look at
the difference
superset of possible features and take a bunch of them and build up and unit
based on that and then it would consider what other group a feature so it
actually did build up
not a completely general architecture but it did a fair amount of automatic learning infrastructure
and that was nineteen sixty one that cornell
yes right
what's your steven compare
okay other questions
or comments
and so
so you work harder weakness of this cosine function white no not for going
i can be not than going down now being up again so do think discourse
and function is gonna work stood was so we will work we don't have to
be on the for productive lives or is it gonna
no one okay
i think it depends on to what extent we believe an exaggerated claims
so if we if we push that to hire people will get you don't speech
recognition works really well under many circumstances fails miserably under others so if people believe
too much that we have already found the holy grail
then after a while when they start using it having it fail
then
funding will go down and interest to go down you know for the whole field
of speech recognition but in particular any particular method
so i think
i don't feel again is that i think that i mean obviously are like using
artificial neural networks are stuff doing for a long time that i mean i started
using in them but
thirty three years ago
because i tried i had a particular task
and try to whole bunch of methods it just so have i mean just lock
that the neural net i was using was the best
of the different things that particular small voiced unvoiced speech task
but so i like them
but i think they're only a part of section
and this is why i emphasise that what you feed them
i should also say what you do them
are both at least is important problem more important
then the stuff that we're currently mostly excited about
and so i think that
but gaussian mixtures that a great run wasn't
you know
and i think people will still use them they're another tool in there is very
nice things about gaussian
it's nice things about sigmoid there's nice things about other kinds of non linear units
people have rectified linear not of data
a
but
i think
the level of excitement will probably go down somewhat because
you know after while being excessively and papers saying very similar things
sort of i down but i think it's people start
using these things in different ways feeding them different things making use the outputs of
different ways cetera
interest can be sustained
in a
you mentioned that one of the big advantage is something you that is the pos
label is that they can take a lot of abuse for how what you feed
it as long as it carries the right kind of information i also feel that
there is a great potential for various architectures built
it you mentioned that you take time with the relation sampled in select the outputs
from that and combining with a stressful or more deletions so i think that there
is plenty of opportunity for us to be deceitful time
one or is there is that you make that again you can use which is
like so if you try all kinds of things that you please report is more
severe this wasn't W
and i would somehow like to encourage the committee i need seeing slightly about was
you know one thing is to whole that i could actually pop out somehow automatically
a side or anything so i think we still need
to build a model i don't know we but can do all done automatically but
i see the works like what christmas present the year but basically learning from the
weight auditory system is working that can be plenty of this duration for vad architectures
of the new movement because neural is indeed a simple in
how much abuse they can take it forms of
removing seems to get the graphite i mean i agree and i maybe the size
of quite as much as a as i feel
we have this right now this real separation we have there's front end of somebody
works on the front end
and then there's neural nets and then you know and there's hmms and there's language
models and so forth these are really quite separate
but they really need the long run to be very integrated
and a that
particularly provide example i showed
are ready was kind of mixed together that you had some of the signal processing
stuff going later on in some of the going earlier and all of that and
when we start opening that a
and you say you know it's not just adding unity or something like that like
a nineteen sixty one approach
but you say it can be anything then i think you really lost unless you
have some example to work for
so for me it's not just the i mean i have no problem and i
think hynek doesn't either
with taken speaker if we come up with a purely engineering approach has nothing to
do with brains that just works better fine we're engineers that's okay
the problem is that the design space is infinite
and so how to figure out what direction even go
and so that's you feel that i think that we had that the appeal that
the brain related biologically stuff as have for us
is that it's a working system
it's something it already works and so it does really reduce the space
that you consider
is someone else gonna come up with some information theoretic approach that is the ends
of being better know this
fine you know i
microsoft
but this is where it occurs to us
questions
so you mention that a hmm gmm systems at some point they'd get much shorter
and one of the aspects is that they could be adapted well
so one would think about adapting neural networks and some sort of similar manner
and is that one of the reasons why neural networks i mean if you sick
recognition task you wanted to be adapted to the speaker and from my limited knowledge
i think that
a adaptation methods are still trying to be figured out
but all the intuition into doing adaptation methods comes from you know
the experience that we have with hmm gmm systems so at least at least for
me so is okay so if you talk about something like speaker adaptive training
could you think of neural network
sort of be becoming speaker independent of speaker adaptive training
i mean is i mean i would you add putting two point where
and what do you think that is that i reduction to build a speaker independent
truly speaker independent dnn
deductions to
i guess i mean speaker independent by being very speaker-dependent an adaptive so right
a actually if you do a little literature search there's a bunch of work on
adapting neural nets for speech recognition early nineties
a
and so this was work was largely done at cambridge and twenty runs and screw
and in our skin portable
you are not so
and there were basically performance is used we were actually and then you collaboration with
them
and there were four methods that i recall that we use so one was to
have like a linear
input transformation
so you could have so if you had you know thirteen plp coefficients
i just have thirteen by thirteen matrix coming in
another was the output so maybe you'd have you know if you
we're doing monophones so as like fifty by fifty or something
a third was to have i didn't wear off to decide what you just sort
of a added to it and
a trained up with a limited data that you had for the new speaker
this we're all
supervised the adaptation
and my favourite
when i proposed was
scrawl that just train everything and
so it just you know we and
the original direction do that was that you might have millions of parameters that but
my feeling was what you just a little bit
and the L
they all work to varying degrees i think it's fair to say but nine or
the
hmm-gmm adaptations nor those neural net adaptation is really solved problem
they all move you a little bit we did some experimentation as part of the
ouch project that stevens gonna
talk about thursday
where
we use the mllr for instance to try to adapt to just my recording given
close my training
and it helps
but it's not like it makes like the
so i'd say that
you can get you can use any of these methods
for both neural nets and for gaussian and there are there are methods for both
but none of them really solve the problem
and the other questions that one there
this number let's it's a couple back here
at the moment in to talk
thank you for that very interesting time
i was just curious is that any run ins in this and
kind of rate that we look at things for adaptation that speech recognition and ten
at something that are human speech recognition
and the reason i S is that if we look at least i am i
inspired by it seems that mention as a look at the places where a human
recognition breaks down i was an ad hoc unless you're a with really bad connection
i just couldn't understand the campus meeting way
and we don't i O and look at how i system and it's a beautiful
in exactly the same conditions human when be able to understand how these and have
hope systems would be added in humans and excel should be really be that i
am i check on my
well
when i found in expects in jack i don't understand that at all
a so i think a machine you do better
i think in general we're pretty far from that
there are individual examples that you could think of i think my favourite is anything
involving attention
so actually my wife used to work with these
large american express call centers
and i when we first got together i will always telling your humans are so
good it speech recognition and you know machines are so bad just a well not
the humans ideal and i
and it turned out that the people the call centres are really great we definitely
much better than anything do with machine
in for simple tasks like a string of numbers
right after they have coffee
and they're terrible after lunch
now they do however have i mean you don't talk about i certainly didn't talk
about recovery mechanisms but the saving grace for people is that they can say could
you repeat that please and all that we have some of that in our systems
humans are better at that
so i think
i think it's their other tasks
for which
machines can clearly be much better than because people
are not trained or is there are evolutionary
guidance
two doors there being better at it so for instance
doing speaker recognition with speakers that you know very well
i think machines can be better
used to do some work with the G is an edgy analysis isn't something we
you know grow up with
and so you can do classification but is much better than people okay
but i think for sort of straight out typical speech recognition
you take that noisy example
elevated to any of our recognizers and you just two
saw some of the signal-to-noise ratios the cost of showing your layer
basically zero db signal-to-noise ratio
say first human beings were paying attention listening to strings of digits they just get
them
and our systems you look at any of the even with the that
part of white noise robust front ends people have papers
you look at their performance at zero db signal-to-noise ratio it's a this
and that's with the best that's not is that we have
so i think we're just so far from that are straight out speech recognition
but maybe someday be saying well of this automatic system that we can figure out
high so you use like computer vision under networks are very appealing you can speak
and visualise what are being learned at the hidden layers so you can see that
explaining stuff specific parts of the faceplates and
so in speech you have an intuition about what is being learned in those hidden
layers
well i mean there have been some experiments with people of what the some of
these things again made reference to forms to reach and
and he was it did it should be just on a topic multilayer perceptron
and he found that
this was attempting to mimic what was happening with the nets that were
train on individual critical bands
and it did another one where i just through the whole spectrum in
and what was learned that layers in fact we did learn
interesting shapes interesting gabor like shapes so for
and there's been a number of experiments with people have looked at
some of those really layers
what you get pretty deep
especially for seven errors
i think it be pretty harder to do
but i wouldn't it is possible
i know there's been some work