Speech Transcript - Multilayer perceptrons for speech recognition: There and Back Again

that actually

that's actually a morgan kind of introduction just

i that the say too much

thank you brian

actually i just should

before get that the target should mention someone made it

seven a brief discussion with someone about the posters and

realising that some extent the optimum strategy for poster would be to make it seem

like it's really interesting but completely impossible to understand

so that we're gonna want to come up and explain

anyway

there are back again

someone else suggested that perhaps to talk to be called the

station through all over again

from that same philosophy real be there

but let me start with a little story i

those you

not no you're just to tell you arthur conan doyle a series of stories

about detective was next production name sure columns

and it had a cool part in watson

really didn't know so much about it

so holmes and watson one on a camping trip

the shared a good meal had a bottle of wine and the recharger the chance

for the night

three in the morning forms notch watson said

look up and this guy tell me what you see

what sense that i see millions of stars

homes that's what is the total

once replies astronomically it tells me there are billions of galaxies potentially millions of planets

astrological it tells me the saturn isn't leo theologically tells me that got is great

we are small insignificant

or logically tells me that it's about three

you are logically tells me will have a beautiful day tomorrow

was a tell you ones

some wants to tell you really

so we might be missing if you think

and

there are

some great really exciting results is a lot of people who are interested now in

neural nets for number of

application areas but in particular

in speech recognition or slots whose yes and

but there might be a few things that we're missing and the journey

and perhaps it might be useful to look at

some historical context to help us to know that

as bright alluded to earlier in the day

there has been a great deal of history

neural networks for speech and the neural networks in general before this

and i think of this is occurring in three ways

the first wave was in the fifties and sixties with the development of the perceptrons

and at one i think of this as a basic structure or the bs

in the eighties and nineties we had back propagation which it actually then develop before

that but really applied a lot

and multilayer perceptrons or mlps

which were applying more structure to the problem sort of an ms

and now we have things that are piled higher and deeper

so it's

the phd level

now asr speech recognition

we had digits pretty much or other very small vocabulary tasks i in the fifties

and sixties

high eighties and nineties we actually graduate too large vocabulary continuous speech recognition

and in this new wave

there's really quite sure use of the technology and it's probably compounded

this talk isn't about the history speech recognition but i think i can't really do

it is true of neural nets for speech recognition without doing a little bit of

that

that also have early start

the best known early paper

i was a nineteen fifty two paper the last

but before that was radio right

now if you haven't seen or heard about radio racks

radio rex was a little doggy dog house

and user racks and racks with but

our course if you did that X would also probably pop out just about anything

that have enough energy

five six seven hundred hertz or so because

but actively doghouse actually resonated with some of those low frequencies

and when it resonated vibrate it would break a connection from electromagnet in the spring

with push the dog

so we could think of it is speech recognition really bad rejection

now the first paper that i know of anyway

that was

just crime real speech recognition was this paper by our second davis

on a digit recognition from bell labs

and it approximated energy in the first couple formants was really just how much energy

there was over time

and the different regions different frequency regions

that already had some kinds of robust estimate particular i was quite insensitive to the

apple two

and it

works very well under limited circumstances that is it was

pristine recording conditions you very quiet very great signal noise ratio

in the laboratory and also for single speaker it was tuned to single speaker

and really tune because it was

big bunch of resistors and capacitors

it also took a fair amount of space

that was the nineteen fifty two digit recogniser

wasn't something that you would fit into

in nineteen fifty two phone

now

i should say that this system

have reported accuracy of ninety seven ninety eight percent

and since

every commercial

system says then has reported an accuracy of ninety seven the ninety percent you might

think there's been no progress

but of course there has been the problems of got much harder

and that's a speech recognition isn't the real point it was talk list of mystery

the fundamentally the early asr was based on some kind of templates are examples and

distances between incoming speech and those examples

in the last thirty to forty years

the systems have pretty much been based on statistical models especially

the last twenty five

the hidden markov model technology however is based on mathematics in the late sixties

and

the biggest source again since then this is slightly unfair statement of justified moment

that's based on having lots of computing

now obviously there's a lot of people including a lot of people here who contributed

many important engineering ideas since the since the late sixties

but

those ideas were in a by having lots of computing lots of storage

statistical models are

trained with exact this is the basic approach we all know about

the examples are represented by some kind of choice of features

and the estimators generate likelihoods for what was set and then

there is a model that integrates over time with these sort of

point wise time likelihoods are generated

now artificial neural nets can be used for this to generate even of the features

that are then processed by some kind of a probability estimator that the just neural

net or they can generate the likelihoods that are actually used in hidden markov

going back to these three ways in the first way

and actually i guess i should say

a lot of the things from the only way scary through to your car one

the idea was the mccullough it's your on model

and

there were training algorithms of learning algorithms that were developed around this model perceptrons headline

another more complex things

example of which is what's called discriminative analysis iterative design or D I D

now going to these little bit

so mccall gets model was basically that you had a bunch of inputs coming in

from other neurons

they were weighted in some way

and when the weighted sum exceeded some threshold in their on fire

now the perceptron algorithm

was based on

changing what these weights for

when the firing was incorrect another's for a classification problem that saying that it is

a particular class and i really S

a by the way i'm gonna have

almost no equations and this presentation

itself

if you rating problem too bad

so the perceptron learning algorithm adjusted these weights using the outputs using whether the run

fired or not

at the wind approach was actually a linear processing approach where it the weights were

just using the weighted so

the initial versions of all the experiments with both of these were done with a

single layer so they were single-layer

perceptrons single-layer outlines

and in the late sixties there is a famous both blackman skin pampered perceptron that

pointed out that such a simple network could not even solve exclusive or problem

but in fact

multiple layers were used as early as the early sixties an example of that is

this da the algorithm

so in timit was not homogeneous neural net like that kind of nets that we

mostly used today

had gaussian the output layer

it had perceptron at the output layer

it was somewhat similar to the waiter radial basis function networks which also had some

kind of

radial basis function gaussian like function that at the first layer

it's not really

a clever weighting scheme

when you loaded up the covariance matrix matrices for the gaussians

you would give particular way to the patterns

that had resulted in that errors

and so a that this and you use an exponential loss function of the output

to do that

this wasn't really used for speech but was used for a wide variety of problems

by task and five mcconnell douglas and you know other

governmental and i commercial organisations a lot people don't know about it i happened know

about it "'cause" i recorded one point

this police or

terribly but anyway

going to nns for speech recognition

in the early sixties at stanford

bernard woodrow's students did a system for digit recognition

where they had a series of these advertise these adaptive

linear units

and it

worked quite well within speaker much as the nineteen fifty two system had

except that this was automatically didn't have to tune a bunch of resistors

and it had

terrible error rates across speakers

but it was it was sort of comparable and it was using this kind of

technology

pooling move into the eighties

wave to

aquino this colleagues did some consonant classification with such systems

i had the good fortune be able to play around with such things for voiced

unvoiced classification for commercial

task

but competing is systems started coming up by the like by the mid to late

eighties

people at cmu

or exploitable and geoff hinton that the time

lying

did this kind of

classification for stop consonants using such systems

and there are many others i don't

have enough for one slide in this how many were but

can hold in finland what mean and goal us cameron cooper in germany dealing more

in U K many others

built up these systems and did a typically isolated word recognition

then by the by the end of the eighties

we got to for speech recognition that is continuous speech recognition

speaker-independent et cetera

have the good fortune to have really

clever friends

and together with some of them include some of this work

i ever bourlard can visit a dixie and eighty eight

and he and i started one collaboration where we developed in approach

for using feed-forward

neural networks for speech recognition

and there is a range of other people who did related things a particular in

germany

and it's seen you

also there was working recurrent networks so that the feed-forward

you can just get there from where

not too many of them a

and the recurrent nets

actually fed back

and this was really high near i mean there was that there were number of

people who work with

of recurrent networks

but for applying it to large vocabulary continuous speech recognition real centre for that is

cambridge

tony robinson and well while still live trying for side

and both approaches though what they had in common

was that they'd through the proper training a generative probability as a phone classes

and then they derive state emission likelihoods for hidden markov models

typically we found it work better in most cases to divide by the prior probabilities

of each phone classes

and get some scaled likelihoods

and we also catch the marker to this name recall that the hybrid hmm mlp

or hybrid

hmm and system

with mlps you would use the back error back propagation

using the chain rule the spread the blame or credit back through layers

it was simple to use simple to train a powerful transformations

they were also used for classification and prediction

but in the hybrid system the idea was using probability estimation

and initially we did this for unlimited number of classes typically model

the slight has the only for only a equation and the stall

we didn't

understand that are having some representation of context could be beneficial

but it was kind of heart to deal with twenty some years ago

and notion of having thousands and thousands of outputs just didn't seem particularly like a

good one

decree with the limited amount of training data that we have

and computation to work with

so we came up with a factor version

in this equation a Q stands for the states which in this case a typically

were monophones

but C stands for context and X is the feature

and you can break it up without any assumptions

and no independence assumption

into

two different factors factorisation is

probability of this of a state given the context and the and the future input

times probability of context given input

or the other one the right is

probability for context given state and the input times the probability of the monophone probability

and the latter one

means that you could take the monophone that you already training just multiply and this

other one

a thing we as with other things a bit right back and initially so if

the first six months to your didn't work at all

and then are colleagues at sri were very helpful came up so it's really good

smoothing methods which given the

when the number with limited amount of data

that we're working with was really necessary to make context work

and then it

and a few years later

dropped cmu

french

stardust to an extreme where you actually had a tree

of search mlps and so you could

implement this factorisation over and over can get in finer and finer sit down and

leaves you actually had tens of thousands or even a hundred thousand generalized triphones of

some sort

and it works very well it was actually quite comparable to other systems at the

time

but was really complicate

and most people this pointed really focused in on gaussian mixture systems so it never

really took off

now

if you look at where all this was n-gram two thousand

the gaussian mixture approaches have mature

people really have learned how to use them

they've been many refinements the developed

sometimes think about gaussians you have means you have covariances people typically using variance only

covariance matrices

and so there's lots of simple things that you can do with

many of these were developed

not just mllr sat and an image by later E

i mean all sorts of alphabet soups

this didn't come easily it's not that like between slu possible they didn't come easily

to the mlp world and since the mlp world for is larger can we speech

recognition at this point was really confined to a few places almost everybody was working

with gaussian mixtures which kind of hardly keep up

but we still want to

and we like them because

one important reason for us was that they work really well with different from S

so if you came up with some really weird thing you know listen to christoph

talking about the neurons and we said let's try that thing

during the to the mlp in llp didn't mind

we had experiences with a colleague of ours for instance john last row who was

doing these funny little chips that we'd implement in some threshold mos

us various functions of people had found in go clear nuclei and so on and

he'd those into htk and it would just rollover and i and so

we he that it into our systems and actually didn't mind at all so because

of the nature of the nonlinearities

it really was very

agnostic to the kind of inputs

so question is how to take advantage of both

well what happened at this time we were working with a with hynek hermansky was

a dog i and with dan ellis with the dixie

and there was this competition was happening for standard

for distributed speech recognition i idea being

that you would compute the features

and the phone and then somewhere else you would actually do the rest of the

recognition and so the idea was to replace mfccs something better

so the models were required to be hmm-gmm

you couldn't change

we still like the next

so the solution these guys came up with

was to use the outputs as features not as probabilities

they were the only ones whatever use the outputs of features the outputs of mlps

as features

but there's a particular way doing it and implemented in large vocabulary or small vocabulary

systems

lot really work this was with the digits

that they came up

and this was called tandem approach

now as a so sort of the social cultural advantage for our research

nice thing was instead of having to convince everybody that the hybrid systems the way

to go we could just say here some cool features one should try them out

and we couldn't did in fact collaborate with other people systems that way

and i give some credit the bottom over to some other work being done this

ryan speaker recognition

so there are also interference it can once you get the idea that you happen

some interesting

use of neural nets to generate features you also could focus on temporal approach which

can dickens guys dude with traps where you would have

neural nets just looking at parts of the spectrum or a lot of time

and so they would be kind of forced into learning something about what you couldn't

in the temporal properties

that would help you with a phonetic identification

icsi's version of this was called hats most hidden activation traps

and

in all these there were there were there was the germ of what people do

now with the layer-by-layer stuff

because you train something out and then you'd feed that into another now run the

case of hats

you train something up then you throw away the last layer and feed it into

something else feature

then there were a bunch of things worked with gabor filters in X M roster

where you had modulation based inputs

you can happen using a tandem approach for the end up with getting features

from that

and then much more recent version

bottleneck features

which are kind of tandem it's not

exactly

same thing that's not coming from posteriors but it is using an output from the

net as

that's features

third way

i liked course to go where

so there's no

there's nothing wrong with the original hybrid theory i mean that it

work fine

gmm approach is sort of have victory because

you get a lot of people

moving in the same direction lot of things can happen

but also

just computation

storage and so forth

there was a lot more straightforward i think to make progress

with modifications to the gmm based approaches

so the fundamental issues with going further with a hybrid approach is how to apply

many parameters usefully

and how do get

these emission probabilities from any phonetic categories

and aspects of solution were already there is already mentioned in a number of these

approaches we reject already generating mlps layer by layer

many phonetic categories there were some work in context dependence but that's needed to be

pushed further

learning approaches second order methods right conversations so forth

these were there are many papers on the sort of things on is variance of

conjugate gradient sort things in the eighties

integrating courses much older than the eighties

but someone had to do all this and so

when i'm sure he's reflections from earlier time i don't want to draw cast aspersions

on and people were doing great things now

someone actually had to put these things together and push forward

and i

and that kind of discussion you have to start with geoff hinton

jeff is kind of excitable guy

it was very excited by back-propagation eighties

it's been excited about the things

and he is very good at spreading but it's a

he developed particular initialisation techniques

and some of these

are unsupervised techniques particular which you likes because it's seen high logically possible

and

this permitted the use of many parameters and all layers

because when you have many layers

back propagation isn't to affective

down at your ears gets more that credit blankets watered down

a so is expected to spread to microsoft research

and they extremely what was going on before too many phonetic categories large vocabulary speech

recognition

and lots of other people or a very talented people are google ibm elsewhere follows

initialisation having a good starting point

for the weights before you start discriminant training some sort

a was often used for limited data case it was often the case

back in the early nineties when we were going to some situation where we had

relatively little data

we train with something else first and then

it start with those weights maybe we wouldn't even train all the way you just

do any block or two

and then we would go to the other language or other task

and we often found that be very helpful

so hinton developers general unsupervised approach

applied to multiple layers in general call that deep learning

lot of this early stuff was called sometimes talk all the deep belief nets

a general every dnns

supplied other applications and speech

and again i gave reasonable weights for the layers far targets because

even if

the weights don't use it all back propagation training at least the early ones are

doing something useful

later speech where a lot of while the things that you see posters are papers

in the last couple years actually skip this step

and do something else for instance do layer by layer

training

discriminatively

and many approaches use some kind of regularisation

to avoid overfitting

so the recent work which you're much more about in a clear today

from

shows significant improvements over comparable gmms

and although there's a mixture of approaches

sometimes tandem why core bottleneck like sometimes a hybrid mode i think they're usually hybrid

mode

and

i have to say it's great to called deep neural nets but they're still multilayer

perceptrons

if they just multilayer perceptrons with you know certain number of layers

and say well okay but it's really different with seven hidden layers then used to

know maybe

but we do have to ask how do you deep to the need to be

many experiments show continued improvements more layers

and the at some point there's diminishing returns but the underlying assumption there is that

there's no limit on parameters

so we start asking the question what if there was a one

now why would you want to limit

well because in any practical situations are actually in some kind of women at least

there's a cost right there's

you could think of the number of parameters as being a proxy for the cost

for the resources in general for the time it takes to train a time text

run amount of storage

and well

there's people who go here but

i have say you know even if you've got you know million machines

you probably one hundred users so it still matters on the parameters use

so in interspeech represented something which i'm just gonna present for mentor to here

what we called deep on a budget

and we say suppose

we have a fixed but very large wanna make sure that nobody thinks we didn't

use enough

parameters

and then you

compare between a narrow and deep versus wine and shallow

we often see comparisons where people tried E

you know the you earlier version that we often used of one big hidden where

versus a bunch of good

but we want to do all along the way step by step have two hidden

layers three hidden layers work admirers

we can't the architecture the same

and it was only a one task was a pretty small task as aurora two

and so that

allowed us to look at varying signal-to-noise ratios

we said if you did this on a budget what works best

well you know and maybe more to there's different kinds of additive noise train station

babble and so forth

and this was a mismatch case it's clean training and noisy test that we didn't

do the multi-style

and

it turns out that the answer is all over the map

and in particular

for the cases that were kind of usable

signal-to-noise ratios and by usable i mean

if we gave you a few percent error and digits

as opposed to twenty or thirty or forty which you just couldn't used for anything

actually to was better

and then to yield little bit with the question of will maybe you just pick

the we were number of parameters we tried with double number of parameters have the

number of parameters we saw similar for now

so when i gave

this longer version of this and interspeech some of the comments more along the lines

of why do you think to is better

so forth

i just wanna be clear i'm not saying that to is better than anything

what i'm saying is that

if you were thinking of something actually going into practical use you should do some

experiments where you keep the number of parameters the same you might

then expand and so forth but usually is to some experiments we keep the number

of parameters the same and then you get an idea about what's best and it's

probably gonna be test then

we focus on neural networks but we do have to be sure we ask direct

questions

i just said no

we have test right questions

one question is what we see into the nets no there's all these questions about

what's wearing data and how many layers we have so forth

some people not having any names

a white characterize is true believers think that features aren't important

actually

to verify slightly a discussion just a

that interspeech i think it wasn't and

the

i made this comment and he said no i think features of importance just usual

so anyway features are important

and this goes back to the old general computer

axiom garbage in garbage are

people have done some very interesting experiments with feeding waveforms in and i should say

back and today we did some experiments hynek like this in experiments with a feeling

waveforms in comparing the plp needed waveforms way worse they have made some progress there

actually are doing better

but if you actually look in detail at what these experiments do

in one case for instance

it they check the absolute value the floor to detect the logarithm the averaged over

a bunch of

all sorts of things which actually obscure the phase and that's kind of the point

is that you can have waveforms of extraordinarily different shape that really sound pretty much

the same

there's more recent results that uses maxout pooling of convolutional neural nets

that also had you know a nice result

again using this maximum and maximum style pooling also tended to screw the face

but in both those cases and the other case i've heard of anyway

this completely falls apart when you have made mismatch when you when the testing is

different than the training

so what is holy

of having a frontend after all the available data is in the way for some

assumptions there that you know you might things well and so forth but

that's ignore that for the moment

in fact front ends to consistently improve speech recognition

and i have this is great but like i learned from here which is

that the goal of front ends is to just or information

that's is a little extreme

scenic as these sometimes but i think it's true that some information is misleading at

some information is not affected

and we want to focus on the discriminative information

because the waveform that you receive is not just spoken language

it's also is and reverberation and channel effects and characteristics of the speaker if you're

not gonna speaker recognition

maybe you don't care so much about that

and so

the front end can help to

focus on the things that you care about for your particular task

and a good front end in principle of use or to carry to what extreme

can make recognition extremely simple

least N

so what about the connection to mlps well as i alluded to earlier mlps have

few distributional assumptions

mlps can

also easily integrate information over time

multiple feature streams

could provide useful way to incorporate more parameters

so yes that's do give you a nice way especially with good realisation initializations and

so forth

can give you a way to incorporate more features more parameters sorry usefully

but also multiple strings can do this too

by multiple streams i mean

having different sets of mlps the look at the signal in different ways

and you can really expand out the number of parameters and in a way that

is often quite useful

and so my as well thrown another acronym

if you use this with the T that

you can call this of don

deep white

so you can combine these different streams easily because the outputs of posteriors we know

how to combine probabilities

this isn't really example a very chanted at our place

fifteen thirteen years ago something

all that on the topic mlp

and

the idea is you have a bunch of different

sets of layers they're looking at different critical bands this is this is like

the hats and traps and so forth

the difference is it was just trained all at once

and in fact this work okay

a recent example and there's because they are in the such examples around i just

pick this one because of standby one of my students actually

in china

in which

he had some with

coming from

high modulation frequencies and modulation frequencies

and T this as pca is not the society for prevention of cruelty to animals

buses

a sparse pca and it's used to pick out

pretty uses it to pick out particular filters are gabor filter is in this case

that are particularly useful

for the discrimination

and this these then go into deep neural nets six-layer do deep neural nets

and the output of one deep neural net goes into another so i get the

and it's really deep but you also have some within their

this was used to some effect it's very noisy data for the rats program so

it's a

data that's and transmitted through a radio channels and is really extremely awful what you

get at the other side

so called or are dnns or troughs

nearly all

still based on essentially on this mccullough that small

there are some nice work is also a poster here about more complex units

and for certainly for large vocabulary

kinds of

for real word error rate measurements

they're not particularly better

just little disappointing

but maybe this work is just started

the complexity and power is not supplied by having more complex units are for used

it is applied by

the data and also is a say with multiple streams by the web

you also can represent to some extent signal correlations by pooling again by acoustic context

and so far at least the most effective learning methods are not biologically plausible

so given all that how can we benefit from biological models

why we want to benefit from biological models because we wanna have stable perception and

noise and reverberation which a human hearing can do

and our system certainly can

the cocktail party effect one voice out of many there are some

tory demonstrations of such things but in general they don't really work

rapid adjustment to changing conditions i remember telling someone one point that

if the if our sponsors

wanted us to have the best recognition

anyone could have in this room

we collected thousand hours in this room

then if the sponsors came back next year it's it now we wanted to be

in that conference room dollar fall we'd have to collect another thousand hours

okay i'm like slightly there is an set of things adaptation

but it's release

very minor compared so people can do we just walking to this room and walking

to their room and we just keep very pretty much

and real speaker independence we often colour system speaker independent the speech recognizers

but when you have a voice this particular a different to its it does badly

so we learn from the brain a

these are pictures from

same source that one of one of the source code first talk it is

E clock

so this is a direct cortical measurement as stuff explain

this is these are you get data

from people who are in the hospital for

certain neurosurgery because they had

extreme cases of epilepsy which have not been

sufficiently well treated by drugs

and so surgery is an option but you have to figure out

where the where the focus of what's the about seizures are

and you also wanna know we're not to cut

in terms of language

at each angle was mentioned earlier and new remotes grounding

had a lovely paper in nature couple years ago where they're making all kinds of

noise measurements during source separation and in this experiment they would play two speakers speaking

once

and

by the design of the experiment we get the subject to focus first on one

speaker and then on the other and sort of the changes and signal process

so this is giving clues about source separation and noise robustness and what's really exciting

about this from his that this isn't kind of intermediate so between E G which

is something i used to work with a long time ago we're on the scale

have really it spatial wrote

resolution you a pretty good temporal resolution

and the

single or

modest number of electrodes that directly like then there is on the surface intermediate region

and looks like we've got a lot of new kinds of information and the technology

on this is rapidly changing

people working on sensors are making these things with the sensor with the sensor with

the electorate closer and closer together

so the whole is that measurements like these and like the things that chris that's

what really are completely new processing steps

for instance

computational auditory scene analysis is based on psychoacoustics and their know that there's a range

of things that you can do try to pick up one speaker from some other

background but if we actually have a better handle on what's really going on inside

the system we might be able to better design those things rather than just relying

on psychoacoustics

and this concludes structures things that the signal level computational level

and

it's a

it's work that's been done

that will be talked about on thursday night by steve bregman for instance

and understanding the statistical systems can learn about what the limitations are

so what that hasn't common the other it's not from brain but it's actually analysis

of what's going on

it can give you a handle on how to percy

we need feature stability

under different kinds of conditions noise room reverberation so i'll

and models that can handle dependent variables

so in conclusion

there is

more than fifty years of effort

including

some with speech recognition

the current methods include tandem and hybrid approaches

multiple layers and initialisation do sometimes i'll

not

as but speech rec automatic speech recognition the fundamental algorithms

neural net used for speech recognition are actually reasonably my quite as well

the engineering efforts to make use of the computational capabilities have helped course

i would argue the features still matter

and the why important not just deep

and where is that missing okay

asr still performs badly for conditions on seen during training

so we have to keep looking

and that's it thank you very much

okay we conduct questions

okay

i can't resist to comment on one of things

like it you know the question of architecture really because

it'll when windows

idea of using hidden units for one task we do use it again and that

the use that eighty nine we called what you like neural networks at the time

was extremely successful work

but it was discarded at the time but people say okay the series say is

that was one hidden layer you can represent any convex classification function so we don't

need to six and then architectural multilayer way

so this car it's a lotta work actually multi layer deep neural networks as you

want even though it time already shot and this

now what it does all still today with work that scoring right now is that

people really don't look very much that how to do automatic architectural learning so in

other words

you know how we want to display by creating another layer of making why narrow

more creating different delays but we all this you know by repeating the same experiments

over again the think and what humans learn they do this development stages we don't

all your sit in the corner run back propagation for twenty years

but we and then wake up and no speech but we learn to babble about

willard words et cetera with this is all come from the must be some scheduled

by which we build architectures in that the about the development away and just too

little of my after divorce the more we look at the low resource as the

multiple languages et cetera i think having some mechanism of building these architectures one learning

approach i think is some fundamental research that still missing in my view but i'd

like to hear your comment that

i guess is another question but

the only count i mean sure

the only thing i have data that i mean i agree hours

is that one thing i didn't mention that is nineteen sixty one approach the idea

is that it actually also build up

automatically

and so it was in that case it was also a feature selection systems well

and so there would look at

the difference

superset of possible features and take a bunch of them and build up and unit

based on that and then it would consider what other group a feature so it

actually did build up

not a completely general architecture but it did a fair amount of automatic learning infrastructure

and that was nineteen sixty one that cornell

yes right

what's your steven compare

okay other questions

or comments

and so

so you work harder weakness of this cosine function white no not for going

i can be not than going down now being up again so do think discourse

and function is gonna work stood was so we will work we don't have to

be on the for productive lives or is it gonna

no one okay

i think it depends on to what extent we believe an exaggerated claims

so if we if we push that to hire people will get you don't speech

recognition works really well under many circumstances fails miserably under others so if people believe

too much that we have already found the holy grail

then after a while when they start using it having it fail

then

funding will go down and interest to go down you know for the whole field

of speech recognition but in particular any particular method

so i think

i don't feel again is that i think that i mean obviously are like using

artificial neural networks are stuff doing for a long time that i mean i started

using in them but

thirty three years ago

because i tried i had a particular task

and try to whole bunch of methods it just so have i mean just lock

that the neural net i was using was the best

of the different things that particular small voiced unvoiced speech task

but so i like them

but i think they're only a part of section

and this is why i emphasise that what you feed them

i should also say what you do them

are both at least is important problem more important

then the stuff that we're currently mostly excited about

and so i think that

but gaussian mixtures that a great run wasn't

you know

and i think people will still use them they're another tool in there is very

nice things about gaussian

it's nice things about sigmoid there's nice things about other kinds of non linear units

people have rectified linear not of data

but

i think

the level of excitement will probably go down somewhat because

you know after while being excessively and papers saying very similar things

sort of i down but i think it's people start

using these things in different ways feeding them different things making use the outputs of

different ways cetera

interest can be sustained

in a

you mentioned that one of the big advantage is something you that is the pos

label is that they can take a lot of abuse for how what you feed

it as long as it carries the right kind of information i also feel that

there is a great potential for various architectures built

it you mentioned that you take time with the relation sampled in select the outputs

from that and combining with a stressful or more deletions so i think that there

is plenty of opportunity for us to be deceitful time

one or is there is that you make that again you can use which is

like so if you try all kinds of things that you please report is more

severe this wasn't W

and i would somehow like to encourage the committee i need seeing slightly about was

you know one thing is to whole that i could actually pop out somehow automatically

a side or anything so i think we still need

to build a model i don't know we but can do all done automatically but

i see the works like what christmas present the year but basically learning from the

weight auditory system is working that can be plenty of this duration for vad architectures

of the new movement because neural is indeed a simple in

how much abuse they can take it forms of

removing seems to get the graphite i mean i agree and i maybe the size

of quite as much as a as i feel

we have this right now this real separation we have there's front end of somebody

works on the front end

and then there's neural nets and then you know and there's hmms and there's language

models and so forth these are really quite separate

but they really need the long run to be very integrated

and a that

particularly provide example i showed

are ready was kind of mixed together that you had some of the signal processing

stuff going later on in some of the going earlier and all of that and

when we start opening that a

and you say you know it's not just adding unity or something like that like

a nineteen sixty one approach

but you say it can be anything then i think you really lost unless you

have some example to work for

so for me it's not just the i mean i have no problem and i

think hynek doesn't either

with taken speaker if we come up with a purely engineering approach has nothing to

do with brains that just works better fine we're engineers that's okay

the problem is that the design space is infinite

and so how to figure out what direction even go

and so that's you feel that i think that we had that the appeal that

the brain related biologically stuff as have for us

is that it's a working system

it's something it already works and so it does really reduce the space

that you consider

is someone else gonna come up with some information theoretic approach that is the ends

of being better know this

fine you know i

microsoft

but this is where it occurs to us

questions

so you mention that a hmm gmm systems at some point they'd get much shorter

and one of the aspects is that they could be adapted well

so one would think about adapting neural networks and some sort of similar manner

and is that one of the reasons why neural networks i mean if you sick

recognition task you wanted to be adapted to the speaker and from my limited knowledge

i think that

a adaptation methods are still trying to be figured out

but all the intuition into doing adaptation methods comes from you know

the experience that we have with hmm gmm systems so at least at least for

me so is okay so if you talk about something like speaker adaptive training

could you think of neural network

sort of be becoming speaker independent of speaker adaptive training

i mean is i mean i would you add putting two point where

and what do you think that is that i reduction to build a speaker independent

truly speaker independent dnn

deductions to

i guess i mean speaker independent by being very speaker-dependent an adaptive so right

a actually if you do a little literature search there's a bunch of work on

adapting neural nets for speech recognition early nineties

and so this was work was largely done at cambridge and twenty runs and screw

and in our skin portable

you are not so

and there were basically performance is used we were actually and then you collaboration with

them

and there were four methods that i recall that we use so one was to

have like a linear

input transformation

so you could have so if you had you know thirteen plp coefficients

i just have thirteen by thirteen matrix coming in

another was the output so maybe you'd have you know if you

we're doing monophones so as like fifty by fifty or something

a third was to have i didn't wear off to decide what you just sort

of a added to it and

a trained up with a limited data that you had for the new speaker

this we're all

supervised the adaptation

and my favourite

when i proposed was

scrawl that just train everything and

so it just you know we and

the original direction do that was that you might have millions of parameters that but

my feeling was what you just a little bit

and the L

they all work to varying degrees i think it's fair to say but nine or

the

hmm-gmm adaptations nor those neural net adaptation is really solved problem

they all move you a little bit we did some experimentation as part of the

ouch project that stevens gonna

talk about thursday

where

we use the mllr for instance to try to adapt to just my recording given

close my training

and it helps

but it's not like it makes like the

so i'd say that

you can get you can use any of these methods

for both neural nets and for gaussian and there are there are methods for both

but none of them really solve the problem

and the other questions that one there

this number let's it's a couple back here

at the moment in to talk

thank you for that very interesting time

i was just curious is that any run ins in this and

kind of rate that we look at things for adaptation that speech recognition and ten

at something that are human speech recognition

and the reason i S is that if we look at least i am i

inspired by it seems that mention as a look at the places where a human

recognition breaks down i was an ad hoc unless you're a with really bad connection

i just couldn't understand the campus meeting way

and we don't i O and look at how i system and it's a beautiful

in exactly the same conditions human when be able to understand how these and have

hope systems would be added in humans and excel should be really be that i

am i check on my

well

when i found in expects in jack i don't understand that at all

a so i think a machine you do better

i think in general we're pretty far from that

there are individual examples that you could think of i think my favourite is anything

involving attention

so actually my wife used to work with these

large american express call centers

and i when we first got together i will always telling your humans are so

good it speech recognition and you know machines are so bad just a well not

the humans ideal and i

and it turned out that the people the call centres are really great we definitely

much better than anything do with machine

in for simple tasks like a string of numbers

right after they have coffee

and they're terrible after lunch

now they do however have i mean you don't talk about i certainly didn't talk

about recovery mechanisms but the saving grace for people is that they can say could

you repeat that please and all that we have some of that in our systems

humans are better at that

so i think

i think it's their other tasks

for which

machines can clearly be much better than because people

are not trained or is there are evolutionary

guidance

two doors there being better at it so for instance

doing speaker recognition with speakers that you know very well

i think machines can be better

used to do some work with the G is an edgy analysis isn't something we

you know grow up with

and so you can do classification but is much better than people okay

but i think for sort of straight out typical speech recognition

you take that noisy example

elevated to any of our recognizers and you just two

saw some of the signal-to-noise ratios the cost of showing your layer

basically zero db signal-to-noise ratio

say first human beings were paying attention listening to strings of digits they just get

them

and our systems you look at any of the even with the that

part of white noise robust front ends people have papers

you look at their performance at zero db signal-to-noise ratio it's a this

and that's with the best that's not is that we have

so i think we're just so far from that are straight out speech recognition

but maybe someday be saying well of this automatic system that we can figure out

high so you use like computer vision under networks are very appealing you can speak

and visualise what are being learned at the hidden layers so you can see that

explaining stuff specific parts of the faceplates and

so in speech you have an intuition about what is being learned in those hidden

layers

well i mean there have been some experiments with people of what the some of

these things again made reference to forms to reach and

and he was it did it should be just on a topic multilayer perceptron

and he found that

this was attempting to mimic what was happening with the nets that were

train on individual critical bands

and it did another one where i just through the whole spectrum in

and what was learned that layers in fact we did learn

interesting shapes interesting gabor like shapes so for

and there's been a number of experiments with people have looked at

some of those really layers

what you get pretty deep

especially for seven errors

i think it be pretty harder to do

but i wouldn't it is possible

i know there's been some work

Multilayer perceptrons for speech recognition: There and Back Again

Neural Network Day

Nelson Morgan (ICSI)