Speech Transcript - Context-Dependent Deep Neural Networks for Large Vocabulary Speech Recognition: From Discovery to Practical Systems

thank you so welcome back after the lunch

my name's frank seide i'm from microsoft research in ageing and this is a post

calibration my colleague dong yu what happens to be chinese but is actually base

and of course as a lot of contributors to this work inside the company and

outside and also thank you very much two people sharing slide to material

okay to me we start with the like personal story i got into this because

i'm sort of an unlikely experts of this because until two thousand eleven i had

no i two thousand ten had no idea what were networks deep one or

so in two thousand ten

my colleague dong yu cannot be here today came to visit us invading only told

us about this new speech recognition result that the dehak

and you told me about the technology that i had never heard about call dbn

and set

this was sort of invented by some professor in wonderful that also had never heard

about

so and he and he need a manager at the time had invite geoffrey hinton

this professor to come to read and with a few students and work on applying

this to speech recognition

any time he got

sixteen percent relative reduction

out of applying deep neural networks

and this is for intel voice search task relatively small number of hours of training

you know sixty percent is really a big a lot of people spend ten years

to get a sixteen percent error reduction

so my first got about this was

sixteen percent while what's wrong with the baseline

said well we should we collaborate on this and try how this carries over into

a large-scale task that switchboard

and the key thing that actually invented here was well talk a classic an hmm

i think this reference is probably based on

whatever this morning from nelson

a little bit

too late

so the classic nn hmm then the in the deep network dbn

which actually does not stand for dynamic bayesian networks as a line

at that point

and then you don't put in this idea of

just using tied triphones as modeling targets like we did in gmm based system

okay so

then fast forward like have here was reading papers in utah look to start and

finally we got to the point where we got first

results so this is or gmm baseline and i start the training next day ahead

the first iteration

was like twenty two percent so okay now seems to not be completely off

the next day i come back

twenty percent

so fourteen percent and the congratulation email to my colleague right

the to run next day came back

eighteen percent

you can really from that one moment i was just sitting at the computer waiting

for the next result of come out and submitting it and saw titanic have better

we got seventeen point three

something point one

then we get the alignment that's one thing you don't had already determined on the

smaller setup we got it down to sixty four then we look at sparseness

six import once we go thirty two percent error reduction

that's a very large reduction

all of a single technology

we also ran this over different test sets the same all and you could see

the error rate reductions were all sort of in a in a similar range of

the word didn't matter as well the gains were slightly worse

we also look the other ones for example we at some point finally the two

thousand all model the can still okay for product like windows on system that you

have right now we got something fifteen percent error reduction

and also other companies started publishing for example ibm on broadcast news i think the

total gaze thirteen eighteen percent that's i think in up to date paper some day

and then you to i think there's was about nineteen percent of the gains were

really convincing across the board

okay so that our work so what is this actually

no i thought asr you has the same different portion of understanding people might not

to you know the end and on the database so i think would like to

go through and explain

a little bit more to the basics how this works i don't know how many

understand people are really here today i hope it's not gonna be too boring

so the basic idea is

the dnn looks at for example spectrogram

a rectangular patch out of that a range of vectors

and feeds this into this processing chain word basically multiplies this input vector this rectangle

here with a matrix at some by and applies a nonlinearity are then you get

something like two thousand values other that you do the several times

note that all that the same thing except nonlinearity is a softmax

this is the formulas for that so what is actually well a softmax

is this form here

that is essentially nothing else but i sort of a linear classifier and is linear

because if you look at the class boundaries between two classes hasn't in are actually

relatively weak classifier have there

the hidden there is actually very similar they have the same for the only difference

is that these sort of this only two classes

instead of and or be all the different speech states here and the second pass

as parameters zero

so what is this really this is sort of a classifier that classifies collect class

membership or non membership in some class but we don't know what those classes are

actually

and so this representation is actually this also kind of sparse typically you get only

maybe ten percent of the activations five to ten percent

to be active in any given frame

so this is really sort of these class membership the kind of features descriptive features

of your input

so another way of looking at it is

basic what it doesn't takes an input vector projected onto something like a base vector

one column

this would be like a direction vector projected on it there's a bias term we

add on it and then you run into this nonlinearity we just one of the

binarization

so what this does this gives you sort of subsume find a river a like

a coordinate system for your inputs

and get another

way of looking at it is

well

this one here is actually a correlation so he the parameters have the same sort

of physical meaning as the inputs you put in there

so for example for the first layer the model parameters are also of the nature

of being a rectangular patch

of spectrogram

so and this is what they look like i think there was a little bit

of the discussion earlier on nelson's talk

so what is this me each of goals

is this case thirty two there twenty three frames Y

this is the frequency

access here

and what happens is that these things are basically overlay over here and then the

correlation is made and whatever it detects this particular pattern this is sort of the

peak detector of people that sliding over time

then you get the hideout

okay

you can we see all these different patterns to get many of them really look

like our filters

but these automatically learn about the system there's no knowledge that was put into their

you have these edge detectors you have P detectors you have some sliding detectors you

have a lot of noise in there actually i don't know what that's for think

this probably of later ignore them later stages

that they are problem is how to interpret the hidden layers

the hidden there is speech don't have any sort of spatial relationship to the input

or something so the only thing that i could think of is that

there we were presenting something like

logical operations so think of this again this is the direction vector this is the

hyperplane that is described by the bias right so if you inputs for example are

one this one is one this is obviously

two dimensional vector ones one is zero

could be this one of this one you could put a plane here indicates incorporation

okay kind of a soft or because not strictly binary

or you put it here is like an operation

so i think this my personal intuition of what this is the nn actually does

on the lower layers it extracts these landmarks

number higher there is it assembles them into more complicated classes

and can you do interesting things you can imagine that

that for example and one layer discover say a female version of and a and

then another no would give you a male version of a

then on the next there would say ten authors

female or male a

so this is an idea on top of the modeling power of this of this

one

okay so take away

lowest layer matters landmarks higher layers i think are sort of soft logical operators

and the top there is just a really primitive linear

okay so how do we do this in speech how to be used as speech

you take those output see these probabilities posterior probabilities of speech segments

suppose you know

it turns them into

likelihoods the using bias will and these are directly used in the hidden markov model

in a

and the key thing here is that these classes are tied triphone state and not

monophone states that is the thing that really made a big

okay so just before we move on just a given a rough idea of like

the subject this idea one buttons error rates actually we wanna play will video clip

where our executive vice president of research gave on stage demo

and you can see what accuracies come out of and speaker independent

dnns we can you can this not been adapted is voice

still far error rate for our work we have the one point five

what you hear research my research university

okay together with the other in your recognition so

i use i tell you know what i weight given red color your

so this is this is basically perfect right and this is really a speaker independent

system

and you can i think do interesting things of that just the fun of it

i'm gonna play at a later part of the video what we actually use this

input to drive translations just

translated into chinese you and vocal here we see i am i know

there i here

you people one

that is there

side

for this is a very

you do initial values you well

if you hear that right down by various people

so what we see

so that's a kind of fun you can have of the model like that

okay so

now in this talk

i would like to

you know you know a is giving talks about the nn is invited talks S

of income bracket like on each of those conferences that likely one hour talking to

you single something's the for example last year smt conference or sandra senior

then i think of the i syllables of innocent fun so when i've prepared a

strong i found energy and it ended up

doing andrews talk

so i thought that's maybe not a good idea i wanna do it slightly different

so what i wanted to someone we focus

and not gonna give you have you noticed overview of everything but i will focus

what is needed to build real life systems large-scale system so for example you will

not see in timit result

and the structured along three areas training features and run-time extraneous the biggest one i'm

gonna start force

how do you train this model i think we're pretty much all familiar with back-propagation

you give it

a sample vector run to the network get a posterior distribution compared against what it

should be

and then basically not the system a little bit in the direction to do a

better job next time

and so the problem is when you do this with the deep network often the

system does not converge where will get stuck in local optimum

so the thing that we of this whole revolution with geoffrey hinton who

the thing that's

matt sorry the thing that we propose to the restricted boltzmann machine

and the ideas basically you train

layer is so here we extend that the networks sort of in the way they

can run about

so you can run the sample through

you get a representation you run it backwards and then you can see okay how

well that's the thing that comes out the action match my input

then you can choose that system so that matches the input as closely as possible

if you can do that and don't forget this is sort of the binary representation

that means you have a representation of data that is meaningful this thing extract something

meaningful about the data and that's so that the idea

so now we do the same thing with the next there you freeze this is

taken as a feature extractor

a do this with the next there and so on

then you put

top softmax and then trying to location

now so i had no idea about

do you nor networks anything when i started this so i thought what we do

this or complicated i mean we already ran this experiment on how many layers you

need and so on so already had

and not work that had like a single in there

so why not just take that one is initialization

right out it softmax layer and then put another

hidden layer and another softmax down off

and then iterate the entire stack here

and then after that again right this guy off and do it again and so

on and once you are at the top and iterate this thing

so we call this greedy layer-wise a discriminative pre-training

and it turns out that actually works really well so if we look at this

the dbn pretraining geoffrey hinton this is the green curve here

if you do what i just described you get the red or just are essentially

the same word error rate

and this is different numbers of layers this is not progression over training the accuracy

for different layers right

so the more layers to get the better gets and

you see basically sparse

tract each other

the layer-wise pretraining slightly worse but then you'd only one understands neural networks much better

than i

said you shouldn't maybe to rating the model all the way to the and should

just let it iterate a little bit rerun in the ballpark then move on it

turns out that made the system slightly better and actually the sixteen point eight here

this is this just made pre-training method works like that

i'm i think it's expensive

because every time you have this full nine thousand seen on top layer there but

it turns out we don't need to do that you can actually use monophones

and it actually works equally well as much

okay so take away pre-training is still sort of me that it helps

but we need discriminative pre-training is sufficient and much simpler than the rbm pity because

we just use the existing call don't need to coding

okay another important topic is

sequence training

so the question here is

we have actually train this network to classify these signals is into those segments of

speech and of each other but in speech recognition

we have dictionary sure of language models we have hidden markov model that gives you

sequence and so on

so if we want to integrate that the system we should but we do that

we should actually get a better result right

the frame-classification right here on is written this way you maximise log posteriors every single

you know posterior correct C

if you want to use C and if you wanted to sequence training actually find

that it has exactly the same form

except this year not state posterior derived from the bn and but it is state

posterior taking all the additional knowledge into account

so this one the takes into account hmms the dictionary and language models

so the way to run this is you run your data through and you have

here you must from speech rec

in computers posteriors

and practical terms you would do this with word lattices

and then you do back-propagation and

so we did that

we start with the baseline fifty one six percent

we did the first iteration of this sequence training

i want to

the one

for

so that kind of didn't work

well we observe that it sort of time for each so

don't like we're training

so we try to do in what is the problem here so there is for

hypotheses

are we actually using the right models lattice generation their problems lattice sparseness

randomization of data and the objective function of multiple objective functions choose from and today

i will talk about the lattice parsing

so the final one thing we found was that

there was increasing

sort of

problem of speech getting replaced by silence

deletion problem we saw that the silence of course we're going

and the other scores were not

so basically what happens is that

the lattice is very biased the lattice typically doesn't have negative hypotheses for silence because

it's so far away from speech but it has a lot a lot of positive

examples of silence

so this thing was just biasing the system towards ringside really we you know given

high bias

so what we do this we said okay one interest

not update

sun state and also skip all silence frames

so that already gave us something much better

already look like it's going

so we could also the slightly more systematically we could actually explicitly and silence hours

into the lattice

right those that should have been there in the first place

so once you do that

i actually get even slightly better so that kind of confirms the missing sounds hypotheses

are all

but then

another problem is that the lattices other sparse

so we find that any given frame

we only have like three hundred out of mine thousand seen on T C and

that

the others are not there because they basically had zero probability

but as the model moves along maybe data at some point no longer have zero

probability so they should be there in the lattice but they're not

so the system cannot train properly

so we thought why don't we just we generate lattices after one iteration

we see how the next little bit of the difference at least keeps table here

now we thought can we do this slightly better so basically we take this idea

of adding silence

and sort of adding speech marks you can't really do that

but similar effect can be achieved by interpolating your sequence criterion

with the frame cardio

so and then we basing we do that get

a very good convergence

now we we're not the only people that observe that problem i ran into this

issue with the training so for example colour destiny

and this workers

observe that

you look at the posterior probability of the ground first pass

over time sometimes find that it's very low it's not always zero sometimes at zero

that means a lot

but they found is that

if you just get those frames you called frame rejection you get a much better

convergence behavior so the green the red curve is without and the blue curve is

with frank removing that

and of course

brian also observed exactly the same thing but he said no i'm gonna do the

smart thing

i'm gonna do something much better i'm gonna and second order method

so what the second one a method you approximate the objective function as a second

order function that you can like hope try to the optimal right theoretically

and so this can be done without explicitly computing they have C and this is

the method that martin's is tuned of hinton

sort of optimized

and the nice thing it's actually batch method

so it doesn't

suffer from these previous issues of like data sparseness and the last carol executions as

a lot of couldn't

and also what i think on this conference there's a paper that says that it

works with partially to rated ce multi don't even have to do a full see

you duration that's also very dry

and

i need to save your outdoor started with my homework actually writing first show the

effectiveness of sequence

for switchboard

okay so you have some results

so this is the gmm system C basically a C D based the nn

sequence trained one

so this is all and switchboard and five and are two or three

so we get like twelve percent

basically and others got eleven percent and ryan on the are two or three said

also fourteen percent sort all similar range

we also

then we i wanna point of one thing

going from here to here

now the dnn has given us forty two percent relative

and that's a fair comparison because this is also sequence trained based

right so if the only difference is you recall gmm replaced by the unique

also it works on a larger dataset

okay to take away sequence training gives us gains of mine to thirty percent

other std works but you need some tricks they're

those of smoothing and rejection of that frames

and the hessian-free method requires no tricks but is actually much more complicated so to

start with i would probably start with the cg method

so another big question is paralysing the training

so just a given idea that more but we use this demo video the threshold

was trained on two thousand hours

it took sixty days

now

most of you probably don't work with windows

we do and that causes the very specific problem because of probably heard of something

a patch tuesday

so basically

every two to four weeks microsoft I T forces us to update some virus scanners

or something like that

and so basically those machines have to be rebooted

so running a java sixty days is actually

you were running this on gpu so we had a very strong motivation to look

at that

but don't get your hopes up

one way of trying to paralyse the training is to see connections to match

ryan had already shown hessian-free works very well can be problem

so actually see one V stuff are to be cage was an intern at microsoft

try to use hessian-free also for the C training

but it the take away was basically it takes a lot of iterations to get

there so it was actually not

so back to std

it's to use also problem because if we do mini-batches of they one thousand twenty

four frames everyone thousand twenty four frame to have sixteen to lot of data

so that's a big challenge so the first group are actually a company that it

is successfully was well with the asynchronous sgd that just

so the way that works is

you have your machines you group them into a first one group them together each

of them takes a part of the model and then you split your data and

each chunk to compute the different

so that at any given time

and whatever one of them has a gradient computed

it sends that

parameters server or set of parameter servers and also parameter servers aggregate

the model or with it

and then

whenever they feel like and the but with allows they send

then models that

now that's a completely asynchronous process the smaller think of this is just being independent

trends one thread is just computing with whatever's and memory

another threat this just sharing and exchanging data in whatever way the small synchronisation

so why but that work

well it's very simple because

std implies sort of an assumption of you know are reading right we make we

this so basically

every parameter update contributes independent the objective function

so it's okay to miss some of them

and also there is something that we call delayed update on a quick to explain

that

so in the simplest way that explained the training the beginning you take every point

in time that a sample X we take a model

compute gradient update the model with the gradient

and then do it again after one frame you do it again do it again

and then based right

you models equal to that model plus

we can also do this you can also not advance

the model that using use the same model multiple times

and update for example this example for

the you do for model updates the frames are still these frames right but the

model session model

in do this again and so on

so that's actually what we call mini-batch based update right

mini-batch training

so now if you want to do parallelization need to deal with the problem that

we need to do computation and data exchange parallel so you would do something like

that you know you would have a model and you would start sending that into

the network so at some point it can do model update while the kids computing

the next

and then

you do not overlap session once these are computed you sent the result over while

these are being received an update so you get the sort of overlap processing and

recall the double buffered update

it has exactly the same form so with this formula can write it in exactly

the same for

and std is basically just sort of a random version of this where you have

no space adjust the

somewhere jumping between one or two that just like

so why not telling

well i would this work because the space not different from i mean you batch

and to make it work only thing you need to make sure is that we

still stay in this

sort of you narrative me

it also means that as you training progresses you can increase your mini-batches

well observed that also means you can increase

your delay

which means you can use more machines

the more machine to use the more delay you in-car because network such right

okay

okay so but then

actually

where the three times

that colleagues told me

like this with paper only the

and then

like three months later ask them so we came up to this day and what

we scale well

actually happened three times so why does not work

so let's look at this one of the different ways paralysing something model power of

data for was layer

model carol isn't means you're splitting a models over different notes

then after each computation step they have to the only compute part of the output

vector

each computed different sub range of your dimension so after every computation to have to

exchange

the airport with all the others

the same thing has to happen in the way back

no data parallelism means

you break your mini-batch into sub batches

so each node computes subgradient

and then sorry

they after every that they have to exchange lisa gradients each has to send their

gradient or the other nodes

so you can already that has a lot of communication going on

the third train a something that we tried called and they are powerless

work something like this you distribute layers

so maybe the first batch comes in

and then when it's done it sends

its output to the next one and i we compute the next batch here but

this section of correct because we haven't update the model

so well we keep doing we just ignore the problem

then in this case after four steps

this guy has finally come back with an update the model

why would that work is just too late update is exactly the same form another

what before except the delay is kind of different in different layers but there's nothing

fundamentally strange about this

very interesting questions how far can actually go what a sort of the optimum number

of notes that you could that you can

paralysed

so my colleague also dropout a very simple idea

you simply said

you optimal when a maxout all the resource

using all you computation and all your network

resource basically means that the time that it takes

computing mini-batch

is equal to the T times that it takes to transfer the result that all

the other

and you would do this sort of an overlap fashion so you would compute one

then you started transfer and you do the next one

and i you are ideal when the time that it's like takes transferred let's say

when it's transform the trance was completed the more you're ready to compute the next

batch

so then you can write down okay what's optimal

number of knowledge here well the form is a bit more complicated but the basic

idea is that this is proportional to the model size bigger monologues better parallelization but

only get faster

so gpu can paralyse less

and also it has to do of course with how much data you have exchanged

what you're bandwidth this

for data parallelization the mini-batch sizes also factor because for a longer mini-batch size you

have to exchange less of

and for their partisan that's not really that interesting because it's limited by the number

this may i ask

what you think model part was what would be get here

so just

consider that will is doing image net like sixteen thousand

so gimme number

gonna tell you

not sixteen thousand

no such a very fine so i implemented that we need to a lot of

care three gpus

this is the best you can do we get at one point eight speed up

twice a lot of three times speedup because gpus get less efficient the smaller chunks

of data they process

and once i went to for it was actually much worse than this

not data pearls must much better i'd so what we think

for many best size of one thousand twenty four now that records of course if

you can use bigger mini-batches as you progress of training

this becomes a bigger number

and the reality what you get is well that will a C D system

paralysing for eight at nodes

and eighty nodes each node is twenty four intervals you

so if we see what you get compared using

compared to using a single twenty four into machine

at times ignored but you only get a speed of five point eight

that's what you can actually get out of the paper there and about two point

two up that comes out of model parameters and two point six comes out of

data

of course not that much

then there's another group at the academy of sciences and in a rating

they paralysed over in video K twenty extra cues that sort of the state-of-the-art

and they got three point two

speedup also

okay not that great

and i'm not gonna give an answer better but i just wanna

okay

so the last thing is layer parallelism okay so we're and this experiment we found

that if you do the right way you can use more gpus and you get

a three point two or three times speedup but we already had to use model

curves

and if you don't do that have a promotional balancing bases there is also so

different

and so this is actually reason why do not recommend their problems

okay so the take away

paralysing sds actually really heart and if your colleagues come to you and say dampen

implement std then maybe show that

okay

so much about realisation

okay need to take about and me talk about adaptation so adaptation can be done

you mentioned that this morning for example by sticking in transform your the bottom called

the L and transform we call it yellow are to match

mllr

can also be things like vtln

another thing we can do is as nelson explain just train the whole stock just

a little bit or you can do this with regularization

what we have service this

we do this approach which are not the alarm and switchboard

we applied to the gmm system we get thirteen percent error reduction

we applied to shallow more network that's one they're only

you get very similar to that

if we do it on the deep network

and

so this is sort of the not such a great example but then on the

other hand to me tell you wanna forgot to put on the side when we

prepared this on stage medial

or vice president we tried to actually train the models

so we talked something like four hours of internal talks

and did adaptation on that one

and tested on another two talks have and we got like thirty percent

but then we moved on an actually did an actual dry run with him

it turns out

on that one parent works

so i think what happened there is that the D N actually did not more

voice

the more channel

of this particular recording and that seems to be if the so basically there's a

couple of other numbers here but let me just cut the short so what we

seem to be observing is that

the gain diminishes with a large amount of the gain of adaptation this what we

have seen so far on that except if the adaptation is done for the purpose

of domain adaptation

so and maybe the reason why this is here is that the dnn is already

very good morning invariant representations especially for all speakers would also means maybe there's a

limit on what is achievable by adaptation some keep this in mind if you're considering

two to do research

on the other hand i think karen try not very good results or with george

right on that so maybe what i'm saying is not correct so you better check

out their papers and session

okay so we need on with training but isolated are what alternative architectures

so when this

so values are very popular

basically replace the nonlinearity sick model

something like this

and that came and also lot of geoffrey hinton school

and it turns out that vision tasks

works really well it converges very fast

you get

base we don't need to do pre-training

and it seems to outperform the sigmoid version thrall basing everything

non-speech that was is really would be a whole you know

encouraging paper

by entering students untied rectified nonlinearity is improved more network acoustic models

and they were able to reduce the error rate from my point five seventy

so great i started a holding it is actually two lines of code

and i didn't get anywhere

not able to rip use these results

the red the paper again

a nice all

sentence network and

network training stops after to compute pass

we only due to process our system is that nineteen point two and we do

all the past as we can see

so actually there's something wrong with a baseline

so it turns out that when i talk to people

on the large set switchboard it seems to be very difficult to get relevance to

work

so one group that actually did get a to work is a ibm together with

george dahl but in a rather complicated method they use

by addition optimize the optimisation systems of the network training

the trains hyper parameters of the training this way the way to get

somebody five percent relative gain

i don't know if you buy still doing that or if it's a bit easier

now but

the point is

the point is that it looks easy but it actually isn't

for large

another's convolutional networks

and the idea is basically that's look at these filters here these are tracking some

sort of formant right but the formant positions the resonance frequencies

depend on your body height

for example for women the typically at slightly different position compared to

two men so

by can share these filters across that at the moment the system wouldn't do that

so the idea would be to apply this filters and just them slightly apply them

over a range of shifts and that's basically represent by this picture here

and then the next there would reduce you pick the maximum

over all these different results there right and so it turns out that actually you

can get something like forty seven percent whatever it reduction i think you have even

little bit more the religious paper you

so the take away for those alternative architectures

ratings like definitely not easy to get work

they seem to work for smaller setups

some people time they get really get result good results on something twenty four hour

datasets but on the big set three hundred hours it's very difficult and expensive

the other hand the cnn so much simpler gains are sort of the range of

what we get

with the adaptation feature adaptation

okay

and of the training section

just talk about a little bit about features

so for features for gmms

has been done a lot of work

because gmms typically used are not bounce my

a lot of work was done to decorrelate features

do we actually need to do this in the dnn

well how did you correlated with a linear transform the first thing dnn does is

your

so kind of are just by itself well so that

so we start with a gmm baseline twenty three point six if you put in

fmpe to be fair twenty two point six

and then you do it cd dnn just a normal dnn using those features here

the fmpe features you get to seventeen

get rid of that simply so this minus means take out

now it's just a plp system

seventeen

the kind of makes sense because the fmpe was basically trained specifically for this gmm

structure

then you can also take out the hlda gets much better

a little data obviously correlation right over a longer range and dnn already feels

you can also take out the dct that's part of plp or mfcc process

and now we have a slightly different the dimension

you have more features here and so

i think a lot of now using this particular set up

you can even take all the deltas

but you have to account for the speaker you have to make the window wider

so we still see the same frames and our case it still

can you go really extreme and completely eliminate filter back just you look at fifty

features direct

now get somebody works focused on the ballpark here right

actually what we just do you basically undid thirty years of features research

that

there is also kind of really could if you really care about the filter bank

you can actually have a more sort of this is another poster by tomorrow so

you see the blue bars the red curve the right the blue of the mel-filters

and the red curve so basically

alarm versions of that

and dnns also kind of really sorry

so take away dnns greatly simplifies feature extraction just use the back to the wider

window

one thing i didn't already still need to the mean normalization

that cannot

now

now we talk about features for dnns we can also trying to around right basically

you know ask not what the features can do for the dnn but what the

dnn and do for the features

i think that was

said by the same speech researcher

so we can use dnns as feature extractor so the idea is basically is one

of the factors that contributed to the success

long span features

discriminative training

and the hierarchical nonlinear feature map

right so

and trying to that is actually the major contributor so why not use this combined

with the gmm so we go really back to what the now some talked about

right

so that many ways of doing the tandem

we heard this morning you can also the tandem with

bigger layer our work on that so basically using signals here

you can do bottleneck where you take in intermediate layer that this has a much

smaller dimension

or you can also

use the top hidden there

ask sort of the bottleneck but not make it smaller just take it in each

of those cases you would typically do like a pca to use your dimensionality

so does that work

well okay so if you take

a dnn

H and this the hybrid system here and then you compared with this gmm system

retake top layer

pca and then applied you gmm

well it's not really that good

but now we have one really big advantage back and the rubber gmms

we can capitalise on anything that worked on the gmm world right

so for example hardly able to you region dependent linear transforms a little bit like

fm P

so once you apply that

already better

can also just to mmi training very easily okay in this case is not really

as good but at least you can do it out of the box without any

of these problems with you know silence at that and you can apply adaptation just

it would always

you can also do something more interesting can say what if i train my dnn

feature extractor on a smaller set

and then to the training on a larger set

because we have the scalability problem

so this can really help with the scalability problem and you can see well

closer not a not quite as good but italy but we're able to do that

i mean imagine the situation what this is like a ten thousand our product database

we couldn't training and then

and it's on the dnn side we also use the same data we definitely get

better

here and that still make it might make sense if we combine the for example

with the idea of building this you model only partially so and then see if

that we don't know that action

so that a lot of attention

another thing another idea of learning using dnns as feature extractor

is to transfer learning from one language

to another so the idea is to feed the network actually training set of multiple

languages

and you're output layer

for every frames based chosen on what that language what's right and this way you

can train

these hidden representations and it turns out if you do that

you can improve each individual language and it even works for another language that has

not been part of this set here

the only thing is that is typically something a works for low resource languages

but if you goal larger so for example salt on has a

was that the paper here or has a paper where you shows that if you

go up to subtract two hundred seventy hours of training

then you're again really is reduced or something like three percent

so this is actually something that does not seem to work very well for large

setting

okay so take away

the dnns as a hierarchical nonlinear feature transform

that's really the key to the success of unions and you can use this directly

and put it the engine on top of that as a plastic later

and it brings it back and gmm world with all the techniques including parallelization and

scalability and so on

and all that transfer learning sides works from a small works a small set ups

but the not so much large

okay

last topic runtime

runtime is an issue

this one problem for gmms

you can actually do on-demand computation

for dnns

a large amount of parameters actually the shared layers you can do on the map

all dnns are

you have to compute

and so it's important to look at how can speed up so for example the

demo video that i showed you in the beginning if i that was run with

the with the my gpu was doing the live likely to evaluation if you don't

do that it would like three times real time

wouldn't infeasible

the way to approach this and that was done both by some colleagues of microsoft

also ibm

is to ask we actually needles full weight matrices

i and so this is that the question is based on two observations

one is that we saw early on that actually you can set something like two

thirds of the parameters to zero

and still you get the same our

and what ibm observed is that this top hidden they're the

the number of

how to the number of nodes the actual active is relatively limited

so can you basically just decompose all the ideas you singular value decomposition

those weight matrix

and the ideas you basically this is your network there

the weight matrix nonlinearity replace this by two matrices and in the middle you have

a low-rank

so that's that work

well

so but there's this is the gmm baseline just for reference dnn

but thirty million parameters of the microsoft internal task

start with the word error rate of twenty five point six

now we apply these singular value decomposition

if you just to the straight out of gets much worse

but you can then do back-propagation again

and then you will get back to exactly the same number

and you gain like one third parameter reduction

you can actually also do that with

although there is not just the top there if you do that can bring it

down

that's a factor of four

and that is actually very good results so this basic bring that back

so just one show you only to again give your very rough idea

my classes

so it's only very short example just a given idea this is an apples to

apples comparison between the old gmm system and the dnn system

but for speech recognition so as to look at some of those things that you

know well so you are devices on the one on the left or is what

a previously the board one on the right uses the documents

we're gonna find a good pizza and

a very similar specifically for discriminative interested look here down to the latency which is

counted from when i don't talk when we see the recognition result over a second

approach

so i just want to give you act this is proof that this section works

okay so

think of cover the whole range i would like to recap

all the take aways

okay so we went through

cd dnn actually members of G

mlp not already said that nothing else the outputs are the triphone states and that's

important

they're not really that far to train we know now but doing it fast

is still sort of frustrating enterprise and i would at the moment recommend just get

the gpu and if you have multiple gpus just one multiple training rather than trying

to paralyse a single training

pre-training is

median but the greedy layer-wise but is simpler and it seems to be sufficient

sequence training gives us regularly good improvements on to thirty percent but if you use

std then you have to use these little tricks smoothing

and rejection

adaptation helps much less than for gmms

which might be because the dnn learns possibly

very good in there and representations already so that might be a limit to what

we can actually you can achieve

writers are definitely not as easy as changing two lines of code especially for large

datasets

but on the other hand the C N N's

give us like five percent is not really the heart get but and they make

a good sense

dnns really simplify the feature extraction we're able to eliminate thirty years of feature extraction

research

but you can also go around and using dnns as feature extractors

so dnns are definitely not slowing decoding if you use this speech

to conclude word racy the challenges one forward

there of course open issues of training

i mean it's one we talk to people in the company we always thinking what

kind of computers we find the future and are we optimize them for std but

we always think you know what in one year will laugh

laugh about this though some patch method and we will just not need all of

this but so far this not i would think it's fair to say that's not

a method like this on the rise in the media laws parallelization

and what we found section learning rate control is not sufficient this kind of really

important because if you don't do this right it might run into unreliable results and

have a hunch that is relevant result we saw there was little bit like that

and also has to do with paralyse ability because the smaller learning rate the bigger

your mini-batch can be factor and the more parallelization can

dnns still have an issue with robustness to real life situations

how much they sort of not be solved speech but they got very close to

solving speech under perfect recording conditions but it still fails it's a do or speech

recognition over like one meter fifteen a more room with two microphones or something like

that so dnns are not

in here we automatically robust to noise

there was to see variability but not on C or what

then personally i wanna not can we kind of a more machine you

so for example there's already work the tries to eliminate H M and replace it

by are and i think that's come very interesting and the same thing is already

very successfully done with language models

and there's the question of

i mean jointly treat everything and one big step but on the other hand

the problem with that is that different kinds of different

aspects of the model different kinds of data that have different cost would using to

them so it might actually never be possible to we need to a joint training

and the final question that i sort of have is what to dnns teachers about

humans process

what will also get

more ideas on

so that concludes my talk thank you very much

i think we have like six minutes for questions

another expert about units there was wondering point therefore if i train and it's a

neural network and conventional speech data and i try to anything the data which is

much more clean we therefore not as good or we don't noise

so what was the configuration you want to do you want to train on what

it is that they train mind manual nets on the noisy data when they're running

on the clean data

so they don't know exactly that's my question

okay so i actually did skip

one slide images this L O

the dnn is actually

way

so you get like

so this table here shows results on aurora so basically doing this case multi-style training

so the idea was not to train a noisy and test on clean

but this is basically training and testing on the same

set of noise conditions

and so the lot of numbers here this is the gmm baseline if you look

at this line here thirteen point four

so another specialist on robustness but i think this is of the best you can

do with the gmm

pooling all the tricks that you could possibly put in

and the dnn

it's just

but not any tricks just training on the data you get

you know how do not from the or you get just exactly the same

so what this means i think is that the dnn is very good and learning

variability new input also noise that it sees in the training data

but we have other experiments were is shown what we're we see that is the

variability smart cover new training data

the dnn is not very robust again

so i don't know what happens if you trained on noisy and test on clean

and clean is not of the conditions that you have your training i could imagine

that it will are but on the of an interest at the data

i don't think i can likely to get away with thirty years thirty years maybe

that was present at all

apparently talking and tongue in cheek right what you're talking about is going back before

some of the developments of the eighties right and most of the effort on feature

extraction last twenty years conferences is actually been more robustness to dealing with unseen variability

and this doesn't get and you that set equation

some more questions or comments

think about features i need for future

research

is it and use a large temporal context this is also be one it's was

coming but for

in contrast

it's something

okay what exactly i don't have to sell the embassy okay

anymore comments

kind of a personal question you said that you know anything about neural nets like

on two three years back something like that so you see this as rather an

advantage for drawback very maybe less sentimental

in throwing away some coldness that you know the guys very in the field for

many years expected some touchable or the other way round

i think so i think it helps to come with sort of the little bit

of an outsiders mine so i think for example it helped me to understand this

parallelization thing right that basically do it is G D you do layer train a

small the mini-batch training

and normal the regular definition of mini-batches is that you can take have original to

sell

maybe you might have noticed that i didn't actually divided by the number of frames

when i use this formula right interesting is if you're not right

so that for example is something for me as an engineer coming in looking at

that i wonder know why do you do mini-batches as an average doesn't seem to

make sense you're just accumulating multiple frames over time that help understand those kind of

parallelization questions in a different way

but things probably details

okay any other buttons

okay the speaker given is present

Context-Dependent Deep Neural Networks for Large Vocabulary Speech Recognition: From Discovery to Practical Systems

Neural Network Day

Frank Seide (Microsoft Research Asia)