thank you so welcome back after the lunch
my name's frank seide i'm from microsoft research in ageing and this is a post
calibration my colleague dong yu what happens to be chinese but is actually base
and of course as a lot of contributors to this work inside the company and
outside and also thank you very much two people sharing slide to material
okay to me we start with the like personal story i got into this because
i'm sort of an unlikely experts of this because until two thousand eleven i had
no i two thousand ten had no idea what were networks deep one or
so in two thousand ten
my colleague dong yu cannot be here today came to visit us invading only told
us about this new speech recognition result that the dehak
and you told me about the technology that i had never heard about call dbn
and set
this was sort of invented by some professor in wonderful that also had never heard
about
so and he and he need a manager at the time had invite geoffrey hinton
this professor to come to read and with a few students and work on applying
this to speech recognition
any time he got
sixteen percent relative reduction
out of applying deep neural networks
and this is for intel voice search task relatively small number of hours of training
you know sixty percent is really a big a lot of people spend ten years
to get a sixteen percent error reduction
so my first got about this was
sixteen percent while what's wrong with the baseline
said well we should we collaborate on this and try how this carries over into
a large-scale task that switchboard
and the key thing that actually invented here was well talk a classic an hmm
i think this reference is probably based on
whatever this morning from nelson
a little bit
too late
so the classic nn hmm then the in the deep network dbn
which actually does not stand for dynamic bayesian networks as a line
at that point
and then you don't put in this idea of
just using tied triphones as modeling targets like we did in gmm based system
okay so
then fast forward like have here was reading papers in utah look to start and
finally we got to the point where we got first
results so this is or gmm baseline and i start the training next day ahead
the first iteration
was like twenty two percent so okay now seems to not be completely off
the next day i come back
twenty percent
so fourteen percent and the congratulation email to my colleague right
the to run next day came back
eighteen percent
you can really from that one moment i was just sitting at the computer waiting
for the next result of come out and submitting it and saw titanic have better
we got seventeen point three
something point one
then we get the alignment that's one thing you don't had already determined on the
smaller setup we got it down to sixty four then we look at sparseness
six import once we go thirty two percent error reduction
that's a very large reduction
all of a single technology
we also ran this over different test sets the same all and you could see
the error rate reductions were all sort of in a in a similar range of
the word didn't matter as well the gains were slightly worse
we also look the other ones for example we at some point finally the two
thousand all model the can still okay for product like windows on system that you
have right now we got something fifteen percent error reduction
and also other companies started publishing for example ibm on broadcast news i think the
total gaze thirteen eighteen percent that's i think in up to date paper some day
and then you to i think there's was about nineteen percent of the gains were
really convincing across the board
okay so that our work so what is this actually
no i thought asr you has the same different portion of understanding people might not
to you know the end and on the database so i think would like to
go through and explain
a little bit more to the basics how this works i don't know how many
understand people are really here today i hope it's not gonna be too boring
so the basic idea is
the dnn looks at for example spectrogram
a rectangular patch out of that a range of vectors
and feeds this into this processing chain word basically multiplies this input vector this rectangle
here with a matrix at some by and applies a nonlinearity are then you get
something like two thousand values other that you do the several times
note that all that the same thing except nonlinearity is a softmax
so
this is the formulas for that so what is actually well a softmax
is this form here
that is essentially nothing else but i sort of a linear classifier and is linear
because if you look at the class boundaries between two classes hasn't in are actually
relatively weak classifier have there
the hidden there is actually very similar they have the same for the only difference
is that these sort of this only two classes
instead of and or be all the different speech states here and the second pass
as parameters zero
so what is this really this is sort of a classifier that classifies collect class
membership or non membership in some class but we don't know what those classes are
actually
and so this representation is actually this also kind of sparse typically you get only
maybe ten percent of the activations five to ten percent
to be active in any given frame
so this is really sort of these class membership the kind of features descriptive features
of your input
so another way of looking at it is
basic what it doesn't takes an input vector projected onto something like a base vector
one column
this would be like a direction vector projected on it there's a bias term we
add on it and then you run into this nonlinearity we just one of the
binarization
so what this does this gives you sort of subsume find a river a like
a coordinate system for your inputs
and get another
way of looking at it is
well
this one here is actually a correlation so he the parameters have the same sort
of physical meaning as the inputs you put in there
so for example for the first layer the model parameters are also of the nature
of being a rectangular patch
of spectrogram
so and this is what they look like i think there was a little bit
of the discussion earlier on nelson's talk
so what is this me each of goals
is this case thirty two there twenty three frames Y
this is the frequency
access here
and what happens is that these things are basically overlay over here and then the
correlation is made and whatever it detects this particular pattern this is sort of the
peak detector of people that sliding over time
then you get the hideout
okay
you can we see all these different patterns to get many of them really look
like our filters
but these automatically learn about the system there's no knowledge that was put into their
you have these edge detectors you have P detectors you have some sliding detectors you
have a lot of noise in there actually i don't know what that's for think
this probably of later ignore them later stages
that they are problem is how to interpret the hidden layers
the hidden there is speech don't have any sort of spatial relationship to the input
or something so the only thing that i could think of is that
there we were presenting something like
logical operations so think of this again this is the direction vector this is the
hyperplane that is described by the bias right so if you inputs for example are
one this one is one this is obviously
two dimensional vector ones one is zero
could be this one of this one you could put a plane here indicates incorporation
okay kind of a soft or because not strictly binary
or you put it here is like an operation
so i think this my personal intuition of what this is the nn actually does
is
on the lower layers it extracts these landmarks
number higher there is it assembles them into more complicated classes
and can you do interesting things you can imagine that
that for example and one layer discover say a female version of and a and
then another no would give you a male version of a
then on the next there would say ten authors
female or male a
so this is an idea on top of the modeling power of this of this
one
okay so take away
lowest layer matters landmarks higher layers i think are sort of soft logical operators
and the top there is just a really primitive linear
okay so how do we do this in speech how to be used as speech
you take those output see these probabilities posterior probabilities of speech segments
suppose you know
it turns them into
likelihoods the using bias will and these are directly used in the hidden markov model
in a
and the key thing here is that these classes are tied triphone state and not
monophone states that is the thing that really made a big
okay so just before we move on just a given a rough idea of like
the subject this idea one buttons error rates actually we wanna play will video clip
where our executive vice president of research gave on stage demo
and you can see what accuracies come out of and speaker independent
dnns we can you can this not been adapted is voice
still far error rate for our work we have the one point five
what you hear research my research university
okay together with the other in your recognition so
i use i tell you know what i weight given red color your
so this is this is basically perfect right and this is really a speaker independent
system
and you can i think do interesting things of that just the fun of it
i'm gonna play at a later part of the video what we actually use this
input to drive translations just
translated into chinese you and vocal here we see i am i know
i
there i here
you people one
that is there
side
for this is a very
you do initial values you well
if you hear that right down by various people
so what we see
so that's a kind of fun you can have of the model like that
okay so
now in this talk
i would like to
you know you know a is giving talks about the nn is invited talks S
of income bracket like on each of those conferences that likely one hour talking to
you single something's the for example last year smt conference or sandra senior
then i think of the i syllables of innocent fun so when i've prepared a
strong i found energy and it ended up
doing andrews talk
so i thought that's maybe not a good idea i wanna do it slightly different
so what i wanted to someone we focus
and not gonna give you have you noticed overview of everything but i will focus
on
what is needed to build real life systems large-scale system so for example you will
not see in timit result
and the structured along three areas training features and run-time extraneous the biggest one i'm
gonna start force
so
how do you train this model i think we're pretty much all familiar with back-propagation
you give it
a sample vector run to the network get a posterior distribution compared against what it
should be
and then basically not the system a little bit in the direction to do a
better job next time
and so the problem is when you do this with the deep network often the
system does not converge where will get stuck in local optimum
so the thing that we of this whole revolution with geoffrey hinton who
the thing that's
matt sorry the thing that we propose to the restricted boltzmann machine
and the ideas basically you train
layer is so here we extend that the networks sort of in the way they
can run about
so you can run the sample through
you get a representation you run it backwards and then you can see okay how
well that's the thing that comes out the action match my input
then you can choose that system so that matches the input as closely as possible
if you can do that and don't forget this is sort of the binary representation
that means you have a representation of data that is meaningful this thing extract something
meaningful about the data and that's so that the idea
so now we do the same thing with the next there you freeze this is
taken as a feature extractor
a do this with the next there and so on
then you put
top softmax and then trying to location
now so i had no idea about
do you nor networks anything when i started this so i thought what we do
this or complicated i mean we already ran this experiment on how many layers you
need and so on so already had
and not work that had like a single in there
so why not just take that one is initialization
right out it softmax layer and then put another
hidden layer and another softmax down off
and then iterate the entire stack here
and then after that again right this guy off and do it again and so
on and once you are at the top and iterate this thing
so we call this greedy layer-wise a discriminative pre-training
and it turns out that actually works really well so if we look at this
the dbn pretraining geoffrey hinton this is the green curve here
if you do what i just described you get the red or just are essentially
the same word error rate
and this is different numbers of layers this is not progression over training the accuracy
for different layers right
so the more layers to get the better gets and
you see basically sparse
tract each other
the layer-wise pretraining slightly worse but then you'd only one understands neural networks much better
than i
said you shouldn't maybe to rating the model all the way to the and should
just let it iterate a little bit rerun in the ballpark then move on it
turns out that made the system slightly better and actually the sixteen point eight here
this is this just made pre-training method works like that
i'm i think it's expensive
because every time you have this full nine thousand seen on top layer there but
it turns out we don't need to do that you can actually use monophones
and it actually works equally well as much
okay so take away pre-training is still sort of me that it helps
but we need discriminative pre-training is sufficient and much simpler than the rbm pity because
we just use the existing call don't need to coding
okay another important topic is
sequence training
so the question here is
we have actually train this network to classify these signals is into those segments of
speech and of each other but in speech recognition
we have dictionary sure of language models we have hidden markov model that gives you
sequence and so on
so if we want to integrate that the system we should but we do that
we should actually get a better result right
so
the frame-classification right here on is written this way you maximise log posteriors every single
you know posterior correct C
if you want to use C and if you wanted to sequence training actually find
that it has exactly the same form
except this year not state posterior derived from the bn and but it is state
posterior taking all the additional knowledge into account
so this one the takes into account hmms the dictionary and language models
so the way to run this is you run your data through and you have
here you must from speech rec
in computers posteriors
and practical terms you would do this with word lattices
and then you do back-propagation and
so we did that
we start with the baseline fifty one six percent
we did the first iteration of this sequence training
i want to
the one
for
so that kind of didn't work
so
well we observe that it sort of time for each so
don't like we're training
so we try to do in what is the problem here so there is for
hypotheses
are we actually using the right models lattice generation their problems lattice sparseness
randomization of data and the objective function of multiple objective functions choose from and today
i will talk about the lattice parsing
so the final one thing we found was that
there was increasing
sort of
problem of speech getting replaced by silence
deletion problem we saw that the silence of course we're going
and the other scores were not
so basically what happens is that
the lattice is very biased the lattice typically doesn't have negative hypotheses for silence because
it's so far away from speech but it has a lot a lot of positive
examples of silence
so this thing was just biasing the system towards ringside really we you know given
high bias
so what we do this we said okay one interest
not update
sun state and also skip all silence frames
so that already gave us something much better
already look like it's going
so we could also the slightly more systematically we could actually explicitly and silence hours
into the lattice
right those that should have been there in the first place
so once you do that
i actually get even slightly better so that kind of confirms the missing sounds hypotheses
are all
but then
another problem is that the lattices other sparse
so we find that any given frame
we only have like three hundred out of mine thousand seen on T C and
that
the others are not there because they basically had zero probability
but as the model moves along maybe data at some point no longer have zero
probability so they should be there in the lattice but they're not
so the system cannot train properly
so we thought why don't we just we generate lattices after one iteration
we see how the next little bit of the difference at least keeps table here
now we thought can we do this slightly better so basically we take this idea
of adding silence
and sort of adding speech marks you can't really do that
but similar effect can be achieved by interpolating your sequence criterion
with the frame cardio
so and then we basing we do that get
a very good convergence
so
now we we're not the only people that observe that problem i ran into this
issue with the training so for example colour destiny
and this workers
observe that
you look at the posterior probability of the ground first pass
over time sometimes find that it's very low it's not always zero sometimes at zero
that means a lot
so
but they found is that
if you just get those frames you called frame rejection you get a much better
convergence behavior so the green the red curve is without and the blue curve is
with frank removing that
and of course
brian also observed exactly the same thing but he said no i'm gonna do the
smart thing
i'm gonna do something much better i'm gonna and second order method
so what the second one a method you approximate the objective function as a second
order function that you can like hope try to the optimal right theoretically
and so this can be done without explicitly computing they have C and this is
the method that martin's is tuned of hinton
sort of optimized
and the nice thing it's actually batch method
so it doesn't
suffer from these previous issues of like data sparseness and the last carol executions as
a lot of couldn't
and also what i think on this conference there's a paper that says that it
works with partially to rated ce multi don't even have to do a full see
you duration that's also very dry
and
i need to save your outdoor started with my homework actually writing first show the
effectiveness of sequence
for switchboard
okay so you have some results
so this is the gmm system C basically a C D based the nn
sequence trained one
so this is all and switchboard and five and are two or three
so we get like twelve percent
basically and others got eleven percent and ryan on the are two or three said
also fourteen percent sort all similar range
we also
then we i wanna point of one thing
going from here to here
now the dnn has given us forty two percent relative
and that's a fair comparison because this is also sequence trained based
right so if the only difference is you recall gmm replaced by the unique
also it works on a larger dataset
okay to take away sequence training gives us gains of mine to thirty percent
other std works but you need some tricks they're
those of smoothing and rejection of that frames
and the hessian-free method requires no tricks but is actually much more complicated so to
start with i would probably start with the cg method
so another big question is paralysing the training
so just a given idea that more but we use this demo video the threshold
was trained on two thousand hours
it took sixty days
now
most of you probably don't work with windows
we do and that causes the very specific problem because of probably heard of something
a patch tuesday
so basically
every two to four weeks microsoft I T forces us to update some virus scanners
or something like that
and so basically those machines have to be rebooted
so running a java sixty days is actually
so
you were running this on gpu so we had a very strong motivation to look
at that
but don't get your hopes up
so
one way of trying to paralyse the training is to see connections to match
ryan had already shown hessian-free works very well can be problem
so actually see one V stuff are to be cage was an intern at microsoft
try to use hessian-free also for the C training
but it the take away was basically it takes a lot of iterations to get
there so it was actually not
so back to std
it's to use also problem because if we do mini-batches of they one thousand twenty
four frames everyone thousand twenty four frame to have sixteen to lot of data
so that's a big challenge so the first group are actually a company that it
is successfully was well with the asynchronous sgd that just
so the way that works is
you have your machines you group them into a first one group them together each
of them takes a part of the model and then you split your data and
each chunk to compute the different
so that at any given time
and whatever one of them has a gradient computed
it sends that
parameters server or set of parameter servers and also parameter servers aggregate
the model or with it
and then
whenever they feel like and the but with allows they send
then models that
now that's a completely asynchronous process the smaller think of this is just being independent
trends one thread is just computing with whatever's and memory
another threat this just sharing and exchanging data in whatever way the small synchronisation
so why but that work
well it's very simple because
std implies sort of an assumption of you know are reading right we make we
this so basically
every parameter update contributes independent the objective function
so it's okay to miss some of them
and also there is something that we call delayed update on a quick to explain
that
so in the simplest way that explained the training the beginning you take every point
in time that a sample X we take a model
compute gradient update the model with the gradient
and then do it again after one frame you do it again do it again
and then based right
you models equal to that model plus
we can also do this you can also not advance
the model that using use the same model multiple times
and update for example this example for
the you do for model updates the frames are still these frames right but the
model session model
in do this again and so on
so that's actually what we call mini-batch based update right
mini-batch training
so now if you want to do parallelization need to deal with the problem that
we need to do computation and data exchange parallel so you would do something like
that you know you would have a model and you would start sending that into
the network so at some point it can do model update while the kids computing
the next
and then
you do not overlap session once these are computed you sent the result over while
these are being received an update so you get the sort of overlap processing and
recall the double buffered update
it has exactly the same form so with this formula can write it in exactly
the same for
and std is basically just sort of a random version of this where you have
no space adjust the
somewhere jumping between one or two that just like
so why not telling
well i would this work because the space not different from i mean you batch
and to make it work only thing you need to make sure is that we
still stay in this
sort of you narrative me
it also means that as you training progresses you can increase your mini-batches
well observed that also means you can increase
your delay
which means you can use more machines
the more machine to use the more delay you in-car because network such right
okay
so
okay so but then
actually
where the three times
that colleagues told me
like this with paper only the
and then
like three months later ask them so we came up to this day and what
we scale well
actually happened three times so why does not work
so let's look at this one of the different ways paralysing something model power of
data for was layer
model carol isn't means you're splitting a models over different notes
then after each computation step they have to the only compute part of the output
vector
each computed different sub range of your dimension so after every computation to have to
exchange
the airport with all the others
the same thing has to happen in the way back
no data parallelism means
you break your mini-batch into sub batches
so each node computes subgradient
and then sorry
they after every that they have to exchange lisa gradients each has to send their
gradient or the other nodes
so you can already that has a lot of communication going on
the third train a something that we tried called and they are powerless
work something like this you distribute layers
so maybe the first batch comes in
and then when it's done it sends
its output to the next one and i we compute the next batch here but
this section of correct because we haven't update the model
so well we keep doing we just ignore the problem
then in this case after four steps
this guy has finally come back with an update the model
so
why would that work is just too late update is exactly the same form another
what before except the delay is kind of different in different layers but there's nothing
fundamentally strange about this
so
no
very interesting questions how far can actually go what a sort of the optimum number
of notes that you could that you can
paralysed
so my colleague also dropout a very simple idea
you simply said
you optimal when a maxout all the resource
using all you computation and all your network
resource basically means that the time that it takes
computing mini-batch
is equal to the T times that it takes to transfer the result that all
the other
and you would do this sort of an overlap fashion so you would compute one
then you started transfer and you do the next one
and i you are ideal when the time that it's like takes transferred let's say
when it's transform the trance was completed the more you're ready to compute the next
batch
so then you can write down okay what's optimal
number of knowledge here well the form is a bit more complicated but the basic
idea is that this is proportional to the model size bigger monologues better parallelization but
only get faster
so gpu can paralyse less
and also it has to do of course with how much data you have exchanged
what you're bandwidth this
for data parallelization the mini-batch sizes also factor because for a longer mini-batch size you
have to exchange less of
and for their partisan that's not really that interesting because it's limited by the number
so
this may i ask
what you think model part was what would be get here
so just
consider that will is doing image net like sixteen thousand
so gimme number
gonna tell you
not sixteen thousand
no such a very fine so i implemented that we need to a lot of
care three gpus
this is the best you can do we get at one point eight speed up
twice a lot of three times speedup because gpus get less efficient the smaller chunks
of data they process
and once i went to for it was actually much worse than this
not data pearls must much better i'd so what we think
for many best size of one thousand twenty four now that records of course if
you can use bigger mini-batches as you progress of training
this becomes a bigger number
and the reality what you get is well that will a C D system
paralysing for eight at nodes
and eighty nodes each node is twenty four intervals you
so if we see what you get compared using
compared to using a single twenty four into machine
at times ignored but you only get a speed of five point eight
that's what you can actually get out of the paper there and about two point
two up that comes out of model parameters and two point six comes out of
data
of course not that much
then there's another group at the academy of sciences and in a rating
they paralysed over in video K twenty extra cues that sort of the state-of-the-art
and they got three point two
speedup also
okay not that great
and i'm not gonna give an answer better but i just wanna
okay
so the last thing is layer parallelism okay so we're and this experiment we found
that if you do the right way you can use more gpus and you get
a three point two or three times speedup but we already had to use model
curves
and if you don't do that have a promotional balancing bases there is also so
different
and so this is actually reason why do not recommend their problems
okay so the take away
paralysing sds actually really heart and if your colleagues come to you and say dampen
implement std then maybe show that
okay
so
so much about realisation
okay need to take about and me talk about adaptation so adaptation can be done
you mentioned that this morning for example by sticking in transform your the bottom called
the L and transform we call it yellow are to match
mllr
can also be things like vtln
another thing we can do is as nelson explain just train the whole stock just
a little bit or you can do this with regularization
so
what we have service this
we do this approach which are not the alarm and switchboard
we applied to the gmm system we get thirteen percent error reduction
we applied to shallow more network that's one they're only
you get very similar to that
if we do it on the deep network
and
so
so this is sort of the not such a great example but then on the
other hand to me tell you wanna forgot to put on the side when we
prepared this on stage medial
or vice president we tried to actually train the models
so we talked something like four hours of internal talks
and did adaptation on that one
and tested on another two talks have and we got like thirty percent
but then we moved on an actually did an actual dry run with him
it turns out
on that one parent works
so i think what happened there is that the D N actually did not more
voice
the more channel
of this particular recording and that seems to be if the so basically there's a
couple of other numbers here but let me just cut the short so what we
seem to be observing is that
the gain diminishes with a large amount of the gain of adaptation this what we
have seen so far on that except if the adaptation is done for the purpose
of domain adaptation
so and maybe the reason why this is here is that the dnn is already
very good morning invariant representations especially for all speakers would also means maybe there's a
limit on what is achievable by adaptation some keep this in mind if you're considering
two to do research
on the other hand i think karen try not very good results or with george
right on that so maybe what i'm saying is not correct so you better check
out their papers and session
okay so we need on with training but isolated are what alternative architectures
so when this
so values are very popular
basically replace the nonlinearity sick model
something like this
and that came and also lot of geoffrey hinton school
and it turns out that vision tasks
works really well it converges very fast
you get
base we don't need to do pre-training
and it seems to outperform the sigmoid version thrall basing everything
non-speech that was is really would be a whole you know
encouraging paper
by entering students untied rectified nonlinearity is improved more network acoustic models
and they were able to reduce the error rate from my point five seventy
so great i started a holding it is actually two lines of code
and i didn't get anywhere
not able to rip use these results
the red the paper again
a nice all
sentence network and
network training stops after to compute pass
we only due to process our system is that nineteen point two and we do
all the past as we can see
so actually there's something wrong with a baseline
so it turns out that when i talk to people
on the large set switchboard it seems to be very difficult to get relevance to
work
so one group that actually did get a to work is a ibm together with
george dahl but in a rather complicated method they use
by addition optimize the optimisation systems of the network training
the trains hyper parameters of the training this way the way to get
somebody five percent relative gain
i don't know if you buy still doing that or if it's a bit easier
now but
so
the point is
the point is that it looks easy but it actually isn't
for large
another's convolutional networks
and the idea is basically that's look at these filters here these are tracking some
sort of formant right but the formant positions the resonance frequencies
depend on your body height
for example for women the typically at slightly different position compared to
two men so
by can share these filters across that at the moment the system wouldn't do that
so the idea would be to apply this filters and just them slightly apply them
over a range of shifts and that's basically represent by this picture here
and then the next there would reduce you pick the maximum
over all these different results there right and so it turns out that actually you
can get something like forty seven percent whatever it reduction i think you have even
little bit more the religious paper you
so the take away for those alternative architectures
ratings like definitely not easy to get work
they seem to work for smaller setups
some people time they get really get result good results on something twenty four hour
datasets but on the big set three hundred hours it's very difficult and expensive
the other hand the cnn so much simpler gains are sort of the range of
what we get
with the adaptation feature adaptation
okay
and of the training section
just talk about a little bit about features
so for features for gmms
has been done a lot of work
because gmms typically used are not bounce my
a lot of work was done to decorrelate features
do we actually need to do this in the dnn
well how did you correlated with a linear transform the first thing dnn does is
your
so kind of are just by itself well so that
so we start with a gmm baseline twenty three point six if you put in
fmpe to be fair twenty two point six
and then you do it cd dnn just a normal dnn using those features here
the fmpe features you get to seventeen
get rid of that simply so this minus means take out
now it's just a plp system
seventeen
the kind of makes sense because the fmpe was basically trained specifically for this gmm
structure
then you can also take out the hlda gets much better
a little data obviously correlation right over a longer range and dnn already feels
you can also take out the dct that's part of plp or mfcc process
and now we have a slightly different the dimension
you have more features here and so
i think a lot of now using this particular set up
you can even take all the deltas
but you have to account for the speaker you have to make the window wider
so we still see the same frames and our case it still
can you go really extreme and completely eliminate filter back just you look at fifty
features direct
now get somebody works focused on the ballpark here right
so
actually what we just do you basically undid thirty years of features research
so
that
there is also kind of really could if you really care about the filter bank
you can actually have a more sort of this is another poster by tomorrow so
you see the blue bars the red curve the right the blue of the mel-filters
and the red curve so basically
alarm versions of that
and dnns also kind of really sorry
so take away dnns greatly simplifies feature extraction just use the back to the wider
window
one thing i didn't already still need to the mean normalization
that cannot
now
now we talk about features for dnns we can also trying to around right basically
you know ask not what the features can do for the dnn but what the
dnn and do for the features
i think that was
said by the same speech researcher
so we can use dnns as feature extractor so the idea is basically is one
of the factors that contributed to the success
long span features
discriminative training
and the hierarchical nonlinear feature map
right so
and trying to that is actually the major contributor so why not use this combined
with the gmm so we go really back to what the now some talked about
right
so that many ways of doing the tandem
we heard this morning you can also the tandem with
bigger layer our work on that so basically using signals here
you can do bottleneck where you take in intermediate layer that this has a much
smaller dimension
or you can also
use the top hidden there
ask sort of the bottleneck but not make it smaller just take it in each
of those cases you would typically do like a pca to use your dimensionality
so does that work
well okay so if you take
a dnn
H and this the hybrid system here and then you compared with this gmm system
retake top layer
pca and then applied you gmm
well it's not really that good
but now we have one really big advantage back and the rubber gmms
we can capitalise on anything that worked on the gmm world right
so for example hardly able to you region dependent linear transforms a little bit like
fm P
so once you apply that
already better
can also just to mmi training very easily okay in this case is not really
as good but at least you can do it out of the box without any
of these problems with you know silence at that and you can apply adaptation just
it would always
you can also do something more interesting can say what if i train my dnn
feature extractor on a smaller set
and then to the training on a larger set
because we have the scalability problem
so this can really help with the scalability problem and you can see well
closer not a not quite as good but italy but we're able to do that
i mean imagine the situation what this is like a ten thousand our product database
we couldn't training and then
and it's on the dnn side we also use the same data we definitely get
better
here and that still make it might make sense if we combine the for example
with the idea of building this you model only partially so and then see if
that we don't know that action
so that a lot of attention
another thing another idea of learning using dnns as feature extractor
is to transfer learning from one language
to another so the idea is to feed the network actually training set of multiple
languages
and you're output layer
for every frames based chosen on what that language what's right and this way you
can train
these hidden representations and it turns out if you do that
you can improve each individual language and it even works for another language that has
not been part of this set here
the only thing is that is typically something a works for low resource languages
but if you goal larger so for example salt on has a
was that the paper here or has a paper where you shows that if you
go up to subtract two hundred seventy hours of training
then you're again really is reduced or something like three percent
so this is actually something that does not seem to work very well for large
setting
okay so take away
the dnns as a hierarchical nonlinear feature transform
that's really the key to the success of unions and you can use this directly
and put it the engine on top of that as a plastic later
and it brings it back and gmm world with all the techniques including parallelization and
scalability and so on
and all that transfer learning sides works from a small works a small set ups
but the not so much large
okay
last topic runtime
runtime is an issue
this one problem for gmms
you can actually do on-demand computation
for dnns
a large amount of parameters actually the shared layers you can do on the map
so
all dnns are
you have to compute
and so it's important to look at how can speed up so for example the
demo video that i showed you in the beginning if i that was run with
the with the my gpu was doing the live likely to evaluation if you don't
do that it would like three times real time
wouldn't infeasible
so
the way to approach this and that was done both by some colleagues of microsoft
also ibm
is to ask we actually needles full weight matrices
i and so this is that the question is based on two observations
one is that we saw early on that actually you can set something like two
thirds of the parameters to zero
and still you get the same our
and what ibm observed is that this top hidden they're the
the number of
how to the number of nodes the actual active is relatively limited
so can you basically just decompose all the ideas you singular value decomposition
those weight matrix
and the ideas you basically this is your network there
the weight matrix nonlinearity replace this by two matrices and in the middle you have
a low-rank
so that's that work
well
so but there's this is the gmm baseline just for reference dnn
but thirty million parameters of the microsoft internal task
start with the word error rate of twenty five point six
now we apply these singular value decomposition
if you just to the straight out of gets much worse
but you can then do back-propagation again
and then you will get back to exactly the same number
and you gain like one third parameter reduction
you can actually also do that with
although there is not just the top there if you do that can bring it
down
that's a factor of four
and that is actually very good results so this basic bring that back
so just one show you only to again give your very rough idea
my classes
so it's only very short example just a given idea this is an apples to
apples comparison between the old gmm system and the dnn system
but for speech recognition so as to look at some of those things that you
know well so you are devices on the one on the left or is what
a previously the board one on the right uses the documents
we're gonna find a good pizza and
a very similar specifically for discriminative interested look here down to the latency which is
counted from when i don't talk when we see the recognition result over a second
approach
so i just want to give you act this is proof that this section works
okay so
think of cover the whole range i would like to recap
all the take aways
okay so we went through
cd dnn actually members of G
mlp not already said that nothing else the outputs are the triphone states and that's
important
they're not really that far to train we know now but doing it fast
is still sort of frustrating enterprise and i would at the moment recommend just get
the gpu and if you have multiple gpus just one multiple training rather than trying
to paralyse a single training
pre-training is
median but the greedy layer-wise but is simpler and it seems to be sufficient
sequence training gives us regularly good improvements on to thirty percent but if you use
std then you have to use these little tricks smoothing
and rejection
adaptation helps much less than for gmms
which might be because the dnn learns possibly
very good in there and representations already so that might be a limit to what
we can actually you can achieve
writers are definitely not as easy as changing two lines of code especially for large
datasets
but on the other hand the C N N's
give us like five percent is not really the heart get but and they make
a good sense
dnns really simplify the feature extraction we're able to eliminate thirty years of feature extraction
research
but you can also go around and using dnns as feature extractors
so dnns are definitely not slowing decoding if you use this speech
so
to conclude word racy the challenges one forward
there of course open issues of training
i mean it's one we talk to people in the company we always thinking what
kind of computers we find the future and are we optimize them for std but
we always think you know what in one year will laugh
laugh about this though some patch method and we will just not need all of
this but so far this not i would think it's fair to say that's not
a method like this on the rise in the media laws parallelization
and what we found section learning rate control is not sufficient this kind of really
important because if you don't do this right it might run into unreliable results and
have a hunch that is relevant result we saw there was little bit like that
and also has to do with paralyse ability because the smaller learning rate the bigger
your mini-batch can be factor and the more parallelization can
dnns still have an issue with robustness to real life situations
how much they sort of not be solved speech but they got very close to
solving speech under perfect recording conditions but it still fails it's a do or speech
recognition over like one meter fifteen a more room with two microphones or something like
that so dnns are not
in here we automatically robust to noise
there was to see variability but not on C or what
then personally i wanna not can we kind of a more machine you
so for example there's already work the tries to eliminate H M and replace it
by are and i think that's come very interesting and the same thing is already
very successfully done with language models
and there's the question of
i mean jointly treat everything and one big step but on the other hand
the problem with that is that different kinds of different
aspects of the model different kinds of data that have different cost would using to
them so it might actually never be possible to we need to a joint training
and the final question that i sort of have is what to dnns teachers about
humans process
what will also get
more ideas on
no
so that concludes my talk thank you very much
i think we have like six minutes for questions
another expert about units there was wondering point therefore if i train and it's a
neural network and conventional speech data and i try to anything the data which is
much more clean we therefore not as good or we don't noise
so what was the configuration you want to do you want to train on what
it is that they train mind manual nets on the noisy data when they're running
on the clean data
so they don't know exactly that's my question
okay so i actually did skip
one slide images this L O
so
the dnn is actually
way
so you get like
so this table here shows results on aurora so basically doing this case multi-style training
so the idea was not to train a noisy and test on clean
but this is basically training and testing on the same
set of noise conditions
and so the lot of numbers here this is the gmm baseline if you look
at this line here thirteen point four
so another specialist on robustness but i think this is of the best you can
do with the gmm
pooling all the tricks that you could possibly put in
and the dnn
it's just
but not any tricks just training on the data you get
you know how do not from the or you get just exactly the same
so what this means i think is that the dnn is very good and learning
variability new input also noise that it sees in the training data
but we have other experiments were is shown what we're we see that is the
variability smart cover new training data
the dnn is not very robust again
so i don't know what happens if you trained on noisy and test on clean
and clean is not of the conditions that you have your training i could imagine
that it will are but on the of an interest at the data
i don't think i can likely to get away with thirty years thirty years maybe
that was present at all
apparently talking and tongue in cheek right what you're talking about is going back before
some of the developments of the eighties right and most of the effort on feature
extraction last twenty years conferences is actually been more robustness to dealing with unseen variability
and this doesn't get and you that set equation
some more questions or comments
think about features i need for future
research
is it and use a large temporal context this is also be one it's was
coming but for
in contrast
it's something
okay what exactly i don't have to sell the embassy okay
anymore comments
kind of a personal question you said that you know anything about neural nets like
on two three years back something like that so you see this as rather an
advantage for drawback very maybe less sentimental
in throwing away some coldness that you know the guys very in the field for
many years expected some touchable or the other way round
i think so i think it helps to come with sort of the little bit
of an outsiders mine so i think for example it helped me to understand this
parallelization thing right that basically do it is G D you do layer train a
small the mini-batch training
and normal the regular definition of mini-batches is that you can take have original to
sell
maybe you might have noticed that i didn't actually divided by the number of frames
when i use this formula right interesting is if you're not right
so that for example is something for me as an engineer coming in looking at
that i wonder know why do you do mini-batches as an average doesn't seem to
make sense you're just accumulating multiple frames over time that help understand those kind of
parallelization questions in a different way
but things probably details
okay any other buttons
okay the speaker given is present