good morning everybody
I'm very happy to see you all in this morning
Professor Li Deng, who proposed the keynote this morning
its not so easy to introduce him, because
he is very well known in the community
he is the fellow of a lot of societies like
ISCA, IEEE, American Acoustical Society
he has proposed several hundreds of papers during the last years
and different talks
Li Deng did his PhD in the University of Wisconsin
He started his carrier in University of Waterloo
He will talk to us today about two very important topics
very important to all of us
one is how to move out of GMM
its not so bad because I start my carrier with GMM
I need some new ideas to do
something else
the second topic will deal with the dynamic of speech
we all also know but dynamic is very important
we will not take more time on his talk, I prefer to listen him
thank Li
thank you, thank the organization and Haizhou
to invite me to come here to give the talk
it is the first time I've attended Odyssey
I've read of lot of thing that the community has been doing
As Jean has introduced
now I think not only in speech recognition but also in speaker recognition
there are few fundamental tools so far
one is GMM, one is MFCC
in common
last year, I've learned a lot of other thing from this community
it turns out that, the main thing by this talk is to say
both of these components may have potential to replaced with much better results
I touch a little bit on MFCC, I don't like MFCC
so I think Hynek hates MFCC also
now only until recently, when we was not doing deep learning
there are evidences to show that all components maybe replaced certain in speech recognition, people
have seen that it is coming
hopefully, after this talk, you may think about whether in speaker recognition, these components can
be replaced
to get better performance
the outline has three parts
In the first part, I will give a little bit about quick tutorial
having several hours of tutorial material
over last few months, so it is a little challenging to compress them down.
to this short tutorial
rather talking about all the technical details
I've decided to just tell the story
I also notice that in the next section after this talk
there are few papers related to this
Restricted Boltzmann Machines, Deep Belief Network
Deep Neural Network in connection with HMM
at the end of this talk, you may be convinced that this may be replaced
as well
we can consider in the future, with much better speech recognition performance thing than that
we have
and also Deep Convex Network, Deep Stacking Network
so I think over last 20 years, people have been working on segment models, hidden
dynamic models
and 12 years ago, I even had
a project with John Hopkins University working on this
and the results were not very promising
now we are beginning to understand the great idea we proposed over there
it did not work well at that time
It is only after we do this, we realize how we can put them together
and that is the final part
the first part
how many people here ever attended one of my tutorial over last year
OK, its a small number of people,
this one you have to know, deep learning, sometime you have hierarchical learning in the
literature
essentially, refer to a class of machine learning technique
largely developed since 2006
by ... you know actually, this is the key paper
that actually introduced a fast learning algorithm for this called Deep Belief Network
in the beginning, this is mainly done on image recognition, information retrieval and other applications
and we, actually Microsoft was the first to collaborate with University Toronto
researchers to bring that to speech recognition
and we show very quickly that not only for small vocabulary
it does very well but also for large vocabulary does even better
this really happens
you know in the part, for small recognition, it worked well for larger sometime it
failed
but this is the bigger tasks we have, the better success we have, I try
to analyze
to you that why it happens
and Boltzmann machine, we will talk Boltzmann machine in the following talks, I think, Patrick
has two papers on that
and Restricted Boltzmann machine
and this is a little bit confusing, so if you read the literature
very often deep neural network and deep belief network
which are defined over here which are totally different concepts
one is a component of another
just for save of convenience, the authors often get confused
they called deep neural network as DBN
and DBN is also referred to Dynamic Bayes network
even more confusing
one of thing that
for tutorial, for people attended my tutorial, I gave a quiz
people know all this
last week, we got a paper accepted for publication, the one I wrote together with
Geoffrey Hinton and with 10 authors all together
work in this area
we try to clarify all this, so we have unified terminologies
when you read the literature, you know how to map one to another
and Deep auto-encoder, I don't have time to go to here, and I will say
about some new developments
to me it is more interesting because some limitations of some others
This is the hot topic, here I list whole recent workshops and special issues
and actually, in Interspeech 2012
you see tens of papers in this area most in speech recognition
and actually, in one of the area, just format areas with 2 full sections for
this topic, just for recognition
and some others, we have more, and special issue
PAMI, its mainly related to machine learning aspects and also computer visual applications
I try to put a few speech papers there as well.
and DARPA program
2009, I think last year they stopped
and I think in December, there is another workshop related to this topic, it is
very popular
I think because people see the good results are coming, and I hope that
one message of this talk is to convince you that is a good technology so
you want to seriously consider adapting some of this essences
tell some stories about this so
so the first time, this is the first time
when deep learning showed promising in speech recognition
and activities grow rapidly since then and that was around
two and half years ago
or three and half years ago, whatever
in NIPS, NIPS is a machine learning workshop
every year
so I think one year before that
so actually, talked with Geoffrey Hinton
a professor at Toronto, he has shown me that
he showed me that the Science paper, he actually has a poster there
the paper was well written and the diagram was really promising
in term of information retrieval for document retrieval
so I looked this, after that we started talking about
maybe we can work on speech
he worked on speech long time ago
so we decided to have this workshop, and actually we work together before
my colleague, Dong Yu, and my self and Geoffrey, we actually decided to have
a propose accepted which presents whole deep learning in preliminary work
and that time most people do TIMIT, a small experiment
and we turn out that this workshop gives a lot of excitement
so I give a tutorial, 90 minutes
about 45 minutes tutorial, and Geoffrey, I talk about speech, and he talks about deep
learning at that time, and we decided
to get people interesting this
so the curriculum is as follows, for NIPS
at the end of the final day workshop
each organizer presents a summary of the workshop
and the instruction for that it is a short presentation, it should be funny,
should not be too serious
every organizer is instructed to prepare few slides to summarize
your workshop in the way that your impression to people attending the workshop
this is the slide, we prepared
speechless summary presentation of the workshop on speech
because, we don't really want to talk too much, just go up there, and show
that slide
no speech there, just animations
so we said that, we met in this year
so this is supposed to be industrial people
and this is supposed to be academic people
so they are smart and deeper
and they say that, can you understand human speech
and they say that, they can recognize phonemes
and they say, that 's a nice first step and what else do you want?
and they said they want to recognize speech in noisy environments and then
and then he said maybe we can work together
so we have all concepts together
that's all presentation
we decided to do small vocabulary first
and then quickly we moved I think in December of 2010
move to very large vocabulary
to our surprise, the bigger vocabulary you have, the better success you get
very unusual
and myself analyze the area with details
you know we have been working on it before 20 some years
one is surprise to me, convince me to work in this area individually
was that every pattern that I see from the recognizer it is very different from
HMM
absolutely, it is better, the area is very different, that means it is good for
me to do that
anyway, I talk about DBN
one of concept is deep belief network, that is one of that Hinton published in
2006
2 papers there
nothing to do with speech, it's called deep belief network. its pretty hard to read,
if you are not in the field for while
and this is another DBN called dynamic Bayesian network
few months ago, Geoffrey sent an email to me saying that look at this
acronym DBN, DBN
he suggests that before you give any talk you check
mostly, in speech recognition, people do Dynamic Bayes Network more
anyway, I will a little bit technical contents on it, time is running up quickly
number one is the first concept is restricted Boltzmann machine
actually, I have 20 slides, so I just take one slide over these 20
so think about this as visible
it can be label, label can be one of visible units
we do discriminative learning, other thing is observation, think about of observation, and other thing
forget about this
think about MFCC, think about label
or speech label, senome or other labels
so we put them together as observation and we have hidden layer here
and then the difference between Boltzmann machine and neural network is that
the standard neural network is one direction, from bottom up
now Boltzmann machine is both directional, you can go up and down, now connections between
neighboring units in this layer and that layer are cut off
if you don't do that, it is very hard to learn
so one of thing is that in deep learning they start with restricted Boltzmann machine
is that
if you have bi-direction of connections
and if you do all in detailed maths, write energy functions, you can write down
the conditional probabilities of hidden units given it and the other around.
so if you put energy right, actually you can get the conditional probability of this
given this to be Gaussian
which is that something people like that, this is conditional you can introduce whole thing
as Gaussian mixture model
so you may think that is just Gaussian mixture model so I can do it
each other
the difference is that, this you can get almost exponentially large number of mixture components
rather than finite
I think in speaker recognition, its about 400 or 1000 mixtures whatever
and here if you get 100 units
you get almost unlimited number of components
but they are tied together
Geoffrey has done very detailed mathematics to show that this is very powerful way of
doing Gaussian model
actually, you get product of experts rather than mixture of experts
that to me it is one of key inside that we get from him
that is RBM, so think about this as RBM
think about this as visible
this observation and hidden and we put them together we have it
it is very hard to do speech recognition on it
this is a generative model, you can do speech recognition, but if you do that,
the result is not very good
dealing discrimination tasks with a generative model you are limited by some of the
you don't directly focus on what you want
however, you can use it as building block
to build DBN (deep belief network)
the way we do it actually in Toronto
if we think about this as building block
you can do learning, after you do learning of this I just skip
it will take whole hour to talk about that learning, but assume that you know
how to do that
after you learn this, you can treat this as feature extraction from what you get
here
and you treat as stacking up
deep learning researchers argue that this becomes the feature of that
and then you can do further I think it is brain architecture
think visual cortex, 6 layers
you can build up, whatever that can learn over here become the hidden feature
hopefully, if you learn that right you can extract the important information from data that
you have
and then you can use feature on the feature and stacking up
why we are stacking up, actually it puts interesting theoretical results
that actually shows that if you unroll this single DBN
sorry, one layer of RBM
in term of belief network, actually it is equal to infinity length
because, every time this is related to learning
learning is actually go up and down, every time you go up and down, it
is equivalent to show that it
you actually get one layer higher, now the restriction here is that
all the weights have to be tied, it is not very powerful
but now we can untie the weights by doing separated learning
what we do it, it is very powerful model
anyway, so the reason why you this one goes down, this one goes up and
down is that if you
actually, I don't have time to go here, but believe me
so if you stack up this one, one layer up
and then you can mathematic show that this is equivalent to having
just one layer RBM at the top and then belief network going down
and this actually called Bayes network
so look at belief network is similar to Bayes network
but now if you look at this, it is very difficult to learn
so for each any one going down over here something in machine learning called explaining
away effect
so the inference becomes very hard, generation is easy
and then the next invention in this whole theory is that
just reverse order
and you can turn into neural network, it turns out that it is not theory
in that aspect as that works well
and practice it works really well
actually, I am looking to some of theories of this
so this is the full picture of DBN
so DBN consists of bi-directional connections here
and then single direction goes down
so if you do this, you actually can use that as generative model that you
can do recognition on this
unfortunately, the result is not good
a lot of steps that people reach the current state
I am going to show you all steps here
so number one RBM is useful, gives you feature extraction
and you stack up RBM few layers up
and you can get DBN, actually at the end you need to do some discriminative
learning at the end
uh, so let's see, but generally, the capacity is just very good
this is the first time, I saw
the generative capability from Geoffrey, I was also amazed
so this is that example that he gave me
so if you train, use this digit
the database is called MNIST
an image database, everybody uses it, as like our speech TIDIGIT
you put them here and you learn it
you know according to this standard technique
you actually now put 1 of digit here you want to synthesize 1
you put 1 here and all other are 0, and then you run
you can actually get something really nice, if you put 0 here
this is different from the traditional generative process
the reason why they are different because of stochastic process
it can memorize
some of numbers are corrupted
most of time you get realistic
last time, in one of tutorial I gave
I give the tutorial shown this result , how about of speech synthesis people in
the audience
they said that is great, I will do speech synthesis now
you get one sentence, fixed number, not human one
human do writing, every time for different writing
intermediately, go back to write draft propose, and ask me to help them
this is very good, stochastic components there, the result looks like how human are doing
now, we want to use for recognition, this is the architecture
I am amazed, I had a lot of discussion with Patrick yesterday
I just feel that, when you have generative model you really need to
you put image here, and move up here, and this becomes the feature
and all you do that you turn on this unit by one
and run a long time until convergence
and you look the probability for this
to get number, OK
and turn of other units, and run run, and see which number is high
I suggest that you don't do that waste your time
number one is it takes long time to do recognition, number two we don't know
to generate to the sequence
and he said the result is not very good, so we did not do it
we abandon the concept of generation, to do everything generative, that's how we do.
and that's how deep neural network is born
so all you do is that you just treat all the connections to be
that why at the end my conclusion is that the theory of deep learning is
very weak
ideally DBN goes down, it generate the model
in practice, you say it is not good, just forget about this, think about
and eliminate this, and make whole weights moving up
we modify this, the easiest way to do it just forget about this, you know
just change make it go up, make this go down again, people don't like it
in the beginning, I supposed it is horrible, that is crazy to do it
just break the theory to build the DBN
finally, what is the best result, what we do that is really as same as
what multilayer perceptron has been doing except it just
has very deep layers
and now if you do that typically, randomize
you know, all the weights, you learn this as standard arguments
20 some years ago saying that
mathematics proves that the deeper you go, the more
the lower level you go, because the label is the top level
so you do back-propagation taking the derivative of error from here go down here
the gradient is very small
you know sigmoid function sigmoid (1-sigmoid)
so the lower you go, the more chance that gradient term vanishes
they even don't back-propagation for deep networks so look that it, it seems impossible to
learn, they gave up
and then now, one of very interesting things that comes out of deep learning is
to say
rather using random numbers, can be interesting to using DBN to plug in there, that
some thing I don't like that
look the argument why it is good, what we do is that we train to
this DBN
over here
the weights DBN, you just use the generative model for the training
and once you trained, you fix these weights, you just copy the whole weights into
this, deep neural network to initialize these
after that you do back propagation
again, these weighting is very small, but its OK
you already got DBN over here
you got RBM, it should be RBM, not DBN anymore
it is not too bad,
so you see exactly how to train this, it is just that using random is
not good
if you use DBN's weights over here is not too bad, but over here, you
modify
you just run recognition, for MNIST
the error goes down to 1.2% that is whole Geoffrey Hinton's idea
and he published inside a paper about this, at that time, it seems to be
very good
but I am going to tell you that MNIST result 1.2% error, but with few
more generations of networks, I will show you, we are able to get 0.7%
and same kind of philosophy goes to speech recognition
I will go quickly, in speech all of you think about how to do sequence
modeling
it is very simple
now we have deep neural network
what we do that we normalize that using softmax
to make that to be, similar to the talk yesterday, a kind of calibration
and we get posterior probabilities and divided by prior you get generative probabilities, and just
use HMM to do that
that why called DNN-HMM
the first experiment we did on TIMIT
with just phonemes, easy
each state, one of three states is a phoneme, very good result, I can show
you
you then move to large vocabulary, one of thing that we do in our company
you know Microsoft called them as senomes
rather we have a phone, we cut it in dependent phone
that becomes our infrastructure
so we don't change all this
rather we use 40 phones, what happen if we use 9000.
you know, the senomes, long time ago people could not do that, 9000 here, crazy
300, 5000, every time you have 15 million weights here, it is very hard to
train
now we bought very big machines
a GPU machine, parallel computing
so we replace this by ... it can be very large
this is very large, and input is also very large as well
so we use a big window
we have a big output, big input, very deep, so there are 3 components
why big input-long window
which could not be done in HMM
do you know why? because
I have a discussion with some experts it could not be done for speaker recognition,
UBM
for speech recognition, the reason why it couldn't be done, because
first of all you have to diagonalize HMM
but its not big, if you do too big, Gaussian becomes sparseness problem
covariance matrix
for the end, all we do that make it simple as possible, just plug whole
long window
and then feed whole thing, we get million of parameters
typically, this number is around 2000
2000 here, every layer, 4 million parameters here, another 4 million, another 4 million
and just use GPU to train the model together
here is not too bad
so if we use about 11 frames
now, it is even extended to 30 frames
but in HMM, we never imagine of doing it
we don't even normalize this, we just the roll
values over here
in the beginning, I still use MFCC, delta MFCC, delta
multiply by 11 or 15 whatever
then we have a big input
which is still small compared with hidden unit size
and train this whole thing, and every thing works really well
and we don't need to worry the correlation modeling, because correlation is automatic captured by
the whole the weights here
the reason I bring it here, just to show you that, this is not just
phone
we went through history, literature, we never saw put this one as speech until this
first data
now just give you a photo here, GMM everybody know
HMM, GMM, so whole point is to show you that
the same kind of architecture if you look at HMM
you can also see GMM is very shallow
all you do it that for each state the output 1 is score for GMM
over here, you can see many layers
so you build feature up and down, this one shows deep versus shallow
here is the result. We wrote the paper together, it will be appear in November
and that result summarize
four groups research together over last three years
since 2009
university of Toronto, Google, and
and our Microsoft research was the first one who
actually serious work for speech recognition
Google data and IBM data
they all confirm the same kind of effectiveness
here is the TIMIT result
it is very nice, all people think that TIMIT is very small
if you don't start with this, you get scared away.
so I will go back in the 2nd part of this talk, it is monophone
hidden trajectory model, I did many years ago
to get this number, need 2 years to do this
I wrote the training algorithm, very good my colleagues wrote the decoder for me, this
is very good number
for TIMIT, and it is very hard to do decoding
the first time we try this DBN
deep neural network
I wrote this paper with ... we do is MMI training
you can do back propagation through the MMI function for whole sequence
so we got 22%, it is almost 3%
and then we look the errors between this and this are very different, especially, for
very short samples
it is not really good, but for the very long side is much better
I've never seen that before
so do this, this kind of work which is compared with HMM
this result has been done for 20 some years ago
this is error, 27% error around 4% up
around 10 years, 15 years, the error drops 3%
and this and this is very similar in term of error
so you see the error is very different
so the first experiment is voice search
at that time, voice search is very an important task, and now voice search goes
to everywhere
in Siri has voice search, in Window phone we have that
even in Android phones
very important topic
so we have data, we have worked on this one, very large vocabulary
and summer of 2010
we first to in our group, just boost that because the it is so different
from TIMIT
and we actually don't even change parameters at all
all parameters, learning rate
from our previous work in TIMIT
and we got down here, that is the paper we wrote
just appear this year
and then this is the result that we got
if you actually want to look at exactly how this is done
most of the thing provide
in this paper is read speech
to tell you how to train the system
but you need to use GPU to implement, without GPU, it takes 3 months, just
for experiments
for large vocabulary, for GPU is really quick
most of thing is the same, you do this, you do this
we try to provide theory as much as possible
so if you want to know how to do this in some applications take a
look at this
so this is the first time
the effects of increasing the depth of DNN for large vocabulary
so our systems, the accuracy go up like this
and the baseline, using HMM, discriminative training MPE learning
around 65, this is just neural network
single layer neuron is doing better than all this
and you increase it, you get it
what you go there, some kind of overfit, data is not very good, we label
24 hours
data at that time, so we say
do more, we try 48 hours
this one drops big
so the more data you have the better can you get
some of my colleagues said that why we don't use Switchboard
I say this is too big for me, we don't do it
so actually, we do this Switchboard
and then we got a huge gain
even more gain that I showed you here
it just because of more data
so typical problem
is not really spontaneous speech, but this is spontaneous as well
so this for spontaneous speech as well
it seems with limited data we go up here quite heavy
and then you get 1 order, or 2 orders magnitude more data there
so you have much more GPUs to run, much better softwares
every thing runs well
it turns out that same kind of read speech
we publish over here
let me show you some of the results
this is the result, this is the table in our recent paper
with Toronto group
so standard GMM based HMM
with 300 hours of data
has error rate about 23 some percent
we do very carefully
tune the parameters, this parameter have been tuned (the number of layers)
and we got from here to here
and that is actually attracted a lot of people attention
and then we realize that
we got 2000 hours, and this result from that is even better
and at that time, it is Microsoft result
and then one of recent paper, publishes the result that
of course, when you do that people argue that you have 29 million parameters
and people always you know
pick, picking people in speech community people
obviously, uh, you got more parameters, of course you're going to win what
so what if you use the same number of parameters
we said fine, we do that
so we use the sparseness coding
to actually cut up all the weights
and the number of non-zero parameters is 15 million
with the smaller number of parameters,
we get even better result
it's amazing... the capacity of deep network is just tremendous
you cut all the parameters
in the beginning, we don't
typically, you expect to be similar right
get rid of the lower
you get slightly gain
but that doesn't carry off before we get more data anyway
so this is, maybe
within the statistical variation, but so
with the smaller number of parameters
then GMM, HMM which is trained using discriminative training
we get about something 30 something % error reduction
more so than our TIMIT, and also more so than our
our voice search
and then this is another paper, and then IBM came along
and then Google came along, they say you know, it's better result, I think they
want to do as well
so you can see that thesis's Google result
thesis's about 5000 hours, amazing right
they just have better infrastructure
mapping this all that, so they manage to do that on 5000, 6000 hours
so this number just came up
actually that number
so actually this will be in the Interspeech papers, if you go to see them
so one of the thing Google does is that they don't put this baseline result
they just give a number,
just ask what number they have
so... sorry.. sorry
with more data they have, with the same data they don't have the number, either
they
they just don't bother to do
they all believe more data is better
so with a lot more data they got this
and then we just with about how many, about three
uh, with this much data
I take about 12%, is better when we got more data
they should put a number here, anyway
so I'm, we're not nick picking on this
and thesis's the number I show, thesis's Microsoft's result, the number from here to here
from here to here for different
these are 2 different test sets
and all these, all the people are here, you should know, this is very important
for our review
ah now, this is IBM result
ah sorry, this is voice search result that I showed you early
this is 20%
it's not bad
so because you have 20 hours of data, so
it turns out the more data you have
the more error reduction you have
and for TIMIT, we get only about 3-4 absolute, about ten something percent
now, and this is the
so this broadcast result is from IBM
and I heard that in Interspeech, they have much better result than this
so if you're interested, look at it
my understanding is that
from what I heard, is that their result is comparable to this
some people say even better
so if you want to know exactly IBM is doing, they would have even better
infrastructure
in term of distributed learning
compared to most other places
but anyway so this kind of error reduction
has been unheard of in the history, I mean in this area about 25 years
and the first time we got these results, we're just stunned
and Google, this is also Google's result, and even Youtube speech which is much more
difficult
spontaneous with all the noise
they also manage to get something from here
this time they're pretty honest to put this over here with the same amount of
data
14 hours they got more
but in our case, we also get 2000 hours, we actually get more gain
rapid gain ah yes
so the more data you have
and then of course, to get this, you have to tune the depth
the more that you have, the deeper you can go
and the bigger you may wan to have
and the more gain you have
and this is the story I want to comment
without, you really have to change major things in the system architecture
OK, so once
one thing that we found
so my colleagues Dong Yu and myself and ah and
recently found was that
so in most of the thing that we
I believe in old days IBM and
and Google and our early work
we actually use DBN to initialize our model off-line
we said can we get rid of that, that training is very tricky, not many
people know how to do that
if for certain recipe, you have to look at the pattern
it's not obvious thing how to do that because the learning
there's the keyword in the learning called the contraction divergence you might hear that word
in the later
part of the talk today
contrastive divergence on theory,
essentially the idea is you should you know iterate
you should do multi-column simulation
Gibbs sampling for infinite turns
but in practice, it's too long
it's a cut it to one
and of course from that, you can, have to use variational learning
variational bump to prune for better result
it's a bit tricky
that's why it's better to get rid of it
so our colleagues, so actually have a patent filed just some few months ago
on this, and also a paper from my colleague
would actually use ... for the
for switchboard task
and they show that
you actually can do comparable things to RBM learning
so might I would say now, for large vocabulary
we don't even have to learn much about DBN
so .. the theory so far is not clear
exactly what kind of power you have
but I might sense is that
if you have a lot of unlabeled data in the future
it might help
but we also did some preliminary example to show it's not the case any more
so it's not clear how to do that
so I think at this point we really have to get better theory
if you take a better theory, and also kind of comparable
you know it's a
although all these issues cannot be settled
so the idea of discriminative pre-training is that
you just train the standard ..um
standard
multi-layer perceptron using you know
thesis's easy to train. For shallow, you can train, the result's not very good
and then every time
you do you fix this
you add a new one, and you do. You need to fix the lower layer
from the previous shallower layer
and that's good, that's the spirit, It's very similar to
OK .. the spirit is very similar to layer by layer learning
now every time
when we add up a new layer, we inject
discriminant labeled information
and that's very important, if you do that, nothing goes wrong
so if you just use all the random number, to go over here and do
that, and nothing is going to work
uh, but except
there's some exception here, but I'm not going to say much about
but once you do this
layer by layer with
the spirit is still similar to DBN right, layer by layer
but you inject discriminative learning
I believe it's very natural thing to do
we talked about this right
so we learn
the generative learning in DBN
you know, layer by layer, to be very careful
you don't just do it to much
and then if you inject some discriminant information
it's bound to happen
you get new information there, not just looking at the data itself
and it turns out that if we do, we get
we actually in some experiment we even get slightly better result than DBN training
so it's not clear the generative learning
plays, is going to play a more important role
as some people claimed
OK so I'm done with
the
the deep neural network, so I spend a few minutes to tell you a bit
more about
some different other different kind of architecture called deep convex network
which to me is kind of more interesting
so I spend most time on this
so actually we have a few papers published, it turned out that
so the idea of this network is that
while this is actually done for MNIST
so when use this architecture
we actually get so much better result than DBN
so we're very excited about this
but the point is that the learning has to.. you know
we have to simplify this network
it turns out learning now
the whole thing is actually convex optimization
So I do not have time to go through all this
we have time for the parallel implementation
which is almost impossible
for deep neural network
and the reason for those of you who've been actually working on neural network, you
notice that
the learning for
discriminant ... discriminant learning phase
which is called the fine tuning phase
and are typically the stochastic weighted descent
you cannot distribute
so this one cannot be distributed
so I'm not going to
I really want to use
this architecture to try speech recognition task so usually we have lots of discussion
so maybe 1 year from now
so if it's working well for you discriminant learning task
I'm glad that
this now is going to define the task
for discrimination that I.. I .. had
discussion so
that gives me the opportunity to try this
I love to try, I love to report the result
even it's negative, I'm happy to share with you
OK, so thesis's a good architecture
and another architecture that we tried
is that we split the hidden layers into 2 parts
we do the crossproduct, so that overcomes
some of the DBN weakness
originally not being able to do correlation in the input
and people just try a few tricks
you know more than correlation
it did not work well, almost impossible
so thesis's very easy to implement
and most of the learning here is convex optimization
and often get very good result over others
there's another architecture called the tensor
so the same kind of correlation
modeling for tensor version
also can be carried out into
deep neural network
so we actually, my colleague, we actually submit a paper in Interspeech
I think if you're interested in this one, should go there to take a look
at it
so the whole point is that
now rather than doing the stacking using input output concatenation
you can actually do the same thing for each of hidden neural network
so in this paper, we actually evaluate that on the switchboard
and we get additional 5% relative gain out of the best we have got so
far. So this is a good staff
so the learning becomes trickier
because when you do .. so the back propagation
and you have to think about how to do this
it adds some additional nuisance in term of effective computation
but the result is good
so now I'm going to the second part, I'm going to skip most of them
OK skip most of them
OK so this uh... I actually wrote a book on this
so this is
dynamic Bayesian network as deep one
the reason why it's deep is there are many layers
so you get the target
you get articulation
you get environment, all together this
so we tried that
and the implementation of this is very hard
so I just go quickly and then to go to the bottom
uh, so, uh, this is one of the paper that
uh, I wrote uh, together with
one of the experts, who actually
this is my colleague who actually invented this variational Bayes
and then ... to work with him
to implement this variational Bayes
into this kind of ...
dynamic Bayesian network
and the result is very good
although the journal we published is wonderful
so you can actually synthesize
you can track all these formants in very precise manner
and then some articulatory problem, it's very amazing, but once you do recognition
the result is not very good
so I'm going to tell you why, if we have time
and then of course
one of the problems
so this 2006 we actually
so we realize that kind of learning is very tricky
essentially you approximate things you don't know what you approximate for
that's one of the problems of deep Bayesian, it's very
but you can get some insights
you work with all the experts in the [ ... ]
at the end at the bottom line
we really don't know how to interpret
but you... but is just
you don't know how much you lose right
so we actually have the simplified version that I spend all time working on, and
that gives me this result
that's actually the paper
so this is .. is about
about 2-3 percent better than the best
context dependent HMM
I'm happy at that time; we stopped at this
once we do this
and it's so much better than this
so in other words, DBN
related, or at least in TIMIT task
it does so much better than
than .. than dynamic Bayesian kind of work
and then we're happy about this
now of course I won't
yes, so this is the history of dynamic model
and a whole bunch of thing going on there
and the key is how to embed
such dynamic property into the DBN framework
if you embed the property of
big chunk into
dynamic Bayesian network is not going to work ... due to technical reasons
but the other way around has a hope, that's one of the
so the part 3 will going to tell you
which I'm running out of time
I'm actually going to show you
first of all some of the lessons
so thesis's the deep belief network or Deep Neural Network
and this, I used the * here, to refer that to as Dynamic Bayesian Network
so one
so all these hidden dynamic models .. is the special case of the Bayesian network
you can see that, or otherwise I showed you earlier on
there were a few key differences that we learned
one is that for DBN
it's distributed implementation
so in our current system, for this system
in our HMM/GMM system
we have the concept that this particular model
is related to a
this particular model is related to e right
you have this concept right, and of course you need training to mix them together
but you still have the concept
whereas in this neural network.. no .. each weight
codes all class information
I think it's very powerful concept here
so you learn things and get distributed
it's like neural system right
you don't say this particular neuron contains visual information
it can also code audio information together
so this has better
neuron basis compared with conventional techniques
also ...... when we did this model
we just set one single bit wrong
at that time, we all said ... we don't have parsimonious model representation.
that's just wrong
5 years ago, 10 years ago, may be OK right
now in our current age
just use massive parameters if you know how to learn them
and also you know how to regularize them well
and just turn on that the DBN has a mechanism
to automatically regularize things well
and that is not proven yet, I don't have the theory to prove that
but in our ... u know every time you stack up
u can intuitively understand that
u don't overfit right
because if u do overfit, u do this many years ago
but if u do this, u know keep going deep, u don't over fit because
whatever information that u get applied
the new parameters
actually sort of take into account
the feature from lower parameters, so it doesn't count as lower
model parameters any more, so automatically u have the mechanism to do this
but in DBN, u don't have that property
u need to stop, it doesn't have that property
so this 's very strong
and another key difference
is something I talked about earlier
product vs mixture
mixture is you sum up probability distribution
and product is you take product between them
so when you take the product, you actually exponentially expand the power of representation
So these all the key differences between these two type of model.
Another important thing is that for this learning we combine generative and discriminative.
Although the final result we got, we still think that discriminative is more important than
generative.
But at least in the initialization, we use the generative model and DBN to initialize
the whole system and discriminative learning to adjust the parameters.
The generative model we did earlier is purely generative.
Finally, longer windows or shorter windows.
In the earlier case, I am still not very happy about longer window.
Because every time you model dynamics which I've actually talked about this, about free method.
How to build dynamics into the model, they both have a very short history, not
long history.
No history of research actually focused on dynamics.
There is so many limitations, you have to use short window. in long window, nothing
works hard, we've tried all these.
So deep recurrent network is something that many people working on now.
In our lab, in the summer, much as all the projects relate to this. Maybe
not all, at least very large percentage.
It has worked well for both acoustic model and language model. I would say that,
recurrent network has been working well for acoustic modeling.
In language modeling, there is a lot of good project in the recurrent network.
The weakness of this approach, there is a generic temporal dependency.
I have no idea what it is, there is not constraint, one following another. This
kind of temporal model is not very big.
The dynamics in DBN is much better.
In term of interpretation, in term of generative capability, in term of physical speech production
mechanism, it is just better. The key is how to combine them together.
We don't like this, and we have shown that all this does not capture the
essence of speech production dynamics.
There is huge amount of information redundancy, think about you have a long window here
and every time you shift ten millisecond and 90% of the information overlapping.
And some people may argue that it doesn't matter and they did experiment to show
that it doesn't help at all.
The importance of optimization techniques is the Hessian-free method.
I am not sure in language modeling, you may not do that actually, but in
acoustic modeling, this is a very popular technique.
And also, another point is that recursive neural network for parsing in NLP has been
very successful.
I think last year in ICML, they actually presented the result of recursive neural net
which is not quite the same as this, but used the structure for the parsing,
they actually got state-of-the-art result for the parsing.
The conclusion of this slide is it's an active and exciting research area to work
on.
So the summary is as follows. I provide historical accounts of two fairly separate research.
One is based upon DBN, the other one is based on Dynamic Bayes Network in
speech.
So I actually hopefully show you that speech research motivates the use of deep architectures
from speech production and perception mechanisms.
And HMM is a shallow architecture with GMM to link linguistic units to observations.
Now I have shown you that, I didn't have time to talk about this, the
point is this kind of model has less success then it is expected.
And now we are beginning to understand why that is a limitation over here, and
actually I have shown some potential possibilities of overcoming that kind of limitations in the
neural network framework.
So one of the thing that we understand why this kind of limitation that has
been developed in the past has not be able to take advantage of the dynamics
into the deep network.
It's because we didn't have the distributed representation, didn't have massive parameters, didn't have fast
parallel computing and we didn't have product of experts.
All these things are good for this, but the dynamics are actually good for this,
and how to merge them together, I think is a very popular research that actually
work on.
You can actually make the deep network to be scientific in terms of speech perception
and recognition
So the outlook the future direction is that so far we have DBN DNN to
replace HMM GMM .
I would expect in within three five years, you may not be able to see
GMM especially in recognition.
I think in industry.If I am wrong then shoot me.
The dynamic properties of model of this Dynamic Bayesian Network speech has the potential to
replace HMM.
And the Deep Recurrent Neural Networks, which I have probably tried to argue that there
is a need to go beyond unconstrained temporal density while making it easier to learn.
Adaptive learning is so far not so successful yet, we tried a few projects, it
is harder to do it.
The scalable learning is hard, for industry at least is, for academic don't worry about
it.
As well as NIST define it into small tasks, you will be very happy to
work on that. But for industry this is a big issue.
Reinventing our infrastructure at the industrial scale. I think we have time to go through
all the applications.
Spoken language understanding, has been one of the successful application I've shown you.
Information retrieval, language modeling, NLP, image recognition, but the speaker recognition is not yet.
The final bottom line here is that the deep learning so far is weak in
theory, I have I have convinced you about it with all the critics.
In Bengio case, he randomize everything first. And then if you do that, of course,
it is bad.
So the key is that, if you get something did so best, I think to
me what generative model maybe useful in that case. But the key of this learning
is if you put a little bit discrimination over here, it is probably better.
So probably the best is you use the structure here and also this, and we
know how to train that now. I think both width and depth is important.
We tried that, we didn't fix the measurement, we just used algorithm to cut out
all the way. We didn't lose anything, in fact from the result I showed you,
it still gains a little bit.
Cross validation.
That's no way, there is no theory on how to do that.
But in particular case, some of the networks that I've shown you, I have theory
to do that, I can control that.
There's some networks, you can do theory. That means you can automatically determine it from
data. But for this deep belief network, it is weak in theory.
He is also doing deep graphical model.
Two years ago, he gave this ? on how to learn the topology of deep
neural network, in term of width and depth.
And he was using Indian Buffet Process.
At the end, everything has to be done by Monte Carlo simulation and for five
by five, he said simulation take about several days.
I think that approach is not scalable, unless people improve that aspect.
That also motivates more of the academic research on machine learning to make that scale.
I think the idea is good, but the technique is so slow to do anything
about this.
For deep neural network, stochastic gradient is still doing the best, it is good enough.
But my understanding is, we are actually playing around with this. He wants to add
the recurrence some more complex architecture, stochastic gradient isn't strong enough.
There is a very nice paper done by Hinton's group, one of his PhD student.
Who actually used Hessian free optimization to do DBN learning.
They actually showed that result is just one single figure, very hard to interpret that
one, the paper in ICML 2010. It's doing better in this compared with using DBN
to initialize neural network.
To me, it is very significant. We are still borrowing this for more complex network,
more complex second order method, probably it will be necessary.
And also the other advantage of Hessian free is the second order, it can be
parallelized for bigger batch training rather than minibatch training, and that makes big difference.
We tried that one, it doesn't work well for DBN, we need to have a
lot of data. Probably the best for DBN network is stochastic gradient .
If you are using the other networks, some later networks that we have talked about.
They are naturally suited for batch training.
In some more modern version of the network, batch training is desirable. They are designed
for those architecture, it is for parallization.