Like the first of thing. I will thank the organisers for having
this opportunity to share with you
some of my personal views
on this very hot topic here. So,
I think the goal of this tutorial really is to
help diversifying the deep learning approach. Just like the theme of this conference,
Interspeech For Diversifying
the Language, okay.
So I have a long list of people to thank. Oh, so I want.
Yeah, thank you.
So I have long ... long list of people here to thank.
Especially Geoff Hinton. I worked with him for some period of time.
And Dong Yu and whole bunch of Microsoft colleagues.
Oh, who,
hmm,
contributed a lot to the material
I'm going to go through.
And also I would like to thank many of the colleagues sitting here who had
a lot of discussions with me.
And their opinions also shaped some of the content that I am going to go
through with you over the next hour.
Yeah, so the main message of this talk
is that deep learning is not the same as deep neural network. I think in
this community most of people
mistake deep learning with deep neural network.
And most ...
So deep learning is something that everybody here would know. I mean just look at
... I think I counted close to 90 papers somewhere
related to the deep learning or approaching at least. Kind of the number of papers
exponentially
increasing over last twelve years.
So deep neural network is essentially neural network
you can unfold that in space. You form a big network.
OR
AND
Either way or both. You can unfold that over time. If you don't unfold
that neural network over time because of reccurent network, okay.
But there's another very big branch of deep learning, which I would call Deep Generative
Model.
Like a type of neural network you can also unfold in space and in time.
If it's unfolded in time, you would call it a dynamic model.
Essentially the same concept. You unfold the network.
Oh. You know
in same direction in terms of time
But in terms of space they are unfolded in the
opposite direction. So I'm gonna elaborate this part. And for example
our very commonly used model.
You know a Gaussian Mixture Model, hidden Markov model, really has the
neural network unfolding in time.
But if you make that unfolding in space you get big Generative Model
which hasn't been very popular in our community.
You know I'm going to survey whole bunch of work related to this area, ah,
you know through the my discussion with many people here.
But anyway so the main message of this talk is eventually to
hope and I think there's a promising direction that is already taking place in machine
learning community
I don't know how many of you actually went to International Conference on Machine Learning
(ICML) this year, just a couple of months ago in Beijing.
But there's huge ammount of work in Deep Generative Model and some very interesting
development, which I think I'd like to share with you at high level,
so you can see that all this deep learning, although it just started in terms
of application in our
speech community, we should be very proud of that.
Hmm, now,
In number of machine learning communities there's huge amount of work going on
in Deep Generative Model. So I hope I can share with you some of recent
development with you to
to enforce the message that
a good combination between the two
which have
complementary strengths and weaknesses can be actually get together to further
advance deep learning in our community here.
Okay, so now. These are very big slides. I'm not going to go through all
of details. I'm just going to highlight a few
things so in order to enforce the message that
generative model and
neural network model can be helping each other.
I'm just going to highlight a few key attributes of
both approaches. They are very different approaches.
I'm going to highlight that very briefly. First of all
in terms of structure they are both graphical in nature as a network, okay.
You think about this deep generative model, typically some of these
we call that a Dynamic Bayesian network. You actually have joint probability between ?? label
and the observation.
And which is not the case for deep neural network,
okay.
In the literature you see many other terms
that relate to deep generative model like probabilistic graphical model,
such as stochastic neurons,
sometimes it's called the stochastic generative network as you see in literature. They all belong
to this
category. So if your mindset is over here, even though you see some neural words
describing that
you know you won't be able to read all this literature, so the mindset is
very difficult when you study these two.
So the strenght
of deep generative model is that,
this is very important to me,
how to interpret, okay.
So everybody that I talked, including the lunchtime when I talk to students,
they complain. I say: have you heard about deep neural network? and everybody says yes,
we do.
To what extent have you started looking to that? and they said we don't want
to do that because we cannot
even interpret what's in the hidden layer, right.
And that's true
and that actually is quite very deciding. I mean if you
read into this ?? science literature in terms of connectionist model
really the whole design is that you need to have a representation here to be
distributed.
So each neuron can represent different concept
and each
concept can be represented by different neurons, so the very design
it's not meant to be interpretable,
okay.
And that actually creates some difficulty for many
and this model is just opposite. It's very easy to interpret because the very nature
of generative story.
You can tell what the process is
and then of course if you want to do
a classification or some other application in machine learning
you simply just have to ..
and for forecast we simply have base route to invert that, that's exactly what in
our community
we have been doing for thirty years hidden Markov model. You get the prior, you
get generative model and
you multiply them and then you do it. Except at that time we didn't know
how to make that
deep for this type of model. And there are some piece of work that I'm
going to survey.
So that's one big part of the advantage of this model.
Of course everybody know that what I just mentioned there.
In deep generative model actually the information flow is from top to down.
You actually have .. what top simply means is that you know you get a
label or you get a higher level concept
and the lower level down simply means you can rotate to fit into that.
Everybody know that in a neural network
the information flow is from bottom to up, okay. So you fit the data and
you compute whatever output and then
you go either way you want.
In this case
the information come from top to down. You generate the information
and then if you want to do classification, you know, any other machine
learning applications, you know you can do Bayesian. Bayesian is very essential for this.
But there's whole list of those. I don't have time to go through, but just
you know those are high lights, these
we have to say. So the main strenght of deep neural network that actually gained
popularity
over the previous years, really is mainly due to these strenghts.
It's easier to do a computation in terms of
so this what I wrote is a regular compute, okay.
So if you
look into exact what kind of compute is involved here
it's just the millions of millions of millions of times of computing
of the big matrix by a vector.
You do that many times. ?? place very small model role
it's very regular.
And therefore GPU is really
ideally suited for this kind of computation
and that's not the case for this model.
So if you compare between these two then you really will understand that if you
can pull
some of these advantages into this model
and pull some of this advantage in this column into this one
you have integrated model. And that's kind of the message I'm going to convey and
I'm going to
give you example to show how this can be done.
Okay, so in terms of interpretability it's very much related to
how to incorporate the main knowledge
and network constraint into the model. And for deep neural network it's very hard.
What people have done that, I have seen many people in this conference and also
in a ??
tried very hard it's not very natural.
What is
This is very easy
I mean you can code your domain control knowledge directly into the system. For example
like distorted speech, voice speech, you know
in the summation, into special domain, summation of
either wave-form domain is a noise
plus
the clean speech you get by observation. That's so simple you just cut that into
one layer, into summation or
you can call them in terms of Bayesian probability very easily.
This is not that easy to do. People tried to do that, it's not just
as easy.
So to encode
a domain knowledge into network constraint of the problem
into
your deep learning system. This has great advantage.
So I'm actually, I mean this is just a random selection
of things you know. There's very nice paper over here
Acoustic Phonetics.
All this knowledge at speech production
and this kind of nonlinear
phonology
and this is an example of this is noise robust. You put the phase information
of the speech and noise you can come up with
very nice conditional distribution. It's kind of complicated
but this one can be put directly
into generative model and this is some example of this. Whereas in deep neural networks
it's very hard to do.
So the question is that do we want to throw away all these knowledge in
the deep learning
and my answer is of course no. Most of people will say no, okay.
And people from the outside of speech (community) there was a yes. I talk about
some people in machine learning,
anyway so since this is speech conference I really want to emphasise that.
So the real
solid reliable knowledge that we attained
from speech science
that has been reflected by local talks are
such as yesterday's talk, talking about how some patterns have been shaped by you by
?? and perceptionists. They were really playing a role in deep generative model.
But very hard to do that in deep neural network.
So with this main message in mind
I'm going to go through three parts of the talk as I put them in
my abstract here.
So I need to go very briefly
through all these three topics.
Okay, so the first part is to give very brief history of how deep speech
recognition started.
So this is a very simple list. There are so many papers around. Before the
rise of the deep learning around
2009 and 2010. There are lots of papers around. So I hope I actually have
a reasonable
sample of the work around here.
So I don't have time to go through, especially for those of you who are
in ?? open house
There was in 1988, I think in 1988
ASRU and at that time there's no U, it's just
ASR. And there is some very nice paper around here and then quickly
you know it was
superseded
superseded by the hidden Markov model approach.
So I'm not going to go through all these
so except to point out that
neural network
has been very popular for awhile.
But towards this you know,
plus ten years
before the deep learning actually took over neural network approach
essentially didn't really make
such a strong impact compared with deep learning network that people have been seeing.
So I just give you one example to just show you how unpopular
the neural network was at that time.
So this is about 2008 or 2006, about nine years ago.
So this is the optimization that I think
is predecessor
of ?? IOPPA.
So they actually got several of together, locked us up into hotel
near Washington, DC.
airport somewhere.
Essentiall the goal is to say that well the speech
recognition is stuck there, so you come over here and help us brainstorm next generation
of speech recognition and understand technology
and then we actually spent about four or five days in the hotel and at
the end we wrote very thick report,
twenty some pages of report.
So there is some interesting discussion about history and the idea is that
if government give you unlimited resource and gives you fifteen years what is it you
can't do, right?
So most of the people in our discussion,
we all focused on neural network, essentially
margin is here,
macro-random field is here, conditional-random field is here and graphical model here.
So it
that was just couple of years before that deep learning actually came out at that
time
so neural network was actually one of the
two's around.
Haven't really make a big impact.
So on the other hand the graphical model was actually mentioned here because it's related
to deep generative model.
So I'm going to show you a little bit, well this is slide about deep
generative model, actually I made some list over here.
One of the
but anyway so. This let's go over here.
I just want to highlight couple of
related to
introduction of deep neural network in the field.
Okay so one of, this is ?? John Riddle?
actually we spent a summer in ?? in 1989,
or 1988.
Fifteen and some years ago. So we spent really interesting summer altogether.
So
and that's kind of the model, deep generative model, the two versions we actually put
together
and at the end we actually brought a very thick report that were about eighty
pages of report.
So this is deep generative model and it turned out that this model
actually both of those models are implemented in neural networking.
And thinking about neural network as simply just function of function of mapping
so if you map the hidden representation
from you know
as part of deep generative model into whatever observation you have
MFCC. Everybody used MFCC at the time.
You actually need to have done the mapping and that was done in neural network
in both versions
and this is statistical version which we
call the hidden dynamic model. It's one of the conversion
of deep generative model.
It didn't succeed. I'll show you the reason why. Now we understood what.
Okay, so it's interesting enough in this
model we actually used, if you read the report, it actually turned out that model
was here since Geoff told me that
the video for this workshop is still around there so it's called ?? sign. I
think I mentioned to ?? pick it out.
It turned out that learning of this workshop, which details are in this report
is actually use the back propagation to do it. Now direction isn't from up to
down, since your model is
top down, the propagation must be bottom up.
So nowadays
when we do speech recognition the error
function is a softmax or sometime you can use the mean square error.
And the measure is in terms of your label.
This is the opposite. The error is measured in terms
of matching between how generative model can match with the observation. And then when you
want to
learn you go bottom up learning. Which actually turned out to be better propagation. So
that propagation doesn't have to be done (up to bottom)
it can be bottom up. Depending on what kind of models you have.
But key is that this is
a gradient descent method.
So actually we got disappointing result for switchboard. You know because we tended to be
a bit off game.
And now we understand why. Not at that time. I'm sure some of you experienced
it. I have a lot
of thinking about how deep learning and this can be integrated together.
So at the same time
Okay so this is a fairly simple model, okay. So you have this hidden representation
and it has
specific constrains built into the model,
by the way which is very hard to do when you do bottom-up neural network.
And for generative model
you can put them very easily down there, so for example
articulatory trajectory has to be smooth
and then specific form of the smoothness can be built indirectly
by simply writing the generative probabilities. Not in the deep neural network.
So at the same time
we actually, also this was done in ??
and we were able to even put this nonlinear phonology in terms of
writing the phonemes into the invidiual constituents at the top level and ?? also has
very nice paper, some fifteen years ago, talking about this.
And also the robustness can be directly integrated into
articulator model simply by generative model. Now for deep neural network it's very hard to
do.
For example you can actually
this is not meant to be seen. Essentially this is one of the conditional likelihood
that covers
one of the links. So everytime you have got the link
you have conditional dependency parent to children that have differnt neighbours.
And then you can specify them in terms of
conditional distribution. Once you do that you formed a model
you can embed
whatever knowledge you have, you think is good, into the system. But anyway
but the problem is that the learning is very hard
and that problem of the learning in machine community only was solved just within last
year.
At that time we just didn't really know. We were so naive.
We didn't really understand all the limitations of learning. So just to show you we
talk, okay. One of the
things we did was that, I actually worked on this with my colleagues Hagai Attias.
He is actually one of the
he is my colleague working not far away from me at that time, some ten
years ago.
So he was the one who invented this very initial base. Which is very well
known.
So the idea was as follows. You have to break up these pieces into the
modules, right.
For each module you have this, this is actually
continuous
dependence of the continuous hidden representation
and it turned out that the way to learn this,
you know in a principle, what is to do is EM (Expectation maximization). It's variational
EM.
So the idea is very crazy.
So you said you cannot solve that regressively and that's well known. It's loopy neural
network. Then you just cut all important things you
carry out. Hoping that M-Step can make it up. That's very crazy idea.
And that's the best around time that was there.
But it turned out that you've got the auxiliary function and you form is still
something very
similar to our EM, you know in HMM. For the general model you don't have
to look you can get rigorous solution.
But now when you have deep it's very hard. You have to make up for
it. And that ?? is just as ??bad-ass
many people could ?? on deep neural network. This ?? deep generative model
probably have more
than otherwise. Although they patched themselves
to be
you know very rigorous. But if you really walk on that, so I can pick
out of this, so it's
for this approach we get surprisingly good inference results for continuous variables.
And in one version what we did was actually we used phonemes
you know as a hidden representation and it turned out it tracked. And once you
do this you
check the phoneme really precisely.
As a byproduct this worked as we created
this worked as we created database for formant tracking
but if we actually do
inference only the linguistic unit which is the problem
of recognition we didn't really make much progress on this.
But anyway so I'm going to show you some of these preliminary results to show
you how this
is one way that led to the deep neural network.
So when we actually simplify the model in order to finish the decoding we actually,
this is actually ?? result
and we would bring out all of analysis for different kinds of phones.
So when we use this kind of generative model with deep structure it actually corrected
many errors
which are related the short phones.
And you understand why because you designed model to make that happen and then you
know if
everything is done recently well you actually get results. So we actually look
at not only corrected short phones for the vowel
but also it correct the a lots of
consonants because they're up with each other.
It's just because the model design whatever hidden trajectory that you get
it's influenced, the parts of the vowel is influenced
by the adjacent sound.
And that's
this is due to the coarticulation.
This work will be very naturally built into the system
and one of things I am very much struggling with deep neural network is that
you can't even build this kind of
information that easily, okay.
This is to convince you how things can be breached.
It's very easy to interpret the results. So we look at the error we
know wow these are quite a big data assumption.
Without the have to go through for example in this these examples of these are
the same sounds, okay.
You just speak fast then you get something like this
and then we actually looked at the error and we said Ohh.
You know
that's exactly what happened. You know mistake was made in the
Gaussian Mixture Model because it doesn't take into account these particular dynamics. Now this one
was pulling correct error
And I'm going to show you in deep neural network things are reversed, so that's
related to ??. But in the same time
in machine learning community also the speech
there is a very interesting model for the deep generative model developed
and that's called the Deep Belief Network.
Okay,
so in the earlier literature before about three or four years ago
DBN, Deep Belief Network, NTA I mix each other, even by the authors
it's just because most people don't understand what it is
so this is very interesting paper that is starting in 2006
many people, most people in machine learning, regard this paper to be the start of
deep learning.
And thus the generative model so you prefer to say deep
generative model actually started the deep learning rather than deep neural network.
But this model has some intriguing probabilities
that really at the time attracted my attention here.
It's totally not obvious, okay.
So for those of you who know RBM and DBM you know when you are
stacking up this undirected model
sever time you get DBN, that's
you might think that the whole thing will be undirected,
you know bottom-up machine, no. It's actually directed model coming down.
You have to read this paper to understand why.
So why do they? I said someone was wrong. I couldn't understand what happened.
But on the other hand it's much simpler than the model I showed you earlier
for deep network we get the temporal dynamics.
This one it's not temporal dynamics over here.
So
the most intriguing aspect of DBN
as described in this paper is that inference is easy.
Normally you think inference is hard. That's the tradition.
It's given fact if you have these multiple dependencies on the top it's very hard
to make voice
and there's special constraint built into this model. Namely the restriction in the connections of
RBM
because of that it makes inference. It's just a special case.
This is very intriguing, so I thought this idea may help
the deep general model I showed you earlier.
So he came to reason me, you know. We discussed it.
It took him a while to explain what this paper is.
Most of people at Microsoft at that time couldn't understand what's going on.
So now let's see how
and then of course what we get together this deep generative model
and the other deep generative model I talked about with you I actually worked on
for almost ten
years at Microsoft. We were working very hard on this.
And then we came up with the conclusion that well we have to use fewer
clues to fix problem.
And they don't match, okay. The reason why they don't match is whole new story
why they don't match.
The main reason is actually not just temporal difference, it's the way you prioritize
the model and also the way to represent
the information is very different
despite the fact that they're both generative models.
It turned out that this model is very good for speech synhesis and ?? has
very nice paper
using this model to do synthesis. And it's very nice to do
image generation. I can see that very nice probably.
Not for continuous speech it is very hard to do
and for speech for general synthesis it's good it's because if you have segment with
whole
context into account, like syllable in Chinese it is good, but for English it is
not that easy to do.
But anyway so we need to have few kluges to fix together, to merge these
two models together.
And that sort of led to the end.
So the first kluge is that
you know
the temporal dependency is very hard. If you have temporal dependency you automatically loop and
then
everybody in machine learning at that time knew, most of speech persons, so I thought
that
machine learning that I show you early on actually just didn't work well, it didn't
worked out well. And most of people who were
very much versed in machine learning who say there's no way to learn that.
Then cut the dependency. It's way to do it, cut the dependency in the hidden
dimension, in the hidden revision
and loose all the powers of
deep generative model
and that's the Geoff Hinton's idea, well it doesn't matter, just use a big window.
If it fixes the clues and that actually
is one of things that actually helped
to solve the problem
and the second Kluge is that you can reverse direction
because
the inference in generative model is very hard to do as I showed earlier.
Now if you reverse direction
from top-down to bottom-up
and then you don't have to solve that problem. And that's why it would be
just a deep neural network, okay. Of course
everybody said: we don't know how to train them, that was in 2009.
Most people don't know how to ??
and then he said that's how DBN can help.
And then he did a fair amount of work on DBN to initialize that ??
approach.
So this is very well-timed academic-industrial collaboration. First of all
it's because speech recognition industry has been searching for new solutions when principle
deep generative model could not deliver, okay. Everybody
was very upset about this at the time.
And at the same time academia developed deep learning tool
DBN, DNN, all the hybrid stuff that's going on.
And also CUDA library was released around that time. It's very recent times.
So this is probably one of the earliest catching on
for this GPU computing power over here.
And then of course big training data in ASR that has been around
and most people, if you actually do
Gaussian Mixture Model for HMM where a lot of data performance accelerates, right.
And then this is one of things that in the end really is powerful. You
can increase the size and depth
and
you know put in a lot of things
into to make it really powerful.
And that's the scalability advantage that I showed you early on. That's not the case
for any shallow model.
Okay, so in 2009 I and three of my colleagues didn't know what's
happening. So we actually got together to
to do this
to this workshop
to show that
this is useful thing, you know, to bring stuff.
So it wasn't popular at all. I remember
you know Geoff Hinton and I we actually got together to
who we should invite to give us
speech in this workshop.
So I remember that one invitee which shall be nameless here
he said: Give me one week to think about, and at the end he said:
it's not worth my time to fly to Vancouver. That's one of them.
The second invitee, I remember this clearly, said: This is crazy idea. So in the
e-mail he said
What you do is not clear enough for us.
So we said you know
waveform may be useful for ASR.
And then the emails said: Oh why?
So we said that's just like using pixel for image recognition. That was popular.
For example convolutional network there are pixels.
We take similar approach. Except it is waveform.
And the answer was: No, no, no that's not same as pixel. It is more
like using photons.
You know making kind of joke essentially. This one didn't show up either. But anyway
so
anyway so this workshop actually has
a lot of brainstorming I had to analyze, all the errors I showed you early
on.
But it's really good
workshop for about four or five years that was
five years ago now.
So now I move to part 2
to discuss achievements. So actually in my original post I had whole bunch of slides
on vision.
So the message for the vision is that if you go to vision community
they look at deep learning to be
just even
maybe thirty time
thirty times more popular than deep learning in speech.
So they actually, the first time they did that was actually first time they
actually got the results.
and noone believed it's the case. At the time I was given a lecture
at Microsoft about Deep Learning
and then right before I, actually Bishop
was doing the lecture together with me
and then this deep learning just came out and Geoff Hinton sent e-mail to me:
Look at the matching! How much bigger it is.
And I showed them. People were like: I don't believe it. Maybe a special case.
You know. And it turned out it's just much
just as good.
Even better than speech. I actually cut all the slides out. Maybe some time I
will show you.
So this is big area to go. So today I am going to focus on
speech.
So one of things that we found during that time
is that we have very interesting discovery that we actually used the model that I
showed you there
and also deep neural network here.
And that actually is the number that we analyzed
error pattern very carefully. So it's very good, you know for TIMIT.
You can disable language model, right.
Then you can understand the errors for acoustic ?? very effectively
and I tried to do that afterwards, you know, to do other tasks
and it's very hard once you put language model in there you just couldn't
do any analysis. So it's very good at the time we did this analysis.
So now the error pattern in the comparison
is, I don't have time to go through except just to mention that.
So DNN made many new errors on short undershoot vowels.
So it sort of undo what this model is about to do
and then we thought of why would that happen and of course at the end
we had a very big window so if the sounds
are very short, information is captured over here and your input is about eleven frames,
you know, you got the fifteen frame it
captures kind of noise coming from different phones of course error is made over here.
So we can understand why.
And then we asked why this model corrects errors? It's just because
you make
you deliberately make a hidden representation
to reflect
what sound pattern looks like.
In the hidden space. And it's nice for whom you can see
but if you have the articulations, how do they see? So sometimes we use former
to illustrate what's going on there.
Another important discovery at Microsoft is that we actually found that using spectrogram
we produce much better
autoencoding results in terms of speech analysis.
Encoding results
?? and that was very surprising at the time.
And that really conforms to the basic deep learning theme that
you know the earliest features are better
then the processed features here. So I show you, this is actually project
we did together in 2009.
So we used spectrogram
to do binary coding of
of spectrogram.
So I don't have time to go through that. You read the auto-encoding book if
you can.
In literature you can all see this.
So the key is that
you use the target to be the same as input and then you use small
number of bits in the middle.
And you want to see whether that would actually
?? all the ?? down here. And the way to evaluate it is to look
at
you know what kind of errors you have.
So the way we did is we used the vector quantizer as a baseline
of 312 bits.
And then reconstruction
looks like this. So this is the original one, this is the shallow model, right.
Now using deep auto-encoder we get much closer to this in terms of errors
we simply have just much lower coding error
using identical number of bits.
So it really shows that if you build deep structure you extract this bottom-up feature.
Both ?? you condense more
information in terms of reconstructing the original signal.
And then we actually found that
for spectrogram this result is the best.
Now for MFCC we still get some gain, but gain is not nearly as much,
sort of indirectly
convinces me. There's Geoff Hinton's
original activities ?? everybody's
to spectogram.
So maybe we should have do the waveform, probably not anyway.
Okay so of course the next step is once we are all convinced that
error analysis shows that
deep learning can correct a lot of errors, not for all but for some
which we understand why. You just pick up the power and also capacity they had.
So on average it does a little bit better
based upon
this analysis.
Based upon this analysis it does slightly better.
But if you look away
but if you look at the error pattern you really can see
that this has a lot of power, but it also has some shortcomings as well.
So that both have pros and cons but one's errors are very different and it
actually gives you the hint that
you know is worthwhile to pursuit.
Of course this was all very interesting
evidence to show.
And then to scale up to industrial scale we had to do
lot of things so many of my colleagues actually were working with me
on this. So first of all
we need to extend the output
from small number of phones
at the states
into very large
and that actually at that time is motivated by
how to save huge Microsoft investment in speech decoder software.
I mean if you don't do this
then you know if you do some other kind of output coding
and they would also had to ?? atypical feature to do it. The one that
would fully believed
that it's going to work.
But it turned out if you need to change decoder, you know, we just have
to say wait a little bit.
So
and at the same time we found that using content dependent model gives much higher
accuracy
than content independent model for large tasks, okay.
Now for small tasks we defined so much better. I think
it's all related to
a capacity saturation problem if you have too much
but since a lot of data
in
in the training for large tasks
you actually keen
to form a very large output and that turn out
to have you know
double benefit.
One is that you increased accuracy and number two is that you don't have to
change anything about decoder.
And industry loves that.
You have both
that's actually ??. I can't recall why actually took off.
And then we summarize what enabled this type of model
and industrial knowledge about how to construct a very large units in DA
is very important
and that essentially come from
everybody's what here
that actually used this kind of content dependent model for Gaussian Mixture Model, you know,
that has been around for
almost twenty some years.
And also
it depends upon industrial knowledge on how to make encoding of such huge and highly
efficient using
our conventional
HMM decoding technology.
And of course how to make things practical.
And this is also very important enabling factor. If GPU didn't come up
roughly at time, didn't become popular at that time
all these experiments would take months to do.
Without all this belief, without all this fancy infrastructure.
And then
people may not have patiance to wait to see the results, you know push that
forward.
So let me show you some very
brief summary of the major
result obtained in early days.
So if we use three hours of training, this is TIMIT for example, we have
got
this is number I show you, it's not much about ?? percent of gain.
Now if you increase the data up to
ten times more thirty some hours you get twenty percent error rate.
Now if you do more.
For SwitchBoard, this is the paper that my colleague published here,
you get more data, another ten times so you get two orders of magnitude to
increase
and the relative gain actually
sort of
increase, you know, ten percent, twenty percent, thirty percent. This is actually
so of course if you increase
the size of training data
the baseline will increase as well, but relative gain is even bigger.
And if people look at this result there's
nobody
in their mind who would say not to use that.
And that's how
and then of course a lot of companies
you know
actually still
implement, DNN is fairly easy to implement for everybody because
I missed one of the points over there. It actually turned out if you use
large amount of data
it turned out that the original
idea of using DBN to regularize that model doesn't lead to
be that anymore. And in the beginning ?? how it happened.
But anyway, so now let me come back to the main thing of the talk.
How generative model
and deep neural network may be helping each other.
So the kluge one was that to use this to be
at that time
we have to keep this now for this conference we see
?? using LSTM with neural network and that fixed this problem.
So this problem is fixed.
This problem is fixed automatically.
At that time
we thought we need to use DBN. Now with use of big data there's no
need anymore.
And that's very well understood now. Actually there are many ways to understand that. You
can think about as
regulization view point
and yesterday at the table with students I mentioned that and people said: What is
regularization?
And you have to understand more in terms of the optimization view point
so actually if you stare at back-propagation formula for ten minutes you figure out why.
Which I actually have slide there, it's very easy to understand why from many perspectives.
With a lots of data you really don't need that.
And that's automatically fixed.
You know kind of by industrialization we tried lots of data
it's fixed and now this is not fixed yet. So this is actually the main
topic
that I'm going to use for the next twenty minutes.
So before I do that I will actually try to summarize some of
the major ... actually I and my colleagues wrote this book
and in this chapter we actually grouped
the major advancement of deep neural network into several categories
so I'm going to go through that quickly.
So one is the optimization,
innovation.
So I think the most important advancement
over the previous, you know the early success of the I showed you early on
what's the development of sequence discriminative training and
this contributed additional ten percent of error rate reduction.
Also many groups of people have done this.
Like for us at Microsoft, you know this is our first intern coming to our
place to do this.
And we tried on TIMIT we didn't know all the subtleties of the importance of
regularization and
we got all the formula right, all of everything right
and the result wasn't very good.
But I think
Interspeech accepting our paper and this we understand that this
and then later on
we got more a more papers, actually a lot of papers were published in Interspeech.
That's very good.
Okay now, the next theme is about 'Towards Raw Input', okay.
So what I showed you early on was the speech coding and analysis part
that we know that is good. We don't need MFCC anymore.
So it was bye MFCC, so
probably it will disappear
in our community. Slowly over the next few years.
And also we want to say bye to Fourier transforms, so I put the question
mark here partly because
actually, so for this Interspeech I think two days ago Herman ?? had a very
nice paper on
this and I encourage everybody to take a look at.
You just put the raw information in there
which was done actually about three years ago by Geoff Hinton students, they truly believed
it. I couldn't
I tried that about 2004, that was the hidden Markov model
error.
And we understood all kind of problem, how to normalize users input and I say
it's crazy
and then when they published the result
in
ICASSP. I looked at these results and error was terrible. I mean there's so much
of error.
So nobody took attention. And this year we brought the attention to this.
And the result is almost as good as using, you know,
using Fourier transforms.
So far we don't want to throw away yet,
but maybe next year people may throw that away.
Nice thing is .. I was very curious about this. I say
at the terms of that to get that result they just randomize everything rather than
using Fourier transforms
to initialize it and that's very intriguing.
Too many references to list I was running all the time. I had ?? list.
But yesterday when I went through this adaptation session there's so many good papers around.
I just don't have patience for them anymore.
So go back to ?? adaptation papers. There are a lot of new
advancements. So another important thing is transfer learning
at that place very important role in multi-lingual acoustic modelling.
So that was tutorial that I was .. actually Tanja was giving in a workshop
I was attending.
I also mention that
for generative model
for shallow model before
this one almost never
multilingual
of course
modelling
actually improved things.
But it never actually beat the baseline
in terms of ..
so think about cross-lingual for example, multi-lingual and cross-lingual
and deep learning actually beat the baseline. So there's whole bunch
papers in this area which I won't have time to go through all here.
Another important innovation is nonlinear regularization, so for
regulation dropout if you don't dropout it's good to know.
And this is special technique. Essentially it's just 'kill all you know' or
randomly and you get the better result.
And in terms of output units
now
is very popular units is to rectify linear units
and now there's some very interesting
many interesting theoretical analogies why this is better than this.
At least while in my experience .. actually I programmed this, it's change of our
lifes
to go from this to this.
Deep learning
really increases.
And we understand now why it happens.
Also (in terms of) accuracy different groups report different results.
Some groups reports they reduced error rate, some groups .. nobody reported increase in error
rates for now.
So in any case (it) speed up
the convergence dramatically.
So I'm going to show you another architecture over here which is going to link
to
a generative model.
So this is a model called Deep Stacking Network.
But its very design is deep neural network, okay. It's information from bottom up.
So the difference between this model and conventional deep neural network is that
for every single layer you can actually
integrate the input for each layer and then do some special processing here.
Especially you can alternate
layers into linear and nonlinear, if you do that you can dramatically increase your
speech convergence
in deep learning.
And there's some another theoretical analysis which is actually put in one of the books
I wrote.
So you actually can convert many complex
propagation,
non-convex problem into
somewhat
kind of ??property measure problem related to
convex optimization so we can understand our probability ??.
So we did that a few years ago and we wrote a paper on this.
And this idea can also be used for this
potential network, which I don't have the time to go through here. And the reason
why I bring that up is
because it's actually related to some recent work
that I have seen
for generative model which were taking convertion of each other, so let me compare between
two of
them to give you some example to show how to
both
networks can help each other.
So when developped this deep stacking network the activation function had to be fixed.
Either logistic or ReLu which are both
reasonably well
you know compared to
with each other.
Now look at this architecture.
Almost identical architecture.
So now
if you change the
activation function to be something very strange, I don't expect you to know anything about
this
and this is actually work done by Mitsubishi people.
There's a very nice paper over here in the technical ??
I spent a lot of time talking to them and they even came to
Microsoft, so actually I listened to some of them and their demo.
So the activation function for this model is called the Deep Unfolding Model
that's is derived from inference method in generative model.
Which is not fixed as in the ?? I showed you earlier. So to stop
this model .. it looks like deep neural network, right?
But the beginning
the initial phase of their generative model which is specific about,
I hope many of you know the non-negative matrix factorization. This is specific technique
which actually is a shallow generative model.
It actually makes a very simple assumption that
the
observed noisy speech or mixed speakers' speech is the sum of two sources
in spectral domain.
What was they make the assumption
and then they of course they have to enforce that each
you know
each vector is positive because of the magnitude of spectra.
What they do is an iterative technique and that becomes a iterative technique.
And that
model automatically embed the main knowledge about how observation
is obtained, you know, through the mix between the two.
And then this work essentially said how to apply that inference technique iteration. Every single
iteration I treat that as a different
layer.
After this they do the back propagation training.
And the backward iteration is possible
because
the problem is very simple, so the application here is a speech enhancement
therefore objective function is a mean-square error, very easy. So the generative model
actually generative model gives you
the
the generative observation
and then
your output is clean speech.
Okay then you do mean-square error you actually adapt all this way
and the results are very impressive. So now this is why
I showed you can design deep neural network
if we use this
type of
activation function you automatically build in the constraints that you use in the generative model
and that's
very good example to show
the message that I'm going to,
actually I put in the beginning of the (presentation) it's
hope of deep generative model. So this is
shallow model and it's easy to do it. Now for deep generative model
it's very hard to do.
And one of reasons I put this as a topic today is partly because
all this conference
it's just three months ago
in Beijing's ICML conference
there's a very nice development
of deep generative models' learning methods.
They actually linked this
neural network and Bayes net together
through some transformation
and because of that .. the main idea of .. whole bunch of papers including
Michael Jordan,
whole bunch, you know, a lot of very well known people
in machine learning for deep generative model
so the main
point of this set of work, I just want to use one simple sentence to
summarize them,
is that
when you originally tried to do E step I showed you early on
you have to factorize them in order to get each step done
and that was approximation
and there was very nice ?? developped. A ?? so large it's practically useless
in terms of inferring the top layer
discrete event.
Now the whole point is that now we can relax that constraint for factorization
and like before three years ago if you do that if you use a rigorous
dependency
you don't get any reasonable analytical solution so you cannot do EM.
Now this
idea is to say that while you can approximate
that factorisation,
you can approximate that dependency in E step learning
not through
factorization which is called mean field approximation,
but use deep neural network to approximate.
So this is example to show that deep neural network actually help you to solve
deep generative model problem and
so this is well know Max Welling, a very good friend of mine in machine
learning.
And he told me that the paper never show that.
And they really developed the
the theorem to prove that if network is large enough
the approximation error can approach
zero. Therefore the variational learnings
can be eliminated and that's a very engine
developed that really give me a little evidence to show that,
to see that this is
a promising approach. I think machine learning community development tool,
our speech community developed verification
and also methodology as well,
but if
you know we actually cross connect
to each other we are gonna to make much more progress and that this type
of development
really
gives some
promising direction
towards the main message I put out at the beginning.
Okay, so now I am gonna show you some deeper results that I want to
show you.
Another better architecture that we have known is what's called the reccurent network, if you
read
this Beaufays' paper LSTM, look at that result. For
voice search the error rate jumped down to about ten percent. That's very impressive result.
Another type of architecture is to integrate the convolution
and non-convolution together. That was ??
in the previous result. As the author worth of any better result is in though.
??
So these are the state-of-the-art for switchboard (SWBD) task.
So now I'm going to concentrate on this type of
recurrent network here.
Okay, so this coming down to one of my main messages here.
So we fixed this kluge
by
a recurrent network.
We also fix this kluge automatically
by
just using big data.
Now how do we fix this kluge?
So first of all I'll show you some analysis on recurrent network vs. deep generative
model
so that's called hidden dynamic model I showed you early on, okay.
And so far analysis hasn't been applied to LSTM.
So some further analysis may
actually automatically give rise to LSTM using some analysis on this.
So this analysis is very preliminary
and so if you stare at the equotation
for recurrent network it looks like best one. So essentially you have state of the
art equotation
and it's recursive.
Okay,
from previous hidden layer to this.
And then you get the output
that produces the label.
Now if you look at this deep generative model - hidden dynamic model
identical equotation,
okay? Now what's the differece?
The difference is that the input now is the label. Actually if you put the
label
you cannot drive it. So you have to make some connection between labels and continuous
variable
and that's what in phonetic
people call phonology to phonetic interface, okay.
So we use some very basic assumption
that the interface is simply, that each label corresponds to target vector,
actually the way that we implement early distribution, you can do that to account for
speaker
differences, etcetera. Now the output
for this recursion gives you the observation
and that's the recurrent filter type of model.
And that's engineering model and there's neural network model, okay. So every time I was
teaching
?? I called ?? on this.
So we fully understood all the constrains for this type of model.
Now for this model it looks the same, right?
So if you reverse direction you convert one model to another.
And for this model it's very easy to put a constraint, for example
the dynamics
of
matrix here that governs
the internal dynamics in the hidden domain actually can be made sparse and then you
can put
realistic constrain there for example in our
earlier implementation of this we put this critical dynamics
therefore you can guarantee it doesn't oscillate. When we do articulation we need phone boundaries.
This is the speech production mechanism
you can put them simply to fix the sparse matrix.
Actually one of the slides I'm gonna show you is all about this.
In this one we cannot do it, everything has to be a structure.
There's just no way you can say that why, you want that dynamics
to behave in certain way.
You just don't have any mechanism to design the structure of this and this is
very natural, it's by physical
properties that design this. Now because of
this correspondence and because of the fact that now we can do
deep inference
if all this machine learning technology actually are fully developed
we can very naturally bridge the two (models together).
It turned out if you do more
rigorous analysis
by
making the inference of this to be fancier
our hope that
this
multiplicative
kind of unit would automatically emerge from this type of model so that has not
been shown yet.
So of course this is just, you know, very high-level view comparison between the two
there are a lot of detail comparison you can make in order to bridge the
two,
so actually my colleague Dong Yu wrote this book that's just coming out very soon.
So in one of the chapters we put all these comparisons: interpretability, parametrization, methods
of learning and nature of representation and all the differences.
So it gives a chance to actually understand
how deep generative model in terms of dynamics
and recurrent network in terms of recurrence can
be matched with each other, so I will read that over here.
So I have the final five, three more minutes, five more minutes. I will go
very quickly.
Everytime I talk about it I was running out of time.
So
so the key concept is called embedding.
Okay, so actually you can find the literature in nineties, eighties to have this
basic idea around.
For example in this special issue of
Artifical Intelligence, very nice paper over here, I had chance to read them all.
And very insightful and some of the chapters over here are very good.
So the idea is that each physical entity or linguistic
you know
entity:
word, phrase, but even whole article, whole paragraph
can be embedded into
continuous-space vector. It could be big ??, you know.
Just to let you know it's special issue on this topic.
And that's why it's important concept.
The second important concept, which is much more advanced
which is described by a few books over here. I really enjoyed reading some of
those and I invite those
people come to visit me.
We have a lot to discuss on that. You can actually even embed the structure
into
next structure symmetric into a vector
where you can recover the structure completely through the vector
operation and the concept is called tensor-product representation.
So I don't have .. if only I had three hours I can go through
all of this.
But for now I'm going to elaborate about this for next two minutes.
So
this is the neural network recurent model and this is very nice, I mean this
is fairly informational paper
to show that embedding can be done as part of the
as a byproduct of the recurrent neural network that
paper was published in Interspeech several years ago.
And then I'll talk very quickly about semantic embedding at MSR, so
the difference between this set of work and the previous work was that
everything is completely unsupervised
so in the company if you have supervision you should grab it, right.
So we actually took initiative to actually take some
very smart
exploitation of supervision signals
at virtually no cost.
So the idea here was that this is the model that we have essentially for
each branch it's deep neural network. Now different
branches can actually link together
through what's called the, you know, cosine distance.
So that
distance can be measured
in terms of
a vector, in a vector space.
And now we do MMI learning,
so if you get hot dog in this one, if your document is talking about
fast food or something, even if
there's no word in common you pick up.
And because of supervision actually link them together.
Like if you have dog racing here
they have the same word although they will be very far apart from each other.
And that can be automatically done.
And that some people told me that topic model can do
similar things, so if we compare that with the topical model
it turned out that ??
and using this
deep semantic model
can do much, much better.
So, now multi-modal. Just one more slide.
So it turned out that not only text you can embed into it,
image can be embedded, speech can be embedded and can do something very similar
to the one I showed you earlier.
And this is the paper that was in yesterday talk about embedding.
That's ver nice, I mean it's very similar concept.
So I looked at this and I said wow it's just like the model that
we did for the text.
But it turned out that application is very different.
So actually
I don't have time to go through here. I encourage to read on some papers
over here. Let's skip this.
So this was just to show you some application for this
semantic model. You can do all the things. From web search
we apply them, quite nicely. For machine translation you have one entity
to be one language
some of the list of the paper that were published you can find some detail.
You actually can make summary, summarization and entity ranking.
So let's skip this. This is final slide, the real final slide.
I don't have any summary slides, this is my summary slide.
So I copied the main message here now. Elaborate could be more. After going through
whole hour of presentation.
Now in terms of application we have seen
speech recognition.
The green is
neural network, the red is deep generative model. So
I say a few words about deep generative model and dynamic model
that's generative models side and LSTM is other side. Now speech enhancement
I showed you these types of models
and then
on the generative model side I showed you this one
and this is shallow generative model that actually can
give rise to deep structure which is corresponding to
deep
stacking network I showed you early on. Now for algorytm we have get back propagation
here.
That's single unchallenged
algorytm for deep neural network.
Now for deep generative model there are two algorytms. They are both called
BP.
So one is called Belief Propagation, for those of you who know machine learning.
The other one is BP, same as this.
That only came up within two years.
Due to this new advancement
of porting deep neural network
into the inference step
of this type of model, so I call BP and BP.
And in terms of neuroscience you call this one to be wake and you call
the other sleep.
And in the sleep you generate things you get hallucination and then when you're awake
you have perception.
You get information there. I think that's all I want to say. Thank you very
much.
Okay. Anyone one or two quick questions?
Very interesting talk.
I don't want to talk about your main point which is very interesting.
Actually just very briefly about one of your side messages which is about waveforms.
Which is about waveforms. So you know the ?? paper there weren't really putting in
waveforms.
They are putting in the waveforms, take the absolute value, floor it, take all
logarithm, average over, but you know so you had to do a lot of things.
Secondly the other papers that there's been a modest
amount of work in last few years on doing this sort of thing,
pretty generally people do it with matched training test conditions
if you have mismatched conditions, good luck with
waveform. I always hate to say something is impossible but good luck.
Thank you very much. ?? good for everything.
And look at presentation that was very nice, thank you.
Any other quick questions?
If not I invite Haizhou
to give a plaque.