Speech Transcript - Being Deep and Being Dynamic - New-Generation Models and Methodology for Advancing Speech Technology

0:00:17	good morning everybody
0:00:21	I'm very happy to see you all in this morning
0:00:38	Professor Li Deng, who proposed the keynote this morning
0:00:43	its not so easy to introduce him, because
0:00:47	he is very well known in the community
0:00:49	he is the fellow of a lot of societies like
0:00:53	ISCA, IEEE, American Acoustical Society
0:00:57	he has proposed several hundreds of papers during the last years
0:01:05	and different talks
0:01:08	Li Deng did his PhD in the University of Wisconsin
0:01:13	He started his carrier in University of Waterloo
0:01:31	He will talk to us today about two very important topics
0:01:37	very important to all of us
0:01:39	one is how to move out of GMM
0:01:44	its not so bad because I start my carrier with GMM
0:01:49	I need some new ideas to do
0:01:53	something else
0:01:54	the second topic will deal with the dynamic of speech
0:01:59	we all also know but dynamic is very important
0:02:13	we will not take more time on his talk, I prefer to listen him
0:02:20	thank Li
0:02:27	thank you, thank the organization and Haizhou
0:02:31	to invite me to come here to give the talk
0:02:34	it is the first time I've attended Odyssey
0:02:37	I've read of lot of thing that the community has been doing
0:02:41	As Jean has introduced
0:02:45	now I think not only in speech recognition but also in speaker recognition
0:02:51	there are few fundamental tools so far
0:02:56	one is GMM, one is MFCC
0:03:01	in common
0:03:03	last year, I've learned a lot of other thing from this community
0:03:07	it turns out that, the main thing by this talk is to say
0:03:11	both of these components may have potential to replaced with much better results
0:03:18	I touch a little bit on MFCC, I don't like MFCC
0:03:23	so I think Hynek hates MFCC also
0:03:25	now only until recently, when we was not doing deep learning
0:03:29	there are evidences to show that all components maybe replaced certain in speech recognition, people
0:03:36	have seen that it is coming
0:03:39	hopefully, after this talk, you may think about whether in speaker recognition, these components can
0:03:45	be replaced
0:03:46	to get better performance
0:03:49	the outline has three parts
0:03:54	In the first part, I will give a little bit about quick tutorial
0:03:59	having several hours of tutorial material
0:04:01	over last few months, so it is a little challenging to compress them down.
0:04:07	to this short tutorial
0:04:11	rather talking about all the technical details
0:04:14	I've decided to just tell the story
0:04:26	I also notice that in the next section after this talk
0:04:30	there are few papers related to this
0:04:35	Restricted Boltzmann Machines, Deep Belief Network
0:04:39	Deep Neural Network in connection with HMM
0:04:46	at the end of this talk, you may be convinced that this may be replaced
0:04:49	as well
0:04:49	we can consider in the future, with much better speech recognition performance thing than that
0:04:56	we have
0:05:00	and also Deep Convex Network, Deep Stacking Network
0:05:16	so I think over last 20 years, people have been working on segment models, hidden
0:05:22	dynamic models
0:05:22	and 12 years ago, I even had
0:05:25	a project with John Hopkins University working on this
0:05:29	and the results were not very promising
0:05:35	now we are beginning to understand the great idea we proposed over there
0:05:39	it did not work well at that time
0:05:41	It is only after we do this, we realize how we can put them together
0:05:45	and that is the final part
0:05:51	the first part
0:05:56	how many people here ever attended one of my tutorial over last year
0:06:01	OK, its a small number of people,
0:06:09	this one you have to know, deep learning, sometime you have hierarchical learning in the
0:06:14	literature
0:06:14	essentially, refer to a class of machine learning technique
0:06:18	largely developed since 2006
0:06:21	by ... you know actually, this is the key paper
0:06:26	that actually introduced a fast learning algorithm for this called Deep Belief Network
0:06:36	in the beginning, this is mainly done on image recognition, information retrieval and other applications
0:06:43	and we, actually Microsoft was the first to collaborate with University Toronto
0:06:51	researchers to bring that to speech recognition
0:06:54	and we show very quickly that not only for small vocabulary
0:06:58	it does very well but also for large vocabulary does even better
0:07:02	this really happens
0:07:03	you know in the part, for small recognition, it worked well for larger sometime it
0:07:07	failed
0:07:09	but this is the bigger tasks we have, the better success we have, I try
0:07:14	to analyze
0:07:14	to you that why it happens
0:07:17	and Boltzmann machine, we will talk Boltzmann machine in the following talks, I think, Patrick
0:07:22	has two papers on that
0:07:24	and Restricted Boltzmann machine
0:07:27	and this is a little bit confusing, so if you read the literature
0:07:31	very often deep neural network and deep belief network
0:07:36	which are defined over here which are totally different concepts
0:07:40	one is a component of another
0:07:44	just for save of convenience, the authors often get confused
0:07:49	they called deep neural network as DBN
0:07:52	and DBN is also referred to Dynamic Bayes network
0:07:55	even more confusing
0:07:57	one of thing that
0:07:59	for tutorial, for people attended my tutorial, I gave a quiz
0:08:06	people know all this
0:08:18	last week, we got a paper accepted for publication, the one I wrote together with
0:08:24	Geoffrey Hinton and with 10 authors all together
0:08:27	work in this area
0:08:29	we try to clarify all this, so we have unified terminologies
0:08:31	when you read the literature, you know how to map one to another
0:08:38	and Deep auto-encoder, I don't have time to go to here, and I will say
0:08:42	about some new developments
0:08:43	to me it is more interesting because some limitations of some others
0:08:50	This is the hot topic, here I list whole recent workshops and special issues
0:08:59	and actually, in Interspeech 2012
0:09:03	you see tens of papers in this area most in speech recognition
0:09:07	and actually, in one of the area, just format areas with 2 full sections for
0:09:16	this topic, just for recognition
0:09:19	and some others, we have more, and special issue
0:09:26	PAMI, its mainly related to machine learning aspects and also computer visual applications
0:09:33	I try to put a few speech papers there as well.
0:09:36	and DARPA program
0:09:40	2009, I think last year they stopped
0:09:47	and I think in December, there is another workshop related to this topic, it is
0:09:54	very popular
0:09:55	I think because people see the good results are coming, and I hope that
0:10:00	one message of this talk is to convince you that is a good technology so
0:10:06	you want to seriously consider adapting some of this essences
0:10:10	tell some stories about this so
0:10:14	so the first time, this is the first time
0:10:17	when deep learning showed promising in speech recognition
0:10:20	and activities grow rapidly since then and that was around
0:10:24	two and half years ago
0:10:28	or three and half years ago, whatever
0:10:31	in NIPS, NIPS is a machine learning workshop
0:10:35	every year
0:10:46	so I think one year before that
0:10:49	so actually, talked with Geoffrey Hinton
0:10:52	a professor at Toronto, he has shown me that
0:10:56	he showed me that the Science paper, he actually has a poster there
0:11:00	the paper was well written and the diagram was really promising
0:11:05	in term of information retrieval for document retrieval
0:11:08	so I looked this, after that we started talking about
0:11:12	maybe we can work on speech
0:11:15	he worked on speech long time ago
0:11:24	so we decided to have this workshop, and actually we work together before
0:11:30	my colleague, Dong Yu, and my self and Geoffrey, we actually decided to have
0:11:37	a propose accepted which presents whole deep learning in preliminary work
0:11:43	and that time most people do TIMIT, a small experiment
0:11:46	and we turn out that this workshop gives a lot of excitement
0:11:53	so I give a tutorial, 90 minutes
0:11:58	about 45 minutes tutorial, and Geoffrey, I talk about speech, and he talks about deep
0:12:02	learning at that time, and we decided
0:12:05	to get people interesting this
0:12:07	so the curriculum is as follows, for NIPS
0:12:12	at the end of the final day workshop
0:12:18	each organizer presents a summary of the workshop
0:12:24	and the instruction for that it is a short presentation, it should be funny,
0:12:30	should not be too serious
0:12:32	every organizer is instructed to prepare few slides to summarize
0:12:41	your workshop in the way that your impression to people attending the workshop
0:12:47	this is the slide, we prepared
0:13:05	speechless summary presentation of the workshop on speech
0:13:10	because, we don't really want to talk too much, just go up there, and show
0:13:15	that slide
0:13:16	no speech there, just animations
0:13:20	so we said that, we met in this year
0:13:24	so this is supposed to be industrial people
0:13:31	and this is supposed to be academic people
0:13:33	so they are smart and deeper
0:13:37	and they say that, can you understand human speech
0:13:41	and they say that, they can recognize phonemes
0:13:47	and they say, that 's a nice first step and what else do you want?
0:13:52	and they said they want to recognize speech in noisy environments and then
0:13:58	and then he said maybe we can work together
0:14:01	so we have all concepts together
0:14:14	that's all presentation
0:14:24	we decided to do small vocabulary first
0:14:31	and then quickly we moved I think in December of 2010
0:14:36	move to very large vocabulary
0:14:38	to our surprise, the bigger vocabulary you have, the better success you get
0:14:43	very unusual
0:14:44	and myself analyze the area with details
0:14:47	you know we have been working on it before 20 some years
0:14:54	one is surprise to me, convince me to work in this area individually
0:15:02	was that every pattern that I see from the recognizer it is very different from
0:15:08	HMM
0:15:08	absolutely, it is better, the area is very different, that means it is good for
0:15:13	me to do that
0:15:14	anyway, I talk about DBN
0:15:20	one of concept is deep belief network, that is one of that Hinton published in
0:15:28	2006
0:15:28	2 papers there
0:15:34	nothing to do with speech, it's called deep belief network. its pretty hard to read,
0:15:38	if you are not in the field for while
0:15:40	and this is another DBN called dynamic Bayesian network
0:15:44	few months ago, Geoffrey sent an email to me saying that look at this
0:15:50	acronym DBN, DBN
0:15:59	he suggests that before you give any talk you check
0:16:03	mostly, in speech recognition, people do Dynamic Bayes Network more
0:16:09	anyway, I will a little bit technical contents on it, time is running up quickly
0:16:17	number one is the first concept is restricted Boltzmann machine
0:16:21	actually, I have 20 slides, so I just take one slide over these 20
0:16:26	so think about this as visible
0:16:30	it can be label, label can be one of visible units
0:16:33	we do discriminative learning, other thing is observation, think about of observation, and other thing
0:16:38	forget about this
0:16:39	think about MFCC, think about label
0:16:43	or speech label, senome or other labels
0:16:47	so we put them together as observation and we have hidden layer here
0:16:51	and then the difference between Boltzmann machine and neural network is that
0:16:57	the standard neural network is one direction, from bottom up
0:17:00	now Boltzmann machine is both directional, you can go up and down, now connections between
0:17:04	neighboring units in this layer and that layer are cut off
0:17:08	if you don't do that, it is very hard to learn
0:17:11	so one of thing is that in deep learning they start with restricted Boltzmann machine
0:17:16	is that
0:17:17	if you have bi-direction of connections
0:17:21	and if you do all in detailed maths, write energy functions, you can write down
0:17:25	the conditional probabilities of hidden units given it and the other around.
0:17:29	so if you put energy right, actually you can get the conditional probability of this
0:17:34	given this to be Gaussian
0:17:35	which is that something people like that, this is conditional you can introduce whole thing
0:17:41	as Gaussian mixture model
0:17:42	so you may think that is just Gaussian mixture model so I can do it
0:17:47	each other
0:17:48	the difference is that, this you can get almost exponentially large number of mixture components
0:17:55	rather than finite
0:17:56	I think in speaker recognition, its about 400 or 1000 mixtures whatever
0:18:06	and here if you get 100 units
0:18:11	you get almost unlimited number of components
0:18:13	but they are tied together
0:18:15	Geoffrey has done very detailed mathematics to show that this is very powerful way of
0:18:22	doing Gaussian model
0:18:23	actually, you get product of experts rather than mixture of experts
0:18:33	that to me it is one of key inside that we get from him
0:18:37	that is RBM, so think about this as RBM
0:18:40	think about this as visible
0:18:44	this observation and hidden and we put them together we have it
0:18:47	it is very hard to do speech recognition on it
0:18:52	this is a generative model, you can do speech recognition, but if you do that,
0:18:57	the result is not very good
0:18:59	dealing discrimination tasks with a generative model you are limited by some of the
0:19:07	you don't directly focus on what you want
0:19:11	however, you can use it as building block
0:19:16	to build DBN (deep belief network)
0:19:18	the way we do it actually in Toronto
0:19:24	if we think about this as building block
0:19:28	you can do learning, after you do learning of this I just skip
0:19:33	it will take whole hour to talk about that learning, but assume that you know
0:19:35	how to do that
0:19:36	after you learn this, you can treat this as feature extraction from what you get
0:19:40	here
0:19:40	and you treat as stacking up
0:19:43	deep learning researchers argue that this becomes the feature of that
0:19:52	and then you can do further I think it is brain architecture
0:19:56	think visual cortex, 6 layers
0:19:59	you can build up, whatever that can learn over here become the hidden feature
0:20:03	hopefully, if you learn that right you can extract the important information from data that
0:20:08	you have
0:20:08	and then you can use feature on the feature and stacking up
0:20:12	why we are stacking up, actually it puts interesting theoretical results
0:20:16	that actually shows that if you unroll this single DBN
0:20:20	sorry, one layer of RBM
0:20:23	in term of belief network, actually it is equal to infinity length
0:20:28	because, every time this is related to learning
0:20:33	learning is actually go up and down, every time you go up and down, it
0:20:37	is equivalent to show that it
0:20:39	you actually get one layer higher, now the restriction here is that
0:20:46	all the weights have to be tied, it is not very powerful
0:20:50	but now we can untie the weights by doing separated learning
0:20:54	what we do it, it is very powerful model
0:20:55	anyway, so the reason why you this one goes down, this one goes up and
0:21:00	down is that if you
0:21:02	actually, I don't have time to go here, but believe me
0:21:05	so if you stack up this one, one layer up
0:21:10	and then you can mathematic show that this is equivalent to having
0:21:15	just one layer RBM at the top and then belief network going down
0:21:20	and this actually called Bayes network
0:21:23	so look at belief network is similar to Bayes network
0:21:26	but now if you look at this, it is very difficult to learn
0:21:30	so for each any one going down over here something in machine learning called explaining
0:21:36	away effect
0:21:37	so the inference becomes very hard, generation is easy
0:21:41	and then the next invention in this whole theory is that
0:21:47	just reverse order
0:21:51	and you can turn into neural network, it turns out that it is not theory
0:21:56	in that aspect as that works well
0:21:59	and practice it works really well
0:22:00	actually, I am looking to some of theories of this
0:22:04	so this is the full picture of DBN
0:22:08	so DBN consists of bi-directional connections here
0:22:11	and then single direction goes down
0:22:13	so if you do this, you actually can use that as generative model that you
0:22:17	can do recognition on this
0:22:18	unfortunately, the result is not good
0:22:21	a lot of steps that people reach the current state
0:22:25	I am going to show you all steps here
0:22:27	so number one RBM is useful, gives you feature extraction
0:22:31	and you stack up RBM few layers up
0:22:34	and you can get DBN, actually at the end you need to do some discriminative
0:22:39	learning at the end
0:22:40	uh, so let's see, but generally, the capacity is just very good
0:22:46	this is the first time, I saw
0:22:48	the generative capability from Geoffrey, I was also amazed
0:22:53	so this is that example that he gave me
0:22:59	so if you train, use this digit
0:23:05	the database is called MNIST
0:23:12	an image database, everybody uses it, as like our speech TIDIGIT
0:23:19	you put them here and you learn it
0:23:21	you know according to this standard technique
0:23:24	you actually now put 1 of digit here you want to synthesize 1
0:23:29	you put 1 here and all other are 0, and then you run
0:23:35	you can actually get something really nice, if you put 0 here
0:23:37	this is different from the traditional generative process
0:23:42	the reason why they are different because of stochastic process
0:23:46	it can memorize
0:23:50	some of numbers are corrupted
0:23:53	most of time you get realistic
0:23:54	last time, in one of tutorial I gave
0:23:58	I give the tutorial shown this result , how about of speech synthesis people in
0:24:03	the audience
0:24:04	they said that is great, I will do speech synthesis now
0:24:07	you get one sentence, fixed number, not human one
0:24:10	human do writing, every time for different writing
0:24:14	intermediately, go back to write draft propose, and ask me to help them
0:24:22	this is very good, stochastic components there, the result looks like how human are doing
0:24:29	now, we want to use for recognition, this is the architecture
0:24:39	I am amazed, I had a lot of discussion with Patrick yesterday
0:24:42	I just feel that, when you have generative model you really need to
0:24:54	you put image here, and move up here, and this becomes the feature
0:24:58	and all you do that you turn on this unit by one
0:25:00	and run a long time until convergence
0:25:04	and you look the probability for this
0:25:05	to get number, OK
0:25:06	and turn of other units, and run run, and see which number is high
0:25:13	I suggest that you don't do that waste your time
0:25:16	number one is it takes long time to do recognition, number two we don't know
0:25:21	to generate to the sequence
0:25:23	and he said the result is not very good, so we did not do it
0:25:27	we abandon the concept of generation, to do everything generative, that's how we do.
0:25:36	and that's how deep neural network is born
0:25:39	so all you do is that you just treat all the connections to be
0:25:47	that why at the end my conclusion is that the theory of deep learning is
0:25:51	very weak
0:25:52	ideally DBN goes down, it generate the model
0:25:57	in practice, you say it is not good, just forget about this, think about
0:26:01	and eliminate this, and make whole weights moving up
0:26:05	we modify this, the easiest way to do it just forget about this, you know
0:26:09	just change make it go up, make this go down again, people don't like it
0:26:14	in the beginning, I supposed it is horrible, that is crazy to do it
0:26:19	just break the theory to build the DBN
0:26:22	finally, what is the best result, what we do that is really as same as
0:26:28	what multilayer perceptron has been doing except it just
0:26:33	has very deep layers
0:26:35	and now if you do that typically, randomize
0:26:40	you know, all the weights, you learn this as standard arguments
0:26:44	20 some years ago saying that
0:26:46	mathematics proves that the deeper you go, the more
0:26:51	the lower level you go, because the label is the top level
0:26:54	so you do back-propagation taking the derivative of error from here go down here
0:26:59	the gradient is very small
0:27:02	you know sigmoid function sigmoid (1-sigmoid)
0:27:05	so the lower you go, the more chance that gradient term vanishes
0:27:14	they even don't back-propagation for deep networks so look that it, it seems impossible to
0:27:20	learn, they gave up
0:27:21	and then now, one of very interesting things that comes out of deep learning is
0:27:27	to say
0:27:28	rather using random numbers, can be interesting to using DBN to plug in there, that
0:27:32	some thing I don't like that
0:27:33	look the argument why it is good, what we do is that we train to
0:27:38	this DBN
0:27:38	over here
0:27:41	the weights DBN, you just use the generative model for the training
0:27:46	and once you trained, you fix these weights, you just copy the whole weights into
0:27:52	this, deep neural network to initialize these
0:27:54	after that you do back propagation
0:27:58	again, these weighting is very small, but its OK
0:28:02	you already got DBN over here
0:28:03	you got RBM, it should be RBM, not DBN anymore
0:28:09	it is not too bad,
0:28:15	so you see exactly how to train this, it is just that using random is
0:28:25	not good
0:28:27	if you use DBN's weights over here is not too bad, but over here, you
0:28:32	modify
0:28:32	you just run recognition, for MNIST
0:28:38	the error goes down to 1.2% that is whole Geoffrey Hinton's idea
0:28:47	and he published inside a paper about this, at that time, it seems to be
0:28:51	very good
0:28:52	but I am going to tell you that MNIST result 1.2% error, but with few
0:28:58	more generations of networks, I will show you, we are able to get 0.7%
0:29:05	and same kind of philosophy goes to speech recognition
0:29:12	I will go quickly, in speech all of you think about how to do sequence
0:29:17	modeling
0:29:18	it is very simple
0:29:21	now we have deep neural network
0:29:24	what we do that we normalize that using softmax
0:29:28	to make that to be, similar to the talk yesterday, a kind of calibration
0:29:35	and we get posterior probabilities and divided by prior you get generative probabilities, and just
0:29:40	use HMM to do that
0:29:42	that why called DNN-HMM
0:29:49	the first experiment we did on TIMIT
0:29:53	with just phonemes, easy
0:29:55	each state, one of three states is a phoneme, very good result, I can show
0:29:59	you
0:30:00	you then move to large vocabulary, one of thing that we do in our company
0:30:05	you know Microsoft called them as senomes
0:30:14	rather we have a phone, we cut it in dependent phone
0:30:18	that becomes our infrastructure
0:30:20	so we don't change all this
0:30:22	rather we use 40 phones, what happen if we use 9000.
0:30:25	you know, the senomes, long time ago people could not do that, 9000 here, crazy
0:30:30	300, 5000, every time you have 15 million weights here, it is very hard to
0:30:37	train
0:30:37	now we bought very big machines
0:30:39	a GPU machine, parallel computing
0:30:45	so we replace this by ... it can be very large
0:30:52	this is very large, and input is also very large as well
0:31:01	so we use a big window
0:31:03	we have a big output, big input, very deep, so there are 3 components
0:31:09	why big input-long window
0:31:11	which could not be done in HMM
0:31:13	do you know why? because
0:31:15	I have a discussion with some experts it could not be done for speaker recognition,
0:31:22	UBM
0:31:22	for speech recognition, the reason why it couldn't be done, because
0:31:26	first of all you have to diagonalize HMM
0:31:32	but its not big, if you do too big, Gaussian becomes sparseness problem
0:31:37	covariance matrix
0:31:39	for the end, all we do that make it simple as possible, just plug whole
0:31:43	long window
0:31:44	and then feed whole thing, we get million of parameters
0:31:48	typically, this number is around 2000
0:31:50	2000 here, every layer, 4 million parameters here, another 4 million, another 4 million
0:31:55	and just use GPU to train the model together
0:31:57	here is not too bad
0:31:59	so if we use about 11 frames
0:32:04	now, it is even extended to 30 frames
0:32:11	but in HMM, we never imagine of doing it
0:32:14	we don't even normalize this, we just the roll
0:32:16	values over here
0:32:17	in the beginning, I still use MFCC, delta MFCC, delta
0:32:23	multiply by 11 or 15 whatever
0:32:26	then we have a big input
0:32:28	which is still small compared with hidden unit size
0:32:31	and train this whole thing, and every thing works really well
0:32:33	and we don't need to worry the correlation modeling, because correlation is automatic captured by
0:32:38	the whole the weights here
0:32:40	the reason I bring it here, just to show you that, this is not just
0:32:47	phone
0:32:55	we went through history, literature, we never saw put this one as speech until this
0:33:02	first data
0:33:03	now just give you a photo here, GMM everybody know
0:33:09	HMM, GMM, so whole point is to show you that
0:33:15	the same kind of architecture if you look at HMM
0:33:18	you can also see GMM is very shallow
0:33:21	all you do it that for each state the output 1 is score for GMM
0:33:26	over here, you can see many layers
0:33:28	so you build feature up and down, this one shows deep versus shallow
0:33:33	here is the result. We wrote the paper together, it will be appear in November
0:33:41	and that result summarize
0:33:45	four groups research together over last three years
0:33:49	since 2009
0:33:51	university of Toronto, Google, and
0:33:55	and our Microsoft research was the first one who
0:33:58	actually serious work for speech recognition
0:34:01	Google data and IBM data
0:34:03	they all confirm the same kind of effectiveness
0:34:05	here is the TIMIT result
0:34:10	it is very nice, all people think that TIMIT is very small
0:34:14	if you don't start with this, you get scared away.
0:34:18	so I will go back in the 2nd part of this talk, it is monophone
0:34:24	hidden trajectory model, I did many years ago
0:34:26	to get this number, need 2 years to do this
0:34:29	I wrote the training algorithm, very good my colleagues wrote the decoder for me, this
0:34:36	is very good number
0:34:38	for TIMIT, and it is very hard to do decoding
0:34:48	the first time we try this DBN
0:34:50	deep neural network
0:34:55	I wrote this paper with ... we do is MMI training
0:35:03	you can do back propagation through the MMI function for whole sequence
0:35:09	so we got 22%, it is almost 3%
0:35:18	and then we look the errors between this and this are very different, especially, for
0:35:22	very short samples
0:35:23	it is not really good, but for the very long side is much better
0:35:27	I've never seen that before
0:35:30	so do this, this kind of work which is compared with HMM
0:35:34	this result has been done for 20 some years ago
0:35:37	this is error, 27% error around 4% up
0:35:42	around 10 years, 15 years, the error drops 3%
0:35:50	and this and this is very similar in term of error
0:35:58	so you see the error is very different
0:36:06	so the first experiment is voice search
0:36:10	at that time, voice search is very an important task, and now voice search goes
0:36:16	to everywhere
0:36:18	in Siri has voice search, in Window phone we have that
0:36:23	even in Android phones
0:36:25	very important topic
0:36:27	so we have data, we have worked on this one, very large vocabulary
0:36:33	and summer of 2010
0:36:35	we first to in our group, just boost that because the it is so different
0:36:44	from TIMIT
0:36:47	and we actually don't even change parameters at all
0:36:49	all parameters, learning rate
0:36:52	from our previous work in TIMIT
0:36:54	and we got down here, that is the paper we wrote
0:36:57	just appear this year
0:37:04	and then this is the result that we got
0:37:07	if you actually want to look at exactly how this is done
0:37:11	most of the thing provide
0:37:13	in this paper is read speech
0:37:15	to tell you how to train the system
0:37:17	but you need to use GPU to implement, without GPU, it takes 3 months, just
0:37:21	for experiments
0:37:22	for large vocabulary, for GPU is really quick
0:37:25	most of thing is the same, you do this, you do this
0:37:32	we try to provide theory as much as possible
0:37:36	so if you want to know how to do this in some applications take a
0:37:40	look at this
0:37:40	so this is the first time
0:37:44	the effects of increasing the depth of DNN for large vocabulary
0:37:50	so our systems, the accuracy go up like this
0:37:58	and the baseline, using HMM, discriminative training MPE learning
0:38:05	around 65, this is just neural network
0:38:08	single layer neuron is doing better than all this
0:38:12	and you increase it, you get it
0:38:15	what you go there, some kind of overfit, data is not very good, we label
0:38:19	24 hours
0:38:20	data at that time, so we say
0:38:23	do more, we try 48 hours
0:38:25	this one drops big
0:38:26	so the more data you have the better can you get
0:38:29	some of my colleagues said that why we don't use Switchboard
0:38:36	I say this is too big for me, we don't do it
0:38:38	so actually, we do this Switchboard
0:38:40	and then we got a huge gain
0:38:41	even more gain that I showed you here
0:38:43	it just because of more data
0:38:45	so typical problem
0:38:46	is not really spontaneous speech, but this is spontaneous as well
0:38:52	so this for spontaneous speech as well
0:38:55	it seems with limited data we go up here quite heavy
0:38:58	and then you get 1 order, or 2 orders magnitude more data there
0:39:02	so you have much more GPUs to run, much better softwares
0:39:05	every thing runs well
0:39:08	it turns out that same kind of read speech
0:39:10	we publish over here
0:39:14	let me show you some of the results
0:39:16	this is the result, this is the table in our recent paper
0:39:24	with Toronto group
0:39:29	so standard GMM based HMM
0:39:31	with 300 hours of data
0:39:33	has error rate about 23 some percent
0:39:38	we do very carefully
0:39:43	tune the parameters, this parameter have been tuned (the number of layers)
0:39:44	and we got from here to here
0:39:47	and that is actually attracted a lot of people attention
0:39:49	and then we realize that
0:39:53	we got 2000 hours, and this result from that is even better
0:39:56	and at that time, it is Microsoft result
0:40:01	and then one of recent paper, publishes the result that
0:40:08	of course, when you do that people argue that you have 29 million parameters
0:40:14	and people always you know
0:40:16	pick, picking people in speech community people
0:40:19	obviously, uh, you got more parameters, of course you're going to win what
0:40:21	so what if you use the same number of parameters
0:40:23	we said fine, we do that
0:40:24	so we use the sparseness coding
0:40:26	to actually cut up all the weights
0:40:28	and the number of non-zero parameters is 15 million
0:40:33	with the smaller number of parameters,
0:40:34	we get even better result
0:40:36	it's amazing... the capacity of deep network is just tremendous
0:40:40	you cut all the parameters
0:40:41	in the beginning, we don't
0:40:42	typically, you expect to be similar right
0:40:44	get rid of the lower
0:40:46	you get slightly gain
0:40:47	but that doesn't carry off before we get more data anyway
0:40:49	so this is, maybe
0:40:50	within the statistical variation, but so
0:40:53	with the smaller number of parameters
0:40:55	then GMM, HMM which is trained using discriminative training
0:40:58	we get about something 30 something % error reduction
0:41:02	more so than our TIMIT, and also more so than our
0:41:06	our voice search
0:41:10	and then this is another paper, and then IBM came along
0:41:12	and then Google came along, they say you know, it's better result, I think they
0:41:16	want to do as well
0:41:18	so you can see that thesis's Google result
0:41:19	thesis's about 5000 hours, amazing right
0:41:22	they just have better infrastructure
0:41:24	mapping this all that, so they manage to do that on 5000, 6000 hours
0:41:29	so this number just came up
0:41:32	actually that number
0:41:33	so actually this will be in the Interspeech papers, if you go to see them
0:41:38	so one of the thing Google does is that they don't put this baseline result
0:41:42	they just give a number,
0:41:44	just ask what number they have
0:41:47	so... sorry.. sorry
0:41:50	with more data they have, with the same data they don't have the number, either
0:41:52	they
0:41:53	they just don't bother to do
0:41:54	they all believe more data is better
0:41:56	so with a lot more data they got this
0:41:58	and then we just with about how many, about three
0:42:01	uh, with this much data
0:42:04	I take about 12%, is better when we got more data
0:42:07	they should put a number here, anyway
0:42:09	so I'm, we're not nick picking on this
0:42:12	and thesis's the number I show, thesis's Microsoft's result, the number from here to here
0:42:16	from here to here for different
0:42:18	these are 2 different test sets
0:42:19	and all these, all the people are here, you should know, this is very important
0:42:23	for our review
0:42:24	ah now, this is IBM result
0:42:27	ah sorry, this is voice search result that I showed you early
0:42:29	this is 20%
0:42:31	it's not bad
0:42:32	so because you have 20 hours of data, so
0:42:34	it turns out the more data you have
0:42:36	the more error reduction you have
0:42:38	and for TIMIT, we get only about 3-4 absolute, about ten something percent
0:42:43	now, and this is the
0:42:46	so this broadcast result is from IBM
0:42:50	and I heard that in Interspeech, they have much better result than this
0:42:55	so if you're interested, look at it
0:42:57	my understanding is that
0:42:59	from what I heard, is that their result is comparable to this
0:43:02	some people say even better
0:43:04	so if you want to know exactly IBM is doing, they would have even better
0:43:07	infrastructure
0:43:08	in term of distributed learning
0:43:10	compared to most other places
0:43:12	but anyway so this kind of error reduction
0:43:15	has been unheard of in the history, I mean in this area about 25 years
0:43:20	and the first time we got these results, we're just stunned
0:43:22	and Google, this is also Google's result, and even Youtube speech which is much more
0:43:27	difficult
0:43:27	spontaneous with all the noise
0:43:29	they also manage to get something from here
0:43:31	this time they're pretty honest to put this over here with the same amount of
0:43:34	data
0:43:35	14 hours they got more
0:43:37	but in our case, we also get 2000 hours, we actually get more gain
0:43:40	rapid gain ah yes
0:43:41	so the more data you have
0:43:43	and then of course, to get this, you have to tune the depth
0:43:45	the more that you have, the deeper you can go
0:43:47	and the bigger you may wan to have
0:43:50	and the more gain you have
0:43:51	and this is the story I want to comment
0:43:52	without, you really have to change major things in the system architecture
0:43:58	OK, so once
0:44:00	one thing that we found
0:44:01	so my colleagues Dong Yu and myself and ah and
0:44:06	recently found was that
0:44:10	so in most of the thing that we
0:44:13	I believe in old days IBM and
0:44:15	and Google and our early work
0:44:17	we actually use DBN to initialize our model off-line
0:44:20	we said can we get rid of that, that training is very tricky, not many
0:44:23	people know how to do that
0:44:27	if for certain recipe, you have to look at the pattern
0:44:30	it's not obvious thing how to do that because the learning
0:44:33	there's the keyword in the learning called the contraction divergence you might hear that word
0:44:37	in the later
0:44:38	part of the talk today
0:44:42	contrastive divergence on theory,
0:44:44	essentially the idea is you should you know iterate
0:44:47	you should do multi-column simulation
0:44:49	Gibbs sampling for infinite turns
0:44:52	but in practice, it's too long
0:44:55	it's a cut it to one
0:44:56	and of course from that, you can, have to use variational learning
0:45:00	variational bump to prune for better result
0:45:02	it's a bit tricky
0:45:04	that's why it's better to get rid of it
0:45:06	so our colleagues, so actually have a patent filed just some few months ago
0:45:10	on this, and also a paper from my colleague
0:45:13	would actually use ... for the
0:45:17	for switchboard task
0:45:18	and they show that
0:45:19	you actually can do comparable things to RBM learning
0:45:23	so might I would say now, for large vocabulary
0:45:26	we don't even have to learn much about DBN
0:45:30	so .. the theory so far is not clear
0:45:34	exactly what kind of power you have
0:45:36	but I might sense is that
0:45:39	if you have a lot of unlabeled data in the future
0:45:42	it might help
0:45:44	but we also did some preliminary example to show it's not the case any more
0:45:47	so it's not clear how to do that
0:45:49	so I think at this point we really have to get better theory
0:45:52	if you take a better theory, and also kind of comparable
0:45:54	you know it's a
0:45:56	although all these issues cannot be settled
0:45:58	so the idea of discriminative pre-training is that
0:46:01	you just train the standard ..um
0:46:04	standard
0:46:07	multi-layer perceptron using you know
0:46:10	thesis's easy to train. For shallow, you can train, the result's not very good
0:46:14	and then every time
0:46:15	you do you fix this
0:46:17	you add a new one, and you do. You need to fix the lower layer
0:46:21	from the previous shallower layer
0:46:24	and that's good, that's the spirit, It's very similar to
0:46:28	OK .. the spirit is very similar to layer by layer learning
0:46:31	now every time
0:46:32	when we add up a new layer, we inject
0:46:35	discriminant labeled information
0:46:37	and that's very important, if you do that, nothing goes wrong
0:46:39	so if you just use all the random number, to go over here and do
0:46:42	that, and nothing is going to work
0:46:44	uh, but except
0:46:45	there's some exception here, but I'm not going to say much about
0:46:48	but once you do this
0:46:50	layer by layer with
0:46:52	the spirit is still similar to DBN right, layer by layer
0:46:55	but you inject discriminative learning
0:46:57	I believe it's very natural thing to do
0:46:59	we talked about this right
0:47:00	so we learn
0:47:02	the generative learning in DBN
0:47:04	you know, layer by layer, to be very careful
0:47:07	you don't just do it to much
0:47:08	and then if you inject some discriminant information
0:47:12	it's bound to happen
0:47:13	you get new information there, not just looking at the data itself
0:47:16	and it turns out that if we do, we get
0:47:19	we actually in some experiment we even get slightly better result than DBN training
0:47:23	so it's not clear the generative learning
0:47:26	plays, is going to play a more important role
0:47:29	as some people claimed
0:47:32	OK so I'm done with
0:47:33	the
0:47:35	the deep neural network, so I spend a few minutes to tell you a bit
0:47:39	more about
0:47:39	some different other different kind of architecture called deep convex network
0:47:45	which to me is kind of more interesting
0:47:47	so I spend most time on this
0:47:49	so actually we have a few papers published, it turned out that
0:47:54	so the idea of this network is that
0:47:56	while this is actually done for MNIST
0:47:58	so when use this architecture
0:48:00	we actually get so much better result than DBN
0:48:04	so we're very excited about this
0:48:05	but the point is that the learning has to.. you know
0:48:08	we have to simplify this network
0:48:10	it turns out learning now
0:48:11	the whole thing is actually convex optimization
0:48:14	So I do not have time to go through all this
0:48:15	we have time for the parallel implementation
0:48:17	which is almost impossible
0:48:19	for deep neural network
0:48:22	and the reason for those of you who've been actually working on neural network, you
0:48:25	notice that
0:48:26	the learning for
0:48:27	discriminant ... discriminant learning phase
0:48:30	which is called the fine tuning phase
0:48:32	and are typically the stochastic weighted descent
0:48:34	you cannot distribute
0:48:35	so this one cannot be distributed
0:48:36	so I'm not going to
0:48:38	I really want to use
0:48:39	this architecture to try speech recognition task so usually we have lots of discussion
0:48:42	so maybe 1 year from now
0:48:44	so if it's working well for you discriminant learning task
0:48:48	I'm glad that
0:48:49	this now is going to define the task
0:48:51	for discrimination that I.. I .. had
0:48:54	discussion so
0:48:56	that gives me the opportunity to try this
0:48:58	I love to try, I love to report the result
0:49:00	even it's negative, I'm happy to share with you
0:49:02	OK, so thesis's a good architecture
0:49:04	and another architecture that we tried
0:49:06	is that we split the hidden layers into 2 parts
0:49:09	we do the crossproduct, so that overcomes
0:49:12	some of the DBN weakness
0:49:14	originally not being able to do correlation in the input
0:49:18	and people just try a few tricks
0:49:20	you know more than correlation
0:49:22	it did not work well, almost impossible
0:49:24	so thesis's very easy to implement
0:49:26	and most of the learning here is convex optimization
0:49:28	and often get very good result over others
0:49:31	there's another architecture called the tensor
0:49:34	so the same kind of correlation
0:49:36	modeling for tensor version
0:49:38	also can be carried out into
0:49:40	deep neural network
0:49:41	so we actually, my colleague, we actually submit a paper in Interspeech
0:49:44	I think if you're interested in this one, should go there to take a look
0:49:47	at it
0:49:47	so the whole point is that
0:49:48	now rather than doing the stacking using input output concatenation
0:49:52	you can actually do the same thing for each of hidden neural network
0:49:56	so in this paper, we actually evaluate that on the switchboard
0:50:00	and we get additional 5% relative gain out of the best we have got so
0:50:03	far. So this is a good staff
0:50:05	so the learning becomes trickier
0:50:06	because when you do .. so the back propagation
0:50:11	and you have to think about how to do this
0:50:13	it adds some additional nuisance in term of effective computation
0:50:16	but the result is good
0:50:18	so now I'm going to the second part, I'm going to skip most of them
0:50:22	OK skip most of them
0:50:25	OK so this uh... I actually wrote a book on this
0:50:27	so this is
0:50:29	dynamic Bayesian network as deep one
0:50:31	the reason why it's deep is there are many layers
0:50:34	so you get the target
0:50:35	you get articulation
0:50:36	you get environment, all together this
0:50:38	so we tried that
0:50:40	and the implementation of this is very hard
0:50:43	so I just go quickly and then to go to the bottom
0:50:47	uh, so, uh, this is one of the paper that
0:50:50	uh, I wrote uh, together with
0:50:53	one of the experts, who actually
0:50:54	this is my colleague who actually invented this variational Bayes
0:50:58	and then ... to work with him
0:51:01	to implement this variational Bayes
0:51:03	into this kind of ...
0:51:06	dynamic Bayesian network
0:51:07	and the result is very good
0:51:09	although the journal we published is wonderful
0:51:11	so you can actually synthesize
0:51:13	you can track all these formants in very precise manner
0:51:17	and then some articulatory problem, it's very amazing, but once you do recognition
0:51:21	the result is not very good
0:51:22	so I'm going to tell you why, if we have time
0:51:25	and then of course
0:51:26	one of the problems
0:51:27	so this 2006 we actually
0:51:31	so we realize that kind of learning is very tricky
0:51:34	essentially you approximate things you don't know what you approximate for
0:51:39	that's one of the problems of deep Bayesian, it's very
0:51:42	but you can get some insights
0:51:43	you work with all the experts in the [ ... ]
0:51:45	at the end at the bottom line
0:51:47	we really don't know how to interpret
0:51:49	but you... but is just
0:51:51	you don't know how much you lose right
0:51:52	so we actually have the simplified version that I spend all time working on, and
0:51:56	that gives me this result
0:51:58	that's actually the paper
0:51:59	so this is .. is about
0:52:01	about 2-3 percent better than the best
0:52:04	context dependent HMM
0:52:06	I'm happy at that time; we stopped at this
0:52:08	once we do this
0:52:09	and it's so much better than this
0:52:10	so in other words, DBN
0:52:12	related, or at least in TIMIT task
0:52:14	it does so much better than
0:52:16	than .. than dynamic Bayesian kind of work
0:52:19	and then we're happy about this
0:52:21	now of course I won't
0:52:22	yes, so this is the history of dynamic model
0:52:25	and a whole bunch of thing going on there
0:52:27	and the key is how to embed
0:52:29	such dynamic property into the DBN framework
0:52:33	if you embed the property of
0:52:36	big chunk into
0:52:37	dynamic Bayesian network is not going to work ... due to technical reasons
0:52:42	but the other way around has a hope, that's one of the
0:52:46	so the part 3 will going to tell you
0:52:49	which I'm running out of time
0:52:50	I'm actually going to show you
0:52:52	first of all some of the lessons
0:52:54	so thesis's the deep belief network or Deep Neural Network
0:52:57	and this, I used the * here, to refer that to as Dynamic Bayesian Network
0:53:02	so one
0:53:05	so all these hidden dynamic models .. is the special case of the Bayesian network
0:53:10	you can see that, or otherwise I showed you earlier on
0:53:13	there were a few key differences that we learned
0:53:15	one is that for DBN
0:53:17	it's distributed implementation
0:53:20	so in our current system, for this system
0:53:23	in our HMM/GMM system
0:53:25	we have the concept that this particular model
0:53:28	is related to a
0:53:29	this particular model is related to e right
0:53:31	you have this concept right, and of course you need training to mix them together
0:53:34	but you still have the concept
0:53:35	whereas in this neural network.. no .. each weight
0:53:39	codes all class information
0:53:41	I think it's very powerful concept here
0:53:43	so you learn things and get distributed
0:53:45	it's like neural system right
0:53:47	you don't say this particular neuron contains visual information
0:53:50	it can also code audio information together
0:53:53	so this has better
0:53:55	neuron basis compared with conventional techniques
0:53:58	also ...... when we did this model
0:54:01	we just set one single bit wrong
0:54:04	at that time, we all said ... we don't have parsimonious model representation.
0:54:08	that's just wrong
0:54:10	5 years ago, 10 years ago, may be OK right
0:54:12	now in our current age
0:54:14	just use massive parameters if you know how to learn them
0:54:17	and also you know how to regularize them well
0:54:19	and just turn on that the DBN has a mechanism
0:54:21	to automatically regularize things well
0:54:24	and that is not proven yet, I don't have the theory to prove that
0:54:26	but in our ... u know every time you stack up
0:54:29	u can intuitively understand that
0:54:31	u don't overfit right
0:54:32	because if u do overfit, u do this many years ago
0:54:36	but if u do this, u know keep going deep, u don't over fit because
0:54:39	whatever information that u get applied
0:54:41	the new parameters
0:54:43	actually sort of take into account
0:54:46	the feature from lower parameters, so it doesn't count as lower
0:54:50	model parameters any more, so automatically u have the mechanism to do this
0:54:53	but in DBN, u don't have that property
0:54:55	u need to stop, it doesn't have that property
0:54:57	so this 's very strong
0:54:59	and another key difference
0:55:01	is something I talked about earlier
0:55:03	product vs mixture
0:55:06	mixture is you sum up probability distribution
0:55:08	and product is you take product between them
0:55:11	so when you take the product, you actually exponentially expand the power of representation
0:55:16	So these all the key differences between these two type of model.
0:55:19	Another important thing is that for this learning we combine generative and discriminative.
0:55:26	Although the final result we got, we still think that discriminative is more important than
0:55:31	generative.
0:55:32	But at least in the initialization, we use the generative model and DBN to initialize
0:55:38	the whole system and discriminative learning to adjust the parameters.
0:55:42	The generative model we did earlier is purely generative.
0:56:02	Finally, longer windows or shorter windows.
0:56:07	In the earlier case, I am still not very happy about longer window.
0:56:15	Because every time you model dynamics which I've actually talked about this, about free method.
0:56:21	How to build dynamics into the model, they both have a very short history, not
0:56:29	long history.
0:56:30	No history of research actually focused on dynamics.
0:56:34	There is so many limitations, you have to use short window. in long window, nothing
0:56:39	works hard, we've tried all these.
0:56:46	So deep recurrent network is something that many people working on now.
0:56:52	In our lab, in the summer, much as all the projects relate to this. Maybe
0:56:58	not all, at least very large percentage.
0:57:01	It has worked well for both acoustic model and language model. I would say that,
0:57:07	recurrent network has been working well for acoustic modeling.
0:57:29	In language modeling, there is a lot of good project in the recurrent network.
0:57:47	The weakness of this approach, there is a generic temporal dependency.
0:57:59	I have no idea what it is, there is not constraint, one following another. This
0:58:06	kind of temporal model is not very big.
0:58:09	The dynamics in DBN is much better.
0:58:15	In term of interpretation, in term of generative capability, in term of physical speech production
0:58:19	mechanism, it is just better. The key is how to combine them together.
0:58:23	We don't like this, and we have shown that all this does not capture the
0:58:32	essence of speech production dynamics.
0:58:35	There is huge amount of information redundancy, think about you have a long window here
0:58:41	and every time you shift ten millisecond and 90% of the information overlapping.
0:58:59	And some people may argue that it doesn't matter and they did experiment to show
0:59:03	that it doesn't help at all.
0:59:04	The importance of optimization techniques is the Hessian-free method.
0:59:18	I am not sure in language modeling, you may not do that actually, but in
0:59:22	acoustic modeling, this is a very popular technique.
0:59:25	And also, another point is that recursive neural network for parsing in NLP has been
0:59:31	very successful.
0:59:32	I think last year in ICML, they actually presented the result of recursive neural net
0:59:36	which is not quite the same as this, but used the structure for the parsing,
0:59:40	they actually got state-of-the-art result for the parsing.
0:59:43	The conclusion of this slide is it's an active and exciting research area to work
0:59:47	on.
0:59:48	So the summary is as follows. I provide historical accounts of two fairly separate research.
0:59:57	One is based upon DBN, the other one is based on Dynamic Bayes Network in
1:00:05	speech.
1:00:05	So I actually hopefully show you that speech research motivates the use of deep architectures
1:00:13	from speech production and perception mechanisms.
1:00:16	And HMM is a shallow architecture with GMM to link linguistic units to observations.
1:00:26	Now I have shown you that, I didn't have time to talk about this, the
1:00:31	point is this kind of model has less success then it is expected.
1:00:34	And now we are beginning to understand why that is a limitation over here, and
1:00:40	actually I have shown some potential possibilities of overcoming that kind of limitations in the
1:00:47	neural network framework.
1:00:48	So one of the thing that we understand why this kind of limitation that has
1:00:53	been developed in the past has not be able to take advantage of the dynamics
1:00:58	into the deep network.
1:01:00	It's because we didn't have the distributed representation, didn't have massive parameters, didn't have fast
1:01:06	parallel computing and we didn't have product of experts.
1:01:09	All these things are good for this, but the dynamics are actually good for this,
1:01:13	and how to merge them together, I think is a very popular research that actually
1:01:17	work on.
1:01:18	You can actually make the deep network to be scientific in terms of speech perception
1:01:23	and recognition
1:01:24	So the outlook the future direction is that so far we have DBN DNN to
1:01:32	replace HMM GMM .
1:01:34	I would expect in within three five years, you may not be able to see
1:01:40	GMM especially in recognition.
1:01:41	I think in industry.If I am wrong then shoot me.
1:01:49	The dynamic properties of model of this Dynamic Bayesian Network speech has the potential to
1:02:01	replace HMM.
1:02:15	And the Deep Recurrent Neural Networks, which I have probably tried to argue that there
1:02:21	is a need to go beyond unconstrained temporal density while making it easier to learn.
1:02:27	Adaptive learning is so far not so successful yet, we tried a few projects, it
1:02:33	is harder to do it.
1:02:35	The scalable learning is hard, for industry at least is, for academic don't worry about
1:02:41	it.
1:02:42	As well as NIST define it into small tasks, you will be very happy to
1:02:47	work on that. But for industry this is a big issue.
1:02:50	Reinventing our infrastructure at the industrial scale. I think we have time to go through
1:02:59	all the applications.
1:03:00	Spoken language understanding, has been one of the successful application I've shown you.
1:03:08	Information retrieval, language modeling, NLP, image recognition, but the speaker recognition is not yet.
1:03:24	The final bottom line here is that the deep learning so far is weak in
1:03:30	theory, I have I have convinced you about it with all the critics.
1:05:18	In Bengio case, he randomize everything first. And then if you do that, of course,
1:05:24	it is bad.
1:05:26	So the key is that, if you get something did so best, I think to
1:05:31	me what generative model maybe useful in that case. But the key of this learning
1:05:36	is if you put a little bit discrimination over here, it is probably better.
1:06:47	So probably the best is you use the structure here and also this, and we
1:06:52	know how to train that now. I think both width and depth is important.
1:07:09	We tried that, we didn't fix the measurement, we just used algorithm to cut out
1:07:15	all the way. We didn't lose anything, in fact from the result I showed you,
1:07:20	it still gains a little bit.
1:07:29	Cross validation.
1:07:32	That's no way, there is no theory on how to do that.
1:07:35	But in particular case, some of the networks that I've shown you, I have theory
1:07:39	to do that, I can control that.
1:07:44	There's some networks, you can do theory. That means you can automatically determine it from
1:07:49	data. But for this deep belief network, it is weak in theory.
1:08:31	He is also doing deep graphical model.
1:08:48	Two years ago, he gave this ? on how to learn the topology of deep
1:08:54	neural network, in term of width and depth.
1:08:57	And he was using Indian Buffet Process.
1:09:03	At the end, everything has to be done by Monte Carlo simulation and for five
1:09:10	by five, he said simulation take about several days.
1:09:15	I think that approach is not scalable, unless people improve that aspect.
1:09:27	That also motivates more of the academic research on machine learning to make that scale.
1:09:31	I think the idea is good, but the technique is so slow to do anything
1:09:35	about this.
1:09:50	For deep neural network, stochastic gradient is still doing the best, it is good enough.
1:09:55	But my understanding is, we are actually playing around with this. He wants to add
1:10:01	the recurrence some more complex architecture, stochastic gradient isn't strong enough.
1:10:05	There is a very nice paper done by Hinton's group, one of his PhD student.
1:10:12	Who actually used Hessian free optimization to do DBN learning.
1:10:20	They actually showed that result is just one single figure, very hard to interpret that
1:10:27	one, the paper in ICML 2010. It's doing better in this compared with using DBN
1:10:34	to initialize neural network.
1:10:36	To me, it is very significant. We are still borrowing this for more complex network,
1:10:44	more complex second order method, probably it will be necessary.
1:10:50	And also the other advantage of Hessian free is the second order, it can be
1:10:54	parallelized for bigger batch training rather than minibatch training, and that makes big difference.
1:11:06	We tried that one, it doesn't work well for DBN, we need to have a
1:11:15	lot of data. Probably the best for DBN network is stochastic gradient .
1:11:22	If you are using the other networks, some later networks that we have talked about.
1:11:31	They are naturally suited for batch training.
1:11:35	In some more modern version of the network, batch training is desirable. They are designed
1:11:47	for those architecture, it is for parallization.

Being Deep and Being Dynamic - New-Generation Models and Methodology for Advancing Speech Technology

Plenary Session

Li Deng