0:00:17 | good morning everybody |
---|
0:00:21 | I'm very happy to see you all in this morning |
---|
0:00:38 | Professor Li Deng, who proposed the keynote this morning |
---|
0:00:43 | its not so easy to introduce him, because |
---|
0:00:47 | he is very well known in the community |
---|
0:00:49 | he is the fellow of a lot of societies like |
---|
0:00:53 | ISCA, IEEE, American Acoustical Society |
---|
0:00:57 | he has proposed several hundreds of papers during the last years |
---|
0:01:05 | and different talks |
---|
0:01:08 | Li Deng did his PhD in the University of Wisconsin |
---|
0:01:13 | He started his carrier in University of Waterloo |
---|
0:01:31 | He will talk to us today about two very important topics |
---|
0:01:37 | very important to all of us |
---|
0:01:39 | one is how to move out of GMM |
---|
0:01:44 | its not so bad because I start my carrier with GMM |
---|
0:01:49 | I need some new ideas to do |
---|
0:01:53 | something else |
---|
0:01:54 | the second topic will deal with the dynamic of speech |
---|
0:01:59 | we all also know but dynamic is very important |
---|
0:02:13 | we will not take more time on his talk, I prefer to listen him |
---|
0:02:20 | thank Li |
---|
0:02:27 | thank you, thank the organization and Haizhou |
---|
0:02:31 | to invite me to come here to give the talk |
---|
0:02:34 | it is the first time I've attended Odyssey |
---|
0:02:37 | I've read of lot of thing that the community has been doing |
---|
0:02:41 | As Jean has introduced |
---|
0:02:45 | now I think not only in speech recognition but also in speaker recognition |
---|
0:02:51 | there are few fundamental tools so far |
---|
0:02:56 | one is GMM, one is MFCC |
---|
0:03:01 | in common |
---|
0:03:03 | last year, I've learned a lot of other thing from this community |
---|
0:03:07 | it turns out that, the main thing by this talk is to say |
---|
0:03:11 | both of these components may have potential to replaced with much better results |
---|
0:03:18 | I touch a little bit on MFCC, I don't like MFCC |
---|
0:03:23 | so I think Hynek hates MFCC also |
---|
0:03:25 | now only until recently, when we was not doing deep learning |
---|
0:03:29 | there are evidences to show that all components maybe replaced certain in speech recognition, people |
---|
0:03:36 | have seen that it is coming |
---|
0:03:39 | hopefully, after this talk, you may think about whether in speaker recognition, these components can |
---|
0:03:45 | be replaced |
---|
0:03:46 | to get better performance |
---|
0:03:49 | the outline has three parts |
---|
0:03:54 | In the first part, I will give a little bit about quick tutorial |
---|
0:03:59 | having several hours of tutorial material |
---|
0:04:01 | over last few months, so it is a little challenging to compress them down. |
---|
0:04:07 | to this short tutorial |
---|
0:04:11 | rather talking about all the technical details |
---|
0:04:14 | I've decided to just tell the story |
---|
0:04:26 | I also notice that in the next section after this talk |
---|
0:04:30 | there are few papers related to this |
---|
0:04:35 | Restricted Boltzmann Machines, Deep Belief Network |
---|
0:04:39 | Deep Neural Network in connection with HMM |
---|
0:04:46 | at the end of this talk, you may be convinced that this may be replaced |
---|
0:04:49 | as well |
---|
0:04:49 | we can consider in the future, with much better speech recognition performance thing than that |
---|
0:04:56 | we have |
---|
0:05:00 | and also Deep Convex Network, Deep Stacking Network |
---|
0:05:16 | so I think over last 20 years, people have been working on segment models, hidden |
---|
0:05:22 | dynamic models |
---|
0:05:22 | and 12 years ago, I even had |
---|
0:05:25 | a project with John Hopkins University working on this |
---|
0:05:29 | and the results were not very promising |
---|
0:05:35 | now we are beginning to understand the great idea we proposed over there |
---|
0:05:39 | it did not work well at that time |
---|
0:05:41 | It is only after we do this, we realize how we can put them together |
---|
0:05:45 | and that is the final part |
---|
0:05:51 | the first part |
---|
0:05:56 | how many people here ever attended one of my tutorial over last year |
---|
0:06:01 | OK, its a small number of people, |
---|
0:06:09 | this one you have to know, deep learning, sometime you have hierarchical learning in the |
---|
0:06:14 | literature |
---|
0:06:14 | essentially, refer to a class of machine learning technique |
---|
0:06:18 | largely developed since 2006 |
---|
0:06:21 | by ... you know actually, this is the key paper |
---|
0:06:26 | that actually introduced a fast learning algorithm for this called Deep Belief Network |
---|
0:06:36 | in the beginning, this is mainly done on image recognition, information retrieval and other applications |
---|
0:06:43 | and we, actually Microsoft was the first to collaborate with University Toronto |
---|
0:06:51 | researchers to bring that to speech recognition |
---|
0:06:54 | and we show very quickly that not only for small vocabulary |
---|
0:06:58 | it does very well but also for large vocabulary does even better |
---|
0:07:02 | this really happens |
---|
0:07:03 | you know in the part, for small recognition, it worked well for larger sometime it |
---|
0:07:07 | failed |
---|
0:07:09 | but this is the bigger tasks we have, the better success we have, I try |
---|
0:07:14 | to analyze |
---|
0:07:14 | to you that why it happens |
---|
0:07:17 | and Boltzmann machine, we will talk Boltzmann machine in the following talks, I think, Patrick |
---|
0:07:22 | has two papers on that |
---|
0:07:24 | and Restricted Boltzmann machine |
---|
0:07:27 | and this is a little bit confusing, so if you read the literature |
---|
0:07:31 | very often deep neural network and deep belief network |
---|
0:07:36 | which are defined over here which are totally different concepts |
---|
0:07:40 | one is a component of another |
---|
0:07:44 | just for save of convenience, the authors often get confused |
---|
0:07:49 | they called deep neural network as DBN |
---|
0:07:52 | and DBN is also referred to Dynamic Bayes network |
---|
0:07:55 | even more confusing |
---|
0:07:57 | one of thing that |
---|
0:07:59 | for tutorial, for people attended my tutorial, I gave a quiz |
---|
0:08:06 | people know all this |
---|
0:08:18 | last week, we got a paper accepted for publication, the one I wrote together with |
---|
0:08:24 | Geoffrey Hinton and with 10 authors all together |
---|
0:08:27 | work in this area |
---|
0:08:29 | we try to clarify all this, so we have unified terminologies |
---|
0:08:31 | when you read the literature, you know how to map one to another |
---|
0:08:38 | and Deep auto-encoder, I don't have time to go to here, and I will say |
---|
0:08:42 | about some new developments |
---|
0:08:43 | to me it is more interesting because some limitations of some others |
---|
0:08:50 | This is the hot topic, here I list whole recent workshops and special issues |
---|
0:08:59 | and actually, in Interspeech 2012 |
---|
0:09:03 | you see tens of papers in this area most in speech recognition |
---|
0:09:07 | and actually, in one of the area, just format areas with 2 full sections for |
---|
0:09:16 | this topic, just for recognition |
---|
0:09:19 | and some others, we have more, and special issue |
---|
0:09:26 | PAMI, its mainly related to machine learning aspects and also computer visual applications |
---|
0:09:33 | I try to put a few speech papers there as well. |
---|
0:09:36 | and DARPA program |
---|
0:09:40 | 2009, I think last year they stopped |
---|
0:09:47 | and I think in December, there is another workshop related to this topic, it is |
---|
0:09:54 | very popular |
---|
0:09:55 | I think because people see the good results are coming, and I hope that |
---|
0:10:00 | one message of this talk is to convince you that is a good technology so |
---|
0:10:06 | you want to seriously consider adapting some of this essences |
---|
0:10:10 | tell some stories about this so |
---|
0:10:14 | so the first time, this is the first time |
---|
0:10:17 | when deep learning showed promising in speech recognition |
---|
0:10:20 | and activities grow rapidly since then and that was around |
---|
0:10:24 | two and half years ago |
---|
0:10:28 | or three and half years ago, whatever |
---|
0:10:31 | in NIPS, NIPS is a machine learning workshop |
---|
0:10:35 | every year |
---|
0:10:46 | so I think one year before that |
---|
0:10:49 | so actually, talked with Geoffrey Hinton |
---|
0:10:52 | a professor at Toronto, he has shown me that |
---|
0:10:56 | he showed me that the Science paper, he actually has a poster there |
---|
0:11:00 | the paper was well written and the diagram was really promising |
---|
0:11:05 | in term of information retrieval for document retrieval |
---|
0:11:08 | so I looked this, after that we started talking about |
---|
0:11:12 | maybe we can work on speech |
---|
0:11:15 | he worked on speech long time ago |
---|
0:11:24 | so we decided to have this workshop, and actually we work together before |
---|
0:11:30 | my colleague, Dong Yu, and my self and Geoffrey, we actually decided to have |
---|
0:11:37 | a propose accepted which presents whole deep learning in preliminary work |
---|
0:11:43 | and that time most people do TIMIT, a small experiment |
---|
0:11:46 | and we turn out that this workshop gives a lot of excitement |
---|
0:11:53 | so I give a tutorial, 90 minutes |
---|
0:11:58 | about 45 minutes tutorial, and Geoffrey, I talk about speech, and he talks about deep |
---|
0:12:02 | learning at that time, and we decided |
---|
0:12:05 | to get people interesting this |
---|
0:12:07 | so the curriculum is as follows, for NIPS |
---|
0:12:12 | at the end of the final day workshop |
---|
0:12:18 | each organizer presents a summary of the workshop |
---|
0:12:24 | and the instruction for that it is a short presentation, it should be funny, |
---|
0:12:30 | should not be too serious |
---|
0:12:32 | every organizer is instructed to prepare few slides to summarize |
---|
0:12:41 | your workshop in the way that your impression to people attending the workshop |
---|
0:12:47 | this is the slide, we prepared |
---|
0:13:05 | speechless summary presentation of the workshop on speech |
---|
0:13:10 | because, we don't really want to talk too much, just go up there, and show |
---|
0:13:15 | that slide |
---|
0:13:16 | no speech there, just animations |
---|
0:13:20 | so we said that, we met in this year |
---|
0:13:24 | so this is supposed to be industrial people |
---|
0:13:31 | and this is supposed to be academic people |
---|
0:13:33 | so they are smart and deeper |
---|
0:13:37 | and they say that, can you understand human speech |
---|
0:13:41 | and they say that, they can recognize phonemes |
---|
0:13:47 | and they say, that 's a nice first step and what else do you want? |
---|
0:13:52 | and they said they want to recognize speech in noisy environments and then |
---|
0:13:58 | and then he said maybe we can work together |
---|
0:14:01 | so we have all concepts together |
---|
0:14:14 | that's all presentation |
---|
0:14:24 | we decided to do small vocabulary first |
---|
0:14:31 | and then quickly we moved I think in December of 2010 |
---|
0:14:36 | move to very large vocabulary |
---|
0:14:38 | to our surprise, the bigger vocabulary you have, the better success you get |
---|
0:14:43 | very unusual |
---|
0:14:44 | and myself analyze the area with details |
---|
0:14:47 | you know we have been working on it before 20 some years |
---|
0:14:54 | one is surprise to me, convince me to work in this area individually |
---|
0:15:02 | was that every pattern that I see from the recognizer it is very different from |
---|
0:15:08 | HMM |
---|
0:15:08 | absolutely, it is better, the area is very different, that means it is good for |
---|
0:15:13 | me to do that |
---|
0:15:14 | anyway, I talk about DBN |
---|
0:15:20 | one of concept is deep belief network, that is one of that Hinton published in |
---|
0:15:28 | 2006 |
---|
0:15:28 | 2 papers there |
---|
0:15:34 | nothing to do with speech, it's called deep belief network. its pretty hard to read, |
---|
0:15:38 | if you are not in the field for while |
---|
0:15:40 | and this is another DBN called dynamic Bayesian network |
---|
0:15:44 | few months ago, Geoffrey sent an email to me saying that look at this |
---|
0:15:50 | acronym DBN, DBN |
---|
0:15:59 | he suggests that before you give any talk you check |
---|
0:16:03 | mostly, in speech recognition, people do Dynamic Bayes Network more |
---|
0:16:09 | anyway, I will a little bit technical contents on it, time is running up quickly |
---|
0:16:17 | number one is the first concept is restricted Boltzmann machine |
---|
0:16:21 | actually, I have 20 slides, so I just take one slide over these 20 |
---|
0:16:26 | so think about this as visible |
---|
0:16:30 | it can be label, label can be one of visible units |
---|
0:16:33 | we do discriminative learning, other thing is observation, think about of observation, and other thing |
---|
0:16:38 | forget about this |
---|
0:16:39 | think about MFCC, think about label |
---|
0:16:43 | or speech label, senome or other labels |
---|
0:16:47 | so we put them together as observation and we have hidden layer here |
---|
0:16:51 | and then the difference between Boltzmann machine and neural network is that |
---|
0:16:57 | the standard neural network is one direction, from bottom up |
---|
0:17:00 | now Boltzmann machine is both directional, you can go up and down, now connections between |
---|
0:17:04 | neighboring units in this layer and that layer are cut off |
---|
0:17:08 | if you don't do that, it is very hard to learn |
---|
0:17:11 | so one of thing is that in deep learning they start with restricted Boltzmann machine |
---|
0:17:16 | is that |
---|
0:17:17 | if you have bi-direction of connections |
---|
0:17:21 | and if you do all in detailed maths, write energy functions, you can write down |
---|
0:17:25 | the conditional probabilities of hidden units given it and the other around. |
---|
0:17:29 | so if you put energy right, actually you can get the conditional probability of this |
---|
0:17:34 | given this to be Gaussian |
---|
0:17:35 | which is that something people like that, this is conditional you can introduce whole thing |
---|
0:17:41 | as Gaussian mixture model |
---|
0:17:42 | so you may think that is just Gaussian mixture model so I can do it |
---|
0:17:47 | each other |
---|
0:17:48 | the difference is that, this you can get almost exponentially large number of mixture components |
---|
0:17:55 | rather than finite |
---|
0:17:56 | I think in speaker recognition, its about 400 or 1000 mixtures whatever |
---|
0:18:06 | and here if you get 100 units |
---|
0:18:11 | you get almost unlimited number of components |
---|
0:18:13 | but they are tied together |
---|
0:18:15 | Geoffrey has done very detailed mathematics to show that this is very powerful way of |
---|
0:18:22 | doing Gaussian model |
---|
0:18:23 | actually, you get product of experts rather than mixture of experts |
---|
0:18:33 | that to me it is one of key inside that we get from him |
---|
0:18:37 | that is RBM, so think about this as RBM |
---|
0:18:40 | think about this as visible |
---|
0:18:44 | this observation and hidden and we put them together we have it |
---|
0:18:47 | it is very hard to do speech recognition on it |
---|
0:18:52 | this is a generative model, you can do speech recognition, but if you do that, |
---|
0:18:57 | the result is not very good |
---|
0:18:59 | dealing discrimination tasks with a generative model you are limited by some of the |
---|
0:19:07 | you don't directly focus on what you want |
---|
0:19:11 | however, you can use it as building block |
---|
0:19:16 | to build DBN (deep belief network) |
---|
0:19:18 | the way we do it actually in Toronto |
---|
0:19:24 | if we think about this as building block |
---|
0:19:28 | you can do learning, after you do learning of this I just skip |
---|
0:19:33 | it will take whole hour to talk about that learning, but assume that you know |
---|
0:19:35 | how to do that |
---|
0:19:36 | after you learn this, you can treat this as feature extraction from what you get |
---|
0:19:40 | here |
---|
0:19:40 | and you treat as stacking up |
---|
0:19:43 | deep learning researchers argue that this becomes the feature of that |
---|
0:19:52 | and then you can do further I think it is brain architecture |
---|
0:19:56 | think visual cortex, 6 layers |
---|
0:19:59 | you can build up, whatever that can learn over here become the hidden feature |
---|
0:20:03 | hopefully, if you learn that right you can extract the important information from data that |
---|
0:20:08 | you have |
---|
0:20:08 | and then you can use feature on the feature and stacking up |
---|
0:20:12 | why we are stacking up, actually it puts interesting theoretical results |
---|
0:20:16 | that actually shows that if you unroll this single DBN |
---|
0:20:20 | sorry, one layer of RBM |
---|
0:20:23 | in term of belief network, actually it is equal to infinity length |
---|
0:20:28 | because, every time this is related to learning |
---|
0:20:33 | learning is actually go up and down, every time you go up and down, it |
---|
0:20:37 | is equivalent to show that it |
---|
0:20:39 | you actually get one layer higher, now the restriction here is that |
---|
0:20:46 | all the weights have to be tied, it is not very powerful |
---|
0:20:50 | but now we can untie the weights by doing separated learning |
---|
0:20:54 | what we do it, it is very powerful model |
---|
0:20:55 | anyway, so the reason why you this one goes down, this one goes up and |
---|
0:21:00 | down is that if you |
---|
0:21:02 | actually, I don't have time to go here, but believe me |
---|
0:21:05 | so if you stack up this one, one layer up |
---|
0:21:10 | and then you can mathematic show that this is equivalent to having |
---|
0:21:15 | just one layer RBM at the top and then belief network going down |
---|
0:21:20 | and this actually called Bayes network |
---|
0:21:23 | so look at belief network is similar to Bayes network |
---|
0:21:26 | but now if you look at this, it is very difficult to learn |
---|
0:21:30 | so for each any one going down over here something in machine learning called explaining |
---|
0:21:36 | away effect |
---|
0:21:37 | so the inference becomes very hard, generation is easy |
---|
0:21:41 | and then the next invention in this whole theory is that |
---|
0:21:47 | just reverse order |
---|
0:21:51 | and you can turn into neural network, it turns out that it is not theory |
---|
0:21:56 | in that aspect as that works well |
---|
0:21:59 | and practice it works really well |
---|
0:22:00 | actually, I am looking to some of theories of this |
---|
0:22:04 | so this is the full picture of DBN |
---|
0:22:08 | so DBN consists of bi-directional connections here |
---|
0:22:11 | and then single direction goes down |
---|
0:22:13 | so if you do this, you actually can use that as generative model that you |
---|
0:22:17 | can do recognition on this |
---|
0:22:18 | unfortunately, the result is not good |
---|
0:22:21 | a lot of steps that people reach the current state |
---|
0:22:25 | I am going to show you all steps here |
---|
0:22:27 | so number one RBM is useful, gives you feature extraction |
---|
0:22:31 | and you stack up RBM few layers up |
---|
0:22:34 | and you can get DBN, actually at the end you need to do some discriminative |
---|
0:22:39 | learning at the end |
---|
0:22:40 | uh, so let's see, but generally, the capacity is just very good |
---|
0:22:46 | this is the first time, I saw |
---|
0:22:48 | the generative capability from Geoffrey, I was also amazed |
---|
0:22:53 | so this is that example that he gave me |
---|
0:22:59 | so if you train, use this digit |
---|
0:23:05 | the database is called MNIST |
---|
0:23:12 | an image database, everybody uses it, as like our speech TIDIGIT |
---|
0:23:19 | you put them here and you learn it |
---|
0:23:21 | you know according to this standard technique |
---|
0:23:24 | you actually now put 1 of digit here you want to synthesize 1 |
---|
0:23:29 | you put 1 here and all other are 0, and then you run |
---|
0:23:35 | you can actually get something really nice, if you put 0 here |
---|
0:23:37 | this is different from the traditional generative process |
---|
0:23:42 | the reason why they are different because of stochastic process |
---|
0:23:46 | it can memorize |
---|
0:23:50 | some of numbers are corrupted |
---|
0:23:53 | most of time you get realistic |
---|
0:23:54 | last time, in one of tutorial I gave |
---|
0:23:58 | I give the tutorial shown this result , how about of speech synthesis people in |
---|
0:24:03 | the audience |
---|
0:24:04 | they said that is great, I will do speech synthesis now |
---|
0:24:07 | you get one sentence, fixed number, not human one |
---|
0:24:10 | human do writing, every time for different writing |
---|
0:24:14 | intermediately, go back to write draft propose, and ask me to help them |
---|
0:24:22 | this is very good, stochastic components there, the result looks like how human are doing |
---|
0:24:29 | now, we want to use for recognition, this is the architecture |
---|
0:24:39 | I am amazed, I had a lot of discussion with Patrick yesterday |
---|
0:24:42 | I just feel that, when you have generative model you really need to |
---|
0:24:54 | you put image here, and move up here, and this becomes the feature |
---|
0:24:58 | and all you do that you turn on this unit by one |
---|
0:25:00 | and run a long time until convergence |
---|
0:25:04 | and you look the probability for this |
---|
0:25:05 | to get number, OK |
---|
0:25:06 | and turn of other units, and run run, and see which number is high |
---|
0:25:13 | I suggest that you don't do that waste your time |
---|
0:25:16 | number one is it takes long time to do recognition, number two we don't know |
---|
0:25:21 | to generate to the sequence |
---|
0:25:23 | and he said the result is not very good, so we did not do it |
---|
0:25:27 | we abandon the concept of generation, to do everything generative, that's how we do. |
---|
0:25:36 | and that's how deep neural network is born |
---|
0:25:39 | so all you do is that you just treat all the connections to be |
---|
0:25:47 | that why at the end my conclusion is that the theory of deep learning is |
---|
0:25:51 | very weak |
---|
0:25:52 | ideally DBN goes down, it generate the model |
---|
0:25:57 | in practice, you say it is not good, just forget about this, think about |
---|
0:26:01 | and eliminate this, and make whole weights moving up |
---|
0:26:05 | we modify this, the easiest way to do it just forget about this, you know |
---|
0:26:09 | just change make it go up, make this go down again, people don't like it |
---|
0:26:14 | in the beginning, I supposed it is horrible, that is crazy to do it |
---|
0:26:19 | just break the theory to build the DBN |
---|
0:26:22 | finally, what is the best result, what we do that is really as same as |
---|
0:26:28 | what multilayer perceptron has been doing except it just |
---|
0:26:33 | has very deep layers |
---|
0:26:35 | and now if you do that typically, randomize |
---|
0:26:40 | you know, all the weights, you learn this as standard arguments |
---|
0:26:44 | 20 some years ago saying that |
---|
0:26:46 | mathematics proves that the deeper you go, the more |
---|
0:26:51 | the lower level you go, because the label is the top level |
---|
0:26:54 | so you do back-propagation taking the derivative of error from here go down here |
---|
0:26:59 | the gradient is very small |
---|
0:27:02 | you know sigmoid function sigmoid (1-sigmoid) |
---|
0:27:05 | so the lower you go, the more chance that gradient term vanishes |
---|
0:27:14 | they even don't back-propagation for deep networks so look that it, it seems impossible to |
---|
0:27:20 | learn, they gave up |
---|
0:27:21 | and then now, one of very interesting things that comes out of deep learning is |
---|
0:27:27 | to say |
---|
0:27:28 | rather using random numbers, can be interesting to using DBN to plug in there, that |
---|
0:27:32 | some thing I don't like that |
---|
0:27:33 | look the argument why it is good, what we do is that we train to |
---|
0:27:38 | this DBN |
---|
0:27:38 | over here |
---|
0:27:41 | the weights DBN, you just use the generative model for the training |
---|
0:27:46 | and once you trained, you fix these weights, you just copy the whole weights into |
---|
0:27:52 | this, deep neural network to initialize these |
---|
0:27:54 | after that you do back propagation |
---|
0:27:58 | again, these weighting is very small, but its OK |
---|
0:28:02 | you already got DBN over here |
---|
0:28:03 | you got RBM, it should be RBM, not DBN anymore |
---|
0:28:09 | it is not too bad, |
---|
0:28:15 | so you see exactly how to train this, it is just that using random is |
---|
0:28:25 | not good |
---|
0:28:27 | if you use DBN's weights over here is not too bad, but over here, you |
---|
0:28:32 | modify |
---|
0:28:32 | you just run recognition, for MNIST |
---|
0:28:38 | the error goes down to 1.2% that is whole Geoffrey Hinton's idea |
---|
0:28:47 | and he published inside a paper about this, at that time, it seems to be |
---|
0:28:51 | very good |
---|
0:28:52 | but I am going to tell you that MNIST result 1.2% error, but with few |
---|
0:28:58 | more generations of networks, I will show you, we are able to get 0.7% |
---|
0:29:05 | and same kind of philosophy goes to speech recognition |
---|
0:29:12 | I will go quickly, in speech all of you think about how to do sequence |
---|
0:29:17 | modeling |
---|
0:29:18 | it is very simple |
---|
0:29:21 | now we have deep neural network |
---|
0:29:24 | what we do that we normalize that using softmax |
---|
0:29:28 | to make that to be, similar to the talk yesterday, a kind of calibration |
---|
0:29:35 | and we get posterior probabilities and divided by prior you get generative probabilities, and just |
---|
0:29:40 | use HMM to do that |
---|
0:29:42 | that why called DNN-HMM |
---|
0:29:49 | the first experiment we did on TIMIT |
---|
0:29:53 | with just phonemes, easy |
---|
0:29:55 | each state, one of three states is a phoneme, very good result, I can show |
---|
0:29:59 | you |
---|
0:30:00 | you then move to large vocabulary, one of thing that we do in our company |
---|
0:30:05 | you know Microsoft called them as senomes |
---|
0:30:14 | rather we have a phone, we cut it in dependent phone |
---|
0:30:18 | that becomes our infrastructure |
---|
0:30:20 | so we don't change all this |
---|
0:30:22 | rather we use 40 phones, what happen if we use 9000. |
---|
0:30:25 | you know, the senomes, long time ago people could not do that, 9000 here, crazy |
---|
0:30:30 | 300, 5000, every time you have 15 million weights here, it is very hard to |
---|
0:30:37 | train |
---|
0:30:37 | now we bought very big machines |
---|
0:30:39 | a GPU machine, parallel computing |
---|
0:30:45 | so we replace this by ... it can be very large |
---|
0:30:52 | this is very large, and input is also very large as well |
---|
0:31:01 | so we use a big window |
---|
0:31:03 | we have a big output, big input, very deep, so there are 3 components |
---|
0:31:09 | why big input-long window |
---|
0:31:11 | which could not be done in HMM |
---|
0:31:13 | do you know why? because |
---|
0:31:15 | I have a discussion with some experts it could not be done for speaker recognition, |
---|
0:31:22 | UBM |
---|
0:31:22 | for speech recognition, the reason why it couldn't be done, because |
---|
0:31:26 | first of all you have to diagonalize HMM |
---|
0:31:32 | but its not big, if you do too big, Gaussian becomes sparseness problem |
---|
0:31:37 | covariance matrix |
---|
0:31:39 | for the end, all we do that make it simple as possible, just plug whole |
---|
0:31:43 | long window |
---|
0:31:44 | and then feed whole thing, we get million of parameters |
---|
0:31:48 | typically, this number is around 2000 |
---|
0:31:50 | 2000 here, every layer, 4 million parameters here, another 4 million, another 4 million |
---|
0:31:55 | and just use GPU to train the model together |
---|
0:31:57 | here is not too bad |
---|
0:31:59 | so if we use about 11 frames |
---|
0:32:04 | now, it is even extended to 30 frames |
---|
0:32:11 | but in HMM, we never imagine of doing it |
---|
0:32:14 | we don't even normalize this, we just the roll |
---|
0:32:16 | values over here |
---|
0:32:17 | in the beginning, I still use MFCC, delta MFCC, delta |
---|
0:32:23 | multiply by 11 or 15 whatever |
---|
0:32:26 | then we have a big input |
---|
0:32:28 | which is still small compared with hidden unit size |
---|
0:32:31 | and train this whole thing, and every thing works really well |
---|
0:32:33 | and we don't need to worry the correlation modeling, because correlation is automatic captured by |
---|
0:32:38 | the whole the weights here |
---|
0:32:40 | the reason I bring it here, just to show you that, this is not just |
---|
0:32:47 | phone |
---|
0:32:55 | we went through history, literature, we never saw put this one as speech until this |
---|
0:33:02 | first data |
---|
0:33:03 | now just give you a photo here, GMM everybody know |
---|
0:33:09 | HMM, GMM, so whole point is to show you that |
---|
0:33:15 | the same kind of architecture if you look at HMM |
---|
0:33:18 | you can also see GMM is very shallow |
---|
0:33:21 | all you do it that for each state the output 1 is score for GMM |
---|
0:33:26 | over here, you can see many layers |
---|
0:33:28 | so you build feature up and down, this one shows deep versus shallow |
---|
0:33:33 | here is the result. We wrote the paper together, it will be appear in November |
---|
0:33:41 | and that result summarize |
---|
0:33:45 | four groups research together over last three years |
---|
0:33:49 | since 2009 |
---|
0:33:51 | university of Toronto, Google, and |
---|
0:33:55 | and our Microsoft research was the first one who |
---|
0:33:58 | actually serious work for speech recognition |
---|
0:34:01 | Google data and IBM data |
---|
0:34:03 | they all confirm the same kind of effectiveness |
---|
0:34:05 | here is the TIMIT result |
---|
0:34:10 | it is very nice, all people think that TIMIT is very small |
---|
0:34:14 | if you don't start with this, you get scared away. |
---|
0:34:18 | so I will go back in the 2nd part of this talk, it is monophone |
---|
0:34:24 | hidden trajectory model, I did many years ago |
---|
0:34:26 | to get this number, need 2 years to do this |
---|
0:34:29 | I wrote the training algorithm, very good my colleagues wrote the decoder for me, this |
---|
0:34:36 | is very good number |
---|
0:34:38 | for TIMIT, and it is very hard to do decoding |
---|
0:34:48 | the first time we try this DBN |
---|
0:34:50 | deep neural network |
---|
0:34:55 | I wrote this paper with ... we do is MMI training |
---|
0:35:03 | you can do back propagation through the MMI function for whole sequence |
---|
0:35:09 | so we got 22%, it is almost 3% |
---|
0:35:18 | and then we look the errors between this and this are very different, especially, for |
---|
0:35:22 | very short samples |
---|
0:35:23 | it is not really good, but for the very long side is much better |
---|
0:35:27 | I've never seen that before |
---|
0:35:30 | so do this, this kind of work which is compared with HMM |
---|
0:35:34 | this result has been done for 20 some years ago |
---|
0:35:37 | this is error, 27% error around 4% up |
---|
0:35:42 | around 10 years, 15 years, the error drops 3% |
---|
0:35:50 | and this and this is very similar in term of error |
---|
0:35:58 | so you see the error is very different |
---|
0:36:06 | so the first experiment is voice search |
---|
0:36:10 | at that time, voice search is very an important task, and now voice search goes |
---|
0:36:16 | to everywhere |
---|
0:36:18 | in Siri has voice search, in Window phone we have that |
---|
0:36:23 | even in Android phones |
---|
0:36:25 | very important topic |
---|
0:36:27 | so we have data, we have worked on this one, very large vocabulary |
---|
0:36:33 | and summer of 2010 |
---|
0:36:35 | we first to in our group, just boost that because the it is so different |
---|
0:36:44 | from TIMIT |
---|
0:36:47 | and we actually don't even change parameters at all |
---|
0:36:49 | all parameters, learning rate |
---|
0:36:52 | from our previous work in TIMIT |
---|
0:36:54 | and we got down here, that is the paper we wrote |
---|
0:36:57 | just appear this year |
---|
0:37:04 | and then this is the result that we got |
---|
0:37:07 | if you actually want to look at exactly how this is done |
---|
0:37:11 | most of the thing provide |
---|
0:37:13 | in this paper is read speech |
---|
0:37:15 | to tell you how to train the system |
---|
0:37:17 | but you need to use GPU to implement, without GPU, it takes 3 months, just |
---|
0:37:21 | for experiments |
---|
0:37:22 | for large vocabulary, for GPU is really quick |
---|
0:37:25 | most of thing is the same, you do this, you do this |
---|
0:37:32 | we try to provide theory as much as possible |
---|
0:37:36 | so if you want to know how to do this in some applications take a |
---|
0:37:40 | look at this |
---|
0:37:40 | so this is the first time |
---|
0:37:44 | the effects of increasing the depth of DNN for large vocabulary |
---|
0:37:50 | so our systems, the accuracy go up like this |
---|
0:37:58 | and the baseline, using HMM, discriminative training MPE learning |
---|
0:38:05 | around 65, this is just neural network |
---|
0:38:08 | single layer neuron is doing better than all this |
---|
0:38:12 | and you increase it, you get it |
---|
0:38:15 | what you go there, some kind of overfit, data is not very good, we label |
---|
0:38:19 | 24 hours |
---|
0:38:20 | data at that time, so we say |
---|
0:38:23 | do more, we try 48 hours |
---|
0:38:25 | this one drops big |
---|
0:38:26 | so the more data you have the better can you get |
---|
0:38:29 | some of my colleagues said that why we don't use Switchboard |
---|
0:38:36 | I say this is too big for me, we don't do it |
---|
0:38:38 | so actually, we do this Switchboard |
---|
0:38:40 | and then we got a huge gain |
---|
0:38:41 | even more gain that I showed you here |
---|
0:38:43 | it just because of more data |
---|
0:38:45 | so typical problem |
---|
0:38:46 | is not really spontaneous speech, but this is spontaneous as well |
---|
0:38:52 | so this for spontaneous speech as well |
---|
0:38:55 | it seems with limited data we go up here quite heavy |
---|
0:38:58 | and then you get 1 order, or 2 orders magnitude more data there |
---|
0:39:02 | so you have much more GPUs to run, much better softwares |
---|
0:39:05 | every thing runs well |
---|
0:39:08 | it turns out that same kind of read speech |
---|
0:39:10 | we publish over here |
---|
0:39:14 | let me show you some of the results |
---|
0:39:16 | this is the result, this is the table in our recent paper |
---|
0:39:24 | with Toronto group |
---|
0:39:29 | so standard GMM based HMM |
---|
0:39:31 | with 300 hours of data |
---|
0:39:33 | has error rate about 23 some percent |
---|
0:39:38 | we do very carefully |
---|
0:39:43 | tune the parameters, this parameter have been tuned (the number of layers) |
---|
0:39:44 | and we got from here to here |
---|
0:39:47 | and that is actually attracted a lot of people attention |
---|
0:39:49 | and then we realize that |
---|
0:39:53 | we got 2000 hours, and this result from that is even better |
---|
0:39:56 | and at that time, it is Microsoft result |
---|
0:40:01 | and then one of recent paper, publishes the result that |
---|
0:40:08 | of course, when you do that people argue that you have 29 million parameters |
---|
0:40:14 | and people always you know |
---|
0:40:16 | pick, picking people in speech community people |
---|
0:40:19 | obviously, uh, you got more parameters, of course you're going to win what |
---|
0:40:21 | so what if you use the same number of parameters |
---|
0:40:23 | we said fine, we do that |
---|
0:40:24 | so we use the sparseness coding |
---|
0:40:26 | to actually cut up all the weights |
---|
0:40:28 | and the number of non-zero parameters is 15 million |
---|
0:40:33 | with the smaller number of parameters, |
---|
0:40:34 | we get even better result |
---|
0:40:36 | it's amazing... the capacity of deep network is just tremendous |
---|
0:40:40 | you cut all the parameters |
---|
0:40:41 | in the beginning, we don't |
---|
0:40:42 | typically, you expect to be similar right |
---|
0:40:44 | get rid of the lower |
---|
0:40:46 | you get slightly gain |
---|
0:40:47 | but that doesn't carry off before we get more data anyway |
---|
0:40:49 | so this is, maybe |
---|
0:40:50 | within the statistical variation, but so |
---|
0:40:53 | with the smaller number of parameters |
---|
0:40:55 | then GMM, HMM which is trained using discriminative training |
---|
0:40:58 | we get about something 30 something % error reduction |
---|
0:41:02 | more so than our TIMIT, and also more so than our |
---|
0:41:06 | our voice search |
---|
0:41:10 | and then this is another paper, and then IBM came along |
---|
0:41:12 | and then Google came along, they say you know, it's better result, I think they |
---|
0:41:16 | want to do as well |
---|
0:41:18 | so you can see that thesis's Google result |
---|
0:41:19 | thesis's about 5000 hours, amazing right |
---|
0:41:22 | they just have better infrastructure |
---|
0:41:24 | mapping this all that, so they manage to do that on 5000, 6000 hours |
---|
0:41:29 | so this number just came up |
---|
0:41:32 | actually that number |
---|
0:41:33 | so actually this will be in the Interspeech papers, if you go to see them |
---|
0:41:38 | so one of the thing Google does is that they don't put this baseline result |
---|
0:41:42 | they just give a number, |
---|
0:41:44 | just ask what number they have |
---|
0:41:47 | so... sorry.. sorry |
---|
0:41:50 | with more data they have, with the same data they don't have the number, either |
---|
0:41:52 | they |
---|
0:41:53 | they just don't bother to do |
---|
0:41:54 | they all believe more data is better |
---|
0:41:56 | so with a lot more data they got this |
---|
0:41:58 | and then we just with about how many, about three |
---|
0:42:01 | uh, with this much data |
---|
0:42:04 | I take about 12%, is better when we got more data |
---|
0:42:07 | they should put a number here, anyway |
---|
0:42:09 | so I'm, we're not nick picking on this |
---|
0:42:12 | and thesis's the number I show, thesis's Microsoft's result, the number from here to here |
---|
0:42:16 | from here to here for different |
---|
0:42:18 | these are 2 different test sets |
---|
0:42:19 | and all these, all the people are here, you should know, this is very important |
---|
0:42:23 | for our review |
---|
0:42:24 | ah now, this is IBM result |
---|
0:42:27 | ah sorry, this is voice search result that I showed you early |
---|
0:42:29 | this is 20% |
---|
0:42:31 | it's not bad |
---|
0:42:32 | so because you have 20 hours of data, so |
---|
0:42:34 | it turns out the more data you have |
---|
0:42:36 | the more error reduction you have |
---|
0:42:38 | and for TIMIT, we get only about 3-4 absolute, about ten something percent |
---|
0:42:43 | now, and this is the |
---|
0:42:46 | so this broadcast result is from IBM |
---|
0:42:50 | and I heard that in Interspeech, they have much better result than this |
---|
0:42:55 | so if you're interested, look at it |
---|
0:42:57 | my understanding is that |
---|
0:42:59 | from what I heard, is that their result is comparable to this |
---|
0:43:02 | some people say even better |
---|
0:43:04 | so if you want to know exactly IBM is doing, they would have even better |
---|
0:43:07 | infrastructure |
---|
0:43:08 | in term of distributed learning |
---|
0:43:10 | compared to most other places |
---|
0:43:12 | but anyway so this kind of error reduction |
---|
0:43:15 | has been unheard of in the history, I mean in this area about 25 years |
---|
0:43:20 | and the first time we got these results, we're just stunned |
---|
0:43:22 | and Google, this is also Google's result, and even Youtube speech which is much more |
---|
0:43:27 | difficult |
---|
0:43:27 | spontaneous with all the noise |
---|
0:43:29 | they also manage to get something from here |
---|
0:43:31 | this time they're pretty honest to put this over here with the same amount of |
---|
0:43:34 | data |
---|
0:43:35 | 14 hours they got more |
---|
0:43:37 | but in our case, we also get 2000 hours, we actually get more gain |
---|
0:43:40 | rapid gain ah yes |
---|
0:43:41 | so the more data you have |
---|
0:43:43 | and then of course, to get this, you have to tune the depth |
---|
0:43:45 | the more that you have, the deeper you can go |
---|
0:43:47 | and the bigger you may wan to have |
---|
0:43:50 | and the more gain you have |
---|
0:43:51 | and this is the story I want to comment |
---|
0:43:52 | without, you really have to change major things in the system architecture |
---|
0:43:58 | OK, so once |
---|
0:44:00 | one thing that we found |
---|
0:44:01 | so my colleagues Dong Yu and myself and ah and |
---|
0:44:06 | recently found was that |
---|
0:44:10 | so in most of the thing that we |
---|
0:44:13 | I believe in old days IBM and |
---|
0:44:15 | and Google and our early work |
---|
0:44:17 | we actually use DBN to initialize our model off-line |
---|
0:44:20 | we said can we get rid of that, that training is very tricky, not many |
---|
0:44:23 | people know how to do that |
---|
0:44:27 | if for certain recipe, you have to look at the pattern |
---|
0:44:30 | it's not obvious thing how to do that because the learning |
---|
0:44:33 | there's the keyword in the learning called the contraction divergence you might hear that word |
---|
0:44:37 | in the later |
---|
0:44:38 | part of the talk today |
---|
0:44:42 | contrastive divergence on theory, |
---|
0:44:44 | essentially the idea is you should you know iterate |
---|
0:44:47 | you should do multi-column simulation |
---|
0:44:49 | Gibbs sampling for infinite turns |
---|
0:44:52 | but in practice, it's too long |
---|
0:44:55 | it's a cut it to one |
---|
0:44:56 | and of course from that, you can, have to use variational learning |
---|
0:45:00 | variational bump to prune for better result |
---|
0:45:02 | it's a bit tricky |
---|
0:45:04 | that's why it's better to get rid of it |
---|
0:45:06 | so our colleagues, so actually have a patent filed just some few months ago |
---|
0:45:10 | on this, and also a paper from my colleague |
---|
0:45:13 | would actually use ... for the |
---|
0:45:17 | for switchboard task |
---|
0:45:18 | and they show that |
---|
0:45:19 | you actually can do comparable things to RBM learning |
---|
0:45:23 | so might I would say now, for large vocabulary |
---|
0:45:26 | we don't even have to learn much about DBN |
---|
0:45:30 | so .. the theory so far is not clear |
---|
0:45:34 | exactly what kind of power you have |
---|
0:45:36 | but I might sense is that |
---|
0:45:39 | if you have a lot of unlabeled data in the future |
---|
0:45:42 | it might help |
---|
0:45:44 | but we also did some preliminary example to show it's not the case any more |
---|
0:45:47 | so it's not clear how to do that |
---|
0:45:49 | so I think at this point we really have to get better theory |
---|
0:45:52 | if you take a better theory, and also kind of comparable |
---|
0:45:54 | you know it's a |
---|
0:45:56 | although all these issues cannot be settled |
---|
0:45:58 | so the idea of discriminative pre-training is that |
---|
0:46:01 | you just train the standard ..um |
---|
0:46:04 | standard |
---|
0:46:07 | multi-layer perceptron using you know |
---|
0:46:10 | thesis's easy to train. For shallow, you can train, the result's not very good |
---|
0:46:14 | and then every time |
---|
0:46:15 | you do you fix this |
---|
0:46:17 | you add a new one, and you do. You need to fix the lower layer |
---|
0:46:21 | from the previous shallower layer |
---|
0:46:24 | and that's good, that's the spirit, It's very similar to |
---|
0:46:28 | OK .. the spirit is very similar to layer by layer learning |
---|
0:46:31 | now every time |
---|
0:46:32 | when we add up a new layer, we inject |
---|
0:46:35 | discriminant labeled information |
---|
0:46:37 | and that's very important, if you do that, nothing goes wrong |
---|
0:46:39 | so if you just use all the random number, to go over here and do |
---|
0:46:42 | that, and nothing is going to work |
---|
0:46:44 | uh, but except |
---|
0:46:45 | there's some exception here, but I'm not going to say much about |
---|
0:46:48 | but once you do this |
---|
0:46:50 | layer by layer with |
---|
0:46:52 | the spirit is still similar to DBN right, layer by layer |
---|
0:46:55 | but you inject discriminative learning |
---|
0:46:57 | I believe it's very natural thing to do |
---|
0:46:59 | we talked about this right |
---|
0:47:00 | so we learn |
---|
0:47:02 | the generative learning in DBN |
---|
0:47:04 | you know, layer by layer, to be very careful |
---|
0:47:07 | you don't just do it to much |
---|
0:47:08 | and then if you inject some discriminant information |
---|
0:47:12 | it's bound to happen |
---|
0:47:13 | you get new information there, not just looking at the data itself |
---|
0:47:16 | and it turns out that if we do, we get |
---|
0:47:19 | we actually in some experiment we even get slightly better result than DBN training |
---|
0:47:23 | so it's not clear the generative learning |
---|
0:47:26 | plays, is going to play a more important role |
---|
0:47:29 | as some people claimed |
---|
0:47:32 | OK so I'm done with |
---|
0:47:33 | the |
---|
0:47:35 | the deep neural network, so I spend a few minutes to tell you a bit |
---|
0:47:39 | more about |
---|
0:47:39 | some different other different kind of architecture called deep convex network |
---|
0:47:45 | which to me is kind of more interesting |
---|
0:47:47 | so I spend most time on this |
---|
0:47:49 | so actually we have a few papers published, it turned out that |
---|
0:47:54 | so the idea of this network is that |
---|
0:47:56 | while this is actually done for MNIST |
---|
0:47:58 | so when use this architecture |
---|
0:48:00 | we actually get so much better result than DBN |
---|
0:48:04 | so we're very excited about this |
---|
0:48:05 | but the point is that the learning has to.. you know |
---|
0:48:08 | we have to simplify this network |
---|
0:48:10 | it turns out learning now |
---|
0:48:11 | the whole thing is actually convex optimization |
---|
0:48:14 | So I do not have time to go through all this |
---|
0:48:15 | we have time for the parallel implementation |
---|
0:48:17 | which is almost impossible |
---|
0:48:19 | for deep neural network |
---|
0:48:22 | and the reason for those of you who've been actually working on neural network, you |
---|
0:48:25 | notice that |
---|
0:48:26 | the learning for |
---|
0:48:27 | discriminant ... discriminant learning phase |
---|
0:48:30 | which is called the fine tuning phase |
---|
0:48:32 | and are typically the stochastic weighted descent |
---|
0:48:34 | you cannot distribute |
---|
0:48:35 | so this one cannot be distributed |
---|
0:48:36 | so I'm not going to |
---|
0:48:38 | I really want to use |
---|
0:48:39 | this architecture to try speech recognition task so usually we have lots of discussion |
---|
0:48:42 | so maybe 1 year from now |
---|
0:48:44 | so if it's working well for you discriminant learning task |
---|
0:48:48 | I'm glad that |
---|
0:48:49 | this now is going to define the task |
---|
0:48:51 | for discrimination that I.. I .. had |
---|
0:48:54 | discussion so |
---|
0:48:56 | that gives me the opportunity to try this |
---|
0:48:58 | I love to try, I love to report the result |
---|
0:49:00 | even it's negative, I'm happy to share with you |
---|
0:49:02 | OK, so thesis's a good architecture |
---|
0:49:04 | and another architecture that we tried |
---|
0:49:06 | is that we split the hidden layers into 2 parts |
---|
0:49:09 | we do the crossproduct, so that overcomes |
---|
0:49:12 | some of the DBN weakness |
---|
0:49:14 | originally not being able to do correlation in the input |
---|
0:49:18 | and people just try a few tricks |
---|
0:49:20 | you know more than correlation |
---|
0:49:22 | it did not work well, almost impossible |
---|
0:49:24 | so thesis's very easy to implement |
---|
0:49:26 | and most of the learning here is convex optimization |
---|
0:49:28 | and often get very good result over others |
---|
0:49:31 | there's another architecture called the tensor |
---|
0:49:34 | so the same kind of correlation |
---|
0:49:36 | modeling for tensor version |
---|
0:49:38 | also can be carried out into |
---|
0:49:40 | deep neural network |
---|
0:49:41 | so we actually, my colleague, we actually submit a paper in Interspeech |
---|
0:49:44 | I think if you're interested in this one, should go there to take a look |
---|
0:49:47 | at it |
---|
0:49:47 | so the whole point is that |
---|
0:49:48 | now rather than doing the stacking using input output concatenation |
---|
0:49:52 | you can actually do the same thing for each of hidden neural network |
---|
0:49:56 | so in this paper, we actually evaluate that on the switchboard |
---|
0:50:00 | and we get additional 5% relative gain out of the best we have got so |
---|
0:50:03 | far. So this is a good staff |
---|
0:50:05 | so the learning becomes trickier |
---|
0:50:06 | because when you do .. so the back propagation |
---|
0:50:11 | and you have to think about how to do this |
---|
0:50:13 | it adds some additional nuisance in term of effective computation |
---|
0:50:16 | but the result is good |
---|
0:50:18 | so now I'm going to the second part, I'm going to skip most of them |
---|
0:50:22 | OK skip most of them |
---|
0:50:25 | OK so this uh... I actually wrote a book on this |
---|
0:50:27 | so this is |
---|
0:50:29 | dynamic Bayesian network as deep one |
---|
0:50:31 | the reason why it's deep is there are many layers |
---|
0:50:34 | so you get the target |
---|
0:50:35 | you get articulation |
---|
0:50:36 | you get environment, all together this |
---|
0:50:38 | so we tried that |
---|
0:50:40 | and the implementation of this is very hard |
---|
0:50:43 | so I just go quickly and then to go to the bottom |
---|
0:50:47 | uh, so, uh, this is one of the paper that |
---|
0:50:50 | uh, I wrote uh, together with |
---|
0:50:53 | one of the experts, who actually |
---|
0:50:54 | this is my colleague who actually invented this variational Bayes |
---|
0:50:58 | and then ... to work with him |
---|
0:51:01 | to implement this variational Bayes |
---|
0:51:03 | into this kind of ... |
---|
0:51:06 | dynamic Bayesian network |
---|
0:51:07 | and the result is very good |
---|
0:51:09 | although the journal we published is wonderful |
---|
0:51:11 | so you can actually synthesize |
---|
0:51:13 | you can track all these formants in very precise manner |
---|
0:51:17 | and then some articulatory problem, it's very amazing, but once you do recognition |
---|
0:51:21 | the result is not very good |
---|
0:51:22 | so I'm going to tell you why, if we have time |
---|
0:51:25 | and then of course |
---|
0:51:26 | one of the problems |
---|
0:51:27 | so this 2006 we actually |
---|
0:51:31 | so we realize that kind of learning is very tricky |
---|
0:51:34 | essentially you approximate things you don't know what you approximate for |
---|
0:51:39 | that's one of the problems of deep Bayesian, it's very |
---|
0:51:42 | but you can get some insights |
---|
0:51:43 | you work with all the experts in the [ ... ] |
---|
0:51:45 | at the end at the bottom line |
---|
0:51:47 | we really don't know how to interpret |
---|
0:51:49 | but you... but is just |
---|
0:51:51 | you don't know how much you lose right |
---|
0:51:52 | so we actually have the simplified version that I spend all time working on, and |
---|
0:51:56 | that gives me this result |
---|
0:51:58 | that's actually the paper |
---|
0:51:59 | so this is .. is about |
---|
0:52:01 | about 2-3 percent better than the best |
---|
0:52:04 | context dependent HMM |
---|
0:52:06 | I'm happy at that time; we stopped at this |
---|
0:52:08 | once we do this |
---|
0:52:09 | and it's so much better than this |
---|
0:52:10 | so in other words, DBN |
---|
0:52:12 | related, or at least in TIMIT task |
---|
0:52:14 | it does so much better than |
---|
0:52:16 | than .. than dynamic Bayesian kind of work |
---|
0:52:19 | and then we're happy about this |
---|
0:52:21 | now of course I won't |
---|
0:52:22 | yes, so this is the history of dynamic model |
---|
0:52:25 | and a whole bunch of thing going on there |
---|
0:52:27 | and the key is how to embed |
---|
0:52:29 | such dynamic property into the DBN framework |
---|
0:52:33 | if you embed the property of |
---|
0:52:36 | big chunk into |
---|
0:52:37 | dynamic Bayesian network is not going to work ... due to technical reasons |
---|
0:52:42 | but the other way around has a hope, that's one of the |
---|
0:52:46 | so the part 3 will going to tell you |
---|
0:52:49 | which I'm running out of time |
---|
0:52:50 | I'm actually going to show you |
---|
0:52:52 | first of all some of the lessons |
---|
0:52:54 | so thesis's the deep belief network or Deep Neural Network |
---|
0:52:57 | and this, I used the * here, to refer that to as Dynamic Bayesian Network |
---|
0:53:02 | so one |
---|
0:53:05 | so all these hidden dynamic models .. is the special case of the Bayesian network |
---|
0:53:10 | you can see that, or otherwise I showed you earlier on |
---|
0:53:13 | there were a few key differences that we learned |
---|
0:53:15 | one is that for DBN |
---|
0:53:17 | it's distributed implementation |
---|
0:53:20 | so in our current system, for this system |
---|
0:53:23 | in our HMM/GMM system |
---|
0:53:25 | we have the concept that this particular model |
---|
0:53:28 | is related to a |
---|
0:53:29 | this particular model is related to e right |
---|
0:53:31 | you have this concept right, and of course you need training to mix them together |
---|
0:53:34 | but you still have the concept |
---|
0:53:35 | whereas in this neural network.. no .. each weight |
---|
0:53:39 | codes all class information |
---|
0:53:41 | I think it's very powerful concept here |
---|
0:53:43 | so you learn things and get distributed |
---|
0:53:45 | it's like neural system right |
---|
0:53:47 | you don't say this particular neuron contains visual information |
---|
0:53:50 | it can also code audio information together |
---|
0:53:53 | so this has better |
---|
0:53:55 | neuron basis compared with conventional techniques |
---|
0:53:58 | also ...... when we did this model |
---|
0:54:01 | we just set one single bit wrong |
---|
0:54:04 | at that time, we all said ... we don't have parsimonious model representation. |
---|
0:54:08 | that's just wrong |
---|
0:54:10 | 5 years ago, 10 years ago, may be OK right |
---|
0:54:12 | now in our current age |
---|
0:54:14 | just use massive parameters if you know how to learn them |
---|
0:54:17 | and also you know how to regularize them well |
---|
0:54:19 | and just turn on that the DBN has a mechanism |
---|
0:54:21 | to automatically regularize things well |
---|
0:54:24 | and that is not proven yet, I don't have the theory to prove that |
---|
0:54:26 | but in our ... u know every time you stack up |
---|
0:54:29 | u can intuitively understand that |
---|
0:54:31 | u don't overfit right |
---|
0:54:32 | because if u do overfit, u do this many years ago |
---|
0:54:36 | but if u do this, u know keep going deep, u don't over fit because |
---|
0:54:39 | whatever information that u get applied |
---|
0:54:41 | the new parameters |
---|
0:54:43 | actually sort of take into account |
---|
0:54:46 | the feature from lower parameters, so it doesn't count as lower |
---|
0:54:50 | model parameters any more, so automatically u have the mechanism to do this |
---|
0:54:53 | but in DBN, u don't have that property |
---|
0:54:55 | u need to stop, it doesn't have that property |
---|
0:54:57 | so this 's very strong |
---|
0:54:59 | and another key difference |
---|
0:55:01 | is something I talked about earlier |
---|
0:55:03 | product vs mixture |
---|
0:55:06 | mixture is you sum up probability distribution |
---|
0:55:08 | and product is you take product between them |
---|
0:55:11 | so when you take the product, you actually exponentially expand the power of representation |
---|
0:55:16 | So these all the key differences between these two type of model. |
---|
0:55:19 | Another important thing is that for this learning we combine generative and discriminative. |
---|
0:55:26 | Although the final result we got, we still think that discriminative is more important than |
---|
0:55:31 | generative. |
---|
0:55:32 | But at least in the initialization, we use the generative model and DBN to initialize |
---|
0:55:38 | the whole system and discriminative learning to adjust the parameters. |
---|
0:55:42 | The generative model we did earlier is purely generative. |
---|
0:56:02 | Finally, longer windows or shorter windows. |
---|
0:56:07 | In the earlier case, I am still not very happy about longer window. |
---|
0:56:15 | Because every time you model dynamics which I've actually talked about this, about free method. |
---|
0:56:21 | How to build dynamics into the model, they both have a very short history, not |
---|
0:56:29 | long history. |
---|
0:56:30 | No history of research actually focused on dynamics. |
---|
0:56:34 | There is so many limitations, you have to use short window. in long window, nothing |
---|
0:56:39 | works hard, we've tried all these. |
---|
0:56:46 | So deep recurrent network is something that many people working on now. |
---|
0:56:52 | In our lab, in the summer, much as all the projects relate to this. Maybe |
---|
0:56:58 | not all, at least very large percentage. |
---|
0:57:01 | It has worked well for both acoustic model and language model. I would say that, |
---|
0:57:07 | recurrent network has been working well for acoustic modeling. |
---|
0:57:29 | In language modeling, there is a lot of good project in the recurrent network. |
---|
0:57:47 | The weakness of this approach, there is a generic temporal dependency. |
---|
0:57:59 | I have no idea what it is, there is not constraint, one following another. This |
---|
0:58:06 | kind of temporal model is not very big. |
---|
0:58:09 | The dynamics in DBN is much better. |
---|
0:58:15 | In term of interpretation, in term of generative capability, in term of physical speech production |
---|
0:58:19 | mechanism, it is just better. The key is how to combine them together. |
---|
0:58:23 | We don't like this, and we have shown that all this does not capture the |
---|
0:58:32 | essence of speech production dynamics. |
---|
0:58:35 | There is huge amount of information redundancy, think about you have a long window here |
---|
0:58:41 | and every time you shift ten millisecond and 90% of the information overlapping. |
---|
0:58:59 | And some people may argue that it doesn't matter and they did experiment to show |
---|
0:59:03 | that it doesn't help at all. |
---|
0:59:04 | The importance of optimization techniques is the Hessian-free method. |
---|
0:59:18 | I am not sure in language modeling, you may not do that actually, but in |
---|
0:59:22 | acoustic modeling, this is a very popular technique. |
---|
0:59:25 | And also, another point is that recursive neural network for parsing in NLP has been |
---|
0:59:31 | very successful. |
---|
0:59:32 | I think last year in ICML, they actually presented the result of recursive neural net |
---|
0:59:36 | which is not quite the same as this, but used the structure for the parsing, |
---|
0:59:40 | they actually got state-of-the-art result for the parsing. |
---|
0:59:43 | The conclusion of this slide is it's an active and exciting research area to work |
---|
0:59:47 | on. |
---|
0:59:48 | So the summary is as follows. I provide historical accounts of two fairly separate research. |
---|
0:59:57 | One is based upon DBN, the other one is based on Dynamic Bayes Network in |
---|
1:00:05 | speech. |
---|
1:00:05 | So I actually hopefully show you that speech research motivates the use of deep architectures |
---|
1:00:13 | from speech production and perception mechanisms. |
---|
1:00:16 | And HMM is a shallow architecture with GMM to link linguistic units to observations. |
---|
1:00:26 | Now I have shown you that, I didn't have time to talk about this, the |
---|
1:00:31 | point is this kind of model has less success then it is expected. |
---|
1:00:34 | And now we are beginning to understand why that is a limitation over here, and |
---|
1:00:40 | actually I have shown some potential possibilities of overcoming that kind of limitations in the |
---|
1:00:47 | neural network framework. |
---|
1:00:48 | So one of the thing that we understand why this kind of limitation that has |
---|
1:00:53 | been developed in the past has not be able to take advantage of the dynamics |
---|
1:00:58 | into the deep network. |
---|
1:01:00 | It's because we didn't have the distributed representation, didn't have massive parameters, didn't have fast |
---|
1:01:06 | parallel computing and we didn't have product of experts. |
---|
1:01:09 | All these things are good for this, but the dynamics are actually good for this, |
---|
1:01:13 | and how to merge them together, I think is a very popular research that actually |
---|
1:01:17 | work on. |
---|
1:01:18 | You can actually make the deep network to be scientific in terms of speech perception |
---|
1:01:23 | and recognition |
---|
1:01:24 | So the outlook the future direction is that so far we have DBN DNN to |
---|
1:01:32 | replace HMM GMM . |
---|
1:01:34 | I would expect in within three five years, you may not be able to see |
---|
1:01:40 | GMM especially in recognition. |
---|
1:01:41 | I think in industry.If I am wrong then shoot me. |
---|
1:01:49 | The dynamic properties of model of this Dynamic Bayesian Network speech has the potential to |
---|
1:02:01 | replace HMM. |
---|
1:02:15 | And the Deep Recurrent Neural Networks, which I have probably tried to argue that there |
---|
1:02:21 | is a need to go beyond unconstrained temporal density while making it easier to learn. |
---|
1:02:27 | Adaptive learning is so far not so successful yet, we tried a few projects, it |
---|
1:02:33 | is harder to do it. |
---|
1:02:35 | The scalable learning is hard, for industry at least is, for academic don't worry about |
---|
1:02:41 | it. |
---|
1:02:42 | As well as NIST define it into small tasks, you will be very happy to |
---|
1:02:47 | work on that. But for industry this is a big issue. |
---|
1:02:50 | Reinventing our infrastructure at the industrial scale. I think we have time to go through |
---|
1:02:59 | all the applications. |
---|
1:03:00 | Spoken language understanding, has been one of the successful application I've shown you. |
---|
1:03:08 | Information retrieval, language modeling, NLP, image recognition, but the speaker recognition is not yet. |
---|
1:03:24 | The final bottom line here is that the deep learning so far is weak in |
---|
1:03:30 | theory, I have I have convinced you about it with all the critics. |
---|
1:05:18 | In Bengio case, he randomize everything first. And then if you do that, of course, |
---|
1:05:24 | it is bad. |
---|
1:05:26 | So the key is that, if you get something did so best, I think to |
---|
1:05:31 | me what generative model maybe useful in that case. But the key of this learning |
---|
1:05:36 | is if you put a little bit discrimination over here, it is probably better. |
---|
1:06:47 | So probably the best is you use the structure here and also this, and we |
---|
1:06:52 | know how to train that now. I think both width and depth is important. |
---|
1:07:09 | We tried that, we didn't fix the measurement, we just used algorithm to cut out |
---|
1:07:15 | all the way. We didn't lose anything, in fact from the result I showed you, |
---|
1:07:20 | it still gains a little bit. |
---|
1:07:29 | Cross validation. |
---|
1:07:32 | That's no way, there is no theory on how to do that. |
---|
1:07:35 | But in particular case, some of the networks that I've shown you, I have theory |
---|
1:07:39 | to do that, I can control that. |
---|
1:07:44 | There's some networks, you can do theory. That means you can automatically determine it from |
---|
1:07:49 | data. But for this deep belief network, it is weak in theory. |
---|
1:08:31 | He is also doing deep graphical model. |
---|
1:08:48 | Two years ago, he gave this ? on how to learn the topology of deep |
---|
1:08:54 | neural network, in term of width and depth. |
---|
1:08:57 | And he was using Indian Buffet Process. |
---|
1:09:03 | At the end, everything has to be done by Monte Carlo simulation and for five |
---|
1:09:10 | by five, he said simulation take about several days. |
---|
1:09:15 | I think that approach is not scalable, unless people improve that aspect. |
---|
1:09:27 | That also motivates more of the academic research on machine learning to make that scale. |
---|
1:09:31 | I think the idea is good, but the technique is so slow to do anything |
---|
1:09:35 | about this. |
---|
1:09:50 | For deep neural network, stochastic gradient is still doing the best, it is good enough. |
---|
1:09:55 | But my understanding is, we are actually playing around with this. He wants to add |
---|
1:10:01 | the recurrence some more complex architecture, stochastic gradient isn't strong enough. |
---|
1:10:05 | There is a very nice paper done by Hinton's group, one of his PhD student. |
---|
1:10:12 | Who actually used Hessian free optimization to do DBN learning. |
---|
1:10:20 | They actually showed that result is just one single figure, very hard to interpret that |
---|
1:10:27 | one, the paper in ICML 2010. It's doing better in this compared with using DBN |
---|
1:10:34 | to initialize neural network. |
---|
1:10:36 | To me, it is very significant. We are still borrowing this for more complex network, |
---|
1:10:44 | more complex second order method, probably it will be necessary. |
---|
1:10:50 | And also the other advantage of Hessian free is the second order, it can be |
---|
1:10:54 | parallelized for bigger batch training rather than minibatch training, and that makes big difference. |
---|
1:11:06 | We tried that one, it doesn't work well for DBN, we need to have a |
---|
1:11:15 | lot of data. Probably the best for DBN network is stochastic gradient . |
---|
1:11:22 | If you are using the other networks, some later networks that we have talked about. |
---|
1:11:31 | They are naturally suited for batch training. |
---|
1:11:35 | In some more modern version of the network, batch training is desirable. They are designed |
---|
1:11:47 | for those architecture, it is for parallization. |
---|