0:00:17 | a low everyone um i'm or your vinnie else |
---|
0:00:20 | i i gonna be talking about |
---|
0:00:22 | a a deep learning and more a concrete on using um um the learning on tandem features |
---|
0:00:29 | and analysing how a it performance for robust asr um basically |
---|
0:00:35 | seeing how if that's with no |
---|
0:00:37 | um so a bit of related work and |
---|
0:00:41 | and background |
---|
0:00:42 | um deep learning i'm not gonna go into |
---|
0:00:45 | many details but it's basically the idea of having many layers of computation |
---|
0:00:49 | and typically in that |
---|
0:00:52 | it just a neural network |
---|
0:00:53 | um with uh better initialization than just random so in two thousand six him and i in do these are |
---|
0:00:59 | B M |
---|
0:01:00 | we each um apparently only have a a lot um and training these deep models |
---|
0:01:05 | and |
---|
0:01:06 | since then many groups um are working on be browning um you can see that by the amount of publications |
---|
0:01:12 | on machine learning conferences and also related conference just these or computer vision conferences like C R |
---|
0:01:19 | and so that um |
---|
0:01:20 | this some people that apply a very deep learning to speech and it's quite recent the last couple years |
---|
0:01:26 | and um um but in the um estimating a phone posteriors using neural networks |
---|
0:01:32 | or |
---|
0:01:33 | deep neural networks |
---|
0:01:34 | he's not a new idea |
---|
0:01:36 | um and basically um there's |
---|
0:01:39 | the two main approaches um one that uses the phone posteriors to um |
---|
0:01:44 | to get that |
---|
0:01:46 | and model for the hmm the high |
---|
0:01:48 | eight M model and the other that uses stand them um which means um |
---|
0:01:53 | we take just the yeah the posteriors as features |
---|
0:01:56 | and then just use them you know a otherwise just stand there at gmm "'em" M system |
---|
0:02:01 | and and the is it's quite attractive because may take needs to rely on a in gmm hmm system so |
---|
0:02:06 | that the kind of approach that |
---|
0:02:08 | we are looking here at in this work |
---|
0:02:10 | um |
---|
0:02:11 | so just to |
---|
0:02:13 | oh |
---|
0:02:15 | okay |
---|
0:02:15 | so just to |
---|
0:02:17 | to just explain briefly what |
---|
0:02:19 | and they may use um for those with an no you we just get some some sort of estimate |
---|
0:02:24 | or frame of posteriors probabilities for for the phones |
---|
0:02:27 | i on the top and then there several techniques and tricks um that have been a a applied and found |
---|
0:02:33 | them |
---|
0:02:34 | ten years ago or so |
---|
0:02:36 | so those posts you're probably use um you we have like the law to them and then we do a |
---|
0:02:41 | like uh we we white in the them so they might better the gmm they have one a covariance assumption |
---|
0:02:47 | we do mean by and the normalisation and then last we just concatenate them a with the mfccs or some |
---|
0:02:53 | spectral um features |
---|
0:02:55 | and we train and or or decode um we these extended feature set |
---|
0:03:00 | so pretty |
---|
0:03:02 | pretty easy to answer i |
---|
0:03:04 | um and i easy to implement as well |
---|
0:03:07 | i so |
---|
0:03:10 | so that made jam to the main points uh of these were um |
---|
0:03:15 | first um well had we want to |
---|
0:03:17 | a C how |
---|
0:03:18 | from post us coming a yeah might be uh |
---|
0:03:21 | a neural network |
---|
0:03:22 | combined with spectral features if there is any gain at there when we when we had them to the to |
---|
0:03:27 | the mfcc |
---|
0:03:29 | i i in this tandem fashion um |
---|
0:03:32 | also i and this is probably a a but interesting and and and |
---|
0:03:36 | and i don't know if that this has been a as yet how does noise affect um the deep neural |
---|
0:03:41 | net based systems |
---|
0:03:42 | and in particular um we want to and the light or |
---|
0:03:47 | a kind of rule out what |
---|
0:03:48 | right |
---|
0:03:49 | parts of the beep |
---|
0:03:50 | she are helping in which situations |
---|
0:03:53 | um so for that |
---|
0:03:55 | as i said we have some some questions regarding deep-learning so for example |
---|
0:04:00 | well and why does |
---|
0:04:01 | having a deep structure matter |
---|
0:04:03 | that's the first question then we can also ask ourselves what about pre-training these are B M training that i |
---|
0:04:09 | was talking about easy eating pork then or not |
---|
0:04:12 | and lastly um |
---|
0:04:14 | we know that to train neural networks you it gets Q sometimes especially when they are the um so as |
---|
0:04:20 | the optimization technique use my |
---|
0:04:23 | and and in the paper are the i uh it was focus on the first two points um the that's |
---|
0:04:28 | the point was |
---|
0:04:29 | that's |
---|
0:04:29 | but not |
---|
0:04:30 | are you deeply |
---|
0:04:32 | and |
---|
0:04:33 | that has to um it's something i've been working on and i i wasn't it is not in the paper |
---|
0:04:37 | about i'm what i talk about this in in this talk |
---|
0:04:41 | and and so were referring to those questions like the |
---|
0:04:44 | some some way to see neural networks |
---|
0:04:47 | that the good part of new deep neural networks yeah i is that they are or for models it by |
---|
0:04:52 | expressive and they can represent by complicated nonlinear relations that good because we know are bring probably does that |
---|
0:04:59 | and also i the they're attractive because a great in is easy to compute and in fact now with the |
---|
0:05:04 | um |
---|
0:05:05 | uh that computing and used |
---|
0:05:07 | i scenes all it in bob is some matrix operations they can actually we train pretty fast and it's a |
---|
0:05:12 | very efficient |
---|
0:05:14 | a there some but things so so i it is a non-convex optimization problem and and there's a vanishing gradients |
---|
0:05:20 | problem that if if we are by E the got instant to zero so |
---|
0:05:24 | it's kind of |
---|
0:05:25 | it is not be they are not easy to change especially when they are very the the neural network |
---|
0:05:30 | and also a number of out there you got from very large um in fact |
---|
0:05:34 | our brain has a when or all or there's of my need more than that the neural net that we |
---|
0:05:38 | obtain training nowadays |
---|
0:05:40 | so were feeding is a an issue that |
---|
0:05:42 | people are worried about a of use the as in many other machine learning techniques |
---|
0:05:46 | and that is something that people don't like about |
---|
0:05:49 | neural networks is they are kind of difficult to parade what what's going on |
---|
0:05:54 | a are some exceptions and there's many like |
---|
0:05:57 | the that some people who were in the computer vision and also speech that are and analysing actually what |
---|
0:06:02 | the you runs are are learning |
---|
0:06:04 | and and its impressive in in computer vision for example you can see that |
---|
0:06:10 | um the first you're is learning basically what be one in our brain he's doing um these double like few |
---|
0:06:15 | there's for computer vision and not really much is so |
---|
0:06:18 | so actually these is actually becoming good in some sense to into bread and |
---|
0:06:22 | they be it and and hence deep learning |
---|
0:06:26 | and so |
---|
0:06:27 | just just to two |
---|
0:06:29 | um um |
---|
0:06:30 | like a concrete on exactly experiment that we don |
---|
0:06:33 | um we train these kind of neural network |
---|
0:06:36 | so it it was um D because it has as you can see three hidden layers and and on the |
---|
0:06:42 | left we have just the input |
---|
0:06:44 | just with the like thirty nine acoustic observations and nine frames of frames some of can |
---|
0:06:49 | and then we have that fee following layers we five hundred a thousand and fifteen hundred um by now your |
---|
0:06:54 | logistic unit |
---|
0:06:55 | and lastly the last layer or it's the one that we estimate the uh |
---|
0:06:59 | the phone posteriors through the softmax later |
---|
0:07:02 | and that need say that i did and |
---|
0:07:05 | to or or and use so the i came up with these architecture and i've been using these i i |
---|
0:07:10 | i haven't change like the parameters and so one |
---|
0:07:13 | a because the but i wanted to see the effect and compare which that work |
---|
0:07:17 | um but they are already better numbers good |
---|
0:07:20 | could be found just by |
---|
0:07:21 | trying different architecture |
---|
0:07:24 | and so dumping in the experimental setup |
---|
0:07:27 | um we use the our a was so it's fairly small or well at around one point four million samples |
---|
0:07:32 | for training |
---|
0:07:34 | um at ten milliseconds for at a sampling rate |
---|
0:07:38 | and |
---|
0:07:39 | as as we know like these these the testing conditions are with added noise |
---|
0:07:44 | at that at different snr level |
---|
0:07:46 | i i'm said just to train station airport port and so on |
---|
0:07:50 | and |
---|
0:07:51 | let me like we we train the our models on clean speech and then we are |
---|
0:07:55 | testing that one D several noisy conditions just to see |
---|
0:07:58 | right the the yeah is a is being as in the noisy conditions if at all |
---|
0:08:03 | and and then we just use the standard hmm model proposed in the uh our or to set up on |
---|
0:08:08 | the |
---|
0:08:08 | the same decoding scheme |
---|
0:08:11 | um so |
---|
0:08:13 | um first table of results i'm let me let need just explain E |
---|
0:08:17 | um it's |
---|
0:08:18 | on the as well as you can see just the |
---|
0:08:21 | a a different noise conditions starting from clean and adding more i |
---|
0:08:25 | and in the parentheses you see that kind of um relevant um differences so if we we run these experiments |
---|
0:08:32 | randomly |
---|
0:08:33 | we observe a around point to point four and point for in word that are a different so that's kind |
---|
0:08:38 | of the significant level of of this result |
---|
0:08:41 | and then the for |
---|
0:08:42 | column of result is just stand the mfcc um model |
---|
0:08:46 | we can see a the as we had not use that it did great |
---|
0:08:49 | and the next two columns are from |
---|
0:08:52 | um write |
---|
0:08:53 | we probably these from icsi so |
---|
0:08:55 | basically that at them mlp so the first column would be a the the mlp would be just using a |
---|
0:09:00 | be features no mfccs and ten denote Ps concatenating both |
---|
0:09:04 | so we can see that concatenating mfcc helps because all the numbers are basically lower |
---|
0:09:10 | and it's |
---|
0:09:11 | it helps also a um in improve the word error rates for for us all the noise conditions |
---|
0:09:17 | now we the the belief network that i show |
---|
0:09:21 | um basically we get |
---|
0:09:22 | that the results are a all the most conditions |
---|
0:09:25 | and in particular |
---|
0:09:27 | um |
---|
0:09:28 | the on clean speech we get an improvement which which is |
---|
0:09:31 | i guess i compared but with other people found findings on timit and so on |
---|
0:09:36 | and um the improvement are consistently better than and the tandem mlp |
---|
0:09:42 | a approach that was proposed several years ago |
---|
0:09:45 | which which is good news |
---|
0:09:46 | and also recall a as C at that um |
---|
0:09:49 | mfcc usually helps but when there's is a lot of noise that using the five V be or that of |
---|
0:09:54 | the only be case |
---|
0:09:56 | um actually the and |
---|
0:09:57 | but and then the be and that's words and maybe you using just the T V and for phone posteriors |
---|
0:10:02 | is better |
---|
0:10:03 | um um |
---|
0:10:05 | and and but the first questions that |
---|
0:10:07 | he's how phone two years |
---|
0:10:10 | um |
---|
0:10:10 | combining that tandem fashion um when we use use people earning best as and just mlp features living |
---|
0:10:17 | and then have there's noise affect so it seems that deep neural nets are also good for nice and not |
---|
0:10:22 | for only like king speech |
---|
0:10:24 | and now |
---|
0:10:25 | i'm gonna jump to some more recent results um actually i was able to run is because i been working |
---|
0:10:32 | on a |
---|
0:10:32 | second all the optimization method proposed you and i see M at ten |
---|
0:10:37 | kind of |
---|
0:10:38 | for |
---|
0:10:39 | suggest that maybe pre-training or these are B business was not necessary if you use some sort of second order |
---|
0:10:45 | the optimization in the back perhaps the |
---|
0:10:47 | um but i go step by step to these questions and the columns to look at |
---|
0:10:51 | so first um does optimization matter |
---|
0:10:55 | um what we need here used in the call um the first the first two columns are the same as |
---|
0:10:59 | previously |
---|
0:11:00 | so that and M M mlp was trained using a standard techniques to that's a again in the centre |
---|
0:11:05 | and it's kind of um after seven hundred or so he and you needs |
---|
0:11:10 | we and and that's see an improvement of perform |
---|
0:11:12 | the last problem that and then a P with the little star |
---|
0:11:15 | we were actually be able to train a a more beer a bigger model basically as many parameters |
---|
0:11:22 | as the D |
---|
0:11:23 | you one that model that i will show later and that they so in the beginning |
---|
0:11:27 | and and as we can see at least for but not a a nice um |
---|
0:11:32 | low the region |
---|
0:11:34 | the time tandem mlp with that that are in need um |
---|
0:11:38 | optimization and more parameters |
---|
0:11:40 | actually a performance |
---|
0:11:42 | but and them "'em" up be without these a big new optimization |
---|
0:11:45 | um but then a on the like |
---|
0:11:48 | a higher noise conditions that you did good |
---|
0:11:51 | so that kind of these a pointing but maybe because there's so many you there some sort of or feeding |
---|
0:11:56 | and the model that a deal |
---|
0:11:58 | well with that |
---|
0:11:59 | um |
---|
0:11:59 | so |
---|
0:12:00 | that brings that's to the next point |
---|
0:12:02 | and that the match |
---|
0:12:04 | so now |
---|
0:12:05 | let's take the parameters of the |
---|
0:12:07 | single tandem P |
---|
0:12:09 | layer would is around three million by the way |
---|
0:12:11 | and let's use it in that the deep neural network that i was this at the beginning but with not |
---|
0:12:16 | to no pre-training so it's not not that deep belief network that can propose C just that that or neural |
---|
0:12:21 | net with many layers |
---|
0:12:23 | and what we see here use the performance is identical |
---|
0:12:26 | but on that high noise situations actually the sheet that we saw performance actually gonna and and we actually get |
---|
0:12:33 | a bit better |
---|
0:12:34 | so i might my here use maybe adding the deep nist |
---|
0:12:38 | has has some sort of effect on being able to cancel the noise better than if you do we have |
---|
0:12:43 | to just the shallow network |
---|
0:12:45 | um |
---|
0:12:46 | obviously this |
---|
0:12:47 | it's just an i is that but |
---|
0:12:49 | from the results we can probably see that |
---|
0:12:51 | and |
---|
0:12:52 | has the that pre-training so this is basically the from the first table the the |
---|
0:12:58 | it's the same neural net but with these pre-training step |
---|
0:13:01 | we see that |
---|
0:13:02 | it improves upon the deep neural net that has not been preaching |
---|
0:13:06 | um um so we what that means that uh and it it improves a grass all the noise conditions so |
---|
0:13:12 | i think what this means this pretty training |
---|
0:13:14 | basically it as a generalization um |
---|
0:13:17 | we know actually that for over fitting pre-training helps quite a lot so for the em nice that the set |
---|
0:13:22 | that was probably seen the signs paper |
---|
0:13:25 | this huge or feeding and pretending a lot |
---|
0:13:27 | but in this case it had |
---|
0:13:29 | not on the clean condition but on the even when to noise is quite low |
---|
0:13:33 | um it i i um to preach in the weights that these to make them to some some sort of |
---|
0:13:39 | generality and not only discriminative objective function |
---|
0:13:42 | and |
---|
0:13:43 | so |
---|
0:13:44 | but i two |
---|
0:13:45 | to conclude this discussion about |
---|
0:13:47 | the error and that if he's and so on |
---|
0:13:50 | um i look at the |
---|
0:13:51 | i thing you have a phone error rate of all these three networks |
---|
0:13:54 | i which is |
---|
0:13:55 | i thought are so i just but some random um phone |
---|
0:13:59 | and we can see that |
---|
0:14:00 | there's the the phone error rate seems similar |
---|
0:14:03 | but then and then when we had the noise right the D ends |
---|
0:14:06 | learn more robust a representation is because what we is when we had that we had when we had a |
---|
0:14:11 | a large amount of light |
---|
0:14:13 | a bore what the deep neural net |
---|
0:14:16 | and the shall only run neural net trained with that both with the better optimization technique |
---|
0:14:21 | so i i believe that i |
---|
0:14:23 | but do maybe it's it's it's hiding has to to learn basically better the representations of the data has it |
---|
0:14:29 | has been found actually also in computer vision and so on |
---|
0:14:32 | um and so basically to conclude |
---|
0:14:35 | um i think it is now |
---|
0:14:38 | but |
---|
0:14:38 | it's it's being your i but people running |
---|
0:14:41 | i words also in and them not only in the hybrid systems which is good news for those school who |
---|
0:14:46 | have a lot of uh engineering work around M Ms and a M system |
---|
0:14:50 | and |
---|
0:14:51 | furthermore i think the mfccs is this |
---|
0:14:54 | oh for the scroll are working on how distance maybe |
---|
0:14:57 | they should |
---|
0:14:58 | incorporate |
---|
0:14:59 | more spectral information some somehow especially if there's not a lot of noise |
---|
0:15:04 | then |
---|
0:15:04 | pre-training we we know we it has for over fitting but also it that was for um kind of generalization |
---|
0:15:10 | um of the uh of of the K in the case where we have a these might to clear mismatch |
---|
0:15:15 | between training which was |
---|
0:15:16 | to on a clean speech and and testing |
---|
0:15:19 | and this also |
---|
0:15:21 | i think the model seem to use you given the same amount of parameters they seem to be more robust |
---|
0:15:26 | in very high noisy situation |
---|
0:15:28 | um which which was found also in computer vision |
---|
0:15:31 | um and obviously these conclusions are for now based on a fairly small task |
---|
0:15:35 | and i think for future work |
---|
0:15:38 | um it would be interesting to go i guess a larger dataset set which we are actually working on that |
---|
0:15:44 | and also to compare |
---|
0:15:45 | between the uh |
---|
0:15:47 | so called a deep neural net it's mm M that and these deep uh and the estimate |
---|
0:15:52 | um thank the very much how they some question |
---|
0:15:59 | whereas |
---|
0:16:00 | question |
---|
0:16:01 | oh have this person able |
---|
0:16:05 | i have a question regarding or comments on the or not to work can you go back to the slides |
---|
0:16:09 | use all zeros |
---|
0:16:11 | this |
---|
0:16:11 | was was that is the beginnings of use some wise a good thing and Y yeah sir |
---|
0:16:16 | this one and that's one |
---|
0:16:17 | yeah |
---|
0:16:19 | but so the one use some wise the bat thing for that deep network |
---|
0:16:22 | and here you use the two points |
---|
0:16:25 | well you in the works so there is a there's one problem is the vanishing the creek me |
---|
0:16:30 | and and otherwise all or with feeding |
---|
0:16:32 | yeah |
---|
0:16:32 | and that to me is this two clones seems uh |
---|
0:16:35 | country uh can a contract |
---|
0:16:37 | is a content uh is not as that because uh |
---|
0:16:40 | is is that if you are in the all right so that you be are getting the hell |
---|
0:16:44 | and the uh more was a get which means the basic is model is mean the same |
---|
0:16:50 | and and the and you can change the them automatically right not no to do this these two the |
---|
0:16:55 | or a feeding |
---|
0:16:56 | is this a well as happened the some case all the |
---|
0:16:59 | as as as a all the single happens all the case uh the yeah i think this to happen in |
---|
0:17:04 | the band it's over fitting so in my is actually in my experiment idea an observe of a a whole |
---|
0:17:08 | lot of over |
---|
0:17:10 | i was just doing out to regularization on the weights but i i i i have over fading |
---|
0:17:15 | but in other cases like the if you read the science paper from hint on there's |
---|
0:17:20 | a lot of like that so that you have only like twenty thousand samples and they are all were fitting |
---|
0:17:24 | is |
---|
0:17:25 | you basically get to zero percent here error in those cases obviously |
---|
0:17:29 | the optimization method doesn't matter that much and it's |
---|
0:17:32 | these to basically how you by as you weights that were |
---|
0:17:34 | these using these are B |
---|
0:17:37 | the next question i don't know which one now |
---|
0:17:39 | oh |
---|
0:17:40 | okay |
---|
0:17:42 | but that's you that that there is some pretending happening so that means |
---|
0:17:46 | you must be using some |
---|
0:17:47 | a person and they tell which is not used to the um right not not not for the in this |
---|
0:17:51 | case like that the pre-training training it it's and supervised so it in or you put at a lot of |
---|
0:17:56 | data to a lot more than that to be chain |
---|
0:17:58 | so i |
---|
0:17:59 | yeah |
---|
0:17:59 | so my question |
---|
0:18:01 | that's for a neural network because |
---|
0:18:04 | oh |
---|
0:18:05 | but but was that then |
---|
0:18:06 | uh a that uh_huh |
---|
0:18:08 | we can try and you the that you know out of this a two more |
---|
0:18:12 | um |
---|
0:18:12 | so do the and thing are you to know |
---|
0:18:15 | on them to do for belief net |
---|
0:18:17 | um i just |
---|
0:18:19 | as i |
---|
0:18:19 | okay a model that is |
---|
0:18:21 | in networks uh_huh |
---|
0:18:23 | oh but the and and that's and removed and not as a remote meaning type to construct the would be |
---|
0:18:28 | to make an addict for uh_huh |
---|
0:18:30 | or |
---|
0:18:32 | it comes to the network |
---|
0:18:35 | and the same level |
---|
0:18:37 | and in the sense that i |
---|
0:18:39 | lastly last limit |
---|
0:18:40 | i to construct an I and do that so that the zero net was trained does |
---|
0:18:45 | take that whole objective function of for neural net and do but pro |
---|
0:18:49 | we random them waiting is a a a problem um |
---|
0:18:51 | that is a fair competition |
---|
0:18:53 | uh_huh okay |
---|
0:18:54 | thanks |
---|
0:18:55 | yeah |
---|
0:18:56 | uh_huh |
---|
0:18:57 | a question |
---|
0:18:58 | uh_huh |
---|
0:18:59 | basic |
---|
0:19:00 | and no |
---|
0:19:01 | or |
---|
0:19:02 | you concatenating anything the puts |
---|
0:19:04 | from the mlp P to the mfcc C you probably want to have |
---|
0:19:08 | the most to different information can from the M D so i wonder |
---|
0:19:13 | a when you do the back propagation |
---|
0:19:16 | with that |
---|
0:19:17 | or basically forcing the M P two |
---|
0:19:20 | of two focus on on know something which discriminates between the class that you decide to use |
---|
0:19:26 | did you try to look at how that would work |
---|
0:19:29 | if you did concatenate the features coming from |
---|
0:19:32 | on |
---|
0:19:33 | i trained |
---|
0:19:34 | and and like not to doing the training |
---|
0:19:37 | yes and what can to and something to different |
---|
0:19:40 | so actually they |
---|
0:19:41 | the deep neural net is train before the concatenation have so |
---|
0:19:46 | so in a sense you |
---|
0:19:47 | you just have the phone targets and you train the neural net first and then you concatenate and the each |
---|
0:19:53 | eight to map |
---|
0:19:54 | so i'm not sure |
---|
0:19:55 | because i |
---|
0:19:57 | so i Z |
---|
0:19:58 | oh |
---|
0:20:00 | i i i think you are using the outputs of to |
---|
0:20:04 | yeah so all right |
---|
0:20:07 | so you train your network first and then one you when you have a train you get it fix and |
---|
0:20:11 | then you don't twenty more backprop prop after you can can you could do that but |
---|
0:20:15 | but if you |
---|
0:20:17 | sure to than that but train before the uh back propagation used still have the the same uh a number |
---|
0:20:23 | of i'll put your and is you will have a off to the back propagation right |
---|
0:20:27 | yeah so this |
---|
0:20:28 | you can choose |
---|
0:20:29 | but there you can cut the need to the mfccs Cs do you have now or you will have off |
---|
0:20:33 | to the back propagation right |
---|
0:20:36 | or missing something |
---|
0:20:38 | i |
---|
0:20:40 | sorry i |
---|
0:20:41 | yeah i don't and i don't stand |
---|
0:20:42 | a do was |
---|
0:20:43 | so we problem to present and seven yeah and we can put that to a okay |
---|
0:20:50 | i |
---|