0:00:16 | thank you so welcome back after the lunch |
---|
0:00:19 | my name's frank seide i'm from microsoft research in ageing and this is a post |
---|
0:00:24 | calibration my colleague dong yu what happens to be chinese but is actually base |
---|
0:00:29 | and of course as a lot of contributors to this work inside the company and |
---|
0:00:33 | outside and also thank you very much two people sharing slide to material |
---|
0:00:38 | okay to me we start with the like personal story i got into this because |
---|
0:00:42 | i'm sort of an unlikely experts of this because until two thousand eleven i had |
---|
0:00:47 | no i two thousand ten had no idea what were networks deep one or |
---|
0:00:51 | so in two thousand ten |
---|
0:00:52 | my colleague dong yu cannot be here today came to visit us invading only told |
---|
0:00:58 | us about this new speech recognition result that the dehak |
---|
0:01:02 | and you told me about the technology that i had never heard about call dbn |
---|
0:01:07 | and set |
---|
0:01:08 | this was sort of invented by some professor in wonderful that also had never heard |
---|
0:01:12 | about |
---|
0:01:14 | so and he and he need a manager at the time had invite geoffrey hinton |
---|
0:01:19 | this professor to come to read and with a few students and work on applying |
---|
0:01:23 | this to speech recognition |
---|
0:01:25 | any time he got |
---|
0:01:26 | sixteen percent relative reduction |
---|
0:01:29 | out of applying deep neural networks |
---|
0:01:31 | and this is for intel voice search task relatively small number of hours of training |
---|
0:01:36 | you know sixty percent is really a big a lot of people spend ten years |
---|
0:01:40 | to get a sixteen percent error reduction |
---|
0:01:42 | so my first got about this was |
---|
0:01:44 | sixteen percent while what's wrong with the baseline |
---|
0:01:55 | said well we should we collaborate on this and try how this carries over into |
---|
0:01:59 | a large-scale task that switchboard |
---|
0:02:02 | and the key thing that actually invented here was well talk a classic an hmm |
---|
0:02:07 | i think this reference is probably based on |
---|
0:02:10 | whatever this morning from nelson |
---|
0:02:12 | a little bit |
---|
0:02:13 | too late |
---|
0:02:15 | so the classic nn hmm then the in the deep network dbn |
---|
0:02:19 | which actually does not stand for dynamic bayesian networks as a line |
---|
0:02:23 | at that point |
---|
0:02:24 | and then you don't put in this idea of |
---|
0:02:26 | just using tied triphones as modeling targets like we did in gmm based system |
---|
0:02:32 | okay so |
---|
0:02:34 | then fast forward like have here was reading papers in utah look to start and |
---|
0:02:38 | finally we got to the point where we got first |
---|
0:02:41 | results so this is or gmm baseline and i start the training next day ahead |
---|
0:02:47 | the first iteration |
---|
0:02:48 | was like twenty two percent so okay now seems to not be completely off |
---|
0:02:53 | the next day i come back |
---|
0:02:55 | twenty percent |
---|
0:02:56 | so fourteen percent and the congratulation email to my colleague right |
---|
0:03:00 | the to run next day came back |
---|
0:03:03 | eighteen percent |
---|
0:03:04 | you can really from that one moment i was just sitting at the computer waiting |
---|
0:03:07 | for the next result of come out and submitting it and saw titanic have better |
---|
0:03:11 | we got seventeen point three |
---|
0:03:13 | something point one |
---|
0:03:15 | then we get the alignment that's one thing you don't had already determined on the |
---|
0:03:20 | smaller setup we got it down to sixty four then we look at sparseness |
---|
0:03:24 | six import once we go thirty two percent error reduction |
---|
0:03:27 | that's a very large reduction |
---|
0:03:29 | all of a single technology |
---|
0:03:33 | we also ran this over different test sets the same all and you could see |
---|
0:03:37 | the error rate reductions were all sort of in a in a similar range of |
---|
0:03:40 | the word didn't matter as well the gains were slightly worse |
---|
0:03:44 | we also look the other ones for example we at some point finally the two |
---|
0:03:48 | thousand all model the can still okay for product like windows on system that you |
---|
0:03:54 | have right now we got something fifteen percent error reduction |
---|
0:03:58 | and also other companies started publishing for example ibm on broadcast news i think the |
---|
0:04:02 | total gaze thirteen eighteen percent that's i think in up to date paper some day |
---|
0:04:07 | and then you to i think there's was about nineteen percent of the gains were |
---|
0:04:11 | really convincing across the board |
---|
0:04:14 | okay so that our work so what is this actually |
---|
0:04:17 | no i thought asr you has the same different portion of understanding people might not |
---|
0:04:22 | to you know the end and on the database so i think would like to |
---|
0:04:26 | go through and explain |
---|
0:04:27 | a little bit more to the basics how this works i don't know how many |
---|
0:04:31 | understand people are really here today i hope it's not gonna be too boring |
---|
0:04:34 | so the basic idea is |
---|
0:04:36 | the dnn looks at for example spectrogram |
---|
0:04:40 | a rectangular patch out of that a range of vectors |
---|
0:04:44 | and feeds this into this processing chain word basically multiplies this input vector this rectangle |
---|
0:04:49 | here with a matrix at some by and applies a nonlinearity are then you get |
---|
0:04:54 | something like two thousand values other that you do the several times |
---|
0:04:58 | note that all that the same thing except nonlinearity is a softmax |
---|
0:05:02 | so |
---|
0:05:04 | this is the formulas for that so what is actually well a softmax |
---|
0:05:08 | is this form here |
---|
0:05:09 | that is essentially nothing else but i sort of a linear classifier and is linear |
---|
0:05:13 | because if you look at the class boundaries between two classes hasn't in are actually |
---|
0:05:17 | relatively weak classifier have there |
---|
0:05:20 | the hidden there is actually very similar they have the same for the only difference |
---|
0:05:25 | is that these sort of this only two classes |
---|
0:05:28 | instead of and or be all the different speech states here and the second pass |
---|
0:05:32 | as parameters zero |
---|
0:05:34 | so what is this really this is sort of a classifier that classifies collect class |
---|
0:05:38 | membership or non membership in some class but we don't know what those classes are |
---|
0:05:42 | actually |
---|
0:05:43 | and so this representation is actually this also kind of sparse typically you get only |
---|
0:05:48 | maybe ten percent of the activations five to ten percent |
---|
0:05:52 | to be active in any given frame |
---|
0:05:54 | so this is really sort of these class membership the kind of features descriptive features |
---|
0:05:58 | of your input |
---|
0:06:00 | so another way of looking at it is |
---|
0:06:03 | basic what it doesn't takes an input vector projected onto something like a base vector |
---|
0:06:07 | one column |
---|
0:06:09 | this would be like a direction vector projected on it there's a bias term we |
---|
0:06:13 | add on it and then you run into this nonlinearity we just one of the |
---|
0:06:16 | binarization |
---|
0:06:18 | so what this does this gives you sort of subsume find a river a like |
---|
0:06:21 | a coordinate system for your inputs |
---|
0:06:25 | and get another |
---|
0:06:27 | way of looking at it is |
---|
0:06:28 | well |
---|
0:06:30 | this one here is actually a correlation so he the parameters have the same sort |
---|
0:06:36 | of physical meaning as the inputs you put in there |
---|
0:06:40 | so for example for the first layer the model parameters are also of the nature |
---|
0:06:44 | of being a rectangular patch |
---|
0:06:45 | of spectrogram |
---|
0:06:46 | so and this is what they look like i think there was a little bit |
---|
0:06:49 | of the discussion earlier on nelson's talk |
---|
0:06:52 | so what is this me each of goals |
---|
0:06:55 | is this case thirty two there twenty three frames Y |
---|
0:06:59 | this is the frequency |
---|
0:07:01 | access here |
---|
0:07:02 | and what happens is that these things are basically overlay over here and then the |
---|
0:07:05 | correlation is made and whatever it detects this particular pattern this is sort of the |
---|
0:07:09 | peak detector of people that sliding over time |
---|
0:07:13 | then you get the hideout |
---|
0:07:14 | okay |
---|
0:07:15 | you can we see all these different patterns to get many of them really look |
---|
0:07:18 | like our filters |
---|
0:07:20 | but these automatically learn about the system there's no knowledge that was put into their |
---|
0:07:24 | you have these edge detectors you have P detectors you have some sliding detectors you |
---|
0:07:29 | have a lot of noise in there actually i don't know what that's for think |
---|
0:07:32 | this probably of later ignore them later stages |
---|
0:07:36 | that they are problem is how to interpret the hidden layers |
---|
0:07:39 | the hidden there is speech don't have any sort of spatial relationship to the input |
---|
0:07:44 | or something so the only thing that i could think of is that |
---|
0:07:47 | there we were presenting something like |
---|
0:07:49 | logical operations so think of this again this is the direction vector this is the |
---|
0:07:53 | hyperplane that is described by the bias right so if you inputs for example are |
---|
0:07:58 | one this one is one this is obviously |
---|
0:08:01 | two dimensional vector ones one is zero |
---|
0:08:04 | could be this one of this one you could put a plane here indicates incorporation |
---|
0:08:09 | okay kind of a soft or because not strictly binary |
---|
0:08:12 | or you put it here is like an operation |
---|
0:08:14 | so i think this my personal intuition of what this is the nn actually does |
---|
0:08:18 | is |
---|
0:08:19 | on the lower layers it extracts these landmarks |
---|
0:08:22 | number higher there is it assembles them into more complicated classes |
---|
0:08:27 | and can you do interesting things you can imagine that |
---|
0:08:30 | that for example and one layer discover say a female version of and a and |
---|
0:08:34 | then another no would give you a male version of a |
---|
0:08:37 | then on the next there would say ten authors |
---|
0:08:40 | female or male a |
---|
0:08:42 | so this is an idea on top of the modeling power of this of this |
---|
0:08:45 | one |
---|
0:08:47 | okay so take away |
---|
0:08:49 | lowest layer matters landmarks higher layers i think are sort of soft logical operators |
---|
0:08:54 | and the top there is just a really primitive linear |
---|
0:08:57 | okay so how do we do this in speech how to be used as speech |
---|
0:09:02 | you take those output see these probabilities posterior probabilities of speech segments |
---|
0:09:08 | suppose you know |
---|
0:09:10 | it turns them into |
---|
0:09:12 | likelihoods the using bias will and these are directly used in the hidden markov model |
---|
0:09:16 | in a |
---|
0:09:19 | and the key thing here is that these classes are tied triphone state and not |
---|
0:09:23 | monophone states that is the thing that really made a big |
---|
0:09:26 | okay so just before we move on just a given a rough idea of like |
---|
0:09:30 | the subject this idea one buttons error rates actually we wanna play will video clip |
---|
0:09:36 | where our executive vice president of research gave on stage demo |
---|
0:09:41 | and you can see what accuracies come out of and speaker independent |
---|
0:09:45 | dnns we can you can this not been adapted is voice |
---|
0:09:53 | still far error rate for our work we have the one point five |
---|
0:10:04 | what you hear research my research university |
---|
0:10:10 | okay together with the other in your recognition so |
---|
0:10:19 | i use i tell you know what i weight given red color your |
---|
0:10:31 | so this is this is basically perfect right and this is really a speaker independent |
---|
0:10:35 | system |
---|
0:10:36 | and you can i think do interesting things of that just the fun of it |
---|
0:10:39 | i'm gonna play at a later part of the video what we actually use this |
---|
0:10:42 | input to drive translations just |
---|
0:10:46 | translated into chinese you and vocal here we see i am i know |
---|
0:11:05 | i |
---|
0:11:07 | there i here |
---|
0:11:09 | you people one |
---|
0:11:17 | that is there |
---|
0:11:21 | side |
---|
0:11:31 | for this is a very |
---|
0:11:35 | you do initial values you well |
---|
0:11:41 | if you hear that right down by various people |
---|
0:11:48 | so what we see |
---|
0:11:54 | so that's a kind of fun you can have of the model like that |
---|
0:11:58 | okay so |
---|
0:11:59 | now in this talk |
---|
0:12:02 | i would like to |
---|
0:12:03 | you know you know a is giving talks about the nn is invited talks S |
---|
0:12:08 | of income bracket like on each of those conferences that likely one hour talking to |
---|
0:12:12 | you single something's the for example last year smt conference or sandra senior |
---|
0:12:16 | then i think of the i syllables of innocent fun so when i've prepared a |
---|
0:12:20 | strong i found energy and it ended up |
---|
0:12:23 | doing andrews talk |
---|
0:12:26 | so i thought that's maybe not a good idea i wanna do it slightly different |
---|
0:12:29 | so what i wanted to someone we focus |
---|
0:12:31 | and not gonna give you have you noticed overview of everything but i will focus |
---|
0:12:35 | on |
---|
0:12:36 | what is needed to build real life systems large-scale system so for example you will |
---|
0:12:40 | not see in timit result |
---|
0:12:42 | and the structured along three areas training features and run-time extraneous the biggest one i'm |
---|
0:12:47 | gonna start force |
---|
0:12:50 | so |
---|
0:12:51 | how do you train this model i think we're pretty much all familiar with back-propagation |
---|
0:12:55 | you give it |
---|
0:12:56 | a sample vector run to the network get a posterior distribution compared against what it |
---|
0:13:00 | should be |
---|
0:13:01 | and then basically not the system a little bit in the direction to do a |
---|
0:13:05 | better job next time |
---|
0:13:07 | and so the problem is when you do this with the deep network often the |
---|
0:13:11 | system does not converge where will get stuck in local optimum |
---|
0:13:14 | so the thing that we of this whole revolution with geoffrey hinton who |
---|
0:13:19 | the thing that's |
---|
0:13:19 | matt sorry the thing that we propose to the restricted boltzmann machine |
---|
0:13:24 | and the ideas basically you train |
---|
0:13:26 | layer is so here we extend that the networks sort of in the way they |
---|
0:13:30 | can run about |
---|
0:13:31 | so you can run the sample through |
---|
0:13:34 | you get a representation you run it backwards and then you can see okay how |
---|
0:13:37 | well that's the thing that comes out the action match my input |
---|
0:13:40 | then you can choose that system so that matches the input as closely as possible |
---|
0:13:45 | if you can do that and don't forget this is sort of the binary representation |
---|
0:13:48 | that means you have a representation of data that is meaningful this thing extract something |
---|
0:13:53 | meaningful about the data and that's so that the idea |
---|
0:13:56 | so now we do the same thing with the next there you freeze this is |
---|
0:13:59 | taken as a feature extractor |
---|
0:14:00 | a do this with the next there and so on |
---|
0:14:02 | then you put |
---|
0:14:04 | top softmax and then trying to location |
---|
0:14:08 | now so i had no idea about |
---|
0:14:10 | do you nor networks anything when i started this so i thought what we do |
---|
0:14:13 | this or complicated i mean we already ran this experiment on how many layers you |
---|
0:14:18 | need and so on so already had |
---|
0:14:20 | and not work that had like a single in there |
---|
0:14:23 | so why not just take that one is initialization |
---|
0:14:25 | right out it softmax layer and then put another |
---|
0:14:30 | hidden layer and another softmax down off |
---|
0:14:32 | and then iterate the entire stack here |
---|
0:14:34 | and then after that again right this guy off and do it again and so |
---|
0:14:38 | on and once you are at the top and iterate this thing |
---|
0:14:41 | so we call this greedy layer-wise a discriminative pre-training |
---|
0:14:44 | and it turns out that actually works really well so if we look at this |
---|
0:14:48 | the dbn pretraining geoffrey hinton this is the green curve here |
---|
0:14:53 | if you do what i just described you get the red or just are essentially |
---|
0:14:58 | the same word error rate |
---|
0:15:00 | and this is different numbers of layers this is not progression over training the accuracy |
---|
0:15:05 | for different layers right |
---|
0:15:07 | so the more layers to get the better gets and |
---|
0:15:09 | you see basically sparse |
---|
0:15:11 | tract each other |
---|
0:15:12 | the layer-wise pretraining slightly worse but then you'd only one understands neural networks much better |
---|
0:15:17 | than i |
---|
0:15:18 | said you shouldn't maybe to rating the model all the way to the and should |
---|
0:15:22 | just let it iterate a little bit rerun in the ballpark then move on it |
---|
0:15:25 | turns out that made the system slightly better and actually the sixteen point eight here |
---|
0:15:29 | this is this just made pre-training method works like that |
---|
0:15:34 | i'm i think it's expensive |
---|
0:15:35 | because every time you have this full nine thousand seen on top layer there but |
---|
0:15:39 | it turns out we don't need to do that you can actually use monophones |
---|
0:15:42 | and it actually works equally well as much |
---|
0:15:46 | okay so take away pre-training is still sort of me that it helps |
---|
0:15:50 | but we need discriminative pre-training is sufficient and much simpler than the rbm pity because |
---|
0:15:55 | we just use the existing call don't need to coding |
---|
0:15:59 | okay another important topic is |
---|
0:16:02 | sequence training |
---|
0:16:03 | so the question here is |
---|
0:16:06 | we have actually train this network to classify these signals is into those segments of |
---|
0:16:11 | speech and of each other but in speech recognition |
---|
0:16:14 | we have dictionary sure of language models we have hidden markov model that gives you |
---|
0:16:18 | sequence and so on |
---|
0:16:19 | so if we want to integrate that the system we should but we do that |
---|
0:16:23 | we should actually get a better result right |
---|
0:16:25 | so |
---|
0:16:27 | the frame-classification right here on is written this way you maximise log posteriors every single |
---|
0:16:32 | you know posterior correct C |
---|
0:16:36 | if you want to use C and if you wanted to sequence training actually find |
---|
0:16:40 | that it has exactly the same form |
---|
0:16:42 | except this year not state posterior derived from the bn and but it is state |
---|
0:16:47 | posterior taking all the additional knowledge into account |
---|
0:16:51 | so this one the takes into account hmms the dictionary and language models |
---|
0:16:55 | so the way to run this is you run your data through and you have |
---|
0:16:59 | here you must from speech rec |
---|
0:17:01 | in computers posteriors |
---|
0:17:02 | and practical terms you would do this with word lattices |
---|
0:17:05 | and then you do back-propagation and |
---|
0:17:08 | so we did that |
---|
0:17:10 | we start with the baseline fifty one six percent |
---|
0:17:13 | we did the first iteration of this sequence training |
---|
0:17:16 | i want to |
---|
0:17:17 | the one |
---|
0:17:18 | for |
---|
0:17:19 | so that kind of didn't work |
---|
0:17:22 | so |
---|
0:17:24 | well we observe that it sort of time for each so |
---|
0:17:27 | don't like we're training |
---|
0:17:30 | so we try to do in what is the problem here so there is for |
---|
0:17:33 | hypotheses |
---|
0:17:34 | are we actually using the right models lattice generation their problems lattice sparseness |
---|
0:17:39 | randomization of data and the objective function of multiple objective functions choose from and today |
---|
0:17:44 | i will talk about the lattice parsing |
---|
0:17:46 | so the final one thing we found was that |
---|
0:17:49 | there was increasing |
---|
0:17:51 | sort of |
---|
0:17:52 | problem of speech getting replaced by silence |
---|
0:17:57 | deletion problem we saw that the silence of course we're going |
---|
0:18:01 | and the other scores were not |
---|
0:18:03 | so basically what happens is that |
---|
0:18:05 | the lattice is very biased the lattice typically doesn't have negative hypotheses for silence because |
---|
0:18:11 | it's so far away from speech but it has a lot a lot of positive |
---|
0:18:15 | examples of silence |
---|
0:18:16 | so this thing was just biasing the system towards ringside really we you know given |
---|
0:18:21 | high bias |
---|
0:18:22 | so what we do this we said okay one interest |
---|
0:18:24 | not update |
---|
0:18:26 | sun state and also skip all silence frames |
---|
0:18:29 | so that already gave us something much better |
---|
0:18:31 | already look like it's going |
---|
0:18:34 | so we could also the slightly more systematically we could actually explicitly and silence hours |
---|
0:18:39 | into the lattice |
---|
0:18:41 | right those that should have been there in the first place |
---|
0:18:44 | so once you do that |
---|
0:18:46 | i actually get even slightly better so that kind of confirms the missing sounds hypotheses |
---|
0:18:50 | are all |
---|
0:18:52 | but then |
---|
0:18:53 | another problem is that the lattices other sparse |
---|
0:18:56 | so we find that any given frame |
---|
0:18:58 | we only have like three hundred out of mine thousand seen on T C and |
---|
0:19:02 | that |
---|
0:19:03 | the others are not there because they basically had zero probability |
---|
0:19:07 | but as the model moves along maybe data at some point no longer have zero |
---|
0:19:11 | probability so they should be there in the lattice but they're not |
---|
0:19:14 | so the system cannot train properly |
---|
0:19:16 | so we thought why don't we just we generate lattices after one iteration |
---|
0:19:20 | we see how the next little bit of the difference at least keeps table here |
---|
0:19:25 | now we thought can we do this slightly better so basically we take this idea |
---|
0:19:28 | of adding silence |
---|
0:19:30 | and sort of adding speech marks you can't really do that |
---|
0:19:33 | but similar effect can be achieved by interpolating your sequence criterion |
---|
0:19:38 | with the frame cardio |
---|
0:19:40 | so and then we basing we do that get |
---|
0:19:43 | a very good convergence |
---|
0:19:46 | so |
---|
0:19:47 | now we we're not the only people that observe that problem i ran into this |
---|
0:19:51 | issue with the training so for example colour destiny |
---|
0:19:55 | and this workers |
---|
0:19:57 | observe that |
---|
0:19:58 | you look at the posterior probability of the ground first pass |
---|
0:20:02 | over time sometimes find that it's very low it's not always zero sometimes at zero |
---|
0:20:07 | that means a lot |
---|
0:20:09 | so |
---|
0:20:09 | but they found is that |
---|
0:20:11 | if you just get those frames you called frame rejection you get a much better |
---|
0:20:15 | convergence behavior so the green the red curve is without and the blue curve is |
---|
0:20:19 | with frank removing that |
---|
0:20:23 | and of course |
---|
0:20:25 | brian also observed exactly the same thing but he said no i'm gonna do the |
---|
0:20:28 | smart thing |
---|
0:20:29 | i'm gonna do something much better i'm gonna and second order method |
---|
0:20:33 | so what the second one a method you approximate the objective function as a second |
---|
0:20:37 | order function that you can like hope try to the optimal right theoretically |
---|
0:20:41 | and so this can be done without explicitly computing they have C and this is |
---|
0:20:44 | the method that martin's is tuned of hinton |
---|
0:20:48 | sort of optimized |
---|
0:20:49 | and the nice thing it's actually batch method |
---|
0:20:52 | so it doesn't |
---|
0:20:54 | suffer from these previous issues of like data sparseness and the last carol executions as |
---|
0:20:59 | a lot of couldn't |
---|
0:21:01 | and also what i think on this conference there's a paper that says that it |
---|
0:21:04 | works with partially to rated ce multi don't even have to do a full see |
---|
0:21:08 | you duration that's also very dry |
---|
0:21:11 | and |
---|
0:21:12 | i need to save your outdoor started with my homework actually writing first show the |
---|
0:21:16 | effectiveness of sequence |
---|
0:21:18 | for switchboard |
---|
0:21:19 | okay so you have some results |
---|
0:21:22 | so this is the gmm system C basically a C D based the nn |
---|
0:21:27 | sequence trained one |
---|
0:21:28 | so this is all and switchboard and five and are two or three |
---|
0:21:32 | so we get like twelve percent |
---|
0:21:35 | basically and others got eleven percent and ryan on the are two or three said |
---|
0:21:39 | also fourteen percent sort all similar range |
---|
0:21:42 | we also |
---|
0:21:43 | then we i wanna point of one thing |
---|
0:21:46 | going from here to here |
---|
0:21:47 | now the dnn has given us forty two percent relative |
---|
0:21:51 | and that's a fair comparison because this is also sequence trained based |
---|
0:21:55 | right so if the only difference is you recall gmm replaced by the unique |
---|
0:22:01 | also it works on a larger dataset |
---|
0:22:05 | okay to take away sequence training gives us gains of mine to thirty percent |
---|
0:22:10 | other std works but you need some tricks they're |
---|
0:22:13 | those of smoothing and rejection of that frames |
---|
0:22:16 | and the hessian-free method requires no tricks but is actually much more complicated so to |
---|
0:22:20 | start with i would probably start with the cg method |
---|
0:22:27 | so another big question is paralysing the training |
---|
0:22:30 | so just a given idea that more but we use this demo video the threshold |
---|
0:22:34 | was trained on two thousand hours |
---|
0:22:37 | it took sixty days |
---|
0:22:40 | now |
---|
0:22:41 | most of you probably don't work with windows |
---|
0:22:44 | we do and that causes the very specific problem because of probably heard of something |
---|
0:22:49 | a patch tuesday |
---|
0:22:51 | so basically |
---|
0:22:52 | every two to four weeks microsoft I T forces us to update some virus scanners |
---|
0:22:57 | or something like that |
---|
0:22:58 | and so basically those machines have to be rebooted |
---|
0:23:02 | so running a java sixty days is actually |
---|
0:23:06 | so |
---|
0:23:07 | you were running this on gpu so we had a very strong motivation to look |
---|
0:23:11 | at that |
---|
0:23:12 | but don't get your hopes up |
---|
0:23:14 | so |
---|
0:23:15 | one way of trying to paralyse the training is to see connections to match |
---|
0:23:20 | ryan had already shown hessian-free works very well can be problem |
---|
0:23:24 | so actually see one V stuff are to be cage was an intern at microsoft |
---|
0:23:29 | try to use hessian-free also for the C training |
---|
0:23:34 | but it the take away was basically it takes a lot of iterations to get |
---|
0:23:38 | there so it was actually not |
---|
0:23:41 | so back to std |
---|
0:23:42 | it's to use also problem because if we do mini-batches of they one thousand twenty |
---|
0:23:47 | four frames everyone thousand twenty four frame to have sixteen to lot of data |
---|
0:23:51 | so that's a big challenge so the first group are actually a company that it |
---|
0:23:55 | is successfully was well with the asynchronous sgd that just |
---|
0:24:00 | so the way that works is |
---|
0:24:02 | you have your machines you group them into a first one group them together each |
---|
0:24:06 | of them takes a part of the model and then you split your data and |
---|
0:24:08 | each chunk to compute the different |
---|
0:24:11 | so that at any given time |
---|
0:24:13 | and whatever one of them has a gradient computed |
---|
0:24:16 | it sends that |
---|
0:24:18 | parameters server or set of parameter servers and also parameter servers aggregate |
---|
0:24:23 | the model or with it |
---|
0:24:25 | and then |
---|
0:24:26 | whenever they feel like and the but with allows they send |
---|
0:24:31 | then models that |
---|
0:24:32 | now that's a completely asynchronous process the smaller think of this is just being independent |
---|
0:24:36 | trends one thread is just computing with whatever's and memory |
---|
0:24:39 | another threat this just sharing and exchanging data in whatever way the small synchronisation |
---|
0:24:45 | so why but that work |
---|
0:24:47 | well it's very simple because |
---|
0:24:50 | std implies sort of an assumption of you know are reading right we make we |
---|
0:24:55 | this so basically |
---|
0:24:57 | every parameter update contributes independent the objective function |
---|
0:25:01 | so it's okay to miss some of them |
---|
0:25:05 | and also there is something that we call delayed update on a quick to explain |
---|
0:25:08 | that |
---|
0:25:08 | so in the simplest way that explained the training the beginning you take every point |
---|
0:25:12 | in time that a sample X we take a model |
---|
0:25:16 | compute gradient update the model with the gradient |
---|
0:25:20 | and then do it again after one frame you do it again do it again |
---|
0:25:24 | and then based right |
---|
0:25:26 | you models equal to that model plus |
---|
0:25:29 | we can also do this you can also not advance |
---|
0:25:33 | the model that using use the same model multiple times |
---|
0:25:36 | and update for example this example for |
---|
0:25:39 | the you do for model updates the frames are still these frames right but the |
---|
0:25:43 | model session model |
---|
0:25:45 | in do this again and so on |
---|
0:25:47 | so that's actually what we call mini-batch based update right |
---|
0:25:51 | mini-batch training |
---|
0:25:53 | so now if you want to do parallelization need to deal with the problem that |
---|
0:25:56 | we need to do computation and data exchange parallel so you would do something like |
---|
0:26:00 | that you know you would have a model and you would start sending that into |
---|
0:26:04 | the network so at some point it can do model update while the kids computing |
---|
0:26:09 | the next |
---|
0:26:11 | and then |
---|
0:26:11 | you do not overlap session once these are computed you sent the result over while |
---|
0:26:15 | these are being received an update so you get the sort of overlap processing and |
---|
0:26:20 | recall the double buffered update |
---|
0:26:22 | it has exactly the same form so with this formula can write it in exactly |
---|
0:26:25 | the same for |
---|
0:26:27 | and std is basically just sort of a random version of this where you have |
---|
0:26:31 | no space adjust the |
---|
0:26:34 | somewhere jumping between one or two that just like |
---|
0:26:38 | so why not telling |
---|
0:26:40 | well i would this work because the space not different from i mean you batch |
---|
0:26:44 | and to make it work only thing you need to make sure is that we |
---|
0:26:47 | still stay in this |
---|
0:26:48 | sort of you narrative me |
---|
0:26:50 | it also means that as you training progresses you can increase your mini-batches |
---|
0:26:54 | well observed that also means you can increase |
---|
0:26:57 | your delay |
---|
0:26:59 | which means you can use more machines |
---|
0:27:00 | the more machine to use the more delay you in-car because network such right |
---|
0:27:06 | okay |
---|
0:27:07 | so |
---|
0:27:09 | okay so but then |
---|
0:27:11 | actually |
---|
0:27:13 | where the three times |
---|
0:27:15 | that colleagues told me |
---|
0:27:17 | like this with paper only the |
---|
0:27:19 | and then |
---|
0:27:20 | like three months later ask them so we came up to this day and what |
---|
0:27:23 | we scale well |
---|
0:27:24 | actually happened three times so why does not work |
---|
0:27:27 | so let's look at this one of the different ways paralysing something model power of |
---|
0:27:31 | data for was layer |
---|
0:27:34 | model carol isn't means you're splitting a models over different notes |
---|
0:27:37 | then after each computation step they have to the only compute part of the output |
---|
0:27:41 | vector |
---|
0:27:43 | each computed different sub range of your dimension so after every computation to have to |
---|
0:27:47 | exchange |
---|
0:27:48 | the airport with all the others |
---|
0:27:50 | the same thing has to happen in the way back |
---|
0:27:53 | no data parallelism means |
---|
0:27:56 | you break your mini-batch into sub batches |
---|
0:27:59 | so each node computes subgradient |
---|
0:28:02 | and then sorry |
---|
0:28:03 | they after every that they have to exchange lisa gradients each has to send their |
---|
0:28:08 | gradient or the other nodes |
---|
0:28:10 | so you can already that has a lot of communication going on |
---|
0:28:13 | the third train a something that we tried called and they are powerless |
---|
0:28:17 | work something like this you distribute layers |
---|
0:28:21 | so maybe the first batch comes in |
---|
0:28:23 | and then when it's done it sends |
---|
0:28:25 | its output to the next one and i we compute the next batch here but |
---|
0:28:29 | this section of correct because we haven't update the model |
---|
0:28:33 | so well we keep doing we just ignore the problem |
---|
0:28:36 | then in this case after four steps |
---|
0:28:37 | this guy has finally come back with an update the model |
---|
0:28:41 | so |
---|
0:28:42 | why would that work is just too late update is exactly the same form another |
---|
0:28:45 | what before except the delay is kind of different in different layers but there's nothing |
---|
0:28:48 | fundamentally strange about this |
---|
0:28:51 | so |
---|
0:28:52 | no |
---|
0:28:54 | very interesting questions how far can actually go what a sort of the optimum number |
---|
0:28:58 | of notes that you could that you can |
---|
0:29:01 | paralysed |
---|
0:29:02 | so my colleague also dropout a very simple idea |
---|
0:29:05 | you simply said |
---|
0:29:06 | you optimal when a maxout all the resource |
---|
0:29:10 | using all you computation and all your network |
---|
0:29:14 | resource basically means that the time that it takes |
---|
0:29:17 | computing mini-batch |
---|
0:29:19 | is equal to the T times that it takes to transfer the result that all |
---|
0:29:23 | the other |
---|
0:29:25 | and you would do this sort of an overlap fashion so you would compute one |
---|
0:29:28 | then you started transfer and you do the next one |
---|
0:29:31 | and i you are ideal when the time that it's like takes transferred let's say |
---|
0:29:35 | when it's transform the trance was completed the more you're ready to compute the next |
---|
0:29:38 | batch |
---|
0:29:39 | so then you can write down okay what's optimal |
---|
0:29:42 | number of knowledge here well the form is a bit more complicated but the basic |
---|
0:29:46 | idea is that this is proportional to the model size bigger monologues better parallelization but |
---|
0:29:51 | only get faster |
---|
0:29:53 | so gpu can paralyse less |
---|
0:29:57 | and also it has to do of course with how much data you have exchanged |
---|
0:29:59 | what you're bandwidth this |
---|
0:30:01 | for data parallelization the mini-batch sizes also factor because for a longer mini-batch size you |
---|
0:30:06 | have to exchange less of |
---|
0:30:09 | and for their partisan that's not really that interesting because it's limited by the number |
---|
0:30:14 | so |
---|
0:30:16 | this may i ask |
---|
0:30:17 | what you think model part was what would be get here |
---|
0:30:20 | so just |
---|
0:30:22 | consider that will is doing image net like sixteen thousand |
---|
0:30:26 | so gimme number |
---|
0:30:31 | gonna tell you |
---|
0:30:36 | not sixteen thousand |
---|
0:30:39 | no such a very fine so i implemented that we need to a lot of |
---|
0:30:43 | care three gpus |
---|
0:30:45 | this is the best you can do we get at one point eight speed up |
---|
0:30:47 | twice a lot of three times speedup because gpus get less efficient the smaller chunks |
---|
0:30:51 | of data they process |
---|
0:30:52 | and once i went to for it was actually much worse than this |
---|
0:30:58 | not data pearls must much better i'd so what we think |
---|
0:31:07 | for many best size of one thousand twenty four now that records of course if |
---|
0:31:11 | you can use bigger mini-batches as you progress of training |
---|
0:31:14 | this becomes a bigger number |
---|
0:31:16 | and the reality what you get is well that will a C D system |
---|
0:31:20 | paralysing for eight at nodes |
---|
0:31:23 | and eighty nodes each node is twenty four intervals you |
---|
0:31:27 | so if we see what you get compared using |
---|
0:31:29 | compared to using a single twenty four into machine |
---|
0:31:34 | at times ignored but you only get a speed of five point eight |
---|
0:31:38 | that's what you can actually get out of the paper there and about two point |
---|
0:31:42 | two up that comes out of model parameters and two point six comes out of |
---|
0:31:46 | data |
---|
0:31:48 | of course not that much |
---|
0:31:49 | then there's another group at the academy of sciences and in a rating |
---|
0:31:53 | they paralysed over in video K twenty extra cues that sort of the state-of-the-art |
---|
0:31:58 | and they got three point two |
---|
0:31:59 | speedup also |
---|
0:32:02 | okay not that great |
---|
0:32:05 | and i'm not gonna give an answer better but i just wanna |
---|
0:32:08 | okay |
---|
0:32:09 | so the last thing is layer parallelism okay so we're and this experiment we found |
---|
0:32:14 | that if you do the right way you can use more gpus and you get |
---|
0:32:17 | a three point two or three times speedup but we already had to use model |
---|
0:32:20 | curves |
---|
0:32:22 | and if you don't do that have a promotional balancing bases there is also so |
---|
0:32:26 | different |
---|
0:32:27 | and so this is actually reason why do not recommend their problems |
---|
0:32:31 | okay so the take away |
---|
0:32:33 | paralysing sds actually really heart and if your colleagues come to you and say dampen |
---|
0:32:38 | implement std then maybe show that |
---|
0:32:41 | okay |
---|
0:32:43 | so |
---|
0:32:45 | so much about realisation |
---|
0:32:51 | okay need to take about and me talk about adaptation so adaptation can be done |
---|
0:32:56 | you mentioned that this morning for example by sticking in transform your the bottom called |
---|
0:33:01 | the L and transform we call it yellow are to match |
---|
0:33:05 | mllr |
---|
0:33:06 | can also be things like vtln |
---|
0:33:09 | another thing we can do is as nelson explain just train the whole stock just |
---|
0:33:13 | a little bit or you can do this with regularization |
---|
0:33:17 | so |
---|
0:33:18 | what we have service this |
---|
0:33:20 | we do this approach which are not the alarm and switchboard |
---|
0:33:23 | we applied to the gmm system we get thirteen percent error reduction |
---|
0:33:29 | we applied to shallow more network that's one they're only |
---|
0:33:33 | you get very similar to that |
---|
0:33:35 | if we do it on the deep network |
---|
0:33:40 | and |
---|
0:33:41 | so |
---|
0:33:44 | so this is sort of the not such a great example but then on the |
---|
0:33:48 | other hand to me tell you wanna forgot to put on the side when we |
---|
0:33:51 | prepared this on stage medial |
---|
0:33:54 | or vice president we tried to actually train the models |
---|
0:33:58 | so we talked something like four hours of internal talks |
---|
0:34:01 | and did adaptation on that one |
---|
0:34:04 | and tested on another two talks have and we got like thirty percent |
---|
0:34:11 | but then we moved on an actually did an actual dry run with him |
---|
0:34:15 | it turns out |
---|
0:34:16 | on that one parent works |
---|
0:34:20 | so i think what happened there is that the D N actually did not more |
---|
0:34:22 | voice |
---|
0:34:23 | the more channel |
---|
0:34:25 | of this particular recording and that seems to be if the so basically there's a |
---|
0:34:29 | couple of other numbers here but let me just cut the short so what we |
---|
0:34:31 | seem to be observing is that |
---|
0:34:34 | the gain diminishes with a large amount of the gain of adaptation this what we |
---|
0:34:37 | have seen so far on that except if the adaptation is done for the purpose |
---|
0:34:42 | of domain adaptation |
---|
0:34:45 | so and maybe the reason why this is here is that the dnn is already |
---|
0:34:48 | very good morning invariant representations especially for all speakers would also means maybe there's a |
---|
0:34:54 | limit on what is achievable by adaptation some keep this in mind if you're considering |
---|
0:34:57 | two to do research |
---|
0:35:00 | on the other hand i think karen try not very good results or with george |
---|
0:35:03 | right on that so maybe what i'm saying is not correct so you better check |
---|
0:35:06 | out their papers and session |
---|
0:35:11 | okay so we need on with training but isolated are what alternative architectures |
---|
0:35:16 | so when this |
---|
0:35:18 | so values are very popular |
---|
0:35:21 | basically replace the nonlinearity sick model |
---|
0:35:25 | something like this |
---|
0:35:27 | and that came and also lot of geoffrey hinton school |
---|
0:35:31 | and it turns out that vision tasks |
---|
0:35:34 | works really well it converges very fast |
---|
0:35:36 | you get |
---|
0:35:37 | base we don't need to do pre-training |
---|
0:35:39 | and it seems to outperform the sigmoid version thrall basing everything |
---|
0:35:44 | non-speech that was is really would be a whole you know |
---|
0:35:48 | encouraging paper |
---|
0:35:49 | by entering students untied rectified nonlinearity is improved more network acoustic models |
---|
0:35:54 | and they were able to reduce the error rate from my point five seventy |
---|
0:35:58 | so great i started a holding it is actually two lines of code |
---|
0:36:02 | and i didn't get anywhere |
---|
0:36:04 | not able to rip use these results |
---|
0:36:07 | the red the paper again |
---|
0:36:08 | a nice all |
---|
0:36:10 | sentence network and |
---|
0:36:11 | network training stops after to compute pass |
---|
0:36:15 | we only due to process our system is that nineteen point two and we do |
---|
0:36:19 | all the past as we can see |
---|
0:36:22 | so actually there's something wrong with a baseline |
---|
0:36:25 | so it turns out that when i talk to people |
---|
0:36:28 | on the large set switchboard it seems to be very difficult to get relevance to |
---|
0:36:33 | work |
---|
0:36:33 | so one group that actually did get a to work is a ibm together with |
---|
0:36:38 | george dahl but in a rather complicated method they use |
---|
0:36:40 | by addition optimize the optimisation systems of the network training |
---|
0:36:44 | the trains hyper parameters of the training this way the way to get |
---|
0:36:47 | somebody five percent relative gain |
---|
0:36:49 | i don't know if you buy still doing that or if it's a bit easier |
---|
0:36:52 | now but |
---|
0:36:54 | so |
---|
0:36:55 | the point is |
---|
0:36:57 | the point is that it looks easy but it actually isn't |
---|
0:37:00 | for large |
---|
0:37:02 | another's convolutional networks |
---|
0:37:04 | and the idea is basically that's look at these filters here these are tracking some |
---|
0:37:08 | sort of formant right but the formant positions the resonance frequencies |
---|
0:37:13 | depend on your body height |
---|
0:37:14 | for example for women the typically at slightly different position compared to |
---|
0:37:18 | two men so |
---|
0:37:19 | by can share these filters across that at the moment the system wouldn't do that |
---|
0:37:24 | so the idea would be to apply this filters and just them slightly apply them |
---|
0:37:28 | over a range of shifts and that's basically represent by this picture here |
---|
0:37:33 | and then the next there would reduce you pick the maximum |
---|
0:37:36 | over all these different results there right and so it turns out that actually you |
---|
0:37:41 | can get something like forty seven percent whatever it reduction i think you have even |
---|
0:37:45 | little bit more the religious paper you |
---|
0:37:49 | so the take away for those alternative architectures |
---|
0:37:52 | ratings like definitely not easy to get work |
---|
0:37:55 | they seem to work for smaller setups |
---|
0:37:57 | some people time they get really get result good results on something twenty four hour |
---|
0:38:01 | datasets but on the big set three hundred hours it's very difficult and expensive |
---|
0:38:06 | the other hand the cnn so much simpler gains are sort of the range of |
---|
0:38:09 | what we get |
---|
0:38:10 | with the adaptation feature adaptation |
---|
0:38:14 | okay |
---|
0:38:15 | and of the training section |
---|
0:38:17 | just talk about a little bit about features |
---|
0:38:23 | so for features for gmms |
---|
0:38:27 | has been done a lot of work |
---|
0:38:29 | because gmms typically used are not bounce my |
---|
0:38:33 | a lot of work was done to decorrelate features |
---|
0:38:36 | do we actually need to do this in the dnn |
---|
0:38:38 | well how did you correlated with a linear transform the first thing dnn does is |
---|
0:38:42 | your |
---|
0:38:44 | so kind of are just by itself well so that |
---|
0:38:48 | so we start with a gmm baseline twenty three point six if you put in |
---|
0:38:51 | fmpe to be fair twenty two point six |
---|
0:38:54 | and then you do it cd dnn just a normal dnn using those features here |
---|
0:38:59 | the fmpe features you get to seventeen |
---|
0:39:02 | get rid of that simply so this minus means take out |
---|
0:39:06 | now it's just a plp system |
---|
0:39:08 | seventeen |
---|
0:39:08 | the kind of makes sense because the fmpe was basically trained specifically for this gmm |
---|
0:39:16 | structure |
---|
0:39:18 | then you can also take out the hlda gets much better |
---|
0:39:21 | a little data obviously correlation right over a longer range and dnn already feels |
---|
0:39:29 | you can also take out the dct that's part of plp or mfcc process |
---|
0:39:34 | and now we have a slightly different the dimension |
---|
0:39:37 | you have more features here and so |
---|
0:39:41 | i think a lot of now using this particular set up |
---|
0:39:44 | you can even take all the deltas |
---|
0:39:46 | but you have to account for the speaker you have to make the window wider |
---|
0:39:49 | so we still see the same frames and our case it still |
---|
0:39:54 | can you go really extreme and completely eliminate filter back just you look at fifty |
---|
0:39:59 | features direct |
---|
0:40:00 | now get somebody works focused on the ballpark here right |
---|
0:40:03 | so |
---|
0:40:05 | actually what we just do you basically undid thirty years of features research |
---|
0:40:10 | so |
---|
0:40:13 | that |
---|
0:40:13 | there is also kind of really could if you really care about the filter bank |
---|
0:40:16 | you can actually have a more sort of this is another poster by tomorrow so |
---|
0:40:20 | you see the blue bars the red curve the right the blue of the mel-filters |
---|
0:40:24 | and the red curve so basically |
---|
0:40:26 | alarm versions of that |
---|
0:40:34 | and dnns also kind of really sorry |
---|
0:40:38 | so take away dnns greatly simplifies feature extraction just use the back to the wider |
---|
0:40:43 | window |
---|
0:40:44 | one thing i didn't already still need to the mean normalization |
---|
0:40:47 | that cannot |
---|
0:40:49 | now |
---|
0:40:50 | now we talk about features for dnns we can also trying to around right basically |
---|
0:40:54 | you know ask not what the features can do for the dnn but what the |
---|
0:40:57 | dnn and do for the features |
---|
0:40:59 | i think that was |
---|
0:41:01 | said by the same speech researcher |
---|
0:41:05 | so we can use dnns as feature extractor so the idea is basically is one |
---|
0:41:09 | of the factors that contributed to the success |
---|
0:41:12 | long span features |
---|
0:41:13 | discriminative training |
---|
0:41:15 | and the hierarchical nonlinear feature map |
---|
0:41:18 | right so |
---|
0:41:19 | and trying to that is actually the major contributor so why not use this combined |
---|
0:41:24 | with the gmm so we go really back to what the now some talked about |
---|
0:41:27 | right |
---|
0:41:28 | so that many ways of doing the tandem |
---|
0:41:31 | we heard this morning you can also the tandem with |
---|
0:41:34 | bigger layer our work on that so basically using signals here |
---|
0:41:39 | you can do bottleneck where you take in intermediate layer that this has a much |
---|
0:41:43 | smaller dimension |
---|
0:41:44 | or you can also |
---|
0:41:46 | use the top hidden there |
---|
0:41:49 | ask sort of the bottleneck but not make it smaller just take it in each |
---|
0:41:52 | of those cases you would typically do like a pca to use your dimensionality |
---|
0:41:56 | so does that work |
---|
0:41:58 | well okay so if you take |
---|
0:42:00 | a dnn |
---|
0:42:01 | H and this the hybrid system here and then you compared with this gmm system |
---|
0:42:05 | retake top layer |
---|
0:42:07 | pca and then applied you gmm |
---|
0:42:09 | well it's not really that good |
---|
0:42:12 | but now we have one really big advantage back and the rubber gmms |
---|
0:42:16 | we can capitalise on anything that worked on the gmm world right |
---|
0:42:20 | so for example hardly able to you region dependent linear transforms a little bit like |
---|
0:42:24 | fm P |
---|
0:42:26 | so once you apply that |
---|
0:42:27 | already better |
---|
0:42:29 | can also just to mmi training very easily okay in this case is not really |
---|
0:42:33 | as good but at least you can do it out of the box without any |
---|
0:42:36 | of these problems with you know silence at that and you can apply adaptation just |
---|
0:42:41 | it would always |
---|
0:42:42 | you can also do something more interesting can say what if i train my dnn |
---|
0:42:47 | feature extractor on a smaller set |
---|
0:42:49 | and then to the training on a larger set |
---|
0:42:52 | because we have the scalability problem |
---|
0:42:54 | so this can really help with the scalability problem and you can see well |
---|
0:43:00 | closer not a not quite as good but italy but we're able to do that |
---|
0:43:04 | i mean imagine the situation what this is like a ten thousand our product database |
---|
0:43:07 | we couldn't training and then |
---|
0:43:10 | and it's on the dnn side we also use the same data we definitely get |
---|
0:43:13 | better |
---|
0:43:14 | here and that still make it might make sense if we combine the for example |
---|
0:43:18 | with the idea of building this you model only partially so and then see if |
---|
0:43:23 | that we don't know that action |
---|
0:43:24 | so that a lot of attention |
---|
0:43:26 | another thing another idea of learning using dnns as feature extractor |
---|
0:43:31 | is to transfer learning from one language |
---|
0:43:35 | to another so the idea is to feed the network actually training set of multiple |
---|
0:43:40 | languages |
---|
0:43:41 | and you're output layer |
---|
0:43:43 | for every frames based chosen on what that language what's right and this way you |
---|
0:43:47 | can train |
---|
0:43:48 | these hidden representations and it turns out if you do that |
---|
0:43:51 | you can improve each individual language and it even works for another language that has |
---|
0:43:56 | not been part of this set here |
---|
0:43:58 | the only thing is that is typically something a works for low resource languages |
---|
0:44:03 | but if you goal larger so for example salt on has a |
---|
0:44:08 | was that the paper here or has a paper where you shows that if you |
---|
0:44:11 | go up to subtract two hundred seventy hours of training |
---|
0:44:14 | then you're again really is reduced or something like three percent |
---|
0:44:18 | so this is actually something that does not seem to work very well for large |
---|
0:44:21 | setting |
---|
0:44:26 | okay so take away |
---|
0:44:28 | the dnns as a hierarchical nonlinear feature transform |
---|
0:44:31 | that's really the key to the success of unions and you can use this directly |
---|
0:44:36 | and put it the engine on top of that as a plastic later |
---|
0:44:40 | and it brings it back and gmm world with all the techniques including parallelization and |
---|
0:44:45 | scalability and so on |
---|
0:44:47 | and all that transfer learning sides works from a small works a small set ups |
---|
0:44:52 | but the not so much large |
---|
0:44:55 | okay |
---|
0:44:58 | last topic runtime |
---|
0:45:00 | runtime is an issue |
---|
0:45:02 | this one problem for gmms |
---|
0:45:05 | you can actually do on-demand computation |
---|
0:45:08 | for dnns |
---|
0:45:09 | a large amount of parameters actually the shared layers you can do on the map |
---|
0:45:14 | so |
---|
0:45:15 | all dnns are |
---|
0:45:16 | you have to compute |
---|
0:45:18 | and so it's important to look at how can speed up so for example the |
---|
0:45:22 | demo video that i showed you in the beginning if i that was run with |
---|
0:45:25 | the with the my gpu was doing the live likely to evaluation if you don't |
---|
0:45:30 | do that it would like three times real time |
---|
0:45:32 | wouldn't infeasible |
---|
0:45:34 | so |
---|
0:45:35 | the way to approach this and that was done both by some colleagues of microsoft |
---|
0:45:38 | also ibm |
---|
0:45:40 | is to ask we actually needles full weight matrices |
---|
0:45:44 | i and so this is that the question is based on two observations |
---|
0:45:48 | one is that we saw early on that actually you can set something like two |
---|
0:45:52 | thirds of the parameters to zero |
---|
0:45:55 | and still you get the same our |
---|
0:45:57 | and what ibm observed is that this top hidden they're the |
---|
0:46:02 | the number of |
---|
0:46:03 | how to the number of nodes the actual active is relatively limited |
---|
0:46:07 | so can you basically just decompose all the ideas you singular value decomposition |
---|
0:46:12 | those weight matrix |
---|
0:46:14 | and the ideas you basically this is your network there |
---|
0:46:17 | the weight matrix nonlinearity replace this by two matrices and in the middle you have |
---|
0:46:23 | a low-rank |
---|
0:46:26 | so that's that work |
---|
0:46:27 | well |
---|
0:46:28 | so but there's this is the gmm baseline just for reference dnn |
---|
0:46:32 | but thirty million parameters of the microsoft internal task |
---|
0:46:35 | start with the word error rate of twenty five point six |
---|
0:46:38 | now we apply these singular value decomposition |
---|
0:46:41 | if you just to the straight out of gets much worse |
---|
0:46:44 | but you can then do back-propagation again |
---|
0:46:47 | and then you will get back to exactly the same number |
---|
0:46:50 | and you gain like one third parameter reduction |
---|
0:46:53 | you can actually also do that with |
---|
0:46:55 | although there is not just the top there if you do that can bring it |
---|
0:46:58 | down |
---|
0:47:00 | that's a factor of four |
---|
0:47:02 | and that is actually very good results so this basic bring that back |
---|
0:47:08 | so just one show you only to again give your very rough idea |
---|
0:47:12 | my classes |
---|
0:47:21 | so it's only very short example just a given idea this is an apples to |
---|
0:47:25 | apples comparison between the old gmm system and the dnn system |
---|
0:47:29 | but for speech recognition so as to look at some of those things that you |
---|
0:47:34 | know well so you are devices on the one on the left or is what |
---|
0:47:37 | a previously the board one on the right uses the documents |
---|
0:47:42 | we're gonna find a good pizza and |
---|
0:47:50 | a very similar specifically for discriminative interested look here down to the latency which is |
---|
0:47:56 | counted from when i don't talk when we see the recognition result over a second |
---|
0:48:01 | approach |
---|
0:48:02 | so i just want to give you act this is proof that this section works |
---|
0:48:06 | okay so |
---|
0:48:08 | think of cover the whole range i would like to recap |
---|
0:48:13 | all the take aways |
---|
0:48:14 | okay so we went through |
---|
0:48:16 | cd dnn actually members of G |
---|
0:48:18 | mlp not already said that nothing else the outputs are the triphone states and that's |
---|
0:48:24 | important |
---|
0:48:25 | they're not really that far to train we know now but doing it fast |
---|
0:48:29 | is still sort of frustrating enterprise and i would at the moment recommend just get |
---|
0:48:33 | the gpu and if you have multiple gpus just one multiple training rather than trying |
---|
0:48:37 | to paralyse a single training |
---|
0:48:40 | pre-training is |
---|
0:48:41 | median but the greedy layer-wise but is simpler and it seems to be sufficient |
---|
0:48:48 | sequence training gives us regularly good improvements on to thirty percent but if you use |
---|
0:48:52 | std then you have to use these little tricks smoothing |
---|
0:48:56 | and rejection |
---|
0:48:57 | adaptation helps much less than for gmms |
---|
0:49:00 | which might be because the dnn learns possibly |
---|
0:49:04 | very good in there and representations already so that might be a limit to what |
---|
0:49:07 | we can actually you can achieve |
---|
0:49:09 | writers are definitely not as easy as changing two lines of code especially for large |
---|
0:49:14 | datasets |
---|
0:49:16 | but on the other hand the C N N's |
---|
0:49:17 | give us like five percent is not really the heart get but and they make |
---|
0:49:20 | a good sense |
---|
0:49:23 | dnns really simplify the feature extraction we're able to eliminate thirty years of feature extraction |
---|
0:49:27 | research |
---|
0:49:30 | but you can also go around and using dnns as feature extractors |
---|
0:49:35 | so dnns are definitely not slowing decoding if you use this speech |
---|
0:49:40 | so |
---|
0:49:40 | to conclude word racy the challenges one forward |
---|
0:49:44 | there of course open issues of training |
---|
0:49:46 | i mean it's one we talk to people in the company we always thinking what |
---|
0:49:51 | kind of computers we find the future and are we optimize them for std but |
---|
0:49:55 | we always think you know what in one year will laugh |
---|
0:49:57 | laugh about this though some patch method and we will just not need all of |
---|
0:50:01 | this but so far this not i would think it's fair to say that's not |
---|
0:50:03 | a method like this on the rise in the media laws parallelization |
---|
0:50:08 | and what we found section learning rate control is not sufficient this kind of really |
---|
0:50:11 | important because if you don't do this right it might run into unreliable results and |
---|
0:50:15 | have a hunch that is relevant result we saw there was little bit like that |
---|
0:50:19 | and also has to do with paralyse ability because the smaller learning rate the bigger |
---|
0:50:23 | your mini-batch can be factor and the more parallelization can |
---|
0:50:30 | dnns still have an issue with robustness to real life situations |
---|
0:50:35 | how much they sort of not be solved speech but they got very close to |
---|
0:50:39 | solving speech under perfect recording conditions but it still fails it's a do or speech |
---|
0:50:44 | recognition over like one meter fifteen a more room with two microphones or something like |
---|
0:50:48 | that so dnns are not |
---|
0:50:49 | in here we automatically robust to noise |
---|
0:50:52 | there was to see variability but not on C or what |
---|
0:50:57 | then personally i wanna not can we kind of a more machine you |
---|
0:51:00 | so for example there's already work the tries to eliminate H M and replace it |
---|
0:51:04 | by are and i think that's come very interesting and the same thing is already |
---|
0:51:08 | very successfully done with language models |
---|
0:51:11 | and there's the question of |
---|
0:51:13 | i mean jointly treat everything and one big step but on the other hand |
---|
0:51:16 | the problem with that is that different kinds of different |
---|
0:51:19 | aspects of the model different kinds of data that have different cost would using to |
---|
0:51:24 | them so it might actually never be possible to we need to a joint training |
---|
0:51:28 | and the final question that i sort of have is what to dnns teachers about |
---|
0:51:32 | humans process |
---|
0:51:35 | what will also get |
---|
0:51:36 | more ideas on |
---|
0:51:38 | no |
---|
0:51:40 | so that concludes my talk thank you very much |
---|
0:51:51 | i think we have like six minutes for questions |
---|
0:52:12 | another expert about units there was wondering point therefore if i train and it's a |
---|
0:52:19 | neural network and conventional speech data and i try to anything the data which is |
---|
0:52:26 | much more clean we therefore not as good or we don't noise |
---|
0:52:31 | so what was the configuration you want to do you want to train on what |
---|
0:52:34 | it is that they train mind manual nets on the noisy data when they're running |
---|
0:52:39 | on the clean data |
---|
0:52:41 | so they don't know exactly that's my question |
---|
0:52:44 | okay so i actually did skip |
---|
0:52:46 | one slide images this L O |
---|
0:52:50 | so |
---|
0:52:51 | the dnn is actually |
---|
0:52:56 | way |
---|
0:53:04 | so you get like |
---|
0:53:10 | so this table here shows results on aurora so basically doing this case multi-style training |
---|
0:53:19 | so the idea was not to train a noisy and test on clean |
---|
0:53:22 | but this is basically training and testing on the same |
---|
0:53:26 | set of noise conditions |
---|
0:53:28 | and so the lot of numbers here this is the gmm baseline if you look |
---|
0:53:31 | at this line here thirteen point four |
---|
0:53:34 | so another specialist on robustness but i think this is of the best you can |
---|
0:53:38 | do with the gmm |
---|
0:53:39 | pooling all the tricks that you could possibly put in |
---|
0:53:42 | and the dnn |
---|
0:53:43 | it's just |
---|
0:53:44 | but not any tricks just training on the data you get |
---|
0:53:48 | you know how do not from the or you get just exactly the same |
---|
0:53:51 | so what this means i think is that the dnn is very good and learning |
---|
0:53:55 | variability new input also noise that it sees in the training data |
---|
0:54:02 | but we have other experiments were is shown what we're we see that is the |
---|
0:54:06 | variability smart cover new training data |
---|
0:54:09 | the dnn is not very robust again |
---|
0:54:12 | so i don't know what happens if you trained on noisy and test on clean |
---|
0:54:15 | and clean is not of the conditions that you have your training i could imagine |
---|
0:54:18 | that it will are but on the of an interest at the data |
---|
0:54:25 | i don't think i can likely to get away with thirty years thirty years maybe |
---|
0:54:30 | that was present at all |
---|
0:54:33 | apparently talking and tongue in cheek right what you're talking about is going back before |
---|
0:54:38 | some of the developments of the eighties right and most of the effort on feature |
---|
0:54:43 | extraction last twenty years conferences is actually been more robustness to dealing with unseen variability |
---|
0:54:51 | and this doesn't get and you that set equation |
---|
0:54:59 | some more questions or comments |
---|
0:55:02 | think about features i need for future |
---|
0:55:08 | research |
---|
0:55:10 | is it and use a large temporal context this is also be one it's was |
---|
0:55:16 | coming but for |
---|
0:55:19 | in contrast |
---|
0:55:21 | it's something |
---|
0:55:24 | okay what exactly i don't have to sell the embassy okay |
---|
0:55:33 | anymore comments |
---|
0:55:36 | kind of a personal question you said that you know anything about neural nets like |
---|
0:55:40 | on two three years back something like that so you see this as rather an |
---|
0:55:45 | advantage for drawback very maybe less sentimental |
---|
0:55:48 | in throwing away some coldness that you know the guys very in the field for |
---|
0:55:54 | many years expected some touchable or the other way round |
---|
0:55:58 | i think so i think it helps to come with sort of the little bit |
---|
0:56:02 | of an outsiders mine so i think for example it helped me to understand this |
---|
0:56:06 | parallelization thing right that basically do it is G D you do layer train a |
---|
0:56:11 | small the mini-batch training |
---|
0:56:13 | and normal the regular definition of mini-batches is that you can take have original to |
---|
0:56:18 | sell |
---|
0:56:18 | maybe you might have noticed that i didn't actually divided by the number of frames |
---|
0:56:23 | when i use this formula right interesting is if you're not right |
---|
0:56:27 | so that for example is something for me as an engineer coming in looking at |
---|
0:56:30 | that i wonder know why do you do mini-batches as an average doesn't seem to |
---|
0:56:33 | make sense you're just accumulating multiple frames over time that help understand those kind of |
---|
0:56:38 | parallelization questions in a different way |
---|
0:56:41 | but things probably details |
---|
0:56:49 | okay any other buttons |
---|
0:56:54 | okay the speaker given is present |
---|