0:00:09 | well come at the next addition of p g s i t this is the |
---|
0:00:12 | uh invited talks on video graphics in speech |
---|
0:00:16 | and it's a series that is run mainly my uh by matching technique but today |
---|
0:00:21 | i'm happy to invite a very good speech or an lp guy so to much |
---|
0:00:25 | weaker of a actually started uh this faculty if i t |
---|
0:00:28 | in two thousand two |
---|
0:00:31 | then uh in two thousand six seven he was working on a diploma project on |
---|
0:00:35 | language modeling for the for chick maybe still remember something of it then actually uh |
---|
0:00:41 | he started his phd in two thousand seven on a language modeling and uh to |
---|
0:00:46 | be frank we didn't have much uh language modeling expertise here |
---|
0:00:51 | so we kept sending him abroad so he's been good considerable time at the johns |
---|
0:00:55 | appears in the hope queens university with the spongy of good on board and the |
---|
0:00:59 | university of montreal we just to a uh bengio |
---|
0:01:03 | and uh well he had a very the influential paper it's interspeech two thousand then |
---|
0:01:11 | that was basically a room like this form of uh senior uh language modeling people |
---|
0:01:16 | and the much basically came up uh and the said that the his language model |
---|
0:01:20 | works the best |
---|
0:01:22 | well they were smiling but it worked the best |
---|
0:01:26 | and uh e eventually uh defended the phd in two thousand twelve |
---|
0:01:30 | was immediately uh hired by you will go brainer and the moved to facebook or |
---|
0:01:36 | research uh i a i research and twenty fourteen or he's now the a research |
---|
0:01:43 | scientist so i will still be here it's to march for now and thank you |
---|
0:01:48 | for coming |
---|
0:01:55 | it is it's fine |
---|
0:01:57 | i guess okay |
---|
0:01:58 | i also things are interaction and uh michael will be uh like uh |
---|
0:02:02 | mixture plus a very small things this uh |
---|
0:02:05 | once asked me to talk about everything so |
---|
0:02:08 | uh let's hope to define would be about nine you wanna works in an l |
---|
0:02:12 | p |
---|
0:02:13 | uh |
---|
0:02:15 | or |
---|
0:02:17 | that is |
---|
0:02:21 | okay |
---|
0:02:23 | ah |
---|
0:02:24 | so for the introduction |
---|
0:02:28 | or the introductions uh |
---|
0:02:30 | and now he's like a an important uh topic for many companies uh nowadays like |
---|
0:02:36 | google facebook yeah we like all these companies that future |
---|
0:02:39 | text data sets uh |
---|
0:02:40 | that are coming either from the web or from the users like you can imagine |
---|
0:02:43 | how |
---|
0:02:44 | uh much text a confusable sense to facebook everyday |
---|
0:02:48 | and then of course like these companies wants to do something cup |
---|
0:02:51 | with the text like out there like a there is this a list of uh |
---|
0:02:55 | somehow some important applications that uh but there are many others like a |
---|
0:02:59 | just detecting the span is something important for like uh |
---|
0:03:02 | users you don't want to see |
---|
0:03:04 | uh some think a strictly binary are using cut these services uh so this like |
---|
0:03:10 | the like a core business of these companies is to be able to deal |
---|
0:03:13 | with text uh |
---|
0:03:16 | and uh for that uh i will be talking about like a set are a |
---|
0:03:19 | lot basic things in the beginning and then their extensions uh using neural networks uh |
---|
0:03:24 | the idea is to work on |
---|
0:03:27 | uh |
---|
0:03:27 | there will be like uh the first uh first part will be about unsupervised learning |
---|
0:03:32 | curve for |
---|
0:03:32 | board representation see that so |
---|
0:03:35 | the border like project uh obvious that we will uh it's like a very nice |
---|
0:03:39 | a simple inter the introduction |
---|
0:03:42 | uh |
---|
0:03:42 | then supervised a text classification the do not for to talk about it much of |
---|
0:03:47 | the weighted average shows simple to last year at the face but that extends the |
---|
0:03:51 | word vector supervised classification and again like uh |
---|
0:03:54 | is quite successful because it's very scalable |
---|
0:03:57 | and then the recurrent work language model |
---|
0:03:59 | uh |
---|
0:04:01 | so exalted mentioned that's all the like something that is so |
---|
0:04:04 | uh nowadays very common and uh i don't be conference this |
---|
0:04:08 | um the last part of that all will be about the |
---|
0:04:11 | what can we do that or maybe in the future maybe some people hearable started |
---|
0:04:16 | but relatively easy and on to uh do something better than the u matrix i |
---|
0:04:21 | think that uh that would be a great goal and you're trying to do it |
---|
0:04:24 | ourselves uh i look up |
---|
0:04:26 | and of the like the our companies are very interested in an uh getting better |
---|
0:04:31 | performance |
---|
0:04:32 | of course one can uh one can focus on the incremental improvement by just taking |
---|
0:04:37 | that exists and trying to make it bigger or something cool that's or |
---|
0:04:41 | uh but i will talk about that some high-level goals that uh |
---|
0:04:45 | we are thinking of uh right now like how to build our mission the regions |
---|
0:04:49 | of the both uh |
---|
0:04:50 | they really smart models that are something the |
---|
0:04:53 | i below that are not show any solution because we don't had it uh |
---|
0:04:56 | but i think it's a good at least uh mention the problem that we are |
---|
0:05:00 | facing |
---|
0:05:03 | uh cycle started like very basic concepts so |
---|
0:05:06 | "'cause" there seem to the |
---|
0:05:07 | uh people here don't uh don't so all of them don't have of the big |
---|
0:05:11 | around in a in uh machine learning cut |
---|
0:05:14 | so i will start with uh basic uh |
---|
0:05:16 | models of sequences and uh |
---|
0:05:19 | representations of uh text uh |
---|
0:05:22 | and then i don't show that you want work basically |
---|
0:05:25 | can extend and improve for |
---|
0:05:27 | all these the above a uh representations and uh and models |
---|
0:05:32 | it's like yeah |
---|
0:05:33 | the artificial neural network so i can be seen as some unified a framework that |
---|
0:05:37 | uh |
---|
0:05:38 | that uh is in some sense simple to understand that |
---|
0:05:42 | i know what the uh are concepts but we only done for the for this |
---|
0:05:46 | to be able to define the features or lots |
---|
0:05:48 | so for the n-grams so |
---|
0:05:50 | that's a standard approach for language modeling that's a core technology not in uh many |
---|
0:05:56 | important applications like speech recognisers are |
---|
0:05:59 | our mission transmissions justines the uh i need to be able to output somehow some |
---|
0:06:04 | text and for that the |
---|
0:06:06 | you are using some statistical models of the language a |
---|
0:06:09 | that uh that was basically the think it is written on the last line to |
---|
0:06:13 | the uh some sentences are basically more likely uh than some others for example uh |
---|
0:06:20 | this is the sentences |
---|
0:06:21 | uh really going to have uh |
---|
0:06:23 | higher probability then |
---|
0:06:25 | then a sequence of words sentence uh is this uh |
---|
0:06:28 | because that's not so that doesn't make much sense |
---|
0:06:31 | and even that should have probably are provided for curing unit |
---|
0:06:34 | in a english and some random string characters |
---|
0:06:38 | so the n-grams are |
---|
0:06:40 | uh estimated from a from counts usually |
---|
0:06:44 | a so it's very simple but you would look at the first equation |
---|
0:06:47 | uh and just think about what is the product of the sentence though i think |
---|
0:06:51 | it's like a very broad concept again even stated uh |
---|
0:06:54 | it be would be able to estimate this probability very well |
---|
0:06:57 | then the uh models uh behind the |
---|
0:07:00 | should be able to understand the language or actually have to understand the language or |
---|
0:07:04 | for example i can write the hearer like creation that uh probably as |
---|
0:07:08 | so sentence uh |
---|
0:07:10 | uh arrest is the capital city of rows so |
---|
0:07:13 | should have uh |
---|
0:07:14 | higher probability that a barrel in is the capital city a problem because the second |
---|
0:07:18 | sentence is incorrect uh |
---|
0:07:20 | uh but the model you have now uh i would say |
---|
0:07:24 | can do this a little bit about the not in a general sensor |
---|
0:07:28 | i will try to get through it at the end of the oldest or like |
---|
0:07:31 | what are the limitations of uh of our best language models |
---|
0:07:35 | but just to get the motivation like a linkage wanting is quite interesting and there's |
---|
0:07:39 | a lot of also problems and we would be able to solve them |
---|
0:07:42 | very well then it would be |
---|
0:07:44 | possibly interesting for the artificial intelligence research |
---|
0:07:49 | and here it is uh how it looks like with the |
---|
0:07:52 | techniques that uh used to be state-of-the-art like ten years ago |
---|
0:07:55 | uh |
---|
0:07:56 | which was based on the grounds there is scalable the mean that we can |
---|
0:07:59 | train uh |
---|
0:08:00 | estimate uh |
---|
0:08:01 | so this model stronger likely that's very quickly uh it's the retrieval if you want |
---|
0:08:07 | to compute a variety of the sentence that just a |
---|
0:08:09 | compute probability of order this that people just get from some trendy corpus just count |
---|
0:08:14 | how many times this the board at a year and divided by all the word |
---|
0:08:18 | count so that we would get its probability |
---|
0:08:20 | and they just multiply this uh |
---|
0:08:22 | um like probability of each word given its complex the |
---|
0:08:26 | they are some advanced things the on top of it like to thank and so |
---|
0:08:30 | on but well but is just the |
---|
0:08:32 | the technique that used to be state-of-the-art in the statistical language modeling for like a |
---|
0:08:37 | i think there are two year so it was like |
---|
0:08:39 | it looks very simple but it took uh people like uh a lot of effort |
---|
0:08:43 | to overcome uh |
---|
0:08:45 | uh convincingly across a occurs um and at a minister uh |
---|
0:08:51 | i don't relate right uh will not be the recurrent networks |
---|
0:08:54 | uh then for the basic representations of takes the uh one and coding or one |
---|
0:09:00 | hundred presentations is something that this like a uh |
---|
0:09:03 | very basic on so that people should know about the |
---|
0:09:06 | uh usually the it when we want to represent some text uh |
---|
0:09:10 | uh especially in english of you we compute first of vocabulary and then |
---|
0:09:14 | represent each corpus uh basically separate uh id |
---|
0:09:18 | uh which show |
---|
0:09:19 | uh has um the advantages and some disadvantages it's very simple uh is it understand |
---|
0:09:25 | the disadvantage is that the |
---|
0:09:27 | as you can see mandy and use the of completely or particular presentation |
---|
0:09:32 | uh data sharing parameters and the |
---|
0:09:34 | and it's up to the model that's using these ones representations to figure out that |
---|
0:09:38 | are they are really it for example so that the |
---|
0:09:40 | it would be able to generalise better |
---|
0:09:42 | so these are the basic representations and the ability to later that we can ensure |
---|
0:09:47 | present work so that some |
---|
0:09:48 | uh |
---|
0:09:49 | better more richer vectors uh |
---|
0:09:53 | actually it's a |
---|
0:09:54 | uh like nice improvements in many applications |
---|
0:09:59 | make of art representations are then just some of these one coats of and then |
---|
0:10:03 | everyone to represent some not be such that the |
---|
0:10:07 | for example if you would have this the small vocabulary and we want to represent |
---|
0:10:10 | the sentence |
---|
0:10:11 | today is among the |
---|
0:10:13 | a little bit like the counts of the words so that it that you're sentence |
---|
0:10:18 | there something special about it the |
---|
0:10:20 | and yeah |
---|
0:10:21 | these |
---|
0:10:22 | their presentation can be still improved by |
---|
0:10:24 | considering basically local context by |
---|
0:10:26 | using backup backgrounds and uh even if it may |
---|
0:10:30 | c not surprising it would see that for |
---|
0:10:32 | many applications yeah really most of the applications nowadays the |
---|
0:10:37 | nobody can they require think it'd be the |
---|
0:10:40 | but uh is a very simple picture presentation the |
---|
0:10:44 | so that's maybe the challenge for the future |
---|
0:10:46 | uh |
---|
0:10:48 | and are important concept is uh the word classes uh |
---|
0:10:52 | as i what is that like uh |
---|
0:10:53 | board really should uh be kind of related to each utterance imposed we help to |
---|
0:10:58 | uh how to think of it is to define some uh |
---|
0:11:01 | some set of classes for example italy germany france spain all these words the |
---|
0:11:05 | a uh denote the names of the of the countries in europe uh |
---|
0:11:10 | and uh maybe you can just the agreement together and just called impulse to |
---|
0:11:14 | so this is uh one of the most successful or not be concepts the in |
---|
0:11:18 | our practise the |
---|
0:11:20 | it was |
---|
0:11:21 | injuries i think uh |
---|
0:11:23 | in the user might be the |
---|
0:11:25 | uh the one uh one the particle paper i think that's a very nice is |
---|
0:11:29 | the from peter brown because based trigram models of natural language or |
---|
0:11:34 | and discusses are computed the |
---|
0:11:36 | automatically again like from song |
---|
0:11:38 | from some training corpus the and uh the main idea behind it is that uh |
---|
0:11:43 | the boards that the |
---|
0:11:45 | that share a the complex that documents in our context so should uh |
---|
0:11:50 | really belong to the same clause |
---|
0:11:53 | once you get these classes then we can improve the oh |
---|
0:11:56 | the representation that will string before because so we can represent the corpus one of |
---|
0:12:01 | and |
---|
0:12:02 | uh our presentation lost one of the class representations for the uh |
---|
0:12:06 | there would be |
---|
0:12:07 | some generalization from the system that is trained on built on top of this representation |
---|
0:12:13 | there was more like the historical uh overview uh i can and the did they |
---|
0:12:18 | are like several other important can so that people should know about the |
---|
0:12:21 | uh and the that otherwise |
---|
0:12:24 | are basically the stepping stones to understanding the neural networks uh |
---|
0:12:27 | uh what's it'd the |
---|
0:12:29 | most frequent things uh probably unsupervised image the reduction using cut principal component analysis and |
---|
0:12:35 | unsupervised clustering with uh |
---|
0:12:38 | k-means so |
---|
0:12:39 | so these algorithms are quite important and then the supervised classification |
---|
0:12:44 | uh especially to the logistic regression |
---|
0:12:46 | uh |
---|
0:12:47 | uh is very important |
---|
0:12:50 | i don't know described in detail because uh i wouldn't finish a |
---|
0:12:53 | uh so now i would jump quick introduction neural networks uh |
---|
0:12:58 | uh |
---|
0:13:00 | and again like it'll be just a quick overview so that people can i get |
---|
0:13:04 | some uh |
---|
0:13:05 | idea with this uh |
---|
0:13:06 | uh what the than you want works actually are |
---|
0:13:08 | uh and uh i will try to describe these uh is basically arms that the |
---|
0:13:13 | people are using all the time |
---|
0:13:14 | and then i will |
---|
0:13:15 | also try to give some short explanation of what the |
---|
0:13:19 | deep-learning means because i think that's from there but the |
---|
0:13:22 | is becoming very popular now and it would be good you are so what is |
---|
0:13:26 | it about |
---|
0:13:27 | so for when you wanna works uh |
---|
0:13:30 | uh in a natural language processing cup |
---|
0:13:32 | and then motivation is to simply come up with the better more precise techniques then |
---|
0:13:37 | what i was you showing before so something better than the |
---|
0:13:40 | uh big aborts uh something better than just the grounds to a so how can |
---|
0:13:46 | be |
---|
0:13:47 | uh red and white would be even though it well |
---|
0:13:49 | if we can come with some better representation than uh we can |
---|
0:13:53 | uh get slightly better performance than our come but there that's important for many people |
---|
0:13:57 | like support for the company because they want to be the best and its importance |
---|
0:14:02 | of the for researchers because the people to publish the most interesting papers |
---|
0:14:07 | uh |
---|
0:14:08 | and years are completely in uh |
---|
0:14:10 | all kinds of competitions the |
---|
0:14:12 | so |
---|
0:14:12 | it's basically important for everyone to develop that that's techniques uh |
---|
0:14:17 | that's about the motivation this is how the |
---|
0:14:19 | uh your own basically looks like uh |
---|
0:14:22 | a is like a mathematical or graphical representation of the of the |
---|
0:14:26 | model it's the simple mathematical model the |
---|
0:14:29 | uh the |
---|
0:14:31 | function that the people didn't really uh the your own so |
---|
0:14:35 | uh the biological neurons and but it's very simplified so i would uh warm about |
---|
0:14:40 | uh yeah giving some parallels between the |
---|
0:14:43 | artificial |
---|
0:14:45 | neurons and the and the biological on your own since it is likely to really |
---|
0:14:48 | about it is very different thing |
---|
0:14:50 | so uh the are concerned you and your own looks like yeah basically they are |
---|
0:14:54 | um |
---|
0:14:55 | uh incoming signals that are coming cut |
---|
0:14:58 | uh to be in your own it's called sign at this uh the time from |
---|
0:15:02 | the biology but uh basically just needed some errors that are something you know |
---|
0:15:07 | uh to be in your own |
---|
0:15:08 | uh it's coming from some other neurons are |
---|
0:15:11 | and uh these signals are multiplied by the |
---|
0:15:15 | by the way that each yeah |
---|
0:15:17 | each year this input arrow results today with one of a small number |
---|
0:15:21 | uh the basic of the weight that multiplies the incoming signal |
---|
0:15:25 | so we had three incoming numbers that |
---|
0:15:28 | and uh they really get a sense together in the uh in your honour |
---|
0:15:33 | after which uh there is the application of the activation function of each yeah um |
---|
0:15:38 | needs to be known in europe you want a proper you wanna or |
---|
0:15:42 | and the simplest one is probably the |
---|
0:15:44 | uh so called the rectified the |
---|
0:15:46 | a linear activation function which is basically just taking my between zero and evaluated that |
---|
0:15:51 | compute so that all the volume that are below zero will basically get a translate |
---|
0:15:56 | the zero |
---|
0:15:57 | and uh |
---|
0:15:58 | this value that we compute is weights a |
---|
0:16:00 | is the output of the your honour in the given find that the and the |
---|
0:16:05 | and uh this uh this output can be connected actually too many pattern your own |
---|
0:16:09 | so it does not be connected |
---|
0:16:10 | one |
---|
0:16:12 | but it's a single number uh goes out of the single in your own |
---|
0:16:16 | and here the creation |
---|
0:16:19 | so i think that like uh |
---|
0:16:21 | the biological onions actually |
---|
0:16:23 | although our like connected to other neurons the about the they are so many difference |
---|
0:16:27 | is that it doesn't even make sense to |
---|
0:16:29 | star comparing these two |
---|
0:16:31 | a logistic that the artificial neural networks that are somewhat uh was it inspired by |
---|
0:16:36 | the biological neurons uh |
---|
0:16:38 | uh in the beginning about the it's a different now |
---|
0:16:43 | uh maybe in the name i think uh is a um |
---|
0:16:47 | uh misleading people start uh working on this uh |
---|
0:16:50 | techniques the and uh start believing that maybe they can just sort of artificial intelligence |
---|
0:16:54 | additional uh have uh you know neurons in their model because after all the model |
---|
0:16:59 | at school you want your right uh well this is the logic that the |
---|
0:17:03 | i sometimes you're from some older purpose errors and i think it's it really misleading |
---|
0:17:07 | and it's part of the |
---|
0:17:09 | marketing so just the |
---|
0:17:10 | don't take it seriously i think yeah |
---|
0:17:12 | if the name of these uh |
---|
0:17:14 | artificial neural networks would be known in your data projections i think it would be |
---|
0:17:18 | maybe better but then |
---|
0:17:19 | nobody would you use it because it would be interesting right |
---|
0:17:23 | uh so |
---|
0:17:24 | uh this is the presentation |
---|
0:17:26 | oh well we'll network when we have a very have multiple of these songs uh |
---|
0:17:31 | usually there are some structure this is like the typical a feed-forward structure very have |
---|
0:17:35 | some people say or uh which uh |
---|
0:17:37 | which is made out of some features it can be |
---|
0:17:40 | the back of our teachers or one of any code so what i was talking |
---|
0:17:43 | about before |
---|
0:17:45 | so these are the features you specify some will somehow |
---|
0:17:48 | the uh and then there's the hidden layer uh you'll you know well to computable |
---|
0:17:52 | is there and then there's the output layer |
---|
0:17:54 | again it's the application of those any questions uh |
---|
0:17:57 | so nothing special their output layer |
---|
0:18:00 | use usually what you want the network to be doing can that's a |
---|
0:18:04 | that's for example classification we want for example say that the input layer that there |
---|
0:18:08 | are some decoding of the sentence then and the upper layer there can be classification |
---|
0:18:12 | of their |
---|
0:18:13 | the sentence is a span or not |
---|
0:18:15 | so there can be just one in your on that would be just the |
---|
0:18:18 | doing a binary decision |
---|
0:18:21 | the training is done with the back propagation |
---|
0:18:24 | and to |
---|
0:18:31 | as the training is done with a back propagation i do not really describe exactly |
---|
0:18:35 | how it is uh don't because it's a lot of mao a multi we can |
---|
0:18:39 | find some nice lectures on the web the uh |
---|
0:18:43 | so i think it course zero there are some nice cars about neural networks would |
---|
0:18:46 | be quite some long i quite some time to |
---|
0:18:49 | expendable basically what we need to do is to define some objective function which yeah |
---|
0:18:54 | we'll |
---|
0:18:55 | ah |
---|
0:18:56 | which will say what is the error that we uh that the network that uh |
---|
0:19:00 | make for the article twenty example so when we trained a network we should we |
---|
0:19:04 | basically some input features |
---|
0:19:06 | and we know what is the output that the network should have produced and we |
---|
0:19:09 | know what the network uh did actually compute uh using the current set of the |
---|
0:19:13 | weights and then using of the back propagation and the stochastic gradient descent algorithms we |
---|
0:19:19 | will compute how much uh and you know what direction which we should change the |
---|
0:19:24 | weights |
---|
0:19:25 | so that next time the network see the same example it would make up this |
---|
0:19:28 | error |
---|
0:19:30 | small it would make smaller error |
---|
0:19:33 | and then there's the simplified graphical representation |
---|
0:19:36 | but is not used in some papers uh |
---|
0:19:38 | there we don't actually draw all the individual neurons but it just the dual the |
---|
0:19:43 | box the |
---|
0:19:44 | with the errors |
---|
0:19:47 | they're section what of are things that the |
---|
0:19:49 | have to be uh down if one actually wants to implement the |
---|
0:19:53 | this natural because they're like this the |
---|
0:19:55 | these uh hyper parameters that the that the training doesn't uh |
---|
0:19:59 | doesn't choose like what is the type of activation function that we use a their |
---|
0:20:03 | conviction in many of them |
---|
0:20:05 | well how many hidden layers to we haven't are their size is uh how they |
---|
0:20:09 | are connected we can have some skip connections we can the recurrent connections we can |
---|
0:20:13 | have some weight sharing deconvolution at work so it's actually |
---|
0:20:17 | why did they do things so uh of course i will not describe for all |
---|
0:20:20 | of them because there would be lycra for course |
---|
0:20:22 | uh about the |
---|
0:20:24 | but i think of what works for me for no for starting to or within |
---|
0:20:28 | you want works is to take some existing set up and try to play weighted |
---|
0:20:31 | by |
---|
0:20:32 | making some modifications and the |
---|
0:20:34 | observing what uh what is the difference |
---|
0:20:37 | so maybe that's so that the best of the star |
---|
0:20:40 | for deep learning uh |
---|
0:20:43 | so this popular term uh |
---|
0:20:46 | uh it's uh basically still the same think it's the it's in you wanna sort |
---|
0:20:50 | of every will have well |
---|
0:20:52 | um or hidden layers usually so that uh |
---|
0:20:54 | if there is like at least two or three hidden layers then the by basically |
---|
0:20:58 | some of the deep learning a all maybe we can also that some recurrent connections |
---|
0:21:03 | with you to make the outputs depends on all the previous the |
---|
0:21:06 | input features which is actually very d very "'cause" they are so many nonlinearities that |
---|
0:21:11 | uh that influence the output of the model |
---|
0:21:13 | uh so basically any a network that uh |
---|
0:21:17 | that uh |
---|
0:21:18 | goes uh any model that goes through a several nonlinearities be before it computes the |
---|
0:21:23 | output uh can be considered as deep learning curve |
---|
0:21:28 | although some people are probably even see nowadays deep-learning which i think is completely sitting |
---|
0:21:33 | about the |
---|
0:21:35 | uh |
---|
0:21:37 | yeah there was also like a this controversy for i think are maybe twenty years |
---|
0:21:42 | there |
---|
0:21:43 | uh basically the welcome annotated very the a |
---|
0:21:46 | the training these the model deep neural networks is not possible to be done with |
---|
0:21:50 | the stochastic gradient descent |
---|
0:21:52 | and uh when i was uh the skewed and myself well |
---|
0:21:56 | whatever book i was reading uh everybody that claimed is that basically training these deep |
---|
0:22:01 | networks does not work and that's it uh that we need to develop some magical |
---|
0:22:05 | algorithms |
---|
0:22:06 | actually it's not the case uh people not trained to networks normally that the d |
---|
0:22:11 | and the |
---|
0:22:12 | just works the it's probably because we have more data than what people are like |
---|
0:22:16 | in the nine so i didn't know much durable power exponential there but the uh |
---|
0:22:21 | there are be about the |
---|
0:22:22 | there are basically this uh |
---|
0:22:24 | uh a long chain of sex as a result starting maybe in two thousand five |
---|
0:22:28 | six the lower people are able to find it remains some deeper networks are |
---|
0:22:35 | there's also like this mathematical justification why prediction need to the |
---|
0:22:40 | the models so |
---|
0:22:42 | uh coming from seeing more popular and marvin means key |
---|
0:22:46 | in their book perceptrons it is uh |
---|
0:22:49 | so the very mathematical i would say about the about the argument is very interest |
---|
0:22:54 | think there are functions that uh |
---|
0:22:57 | that we can't represent action maybe give just a single hidden layer |
---|
0:23:01 | and uh |
---|
0:23:02 | actually that's the logic that i will be using at the end of the talk |
---|
0:23:05 | show that they are actually |
---|
0:23:06 | a function that even the deep learning models so |
---|
0:23:09 | cannot uh a learn a gently |
---|
0:23:12 | maybe represent us not very large |
---|
0:23:15 | so |
---|
0:23:16 | uh |
---|
0:23:17 | so i would see that the wall down deep learning maybe was invented a neural |
---|
0:23:22 | network and you're like about the |
---|
0:23:24 | but the these ideas are much older |
---|
0:23:27 | uh like you have the motivation uh our people to argue that we really need |
---|
0:23:32 | to |
---|
0:23:32 | use something else then use the |
---|
0:23:35 | these uh simple perceptrons |
---|
0:23:41 | so this the graphical representation |
---|
0:23:43 | a very good basically just multi um |
---|
0:23:45 | just several little errors |
---|
0:23:47 | and so |
---|
0:23:49 | that's about it the states that it can be more complicated than this but if |
---|
0:23:53 | there will be some recurrence the connections or something of that sort |
---|
0:23:57 | but a lot of visiting model |
---|
0:24:00 | yeah i would even say that it still an open research problem because the |
---|
0:24:04 | then entropy you have uh |
---|
0:24:05 | very deep model then uh |
---|
0:24:08 | so possible to show in many cases that it can represent the or it can |
---|
0:24:13 | are present solutions to some |
---|
0:24:14 | interesting problems about the |
---|
0:24:16 | the a request use the |
---|
0:24:18 | there are there is um |
---|
0:24:20 | so um i good job approach are we can find the solution we constrain the |
---|
0:24:25 | network which is actually not always the case especially for some complex problem by will |
---|
0:24:31 | be |
---|
0:24:31 | showing at the end to uh there are the network for example it's learns um |
---|
0:24:36 | some complex the controllers number structures uh |
---|
0:24:40 | then |
---|
0:24:41 | because there's a lot of local optima then |
---|
0:24:44 | it seems that uh we start to be something better than what we get now |
---|
0:24:50 | and uh so now i will be talking about the |
---|
0:24:53 | uh most basic application of uh |
---|
0:24:56 | neural nets to a to some text problems which is how to compute the distributed |
---|
0:25:01 | your presentations aborts the |
---|
0:25:03 | and uh and i do show some mice examples i think i see examples the |
---|
0:25:08 | oh uh some linguistic irregularities in the vector space |
---|
0:25:13 | so this is how we can actually train the most basic gabor vectors is that |
---|
0:25:17 | they started the |
---|
0:25:18 | band when i one that was mentioning here uh but i was think my diploma |
---|
0:25:22 | thesis in two thousand six is the first model to implement it very just try |
---|
0:25:26 | to predict the |
---|
0:25:27 | the next door to given the previous work the using a simple neural network with |
---|
0:25:33 | one hidden layer |
---|
0:25:34 | and uh |
---|
0:25:35 | here we train this model some |
---|
0:25:37 | on some text corpus a |
---|
0:25:39 | then the by product of this learning is uh |
---|
0:25:43 | that a bit we the matrix a way to be in the |
---|
0:25:46 | uh input layer and the hidden layer |
---|
0:25:49 | we'll |
---|
0:25:50 | a real basic contain the worst representations in some |
---|
0:25:54 | a vector format that is the |
---|
0:25:56 | you're gonna be seen as a fess uh this uh |
---|
0:26:00 | this uh a real for of numbers of the weights from this matrix |
---|
0:26:05 | and it is interesting properties for example |
---|
0:26:08 | you don't group uh boards that similar sense together so that the |
---|
0:26:12 | uh so that this vector representation |
---|
0:26:15 | of four so for example like france any turn it will be |
---|
0:26:18 | close to each other while for example |
---|
0:26:21 | uh i dunno rent and uh |
---|
0:26:24 | and china will be probably farther away both maybe not the |
---|
0:26:29 | uh |
---|
0:26:31 | so |
---|
0:26:32 | uh |
---|
0:26:34 | so uh basically a this like the simple supplication of the of the neural networks |
---|
0:26:40 | and it'll is a kind of found to play we did a |
---|
0:26:43 | of course it's not perfect so that were vectors coming from this model a very |
---|
0:26:48 | be comparable to the state-of-the-art the |
---|
0:26:50 | today about the |
---|
0:26:52 | already function start there |
---|
0:26:54 | uh |
---|
0:26:55 | sometimes list of i-vectors all the core the cold um for embedding i'm not complete |
---|
0:27:00 | sure why |
---|
0:27:01 | but that's all the relative name |
---|
0:27:03 | and uh |
---|
0:27:04 | uh |
---|
0:27:06 | usually to our presentation this like a then agenda like fifty to one problem so |
---|
0:27:10 | each work on this you know say one hundred fold herself to retrain the model |
---|
0:27:15 | and uh |
---|
0:27:16 | a product of that signal purpose to work losses the samples think before |
---|
0:27:20 | uh france italy can go to the same class but uh |
---|
0:27:23 | yeah but with of all vectors a |
---|
0:27:25 | these representations can be much richer because the |
---|
0:27:28 | unlike a us with the board classes we can have a multiple degrees of sonority |
---|
0:27:32 | encoded in this uh in this work vectors and that settled shrink later |
---|
0:27:37 | uh it actually makes sense so |
---|
0:27:40 | uh of course so one thing is that it is found to have uh |
---|
0:27:44 | these vectors just uh just to study the language and actually increase or although our |
---|
0:27:48 | interest in these techniques the but the are think is that uh |
---|
0:27:52 | we can also use them in some uh |
---|
0:27:53 | uh in some uh i know the application so |
---|
0:27:56 | for example a roman coloured show in is a famous paper |
---|
0:28:00 | a role to |
---|
0:28:02 | a natural language processing from scratch uh |
---|
0:28:05 | that the can so for many an open problem so |
---|
0:28:10 | at the state-of-the-art performance uh |
---|
0:28:13 | by using some pre-trained uh were vectors |
---|
0:28:19 | so that are vectors can be basically features to some other models like the neural |
---|
0:28:24 | networks instead of the or in addition to the wanna undercoating |
---|
0:28:29 | uh historically there are like |
---|
0:28:32 | uh several models the proposed before for training data |
---|
0:28:35 | uh this uh this word representations the |
---|
0:28:38 | usually people started to the most complicated things so the start with some model that |
---|
0:28:43 | the |
---|
0:28:43 | many hidden layers uh |
---|
0:28:45 | and it was uh kind of working so |
---|
0:28:47 | so it was considered the big success of the deep learning yeah |
---|
0:28:51 | well i wasn't convinced about its because i would it know about my previous result |
---|
0:28:54 | of just one hidden layer |
---|
0:28:56 | uh the product of sourly quite good |
---|
0:28:58 | uh |
---|
0:28:59 | so |
---|
0:29:00 | i wanted to show that uh actually the shovel models to model the model the |
---|
0:29:04 | don't have many a hidden layers but just one |
---|
0:29:06 | can actually be quite competitive for that i need to be able to compare two |
---|
0:29:11 | uh either vectors of other people's approaches |
---|
0:29:16 | and that wasn't actually parameters because uh |
---|
0:29:19 | people that are showing results after training the models on different datasets and to |
---|
0:29:24 | and the |
---|
0:29:25 | these adjusted are not public and then if you compare two techniques just are trained |
---|
0:29:29 | on different data then the comparison is not going to be very good |
---|
0:29:34 | uh |
---|
0:29:35 | so one of the interesting car properties that actually used for uh developing this uh |
---|
0:29:41 | simply relation sets |
---|
0:29:43 | was that uh |
---|
0:29:44 | uh these support vectors can be used for |
---|
0:29:48 | uh doing these so small |
---|
0:29:51 | analogy like calculations with the board so one can define example |
---|
0:29:55 | was then of a string that the when we take a the vector forking and |
---|
0:30:00 | the subjects from a the vector for that represents man |
---|
0:30:03 | then and uh vector that represents woman and do the nearest no uh need research |
---|
0:30:09 | uh while excluding the but works around this position |
---|
0:30:12 | then we will uh find the work we need for any |
---|
0:30:16 | a reasonably good um or vector o |
---|
0:30:20 | and uh |
---|
0:30:21 | similarly we can actually calculated the |
---|
0:30:23 | with the boards and sounds are a lot of uh |
---|
0:30:26 | which questions of this type |
---|
0:30:28 | uh kind of funny how accurate kind of get |
---|
0:30:32 | uh i |
---|
0:30:32 | on the picture below uh there is shown in basically there can be like this |
---|
0:30:36 | multiple degrees of similarity so can guess the related to queen in some way but |
---|
0:30:41 | it's a related to its lower for form like uh can't case related the kings |
---|
0:30:46 | in some our way |
---|
0:30:47 | and which you want to capture all these things the |
---|
0:30:50 | so the idea that the board will be and then they're of a single class |
---|
0:30:54 | what follows to capture this |
---|
0:31:00 | so for the relation edit construct this dataset with the |
---|
0:31:03 | uh almost twenty thousand question so to basically written by and uh and then generate |
---|
0:31:09 | it uh using permutations so |
---|
0:31:12 | automatically |
---|
0:31:13 | and these are few examples like a |
---|
0:31:15 | take it would be quite challenging even for |
---|
0:31:17 | uh people once there are some of these so |
---|
0:31:21 | analogy questions so maybe try to compute uh |
---|
0:31:24 | uh this think uh |
---|
0:31:26 | board for example utterances to greece's all slaves to norway i think that's quite easy |
---|
0:31:31 | but the second one is uh |
---|
0:31:33 | that's an article like analyze the ones last year honest well like ones like the |
---|
0:31:38 | currency non-goal and the currency here on i think is the |
---|
0:31:41 | three l |
---|
0:31:42 | so i think that's the that's more complicated well |
---|
0:31:45 | and of the and then there are like the errors that can actually very simple |
---|
0:31:48 | like brothers sisters grandsons two |
---|
0:31:51 | granddaughter and so on |
---|
0:31:53 | so we can accumulate performance of uh |
---|
0:31:55 | oh different models on these so on these questions |
---|
0:31:58 | or this |
---|
0:31:59 | uh on a would you questions |
---|
0:32:03 | yeah |
---|
0:32:04 | it can be actually scaled up very |
---|
0:32:07 | yeah to phrases |
---|
0:32:08 | so that we can compute like new york a sting your times baltimore's to i |
---|
0:32:13 | think baltimore sun |
---|
0:32:14 | uh |
---|
0:32:15 | so we'll these datasets are public |
---|
0:32:18 | in their published in the dark in |
---|
0:32:21 | and uh i go there |
---|
0:32:23 | the uh |
---|
0:32:24 | simple algebra or vector model that will show later to this one that was uh |
---|
0:32:29 | kind of stick of er state-of-the-art uh |
---|
0:32:31 | make in the days uh |
---|
0:32:33 | that it used to hidden layers uh |
---|
0:32:35 | uh starting with the |
---|
0:32:37 | a context of three boards are and of or so |
---|
0:32:41 | and the input to predict the next door to |
---|
0:32:44 | by going through the projection layer and little air |
---|
0:32:47 | uh and the |
---|
0:32:49 | them |
---|
0:32:49 | the main complexity of this model after that we do some tricks the |
---|
0:32:53 | there we can actually deal with u n w matrices of them a complex this |
---|
0:32:58 | in the v matrix because we need to touch all the parameters of for every |
---|
0:33:01 | training example |
---|
0:33:03 | and the model takes ages to train |
---|
0:33:05 | uh so |
---|
0:33:07 | what i did this was basically the remove the hidden layer |
---|
0:33:09 | and uh have the projection layer |
---|
0:33:12 | slightly different and uh |
---|
0:33:14 | as i don't show later in section were quite fine so this uh again the |
---|
0:33:18 | idea that uh we can take the bigram model |
---|
0:33:21 | and just extended to |
---|
0:33:22 | which are showing the context uh around the border we are trying to predict and |
---|
0:33:27 | just uh some the board representation of the projection layer and the prediction right away |
---|
0:33:31 | this model will be able to learn the n-gram so it's not the |
---|
0:33:34 | suitable for language modeling i just fine to learn the word vectors this way |
---|
0:33:42 | uh |
---|
0:33:43 | the near model to the previous model is the skipper model |
---|
0:33:47 | that uh |
---|
0:33:48 | tries to predict uh the context a given the |
---|
0:33:51 | current uh board |
---|
0:33:53 | they should were quite sooner or later uh if they are true and uh |
---|
0:33:58 | peripherally |
---|
0:34:00 | so the training is uh is uh still the same thing like stochastic gradient descent |
---|
0:34:04 | back propagation |
---|
0:34:06 | uh |
---|
0:34:06 | these works at the output layer coding code it does one of and of the |
---|
0:34:10 | same for the input layer |
---|
0:34:11 | so we cannot be this the use of mikes so |
---|
0:34:14 | function in the output layers which is so |
---|
0:34:16 | this good probability distribution we have to compute the |
---|
0:34:19 | all these uh only use which would take too long so they are like this |
---|
0:34:22 | to |
---|
0:34:23 | a fast approximations uh one that the still keeps the |
---|
0:34:28 | probability to be correctly uh something to one which is dark a softmax and the |
---|
0:34:32 | second one |
---|
0:34:33 | uh that actually is uh from the |
---|
0:34:35 | assumption that uh the models to be prefer probabilistic model |
---|
0:34:38 | and uh i just takes the |
---|
0:34:40 | bunch of divorces negative example |
---|
0:34:42 | uh to be related and the output layer loss the positive example and that's all |
---|
0:34:47 | but is done and to |
---|
0:34:48 | and such as the second option seems to be preferable |
---|
0:34:53 | and are very good at the |
---|
0:34:55 | and it uh that actually improves the performance drop well |
---|
0:34:58 | uh |
---|
0:34:59 | probabilistically or stochastically a the most frequent corkscrew both speed up the training can interestingly |
---|
0:35:06 | even improve the accuracy where we don't shall billions and billions of examples there we |
---|
0:35:11 | try to |
---|
0:35:12 | a related work like a ds the |
---|
0:35:15 | is uh the and so on |
---|
0:35:19 | so these are not removed from the training set up |
---|
0:35:22 | a like all of them but uh |
---|
0:35:24 | some proportion of them is actually remote so that their importance is actually reduced yeah |
---|
0:35:29 | and it comes to the objective function |
---|
0:35:32 | and here is the comparison as i said |
---|
0:35:34 | on this analogy deepest that the |
---|
0:35:37 | there was like this you get about in the training time and it and the |
---|
0:35:40 | accuracy to whatever people it's probably before so i so that's what i wanted to |
---|
0:35:45 | prove that one does not have to train a language model to a to obtain |
---|
0:35:48 | good were able to report representations |
---|
0:35:51 | and this you the last two lines are only simple models that the |
---|
0:35:55 | data are invariant to the border you don't understand the |
---|
0:35:59 | n-gram they just see the single boards and uh |
---|
0:36:02 | the only yeah they can compute the very accurate the word representations that are actually |
---|
0:36:07 | way better than with there are people that could twenty before |
---|
0:36:11 | and to while the training time to go from models and reach two minutes |
---|
0:36:15 | and maybe even seconds |
---|
0:36:18 | so this is uh obvious this open source code |
---|
0:36:22 | it's called words like project |
---|
0:36:24 | actually many people |
---|
0:36:25 | it's find it uh useful because uh |
---|
0:36:27 | they can train on the on the border like there are some they are datasets |
---|
0:36:31 | to improve uh |
---|
0:36:32 | many i know the application so |
---|
0:36:35 | so |
---|
0:36:36 | i think it's like a nice uh |
---|
0:36:38 | nice we have to and few person topic receive an adder |
---|
0:36:41 | uh people are dealing with data sets of our there is not uh |
---|
0:36:45 | you each number of uh |
---|
0:36:46 | uh so supervised training examples |
---|
0:36:52 | you are some examples of the nearest neighbor so |
---|
0:36:55 | just to give a |
---|
0:36:56 | idea |
---|
0:36:58 | uh how big again was built a between about uh what was state-of-the-art before and |
---|
0:37:03 | after |
---|
0:37:03 | uh these models they are introduced |
---|
0:37:06 | so for example for how well that's like infrequent words in a war in |
---|
0:37:10 | english yeah |
---|
0:37:11 | uh but still it's present in |
---|
0:37:13 | all these the all these uh models |
---|
0:37:16 | and we can see that the nearest neighbours for the first we |
---|
0:37:19 | uh |
---|
0:37:21 | barrelling makes any sense |
---|
0:37:22 | and this one and then at least get the idea is probably a name of |
---|
0:37:25 | some parsing |
---|
0:37:27 | while the last one is obviously much better when it comes the nearest neighbours |
---|
0:37:34 | and of course this the this improvement of the quality comes from the plight of |
---|
0:37:37 | the models trained a much more data and the |
---|
0:37:39 | and had a large dimensionality and the that all as possible because the |
---|
0:37:44 | uh training complexity is reduced by many orders of magnitude |
---|
0:37:48 | uh_huh |
---|
0:37:50 | there are some few more fun examples like a |
---|
0:37:53 | but uh |
---|
0:37:53 | if we can uh can calculate like these things uh |
---|
0:37:57 | likes |
---|
0:37:58 | sushi mine in japan and germany rutgers |
---|
0:38:02 | and so on i think yeah |
---|
0:38:04 | scan to find of course we don't have to look at each of the news |
---|
0:38:06 | token we can look at the top then use the |
---|
0:38:09 | tokens uh |
---|
0:38:10 | so i wouldn't say that it works all the time |
---|
0:38:13 | and he goes like sixty percent of the time the nearest models are |
---|
0:38:16 | looking reasonable |
---|
0:38:18 | uh |
---|
0:38:19 | about the it still like fun to play with it and the |
---|
0:38:22 | there's the many existing we trained models no available on the web |
---|
0:38:28 | i don't think that actually data scientists uh find useful is that the |
---|
0:38:32 | these are vectors can be visualised to get some understanding of what is going on |
---|
0:38:37 | in the dataset |
---|
0:38:38 | uh that they are using a |
---|
0:38:40 | the are ignored these are so strong that actually when we train this model and |
---|
0:38:45 | the good news dataset |
---|
0:38:46 | uh and then and it uh visualising two dimensions the representations for countries and the |
---|
0:38:53 | capital cities |
---|
0:38:55 | then we can actually see recorrelation |
---|
0:38:57 | between them that the |
---|
0:38:58 | there is this a single direction uh |
---|
0:39:02 | uh how to get from account to basically it's capital city and even the contras |
---|
0:39:07 | are actually a related to each other in this the in this representation in some |
---|
0:39:12 | interesting way for example we can see that |
---|
0:39:14 | the european countries from the saddam european so far in some part of the image |
---|
0:39:19 | and then the problem |
---|
0:39:20 | the rest of the real or somewhere in the middle |
---|
0:39:22 | and then the asian countries are more like a |
---|
0:39:24 | uh the |
---|
0:39:26 | at the top of the image |
---|
0:39:32 | so for the summary |
---|
0:39:34 | uh |
---|
0:39:35 | i think it's always good to think if a |
---|
0:39:37 | it things can be down a simpler and uh as it was shown like uh |
---|
0:39:42 | uh not everything is to be deeper and uh |
---|
0:39:44 | you wanna works are fine even navy |
---|
0:39:47 | actually remove uh many of the hidden layers especially in the know the applications it's |
---|
0:39:51 | a different story for example for acoustic modeling or yeah per image yeah |
---|
0:39:56 | classifiers are actually |
---|
0:39:58 | i another there any |
---|
0:40:00 | model that would be able to be competitive in the deeper |
---|
0:40:03 | deep models the |
---|
0:40:04 | um without having many nonlinearities about for the for an l p task so is |
---|
0:40:09 | the other way around so i |
---|
0:40:11 | and not company can means that the deep learning actually works for now you so |
---|
0:40:14 | far |
---|
0:40:15 | um but |
---|
0:40:16 | in the future be noted that will |
---|
0:40:18 | we better |
---|
0:40:20 | although there is this thought |
---|
0:40:22 | extension |
---|
0:40:23 | to work to make of are basically instead of predicting the middle of or given |
---|
0:40:26 | the context the connection predictor a labels for sentence using the same yeah same algorithms |
---|
0:40:33 | and uh this is what we published a as a fast x library last year |
---|
0:40:38 | it's very simple but that the same time very useful |
---|
0:40:41 | oh |
---|
0:40:41 | and compared to what job |
---|
0:40:43 | both people are probably think nowadays uh in the |
---|
0:40:47 | in the uh |
---|
0:40:49 | and all the initial learning conference this uh |
---|
0:40:52 | uh then we need to do the comparison to some |
---|
0:40:55 | a convolution network with the |
---|
0:40:57 | several hidden layers trained on |
---|
0:40:59 | um any gpu so |
---|
0:41:01 | we did find out of each you can get a their accuracy while being a |
---|
0:41:05 | hundred times so |
---|
0:41:06 | hundred thousand times faster |
---|
0:41:08 | so i think it's always been think about the baselines and doing the simple things |
---|
0:41:12 | first |
---|
0:41:15 | so the next uh |
---|
0:41:17 | expired will be about the recurrent network because the |
---|
0:41:20 | i think it's quite obvious that the |
---|
0:41:22 | or representations can be found the easy to traditional networks but it's gonna it's a |
---|
0:41:26 | different story for language modeling there's actually some success of the |
---|
0:41:30 | of the deep learning because the state-of-the-art the models |
---|
0:41:34 | nowadays a recurrent and that's basically this model |
---|
0:41:37 | and then able talk also about the limitations of these models |
---|
0:41:42 | so these three so |
---|
0:41:43 | of the recant it is quite longer |
---|
0:41:46 | oh there was a lot of people work on this models a blend a piece |
---|
0:41:49 | like a just on long mike jordan michael's or and so on uh because the |
---|
0:41:55 | models |
---|
0:41:55 | model is actually very |
---|
0:41:57 | their interest think it's the |
---|
0:41:58 | simple modification how to get a some sort of short the memory into the model |
---|
0:42:03 | here is the graphical representation |
---|
0:42:05 | so again we can uh taking the |
---|
0:42:07 | bigram model and just the handed the |
---|
0:42:11 | hidden layers they hidden layer |
---|
0:42:13 | to be connected to uh the entire from the previous time step so that the |
---|
0:42:17 | h t create is uh |
---|
0:42:19 | this loop in the model |
---|
0:42:20 | so that the hidden layer |
---|
0:42:22 | oh uh just c is the features |
---|
0:42:26 | the input layer what although it's all state from the previous the times that |
---|
0:42:31 | and that it's selsa's the |
---|
0:42:33 | the previous uh various state and so on so basically |
---|
0:42:37 | um then ever you prediction it depends on the goal is threefold the |
---|
0:42:42 | input feature that you know put us that's that it of the time steps that |
---|
0:42:46 | we need to do before |
---|
0:42:49 | so one can say that to the hidden error than or present some sort of |
---|
0:42:53 | memory |
---|
0:42:54 | a that this model has |
---|
0:42:56 | uh there's this interesting paper from different monocle finding structuring time that the |
---|
0:43:01 | you sort of this motivation |
---|
0:43:05 | well after |
---|
0:43:06 | after this period where the recurrent or explore studied uh |
---|
0:43:10 | at a the excitement the excitement that the kind of when it show |
---|
0:43:16 | uh because some people started believing that the than these models even in the actually |
---|
0:43:21 | are looking very good cannot be trained with uh |
---|
0:43:23 | and g d may but can see that the |
---|
0:43:26 | this is the remote server |
---|
0:43:27 | can a real curing again and again whenever people data |
---|
0:43:30 | failing to do something enables the data link the energy that it just doesn't more |
---|
0:43:35 | uh and of course to solicit they're out to be wrong so |
---|
0:43:39 | uh the recognizers are actually trying to do you know this normally just one has |
---|
0:43:44 | to do some small break it down |
---|
0:43:47 | uh |
---|
0:43:50 | so what i did um i said in the doesn't animals that the idea to |
---|
0:43:54 | that actually |
---|
0:43:55 | one can train state-of-the-art language model based on the recurrent networks and the it was |
---|
0:43:59 | very easy to apply to |
---|
0:44:01 | a range of tasks like language modeling machine translation speech recognition data compression and so |
---|
0:44:06 | on and to |
---|
0:44:07 | in each of these uh i was able to improve the existing systems to |
---|
0:44:11 | uh achieved view state-of-the-art results and the |
---|
0:44:14 | sometimes by quite a significant margin for something language modeling i think uh the or |
---|
0:44:19 | looked perplexed the reduction over n-grams uh |
---|
0:44:22 | it and symbol of similar recurrent network so |
---|
0:44:24 | most |
---|
0:44:25 | for me usually like fifty percent the remote so that that's quite a lot |
---|
0:44:28 | uh_huh |
---|
0:44:30 | and uh |
---|
0:44:31 | company started using ca uh this uh this toolkit and what this the body so |
---|
0:44:36 | that they pleased here about really many others |
---|
0:44:39 | uh and uh |
---|
0:44:41 | then i was looking at uh with your savings you but like uh |
---|
0:44:45 | what uh outcomes that the uh the model actually works for me well people try |
---|
0:44:50 | to do it before and that uh they just couldn't uh make it or they're |
---|
0:44:54 | not that there was uh |
---|
0:44:55 | this problem that they did a place at some point that the |
---|
0:44:58 | it's i was uh trying to train the network and more and more data at |
---|
0:45:03 | the start at the meeting can some celtic way |
---|
0:45:07 | and the training response table so sometimes that it converts sometimes not |
---|
0:45:11 | and the more and more data used uh the lower was the chance that the |
---|
0:45:15 | network would can were convert |
---|
0:45:17 | and um mostly the result or just rutledge |
---|
0:45:21 | so it detects to spend quite a few days uh trying to figure out what |
---|
0:45:24 | is going on and uh |
---|
0:45:26 | i did find out that the there are some rare cases that are |
---|
0:45:29 | the std a science in a such a way that the |
---|
0:45:35 | changes of the way that are computed |
---|
0:45:37 | uh become |
---|
0:45:38 | exponentially larger they get propagated through the reckon the matrix |
---|
0:45:42 | so that they become so huge |
---|
0:45:44 | that the rule weight matrix the matrix a get overwritten bits the |
---|
0:45:48 | it some numbers and the not review the |
---|
0:45:51 | just american that are just |
---|
0:45:53 | so what i did not so |
---|
0:45:55 | the simplest thing to take with think |
---|
0:45:57 | because these uh these gradient explosions uh |
---|
0:46:00 | it happened just the |
---|
0:46:01 | just very rarely |
---|
0:46:03 | i didn't uh can't the gradient so that he wouldn't be able to become a |
---|
0:46:06 | larger than some values of it in some threshold |
---|
0:46:12 | and then it there now that the |
---|
0:46:13 | that uh probably nobody was actually |
---|
0:46:16 | either of this uh the neighbours the |
---|
0:46:18 | but there was uh discussing this uh |
---|
0:46:21 | this idea two thousand eleven |
---|
0:46:24 | so |
---|
0:46:25 | there was maybe the reason why things that or i dunno |
---|
0:46:29 | but the |
---|
0:46:30 | it's a set it was the mean of the case that uh that as g |
---|
0:46:33 | d wouldn't uh work for training these models |
---|
0:46:37 | and this i said it was quite easy to obtain a pretty good results one |
---|
0:46:41 | shows that the weight thirty long for training the model because the they were quite |
---|
0:46:45 | expensive |
---|
0:46:47 | so the |
---|
0:46:48 | um ability to the original setup well speech recognition |
---|
0:46:52 | uh it's uh like uh |
---|
0:46:54 | small |
---|
0:46:55 | simple datasets that and to |
---|
0:46:57 | and the reduction of the word error rate was like a over twenty percent compared |
---|
0:47:01 | to the |
---|
0:47:02 | best n-gram models a one can see that as a as the number of neurons |
---|
0:47:06 | in the model gets bigger like to like a ranker then at higher sixty three |
---|
0:47:11 | are twenty and so that's basically skating the size of the model |
---|
0:47:15 | then the perplexity goes down but just like uh |
---|
0:47:18 | the uh |
---|
0:47:21 | down make sure how good is a network that's predicting the next board basically the |
---|
0:47:25 | lower the better and the word error rate uh |
---|
0:47:28 | uh is going down so basically the best n-gram modeling the |
---|
0:47:32 | is that in by uh |
---|
0:47:34 | with no count cutoffs and the on the on the relation data sets it up |
---|
0:47:39 | like a |
---|
0:47:40 | the twelve and sixteen point six a word error rate and with a combination of |
---|
0:47:44 | uh |
---|
0:47:45 | of uh these so we can network sacred would like to nine and thirty percent |
---|
0:47:49 | that was quite yeah a big uh |
---|
0:47:52 | the gain coming from just the language go to just from a change of the |
---|
0:47:56 | language modeling technique which uh i think wasn't thirty |
---|
0:48:00 | but these before when i did compare these results to other techniques that are being |
---|
0:48:04 | developed the like then at uh johns hopkins university then usually people are happy with |
---|
0:48:09 | like zero point three percent improvement of the words rate than the i could get |
---|
0:48:13 | like you're like three and how well |
---|
0:48:15 | percent absolutely |
---|
0:48:17 | so that was uh quite some uh interesting cut think |
---|
0:48:21 | uh under interesting co |
---|
0:48:23 | no observation was that the more training data |
---|
0:48:27 | uh was used uh the bigger uh will still |
---|
0:48:31 | the gain over n-gram models that the recurrent networks |
---|
0:48:34 | so |
---|
0:48:36 | that was uh |
---|
0:48:37 | uh quite the opposite of what to just argument abolitionists technical report in two thousand |
---|
0:48:42 | one |
---|
0:48:43 | uh |
---|
0:48:44 | i think it was two thousand one |
---|
0:48:45 | uh we use very famous very basically data collection that are all these qualities models |
---|
0:48:51 | and so on that you put it consider for improving language modeling rare |
---|
0:48:55 | actually uh |
---|
0:48:56 | well help think less and less as so |
---|
0:48:59 | as more data was used |
---|
0:49:01 | and uh |
---|
0:49:02 | it did see that you also losing all the whole that the |
---|
0:49:05 | n-gram models can ever be beaten |
---|
0:49:07 | well like their output with the police or we can model so that actually happened |
---|
0:49:13 | so it was gonna likely |
---|
0:49:14 | and uh and the last grapples the |
---|
0:49:17 | this uh large datasets from i b m so it's pretty much the same thing |
---|
0:49:21 | "'cause" the able to journal or just like much bigger much better tuned coming from |
---|
0:49:26 | a commercial company |
---|
0:49:28 | it was their the state of the green line is the |
---|
0:49:30 | is their best result videos like thirteen percent for |
---|
0:49:33 | more to rate uh |
---|
0:49:35 | and then uh on the x-axis |
---|
0:49:37 | there isn't the size okay size of the domain of this we're going to work |
---|
0:49:41 | so you can see that the |
---|
0:49:42 | and the |
---|
0:49:43 | as the networks get bigger and bigger the word rate is not going down |
---|
0:49:49 | so in the entire experiment it by |
---|
0:49:51 | by the computation complexity because it to cut |
---|
0:49:54 | many tricks to train the biggest models |
---|
0:49:56 | uh |
---|
0:49:57 | and uh that was quite challenging |
---|
0:50:00 | uh in the nlu could get like another person's or try to reduction like the |
---|
0:50:04 | relative well i think it will get even more than that if i could train |
---|
0:50:08 | bigger models |
---|
0:50:09 | but already this result was very convincing can to |
---|
0:50:12 | uh_huh and stuff |
---|
0:50:14 | people from the companies are interested |
---|
0:50:18 | oh |
---|
0:50:19 | lighter a user can afford became much more accessible because actually implementing the stochastic grandest |
---|
0:50:26 | and correctly |
---|
0:50:28 | it's gonna painful in this model because one is to use the |
---|
0:50:30 | back propagation through time algorithm and it is the makes a mistake there and very |
---|
0:50:35 | heart to |
---|
0:50:36 | find it later |
---|
0:50:37 | so i think is also like very useful so the maybe the most popular to |
---|
0:50:41 | look it's are now like it happens or floor one so |
---|
0:50:44 | and piano and or |
---|
0:50:46 | but there are many errors |
---|
0:50:49 | and uh |
---|
0:50:50 | using the graphics processing units that you use uh |
---|
0:50:53 | uh people could |
---|
0:50:54 | scale the training to billions of training works uh using thousands of in your own |
---|
0:50:59 | so that's even |
---|
0:51:00 | quite a bit bigger than the with the brothers using can doesn't then |
---|
0:51:06 | no like today a the right kind that's are used uh in many tasks like |
---|
0:51:11 | speech recognition machine translation |
---|
0:51:13 | i think uh |
---|
0:51:14 | uh google guys that uh publishing months ago paper very are investigating how to get |
---|
0:51:20 | the recurrent networks into |
---|
0:51:22 | into the production sistine for google translate |
---|
0:51:25 | i think it will still take some time about the |
---|
0:51:27 | let's hope readable happen because setting would be great for example for translating from english |
---|
0:51:32 | to check so that finally the |
---|
0:51:34 | the |
---|
0:51:35 | the uh |
---|
0:51:36 | morphology wouldn't be as painful as it usually so |
---|
0:51:40 | hmmm |
---|
0:51:42 | on the other hand i think that the downside is that the because the |
---|
0:51:46 | because these two kids like a say select and therefore and so on |
---|
0:51:50 | it's named eric and works very easily accessible people are using them for all kinds |
---|
0:51:54 | of problems that are there and thirteen |
---|
0:51:56 | uh require |
---|
0:51:58 | and uh especially when people try to complete their presentation file |
---|
0:52:01 | they sentences are documents are |
---|
0:52:04 | i would to always work on people told to think about the simpler baselines because |
---|
0:52:08 | the |
---|
0:52:08 | just big of n-grams the |
---|
0:52:10 | can usually be to |
---|
0:52:12 | is uh this uh models |
---|
0:52:14 | or |
---|
0:52:15 | at least be around the same |
---|
0:52:17 | accuracy |
---|
0:52:19 | when it comes the representation so the different in the language modeling |
---|
0:52:26 | so one can ask like a what can we do better like a so really |
---|
0:52:30 | need it is you distorted it may be there we can work uh |
---|
0:52:33 | pretty well and sometimes maybe adding more layers held so for some problems doesn't |
---|
0:52:38 | well it can be you |
---|
0:52:39 | to get uh |
---|
0:52:40 | better results so |
---|
0:52:42 | uh can be built uh |
---|
0:52:44 | this is a great language model that i mentioned in a direction that would be |
---|
0:52:47 | able to lock to tell us to what is the capital city of some constant |
---|
0:52:50 | maybe we could stall with uh |
---|
0:52:53 | uh and we do it is we're gonna works well |
---|
0:52:56 | i'm not thirty that much convinced because they are very simple things that these models |
---|
0:52:59 | that we can have lower |
---|
0:53:00 | and that actually is opportunity to new people a new generation to develop better models |
---|
0:53:07 | so simple button for example is a very difficult to uh to learn is uh |
---|
0:53:12 | it's memorisation a variable-length sequence of symbols |
---|
0:53:16 | this like to |
---|
0:53:18 | the people to just to see uh see you can you bored and be able |
---|
0:53:22 | to repeat the later |
---|
0:53:23 | this something that the uh |
---|
0:53:25 | nobody can retrain in general the recanted networks to do that |
---|
0:53:29 | uh there even simple but there's and that we don't have to minimize the sequencer |
---|
0:53:35 | of symbols we can just a little bit novelty kind it comes to count thing |
---|
0:53:38 | so |
---|
0:53:39 | we can generate uh |
---|
0:53:40 | uh is very simple |
---|
0:53:42 | uh algorithmic but there are so uh which are |
---|
0:53:47 | sequences so |
---|
0:53:48 | with some strong regularity |
---|
0:53:50 | uh and see what can actually be so we can the rocks lower |
---|
0:53:54 | so i think that uh people know for some from the recouped article compare signs |
---|
0:54:00 | that the that there are like a very simple or a languages like the |
---|
0:54:05 | a and b and language yeah are there is this thing number of symbols |
---|
0:54:08 | and we consider in quite a few examples and train |
---|
0:54:11 | a sequential |
---|
0:54:12 | a predictive model like a recurrent network to be able to predict the next symbol |
---|
0:54:16 | and if it can actually oh come then it should be able to predict correctly |
---|
0:54:20 | uh all these so uh not hold it is the sum of these so symbols |
---|
0:54:25 | in the sequence of because there are still this uh this uh |
---|
0:54:29 | a this uh |
---|
0:54:31 | information coming from the first and because that's not predictable |
---|
0:54:36 | so |
---|
0:54:38 | so these quite challenging |
---|
0:54:40 | but then we can a talk about plenty of these uh |
---|
0:54:44 | uh ask that uh currently cannot do you are and uh |
---|
0:54:47 | and you can get a confusing cover what should we focus on should be a |
---|
0:54:51 | study these artificial grammars or is it going how's that related to the real language |
---|
0:54:55 | and if you can shows all the in the end of a light improve some |
---|
0:54:58 | language model |
---|
0:54:59 | so i think that's the there's the natural questions |
---|
0:55:03 | i think the answer is uh |
---|
0:55:05 | quite complicated about the |
---|
0:55:07 | what i think is that the |
---|
0:55:09 | it's good to set some uh big role in the beginning and then try to |
---|
0:55:13 | uh define |
---|
0:55:15 | a some plan how to actually uh |
---|
0:55:19 | you know i |
---|
0:55:20 | accomplish this goal |
---|
0:55:21 | so that we did right so |
---|
0:55:23 | one paper where we did the |
---|
0:55:25 | uh discuss uh |
---|
0:55:27 | the |
---|
0:55:28 | like automated goal at first of the start bit uh the underrate around the instead |
---|
0:55:33 | of trying to improve some existing cup |
---|
0:55:35 | uh so that we are trying to define a new set up that would be |
---|
0:55:38 | a more like a artificial intelligence like something that the |
---|
0:55:42 | the people can see in the sense section we something to the really exciting and |
---|
0:55:46 | the |
---|
0:55:47 | that's what we actually want to optimize the |
---|
0:55:49 | the objective function formants just some |
---|
0:55:51 | uh speech recognisers |
---|
0:55:53 | uh something come more funny |
---|
0:55:56 | so |
---|
0:55:56 | so we did think that like a |
---|
0:55:58 | the properties of the of the a i that the |
---|
0:56:03 | the really useful for us |
---|
0:56:05 | and it seems that the any useful artificial intelligence the |
---|
0:56:09 | we like to be able to somehow communicate with us uh |
---|
0:56:12 | in uh |
---|
0:56:13 | hopefully some natural the way |
---|
0:56:15 | uh so again if you would look at the |
---|
0:56:18 | at the sign at the science fiction movies are the books the |
---|
0:56:22 | usually the artificial intelligence is some machine that uh |
---|
0:56:25 | either is or about the to be controlled with the core in tried it or |
---|
0:56:29 | it's as some computer that again we can interact with the so the embodiment doesn't |
---|
0:56:33 | seem to be |
---|
0:56:34 | necessary about the |
---|
0:56:35 | there needs to be some communication channel so that we can actually state some goal |
---|
0:56:39 | so that a i can actually accomplish the goal for us |
---|
0:56:43 | uh |
---|
0:56:44 | and we can communicate of the machines of course it will help or maybe that |
---|
0:56:48 | we could even i'm going to programming because currently we have become they can be |
---|
0:56:52 | computed by thinking second instructions will be one computers to do there is no way |
---|
0:56:57 | how we can start talking to the computer centre and expect the table |
---|
0:57:01 | accomplish a task for me that's basically no the framework we have not |
---|
0:57:05 | and i think that in the feature this will become possible about the |
---|
0:57:10 | but see a long really take |
---|
0:57:12 | i think we should start thinking of it because i don't think that we can |
---|
0:57:16 | improve the language models much more |
---|
0:57:18 | it's something uh like some crazy recurrent or okay |
---|
0:57:23 | so for the room at the |
---|
0:57:25 | we get uh describe the |
---|
0:57:27 | oh pretty minimal sort of components that we think that uh |
---|
0:57:30 | the intelligence machines going to consist of |
---|
0:57:33 | and then some the productivity may be actually good for constructing these machines to |
---|
0:57:38 | so it's it is that the idea that the is now and uh maybe later |
---|
0:57:41 | really improves then |
---|
0:57:42 | uh and we only are discussing them at the conference this |
---|
0:57:45 | uh |
---|
0:57:46 | and then the mission the requirements are too many dimensions scalable so that will actually |
---|
0:57:51 | be able to grow two full intelligence |
---|
0:57:55 | the components are added as i said the committed the ability to communicate a |
---|
0:57:59 | the ability to set some tasks for the machine so that the uh |
---|
0:58:03 | but work to do something useful |
---|
0:58:05 | so some motivation component |
---|
0:58:07 | again that something that is normally missing can the |
---|
0:58:09 | in the predictive models like the language models and so on |
---|
0:58:13 | and then have some learning skills which uh |
---|
0:58:17 | scenes can be used uh |
---|
0:58:18 | but uh many models are excluded |
---|
0:58:20 | missing these for example long-term memory is not part of uh |
---|
0:58:24 | any model the time of our of uh |
---|
0:58:26 | then you want works represent the want to memory and the weight matrices and that |
---|
0:58:29 | the get overwritten the number that network a |
---|
0:58:32 | uh kids uh gradients from the |
---|
0:58:34 | you examples and that's basically not agree to |
---|
0:58:37 | good model for once the memory |
---|
0:58:40 | so |
---|
0:58:41 | we need to do something better basically |
---|
0:58:44 | and uh |
---|
0:58:46 | just quickly i will go over it is because uh it would be long discussion |
---|
0:58:50 | to explain why we actually think uh |
---|
0:58:53 | uh about all these uh all these things about the |
---|
0:58:56 | we think that turns to be some incremental structure in a how the |
---|
0:59:01 | mission will be trained use of training can like a we could be idea is |
---|
0:59:05 | to retrain normally language models one uh it seems that it has to be trained |
---|
0:59:09 | some incremental weight a similar way as the as humans with um |
---|
0:59:12 | would be learning the language a |
---|
0:59:15 | and for that uh |
---|
0:59:16 | we are thinking about the some sort of simulated environment that the |
---|
0:59:20 | that uh |
---|
0:59:21 | would be used to |
---|
0:59:23 | develop both the all words of the are missing and then once you would to |
---|
0:59:27 | have this algorithms to train the most basic intelligent machines so it the most basic |
---|
0:59:32 | properties that we can think of |
---|
0:59:36 | so |
---|
0:59:37 | this basically what we are thinking about the and we wanted uh quite of experiments |
---|
0:59:42 | so there's this can or components like the lower it stands for the intelligent machines |
---|
0:59:48 | that the that can be in this environment the uh it can do some actions |
---|
0:59:52 | but everything is actually very simple like we try to minimize the complexity so it's |
---|
0:59:57 | just uh |
---|
0:59:57 | basically |
---|
0:59:58 | uh receive some input signal |
---|
1:00:01 | sequence and purposes the output signal which is a sequence as well |
---|
1:00:05 | and it to achieve it uh receive summary or so which is uh |
---|
1:00:10 | used to measure the performance of the learner and the |
---|
1:00:13 | it's pretty much either there so the teacher that uh defines the goals the |
---|
1:00:17 | and assigns the reward and to |
---|
1:00:19 | and that's it |
---|
1:00:25 | so this like the description the environment this uh based on screen |
---|
1:00:30 | of course we want to have uh and the teachers well we don't the this |
---|
1:00:34 | to be scalable so later |
---|
1:00:35 | uh one so we would you have a learner that can learn is very simple |
---|
1:00:39 | patterns then |
---|
1:00:40 | uh the expectation that the teacher would be replaced by humans |
---|
1:00:43 | so directly humans would be teaching part of the machine and the and the signing |
---|
1:00:47 | the rewards |
---|
1:00:48 | and that are dimension you really get to some |
---|
1:00:50 | sufficient level than a then there would be to stick expectation that we can start |
---|
1:00:54 | using a for doing something actually useful for us |
---|
1:00:58 | yeah |
---|
1:01:00 | so the communication is thirty the core so the learner just as this input channel |
---|
1:01:06 | and the output channel and all that it has to do is to figure out |
---|
1:01:10 | that it should be |
---|
1:01:11 | a coding at a given time |
---|
1:01:13 | given the inputs to maximize the average incoming three or |
---|
1:01:18 | so it seems to be quite simple but of course it is not simple uh |
---|
1:01:23 | this is a graphical representation just so that it would to look more obvious that |
---|
1:01:27 | we are aiming to do so there is a simple channel output channel |
---|
1:01:31 | uh the task specification given by teacher is the movement luke and uh |
---|
1:01:36 | and then |
---|
1:01:37 | the view learner |
---|
1:01:39 | the only assume that the delivery learn to do this task |
---|
1:01:42 | uh says |
---|
1:01:44 | to the environment they move that's how it accomplishes that action |
---|
1:01:47 | so we don't need every year for all possible actions a |
---|
1:01:51 | the learner can actually do anything to do |
---|
1:01:53 | it uh is allowed to do i just by saying it so if it's gonna |
---|
1:01:56 | if it wants to go for ever or if it wants that are like can |
---|
1:02:00 | just say |
---|
1:02:01 | and uh then at the end of the task yeah |
---|
1:02:03 | uh it gets the reward for |
---|
1:02:05 | uh looking for events finding that apple |
---|
1:02:09 | so |
---|
1:02:12 | so we think that the |
---|
1:02:14 | the learning weekly a will be a complete crucial here and the |
---|
1:02:19 | that's the same what i can see about incrementality of the learning so |
---|
1:02:22 | when the tasks uh |
---|
1:02:25 | are getting more and more complex or in something criminal way uh the learner should |
---|
1:02:31 | be able to learn from few examples at most we don't actually we're forcing the |
---|
1:02:35 | search space |
---|
1:02:36 | so the algorithm that you get uh at the moment or would basically break uh |
---|
1:02:41 | on this type of problems |
---|
1:02:45 | and that's a supporter before we of course get a documents later but still |
---|
1:02:51 | seems to be assumed because the we don't have uh without regard to be able |
---|
1:02:55 | to deal with a given the basic problems |
---|
1:03:00 | and then |
---|
1:03:01 | if we have this uh this intelligent machines and that can |
---|
1:03:04 | uh were with the input and output channels that of course we can add the |
---|
1:03:07 | real world basically this additional |
---|
1:03:09 | input uh |
---|
1:03:10 | channel that the much again |
---|
1:03:12 | one troll for example it can give where is that the output channel to the |
---|
1:03:16 | internet and the received the resulted input so |
---|
1:03:19 | uh the framework is very simple about the |
---|
1:03:22 | uh it seems to be sufficient for the intelligent machines |
---|
1:03:28 | and that's a was select things the real time they are things that seem to |
---|
1:03:32 | be very |
---|
1:03:33 | simple to lower for about the |
---|
1:03:35 | but you can pretty do it with the |
---|
1:03:37 | a recurrent networks even a via the long short-term memory units into the recurrence for |
---|
1:03:41 | x and the |
---|
1:03:42 | all kinds of crazy things the |
---|
1:03:44 | then still they are very simple but are they are very challenging the lower |
---|
1:03:49 | uh even when we have supervision about the |
---|
1:03:51 | what is the next symbol |
---|
1:03:53 | it would just try to learn |
---|
1:03:54 | these things just be the records of even worse |
---|
1:03:58 | so |
---|
1:04:01 | so |
---|
1:04:02 | they are like this the things that we believe that the |
---|
1:04:04 | especially the last two |
---|
1:04:06 | have and it's basically like all these problems are a console research problems and a |
---|
1:04:12 | maybe even they have to be addressed together so it's quite challenging but i think |
---|
1:04:16 | it's uh it's good for people who are trying to |
---|
1:04:19 | uh start the |
---|
1:04:21 | their own research to think about the challenging problems |
---|
1:04:27 | so for the |
---|
1:04:29 | small steps former be publisher before we review exchange are showing that the that the |
---|
1:04:35 | recanted first can actually learn style |
---|
1:04:38 | some of these uh algorithmic uh |
---|
1:04:40 | but then we extend then be the |
---|
1:04:42 | it is um in memory structure |
---|
1:04:44 | that the recurrent network are learns to |
---|
1:04:47 | uh control |
---|
1:04:49 | that actually utterances several the problem that i mentioned before |
---|
1:04:52 | in this uh |
---|
1:04:53 | if this memory |
---|
1:04:54 | is a unwanted in the signs |
---|
1:04:57 | uh |
---|
1:04:58 | like this like for example |
---|
1:05:00 | then |
---|
1:05:01 | uh |
---|
1:05:02 | suddenly the model can actually be at least two sticks it can be countering complete |
---|
1:05:06 | uh so that can i should learn finder presentation |
---|
1:05:09 | a two |
---|
1:05:10 | any algorithm which seems to be necessary and we as humans can do it |
---|
1:05:14 | uh |
---|
1:05:16 | and uh it also like addresses these problems |
---|
1:05:19 | or it could address the problems as i mentioned before with the neural networks that |
---|
1:05:23 | are changing there are the weight matrices all the time |
---|
1:05:27 | uh |
---|
1:05:28 | and therefore getting things then if you would have this a controlled way how to |
---|
1:05:32 | grow something more structures |
---|
1:05:34 | then that could be you way how to actually are present the long term memory |
---|
1:05:38 | better but |
---|
1:05:40 | as i said is just the first that's former |
---|
1:05:43 | of course we did find out the men been already work and idea behind the |
---|
1:05:47 | first one with this idea and it will study published in the |
---|
1:05:50 | in the a piece uh |
---|
1:05:51 | but uh likely needed to find of our solution is that |
---|
1:05:55 | is again simpler and works better than people probability for so |
---|
1:05:59 | so it's it in my looks like this |
---|
1:06:02 | no |
---|
1:06:04 | uh so there's not much of the complexity |
---|
1:06:07 | um basically the hidden air decides on about the action to do we just ikea |
---|
1:06:13 | a by purchasing castle my position distribution probability distribution over |
---|
1:06:17 | uh used reactions that it can be doing |
---|
1:06:20 | it can either push some volume on top of the state courts couple the volume |
---|
1:06:24 | of the stack part can decide to do not think intersect and of course are |
---|
1:06:28 | there can be multiple states that the network controls |
---|
1:06:31 | and uh if it wants to write some specific volume and it's actually that depends |
---|
1:06:36 | on the state of the hidden layer |
---|
1:06:37 | and uh |
---|
1:06:39 | and the fan think is that it can be trained actually data |
---|
1:06:42 | it as g |
---|
1:06:43 | like stochastic gradient descent so we shouldn't need to do |
---|
1:06:46 | and the crazy thing |
---|
1:06:48 | and uh it seems to be working for at least some of the simpler sink |
---|
1:06:52 | what sequences |
---|
1:06:54 | a like here but at least some of them variant able to the characters uh |
---|
1:06:59 | the bold characters are the predictable on the deterministic once |
---|
1:07:03 | and we could to |
---|
1:07:04 | so well |
---|
1:07:06 | i think all these problems |
---|
1:07:08 | i really |
---|
1:07:09 | uh so that was gonna interesting |
---|
1:07:14 | and of course the recurrence works candidate |
---|
1:07:16 | and the funding is that the lsp a models that are actually origin developed to |
---|
1:07:20 | uh to address the do you know exactly these problems |
---|
1:07:25 | can do it because they can count because the linear component so |
---|
1:07:29 | uh so that sort of the like cheating because the models |
---|
1:07:32 | developed for this article reason |
---|
1:07:34 | uh of course we can show that the |
---|
1:07:36 | the lsp and below |
---|
1:07:37 | break it will just uh a scalar complex to bit odd because the uh instead |
---|
1:07:44 | of just recovering them are to come we can |
---|
1:07:46 | uh |
---|
1:07:47 | are we compare it to start memorising sequences as they said before he really just |
---|
1:07:51 | show like a |
---|
1:07:52 | a bunch of characters with variable-length the |
---|
1:07:55 | that have to be repeated to |
---|
1:07:57 | and the to larry breaks the last jens |
---|
1:07:59 | which uh for the people don't know them is so it is a modification or |
---|
1:08:04 | extension of the basic network by adding yeah these are linear units the |
---|
1:08:09 | with a bit sums of connections and basically complicated architecture how to |
---|
1:08:13 | get some |
---|
1:08:13 | more stable memory into the reckon that's regular to propagate more smoothly across time |
---|
1:08:21 | so we could solve the memorisation |
---|
1:08:23 | but then of course one can say to the |
---|
1:08:25 | uh that the stakes are kind of developed for this kind of can uh a |
---|
1:08:29 | regularity so |
---|
1:08:31 | so it interesting a so our model was the |
---|
1:08:34 | a first-order on the speaker was |
---|
1:08:37 | blank a little bit binary addition just quite a bit more complicated and the |
---|
1:08:41 | interestingly it also did uh can you have more so here we are shrink a |
---|
1:08:45 | these examples which are like uh |
---|
1:08:48 | a binary like input so |
---|
1:08:51 | by the addition of two binary numbers together with the result than the terracotta lance |
---|
1:08:57 | to pretty the next symbol to get in this story so it's like a language |
---|
1:08:59 | model |
---|
1:09:00 | and it turned out that the |
---|
1:09:02 | it actually could to learn to operate with this mixing right complicated way to solve |
---|
1:09:06 | this problem |
---|
1:09:07 | so that actually |
---|
1:09:09 | uh space the first number and i think to stack so there are some redundancy |
---|
1:09:13 | uh actually three of than i think of |
---|
1:09:15 | our previous information |
---|
1:09:17 | and then the it's a so the |
---|
1:09:20 | how the second number |
---|
1:09:21 | and then it's able to produce a correctly the |
---|
1:09:25 | uh the addition |
---|
1:09:28 | from these two numbers |
---|
1:09:30 | so |
---|
1:09:31 | uh i think it's quite funny example |
---|
1:09:33 | of course there was like uh oh this is a heck the to be used |
---|
1:09:37 | to how the model because the stakes a are pushing the volume some top of |
---|
1:09:41 | the steak it's actually much easier to the |
---|
1:09:43 | do the memorisation of the |
---|
1:09:45 | all the strings in the reverse order |
---|
1:09:48 | so |
---|
1:09:49 | a so that's the same |
---|
1:09:51 | case for the binary addition |
---|
1:09:53 | uh so i wouldn't say that we can actually learn |
---|
1:09:56 | a general |
---|
1:09:57 | algorithmic fathers with this model |
---|
1:10:00 | and uh |
---|
1:10:01 | of course we could to |
---|
1:10:02 | do better if you just uh |
---|
1:10:04 | uh not use just the stakes but we could use for example states this additional |
---|
1:10:08 | memory structures |
---|
1:10:09 | with all kinds of topologies and so on |
---|
1:10:12 | but it seems the like uh taking yeah |
---|
1:10:15 | uh the solution together with the task so |
---|
1:10:18 | that uh doesn't seem to be great so i would refer back to do that |
---|
1:10:22 | sort of the paper that you had a rejection try to define the tasks first |
---|
1:10:26 | before think about the solution but in any case we could show that the |
---|
1:10:30 | that we can learn a interesting car |
---|
1:10:33 | uh in there's think a complex patterns are |
---|
1:10:37 | that the normal recurrent networks couldn't lower |
---|
1:10:42 | and the model is turing complete the say set and has some sort of long |
---|
1:10:45 | the memory |
---|
1:10:47 | but it's not the long-term limited |
---|
1:10:49 | like to have |
---|
1:10:50 | you does not to the properties that we |
---|
1:10:53 | uh we you want |
---|
1:10:55 | so there is that and a lot of uh things that should be tried |
---|
1:10:58 | and to |
---|
1:11:00 | let's see what to |
---|
1:11:02 | well happen in the future |
---|
1:11:05 | so for the conclusion a |
---|
1:11:07 | of the lost power of the talker |
---|
1:11:09 | i would say that to achieve chart which intelligence which was my motivation many start |
---|
1:11:13 | my phd so far i had failed to do it but at least there was |
---|
1:11:16 | like this uh |
---|
1:11:17 | these site brightly that are to be useful |
---|
1:11:20 | uh i think that we need to file think uh |
---|
1:11:24 | a lot about the goal |
---|
1:11:25 | uh i just a few that no people are from |
---|
1:11:29 | working harder than the wrong task so |
---|
1:11:31 | the tasks are too small and to |
---|
1:11:34 | and isolate it i think it's a it's time to think about something bigger |
---|
1:11:38 | and uh there are there will be a lot of like uh |
---|
1:11:42 | new ideas that will be needed to defined a framework in which we can develop |
---|
1:11:46 | the uh yeah i just same way as the framework in which the first speech |
---|
1:11:51 | recognisers were built also to take like uh a quite a few years to |
---|
1:11:56 | just uh define how to measure the boards rates and so on and the |
---|
1:12:00 | and how to annotated data sets |
---|
1:12:02 | i did not for the ideal basically need to rethink some of the basic concepts |
---|
1:12:07 | that to be take for granted now i'm that are probably wrong like uh for |
---|
1:12:11 | example the central role of the supervised learning in the machine learning curve |
---|
1:12:15 | uh |
---|
1:12:16 | techniques i think that has to be revisited and via to that taking that uh |
---|
1:12:20 | that are much more unsupervised and to |
---|
1:12:23 | can |
---|
1:12:23 | more calm somewhat different principles |
---|
1:12:27 | and of course uh the uh |
---|
1:12:29 | on of the goals of this august so |
---|
1:12:31 | motivate more people to think about this problems |
---|
1:12:35 | because that's the |
---|
1:12:36 | i think our |
---|
1:12:38 | we can |
---|
1:12:39 | a rigorous harder so i think that the last line so things for |
---|
1:12:43 | attention |
---|
1:12:45 | sparse dark questions |
---|
1:13:07 | yeah right sorry |
---|
1:13:22 | mountain yeah okay |
---|
1:13:24 | so my question here how to properly defined intelligent not artificial intelligence but just in |
---|
1:13:29 | the intelligence in the second question which it to the first one is okay so |
---|
1:13:35 | we know that the true machine um is limited you can so everything and then |
---|
1:13:40 | can you believe don't believe that detergents how you define |
---|
1:13:45 | yeah is achievable with your incomplete mission |
---|
1:13:48 | well |
---|
1:13:49 | not sure that it's the question started actually relate it's like to questions for me |
---|
1:13:53 | uh but uh |
---|
1:13:55 | a first for the definition of the intelligence they're actually right yeah |
---|
1:13:59 | uh many opinions on this uh probably no like you know i would say that |
---|
1:14:04 | uh pretty much research of are defined intelligence in |
---|
1:14:07 | different way |
---|
1:14:08 | uh |
---|
1:14:10 | hmmm all the most general definition that i can think of uh |
---|
1:14:14 | is a it would be maybe to philosophical is uh |
---|
1:14:17 | basically uh |
---|
1:14:20 | but there is that uh that uh that exist in the universe a could be |
---|
1:14:25 | thought of uh as intelligent there we can see that uh |
---|
1:14:29 | uh a life is basically just uh some organisation of the matcher that uh tends |
---|
1:14:34 | to results it's a form |
---|
1:14:37 | uh to evolution and everything |
---|
1:14:39 | so that it goes back to well ideas for example |
---|
1:14:43 | that the universe gonna be seen as this overall to model on and then everything |
---|
1:14:46 | to the observed that are is just a consequence of that |
---|
1:14:49 | and then you really can see the light so as a just a pair is |
---|
1:14:53 | the button that exist in this uh in this uh topological structure and the intelligence |
---|
1:14:59 | is just a mechanism that the this uh this part there and that uh developed |
---|
1:15:03 | to preserve itself and so basically |
---|
1:15:07 | for the second a question you said that the uh that the |
---|
1:15:11 | during machines are limited i'm not so sure in what sense maybe you mean that |
---|
1:15:15 | the normal computers are not during machines the |
---|
1:15:18 | uh in strict sense so |
---|
1:15:24 | uh so i don't know which problems uh you mean you can do not only |
---|
1:15:28 | major uh i was talking more about the incompleteness in the sense that the during |
---|
1:15:33 | common is basically a this concept that uh there is a find a description of |
---|
1:15:38 | well |
---|
1:15:38 | of all the buttons in this competition model |
---|
1:15:40 | if you would yeah they could bigger model like c find machines that you know |
---|
1:15:44 | that for example |
---|
1:15:45 | there does not exist to find it solution |
---|
1:15:48 | funny description of uh |
---|
1:15:49 | of some algorithm of well so |
---|
1:15:52 | for example |
---|
1:15:53 | you can tell account if you if you limit yourself to the finite state machines |
---|
1:15:58 | hand uh |
---|
1:15:59 | for example in the context of the recurrent network so i think there is a |
---|
1:16:03 | gets more confusing because they should ever papers written |
---|
1:16:06 | then uh claim that the recurrent networks are incomplete and which sensor |
---|
1:16:10 | a one can make a conclusion and actually adhered for example you're gonna speed reverses |
---|
1:16:15 | it uh |
---|
1:16:15 | requested that the |
---|
1:16:17 | that the recurrent networks larger incomplete and that |
---|
1:16:19 | that means that they are just fine and they should general are all these things |
---|
1:16:22 | that i was showing co |
---|
1:16:24 | what do you want to say a say that the uh show that when we |
---|
1:16:28 | tried to train it uh it does g d a normal requirements we regard doesn't |
---|
1:16:32 | larry of an accounting and the list even doesn't learning like a plane sequence memorisation |
---|
1:16:37 | so that's a that's one thing what is learnable |
---|
1:16:39 | and that's actually quite different than what the what can be represented |
---|
1:16:43 | and of course the |
---|
1:16:44 | to take uh the argument of uh all these people to strictly then i would |
---|
1:16:49 | say uh the recurrent networks as we have now |
---|
1:16:51 | uh then including dollars teams are not to incomplete because the |
---|
1:16:55 | uh defined it's isa the proofs of their string complete this assumes that there are |
---|
1:17:00 | so infinity somewhere hidden in the model |
---|
1:17:03 | usually in the in the volumes the distortion the in the in the neuronal so |
---|
1:17:08 | um so that seems not to do not the neural network that we are using |
---|
1:17:12 | now we are using like thirty two bit a full precision and you can tell |
---|
1:17:15 | you think of that you can store like uh infinitive |
---|
1:17:18 | if it number of four formants there is the same argument as the saying that |
---|
1:17:22 | you can |
---|
1:17:23 | save the whole universe in a single number using arithmetic coding sure you can but |
---|
1:17:27 | the |
---|
1:17:28 | but uh do you actually want to this representation to be uh in uh some |
---|
1:17:32 | neural network like uh in one of all you want to store everything and have |
---|
1:17:36 | a lexus a double detection decoded at every time |
---|
1:17:38 | for a time step if you would want to more the identification it makes sense |
---|
1:17:40 | to say that we can it works are two incomplete because the in my view |
---|
1:17:44 | a strictly speaking there are some of their versions maybe about the |
---|
1:17:48 | but it's just uh i'm not practical of course uh during missions also not very |
---|
1:17:52 | practical model so i'm talking about through a complete that's not about directions |
---|
1:18:00 | i |
---|
1:18:02 | yeah uh i see that the uh you're thinking a lot about the yeah i |
---|
1:18:07 | creation actually there is a huge discussion right now in the in the field about |
---|
1:18:11 | the about achievement uh achieving of the number the singularity and whenever you will create |
---|
1:18:17 | a binary what traits such as a i which would get connected to the to |
---|
1:18:21 | the internet |
---|
1:18:22 | and |
---|
1:18:24 | uh did it to share any of dares their concerns of uh |
---|
1:18:29 | uh country yeah i |
---|
1:18:32 | or super intelligent a i which will which will basically make a some silly |
---|
1:18:37 | well |
---|
1:18:39 | well i like yeah |
---|
1:18:40 | different views on this uh |
---|
1:18:42 | uh i think that the thinking of this a super intelligence and singularity |
---|
1:18:47 | i think it's little like yes uh |
---|
1:18:50 | i don't know what i would uh related to like yeah but the chinese and |
---|
1:18:54 | italian when they got power if they would to be afraid of uh of just |
---|
1:18:58 | or interval are so |
---|
1:19:00 | by some chain reaction uh i mean like it just to suit basically just the |
---|
1:19:04 | technologist there and uh it should be aware of it and does the same like |
---|
1:19:08 | uh when it comes to state of the research does i'm saying if you really |
---|
1:19:11 | don't want to |
---|
1:19:12 | uh if you don't want to be unfair divorce yourself |
---|
1:19:16 | then it is clear that we can teach yeah there are many of them very |
---|
1:19:19 | simple things and to talking about single or the then i think it's just uh |
---|
1:19:23 | assume the our think a is that the |
---|
1:19:27 | of course there are people have arguments that the uh maybe uh the gap between |
---|
1:19:33 | actually having something that doesn't work at all and suddenly having some intelligent that again |
---|
1:19:37 | you can improve itself |
---|
1:19:38 | the together doesn't have to be that they can maybe how we can achieve this |
---|
1:19:42 | machine so sooner than we expect that |
---|
1:19:44 | even if some people are sceptical that made it can be later |
---|
1:19:47 | um but if you would take this argument then i would say depends on how |
---|
1:19:51 | we should construct this machine so |
---|
1:19:53 | uh the frame level describing a |
---|
1:19:55 | uh were supposed to make machine that actually |
---|
1:19:58 | are trying to minimize uh some goals |
---|
1:20:01 | and as long as we will be able to define the goals for the machines |
---|
1:20:04 | then |
---|
1:20:05 | i would say uh for me the machine should be basically um like a |
---|
1:20:11 | some and it can that extends your own ability |
---|
1:20:14 | see if you are sitting in a car |
---|
1:20:16 | then uh you are able to move much faster than using your own lines because |
---|
1:20:20 | the cars physical tools for you |
---|
1:20:22 | oh |
---|
1:20:23 | well the car just does what you wanted to do because you are existing very |
---|
1:20:26 | it should go show it can either a knockers people and it can kill someone |
---|
1:20:29 | button that but the next to the driver is responsible |
---|
1:20:32 | so i think that the a i even if you to be very clever as |
---|
1:20:36 | long as its own all purpose is just to accomplish of the goals for the |
---|
1:20:41 | for the for the human which specifies the uh the goals |
---|
1:20:45 | then it basically to like extension of our mental a couple state the same as |
---|
1:20:49 | cars extending our abilities to move |
---|
1:20:56 | well that was just a |
---|
1:21:01 | there was just as the to your phyllis this step three of us to lead |
---|
1:21:06 | to learn a learned itself the questions which is the tricky part |
---|
1:21:10 | because whenever you will you will not part of it was on the a i |
---|
1:21:16 | part |
---|
1:21:18 | uh um |
---|
1:21:24 | uh which file it was about io and collecting to internet just only about the |
---|
1:21:29 | thought oh yeah and c |
---|
1:21:35 | okay i don't remember exactly which might not you are correctly so |
---|
1:21:39 | uh actual actually the last are was to let or no learned itself the from |
---|
1:21:45 | the from the other sources which makes it only has no control |
---|
1:21:49 | rather sure churchill are well that's a that's a like a question like a |
---|
1:21:54 | given the learner will learn from the other sources of how much uh |
---|
1:22:00 | kind of uh distant gonna get from the from the top |
---|
1:22:04 | a external river so |
---|
1:22:06 | you can actually the same argument about uh people they are also born with some |
---|
1:22:10 | kind of like a internal rework mechanism that was maybe large revolution to be a |
---|
1:22:14 | kind of hardcoded and also before example note if you sugar than you feel happy |
---|
1:22:19 | or whatever because the so they are coded thing |
---|
1:22:21 | and uh |
---|
1:22:23 | that still doesn't we then people to actually behave uh quite different then they become |
---|
1:22:27 | adult because |
---|
1:22:28 | they can for example just decide to stop eating sugar a |
---|
1:22:32 | and just uh just not follow the external rewards so |
---|
1:22:36 | yeah encoded or external interlocked the basic of the hardcoded to record so |
---|
1:22:40 | that are in the brain stop |
---|
1:22:42 | so that's like more like the question if the |
---|
1:22:45 | yeah i would be so much independent that it would have thought some sort of |
---|
1:22:48 | like you will then and you can of course see that it can turn out |
---|
1:22:53 | into something bad about the and if you think about a i think a single |
---|
1:22:57 | and that it but uh many of them and many of them working with yes |
---|
1:23:00 | then |
---|
1:23:00 | uh my vision is basically that it extends our own abilities and the |
---|
1:23:05 | is the same as the |
---|
1:23:07 | saying that uh pretty much any piece of technology can be used for good and |
---|
1:23:10 | that purposes so |
---|
1:23:12 | just to be belongs |
---|
1:23:23 | others |
---|
1:23:24 | it was wondering whether it's be more local |
---|
1:23:29 | no target location is this |
---|
1:23:33 | like something which whoops work in the network and it would be actually |
---|
1:23:40 | clearly that it would be changed just some subset of a scene |
---|
1:23:44 | yeah i wouldn't it propagates |
---|
1:23:49 | the information in this mostly due more |
---|
1:23:53 | oh unsupervised |
---|
1:23:56 | hmmm |
---|
1:23:57 | something like |
---|
1:23:58 | c d's |
---|
1:24:02 | someone's using something |
---|
1:24:04 | these days |
---|
1:24:05 | hmmm i think that i see something but i do not be able to give |
---|
1:24:09 | your friends because uh i didn't know are right now |
---|
1:24:13 | i wasn't myself music find a fulfilment first mse because i think that uh |
---|
1:24:18 | therefore limit it yeah |
---|
1:24:20 | uh |
---|
1:24:21 | so well i don't know like a |
---|
1:24:24 | i guess that the property that we should be able to uh |
---|
1:24:27 | uh get into our models that are it's neural networks or something else is this |
---|
1:24:32 | uh i but it's to grow in the complexity |
---|
1:24:35 | and that's something that norm on your a result that |
---|
1:24:37 | once you start seeing the network a some sort of like memory mechanisms are it |
---|
1:24:41 | us ability to kind of like to extend the memory structure i think that's uh |
---|
1:24:45 | that's all i see it |
---|
1:24:46 | uh |
---|
1:24:47 | and then the topology allows you to not spectral parameters but just some subset |
---|
1:24:51 | uh |
---|
1:24:52 | so that's what i was thinking of uh |
---|
1:24:55 | but of course that the that doesn't mean that uh that's the solution may be |
---|
1:24:58 | somewhat oh come on something else |
---|
1:25:00 | i just think that actual data |
---|
1:25:02 | points to model even if you go to as you do something that will be |
---|
1:25:06 | again do local updates to the and i would be a bit boring about |
---|
1:25:10 | just the model itself to be |
---|
1:25:12 | to be and not limited in the convolutional sense of course to consider the human |
---|
1:25:16 | brain to select find it's the number of years maybe but then there like targets |
---|
1:25:20 | may be some neurons are triggering cup and the final arguments from me would be |
---|
1:25:23 | that the us a human you can actually um navigate in the topological environment like |
---|
1:25:28 | the environment around your yourself this three-dimensional it has a topology |
---|
1:25:32 | and uh if you actually want to understand all utterances you can use the piece |
---|
1:25:35 | of the paper and so on so you can be actually finds the matching about |
---|
1:25:39 | this long as you are but i think in the in the environment that connections |
---|
1:25:42 | work as a paper in the during machine |
---|
1:25:44 | and then you can actually be as the whole system during complete so |
---|
1:25:47 | you know like a if the model will actually start living in the environment i |
---|
1:25:51 | think it actually gets a |
---|
1:25:52 | gets a more about it is that's look at it can also change the environment |
---|
1:25:55 | that the it uh becomes much more interesting content if you have just neural network |
---|
1:26:00 | really think a in a way to just observing can able to lectures and purchasing |
---|
1:26:04 | cup vectors uh |
---|
1:26:05 | we don't actually uh being we are able to control anything that that's the topology |
---|
1:26:10 | so for example and i was talking about the stick carmen's you can see that |
---|
1:26:13 | so that's that can be seen as a one dimensional environment that the stick our |
---|
1:26:18 | lives in and can operate on it and to have any of the two d |
---|
1:26:21 | r two d environment that utterance basically just more |
---|
1:26:24 | more dimensions but it's uh kind of the same thing and you are so far |
---|
1:26:27 | linking the three to adjust really brought and to if you will be able to |
---|
1:26:30 | influence of the state of the role i think that a user will be uh |
---|
1:26:34 | quite limited to that so that's kind of like a my understanding of the think |
---|
1:26:38 | so |
---|
1:26:48 | does the research agenda open a is doing have any overlap with the framework that |
---|
1:26:53 | you have suggested a which |
---|
1:26:55 | opening i |
---|
1:26:56 | a banana |
---|
1:26:57 | uh yeah that's of the guys in california a where they get publisher a uh |
---|
1:27:04 | she already recorded opening i universe i think a |
---|
1:27:08 | or a needs a month ago so uh somewhat overlooked so that the goals in |
---|
1:27:14 | the sense that uh |
---|
1:27:15 | did try to yeah |
---|
1:27:17 | um like a person every social a point to the define a defined to like |
---|
1:27:23 | i think thousand task or something of that sort the |
---|
1:27:27 | and to |
---|
1:27:28 | mm they are trying to make a machines are coming from the definition data |
---|
1:27:33 | generally i guess a |
---|
1:27:35 | it's a some sort of machine that can work of course uh a range of |
---|
1:27:40 | tasks are not a single path but for many tasks are |
---|
1:27:43 | uh but it somewhat uh it's quite crucial to different actually to what i was |
---|
1:27:47 | describing because uh |
---|
1:27:49 | there's a different between incremental or gradual learning curve i think there are several other |
---|
1:27:53 | names are you assume that the meshing wanted one signals and tasks and it alarm |
---|
1:27:58 | so it's a you try to teach it a task and plus one |
---|
1:28:01 | then it should be able to learn it faster in this new tasks are related |
---|
1:28:05 | to the all the ones and then you can actually be measured because you can |
---|
1:28:08 | construct this subtask yourself oh artificial and you can make than actually |
---|
1:28:13 | uh |
---|
1:28:14 | while of what i did see so far list uh i'm not like a an |
---|
1:28:18 | expert will they are doing that maybe they are changing still |
---|
1:28:21 | direction but i folded they are trying just to solve a bunch of tasks together |
---|
1:28:23 | which multitask learning that such a different thing as actually completed that yielded the neural |
---|
1:28:28 | networks which don't have to do so you |
---|
1:28:31 | two approach that are the problems |
---|
1:28:33 | but the well they try to do it with a yeah i think which are |
---|
1:28:36 | uh or reinforcement learning which again is quite a challenging |
---|
1:28:41 | uh because direction don't stay supervisor level the of the model should be doing about |
---|
1:28:46 | you are just giving rewards for the correct behaviour |
---|
1:28:48 | second that the |
---|
1:28:50 | that part of what they are trying to do is uh somewhat related to whatever |
---|
1:28:53 | describing about the |
---|
1:28:55 | a i don't think it should uh multitask learning has a big a problem because |
---|
1:28:59 | that could actually just works fine you can just a venue auditory little recognize speech |
---|
1:29:03 | under the image classification and then language monica the same time uh because really represent |
---|
1:29:09 | all these things of the input layer and what kind of our quite a so |
---|
1:29:12 | what would be just encoded in different parts of the network |
---|
1:29:14 | uh |
---|
1:29:15 | so |
---|
1:29:17 | i think that they're hope is that uh actually they start uh |
---|
1:29:20 | boosting each other's performance uh |
---|
1:29:22 | uh if you will train basically this work to do all these things together then |
---|
1:29:27 | it will adjust sure the |
---|
1:29:28 | sharma of the ability somehow |
---|
1:29:31 | uh so let's see what they become a bit about uh |
---|
1:29:33 | from my point of view i think that uh it's good to try to of |
---|
1:29:36 | isolated of the biggest problems and try to so that uh i was for example |
---|
1:29:41 | a giving preference uh |
---|
1:29:42 | how to this um is "'cause" book are subproblems and uh |
---|
1:29:45 | iterate try to go like to the core like of the simplest things that you |
---|
1:29:49 | guards are present additionally with a one hidden layer and the at intervals where influential |
---|
1:29:53 | and very simple are sent and from my point of view if we try to |
---|
1:29:57 | analyze what is going wrong with the current the algorithm is going to |
---|
1:30:02 | like a huge data are sort of thousands or thousand different problems training some model |
---|
1:30:07 | couple of it |
---|
1:30:08 | and then make some place about that are it works or doesn't work and what |
---|
1:30:11 | it go wrong a |
---|
1:30:12 | i think the analyses will be very harsh for then |
---|
1:30:15 | it will be different amazing for p r videos |
---|
1:30:18 | uh |
---|
1:30:19 | which of course is uh is uh like uh |
---|
1:30:22 | uh one of the main things that they are through but the but the except |
---|
1:30:27 | that the |
---|
1:30:28 | i don't let c |
---|
1:30:33 | so don't you think that actually multitask training is |
---|
1:30:37 | crucial in |
---|
1:30:38 | and these things "'cause" it can cover a lot of things and can learn what |
---|
1:30:42 | not to do instead of just learning what to do |
---|
1:30:46 | well to multitask learning car |
---|
1:30:48 | like |
---|
1:30:49 | not a like a crucial problem or saying it's a problem i'm just saying it's |
---|
1:30:55 | a part of the real life thing right now sure you never learn just one |
---|
1:30:59 | thing you always observed |
---|
1:31:01 | and if you wanna have inspiration and the real life |
---|
1:31:04 | oh just |
---|
1:31:06 | i sure i mean that's uh that's complete fine for example then it was uh |
---|
1:31:09 | describing this framework with a larger and uh and so on and the teacher then |
---|
1:31:14 | also like the point is that the teacher would be interesting you trusted alarm are |
---|
1:31:18 | and then uh this can be predators |
---|
1:31:20 | uh |
---|
1:31:21 | i'm can be defined when people are trying to work on multiple tasks and assigned |
---|
1:31:25 | a set like a we have it there but uh is different that are you're |
---|
1:31:29 | assuming that you |
---|
1:31:30 | yeah are training the model the model all the tasks together and then you try |
---|
1:31:35 | to measure of performance on the same tasks |
---|
1:31:37 | or if you train the model on some task and then you try to teach |
---|
1:31:41 | it quickly on different tasks uh and that's actually what i think is much more |
---|
1:31:44 | challenging and that's what well i think we should try to |
---|
1:31:47 | our focus on because it will be needed that if you just uh are going |
---|
1:31:51 | to fully also by training commit in tasks and then show always place are combined |
---|
1:31:55 | very well maybe because it was in the training set so you don't like was |
---|
1:31:58 | the point |
---|
1:32:01 | so i think it's of course it's part of the of the problem to know |
---|
1:32:05 | to have adult dyad to work on multiple tasks at once we have to uh |
---|
1:32:09 | this uh but it's alarm and you that's quickly |
---|
1:32:13 | um you you've mentioned uh |
---|
1:32:16 | steps that to be taken toward a creating an environment for a line |
---|
1:32:22 | um you know what the state-of-the-art in |
---|
1:32:25 | using anything with |
---|
1:32:27 | this principles |
---|
1:32:29 | just anybody's is such an environment is that we establish a with a lot some |
---|
1:32:34 | simply environmentally the public's it uh last year and weighted present the that needs as |
---|
1:32:38 | well lots to models |
---|
1:32:40 | uh the next conference and to |
---|
1:32:44 | uh it's and the get out that it's called communication based artificial intelligence environments uh |
---|
1:32:50 | so uh it's uh |
---|
1:32:52 | i think the short "'cause" this column yeah if a dash and |
---|
1:32:56 | so that's pretty stupid shortcut nobody likes about the we and they have a this |
---|
1:33:00 | one because uh and the course of the story would be longer that's good that's |
---|
1:33:04 | one |
---|
1:33:05 | uh |
---|
1:33:06 | and uh |
---|
1:33:08 | and uh so this is our environment that be published uh |
---|
1:33:12 | when it comes to other side |
---|
1:33:14 | uh |
---|
1:33:17 | well there was this a discussion about the open |
---|
1:33:20 | yeah universe which is a something then i think the mind of publisher the same |
---|
1:33:24 | conference on like a thing deep mine plan for all the decorrelation so |
---|
1:33:28 | but to compare games in three d environments and how to navigate are just by |
---|
1:33:32 | observing pixels and that's really environments that gets uh |
---|
1:33:36 | such a quite different uh |
---|
1:33:38 | because again if you for some a single task results different this uh this focus |
---|
1:33:42 | on the incrementality of the learning so i we not sure intersections something comparable to |
---|
1:33:48 | what we have |
---|
1:33:49 | yes actually they are so many researchers is that you never know about so |
---|
1:33:56 | that's encouraging for the rest of this |
---|
1:34:16 | do you think we have enough data for training a building language models and now |
---|
1:34:22 | we should focus only on algorithms |
---|
1:34:24 | or we should |
---|
1:34:25 | also green using a data sources and i don't to textual data |
---|
1:34:32 | well of course is a more data you have the better models you can uh |
---|
1:34:36 | they'll and i would say |
---|
1:34:38 | there's never enough data |
---|
1:34:40 | uh so |
---|
1:34:41 | then i do you try to improve well all these tasks that i mentioned in |
---|
1:34:44 | the uh the first part of the whole alexi speech recognition machine translation |
---|
1:34:48 | and spam detection or whatever uh |
---|
1:34:50 | then sure like uh more data will be good to |
---|
1:34:54 | and um |
---|
1:34:55 | and the amount of uh written uh |
---|
1:34:57 | text data on the web is increasing called a time so |
---|
1:35:00 | how i think that uh in the future we will happen even bigger models trained |
---|
1:35:04 | on either more data |
---|
1:35:05 | and the accuracy center of these models a perplexed is will be able recursive will |
---|
1:35:09 | be higher uh things only get think a bit better there's a like this uh |
---|
1:35:13 | this argument i think you are going back to shown on the question of uh |
---|
1:35:17 | i think there are models |
---|
1:35:19 | actually able to uh to capture an irregularity in the language just one "'cause" the |
---|
1:35:25 | amount of uh data that you have is infinite and the n is included as |
---|
1:35:28 | well |
---|
1:35:29 | which is nice of which basically says that the more data you will have the |
---|
1:35:32 | better you will be about the but the gains are just getting smaller and smaller |
---|
1:35:36 | and then uh i don't think it should this is the way up to |
---|
1:35:40 | gets to a i because uh even if you would have like billions uh but |
---|
1:35:43 | in times more data and then you have now then sure you get like a |
---|
1:35:46 | two point improvement in machine translation and that's fine or maybe one or two percent |
---|
1:35:52 | laura portrayed in speech recognition |
---|
1:35:54 | but uh there is diminished diminishing gains a diminishing returns so it would just uh |
---|
1:36:00 | not be boarded doing uh after sampling |
---|
1:36:03 | of course then there's this think about the |
---|
1:36:05 | i think more data in uh in domains are actually a very small amount of |
---|
1:36:09 | the data now like today uh |
---|
1:36:12 | of course then you can expect a big gains in the icarus this later so |
---|
1:36:16 | for example for fink english language models |
---|
1:36:18 | i think that well |
---|
1:36:20 | that's only like a |
---|
1:36:21 | just about the maximizing the size of the model the ds now |
---|
1:36:25 | minimizing we could be sure complex the training data is because as we can |
---|
1:36:29 | a lot for different languages uh there can be some more fun um there was |
---|
1:36:33 | sort side maybe uh i would have or more hope for it |
---|
1:36:37 | um because there's less uh data |
---|
1:36:40 | um so yeah maybe like for czech language or |
---|
1:36:44 | oh |
---|
1:36:45 | there is there something to be down like all this morphological languages uh are interesting |
---|
1:36:50 | for some reasons |
---|
1:36:51 | uh |
---|
1:36:52 | so yeah so the answer is basically yes more data is a good uh |
---|
1:36:56 | uh |
---|
1:36:57 | a if you want to get a i then i don't think it should get |
---|
1:37:00 | a us there |
---|