0:00:09well come at the next addition of p g s i t this is the
0:00:12uh invited talks on video graphics in speech
0:00:16and it's a series that is run mainly my uh by matching technique but today
0:00:21i'm happy to invite a very good speech or an lp guy so to much
0:00:25weaker of a actually started uh this faculty if i t
0:00:28in two thousand two
0:00:31then uh in two thousand six seven he was working on a diploma project on
0:00:35language modeling for the for chick maybe still remember something of it then actually uh
0:00:41he started his phd in two thousand seven on a language modeling and uh to
0:00:46be frank we didn't have much uh language modeling expertise here
0:00:51so we kept sending him abroad so he's been good considerable time at the johns
0:00:55appears in the hope queens university with the spongy of good on board and the
0:00:59university of montreal we just to a uh bengio
0:01:03and uh well he had a very the influential paper it's interspeech two thousand then
0:01:11that was basically a room like this form of uh senior uh language modeling people
0:01:16and the much basically came up uh and the said that the his language model
0:01:20works the best
0:01:22well they were smiling but it worked the best
0:01:26and uh e eventually uh defended the phd in two thousand twelve
0:01:30was immediately uh hired by you will go brainer and the moved to facebook or
0:01:36research uh i a i research and twenty fourteen or he's now the a research
0:01:43scientist so i will still be here it's to march for now and thank you
0:01:48for coming
0:01:55it is it's fine
0:01:57i guess okay
0:01:58i also things are interaction and uh michael will be uh like uh
0:02:02mixture plus a very small things this uh
0:02:05once asked me to talk about everything so
0:02:08uh let's hope to define would be about nine you wanna works in an l
0:02:12p
0:02:13uh
0:02:15or
0:02:17that is
0:02:21okay
0:02:23ah
0:02:24so for the introduction
0:02:28or the introductions uh
0:02:30and now he's like a an important uh topic for many companies uh nowadays like
0:02:36google facebook yeah we like all these companies that future
0:02:39text data sets uh
0:02:40that are coming either from the web or from the users like you can imagine
0:02:43how
0:02:44uh much text a confusable sense to facebook everyday
0:02:48and then of course like these companies wants to do something cup
0:02:51with the text like out there like a there is this a list of uh
0:02:55somehow some important applications that uh but there are many others like a
0:02:59just detecting the span is something important for like uh
0:03:02users you don't want to see
0:03:04uh some think a strictly binary are using cut these services uh so this like
0:03:10the like a core business of these companies is to be able to deal
0:03:13with text uh
0:03:16and uh for that uh i will be talking about like a set are a
0:03:19lot basic things in the beginning and then their extensions uh using neural networks uh
0:03:24the idea is to work on
0:03:27uh
0:03:27there will be like uh the first uh first part will be about unsupervised learning
0:03:32curve for
0:03:32board representation see that so
0:03:35the border like project uh obvious that we will uh it's like a very nice
0:03:39a simple inter the introduction
0:03:42uh
0:03:42then supervised a text classification the do not for to talk about it much of
0:03:47the weighted average shows simple to last year at the face but that extends the
0:03:51word vector supervised classification and again like uh
0:03:54is quite successful because it's very scalable
0:03:57and then the recurrent work language model
0:03:59uh
0:04:01so exalted mentioned that's all the like something that is so
0:04:04uh nowadays very common and uh i don't be conference this
0:04:08um the last part of that all will be about the
0:04:11what can we do that or maybe in the future maybe some people hearable started
0:04:16but relatively easy and on to uh do something better than the u matrix i
0:04:21think that uh that would be a great goal and you're trying to do it
0:04:24ourselves uh i look up
0:04:26and of the like the our companies are very interested in an uh getting better
0:04:31performance
0:04:32of course one can uh one can focus on the incremental improvement by just taking
0:04:37that exists and trying to make it bigger or something cool that's or
0:04:41uh but i will talk about that some high-level goals that uh
0:04:45we are thinking of uh right now like how to build our mission the regions
0:04:49of the both uh
0:04:50they really smart models that are something the
0:04:53i below that are not show any solution because we don't had it uh
0:04:56but i think it's a good at least uh mention the problem that we are
0:05:00facing
0:05:03uh cycle started like very basic concepts so
0:05:06"'cause" there seem to the
0:05:07uh people here don't uh don't so all of them don't have of the big
0:05:11around in a in uh machine learning cut
0:05:14so i will start with uh basic uh
0:05:16models of sequences and uh
0:05:19representations of uh text uh
0:05:22and then i don't show that you want work basically
0:05:25can extend and improve for
0:05:27all these the above a uh representations and uh and models
0:05:32it's like yeah
0:05:33the artificial neural network so i can be seen as some unified a framework that
0:05:37uh
0:05:38that uh is in some sense simple to understand that
0:05:42i know what the uh are concepts but we only done for the for this
0:05:46to be able to define the features or lots
0:05:48so for the n-grams so
0:05:50that's a standard approach for language modeling that's a core technology not in uh many
0:05:56important applications like speech recognisers are
0:05:59our mission transmissions justines the uh i need to be able to output somehow some
0:06:04text and for that the
0:06:06you are using some statistical models of the language a
0:06:09that uh that was basically the think it is written on the last line to
0:06:13the uh some sentences are basically more likely uh than some others for example uh
0:06:20this is the sentences
0:06:21uh really going to have uh
0:06:23higher probability then
0:06:25then a sequence of words sentence uh is this uh
0:06:28because that's not so that doesn't make much sense
0:06:31and even that should have probably are provided for curing unit
0:06:34in a english and some random string characters
0:06:38so the n-grams are
0:06:40uh estimated from a from counts usually
0:06:44a so it's very simple but you would look at the first equation
0:06:47uh and just think about what is the product of the sentence though i think
0:06:51it's like a very broad concept again even stated uh
0:06:54it be would be able to estimate this probability very well
0:06:57then the uh models uh behind the
0:07:00should be able to understand the language or actually have to understand the language or
0:07:04for example i can write the hearer like creation that uh probably as
0:07:08so sentence uh
0:07:10uh arrest is the capital city of rows so
0:07:13should have uh
0:07:14higher probability that a barrel in is the capital city a problem because the second
0:07:18sentence is incorrect uh
0:07:20uh but the model you have now uh i would say
0:07:24can do this a little bit about the not in a general sensor
0:07:28i will try to get through it at the end of the oldest or like
0:07:31what are the limitations of uh of our best language models
0:07:35but just to get the motivation like a linkage wanting is quite interesting and there's
0:07:39a lot of also problems and we would be able to solve them
0:07:42very well then it would be
0:07:44possibly interesting for the artificial intelligence research
0:07:49and here it is uh how it looks like with the
0:07:52techniques that uh used to be state-of-the-art like ten years ago
0:07:55uh
0:07:56which was based on the grounds there is scalable the mean that we can
0:07:59train uh
0:08:00estimate uh
0:08:01so this model stronger likely that's very quickly uh it's the retrieval if you want
0:08:07to compute a variety of the sentence that just a
0:08:09compute probability of order this that people just get from some trendy corpus just count
0:08:14how many times this the board at a year and divided by all the word
0:08:18count so that we would get its probability
0:08:20and they just multiply this uh
0:08:22um like probability of each word given its complex the
0:08:26they are some advanced things the on top of it like to thank and so
0:08:30on but well but is just the
0:08:32the technique that used to be state-of-the-art in the statistical language modeling for like a
0:08:37i think there are two year so it was like
0:08:39it looks very simple but it took uh people like uh a lot of effort
0:08:43to overcome uh
0:08:45uh convincingly across a occurs um and at a minister uh
0:08:51i don't relate right uh will not be the recurrent networks
0:08:54uh then for the basic representations of takes the uh one and coding or one
0:09:00hundred presentations is something that this like a uh
0:09:03very basic on so that people should know about the
0:09:06uh usually the it when we want to represent some text uh
0:09:10uh especially in english of you we compute first of vocabulary and then
0:09:14represent each corpus uh basically separate uh id
0:09:18uh which show
0:09:19uh has um the advantages and some disadvantages it's very simple uh is it understand
0:09:25the disadvantage is that the
0:09:27as you can see mandy and use the of completely or particular presentation
0:09:32uh data sharing parameters and the
0:09:34and it's up to the model that's using these ones representations to figure out that
0:09:38are they are really it for example so that the
0:09:40it would be able to generalise better
0:09:42so these are the basic representations and the ability to later that we can ensure
0:09:47present work so that some
0:09:48uh
0:09:49better more richer vectors uh
0:09:53actually it's a
0:09:54uh like nice improvements in many applications
0:09:59make of art representations are then just some of these one coats of and then
0:10:03everyone to represent some not be such that the
0:10:07for example if you would have this the small vocabulary and we want to represent
0:10:10the sentence
0:10:11today is among the
0:10:13a little bit like the counts of the words so that it that you're sentence
0:10:18there something special about it the
0:10:20and yeah
0:10:21these
0:10:22their presentation can be still improved by
0:10:24considering basically local context by
0:10:26using backup backgrounds and uh even if it may
0:10:30c not surprising it would see that for
0:10:32many applications yeah really most of the applications nowadays the
0:10:37nobody can they require think it'd be the
0:10:40but uh is a very simple picture presentation the
0:10:44so that's maybe the challenge for the future
0:10:46uh
0:10:48and are important concept is uh the word classes uh
0:10:52as i what is that like uh
0:10:53board really should uh be kind of related to each utterance imposed we help to
0:10:58uh how to think of it is to define some uh
0:11:01some set of classes for example italy germany france spain all these words the
0:11:05a uh denote the names of the of the countries in europe uh
0:11:10and uh maybe you can just the agreement together and just called impulse to
0:11:14so this is uh one of the most successful or not be concepts the in
0:11:18our practise the
0:11:20it was
0:11:21injuries i think uh
0:11:23in the user might be the
0:11:25uh the one uh one the particle paper i think that's a very nice is
0:11:29the from peter brown because based trigram models of natural language or
0:11:34and discusses are computed the
0:11:36automatically again like from song
0:11:38from some training corpus the and uh the main idea behind it is that uh
0:11:43the boards that the
0:11:45that share a the complex that documents in our context so should uh
0:11:50really belong to the same clause
0:11:53once you get these classes then we can improve the oh
0:11:56the representation that will string before because so we can represent the corpus one of
0:12:01and
0:12:02uh our presentation lost one of the class representations for the uh
0:12:06there would be
0:12:07some generalization from the system that is trained on built on top of this representation
0:12:13there was more like the historical uh overview uh i can and the did they
0:12:18are like several other important can so that people should know about the
0:12:21uh and the that otherwise
0:12:24are basically the stepping stones to understanding the neural networks uh
0:12:27uh what's it'd the
0:12:29most frequent things uh probably unsupervised image the reduction using cut principal component analysis and
0:12:35unsupervised clustering with uh
0:12:38k-means so
0:12:39so these algorithms are quite important and then the supervised classification
0:12:44uh especially to the logistic regression
0:12:46uh
0:12:47uh is very important
0:12:50i don't know described in detail because uh i wouldn't finish a
0:12:53uh so now i would jump quick introduction neural networks uh
0:12:58uh
0:13:00and again like it'll be just a quick overview so that people can i get
0:13:04some uh
0:13:05idea with this uh
0:13:06uh what the than you want works actually are
0:13:08uh and uh i will try to describe these uh is basically arms that the
0:13:13people are using all the time
0:13:14and then i will
0:13:15also try to give some short explanation of what the
0:13:19deep-learning means because i think that's from there but the
0:13:22is becoming very popular now and it would be good you are so what is
0:13:26it about
0:13:27so for when you wanna works uh
0:13:30uh in a natural language processing cup
0:13:32and then motivation is to simply come up with the better more precise techniques then
0:13:37what i was you showing before so something better than the
0:13:40uh big aborts uh something better than just the grounds to a so how can
0:13:46be
0:13:47uh red and white would be even though it well
0:13:49if we can come with some better representation than uh we can
0:13:53uh get slightly better performance than our come but there that's important for many people
0:13:57like support for the company because they want to be the best and its importance
0:14:02of the for researchers because the people to publish the most interesting papers
0:14:07uh
0:14:08and years are completely in uh
0:14:10all kinds of competitions the
0:14:12so
0:14:12it's basically important for everyone to develop that that's techniques uh
0:14:17that's about the motivation this is how the
0:14:19uh your own basically looks like uh
0:14:22a is like a mathematical or graphical representation of the of the
0:14:26model it's the simple mathematical model the
0:14:29uh the
0:14:31function that the people didn't really uh the your own so
0:14:35uh the biological neurons and but it's very simplified so i would uh warm about
0:14:40uh yeah giving some parallels between the
0:14:43artificial
0:14:45neurons and the and the biological on your own since it is likely to really
0:14:48about it is very different thing
0:14:50so uh the are concerned you and your own looks like yeah basically they are
0:14:54um
0:14:55uh incoming signals that are coming cut
0:14:58uh to be in your own it's called sign at this uh the time from
0:15:02the biology but uh basically just needed some errors that are something you know
0:15:07uh to be in your own
0:15:08uh it's coming from some other neurons are
0:15:11and uh these signals are multiplied by the
0:15:15by the way that each yeah
0:15:17each year this input arrow results today with one of a small number
0:15:21uh the basic of the weight that multiplies the incoming signal
0:15:25so we had three incoming numbers that
0:15:28and uh they really get a sense together in the uh in your honour
0:15:33after which uh there is the application of the activation function of each yeah um
0:15:38needs to be known in europe you want a proper you wanna or
0:15:42and the simplest one is probably the
0:15:44uh so called the rectified the
0:15:46a linear activation function which is basically just taking my between zero and evaluated that
0:15:51compute so that all the volume that are below zero will basically get a translate
0:15:56the zero
0:15:57and uh
0:15:58this value that we compute is weights a
0:16:00is the output of the your honour in the given find that the and the
0:16:05and uh this uh this output can be connected actually too many pattern your own
0:16:09so it does not be connected
0:16:10one
0:16:12but it's a single number uh goes out of the single in your own
0:16:16and here the creation
0:16:19so i think that like uh
0:16:21the biological onions actually
0:16:23although our like connected to other neurons the about the they are so many difference
0:16:27is that it doesn't even make sense to
0:16:29star comparing these two
0:16:31a logistic that the artificial neural networks that are somewhat uh was it inspired by
0:16:36the biological neurons uh
0:16:38uh in the beginning about the it's a different now
0:16:43uh maybe in the name i think uh is a um
0:16:47uh misleading people start uh working on this uh
0:16:50techniques the and uh start believing that maybe they can just sort of artificial intelligence
0:16:54additional uh have uh you know neurons in their model because after all the model
0:16:59at school you want your right uh well this is the logic that the
0:17:03i sometimes you're from some older purpose errors and i think it's it really misleading
0:17:07and it's part of the
0:17:09marketing so just the
0:17:10don't take it seriously i think yeah
0:17:12if the name of these uh
0:17:14artificial neural networks would be known in your data projections i think it would be
0:17:18maybe better but then
0:17:19nobody would you use it because it would be interesting right
0:17:23uh so
0:17:24uh this is the presentation
0:17:26oh well we'll network when we have a very have multiple of these songs uh
0:17:31usually there are some structure this is like the typical a feed-forward structure very have
0:17:35some people say or uh which uh
0:17:37which is made out of some features it can be
0:17:40the back of our teachers or one of any code so what i was talking
0:17:43about before
0:17:45so these are the features you specify some will somehow
0:17:48the uh and then there's the hidden layer uh you'll you know well to computable
0:17:52is there and then there's the output layer
0:17:54again it's the application of those any questions uh
0:17:57so nothing special their output layer
0:18:00use usually what you want the network to be doing can that's a
0:18:04that's for example classification we want for example say that the input layer that there
0:18:08are some decoding of the sentence then and the upper layer there can be classification
0:18:12of their
0:18:13the sentence is a span or not
0:18:15so there can be just one in your on that would be just the
0:18:18doing a binary decision
0:18:21the training is done with the back propagation
0:18:24and to
0:18:31as the training is done with a back propagation i do not really describe exactly
0:18:35how it is uh don't because it's a lot of mao a multi we can
0:18:39find some nice lectures on the web the uh
0:18:43so i think it course zero there are some nice cars about neural networks would
0:18:46be quite some long i quite some time to
0:18:49expendable basically what we need to do is to define some objective function which yeah
0:18:54we'll
0:18:55ah
0:18:56which will say what is the error that we uh that the network that uh
0:19:00make for the article twenty example so when we trained a network we should we
0:19:04basically some input features
0:19:06and we know what is the output that the network should have produced and we
0:19:09know what the network uh did actually compute uh using the current set of the
0:19:13weights and then using of the back propagation and the stochastic gradient descent algorithms we
0:19:19will compute how much uh and you know what direction which we should change the
0:19:24weights
0:19:25so that next time the network see the same example it would make up this
0:19:28error
0:19:30small it would make smaller error
0:19:33and then there's the simplified graphical representation
0:19:36but is not used in some papers uh
0:19:38there we don't actually draw all the individual neurons but it just the dual the
0:19:43box the
0:19:44with the errors
0:19:47they're section what of are things that the
0:19:49have to be uh down if one actually wants to implement the
0:19:53this natural because they're like this the
0:19:55these uh hyper parameters that the that the training doesn't uh
0:19:59doesn't choose like what is the type of activation function that we use a their
0:20:03conviction in many of them
0:20:05well how many hidden layers to we haven't are their size is uh how they
0:20:09are connected we can have some skip connections we can the recurrent connections we can
0:20:13have some weight sharing deconvolution at work so it's actually
0:20:17why did they do things so uh of course i will not describe for all
0:20:20of them because there would be lycra for course
0:20:22uh about the
0:20:24but i think of what works for me for no for starting to or within
0:20:28you want works is to take some existing set up and try to play weighted
0:20:31by
0:20:32making some modifications and the
0:20:34observing what uh what is the difference
0:20:37so maybe that's so that the best of the star
0:20:40for deep learning uh
0:20:43so this popular term uh
0:20:46uh it's uh basically still the same think it's the it's in you wanna sort
0:20:50of every will have well
0:20:52um or hidden layers usually so that uh
0:20:54if there is like at least two or three hidden layers then the by basically
0:20:58some of the deep learning a all maybe we can also that some recurrent connections
0:21:03with you to make the outputs depends on all the previous the
0:21:06input features which is actually very d very "'cause" they are so many nonlinearities that
0:21:11uh that influence the output of the model
0:21:13uh so basically any a network that uh
0:21:17that uh
0:21:18goes uh any model that goes through a several nonlinearities be before it computes the
0:21:23output uh can be considered as deep learning curve
0:21:28although some people are probably even see nowadays deep-learning which i think is completely sitting
0:21:33about the
0:21:35uh
0:21:37yeah there was also like a this controversy for i think are maybe twenty years
0:21:42there
0:21:43uh basically the welcome annotated very the a
0:21:46the training these the model deep neural networks is not possible to be done with
0:21:50the stochastic gradient descent
0:21:52and uh when i was uh the skewed and myself well
0:21:56whatever book i was reading uh everybody that claimed is that basically training these deep
0:22:01networks does not work and that's it uh that we need to develop some magical
0:22:05algorithms
0:22:06actually it's not the case uh people not trained to networks normally that the d
0:22:11and the
0:22:12just works the it's probably because we have more data than what people are like
0:22:16in the nine so i didn't know much durable power exponential there but the uh
0:22:21there are be about the
0:22:22there are basically this uh
0:22:24uh a long chain of sex as a result starting maybe in two thousand five
0:22:28six the lower people are able to find it remains some deeper networks are
0:22:35there's also like this mathematical justification why prediction need to the
0:22:40the models so
0:22:42uh coming from seeing more popular and marvin means key
0:22:46in their book perceptrons it is uh
0:22:49so the very mathematical i would say about the about the argument is very interest
0:22:54think there are functions that uh
0:22:57that we can't represent action maybe give just a single hidden layer
0:23:01and uh
0:23:02actually that's the logic that i will be using at the end of the talk
0:23:05show that they are actually
0:23:06a function that even the deep learning models so
0:23:09cannot uh a learn a gently
0:23:12maybe represent us not very large
0:23:15so
0:23:16uh
0:23:17so i would see that the wall down deep learning maybe was invented a neural
0:23:22network and you're like about the
0:23:24but the these ideas are much older
0:23:27uh like you have the motivation uh our people to argue that we really need
0:23:32to
0:23:32use something else then use the
0:23:35these uh simple perceptrons
0:23:41so this the graphical representation
0:23:43a very good basically just multi um
0:23:45just several little errors
0:23:47and so
0:23:49that's about it the states that it can be more complicated than this but if
0:23:53there will be some recurrence the connections or something of that sort
0:23:57but a lot of visiting model
0:24:00yeah i would even say that it still an open research problem because the
0:24:04then entropy you have uh
0:24:05very deep model then uh
0:24:08so possible to show in many cases that it can represent the or it can
0:24:13are present solutions to some
0:24:14interesting problems about the
0:24:16the a request use the
0:24:18there are there is um
0:24:20so um i good job approach are we can find the solution we constrain the
0:24:25network which is actually not always the case especially for some complex problem by will
0:24:31be
0:24:31showing at the end to uh there are the network for example it's learns um
0:24:36some complex the controllers number structures uh
0:24:40then
0:24:41because there's a lot of local optima then
0:24:44it seems that uh we start to be something better than what we get now
0:24:50and uh so now i will be talking about the
0:24:53uh most basic application of uh
0:24:56neural nets to a to some text problems which is how to compute the distributed
0:25:01your presentations aborts the
0:25:03and uh and i do show some mice examples i think i see examples the
0:25:08oh uh some linguistic irregularities in the vector space
0:25:13so this is how we can actually train the most basic gabor vectors is that
0:25:17they started the
0:25:18band when i one that was mentioning here uh but i was think my diploma
0:25:22thesis in two thousand six is the first model to implement it very just try
0:25:26to predict the
0:25:27the next door to given the previous work the using a simple neural network with
0:25:33one hidden layer
0:25:34and uh
0:25:35here we train this model some
0:25:37on some text corpus a
0:25:39then the by product of this learning is uh
0:25:43that a bit we the matrix a way to be in the
0:25:46uh input layer and the hidden layer
0:25:49we'll
0:25:50a real basic contain the worst representations in some
0:25:54a vector format that is the
0:25:56you're gonna be seen as a fess uh this uh
0:26:00this uh a real for of numbers of the weights from this matrix
0:26:05and it is interesting properties for example
0:26:08you don't group uh boards that similar sense together so that the
0:26:12uh so that this vector representation
0:26:15of four so for example like france any turn it will be
0:26:18close to each other while for example
0:26:21uh i dunno rent and uh
0:26:24and china will be probably farther away both maybe not the
0:26:29uh
0:26:31so
0:26:32uh
0:26:34so uh basically a this like the simple supplication of the of the neural networks
0:26:40and it'll is a kind of found to play we did a
0:26:43of course it's not perfect so that were vectors coming from this model a very
0:26:48be comparable to the state-of-the-art the
0:26:50today about the
0:26:52already function start there
0:26:54uh
0:26:55sometimes list of i-vectors all the core the cold um for embedding i'm not complete
0:27:00sure why
0:27:01but that's all the relative name
0:27:03and uh
0:27:04uh
0:27:06usually to our presentation this like a then agenda like fifty to one problem so
0:27:10each work on this you know say one hundred fold herself to retrain the model
0:27:15and uh
0:27:16a product of that signal purpose to work losses the samples think before
0:27:20uh france italy can go to the same class but uh
0:27:23yeah but with of all vectors a
0:27:25these representations can be much richer because the
0:27:28unlike a us with the board classes we can have a multiple degrees of sonority
0:27:32encoded in this uh in this work vectors and that settled shrink later
0:27:37uh it actually makes sense so
0:27:40uh of course so one thing is that it is found to have uh
0:27:44these vectors just uh just to study the language and actually increase or although our
0:27:48interest in these techniques the but the are think is that uh
0:27:52we can also use them in some uh
0:27:53uh in some uh i know the application so
0:27:56for example a roman coloured show in is a famous paper
0:28:00a role to
0:28:02a natural language processing from scratch uh
0:28:05that the can so for many an open problem so
0:28:10at the state-of-the-art performance uh
0:28:13by using some pre-trained uh were vectors
0:28:19so that are vectors can be basically features to some other models like the neural
0:28:24networks instead of the or in addition to the wanna undercoating
0:28:29uh historically there are like
0:28:32uh several models the proposed before for training data
0:28:35uh this uh this word representations the
0:28:38usually people started to the most complicated things so the start with some model that
0:28:43the
0:28:43many hidden layers uh
0:28:45and it was uh kind of working so
0:28:47so it was considered the big success of the deep learning yeah
0:28:51well i wasn't convinced about its because i would it know about my previous result
0:28:54of just one hidden layer
0:28:56uh the product of sourly quite good
0:28:58uh
0:28:59so
0:29:00i wanted to show that uh actually the shovel models to model the model the
0:29:04don't have many a hidden layers but just one
0:29:06can actually be quite competitive for that i need to be able to compare two
0:29:11uh either vectors of other people's approaches
0:29:16and that wasn't actually parameters because uh
0:29:19people that are showing results after training the models on different datasets and to
0:29:24and the
0:29:25these adjusted are not public and then if you compare two techniques just are trained
0:29:29on different data then the comparison is not going to be very good
0:29:34uh
0:29:35so one of the interesting car properties that actually used for uh developing this uh
0:29:41simply relation sets
0:29:43was that uh
0:29:44uh these support vectors can be used for
0:29:48uh doing these so small
0:29:51analogy like calculations with the board so one can define example
0:29:55was then of a string that the when we take a the vector forking and
0:30:00the subjects from a the vector for that represents man
0:30:03then and uh vector that represents woman and do the nearest no uh need research
0:30:09uh while excluding the but works around this position
0:30:12then we will uh find the work we need for any
0:30:16a reasonably good um or vector o
0:30:20and uh
0:30:21similarly we can actually calculated the
0:30:23with the boards and sounds are a lot of uh
0:30:26which questions of this type
0:30:28uh kind of funny how accurate kind of get
0:30:32uh i
0:30:32on the picture below uh there is shown in basically there can be like this
0:30:36multiple degrees of similarity so can guess the related to queen in some way but
0:30:41it's a related to its lower for form like uh can't case related the kings
0:30:46in some our way
0:30:47and which you want to capture all these things the
0:30:50so the idea that the board will be and then they're of a single class
0:30:54what follows to capture this
0:31:00so for the relation edit construct this dataset with the
0:31:03uh almost twenty thousand question so to basically written by and uh and then generate
0:31:09it uh using permutations so
0:31:12automatically
0:31:13and these are few examples like a
0:31:15take it would be quite challenging even for
0:31:17uh people once there are some of these so
0:31:21analogy questions so maybe try to compute uh
0:31:24uh this think uh
0:31:26board for example utterances to greece's all slaves to norway i think that's quite easy
0:31:31but the second one is uh
0:31:33that's an article like analyze the ones last year honest well like ones like the
0:31:38currency non-goal and the currency here on i think is the
0:31:41three l
0:31:42so i think that's the that's more complicated well
0:31:45and of the and then there are like the errors that can actually very simple
0:31:48like brothers sisters grandsons two
0:31:51granddaughter and so on
0:31:53so we can accumulate performance of uh
0:31:55oh different models on these so on these questions
0:31:58or this
0:31:59uh on a would you questions
0:32:03yeah
0:32:04it can be actually scaled up very
0:32:07yeah to phrases
0:32:08so that we can compute like new york a sting your times baltimore's to i
0:32:13think baltimore sun
0:32:14uh
0:32:15so we'll these datasets are public
0:32:18in their published in the dark in
0:32:21and uh i go there
0:32:23the uh
0:32:24simple algebra or vector model that will show later to this one that was uh
0:32:29kind of stick of er state-of-the-art uh
0:32:31make in the days uh
0:32:33that it used to hidden layers uh
0:32:35uh starting with the
0:32:37a context of three boards are and of or so
0:32:41and the input to predict the next door to
0:32:44by going through the projection layer and little air
0:32:47uh and the
0:32:49them
0:32:49the main complexity of this model after that we do some tricks the
0:32:53there we can actually deal with u n w matrices of them a complex this
0:32:58in the v matrix because we need to touch all the parameters of for every
0:33:01training example
0:33:03and the model takes ages to train
0:33:05uh so
0:33:07what i did this was basically the remove the hidden layer
0:33:09and uh have the projection layer
0:33:12slightly different and uh
0:33:14as i don't show later in section were quite fine so this uh again the
0:33:18idea that uh we can take the bigram model
0:33:21and just extended to
0:33:22which are showing the context uh around the border we are trying to predict and
0:33:27just uh some the board representation of the projection layer and the prediction right away
0:33:31this model will be able to learn the n-gram so it's not the
0:33:34suitable for language modeling i just fine to learn the word vectors this way
0:33:42uh
0:33:43the near model to the previous model is the skipper model
0:33:47that uh
0:33:48tries to predict uh the context a given the
0:33:51current uh board
0:33:53they should were quite sooner or later uh if they are true and uh
0:33:58peripherally
0:34:00so the training is uh is uh still the same thing like stochastic gradient descent
0:34:04back propagation
0:34:06uh
0:34:06these works at the output layer coding code it does one of and of the
0:34:10same for the input layer
0:34:11so we cannot be this the use of mikes so
0:34:14function in the output layers which is so
0:34:16this good probability distribution we have to compute the
0:34:19all these uh only use which would take too long so they are like this
0:34:22to
0:34:23a fast approximations uh one that the still keeps the
0:34:28probability to be correctly uh something to one which is dark a softmax and the
0:34:32second one
0:34:33uh that actually is uh from the
0:34:35assumption that uh the models to be prefer probabilistic model
0:34:38and uh i just takes the
0:34:40bunch of divorces negative example
0:34:42uh to be related and the output layer loss the positive example and that's all
0:34:47but is done and to
0:34:48and such as the second option seems to be preferable
0:34:53and are very good at the
0:34:55and it uh that actually improves the performance drop well
0:34:58uh
0:34:59probabilistically or stochastically a the most frequent corkscrew both speed up the training can interestingly
0:35:06even improve the accuracy where we don't shall billions and billions of examples there we
0:35:11try to
0:35:12a related work like a ds the
0:35:15is uh the and so on
0:35:19so these are not removed from the training set up
0:35:22a like all of them but uh
0:35:24some proportion of them is actually remote so that their importance is actually reduced yeah
0:35:29and it comes to the objective function
0:35:32and here is the comparison as i said
0:35:34on this analogy deepest that the
0:35:37there was like this you get about in the training time and it and the
0:35:40accuracy to whatever people it's probably before so i so that's what i wanted to
0:35:45prove that one does not have to train a language model to a to obtain
0:35:48good were able to report representations
0:35:51and this you the last two lines are only simple models that the
0:35:55data are invariant to the border you don't understand the
0:35:59n-gram they just see the single boards and uh
0:36:02the only yeah they can compute the very accurate the word representations that are actually
0:36:07way better than with there are people that could twenty before
0:36:11and to while the training time to go from models and reach two minutes
0:36:15and maybe even seconds
0:36:18so this is uh obvious this open source code
0:36:22it's called words like project
0:36:24actually many people
0:36:25it's find it uh useful because uh
0:36:27they can train on the on the border like there are some they are datasets
0:36:31to improve uh
0:36:32many i know the application so
0:36:35so
0:36:36i think it's like a nice uh
0:36:38nice we have to and few person topic receive an adder
0:36:41uh people are dealing with data sets of our there is not uh
0:36:45you each number of uh
0:36:46uh so supervised training examples
0:36:52you are some examples of the nearest neighbor so
0:36:55just to give a
0:36:56idea
0:36:58uh how big again was built a between about uh what was state-of-the-art before and
0:37:03after
0:37:03uh these models they are introduced
0:37:06so for example for how well that's like infrequent words in a war in
0:37:10english yeah
0:37:11uh but still it's present in
0:37:13all these the all these uh models
0:37:16and we can see that the nearest neighbours for the first we
0:37:19uh
0:37:21barrelling makes any sense
0:37:22and this one and then at least get the idea is probably a name of
0:37:25some parsing
0:37:27while the last one is obviously much better when it comes the nearest neighbours
0:37:34and of course this the this improvement of the quality comes from the plight of
0:37:37the models trained a much more data and the
0:37:39and had a large dimensionality and the that all as possible because the
0:37:44uh training complexity is reduced by many orders of magnitude
0:37:48uh_huh
0:37:50there are some few more fun examples like a
0:37:53but uh
0:37:53if we can uh can calculate like these things uh
0:37:57likes
0:37:58sushi mine in japan and germany rutgers
0:38:02and so on i think yeah
0:38:04scan to find of course we don't have to look at each of the news
0:38:06token we can look at the top then use the
0:38:09tokens uh
0:38:10so i wouldn't say that it works all the time
0:38:13and he goes like sixty percent of the time the nearest models are
0:38:16looking reasonable
0:38:18uh
0:38:19about the it still like fun to play with it and the
0:38:22there's the many existing we trained models no available on the web
0:38:28i don't think that actually data scientists uh find useful is that the
0:38:32these are vectors can be visualised to get some understanding of what is going on
0:38:37in the dataset
0:38:38uh that they are using a
0:38:40the are ignored these are so strong that actually when we train this model and
0:38:45the good news dataset
0:38:46uh and then and it uh visualising two dimensions the representations for countries and the
0:38:53capital cities
0:38:55then we can actually see recorrelation
0:38:57between them that the
0:38:58there is this a single direction uh
0:39:02uh how to get from account to basically it's capital city and even the contras
0:39:07are actually a related to each other in this the in this representation in some
0:39:12interesting way for example we can see that
0:39:14the european countries from the saddam european so far in some part of the image
0:39:19and then the problem
0:39:20the rest of the real or somewhere in the middle
0:39:22and then the asian countries are more like a
0:39:24uh the
0:39:26at the top of the image
0:39:32so for the summary
0:39:34uh
0:39:35i think it's always good to think if a
0:39:37it things can be down a simpler and uh as it was shown like uh
0:39:42uh not everything is to be deeper and uh
0:39:44you wanna works are fine even navy
0:39:47actually remove uh many of the hidden layers especially in the know the applications it's
0:39:51a different story for example for acoustic modeling or yeah per image yeah
0:39:56classifiers are actually
0:39:58i another there any
0:40:00model that would be able to be competitive in the deeper
0:40:03deep models the
0:40:04um without having many nonlinearities about for the for an l p task so is
0:40:09the other way around so i
0:40:11and not company can means that the deep learning actually works for now you so
0:40:14far
0:40:15um but
0:40:16in the future be noted that will
0:40:18we better
0:40:20although there is this thought
0:40:22extension
0:40:23to work to make of are basically instead of predicting the middle of or given
0:40:26the context the connection predictor a labels for sentence using the same yeah same algorithms
0:40:33and uh this is what we published a as a fast x library last year
0:40:38it's very simple but that the same time very useful
0:40:41oh
0:40:41and compared to what job
0:40:43both people are probably think nowadays uh in the
0:40:47in the uh
0:40:49and all the initial learning conference this uh
0:40:52uh then we need to do the comparison to some
0:40:55a convolution network with the
0:40:57several hidden layers trained on
0:40:59um any gpu so
0:41:01we did find out of each you can get a their accuracy while being a
0:41:05hundred times so
0:41:06hundred thousand times faster
0:41:08so i think it's always been think about the baselines and doing the simple things
0:41:12first
0:41:15so the next uh
0:41:17expired will be about the recurrent network because the
0:41:20i think it's quite obvious that the
0:41:22or representations can be found the easy to traditional networks but it's gonna it's a
0:41:26different story for language modeling there's actually some success of the
0:41:30of the deep learning because the state-of-the-art the models
0:41:34nowadays a recurrent and that's basically this model
0:41:37and then able talk also about the limitations of these models
0:41:42so these three so
0:41:43of the recant it is quite longer
0:41:46oh there was a lot of people work on this models a blend a piece
0:41:49like a just on long mike jordan michael's or and so on uh because the
0:41:55models
0:41:55model is actually very
0:41:57their interest think it's the
0:41:58simple modification how to get a some sort of short the memory into the model
0:42:03here is the graphical representation
0:42:05so again we can uh taking the
0:42:07bigram model and just the handed the
0:42:11hidden layers they hidden layer
0:42:13to be connected to uh the entire from the previous time step so that the
0:42:17h t create is uh
0:42:19this loop in the model
0:42:20so that the hidden layer
0:42:22oh uh just c is the features
0:42:26the input layer what although it's all state from the previous the times that
0:42:31and that it's selsa's the
0:42:33the previous uh various state and so on so basically
0:42:37um then ever you prediction it depends on the goal is threefold the
0:42:42input feature that you know put us that's that it of the time steps that
0:42:46we need to do before
0:42:49so one can say that to the hidden error than or present some sort of
0:42:53memory
0:42:54a that this model has
0:42:56uh there's this interesting paper from different monocle finding structuring time that the
0:43:01you sort of this motivation
0:43:05well after
0:43:06after this period where the recurrent or explore studied uh
0:43:10at a the excitement the excitement that the kind of when it show
0:43:16uh because some people started believing that the than these models even in the actually
0:43:21are looking very good cannot be trained with uh
0:43:23and g d may but can see that the
0:43:26this is the remote server
0:43:27can a real curing again and again whenever people data
0:43:30failing to do something enables the data link the energy that it just doesn't more
0:43:35uh and of course to solicit they're out to be wrong so
0:43:39uh the recognizers are actually trying to do you know this normally just one has
0:43:44to do some small break it down
0:43:47uh
0:43:50so what i did um i said in the doesn't animals that the idea to
0:43:54that actually
0:43:55one can train state-of-the-art language model based on the recurrent networks and the it was
0:43:59very easy to apply to
0:44:01a range of tasks like language modeling machine translation speech recognition data compression and so
0:44:06on and to
0:44:07in each of these uh i was able to improve the existing systems to
0:44:11uh achieved view state-of-the-art results and the
0:44:14sometimes by quite a significant margin for something language modeling i think uh the or
0:44:19looked perplexed the reduction over n-grams uh
0:44:22it and symbol of similar recurrent network so
0:44:24most
0:44:25for me usually like fifty percent the remote so that that's quite a lot
0:44:28uh_huh
0:44:30and uh
0:44:31company started using ca uh this uh this toolkit and what this the body so
0:44:36that they pleased here about really many others
0:44:39uh and uh
0:44:41then i was looking at uh with your savings you but like uh
0:44:45what uh outcomes that the uh the model actually works for me well people try
0:44:50to do it before and that uh they just couldn't uh make it or they're
0:44:54not that there was uh
0:44:55this problem that they did a place at some point that the
0:44:58it's i was uh trying to train the network and more and more data at
0:45:03the start at the meeting can some celtic way
0:45:07and the training response table so sometimes that it converts sometimes not
0:45:11and the more and more data used uh the lower was the chance that the
0:45:15network would can were convert
0:45:17and um mostly the result or just rutledge
0:45:21so it detects to spend quite a few days uh trying to figure out what
0:45:24is going on and uh
0:45:26i did find out that the there are some rare cases that are
0:45:29the std a science in a such a way that the
0:45:35changes of the way that are computed
0:45:37uh become
0:45:38exponentially larger they get propagated through the reckon the matrix
0:45:42so that they become so huge
0:45:44that the rule weight matrix the matrix a get overwritten bits the
0:45:48it some numbers and the not review the
0:45:51just american that are just
0:45:53so what i did not so
0:45:55the simplest thing to take with think
0:45:57because these uh these gradient explosions uh
0:46:00it happened just the
0:46:01just very rarely
0:46:03i didn't uh can't the gradient so that he wouldn't be able to become a
0:46:06larger than some values of it in some threshold
0:46:12and then it there now that the
0:46:13that uh probably nobody was actually
0:46:16either of this uh the neighbours the
0:46:18but there was uh discussing this uh
0:46:21this idea two thousand eleven
0:46:24so
0:46:25there was maybe the reason why things that or i dunno
0:46:29but the
0:46:30it's a set it was the mean of the case that uh that as g
0:46:33d wouldn't uh work for training these models
0:46:37and this i said it was quite easy to obtain a pretty good results one
0:46:41shows that the weight thirty long for training the model because the they were quite
0:46:45expensive
0:46:47so the
0:46:48um ability to the original setup well speech recognition
0:46:52uh it's uh like uh
0:46:54small
0:46:55simple datasets that and to
0:46:57and the reduction of the word error rate was like a over twenty percent compared
0:47:01to the
0:47:02best n-gram models a one can see that as a as the number of neurons
0:47:06in the model gets bigger like to like a ranker then at higher sixty three
0:47:11are twenty and so that's basically skating the size of the model
0:47:15then the perplexity goes down but just like uh
0:47:18the uh
0:47:21down make sure how good is a network that's predicting the next board basically the
0:47:25lower the better and the word error rate uh
0:47:28uh is going down so basically the best n-gram modeling the
0:47:32is that in by uh
0:47:34with no count cutoffs and the on the on the relation data sets it up
0:47:39like a
0:47:40the twelve and sixteen point six a word error rate and with a combination of
0:47:44uh
0:47:45of uh these so we can network sacred would like to nine and thirty percent
0:47:49that was quite yeah a big uh
0:47:52the gain coming from just the language go to just from a change of the
0:47:56language modeling technique which uh i think wasn't thirty
0:48:00but these before when i did compare these results to other techniques that are being
0:48:04developed the like then at uh johns hopkins university then usually people are happy with
0:48:09like zero point three percent improvement of the words rate than the i could get
0:48:13like you're like three and how well
0:48:15percent absolutely
0:48:17so that was uh quite some uh interesting cut think
0:48:21uh under interesting co
0:48:23no observation was that the more training data
0:48:27uh was used uh the bigger uh will still
0:48:31the gain over n-gram models that the recurrent networks
0:48:34so
0:48:36that was uh
0:48:37uh quite the opposite of what to just argument abolitionists technical report in two thousand
0:48:42one
0:48:43uh
0:48:44i think it was two thousand one
0:48:45uh we use very famous very basically data collection that are all these qualities models
0:48:51and so on that you put it consider for improving language modeling rare
0:48:55actually uh
0:48:56well help think less and less as so
0:48:59as more data was used
0:49:01and uh
0:49:02it did see that you also losing all the whole that the
0:49:05n-gram models can ever be beaten
0:49:07well like their output with the police or we can model so that actually happened
0:49:13so it was gonna likely
0:49:14and uh and the last grapples the
0:49:17this uh large datasets from i b m so it's pretty much the same thing
0:49:21"'cause" the able to journal or just like much bigger much better tuned coming from
0:49:26a commercial company
0:49:28it was their the state of the green line is the
0:49:30is their best result videos like thirteen percent for
0:49:33more to rate uh
0:49:35and then uh on the x-axis
0:49:37there isn't the size okay size of the domain of this we're going to work
0:49:41so you can see that the
0:49:42and the
0:49:43as the networks get bigger and bigger the word rate is not going down
0:49:49so in the entire experiment it by
0:49:51by the computation complexity because it to cut
0:49:54many tricks to train the biggest models
0:49:56uh
0:49:57and uh that was quite challenging
0:50:00uh in the nlu could get like another person's or try to reduction like the
0:50:04relative well i think it will get even more than that if i could train
0:50:08bigger models
0:50:09but already this result was very convincing can to
0:50:12uh_huh and stuff
0:50:14people from the companies are interested
0:50:18oh
0:50:19lighter a user can afford became much more accessible because actually implementing the stochastic grandest
0:50:26and correctly
0:50:28it's gonna painful in this model because one is to use the
0:50:30back propagation through time algorithm and it is the makes a mistake there and very
0:50:35heart to
0:50:36find it later
0:50:37so i think is also like very useful so the maybe the most popular to
0:50:41look it's are now like it happens or floor one so
0:50:44and piano and or
0:50:46but there are many errors
0:50:49and uh
0:50:50using the graphics processing units that you use uh
0:50:53uh people could
0:50:54scale the training to billions of training works uh using thousands of in your own
0:50:59so that's even
0:51:00quite a bit bigger than the with the brothers using can doesn't then
0:51:06no like today a the right kind that's are used uh in many tasks like
0:51:11speech recognition machine translation
0:51:13i think uh
0:51:14uh google guys that uh publishing months ago paper very are investigating how to get
0:51:20the recurrent networks into
0:51:22into the production sistine for google translate
0:51:25i think it will still take some time about the
0:51:27let's hope readable happen because setting would be great for example for translating from english
0:51:32to check so that finally the
0:51:34the
0:51:35the uh
0:51:36morphology wouldn't be as painful as it usually so
0:51:40hmmm
0:51:42on the other hand i think that the downside is that the because the
0:51:46because these two kids like a say select and therefore and so on
0:51:50it's named eric and works very easily accessible people are using them for all kinds
0:51:54of problems that are there and thirteen
0:51:56uh require
0:51:58and uh especially when people try to complete their presentation file
0:52:01they sentences are documents are
0:52:04i would to always work on people told to think about the simpler baselines because
0:52:08the
0:52:08just big of n-grams the
0:52:10can usually be to
0:52:12is uh this uh models
0:52:14or
0:52:15at least be around the same
0:52:17accuracy
0:52:19when it comes the representation so the different in the language modeling
0:52:26so one can ask like a what can we do better like a so really
0:52:30need it is you distorted it may be there we can work uh
0:52:33pretty well and sometimes maybe adding more layers held so for some problems doesn't
0:52:38well it can be you
0:52:39to get uh
0:52:40better results so
0:52:42uh can be built uh
0:52:44this is a great language model that i mentioned in a direction that would be
0:52:47able to lock to tell us to what is the capital city of some constant
0:52:50maybe we could stall with uh
0:52:53uh and we do it is we're gonna works well
0:52:56i'm not thirty that much convinced because they are very simple things that these models
0:52:59that we can have lower
0:53:00and that actually is opportunity to new people a new generation to develop better models
0:53:07so simple button for example is a very difficult to uh to learn is uh
0:53:12it's memorisation a variable-length sequence of symbols
0:53:16this like to
0:53:18the people to just to see uh see you can you bored and be able
0:53:22to repeat the later
0:53:23this something that the uh
0:53:25nobody can retrain in general the recanted networks to do that
0:53:29uh there even simple but there's and that we don't have to minimize the sequencer
0:53:35of symbols we can just a little bit novelty kind it comes to count thing
0:53:38so
0:53:39we can generate uh
0:53:40uh is very simple
0:53:42uh algorithmic but there are so uh which are
0:53:47sequences so
0:53:48with some strong regularity
0:53:50uh and see what can actually be so we can the rocks lower
0:53:54so i think that uh people know for some from the recouped article compare signs
0:54:00that the that there are like a very simple or a languages like the
0:54:05a and b and language yeah are there is this thing number of symbols
0:54:08and we consider in quite a few examples and train
0:54:11a sequential
0:54:12a predictive model like a recurrent network to be able to predict the next symbol
0:54:16and if it can actually oh come then it should be able to predict correctly
0:54:20uh all these so uh not hold it is the sum of these so symbols
0:54:25in the sequence of because there are still this uh this uh
0:54:29a this uh
0:54:31information coming from the first and because that's not predictable
0:54:36so
0:54:38so these quite challenging
0:54:40but then we can a talk about plenty of these uh
0:54:44uh ask that uh currently cannot do you are and uh
0:54:47and you can get a confusing cover what should we focus on should be a
0:54:51study these artificial grammars or is it going how's that related to the real language
0:54:55and if you can shows all the in the end of a light improve some
0:54:58language model
0:54:59so i think that's the there's the natural questions
0:55:03i think the answer is uh
0:55:05quite complicated about the
0:55:07what i think is that the
0:55:09it's good to set some uh big role in the beginning and then try to
0:55:13uh define
0:55:15a some plan how to actually uh
0:55:19you know i
0:55:20accomplish this goal
0:55:21so that we did right so
0:55:23one paper where we did the
0:55:25uh discuss uh
0:55:27the
0:55:28like automated goal at first of the start bit uh the underrate around the instead
0:55:33of trying to improve some existing cup
0:55:35uh so that we are trying to define a new set up that would be
0:55:38a more like a artificial intelligence like something that the
0:55:42the people can see in the sense section we something to the really exciting and
0:55:46the
0:55:47that's what we actually want to optimize the
0:55:49the objective function formants just some
0:55:51uh speech recognisers
0:55:53uh something come more funny
0:55:56so
0:55:56so we did think that like a
0:55:58the properties of the of the a i that the
0:56:03the really useful for us
0:56:05and it seems that the any useful artificial intelligence the
0:56:09we like to be able to somehow communicate with us uh
0:56:12in uh
0:56:13hopefully some natural the way
0:56:15uh so again if you would look at the
0:56:18at the sign at the science fiction movies are the books the
0:56:22usually the artificial intelligence is some machine that uh
0:56:25either is or about the to be controlled with the core in tried it or
0:56:29it's as some computer that again we can interact with the so the embodiment doesn't
0:56:33seem to be
0:56:34necessary about the
0:56:35there needs to be some communication channel so that we can actually state some goal
0:56:39so that a i can actually accomplish the goal for us
0:56:43uh
0:56:44and we can communicate of the machines of course it will help or maybe that
0:56:48we could even i'm going to programming because currently we have become they can be
0:56:52computed by thinking second instructions will be one computers to do there is no way
0:56:57how we can start talking to the computer centre and expect the table
0:57:01accomplish a task for me that's basically no the framework we have not
0:57:05and i think that in the feature this will become possible about the
0:57:10but see a long really take
0:57:12i think we should start thinking of it because i don't think that we can
0:57:16improve the language models much more
0:57:18it's something uh like some crazy recurrent or okay
0:57:23so for the room at the
0:57:25we get uh describe the
0:57:27oh pretty minimal sort of components that we think that uh
0:57:30the intelligence machines going to consist of
0:57:33and then some the productivity may be actually good for constructing these machines to
0:57:38so it's it is that the idea that the is now and uh maybe later
0:57:41really improves then
0:57:42uh and we only are discussing them at the conference this
0:57:45uh
0:57:46and then the mission the requirements are too many dimensions scalable so that will actually
0:57:51be able to grow two full intelligence
0:57:55the components are added as i said the committed the ability to communicate a
0:57:59the ability to set some tasks for the machine so that the uh
0:58:03but work to do something useful
0:58:05so some motivation component
0:58:07again that something that is normally missing can the
0:58:09in the predictive models like the language models and so on
0:58:13and then have some learning skills which uh
0:58:17scenes can be used uh
0:58:18but uh many models are excluded
0:58:20missing these for example long-term memory is not part of uh
0:58:24any model the time of our of uh
0:58:26then you want works represent the want to memory and the weight matrices and that
0:58:29the get overwritten the number that network a
0:58:32uh kids uh gradients from the
0:58:34you examples and that's basically not agree to
0:58:37good model for once the memory
0:58:40so
0:58:41we need to do something better basically
0:58:44and uh
0:58:46just quickly i will go over it is because uh it would be long discussion
0:58:50to explain why we actually think uh
0:58:53uh about all these uh all these things about the
0:58:56we think that turns to be some incremental structure in a how the
0:59:01mission will be trained use of training can like a we could be idea is
0:59:05to retrain normally language models one uh it seems that it has to be trained
0:59:09some incremental weight a similar way as the as humans with um
0:59:12would be learning the language a
0:59:15and for that uh
0:59:16we are thinking about the some sort of simulated environment that the
0:59:20that uh
0:59:21would be used to
0:59:23develop both the all words of the are missing and then once you would to
0:59:27have this algorithms to train the most basic intelligent machines so it the most basic
0:59:32properties that we can think of
0:59:36so
0:59:37this basically what we are thinking about the and we wanted uh quite of experiments
0:59:42so there's this can or components like the lower it stands for the intelligent machines
0:59:48that the that can be in this environment the uh it can do some actions
0:59:52but everything is actually very simple like we try to minimize the complexity so it's
0:59:57just uh
0:59:57basically
0:59:58uh receive some input signal
1:00:01sequence and purposes the output signal which is a sequence as well
1:00:05and it to achieve it uh receive summary or so which is uh
1:00:10used to measure the performance of the learner and the
1:00:13it's pretty much either there so the teacher that uh defines the goals the
1:00:17and assigns the reward and to
1:00:19and that's it
1:00:25so this like the description the environment this uh based on screen
1:00:30of course we want to have uh and the teachers well we don't the this
1:00:34to be scalable so later
1:00:35uh one so we would you have a learner that can learn is very simple
1:00:39patterns then
1:00:40uh the expectation that the teacher would be replaced by humans
1:00:43so directly humans would be teaching part of the machine and the and the signing
1:00:47the rewards
1:00:48and that are dimension you really get to some
1:00:50sufficient level than a then there would be to stick expectation that we can start
1:00:54using a for doing something actually useful for us
1:00:58yeah
1:01:00so the communication is thirty the core so the learner just as this input channel
1:01:06and the output channel and all that it has to do is to figure out
1:01:10that it should be
1:01:11a coding at a given time
1:01:13given the inputs to maximize the average incoming three or
1:01:18so it seems to be quite simple but of course it is not simple uh
1:01:23this is a graphical representation just so that it would to look more obvious that
1:01:27we are aiming to do so there is a simple channel output channel
1:01:31uh the task specification given by teacher is the movement luke and uh
1:01:36and then
1:01:37the view learner
1:01:39the only assume that the delivery learn to do this task
1:01:42uh says
1:01:44to the environment they move that's how it accomplishes that action
1:01:47so we don't need every year for all possible actions a
1:01:51the learner can actually do anything to do
1:01:53it uh is allowed to do i just by saying it so if it's gonna
1:01:56if it wants to go for ever or if it wants that are like can
1:02:00just say
1:02:01and uh then at the end of the task yeah
1:02:03uh it gets the reward for
1:02:05uh looking for events finding that apple
1:02:09so
1:02:12so we think that the
1:02:14the learning weekly a will be a complete crucial here and the
1:02:19that's the same what i can see about incrementality of the learning so
1:02:22when the tasks uh
1:02:25are getting more and more complex or in something criminal way uh the learner should
1:02:31be able to learn from few examples at most we don't actually we're forcing the
1:02:35search space
1:02:36so the algorithm that you get uh at the moment or would basically break uh
1:02:41on this type of problems
1:02:45and that's a supporter before we of course get a documents later but still
1:02:51seems to be assumed because the we don't have uh without regard to be able
1:02:55to deal with a given the basic problems
1:03:00and then
1:03:01if we have this uh this intelligent machines and that can
1:03:04uh were with the input and output channels that of course we can add the
1:03:07real world basically this additional
1:03:09input uh
1:03:10channel that the much again
1:03:12one troll for example it can give where is that the output channel to the
1:03:16internet and the received the resulted input so
1:03:19uh the framework is very simple about the
1:03:22uh it seems to be sufficient for the intelligent machines
1:03:28and that's a was select things the real time they are things that seem to
1:03:32be very
1:03:33simple to lower for about the
1:03:35but you can pretty do it with the
1:03:37a recurrent networks even a via the long short-term memory units into the recurrence for
1:03:41x and the
1:03:42all kinds of crazy things the
1:03:44then still they are very simple but are they are very challenging the lower
1:03:49uh even when we have supervision about the
1:03:51what is the next symbol
1:03:53it would just try to learn
1:03:54these things just be the records of even worse
1:03:58so
1:04:01so
1:04:02they are like this the things that we believe that the
1:04:04especially the last two
1:04:06have and it's basically like all these problems are a console research problems and a
1:04:12maybe even they have to be addressed together so it's quite challenging but i think
1:04:16it's uh it's good for people who are trying to
1:04:19uh start the
1:04:21their own research to think about the challenging problems
1:04:27so for the
1:04:29small steps former be publisher before we review exchange are showing that the that the
1:04:35recanted first can actually learn style
1:04:38some of these uh algorithmic uh
1:04:40but then we extend then be the
1:04:42it is um in memory structure
1:04:44that the recurrent network are learns to
1:04:47uh control
1:04:49that actually utterances several the problem that i mentioned before
1:04:52in this uh
1:04:53if this memory
1:04:54is a unwanted in the signs
1:04:57uh
1:04:58like this like for example
1:05:00then
1:05:01uh
1:05:02suddenly the model can actually be at least two sticks it can be countering complete
1:05:06uh so that can i should learn finder presentation
1:05:09a two
1:05:10any algorithm which seems to be necessary and we as humans can do it
1:05:14uh
1:05:16and uh it also like addresses these problems
1:05:19or it could address the problems as i mentioned before with the neural networks that
1:05:23are changing there are the weight matrices all the time
1:05:27uh
1:05:28and therefore getting things then if you would have this a controlled way how to
1:05:32grow something more structures
1:05:34then that could be you way how to actually are present the long term memory
1:05:38better but
1:05:40as i said is just the first that's former
1:05:43of course we did find out the men been already work and idea behind the
1:05:47first one with this idea and it will study published in the
1:05:50in the a piece uh
1:05:51but uh likely needed to find of our solution is that
1:05:55is again simpler and works better than people probability for so
1:05:59so it's it in my looks like this
1:06:02no
1:06:04uh so there's not much of the complexity
1:06:07um basically the hidden air decides on about the action to do we just ikea
1:06:13a by purchasing castle my position distribution probability distribution over
1:06:17uh used reactions that it can be doing
1:06:20it can either push some volume on top of the state courts couple the volume
1:06:24of the stack part can decide to do not think intersect and of course are
1:06:28there can be multiple states that the network controls
1:06:31and uh if it wants to write some specific volume and it's actually that depends
1:06:36on the state of the hidden layer
1:06:37and uh
1:06:39and the fan think is that it can be trained actually data
1:06:42it as g
1:06:43like stochastic gradient descent so we shouldn't need to do
1:06:46and the crazy thing
1:06:48and uh it seems to be working for at least some of the simpler sink
1:06:52what sequences
1:06:54a like here but at least some of them variant able to the characters uh
1:06:59the bold characters are the predictable on the deterministic once
1:07:03and we could to
1:07:04so well
1:07:06i think all these problems
1:07:08i really
1:07:09uh so that was gonna interesting
1:07:14and of course the recurrence works candidate
1:07:16and the funding is that the lsp a models that are actually origin developed to
1:07:20uh to address the do you know exactly these problems
1:07:25can do it because they can count because the linear component so
1:07:29uh so that sort of the like cheating because the models
1:07:32developed for this article reason
1:07:34uh of course we can show that the
1:07:36the lsp and below
1:07:37break it will just uh a scalar complex to bit odd because the uh instead
1:07:44of just recovering them are to come we can
1:07:46uh
1:07:47are we compare it to start memorising sequences as they said before he really just
1:07:51show like a
1:07:52a bunch of characters with variable-length the
1:07:55that have to be repeated to
1:07:57and the to larry breaks the last jens
1:07:59which uh for the people don't know them is so it is a modification or
1:08:04extension of the basic network by adding yeah these are linear units the
1:08:09with a bit sums of connections and basically complicated architecture how to
1:08:13get some
1:08:13more stable memory into the reckon that's regular to propagate more smoothly across time
1:08:21so we could solve the memorisation
1:08:23but then of course one can say to the
1:08:25uh that the stakes are kind of developed for this kind of can uh a
1:08:29regularity so
1:08:31so it interesting a so our model was the
1:08:34a first-order on the speaker was
1:08:37blank a little bit binary addition just quite a bit more complicated and the
1:08:41interestingly it also did uh can you have more so here we are shrink a
1:08:45these examples which are like uh
1:08:48a binary like input so
1:08:51by the addition of two binary numbers together with the result than the terracotta lance
1:08:57to pretty the next symbol to get in this story so it's like a language
1:08:59model
1:09:00and it turned out that the
1:09:02it actually could to learn to operate with this mixing right complicated way to solve
1:09:06this problem
1:09:07so that actually
1:09:09uh space the first number and i think to stack so there are some redundancy
1:09:13uh actually three of than i think of
1:09:15our previous information
1:09:17and then the it's a so the
1:09:20how the second number
1:09:21and then it's able to produce a correctly the
1:09:25uh the addition
1:09:28from these two numbers
1:09:30so
1:09:31uh i think it's quite funny example
1:09:33of course there was like uh oh this is a heck the to be used
1:09:37to how the model because the stakes a are pushing the volume some top of
1:09:41the steak it's actually much easier to the
1:09:43do the memorisation of the
1:09:45all the strings in the reverse order
1:09:48so
1:09:49a so that's the same
1:09:51case for the binary addition
1:09:53uh so i wouldn't say that we can actually learn
1:09:56a general
1:09:57algorithmic fathers with this model
1:10:00and uh
1:10:01of course we could to
1:10:02do better if you just uh
1:10:04uh not use just the stakes but we could use for example states this additional
1:10:08memory structures
1:10:09with all kinds of topologies and so on
1:10:12but it seems the like uh taking yeah
1:10:15uh the solution together with the task so
1:10:18that uh doesn't seem to be great so i would refer back to do that
1:10:22sort of the paper that you had a rejection try to define the tasks first
1:10:26before think about the solution but in any case we could show that the
1:10:30that we can learn a interesting car
1:10:33uh in there's think a complex patterns are
1:10:37that the normal recurrent networks couldn't lower
1:10:42and the model is turing complete the say set and has some sort of long
1:10:45the memory
1:10:47but it's not the long-term limited
1:10:49like to have
1:10:50you does not to the properties that we
1:10:53uh we you want
1:10:55so there is that and a lot of uh things that should be tried
1:10:58and to
1:11:00let's see what to
1:11:02well happen in the future
1:11:05so for the conclusion a
1:11:07of the lost power of the talker
1:11:09i would say that to achieve chart which intelligence which was my motivation many start
1:11:13my phd so far i had failed to do it but at least there was
1:11:16like this uh
1:11:17these site brightly that are to be useful
1:11:20uh i think that we need to file think uh
1:11:24a lot about the goal
1:11:25uh i just a few that no people are from
1:11:29working harder than the wrong task so
1:11:31the tasks are too small and to
1:11:34and isolate it i think it's a it's time to think about something bigger
1:11:38and uh there are there will be a lot of like uh
1:11:42new ideas that will be needed to defined a framework in which we can develop
1:11:46the uh yeah i just same way as the framework in which the first speech
1:11:51recognisers were built also to take like uh a quite a few years to
1:11:56just uh define how to measure the boards rates and so on and the
1:12:00and how to annotated data sets
1:12:02i did not for the ideal basically need to rethink some of the basic concepts
1:12:07that to be take for granted now i'm that are probably wrong like uh for
1:12:11example the central role of the supervised learning in the machine learning curve
1:12:15uh
1:12:16techniques i think that has to be revisited and via to that taking that uh
1:12:20that are much more unsupervised and to
1:12:23can
1:12:23more calm somewhat different principles
1:12:27and of course uh the uh
1:12:29on of the goals of this august so
1:12:31motivate more people to think about this problems
1:12:35because that's the
1:12:36i think our
1:12:38we can
1:12:39a rigorous harder so i think that the last line so things for
1:12:43attention
1:12:45sparse dark questions
1:13:07yeah right sorry
1:13:22mountain yeah okay
1:13:24so my question here how to properly defined intelligent not artificial intelligence but just in
1:13:29the intelligence in the second question which it to the first one is okay so
1:13:35we know that the true machine um is limited you can so everything and then
1:13:40can you believe don't believe that detergents how you define
1:13:45yeah is achievable with your incomplete mission
1:13:48well
1:13:49not sure that it's the question started actually relate it's like to questions for me
1:13:53uh but uh
1:13:55a first for the definition of the intelligence they're actually right yeah
1:13:59uh many opinions on this uh probably no like you know i would say that
1:14:04uh pretty much research of are defined intelligence in
1:14:07different way
1:14:08uh
1:14:10hmmm all the most general definition that i can think of uh
1:14:14is a it would be maybe to philosophical is uh
1:14:17basically uh
1:14:20but there is that uh that uh that exist in the universe a could be
1:14:25thought of uh as intelligent there we can see that uh
1:14:29uh a life is basically just uh some organisation of the matcher that uh tends
1:14:34to results it's a form
1:14:37uh to evolution and everything
1:14:39so that it goes back to well ideas for example
1:14:43that the universe gonna be seen as this overall to model on and then everything
1:14:46to the observed that are is just a consequence of that
1:14:49and then you really can see the light so as a just a pair is
1:14:53the button that exist in this uh in this uh topological structure and the intelligence
1:14:59is just a mechanism that the this uh this part there and that uh developed
1:15:03to preserve itself and so basically
1:15:07for the second a question you said that the uh that the
1:15:11during machines are limited i'm not so sure in what sense maybe you mean that
1:15:15the normal computers are not during machines the
1:15:18uh in strict sense so
1:15:24uh so i don't know which problems uh you mean you can do not only
1:15:28major uh i was talking more about the incompleteness in the sense that the during
1:15:33common is basically a this concept that uh there is a find a description of
1:15:38well
1:15:38of all the buttons in this competition model
1:15:40if you would yeah they could bigger model like c find machines that you know
1:15:44that for example
1:15:45there does not exist to find it solution
1:15:48funny description of uh
1:15:49of some algorithm of well so
1:15:52for example
1:15:53you can tell account if you if you limit yourself to the finite state machines
1:15:58hand uh
1:15:59for example in the context of the recurrent network so i think there is a
1:16:03gets more confusing because they should ever papers written
1:16:06then uh claim that the recurrent networks are incomplete and which sensor
1:16:10a one can make a conclusion and actually adhered for example you're gonna speed reverses
1:16:15it uh
1:16:15requested that the
1:16:17that the recurrent networks larger incomplete and that
1:16:19that means that they are just fine and they should general are all these things
1:16:22that i was showing co
1:16:24what do you want to say a say that the uh show that when we
1:16:28tried to train it uh it does g d a normal requirements we regard doesn't
1:16:32larry of an accounting and the list even doesn't learning like a plane sequence memorisation
1:16:37so that's a that's one thing what is learnable
1:16:39and that's actually quite different than what the what can be represented
1:16:43and of course the
1:16:44to take uh the argument of uh all these people to strictly then i would
1:16:49say uh the recurrent networks as we have now
1:16:51uh then including dollars teams are not to incomplete because the
1:16:55uh defined it's isa the proofs of their string complete this assumes that there are
1:17:00so infinity somewhere hidden in the model
1:17:03usually in the in the volumes the distortion the in the in the neuronal so
1:17:08um so that seems not to do not the neural network that we are using
1:17:12now we are using like thirty two bit a full precision and you can tell
1:17:15you think of that you can store like uh infinitive
1:17:18if it number of four formants there is the same argument as the saying that
1:17:22you can
1:17:23save the whole universe in a single number using arithmetic coding sure you can but
1:17:27the
1:17:28but uh do you actually want to this representation to be uh in uh some
1:17:32neural network like uh in one of all you want to store everything and have
1:17:36a lexus a double detection decoded at every time
1:17:38for a time step if you would want to more the identification it makes sense
1:17:40to say that we can it works are two incomplete because the in my view
1:17:44a strictly speaking there are some of their versions maybe about the
1:17:48but it's just uh i'm not practical of course uh during missions also not very
1:17:52practical model so i'm talking about through a complete that's not about directions
1:18:00i
1:18:02yeah uh i see that the uh you're thinking a lot about the yeah i
1:18:07creation actually there is a huge discussion right now in the in the field about
1:18:11the about achievement uh achieving of the number the singularity and whenever you will create
1:18:17a binary what traits such as a i which would get connected to the to
1:18:21the internet
1:18:22and
1:18:24uh did it to share any of dares their concerns of uh
1:18:29uh country yeah i
1:18:32or super intelligent a i which will which will basically make a some silly
1:18:37well
1:18:39well i like yeah
1:18:40different views on this uh
1:18:42uh i think that the thinking of this a super intelligence and singularity
1:18:47i think it's little like yes uh
1:18:50i don't know what i would uh related to like yeah but the chinese and
1:18:54italian when they got power if they would to be afraid of uh of just
1:18:58or interval are so
1:19:00by some chain reaction uh i mean like it just to suit basically just the
1:19:04technologist there and uh it should be aware of it and does the same like
1:19:08uh when it comes to state of the research does i'm saying if you really
1:19:11don't want to
1:19:12uh if you don't want to be unfair divorce yourself
1:19:16then it is clear that we can teach yeah there are many of them very
1:19:19simple things and to talking about single or the then i think it's just uh
1:19:23assume the our think a is that the
1:19:27of course there are people have arguments that the uh maybe uh the gap between
1:19:33actually having something that doesn't work at all and suddenly having some intelligent that again
1:19:37you can improve itself
1:19:38the together doesn't have to be that they can maybe how we can achieve this
1:19:42machine so sooner than we expect that
1:19:44even if some people are sceptical that made it can be later
1:19:47um but if you would take this argument then i would say depends on how
1:19:51we should construct this machine so
1:19:53uh the frame level describing a
1:19:55uh were supposed to make machine that actually
1:19:58are trying to minimize uh some goals
1:20:01and as long as we will be able to define the goals for the machines
1:20:04then
1:20:05i would say uh for me the machine should be basically um like a
1:20:11some and it can that extends your own ability
1:20:14see if you are sitting in a car
1:20:16then uh you are able to move much faster than using your own lines because
1:20:20the cars physical tools for you
1:20:22oh
1:20:23well the car just does what you wanted to do because you are existing very
1:20:26it should go show it can either a knockers people and it can kill someone
1:20:29button that but the next to the driver is responsible
1:20:32so i think that the a i even if you to be very clever as
1:20:36long as its own all purpose is just to accomplish of the goals for the
1:20:41for the for the human which specifies the uh the goals
1:20:45then it basically to like extension of our mental a couple state the same as
1:20:49cars extending our abilities to move
1:20:56well that was just a
1:21:01there was just as the to your phyllis this step three of us to lead
1:21:06to learn a learned itself the questions which is the tricky part
1:21:10because whenever you will you will not part of it was on the a i
1:21:16part
1:21:18uh um
1:21:24uh which file it was about io and collecting to internet just only about the
1:21:29thought oh yeah and c
1:21:35okay i don't remember exactly which might not you are correctly so
1:21:39uh actual actually the last are was to let or no learned itself the from
1:21:45the from the other sources which makes it only has no control
1:21:49rather sure churchill are well that's a that's a like a question like a
1:21:54given the learner will learn from the other sources of how much uh
1:22:00kind of uh distant gonna get from the from the top
1:22:04a external river so
1:22:06you can actually the same argument about uh people they are also born with some
1:22:10kind of like a internal rework mechanism that was maybe large revolution to be a
1:22:14kind of hardcoded and also before example note if you sugar than you feel happy
1:22:19or whatever because the so they are coded thing
1:22:21and uh
1:22:23that still doesn't we then people to actually behave uh quite different then they become
1:22:27adult because
1:22:28they can for example just decide to stop eating sugar a
1:22:32and just uh just not follow the external rewards so
1:22:36yeah encoded or external interlocked the basic of the hardcoded to record so
1:22:40that are in the brain stop
1:22:42so that's like more like the question if the
1:22:45yeah i would be so much independent that it would have thought some sort of
1:22:48like you will then and you can of course see that it can turn out
1:22:53into something bad about the and if you think about a i think a single
1:22:57and that it but uh many of them and many of them working with yes
1:23:00then
1:23:00uh my vision is basically that it extends our own abilities and the
1:23:05is the same as the
1:23:07saying that uh pretty much any piece of technology can be used for good and
1:23:10that purposes so
1:23:12just to be belongs
1:23:23others
1:23:24it was wondering whether it's be more local
1:23:29no target location is this
1:23:33like something which whoops work in the network and it would be actually
1:23:40clearly that it would be changed just some subset of a scene
1:23:44yeah i wouldn't it propagates
1:23:49the information in this mostly due more
1:23:53oh unsupervised
1:23:56hmmm
1:23:57something like
1:23:58c d's
1:24:02someone's using something
1:24:04these days
1:24:05hmmm i think that i see something but i do not be able to give
1:24:09your friends because uh i didn't know are right now
1:24:13i wasn't myself music find a fulfilment first mse because i think that uh
1:24:18therefore limit it yeah
1:24:20uh
1:24:21so well i don't know like a
1:24:24i guess that the property that we should be able to uh
1:24:27uh get into our models that are it's neural networks or something else is this
1:24:32uh i but it's to grow in the complexity
1:24:35and that's something that norm on your a result that
1:24:37once you start seeing the network a some sort of like memory mechanisms are it
1:24:41us ability to kind of like to extend the memory structure i think that's uh
1:24:45that's all i see it
1:24:46uh
1:24:47and then the topology allows you to not spectral parameters but just some subset
1:24:51uh
1:24:52so that's what i was thinking of uh
1:24:55but of course that the that doesn't mean that uh that's the solution may be
1:24:58somewhat oh come on something else
1:25:00i just think that actual data
1:25:02points to model even if you go to as you do something that will be
1:25:06again do local updates to the and i would be a bit boring about
1:25:10just the model itself to be
1:25:12to be and not limited in the convolutional sense of course to consider the human
1:25:16brain to select find it's the number of years maybe but then there like targets
1:25:20may be some neurons are triggering cup and the final arguments from me would be
1:25:23that the us a human you can actually um navigate in the topological environment like
1:25:28the environment around your yourself this three-dimensional it has a topology
1:25:32and uh if you actually want to understand all utterances you can use the piece
1:25:35of the paper and so on so you can be actually finds the matching about
1:25:39this long as you are but i think in the in the environment that connections
1:25:42work as a paper in the during machine
1:25:44and then you can actually be as the whole system during complete so
1:25:47you know like a if the model will actually start living in the environment i
1:25:51think it actually gets a
1:25:52gets a more about it is that's look at it can also change the environment
1:25:55that the it uh becomes much more interesting content if you have just neural network
1:26:00really think a in a way to just observing can able to lectures and purchasing
1:26:04cup vectors uh
1:26:05we don't actually uh being we are able to control anything that that's the topology
1:26:10so for example and i was talking about the stick carmen's you can see that
1:26:13so that's that can be seen as a one dimensional environment that the stick our
1:26:18lives in and can operate on it and to have any of the two d
1:26:21r two d environment that utterance basically just more
1:26:24more dimensions but it's uh kind of the same thing and you are so far
1:26:27linking the three to adjust really brought and to if you will be able to
1:26:30influence of the state of the role i think that a user will be uh
1:26:34quite limited to that so that's kind of like a my understanding of the think
1:26:38so
1:26:48does the research agenda open a is doing have any overlap with the framework that
1:26:53you have suggested a which
1:26:55opening i
1:26:56a banana
1:26:57uh yeah that's of the guys in california a where they get publisher a uh
1:27:04she already recorded opening i universe i think a
1:27:08or a needs a month ago so uh somewhat overlooked so that the goals in
1:27:14the sense that uh
1:27:15did try to yeah
1:27:17um like a person every social a point to the define a defined to like
1:27:23i think thousand task or something of that sort the
1:27:27and to
1:27:28mm they are trying to make a machines are coming from the definition data
1:27:33generally i guess a
1:27:35it's a some sort of machine that can work of course uh a range of
1:27:40tasks are not a single path but for many tasks are
1:27:43uh but it somewhat uh it's quite crucial to different actually to what i was
1:27:47describing because uh
1:27:49there's a different between incremental or gradual learning curve i think there are several other
1:27:53names are you assume that the meshing wanted one signals and tasks and it alarm
1:27:58so it's a you try to teach it a task and plus one
1:28:01then it should be able to learn it faster in this new tasks are related
1:28:05to the all the ones and then you can actually be measured because you can
1:28:08construct this subtask yourself oh artificial and you can make than actually
1:28:13uh
1:28:14while of what i did see so far list uh i'm not like a an
1:28:18expert will they are doing that maybe they are changing still
1:28:21direction but i folded they are trying just to solve a bunch of tasks together
1:28:23which multitask learning that such a different thing as actually completed that yielded the neural
1:28:28networks which don't have to do so you
1:28:31two approach that are the problems
1:28:33but the well they try to do it with a yeah i think which are
1:28:36uh or reinforcement learning which again is quite a challenging
1:28:41uh because direction don't stay supervisor level the of the model should be doing about
1:28:46you are just giving rewards for the correct behaviour
1:28:48second that the
1:28:50that part of what they are trying to do is uh somewhat related to whatever
1:28:53describing about the
1:28:55a i don't think it should uh multitask learning has a big a problem because
1:28:59that could actually just works fine you can just a venue auditory little recognize speech
1:29:03under the image classification and then language monica the same time uh because really represent
1:29:09all these things of the input layer and what kind of our quite a so
1:29:12what would be just encoded in different parts of the network
1:29:14uh
1:29:15so
1:29:17i think that they're hope is that uh actually they start uh
1:29:20boosting each other's performance uh
1:29:22uh if you will train basically this work to do all these things together then
1:29:27it will adjust sure the
1:29:28sharma of the ability somehow
1:29:31uh so let's see what they become a bit about uh
1:29:33from my point of view i think that uh it's good to try to of
1:29:36isolated of the biggest problems and try to so that uh i was for example
1:29:41a giving preference uh
1:29:42how to this um is "'cause" book are subproblems and uh
1:29:45iterate try to go like to the core like of the simplest things that you
1:29:49guards are present additionally with a one hidden layer and the at intervals where influential
1:29:53and very simple are sent and from my point of view if we try to
1:29:57analyze what is going wrong with the current the algorithm is going to
1:30:02like a huge data are sort of thousands or thousand different problems training some model
1:30:07couple of it
1:30:08and then make some place about that are it works or doesn't work and what
1:30:11it go wrong a
1:30:12i think the analyses will be very harsh for then
1:30:15it will be different amazing for p r videos
1:30:18uh
1:30:19which of course is uh is uh like uh
1:30:22uh one of the main things that they are through but the but the except
1:30:27that the
1:30:28i don't let c
1:30:33so don't you think that actually multitask training is
1:30:37crucial in
1:30:38and these things "'cause" it can cover a lot of things and can learn what
1:30:42not to do instead of just learning what to do
1:30:46well to multitask learning car
1:30:48like
1:30:49not a like a crucial problem or saying it's a problem i'm just saying it's
1:30:55a part of the real life thing right now sure you never learn just one
1:30:59thing you always observed
1:31:01and if you wanna have inspiration and the real life
1:31:04oh just
1:31:06i sure i mean that's uh that's complete fine for example then it was uh
1:31:09describing this framework with a larger and uh and so on and the teacher then
1:31:14also like the point is that the teacher would be interesting you trusted alarm are
1:31:18and then uh this can be predators
1:31:20uh
1:31:21i'm can be defined when people are trying to work on multiple tasks and assigned
1:31:25a set like a we have it there but uh is different that are you're
1:31:29assuming that you
1:31:30yeah are training the model the model all the tasks together and then you try
1:31:35to measure of performance on the same tasks
1:31:37or if you train the model on some task and then you try to teach
1:31:41it quickly on different tasks uh and that's actually what i think is much more
1:31:44challenging and that's what well i think we should try to
1:31:47our focus on because it will be needed that if you just uh are going
1:31:51to fully also by training commit in tasks and then show always place are combined
1:31:55very well maybe because it was in the training set so you don't like was
1:31:58the point
1:32:01so i think it's of course it's part of the of the problem to know
1:32:05to have adult dyad to work on multiple tasks at once we have to uh
1:32:09this uh but it's alarm and you that's quickly
1:32:13um you you've mentioned uh
1:32:16steps that to be taken toward a creating an environment for a line
1:32:22um you know what the state-of-the-art in
1:32:25using anything with
1:32:27this principles
1:32:29just anybody's is such an environment is that we establish a with a lot some
1:32:34simply environmentally the public's it uh last year and weighted present the that needs as
1:32:38well lots to models
1:32:40uh the next conference and to
1:32:44uh it's and the get out that it's called communication based artificial intelligence environments uh
1:32:50so uh it's uh
1:32:52i think the short "'cause" this column yeah if a dash and
1:32:56so that's pretty stupid shortcut nobody likes about the we and they have a this
1:33:00one because uh and the course of the story would be longer that's good that's
1:33:04one
1:33:05uh
1:33:06and uh
1:33:08and uh so this is our environment that be published uh
1:33:12when it comes to other side
1:33:14uh
1:33:17well there was this a discussion about the open
1:33:20yeah universe which is a something then i think the mind of publisher the same
1:33:24conference on like a thing deep mine plan for all the decorrelation so
1:33:28but to compare games in three d environments and how to navigate are just by
1:33:32observing pixels and that's really environments that gets uh
1:33:36such a quite different uh
1:33:38because again if you for some a single task results different this uh this focus
1:33:42on the incrementality of the learning so i we not sure intersections something comparable to
1:33:48what we have
1:33:49yes actually they are so many researchers is that you never know about so
1:33:56that's encouraging for the rest of this
1:34:16do you think we have enough data for training a building language models and now
1:34:22we should focus only on algorithms
1:34:24or we should
1:34:25also green using a data sources and i don't to textual data
1:34:32well of course is a more data you have the better models you can uh
1:34:36they'll and i would say
1:34:38there's never enough data
1:34:40uh so
1:34:41then i do you try to improve well all these tasks that i mentioned in
1:34:44the uh the first part of the whole alexi speech recognition machine translation
1:34:48and spam detection or whatever uh
1:34:50then sure like uh more data will be good to
1:34:54and um
1:34:55and the amount of uh written uh
1:34:57text data on the web is increasing called a time so
1:35:00how i think that uh in the future we will happen even bigger models trained
1:35:04on either more data
1:35:05and the accuracy center of these models a perplexed is will be able recursive will
1:35:09be higher uh things only get think a bit better there's a like this uh
1:35:13this argument i think you are going back to shown on the question of uh
1:35:17i think there are models
1:35:19actually able to uh to capture an irregularity in the language just one "'cause" the
1:35:25amount of uh data that you have is infinite and the n is included as
1:35:28well
1:35:29which is nice of which basically says that the more data you will have the
1:35:32better you will be about the but the gains are just getting smaller and smaller
1:35:36and then uh i don't think it should this is the way up to
1:35:40gets to a i because uh even if you would have like billions uh but
1:35:43in times more data and then you have now then sure you get like a
1:35:46two point improvement in machine translation and that's fine or maybe one or two percent
1:35:52laura portrayed in speech recognition
1:35:54but uh there is diminished diminishing gains a diminishing returns so it would just uh
1:36:00not be boarded doing uh after sampling
1:36:03of course then there's this think about the
1:36:05i think more data in uh in domains are actually a very small amount of
1:36:09the data now like today uh
1:36:12of course then you can expect a big gains in the icarus this later so
1:36:16for example for fink english language models
1:36:18i think that well
1:36:20that's only like a
1:36:21just about the maximizing the size of the model the ds now
1:36:25minimizing we could be sure complex the training data is because as we can
1:36:29a lot for different languages uh there can be some more fun um there was
1:36:33sort side maybe uh i would have or more hope for it
1:36:37um because there's less uh data
1:36:40um so yeah maybe like for czech language or
1:36:44oh
1:36:45there is there something to be down like all this morphological languages uh are interesting
1:36:50for some reasons
1:36:51uh
1:36:52so yeah so the answer is basically yes more data is a good uh
1:36:56uh
1:36:57a if you want to get a i then i don't think it should get
1:37:00a us there