Speech Transcript - Neural Networks for Natural Language Processing

0:00:09	well come at the next addition of p g s i t this is the
0:00:12	uh invited talks on video graphics in speech
0:00:16	and it's a series that is run mainly my uh by matching technique but today
0:00:21	i'm happy to invite a very good speech or an lp guy so to much
0:00:25	weaker of a actually started uh this faculty if i t
0:00:28	in two thousand two
0:00:31	then uh in two thousand six seven he was working on a diploma project on
0:00:35	language modeling for the for chick maybe still remember something of it then actually uh
0:00:41	he started his phd in two thousand seven on a language modeling and uh to
0:00:46	be frank we didn't have much uh language modeling expertise here
0:00:51	so we kept sending him abroad so he's been good considerable time at the johns
0:00:55	appears in the hope queens university with the spongy of good on board and the
0:00:59	university of montreal we just to a uh bengio
0:01:03	and uh well he had a very the influential paper it's interspeech two thousand then
0:01:11	that was basically a room like this form of uh senior uh language modeling people
0:01:16	and the much basically came up uh and the said that the his language model
0:01:20	works the best
0:01:22	well they were smiling but it worked the best
0:01:26	and uh e eventually uh defended the phd in two thousand twelve
0:01:30	was immediately uh hired by you will go brainer and the moved to facebook or
0:01:36	research uh i a i research and twenty fourteen or he's now the a research
0:01:43	scientist so i will still be here it's to march for now and thank you
0:01:48	for coming
0:01:55	it is it's fine
0:01:57	i guess okay
0:01:58	i also things are interaction and uh michael will be uh like uh
0:02:02	mixture plus a very small things this uh
0:02:05	once asked me to talk about everything so
0:02:08	uh let's hope to define would be about nine you wanna works in an l
0:02:12	p
0:02:13	uh
0:02:15	or
0:02:17	that is
0:02:21	okay
0:02:23	ah
0:02:24	so for the introduction
0:02:28	or the introductions uh
0:02:30	and now he's like a an important uh topic for many companies uh nowadays like
0:02:36	google facebook yeah we like all these companies that future
0:02:39	text data sets uh
0:02:40	that are coming either from the web or from the users like you can imagine
0:02:43	how
0:02:44	uh much text a confusable sense to facebook everyday
0:02:48	and then of course like these companies wants to do something cup
0:02:51	with the text like out there like a there is this a list of uh
0:02:55	somehow some important applications that uh but there are many others like a
0:02:59	just detecting the span is something important for like uh
0:03:02	users you don't want to see
0:03:04	uh some think a strictly binary are using cut these services uh so this like
0:03:10	the like a core business of these companies is to be able to deal
0:03:13	with text uh
0:03:16	and uh for that uh i will be talking about like a set are a
0:03:19	lot basic things in the beginning and then their extensions uh using neural networks uh
0:03:24	the idea is to work on
0:03:27	uh
0:03:27	there will be like uh the first uh first part will be about unsupervised learning
0:03:32	curve for
0:03:32	board representation see that so
0:03:35	the border like project uh obvious that we will uh it's like a very nice
0:03:39	a simple inter the introduction
0:03:42	uh
0:03:42	then supervised a text classification the do not for to talk about it much of
0:03:47	the weighted average shows simple to last year at the face but that extends the
0:03:51	word vector supervised classification and again like uh
0:03:54	is quite successful because it's very scalable
0:03:57	and then the recurrent work language model
0:03:59	uh
0:04:01	so exalted mentioned that's all the like something that is so
0:04:04	uh nowadays very common and uh i don't be conference this
0:04:08	um the last part of that all will be about the
0:04:11	what can we do that or maybe in the future maybe some people hearable started
0:04:16	but relatively easy and on to uh do something better than the u matrix i
0:04:21	think that uh that would be a great goal and you're trying to do it
0:04:24	ourselves uh i look up
0:04:26	and of the like the our companies are very interested in an uh getting better
0:04:31	performance
0:04:32	of course one can uh one can focus on the incremental improvement by just taking
0:04:37	that exists and trying to make it bigger or something cool that's or
0:04:41	uh but i will talk about that some high-level goals that uh
0:04:45	we are thinking of uh right now like how to build our mission the regions
0:04:49	of the both uh
0:04:50	they really smart models that are something the
0:04:53	i below that are not show any solution because we don't had it uh
0:04:56	but i think it's a good at least uh mention the problem that we are
0:05:00	facing
0:05:03	uh cycle started like very basic concepts so
0:05:06	"'cause" there seem to the
0:05:07	uh people here don't uh don't so all of them don't have of the big
0:05:11	around in a in uh machine learning cut
0:05:14	so i will start with uh basic uh
0:05:16	models of sequences and uh
0:05:19	representations of uh text uh
0:05:22	and then i don't show that you want work basically
0:05:25	can extend and improve for
0:05:27	all these the above a uh representations and uh and models
0:05:32	it's like yeah
0:05:33	the artificial neural network so i can be seen as some unified a framework that
0:05:37	uh
0:05:38	that uh is in some sense simple to understand that
0:05:42	i know what the uh are concepts but we only done for the for this
0:05:46	to be able to define the features or lots
0:05:48	so for the n-grams so
0:05:50	that's a standard approach for language modeling that's a core technology not in uh many
0:05:56	important applications like speech recognisers are
0:05:59	our mission transmissions justines the uh i need to be able to output somehow some
0:06:04	text and for that the
0:06:06	you are using some statistical models of the language a
0:06:09	that uh that was basically the think it is written on the last line to
0:06:13	the uh some sentences are basically more likely uh than some others for example uh
0:06:20	this is the sentences
0:06:21	uh really going to have uh
0:06:23	higher probability then
0:06:25	then a sequence of words sentence uh is this uh
0:06:28	because that's not so that doesn't make much sense
0:06:31	and even that should have probably are provided for curing unit
0:06:34	in a english and some random string characters
0:06:38	so the n-grams are
0:06:40	uh estimated from a from counts usually
0:06:44	a so it's very simple but you would look at the first equation
0:06:47	uh and just think about what is the product of the sentence though i think
0:06:51	it's like a very broad concept again even stated uh
0:06:54	it be would be able to estimate this probability very well
0:06:57	then the uh models uh behind the
0:07:00	should be able to understand the language or actually have to understand the language or
0:07:04	for example i can write the hearer like creation that uh probably as
0:07:08	so sentence uh
0:07:10	uh arrest is the capital city of rows so
0:07:13	should have uh
0:07:14	higher probability that a barrel in is the capital city a problem because the second
0:07:18	sentence is incorrect uh
0:07:20	uh but the model you have now uh i would say
0:07:24	can do this a little bit about the not in a general sensor
0:07:28	i will try to get through it at the end of the oldest or like
0:07:31	what are the limitations of uh of our best language models
0:07:35	but just to get the motivation like a linkage wanting is quite interesting and there's
0:07:39	a lot of also problems and we would be able to solve them
0:07:42	very well then it would be
0:07:44	possibly interesting for the artificial intelligence research
0:07:49	and here it is uh how it looks like with the
0:07:52	techniques that uh used to be state-of-the-art like ten years ago
0:07:55	uh
0:07:56	which was based on the grounds there is scalable the mean that we can
0:07:59	train uh
0:08:00	estimate uh
0:08:01	so this model stronger likely that's very quickly uh it's the retrieval if you want
0:08:07	to compute a variety of the sentence that just a
0:08:09	compute probability of order this that people just get from some trendy corpus just count
0:08:14	how many times this the board at a year and divided by all the word
0:08:18	count so that we would get its probability
0:08:20	and they just multiply this uh
0:08:22	um like probability of each word given its complex the
0:08:26	they are some advanced things the on top of it like to thank and so
0:08:30	on but well but is just the
0:08:32	the technique that used to be state-of-the-art in the statistical language modeling for like a
0:08:37	i think there are two year so it was like
0:08:39	it looks very simple but it took uh people like uh a lot of effort
0:08:43	to overcome uh
0:08:45	uh convincingly across a occurs um and at a minister uh
0:08:51	i don't relate right uh will not be the recurrent networks
0:08:54	uh then for the basic representations of takes the uh one and coding or one
0:09:00	hundred presentations is something that this like a uh
0:09:03	very basic on so that people should know about the
0:09:06	uh usually the it when we want to represent some text uh
0:09:10	uh especially in english of you we compute first of vocabulary and then
0:09:14	represent each corpus uh basically separate uh id
0:09:18	uh which show
0:09:19	uh has um the advantages and some disadvantages it's very simple uh is it understand
0:09:25	the disadvantage is that the
0:09:27	as you can see mandy and use the of completely or particular presentation
0:09:32	uh data sharing parameters and the
0:09:34	and it's up to the model that's using these ones representations to figure out that
0:09:38	are they are really it for example so that the
0:09:40	it would be able to generalise better
0:09:42	so these are the basic representations and the ability to later that we can ensure
0:09:47	present work so that some
0:09:48	uh
0:09:49	better more richer vectors uh
0:09:53	actually it's a
0:09:54	uh like nice improvements in many applications
0:09:59	make of art representations are then just some of these one coats of and then
0:10:03	everyone to represent some not be such that the
0:10:07	for example if you would have this the small vocabulary and we want to represent
0:10:10	the sentence
0:10:11	today is among the
0:10:13	a little bit like the counts of the words so that it that you're sentence
0:10:18	there something special about it the
0:10:20	and yeah
0:10:21	these
0:10:22	their presentation can be still improved by
0:10:24	considering basically local context by
0:10:26	using backup backgrounds and uh even if it may
0:10:30	c not surprising it would see that for
0:10:32	many applications yeah really most of the applications nowadays the
0:10:37	nobody can they require think it'd be the
0:10:40	but uh is a very simple picture presentation the
0:10:44	so that's maybe the challenge for the future
0:10:46	uh
0:10:48	and are important concept is uh the word classes uh
0:10:52	as i what is that like uh
0:10:53	board really should uh be kind of related to each utterance imposed we help to
0:10:58	uh how to think of it is to define some uh
0:11:01	some set of classes for example italy germany france spain all these words the
0:11:05	a uh denote the names of the of the countries in europe uh
0:11:10	and uh maybe you can just the agreement together and just called impulse to
0:11:14	so this is uh one of the most successful or not be concepts the in
0:11:18	our practise the
0:11:20	it was
0:11:21	injuries i think uh
0:11:23	in the user might be the
0:11:25	uh the one uh one the particle paper i think that's a very nice is
0:11:29	the from peter brown because based trigram models of natural language or
0:11:34	and discusses are computed the
0:11:36	automatically again like from song
0:11:38	from some training corpus the and uh the main idea behind it is that uh
0:11:43	the boards that the
0:11:45	that share a the complex that documents in our context so should uh
0:11:50	really belong to the same clause
0:11:53	once you get these classes then we can improve the oh
0:11:56	the representation that will string before because so we can represent the corpus one of
0:12:01	and
0:12:02	uh our presentation lost one of the class representations for the uh
0:12:06	there would be
0:12:07	some generalization from the system that is trained on built on top of this representation
0:12:13	there was more like the historical uh overview uh i can and the did they
0:12:18	are like several other important can so that people should know about the
0:12:21	uh and the that otherwise
0:12:24	are basically the stepping stones to understanding the neural networks uh
0:12:27	uh what's it'd the
0:12:29	most frequent things uh probably unsupervised image the reduction using cut principal component analysis and
0:12:35	unsupervised clustering with uh
0:12:38	k-means so
0:12:39	so these algorithms are quite important and then the supervised classification
0:12:44	uh especially to the logistic regression
0:12:46	uh
0:12:47	uh is very important
0:12:50	i don't know described in detail because uh i wouldn't finish a
0:12:53	uh so now i would jump quick introduction neural networks uh
0:12:58	uh
0:13:00	and again like it'll be just a quick overview so that people can i get
0:13:04	some uh
0:13:05	idea with this uh
0:13:06	uh what the than you want works actually are
0:13:08	uh and uh i will try to describe these uh is basically arms that the
0:13:13	people are using all the time
0:13:14	and then i will
0:13:15	also try to give some short explanation of what the
0:13:19	deep-learning means because i think that's from there but the
0:13:22	is becoming very popular now and it would be good you are so what is
0:13:26	it about
0:13:27	so for when you wanna works uh
0:13:30	uh in a natural language processing cup
0:13:32	and then motivation is to simply come up with the better more precise techniques then
0:13:37	what i was you showing before so something better than the
0:13:40	uh big aborts uh something better than just the grounds to a so how can
0:13:46	be
0:13:47	uh red and white would be even though it well
0:13:49	if we can come with some better representation than uh we can
0:13:53	uh get slightly better performance than our come but there that's important for many people
0:13:57	like support for the company because they want to be the best and its importance
0:14:02	of the for researchers because the people to publish the most interesting papers
0:14:07	uh
0:14:08	and years are completely in uh
0:14:10	all kinds of competitions the
0:14:12	so
0:14:12	it's basically important for everyone to develop that that's techniques uh
0:14:17	that's about the motivation this is how the
0:14:19	uh your own basically looks like uh
0:14:22	a is like a mathematical or graphical representation of the of the
0:14:26	model it's the simple mathematical model the
0:14:29	uh the
0:14:31	function that the people didn't really uh the your own so
0:14:35	uh the biological neurons and but it's very simplified so i would uh warm about
0:14:40	uh yeah giving some parallels between the
0:14:43	artificial
0:14:45	neurons and the and the biological on your own since it is likely to really
0:14:48	about it is very different thing
0:14:50	so uh the are concerned you and your own looks like yeah basically they are
0:14:54	um
0:14:55	uh incoming signals that are coming cut
0:14:58	uh to be in your own it's called sign at this uh the time from
0:15:02	the biology but uh basically just needed some errors that are something you know
0:15:07	uh to be in your own
0:15:08	uh it's coming from some other neurons are
0:15:11	and uh these signals are multiplied by the
0:15:15	by the way that each yeah
0:15:17	each year this input arrow results today with one of a small number
0:15:21	uh the basic of the weight that multiplies the incoming signal
0:15:25	so we had three incoming numbers that
0:15:28	and uh they really get a sense together in the uh in your honour
0:15:33	after which uh there is the application of the activation function of each yeah um
0:15:38	needs to be known in europe you want a proper you wanna or
0:15:42	and the simplest one is probably the
0:15:44	uh so called the rectified the
0:15:46	a linear activation function which is basically just taking my between zero and evaluated that
0:15:51	compute so that all the volume that are below zero will basically get a translate
0:15:56	the zero
0:15:57	and uh
0:15:58	this value that we compute is weights a
0:16:00	is the output of the your honour in the given find that the and the
0:16:05	and uh this uh this output can be connected actually too many pattern your own
0:16:09	so it does not be connected
0:16:10	one
0:16:12	but it's a single number uh goes out of the single in your own
0:16:16	and here the creation
0:16:19	so i think that like uh
0:16:21	the biological onions actually
0:16:23	although our like connected to other neurons the about the they are so many difference
0:16:27	is that it doesn't even make sense to
0:16:29	star comparing these two
0:16:31	a logistic that the artificial neural networks that are somewhat uh was it inspired by
0:16:36	the biological neurons uh
0:16:38	uh in the beginning about the it's a different now
0:16:43	uh maybe in the name i think uh is a um
0:16:47	uh misleading people start uh working on this uh
0:16:50	techniques the and uh start believing that maybe they can just sort of artificial intelligence
0:16:54	additional uh have uh you know neurons in their model because after all the model
0:16:59	at school you want your right uh well this is the logic that the
0:17:03	i sometimes you're from some older purpose errors and i think it's it really misleading
0:17:07	and it's part of the
0:17:09	marketing so just the
0:17:10	don't take it seriously i think yeah
0:17:12	if the name of these uh
0:17:14	artificial neural networks would be known in your data projections i think it would be
0:17:18	maybe better but then
0:17:19	nobody would you use it because it would be interesting right
0:17:23	uh so
0:17:24	uh this is the presentation
0:17:26	oh well we'll network when we have a very have multiple of these songs uh
0:17:31	usually there are some structure this is like the typical a feed-forward structure very have
0:17:35	some people say or uh which uh
0:17:37	which is made out of some features it can be
0:17:40	the back of our teachers or one of any code so what i was talking
0:17:43	about before
0:17:45	so these are the features you specify some will somehow
0:17:48	the uh and then there's the hidden layer uh you'll you know well to computable
0:17:52	is there and then there's the output layer
0:17:54	again it's the application of those any questions uh
0:17:57	so nothing special their output layer
0:18:00	use usually what you want the network to be doing can that's a
0:18:04	that's for example classification we want for example say that the input layer that there
0:18:08	are some decoding of the sentence then and the upper layer there can be classification
0:18:12	of their
0:18:13	the sentence is a span or not
0:18:15	so there can be just one in your on that would be just the
0:18:18	doing a binary decision
0:18:21	the training is done with the back propagation
0:18:24	and to
0:18:31	as the training is done with a back propagation i do not really describe exactly
0:18:35	how it is uh don't because it's a lot of mao a multi we can
0:18:39	find some nice lectures on the web the uh
0:18:43	so i think it course zero there are some nice cars about neural networks would
0:18:46	be quite some long i quite some time to
0:18:49	expendable basically what we need to do is to define some objective function which yeah
0:18:54	we'll
0:18:55	ah
0:18:56	which will say what is the error that we uh that the network that uh
0:19:00	make for the article twenty example so when we trained a network we should we
0:19:04	basically some input features
0:19:06	and we know what is the output that the network should have produced and we
0:19:09	know what the network uh did actually compute uh using the current set of the
0:19:13	weights and then using of the back propagation and the stochastic gradient descent algorithms we
0:19:19	will compute how much uh and you know what direction which we should change the
0:19:24	weights
0:19:25	so that next time the network see the same example it would make up this
0:19:28	error
0:19:30	small it would make smaller error
0:19:33	and then there's the simplified graphical representation
0:19:36	but is not used in some papers uh
0:19:38	there we don't actually draw all the individual neurons but it just the dual the
0:19:43	box the
0:19:44	with the errors
0:19:47	they're section what of are things that the
0:19:49	have to be uh down if one actually wants to implement the
0:19:53	this natural because they're like this the
0:19:55	these uh hyper parameters that the that the training doesn't uh
0:19:59	doesn't choose like what is the type of activation function that we use a their
0:20:03	conviction in many of them
0:20:05	well how many hidden layers to we haven't are their size is uh how they
0:20:09	are connected we can have some skip connections we can the recurrent connections we can
0:20:13	have some weight sharing deconvolution at work so it's actually
0:20:17	why did they do things so uh of course i will not describe for all
0:20:20	of them because there would be lycra for course
0:20:22	uh about the
0:20:24	but i think of what works for me for no for starting to or within
0:20:28	you want works is to take some existing set up and try to play weighted
0:20:31	by
0:20:32	making some modifications and the
0:20:34	observing what uh what is the difference
0:20:37	so maybe that's so that the best of the star
0:20:40	for deep learning uh
0:20:43	so this popular term uh
0:20:46	uh it's uh basically still the same think it's the it's in you wanna sort
0:20:50	of every will have well
0:20:52	um or hidden layers usually so that uh
0:20:54	if there is like at least two or three hidden layers then the by basically
0:20:58	some of the deep learning a all maybe we can also that some recurrent connections
0:21:03	with you to make the outputs depends on all the previous the
0:21:06	input features which is actually very d very "'cause" they are so many nonlinearities that
0:21:11	uh that influence the output of the model
0:21:13	uh so basically any a network that uh
0:21:17	that uh
0:21:18	goes uh any model that goes through a several nonlinearities be before it computes the
0:21:23	output uh can be considered as deep learning curve
0:21:28	although some people are probably even see nowadays deep-learning which i think is completely sitting
0:21:33	about the
0:21:35	uh
0:21:37	yeah there was also like a this controversy for i think are maybe twenty years
0:21:42	there
0:21:43	uh basically the welcome annotated very the a
0:21:46	the training these the model deep neural networks is not possible to be done with
0:21:50	the stochastic gradient descent
0:21:52	and uh when i was uh the skewed and myself well
0:21:56	whatever book i was reading uh everybody that claimed is that basically training these deep
0:22:01	networks does not work and that's it uh that we need to develop some magical
0:22:05	algorithms
0:22:06	actually it's not the case uh people not trained to networks normally that the d
0:22:11	and the
0:22:12	just works the it's probably because we have more data than what people are like
0:22:16	in the nine so i didn't know much durable power exponential there but the uh
0:22:21	there are be about the
0:22:22	there are basically this uh
0:22:24	uh a long chain of sex as a result starting maybe in two thousand five
0:22:28	six the lower people are able to find it remains some deeper networks are
0:22:35	there's also like this mathematical justification why prediction need to the
0:22:40	the models so
0:22:42	uh coming from seeing more popular and marvin means key
0:22:46	in their book perceptrons it is uh
0:22:49	so the very mathematical i would say about the about the argument is very interest
0:22:54	think there are functions that uh
0:22:57	that we can't represent action maybe give just a single hidden layer
0:23:01	and uh
0:23:02	actually that's the logic that i will be using at the end of the talk
0:23:05	show that they are actually
0:23:06	a function that even the deep learning models so
0:23:09	cannot uh a learn a gently
0:23:12	maybe represent us not very large
0:23:15	so
0:23:16	uh
0:23:17	so i would see that the wall down deep learning maybe was invented a neural
0:23:22	network and you're like about the
0:23:24	but the these ideas are much older
0:23:27	uh like you have the motivation uh our people to argue that we really need
0:23:32	to
0:23:32	use something else then use the
0:23:35	these uh simple perceptrons
0:23:41	so this the graphical representation
0:23:43	a very good basically just multi um
0:23:45	just several little errors
0:23:47	and so
0:23:49	that's about it the states that it can be more complicated than this but if
0:23:53	there will be some recurrence the connections or something of that sort
0:23:57	but a lot of visiting model
0:24:00	yeah i would even say that it still an open research problem because the
0:24:04	then entropy you have uh
0:24:05	very deep model then uh
0:24:08	so possible to show in many cases that it can represent the or it can
0:24:13	are present solutions to some
0:24:14	interesting problems about the
0:24:16	the a request use the
0:24:18	there are there is um
0:24:20	so um i good job approach are we can find the solution we constrain the
0:24:25	network which is actually not always the case especially for some complex problem by will
0:24:31	be
0:24:31	showing at the end to uh there are the network for example it's learns um
0:24:36	some complex the controllers number structures uh
0:24:40	then
0:24:41	because there's a lot of local optima then
0:24:44	it seems that uh we start to be something better than what we get now
0:24:50	and uh so now i will be talking about the
0:24:53	uh most basic application of uh
0:24:56	neural nets to a to some text problems which is how to compute the distributed
0:25:01	your presentations aborts the
0:25:03	and uh and i do show some mice examples i think i see examples the
0:25:08	oh uh some linguistic irregularities in the vector space
0:25:13	so this is how we can actually train the most basic gabor vectors is that
0:25:17	they started the
0:25:18	band when i one that was mentioning here uh but i was think my diploma
0:25:22	thesis in two thousand six is the first model to implement it very just try
0:25:26	to predict the
0:25:27	the next door to given the previous work the using a simple neural network with
0:25:33	one hidden layer
0:25:34	and uh
0:25:35	here we train this model some
0:25:37	on some text corpus a
0:25:39	then the by product of this learning is uh
0:25:43	that a bit we the matrix a way to be in the
0:25:46	uh input layer and the hidden layer
0:25:49	we'll
0:25:50	a real basic contain the worst representations in some
0:25:54	a vector format that is the
0:25:56	you're gonna be seen as a fess uh this uh
0:26:00	this uh a real for of numbers of the weights from this matrix
0:26:05	and it is interesting properties for example
0:26:08	you don't group uh boards that similar sense together so that the
0:26:12	uh so that this vector representation
0:26:15	of four so for example like france any turn it will be
0:26:18	close to each other while for example
0:26:21	uh i dunno rent and uh
0:26:24	and china will be probably farther away both maybe not the
0:26:29	uh
0:26:31	so
0:26:32	uh
0:26:34	so uh basically a this like the simple supplication of the of the neural networks
0:26:40	and it'll is a kind of found to play we did a
0:26:43	of course it's not perfect so that were vectors coming from this model a very
0:26:48	be comparable to the state-of-the-art the
0:26:50	today about the
0:26:52	already function start there
0:26:54	uh
0:26:55	sometimes list of i-vectors all the core the cold um for embedding i'm not complete
0:27:00	sure why
0:27:01	but that's all the relative name
0:27:03	and uh
0:27:04	uh
0:27:06	usually to our presentation this like a then agenda like fifty to one problem so
0:27:10	each work on this you know say one hundred fold herself to retrain the model
0:27:15	and uh
0:27:16	a product of that signal purpose to work losses the samples think before
0:27:20	uh france italy can go to the same class but uh
0:27:23	yeah but with of all vectors a
0:27:25	these representations can be much richer because the
0:27:28	unlike a us with the board classes we can have a multiple degrees of sonority
0:27:32	encoded in this uh in this work vectors and that settled shrink later
0:27:37	uh it actually makes sense so
0:27:40	uh of course so one thing is that it is found to have uh
0:27:44	these vectors just uh just to study the language and actually increase or although our
0:27:48	interest in these techniques the but the are think is that uh
0:27:52	we can also use them in some uh
0:27:53	uh in some uh i know the application so
0:27:56	for example a roman coloured show in is a famous paper
0:28:00	a role to
0:28:02	a natural language processing from scratch uh
0:28:05	that the can so for many an open problem so
0:28:10	at the state-of-the-art performance uh
0:28:13	by using some pre-trained uh were vectors
0:28:19	so that are vectors can be basically features to some other models like the neural
0:28:24	networks instead of the or in addition to the wanna undercoating
0:28:29	uh historically there are like
0:28:32	uh several models the proposed before for training data
0:28:35	uh this uh this word representations the
0:28:38	usually people started to the most complicated things so the start with some model that
0:28:43	the
0:28:43	many hidden layers uh
0:28:45	and it was uh kind of working so
0:28:47	so it was considered the big success of the deep learning yeah
0:28:51	well i wasn't convinced about its because i would it know about my previous result
0:28:54	of just one hidden layer
0:28:56	uh the product of sourly quite good
0:28:58	uh
0:28:59	so
0:29:00	i wanted to show that uh actually the shovel models to model the model the
0:29:04	don't have many a hidden layers but just one
0:29:06	can actually be quite competitive for that i need to be able to compare two
0:29:11	uh either vectors of other people's approaches
0:29:16	and that wasn't actually parameters because uh
0:29:19	people that are showing results after training the models on different datasets and to
0:29:24	and the
0:29:25	these adjusted are not public and then if you compare two techniques just are trained
0:29:29	on different data then the comparison is not going to be very good
0:29:34	uh
0:29:35	so one of the interesting car properties that actually used for uh developing this uh
0:29:41	simply relation sets
0:29:43	was that uh
0:29:44	uh these support vectors can be used for
0:29:48	uh doing these so small
0:29:51	analogy like calculations with the board so one can define example
0:29:55	was then of a string that the when we take a the vector forking and
0:30:00	the subjects from a the vector for that represents man
0:30:03	then and uh vector that represents woman and do the nearest no uh need research
0:30:09	uh while excluding the but works around this position
0:30:12	then we will uh find the work we need for any
0:30:16	a reasonably good um or vector o
0:30:20	and uh
0:30:21	similarly we can actually calculated the
0:30:23	with the boards and sounds are a lot of uh
0:30:26	which questions of this type
0:30:28	uh kind of funny how accurate kind of get
0:30:32	uh i
0:30:32	on the picture below uh there is shown in basically there can be like this
0:30:36	multiple degrees of similarity so can guess the related to queen in some way but
0:30:41	it's a related to its lower for form like uh can't case related the kings
0:30:46	in some our way
0:30:47	and which you want to capture all these things the
0:30:50	so the idea that the board will be and then they're of a single class
0:30:54	what follows to capture this
0:31:00	so for the relation edit construct this dataset with the
0:31:03	uh almost twenty thousand question so to basically written by and uh and then generate
0:31:09	it uh using permutations so
0:31:12	automatically
0:31:13	and these are few examples like a
0:31:15	take it would be quite challenging even for
0:31:17	uh people once there are some of these so
0:31:21	analogy questions so maybe try to compute uh
0:31:24	uh this think uh
0:31:26	board for example utterances to greece's all slaves to norway i think that's quite easy
0:31:31	but the second one is uh
0:31:33	that's an article like analyze the ones last year honest well like ones like the
0:31:38	currency non-goal and the currency here on i think is the
0:31:41	three l
0:31:42	so i think that's the that's more complicated well
0:31:45	and of the and then there are like the errors that can actually very simple
0:31:48	like brothers sisters grandsons two
0:31:51	granddaughter and so on
0:31:53	so we can accumulate performance of uh
0:31:55	oh different models on these so on these questions
0:31:58	or this
0:31:59	uh on a would you questions
0:32:03	yeah
0:32:04	it can be actually scaled up very
0:32:07	yeah to phrases
0:32:08	so that we can compute like new york a sting your times baltimore's to i
0:32:13	think baltimore sun
0:32:14	uh
0:32:15	so we'll these datasets are public
0:32:18	in their published in the dark in
0:32:21	and uh i go there
0:32:23	the uh
0:32:24	simple algebra or vector model that will show later to this one that was uh
0:32:29	kind of stick of er state-of-the-art uh
0:32:31	make in the days uh
0:32:33	that it used to hidden layers uh
0:32:35	uh starting with the
0:32:37	a context of three boards are and of or so
0:32:41	and the input to predict the next door to
0:32:44	by going through the projection layer and little air
0:32:47	uh and the
0:32:49	them
0:32:49	the main complexity of this model after that we do some tricks the
0:32:53	there we can actually deal with u n w matrices of them a complex this
0:32:58	in the v matrix because we need to touch all the parameters of for every
0:33:01	training example
0:33:03	and the model takes ages to train
0:33:05	uh so
0:33:07	what i did this was basically the remove the hidden layer
0:33:09	and uh have the projection layer
0:33:12	slightly different and uh
0:33:14	as i don't show later in section were quite fine so this uh again the
0:33:18	idea that uh we can take the bigram model
0:33:21	and just extended to
0:33:22	which are showing the context uh around the border we are trying to predict and
0:33:27	just uh some the board representation of the projection layer and the prediction right away
0:33:31	this model will be able to learn the n-gram so it's not the
0:33:34	suitable for language modeling i just fine to learn the word vectors this way
0:33:42	uh
0:33:43	the near model to the previous model is the skipper model
0:33:47	that uh
0:33:48	tries to predict uh the context a given the
0:33:51	current uh board
0:33:53	they should were quite sooner or later uh if they are true and uh
0:33:58	peripherally
0:34:00	so the training is uh is uh still the same thing like stochastic gradient descent
0:34:04	back propagation
0:34:06	uh
0:34:06	these works at the output layer coding code it does one of and of the
0:34:10	same for the input layer
0:34:11	so we cannot be this the use of mikes so
0:34:14	function in the output layers which is so
0:34:16	this good probability distribution we have to compute the
0:34:19	all these uh only use which would take too long so they are like this
0:34:22	to
0:34:23	a fast approximations uh one that the still keeps the
0:34:28	probability to be correctly uh something to one which is dark a softmax and the
0:34:32	second one
0:34:33	uh that actually is uh from the
0:34:35	assumption that uh the models to be prefer probabilistic model
0:34:38	and uh i just takes the
0:34:40	bunch of divorces negative example
0:34:42	uh to be related and the output layer loss the positive example and that's all
0:34:47	but is done and to
0:34:48	and such as the second option seems to be preferable
0:34:53	and are very good at the
0:34:55	and it uh that actually improves the performance drop well
0:34:58	uh
0:34:59	probabilistically or stochastically a the most frequent corkscrew both speed up the training can interestingly
0:35:06	even improve the accuracy where we don't shall billions and billions of examples there we
0:35:11	try to
0:35:12	a related work like a ds the
0:35:15	is uh the and so on
0:35:19	so these are not removed from the training set up
0:35:22	a like all of them but uh
0:35:24	some proportion of them is actually remote so that their importance is actually reduced yeah
0:35:29	and it comes to the objective function
0:35:32	and here is the comparison as i said
0:35:34	on this analogy deepest that the
0:35:37	there was like this you get about in the training time and it and the
0:35:40	accuracy to whatever people it's probably before so i so that's what i wanted to
0:35:45	prove that one does not have to train a language model to a to obtain
0:35:48	good were able to report representations
0:35:51	and this you the last two lines are only simple models that the
0:35:55	data are invariant to the border you don't understand the
0:35:59	n-gram they just see the single boards and uh
0:36:02	the only yeah they can compute the very accurate the word representations that are actually
0:36:07	way better than with there are people that could twenty before
0:36:11	and to while the training time to go from models and reach two minutes
0:36:15	and maybe even seconds
0:36:18	so this is uh obvious this open source code
0:36:22	it's called words like project
0:36:24	actually many people
0:36:25	it's find it uh useful because uh
0:36:27	they can train on the on the border like there are some they are datasets
0:36:31	to improve uh
0:36:32	many i know the application so
0:36:35	so
0:36:36	i think it's like a nice uh
0:36:38	nice we have to and few person topic receive an adder
0:36:41	uh people are dealing with data sets of our there is not uh
0:36:45	you each number of uh
0:36:46	uh so supervised training examples
0:36:52	you are some examples of the nearest neighbor so
0:36:55	just to give a
0:36:56	idea
0:36:58	uh how big again was built a between about uh what was state-of-the-art before and
0:37:03	after
0:37:03	uh these models they are introduced
0:37:06	so for example for how well that's like infrequent words in a war in
0:37:10	english yeah
0:37:11	uh but still it's present in
0:37:13	all these the all these uh models
0:37:16	and we can see that the nearest neighbours for the first we
0:37:19	uh
0:37:21	barrelling makes any sense
0:37:22	and this one and then at least get the idea is probably a name of
0:37:25	some parsing
0:37:27	while the last one is obviously much better when it comes the nearest neighbours
0:37:34	and of course this the this improvement of the quality comes from the plight of
0:37:37	the models trained a much more data and the
0:37:39	and had a large dimensionality and the that all as possible because the
0:37:44	uh training complexity is reduced by many orders of magnitude
0:37:48	uh_huh
0:37:50	there are some few more fun examples like a
0:37:53	but uh
0:37:53	if we can uh can calculate like these things uh
0:37:57	likes
0:37:58	sushi mine in japan and germany rutgers
0:38:02	and so on i think yeah
0:38:04	scan to find of course we don't have to look at each of the news
0:38:06	token we can look at the top then use the
0:38:09	tokens uh
0:38:10	so i wouldn't say that it works all the time
0:38:13	and he goes like sixty percent of the time the nearest models are
0:38:16	looking reasonable
0:38:18	uh
0:38:19	about the it still like fun to play with it and the
0:38:22	there's the many existing we trained models no available on the web
0:38:28	i don't think that actually data scientists uh find useful is that the
0:38:32	these are vectors can be visualised to get some understanding of what is going on
0:38:37	in the dataset
0:38:38	uh that they are using a
0:38:40	the are ignored these are so strong that actually when we train this model and
0:38:45	the good news dataset
0:38:46	uh and then and it uh visualising two dimensions the representations for countries and the
0:38:53	capital cities
0:38:55	then we can actually see recorrelation
0:38:57	between them that the
0:38:58	there is this a single direction uh
0:39:02	uh how to get from account to basically it's capital city and even the contras
0:39:07	are actually a related to each other in this the in this representation in some
0:39:12	interesting way for example we can see that
0:39:14	the european countries from the saddam european so far in some part of the image
0:39:19	and then the problem
0:39:20	the rest of the real or somewhere in the middle
0:39:22	and then the asian countries are more like a
0:39:24	uh the
0:39:26	at the top of the image
0:39:32	so for the summary
0:39:34	uh
0:39:35	i think it's always good to think if a
0:39:37	it things can be down a simpler and uh as it was shown like uh
0:39:42	uh not everything is to be deeper and uh
0:39:44	you wanna works are fine even navy
0:39:47	actually remove uh many of the hidden layers especially in the know the applications it's
0:39:51	a different story for example for acoustic modeling or yeah per image yeah
0:39:56	classifiers are actually
0:39:58	i another there any
0:40:00	model that would be able to be competitive in the deeper
0:40:03	deep models the
0:40:04	um without having many nonlinearities about for the for an l p task so is
0:40:09	the other way around so i
0:40:11	and not company can means that the deep learning actually works for now you so
0:40:14	far
0:40:15	um but
0:40:16	in the future be noted that will
0:40:18	we better
0:40:20	although there is this thought
0:40:22	extension
0:40:23	to work to make of are basically instead of predicting the middle of or given
0:40:26	the context the connection predictor a labels for sentence using the same yeah same algorithms
0:40:33	and uh this is what we published a as a fast x library last year
0:40:38	it's very simple but that the same time very useful
0:40:41	oh
0:40:41	and compared to what job
0:40:43	both people are probably think nowadays uh in the
0:40:47	in the uh
0:40:49	and all the initial learning conference this uh
0:40:52	uh then we need to do the comparison to some
0:40:55	a convolution network with the
0:40:57	several hidden layers trained on
0:40:59	um any gpu so
0:41:01	we did find out of each you can get a their accuracy while being a
0:41:05	hundred times so
0:41:06	hundred thousand times faster
0:41:08	so i think it's always been think about the baselines and doing the simple things
0:41:12	first
0:41:15	so the next uh
0:41:17	expired will be about the recurrent network because the
0:41:20	i think it's quite obvious that the
0:41:22	or representations can be found the easy to traditional networks but it's gonna it's a
0:41:26	different story for language modeling there's actually some success of the
0:41:30	of the deep learning because the state-of-the-art the models
0:41:34	nowadays a recurrent and that's basically this model
0:41:37	and then able talk also about the limitations of these models
0:41:42	so these three so
0:41:43	of the recant it is quite longer
0:41:46	oh there was a lot of people work on this models a blend a piece
0:41:49	like a just on long mike jordan michael's or and so on uh because the
0:41:55	models
0:41:55	model is actually very
0:41:57	their interest think it's the
0:41:58	simple modification how to get a some sort of short the memory into the model
0:42:03	here is the graphical representation
0:42:05	so again we can uh taking the
0:42:07	bigram model and just the handed the
0:42:11	hidden layers they hidden layer
0:42:13	to be connected to uh the entire from the previous time step so that the
0:42:17	h t create is uh
0:42:19	this loop in the model
0:42:20	so that the hidden layer
0:42:22	oh uh just c is the features
0:42:26	the input layer what although it's all state from the previous the times that
0:42:31	and that it's selsa's the
0:42:33	the previous uh various state and so on so basically
0:42:37	um then ever you prediction it depends on the goal is threefold the
0:42:42	input feature that you know put us that's that it of the time steps that
0:42:46	we need to do before
0:42:49	so one can say that to the hidden error than or present some sort of
0:42:53	memory
0:42:54	a that this model has
0:42:56	uh there's this interesting paper from different monocle finding structuring time that the
0:43:01	you sort of this motivation
0:43:05	well after
0:43:06	after this period where the recurrent or explore studied uh
0:43:10	at a the excitement the excitement that the kind of when it show
0:43:16	uh because some people started believing that the than these models even in the actually
0:43:21	are looking very good cannot be trained with uh
0:43:23	and g d may but can see that the
0:43:26	this is the remote server
0:43:27	can a real curing again and again whenever people data
0:43:30	failing to do something enables the data link the energy that it just doesn't more
0:43:35	uh and of course to solicit they're out to be wrong so
0:43:39	uh the recognizers are actually trying to do you know this normally just one has
0:43:44	to do some small break it down
0:43:47	uh
0:43:50	so what i did um i said in the doesn't animals that the idea to
0:43:54	that actually
0:43:55	one can train state-of-the-art language model based on the recurrent networks and the it was
0:43:59	very easy to apply to
0:44:01	a range of tasks like language modeling machine translation speech recognition data compression and so
0:44:06	on and to
0:44:07	in each of these uh i was able to improve the existing systems to
0:44:11	uh achieved view state-of-the-art results and the
0:44:14	sometimes by quite a significant margin for something language modeling i think uh the or
0:44:19	looked perplexed the reduction over n-grams uh
0:44:22	it and symbol of similar recurrent network so
0:44:24	most
0:44:25	for me usually like fifty percent the remote so that that's quite a lot
0:44:28	uh_huh
0:44:30	and uh
0:44:31	company started using ca uh this uh this toolkit and what this the body so
0:44:36	that they pleased here about really many others
0:44:39	uh and uh
0:44:41	then i was looking at uh with your savings you but like uh
0:44:45	what uh outcomes that the uh the model actually works for me well people try
0:44:50	to do it before and that uh they just couldn't uh make it or they're
0:44:54	not that there was uh
0:44:55	this problem that they did a place at some point that the
0:44:58	it's i was uh trying to train the network and more and more data at
0:45:03	the start at the meeting can some celtic way
0:45:07	and the training response table so sometimes that it converts sometimes not
0:45:11	and the more and more data used uh the lower was the chance that the
0:45:15	network would can were convert
0:45:17	and um mostly the result or just rutledge
0:45:21	so it detects to spend quite a few days uh trying to figure out what
0:45:24	is going on and uh
0:45:26	i did find out that the there are some rare cases that are
0:45:29	the std a science in a such a way that the
0:45:35	changes of the way that are computed
0:45:37	uh become
0:45:38	exponentially larger they get propagated through the reckon the matrix
0:45:42	so that they become so huge
0:45:44	that the rule weight matrix the matrix a get overwritten bits the
0:45:48	it some numbers and the not review the
0:45:51	just american that are just
0:45:53	so what i did not so
0:45:55	the simplest thing to take with think
0:45:57	because these uh these gradient explosions uh
0:46:00	it happened just the
0:46:01	just very rarely
0:46:03	i didn't uh can't the gradient so that he wouldn't be able to become a
0:46:06	larger than some values of it in some threshold
0:46:12	and then it there now that the
0:46:13	that uh probably nobody was actually
0:46:16	either of this uh the neighbours the
0:46:18	but there was uh discussing this uh
0:46:21	this idea two thousand eleven
0:46:24	so
0:46:25	there was maybe the reason why things that or i dunno
0:46:29	but the
0:46:30	it's a set it was the mean of the case that uh that as g
0:46:33	d wouldn't uh work for training these models
0:46:37	and this i said it was quite easy to obtain a pretty good results one
0:46:41	shows that the weight thirty long for training the model because the they were quite
0:46:45	expensive
0:46:47	so the
0:46:48	um ability to the original setup well speech recognition
0:46:52	uh it's uh like uh
0:46:54	small
0:46:55	simple datasets that and to
0:46:57	and the reduction of the word error rate was like a over twenty percent compared
0:47:01	to the
0:47:02	best n-gram models a one can see that as a as the number of neurons
0:47:06	in the model gets bigger like to like a ranker then at higher sixty three
0:47:11	are twenty and so that's basically skating the size of the model
0:47:15	then the perplexity goes down but just like uh
0:47:18	the uh
0:47:21	down make sure how good is a network that's predicting the next board basically the
0:47:25	lower the better and the word error rate uh
0:47:28	uh is going down so basically the best n-gram modeling the
0:47:32	is that in by uh
0:47:34	with no count cutoffs and the on the on the relation data sets it up
0:47:39	like a
0:47:40	the twelve and sixteen point six a word error rate and with a combination of
0:47:44	uh
0:47:45	of uh these so we can network sacred would like to nine and thirty percent
0:47:49	that was quite yeah a big uh
0:47:52	the gain coming from just the language go to just from a change of the
0:47:56	language modeling technique which uh i think wasn't thirty
0:48:00	but these before when i did compare these results to other techniques that are being
0:48:04	developed the like then at uh johns hopkins university then usually people are happy with
0:48:09	like zero point three percent improvement of the words rate than the i could get
0:48:13	like you're like three and how well
0:48:15	percent absolutely
0:48:17	so that was uh quite some uh interesting cut think
0:48:21	uh under interesting co
0:48:23	no observation was that the more training data
0:48:27	uh was used uh the bigger uh will still
0:48:31	the gain over n-gram models that the recurrent networks
0:48:34	so
0:48:36	that was uh
0:48:37	uh quite the opposite of what to just argument abolitionists technical report in two thousand
0:48:42	one
0:48:43	uh
0:48:44	i think it was two thousand one
0:48:45	uh we use very famous very basically data collection that are all these qualities models
0:48:51	and so on that you put it consider for improving language modeling rare
0:48:55	actually uh
0:48:56	well help think less and less as so
0:48:59	as more data was used
0:49:01	and uh
0:49:02	it did see that you also losing all the whole that the
0:49:05	n-gram models can ever be beaten
0:49:07	well like their output with the police or we can model so that actually happened
0:49:13	so it was gonna likely
0:49:14	and uh and the last grapples the
0:49:17	this uh large datasets from i b m so it's pretty much the same thing
0:49:21	"'cause" the able to journal or just like much bigger much better tuned coming from
0:49:26	a commercial company
0:49:28	it was their the state of the green line is the
0:49:30	is their best result videos like thirteen percent for
0:49:33	more to rate uh
0:49:35	and then uh on the x-axis
0:49:37	there isn't the size okay size of the domain of this we're going to work
0:49:41	so you can see that the
0:49:42	and the
0:49:43	as the networks get bigger and bigger the word rate is not going down
0:49:49	so in the entire experiment it by
0:49:51	by the computation complexity because it to cut
0:49:54	many tricks to train the biggest models
0:49:56	uh
0:49:57	and uh that was quite challenging
0:50:00	uh in the nlu could get like another person's or try to reduction like the
0:50:04	relative well i think it will get even more than that if i could train
0:50:08	bigger models
0:50:09	but already this result was very convincing can to
0:50:12	uh_huh and stuff
0:50:14	people from the companies are interested
0:50:18	oh
0:50:19	lighter a user can afford became much more accessible because actually implementing the stochastic grandest
0:50:26	and correctly
0:50:28	it's gonna painful in this model because one is to use the
0:50:30	back propagation through time algorithm and it is the makes a mistake there and very
0:50:35	heart to
0:50:36	find it later
0:50:37	so i think is also like very useful so the maybe the most popular to
0:50:41	look it's are now like it happens or floor one so
0:50:44	and piano and or
0:50:46	but there are many errors
0:50:49	and uh
0:50:50	using the graphics processing units that you use uh
0:50:53	uh people could
0:50:54	scale the training to billions of training works uh using thousands of in your own
0:50:59	so that's even
0:51:00	quite a bit bigger than the with the brothers using can doesn't then
0:51:06	no like today a the right kind that's are used uh in many tasks like
0:51:11	speech recognition machine translation
0:51:13	i think uh
0:51:14	uh google guys that uh publishing months ago paper very are investigating how to get
0:51:20	the recurrent networks into
0:51:22	into the production sistine for google translate
0:51:25	i think it will still take some time about the
0:51:27	let's hope readable happen because setting would be great for example for translating from english
0:51:32	to check so that finally the
0:51:34	the
0:51:35	the uh
0:51:36	morphology wouldn't be as painful as it usually so
0:51:40	hmmm
0:51:42	on the other hand i think that the downside is that the because the
0:51:46	because these two kids like a say select and therefore and so on
0:51:50	it's named eric and works very easily accessible people are using them for all kinds
0:51:54	of problems that are there and thirteen
0:51:56	uh require
0:51:58	and uh especially when people try to complete their presentation file
0:52:01	they sentences are documents are
0:52:04	i would to always work on people told to think about the simpler baselines because
0:52:08	the
0:52:08	just big of n-grams the
0:52:10	can usually be to
0:52:12	is uh this uh models
0:52:14	or
0:52:15	at least be around the same
0:52:17	accuracy
0:52:19	when it comes the representation so the different in the language modeling
0:52:26	so one can ask like a what can we do better like a so really
0:52:30	need it is you distorted it may be there we can work uh
0:52:33	pretty well and sometimes maybe adding more layers held so for some problems doesn't
0:52:38	well it can be you
0:52:39	to get uh
0:52:40	better results so
0:52:42	uh can be built uh
0:52:44	this is a great language model that i mentioned in a direction that would be
0:52:47	able to lock to tell us to what is the capital city of some constant
0:52:50	maybe we could stall with uh
0:52:53	uh and we do it is we're gonna works well
0:52:56	i'm not thirty that much convinced because they are very simple things that these models
0:52:59	that we can have lower
0:53:00	and that actually is opportunity to new people a new generation to develop better models
0:53:07	so simple button for example is a very difficult to uh to learn is uh
0:53:12	it's memorisation a variable-length sequence of symbols
0:53:16	this like to
0:53:18	the people to just to see uh see you can you bored and be able
0:53:22	to repeat the later
0:53:23	this something that the uh
0:53:25	nobody can retrain in general the recanted networks to do that
0:53:29	uh there even simple but there's and that we don't have to minimize the sequencer
0:53:35	of symbols we can just a little bit novelty kind it comes to count thing
0:53:38	so
0:53:39	we can generate uh
0:53:40	uh is very simple
0:53:42	uh algorithmic but there are so uh which are
0:53:47	sequences so
0:53:48	with some strong regularity
0:53:50	uh and see what can actually be so we can the rocks lower
0:53:54	so i think that uh people know for some from the recouped article compare signs
0:54:00	that the that there are like a very simple or a languages like the
0:54:05	a and b and language yeah are there is this thing number of symbols
0:54:08	and we consider in quite a few examples and train
0:54:11	a sequential
0:54:12	a predictive model like a recurrent network to be able to predict the next symbol
0:54:16	and if it can actually oh come then it should be able to predict correctly
0:54:20	uh all these so uh not hold it is the sum of these so symbols
0:54:25	in the sequence of because there are still this uh this uh
0:54:29	a this uh
0:54:31	information coming from the first and because that's not predictable
0:54:36	so
0:54:38	so these quite challenging
0:54:40	but then we can a talk about plenty of these uh
0:54:44	uh ask that uh currently cannot do you are and uh
0:54:47	and you can get a confusing cover what should we focus on should be a
0:54:51	study these artificial grammars or is it going how's that related to the real language
0:54:55	and if you can shows all the in the end of a light improve some
0:54:58	language model
0:54:59	so i think that's the there's the natural questions
0:55:03	i think the answer is uh
0:55:05	quite complicated about the
0:55:07	what i think is that the
0:55:09	it's good to set some uh big role in the beginning and then try to
0:55:13	uh define
0:55:15	a some plan how to actually uh
0:55:19	you know i
0:55:20	accomplish this goal
0:55:21	so that we did right so
0:55:23	one paper where we did the
0:55:25	uh discuss uh
0:55:27	the
0:55:28	like automated goal at first of the start bit uh the underrate around the instead
0:55:33	of trying to improve some existing cup
0:55:35	uh so that we are trying to define a new set up that would be
0:55:38	a more like a artificial intelligence like something that the
0:55:42	the people can see in the sense section we something to the really exciting and
0:55:46	the
0:55:47	that's what we actually want to optimize the
0:55:49	the objective function formants just some
0:55:51	uh speech recognisers
0:55:53	uh something come more funny
0:55:56	so
0:55:56	so we did think that like a
0:55:58	the properties of the of the a i that the
0:56:03	the really useful for us
0:56:05	and it seems that the any useful artificial intelligence the
0:56:09	we like to be able to somehow communicate with us uh
0:56:12	in uh
0:56:13	hopefully some natural the way
0:56:15	uh so again if you would look at the
0:56:18	at the sign at the science fiction movies are the books the
0:56:22	usually the artificial intelligence is some machine that uh
0:56:25	either is or about the to be controlled with the core in tried it or
0:56:29	it's as some computer that again we can interact with the so the embodiment doesn't
0:56:33	seem to be
0:56:34	necessary about the
0:56:35	there needs to be some communication channel so that we can actually state some goal
0:56:39	so that a i can actually accomplish the goal for us
0:56:43	uh
0:56:44	and we can communicate of the machines of course it will help or maybe that
0:56:48	we could even i'm going to programming because currently we have become they can be
0:56:52	computed by thinking second instructions will be one computers to do there is no way
0:56:57	how we can start talking to the computer centre and expect the table
0:57:01	accomplish a task for me that's basically no the framework we have not
0:57:05	and i think that in the feature this will become possible about the
0:57:10	but see a long really take
0:57:12	i think we should start thinking of it because i don't think that we can
0:57:16	improve the language models much more
0:57:18	it's something uh like some crazy recurrent or okay
0:57:23	so for the room at the
0:57:25	we get uh describe the
0:57:27	oh pretty minimal sort of components that we think that uh
0:57:30	the intelligence machines going to consist of
0:57:33	and then some the productivity may be actually good for constructing these machines to
0:57:38	so it's it is that the idea that the is now and uh maybe later
0:57:41	really improves then
0:57:42	uh and we only are discussing them at the conference this
0:57:45	uh
0:57:46	and then the mission the requirements are too many dimensions scalable so that will actually
0:57:51	be able to grow two full intelligence
0:57:55	the components are added as i said the committed the ability to communicate a
0:57:59	the ability to set some tasks for the machine so that the uh
0:58:03	but work to do something useful
0:58:05	so some motivation component
0:58:07	again that something that is normally missing can the
0:58:09	in the predictive models like the language models and so on
0:58:13	and then have some learning skills which uh
0:58:17	scenes can be used uh
0:58:18	but uh many models are excluded
0:58:20	missing these for example long-term memory is not part of uh
0:58:24	any model the time of our of uh
0:58:26	then you want works represent the want to memory and the weight matrices and that
0:58:29	the get overwritten the number that network a
0:58:32	uh kids uh gradients from the
0:58:34	you examples and that's basically not agree to
0:58:37	good model for once the memory
0:58:40	so
0:58:41	we need to do something better basically
0:58:44	and uh
0:58:46	just quickly i will go over it is because uh it would be long discussion
0:58:50	to explain why we actually think uh
0:58:53	uh about all these uh all these things about the
0:58:56	we think that turns to be some incremental structure in a how the
0:59:01	mission will be trained use of training can like a we could be idea is
0:59:05	to retrain normally language models one uh it seems that it has to be trained
0:59:09	some incremental weight a similar way as the as humans with um
0:59:12	would be learning the language a
0:59:15	and for that uh
0:59:16	we are thinking about the some sort of simulated environment that the
0:59:20	that uh
0:59:21	would be used to
0:59:23	develop both the all words of the are missing and then once you would to
0:59:27	have this algorithms to train the most basic intelligent machines so it the most basic
0:59:32	properties that we can think of
0:59:36	so
0:59:37	this basically what we are thinking about the and we wanted uh quite of experiments
0:59:42	so there's this can or components like the lower it stands for the intelligent machines
0:59:48	that the that can be in this environment the uh it can do some actions
0:59:52	but everything is actually very simple like we try to minimize the complexity so it's
0:59:57	just uh
0:59:57	basically
0:59:58	uh receive some input signal
1:00:01	sequence and purposes the output signal which is a sequence as well
1:00:05	and it to achieve it uh receive summary or so which is uh
1:00:10	used to measure the performance of the learner and the
1:00:13	it's pretty much either there so the teacher that uh defines the goals the
1:00:17	and assigns the reward and to
1:00:19	and that's it
1:00:25	so this like the description the environment this uh based on screen
1:00:30	of course we want to have uh and the teachers well we don't the this
1:00:34	to be scalable so later
1:00:35	uh one so we would you have a learner that can learn is very simple
1:00:39	patterns then
1:00:40	uh the expectation that the teacher would be replaced by humans
1:00:43	so directly humans would be teaching part of the machine and the and the signing
1:00:47	the rewards
1:00:48	and that are dimension you really get to some
1:00:50	sufficient level than a then there would be to stick expectation that we can start
1:00:54	using a for doing something actually useful for us
1:00:58	yeah
1:01:00	so the communication is thirty the core so the learner just as this input channel
1:01:06	and the output channel and all that it has to do is to figure out
1:01:10	that it should be
1:01:11	a coding at a given time
1:01:13	given the inputs to maximize the average incoming three or
1:01:18	so it seems to be quite simple but of course it is not simple uh
1:01:23	this is a graphical representation just so that it would to look more obvious that
1:01:27	we are aiming to do so there is a simple channel output channel
1:01:31	uh the task specification given by teacher is the movement luke and uh
1:01:36	and then
1:01:37	the view learner
1:01:39	the only assume that the delivery learn to do this task
1:01:42	uh says
1:01:44	to the environment they move that's how it accomplishes that action
1:01:47	so we don't need every year for all possible actions a
1:01:51	the learner can actually do anything to do
1:01:53	it uh is allowed to do i just by saying it so if it's gonna
1:01:56	if it wants to go for ever or if it wants that are like can
1:02:00	just say
1:02:01	and uh then at the end of the task yeah
1:02:03	uh it gets the reward for
1:02:05	uh looking for events finding that apple
1:02:09	so
1:02:12	so we think that the
1:02:14	the learning weekly a will be a complete crucial here and the
1:02:19	that's the same what i can see about incrementality of the learning so
1:02:22	when the tasks uh
1:02:25	are getting more and more complex or in something criminal way uh the learner should
1:02:31	be able to learn from few examples at most we don't actually we're forcing the
1:02:35	search space
1:02:36	so the algorithm that you get uh at the moment or would basically break uh
1:02:41	on this type of problems
1:02:45	and that's a supporter before we of course get a documents later but still
1:02:51	seems to be assumed because the we don't have uh without regard to be able
1:02:55	to deal with a given the basic problems
1:03:00	and then
1:03:01	if we have this uh this intelligent machines and that can
1:03:04	uh were with the input and output channels that of course we can add the
1:03:07	real world basically this additional
1:03:09	input uh
1:03:10	channel that the much again
1:03:12	one troll for example it can give where is that the output channel to the
1:03:16	internet and the received the resulted input so
1:03:19	uh the framework is very simple about the
1:03:22	uh it seems to be sufficient for the intelligent machines
1:03:28	and that's a was select things the real time they are things that seem to
1:03:32	be very
1:03:33	simple to lower for about the
1:03:35	but you can pretty do it with the
1:03:37	a recurrent networks even a via the long short-term memory units into the recurrence for
1:03:41	x and the
1:03:42	all kinds of crazy things the
1:03:44	then still they are very simple but are they are very challenging the lower
1:03:49	uh even when we have supervision about the
1:03:51	what is the next symbol
1:03:53	it would just try to learn
1:03:54	these things just be the records of even worse
1:03:58	so
1:04:01	so
1:04:02	they are like this the things that we believe that the
1:04:04	especially the last two
1:04:06	have and it's basically like all these problems are a console research problems and a
1:04:12	maybe even they have to be addressed together so it's quite challenging but i think
1:04:16	it's uh it's good for people who are trying to
1:04:19	uh start the
1:04:21	their own research to think about the challenging problems
1:04:27	so for the
1:04:29	small steps former be publisher before we review exchange are showing that the that the
1:04:35	recanted first can actually learn style
1:04:38	some of these uh algorithmic uh
1:04:40	but then we extend then be the
1:04:42	it is um in memory structure
1:04:44	that the recurrent network are learns to
1:04:47	uh control
1:04:49	that actually utterances several the problem that i mentioned before
1:04:52	in this uh
1:04:53	if this memory
1:04:54	is a unwanted in the signs
1:04:57	uh
1:04:58	like this like for example
1:05:00	then
1:05:01	uh
1:05:02	suddenly the model can actually be at least two sticks it can be countering complete
1:05:06	uh so that can i should learn finder presentation
1:05:09	a two
1:05:10	any algorithm which seems to be necessary and we as humans can do it
1:05:14	uh
1:05:16	and uh it also like addresses these problems
1:05:19	or it could address the problems as i mentioned before with the neural networks that
1:05:23	are changing there are the weight matrices all the time
1:05:27	uh
1:05:28	and therefore getting things then if you would have this a controlled way how to
1:05:32	grow something more structures
1:05:34	then that could be you way how to actually are present the long term memory
1:05:38	better but
1:05:40	as i said is just the first that's former
1:05:43	of course we did find out the men been already work and idea behind the
1:05:47	first one with this idea and it will study published in the
1:05:50	in the a piece uh
1:05:51	but uh likely needed to find of our solution is that
1:05:55	is again simpler and works better than people probability for so
1:05:59	so it's it in my looks like this
1:06:02	no
1:06:04	uh so there's not much of the complexity
1:06:07	um basically the hidden air decides on about the action to do we just ikea
1:06:13	a by purchasing castle my position distribution probability distribution over
1:06:17	uh used reactions that it can be doing
1:06:20	it can either push some volume on top of the state courts couple the volume
1:06:24	of the stack part can decide to do not think intersect and of course are
1:06:28	there can be multiple states that the network controls
1:06:31	and uh if it wants to write some specific volume and it's actually that depends
1:06:36	on the state of the hidden layer
1:06:37	and uh
1:06:39	and the fan think is that it can be trained actually data
1:06:42	it as g
1:06:43	like stochastic gradient descent so we shouldn't need to do
1:06:46	and the crazy thing
1:06:48	and uh it seems to be working for at least some of the simpler sink
1:06:52	what sequences
1:06:54	a like here but at least some of them variant able to the characters uh
1:06:59	the bold characters are the predictable on the deterministic once
1:07:03	and we could to
1:07:04	so well
1:07:06	i think all these problems
1:07:08	i really
1:07:09	uh so that was gonna interesting
1:07:14	and of course the recurrence works candidate
1:07:16	and the funding is that the lsp a models that are actually origin developed to
1:07:20	uh to address the do you know exactly these problems
1:07:25	can do it because they can count because the linear component so
1:07:29	uh so that sort of the like cheating because the models
1:07:32	developed for this article reason
1:07:34	uh of course we can show that the
1:07:36	the lsp and below
1:07:37	break it will just uh a scalar complex to bit odd because the uh instead
1:07:44	of just recovering them are to come we can
1:07:46	uh
1:07:47	are we compare it to start memorising sequences as they said before he really just
1:07:51	show like a
1:07:52	a bunch of characters with variable-length the
1:07:55	that have to be repeated to
1:07:57	and the to larry breaks the last jens
1:07:59	which uh for the people don't know them is so it is a modification or
1:08:04	extension of the basic network by adding yeah these are linear units the
1:08:09	with a bit sums of connections and basically complicated architecture how to
1:08:13	get some
1:08:13	more stable memory into the reckon that's regular to propagate more smoothly across time
1:08:21	so we could solve the memorisation
1:08:23	but then of course one can say to the
1:08:25	uh that the stakes are kind of developed for this kind of can uh a
1:08:29	regularity so
1:08:31	so it interesting a so our model was the
1:08:34	a first-order on the speaker was
1:08:37	blank a little bit binary addition just quite a bit more complicated and the
1:08:41	interestingly it also did uh can you have more so here we are shrink a
1:08:45	these examples which are like uh
1:08:48	a binary like input so
1:08:51	by the addition of two binary numbers together with the result than the terracotta lance
1:08:57	to pretty the next symbol to get in this story so it's like a language
1:08:59	model
1:09:00	and it turned out that the
1:09:02	it actually could to learn to operate with this mixing right complicated way to solve
1:09:06	this problem
1:09:07	so that actually
1:09:09	uh space the first number and i think to stack so there are some redundancy
1:09:13	uh actually three of than i think of
1:09:15	our previous information
1:09:17	and then the it's a so the
1:09:20	how the second number
1:09:21	and then it's able to produce a correctly the
1:09:25	uh the addition
1:09:28	from these two numbers
1:09:30	so
1:09:31	uh i think it's quite funny example
1:09:33	of course there was like uh oh this is a heck the to be used
1:09:37	to how the model because the stakes a are pushing the volume some top of
1:09:41	the steak it's actually much easier to the
1:09:43	do the memorisation of the
1:09:45	all the strings in the reverse order
1:09:48	so
1:09:49	a so that's the same
1:09:51	case for the binary addition
1:09:53	uh so i wouldn't say that we can actually learn
1:09:56	a general
1:09:57	algorithmic fathers with this model
1:10:00	and uh
1:10:01	of course we could to
1:10:02	do better if you just uh
1:10:04	uh not use just the stakes but we could use for example states this additional
1:10:08	memory structures
1:10:09	with all kinds of topologies and so on
1:10:12	but it seems the like uh taking yeah
1:10:15	uh the solution together with the task so
1:10:18	that uh doesn't seem to be great so i would refer back to do that
1:10:22	sort of the paper that you had a rejection try to define the tasks first
1:10:26	before think about the solution but in any case we could show that the
1:10:30	that we can learn a interesting car
1:10:33	uh in there's think a complex patterns are
1:10:37	that the normal recurrent networks couldn't lower
1:10:42	and the model is turing complete the say set and has some sort of long
1:10:45	the memory
1:10:47	but it's not the long-term limited
1:10:49	like to have
1:10:50	you does not to the properties that we
1:10:53	uh we you want
1:10:55	so there is that and a lot of uh things that should be tried
1:10:58	and to
1:11:00	let's see what to
1:11:02	well happen in the future
1:11:05	so for the conclusion a
1:11:07	of the lost power of the talker
1:11:09	i would say that to achieve chart which intelligence which was my motivation many start
1:11:13	my phd so far i had failed to do it but at least there was
1:11:16	like this uh
1:11:17	these site brightly that are to be useful
1:11:20	uh i think that we need to file think uh
1:11:24	a lot about the goal
1:11:25	uh i just a few that no people are from
1:11:29	working harder than the wrong task so
1:11:31	the tasks are too small and to
1:11:34	and isolate it i think it's a it's time to think about something bigger
1:11:38	and uh there are there will be a lot of like uh
1:11:42	new ideas that will be needed to defined a framework in which we can develop
1:11:46	the uh yeah i just same way as the framework in which the first speech
1:11:51	recognisers were built also to take like uh a quite a few years to
1:11:56	just uh define how to measure the boards rates and so on and the
1:12:00	and how to annotated data sets
1:12:02	i did not for the ideal basically need to rethink some of the basic concepts
1:12:07	that to be take for granted now i'm that are probably wrong like uh for
1:12:11	example the central role of the supervised learning in the machine learning curve
1:12:15	uh
1:12:16	techniques i think that has to be revisited and via to that taking that uh
1:12:20	that are much more unsupervised and to
1:12:23	can
1:12:23	more calm somewhat different principles
1:12:27	and of course uh the uh
1:12:29	on of the goals of this august so
1:12:31	motivate more people to think about this problems
1:12:35	because that's the
1:12:36	i think our
1:12:38	we can
1:12:39	a rigorous harder so i think that the last line so things for
1:12:43	attention
1:12:45	sparse dark questions
1:13:07	yeah right sorry
1:13:22	mountain yeah okay
1:13:24	so my question here how to properly defined intelligent not artificial intelligence but just in
1:13:29	the intelligence in the second question which it to the first one is okay so
1:13:35	we know that the true machine um is limited you can so everything and then
1:13:40	can you believe don't believe that detergents how you define
1:13:45	yeah is achievable with your incomplete mission
1:13:48	well
1:13:49	not sure that it's the question started actually relate it's like to questions for me
1:13:53	uh but uh
1:13:55	a first for the definition of the intelligence they're actually right yeah
1:13:59	uh many opinions on this uh probably no like you know i would say that
1:14:04	uh pretty much research of are defined intelligence in
1:14:07	different way
1:14:08	uh
1:14:10	hmmm all the most general definition that i can think of uh
1:14:14	is a it would be maybe to philosophical is uh
1:14:17	basically uh
1:14:20	but there is that uh that uh that exist in the universe a could be
1:14:25	thought of uh as intelligent there we can see that uh
1:14:29	uh a life is basically just uh some organisation of the matcher that uh tends
1:14:34	to results it's a form
1:14:37	uh to evolution and everything
1:14:39	so that it goes back to well ideas for example
1:14:43	that the universe gonna be seen as this overall to model on and then everything
1:14:46	to the observed that are is just a consequence of that
1:14:49	and then you really can see the light so as a just a pair is
1:14:53	the button that exist in this uh in this uh topological structure and the intelligence
1:14:59	is just a mechanism that the this uh this part there and that uh developed
1:15:03	to preserve itself and so basically
1:15:07	for the second a question you said that the uh that the
1:15:11	during machines are limited i'm not so sure in what sense maybe you mean that
1:15:15	the normal computers are not during machines the
1:15:18	uh in strict sense so
1:15:24	uh so i don't know which problems uh you mean you can do not only
1:15:28	major uh i was talking more about the incompleteness in the sense that the during
1:15:33	common is basically a this concept that uh there is a find a description of
1:15:38	well
1:15:38	of all the buttons in this competition model
1:15:40	if you would yeah they could bigger model like c find machines that you know
1:15:44	that for example
1:15:45	there does not exist to find it solution
1:15:48	funny description of uh
1:15:49	of some algorithm of well so
1:15:52	for example
1:15:53	you can tell account if you if you limit yourself to the finite state machines
1:15:58	hand uh
1:15:59	for example in the context of the recurrent network so i think there is a
1:16:03	gets more confusing because they should ever papers written
1:16:06	then uh claim that the recurrent networks are incomplete and which sensor
1:16:10	a one can make a conclusion and actually adhered for example you're gonna speed reverses
1:16:15	it uh
1:16:15	requested that the
1:16:17	that the recurrent networks larger incomplete and that
1:16:19	that means that they are just fine and they should general are all these things
1:16:22	that i was showing co
1:16:24	what do you want to say a say that the uh show that when we
1:16:28	tried to train it uh it does g d a normal requirements we regard doesn't
1:16:32	larry of an accounting and the list even doesn't learning like a plane sequence memorisation
1:16:37	so that's a that's one thing what is learnable
1:16:39	and that's actually quite different than what the what can be represented
1:16:43	and of course the
1:16:44	to take uh the argument of uh all these people to strictly then i would
1:16:49	say uh the recurrent networks as we have now
1:16:51	uh then including dollars teams are not to incomplete because the
1:16:55	uh defined it's isa the proofs of their string complete this assumes that there are
1:17:00	so infinity somewhere hidden in the model
1:17:03	usually in the in the volumes the distortion the in the in the neuronal so
1:17:08	um so that seems not to do not the neural network that we are using
1:17:12	now we are using like thirty two bit a full precision and you can tell
1:17:15	you think of that you can store like uh infinitive
1:17:18	if it number of four formants there is the same argument as the saying that
1:17:22	you can
1:17:23	save the whole universe in a single number using arithmetic coding sure you can but
1:17:27	the
1:17:28	but uh do you actually want to this representation to be uh in uh some
1:17:32	neural network like uh in one of all you want to store everything and have
1:17:36	a lexus a double detection decoded at every time
1:17:38	for a time step if you would want to more the identification it makes sense
1:17:40	to say that we can it works are two incomplete because the in my view
1:17:44	a strictly speaking there are some of their versions maybe about the
1:17:48	but it's just uh i'm not practical of course uh during missions also not very
1:17:52	practical model so i'm talking about through a complete that's not about directions
1:18:00	i
1:18:02	yeah uh i see that the uh you're thinking a lot about the yeah i
1:18:07	creation actually there is a huge discussion right now in the in the field about
1:18:11	the about achievement uh achieving of the number the singularity and whenever you will create
1:18:17	a binary what traits such as a i which would get connected to the to
1:18:21	the internet
1:18:22	and
1:18:24	uh did it to share any of dares their concerns of uh
1:18:29	uh country yeah i
1:18:32	or super intelligent a i which will which will basically make a some silly
1:18:37	well
1:18:39	well i like yeah
1:18:40	different views on this uh
1:18:42	uh i think that the thinking of this a super intelligence and singularity
1:18:47	i think it's little like yes uh
1:18:50	i don't know what i would uh related to like yeah but the chinese and
1:18:54	italian when they got power if they would to be afraid of uh of just
1:18:58	or interval are so
1:19:00	by some chain reaction uh i mean like it just to suit basically just the
1:19:04	technologist there and uh it should be aware of it and does the same like
1:19:08	uh when it comes to state of the research does i'm saying if you really
1:19:11	don't want to
1:19:12	uh if you don't want to be unfair divorce yourself
1:19:16	then it is clear that we can teach yeah there are many of them very
1:19:19	simple things and to talking about single or the then i think it's just uh
1:19:23	assume the our think a is that the
1:19:27	of course there are people have arguments that the uh maybe uh the gap between
1:19:33	actually having something that doesn't work at all and suddenly having some intelligent that again
1:19:37	you can improve itself
1:19:38	the together doesn't have to be that they can maybe how we can achieve this
1:19:42	machine so sooner than we expect that
1:19:44	even if some people are sceptical that made it can be later
1:19:47	um but if you would take this argument then i would say depends on how
1:19:51	we should construct this machine so
1:19:53	uh the frame level describing a
1:19:55	uh were supposed to make machine that actually
1:19:58	are trying to minimize uh some goals
1:20:01	and as long as we will be able to define the goals for the machines
1:20:04	then
1:20:05	i would say uh for me the machine should be basically um like a
1:20:11	some and it can that extends your own ability
1:20:14	see if you are sitting in a car
1:20:16	then uh you are able to move much faster than using your own lines because
1:20:20	the cars physical tools for you
1:20:22	oh
1:20:23	well the car just does what you wanted to do because you are existing very
1:20:26	it should go show it can either a knockers people and it can kill someone
1:20:29	button that but the next to the driver is responsible
1:20:32	so i think that the a i even if you to be very clever as
1:20:36	long as its own all purpose is just to accomplish of the goals for the
1:20:41	for the for the human which specifies the uh the goals
1:20:45	then it basically to like extension of our mental a couple state the same as
1:20:49	cars extending our abilities to move
1:20:56	well that was just a
1:21:01	there was just as the to your phyllis this step three of us to lead
1:21:06	to learn a learned itself the questions which is the tricky part
1:21:10	because whenever you will you will not part of it was on the a i
1:21:16	part
1:21:18	uh um
1:21:24	uh which file it was about io and collecting to internet just only about the
1:21:29	thought oh yeah and c
1:21:35	okay i don't remember exactly which might not you are correctly so
1:21:39	uh actual actually the last are was to let or no learned itself the from
1:21:45	the from the other sources which makes it only has no control
1:21:49	rather sure churchill are well that's a that's a like a question like a
1:21:54	given the learner will learn from the other sources of how much uh
1:22:00	kind of uh distant gonna get from the from the top
1:22:04	a external river so
1:22:06	you can actually the same argument about uh people they are also born with some
1:22:10	kind of like a internal rework mechanism that was maybe large revolution to be a
1:22:14	kind of hardcoded and also before example note if you sugar than you feel happy
1:22:19	or whatever because the so they are coded thing
1:22:21	and uh
1:22:23	that still doesn't we then people to actually behave uh quite different then they become
1:22:27	adult because
1:22:28	they can for example just decide to stop eating sugar a
1:22:32	and just uh just not follow the external rewards so
1:22:36	yeah encoded or external interlocked the basic of the hardcoded to record so
1:22:40	that are in the brain stop
1:22:42	so that's like more like the question if the
1:22:45	yeah i would be so much independent that it would have thought some sort of
1:22:48	like you will then and you can of course see that it can turn out
1:22:53	into something bad about the and if you think about a i think a single
1:22:57	and that it but uh many of them and many of them working with yes
1:23:00	then
1:23:00	uh my vision is basically that it extends our own abilities and the
1:23:05	is the same as the
1:23:07	saying that uh pretty much any piece of technology can be used for good and
1:23:10	that purposes so
1:23:12	just to be belongs
1:23:23	others
1:23:24	it was wondering whether it's be more local
1:23:29	no target location is this
1:23:33	like something which whoops work in the network and it would be actually
1:23:40	clearly that it would be changed just some subset of a scene
1:23:44	yeah i wouldn't it propagates
1:23:49	the information in this mostly due more
1:23:53	oh unsupervised
1:23:56	hmmm
1:23:57	something like
1:23:58	c d's
1:24:02	someone's using something
1:24:04	these days
1:24:05	hmmm i think that i see something but i do not be able to give
1:24:09	your friends because uh i didn't know are right now
1:24:13	i wasn't myself music find a fulfilment first mse because i think that uh
1:24:18	therefore limit it yeah
1:24:20	uh
1:24:21	so well i don't know like a
1:24:24	i guess that the property that we should be able to uh
1:24:27	uh get into our models that are it's neural networks or something else is this
1:24:32	uh i but it's to grow in the complexity
1:24:35	and that's something that norm on your a result that
1:24:37	once you start seeing the network a some sort of like memory mechanisms are it
1:24:41	us ability to kind of like to extend the memory structure i think that's uh
1:24:45	that's all i see it
1:24:46	uh
1:24:47	and then the topology allows you to not spectral parameters but just some subset
1:24:51	uh
1:24:52	so that's what i was thinking of uh
1:24:55	but of course that the that doesn't mean that uh that's the solution may be
1:24:58	somewhat oh come on something else
1:25:00	i just think that actual data
1:25:02	points to model even if you go to as you do something that will be
1:25:06	again do local updates to the and i would be a bit boring about
1:25:10	just the model itself to be
1:25:12	to be and not limited in the convolutional sense of course to consider the human
1:25:16	brain to select find it's the number of years maybe but then there like targets
1:25:20	may be some neurons are triggering cup and the final arguments from me would be
1:25:23	that the us a human you can actually um navigate in the topological environment like
1:25:28	the environment around your yourself this three-dimensional it has a topology
1:25:32	and uh if you actually want to understand all utterances you can use the piece
1:25:35	of the paper and so on so you can be actually finds the matching about
1:25:39	this long as you are but i think in the in the environment that connections
1:25:42	work as a paper in the during machine
1:25:44	and then you can actually be as the whole system during complete so
1:25:47	you know like a if the model will actually start living in the environment i
1:25:51	think it actually gets a
1:25:52	gets a more about it is that's look at it can also change the environment
1:25:55	that the it uh becomes much more interesting content if you have just neural network
1:26:00	really think a in a way to just observing can able to lectures and purchasing
1:26:04	cup vectors uh
1:26:05	we don't actually uh being we are able to control anything that that's the topology
1:26:10	so for example and i was talking about the stick carmen's you can see that
1:26:13	so that's that can be seen as a one dimensional environment that the stick our
1:26:18	lives in and can operate on it and to have any of the two d
1:26:21	r two d environment that utterance basically just more
1:26:24	more dimensions but it's uh kind of the same thing and you are so far
1:26:27	linking the three to adjust really brought and to if you will be able to
1:26:30	influence of the state of the role i think that a user will be uh
1:26:34	quite limited to that so that's kind of like a my understanding of the think
1:26:38	so
1:26:48	does the research agenda open a is doing have any overlap with the framework that
1:26:53	you have suggested a which
1:26:55	opening i
1:26:56	a banana
1:26:57	uh yeah that's of the guys in california a where they get publisher a uh
1:27:04	she already recorded opening i universe i think a
1:27:08	or a needs a month ago so uh somewhat overlooked so that the goals in
1:27:14	the sense that uh
1:27:15	did try to yeah
1:27:17	um like a person every social a point to the define a defined to like
1:27:23	i think thousand task or something of that sort the
1:27:27	and to
1:27:28	mm they are trying to make a machines are coming from the definition data
1:27:33	generally i guess a
1:27:35	it's a some sort of machine that can work of course uh a range of
1:27:40	tasks are not a single path but for many tasks are
1:27:43	uh but it somewhat uh it's quite crucial to different actually to what i was
1:27:47	describing because uh
1:27:49	there's a different between incremental or gradual learning curve i think there are several other
1:27:53	names are you assume that the meshing wanted one signals and tasks and it alarm
1:27:58	so it's a you try to teach it a task and plus one
1:28:01	then it should be able to learn it faster in this new tasks are related
1:28:05	to the all the ones and then you can actually be measured because you can
1:28:08	construct this subtask yourself oh artificial and you can make than actually
1:28:13	uh
1:28:14	while of what i did see so far list uh i'm not like a an
1:28:18	expert will they are doing that maybe they are changing still
1:28:21	direction but i folded they are trying just to solve a bunch of tasks together
1:28:23	which multitask learning that such a different thing as actually completed that yielded the neural
1:28:28	networks which don't have to do so you
1:28:31	two approach that are the problems
1:28:33	but the well they try to do it with a yeah i think which are
1:28:36	uh or reinforcement learning which again is quite a challenging
1:28:41	uh because direction don't stay supervisor level the of the model should be doing about
1:28:46	you are just giving rewards for the correct behaviour
1:28:48	second that the
1:28:50	that part of what they are trying to do is uh somewhat related to whatever
1:28:53	describing about the
1:28:55	a i don't think it should uh multitask learning has a big a problem because
1:28:59	that could actually just works fine you can just a venue auditory little recognize speech
1:29:03	under the image classification and then language monica the same time uh because really represent
1:29:09	all these things of the input layer and what kind of our quite a so
1:29:12	what would be just encoded in different parts of the network
1:29:14	uh
1:29:15	so
1:29:17	i think that they're hope is that uh actually they start uh
1:29:20	boosting each other's performance uh
1:29:22	uh if you will train basically this work to do all these things together then
1:29:27	it will adjust sure the
1:29:28	sharma of the ability somehow
1:29:31	uh so let's see what they become a bit about uh
1:29:33	from my point of view i think that uh it's good to try to of
1:29:36	isolated of the biggest problems and try to so that uh i was for example
1:29:41	a giving preference uh
1:29:42	how to this um is "'cause" book are subproblems and uh
1:29:45	iterate try to go like to the core like of the simplest things that you
1:29:49	guards are present additionally with a one hidden layer and the at intervals where influential
1:29:53	and very simple are sent and from my point of view if we try to
1:29:57	analyze what is going wrong with the current the algorithm is going to
1:30:02	like a huge data are sort of thousands or thousand different problems training some model
1:30:07	couple of it
1:30:08	and then make some place about that are it works or doesn't work and what
1:30:11	it go wrong a
1:30:12	i think the analyses will be very harsh for then
1:30:15	it will be different amazing for p r videos
1:30:18	uh
1:30:19	which of course is uh is uh like uh
1:30:22	uh one of the main things that they are through but the but the except
1:30:27	that the
1:30:28	i don't let c
1:30:33	so don't you think that actually multitask training is
1:30:37	crucial in
1:30:38	and these things "'cause" it can cover a lot of things and can learn what
1:30:42	not to do instead of just learning what to do
1:30:46	well to multitask learning car
1:30:48	like
1:30:49	not a like a crucial problem or saying it's a problem i'm just saying it's
1:30:55	a part of the real life thing right now sure you never learn just one
1:30:59	thing you always observed
1:31:01	and if you wanna have inspiration and the real life
1:31:04	oh just
1:31:06	i sure i mean that's uh that's complete fine for example then it was uh
1:31:09	describing this framework with a larger and uh and so on and the teacher then
1:31:14	also like the point is that the teacher would be interesting you trusted alarm are
1:31:18	and then uh this can be predators
1:31:20	uh
1:31:21	i'm can be defined when people are trying to work on multiple tasks and assigned
1:31:25	a set like a we have it there but uh is different that are you're
1:31:29	assuming that you
1:31:30	yeah are training the model the model all the tasks together and then you try
1:31:35	to measure of performance on the same tasks
1:31:37	or if you train the model on some task and then you try to teach
1:31:41	it quickly on different tasks uh and that's actually what i think is much more
1:31:44	challenging and that's what well i think we should try to
1:31:47	our focus on because it will be needed that if you just uh are going
1:31:51	to fully also by training commit in tasks and then show always place are combined
1:31:55	very well maybe because it was in the training set so you don't like was
1:31:58	the point
1:32:01	so i think it's of course it's part of the of the problem to know
1:32:05	to have adult dyad to work on multiple tasks at once we have to uh
1:32:09	this uh but it's alarm and you that's quickly
1:32:13	um you you've mentioned uh
1:32:16	steps that to be taken toward a creating an environment for a line
1:32:22	um you know what the state-of-the-art in
1:32:25	using anything with
1:32:27	this principles
1:32:29	just anybody's is such an environment is that we establish a with a lot some
1:32:34	simply environmentally the public's it uh last year and weighted present the that needs as
1:32:38	well lots to models
1:32:40	uh the next conference and to
1:32:44	uh it's and the get out that it's called communication based artificial intelligence environments uh
1:32:50	so uh it's uh
1:32:52	i think the short "'cause" this column yeah if a dash and
1:32:56	so that's pretty stupid shortcut nobody likes about the we and they have a this
1:33:00	one because uh and the course of the story would be longer that's good that's
1:33:04	one
1:33:05	uh
1:33:06	and uh
1:33:08	and uh so this is our environment that be published uh
1:33:12	when it comes to other side
1:33:14	uh
1:33:17	well there was this a discussion about the open
1:33:20	yeah universe which is a something then i think the mind of publisher the same
1:33:24	conference on like a thing deep mine plan for all the decorrelation so
1:33:28	but to compare games in three d environments and how to navigate are just by
1:33:32	observing pixels and that's really environments that gets uh
1:33:36	such a quite different uh
1:33:38	because again if you for some a single task results different this uh this focus
1:33:42	on the incrementality of the learning so i we not sure intersections something comparable to
1:33:48	what we have
1:33:49	yes actually they are so many researchers is that you never know about so
1:33:56	that's encouraging for the rest of this
1:34:16	do you think we have enough data for training a building language models and now
1:34:22	we should focus only on algorithms
1:34:24	or we should
1:34:25	also green using a data sources and i don't to textual data
1:34:32	well of course is a more data you have the better models you can uh
1:34:36	they'll and i would say
1:34:38	there's never enough data
1:34:40	uh so
1:34:41	then i do you try to improve well all these tasks that i mentioned in
1:34:44	the uh the first part of the whole alexi speech recognition machine translation
1:34:48	and spam detection or whatever uh
1:34:50	then sure like uh more data will be good to
1:34:54	and um
1:34:55	and the amount of uh written uh
1:34:57	text data on the web is increasing called a time so
1:35:00	how i think that uh in the future we will happen even bigger models trained
1:35:04	on either more data
1:35:05	and the accuracy center of these models a perplexed is will be able recursive will
1:35:09	be higher uh things only get think a bit better there's a like this uh
1:35:13	this argument i think you are going back to shown on the question of uh
1:35:17	i think there are models
1:35:19	actually able to uh to capture an irregularity in the language just one "'cause" the
1:35:25	amount of uh data that you have is infinite and the n is included as
1:35:28	well
1:35:29	which is nice of which basically says that the more data you will have the
1:35:32	better you will be about the but the gains are just getting smaller and smaller
1:35:36	and then uh i don't think it should this is the way up to
1:35:40	gets to a i because uh even if you would have like billions uh but
1:35:43	in times more data and then you have now then sure you get like a
1:35:46	two point improvement in machine translation and that's fine or maybe one or two percent
1:35:52	laura portrayed in speech recognition
1:35:54	but uh there is diminished diminishing gains a diminishing returns so it would just uh
1:36:00	not be boarded doing uh after sampling
1:36:03	of course then there's this think about the
1:36:05	i think more data in uh in domains are actually a very small amount of
1:36:09	the data now like today uh
1:36:12	of course then you can expect a big gains in the icarus this later so
1:36:16	for example for fink english language models
1:36:18	i think that well
1:36:20	that's only like a
1:36:21	just about the maximizing the size of the model the ds now
1:36:25	minimizing we could be sure complex the training data is because as we can
1:36:29	a lot for different languages uh there can be some more fun um there was
1:36:33	sort side maybe uh i would have or more hope for it
1:36:37	um because there's less uh data
1:36:40	um so yeah maybe like for czech language or
1:36:44	oh
1:36:45	there is there something to be down like all this morphological languages uh are interesting
1:36:50	for some reasons
1:36:51	uh
1:36:52	so yeah so the answer is basically yes more data is a good uh
1:36:56	uh
1:36:57	a if you want to get a i then i don't think it should get
1:37:00	a us there

Neural Networks for Natural Language Processing

VGS-IT 2017

Tomáš Mikolov (Facebook)