Speech Transcript - Neural Networks for Natural Language Processing

well come at the next addition of p g s i t this is the

uh invited talks on video graphics in speech

and it's a series that is run mainly my uh by matching technique but today

i'm happy to invite a very good speech or an lp guy so to much

weaker of a actually started uh this faculty if i t

in two thousand two

then uh in two thousand six seven he was working on a diploma project on

language modeling for the for chick maybe still remember something of it then actually uh

he started his phd in two thousand seven on a language modeling and uh to

be frank we didn't have much uh language modeling expertise here

so we kept sending him abroad so he's been good considerable time at the johns

appears in the hope queens university with the spongy of good on board and the

university of montreal we just to a uh bengio

and uh well he had a very the influential paper it's interspeech two thousand then

that was basically a room like this form of uh senior uh language modeling people

and the much basically came up uh and the said that the his language model

works the best

well they were smiling but it worked the best

and uh e eventually uh defended the phd in two thousand twelve

was immediately uh hired by you will go brainer and the moved to facebook or

research uh i a i research and twenty fourteen or he's now the a research

scientist so i will still be here it's to march for now and thank you

for coming

it is it's fine

i guess okay

i also things are interaction and uh michael will be uh like uh

mixture plus a very small things this uh

once asked me to talk about everything so

uh let's hope to define would be about nine you wanna works in an l

that is

okay

so for the introduction

or the introductions uh

and now he's like a an important uh topic for many companies uh nowadays like

google facebook yeah we like all these companies that future

text data sets uh

that are coming either from the web or from the users like you can imagine

how

uh much text a confusable sense to facebook everyday

and then of course like these companies wants to do something cup

with the text like out there like a there is this a list of uh

somehow some important applications that uh but there are many others like a

just detecting the span is something important for like uh

users you don't want to see

uh some think a strictly binary are using cut these services uh so this like

the like a core business of these companies is to be able to deal

with text uh

and uh for that uh i will be talking about like a set are a

lot basic things in the beginning and then their extensions uh using neural networks uh

the idea is to work on

there will be like uh the first uh first part will be about unsupervised learning

curve for

board representation see that so

the border like project uh obvious that we will uh it's like a very nice

a simple inter the introduction

then supervised a text classification the do not for to talk about it much of

the weighted average shows simple to last year at the face but that extends the

word vector supervised classification and again like uh

is quite successful because it's very scalable

and then the recurrent work language model

so exalted mentioned that's all the like something that is so

uh nowadays very common and uh i don't be conference this

um the last part of that all will be about the

what can we do that or maybe in the future maybe some people hearable started

but relatively easy and on to uh do something better than the u matrix i

think that uh that would be a great goal and you're trying to do it

ourselves uh i look up

and of the like the our companies are very interested in an uh getting better

performance

of course one can uh one can focus on the incremental improvement by just taking

that exists and trying to make it bigger or something cool that's or

uh but i will talk about that some high-level goals that uh

we are thinking of uh right now like how to build our mission the regions

of the both uh

they really smart models that are something the

i below that are not show any solution because we don't had it uh

but i think it's a good at least uh mention the problem that we are

facing

uh cycle started like very basic concepts so

"'cause" there seem to the

uh people here don't uh don't so all of them don't have of the big

around in a in uh machine learning cut

so i will start with uh basic uh

models of sequences and uh

representations of uh text uh

and then i don't show that you want work basically

can extend and improve for

all these the above a uh representations and uh and models

it's like yeah

the artificial neural network so i can be seen as some unified a framework that

that uh is in some sense simple to understand that

i know what the uh are concepts but we only done for the for this

to be able to define the features or lots

so for the n-grams so

that's a standard approach for language modeling that's a core technology not in uh many

important applications like speech recognisers are

our mission transmissions justines the uh i need to be able to output somehow some

text and for that the

you are using some statistical models of the language a

that uh that was basically the think it is written on the last line to

the uh some sentences are basically more likely uh than some others for example uh

this is the sentences

uh really going to have uh

higher probability then

then a sequence of words sentence uh is this uh

because that's not so that doesn't make much sense

and even that should have probably are provided for curing unit

in a english and some random string characters

so the n-grams are

uh estimated from a from counts usually

a so it's very simple but you would look at the first equation

uh and just think about what is the product of the sentence though i think

it's like a very broad concept again even stated uh

it be would be able to estimate this probability very well

then the uh models uh behind the

should be able to understand the language or actually have to understand the language or

for example i can write the hearer like creation that uh probably as

so sentence uh

uh arrest is the capital city of rows so

should have uh

higher probability that a barrel in is the capital city a problem because the second

sentence is incorrect uh

uh but the model you have now uh i would say

can do this a little bit about the not in a general sensor

i will try to get through it at the end of the oldest or like

what are the limitations of uh of our best language models

but just to get the motivation like a linkage wanting is quite interesting and there's

a lot of also problems and we would be able to solve them

very well then it would be

possibly interesting for the artificial intelligence research

and here it is uh how it looks like with the

techniques that uh used to be state-of-the-art like ten years ago

which was based on the grounds there is scalable the mean that we can

train uh

estimate uh

so this model stronger likely that's very quickly uh it's the retrieval if you want

to compute a variety of the sentence that just a

compute probability of order this that people just get from some trendy corpus just count

how many times this the board at a year and divided by all the word

count so that we would get its probability

and they just multiply this uh

um like probability of each word given its complex the

they are some advanced things the on top of it like to thank and so

on but well but is just the

the technique that used to be state-of-the-art in the statistical language modeling for like a

i think there are two year so it was like

it looks very simple but it took uh people like uh a lot of effort

to overcome uh

uh convincingly across a occurs um and at a minister uh

i don't relate right uh will not be the recurrent networks

uh then for the basic representations of takes the uh one and coding or one

hundred presentations is something that this like a uh

very basic on so that people should know about the

uh usually the it when we want to represent some text uh

uh especially in english of you we compute first of vocabulary and then

represent each corpus uh basically separate uh id

uh which show

uh has um the advantages and some disadvantages it's very simple uh is it understand

the disadvantage is that the

as you can see mandy and use the of completely or particular presentation

uh data sharing parameters and the

and it's up to the model that's using these ones representations to figure out that

are they are really it for example so that the

it would be able to generalise better

so these are the basic representations and the ability to later that we can ensure

present work so that some

better more richer vectors uh

actually it's a

uh like nice improvements in many applications

make of art representations are then just some of these one coats of and then

everyone to represent some not be such that the

for example if you would have this the small vocabulary and we want to represent

the sentence

today is among the

a little bit like the counts of the words so that it that you're sentence

there something special about it the

and yeah

these

their presentation can be still improved by

considering basically local context by

using backup backgrounds and uh even if it may

c not surprising it would see that for

many applications yeah really most of the applications nowadays the

nobody can they require think it'd be the

but uh is a very simple picture presentation the

so that's maybe the challenge for the future

and are important concept is uh the word classes uh

as i what is that like uh

board really should uh be kind of related to each utterance imposed we help to

uh how to think of it is to define some uh

some set of classes for example italy germany france spain all these words the

a uh denote the names of the of the countries in europe uh

and uh maybe you can just the agreement together and just called impulse to

so this is uh one of the most successful or not be concepts the in

our practise the

it was

injuries i think uh

in the user might be the

uh the one uh one the particle paper i think that's a very nice is

the from peter brown because based trigram models of natural language or

and discusses are computed the

automatically again like from song

from some training corpus the and uh the main idea behind it is that uh

the boards that the

that share a the complex that documents in our context so should uh

really belong to the same clause

once you get these classes then we can improve the oh

the representation that will string before because so we can represent the corpus one of

and

uh our presentation lost one of the class representations for the uh

there would be

some generalization from the system that is trained on built on top of this representation

there was more like the historical uh overview uh i can and the did they

are like several other important can so that people should know about the

uh and the that otherwise

are basically the stepping stones to understanding the neural networks uh

uh what's it'd the

most frequent things uh probably unsupervised image the reduction using cut principal component analysis and

unsupervised clustering with uh

k-means so

so these algorithms are quite important and then the supervised classification

uh especially to the logistic regression

uh is very important

i don't know described in detail because uh i wouldn't finish a

uh so now i would jump quick introduction neural networks uh

and again like it'll be just a quick overview so that people can i get

some uh

idea with this uh

uh what the than you want works actually are

uh and uh i will try to describe these uh is basically arms that the

people are using all the time

and then i will

also try to give some short explanation of what the

deep-learning means because i think that's from there but the

is becoming very popular now and it would be good you are so what is

it about

so for when you wanna works uh

uh in a natural language processing cup

and then motivation is to simply come up with the better more precise techniques then

what i was you showing before so something better than the

uh big aborts uh something better than just the grounds to a so how can

uh red and white would be even though it well

if we can come with some better representation than uh we can

uh get slightly better performance than our come but there that's important for many people

like support for the company because they want to be the best and its importance

of the for researchers because the people to publish the most interesting papers

and years are completely in uh

all kinds of competitions the

it's basically important for everyone to develop that that's techniques uh

that's about the motivation this is how the

uh your own basically looks like uh

a is like a mathematical or graphical representation of the of the

model it's the simple mathematical model the

uh the

function that the people didn't really uh the your own so

uh the biological neurons and but it's very simplified so i would uh warm about

uh yeah giving some parallels between the

artificial

neurons and the and the biological on your own since it is likely to really

about it is very different thing

so uh the are concerned you and your own looks like yeah basically they are

uh incoming signals that are coming cut

uh to be in your own it's called sign at this uh the time from

the biology but uh basically just needed some errors that are something you know

uh to be in your own

uh it's coming from some other neurons are

and uh these signals are multiplied by the

by the way that each yeah

each year this input arrow results today with one of a small number

uh the basic of the weight that multiplies the incoming signal

so we had three incoming numbers that

and uh they really get a sense together in the uh in your honour

after which uh there is the application of the activation function of each yeah um

needs to be known in europe you want a proper you wanna or

and the simplest one is probably the

uh so called the rectified the

a linear activation function which is basically just taking my between zero and evaluated that

compute so that all the volume that are below zero will basically get a translate

the zero

and uh

this value that we compute is weights a

is the output of the your honour in the given find that the and the

and uh this uh this output can be connected actually too many pattern your own

so it does not be connected

one

but it's a single number uh goes out of the single in your own

and here the creation

so i think that like uh

the biological onions actually

although our like connected to other neurons the about the they are so many difference

is that it doesn't even make sense to

star comparing these two

a logistic that the artificial neural networks that are somewhat uh was it inspired by

the biological neurons uh

uh in the beginning about the it's a different now

uh maybe in the name i think uh is a um

uh misleading people start uh working on this uh

techniques the and uh start believing that maybe they can just sort of artificial intelligence

additional uh have uh you know neurons in their model because after all the model

at school you want your right uh well this is the logic that the

i sometimes you're from some older purpose errors and i think it's it really misleading

and it's part of the

marketing so just the

don't take it seriously i think yeah

if the name of these uh

artificial neural networks would be known in your data projections i think it would be

maybe better but then

nobody would you use it because it would be interesting right

uh so

uh this is the presentation

oh well we'll network when we have a very have multiple of these songs uh

usually there are some structure this is like the typical a feed-forward structure very have

some people say or uh which uh

which is made out of some features it can be

the back of our teachers or one of any code so what i was talking

about before

so these are the features you specify some will somehow

the uh and then there's the hidden layer uh you'll you know well to computable

is there and then there's the output layer

again it's the application of those any questions uh

so nothing special their output layer

use usually what you want the network to be doing can that's a

that's for example classification we want for example say that the input layer that there

are some decoding of the sentence then and the upper layer there can be classification

of their

the sentence is a span or not

so there can be just one in your on that would be just the

doing a binary decision

the training is done with the back propagation

and to

as the training is done with a back propagation i do not really describe exactly

how it is uh don't because it's a lot of mao a multi we can

find some nice lectures on the web the uh

so i think it course zero there are some nice cars about neural networks would

be quite some long i quite some time to

expendable basically what we need to do is to define some objective function which yeah

we'll

which will say what is the error that we uh that the network that uh

make for the article twenty example so when we trained a network we should we

basically some input features

and we know what is the output that the network should have produced and we

know what the network uh did actually compute uh using the current set of the

weights and then using of the back propagation and the stochastic gradient descent algorithms we

will compute how much uh and you know what direction which we should change the

weights

so that next time the network see the same example it would make up this

error

small it would make smaller error

and then there's the simplified graphical representation

but is not used in some papers uh

there we don't actually draw all the individual neurons but it just the dual the

box the

with the errors

they're section what of are things that the

have to be uh down if one actually wants to implement the

this natural because they're like this the

these uh hyper parameters that the that the training doesn't uh

doesn't choose like what is the type of activation function that we use a their

conviction in many of them

well how many hidden layers to we haven't are their size is uh how they

are connected we can have some skip connections we can the recurrent connections we can

have some weight sharing deconvolution at work so it's actually

why did they do things so uh of course i will not describe for all

of them because there would be lycra for course

uh about the

but i think of what works for me for no for starting to or within

you want works is to take some existing set up and try to play weighted

making some modifications and the

observing what uh what is the difference

so maybe that's so that the best of the star

for deep learning uh

so this popular term uh

uh it's uh basically still the same think it's the it's in you wanna sort

of every will have well

um or hidden layers usually so that uh

if there is like at least two or three hidden layers then the by basically

some of the deep learning a all maybe we can also that some recurrent connections

with you to make the outputs depends on all the previous the

input features which is actually very d very "'cause" they are so many nonlinearities that

uh that influence the output of the model

uh so basically any a network that uh

that uh

goes uh any model that goes through a several nonlinearities be before it computes the

output uh can be considered as deep learning curve

although some people are probably even see nowadays deep-learning which i think is completely sitting

about the

yeah there was also like a this controversy for i think are maybe twenty years

there

uh basically the welcome annotated very the a

the training these the model deep neural networks is not possible to be done with

the stochastic gradient descent

and uh when i was uh the skewed and myself well

whatever book i was reading uh everybody that claimed is that basically training these deep

networks does not work and that's it uh that we need to develop some magical

algorithms

actually it's not the case uh people not trained to networks normally that the d

and the

just works the it's probably because we have more data than what people are like

in the nine so i didn't know much durable power exponential there but the uh

there are be about the

there are basically this uh

uh a long chain of sex as a result starting maybe in two thousand five

six the lower people are able to find it remains some deeper networks are

there's also like this mathematical justification why prediction need to the

the models so

uh coming from seeing more popular and marvin means key

in their book perceptrons it is uh

so the very mathematical i would say about the about the argument is very interest

think there are functions that uh

that we can't represent action maybe give just a single hidden layer

and uh

actually that's the logic that i will be using at the end of the talk

show that they are actually

a function that even the deep learning models so

cannot uh a learn a gently

maybe represent us not very large

so i would see that the wall down deep learning maybe was invented a neural

network and you're like about the

but the these ideas are much older

uh like you have the motivation uh our people to argue that we really need

use something else then use the

these uh simple perceptrons

so this the graphical representation

a very good basically just multi um

just several little errors

and so

that's about it the states that it can be more complicated than this but if

there will be some recurrence the connections or something of that sort

but a lot of visiting model

yeah i would even say that it still an open research problem because the

then entropy you have uh

very deep model then uh

so possible to show in many cases that it can represent the or it can

are present solutions to some

interesting problems about the

the a request use the

there are there is um

so um i good job approach are we can find the solution we constrain the

network which is actually not always the case especially for some complex problem by will

showing at the end to uh there are the network for example it's learns um

some complex the controllers number structures uh

then

because there's a lot of local optima then

it seems that uh we start to be something better than what we get now

and uh so now i will be talking about the

uh most basic application of uh

neural nets to a to some text problems which is how to compute the distributed

your presentations aborts the

and uh and i do show some mice examples i think i see examples the

oh uh some linguistic irregularities in the vector space

so this is how we can actually train the most basic gabor vectors is that

they started the

band when i one that was mentioning here uh but i was think my diploma

thesis in two thousand six is the first model to implement it very just try

to predict the

the next door to given the previous work the using a simple neural network with

one hidden layer

and uh

here we train this model some

on some text corpus a

then the by product of this learning is uh

that a bit we the matrix a way to be in the

uh input layer and the hidden layer

we'll

a real basic contain the worst representations in some

a vector format that is the

you're gonna be seen as a fess uh this uh

this uh a real for of numbers of the weights from this matrix

and it is interesting properties for example

you don't group uh boards that similar sense together so that the

uh so that this vector representation

of four so for example like france any turn it will be

close to each other while for example

uh i dunno rent and uh

and china will be probably farther away both maybe not the

so uh basically a this like the simple supplication of the of the neural networks

and it'll is a kind of found to play we did a

of course it's not perfect so that were vectors coming from this model a very

be comparable to the state-of-the-art the

today about the

already function start there

sometimes list of i-vectors all the core the cold um for embedding i'm not complete

sure why

but that's all the relative name

and uh

usually to our presentation this like a then agenda like fifty to one problem so

each work on this you know say one hundred fold herself to retrain the model

and uh

a product of that signal purpose to work losses the samples think before

uh france italy can go to the same class but uh

yeah but with of all vectors a

these representations can be much richer because the

unlike a us with the board classes we can have a multiple degrees of sonority

encoded in this uh in this work vectors and that settled shrink later

uh it actually makes sense so

uh of course so one thing is that it is found to have uh

these vectors just uh just to study the language and actually increase or although our

interest in these techniques the but the are think is that uh

we can also use them in some uh

uh in some uh i know the application so

for example a roman coloured show in is a famous paper

a role to

a natural language processing from scratch uh

that the can so for many an open problem so

at the state-of-the-art performance uh

by using some pre-trained uh were vectors

so that are vectors can be basically features to some other models like the neural

networks instead of the or in addition to the wanna undercoating

uh historically there are like

uh several models the proposed before for training data

uh this uh this word representations the

usually people started to the most complicated things so the start with some model that

the

many hidden layers uh

and it was uh kind of working so

so it was considered the big success of the deep learning yeah

well i wasn't convinced about its because i would it know about my previous result

of just one hidden layer

uh the product of sourly quite good

i wanted to show that uh actually the shovel models to model the model the

don't have many a hidden layers but just one

can actually be quite competitive for that i need to be able to compare two

uh either vectors of other people's approaches

and that wasn't actually parameters because uh

people that are showing results after training the models on different datasets and to

and the

these adjusted are not public and then if you compare two techniques just are trained

on different data then the comparison is not going to be very good

so one of the interesting car properties that actually used for uh developing this uh

simply relation sets

was that uh

uh these support vectors can be used for

uh doing these so small

analogy like calculations with the board so one can define example

was then of a string that the when we take a the vector forking and

the subjects from a the vector for that represents man

then and uh vector that represents woman and do the nearest no uh need research

uh while excluding the but works around this position

then we will uh find the work we need for any

a reasonably good um or vector o

and uh

similarly we can actually calculated the

with the boards and sounds are a lot of uh

which questions of this type

uh kind of funny how accurate kind of get

uh i

on the picture below uh there is shown in basically there can be like this

multiple degrees of similarity so can guess the related to queen in some way but

it's a related to its lower for form like uh can't case related the kings

in some our way

and which you want to capture all these things the

so the idea that the board will be and then they're of a single class

what follows to capture this

so for the relation edit construct this dataset with the

uh almost twenty thousand question so to basically written by and uh and then generate

it uh using permutations so

automatically

and these are few examples like a

take it would be quite challenging even for

uh people once there are some of these so

analogy questions so maybe try to compute uh

uh this think uh

board for example utterances to greece's all slaves to norway i think that's quite easy

but the second one is uh

that's an article like analyze the ones last year honest well like ones like the

currency non-goal and the currency here on i think is the

three l

so i think that's the that's more complicated well

and of the and then there are like the errors that can actually very simple

like brothers sisters grandsons two

granddaughter and so on

so we can accumulate performance of uh

oh different models on these so on these questions

or this

uh on a would you questions

yeah

it can be actually scaled up very

yeah to phrases

so that we can compute like new york a sting your times baltimore's to i

think baltimore sun

so we'll these datasets are public

in their published in the dark in

and uh i go there

the uh

simple algebra or vector model that will show later to this one that was uh

kind of stick of er state-of-the-art uh

make in the days uh

that it used to hidden layers uh

uh starting with the

a context of three boards are and of or so

and the input to predict the next door to

by going through the projection layer and little air

uh and the

them

the main complexity of this model after that we do some tricks the

there we can actually deal with u n w matrices of them a complex this

in the v matrix because we need to touch all the parameters of for every

training example

and the model takes ages to train

uh so

what i did this was basically the remove the hidden layer

and uh have the projection layer

slightly different and uh

as i don't show later in section were quite fine so this uh again the

idea that uh we can take the bigram model

and just extended to

which are showing the context uh around the border we are trying to predict and

just uh some the board representation of the projection layer and the prediction right away

this model will be able to learn the n-gram so it's not the

suitable for language modeling i just fine to learn the word vectors this way

the near model to the previous model is the skipper model

that uh

tries to predict uh the context a given the

current uh board

they should were quite sooner or later uh if they are true and uh

peripherally

so the training is uh is uh still the same thing like stochastic gradient descent

back propagation

these works at the output layer coding code it does one of and of the

same for the input layer

so we cannot be this the use of mikes so

function in the output layers which is so

this good probability distribution we have to compute the

all these uh only use which would take too long so they are like this

a fast approximations uh one that the still keeps the

probability to be correctly uh something to one which is dark a softmax and the

second one

uh that actually is uh from the

assumption that uh the models to be prefer probabilistic model

and uh i just takes the

bunch of divorces negative example

uh to be related and the output layer loss the positive example and that's all

but is done and to

and such as the second option seems to be preferable

and are very good at the

and it uh that actually improves the performance drop well

probabilistically or stochastically a the most frequent corkscrew both speed up the training can interestingly

even improve the accuracy where we don't shall billions and billions of examples there we

try to

a related work like a ds the

is uh the and so on

so these are not removed from the training set up

a like all of them but uh

some proportion of them is actually remote so that their importance is actually reduced yeah

and it comes to the objective function

and here is the comparison as i said

on this analogy deepest that the

there was like this you get about in the training time and it and the

accuracy to whatever people it's probably before so i so that's what i wanted to

prove that one does not have to train a language model to a to obtain

good were able to report representations

and this you the last two lines are only simple models that the

data are invariant to the border you don't understand the

n-gram they just see the single boards and uh

the only yeah they can compute the very accurate the word representations that are actually

way better than with there are people that could twenty before

and to while the training time to go from models and reach two minutes

and maybe even seconds

so this is uh obvious this open source code

it's called words like project

actually many people

it's find it uh useful because uh

they can train on the on the border like there are some they are datasets

to improve uh

many i know the application so

i think it's like a nice uh

nice we have to and few person topic receive an adder

uh people are dealing with data sets of our there is not uh

you each number of uh

uh so supervised training examples

you are some examples of the nearest neighbor so

just to give a

idea

uh how big again was built a between about uh what was state-of-the-art before and

after

uh these models they are introduced

so for example for how well that's like infrequent words in a war in

english yeah

uh but still it's present in

all these the all these uh models

and we can see that the nearest neighbours for the first we

barrelling makes any sense

and this one and then at least get the idea is probably a name of

some parsing

while the last one is obviously much better when it comes the nearest neighbours

and of course this the this improvement of the quality comes from the plight of

the models trained a much more data and the

and had a large dimensionality and the that all as possible because the

uh training complexity is reduced by many orders of magnitude

uh_huh

there are some few more fun examples like a

but uh

if we can uh can calculate like these things uh

likes

sushi mine in japan and germany rutgers

and so on i think yeah

scan to find of course we don't have to look at each of the news

token we can look at the top then use the

tokens uh

so i wouldn't say that it works all the time

and he goes like sixty percent of the time the nearest models are

looking reasonable

about the it still like fun to play with it and the

there's the many existing we trained models no available on the web

i don't think that actually data scientists uh find useful is that the

these are vectors can be visualised to get some understanding of what is going on

in the dataset

uh that they are using a

the are ignored these are so strong that actually when we train this model and

the good news dataset

uh and then and it uh visualising two dimensions the representations for countries and the

capital cities

then we can actually see recorrelation

between them that the

there is this a single direction uh

uh how to get from account to basically it's capital city and even the contras

are actually a related to each other in this the in this representation in some

interesting way for example we can see that

the european countries from the saddam european so far in some part of the image

and then the problem

the rest of the real or somewhere in the middle

and then the asian countries are more like a

uh the

at the top of the image

so for the summary

i think it's always good to think if a

it things can be down a simpler and uh as it was shown like uh

uh not everything is to be deeper and uh

you wanna works are fine even navy

actually remove uh many of the hidden layers especially in the know the applications it's

a different story for example for acoustic modeling or yeah per image yeah

classifiers are actually

i another there any

model that would be able to be competitive in the deeper

deep models the

um without having many nonlinearities about for the for an l p task so is

the other way around so i

and not company can means that the deep learning actually works for now you so

far

um but

in the future be noted that will

we better

although there is this thought

extension

to work to make of are basically instead of predicting the middle of or given

the context the connection predictor a labels for sentence using the same yeah same algorithms

and uh this is what we published a as a fast x library last year

it's very simple but that the same time very useful

and compared to what job

both people are probably think nowadays uh in the

in the uh

and all the initial learning conference this uh

uh then we need to do the comparison to some

a convolution network with the

several hidden layers trained on

um any gpu so

we did find out of each you can get a their accuracy while being a

hundred times so

hundred thousand times faster

so i think it's always been think about the baselines and doing the simple things

first

so the next uh

expired will be about the recurrent network because the

i think it's quite obvious that the

or representations can be found the easy to traditional networks but it's gonna it's a

different story for language modeling there's actually some success of the

of the deep learning because the state-of-the-art the models

nowadays a recurrent and that's basically this model

and then able talk also about the limitations of these models

so these three so

of the recant it is quite longer

oh there was a lot of people work on this models a blend a piece

like a just on long mike jordan michael's or and so on uh because the

models

model is actually very

their interest think it's the

simple modification how to get a some sort of short the memory into the model

here is the graphical representation

so again we can uh taking the

bigram model and just the handed the

hidden layers they hidden layer

to be connected to uh the entire from the previous time step so that the

h t create is uh

this loop in the model

so that the hidden layer

oh uh just c is the features

the input layer what although it's all state from the previous the times that

and that it's selsa's the

the previous uh various state and so on so basically

um then ever you prediction it depends on the goal is threefold the

input feature that you know put us that's that it of the time steps that

we need to do before

so one can say that to the hidden error than or present some sort of

memory

a that this model has

uh there's this interesting paper from different monocle finding structuring time that the

you sort of this motivation

well after

after this period where the recurrent or explore studied uh

at a the excitement the excitement that the kind of when it show

uh because some people started believing that the than these models even in the actually

are looking very good cannot be trained with uh

and g d may but can see that the

this is the remote server

can a real curing again and again whenever people data

failing to do something enables the data link the energy that it just doesn't more

uh and of course to solicit they're out to be wrong so

uh the recognizers are actually trying to do you know this normally just one has

to do some small break it down

so what i did um i said in the doesn't animals that the idea to

that actually

one can train state-of-the-art language model based on the recurrent networks and the it was

very easy to apply to

a range of tasks like language modeling machine translation speech recognition data compression and so

on and to

in each of these uh i was able to improve the existing systems to

uh achieved view state-of-the-art results and the

sometimes by quite a significant margin for something language modeling i think uh the or

looked perplexed the reduction over n-grams uh

it and symbol of similar recurrent network so

most

for me usually like fifty percent the remote so that that's quite a lot

uh_huh

and uh

company started using ca uh this uh this toolkit and what this the body so

that they pleased here about really many others

uh and uh

then i was looking at uh with your savings you but like uh

what uh outcomes that the uh the model actually works for me well people try

to do it before and that uh they just couldn't uh make it or they're

not that there was uh

this problem that they did a place at some point that the

it's i was uh trying to train the network and more and more data at

the start at the meeting can some celtic way

and the training response table so sometimes that it converts sometimes not

and the more and more data used uh the lower was the chance that the

network would can were convert

and um mostly the result or just rutledge

so it detects to spend quite a few days uh trying to figure out what

is going on and uh

i did find out that the there are some rare cases that are

the std a science in a such a way that the

changes of the way that are computed

uh become

exponentially larger they get propagated through the reckon the matrix

so that they become so huge

that the rule weight matrix the matrix a get overwritten bits the

it some numbers and the not review the

just american that are just

so what i did not so

the simplest thing to take with think

because these uh these gradient explosions uh

it happened just the

just very rarely

i didn't uh can't the gradient so that he wouldn't be able to become a

larger than some values of it in some threshold

and then it there now that the

that uh probably nobody was actually

either of this uh the neighbours the

but there was uh discussing this uh

this idea two thousand eleven

there was maybe the reason why things that or i dunno

but the

it's a set it was the mean of the case that uh that as g

d wouldn't uh work for training these models

and this i said it was quite easy to obtain a pretty good results one

shows that the weight thirty long for training the model because the they were quite

expensive

so the

um ability to the original setup well speech recognition

uh it's uh like uh

small

simple datasets that and to

and the reduction of the word error rate was like a over twenty percent compared

to the

best n-gram models a one can see that as a as the number of neurons

in the model gets bigger like to like a ranker then at higher sixty three

are twenty and so that's basically skating the size of the model

then the perplexity goes down but just like uh

the uh

down make sure how good is a network that's predicting the next board basically the

lower the better and the word error rate uh

uh is going down so basically the best n-gram modeling the

is that in by uh

with no count cutoffs and the on the on the relation data sets it up

like a

the twelve and sixteen point six a word error rate and with a combination of

of uh these so we can network sacred would like to nine and thirty percent

that was quite yeah a big uh

the gain coming from just the language go to just from a change of the

language modeling technique which uh i think wasn't thirty

but these before when i did compare these results to other techniques that are being

developed the like then at uh johns hopkins university then usually people are happy with

like zero point three percent improvement of the words rate than the i could get

like you're like three and how well

percent absolutely

so that was uh quite some uh interesting cut think

uh under interesting co

no observation was that the more training data

uh was used uh the bigger uh will still

the gain over n-gram models that the recurrent networks

that was uh

uh quite the opposite of what to just argument abolitionists technical report in two thousand

one

i think it was two thousand one

uh we use very famous very basically data collection that are all these qualities models

and so on that you put it consider for improving language modeling rare

actually uh

well help think less and less as so

as more data was used

and uh

it did see that you also losing all the whole that the

n-gram models can ever be beaten

well like their output with the police or we can model so that actually happened

so it was gonna likely

and uh and the last grapples the

this uh large datasets from i b m so it's pretty much the same thing

"'cause" the able to journal or just like much bigger much better tuned coming from

a commercial company

it was their the state of the green line is the

is their best result videos like thirteen percent for

more to rate uh

and then uh on the x-axis

there isn't the size okay size of the domain of this we're going to work

so you can see that the

and the

as the networks get bigger and bigger the word rate is not going down

so in the entire experiment it by

by the computation complexity because it to cut

many tricks to train the biggest models

and uh that was quite challenging

uh in the nlu could get like another person's or try to reduction like the

relative well i think it will get even more than that if i could train

bigger models

but already this result was very convincing can to

uh_huh and stuff

people from the companies are interested

lighter a user can afford became much more accessible because actually implementing the stochastic grandest

and correctly

it's gonna painful in this model because one is to use the

back propagation through time algorithm and it is the makes a mistake there and very

heart to

find it later

so i think is also like very useful so the maybe the most popular to

look it's are now like it happens or floor one so

and piano and or

but there are many errors

and uh

using the graphics processing units that you use uh

uh people could

scale the training to billions of training works uh using thousands of in your own

so that's even

quite a bit bigger than the with the brothers using can doesn't then

no like today a the right kind that's are used uh in many tasks like

speech recognition machine translation

i think uh

uh google guys that uh publishing months ago paper very are investigating how to get

the recurrent networks into

into the production sistine for google translate

i think it will still take some time about the

let's hope readable happen because setting would be great for example for translating from english

to check so that finally the

the

the uh

morphology wouldn't be as painful as it usually so

hmmm

on the other hand i think that the downside is that the because the

because these two kids like a say select and therefore and so on

it's named eric and works very easily accessible people are using them for all kinds

of problems that are there and thirteen

uh require

and uh especially when people try to complete their presentation file

they sentences are documents are

i would to always work on people told to think about the simpler baselines because

the

just big of n-grams the

can usually be to

is uh this uh models

at least be around the same

accuracy

when it comes the representation so the different in the language modeling

so one can ask like a what can we do better like a so really

need it is you distorted it may be there we can work uh

pretty well and sometimes maybe adding more layers held so for some problems doesn't

well it can be you

to get uh

better results so

uh can be built uh

this is a great language model that i mentioned in a direction that would be

able to lock to tell us to what is the capital city of some constant

maybe we could stall with uh

uh and we do it is we're gonna works well

i'm not thirty that much convinced because they are very simple things that these models

that we can have lower

and that actually is opportunity to new people a new generation to develop better models

so simple button for example is a very difficult to uh to learn is uh

it's memorisation a variable-length sequence of symbols

this like to

the people to just to see uh see you can you bored and be able

to repeat the later

this something that the uh

nobody can retrain in general the recanted networks to do that

uh there even simple but there's and that we don't have to minimize the sequencer

of symbols we can just a little bit novelty kind it comes to count thing

we can generate uh

uh is very simple

uh algorithmic but there are so uh which are

sequences so

with some strong regularity

uh and see what can actually be so we can the rocks lower

so i think that uh people know for some from the recouped article compare signs

that the that there are like a very simple or a languages like the

a and b and language yeah are there is this thing number of symbols

and we consider in quite a few examples and train

a sequential

a predictive model like a recurrent network to be able to predict the next symbol

and if it can actually oh come then it should be able to predict correctly

uh all these so uh not hold it is the sum of these so symbols

in the sequence of because there are still this uh this uh

a this uh

information coming from the first and because that's not predictable

so these quite challenging

but then we can a talk about plenty of these uh

uh ask that uh currently cannot do you are and uh

and you can get a confusing cover what should we focus on should be a

study these artificial grammars or is it going how's that related to the real language

and if you can shows all the in the end of a light improve some

language model

so i think that's the there's the natural questions

i think the answer is uh

quite complicated about the

what i think is that the

it's good to set some uh big role in the beginning and then try to

uh define

a some plan how to actually uh

you know i

accomplish this goal

so that we did right so

one paper where we did the

uh discuss uh

the

like automated goal at first of the start bit uh the underrate around the instead

of trying to improve some existing cup

uh so that we are trying to define a new set up that would be

a more like a artificial intelligence like something that the

the people can see in the sense section we something to the really exciting and

the

that's what we actually want to optimize the

the objective function formants just some

uh speech recognisers

uh something come more funny

so we did think that like a

the properties of the of the a i that the

the really useful for us

and it seems that the any useful artificial intelligence the

we like to be able to somehow communicate with us uh

in uh

hopefully some natural the way

uh so again if you would look at the

at the sign at the science fiction movies are the books the

usually the artificial intelligence is some machine that uh

either is or about the to be controlled with the core in tried it or

it's as some computer that again we can interact with the so the embodiment doesn't

seem to be

necessary about the

there needs to be some communication channel so that we can actually state some goal

so that a i can actually accomplish the goal for us

and we can communicate of the machines of course it will help or maybe that

we could even i'm going to programming because currently we have become they can be

computed by thinking second instructions will be one computers to do there is no way

how we can start talking to the computer centre and expect the table

accomplish a task for me that's basically no the framework we have not

and i think that in the feature this will become possible about the

but see a long really take

i think we should start thinking of it because i don't think that we can

improve the language models much more

it's something uh like some crazy recurrent or okay

so for the room at the

we get uh describe the

oh pretty minimal sort of components that we think that uh

the intelligence machines going to consist of

and then some the productivity may be actually good for constructing these machines to

so it's it is that the idea that the is now and uh maybe later

really improves then

uh and we only are discussing them at the conference this

and then the mission the requirements are too many dimensions scalable so that will actually

be able to grow two full intelligence

the components are added as i said the committed the ability to communicate a

the ability to set some tasks for the machine so that the uh

but work to do something useful

so some motivation component

again that something that is normally missing can the

in the predictive models like the language models and so on

and then have some learning skills which uh

scenes can be used uh

but uh many models are excluded

missing these for example long-term memory is not part of uh

any model the time of our of uh

then you want works represent the want to memory and the weight matrices and that

the get overwritten the number that network a

uh kids uh gradients from the

you examples and that's basically not agree to

good model for once the memory

we need to do something better basically

and uh

just quickly i will go over it is because uh it would be long discussion

to explain why we actually think uh

uh about all these uh all these things about the

we think that turns to be some incremental structure in a how the

mission will be trained use of training can like a we could be idea is

to retrain normally language models one uh it seems that it has to be trained

some incremental weight a similar way as the as humans with um

would be learning the language a

and for that uh

we are thinking about the some sort of simulated environment that the

that uh

would be used to

develop both the all words of the are missing and then once you would to

have this algorithms to train the most basic intelligent machines so it the most basic

properties that we can think of

this basically what we are thinking about the and we wanted uh quite of experiments

so there's this can or components like the lower it stands for the intelligent machines

that the that can be in this environment the uh it can do some actions

but everything is actually very simple like we try to minimize the complexity so it's

just uh

basically

uh receive some input signal

sequence and purposes the output signal which is a sequence as well

and it to achieve it uh receive summary or so which is uh

used to measure the performance of the learner and the

it's pretty much either there so the teacher that uh defines the goals the

and assigns the reward and to

and that's it

so this like the description the environment this uh based on screen

of course we want to have uh and the teachers well we don't the this

to be scalable so later

uh one so we would you have a learner that can learn is very simple

patterns then

uh the expectation that the teacher would be replaced by humans

so directly humans would be teaching part of the machine and the and the signing

the rewards

and that are dimension you really get to some

sufficient level than a then there would be to stick expectation that we can start

using a for doing something actually useful for us

yeah

so the communication is thirty the core so the learner just as this input channel

and the output channel and all that it has to do is to figure out

that it should be

a coding at a given time

given the inputs to maximize the average incoming three or

so it seems to be quite simple but of course it is not simple uh

this is a graphical representation just so that it would to look more obvious that

we are aiming to do so there is a simple channel output channel

uh the task specification given by teacher is the movement luke and uh

and then

the view learner

the only assume that the delivery learn to do this task

uh says

to the environment they move that's how it accomplishes that action

so we don't need every year for all possible actions a

the learner can actually do anything to do

it uh is allowed to do i just by saying it so if it's gonna

if it wants to go for ever or if it wants that are like can

just say

and uh then at the end of the task yeah

uh it gets the reward for

uh looking for events finding that apple

so we think that the

the learning weekly a will be a complete crucial here and the

that's the same what i can see about incrementality of the learning so

when the tasks uh

are getting more and more complex or in something criminal way uh the learner should

be able to learn from few examples at most we don't actually we're forcing the

search space

so the algorithm that you get uh at the moment or would basically break uh

on this type of problems

and that's a supporter before we of course get a documents later but still

seems to be assumed because the we don't have uh without regard to be able

to deal with a given the basic problems

and then

if we have this uh this intelligent machines and that can

uh were with the input and output channels that of course we can add the

real world basically this additional

input uh

channel that the much again

one troll for example it can give where is that the output channel to the

internet and the received the resulted input so

uh the framework is very simple about the

uh it seems to be sufficient for the intelligent machines

and that's a was select things the real time they are things that seem to

be very

simple to lower for about the

but you can pretty do it with the

a recurrent networks even a via the long short-term memory units into the recurrence for

x and the

all kinds of crazy things the

then still they are very simple but are they are very challenging the lower

uh even when we have supervision about the

what is the next symbol

it would just try to learn

these things just be the records of even worse

they are like this the things that we believe that the

especially the last two

have and it's basically like all these problems are a console research problems and a

maybe even they have to be addressed together so it's quite challenging but i think

it's uh it's good for people who are trying to

uh start the

their own research to think about the challenging problems

so for the

small steps former be publisher before we review exchange are showing that the that the

recanted first can actually learn style

some of these uh algorithmic uh

but then we extend then be the

it is um in memory structure

that the recurrent network are learns to

uh control

that actually utterances several the problem that i mentioned before

in this uh

if this memory

is a unwanted in the signs

like this like for example

then

suddenly the model can actually be at least two sticks it can be countering complete

uh so that can i should learn finder presentation

a two

any algorithm which seems to be necessary and we as humans can do it

and uh it also like addresses these problems

or it could address the problems as i mentioned before with the neural networks that

are changing there are the weight matrices all the time

and therefore getting things then if you would have this a controlled way how to

grow something more structures

then that could be you way how to actually are present the long term memory

better but

as i said is just the first that's former

of course we did find out the men been already work and idea behind the

first one with this idea and it will study published in the

in the a piece uh

but uh likely needed to find of our solution is that

is again simpler and works better than people probability for so

so it's it in my looks like this

uh so there's not much of the complexity

um basically the hidden air decides on about the action to do we just ikea

a by purchasing castle my position distribution probability distribution over

uh used reactions that it can be doing

it can either push some volume on top of the state courts couple the volume

of the stack part can decide to do not think intersect and of course are

there can be multiple states that the network controls

and uh if it wants to write some specific volume and it's actually that depends

on the state of the hidden layer

and uh

and the fan think is that it can be trained actually data

it as g

like stochastic gradient descent so we shouldn't need to do

and the crazy thing

and uh it seems to be working for at least some of the simpler sink

what sequences

a like here but at least some of them variant able to the characters uh

the bold characters are the predictable on the deterministic once

and we could to

so well

i think all these problems

i really

uh so that was gonna interesting

and of course the recurrence works candidate

and the funding is that the lsp a models that are actually origin developed to

uh to address the do you know exactly these problems

can do it because they can count because the linear component so

uh so that sort of the like cheating because the models

developed for this article reason

uh of course we can show that the

the lsp and below

break it will just uh a scalar complex to bit odd because the uh instead

of just recovering them are to come we can

are we compare it to start memorising sequences as they said before he really just

show like a

a bunch of characters with variable-length the

that have to be repeated to

and the to larry breaks the last jens

which uh for the people don't know them is so it is a modification or

extension of the basic network by adding yeah these are linear units the

with a bit sums of connections and basically complicated architecture how to

get some

more stable memory into the reckon that's regular to propagate more smoothly across time

so we could solve the memorisation

but then of course one can say to the

uh that the stakes are kind of developed for this kind of can uh a

regularity so

so it interesting a so our model was the

a first-order on the speaker was

blank a little bit binary addition just quite a bit more complicated and the

interestingly it also did uh can you have more so here we are shrink a

these examples which are like uh

a binary like input so

by the addition of two binary numbers together with the result than the terracotta lance

to pretty the next symbol to get in this story so it's like a language

model

and it turned out that the

it actually could to learn to operate with this mixing right complicated way to solve

this problem

so that actually

uh space the first number and i think to stack so there are some redundancy

uh actually three of than i think of

our previous information

and then the it's a so the

how the second number

and then it's able to produce a correctly the

uh the addition

from these two numbers

uh i think it's quite funny example

of course there was like uh oh this is a heck the to be used

to how the model because the stakes a are pushing the volume some top of

the steak it's actually much easier to the

do the memorisation of the

all the strings in the reverse order

a so that's the same

case for the binary addition

uh so i wouldn't say that we can actually learn

a general

algorithmic fathers with this model

and uh

of course we could to

do better if you just uh

uh not use just the stakes but we could use for example states this additional

memory structures

with all kinds of topologies and so on

but it seems the like uh taking yeah

uh the solution together with the task so

that uh doesn't seem to be great so i would refer back to do that

sort of the paper that you had a rejection try to define the tasks first

before think about the solution but in any case we could show that the

that we can learn a interesting car

uh in there's think a complex patterns are

that the normal recurrent networks couldn't lower

and the model is turing complete the say set and has some sort of long

the memory

but it's not the long-term limited

like to have

you does not to the properties that we

uh we you want

so there is that and a lot of uh things that should be tried

and to

let's see what to

well happen in the future

so for the conclusion a

of the lost power of the talker

i would say that to achieve chart which intelligence which was my motivation many start

my phd so far i had failed to do it but at least there was

like this uh

these site brightly that are to be useful

uh i think that we need to file think uh

a lot about the goal

uh i just a few that no people are from

working harder than the wrong task so

the tasks are too small and to

and isolate it i think it's a it's time to think about something bigger

and uh there are there will be a lot of like uh

new ideas that will be needed to defined a framework in which we can develop

the uh yeah i just same way as the framework in which the first speech

recognisers were built also to take like uh a quite a few years to

just uh define how to measure the boards rates and so on and the

and how to annotated data sets

i did not for the ideal basically need to rethink some of the basic concepts

that to be take for granted now i'm that are probably wrong like uh for

example the central role of the supervised learning in the machine learning curve

techniques i think that has to be revisited and via to that taking that uh

that are much more unsupervised and to

can

more calm somewhat different principles

and of course uh the uh

on of the goals of this august so

motivate more people to think about this problems

because that's the

i think our

we can

a rigorous harder so i think that the last line so things for

attention

sparse dark questions

yeah right sorry

mountain yeah okay

so my question here how to properly defined intelligent not artificial intelligence but just in

the intelligence in the second question which it to the first one is okay so

we know that the true machine um is limited you can so everything and then

can you believe don't believe that detergents how you define

yeah is achievable with your incomplete mission

well

not sure that it's the question started actually relate it's like to questions for me

uh but uh

a first for the definition of the intelligence they're actually right yeah

uh many opinions on this uh probably no like you know i would say that

uh pretty much research of are defined intelligence in

different way

hmmm all the most general definition that i can think of uh

is a it would be maybe to philosophical is uh

basically uh

but there is that uh that uh that exist in the universe a could be

thought of uh as intelligent there we can see that uh

uh a life is basically just uh some organisation of the matcher that uh tends

to results it's a form

uh to evolution and everything

so that it goes back to well ideas for example

that the universe gonna be seen as this overall to model on and then everything

to the observed that are is just a consequence of that

and then you really can see the light so as a just a pair is

the button that exist in this uh in this uh topological structure and the intelligence

is just a mechanism that the this uh this part there and that uh developed

to preserve itself and so basically

for the second a question you said that the uh that the

during machines are limited i'm not so sure in what sense maybe you mean that

the normal computers are not during machines the

uh in strict sense so

uh so i don't know which problems uh you mean you can do not only

major uh i was talking more about the incompleteness in the sense that the during

common is basically a this concept that uh there is a find a description of

well

of all the buttons in this competition model

if you would yeah they could bigger model like c find machines that you know

that for example

there does not exist to find it solution

funny description of uh

of some algorithm of well so

for example

you can tell account if you if you limit yourself to the finite state machines

hand uh

for example in the context of the recurrent network so i think there is a

gets more confusing because they should ever papers written

then uh claim that the recurrent networks are incomplete and which sensor

a one can make a conclusion and actually adhered for example you're gonna speed reverses

it uh

requested that the

that the recurrent networks larger incomplete and that

that means that they are just fine and they should general are all these things

that i was showing co

what do you want to say a say that the uh show that when we

tried to train it uh it does g d a normal requirements we regard doesn't

larry of an accounting and the list even doesn't learning like a plane sequence memorisation

so that's a that's one thing what is learnable

and that's actually quite different than what the what can be represented

and of course the

to take uh the argument of uh all these people to strictly then i would

say uh the recurrent networks as we have now

uh then including dollars teams are not to incomplete because the

uh defined it's isa the proofs of their string complete this assumes that there are

so infinity somewhere hidden in the model

usually in the in the volumes the distortion the in the in the neuronal so

um so that seems not to do not the neural network that we are using

now we are using like thirty two bit a full precision and you can tell

you think of that you can store like uh infinitive

if it number of four formants there is the same argument as the saying that

you can

save the whole universe in a single number using arithmetic coding sure you can but

the

but uh do you actually want to this representation to be uh in uh some

neural network like uh in one of all you want to store everything and have

a lexus a double detection decoded at every time

for a time step if you would want to more the identification it makes sense

to say that we can it works are two incomplete because the in my view

a strictly speaking there are some of their versions maybe about the

but it's just uh i'm not practical of course uh during missions also not very

practical model so i'm talking about through a complete that's not about directions

yeah uh i see that the uh you're thinking a lot about the yeah i

creation actually there is a huge discussion right now in the in the field about

the about achievement uh achieving of the number the singularity and whenever you will create

a binary what traits such as a i which would get connected to the to

the internet

and

uh did it to share any of dares their concerns of uh

uh country yeah i

or super intelligent a i which will which will basically make a some silly

well

well i like yeah

different views on this uh

uh i think that the thinking of this a super intelligence and singularity

i think it's little like yes uh

i don't know what i would uh related to like yeah but the chinese and

italian when they got power if they would to be afraid of uh of just

or interval are so

by some chain reaction uh i mean like it just to suit basically just the

technologist there and uh it should be aware of it and does the same like

uh when it comes to state of the research does i'm saying if you really

don't want to

uh if you don't want to be unfair divorce yourself

then it is clear that we can teach yeah there are many of them very

simple things and to talking about single or the then i think it's just uh

assume the our think a is that the

of course there are people have arguments that the uh maybe uh the gap between

actually having something that doesn't work at all and suddenly having some intelligent that again

you can improve itself

the together doesn't have to be that they can maybe how we can achieve this

machine so sooner than we expect that

even if some people are sceptical that made it can be later

um but if you would take this argument then i would say depends on how

we should construct this machine so

uh the frame level describing a

uh were supposed to make machine that actually

are trying to minimize uh some goals

and as long as we will be able to define the goals for the machines

then

i would say uh for me the machine should be basically um like a

some and it can that extends your own ability

see if you are sitting in a car

then uh you are able to move much faster than using your own lines because

the cars physical tools for you

well the car just does what you wanted to do because you are existing very

it should go show it can either a knockers people and it can kill someone

button that but the next to the driver is responsible

so i think that the a i even if you to be very clever as

long as its own all purpose is just to accomplish of the goals for the

for the for the human which specifies the uh the goals

then it basically to like extension of our mental a couple state the same as

cars extending our abilities to move

well that was just a

there was just as the to your phyllis this step three of us to lead

to learn a learned itself the questions which is the tricky part

because whenever you will you will not part of it was on the a i

part

uh um

uh which file it was about io and collecting to internet just only about the

thought oh yeah and c

okay i don't remember exactly which might not you are correctly so

uh actual actually the last are was to let or no learned itself the from

the from the other sources which makes it only has no control

rather sure churchill are well that's a that's a like a question like a

given the learner will learn from the other sources of how much uh

kind of uh distant gonna get from the from the top

a external river so

you can actually the same argument about uh people they are also born with some

kind of like a internal rework mechanism that was maybe large revolution to be a

kind of hardcoded and also before example note if you sugar than you feel happy

or whatever because the so they are coded thing

and uh

that still doesn't we then people to actually behave uh quite different then they become

adult because

they can for example just decide to stop eating sugar a

and just uh just not follow the external rewards so

yeah encoded or external interlocked the basic of the hardcoded to record so

that are in the brain stop

so that's like more like the question if the

yeah i would be so much independent that it would have thought some sort of

like you will then and you can of course see that it can turn out

into something bad about the and if you think about a i think a single

and that it but uh many of them and many of them working with yes

then

uh my vision is basically that it extends our own abilities and the

is the same as the

saying that uh pretty much any piece of technology can be used for good and

that purposes so

just to be belongs

others

it was wondering whether it's be more local

no target location is this

like something which whoops work in the network and it would be actually

clearly that it would be changed just some subset of a scene

yeah i wouldn't it propagates

the information in this mostly due more

oh unsupervised

hmmm

something like

c d's

someone's using something

these days

hmmm i think that i see something but i do not be able to give

your friends because uh i didn't know are right now

i wasn't myself music find a fulfilment first mse because i think that uh

therefore limit it yeah

so well i don't know like a

i guess that the property that we should be able to uh

uh get into our models that are it's neural networks or something else is this

uh i but it's to grow in the complexity

and that's something that norm on your a result that

once you start seeing the network a some sort of like memory mechanisms are it

us ability to kind of like to extend the memory structure i think that's uh

that's all i see it

and then the topology allows you to not spectral parameters but just some subset

so that's what i was thinking of uh

but of course that the that doesn't mean that uh that's the solution may be

somewhat oh come on something else

i just think that actual data

points to model even if you go to as you do something that will be

again do local updates to the and i would be a bit boring about

just the model itself to be

to be and not limited in the convolutional sense of course to consider the human

brain to select find it's the number of years maybe but then there like targets

may be some neurons are triggering cup and the final arguments from me would be

that the us a human you can actually um navigate in the topological environment like

the environment around your yourself this three-dimensional it has a topology

and uh if you actually want to understand all utterances you can use the piece

of the paper and so on so you can be actually finds the matching about

this long as you are but i think in the in the environment that connections

work as a paper in the during machine

and then you can actually be as the whole system during complete so

you know like a if the model will actually start living in the environment i

think it actually gets a

gets a more about it is that's look at it can also change the environment

that the it uh becomes much more interesting content if you have just neural network

really think a in a way to just observing can able to lectures and purchasing

cup vectors uh

we don't actually uh being we are able to control anything that that's the topology

so for example and i was talking about the stick carmen's you can see that

so that's that can be seen as a one dimensional environment that the stick our

lives in and can operate on it and to have any of the two d

r two d environment that utterance basically just more

more dimensions but it's uh kind of the same thing and you are so far

linking the three to adjust really brought and to if you will be able to

influence of the state of the role i think that a user will be uh

quite limited to that so that's kind of like a my understanding of the think

does the research agenda open a is doing have any overlap with the framework that

you have suggested a which

opening i

a banana

uh yeah that's of the guys in california a where they get publisher a uh

she already recorded opening i universe i think a

or a needs a month ago so uh somewhat overlooked so that the goals in

the sense that uh

did try to yeah

um like a person every social a point to the define a defined to like

i think thousand task or something of that sort the

and to

mm they are trying to make a machines are coming from the definition data

generally i guess a

it's a some sort of machine that can work of course uh a range of

tasks are not a single path but for many tasks are

uh but it somewhat uh it's quite crucial to different actually to what i was

describing because uh

there's a different between incremental or gradual learning curve i think there are several other

names are you assume that the meshing wanted one signals and tasks and it alarm

so it's a you try to teach it a task and plus one

then it should be able to learn it faster in this new tasks are related

to the all the ones and then you can actually be measured because you can

construct this subtask yourself oh artificial and you can make than actually

while of what i did see so far list uh i'm not like a an

expert will they are doing that maybe they are changing still

direction but i folded they are trying just to solve a bunch of tasks together

which multitask learning that such a different thing as actually completed that yielded the neural

networks which don't have to do so you

two approach that are the problems

but the well they try to do it with a yeah i think which are

uh or reinforcement learning which again is quite a challenging

uh because direction don't stay supervisor level the of the model should be doing about

you are just giving rewards for the correct behaviour

second that the

that part of what they are trying to do is uh somewhat related to whatever

describing about the

a i don't think it should uh multitask learning has a big a problem because

that could actually just works fine you can just a venue auditory little recognize speech

under the image classification and then language monica the same time uh because really represent

all these things of the input layer and what kind of our quite a so

what would be just encoded in different parts of the network

i think that they're hope is that uh actually they start uh

boosting each other's performance uh

uh if you will train basically this work to do all these things together then

it will adjust sure the

sharma of the ability somehow

uh so let's see what they become a bit about uh

from my point of view i think that uh it's good to try to of

isolated of the biggest problems and try to so that uh i was for example

a giving preference uh

how to this um is "'cause" book are subproblems and uh

iterate try to go like to the core like of the simplest things that you

guards are present additionally with a one hidden layer and the at intervals where influential

and very simple are sent and from my point of view if we try to

analyze what is going wrong with the current the algorithm is going to

like a huge data are sort of thousands or thousand different problems training some model

couple of it

and then make some place about that are it works or doesn't work and what

it go wrong a

i think the analyses will be very harsh for then

it will be different amazing for p r videos

which of course is uh is uh like uh

uh one of the main things that they are through but the but the except

that the

i don't let c

so don't you think that actually multitask training is

crucial in

and these things "'cause" it can cover a lot of things and can learn what

not to do instead of just learning what to do

well to multitask learning car

not a like a crucial problem or saying it's a problem i'm just saying it's

a part of the real life thing right now sure you never learn just one

thing you always observed

and if you wanna have inspiration and the real life

oh just

i sure i mean that's uh that's complete fine for example then it was uh

describing this framework with a larger and uh and so on and the teacher then

also like the point is that the teacher would be interesting you trusted alarm are

and then uh this can be predators

i'm can be defined when people are trying to work on multiple tasks and assigned

a set like a we have it there but uh is different that are you're

assuming that you

yeah are training the model the model all the tasks together and then you try

to measure of performance on the same tasks

or if you train the model on some task and then you try to teach

it quickly on different tasks uh and that's actually what i think is much more

challenging and that's what well i think we should try to

our focus on because it will be needed that if you just uh are going

to fully also by training commit in tasks and then show always place are combined

very well maybe because it was in the training set so you don't like was

the point

so i think it's of course it's part of the of the problem to know

to have adult dyad to work on multiple tasks at once we have to uh

this uh but it's alarm and you that's quickly

um you you've mentioned uh

steps that to be taken toward a creating an environment for a line

um you know what the state-of-the-art in

using anything with

this principles

just anybody's is such an environment is that we establish a with a lot some

simply environmentally the public's it uh last year and weighted present the that needs as

well lots to models

uh the next conference and to

uh it's and the get out that it's called communication based artificial intelligence environments uh

so uh it's uh

i think the short "'cause" this column yeah if a dash and

so that's pretty stupid shortcut nobody likes about the we and they have a this

one because uh and the course of the story would be longer that's good that's

one

and uh

and uh so this is our environment that be published uh

when it comes to other side

well there was this a discussion about the open

yeah universe which is a something then i think the mind of publisher the same

conference on like a thing deep mine plan for all the decorrelation so

but to compare games in three d environments and how to navigate are just by

observing pixels and that's really environments that gets uh

such a quite different uh

because again if you for some a single task results different this uh this focus

on the incrementality of the learning so i we not sure intersections something comparable to

what we have

yes actually they are so many researchers is that you never know about so

that's encouraging for the rest of this

do you think we have enough data for training a building language models and now

we should focus only on algorithms

or we should

also green using a data sources and i don't to textual data

well of course is a more data you have the better models you can uh

they'll and i would say

there's never enough data

uh so

then i do you try to improve well all these tasks that i mentioned in

the uh the first part of the whole alexi speech recognition machine translation

and spam detection or whatever uh

then sure like uh more data will be good to

and um

and the amount of uh written uh

text data on the web is increasing called a time so

how i think that uh in the future we will happen even bigger models trained

on either more data

and the accuracy center of these models a perplexed is will be able recursive will

be higher uh things only get think a bit better there's a like this uh

this argument i think you are going back to shown on the question of uh

i think there are models

actually able to uh to capture an irregularity in the language just one "'cause" the

amount of uh data that you have is infinite and the n is included as

well

which is nice of which basically says that the more data you will have the

better you will be about the but the gains are just getting smaller and smaller

and then uh i don't think it should this is the way up to

gets to a i because uh even if you would have like billions uh but

in times more data and then you have now then sure you get like a

two point improvement in machine translation and that's fine or maybe one or two percent

laura portrayed in speech recognition

but uh there is diminished diminishing gains a diminishing returns so it would just uh

not be boarded doing uh after sampling

of course then there's this think about the

i think more data in uh in domains are actually a very small amount of

the data now like today uh

of course then you can expect a big gains in the icarus this later so

for example for fink english language models

i think that well

that's only like a

just about the maximizing the size of the model the ds now

minimizing we could be sure complex the training data is because as we can

a lot for different languages uh there can be some more fun um there was

sort side maybe uh i would have or more hope for it

um because there's less uh data

um so yeah maybe like for czech language or

there is there something to be down like all this morphological languages uh are interesting

for some reasons

so yeah so the answer is basically yes more data is a good uh

a if you want to get a i then i don't think it should get

a us there

Neural Networks for Natural Language Processing

VGS-IT 2017

Tomáš Mikolov (Facebook)