well come at the next addition of p g s i t this is the
uh invited talks on video graphics in speech
and it's a series that is run mainly my uh by matching technique but today
i'm happy to invite a very good speech or an lp guy so to much
weaker of a actually started uh this faculty if i t
in two thousand two
then uh in two thousand six seven he was working on a diploma project on
language modeling for the for chick maybe still remember something of it then actually uh
he started his phd in two thousand seven on a language modeling and uh to
be frank we didn't have much uh language modeling expertise here
so we kept sending him abroad so he's been good considerable time at the johns
appears in the hope queens university with the spongy of good on board and the
university of montreal we just to a uh bengio
and uh well he had a very the influential paper it's interspeech two thousand then
that was basically a room like this form of uh senior uh language modeling people
and the much basically came up uh and the said that the his language model
works the best
well they were smiling but it worked the best
and uh e eventually uh defended the phd in two thousand twelve
was immediately uh hired by you will go brainer and the moved to facebook or
research uh i a i research and twenty fourteen or he's now the a research
scientist so i will still be here it's to march for now and thank you
for coming
it is it's fine
i guess okay
i also things are interaction and uh michael will be uh like uh
mixture plus a very small things this uh
once asked me to talk about everything so
uh let's hope to define would be about nine you wanna works in an l
p
uh
or
that is
okay
ah
so for the introduction
or the introductions uh
and now he's like a an important uh topic for many companies uh nowadays like
google facebook yeah we like all these companies that future
text data sets uh
that are coming either from the web or from the users like you can imagine
how
uh much text a confusable sense to facebook everyday
and then of course like these companies wants to do something cup
with the text like out there like a there is this a list of uh
somehow some important applications that uh but there are many others like a
just detecting the span is something important for like uh
users you don't want to see
uh some think a strictly binary are using cut these services uh so this like
the like a core business of these companies is to be able to deal
with text uh
and uh for that uh i will be talking about like a set are a
lot basic things in the beginning and then their extensions uh using neural networks uh
the idea is to work on
uh
there will be like uh the first uh first part will be about unsupervised learning
curve for
board representation see that so
the border like project uh obvious that we will uh it's like a very nice
a simple inter the introduction
uh
then supervised a text classification the do not for to talk about it much of
the weighted average shows simple to last year at the face but that extends the
word vector supervised classification and again like uh
is quite successful because it's very scalable
and then the recurrent work language model
uh
so exalted mentioned that's all the like something that is so
uh nowadays very common and uh i don't be conference this
um the last part of that all will be about the
what can we do that or maybe in the future maybe some people hearable started
but relatively easy and on to uh do something better than the u matrix i
think that uh that would be a great goal and you're trying to do it
ourselves uh i look up
and of the like the our companies are very interested in an uh getting better
performance
of course one can uh one can focus on the incremental improvement by just taking
that exists and trying to make it bigger or something cool that's or
uh but i will talk about that some high-level goals that uh
we are thinking of uh right now like how to build our mission the regions
of the both uh
they really smart models that are something the
i below that are not show any solution because we don't had it uh
but i think it's a good at least uh mention the problem that we are
facing
uh cycle started like very basic concepts so
"'cause" there seem to the
uh people here don't uh don't so all of them don't have of the big
around in a in uh machine learning cut
so i will start with uh basic uh
models of sequences and uh
representations of uh text uh
and then i don't show that you want work basically
can extend and improve for
all these the above a uh representations and uh and models
it's like yeah
the artificial neural network so i can be seen as some unified a framework that
uh
that uh is in some sense simple to understand that
i know what the uh are concepts but we only done for the for this
to be able to define the features or lots
so for the n-grams so
that's a standard approach for language modeling that's a core technology not in uh many
important applications like speech recognisers are
our mission transmissions justines the uh i need to be able to output somehow some
text and for that the
you are using some statistical models of the language a
that uh that was basically the think it is written on the last line to
the uh some sentences are basically more likely uh than some others for example uh
this is the sentences
uh really going to have uh
higher probability then
then a sequence of words sentence uh is this uh
because that's not so that doesn't make much sense
and even that should have probably are provided for curing unit
in a english and some random string characters
so the n-grams are
uh estimated from a from counts usually
a so it's very simple but you would look at the first equation
uh and just think about what is the product of the sentence though i think
it's like a very broad concept again even stated uh
it be would be able to estimate this probability very well
then the uh models uh behind the
should be able to understand the language or actually have to understand the language or
for example i can write the hearer like creation that uh probably as
so sentence uh
uh arrest is the capital city of rows so
should have uh
higher probability that a barrel in is the capital city a problem because the second
sentence is incorrect uh
uh but the model you have now uh i would say
can do this a little bit about the not in a general sensor
i will try to get through it at the end of the oldest or like
what are the limitations of uh of our best language models
but just to get the motivation like a linkage wanting is quite interesting and there's
a lot of also problems and we would be able to solve them
very well then it would be
possibly interesting for the artificial intelligence research
and here it is uh how it looks like with the
techniques that uh used to be state-of-the-art like ten years ago
uh
which was based on the grounds there is scalable the mean that we can
train uh
estimate uh
so this model stronger likely that's very quickly uh it's the retrieval if you want
to compute a variety of the sentence that just a
compute probability of order this that people just get from some trendy corpus just count
how many times this the board at a year and divided by all the word
count so that we would get its probability
and they just multiply this uh
um like probability of each word given its complex the
they are some advanced things the on top of it like to thank and so
on but well but is just the
the technique that used to be state-of-the-art in the statistical language modeling for like a
i think there are two year so it was like
it looks very simple but it took uh people like uh a lot of effort
to overcome uh
uh convincingly across a occurs um and at a minister uh
i don't relate right uh will not be the recurrent networks
uh then for the basic representations of takes the uh one and coding or one
hundred presentations is something that this like a uh
very basic on so that people should know about the
uh usually the it when we want to represent some text uh
uh especially in english of you we compute first of vocabulary and then
represent each corpus uh basically separate uh id
uh which show
uh has um the advantages and some disadvantages it's very simple uh is it understand
the disadvantage is that the
as you can see mandy and use the of completely or particular presentation
uh data sharing parameters and the
and it's up to the model that's using these ones representations to figure out that
are they are really it for example so that the
it would be able to generalise better
so these are the basic representations and the ability to later that we can ensure
present work so that some
uh
better more richer vectors uh
actually it's a
uh like nice improvements in many applications
make of art representations are then just some of these one coats of and then
everyone to represent some not be such that the
for example if you would have this the small vocabulary and we want to represent
the sentence
today is among the
a little bit like the counts of the words so that it that you're sentence
there something special about it the
and yeah
these
their presentation can be still improved by
considering basically local context by
using backup backgrounds and uh even if it may
c not surprising it would see that for
many applications yeah really most of the applications nowadays the
nobody can they require think it'd be the
but uh is a very simple picture presentation the
so that's maybe the challenge for the future
uh
and are important concept is uh the word classes uh
as i what is that like uh
board really should uh be kind of related to each utterance imposed we help to
uh how to think of it is to define some uh
some set of classes for example italy germany france spain all these words the
a uh denote the names of the of the countries in europe uh
and uh maybe you can just the agreement together and just called impulse to
so this is uh one of the most successful or not be concepts the in
our practise the
it was
injuries i think uh
in the user might be the
uh the one uh one the particle paper i think that's a very nice is
the from peter brown because based trigram models of natural language or
and discusses are computed the
automatically again like from song
from some training corpus the and uh the main idea behind it is that uh
the boards that the
that share a the complex that documents in our context so should uh
really belong to the same clause
once you get these classes then we can improve the oh
the representation that will string before because so we can represent the corpus one of
and
uh our presentation lost one of the class representations for the uh
there would be
some generalization from the system that is trained on built on top of this representation
there was more like the historical uh overview uh i can and the did they
are like several other important can so that people should know about the
uh and the that otherwise
are basically the stepping stones to understanding the neural networks uh
uh what's it'd the
most frequent things uh probably unsupervised image the reduction using cut principal component analysis and
unsupervised clustering with uh
k-means so
so these algorithms are quite important and then the supervised classification
uh especially to the logistic regression
uh
uh is very important
i don't know described in detail because uh i wouldn't finish a
uh so now i would jump quick introduction neural networks uh
uh
and again like it'll be just a quick overview so that people can i get
some uh
idea with this uh
uh what the than you want works actually are
uh and uh i will try to describe these uh is basically arms that the
people are using all the time
and then i will
also try to give some short explanation of what the
deep-learning means because i think that's from there but the
is becoming very popular now and it would be good you are so what is
it about
so for when you wanna works uh
uh in a natural language processing cup
and then motivation is to simply come up with the better more precise techniques then
what i was you showing before so something better than the
uh big aborts uh something better than just the grounds to a so how can
be
uh red and white would be even though it well
if we can come with some better representation than uh we can
uh get slightly better performance than our come but there that's important for many people
like support for the company because they want to be the best and its importance
of the for researchers because the people to publish the most interesting papers
uh
and years are completely in uh
all kinds of competitions the
so
it's basically important for everyone to develop that that's techniques uh
that's about the motivation this is how the
uh your own basically looks like uh
a is like a mathematical or graphical representation of the of the
model it's the simple mathematical model the
uh the
function that the people didn't really uh the your own so
uh the biological neurons and but it's very simplified so i would uh warm about
uh yeah giving some parallels between the
artificial
neurons and the and the biological on your own since it is likely to really
about it is very different thing
so uh the are concerned you and your own looks like yeah basically they are
um
uh incoming signals that are coming cut
uh to be in your own it's called sign at this uh the time from
the biology but uh basically just needed some errors that are something you know
uh to be in your own
uh it's coming from some other neurons are
and uh these signals are multiplied by the
by the way that each yeah
each year this input arrow results today with one of a small number
uh the basic of the weight that multiplies the incoming signal
so we had three incoming numbers that
and uh they really get a sense together in the uh in your honour
after which uh there is the application of the activation function of each yeah um
needs to be known in europe you want a proper you wanna or
and the simplest one is probably the
uh so called the rectified the
a linear activation function which is basically just taking my between zero and evaluated that
compute so that all the volume that are below zero will basically get a translate
the zero
and uh
this value that we compute is weights a
is the output of the your honour in the given find that the and the
and uh this uh this output can be connected actually too many pattern your own
so it does not be connected
one
but it's a single number uh goes out of the single in your own
and here the creation
so i think that like uh
the biological onions actually
although our like connected to other neurons the about the they are so many difference
is that it doesn't even make sense to
star comparing these two
a logistic that the artificial neural networks that are somewhat uh was it inspired by
the biological neurons uh
uh in the beginning about the it's a different now
uh maybe in the name i think uh is a um
uh misleading people start uh working on this uh
techniques the and uh start believing that maybe they can just sort of artificial intelligence
additional uh have uh you know neurons in their model because after all the model
at school you want your right uh well this is the logic that the
i sometimes you're from some older purpose errors and i think it's it really misleading
and it's part of the
marketing so just the
don't take it seriously i think yeah
if the name of these uh
artificial neural networks would be known in your data projections i think it would be
maybe better but then
nobody would you use it because it would be interesting right
uh so
uh this is the presentation
oh well we'll network when we have a very have multiple of these songs uh
usually there are some structure this is like the typical a feed-forward structure very have
some people say or uh which uh
which is made out of some features it can be
the back of our teachers or one of any code so what i was talking
about before
so these are the features you specify some will somehow
the uh and then there's the hidden layer uh you'll you know well to computable
is there and then there's the output layer
again it's the application of those any questions uh
so nothing special their output layer
use usually what you want the network to be doing can that's a
that's for example classification we want for example say that the input layer that there
are some decoding of the sentence then and the upper layer there can be classification
of their
the sentence is a span or not
so there can be just one in your on that would be just the
doing a binary decision
the training is done with the back propagation
and to
as the training is done with a back propagation i do not really describe exactly
how it is uh don't because it's a lot of mao a multi we can
find some nice lectures on the web the uh
so i think it course zero there are some nice cars about neural networks would
be quite some long i quite some time to
expendable basically what we need to do is to define some objective function which yeah
we'll
ah
which will say what is the error that we uh that the network that uh
make for the article twenty example so when we trained a network we should we
basically some input features
and we know what is the output that the network should have produced and we
know what the network uh did actually compute uh using the current set of the
weights and then using of the back propagation and the stochastic gradient descent algorithms we
will compute how much uh and you know what direction which we should change the
weights
so that next time the network see the same example it would make up this
error
small it would make smaller error
and then there's the simplified graphical representation
but is not used in some papers uh
there we don't actually draw all the individual neurons but it just the dual the
box the
with the errors
they're section what of are things that the
have to be uh down if one actually wants to implement the
this natural because they're like this the
these uh hyper parameters that the that the training doesn't uh
doesn't choose like what is the type of activation function that we use a their
conviction in many of them
well how many hidden layers to we haven't are their size is uh how they
are connected we can have some skip connections we can the recurrent connections we can
have some weight sharing deconvolution at work so it's actually
why did they do things so uh of course i will not describe for all
of them because there would be lycra for course
uh about the
but i think of what works for me for no for starting to or within
you want works is to take some existing set up and try to play weighted
by
making some modifications and the
observing what uh what is the difference
so maybe that's so that the best of the star
for deep learning uh
so this popular term uh
uh it's uh basically still the same think it's the it's in you wanna sort
of every will have well
um or hidden layers usually so that uh
if there is like at least two or three hidden layers then the by basically
some of the deep learning a all maybe we can also that some recurrent connections
with you to make the outputs depends on all the previous the
input features which is actually very d very "'cause" they are so many nonlinearities that
uh that influence the output of the model
uh so basically any a network that uh
that uh
goes uh any model that goes through a several nonlinearities be before it computes the
output uh can be considered as deep learning curve
although some people are probably even see nowadays deep-learning which i think is completely sitting
about the
uh
yeah there was also like a this controversy for i think are maybe twenty years
there
uh basically the welcome annotated very the a
the training these the model deep neural networks is not possible to be done with
the stochastic gradient descent
and uh when i was uh the skewed and myself well
whatever book i was reading uh everybody that claimed is that basically training these deep
networks does not work and that's it uh that we need to develop some magical
algorithms
actually it's not the case uh people not trained to networks normally that the d
and the
just works the it's probably because we have more data than what people are like
in the nine so i didn't know much durable power exponential there but the uh
there are be about the
there are basically this uh
uh a long chain of sex as a result starting maybe in two thousand five
six the lower people are able to find it remains some deeper networks are
there's also like this mathematical justification why prediction need to the
the models so
uh coming from seeing more popular and marvin means key
in their book perceptrons it is uh
so the very mathematical i would say about the about the argument is very interest
think there are functions that uh
that we can't represent action maybe give just a single hidden layer
and uh
actually that's the logic that i will be using at the end of the talk
show that they are actually
a function that even the deep learning models so
cannot uh a learn a gently
maybe represent us not very large
so
uh
so i would see that the wall down deep learning maybe was invented a neural
network and you're like about the
but the these ideas are much older
uh like you have the motivation uh our people to argue that we really need
to
use something else then use the
these uh simple perceptrons
so this the graphical representation
a very good basically just multi um
just several little errors
and so
that's about it the states that it can be more complicated than this but if
there will be some recurrence the connections or something of that sort
but a lot of visiting model
yeah i would even say that it still an open research problem because the
then entropy you have uh
very deep model then uh
so possible to show in many cases that it can represent the or it can
are present solutions to some
interesting problems about the
the a request use the
there are there is um
so um i good job approach are we can find the solution we constrain the
network which is actually not always the case especially for some complex problem by will
be
showing at the end to uh there are the network for example it's learns um
some complex the controllers number structures uh
then
because there's a lot of local optima then
it seems that uh we start to be something better than what we get now
and uh so now i will be talking about the
uh most basic application of uh
neural nets to a to some text problems which is how to compute the distributed
your presentations aborts the
and uh and i do show some mice examples i think i see examples the
oh uh some linguistic irregularities in the vector space
so this is how we can actually train the most basic gabor vectors is that
they started the
band when i one that was mentioning here uh but i was think my diploma
thesis in two thousand six is the first model to implement it very just try
to predict the
the next door to given the previous work the using a simple neural network with
one hidden layer
and uh
here we train this model some
on some text corpus a
then the by product of this learning is uh
that a bit we the matrix a way to be in the
uh input layer and the hidden layer
we'll
a real basic contain the worst representations in some
a vector format that is the
you're gonna be seen as a fess uh this uh
this uh a real for of numbers of the weights from this matrix
and it is interesting properties for example
you don't group uh boards that similar sense together so that the
uh so that this vector representation
of four so for example like france any turn it will be
close to each other while for example
uh i dunno rent and uh
and china will be probably farther away both maybe not the
uh
so
uh
so uh basically a this like the simple supplication of the of the neural networks
and it'll is a kind of found to play we did a
of course it's not perfect so that were vectors coming from this model a very
be comparable to the state-of-the-art the
today about the
already function start there
uh
sometimes list of i-vectors all the core the cold um for embedding i'm not complete
sure why
but that's all the relative name
and uh
uh
usually to our presentation this like a then agenda like fifty to one problem so
each work on this you know say one hundred fold herself to retrain the model
and uh
a product of that signal purpose to work losses the samples think before
uh france italy can go to the same class but uh
yeah but with of all vectors a
these representations can be much richer because the
unlike a us with the board classes we can have a multiple degrees of sonority
encoded in this uh in this work vectors and that settled shrink later
uh it actually makes sense so
uh of course so one thing is that it is found to have uh
these vectors just uh just to study the language and actually increase or although our
interest in these techniques the but the are think is that uh
we can also use them in some uh
uh in some uh i know the application so
for example a roman coloured show in is a famous paper
a role to
a natural language processing from scratch uh
that the can so for many an open problem so
at the state-of-the-art performance uh
by using some pre-trained uh were vectors
so that are vectors can be basically features to some other models like the neural
networks instead of the or in addition to the wanna undercoating
uh historically there are like
uh several models the proposed before for training data
uh this uh this word representations the
usually people started to the most complicated things so the start with some model that
the
many hidden layers uh
and it was uh kind of working so
so it was considered the big success of the deep learning yeah
well i wasn't convinced about its because i would it know about my previous result
of just one hidden layer
uh the product of sourly quite good
uh
so
i wanted to show that uh actually the shovel models to model the model the
don't have many a hidden layers but just one
can actually be quite competitive for that i need to be able to compare two
uh either vectors of other people's approaches
and that wasn't actually parameters because uh
people that are showing results after training the models on different datasets and to
and the
these adjusted are not public and then if you compare two techniques just are trained
on different data then the comparison is not going to be very good
uh
so one of the interesting car properties that actually used for uh developing this uh
simply relation sets
was that uh
uh these support vectors can be used for
uh doing these so small
analogy like calculations with the board so one can define example
was then of a string that the when we take a the vector forking and
the subjects from a the vector for that represents man
then and uh vector that represents woman and do the nearest no uh need research
uh while excluding the but works around this position
then we will uh find the work we need for any
a reasonably good um or vector o
and uh
similarly we can actually calculated the
with the boards and sounds are a lot of uh
which questions of this type
uh kind of funny how accurate kind of get
uh i
on the picture below uh there is shown in basically there can be like this
multiple degrees of similarity so can guess the related to queen in some way but
it's a related to its lower for form like uh can't case related the kings
in some our way
and which you want to capture all these things the
so the idea that the board will be and then they're of a single class
what follows to capture this
so for the relation edit construct this dataset with the
uh almost twenty thousand question so to basically written by and uh and then generate
it uh using permutations so
automatically
and these are few examples like a
take it would be quite challenging even for
uh people once there are some of these so
analogy questions so maybe try to compute uh
uh this think uh
board for example utterances to greece's all slaves to norway i think that's quite easy
but the second one is uh
that's an article like analyze the ones last year honest well like ones like the
currency non-goal and the currency here on i think is the
three l
so i think that's the that's more complicated well
and of the and then there are like the errors that can actually very simple
like brothers sisters grandsons two
granddaughter and so on
so we can accumulate performance of uh
oh different models on these so on these questions
or this
uh on a would you questions
yeah
it can be actually scaled up very
yeah to phrases
so that we can compute like new york a sting your times baltimore's to i
think baltimore sun
uh
so we'll these datasets are public
in their published in the dark in
and uh i go there
the uh
simple algebra or vector model that will show later to this one that was uh
kind of stick of er state-of-the-art uh
make in the days uh
that it used to hidden layers uh
uh starting with the
a context of three boards are and of or so
and the input to predict the next door to
by going through the projection layer and little air
uh and the
them
the main complexity of this model after that we do some tricks the
there we can actually deal with u n w matrices of them a complex this
in the v matrix because we need to touch all the parameters of for every
training example
and the model takes ages to train
uh so
what i did this was basically the remove the hidden layer
and uh have the projection layer
slightly different and uh
as i don't show later in section were quite fine so this uh again the
idea that uh we can take the bigram model
and just extended to
which are showing the context uh around the border we are trying to predict and
just uh some the board representation of the projection layer and the prediction right away
this model will be able to learn the n-gram so it's not the
suitable for language modeling i just fine to learn the word vectors this way
uh
the near model to the previous model is the skipper model
that uh
tries to predict uh the context a given the
current uh board
they should were quite sooner or later uh if they are true and uh
peripherally
so the training is uh is uh still the same thing like stochastic gradient descent
back propagation
uh
these works at the output layer coding code it does one of and of the
same for the input layer
so we cannot be this the use of mikes so
function in the output layers which is so
this good probability distribution we have to compute the
all these uh only use which would take too long so they are like this
to
a fast approximations uh one that the still keeps the
probability to be correctly uh something to one which is dark a softmax and the
second one
uh that actually is uh from the
assumption that uh the models to be prefer probabilistic model
and uh i just takes the
bunch of divorces negative example
uh to be related and the output layer loss the positive example and that's all
but is done and to
and such as the second option seems to be preferable
and are very good at the
and it uh that actually improves the performance drop well
uh
probabilistically or stochastically a the most frequent corkscrew both speed up the training can interestingly
even improve the accuracy where we don't shall billions and billions of examples there we
try to
a related work like a ds the
is uh the and so on
so these are not removed from the training set up
a like all of them but uh
some proportion of them is actually remote so that their importance is actually reduced yeah
and it comes to the objective function
and here is the comparison as i said
on this analogy deepest that the
there was like this you get about in the training time and it and the
accuracy to whatever people it's probably before so i so that's what i wanted to
prove that one does not have to train a language model to a to obtain
good were able to report representations
and this you the last two lines are only simple models that the
data are invariant to the border you don't understand the
n-gram they just see the single boards and uh
the only yeah they can compute the very accurate the word representations that are actually
way better than with there are people that could twenty before
and to while the training time to go from models and reach two minutes
and maybe even seconds
so this is uh obvious this open source code
it's called words like project
actually many people
it's find it uh useful because uh
they can train on the on the border like there are some they are datasets
to improve uh
many i know the application so
so
i think it's like a nice uh
nice we have to and few person topic receive an adder
uh people are dealing with data sets of our there is not uh
you each number of uh
uh so supervised training examples
you are some examples of the nearest neighbor so
just to give a
idea
uh how big again was built a between about uh what was state-of-the-art before and
after
uh these models they are introduced
so for example for how well that's like infrequent words in a war in
english yeah
uh but still it's present in
all these the all these uh models
and we can see that the nearest neighbours for the first we
uh
barrelling makes any sense
and this one and then at least get the idea is probably a name of
some parsing
while the last one is obviously much better when it comes the nearest neighbours
and of course this the this improvement of the quality comes from the plight of
the models trained a much more data and the
and had a large dimensionality and the that all as possible because the
uh training complexity is reduced by many orders of magnitude
uh_huh
there are some few more fun examples like a
but uh
if we can uh can calculate like these things uh
likes
sushi mine in japan and germany rutgers
and so on i think yeah
scan to find of course we don't have to look at each of the news
token we can look at the top then use the
tokens uh
so i wouldn't say that it works all the time
and he goes like sixty percent of the time the nearest models are
looking reasonable
uh
about the it still like fun to play with it and the
there's the many existing we trained models no available on the web
i don't think that actually data scientists uh find useful is that the
these are vectors can be visualised to get some understanding of what is going on
in the dataset
uh that they are using a
the are ignored these are so strong that actually when we train this model and
the good news dataset
uh and then and it uh visualising two dimensions the representations for countries and the
capital cities
then we can actually see recorrelation
between them that the
there is this a single direction uh
uh how to get from account to basically it's capital city and even the contras
are actually a related to each other in this the in this representation in some
interesting way for example we can see that
the european countries from the saddam european so far in some part of the image
and then the problem
the rest of the real or somewhere in the middle
and then the asian countries are more like a
uh the
at the top of the image
so for the summary
uh
i think it's always good to think if a
it things can be down a simpler and uh as it was shown like uh
uh not everything is to be deeper and uh
you wanna works are fine even navy
actually remove uh many of the hidden layers especially in the know the applications it's
a different story for example for acoustic modeling or yeah per image yeah
classifiers are actually
i another there any
model that would be able to be competitive in the deeper
deep models the
um without having many nonlinearities about for the for an l p task so is
the other way around so i
and not company can means that the deep learning actually works for now you so
far
um but
in the future be noted that will
we better
although there is this thought
extension
to work to make of are basically instead of predicting the middle of or given
the context the connection predictor a labels for sentence using the same yeah same algorithms
and uh this is what we published a as a fast x library last year
it's very simple but that the same time very useful
oh
and compared to what job
both people are probably think nowadays uh in the
in the uh
and all the initial learning conference this uh
uh then we need to do the comparison to some
a convolution network with the
several hidden layers trained on
um any gpu so
we did find out of each you can get a their accuracy while being a
hundred times so
hundred thousand times faster
so i think it's always been think about the baselines and doing the simple things
first
so the next uh
expired will be about the recurrent network because the
i think it's quite obvious that the
or representations can be found the easy to traditional networks but it's gonna it's a
different story for language modeling there's actually some success of the
of the deep learning because the state-of-the-art the models
nowadays a recurrent and that's basically this model
and then able talk also about the limitations of these models
so these three so
of the recant it is quite longer
oh there was a lot of people work on this models a blend a piece
like a just on long mike jordan michael's or and so on uh because the
models
model is actually very
their interest think it's the
simple modification how to get a some sort of short the memory into the model
here is the graphical representation
so again we can uh taking the
bigram model and just the handed the
hidden layers they hidden layer
to be connected to uh the entire from the previous time step so that the
h t create is uh
this loop in the model
so that the hidden layer
oh uh just c is the features
the input layer what although it's all state from the previous the times that
and that it's selsa's the
the previous uh various state and so on so basically
um then ever you prediction it depends on the goal is threefold the
input feature that you know put us that's that it of the time steps that
we need to do before
so one can say that to the hidden error than or present some sort of
memory
a that this model has
uh there's this interesting paper from different monocle finding structuring time that the
you sort of this motivation
well after
after this period where the recurrent or explore studied uh
at a the excitement the excitement that the kind of when it show
uh because some people started believing that the than these models even in the actually
are looking very good cannot be trained with uh
and g d may but can see that the
this is the remote server
can a real curing again and again whenever people data
failing to do something enables the data link the energy that it just doesn't more
uh and of course to solicit they're out to be wrong so
uh the recognizers are actually trying to do you know this normally just one has
to do some small break it down
uh
so what i did um i said in the doesn't animals that the idea to
that actually
one can train state-of-the-art language model based on the recurrent networks and the it was
very easy to apply to
a range of tasks like language modeling machine translation speech recognition data compression and so
on and to
in each of these uh i was able to improve the existing systems to
uh achieved view state-of-the-art results and the
sometimes by quite a significant margin for something language modeling i think uh the or
looked perplexed the reduction over n-grams uh
it and symbol of similar recurrent network so
most
for me usually like fifty percent the remote so that that's quite a lot
uh_huh
and uh
company started using ca uh this uh this toolkit and what this the body so
that they pleased here about really many others
uh and uh
then i was looking at uh with your savings you but like uh
what uh outcomes that the uh the model actually works for me well people try
to do it before and that uh they just couldn't uh make it or they're
not that there was uh
this problem that they did a place at some point that the
it's i was uh trying to train the network and more and more data at
the start at the meeting can some celtic way
and the training response table so sometimes that it converts sometimes not
and the more and more data used uh the lower was the chance that the
network would can were convert
and um mostly the result or just rutledge
so it detects to spend quite a few days uh trying to figure out what
is going on and uh
i did find out that the there are some rare cases that are
the std a science in a such a way that the
changes of the way that are computed
uh become
exponentially larger they get propagated through the reckon the matrix
so that they become so huge
that the rule weight matrix the matrix a get overwritten bits the
it some numbers and the not review the
just american that are just
so what i did not so
the simplest thing to take with think
because these uh these gradient explosions uh
it happened just the
just very rarely
i didn't uh can't the gradient so that he wouldn't be able to become a
larger than some values of it in some threshold
and then it there now that the
that uh probably nobody was actually
either of this uh the neighbours the
but there was uh discussing this uh
this idea two thousand eleven
so
there was maybe the reason why things that or i dunno
but the
it's a set it was the mean of the case that uh that as g
d wouldn't uh work for training these models
and this i said it was quite easy to obtain a pretty good results one
shows that the weight thirty long for training the model because the they were quite
expensive
so the
um ability to the original setup well speech recognition
uh it's uh like uh
small
simple datasets that and to
and the reduction of the word error rate was like a over twenty percent compared
to the
best n-gram models a one can see that as a as the number of neurons
in the model gets bigger like to like a ranker then at higher sixty three
are twenty and so that's basically skating the size of the model
then the perplexity goes down but just like uh
the uh
down make sure how good is a network that's predicting the next board basically the
lower the better and the word error rate uh
uh is going down so basically the best n-gram modeling the
is that in by uh
with no count cutoffs and the on the on the relation data sets it up
like a
the twelve and sixteen point six a word error rate and with a combination of
uh
of uh these so we can network sacred would like to nine and thirty percent
that was quite yeah a big uh
the gain coming from just the language go to just from a change of the
language modeling technique which uh i think wasn't thirty
but these before when i did compare these results to other techniques that are being
developed the like then at uh johns hopkins university then usually people are happy with
like zero point three percent improvement of the words rate than the i could get
like you're like three and how well
percent absolutely
so that was uh quite some uh interesting cut think
uh under interesting co
no observation was that the more training data
uh was used uh the bigger uh will still
the gain over n-gram models that the recurrent networks
so
that was uh
uh quite the opposite of what to just argument abolitionists technical report in two thousand
one
uh
i think it was two thousand one
uh we use very famous very basically data collection that are all these qualities models
and so on that you put it consider for improving language modeling rare
actually uh
well help think less and less as so
as more data was used
and uh
it did see that you also losing all the whole that the
n-gram models can ever be beaten
well like their output with the police or we can model so that actually happened
so it was gonna likely
and uh and the last grapples the
this uh large datasets from i b m so it's pretty much the same thing
"'cause" the able to journal or just like much bigger much better tuned coming from
a commercial company
it was their the state of the green line is the
is their best result videos like thirteen percent for
more to rate uh
and then uh on the x-axis
there isn't the size okay size of the domain of this we're going to work
so you can see that the
and the
as the networks get bigger and bigger the word rate is not going down
so in the entire experiment it by
by the computation complexity because it to cut
many tricks to train the biggest models
uh
and uh that was quite challenging
uh in the nlu could get like another person's or try to reduction like the
relative well i think it will get even more than that if i could train
bigger models
but already this result was very convincing can to
uh_huh and stuff
people from the companies are interested
oh
lighter a user can afford became much more accessible because actually implementing the stochastic grandest
and correctly
it's gonna painful in this model because one is to use the
back propagation through time algorithm and it is the makes a mistake there and very
heart to
find it later
so i think is also like very useful so the maybe the most popular to
look it's are now like it happens or floor one so
and piano and or
but there are many errors
and uh
using the graphics processing units that you use uh
uh people could
scale the training to billions of training works uh using thousands of in your own
so that's even
quite a bit bigger than the with the brothers using can doesn't then
no like today a the right kind that's are used uh in many tasks like
speech recognition machine translation
i think uh
uh google guys that uh publishing months ago paper very are investigating how to get
the recurrent networks into
into the production sistine for google translate
i think it will still take some time about the
let's hope readable happen because setting would be great for example for translating from english
to check so that finally the
the
the uh
morphology wouldn't be as painful as it usually so
hmmm
on the other hand i think that the downside is that the because the
because these two kids like a say select and therefore and so on
it's named eric and works very easily accessible people are using them for all kinds
of problems that are there and thirteen
uh require
and uh especially when people try to complete their presentation file
they sentences are documents are
i would to always work on people told to think about the simpler baselines because
the
just big of n-grams the
can usually be to
is uh this uh models
or
at least be around the same
accuracy
when it comes the representation so the different in the language modeling
so one can ask like a what can we do better like a so really
need it is you distorted it may be there we can work uh
pretty well and sometimes maybe adding more layers held so for some problems doesn't
well it can be you
to get uh
better results so
uh can be built uh
this is a great language model that i mentioned in a direction that would be
able to lock to tell us to what is the capital city of some constant
maybe we could stall with uh
uh and we do it is we're gonna works well
i'm not thirty that much convinced because they are very simple things that these models
that we can have lower
and that actually is opportunity to new people a new generation to develop better models
so simple button for example is a very difficult to uh to learn is uh
it's memorisation a variable-length sequence of symbols
this like to
the people to just to see uh see you can you bored and be able
to repeat the later
this something that the uh
nobody can retrain in general the recanted networks to do that
uh there even simple but there's and that we don't have to minimize the sequencer
of symbols we can just a little bit novelty kind it comes to count thing
so
we can generate uh
uh is very simple
uh algorithmic but there are so uh which are
sequences so
with some strong regularity
uh and see what can actually be so we can the rocks lower
so i think that uh people know for some from the recouped article compare signs
that the that there are like a very simple or a languages like the
a and b and language yeah are there is this thing number of symbols
and we consider in quite a few examples and train
a sequential
a predictive model like a recurrent network to be able to predict the next symbol
and if it can actually oh come then it should be able to predict correctly
uh all these so uh not hold it is the sum of these so symbols
in the sequence of because there are still this uh this uh
a this uh
information coming from the first and because that's not predictable
so
so these quite challenging
but then we can a talk about plenty of these uh
uh ask that uh currently cannot do you are and uh
and you can get a confusing cover what should we focus on should be a
study these artificial grammars or is it going how's that related to the real language
and if you can shows all the in the end of a light improve some
language model
so i think that's the there's the natural questions
i think the answer is uh
quite complicated about the
what i think is that the
it's good to set some uh big role in the beginning and then try to
uh define
a some plan how to actually uh
you know i
accomplish this goal
so that we did right so
one paper where we did the
uh discuss uh
the
like automated goal at first of the start bit uh the underrate around the instead
of trying to improve some existing cup
uh so that we are trying to define a new set up that would be
a more like a artificial intelligence like something that the
the people can see in the sense section we something to the really exciting and
the
that's what we actually want to optimize the
the objective function formants just some
uh speech recognisers
uh something come more funny
so
so we did think that like a
the properties of the of the a i that the
the really useful for us
and it seems that the any useful artificial intelligence the
we like to be able to somehow communicate with us uh
in uh
hopefully some natural the way
uh so again if you would look at the
at the sign at the science fiction movies are the books the
usually the artificial intelligence is some machine that uh
either is or about the to be controlled with the core in tried it or
it's as some computer that again we can interact with the so the embodiment doesn't
seem to be
necessary about the
there needs to be some communication channel so that we can actually state some goal
so that a i can actually accomplish the goal for us
uh
and we can communicate of the machines of course it will help or maybe that
we could even i'm going to programming because currently we have become they can be
computed by thinking second instructions will be one computers to do there is no way
how we can start talking to the computer centre and expect the table
accomplish a task for me that's basically no the framework we have not
and i think that in the feature this will become possible about the
but see a long really take
i think we should start thinking of it because i don't think that we can
improve the language models much more
it's something uh like some crazy recurrent or okay
so for the room at the
we get uh describe the
oh pretty minimal sort of components that we think that uh
the intelligence machines going to consist of
and then some the productivity may be actually good for constructing these machines to
so it's it is that the idea that the is now and uh maybe later
really improves then
uh and we only are discussing them at the conference this
uh
and then the mission the requirements are too many dimensions scalable so that will actually
be able to grow two full intelligence
the components are added as i said the committed the ability to communicate a
the ability to set some tasks for the machine so that the uh
but work to do something useful
so some motivation component
again that something that is normally missing can the
in the predictive models like the language models and so on
and then have some learning skills which uh
scenes can be used uh
but uh many models are excluded
missing these for example long-term memory is not part of uh
any model the time of our of uh
then you want works represent the want to memory and the weight matrices and that
the get overwritten the number that network a
uh kids uh gradients from the
you examples and that's basically not agree to
good model for once the memory
so
we need to do something better basically
and uh
just quickly i will go over it is because uh it would be long discussion
to explain why we actually think uh
uh about all these uh all these things about the
we think that turns to be some incremental structure in a how the
mission will be trained use of training can like a we could be idea is
to retrain normally language models one uh it seems that it has to be trained
some incremental weight a similar way as the as humans with um
would be learning the language a
and for that uh
we are thinking about the some sort of simulated environment that the
that uh
would be used to
develop both the all words of the are missing and then once you would to
have this algorithms to train the most basic intelligent machines so it the most basic
properties that we can think of
so
this basically what we are thinking about the and we wanted uh quite of experiments
so there's this can or components like the lower it stands for the intelligent machines
that the that can be in this environment the uh it can do some actions
but everything is actually very simple like we try to minimize the complexity so it's
just uh
basically
uh receive some input signal
sequence and purposes the output signal which is a sequence as well
and it to achieve it uh receive summary or so which is uh
used to measure the performance of the learner and the
it's pretty much either there so the teacher that uh defines the goals the
and assigns the reward and to
and that's it
so this like the description the environment this uh based on screen
of course we want to have uh and the teachers well we don't the this
to be scalable so later
uh one so we would you have a learner that can learn is very simple
patterns then
uh the expectation that the teacher would be replaced by humans
so directly humans would be teaching part of the machine and the and the signing
the rewards
and that are dimension you really get to some
sufficient level than a then there would be to stick expectation that we can start
using a for doing something actually useful for us
yeah
so the communication is thirty the core so the learner just as this input channel
and the output channel and all that it has to do is to figure out
that it should be
a coding at a given time
given the inputs to maximize the average incoming three or
so it seems to be quite simple but of course it is not simple uh
this is a graphical representation just so that it would to look more obvious that
we are aiming to do so there is a simple channel output channel
uh the task specification given by teacher is the movement luke and uh
and then
the view learner
the only assume that the delivery learn to do this task
uh says
to the environment they move that's how it accomplishes that action
so we don't need every year for all possible actions a
the learner can actually do anything to do
it uh is allowed to do i just by saying it so if it's gonna
if it wants to go for ever or if it wants that are like can
just say
and uh then at the end of the task yeah
uh it gets the reward for
uh looking for events finding that apple
so
so we think that the
the learning weekly a will be a complete crucial here and the
that's the same what i can see about incrementality of the learning so
when the tasks uh
are getting more and more complex or in something criminal way uh the learner should
be able to learn from few examples at most we don't actually we're forcing the
search space
so the algorithm that you get uh at the moment or would basically break uh
on this type of problems
and that's a supporter before we of course get a documents later but still
seems to be assumed because the we don't have uh without regard to be able
to deal with a given the basic problems
and then
if we have this uh this intelligent machines and that can
uh were with the input and output channels that of course we can add the
real world basically this additional
input uh
channel that the much again
one troll for example it can give where is that the output channel to the
internet and the received the resulted input so
uh the framework is very simple about the
uh it seems to be sufficient for the intelligent machines
and that's a was select things the real time they are things that seem to
be very
simple to lower for about the
but you can pretty do it with the
a recurrent networks even a via the long short-term memory units into the recurrence for
x and the
all kinds of crazy things the
then still they are very simple but are they are very challenging the lower
uh even when we have supervision about the
what is the next symbol
it would just try to learn
these things just be the records of even worse
so
so
they are like this the things that we believe that the
especially the last two
have and it's basically like all these problems are a console research problems and a
maybe even they have to be addressed together so it's quite challenging but i think
it's uh it's good for people who are trying to
uh start the
their own research to think about the challenging problems
so for the
small steps former be publisher before we review exchange are showing that the that the
recanted first can actually learn style
some of these uh algorithmic uh
but then we extend then be the
it is um in memory structure
that the recurrent network are learns to
uh control
that actually utterances several the problem that i mentioned before
in this uh
if this memory
is a unwanted in the signs
uh
like this like for example
then
uh
suddenly the model can actually be at least two sticks it can be countering complete
uh so that can i should learn finder presentation
a two
any algorithm which seems to be necessary and we as humans can do it
uh
and uh it also like addresses these problems
or it could address the problems as i mentioned before with the neural networks that
are changing there are the weight matrices all the time
uh
and therefore getting things then if you would have this a controlled way how to
grow something more structures
then that could be you way how to actually are present the long term memory
better but
as i said is just the first that's former
of course we did find out the men been already work and idea behind the
first one with this idea and it will study published in the
in the a piece uh
but uh likely needed to find of our solution is that
is again simpler and works better than people probability for so
so it's it in my looks like this
no
uh so there's not much of the complexity
um basically the hidden air decides on about the action to do we just ikea
a by purchasing castle my position distribution probability distribution over
uh used reactions that it can be doing
it can either push some volume on top of the state courts couple the volume
of the stack part can decide to do not think intersect and of course are
there can be multiple states that the network controls
and uh if it wants to write some specific volume and it's actually that depends
on the state of the hidden layer
and uh
and the fan think is that it can be trained actually data
it as g
like stochastic gradient descent so we shouldn't need to do
and the crazy thing
and uh it seems to be working for at least some of the simpler sink
what sequences
a like here but at least some of them variant able to the characters uh
the bold characters are the predictable on the deterministic once
and we could to
so well
i think all these problems
i really
uh so that was gonna interesting
and of course the recurrence works candidate
and the funding is that the lsp a models that are actually origin developed to
uh to address the do you know exactly these problems
can do it because they can count because the linear component so
uh so that sort of the like cheating because the models
developed for this article reason
uh of course we can show that the
the lsp and below
break it will just uh a scalar complex to bit odd because the uh instead
of just recovering them are to come we can
uh
are we compare it to start memorising sequences as they said before he really just
show like a
a bunch of characters with variable-length the
that have to be repeated to
and the to larry breaks the last jens
which uh for the people don't know them is so it is a modification or
extension of the basic network by adding yeah these are linear units the
with a bit sums of connections and basically complicated architecture how to
get some
more stable memory into the reckon that's regular to propagate more smoothly across time
so we could solve the memorisation
but then of course one can say to the
uh that the stakes are kind of developed for this kind of can uh a
regularity so
so it interesting a so our model was the
a first-order on the speaker was
blank a little bit binary addition just quite a bit more complicated and the
interestingly it also did uh can you have more so here we are shrink a
these examples which are like uh
a binary like input so
by the addition of two binary numbers together with the result than the terracotta lance
to pretty the next symbol to get in this story so it's like a language
model
and it turned out that the
it actually could to learn to operate with this mixing right complicated way to solve
this problem
so that actually
uh space the first number and i think to stack so there are some redundancy
uh actually three of than i think of
our previous information
and then the it's a so the
how the second number
and then it's able to produce a correctly the
uh the addition
from these two numbers
so
uh i think it's quite funny example
of course there was like uh oh this is a heck the to be used
to how the model because the stakes a are pushing the volume some top of
the steak it's actually much easier to the
do the memorisation of the
all the strings in the reverse order
so
a so that's the same
case for the binary addition
uh so i wouldn't say that we can actually learn
a general
algorithmic fathers with this model
and uh
of course we could to
do better if you just uh
uh not use just the stakes but we could use for example states this additional
memory structures
with all kinds of topologies and so on
but it seems the like uh taking yeah
uh the solution together with the task so
that uh doesn't seem to be great so i would refer back to do that
sort of the paper that you had a rejection try to define the tasks first
before think about the solution but in any case we could show that the
that we can learn a interesting car
uh in there's think a complex patterns are
that the normal recurrent networks couldn't lower
and the model is turing complete the say set and has some sort of long
the memory
but it's not the long-term limited
like to have
you does not to the properties that we
uh we you want
so there is that and a lot of uh things that should be tried
and to
let's see what to
well happen in the future
so for the conclusion a
of the lost power of the talker
i would say that to achieve chart which intelligence which was my motivation many start
my phd so far i had failed to do it but at least there was
like this uh
these site brightly that are to be useful
uh i think that we need to file think uh
a lot about the goal
uh i just a few that no people are from
working harder than the wrong task so
the tasks are too small and to
and isolate it i think it's a it's time to think about something bigger
and uh there are there will be a lot of like uh
new ideas that will be needed to defined a framework in which we can develop
the uh yeah i just same way as the framework in which the first speech
recognisers were built also to take like uh a quite a few years to
just uh define how to measure the boards rates and so on and the
and how to annotated data sets
i did not for the ideal basically need to rethink some of the basic concepts
that to be take for granted now i'm that are probably wrong like uh for
example the central role of the supervised learning in the machine learning curve
uh
techniques i think that has to be revisited and via to that taking that uh
that are much more unsupervised and to
can
more calm somewhat different principles
and of course uh the uh
on of the goals of this august so
motivate more people to think about this problems
because that's the
i think our
we can
a rigorous harder so i think that the last line so things for
attention
sparse dark questions
yeah right sorry
mountain yeah okay
so my question here how to properly defined intelligent not artificial intelligence but just in
the intelligence in the second question which it to the first one is okay so
we know that the true machine um is limited you can so everything and then
can you believe don't believe that detergents how you define
yeah is achievable with your incomplete mission
well
not sure that it's the question started actually relate it's like to questions for me
uh but uh
a first for the definition of the intelligence they're actually right yeah
uh many opinions on this uh probably no like you know i would say that
uh pretty much research of are defined intelligence in
different way
uh
hmmm all the most general definition that i can think of uh
is a it would be maybe to philosophical is uh
basically uh
but there is that uh that uh that exist in the universe a could be
thought of uh as intelligent there we can see that uh
uh a life is basically just uh some organisation of the matcher that uh tends
to results it's a form
uh to evolution and everything
so that it goes back to well ideas for example
that the universe gonna be seen as this overall to model on and then everything
to the observed that are is just a consequence of that
and then you really can see the light so as a just a pair is
the button that exist in this uh in this uh topological structure and the intelligence
is just a mechanism that the this uh this part there and that uh developed
to preserve itself and so basically
for the second a question you said that the uh that the
during machines are limited i'm not so sure in what sense maybe you mean that
the normal computers are not during machines the
uh in strict sense so
uh so i don't know which problems uh you mean you can do not only
major uh i was talking more about the incompleteness in the sense that the during
common is basically a this concept that uh there is a find a description of
well
of all the buttons in this competition model
if you would yeah they could bigger model like c find machines that you know
that for example
there does not exist to find it solution
funny description of uh
of some algorithm of well so
for example
you can tell account if you if you limit yourself to the finite state machines
hand uh
for example in the context of the recurrent network so i think there is a
gets more confusing because they should ever papers written
then uh claim that the recurrent networks are incomplete and which sensor
a one can make a conclusion and actually adhered for example you're gonna speed reverses
it uh
requested that the
that the recurrent networks larger incomplete and that
that means that they are just fine and they should general are all these things
that i was showing co
what do you want to say a say that the uh show that when we
tried to train it uh it does g d a normal requirements we regard doesn't
larry of an accounting and the list even doesn't learning like a plane sequence memorisation
so that's a that's one thing what is learnable
and that's actually quite different than what the what can be represented
and of course the
to take uh the argument of uh all these people to strictly then i would
say uh the recurrent networks as we have now
uh then including dollars teams are not to incomplete because the
uh defined it's isa the proofs of their string complete this assumes that there are
so infinity somewhere hidden in the model
usually in the in the volumes the distortion the in the in the neuronal so
um so that seems not to do not the neural network that we are using
now we are using like thirty two bit a full precision and you can tell
you think of that you can store like uh infinitive
if it number of four formants there is the same argument as the saying that
you can
save the whole universe in a single number using arithmetic coding sure you can but
the
but uh do you actually want to this representation to be uh in uh some
neural network like uh in one of all you want to store everything and have
a lexus a double detection decoded at every time
for a time step if you would want to more the identification it makes sense
to say that we can it works are two incomplete because the in my view
a strictly speaking there are some of their versions maybe about the
but it's just uh i'm not practical of course uh during missions also not very
practical model so i'm talking about through a complete that's not about directions
i
yeah uh i see that the uh you're thinking a lot about the yeah i
creation actually there is a huge discussion right now in the in the field about
the about achievement uh achieving of the number the singularity and whenever you will create
a binary what traits such as a i which would get connected to the to
the internet
and
uh did it to share any of dares their concerns of uh
uh country yeah i
or super intelligent a i which will which will basically make a some silly
well
well i like yeah
different views on this uh
uh i think that the thinking of this a super intelligence and singularity
i think it's little like yes uh
i don't know what i would uh related to like yeah but the chinese and
italian when they got power if they would to be afraid of uh of just
or interval are so
by some chain reaction uh i mean like it just to suit basically just the
technologist there and uh it should be aware of it and does the same like
uh when it comes to state of the research does i'm saying if you really
don't want to
uh if you don't want to be unfair divorce yourself
then it is clear that we can teach yeah there are many of them very
simple things and to talking about single or the then i think it's just uh
assume the our think a is that the
of course there are people have arguments that the uh maybe uh the gap between
actually having something that doesn't work at all and suddenly having some intelligent that again
you can improve itself
the together doesn't have to be that they can maybe how we can achieve this
machine so sooner than we expect that
even if some people are sceptical that made it can be later
um but if you would take this argument then i would say depends on how
we should construct this machine so
uh the frame level describing a
uh were supposed to make machine that actually
are trying to minimize uh some goals
and as long as we will be able to define the goals for the machines
then
i would say uh for me the machine should be basically um like a
some and it can that extends your own ability
see if you are sitting in a car
then uh you are able to move much faster than using your own lines because
the cars physical tools for you
oh
well the car just does what you wanted to do because you are existing very
it should go show it can either a knockers people and it can kill someone
button that but the next to the driver is responsible
so i think that the a i even if you to be very clever as
long as its own all purpose is just to accomplish of the goals for the
for the for the human which specifies the uh the goals
then it basically to like extension of our mental a couple state the same as
cars extending our abilities to move
well that was just a
there was just as the to your phyllis this step three of us to lead
to learn a learned itself the questions which is the tricky part
because whenever you will you will not part of it was on the a i
part
uh um
uh which file it was about io and collecting to internet just only about the
thought oh yeah and c
okay i don't remember exactly which might not you are correctly so
uh actual actually the last are was to let or no learned itself the from
the from the other sources which makes it only has no control
rather sure churchill are well that's a that's a like a question like a
given the learner will learn from the other sources of how much uh
kind of uh distant gonna get from the from the top
a external river so
you can actually the same argument about uh people they are also born with some
kind of like a internal rework mechanism that was maybe large revolution to be a
kind of hardcoded and also before example note if you sugar than you feel happy
or whatever because the so they are coded thing
and uh
that still doesn't we then people to actually behave uh quite different then they become
adult because
they can for example just decide to stop eating sugar a
and just uh just not follow the external rewards so
yeah encoded or external interlocked the basic of the hardcoded to record so
that are in the brain stop
so that's like more like the question if the
yeah i would be so much independent that it would have thought some sort of
like you will then and you can of course see that it can turn out
into something bad about the and if you think about a i think a single
and that it but uh many of them and many of them working with yes
then
uh my vision is basically that it extends our own abilities and the
is the same as the
saying that uh pretty much any piece of technology can be used for good and
that purposes so
just to be belongs
others
it was wondering whether it's be more local
no target location is this
like something which whoops work in the network and it would be actually
clearly that it would be changed just some subset of a scene
yeah i wouldn't it propagates
the information in this mostly due more
oh unsupervised
hmmm
something like
c d's
someone's using something
these days
hmmm i think that i see something but i do not be able to give
your friends because uh i didn't know are right now
i wasn't myself music find a fulfilment first mse because i think that uh
therefore limit it yeah
uh
so well i don't know like a
i guess that the property that we should be able to uh
uh get into our models that are it's neural networks or something else is this
uh i but it's to grow in the complexity
and that's something that norm on your a result that
once you start seeing the network a some sort of like memory mechanisms are it
us ability to kind of like to extend the memory structure i think that's uh
that's all i see it
uh
and then the topology allows you to not spectral parameters but just some subset
uh
so that's what i was thinking of uh
but of course that the that doesn't mean that uh that's the solution may be
somewhat oh come on something else
i just think that actual data
points to model even if you go to as you do something that will be
again do local updates to the and i would be a bit boring about
just the model itself to be
to be and not limited in the convolutional sense of course to consider the human
brain to select find it's the number of years maybe but then there like targets
may be some neurons are triggering cup and the final arguments from me would be
that the us a human you can actually um navigate in the topological environment like
the environment around your yourself this three-dimensional it has a topology
and uh if you actually want to understand all utterances you can use the piece
of the paper and so on so you can be actually finds the matching about
this long as you are but i think in the in the environment that connections
work as a paper in the during machine
and then you can actually be as the whole system during complete so
you know like a if the model will actually start living in the environment i
think it actually gets a
gets a more about it is that's look at it can also change the environment
that the it uh becomes much more interesting content if you have just neural network
really think a in a way to just observing can able to lectures and purchasing
cup vectors uh
we don't actually uh being we are able to control anything that that's the topology
so for example and i was talking about the stick carmen's you can see that
so that's that can be seen as a one dimensional environment that the stick our
lives in and can operate on it and to have any of the two d
r two d environment that utterance basically just more
more dimensions but it's uh kind of the same thing and you are so far
linking the three to adjust really brought and to if you will be able to
influence of the state of the role i think that a user will be uh
quite limited to that so that's kind of like a my understanding of the think
so
does the research agenda open a is doing have any overlap with the framework that
you have suggested a which
opening i
a banana
uh yeah that's of the guys in california a where they get publisher a uh
she already recorded opening i universe i think a
or a needs a month ago so uh somewhat overlooked so that the goals in
the sense that uh
did try to yeah
um like a person every social a point to the define a defined to like
i think thousand task or something of that sort the
and to
mm they are trying to make a machines are coming from the definition data
generally i guess a
it's a some sort of machine that can work of course uh a range of
tasks are not a single path but for many tasks are
uh but it somewhat uh it's quite crucial to different actually to what i was
describing because uh
there's a different between incremental or gradual learning curve i think there are several other
names are you assume that the meshing wanted one signals and tasks and it alarm
so it's a you try to teach it a task and plus one
then it should be able to learn it faster in this new tasks are related
to the all the ones and then you can actually be measured because you can
construct this subtask yourself oh artificial and you can make than actually
uh
while of what i did see so far list uh i'm not like a an
expert will they are doing that maybe they are changing still
direction but i folded they are trying just to solve a bunch of tasks together
which multitask learning that such a different thing as actually completed that yielded the neural
networks which don't have to do so you
two approach that are the problems
but the well they try to do it with a yeah i think which are
uh or reinforcement learning which again is quite a challenging
uh because direction don't stay supervisor level the of the model should be doing about
you are just giving rewards for the correct behaviour
second that the
that part of what they are trying to do is uh somewhat related to whatever
describing about the
a i don't think it should uh multitask learning has a big a problem because
that could actually just works fine you can just a venue auditory little recognize speech
under the image classification and then language monica the same time uh because really represent
all these things of the input layer and what kind of our quite a so
what would be just encoded in different parts of the network
uh
so
i think that they're hope is that uh actually they start uh
boosting each other's performance uh
uh if you will train basically this work to do all these things together then
it will adjust sure the
sharma of the ability somehow
uh so let's see what they become a bit about uh
from my point of view i think that uh it's good to try to of
isolated of the biggest problems and try to so that uh i was for example
a giving preference uh
how to this um is "'cause" book are subproblems and uh
iterate try to go like to the core like of the simplest things that you
guards are present additionally with a one hidden layer and the at intervals where influential
and very simple are sent and from my point of view if we try to
analyze what is going wrong with the current the algorithm is going to
like a huge data are sort of thousands or thousand different problems training some model
couple of it
and then make some place about that are it works or doesn't work and what
it go wrong a
i think the analyses will be very harsh for then
it will be different amazing for p r videos
uh
which of course is uh is uh like uh
uh one of the main things that they are through but the but the except
that the
i don't let c
so don't you think that actually multitask training is
crucial in
and these things "'cause" it can cover a lot of things and can learn what
not to do instead of just learning what to do
well to multitask learning car
like
not a like a crucial problem or saying it's a problem i'm just saying it's
a part of the real life thing right now sure you never learn just one
thing you always observed
and if you wanna have inspiration and the real life
oh just
i sure i mean that's uh that's complete fine for example then it was uh
describing this framework with a larger and uh and so on and the teacher then
also like the point is that the teacher would be interesting you trusted alarm are
and then uh this can be predators
uh
i'm can be defined when people are trying to work on multiple tasks and assigned
a set like a we have it there but uh is different that are you're
assuming that you
yeah are training the model the model all the tasks together and then you try
to measure of performance on the same tasks
or if you train the model on some task and then you try to teach
it quickly on different tasks uh and that's actually what i think is much more
challenging and that's what well i think we should try to
our focus on because it will be needed that if you just uh are going
to fully also by training commit in tasks and then show always place are combined
very well maybe because it was in the training set so you don't like was
the point
so i think it's of course it's part of the of the problem to know
to have adult dyad to work on multiple tasks at once we have to uh
this uh but it's alarm and you that's quickly
um you you've mentioned uh
steps that to be taken toward a creating an environment for a line
um you know what the state-of-the-art in
using anything with
this principles
just anybody's is such an environment is that we establish a with a lot some
simply environmentally the public's it uh last year and weighted present the that needs as
well lots to models
uh the next conference and to
uh it's and the get out that it's called communication based artificial intelligence environments uh
so uh it's uh
i think the short "'cause" this column yeah if a dash and
so that's pretty stupid shortcut nobody likes about the we and they have a this
one because uh and the course of the story would be longer that's good that's
one
uh
and uh
and uh so this is our environment that be published uh
when it comes to other side
uh
well there was this a discussion about the open
yeah universe which is a something then i think the mind of publisher the same
conference on like a thing deep mine plan for all the decorrelation so
but to compare games in three d environments and how to navigate are just by
observing pixels and that's really environments that gets uh
such a quite different uh
because again if you for some a single task results different this uh this focus
on the incrementality of the learning so i we not sure intersections something comparable to
what we have
yes actually they are so many researchers is that you never know about so
that's encouraging for the rest of this
do you think we have enough data for training a building language models and now
we should focus only on algorithms
or we should
also green using a data sources and i don't to textual data
well of course is a more data you have the better models you can uh
they'll and i would say
there's never enough data
uh so
then i do you try to improve well all these tasks that i mentioned in
the uh the first part of the whole alexi speech recognition machine translation
and spam detection or whatever uh
then sure like uh more data will be good to
and um
and the amount of uh written uh
text data on the web is increasing called a time so
how i think that uh in the future we will happen even bigger models trained
on either more data
and the accuracy center of these models a perplexed is will be able recursive will
be higher uh things only get think a bit better there's a like this uh
this argument i think you are going back to shown on the question of uh
i think there are models
actually able to uh to capture an irregularity in the language just one "'cause" the
amount of uh data that you have is infinite and the n is included as
well
which is nice of which basically says that the more data you will have the
better you will be about the but the gains are just getting smaller and smaller
and then uh i don't think it should this is the way up to
gets to a i because uh even if you would have like billions uh but
in times more data and then you have now then sure you get like a
two point improvement in machine translation and that's fine or maybe one or two percent
laura portrayed in speech recognition
but uh there is diminished diminishing gains a diminishing returns so it would just uh
not be boarded doing uh after sampling
of course then there's this think about the
i think more data in uh in domains are actually a very small amount of
the data now like today uh
of course then you can expect a big gains in the icarus this later so
for example for fink english language models
i think that well
that's only like a
just about the maximizing the size of the model the ds now
minimizing we could be sure complex the training data is because as we can
a lot for different languages uh there can be some more fun um there was
sort side maybe uh i would have or more hope for it
um because there's less uh data
um so yeah maybe like for czech language or
oh
there is there something to be down like all this morphological languages uh are interesting
for some reasons
uh
so yeah so the answer is basically yes more data is a good uh
uh
a if you want to get a i then i don't think it should get
a us there