everyone my name is the injury and from the interactional that are about university in
denver
and today i want to talk our paper about training and adaptive dialogue policy
for interactive learning of visually grounded word meanings
so in this talk i want to talk subgoal a couple of things about the
too many two aspects one is we want to discuss an overview of the system
architecture
and then we shows the
based on a movie you about how the system how it works
and also
based on this system architecture we use that to investigate the event effectiveness of different
a random forest is the and the probabilities worked
in the
interaction learning interactive learning process and the based on that we investigation we trained and
adaptive
don't policy
now let's move to the motivation okay
what we want to do in this case we want to be at each other
about toward a multimodal systems and which can learn
and individual using
are users of language bike are listed my best clear are this my look like
four or something like that
and then
we in you know to us that we have to we learn everything we learn
the visual context we learn the
the knowledge online using
a the target interactions rappers then the text the basic descriptions war manual annotations
and also we
this is the system used in the really small amounts most months of the training
data maybe once all the what use of it
well and then what is this them into a really different position we put that
into the channel the rather than a second language learner you know for the technical
and you don't know is the
it has all of the visual knowledge about what happens what the what's the meaning
of colour was the mean of shape and of the what they only need to
do is the trying to associate the all and
vision or h two ways the specific words or phrases in another language
but the channel is quite different because it doesn't have any
a knowledge about that and they have to trying to learn the
one language meanings that a forced
though it as what we know there are lots of the recent works the trying
to do ways that the symbol grounding problems they are trying to generates the other
natural language descriptions what images were we use
or they are trying to identify or describe the visual objects using the visual features
like colors shapes war materials
but
to our knowledge now of these masters or long as to go for teachable robot
or multimodal system
and this aspect should be combined together
so here we present a table comparing our project with others and is what we
can see this the is almost all words and they are focused on one where
some aspects of in this table but our work it consider all of the cup
aspects
including the interaction
online learning natural language and incremental okay
now let's move to the same the system to capture is a really on general
architecture it has the it combines the visual we show and you wasted a remote
you and the
visual mode you
we can see on the left that's the
on chopped the class of weak classifiers which rounds of the semantic representations in the
language processing module
and then
so that
so the predictions from the classifiers to the dialogue and the visual observation mode you
produce the semantic analysis of the real thing
and then the becomes the non-linguistic
a context as part of the dialogs
and you
for the parsing a generating
for example maybe pronounce war reference resolution
and on the other hand for the
on the data mode you don't mode you the dstc removed you we parse the
dialogue from with that users the number of ways that users and through the parsing
all object the judgements are used as the labels to
updates
other classifiers incremental bayes rule the interaction
notice talking about the vision all
mode you the vision mode you can extract the a high dimensional feature vectors
including the h s b
space will colour and the back on visual words for the shape
and then
it incrementally train a binary classifiers what each at you each we show attribute
using the logistic regression where is the other stuff stick gradient descent model
finally after the classification
it's put use the visual context the based on the predictions and the corresponding outcome
discourse
that's trying to ground that is atomic i turned into the particular classifiers
well the dstc our model to
the that the dft remote you contains the
to parse the dynamic syntax and the teacher directly types other than anything text example
word-byword incremental semantic module of the dialog equal including the parsing and parser generator
and it produce the semantic and the contextual
representations in the temple have not cafeterias records
and i work
in i want to highlight here is a what is quite similar to okay the
can indiana data strings work and but where they are trying to do is to
the one that run grounding the
the words to the other classifiers but we are using that each are regular time
logical form in that
okay here's an example about the incremental parser and the
and here's the other graph about the whole r
it shows the plot for the dialogue context
and each acts like dialogue context from all of the participants in the is dialogue
including the learner and the
other tutor
and
is each note here is we represents this that when states or the key directly
types for the particular points
and the an edge
each and we present the particular word
parse the by the that the at each other parser
though
when we get new word
if just to grow up i guess new tattoos new node and the updates the
medial record travel again so it's got another one
and
there's always sensors and the final record time for
distances about what is this
and the learner's thing a square
so
it just continue
is crossed and update that by removing
the question
judgement type and
and the result some answers about squeaker
so i have to work
well in the because you're saying yes as a kind of acknowledgement comes from the
tutor so the previous don't of the previous dialog is
has been upgraded by those that you jen the and the
the learner it check industrial contexts in this case
is another example about the a ground at natural language semantics in the visual quality
for hours
and the system
we can see the system capture okay objects from the webcam role and extracts features
push them into the classifiers get a bunch of the
the prediction labels and the corresponding for a confidence scores
and the based on that we instead of using the distribution in probabilities well using
the binary classifiers for each other attributes so we just pick up a the labels
with highest scores
for each group here so we used in that to generates
a visual context and the that's a teletype
training the role it cold to read and ways than zero point seventy us then
t five
and the pretty one square
is
want to the stick where trust if are ways of zero point eighty eight
and when we push them into the generator with the meaning that means
i can see a red square
our system
the vision module change all of the visual attributes in quality of assigns of binary
classifiers
it does not
identify the meaning soft is fake actually word
but in the language model the language processing module the grammar
note them and knows they are in the different okay great strike a red is
a kind of colour and a switch kind of shape
and the our research will as we mentioned already or the system is the position
of the channel so we don't need to learn
are this maps between the classifiers and the semantic items
instead
we learn a week we just types of classifiers on the fly and for the
new semantic items and
we might encode or in contrary to eight
in the dialogue and we retrain them are incrementally strolled interaction
no i want to show you about the how the system works
here's a really simple a dollar dialogues
system and has or of the
a dialogue on the left
is very simple to just to reach a specific words about the colour was shapes
and then
we got new dialogue and is trying to testing what's the system or you learned
from
are the previous work
and then we got the
object and we got the visual contact
two based on their that would be a visual sorry visual features and we you
have a bunch of the classification without
and
based on the result
we generate the window shows the visual context
could use the by the classification without
and then about this dialogue context it shows the
at each are idle time harsh the by the from the previous tutor
all streams
and then the this window shows that generation go and it's shave the answers
by unifying
a the dialogue context and visual contacts
and would clues that into that generate the generator to get the final ten sentences
by collect-is square
all because the time i once
finished that
okay
i want to start
i want to highlight here is what we use in a very simple colours and
shapes in this week you and a but this is the ms module is the
really generalize the framework and it should scale to more complex visual things will classifiers
in the future work
note let's move to the experiments
in the experiments
we aim to explore the effectiveness of the different dialogue capabilities in the possibilities
on the learning grounded word meanings
with the history of factors and certainty a context dependency and the initiative
and then based on the exploration and we learn an adaptive dialogue strategies
a comedy contains a common text
into account to the reliability of the clutter far results
okay on in the experiment one we
designed two by two by two spectral or experiment and considers three a
factors the first one is initiative which determines the who takes initiative in this whole
dialogue
and then
what time to their
the a context dependency which determines a whether the learner can personalise the context-dependent a
expressions like short answers were incrementally are constructed turns
as the example here
and then
we considered as edge in g
the uncertainty determines whether and how the classification of the classification scores affect the learner's
dialog behaviors
and
as what we know
the
for the system the classification there's the system you achieve a bunch of the confidence
scores a stronger predictions and the chip agents considering the and surgeons you
we're trying to find out the points
where it can be of its own predictions so are we call this point is
that the strike show the confidence threshold
and for the agent's consider the and certainty it will be of you task and
of the active learning as like a you want asking you will only asking questions
one you are not very sure about you answer word about your predictions
and on the other hand
the and
and are the condition without uncertainty
and that the agent always takes the confirmations were are more informations from the tutor
so it is more a cost as well
and to our knowledge the classification scores and not always reliable is better in the
very beginning because you don't have also extend use them house
and
this reliability you approved obviously improve that you're in the interaction interactive learning
when you get more and more are examples
so
the class you see this can other and the uncertainty will take the risk
of meeting some informations from that users
so it cannot ask any questions you might be for the
the answer is wrong maybe
the to evaluate the interactive a learning a performance we come up with this kind
of the magics by integrating the classification accuracy and the tutoring housed during comes to
me fact that at first mediated by the tutor in the interactive interaction with the
system
and we provide a proponent performance of scores for the increase in the accuracy against
the cost to the tutor and the discourse capture the trade-off between accuracy and a
cost
there's a score very if we record these scores into the map of the graph
it will be a curve and the other score being be represented using the gradient
of the curve and what we want to do is we want to find out
word learn a suitable down holistic
to maximize the performance score
okay here's the results from the experiment one
and to the
the x is represents the uni the unit of the that you the cost you
did by that user zero five had i arg only instance and the y-axis is
are we present the accuracy
so this graph we can see the agents
was the outcomes using the learner taking initiative ways and certainty the green and the
blue curve
performs much better than the others are however because the process use the at least
two more are risks
it can have gets pretty open answers from them valve so
you can already achieve the really higher accuracy like the others that is achieved near
zero there are nine but this on the other one seventy five was then please
take something like that
so we conclude that because the confidence scores and not really are reliable in the
rolling per size
so it shouldn't be cheap constant in the whole running time on the whole learning
tasks
so we assume that a certainty and that can change that dynamic range over time
should lead to a dog add a tradeoff between accuracy and the cost
therefore we used we trained a an adaptive dialogue policy by
using the multimodal a mdp model and the reinforcements running starts to all the reasons
and because of time they made i can article details so
would you just before in the paper
so here's another results and in this without
we keep all of other conditions that constant using the learner taking that you know
the initiative
a takes the answers indiana context dependency into account as well
zero the and stroll the result we can find a the
the adaptive strategy on the right curve
achieve the much answer
much higher actors the are
but it cannot really bait other constants racial the of the quake in is not
really that
answer the that that's already good enough but what we know is
it achieves the high actually is the much faster especially in the first the ones
thousands
a unit is of course so we can find here and much
batter
so we conclude that the agent's with the adaptive astray show is more visible
in the and interactive learning tasks
and in the convolution
in this paper we use a fully integrated multimodal and interactive at each post systems
for the language n-gram language grounding and we trained a party rate are adaptive dialogue
strategy for all other language grounding
and then we can inspect impact of different strategies and the conditions on the regarding
precise
and we know
the learned policy we which takes the and searching into account with based adaptive most
racial shows the past the overall performance
in the future works are we
trying to crafting and training the dialogue policy using the human tutor
using a human data collection without a to and do we trained two gender learner
at the same time
and
and then we learn a word level while trying to learn word-level adaptive dialogue policy
using the reinforcement learning based on the dstc a mold you
and finally a way trying
order to deal with the new previous i think words what features what concepts we
trying to integrated distribution is semantics into this system
and here's the bunch of the a reference and sensible at a tension iq much
actually grins we want to use because we considering the
the uncertainty from the visual knowledge
and we think about okay well actually we're trying to use the different things as
will we using the entropy were something else to manager
the reliability of the classifiers and we consider
okay
because we will have a lot of the classifiers from scratch
so it doesn't have an orange it has only one what to a examples in
very beginnings the web trying to you is kind of thing astray shows they okay
are we trying to sign top three hires right you very beginning and it be
asking more questions
allows us to speak a domain was we speak actually use and then when you
get more examples we just to reduce the maybe just reuse the illustration
and trying to get read off from
a waste maybe the where the learner doesn't need a kind of questions
and what we want to do this we want to do this kind of our
troubles imagine would buy role at home
we just trying to
teach the robot about all of the information from the user's aspect as a perspective
so we on
posterior that situation because and y has different knowledge about the visual things so
all
we're trying to actually which windy to consider the situation for the
the confidence try to show the parts
with you what count that
well actually
that's very good question that we
in this case we don't really thing about that
it's we just think about overall about all of the answer changes in the usual
knowledge rather than asr or well
i think i would try to figure out that
to be on is currently not yet
that's maybe the future
stuff
sorry
so you questions or how i generates the representations for the object right
so we use thing the
i just be used the matlab now brilliance using us get the a chance we
cover have a spare beats for the
the colour style of the colour and the battle visual words and just build a
kind of dictionary ourself and you get gas the frequency of each which are in
the pixels and get them together to generate the ones of them to handle at
a feature to make a feature vectors
well actually
because we know there's a lot of guys working on the classification and their use
in deep learning or no degree a room or with twenty four network your network
and we trying to think about you know also channel it doesn't have any knowledge
and trying to use all of the
classifications classifiers upon a very from scratch
and we trying to think about okay we don't really mount the system already know
what's the meaning what's the group of the colours what shapes
so we used in the final class for each attributes and that's equally
so afterwards went through the interaction and we get new more knowledge and trying to
figure out okay this rat is a kind of colour and then we get new
features right yellow and i know yellow is quite similar to the red so is
also
in the same group
right
that's not really the weights
you mean that the distribution no
results right across all of the classifiers in the same group right
that's different where we just use in the country the binary and it the all
of the even the shamanic colours they are encoding
it doesn't have any difference between