everyone my name is the injury and from the interactional that are about university in

denver

and today i want to talk our paper about training and adaptive dialogue policy

for interactive learning of visually grounded word meanings

so in this talk i want to talk subgoal a couple of things about the

too many two aspects one is we want to discuss an overview of the system

architecture

and then we shows the

based on a movie you about how the system how it works

and also

based on this system architecture we use that to investigate the event effectiveness of different

a random forest is the and the probabilities worked

in the

interaction learning interactive learning process and the based on that we investigation we trained and

adaptive

don't policy

now let's move to the motivation okay

what we want to do in this case we want to be at each other

about toward a multimodal systems and which can learn

and individual using

are users of language bike are listed my best clear are this my look like

four or something like that

and then

we in you know to us that we have to we learn everything we learn

the visual context we learn the

the knowledge online using

a the target interactions rappers then the text the basic descriptions war manual annotations

and also we

this is the system used in the really small amounts most months of the training

data maybe once all the what use of it

well and then what is this them into a really different position we put that

into the channel the rather than a second language learner you know for the technical

and you don't know is the

it has all of the visual knowledge about what happens what the what's the meaning

of colour was the mean of shape and of the what they only need to

do is the trying to associate the all and

vision or h two ways the specific words or phrases in another language

but the channel is quite different because it doesn't have any

a knowledge about that and they have to trying to learn the

one language meanings that a forced

though it as what we know there are lots of the recent works the trying

to do ways that the symbol grounding problems they are trying to generates the other

natural language descriptions what images were we use

or they are trying to identify or describe the visual objects using the visual features

like colors shapes war materials

but

to our knowledge now of these masters or long as to go for teachable robot

or multimodal system

and this aspect should be combined together

so here we present a table comparing our project with others and is what we

can see this the is almost all words and they are focused on one where

some aspects of in this table but our work it consider all of the cup

aspects

including the interaction

online learning natural language and incremental okay

now let's move to the same the system to capture is a really on general

architecture it has the it combines the visual we show and you wasted a remote

you and the

visual mode you

we can see on the left that's the

on chopped the class of weak classifiers which rounds of the semantic representations in the

language processing module

and then

so that

so the predictions from the classifiers to the dialogue and the visual observation mode you

produce the semantic analysis of the real thing

and then the becomes the non-linguistic

a context as part of the dialogs

and you

for the parsing a generating

for example maybe pronounce war reference resolution

and on the other hand for the

on the data mode you don't mode you the dstc removed you we parse the

dialogue from with that users the number of ways that users and through the parsing

all object the judgements are used as the labels to

updates

other classifiers incremental bayes rule the interaction

notice talking about the vision all

mode you the vision mode you can extract the a high dimensional feature vectors

including the h s b

space will colour and the back on visual words for the shape

and then

it incrementally train a binary classifiers what each at you each we show attribute

using the logistic regression where is the other stuff stick gradient descent model

finally after the classification

it's put use the visual context the based on the predictions and the corresponding outcome

discourse

that's trying to ground that is atomic i turned into the particular classifiers

well the dstc our model to

the that the dft remote you contains the

to parse the dynamic syntax and the teacher directly types other than anything text example

word-byword incremental semantic module of the dialog equal including the parsing and parser generator

and it produce the semantic and the contextual

representations in the temple have not cafeterias records

and i work

in i want to highlight here is a what is quite similar to okay the

can indiana data strings work and but where they are trying to do is to

the one that run grounding the

the words to the other classifiers but we are using that each are regular time

logical form in that

okay here's an example about the incremental parser and the

and here's the other graph about the whole r

it shows the plot for the dialogue context

and each acts like dialogue context from all of the participants in the is dialogue

including the learner and the

other tutor

and

is each note here is we represents this that when states or the key directly

types for the particular points

and the an edge

each and we present the particular word

parse the by the that the at each other parser

though

when we get new word

if just to grow up i guess new tattoos new node and the updates the

medial record travel again so it's got another one

and

there's always sensors and the final record time for

distances about what is this

and the learner's thing a square

so

it just continue

is crossed and update that by removing

the question

judgement type and

and the result some answers about squeaker

so i have to work

well in the because you're saying yes as a kind of acknowledgement comes from the

tutor so the previous don't of the previous dialog is

has been upgraded by those that you jen the and the

the learner it check industrial contexts in this case

is another example about the a ground at natural language semantics in the visual quality

for hours

and the system

we can see the system capture okay objects from the webcam role and extracts features

push them into the classifiers get a bunch of the

the prediction labels and the corresponding for a confidence scores

and the based on that we instead of using the distribution in probabilities well using

the binary classifiers for each other attributes so we just pick up a the labels

with highest scores

for each group here so we used in that to generates

a visual context and the that's a teletype

training the role it cold to read and ways than zero point seventy us then

t five

and the pretty one square

is

want to the stick where trust if are ways of zero point eighty eight

and when we push them into the generator with the meaning that means

i can see a red square

our system

the vision module change all of the visual attributes in quality of assigns of binary

classifiers

it does not

identify the meaning soft is fake actually word

but in the language model the language processing module the grammar

note them and knows they are in the different okay great strike a red is

a kind of colour and a switch kind of shape

and the our research will as we mentioned already or the system is the position

of the channel so we don't need to learn

are this maps between the classifiers and the semantic items

instead

we learn a week we just types of classifiers on the fly and for the

new semantic items and

we might encode or in contrary to eight

in the dialogue and we retrain them are incrementally strolled interaction

no i want to show you about the how the system works

here's a really simple a dollar dialogues

system and has or of the

a dialogue on the left

is very simple to just to reach a specific words about the colour was shapes

and then

we got new dialogue and is trying to testing what's the system or you learned

from

are the previous work

and then we got the

object and we got the visual contact

two based on their that would be a visual sorry visual features and we you

have a bunch of the classification without

and

based on the result

we generate the window shows the visual context

could use the by the classification without

and then about this dialogue context it shows the

at each are idle time harsh the by the from the previous tutor

all streams

and then the this window shows that generation go and it's shave the answers

by unifying

a the dialogue context and visual contacts

and would clues that into that generate the generator to get the final ten sentences

by collect-is square

all because the time i once

finished that

okay

i want to start

i want to highlight here is what we use in a very simple colours and

shapes in this week you and a but this is the ms module is the

really generalize the framework and it should scale to more complex visual things will classifiers

in the future work

note let's move to the experiments

in the experiments

we aim to explore the effectiveness of the different dialogue capabilities in the possibilities

on the learning grounded word meanings

with the history of factors and certainty a context dependency and the initiative

and then based on the exploration and we learn an adaptive dialogue strategies

a comedy contains a common text

into account to the reliability of the clutter far results

okay on in the experiment one we

designed two by two by two spectral or experiment and considers three a

factors the first one is initiative which determines the who takes initiative in this whole

dialogue

and then

what time to their

the a context dependency which determines a whether the learner can personalise the context-dependent a

expressions like short answers were incrementally are constructed turns

as the example here

and then

we considered as edge in g

the uncertainty determines whether and how the classification of the classification scores affect the learner's

dialog behaviors

and

as what we know

the

for the system the classification there's the system you achieve a bunch of the confidence

scores a stronger predictions and the chip agents considering the and surgeons you

we're trying to find out the points

where it can be of its own predictions so are we call this point is

that the strike show the confidence threshold

and for the agent's consider the and certainty it will be of you task and

of the active learning as like a you want asking you will only asking questions

one you are not very sure about you answer word about your predictions

and on the other hand

the and

and are the condition without uncertainty

and that the agent always takes the confirmations were are more informations from the tutor

so it is more a cost as well

and to our knowledge the classification scores and not always reliable is better in the

very beginning because you don't have also extend use them house

and

this reliability you approved obviously improve that you're in the interaction interactive learning

when you get more and more are examples

so

the class you see this can other and the uncertainty will take the risk

of meeting some informations from that users

so it cannot ask any questions you might be for the

the answer is wrong maybe

the to evaluate the interactive a learning a performance we come up with this kind

of the magics by integrating the classification accuracy and the tutoring housed during comes to

me fact that at first mediated by the tutor in the interactive interaction with the

system

and we provide a proponent performance of scores for the increase in the accuracy against

the cost to the tutor and the discourse capture the trade-off between accuracy and a

cost

there's a score very if we record these scores into the map of the graph

it will be a curve and the other score being be represented using the gradient

of the curve and what we want to do is we want to find out

word learn a suitable down holistic

to maximize the performance score

okay here's the results from the experiment one

and to the

the x is represents the uni the unit of the that you the cost you

did by that user zero five had i arg only instance and the y-axis is

are we present the accuracy

so this graph we can see the agents

was the outcomes using the learner taking initiative ways and certainty the green and the

blue curve

performs much better than the others are however because the process use the at least

two more are risks

it can have gets pretty open answers from them valve so

you can already achieve the really higher accuracy like the others that is achieved near

zero there are nine but this on the other one seventy five was then please

take something like that

so we conclude that because the confidence scores and not really are reliable in the

rolling per size

so it shouldn't be cheap constant in the whole running time on the whole learning

tasks

so we assume that a certainty and that can change that dynamic range over time

should lead to a dog add a tradeoff between accuracy and the cost

therefore we used we trained a an adaptive dialogue policy by

using the multimodal a mdp model and the reinforcements running starts to all the reasons

and because of time they made i can article details so

would you just before in the paper

so here's another results and in this without

we keep all of other conditions that constant using the learner taking that you know

the initiative

a takes the answers indiana context dependency into account as well

zero the and stroll the result we can find a the

the adaptive strategy on the right curve

achieve the much answer

much higher actors the are

but it cannot really bait other constants racial the of the quake in is not

really that

answer the that that's already good enough but what we know is

it achieves the high actually is the much faster especially in the first the ones

thousands

a unit is of course so we can find here and much

batter

so we conclude that the agent's with the adaptive astray show is more visible

in the and interactive learning tasks

and in the convolution

in this paper we use a fully integrated multimodal and interactive at each post systems

for the language n-gram language grounding and we trained a party rate are adaptive dialogue

strategy for all other language grounding

and then we can inspect impact of different strategies and the conditions on the regarding

precise

and we know

the learned policy we which takes the and searching into account with based adaptive most

racial shows the past the overall performance

in the future works are we

trying to crafting and training the dialogue policy using the human tutor

using a human data collection without a to and do we trained two gender learner

at the same time

and

and then we learn a word level while trying to learn word-level adaptive dialogue policy

using the reinforcement learning based on the dstc a mold you

and finally a way trying

order to deal with the new previous i think words what features what concepts we

trying to integrated distribution is semantics into this system

and here's the bunch of the a reference and sensible at a tension iq much

actually grins we want to use because we considering the

the uncertainty from the visual knowledge

and we think about okay well actually we're trying to use the different things as

will we using the entropy were something else to manager

the reliability of the classifiers and we consider

okay

because we will have a lot of the classifiers from scratch

so it doesn't have an orange it has only one what to a examples in

very beginnings the web trying to you is kind of thing astray shows they okay

are we trying to sign top three hires right you very beginning and it be

asking more questions

allows us to speak a domain was we speak actually use and then when you

get more examples we just to reduce the maybe just reuse the illustration

and trying to get read off from

a waste maybe the where the learner doesn't need a kind of questions

and what we want to do this we want to do this kind of our

troubles imagine would buy role at home

we just trying to

teach the robot about all of the information from the user's aspect as a perspective

so we on

posterior that situation because and y has different knowledge about the visual things so

all

we're trying to actually which windy to consider the situation for the

the confidence try to show the parts

with you what count that

well actually

that's very good question that we

in this case we don't really thing about that

it's we just think about overall about all of the answer changes in the usual

knowledge rather than asr or well

i think i would try to figure out that

to be on is currently not yet

that's maybe the future

stuff

sorry

so you questions or how i generates the representations for the object right

so we use thing the

i just be used the matlab now brilliance using us get the a chance we

cover have a spare beats for the

the colour style of the colour and the battle visual words and just build a

kind of dictionary ourself and you get gas the frequency of each which are in

the pixels and get them together to generate the ones of them to handle at

a feature to make a feature vectors

well actually

because we know there's a lot of guys working on the classification and their use

in deep learning or no degree a room or with twenty four network your network

and we trying to think about you know also channel it doesn't have any knowledge

and trying to use all of the

classifications classifiers upon a very from scratch

and we trying to think about okay we don't really mount the system already know

what's the meaning what's the group of the colours what shapes

so we used in the final class for each attributes and that's equally

so afterwards went through the interaction and we get new more knowledge and trying to

figure out okay this rat is a kind of colour and then we get new

features right yellow and i know yellow is quite similar to the red so is

also

in the same group

right

that's not really the weights

you mean that the distribution no

results right across all of the classifiers in the same group right

that's different where we just use in the country the binary and it the all

of the even the shamanic colours they are encoding

it doesn't have any difference between