hi everyone on
j p and on the sister showed work done with
my core accurate attention and modify search to
sorted down going to talk about their dialogue policy learning problems for task oriented visual
dollar
also first let me introduce their problems
so
there's physically situated x and you know that would we want to study is
where a few joint chapel tries to engage with the user to how
a lot and five to their p georgia order target image
so here you can see
and of there were twenty similar images presented a tutor a agent and at the
first n
their users can provide er
this questions
a luddite and you want
so
and then there
agents here pay some more proactive role by asking
i reverend cushions in this community cushions
and hopefully once
if the confidence in notes it to make a decisions
to finish the top within a minimal number of turns
so in this setting
on their a true our main challenges
on the agent very need to none
and understand the multimodal rip intuitions
and also be aware of the dynamic your dark contrast
especially on where receiving signals
for making decisions cell phones sample wrong information correlate all wrong guesses
so the main goal is for the agent is to learn
efficient dialogue policy to accomplish the task
still a motivation very counts for
on some
potential so usable real applications
so imagines for virtual a nice talking assists and
to help the customers
army commander propose or recommend approach is based on all the user's preference so that
multimodal contest us through the dialog
so is assigned as we also working probabilistically that are consummated visual dialogue based on
the fashion dataset
and hopefully we have something interesting what still
next year
on so
but the previous research are only show that a lot mainly focus on be sure
to language understanding and generations
where it so they have for questionable and also for underwear with each other with
thing a fixed number of turns
a however we focus on the dialogue policy learning problems of the cushion policy
so it was within style it's
the questionable can produce a more constructive rules to help her sister
the human to accomplish the task
we want to be very their efficiency and the robustness of the dialogue policies in
terms of on
more task ninety dollars to semantics
and is supposed to mentioning that our
word is also related to hierarchical reinforcement learning
basically we view this as the two stage problem
at first we want to obtain a dialog proceeds to
so that a proper dialog basically a information queries all making a decision on to
do information image retrieval
and then we can have for real lower level proceed to see that of primitive
actions
like which question to ask
all
however to reinforcement learning has been applied seeing a multi domain dialogue system but with
our multimodal contestant action space
and
on our
architecture also resembles the fruit or reinforcement learning which have some nice properties
that are steadfast rations state sharing and a sequential execution
and here is to overview order
that the information for one thing to our proposed framework and important
we have their simulator module a which
how to teach us to again transition state and also provide the p but a
remote signals and
the generated out of service
to feed into the vision data loss matching batting more joe to
updates of usually stays home with our new approach is appears and also communicates with
the
dialog state checking more so we attentional signal to and dialog state checking kind of
formal easter
the people all speech without loss state representations and in the high-level going to proceed
any module uses a
a lot
do you get us to
to a tool and the prosody in terms of asking questions on making a gas
and we have to specialise questions that you modulo two
all learn the decision what will question to ask
the first simple them also is or visual dialogue
the matching bad and module
on to go for this module is true
i'll try to learns an encoder
tasks and
that the region and the task information into a joint space
so the intuition is
we want to have to each and thus able to kind of have
of intelligence to understand the visions and
and that the semantic relation between the image and
and the dialog contest
so
bodies we also need to preach and this module on
all to timber to encode to have for
a robot a efficient as reinforcement learning training
and also the album can be also applicable to use for image retrieval
and to be very the performance of this module we perform a sanity check
and we will choose for high a image which you're accuracies
in this system in again setting
which means this can provide a reliable signals for our reinforcement learning change
in the visual dialog state checking module we need to teach i three types of
all state information
on the vision released a kind of represent the agent's internal decision making models
i which is solvable of the vision dollars imagine a impending module and
on and the vision qantas stays kind of captures their
features the visual feature of their environments in here we applied a and what was
the technical stay adaptations well
basically the intuition is used to we want to after a vision contest
or more phones and but that's a two
the vision really state or the decision making model of the agents
also
on based on some feedback so attentional signals
so the attention a signal here is calculated by all the semantic similarity score between
the vision a belief state and image vectors and
then we take the weighted average
and in case of the wrong guess which are set their attendance attention signal to
zero
and we also could show that our the alignment information a number of questions asked
number of image yes and their last session
so given the past dialogs stays we have all
all
this kind of boasting learning modules and basic since we have two separate is quite
something also we applied to
wt q and method a
so we have applied the project was replaced and pose tracking to
to improve the eight thousand problem efficiency
and
another important task that for reinforcement learning is the reward design
and
so
the rule for this model training use can be composed decomposed into
the words and
questions rewards in the image mature words and so with
a reward shaping possibility into the a question suggestions and which
it's kind of all
the information gain of
of us
of our question ask
and then here we calculate query is to
difference between tough only to score between the usually state and attacked image vector
on so
a cushion citizen modules are to see that there
the most informative questions to ask for when asked and
based on the shared visual contest eight
so we use this a core a reinforcement relevant networks
that's able to handle a large discrete task based
action space
in
there you value is can be post made a
a between the embedding
i vectors softer revision contest
the and
the questions
on the reward t is the intermediate or not quite sure what is we discuss
and then we use also an assertion strategy
as their inspiration policy
to train the reinforcement learning with different need to have for simulator and so we
propose a corpus sse that once onset of anns consists
since all thought a similar image
and it is stiff image
a corresponds to a ten rows of question answer pairs
also this model provides the remote signal is saying axis related to the target image
and also chest internal against a to their what do we
other types of diminishing conditions first a teaching get is the correct answer
on the mess number of gets this is reached
and there is a lot turns his original depends on different experiment settings
all we define the winning and lost we were assessed
plus it's a negative tent and
the wrong guess penalties negative we
and to evaluate the contribution of each component within our friend work we focus on
five policies models on
the sap baseline to
a random procedures in is still at the cushions all my guess and any state
and then we added it you and to optimize dependent
level decision making and of the a handful
the lower or level pushes session a process
and we also want to evaluate the stay adaptations and reward shaping techniques to see
how
data affect the policy learning
and we want to
a because we want to be very to efficiencies and the robustness of dallas policy
we construct three sets of experiments
by step by step
for the first it smell nice to agents only see that
the questions formed directive and eight also obtain a questions answer pairs generated by human
for the target image
so this
last are open down stepping allows us to verify the effectiveness of our
a friend word
and then we increase the task difficulties are by enlarge the number of questions are
so there are two hundred questions generated by humans
and the dancers who are generated using our approach and this question answering models
respect to the target image
and
doesn't their experience we scale outer testing process you have
to answer question answer pairs generated automatically using the pitch and question answer parts
which kind of simulates a more noisy and real was setting and a different a
also
also we very sour the policy model set of the one thousand iterations during the
training process we pretty policy and we look at them reiteration magics lie within rate
and average number of ten dollars terms
here as there is there is also out in its parent once all we constrain
on their maxima number missing the rows of dialogue to hand and
within are defined ten questions
there fourteen well so there
there and encode falter william rate and all the average can reward
and we can see the optimal model is the last prosody models and i have
solar cell part of also conversion rates and outperform model with a hierarchy can sure
pos see a question partitions and state adaptation
and depression with that we were i want to also is whether a hierarchical reinforcement
learning policy enable efficient decision making
so here we define the of oracle baseline a data each and kids asking questions
in order
and only make the guess at the end of the data loss
which means are or where a is means
there
the ages ask several tens of not number of two rounds operations and then only
make a tick sid and so we found our optimal dialogue policies
okay such as a significant higher a win rate and the or point seven
and have a compulsive a win raise with their oracle baseline at eight well we're
knows
static o significant difference
so
and also we know that the oracle and nine and ten have higher we may
because they can about more information our longer turns
so we can see that our how code enforcement policy coefficient decision making
in
and we further work after if you know why we want to offer the evaluate
the robustness of our are thousand policies
so in paris the number of all
we increase the number of questions and then we also use a fly above chance
of vision question answering model as a user simulator to generate on servers and we
can see our departments we watch is the best performance induce more noisy a setting
and so on
this point three we further
increase to
task difficulties
and as we know all when e varies
when you very thin the analysis and the test data can be very different
so here we uses l two in this way because simulator different testing dataset and
and
and we are served the performance in the can jobs a but other propose reward
is more robust to noise and we think there is a potential application of using
the restart it has a bicluster orchard a song datasets constructing
by to humans are just talk about their
the call quality may state assets hope that was basically goes
so that it may not be very suitable for task oriented
applications on so here's is their sampled al also where so sir
systems for example in you spend two and a failure example when you spend the
as we can see in example tutor dialogue policy a sensor
susceptible ready a see that the relevant question some relates to color ten
and birds
and all those are some wrong guesses happens and there's someone answers to everything is
o
they can you can do a good job to self correcting and then maybe yes
in the end
and in a israel the weights and
since the question that also appear are overgenerating using sequence to a sequence model and
so the testing on the questions is more general or and
on the very specific
to summarize a we propose
a correct answer in t v show that allows set of tasks that is a
applicable and extensible for real application and we also propose a hardcore reinforcement learning framework
to selectively learn the multimodal state
a reputation and efficient dialogue policy
and then we what's propose and a state adaptation technique to make the vision contest
rip condition more relevant to the usual dialog state
and we vary only at estimating the dialogue system matches in different a semantics narrows
to very date the task completion efficiency and robustness
for future work we plan to extend then apply a different well former study in
the city real application that i don't realise something scenarios and we are
we can also explore ways to incorporate or domain noise like the ontology
on the data about interactions into a multimodal dialogue system to enable a large scale
or information retrieval task
thanks
which again
okay
how do you push the signals in different models mean
basically how do you model dreamworks
and every works i guess
are so as i mention the
there we will all the most leader
the reinforcement learning part transform
the high-level within the policy and the questions the actual module
so
after we have for this part well
consists of three
three parts of rewards as i mentioned take reward and there are questionnaire was and
also their intention reachable we what is making a wrong guesses
and so and the rule for the classes is actually macho only a
applies to reward shaping techniques e
so we manager to their
a basically the
similarity between the this to embedding vectors
it's a real environment
system defines itself wrong from what they're having
okay
we also
because we have
on the simulations so we have for pretty five talking image
so you so the two is controlled by the simulator module to kind of value
at a at each can state our waiter
yes the correct also not
and so we can get the signals
during the training process
sorry affected by the question selection like you have any idea is to five to
find it
the most important question defined in section
a here it's a fixed number here a paragraph
a situation i have here a
find you have generated
nee most if english ink
question
and two and add a question is mm i finish working on it
i think in high recognition i
it's kind of questions i think that's the group cushion
so here is basically a discriminative approach
to still no questions
from the different data also a because there's a ago a question proves
so we can just to that of questions but
okay a more interesting question is how we can generate a discriminative questions and
and we you know online fashion
so i think that something to explore in future