0:00:14hi everyone on
0:00:17j p and on the sister showed work done with
0:00:21my core accurate attention and modify search to
0:00:27sorted down going to talk about their dialogue policy learning problems for task oriented visual
0:00:32dollar
0:00:36also first let me introduce their problems
0:00:40so
0:00:41there's physically situated x and you know that would we want to study is
0:00:46where a few joint chapel tries to engage with the user to how
0:00:52a lot and five to their p georgia order target image
0:00:57so here you can see
0:00:59and of there were twenty similar images presented a tutor a agent and at the
0:01:07first n
0:01:09their users can provide er
0:01:12this questions
0:01:13a luddite and you want
0:01:17so
0:01:17and then there
0:01:19agents here pay some more proactive role by asking
0:01:23i reverend cushions in this community cushions
0:01:27and hopefully once
0:01:28if the confidence in notes it to make a decisions
0:01:33to finish the top within a minimal number of turns
0:01:37so in this setting
0:01:38on their a true our main challenges
0:01:42on the agent very need to none
0:01:45and understand the multimodal rip intuitions
0:01:48and also be aware of the dynamic your dark contrast
0:01:52especially on where receiving signals
0:01:56for making decisions cell phones sample wrong information correlate all wrong guesses
0:02:03so the main goal is for the agent is to learn
0:02:07efficient dialogue policy to accomplish the task
0:02:14still a motivation very counts for
0:02:16on some
0:02:17potential so usable real applications
0:02:21so imagines for virtual a nice talking assists and
0:02:26to help the customers
0:02:28army commander propose or recommend approach is based on all the user's preference so that
0:02:35multimodal contest us through the dialog
0:02:39so is assigned as we also working probabilistically that are consummated visual dialogue based on
0:02:44the fashion dataset
0:02:47and hopefully we have something interesting what still
0:02:50next year
0:02:53on so
0:02:54but the previous research are only show that a lot mainly focus on be sure
0:02:59to language understanding and generations
0:03:02where it so they have for questionable and also for underwear with each other with
0:03:07thing a fixed number of turns
0:03:09a however we focus on the dialogue policy learning problems of the cushion policy
0:03:16so it was within style it's
0:03:19the questionable can produce a more constructive rules to help her sister
0:03:24the human to accomplish the task
0:03:28we want to be very their efficiency and the robustness of the dialogue policies in
0:03:34terms of on
0:03:35more task ninety dollars to semantics
0:03:40and is supposed to mentioning that our
0:03:43word is also related to hierarchical reinforcement learning
0:03:47basically we view this as the two stage problem
0:03:50at first we want to obtain a dialog proceeds to
0:03:53so that a proper dialog basically a information queries all making a decision on to
0:03:59do information image retrieval
0:04:01and then we can have for real lower level proceed to see that of primitive
0:04:06actions
0:04:07like which question to ask
0:04:09all
0:04:12however to reinforcement learning has been applied seeing a multi domain dialogue system but with
0:04:18our multimodal contestant action space
0:04:21and
0:04:21on our
0:04:23architecture also resembles the fruit or reinforcement learning which have some nice properties
0:04:28that are steadfast rations state sharing and a sequential execution
0:04:35and here is to overview order
0:04:38that the information for one thing to our proposed framework and important
0:04:44we have their simulator module a which
0:04:47how to teach us to again transition state and also provide the p but a
0:04:53remote signals and
0:04:54the generated out of service
0:04:56to feed into the vision data loss matching batting more joe to
0:05:01updates of usually stays home with our new approach is appears and also communicates with
0:05:07the
0:05:08dialog state checking more so we attentional signal to and dialog state checking kind of
0:05:14formal easter
0:05:15the people all speech without loss state representations and in the high-level going to proceed
0:05:21any module uses a
0:05:23a lot
0:05:24do you get us to
0:05:26to a tool and the prosody in terms of asking questions on making a gas
0:05:31and we have to specialise questions that you modulo two
0:05:35all learn the decision what will question to ask
0:05:42the first simple them also is or visual dialogue
0:05:45the matching bad and module
0:05:47on to go for this module is true
0:05:50i'll try to learns an encoder
0:05:54tasks and
0:05:55that the region and the task information into a joint space
0:06:00so the intuition is
0:06:02we want to have to each and thus able to kind of have
0:06:06of intelligence to understand the visions and
0:06:10and that the semantic relation between the image and
0:06:15and the dialog contest
0:06:18so
0:06:20bodies we also need to preach and this module on
0:06:23all to timber to encode to have for
0:06:27a robot a efficient as reinforcement learning training
0:06:32and also the album can be also applicable to use for image retrieval
0:06:38and to be very the performance of this module we perform a sanity check
0:06:43and we will choose for high a image which you're accuracies
0:06:49in this system in again setting
0:06:50which means this can provide a reliable signals for our reinforcement learning change
0:06:59in the visual dialog state checking module we need to teach i three types of
0:07:03all state information
0:07:05on the vision released a kind of represent the agent's internal decision making models
0:07:11i which is solvable of the vision dollars imagine a impending module and
0:07:17on and the vision qantas stays kind of captures their
0:07:21features the visual feature of their environments in here we applied a and what was
0:07:27the technical stay adaptations well
0:07:31basically the intuition is used to we want to after a vision contest
0:07:35or more phones and but that's a two
0:07:39the vision really state or the decision making model of the agents
0:07:43also
0:07:44on based on some feedback so attentional signals
0:07:48so the attention a signal here is calculated by all the semantic similarity score between
0:07:54the vision a belief state and image vectors and
0:07:58then we take the weighted average
0:08:00and in case of the wrong guess which are set their attendance attention signal to
0:08:05zero
0:08:06and we also could show that our the alignment information a number of questions asked
0:08:10number of image yes and their last session
0:08:15so given the past dialogs stays we have all
0:08:19all
0:08:21this kind of boasting learning modules and basic since we have two separate is quite
0:08:26something also we applied to
0:08:29wt q and method a
0:08:31so we have applied the project was replaced and pose tracking to
0:08:36to improve the eight thousand problem efficiency
0:08:39and
0:08:40another important task that for reinforcement learning is the reward design
0:08:45and
0:08:45so
0:08:46the rule for this model training use can be composed decomposed into
0:08:52the words and
0:08:54questions rewards in the image mature words and so with
0:09:00a reward shaping possibility into the a question suggestions and which
0:09:07it's kind of all
0:09:09the information gain of
0:09:12of us
0:09:13of our question ask
0:09:15and then here we calculate query is to
0:09:19difference between tough only to score between the usually state and attacked image vector
0:09:29on so
0:09:31a cushion citizen modules are to see that there
0:09:34the most informative questions to ask for when asked and
0:09:39based on the shared visual contest eight
0:09:43so we use this a core a reinforcement relevant networks
0:09:48that's able to handle a large discrete task based
0:09:52action space
0:09:54in
0:09:54there you value is can be post made a
0:09:58a between the embedding
0:10:00i vectors softer revision contest
0:10:03the and
0:10:04the questions
0:10:06on the reward t is the intermediate or not quite sure what is we discuss
0:10:10and then we use also an assertion strategy
0:10:14as their inspiration policy
0:10:18to train the reinforcement learning with different need to have for simulator and so we
0:10:23propose a corpus sse that once onset of anns consists
0:10:27since all thought a similar image
0:10:30and it is stiff image
0:10:32a corresponds to a ten rows of question answer pairs
0:10:36also this model provides the remote signal is saying axis related to the target image
0:10:42and also chest internal against a to their what do we
0:10:46other types of diminishing conditions first a teaching get is the correct answer
0:10:51on the mess number of gets this is reached
0:10:54and there is a lot turns his original depends on different experiment settings
0:11:00all we define the winning and lost we were assessed
0:11:03plus it's a negative tent and
0:11:06the wrong guess penalties negative we
0:11:11and to evaluate the contribution of each component within our friend work we focus on
0:11:17five policies models on
0:11:20the sap baseline to
0:11:22a random procedures in is still at the cushions all my guess and any state
0:11:27and then we added it you and to optimize dependent
0:11:31level decision making and of the a handful
0:11:34the lower or level pushes session a process
0:11:38and we also want to evaluate the stay adaptations and reward shaping techniques to see
0:11:43how
0:11:45data affect the policy learning
0:11:50and we want to
0:11:52a because we want to be very to efficiencies and the robustness of dallas policy
0:11:57we construct three sets of experiments
0:11:59by step by step
0:12:02for the first it smell nice to agents only see that
0:12:06the questions formed directive and eight also obtain a questions answer pairs generated by human
0:12:12for the target image
0:12:14so this
0:12:16last are open down stepping allows us to verify the effectiveness of our
0:12:20a friend word
0:12:21and then we increase the task difficulties are by enlarge the number of questions are
0:12:27so there are two hundred questions generated by humans
0:12:30and the dancers who are generated using our approach and this question answering models
0:12:37respect to the target image
0:12:39and
0:12:40doesn't their experience we scale outer testing process you have
0:12:45to answer question answer pairs generated automatically using the pitch and question answer parts
0:12:52which kind of simulates a more noisy and real was setting and a different a
0:12:56also
0:12:58also we very sour the policy model set of the one thousand iterations during the
0:13:04training process we pretty policy and we look at them reiteration magics lie within rate
0:13:09and average number of ten dollars terms
0:13:15here as there is there is also out in its parent once all we constrain
0:13:20on their maxima number missing the rows of dialogue to hand and
0:13:25within are defined ten questions
0:13:29there fourteen well so there
0:13:32there and encode falter william rate and all the average can reward
0:13:38and we can see the optimal model is the last prosody models and i have
0:13:43solar cell part of also conversion rates and outperform model with a hierarchy can sure
0:13:50pos see a question partitions and state adaptation
0:13:54and depression with that we were i want to also is whether a hierarchical reinforcement
0:13:58learning policy enable efficient decision making
0:14:02so here we define the of oracle baseline a data each and kids asking questions
0:14:09in order
0:14:10and only make the guess at the end of the data loss
0:14:14which means are or where a is means
0:14:19there
0:14:19the ages ask several tens of not number of two rounds operations and then only
0:14:26make a tick sid and so we found our optimal dialogue policies
0:14:33okay such as a significant higher a win rate and the or point seven
0:14:38and have a compulsive a win raise with their oracle baseline at eight well we're
0:14:46knows
0:14:48static o significant difference
0:14:50so
0:14:51and also we know that the oracle and nine and ten have higher we may
0:14:56because they can about more information our longer turns
0:15:01so we can see that our how code enforcement policy coefficient decision making
0:15:09in
0:15:11and we further work after if you know why we want to offer the evaluate
0:15:16the robustness of our are thousand policies
0:15:19so in paris the number of all
0:15:22we increase the number of questions and then we also use a fly above chance
0:15:28of vision question answering model as a user simulator to generate on servers and we
0:15:33can see our departments we watch is the best performance induce more noisy a setting
0:15:43and so on
0:15:44this point three we further
0:15:47increase to
0:15:49task difficulties
0:15:50and as we know all when e varies
0:15:54when you very thin the analysis and the test data can be very different
0:15:58so here we uses l two in this way because simulator different testing dataset and
0:16:06and
0:16:07and we are served the performance in the can jobs a but other propose reward
0:16:13is more robust to noise and we think there is a potential application of using
0:16:20the restart it has a bicluster orchard a song datasets constructing
0:16:26by to humans are just talk about their
0:16:32the call quality may state assets hope that was basically goes
0:16:37so that it may not be very suitable for task oriented
0:16:40applications on so here's is their sampled al also where so sir
0:16:47systems for example in you spend two and a failure example when you spend the
0:16:52as we can see in example tutor dialogue policy a sensor
0:16:58susceptible ready a see that the relevant question some relates to color ten
0:17:04and birds
0:17:05and all those are some wrong guesses happens and there's someone answers to everything is
0:17:13o
0:17:14they can you can do a good job to self correcting and then maybe yes
0:17:18in the end
0:17:20and in a israel the weights and
0:17:23since the question that also appear are overgenerating using sequence to a sequence model and
0:17:28so the testing on the questions is more general or and
0:17:32on the very specific
0:17:38to summarize a we propose
0:17:41a correct answer in t v show that allows set of tasks that is a
0:17:46applicable and extensible for real application and we also propose a hardcore reinforcement learning framework
0:17:53to selectively learn the multimodal state
0:17:56a reputation and efficient dialogue policy
0:18:00and then we what's propose and a state adaptation technique to make the vision contest
0:18:07rip condition more relevant to the usual dialog state
0:18:11and we vary only at estimating the dialogue system matches in different a semantics narrows
0:18:17to very date the task completion efficiency and robustness
0:18:23for future work we plan to extend then apply a different well former study in
0:18:28the city real application that i don't realise something scenarios and we are
0:18:34we can also explore ways to incorporate or domain noise like the ontology
0:18:39on the data about interactions into a multimodal dialogue system to enable a large scale
0:18:45or information retrieval task
0:18:48thanks
0:19:05which again
0:19:08okay
0:19:19how do you push the signals in different models mean
0:19:22basically how do you model dreamworks
0:19:24and every works i guess
0:19:27are so as i mention the
0:19:30there we will all the most leader
0:19:35the reinforcement learning part transform
0:19:38the high-level within the policy and the questions the actual module
0:19:42so
0:19:44after we have for this part well
0:19:47consists of three
0:19:48three parts of rewards as i mentioned take reward and there are questionnaire was and
0:19:53also their intention reachable we what is making a wrong guesses
0:19:57and so and the rule for the classes is actually macho only a
0:20:04applies to reward shaping techniques e
0:20:07so we manager to their
0:20:11a basically the
0:20:13similarity between the this to embedding vectors
0:20:35it's a real environment
0:20:42system defines itself wrong from what they're having
0:20:49okay
0:20:50we also
0:20:52because we have
0:20:54on the simulations so we have for pretty five talking image
0:20:59so you so the two is controlled by the simulator module to kind of value
0:21:04at a at each can state our waiter
0:21:08yes the correct also not
0:21:10and so we can get the signals
0:21:15during the training process
0:21:30sorry affected by the question selection like you have any idea is to five to
0:21:36find it
0:21:38the most important question defined in section
0:21:42a here it's a fixed number here a paragraph
0:21:48a situation i have here a
0:21:53find you have generated
0:21:56nee most if english ink
0:22:02question
0:22:03and two and add a question is mm i finish working on it
0:22:12i think in high recognition i
0:22:18it's kind of questions i think that's the group cushion
0:22:22so here is basically a discriminative approach
0:22:28to still no questions
0:22:30from the different data also a because there's a ago a question proves
0:22:36so we can just to that of questions but
0:22:40okay a more interesting question is how we can generate a discriminative questions and
0:22:46and we you know online fashion
0:22:50so i think that something to explore in future