0:00:14 | hi everyone on |
---|
0:00:17 | j p and on the sister showed work done with |
---|
0:00:21 | my core accurate attention and modify search to |
---|
0:00:27 | sorted down going to talk about their dialogue policy learning problems for task oriented visual |
---|
0:00:32 | dollar |
---|
0:00:36 | also first let me introduce their problems |
---|
0:00:40 | so |
---|
0:00:41 | there's physically situated x and you know that would we want to study is |
---|
0:00:46 | where a few joint chapel tries to engage with the user to how |
---|
0:00:52 | a lot and five to their p georgia order target image |
---|
0:00:57 | so here you can see |
---|
0:00:59 | and of there were twenty similar images presented a tutor a agent and at the |
---|
0:01:07 | first n |
---|
0:01:09 | their users can provide er |
---|
0:01:12 | this questions |
---|
0:01:13 | a luddite and you want |
---|
0:01:17 | so |
---|
0:01:17 | and then there |
---|
0:01:19 | agents here pay some more proactive role by asking |
---|
0:01:23 | i reverend cushions in this community cushions |
---|
0:01:27 | and hopefully once |
---|
0:01:28 | if the confidence in notes it to make a decisions |
---|
0:01:33 | to finish the top within a minimal number of turns |
---|
0:01:37 | so in this setting |
---|
0:01:38 | on their a true our main challenges |
---|
0:01:42 | on the agent very need to none |
---|
0:01:45 | and understand the multimodal rip intuitions |
---|
0:01:48 | and also be aware of the dynamic your dark contrast |
---|
0:01:52 | especially on where receiving signals |
---|
0:01:56 | for making decisions cell phones sample wrong information correlate all wrong guesses |
---|
0:02:03 | so the main goal is for the agent is to learn |
---|
0:02:07 | efficient dialogue policy to accomplish the task |
---|
0:02:14 | still a motivation very counts for |
---|
0:02:16 | on some |
---|
0:02:17 | potential so usable real applications |
---|
0:02:21 | so imagines for virtual a nice talking assists and |
---|
0:02:26 | to help the customers |
---|
0:02:28 | army commander propose or recommend approach is based on all the user's preference so that |
---|
0:02:35 | multimodal contest us through the dialog |
---|
0:02:39 | so is assigned as we also working probabilistically that are consummated visual dialogue based on |
---|
0:02:44 | the fashion dataset |
---|
0:02:47 | and hopefully we have something interesting what still |
---|
0:02:50 | next year |
---|
0:02:53 | on so |
---|
0:02:54 | but the previous research are only show that a lot mainly focus on be sure |
---|
0:02:59 | to language understanding and generations |
---|
0:03:02 | where it so they have for questionable and also for underwear with each other with |
---|
0:03:07 | thing a fixed number of turns |
---|
0:03:09 | a however we focus on the dialogue policy learning problems of the cushion policy |
---|
0:03:16 | so it was within style it's |
---|
0:03:19 | the questionable can produce a more constructive rules to help her sister |
---|
0:03:24 | the human to accomplish the task |
---|
0:03:28 | we want to be very their efficiency and the robustness of the dialogue policies in |
---|
0:03:34 | terms of on |
---|
0:03:35 | more task ninety dollars to semantics |
---|
0:03:40 | and is supposed to mentioning that our |
---|
0:03:43 | word is also related to hierarchical reinforcement learning |
---|
0:03:47 | basically we view this as the two stage problem |
---|
0:03:50 | at first we want to obtain a dialog proceeds to |
---|
0:03:53 | so that a proper dialog basically a information queries all making a decision on to |
---|
0:03:59 | do information image retrieval |
---|
0:04:01 | and then we can have for real lower level proceed to see that of primitive |
---|
0:04:06 | actions |
---|
0:04:07 | like which question to ask |
---|
0:04:09 | all |
---|
0:04:12 | however to reinforcement learning has been applied seeing a multi domain dialogue system but with |
---|
0:04:18 | our multimodal contestant action space |
---|
0:04:21 | and |
---|
0:04:21 | on our |
---|
0:04:23 | architecture also resembles the fruit or reinforcement learning which have some nice properties |
---|
0:04:28 | that are steadfast rations state sharing and a sequential execution |
---|
0:04:35 | and here is to overview order |
---|
0:04:38 | that the information for one thing to our proposed framework and important |
---|
0:04:44 | we have their simulator module a which |
---|
0:04:47 | how to teach us to again transition state and also provide the p but a |
---|
0:04:53 | remote signals and |
---|
0:04:54 | the generated out of service |
---|
0:04:56 | to feed into the vision data loss matching batting more joe to |
---|
0:05:01 | updates of usually stays home with our new approach is appears and also communicates with |
---|
0:05:07 | the |
---|
0:05:08 | dialog state checking more so we attentional signal to and dialog state checking kind of |
---|
0:05:14 | formal easter |
---|
0:05:15 | the people all speech without loss state representations and in the high-level going to proceed |
---|
0:05:21 | any module uses a |
---|
0:05:23 | a lot |
---|
0:05:24 | do you get us to |
---|
0:05:26 | to a tool and the prosody in terms of asking questions on making a gas |
---|
0:05:31 | and we have to specialise questions that you modulo two |
---|
0:05:35 | all learn the decision what will question to ask |
---|
0:05:42 | the first simple them also is or visual dialogue |
---|
0:05:45 | the matching bad and module |
---|
0:05:47 | on to go for this module is true |
---|
0:05:50 | i'll try to learns an encoder |
---|
0:05:54 | tasks and |
---|
0:05:55 | that the region and the task information into a joint space |
---|
0:06:00 | so the intuition is |
---|
0:06:02 | we want to have to each and thus able to kind of have |
---|
0:06:06 | of intelligence to understand the visions and |
---|
0:06:10 | and that the semantic relation between the image and |
---|
0:06:15 | and the dialog contest |
---|
0:06:18 | so |
---|
0:06:20 | bodies we also need to preach and this module on |
---|
0:06:23 | all to timber to encode to have for |
---|
0:06:27 | a robot a efficient as reinforcement learning training |
---|
0:06:32 | and also the album can be also applicable to use for image retrieval |
---|
0:06:38 | and to be very the performance of this module we perform a sanity check |
---|
0:06:43 | and we will choose for high a image which you're accuracies |
---|
0:06:49 | in this system in again setting |
---|
0:06:50 | which means this can provide a reliable signals for our reinforcement learning change |
---|
0:06:59 | in the visual dialog state checking module we need to teach i three types of |
---|
0:07:03 | all state information |
---|
0:07:05 | on the vision released a kind of represent the agent's internal decision making models |
---|
0:07:11 | i which is solvable of the vision dollars imagine a impending module and |
---|
0:07:17 | on and the vision qantas stays kind of captures their |
---|
0:07:21 | features the visual feature of their environments in here we applied a and what was |
---|
0:07:27 | the technical stay adaptations well |
---|
0:07:31 | basically the intuition is used to we want to after a vision contest |
---|
0:07:35 | or more phones and but that's a two |
---|
0:07:39 | the vision really state or the decision making model of the agents |
---|
0:07:43 | also |
---|
0:07:44 | on based on some feedback so attentional signals |
---|
0:07:48 | so the attention a signal here is calculated by all the semantic similarity score between |
---|
0:07:54 | the vision a belief state and image vectors and |
---|
0:07:58 | then we take the weighted average |
---|
0:08:00 | and in case of the wrong guess which are set their attendance attention signal to |
---|
0:08:05 | zero |
---|
0:08:06 | and we also could show that our the alignment information a number of questions asked |
---|
0:08:10 | number of image yes and their last session |
---|
0:08:15 | so given the past dialogs stays we have all |
---|
0:08:19 | all |
---|
0:08:21 | this kind of boasting learning modules and basic since we have two separate is quite |
---|
0:08:26 | something also we applied to |
---|
0:08:29 | wt q and method a |
---|
0:08:31 | so we have applied the project was replaced and pose tracking to |
---|
0:08:36 | to improve the eight thousand problem efficiency |
---|
0:08:39 | and |
---|
0:08:40 | another important task that for reinforcement learning is the reward design |
---|
0:08:45 | and |
---|
0:08:45 | so |
---|
0:08:46 | the rule for this model training use can be composed decomposed into |
---|
0:08:52 | the words and |
---|
0:08:54 | questions rewards in the image mature words and so with |
---|
0:09:00 | a reward shaping possibility into the a question suggestions and which |
---|
0:09:07 | it's kind of all |
---|
0:09:09 | the information gain of |
---|
0:09:12 | of us |
---|
0:09:13 | of our question ask |
---|
0:09:15 | and then here we calculate query is to |
---|
0:09:19 | difference between tough only to score between the usually state and attacked image vector |
---|
0:09:29 | on so |
---|
0:09:31 | a cushion citizen modules are to see that there |
---|
0:09:34 | the most informative questions to ask for when asked and |
---|
0:09:39 | based on the shared visual contest eight |
---|
0:09:43 | so we use this a core a reinforcement relevant networks |
---|
0:09:48 | that's able to handle a large discrete task based |
---|
0:09:52 | action space |
---|
0:09:54 | in |
---|
0:09:54 | there you value is can be post made a |
---|
0:09:58 | a between the embedding |
---|
0:10:00 | i vectors softer revision contest |
---|
0:10:03 | the and |
---|
0:10:04 | the questions |
---|
0:10:06 | on the reward t is the intermediate or not quite sure what is we discuss |
---|
0:10:10 | and then we use also an assertion strategy |
---|
0:10:14 | as their inspiration policy |
---|
0:10:18 | to train the reinforcement learning with different need to have for simulator and so we |
---|
0:10:23 | propose a corpus sse that once onset of anns consists |
---|
0:10:27 | since all thought a similar image |
---|
0:10:30 | and it is stiff image |
---|
0:10:32 | a corresponds to a ten rows of question answer pairs |
---|
0:10:36 | also this model provides the remote signal is saying axis related to the target image |
---|
0:10:42 | and also chest internal against a to their what do we |
---|
0:10:46 | other types of diminishing conditions first a teaching get is the correct answer |
---|
0:10:51 | on the mess number of gets this is reached |
---|
0:10:54 | and there is a lot turns his original depends on different experiment settings |
---|
0:11:00 | all we define the winning and lost we were assessed |
---|
0:11:03 | plus it's a negative tent and |
---|
0:11:06 | the wrong guess penalties negative we |
---|
0:11:11 | and to evaluate the contribution of each component within our friend work we focus on |
---|
0:11:17 | five policies models on |
---|
0:11:20 | the sap baseline to |
---|
0:11:22 | a random procedures in is still at the cushions all my guess and any state |
---|
0:11:27 | and then we added it you and to optimize dependent |
---|
0:11:31 | level decision making and of the a handful |
---|
0:11:34 | the lower or level pushes session a process |
---|
0:11:38 | and we also want to evaluate the stay adaptations and reward shaping techniques to see |
---|
0:11:43 | how |
---|
0:11:45 | data affect the policy learning |
---|
0:11:50 | and we want to |
---|
0:11:52 | a because we want to be very to efficiencies and the robustness of dallas policy |
---|
0:11:57 | we construct three sets of experiments |
---|
0:11:59 | by step by step |
---|
0:12:02 | for the first it smell nice to agents only see that |
---|
0:12:06 | the questions formed directive and eight also obtain a questions answer pairs generated by human |
---|
0:12:12 | for the target image |
---|
0:12:14 | so this |
---|
0:12:16 | last are open down stepping allows us to verify the effectiveness of our |
---|
0:12:20 | a friend word |
---|
0:12:21 | and then we increase the task difficulties are by enlarge the number of questions are |
---|
0:12:27 | so there are two hundred questions generated by humans |
---|
0:12:30 | and the dancers who are generated using our approach and this question answering models |
---|
0:12:37 | respect to the target image |
---|
0:12:39 | and |
---|
0:12:40 | doesn't their experience we scale outer testing process you have |
---|
0:12:45 | to answer question answer pairs generated automatically using the pitch and question answer parts |
---|
0:12:52 | which kind of simulates a more noisy and real was setting and a different a |
---|
0:12:56 | also |
---|
0:12:58 | also we very sour the policy model set of the one thousand iterations during the |
---|
0:13:04 | training process we pretty policy and we look at them reiteration magics lie within rate |
---|
0:13:09 | and average number of ten dollars terms |
---|
0:13:15 | here as there is there is also out in its parent once all we constrain |
---|
0:13:20 | on their maxima number missing the rows of dialogue to hand and |
---|
0:13:25 | within are defined ten questions |
---|
0:13:29 | there fourteen well so there |
---|
0:13:32 | there and encode falter william rate and all the average can reward |
---|
0:13:38 | and we can see the optimal model is the last prosody models and i have |
---|
0:13:43 | solar cell part of also conversion rates and outperform model with a hierarchy can sure |
---|
0:13:50 | pos see a question partitions and state adaptation |
---|
0:13:54 | and depression with that we were i want to also is whether a hierarchical reinforcement |
---|
0:13:58 | learning policy enable efficient decision making |
---|
0:14:02 | so here we define the of oracle baseline a data each and kids asking questions |
---|
0:14:09 | in order |
---|
0:14:10 | and only make the guess at the end of the data loss |
---|
0:14:14 | which means are or where a is means |
---|
0:14:19 | there |
---|
0:14:19 | the ages ask several tens of not number of two rounds operations and then only |
---|
0:14:26 | make a tick sid and so we found our optimal dialogue policies |
---|
0:14:33 | okay such as a significant higher a win rate and the or point seven |
---|
0:14:38 | and have a compulsive a win raise with their oracle baseline at eight well we're |
---|
0:14:46 | knows |
---|
0:14:48 | static o significant difference |
---|
0:14:50 | so |
---|
0:14:51 | and also we know that the oracle and nine and ten have higher we may |
---|
0:14:56 | because they can about more information our longer turns |
---|
0:15:01 | so we can see that our how code enforcement policy coefficient decision making |
---|
0:15:09 | in |
---|
0:15:11 | and we further work after if you know why we want to offer the evaluate |
---|
0:15:16 | the robustness of our are thousand policies |
---|
0:15:19 | so in paris the number of all |
---|
0:15:22 | we increase the number of questions and then we also use a fly above chance |
---|
0:15:28 | of vision question answering model as a user simulator to generate on servers and we |
---|
0:15:33 | can see our departments we watch is the best performance induce more noisy a setting |
---|
0:15:43 | and so on |
---|
0:15:44 | this point three we further |
---|
0:15:47 | increase to |
---|
0:15:49 | task difficulties |
---|
0:15:50 | and as we know all when e varies |
---|
0:15:54 | when you very thin the analysis and the test data can be very different |
---|
0:15:58 | so here we uses l two in this way because simulator different testing dataset and |
---|
0:16:06 | and |
---|
0:16:07 | and we are served the performance in the can jobs a but other propose reward |
---|
0:16:13 | is more robust to noise and we think there is a potential application of using |
---|
0:16:20 | the restart it has a bicluster orchard a song datasets constructing |
---|
0:16:26 | by to humans are just talk about their |
---|
0:16:32 | the call quality may state assets hope that was basically goes |
---|
0:16:37 | so that it may not be very suitable for task oriented |
---|
0:16:40 | applications on so here's is their sampled al also where so sir |
---|
0:16:47 | systems for example in you spend two and a failure example when you spend the |
---|
0:16:52 | as we can see in example tutor dialogue policy a sensor |
---|
0:16:58 | susceptible ready a see that the relevant question some relates to color ten |
---|
0:17:04 | and birds |
---|
0:17:05 | and all those are some wrong guesses happens and there's someone answers to everything is |
---|
0:17:13 | o |
---|
0:17:14 | they can you can do a good job to self correcting and then maybe yes |
---|
0:17:18 | in the end |
---|
0:17:20 | and in a israel the weights and |
---|
0:17:23 | since the question that also appear are overgenerating using sequence to a sequence model and |
---|
0:17:28 | so the testing on the questions is more general or and |
---|
0:17:32 | on the very specific |
---|
0:17:38 | to summarize a we propose |
---|
0:17:41 | a correct answer in t v show that allows set of tasks that is a |
---|
0:17:46 | applicable and extensible for real application and we also propose a hardcore reinforcement learning framework |
---|
0:17:53 | to selectively learn the multimodal state |
---|
0:17:56 | a reputation and efficient dialogue policy |
---|
0:18:00 | and then we what's propose and a state adaptation technique to make the vision contest |
---|
0:18:07 | rip condition more relevant to the usual dialog state |
---|
0:18:11 | and we vary only at estimating the dialogue system matches in different a semantics narrows |
---|
0:18:17 | to very date the task completion efficiency and robustness |
---|
0:18:23 | for future work we plan to extend then apply a different well former study in |
---|
0:18:28 | the city real application that i don't realise something scenarios and we are |
---|
0:18:34 | we can also explore ways to incorporate or domain noise like the ontology |
---|
0:18:39 | on the data about interactions into a multimodal dialogue system to enable a large scale |
---|
0:18:45 | or information retrieval task |
---|
0:18:48 | thanks |
---|
0:19:05 | which again |
---|
0:19:08 | okay |
---|
0:19:19 | how do you push the signals in different models mean |
---|
0:19:22 | basically how do you model dreamworks |
---|
0:19:24 | and every works i guess |
---|
0:19:27 | are so as i mention the |
---|
0:19:30 | there we will all the most leader |
---|
0:19:35 | the reinforcement learning part transform |
---|
0:19:38 | the high-level within the policy and the questions the actual module |
---|
0:19:42 | so |
---|
0:19:44 | after we have for this part well |
---|
0:19:47 | consists of three |
---|
0:19:48 | three parts of rewards as i mentioned take reward and there are questionnaire was and |
---|
0:19:53 | also their intention reachable we what is making a wrong guesses |
---|
0:19:57 | and so and the rule for the classes is actually macho only a |
---|
0:20:04 | applies to reward shaping techniques e |
---|
0:20:07 | so we manager to their |
---|
0:20:11 | a basically the |
---|
0:20:13 | similarity between the this to embedding vectors |
---|
0:20:35 | it's a real environment |
---|
0:20:42 | system defines itself wrong from what they're having |
---|
0:20:49 | okay |
---|
0:20:50 | we also |
---|
0:20:52 | because we have |
---|
0:20:54 | on the simulations so we have for pretty five talking image |
---|
0:20:59 | so you so the two is controlled by the simulator module to kind of value |
---|
0:21:04 | at a at each can state our waiter |
---|
0:21:08 | yes the correct also not |
---|
0:21:10 | and so we can get the signals |
---|
0:21:15 | during the training process |
---|
0:21:30 | sorry affected by the question selection like you have any idea is to five to |
---|
0:21:36 | find it |
---|
0:21:38 | the most important question defined in section |
---|
0:21:42 | a here it's a fixed number here a paragraph |
---|
0:21:48 | a situation i have here a |
---|
0:21:53 | find you have generated |
---|
0:21:56 | nee most if english ink |
---|
0:22:02 | question |
---|
0:22:03 | and two and add a question is mm i finish working on it |
---|
0:22:12 | i think in high recognition i |
---|
0:22:18 | it's kind of questions i think that's the group cushion |
---|
0:22:22 | so here is basically a discriminative approach |
---|
0:22:28 | to still no questions |
---|
0:22:30 | from the different data also a because there's a ago a question proves |
---|
0:22:36 | so we can just to that of questions but |
---|
0:22:40 | okay a more interesting question is how we can generate a discriminative questions and |
---|
0:22:46 | and we you know online fashion |
---|
0:22:50 | so i think that something to explore in future |
---|