0:00:17 | everyone my name is the injury and from the interactional that are about university in |
---|
0:00:21 | denver |
---|
0:00:22 | and today i want to talk our paper about training and adaptive dialogue policy |
---|
0:00:28 | for interactive learning of visually grounded word meanings |
---|
0:00:32 | so in this talk i want to talk subgoal a couple of things about the |
---|
0:00:38 | too many two aspects one is we want to discuss an overview of the system |
---|
0:00:43 | architecture |
---|
0:00:44 | and then we shows the |
---|
0:00:46 | based on a movie you about how the system how it works |
---|
0:00:49 | and also |
---|
0:00:50 | based on this system architecture we use that to investigate the event effectiveness of different |
---|
0:00:56 | a random forest is the and the probabilities worked |
---|
0:01:00 | in the |
---|
0:01:01 | interaction learning interactive learning process and the based on that we investigation we trained and |
---|
0:01:08 | adaptive |
---|
0:01:10 | don't policy |
---|
0:01:11 | now let's move to the motivation okay |
---|
0:01:14 | what we want to do in this case we want to be at each other |
---|
0:01:18 | about toward a multimodal systems and which can learn |
---|
0:01:21 | and individual using |
---|
0:01:23 | are users of language bike are listed my best clear are this my look like |
---|
0:01:29 | four or something like that |
---|
0:01:31 | and then |
---|
0:01:32 | we in you know to us that we have to we learn everything we learn |
---|
0:01:37 | the visual context we learn the |
---|
0:01:39 | the knowledge online using |
---|
0:01:41 | a the target interactions rappers then the text the basic descriptions war manual annotations |
---|
0:01:48 | and also we |
---|
0:01:49 | this is the system used in the really small amounts most months of the training |
---|
0:01:54 | data maybe once all the what use of it |
---|
0:01:57 | well and then what is this them into a really different position we put that |
---|
0:02:02 | into the channel the rather than a second language learner you know for the technical |
---|
0:02:06 | and you don't know is the |
---|
0:02:08 | it has all of the visual knowledge about what happens what the what's the meaning |
---|
0:02:12 | of colour was the mean of shape and of the what they only need to |
---|
0:02:15 | do is the trying to associate the all and |
---|
0:02:17 | vision or h two ways the specific words or phrases in another language |
---|
0:02:22 | but the channel is quite different because it doesn't have any |
---|
0:02:25 | a knowledge about that and they have to trying to learn the |
---|
0:02:30 | one language meanings that a forced |
---|
0:02:33 | though it as what we know there are lots of the recent works the trying |
---|
0:02:39 | to do ways that the symbol grounding problems they are trying to generates the other |
---|
0:02:44 | natural language descriptions what images were we use |
---|
0:02:47 | or they are trying to identify or describe the visual objects using the visual features |
---|
0:02:51 | like colors shapes war materials |
---|
0:02:54 | but |
---|
0:02:56 | to our knowledge now of these masters or long as to go for teachable robot |
---|
0:03:00 | or multimodal system |
---|
0:03:02 | and this aspect should be combined together |
---|
0:03:05 | so here we present a table comparing our project with others and is what we |
---|
0:03:11 | can see this the is almost all words and they are focused on one where |
---|
0:03:15 | some aspects of in this table but our work it consider all of the cup |
---|
0:03:20 | aspects |
---|
0:03:21 | including the interaction |
---|
0:03:24 | online learning natural language and incremental okay |
---|
0:03:29 | now let's move to the same the system to capture is a really on general |
---|
0:03:35 | architecture it has the it combines the visual we show and you wasted a remote |
---|
0:03:42 | you and the |
---|
0:03:43 | visual mode you |
---|
0:03:45 | we can see on the left that's the |
---|
0:03:48 | on chopped the class of weak classifiers which rounds of the semantic representations in the |
---|
0:03:53 | language processing module |
---|
0:03:55 | and then |
---|
0:03:56 | so that |
---|
0:04:00 | so the predictions from the classifiers to the dialogue and the visual observation mode you |
---|
0:04:06 | produce the semantic analysis of the real thing |
---|
0:04:09 | and then the becomes the non-linguistic |
---|
0:04:12 | a context as part of the dialogs |
---|
0:04:15 | and you |
---|
0:04:16 | for the parsing a generating |
---|
0:04:18 | for example maybe pronounce war reference resolution |
---|
0:04:22 | and on the other hand for the |
---|
0:04:24 | on the data mode you don't mode you the dstc removed you we parse the |
---|
0:04:30 | dialogue from with that users the number of ways that users and through the parsing |
---|
0:04:35 | all object the judgements are used as the labels to |
---|
0:04:40 | updates |
---|
0:04:41 | other classifiers incremental bayes rule the interaction |
---|
0:04:46 | notice talking about the vision all |
---|
0:04:48 | mode you the vision mode you can extract the a high dimensional feature vectors |
---|
0:04:53 | including the h s b |
---|
0:04:56 | space will colour and the back on visual words for the shape |
---|
0:04:59 | and then |
---|
0:05:00 | it incrementally train a binary classifiers what each at you each we show attribute |
---|
0:05:05 | using the logistic regression where is the other stuff stick gradient descent model |
---|
0:05:11 | finally after the classification |
---|
0:05:14 | it's put use the visual context the based on the predictions and the corresponding outcome |
---|
0:05:20 | discourse |
---|
0:05:22 | that's trying to ground that is atomic i turned into the particular classifiers |
---|
0:05:28 | well the dstc our model to |
---|
0:05:30 | the that the dft remote you contains the |
---|
0:05:32 | to parse the dynamic syntax and the teacher directly types other than anything text example |
---|
0:05:38 | word-byword incremental semantic module of the dialog equal including the parsing and parser generator |
---|
0:05:45 | and it produce the semantic and the contextual |
---|
0:05:49 | representations in the temple have not cafeterias records |
---|
0:05:52 | and i work |
---|
0:05:54 | in i want to highlight here is a what is quite similar to okay the |
---|
0:05:58 | can indiana data strings work and but where they are trying to do is to |
---|
0:06:03 | the one that run grounding the |
---|
0:06:05 | the words to the other classifiers but we are using that each are regular time |
---|
0:06:10 | logical form in that |
---|
0:06:13 | okay here's an example about the incremental parser and the |
---|
0:06:18 | and here's the other graph about the whole r |
---|
0:06:22 | it shows the plot for the dialogue context |
---|
0:06:25 | and each acts like dialogue context from all of the participants in the is dialogue |
---|
0:06:29 | including the learner and the |
---|
0:06:31 | other tutor |
---|
0:06:32 | and |
---|
0:06:34 | is each note here is we represents this that when states or the key directly |
---|
0:06:40 | types for the particular points |
---|
0:06:43 | and the an edge |
---|
0:06:44 | each and we present the particular word |
---|
0:06:48 | parse the by the that the at each other parser |
---|
0:06:51 | though |
---|
0:06:52 | when we get new word |
---|
0:06:53 | if just to grow up i guess new tattoos new node and the updates the |
---|
0:06:58 | medial record travel again so it's got another one |
---|
0:07:01 | and |
---|
0:07:02 | there's always sensors and the final record time for |
---|
0:07:06 | distances about what is this |
---|
0:07:08 | and the learner's thing a square |
---|
0:07:11 | so |
---|
0:07:11 | it just continue |
---|
0:07:12 | is crossed and update that by removing |
---|
0:07:16 | the question |
---|
0:07:19 | judgement type and |
---|
0:07:21 | and the result some answers about squeaker |
---|
0:07:26 | so i have to work |
---|
0:07:28 | well in the because you're saying yes as a kind of acknowledgement comes from the |
---|
0:07:33 | tutor so the previous don't of the previous dialog is |
---|
0:07:38 | has been upgraded by those that you jen the and the |
---|
0:07:43 | the learner it check industrial contexts in this case |
---|
0:07:47 | is another example about the a ground at natural language semantics in the visual quality |
---|
0:07:54 | for hours |
---|
0:07:55 | and the system |
---|
0:07:56 | we can see the system capture okay objects from the webcam role and extracts features |
---|
0:08:04 | push them into the classifiers get a bunch of the |
---|
0:08:07 | the prediction labels and the corresponding for a confidence scores |
---|
0:08:11 | and the based on that we instead of using the distribution in probabilities well using |
---|
0:08:18 | the binary classifiers for each other attributes so we just pick up a the labels |
---|
0:08:24 | with highest scores |
---|
0:08:26 | for each group here so we used in that to generates |
---|
0:08:30 | a visual context and the that's a teletype |
---|
0:08:35 | training the role it cold to read and ways than zero point seventy us then |
---|
0:08:40 | t five |
---|
0:08:41 | and the pretty one square |
---|
0:08:43 | is |
---|
0:08:45 | want to the stick where trust if are ways of zero point eighty eight |
---|
0:08:50 | and when we push them into the generator with the meaning that means |
---|
0:08:53 | i can see a red square |
---|
0:08:57 | our system |
---|
0:08:59 | the vision module change all of the visual attributes in quality of assigns of binary |
---|
0:09:04 | classifiers |
---|
0:09:05 | it does not |
---|
0:09:07 | identify the meaning soft is fake actually word |
---|
0:09:11 | but in the language model the language processing module the grammar |
---|
0:09:15 | note them and knows they are in the different okay great strike a red is |
---|
0:09:20 | a kind of colour and a switch kind of shape |
---|
0:09:25 | and the our research will as we mentioned already or the system is the position |
---|
0:09:29 | of the channel so we don't need to learn |
---|
0:09:32 | are this maps between the classifiers and the semantic items |
---|
0:09:35 | instead |
---|
0:09:36 | we learn a week we just types of classifiers on the fly and for the |
---|
0:09:40 | new semantic items and |
---|
0:09:42 | we might encode or in contrary to eight |
---|
0:09:45 | in the dialogue and we retrain them are incrementally strolled interaction |
---|
0:09:52 | no i want to show you about the how the system works |
---|
0:10:16 | here's a really simple a dollar dialogues |
---|
0:10:19 | system and has or of the |
---|
0:10:21 | a dialogue on the left |
---|
0:10:24 | is very simple to just to reach a specific words about the colour was shapes |
---|
0:10:30 | and then |
---|
0:10:31 | we got new dialogue and is trying to testing what's the system or you learned |
---|
0:10:36 | from |
---|
0:10:36 | are the previous work |
---|
0:10:40 | and then we got the |
---|
0:10:43 | object and we got the visual contact |
---|
0:10:45 | two based on their that would be a visual sorry visual features and we you |
---|
0:10:50 | have a bunch of the classification without |
---|
0:10:52 | and |
---|
0:10:53 | based on the result |
---|
0:10:54 | we generate the window shows the visual context |
---|
0:10:58 | could use the by the classification without |
---|
0:11:03 | and then about this dialogue context it shows the |
---|
0:11:07 | at each are idle time harsh the by the from the previous tutor |
---|
0:11:11 | all streams |
---|
0:11:18 | and then the this window shows that generation go and it's shave the answers |
---|
0:11:23 | by unifying |
---|
0:11:24 | a the dialogue context and visual contacts |
---|
0:11:29 | and would clues that into that generate the generator to get the final ten sentences |
---|
0:11:33 | by collect-is square |
---|
0:11:36 | all because the time i once |
---|
0:11:38 | finished that |
---|
0:11:41 | okay |
---|
0:11:42 | i want to start |
---|
0:11:43 | i want to highlight here is what we use in a very simple colours and |
---|
0:11:47 | shapes in this week you and a but this is the ms module is the |
---|
0:11:52 | really generalize the framework and it should scale to more complex visual things will classifiers |
---|
0:11:58 | in the future work |
---|
0:12:01 | note let's move to the experiments |
---|
0:12:03 | in the experiments |
---|
0:12:05 | we aim to explore the effectiveness of the different dialogue capabilities in the possibilities |
---|
0:12:10 | on the learning grounded word meanings |
---|
0:12:12 | with the history of factors and certainty a context dependency and the initiative |
---|
0:12:19 | and then based on the exploration and we learn an adaptive dialogue strategies |
---|
0:12:24 | a comedy contains a common text |
---|
0:12:28 | into account to the reliability of the clutter far results |
---|
0:12:33 | okay on in the experiment one we |
---|
0:12:36 | designed two by two by two spectral or experiment and considers three a |
---|
0:12:42 | factors the first one is initiative which determines the who takes initiative in this whole |
---|
0:12:47 | dialogue |
---|
0:12:48 | and then |
---|
0:12:49 | what time to their |
---|
0:12:50 | the a context dependency which determines a whether the learner can personalise the context-dependent a |
---|
0:12:57 | expressions like short answers were incrementally are constructed turns |
---|
0:13:03 | as the example here |
---|
0:13:06 | and then |
---|
0:13:07 | we considered as edge in g |
---|
0:13:08 | the uncertainty determines whether and how the classification of the classification scores affect the learner's |
---|
0:13:14 | dialog behaviors |
---|
0:13:16 | and |
---|
0:13:16 | as what we know |
---|
0:13:19 | the |
---|
0:13:20 | for the system the classification there's the system you achieve a bunch of the confidence |
---|
0:13:27 | scores a stronger predictions and the chip agents considering the and surgeons you |
---|
0:13:32 | we're trying to find out the points |
---|
0:13:34 | where it can be of its own predictions so are we call this point is |
---|
0:13:38 | that the strike show the confidence threshold |
---|
0:13:41 | and for the agent's consider the and certainty it will be of you task and |
---|
0:13:47 | of the active learning as like a you want asking you will only asking questions |
---|
0:13:53 | one you are not very sure about you answer word about your predictions |
---|
0:13:57 | and on the other hand |
---|
0:13:59 | the and |
---|
0:14:00 | and are the condition without uncertainty |
---|
0:14:03 | and that the agent always takes the confirmations were are more informations from the tutor |
---|
0:14:08 | so it is more a cost as well |
---|
0:14:11 | and to our knowledge the classification scores and not always reliable is better in the |
---|
0:14:16 | very beginning because you don't have also extend use them house |
---|
0:14:19 | and |
---|
0:14:20 | this reliability you approved obviously improve that you're in the interaction interactive learning |
---|
0:14:28 | when you get more and more are examples |
---|
0:14:30 | so |
---|
0:14:32 | the class you see this can other and the uncertainty will take the risk |
---|
0:14:37 | of meeting some informations from that users |
---|
0:14:39 | so it cannot ask any questions you might be for the |
---|
0:14:42 | the answer is wrong maybe |
---|
0:14:45 | the to evaluate the interactive a learning a performance we come up with this kind |
---|
0:14:51 | of the magics by integrating the classification accuracy and the tutoring housed during comes to |
---|
0:14:57 | me fact that at first mediated by the tutor in the interactive interaction with the |
---|
0:15:02 | system |
---|
0:15:02 | and we provide a proponent performance of scores for the increase in the accuracy against |
---|
0:15:08 | the cost to the tutor and the discourse capture the trade-off between accuracy and a |
---|
0:15:14 | cost |
---|
0:15:16 | there's a score very if we record these scores into the map of the graph |
---|
0:15:22 | it will be a curve and the other score being be represented using the gradient |
---|
0:15:28 | of the curve and what we want to do is we want to find out |
---|
0:15:32 | word learn a suitable down holistic |
---|
0:15:36 | to maximize the performance score |
---|
0:15:39 | okay here's the results from the experiment one |
---|
0:15:41 | and to the |
---|
0:15:42 | the x is represents the uni the unit of the that you the cost you |
---|
0:15:50 | did by that user zero five had i arg only instance and the y-axis is |
---|
0:15:55 | are we present the accuracy |
---|
0:15:57 | so this graph we can see the agents |
---|
0:16:00 | was the outcomes using the learner taking initiative ways and certainty the green and the |
---|
0:16:06 | blue curve |
---|
0:16:08 | performs much better than the others are however because the process use the at least |
---|
0:16:14 | two more are risks |
---|
0:16:15 | it can have gets pretty open answers from them valve so |
---|
0:16:20 | you can already achieve the really higher accuracy like the others that is achieved near |
---|
0:16:25 | zero there are nine but this on the other one seventy five was then please |
---|
0:16:29 | take something like that |
---|
0:16:30 | so we conclude that because the confidence scores and not really are reliable in the |
---|
0:16:36 | rolling per size |
---|
0:16:37 | so it shouldn't be cheap constant in the whole running time on the whole learning |
---|
0:16:41 | tasks |
---|
0:16:42 | so we assume that a certainty and that can change that dynamic range over time |
---|
0:16:48 | should lead to a dog add a tradeoff between accuracy and the cost |
---|
0:16:53 | therefore we used we trained a an adaptive dialogue policy by |
---|
0:17:00 | using the multimodal a mdp model and the reinforcements running starts to all the reasons |
---|
0:17:06 | and because of time they made i can article details so |
---|
0:17:10 | would you just before in the paper |
---|
0:17:13 | so here's another results and in this without |
---|
0:17:17 | we keep all of other conditions that constant using the learner taking that you know |
---|
0:17:22 | the initiative |
---|
0:17:23 | a takes the answers indiana context dependency into account as well |
---|
0:17:27 | zero the and stroll the result we can find a the |
---|
0:17:32 | the adaptive strategy on the right curve |
---|
0:17:34 | achieve the much answer |
---|
0:17:38 | much higher actors the are |
---|
0:17:41 | but it cannot really bait other constants racial the of the quake in is not |
---|
0:17:47 | really that |
---|
0:17:49 | answer the that that's already good enough but what we know is |
---|
0:17:54 | it achieves the high actually is the much faster especially in the first the ones |
---|
0:17:58 | thousands |
---|
0:17:59 | a unit is of course so we can find here and much |
---|
0:18:02 | batter |
---|
0:18:04 | so we conclude that the agent's with the adaptive astray show is more visible |
---|
0:18:09 | in the and interactive learning tasks |
---|
0:18:14 | and in the convolution |
---|
0:18:16 | in this paper we use a fully integrated multimodal and interactive at each post systems |
---|
0:18:22 | for the language n-gram language grounding and we trained a party rate are adaptive dialogue |
---|
0:18:27 | strategy for all other language grounding |
---|
0:18:30 | and then we can inspect impact of different strategies and the conditions on the regarding |
---|
0:18:36 | precise |
---|
0:18:37 | and we know |
---|
0:18:38 | the learned policy we which takes the and searching into account with based adaptive most |
---|
0:18:44 | racial shows the past the overall performance |
---|
0:18:48 | in the future works are we |
---|
0:18:52 | trying to crafting and training the dialogue policy using the human tutor |
---|
0:18:58 | using a human data collection without a to and do we trained two gender learner |
---|
0:19:04 | at the same time |
---|
0:19:06 | and |
---|
0:19:07 | and then we learn a word level while trying to learn word-level adaptive dialogue policy |
---|
0:19:12 | using the reinforcement learning based on the dstc a mold you |
---|
0:19:17 | and finally a way trying |
---|
0:19:20 | order to deal with the new previous i think words what features what concepts we |
---|
0:19:25 | trying to integrated distribution is semantics into this system |
---|
0:19:30 | and here's the bunch of the a reference and sensible at a tension iq much |
---|
0:20:30 | actually grins we want to use because we considering the |
---|
0:20:36 | the uncertainty from the visual knowledge |
---|
0:20:38 | and we think about okay well actually we're trying to use the different things as |
---|
0:20:43 | will we using the entropy were something else to manager |
---|
0:20:46 | the reliability of the classifiers and we consider |
---|
0:20:50 | okay |
---|
0:20:51 | because we will have a lot of the classifiers from scratch |
---|
0:20:54 | so it doesn't have an orange it has only one what to a examples in |
---|
0:20:58 | very beginnings the web trying to you is kind of thing astray shows they okay |
---|
0:21:02 | are we trying to sign top three hires right you very beginning and it be |
---|
0:21:08 | asking more questions |
---|
0:21:09 | allows us to speak a domain was we speak actually use and then when you |
---|
0:21:14 | get more examples we just to reduce the maybe just reuse the illustration |
---|
0:21:19 | and trying to get read off from |
---|
0:21:21 | a waste maybe the where the learner doesn't need a kind of questions |
---|
0:21:26 | and what we want to do this we want to do this kind of our |
---|
0:21:28 | troubles imagine would buy role at home |
---|
0:21:31 | we just trying to |
---|
0:21:32 | teach the robot about all of the information from the user's aspect as a perspective |
---|
0:21:37 | so we on |
---|
0:21:40 | posterior that situation because and y has different knowledge about the visual things so |
---|
0:21:46 | all |
---|
0:21:47 | we're trying to actually which windy to consider the situation for the |
---|
0:21:51 | the confidence try to show the parts |
---|
0:21:54 | with you what count that |
---|
0:22:11 | well actually |
---|
0:22:14 | that's very good question that we |
---|
0:22:17 | in this case we don't really thing about that |
---|
0:22:21 | it's we just think about overall about all of the answer changes in the usual |
---|
0:22:25 | knowledge rather than asr or well |
---|
0:22:30 | i think i would try to figure out that |
---|
0:22:48 | to be on is currently not yet |
---|
0:22:51 | that's maybe the future |
---|
0:22:53 | stuff |
---|
0:23:03 | sorry |
---|
0:23:06 | so you questions or how i generates the representations for the object right |
---|
0:23:10 | so we use thing the |
---|
0:23:12 | i just be used the matlab now brilliance using us get the a chance we |
---|
0:23:19 | cover have a spare beats for the |
---|
0:23:22 | the colour style of the colour and the battle visual words and just build a |
---|
0:23:26 | kind of dictionary ourself and you get gas the frequency of each which are in |
---|
0:23:31 | the pixels and get them together to generate the ones of them to handle at |
---|
0:23:35 | a feature to make a feature vectors |
---|
0:23:50 | well actually |
---|
0:23:52 | because we know there's a lot of guys working on the classification and their use |
---|
0:23:56 | in deep learning or no degree a room or with twenty four network your network |
---|
0:24:02 | and we trying to think about you know also channel it doesn't have any knowledge |
---|
0:24:06 | and trying to use all of the |
---|
0:24:08 | classifications classifiers upon a very from scratch |
---|
0:24:11 | and we trying to think about okay we don't really mount the system already know |
---|
0:24:16 | what's the meaning what's the group of the colours what shapes |
---|
0:24:19 | so we used in the final class for each attributes and that's equally |
---|
0:24:24 | so afterwards went through the interaction and we get new more knowledge and trying to |
---|
0:24:29 | figure out okay this rat is a kind of colour and then we get new |
---|
0:24:33 | features right yellow and i know yellow is quite similar to the red so is |
---|
0:24:38 | also |
---|
0:24:39 | in the same group |
---|
0:24:47 | right |
---|
0:24:53 | that's not really the weights |
---|
0:24:56 | you mean that the distribution no |
---|
0:25:00 | results right across all of the classifiers in the same group right |
---|
0:25:05 | that's different where we just use in the country the binary and it the all |
---|
0:25:09 | of the even the shamanic colours they are encoding |
---|
0:25:13 | it doesn't have any difference between |
---|