0:00:15 | hi everyone i'm up to shake i'm not be achieved to than a joystick |
---|
0:00:19 | identity and when you presenting our work on embody question answering this is joint work |
---|
0:00:24 | with my collaborators at georgia tech and facebook a research |
---|
0:00:29 | so in this work we propose a new task called embody question answering the task |
---|
0:00:33 | is that there's an agent that's point it random location in an unseen environment and |
---|
0:00:38 | exhaustive question such as what colours the car |
---|
0:00:41 | in order to sixty the agent must understand the question navagati environment find the object |
---|
0:00:46 | that the question asked about and respond back with the onset |
---|
0:00:51 | so we begin by proposing a data set of questions in environments for this task |
---|
0:00:56 | so for environments we use house three d which is work out of this book |
---|
0:01:00 | a research in building a rich and interactive enviroment out of this one cg dataset |
---|
0:01:06 | and so to give us sensible this data looks like here are a few questions |
---|
0:01:09 | from a three d |
---|
0:01:14 | you know if you living rooms |
---|
0:01:20 | and here are a few buttons rooms |
---|
0:01:23 | so as you can see there's rate and i were set of colours textures objects |
---|
0:01:26 | and their spatial configurations |
---|
0:01:29 | so in total we use eight hundred environments from house three d for this work |
---|
0:01:33 | consisting of twelve context and fifty object types and we make sure that there's no |
---|
0:01:38 | overlap between the training validation and test environments so we strictly check for generalization to |
---|
0:01:42 | novel advance |
---|
0:01:45 | coming to questions are questions are generated programmatically in a manner similar to clever in |
---|
0:01:49 | that we have set several primitive functions that can be combined and executed on these |
---|
0:01:54 | environments to generate a whole bunch of questions |
---|
0:01:58 | give an example executing select objects on environment returns a list of objects present and |
---|
0:02:03 | then and parameter passing that list a singleton will filter it again objects that a |
---|
0:02:08 | can only once |
---|
0:02:10 | and we can then played the location for each object in that set we generate |
---|
0:02:13 | a whole bunch of location questions such as what rumours the piano located in what |
---|
0:02:17 | rumours the dog located in what with the cutting board located in and still |
---|
0:02:23 | here's another example when we combine these primitive functions in a different combination to generate |
---|
0:02:27 | a whole bunch of colour question so what colours the base station in the living |
---|
0:02:30 | room what colours that are in the gym and still |
---|
0:02:33 | in total we have several question types would for this initial work we focus on |
---|
0:02:38 | location colour template based preposition questions that focus at that ask questions about a single |
---|
0:02:43 | target object |
---|
0:02:44 | and additionally as a post-processing step we make sure that the onset distributions for these |
---|
0:02:49 | questions on creaky so that the agent actually has to navigate to be a bit |
---|
0:02:52 | onset accurately and cannot exploit basis |
---|
0:02:57 | and all this data is publicly available for download on embodied q don't or |
---|
0:03:01 | coming to and martin it consists of four components division language navigation on saying what |
---|
0:03:06 | use the vision a module is a four layer convolutional neural network which is speech |
---|
0:03:11 | input reconstruction semantic segmentation and that estimation |
---|
0:03:15 | once it speech aim we tore with the decoders and just use the encoded as |
---|
0:03:18 | a fixed feature extractor |
---|
0:03:20 | i language module is the is an lstm that extracts a fixed size representation of |
---|
0:03:25 | the question |
---|
0:03:26 | we have a hierarchical navigation policy consisting of a planner that x which action to |
---|
0:03:31 | perform and a controller that decides how many time steps to execute each action for |
---|
0:03:36 | and so here's what it looks like in practice we extract image features using the |
---|
0:03:41 | cnn a condition on these image features in the question the planner decides which action |
---|
0:03:45 | to perform so in this case it decides to turn-right |
---|
0:03:48 | control is then passed to the controller |
---|
0:03:51 | the control that it has to decide whether to continue turning right ordered uncontrolled of |
---|
0:03:55 | the planner so in this case it decides to don't control and that computes one |
---|
0:04:00 | time step of the planet |
---|
0:04:01 | okay and at the next time step the planner looks at the image features in |
---|
0:04:04 | the question and decides which action to perform so here to explore control is part |
---|
0:04:09 | of the controller the controller decides to continue moving forward for three time steps before |
---|
0:04:13 | handing back controlled of the plan |
---|
0:04:15 | and this sort of continues until finally the planner decides to stop |
---|
0:04:22 | fertilising |
---|
0:04:24 | we extract question application using an lstm where you and we compute attention over the |
---|
0:04:29 | last five image frames from the navigation trajectory we combine these attended image features with |
---|
0:04:34 | the question of presentation to make a prediction of the onset |
---|
0:04:39 | now that we have these form audience coming to training data is as a reminder |
---|
0:04:43 | a in order to respond the agent at a at a time the location in |
---|
0:04:47 | an environment here i'm showing the top-down map |
---|
0:04:50 | we ask the questions that is what room of the csi located in the red |
---|
0:04:53 | star shows the location of this dataset so that's where the agent is expected to |
---|
0:04:56 | navigate a short response might look some something like anybody here's the first person video |
---|
0:05:03 | that short response to this expert agent will say i guess |
---|
0:05:07 | and a given the shortest path we can collegian out on thing module to be |
---|
0:05:11 | able to predict the onset from the last five three |
---|
0:05:13 | and we pretty general navigation module in a teacher forcing minded pretty each action in |
---|
0:05:18 | the shortest |
---|
0:05:20 | and once we have these two modules preaching defined units reinforcement learning about the agent |
---|
0:05:25 | an environment sound that actions from this navigation policy execute these actions in the environment |
---|
0:05:30 | and assign an intermediate award for when it makes a progress towards the target |
---|
0:05:35 | and when it when the agent chooses to start with we execute the onset of |
---|
0:05:39 | and assign determine what if the using gets the onset |
---|
0:05:44 | in terms of metrics again i'm showing that are not so the right plot shows |
---|
0:05:49 | what am agents trajectory might look like so given an agent's final location we can |
---|
0:05:53 | evaluate what is the finer distance target and what is the improvement in distance we |
---|
0:05:58 | also compute whether the agent enters that ends up in the right room |
---|
0:06:02 | or if it ever choose just are not and for on setting we look at |
---|
0:06:05 | the mean lack of the ground truth onset in the softmax distribution predicted by the |
---|
0:06:10 | so in terms of results on the distance the target matrix a low it is |
---|
0:06:13 | like a so here i'm showing a few baselines first adding in question information or |
---|
0:06:18 | whatever prior based navigation module has attained end up closer to the target by about |
---|
0:06:22 | half a meter adding an entity in the form of an lstm had to do |
---|
0:06:26 | even better by about how to make good |
---|
0:06:28 | and finally a hierarchical policy ends up close to the doctor |
---|
0:06:34 | so here are a few qualitative examples of for the question what color is the |
---|
0:06:38 | fish tank in the living room i'm showing the baseline lstm model on the left |
---|
0:06:42 | so the baseline model tones looks at the fish tank would what's right out of |
---|
0:06:45 | the house so it doesn't know where to start and it finally gets the onset |
---|
0:06:49 | all |
---|
0:06:51 | what is a lot more turns looks at the four test and what's up to |
---|
0:06:54 | start and get you select |
---|
0:06:57 | here's another example so the question is what colours the bottom |
---|
0:07:00 | the baseline model tones but get stuck against a wall |
---|
0:07:03 | but is are modeled is also to the button stops and gets the onset |
---|
0:07:08 | to so as to summarize i introduce the task of more question answering which involves |
---|
0:07:12 | navigation and question answering and these simulated house three environments we propose a dataset for |
---|
0:07:17 | this task and we proposed a hierarchical navigation policy of the of unseasonably against competitive |
---|
0:07:22 | baseline |
---|
0:07:23 | all of this data and code is publicly available say got it you to check |
---|
0:07:27 | that out |
---|
0:07:28 | that's is thank you |
---|
0:07:52 | so by taking the navigator into your model gives you make an assumption about how |
---|
0:07:58 | the system can navigate and you're building |
---|
0:08:00 | if you have a lady system or so we'll system you can imagine learning very |
---|
0:08:05 | different policies value that you multi storey building you assess on how you might |
---|
0:08:10 | generalize the model in this is the right extraction really try to understand |
---|
0:08:14 | how to solve the problem |
---|
0:08:18 | i mean that's a good question i don't think i'm the type or seem to |
---|
0:08:21 | be on single right now we're abstracting away all that it is related to what |
---|
0:08:25 | the specific hardware might be and b are assuming no stochastic no stochastic city in |
---|
0:08:30 | the environment |
---|
0:08:31 | we are assuming that executing for will always and point five meters |
---|
0:08:37 | were taken for seven how can we go |
---|
0:08:42 | i mean |
---|
0:08:44 | one |
---|
0:08:45 | so the action space will change depending on what specific hardware you have access to |
---|
0:08:50 | you could |
---|
0:08:51 | i could imagine |
---|
0:08:53 | a training i some of these models |
---|
0:08:56 | conditioned on the specific hardware parameters that they have to the might have to be |
---|
0:09:00 | but if we had access to those |
---|
0:09:02 | but i and say i don't have anything young |
---|
0:09:09 | i think if it |
---|
0:09:11 | what ideas of the model comes from the people time from the language part from |
---|
0:09:16 | the an additional |
---|
0:09:17 | so i missed the first point in the other side of model come from |
---|
0:09:22 | so |
---|
0:09:24 | the way the task is set up the agent has heavy it clearly from first |
---|
0:09:27 | person vision it doesn't have a map of the environment |
---|
0:09:29 | i think that's where most of the others come from navigating just from first person |
---|
0:09:33 | vision even in the simulated environment is extremely hard to get the work so in |
---|
0:09:38 | more so i skip those leaders in this presentation but if you know people we |
---|
0:09:42 | have that |
---|
0:09:43 | for evaluating we evaluate the agent in different difficulty levels but we initially bring it |
---|
0:09:49 | back and steps from the target than thirty then fifty and see how well it |
---|
0:09:53 | does so i |
---|
0:09:56 | not at the most difficult level it has to just cost one room what anything |
---|
0:10:02 | beyond |
---|
0:10:04 | it doesn't do a really good job at so i think navigation is the is |
---|
0:10:07 | the hardest part |
---|