hi everyone i'm up to shake i'm not be achieved to than a joystick
identity and when you presenting our work on embody question answering this is joint work
with my collaborators at georgia tech and facebook a research
so in this work we propose a new task called embody question answering the task
is that there's an agent that's point it random location in an unseen environment and
exhaustive question such as what colours the car
in order to sixty the agent must understand the question navagati environment find the object
that the question asked about and respond back with the onset
so we begin by proposing a data set of questions in environments for this task
so for environments we use house three d which is work out of this book
a research in building a rich and interactive enviroment out of this one cg dataset
and so to give us sensible this data looks like here are a few questions
from a three d
you know if you living rooms
and here are a few buttons rooms
so as you can see there's rate and i were set of colours textures objects
and their spatial configurations
so in total we use eight hundred environments from house three d for this work
consisting of twelve context and fifty object types and we make sure that there's no
overlap between the training validation and test environments so we strictly check for generalization to
novel advance
coming to questions are questions are generated programmatically in a manner similar to clever in
that we have set several primitive functions that can be combined and executed on these
environments to generate a whole bunch of questions
give an example executing select objects on environment returns a list of objects present and
then and parameter passing that list a singleton will filter it again objects that a
can only once
and we can then played the location for each object in that set we generate
a whole bunch of location questions such as what rumours the piano located in what
rumours the dog located in what with the cutting board located in and still
here's another example when we combine these primitive functions in a different combination to generate
a whole bunch of colour question so what colours the base station in the living
room what colours that are in the gym and still
in total we have several question types would for this initial work we focus on
location colour template based preposition questions that focus at that ask questions about a single
target object
and additionally as a post-processing step we make sure that the onset distributions for these
questions on creaky so that the agent actually has to navigate to be a bit
onset accurately and cannot exploit basis
and all this data is publicly available for download on embodied q don't or
coming to and martin it consists of four components division language navigation on saying what
use the vision a module is a four layer convolutional neural network which is speech
input reconstruction semantic segmentation and that estimation
once it speech aim we tore with the decoders and just use the encoded as
a fixed feature extractor
i language module is the is an lstm that extracts a fixed size representation of
the question
we have a hierarchical navigation policy consisting of a planner that x which action to
perform and a controller that decides how many time steps to execute each action for
and so here's what it looks like in practice we extract image features using the
cnn a condition on these image features in the question the planner decides which action
to perform so in this case it decides to turn-right
control is then passed to the controller
the control that it has to decide whether to continue turning right ordered uncontrolled of
the planner so in this case it decides to don't control and that computes one
time step of the planet
okay and at the next time step the planner looks at the image features in
the question and decides which action to perform so here to explore control is part
of the controller the controller decides to continue moving forward for three time steps before
handing back controlled of the plan
and this sort of continues until finally the planner decides to stop
fertilising
we extract question application using an lstm where you and we compute attention over the
last five image frames from the navigation trajectory we combine these attended image features with
the question of presentation to make a prediction of the onset
now that we have these form audience coming to training data is as a reminder
a in order to respond the agent at a at a time the location in
an environment here i'm showing the top-down map
we ask the questions that is what room of the csi located in the red
star shows the location of this dataset so that's where the agent is expected to
navigate a short response might look some something like anybody here's the first person video
that short response to this expert agent will say i guess
and a given the shortest path we can collegian out on thing module to be
able to predict the onset from the last five three
and we pretty general navigation module in a teacher forcing minded pretty each action in
the shortest
and once we have these two modules preaching defined units reinforcement learning about the agent
an environment sound that actions from this navigation policy execute these actions in the environment
and assign an intermediate award for when it makes a progress towards the target
and when it when the agent chooses to start with we execute the onset of
and assign determine what if the using gets the onset
in terms of metrics again i'm showing that are not so the right plot shows
what am agents trajectory might look like so given an agent's final location we can
evaluate what is the finer distance target and what is the improvement in distance we
also compute whether the agent enters that ends up in the right room
or if it ever choose just are not and for on setting we look at
the mean lack of the ground truth onset in the softmax distribution predicted by the
so in terms of results on the distance the target matrix a low it is
like a so here i'm showing a few baselines first adding in question information or
whatever prior based navigation module has attained end up closer to the target by about
half a meter adding an entity in the form of an lstm had to do
even better by about how to make good
and finally a hierarchical policy ends up close to the doctor
so here are a few qualitative examples of for the question what color is the
fish tank in the living room i'm showing the baseline lstm model on the left
so the baseline model tones looks at the fish tank would what's right out of
the house so it doesn't know where to start and it finally gets the onset
all
what is a lot more turns looks at the four test and what's up to
start and get you select
here's another example so the question is what colours the bottom
the baseline model tones but get stuck against a wall
but is are modeled is also to the button stops and gets the onset
to so as to summarize i introduce the task of more question answering which involves
navigation and question answering and these simulated house three environments we propose a dataset for
this task and we proposed a hierarchical navigation policy of the of unseasonably against competitive
baseline
all of this data and code is publicly available say got it you to check
that out
that's is thank you
so by taking the navigator into your model gives you make an assumption about how
the system can navigate and you're building
if you have a lady system or so we'll system you can imagine learning very
different policies value that you multi storey building you assess on how you might
generalize the model in this is the right extraction really try to understand
how to solve the problem
i mean that's a good question i don't think i'm the type or seem to
be on single right now we're abstracting away all that it is related to what
the specific hardware might be and b are assuming no stochastic no stochastic city in
the environment
we are assuming that executing for will always and point five meters
were taken for seven how can we go
i mean
one
so the action space will change depending on what specific hardware you have access to
you could
i could imagine
a training i some of these models
conditioned on the specific hardware parameters that they have to the might have to be
but if we had access to those
but i and say i don't have anything young
i think if it
what ideas of the model comes from the people time from the language part from
the an additional
so i missed the first point in the other side of model come from
so
the way the task is set up the agent has heavy it clearly from first
person vision it doesn't have a map of the environment
i think that's where most of the others come from navigating just from first person
vision even in the simulated environment is extremely hard to get the work so in
more so i skip those leaders in this presentation but if you know people we
have that
for evaluating we evaluate the agent in different difficulty levels but we initially bring it
back and steps from the target than thirty then fifty and see how well it
does so i
not at the most difficult level it has to just cost one room what anything
beyond
it doesn't do a really good job at so i think navigation is the is
the hardest part