Speech Transcript - Embodied Question Answering

0:00:15	hi everyone i'm up to shake i'm not be achieved to than a joystick
0:00:19	identity and when you presenting our work on embody question answering this is joint work
0:00:24	with my collaborators at georgia tech and facebook a research
0:00:29	so in this work we propose a new task called embody question answering the task
0:00:33	is that there's an agent that's point it random location in an unseen environment and
0:00:38	exhaustive question such as what colours the car
0:00:41	in order to sixty the agent must understand the question navagati environment find the object
0:00:46	that the question asked about and respond back with the onset
0:00:51	so we begin by proposing a data set of questions in environments for this task
0:00:56	so for environments we use house three d which is work out of this book
0:01:00	a research in building a rich and interactive enviroment out of this one cg dataset
0:01:06	and so to give us sensible this data looks like here are a few questions
0:01:09	from a three d
0:01:14	you know if you living rooms
0:01:20	and here are a few buttons rooms
0:01:23	so as you can see there's rate and i were set of colours textures objects
0:01:26	and their spatial configurations
0:01:29	so in total we use eight hundred environments from house three d for this work
0:01:33	consisting of twelve context and fifty object types and we make sure that there's no
0:01:38	overlap between the training validation and test environments so we strictly check for generalization to
0:01:42	novel advance
0:01:45	coming to questions are questions are generated programmatically in a manner similar to clever in
0:01:49	that we have set several primitive functions that can be combined and executed on these
0:01:54	environments to generate a whole bunch of questions
0:01:58	give an example executing select objects on environment returns a list of objects present and
0:02:03	then and parameter passing that list a singleton will filter it again objects that a
0:02:08	can only once
0:02:10	and we can then played the location for each object in that set we generate
0:02:13	a whole bunch of location questions such as what rumours the piano located in what
0:02:17	rumours the dog located in what with the cutting board located in and still
0:02:23	here's another example when we combine these primitive functions in a different combination to generate
0:02:27	a whole bunch of colour question so what colours the base station in the living
0:02:30	room what colours that are in the gym and still
0:02:33	in total we have several question types would for this initial work we focus on
0:02:38	location colour template based preposition questions that focus at that ask questions about a single
0:02:43	target object
0:02:44	and additionally as a post-processing step we make sure that the onset distributions for these
0:02:49	questions on creaky so that the agent actually has to navigate to be a bit
0:02:52	onset accurately and cannot exploit basis
0:02:57	and all this data is publicly available for download on embodied q don't or
0:03:01	coming to and martin it consists of four components division language navigation on saying what
0:03:06	use the vision a module is a four layer convolutional neural network which is speech
0:03:11	input reconstruction semantic segmentation and that estimation
0:03:15	once it speech aim we tore with the decoders and just use the encoded as
0:03:18	a fixed feature extractor
0:03:20	i language module is the is an lstm that extracts a fixed size representation of
0:03:25	the question
0:03:26	we have a hierarchical navigation policy consisting of a planner that x which action to
0:03:31	perform and a controller that decides how many time steps to execute each action for
0:03:36	and so here's what it looks like in practice we extract image features using the
0:03:41	cnn a condition on these image features in the question the planner decides which action
0:03:45	to perform so in this case it decides to turn-right
0:03:48	control is then passed to the controller
0:03:51	the control that it has to decide whether to continue turning right ordered uncontrolled of
0:03:55	the planner so in this case it decides to don't control and that computes one
0:04:00	time step of the planet
0:04:01	okay and at the next time step the planner looks at the image features in
0:04:04	the question and decides which action to perform so here to explore control is part
0:04:09	of the controller the controller decides to continue moving forward for three time steps before
0:04:13	handing back controlled of the plan
0:04:15	and this sort of continues until finally the planner decides to stop
0:04:22	fertilising
0:04:24	we extract question application using an lstm where you and we compute attention over the
0:04:29	last five image frames from the navigation trajectory we combine these attended image features with
0:04:34	the question of presentation to make a prediction of the onset
0:04:39	now that we have these form audience coming to training data is as a reminder
0:04:43	a in order to respond the agent at a at a time the location in
0:04:47	an environment here i'm showing the top-down map
0:04:50	we ask the questions that is what room of the csi located in the red
0:04:53	star shows the location of this dataset so that's where the agent is expected to
0:04:56	navigate a short response might look some something like anybody here's the first person video
0:05:03	that short response to this expert agent will say i guess
0:05:07	and a given the shortest path we can collegian out on thing module to be
0:05:11	able to predict the onset from the last five three
0:05:13	and we pretty general navigation module in a teacher forcing minded pretty each action in
0:05:18	the shortest
0:05:20	and once we have these two modules preaching defined units reinforcement learning about the agent
0:05:25	an environment sound that actions from this navigation policy execute these actions in the environment
0:05:30	and assign an intermediate award for when it makes a progress towards the target
0:05:35	and when it when the agent chooses to start with we execute the onset of
0:05:39	and assign determine what if the using gets the onset
0:05:44	in terms of metrics again i'm showing that are not so the right plot shows
0:05:49	what am agents trajectory might look like so given an agent's final location we can
0:05:53	evaluate what is the finer distance target and what is the improvement in distance we
0:05:58	also compute whether the agent enters that ends up in the right room
0:06:02	or if it ever choose just are not and for on setting we look at
0:06:05	the mean lack of the ground truth onset in the softmax distribution predicted by the
0:06:10	so in terms of results on the distance the target matrix a low it is
0:06:13	like a so here i'm showing a few baselines first adding in question information or
0:06:18	whatever prior based navigation module has attained end up closer to the target by about
0:06:22	half a meter adding an entity in the form of an lstm had to do
0:06:26	even better by about how to make good
0:06:28	and finally a hierarchical policy ends up close to the doctor
0:06:34	so here are a few qualitative examples of for the question what color is the
0:06:38	fish tank in the living room i'm showing the baseline lstm model on the left
0:06:42	so the baseline model tones looks at the fish tank would what's right out of
0:06:45	the house so it doesn't know where to start and it finally gets the onset
0:06:49	all
0:06:51	what is a lot more turns looks at the four test and what's up to
0:06:54	start and get you select
0:06:57	here's another example so the question is what colours the bottom
0:07:00	the baseline model tones but get stuck against a wall
0:07:03	but is are modeled is also to the button stops and gets the onset
0:07:08	to so as to summarize i introduce the task of more question answering which involves
0:07:12	navigation and question answering and these simulated house three environments we propose a dataset for
0:07:17	this task and we proposed a hierarchical navigation policy of the of unseasonably against competitive
0:07:22	baseline
0:07:23	all of this data and code is publicly available say got it you to check
0:07:27	that out
0:07:28	that's is thank you
0:07:52	so by taking the navigator into your model gives you make an assumption about how
0:07:58	the system can navigate and you're building
0:08:00	if you have a lady system or so we'll system you can imagine learning very
0:08:05	different policies value that you multi storey building you assess on how you might
0:08:10	generalize the model in this is the right extraction really try to understand
0:08:14	how to solve the problem
0:08:18	i mean that's a good question i don't think i'm the type or seem to
0:08:21	be on single right now we're abstracting away all that it is related to what
0:08:25	the specific hardware might be and b are assuming no stochastic no stochastic city in
0:08:30	the environment
0:08:31	we are assuming that executing for will always and point five meters
0:08:37	were taken for seven how can we go
0:08:42	i mean
0:08:44	one
0:08:45	so the action space will change depending on what specific hardware you have access to
0:08:50	you could
0:08:51	i could imagine
0:08:53	a training i some of these models
0:08:56	conditioned on the specific hardware parameters that they have to the might have to be
0:09:00	but if we had access to those
0:09:02	but i and say i don't have anything young
0:09:09	i think if it
0:09:11	what ideas of the model comes from the people time from the language part from
0:09:16	the an additional
0:09:17	so i missed the first point in the other side of model come from
0:09:22	so
0:09:24	the way the task is set up the agent has heavy it clearly from first
0:09:27	person vision it doesn't have a map of the environment
0:09:29	i think that's where most of the others come from navigating just from first person
0:09:33	vision even in the simulated environment is extremely hard to get the work so in
0:09:38	more so i skip those leaders in this presentation but if you know people we
0:09:42	have that
0:09:43	for evaluating we evaluate the agent in different difficulty levels but we initially bring it
0:09:49	back and steps from the target than thirty then fifty and see how well it
0:09:53	does so i
0:09:56	not at the most difficult level it has to just cost one room what anything
0:10:02	beyond
0:10:04	it doesn't do a really good job at so i think navigation is the is
0:10:07	the hardest part

Embodied Question Answering

Special Session: Late-breaking and work-in-progress talks

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra