Speech Transcript - Language-Guided Adaptive Perception for Efficient Grounded Communication with Robotic Manipulators in Cluttered Environments

0:00:15	so i'm gonna present my work without power on the topic off language guided adaptive
0:00:21	perception
0:00:22	for efficient grounded communication
0:00:25	right robotic manipulators in cluttered environments it's kind of a lot better hope you raise
0:00:30	will understand at the end of the presentation
0:00:33	but this about
0:00:36	so on
0:00:37	situated so like a language understanding in a physically situated settings and
0:00:43	interesting problem in robotics
0:00:45	okay
0:00:46	ability to
0:00:48	interact with the collaborative the once using natural language then
0:00:52	but saving then planning and establishing like a
0:00:55	common ground
0:00:57	is critical
0:00:58	for effective human the what in fact that is look at an example
0:01:02	the user says
0:01:04	we got the leftmost you you're
0:01:06	then the robot
0:01:08	where c then why and then a grounded a specific object
0:01:12	and that's done and the
0:01:13	assuming and say put it on the top of the
0:01:16	right composed into the
0:01:18	okay
0:01:19	so they're feuding sweetheart about it is there is diversity in the language that user
0:01:27	that was it in the language in terms of instructions that the user can you
0:01:30	to the robot or the way in which the instructions the said
0:01:35	there's there is challenges because the environments or unstructured they could be clutter and then
0:01:40	one and like and here
0:01:43	and if you need a real-time interaction with this robot
0:01:47	and perception takes time so
0:01:50	so that's what this work specifically talks about is how to efficiently perceive environments
0:01:56	for fast and accurate grounded in setting off a variety of natural language
0:02:01	instructions and it's demonstrated in the context of robotic manipulation
0:02:07	you give you back downtown
0:02:09	in four or on provide perception representation what perception usually refers to is you have
0:02:15	sensor measurements that come from the robot sensors you have some perception pipeline i
0:02:20	perception by plane in compressed this high-dimensional a sensor measurements and gives you a representation
0:02:26	of something colours world model
0:02:30	then alignment to give you an example of visual perception
0:02:34	you can fill in the sequence of it really images
0:02:37	and what you get out of it is some representation of the word
0:02:41	and the representation varies based on application for example here it's just strong point only
0:02:47	presentation i can make a
0:02:49	three d voxel they're not affect you can have an occupancy grid are semantic map
0:02:54	or if you if you want to
0:02:56	become specific object you want to model the pos the six degrees of freedom
0:03:01	was of those objects you can get something like that
0:03:04	even going further you can have some articulation modeling of the components often do you
0:03:10	lot
0:03:10	so the point to note is the at the representations of at based on the
0:03:15	application
0:03:18	so a as i one more point to note is as we move from like
0:03:23	a simple representations to more detail regions
0:03:25	that year it's just the bounding box representation of the all data you have a
0:03:29	semantics that you know that it's most reported then you know six degrees of freedom
0:03:33	pose and going for here you have decompose the boarding two hundred and body
0:03:39	the more motivated representations you have
0:03:43	more complicated cost and it'll what can perform
0:03:46	so
0:03:51	so highly detailed models allow for reasoning and planning for a wide variety of complex
0:03:55	task but that leads us to the problem
0:03:58	that element in floating search exhaustively detailed world models
0:04:02	after environments it is computationally expensive and it integrates like the real-time interaction
0:04:07	and dialogue with this collaborative about
0:04:09	so one common approach is to have a task specific representations very know what after
0:04:15	what was to respond then
0:04:16	i you
0:04:17	you hardcode the perception pipeline according to that but how to best represent environments to
0:04:24	cecilia planning and
0:04:28	grounding for reasoning for a wide variety of complex task is an open question
0:04:32	okay
0:04:34	what we observe here is
0:04:35	and in the in case of exhaustive modeling and if you model all the properties
0:04:41	of all objects in the world is one problem that some of these properties that
0:04:45	inconsequential interpreting the meaning of takes o one of the instructions so
0:04:51	in this case like modeling deference between the lid of the active scan
0:04:55	is your living for the task of picking up the most important and vice versa
0:05:00	so
0:05:02	so what we propose in our work is learning a model of language and perception
0:05:07	specifically do i doubt the configuration of perception pipeline
0:05:11	we further on
0:05:13	to infer task optimal representations of the world that first year the grounding a phone
0:05:20	language instructions for example
0:05:23	this is
0:05:24	the environment representation inferred the task of picking up the leftmost you hear a very
0:05:30	just segments are w appears and this is like for the task of picking and
0:05:35	the nearest tracked object where you're ignoring the bu objects inferring properties when you want
0:05:40	to read a text
0:05:43	i'll to give you some background about the models that had been used in the
0:05:47	paper we are not the first ones to do language understanding so all generalize grounding
0:05:52	grass is one of the models that was developed by right quadrant the lexus different
0:05:57	alexis and they demonstrated
0:06:01	the utility on the task of lifting stuff using forklift
0:06:05	tracks
0:06:07	you have one advancement over that model was dct allocated to them because of this
0:06:12	model later but this was
0:06:15	basically exploited conditional independence assumption that crossed constituents
0:06:19	a language and the semantic constituents to infer high-level motion planning constraints
0:06:25	given instruction
0:06:27	there's one more model that was used to infer abstract visual concepts for example to
0:06:33	learn what it means to become brittle blocking the role of five blocks of the
0:06:38	right
0:06:38	okay
0:06:39	so all of these all of these language models
0:06:44	i in some fixed flat representation of the word
0:06:47	i mean
0:06:48	and it is an hour that
0:06:50	so that has been working on in the intersection of perception and language understanding
0:06:55	that talks about how we can leverage language you
0:07:00	and the perceptions
0:07:01	so in this case it was used to add semantic labels to the
0:07:06	do that
0:07:08	regions in the map this is a good to the occupancy grid representation
0:07:11	are in this work
0:07:14	then a little use language to
0:07:18	it also in the process of in fitting kinematic models this was one which you
0:07:23	we apply a so that the another part of the what about the for the
0:07:28	instruction like go to the hydrant be identical and robot cannot see what's behind the
0:07:31	goal and so you
0:07:33	however instruction that you can be a models we can be augmented representation
0:07:38	so
0:07:40	these models do not these models augmented representation but they do not consider how to
0:07:45	efficiently convert raw observations into representation that can
0:07:50	speed up the grounding process so most related work to our work is done by
0:07:57	sink amortizing
0:07:58	i shows he uses a joint language perception model to select a subset of objects
0:08:03	based on some colouring geometric properties and that this is what done by are you
0:08:10	with the jaws like segmenting from natural expressions very haven't rgb image and
0:08:16	you given instruction people in the be accorded segments those things
0:08:19	so what is different in our work is here we are expanding the data was
0:08:22	again complexity of the perceptual classifiers that using the word
0:08:27	and we were conducted really data
0:08:29	and
0:08:30	we present an approach to adapt the concentration of the perception pipeline order to
0:08:36	in for task specific representations so going to the technical approach we present the general
0:08:43	high very high level language understanding problem as
0:08:47	finding the most likely trajectory you and some natural language expression and some observations is
0:08:53	unlike also which in this could be just like sequence of rgbd was this
0:08:58	in our case is just a single rgb image
0:09:02	solving this inference is computationally expensive and the space of
0:09:05	project is quite large for
0:09:08	for complicated environment and what
0:09:12	so we singly contemporary techniques the proposed this
0:09:16	i structure this problem has a symbol grounding problem
0:09:19	so there's refer the infer a distribution over some symbols
0:09:23	a given the language and avoid model soviet moving from like high dimensional
0:09:29	sensor
0:09:31	measurements to
0:09:32	a structurally go up the representation of all the model
0:09:36	that it's a function of perception pipeline of the robot
0:09:39	so
0:09:41	a what is
0:09:44	symbol space exactly cancel the is consisted authenticity
0:09:48	so in the so
0:09:50	in dct model the symbol spaces basically consist basically consists of the objects in the
0:09:55	word the properties which are perceived all regions that have found the world model
0:09:59	so i spatial relations what action symbols
0:10:03	so you for this so symbols response like a discrete space of interpretations and which
0:10:07	are in instruction will be understood
0:10:12	v specifically using dct in our work so dct is like a probabilistic graphical model
0:10:17	so it but it's a factor graph for the past instruction so in this axis
0:10:22	their phases of these that the linguistic component in the vertical axis of the constituents
0:10:27	of the single space so this is like an example of one of the factors
0:10:30	that are
0:10:31	it clings the linguistic
0:10:33	phrase to one of the symbols
0:10:35	it could represent objects or regions of what next neck and thing with of it
0:10:39	correspondence variable
0:10:41	so
0:10:42	what dct does is it
0:10:44	it is trying to find
0:10:47	the most likely set of correspondence variables in the context of the grounding language the
0:10:54	child corresponds variables and the world model and this probably a by maximizing this product
0:11:00	of an usual factors across the linguistic across the linguistic components and someone a constituent
0:11:06	so these are my
0:11:09	it's likely to the estimated with log-linear models
0:11:11	and dct
0:11:12	so i'll the what we so a problem here is that the runtime of dct
0:11:18	is directly proportional to the one model figure three
0:11:22	and this is because the size of the semantic space increases as the number of
0:11:26	objects
0:11:27	in the waving trees
0:11:28	so what we of there is some objects and the si models based on those
0:11:32	objects is inconsequential reading the meaning of that instruction so we can this of the
0:11:38	as you there exists a given an optimal the world model that can express the
0:11:43	necessary information this is insufficient information to solve this problem
0:11:47	so we go from this previous equation that this behavior have this nonstarter so we
0:11:53	want and we hypothesize that from time to solve this equation will be this are
0:11:57	then but
0:11:59	so what we propose is using language doesn't mean to guide the a process of
0:12:05	generating these optimal word models so we make this world model a function of perception
0:12:09	pipeline observations and language
0:12:13	so now we have added this nlu part
0:12:16	which takes in language lead you some constraints on the perception based on the task
0:12:20	and perception use an optimal what model back and on which directly models with reason
0:12:26	no
0:12:27	so this is so to achieve this we define a new symbol space quite a
0:12:31	specific only to suppose such as input space
0:12:33	so what it basically consists off is different colour detectors is already detectors was detectors
0:12:39	and semantic object detectors accepted that
0:12:42	these need not be just the liberty to dismiss can be
0:12:45	like
0:12:46	a detector to infer or what's the likelihood of an object pointing out some major
0:12:51	something like that
0:12:53	so
0:12:55	so we so we use these easy and we add that to infer this perception
0:13:00	symbols that modifying this equation
0:13:02	so we don't know that no longer have the world model that in this equation
0:13:05	and via reasoning in the positives in the space
0:13:08	so to give you some of the base what the symbolic representation that we use
0:13:14	is made above two
0:13:15	different sets of symbols which are independent symbols and conditionally dependent symbols independent perceptual symbols
0:13:22	are basically and the individual detectors that exists in their perception pipeline
0:13:28	like you better detector red color detector and so on
0:13:32	and
0:13:33	it forms of set of all those new detectors so we also recognise that to
0:13:38	incorporate some complex phases that just pick up to a ball you would need some
0:13:43	condition because
0:13:45	have some conditional independence in was that be just runs the red that
0:13:51	sphere detector in this case on objects which are rate so we can have extent
0:13:55	of a faster interpretation
0:13:59	so it going forward
0:14:01	in experiment it is and this is a system architecture so we have our g
0:14:04	really sensor that feeds into the acquisition model
0:14:08	we have a parser that x instruction buses and use it to this and to
0:14:11	nlu models first one is for inferring the language perception constraint to equalizer the references
0:14:17	like a specific model
0:14:19	and the second one second one is
0:14:22	and the nlu used for symbol grounding so
0:14:25	this
0:14:27	the and the indexing language and gives you the perception constraints are the can think
0:14:32	is in the fight planet suitable for the task and then adapted perception takes an
0:14:37	observation and the constraints and gives you an audible would model in which the seasons
0:14:41	and
0:14:41	then symbol grounding inference like high-level motion planning constraints to go to the motion that
0:14:46	are
0:14:46	alright
0:14:48	so we do a comparative study in which you compare our proposed model with the
0:14:52	baseline that the only difference is that the l p n block is missing in
0:14:56	this architecture so we i'm this different processes and required to infer this constraint same
0:15:02	time according to perception here was is time required to
0:15:07	we will complete perception very and use all the detectors all of the models all
0:15:12	of the modules in the perception pipeline
0:15:16	and according to the symbol grounding
0:15:18	so on
0:15:20	there are few assumptions in the experiments that we
0:15:22	we have
0:15:24	the environment is like we have a baxter what we keep the reasons aren't all
0:15:28	binary a wide range different we get different word that is meant so we have
0:15:33	and different more i don't spend
0:15:37	and different word utterance events in our work and number of objects in the collector
0:15:43	varies from fifteen to twenty
0:15:46	so this is the actually that's of that perception pipeline so it has different components
0:15:51	like colour detectors geometry detectors label detectors and bottom lost at a different type of
0:15:56	body most detectors region those objectives acceptor
0:16:00	this is actually the
0:16:02	for all of those detectors we have this independent symbols and conditionally dependent symbols where
0:16:08	this is like set of symbols which are which depend on jonathan colour labels like
0:16:16	eight what basically chooses
0:16:19	the expression of the symbol say a engage the geometry detector of specific type on
0:16:24	the colour detector of a specific type
0:16:29	and the symbolic representation for the symbol grounding model basically consist of five seven different
0:16:34	things
0:16:34	which are objects in the words labels sticklers geometries regions in the world except for
0:16:41	the corpus consists of syntactically we'd like instruct a sparse instructions so i but like
0:16:47	hundred instructions and annotated and once it was annotated with the
0:16:53	the perception symbols and another with the grounding symbols
0:16:57	and the linguistic background that i followed was
0:17:01	was
0:17:02	inspired by the work done in this analysis paper
0:17:05	on the collected data using amazon mechanical turk so we use the similar linguistic buttons
0:17:11	so
0:17:11	we did in our experiments we have to hypothesis
0:17:14	and the first one is i don't really inferring the task optimally representations a given
0:17:19	that we use the perception real-time babbling exhaustively detailed uniform modeling of the word
0:17:25	and the second have what's is the reasoning in this context i think this compact
0:17:29	representations will reduce the symbol grounding time as well
0:17:33	so we have this two experiments first is just a simple learning characteristics of the
0:17:38	inventory
0:17:39	of ill we observe plastic training fastening chooses what happens at accuracy of the second
0:17:44	one is more interesting that how does help in an impact the perception runtime and
0:17:50	the third wise however the ubm back the symbol grounding runtime
0:17:53	so we hypothesize that
0:17:56	has the printing fraction in increases the accuracy of inference should increase in the second
0:18:01	is
0:18:03	as the number of objects increases if you're using the complete perception the potentials and
0:18:08	trees and in case of when using an em it should still lower than this
0:18:14	similarly in the case of symbol grounding
0:18:16	and this
0:18:18	in your nature exponential nature is just to demonstrate the ten year
0:18:24	so in our results we find this is basically the
0:18:27	the
0:18:28	learning characteristics i just as a as we expected
0:18:31	in the second one data vc the blue demonstrates the
0:18:36	the time required to perceive the world is the number of context changes from fifteen
0:18:40	to twenty busiest regularize and here it's kind of independent and in then use the
0:18:47	i think this for the symbol grounding runtime
0:18:53	so to summarize
0:18:54	this is this table shows the average perception paradigm for all the instructions when the
0:18:59	user to complete exhaustive modeling of the word and i do so you see like
0:19:03	a good increasing the decrease in the perception and then here so we implemented a
0:19:07	show similar in similar for the semantic content
0:19:11	and the point to notice that the symbol grounding accuracy is fairly the same in
0:19:15	both the keys
0:19:16	and so i so we just
0:19:19	coming back to the hypothesis we had this to hypothesis which we verified to the
0:19:23	experiment
0:19:25	so
0:19:26	in conclusion
0:19:29	the real-time interaction is important
0:19:31	for physically situated dialogue interaction the robot
0:19:34	and the problem is exhaustive modeling of the clutter that model utterance is a perception
0:19:39	bottleneck in such cases
0:19:41	and so we propose
0:19:42	a language perception model that
0:19:46	that configures
0:19:48	that takes an instruction understands the perception constraint on figures the perception pipeline of the
0:19:53	robot to give optimal what model is that again
0:19:57	then if the symbol grounding
0:19:58	process and we verified that to the experiments
0:20:03	and q
0:20:19	this is really great already like
0:20:23	so in relation to
0:20:26	extra information extra fish the optimisation you get result in mind
0:20:31	you're language interpreter is a deposit
0:20:34	so what your language interpreter is the parser
0:20:38	i mean how well as examples
0:20:41	so
0:20:44	are you talking about
0:20:48	so you have to design a the run in real-time incrementally all the to the
0:20:52	whole parse just wait till the end of the utterance then passed the whole thing
0:20:55	so also it is not the main contributions of we just use the boston's track
0:21:00	instructions
0:21:01	so this is the and then you model is what contributes to instructions so it
0:21:07	is it interpret
0:21:08	the instructions word by word orders of what the end i instructed to
0:21:15	i resulting "'cause" i think we might see for the
0:21:20	efficiency gains if you interpret the utterance word but yes evidence that differences from the
0:21:25	visual world paradigm humans that humans do that and see in their eye movement as
0:21:31	a listening and
0:21:34	if you could a speed of the process
0:21:40	but this work it's just like
0:21:42	it integrates day
0:21:44	the instruction after it's received by the nlu it does this phase by phase so
0:21:50	the interpretation and avoid lake lexical closed phases of what phrase in the case pick
0:21:55	up the
0:21:56	you want
0:21:57	so the interpretation on at the word phrase is a function of its child thesis
0:22:02	as one
0:22:02	so that's a to pick up a blue ball you
0:22:05	need not know the six degrees of freedom pose of the ball because it's a
0:22:08	semantic content
0:22:09	so in that case it will
0:22:11	the reason that you would need a to a degree of freedom was of course
0:22:15	estimator
0:22:16	in the as opposite in the case of become a value box you would need
0:22:21	a six degrees of freedom
0:22:24	pose estimation of the object but also i the word for this for that it
0:22:27	will and uses a six degrees of freedom soak reasons in the context of the
0:22:30	ten faces
0:22:37	well questions
0:22:51	over the course the back and forth dialogue we're gonna have discussion of different objects
0:22:56	i noticed in your conclusions live you have an example using the word
0:23:02	in the second you put it on the top of the red card so i
0:23:04	was wondering how are currently exploring dialogue history what like the previous utterances and how
0:23:09	you might tracker
0:23:11	you no longer
0:23:13	longer histories in the future
0:23:15	in this work we are not tracking the dialogue history it's basically the first monologue
0:23:19	part of the dialogue
0:23:21	x p something that that's supposed to speed up the entire dialog by speeding up
0:23:24	the perception
0:23:26	but
0:23:27	we are not currently designing what it means in the con
0:23:32	any other questions
0:23:38	okay estimation okay i'm going to not tell you
0:23:46	it's a special case where a the detectors in a perception pipeline
0:23:51	the time recorded on the detections was also function of size of the object
0:23:55	so in especially specific case i had like lots of objects but there is small
0:23:59	in size does not the ones
0:24:01	specifically the geometry detectors because depends on the point cloud
0:24:04	it's like to point out it needs to reason about more points that's the
0:24:11	except that
0:24:13	still the time required to do that the perception this one and you're so

Language-Guided Adaptive Perception for Efficient Grounded Communication with Robotic Manipulators in Cluttered Environments

Special Session: Physically Situated Dialogue

Siddharth Patki and Thomas Howard