so i'm gonna present my work without power on the topic off language guided adaptive
perception
for efficient grounded communication
right robotic manipulators in cluttered environments it's kind of a lot better hope you raise
will understand at the end of the presentation
but this about
so on
situated so like a language understanding in a physically situated settings and
interesting problem in robotics
okay
ability to
interact with the collaborative the once using natural language then
but saving then planning and establishing like a
common ground
is critical
for effective human the what in fact that is look at an example
the user says
we got the leftmost you you're
then the robot
where c then why and then a grounded a specific object
and that's done and the
assuming and say put it on the top of the
right composed into the
okay
so they're feuding sweetheart about it is there is diversity in the language that user
that was it in the language in terms of instructions that the user can you
to the robot or the way in which the instructions the said
there's there is challenges because the environments or unstructured they could be clutter and then
one and like and here
and if you need a real-time interaction with this robot
and perception takes time so
so that's what this work specifically talks about is how to efficiently perceive environments
for fast and accurate grounded in setting off a variety of natural language
instructions and it's demonstrated in the context of robotic manipulation
you give you back downtown
in four or on provide perception representation what perception usually refers to is you have
sensor measurements that come from the robot sensors you have some perception pipeline i
perception by plane in compressed this high-dimensional a sensor measurements and gives you a representation
of something colours world model
then alignment to give you an example of visual perception
you can fill in the sequence of it really images
and what you get out of it is some representation of the word
and the representation varies based on application for example here it's just strong point only
presentation i can make a
three d voxel they're not affect you can have an occupancy grid are semantic map
or if you if you want to
become specific object you want to model the pos the six degrees of freedom
was of those objects you can get something like that
even going further you can have some articulation modeling of the components often do you
lot
so the point to note is the at the representations of at based on the
application
so a as i one more point to note is as we move from like
a simple representations to more detail regions
that year it's just the bounding box representation of the all data you have a
semantics that you know that it's most reported then you know six degrees of freedom
pose and going for here you have decompose the boarding two hundred and body
the more motivated representations you have
more complicated cost and it'll what can perform
so
so highly detailed models allow for reasoning and planning for a wide variety of complex
task but that leads us to the problem
that element in floating search exhaustively detailed world models
after environments it is computationally expensive and it integrates like the real-time interaction
and dialogue with this collaborative about
so one common approach is to have a task specific representations very know what after
what was to respond then
i you
you hardcode the perception pipeline according to that but how to best represent environments to
cecilia planning and
grounding for reasoning for a wide variety of complex task is an open question
okay
what we observe here is
and in the in case of exhaustive modeling and if you model all the properties
of all objects in the world is one problem that some of these properties that
inconsequential interpreting the meaning of takes o one of the instructions so
in this case like modeling deference between the lid of the active scan
is your living for the task of picking up the most important and vice versa
so
so what we propose in our work is learning a model of language and perception
specifically do i doubt the configuration of perception pipeline
we further on
to infer task optimal representations of the world that first year the grounding a phone
language instructions for example
this is
the environment representation inferred the task of picking up the leftmost you hear a very
just segments are w appears and this is like for the task of picking and
the nearest tracked object where you're ignoring the bu objects inferring properties when you want
to read a text
i'll to give you some background about the models that had been used in the
paper we are not the first ones to do language understanding so all generalize grounding
grass is one of the models that was developed by right quadrant the lexus different
alexis and they demonstrated
the utility on the task of lifting stuff using forklift
tracks
you have one advancement over that model was dct allocated to them because of this
model later but this was
basically exploited conditional independence assumption that crossed constituents
a language and the semantic constituents to infer high-level motion planning constraints
given instruction
there's one more model that was used to infer abstract visual concepts for example to
learn what it means to become brittle blocking the role of five blocks of the
right
okay
so all of these all of these language models
i in some fixed flat representation of the word
i mean
and it is an hour that
so that has been working on in the intersection of perception and language understanding
that talks about how we can leverage language you
and the perceptions
so in this case it was used to add semantic labels to the
do that
regions in the map this is a good to the occupancy grid representation
are in this work
then a little use language to
it also in the process of in fitting kinematic models this was one which you
we apply a so that the another part of the what about the for the
instruction like go to the hydrant be identical and robot cannot see what's behind the
goal and so you
however instruction that you can be a models we can be augmented representation
so
these models do not these models augmented representation but they do not consider how to
efficiently convert raw observations into representation that can
speed up the grounding process so most related work to our work is done by
sink amortizing
i shows he uses a joint language perception model to select a subset of objects
based on some colouring geometric properties and that this is what done by are you
with the jaws like segmenting from natural expressions very haven't rgb image and
you given instruction people in the be accorded segments those things
so what is different in our work is here we are expanding the data was
again complexity of the perceptual classifiers that using the word
and we were conducted really data
and
we present an approach to adapt the concentration of the perception pipeline order to
in for task specific representations so going to the technical approach we present the general
high very high level language understanding problem as
finding the most likely trajectory you and some natural language expression and some observations is
unlike also which in this could be just like sequence of rgbd was this
in our case is just a single rgb image
solving this inference is computationally expensive and the space of
project is quite large for
for complicated environment and what
so we singly contemporary techniques the proposed this
i structure this problem has a symbol grounding problem
so there's refer the infer a distribution over some symbols
a given the language and avoid model soviet moving from like high dimensional
sensor
measurements to
a structurally go up the representation of all the model
that it's a function of perception pipeline of the robot
so
a what is
symbol space exactly cancel the is consisted authenticity
so in the so
in dct model the symbol spaces basically consist basically consists of the objects in the
word the properties which are perceived all regions that have found the world model
so i spatial relations what action symbols
so you for this so symbols response like a discrete space of interpretations and which
are in instruction will be understood
v specifically using dct in our work so dct is like a probabilistic graphical model
so it but it's a factor graph for the past instruction so in this axis
their phases of these that the linguistic component in the vertical axis of the constituents
of the single space so this is like an example of one of the factors
that are
it clings the linguistic
phrase to one of the symbols
it could represent objects or regions of what next neck and thing with of it
correspondence variable
so
what dct does is it
it is trying to find
the most likely set of correspondence variables in the context of the grounding language the
child corresponds variables and the world model and this probably a by maximizing this product
of an usual factors across the linguistic across the linguistic components and someone a constituent
so these are my
it's likely to the estimated with log-linear models
and dct
so i'll the what we so a problem here is that the runtime of dct
is directly proportional to the one model figure three
and this is because the size of the semantic space increases as the number of
objects
in the waving trees
so what we of there is some objects and the si models based on those
objects is inconsequential reading the meaning of that instruction so we can this of the
as you there exists a given an optimal the world model that can express the
necessary information this is insufficient information to solve this problem
so we go from this previous equation that this behavior have this nonstarter so we
want and we hypothesize that from time to solve this equation will be this are
then but
so what we propose is using language doesn't mean to guide the a process of
generating these optimal word models so we make this world model a function of perception
pipeline observations and language
so now we have added this nlu part
which takes in language lead you some constraints on the perception based on the task
and perception use an optimal what model back and on which directly models with reason
no
so this is so to achieve this we define a new symbol space quite a
specific only to suppose such as input space
so what it basically consists off is different colour detectors is already detectors was detectors
and semantic object detectors accepted that
these need not be just the liberty to dismiss can be
like
a detector to infer or what's the likelihood of an object pointing out some major
something like that
so
so we so we use these easy and we add that to infer this perception
symbols that modifying this equation
so we don't know that no longer have the world model that in this equation
and via reasoning in the positives in the space
so to give you some of the base what the symbolic representation that we use
is made above two
different sets of symbols which are independent symbols and conditionally dependent symbols independent perceptual symbols
are basically and the individual detectors that exists in their perception pipeline
like you better detector red color detector and so on
and
it forms of set of all those new detectors so we also recognise that to
incorporate some complex phases that just pick up to a ball you would need some
condition because
have some conditional independence in was that be just runs the red that
sphere detector in this case on objects which are rate so we can have extent
of a faster interpretation
so it going forward
in experiment it is and this is a system architecture so we have our g
really sensor that feeds into the acquisition model
we have a parser that x instruction buses and use it to this and to
nlu models first one is for inferring the language perception constraint to equalizer the references
like a specific model
and the second one second one is
and the nlu used for symbol grounding so
this
the and the indexing language and gives you the perception constraints are the can think
is in the fight planet suitable for the task and then adapted perception takes an
observation and the constraints and gives you an audible would model in which the seasons
and
then symbol grounding inference like high-level motion planning constraints to go to the motion that
are
alright
so we do a comparative study in which you compare our proposed model with the
baseline that the only difference is that the l p n block is missing in
this architecture so we i'm this different processes and required to infer this constraint same
time according to perception here was is time required to
we will complete perception very and use all the detectors all of the models all
of the modules in the perception pipeline
and according to the symbol grounding
so on
there are few assumptions in the experiments that we
we have
the environment is like we have a baxter what we keep the reasons aren't all
binary a wide range different we get different word that is meant so we have
and different more i don't spend
and different word utterance events in our work and number of objects in the collector
varies from fifteen to twenty
so this is the actually that's of that perception pipeline so it has different components
like colour detectors geometry detectors label detectors and bottom lost at a different type of
body most detectors region those objectives acceptor
this is actually the
for all of those detectors we have this independent symbols and conditionally dependent symbols where
this is like set of symbols which are which depend on jonathan colour labels like
eight what basically chooses
the expression of the symbol say a engage the geometry detector of specific type on
the colour detector of a specific type
and the symbolic representation for the symbol grounding model basically consist of five seven different
things
which are objects in the words labels sticklers geometries regions in the world except for
the corpus consists of syntactically we'd like instruct a sparse instructions so i but like
hundred instructions and annotated and once it was annotated with the
the perception symbols and another with the grounding symbols
and the linguistic background that i followed was
was
inspired by the work done in this analysis paper
on the collected data using amazon mechanical turk so we use the similar linguistic buttons
so
we did in our experiments we have to hypothesis
and the first one is i don't really inferring the task optimally representations a given
that we use the perception real-time babbling exhaustively detailed uniform modeling of the word
and the second have what's is the reasoning in this context i think this compact
representations will reduce the symbol grounding time as well
so we have this two experiments first is just a simple learning characteristics of the
inventory
of ill we observe plastic training fastening chooses what happens at accuracy of the second
one is more interesting that how does help in an impact the perception runtime and
the third wise however the ubm back the symbol grounding runtime
so we hypothesize that
has the printing fraction in increases the accuracy of inference should increase in the second
is
as the number of objects increases if you're using the complete perception the potentials and
trees and in case of when using an em it should still lower than this
similarly in the case of symbol grounding
and this
in your nature exponential nature is just to demonstrate the ten year
so in our results we find this is basically the
the
learning characteristics i just as a as we expected
in the second one data vc the blue demonstrates the
the time required to perceive the world is the number of context changes from fifteen
to twenty busiest regularize and here it's kind of independent and in then use the
i think this for the symbol grounding runtime
so to summarize
this is this table shows the average perception paradigm for all the instructions when the
user to complete exhaustive modeling of the word and i do so you see like
a good increasing the decrease in the perception and then here so we implemented a
show similar in similar for the semantic content
and the point to notice that the symbol grounding accuracy is fairly the same in
both the keys
and so i so we just
coming back to the hypothesis we had this to hypothesis which we verified to the
experiment
so
in conclusion
the real-time interaction is important
for physically situated dialogue interaction the robot
and the problem is exhaustive modeling of the clutter that model utterance is a perception
bottleneck in such cases
and so we propose
a language perception model that
that configures
that takes an instruction understands the perception constraint on figures the perception pipeline of the
robot to give optimal what model is that again
then if the symbol grounding
process and we verified that to the experiments
and q
this is really great already like
so in relation to
extra information extra fish the optimisation you get result in mind
you're language interpreter is a deposit
so what your language interpreter is the parser
i mean how well as examples
so
are you talking about
so you have to design a the run in real-time incrementally all the to the
whole parse just wait till the end of the utterance then passed the whole thing
so also it is not the main contributions of we just use the boston's track
instructions
so this is the and then you model is what contributes to instructions so it
is it interpret
the instructions word by word orders of what the end i instructed to
i resulting "'cause" i think we might see for the
efficiency gains if you interpret the utterance word but yes evidence that differences from the
visual world paradigm humans that humans do that and see in their eye movement as
a listening and
if you could a speed of the process
but this work it's just like
it integrates day
the instruction after it's received by the nlu it does this phase by phase so
the interpretation and avoid lake lexical closed phases of what phrase in the case pick
up the
you want
so the interpretation on at the word phrase is a function of its child thesis
as one
so that's a to pick up a blue ball you
need not know the six degrees of freedom pose of the ball because it's a
semantic content
so in that case it will
the reason that you would need a to a degree of freedom was of course
estimator
in the as opposite in the case of become a value box you would need
a six degrees of freedom
pose estimation of the object but also i the word for this for that it
will and uses a six degrees of freedom soak reasons in the context of the
ten faces
well questions
over the course the back and forth dialogue we're gonna have discussion of different objects
i noticed in your conclusions live you have an example using the word
in the second you put it on the top of the red card so i
was wondering how are currently exploring dialogue history what like the previous utterances and how
you might tracker
you no longer
longer histories in the future
in this work we are not tracking the dialogue history it's basically the first monologue
part of the dialogue
x p something that that's supposed to speed up the entire dialog by speeding up
the perception
but
we are not currently designing what it means in the con
any other questions
okay estimation okay i'm going to not tell you
it's a special case where a the detectors in a perception pipeline
the time recorded on the detections was also function of size of the object
so in especially specific case i had like lots of objects but there is small
in size does not the ones
specifically the geometry detectors because depends on the point cloud
it's like to point out it needs to reason about more points that's the
except that
still the time required to do that the perception this one and you're so