Speech Transcript - Language-Guided Adaptive Perception for Efficient Grounded Communication with Robotic Manipulators in Cluttered Environments

so i'm gonna present my work without power on the topic off language guided adaptive

perception

for efficient grounded communication

right robotic manipulators in cluttered environments it's kind of a lot better hope you raise

will understand at the end of the presentation

but this about

so on

situated so like a language understanding in a physically situated settings and

interesting problem in robotics

okay

ability to

interact with the collaborative the once using natural language then

but saving then planning and establishing like a

common ground

is critical

for effective human the what in fact that is look at an example

the user says

we got the leftmost you you're

then the robot

where c then why and then a grounded a specific object

and that's done and the

assuming and say put it on the top of the

right composed into the

okay

so they're feuding sweetheart about it is there is diversity in the language that user

that was it in the language in terms of instructions that the user can you

to the robot or the way in which the instructions the said

there's there is challenges because the environments or unstructured they could be clutter and then

one and like and here

and if you need a real-time interaction with this robot

and perception takes time so

so that's what this work specifically talks about is how to efficiently perceive environments

for fast and accurate grounded in setting off a variety of natural language

instructions and it's demonstrated in the context of robotic manipulation

you give you back downtown

in four or on provide perception representation what perception usually refers to is you have

sensor measurements that come from the robot sensors you have some perception pipeline i

perception by plane in compressed this high-dimensional a sensor measurements and gives you a representation

of something colours world model

then alignment to give you an example of visual perception

you can fill in the sequence of it really images

and what you get out of it is some representation of the word

and the representation varies based on application for example here it's just strong point only

presentation i can make a

three d voxel they're not affect you can have an occupancy grid are semantic map

or if you if you want to

become specific object you want to model the pos the six degrees of freedom

was of those objects you can get something like that

even going further you can have some articulation modeling of the components often do you

lot

so the point to note is the at the representations of at based on the

application

so a as i one more point to note is as we move from like

a simple representations to more detail regions

that year it's just the bounding box representation of the all data you have a

semantics that you know that it's most reported then you know six degrees of freedom

pose and going for here you have decompose the boarding two hundred and body

the more motivated representations you have

more complicated cost and it'll what can perform

so highly detailed models allow for reasoning and planning for a wide variety of complex

task but that leads us to the problem

that element in floating search exhaustively detailed world models

after environments it is computationally expensive and it integrates like the real-time interaction

and dialogue with this collaborative about

so one common approach is to have a task specific representations very know what after

what was to respond then

i you

you hardcode the perception pipeline according to that but how to best represent environments to

cecilia planning and

grounding for reasoning for a wide variety of complex task is an open question

okay

what we observe here is

and in the in case of exhaustive modeling and if you model all the properties

of all objects in the world is one problem that some of these properties that

inconsequential interpreting the meaning of takes o one of the instructions so

in this case like modeling deference between the lid of the active scan

is your living for the task of picking up the most important and vice versa

so what we propose in our work is learning a model of language and perception

specifically do i doubt the configuration of perception pipeline

we further on

to infer task optimal representations of the world that first year the grounding a phone

language instructions for example

this is

the environment representation inferred the task of picking up the leftmost you hear a very

just segments are w appears and this is like for the task of picking and

the nearest tracked object where you're ignoring the bu objects inferring properties when you want

to read a text

i'll to give you some background about the models that had been used in the

paper we are not the first ones to do language understanding so all generalize grounding

grass is one of the models that was developed by right quadrant the lexus different

alexis and they demonstrated

the utility on the task of lifting stuff using forklift

tracks

you have one advancement over that model was dct allocated to them because of this

model later but this was

basically exploited conditional independence assumption that crossed constituents

a language and the semantic constituents to infer high-level motion planning constraints

given instruction

there's one more model that was used to infer abstract visual concepts for example to

learn what it means to become brittle blocking the role of five blocks of the

right

okay

so all of these all of these language models

i in some fixed flat representation of the word

i mean

and it is an hour that

so that has been working on in the intersection of perception and language understanding

that talks about how we can leverage language you

and the perceptions

so in this case it was used to add semantic labels to the

do that

regions in the map this is a good to the occupancy grid representation

are in this work

then a little use language to

it also in the process of in fitting kinematic models this was one which you

we apply a so that the another part of the what about the for the

instruction like go to the hydrant be identical and robot cannot see what's behind the

goal and so you

however instruction that you can be a models we can be augmented representation

these models do not these models augmented representation but they do not consider how to

efficiently convert raw observations into representation that can

speed up the grounding process so most related work to our work is done by

sink amortizing

i shows he uses a joint language perception model to select a subset of objects

based on some colouring geometric properties and that this is what done by are you

with the jaws like segmenting from natural expressions very haven't rgb image and

you given instruction people in the be accorded segments those things

so what is different in our work is here we are expanding the data was

again complexity of the perceptual classifiers that using the word

and we were conducted really data

and

we present an approach to adapt the concentration of the perception pipeline order to

in for task specific representations so going to the technical approach we present the general

high very high level language understanding problem as

finding the most likely trajectory you and some natural language expression and some observations is

unlike also which in this could be just like sequence of rgbd was this

in our case is just a single rgb image

solving this inference is computationally expensive and the space of

project is quite large for

for complicated environment and what

so we singly contemporary techniques the proposed this

i structure this problem has a symbol grounding problem

so there's refer the infer a distribution over some symbols

a given the language and avoid model soviet moving from like high dimensional

sensor

measurements to

a structurally go up the representation of all the model

that it's a function of perception pipeline of the robot

a what is

symbol space exactly cancel the is consisted authenticity

so in the so

in dct model the symbol spaces basically consist basically consists of the objects in the

word the properties which are perceived all regions that have found the world model

so i spatial relations what action symbols

so you for this so symbols response like a discrete space of interpretations and which

are in instruction will be understood

v specifically using dct in our work so dct is like a probabilistic graphical model

so it but it's a factor graph for the past instruction so in this axis

their phases of these that the linguistic component in the vertical axis of the constituents

of the single space so this is like an example of one of the factors

that are

it clings the linguistic

phrase to one of the symbols

it could represent objects or regions of what next neck and thing with of it

correspondence variable

what dct does is it

it is trying to find

the most likely set of correspondence variables in the context of the grounding language the

child corresponds variables and the world model and this probably a by maximizing this product

of an usual factors across the linguistic across the linguistic components and someone a constituent

so these are my

it's likely to the estimated with log-linear models

and dct

so i'll the what we so a problem here is that the runtime of dct

is directly proportional to the one model figure three

and this is because the size of the semantic space increases as the number of

objects

in the waving trees

so what we of there is some objects and the si models based on those

objects is inconsequential reading the meaning of that instruction so we can this of the

as you there exists a given an optimal the world model that can express the

necessary information this is insufficient information to solve this problem

so we go from this previous equation that this behavior have this nonstarter so we

want and we hypothesize that from time to solve this equation will be this are

then but

so what we propose is using language doesn't mean to guide the a process of

generating these optimal word models so we make this world model a function of perception

pipeline observations and language

so now we have added this nlu part

which takes in language lead you some constraints on the perception based on the task

and perception use an optimal what model back and on which directly models with reason

so this is so to achieve this we define a new symbol space quite a

specific only to suppose such as input space

so what it basically consists off is different colour detectors is already detectors was detectors

and semantic object detectors accepted that

these need not be just the liberty to dismiss can be

a detector to infer or what's the likelihood of an object pointing out some major

something like that

so we so we use these easy and we add that to infer this perception

symbols that modifying this equation

so we don't know that no longer have the world model that in this equation

and via reasoning in the positives in the space

so to give you some of the base what the symbolic representation that we use

is made above two

different sets of symbols which are independent symbols and conditionally dependent symbols independent perceptual symbols

are basically and the individual detectors that exists in their perception pipeline

like you better detector red color detector and so on

and

it forms of set of all those new detectors so we also recognise that to

incorporate some complex phases that just pick up to a ball you would need some

condition because

have some conditional independence in was that be just runs the red that

sphere detector in this case on objects which are rate so we can have extent

of a faster interpretation

so it going forward

in experiment it is and this is a system architecture so we have our g

really sensor that feeds into the acquisition model

we have a parser that x instruction buses and use it to this and to

nlu models first one is for inferring the language perception constraint to equalizer the references

like a specific model

and the second one second one is

and the nlu used for symbol grounding so

this

the and the indexing language and gives you the perception constraints are the can think

is in the fight planet suitable for the task and then adapted perception takes an

observation and the constraints and gives you an audible would model in which the seasons

and

then symbol grounding inference like high-level motion planning constraints to go to the motion that

are

alright

so we do a comparative study in which you compare our proposed model with the

baseline that the only difference is that the l p n block is missing in

this architecture so we i'm this different processes and required to infer this constraint same

time according to perception here was is time required to

we will complete perception very and use all the detectors all of the models all

of the modules in the perception pipeline

and according to the symbol grounding

so on

there are few assumptions in the experiments that we

we have

the environment is like we have a baxter what we keep the reasons aren't all

binary a wide range different we get different word that is meant so we have

and different more i don't spend

and different word utterance events in our work and number of objects in the collector

varies from fifteen to twenty

so this is the actually that's of that perception pipeline so it has different components

like colour detectors geometry detectors label detectors and bottom lost at a different type of

body most detectors region those objectives acceptor

this is actually the

for all of those detectors we have this independent symbols and conditionally dependent symbols where

this is like set of symbols which are which depend on jonathan colour labels like

eight what basically chooses

the expression of the symbol say a engage the geometry detector of specific type on

the colour detector of a specific type

and the symbolic representation for the symbol grounding model basically consist of five seven different

things

which are objects in the words labels sticklers geometries regions in the world except for

the corpus consists of syntactically we'd like instruct a sparse instructions so i but like

hundred instructions and annotated and once it was annotated with the

the perception symbols and another with the grounding symbols

and the linguistic background that i followed was

was

inspired by the work done in this analysis paper

on the collected data using amazon mechanical turk so we use the similar linguistic buttons

we did in our experiments we have to hypothesis

and the first one is i don't really inferring the task optimally representations a given

that we use the perception real-time babbling exhaustively detailed uniform modeling of the word

and the second have what's is the reasoning in this context i think this compact

representations will reduce the symbol grounding time as well

so we have this two experiments first is just a simple learning characteristics of the

inventory

of ill we observe plastic training fastening chooses what happens at accuracy of the second

one is more interesting that how does help in an impact the perception runtime and

the third wise however the ubm back the symbol grounding runtime

so we hypothesize that

has the printing fraction in increases the accuracy of inference should increase in the second

as the number of objects increases if you're using the complete perception the potentials and

trees and in case of when using an em it should still lower than this

similarly in the case of symbol grounding

and this

in your nature exponential nature is just to demonstrate the ten year

so in our results we find this is basically the

the

learning characteristics i just as a as we expected

in the second one data vc the blue demonstrates the

the time required to perceive the world is the number of context changes from fifteen

to twenty busiest regularize and here it's kind of independent and in then use the

i think this for the symbol grounding runtime

so to summarize

this is this table shows the average perception paradigm for all the instructions when the

user to complete exhaustive modeling of the word and i do so you see like

a good increasing the decrease in the perception and then here so we implemented a

show similar in similar for the semantic content

and the point to notice that the symbol grounding accuracy is fairly the same in

both the keys

and so i so we just

coming back to the hypothesis we had this to hypothesis which we verified to the

experiment

in conclusion

the real-time interaction is important

for physically situated dialogue interaction the robot

and the problem is exhaustive modeling of the clutter that model utterance is a perception

bottleneck in such cases

and so we propose

a language perception model that

that configures

that takes an instruction understands the perception constraint on figures the perception pipeline of the

robot to give optimal what model is that again

then if the symbol grounding

process and we verified that to the experiments

and q

this is really great already like

so in relation to

extra information extra fish the optimisation you get result in mind

you're language interpreter is a deposit

so what your language interpreter is the parser

i mean how well as examples

are you talking about

so you have to design a the run in real-time incrementally all the to the

whole parse just wait till the end of the utterance then passed the whole thing

so also it is not the main contributions of we just use the boston's track

instructions

so this is the and then you model is what contributes to instructions so it

is it interpret

the instructions word by word orders of what the end i instructed to

i resulting "'cause" i think we might see for the

efficiency gains if you interpret the utterance word but yes evidence that differences from the

visual world paradigm humans that humans do that and see in their eye movement as

a listening and

if you could a speed of the process

but this work it's just like

it integrates day

the instruction after it's received by the nlu it does this phase by phase so

the interpretation and avoid lake lexical closed phases of what phrase in the case pick

up the

you want

so the interpretation on at the word phrase is a function of its child thesis

as one

so that's a to pick up a blue ball you

need not know the six degrees of freedom pose of the ball because it's a

semantic content

so in that case it will

the reason that you would need a to a degree of freedom was of course

estimator

in the as opposite in the case of become a value box you would need

a six degrees of freedom

pose estimation of the object but also i the word for this for that it

will and uses a six degrees of freedom soak reasons in the context of the

ten faces

well questions

over the course the back and forth dialogue we're gonna have discussion of different objects

i noticed in your conclusions live you have an example using the word

in the second you put it on the top of the red card so i

was wondering how are currently exploring dialogue history what like the previous utterances and how

you might tracker

you no longer

longer histories in the future

in this work we are not tracking the dialogue history it's basically the first monologue

part of the dialogue

x p something that that's supposed to speed up the entire dialog by speeding up

the perception

but

we are not currently designing what it means in the con

any other questions

okay estimation okay i'm going to not tell you

it's a special case where a the detectors in a perception pipeline

the time recorded on the detections was also function of size of the object

so in especially specific case i had like lots of objects but there is small

in size does not the ones

specifically the geometry detectors because depends on the point cloud

it's like to point out it needs to reason about more points that's the

except that

still the time required to do that the perception this one and you're so

Language-Guided Adaptive Perception for Efficient Grounded Communication with Robotic Manipulators in Cluttered Environments

Special Session: Physically Situated Dialogue

Siddharth Patki and Thomas Howard