0:00:15 | so i'm gonna present my work without power on the topic off language guided adaptive |
---|
0:00:21 | perception |
---|
0:00:22 | for efficient grounded communication |
---|
0:00:25 | right robotic manipulators in cluttered environments it's kind of a lot better hope you raise |
---|
0:00:30 | will understand at the end of the presentation |
---|
0:00:33 | but this about |
---|
0:00:36 | so on |
---|
0:00:37 | situated so like a language understanding in a physically situated settings and |
---|
0:00:43 | interesting problem in robotics |
---|
0:00:45 | okay |
---|
0:00:46 | ability to |
---|
0:00:48 | interact with the collaborative the once using natural language then |
---|
0:00:52 | but saving then planning and establishing like a |
---|
0:00:55 | common ground |
---|
0:00:57 | is critical |
---|
0:00:58 | for effective human the what in fact that is look at an example |
---|
0:01:02 | the user says |
---|
0:01:04 | we got the leftmost you you're |
---|
0:01:06 | then the robot |
---|
0:01:08 | where c then why and then a grounded a specific object |
---|
0:01:12 | and that's done and the |
---|
0:01:13 | assuming and say put it on the top of the |
---|
0:01:16 | right composed into the |
---|
0:01:18 | okay |
---|
0:01:19 | so they're feuding sweetheart about it is there is diversity in the language that user |
---|
0:01:27 | that was it in the language in terms of instructions that the user can you |
---|
0:01:30 | to the robot or the way in which the instructions the said |
---|
0:01:35 | there's there is challenges because the environments or unstructured they could be clutter and then |
---|
0:01:40 | one and like and here |
---|
0:01:43 | and if you need a real-time interaction with this robot |
---|
0:01:47 | and perception takes time so |
---|
0:01:50 | so that's what this work specifically talks about is how to efficiently perceive environments |
---|
0:01:56 | for fast and accurate grounded in setting off a variety of natural language |
---|
0:02:01 | instructions and it's demonstrated in the context of robotic manipulation |
---|
0:02:07 | you give you back downtown |
---|
0:02:09 | in four or on provide perception representation what perception usually refers to is you have |
---|
0:02:15 | sensor measurements that come from the robot sensors you have some perception pipeline i |
---|
0:02:20 | perception by plane in compressed this high-dimensional a sensor measurements and gives you a representation |
---|
0:02:26 | of something colours world model |
---|
0:02:30 | then alignment to give you an example of visual perception |
---|
0:02:34 | you can fill in the sequence of it really images |
---|
0:02:37 | and what you get out of it is some representation of the word |
---|
0:02:41 | and the representation varies based on application for example here it's just strong point only |
---|
0:02:47 | presentation i can make a |
---|
0:02:49 | three d voxel they're not affect you can have an occupancy grid are semantic map |
---|
0:02:54 | or if you if you want to |
---|
0:02:56 | become specific object you want to model the pos the six degrees of freedom |
---|
0:03:01 | was of those objects you can get something like that |
---|
0:03:04 | even going further you can have some articulation modeling of the components often do you |
---|
0:03:10 | lot |
---|
0:03:10 | so the point to note is the at the representations of at based on the |
---|
0:03:15 | application |
---|
0:03:18 | so a as i one more point to note is as we move from like |
---|
0:03:23 | a simple representations to more detail regions |
---|
0:03:25 | that year it's just the bounding box representation of the all data you have a |
---|
0:03:29 | semantics that you know that it's most reported then you know six degrees of freedom |
---|
0:03:33 | pose and going for here you have decompose the boarding two hundred and body |
---|
0:03:39 | the more motivated representations you have |
---|
0:03:43 | more complicated cost and it'll what can perform |
---|
0:03:46 | so |
---|
0:03:51 | so highly detailed models allow for reasoning and planning for a wide variety of complex |
---|
0:03:55 | task but that leads us to the problem |
---|
0:03:58 | that element in floating search exhaustively detailed world models |
---|
0:04:02 | after environments it is computationally expensive and it integrates like the real-time interaction |
---|
0:04:07 | and dialogue with this collaborative about |
---|
0:04:09 | so one common approach is to have a task specific representations very know what after |
---|
0:04:15 | what was to respond then |
---|
0:04:16 | i you |
---|
0:04:17 | you hardcode the perception pipeline according to that but how to best represent environments to |
---|
0:04:24 | cecilia planning and |
---|
0:04:28 | grounding for reasoning for a wide variety of complex task is an open question |
---|
0:04:32 | okay |
---|
0:04:34 | what we observe here is |
---|
0:04:35 | and in the in case of exhaustive modeling and if you model all the properties |
---|
0:04:41 | of all objects in the world is one problem that some of these properties that |
---|
0:04:45 | inconsequential interpreting the meaning of takes o one of the instructions so |
---|
0:04:51 | in this case like modeling deference between the lid of the active scan |
---|
0:04:55 | is your living for the task of picking up the most important and vice versa |
---|
0:05:00 | so |
---|
0:05:02 | so what we propose in our work is learning a model of language and perception |
---|
0:05:07 | specifically do i doubt the configuration of perception pipeline |
---|
0:05:11 | we further on |
---|
0:05:13 | to infer task optimal representations of the world that first year the grounding a phone |
---|
0:05:20 | language instructions for example |
---|
0:05:23 | this is |
---|
0:05:24 | the environment representation inferred the task of picking up the leftmost you hear a very |
---|
0:05:30 | just segments are w appears and this is like for the task of picking and |
---|
0:05:35 | the nearest tracked object where you're ignoring the bu objects inferring properties when you want |
---|
0:05:40 | to read a text |
---|
0:05:43 | i'll to give you some background about the models that had been used in the |
---|
0:05:47 | paper we are not the first ones to do language understanding so all generalize grounding |
---|
0:05:52 | grass is one of the models that was developed by right quadrant the lexus different |
---|
0:05:57 | alexis and they demonstrated |
---|
0:06:01 | the utility on the task of lifting stuff using forklift |
---|
0:06:05 | tracks |
---|
0:06:07 | you have one advancement over that model was dct allocated to them because of this |
---|
0:06:12 | model later but this was |
---|
0:06:15 | basically exploited conditional independence assumption that crossed constituents |
---|
0:06:19 | a language and the semantic constituents to infer high-level motion planning constraints |
---|
0:06:25 | given instruction |
---|
0:06:27 | there's one more model that was used to infer abstract visual concepts for example to |
---|
0:06:33 | learn what it means to become brittle blocking the role of five blocks of the |
---|
0:06:38 | right |
---|
0:06:38 | okay |
---|
0:06:39 | so all of these all of these language models |
---|
0:06:44 | i in some fixed flat representation of the word |
---|
0:06:47 | i mean |
---|
0:06:48 | and it is an hour that |
---|
0:06:50 | so that has been working on in the intersection of perception and language understanding |
---|
0:06:55 | that talks about how we can leverage language you |
---|
0:07:00 | and the perceptions |
---|
0:07:01 | so in this case it was used to add semantic labels to the |
---|
0:07:06 | do that |
---|
0:07:08 | regions in the map this is a good to the occupancy grid representation |
---|
0:07:11 | are in this work |
---|
0:07:14 | then a little use language to |
---|
0:07:18 | it also in the process of in fitting kinematic models this was one which you |
---|
0:07:23 | we apply a so that the another part of the what about the for the |
---|
0:07:28 | instruction like go to the hydrant be identical and robot cannot see what's behind the |
---|
0:07:31 | goal and so you |
---|
0:07:33 | however instruction that you can be a models we can be augmented representation |
---|
0:07:38 | so |
---|
0:07:40 | these models do not these models augmented representation but they do not consider how to |
---|
0:07:45 | efficiently convert raw observations into representation that can |
---|
0:07:50 | speed up the grounding process so most related work to our work is done by |
---|
0:07:57 | sink amortizing |
---|
0:07:58 | i shows he uses a joint language perception model to select a subset of objects |
---|
0:08:03 | based on some colouring geometric properties and that this is what done by are you |
---|
0:08:10 | with the jaws like segmenting from natural expressions very haven't rgb image and |
---|
0:08:16 | you given instruction people in the be accorded segments those things |
---|
0:08:19 | so what is different in our work is here we are expanding the data was |
---|
0:08:22 | again complexity of the perceptual classifiers that using the word |
---|
0:08:27 | and we were conducted really data |
---|
0:08:29 | and |
---|
0:08:30 | we present an approach to adapt the concentration of the perception pipeline order to |
---|
0:08:36 | in for task specific representations so going to the technical approach we present the general |
---|
0:08:43 | high very high level language understanding problem as |
---|
0:08:47 | finding the most likely trajectory you and some natural language expression and some observations is |
---|
0:08:53 | unlike also which in this could be just like sequence of rgbd was this |
---|
0:08:58 | in our case is just a single rgb image |
---|
0:09:02 | solving this inference is computationally expensive and the space of |
---|
0:09:05 | project is quite large for |
---|
0:09:08 | for complicated environment and what |
---|
0:09:12 | so we singly contemporary techniques the proposed this |
---|
0:09:16 | i structure this problem has a symbol grounding problem |
---|
0:09:19 | so there's refer the infer a distribution over some symbols |
---|
0:09:23 | a given the language and avoid model soviet moving from like high dimensional |
---|
0:09:29 | sensor |
---|
0:09:31 | measurements to |
---|
0:09:32 | a structurally go up the representation of all the model |
---|
0:09:36 | that it's a function of perception pipeline of the robot |
---|
0:09:39 | so |
---|
0:09:41 | a what is |
---|
0:09:44 | symbol space exactly cancel the is consisted authenticity |
---|
0:09:48 | so in the so |
---|
0:09:50 | in dct model the symbol spaces basically consist basically consists of the objects in the |
---|
0:09:55 | word the properties which are perceived all regions that have found the world model |
---|
0:09:59 | so i spatial relations what action symbols |
---|
0:10:03 | so you for this so symbols response like a discrete space of interpretations and which |
---|
0:10:07 | are in instruction will be understood |
---|
0:10:12 | v specifically using dct in our work so dct is like a probabilistic graphical model |
---|
0:10:17 | so it but it's a factor graph for the past instruction so in this axis |
---|
0:10:22 | their phases of these that the linguistic component in the vertical axis of the constituents |
---|
0:10:27 | of the single space so this is like an example of one of the factors |
---|
0:10:30 | that are |
---|
0:10:31 | it clings the linguistic |
---|
0:10:33 | phrase to one of the symbols |
---|
0:10:35 | it could represent objects or regions of what next neck and thing with of it |
---|
0:10:39 | correspondence variable |
---|
0:10:41 | so |
---|
0:10:42 | what dct does is it |
---|
0:10:44 | it is trying to find |
---|
0:10:47 | the most likely set of correspondence variables in the context of the grounding language the |
---|
0:10:54 | child corresponds variables and the world model and this probably a by maximizing this product |
---|
0:11:00 | of an usual factors across the linguistic across the linguistic components and someone a constituent |
---|
0:11:06 | so these are my |
---|
0:11:09 | it's likely to the estimated with log-linear models |
---|
0:11:11 | and dct |
---|
0:11:12 | so i'll the what we so a problem here is that the runtime of dct |
---|
0:11:18 | is directly proportional to the one model figure three |
---|
0:11:22 | and this is because the size of the semantic space increases as the number of |
---|
0:11:26 | objects |
---|
0:11:27 | in the waving trees |
---|
0:11:28 | so what we of there is some objects and the si models based on those |
---|
0:11:32 | objects is inconsequential reading the meaning of that instruction so we can this of the |
---|
0:11:38 | as you there exists a given an optimal the world model that can express the |
---|
0:11:43 | necessary information this is insufficient information to solve this problem |
---|
0:11:47 | so we go from this previous equation that this behavior have this nonstarter so we |
---|
0:11:53 | want and we hypothesize that from time to solve this equation will be this are |
---|
0:11:57 | then but |
---|
0:11:59 | so what we propose is using language doesn't mean to guide the a process of |
---|
0:12:05 | generating these optimal word models so we make this world model a function of perception |
---|
0:12:09 | pipeline observations and language |
---|
0:12:13 | so now we have added this nlu part |
---|
0:12:16 | which takes in language lead you some constraints on the perception based on the task |
---|
0:12:20 | and perception use an optimal what model back and on which directly models with reason |
---|
0:12:26 | no |
---|
0:12:27 | so this is so to achieve this we define a new symbol space quite a |
---|
0:12:31 | specific only to suppose such as input space |
---|
0:12:33 | so what it basically consists off is different colour detectors is already detectors was detectors |
---|
0:12:39 | and semantic object detectors accepted that |
---|
0:12:42 | these need not be just the liberty to dismiss can be |
---|
0:12:45 | like |
---|
0:12:46 | a detector to infer or what's the likelihood of an object pointing out some major |
---|
0:12:51 | something like that |
---|
0:12:53 | so |
---|
0:12:55 | so we so we use these easy and we add that to infer this perception |
---|
0:13:00 | symbols that modifying this equation |
---|
0:13:02 | so we don't know that no longer have the world model that in this equation |
---|
0:13:05 | and via reasoning in the positives in the space |
---|
0:13:08 | so to give you some of the base what the symbolic representation that we use |
---|
0:13:14 | is made above two |
---|
0:13:15 | different sets of symbols which are independent symbols and conditionally dependent symbols independent perceptual symbols |
---|
0:13:22 | are basically and the individual detectors that exists in their perception pipeline |
---|
0:13:28 | like you better detector red color detector and so on |
---|
0:13:32 | and |
---|
0:13:33 | it forms of set of all those new detectors so we also recognise that to |
---|
0:13:38 | incorporate some complex phases that just pick up to a ball you would need some |
---|
0:13:43 | condition because |
---|
0:13:45 | have some conditional independence in was that be just runs the red that |
---|
0:13:51 | sphere detector in this case on objects which are rate so we can have extent |
---|
0:13:55 | of a faster interpretation |
---|
0:13:59 | so it going forward |
---|
0:14:01 | in experiment it is and this is a system architecture so we have our g |
---|
0:14:04 | really sensor that feeds into the acquisition model |
---|
0:14:08 | we have a parser that x instruction buses and use it to this and to |
---|
0:14:11 | nlu models first one is for inferring the language perception constraint to equalizer the references |
---|
0:14:17 | like a specific model |
---|
0:14:19 | and the second one second one is |
---|
0:14:22 | and the nlu used for symbol grounding so |
---|
0:14:25 | this |
---|
0:14:27 | the and the indexing language and gives you the perception constraints are the can think |
---|
0:14:32 | is in the fight planet suitable for the task and then adapted perception takes an |
---|
0:14:37 | observation and the constraints and gives you an audible would model in which the seasons |
---|
0:14:41 | and |
---|
0:14:41 | then symbol grounding inference like high-level motion planning constraints to go to the motion that |
---|
0:14:46 | are |
---|
0:14:46 | alright |
---|
0:14:48 | so we do a comparative study in which you compare our proposed model with the |
---|
0:14:52 | baseline that the only difference is that the l p n block is missing in |
---|
0:14:56 | this architecture so we i'm this different processes and required to infer this constraint same |
---|
0:15:02 | time according to perception here was is time required to |
---|
0:15:07 | we will complete perception very and use all the detectors all of the models all |
---|
0:15:12 | of the modules in the perception pipeline |
---|
0:15:16 | and according to the symbol grounding |
---|
0:15:18 | so on |
---|
0:15:20 | there are few assumptions in the experiments that we |
---|
0:15:22 | we have |
---|
0:15:24 | the environment is like we have a baxter what we keep the reasons aren't all |
---|
0:15:28 | binary a wide range different we get different word that is meant so we have |
---|
0:15:33 | and different more i don't spend |
---|
0:15:37 | and different word utterance events in our work and number of objects in the collector |
---|
0:15:43 | varies from fifteen to twenty |
---|
0:15:46 | so this is the actually that's of that perception pipeline so it has different components |
---|
0:15:51 | like colour detectors geometry detectors label detectors and bottom lost at a different type of |
---|
0:15:56 | body most detectors region those objectives acceptor |
---|
0:16:00 | this is actually the |
---|
0:16:02 | for all of those detectors we have this independent symbols and conditionally dependent symbols where |
---|
0:16:08 | this is like set of symbols which are which depend on jonathan colour labels like |
---|
0:16:16 | eight what basically chooses |
---|
0:16:19 | the expression of the symbol say a engage the geometry detector of specific type on |
---|
0:16:24 | the colour detector of a specific type |
---|
0:16:29 | and the symbolic representation for the symbol grounding model basically consist of five seven different |
---|
0:16:34 | things |
---|
0:16:34 | which are objects in the words labels sticklers geometries regions in the world except for |
---|
0:16:41 | the corpus consists of syntactically we'd like instruct a sparse instructions so i but like |
---|
0:16:47 | hundred instructions and annotated and once it was annotated with the |
---|
0:16:53 | the perception symbols and another with the grounding symbols |
---|
0:16:57 | and the linguistic background that i followed was |
---|
0:17:01 | was |
---|
0:17:02 | inspired by the work done in this analysis paper |
---|
0:17:05 | on the collected data using amazon mechanical turk so we use the similar linguistic buttons |
---|
0:17:11 | so |
---|
0:17:11 | we did in our experiments we have to hypothesis |
---|
0:17:14 | and the first one is i don't really inferring the task optimally representations a given |
---|
0:17:19 | that we use the perception real-time babbling exhaustively detailed uniform modeling of the word |
---|
0:17:25 | and the second have what's is the reasoning in this context i think this compact |
---|
0:17:29 | representations will reduce the symbol grounding time as well |
---|
0:17:33 | so we have this two experiments first is just a simple learning characteristics of the |
---|
0:17:38 | inventory |
---|
0:17:39 | of ill we observe plastic training fastening chooses what happens at accuracy of the second |
---|
0:17:44 | one is more interesting that how does help in an impact the perception runtime and |
---|
0:17:50 | the third wise however the ubm back the symbol grounding runtime |
---|
0:17:53 | so we hypothesize that |
---|
0:17:56 | has the printing fraction in increases the accuracy of inference should increase in the second |
---|
0:18:01 | is |
---|
0:18:03 | as the number of objects increases if you're using the complete perception the potentials and |
---|
0:18:08 | trees and in case of when using an em it should still lower than this |
---|
0:18:14 | similarly in the case of symbol grounding |
---|
0:18:16 | and this |
---|
0:18:18 | in your nature exponential nature is just to demonstrate the ten year |
---|
0:18:24 | so in our results we find this is basically the |
---|
0:18:27 | the |
---|
0:18:28 | learning characteristics i just as a as we expected |
---|
0:18:31 | in the second one data vc the blue demonstrates the |
---|
0:18:36 | the time required to perceive the world is the number of context changes from fifteen |
---|
0:18:40 | to twenty busiest regularize and here it's kind of independent and in then use the |
---|
0:18:47 | i think this for the symbol grounding runtime |
---|
0:18:53 | so to summarize |
---|
0:18:54 | this is this table shows the average perception paradigm for all the instructions when the |
---|
0:18:59 | user to complete exhaustive modeling of the word and i do so you see like |
---|
0:19:03 | a good increasing the decrease in the perception and then here so we implemented a |
---|
0:19:07 | show similar in similar for the semantic content |
---|
0:19:11 | and the point to notice that the symbol grounding accuracy is fairly the same in |
---|
0:19:15 | both the keys |
---|
0:19:16 | and so i so we just |
---|
0:19:19 | coming back to the hypothesis we had this to hypothesis which we verified to the |
---|
0:19:23 | experiment |
---|
0:19:25 | so |
---|
0:19:26 | in conclusion |
---|
0:19:29 | the real-time interaction is important |
---|
0:19:31 | for physically situated dialogue interaction the robot |
---|
0:19:34 | and the problem is exhaustive modeling of the clutter that model utterance is a perception |
---|
0:19:39 | bottleneck in such cases |
---|
0:19:41 | and so we propose |
---|
0:19:42 | a language perception model that |
---|
0:19:46 | that configures |
---|
0:19:48 | that takes an instruction understands the perception constraint on figures the perception pipeline of the |
---|
0:19:53 | robot to give optimal what model is that again |
---|
0:19:57 | then if the symbol grounding |
---|
0:19:58 | process and we verified that to the experiments |
---|
0:20:03 | and q |
---|
0:20:19 | this is really great already like |
---|
0:20:23 | so in relation to |
---|
0:20:26 | extra information extra fish the optimisation you get result in mind |
---|
0:20:31 | you're language interpreter is a deposit |
---|
0:20:34 | so what your language interpreter is the parser |
---|
0:20:38 | i mean how well as examples |
---|
0:20:41 | so |
---|
0:20:44 | are you talking about |
---|
0:20:48 | so you have to design a the run in real-time incrementally all the to the |
---|
0:20:52 | whole parse just wait till the end of the utterance then passed the whole thing |
---|
0:20:55 | so also it is not the main contributions of we just use the boston's track |
---|
0:21:00 | instructions |
---|
0:21:01 | so this is the and then you model is what contributes to instructions so it |
---|
0:21:07 | is it interpret |
---|
0:21:08 | the instructions word by word orders of what the end i instructed to |
---|
0:21:15 | i resulting "'cause" i think we might see for the |
---|
0:21:20 | efficiency gains if you interpret the utterance word but yes evidence that differences from the |
---|
0:21:25 | visual world paradigm humans that humans do that and see in their eye movement as |
---|
0:21:31 | a listening and |
---|
0:21:34 | if you could a speed of the process |
---|
0:21:40 | but this work it's just like |
---|
0:21:42 | it integrates day |
---|
0:21:44 | the instruction after it's received by the nlu it does this phase by phase so |
---|
0:21:50 | the interpretation and avoid lake lexical closed phases of what phrase in the case pick |
---|
0:21:55 | up the |
---|
0:21:56 | you want |
---|
0:21:57 | so the interpretation on at the word phrase is a function of its child thesis |
---|
0:22:02 | as one |
---|
0:22:02 | so that's a to pick up a blue ball you |
---|
0:22:05 | need not know the six degrees of freedom pose of the ball because it's a |
---|
0:22:08 | semantic content |
---|
0:22:09 | so in that case it will |
---|
0:22:11 | the reason that you would need a to a degree of freedom was of course |
---|
0:22:15 | estimator |
---|
0:22:16 | in the as opposite in the case of become a value box you would need |
---|
0:22:21 | a six degrees of freedom |
---|
0:22:24 | pose estimation of the object but also i the word for this for that it |
---|
0:22:27 | will and uses a six degrees of freedom soak reasons in the context of the |
---|
0:22:30 | ten faces |
---|
0:22:37 | well questions |
---|
0:22:51 | over the course the back and forth dialogue we're gonna have discussion of different objects |
---|
0:22:56 | i noticed in your conclusions live you have an example using the word |
---|
0:23:02 | in the second you put it on the top of the red card so i |
---|
0:23:04 | was wondering how are currently exploring dialogue history what like the previous utterances and how |
---|
0:23:09 | you might tracker |
---|
0:23:11 | you no longer |
---|
0:23:13 | longer histories in the future |
---|
0:23:15 | in this work we are not tracking the dialogue history it's basically the first monologue |
---|
0:23:19 | part of the dialogue |
---|
0:23:21 | x p something that that's supposed to speed up the entire dialog by speeding up |
---|
0:23:24 | the perception |
---|
0:23:26 | but |
---|
0:23:27 | we are not currently designing what it means in the con |
---|
0:23:32 | any other questions |
---|
0:23:38 | okay estimation okay i'm going to not tell you |
---|
0:23:46 | it's a special case where a the detectors in a perception pipeline |
---|
0:23:51 | the time recorded on the detections was also function of size of the object |
---|
0:23:55 | so in especially specific case i had like lots of objects but there is small |
---|
0:23:59 | in size does not the ones |
---|
0:24:01 | specifically the geometry detectors because depends on the point cloud |
---|
0:24:04 | it's like to point out it needs to reason about more points that's the |
---|
0:24:11 | except that |
---|
0:24:13 | still the time required to do that the perception this one and you're so |
---|