0:00:19 | but not any |
---|
0:00:23 | i don't fit the crime and i and i have great pleasure in that using |
---|
0:00:29 | the second keynote speaker of the confidence that abolish from microsoft research |
---|
0:00:36 | and |
---|
0:00:39 | seeing everything either restart at microsoft research |
---|
0:00:45 | and is what he has been for the last twelve years and it's going to |
---|
0:00:49 | talk to is about straight the interaction |
---|
0:00:52 | okay thanks a lot thanks ingrid things for the introduction and also for the invitation |
---|
0:00:57 | to talk it's a great |
---|
0:00:59 | be back here think that i think i missed the last couple of years but |
---|
0:01:02 | this is always and |
---|
0:01:03 | a central to come back to |
---|
0:01:05 | so the time of the talk is situated interaction and i think is gonna those |
---|
0:01:10 | they'll pretty well with the panel discussion we had at the end of yesterday a |
---|
0:01:15 | operable narrowing versus broadening of the field |
---|
0:01:18 | an interesting questions we might all be working on there are basically |
---|
0:01:23 | two main points that i would like to highlight in this talk the first one |
---|
0:01:28 | is that dialogue is really a multimodal highly coordinated complex affair |
---|
0:01:35 | that goes well beyond the spoken word |
---|
0:01:38 | i don't know how many of your familiar with a little work over eight boards |
---|
0:01:41 | with all views an anthropologist that did some of the seminal work on can as |
---|
0:01:46 | exciting back in the sixties |
---|
0:01:48 | and basically studying the role body movement in communication and in one of his books |
---|
0:01:53 | he essentially or |
---|
0:01:55 | comments on how basically |
---|
0:01:59 | perhaps the problem with the early records that we have of studies of communications |
---|
0:02:03 | is that they were done by illiterate people |
---|
0:02:06 | now all joking aside it is the case that if you look at |
---|
0:02:10 | most of the work we do to the in dialogue |
---|
0:02:12 | is really heavily and curtin text in the written word and at best in the |
---|
0:02:17 | spoken world |
---|
0:02:19 | but in reality we do a lot of work with our bodies when we interact |
---|
0:02:23 | with each other when we communicate with each other and the surrounding physical context also |
---|
0:02:27 | plays a very important role in these interactions |
---|
0:02:30 | from where we place ourselves in space relative to each other the stance we adopt |
---|
0:02:35 | two where or gaze goes moment-by-moment to facial expressions head nods hand gestures |
---|
0:02:41 | prosodic o contours |
---|
0:02:43 | all of these channels come into play when we interact with each other |
---|
0:02:47 | and so that's the view of dialogue that i would like to highlight today |
---|
0:02:50 | the second point that i'm gonna try to make in this talk is i think |
---|
0:02:54 | were also it is very interesting time |
---|
0:02:57 | when in the last take it also seem very fast paced advantages is based on |
---|
0:03:01 | deep learning |
---|
0:03:01 | in areas like vision and in germany perception and sensing |
---|
0:03:05 | and i think these advances are getting us to this point or able to |
---|
0:03:10 | start building machines that understand |
---|
0:03:12 | people in physical spacing how people moving behaving physical space |
---|
0:03:17 | i think it's a very interesting time in that sense in just like in the |
---|
0:03:20 | nineties |
---|
0:03:21 | advances in speech recognition have broken up the field and open up this whole area |
---|
0:03:25 | of spoken dialogue systems with all the research that has come to that |
---|
0:03:29 | and that today has led to these mobile assistance in our pockets i think these |
---|
0:03:33 | advances in vision and in the perceptual technologies |
---|
0:03:36 | give us a chance to again brought not the feel in this direction of |
---|
0:03:40 | physically situated dialogue and more generally situated interaction |
---|
0:03:45 | so what i'm gonna doing this talk is i'm gonna try to give us a |
---|
0:03:47 | sense of this area based on some research vignettes from our own work at m |
---|
0:03:52 | s r |
---|
0:03:52 | over last ten years or so in this space |
---|
0:03:56 | and hopefully i'll be able to convey to my excitement about it then maybe gets |
---|
0:04:00 | more of you guys to look into this direction |
---|
0:04:02 | because i think there's a lot of interesting an open problems in this space and |
---|
0:04:07 | i think a lot of the people in this will have quite a bit to |
---|
0:04:10 | contribute to solving these problems |
---|
0:04:13 | so finally before i get going before we'd i've been i'll sonar make sure i |
---|
0:04:17 | think my collaborators that i've had over the years that in likely to work with |
---|
0:04:22 | fabulous people elements are |
---|
0:04:25 | and that long-term collaborations with folks like corvettes and shown andreas's |
---|
0:04:29 | here and also many other researchers |
---|
0:04:33 | talented in jeers and great interest we have |
---|
0:04:35 | over the years and |
---|
0:04:37 | some of the work with c and the work with done and ms are in |
---|
0:04:39 | this space will not be impossible without their help so on |
---|
0:04:42 | then them |
---|
0:04:43 | okay so let's get started situated interaction well |
---|
0:04:47 | i started working in this space or shortly after i joined m s a round |
---|
0:04:51 | two thousand than eight and the main question that has been driving my research agenda |
---|
0:04:56 | scenes has been basically how do we get computers |
---|
0:05:00 | to reason about the physical space around them |
---|
0:05:02 | and to interact with people in this kind of open moral physically situated setting enough |
---|
0:05:07 | fluid and seamless manner |
---|
0:05:09 | and the general approach i've taken two words that space has been one where |
---|
0:05:13 | we built a variety of systems |
---|
0:05:15 | and we've deployed in the wild and by deploying the wild what i mean in |
---|
0:05:20 | this case is place them in some public space in our building where people would |
---|
0:05:24 | naturally encounter any interact with them without much instruction |
---|
0:05:28 | so is not a control upsetting there just deployed some |
---|
0:05:31 | where all their people just comment intractable |
---|
0:05:33 | then we observe the interactions and we lead to drive what are the research problems |
---|
0:05:39 | that will do not address we find what are the problems we need to solve |
---|
0:05:42 | by observing what happens in this kind of ecologically more valid setting |
---|
0:05:47 | and try to let that |
---|
0:05:49 | give us direction and so to make this concrete then to give you a sense |
---|
0:05:52 | of the variety of systems we build an honours start by showing your view videos |
---|
0:05:56 | and then we can go more for some of the research questions we've looked at |
---|
0:06:02 | the first really i'm gonna show use from the system that we refer to as |
---|
0:06:06 | the assistant |
---|
0:06:07 | it's a virtual agent based system that's placed outside eric's office and interacts with people |
---|
0:06:13 | that come by whenever he is not available or maybe what is available but busy |
---|
0:06:17 | his office |
---|
0:06:18 | and basically the system that some simple assistive type tasks like |
---|
0:06:23 | handling meetings and taking you know some notes the relay and so on |
---|
0:06:29 | it's connected to a quite a wide infrastructure has access to eric's calendar but also |
---|
0:06:33 | for other machine learned models that predict his availability when is he gonna be backing |
---|
0:06:38 | his office you know what's the likelihood that he will but then the particular meeting |
---|
0:06:42 | and so on |
---|
0:06:43 | but what i want to highlight with this video is not so much lower part |
---|
0:06:46 | as much as |
---|
0:06:47 | the multi party dialogue or interaction capabilities here the system has a camera why dental |
---|
0:06:53 | camera the top and a microphone array and |
---|
0:06:56 | it's able to basically reason about multiple people |
---|
0:06:59 | and understand who it engages with and have dialogue in this kind of all multiparty |
---|
0:07:04 | setting |
---|
0:07:05 | based on the roles and he's that these people have |
---|
0:07:16 | and you hear forty five o'clock meeting we can eat |
---|
0:07:20 | i'm sorry i told you hear for each meeting could you please one appeal h |
---|
0:07:26 | and it is not here and then it makes me look at scanning to see |
---|
0:07:30 | if i can tell me he's trying to see he should be meeting in his |
---|
0:07:34 | office now saline actually not sure he's he will probably be packing amount fifteen minutes |
---|
0:07:40 | listen would you like to have unit or maybe come back later |
---|
0:07:46 | and you could try sending him an email message i'm expecting to look at seen |
---|
0:07:51 | in amount a mean it's fancy rejecting nighttime |
---|
0:08:05 | great |
---|
0:08:08 | thanks |
---|
0:08:10 | so over the years we built a variety of these systems are based on virtual |
---|
0:08:14 | agents this is a prototype for those aiming to do shuttle reservations on campus of |
---|
0:08:18 | for people moving from one building to another when you going to little be you |
---|
0:08:21 | can say i'm going to this building and get a shovel |
---|
0:08:24 | we build the fun trivia questions game that we deployed in a quarterly or one |
---|
0:08:29 | of our kitchens where the system would try to engage people that go buy into |
---|
0:08:33 | this questions gamelike out ask you what's the longest river in the world then you |
---|
0:08:37 | try to figure out the answer but the interesting bit here is that this is |
---|
0:08:40 | the most trying to do this in some sense |
---|
0:08:42 | cooperatively is trying to get people to reach a consensus before revealing the answer moving |
---|
0:08:47 | to the next question |
---|
0:08:48 | we did a lot of interesting studies on engagement then how do you attract a |
---|
0:08:51 | bystander a little times people kind of sit back and watch from a distance what |
---|
0:08:56 | happens working on how do you attract bystanders |
---|
0:08:59 | inside an interaction so again studying various problems related to multi party dialogue an open |
---|
0:09:03 | more settings |
---|
0:09:05 | without more that has also nothing to do it language show i'm using the term |
---|
0:09:08 | situated interaction |
---|
0:09:10 | purposefully |
---|
0:09:12 | because my focus is on is my interests are in sort of how do we |
---|
0:09:15 | get machines to interact with people with there's language or not |
---|
0:09:18 | this is a an example a system we call the third generation elevator |
---|
0:09:23 | what you're seeing here is a view from the top in our atrium |
---|
0:09:28 | there's basically let's see this work there's the elevator doors are over there this is |
---|
0:09:32 | a fisheye distorted you from the top but this is in front of the bank |
---|
0:09:35 | of elevators where people are going by |
---|
0:09:37 | so we build a simple model that just those optical flow and based on features |
---|
0:09:41 | from optical flow |
---|
0:09:42 | if there's to anticipate by about three seconds when the button will be pushed |
---|
0:09:46 | so as you walk towards the elevator pushes the button for you the idea was |
---|
0:09:49 | a mess build a star trek elevator but if you just simply go by you |
---|
0:09:53 | know nothing happens |
---|
0:09:56 | and n is not necessary that i think this is high elevators will work in |
---|
0:09:59 | the future but its own exploration and i had not |
---|
0:10:02 | to this idea that machines should be able to reason about and think about how |
---|
0:10:07 | people behaving physical space |
---|
0:10:09 | and right interesting interactions of that and the system has been running four years in |
---|
0:10:13 | our lobby and by now everyone's |
---|
0:10:15 | no one models it's there in some sense it just works |
---|
0:10:20 | within the last years also started the looking in the directional interaction with the robot |
---|
0:10:24 | so human robot interaction and system that we've done a lot of research with are |
---|
0:10:29 | these direction robot's we have three of these guys we have them deployed on each |
---|
0:10:33 | of the floors in a building as you come up of the elevator |
---|
0:10:36 | and they can give you directions inside the building so you can ask for meeting |
---|
0:10:40 | rooms are various people |
---|
0:10:41 | and they can directly there are four |
---|
0:10:47 | conference room three hundred |
---|
0:10:50 | go to be useful way |
---|
0:10:53 | turn right into three down the hall we review |
---|
0:10:56 | conference room three hundred will be the first room on your right |
---|
0:11:02 | your will |
---|
0:11:07 | john is in all use number forty one twenty |
---|
0:11:10 | here |
---|
0:11:11 | t v all of your to the fourth floor |
---|
0:11:14 | you're right when you mix of the elevator and continue to the end of this |
---|
0:11:18 | fall |
---|
0:11:19 | john solve this will be in that we have revealed |
---|
0:11:25 | okay so hopefully this gives you guys a sense of the class of systems will |
---|
0:11:29 | be in building an working with are doing research with over the years |
---|
0:11:33 | no when you try to build these things and have them actually work in the |
---|
0:11:37 | wild in this kind of one control settings you quickly run into a number of |
---|
0:11:41 | problems that otherwise you might not even think of our consider |
---|
0:11:45 | so |
---|
0:11:46 | a lot of the problems with interactions i think we as human soul on self |
---|
0:11:51 | conscious the this is so training to us that we don't think about it |
---|
0:11:55 | but you know once you try to do something with a machine and computational eyes |
---|
0:11:59 | it you run into the actual problem so first problem you have to solve is |
---|
0:12:03 | that of engagement knowing |
---|
0:12:05 | who am i engage with an in an interaction with and one |
---|
0:12:08 | like this is all obvious loss whenever word an interaction |
---|
0:12:11 | but a machine is to reason about it for instance here needs to reason that |
---|
0:12:14 | even though these two guys are looking away from it at this moment |
---|
0:12:19 | they're actually still engaged in an interaction with the machine they're looking away because the |
---|
0:12:24 | robot just pointed over there and she well she's been looking at the machine all |
---|
0:12:28 | the time she's actually not engaged in this conversation and going one step for the |
---|
0:12:32 | robot my reason that well perhaps is you know group with them and waiting for |
---|
0:12:35 | them |
---|
0:12:36 | or perhaps she's not in a group with number has an intention to engage with |
---|
0:12:40 | the robot once they're done there's all these reasoning that we assume as kind of |
---|
0:12:43 | the one automatic and we don't think about what you have to kind of program |
---|
0:12:47 | machine to do it |
---|
0:12:49 | once you can solve the problem of engagement not a problem you have to solve |
---|
0:12:52 | is that of turn taking and you know the standard dialogue model we all phone |
---|
0:12:58 | work with this one where dialogue is of all the of utterances by the system |
---|
0:13:02 | and user and system and user this breaks two pieces immediately once you're in a |
---|
0:13:06 | multi party setting |
---|
0:13:08 | you need to reason not only about when utterances are happening but you to reason |
---|
0:13:11 | about who's producing them |
---|
0:13:13 | who are the utterance is addressed to and what does the producer expect would talk |
---|
0:13:19 | next so who is the next ratified speaker here |
---|
0:13:22 | should i as a robot inject myself or the end of this utterance that i |
---|
0:13:25 | heard or should i wait "'cause" someone else is gonna respond |
---|
0:13:28 | so the problem gets more complex |
---|
0:13:30 | and again all of this |
---|
0:13:32 | we do on automatic and it's regulated with gaze with prosody with |
---|
0:13:36 | how we move our bodies and so on and only once you can kind of |
---|
0:13:41 | deal with these two problems you can start worrying about speech recognition and decoding the |
---|
0:13:45 | signals in understanding what is actually contained in the signals that was sent to each |
---|
0:13:49 | other |
---|
0:13:50 | and doing the high-level interaction planning and dialogue control so in some sense a we've |
---|
0:13:56 | use it we view this as a |
---|
0:13:59 | almost like a minimal set of communicative competence is that you need to have to |
---|
0:14:03 | do this kind of interaction open world settings |
---|
0:14:05 | and over the years the re our research agenda |
---|
0:14:08 | has been basically looking at various problems looking that in this processes by trying to |
---|
0:14:16 | leverage the information we have about the situated context the who the lot and the |
---|
0:14:20 | why of the surroundings |
---|
0:14:23 | so that's kind of a the very high level kind of fuzzy one slide about |
---|
0:14:26 | what the research has been about that the ms are in the last ten years |
---|
0:14:29 | in the space and i'm gonna diving now when show you two different examples in |
---|
0:14:36 | a little bit more detail i'm not gonna goal very technically d pointier pointed to |
---|
0:14:40 | the papers i'm happy to talk more |
---|
0:14:42 | offline but i want to show you give you a sense of what the research |
---|
0:14:46 | problems look like i'm gonna start with the problem that has to do with engagement |
---|
0:14:52 | i've already mentioned |
---|
0:14:54 | engagement as a process can this in the reverse this is a process by which |
---|
0:14:57 | participants |
---|
0:14:58 | initiate maintain and terminate the conversation is that they jointly undertake now you know lot |
---|
0:15:04 | of classical dialogue work i mean this is in telephony applications re mobile phones and |
---|
0:15:09 | so on so |
---|
0:15:11 | trivial problem to solve right i push a button i know i'm engaged or i |
---|
0:15:14 | pick up a phone call i'm not i'm gauge i don't have a really big |
---|
0:15:17 | problem to solve however if you have a robot or system that's embodied in situated |
---|
0:15:20 | in space is becomes a more of complex problem |
---|
0:15:24 | and just to illustrate sort of the diversity of behave years |
---|
0:15:28 | with respect to engagement that one might have |
---|
0:15:32 | we sort of |
---|
0:15:34 | capture this video this as many years ago at the at the start of this |
---|
0:15:38 | work |
---|
0:15:38 | it's a video from a the receptionist prototype the one that was doing the shuttle |
---|
0:15:42 | reservations |
---|
0:15:43 | and it mostly highlights how by reasoning about |
---|
0:15:46 | three engagement variables in particular engagement state the my negation a conversation or not engagement |
---|
0:15:53 | actions which regulate the transitions between the states and engagement intentions which are different from |
---|
0:15:58 | states |
---|
0:15:59 | by reading about these three keep variables you can construct fairly sophisticated policies |
---|
0:16:04 | in terms of how you manage engagement in you know group setting |
---|
0:16:08 | so no play this video for you in a second just before i do that |
---|
0:16:11 | to help you with the legend here and all this annotation |
---|
0:16:14 | yellow line |
---|
0:16:16 | below the face means this is what the system is engaged with at some points |
---|
0:16:19 | of this is the system's viewpoint what we see is one of these avatar has |
---|
0:16:23 | that but for us |
---|
0:16:25 | dotted line is an engagement that is currently suspended |
---|
0:16:28 | the red dot moving around that right now it's on eric's face shows the direction |
---|
0:16:33 | of the avatars case |
---|
0:16:35 | so i'll run this for you |
---|
0:16:38 | sorry for the quality of the audio here |
---|
0:16:42 | here |
---|
0:16:50 | right |
---|
0:16:54 | right |
---|
0:16:59 | for |
---|
0:17:03 | right |
---|
0:17:21 | alright thank you |
---|
0:17:24 | i |
---|
0:17:27 | right |
---|
0:17:28 | sure |
---|
0:17:29 | yes |
---|
0:17:31 | or not |
---|
0:17:34 | yes |
---|
0:17:37 | in addition |
---|
0:17:40 | right |
---|
0:17:45 | sure |
---|
0:17:50 | sure |
---|
0:17:52 | here |
---|
0:17:54 | right |
---|
0:18:07 | right |
---|
0:18:09 | sure |
---|
0:18:12 | right |
---|
0:18:16 | i |
---|
0:18:19 | so there's many behaviors in here that flight by pretty fast like for instance when |
---|
0:18:24 | the receptionist turns from eric to me and my attention is in my cellphone says |
---|
0:18:28 | excuse me and waits for my attention to come up to continue that engagement or |
---|
0:18:33 | at the end when i'm basing some more far away in the distance |
---|
0:18:37 | the moment i turn my attention towards it even though i'm at a distance he |
---|
0:18:40 | creates this initiate disengagement because you know as i still have this task of getting |
---|
0:18:44 | the shells i can give me an update |
---|
0:18:47 | there's a lot of behaviours that you can create from relatively simple inferences |
---|
0:18:51 | now i don't obviously you this is a demonstration video that was shot in the |
---|
0:18:54 | lab in probably we had it do it i don't know three five times to |
---|
0:18:57 | get it right |
---|
0:18:58 | this stuff does not work that well when you put it out there in the |
---|
0:19:02 | wild and i will show you know second how well it works in the wild |
---|
0:19:05 | but this is almost like a more star video like a more star direction for |
---|
0:19:09 | us in our research work |
---|
0:19:11 | we wanna be able to create systems where the underlying inference models are so robust |
---|
0:19:16 | that |
---|
0:19:16 | we can actually have this kind of lead interactions are there in the wild |
---|
0:19:21 | so let me |
---|
0:19:23 | show you how it works in practice and talk about a particular give an example |
---|
0:19:27 | of a research problem in this space |
---|
0:19:30 | start with this video that kind of motivates it pay attention to how badly in |
---|
0:19:36 | this case is a be a from the directions robot |
---|
0:19:39 | how badly the robot is as negotiating disengagement so the moment of breaking of the |
---|
0:19:44 | interaction |
---|
0:19:48 | you need help finding something |
---|
0:19:52 | a room that hallway and on my |
---|
0:19:55 | by the way would you mind swiping your badge on the remote so i know |
---|
0:20:00 | wideband park with not |
---|
0:20:03 | thank you hear anything else i can help you find nothing |
---|
0:20:07 | okay |
---|
0:20:08 | think that it |
---|
0:20:10 | i'm or |
---|
0:20:12 | i know i help you find something else no thank you |
---|
0:20:17 | okay that |
---|
0:20:19 | by |
---|
0:20:21 | not very good he's running to the bottom so |
---|
0:20:26 | so what happens here well what happens here is that at this point in time |
---|
0:20:30 | it's obvious to all of us that this interaction is over |
---|
0:20:34 | but all the machines easy is just |
---|
0:20:37 | the rectangle of where the face easy back in the day that's all the tracking |
---|
0:20:40 | were doing doesn't understand this gesture |
---|
0:20:44 | and so this point the robot continues the dialogue with is there anything else i |
---|
0:20:48 | can help you find and this is quite a long production now what's interesting here |
---|
0:20:54 | is that by some but just a couple of seconds right after that by this |
---|
0:20:58 | point by this frame |
---|
0:20:59 | the robot's engagement model can actually tell that this person is disengaging but by that |
---|
0:21:04 | time it's already too late because we've already study producing this is there anything else |
---|
0:21:09 | and the person hears descent errors and word in this bad look now where we |
---|
0:21:14 | are basically non negotiating these disengagement properly in person starts coming back so now they're |
---|
0:21:18 | engaged again |
---|
0:21:19 | and we get into this problem |
---|
0:21:22 | so what's interesting here is that the robot eventually notes |
---|
0:21:25 | and so the idea that comes to mind is |
---|
0:21:27 | well |
---|
0:21:28 | if we could somehow forecast from here that some future time this person is likely |
---|
0:21:33 | to disengage with some |
---|
0:21:36 | good probability |
---|
0:21:37 | we could perhaps use hesitations to mitigate the uncertainty people of unused hesitations this situation |
---|
0:21:43 | of uncertainty so if we could somehow forecasting funny for perfect in that forecast that |
---|
0:21:48 | a t zero plus l for this person might be disengaging is there are launching |
---|
0:21:52 | this production we could launch of filler or like a hesitation like soul |
---|
0:21:56 | and then if it zero plus a thought we find them disengaging we say so |
---|
0:22:00 | well guess a catcher later then |
---|
0:22:02 | or if somehow alternatively there are not we can still say so is there anything |
---|
0:22:06 | else i can help you fine and that doesn't sound too bad and so the |
---|
0:22:10 | core idea here is |
---|
0:22:12 | that's forecast what's gonna happen in the future |
---|
0:22:15 | and maybe use hesitations to mitigate the associated uncertainty |
---|
0:22:20 | now how do we do this well we have an interesting approach here that is |
---|
0:22:24 | in some sense self-supervised |
---|
0:22:26 | the motion eventually novel so we can leverage that knowledge you basically rollback time and |
---|
0:22:31 | you can learn from your own experience basically without the need for any manual supervision |
---|
0:22:36 | so you have a variety of features i'm illustrating here three features like the location |
---|
0:22:40 | of the face in the image and the size of the face |
---|
0:22:42 | which kind of these esns this is where they start moving away it right the |
---|
0:22:45 | size of the face is kind of a proxy for how far away from you |
---|
0:22:48 | they are we have all sorts of probabilistic models for instance for inferring whereas their |
---|
0:22:52 | attention is the attention on the robot there is their attention somewhere else |
---|
0:22:56 | and there's many i such features in the system |
---|
0:22:59 | no the ideas you start with the |
---|
0:23:01 | very conservative heuristic for detecting disengagement then you wanna be conservative because |
---|
0:23:06 | the flip side of the equation breaking then engage moment someone is still engages even |
---|
0:23:11 | more painful so you don't want no kind of stopped talking to someone one that |
---|
0:23:14 | they're talking to so is there on the conservative side which means you're gonna be |
---|
0:23:18 | late in detecting when they disengaged |
---|
0:23:20 | but you will eventually detect that they disengage at some point you would exceed some |
---|
0:23:24 | probability threshold that says they're disengaging and then what you can do is like i |
---|
0:23:29 | said you rollback time so let's say you one anticipate that moment by five seconds |
---|
0:23:32 | where it's easy to automatically construct a label that looks like that and five seconds |
---|
0:23:37 | ahead of time predicts that event |
---|
0:23:39 | and then you train a model from all these features that you have to predict |
---|
0:23:42 | this label |
---|
0:23:44 | all this model is not gonna be it's gonna be far from perfect but you'll |
---|
0:23:47 | probably detect that moment a bit earlier on |
---|
0:23:50 | so if you use the same threshold of point eight you might have you know |
---|
0:23:55 | you might be able to detect by this much earlier we call these the only |
---|
0:23:58 | detection |
---|
0:23:59 | and so then you go and train models with all these features and really the |
---|
0:24:02 | technical details are not that important here the point i wanna make is a high-level |
---|
0:24:05 | point this case i think |
---|
0:24:07 | we use logistic regression boosted cheese whatever favour machine learning technique is and you can |
---|
0:24:12 | see that like you know for the same false positive rate you can kind of |
---|
0:24:15 | increase our you can detect the engagement over baseline heuristics |
---|
0:24:19 | the other sort of high-level lesson is that |
---|
0:24:21 | by using multi modal features you tend to improve your performance all use features |
---|
0:24:26 | relate the focus of attention location and tracking confidence scores |
---|
0:24:30 | dialogue features like dialog state how one died in there and so on |
---|
0:24:33 | each of these individually do something and then at you at the mall up together |
---|
0:24:37 | you get better results was generally |
---|
0:24:40 | something that tends to happen would multimodal systems |
---|
0:24:44 | again the high-level point i wanna make here use |
---|
0:24:47 | forecasting was a construct i think is very interesting like there's been a lot of |
---|
0:24:52 | work recently in dialogue with incrementality and i think forecasting goes handing hand without |
---|
0:24:57 | and because it's very important in order to be able to achieve |
---|
0:25:01 | the kind of fluid coordination we want you we probably have to anticipate more |
---|
0:25:06 | and then also presents this interesting opportunities from learning easy to from experience without manually |
---|
0:25:12 | labeling data because in general if you wanna forecast an event like you have the |
---|
0:25:16 | label you know when you happens you just know it too late but you can |
---|
0:25:19 | still learn from all of that and you can do that online in the system |
---|
0:25:23 | can adapt to the particular situation it's in |
---|
0:25:26 | so i think those are a couple of interesting lessons sort of what the high-level |
---|
0:25:29 | a from this work |
---|
0:25:31 | i'm gonna switch gears |
---|
0:25:32 | and talk about a different problem that lives more |
---|
0:25:37 | relatively speaking in the turn taking or you know just like engagement is |
---|
0:25:43 | reach mixed-initiative process you know by which regulate how initiate interactions |
---|
0:25:48 | turn taking is also you know mixed-initiative incrementally controlled by the participants is this process |
---|
0:25:55 | by which we regulate who takes that are not talk |
---|
0:25:58 | in conversation and as i mentioned before again in a lot of a traditional dialogue |
---|
0:26:04 | work we make the simple turn taking the assumptions of you speak then i speak |
---|
0:26:08 | then you speak the nice we can maybe there's barge ins that are being handled |
---|
0:26:11 | in multiparty settings you really the be double more sophisticated model "'cause" you to understand |
---|
0:26:16 | who's talking to someone any given point in time |
---|
0:26:19 | and when is your time to speak |
---|
0:26:20 | and we've done |
---|
0:26:22 | bunch of work in that direction i'm not gonna show you that on a show |
---|
0:26:25 | you a different problem that relates to turn taking that i think illustrates even better |
---|
0:26:30 | this a high degree of coordination and multimodality in situated dialogue and this has to |
---|
0:26:37 | do with coordination between speech and attention |
---|
0:26:39 | and in some sense this work was prompted by reading some of goodwin's work on |
---|
0:26:45 | disfluencies and attention so goodwin made this interesting observation about disfluencies you know one of |
---|
0:26:52 | his of papers |
---|
0:26:53 | we all know that if you look at transcripts of conversational speech it's formal false |
---|
0:26:58 | starts and b starts and disfluencies so they're gonna look like |
---|
0:27:01 | you know the speaker says anyway |
---|
0:27:03 | we went to i want to bad or brian you're gonna have |
---|
0:27:06 | you can still have to go or i can't mean and also mercy down the |
---|
0:27:10 | car choice of these this part of a t v transcribe like very literally you |
---|
0:27:14 | know conversational speech these are everywhere and they create problems for speech recognition people in |
---|
0:27:19 | language modeling people and so on conversational speech is hard |
---|
0:27:23 | well goodwin had the interesting insight of looking that this in conjunction with gaze |
---|
0:27:28 | so here's the listener's gaze |
---|
0:27:30 | and the region in red dots |
---|
0:27:32 | is where the listener is not looking at the speaker |
---|
0:27:36 | this is the point where mutual gaze gets reestablished and then we have mutual gaze |
---|
0:27:40 | between listener in speaker |
---|
0:27:42 | as a something that's really interesting in this examples is that things become much more |
---|
0:27:45 | grammatical |
---|
0:27:47 | in regions on mutual gaze |
---|
0:27:49 | and this means to kind of one interesting hypotheses that maybe disfluencies are not just |
---|
0:27:54 | errors-in-production maybe some of this one is we have |
---|
0:27:58 | actually fulfil the coordinative purpose the are used to regulate and coordinate and make sure |
---|
0:28:03 | that either i'm able to attract your attention back if you'd has drifted away |
---|
0:28:09 | or whenever i deliver what i want to deliver i really have your attention |
---|
0:28:13 | and so |
---|
0:28:15 | partly inspired by this work and partly inspired by behaviors in our systems |
---|
0:28:20 | we did a bunch of work on coordinating speech and attention |
---|
0:28:25 | so let me show the example in contrast to |
---|
0:28:28 | what humans are able to do without thinking about it |
---|
0:28:31 | here's our robot is not able to reason about where the person's attention is |
---|
0:28:36 | as a bunch of speech recognition errors in this interaction as well but i like |
---|
0:28:40 | it to pay more attention to basically how the robot is not able to take |
---|
0:28:44 | into account where the participantsattention is as the interaction is happening she's just looking her |
---|
0:28:49 | phone trying to get |
---|
0:28:50 | the number for the meeting she's going to but the robot is ignoring all that |
---|
0:28:56 | right |
---|
0:28:58 | or |
---|
0:28:59 | i think that again |
---|
0:29:05 | metric in going right |
---|
0:29:09 | during that would help or go back to like you want right |
---|
0:29:15 | well o where |
---|
0:29:19 | or maybe not so she's a she's you know she's just looking or phone trying |
---|
0:29:24 | to find the and the robot keeps pushing this question or four where you going |
---|
0:29:27 | where you going and so that's you know quite different from what people are doing |
---|
0:29:32 | so inspired by goodwin's work we did some work on |
---|
0:29:36 | basically coordinating speech with the |
---|
0:29:40 | attention and the idea here was to have a model where one hand |
---|
0:29:44 | we model the attentional demands |
---|
0:29:46 | like where does the robot expect the attention persons to be |
---|
0:29:50 | and on the other hand we model attentional supply where is the actual attention going |
---|
0:29:55 | so attentional demands are defined of the phrase level so for every output that the |
---|
0:29:59 | robot is producing got the phrase level |
---|
0:30:01 | we have an expectation about where attention should be in most cases it probably should |
---|
0:30:04 | be on the robot but it is not always the case twenty seventy five point |
---|
0:30:08 | of what they're in say to get to thirty eight hundred i might expect that |
---|
0:30:11 | your attention will go over there and actually fluoridation doesn't go over there may be |
---|
0:30:14 | we have a problem |
---|
0:30:16 | so we are specifying descent is are manually specified basically of just like a natural |
---|
0:30:20 | language generation for every output we have one of these |
---|
0:30:23 | expected attention targets and then on the other hand we make inferences about where is |
---|
0:30:27 | your attention |
---|
0:30:28 | and we do that based on machine learning models that use radio features us on |
---|
0:30:33 | and so forth |
---|
0:30:34 | whenever there's a difference between the two is there of just ballistic reproducing speech synthesis |
---|
0:30:40 | we use this coordinative policy that basically interjects the same kinds of pauses and feels |
---|
0:30:46 | like pauses in false starts and restarts |
---|
0:30:49 | that humans do is basically create these disfluencies |
---|
0:30:52 | to get to a point where attention is exactly where we expected to be an |
---|
0:30:55 | only then we continue so instead of saying to get the thirty eight hundred we |
---|
0:30:59 | might pose for awhile say excuse me be say the first two words to get |
---|
0:31:03 | pause more and so on before we actually produce the utterance |
---|
0:31:08 | so i'll in this is again than on the phrase by phrase |
---|
0:31:13 | basis |
---|
0:31:13 | here is again a demonstration video of |
---|
0:31:16 | eric and i bad actors trying to kind of illustrate this behavior |
---|
0:31:23 | yes or no i |
---|
0:31:33 | for me excuse |
---|
0:31:42 | for |
---|
0:31:47 | where he is just you know it's fashion |
---|
0:31:58 | so still bit clunky you know but you get the sense and the idea let |
---|
0:32:02 | me show you a few interactions captured in the wild once we deployed this coordinative |
---|
0:32:08 | mechanism |
---|
0:32:09 | in here basically |
---|
0:32:11 | the regions in block are the production that you know the robot normally produces the |
---|
0:32:17 | synthesis these are phrase boundary delimiters |
---|
0:32:20 | and the regions in the regions in orange are |
---|
0:32:24 | these filled pauses interjections that are dynamically injected on the fly |
---|
0:32:29 | based on where the user's attention is |
---|
0:32:40 | rule |
---|
0:32:47 | all you cough that the volume was kind of level |
---|
0:32:57 | excuse me |
---|
0:33:00 | really |
---|
0:33:02 | you know you |
---|
0:33:14 | you are right |
---|
0:33:21 | here you direction |
---|
0:33:25 | so that excuse we might be a bit aggressive you know there's a lot of |
---|
0:33:28 | tuning once you once you put this in there you realise the next layer of |
---|
0:33:31 | problems that you have been how synthesis is not quite conversational enough and you know |
---|
0:33:36 | like than one sees of saying social forces so an excuse me and so on |
---|
0:33:42 | and while these videos again my make it look like a wild like we can |
---|
0:33:45 | go quite far again wanna leave you with the wrong impression of a lot of |
---|
0:33:49 | work remains to be done |
---|
0:33:51 | these things often failed or videos i shown easement things work |
---|
0:33:54 | relatively well i would say |
---|
0:33:56 | but this things often failed and i want to show you one interesting the example |
---|
0:34:00 | of a failure |
---|
0:34:02 | right |
---|
0:34:04 | we would be |
---|
0:34:08 | you give you |
---|
0:34:09 | what would be included |
---|
0:34:16 | well you will see later |
---|
0:34:19 | so the signals to say whoops |
---|
0:34:21 | so what actually happens here well what happens here is that |
---|
0:34:25 | we are coordinating you know warping a lot of attention to coordinate our speech with |
---|
0:34:31 | the participantsattention |
---|
0:34:33 | but were completely ignoring what his upper body and torso is signalling so what happens |
---|
0:34:38 | here is |
---|
0:34:38 | the robot guess to this phrase where it says to get their walk to the |
---|
0:34:41 | end of this hallway |
---|
0:34:43 | at which point the person feels that maybe this is the end of the instructions |
---|
0:34:47 | so they start turning both their face and their body to kind of indicate that |
---|
0:34:51 | they might be leaving right |
---|
0:34:54 | the robot sees their attention goal way and things well i'm gonna wait for their |
---|
0:34:58 | attention to come back and the long pause that gets created for the reinforces the |
---|
0:35:03 | person to believe that this is the end of the directions so i'm just going |
---|
0:35:06 | given the robot had all these other things to say right |
---|
0:35:09 | and so because the robot in this some sense ignores the signal from his upper |
---|
0:35:14 | body that i'm and if the robot can take into account that signal we could |
---|
0:35:17 | be a bit smarter and maybe not wait there maybe use a different mechanism to |
---|
0:35:21 | get their attention back |
---|
0:35:22 | or maybe just |
---|
0:35:23 | blasts through that you don't always have to coordinate exactly that way it right and |
---|
0:35:27 | so |
---|
0:35:29 | i love this example because it really highlights any drives on this point and trying |
---|
0:35:34 | to make i think that |
---|
0:35:35 | dialogue is really highly coordinate in and highly multimodal dialogue between people in face-to-face settings |
---|
0:35:41 | has these properties you know |
---|
0:35:44 | we've talked about carnegie speech and gaze |
---|
0:35:47 | and we seen in this example how not reasoning about body pose gets us into |
---|
0:35:51 | trouble |
---|
0:35:52 | as many other things going on we do head gestures like not then shakes and |
---|
0:35:57 | all sorts of other head gestures and there's a myriad of hand gestures you know |
---|
0:36:01 | from be metaphorically iconic the big gestures |
---|
0:36:05 | facial expressions smiles frowns expressions of uncertainty |
---|
0:36:09 | where we |
---|
0:36:10 | put our bodies and how we move dynamically prosodic all contours all of these things |
---|
0:36:14 | come into play and their highly coordinated frame-by-frame moment-by-moment in the coordination that happens is |
---|
0:36:21 | not just across the channels |
---|
0:36:23 | it's across people |
---|
0:36:25 | and these channels and so i'd like us to think about dialogue in this view |
---|
0:36:29 | more from a view of you know sequence of turns into of your of |
---|
0:36:35 | multimodal incrementally co-produce process |
---|
0:36:38 | and i think if we do that i think there's a lot of interesting opportunities |
---|
0:36:42 | because of these enabling technologies that are coming up these days |
---|
0:36:46 | so i've shown you a couple of problems in the space of turn taking an |
---|
0:36:51 | engagement there's many more problems in every time we touch one of these we really |
---|
0:36:55 | feel like we barely scratched the surface |
---|
0:36:58 | take for instance engagement i talk for a bit about |
---|
0:37:01 | how to forecast disengagement and maybe negotiate the disengagement process better but this many other |
---|
0:37:08 | problems how do we build robust models for making inferences about those engagement variables like |
---|
0:37:13 | states engagement actions and intentions |
---|
0:37:16 | how do we or construct measures of engagement that are more continuous here all the |
---|
0:37:21 | work we've done is on i'm engaged or i'm not engage well-known educational or tutoring |
---|
0:37:25 | or other kinds of setting you wanna more continuous measure engagement |
---|
0:37:28 | how do you reason about that |
---|
0:37:31 | similarly many other problems in turn taking understanding how do we ground all these things |
---|
0:37:35 | in the physical situation is interesting challenges with rapport with negotiation grounding well lots of |
---|
0:37:43 | open space lots of interesting problem once you start thinking about how the physical world |
---|
0:37:46 | a whole these channels interact with each other |
---|
0:37:50 | like i said i said i think we have this interesting opportunities because |
---|
0:37:53 | there has been a lot of progress in the visual and perception space |
---|
0:37:57 | the tracking facial expression tracking smiles affix recognition is on that can |
---|
0:38:03 | help us sort of in this direction |
---|
0:38:06 | i think the other think that i really want to highlight bill be size the |
---|
0:38:09 | current technological advances that i think is very important |
---|
0:38:12 | is all these body of work that comes from connected feels like anthropology sociology |
---|
0:38:18 | cycling sociolinguistics a conversational analysis context analysis on |
---|
0:38:23 | there's a wide body of work basically |
---|
0:38:25 | as soon as people got their hands on video tapes in the fifties and sixties |
---|
0:38:28 | they started looking carefully at |
---|
0:38:30 | human communicative behaviours |
---|
0:38:32 | and all that work was done |
---|
0:38:34 | based on you know small snippets or video and if you think about it today |
---|
0:38:37 | we have millions of videos |
---|
0:38:40 | an interesting a powerful data techniques so there's interesting questions about how do we bring |
---|
0:38:46 | this work into the present the how do we leverage all the knowledge and the |
---|
0:38:49 | theoretical models that have been built into the past |
---|
0:38:51 | i've put here just some names there's many more |
---|
0:38:54 | people that have done work in this space and i pick one title from each |
---|
0:38:57 | of them in each of these guys |
---|
0:38:58 | has full bodies of works i really recommend that |
---|
0:39:01 | as a community we look back more on all this work that has that has |
---|
0:39:05 | been done already in a human communication and try to understand how to leverage that |
---|
0:39:09 | when we think of dialogue |
---|
0:39:12 | so |
---|
0:39:14 | with that i guess i have a ten minutes left i one a kind of |
---|
0:39:17 | switch gears a bit and talk more about |
---|
0:39:20 | challenges because you know |
---|
0:39:22 | there's a lot of opportunity there's a lot of open field |
---|
0:39:25 | but working in this space is not necessarily easy either and when i think of |
---|
0:39:30 | challenges i think the |
---|
0:39:32 | high level i think of three kind of categories there's obviously the research challenges that |
---|
0:39:37 | we have like i wanna work on this problem and forecasting disengagement will help lysol |
---|
0:39:41 | try there's obviously the research challenge |
---|
0:39:44 | but i'm gonna leave those aside and gonna try to talk about to other kinds |
---|
0:39:48 | of challenges one is data and experimentation challenges and we touch briefly on this in |
---|
0:39:52 | the panel yesterday i think getting data for these kinds of systems is it's not |
---|
0:39:57 | easy and stuff |
---|
0:39:59 | if you look at a lot of our adjacent feels like machine translation in speech |
---|
0:40:03 | recognition nlp and so on |
---|
0:40:05 | a lot of progress has been accomplished by you know |
---|
0:40:09 | challenges with datasets and clear evaluation metrics and so on |
---|
0:40:12 | in dialogue this is not easy to do any is not easy to do because |
---|
0:40:15 | dialogue is an interactive process you cannot easily studied on a fixed that dataset |
---|
0:40:20 | because by the time you've |
---|
0:40:22 | made an improvement or change something the whole thing behaves differently |
---|
0:40:26 | and so that creates challenges generally for dialogue and even more so for multimodal a |
---|
0:40:31 | dialogue in the multi model space right |
---|
0:40:33 | then apart from the data charges there's also kind of experimentation challenges |
---|
0:40:38 | we've done a lot of the work we've done in the while because i feel |
---|
0:40:42 | like you see the real problems you see ecologically valid settings and you see what |
---|
0:40:47 | really happens |
---|
0:40:48 | some of these phenomena are actually even probably |
---|
0:40:52 | challenging and hard to do in a controlled lab settings like study how engagement how |
---|
0:40:56 | these break supplements on you can think of all sorts of things of confederates and |
---|
0:40:59 | you can try to you know figure out controlled experiments but is not easy and |
---|
0:41:04 | all the other hand experimenting in-the-wild is not easy either for many in reasons |
---|
0:41:09 | one of the |
---|
0:41:10 | other kinds of challenges in here are purely building up the system's right so |
---|
0:41:14 | in our work over last ten years the way we've gotten our data is by |
---|
0:41:17 | building systems and deploying them right |
---|
0:41:21 | but building systems is hard in so in the last five minutes i wanna talk |
---|
0:41:25 | a bit about actually engineering challenges because i think there just as important in that |
---|
0:41:29 | they kind of create the damped nor on the research and they kind of stifle |
---|
0:41:34 | things from moving faster forward building this kind of a multimodal systems is hard for |
---|
0:41:39 | a number of reasons |
---|
0:41:41 | first there's a problem integration they leverage many different kinds of technologies |
---|
0:41:45 | that |
---|
0:41:47 | are of different types operate on different time scales the sheer complexity and the number |
---|
0:41:51 | of boxes you have to having one of these systems kind of makes the problem |
---|
0:41:55 | challenge |
---|
0:41:56 | but then there's other things where constructs that are pervasive in the systems like pine |
---|
0:42:01 | space and uncertainty are nowhere in our |
---|
0:42:04 | programming fabrics like |
---|
0:42:06 | it's kind of the clear to me that time for instance is not a first-order |
---|
0:42:10 | citizen in any programming language that i can think of so every time i wanna |
---|
0:42:14 | do something that's over time or streaming or |
---|
0:42:16 | i have to go implement might buffers and my streaming and my you know like |
---|
0:42:19 | a kind of have to go from scratch and it's similar for space in uncertainty |
---|
0:42:23 | but it is very important because |
---|
0:42:26 | we want to create systems that are fluid |
---|
0:42:27 | but the sensing thinking acting all of these things take time |
---|
0:42:32 | being fast is not even enough often times you need to do fusion in the |
---|
0:42:36 | systems and things the right but different latency so you need to coordinate basically so |
---|
0:42:40 | you need to kind of deal whereabout time in a deeper sense down deep down |
---|
0:42:45 | be well and the same things can be set i think in this systems about |
---|
0:42:49 | the notions of space and notions of uncertainty |
---|
0:42:52 | and finally the other thing that kind of puts of them are is the fact |
---|
0:42:55 | that the development tools we have |
---|
0:42:58 | are not here for this class of systems right so the development environments and debug |
---|
0:43:03 | errors and all of this stuff is not |
---|
0:43:05 | they were not developed with this kind of with this class of systems in mind |
---|
0:43:09 | and if i think back of all the work we've done i don't know if |
---|
0:43:12 | after time as maybe spend on building the tools to build a systems rather than |
---|
0:43:16 | building the systems are doing the research right and so |
---|
0:43:20 | basically driven by a lot of the lessons we've learned over the years |
---|
0:43:25 | in the last three years three or four years at ms are we basically embarked |
---|
0:43:29 | on this project and i wanted to spend the last couple of minutes telling you |
---|
0:43:32 | about it because if there's any people in the room that are more interested in |
---|
0:43:36 | joining the space this might be useful for them |
---|
0:43:39 | we've worked on developing a open-source platform that |
---|
0:43:42 | basically aims to simplify building the systems |
---|
0:43:46 | the end goal being lower the barrier to entry in enabling more research into this |
---|
0:43:51 | pay so it's a framework that three targeted researchers |
---|
0:43:55 | it's open source and it's |
---|
0:43:59 | supports the construction of this kind of a situated interactive system |
---|
0:44:04 | we call it |
---|
0:44:06 | platform for cd intelligence which kind of a mouthful solo abbreviate either side pronounced like |
---|
0:44:10 | the greek letter sci |
---|
0:44:12 | and i want to just give your whirlwind tour in two minutes just to kind |
---|
0:44:15 | of give us sensible or what's available in there |
---|
0:44:19 | the platform consists of three layers there's a runtime layer |
---|
0:44:23 | a set of tools in a set of component components the runtime basically provides all |
---|
0:44:28 | these infrastructure |
---|
0:44:30 | for building systems that operate over streaming data are have latency constraints anytime you have |
---|
0:44:34 | something interactive |
---|
0:44:36 | it's latency constraint |
---|
0:44:38 | so there's a certain model for parallel courtney computation that actually feels pretty natural you |
---|
0:44:43 | just kind of connect components streams of data you know so it's the standard sort |
---|
0:44:48 | of data flow model |
---|
0:44:50 | but the streams a have a really interesting properties and i don't have time to |
---|
0:44:55 | get here in |
---|
0:44:56 | the full beetle and all the glory here |
---|
0:44:59 | but i wanna kind of highlight some of the important aspect so for instance i'm |
---|
0:45:03 | mentioned about time how time is to be first-order citizen well we bake that from |
---|
0:45:08 | day one d below in the fabric all messages that are flowing to are timestamp |
---|
0:45:13 | the origin when they're captured |
---|
0:45:15 | and then as they flow to |
---|
0:45:16 | through the pipeline |
---|
0:45:18 | we have access not only to the |
---|
0:45:20 | time the message was created by the component that created but also to that originating |
---|
0:45:24 | time |
---|
0:45:25 | so we know this message has a latency of four hundred and thirty milliseconds so |
---|
0:45:29 | in the entire graph we can be latency or all points |
---|
0:45:32 | which enables synchronization so we provide a whole time algebra and synchronization mechanisms when you |
---|
0:45:37 | work was training data |
---|
0:45:39 | that pairs these messages correctly and so on |
---|
0:45:41 | so is basically all about enabling coordinated computation where time is really first-order citizen |
---|
0:45:49 | the strings can be automatically persisted so there's a logging infrastructure |
---|
0:45:53 | that is therefore free any data type of you know you can stream any of |
---|
0:45:57 | your data types and we can automatically persist those and because we per system with |
---|
0:46:02 | all this is so sure you timing information |
---|
0:46:04 | we can enable a more interesting replace scenarios are i say well forget about these |
---|
0:46:08 | sensors less played back from disk |
---|
0:46:10 | and tune this component and i can play this back from disk exactly as it |
---|
0:46:14 | happen in real time or i can speed it up or slowly down time is |
---|
0:46:18 | entirely under our control because is baked deep down in the fabric |
---|
0:46:22 | so these are some of the properties of the runtime there's a lot more |
---|
0:46:25 | is basically a very lightweight very efficient kind of |
---|
0:46:29 | system for constructing things that works with streaming data |
---|
0:46:32 | at this level we don't care we don't know anything about speech or dialogue or |
---|
0:46:36 | components |
---|
0:46:37 | it's a gnostic to that you can use it for anything that operates was training |
---|
0:46:40 | data and temporal constraints |
---|
0:46:42 | the set of tools we built |
---|
0:46:44 | basically are heavily centred on visualisation this is a snapshot from a |
---|
0:46:49 | the visualisation tool we have on the right there someone's actually eating it and this |
---|
0:46:52 | video sped up a bit but these are the streams that were persisted in application |
---|
0:46:56 | these are just visualise there's for different kinds of streams that can get composer didn't |
---|
0:47:00 | overlaid |
---|
0:47:00 | so this is a visualiser for and in each stream this is a visualiser for |
---|
0:47:05 | face detection results stream this is audio this is a voice activity detection that's a |
---|
0:47:09 | speech recognition result is a visualiser for all three d conversational scene analysis and the |
---|
0:47:15 | basic idea is that can composite overlaid is visualise there's |
---|
0:47:18 | and then you can navigate over time left and right ensue mean and look at |
---|
0:47:22 | particular moments this is very powerful in enabling especially when coupled with debugging |
---|
0:47:28 | and word evolving this to visualize not just the data collected and running through the |
---|
0:47:33 | systems |
---|
0:47:34 | but also all |
---|
0:47:35 | the architecture of the system itself and you know the view of the component graph |
---|
0:47:42 | and also towards annotation for supporting data annotation |
---|
0:47:45 | finally a the components layer we are hoping to create an ecosystem of components where |
---|
0:47:51 | people can plug n play different kinds of components will bootstrapping this with things like |
---|
0:47:56 | sensors imaging components vision audio speech output is are very relatively simple components that we |
---|
0:48:01 | have in the initial echo system |
---|
0:48:03 | but the idea is that |
---|
0:48:05 | is meant to be an across system and people are meant to contribute into it |
---|
0:48:08 | is an open source project there's already boise state casey kennington has its own repository |
---|
0:48:13 | of sci components |
---|
0:48:15 | and so people are starting to use this and the hope is that as more |
---|
0:48:18 | people use it |
---|
0:48:19 | if i can get you to have eighty percent of what you need off-the-shelf and |
---|
0:48:24 | just focus on your research |
---|
0:48:26 | that's the key idea |
---|
0:48:28 | lasting else a is that something we haven't released yet but we are planning to |
---|
0:48:32 | release in the next few months |
---|
0:48:34 | is an array of components that we refer to as a situated interaction foundation it's |
---|
0:48:41 | basically a set of components at that level that |
---|
0:48:43 | plus a set of representations |
---|
0:48:45 | that you want further abstract and accelerate the development of this physically situated interactive systems |
---|
0:48:51 | basically what we are planning to construct is |
---|
0:48:56 | the ability to instantiate the perception pipeline where you as a developer of the system |
---|
0:49:00 | just only where you're sensors and what sensors you have |
---|
0:49:03 | so in this instance there is you know there's a kinect sensor the big box |
---|
0:49:08 | their represents my office and there's a kinect sensor sitting on top of the screen |
---|
0:49:12 | and if you tell me i have three sensors i'm gonna use the data from |
---|
0:49:15 | all the three sensors infuse evil gonna configure perception pipeline automatically from all the sensors |
---|
0:49:20 | we have the right fusion |
---|
0:49:22 | and provide the d n the |
---|
0:49:24 | the kind of |
---|
0:49:25 | analyses a deep scene analysis object that runs at frame rate at four thirty frames |
---|
0:49:30 | per second i'm gonna tell you things like here's where the people are in the |
---|
0:49:34 | scene and what their body pauses are here's where everyone's attention is |
---|
0:49:39 | in this case there's an actual engagement happening between the two of us in an |
---|
0:49:43 | agent that's on the screen |
---|
0:49:44 | and stewart is you know directing the utterance the words |
---|
0:49:50 | you know the agent and at some later point |
---|
0:49:53 | we have peeled off we've gone more towards the back of the office towards the |
---|
0:49:56 | whiteboard |
---|
0:49:57 | and we're just talking to each other and so we're trying to provide all these |
---|
0:50:00 | reach analysis all |
---|
0:50:02 | the conversation in the conversation space including issues of engagement turn taking utterances sources targets |
---|
0:50:07 | and all of that |
---|
0:50:08 | from the available sensors and |
---|
0:50:10 | if you give me more sensors |
---|
0:50:12 | the idea is that you get the same object back |
---|
0:50:15 | but at a higher fidelity because we have more sensors and we confuse data |
---|
0:50:19 | this parts have not be really see other coming out probably in the next couple |
---|
0:50:22 | of months |
---|
0:50:23 | but our hope with the entire framework is basically to accelerate research in this space |
---|
0:50:27 | to get people to be able to |
---|
0:50:29 | build an experiment with these kinds of systems are having to spend two years to |
---|
0:50:33 | construct all the infrastructure that's necessary |
---|
0:50:37 | and so this brings me basically two |
---|
0:50:40 | the end of my talk all conclude on this slide |
---|
0:50:44 | try to adopt this view of dialogue in this is a talk and portrayed is |
---|
0:50:48 | view of dialogue as a |
---|
0:50:50 | multimodal incrementally corporate used process where part this one scene interaction really |
---|
0:50:56 | do fine grained coordination across all these different modalities |
---|
0:50:59 | i think there is |
---|
0:51:01 | tremendous number of opportunities e here and i think it's up to us to basically |
---|
0:51:05 | broaden the field in this direction because the |
---|
0:51:08 | underlying technologies are coming and they are starting to get to the point where |
---|
0:51:12 | the reliable enough to start to do interesting work and again there's this |
---|
0:51:18 | big body of work in human communication dynamics that will we can leverage and that |
---|
0:51:22 | we can draw upon |
---|
0:51:24 | so i'll stop here thank you all for listening and all the questions |
---|
0:51:37 | thanks very much and then |
---|
0:51:44 | thank stan i was so great to see |
---|
0:51:47 | all this work again and how |
---|
0:51:50 | oppressive the research program over the number of years or to get at this point |
---|
0:51:53 | i'm really looking forward to that |
---|
0:51:55 | situated the interaction foundation |
---|
0:51:58 | coming out |
---|
0:51:59 | i've a question i guess related partly to that but |
---|
0:52:03 | one of the problems with integration is not just taking a bunch of pieces and |
---|
0:52:07 | putting them together but |
---|
0:52:08 | the maintenance of that over time as you add new pieces so |
---|
0:52:11 | in particular for this last thing |
---|
0:52:15 | how much can you just adding a new component expect everything else to |
---|
0:52:21 | work the way it did i just have some value added by getting new information |
---|
0:52:24 | and how much do have to re engineer the whole architecture to make sure that |
---|
0:52:30 | your not and doing things are getting and a problem thinking |
---|
0:52:34 | you know in terms of engineering that the recent plane flight crashes seem to then |
---|
0:52:38 | for this kind of thing where |
---|
0:52:41 | different engineers design systems very well given a set of assumptions about what else would |
---|
0:52:45 | be there |
---|
0:52:46 | or not and then that changed under them and that's what seem because the point |
---|
0:52:50 | right |
---|
0:52:51 | i mean i completely agree i mean the ideal world is one where |
---|
0:52:56 | you know everything works in you like your thing in and but in reality is |
---|
0:52:59 | never that way right in e d is gonna be like different people with different |
---|
0:53:02 | research agendas you know few things different the have different mental models are different |
---|
0:53:08 | viewpoints from which they look at a problem and attacking |
---|
0:53:11 | and i think that does create challenges that way i don't know holes all those |
---|
0:53:15 | challenges |
---|
0:53:15 | well all i can say that these were kind of aware of that and one |
---|
0:53:19 | more constructing this work trying to |
---|
0:53:21 | make us view commitments in some sense as possible to allow for the flexibility that's |
---|
0:53:26 | needed for research |
---|
0:53:27 | because i think there's actual value in all those different viewpoints and different architectures an |
---|
0:53:31 | exploration |
---|
0:53:33 | and so |
---|
0:53:33 | yes i think what i can say that we are purposefully trying to |
---|
0:53:37 | not make hard commitments to what the what is an utterance you know i don't |
---|
0:53:42 | wanna tell you what an utterance as i wanna have you do have your opinion |
---|
0:53:45 | of what an utterance is |
---|
0:53:46 | but also might mean that again when you try to plug in your speech recognizer |
---|
0:53:50 | in my system |
---|
0:53:51 | the my needs to be some wrangling and so on or you know making these |
---|
0:53:54 | components work together i don't know how we can solve this problem |
---|
0:53:57 | i'm not a big believer in all will all come together with the big beautiful |
---|
0:54:01 | standard that will agree to i don't think that i don't see that happening |
---|
0:54:04 | we're just trying to design words |
---|
0:54:07 | flexibility i would say |
---|
0:54:10 | and |
---|
0:54:11 | i think that are a wonderful talk and you're highlighting these things that you're right |
---|
0:54:16 | this is not right for us to be able to that and address and we |
---|
0:54:20 | should be working more about this work beyond the simple turn and |
---|
0:54:26 | sorry i might be introducing something even more complex down the line and one about |
---|
0:54:31 | user adaptation users are very good humans are very good it changing their behaviour based |
---|
0:54:36 | on the system that in front of you know if its human of its that |
---|
0:54:42 | you will call and there's a delay we'll or not the backchannel because it |
---|
0:54:46 | screws up the conversation |
---|
0:54:48 | and people can adapt to this forty dollars |
---|
0:54:52 | and that might be confusing to our learning this will then allowed to be able |
---|
0:54:56 | to the |
---|
0:54:58 | two shows the affects that |
---|
0:55:01 | to windsor good adapting to rather the most natural ones of you thought about how |
---|
0:55:06 | to |
---|
0:55:07 | to hear about not getting the human to adapt or to be able to control |
---|
0:55:11 | how the human adapts to the particular system |
---|
0:55:14 | and the policies that you're doing that are adaptation |
---|
0:55:18 | no i think it's a very interesting question so i think |
---|
0:55:20 | so there's a couple things here someone is i do not seen a lot of |
---|
0:55:24 | the data that we will observe a large variability |
---|
0:55:27 | between people's attitudes and what people so |
---|
0:55:30 | both in the you know just the initial meant like you that they come towards |
---|
0:55:33 | the system and the expectations they have and also you how they do or do |
---|
0:55:38 | not adapt to whatever the system is doing |
---|
0:55:40 | well i guess my view is one think i would say is i think more |
---|
0:55:45 | of this system should be learning continuously because you are basically not continues that's with |
---|
0:55:51 | the person on the other and in this adaptation you know and |
---|
0:55:54 | doing things in big batches |
---|
0:55:56 | is likely to create more friction than doing things that is continuous the adaptive so |
---|
0:56:00 | i think that's an interesting their selecting a to solve a problem |
---|
0:56:04 | i fuel |
---|
0:56:05 | a lot of the work i and the when thinking of it is i want |
---|
0:56:08 | to reduce this impotence mismatch interaction between where machines are where people are and i |
---|
0:56:13 | think we still have a law to travel with the machines this way |
---|
0:56:17 | people always come whatever the machines and mediate but i think i want that going |
---|
0:56:21 | to be closer to where the human is and that would make things easier |
---|
0:56:25 | so i think of all my |
---|
0:56:26 | the work we've done in the way i see this kind of |
---|
0:56:30 | i'm gonna try to reduce that impotence from the machine side as much as possible |
---|
0:56:35 | but you're right people will it that sometimes with clever designs you can actually you |
---|
0:56:40 | know create interesting experiences we leverage that adaptation when you know it's gonna happen |
---|
0:56:46 | but i think in most cases i'm in favour of systems that just |
---|
0:56:49 | incrementally adjust themselves to be able to be at the right spot "'cause" it continues |
---|
0:56:54 | to shift |
---|
0:56:55 | i don't know that really asses the questions were some sort surrounded |
---|
0:57:00 | i time i'm rubber's from technological university double speaking maybe as one of the many |
---|
0:57:06 | people here over the years of wasted two years of our lives building a dialogue |
---|
0:57:10 | systems from the ground up or i think what you presents their at the end |
---|
0:57:14 | is fantastic and but my question is a bit more specific |
---|
0:57:18 | and in terms of the work you did on interjections being used and hesitations being |
---|
0:57:23 | used to sort of keep the user's engagement |
---|
0:57:26 | in the work in the wilds did you do any variation in terms of the |
---|
0:57:30 | multimodal aspects of task in other words the avatar that's being used to gestures that |
---|
0:57:36 | we're be used in fact whether or not using an avatar was a good idea |
---|
0:57:39 | that's by fine grained question and then just a more general question is have you |
---|
0:57:45 | looked at all but the issues |
---|
0:57:47 | of engagement in terms of activity modeling because it's always struck me data big problem |
---|
0:57:51 | in situated interaction |
---|
0:57:53 | when you move away from the kiosk style the user is asking a question is |
---|
0:57:59 | that users are engaged in activities and first to truly get the situated interaction working |
---|
0:58:05 | we are we necessarily need to track the user what they're doing to be able |
---|
0:58:10 | to make sensible contributions to the dialogue about just answer questions yep |
---|
0:58:14 | so to the first part of the question the short answer is no what we |
---|
0:58:18 | should have |
---|
0:58:19 | like i think there's a there's |
---|
0:58:22 | there's a rich set of and once is basically how you do hesitations and |
---|
0:58:26 | interjections and all these policies and definitely in the nonverbal corresponding behaviors |
---|
0:58:31 | would affect that |
---|
0:58:32 | and we just seen the process of the prosodic o contours of a |
---|
0:58:35 | so you know also was not such a good choice because so |
---|
0:58:39 | as excitation sometimes |
---|
0:58:41 | pricks people back the likes |
---|
0:58:42 | so what |
---|
0:58:43 | you know why wanna say but does are hard to synthesize it's on the display |
---|
0:58:47 | the technology we have at the time |
---|
0:58:50 | so i should say that yes definitely should consider those aspects |
---|
0:58:56 | the second part of the of the question remind me was |
---|
0:59:03 | so i think you're absolutely right a lot of the work i've shown in that |
---|
0:59:07 | we've done actually in the last you know ten years there has been |
---|
0:59:11 | well focused on interaction one communicate where when any interaction and communication like there's some |
---|
0:59:17 | communication happens between the human and the person |
---|
0:59:19 | but that's the whole task is this conversation that we're having |
---|
0:59:22 | where actually just now starting to do more work with systems that where the human |
---|
0:59:28 | is involved in an actual task does not just the communicative task |
---|
0:59:31 | and we're trying to see how the machine can play a supporting role in that |
---|
0:59:35 | and i think you're absolutely right like that kind of brings up the next interesting |
---|
0:59:39 | level of how we really get collaboration going rather than just this kind of back |
---|
0:59:44 | and forth of i can ask or answer question and so on i think that's |
---|
0:59:47 | a very interesting space and we're just starting to play in that space |
---|
0:59:54 | thank you very much for two where interesting a request i think is great but |
---|
0:59:58 | this is |
---|
1:00:00 | going out in the wild approach i was just wondering have you |
---|
1:00:05 | i still assume that microsoft research office is |
---|
1:00:09 | a certain type of people who are in there |
---|
1:00:13 | so it's not completely out in them are not so it sort of a question |
---|
1:00:17 | of a have you considered the sort of other i mean i guess children on |
---|
1:00:22 | the other types of user groups that |
---|
1:00:25 | or other types of problems that you might have in is more sort of accepting |
---|
1:00:29 | or something no we have so the short answer is again all we have on |
---|
1:00:33 | but i completely agree like the population we have these just the very narrow very |
---|
1:00:37 | specific one |
---|
1:00:39 | it's interesting to me |
---|
1:00:40 | how much variability i see even in that narrow cross section which makes me wonder |
---|
1:00:44 | like you know and units interesting there's and there's a lot of variability even in |
---|
1:00:49 | that narrow population |
---|
1:00:50 | but you're absolutely right like it's not |
---|
1:00:53 | truly in-the-wild is not a to public space like |
---|
1:00:56 | and so you be very interesting to go there and see what kind of "'cause" |
---|
1:01:00 | yes populations are different than |
---|
1:01:05 | we haven't done much outside this |
---|
1:01:08 | okay let's think then again for a really is done |
---|