0:00:19but not any
0:00:23i don't fit the crime and i and i have great pleasure in that using
0:00:29the second keynote speaker of the confidence that abolish from microsoft research
0:00:36and
0:00:39seeing everything either restart at microsoft research
0:00:45and is what he has been for the last twelve years and it's going to
0:00:49talk to is about straight the interaction
0:00:52okay thanks a lot thanks ingrid things for the introduction and also for the invitation
0:00:57to talk it's a great
0:00:59be back here think that i think i missed the last couple of years but
0:01:02this is always and
0:01:03a central to come back to
0:01:05so the time of the talk is situated interaction and i think is gonna those
0:01:10they'll pretty well with the panel discussion we had at the end of yesterday a
0:01:15operable narrowing versus broadening of the field
0:01:18an interesting questions we might all be working on there are basically
0:01:23two main points that i would like to highlight in this talk the first one
0:01:28is that dialogue is really a multimodal highly coordinated complex affair
0:01:35that goes well beyond the spoken word
0:01:38i don't know how many of your familiar with a little work over eight boards
0:01:41with all views an anthropologist that did some of the seminal work on can as
0:01:46exciting back in the sixties
0:01:48and basically studying the role body movement in communication and in one of his books
0:01:53he essentially or
0:01:55comments on how basically
0:01:59perhaps the problem with the early records that we have of studies of communications
0:02:03is that they were done by illiterate people
0:02:06now all joking aside it is the case that if you look at
0:02:10most of the work we do to the in dialogue
0:02:12is really heavily and curtin text in the written word and at best in the
0:02:17spoken world
0:02:19but in reality we do a lot of work with our bodies when we interact
0:02:23with each other when we communicate with each other and the surrounding physical context also
0:02:27plays a very important role in these interactions
0:02:30from where we place ourselves in space relative to each other the stance we adopt
0:02:35two where or gaze goes moment-by-moment to facial expressions head nods hand gestures
0:02:41prosodic o contours
0:02:43all of these channels come into play when we interact with each other
0:02:47and so that's the view of dialogue that i would like to highlight today
0:02:50the second point that i'm gonna try to make in this talk is i think
0:02:54were also it is very interesting time
0:02:57when in the last take it also seem very fast paced advantages is based on
0:03:01deep learning
0:03:01in areas like vision and in germany perception and sensing
0:03:05and i think these advances are getting us to this point or able to
0:03:10start building machines that understand
0:03:12people in physical spacing how people moving behaving physical space
0:03:17i think it's a very interesting time in that sense in just like in the
0:03:20nineties
0:03:21advances in speech recognition have broken up the field and open up this whole area
0:03:25of spoken dialogue systems with all the research that has come to that
0:03:29and that today has led to these mobile assistance in our pockets i think these
0:03:33advances in vision and in the perceptual technologies
0:03:36give us a chance to again brought not the feel in this direction of
0:03:40physically situated dialogue and more generally situated interaction
0:03:45so what i'm gonna doing this talk is i'm gonna try to give us a
0:03:47sense of this area based on some research vignettes from our own work at m
0:03:52s r
0:03:52over last ten years or so in this space
0:03:56and hopefully i'll be able to convey to my excitement about it then maybe gets
0:04:00more of you guys to look into this direction
0:04:02because i think there's a lot of interesting an open problems in this space and
0:04:07i think a lot of the people in this will have quite a bit to
0:04:10contribute to solving these problems
0:04:13so finally before i get going before we'd i've been i'll sonar make sure i
0:04:17think my collaborators that i've had over the years that in likely to work with
0:04:22fabulous people elements are
0:04:25and that long-term collaborations with folks like corvettes and shown andreas's
0:04:29here and also many other researchers
0:04:33talented in jeers and great interest we have
0:04:35over the years and
0:04:37some of the work with c and the work with done and ms are in
0:04:39this space will not be impossible without their help so on
0:04:42then them
0:04:43okay so let's get started situated interaction well
0:04:47i started working in this space or shortly after i joined m s a round
0:04:51two thousand than eight and the main question that has been driving my research agenda
0:04:56scenes has been basically how do we get computers
0:05:00to reason about the physical space around them
0:05:02and to interact with people in this kind of open moral physically situated setting enough
0:05:07fluid and seamless manner
0:05:09and the general approach i've taken two words that space has been one where
0:05:13we built a variety of systems
0:05:15and we've deployed in the wild and by deploying the wild what i mean in
0:05:20this case is place them in some public space in our building where people would
0:05:24naturally encounter any interact with them without much instruction
0:05:28so is not a control upsetting there just deployed some
0:05:31where all their people just comment intractable
0:05:33then we observe the interactions and we lead to drive what are the research problems
0:05:39that will do not address we find what are the problems we need to solve
0:05:42by observing what happens in this kind of ecologically more valid setting
0:05:47and try to let that
0:05:49give us direction and so to make this concrete then to give you a sense
0:05:52of the variety of systems we build an honours start by showing your view videos
0:05:56and then we can go more for some of the research questions we've looked at
0:06:02the first really i'm gonna show use from the system that we refer to as
0:06:06the assistant
0:06:07it's a virtual agent based system that's placed outside eric's office and interacts with people
0:06:13that come by whenever he is not available or maybe what is available but busy
0:06:17his office
0:06:18and basically the system that some simple assistive type tasks like
0:06:23handling meetings and taking you know some notes the relay and so on
0:06:29it's connected to a quite a wide infrastructure has access to eric's calendar but also
0:06:33for other machine learned models that predict his availability when is he gonna be backing
0:06:38his office you know what's the likelihood that he will but then the particular meeting
0:06:42and so on
0:06:43but what i want to highlight with this video is not so much lower part
0:06:46as much as
0:06:47the multi party dialogue or interaction capabilities here the system has a camera why dental
0:06:53camera the top and a microphone array and
0:06:56it's able to basically reason about multiple people
0:06:59and understand who it engages with and have dialogue in this kind of all multiparty
0:07:04setting
0:07:05based on the roles and he's that these people have
0:07:16and you hear forty five o'clock meeting we can eat
0:07:20i'm sorry i told you hear for each meeting could you please one appeal h
0:07:26and it is not here and then it makes me look at scanning to see
0:07:30if i can tell me he's trying to see he should be meeting in his
0:07:34office now saline actually not sure he's he will probably be packing amount fifteen minutes
0:07:40listen would you like to have unit or maybe come back later
0:07:46and you could try sending him an email message i'm expecting to look at seen
0:07:51in amount a mean it's fancy rejecting nighttime
0:08:05great
0:08:08thanks
0:08:10so over the years we built a variety of these systems are based on virtual
0:08:14agents this is a prototype for those aiming to do shuttle reservations on campus of
0:08:18for people moving from one building to another when you going to little be you
0:08:21can say i'm going to this building and get a shovel
0:08:24we build the fun trivia questions game that we deployed in a quarterly or one
0:08:29of our kitchens where the system would try to engage people that go buy into
0:08:33this questions gamelike out ask you what's the longest river in the world then you
0:08:37try to figure out the answer but the interesting bit here is that this is
0:08:40the most trying to do this in some sense
0:08:42cooperatively is trying to get people to reach a consensus before revealing the answer moving
0:08:47to the next question
0:08:48we did a lot of interesting studies on engagement then how do you attract a
0:08:51bystander a little times people kind of sit back and watch from a distance what
0:08:56happens working on how do you attract bystanders
0:08:59inside an interaction so again studying various problems related to multi party dialogue an open
0:09:03more settings
0:09:05without more that has also nothing to do it language show i'm using the term
0:09:08situated interaction
0:09:10purposefully
0:09:12because my focus is on is my interests are in sort of how do we
0:09:15get machines to interact with people with there's language or not
0:09:18this is a an example a system we call the third generation elevator
0:09:23what you're seeing here is a view from the top in our atrium
0:09:28there's basically let's see this work there's the elevator doors are over there this is
0:09:32a fisheye distorted you from the top but this is in front of the bank
0:09:35of elevators where people are going by
0:09:37so we build a simple model that just those optical flow and based on features
0:09:41from optical flow
0:09:42if there's to anticipate by about three seconds when the button will be pushed
0:09:46so as you walk towards the elevator pushes the button for you the idea was
0:09:49a mess build a star trek elevator but if you just simply go by you
0:09:53know nothing happens
0:09:56and n is not necessary that i think this is high elevators will work in
0:09:59the future but its own exploration and i had not
0:10:02to this idea that machines should be able to reason about and think about how
0:10:07people behaving physical space
0:10:09and right interesting interactions of that and the system has been running four years in
0:10:13our lobby and by now everyone's
0:10:15no one models it's there in some sense it just works
0:10:20within the last years also started the looking in the directional interaction with the robot
0:10:24so human robot interaction and system that we've done a lot of research with are
0:10:29these direction robot's we have three of these guys we have them deployed on each
0:10:33of the floors in a building as you come up of the elevator
0:10:36and they can give you directions inside the building so you can ask for meeting
0:10:40rooms are various people
0:10:41and they can directly there are four
0:10:47conference room three hundred
0:10:50go to be useful way
0:10:53turn right into three down the hall we review
0:10:56conference room three hundred will be the first room on your right
0:11:02your will
0:11:07john is in all use number forty one twenty
0:11:10here
0:11:11t v all of your to the fourth floor
0:11:14you're right when you mix of the elevator and continue to the end of this
0:11:18fall
0:11:19john solve this will be in that we have revealed
0:11:25okay so hopefully this gives you guys a sense of the class of systems will
0:11:29be in building an working with are doing research with over the years
0:11:33no when you try to build these things and have them actually work in the
0:11:37wild in this kind of one control settings you quickly run into a number of
0:11:41problems that otherwise you might not even think of our consider
0:11:45so
0:11:46a lot of the problems with interactions i think we as human soul on self
0:11:51conscious the this is so training to us that we don't think about it
0:11:55but you know once you try to do something with a machine and computational eyes
0:11:59it you run into the actual problem so first problem you have to solve is
0:12:03that of engagement knowing
0:12:05who am i engage with an in an interaction with and one
0:12:08like this is all obvious loss whenever word an interaction
0:12:11but a machine is to reason about it for instance here needs to reason that
0:12:14even though these two guys are looking away from it at this moment
0:12:19they're actually still engaged in an interaction with the machine they're looking away because the
0:12:24robot just pointed over there and she well she's been looking at the machine all
0:12:28the time she's actually not engaged in this conversation and going one step for the
0:12:32robot my reason that well perhaps is you know group with them and waiting for
0:12:35them
0:12:36or perhaps she's not in a group with number has an intention to engage with
0:12:40the robot once they're done there's all these reasoning that we assume as kind of
0:12:43the one automatic and we don't think about what you have to kind of program
0:12:47machine to do it
0:12:49once you can solve the problem of engagement not a problem you have to solve
0:12:52is that of turn taking and you know the standard dialogue model we all phone
0:12:58work with this one where dialogue is of all the of utterances by the system
0:13:02and user and system and user this breaks two pieces immediately once you're in a
0:13:06multi party setting
0:13:08you need to reason not only about when utterances are happening but you to reason
0:13:11about who's producing them
0:13:13who are the utterance is addressed to and what does the producer expect would talk
0:13:19next so who is the next ratified speaker here
0:13:22should i as a robot inject myself or the end of this utterance that i
0:13:25heard or should i wait "'cause" someone else is gonna respond
0:13:28so the problem gets more complex
0:13:30and again all of this
0:13:32we do on automatic and it's regulated with gaze with prosody with
0:13:36how we move our bodies and so on and only once you can kind of
0:13:41deal with these two problems you can start worrying about speech recognition and decoding the
0:13:45signals in understanding what is actually contained in the signals that was sent to each
0:13:49other
0:13:50and doing the high-level interaction planning and dialogue control so in some sense a we've
0:13:56use it we view this as a
0:13:59almost like a minimal set of communicative competence is that you need to have to
0:14:03do this kind of interaction open world settings
0:14:05and over the years the re our research agenda
0:14:08has been basically looking at various problems looking that in this processes by trying to
0:14:16leverage the information we have about the situated context the who the lot and the
0:14:20why of the surroundings
0:14:23so that's kind of a the very high level kind of fuzzy one slide about
0:14:26what the research has been about that the ms are in the last ten years
0:14:29in the space and i'm gonna diving now when show you two different examples in
0:14:36a little bit more detail i'm not gonna goal very technically d pointier pointed to
0:14:40the papers i'm happy to talk more
0:14:42offline but i want to show you give you a sense of what the research
0:14:46problems look like i'm gonna start with the problem that has to do with engagement
0:14:52i've already mentioned
0:14:54engagement as a process can this in the reverse this is a process by which
0:14:57participants
0:14:58initiate maintain and terminate the conversation is that they jointly undertake now you know lot
0:15:04of classical dialogue work i mean this is in telephony applications re mobile phones and
0:15:09so on so
0:15:11trivial problem to solve right i push a button i know i'm engaged or i
0:15:14pick up a phone call i'm not i'm gauge i don't have a really big
0:15:17problem to solve however if you have a robot or system that's embodied in situated
0:15:20in space is becomes a more of complex problem
0:15:24and just to illustrate sort of the diversity of behave years
0:15:28with respect to engagement that one might have
0:15:32we sort of
0:15:34capture this video this as many years ago at the at the start of this
0:15:38work
0:15:38it's a video from a the receptionist prototype the one that was doing the shuttle
0:15:42reservations
0:15:43and it mostly highlights how by reasoning about
0:15:46three engagement variables in particular engagement state the my negation a conversation or not engagement
0:15:53actions which regulate the transitions between the states and engagement intentions which are different from
0:15:58states
0:15:59by reading about these three keep variables you can construct fairly sophisticated policies
0:16:04in terms of how you manage engagement in you know group setting
0:16:08so no play this video for you in a second just before i do that
0:16:11to help you with the legend here and all this annotation
0:16:14yellow line
0:16:16below the face means this is what the system is engaged with at some points
0:16:19of this is the system's viewpoint what we see is one of these avatar has
0:16:23that but for us
0:16:25dotted line is an engagement that is currently suspended
0:16:28the red dot moving around that right now it's on eric's face shows the direction
0:16:33of the avatars case
0:16:35so i'll run this for you
0:16:38sorry for the quality of the audio here
0:16:42here
0:16:50right
0:16:54right
0:16:59for
0:17:03right
0:17:21alright thank you
0:17:24i
0:17:27right
0:17:28sure
0:17:29yes
0:17:31or not
0:17:34yes
0:17:37in addition
0:17:40right
0:17:45sure
0:17:50sure
0:17:52here
0:17:54right
0:18:07right
0:18:09sure
0:18:12right
0:18:16i
0:18:19so there's many behaviors in here that flight by pretty fast like for instance when
0:18:24the receptionist turns from eric to me and my attention is in my cellphone says
0:18:28excuse me and waits for my attention to come up to continue that engagement or
0:18:33at the end when i'm basing some more far away in the distance
0:18:37the moment i turn my attention towards it even though i'm at a distance he
0:18:40creates this initiate disengagement because you know as i still have this task of getting
0:18:44the shells i can give me an update
0:18:47there's a lot of behaviours that you can create from relatively simple inferences
0:18:51now i don't obviously you this is a demonstration video that was shot in the
0:18:54lab in probably we had it do it i don't know three five times to
0:18:57get it right
0:18:58this stuff does not work that well when you put it out there in the
0:19:02wild and i will show you know second how well it works in the wild
0:19:05but this is almost like a more star video like a more star direction for
0:19:09us in our research work
0:19:11we wanna be able to create systems where the underlying inference models are so robust
0:19:16that
0:19:16we can actually have this kind of lead interactions are there in the wild
0:19:21so let me
0:19:23show you how it works in practice and talk about a particular give an example
0:19:27of a research problem in this space
0:19:30start with this video that kind of motivates it pay attention to how badly in
0:19:36this case is a be a from the directions robot
0:19:39how badly the robot is as negotiating disengagement so the moment of breaking of the
0:19:44interaction
0:19:48you need help finding something
0:19:52a room that hallway and on my
0:19:55by the way would you mind swiping your badge on the remote so i know
0:20:00wideband park with not
0:20:03thank you hear anything else i can help you find nothing
0:20:07okay
0:20:08think that it
0:20:10i'm or
0:20:12i know i help you find something else no thank you
0:20:17okay that
0:20:19by
0:20:21not very good he's running to the bottom so
0:20:26so what happens here well what happens here is that at this point in time
0:20:30it's obvious to all of us that this interaction is over
0:20:34but all the machines easy is just
0:20:37the rectangle of where the face easy back in the day that's all the tracking
0:20:40were doing doesn't understand this gesture
0:20:44and so this point the robot continues the dialogue with is there anything else i
0:20:48can help you find and this is quite a long production now what's interesting here
0:20:54is that by some but just a couple of seconds right after that by this
0:20:58point by this frame
0:20:59the robot's engagement model can actually tell that this person is disengaging but by that
0:21:04time it's already too late because we've already study producing this is there anything else
0:21:09and the person hears descent errors and word in this bad look now where we
0:21:14are basically non negotiating these disengagement properly in person starts coming back so now they're
0:21:18engaged again
0:21:19and we get into this problem
0:21:22so what's interesting here is that the robot eventually notes
0:21:25and so the idea that comes to mind is
0:21:27well
0:21:28if we could somehow forecast from here that some future time this person is likely
0:21:33to disengage with some
0:21:36good probability
0:21:37we could perhaps use hesitations to mitigate the uncertainty people of unused hesitations this situation
0:21:43of uncertainty so if we could somehow forecasting funny for perfect in that forecast that
0:21:48a t zero plus l for this person might be disengaging is there are launching
0:21:52this production we could launch of filler or like a hesitation like soul
0:21:56and then if it zero plus a thought we find them disengaging we say so
0:22:00well guess a catcher later then
0:22:02or if somehow alternatively there are not we can still say so is there anything
0:22:06else i can help you fine and that doesn't sound too bad and so the
0:22:10core idea here is
0:22:12that's forecast what's gonna happen in the future
0:22:15and maybe use hesitations to mitigate the associated uncertainty
0:22:20now how do we do this well we have an interesting approach here that is
0:22:24in some sense self-supervised
0:22:26the motion eventually novel so we can leverage that knowledge you basically rollback time and
0:22:31you can learn from your own experience basically without the need for any manual supervision
0:22:36so you have a variety of features i'm illustrating here three features like the location
0:22:40of the face in the image and the size of the face
0:22:42which kind of these esns this is where they start moving away it right the
0:22:45size of the face is kind of a proxy for how far away from you
0:22:48they are we have all sorts of probabilistic models for instance for inferring whereas their
0:22:52attention is the attention on the robot there is their attention somewhere else
0:22:56and there's many i such features in the system
0:22:59no the ideas you start with the
0:23:01very conservative heuristic for detecting disengagement then you wanna be conservative because
0:23:06the flip side of the equation breaking then engage moment someone is still engages even
0:23:11more painful so you don't want no kind of stopped talking to someone one that
0:23:14they're talking to so is there on the conservative side which means you're gonna be
0:23:18late in detecting when they disengaged
0:23:20but you will eventually detect that they disengage at some point you would exceed some
0:23:24probability threshold that says they're disengaging and then what you can do is like i
0:23:29said you rollback time so let's say you one anticipate that moment by five seconds
0:23:32where it's easy to automatically construct a label that looks like that and five seconds
0:23:37ahead of time predicts that event
0:23:39and then you train a model from all these features that you have to predict
0:23:42this label
0:23:44all this model is not gonna be it's gonna be far from perfect but you'll
0:23:47probably detect that moment a bit earlier on
0:23:50so if you use the same threshold of point eight you might have you know
0:23:55you might be able to detect by this much earlier we call these the only
0:23:58detection
0:23:59and so then you go and train models with all these features and really the
0:24:02technical details are not that important here the point i wanna make is a high-level
0:24:05point this case i think
0:24:07we use logistic regression boosted cheese whatever favour machine learning technique is and you can
0:24:12see that like you know for the same false positive rate you can kind of
0:24:15increase our you can detect the engagement over baseline heuristics
0:24:19the other sort of high-level lesson is that
0:24:21by using multi modal features you tend to improve your performance all use features
0:24:26relate the focus of attention location and tracking confidence scores
0:24:30dialogue features like dialog state how one died in there and so on
0:24:33each of these individually do something and then at you at the mall up together
0:24:37you get better results was generally
0:24:40something that tends to happen would multimodal systems
0:24:44again the high-level point i wanna make here use
0:24:47forecasting was a construct i think is very interesting like there's been a lot of
0:24:52work recently in dialogue with incrementality and i think forecasting goes handing hand without
0:24:57and because it's very important in order to be able to achieve
0:25:01the kind of fluid coordination we want you we probably have to anticipate more
0:25:06and then also presents this interesting opportunities from learning easy to from experience without manually
0:25:12labeling data because in general if you wanna forecast an event like you have the
0:25:16label you know when you happens you just know it too late but you can
0:25:19still learn from all of that and you can do that online in the system
0:25:23can adapt to the particular situation it's in
0:25:26so i think those are a couple of interesting lessons sort of what the high-level
0:25:29a from this work
0:25:31i'm gonna switch gears
0:25:32and talk about a different problem that lives more
0:25:37relatively speaking in the turn taking or you know just like engagement is
0:25:43reach mixed-initiative process you know by which regulate how initiate interactions
0:25:48turn taking is also you know mixed-initiative incrementally controlled by the participants is this process
0:25:55by which we regulate who takes that are not talk
0:25:58in conversation and as i mentioned before again in a lot of a traditional dialogue
0:26:04work we make the simple turn taking the assumptions of you speak then i speak
0:26:08then you speak the nice we can maybe there's barge ins that are being handled
0:26:11in multiparty settings you really the be double more sophisticated model "'cause" you to understand
0:26:16who's talking to someone any given point in time
0:26:19and when is your time to speak
0:26:20and we've done
0:26:22bunch of work in that direction i'm not gonna show you that on a show
0:26:25you a different problem that relates to turn taking that i think illustrates even better
0:26:30this a high degree of coordination and multimodality in situated dialogue and this has to
0:26:37do with coordination between speech and attention
0:26:39and in some sense this work was prompted by reading some of goodwin's work on
0:26:45disfluencies and attention so goodwin made this interesting observation about disfluencies you know one of
0:26:52his of papers
0:26:53we all know that if you look at transcripts of conversational speech it's formal false
0:26:58starts and b starts and disfluencies so they're gonna look like
0:27:01you know the speaker says anyway
0:27:03we went to i want to bad or brian you're gonna have
0:27:06you can still have to go or i can't mean and also mercy down the
0:27:10car choice of these this part of a t v transcribe like very literally you
0:27:14know conversational speech these are everywhere and they create problems for speech recognition people in
0:27:19language modeling people and so on conversational speech is hard
0:27:23well goodwin had the interesting insight of looking that this in conjunction with gaze
0:27:28so here's the listener's gaze
0:27:30and the region in red dots
0:27:32is where the listener is not looking at the speaker
0:27:36this is the point where mutual gaze gets reestablished and then we have mutual gaze
0:27:40between listener in speaker
0:27:42as a something that's really interesting in this examples is that things become much more
0:27:45grammatical
0:27:47in regions on mutual gaze
0:27:49and this means to kind of one interesting hypotheses that maybe disfluencies are not just
0:27:54errors-in-production maybe some of this one is we have
0:27:58actually fulfil the coordinative purpose the are used to regulate and coordinate and make sure
0:28:03that either i'm able to attract your attention back if you'd has drifted away
0:28:09or whenever i deliver what i want to deliver i really have your attention
0:28:13and so
0:28:15partly inspired by this work and partly inspired by behaviors in our systems
0:28:20we did a bunch of work on coordinating speech and attention
0:28:25so let me show the example in contrast to
0:28:28what humans are able to do without thinking about it
0:28:31here's our robot is not able to reason about where the person's attention is
0:28:36as a bunch of speech recognition errors in this interaction as well but i like
0:28:40it to pay more attention to basically how the robot is not able to take
0:28:44into account where the participantsattention is as the interaction is happening she's just looking her
0:28:49phone trying to get
0:28:50the number for the meeting she's going to but the robot is ignoring all that
0:28:56right
0:28:58or
0:28:59i think that again
0:29:05metric in going right
0:29:09during that would help or go back to like you want right
0:29:15well o where
0:29:19or maybe not so she's a she's you know she's just looking or phone trying
0:29:24to find the and the robot keeps pushing this question or four where you going
0:29:27where you going and so that's you know quite different from what people are doing
0:29:32so inspired by goodwin's work we did some work on
0:29:36basically coordinating speech with the
0:29:40attention and the idea here was to have a model where one hand
0:29:44we model the attentional demands
0:29:46like where does the robot expect the attention persons to be
0:29:50and on the other hand we model attentional supply where is the actual attention going
0:29:55so attentional demands are defined of the phrase level so for every output that the
0:29:59robot is producing got the phrase level
0:30:01we have an expectation about where attention should be in most cases it probably should
0:30:04be on the robot but it is not always the case twenty seventy five point
0:30:08of what they're in say to get to thirty eight hundred i might expect that
0:30:11your attention will go over there and actually fluoridation doesn't go over there may be
0:30:14we have a problem
0:30:16so we are specifying descent is are manually specified basically of just like a natural
0:30:20language generation for every output we have one of these
0:30:23expected attention targets and then on the other hand we make inferences about where is
0:30:27your attention
0:30:28and we do that based on machine learning models that use radio features us on
0:30:33and so forth
0:30:34whenever there's a difference between the two is there of just ballistic reproducing speech synthesis
0:30:40we use this coordinative policy that basically interjects the same kinds of pauses and feels
0:30:46like pauses in false starts and restarts
0:30:49that humans do is basically create these disfluencies
0:30:52to get to a point where attention is exactly where we expected to be an
0:30:55only then we continue so instead of saying to get the thirty eight hundred we
0:30:59might pose for awhile say excuse me be say the first two words to get
0:31:03pause more and so on before we actually produce the utterance
0:31:08so i'll in this is again than on the phrase by phrase
0:31:13basis
0:31:13here is again a demonstration video of
0:31:16eric and i bad actors trying to kind of illustrate this behavior
0:31:23yes or no i
0:31:33for me excuse
0:31:42for
0:31:47where he is just you know it's fashion
0:31:58so still bit clunky you know but you get the sense and the idea let
0:32:02me show you a few interactions captured in the wild once we deployed this coordinative
0:32:08mechanism
0:32:09in here basically
0:32:11the regions in block are the production that you know the robot normally produces the
0:32:17synthesis these are phrase boundary delimiters
0:32:20and the regions in the regions in orange are
0:32:24these filled pauses interjections that are dynamically injected on the fly
0:32:29based on where the user's attention is
0:32:40rule
0:32:47all you cough that the volume was kind of level
0:32:57excuse me
0:33:00really
0:33:02you know you
0:33:14you are right
0:33:21here you direction
0:33:25so that excuse we might be a bit aggressive you know there's a lot of
0:33:28tuning once you once you put this in there you realise the next layer of
0:33:31problems that you have been how synthesis is not quite conversational enough and you know
0:33:36like than one sees of saying social forces so an excuse me and so on
0:33:42and while these videos again my make it look like a wild like we can
0:33:45go quite far again wanna leave you with the wrong impression of a lot of
0:33:49work remains to be done
0:33:51these things often failed or videos i shown easement things work
0:33:54relatively well i would say
0:33:56but this things often failed and i want to show you one interesting the example
0:34:00of a failure
0:34:02right
0:34:04we would be
0:34:08you give you
0:34:09what would be included
0:34:16well you will see later
0:34:19so the signals to say whoops
0:34:21so what actually happens here well what happens here is that
0:34:25we are coordinating you know warping a lot of attention to coordinate our speech with
0:34:31the participantsattention
0:34:33but were completely ignoring what his upper body and torso is signalling so what happens
0:34:38here is
0:34:38the robot guess to this phrase where it says to get their walk to the
0:34:41end of this hallway
0:34:43at which point the person feels that maybe this is the end of the instructions
0:34:47so they start turning both their face and their body to kind of indicate that
0:34:51they might be leaving right
0:34:54the robot sees their attention goal way and things well i'm gonna wait for their
0:34:58attention to come back and the long pause that gets created for the reinforces the
0:35:03person to believe that this is the end of the directions so i'm just going
0:35:06given the robot had all these other things to say right
0:35:09and so because the robot in this some sense ignores the signal from his upper
0:35:14body that i'm and if the robot can take into account that signal we could
0:35:17be a bit smarter and maybe not wait there maybe use a different mechanism to
0:35:21get their attention back
0:35:22or maybe just
0:35:23blasts through that you don't always have to coordinate exactly that way it right and
0:35:27so
0:35:29i love this example because it really highlights any drives on this point and trying
0:35:34to make i think that
0:35:35dialogue is really highly coordinate in and highly multimodal dialogue between people in face-to-face settings
0:35:41has these properties you know
0:35:44we've talked about carnegie speech and gaze
0:35:47and we seen in this example how not reasoning about body pose gets us into
0:35:51trouble
0:35:52as many other things going on we do head gestures like not then shakes and
0:35:57all sorts of other head gestures and there's a myriad of hand gestures you know
0:36:01from be metaphorically iconic the big gestures
0:36:05facial expressions smiles frowns expressions of uncertainty
0:36:09where we
0:36:10put our bodies and how we move dynamically prosodic all contours all of these things
0:36:14come into play and their highly coordinated frame-by-frame moment-by-moment in the coordination that happens is
0:36:21not just across the channels
0:36:23it's across people
0:36:25and these channels and so i'd like us to think about dialogue in this view
0:36:29more from a view of you know sequence of turns into of your of
0:36:35multimodal incrementally co-produce process
0:36:38and i think if we do that i think there's a lot of interesting opportunities
0:36:42because of these enabling technologies that are coming up these days
0:36:46so i've shown you a couple of problems in the space of turn taking an
0:36:51engagement there's many more problems in every time we touch one of these we really
0:36:55feel like we barely scratched the surface
0:36:58take for instance engagement i talk for a bit about
0:37:01how to forecast disengagement and maybe negotiate the disengagement process better but this many other
0:37:08problems how do we build robust models for making inferences about those engagement variables like
0:37:13states engagement actions and intentions
0:37:16how do we or construct measures of engagement that are more continuous here all the
0:37:21work we've done is on i'm engaged or i'm not engage well-known educational or tutoring
0:37:25or other kinds of setting you wanna more continuous measure engagement
0:37:28how do you reason about that
0:37:31similarly many other problems in turn taking understanding how do we ground all these things
0:37:35in the physical situation is interesting challenges with rapport with negotiation grounding well lots of
0:37:43open space lots of interesting problem once you start thinking about how the physical world
0:37:46a whole these channels interact with each other
0:37:50like i said i said i think we have this interesting opportunities because
0:37:53there has been a lot of progress in the visual and perception space
0:37:57the tracking facial expression tracking smiles affix recognition is on that can
0:38:03help us sort of in this direction
0:38:06i think the other think that i really want to highlight bill be size the
0:38:09current technological advances that i think is very important
0:38:12is all these body of work that comes from connected feels like anthropology sociology
0:38:18cycling sociolinguistics a conversational analysis context analysis on
0:38:23there's a wide body of work basically
0:38:25as soon as people got their hands on video tapes in the fifties and sixties
0:38:28they started looking carefully at
0:38:30human communicative behaviours
0:38:32and all that work was done
0:38:34based on you know small snippets or video and if you think about it today
0:38:37we have millions of videos
0:38:40an interesting a powerful data techniques so there's interesting questions about how do we bring
0:38:46this work into the present the how do we leverage all the knowledge and the
0:38:49theoretical models that have been built into the past
0:38:51i've put here just some names there's many more
0:38:54people that have done work in this space and i pick one title from each
0:38:57of them in each of these guys
0:38:58has full bodies of works i really recommend that
0:39:01as a community we look back more on all this work that has that has
0:39:05been done already in a human communication and try to understand how to leverage that
0:39:09when we think of dialogue
0:39:12so
0:39:14with that i guess i have a ten minutes left i one a kind of
0:39:17switch gears a bit and talk more about
0:39:20challenges because you know
0:39:22there's a lot of opportunity there's a lot of open field
0:39:25but working in this space is not necessarily easy either and when i think of
0:39:30challenges i think the
0:39:32high level i think of three kind of categories there's obviously the research challenges that
0:39:37we have like i wanna work on this problem and forecasting disengagement will help lysol
0:39:41try there's obviously the research challenge
0:39:44but i'm gonna leave those aside and gonna try to talk about to other kinds
0:39:48of challenges one is data and experimentation challenges and we touch briefly on this in
0:39:52the panel yesterday i think getting data for these kinds of systems is it's not
0:39:57easy and stuff
0:39:59if you look at a lot of our adjacent feels like machine translation in speech
0:40:03recognition nlp and so on
0:40:05a lot of progress has been accomplished by you know
0:40:09challenges with datasets and clear evaluation metrics and so on
0:40:12in dialogue this is not easy to do any is not easy to do because
0:40:15dialogue is an interactive process you cannot easily studied on a fixed that dataset
0:40:20because by the time you've
0:40:22made an improvement or change something the whole thing behaves differently
0:40:26and so that creates challenges generally for dialogue and even more so for multimodal a
0:40:31dialogue in the multi model space right
0:40:33then apart from the data charges there's also kind of experimentation challenges
0:40:38we've done a lot of the work we've done in the while because i feel
0:40:42like you see the real problems you see ecologically valid settings and you see what
0:40:47really happens
0:40:48some of these phenomena are actually even probably
0:40:52challenging and hard to do in a controlled lab settings like study how engagement how
0:40:56these break supplements on you can think of all sorts of things of confederates and
0:40:59you can try to you know figure out controlled experiments but is not easy and
0:41:04all the other hand experimenting in-the-wild is not easy either for many in reasons
0:41:09one of the
0:41:10other kinds of challenges in here are purely building up the system's right so
0:41:14in our work over last ten years the way we've gotten our data is by
0:41:17building systems and deploying them right
0:41:21but building systems is hard in so in the last five minutes i wanna talk
0:41:25a bit about actually engineering challenges because i think there just as important in that
0:41:29they kind of create the damped nor on the research and they kind of stifle
0:41:34things from moving faster forward building this kind of a multimodal systems is hard for
0:41:39a number of reasons
0:41:41first there's a problem integration they leverage many different kinds of technologies
0:41:45that
0:41:47are of different types operate on different time scales the sheer complexity and the number
0:41:51of boxes you have to having one of these systems kind of makes the problem
0:41:55challenge
0:41:56but then there's other things where constructs that are pervasive in the systems like pine
0:42:01space and uncertainty are nowhere in our
0:42:04programming fabrics like
0:42:06it's kind of the clear to me that time for instance is not a first-order
0:42:10citizen in any programming language that i can think of so every time i wanna
0:42:14do something that's over time or streaming or
0:42:16i have to go implement might buffers and my streaming and my you know like
0:42:19a kind of have to go from scratch and it's similar for space in uncertainty
0:42:23but it is very important because
0:42:26we want to create systems that are fluid
0:42:27but the sensing thinking acting all of these things take time
0:42:32being fast is not even enough often times you need to do fusion in the
0:42:36systems and things the right but different latency so you need to coordinate basically so
0:42:40you need to kind of deal whereabout time in a deeper sense down deep down
0:42:45be well and the same things can be set i think in this systems about
0:42:49the notions of space and notions of uncertainty
0:42:52and finally the other thing that kind of puts of them are is the fact
0:42:55that the development tools we have
0:42:58are not here for this class of systems right so the development environments and debug
0:43:03errors and all of this stuff is not
0:43:05they were not developed with this kind of with this class of systems in mind
0:43:09and if i think back of all the work we've done i don't know if
0:43:12after time as maybe spend on building the tools to build a systems rather than
0:43:16building the systems are doing the research right and so
0:43:20basically driven by a lot of the lessons we've learned over the years
0:43:25in the last three years three or four years at ms are we basically embarked
0:43:29on this project and i wanted to spend the last couple of minutes telling you
0:43:32about it because if there's any people in the room that are more interested in
0:43:36joining the space this might be useful for them
0:43:39we've worked on developing a open-source platform that
0:43:42basically aims to simplify building the systems
0:43:46the end goal being lower the barrier to entry in enabling more research into this
0:43:51pay so it's a framework that three targeted researchers
0:43:55it's open source and it's
0:43:59supports the construction of this kind of a situated interactive system
0:44:04we call it
0:44:06platform for cd intelligence which kind of a mouthful solo abbreviate either side pronounced like
0:44:10the greek letter sci
0:44:12and i want to just give your whirlwind tour in two minutes just to kind
0:44:15of give us sensible or what's available in there
0:44:19the platform consists of three layers there's a runtime layer
0:44:23a set of tools in a set of component components the runtime basically provides all
0:44:28these infrastructure
0:44:30for building systems that operate over streaming data are have latency constraints anytime you have
0:44:34something interactive
0:44:36it's latency constraint
0:44:38so there's a certain model for parallel courtney computation that actually feels pretty natural you
0:44:43just kind of connect components streams of data you know so it's the standard sort
0:44:48of data flow model
0:44:50but the streams a have a really interesting properties and i don't have time to
0:44:55get here in
0:44:56the full beetle and all the glory here
0:44:59but i wanna kind of highlight some of the important aspect so for instance i'm
0:45:03mentioned about time how time is to be first-order citizen well we bake that from
0:45:08day one d below in the fabric all messages that are flowing to are timestamp
0:45:13the origin when they're captured
0:45:15and then as they flow to
0:45:16through the pipeline
0:45:18we have access not only to the
0:45:20time the message was created by the component that created but also to that originating
0:45:24time
0:45:25so we know this message has a latency of four hundred and thirty milliseconds so
0:45:29in the entire graph we can be latency or all points
0:45:32which enables synchronization so we provide a whole time algebra and synchronization mechanisms when you
0:45:37work was training data
0:45:39that pairs these messages correctly and so on
0:45:41so is basically all about enabling coordinated computation where time is really first-order citizen
0:45:49the strings can be automatically persisted so there's a logging infrastructure
0:45:53that is therefore free any data type of you know you can stream any of
0:45:57your data types and we can automatically persist those and because we per system with
0:46:02all this is so sure you timing information
0:46:04we can enable a more interesting replace scenarios are i say well forget about these
0:46:08sensors less played back from disk
0:46:10and tune this component and i can play this back from disk exactly as it
0:46:14happen in real time or i can speed it up or slowly down time is
0:46:18entirely under our control because is baked deep down in the fabric
0:46:22so these are some of the properties of the runtime there's a lot more
0:46:25is basically a very lightweight very efficient kind of
0:46:29system for constructing things that works with streaming data
0:46:32at this level we don't care we don't know anything about speech or dialogue or
0:46:36components
0:46:37it's a gnostic to that you can use it for anything that operates was training
0:46:40data and temporal constraints
0:46:42the set of tools we built
0:46:44basically are heavily centred on visualisation this is a snapshot from a
0:46:49the visualisation tool we have on the right there someone's actually eating it and this
0:46:52video sped up a bit but these are the streams that were persisted in application
0:46:56these are just visualise there's for different kinds of streams that can get composer didn't
0:47:00overlaid
0:47:00so this is a visualiser for and in each stream this is a visualiser for
0:47:05face detection results stream this is audio this is a voice activity detection that's a
0:47:09speech recognition result is a visualiser for all three d conversational scene analysis and the
0:47:15basic idea is that can composite overlaid is visualise there's
0:47:18and then you can navigate over time left and right ensue mean and look at
0:47:22particular moments this is very powerful in enabling especially when coupled with debugging
0:47:28and word evolving this to visualize not just the data collected and running through the
0:47:33systems
0:47:34but also all
0:47:35the architecture of the system itself and you know the view of the component graph
0:47:42and also towards annotation for supporting data annotation
0:47:45finally a the components layer we are hoping to create an ecosystem of components where
0:47:51people can plug n play different kinds of components will bootstrapping this with things like
0:47:56sensors imaging components vision audio speech output is are very relatively simple components that we
0:48:01have in the initial echo system
0:48:03but the idea is that
0:48:05is meant to be an across system and people are meant to contribute into it
0:48:08is an open source project there's already boise state casey kennington has its own repository
0:48:13of sci components
0:48:15and so people are starting to use this and the hope is that as more
0:48:18people use it
0:48:19if i can get you to have eighty percent of what you need off-the-shelf and
0:48:24just focus on your research
0:48:26that's the key idea
0:48:28lasting else a is that something we haven't released yet but we are planning to
0:48:32release in the next few months
0:48:34is an array of components that we refer to as a situated interaction foundation it's
0:48:41basically a set of components at that level that
0:48:43plus a set of representations
0:48:45that you want further abstract and accelerate the development of this physically situated interactive systems
0:48:51basically what we are planning to construct is
0:48:56the ability to instantiate the perception pipeline where you as a developer of the system
0:49:00just only where you're sensors and what sensors you have
0:49:03so in this instance there is you know there's a kinect sensor the big box
0:49:08their represents my office and there's a kinect sensor sitting on top of the screen
0:49:12and if you tell me i have three sensors i'm gonna use the data from
0:49:15all the three sensors infuse evil gonna configure perception pipeline automatically from all the sensors
0:49:20we have the right fusion
0:49:22and provide the d n the
0:49:24the kind of
0:49:25analyses a deep scene analysis object that runs at frame rate at four thirty frames
0:49:30per second i'm gonna tell you things like here's where the people are in the
0:49:34scene and what their body pauses are here's where everyone's attention is
0:49:39in this case there's an actual engagement happening between the two of us in an
0:49:43agent that's on the screen
0:49:44and stewart is you know directing the utterance the words
0:49:50you know the agent and at some later point
0:49:53we have peeled off we've gone more towards the back of the office towards the
0:49:56whiteboard
0:49:57and we're just talking to each other and so we're trying to provide all these
0:50:00reach analysis all
0:50:02the conversation in the conversation space including issues of engagement turn taking utterances sources targets
0:50:07and all of that
0:50:08from the available sensors and
0:50:10if you give me more sensors
0:50:12the idea is that you get the same object back
0:50:15but at a higher fidelity because we have more sensors and we confuse data
0:50:19this parts have not be really see other coming out probably in the next couple
0:50:22of months
0:50:23but our hope with the entire framework is basically to accelerate research in this space
0:50:27to get people to be able to
0:50:29build an experiment with these kinds of systems are having to spend two years to
0:50:33construct all the infrastructure that's necessary
0:50:37and so this brings me basically two
0:50:40the end of my talk all conclude on this slide
0:50:44try to adopt this view of dialogue in this is a talk and portrayed is
0:50:48view of dialogue as a
0:50:50multimodal incrementally corporate used process where part this one scene interaction really
0:50:56do fine grained coordination across all these different modalities
0:50:59i think there is
0:51:01tremendous number of opportunities e here and i think it's up to us to basically
0:51:05broaden the field in this direction because the
0:51:08underlying technologies are coming and they are starting to get to the point where
0:51:12the reliable enough to start to do interesting work and again there's this
0:51:18big body of work in human communication dynamics that will we can leverage and that
0:51:22we can draw upon
0:51:24so i'll stop here thank you all for listening and all the questions
0:51:37thanks very much and then
0:51:44thank stan i was so great to see
0:51:47all this work again and how
0:51:50oppressive the research program over the number of years or to get at this point
0:51:53i'm really looking forward to that
0:51:55situated the interaction foundation
0:51:58coming out
0:51:59i've a question i guess related partly to that but
0:52:03one of the problems with integration is not just taking a bunch of pieces and
0:52:07putting them together but
0:52:08the maintenance of that over time as you add new pieces so
0:52:11in particular for this last thing
0:52:15how much can you just adding a new component expect everything else to
0:52:21work the way it did i just have some value added by getting new information
0:52:24and how much do have to re engineer the whole architecture to make sure that
0:52:30your not and doing things are getting and a problem thinking
0:52:34you know in terms of engineering that the recent plane flight crashes seem to then
0:52:38for this kind of thing where
0:52:41different engineers design systems very well given a set of assumptions about what else would
0:52:45be there
0:52:46or not and then that changed under them and that's what seem because the point
0:52:50right
0:52:51i mean i completely agree i mean the ideal world is one where
0:52:56you know everything works in you like your thing in and but in reality is
0:52:59never that way right in e d is gonna be like different people with different
0:53:02research agendas you know few things different the have different mental models are different
0:53:08viewpoints from which they look at a problem and attacking
0:53:11and i think that does create challenges that way i don't know holes all those
0:53:15challenges
0:53:15well all i can say that these were kind of aware of that and one
0:53:19more constructing this work trying to
0:53:21make us view commitments in some sense as possible to allow for the flexibility that's
0:53:26needed for research
0:53:27because i think there's actual value in all those different viewpoints and different architectures an
0:53:31exploration
0:53:33and so
0:53:33yes i think what i can say that we are purposefully trying to
0:53:37not make hard commitments to what the what is an utterance you know i don't
0:53:42wanna tell you what an utterance as i wanna have you do have your opinion
0:53:45of what an utterance is
0:53:46but also might mean that again when you try to plug in your speech recognizer
0:53:50in my system
0:53:51the my needs to be some wrangling and so on or you know making these
0:53:54components work together i don't know how we can solve this problem
0:53:57i'm not a big believer in all will all come together with the big beautiful
0:54:01standard that will agree to i don't think that i don't see that happening
0:54:04we're just trying to design words
0:54:07flexibility i would say
0:54:10and
0:54:11i think that are a wonderful talk and you're highlighting these things that you're right
0:54:16this is not right for us to be able to that and address and we
0:54:20should be working more about this work beyond the simple turn and
0:54:26sorry i might be introducing something even more complex down the line and one about
0:54:31user adaptation users are very good humans are very good it changing their behaviour based
0:54:36on the system that in front of you know if its human of its that
0:54:42you will call and there's a delay we'll or not the backchannel because it
0:54:46screws up the conversation
0:54:48and people can adapt to this forty dollars
0:54:52and that might be confusing to our learning this will then allowed to be able
0:54:56to the
0:54:58two shows the affects that
0:55:01to windsor good adapting to rather the most natural ones of you thought about how
0:55:06to
0:55:07to hear about not getting the human to adapt or to be able to control
0:55:11how the human adapts to the particular system
0:55:14and the policies that you're doing that are adaptation
0:55:18no i think it's a very interesting question so i think
0:55:20so there's a couple things here someone is i do not seen a lot of
0:55:24the data that we will observe a large variability
0:55:27between people's attitudes and what people so
0:55:30both in the you know just the initial meant like you that they come towards
0:55:33the system and the expectations they have and also you how they do or do
0:55:38not adapt to whatever the system is doing
0:55:40well i guess my view is one think i would say is i think more
0:55:45of this system should be learning continuously because you are basically not continues that's with
0:55:51the person on the other and in this adaptation you know and
0:55:54doing things in big batches
0:55:56is likely to create more friction than doing things that is continuous the adaptive so
0:56:00i think that's an interesting their selecting a to solve a problem
0:56:04i fuel
0:56:05a lot of the work i and the when thinking of it is i want
0:56:08to reduce this impotence mismatch interaction between where machines are where people are and i
0:56:13think we still have a law to travel with the machines this way
0:56:17people always come whatever the machines and mediate but i think i want that going
0:56:21to be closer to where the human is and that would make things easier
0:56:25so i think of all my
0:56:26the work we've done in the way i see this kind of
0:56:30i'm gonna try to reduce that impotence from the machine side as much as possible
0:56:35but you're right people will it that sometimes with clever designs you can actually you
0:56:40know create interesting experiences we leverage that adaptation when you know it's gonna happen
0:56:46but i think in most cases i'm in favour of systems that just
0:56:49incrementally adjust themselves to be able to be at the right spot "'cause" it continues
0:56:54to shift
0:56:55i don't know that really asses the questions were some sort surrounded
0:57:00i time i'm rubber's from technological university double speaking maybe as one of the many
0:57:06people here over the years of wasted two years of our lives building a dialogue
0:57:10systems from the ground up or i think what you presents their at the end
0:57:14is fantastic and but my question is a bit more specific
0:57:18and in terms of the work you did on interjections being used and hesitations being
0:57:23used to sort of keep the user's engagement
0:57:26in the work in the wilds did you do any variation in terms of the
0:57:30multimodal aspects of task in other words the avatar that's being used to gestures that
0:57:36we're be used in fact whether or not using an avatar was a good idea
0:57:39that's by fine grained question and then just a more general question is have you
0:57:45looked at all but the issues
0:57:47of engagement in terms of activity modeling because it's always struck me data big problem
0:57:51in situated interaction
0:57:53when you move away from the kiosk style the user is asking a question is
0:57:59that users are engaged in activities and first to truly get the situated interaction working
0:58:05we are we necessarily need to track the user what they're doing to be able
0:58:10to make sensible contributions to the dialogue about just answer questions yep
0:58:14so to the first part of the question the short answer is no what we
0:58:18should have
0:58:19like i think there's a there's
0:58:22there's a rich set of and once is basically how you do hesitations and
0:58:26interjections and all these policies and definitely in the nonverbal corresponding behaviors
0:58:31would affect that
0:58:32and we just seen the process of the prosodic o contours of a
0:58:35so you know also was not such a good choice because so
0:58:39as excitation sometimes
0:58:41pricks people back the likes
0:58:42so what
0:58:43you know why wanna say but does are hard to synthesize it's on the display
0:58:47the technology we have at the time
0:58:50so i should say that yes definitely should consider those aspects
0:58:56the second part of the of the question remind me was
0:59:03so i think you're absolutely right a lot of the work i've shown in that
0:59:07we've done actually in the last you know ten years there has been
0:59:11well focused on interaction one communicate where when any interaction and communication like there's some
0:59:17communication happens between the human and the person
0:59:19but that's the whole task is this conversation that we're having
0:59:22where actually just now starting to do more work with systems that where the human
0:59:28is involved in an actual task does not just the communicative task
0:59:31and we're trying to see how the machine can play a supporting role in that
0:59:35and i think you're absolutely right like that kind of brings up the next interesting
0:59:39level of how we really get collaboration going rather than just this kind of back
0:59:44and forth of i can ask or answer question and so on i think that's
0:59:47a very interesting space and we're just starting to play in that space
0:59:54thank you very much for two where interesting a request i think is great but
0:59:58this is
1:00:00going out in the wild approach i was just wondering have you
1:00:05i still assume that microsoft research office is
1:00:09a certain type of people who are in there
1:00:13so it's not completely out in them are not so it sort of a question
1:00:17of a have you considered the sort of other i mean i guess children on
1:00:22the other types of user groups that
1:00:25or other types of problems that you might have in is more sort of accepting
1:00:29or something no we have so the short answer is again all we have on
1:00:33but i completely agree like the population we have these just the very narrow very
1:00:37specific one
1:00:39it's interesting to me
1:00:40how much variability i see even in that narrow cross section which makes me wonder
1:00:44like you know and units interesting there's and there's a lot of variability even in
1:00:49that narrow population
1:00:50but you're absolutely right like it's not
1:00:53truly in-the-wild is not a to public space like
1:00:56and so you be very interesting to go there and see what kind of "'cause"
1:01:00yes populations are different than
1:01:05we haven't done much outside this
1:01:08okay let's think then again for a really is done