Speech Transcript - Situated Interaction

but not any

i don't fit the crime and i and i have great pleasure in that using

the second keynote speaker of the confidence that abolish from microsoft research

and

seeing everything either restart at microsoft research

and is what he has been for the last twelve years and it's going to

talk to is about straight the interaction

okay thanks a lot thanks ingrid things for the introduction and also for the invitation

to talk it's a great

be back here think that i think i missed the last couple of years but

this is always and

a central to come back to

so the time of the talk is situated interaction and i think is gonna those

they'll pretty well with the panel discussion we had at the end of yesterday a

operable narrowing versus broadening of the field

an interesting questions we might all be working on there are basically

two main points that i would like to highlight in this talk the first one

is that dialogue is really a multimodal highly coordinated complex affair

that goes well beyond the spoken word

i don't know how many of your familiar with a little work over eight boards

with all views an anthropologist that did some of the seminal work on can as

exciting back in the sixties

and basically studying the role body movement in communication and in one of his books

he essentially or

comments on how basically

perhaps the problem with the early records that we have of studies of communications

is that they were done by illiterate people

now all joking aside it is the case that if you look at

most of the work we do to the in dialogue

is really heavily and curtin text in the written word and at best in the

spoken world

but in reality we do a lot of work with our bodies when we interact

with each other when we communicate with each other and the surrounding physical context also

plays a very important role in these interactions

from where we place ourselves in space relative to each other the stance we adopt

two where or gaze goes moment-by-moment to facial expressions head nods hand gestures

prosodic o contours

all of these channels come into play when we interact with each other

and so that's the view of dialogue that i would like to highlight today

the second point that i'm gonna try to make in this talk is i think

were also it is very interesting time

when in the last take it also seem very fast paced advantages is based on

deep learning

in areas like vision and in germany perception and sensing

and i think these advances are getting us to this point or able to

start building machines that understand

people in physical spacing how people moving behaving physical space

i think it's a very interesting time in that sense in just like in the

nineties

advances in speech recognition have broken up the field and open up this whole area

of spoken dialogue systems with all the research that has come to that

and that today has led to these mobile assistance in our pockets i think these

advances in vision and in the perceptual technologies

give us a chance to again brought not the feel in this direction of

physically situated dialogue and more generally situated interaction

so what i'm gonna doing this talk is i'm gonna try to give us a

sense of this area based on some research vignettes from our own work at m

s r

over last ten years or so in this space

and hopefully i'll be able to convey to my excitement about it then maybe gets

more of you guys to look into this direction

because i think there's a lot of interesting an open problems in this space and

i think a lot of the people in this will have quite a bit to

contribute to solving these problems

so finally before i get going before we'd i've been i'll sonar make sure i

think my collaborators that i've had over the years that in likely to work with

fabulous people elements are

and that long-term collaborations with folks like corvettes and shown andreas's

here and also many other researchers

talented in jeers and great interest we have

over the years and

some of the work with c and the work with done and ms are in

this space will not be impossible without their help so on

then them

okay so let's get started situated interaction well

i started working in this space or shortly after i joined m s a round

two thousand than eight and the main question that has been driving my research agenda

scenes has been basically how do we get computers

to reason about the physical space around them

and to interact with people in this kind of open moral physically situated setting enough

fluid and seamless manner

and the general approach i've taken two words that space has been one where

we built a variety of systems

and we've deployed in the wild and by deploying the wild what i mean in

this case is place them in some public space in our building where people would

naturally encounter any interact with them without much instruction

so is not a control upsetting there just deployed some

where all their people just comment intractable

then we observe the interactions and we lead to drive what are the research problems

that will do not address we find what are the problems we need to solve

by observing what happens in this kind of ecologically more valid setting

and try to let that

give us direction and so to make this concrete then to give you a sense

of the variety of systems we build an honours start by showing your view videos

and then we can go more for some of the research questions we've looked at

the first really i'm gonna show use from the system that we refer to as

the assistant

it's a virtual agent based system that's placed outside eric's office and interacts with people

that come by whenever he is not available or maybe what is available but busy

his office

and basically the system that some simple assistive type tasks like

handling meetings and taking you know some notes the relay and so on

it's connected to a quite a wide infrastructure has access to eric's calendar but also

for other machine learned models that predict his availability when is he gonna be backing

his office you know what's the likelihood that he will but then the particular meeting

and so on

but what i want to highlight with this video is not so much lower part

as much as

the multi party dialogue or interaction capabilities here the system has a camera why dental

camera the top and a microphone array and

it's able to basically reason about multiple people

and understand who it engages with and have dialogue in this kind of all multiparty

setting

based on the roles and he's that these people have

and you hear forty five o'clock meeting we can eat

i'm sorry i told you hear for each meeting could you please one appeal h

and it is not here and then it makes me look at scanning to see

if i can tell me he's trying to see he should be meeting in his

office now saline actually not sure he's he will probably be packing amount fifteen minutes

listen would you like to have unit or maybe come back later

and you could try sending him an email message i'm expecting to look at seen

in amount a mean it's fancy rejecting nighttime

great

thanks

so over the years we built a variety of these systems are based on virtual

agents this is a prototype for those aiming to do shuttle reservations on campus of

for people moving from one building to another when you going to little be you

can say i'm going to this building and get a shovel

we build the fun trivia questions game that we deployed in a quarterly or one

of our kitchens where the system would try to engage people that go buy into

this questions gamelike out ask you what's the longest river in the world then you

try to figure out the answer but the interesting bit here is that this is

the most trying to do this in some sense

cooperatively is trying to get people to reach a consensus before revealing the answer moving

to the next question

we did a lot of interesting studies on engagement then how do you attract a

bystander a little times people kind of sit back and watch from a distance what

happens working on how do you attract bystanders

inside an interaction so again studying various problems related to multi party dialogue an open

more settings

without more that has also nothing to do it language show i'm using the term

situated interaction

purposefully

because my focus is on is my interests are in sort of how do we

get machines to interact with people with there's language or not

this is a an example a system we call the third generation elevator

what you're seeing here is a view from the top in our atrium

there's basically let's see this work there's the elevator doors are over there this is

a fisheye distorted you from the top but this is in front of the bank

of elevators where people are going by

so we build a simple model that just those optical flow and based on features

from optical flow

if there's to anticipate by about three seconds when the button will be pushed

so as you walk towards the elevator pushes the button for you the idea was

a mess build a star trek elevator but if you just simply go by you

know nothing happens

and n is not necessary that i think this is high elevators will work in

the future but its own exploration and i had not

to this idea that machines should be able to reason about and think about how

people behaving physical space

and right interesting interactions of that and the system has been running four years in

our lobby and by now everyone's

no one models it's there in some sense it just works

within the last years also started the looking in the directional interaction with the robot

so human robot interaction and system that we've done a lot of research with are

these direction robot's we have three of these guys we have them deployed on each

of the floors in a building as you come up of the elevator

and they can give you directions inside the building so you can ask for meeting

rooms are various people

and they can directly there are four

conference room three hundred

go to be useful way

turn right into three down the hall we review

conference room three hundred will be the first room on your right

your will

john is in all use number forty one twenty

here

t v all of your to the fourth floor

you're right when you mix of the elevator and continue to the end of this

fall

john solve this will be in that we have revealed

okay so hopefully this gives you guys a sense of the class of systems will

be in building an working with are doing research with over the years

no when you try to build these things and have them actually work in the

wild in this kind of one control settings you quickly run into a number of

problems that otherwise you might not even think of our consider

a lot of the problems with interactions i think we as human soul on self

conscious the this is so training to us that we don't think about it

but you know once you try to do something with a machine and computational eyes

it you run into the actual problem so first problem you have to solve is

that of engagement knowing

who am i engage with an in an interaction with and one

like this is all obvious loss whenever word an interaction

but a machine is to reason about it for instance here needs to reason that

even though these two guys are looking away from it at this moment

they're actually still engaged in an interaction with the machine they're looking away because the

robot just pointed over there and she well she's been looking at the machine all

the time she's actually not engaged in this conversation and going one step for the

robot my reason that well perhaps is you know group with them and waiting for

them

or perhaps she's not in a group with number has an intention to engage with

the robot once they're done there's all these reasoning that we assume as kind of

the one automatic and we don't think about what you have to kind of program

machine to do it

once you can solve the problem of engagement not a problem you have to solve

is that of turn taking and you know the standard dialogue model we all phone

work with this one where dialogue is of all the of utterances by the system

and user and system and user this breaks two pieces immediately once you're in a

multi party setting

you need to reason not only about when utterances are happening but you to reason

about who's producing them

who are the utterance is addressed to and what does the producer expect would talk

next so who is the next ratified speaker here

should i as a robot inject myself or the end of this utterance that i

heard or should i wait "'cause" someone else is gonna respond

so the problem gets more complex

and again all of this

we do on automatic and it's regulated with gaze with prosody with

how we move our bodies and so on and only once you can kind of

deal with these two problems you can start worrying about speech recognition and decoding the

signals in understanding what is actually contained in the signals that was sent to each

other

and doing the high-level interaction planning and dialogue control so in some sense a we've

use it we view this as a

almost like a minimal set of communicative competence is that you need to have to

do this kind of interaction open world settings

and over the years the re our research agenda

has been basically looking at various problems looking that in this processes by trying to

leverage the information we have about the situated context the who the lot and the

why of the surroundings

so that's kind of a the very high level kind of fuzzy one slide about

what the research has been about that the ms are in the last ten years

in the space and i'm gonna diving now when show you two different examples in

a little bit more detail i'm not gonna goal very technically d pointier pointed to

the papers i'm happy to talk more

offline but i want to show you give you a sense of what the research

problems look like i'm gonna start with the problem that has to do with engagement

i've already mentioned

engagement as a process can this in the reverse this is a process by which

participants

initiate maintain and terminate the conversation is that they jointly undertake now you know lot

of classical dialogue work i mean this is in telephony applications re mobile phones and

so on so

trivial problem to solve right i push a button i know i'm engaged or i

pick up a phone call i'm not i'm gauge i don't have a really big

problem to solve however if you have a robot or system that's embodied in situated

in space is becomes a more of complex problem

and just to illustrate sort of the diversity of behave years

with respect to engagement that one might have

we sort of

capture this video this as many years ago at the at the start of this

work

it's a video from a the receptionist prototype the one that was doing the shuttle

reservations

and it mostly highlights how by reasoning about

three engagement variables in particular engagement state the my negation a conversation or not engagement

actions which regulate the transitions between the states and engagement intentions which are different from

states

by reading about these three keep variables you can construct fairly sophisticated policies

in terms of how you manage engagement in you know group setting

so no play this video for you in a second just before i do that

to help you with the legend here and all this annotation

yellow line

below the face means this is what the system is engaged with at some points

of this is the system's viewpoint what we see is one of these avatar has

that but for us

dotted line is an engagement that is currently suspended

the red dot moving around that right now it's on eric's face shows the direction

of the avatars case

so i'll run this for you

sorry for the quality of the audio here

here

right

for

right

alright thank you

right

sure

yes

or not

yes

in addition

right

sure

here

right

sure

right

so there's many behaviors in here that flight by pretty fast like for instance when

the receptionist turns from eric to me and my attention is in my cellphone says

excuse me and waits for my attention to come up to continue that engagement or

at the end when i'm basing some more far away in the distance

the moment i turn my attention towards it even though i'm at a distance he

creates this initiate disengagement because you know as i still have this task of getting

the shells i can give me an update

there's a lot of behaviours that you can create from relatively simple inferences

now i don't obviously you this is a demonstration video that was shot in the

lab in probably we had it do it i don't know three five times to

get it right

this stuff does not work that well when you put it out there in the

wild and i will show you know second how well it works in the wild

but this is almost like a more star video like a more star direction for

us in our research work

we wanna be able to create systems where the underlying inference models are so robust

that

we can actually have this kind of lead interactions are there in the wild

so let me

show you how it works in practice and talk about a particular give an example

of a research problem in this space

start with this video that kind of motivates it pay attention to how badly in

this case is a be a from the directions robot

how badly the robot is as negotiating disengagement so the moment of breaking of the

interaction

you need help finding something

a room that hallway and on my

by the way would you mind swiping your badge on the remote so i know

wideband park with not

thank you hear anything else i can help you find nothing

okay

think that it

i'm or

i know i help you find something else no thank you

okay that

not very good he's running to the bottom so

so what happens here well what happens here is that at this point in time

it's obvious to all of us that this interaction is over

but all the machines easy is just

the rectangle of where the face easy back in the day that's all the tracking

were doing doesn't understand this gesture

and so this point the robot continues the dialogue with is there anything else i

can help you find and this is quite a long production now what's interesting here

is that by some but just a couple of seconds right after that by this

point by this frame

the robot's engagement model can actually tell that this person is disengaging but by that

time it's already too late because we've already study producing this is there anything else

and the person hears descent errors and word in this bad look now where we

are basically non negotiating these disengagement properly in person starts coming back so now they're

engaged again

and we get into this problem

so what's interesting here is that the robot eventually notes

and so the idea that comes to mind is

well

if we could somehow forecast from here that some future time this person is likely

to disengage with some

good probability

we could perhaps use hesitations to mitigate the uncertainty people of unused hesitations this situation

of uncertainty so if we could somehow forecasting funny for perfect in that forecast that

a t zero plus l for this person might be disengaging is there are launching

this production we could launch of filler or like a hesitation like soul

and then if it zero plus a thought we find them disengaging we say so

well guess a catcher later then

or if somehow alternatively there are not we can still say so is there anything

else i can help you fine and that doesn't sound too bad and so the

core idea here is

that's forecast what's gonna happen in the future

and maybe use hesitations to mitigate the associated uncertainty

now how do we do this well we have an interesting approach here that is

in some sense self-supervised

the motion eventually novel so we can leverage that knowledge you basically rollback time and

you can learn from your own experience basically without the need for any manual supervision

so you have a variety of features i'm illustrating here three features like the location

of the face in the image and the size of the face

which kind of these esns this is where they start moving away it right the

size of the face is kind of a proxy for how far away from you

they are we have all sorts of probabilistic models for instance for inferring whereas their

attention is the attention on the robot there is their attention somewhere else

and there's many i such features in the system

no the ideas you start with the

very conservative heuristic for detecting disengagement then you wanna be conservative because

the flip side of the equation breaking then engage moment someone is still engages even

more painful so you don't want no kind of stopped talking to someone one that

they're talking to so is there on the conservative side which means you're gonna be

late in detecting when they disengaged

but you will eventually detect that they disengage at some point you would exceed some

probability threshold that says they're disengaging and then what you can do is like i

said you rollback time so let's say you one anticipate that moment by five seconds

where it's easy to automatically construct a label that looks like that and five seconds

ahead of time predicts that event

and then you train a model from all these features that you have to predict

this label

all this model is not gonna be it's gonna be far from perfect but you'll

probably detect that moment a bit earlier on

so if you use the same threshold of point eight you might have you know

you might be able to detect by this much earlier we call these the only

detection

and so then you go and train models with all these features and really the

technical details are not that important here the point i wanna make is a high-level

point this case i think

we use logistic regression boosted cheese whatever favour machine learning technique is and you can

see that like you know for the same false positive rate you can kind of

increase our you can detect the engagement over baseline heuristics

the other sort of high-level lesson is that

by using multi modal features you tend to improve your performance all use features

relate the focus of attention location and tracking confidence scores

dialogue features like dialog state how one died in there and so on

each of these individually do something and then at you at the mall up together

you get better results was generally

something that tends to happen would multimodal systems

again the high-level point i wanna make here use

forecasting was a construct i think is very interesting like there's been a lot of

work recently in dialogue with incrementality and i think forecasting goes handing hand without

and because it's very important in order to be able to achieve

the kind of fluid coordination we want you we probably have to anticipate more

and then also presents this interesting opportunities from learning easy to from experience without manually

labeling data because in general if you wanna forecast an event like you have the

label you know when you happens you just know it too late but you can

still learn from all of that and you can do that online in the system

can adapt to the particular situation it's in

so i think those are a couple of interesting lessons sort of what the high-level

a from this work

i'm gonna switch gears

and talk about a different problem that lives more

relatively speaking in the turn taking or you know just like engagement is

reach mixed-initiative process you know by which regulate how initiate interactions

turn taking is also you know mixed-initiative incrementally controlled by the participants is this process

by which we regulate who takes that are not talk

in conversation and as i mentioned before again in a lot of a traditional dialogue

work we make the simple turn taking the assumptions of you speak then i speak

then you speak the nice we can maybe there's barge ins that are being handled

in multiparty settings you really the be double more sophisticated model "'cause" you to understand

who's talking to someone any given point in time

and when is your time to speak

and we've done

bunch of work in that direction i'm not gonna show you that on a show

you a different problem that relates to turn taking that i think illustrates even better

this a high degree of coordination and multimodality in situated dialogue and this has to

do with coordination between speech and attention

and in some sense this work was prompted by reading some of goodwin's work on

disfluencies and attention so goodwin made this interesting observation about disfluencies you know one of

his of papers

we all know that if you look at transcripts of conversational speech it's formal false

starts and b starts and disfluencies so they're gonna look like

you know the speaker says anyway

we went to i want to bad or brian you're gonna have

you can still have to go or i can't mean and also mercy down the

car choice of these this part of a t v transcribe like very literally you

know conversational speech these are everywhere and they create problems for speech recognition people in

language modeling people and so on conversational speech is hard

well goodwin had the interesting insight of looking that this in conjunction with gaze

so here's the listener's gaze

and the region in red dots

is where the listener is not looking at the speaker

this is the point where mutual gaze gets reestablished and then we have mutual gaze

between listener in speaker

as a something that's really interesting in this examples is that things become much more

grammatical

in regions on mutual gaze

and this means to kind of one interesting hypotheses that maybe disfluencies are not just

errors-in-production maybe some of this one is we have

actually fulfil the coordinative purpose the are used to regulate and coordinate and make sure

that either i'm able to attract your attention back if you'd has drifted away

or whenever i deliver what i want to deliver i really have your attention

and so

partly inspired by this work and partly inspired by behaviors in our systems

we did a bunch of work on coordinating speech and attention

so let me show the example in contrast to

what humans are able to do without thinking about it

here's our robot is not able to reason about where the person's attention is

as a bunch of speech recognition errors in this interaction as well but i like

it to pay more attention to basically how the robot is not able to take

into account where the participantsattention is as the interaction is happening she's just looking her

phone trying to get

the number for the meeting she's going to but the robot is ignoring all that

right

i think that again

metric in going right

during that would help or go back to like you want right

well o where

or maybe not so she's a she's you know she's just looking or phone trying

to find the and the robot keeps pushing this question or four where you going

where you going and so that's you know quite different from what people are doing

so inspired by goodwin's work we did some work on

basically coordinating speech with the

attention and the idea here was to have a model where one hand

we model the attentional demands

like where does the robot expect the attention persons to be

and on the other hand we model attentional supply where is the actual attention going

so attentional demands are defined of the phrase level so for every output that the

robot is producing got the phrase level

we have an expectation about where attention should be in most cases it probably should

be on the robot but it is not always the case twenty seventy five point

of what they're in say to get to thirty eight hundred i might expect that

your attention will go over there and actually fluoridation doesn't go over there may be

we have a problem

so we are specifying descent is are manually specified basically of just like a natural

language generation for every output we have one of these

expected attention targets and then on the other hand we make inferences about where is

your attention

and we do that based on machine learning models that use radio features us on

and so forth

whenever there's a difference between the two is there of just ballistic reproducing speech synthesis

we use this coordinative policy that basically interjects the same kinds of pauses and feels

like pauses in false starts and restarts

that humans do is basically create these disfluencies

to get to a point where attention is exactly where we expected to be an

only then we continue so instead of saying to get the thirty eight hundred we

might pose for awhile say excuse me be say the first two words to get

pause more and so on before we actually produce the utterance

so i'll in this is again than on the phrase by phrase

basis

here is again a demonstration video of

eric and i bad actors trying to kind of illustrate this behavior

yes or no i

for me excuse

for

where he is just you know it's fashion

so still bit clunky you know but you get the sense and the idea let

me show you a few interactions captured in the wild once we deployed this coordinative

mechanism

in here basically

the regions in block are the production that you know the robot normally produces the

synthesis these are phrase boundary delimiters

and the regions in the regions in orange are

these filled pauses interjections that are dynamically injected on the fly

based on where the user's attention is

rule

all you cough that the volume was kind of level

excuse me

really

you know you

you are right

here you direction

so that excuse we might be a bit aggressive you know there's a lot of

tuning once you once you put this in there you realise the next layer of

problems that you have been how synthesis is not quite conversational enough and you know

like than one sees of saying social forces so an excuse me and so on

and while these videos again my make it look like a wild like we can

go quite far again wanna leave you with the wrong impression of a lot of

work remains to be done

these things often failed or videos i shown easement things work

relatively well i would say

but this things often failed and i want to show you one interesting the example

of a failure

right

we would be

you give you

what would be included

well you will see later

so the signals to say whoops

so what actually happens here well what happens here is that

we are coordinating you know warping a lot of attention to coordinate our speech with

the participantsattention

but were completely ignoring what his upper body and torso is signalling so what happens

here is

the robot guess to this phrase where it says to get their walk to the

end of this hallway

at which point the person feels that maybe this is the end of the instructions

so they start turning both their face and their body to kind of indicate that

they might be leaving right

the robot sees their attention goal way and things well i'm gonna wait for their

attention to come back and the long pause that gets created for the reinforces the

person to believe that this is the end of the directions so i'm just going

given the robot had all these other things to say right

and so because the robot in this some sense ignores the signal from his upper

body that i'm and if the robot can take into account that signal we could

be a bit smarter and maybe not wait there maybe use a different mechanism to

get their attention back

or maybe just

blasts through that you don't always have to coordinate exactly that way it right and

i love this example because it really highlights any drives on this point and trying

to make i think that

dialogue is really highly coordinate in and highly multimodal dialogue between people in face-to-face settings

has these properties you know

we've talked about carnegie speech and gaze

and we seen in this example how not reasoning about body pose gets us into

trouble

as many other things going on we do head gestures like not then shakes and

all sorts of other head gestures and there's a myriad of hand gestures you know

from be metaphorically iconic the big gestures

facial expressions smiles frowns expressions of uncertainty

where we

put our bodies and how we move dynamically prosodic all contours all of these things

come into play and their highly coordinated frame-by-frame moment-by-moment in the coordination that happens is

not just across the channels

it's across people

and these channels and so i'd like us to think about dialogue in this view

more from a view of you know sequence of turns into of your of

multimodal incrementally co-produce process

and i think if we do that i think there's a lot of interesting opportunities

because of these enabling technologies that are coming up these days

so i've shown you a couple of problems in the space of turn taking an

engagement there's many more problems in every time we touch one of these we really

feel like we barely scratched the surface

take for instance engagement i talk for a bit about

how to forecast disengagement and maybe negotiate the disengagement process better but this many other

problems how do we build robust models for making inferences about those engagement variables like

states engagement actions and intentions

how do we or construct measures of engagement that are more continuous here all the

work we've done is on i'm engaged or i'm not engage well-known educational or tutoring

or other kinds of setting you wanna more continuous measure engagement

how do you reason about that

similarly many other problems in turn taking understanding how do we ground all these things

in the physical situation is interesting challenges with rapport with negotiation grounding well lots of

open space lots of interesting problem once you start thinking about how the physical world

a whole these channels interact with each other

like i said i said i think we have this interesting opportunities because

there has been a lot of progress in the visual and perception space

the tracking facial expression tracking smiles affix recognition is on that can

help us sort of in this direction

i think the other think that i really want to highlight bill be size the

current technological advances that i think is very important

is all these body of work that comes from connected feels like anthropology sociology

cycling sociolinguistics a conversational analysis context analysis on

there's a wide body of work basically

as soon as people got their hands on video tapes in the fifties and sixties

they started looking carefully at

human communicative behaviours

and all that work was done

based on you know small snippets or video and if you think about it today

we have millions of videos

an interesting a powerful data techniques so there's interesting questions about how do we bring

this work into the present the how do we leverage all the knowledge and the

theoretical models that have been built into the past

i've put here just some names there's many more

people that have done work in this space and i pick one title from each

of them in each of these guys

has full bodies of works i really recommend that

as a community we look back more on all this work that has that has

been done already in a human communication and try to understand how to leverage that

when we think of dialogue

with that i guess i have a ten minutes left i one a kind of

switch gears a bit and talk more about

challenges because you know

there's a lot of opportunity there's a lot of open field

but working in this space is not necessarily easy either and when i think of

challenges i think the

high level i think of three kind of categories there's obviously the research challenges that

we have like i wanna work on this problem and forecasting disengagement will help lysol

try there's obviously the research challenge

but i'm gonna leave those aside and gonna try to talk about to other kinds

of challenges one is data and experimentation challenges and we touch briefly on this in

the panel yesterday i think getting data for these kinds of systems is it's not

easy and stuff

if you look at a lot of our adjacent feels like machine translation in speech

recognition nlp and so on

a lot of progress has been accomplished by you know

challenges with datasets and clear evaluation metrics and so on

in dialogue this is not easy to do any is not easy to do because

dialogue is an interactive process you cannot easily studied on a fixed that dataset

because by the time you've

made an improvement or change something the whole thing behaves differently

and so that creates challenges generally for dialogue and even more so for multimodal a

dialogue in the multi model space right

then apart from the data charges there's also kind of experimentation challenges

we've done a lot of the work we've done in the while because i feel

like you see the real problems you see ecologically valid settings and you see what

really happens

some of these phenomena are actually even probably

challenging and hard to do in a controlled lab settings like study how engagement how

these break supplements on you can think of all sorts of things of confederates and

you can try to you know figure out controlled experiments but is not easy and

all the other hand experimenting in-the-wild is not easy either for many in reasons

one of the

other kinds of challenges in here are purely building up the system's right so

in our work over last ten years the way we've gotten our data is by

building systems and deploying them right

but building systems is hard in so in the last five minutes i wanna talk

a bit about actually engineering challenges because i think there just as important in that

they kind of create the damped nor on the research and they kind of stifle

things from moving faster forward building this kind of a multimodal systems is hard for

a number of reasons

first there's a problem integration they leverage many different kinds of technologies

that

are of different types operate on different time scales the sheer complexity and the number

of boxes you have to having one of these systems kind of makes the problem

challenge

but then there's other things where constructs that are pervasive in the systems like pine

space and uncertainty are nowhere in our

programming fabrics like

it's kind of the clear to me that time for instance is not a first-order

citizen in any programming language that i can think of so every time i wanna

do something that's over time or streaming or

i have to go implement might buffers and my streaming and my you know like

a kind of have to go from scratch and it's similar for space in uncertainty

but it is very important because

we want to create systems that are fluid

but the sensing thinking acting all of these things take time

being fast is not even enough often times you need to do fusion in the

systems and things the right but different latency so you need to coordinate basically so

you need to kind of deal whereabout time in a deeper sense down deep down

be well and the same things can be set i think in this systems about

the notions of space and notions of uncertainty

and finally the other thing that kind of puts of them are is the fact

that the development tools we have

are not here for this class of systems right so the development environments and debug

errors and all of this stuff is not

they were not developed with this kind of with this class of systems in mind

and if i think back of all the work we've done i don't know if

after time as maybe spend on building the tools to build a systems rather than

building the systems are doing the research right and so

basically driven by a lot of the lessons we've learned over the years

in the last three years three or four years at ms are we basically embarked

on this project and i wanted to spend the last couple of minutes telling you

about it because if there's any people in the room that are more interested in

joining the space this might be useful for them

we've worked on developing a open-source platform that

basically aims to simplify building the systems

the end goal being lower the barrier to entry in enabling more research into this

pay so it's a framework that three targeted researchers

it's open source and it's

supports the construction of this kind of a situated interactive system

we call it

platform for cd intelligence which kind of a mouthful solo abbreviate either side pronounced like

the greek letter sci

and i want to just give your whirlwind tour in two minutes just to kind

of give us sensible or what's available in there

the platform consists of three layers there's a runtime layer

a set of tools in a set of component components the runtime basically provides all

these infrastructure

for building systems that operate over streaming data are have latency constraints anytime you have

something interactive

it's latency constraint

so there's a certain model for parallel courtney computation that actually feels pretty natural you

just kind of connect components streams of data you know so it's the standard sort

of data flow model

but the streams a have a really interesting properties and i don't have time to

get here in

the full beetle and all the glory here

but i wanna kind of highlight some of the important aspect so for instance i'm

mentioned about time how time is to be first-order citizen well we bake that from

day one d below in the fabric all messages that are flowing to are timestamp

the origin when they're captured

and then as they flow to

through the pipeline

we have access not only to the

time the message was created by the component that created but also to that originating

time

so we know this message has a latency of four hundred and thirty milliseconds so

in the entire graph we can be latency or all points

which enables synchronization so we provide a whole time algebra and synchronization mechanisms when you

work was training data

that pairs these messages correctly and so on

so is basically all about enabling coordinated computation where time is really first-order citizen

the strings can be automatically persisted so there's a logging infrastructure

that is therefore free any data type of you know you can stream any of

your data types and we can automatically persist those and because we per system with

all this is so sure you timing information

we can enable a more interesting replace scenarios are i say well forget about these

sensors less played back from disk

and tune this component and i can play this back from disk exactly as it

happen in real time or i can speed it up or slowly down time is

entirely under our control because is baked deep down in the fabric

so these are some of the properties of the runtime there's a lot more

is basically a very lightweight very efficient kind of

system for constructing things that works with streaming data

at this level we don't care we don't know anything about speech or dialogue or

components

it's a gnostic to that you can use it for anything that operates was training

data and temporal constraints

the set of tools we built

basically are heavily centred on visualisation this is a snapshot from a

the visualisation tool we have on the right there someone's actually eating it and this

video sped up a bit but these are the streams that were persisted in application

these are just visualise there's for different kinds of streams that can get composer didn't

overlaid

so this is a visualiser for and in each stream this is a visualiser for

face detection results stream this is audio this is a voice activity detection that's a

speech recognition result is a visualiser for all three d conversational scene analysis and the

basic idea is that can composite overlaid is visualise there's

and then you can navigate over time left and right ensue mean and look at

particular moments this is very powerful in enabling especially when coupled with debugging

and word evolving this to visualize not just the data collected and running through the

systems

but also all

the architecture of the system itself and you know the view of the component graph

and also towards annotation for supporting data annotation

finally a the components layer we are hoping to create an ecosystem of components where

people can plug n play different kinds of components will bootstrapping this with things like

sensors imaging components vision audio speech output is are very relatively simple components that we

have in the initial echo system

but the idea is that

is meant to be an across system and people are meant to contribute into it

is an open source project there's already boise state casey kennington has its own repository

of sci components

and so people are starting to use this and the hope is that as more

people use it

if i can get you to have eighty percent of what you need off-the-shelf and

just focus on your research

that's the key idea

lasting else a is that something we haven't released yet but we are planning to

release in the next few months

is an array of components that we refer to as a situated interaction foundation it's

basically a set of components at that level that

plus a set of representations

that you want further abstract and accelerate the development of this physically situated interactive systems

basically what we are planning to construct is

the ability to instantiate the perception pipeline where you as a developer of the system

just only where you're sensors and what sensors you have

so in this instance there is you know there's a kinect sensor the big box

their represents my office and there's a kinect sensor sitting on top of the screen

and if you tell me i have three sensors i'm gonna use the data from

all the three sensors infuse evil gonna configure perception pipeline automatically from all the sensors

we have the right fusion

and provide the d n the

the kind of

analyses a deep scene analysis object that runs at frame rate at four thirty frames

per second i'm gonna tell you things like here's where the people are in the

scene and what their body pauses are here's where everyone's attention is

in this case there's an actual engagement happening between the two of us in an

agent that's on the screen

and stewart is you know directing the utterance the words

you know the agent and at some later point

we have peeled off we've gone more towards the back of the office towards the

whiteboard

and we're just talking to each other and so we're trying to provide all these

reach analysis all

the conversation in the conversation space including issues of engagement turn taking utterances sources targets

and all of that

from the available sensors and

if you give me more sensors

the idea is that you get the same object back

but at a higher fidelity because we have more sensors and we confuse data

this parts have not be really see other coming out probably in the next couple

of months

but our hope with the entire framework is basically to accelerate research in this space

to get people to be able to

build an experiment with these kinds of systems are having to spend two years to

construct all the infrastructure that's necessary

and so this brings me basically two

the end of my talk all conclude on this slide

try to adopt this view of dialogue in this is a talk and portrayed is

view of dialogue as a

multimodal incrementally corporate used process where part this one scene interaction really

do fine grained coordination across all these different modalities

i think there is

tremendous number of opportunities e here and i think it's up to us to basically

broaden the field in this direction because the

underlying technologies are coming and they are starting to get to the point where

the reliable enough to start to do interesting work and again there's this

big body of work in human communication dynamics that will we can leverage and that

we can draw upon

so i'll stop here thank you all for listening and all the questions

thanks very much and then

thank stan i was so great to see

all this work again and how

oppressive the research program over the number of years or to get at this point

i'm really looking forward to that

situated the interaction foundation

coming out

i've a question i guess related partly to that but

one of the problems with integration is not just taking a bunch of pieces and

putting them together but

the maintenance of that over time as you add new pieces so

in particular for this last thing

how much can you just adding a new component expect everything else to

work the way it did i just have some value added by getting new information

and how much do have to re engineer the whole architecture to make sure that

your not and doing things are getting and a problem thinking

you know in terms of engineering that the recent plane flight crashes seem to then

for this kind of thing where

different engineers design systems very well given a set of assumptions about what else would

be there

or not and then that changed under them and that's what seem because the point

right

i mean i completely agree i mean the ideal world is one where

you know everything works in you like your thing in and but in reality is

never that way right in e d is gonna be like different people with different

research agendas you know few things different the have different mental models are different

viewpoints from which they look at a problem and attacking

and i think that does create challenges that way i don't know holes all those

challenges

well all i can say that these were kind of aware of that and one

more constructing this work trying to

make us view commitments in some sense as possible to allow for the flexibility that's

needed for research

because i think there's actual value in all those different viewpoints and different architectures an

exploration

and so

yes i think what i can say that we are purposefully trying to

not make hard commitments to what the what is an utterance you know i don't

wanna tell you what an utterance as i wanna have you do have your opinion

of what an utterance is

but also might mean that again when you try to plug in your speech recognizer

in my system

the my needs to be some wrangling and so on or you know making these

components work together i don't know how we can solve this problem

i'm not a big believer in all will all come together with the big beautiful

standard that will agree to i don't think that i don't see that happening

we're just trying to design words

flexibility i would say

and

i think that are a wonderful talk and you're highlighting these things that you're right

this is not right for us to be able to that and address and we

should be working more about this work beyond the simple turn and

sorry i might be introducing something even more complex down the line and one about

user adaptation users are very good humans are very good it changing their behaviour based

on the system that in front of you know if its human of its that

you will call and there's a delay we'll or not the backchannel because it

screws up the conversation

and people can adapt to this forty dollars

and that might be confusing to our learning this will then allowed to be able

to the

two shows the affects that

to windsor good adapting to rather the most natural ones of you thought about how

to hear about not getting the human to adapt or to be able to control

how the human adapts to the particular system

and the policies that you're doing that are adaptation

no i think it's a very interesting question so i think

so there's a couple things here someone is i do not seen a lot of

the data that we will observe a large variability

between people's attitudes and what people so

both in the you know just the initial meant like you that they come towards

the system and the expectations they have and also you how they do or do

not adapt to whatever the system is doing

well i guess my view is one think i would say is i think more

of this system should be learning continuously because you are basically not continues that's with

the person on the other and in this adaptation you know and

doing things in big batches

is likely to create more friction than doing things that is continuous the adaptive so

i think that's an interesting their selecting a to solve a problem

i fuel

a lot of the work i and the when thinking of it is i want

to reduce this impotence mismatch interaction between where machines are where people are and i

think we still have a law to travel with the machines this way

people always come whatever the machines and mediate but i think i want that going

to be closer to where the human is and that would make things easier

so i think of all my

the work we've done in the way i see this kind of

i'm gonna try to reduce that impotence from the machine side as much as possible

but you're right people will it that sometimes with clever designs you can actually you

know create interesting experiences we leverage that adaptation when you know it's gonna happen

but i think in most cases i'm in favour of systems that just

incrementally adjust themselves to be able to be at the right spot "'cause" it continues

to shift

i don't know that really asses the questions were some sort surrounded

i time i'm rubber's from technological university double speaking maybe as one of the many

people here over the years of wasted two years of our lives building a dialogue

systems from the ground up or i think what you presents their at the end

is fantastic and but my question is a bit more specific

and in terms of the work you did on interjections being used and hesitations being

used to sort of keep the user's engagement

in the work in the wilds did you do any variation in terms of the

multimodal aspects of task in other words the avatar that's being used to gestures that

we're be used in fact whether or not using an avatar was a good idea

that's by fine grained question and then just a more general question is have you

looked at all but the issues

of engagement in terms of activity modeling because it's always struck me data big problem

in situated interaction

when you move away from the kiosk style the user is asking a question is

that users are engaged in activities and first to truly get the situated interaction working

we are we necessarily need to track the user what they're doing to be able

to make sensible contributions to the dialogue about just answer questions yep

so to the first part of the question the short answer is no what we

should have

like i think there's a there's

there's a rich set of and once is basically how you do hesitations and

interjections and all these policies and definitely in the nonverbal corresponding behaviors

would affect that

and we just seen the process of the prosodic o contours of a

so you know also was not such a good choice because so

as excitation sometimes

pricks people back the likes

so what

you know why wanna say but does are hard to synthesize it's on the display

the technology we have at the time

so i should say that yes definitely should consider those aspects

the second part of the of the question remind me was

so i think you're absolutely right a lot of the work i've shown in that

we've done actually in the last you know ten years there has been

well focused on interaction one communicate where when any interaction and communication like there's some

communication happens between the human and the person

but that's the whole task is this conversation that we're having

where actually just now starting to do more work with systems that where the human

is involved in an actual task does not just the communicative task

and we're trying to see how the machine can play a supporting role in that

and i think you're absolutely right like that kind of brings up the next interesting

level of how we really get collaboration going rather than just this kind of back

and forth of i can ask or answer question and so on i think that's

a very interesting space and we're just starting to play in that space

thank you very much for two where interesting a request i think is great but

this is

going out in the wild approach i was just wondering have you

i still assume that microsoft research office is

a certain type of people who are in there

so it's not completely out in them are not so it sort of a question

of a have you considered the sort of other i mean i guess children on

the other types of user groups that

or other types of problems that you might have in is more sort of accepting

or something no we have so the short answer is again all we have on

but i completely agree like the population we have these just the very narrow very

specific one

it's interesting to me

how much variability i see even in that narrow cross section which makes me wonder

like you know and units interesting there's and there's a lot of variability even in

that narrow population

but you're absolutely right like it's not

truly in-the-wild is not a to public space like

and so you be very interesting to go there and see what kind of "'cause"

yes populations are different than

we haven't done much outside this

okay let's think then again for a really is done

Situated Interaction

Keynotes

Dan Bohus (Microsoft Research, Redmond, Washington, US)