but not any
i don't fit the crime and i and i have great pleasure in that using
the second keynote speaker of the confidence that abolish from microsoft research
and
seeing everything either restart at microsoft research
and is what he has been for the last twelve years and it's going to
talk to is about straight the interaction
okay thanks a lot thanks ingrid things for the introduction and also for the invitation
to talk it's a great
be back here think that i think i missed the last couple of years but
this is always and
a central to come back to
so the time of the talk is situated interaction and i think is gonna those
they'll pretty well with the panel discussion we had at the end of yesterday a
operable narrowing versus broadening of the field
an interesting questions we might all be working on there are basically
two main points that i would like to highlight in this talk the first one
is that dialogue is really a multimodal highly coordinated complex affair
that goes well beyond the spoken word
i don't know how many of your familiar with a little work over eight boards
with all views an anthropologist that did some of the seminal work on can as
exciting back in the sixties
and basically studying the role body movement in communication and in one of his books
he essentially or
comments on how basically
perhaps the problem with the early records that we have of studies of communications
is that they were done by illiterate people
now all joking aside it is the case that if you look at
most of the work we do to the in dialogue
is really heavily and curtin text in the written word and at best in the
spoken world
but in reality we do a lot of work with our bodies when we interact
with each other when we communicate with each other and the surrounding physical context also
plays a very important role in these interactions
from where we place ourselves in space relative to each other the stance we adopt
two where or gaze goes moment-by-moment to facial expressions head nods hand gestures
prosodic o contours
all of these channels come into play when we interact with each other
and so that's the view of dialogue that i would like to highlight today
the second point that i'm gonna try to make in this talk is i think
were also it is very interesting time
when in the last take it also seem very fast paced advantages is based on
deep learning
in areas like vision and in germany perception and sensing
and i think these advances are getting us to this point or able to
start building machines that understand
people in physical spacing how people moving behaving physical space
i think it's a very interesting time in that sense in just like in the
nineties
advances in speech recognition have broken up the field and open up this whole area
of spoken dialogue systems with all the research that has come to that
and that today has led to these mobile assistance in our pockets i think these
advances in vision and in the perceptual technologies
give us a chance to again brought not the feel in this direction of
physically situated dialogue and more generally situated interaction
so what i'm gonna doing this talk is i'm gonna try to give us a
sense of this area based on some research vignettes from our own work at m
s r
over last ten years or so in this space
and hopefully i'll be able to convey to my excitement about it then maybe gets
more of you guys to look into this direction
because i think there's a lot of interesting an open problems in this space and
i think a lot of the people in this will have quite a bit to
contribute to solving these problems
so finally before i get going before we'd i've been i'll sonar make sure i
think my collaborators that i've had over the years that in likely to work with
fabulous people elements are
and that long-term collaborations with folks like corvettes and shown andreas's
here and also many other researchers
talented in jeers and great interest we have
over the years and
some of the work with c and the work with done and ms are in
this space will not be impossible without their help so on
then them
okay so let's get started situated interaction well
i started working in this space or shortly after i joined m s a round
two thousand than eight and the main question that has been driving my research agenda
scenes has been basically how do we get computers
to reason about the physical space around them
and to interact with people in this kind of open moral physically situated setting enough
fluid and seamless manner
and the general approach i've taken two words that space has been one where
we built a variety of systems
and we've deployed in the wild and by deploying the wild what i mean in
this case is place them in some public space in our building where people would
naturally encounter any interact with them without much instruction
so is not a control upsetting there just deployed some
where all their people just comment intractable
then we observe the interactions and we lead to drive what are the research problems
that will do not address we find what are the problems we need to solve
by observing what happens in this kind of ecologically more valid setting
and try to let that
give us direction and so to make this concrete then to give you a sense
of the variety of systems we build an honours start by showing your view videos
and then we can go more for some of the research questions we've looked at
the first really i'm gonna show use from the system that we refer to as
the assistant
it's a virtual agent based system that's placed outside eric's office and interacts with people
that come by whenever he is not available or maybe what is available but busy
his office
and basically the system that some simple assistive type tasks like
handling meetings and taking you know some notes the relay and so on
it's connected to a quite a wide infrastructure has access to eric's calendar but also
for other machine learned models that predict his availability when is he gonna be backing
his office you know what's the likelihood that he will but then the particular meeting
and so on
but what i want to highlight with this video is not so much lower part
as much as
the multi party dialogue or interaction capabilities here the system has a camera why dental
camera the top and a microphone array and
it's able to basically reason about multiple people
and understand who it engages with and have dialogue in this kind of all multiparty
setting
based on the roles and he's that these people have
and you hear forty five o'clock meeting we can eat
i'm sorry i told you hear for each meeting could you please one appeal h
and it is not here and then it makes me look at scanning to see
if i can tell me he's trying to see he should be meeting in his
office now saline actually not sure he's he will probably be packing amount fifteen minutes
listen would you like to have unit or maybe come back later
and you could try sending him an email message i'm expecting to look at seen
in amount a mean it's fancy rejecting nighttime
great
thanks
so over the years we built a variety of these systems are based on virtual
agents this is a prototype for those aiming to do shuttle reservations on campus of
for people moving from one building to another when you going to little be you
can say i'm going to this building and get a shovel
we build the fun trivia questions game that we deployed in a quarterly or one
of our kitchens where the system would try to engage people that go buy into
this questions gamelike out ask you what's the longest river in the world then you
try to figure out the answer but the interesting bit here is that this is
the most trying to do this in some sense
cooperatively is trying to get people to reach a consensus before revealing the answer moving
to the next question
we did a lot of interesting studies on engagement then how do you attract a
bystander a little times people kind of sit back and watch from a distance what
happens working on how do you attract bystanders
inside an interaction so again studying various problems related to multi party dialogue an open
more settings
without more that has also nothing to do it language show i'm using the term
situated interaction
purposefully
because my focus is on is my interests are in sort of how do we
get machines to interact with people with there's language or not
this is a an example a system we call the third generation elevator
what you're seeing here is a view from the top in our atrium
there's basically let's see this work there's the elevator doors are over there this is
a fisheye distorted you from the top but this is in front of the bank
of elevators where people are going by
so we build a simple model that just those optical flow and based on features
from optical flow
if there's to anticipate by about three seconds when the button will be pushed
so as you walk towards the elevator pushes the button for you the idea was
a mess build a star trek elevator but if you just simply go by you
know nothing happens
and n is not necessary that i think this is high elevators will work in
the future but its own exploration and i had not
to this idea that machines should be able to reason about and think about how
people behaving physical space
and right interesting interactions of that and the system has been running four years in
our lobby and by now everyone's
no one models it's there in some sense it just works
within the last years also started the looking in the directional interaction with the robot
so human robot interaction and system that we've done a lot of research with are
these direction robot's we have three of these guys we have them deployed on each
of the floors in a building as you come up of the elevator
and they can give you directions inside the building so you can ask for meeting
rooms are various people
and they can directly there are four
conference room three hundred
go to be useful way
turn right into three down the hall we review
conference room three hundred will be the first room on your right
your will
john is in all use number forty one twenty
here
t v all of your to the fourth floor
you're right when you mix of the elevator and continue to the end of this
fall
john solve this will be in that we have revealed
okay so hopefully this gives you guys a sense of the class of systems will
be in building an working with are doing research with over the years
no when you try to build these things and have them actually work in the
wild in this kind of one control settings you quickly run into a number of
problems that otherwise you might not even think of our consider
so
a lot of the problems with interactions i think we as human soul on self
conscious the this is so training to us that we don't think about it
but you know once you try to do something with a machine and computational eyes
it you run into the actual problem so first problem you have to solve is
that of engagement knowing
who am i engage with an in an interaction with and one
like this is all obvious loss whenever word an interaction
but a machine is to reason about it for instance here needs to reason that
even though these two guys are looking away from it at this moment
they're actually still engaged in an interaction with the machine they're looking away because the
robot just pointed over there and she well she's been looking at the machine all
the time she's actually not engaged in this conversation and going one step for the
robot my reason that well perhaps is you know group with them and waiting for
them
or perhaps she's not in a group with number has an intention to engage with
the robot once they're done there's all these reasoning that we assume as kind of
the one automatic and we don't think about what you have to kind of program
machine to do it
once you can solve the problem of engagement not a problem you have to solve
is that of turn taking and you know the standard dialogue model we all phone
work with this one where dialogue is of all the of utterances by the system
and user and system and user this breaks two pieces immediately once you're in a
multi party setting
you need to reason not only about when utterances are happening but you to reason
about who's producing them
who are the utterance is addressed to and what does the producer expect would talk
next so who is the next ratified speaker here
should i as a robot inject myself or the end of this utterance that i
heard or should i wait "'cause" someone else is gonna respond
so the problem gets more complex
and again all of this
we do on automatic and it's regulated with gaze with prosody with
how we move our bodies and so on and only once you can kind of
deal with these two problems you can start worrying about speech recognition and decoding the
signals in understanding what is actually contained in the signals that was sent to each
other
and doing the high-level interaction planning and dialogue control so in some sense a we've
use it we view this as a
almost like a minimal set of communicative competence is that you need to have to
do this kind of interaction open world settings
and over the years the re our research agenda
has been basically looking at various problems looking that in this processes by trying to
leverage the information we have about the situated context the who the lot and the
why of the surroundings
so that's kind of a the very high level kind of fuzzy one slide about
what the research has been about that the ms are in the last ten years
in the space and i'm gonna diving now when show you two different examples in
a little bit more detail i'm not gonna goal very technically d pointier pointed to
the papers i'm happy to talk more
offline but i want to show you give you a sense of what the research
problems look like i'm gonna start with the problem that has to do with engagement
i've already mentioned
engagement as a process can this in the reverse this is a process by which
participants
initiate maintain and terminate the conversation is that they jointly undertake now you know lot
of classical dialogue work i mean this is in telephony applications re mobile phones and
so on so
trivial problem to solve right i push a button i know i'm engaged or i
pick up a phone call i'm not i'm gauge i don't have a really big
problem to solve however if you have a robot or system that's embodied in situated
in space is becomes a more of complex problem
and just to illustrate sort of the diversity of behave years
with respect to engagement that one might have
we sort of
capture this video this as many years ago at the at the start of this
work
it's a video from a the receptionist prototype the one that was doing the shuttle
reservations
and it mostly highlights how by reasoning about
three engagement variables in particular engagement state the my negation a conversation or not engagement
actions which regulate the transitions between the states and engagement intentions which are different from
states
by reading about these three keep variables you can construct fairly sophisticated policies
in terms of how you manage engagement in you know group setting
so no play this video for you in a second just before i do that
to help you with the legend here and all this annotation
yellow line
below the face means this is what the system is engaged with at some points
of this is the system's viewpoint what we see is one of these avatar has
that but for us
dotted line is an engagement that is currently suspended
the red dot moving around that right now it's on eric's face shows the direction
of the avatars case
so i'll run this for you
sorry for the quality of the audio here
here
right
right
for
right
alright thank you
i
right
sure
yes
or not
yes
in addition
right
sure
sure
here
right
right
sure
right
i
so there's many behaviors in here that flight by pretty fast like for instance when
the receptionist turns from eric to me and my attention is in my cellphone says
excuse me and waits for my attention to come up to continue that engagement or
at the end when i'm basing some more far away in the distance
the moment i turn my attention towards it even though i'm at a distance he
creates this initiate disengagement because you know as i still have this task of getting
the shells i can give me an update
there's a lot of behaviours that you can create from relatively simple inferences
now i don't obviously you this is a demonstration video that was shot in the
lab in probably we had it do it i don't know three five times to
get it right
this stuff does not work that well when you put it out there in the
wild and i will show you know second how well it works in the wild
but this is almost like a more star video like a more star direction for
us in our research work
we wanna be able to create systems where the underlying inference models are so robust
that
we can actually have this kind of lead interactions are there in the wild
so let me
show you how it works in practice and talk about a particular give an example
of a research problem in this space
start with this video that kind of motivates it pay attention to how badly in
this case is a be a from the directions robot
how badly the robot is as negotiating disengagement so the moment of breaking of the
interaction
you need help finding something
a room that hallway and on my
by the way would you mind swiping your badge on the remote so i know
wideband park with not
thank you hear anything else i can help you find nothing
okay
think that it
i'm or
i know i help you find something else no thank you
okay that
by
not very good he's running to the bottom so
so what happens here well what happens here is that at this point in time
it's obvious to all of us that this interaction is over
but all the machines easy is just
the rectangle of where the face easy back in the day that's all the tracking
were doing doesn't understand this gesture
and so this point the robot continues the dialogue with is there anything else i
can help you find and this is quite a long production now what's interesting here
is that by some but just a couple of seconds right after that by this
point by this frame
the robot's engagement model can actually tell that this person is disengaging but by that
time it's already too late because we've already study producing this is there anything else
and the person hears descent errors and word in this bad look now where we
are basically non negotiating these disengagement properly in person starts coming back so now they're
engaged again
and we get into this problem
so what's interesting here is that the robot eventually notes
and so the idea that comes to mind is
well
if we could somehow forecast from here that some future time this person is likely
to disengage with some
good probability
we could perhaps use hesitations to mitigate the uncertainty people of unused hesitations this situation
of uncertainty so if we could somehow forecasting funny for perfect in that forecast that
a t zero plus l for this person might be disengaging is there are launching
this production we could launch of filler or like a hesitation like soul
and then if it zero plus a thought we find them disengaging we say so
well guess a catcher later then
or if somehow alternatively there are not we can still say so is there anything
else i can help you fine and that doesn't sound too bad and so the
core idea here is
that's forecast what's gonna happen in the future
and maybe use hesitations to mitigate the associated uncertainty
now how do we do this well we have an interesting approach here that is
in some sense self-supervised
the motion eventually novel so we can leverage that knowledge you basically rollback time and
you can learn from your own experience basically without the need for any manual supervision
so you have a variety of features i'm illustrating here three features like the location
of the face in the image and the size of the face
which kind of these esns this is where they start moving away it right the
size of the face is kind of a proxy for how far away from you
they are we have all sorts of probabilistic models for instance for inferring whereas their
attention is the attention on the robot there is their attention somewhere else
and there's many i such features in the system
no the ideas you start with the
very conservative heuristic for detecting disengagement then you wanna be conservative because
the flip side of the equation breaking then engage moment someone is still engages even
more painful so you don't want no kind of stopped talking to someone one that
they're talking to so is there on the conservative side which means you're gonna be
late in detecting when they disengaged
but you will eventually detect that they disengage at some point you would exceed some
probability threshold that says they're disengaging and then what you can do is like i
said you rollback time so let's say you one anticipate that moment by five seconds
where it's easy to automatically construct a label that looks like that and five seconds
ahead of time predicts that event
and then you train a model from all these features that you have to predict
this label
all this model is not gonna be it's gonna be far from perfect but you'll
probably detect that moment a bit earlier on
so if you use the same threshold of point eight you might have you know
you might be able to detect by this much earlier we call these the only
detection
and so then you go and train models with all these features and really the
technical details are not that important here the point i wanna make is a high-level
point this case i think
we use logistic regression boosted cheese whatever favour machine learning technique is and you can
see that like you know for the same false positive rate you can kind of
increase our you can detect the engagement over baseline heuristics
the other sort of high-level lesson is that
by using multi modal features you tend to improve your performance all use features
relate the focus of attention location and tracking confidence scores
dialogue features like dialog state how one died in there and so on
each of these individually do something and then at you at the mall up together
you get better results was generally
something that tends to happen would multimodal systems
again the high-level point i wanna make here use
forecasting was a construct i think is very interesting like there's been a lot of
work recently in dialogue with incrementality and i think forecasting goes handing hand without
and because it's very important in order to be able to achieve
the kind of fluid coordination we want you we probably have to anticipate more
and then also presents this interesting opportunities from learning easy to from experience without manually
labeling data because in general if you wanna forecast an event like you have the
label you know when you happens you just know it too late but you can
still learn from all of that and you can do that online in the system
can adapt to the particular situation it's in
so i think those are a couple of interesting lessons sort of what the high-level
a from this work
i'm gonna switch gears
and talk about a different problem that lives more
relatively speaking in the turn taking or you know just like engagement is
reach mixed-initiative process you know by which regulate how initiate interactions
turn taking is also you know mixed-initiative incrementally controlled by the participants is this process
by which we regulate who takes that are not talk
in conversation and as i mentioned before again in a lot of a traditional dialogue
work we make the simple turn taking the assumptions of you speak then i speak
then you speak the nice we can maybe there's barge ins that are being handled
in multiparty settings you really the be double more sophisticated model "'cause" you to understand
who's talking to someone any given point in time
and when is your time to speak
and we've done
bunch of work in that direction i'm not gonna show you that on a show
you a different problem that relates to turn taking that i think illustrates even better
this a high degree of coordination and multimodality in situated dialogue and this has to
do with coordination between speech and attention
and in some sense this work was prompted by reading some of goodwin's work on
disfluencies and attention so goodwin made this interesting observation about disfluencies you know one of
his of papers
we all know that if you look at transcripts of conversational speech it's formal false
starts and b starts and disfluencies so they're gonna look like
you know the speaker says anyway
we went to i want to bad or brian you're gonna have
you can still have to go or i can't mean and also mercy down the
car choice of these this part of a t v transcribe like very literally you
know conversational speech these are everywhere and they create problems for speech recognition people in
language modeling people and so on conversational speech is hard
well goodwin had the interesting insight of looking that this in conjunction with gaze
so here's the listener's gaze
and the region in red dots
is where the listener is not looking at the speaker
this is the point where mutual gaze gets reestablished and then we have mutual gaze
between listener in speaker
as a something that's really interesting in this examples is that things become much more
grammatical
in regions on mutual gaze
and this means to kind of one interesting hypotheses that maybe disfluencies are not just
errors-in-production maybe some of this one is we have
actually fulfil the coordinative purpose the are used to regulate and coordinate and make sure
that either i'm able to attract your attention back if you'd has drifted away
or whenever i deliver what i want to deliver i really have your attention
and so
partly inspired by this work and partly inspired by behaviors in our systems
we did a bunch of work on coordinating speech and attention
so let me show the example in contrast to
what humans are able to do without thinking about it
here's our robot is not able to reason about where the person's attention is
as a bunch of speech recognition errors in this interaction as well but i like
it to pay more attention to basically how the robot is not able to take
into account where the participantsattention is as the interaction is happening she's just looking her
phone trying to get
the number for the meeting she's going to but the robot is ignoring all that
right
or
i think that again
metric in going right
during that would help or go back to like you want right
well o where
or maybe not so she's a she's you know she's just looking or phone trying
to find the and the robot keeps pushing this question or four where you going
where you going and so that's you know quite different from what people are doing
so inspired by goodwin's work we did some work on
basically coordinating speech with the
attention and the idea here was to have a model where one hand
we model the attentional demands
like where does the robot expect the attention persons to be
and on the other hand we model attentional supply where is the actual attention going
so attentional demands are defined of the phrase level so for every output that the
robot is producing got the phrase level
we have an expectation about where attention should be in most cases it probably should
be on the robot but it is not always the case twenty seventy five point
of what they're in say to get to thirty eight hundred i might expect that
your attention will go over there and actually fluoridation doesn't go over there may be
we have a problem
so we are specifying descent is are manually specified basically of just like a natural
language generation for every output we have one of these
expected attention targets and then on the other hand we make inferences about where is
your attention
and we do that based on machine learning models that use radio features us on
and so forth
whenever there's a difference between the two is there of just ballistic reproducing speech synthesis
we use this coordinative policy that basically interjects the same kinds of pauses and feels
like pauses in false starts and restarts
that humans do is basically create these disfluencies
to get to a point where attention is exactly where we expected to be an
only then we continue so instead of saying to get the thirty eight hundred we
might pose for awhile say excuse me be say the first two words to get
pause more and so on before we actually produce the utterance
so i'll in this is again than on the phrase by phrase
basis
here is again a demonstration video of
eric and i bad actors trying to kind of illustrate this behavior
yes or no i
for me excuse
for
where he is just you know it's fashion
so still bit clunky you know but you get the sense and the idea let
me show you a few interactions captured in the wild once we deployed this coordinative
mechanism
in here basically
the regions in block are the production that you know the robot normally produces the
synthesis these are phrase boundary delimiters
and the regions in the regions in orange are
these filled pauses interjections that are dynamically injected on the fly
based on where the user's attention is
rule
all you cough that the volume was kind of level
excuse me
really
you know you
you are right
here you direction
so that excuse we might be a bit aggressive you know there's a lot of
tuning once you once you put this in there you realise the next layer of
problems that you have been how synthesis is not quite conversational enough and you know
like than one sees of saying social forces so an excuse me and so on
and while these videos again my make it look like a wild like we can
go quite far again wanna leave you with the wrong impression of a lot of
work remains to be done
these things often failed or videos i shown easement things work
relatively well i would say
but this things often failed and i want to show you one interesting the example
of a failure
right
we would be
you give you
what would be included
well you will see later
so the signals to say whoops
so what actually happens here well what happens here is that
we are coordinating you know warping a lot of attention to coordinate our speech with
the participantsattention
but were completely ignoring what his upper body and torso is signalling so what happens
here is
the robot guess to this phrase where it says to get their walk to the
end of this hallway
at which point the person feels that maybe this is the end of the instructions
so they start turning both their face and their body to kind of indicate that
they might be leaving right
the robot sees their attention goal way and things well i'm gonna wait for their
attention to come back and the long pause that gets created for the reinforces the
person to believe that this is the end of the directions so i'm just going
given the robot had all these other things to say right
and so because the robot in this some sense ignores the signal from his upper
body that i'm and if the robot can take into account that signal we could
be a bit smarter and maybe not wait there maybe use a different mechanism to
get their attention back
or maybe just
blasts through that you don't always have to coordinate exactly that way it right and
so
i love this example because it really highlights any drives on this point and trying
to make i think that
dialogue is really highly coordinate in and highly multimodal dialogue between people in face-to-face settings
has these properties you know
we've talked about carnegie speech and gaze
and we seen in this example how not reasoning about body pose gets us into
trouble
as many other things going on we do head gestures like not then shakes and
all sorts of other head gestures and there's a myriad of hand gestures you know
from be metaphorically iconic the big gestures
facial expressions smiles frowns expressions of uncertainty
where we
put our bodies and how we move dynamically prosodic all contours all of these things
come into play and their highly coordinated frame-by-frame moment-by-moment in the coordination that happens is
not just across the channels
it's across people
and these channels and so i'd like us to think about dialogue in this view
more from a view of you know sequence of turns into of your of
multimodal incrementally co-produce process
and i think if we do that i think there's a lot of interesting opportunities
because of these enabling technologies that are coming up these days
so i've shown you a couple of problems in the space of turn taking an
engagement there's many more problems in every time we touch one of these we really
feel like we barely scratched the surface
take for instance engagement i talk for a bit about
how to forecast disengagement and maybe negotiate the disengagement process better but this many other
problems how do we build robust models for making inferences about those engagement variables like
states engagement actions and intentions
how do we or construct measures of engagement that are more continuous here all the
work we've done is on i'm engaged or i'm not engage well-known educational or tutoring
or other kinds of setting you wanna more continuous measure engagement
how do you reason about that
similarly many other problems in turn taking understanding how do we ground all these things
in the physical situation is interesting challenges with rapport with negotiation grounding well lots of
open space lots of interesting problem once you start thinking about how the physical world
a whole these channels interact with each other
like i said i said i think we have this interesting opportunities because
there has been a lot of progress in the visual and perception space
the tracking facial expression tracking smiles affix recognition is on that can
help us sort of in this direction
i think the other think that i really want to highlight bill be size the
current technological advances that i think is very important
is all these body of work that comes from connected feels like anthropology sociology
cycling sociolinguistics a conversational analysis context analysis on
there's a wide body of work basically
as soon as people got their hands on video tapes in the fifties and sixties
they started looking carefully at
human communicative behaviours
and all that work was done
based on you know small snippets or video and if you think about it today
we have millions of videos
an interesting a powerful data techniques so there's interesting questions about how do we bring
this work into the present the how do we leverage all the knowledge and the
theoretical models that have been built into the past
i've put here just some names there's many more
people that have done work in this space and i pick one title from each
of them in each of these guys
has full bodies of works i really recommend that
as a community we look back more on all this work that has that has
been done already in a human communication and try to understand how to leverage that
when we think of dialogue
so
with that i guess i have a ten minutes left i one a kind of
switch gears a bit and talk more about
challenges because you know
there's a lot of opportunity there's a lot of open field
but working in this space is not necessarily easy either and when i think of
challenges i think the
high level i think of three kind of categories there's obviously the research challenges that
we have like i wanna work on this problem and forecasting disengagement will help lysol
try there's obviously the research challenge
but i'm gonna leave those aside and gonna try to talk about to other kinds
of challenges one is data and experimentation challenges and we touch briefly on this in
the panel yesterday i think getting data for these kinds of systems is it's not
easy and stuff
if you look at a lot of our adjacent feels like machine translation in speech
recognition nlp and so on
a lot of progress has been accomplished by you know
challenges with datasets and clear evaluation metrics and so on
in dialogue this is not easy to do any is not easy to do because
dialogue is an interactive process you cannot easily studied on a fixed that dataset
because by the time you've
made an improvement or change something the whole thing behaves differently
and so that creates challenges generally for dialogue and even more so for multimodal a
dialogue in the multi model space right
then apart from the data charges there's also kind of experimentation challenges
we've done a lot of the work we've done in the while because i feel
like you see the real problems you see ecologically valid settings and you see what
really happens
some of these phenomena are actually even probably
challenging and hard to do in a controlled lab settings like study how engagement how
these break supplements on you can think of all sorts of things of confederates and
you can try to you know figure out controlled experiments but is not easy and
all the other hand experimenting in-the-wild is not easy either for many in reasons
one of the
other kinds of challenges in here are purely building up the system's right so
in our work over last ten years the way we've gotten our data is by
building systems and deploying them right
but building systems is hard in so in the last five minutes i wanna talk
a bit about actually engineering challenges because i think there just as important in that
they kind of create the damped nor on the research and they kind of stifle
things from moving faster forward building this kind of a multimodal systems is hard for
a number of reasons
first there's a problem integration they leverage many different kinds of technologies
that
are of different types operate on different time scales the sheer complexity and the number
of boxes you have to having one of these systems kind of makes the problem
challenge
but then there's other things where constructs that are pervasive in the systems like pine
space and uncertainty are nowhere in our
programming fabrics like
it's kind of the clear to me that time for instance is not a first-order
citizen in any programming language that i can think of so every time i wanna
do something that's over time or streaming or
i have to go implement might buffers and my streaming and my you know like
a kind of have to go from scratch and it's similar for space in uncertainty
but it is very important because
we want to create systems that are fluid
but the sensing thinking acting all of these things take time
being fast is not even enough often times you need to do fusion in the
systems and things the right but different latency so you need to coordinate basically so
you need to kind of deal whereabout time in a deeper sense down deep down
be well and the same things can be set i think in this systems about
the notions of space and notions of uncertainty
and finally the other thing that kind of puts of them are is the fact
that the development tools we have
are not here for this class of systems right so the development environments and debug
errors and all of this stuff is not
they were not developed with this kind of with this class of systems in mind
and if i think back of all the work we've done i don't know if
after time as maybe spend on building the tools to build a systems rather than
building the systems are doing the research right and so
basically driven by a lot of the lessons we've learned over the years
in the last three years three or four years at ms are we basically embarked
on this project and i wanted to spend the last couple of minutes telling you
about it because if there's any people in the room that are more interested in
joining the space this might be useful for them
we've worked on developing a open-source platform that
basically aims to simplify building the systems
the end goal being lower the barrier to entry in enabling more research into this
pay so it's a framework that three targeted researchers
it's open source and it's
supports the construction of this kind of a situated interactive system
we call it
platform for cd intelligence which kind of a mouthful solo abbreviate either side pronounced like
the greek letter sci
and i want to just give your whirlwind tour in two minutes just to kind
of give us sensible or what's available in there
the platform consists of three layers there's a runtime layer
a set of tools in a set of component components the runtime basically provides all
these infrastructure
for building systems that operate over streaming data are have latency constraints anytime you have
something interactive
it's latency constraint
so there's a certain model for parallel courtney computation that actually feels pretty natural you
just kind of connect components streams of data you know so it's the standard sort
of data flow model
but the streams a have a really interesting properties and i don't have time to
get here in
the full beetle and all the glory here
but i wanna kind of highlight some of the important aspect so for instance i'm
mentioned about time how time is to be first-order citizen well we bake that from
day one d below in the fabric all messages that are flowing to are timestamp
the origin when they're captured
and then as they flow to
through the pipeline
we have access not only to the
time the message was created by the component that created but also to that originating
time
so we know this message has a latency of four hundred and thirty milliseconds so
in the entire graph we can be latency or all points
which enables synchronization so we provide a whole time algebra and synchronization mechanisms when you
work was training data
that pairs these messages correctly and so on
so is basically all about enabling coordinated computation where time is really first-order citizen
the strings can be automatically persisted so there's a logging infrastructure
that is therefore free any data type of you know you can stream any of
your data types and we can automatically persist those and because we per system with
all this is so sure you timing information
we can enable a more interesting replace scenarios are i say well forget about these
sensors less played back from disk
and tune this component and i can play this back from disk exactly as it
happen in real time or i can speed it up or slowly down time is
entirely under our control because is baked deep down in the fabric
so these are some of the properties of the runtime there's a lot more
is basically a very lightweight very efficient kind of
system for constructing things that works with streaming data
at this level we don't care we don't know anything about speech or dialogue or
components
it's a gnostic to that you can use it for anything that operates was training
data and temporal constraints
the set of tools we built
basically are heavily centred on visualisation this is a snapshot from a
the visualisation tool we have on the right there someone's actually eating it and this
video sped up a bit but these are the streams that were persisted in application
these are just visualise there's for different kinds of streams that can get composer didn't
overlaid
so this is a visualiser for and in each stream this is a visualiser for
face detection results stream this is audio this is a voice activity detection that's a
speech recognition result is a visualiser for all three d conversational scene analysis and the
basic idea is that can composite overlaid is visualise there's
and then you can navigate over time left and right ensue mean and look at
particular moments this is very powerful in enabling especially when coupled with debugging
and word evolving this to visualize not just the data collected and running through the
systems
but also all
the architecture of the system itself and you know the view of the component graph
and also towards annotation for supporting data annotation
finally a the components layer we are hoping to create an ecosystem of components where
people can plug n play different kinds of components will bootstrapping this with things like
sensors imaging components vision audio speech output is are very relatively simple components that we
have in the initial echo system
but the idea is that
is meant to be an across system and people are meant to contribute into it
is an open source project there's already boise state casey kennington has its own repository
of sci components
and so people are starting to use this and the hope is that as more
people use it
if i can get you to have eighty percent of what you need off-the-shelf and
just focus on your research
that's the key idea
lasting else a is that something we haven't released yet but we are planning to
release in the next few months
is an array of components that we refer to as a situated interaction foundation it's
basically a set of components at that level that
plus a set of representations
that you want further abstract and accelerate the development of this physically situated interactive systems
basically what we are planning to construct is
the ability to instantiate the perception pipeline where you as a developer of the system
just only where you're sensors and what sensors you have
so in this instance there is you know there's a kinect sensor the big box
their represents my office and there's a kinect sensor sitting on top of the screen
and if you tell me i have three sensors i'm gonna use the data from
all the three sensors infuse evil gonna configure perception pipeline automatically from all the sensors
we have the right fusion
and provide the d n the
the kind of
analyses a deep scene analysis object that runs at frame rate at four thirty frames
per second i'm gonna tell you things like here's where the people are in the
scene and what their body pauses are here's where everyone's attention is
in this case there's an actual engagement happening between the two of us in an
agent that's on the screen
and stewart is you know directing the utterance the words
you know the agent and at some later point
we have peeled off we've gone more towards the back of the office towards the
whiteboard
and we're just talking to each other and so we're trying to provide all these
reach analysis all
the conversation in the conversation space including issues of engagement turn taking utterances sources targets
and all of that
from the available sensors and
if you give me more sensors
the idea is that you get the same object back
but at a higher fidelity because we have more sensors and we confuse data
this parts have not be really see other coming out probably in the next couple
of months
but our hope with the entire framework is basically to accelerate research in this space
to get people to be able to
build an experiment with these kinds of systems are having to spend two years to
construct all the infrastructure that's necessary
and so this brings me basically two
the end of my talk all conclude on this slide
try to adopt this view of dialogue in this is a talk and portrayed is
view of dialogue as a
multimodal incrementally corporate used process where part this one scene interaction really
do fine grained coordination across all these different modalities
i think there is
tremendous number of opportunities e here and i think it's up to us to basically
broaden the field in this direction because the
underlying technologies are coming and they are starting to get to the point where
the reliable enough to start to do interesting work and again there's this
big body of work in human communication dynamics that will we can leverage and that
we can draw upon
so i'll stop here thank you all for listening and all the questions
thanks very much and then
thank stan i was so great to see
all this work again and how
oppressive the research program over the number of years or to get at this point
i'm really looking forward to that
situated the interaction foundation
coming out
i've a question i guess related partly to that but
one of the problems with integration is not just taking a bunch of pieces and
putting them together but
the maintenance of that over time as you add new pieces so
in particular for this last thing
how much can you just adding a new component expect everything else to
work the way it did i just have some value added by getting new information
and how much do have to re engineer the whole architecture to make sure that
your not and doing things are getting and a problem thinking
you know in terms of engineering that the recent plane flight crashes seem to then
for this kind of thing where
different engineers design systems very well given a set of assumptions about what else would
be there
or not and then that changed under them and that's what seem because the point
right
i mean i completely agree i mean the ideal world is one where
you know everything works in you like your thing in and but in reality is
never that way right in e d is gonna be like different people with different
research agendas you know few things different the have different mental models are different
viewpoints from which they look at a problem and attacking
and i think that does create challenges that way i don't know holes all those
challenges
well all i can say that these were kind of aware of that and one
more constructing this work trying to
make us view commitments in some sense as possible to allow for the flexibility that's
needed for research
because i think there's actual value in all those different viewpoints and different architectures an
exploration
and so
yes i think what i can say that we are purposefully trying to
not make hard commitments to what the what is an utterance you know i don't
wanna tell you what an utterance as i wanna have you do have your opinion
of what an utterance is
but also might mean that again when you try to plug in your speech recognizer
in my system
the my needs to be some wrangling and so on or you know making these
components work together i don't know how we can solve this problem
i'm not a big believer in all will all come together with the big beautiful
standard that will agree to i don't think that i don't see that happening
we're just trying to design words
flexibility i would say
and
i think that are a wonderful talk and you're highlighting these things that you're right
this is not right for us to be able to that and address and we
should be working more about this work beyond the simple turn and
sorry i might be introducing something even more complex down the line and one about
user adaptation users are very good humans are very good it changing their behaviour based
on the system that in front of you know if its human of its that
you will call and there's a delay we'll or not the backchannel because it
screws up the conversation
and people can adapt to this forty dollars
and that might be confusing to our learning this will then allowed to be able
to the
two shows the affects that
to windsor good adapting to rather the most natural ones of you thought about how
to
to hear about not getting the human to adapt or to be able to control
how the human adapts to the particular system
and the policies that you're doing that are adaptation
no i think it's a very interesting question so i think
so there's a couple things here someone is i do not seen a lot of
the data that we will observe a large variability
between people's attitudes and what people so
both in the you know just the initial meant like you that they come towards
the system and the expectations they have and also you how they do or do
not adapt to whatever the system is doing
well i guess my view is one think i would say is i think more
of this system should be learning continuously because you are basically not continues that's with
the person on the other and in this adaptation you know and
doing things in big batches
is likely to create more friction than doing things that is continuous the adaptive so
i think that's an interesting their selecting a to solve a problem
i fuel
a lot of the work i and the when thinking of it is i want
to reduce this impotence mismatch interaction between where machines are where people are and i
think we still have a law to travel with the machines this way
people always come whatever the machines and mediate but i think i want that going
to be closer to where the human is and that would make things easier
so i think of all my
the work we've done in the way i see this kind of
i'm gonna try to reduce that impotence from the machine side as much as possible
but you're right people will it that sometimes with clever designs you can actually you
know create interesting experiences we leverage that adaptation when you know it's gonna happen
but i think in most cases i'm in favour of systems that just
incrementally adjust themselves to be able to be at the right spot "'cause" it continues
to shift
i don't know that really asses the questions were some sort surrounded
i time i'm rubber's from technological university double speaking maybe as one of the many
people here over the years of wasted two years of our lives building a dialogue
systems from the ground up or i think what you presents their at the end
is fantastic and but my question is a bit more specific
and in terms of the work you did on interjections being used and hesitations being
used to sort of keep the user's engagement
in the work in the wilds did you do any variation in terms of the
multimodal aspects of task in other words the avatar that's being used to gestures that
we're be used in fact whether or not using an avatar was a good idea
that's by fine grained question and then just a more general question is have you
looked at all but the issues
of engagement in terms of activity modeling because it's always struck me data big problem
in situated interaction
when you move away from the kiosk style the user is asking a question is
that users are engaged in activities and first to truly get the situated interaction working
we are we necessarily need to track the user what they're doing to be able
to make sensible contributions to the dialogue about just answer questions yep
so to the first part of the question the short answer is no what we
should have
like i think there's a there's
there's a rich set of and once is basically how you do hesitations and
interjections and all these policies and definitely in the nonverbal corresponding behaviors
would affect that
and we just seen the process of the prosodic o contours of a
so you know also was not such a good choice because so
as excitation sometimes
pricks people back the likes
so what
you know why wanna say but does are hard to synthesize it's on the display
the technology we have at the time
so i should say that yes definitely should consider those aspects
the second part of the of the question remind me was
so i think you're absolutely right a lot of the work i've shown in that
we've done actually in the last you know ten years there has been
well focused on interaction one communicate where when any interaction and communication like there's some
communication happens between the human and the person
but that's the whole task is this conversation that we're having
where actually just now starting to do more work with systems that where the human
is involved in an actual task does not just the communicative task
and we're trying to see how the machine can play a supporting role in that
and i think you're absolutely right like that kind of brings up the next interesting
level of how we really get collaboration going rather than just this kind of back
and forth of i can ask or answer question and so on i think that's
a very interesting space and we're just starting to play in that space
thank you very much for two where interesting a request i think is great but
this is
going out in the wild approach i was just wondering have you
i still assume that microsoft research office is
a certain type of people who are in there
so it's not completely out in them are not so it sort of a question
of a have you considered the sort of other i mean i guess children on
the other types of user groups that
or other types of problems that you might have in is more sort of accepting
or something no we have so the short answer is again all we have on
but i completely agree like the population we have these just the very narrow very
specific one
it's interesting to me
how much variability i see even in that narrow cross section which makes me wonder
like you know and units interesting there's and there's a lot of variability even in
that narrow population
but you're absolutely right like it's not
truly in-the-wild is not a to public space like
and so you be very interesting to go there and see what kind of "'cause"
yes populations are different than
we haven't done much outside this
okay let's think then again for a really is done