one of communication which is just as important as much
namely nonverbal communication
and in my i will discuss
a how to enrich a
the precise and useful function of computers with the human stability
to show i mean of the message nonverbal behaviors
also here in the collaboration between the woman and a robot to see they are
not just collaborating this even the kind of close effective upon between down
unease is actually of the for both of my of the research
so that the problem i will be a structure is always select will first talk
about
at the recognition of social issues in human robot interaction but of course the technology
is also useful for any kind
of a solution to see there are also source signals in human
interaction
or in a man virtual agent interaction
then average hold out that the generation of soldiers you in human robot interaction
of course dropped what should not be just able to interpret the human signal
it should also be able to respond with appropriately
the next topic will be a dialogue management
in a still virtual human robot in that should be able to talk about what
the talent hiking
also a pilot a solution is a mutual gaze back channels
and to handle all these challenges we need of course a lot of data
and so the last a part of my whole will be on how gradient learning
for focus
and at least is
which will ease at fort of human by using
scenarios to wise or
so let's start with the recognition of a social use and human robot interaction
so what kind of i c nodes are interested in
basically in speech and facial expressions guys holster gestures body movements
and approximate
about not only the in are not only interested in the solution can use of
an individual person
but also in interaction patterns such as synchrony on maybe we interpersonal added you for
example with the don't mean and a person
all agent in interaction
and i was also engagement
so how engaged are the participants in than in a church
so if you look at the literature the most attention has high to facial features
so i don't want to go in detail here just mentioned
and spatial i should according to the system which is used applied of to recognize
but also channel eight facial expressions
and the basic idea is to define such units
i to characterize sosa emotional expressions
others such as a hundred raised out of which is usually an indicator of the
happiness
also a lot of what has been spent on
cool emotion recognition
you're just for inspiration i show you how to signal of the same utterance baseball
in different emotions
you can see here the pitch point where it is quite a different
depending on the emotion expressed
and there has been some effort to find a wouldn't predictors of for vocal
a motion it i would like to mention geneva minimalistic a set of features
which was recently introduced and which actually titanium that's why would waste is also if
you compare the two feature set
consisting of a semblance analysis of features so if you like some will try to
get of speech is a binary or deep neural network approaches us so it would
be
it put it here to compare arguably side it's a with the police is obtained
and it by the chili that minimalistic a feature set
so if you look at the literature you might get the impression okay you get
very high recognition why the four emotions
it even a little bit a scary a few wiped it to model and a
test run in real words and mapping of the find out okay
we started as sometimes even comes
close up to four
randomized the
we sites
so why is that so actually a previous research has focused on the analysis of
equipments the basic emotions the motions
that are quiet or extreme prototypical
emoticons such as happiness that knows this task anger
i emotional responses of what what's can usually not be mapped to a men's basic
a motion so we see here for example use that's and because of the point
i know
that a post edit any woman and to create web why the happy in the
interaction with the robot but it's not clearly
with
a couple of years ago
a colleague of mine and one about clean a heated that we are interested in
this study
so actually they invest to investigate the motion recognition rate for acted emotions
for read a motion and motions type in the with that of course the sound
natural and it actually cost was just to distinguish between what they the motion no
unknown motion so not the very difficult task
and what i don't a motion is
they got one hundred percent so help
for an emotion
is a little bit more natural than acted emotions they got eight percent
which is okay but not really exciting because you know chances fifty percent if we
just need to recognise what distinguish between
mutual the motion and
abortions
and finally for obvious that of course scenario they just got a seventy percent
so obviously systems developed under laboratory conditions of how perform poorly unless ordered
a scenarios
and the challenge is actually adaptive real time applications
so usually if you look at the clutch if you look at speakers people obtain
you will find out that most studies a offline start this so they take the
call was
and the calls
is usually a pair
and for example expressions that cannot be locally and in that one was the city
emotional states
a simple thing that our
and the also a
yup and of course the start from the assumption that whole process is a segment
that in some way
but the in so we don't life we also have a one handed noise the
on the other data
so we might as seen you information
and also our pictures can only rely on previously seen data so we cannot
look into the future
but of course that the system has to at least one
in a real time
so the question is what can you about what they're
and one other thing though we might consider would be or
the context
so if you know at the picture why matching in which emotional state
a couple s
so we have any idea of pos just people who don't know
to compute a context
any ideas what emotional state
to go would be
your quite what so usually actually in other people's say okay anyway distress
its candidate
i you are actually very good of size three that
because it's a actually a jealousy
i do actually the first cousins who actually of how it immediately a correct motion
i nevertheless i don't say a four system
and even able to a type of the facial action you want it in a
perfect manner would have problems to find how without knowing the context
that the least actually other channels
so there are some
recent research has been done actually to consider of the context and we science electro
some improvement
so a couple of years ago we investigate the agenda specific motion in the motion
like recognition
and so we were able to improve the recognition rates by training gender-specific a model
and that's an approach was a done by christina format so actually she can see
that the success and failure you don't
it would during an application for example it student is heading a little time
and that's to smiling a then interacting with the way application
so probably the student is not a really happy to might be a to student
does not try to system
serious
and i even though this approach is quite used for quite reasonable it has not
be in a pick up so much
so
we arg one see that you got the dialogue behind me out of the virtual
agent in the job and do you training scenario
so for example when a job interview a task difficult questions about a the weaknesses
of the candy that
then it is also i had to something the pilot a
a likely the motion that state
and they are some of the time
the to align actually a temp where context using bidirectional long short-term and you were
networks
so the context
a might be a good option to oakland see that
and not a maybe obvious thing to one see that use a multi modality here
you can see you know what has bought cell where no it's just one
and it to one of four two with a look at nearly to say so
actually or it's
for me it's not possible to recognize any difference in the face
but if you look at the bottom
you'll get a match the other pictures so on the right
this moment is obviously why that's right i guess correct way no but not you
very happy about l
but nonetheless at least two from
home or a demonstrated by a the face
so
multimodal fusion the data how
that is an interesting to start by a team at all and a whole rate
on my remote affect the detection
and a study us to investigate that many studies that have outperformed the possibly with
them at a study
and radio show what
that improvement how correlates with the naturalness of the calls which is actually that you
so as a step four of them
acted emotions
you get quite high recognition rate and if you use multiple modalities
so you can even get improvement of more than ten percent
but for to difficult task namely spontaneous emotions
the improvement left and i was then which is really bad you because the
should we a hundred
the user to additional devices just get less than five percent recognition rates
and this assumption actually is that in the natural interaction
a sheep are actually a of shall a motion in a you once is a
menace or may not show a motion
so more channels are the same express if a
manner
and first investigate a tractable
assumption of we all looked at the call so we have we had a corpus
i would hate affect just by the video and then just find audio
and then you don't with that note i should mismatch on or
and then we don't at the recognition rate and actually or when the annotations a
mismatch
and so the robot a match the low well
like recognition weights
so it will show you another example look at a woman the here
so we have let's look at the second rate and here the woman shows a
neutral face
and the voice is happy
and a little bit late error rates the other way round one is of the
face looks at it but it's new well
and i was sort of question is a watch a whole fusion approach to in
such a situation
and a yellow i sketch a potential solution
so we my a so you show actually modality a specific recognizer might decide when
to quantum leap you would
and then interpolated
and the y-axis interpolation or we get a better recognition besides
so if you look at the literature so most of fusion approaches actually used in
one this fusion approaches
and synchronous fusion approaches are carried wise a it could situation of multiple modalities
within the same time frame so for example people at a complete seven and eight
just analyze the face
and avoid over complete
sentence
i think owners fusion
approaches
actually a
they a color rate and a modality is not bad all at different times
so they do not assume that for example audio and video
a expressed
at the same time
and therefore they are able to track channel to a simple nature of cops the
other modalities so it's very important if you use the fusion approach and like to
use of approach that is thus able
two point see that and what a dependency is
and it depends what if we wish of modalities but also
the interdependence between modalities
and that is only possible
if you go for frame-wise recognition approach
so we don't this approach either but a first year
so we adopt at an event bayes fusion approach where we once you to events
as an additional
layout of at stretch between or sink nodes
and higher-level emotional states
even though the such as are allowed to have no
or similar kinds of the social few
and a in this way we were able to try to work how the temporal
relationships between channel
and learn when to provide information
and also in case of some data on be seeing
another approach is still a delivers a reasonable recognition besides
so let's have a look at an exam well it's a simplified example it's over
here we have audio and we have a facial expressions
and the fusion approach my comma
ways
so what degree of whether it's
and now let's assume for some reason the audio is no longer available
and why interpolation
we still a get a wide reasonable
with is
so we compare
and number of those seen owners fusion approaches i think there is a fusion approaches
and he went written of fusion
and so for example of forty a synchronous fusion approaches so we call
consider for example you wouldn't networks we also once it's not understand to
take into account the temporal history of signals
and also a bidirectional long on a short time you will networks
a to be able to look in the future
and to learn to tamper what history and what you can see here which is
quite a whitening
that or i think colin is a fusion the
approaches actually up outperform a that are
then one is a fusion approaches
so a message i call it is if you fuses modalities
usually do for approach that its first a able to point see that
the can we wish of modality is
but also in the dependency between modalities
actually i mean actually i am i
i don't i right away
like a rational
and actually two
a postech development of
social see their processing approaches for on-line recognition task
we developed a framework which is called justice i for social signal
in the quantization
and this framework a synchronized with the modalities and it supports equal clear
machine learning i nine words or offering a various kinds of machine learning
approaches
and
we are able to actually or
you with the natural at all modalities and sentences and whenever stands and uses and
it becomes available
my people write read will for it
so we consider a motion capturing as you are the ones you doing of various
kinds of
i try to a stationary i like a smoothed by
i traded
and
also a text is
so basically all kinds of
sensors that our company
but way level
so this was the top one or
emotion recognition now i would like to come up to the as a side namely
to the generation of those used by the robot
it's nice that it is not sufficient to recognize the motion
you also need to respond appropriately approaches a list apart appropriate responses
and
i guess it's a clear so why would nonverbal human signals a where we all
and update not only express emotions but also edit you would
intention
also called only high interpersonal relations with the plate sample
you are interested in talking to have a
or not
and nonverbal the three minutes kind of course also be you with
other to understand be worth messages
and in general will make the communication
more natural implausible
so we see that there are a couple of years ago a with and how
well what
of course the not what a leader is not how well
and expressive case fetters so we have to look for after options and so we
looked for action
a tendency is
which are related to motion selection and this is actually want to show before you
start at so it's very common in
in sports
so you have proposed chat bots a person
and to sports is not yet it but it's quite clear what is coming next
and so we among a cisco we simulated actually tendencies such as approach
panic attack and submission
and it turned out that people were able to
wait and is a ds the action can see is
later we actually
got a robot from hand mobile kind
and here we actually try to simulate of facial
expressions
and you well kind of image that is all three start from the facial action
coding system i mentioned
well
and a actually identify forty actually you would minutes of forty human of high
for the question or can we simulate a report the action units
and the for the robot
so we write about the and a this the simulation of just seven hatch you
wouldn't
and these robot has a syntactic a skin and on the skin your house on
modal is and the motors can move a and a beep or form eight
a to form a the skin
do we not only a little a two hour
simulate the seven action units and at a question is whether this is enough and
i show you show a video
so it pretty what is in german with english as a high that's a lot
is introduced focus about non-verbal signals it does not necessarily that you want to understand
what is you start
you can just a discussion of actually what the machine about information the machine have
to be close she did not consider at stage the semantics of or utterances
to about position is equal
it can you see also would not test so it is equal to one can
once will be given by its because i
i
yes i understand what is not one but also talk about
but also useful what it is not quite often what it
i don't think it's one is that all data that are not handled by a
weighted sum of all
is that it is not able to account for instance the hopefully it does come
zero point all possible
a problem with this is no i o
in to compute so that you mentioned you can
one for training
okay just to show you that really does not can see that the semantics another
example
that's my
schuller
about done
are you
the system can do not work about online to one hundred fifty yet but not
really constant talk detector e
so just to show that you can't
i have a conversation with emotional features are that's of course not over
and a few well maybe we
the of course a use a different from and to see so maybe we
my a held at a it's not a over
so what is the embassy still embassy it is an emotional response and its stance
from the comprehension of emotional state of and also
pairs
and a so that the emotional state of the other person
might be similar to your own emotions at but that's not have to be design
a motion
and embassy like what is either deeper such a of emotional state of an a
set of parents and facilities is what we can more of a signal processing technology
and it is also like well i guess so we don't think about the situation
of the also use somehow
need to know
and of what at the outset person is feeling and why not start to oppose
that it
and also you are required to decide the how to respond to the ad suppose
a motion
so for example in the tutoring system
if
the student is in the very emotional state and depressed
in a high it could be a disaster if the virtual agent would actually minimal
a emotional state
of the student because it might make a student
moura
depressed
so
it is actually a week what is a tree or
this is a potential and want to not to show
and we can realize kind of have say listen now
so where we can see a motion we try to understand a emotional state
and understanding and the motion state of the knots that appears in
we could choose an internal reaction and that the question is should be external is
a reaction and of what are two ways that i virtual you'll another
examples was actually and how much will be
simulated and appraisal a model
a lot of the dialog alive will show you is actually is that of course
so first and of what we do in this kind of a tie and all
so we be able and motions
a lot so
we also a common to on the user's the emotions so the story will be
a pilot a forgotten
four point of medication
and
function and to see it is so we had to robert shows console a power
of a button medication to increase awareness but it is doing it in a supplement
no
actually not what we are still at
to a much
and no overt so dropped what will show the some intention as
while
the palm down to the user
so i will apply deal with
but the video and what is actually a kind of amazing
this is that it is disappointing fine edge while it is all
here
okay
i
a
okay and a actually a to develop a better understanding of four emotions of users
we are currently investigating how to combine the social signal processing of with affective as
you rate of mind and cases actually what operation where is that happily an apart
from the if i
in a support
so partly other developed a model of the whole and i don't know
actually to simulate emotional behaviors
and the basic idea is actually
what
have some and motion of stimulation and then change a ways of what do you
recognise in terms of sources used
actually matches and how well
a simulation
and the even type just a little bit of errors are
we do not just once you to how a list one was so that
and emotional state
we also points you know how people
actually show like to like they'll motions to show you an example
so let's see that
shape so if you are not regulated well you want a motion is either so
the person who
just flash they had a dollar
and that this is the typical
emotional expression
we would expect
and a people usually awake you like a motion is actually i like to better
whole always the emotional state
and or shy of the at different weights to like motions
so avoided is one reaction but you put it text yourself so we have for
example you say okay and i four and a but also at a gas a
person
and
what you panacea actually other that we have a quite a different is no actually
you know people might show depending on the way they regulate their motion and if
you use a typical the machine learning approach actually
to analyze distortion no
you would never know i'm be able to find one motions
because don't know
how do people go back to rely on the emotional state so here is and
have a price we have to discussion already yesterday
maybe you can us
machine learning approaches as like boxers recognise certain signals
a fine tuning as some understanding actually
a map
to see that want to emotional states
and it's even more important
if the system has to respond what emotional state so matching a
you talked to somebody on the on the guys not really understanding what's your problem
you
and i just at behaving like what we can you like well and
a responding in a schematic a manner we were able shall
and behaviour
so it would like at the end of are also called me but all what
is the weighted dialogue between a
humans and or
robots
and only actually a client by dpi apply a job which can decide no
on engagement and human robot in the action
we looked at so
signs of engagement in human robot a dialogue act of the amount of mutual gaze
below a direct gaze turn taking
and i just show you example the here it's a path of gain between a
robot and you can result
and to use that is where we hyped weight loss there's a so that the
robot notes when it was is a loopy
and in this specific as scenario
all you know simulated directed gaze which is that kind of
functional same
so
the robot is able to detect which all check
the use that is a focusing on and this makes the interaction more efficient because
there is no longer forced to describe
o j lo detector i also implemented a hallway a scenario is or should gaze
for distortion case actually voice
do not have we deal function
so i'd the dialogue was completely understandable without distortion we just wanted to know
that's my to any difference
so it just a very quickly
we have a direct that a gaze assorted one who is the following two options
and pointing the object or just looking at the object
and for mutual gaze of both in that interval establish eye gaze
the next thing what we realise was case is a disambiguation
and a case applies disambiguation is interesting in so yes other people
a few option which was then look away again
so we need a different disambiguation approach
that for example powerpointy then for example for pointing gestures when two point usually just
point one and that's it you know what into the one time
and so case is
then we
different
and we also a real is
so that some typical gaze behaviour is that you in a turn taking
so speakers a new way usually from the addressee to indicate
that they are for it to process of thinking about what to say next
and also to show that they don't one and it should be a drop that
and are typically at the end of an utterance the speakers
low would you have a person
because they want to know how we are suppose
what the as opposed
thinking about what has been set
so basically
we realize a shared folder of what follows the user's hand movements and drop what
follows to users he's
we will i social around eight
so here to what i see and recognise this mutual gaze
and finally to an eye dropper to make a nice is going to use that
you tell
and that will show you
we deal
so i decided to leave at the top and because i realise the top is
much better roundy then the problem i did it is one of the
how do okay
the red wine it's of course ambiguous nothing more i k
which man
e
thus
again you know that
and the we did an evaluation well this where
and what we found was that actually of the object wanting was more effective than
distortion grounding
so the people were there are able to interact more efficiently with object a groundings
of the dialogs were much shorter
and the word lattice misconceptions
and it's not distortion rounding error you not a improve the perception
of the interaction
which is of course appear because we spend quite some time one mutual gaze
i one assumption is that people wear out that once waiting on the task instead
of the social interaction with the robot
and we might investigate if you have a more sources ask for example looking at
family for both
and the distortion gaze a might become more important
and its assumption is a which we do not yet a try
that some people are focusing more on the task in some without focusing more on
the social interactions you can be classified like these
and a specific people
might appreciate the social gaze a more
the analysis
so have finally i would like a to come to reason a development is so
we started one
interactions in or dialogue
and data from both sides of always
do you make an interactive machine but also to machines in route a robot
come do we she can interact
the human
so the o project which was already mentioned yesterday
we have collected a corpus of which people a dialogue between
you minutes
and the dialogue has the in i'm not trying to label
and we actually or integrated active learning and hope wait a litany
in the annotation work so basically i think it is that the system actually
this is which samples of the show you label
pick the right relatedness and it also this is which sound shall be no actually
a from like that at all
and so one of which is forced to select examples
for which a did not she and actually
tie a low confidence
and always that approach so we've well at the o to o
make up the annotation process
significantly more efficient
and of these basically integration of the no one system is as i a system
which i mentioned earlier
and for the interactions that it is actually that you do an additive high main
which is the essence of interruptions
from
called as a between a human
it down
so i to come to one compare emotion
i think that a human robot in that capture cannot come we can treat here
until a
the problem of
appropriate social interaction between robots and human
for it
in particular
if a what is employed in
the people it's how you
and of what we need of course is a fully integrated into consisting of perception
reasoning
learning and responding
and a particular it is at the moment is a big gap between the perceptual
and the reason nine so the reasoning is
kind of the net like that
at the moment in favour of a black box the
approaches
which is useful for
actually attended i o
so we should use as such as laughter
but after that so we need to reason about what
actually distortion signal a marine
and of course i know my disciplinary expertise is a
necessary in order to emulate aspects of social intelligence that's why
we call up with a lot we so
psychologist
and so we might a lot of software publicly or a way that well in
particular its as i system distortion no
interpretation and there's no way as its i leave work on the nist you make
a small
the
install we entirely and finite state automaton
and of which the of was actually at is to various virtual agents but also
to all kinds of
robots
and of these is actually and
problem thinking when
is so actually that's a good a point
because
you to do it making dropped what is of a with of point able to
recognize o where is looking
at a much higher level of accuracy at any human would be
and some people because they are just used explicitly also pointed and of course if
you and not change its flexible kind of a reference i don't act
in that particular we deal discourse features are just stuff
for the illustration
these boards and as a model somebody would you wanna pollard benefits
of a it quickly a but also had this kind of behavior or we just
got here we have the people off policy with a non contact with a
of what some people show some people use pointing some people do not use pointing
up by a nevertheless it will always a good usually do not point and not
low so i wake up with
had a information
and because it meant to have a study are usually people believe you want has
now and so they are really concentrating on this task
and so that's probably why
okay not
at
appreciate so much at a social okay so it is not bestow and the people
actually makes no solution i would want to turn taking opening is realized release the
turn taking a dialog was more efficient because it was clear out
open dropped what the was expecting a user type that on a in terms of
subjective evaluation considers did not to do the what was a behaviour or natural or
a source what if
men and i case it's really a task based
scenario
i it's not have time to show live video humans collaborating on data on to
say a
and we have
some examples of human interaction is left not sure that the human robot interaction and
syntactic in cases we had to human knowledge at that very well okay fact that
we and various taking not
for statistics was very close to take not they have to look at each other
for data on the table
and this was followed by a wide interest
s two because actually correctly skewed documents we
acquired they do so you have not one but what which looks like it would
for this got what look like points and so intuitively the people of course top
down in a very well may be justified condition in a more expressive no according
to which i
s two s p o
more clearly
and it was also used for people to a related to drop what so we
brought one what to and
it home and fist people well
valley points a and a set you know why not fall at home we would
just want to be a tweet that by i will
and then is that okay as long as to what just calls it's okay it
cannot close
and
and india and dropped what performance this is actually a to send out of are
realised exactly how they had something called out and
actually taken to do not what you like to have real data that you like
a example somebody to take et al
and it is sometimes they were also
a p a surprise was one ladies she was
the one hundred years the what affords you was really clear
we call it can still and she's at
it's just plastic i have a high round dropped were extracted by a strange but
you're right with use
i don't i brought lots of people find it easier or want to talk and
what expressed
details on thank you press
i
it's probably and that's that in
because for example in
since holders gain
people actually intentionally shall was quite sure a particular emotional state whereas when regulate motion
usually do not really
think about it
and that there's a
that some quite some properties pulses just a few hundred and so that the general
expression years with i just can't seem high location just looking at least is used
machine learning always have a kind of evaluation able
to recognize
emotional and the state of what has actually you
and what situation
i believe that of the phase is quite important
so i was in the presentation by a company that was really proud of their
robot and did not have facial expressions it is not have just thinking
and somebody in the audience that i don't understand the point is just a loudspeaker
and what is the point so i think the
the party as i want a back to face as important as well and what
the
now we have washed up to an issue
okay we have this property of before and that was possible with the case apart
head pose actually