one of communication which is just as important as much

namely nonverbal communication

and in my i will discuss

a how to enrich a

the precise and useful function of computers with the human stability

to show i mean of the message nonverbal behaviors

also here in the collaboration between the woman and a robot to see they are

not just collaborating this even the kind of close effective upon between down

unease is actually of the for both of my of the research

so that the problem i will be a structure is always select will first talk

about

at the recognition of social issues in human robot interaction but of course the technology

is also useful for any kind

of a solution to see there are also source signals in human

interaction

or in a man virtual agent interaction

then average hold out that the generation of soldiers you in human robot interaction

of course dropped what should not be just able to interpret the human signal

it should also be able to respond with appropriately

the next topic will be a dialogue management

in a still virtual human robot in that should be able to talk about what

the talent hiking

also a pilot a solution is a mutual gaze back channels

and to handle all these challenges we need of course a lot of data

and so the last a part of my whole will be on how gradient learning

for focus

and at least is

which will ease at fort of human by using

scenarios to wise or

so let's start with the recognition of a social use and human robot interaction

so what kind of i c nodes are interested in

basically in speech and facial expressions guys holster gestures body movements

and approximate

about not only the in are not only interested in the solution can use of

an individual person

but also in interaction patterns such as synchrony on maybe we interpersonal added you for

example with the don't mean and a person

all agent in interaction

and i was also engagement

so how engaged are the participants in than in a church

so if you look at the literature the most attention has high to facial features

so i don't want to go in detail here just mentioned

and spatial i should according to the system which is used applied of to recognize

but also channel eight facial expressions

and the basic idea is to define such units

i to characterize sosa emotional expressions

others such as a hundred raised out of which is usually an indicator of the

happiness

also a lot of what has been spent on

cool emotion recognition

you're just for inspiration i show you how to signal of the same utterance baseball

in different emotions

you can see here the pitch point where it is quite a different

depending on the emotion expressed

and there has been some effort to find a wouldn't predictors of for vocal

a motion it i would like to mention geneva minimalistic a set of features

which was recently introduced and which actually titanium that's why would waste is also if

you compare the two feature set

consisting of a semblance analysis of features so if you like some will try to

get of speech is a binary or deep neural network approaches us so it would

be

it put it here to compare arguably side it's a with the police is obtained

and it by the chili that minimalistic a feature set

so if you look at the literature you might get the impression okay you get

very high recognition why the four emotions

it even a little bit a scary a few wiped it to model and a

test run in real words and mapping of the find out okay

we started as sometimes even comes

close up to four

randomized the

we sites

so why is that so actually a previous research has focused on the analysis of

equipments the basic emotions the motions

that are quiet or extreme prototypical

emoticons such as happiness that knows this task anger

i emotional responses of what what's can usually not be mapped to a men's basic

a motion so we see here for example use that's and because of the point

i know

that a post edit any woman and to create web why the happy in the

interaction with the robot but it's not clearly

with

a couple of years ago

a colleague of mine and one about clean a heated that we are interested in

this study

so actually they invest to investigate the motion recognition rate for acted emotions

for read a motion and motions type in the with that of course the sound

natural and it actually cost was just to distinguish between what they the motion no

unknown motion so not the very difficult task

and what i don't a motion is

they got one hundred percent so help

for an emotion

is a little bit more natural than acted emotions they got eight percent

which is okay but not really exciting because you know chances fifty percent if we

just need to recognise what distinguish between

mutual the motion and

abortions

and finally for obvious that of course scenario they just got a seventy percent

so obviously systems developed under laboratory conditions of how perform poorly unless ordered

a scenarios

and the challenge is actually adaptive real time applications

so usually if you look at the clutch if you look at speakers people obtain

you will find out that most studies a offline start this so they take the

call was

and the calls

is usually a pair

and for example expressions that cannot be locally and in that one was the city

emotional states

a simple thing that our

and the also a

yup and of course the start from the assumption that whole process is a segment

that in some way

but the in so we don't life we also have a one handed noise the

on the other data

so we might as seen you information

and also our pictures can only rely on previously seen data so we cannot

look into the future

but of course that the system has to at least one

in a real time

so the question is what can you about what they're

and one other thing though we might consider would be or

the context

so if you know at the picture why matching in which emotional state

a couple s

so we have any idea of pos just people who don't know

to compute a context

any ideas what emotional state

to go would be

your quite what so usually actually in other people's say okay anyway distress

its candidate

i you are actually very good of size three that

because it's a actually a jealousy

i do actually the first cousins who actually of how it immediately a correct motion

i nevertheless i don't say a four system

and even able to a type of the facial action you want it in a

perfect manner would have problems to find how without knowing the context

that the least actually other channels

so there are some

recent research has been done actually to consider of the context and we science electro

some improvement

so a couple of years ago we investigate the agenda specific motion in the motion

like recognition

and so we were able to improve the recognition rates by training gender-specific a model

and that's an approach was a done by christina format so actually she can see

that the success and failure you don't

it would during an application for example it student is heading a little time

and that's to smiling a then interacting with the way application

so probably the student is not a really happy to might be a to student

does not try to system

serious

and i even though this approach is quite used for quite reasonable it has not

be in a pick up so much

so

we arg one see that you got the dialogue behind me out of the virtual

agent in the job and do you training scenario

so for example when a job interview a task difficult questions about a the weaknesses

of the candy that

then it is also i had to something the pilot a

a likely the motion that state

and they are some of the time

the to align actually a temp where context using bidirectional long short-term and you were

networks

so the context

a might be a good option to oakland see that

and not a maybe obvious thing to one see that use a multi modality here

you can see you know what has bought cell where no it's just one

and it to one of four two with a look at nearly to say so

actually or it's

for me it's not possible to recognize any difference in the face

but if you look at the bottom

you'll get a match the other pictures so on the right

this moment is obviously why that's right i guess correct way no but not you

very happy about l

but nonetheless at least two from

home or a demonstrated by a the face

so

multimodal fusion the data how

that is an interesting to start by a team at all and a whole rate

on my remote affect the detection

and a study us to investigate that many studies that have outperformed the possibly with

them at a study

and radio show what

that improvement how correlates with the naturalness of the calls which is actually that you

so as a step four of them

acted emotions

you get quite high recognition rate and if you use multiple modalities

so you can even get improvement of more than ten percent

but for to difficult task namely spontaneous emotions

the improvement left and i was then which is really bad you because the

should we a hundred

the user to additional devices just get less than five percent recognition rates

and this assumption actually is that in the natural interaction

a sheep are actually a of shall a motion in a you once is a

menace or may not show a motion

so more channels are the same express if a

manner

and first investigate a tractable

assumption of we all looked at the call so we have we had a corpus

i would hate affect just by the video and then just find audio

and then you don't with that note i should mismatch on or

and then we don't at the recognition rate and actually or when the annotations a

mismatch

and so the robot a match the low well

like recognition weights

so it will show you another example look at a woman the here

so we have let's look at the second rate and here the woman shows a

neutral face

and the voice is happy

and a little bit late error rates the other way round one is of the

face looks at it but it's new well

and i was sort of question is a watch a whole fusion approach to in

such a situation

and a yellow i sketch a potential solution

so we my a so you show actually modality a specific recognizer might decide when

to quantum leap you would

and then interpolated

and the y-axis interpolation or we get a better recognition besides

so if you look at the literature so most of fusion approaches actually used in

one this fusion approaches

and synchronous fusion approaches are carried wise a it could situation of multiple modalities

within the same time frame so for example people at a complete seven and eight

just analyze the face

and avoid over complete

sentence

i think owners fusion

approaches

actually a

they a color rate and a modality is not bad all at different times

so they do not assume that for example audio and video

a expressed

at the same time

and therefore they are able to track channel to a simple nature of cops the

other modalities so it's very important if you use the fusion approach and like to

use of approach that is thus able

two point see that and what a dependency is

and it depends what if we wish of modalities but also

the interdependence between modalities

and that is only possible

if you go for frame-wise recognition approach

so we don't this approach either but a first year

so we adopt at an event bayes fusion approach where we once you to events

as an additional

layout of at stretch between or sink nodes

and higher-level emotional states

even though the such as are allowed to have no

or similar kinds of the social few

and a in this way we were able to try to work how the temporal

relationships between channel

and learn when to provide information

and also in case of some data on be seeing

another approach is still a delivers a reasonable recognition besides

so let's have a look at an exam well it's a simplified example it's over

here we have audio and we have a facial expressions

and the fusion approach my comma

ways

so what degree of whether it's

and now let's assume for some reason the audio is no longer available

and why interpolation

we still a get a wide reasonable

with is

so we compare

and number of those seen owners fusion approaches i think there is a fusion approaches

and he went written of fusion

and so for example of forty a synchronous fusion approaches so we call

consider for example you wouldn't networks we also once it's not understand to

take into account the temporal history of signals

and also a bidirectional long on a short time you will networks

a to be able to look in the future

and to learn to tamper what history and what you can see here which is

quite a whitening

that or i think colin is a fusion the

approaches actually up outperform a that are

then one is a fusion approaches

so a message i call it is if you fuses modalities

usually do for approach that its first a able to point see that

the can we wish of modality is

but also in the dependency between modalities

actually i mean actually i am i

i don't i right away

like a rational

and actually two

a postech development of

social see their processing approaches for on-line recognition task

we developed a framework which is called justice i for social signal

in the quantization

and this framework a synchronized with the modalities and it supports equal clear

machine learning i nine words or offering a various kinds of machine learning

approaches

and

we are able to actually or

you with the natural at all modalities and sentences and whenever stands and uses and

it becomes available

my people write read will for it

so we consider a motion capturing as you are the ones you doing of various

kinds of

i try to a stationary i like a smoothed by

i traded

and

also a text is

so basically all kinds of

sensors that our company

but way level

so this was the top one or

emotion recognition now i would like to come up to the as a side namely

to the generation of those used by the robot

it's nice that it is not sufficient to recognize the motion

you also need to respond appropriately approaches a list apart appropriate responses

and

i guess it's a clear so why would nonverbal human signals a where we all

and update not only express emotions but also edit you would

intention

also called only high interpersonal relations with the plate sample

you are interested in talking to have a

or not

and nonverbal the three minutes kind of course also be you with

other to understand be worth messages

and in general will make the communication

more natural implausible

so we see that there are a couple of years ago a with and how

well what

of course the not what a leader is not how well

and expressive case fetters so we have to look for after options and so we

looked for action

a tendency is

which are related to motion selection and this is actually want to show before you

start at so it's very common in

in sports

so you have proposed chat bots a person

and to sports is not yet it but it's quite clear what is coming next

and so we among a cisco we simulated actually tendencies such as approach

panic attack and submission

and it turned out that people were able to

wait and is a ds the action can see is

later we actually

got a robot from hand mobile kind

and here we actually try to simulate of facial

expressions

and you well kind of image that is all three start from the facial action

coding system i mentioned

well

and a actually identify forty actually you would minutes of forty human of high

for the question or can we simulate a report the action units

and the for the robot

so we write about the and a this the simulation of just seven hatch you

wouldn't

and these robot has a syntactic a skin and on the skin your house on

modal is and the motors can move a and a beep or form eight

a to form a the skin

do we not only a little a two hour

simulate the seven action units and at a question is whether this is enough and

i show you show a video

so it pretty what is in german with english as a high that's a lot

is introduced focus about non-verbal signals it does not necessarily that you want to understand

what is you start

you can just a discussion of actually what the machine about information the machine have

to be close she did not consider at stage the semantics of or utterances

to about position is equal

it can you see also would not test so it is equal to one can

once will be given by its because i

i

yes i understand what is not one but also talk about

but also useful what it is not quite often what it

i don't think it's one is that all data that are not handled by a

weighted sum of all

is that it is not able to account for instance the hopefully it does come

zero point all possible

a problem with this is no i o

in to compute so that you mentioned you can

one for training

okay just to show you that really does not can see that the semantics another

example

that's my

schuller

about done

are you

the system can do not work about online to one hundred fifty yet but not

really constant talk detector e

so just to show that you can't

i have a conversation with emotional features are that's of course not over

and a few well maybe we

the of course a use a different from and to see so maybe we

my a held at a it's not a over

so what is the embassy still embassy it is an emotional response and its stance

from the comprehension of emotional state of and also

pairs

and a so that the emotional state of the other person

might be similar to your own emotions at but that's not have to be design

a motion

and embassy like what is either deeper such a of emotional state of an a

set of parents and facilities is what we can more of a signal processing technology

and it is also like well i guess so we don't think about the situation

of the also use somehow

need to know

and of what at the outset person is feeling and why not start to oppose

that it

and also you are required to decide the how to respond to the ad suppose

a motion

so for example in the tutoring system

if

the student is in the very emotional state and depressed

in a high it could be a disaster if the virtual agent would actually minimal

a emotional state

of the student because it might make a student

moura

depressed

so

it is actually a week what is a tree or

this is a potential and want to not to show

and we can realize kind of have say listen now

so where we can see a motion we try to understand a emotional state

and understanding and the motion state of the knots that appears in

we could choose an internal reaction and that the question is should be external is

a reaction and of what are two ways that i virtual you'll another

examples was actually and how much will be

simulated and appraisal a model

a lot of the dialog alive will show you is actually is that of course

so first and of what we do in this kind of a tie and all

so we be able and motions

a lot so

we also a common to on the user's the emotions so the story will be

a pilot a forgotten

four point of medication

and

function and to see it is so we had to robert shows console a power

of a button medication to increase awareness but it is doing it in a supplement

no

actually not what we are still at

to a much

and no overt so dropped what will show the some intention as

while

the palm down to the user

so i will apply deal with

but the video and what is actually a kind of amazing

this is that it is disappointing fine edge while it is all

here

okay

i

a

okay and a actually a to develop a better understanding of four emotions of users

we are currently investigating how to combine the social signal processing of with affective as

you rate of mind and cases actually what operation where is that happily an apart

from the if i

in a support

so partly other developed a model of the whole and i don't know

actually to simulate emotional behaviors

and the basic idea is actually

what

have some and motion of stimulation and then change a ways of what do you

recognise in terms of sources used

actually matches and how well

a simulation

and the even type just a little bit of errors are

we do not just once you to how a list one was so that

and emotional state

we also points you know how people

actually show like to like they'll motions to show you an example

so let's see that

shape so if you are not regulated well you want a motion is either so

the person who

just flash they had a dollar

and that this is the typical

emotional expression

we would expect

and a people usually awake you like a motion is actually i like to better

whole always the emotional state

and or shy of the at different weights to like motions

so avoided is one reaction but you put it text yourself so we have for

example you say okay and i four and a but also at a gas a

person

and

what you panacea actually other that we have a quite a different is no actually

you know people might show depending on the way they regulate their motion and if

you use a typical the machine learning approach actually

to analyze distortion no

you would never know i'm be able to find one motions

because don't know

how do people go back to rely on the emotional state so here is and

have a price we have to discussion already yesterday

maybe you can us

machine learning approaches as like boxers recognise certain signals

a fine tuning as some understanding actually

a map

to see that want to emotional states

and it's even more important

if the system has to respond what emotional state so matching a

you talked to somebody on the on the guys not really understanding what's your problem

you

and i just at behaving like what we can you like well and

a responding in a schematic a manner we were able shall

and behaviour

so it would like at the end of are also called me but all what

is the weighted dialogue between a

humans and or

robots

and only actually a client by dpi apply a job which can decide no

on engagement and human robot in the action

we looked at so

signs of engagement in human robot a dialogue act of the amount of mutual gaze

below a direct gaze turn taking

and i just show you example the here it's a path of gain between a

robot and you can result

and to use that is where we hyped weight loss there's a so that the

robot notes when it was is a loopy

and in this specific as scenario

all you know simulated directed gaze which is that kind of

functional same

so

the robot is able to detect which all check

the use that is a focusing on and this makes the interaction more efficient because

there is no longer forced to describe

o j lo detector i also implemented a hallway a scenario is or should gaze

for distortion case actually voice

do not have we deal function

so i'd the dialogue was completely understandable without distortion we just wanted to know

that's my to any difference

so it just a very quickly

we have a direct that a gaze assorted one who is the following two options

and pointing the object or just looking at the object

and for mutual gaze of both in that interval establish eye gaze

the next thing what we realise was case is a disambiguation

and a case applies disambiguation is interesting in so yes other people

a few option which was then look away again

so we need a different disambiguation approach

that for example powerpointy then for example for pointing gestures when two point usually just

point one and that's it you know what into the one time

and so case is

then we

different

and we also a real is

so that some typical gaze behaviour is that you in a turn taking

so speakers a new way usually from the addressee to indicate

that they are for it to process of thinking about what to say next

and also to show that they don't one and it should be a drop that

and are typically at the end of an utterance the speakers

low would you have a person

because they want to know how we are suppose

what the as opposed

thinking about what has been set

so basically

we realize a shared folder of what follows the user's hand movements and drop what

follows to users he's

we will i social around eight

so here to what i see and recognise this mutual gaze

and finally to an eye dropper to make a nice is going to use that

you tell

and that will show you

we deal

so i decided to leave at the top and because i realise the top is

much better roundy then the problem i did it is one of the

how do okay

the red wine it's of course ambiguous nothing more i k

which man

e

thus

again you know that

and the we did an evaluation well this where

and what we found was that actually of the object wanting was more effective than

distortion grounding

so the people were there are able to interact more efficiently with object a groundings

of the dialogs were much shorter

and the word lattice misconceptions

and it's not distortion rounding error you not a improve the perception

of the interaction

which is of course appear because we spend quite some time one mutual gaze

i one assumption is that people wear out that once waiting on the task instead

of the social interaction with the robot

and we might investigate if you have a more sources ask for example looking at

family for both

and the distortion gaze a might become more important

and its assumption is a which we do not yet a try

that some people are focusing more on the task in some without focusing more on

the social interactions you can be classified like these

and a specific people

might appreciate the social gaze a more

the analysis

so have finally i would like a to come to reason a development is so

we started one

interactions in or dialogue

and data from both sides of always

do you make an interactive machine but also to machines in route a robot

come do we she can interact

the human

so the o project which was already mentioned yesterday

we have collected a corpus of which people a dialogue between

you minutes

and the dialogue has the in i'm not trying to label

and we actually or integrated active learning and hope wait a litany

in the annotation work so basically i think it is that the system actually

this is which samples of the show you label

pick the right relatedness and it also this is which sound shall be no actually

a from like that at all

and so one of which is forced to select examples

for which a did not she and actually

tie a low confidence

and always that approach so we've well at the o to o

make up the annotation process

significantly more efficient

and of these basically integration of the no one system is as i a system

which i mentioned earlier

and for the interactions that it is actually that you do an additive high main

which is the essence of interruptions

from

called as a between a human

it down

so i to come to one compare emotion

i think that a human robot in that capture cannot come we can treat here

until a

the problem of

appropriate social interaction between robots and human

for it

in particular

if a what is employed in

the people it's how you

and of what we need of course is a fully integrated into consisting of perception

reasoning

learning and responding

and a particular it is at the moment is a big gap between the perceptual

and the reason nine so the reasoning is

kind of the net like that

at the moment in favour of a black box the

approaches

which is useful for

actually attended i o

so we should use as such as laughter

but after that so we need to reason about what

actually distortion signal a marine

and of course i know my disciplinary expertise is a

necessary in order to emulate aspects of social intelligence that's why

we call up with a lot we so

psychologist

and so we might a lot of software publicly or a way that well in

particular its as i system distortion no

interpretation and there's no way as its i leave work on the nist you make

a small

the

install we entirely and finite state automaton

and of which the of was actually at is to various virtual agents but also

to all kinds of

robots

and of these is actually and

problem thinking when

is so actually that's a good a point

because

you to do it making dropped what is of a with of point able to

recognize o where is looking

at a much higher level of accuracy at any human would be

and some people because they are just used explicitly also pointed and of course if

you and not change its flexible kind of a reference i don't act

in that particular we deal discourse features are just stuff

for the illustration

these boards and as a model somebody would you wanna pollard benefits

of a it quickly a but also had this kind of behavior or we just

got here we have the people off policy with a non contact with a

of what some people show some people use pointing some people do not use pointing

up by a nevertheless it will always a good usually do not point and not

low so i wake up with

had a information

and because it meant to have a study are usually people believe you want has

now and so they are really concentrating on this task

and so that's probably why

okay not

at

appreciate so much at a social okay so it is not bestow and the people

actually makes no solution i would want to turn taking opening is realized release the

turn taking a dialog was more efficient because it was clear out

open dropped what the was expecting a user type that on a in terms of

subjective evaluation considers did not to do the what was a behaviour or natural or

a source what if

men and i case it's really a task based

scenario

i it's not have time to show live video humans collaborating on data on to

say a

and we have

some examples of human interaction is left not sure that the human robot interaction and

syntactic in cases we had to human knowledge at that very well okay fact that

we and various taking not

for statistics was very close to take not they have to look at each other

for data on the table

and this was followed by a wide interest

s two because actually correctly skewed documents we

acquired they do so you have not one but what which looks like it would

for this got what look like points and so intuitively the people of course top

down in a very well may be justified condition in a more expressive no according

to which i

s two s p o

more clearly

and it was also used for people to a related to drop what so we

brought one what to and

it home and fist people well

valley points a and a set you know why not fall at home we would

just want to be a tweet that by i will

and then is that okay as long as to what just calls it's okay it

cannot close

and

and india and dropped what performance this is actually a to send out of are

realised exactly how they had something called out and

actually taken to do not what you like to have real data that you like

a example somebody to take et al

and it is sometimes they were also

a p a surprise was one ladies she was

the one hundred years the what affords you was really clear

we call it can still and she's at

it's just plastic i have a high round dropped were extracted by a strange but

you're right with use

i don't i brought lots of people find it easier or want to talk and

what expressed

details on thank you press

i

it's probably and that's that in

because for example in

since holders gain

people actually intentionally shall was quite sure a particular emotional state whereas when regulate motion

usually do not really

think about it

and that there's a

that some quite some properties pulses just a few hundred and so that the general

expression years with i just can't seem high location just looking at least is used

machine learning always have a kind of evaluation able

to recognize

emotional and the state of what has actually you

and what situation

i believe that of the phase is quite important

so i was in the presentation by a company that was really proud of their

robot and did not have facial expressions it is not have just thinking

and somebody in the audience that i don't understand the point is just a loudspeaker

and what is the point so i think the

the party as i want a back to face as important as well and what

the

now we have washed up to an issue

okay we have this property of before and that was possible with the case apart

head pose actually