thank you very much for waking up early the star

this is really exciting this is the first time

i will be giving a talk in this room in two years

it is that the same time kind of emotional for me

and the so i'm really happy to share

the recent research i've done on human communication analysis

and i will also talk a little bit briefly estimate

of the earlier project i've been doing

on this topic

and as you know really well

i'm here spending about a lot of the word is

with my student and also with my collaborators

this is

this is the new of the comp lab other one at cmu there is one

let us see that stuff answer is leading

this the theme of you don't and we are all working together

with the goal of building algorithm

two and the light

really and event may sometimes think the five

german syllable can get the behaviors

and to really get into this understanding of why

human communication and why multimodal the magic word i know it's impossible for me to

give a talk without that you know about multimodal

i really strongly that when we analyze dialogue

dialogue is powerful in how people what they are seeing

and this is a really strong component

of dialogue in conversation analysis

but i also strongly believe that nonverbal communication both vocal and visual

is that the really important

and for that reason i'm gonna show you an example some of you may have

in it so don't tell you never about the answer but i want to give

you this sort a clip where we have an interview

between two people

and we i want to task from you and easy and a hard

the easy one is to find out from the so you have the interviewer and

interviewee

how what emotion

there's the interviewee

feel

and that's what i'll do it is a hard one

it is just of the two task

the second that i want you what i

well as the cost

that's the hardest but is also the most interesting

so we're gonna we will let's read it together about a corpus tried to have

no prior to the

denote the board

so did you need it if the of the technology what side

l o good morning good morning

where you surprised by the verdict today i'm very surprised that the this world economy

because there was no the expecting that

when a game tell me something out

so maybe something of big surprise

what emotion does you feel

it is an easy question

so right exactly i

and that's look at it from a computer

who is probably just gonna do some kind of word embedding and matching things

what is why these surprise

let's look at the question probably because of the verdict

that the that the follows

really quick one

what if we more carefully

we do see that there was something unexpected

a and maybe even got related to him

so let's add one more modality

that is in which word as you decide to emphasise

i for me is a set of technology websites f is it is i see

this is like this i said ice surfaces yes

is something that

this is this something isn't done yet to address this as a basis i said

yes

okay so

which word

and his second and so that he decided to emphasise

me

is strongly emphasise the me

so this surprise doesn't seem as much about the big

but mostly because it can count em

so that add another modality

where you see surprise but now you want to look at it at the timing

of thing

and that's one of the other take all my want to bring in

it's not just multimodal

where the alignment of the modality that's really important

the let's look at the visual modality second line

for tracking the et cetera technology website f news line is a good morning t

is fine this and i said that the surface to see this is not have

to come only because i would like to think that

unless you know that is something this and don't think of the to address this

implies that they suffices that

okay so

with that that's a driveway came a lot earlier than with to

much earlier

and five where with the

rampantly and five you look carefully it is around that

so given that information

what is the cause or what how can you explain the surprise

probably is related to this title there's probably something wrong with the title

okay and that's would be interesting so that's where the timing is important

really surprised at uni of the case of pride so if you look at name

entity recognition there's differently to entities there is the name of the person enters the

position in the place if you look carefully it is the second one

so

based on that you inferred that his name at uni his job title is not

recognizer web site

the last be i have to give you will never have known that there ought

without the context but effectively his did you need

what he's a taxi driver

the taxi driver goes therefore small job interview item one of the small there

and i'll give you a that's great command

they put him with the makeup what the microphone is that we're the job interview

thing i think that up and everything

and that well i don't the realise that all my guys this is not that

have interviewed it only something love and that the that thing

but that are you that have known what are several interesting it is see the

proportion of them of the interviewer see keep the straight place

the only thing see that the will come back after the commercial

you never comes back that's also a so what we start here is

we as human are expressing or communicative behavior to that's we i call it the

rouble vocal and visual

a word you decide you

is maybe slightly more power that it was like you or negative

this is the choice you make

this is a child because you want to emphasize the sentiment

all because you want to be polite in that's really importance for discourse

the way you decide to a phrase the sentence would bring a lot also

the vocal every word use p can be emphasized differently

and also you can decide to put more or less tension of writing this on

the voice

it also the vocal expression of laughter

or the policy allows that are important

the visible which i come from computer vision background the reason is i put the

phone call them on visual

is it might bias but i strongly believe there's also a lot to the gesture

i'm doing to be gesture i mean do some iconic gesture

the eye gaze the way i will also do occur on gesture

the body language is important it's both on my posture of the body and also

the proxy mixed with others

and that is really also control specific always have this is a great example

of a brain you student who graduated by now

but just came up from china

and we have the wonderful discussed and i go to the whiteboard and i turn

and he was right there

and i tried to have a conversation but my canadian bobble well

lied

i survive only twenty seconds and then when we have a wonderful conversation about tried

to make so that within a

i j then had gate

one of the first q i look almost always in any video analysis i do

is eye gaze eye gaze is extremely important

it is also some time cognitive emotions also eye gaze is really important

and i have a bias for facial expression also so i believe the face brings

a lot

we have about forty two models on the phase depending you can't exactly but for

to do

all of them has been i sighing a number of byproduct men famous coding scheme

and i'm interest and not just in the basic emotion like had is that if

you is happy starts to cry

well i'm also interested in these other going to state is the thing the confusion

and understanding

there are about of and more important when we think about learning an indication for

example

so that you of the three v verbal vocal and visual

and

the reason for this research has been in that people's mind for many years

if you look back sixty years ago and by the way have puberty a it

is the sixtieth anniversary of artificial intelligence

the us they're from the beginning but we didn't have all the technology now these

days we have technology to do a lot of the low-level sensing finding facial landmarks

and the licensing the voice

every in speech recognition is getting better

so we can in real time at leftmost and browse speech

and i can be able to start doing some of the original goal of inferring

behaviour in emotion

so personally when i look at this challenge of looking in human communication dynamic

i don't get for type of dynamics

the first one is behavioural dynamics

and that every smile is born it or there's some mild that seems to show

politeness some are feeling and there is also what we call and that this is

i have to give this to my

appear as opposed to

but if the size of

which means that the same

can be really need a lot there's by the change of prosody and for people

working in speech in conversation analysis try to find out who is speaking

the stuff

the

i

okay this was one that only

this was from only one hour of audio

do you know with it

it nick campbell and it's was from one of experiments data as that the interaction

they have that but only from one hour or the you can see the variety

as some of them are just

which is more like a concentration please continue

some clearly show some common ground

and the lights men

and some of them maybe eventually agreement so just from the brother the same word

changes

the second one was by now you hopefully bought into is the idea of multimodal

dynamic with a line

the third one is really important i think that's where a lot of the research

in this conference

and moving forward is needed is the interpersonal dynamic

and the former one is the cultural muscles title dynamics

this is a lot of study of both difference of also and event between cultures

so today i will focus

primarily on these tree

and try to explain some of the mathematics behind that

how can we use the

and develop new algorithms to be able to send

the behaviors so

and i make personal excited in this field

right i'm only follows for because of its but then syllable healthcare

there's a lot of what then so in the being able to have the doctor

during their assessment or treatment

a depression

the since i don't live and offers them

and the other i have every are which is really important is education

the way people are learning these they this shifting completely we remove was seeing more

and more online learning

online learning brings a lot of advent age

but one of the b is advantageous you lose the face-to-face interaction

how can you improve that still in this new error

and

the internet is wonderful

there is so much there are there people lie to talk about themselves and talk

about what they lower their poppy and everything this so much data and every language

every call so it allows a and a lot of it

and then transcribed already

it gives us a great opportunity for gathering data and starting people's behaviour so that

a two day i on purpose put it in three phases

the first phase is probably where one half of my heart is which is that

on held behavior informatics i will present some of their work we have done when

i was also at usc

working on the hard you analyse gonna get the behavior to have doctors

the core of this star

will be about the mathematics

of communication

and this is that a little bit of map but you can always ignore the

bottom half of the screen if you don't

i want to see mathematical equation and i will give an interest and on every

algorithms that present

but i want you to believe and understand

that we can get a lot from mathematical an algorithm

when studying

communication

and the last one is the interpersonal dynamic i was to some result but i

think this is where there's a need

of working together and pushing this part of the research

a lot further

and so let me start with help behavior informatics

you're gonna recognise right away

any maze of a person who's been really important was sick dial this year us

they're elicit thank you for your email as a citizen realise but i mean using

her as my patient well out of my slide

but let's suppose that we have a patient

weights for anybody else who than that in this room

and we wanted the interaction between the patient and the doctor

during that interaction we will have some camera let's say a samsung tree sixty

just sitting on the table

if we are lucky and are at i c t or we are working we

dissected then we can also have a natural and to your

the advantage of the virtual interviewer versus the human is then they're dissertation

the virtual interior is gonna have the question always the same way as long as

we asked to do it

the core my research there

is to while the interaction is happening

to be able to pick up on the communicative cues

that may be related to depression

exactly within this schizophrenia

we bring it back to the clinician

and then they can do a better assessment of depression

this is the you'd the views and long-term

what is really lucky

is we started this

and it was primarily computers lines is

with one strong believer which escape result

we would like we believe in this and working to with us

made it possible but now the medical field is thing it

a more and more important and with a lot more links going on after that

so let me

introduced la probably a lot of you seen her sit changed a lot of clothing

and you may ask you know in three

i heard i'm gonna sure that primarily because i want you yes to see the

technology which i think is amazing because it to forty five people in four years

to build

i'm showing this video as the landmark video on that on that field but also

to look at the nonverbal happening in real time the sensing of this

hi and highly

it's the community

and is created to talk to people in a safe and secure environment

i'm not a therapist that i'm here to learn about people in the black to

learn about you ask a few questions can start

and please feel free to tell me anything you can see are totally confidential

are you looking like this

yes

so

high density

whom well

that's good

where you from originally

from los angeles

i'm from not only myself

one this time last time you felt really happy

and

i time and i i'd rather be happy

like a skinny nine

okay thanks but you get an yourself to twenty years ago

and

i it's not a lean

it

an example that is that i'll

okay this is really sort it it's or not we originally designed get within fifteen

minutes instruction in mine people easily top twenty thirty minutes with l e

we have one example are really famous professor i'm not gonna name

and that person who came in visiting and we told them

be careful we're gonna be watching behind the videos

don't that'll to much a we are there

just and allow no problem

this start talking a little bit and eventually the started talking the slow thing about

the bars and about everything and i was not there are present at that point

the l a brings that in what are that's really

and a is there to listen to you which is a good listener

has been designed with that if you want otherwise you know so in what like

so much emotion

emotion is the is the double edged in this case

you can surely most and get the present more engaged you can go the opposite

way for example a bad error in speech recognition the patient said

i and my grandmother died and the l it was a

and so you can definitely be sure so all those reduce the aspect

and a lot of the world there was done by david and david

on handling the dialogue at a level

then make the interaction grow through a rapport way

true of phase of intimacy what part of their what was positive in the lower

what have you moment in the last week

a negative as well

if you could go back in time what do you change about yourself

these are important and he

four hours or research because

how does the presenter we have from positive

and how they react one they can sit will tell you a lot about the

their reaction and allow us to calibrate

so our view

is and that's prior to my research and in this case is hard to analyze

the patient behavior to date

and how to be a yes that's we and compared to like two weeks ago

that allows us to see a change so if you ask me where the technology

is going to be sparse

it's in treatment

because in sweet menu see the same person over time

and now over time we have gathered is the entire that allows also to maybe

due screening over this technology and give a great indicators

so this is the project that start and more than six years ago and that

means do you in a few minutes

what are the other things we discovered that we did not expect

and things i think that we were not seen previously

and so the first

population will look at is depression

and you think of depressed people and you think my

smile is gonna be a great way to the that you look at the red

and on the press this is an obvious one it sort out that no

the comp a smile

in almost exactly the same between the pressure in a depressed

what change the is the relation shorter

and less amplitude

that is hypothetically what it means is social norm thousand that you have to smile

where you don't feel it

and so use change the dynamic of your behavior

and that's where behavior directly so important

the second population we look at

look at its posttraumatic stress

and you like okay point vts the it is for sure there's some negative expression

with this

it is a given

people would be it is there will probably so

and what we did we see almost the same rate in or intensity

the same intensive negative

what did we end up doing we split it men and women

what did we find out

man

c and increase in the gets a spatial expressions well woman see a decrease and

negative expression when they have symptoms related to pitch the

this is really interesting

so why

another interesting question

i respond we have nice research question

again probably maybe because of social norm

man it is accepted in our culture

that it may show more negative expression

so they are not

reducing them well woman because of the social norm again main to reducing

this one here i part is this i'm just gonna see it because i'm here

that maybe it is because they're from los angeles and both boxes so popular

i don't know about the we have to study there's the don't give a new

new interesting research question to study

the research population that we looked at is suicidal id asian

the you know that there's forty teenagers are we going to the eer in cincinnati

only

forces title idea is to either first attempt or strong sits idle addition

and that has to make this hard decision

i my keeping all of them here

sending some of them or putting on medication or not

is a hard decision so we have to task in mind

one is findings this i don't versus non societal

but where is the money

the money is then detecting repeaters

because the first time is always

a phrase that then the second item bits of and the most and to

so we did a lot of research and this is in collaboration with defined server

and cincinnati john question

where we studied the behavior between societal and non societal

and the language is really important

you see more pronounced when societal about themselves

and you also see more negative

these are not surprising but they were confirmation of previous research

what was the most challenging is repeaters in on repeaters

how can we differentiate that and one of the most interesting result is that the

voice

where the difference shader

people we're speaking differently

when a repeat what's gonna happen we will call again three weeks later to find

if there was a second at them

and so the brightness of the voice was an indicator

is it just one indicator will not just because you were had to rate advice

in itself

but that's and that is then in together and then we can add this

we did you know there's a lot of other indicated that you can add

to help with this

the last population and we also look at it schizophrenia

use of in is the really important

disorder

and they also related to buy there's also by problem is a free not vote

in the cycle this

arena

and so we were really interested to look at the facial be yours because we

were o is of rain are they gonna look everywhere the gonna move and al

this

and what did we find out

when they were the doctor nothing

they were not moving they are brought there was no more sand with the same

that they were strongly schizophrenic or not

but

if there were by themselves

then we could see that just a

so that brings than the really interesting aspect of interpersonal

where the doctors the there

they're kind of constraining a little bit their behaviour well when they were the by

the slu could see a lot in the facial expression

so the that some of the example these are more of the population will been

working on

since then we started looking at art is then

and also as sleep deprivation

it's all of my phd student the like can be really get paid one that

sleeping

and yes they're the lattice that is

onthe-fly and so we're looking at these as well

if you're interested in doing and pushing for that kind of research

i strongly suggest

to go aligned right now and download open phase

open phase is us

taking promote to stance and taking the main component of multisensor for visual analysis

and giving it

not only for free

not only give the open source for recognition

what do you mean you all the open source for the training

of all the model that were all trained with public dataset

i'd probably not good for my grant proposal and all this because i'm probably gonna

give too much but i think it is important for the community and we're doing

that for that

open phase has state-of-the-art performance for facial landmarks sixty eight facial landmark

state-of-the-art performance for twenty two facial action unit

also for eye gaze

eye gaze just from a webcam plus or minus by degree and also head position

we're adding more and more every few months also

so this is online

and be sure to contact that that's with the main person behind all of the

switchboard

so i think i got you hopefully excited about the potential of an analysing nonverbal

and verbal behaviour for help here

so how do we do this

how can we go a step ahead right now we just a couple of uni

modal

one behavior

but what i really excited about is how can we add together

all of these indicators from probable vocal and visual

so then we can better infer

the tighter the disorder or in a social interaction to recognize leadership

ripple

and also maybe emotion

so

what are the court silence and

if you have to remember wanting of this lecture is these four challenges

when you look at them negation therefore main challenge to the first one is with

dimension is the temporal aspect i told us smiled the dynamic of this might is

really important

we need to model each day behaviour

but there is also what's got representation alignment and fusion

representation i have what the person said and i have these gesture how can i

learn a joint way of representing it

so that if someone say i like it

and the smile

these should be indicators that are represented close to each other

and by representation what i mean

i mean numerical numbers that are import that our interpretable by the computer

imagine a vector in some sense

the alignment is the second thing

we move i sometime faster and of course changes faster than all words so we

need to align the modality and the last one is the fusion

we want a breathing disorder or emotion how do you use this information

so the first one is and i will ask you to use one other part

of your brain a

the one that's is slowly waking up because of the copy about looking at matt's

and algorithms but i want to give you a little bit of a background on

the mat side

so we have the behavior of a person

and we wanna be looking at

what is this so that

component to it and what is the information you have a you have a plot

like a movie plot and the all sub plots to it

there is a gesture and there's subcomponents to it

this component i really important when you look at my at behaviors

so how do we do this so anybody the let's see

whose strongest background is in language and an l p

would be most of you

anybody with a strong background in vocal and out of the speech

okay great

anybody with a strong background in visual computer vision

okay good thank you

i don't feel lonely well for each of these modality

there are existing problems that are well studied looking at structure for example in language

looking at a noun phrases or shallow segmentation

in have used one recognizing gesture or in vocal looking at the tenseness already motion

in the voice

and there are been a lot of approaches suggested to that

it generates addresses this common that's a

generative in a nutshell is looking at each gesture and try to generate it so

if you look at hand out and head shake it's gonna learn how and upgraded

and how the head say created

and if i'm giving a new video is say that no other the with head

shake a discriminative approach is really looking at what differentiates the two

and so in a lot of our work it or not the discriminative approaches perform

better at least for the task of prediction

and so i'm gonna give you

information about this kind of approach

knowing really well it is interesting work on the genitive

so

what is a conditional random field

my guys i didn't thing i would see that do this morning

but no conditional random field is what's colour graphical model

and the reason i want you to learn about it is that this is the

and good entry way to a lot about the research that you've heard about word

embedding

our board to back or deep learning or recurrent neural network you're all of these

terms

we're gonna go step by step to be able to understand the and that the

same time i will give you some of the work we've done tree that

so given the task and given the sentence

and i want to know what is the beginning of a noun phrase

all what is the continuation of a noun phrase or what is other like ever

so it is simple classification task

and you could imagine given observation

where you have a one hot encoding

zero and one for the words if it's a word embedding

you can try to predict

the relationship between the word and the non trade

if you wanna do it in a discriminative way what does this minutes of mean

in means that you model problem the of the label

given the input b r y given x

now this equation is simpler than o

there is one component that look at

how is my observation looking like the label this is what color is singular potential

and the second part is if i'm at the beginning of a noun phrase what

is the likely label afterwards

if i tell you that if i'm the beginning and noun phrase one is like

there were i know a continuation of a noun phrase or another but if i

mean concentration a noun phrase

it's really less likely maybe that i go

into a global after that so this is the kind of interest and you put

in this model

this model i patients recognize behaviour and they can do it but

but there's always about

but in this problem will be

so much easier

if i knew the part-of-speech tagging it would be so much easier

if i had and at college the undergrad in the box if at the annotators

same and obtaining out of this for us

the task will be so much easier from this pronouns you know it's like but

beginning of a well i

beginning of an off right

this is the verb so

why don't just do that when it is the hard a i r b doesn't

allow us to put undergrads in the box and it is a time-consuming

i process to do that so

this is the want a remote wants you to remember from that's part of the

lecture

latent variable i'm gonna replace that by a latent variable length bible is the number

from one so let's they can

that's gonna do the job for you

latent variable are therefore have been

they can include the words together for you

but you don't have to give them what the name of each group

they can define camping naturally that works for the purpose of your past which is

in this case

noun phrase

so you et al it hey learn this grouping for me of all the words

and you can do that by doing a small to make with saying for the

non fright the beginning a noun phrase i'm allowing you for this

these four rule

for the middle for the constellation of a noun phrase i'm allowing you grew for

you to group all the words in four or the rooms

and i would do it also for all the other one

so you see it almost

it's not unsupervised-clustering because i have the grouping will be happening because i have a

task in mind

discriminative model task in mind

so if you do this once beautiful is the complexity of this algorithm is that

almost the same as the c i have with a simple a summation over that

now what do you end up learning with this grouping

the most important is this link

what do you end up learning you know knowing what's got intrinsic dynamic what is

that if i want to recognize hand on the intrinsic tells me i'm going down

and well this is the dynamic

but it had say at the different dynamic this is specific to the gesture

extrinsic tells you if i my hand on how likely am i to switch strategy

this is between the labeled how likely am i two had say now rely on

lightly in fact come back then i can head shape

it's an intuition behind this

so if you do this and you apply this to the task where famous that

of noun phrase

segmentation also called shallow parsing

and then you know

it should have the hidden state look the most likely one for this word when

it is i want to know what that my model learn what is the grouping

that loan

and if you know can what they did learn

it's really beautiful

it is an automatically that the beginning of a phrase is the determinant or pronouns

and it also give me intuition

about the kind of part-of-speech tags

that is but in that one on whether part-of-speech tags it just learned automatically

because of the words and the way of these words happen in the bright

so this is that they come first they common stage

latent variable are there so rule thing

for you

their grouping thing temporal grouping

that the first ingredient we will need

the

you probably heard the word recurrent neural network

and you like that fancy name have no clue what i don't wanna use that

right away recurrent neural network looks a lot like this model

the only thing that change it is instead of having one latent state from one

so well

i'm gonna have many neurons that are binary

zero o one

and so recurrent neural network is someone looking at a neural network and it looking

at the painting and be like how it will look better horizontally so it's taking

a neural network and moving it horizontally and that is your temporal

so if i was to show you the other way around you with the other

just the neural network that the normal one

by shifting it this way this is the temporal

that i model and so this is right

the problem with these

is therefore get

therefore get they have a problem in the learning

so this famous algorithm that happen in germany

have more than twenty years ago that speaking super famous recently

it long short-term memory

and the long short-term memory is really similar to the previous neural network

but in also then you have the memory

and but how do you guard the memory

you going to put the gate

that only once you want that's in the memory

and only what you want get out of the memory you putting a gating and

then you think hey i'm gonna sometime for get things but i'm gonna design what

i forget this is a really high level you but you could imagine by now

this is the exact same that

the word

and the label

and the only difference is i'm going to memorise when i memorise i memorise what

happened before

i'm gonna memorise what are the word and the faster the grouping that happened before

i wanted to show you that

just so that when you see this times you have at least in its vision

that there is a way to approach

temporal modeling two latent by about that i talk about

or true neural networks

okay

no i want to address the second challenge

that's one of the most interesting from my perspective other i work a lot of

my life on temporal modeling so as to not say that i think the next

screen fluent

is how do you work on representation how can you in the look at someone

what they say

and how they stated in the gesture

and find a common representation

what is this common representations to look like

i wanna representation so that if i know why a video and i have a

segment of someone saying i like it

i a part of the video it as someone smiling

part of the video i

a joyful tone

i want these

to all be represent that mainly similar from each other if you look at the

numbers representing

this it should be really similar i like it from happy forms artful

and if i have someone will look a little bit tens of the press or

some tenderness in there but i want them the number like i think there is

audio clip

and i tried to every presented with this that the transformation

i wanted to be we need those someone would deprive

or if i have someone who looked surprised and i hear

wow

i want these to look alike

and this was the dream

i personally had this dream

back more than ten years ago

and this really smite researcher at toronto

showed us a path for that

and it is ruslan in university of toronto

but is a lot of interesting work

where neural network

are allowing us to make this dream come true

it did it installed at don't worry but they've done the first step that's really

important i'm gonna show you result in second

what they say it's a visual

could be represented with multiple layer of neurons

and verbal can be represented

with multiple layer of neurons

what i see here

i don't collect like word to back for people who know about it it's a

representation of a word that becomes a vector and here i have images that suddenly

becomes also a nice vector by the way

if you wonder why modes model was not working

it's all the fault of computer vision people

the reasonable to model was not working is images were so hard to recognize any

object it was barely working well

but certainly in two thousand and eleven

computer vision started working

at a level that is really impressive we can recognize object really efficiently and now

we can look at all

hi is the high-level representation of the image that is useful

words were always quite informative in itself

but the you guys that solve a lot of the and now we can do

that and put them together

in one representation

and there's been a lot of really interesting work

starting that from two thousand ten

and this is still a lot of work on that feel

i'm gonna show this one a result that's that

to me how it may be possible

and this is the work from toronto

is what they did

they learned

how images from the web from flicker

they take a bunch of images and then

they were here

one word or you were describing them

and the first two

well point to the same place

and when you do that

you get for any rate

and their representation put at work you get a representation

but now i'm going to do

multilingual

work and he is there but of it i'm gonna take an image

and the number

representation

i'm gonna get the word

and get a number and stuff strike

the what number from the image number

and i am gonna and that the number

and finally again this final number out of it and i'm gonna know what kind

of email

to that part of the space

then you get a new car

and then it becomes red color

that for me what it man is i find belief on what is the bad

l what is the their magic language where everything can be no the

and that's no there is a language

the magic language where everybody can go from the french think this and all that

is this magic language

this is the live in the same for language and bayes and we finally got

a piece of that magic language where computer vision people can live happily with natural

language people and speech people

and they can do that for the they and then i

flying in sailing bold box i don't know it is beautiful but they didn't sell

any of the only problem i mentioned without about communicative behavior they don't have yet

happy smile that goes with lie like but you can see the product now to

that

so i'm gonna do now store an algorithm

that brings together what you learn all your

latent viable

which are grouping have role

and now i'm gonna at this new ingredient which are neural network that their goal

is to find a better way of representing i don't like one hot

representation for words like zero and one

i want something that's more informative

and i don't like images i want something much more informative

so i'm going to learn at the same time

how what in my room being temporally what does my temporal dynamic and what is

my way to

represent

so given the same input

and the goal of maybe

doing it's email are recognition or let's say recognizing what is positive or negative i'm

changing the task

because noun phrase

segmentation is not really among the model problem

so i'm thinking at that like positive versus negative like

we will smaller sentiment the not of that for example

and that was at the first layer here this is in fact i'm showing it

this way but what it is

is that the word

is multidimensional

and this is also multi dimensional because you have neurons

so i'm replacing this as one layer

of neurons

and then

i'm gonna at you or famous latent variables

so what is happening here

and that's really important

on this their job

is that they all the agenda-based here

that's a me is about a false there you don't and those then because i

speak french about of other

and so they call this gibberish and one in the format

that's going to be useful for the computer and their task here is to say

from a useful information that we tried to bit

to see what is similar between the different

between the different modalities

and so this is what you get here

it is it right grouping what should i grew

this is the this is here

how should i go from the numbers just something that's useful for my computer and

here is the same as all your is how the between late and viable or

grouping

so this is beautiful because you do at the same time

translate from gibberish to something useful and cluster the same time

one of the most challenging thing when you train that

is that each layer is he then late and you don't have it on the

ground labelling it

so when you have many of that what happen is one could try to learn

the same as the next layer

so you want divers city in its of your layer

and the good neural network they will do we what's called dropout

or you can also implies some sparsity so that this is gonna be really different

from this one

and when you do this by emotion recognition

you get a huge bruise on any of the prior work

because we were not just the only a late fusion we're really at the same

time modeling the representation

and the temporal clustering

okay

that everyone survived this is the last equation we had so but this was

this is my goal of

present thing for you

the representation how do i goal

from temporal and the representation and the two that's one which i wanna presents quickly

one is that about riyadh alignment

how do you align

usual which is really i thirty frames per second

we language

which is in fact i don't know how many words per second i see i'm

from you know the high end on that

but it's probably five to six word maybe a little bit more per second

so how do you emanates to be able to

they really high frame rate and the lying it is something much lower

in some other way i have a video

and i want to summarize that video

it's which is so that at the end

i really have only the important part

and if you look at computer vision people

they don't look at the excel

and this is allowed to change prop excel

and this is really few change

is really little change here

is about the and pixel changing here so if you just look at the excel

in you try to merge you wanna i all of these frame

and you want to find how am i gonna merge them

there's two obvious way to do it

one it in all one out of two frames

really a long sequence then you just ignore and all of the people in neural

network that's often what they do they take one out of ten frames that side

about the most interesting will be

look at one image visit look at look like the previous one

in that they look alike i'm gonna modes them but i don't you the local

at this time

but i do not merge them

what is more importing or magic a gradient you remember latent variable they didn't viable

are gonna move things for you

for a task in mine which is recognizing gesture

and if i do the merging because they look alike and this space

then there really more important more fusion

and if you do that you get a you lose in performance for recognizing gesture

and i'm gonna give you want more intuition about see i have an hmm

so you have an hmm are a lot like finding new model or finding dora

is the dollar

short memory they don't remember the only remember the last thing be seen that the

really short term memory

so if you give them something really high frame rate

the only think it wouldn't remember is the previous one

so what do they remember and a member my previous frame always look a lot

like my current frame

so i smoothing

but i was give it

these frames here that are different from each other

it will be learned some temporal information that's more useful and that's why

a lot of model works so much better on language

because every word is quite different from the previous

but every major in a video frame a really similar to each other so that

this model

and when you do that you get a nice clustering

the frame because it's not looking

just that the similarity but it really

and the at the mood being that you get from the latent bible

the last one is fusion and there's a lot more work to be done on

fusion but this one is like okay

i model the temporal

i model the representation i lying my modality

but now i want to make a prediction i wanna make my final prediction

and i want to use all the information i have

to make my prediction

and to do that is a lot of new way to do that

if you think about it each modality has its own dynamics of voice is really

quick

word is floor

so you don't want to lose that

so you have word

uhuh dynamic for

each modality so one is private and one

will in fact with mine mation

okay so you will learn a dynamic for audio and you learn a dynamic for

visual and then you know how to synchronise them

i'm going quickly turned out but just want to give you the institution

that user and the last that is the one that's going to do

learned the dynamic and learned also to synchronise at the same time and when you

do that you improve a lot so

i'm gonna coming back closing the loop

i'm clothing the lu

and going back to the average and all work on this stress depression and ptsd

i'm gonna take verbal acoustic and visual

and i want to predict how

distress you are

and here the results you get when you do multimodal fusion

you get this to what you have is a hundred part is event

who interacted with l e

and each of them at the level of distress in blue

and some of them have speech the in depression

and in green what you get

is in fact the prediction

you get the prediction from the green

but i putting together the verbal indicator

the vocal and the visual

and you can do that i'm gonna skip to that because of time

but you can also do this a lot for

looking at sentiment

in videos sentiment in youtube videos

is another application of that i'm gonna skip this one

does because our model to go quickly under the last point i want to make

but the last part i want a state now is interpersonal dynamic

you guys have been amazing you been handout thing smiling yearning watching emails

i got you

okay

but interpersonal dynamic is i think the next friends really in algorithm because people some

people will like siri synchrony in their behaviors

synchrony in their behaviour are great this all up and some kind of rubber ball

i with the also the in the video

in some of our video using the virtual human mimicking each other

well in negotiation

you also c and d symmetry or divergence

we also really informative

if i move or what you move backward design important you

this is important negotiation but also in learning

if i look at the behavior of one speaker and another

i can find moment where the synchronise

and i can also find one when there is synchrony

and these are often in our data

related to

a rejection or bad in their homework

because they're not working well together

there's a there's the disagreement

and that synchrony can show their

we can use some of the behaviour is more for one but you get the

right leader from expert

and this year otherwise you think the other knowledge about the on but they're not

always that the there are not only the knowledgeable and so hard to differentiate that

and voice is a good one for that

and one another type what are you gonna accent on that my offer during negotiation

and to do that i will look

and your behavior

i will look at my behaviour as the proposed or and i will look at

our history together if we do that together we get a user improvement when we

put the dinally

but that i think what is that

it your behavior if you hand not are stored bothers you are likely to accept

but my behaviour important by the way the best way to have someone a text

that you are for

tells you have

you put that you put that out in your on a request

so the last one is there you guys

good listeners

how do i create a crowd like you guys as good listener you

i can do that from data

i can look at each of you how you reacting to the speaker

and learn

what are the most predictive one

and be able to eventually grade of its own listener

these are the top for most predictive listener speech about features so if i part

you likely to hannah

that's another surprise if i look at you you're likely to and not after a

little well known right away

if i stayed a word and the one hand by itself is not a good

predictor but if i'm in the middle of as and then ipods and look at

you

you really likely to give feedback

so this is the power of multimodal and badly if i don't look at you

unlikely

to hannah but not all that you guys are the same

you all the little bit different you not all a smiling at another thing which

i don't know why use all be about the

a

some of you i can learn a model for one person

i can learn a model for another person

and that a person

and then when i would like to do is find out the prototypical grouping

grouping

latent viable a again very like that model selection

again at that it

but you will be grouping people want to find what is common between people

and what do you fine

you find that some people

is that was produced by law on so that they also that the warm i'm

a men's is the than is only about one that if i begin in france

event have i say stupid things you will hand not just because that the part

of the right time

a some people will be a visual there don't even care listening

and i do this and noun phrases turn out to be a good predictor

okay so i wonder so work from stacy mice the lower here this is the

really great representation of putting all this interpersonal dynamic in one video i could have

never done better than that

so i wanna you just do this

this is a video movie and you want if we only gonna take the audio

track

and the text

only the audio and the text

and we're gonna and may

the virtual human here we gotta make two of them

some of them are going to be speakers so it speaking behavior based on how

to

you don't the speech you want to know is the icing that the head is

the

which facial expression is it speaker behaviour

but we also want to predict the center behaviors

directly from the speech of the speaker and so look at the

it's is beautiful

and i hope you enjoy the movie

s two s process i like an answer the question judge the core poor performance

statistical touched

technical difficulty writing style change to do so

i o

i

i

i don't have to answer the question or answer the question

you want answers i entirely

i don't try to

but this was all automatic from the audio

and the visual one that some of the text only

i you get the can you cues from the audio you get the emotion

so this is an example putting everything together these are some of the application that

you can will

bringing together the behavior dynamic every my not every smile on equal going to model

the model with the late and viable you don't quite get that the multimodal representation

and alignment in the fusion

and then the interpersonal dynamics so

with the bocal for your attention remotes

okay

so

let me to answer the first second one and maybe the first one we can

discuss more

about the second one apartments model alignment right now we are looking at alignment i

don't really instantaneous level so it's only really small piece of the big problem of

alignment

right now we only aligning

i really short term

i personally believe the next

okay at the next level

of alignment needs to be at the segment level so you need to be able

to do segmentation

at the same time as you the alignment and to go ahead with the other

example that you mention

the a when you don't you mimicry instantaneously

the plastic example i think it's four seconds or something like that so that the

problem is that temporal contingency you need to model that and i think

right now as i said a lot of a model are sort or memory

and so we need the infrastructure

to be able to remember so

i think all the points you mention are wonderful i agree with you this is

why i'm excited with this we don't

is that we got actually the building blocks there

and i think we need to study the next step so

thank you

okay the with the money and then

right requested

so right now we tried to work with the calibration of each speaker

by having a for space of four or

but where we got more sober indicators

what's the difference on how to direct from positive

and from positive

as a problem there from negative still really

and looking at the delta

what is the most informative

because the data is the little bit

it's not completely independent on the user base a lot less dependent

then just looking at hoffman this might happen to this might if it's positive hop

into this might in when it's negative

that is more informative

the other work is if you ask me where this research going follows it's in

treatment

and they're

what is it and we're working with harvard medical school

is you get a schizophrenic patient at their worst

you get a schizophrenic patient as they go through treatment at the back they go

back home

you can create a beautiful patient profile of that there were at their best and

then use that to monitor

their behaviour as they go back

and so that the work we are putting forward with harvard medical school

is to be able to create these

profiles of people

at the word profile doesn't sound also we call the signature

as on a list the big brother but the idea is the profile of that

so

thing thank you all four pension thank you