thank you very much for waking up early the star
this is really exciting this is the first time
i will be giving a talk in this room in two years
it is that the same time kind of emotional for me
and the so i'm really happy to share
the recent research i've done on human communication analysis
and i will also talk a little bit briefly estimate
of the earlier project i've been doing
on this topic
and as you know really well
i'm here spending about a lot of the word is
with my student and also with my collaborators
this is
this is the new of the comp lab other one at cmu there is one
let us see that stuff answer is leading
this the theme of you don't and we are all working together
with the goal of building algorithm
two and the light
really and event may sometimes think the five
german syllable can get the behaviors
and to really get into this understanding of why
human communication and why multimodal the magic word i know it's impossible for me to
give a talk without that you know about multimodal
i really strongly that when we analyze dialogue
dialogue is powerful in how people what they are seeing
and this is a really strong component
of dialogue in conversation analysis
but i also strongly believe that nonverbal communication both vocal and visual
is that the really important
and for that reason i'm gonna show you an example some of you may have
in it so don't tell you never about the answer but i want to give
you this sort a clip where we have an interview
between two people
and we i want to task from you and easy and a hard
the easy one is to find out from the so you have the interviewer and
interviewee
how what emotion
there's the interviewee
feel
and that's what i'll do it is a hard one
it is just of the two task
the second that i want you what i
well as the cost
that's the hardest but is also the most interesting
so we're gonna we will let's read it together about a corpus tried to have
no prior to the
denote the board
so did you need it if the of the technology what side
l o good morning good morning
where you surprised by the verdict today i'm very surprised that the this world economy
because there was no the expecting that
when a game tell me something out
so maybe something of big surprise
what emotion does you feel
it is an easy question
so right exactly i
and that's look at it from a computer
who is probably just gonna do some kind of word embedding and matching things
what is why these surprise
let's look at the question probably because of the verdict
that the that the follows
really quick one
what if we more carefully
we do see that there was something unexpected
a and maybe even got related to him
so let's add one more modality
that is in which word as you decide to emphasise
i for me is a set of technology websites f is it is i see
this is like this i said ice surfaces yes
is something that
this is this something isn't done yet to address this as a basis i said
yes
okay so
which word
and his second and so that he decided to emphasise
me
is strongly emphasise the me
so this surprise doesn't seem as much about the big
but mostly because it can count em
so that add another modality
where you see surprise but now you want to look at it at the timing
of thing
and that's one of the other take all my want to bring in
it's not just multimodal
where the alignment of the modality that's really important
the let's look at the visual modality second line
for tracking the et cetera technology website f news line is a good morning t
is fine this and i said that the surface to see this is not have
to come only because i would like to think that
unless you know that is something this and don't think of the to address this
implies that they suffices that
okay so
with that that's a driveway came a lot earlier than with to
much earlier
and five where with the
rampantly and five you look carefully it is around that
so given that information
what is the cause or what how can you explain the surprise
probably is related to this title there's probably something wrong with the title
okay and that's would be interesting so that's where the timing is important
really surprised at uni of the case of pride so if you look at name
entity recognition there's differently to entities there is the name of the person enters the
position in the place if you look carefully it is the second one
so
based on that you inferred that his name at uni his job title is not
recognizer web site
the last be i have to give you will never have known that there ought
without the context but effectively his did you need
what he's a taxi driver
the taxi driver goes therefore small job interview item one of the small there
and i'll give you a that's great command
they put him with the makeup what the microphone is that we're the job interview
thing i think that up and everything
and that well i don't the realise that all my guys this is not that
have interviewed it only something love and that the that thing
but that are you that have known what are several interesting it is see the
proportion of them of the interviewer see keep the straight place
the only thing see that the will come back after the commercial
you never comes back that's also a so what we start here is
we as human are expressing or communicative behavior to that's we i call it the
rouble vocal and visual
a word you decide you
is maybe slightly more power that it was like you or negative
this is the choice you make
this is a child because you want to emphasize the sentiment
all because you want to be polite in that's really importance for discourse
the way you decide to a phrase the sentence would bring a lot also
the vocal every word use p can be emphasized differently
and also you can decide to put more or less tension of writing this on
the voice
it also the vocal expression of laughter
or the policy allows that are important
the visible which i come from computer vision background the reason is i put the
phone call them on visual
is it might bias but i strongly believe there's also a lot to the gesture
i'm doing to be gesture i mean do some iconic gesture
the eye gaze the way i will also do occur on gesture
the body language is important it's both on my posture of the body and also
the proxy mixed with others
and that is really also control specific always have this is a great example
of a brain you student who graduated by now
but just came up from china
and we have the wonderful discussed and i go to the whiteboard and i turn
and he was right there
and i tried to have a conversation but my canadian bobble well
lied
i survive only twenty seconds and then when we have a wonderful conversation about tried
to make so that within a
i j then had gate
one of the first q i look almost always in any video analysis i do
is eye gaze eye gaze is extremely important
it is also some time cognitive emotions also eye gaze is really important
and i have a bias for facial expression also so i believe the face brings
a lot
we have about forty two models on the phase depending you can't exactly but for
to do
all of them has been i sighing a number of byproduct men famous coding scheme
and i'm interest and not just in the basic emotion like had is that if
you is happy starts to cry
well i'm also interested in these other going to state is the thing the confusion
and understanding
there are about of and more important when we think about learning an indication for
example
so that you of the three v verbal vocal and visual
and
the reason for this research has been in that people's mind for many years
if you look back sixty years ago and by the way have puberty a it
is the sixtieth anniversary of artificial intelligence
the us they're from the beginning but we didn't have all the technology now these
days we have technology to do a lot of the low-level sensing finding facial landmarks
and the licensing the voice
every in speech recognition is getting better
so we can in real time at leftmost and browse speech
and i can be able to start doing some of the original goal of inferring
behaviour in emotion
so personally when i look at this challenge of looking in human communication dynamic
i don't get for type of dynamics
the first one is behavioural dynamics
and that every smile is born it or there's some mild that seems to show
politeness some are feeling and there is also what we call and that this is
i have to give this to my
appear as opposed to
but if the size of
which means that the same
can be really need a lot there's by the change of prosody and for people
working in speech in conversation analysis try to find out who is speaking
the stuff
the
i
okay this was one that only
this was from only one hour of audio
do you know with it
it nick campbell and it's was from one of experiments data as that the interaction
they have that but only from one hour or the you can see the variety
as some of them are just
which is more like a concentration please continue
some clearly show some common ground
and the lights men
and some of them maybe eventually agreement so just from the brother the same word
changes
the second one was by now you hopefully bought into is the idea of multimodal
dynamic with a line
the third one is really important i think that's where a lot of the research
in this conference
and moving forward is needed is the interpersonal dynamic
and the former one is the cultural muscles title dynamics
this is a lot of study of both difference of also and event between cultures
so today i will focus
primarily on these tree
and try to explain some of the mathematics behind that
how can we use the
and develop new algorithms to be able to send
the behaviors so
and i make personal excited in this field
right i'm only follows for because of its but then syllable healthcare
there's a lot of what then so in the being able to have the doctor
during their assessment or treatment
a depression
the since i don't live and offers them
and the other i have every are which is really important is education
the way people are learning these they this shifting completely we remove was seeing more
and more online learning
online learning brings a lot of advent age
but one of the b is advantageous you lose the face-to-face interaction
how can you improve that still in this new error
and
the internet is wonderful
there is so much there are there people lie to talk about themselves and talk
about what they lower their poppy and everything this so much data and every language
every call so it allows a and a lot of it
and then transcribed already
it gives us a great opportunity for gathering data and starting people's behaviour so that
a two day i on purpose put it in three phases
the first phase is probably where one half of my heart is which is that
on held behavior informatics i will present some of their work we have done when
i was also at usc
working on the hard you analyse gonna get the behavior to have doctors
the core of this star
will be about the mathematics
of communication
and this is that a little bit of map but you can always ignore the
bottom half of the screen if you don't
i want to see mathematical equation and i will give an interest and on every
algorithms that present
but i want you to believe and understand
that we can get a lot from mathematical an algorithm
when studying
communication
and the last one is the interpersonal dynamic i was to some result but i
think this is where there's a need
of working together and pushing this part of the research
a lot further
and so let me start with help behavior informatics
you're gonna recognise right away
any maze of a person who's been really important was sick dial this year us
they're elicit thank you for your email as a citizen realise but i mean using
her as my patient well out of my slide
but let's suppose that we have a patient
weights for anybody else who than that in this room
and we wanted the interaction between the patient and the doctor
during that interaction we will have some camera let's say a samsung tree sixty
just sitting on the table
if we are lucky and are at i c t or we are working we
dissected then we can also have a natural and to your
the advantage of the virtual interviewer versus the human is then they're dissertation
the virtual interior is gonna have the question always the same way as long as
we asked to do it
the core my research there
is to while the interaction is happening
to be able to pick up on the communicative cues
that may be related to depression
exactly within this schizophrenia
we bring it back to the clinician
and then they can do a better assessment of depression
this is the you'd the views and long-term
what is really lucky
is we started this
and it was primarily computers lines is
with one strong believer which escape result
we would like we believe in this and working to with us
made it possible but now the medical field is thing it
a more and more important and with a lot more links going on after that
so let me
introduced la probably a lot of you seen her sit changed a lot of clothing
and you may ask you know in three
i heard i'm gonna sure that primarily because i want you yes to see the
technology which i think is amazing because it to forty five people in four years
to build
i'm showing this video as the landmark video on that on that field but also
to look at the nonverbal happening in real time the sensing of this
hi and highly
it's the community
and is created to talk to people in a safe and secure environment
i'm not a therapist that i'm here to learn about people in the black to
learn about you ask a few questions can start
and please feel free to tell me anything you can see are totally confidential
are you looking like this
yes
so
high density
whom well
that's good
where you from originally
from los angeles
i'm from not only myself
one this time last time you felt really happy
and
i time and i i'd rather be happy
like a skinny nine
okay thanks but you get an yourself to twenty years ago
and
i it's not a lean
it
an example that is that i'll
okay this is really sort it it's or not we originally designed get within fifteen
minutes instruction in mine people easily top twenty thirty minutes with l e
we have one example are really famous professor i'm not gonna name
and that person who came in visiting and we told them
be careful we're gonna be watching behind the videos
don't that'll to much a we are there
just and allow no problem
this start talking a little bit and eventually the started talking the slow thing about
the bars and about everything and i was not there are present at that point
the l a brings that in what are that's really
and a is there to listen to you which is a good listener
has been designed with that if you want otherwise you know so in what like
so much emotion
emotion is the is the double edged in this case
you can surely most and get the present more engaged you can go the opposite
way for example a bad error in speech recognition the patient said
i and my grandmother died and the l it was a
and so you can definitely be sure so all those reduce the aspect
and a lot of the world there was done by david and david
on handling the dialogue at a level
then make the interaction grow through a rapport way
true of phase of intimacy what part of their what was positive in the lower
what have you moment in the last week
a negative as well
if you could go back in time what do you change about yourself
these are important and he
four hours or research because
how does the presenter we have from positive
and how they react one they can sit will tell you a lot about the
their reaction and allow us to calibrate
so our view
is and that's prior to my research and in this case is hard to analyze
the patient behavior to date
and how to be a yes that's we and compared to like two weeks ago
that allows us to see a change so if you ask me where the technology
is going to be sparse
it's in treatment
because in sweet menu see the same person over time
and now over time we have gathered is the entire that allows also to maybe
due screening over this technology and give a great indicators
so this is the project that start and more than six years ago and that
means do you in a few minutes
what are the other things we discovered that we did not expect
and things i think that we were not seen previously
and so the first
population will look at is depression
and you think of depressed people and you think my
smile is gonna be a great way to the that you look at the red
and on the press this is an obvious one it sort out that no
the comp a smile
in almost exactly the same between the pressure in a depressed
what change the is the relation shorter
and less amplitude
that is hypothetically what it means is social norm thousand that you have to smile
where you don't feel it
and so use change the dynamic of your behavior
and that's where behavior directly so important
the second population we look at
look at its posttraumatic stress
and you like okay point vts the it is for sure there's some negative expression
with this
it is a given
people would be it is there will probably so
and what we did we see almost the same rate in or intensity
the same intensive negative
what did we end up doing we split it men and women
what did we find out
man
c and increase in the gets a spatial expressions well woman see a decrease and
negative expression when they have symptoms related to pitch the
this is really interesting
so why
another interesting question
i respond we have nice research question
again probably maybe because of social norm
man it is accepted in our culture
that it may show more negative expression
so they are not
reducing them well woman because of the social norm again main to reducing
this one here i part is this i'm just gonna see it because i'm here
that maybe it is because they're from los angeles and both boxes so popular
i don't know about the we have to study there's the don't give a new
new interesting research question to study
the research population that we looked at is suicidal id asian
the you know that there's forty teenagers are we going to the eer in cincinnati
only
forces title idea is to either first attempt or strong sits idle addition
and that has to make this hard decision
i my keeping all of them here
sending some of them or putting on medication or not
is a hard decision so we have to task in mind
one is findings this i don't versus non societal
but where is the money
the money is then detecting repeaters
because the first time is always
a phrase that then the second item bits of and the most and to
so we did a lot of research and this is in collaboration with defined server
and cincinnati john question
where we studied the behavior between societal and non societal
and the language is really important
you see more pronounced when societal about themselves
and you also see more negative
these are not surprising but they were confirmation of previous research
what was the most challenging is repeaters in on repeaters
how can we differentiate that and one of the most interesting result is that the
voice
where the difference shader
people we're speaking differently
when a repeat what's gonna happen we will call again three weeks later to find
if there was a second at them
and so the brightness of the voice was an indicator
is it just one indicator will not just because you were had to rate advice
in itself
but that's and that is then in together and then we can add this
we did you know there's a lot of other indicated that you can add
to help with this
the last population and we also look at it schizophrenia
use of in is the really important
disorder
and they also related to buy there's also by problem is a free not vote
in the cycle this
arena
and so we were really interested to look at the facial be yours because we
were o is of rain are they gonna look everywhere the gonna move and al
this
and what did we find out
when they were the doctor nothing
they were not moving they are brought there was no more sand with the same
that they were strongly schizophrenic or not
but
if there were by themselves
then we could see that just a
so that brings than the really interesting aspect of interpersonal
where the doctors the there
they're kind of constraining a little bit their behaviour well when they were the by
the slu could see a lot in the facial expression
so the that some of the example these are more of the population will been
working on
since then we started looking at art is then
and also as sleep deprivation
it's all of my phd student the like can be really get paid one that
sleeping
and yes they're the lattice that is
onthe-fly and so we're looking at these as well
if you're interested in doing and pushing for that kind of research
i strongly suggest
to go aligned right now and download open phase
open phase is us
taking promote to stance and taking the main component of multisensor for visual analysis
and giving it
not only for free
not only give the open source for recognition
what do you mean you all the open source for the training
of all the model that were all trained with public dataset
i'd probably not good for my grant proposal and all this because i'm probably gonna
give too much but i think it is important for the community and we're doing
that for that
open phase has state-of-the-art performance for facial landmarks sixty eight facial landmark
state-of-the-art performance for twenty two facial action unit
also for eye gaze
eye gaze just from a webcam plus or minus by degree and also head position
we're adding more and more every few months also
so this is online
and be sure to contact that that's with the main person behind all of the
switchboard
so i think i got you hopefully excited about the potential of an analysing nonverbal
and verbal behaviour for help here
so how do we do this
how can we go a step ahead right now we just a couple of uni
modal
one behavior
but what i really excited about is how can we add together
all of these indicators from probable vocal and visual
so then we can better infer
the tighter the disorder or in a social interaction to recognize leadership
ripple
and also maybe emotion
so
what are the court silence and
if you have to remember wanting of this lecture is these four challenges
when you look at them negation therefore main challenge to the first one is with
dimension is the temporal aspect i told us smiled the dynamic of this might is
really important
we need to model each day behaviour
but there is also what's got representation alignment and fusion
representation i have what the person said and i have these gesture how can i
learn a joint way of representing it
so that if someone say i like it
and the smile
these should be indicators that are represented close to each other
and by representation what i mean
i mean numerical numbers that are import that our interpretable by the computer
imagine a vector in some sense
the alignment is the second thing
we move i sometime faster and of course changes faster than all words so we
need to align the modality and the last one is the fusion
we want a breathing disorder or emotion how do you use this information
so the first one is and i will ask you to use one other part
of your brain a
the one that's is slowly waking up because of the copy about looking at matt's
and algorithms but i want to give you a little bit of a background on
the mat side
so we have the behavior of a person
and we wanna be looking at
what is this so that
component to it and what is the information you have a you have a plot
like a movie plot and the all sub plots to it
there is a gesture and there's subcomponents to it
this component i really important when you look at my at behaviors
so how do we do this so anybody the let's see
whose strongest background is in language and an l p
would be most of you
anybody with a strong background in vocal and out of the speech
okay great
anybody with a strong background in visual computer vision
okay good thank you
i don't feel lonely well for each of these modality
there are existing problems that are well studied looking at structure for example in language
looking at a noun phrases or shallow segmentation
in have used one recognizing gesture or in vocal looking at the tenseness already motion
in the voice
and there are been a lot of approaches suggested to that
it generates addresses this common that's a
generative in a nutshell is looking at each gesture and try to generate it so
if you look at hand out and head shake it's gonna learn how and upgraded
and how the head say created
and if i'm giving a new video is say that no other the with head
shake a discriminative approach is really looking at what differentiates the two
and so in a lot of our work it or not the discriminative approaches perform
better at least for the task of prediction
and so i'm gonna give you
information about this kind of approach
knowing really well it is interesting work on the genitive
so
what is a conditional random field
my guys i didn't thing i would see that do this morning
but no conditional random field is what's colour graphical model
and the reason i want you to learn about it is that this is the
and good entry way to a lot about the research that you've heard about word
embedding
our board to back or deep learning or recurrent neural network you're all of these
terms
we're gonna go step by step to be able to understand the and that the
same time i will give you some of the work we've done tree that
so given the task and given the sentence
and i want to know what is the beginning of a noun phrase
all what is the continuation of a noun phrase or what is other like ever
so it is simple classification task
and you could imagine given observation
where you have a one hot encoding
zero and one for the words if it's a word embedding
you can try to predict
the relationship between the word and the non trade
if you wanna do it in a discriminative way what does this minutes of mean
in means that you model problem the of the label
given the input b r y given x
now this equation is simpler than o
there is one component that look at
how is my observation looking like the label this is what color is singular potential
and the second part is if i'm at the beginning of a noun phrase what
is the likely label afterwards
if i tell you that if i'm the beginning and noun phrase one is like
there were i know a continuation of a noun phrase or another but if i
mean concentration a noun phrase
it's really less likely maybe that i go
into a global after that so this is the kind of interest and you put
in this model
this model i patients recognize behaviour and they can do it but
but there's always about
but in this problem will be
so much easier
if i knew the part-of-speech tagging it would be so much easier
if i had and at college the undergrad in the box if at the annotators
same and obtaining out of this for us
the task will be so much easier from this pronouns you know it's like but
beginning of a well i
beginning of an off right
this is the verb so
why don't just do that when it is the hard a i r b doesn't
allow us to put undergrads in the box and it is a time-consuming
i process to do that so
this is the want a remote wants you to remember from that's part of the
lecture
latent variable i'm gonna replace that by a latent variable length bible is the number
from one so let's they can
that's gonna do the job for you
latent variable are therefore have been
they can include the words together for you
but you don't have to give them what the name of each group
they can define camping naturally that works for the purpose of your past which is
in this case
noun phrase
so you et al it hey learn this grouping for me of all the words
and you can do that by doing a small to make with saying for the
non fright the beginning a noun phrase i'm allowing you for this
these four rule
for the middle for the constellation of a noun phrase i'm allowing you grew for
you to group all the words in four or the rooms
and i would do it also for all the other one
so you see it almost
it's not unsupervised-clustering because i have the grouping will be happening because i have a
task in mind
discriminative model task in mind
so if you do this once beautiful is the complexity of this algorithm is that
almost the same as the c i have with a simple a summation over that
now what do you end up learning with this grouping
the most important is this link
what do you end up learning you know knowing what's got intrinsic dynamic what is
that if i want to recognize hand on the intrinsic tells me i'm going down
and well this is the dynamic
but it had say at the different dynamic this is specific to the gesture
extrinsic tells you if i my hand on how likely am i to switch strategy
this is between the labeled how likely am i two had say now rely on
lightly in fact come back then i can head shape
it's an intuition behind this
so if you do this and you apply this to the task where famous that
of noun phrase
segmentation also called shallow parsing
and then you know
it should have the hidden state look the most likely one for this word when
it is i want to know what that my model learn what is the grouping
that loan
and if you know can what they did learn
it's really beautiful
it is an automatically that the beginning of a phrase is the determinant or pronouns
and it also give me intuition
about the kind of part-of-speech tags
that is but in that one on whether part-of-speech tags it just learned automatically
because of the words and the way of these words happen in the bright
so this is that they come first they common stage
latent variable are there so rule thing
for you
their grouping thing temporal grouping
that the first ingredient we will need
the
you probably heard the word recurrent neural network
and you like that fancy name have no clue what i don't wanna use that
right away recurrent neural network looks a lot like this model
the only thing that change it is instead of having one latent state from one
so well
i'm gonna have many neurons that are binary
zero o one
and so recurrent neural network is someone looking at a neural network and it looking
at the painting and be like how it will look better horizontally so it's taking
a neural network and moving it horizontally and that is your temporal
so if i was to show you the other way around you with the other
just the neural network that the normal one
by shifting it this way this is the temporal
that i model and so this is right
the problem with these
is therefore get
therefore get they have a problem in the learning
so this famous algorithm that happen in germany
have more than twenty years ago that speaking super famous recently
it long short-term memory
and the long short-term memory is really similar to the previous neural network
but in also then you have the memory
and but how do you guard the memory
you going to put the gate
that only once you want that's in the memory
and only what you want get out of the memory you putting a gating and
then you think hey i'm gonna sometime for get things but i'm gonna design what
i forget this is a really high level you but you could imagine by now
this is the exact same that
the word
and the label
and the only difference is i'm going to memorise when i memorise i memorise what
happened before
i'm gonna memorise what are the word and the faster the grouping that happened before
i wanted to show you that
just so that when you see this times you have at least in its vision
that there is a way to approach
temporal modeling two latent by about that i talk about
or true neural networks
okay
no i want to address the second challenge
that's one of the most interesting from my perspective other i work a lot of
my life on temporal modeling so as to not say that i think the next
screen fluent
is how do you work on representation how can you in the look at someone
what they say
and how they stated in the gesture
and find a common representation
what is this common representations to look like
i wanna representation so that if i know why a video and i have a
segment of someone saying i like it
i a part of the video it as someone smiling
part of the video i
a joyful tone
i want these
to all be represent that mainly similar from each other if you look at the
numbers representing
this it should be really similar i like it from happy forms artful
and if i have someone will look a little bit tens of the press or
some tenderness in there but i want them the number like i think there is
audio clip
and i tried to every presented with this that the transformation
i wanted to be we need those someone would deprive
or if i have someone who looked surprised and i hear
wow
i want these to look alike
and this was the dream
i personally had this dream
back more than ten years ago
and this really smite researcher at toronto
showed us a path for that
and it is ruslan in university of toronto
but is a lot of interesting work
where neural network
are allowing us to make this dream come true
it did it installed at don't worry but they've done the first step that's really
important i'm gonna show you result in second
what they say it's a visual
could be represented with multiple layer of neurons
and verbal can be represented
with multiple layer of neurons
what i see here
i don't collect like word to back for people who know about it it's a
representation of a word that becomes a vector and here i have images that suddenly
becomes also a nice vector by the way
if you wonder why modes model was not working
it's all the fault of computer vision people
the reasonable to model was not working is images were so hard to recognize any
object it was barely working well
but certainly in two thousand and eleven
computer vision started working
at a level that is really impressive we can recognize object really efficiently and now
we can look at all
hi is the high-level representation of the image that is useful
words were always quite informative in itself
but the you guys that solve a lot of the and now we can do
that and put them together
in one representation
and there's been a lot of really interesting work
starting that from two thousand ten
and this is still a lot of work on that feel
i'm gonna show this one a result that's that
to me how it may be possible
and this is the work from toronto
is what they did
they learned
how images from the web from flicker
they take a bunch of images and then
they were here
one word or you were describing them
and the first two
well point to the same place
and when you do that
you get for any rate
and their representation put at work you get a representation
but now i'm going to do
multilingual
work and he is there but of it i'm gonna take an image
and the number
representation
i'm gonna get the word
and get a number and stuff strike
the what number from the image number
and i am gonna and that the number
and finally again this final number out of it and i'm gonna know what kind
of email
to that part of the space
then you get a new car
and then it becomes red color
that for me what it man is i find belief on what is the bad
l what is the their magic language where everything can be no the
and that's no there is a language
the magic language where everybody can go from the french think this and all that
is this magic language
this is the live in the same for language and bayes and we finally got
a piece of that magic language where computer vision people can live happily with natural
language people and speech people
and they can do that for the they and then i
flying in sailing bold box i don't know it is beautiful but they didn't sell
any of the only problem i mentioned without about communicative behavior they don't have yet
happy smile that goes with lie like but you can see the product now to
that
so i'm gonna do now store an algorithm
that brings together what you learn all your
latent viable
which are grouping have role
and now i'm gonna at this new ingredient which are neural network that their goal
is to find a better way of representing i don't like one hot
representation for words like zero and one
i want something that's more informative
and i don't like images i want something much more informative
so i'm going to learn at the same time
how what in my room being temporally what does my temporal dynamic and what is
my way to
represent
so given the same input
and the goal of maybe
doing it's email are recognition or let's say recognizing what is positive or negative i'm
changing the task
because noun phrase
segmentation is not really among the model problem
so i'm thinking at that like positive versus negative like
we will smaller sentiment the not of that for example
and that was at the first layer here this is in fact i'm showing it
this way but what it is
is that the word
is multidimensional
and this is also multi dimensional because you have neurons
so i'm replacing this as one layer
of neurons
and then
i'm gonna at you or famous latent variables
so what is happening here
and that's really important
on this their job
is that they all the agenda-based here
that's a me is about a false there you don't and those then because i
speak french about of other
and so they call this gibberish and one in the format
that's going to be useful for the computer and their task here is to say
from a useful information that we tried to bit
to see what is similar between the different
between the different modalities
and so this is what you get here
it is it right grouping what should i grew
this is the this is here
how should i go from the numbers just something that's useful for my computer and
here is the same as all your is how the between late and viable or
grouping
so this is beautiful because you do at the same time
translate from gibberish to something useful and cluster the same time
one of the most challenging thing when you train that
is that each layer is he then late and you don't have it on the
ground labelling it
so when you have many of that what happen is one could try to learn
the same as the next layer
so you want divers city in its of your layer
and the good neural network they will do we what's called dropout
or you can also implies some sparsity so that this is gonna be really different
from this one
and when you do this by emotion recognition
you get a huge bruise on any of the prior work
because we were not just the only a late fusion we're really at the same
time modeling the representation
and the temporal clustering
okay
that everyone survived this is the last equation we had so but this was
this is my goal of
present thing for you
the representation how do i goal
from temporal and the representation and the two that's one which i wanna presents quickly
one is that about riyadh alignment
how do you align
usual which is really i thirty frames per second
we language
which is in fact i don't know how many words per second i see i'm
from you know the high end on that
but it's probably five to six word maybe a little bit more per second
so how do you emanates to be able to
they really high frame rate and the lying it is something much lower
in some other way i have a video
and i want to summarize that video
it's which is so that at the end
i really have only the important part
and if you look at computer vision people
they don't look at the excel
and this is allowed to change prop excel
and this is really few change
is really little change here
is about the and pixel changing here so if you just look at the excel
in you try to merge you wanna i all of these frame
and you want to find how am i gonna merge them
there's two obvious way to do it
one it in all one out of two frames
really a long sequence then you just ignore and all of the people in neural
network that's often what they do they take one out of ten frames that side
about the most interesting will be
look at one image visit look at look like the previous one
in that they look alike i'm gonna modes them but i don't you the local
at this time
but i do not merge them
what is more importing or magic a gradient you remember latent variable they didn't viable
are gonna move things for you
for a task in mine which is recognizing gesture
and if i do the merging because they look alike and this space
then there really more important more fusion
and if you do that you get a you lose in performance for recognizing gesture
and i'm gonna give you want more intuition about see i have an hmm
so you have an hmm are a lot like finding new model or finding dora
is the dollar
short memory they don't remember the only remember the last thing be seen that the
really short term memory
so if you give them something really high frame rate
the only think it wouldn't remember is the previous one
so what do they remember and a member my previous frame always look a lot
like my current frame
so i smoothing
but i was give it
these frames here that are different from each other
it will be learned some temporal information that's more useful and that's why
a lot of model works so much better on language
because every word is quite different from the previous
but every major in a video frame a really similar to each other so that
this model
and when you do that you get a nice clustering
the frame because it's not looking
just that the similarity but it really
and the at the mood being that you get from the latent bible
the last one is fusion and there's a lot more work to be done on
fusion but this one is like okay
i model the temporal
i model the representation i lying my modality
but now i want to make a prediction i wanna make my final prediction
and i want to use all the information i have
to make my prediction
and to do that is a lot of new way to do that
if you think about it each modality has its own dynamics of voice is really
quick
word is floor
so you don't want to lose that
so you have word
uhuh dynamic for
each modality so one is private and one
will in fact with mine mation
okay so you will learn a dynamic for audio and you learn a dynamic for
visual and then you know how to synchronise them
i'm going quickly turned out but just want to give you the institution
that user and the last that is the one that's going to do
learned the dynamic and learned also to synchronise at the same time and when you
do that you improve a lot so
i'm gonna coming back closing the loop
i'm clothing the lu
and going back to the average and all work on this stress depression and ptsd
i'm gonna take verbal acoustic and visual
and i want to predict how
distress you are
and here the results you get when you do multimodal fusion
you get this to what you have is a hundred part is event
who interacted with l e
and each of them at the level of distress in blue
and some of them have speech the in depression
and in green what you get
is in fact the prediction
you get the prediction from the green
but i putting together the verbal indicator
the vocal and the visual
and you can do that i'm gonna skip to that because of time
but you can also do this a lot for
looking at sentiment
in videos sentiment in youtube videos
is another application of that i'm gonna skip this one
does because our model to go quickly under the last point i want to make
but the last part i want a state now is interpersonal dynamic
you guys have been amazing you been handout thing smiling yearning watching emails
i got you
okay
but interpersonal dynamic is i think the next friends really in algorithm because people some
people will like siri synchrony in their behaviors
synchrony in their behaviour are great this all up and some kind of rubber ball
i with the also the in the video
in some of our video using the virtual human mimicking each other
well in negotiation
you also c and d symmetry or divergence
we also really informative
if i move or what you move backward design important you
this is important negotiation but also in learning
if i look at the behavior of one speaker and another
i can find moment where the synchronise
and i can also find one when there is synchrony
and these are often in our data
related to
a rejection or bad in their homework
because they're not working well together
there's a there's the disagreement
and that synchrony can show their
we can use some of the behaviour is more for one but you get the
right leader from expert
and this year otherwise you think the other knowledge about the on but they're not
always that the there are not only the knowledgeable and so hard to differentiate that
and voice is a good one for that
and one another type what are you gonna accent on that my offer during negotiation
and to do that i will look
and your behavior
i will look at my behaviour as the proposed or and i will look at
our history together if we do that together we get a user improvement when we
put the dinally
but that i think what is that
it your behavior if you hand not are stored bothers you are likely to accept
but my behaviour important by the way the best way to have someone a text
that you are for
tells you have
you put that you put that out in your on a request
so the last one is there you guys
good listeners
how do i create a crowd like you guys as good listener you
i can do that from data
i can look at each of you how you reacting to the speaker
and learn
what are the most predictive one
and be able to eventually grade of its own listener
these are the top for most predictive listener speech about features so if i part
you likely to hannah
that's another surprise if i look at you you're likely to and not after a
little well known right away
if i stayed a word and the one hand by itself is not a good
predictor but if i'm in the middle of as and then ipods and look at
you
you really likely to give feedback
so this is the power of multimodal and badly if i don't look at you
unlikely
to hannah but not all that you guys are the same
you all the little bit different you not all a smiling at another thing which
i don't know why use all be about the
a
some of you i can learn a model for one person
i can learn a model for another person
and that a person
and then when i would like to do is find out the prototypical grouping
grouping
latent viable a again very like that model selection
again at that it
but you will be grouping people want to find what is common between people
and what do you fine
you find that some people
is that was produced by law on so that they also that the warm i'm
a men's is the than is only about one that if i begin in france
event have i say stupid things you will hand not just because that the part
of the right time
a some people will be a visual there don't even care listening
and i do this and noun phrases turn out to be a good predictor
okay so i wonder so work from stacy mice the lower here this is the
really great representation of putting all this interpersonal dynamic in one video i could have
never done better than that
so i wanna you just do this
this is a video movie and you want if we only gonna take the audio
track
and the text
only the audio and the text
and we're gonna and may
the virtual human here we gotta make two of them
some of them are going to be speakers so it speaking behavior based on how
to
you don't the speech you want to know is the icing that the head is
the
which facial expression is it speaker behaviour
but we also want to predict the center behaviors
directly from the speech of the speaker and so look at the
it's is beautiful
and i hope you enjoy the movie
s two s process i like an answer the question judge the core poor performance
statistical touched
technical difficulty writing style change to do so
i o
i
i
i don't have to answer the question or answer the question
you want answers i entirely
i don't try to
but this was all automatic from the audio
and the visual one that some of the text only
i you get the can you cues from the audio you get the emotion
so this is an example putting everything together these are some of the application that
you can will
bringing together the behavior dynamic every my not every smile on equal going to model
the model with the late and viable you don't quite get that the multimodal representation
and alignment in the fusion
and then the interpersonal dynamics so
with the bocal for your attention remotes
okay
so
let me to answer the first second one and maybe the first one we can
discuss more
about the second one apartments model alignment right now we are looking at alignment i
don't really instantaneous level so it's only really small piece of the big problem of
alignment
right now we only aligning
i really short term
i personally believe the next
okay at the next level
of alignment needs to be at the segment level so you need to be able
to do segmentation
at the same time as you the alignment and to go ahead with the other
example that you mention
the a when you don't you mimicry instantaneously
the plastic example i think it's four seconds or something like that so that the
problem is that temporal contingency you need to model that and i think
right now as i said a lot of a model are sort or memory
and so we need the infrastructure
to be able to remember so
i think all the points you mention are wonderful i agree with you this is
why i'm excited with this we don't
is that we got actually the building blocks there
and i think we need to study the next step so
thank you
okay the with the money and then
right requested
so right now we tried to work with the calibration of each speaker
by having a for space of four or
but where we got more sober indicators
what's the difference on how to direct from positive
and from positive
as a problem there from negative still really
and looking at the delta
what is the most informative
because the data is the little bit
it's not completely independent on the user base a lot less dependent
then just looking at hoffman this might happen to this might if it's positive hop
into this might in when it's negative
that is more informative
the other work is if you ask me where this research going follows it's in
treatment
and they're
what is it and we're working with harvard medical school
is you get a schizophrenic patient at their worst
you get a schizophrenic patient as they go through treatment at the back they go
back home
you can create a beautiful patient profile of that there were at their best and
then use that to monitor
their behaviour as they go back
and so that the work we are putting forward with harvard medical school
is to be able to create these
profiles of people
at the word profile doesn't sound also we call the signature
as on a list the big brother but the idea is the profile of that
so
thing thank you all four pension thank you