0:00:18thank you very much for waking up early the star
0:00:23this is really exciting this is the first time
0:00:26i will be giving a talk in this room in two years
0:00:29it is that the same time kind of emotional for me
0:00:33and the so i'm really happy to share
0:00:36the recent research i've done on human communication analysis
0:00:41and i will also talk a little bit briefly estimate
0:00:43of the earlier project i've been doing
0:00:45on this topic
0:00:46and as you know really well
0:00:49i'm here spending about a lot of the word is
0:00:52with my student and also with my collaborators
0:00:56this is
0:00:57this is the new of the comp lab other one at cmu there is one
0:01:02let us see that stuff answer is leading
0:01:05this the theme of you don't and we are all working together
0:01:09with the goal of building algorithm
0:01:12two and the light
0:01:13really and event may sometimes think the five
0:01:16german syllable can get the behaviors
0:01:19and to really get into this understanding of why
0:01:24human communication and why multimodal the magic word i know it's impossible for me to
0:01:29give a talk without that you know about multimodal
0:01:32i really strongly that when we analyze dialogue
0:01:36dialogue is powerful in how people what they are seeing
0:01:41and this is a really strong component
0:01:44of dialogue in conversation analysis
0:01:47but i also strongly believe that nonverbal communication both vocal and visual
0:01:52is that the really important
0:01:53and for that reason i'm gonna show you an example some of you may have
0:01:57in it so don't tell you never about the answer but i want to give
0:02:02you this sort a clip where we have an interview
0:02:07between two people
0:02:08and we i want to task from you and easy and a hard
0:02:13the easy one is to find out from the so you have the interviewer and
0:02:17interviewee
0:02:18how what emotion
0:02:20there's the interviewee
0:02:22feel
0:02:23and that's what i'll do it is a hard one
0:02:25it is just of the two task
0:02:27the second that i want you what i
0:02:30well as the cost
0:02:31that's the hardest but is also the most interesting
0:02:35so we're gonna we will let's read it together about a corpus tried to have
0:02:39no prior to the
0:02:41denote the board
0:02:42so did you need it if the of the technology what side
0:02:46l o good morning good morning
0:02:48where you surprised by the verdict today i'm very surprised that the this world economy
0:02:53because there was no the expecting that
0:02:55when a game tell me something out
0:02:57so maybe something of big surprise
0:03:00what emotion does you feel
0:03:04it is an easy question
0:03:06so right exactly i
0:03:09and that's look at it from a computer
0:03:12who is probably just gonna do some kind of word embedding and matching things
0:03:18what is why these surprise
0:03:20let's look at the question probably because of the verdict
0:03:23that the that the follows
0:03:25really quick one
0:03:27what if we more carefully
0:03:29we do see that there was something unexpected
0:03:32a and maybe even got related to him
0:03:36so let's add one more modality
0:03:38that is in which word as you decide to emphasise
0:03:43i for me is a set of technology websites f is it is i see
0:03:49this is like this i said ice surfaces yes
0:03:57is something that
0:04:00this is this something isn't done yet to address this as a basis i said
0:04:08yes
0:04:09okay so
0:04:10which word
0:04:12and his second and so that he decided to emphasise
0:04:16me
0:04:17is strongly emphasise the me
0:04:19so this surprise doesn't seem as much about the big
0:04:23but mostly because it can count em
0:04:26so that add another modality
0:04:28where you see surprise but now you want to look at it at the timing
0:04:33of thing
0:04:34and that's one of the other take all my want to bring in
0:04:37it's not just multimodal
0:04:39where the alignment of the modality that's really important
0:04:43the let's look at the visual modality second line
0:04:46for tracking the et cetera technology website f news line is a good morning t
0:04:52is fine this and i said that the surface to see this is not have
0:04:59to come only because i would like to think that
0:05:02unless you know that is something this and don't think of the to address this
0:05:08implies that they suffices that
0:05:11okay so
0:05:13with that that's a driveway came a lot earlier than with to
0:05:17much earlier
0:05:18and five where with the
0:05:22rampantly and five you look carefully it is around that
0:05:26so given that information
0:05:28what is the cause or what how can you explain the surprise
0:05:33probably is related to this title there's probably something wrong with the title
0:05:39okay and that's would be interesting so that's where the timing is important
0:05:43really surprised at uni of the case of pride so if you look at name
0:05:47entity recognition there's differently to entities there is the name of the person enters the
0:05:54position in the place if you look carefully it is the second one
0:05:58so
0:05:58based on that you inferred that his name at uni his job title is not
0:06:05recognizer web site
0:06:06the last be i have to give you will never have known that there ought
0:06:10without the context but effectively his did you need
0:06:13what he's a taxi driver
0:06:17the taxi driver goes therefore small job interview item one of the small there
0:06:22and i'll give you a that's great command
0:06:25they put him with the makeup what the microphone is that we're the job interview
0:06:31thing i think that up and everything
0:06:34and that well i don't the realise that all my guys this is not that
0:06:38have interviewed it only something love and that the that thing
0:06:42but that are you that have known what are several interesting it is see the
0:06:46proportion of them of the interviewer see keep the straight place
0:06:50the only thing see that the will come back after the commercial
0:06:54you never comes back that's also a so what we start here is
0:07:00we as human are expressing or communicative behavior to that's we i call it the
0:07:07rouble vocal and visual
0:07:09a word you decide you
0:07:11is maybe slightly more power that it was like you or negative
0:07:16this is the choice you make
0:07:18this is a child because you want to emphasize the sentiment
0:07:20all because you want to be polite in that's really importance for discourse
0:07:25the way you decide to a phrase the sentence would bring a lot also
0:07:30the vocal every word use p can be emphasized differently
0:07:35and also you can decide to put more or less tension of writing this on
0:07:39the voice
0:07:40it also the vocal expression of laughter
0:07:42or the policy allows that are important
0:07:46the visible which i come from computer vision background the reason is i put the
0:07:50phone call them on visual
0:07:52is it might bias but i strongly believe there's also a lot to the gesture
0:07:56i'm doing to be gesture i mean do some iconic gesture
0:08:00the eye gaze the way i will also do occur on gesture
0:08:05the body language is important it's both on my posture of the body and also
0:08:09the proxy mixed with others
0:08:12and that is really also control specific always have this is a great example
0:08:17of a brain you student who graduated by now
0:08:20but just came up from china
0:08:22and we have the wonderful discussed and i go to the whiteboard and i turn
0:08:27and he was right there
0:08:30and i tried to have a conversation but my canadian bobble well
0:08:36lied
0:08:36i survive only twenty seconds and then when we have a wonderful conversation about tried
0:08:41to make so that within a
0:08:44i j then had gate
0:08:45one of the first q i look almost always in any video analysis i do
0:08:50is eye gaze eye gaze is extremely important
0:08:53it is also some time cognitive emotions also eye gaze is really important
0:08:58and i have a bias for facial expression also so i believe the face brings
0:09:03a lot
0:09:04we have about forty two models on the phase depending you can't exactly but for
0:09:08to do
0:09:09all of them has been i sighing a number of byproduct men famous coding scheme
0:09:15and i'm interest and not just in the basic emotion like had is that if
0:09:19you is happy starts to cry
0:09:21well i'm also interested in these other going to state is the thing the confusion
0:09:26and understanding
0:09:27there are about of and more important when we think about learning an indication for
0:09:31example
0:09:32so that you of the three v verbal vocal and visual
0:09:36and
0:09:37the reason for this research has been in that people's mind for many years
0:09:42if you look back sixty years ago and by the way have puberty a it
0:09:46is the sixtieth anniversary of artificial intelligence
0:09:50the us they're from the beginning but we didn't have all the technology now these
0:09:56days we have technology to do a lot of the low-level sensing finding facial landmarks
0:10:02and the licensing the voice
0:10:03every in speech recognition is getting better
0:10:06so we can in real time at leftmost and browse speech
0:10:11and i can be able to start doing some of the original goal of inferring
0:10:17behaviour in emotion
0:10:18so personally when i look at this challenge of looking in human communication dynamic
0:10:23i don't get for type of dynamics
0:10:27the first one is behavioural dynamics
0:10:30and that every smile is born it or there's some mild that seems to show
0:10:36politeness some are feeling and there is also what we call and that this is
0:10:42i have to give this to my
0:10:44appear as opposed to
0:10:45but if the size of
0:10:47which means that the same
0:10:49can be really need a lot there's by the change of prosody and for people
0:10:54working in speech in conversation analysis try to find out who is speaking
0:11:01the stuff
0:11:05the
0:11:11i
0:11:12okay this was one that only
0:11:15this was from only one hour of audio
0:11:19do you know with it
0:11:21it nick campbell and it's was from one of experiments data as that the interaction
0:11:27they have that but only from one hour or the you can see the variety
0:11:33as some of them are just
0:11:35which is more like a concentration please continue
0:11:38some clearly show some common ground
0:11:41and the lights men
0:11:42and some of them maybe eventually agreement so just from the brother the same word
0:11:46changes
0:11:47the second one was by now you hopefully bought into is the idea of multimodal
0:11:52dynamic with a line
0:11:54the third one is really important i think that's where a lot of the research
0:11:57in this conference
0:11:59and moving forward is needed is the interpersonal dynamic
0:12:03and the former one is the cultural muscles title dynamics
0:12:07this is a lot of study of both difference of also and event between cultures
0:12:12so today i will focus
0:12:14primarily on these tree
0:12:16and try to explain some of the mathematics behind that
0:12:20how can we use the
0:12:21and develop new algorithms to be able to send
0:12:24the behaviors so
0:12:26and i make personal excited in this field
0:12:30right i'm only follows for because of its but then syllable healthcare
0:12:35there's a lot of what then so in the being able to have the doctor
0:12:39during their assessment or treatment
0:12:42a depression
0:12:44the since i don't live and offers them
0:12:46and the other i have every are which is really important is education
0:12:50the way people are learning these they this shifting completely we remove was seeing more
0:12:55and more online learning
0:12:57online learning brings a lot of advent age
0:13:00but one of the b is advantageous you lose the face-to-face interaction
0:13:04how can you improve that still in this new error
0:13:08and
0:13:09the internet is wonderful
0:13:11there is so much there are there people lie to talk about themselves and talk
0:13:17about what they lower their poppy and everything this so much data and every language
0:13:21every call so it allows a and a lot of it
0:13:24and then transcribed already
0:13:26it gives us a great opportunity for gathering data and starting people's behaviour so that
0:13:32a two day i on purpose put it in three phases
0:13:36the first phase is probably where one half of my heart is which is that
0:13:40on held behavior informatics i will present some of their work we have done when
0:13:45i was also at usc
0:13:47working on the hard you analyse gonna get the behavior to have doctors
0:13:52the core of this star
0:13:54will be about the mathematics
0:13:56of communication
0:13:58and this is that a little bit of map but you can always ignore the
0:14:01bottom half of the screen if you don't
0:14:03i want to see mathematical equation and i will give an interest and on every
0:14:08algorithms that present
0:14:09but i want you to believe and understand
0:14:12that we can get a lot from mathematical an algorithm
0:14:15when studying
0:14:17communication
0:14:17and the last one is the interpersonal dynamic i was to some result but i
0:14:22think this is where there's a need
0:14:23of working together and pushing this part of the research
0:14:27a lot further
0:14:29and so let me start with help behavior informatics
0:14:33you're gonna recognise right away
0:14:36any maze of a person who's been really important was sick dial this year us
0:14:41they're elicit thank you for your email as a citizen realise but i mean using
0:14:45her as my patient well out of my slide
0:14:48but let's suppose that we have a patient
0:14:51weights for anybody else who than that in this room
0:14:54and we wanted the interaction between the patient and the doctor
0:14:58during that interaction we will have some camera let's say a samsung tree sixty
0:15:03just sitting on the table
0:15:05if we are lucky and are at i c t or we are working we
0:15:10dissected then we can also have a natural and to your
0:15:14the advantage of the virtual interviewer versus the human is then they're dissertation
0:15:20the virtual interior is gonna have the question always the same way as long as
0:15:24we asked to do it
0:15:25the core my research there
0:15:27is to while the interaction is happening
0:15:30to be able to pick up on the communicative cues
0:15:33that may be related to depression
0:15:35exactly within this schizophrenia
0:15:38we bring it back to the clinician
0:15:41and then they can do a better assessment of depression
0:15:44this is the you'd the views and long-term
0:15:48what is really lucky
0:15:50is we started this
0:15:52and it was primarily computers lines is
0:15:55with one strong believer which escape result
0:15:57we would like we believe in this and working to with us
0:16:01made it possible but now the medical field is thing it
0:16:05a more and more important and with a lot more links going on after that
0:16:10so let me
0:16:11introduced la probably a lot of you seen her sit changed a lot of clothing
0:16:15and you may ask you know in three
0:16:18i heard i'm gonna sure that primarily because i want you yes to see the
0:16:23technology which i think is amazing because it to forty five people in four years
0:16:29to build
0:16:30i'm showing this video as the landmark video on that on that field but also
0:16:35to look at the nonverbal happening in real time the sensing of this
0:16:40hi and highly
0:16:42it's the community
0:16:43and is created to talk to people in a safe and secure environment
0:16:48i'm not a therapist that i'm here to learn about people in the black to
0:16:51learn about you ask a few questions can start
0:16:55and please feel free to tell me anything you can see are totally confidential
0:17:00are you looking like this
0:17:03yes
0:17:04so
0:17:05high density
0:17:08whom well
0:17:10that's good
0:17:13where you from originally
0:17:16from los angeles
0:17:19i'm from not only myself
0:17:22one this time last time you felt really happy
0:17:27and
0:17:29i time and i i'd rather be happy
0:17:35like a skinny nine
0:17:38okay thanks but you get an yourself to twenty years ago
0:17:43and
0:17:47i it's not a lean
0:17:51it
0:17:52an example that is that i'll
0:17:56okay this is really sort it it's or not we originally designed get within fifteen
0:18:01minutes instruction in mine people easily top twenty thirty minutes with l e
0:18:06we have one example are really famous professor i'm not gonna name
0:18:11and that person who came in visiting and we told them
0:18:14be careful we're gonna be watching behind the videos
0:18:18don't that'll to much a we are there
0:18:21just and allow no problem
0:18:24this start talking a little bit and eventually the started talking the slow thing about
0:18:29the bars and about everything and i was not there are present at that point
0:18:34the l a brings that in what are that's really
0:18:39and a is there to listen to you which is a good listener
0:18:42has been designed with that if you want otherwise you know so in what like
0:18:46so much emotion
0:18:48emotion is the is the double edged in this case
0:18:52you can surely most and get the present more engaged you can go the opposite
0:18:55way for example a bad error in speech recognition the patient said
0:19:01i and my grandmother died and the l it was a
0:19:05and so you can definitely be sure so all those reduce the aspect
0:19:09and a lot of the world there was done by david and david
0:19:13on handling the dialogue at a level
0:19:16then make the interaction grow through a rapport way
0:19:20true of phase of intimacy what part of their what was positive in the lower
0:19:26what have you moment in the last week
0:19:28a negative as well
0:19:30if you could go back in time what do you change about yourself
0:19:33these are important and he
0:19:36four hours or research because
0:19:38how does the presenter we have from positive
0:19:41and how they react one they can sit will tell you a lot about the
0:19:44their reaction and allow us to calibrate
0:19:47so our view
0:19:50is and that's prior to my research and in this case is hard to analyze
0:19:55the patient behavior to date
0:19:58and how to be a yes that's we and compared to like two weeks ago
0:20:03that allows us to see a change so if you ask me where the technology
0:20:06is going to be sparse
0:20:08it's in treatment
0:20:10because in sweet menu see the same person over time
0:20:13and now over time we have gathered is the entire that allows also to maybe
0:20:18due screening over this technology and give a great indicators
0:20:23so this is the project that start and more than six years ago and that
0:20:26means do you in a few minutes
0:20:29what are the other things we discovered that we did not expect
0:20:33and things i think that we were not seen previously
0:20:36and so the first
0:20:38population will look at is depression
0:20:41and you think of depressed people and you think my
0:20:44smile is gonna be a great way to the that you look at the red
0:20:48and on the press this is an obvious one it sort out that no
0:20:52the comp a smile
0:20:54in almost exactly the same between the pressure in a depressed
0:20:59what change the is the relation shorter
0:21:02and less amplitude
0:21:04that is hypothetically what it means is social norm thousand that you have to smile
0:21:10where you don't feel it
0:21:12and so use change the dynamic of your behavior
0:21:15and that's where behavior directly so important
0:21:18the second population we look at
0:21:20look at its posttraumatic stress
0:21:23and you like okay point vts the it is for sure there's some negative expression
0:21:28with this
0:21:29it is a given
0:21:30people would be it is there will probably so
0:21:33and what we did we see almost the same rate in or intensity
0:21:37the same intensive negative
0:21:39what did we end up doing we split it men and women
0:21:43what did we find out
0:21:44man
0:21:46c and increase in the gets a spatial expressions well woman see a decrease and
0:21:52negative expression when they have symptoms related to pitch the
0:21:56this is really interesting
0:21:57so why
0:21:59another interesting question
0:22:01i respond we have nice research question
0:22:03again probably maybe because of social norm
0:22:06man it is accepted in our culture
0:22:08that it may show more negative expression
0:22:11so they are not
0:22:13reducing them well woman because of the social norm again main to reducing
0:22:17this one here i part is this i'm just gonna see it because i'm here
0:22:21that maybe it is because they're from los angeles and both boxes so popular
0:22:26i don't know about the we have to study there's the don't give a new
0:22:32new interesting research question to study
0:22:35the research population that we looked at is suicidal id asian
0:22:41the you know that there's forty teenagers are we going to the eer in cincinnati
0:22:47only
0:22:49forces title idea is to either first attempt or strong sits idle addition
0:22:53and that has to make this hard decision
0:22:55i my keeping all of them here
0:22:57sending some of them or putting on medication or not
0:23:01is a hard decision so we have to task in mind
0:23:04one is findings this i don't versus non societal
0:23:08but where is the money
0:23:09the money is then detecting repeaters
0:23:12because the first time is always
0:23:14a phrase that then the second item bits of and the most and to
0:23:18so we did a lot of research and this is in collaboration with defined server
0:23:22and cincinnati john question
0:23:24where we studied the behavior between societal and non societal
0:23:29and the language is really important
0:23:32you see more pronounced when societal about themselves
0:23:36and you also see more negative
0:23:38these are not surprising but they were confirmation of previous research
0:23:42what was the most challenging is repeaters in on repeaters
0:23:48how can we differentiate that and one of the most interesting result is that the
0:23:53voice
0:23:54where the difference shader
0:23:56people we're speaking differently
0:23:58when a repeat what's gonna happen we will call again three weeks later to find
0:24:03if there was a second at them
0:24:05and so the brightness of the voice was an indicator
0:24:08is it just one indicator will not just because you were had to rate advice
0:24:13in itself
0:24:13but that's and that is then in together and then we can add this
0:24:17we did you know there's a lot of other indicated that you can add
0:24:21to help with this
0:24:22the last population and we also look at it schizophrenia
0:24:27use of in is the really important
0:24:31disorder
0:24:32and they also related to buy there's also by problem is a free not vote
0:24:36in the cycle this
0:24:37arena
0:24:38and so we were really interested to look at the facial be yours because we
0:24:43were o is of rain are they gonna look everywhere the gonna move and al
0:24:47this
0:24:47and what did we find out
0:24:49when they were the doctor nothing
0:24:51they were not moving they are brought there was no more sand with the same
0:24:55that they were strongly schizophrenic or not
0:24:57but
0:24:58if there were by themselves
0:25:01then we could see that just a
0:25:03so that brings than the really interesting aspect of interpersonal
0:25:08where the doctors the there
0:25:09they're kind of constraining a little bit their behaviour well when they were the by
0:25:13the slu could see a lot in the facial expression
0:25:16so the that some of the example these are more of the population will been
0:25:20working on
0:25:21since then we started looking at art is then
0:25:23and also as sleep deprivation
0:25:26it's all of my phd student the like can be really get paid one that
0:25:30sleeping
0:25:32and yes they're the lattice that is
0:25:34onthe-fly and so we're looking at these as well
0:25:37if you're interested in doing and pushing for that kind of research
0:25:41i strongly suggest
0:25:43to go aligned right now and download open phase
0:25:47open phase is us
0:25:49taking promote to stance and taking the main component of multisensor for visual analysis
0:25:55and giving it
0:25:56not only for free
0:25:59not only give the open source for recognition
0:26:02what do you mean you all the open source for the training
0:26:06of all the model that were all trained with public dataset
0:26:10i'd probably not good for my grant proposal and all this because i'm probably gonna
0:26:14give too much but i think it is important for the community and we're doing
0:26:18that for that
0:26:19open phase has state-of-the-art performance for facial landmarks sixty eight facial landmark
0:26:24state-of-the-art performance for twenty two facial action unit
0:26:28also for eye gaze
0:26:30eye gaze just from a webcam plus or minus by degree and also head position
0:26:34we're adding more and more every few months also
0:26:38so this is online
0:26:40and be sure to contact that that's with the main person behind all of the
0:26:44switchboard
0:26:45so i think i got you hopefully excited about the potential of an analysing nonverbal
0:26:50and verbal behaviour for help here
0:26:53so how do we do this
0:26:55how can we go a step ahead right now we just a couple of uni
0:26:59modal
0:27:00one behavior
0:27:01but what i really excited about is how can we add together
0:27:05all of these indicators from probable vocal and visual
0:27:08so then we can better infer
0:27:10the tighter the disorder or in a social interaction to recognize leadership
0:27:16ripple
0:27:17and also maybe emotion
0:27:19so
0:27:20what are the court silence and
0:27:23if you have to remember wanting of this lecture is these four challenges
0:27:28when you look at them negation therefore main challenge to the first one is with
0:27:32dimension is the temporal aspect i told us smiled the dynamic of this might is
0:27:38really important
0:27:39we need to model each day behaviour
0:27:41but there is also what's got representation alignment and fusion
0:27:48representation i have what the person said and i have these gesture how can i
0:27:53learn a joint way of representing it
0:27:57so that if someone say i like it
0:27:59and the smile
0:28:01these should be indicators that are represented close to each other
0:28:05and by representation what i mean
0:28:07i mean numerical numbers that are import that our interpretable by the computer
0:28:13imagine a vector in some sense
0:28:16the alignment is the second thing
0:28:18we move i sometime faster and of course changes faster than all words so we
0:28:24need to align the modality and the last one is the fusion
0:28:28we want a breathing disorder or emotion how do you use this information
0:28:33so the first one is and i will ask you to use one other part
0:28:38of your brain a
0:28:39the one that's is slowly waking up because of the copy about looking at matt's
0:28:44and algorithms but i want to give you a little bit of a background on
0:28:47the mat side
0:28:48so we have the behavior of a person
0:28:51and we wanna be looking at
0:28:53what is this so that
0:28:56component to it and what is the information you have a you have a plot
0:29:00like a movie plot and the all sub plots to it
0:29:04there is a gesture and there's subcomponents to it
0:29:08this component i really important when you look at my at behaviors
0:29:12so how do we do this so anybody the let's see
0:29:16whose strongest background is in language and an l p
0:29:21would be most of you
0:29:22anybody with a strong background in vocal and out of the speech
0:29:26okay great
0:29:27anybody with a strong background in visual computer vision
0:29:31okay good thank you
0:29:33i don't feel lonely well for each of these modality
0:29:37there are existing problems that are well studied looking at structure for example in language
0:29:44looking at a noun phrases or shallow segmentation
0:29:48in have used one recognizing gesture or in vocal looking at the tenseness already motion
0:29:54in the voice
0:29:55and there are been a lot of approaches suggested to that
0:29:59it generates addresses this common that's a
0:30:02generative in a nutshell is looking at each gesture and try to generate it so
0:30:07if you look at hand out and head shake it's gonna learn how and upgraded
0:30:12and how the head say created
0:30:15and if i'm giving a new video is say that no other the with head
0:30:19shake a discriminative approach is really looking at what differentiates the two
0:30:25and so in a lot of our work it or not the discriminative approaches perform
0:30:30better at least for the task of prediction
0:30:32and so i'm gonna give you
0:30:34information about this kind of approach
0:30:37knowing really well it is interesting work on the genitive
0:30:40so
0:30:41what is a conditional random field
0:30:45my guys i didn't thing i would see that do this morning
0:30:48but no conditional random field is what's colour graphical model
0:30:52and the reason i want you to learn about it is that this is the
0:30:55and good entry way to a lot about the research that you've heard about word
0:31:00embedding
0:31:01our board to back or deep learning or recurrent neural network you're all of these
0:31:06terms
0:31:07we're gonna go step by step to be able to understand the and that the
0:31:11same time i will give you some of the work we've done tree that
0:31:15so given the task and given the sentence
0:31:18and i want to know what is the beginning of a noun phrase
0:31:21all what is the continuation of a noun phrase or what is other like ever
0:31:25so it is simple classification task
0:31:28and you could imagine given observation
0:31:30where you have a one hot encoding
0:31:32zero and one for the words if it's a word embedding
0:31:37you can try to predict
0:31:38the relationship between the word and the non trade
0:31:42if you wanna do it in a discriminative way what does this minutes of mean
0:31:46in means that you model problem the of the label
0:31:50given the input b r y given x
0:31:53now this equation is simpler than o
0:31:56there is one component that look at
0:31:59how is my observation looking like the label this is what color is singular potential
0:32:06and the second part is if i'm at the beginning of a noun phrase what
0:32:10is the likely label afterwards
0:32:13if i tell you that if i'm the beginning and noun phrase one is like
0:32:16there were i know a continuation of a noun phrase or another but if i
0:32:22mean concentration a noun phrase
0:32:24it's really less likely maybe that i go
0:32:26into a global after that so this is the kind of interest and you put
0:32:30in this model
0:32:31this model i patients recognize behaviour and they can do it but
0:32:37but there's always about
0:32:39but in this problem will be
0:32:41so much easier
0:32:43if i knew the part-of-speech tagging it would be so much easier
0:32:47if i had and at college the undergrad in the box if at the annotators
0:32:52same and obtaining out of this for us
0:32:56the task will be so much easier from this pronouns you know it's like but
0:33:01beginning of a well i
0:33:02beginning of an off right
0:33:04this is the verb so
0:33:06why don't just do that when it is the hard a i r b doesn't
0:33:10allow us to put undergrads in the box and it is a time-consuming
0:33:15i process to do that so
0:33:17this is the want a remote wants you to remember from that's part of the
0:33:21lecture
0:33:21latent variable i'm gonna replace that by a latent variable length bible is the number
0:33:28from one so let's they can
0:33:31that's gonna do the job for you
0:33:33latent variable are therefore have been
0:33:36they can include the words together for you
0:33:39but you don't have to give them what the name of each group
0:33:44they can define camping naturally that works for the purpose of your past which is
0:33:49in this case
0:33:49noun phrase
0:33:51so you et al it hey learn this grouping for me of all the words
0:33:56and you can do that by doing a small to make with saying for the
0:34:00non fright the beginning a noun phrase i'm allowing you for this
0:34:05these four rule
0:34:07for the middle for the constellation of a noun phrase i'm allowing you grew for
0:34:11you to group all the words in four or the rooms
0:34:14and i would do it also for all the other one
0:34:17so you see it almost
0:34:19it's not unsupervised-clustering because i have the grouping will be happening because i have a
0:34:25task in mind
0:34:26discriminative model task in mind
0:34:29so if you do this once beautiful is the complexity of this algorithm is that
0:34:34almost the same as the c i have with a simple a summation over that
0:34:38now what do you end up learning with this grouping
0:34:42the most important is this link
0:34:45what do you end up learning you know knowing what's got intrinsic dynamic what is
0:34:50that if i want to recognize hand on the intrinsic tells me i'm going down
0:34:55and well this is the dynamic
0:34:58but it had say at the different dynamic this is specific to the gesture
0:35:02extrinsic tells you if i my hand on how likely am i to switch strategy
0:35:07this is between the labeled how likely am i two had say now rely on
0:35:12lightly in fact come back then i can head shape
0:35:15it's an intuition behind this
0:35:17so if you do this and you apply this to the task where famous that
0:35:22of noun phrase
0:35:24segmentation also called shallow parsing
0:35:26and then you know
0:35:27it should have the hidden state look the most likely one for this word when
0:35:31it is i want to know what that my model learn what is the grouping
0:35:36that loan
0:35:37and if you know can what they did learn
0:35:39it's really beautiful
0:35:40it is an automatically that the beginning of a phrase is the determinant or pronouns
0:35:44and it also give me intuition
0:35:47about the kind of part-of-speech tags
0:35:49that is but in that one on whether part-of-speech tags it just learned automatically
0:35:54because of the words and the way of these words happen in the bright
0:35:58so this is that they come first they common stage
0:36:01latent variable are there so rule thing
0:36:05for you
0:36:06their grouping thing temporal grouping
0:36:08that the first ingredient we will need
0:36:12the
0:36:14you probably heard the word recurrent neural network
0:36:18and you like that fancy name have no clue what i don't wanna use that
0:36:23right away recurrent neural network looks a lot like this model
0:36:27the only thing that change it is instead of having one latent state from one
0:36:32so well
0:36:33i'm gonna have many neurons that are binary
0:36:36zero o one
0:36:38and so recurrent neural network is someone looking at a neural network and it looking
0:36:43at the painting and be like how it will look better horizontally so it's taking
0:36:48a neural network and moving it horizontally and that is your temporal
0:36:53so if i was to show you the other way around you with the other
0:36:56just the neural network that the normal one
0:36:58by shifting it this way this is the temporal
0:37:02that i model and so this is right
0:37:05the problem with these
0:37:06is therefore get
0:37:08therefore get they have a problem in the learning
0:37:11so this famous algorithm that happen in germany
0:37:15have more than twenty years ago that speaking super famous recently
0:37:19it long short-term memory
0:37:21and the long short-term memory is really similar to the previous neural network
0:37:26but in also then you have the memory
0:37:30and but how do you guard the memory
0:37:33you going to put the gate
0:37:34that only once you want that's in the memory
0:37:38and only what you want get out of the memory you putting a gating and
0:37:43then you think hey i'm gonna sometime for get things but i'm gonna design what
0:37:47i forget this is a really high level you but you could imagine by now
0:37:51this is the exact same that
0:37:53the word
0:37:54and the label
0:37:56and the only difference is i'm going to memorise when i memorise i memorise what
0:38:01happened before
0:38:02i'm gonna memorise what are the word and the faster the grouping that happened before
0:38:06i wanted to show you that
0:38:08just so that when you see this times you have at least in its vision
0:38:11that there is a way to approach
0:38:14temporal modeling two latent by about that i talk about
0:38:17or true neural networks
0:38:20okay
0:38:21no i want to address the second challenge
0:38:23that's one of the most interesting from my perspective other i work a lot of
0:38:27my life on temporal modeling so as to not say that i think the next
0:38:31screen fluent
0:38:32is how do you work on representation how can you in the look at someone
0:38:37what they say
0:38:38and how they stated in the gesture
0:38:41and find a common representation
0:38:43what is this common representations to look like
0:38:46i wanna representation so that if i know why a video and i have a
0:38:52segment of someone saying i like it
0:38:55i a part of the video it as someone smiling
0:38:59part of the video i
0:39:00a joyful tone
0:39:02i want these
0:39:03to all be represent that mainly similar from each other if you look at the
0:39:09numbers representing
0:39:11this it should be really similar i like it from happy forms artful
0:39:16and if i have someone will look a little bit tens of the press or
0:39:20some tenderness in there but i want them the number like i think there is
0:39:24audio clip
0:39:26and i tried to every presented with this that the transformation
0:39:29i wanted to be we need those someone would deprive
0:39:32or if i have someone who looked surprised and i hear
0:39:35wow
0:39:36i want these to look alike
0:39:38and this was the dream
0:39:40i personally had this dream
0:39:42back more than ten years ago
0:39:44and this really smite researcher at toronto
0:39:47showed us a path for that
0:39:50and it is ruslan in university of toronto
0:39:53but is a lot of interesting work
0:39:55where neural network
0:39:57are allowing us to make this dream come true
0:40:00it did it installed at don't worry but they've done the first step that's really
0:40:04important i'm gonna show you result in second
0:40:06what they say it's a visual
0:40:09could be represented with multiple layer of neurons
0:40:13and verbal can be represented
0:40:16with multiple layer of neurons
0:40:18what i see here
0:40:19i don't collect like word to back for people who know about it it's a
0:40:23representation of a word that becomes a vector and here i have images that suddenly
0:40:29becomes also a nice vector by the way
0:40:32if you wonder why modes model was not working
0:40:34it's all the fault of computer vision people
0:40:37the reasonable to model was not working is images were so hard to recognize any
0:40:43object it was barely working well
0:40:46but certainly in two thousand and eleven
0:40:48computer vision started working
0:40:50at a level that is really impressive we can recognize object really efficiently and now
0:40:56we can look at all
0:40:58hi is the high-level representation of the image that is useful
0:41:02words were always quite informative in itself
0:41:05but the you guys that solve a lot of the and now we can do
0:41:09that and put them together
0:41:11in one representation
0:41:13and there's been a lot of really interesting work
0:41:16starting that from two thousand ten
0:41:18and this is still a lot of work on that feel
0:41:21i'm gonna show this one a result that's that
0:41:24to me how it may be possible
0:41:26and this is the work from toronto
0:41:29is what they did
0:41:31they learned
0:41:32how images from the web from flicker
0:41:35they take a bunch of images and then
0:41:37they were here
0:41:39one word or you were describing them
0:41:42and the first two
0:41:43well point to the same place
0:41:46and when you do that
0:41:48you get for any rate
0:41:50and their representation put at work you get a representation
0:41:54but now i'm going to do
0:41:55multilingual
0:41:57work and he is there but of it i'm gonna take an image
0:42:01and the number
0:42:03representation
0:42:04i'm gonna get the word
0:42:05and get a number and stuff strike
0:42:08the what number from the image number
0:42:11and i am gonna and that the number
0:42:15and finally again this final number out of it and i'm gonna know what kind
0:42:19of email
0:42:20to that part of the space
0:42:22then you get a new car
0:42:24and then it becomes red color
0:42:26that for me what it man is i find belief on what is the bad
0:42:31l what is the their magic language where everything can be no the
0:42:37and that's no there is a language
0:42:41the magic language where everybody can go from the french think this and all that
0:42:44is this magic language
0:42:47this is the live in the same for language and bayes and we finally got
0:42:50a piece of that magic language where computer vision people can live happily with natural
0:42:55language people and speech people
0:42:58and they can do that for the they and then i
0:43:03flying in sailing bold box i don't know it is beautiful but they didn't sell
0:43:09any of the only problem i mentioned without about communicative behavior they don't have yet
0:43:14happy smile that goes with lie like but you can see the product now to
0:43:19that
0:43:20so i'm gonna do now store an algorithm
0:43:23that brings together what you learn all your
0:43:26latent viable
0:43:28which are grouping have role
0:43:30and now i'm gonna at this new ingredient which are neural network that their goal
0:43:35is to find a better way of representing i don't like one hot
0:43:41representation for words like zero and one
0:43:44i want something that's more informative
0:43:46and i don't like images i want something much more informative
0:43:50so i'm going to learn at the same time
0:43:52how what in my room being temporally what does my temporal dynamic and what is
0:43:57my way to
0:43:58represent
0:43:59so given the same input
0:44:02and the goal of maybe
0:44:04doing it's email are recognition or let's say recognizing what is positive or negative i'm
0:44:10changing the task
0:44:11because noun phrase
0:44:13segmentation is not really among the model problem
0:44:15so i'm thinking at that like positive versus negative like
0:44:19we will smaller sentiment the not of that for example
0:44:22and that was at the first layer here this is in fact i'm showing it
0:44:26this way but what it is
0:44:27is that the word
0:44:29is multidimensional
0:44:31and this is also multi dimensional because you have neurons
0:44:34so i'm replacing this as one layer
0:44:37of neurons
0:44:38and then
0:44:39i'm gonna at you or famous latent variables
0:44:43so what is happening here
0:44:45and that's really important
0:44:46on this their job
0:44:48is that they all the agenda-based here
0:44:51that's a me is about a false there you don't and those then because i
0:44:56speak french about of other
0:44:57and so they call this gibberish and one in the format
0:45:00that's going to be useful for the computer and their task here is to say
0:45:04from a useful information that we tried to bit
0:45:08to see what is similar between the different
0:45:12between the different modalities
0:45:14and so this is what you get here
0:45:16it is it right grouping what should i grew
0:45:19this is the this is here
0:45:22how should i go from the numbers just something that's useful for my computer and
0:45:26here is the same as all your is how the between late and viable or
0:45:31grouping
0:45:32so this is beautiful because you do at the same time
0:45:36translate from gibberish to something useful and cluster the same time
0:45:40one of the most challenging thing when you train that
0:45:44is that each layer is he then late and you don't have it on the
0:45:48ground labelling it
0:45:49so when you have many of that what happen is one could try to learn
0:45:54the same as the next layer
0:45:55so you want divers city in its of your layer
0:45:58and the good neural network they will do we what's called dropout
0:46:02or you can also implies some sparsity so that this is gonna be really different
0:46:06from this one
0:46:08and when you do this by emotion recognition
0:46:10you get a huge bruise on any of the prior work
0:46:13because we were not just the only a late fusion we're really at the same
0:46:17time modeling the representation
0:46:19and the temporal clustering
0:46:21okay
0:46:23that everyone survived this is the last equation we had so but this was
0:46:28this is my goal of
0:46:31present thing for you
0:46:33the representation how do i goal
0:46:35from temporal and the representation and the two that's one which i wanna presents quickly
0:46:42one is that about riyadh alignment
0:46:44how do you align
0:46:46usual which is really i thirty frames per second
0:46:49we language
0:46:51which is in fact i don't know how many words per second i see i'm
0:46:54from you know the high end on that
0:46:56but it's probably five to six word maybe a little bit more per second
0:46:59so how do you emanates to be able to
0:47:02they really high frame rate and the lying it is something much lower
0:47:06in some other way i have a video
0:47:09and i want to summarize that video
0:47:12it's which is so that at the end
0:47:14i really have only the important part
0:47:16and if you look at computer vision people
0:47:19they don't look at the excel
0:47:21and this is allowed to change prop excel
0:47:23and this is really few change
0:47:25is really little change here
0:47:27is about the and pixel changing here so if you just look at the excel
0:47:31in you try to merge you wanna i all of these frame
0:47:35and you want to find how am i gonna merge them
0:47:38there's two obvious way to do it
0:47:40one it in all one out of two frames
0:47:44really a long sequence then you just ignore and all of the people in neural
0:47:48network that's often what they do they take one out of ten frames that side
0:47:52about the most interesting will be
0:47:54look at one image visit look at look like the previous one
0:47:58in that they look alike i'm gonna modes them but i don't you the local
0:48:02at this time
0:48:03but i do not merge them
0:48:05what is more importing or magic a gradient you remember latent variable they didn't viable
0:48:11are gonna move things for you
0:48:12for a task in mine which is recognizing gesture
0:48:16and if i do the merging because they look alike and this space
0:48:20then there really more important more fusion
0:48:22and if you do that you get a you lose in performance for recognizing gesture
0:48:28and i'm gonna give you want more intuition about see i have an hmm
0:48:32so you have an hmm are a lot like finding new model or finding dora
0:48:37is the dollar
0:48:39short memory they don't remember the only remember the last thing be seen that the
0:48:44really short term memory
0:48:45so if you give them something really high frame rate
0:48:48the only think it wouldn't remember is the previous one
0:48:51so what do they remember and a member my previous frame always look a lot
0:48:55like my current frame
0:48:56so i smoothing
0:48:58but i was give it
0:48:59these frames here that are different from each other
0:49:03it will be learned some temporal information that's more useful and that's why
0:49:08a lot of model works so much better on language
0:49:12because every word is quite different from the previous
0:49:15but every major in a video frame a really similar to each other so that
0:49:18this model
0:49:19and when you do that you get a nice clustering
0:49:23the frame because it's not looking
0:49:25just that the similarity but it really
0:49:27and the at the mood being that you get from the latent bible
0:49:32the last one is fusion and there's a lot more work to be done on
0:49:36fusion but this one is like okay
0:49:39i model the temporal
0:49:42i model the representation i lying my modality
0:49:45but now i want to make a prediction i wanna make my final prediction
0:49:50and i want to use all the information i have
0:49:52to make my prediction
0:49:54and to do that is a lot of new way to do that
0:49:58if you think about it each modality has its own dynamics of voice is really
0:50:03quick
0:50:04word is floor
0:50:05so you don't want to lose that
0:50:07so you have word
0:50:09uhuh dynamic for
0:50:11each modality so one is private and one
0:50:14will in fact with mine mation
0:50:16okay so you will learn a dynamic for audio and you learn a dynamic for
0:50:21visual and then you know how to synchronise them
0:50:25i'm going quickly turned out but just want to give you the institution
0:50:28that user and the last that is the one that's going to do
0:50:33learned the dynamic and learned also to synchronise at the same time and when you
0:50:36do that you improve a lot so
0:50:38i'm gonna coming back closing the loop
0:50:41i'm clothing the lu
0:50:43and going back to the average and all work on this stress depression and ptsd
0:50:48i'm gonna take verbal acoustic and visual
0:50:51and i want to predict how
0:50:54distress you are
0:50:55and here the results you get when you do multimodal fusion
0:50:59you get this to what you have is a hundred part is event
0:51:03who interacted with l e
0:51:05and each of them at the level of distress in blue
0:51:10and some of them have speech the in depression
0:51:13and in green what you get
0:51:15is in fact the prediction
0:51:18you get the prediction from the green
0:51:20but i putting together the verbal indicator
0:51:24the vocal and the visual
0:51:26and you can do that i'm gonna skip to that because of time
0:51:29but you can also do this a lot for
0:51:32looking at sentiment
0:51:34in videos sentiment in youtube videos
0:51:37is another application of that i'm gonna skip this one
0:51:40does because our model to go quickly under the last point i want to make
0:51:44but the last part i want a state now is interpersonal dynamic
0:51:49you guys have been amazing you been handout thing smiling yearning watching emails
0:51:56i got you
0:51:57okay
0:51:58but interpersonal dynamic is i think the next friends really in algorithm because people some
0:52:06people will like siri synchrony in their behaviors
0:52:09synchrony in their behaviour are great this all up and some kind of rubber ball
0:52:14i with the also the in the video
0:52:17in some of our video using the virtual human mimicking each other
0:52:21well in negotiation
0:52:22you also c and d symmetry or divergence
0:52:26we also really informative
0:52:28if i move or what you move backward design important you
0:52:32this is important negotiation but also in learning
0:52:35if i look at the behavior of one speaker and another
0:52:39i can find moment where the synchronise
0:52:41and i can also find one when there is synchrony
0:52:45and these are often in our data
0:52:48related to
0:52:49a rejection or bad in their homework
0:52:53because they're not working well together
0:52:55there's a there's the disagreement
0:52:57and that synchrony can show their
0:53:00we can use some of the behaviour is more for one but you get the
0:53:03right leader from expert
0:53:06and this year otherwise you think the other knowledge about the on but they're not
0:53:09always that the there are not only the knowledgeable and so hard to differentiate that
0:53:14and voice is a good one for that
0:53:17and one another type what are you gonna accent on that my offer during negotiation
0:53:23and to do that i will look
0:53:25and your behavior
0:53:27i will look at my behaviour as the proposed or and i will look at
0:53:31our history together if we do that together we get a user improvement when we
0:53:36put the dinally
0:53:38but that i think what is that
0:53:39it your behavior if you hand not are stored bothers you are likely to accept
0:53:44but my behaviour important by the way the best way to have someone a text
0:53:48that you are for
0:53:49tells you have
0:53:50you put that you put that out in your on a request
0:53:54so the last one is there you guys
0:53:59good listeners
0:54:01how do i create a crowd like you guys as good listener you
0:54:05i can do that from data
0:54:07i can look at each of you how you reacting to the speaker
0:54:11and learn
0:54:12what are the most predictive one
0:54:14and be able to eventually grade of its own listener
0:54:17these are the top for most predictive listener speech about features so if i part
0:54:23you likely to hannah
0:54:24that's another surprise if i look at you you're likely to and not after a
0:54:29little well known right away
0:54:31if i stayed a word and the one hand by itself is not a good
0:54:34predictor but if i'm in the middle of as and then ipods and look at
0:54:39you
0:54:39you really likely to give feedback
0:54:41so this is the power of multimodal and badly if i don't look at you
0:54:46unlikely
0:54:47to hannah but not all that you guys are the same
0:54:50you all the little bit different you not all a smiling at another thing which
0:54:55i don't know why use all be about the
0:54:58a
0:54:58some of you i can learn a model for one person
0:55:02i can learn a model for another person
0:55:04and that a person
0:55:06and then when i would like to do is find out the prototypical grouping
0:55:11grouping
0:55:13latent viable a again very like that model selection
0:55:18again at that it
0:55:20but you will be grouping people want to find what is common between people
0:55:24and what do you fine
0:55:25you find that some people
0:55:27is that was produced by law on so that they also that the warm i'm
0:55:31a men's is the than is only about one that if i begin in france
0:55:35event have i say stupid things you will hand not just because that the part
0:55:39of the right time
0:55:40a some people will be a visual there don't even care listening
0:55:45and i do this and noun phrases turn out to be a good predictor
0:55:50okay so i wonder so work from stacy mice the lower here this is the
0:55:55really great representation of putting all this interpersonal dynamic in one video i could have
0:56:01never done better than that
0:56:02so i wanna you just do this
0:56:04this is a video movie and you want if we only gonna take the audio
0:56:08track
0:56:09and the text
0:56:10only the audio and the text
0:56:12and we're gonna and may
0:56:13the virtual human here we gotta make two of them
0:56:17some of them are going to be speakers so it speaking behavior based on how
0:56:21to
0:56:22you don't the speech you want to know is the icing that the head is
0:56:26the
0:56:27which facial expression is it speaker behaviour
0:56:30but we also want to predict the center behaviors
0:56:33directly from the speech of the speaker and so look at the
0:56:38it's is beautiful
0:56:40and i hope you enjoy the movie
0:56:42s two s process i like an answer the question judge the core poor performance
0:56:51statistical touched
0:56:59technical difficulty writing style change to do so
0:57:05i o
0:57:11i
0:57:13i
0:57:14i don't have to answer the question or answer the question
0:57:19you want answers i entirely
0:57:23i don't try to
0:57:27but this was all automatic from the audio
0:57:31and the visual one that some of the text only
0:57:34i you get the can you cues from the audio you get the emotion
0:57:38so this is an example putting everything together these are some of the application that
0:57:44you can will
0:57:45bringing together the behavior dynamic every my not every smile on equal going to model
0:57:50the model with the late and viable you don't quite get that the multimodal representation
0:57:57and alignment in the fusion
0:57:59and then the interpersonal dynamics so
0:58:01with the bocal for your attention remotes
1:00:16okay
1:00:17so
1:00:18let me to answer the first second one and maybe the first one we can
1:00:22discuss more
1:00:23about the second one apartments model alignment right now we are looking at alignment i
1:00:29don't really instantaneous level so it's only really small piece of the big problem of
1:00:36alignment
1:00:36right now we only aligning
1:00:38i really short term
1:00:40i personally believe the next
1:00:43okay at the next level
1:00:48of alignment needs to be at the segment level so you need to be able
1:00:52to do segmentation
1:00:54at the same time as you the alignment and to go ahead with the other
1:00:59example that you mention
1:01:01the a when you don't you mimicry instantaneously
1:01:05the plastic example i think it's four seconds or something like that so that the
1:01:10problem is that temporal contingency you need to model that and i think
1:01:14right now as i said a lot of a model are sort or memory
1:01:17and so we need the infrastructure
1:01:20to be able to remember so
1:01:21i think all the points you mention are wonderful i agree with you this is
1:01:25why i'm excited with this we don't
1:01:27is that we got actually the building blocks there
1:01:30and i think we need to study the next step so
1:01:33thank you
1:01:35okay the with the money and then
1:02:12right requested
1:02:13so right now we tried to work with the calibration of each speaker
1:02:20by having a for space of four or
1:02:23but where we got more sober indicators
1:02:26what's the difference on how to direct from positive
1:02:29and from positive
1:02:31as a problem there from negative still really
1:02:34and looking at the delta
1:02:35what is the most informative
1:02:37because the data is the little bit
1:02:39it's not completely independent on the user base a lot less dependent
1:02:43then just looking at hoffman this might happen to this might if it's positive hop
1:02:48into this might in when it's negative
1:02:49that is more informative
1:02:52the other work is if you ask me where this research going follows it's in
1:02:56treatment
1:02:57and they're
1:02:58what is it and we're working with harvard medical school
1:03:01is you get a schizophrenic patient at their worst
1:03:04you get a schizophrenic patient as they go through treatment at the back they go
1:03:08back home
1:03:09you can create a beautiful patient profile of that there were at their best and
1:03:14then use that to monitor
1:03:16their behaviour as they go back
1:03:18and so that the work we are putting forward with harvard medical school
1:03:22is to be able to create these
1:03:24profiles of people
1:03:25at the word profile doesn't sound also we call the signature
1:03:28as on a list the big brother but the idea is the profile of that
1:03:34so
1:03:36thing thank you all four pension thank you