0:00:18 | thank you very much for waking up early the star |
---|
0:00:23 | this is really exciting this is the first time |
---|
0:00:26 | i will be giving a talk in this room in two years |
---|
0:00:29 | it is that the same time kind of emotional for me |
---|
0:00:33 | and the so i'm really happy to share |
---|
0:00:36 | the recent research i've done on human communication analysis |
---|
0:00:41 | and i will also talk a little bit briefly estimate |
---|
0:00:43 | of the earlier project i've been doing |
---|
0:00:45 | on this topic |
---|
0:00:46 | and as you know really well |
---|
0:00:49 | i'm here spending about a lot of the word is |
---|
0:00:52 | with my student and also with my collaborators |
---|
0:00:56 | this is |
---|
0:00:57 | this is the new of the comp lab other one at cmu there is one |
---|
0:01:02 | let us see that stuff answer is leading |
---|
0:01:05 | this the theme of you don't and we are all working together |
---|
0:01:09 | with the goal of building algorithm |
---|
0:01:12 | two and the light |
---|
0:01:13 | really and event may sometimes think the five |
---|
0:01:16 | german syllable can get the behaviors |
---|
0:01:19 | and to really get into this understanding of why |
---|
0:01:24 | human communication and why multimodal the magic word i know it's impossible for me to |
---|
0:01:29 | give a talk without that you know about multimodal |
---|
0:01:32 | i really strongly that when we analyze dialogue |
---|
0:01:36 | dialogue is powerful in how people what they are seeing |
---|
0:01:41 | and this is a really strong component |
---|
0:01:44 | of dialogue in conversation analysis |
---|
0:01:47 | but i also strongly believe that nonverbal communication both vocal and visual |
---|
0:01:52 | is that the really important |
---|
0:01:53 | and for that reason i'm gonna show you an example some of you may have |
---|
0:01:57 | in it so don't tell you never about the answer but i want to give |
---|
0:02:02 | you this sort a clip where we have an interview |
---|
0:02:07 | between two people |
---|
0:02:08 | and we i want to task from you and easy and a hard |
---|
0:02:13 | the easy one is to find out from the so you have the interviewer and |
---|
0:02:17 | interviewee |
---|
0:02:18 | how what emotion |
---|
0:02:20 | there's the interviewee |
---|
0:02:22 | feel |
---|
0:02:23 | and that's what i'll do it is a hard one |
---|
0:02:25 | it is just of the two task |
---|
0:02:27 | the second that i want you what i |
---|
0:02:30 | well as the cost |
---|
0:02:31 | that's the hardest but is also the most interesting |
---|
0:02:35 | so we're gonna we will let's read it together about a corpus tried to have |
---|
0:02:39 | no prior to the |
---|
0:02:41 | denote the board |
---|
0:02:42 | so did you need it if the of the technology what side |
---|
0:02:46 | l o good morning good morning |
---|
0:02:48 | where you surprised by the verdict today i'm very surprised that the this world economy |
---|
0:02:53 | because there was no the expecting that |
---|
0:02:55 | when a game tell me something out |
---|
0:02:57 | so maybe something of big surprise |
---|
0:03:00 | what emotion does you feel |
---|
0:03:04 | it is an easy question |
---|
0:03:06 | so right exactly i |
---|
0:03:09 | and that's look at it from a computer |
---|
0:03:12 | who is probably just gonna do some kind of word embedding and matching things |
---|
0:03:18 | what is why these surprise |
---|
0:03:20 | let's look at the question probably because of the verdict |
---|
0:03:23 | that the that the follows |
---|
0:03:25 | really quick one |
---|
0:03:27 | what if we more carefully |
---|
0:03:29 | we do see that there was something unexpected |
---|
0:03:32 | a and maybe even got related to him |
---|
0:03:36 | so let's add one more modality |
---|
0:03:38 | that is in which word as you decide to emphasise |
---|
0:03:43 | i for me is a set of technology websites f is it is i see |
---|
0:03:49 | this is like this i said ice surfaces yes |
---|
0:03:57 | is something that |
---|
0:04:00 | this is this something isn't done yet to address this as a basis i said |
---|
0:04:08 | yes |
---|
0:04:09 | okay so |
---|
0:04:10 | which word |
---|
0:04:12 | and his second and so that he decided to emphasise |
---|
0:04:16 | me |
---|
0:04:17 | is strongly emphasise the me |
---|
0:04:19 | so this surprise doesn't seem as much about the big |
---|
0:04:23 | but mostly because it can count em |
---|
0:04:26 | so that add another modality |
---|
0:04:28 | where you see surprise but now you want to look at it at the timing |
---|
0:04:33 | of thing |
---|
0:04:34 | and that's one of the other take all my want to bring in |
---|
0:04:37 | it's not just multimodal |
---|
0:04:39 | where the alignment of the modality that's really important |
---|
0:04:43 | the let's look at the visual modality second line |
---|
0:04:46 | for tracking the et cetera technology website f news line is a good morning t |
---|
0:04:52 | is fine this and i said that the surface to see this is not have |
---|
0:04:59 | to come only because i would like to think that |
---|
0:05:02 | unless you know that is something this and don't think of the to address this |
---|
0:05:08 | implies that they suffices that |
---|
0:05:11 | okay so |
---|
0:05:13 | with that that's a driveway came a lot earlier than with to |
---|
0:05:17 | much earlier |
---|
0:05:18 | and five where with the |
---|
0:05:22 | rampantly and five you look carefully it is around that |
---|
0:05:26 | so given that information |
---|
0:05:28 | what is the cause or what how can you explain the surprise |
---|
0:05:33 | probably is related to this title there's probably something wrong with the title |
---|
0:05:39 | okay and that's would be interesting so that's where the timing is important |
---|
0:05:43 | really surprised at uni of the case of pride so if you look at name |
---|
0:05:47 | entity recognition there's differently to entities there is the name of the person enters the |
---|
0:05:54 | position in the place if you look carefully it is the second one |
---|
0:05:58 | so |
---|
0:05:58 | based on that you inferred that his name at uni his job title is not |
---|
0:06:05 | recognizer web site |
---|
0:06:06 | the last be i have to give you will never have known that there ought |
---|
0:06:10 | without the context but effectively his did you need |
---|
0:06:13 | what he's a taxi driver |
---|
0:06:17 | the taxi driver goes therefore small job interview item one of the small there |
---|
0:06:22 | and i'll give you a that's great command |
---|
0:06:25 | they put him with the makeup what the microphone is that we're the job interview |
---|
0:06:31 | thing i think that up and everything |
---|
0:06:34 | and that well i don't the realise that all my guys this is not that |
---|
0:06:38 | have interviewed it only something love and that the that thing |
---|
0:06:42 | but that are you that have known what are several interesting it is see the |
---|
0:06:46 | proportion of them of the interviewer see keep the straight place |
---|
0:06:50 | the only thing see that the will come back after the commercial |
---|
0:06:54 | you never comes back that's also a so what we start here is |
---|
0:07:00 | we as human are expressing or communicative behavior to that's we i call it the |
---|
0:07:07 | rouble vocal and visual |
---|
0:07:09 | a word you decide you |
---|
0:07:11 | is maybe slightly more power that it was like you or negative |
---|
0:07:16 | this is the choice you make |
---|
0:07:18 | this is a child because you want to emphasize the sentiment |
---|
0:07:20 | all because you want to be polite in that's really importance for discourse |
---|
0:07:25 | the way you decide to a phrase the sentence would bring a lot also |
---|
0:07:30 | the vocal every word use p can be emphasized differently |
---|
0:07:35 | and also you can decide to put more or less tension of writing this on |
---|
0:07:39 | the voice |
---|
0:07:40 | it also the vocal expression of laughter |
---|
0:07:42 | or the policy allows that are important |
---|
0:07:46 | the visible which i come from computer vision background the reason is i put the |
---|
0:07:50 | phone call them on visual |
---|
0:07:52 | is it might bias but i strongly believe there's also a lot to the gesture |
---|
0:07:56 | i'm doing to be gesture i mean do some iconic gesture |
---|
0:08:00 | the eye gaze the way i will also do occur on gesture |
---|
0:08:05 | the body language is important it's both on my posture of the body and also |
---|
0:08:09 | the proxy mixed with others |
---|
0:08:12 | and that is really also control specific always have this is a great example |
---|
0:08:17 | of a brain you student who graduated by now |
---|
0:08:20 | but just came up from china |
---|
0:08:22 | and we have the wonderful discussed and i go to the whiteboard and i turn |
---|
0:08:27 | and he was right there |
---|
0:08:30 | and i tried to have a conversation but my canadian bobble well |
---|
0:08:36 | lied |
---|
0:08:36 | i survive only twenty seconds and then when we have a wonderful conversation about tried |
---|
0:08:41 | to make so that within a |
---|
0:08:44 | i j then had gate |
---|
0:08:45 | one of the first q i look almost always in any video analysis i do |
---|
0:08:50 | is eye gaze eye gaze is extremely important |
---|
0:08:53 | it is also some time cognitive emotions also eye gaze is really important |
---|
0:08:58 | and i have a bias for facial expression also so i believe the face brings |
---|
0:09:03 | a lot |
---|
0:09:04 | we have about forty two models on the phase depending you can't exactly but for |
---|
0:09:08 | to do |
---|
0:09:09 | all of them has been i sighing a number of byproduct men famous coding scheme |
---|
0:09:15 | and i'm interest and not just in the basic emotion like had is that if |
---|
0:09:19 | you is happy starts to cry |
---|
0:09:21 | well i'm also interested in these other going to state is the thing the confusion |
---|
0:09:26 | and understanding |
---|
0:09:27 | there are about of and more important when we think about learning an indication for |
---|
0:09:31 | example |
---|
0:09:32 | so that you of the three v verbal vocal and visual |
---|
0:09:36 | and |
---|
0:09:37 | the reason for this research has been in that people's mind for many years |
---|
0:09:42 | if you look back sixty years ago and by the way have puberty a it |
---|
0:09:46 | is the sixtieth anniversary of artificial intelligence |
---|
0:09:50 | the us they're from the beginning but we didn't have all the technology now these |
---|
0:09:56 | days we have technology to do a lot of the low-level sensing finding facial landmarks |
---|
0:10:02 | and the licensing the voice |
---|
0:10:03 | every in speech recognition is getting better |
---|
0:10:06 | so we can in real time at leftmost and browse speech |
---|
0:10:11 | and i can be able to start doing some of the original goal of inferring |
---|
0:10:17 | behaviour in emotion |
---|
0:10:18 | so personally when i look at this challenge of looking in human communication dynamic |
---|
0:10:23 | i don't get for type of dynamics |
---|
0:10:27 | the first one is behavioural dynamics |
---|
0:10:30 | and that every smile is born it or there's some mild that seems to show |
---|
0:10:36 | politeness some are feeling and there is also what we call and that this is |
---|
0:10:42 | i have to give this to my |
---|
0:10:44 | appear as opposed to |
---|
0:10:45 | but if the size of |
---|
0:10:47 | which means that the same |
---|
0:10:49 | can be really need a lot there's by the change of prosody and for people |
---|
0:10:54 | working in speech in conversation analysis try to find out who is speaking |
---|
0:11:01 | the stuff |
---|
0:11:05 | the |
---|
0:11:11 | i |
---|
0:11:12 | okay this was one that only |
---|
0:11:15 | this was from only one hour of audio |
---|
0:11:19 | do you know with it |
---|
0:11:21 | it nick campbell and it's was from one of experiments data as that the interaction |
---|
0:11:27 | they have that but only from one hour or the you can see the variety |
---|
0:11:33 | as some of them are just |
---|
0:11:35 | which is more like a concentration please continue |
---|
0:11:38 | some clearly show some common ground |
---|
0:11:41 | and the lights men |
---|
0:11:42 | and some of them maybe eventually agreement so just from the brother the same word |
---|
0:11:46 | changes |
---|
0:11:47 | the second one was by now you hopefully bought into is the idea of multimodal |
---|
0:11:52 | dynamic with a line |
---|
0:11:54 | the third one is really important i think that's where a lot of the research |
---|
0:11:57 | in this conference |
---|
0:11:59 | and moving forward is needed is the interpersonal dynamic |
---|
0:12:03 | and the former one is the cultural muscles title dynamics |
---|
0:12:07 | this is a lot of study of both difference of also and event between cultures |
---|
0:12:12 | so today i will focus |
---|
0:12:14 | primarily on these tree |
---|
0:12:16 | and try to explain some of the mathematics behind that |
---|
0:12:20 | how can we use the |
---|
0:12:21 | and develop new algorithms to be able to send |
---|
0:12:24 | the behaviors so |
---|
0:12:26 | and i make personal excited in this field |
---|
0:12:30 | right i'm only follows for because of its but then syllable healthcare |
---|
0:12:35 | there's a lot of what then so in the being able to have the doctor |
---|
0:12:39 | during their assessment or treatment |
---|
0:12:42 | a depression |
---|
0:12:44 | the since i don't live and offers them |
---|
0:12:46 | and the other i have every are which is really important is education |
---|
0:12:50 | the way people are learning these they this shifting completely we remove was seeing more |
---|
0:12:55 | and more online learning |
---|
0:12:57 | online learning brings a lot of advent age |
---|
0:13:00 | but one of the b is advantageous you lose the face-to-face interaction |
---|
0:13:04 | how can you improve that still in this new error |
---|
0:13:08 | and |
---|
0:13:09 | the internet is wonderful |
---|
0:13:11 | there is so much there are there people lie to talk about themselves and talk |
---|
0:13:17 | about what they lower their poppy and everything this so much data and every language |
---|
0:13:21 | every call so it allows a and a lot of it |
---|
0:13:24 | and then transcribed already |
---|
0:13:26 | it gives us a great opportunity for gathering data and starting people's behaviour so that |
---|
0:13:32 | a two day i on purpose put it in three phases |
---|
0:13:36 | the first phase is probably where one half of my heart is which is that |
---|
0:13:40 | on held behavior informatics i will present some of their work we have done when |
---|
0:13:45 | i was also at usc |
---|
0:13:47 | working on the hard you analyse gonna get the behavior to have doctors |
---|
0:13:52 | the core of this star |
---|
0:13:54 | will be about the mathematics |
---|
0:13:56 | of communication |
---|
0:13:58 | and this is that a little bit of map but you can always ignore the |
---|
0:14:01 | bottom half of the screen if you don't |
---|
0:14:03 | i want to see mathematical equation and i will give an interest and on every |
---|
0:14:08 | algorithms that present |
---|
0:14:09 | but i want you to believe and understand |
---|
0:14:12 | that we can get a lot from mathematical an algorithm |
---|
0:14:15 | when studying |
---|
0:14:17 | communication |
---|
0:14:17 | and the last one is the interpersonal dynamic i was to some result but i |
---|
0:14:22 | think this is where there's a need |
---|
0:14:23 | of working together and pushing this part of the research |
---|
0:14:27 | a lot further |
---|
0:14:29 | and so let me start with help behavior informatics |
---|
0:14:33 | you're gonna recognise right away |
---|
0:14:36 | any maze of a person who's been really important was sick dial this year us |
---|
0:14:41 | they're elicit thank you for your email as a citizen realise but i mean using |
---|
0:14:45 | her as my patient well out of my slide |
---|
0:14:48 | but let's suppose that we have a patient |
---|
0:14:51 | weights for anybody else who than that in this room |
---|
0:14:54 | and we wanted the interaction between the patient and the doctor |
---|
0:14:58 | during that interaction we will have some camera let's say a samsung tree sixty |
---|
0:15:03 | just sitting on the table |
---|
0:15:05 | if we are lucky and are at i c t or we are working we |
---|
0:15:10 | dissected then we can also have a natural and to your |
---|
0:15:14 | the advantage of the virtual interviewer versus the human is then they're dissertation |
---|
0:15:20 | the virtual interior is gonna have the question always the same way as long as |
---|
0:15:24 | we asked to do it |
---|
0:15:25 | the core my research there |
---|
0:15:27 | is to while the interaction is happening |
---|
0:15:30 | to be able to pick up on the communicative cues |
---|
0:15:33 | that may be related to depression |
---|
0:15:35 | exactly within this schizophrenia |
---|
0:15:38 | we bring it back to the clinician |
---|
0:15:41 | and then they can do a better assessment of depression |
---|
0:15:44 | this is the you'd the views and long-term |
---|
0:15:48 | what is really lucky |
---|
0:15:50 | is we started this |
---|
0:15:52 | and it was primarily computers lines is |
---|
0:15:55 | with one strong believer which escape result |
---|
0:15:57 | we would like we believe in this and working to with us |
---|
0:16:01 | made it possible but now the medical field is thing it |
---|
0:16:05 | a more and more important and with a lot more links going on after that |
---|
0:16:10 | so let me |
---|
0:16:11 | introduced la probably a lot of you seen her sit changed a lot of clothing |
---|
0:16:15 | and you may ask you know in three |
---|
0:16:18 | i heard i'm gonna sure that primarily because i want you yes to see the |
---|
0:16:23 | technology which i think is amazing because it to forty five people in four years |
---|
0:16:29 | to build |
---|
0:16:30 | i'm showing this video as the landmark video on that on that field but also |
---|
0:16:35 | to look at the nonverbal happening in real time the sensing of this |
---|
0:16:40 | hi and highly |
---|
0:16:42 | it's the community |
---|
0:16:43 | and is created to talk to people in a safe and secure environment |
---|
0:16:48 | i'm not a therapist that i'm here to learn about people in the black to |
---|
0:16:51 | learn about you ask a few questions can start |
---|
0:16:55 | and please feel free to tell me anything you can see are totally confidential |
---|
0:17:00 | are you looking like this |
---|
0:17:03 | yes |
---|
0:17:04 | so |
---|
0:17:05 | high density |
---|
0:17:08 | whom well |
---|
0:17:10 | that's good |
---|
0:17:13 | where you from originally |
---|
0:17:16 | from los angeles |
---|
0:17:19 | i'm from not only myself |
---|
0:17:22 | one this time last time you felt really happy |
---|
0:17:27 | and |
---|
0:17:29 | i time and i i'd rather be happy |
---|
0:17:35 | like a skinny nine |
---|
0:17:38 | okay thanks but you get an yourself to twenty years ago |
---|
0:17:43 | and |
---|
0:17:47 | i it's not a lean |
---|
0:17:51 | it |
---|
0:17:52 | an example that is that i'll |
---|
0:17:56 | okay this is really sort it it's or not we originally designed get within fifteen |
---|
0:18:01 | minutes instruction in mine people easily top twenty thirty minutes with l e |
---|
0:18:06 | we have one example are really famous professor i'm not gonna name |
---|
0:18:11 | and that person who came in visiting and we told them |
---|
0:18:14 | be careful we're gonna be watching behind the videos |
---|
0:18:18 | don't that'll to much a we are there |
---|
0:18:21 | just and allow no problem |
---|
0:18:24 | this start talking a little bit and eventually the started talking the slow thing about |
---|
0:18:29 | the bars and about everything and i was not there are present at that point |
---|
0:18:34 | the l a brings that in what are that's really |
---|
0:18:39 | and a is there to listen to you which is a good listener |
---|
0:18:42 | has been designed with that if you want otherwise you know so in what like |
---|
0:18:46 | so much emotion |
---|
0:18:48 | emotion is the is the double edged in this case |
---|
0:18:52 | you can surely most and get the present more engaged you can go the opposite |
---|
0:18:55 | way for example a bad error in speech recognition the patient said |
---|
0:19:01 | i and my grandmother died and the l it was a |
---|
0:19:05 | and so you can definitely be sure so all those reduce the aspect |
---|
0:19:09 | and a lot of the world there was done by david and david |
---|
0:19:13 | on handling the dialogue at a level |
---|
0:19:16 | then make the interaction grow through a rapport way |
---|
0:19:20 | true of phase of intimacy what part of their what was positive in the lower |
---|
0:19:26 | what have you moment in the last week |
---|
0:19:28 | a negative as well |
---|
0:19:30 | if you could go back in time what do you change about yourself |
---|
0:19:33 | these are important and he |
---|
0:19:36 | four hours or research because |
---|
0:19:38 | how does the presenter we have from positive |
---|
0:19:41 | and how they react one they can sit will tell you a lot about the |
---|
0:19:44 | their reaction and allow us to calibrate |
---|
0:19:47 | so our view |
---|
0:19:50 | is and that's prior to my research and in this case is hard to analyze |
---|
0:19:55 | the patient behavior to date |
---|
0:19:58 | and how to be a yes that's we and compared to like two weeks ago |
---|
0:20:03 | that allows us to see a change so if you ask me where the technology |
---|
0:20:06 | is going to be sparse |
---|
0:20:08 | it's in treatment |
---|
0:20:10 | because in sweet menu see the same person over time |
---|
0:20:13 | and now over time we have gathered is the entire that allows also to maybe |
---|
0:20:18 | due screening over this technology and give a great indicators |
---|
0:20:23 | so this is the project that start and more than six years ago and that |
---|
0:20:26 | means do you in a few minutes |
---|
0:20:29 | what are the other things we discovered that we did not expect |
---|
0:20:33 | and things i think that we were not seen previously |
---|
0:20:36 | and so the first |
---|
0:20:38 | population will look at is depression |
---|
0:20:41 | and you think of depressed people and you think my |
---|
0:20:44 | smile is gonna be a great way to the that you look at the red |
---|
0:20:48 | and on the press this is an obvious one it sort out that no |
---|
0:20:52 | the comp a smile |
---|
0:20:54 | in almost exactly the same between the pressure in a depressed |
---|
0:20:59 | what change the is the relation shorter |
---|
0:21:02 | and less amplitude |
---|
0:21:04 | that is hypothetically what it means is social norm thousand that you have to smile |
---|
0:21:10 | where you don't feel it |
---|
0:21:12 | and so use change the dynamic of your behavior |
---|
0:21:15 | and that's where behavior directly so important |
---|
0:21:18 | the second population we look at |
---|
0:21:20 | look at its posttraumatic stress |
---|
0:21:23 | and you like okay point vts the it is for sure there's some negative expression |
---|
0:21:28 | with this |
---|
0:21:29 | it is a given |
---|
0:21:30 | people would be it is there will probably so |
---|
0:21:33 | and what we did we see almost the same rate in or intensity |
---|
0:21:37 | the same intensive negative |
---|
0:21:39 | what did we end up doing we split it men and women |
---|
0:21:43 | what did we find out |
---|
0:21:44 | man |
---|
0:21:46 | c and increase in the gets a spatial expressions well woman see a decrease and |
---|
0:21:52 | negative expression when they have symptoms related to pitch the |
---|
0:21:56 | this is really interesting |
---|
0:21:57 | so why |
---|
0:21:59 | another interesting question |
---|
0:22:01 | i respond we have nice research question |
---|
0:22:03 | again probably maybe because of social norm |
---|
0:22:06 | man it is accepted in our culture |
---|
0:22:08 | that it may show more negative expression |
---|
0:22:11 | so they are not |
---|
0:22:13 | reducing them well woman because of the social norm again main to reducing |
---|
0:22:17 | this one here i part is this i'm just gonna see it because i'm here |
---|
0:22:21 | that maybe it is because they're from los angeles and both boxes so popular |
---|
0:22:26 | i don't know about the we have to study there's the don't give a new |
---|
0:22:32 | new interesting research question to study |
---|
0:22:35 | the research population that we looked at is suicidal id asian |
---|
0:22:41 | the you know that there's forty teenagers are we going to the eer in cincinnati |
---|
0:22:47 | only |
---|
0:22:49 | forces title idea is to either first attempt or strong sits idle addition |
---|
0:22:53 | and that has to make this hard decision |
---|
0:22:55 | i my keeping all of them here |
---|
0:22:57 | sending some of them or putting on medication or not |
---|
0:23:01 | is a hard decision so we have to task in mind |
---|
0:23:04 | one is findings this i don't versus non societal |
---|
0:23:08 | but where is the money |
---|
0:23:09 | the money is then detecting repeaters |
---|
0:23:12 | because the first time is always |
---|
0:23:14 | a phrase that then the second item bits of and the most and to |
---|
0:23:18 | so we did a lot of research and this is in collaboration with defined server |
---|
0:23:22 | and cincinnati john question |
---|
0:23:24 | where we studied the behavior between societal and non societal |
---|
0:23:29 | and the language is really important |
---|
0:23:32 | you see more pronounced when societal about themselves |
---|
0:23:36 | and you also see more negative |
---|
0:23:38 | these are not surprising but they were confirmation of previous research |
---|
0:23:42 | what was the most challenging is repeaters in on repeaters |
---|
0:23:48 | how can we differentiate that and one of the most interesting result is that the |
---|
0:23:53 | voice |
---|
0:23:54 | where the difference shader |
---|
0:23:56 | people we're speaking differently |
---|
0:23:58 | when a repeat what's gonna happen we will call again three weeks later to find |
---|
0:24:03 | if there was a second at them |
---|
0:24:05 | and so the brightness of the voice was an indicator |
---|
0:24:08 | is it just one indicator will not just because you were had to rate advice |
---|
0:24:13 | in itself |
---|
0:24:13 | but that's and that is then in together and then we can add this |
---|
0:24:17 | we did you know there's a lot of other indicated that you can add |
---|
0:24:21 | to help with this |
---|
0:24:22 | the last population and we also look at it schizophrenia |
---|
0:24:27 | use of in is the really important |
---|
0:24:31 | disorder |
---|
0:24:32 | and they also related to buy there's also by problem is a free not vote |
---|
0:24:36 | in the cycle this |
---|
0:24:37 | arena |
---|
0:24:38 | and so we were really interested to look at the facial be yours because we |
---|
0:24:43 | were o is of rain are they gonna look everywhere the gonna move and al |
---|
0:24:47 | this |
---|
0:24:47 | and what did we find out |
---|
0:24:49 | when they were the doctor nothing |
---|
0:24:51 | they were not moving they are brought there was no more sand with the same |
---|
0:24:55 | that they were strongly schizophrenic or not |
---|
0:24:57 | but |
---|
0:24:58 | if there were by themselves |
---|
0:25:01 | then we could see that just a |
---|
0:25:03 | so that brings than the really interesting aspect of interpersonal |
---|
0:25:08 | where the doctors the there |
---|
0:25:09 | they're kind of constraining a little bit their behaviour well when they were the by |
---|
0:25:13 | the slu could see a lot in the facial expression |
---|
0:25:16 | so the that some of the example these are more of the population will been |
---|
0:25:20 | working on |
---|
0:25:21 | since then we started looking at art is then |
---|
0:25:23 | and also as sleep deprivation |
---|
0:25:26 | it's all of my phd student the like can be really get paid one that |
---|
0:25:30 | sleeping |
---|
0:25:32 | and yes they're the lattice that is |
---|
0:25:34 | onthe-fly and so we're looking at these as well |
---|
0:25:37 | if you're interested in doing and pushing for that kind of research |
---|
0:25:41 | i strongly suggest |
---|
0:25:43 | to go aligned right now and download open phase |
---|
0:25:47 | open phase is us |
---|
0:25:49 | taking promote to stance and taking the main component of multisensor for visual analysis |
---|
0:25:55 | and giving it |
---|
0:25:56 | not only for free |
---|
0:25:59 | not only give the open source for recognition |
---|
0:26:02 | what do you mean you all the open source for the training |
---|
0:26:06 | of all the model that were all trained with public dataset |
---|
0:26:10 | i'd probably not good for my grant proposal and all this because i'm probably gonna |
---|
0:26:14 | give too much but i think it is important for the community and we're doing |
---|
0:26:18 | that for that |
---|
0:26:19 | open phase has state-of-the-art performance for facial landmarks sixty eight facial landmark |
---|
0:26:24 | state-of-the-art performance for twenty two facial action unit |
---|
0:26:28 | also for eye gaze |
---|
0:26:30 | eye gaze just from a webcam plus or minus by degree and also head position |
---|
0:26:34 | we're adding more and more every few months also |
---|
0:26:38 | so this is online |
---|
0:26:40 | and be sure to contact that that's with the main person behind all of the |
---|
0:26:44 | switchboard |
---|
0:26:45 | so i think i got you hopefully excited about the potential of an analysing nonverbal |
---|
0:26:50 | and verbal behaviour for help here |
---|
0:26:53 | so how do we do this |
---|
0:26:55 | how can we go a step ahead right now we just a couple of uni |
---|
0:26:59 | modal |
---|
0:27:00 | one behavior |
---|
0:27:01 | but what i really excited about is how can we add together |
---|
0:27:05 | all of these indicators from probable vocal and visual |
---|
0:27:08 | so then we can better infer |
---|
0:27:10 | the tighter the disorder or in a social interaction to recognize leadership |
---|
0:27:16 | ripple |
---|
0:27:17 | and also maybe emotion |
---|
0:27:19 | so |
---|
0:27:20 | what are the court silence and |
---|
0:27:23 | if you have to remember wanting of this lecture is these four challenges |
---|
0:27:28 | when you look at them negation therefore main challenge to the first one is with |
---|
0:27:32 | dimension is the temporal aspect i told us smiled the dynamic of this might is |
---|
0:27:38 | really important |
---|
0:27:39 | we need to model each day behaviour |
---|
0:27:41 | but there is also what's got representation alignment and fusion |
---|
0:27:48 | representation i have what the person said and i have these gesture how can i |
---|
0:27:53 | learn a joint way of representing it |
---|
0:27:57 | so that if someone say i like it |
---|
0:27:59 | and the smile |
---|
0:28:01 | these should be indicators that are represented close to each other |
---|
0:28:05 | and by representation what i mean |
---|
0:28:07 | i mean numerical numbers that are import that our interpretable by the computer |
---|
0:28:13 | imagine a vector in some sense |
---|
0:28:16 | the alignment is the second thing |
---|
0:28:18 | we move i sometime faster and of course changes faster than all words so we |
---|
0:28:24 | need to align the modality and the last one is the fusion |
---|
0:28:28 | we want a breathing disorder or emotion how do you use this information |
---|
0:28:33 | so the first one is and i will ask you to use one other part |
---|
0:28:38 | of your brain a |
---|
0:28:39 | the one that's is slowly waking up because of the copy about looking at matt's |
---|
0:28:44 | and algorithms but i want to give you a little bit of a background on |
---|
0:28:47 | the mat side |
---|
0:28:48 | so we have the behavior of a person |
---|
0:28:51 | and we wanna be looking at |
---|
0:28:53 | what is this so that |
---|
0:28:56 | component to it and what is the information you have a you have a plot |
---|
0:29:00 | like a movie plot and the all sub plots to it |
---|
0:29:04 | there is a gesture and there's subcomponents to it |
---|
0:29:08 | this component i really important when you look at my at behaviors |
---|
0:29:12 | so how do we do this so anybody the let's see |
---|
0:29:16 | whose strongest background is in language and an l p |
---|
0:29:21 | would be most of you |
---|
0:29:22 | anybody with a strong background in vocal and out of the speech |
---|
0:29:26 | okay great |
---|
0:29:27 | anybody with a strong background in visual computer vision |
---|
0:29:31 | okay good thank you |
---|
0:29:33 | i don't feel lonely well for each of these modality |
---|
0:29:37 | there are existing problems that are well studied looking at structure for example in language |
---|
0:29:44 | looking at a noun phrases or shallow segmentation |
---|
0:29:48 | in have used one recognizing gesture or in vocal looking at the tenseness already motion |
---|
0:29:54 | in the voice |
---|
0:29:55 | and there are been a lot of approaches suggested to that |
---|
0:29:59 | it generates addresses this common that's a |
---|
0:30:02 | generative in a nutshell is looking at each gesture and try to generate it so |
---|
0:30:07 | if you look at hand out and head shake it's gonna learn how and upgraded |
---|
0:30:12 | and how the head say created |
---|
0:30:15 | and if i'm giving a new video is say that no other the with head |
---|
0:30:19 | shake a discriminative approach is really looking at what differentiates the two |
---|
0:30:25 | and so in a lot of our work it or not the discriminative approaches perform |
---|
0:30:30 | better at least for the task of prediction |
---|
0:30:32 | and so i'm gonna give you |
---|
0:30:34 | information about this kind of approach |
---|
0:30:37 | knowing really well it is interesting work on the genitive |
---|
0:30:40 | so |
---|
0:30:41 | what is a conditional random field |
---|
0:30:45 | my guys i didn't thing i would see that do this morning |
---|
0:30:48 | but no conditional random field is what's colour graphical model |
---|
0:30:52 | and the reason i want you to learn about it is that this is the |
---|
0:30:55 | and good entry way to a lot about the research that you've heard about word |
---|
0:31:00 | embedding |
---|
0:31:01 | our board to back or deep learning or recurrent neural network you're all of these |
---|
0:31:06 | terms |
---|
0:31:07 | we're gonna go step by step to be able to understand the and that the |
---|
0:31:11 | same time i will give you some of the work we've done tree that |
---|
0:31:15 | so given the task and given the sentence |
---|
0:31:18 | and i want to know what is the beginning of a noun phrase |
---|
0:31:21 | all what is the continuation of a noun phrase or what is other like ever |
---|
0:31:25 | so it is simple classification task |
---|
0:31:28 | and you could imagine given observation |
---|
0:31:30 | where you have a one hot encoding |
---|
0:31:32 | zero and one for the words if it's a word embedding |
---|
0:31:37 | you can try to predict |
---|
0:31:38 | the relationship between the word and the non trade |
---|
0:31:42 | if you wanna do it in a discriminative way what does this minutes of mean |
---|
0:31:46 | in means that you model problem the of the label |
---|
0:31:50 | given the input b r y given x |
---|
0:31:53 | now this equation is simpler than o |
---|
0:31:56 | there is one component that look at |
---|
0:31:59 | how is my observation looking like the label this is what color is singular potential |
---|
0:32:06 | and the second part is if i'm at the beginning of a noun phrase what |
---|
0:32:10 | is the likely label afterwards |
---|
0:32:13 | if i tell you that if i'm the beginning and noun phrase one is like |
---|
0:32:16 | there were i know a continuation of a noun phrase or another but if i |
---|
0:32:22 | mean concentration a noun phrase |
---|
0:32:24 | it's really less likely maybe that i go |
---|
0:32:26 | into a global after that so this is the kind of interest and you put |
---|
0:32:30 | in this model |
---|
0:32:31 | this model i patients recognize behaviour and they can do it but |
---|
0:32:37 | but there's always about |
---|
0:32:39 | but in this problem will be |
---|
0:32:41 | so much easier |
---|
0:32:43 | if i knew the part-of-speech tagging it would be so much easier |
---|
0:32:47 | if i had and at college the undergrad in the box if at the annotators |
---|
0:32:52 | same and obtaining out of this for us |
---|
0:32:56 | the task will be so much easier from this pronouns you know it's like but |
---|
0:33:01 | beginning of a well i |
---|
0:33:02 | beginning of an off right |
---|
0:33:04 | this is the verb so |
---|
0:33:06 | why don't just do that when it is the hard a i r b doesn't |
---|
0:33:10 | allow us to put undergrads in the box and it is a time-consuming |
---|
0:33:15 | i process to do that so |
---|
0:33:17 | this is the want a remote wants you to remember from that's part of the |
---|
0:33:21 | lecture |
---|
0:33:21 | latent variable i'm gonna replace that by a latent variable length bible is the number |
---|
0:33:28 | from one so let's they can |
---|
0:33:31 | that's gonna do the job for you |
---|
0:33:33 | latent variable are therefore have been |
---|
0:33:36 | they can include the words together for you |
---|
0:33:39 | but you don't have to give them what the name of each group |
---|
0:33:44 | they can define camping naturally that works for the purpose of your past which is |
---|
0:33:49 | in this case |
---|
0:33:49 | noun phrase |
---|
0:33:51 | so you et al it hey learn this grouping for me of all the words |
---|
0:33:56 | and you can do that by doing a small to make with saying for the |
---|
0:34:00 | non fright the beginning a noun phrase i'm allowing you for this |
---|
0:34:05 | these four rule |
---|
0:34:07 | for the middle for the constellation of a noun phrase i'm allowing you grew for |
---|
0:34:11 | you to group all the words in four or the rooms |
---|
0:34:14 | and i would do it also for all the other one |
---|
0:34:17 | so you see it almost |
---|
0:34:19 | it's not unsupervised-clustering because i have the grouping will be happening because i have a |
---|
0:34:25 | task in mind |
---|
0:34:26 | discriminative model task in mind |
---|
0:34:29 | so if you do this once beautiful is the complexity of this algorithm is that |
---|
0:34:34 | almost the same as the c i have with a simple a summation over that |
---|
0:34:38 | now what do you end up learning with this grouping |
---|
0:34:42 | the most important is this link |
---|
0:34:45 | what do you end up learning you know knowing what's got intrinsic dynamic what is |
---|
0:34:50 | that if i want to recognize hand on the intrinsic tells me i'm going down |
---|
0:34:55 | and well this is the dynamic |
---|
0:34:58 | but it had say at the different dynamic this is specific to the gesture |
---|
0:35:02 | extrinsic tells you if i my hand on how likely am i to switch strategy |
---|
0:35:07 | this is between the labeled how likely am i two had say now rely on |
---|
0:35:12 | lightly in fact come back then i can head shape |
---|
0:35:15 | it's an intuition behind this |
---|
0:35:17 | so if you do this and you apply this to the task where famous that |
---|
0:35:22 | of noun phrase |
---|
0:35:24 | segmentation also called shallow parsing |
---|
0:35:26 | and then you know |
---|
0:35:27 | it should have the hidden state look the most likely one for this word when |
---|
0:35:31 | it is i want to know what that my model learn what is the grouping |
---|
0:35:36 | that loan |
---|
0:35:37 | and if you know can what they did learn |
---|
0:35:39 | it's really beautiful |
---|
0:35:40 | it is an automatically that the beginning of a phrase is the determinant or pronouns |
---|
0:35:44 | and it also give me intuition |
---|
0:35:47 | about the kind of part-of-speech tags |
---|
0:35:49 | that is but in that one on whether part-of-speech tags it just learned automatically |
---|
0:35:54 | because of the words and the way of these words happen in the bright |
---|
0:35:58 | so this is that they come first they common stage |
---|
0:36:01 | latent variable are there so rule thing |
---|
0:36:05 | for you |
---|
0:36:06 | their grouping thing temporal grouping |
---|
0:36:08 | that the first ingredient we will need |
---|
0:36:12 | the |
---|
0:36:14 | you probably heard the word recurrent neural network |
---|
0:36:18 | and you like that fancy name have no clue what i don't wanna use that |
---|
0:36:23 | right away recurrent neural network looks a lot like this model |
---|
0:36:27 | the only thing that change it is instead of having one latent state from one |
---|
0:36:32 | so well |
---|
0:36:33 | i'm gonna have many neurons that are binary |
---|
0:36:36 | zero o one |
---|
0:36:38 | and so recurrent neural network is someone looking at a neural network and it looking |
---|
0:36:43 | at the painting and be like how it will look better horizontally so it's taking |
---|
0:36:48 | a neural network and moving it horizontally and that is your temporal |
---|
0:36:53 | so if i was to show you the other way around you with the other |
---|
0:36:56 | just the neural network that the normal one |
---|
0:36:58 | by shifting it this way this is the temporal |
---|
0:37:02 | that i model and so this is right |
---|
0:37:05 | the problem with these |
---|
0:37:06 | is therefore get |
---|
0:37:08 | therefore get they have a problem in the learning |
---|
0:37:11 | so this famous algorithm that happen in germany |
---|
0:37:15 | have more than twenty years ago that speaking super famous recently |
---|
0:37:19 | it long short-term memory |
---|
0:37:21 | and the long short-term memory is really similar to the previous neural network |
---|
0:37:26 | but in also then you have the memory |
---|
0:37:30 | and but how do you guard the memory |
---|
0:37:33 | you going to put the gate |
---|
0:37:34 | that only once you want that's in the memory |
---|
0:37:38 | and only what you want get out of the memory you putting a gating and |
---|
0:37:43 | then you think hey i'm gonna sometime for get things but i'm gonna design what |
---|
0:37:47 | i forget this is a really high level you but you could imagine by now |
---|
0:37:51 | this is the exact same that |
---|
0:37:53 | the word |
---|
0:37:54 | and the label |
---|
0:37:56 | and the only difference is i'm going to memorise when i memorise i memorise what |
---|
0:38:01 | happened before |
---|
0:38:02 | i'm gonna memorise what are the word and the faster the grouping that happened before |
---|
0:38:06 | i wanted to show you that |
---|
0:38:08 | just so that when you see this times you have at least in its vision |
---|
0:38:11 | that there is a way to approach |
---|
0:38:14 | temporal modeling two latent by about that i talk about |
---|
0:38:17 | or true neural networks |
---|
0:38:20 | okay |
---|
0:38:21 | no i want to address the second challenge |
---|
0:38:23 | that's one of the most interesting from my perspective other i work a lot of |
---|
0:38:27 | my life on temporal modeling so as to not say that i think the next |
---|
0:38:31 | screen fluent |
---|
0:38:32 | is how do you work on representation how can you in the look at someone |
---|
0:38:37 | what they say |
---|
0:38:38 | and how they stated in the gesture |
---|
0:38:41 | and find a common representation |
---|
0:38:43 | what is this common representations to look like |
---|
0:38:46 | i wanna representation so that if i know why a video and i have a |
---|
0:38:52 | segment of someone saying i like it |
---|
0:38:55 | i a part of the video it as someone smiling |
---|
0:38:59 | part of the video i |
---|
0:39:00 | a joyful tone |
---|
0:39:02 | i want these |
---|
0:39:03 | to all be represent that mainly similar from each other if you look at the |
---|
0:39:09 | numbers representing |
---|
0:39:11 | this it should be really similar i like it from happy forms artful |
---|
0:39:16 | and if i have someone will look a little bit tens of the press or |
---|
0:39:20 | some tenderness in there but i want them the number like i think there is |
---|
0:39:24 | audio clip |
---|
0:39:26 | and i tried to every presented with this that the transformation |
---|
0:39:29 | i wanted to be we need those someone would deprive |
---|
0:39:32 | or if i have someone who looked surprised and i hear |
---|
0:39:35 | wow |
---|
0:39:36 | i want these to look alike |
---|
0:39:38 | and this was the dream |
---|
0:39:40 | i personally had this dream |
---|
0:39:42 | back more than ten years ago |
---|
0:39:44 | and this really smite researcher at toronto |
---|
0:39:47 | showed us a path for that |
---|
0:39:50 | and it is ruslan in university of toronto |
---|
0:39:53 | but is a lot of interesting work |
---|
0:39:55 | where neural network |
---|
0:39:57 | are allowing us to make this dream come true |
---|
0:40:00 | it did it installed at don't worry but they've done the first step that's really |
---|
0:40:04 | important i'm gonna show you result in second |
---|
0:40:06 | what they say it's a visual |
---|
0:40:09 | could be represented with multiple layer of neurons |
---|
0:40:13 | and verbal can be represented |
---|
0:40:16 | with multiple layer of neurons |
---|
0:40:18 | what i see here |
---|
0:40:19 | i don't collect like word to back for people who know about it it's a |
---|
0:40:23 | representation of a word that becomes a vector and here i have images that suddenly |
---|
0:40:29 | becomes also a nice vector by the way |
---|
0:40:32 | if you wonder why modes model was not working |
---|
0:40:34 | it's all the fault of computer vision people |
---|
0:40:37 | the reasonable to model was not working is images were so hard to recognize any |
---|
0:40:43 | object it was barely working well |
---|
0:40:46 | but certainly in two thousand and eleven |
---|
0:40:48 | computer vision started working |
---|
0:40:50 | at a level that is really impressive we can recognize object really efficiently and now |
---|
0:40:56 | we can look at all |
---|
0:40:58 | hi is the high-level representation of the image that is useful |
---|
0:41:02 | words were always quite informative in itself |
---|
0:41:05 | but the you guys that solve a lot of the and now we can do |
---|
0:41:09 | that and put them together |
---|
0:41:11 | in one representation |
---|
0:41:13 | and there's been a lot of really interesting work |
---|
0:41:16 | starting that from two thousand ten |
---|
0:41:18 | and this is still a lot of work on that feel |
---|
0:41:21 | i'm gonna show this one a result that's that |
---|
0:41:24 | to me how it may be possible |
---|
0:41:26 | and this is the work from toronto |
---|
0:41:29 | is what they did |
---|
0:41:31 | they learned |
---|
0:41:32 | how images from the web from flicker |
---|
0:41:35 | they take a bunch of images and then |
---|
0:41:37 | they were here |
---|
0:41:39 | one word or you were describing them |
---|
0:41:42 | and the first two |
---|
0:41:43 | well point to the same place |
---|
0:41:46 | and when you do that |
---|
0:41:48 | you get for any rate |
---|
0:41:50 | and their representation put at work you get a representation |
---|
0:41:54 | but now i'm going to do |
---|
0:41:55 | multilingual |
---|
0:41:57 | work and he is there but of it i'm gonna take an image |
---|
0:42:01 | and the number |
---|
0:42:03 | representation |
---|
0:42:04 | i'm gonna get the word |
---|
0:42:05 | and get a number and stuff strike |
---|
0:42:08 | the what number from the image number |
---|
0:42:11 | and i am gonna and that the number |
---|
0:42:15 | and finally again this final number out of it and i'm gonna know what kind |
---|
0:42:19 | of email |
---|
0:42:20 | to that part of the space |
---|
0:42:22 | then you get a new car |
---|
0:42:24 | and then it becomes red color |
---|
0:42:26 | that for me what it man is i find belief on what is the bad |
---|
0:42:31 | l what is the their magic language where everything can be no the |
---|
0:42:37 | and that's no there is a language |
---|
0:42:41 | the magic language where everybody can go from the french think this and all that |
---|
0:42:44 | is this magic language |
---|
0:42:47 | this is the live in the same for language and bayes and we finally got |
---|
0:42:50 | a piece of that magic language where computer vision people can live happily with natural |
---|
0:42:55 | language people and speech people |
---|
0:42:58 | and they can do that for the they and then i |
---|
0:43:03 | flying in sailing bold box i don't know it is beautiful but they didn't sell |
---|
0:43:09 | any of the only problem i mentioned without about communicative behavior they don't have yet |
---|
0:43:14 | happy smile that goes with lie like but you can see the product now to |
---|
0:43:19 | that |
---|
0:43:20 | so i'm gonna do now store an algorithm |
---|
0:43:23 | that brings together what you learn all your |
---|
0:43:26 | latent viable |
---|
0:43:28 | which are grouping have role |
---|
0:43:30 | and now i'm gonna at this new ingredient which are neural network that their goal |
---|
0:43:35 | is to find a better way of representing i don't like one hot |
---|
0:43:41 | representation for words like zero and one |
---|
0:43:44 | i want something that's more informative |
---|
0:43:46 | and i don't like images i want something much more informative |
---|
0:43:50 | so i'm going to learn at the same time |
---|
0:43:52 | how what in my room being temporally what does my temporal dynamic and what is |
---|
0:43:57 | my way to |
---|
0:43:58 | represent |
---|
0:43:59 | so given the same input |
---|
0:44:02 | and the goal of maybe |
---|
0:44:04 | doing it's email are recognition or let's say recognizing what is positive or negative i'm |
---|
0:44:10 | changing the task |
---|
0:44:11 | because noun phrase |
---|
0:44:13 | segmentation is not really among the model problem |
---|
0:44:15 | so i'm thinking at that like positive versus negative like |
---|
0:44:19 | we will smaller sentiment the not of that for example |
---|
0:44:22 | and that was at the first layer here this is in fact i'm showing it |
---|
0:44:26 | this way but what it is |
---|
0:44:27 | is that the word |
---|
0:44:29 | is multidimensional |
---|
0:44:31 | and this is also multi dimensional because you have neurons |
---|
0:44:34 | so i'm replacing this as one layer |
---|
0:44:37 | of neurons |
---|
0:44:38 | and then |
---|
0:44:39 | i'm gonna at you or famous latent variables |
---|
0:44:43 | so what is happening here |
---|
0:44:45 | and that's really important |
---|
0:44:46 | on this their job |
---|
0:44:48 | is that they all the agenda-based here |
---|
0:44:51 | that's a me is about a false there you don't and those then because i |
---|
0:44:56 | speak french about of other |
---|
0:44:57 | and so they call this gibberish and one in the format |
---|
0:45:00 | that's going to be useful for the computer and their task here is to say |
---|
0:45:04 | from a useful information that we tried to bit |
---|
0:45:08 | to see what is similar between the different |
---|
0:45:12 | between the different modalities |
---|
0:45:14 | and so this is what you get here |
---|
0:45:16 | it is it right grouping what should i grew |
---|
0:45:19 | this is the this is here |
---|
0:45:22 | how should i go from the numbers just something that's useful for my computer and |
---|
0:45:26 | here is the same as all your is how the between late and viable or |
---|
0:45:31 | grouping |
---|
0:45:32 | so this is beautiful because you do at the same time |
---|
0:45:36 | translate from gibberish to something useful and cluster the same time |
---|
0:45:40 | one of the most challenging thing when you train that |
---|
0:45:44 | is that each layer is he then late and you don't have it on the |
---|
0:45:48 | ground labelling it |
---|
0:45:49 | so when you have many of that what happen is one could try to learn |
---|
0:45:54 | the same as the next layer |
---|
0:45:55 | so you want divers city in its of your layer |
---|
0:45:58 | and the good neural network they will do we what's called dropout |
---|
0:46:02 | or you can also implies some sparsity so that this is gonna be really different |
---|
0:46:06 | from this one |
---|
0:46:08 | and when you do this by emotion recognition |
---|
0:46:10 | you get a huge bruise on any of the prior work |
---|
0:46:13 | because we were not just the only a late fusion we're really at the same |
---|
0:46:17 | time modeling the representation |
---|
0:46:19 | and the temporal clustering |
---|
0:46:21 | okay |
---|
0:46:23 | that everyone survived this is the last equation we had so but this was |
---|
0:46:28 | this is my goal of |
---|
0:46:31 | present thing for you |
---|
0:46:33 | the representation how do i goal |
---|
0:46:35 | from temporal and the representation and the two that's one which i wanna presents quickly |
---|
0:46:42 | one is that about riyadh alignment |
---|
0:46:44 | how do you align |
---|
0:46:46 | usual which is really i thirty frames per second |
---|
0:46:49 | we language |
---|
0:46:51 | which is in fact i don't know how many words per second i see i'm |
---|
0:46:54 | from you know the high end on that |
---|
0:46:56 | but it's probably five to six word maybe a little bit more per second |
---|
0:46:59 | so how do you emanates to be able to |
---|
0:47:02 | they really high frame rate and the lying it is something much lower |
---|
0:47:06 | in some other way i have a video |
---|
0:47:09 | and i want to summarize that video |
---|
0:47:12 | it's which is so that at the end |
---|
0:47:14 | i really have only the important part |
---|
0:47:16 | and if you look at computer vision people |
---|
0:47:19 | they don't look at the excel |
---|
0:47:21 | and this is allowed to change prop excel |
---|
0:47:23 | and this is really few change |
---|
0:47:25 | is really little change here |
---|
0:47:27 | is about the and pixel changing here so if you just look at the excel |
---|
0:47:31 | in you try to merge you wanna i all of these frame |
---|
0:47:35 | and you want to find how am i gonna merge them |
---|
0:47:38 | there's two obvious way to do it |
---|
0:47:40 | one it in all one out of two frames |
---|
0:47:44 | really a long sequence then you just ignore and all of the people in neural |
---|
0:47:48 | network that's often what they do they take one out of ten frames that side |
---|
0:47:52 | about the most interesting will be |
---|
0:47:54 | look at one image visit look at look like the previous one |
---|
0:47:58 | in that they look alike i'm gonna modes them but i don't you the local |
---|
0:48:02 | at this time |
---|
0:48:03 | but i do not merge them |
---|
0:48:05 | what is more importing or magic a gradient you remember latent variable they didn't viable |
---|
0:48:11 | are gonna move things for you |
---|
0:48:12 | for a task in mine which is recognizing gesture |
---|
0:48:16 | and if i do the merging because they look alike and this space |
---|
0:48:20 | then there really more important more fusion |
---|
0:48:22 | and if you do that you get a you lose in performance for recognizing gesture |
---|
0:48:28 | and i'm gonna give you want more intuition about see i have an hmm |
---|
0:48:32 | so you have an hmm are a lot like finding new model or finding dora |
---|
0:48:37 | is the dollar |
---|
0:48:39 | short memory they don't remember the only remember the last thing be seen that the |
---|
0:48:44 | really short term memory |
---|
0:48:45 | so if you give them something really high frame rate |
---|
0:48:48 | the only think it wouldn't remember is the previous one |
---|
0:48:51 | so what do they remember and a member my previous frame always look a lot |
---|
0:48:55 | like my current frame |
---|
0:48:56 | so i smoothing |
---|
0:48:58 | but i was give it |
---|
0:48:59 | these frames here that are different from each other |
---|
0:49:03 | it will be learned some temporal information that's more useful and that's why |
---|
0:49:08 | a lot of model works so much better on language |
---|
0:49:12 | because every word is quite different from the previous |
---|
0:49:15 | but every major in a video frame a really similar to each other so that |
---|
0:49:18 | this model |
---|
0:49:19 | and when you do that you get a nice clustering |
---|
0:49:23 | the frame because it's not looking |
---|
0:49:25 | just that the similarity but it really |
---|
0:49:27 | and the at the mood being that you get from the latent bible |
---|
0:49:32 | the last one is fusion and there's a lot more work to be done on |
---|
0:49:36 | fusion but this one is like okay |
---|
0:49:39 | i model the temporal |
---|
0:49:42 | i model the representation i lying my modality |
---|
0:49:45 | but now i want to make a prediction i wanna make my final prediction |
---|
0:49:50 | and i want to use all the information i have |
---|
0:49:52 | to make my prediction |
---|
0:49:54 | and to do that is a lot of new way to do that |
---|
0:49:58 | if you think about it each modality has its own dynamics of voice is really |
---|
0:50:03 | quick |
---|
0:50:04 | word is floor |
---|
0:50:05 | so you don't want to lose that |
---|
0:50:07 | so you have word |
---|
0:50:09 | uhuh dynamic for |
---|
0:50:11 | each modality so one is private and one |
---|
0:50:14 | will in fact with mine mation |
---|
0:50:16 | okay so you will learn a dynamic for audio and you learn a dynamic for |
---|
0:50:21 | visual and then you know how to synchronise them |
---|
0:50:25 | i'm going quickly turned out but just want to give you the institution |
---|
0:50:28 | that user and the last that is the one that's going to do |
---|
0:50:33 | learned the dynamic and learned also to synchronise at the same time and when you |
---|
0:50:36 | do that you improve a lot so |
---|
0:50:38 | i'm gonna coming back closing the loop |
---|
0:50:41 | i'm clothing the lu |
---|
0:50:43 | and going back to the average and all work on this stress depression and ptsd |
---|
0:50:48 | i'm gonna take verbal acoustic and visual |
---|
0:50:51 | and i want to predict how |
---|
0:50:54 | distress you are |
---|
0:50:55 | and here the results you get when you do multimodal fusion |
---|
0:50:59 | you get this to what you have is a hundred part is event |
---|
0:51:03 | who interacted with l e |
---|
0:51:05 | and each of them at the level of distress in blue |
---|
0:51:10 | and some of them have speech the in depression |
---|
0:51:13 | and in green what you get |
---|
0:51:15 | is in fact the prediction |
---|
0:51:18 | you get the prediction from the green |
---|
0:51:20 | but i putting together the verbal indicator |
---|
0:51:24 | the vocal and the visual |
---|
0:51:26 | and you can do that i'm gonna skip to that because of time |
---|
0:51:29 | but you can also do this a lot for |
---|
0:51:32 | looking at sentiment |
---|
0:51:34 | in videos sentiment in youtube videos |
---|
0:51:37 | is another application of that i'm gonna skip this one |
---|
0:51:40 | does because our model to go quickly under the last point i want to make |
---|
0:51:44 | but the last part i want a state now is interpersonal dynamic |
---|
0:51:49 | you guys have been amazing you been handout thing smiling yearning watching emails |
---|
0:51:56 | i got you |
---|
0:51:57 | okay |
---|
0:51:58 | but interpersonal dynamic is i think the next friends really in algorithm because people some |
---|
0:52:06 | people will like siri synchrony in their behaviors |
---|
0:52:09 | synchrony in their behaviour are great this all up and some kind of rubber ball |
---|
0:52:14 | i with the also the in the video |
---|
0:52:17 | in some of our video using the virtual human mimicking each other |
---|
0:52:21 | well in negotiation |
---|
0:52:22 | you also c and d symmetry or divergence |
---|
0:52:26 | we also really informative |
---|
0:52:28 | if i move or what you move backward design important you |
---|
0:52:32 | this is important negotiation but also in learning |
---|
0:52:35 | if i look at the behavior of one speaker and another |
---|
0:52:39 | i can find moment where the synchronise |
---|
0:52:41 | and i can also find one when there is synchrony |
---|
0:52:45 | and these are often in our data |
---|
0:52:48 | related to |
---|
0:52:49 | a rejection or bad in their homework |
---|
0:52:53 | because they're not working well together |
---|
0:52:55 | there's a there's the disagreement |
---|
0:52:57 | and that synchrony can show their |
---|
0:53:00 | we can use some of the behaviour is more for one but you get the |
---|
0:53:03 | right leader from expert |
---|
0:53:06 | and this year otherwise you think the other knowledge about the on but they're not |
---|
0:53:09 | always that the there are not only the knowledgeable and so hard to differentiate that |
---|
0:53:14 | and voice is a good one for that |
---|
0:53:17 | and one another type what are you gonna accent on that my offer during negotiation |
---|
0:53:23 | and to do that i will look |
---|
0:53:25 | and your behavior |
---|
0:53:27 | i will look at my behaviour as the proposed or and i will look at |
---|
0:53:31 | our history together if we do that together we get a user improvement when we |
---|
0:53:36 | put the dinally |
---|
0:53:38 | but that i think what is that |
---|
0:53:39 | it your behavior if you hand not are stored bothers you are likely to accept |
---|
0:53:44 | but my behaviour important by the way the best way to have someone a text |
---|
0:53:48 | that you are for |
---|
0:53:49 | tells you have |
---|
0:53:50 | you put that you put that out in your on a request |
---|
0:53:54 | so the last one is there you guys |
---|
0:53:59 | good listeners |
---|
0:54:01 | how do i create a crowd like you guys as good listener you |
---|
0:54:05 | i can do that from data |
---|
0:54:07 | i can look at each of you how you reacting to the speaker |
---|
0:54:11 | and learn |
---|
0:54:12 | what are the most predictive one |
---|
0:54:14 | and be able to eventually grade of its own listener |
---|
0:54:17 | these are the top for most predictive listener speech about features so if i part |
---|
0:54:23 | you likely to hannah |
---|
0:54:24 | that's another surprise if i look at you you're likely to and not after a |
---|
0:54:29 | little well known right away |
---|
0:54:31 | if i stayed a word and the one hand by itself is not a good |
---|
0:54:34 | predictor but if i'm in the middle of as and then ipods and look at |
---|
0:54:39 | you |
---|
0:54:39 | you really likely to give feedback |
---|
0:54:41 | so this is the power of multimodal and badly if i don't look at you |
---|
0:54:46 | unlikely |
---|
0:54:47 | to hannah but not all that you guys are the same |
---|
0:54:50 | you all the little bit different you not all a smiling at another thing which |
---|
0:54:55 | i don't know why use all be about the |
---|
0:54:58 | a |
---|
0:54:58 | some of you i can learn a model for one person |
---|
0:55:02 | i can learn a model for another person |
---|
0:55:04 | and that a person |
---|
0:55:06 | and then when i would like to do is find out the prototypical grouping |
---|
0:55:11 | grouping |
---|
0:55:13 | latent viable a again very like that model selection |
---|
0:55:18 | again at that it |
---|
0:55:20 | but you will be grouping people want to find what is common between people |
---|
0:55:24 | and what do you fine |
---|
0:55:25 | you find that some people |
---|
0:55:27 | is that was produced by law on so that they also that the warm i'm |
---|
0:55:31 | a men's is the than is only about one that if i begin in france |
---|
0:55:35 | event have i say stupid things you will hand not just because that the part |
---|
0:55:39 | of the right time |
---|
0:55:40 | a some people will be a visual there don't even care listening |
---|
0:55:45 | and i do this and noun phrases turn out to be a good predictor |
---|
0:55:50 | okay so i wonder so work from stacy mice the lower here this is the |
---|
0:55:55 | really great representation of putting all this interpersonal dynamic in one video i could have |
---|
0:56:01 | never done better than that |
---|
0:56:02 | so i wanna you just do this |
---|
0:56:04 | this is a video movie and you want if we only gonna take the audio |
---|
0:56:08 | track |
---|
0:56:09 | and the text |
---|
0:56:10 | only the audio and the text |
---|
0:56:12 | and we're gonna and may |
---|
0:56:13 | the virtual human here we gotta make two of them |
---|
0:56:17 | some of them are going to be speakers so it speaking behavior based on how |
---|
0:56:21 | to |
---|
0:56:22 | you don't the speech you want to know is the icing that the head is |
---|
0:56:26 | the |
---|
0:56:27 | which facial expression is it speaker behaviour |
---|
0:56:30 | but we also want to predict the center behaviors |
---|
0:56:33 | directly from the speech of the speaker and so look at the |
---|
0:56:38 | it's is beautiful |
---|
0:56:40 | and i hope you enjoy the movie |
---|
0:56:42 | s two s process i like an answer the question judge the core poor performance |
---|
0:56:51 | statistical touched |
---|
0:56:59 | technical difficulty writing style change to do so |
---|
0:57:05 | i o |
---|
0:57:11 | i |
---|
0:57:13 | i |
---|
0:57:14 | i don't have to answer the question or answer the question |
---|
0:57:19 | you want answers i entirely |
---|
0:57:23 | i don't try to |
---|
0:57:27 | but this was all automatic from the audio |
---|
0:57:31 | and the visual one that some of the text only |
---|
0:57:34 | i you get the can you cues from the audio you get the emotion |
---|
0:57:38 | so this is an example putting everything together these are some of the application that |
---|
0:57:44 | you can will |
---|
0:57:45 | bringing together the behavior dynamic every my not every smile on equal going to model |
---|
0:57:50 | the model with the late and viable you don't quite get that the multimodal representation |
---|
0:57:57 | and alignment in the fusion |
---|
0:57:59 | and then the interpersonal dynamics so |
---|
0:58:01 | with the bocal for your attention remotes |
---|
1:00:16 | okay |
---|
1:00:17 | so |
---|
1:00:18 | let me to answer the first second one and maybe the first one we can |
---|
1:00:22 | discuss more |
---|
1:00:23 | about the second one apartments model alignment right now we are looking at alignment i |
---|
1:00:29 | don't really instantaneous level so it's only really small piece of the big problem of |
---|
1:00:36 | alignment |
---|
1:00:36 | right now we only aligning |
---|
1:00:38 | i really short term |
---|
1:00:40 | i personally believe the next |
---|
1:00:43 | okay at the next level |
---|
1:00:48 | of alignment needs to be at the segment level so you need to be able |
---|
1:00:52 | to do segmentation |
---|
1:00:54 | at the same time as you the alignment and to go ahead with the other |
---|
1:00:59 | example that you mention |
---|
1:01:01 | the a when you don't you mimicry instantaneously |
---|
1:01:05 | the plastic example i think it's four seconds or something like that so that the |
---|
1:01:10 | problem is that temporal contingency you need to model that and i think |
---|
1:01:14 | right now as i said a lot of a model are sort or memory |
---|
1:01:17 | and so we need the infrastructure |
---|
1:01:20 | to be able to remember so |
---|
1:01:21 | i think all the points you mention are wonderful i agree with you this is |
---|
1:01:25 | why i'm excited with this we don't |
---|
1:01:27 | is that we got actually the building blocks there |
---|
1:01:30 | and i think we need to study the next step so |
---|
1:01:33 | thank you |
---|
1:01:35 | okay the with the money and then |
---|
1:02:12 | right requested |
---|
1:02:13 | so right now we tried to work with the calibration of each speaker |
---|
1:02:20 | by having a for space of four or |
---|
1:02:23 | but where we got more sober indicators |
---|
1:02:26 | what's the difference on how to direct from positive |
---|
1:02:29 | and from positive |
---|
1:02:31 | as a problem there from negative still really |
---|
1:02:34 | and looking at the delta |
---|
1:02:35 | what is the most informative |
---|
1:02:37 | because the data is the little bit |
---|
1:02:39 | it's not completely independent on the user base a lot less dependent |
---|
1:02:43 | then just looking at hoffman this might happen to this might if it's positive hop |
---|
1:02:48 | into this might in when it's negative |
---|
1:02:49 | that is more informative |
---|
1:02:52 | the other work is if you ask me where this research going follows it's in |
---|
1:02:56 | treatment |
---|
1:02:57 | and they're |
---|
1:02:58 | what is it and we're working with harvard medical school |
---|
1:03:01 | is you get a schizophrenic patient at their worst |
---|
1:03:04 | you get a schizophrenic patient as they go through treatment at the back they go |
---|
1:03:08 | back home |
---|
1:03:09 | you can create a beautiful patient profile of that there were at their best and |
---|
1:03:14 | then use that to monitor |
---|
1:03:16 | their behaviour as they go back |
---|
1:03:18 | and so that the work we are putting forward with harvard medical school |
---|
1:03:22 | is to be able to create these |
---|
1:03:24 | profiles of people |
---|
1:03:25 | at the word profile doesn't sound also we call the signature |
---|
1:03:28 | as on a list the big brother but the idea is the profile of that |
---|
1:03:34 | so |
---|
1:03:36 | thing thank you all four pension thank you |
---|