0:00:13 | a could bring everybody |
---|
0:00:15 | uh this paper is is of all tract and changes in continuous emotion state |
---|
0:00:20 | using body language and prosody |
---|
0:00:22 | uh uh there's so this paper of you uh working with the inverse to seven point forty S at out |
---|
0:00:26 | or |
---|
0:00:27 | when the this work was done |
---|
0:00:29 | so |
---|
0:00:30 | yeah there is a video |
---|
0:00:32 | so you |
---|
0:00:33 | and X |
---|
0:00:34 | if express an interaction between two actors |
---|
0:00:37 | um |
---|
0:00:44 | is |
---|
0:01:23 | i |
---|
0:01:24 | i |
---|
0:01:25 | i |
---|
0:01:26 | i |
---|
0:01:27 | i |
---|
0:01:28 | i |
---|
0:01:29 | sure |
---|
0:01:37 | alright so i |
---|
0:01:38 | as an example of by express it and leading to a there's |
---|
0:01:41 | you can see that through the course of the time there is a continues uh a for bingo body and |
---|
0:01:47 | sprays since that of all |
---|
0:01:48 | at the same time the emotional state of the two i've there's of holes |
---|
0:01:51 | through the course of the time |
---|
0:01:53 | and we can see that it has a can see that has a viable intensity and clarity to the course |
---|
0:01:57 | of by |
---|
0:01:58 | so the focus of this work |
---|
0:02:00 | uh is |
---|
0:02:01 | oh first to examine what is the emotional content of these body language gestures |
---|
0:02:06 | and how or different the language gestures indicative of different underlying motion states of the art |
---|
0:02:12 | and secondly |
---|
0:02:13 | we want to use |
---|
0:02:14 | this this live and prosodic information in order to continuously track and folding emotion and chase to the course of |
---|
0:02:21 | time |
---|
0:02:22 | for this uh |
---|
0:02:23 | method we will use for this paper we we use |
---|
0:02:26 | uh pose the problem as the tracking problem so we're trying to track |
---|
0:02:29 | the underlying emotional state |
---|
0:02:31 | through time |
---|
0:02:32 | a a we use as that this a mapping |
---|
0:02:34 | between that emotional state and the observe audio visual cues |
---|
0:02:38 | so for each of a you can observe all do use of cues we will try |
---|
0:02:42 | to parade to try the underlying emotional state |
---|
0:02:46 | um the data base that we use for this work is the you C creative by D database |
---|
0:02:51 | you just so an example read from this database |
---|
0:02:54 | uh it's a multi model in multidisciplinary database that was collected uh as a collaboration of the U C engineering |
---|
0:03:01 | department |
---|
0:03:01 | and the you T at their school |
---|
0:03:03 | a and it consists of a a variety of uh dyadic that you cal improvisation |
---|
0:03:08 | a uh we asked the access from the of there's school to come to or uh motion capture lab |
---|
0:03:14 | and the play an improvised to a place of the got exercises and at the same time we to them |
---|
0:03:20 | with motion capture we placed markers |
---|
0:03:22 | in their bodies as you can see |
---|
0:03:24 | uh so |
---|
0:03:26 | we record them with video cameras motion capture and close audio from a of microphone |
---|
0:03:31 | uh so more details of this thing bayes is in this with site |
---|
0:03:35 | but is very um important to keep in mind that it "'cause" |
---|
0:03:39 | uh a large variety of a very expressive uh a the language uh express |
---|
0:03:44 | um so now we want to have a and annotation of the underlying emotional state of a two out there's |
---|
0:03:49 | during their interaction |
---|
0:03:51 | but we found that uh that the goal an emotion discrete of such just and re happy or sad at |
---|
0:03:56 | to restrict lead to uh i think clean are the good you describe |
---|
0:04:00 | the emotional state of a right of the emotional state |
---|
0:04:03 | so we went for dimensional motion emotional descriptors that that why we use in the emotional community |
---|
0:04:08 | based descriptors are |
---|
0:04:10 | uh activation which describes how excited versus calm is a person |
---|
0:04:15 | it's um |
---|
0:04:16 | balance we describes how positive the versus negative is the additive of a person |
---|
0:04:20 | and is don't with describe how don and versus of it is a person you know interaction |
---|
0:04:26 | uh also we didn't want to uh a channel the set the recordings of the recordings in arbitrary segments because |
---|
0:04:32 | we gonna want to in out |
---|
0:04:34 | is continuous flow of audio cues |
---|
0:04:36 | so we decided to go for a is a notation through the course of the time |
---|
0:04:40 | or of uh these emotional descriptors |
---|
0:04:42 | so we use the filter trace instrument which is a software that out i was people as they were to |
---|
0:04:48 | be do to give a continuous notations all balance activation up and down |
---|
0:04:52 | and now i also one to show an other them more of uh the use of a trace |
---|
0:04:56 | but three a data annotators as they're watching the video |
---|
0:04:59 | to give a and um rating of the activation |
---|
0:05:15 | i |
---|
0:05:16 | oh |
---|
0:05:17 | i |
---|
0:05:18 | i |
---|
0:05:19 | i |
---|
0:05:20 | uh |
---|
0:05:21 | oh |
---|
0:05:24 | a |
---|
0:05:26 | yeah |
---|
0:05:26 | i |
---|
0:05:27 | i |
---|
0:05:28 | i |
---|
0:05:30 | a so as you can see uh or three different people but giving the rating of the male at time |
---|
0:05:35 | it seems a reasonable since in the beginning that are is like stalking a lot is moving a |
---|
0:05:40 | well there as the end he's turning around and going to in into the end of the room and not |
---|
0:05:44 | talking a lot |
---|
0:05:45 | so basically that the decrease is of course we can see differences between different annotators |
---|
0:05:50 | which brings us to our nee |
---|
0:05:53 | which is how do we define that a a different uh a lot they are a green |
---|
0:06:05 | uh so we not is by examining our data |
---|
0:06:08 | that uh i i don't think things are agree more on the correlation on but train |
---|
0:06:12 | uh that is the agree more on a you for emotion motion at a rate is increasing or decreasing or |
---|
0:06:17 | staying stable rather than the absolute values of any motion uh at be and this is expected scenes for human |
---|
0:06:23 | seems to be easier to classify emotion motion in relative there say something is more active at or less it |
---|
0:06:30 | rather than absolute there |
---|
0:06:32 | so um |
---|
0:06:33 | so motivated by this we decided to define evaluator agreement in this work as positive correlation or a different annotation |
---|
0:06:40 | care |
---|
0:06:41 | uh in this plan you can see an example of uh of a notation in the blue lines is three |
---|
0:06:46 | different annotators a there's a dating the recordings through the course of time and the red line is that uh |
---|
0:06:51 | mean annotation |
---|
0:06:52 | uh which will be used as ground truth for our experiment |
---|
0:06:56 | a in this work we using uh a uh uh i don't thirty different instances of people interact |
---|
0:07:04 | a a a number of known to the building young which feature extraction |
---|
0:07:07 | uh we decided to this uh to extract a variety of body language features |
---|
0:07:12 | that we wanted to be into and inspired from psychology so that they are well suited for analysis per |
---|
0:07:18 | a i so we have to different types of features one is absolute features |
---|
0:07:22 | that have to do with just one person has to do the but only portion of that person and move |
---|
0:07:26 | an of that person |
---|
0:07:27 | the second of features are relative features of one person with respect to his interlocutor |
---|
0:07:32 | so they have to do more with semi |
---|
0:07:35 | that in the other person looking at the other person or a or approaching or of reading the other person |
---|
0:07:40 | and there interesting because |
---|
0:07:42 | you need for people to extract these features so they can be some information about this actor in right |
---|
0:07:48 | i i i have to that it is that this analysis the purpose of that is to see how well |
---|
0:07:52 | and its behaviour changes |
---|
0:07:54 | in a yeah yes that the underlying emotional attribute either increases decreases is or is thing uh stable |
---|
0:08:01 | a a in uh specifically we extract it to a to work of the features were extracted for them one |
---|
0:08:07 | show are michael |
---|
0:08:08 | a a in a dramatic a fashion a straightforward manner |
---|
0:08:11 | uh you can see a list of some of the features that we use |
---|
0:08:14 | for example there some absolute features |
---|
0:08:16 | such as uh have the velocity |
---|
0:08:18 | and are some relative features that as the relative velocity of and i actor with respect to the other i |
---|
0:08:24 | um from this a low level features we can be some high level of behaviours that the this behaviours |
---|
0:08:31 | for example take a are relative a even the last P is uh a positive that that means of moving |
---|
0:08:37 | to was the other person you the velocities that the mean some moving away from the other person and very |
---|
0:08:42 | close to zero in i'm not moving my |
---|
0:08:44 | so by to set of in this value you i can i in for these high-level level used |
---|
0:08:49 | i that my approach of what still was the other person |
---|
0:08:52 | and now uh when we was that this scale analysis of this high a little behaviours |
---|
0:08:57 | we will use a ground through to select regions |
---|
0:09:00 | that we have the dance with is that um of the estimation the mean estimation for different not of data |
---|
0:09:05 | uh when you select a yes of increase decrease or stability of the emotional attribute |
---|
0:09:11 | and then it's that this we compare but but the language behaviours in regions of in these the different regions |
---|
0:09:17 | and we see there are statistically significant difference in the but the language speech |
---|
0:09:22 | we behaviours |
---|
0:09:23 | so we find meaningful results for the activation and the dominant |
---|
0:09:27 | uh emotion attributes |
---|
0:09:28 | uh i i as an example of when where comparing activation division increase compared but at division decrease |
---|
0:09:34 | we not is that uh when the present is increasing he's activation keep tends to what more to move more |
---|
0:09:38 | there as the other person of his hands more was the other person |
---|
0:09:42 | when she's is vision is decreasing in moves more away from the other person and he hand to each other |
---|
0:09:48 | more |
---|
0:09:49 | uh uh when we uh looking at the dominance increase purses down is decreased |
---|
0:09:53 | we see that when a person is increasing is the as tends to low as the other person or or |
---|
0:09:58 | to was the other person |
---|
0:09:59 | well he's looking uh which he's decrease it decreasing his dominant |
---|
0:10:03 | he looks more a weight or heating |
---|
0:10:06 | uh away from the other person |
---|
0:10:08 | and this result seems intuitive and they are in agreement with some uh qualitative results from psychology |
---|
0:10:14 | i unfortunately we're not able to find so into D uh here yours for the balance attribute |
---|
0:10:20 | which simply your if we ate in our late the results about tracking |
---|
0:10:24 | and they now only it's good to uh the tracking part |
---|
0:10:27 | i also have a B D here |
---|
0:10:38 | a a so are gonna for the tracking experiment |
---|
0:10:41 | is using the are |
---|
0:10:43 | uh all the visual features to track the underlying emotional states which is the red curve that you see here |
---|
0:10:49 | but great care is are estimation of the underlying emotional states through the course of the time in this case |
---|
0:10:55 | is that i-th division of the at actor for one of a recording |
---|
0:10:58 | so in order to do that |
---|
0:11:09 | a you know to do this we use the gaussian mixture model based mapping that was used a among others |
---|
0:11:15 | for the problem or for articulatory to speech |
---|
0:11:18 | uh inversion |
---|
0:11:19 | uh so it consists of finding a mapping |
---|
0:11:22 | uh between uh a continuous underlying emotional state at a specific that means that that is you know that X |
---|
0:11:29 | and it can is their boy language or process of that feature set of features |
---|
0:11:34 | at a specific that is the start |
---|
0:11:36 | uh uh i denote by Y |
---|
0:11:37 | where modeling base uh joint distribution uh with a gaussian mixture model |
---|
0:11:42 | uh there four |
---|
0:11:44 | uh the conditional distribution of a |
---|
0:11:47 | of of the underlying emotional state you don't they observed audio visual feature vector is gonna be also a gaussian |
---|
0:11:52 | mixture model |
---|
0:11:53 | and therefore we will have the optimal uh maximum likelihood mapping |
---|
0:11:59 | a if we select the underlying emotional state that maximise the conditional probability of the underlying motion state |
---|
0:12:05 | given uh the observe a feature vector |
---|
0:12:08 | this is an eight i dated process |
---|
0:12:10 | uh through expectation maximization algorithm |
---|
0:12:13 | and it converts is to the maximum likelihood map |
---|
0:12:16 | um |
---|
0:12:17 | in practice |
---|
0:12:18 | we don't just want to use information about time it |
---|
0:12:21 | a but we want to take into account temporal content |
---|
0:12:24 | a a to do this we also use a a first and second derivatives so we augment meant of feature |
---|
0:12:29 | vector back the those three but this |
---|
0:12:31 | and this enables us to take into account to this temporal context |
---|
0:12:34 | and produce smooth or emotional trajectory estimate |
---|
0:12:41 | now i don't like to use not only barely language which but also speech |
---|
0:12:45 | as as our feature vector however seems that |
---|
0:12:48 | is relevant for the whole interaction but uh audio cues |
---|
0:12:52 | i only relevant when the actors is and actually the result speaking of the time so |
---|
0:12:58 | uh we need to find a solution for this so we decided to |
---|
0:13:01 | uh a to go mixture model mapping |
---|
0:13:04 | one more mapping is train when the or is not speaking |
---|
0:13:07 | is of these all gmm pink only uh using but features |
---|
0:13:11 | the second is string when the actor is speaking |
---|
0:13:14 | a it's in all the gmm being using both will be and speech speech |
---|
0:13:18 | so you know a testing stage for it's test recording which are in two consecutive overlapping segments any need check |
---|
0:13:25 | and we apply the appropriate mapping according to whether the speaker is |
---|
0:13:29 | uh the actor speak your are not speaking |
---|
0:13:31 | uh in order to find a a a at the curve estimation |
---|
0:13:35 | uh the audio |
---|
0:13:36 | features that we are using our peach energy and no put the button feature |
---|
0:13:42 | so |
---|
0:13:43 | here are some of our tracking result |
---|
0:13:46 | uh you can see in this plot |
---|
0:13:48 | uh the red line is the mean out nation |
---|
0:13:51 | uh the blue line is the maximum likelihood estimation trajectory using only body features and the green one is when |
---|
0:13:58 | using by end speech features together |
---|
0:14:00 | uh i this is one of our best results uh we achieve the correlation between the ground truth |
---|
0:14:06 | and are estimation of the order of zero point eight which increased to zero point |
---|
0:14:10 | a a at one a when we also use all audio information |
---|
0:14:13 | at the second he's some more medium result |
---|
0:14:16 | uh uh we have a a correlation between uh the ground truth and the blue line which uses only body |
---|
0:14:21 | feature |
---|
0:14:22 | oh far down zero point four |
---|
0:14:24 | when we use the or features we go to zero we increase significantly to zero point uh five |
---|
0:14:30 | and that in general have that we can see from this to play is that the two trajectories |
---|
0:14:35 | they for a little bit trends of the underlying emotional states that of number the absolute value values |
---|
0:14:40 | so we can going that we can track emotional changes |
---|
0:14:43 | yeah that then absolute emotional body |
---|
0:14:46 | and this is a a in part explained by the fact that these easier to quantify as i or dimension |
---|
0:14:51 | motion in relative terms rather than absolute there |
---|
0:14:54 | um |
---|
0:14:56 | so now uh going to the or or or a results |
---|
0:15:00 | uh we evaluate the performance |
---|
0:15:02 | uh uh of tracking changes probably uh measuring the correlation between the ground truth and the estimate here that we |
---|
0:15:09 | have |
---|
0:15:09 | and i an upper bound to show that the difficulty of the problem |
---|
0:15:13 | we also so what's the inter notation correlation |
---|
0:15:16 | for one specific recording between the annotators |
---|
0:15:19 | um |
---|
0:15:20 | so this is are there are result |
---|
0:15:23 | uh for the activation case the median correlation between the ground truth |
---|
0:15:27 | and the these or |
---|
0:15:29 | uh mle mapping is zero point thirty one when we uh use this is the median correlation |
---|
0:15:34 | when we use that all is were mapping the correlation increases to zero point for two there's a significant increase |
---|
0:15:40 | and the internal annotation correlations are zero point fifty five so one can coding word |
---|
0:15:45 | uh a comparable to |
---|
0:15:47 | uh the annotation so for a human |
---|
0:15:50 | uh for don't that's our results are lower |
---|
0:15:53 | uh the V well in the old of is more being are uh similar a a around zero point and |
---|
0:15:58 | to six zero point than be three |
---|
0:16:00 | well the in there are they probably correlations are zero point the forty seven |
---|
0:16:04 | uh |
---|
0:16:05 | for the by case |
---|
0:16:07 | uh we're not able to track the changes |
---|
0:16:09 | uh uh you have a median listen around zero and this is in agreement with our statistical results that uh |
---|
0:16:14 | where we were not able to show |
---|
0:16:16 | and mean that the violence is meaningfully reflected through our features |
---|
0:16:22 | um |
---|
0:16:23 | so this these us just our discussion of how are observable i'd the underlying emotional states that were trying to |
---|
0:16:29 | trap |
---|
0:16:30 | with the features that are extracted |
---|
0:16:32 | it seems from a results that were able to capture the activation changes of a persons through the course of |
---|
0:16:37 | an actors to the course of time |
---|
0:16:38 | uh some of the dominance changes |
---|
0:16:40 | but we not able to track the violence changes |
---|
0:16:43 | uh these might mean that the ball language in prosody maybe more informative about the activation and the dominant state |
---|
0:16:50 | well the violence date |
---|
0:16:51 | um the balanced state can be expressed through out a more than two such as facial expressions |
---|
0:16:56 | or lexical content |
---|
0:16:58 | um also uh also could be the case that we need to do more detailed feature extraction specifically tailored |
---|
0:17:05 | for the balance uh attribute rather than using the same feature set for or three motion now uh attribute |
---|
0:17:13 | um um |
---|
0:17:14 | this is a one of the future work |
---|
0:17:16 | so i only we note that uh the use of prosodic cues greatly benefits activation tracking the fact that uh |
---|
0:17:22 | um of use |
---|
0:17:24 | uh |
---|
0:17:24 | the fact that uh vocal cues are |
---|
0:17:27 | when feature for activation is is uh that has already been seen in they motion literature |
---|
0:17:31 | and finally uh are are all rep a close and is a we can perhaps or motion changes rather than |
---|
0:17:37 | absolute motion values with this frame |
---|
0:17:41 | uh a in there some feature work we would like to focus on improving our features by extracting |
---|
0:17:46 | specifically features for each emotional attributes that we got |
---|
0:17:49 | well so we would like to uh improve uh a bit data annotation process as in an not them to |
---|
0:17:54 | achieve higher we put evaluator agreement so you know to have a a more consistent true |
---|
0:17:59 | and and as a a a lot of but we want to wear the rates use uh monitoring of emotional |
---|
0:18:05 | state |
---|
0:18:06 | that would like enables us to a estimate the crimes time to find |
---|
0:18:10 | uh regions where we have but increase or or increase of the activation for example of a person |
---|
0:18:16 | so find region were something interesting happened or or as there is in or or else we can say was |
---|
0:18:21 | that they emotionally salient regions |
---|
0:18:23 | uh of direction |
---|
0:18:25 | um |
---|
0:18:26 | is that the reference that that are used |
---|
0:18:28 | for this presentation and |
---|
0:18:30 | thank you for your thing |
---|
0:18:35 | a a source of dropped to start from your |
---|
0:18:40 | okay or |
---|
0:18:42 | and |
---|
0:18:43 | or sort of remarks or a remote one were strong |
---|