0:00:13a could bring everybody
0:00:15uh this paper is is of all tract and changes in continuous emotion state
0:00:20using body language and prosody
0:00:22uh uh there's so this paper of you uh working with the inverse to seven point forty S at out
0:00:26or
0:00:27when the this work was done
0:00:29so
0:00:30yeah there is a video
0:00:32so you
0:00:33and X
0:00:34if express an interaction between two actors
0:00:37um
0:00:44is
0:01:23i
0:01:24i
0:01:25i
0:01:26i
0:01:27i
0:01:28i
0:01:29sure
0:01:37alright so i
0:01:38as an example of by express it and leading to a there's
0:01:41you can see that through the course of the time there is a continues uh a for bingo body and
0:01:47sprays since that of all
0:01:48at the same time the emotional state of the two i've there's of holes
0:01:51through the course of the time
0:01:53and we can see that it has a can see that has a viable intensity and clarity to the course
0:01:57of by
0:01:58so the focus of this work
0:02:00uh is
0:02:01oh first to examine what is the emotional content of these body language gestures
0:02:06and how or different the language gestures indicative of different underlying motion states of the art
0:02:12and secondly
0:02:13we want to use
0:02:14this this live and prosodic information in order to continuously track and folding emotion and chase to the course of
0:02:21time
0:02:22for this uh
0:02:23method we will use for this paper we we use
0:02:26uh pose the problem as the tracking problem so we're trying to track
0:02:29the underlying emotional state
0:02:31through time
0:02:32a a we use as that this a mapping
0:02:34between that emotional state and the observe audio visual cues
0:02:38so for each of a you can observe all do use of cues we will try
0:02:42to parade to try the underlying emotional state
0:02:46um the data base that we use for this work is the you C creative by D database
0:02:51you just so an example read from this database
0:02:54uh it's a multi model in multidisciplinary database that was collected uh as a collaboration of the U C engineering
0:03:01department
0:03:01and the you T at their school
0:03:03a and it consists of a a variety of uh dyadic that you cal improvisation
0:03:08a uh we asked the access from the of there's school to come to or uh motion capture lab
0:03:14and the play an improvised to a place of the got exercises and at the same time we to them
0:03:20with motion capture we placed markers
0:03:22in their bodies as you can see
0:03:24uh so
0:03:26we record them with video cameras motion capture and close audio from a of microphone
0:03:31uh so more details of this thing bayes is in this with site
0:03:35but is very um important to keep in mind that it "'cause"
0:03:39uh a large variety of a very expressive uh a the language uh express
0:03:44um so now we want to have a and annotation of the underlying emotional state of a two out there's
0:03:49during their interaction
0:03:51but we found that uh that the goal an emotion discrete of such just and re happy or sad at
0:03:56to restrict lead to uh i think clean are the good you describe
0:04:00the emotional state of a right of the emotional state
0:04:03so we went for dimensional motion emotional descriptors that that why we use in the emotional community
0:04:08based descriptors are
0:04:10uh activation which describes how excited versus calm is a person
0:04:15it's um
0:04:16balance we describes how positive the versus negative is the additive of a person
0:04:20and is don't with describe how don and versus of it is a person you know interaction
0:04:26uh also we didn't want to uh a channel the set the recordings of the recordings in arbitrary segments because
0:04:32we gonna want to in out
0:04:34is continuous flow of audio cues
0:04:36so we decided to go for a is a notation through the course of the time
0:04:40or of uh these emotional descriptors
0:04:42so we use the filter trace instrument which is a software that out i was people as they were to
0:04:48be do to give a continuous notations all balance activation up and down
0:04:52and now i also one to show an other them more of uh the use of a trace
0:04:56but three a data annotators as they're watching the video
0:04:59to give a and um rating of the activation
0:05:15i
0:05:16oh
0:05:17i
0:05:18i
0:05:19i
0:05:20uh
0:05:21oh
0:05:24a
0:05:26yeah
0:05:26i
0:05:27i
0:05:28i
0:05:30a so as you can see uh or three different people but giving the rating of the male at time
0:05:35it seems a reasonable since in the beginning that are is like stalking a lot is moving a
0:05:40well there as the end he's turning around and going to in into the end of the room and not
0:05:44talking a lot
0:05:45so basically that the decrease is of course we can see differences between different annotators
0:05:50which brings us to our nee
0:05:53which is how do we define that a a different uh a lot they are a green
0:06:05uh so we not is by examining our data
0:06:08that uh i i don't think things are agree more on the correlation on but train
0:06:12uh that is the agree more on a you for emotion motion at a rate is increasing or decreasing or
0:06:17staying stable rather than the absolute values of any motion uh at be and this is expected scenes for human
0:06:23seems to be easier to classify emotion motion in relative there say something is more active at or less it
0:06:30rather than absolute there
0:06:32so um
0:06:33so motivated by this we decided to define evaluator agreement in this work as positive correlation or a different annotation
0:06:40care
0:06:41uh in this plan you can see an example of uh of a notation in the blue lines is three
0:06:46different annotators a there's a dating the recordings through the course of time and the red line is that uh
0:06:51mean annotation
0:06:52uh which will be used as ground truth for our experiment
0:06:56a in this work we using uh a uh uh i don't thirty different instances of people interact
0:07:04a a a number of known to the building young which feature extraction
0:07:07uh we decided to this uh to extract a variety of body language features
0:07:12that we wanted to be into and inspired from psychology so that they are well suited for analysis per
0:07:18a i so we have to different types of features one is absolute features
0:07:22that have to do with just one person has to do the but only portion of that person and move
0:07:26an of that person
0:07:27the second of features are relative features of one person with respect to his interlocutor
0:07:32so they have to do more with semi
0:07:35that in the other person looking at the other person or a or approaching or of reading the other person
0:07:40and there interesting because
0:07:42you need for people to extract these features so they can be some information about this actor in right
0:07:48i i i have to that it is that this analysis the purpose of that is to see how well
0:07:52and its behaviour changes
0:07:54in a yeah yes that the underlying emotional attribute either increases decreases is or is thing uh stable
0:08:01a a in uh specifically we extract it to a to work of the features were extracted for them one
0:08:07show are michael
0:08:08a a in a dramatic a fashion a straightforward manner
0:08:11uh you can see a list of some of the features that we use
0:08:14for example there some absolute features
0:08:16such as uh have the velocity
0:08:18and are some relative features that as the relative velocity of and i actor with respect to the other i
0:08:24um from this a low level features we can be some high level of behaviours that the this behaviours
0:08:31for example take a are relative a even the last P is uh a positive that that means of moving
0:08:37to was the other person you the velocities that the mean some moving away from the other person and very
0:08:42close to zero in i'm not moving my
0:08:44so by to set of in this value you i can i in for these high-level level used
0:08:49i that my approach of what still was the other person
0:08:52and now uh when we was that this scale analysis of this high a little behaviours
0:08:57we will use a ground through to select regions
0:09:00that we have the dance with is that um of the estimation the mean estimation for different not of data
0:09:05uh when you select a yes of increase decrease or stability of the emotional attribute
0:09:11and then it's that this we compare but but the language behaviours in regions of in these the different regions
0:09:17and we see there are statistically significant difference in the but the language speech
0:09:22we behaviours
0:09:23so we find meaningful results for the activation and the dominant
0:09:27uh emotion attributes
0:09:28uh i i as an example of when where comparing activation division increase compared but at division decrease
0:09:34we not is that uh when the present is increasing he's activation keep tends to what more to move more
0:09:38there as the other person of his hands more was the other person
0:09:42when she's is vision is decreasing in moves more away from the other person and he hand to each other
0:09:48more
0:09:49uh uh when we uh looking at the dominance increase purses down is decreased
0:09:53we see that when a person is increasing is the as tends to low as the other person or or
0:09:58to was the other person
0:09:59well he's looking uh which he's decrease it decreasing his dominant
0:10:03he looks more a weight or heating
0:10:06uh away from the other person
0:10:08and this result seems intuitive and they are in agreement with some uh qualitative results from psychology
0:10:14i unfortunately we're not able to find so into D uh here yours for the balance attribute
0:10:20which simply your if we ate in our late the results about tracking
0:10:24and they now only it's good to uh the tracking part
0:10:27i also have a B D here
0:10:38a a so are gonna for the tracking experiment
0:10:41is using the are
0:10:43uh all the visual features to track the underlying emotional states which is the red curve that you see here
0:10:49but great care is are estimation of the underlying emotional states through the course of the time in this case
0:10:55is that i-th division of the at actor for one of a recording
0:10:58so in order to do that
0:11:09a you know to do this we use the gaussian mixture model based mapping that was used a among others
0:11:15for the problem or for articulatory to speech
0:11:18uh inversion
0:11:19uh so it consists of finding a mapping
0:11:22uh between uh a continuous underlying emotional state at a specific that means that that is you know that X
0:11:29and it can is their boy language or process of that feature set of features
0:11:34at a specific that is the start
0:11:36uh uh i denote by Y
0:11:37where modeling base uh joint distribution uh with a gaussian mixture model
0:11:42uh there four
0:11:44uh the conditional distribution of a
0:11:47of of the underlying emotional state you don't they observed audio visual feature vector is gonna be also a gaussian
0:11:52mixture model
0:11:53and therefore we will have the optimal uh maximum likelihood mapping
0:11:59a if we select the underlying emotional state that maximise the conditional probability of the underlying motion state
0:12:05given uh the observe a feature vector
0:12:08this is an eight i dated process
0:12:10uh through expectation maximization algorithm
0:12:13and it converts is to the maximum likelihood map
0:12:16um
0:12:17in practice
0:12:18we don't just want to use information about time it
0:12:21a but we want to take into account temporal content
0:12:24a a to do this we also use a a first and second derivatives so we augment meant of feature
0:12:29vector back the those three but this
0:12:31and this enables us to take into account to this temporal context
0:12:34and produce smooth or emotional trajectory estimate
0:12:41now i don't like to use not only barely language which but also speech
0:12:45as as our feature vector however seems that
0:12:48is relevant for the whole interaction but uh audio cues
0:12:52i only relevant when the actors is and actually the result speaking of the time so
0:12:58uh we need to find a solution for this so we decided to
0:13:01uh a to go mixture model mapping
0:13:04one more mapping is train when the or is not speaking
0:13:07is of these all gmm pink only uh using but features
0:13:11the second is string when the actor is speaking
0:13:14a it's in all the gmm being using both will be and speech speech
0:13:18so you know a testing stage for it's test recording which are in two consecutive overlapping segments any need check
0:13:25and we apply the appropriate mapping according to whether the speaker is
0:13:29uh the actor speak your are not speaking
0:13:31uh in order to find a a a at the curve estimation
0:13:35uh the audio
0:13:36features that we are using our peach energy and no put the button feature
0:13:42so
0:13:43here are some of our tracking result
0:13:46uh you can see in this plot
0:13:48uh the red line is the mean out nation
0:13:51uh the blue line is the maximum likelihood estimation trajectory using only body features and the green one is when
0:13:58using by end speech features together
0:14:00uh i this is one of our best results uh we achieve the correlation between the ground truth
0:14:06and are estimation of the order of zero point eight which increased to zero point
0:14:10a a at one a when we also use all audio information
0:14:13at the second he's some more medium result
0:14:16uh uh we have a a correlation between uh the ground truth and the blue line which uses only body
0:14:21feature
0:14:22oh far down zero point four
0:14:24when we use the or features we go to zero we increase significantly to zero point uh five
0:14:30and that in general have that we can see from this to play is that the two trajectories
0:14:35they for a little bit trends of the underlying emotional states that of number the absolute value values
0:14:40so we can going that we can track emotional changes
0:14:43yeah that then absolute emotional body
0:14:46and this is a a in part explained by the fact that these easier to quantify as i or dimension
0:14:51motion in relative terms rather than absolute there
0:14:54um
0:14:56so now uh going to the or or or a results
0:15:00uh we evaluate the performance
0:15:02uh uh of tracking changes probably uh measuring the correlation between the ground truth and the estimate here that we
0:15:09have
0:15:09and i an upper bound to show that the difficulty of the problem
0:15:13we also so what's the inter notation correlation
0:15:16for one specific recording between the annotators
0:15:19um
0:15:20so this is are there are result
0:15:23uh for the activation case the median correlation between the ground truth
0:15:27and the these or
0:15:29uh mle mapping is zero point thirty one when we uh use this is the median correlation
0:15:34when we use that all is were mapping the correlation increases to zero point for two there's a significant increase
0:15:40and the internal annotation correlations are zero point fifty five so one can coding word
0:15:45uh a comparable to
0:15:47uh the annotation so for a human
0:15:50uh for don't that's our results are lower
0:15:53uh the V well in the old of is more being are uh similar a a around zero point and
0:15:58to six zero point than be three
0:16:00well the in there are they probably correlations are zero point the forty seven
0:16:04uh
0:16:05for the by case
0:16:07uh we're not able to track the changes
0:16:09uh uh you have a median listen around zero and this is in agreement with our statistical results that uh
0:16:14where we were not able to show
0:16:16and mean that the violence is meaningfully reflected through our features
0:16:22um
0:16:23so this these us just our discussion of how are observable i'd the underlying emotional states that were trying to
0:16:29trap
0:16:30with the features that are extracted
0:16:32it seems from a results that were able to capture the activation changes of a persons through the course of
0:16:37an actors to the course of time
0:16:38uh some of the dominance changes
0:16:40but we not able to track the violence changes
0:16:43uh these might mean that the ball language in prosody maybe more informative about the activation and the dominant state
0:16:50well the violence date
0:16:51um the balanced state can be expressed through out a more than two such as facial expressions
0:16:56or lexical content
0:16:58um also uh also could be the case that we need to do more detailed feature extraction specifically tailored
0:17:05for the balance uh attribute rather than using the same feature set for or three motion now uh attribute
0:17:13um um
0:17:14this is a one of the future work
0:17:16so i only we note that uh the use of prosodic cues greatly benefits activation tracking the fact that uh
0:17:22um of use
0:17:24uh
0:17:24the fact that uh vocal cues are
0:17:27when feature for activation is is uh that has already been seen in they motion literature
0:17:31and finally uh are are all rep a close and is a we can perhaps or motion changes rather than
0:17:37absolute motion values with this frame
0:17:41uh a in there some feature work we would like to focus on improving our features by extracting
0:17:46specifically features for each emotional attributes that we got
0:17:49well so we would like to uh improve uh a bit data annotation process as in an not them to
0:17:54achieve higher we put evaluator agreement so you know to have a a more consistent true
0:17:59and and as a a a lot of but we want to wear the rates use uh monitoring of emotional
0:18:05state
0:18:06that would like enables us to a estimate the crimes time to find
0:18:10uh regions where we have but increase or or increase of the activation for example of a person
0:18:16so find region were something interesting happened or or as there is in or or else we can say was
0:18:21that they emotionally salient regions
0:18:23uh of direction
0:18:25um
0:18:26is that the reference that that are used
0:18:28for this presentation and
0:18:30thank you for your thing
0:18:35a a source of dropped to start from your
0:18:40okay or
0:18:42and
0:18:43or sort of remarks or a remote one were strong