Speech Transcript - TRACKING CHANGES IN CONTINUOUS EMOTION STATES USING BODY LANGUAGE AND PROSODIC CUES

0:00:13	a could bring everybody
0:00:15	uh this paper is is of all tract and changes in continuous emotion state
0:00:20	using body language and prosody
0:00:22	uh uh there's so this paper of you uh working with the inverse to seven point forty S at out
0:00:26	or
0:00:27	when the this work was done
0:00:29	so
0:00:30	yeah there is a video
0:00:32	so you
0:00:33	and X
0:00:34	if express an interaction between two actors
0:00:37	um
0:00:44	is
0:01:23	i
0:01:24	i
0:01:25	i
0:01:26	i
0:01:27	i
0:01:28	i
0:01:29	sure
0:01:37	alright so i
0:01:38	as an example of by express it and leading to a there's
0:01:41	you can see that through the course of the time there is a continues uh a for bingo body and
0:01:47	sprays since that of all
0:01:48	at the same time the emotional state of the two i've there's of holes
0:01:51	through the course of the time
0:01:53	and we can see that it has a can see that has a viable intensity and clarity to the course
0:01:57	of by
0:01:58	so the focus of this work
0:02:00	uh is
0:02:01	oh first to examine what is the emotional content of these body language gestures
0:02:06	and how or different the language gestures indicative of different underlying motion states of the art
0:02:12	and secondly
0:02:13	we want to use
0:02:14	this this live and prosodic information in order to continuously track and folding emotion and chase to the course of
0:02:21	time
0:02:22	for this uh
0:02:23	method we will use for this paper we we use
0:02:26	uh pose the problem as the tracking problem so we're trying to track
0:02:29	the underlying emotional state
0:02:31	through time
0:02:32	a a we use as that this a mapping
0:02:34	between that emotional state and the observe audio visual cues
0:02:38	so for each of a you can observe all do use of cues we will try
0:02:42	to parade to try the underlying emotional state
0:02:46	um the data base that we use for this work is the you C creative by D database
0:02:51	you just so an example read from this database
0:02:54	uh it's a multi model in multidisciplinary database that was collected uh as a collaboration of the U C engineering
0:03:01	department
0:03:01	and the you T at their school
0:03:03	a and it consists of a a variety of uh dyadic that you cal improvisation
0:03:08	a uh we asked the access from the of there's school to come to or uh motion capture lab
0:03:14	and the play an improvised to a place of the got exercises and at the same time we to them
0:03:20	with motion capture we placed markers
0:03:22	in their bodies as you can see
0:03:24	uh so
0:03:26	we record them with video cameras motion capture and close audio from a of microphone
0:03:31	uh so more details of this thing bayes is in this with site
0:03:35	but is very um important to keep in mind that it "'cause"
0:03:39	uh a large variety of a very expressive uh a the language uh express
0:03:44	um so now we want to have a and annotation of the underlying emotional state of a two out there's
0:03:49	during their interaction
0:03:51	but we found that uh that the goal an emotion discrete of such just and re happy or sad at
0:03:56	to restrict lead to uh i think clean are the good you describe
0:04:00	the emotional state of a right of the emotional state
0:04:03	so we went for dimensional motion emotional descriptors that that why we use in the emotional community
0:04:08	based descriptors are
0:04:10	uh activation which describes how excited versus calm is a person
0:04:15	it's um
0:04:16	balance we describes how positive the versus negative is the additive of a person
0:04:20	and is don't with describe how don and versus of it is a person you know interaction
0:04:26	uh also we didn't want to uh a channel the set the recordings of the recordings in arbitrary segments because
0:04:32	we gonna want to in out
0:04:34	is continuous flow of audio cues
0:04:36	so we decided to go for a is a notation through the course of the time
0:04:40	or of uh these emotional descriptors
0:04:42	so we use the filter trace instrument which is a software that out i was people as they were to
0:04:48	be do to give a continuous notations all balance activation up and down
0:04:52	and now i also one to show an other them more of uh the use of a trace
0:04:56	but three a data annotators as they're watching the video
0:04:59	to give a and um rating of the activation
0:05:15	i
0:05:16	oh
0:05:17	i
0:05:18	i
0:05:19	i
0:05:20	uh
0:05:21	oh
0:05:24	a
0:05:26	yeah
0:05:26	i
0:05:27	i
0:05:28	i
0:05:30	a so as you can see uh or three different people but giving the rating of the male at time
0:05:35	it seems a reasonable since in the beginning that are is like stalking a lot is moving a
0:05:40	well there as the end he's turning around and going to in into the end of the room and not
0:05:44	talking a lot
0:05:45	so basically that the decrease is of course we can see differences between different annotators
0:05:50	which brings us to our nee
0:05:53	which is how do we define that a a different uh a lot they are a green
0:06:05	uh so we not is by examining our data
0:06:08	that uh i i don't think things are agree more on the correlation on but train
0:06:12	uh that is the agree more on a you for emotion motion at a rate is increasing or decreasing or
0:06:17	staying stable rather than the absolute values of any motion uh at be and this is expected scenes for human
0:06:23	seems to be easier to classify emotion motion in relative there say something is more active at or less it
0:06:30	rather than absolute there
0:06:32	so um
0:06:33	so motivated by this we decided to define evaluator agreement in this work as positive correlation or a different annotation
0:06:40	care
0:06:41	uh in this plan you can see an example of uh of a notation in the blue lines is three
0:06:46	different annotators a there's a dating the recordings through the course of time and the red line is that uh
0:06:51	mean annotation
0:06:52	uh which will be used as ground truth for our experiment
0:06:56	a in this work we using uh a uh uh i don't thirty different instances of people interact
0:07:04	a a a number of known to the building young which feature extraction
0:07:07	uh we decided to this uh to extract a variety of body language features
0:07:12	that we wanted to be into and inspired from psychology so that they are well suited for analysis per
0:07:18	a i so we have to different types of features one is absolute features
0:07:22	that have to do with just one person has to do the but only portion of that person and move
0:07:26	an of that person
0:07:27	the second of features are relative features of one person with respect to his interlocutor
0:07:32	so they have to do more with semi
0:07:35	that in the other person looking at the other person or a or approaching or of reading the other person
0:07:40	and there interesting because
0:07:42	you need for people to extract these features so they can be some information about this actor in right
0:07:48	i i i have to that it is that this analysis the purpose of that is to see how well
0:07:52	and its behaviour changes
0:07:54	in a yeah yes that the underlying emotional attribute either increases decreases is or is thing uh stable
0:08:01	a a in uh specifically we extract it to a to work of the features were extracted for them one
0:08:07	show are michael
0:08:08	a a in a dramatic a fashion a straightforward manner
0:08:11	uh you can see a list of some of the features that we use
0:08:14	for example there some absolute features
0:08:16	such as uh have the velocity
0:08:18	and are some relative features that as the relative velocity of and i actor with respect to the other i
0:08:24	um from this a low level features we can be some high level of behaviours that the this behaviours
0:08:31	for example take a are relative a even the last P is uh a positive that that means of moving
0:08:37	to was the other person you the velocities that the mean some moving away from the other person and very
0:08:42	close to zero in i'm not moving my
0:08:44	so by to set of in this value you i can i in for these high-level level used
0:08:49	i that my approach of what still was the other person
0:08:52	and now uh when we was that this scale analysis of this high a little behaviours
0:08:57	we will use a ground through to select regions
0:09:00	that we have the dance with is that um of the estimation the mean estimation for different not of data
0:09:05	uh when you select a yes of increase decrease or stability of the emotional attribute
0:09:11	and then it's that this we compare but but the language behaviours in regions of in these the different regions
0:09:17	and we see there are statistically significant difference in the but the language speech
0:09:22	we behaviours
0:09:23	so we find meaningful results for the activation and the dominant
0:09:27	uh emotion attributes
0:09:28	uh i i as an example of when where comparing activation division increase compared but at division decrease
0:09:34	we not is that uh when the present is increasing he's activation keep tends to what more to move more
0:09:38	there as the other person of his hands more was the other person
0:09:42	when she's is vision is decreasing in moves more away from the other person and he hand to each other
0:09:48	more
0:09:49	uh uh when we uh looking at the dominance increase purses down is decreased
0:09:53	we see that when a person is increasing is the as tends to low as the other person or or
0:09:58	to was the other person
0:09:59	well he's looking uh which he's decrease it decreasing his dominant
0:10:03	he looks more a weight or heating
0:10:06	uh away from the other person
0:10:08	and this result seems intuitive and they are in agreement with some uh qualitative results from psychology
0:10:14	i unfortunately we're not able to find so into D uh here yours for the balance attribute
0:10:20	which simply your if we ate in our late the results about tracking
0:10:24	and they now only it's good to uh the tracking part
0:10:27	i also have a B D here
0:10:38	a a so are gonna for the tracking experiment
0:10:41	is using the are
0:10:43	uh all the visual features to track the underlying emotional states which is the red curve that you see here
0:10:49	but great care is are estimation of the underlying emotional states through the course of the time in this case
0:10:55	is that i-th division of the at actor for one of a recording
0:10:58	so in order to do that
0:11:09	a you know to do this we use the gaussian mixture model based mapping that was used a among others
0:11:15	for the problem or for articulatory to speech
0:11:18	uh inversion
0:11:19	uh so it consists of finding a mapping
0:11:22	uh between uh a continuous underlying emotional state at a specific that means that that is you know that X
0:11:29	and it can is their boy language or process of that feature set of features
0:11:34	at a specific that is the start
0:11:36	uh uh i denote by Y
0:11:37	where modeling base uh joint distribution uh with a gaussian mixture model
0:11:42	uh there four
0:11:44	uh the conditional distribution of a
0:11:47	of of the underlying emotional state you don't they observed audio visual feature vector is gonna be also a gaussian
0:11:52	mixture model
0:11:53	and therefore we will have the optimal uh maximum likelihood mapping
0:11:59	a if we select the underlying emotional state that maximise the conditional probability of the underlying motion state
0:12:05	given uh the observe a feature vector
0:12:08	this is an eight i dated process
0:12:10	uh through expectation maximization algorithm
0:12:13	and it converts is to the maximum likelihood map
0:12:16	um
0:12:17	in practice
0:12:18	we don't just want to use information about time it
0:12:21	a but we want to take into account temporal content
0:12:24	a a to do this we also use a a first and second derivatives so we augment meant of feature
0:12:29	vector back the those three but this
0:12:31	and this enables us to take into account to this temporal context
0:12:34	and produce smooth or emotional trajectory estimate
0:12:41	now i don't like to use not only barely language which but also speech
0:12:45	as as our feature vector however seems that
0:12:48	is relevant for the whole interaction but uh audio cues
0:12:52	i only relevant when the actors is and actually the result speaking of the time so
0:12:58	uh we need to find a solution for this so we decided to
0:13:01	uh a to go mixture model mapping
0:13:04	one more mapping is train when the or is not speaking
0:13:07	is of these all gmm pink only uh using but features
0:13:11	the second is string when the actor is speaking
0:13:14	a it's in all the gmm being using both will be and speech speech
0:13:18	so you know a testing stage for it's test recording which are in two consecutive overlapping segments any need check
0:13:25	and we apply the appropriate mapping according to whether the speaker is
0:13:29	uh the actor speak your are not speaking
0:13:31	uh in order to find a a a at the curve estimation
0:13:35	uh the audio
0:13:36	features that we are using our peach energy and no put the button feature
0:13:42	so
0:13:43	here are some of our tracking result
0:13:46	uh you can see in this plot
0:13:48	uh the red line is the mean out nation
0:13:51	uh the blue line is the maximum likelihood estimation trajectory using only body features and the green one is when
0:13:58	using by end speech features together
0:14:00	uh i this is one of our best results uh we achieve the correlation between the ground truth
0:14:06	and are estimation of the order of zero point eight which increased to zero point
0:14:10	a a at one a when we also use all audio information
0:14:13	at the second he's some more medium result
0:14:16	uh uh we have a a correlation between uh the ground truth and the blue line which uses only body
0:14:21	feature
0:14:22	oh far down zero point four
0:14:24	when we use the or features we go to zero we increase significantly to zero point uh five
0:14:30	and that in general have that we can see from this to play is that the two trajectories
0:14:35	they for a little bit trends of the underlying emotional states that of number the absolute value values
0:14:40	so we can going that we can track emotional changes
0:14:43	yeah that then absolute emotional body
0:14:46	and this is a a in part explained by the fact that these easier to quantify as i or dimension
0:14:51	motion in relative terms rather than absolute there
0:14:54	um
0:14:56	so now uh going to the or or or a results
0:15:00	uh we evaluate the performance
0:15:02	uh uh of tracking changes probably uh measuring the correlation between the ground truth and the estimate here that we
0:15:09	have
0:15:09	and i an upper bound to show that the difficulty of the problem
0:15:13	we also so what's the inter notation correlation
0:15:16	for one specific recording between the annotators
0:15:19	um
0:15:20	so this is are there are result
0:15:23	uh for the activation case the median correlation between the ground truth
0:15:27	and the these or
0:15:29	uh mle mapping is zero point thirty one when we uh use this is the median correlation
0:15:34	when we use that all is were mapping the correlation increases to zero point for two there's a significant increase
0:15:40	and the internal annotation correlations are zero point fifty five so one can coding word
0:15:45	uh a comparable to
0:15:47	uh the annotation so for a human
0:15:50	uh for don't that's our results are lower
0:15:53	uh the V well in the old of is more being are uh similar a a around zero point and
0:15:58	to six zero point than be three
0:16:00	well the in there are they probably correlations are zero point the forty seven
0:16:04	uh
0:16:05	for the by case
0:16:07	uh we're not able to track the changes
0:16:09	uh uh you have a median listen around zero and this is in agreement with our statistical results that uh
0:16:14	where we were not able to show
0:16:16	and mean that the violence is meaningfully reflected through our features
0:16:22	um
0:16:23	so this these us just our discussion of how are observable i'd the underlying emotional states that were trying to
0:16:29	trap
0:16:30	with the features that are extracted
0:16:32	it seems from a results that were able to capture the activation changes of a persons through the course of
0:16:37	an actors to the course of time
0:16:38	uh some of the dominance changes
0:16:40	but we not able to track the violence changes
0:16:43	uh these might mean that the ball language in prosody maybe more informative about the activation and the dominant state
0:16:50	well the violence date
0:16:51	um the balanced state can be expressed through out a more than two such as facial expressions
0:16:56	or lexical content
0:16:58	um also uh also could be the case that we need to do more detailed feature extraction specifically tailored
0:17:05	for the balance uh attribute rather than using the same feature set for or three motion now uh attribute
0:17:13	um um
0:17:14	this is a one of the future work
0:17:16	so i only we note that uh the use of prosodic cues greatly benefits activation tracking the fact that uh
0:17:22	um of use
0:17:24	uh
0:17:24	the fact that uh vocal cues are
0:17:27	when feature for activation is is uh that has already been seen in they motion literature
0:17:31	and finally uh are are all rep a close and is a we can perhaps or motion changes rather than
0:17:37	absolute motion values with this frame
0:17:41	uh a in there some feature work we would like to focus on improving our features by extracting
0:17:46	specifically features for each emotional attributes that we got
0:17:49	well so we would like to uh improve uh a bit data annotation process as in an not them to
0:17:54	achieve higher we put evaluator agreement so you know to have a a more consistent true
0:17:59	and and as a a a lot of but we want to wear the rates use uh monitoring of emotional
0:18:05	state
0:18:06	that would like enables us to a estimate the crimes time to find
0:18:10	uh regions where we have but increase or or increase of the activation for example of a person
0:18:16	so find region were something interesting happened or or as there is in or or else we can say was
0:18:21	that they emotionally salient regions
0:18:23	uh of direction
0:18:25	um
0:18:26	is that the reference that that are used
0:18:28	for this presentation and
0:18:30	thank you for your thing
0:18:35	a a source of dropped to start from your
0:18:40	okay or
0:18:42	and
0:18:43	or sort of remarks or a remote one were strong

TRACKING CHANGES IN CONTINUOUS EMOTION STATES USING BODY LANGUAGE AND PROSODIC CUES

Joint Audio Visual Processing

Presented by: Angeliki Metallinou, Author(s): Angeliki Metallinou, Athanasios Katsamanis, University of Southern California, United States; Yun Wang, Carnegie Mellon University, United States; Shrikanth S. Narayanan, University of Southern California, United States