0:00:12it's my great honour and pleasure to announce our stick
0:00:17invited speaker today three now ryan who will talk about behavioral signal processing
0:00:23so i three is the andrew viterbi professor at U S C
0:00:27this research focuses on human centred information processing and communication technologies
0:00:33and enjoying that he seems to be kind of the volume that holds for professor appointment
0:00:38i was very impressed to see that in electrical engineering computer science but also linguistics
0:00:43and psychology
0:00:44and i don't live in the us but one and told me that is also a regular guest on us
0:00:48television so
0:00:50so please help me welcome sheri not really looking forward to talk
0:01:03thank you
0:01:04right i
0:01:05some really honoured to be here and it was great to see a lot of friends of my haven't seen
0:01:10in a long time kind of come back to speech at least to check it out
0:01:14so they were asking you know what to say next crazy fringe E funny things i've been up to you
0:01:21so that's this talk today
0:01:23and
0:01:24the only bad little problem i have with this is because i haven't done very much in this topic yet
0:01:30machines but i would share whatever we've been up to in the last couple of years
0:01:35hopefully won't disappoint them able to spend you part yeah
0:01:38so the title is a bit here on signal processing i will yeah momentarily define what i mean by that
0:01:44the case be made of this terms of the got at least say what it is
0:01:49so
0:01:51but this is this work concerns you know human behaviour as we all know it's a very complex and multifaceted
0:01:58involves a very complex and intricate of main body kind of relations
0:02:03has the effect of you know but you know and the environment rolls interaction with other people and then barman
0:02:11a very low you know it's that reflected in how we communicate you mode our personality and interact with other
0:02:18people
0:02:19and also it's characterised by the generation processing a multimodal cues
0:02:24and often characterises typical atypical this water so one
0:02:29so one wonder you know what is the role of signal processing or signal bussing people in this
0:02:36business
0:02:38so you get across number of domains actually be here analysis for either explicitly or implicitly so essential to the
0:02:46starting from customer care you know you want to know a person is
0:02:51you know frustrate are very satisfied with the services that that's been rendered and you want to sell more things
0:02:57you know you wanna
0:02:58right here but at the level of an individual or group source or one
0:03:02in a learning and education you not only do you wanna know whether someone is getting a particular and so
0:03:09right or wrong you wanna know how they got it how confident are they
0:03:13and you know how can you actually adapt or this personalise learning is one of these you know grand challenges
0:03:18of engineering so to be able to do that you know we have to understand
0:03:23be here patterns and like that
0:03:25but more importantly and something that i'm five gotta know increasing passion about this whole area of mental health and
0:03:31wellbeing which i'll try to touch of one today S couple of my examples
0:03:35where a
0:03:37you'll behavior analysis very centrally the observation based or other means
0:03:43but you know when you look across no while the computational tools are used but mostly it's very human based
0:03:50so i thought before we go for it also shows some videos are examples of you know some of these
0:03:57typical problems one could ask
0:03:58so here this is like you're gonna see kids playing with actually a computer game talking to it
0:04:03the question is you know a can be tell if the child is you know something about their cognitive state
0:04:09you know confident are
0:04:11not
0:04:12so let's look at this little girl
0:04:18right
0:04:20or you can you
0:04:22and mute audio please
0:04:36alright let's try again
0:04:43hold on i checked many times
0:04:47something about you people an idea
0:04:50let's see
0:04:53it's still a
0:04:54okay
0:05:06or answer
0:05:09yeah
0:05:16i
0:05:20where is this
0:05:28well i
0:05:30oh
0:05:31oh
0:05:33i
0:05:36oh
0:05:43i
0:05:45so just looking at us
0:05:50we season or from there is sort of a vocal cues and you know that the language they're using the
0:05:55visual cues and looking around and looking away you can say something you know at least that these are different
0:06:02and you know the one of the questions we ask is like okay can be actually formally someone you know
0:06:06the these problems of measuring speaker
0:06:09so the next example
0:06:10it's from marital therapy or plastic all than your counselling
0:06:16so what you're gonna see us that a couple in writing
0:06:21and that people in this or social able to play a psychologist or doing this kind of research and people
0:06:29who are actually help in trying to help these couples in a look for a lot of things you know
0:06:34characterising aspect of dynamics in off
0:06:37looking at who's blaming homeland trying to figure out what that is and try to plan to treatment based on
0:06:43that so let's look at this video
0:06:45should i tried again
0:06:48i
0:06:50okay
0:06:58no it's not me
0:07:01you know
0:07:02no
0:07:03the right leg
0:07:06right
0:07:10alright
0:07:45oh
0:07:46used car
0:07:48yeah but what you
0:07:50again
0:07:52but
0:07:53the one of these things
0:07:57or we try to make
0:08:01this is an example from
0:08:06the main word
0:08:07colour is actually
0:08:10you interaction with the child
0:08:12the sort of a semi structured interaction following a particular diagnostic for diagnostic test
0:08:23so that is engaged
0:08:28one
0:08:28one
0:08:29trying to figure out
0:08:33things
0:08:34or
0:08:35everything
0:08:37prosody to
0:08:39sure
0:08:41right
0:08:42you know get price that characterising
0:08:45if you ask
0:08:47six
0:08:49so
0:08:52i
0:08:53i
0:08:56oh right
0:09:01right
0:09:03i
0:09:06right
0:09:08i
0:09:10so you think you should probably observed is that no that child you know there was a clear place more
0:09:16no they could chart of last back or looked at the person's the cost nothing was happening this has just
0:09:22you know doing the task memory so why not sway
0:09:25and X I causes rate so these things sort of on a very i'll talk a little later on some
0:09:32scales that we've been developed in the I D S M
0:09:35or just want to confide
0:09:37and it all these are some of the things that are happening as you can see right very observation based
0:09:41but where people are looking at multimodal cues and trying to so vendor sentiment be
0:09:48so when you look at these human behavior signals write the kind of pro why
0:09:53a window into these high-level processes like you know i'll be you know what's it depends on how big or
0:09:58small the window is
0:09:59some or all working observable like this vocal and facial expressions and body posture others are covered you know we
0:10:06don't have access to them non the less intelligent special cases
0:10:10things like heart rate can lead to the remote response or even brain activity and from a single one of
0:10:16you know in this kind of information besides and you know different time scales to these different Q
0:10:22but you know the ability to process and you know sort of interpret decode these signals so can provide us
0:10:28some insights and understanding mind body relations
0:10:31but also more importantly no these how people process other people's behaviour patterns no that's a fine distinction bode plot
0:10:40are generated a processes but also hoping something process and
0:10:45and don't the measurements and quantification of these kinds of human behaviour both from the production perception respect is a
0:10:51fairly challenging problem i believe
0:10:55so here's my operational definition for what are called he'll signal processing basically traverse the competition methods that try to
0:11:03model human behavioral signals
0:11:05that are manifested in you know either will work and or covert signals
0:11:09i don't process by humans explicitly or implicitly you know
0:11:13and that you know eventually help facilitate no human analysis and decision making you know
0:11:19so
0:11:20the outcome is you know it's informatics which can be useful across domains you know whether to inform diagnostics are
0:11:26they not planned treatments already know a fire up an autonomous system do you know do personalised no teaching age
0:11:34and so on
0:11:35but in all these writers be here on signal processing what tries to do such varying levels face to quantify
0:11:40this human felt sense
0:11:42so and
0:11:44that's kind of that they don't like it's challenging from a very lot different dimensions and i'll try to get
0:11:50at least impress upon you some of those
0:11:54so i think about it right now of course technology's already held and not in this in this domain quite
0:12:00a bit a role in all of this is that relies on the significant foundational advances that have been made
0:12:05and number of the means no but well things that happened and been discussed
0:12:10i know deeply this conference to audio video data station you know a speech recognition understanding what was spoken
0:12:17two things like what they forced to talk about visual activity recognition about you know everything from the little descriptions
0:12:23of you know head pose orientation to
0:12:26complex you know
0:12:29classification of a normal activity
0:12:31to physiological aspect of signal processing
0:12:34but the thing is that the difference is that using these as building blocks no what you wanna do is
0:12:39to try to map it to more abstract domain relevant behaviours and that means no more new or a multimodal
0:12:46model modeling approach
0:12:48oh
0:12:49so people have been started to but work on this already you know in no solving various parts of disposal
0:12:55a right from sensing more people other people been trying to say how do you actually measure human behaviour and
0:13:01sort of ecologically valid be that is not disturbing the process that we're trying to measure
0:13:06from you know instrumenting environments but that no cameras and the microphones and other types of things to actually instrumenting
0:13:13people with sensors by computing that's of techniques
0:13:16in speech a lot you know increasingly people are doing very rich and rich processing a large know what's
0:13:23been said by whom and
0:13:24how
0:13:25i think to computing is you see a lot of papers have been published in this area
0:13:30and also it's neutron so still signal processing about how modeling individual group interaction turn-taking dynamics and non-verbal cue processing
0:13:39and so on so that these are all kind of no essential building blocks for speech
0:13:44so
0:13:47in somewhere you know the ingredients for being able to do this is of course you know people are working
0:13:52in signal processing areas on acquisition how do you acquire these things are you build these types of systems and
0:13:58meaningful way many dimensions might wanna make are you know the kind of behaviour is you want to track
0:14:04might not happen in at sonic no you might wanna do it in no in wild animal in the wild
0:14:09so to speak you know and playgrounds in classrooms at home
0:14:13for example the montana modeling hidden buttons of elderly
0:14:17and also you know body computing and there's lots of interesting signal processing challenges their analysis you know how do
0:14:23you what features kind of tell you more about particular behaviour patterns of interest
0:14:29and how do you do this robustly no questions that we ask your noise are you
0:14:33and more importantly also modeling these behavioural constructs a better decide by this expert
0:14:40oh and provide the capability of you know both descriptive and pretty to you know modeling
0:14:47so this is kind of not easy because
0:14:51one the observations off these that here buttons are you know how large amounts of uncertainty
0:14:58at best partial
0:14:59there's lots of you know there's no didn't mention this talk and the vision computer vision talk about representations know
0:15:07how are you what are the representations that we
0:15:10i have to define
0:15:11to compute these things the first place no you mention experiment where they gave visual scenes and ask people describe
0:15:18right so imagine now if you are psychologist is absorbing a couple interacting that one of the things that you're
0:15:24looking for how they describe the before we even set out to actually man
0:15:29observable cues to be some presentation so
0:15:32that itself is a first class of source problem what kind of presentations be specified
0:15:36and given you know we are talking about human behaviour there's fast model heterogeneity
0:15:42and that basically differences and how people the bu patterns of people over time and across people
0:15:49and variability in how these data are generated and use
0:15:53so
0:15:54what do people do you know that you know each of these domains you look at a i'll show you
0:16:00some examples they have their own specific constructs for example in all and language assessment or you know in a
0:16:06learning situation say literacy
0:16:08when they tried to figure out what kind of you know help but little a child needs that when they're
0:16:13learning to read they're looking at night just to know if a child is making it particular sound or in
0:16:18all of two and they are a number of things come into play in or disfluencies in fact the rate
0:16:23of disfluencies station they play to
0:16:25implicit role when we did some experiments
0:16:28in you know and video should be C D you know how wondering not only are they monitoring physical activity
0:16:33but also you know emotional state and still want to know model a decision making
0:16:38and so on
0:16:39and a lot of common features because after all you know the kinda sensing we have access to are limited
0:16:46now we have an audio microphones be a bit you can write some physiological sense
0:16:51and so the approach tends to be at least at some little the little bit levels him tends to be
0:16:56the same
0:16:57but important part is no to see
0:17:00how exports a human expert and i see signal absorb them and learn and see try to see how we
0:17:05can augment the cape
0:17:07so that's why the kind of i think the hallmarks of the way i look at the cable signal processing
0:17:12is to provide supporting tools that would help the human expert and not supplant in on a total automation of
0:17:19replacing what they're doing and i think that's probably not the most beneficial thing to do
0:17:24so
0:17:26oh pictorially you look at this particular chart you know it this is what happens today you know people levels
0:17:32or
0:17:35but phenomena that they're trying to do no observe say for example child interact with the teacher and they don't
0:17:41get a lot of data to listen to but look at the child and see how confident make some judgements
0:17:45about how the child is reading and provide appropriate scaffolding or you know intervention
0:17:52what you're saying is that perhaps you know signal processing in the machine learning and then all computational tools can
0:17:58come in handy one based on trying to sort of be called what human experts to try to learn a
0:18:04what are the features that you see no either explicitly or implicitly learned that
0:18:09build models that can help with some of these predictive capabilities there's certain things you know there are beyond human
0:18:15processing capabilities for example in a look you know fine pitch dynamics or looking at you know what happened the
0:18:21beginning of the session and of the session some things
0:18:24computer models can do better
0:18:27provide feedback and hopefully not this can kind of reinforce each other nicely and no common of conducted and use
0:18:34it as some informatics so that's kind of the idea here
0:18:37so
0:18:39with that kind of background what i'm gonna do the rest of the talk is to signal quickly don't to
0:18:43some of these building blocks that indeed
0:18:46but mostly focused on couple of examples you know i'm glad shows and i two examples you're one from this
0:18:51marital therapy domain
0:18:53and to know quickly on off to some domain just to show highlights some of the possibilities and challenges that
0:19:00there are
0:19:02so
0:19:03i can't mention is already you know that you know lots of work is happening in multimodal signal acquisition and
0:19:08processing you know everything from smart rooms
0:19:11and only an instrument of space
0:19:13to actually instrumenting people to sense a lot of different things up a sensing the user sensing the environment in
0:19:20which things are happening because context becomes important
0:19:23and you know doing this in a variety of locations
0:19:27from laboratory to actually in classrooms and clicks and played around and so on
0:19:32what are the important things there we learn is that depending on environment no there are lots of constraints that
0:19:37come into play for example when we do our work at the hospital with that you know that would kids
0:19:41with autism there's only restrictions and where we can place cameras where we can put the be yeah the microphones
0:19:47either
0:19:49no it interrupts what's happening there what the psychologist rain weather conditions trying to war it's just the structure for
0:19:56the child because they are not sensitive to certain things and these are distracting and so on
0:20:01so
0:20:02even though no we'd like to capture the three D environment with like ten fifteen cameras is just not possible
0:20:08so we have to work with these kinds of restrictions and hence you know robustness issues personal audio processing and
0:20:15no language processing bu processing you know
0:20:17are real they have to know we can just solve it by better sensing
0:20:22likewise in ascending people we can do a lot of different things but we also have to worry about not
0:20:26the proper you know not only the technological constraints but also the corresponding it second privacy constraints all these things
0:20:33so challenging area
0:20:40i
0:20:42so those are two actors i
0:20:44spoken different part
0:20:45so there we've been collecting all the using actors to study behaviours you know in addition to working with actual
0:20:51you know population
0:20:53because you know we can do certain things and the lab or three a derrick on you know do with
0:20:57data that we collect and hand in hand so this is a more formal motion capture database of dyadic interaction
0:21:03but a lot of different emotional stuff that's been annotated rated and you know you interested look up
0:21:08sure
0:21:09likewise when using actors know that collaborating with the people in your school doing full body sort of interaction dyadic
0:21:17interaction each of these cases right these were the scenarios that we're is chosen rich enough so that it goes
0:21:24from the entire gamut of or not
0:21:27them playing shakespeare and check off to actually doing a broth so rich enough of audio video motion capture data
0:21:34to ask different questions
0:21:36looks like this
0:21:42so this actress
0:21:45so that kind of data is very important in our data acquisition collection that's a point there so the next
0:21:50point is you know like this is like a kind of summarises whatever happens at asr you know people have
0:21:55been working on not only you know doing that in speech works but they're doing
0:21:59number of different things extracting a variety of you no matter features which may help the not the speech understanding
0:22:05problem the dialogue management problem you know speaker id problem all this is important for no doing B S P
0:22:12that's all
0:22:13also that a lot of work on T no emotion recognition again from speech
0:22:18and from other modalities are what important questions there is no how do you to present emotions no do we
0:22:24do categorical representation likeness a happy sad or do more dimension leasing or how oscar this or how to make
0:22:30it is it or you know how that are also the person
0:22:34to actually having profiles more statistical distributions are emotional behaviour
0:22:41actually now people want to continuous tracking of emotional state variation used all sort of ongoing questions in the community
0:22:48and people try to map those representations from multi modality is important there also
0:22:56for example you know we know the interplay between you know visual and local features are pretty well known it's
0:23:02very complex interplay and one could in fact learn things about okay how prosody and head motion related and how
0:23:09they encode
0:23:10for example not only linguistic information but also these para-linguistic information nice place
0:23:16and you know if C number of studies involving and or one says that show that both the complementarity and
0:23:24redundancy in information coding about no emotions in all these modality
0:23:29for example you know you run most emotion recognizer with you know speech and facial expression you can show that
0:23:35would speech others lots of confusion between anger and sort of happiness
0:23:40but you know if you use face not that goes away you put together of course like any multimodal experiment
0:23:45reagan sure boost in performance but the point i again here is like when you're trying to model these abstract
0:23:51types of behaviours
0:23:53a more the information that kind of encodes these types of constructs a you can have a handle on the
0:24:01better it is for your competition model
0:24:05so going back to that example i show those kids being uncertain not sure enough not to add things like
0:24:10you know measure lexical nonverbal vocalisation like that person mm that little boy said no was hesitating you kind of
0:24:18detect an model those and you know with the visual cues of you know hand and head motion you can
0:24:25surely come fairly close to human agreement about not is the style certain or not in context so it's gonna
0:24:32integrating that you can do things of the sort
0:24:36in fact in many real-life situations that of course no interactions are based on
0:24:42other people there who is and who you're interacting with so the idea is if you model you know humans
0:24:48that are there the mutual influence between say a dyadic two people interacting no more you can do better in
0:24:54predicting what would come next so for example in the dyadic interaction were we can model both these yeah
0:25:02people that are in what's it has been why as sort of a data base unit
0:25:06and you can show that by doing that right hand X cross dependencies between these people not only what they
0:25:12did for but also what the other person did before you can pretty the upcoming state slightly better so this
0:25:18this type of things can be done with the existing missionary you know with a number of different things
0:25:23so
0:25:24what would that kind of broad very high level overview of you know some of the computational things that are
0:25:29happening in our field
0:25:30so now we can answer to not a goal what i'm asking you know seen
0:25:34how can these types of things be applied in two problems that people are asking in these various domains they're
0:25:41doing this without as you know messing with that those fields right no matter to their peers that's been going
0:25:46on for decades it all they want to predict things like based on sort of how long will the
0:25:51matters last or can that be amended to questions
0:25:54so you know we come there and say well we have some computational ideas and i can be held
0:26:00so that's right
0:26:02so psychology research all depends a lot on observation judgements a you know many times the in fact report these
0:26:09interactions and code to go to
0:26:13very painstaking and careful coding off these behaviours based on you know a good theoretical research frameworks that particular lab
0:26:23might have
0:26:24and they develop a lot of coding standards and so on
0:26:28so
0:26:31yeah i'll show you some examples of
0:26:37earlier
0:27:00i
0:27:03so various couples interacting okay this is actually not real clinical data
0:27:10what i'm gonna talk about that later is actually based on clinical trial data
0:27:15so they create these manual but this man decoding process with which the analyses kind of not very scalable it
0:27:20takes a lot of time and you know and that training coders use integrated that no students in psychology linguistics
0:27:28are recruited not very reliable
0:27:31inter coder reliability is also tough
0:27:34and so we ask you know the very simplistic question of a word can technology help to code these kind
0:27:39audio-visual data these behavioural sort of characterization
0:27:43so and there's a measure is in fact are very difficult for humans to make that can help you know
0:27:49all these
0:27:49measurements of timing and you know even battery station if you do how long a person speaks actually very important
0:27:55in a show later on
0:27:57that tells you quite a bit
0:27:58and we can you know consistently sort of able to quantify some aspects of these at least the low level
0:28:05human behaviour
0:28:07so here's the same kind of chart no here for example we are interested in very couple discussing a problem
0:28:13who wanna know for example you know how
0:28:17a spouse's blaming how much blame is one spouse putting on the other person other spouse
0:28:22two weeks it's not symmetric necessarily
0:28:25so this is what we wanna do so to help with that so we have a big corpus up from
0:28:31one hundred thirty four P this just couples were enrolled in a clinical trial
0:28:38and received couples therapy so we have access to one hundred hours of data or so we not intended for
0:28:45doing these automated processing yeah no transcription and so one it also has video was sought some examples and this
0:28:54is what we start with
0:28:55so
0:28:56and it also has a very nice for us this that it has a explored ratings of these interaction session-level
0:29:04each ten minute you know every couple that a ten minute long problem solving interaction
0:29:09and they could for a number of things number of behavioural patterns that were of interest to researchers in this
0:29:14domain for example
0:29:16one coding global goal was like
0:29:18is the husband showing acceptance so pretty abstract a question and the description that was that corresponds to that process
0:29:26will indicate understanding acceptance apartments use
0:29:30feelings and behaviours listens to the partner with an open mind positive attitude and so on so this is what
0:29:35the court a straight internalise and rated on a scale of one to nine
0:29:40okay
0:29:41so this is the kind of the behaviours we try to see whether we can pretty with that these signal
0:29:47cues right so most we start with the most obvious thing or simplest thing we know how to do
0:29:52so we said well let's focus on a few of those codes besides like you know acceptance blame positive aspect
0:29:58negative aspect sadness
0:30:00and so one each mark for the yeah
0:30:03both husband and the wife
0:30:04and with that the ratings no one through nine is there are no histograms of those that would given by
0:30:10people
0:30:11we said to make it so even simple but simpler for us we said well let's just focus on the
0:30:16top twenty percent and top to bottom twenty percent
0:30:19no separating extremes
0:30:21and you see what can we do this
0:30:23"'kay"
0:30:24from say things that we know to do like measure speech properties no and measure transcribe it and say can
0:30:32towards tell me something
0:30:33and if i know that how successful like be in predicting these codes that the humans can get a was
0:30:39a problem
0:30:40so that's a surcharge it's busy but what it just says is what we most of us here due right
0:30:46we kind of get all your at you know you get rid of things that are hopeless and then we
0:30:51do speech signal processing we measured no be due but or now recall that
0:30:56at
0:30:56and measure things like you know pitch and intensity and peas and mfccs and drive lots of different statistical functionals
0:31:05at the utterance level like different data levels of temporal granularities
0:31:10and throw it into our favourite machine learning a tool
0:31:15and try to predict that the that particular category to be interested in
0:31:19likewise we can also do you know transcription generate lattices and then you can use those discourse specific
0:31:27i know for K for classification
0:31:29"'kay" so that's
0:31:30exactly what we did so here's a transcript of interest i don't example so what it's like you know what
0:31:37exactly what you can spend you know everything there where the money is one of the things that we like
0:31:41this like a
0:31:42think that they're worried about in the fight double
0:31:45another thing is that
0:31:46you'll see that when you look at the results
0:31:50and in fact one of the other important things that the detection of all these non-verbal vocalisations and cues that
0:31:56about their information bearing at least that's what the algorithms dallas
0:32:00so i say mentioned right
0:32:03lot of prosodic features an acoustic features and simple binary classification and you're the results just from a very simple
0:32:09years yeah with the acoustic features right rating you know for many of these constructs like you know blame and
0:32:16you know pasta negative behaviour you know we can do much better than ten
0:32:21that's problem you know these local features and that was very encouraging
0:32:26well there's certain things like not sadness and humour harder to a do just from acoustics and the reasons because
0:32:33no remark on to capture any contextual cues are lexical cues are visual cues or anything at all
0:32:39so then we said well okay now let's throw in a lexical information you look at the transcripts about their
0:32:44a lot of work that scream at you saying hey this guy's really mad at that person you know they're
0:32:49blaming each other for example in this transcript you know we highlight
0:32:53and the kept saying it's aggravating yeah why
0:32:56and so we said well can be automatically it captured these kinds of sailing works from the text
0:33:02and simple again you know we'll language model
0:33:06and you can score you know utterance X against these models to figure out okay which particular conditioned that case
0:33:15this particular i know this these ports correspond to
0:33:22so can do it with no like this is not necessarily just utterances but the interesting thing is you know
0:33:27the kinds of things that a part of these models are very informative you know you've been very simple things
0:33:32like okay in the blame situation you can look at the extremes of the hyperplane work
0:33:38and the little blame words you know that you the second person
0:33:44is actually got correlated with high plane quite a bit in fact very consistent with what psychologists that you know
0:33:50predict hypothesize
0:33:52compared to first person but you also see words like you know teaching
0:33:56because cleaning seems to be a big deal if i
0:34:01comes about living
0:34:03quite a bit so
0:34:05yeah that's right then you know we said well let's a simple thing we do but there that's just not
0:34:10know
0:34:11right a lot of challenges add to this problem domain
0:34:15first of all you know any particular single feature stream is what we provide you just a small window as
0:34:20it pointed out and it's noisy
0:34:22so you know of course we want to do with multimodal E and you know you want also do it
0:34:26in the context sensitive fashion
0:34:29is more important thing is like many of these ratings you know many but domains they do it at the
0:34:34session level they wanna get attached although just of that particular thing
0:34:37but what is not clear is why in that particular unfolding of and that like to the space particular perceptual
0:34:44judgement the people
0:34:46so you want to know what was sailing
0:34:48so we tried doing you know like these are first got it is using sort of multiple instance learning to
0:34:54see whether we can do tree things that are possible
0:34:58then that a point is that no when these ratings are done write it down it's not so that you
0:35:02know in a more typical sort of i categorisation but they got they are posed as many times you know
0:35:10i in a rank order list that is one is
0:35:13you know sort of less than two is less than three tuple or no way into one are known or
0:35:17can be also
0:35:19do what people are trying to integrate this
0:35:22yeah then these are kind of things trying to do more efficiently what people are doing
0:35:28there are things that you know that are more than the felt sense case
0:35:31people hypothesize that you know when that so in track there's some things about you know
0:35:37synchrony in their interaction that happens that tells you how flexible that interaction proceeds no
0:35:43so if you are able to quantify the spelling of this aspect of what is called entrainment then that'll be
0:35:48useful you wanna known or can be bold signal models that actually try to do this
0:35:54then another point is when people look at you know these are you know a particular behaviour apartment looking for
0:36:00it
0:36:01different exports even you know train people look at it differently you know and they responded different portions of the
0:36:07data
0:36:08so you wanna know how we can actually capture these data-dependent human a diversity in behaviour brown processing
0:36:18into our models
0:36:19so doing simple plurality or majority voting based you know a mission line techniques might not necessarily work well for
0:36:27these kinds of knots track
0:36:29processing
0:36:30so the first thing is like you know the easiest thing like we had the language and acoustic information to
0:36:35work together you know of course it's gonna do better yeah at least that's all these expressions a rate including
0:36:41ours
0:36:42and our that for one reason was our asr really was bad
0:36:47because we went to new duties that these language models from the couples domain but what was encouraging is that
0:36:53even with like a solar for thirty five percent that what iterate asr the information from the
0:37:00from the language models
0:37:03from the not a lattice is that we generated and acoustic bass tech classifiers no put together provided a fairly
0:37:11decent sort of prediction of these codes and cycles is very excited about that
0:37:17but what we did was to actually make it more multimodal be really need to have information about the nonverbal
0:37:23cues quite a bit so be rigged up or latino rebar couch
0:37:27for the therapy
0:37:28and several microphone arrays and you know
0:37:33synchronise with about ten htk emerson about that well a motion capture camera to provide data of the sort so
0:37:39it's very useful to do more sort of a careful study of human
0:37:43vocal non horrible a behavior interactions
0:37:46so you're data like this
0:38:05oh
0:38:06so goes the conversation so you can do a lot of things yeah since we are collecting data in a
0:38:10week and a rice and you know localise and do things of that sort quite well and
0:38:15so we asked some questions like okay
0:38:18describing approach avoidance behaviour which is very important so we need side of course you about
0:38:25has been coupled interactive this guy was leading back quite a bit and you know effect expresses displeasure ins interact
0:38:31very subtle cues just like this folks that come on C N body language experts right we tried to do
0:38:36this
0:38:37but signal processing
0:38:40so approach avoidance is actually no moving toward or away from events or objects
0:38:46and it actually is related doing psychology theory like you know emotion motivation and particularly in the couples domain relationship
0:38:53that commitment
0:38:55so people are very interesting if we can quantify that from using no vocal and no visual cues can be
0:39:01actually predict or model this
0:39:02so that was a problem that we took on we said okay we can post disaster no we had psychologists
0:39:09rate this an ordinal scale want to know minus for two for a scale of nine
0:39:14and we pose this as sort of an ordinal regression problem basically broke it down a series of sort of
0:39:20binary classifiers one was the other one and two was the other and then we'll put the large a logistic
0:39:27regression model on top of that
0:39:29with these multimodal features both in all acoustic and visual features
0:39:34so computer vision was stuff so we just took the motion capture data in slow but actual video data
0:39:41things like we could get very clear you know my measurements of in a head body orientation you know the
0:39:47folding arms are how the how much they're leaning and so on so at least to get an upper bound
0:39:51idea of you know what kind of visual features are important to measure
0:39:55approach avoid
0:39:56and the usual audio features that i don't need to tell you guys about
0:40:00pitch and mfcc and all that stuff
0:40:03so interestingly no we showed that actually this or no formulation this that's published by a vector and one other
0:40:10students matlab
0:40:12i guess
0:40:13that i would not formulation was actually very helpful and stuff just formulating is the plano sort of classification problem
0:40:21and the charge you're sure actually the difference between using on all the lips svm with of just a plain
0:40:26all svm
0:40:27and lighters better be means that just the difference in the error rates so with audio video labels it's actually
0:40:35better so
0:40:36but again multimodal in all of this again say preaching to the point type of thing it's important but we
0:40:42can actually use these audiovisual cues to measure something like this
0:40:47what psychologist perceive as approach avoidance behaviour that wasn't great
0:40:53so the point so far is that you know a multimodal approach this important
0:40:58the next sort of a computational thing i wanna share is this whole notion of okay they often make these
0:41:04sort of just all the judgements on data and you wanna know what like to it or from
0:41:12pure learning point of view
0:41:14how to make it more or less that is how do you choose sample the data says that you can
0:41:18maximise the i-th here's you can post
0:41:21two different ways
0:41:22so i will show that the little study here
0:41:26so we use that multiple instance learning again using this case study of this behavior interaction of these couples to
0:41:33say well can be i defy speaker turn
0:41:36that yeah that are salient you would normally session-level code so you have a ten minutes long session husband wife
0:41:42note taking turns not talking about what are they talking about and we have
0:41:47for rating so you wanna know which of these torrents would most explain that observed rating okay that's a problem
0:41:56so as usual right you extracting all features from the signals and you want to identify turns that make the
0:42:04difference so we use approach all i know that was density based svm a support for doing this that and
0:42:11my whole problem
0:42:13as follows so
0:42:14very simple idea so you have this whole notion of backstrap pasta bags so hyped lame sessions low blame sessions
0:42:22i acceptance looks at concessions the of data from that
0:42:25so you
0:42:27you create your feature space here so acoustic feature space
0:42:31then you build these that was density and select these local maxima showing that they must be the prototype from
0:42:36your data and then when you're ready to kind of evaluate and incoming session you compute the distance
0:42:45minimum distance to these prototypes and use those
0:42:48as you features rather than all the all the
0:42:51simple idea
0:42:53so the features that you considered or you don't in lexical features here for example i put this table here
0:42:59just to again point out that not only are not
0:43:04no lexical items important but things like no fillers and that nonverbal vocalisation
0:43:10seem to pop up quite a bit by information get again selection so they are important for these kinds of
0:43:15behaviour
0:43:17signal processing stuff
0:43:18and so we had all these different informative
0:43:22features
0:43:25and created feature vector procession patient is to ever since the density
0:43:29and you're some results for the acceptance problem so we could show the one with these in my L select
0:43:37a feature i think is all the features so not cool you know are we
0:43:41this selected features no or
0:43:45sort of meaningful but they also kind of boosted the performance of the wave be interpreted that these are sort
0:43:50of reasonable ways of selecting these
0:43:53sailing consensus that our definition of saliency to discrimination
0:43:58but when we add intonation features for this problem at least for some of these construct it didn't really help
0:44:03another way be added these intonation features as
0:44:07as contours probably doesn't right or maybe they don't bear any information for these became constructs
0:44:13so and that was true for this and the multiple this instance but based learning was true for many of
0:44:18these behavioural descriptions we were looking for and that was increasing
0:44:23but what we haven't done
0:44:25is that you know you have really validate whether these sort of
0:44:30machines a hypothesized instances are in fact something consistent what humans would do ask them to be a
0:44:39if they're salient or not
0:44:41so what things are up but i no interest in doing is how we can actually have do human experiments
0:44:48are underway role to make this part of active learning you want to so machine propose a certain things humans
0:44:53can either correct or not
0:44:54and so on that's interesting stuff
0:44:56and you could throw in or other features also
0:44:59so
0:45:00the next step topic i want to talk about again moving along this line of more getting more abstract this
0:45:06is all modeling of entrainment
0:45:08so entrainment this you know
0:45:10kind of refers to or also called as interaction synchrony this natural naturally occurring in a coordination between and not
0:45:17interested in tracking
0:45:19ads are interacting people like multiple levels and along multiple communication channels
0:45:24so you worked at interspeech this year no julia hirschberg of a fantastic talk on this
0:45:29local lexical entrainment all
0:45:32so and people have been hypothesized in that this is needed for all humans use this touchy the efficiency in
0:45:37communicating and you know and
0:45:40increasing mutual understanding and so on it's been extensively studied ins and psychology psycho linguistic sense
0:45:47so what we want to try to see is that okay you have these kinds of we hear buttons
0:45:52well measurements of these sorts a set of things
0:45:56can be it can it be done and can didn't want these high level sort of behaviour characterization that yeah
0:46:02so
0:46:04but the thing is here you can't really ask human sanity hey are these people in training or not
0:46:09it's very difficult to do particularly notable coli other sort of signal Q based things
0:46:14and also unlike many places where they measure synchrony no they have signals and then you can do all mutual
0:46:20information correlation measure
0:46:21here because the turn-taking structure right really you know things are not aligned in time so we have to think
0:46:26about other clever ways of computing is
0:46:29and of course it's also directional how much i inching towards you not necessarily same as how much you entering
0:46:34toward me so
0:46:36no that's we try to figure out now how to compute how do two people sounded like in the spoken
0:46:42trained case
0:46:44as usual so measure acoustic features well tell you what about it
0:46:49what we have maxed in german that here was to actually a concert at the what we call these pca
0:46:55vocal characteristics space and then that similarity between these spaces for projecting the data onto D space to find some
0:47:01similarity measure that was the basic idea
0:47:05so features are as usual you know a pitch and frequency loudness and spectral features for vocal data at the
0:47:13word level
0:47:14and pca speech is reconstructed board at the level of the turn and the level of the whole session so
0:47:21we have that
0:47:22and then you can calculate very similarity measure
0:47:25this both you're basically doing the pca means you're transforming a
0:47:30to a different coordinate space so these components are not pose by then aligned with smaller so measuring angle "'cause"
0:47:38i know that give you some notion of that some larger metric you can make those components with the varying
0:47:45and you can use that as one kind of similarity metric
0:47:49or you can project these data want to use pca space and calculate like a level
0:47:56in calgary number of different similarity metrics
0:47:59and then you ask questions hey what does this mean
0:48:02so first thing is we thought well as a sanity check you know put for real dialogue basically hopefully there
0:48:07must be some provision that these measures reflect i think it's artificial style
0:48:11so we construct artificial dialogs from these things you know the randomized data from other people and created that
0:48:17and just to just to sanity check to make sure that you know like these measures are
0:48:22separate these things out no it doesn't tell you this entrainment or not but at least tells you know
0:48:28something reflecting real dialogues enough so that was first
0:48:32the second is you know this is what we sort of reflected on the literature in the second in the
0:48:38domain where they feel that in train with this actually said so useful tool to
0:48:43provide flexibility in a rhino this discoupled interactions
0:48:46so a known fact that people think it's a precursor to know the empathy and so on so you wanna
0:48:51say that you know in shame this was more in positive sort of interactions that a negative interaction
0:48:57was so that was sort of indirectly via trying to see these interim measures that you
0:49:02so
0:49:03and encouraging that these measures were able just these interim measures right these similarity measures as features
0:49:09we were able to note that i a statistically significant based distinguish between these by estimating interact
0:49:16i was varies you know increasing so of course immediately want to build a prediction model and that so that
0:49:22you be put these features in a factorial hmm model and try to see how just using these entrainment features
0:49:29nothing else how well can you predict how negative or positive that interaction one
0:49:36so
0:49:37we could do what you know
0:49:39quite better than chance of stance to present gonna such diverse that's pretty in great
0:49:44again
0:49:46here again that open questions all this is just a small look at the what this pretty tough problem in
0:49:51a lot of open questions you know how we can actually show entrainment across modalities you know
0:49:59and how do you actually do this in a very dynamic framework what are other different ways of quantifying this
0:50:05and how the actual evaluated better than just doing this indirectly the lots of very open both theoretical and i
0:50:11know a computation question
0:50:14finally nodded quickly say they know that
0:50:17you know human annotators that's the reference a number of cases
0:50:21and often times we do fusion of various sorts you know whether human classifiers machine classifier
0:50:26and
0:50:27B
0:50:28rely on diversity these classifiers so that they can in creates them you don't get better result
0:50:35so what we wanna know is how actually we can build mathematical models that reflectees i'd ever since people so
0:50:41for example no people of study reliability weighted you know a data to bow classifier models
0:50:48and they show on that
0:50:50better than these just doing simple plurality
0:50:52and i my student card they did some work on and actually modeling this you know and em framework and
0:50:58very encouraging
0:50:59so the point you i wanna know these data using a lot of different things about the wisdom of crowds
0:51:04in you know that wisdom of experts in all these things really i think particularly for modeling abstract things we
0:51:09have to bring
0:51:11explicit models of the evaluators into
0:51:14the
0:51:16that the classification problems to learning problems
0:51:20so
0:51:21so these are just you know some of the challenges that i just mentioned you know while attacking these types
0:51:26of behavior questions as many others but i just want to keep a feel for
0:51:30so what do very quickly you know i know that frank is showing its time thing
0:51:35i wanna share some things about that ought to some feel just a few slides
0:51:40so ought to some as you know it's like you we we've been hearing a lot about in the news
0:51:44lately eight statistics and one in wanting to children were diagnosed and so on so yeah asking what can technology
0:51:53to hear particularly you know people working in speech signal processing and related areas
0:51:58one we can do it all computational techniques and tools to help better understand it all these various you know
0:52:03communication social patterns and children one of the biggest hallmark's
0:52:07is
0:52:08difficulties and social communication pros prosody
0:52:12perhaps a better site defined quantified these kinds of felt since five seconds
0:52:17and the second thing is of course building or interfaces that can elicit increase held specific social communication behaviour
0:52:24also example so it is important to do pursue these kinds of questions so we've been collecting data all child
0:52:30psychologist interaction that will be about
0:52:33at ninety kids today and no transcribed and both audio video data
0:52:38and you can ask questions of various sorts with these types of data
0:52:43in dallas
0:52:44so in these areas interactions in the psychologists and the you know interactive child a rate that the child along
0:52:50number of dimensions you know a or everything about you know showing empathy shared enjoyment the prosody and so on
0:52:58and be looked at very simple measures of just do would be on these interactions a look how much
0:53:05each spoken by child relative to seconds
0:53:08tells you what the codes that are provided very interesting like what you know that thirty three no ratings that
0:53:14cycle is provided for explained by us it yeah by these just simple measure
0:53:19it's very interesting because it's observation based
0:53:22and this can be done sort of you know consistently is that
0:53:25two
0:53:26the other thing is speaking rate so just look at you know normalized on speaking rate that explains other code
0:53:31so
0:53:32even with simple techniques that you have in hand and with the kinds of behaviour conscious people interested you can
0:53:38actually provide tools and support these steps that
0:53:42of course you can also use these dialogue systems and you know interface is the number of colleagues at developing
0:53:48to actually illicit
0:53:49interactions in a very systematic and reproducible way
0:53:53because it's human interacting is no sort of variable because psychologist even though they're doing structured interaction i'm not gonna
0:53:59be the same
0:54:00and we want to see whether childhood in fact interact naturally with these kinds of character
0:54:06and if we built that thing with cslu toolkit was robust it and we're in creating we had a number
0:54:11of different emotional reasoning games storytelling and so on like this no principle
0:54:17oh
0:54:20i
0:54:20oh
0:54:21yeah
0:54:24and so on so that i'll they don't wear is price we have collected data no each child came or
0:54:28four times four hours each of what the of fifty hours of data
0:54:32think it
0:54:33and very encouraging we could actually see that they we extracted as they would contract the parents how the parents
0:54:39interaction change to be a physiological data so a lot of very interesting questions we could do that we can
0:54:44measure speech
0:54:45these parameters language that parameters visual things
0:54:48and that and so a lot of interesting questions to supplement what people are doing otherwise so a number increased
0:54:54by that possible yeah i'll cut the slides there so anyway so it's in some other time what i want
0:55:00someone nice at this point is to show that you know that what i know
0:55:03what i should couple of examples there's like so many open challenges in these domains you know where a community
0:55:09like
0:55:10our skin i doubt contribute everything from you know robust capture and processing of these multimodal signals to actually deriving
0:55:18basic find appropriate representations for computing
0:55:22and you know doing signal processing know what kind of features no feature engineering help that some that are data-driven
0:55:29some that are inspired by human-like processing
0:55:32different modeling schemes mathematically schemes that can bring some quantitative sort of sight to these kinds of
0:55:37very subject to type about human based assessments
0:55:40to actually you know helping and the questions of what data privacy issues
0:55:46so lots of interesting possibility
0:55:48in a latino we've been are forced to work on you know number of different meant to have domains in
0:55:52fact i just touched upon one here and a little bit on the arts and so that's why like
0:55:57blocks
0:55:58but there's lots more one could talk about the here like it's fascinating area
0:56:03so in conclusion
0:56:05you know human behavior can be described no same people interacting or we can
0:56:11two different sets of people can describe the same thing from different perspectives depending on what they want look for
0:56:17so that offers a lot of but channel is an opportunity as far as to the developed you are indeed
0:56:23computational advances you know in sensing processing modeling folly did but i think what's most exciting for me is this
0:56:30opportunity for interdisciplinary sort of a collaborative scholarship
0:56:34here
0:56:35and so in some
0:56:37obviously we have a signal processing you know on the one hand held says do things that people know how
0:56:43to do well perhaps more efficiently consistently
0:56:46but what this tantalising is that you know we can actually provide no new tools and data
0:56:53to offer insights that we haven't had before it's not yet so i think that's a that's exciting part here
0:56:59so i'd like to thank you and all my collaborators as like hundreds of them to help this work with
0:57:05teleported and mice of sponsors
0:57:08so with that i'll can to and i'll show you some funding since it's a holiday season
0:57:14the feedback
0:57:28yeah
0:57:39this was actually if it was wrapper
0:57:41so i convinced him to good don't ask and two
0:57:43right but
0:57:45you can be busting
0:57:47so thank you again
0:57:56yeah thank you very much for this very interesting very lightning talk we have something like four minutes for questions
0:58:02so i would like to open the floor
0:58:09a question for multimodal signal processing a logo like as we know some people for the more formally
0:58:20oh no we use a
0:58:23also market like a the comparable comfortable distance for the communication but different
0:58:28approximates you mean yeah yes and you know in fact that
0:58:32the
0:58:33body language data showed sort of very quickly of these actors doing it so we have a distance measures of
0:58:39both that are estimated from video but also from all body capture
0:58:44be a couple papers nike as to share on this body language business and how that would reflect the don't
0:58:51can tell you something about this that
0:58:54i think the dynamics of interaction and
0:58:58approximates also sort of a feature in now
0:59:01approach avoids
0:59:02as when they're trying to come together or normal way
0:59:06that in fact a little flip actually just a little mowing rushing away from the center of that interaction
0:59:12well that's you and culturally invaded over the important question i think what you're alluding is to what are the
0:59:17cultural sort of underpinnings of these types of features and how to demonstrate
0:59:22even had data from different cultures in these studies except what we have
0:59:28in the syllable taught some have data from kids growing up and let you know families in los angeles los
0:59:35angeles is very multicultural
0:59:37and
0:59:38we have some data but we haven't had enough information to marginalise those effects yet
0:59:45so the only thing we have
0:59:47body language that things are but the actors
0:59:49so far
0:59:51sense
0:59:54do we have another question
0:59:58okay so well i have a question sounds
1:00:01you mentioned very briefly on crowd sourcing so i'm kind of injustice how what's your view on what kind of
1:00:06role crowd sourcing code
1:00:08play here especially works really all a lot of our subjective measurements and so on
1:00:13yeah so we used you know for more obvious things right the things like transcription or judgements of more things
1:00:21that in define better
1:00:22i
1:00:23ask people raping that's easier but what i'm finding a difficult is to define these abstract tasks for ratings from
1:00:31a lot of people
1:00:33you're trying right now to do sarcastic
1:00:37so cast more snark in is enough
1:00:41were you trying to see we can use the wisdom of crowds but at least
1:00:45the biggest challenge is to see how we can
1:00:47partition these cards so that the kids are from people that won't be so we put all these questions one
1:00:53but for behaviour processing the bigger challenges someone all these data are very protected by all kinds of restrictions so
1:01:01we can't farm it out to do crowd sourcing types of things but the actors data we are able to
1:01:07do things so
1:01:09but we still haven't figured out how to do not abstract things because we have in turn make
1:01:15this concept be internalised by the people that annotating
1:01:19so simpler tasks that are in to do more easier i think
1:01:25okay
1:01:26is there any more questions from the floor
1:01:34that was a great thank you
1:01:36so a couple years ago that julia hirschberg gave a really interesting
1:01:41summary overview of what it's being done on detecting nine
1:01:46with obvious applications of course
1:01:49and one of the main conclusions is that in fact
1:01:54with detecting a you can you really need to
1:01:57i know the price is there anyway
1:01:59if you don't so it's still
1:02:02it's a it's a step beyond
1:02:04the earlier question about contradiction
1:02:07and i wondered if you've come across any evidence for this thing with the kind of
1:02:12data you're looking at
1:02:14in the you know in fact this is actually a very important question how we can actually individualised personalised in
1:02:19fact that's one of the i believe that strong points paper we competition
1:02:25as we have enough data actually line particular specific patterns or an individual specific fairly well
1:02:33in fact in on some right that's what actually what people always talk about all this is very heterogeneous right
1:02:40because the symptom all of these very lacrosse children but with the children to they are actually very depending on
1:02:46con
1:02:47but the way that they present themselves are fairly into the specific there are gaps and there are we strains
1:02:54every individual
1:02:55and you can learn that from data fairly well these patterns over time which are not necessarily have to buy
1:03:01these forty five minute set of interactions with that there is you know or a clinician
1:03:07i do believe that
1:03:09that the ability to be able to individual i six models you know that people talk about adaptation of bigram
1:03:16modeling all these things all these techniques actually lenses
1:03:22so
1:03:23culture cultural aspects are you know slightly harder because not because we can try because it's very how to collect
1:03:30data systematic control base so you can see this is because of that and not this and that's the but
1:03:37individual low level models are easy i believe and
1:03:41in fact that's why one of the things we did with these
1:03:44computer character based interaction was to bring the same title word or again because they loved interact with computer characters
1:03:51and have dialogues with these characters
1:03:53and that
1:03:55so we have several hours of data from the same child and you also have them interact and the parents
1:04:00and with the unknown person like sort of randomly persons like also you have these human interaction family run for
1:04:07my personal and human computer interaction
1:04:09you can kind of actually trying to start beginning do a characterises child fairly well would be a real the
1:04:17lexical use their you know what kind of initiative at you know initiatives one because things in that
1:04:23we can we can begin to do even with that simple little speech and entropy ideas we you know we
1:04:29can bring to the table
1:04:31but line and stuff i don't know
1:04:34i but i'm acting on it
1:04:36yeah that's like killing your times people to okay speaker again
1:04:41thanks