Speech Transcript - Behavioral Signal Processing

it's my great honour and pleasure to announce our stick

invited speaker today three now ryan who will talk about behavioral signal processing

so i three is the andrew viterbi professor at U S C

this research focuses on human centred information processing and communication technologies

and enjoying that he seems to be kind of the volume that holds for professor appointment

i was very impressed to see that in electrical engineering computer science but also linguistics

and psychology

and i don't live in the us but one and told me that is also a regular guest on us

television so

so please help me welcome sheri not really looking forward to talk

thank you

right i

some really honoured to be here and it was great to see a lot of friends of my haven't seen

in a long time kind of come back to speech at least to check it out

so they were asking you know what to say next crazy fringe E funny things i've been up to you

so that's this talk today

and

the only bad little problem i have with this is because i haven't done very much in this topic yet

machines but i would share whatever we've been up to in the last couple of years

hopefully won't disappoint them able to spend you part yeah

so the title is a bit here on signal processing i will yeah momentarily define what i mean by that

the case be made of this terms of the got at least say what it is

but this is this work concerns you know human behaviour as we all know it's a very complex and multifaceted

involves a very complex and intricate of main body kind of relations

has the effect of you know but you know and the environment rolls interaction with other people and then barman

a very low you know it's that reflected in how we communicate you mode our personality and interact with other

people

and also it's characterised by the generation processing a multimodal cues

and often characterises typical atypical this water so one

so one wonder you know what is the role of signal processing or signal bussing people in this

business

so you get across number of domains actually be here analysis for either explicitly or implicitly so essential to the

starting from customer care you know you want to know a person is

you know frustrate are very satisfied with the services that that's been rendered and you want to sell more things

you know you wanna

right here but at the level of an individual or group source or one

in a learning and education you not only do you wanna know whether someone is getting a particular and so

right or wrong you wanna know how they got it how confident are they

and you know how can you actually adapt or this personalise learning is one of these you know grand challenges

of engineering so to be able to do that you know we have to understand

be here patterns and like that

but more importantly and something that i'm five gotta know increasing passion about this whole area of mental health and

wellbeing which i'll try to touch of one today S couple of my examples

where a

you'll behavior analysis very centrally the observation based or other means

but you know when you look across no while the computational tools are used but mostly it's very human based

so i thought before we go for it also shows some videos are examples of you know some of these

typical problems one could ask

so here this is like you're gonna see kids playing with actually a computer game talking to it

the question is you know a can be tell if the child is you know something about their cognitive state

you know confident are

not

so let's look at this little girl

right

or you can you

and mute audio please

alright let's try again

hold on i checked many times

something about you people an idea

let's see

it's still a

okay

or answer

yeah

where is this

well i

so just looking at us

we season or from there is sort of a vocal cues and you know that the language they're using the

visual cues and looking around and looking away you can say something you know at least that these are different

and you know the one of the questions we ask is like okay can be actually formally someone you know

the these problems of measuring speaker

so the next example

it's from marital therapy or plastic all than your counselling

so what you're gonna see us that a couple in writing

and that people in this or social able to play a psychologist or doing this kind of research and people

who are actually help in trying to help these couples in a look for a lot of things you know

characterising aspect of dynamics in off

looking at who's blaming homeland trying to figure out what that is and try to plan to treatment based on

that so let's look at this video

should i tried again

okay

no it's not me

you know

the right leg

right

alright

used car

yeah but what you

again

but

the one of these things

or we try to make

this is an example from

the main word

colour is actually

you interaction with the child

the sort of a semi structured interaction following a particular diagnostic for diagnostic test

so that is engaged

one

trying to figure out

things

everything

prosody to

sure

right

you know get price that characterising

if you ask

six

oh right

right

so you think you should probably observed is that no that child you know there was a clear place more

no they could chart of last back or looked at the person's the cost nothing was happening this has just

you know doing the task memory so why not sway

and X I causes rate so these things sort of on a very i'll talk a little later on some

scales that we've been developed in the I D S M

or just want to confide

and it all these are some of the things that are happening as you can see right very observation based

but where people are looking at multimodal cues and trying to so vendor sentiment be

so when you look at these human behavior signals write the kind of pro why

a window into these high-level processes like you know i'll be you know what's it depends on how big or

small the window is

some or all working observable like this vocal and facial expressions and body posture others are covered you know we

don't have access to them non the less intelligent special cases

things like heart rate can lead to the remote response or even brain activity and from a single one of

you know in this kind of information besides and you know different time scales to these different Q

but you know the ability to process and you know sort of interpret decode these signals so can provide us

some insights and understanding mind body relations

but also more importantly no these how people process other people's behaviour patterns no that's a fine distinction bode plot

are generated a processes but also hoping something process and

and don't the measurements and quantification of these kinds of human behaviour both from the production perception respect is a

fairly challenging problem i believe

so here's my operational definition for what are called he'll signal processing basically traverse the competition methods that try to

model human behavioral signals

that are manifested in you know either will work and or covert signals

i don't process by humans explicitly or implicitly you know

and that you know eventually help facilitate no human analysis and decision making you know

the outcome is you know it's informatics which can be useful across domains you know whether to inform diagnostics are

they not planned treatments already know a fire up an autonomous system do you know do personalised no teaching age

and so on

but in all these writers be here on signal processing what tries to do such varying levels face to quantify

this human felt sense

so and

that's kind of that they don't like it's challenging from a very lot different dimensions and i'll try to get

at least impress upon you some of those

so i think about it right now of course technology's already held and not in this in this domain quite

a bit a role in all of this is that relies on the significant foundational advances that have been made

and number of the means no but well things that happened and been discussed

i know deeply this conference to audio video data station you know a speech recognition understanding what was spoken

two things like what they forced to talk about visual activity recognition about you know everything from the little descriptions

of you know head pose orientation to

complex you know

classification of a normal activity

to physiological aspect of signal processing

but the thing is that the difference is that using these as building blocks no what you wanna do is

to try to map it to more abstract domain relevant behaviours and that means no more new or a multimodal

model modeling approach

so people have been started to but work on this already you know in no solving various parts of disposal

a right from sensing more people other people been trying to say how do you actually measure human behaviour and

sort of ecologically valid be that is not disturbing the process that we're trying to measure

from you know instrumenting environments but that no cameras and the microphones and other types of things to actually instrumenting

people with sensors by computing that's of techniques

in speech a lot you know increasingly people are doing very rich and rich processing a large know what's

been said by whom and

how

i think to computing is you see a lot of papers have been published in this area

and also it's neutron so still signal processing about how modeling individual group interaction turn-taking dynamics and non-verbal cue processing

and so on so that these are all kind of no essential building blocks for speech

in somewhere you know the ingredients for being able to do this is of course you know people are working

in signal processing areas on acquisition how do you acquire these things are you build these types of systems and

meaningful way many dimensions might wanna make are you know the kind of behaviour is you want to track

might not happen in at sonic no you might wanna do it in no in wild animal in the wild

so to speak you know and playgrounds in classrooms at home

for example the montana modeling hidden buttons of elderly

and also you know body computing and there's lots of interesting signal processing challenges their analysis you know how do

you what features kind of tell you more about particular behaviour patterns of interest

and how do you do this robustly no questions that we ask your noise are you

and more importantly also modeling these behavioural constructs a better decide by this expert

oh and provide the capability of you know both descriptive and pretty to you know modeling

so this is kind of not easy because

one the observations off these that here buttons are you know how large amounts of uncertainty

at best partial

there's lots of you know there's no didn't mention this talk and the vision computer vision talk about representations know

how are you what are the representations that we

i have to define

to compute these things the first place no you mention experiment where they gave visual scenes and ask people describe

right so imagine now if you are psychologist is absorbing a couple interacting that one of the things that you're

looking for how they describe the before we even set out to actually man

observable cues to be some presentation so

that itself is a first class of source problem what kind of presentations be specified

and given you know we are talking about human behaviour there's fast model heterogeneity

and that basically differences and how people the bu patterns of people over time and across people

and variability in how these data are generated and use

what do people do you know that you know each of these domains you look at a i'll show you

some examples they have their own specific constructs for example in all and language assessment or you know in a

learning situation say literacy

when they tried to figure out what kind of you know help but little a child needs that when they're

learning to read they're looking at night just to know if a child is making it particular sound or in

all of two and they are a number of things come into play in or disfluencies in fact the rate

of disfluencies station they play to

implicit role when we did some experiments

in you know and video should be C D you know how wondering not only are they monitoring physical activity

but also you know emotional state and still want to know model a decision making

and so on

and a lot of common features because after all you know the kinda sensing we have access to are limited

now we have an audio microphones be a bit you can write some physiological sense

and so the approach tends to be at least at some little the little bit levels him tends to be

the same

but important part is no to see

how exports a human expert and i see signal absorb them and learn and see try to see how we

can augment the cape

so that's why the kind of i think the hallmarks of the way i look at the cable signal processing

is to provide supporting tools that would help the human expert and not supplant in on a total automation of

replacing what they're doing and i think that's probably not the most beneficial thing to do

oh pictorially you look at this particular chart you know it this is what happens today you know people levels

but phenomena that they're trying to do no observe say for example child interact with the teacher and they don't

get a lot of data to listen to but look at the child and see how confident make some judgements

about how the child is reading and provide appropriate scaffolding or you know intervention

what you're saying is that perhaps you know signal processing in the machine learning and then all computational tools can

come in handy one based on trying to sort of be called what human experts to try to learn a

what are the features that you see no either explicitly or implicitly learned that

build models that can help with some of these predictive capabilities there's certain things you know there are beyond human

processing capabilities for example in a look you know fine pitch dynamics or looking at you know what happened the

beginning of the session and of the session some things

computer models can do better

provide feedback and hopefully not this can kind of reinforce each other nicely and no common of conducted and use

it as some informatics so that's kind of the idea here

with that kind of background what i'm gonna do the rest of the talk is to signal quickly don't to

some of these building blocks that indeed

but mostly focused on couple of examples you know i'm glad shows and i two examples you're one from this

marital therapy domain

and to know quickly on off to some domain just to show highlights some of the possibilities and challenges that

there are

i can't mention is already you know that you know lots of work is happening in multimodal signal acquisition and

processing you know everything from smart rooms

and only an instrument of space

to actually instrumenting people to sense a lot of different things up a sensing the user sensing the environment in

which things are happening because context becomes important

and you know doing this in a variety of locations

from laboratory to actually in classrooms and clicks and played around and so on

what are the important things there we learn is that depending on environment no there are lots of constraints that

come into play for example when we do our work at the hospital with that you know that would kids

with autism there's only restrictions and where we can place cameras where we can put the be yeah the microphones

either

no it interrupts what's happening there what the psychologist rain weather conditions trying to war it's just the structure for

the child because they are not sensitive to certain things and these are distracting and so on

even though no we'd like to capture the three D environment with like ten fifteen cameras is just not possible

so we have to work with these kinds of restrictions and hence you know robustness issues personal audio processing and

no language processing bu processing you know

are real they have to know we can just solve it by better sensing

likewise in ascending people we can do a lot of different things but we also have to worry about not

the proper you know not only the technological constraints but also the corresponding it second privacy constraints all these things

so challenging area

so those are two actors i

spoken different part

so there we've been collecting all the using actors to study behaviours you know in addition to working with actual

you know population

because you know we can do certain things and the lab or three a derrick on you know do with

data that we collect and hand in hand so this is a more formal motion capture database of dyadic interaction

but a lot of different emotional stuff that's been annotated rated and you know you interested look up

sure

likewise when using actors know that collaborating with the people in your school doing full body sort of interaction dyadic

interaction each of these cases right these were the scenarios that we're is chosen rich enough so that it goes

from the entire gamut of or not

them playing shakespeare and check off to actually doing a broth so rich enough of audio video motion capture data

to ask different questions

looks like this

so this actress

so that kind of data is very important in our data acquisition collection that's a point there so the next

point is you know like this is like a kind of summarises whatever happens at asr you know people have

been working on not only you know doing that in speech works but they're doing

number of different things extracting a variety of you no matter features which may help the not the speech understanding

problem the dialogue management problem you know speaker id problem all this is important for no doing B S P

that's all

also that a lot of work on T no emotion recognition again from speech

and from other modalities are what important questions there is no how do you to present emotions no do we

do categorical representation likeness a happy sad or do more dimension leasing or how oscar this or how to make

it is it or you know how that are also the person

to actually having profiles more statistical distributions are emotional behaviour

actually now people want to continuous tracking of emotional state variation used all sort of ongoing questions in the community

and people try to map those representations from multi modality is important there also

for example you know we know the interplay between you know visual and local features are pretty well known it's

very complex interplay and one could in fact learn things about okay how prosody and head motion related and how

they encode

for example not only linguistic information but also these para-linguistic information nice place

and you know if C number of studies involving and or one says that show that both the complementarity and

redundancy in information coding about no emotions in all these modality

for example you know you run most emotion recognizer with you know speech and facial expression you can show that

would speech others lots of confusion between anger and sort of happiness

but you know if you use face not that goes away you put together of course like any multimodal experiment

reagan sure boost in performance but the point i again here is like when you're trying to model these abstract

types of behaviours

a more the information that kind of encodes these types of constructs a you can have a handle on the

better it is for your competition model

so going back to that example i show those kids being uncertain not sure enough not to add things like

you know measure lexical nonverbal vocalisation like that person mm that little boy said no was hesitating you kind of

detect an model those and you know with the visual cues of you know hand and head motion you can

surely come fairly close to human agreement about not is the style certain or not in context so it's gonna

integrating that you can do things of the sort

in fact in many real-life situations that of course no interactions are based on

other people there who is and who you're interacting with so the idea is if you model you know humans

that are there the mutual influence between say a dyadic two people interacting no more you can do better in

predicting what would come next so for example in the dyadic interaction were we can model both these yeah

people that are in what's it has been why as sort of a data base unit

and you can show that by doing that right hand X cross dependencies between these people not only what they

did for but also what the other person did before you can pretty the upcoming state slightly better so this

this type of things can be done with the existing missionary you know with a number of different things

what would that kind of broad very high level overview of you know some of the computational things that are

happening in our field

so now we can answer to not a goal what i'm asking you know seen

how can these types of things be applied in two problems that people are asking in these various domains they're

doing this without as you know messing with that those fields right no matter to their peers that's been going

on for decades it all they want to predict things like based on sort of how long will the

matters last or can that be amended to questions

so you know we come there and say well we have some computational ideas and i can be held

so that's right

so psychology research all depends a lot on observation judgements a you know many times the in fact report these

interactions and code to go to

very painstaking and careful coding off these behaviours based on you know a good theoretical research frameworks that particular lab

might have

and they develop a lot of coding standards and so on

yeah i'll show you some examples of

earlier

so various couples interacting okay this is actually not real clinical data

what i'm gonna talk about that later is actually based on clinical trial data

so they create these manual but this man decoding process with which the analyses kind of not very scalable it

takes a lot of time and you know and that training coders use integrated that no students in psychology linguistics

are recruited not very reliable

inter coder reliability is also tough

and so we ask you know the very simplistic question of a word can technology help to code these kind

audio-visual data these behavioural sort of characterization

so and there's a measure is in fact are very difficult for humans to make that can help you know

all these

measurements of timing and you know even battery station if you do how long a person speaks actually very important

in a show later on

that tells you quite a bit

and we can you know consistently sort of able to quantify some aspects of these at least the low level

human behaviour

so here's the same kind of chart no here for example we are interested in very couple discussing a problem

who wanna know for example you know how

a spouse's blaming how much blame is one spouse putting on the other person other spouse

two weeks it's not symmetric necessarily

so this is what we wanna do so to help with that so we have a big corpus up from

one hundred thirty four P this just couples were enrolled in a clinical trial

and received couples therapy so we have access to one hundred hours of data or so we not intended for

doing these automated processing yeah no transcription and so one it also has video was sought some examples and this

is what we start with

and it also has a very nice for us this that it has a explored ratings of these interaction session-level

each ten minute you know every couple that a ten minute long problem solving interaction

and they could for a number of things number of behavioural patterns that were of interest to researchers in this

domain for example

one coding global goal was like

is the husband showing acceptance so pretty abstract a question and the description that was that corresponds to that process

will indicate understanding acceptance apartments use

feelings and behaviours listens to the partner with an open mind positive attitude and so on so this is what

the court a straight internalise and rated on a scale of one to nine

okay

so this is the kind of the behaviours we try to see whether we can pretty with that these signal

cues right so most we start with the most obvious thing or simplest thing we know how to do

so we said well let's focus on a few of those codes besides like you know acceptance blame positive aspect

negative aspect sadness

and so one each mark for the yeah

both husband and the wife

and with that the ratings no one through nine is there are no histograms of those that would given by

people

we said to make it so even simple but simpler for us we said well let's just focus on the

top twenty percent and top to bottom twenty percent

no separating extremes

and you see what can we do this

"'kay"

from say things that we know to do like measure speech properties no and measure transcribe it and say can

towards tell me something

and if i know that how successful like be in predicting these codes that the humans can get a was

a problem

so that's a surcharge it's busy but what it just says is what we most of us here due right

we kind of get all your at you know you get rid of things that are hopeless and then we

do speech signal processing we measured no be due but or now recall that

and measure things like you know pitch and intensity and peas and mfccs and drive lots of different statistical functionals

at the utterance level like different data levels of temporal granularities

and throw it into our favourite machine learning a tool

and try to predict that the that particular category to be interested in

likewise we can also do you know transcription generate lattices and then you can use those discourse specific

i know for K for classification

"'kay" so that's

exactly what we did so here's a transcript of interest i don't example so what it's like you know what

exactly what you can spend you know everything there where the money is one of the things that we like

this like a

think that they're worried about in the fight double

another thing is that

you'll see that when you look at the results

and in fact one of the other important things that the detection of all these non-verbal vocalisations and cues that

about their information bearing at least that's what the algorithms dallas

so i say mentioned right

lot of prosodic features an acoustic features and simple binary classification and you're the results just from a very simple

years yeah with the acoustic features right rating you know for many of these constructs like you know blame and

you know pasta negative behaviour you know we can do much better than ten

that's problem you know these local features and that was very encouraging

well there's certain things like not sadness and humour harder to a do just from acoustics and the reasons because

no remark on to capture any contextual cues are lexical cues are visual cues or anything at all

so then we said well okay now let's throw in a lexical information you look at the transcripts about their

a lot of work that scream at you saying hey this guy's really mad at that person you know they're

blaming each other for example in this transcript you know we highlight

and the kept saying it's aggravating yeah why

and so we said well can be automatically it captured these kinds of sailing works from the text

and simple again you know we'll language model

and you can score you know utterance X against these models to figure out okay which particular conditioned that case

this particular i know this these ports correspond to

so can do it with no like this is not necessarily just utterances but the interesting thing is you know

the kinds of things that a part of these models are very informative you know you've been very simple things

like okay in the blame situation you can look at the extremes of the hyperplane work

and the little blame words you know that you the second person

is actually got correlated with high plane quite a bit in fact very consistent with what psychologists that you know

predict hypothesize

compared to first person but you also see words like you know teaching

because cleaning seems to be a big deal if i

comes about living

quite a bit so

yeah that's right then you know we said well let's a simple thing we do but there that's just not

know

right a lot of challenges add to this problem domain

first of all you know any particular single feature stream is what we provide you just a small window as

it pointed out and it's noisy

so you know of course we want to do with multimodal E and you know you want also do it

in the context sensitive fashion

is more important thing is like many of these ratings you know many but domains they do it at the

session level they wanna get attached although just of that particular thing

but what is not clear is why in that particular unfolding of and that like to the space particular perceptual

judgement the people

so you want to know what was sailing

so we tried doing you know like these are first got it is using sort of multiple instance learning to

see whether we can do tree things that are possible

then that a point is that no when these ratings are done write it down it's not so that you

know in a more typical sort of i categorisation but they got they are posed as many times you know

i in a rank order list that is one is

you know sort of less than two is less than three tuple or no way into one are known or

can be also

do what people are trying to integrate this

yeah then these are kind of things trying to do more efficiently what people are doing

there are things that you know that are more than the felt sense case

people hypothesize that you know when that so in track there's some things about you know

synchrony in their interaction that happens that tells you how flexible that interaction proceeds no

so if you are able to quantify the spelling of this aspect of what is called entrainment then that'll be

useful you wanna known or can be bold signal models that actually try to do this

then another point is when people look at you know these are you know a particular behaviour apartment looking for

different exports even you know train people look at it differently you know and they responded different portions of the

data

so you wanna know how we can actually capture these data-dependent human a diversity in behaviour brown processing

into our models

so doing simple plurality or majority voting based you know a mission line techniques might not necessarily work well for

these kinds of knots track

processing

so the first thing is like you know the easiest thing like we had the language and acoustic information to

work together you know of course it's gonna do better yeah at least that's all these expressions a rate including

ours

and our that for one reason was our asr really was bad

because we went to new duties that these language models from the couples domain but what was encouraging is that

even with like a solar for thirty five percent that what iterate asr the information from the

from the language models

from the not a lattice is that we generated and acoustic bass tech classifiers no put together provided a fairly

decent sort of prediction of these codes and cycles is very excited about that

but what we did was to actually make it more multimodal be really need to have information about the nonverbal

cues quite a bit so be rigged up or latino rebar couch

for the therapy

and several microphone arrays and you know

synchronise with about ten htk emerson about that well a motion capture camera to provide data of the sort so

it's very useful to do more sort of a careful study of human

vocal non horrible a behavior interactions

so you're data like this

so goes the conversation so you can do a lot of things yeah since we are collecting data in a

week and a rice and you know localise and do things of that sort quite well and

so we asked some questions like okay

describing approach avoidance behaviour which is very important so we need side of course you about

has been coupled interactive this guy was leading back quite a bit and you know effect expresses displeasure ins interact

very subtle cues just like this folks that come on C N body language experts right we tried to do

this

but signal processing

so approach avoidance is actually no moving toward or away from events or objects

and it actually is related doing psychology theory like you know emotion motivation and particularly in the couples domain relationship

that commitment

so people are very interesting if we can quantify that from using no vocal and no visual cues can be

actually predict or model this

so that was a problem that we took on we said okay we can post disaster no we had psychologists

rate this an ordinal scale want to know minus for two for a scale of nine

and we pose this as sort of an ordinal regression problem basically broke it down a series of sort of

binary classifiers one was the other one and two was the other and then we'll put the large a logistic

regression model on top of that

with these multimodal features both in all acoustic and visual features

so computer vision was stuff so we just took the motion capture data in slow but actual video data

things like we could get very clear you know my measurements of in a head body orientation you know the

folding arms are how the how much they're leaning and so on so at least to get an upper bound

idea of you know what kind of visual features are important to measure

approach avoid

and the usual audio features that i don't need to tell you guys about

pitch and mfcc and all that stuff

so interestingly no we showed that actually this or no formulation this that's published by a vector and one other

students matlab

i guess

that i would not formulation was actually very helpful and stuff just formulating is the plano sort of classification problem

and the charge you're sure actually the difference between using on all the lips svm with of just a plain

all svm

and lighters better be means that just the difference in the error rates so with audio video labels it's actually

better so

but again multimodal in all of this again say preaching to the point type of thing it's important but we

can actually use these audiovisual cues to measure something like this

what psychologist perceive as approach avoidance behaviour that wasn't great

so the point so far is that you know a multimodal approach this important

the next sort of a computational thing i wanna share is this whole notion of okay they often make these

sort of just all the judgements on data and you wanna know what like to it or from

pure learning point of view

how to make it more or less that is how do you choose sample the data says that you can

maximise the i-th here's you can post

two different ways

so i will show that the little study here

so we use that multiple instance learning again using this case study of this behavior interaction of these couples to

say well can be i defy speaker turn

that yeah that are salient you would normally session-level code so you have a ten minutes long session husband wife

note taking turns not talking about what are they talking about and we have

for rating so you wanna know which of these torrents would most explain that observed rating okay that's a problem

so as usual right you extracting all features from the signals and you want to identify turns that make the

difference so we use approach all i know that was density based svm a support for doing this that and

my whole problem

as follows so

very simple idea so you have this whole notion of backstrap pasta bags so hyped lame sessions low blame sessions

i acceptance looks at concessions the of data from that

so you

you create your feature space here so acoustic feature space

then you build these that was density and select these local maxima showing that they must be the prototype from

your data and then when you're ready to kind of evaluate and incoming session you compute the distance

minimum distance to these prototypes and use those

as you features rather than all the all the

simple idea

so the features that you considered or you don't in lexical features here for example i put this table here

just to again point out that not only are not

no lexical items important but things like no fillers and that nonverbal vocalisation

seem to pop up quite a bit by information get again selection so they are important for these kinds of

behaviour

signal processing stuff

and so we had all these different informative

features

and created feature vector procession patient is to ever since the density

and you're some results for the acceptance problem so we could show the one with these in my L select

a feature i think is all the features so not cool you know are we

this selected features no or

sort of meaningful but they also kind of boosted the performance of the wave be interpreted that these are sort

of reasonable ways of selecting these

sailing consensus that our definition of saliency to discrimination

but when we add intonation features for this problem at least for some of these construct it didn't really help

another way be added these intonation features as

as contours probably doesn't right or maybe they don't bear any information for these became constructs

so and that was true for this and the multiple this instance but based learning was true for many of

these behavioural descriptions we were looking for and that was increasing

but what we haven't done

is that you know you have really validate whether these sort of

machines a hypothesized instances are in fact something consistent what humans would do ask them to be a

if they're salient or not

so what things are up but i no interest in doing is how we can actually have do human experiments

are underway role to make this part of active learning you want to so machine propose a certain things humans

can either correct or not

and so on that's interesting stuff

and you could throw in or other features also

the next step topic i want to talk about again moving along this line of more getting more abstract this

is all modeling of entrainment

so entrainment this you know

kind of refers to or also called as interaction synchrony this natural naturally occurring in a coordination between and not

interested in tracking

ads are interacting people like multiple levels and along multiple communication channels

so you worked at interspeech this year no julia hirschberg of a fantastic talk on this

local lexical entrainment all

so and people have been hypothesized in that this is needed for all humans use this touchy the efficiency in

communicating and you know and

increasing mutual understanding and so on it's been extensively studied ins and psychology psycho linguistic sense

so what we want to try to see is that okay you have these kinds of we hear buttons

well measurements of these sorts a set of things

can be it can it be done and can didn't want these high level sort of behaviour characterization that yeah

but the thing is here you can't really ask human sanity hey are these people in training or not

it's very difficult to do particularly notable coli other sort of signal Q based things

and also unlike many places where they measure synchrony no they have signals and then you can do all mutual

information correlation measure

here because the turn-taking structure right really you know things are not aligned in time so we have to think

about other clever ways of computing is

and of course it's also directional how much i inching towards you not necessarily same as how much you entering

toward me so

no that's we try to figure out now how to compute how do two people sounded like in the spoken

trained case

as usual so measure acoustic features well tell you what about it

what we have maxed in german that here was to actually a concert at the what we call these pca

vocal characteristics space and then that similarity between these spaces for projecting the data onto D space to find some

similarity measure that was the basic idea

so features are as usual you know a pitch and frequency loudness and spectral features for vocal data at the

word level

and pca speech is reconstructed board at the level of the turn and the level of the whole session so

we have that

and then you can calculate very similarity measure

this both you're basically doing the pca means you're transforming a

to a different coordinate space so these components are not pose by then aligned with smaller so measuring angle "'cause"

i know that give you some notion of that some larger metric you can make those components with the varying

and you can use that as one kind of similarity metric

or you can project these data want to use pca space and calculate like a level

in calgary number of different similarity metrics

and then you ask questions hey what does this mean

so first thing is we thought well as a sanity check you know put for real dialogue basically hopefully there

must be some provision that these measures reflect i think it's artificial style

so we construct artificial dialogs from these things you know the randomized data from other people and created that

and just to just to sanity check to make sure that you know like these measures are

separate these things out no it doesn't tell you this entrainment or not but at least tells you know

something reflecting real dialogues enough so that was first

the second is you know this is what we sort of reflected on the literature in the second in the

domain where they feel that in train with this actually said so useful tool to

provide flexibility in a rhino this discoupled interactions

so a known fact that people think it's a precursor to know the empathy and so on so you wanna

say that you know in shame this was more in positive sort of interactions that a negative interaction

was so that was sort of indirectly via trying to see these interim measures that you

and encouraging that these measures were able just these interim measures right these similarity measures as features

we were able to note that i a statistically significant based distinguish between these by estimating interact

i was varies you know increasing so of course immediately want to build a prediction model and that so that

you be put these features in a factorial hmm model and try to see how just using these entrainment features

nothing else how well can you predict how negative or positive that interaction one

we could do what you know

quite better than chance of stance to present gonna such diverse that's pretty in great

again

here again that open questions all this is just a small look at the what this pretty tough problem in

a lot of open questions you know how we can actually show entrainment across modalities you know

and how do you actually do this in a very dynamic framework what are other different ways of quantifying this

and how the actual evaluated better than just doing this indirectly the lots of very open both theoretical and i

know a computation question

finally nodded quickly say they know that

you know human annotators that's the reference a number of cases

and often times we do fusion of various sorts you know whether human classifiers machine classifier

and

rely on diversity these classifiers so that they can in creates them you don't get better result

so what we wanna know is how actually we can build mathematical models that reflectees i'd ever since people so

for example no people of study reliability weighted you know a data to bow classifier models

and they show on that

better than these just doing simple plurality

and i my student card they did some work on and actually modeling this you know and em framework and

very encouraging

so the point you i wanna know these data using a lot of different things about the wisdom of crowds

in you know that wisdom of experts in all these things really i think particularly for modeling abstract things we

have to bring

explicit models of the evaluators into

the

that the classification problems to learning problems

so these are just you know some of the challenges that i just mentioned you know while attacking these types

of behavior questions as many others but i just want to keep a feel for

so what do very quickly you know i know that frank is showing its time thing

i wanna share some things about that ought to some feel just a few slides

so ought to some as you know it's like you we we've been hearing a lot about in the news

lately eight statistics and one in wanting to children were diagnosed and so on so yeah asking what can technology

to hear particularly you know people working in speech signal processing and related areas

one we can do it all computational techniques and tools to help better understand it all these various you know

communication social patterns and children one of the biggest hallmark's

difficulties and social communication pros prosody

perhaps a better site defined quantified these kinds of felt since five seconds

and the second thing is of course building or interfaces that can elicit increase held specific social communication behaviour

also example so it is important to do pursue these kinds of questions so we've been collecting data all child

psychologist interaction that will be about

at ninety kids today and no transcribed and both audio video data

and you can ask questions of various sorts with these types of data

in dallas

so in these areas interactions in the psychologists and the you know interactive child a rate that the child along

number of dimensions you know a or everything about you know showing empathy shared enjoyment the prosody and so on

and be looked at very simple measures of just do would be on these interactions a look how much

each spoken by child relative to seconds

tells you what the codes that are provided very interesting like what you know that thirty three no ratings that

cycle is provided for explained by us it yeah by these just simple measure

it's very interesting because it's observation based

and this can be done sort of you know consistently is that

two

the other thing is speaking rate so just look at you know normalized on speaking rate that explains other code

even with simple techniques that you have in hand and with the kinds of behaviour conscious people interested you can

actually provide tools and support these steps that

of course you can also use these dialogue systems and you know interface is the number of colleagues at developing

to actually illicit

interactions in a very systematic and reproducible way

because it's human interacting is no sort of variable because psychologist even though they're doing structured interaction i'm not gonna

be the same

and we want to see whether childhood in fact interact naturally with these kinds of character

and if we built that thing with cslu toolkit was robust it and we're in creating we had a number

of different emotional reasoning games storytelling and so on like this no principle

yeah

and so on so that i'll they don't wear is price we have collected data no each child came or

four times four hours each of what the of fifty hours of data

think it

and very encouraging we could actually see that they we extracted as they would contract the parents how the parents

interaction change to be a physiological data so a lot of very interesting questions we could do that we can

measure speech

these parameters language that parameters visual things

and that and so a lot of interesting questions to supplement what people are doing otherwise so a number increased

by that possible yeah i'll cut the slides there so anyway so it's in some other time what i want

someone nice at this point is to show that you know that what i know

what i should couple of examples there's like so many open challenges in these domains you know where a community

our skin i doubt contribute everything from you know robust capture and processing of these multimodal signals to actually deriving

basic find appropriate representations for computing

and you know doing signal processing know what kind of features no feature engineering help that some that are data-driven

some that are inspired by human-like processing

different modeling schemes mathematically schemes that can bring some quantitative sort of sight to these kinds of

very subject to type about human based assessments

to actually you know helping and the questions of what data privacy issues

so lots of interesting possibility

in a latino we've been are forced to work on you know number of different meant to have domains in

fact i just touched upon one here and a little bit on the arts and so that's why like

blocks

but there's lots more one could talk about the here like it's fascinating area

so in conclusion

you know human behavior can be described no same people interacting or we can

two different sets of people can describe the same thing from different perspectives depending on what they want look for

so that offers a lot of but channel is an opportunity as far as to the developed you are indeed

computational advances you know in sensing processing modeling folly did but i think what's most exciting for me is this

opportunity for interdisciplinary sort of a collaborative scholarship

here

and so in some

obviously we have a signal processing you know on the one hand held says do things that people know how

to do well perhaps more efficiently consistently

but what this tantalising is that you know we can actually provide no new tools and data

to offer insights that we haven't had before it's not yet so i think that's a that's exciting part here

so i'd like to thank you and all my collaborators as like hundreds of them to help this work with

teleported and mice of sponsors

so with that i'll can to and i'll show you some funding since it's a holiday season

the feedback

yeah

this was actually if it was wrapper

so i convinced him to good don't ask and two

right but

you can be busting

so thank you again

yeah thank you very much for this very interesting very lightning talk we have something like four minutes for questions

so i would like to open the floor

a question for multimodal signal processing a logo like as we know some people for the more formally

oh no we use a

also market like a the comparable comfortable distance for the communication but different

approximates you mean yeah yes and you know in fact that

the

body language data showed sort of very quickly of these actors doing it so we have a distance measures of

both that are estimated from video but also from all body capture

be a couple papers nike as to share on this body language business and how that would reflect the don't

can tell you something about this that

i think the dynamics of interaction and

approximates also sort of a feature in now

approach avoids

as when they're trying to come together or normal way

that in fact a little flip actually just a little mowing rushing away from the center of that interaction

well that's you and culturally invaded over the important question i think what you're alluding is to what are the

cultural sort of underpinnings of these types of features and how to demonstrate

even had data from different cultures in these studies except what we have

in the syllable taught some have data from kids growing up and let you know families in los angeles los

angeles is very multicultural

and

we have some data but we haven't had enough information to marginalise those effects yet

so the only thing we have

body language that things are but the actors

so far

sense

do we have another question

okay so well i have a question sounds

you mentioned very briefly on crowd sourcing so i'm kind of injustice how what's your view on what kind of

role crowd sourcing code

play here especially works really all a lot of our subjective measurements and so on

yeah so we used you know for more obvious things right the things like transcription or judgements of more things

that in define better

ask people raping that's easier but what i'm finding a difficult is to define these abstract tasks for ratings from

a lot of people

you're trying right now to do sarcastic

so cast more snark in is enough

were you trying to see we can use the wisdom of crowds but at least

the biggest challenge is to see how we can

partition these cards so that the kids are from people that won't be so we put all these questions one

but for behaviour processing the bigger challenges someone all these data are very protected by all kinds of restrictions so

we can't farm it out to do crowd sourcing types of things but the actors data we are able to

do things so

but we still haven't figured out how to do not abstract things because we have in turn make

this concept be internalised by the people that annotating

so simpler tasks that are in to do more easier i think

okay

is there any more questions from the floor

that was a great thank you

so a couple years ago that julia hirschberg gave a really interesting

summary overview of what it's being done on detecting nine

with obvious applications of course

and one of the main conclusions is that in fact

with detecting a you can you really need to

i know the price is there anyway

if you don't so it's still

it's a it's a step beyond

the earlier question about contradiction

and i wondered if you've come across any evidence for this thing with the kind of

data you're looking at

in the you know in fact this is actually a very important question how we can actually individualised personalised in

fact that's one of the i believe that strong points paper we competition

as we have enough data actually line particular specific patterns or an individual specific fairly well

in fact in on some right that's what actually what people always talk about all this is very heterogeneous right

because the symptom all of these very lacrosse children but with the children to they are actually very depending on

con

but the way that they present themselves are fairly into the specific there are gaps and there are we strains

every individual

and you can learn that from data fairly well these patterns over time which are not necessarily have to buy

these forty five minute set of interactions with that there is you know or a clinician

i do believe that

that the ability to be able to individual i six models you know that people talk about adaptation of bigram

modeling all these things all these techniques actually lenses

culture cultural aspects are you know slightly harder because not because we can try because it's very how to collect

data systematic control base so you can see this is because of that and not this and that's the but

individual low level models are easy i believe and

in fact that's why one of the things we did with these

computer character based interaction was to bring the same title word or again because they loved interact with computer characters

and have dialogues with these characters

and that

so we have several hours of data from the same child and you also have them interact and the parents

and with the unknown person like sort of randomly persons like also you have these human interaction family run for

my personal and human computer interaction

you can kind of actually trying to start beginning do a characterises child fairly well would be a real the

lexical use their you know what kind of initiative at you know initiatives one because things in that

we can we can begin to do even with that simple little speech and entropy ideas we you know we

can bring to the table

but line and stuff i don't know

i but i'm acting on it

yeah that's like killing your times people to okay speaker again

thanks

Behavioral Signal Processing

Invited Speakers

Shri Narayanan (University of Southern California)