it's my great honour and pleasure to announce our stick
invited speaker today three now ryan who will talk about behavioral signal processing
so i three is the andrew viterbi professor at U S C
this research focuses on human centred information processing and communication technologies
and enjoying that he seems to be kind of the volume that holds for professor appointment
i was very impressed to see that in electrical engineering computer science but also linguistics
and psychology
and i don't live in the us but one and told me that is also a regular guest on us
television so
so please help me welcome sheri not really looking forward to talk
thank you
right i
some really honoured to be here and it was great to see a lot of friends of my haven't seen
in a long time kind of come back to speech at least to check it out
so they were asking you know what to say next crazy fringe E funny things i've been up to you
so that's this talk today
and
the only bad little problem i have with this is because i haven't done very much in this topic yet
machines but i would share whatever we've been up to in the last couple of years
hopefully won't disappoint them able to spend you part yeah
so the title is a bit here on signal processing i will yeah momentarily define what i mean by that
the case be made of this terms of the got at least say what it is
so
but this is this work concerns you know human behaviour as we all know it's a very complex and multifaceted
involves a very complex and intricate of main body kind of relations
has the effect of you know but you know and the environment rolls interaction with other people and then barman
a very low you know it's that reflected in how we communicate you mode our personality and interact with other
people
and also it's characterised by the generation processing a multimodal cues
and often characterises typical atypical this water so one
so one wonder you know what is the role of signal processing or signal bussing people in this
business
so you get across number of domains actually be here analysis for either explicitly or implicitly so essential to the
starting from customer care you know you want to know a person is
you know frustrate are very satisfied with the services that that's been rendered and you want to sell more things
you know you wanna
right here but at the level of an individual or group source or one
in a learning and education you not only do you wanna know whether someone is getting a particular and so
right or wrong you wanna know how they got it how confident are they
and you know how can you actually adapt or this personalise learning is one of these you know grand challenges
of engineering so to be able to do that you know we have to understand
be here patterns and like that
but more importantly and something that i'm five gotta know increasing passion about this whole area of mental health and
wellbeing which i'll try to touch of one today S couple of my examples
where a
you'll behavior analysis very centrally the observation based or other means
but you know when you look across no while the computational tools are used but mostly it's very human based
so i thought before we go for it also shows some videos are examples of you know some of these
typical problems one could ask
so here this is like you're gonna see kids playing with actually a computer game talking to it
the question is you know a can be tell if the child is you know something about their cognitive state
you know confident are
not
so let's look at this little girl
right
or you can you
and mute audio please
alright let's try again
hold on i checked many times
something about you people an idea
let's see
it's still a
okay
or answer
yeah
i
where is this
well i
oh
oh
i
oh
i
so just looking at us
we season or from there is sort of a vocal cues and you know that the language they're using the
visual cues and looking around and looking away you can say something you know at least that these are different
and you know the one of the questions we ask is like okay can be actually formally someone you know
the these problems of measuring speaker
so the next example
it's from marital therapy or plastic all than your counselling
so what you're gonna see us that a couple in writing
and that people in this or social able to play a psychologist or doing this kind of research and people
who are actually help in trying to help these couples in a look for a lot of things you know
characterising aspect of dynamics in off
looking at who's blaming homeland trying to figure out what that is and try to plan to treatment based on
that so let's look at this video
should i tried again
i
okay
no it's not me
you know
no
the right leg
right
alright
oh
used car
yeah but what you
again
but
the one of these things
or we try to make
this is an example from
the main word
colour is actually
you interaction with the child
the sort of a semi structured interaction following a particular diagnostic for diagnostic test
so that is engaged
one
one
trying to figure out
things
or
everything
prosody to
sure
right
you know get price that characterising
if you ask
six
so
i
i
oh right
right
i
right
i
so you think you should probably observed is that no that child you know there was a clear place more
no they could chart of last back or looked at the person's the cost nothing was happening this has just
you know doing the task memory so why not sway
and X I causes rate so these things sort of on a very i'll talk a little later on some
scales that we've been developed in the I D S M
or just want to confide
and it all these are some of the things that are happening as you can see right very observation based
but where people are looking at multimodal cues and trying to so vendor sentiment be
so when you look at these human behavior signals write the kind of pro why
a window into these high-level processes like you know i'll be you know what's it depends on how big or
small the window is
some or all working observable like this vocal and facial expressions and body posture others are covered you know we
don't have access to them non the less intelligent special cases
things like heart rate can lead to the remote response or even brain activity and from a single one of
you know in this kind of information besides and you know different time scales to these different Q
but you know the ability to process and you know sort of interpret decode these signals so can provide us
some insights and understanding mind body relations
but also more importantly no these how people process other people's behaviour patterns no that's a fine distinction bode plot
are generated a processes but also hoping something process and
and don't the measurements and quantification of these kinds of human behaviour both from the production perception respect is a
fairly challenging problem i believe
so here's my operational definition for what are called he'll signal processing basically traverse the competition methods that try to
model human behavioral signals
that are manifested in you know either will work and or covert signals
i don't process by humans explicitly or implicitly you know
and that you know eventually help facilitate no human analysis and decision making you know
so
the outcome is you know it's informatics which can be useful across domains you know whether to inform diagnostics are
they not planned treatments already know a fire up an autonomous system do you know do personalised no teaching age
and so on
but in all these writers be here on signal processing what tries to do such varying levels face to quantify
this human felt sense
so and
that's kind of that they don't like it's challenging from a very lot different dimensions and i'll try to get
at least impress upon you some of those
so i think about it right now of course technology's already held and not in this in this domain quite
a bit a role in all of this is that relies on the significant foundational advances that have been made
and number of the means no but well things that happened and been discussed
i know deeply this conference to audio video data station you know a speech recognition understanding what was spoken
two things like what they forced to talk about visual activity recognition about you know everything from the little descriptions
of you know head pose orientation to
complex you know
classification of a normal activity
to physiological aspect of signal processing
but the thing is that the difference is that using these as building blocks no what you wanna do is
to try to map it to more abstract domain relevant behaviours and that means no more new or a multimodal
model modeling approach
oh
so people have been started to but work on this already you know in no solving various parts of disposal
a right from sensing more people other people been trying to say how do you actually measure human behaviour and
sort of ecologically valid be that is not disturbing the process that we're trying to measure
from you know instrumenting environments but that no cameras and the microphones and other types of things to actually instrumenting
people with sensors by computing that's of techniques
in speech a lot you know increasingly people are doing very rich and rich processing a large know what's
been said by whom and
how
i think to computing is you see a lot of papers have been published in this area
and also it's neutron so still signal processing about how modeling individual group interaction turn-taking dynamics and non-verbal cue processing
and so on so that these are all kind of no essential building blocks for speech
so
in somewhere you know the ingredients for being able to do this is of course you know people are working
in signal processing areas on acquisition how do you acquire these things are you build these types of systems and
meaningful way many dimensions might wanna make are you know the kind of behaviour is you want to track
might not happen in at sonic no you might wanna do it in no in wild animal in the wild
so to speak you know and playgrounds in classrooms at home
for example the montana modeling hidden buttons of elderly
and also you know body computing and there's lots of interesting signal processing challenges their analysis you know how do
you what features kind of tell you more about particular behaviour patterns of interest
and how do you do this robustly no questions that we ask your noise are you
and more importantly also modeling these behavioural constructs a better decide by this expert
oh and provide the capability of you know both descriptive and pretty to you know modeling
so this is kind of not easy because
one the observations off these that here buttons are you know how large amounts of uncertainty
at best partial
there's lots of you know there's no didn't mention this talk and the vision computer vision talk about representations know
how are you what are the representations that we
i have to define
to compute these things the first place no you mention experiment where they gave visual scenes and ask people describe
right so imagine now if you are psychologist is absorbing a couple interacting that one of the things that you're
looking for how they describe the before we even set out to actually man
observable cues to be some presentation so
that itself is a first class of source problem what kind of presentations be specified
and given you know we are talking about human behaviour there's fast model heterogeneity
and that basically differences and how people the bu patterns of people over time and across people
and variability in how these data are generated and use
so
what do people do you know that you know each of these domains you look at a i'll show you
some examples they have their own specific constructs for example in all and language assessment or you know in a
learning situation say literacy
when they tried to figure out what kind of you know help but little a child needs that when they're
learning to read they're looking at night just to know if a child is making it particular sound or in
all of two and they are a number of things come into play in or disfluencies in fact the rate
of disfluencies station they play to
implicit role when we did some experiments
in you know and video should be C D you know how wondering not only are they monitoring physical activity
but also you know emotional state and still want to know model a decision making
and so on
and a lot of common features because after all you know the kinda sensing we have access to are limited
now we have an audio microphones be a bit you can write some physiological sense
and so the approach tends to be at least at some little the little bit levels him tends to be
the same
but important part is no to see
how exports a human expert and i see signal absorb them and learn and see try to see how we
can augment the cape
so that's why the kind of i think the hallmarks of the way i look at the cable signal processing
is to provide supporting tools that would help the human expert and not supplant in on a total automation of
replacing what they're doing and i think that's probably not the most beneficial thing to do
so
oh pictorially you look at this particular chart you know it this is what happens today you know people levels
or
but phenomena that they're trying to do no observe say for example child interact with the teacher and they don't
get a lot of data to listen to but look at the child and see how confident make some judgements
about how the child is reading and provide appropriate scaffolding or you know intervention
what you're saying is that perhaps you know signal processing in the machine learning and then all computational tools can
come in handy one based on trying to sort of be called what human experts to try to learn a
what are the features that you see no either explicitly or implicitly learned that
build models that can help with some of these predictive capabilities there's certain things you know there are beyond human
processing capabilities for example in a look you know fine pitch dynamics or looking at you know what happened the
beginning of the session and of the session some things
computer models can do better
provide feedback and hopefully not this can kind of reinforce each other nicely and no common of conducted and use
it as some informatics so that's kind of the idea here
so
with that kind of background what i'm gonna do the rest of the talk is to signal quickly don't to
some of these building blocks that indeed
but mostly focused on couple of examples you know i'm glad shows and i two examples you're one from this
marital therapy domain
and to know quickly on off to some domain just to show highlights some of the possibilities and challenges that
there are
so
i can't mention is already you know that you know lots of work is happening in multimodal signal acquisition and
processing you know everything from smart rooms
and only an instrument of space
to actually instrumenting people to sense a lot of different things up a sensing the user sensing the environment in
which things are happening because context becomes important
and you know doing this in a variety of locations
from laboratory to actually in classrooms and clicks and played around and so on
what are the important things there we learn is that depending on environment no there are lots of constraints that
come into play for example when we do our work at the hospital with that you know that would kids
with autism there's only restrictions and where we can place cameras where we can put the be yeah the microphones
either
no it interrupts what's happening there what the psychologist rain weather conditions trying to war it's just the structure for
the child because they are not sensitive to certain things and these are distracting and so on
so
even though no we'd like to capture the three D environment with like ten fifteen cameras is just not possible
so we have to work with these kinds of restrictions and hence you know robustness issues personal audio processing and
no language processing bu processing you know
are real they have to know we can just solve it by better sensing
likewise in ascending people we can do a lot of different things but we also have to worry about not
the proper you know not only the technological constraints but also the corresponding it second privacy constraints all these things
so challenging area
i
so those are two actors i
spoken different part
so there we've been collecting all the using actors to study behaviours you know in addition to working with actual
you know population
because you know we can do certain things and the lab or three a derrick on you know do with
data that we collect and hand in hand so this is a more formal motion capture database of dyadic interaction
but a lot of different emotional stuff that's been annotated rated and you know you interested look up
sure
likewise when using actors know that collaborating with the people in your school doing full body sort of interaction dyadic
interaction each of these cases right these were the scenarios that we're is chosen rich enough so that it goes
from the entire gamut of or not
them playing shakespeare and check off to actually doing a broth so rich enough of audio video motion capture data
to ask different questions
looks like this
so this actress
so that kind of data is very important in our data acquisition collection that's a point there so the next
point is you know like this is like a kind of summarises whatever happens at asr you know people have
been working on not only you know doing that in speech works but they're doing
number of different things extracting a variety of you no matter features which may help the not the speech understanding
problem the dialogue management problem you know speaker id problem all this is important for no doing B S P
that's all
also that a lot of work on T no emotion recognition again from speech
and from other modalities are what important questions there is no how do you to present emotions no do we
do categorical representation likeness a happy sad or do more dimension leasing or how oscar this or how to make
it is it or you know how that are also the person
to actually having profiles more statistical distributions are emotional behaviour
actually now people want to continuous tracking of emotional state variation used all sort of ongoing questions in the community
and people try to map those representations from multi modality is important there also
for example you know we know the interplay between you know visual and local features are pretty well known it's
very complex interplay and one could in fact learn things about okay how prosody and head motion related and how
they encode
for example not only linguistic information but also these para-linguistic information nice place
and you know if C number of studies involving and or one says that show that both the complementarity and
redundancy in information coding about no emotions in all these modality
for example you know you run most emotion recognizer with you know speech and facial expression you can show that
would speech others lots of confusion between anger and sort of happiness
but you know if you use face not that goes away you put together of course like any multimodal experiment
reagan sure boost in performance but the point i again here is like when you're trying to model these abstract
types of behaviours
a more the information that kind of encodes these types of constructs a you can have a handle on the
better it is for your competition model
so going back to that example i show those kids being uncertain not sure enough not to add things like
you know measure lexical nonverbal vocalisation like that person mm that little boy said no was hesitating you kind of
detect an model those and you know with the visual cues of you know hand and head motion you can
surely come fairly close to human agreement about not is the style certain or not in context so it's gonna
integrating that you can do things of the sort
in fact in many real-life situations that of course no interactions are based on
other people there who is and who you're interacting with so the idea is if you model you know humans
that are there the mutual influence between say a dyadic two people interacting no more you can do better in
predicting what would come next so for example in the dyadic interaction were we can model both these yeah
people that are in what's it has been why as sort of a data base unit
and you can show that by doing that right hand X cross dependencies between these people not only what they
did for but also what the other person did before you can pretty the upcoming state slightly better so this
this type of things can be done with the existing missionary you know with a number of different things
so
what would that kind of broad very high level overview of you know some of the computational things that are
happening in our field
so now we can answer to not a goal what i'm asking you know seen
how can these types of things be applied in two problems that people are asking in these various domains they're
doing this without as you know messing with that those fields right no matter to their peers that's been going
on for decades it all they want to predict things like based on sort of how long will the
matters last or can that be amended to questions
so you know we come there and say well we have some computational ideas and i can be held
so that's right
so psychology research all depends a lot on observation judgements a you know many times the in fact report these
interactions and code to go to
very painstaking and careful coding off these behaviours based on you know a good theoretical research frameworks that particular lab
might have
and they develop a lot of coding standards and so on
so
yeah i'll show you some examples of
earlier
i
so various couples interacting okay this is actually not real clinical data
what i'm gonna talk about that later is actually based on clinical trial data
so they create these manual but this man decoding process with which the analyses kind of not very scalable it
takes a lot of time and you know and that training coders use integrated that no students in psychology linguistics
are recruited not very reliable
inter coder reliability is also tough
and so we ask you know the very simplistic question of a word can technology help to code these kind
audio-visual data these behavioural sort of characterization
so and there's a measure is in fact are very difficult for humans to make that can help you know
all these
measurements of timing and you know even battery station if you do how long a person speaks actually very important
in a show later on
that tells you quite a bit
and we can you know consistently sort of able to quantify some aspects of these at least the low level
human behaviour
so here's the same kind of chart no here for example we are interested in very couple discussing a problem
who wanna know for example you know how
a spouse's blaming how much blame is one spouse putting on the other person other spouse
two weeks it's not symmetric necessarily
so this is what we wanna do so to help with that so we have a big corpus up from
one hundred thirty four P this just couples were enrolled in a clinical trial
and received couples therapy so we have access to one hundred hours of data or so we not intended for
doing these automated processing yeah no transcription and so one it also has video was sought some examples and this
is what we start with
so
and it also has a very nice for us this that it has a explored ratings of these interaction session-level
each ten minute you know every couple that a ten minute long problem solving interaction
and they could for a number of things number of behavioural patterns that were of interest to researchers in this
domain for example
one coding global goal was like
is the husband showing acceptance so pretty abstract a question and the description that was that corresponds to that process
will indicate understanding acceptance apartments use
feelings and behaviours listens to the partner with an open mind positive attitude and so on so this is what
the court a straight internalise and rated on a scale of one to nine
okay
so this is the kind of the behaviours we try to see whether we can pretty with that these signal
cues right so most we start with the most obvious thing or simplest thing we know how to do
so we said well let's focus on a few of those codes besides like you know acceptance blame positive aspect
negative aspect sadness
and so one each mark for the yeah
both husband and the wife
and with that the ratings no one through nine is there are no histograms of those that would given by
people
we said to make it so even simple but simpler for us we said well let's just focus on the
top twenty percent and top to bottom twenty percent
no separating extremes
and you see what can we do this
"'kay"
from say things that we know to do like measure speech properties no and measure transcribe it and say can
towards tell me something
and if i know that how successful like be in predicting these codes that the humans can get a was
a problem
so that's a surcharge it's busy but what it just says is what we most of us here due right
we kind of get all your at you know you get rid of things that are hopeless and then we
do speech signal processing we measured no be due but or now recall that
at
and measure things like you know pitch and intensity and peas and mfccs and drive lots of different statistical functionals
at the utterance level like different data levels of temporal granularities
and throw it into our favourite machine learning a tool
and try to predict that the that particular category to be interested in
likewise we can also do you know transcription generate lattices and then you can use those discourse specific
i know for K for classification
"'kay" so that's
exactly what we did so here's a transcript of interest i don't example so what it's like you know what
exactly what you can spend you know everything there where the money is one of the things that we like
this like a
think that they're worried about in the fight double
another thing is that
you'll see that when you look at the results
and in fact one of the other important things that the detection of all these non-verbal vocalisations and cues that
about their information bearing at least that's what the algorithms dallas
so i say mentioned right
lot of prosodic features an acoustic features and simple binary classification and you're the results just from a very simple
years yeah with the acoustic features right rating you know for many of these constructs like you know blame and
you know pasta negative behaviour you know we can do much better than ten
that's problem you know these local features and that was very encouraging
well there's certain things like not sadness and humour harder to a do just from acoustics and the reasons because
no remark on to capture any contextual cues are lexical cues are visual cues or anything at all
so then we said well okay now let's throw in a lexical information you look at the transcripts about their
a lot of work that scream at you saying hey this guy's really mad at that person you know they're
blaming each other for example in this transcript you know we highlight
and the kept saying it's aggravating yeah why
and so we said well can be automatically it captured these kinds of sailing works from the text
and simple again you know we'll language model
and you can score you know utterance X against these models to figure out okay which particular conditioned that case
this particular i know this these ports correspond to
so can do it with no like this is not necessarily just utterances but the interesting thing is you know
the kinds of things that a part of these models are very informative you know you've been very simple things
like okay in the blame situation you can look at the extremes of the hyperplane work
and the little blame words you know that you the second person
is actually got correlated with high plane quite a bit in fact very consistent with what psychologists that you know
predict hypothesize
compared to first person but you also see words like you know teaching
because cleaning seems to be a big deal if i
comes about living
quite a bit so
yeah that's right then you know we said well let's a simple thing we do but there that's just not
know
right a lot of challenges add to this problem domain
first of all you know any particular single feature stream is what we provide you just a small window as
it pointed out and it's noisy
so you know of course we want to do with multimodal E and you know you want also do it
in the context sensitive fashion
is more important thing is like many of these ratings you know many but domains they do it at the
session level they wanna get attached although just of that particular thing
but what is not clear is why in that particular unfolding of and that like to the space particular perceptual
judgement the people
so you want to know what was sailing
so we tried doing you know like these are first got it is using sort of multiple instance learning to
see whether we can do tree things that are possible
then that a point is that no when these ratings are done write it down it's not so that you
know in a more typical sort of i categorisation but they got they are posed as many times you know
i in a rank order list that is one is
you know sort of less than two is less than three tuple or no way into one are known or
can be also
do what people are trying to integrate this
yeah then these are kind of things trying to do more efficiently what people are doing
there are things that you know that are more than the felt sense case
people hypothesize that you know when that so in track there's some things about you know
synchrony in their interaction that happens that tells you how flexible that interaction proceeds no
so if you are able to quantify the spelling of this aspect of what is called entrainment then that'll be
useful you wanna known or can be bold signal models that actually try to do this
then another point is when people look at you know these are you know a particular behaviour apartment looking for
it
different exports even you know train people look at it differently you know and they responded different portions of the
data
so you wanna know how we can actually capture these data-dependent human a diversity in behaviour brown processing
into our models
so doing simple plurality or majority voting based you know a mission line techniques might not necessarily work well for
these kinds of knots track
processing
so the first thing is like you know the easiest thing like we had the language and acoustic information to
work together you know of course it's gonna do better yeah at least that's all these expressions a rate including
ours
and our that for one reason was our asr really was bad
because we went to new duties that these language models from the couples domain but what was encouraging is that
even with like a solar for thirty five percent that what iterate asr the information from the
from the language models
from the not a lattice is that we generated and acoustic bass tech classifiers no put together provided a fairly
decent sort of prediction of these codes and cycles is very excited about that
but what we did was to actually make it more multimodal be really need to have information about the nonverbal
cues quite a bit so be rigged up or latino rebar couch
for the therapy
and several microphone arrays and you know
synchronise with about ten htk emerson about that well a motion capture camera to provide data of the sort so
it's very useful to do more sort of a careful study of human
vocal non horrible a behavior interactions
so you're data like this
oh
so goes the conversation so you can do a lot of things yeah since we are collecting data in a
week and a rice and you know localise and do things of that sort quite well and
so we asked some questions like okay
describing approach avoidance behaviour which is very important so we need side of course you about
has been coupled interactive this guy was leading back quite a bit and you know effect expresses displeasure ins interact
very subtle cues just like this folks that come on C N body language experts right we tried to do
this
but signal processing
so approach avoidance is actually no moving toward or away from events or objects
and it actually is related doing psychology theory like you know emotion motivation and particularly in the couples domain relationship
that commitment
so people are very interesting if we can quantify that from using no vocal and no visual cues can be
actually predict or model this
so that was a problem that we took on we said okay we can post disaster no we had psychologists
rate this an ordinal scale want to know minus for two for a scale of nine
and we pose this as sort of an ordinal regression problem basically broke it down a series of sort of
binary classifiers one was the other one and two was the other and then we'll put the large a logistic
regression model on top of that
with these multimodal features both in all acoustic and visual features
so computer vision was stuff so we just took the motion capture data in slow but actual video data
things like we could get very clear you know my measurements of in a head body orientation you know the
folding arms are how the how much they're leaning and so on so at least to get an upper bound
idea of you know what kind of visual features are important to measure
approach avoid
and the usual audio features that i don't need to tell you guys about
pitch and mfcc and all that stuff
so interestingly no we showed that actually this or no formulation this that's published by a vector and one other
students matlab
i guess
that i would not formulation was actually very helpful and stuff just formulating is the plano sort of classification problem
and the charge you're sure actually the difference between using on all the lips svm with of just a plain
all svm
and lighters better be means that just the difference in the error rates so with audio video labels it's actually
better so
but again multimodal in all of this again say preaching to the point type of thing it's important but we
can actually use these audiovisual cues to measure something like this
what psychologist perceive as approach avoidance behaviour that wasn't great
so the point so far is that you know a multimodal approach this important
the next sort of a computational thing i wanna share is this whole notion of okay they often make these
sort of just all the judgements on data and you wanna know what like to it or from
pure learning point of view
how to make it more or less that is how do you choose sample the data says that you can
maximise the i-th here's you can post
two different ways
so i will show that the little study here
so we use that multiple instance learning again using this case study of this behavior interaction of these couples to
say well can be i defy speaker turn
that yeah that are salient you would normally session-level code so you have a ten minutes long session husband wife
note taking turns not talking about what are they talking about and we have
for rating so you wanna know which of these torrents would most explain that observed rating okay that's a problem
so as usual right you extracting all features from the signals and you want to identify turns that make the
difference so we use approach all i know that was density based svm a support for doing this that and
my whole problem
as follows so
very simple idea so you have this whole notion of backstrap pasta bags so hyped lame sessions low blame sessions
i acceptance looks at concessions the of data from that
so you
you create your feature space here so acoustic feature space
then you build these that was density and select these local maxima showing that they must be the prototype from
your data and then when you're ready to kind of evaluate and incoming session you compute the distance
minimum distance to these prototypes and use those
as you features rather than all the all the
simple idea
so the features that you considered or you don't in lexical features here for example i put this table here
just to again point out that not only are not
no lexical items important but things like no fillers and that nonverbal vocalisation
seem to pop up quite a bit by information get again selection so they are important for these kinds of
behaviour
signal processing stuff
and so we had all these different informative
features
and created feature vector procession patient is to ever since the density
and you're some results for the acceptance problem so we could show the one with these in my L select
a feature i think is all the features so not cool you know are we
this selected features no or
sort of meaningful but they also kind of boosted the performance of the wave be interpreted that these are sort
of reasonable ways of selecting these
sailing consensus that our definition of saliency to discrimination
but when we add intonation features for this problem at least for some of these construct it didn't really help
another way be added these intonation features as
as contours probably doesn't right or maybe they don't bear any information for these became constructs
so and that was true for this and the multiple this instance but based learning was true for many of
these behavioural descriptions we were looking for and that was increasing
but what we haven't done
is that you know you have really validate whether these sort of
machines a hypothesized instances are in fact something consistent what humans would do ask them to be a
if they're salient or not
so what things are up but i no interest in doing is how we can actually have do human experiments
are underway role to make this part of active learning you want to so machine propose a certain things humans
can either correct or not
and so on that's interesting stuff
and you could throw in or other features also
so
the next step topic i want to talk about again moving along this line of more getting more abstract this
is all modeling of entrainment
so entrainment this you know
kind of refers to or also called as interaction synchrony this natural naturally occurring in a coordination between and not
interested in tracking
ads are interacting people like multiple levels and along multiple communication channels
so you worked at interspeech this year no julia hirschberg of a fantastic talk on this
local lexical entrainment all
so and people have been hypothesized in that this is needed for all humans use this touchy the efficiency in
communicating and you know and
increasing mutual understanding and so on it's been extensively studied ins and psychology psycho linguistic sense
so what we want to try to see is that okay you have these kinds of we hear buttons
well measurements of these sorts a set of things
can be it can it be done and can didn't want these high level sort of behaviour characterization that yeah
so
but the thing is here you can't really ask human sanity hey are these people in training or not
it's very difficult to do particularly notable coli other sort of signal Q based things
and also unlike many places where they measure synchrony no they have signals and then you can do all mutual
information correlation measure
here because the turn-taking structure right really you know things are not aligned in time so we have to think
about other clever ways of computing is
and of course it's also directional how much i inching towards you not necessarily same as how much you entering
toward me so
no that's we try to figure out now how to compute how do two people sounded like in the spoken
trained case
as usual so measure acoustic features well tell you what about it
what we have maxed in german that here was to actually a concert at the what we call these pca
vocal characteristics space and then that similarity between these spaces for projecting the data onto D space to find some
similarity measure that was the basic idea
so features are as usual you know a pitch and frequency loudness and spectral features for vocal data at the
word level
and pca speech is reconstructed board at the level of the turn and the level of the whole session so
we have that
and then you can calculate very similarity measure
this both you're basically doing the pca means you're transforming a
to a different coordinate space so these components are not pose by then aligned with smaller so measuring angle "'cause"
i know that give you some notion of that some larger metric you can make those components with the varying
and you can use that as one kind of similarity metric
or you can project these data want to use pca space and calculate like a level
in calgary number of different similarity metrics
and then you ask questions hey what does this mean
so first thing is we thought well as a sanity check you know put for real dialogue basically hopefully there
must be some provision that these measures reflect i think it's artificial style
so we construct artificial dialogs from these things you know the randomized data from other people and created that
and just to just to sanity check to make sure that you know like these measures are
separate these things out no it doesn't tell you this entrainment or not but at least tells you know
something reflecting real dialogues enough so that was first
the second is you know this is what we sort of reflected on the literature in the second in the
domain where they feel that in train with this actually said so useful tool to
provide flexibility in a rhino this discoupled interactions
so a known fact that people think it's a precursor to know the empathy and so on so you wanna
say that you know in shame this was more in positive sort of interactions that a negative interaction
was so that was sort of indirectly via trying to see these interim measures that you
so
and encouraging that these measures were able just these interim measures right these similarity measures as features
we were able to note that i a statistically significant based distinguish between these by estimating interact
i was varies you know increasing so of course immediately want to build a prediction model and that so that
you be put these features in a factorial hmm model and try to see how just using these entrainment features
nothing else how well can you predict how negative or positive that interaction one
so
we could do what you know
quite better than chance of stance to present gonna such diverse that's pretty in great
again
here again that open questions all this is just a small look at the what this pretty tough problem in
a lot of open questions you know how we can actually show entrainment across modalities you know
and how do you actually do this in a very dynamic framework what are other different ways of quantifying this
and how the actual evaluated better than just doing this indirectly the lots of very open both theoretical and i
know a computation question
finally nodded quickly say they know that
you know human annotators that's the reference a number of cases
and often times we do fusion of various sorts you know whether human classifiers machine classifier
and
B
rely on diversity these classifiers so that they can in creates them you don't get better result
so what we wanna know is how actually we can build mathematical models that reflectees i'd ever since people so
for example no people of study reliability weighted you know a data to bow classifier models
and they show on that
better than these just doing simple plurality
and i my student card they did some work on and actually modeling this you know and em framework and
very encouraging
so the point you i wanna know these data using a lot of different things about the wisdom of crowds
in you know that wisdom of experts in all these things really i think particularly for modeling abstract things we
have to bring
explicit models of the evaluators into
the
that the classification problems to learning problems
so
so these are just you know some of the challenges that i just mentioned you know while attacking these types
of behavior questions as many others but i just want to keep a feel for
so what do very quickly you know i know that frank is showing its time thing
i wanna share some things about that ought to some feel just a few slides
so ought to some as you know it's like you we we've been hearing a lot about in the news
lately eight statistics and one in wanting to children were diagnosed and so on so yeah asking what can technology
to hear particularly you know people working in speech signal processing and related areas
one we can do it all computational techniques and tools to help better understand it all these various you know
communication social patterns and children one of the biggest hallmark's
is
difficulties and social communication pros prosody
perhaps a better site defined quantified these kinds of felt since five seconds
and the second thing is of course building or interfaces that can elicit increase held specific social communication behaviour
also example so it is important to do pursue these kinds of questions so we've been collecting data all child
psychologist interaction that will be about
at ninety kids today and no transcribed and both audio video data
and you can ask questions of various sorts with these types of data
in dallas
so in these areas interactions in the psychologists and the you know interactive child a rate that the child along
number of dimensions you know a or everything about you know showing empathy shared enjoyment the prosody and so on
and be looked at very simple measures of just do would be on these interactions a look how much
each spoken by child relative to seconds
tells you what the codes that are provided very interesting like what you know that thirty three no ratings that
cycle is provided for explained by us it yeah by these just simple measure
it's very interesting because it's observation based
and this can be done sort of you know consistently is that
two
the other thing is speaking rate so just look at you know normalized on speaking rate that explains other code
so
even with simple techniques that you have in hand and with the kinds of behaviour conscious people interested you can
actually provide tools and support these steps that
of course you can also use these dialogue systems and you know interface is the number of colleagues at developing
to actually illicit
interactions in a very systematic and reproducible way
because it's human interacting is no sort of variable because psychologist even though they're doing structured interaction i'm not gonna
be the same
and we want to see whether childhood in fact interact naturally with these kinds of character
and if we built that thing with cslu toolkit was robust it and we're in creating we had a number
of different emotional reasoning games storytelling and so on like this no principle
oh
i
oh
yeah
and so on so that i'll they don't wear is price we have collected data no each child came or
four times four hours each of what the of fifty hours of data
think it
and very encouraging we could actually see that they we extracted as they would contract the parents how the parents
interaction change to be a physiological data so a lot of very interesting questions we could do that we can
measure speech
these parameters language that parameters visual things
and that and so a lot of interesting questions to supplement what people are doing otherwise so a number increased
by that possible yeah i'll cut the slides there so anyway so it's in some other time what i want
someone nice at this point is to show that you know that what i know
what i should couple of examples there's like so many open challenges in these domains you know where a community
like
our skin i doubt contribute everything from you know robust capture and processing of these multimodal signals to actually deriving
basic find appropriate representations for computing
and you know doing signal processing know what kind of features no feature engineering help that some that are data-driven
some that are inspired by human-like processing
different modeling schemes mathematically schemes that can bring some quantitative sort of sight to these kinds of
very subject to type about human based assessments
to actually you know helping and the questions of what data privacy issues
so lots of interesting possibility
in a latino we've been are forced to work on you know number of different meant to have domains in
fact i just touched upon one here and a little bit on the arts and so that's why like
blocks
but there's lots more one could talk about the here like it's fascinating area
so in conclusion
you know human behavior can be described no same people interacting or we can
two different sets of people can describe the same thing from different perspectives depending on what they want look for
so that offers a lot of but channel is an opportunity as far as to the developed you are indeed
computational advances you know in sensing processing modeling folly did but i think what's most exciting for me is this
opportunity for interdisciplinary sort of a collaborative scholarship
here
and so in some
obviously we have a signal processing you know on the one hand held says do things that people know how
to do well perhaps more efficiently consistently
but what this tantalising is that you know we can actually provide no new tools and data
to offer insights that we haven't had before it's not yet so i think that's a that's exciting part here
so i'd like to thank you and all my collaborators as like hundreds of them to help this work with
teleported and mice of sponsors
so with that i'll can to and i'll show you some funding since it's a holiday season
the feedback
yeah
this was actually if it was wrapper
so i convinced him to good don't ask and two
right but
you can be busting
so thank you again
yeah thank you very much for this very interesting very lightning talk we have something like four minutes for questions
so i would like to open the floor
a question for multimodal signal processing a logo like as we know some people for the more formally
oh no we use a
also market like a the comparable comfortable distance for the communication but different
approximates you mean yeah yes and you know in fact that
the
body language data showed sort of very quickly of these actors doing it so we have a distance measures of
both that are estimated from video but also from all body capture
be a couple papers nike as to share on this body language business and how that would reflect the don't
can tell you something about this that
i think the dynamics of interaction and
approximates also sort of a feature in now
approach avoids
as when they're trying to come together or normal way
that in fact a little flip actually just a little mowing rushing away from the center of that interaction
well that's you and culturally invaded over the important question i think what you're alluding is to what are the
cultural sort of underpinnings of these types of features and how to demonstrate
even had data from different cultures in these studies except what we have
in the syllable taught some have data from kids growing up and let you know families in los angeles los
angeles is very multicultural
and
we have some data but we haven't had enough information to marginalise those effects yet
so the only thing we have
body language that things are but the actors
so far
sense
do we have another question
okay so well i have a question sounds
you mentioned very briefly on crowd sourcing so i'm kind of injustice how what's your view on what kind of
role crowd sourcing code
play here especially works really all a lot of our subjective measurements and so on
yeah so we used you know for more obvious things right the things like transcription or judgements of more things
that in define better
i
ask people raping that's easier but what i'm finding a difficult is to define these abstract tasks for ratings from
a lot of people
you're trying right now to do sarcastic
so cast more snark in is enough
were you trying to see we can use the wisdom of crowds but at least
the biggest challenge is to see how we can
partition these cards so that the kids are from people that won't be so we put all these questions one
but for behaviour processing the bigger challenges someone all these data are very protected by all kinds of restrictions so
we can't farm it out to do crowd sourcing types of things but the actors data we are able to
do things so
but we still haven't figured out how to do not abstract things because we have in turn make
this concept be internalised by the people that annotating
so simpler tasks that are in to do more easier i think
okay
is there any more questions from the floor
that was a great thank you
so a couple years ago that julia hirschberg gave a really interesting
summary overview of what it's being done on detecting nine
with obvious applications of course
and one of the main conclusions is that in fact
with detecting a you can you really need to
i know the price is there anyway
if you don't so it's still
it's a it's a step beyond
the earlier question about contradiction
and i wondered if you've come across any evidence for this thing with the kind of
data you're looking at
in the you know in fact this is actually a very important question how we can actually individualised personalised in
fact that's one of the i believe that strong points paper we competition
as we have enough data actually line particular specific patterns or an individual specific fairly well
in fact in on some right that's what actually what people always talk about all this is very heterogeneous right
because the symptom all of these very lacrosse children but with the children to they are actually very depending on
con
but the way that they present themselves are fairly into the specific there are gaps and there are we strains
every individual
and you can learn that from data fairly well these patterns over time which are not necessarily have to buy
these forty five minute set of interactions with that there is you know or a clinician
i do believe that
that the ability to be able to individual i six models you know that people talk about adaptation of bigram
modeling all these things all these techniques actually lenses
so
culture cultural aspects are you know slightly harder because not because we can try because it's very how to collect
data systematic control base so you can see this is because of that and not this and that's the but
individual low level models are easy i believe and
in fact that's why one of the things we did with these
computer character based interaction was to bring the same title word or again because they loved interact with computer characters
and have dialogues with these characters
and that
so we have several hours of data from the same child and you also have them interact and the parents
and with the unknown person like sort of randomly persons like also you have these human interaction family run for
my personal and human computer interaction
you can kind of actually trying to start beginning do a characterises child fairly well would be a real the
lexical use their you know what kind of initiative at you know initiatives one because things in that
we can we can begin to do even with that simple little speech and entropy ideas we you know we
can bring to the table
but line and stuff i don't know
i but i'm acting on it
yeah that's like killing your times people to okay speaker again
thanks