good morning everyone welcome to date three of us signal and on the like to
be here to introduce our third keynote speaker professor helen mapping from chinese university of
hong kong the howling gotta phd from mit
and she has been professor in a in hong kong chinese university of hong kong
for a sometime it's not count the number of years and in addition to what
she's done abilities aspects of speech and language processing language learning exact role
she is also involved in universal thing should be an associate universe archie's also given
presentations the world economic forum and world peace conference on the main i'm so she's
is not just doing research but actually trying to get a
a the information about speech and language and a help other people so without for
the to do that like to introduce professor how nine
so thank you very much talent for the kind introduction of the morning ladies and
gentlemen i'm really delighted to be here i wish to thank the organizers for the
very kind invitation
and i've been working as i once the a lot on language learning in recent
years but upon receiving the invitation from stick to al
i thought of this is a
excellent opportunity for me to take stock of what i've been doing
rather serendipity
on dialogue
so i decided to
choose this topic the many facets of dialogue for
my presentation
and in fact
the different fact that some going to cover
include
dialogue in teaching and learning
dialogue and e commerce
dialogue in cognitive assessment and the first three are more application oriented and then
the next to a more research oriented extracting semantic patterns from dialogues
and modeling user emotion changes in dialogues
so here we go the first one
is on
dialogue in teaching and learning
where
this project is
about investigating student discussion dialogues and learning outcomes in flip classroom teaching
so how is that my phd of it and more so too is
the research assistant in our t
i don't have three undergraduate student helpers in this project
so
this project came about because back in twenty twelve
that was actually a sweeping change in university education and home call
where
well the university have to migrate from a three year
curriculum to a for your curriculum
so what was said then we're admitting
students
who are one year younger
and we have to design a curriculum for first year engineering students which is brought
based meeting
or engineering students need to
take those course this
and among these is the engineering a freshman
math course
and because it's a broad base that mission
so we have really because this
and after a few years of teaching these big classes
we realise that we need to
sort of the students better
i specially for the each students
so we designed a
elite freshman amount of course
where it has a much more demanding a curriculum and of course students can opt
in an opt out of this course
it's basically of freshman year engineering math course
but we have this elite course and we have a very dedicated a teacher my
colleague a professor sit on jackie
and he's very creative and innovative and he has been
trying out many different
ways to teach the elite students
and so many different ways to flip it's constant
and eventually he's settled upon a
a mode where i'm gonna talk about that so in general is you know flip
classroom teaching involves having students watch online video lectures before they come into class and
then class it's all dedicated to base a cost discussions
so students are given
in class exercise this and they work in teams
and they discuss and in fact survey try to solve these problems and sometimes the
team
get picked to go up to the front and
presents but there there's solution to the their classmates
now this is that setting
and in fact it's in a computer lab so you have to see computers i
think it will be ideal if we have peace a reconfigurable furniture in a classroom
but hopefully it will come someday so
as i mentioned every week
the class
time it's
spent on
peer to peer learning and group discussions and some clips are selected to present their
solution
so
since we to let my students record
the student group discussions during class
so the dots are where the computer monitors are placed in the room
and the red dots are where we put the speech recorders
and
so you can see the students in groups and we actually get consents from most
of the groups
except for two
which are shown here to record their discussions
so technically
the contents of an audio file looks like this
so the lecture or woodstock the class
by addressing the whole class and also of course also close the cost
so we have lecture speech
at the beginning and at the end
and
at various
points in time
in the class
sometimes the lecture was speak and sometimes the ta will speak
again addressing the whole class
and there are times
when i still included finishes an exercise and they're invited to go up to the
front to present their solution but all the other times are open for the
student groups to discuss
within the team within the group to try to solve
the problem at hand
so this is the content of the audio file
so it's actually
we have two types of speech
one which is directed at the whole class
and one
which is the student group discussions
so we devised a methodology to automatic separation
between these two types
so that we can filter out the we want to be able to filter out
the
student group discussions speech
for further processing and studying here
this methodology we will be presenting a interspeech next week
now
it's actually
within that student group discussions
we actually segment the speech the audio
and this expectation is based on speaker change
and also if there's a pause
of more than one second duration then we'll segmented and
we have a lot of student helpers helping us in terms of transcribing
the speech
and a typical transcription looks like this
so each segment includes
the name
so for example gets more bits known as and report the call themselves and reburned
and here are the
i segments in fact that students we teach and we lecture in english but when
they are
open to discussing among themselves some of them
discussed input on parliamentary
philip and discussed in
in a cantonese
so
so here the speech is actually in chinese
and but i've translated it for presentation here so just to play for you
each of these segments in turn
so basically the first segment is a speaker a male speaker
say it really should be the same and then the females because they know these
piece to always exactly the same and so on so i'm gonna play for you
what the audio sounds like starting with the first segment
so that the first segment seconds segments
third segment
of segments and the last so very noisy
and
so what we have been working on is the transcription
now
the class exercise is generally take one which to solve
at each week i three classes
and so together the recordings composed a set
we have ten groups and over semester where we are able to record over twelve
weeks a we end up with a hundred and twenty
a weekly group discussions sets which we do not by w g d s
i don't speeds
fifty two have been transcribed this is from the previous offering
well as yours offering of the course
and the total a number of hours of the audio is five hundred fifty a
worse
and the total colours of discussion is about two hundred eighty hours and we've transcribed
about a hundred hours
so what we do care
as the beginning a beginning step
it's to look at the weekly group discussions that and try to look at
the discussions of the students and see whether it is relevant
so the core topic
and also whether it and also what level of activity
there was in communicative exchange
and that we try to conduct analysis to tie with the academic performance
of the group in the course
so
if we look at peace to
measures a relevance to the course topic in fact we divide that up into
two components
the first is the number of matching map terms
that's occur in the speech
so for example here is
it group audio
i
so basically they if there's a circle that usually use polar coordinates
and i've
used polar coordinates and then i've used it for integration but the variable y has
some problems
so that's what he thing
and in this
segments
we actually see the matching map terms based on some textbooks and mapped dictionaries these
other resources that we have chosen
and so we not take note of those
then the next component it's on content similarity and we figured that because the discussion
is there is solved and in cost exercise so they should bear similarity that discussions
content should have similarity to the in class exercise so to measure that's
we trained a
what effect model
and when we use that
to compute a segment vector so far
each segment in the discussion
we got a segment vector and we also get a document vector
from the in class exercise and we measure the cosine similarity
so here's an example of the a high similarity segment is on top versus the
low similarity segment and the bottom so you can see that's upon first glance the
top to segments they are indeed about some math
and then that the third one it's which chapter so it's referring to the text
probably
whereas the low similarity segments are general conversation
so that has to do with the relevance of the content we also measure the
level of activity in information exchange and for that
we
counts the number of segments in the inter in the discussion dialogue
and also the number of words
in the discussion dialogue and we add both
chinese characters and english words together
so it's actually for a weekly group discussions that we have
four features
two
putting to relevance to the course topic and two for information exchange measures
now
the next thing we do is to look at
be academic performance
so the learning outcome
that corresponds to each week scores topic
it's measured through the relevant question components
that's it's present in the way we've sets the midterm paper and the final exam
paper
so
basically we have a score and the final exam count sixty percent
the midterm talents forty percent but we have set the questions that's the course content
for each week will be present in different components
in the midterm and
final papers respectively
therefore we are able to
look at a groups overall performance according to the course content for a particular week
so this is the way we did the analysis and here's the
quick summary
so basically we looked at the high performing groups
versus the low performing groups and it's not surprise we can see that's
the high performing groups generally have a much higher average proportion of
matching map terms in the discussions
and also they have higher content similarity so
the worth it that use the discussion content it's much more relevant
and
in terms of communicative exchange activity the high-performing groups have many more
total segments exchanged and
more words
note that the first three measures so these three matching map terms content similarity
and number of segments exchanged
we did a success significance test and it's significant that the fourth one is at
point a weight so but i think it's still relevance and it still important an
important feature
so what have presented to you is if the first step
where we
collected the data and we try to investigate to the discussion dialogues in that it
flip classroom setting
in relation to learning outcomes
in terms of for the investigation what
our team will like to understand it's how
can
the student discussion
become if and if pair effective platform for peer to peer learning how the dialogue
facilitate learning and then hands learning
and for more if they're high-performing teams
because a very efficient exchange
in the dialogues
whether
we can use that information to inform formation
so right now that students would form a group to what the beginning of the
semester and they stick with that before the entire semester so
where thinking that if there cry performing groups as the results are very effective discussions
maybe if we are able to swap the groups around and
and
not this dialogue exchange the benefits of the dialogue exchange to learning
spread that maybe
you know rising tide
races all boats so maybe you and hands learning for the whole class
so that's the direction we'd like to take this investigation
so that the first section
no i will want to the second section which is on e commerce
so that this is actually the ching don't dialogue challenge in the summer of twenty
eighteen
and i had a summer
in turn
that year and i ching and is the undergraduate students and so i said well
maybe you may be interested in joining the team don't dialogue challenge but you have
no background luckily i have also had a part time a postdoctoral fellow duct according
to
and also doctor a value is a recent graduate from a group i'm he's not
working for the startup speech acts limited
and in particular i'd like to thank a doctor bones order to show don't go
and
miss them on track of
don't ai for running that's general dialogue challenge from which we've benefited a lot of
a special student
junior and undergraduate student
learning a lot
so
the goal of this dialogue challenge is to develop a chat part for you commerce
customer service
using gin don's very large dataset
they're giving us
they gave us one million chinese customer service conversations sessions
what amounts to twenty million conversation utterances or turns
this data covers ten after sales topics
and their unlabeled and for each of these topics may have for the subtopics so
for example in voice modification this topic
it can have
the subtopics of changing the name
changing the in voiced type asking about e invoices extraction
and the task it's to do the following we have a context
which consists of
the two previous conversation on
turns
so the two
so therefore utterances
from the two previous turns and the current query
from the user or from the customer
and the task is to generate a response for this context
okay so it's basically a of five utterance group
and we need to generate a response
and but generally that response from the system is evaluated by experts
a human experts to for from customer service
so there are two very well known approach is the retrieval-based approach and the gender
and racial based approach
and we
take advantage of the training data with the context and response pairs
in building bees
so i retrieval-based approaches very standard basically if the tf-idf plus cosine similarity
and our generation based approach is also a very standard configuration where we segmented
be chinese
context
the two previous
dialogue turns together with the current query
with that met that's
and then also we segment the response
and we feed those data and we model that statistical relation between the context
and the response
using i think to stick with attention
using this model
and so that's the training and also be inference phases
now
lee
system that we eventually submitted is a hybrid model
based on a
very commonly used rescoring framework
so what we did words to generate using their retrieval-based approach
and that's response alternatives
where we chose and to be twenty
so that it's
that there's enough choice that's but also it won't take too long
and
and we use the generation based approach to rescore
these twenty responses so
then i think about that it's be the generation based approach will
consider
the
given context and hand and the chosen response the relationship between those
and then we use this
we scored
the highest scoring response so we rescore it and we're a racket and use and
we check whether the highest scoring response has exceeded the threshold and this is arbitrarily
trout chosen
at points out of five
so if it exceeds a threshold then we'll output that response
otherwise we think that maybe that this signed that's our which we will base model
does not have enough information to choose the right response so we just use the
entire i think to seek
to generate that a new response
and so that the system and we got a technology innovation award for the system
so it has been a very fruitful experience especially for my undergraduate students and she
decided after this a general dialogue challenge to pursue a phd so she's actually starting
her first term as the phd student in our lab now
and also we got valuable data resources from the industry doing this summer
and i think
moving forward we'd like to
look into flexible use of context information
for different kinds of user inputs ranging from chit chats to one shot information-seeking enquiries
followup questions multi intent input et cetera and i think time yesterday i saw a
professor of folk owens
poster and i think i you have the a very comprehensive decomposition of this problem
so that's my second project and now i'm gonna move to the third project which
is looking at dialogue in cognitive screening
so investigating spoken language model markets in euro psychological dialogues for cognitive screening this is
actually a recently funded project is the very big project and we have a frost
university t
so there's the chinese university team
and we also have colleagues from h k u s t and also polytechnic university
so
but also from chinese university not only do we have engineers we also have
linguists
psychologists urologist
jerry education center and how just on our team so i'm really excited about this
team
and
we have our teaching hospital which is the prince of wales hospital and we also
building a new see which k teaching hospital which is a private hospital so i
think we're gonna be able to get
any
subjects to
participate in our study
so is actually this study focus on focuses on your cooperativeness order
so it's and another time for dimension
and it is and you know well that's know that the global population is ageing
fast and actually hong kong's population is ageing even faster
and cd neurocognitive is order
it's very prevalent among older at outs
it has an insidious onset it's chronic and progressive and there's a general global deterioration
and memory
communication thinking judgement and either probably to functions
and it's the most incapacitated
disease
now that cd manifests itself in communicative impairments such as uncoordinated articulation like this a
trio the subject may
news the capability in language use such as an aphasia
they may have a reduced vocabulary programmer weakened listening reading and writing
and the existing detection methods include brain scans blood tests
and face-to-face neural psychological and p assessments which include structured
semi-structured and free-form dialogues
so if we want dialogue is where the participant is invited to
to do a picture description so the given a picture or sometimes the process
and asked to describe it
now
my colleagues in the teaching hot scroll they have been recording
actually we we're allowed to record their then you're psychological tasks
and that will provide some that provide some initial data for our research so is
actually
the flow of the conversation includes the mmse
the many a mental state examination together with the montreal cognitive assessment a test
so it's the combination of both and there's some overlapping component so that's shared
and
we have about two hundred hours of a conversations between the clinicians and the subjects
it's a one on one
and euro psychological test
now here's an example so we have normal subjects and also others were cognitively impaired
and here are some examples of the
excerpts of the conversation so this is from a normal subject was ask about the
commonality between a training on a bicycle
and this is answer
and then the condition has size is big and then the subjects that yes to
train as long of the bike a smaller is in it and then the pledges
that's o
okay but what's called between them and the subjects that's both values for transport
now for the cognitively impaired subject the
the this is more typical and in fact the original
dialogue is in tiny so we also translated to into english for presentation here
and this is that the dialogue for a cooperative impaired subject so we did not
vary preliminary analysis based on about twenty individuals gender balance
and we look at than average number of utterances in and p assessment as
so you can see
that for males
so the total number of utterance the total number of utterances drop as we move
from the normal to the cognitively impaired
and also the same trend for the female
and then the cat time that sort of the reaction time
there's a general increase small increase
going from the normal to the cognitive impaired and this is for the male and
this one is for the female
also the normal subjects tend to speak faster so they put out more about how
your number of average characters per minute and average number of words per minute
and
so this is very preliminary data
and what we're looking at
different linguistic features such as
parameter quality
information density fluency and also acoustic features such as
and that it in addition to reaction time duration of pauses hesitations pitch prosody et
cetera so will be looking at a whole spectrum of these features
and also my student has developed an initial prototype which illustrates how interactive screening may
be done
and here's the
a demonstration video to show you
so it's actually it starts with
a word recall
exercise
please listen carefully i and going to state three words that i want you to
try to remember and repeat then back to me
please repeat the following three words to me
c then
can
radar
say a response it'd be
well
season
it should
river
good
please remember that three words that were presented and recall them later on
please your best to describe what is happening in the picture about
cap on the button below to begin our complete your response
i see
a family of four
or sitting in the living room
there is a order
monitor
carol
and the board
they are do you do we are we to release
i can't really see much clearly i don't know
that's
good
tap on data and that an if you have completed the task
tap on the try again that into redid the picture description task
please say that three words i asked you to remember earlier in the
recall and say that three words to me
say a response it'd be
season
rumour
i don't remember the last one
summer
u denotes the
so basically the system tries just or a job
the results of everyone several
the data
and so they're score charts
related to for example how many contracts a answers
correct responses were given the response time length get the gap time exact role so
i need to i need to state clearly that
the voice is actually so the voice is based on know that speech is based
on
real data but it's in chinese
so my student
translated to english and try to mimic the
the pause it and also used as you would think that the subject like to
say i think that's it so sort of talk
talking to himself
so he also mimic that so that is for illustration only
are most about data
will be in chinese cantonese or maybe
mandarin
so as a quick summary spoken dialogue offers easy accessibility
and high feature
resolution i'm talking about even millisecond resolution
in terms of reaction time and pause time extractor
for cognitive assessment so we want to be able to develop
a very speech language and dialogue processing technologies
to support holistic assessment of various cognitive functions
and domains
by combining dialog interaction with other interactions
and also we want to further develop this platform as the support of two
for cognitive screening
so that's the end of the third projects and now i'm gonna move away from
the applications oriented facets to a more research oriented facets
so the for project is on extracting
semantic patterns from user inputs
in dialogues and we've been developing a convex probably topic model for that and this
work done by a doctor according to myself and my colleague are professor younger
so
this study actually use it at its two and three
and to get about five thousand utterances to support our investigation
and that complex probably topic model
it's really and unsupervised approach
that is applicable to short text
and it can help us automatically identify semantic patterns from a dialogue corpus
via a geometric technique
so as shown here this that with the well-known m eight is
examples
we can see that semantic pattern of
show me flights
so this is an intent
and also another
semantic pattern of going from an origin to a destination and also
another
semantic pattern on a certain day
so we begin the space of m dimensions where if the vocabulary size and each
utterance forms in this space i'd point and the coordinates of the points
we
you close to the sum normalize worked out of that axis
so that there are two steps in our approach the first one is to embed
the utterances into a low dimensional affine subspace using principal component analysis so it's actually
this is a very common technique and the principal components in to capture
features that can optimally distinguish points by their semantic differences
then we want to the second step where we try to generate a compact
compact convex polytope
two
and close or the and bedded utterance points
and this is using
the quick whole algorithm
so i think illustration
this is what we call a normal type
convex polytope
and all these
points are always points so there are the illustrate be utterances in the corpus
residing in that space
maybe affine subspace
and the
compact a compact convex polytope the various ease of the pot the polytope
each vertex is actually
a point from the set of from the collection of utterance points
so each vertex
also corresponds to an utterance
now
we can then connect the linguistic aspects
of the utterances within the corpus to be geometric aspect of the convex palmtop
so it's actually you can think of the utterances in the dialogue corpus they become
embedded points in the affine subspace
the scope of the corpus
it's now and complex by be compact
convex polytope
that is delineated by the boundaries connecting liver disease
and then the semantic patterns of the language of the corpus
it's not represented
as
the vertices
of the complex
on of the compact convex polytope
now
because the very sees represents extreme points of the polytope
each are displayed can also be formed by a linear combination of the party types
for disease
so let's look at the a this corpus
be a this corpora
and as you know and it is we have these intents
and we also colour code them here and that we plot the utterances in be
a that's training corpora
in that space and which shows a two-dimensional space that you can
see all the plots on a plane
and then we won the quick all algorithm and it came up with this polytope
so this is the most compact one
and you can see
that the most compact
a polytope
meets
twelve or to see so v one v two
well the way to be twelve
now each word x actually also
corresponds to an utterance
so you can look at
the vertices one
tonight they're all
dark blue in colour and in fact they all
correspond to an address with the intent class think of lights
but next
is light blue
and actually a corresponds to
the intents of
abbreviation
and then vertex eleven is also dark blue so with vertex twelve
so this is
an illustration
of the convex polytope
now we can then look at each vertex
so we want to view nine they all
corresponds one hundred just so you can see
you want to v nine
so these not be one vertex once a vertex nine over here they're very close
together and essentially they are well
capturing the semantic pattern
of
from some origin to some destination and these are all
address this with the you labeled intent of flight
now vertex twelve it's very close by
and
but it's twelve itself the constituent utterance its flights to baltimore
so just having the destination
and
we when we also want to look at work text ten and eleven so let's
go to the next page
no vertex
and here in green
the other
utterances and if you look at the constants one utterances you can see that they're
all questions are what is an abbreviation
and then vertex alive it so the nearest neighbors of vertex eleven
basically all capture show me
show me some flights
okay so
you can see
that the versus ease the a generally together with their nearest neighbors capture some car
semantic patterns
now
for the context polytope we don't have any control on the number of er to
seize and it's usually unknown until you actually run the algorithm
so if you want to
control the number of vertices we can use
a simplex
and here again
we want to put plot in two d two dimensions so we chose a simplex
with three birdies so if we want to constrain it you
three courtesies we can use
the sequential quadratic programming algorithm
to come up with the minimum volume simplex
so just
for you to recall
this is the normal type convex polytope
so you can see
it has twelve were to see now we want to
control the number of vertices into three is that we want to
generate a
minute volume simplex and here is the output of the algorithm
okay so we can now see
we have the
minimum volume simplex with the river receives
and
if you look at this minimum volume simplex vertex one
two and three
and if you compare with the previous normal type
convex polytope so let's look at vertex one of the simplex
and it just corresponds to vertex eleven of the normal type polytope
and it also happens to coincide with an utterance
now if we go to vertex summary of the simplex you can see that there's
the light
blue
dots here and that actually corresponds to
for next
and
of the normal type up until so it's very close by
so the vertex
three of the simplex is very close to what extent of than normal type probably
channel
know what about
all these policies from one to nine and also verdicts twelve
these are all
we grouped into
into here
and we have a little bit by
extending vertex to
so you can see that is actually that's minimum
well in seven flights it's not encompassing all the utterance this week no longer guaranteed
that the verdict itself is an utterance points but
we have only three policies and the resulting
minimum value a minute volume simplex is formed by extrapolating the three lines
and joining the previous
not more type take bounding convex hull the vertices from that convex hull
including v ten
we tend to be a lot of n we eleven t v twelve
and then v eight and nine in be three lines
now
we can also look at
for this minimum volume simplex for each vertex we can look at it further so
for example
the first four attacks
you can look at feast on
nearest neighbors and here is the list of the utterances
that corresponds to e point each point
in the nearest neighbor group and they all have the pattern of show me
some flights from someplace to someplace show me flights so that some a semantic parser
now let's look at
verdicts two
so this is where you can see the patterns are from a and order to
a destination
for every vertex
because it's also residing in
the m dimensional space so the
coordinates can actually show was what are the top words the strongest words that are
most representative of the board chuck's
so you can also see
the list of ten top words for those verdicts coordinates of each you
now let's look at b three
the we and its nearest neighbors are shown here and it's mostly
about what it's
for by an abbreviation
okay so the minimum volume simplex actually also shows it allows us to pick
the number of vertices what is this we want to use and also shows some
of the semantic patterns
there are captured
and we paid three because we wanna be able to plot it
in fact and we can pick any arbitrary number of higher dimensions
so
we can examine at a higher dimensionality that semantic patterns
by analysing the nearest neighbors and also the top words of the verdict sees
so for example we ran
well one with sixteen dimensions
so we end up with seventeen courtesies
and i like that
first ten here
followed by the next
seven so seventeen altogether
and then here are the top words for each vertex and also the representative nearest
neighbor
so you can see that
for example verdicts full
it's cut it's capturing the semantic patterns show me something
and number x
from someplace to someplace
for x
eight
what does
some abbreviation me
and verdicts nine
asking about ground transportation
we also have er to seize one
two
five which
really
related to locations
and i think
that's because the perhaps due to data sparsity
and also verdicts the re
it's about can i get something i would like something
and vortex
so then
it's really a bunch of
frequently occurring words and i guess
now if we look at the next set inverted c
a vortex
thirteen it's
about flights from someplace
maybe to someplace as well
fourteen is what is something
sixty s list all
something and again verdicts eleven
fifteen and seventeen or location names
word x twelve
is an airline
name
exactly about either date a date or an airline so i think this is the
case where
we may have been
to address it introducing the subspace dimensions
and i think if we have one this
same experiment more dimensions hopefully it will
separate the day from the airline
so basically we're just playing around with this complex probably topic model as an a
tool for exploratory data analysis
and
i like the geometric nature because it helps me interpret the semantic patterns
and my hope is to extend this
from
semantic pattern extraction to tracking dialog states in the future
so that section four
and now
section five
i last section which is on
affective design
for conversational agents
modeling user emotion changes in a dialogue
this is actually the phd work of monotony
of with the students from to enquire university
and we also interned
in our lab in hong kong for a couple of summers because direct supervisor is
professor at your wafting part university
and this work it's conducted in their drink wa
chinese university joint research center a media sizes technologies and systems
which is and schlangen
and it just funded by the
national
natural science foundation of china
hong kong research grants council part we search scheme
so
a long time goal is to impart i
sensitivity
into conversational agents
which is important for user engagement and also for supporting
socially intelligence conversations
so
that's work look at inferring users emotion changes
i mean assumption is that emotive state change is related to the user's emotive state
in the covariance
dialogue turn and also the corresponding system response
so the objective is to infer the users emotion states
and also be emotive state change
which can in the future inform the generation of the system response
we use the p at a model pleasure arousal dominance framework for describing
emotions in a three dimensional continuous space
so pleasure it's more about positive and negative emotions are rows or is about mental
alertness and dominance is about more about control
so this is a real dialogue which is originally in chinese and again i
i have translated into english here for presentation
so this is a dialogue between a chat bots and the user
and
we have
annotated the p i d values
for each dialogue turn
so you can see for example in dialogue turn to
the user study broke up with me and the response from the system
is let it go you deserve a better one and you see that the from
the dialogue turn all the values of p a and the all
increase
and
and then
for example in dialogue turn eight
that use just said
actually
and the systems that use get me
would seem to amuse the user
so and also soft and the dominance
the value of the dominance
so these are the values that we work within the p d space and this
is our approach joe what's inferring emotive state change
on the left it's the speech input on the right is the output of emotion
recognition
and the prediction of emotion stick change
now we start by integrating the acoustic and lexical features
from the speech import
and
this is basically i'm multimodal fusion problem
and it is achieved by concatenating the features and then applying p
multitask learning convolutional
fusion auto-encoder
so it's go through different layers of convolution and max
and
and also max pooling
and
then we also
capture the system response as a whole utterance
and it is
this is because the holistic message is received by the user and the entire message
plays a role in influencing the users emotions
now the system response co and coding that uses a long short-term memory recurrent auto-encoder
and it is trained to map the system response into a sentence level vector
representation
next the user's input
and the system's response are further
combined using convolutional fusion
and
the framework
then performs emotion recognition using a stacked hidden layer
started only years and the results will be
further used for inferring emotive state change
and for this we use a multitask learning structured output layer
so that the dependency between them emotion state change
and the
emotion recognition output is captured
so in other words the e motive state change its conditioned on the recognise
emotion state of the current query
now the experimentation is done on i you mocap which is a corpus of very
widely used
in emotion recognition system
and also that so go voice assistant corpus so that so what is its did
corpus it has over four million put on what utterances in
three domains
it is transcribed by an asr engine with five point five percent whatever rates
now we actually look at the chat dialogues
and
there are
ninety eight thousand of such conversations between for the forty nine turns but we use
a pre-trained
you know emotional dnn to filter out the
the
neutral
dialogues
a neutral conversations so we ended up with about nine thousand
emotive conversations
with over fifty two thousand utterances which are selected for labeling
so labeling the p a d values
and then we run the emotion recognition and also the emotion state change
prediction
so we use a whole suite of evaluation criteria on but predicted emotive states
in p a d values and also the emotive state changes in p d values
the unweighted accuracy
the mean accuracy of different emotion categories
the mean absolute error and also the concordance correlation coefficient
now
this is a
benchmark against other recent work using other methods
and for i mocap and also for the so go data sets
the proposed approach
actually achieves competitive performance
in emotion recognition
now in emotion
change prediction actually
our proposed approach achieves a significantly better performance then be other approaches
but they're still room for improvement if you compare with
a human performance in human annotation
so to sum up this is among the first efforts to analyze
user input features
both acoustical and lexical features
together with the system response to understand how the user emotion changes
due to the system response and the dialogue
and we have achieved competitive performance in impulsive state change prediction
and we believe that this is a very important a step
to work to what's having socially intelligent virtual assistants
with the incorporation of affect sensitivity for human computer interaction
so
so my talk is in five chunks but this is the overall summary
basically
when i look back at all these different projects
you know with it very
tries on the message that
much can be gleaned
from dialogues
to understand many important phenomena including
how group discussions may facilitate learning
a student would discussions may facilitate learning
however the cuffs customer experience can be shaped by chopper responses and also the status
of an individual's cognitive health
and i guess i'm preaching to the choir here but i really truly believe there's
tremendous potential
we've only seen
the tip of an iceberg
and there's tremendous potential with abundant opportunities and a lot research so thank you very
much
thank you very much do we have questions
thank you very much going to us or regarding the topic three cognitive impairment so
we also working on that but still
so the heavy cognitive impairment of people is easy to detect case of just a
small conversation we can identify this guy so going to put compare
but i think problem is the mild cognitive impairment and ci voice on a is
a very difficult to detect
so i think so the final goal of this well maybe how to estimate the
degree of cognitive impairment using features so what the sig
so thank you very much for the question
indeed
in our study we will be covering
come to the normal adults also what they not call
minor in and cd that so the new terminology
if
my nancy the my small
and you will have a disorder
and major big
you have to disorder
and
so this is a what are learnt from our colleagues in eulogy so
for elderly people we need to be more diligent in engaging them in these
a positive assessments "'cause" they're a really exercises and there's subjective fluctuations going from one
exercise to another so therefore the more frequent you can
take the assessment of better
and
and the issue is not and axle scoring so the
that's obviously it's more the personal level and if there's any sudden changes perhaps more
drastic changes
in the
scoring level of the individual that is off
that would be an important
sign
and
and also tracking
frequently is important
so in the sometimes that are whole minor and cd more mild cognitive impairments harder
to detect those and also you have to work
again sort of the natural cognitive decline due to ageing and the pathological cognitive decline
so it's a it's in a complex problem but nevertheless because
dimension is such a big problem and people talk about
the dimension is not any of the age and global population
and there's not sure
so we just have to work very hard on how to do early
early detection and intervention thank you for the
question
thank you for this very nice thought maybe topics really impressive i was wondering especially
in relation to the classrooms and to the cognitive screening
the moment of understood by your
working on transcriptions rate on the basis of transcription of you made any experiments
but with this or and if so what was your experience there what's the likelihood
of being sufficiently good
so the
the classroom
it is very difficult
that's why we have two
we have no choice but work on transcriptions
but so for
the
the
the way we have recorded these neural psychological tests
it's actually between recognition and thus subject
so the conditions of i think that they don't want any sense
so we just put a phone there
and we can send the subject of course
and
depend on the device some of it we think it's doable
but we went to have a response on
speaker adaptive training and noise of is the
speech processing we
we need to fall in the kitchen sink to be able to do
well
thanks for agree though
is
on the cognitive assessment from a discourse structure point of view actually i was wondering
what sort of processing now you plan to do on those descriptions that they provide
apart from you know speech processing and lexical the cohesion any thoughts about in on
discourse coherence rhetorical relation
among the sentence is that they provide and so on
so thank you for that the one of a question we must look at that
we must okay that we haven't looked at that yet but is actually i have
for her from our you know our colleagues to other clinicians face a coherence in
following the
discourse of a dialog oftentimes show problems
if there's cognitive impairment so that is definitely
one aspect that we must
and in fact we would welcome any
interest the collaborators to look at that together
thank you for regression
a thanks for the survey instinct to you i'm to consider what to talk about
the emotional modeling the pat space move modeling is that just based on speech input
was are you also using i also using to analyse things like
us a nonverbal as a signals like laughter or sighing little things like that
right now we don't have that's it will be wonderful if we can have that
those features but right now it's really the speech input so acoustics and lexical input
and also the sentence level of the system's response
hi a question is about the a section five
so you due to prediction task you did emotion recognition and the emotive change prediction
so even though these some similar really think there is a subtle but important difference
between the two
so my question is
do you use the same features to do both does do you think there are
features that are more important for that you motives the rather than the emotion recognition
and
what difference have you seen
between these two
so requested so we think that
for the current query
based on the current user input we want to be able to
understand the motion of the user
but if you think about
what comes next so depending on how to respond
to the user
the system response the users emotion change the next
input
maybe different
right so for example
in be
in the
so here this is a subject him talking about a breakup
and
i first the system tries to
comfort the subject and then at some point you know the
the country the dialogue goes
i in timit assistive so are you real or not how can robot's no you
like
i know what you like as i do it should be
and then
the user says something
and at this point it sort of like a in this i at this point
of the dialogue you can you can respond in various ways but the talk about
that all used here
and then it seems that
a and then the user says you must be real so i think
but you most exchanges depend on a system response
so if we can
model that
and the way we've model that is to
to
mostly task training where a
e motion state change
it's dependent on the
recognize emotion
we want to be able to capture this dependency
and
in
and to be able you utilize this stuff
dependency is we choose how to
in the future choose how to
recent on how to generate the system response so that you can hopefully died off
dialogue be motioned change in the dialogue
in the way you