so good morning everyone i know i introduce a phonotactic that of science and technology
in japan
detailed like to talking about our recent work in utilizing unsupervised-clustering or positive emotion elicitation
in overall dialog system
i so in this research we particularly look at affective dialogue system and that is
dialogue system that takes into account
affective aspects in the interaction
so that all systems are as a way for users to interact naturally with the
system
especially to complete sorry house
but as the technology develops we see high potential of
dialogue system to address the emotional needs of the user
and we can see in the increase of dialogue system works applications
in various tasks that involve a perfect
for example companionship for elderly
distress clues assessment and affect sensitive tutoring
the traditional fr in a working system with affective aspects and for surround to mean
from utterances
so there's emotion recognition or a speech recognition where we try to see what the
user is currently feeling where their affective state and then use this information in the
interaction
and there's also emotion expression where the system tries to could be certain personality over
emotion the user
well useful this does not slowly we present emotion processes in human communication
resulting in there is an increasing interest in emotion elicitation
so it is focuses on the change of emotion in
in dialogue
there are some work has to go when only the
use
machine translation to translate users can what into a system response at and target a
specific emotion
there's also workplace or on and quality the implement different affective personalities in dialogue system
and this study how user are impacted by each of these personalities
upon interaction
so the
the drawback or the shortcoming of these existing work is that they have not yet
considered the benefit emotional benefit for the user
so he focuses on the intent of really sufficient itself and the me ask will
be able to achieve this intention
but as to how this can better with the user has not yet we study
so in this research into drawing and overlooked potential of emotion elicitation to improve user
emotional states
and it's form is a chat based dialog system
with an implicit role of positive emotion elicitation
and now to formalize this we follow an emotion model which is called the circumplex
model
this is quite emotion in terms of two dimensions so there's a lens
that masters the positivity negativity of emotion
and there's arousal that captures the activation of emotion
so based on this model what we mean when was a positive emotion is
emotion with
also if you and that's
and what we mean when we say posterior emotional change or positive emotion elicitation
it's any move in this valence arousal space the word more positive feelings so any
of these errors that are shown here we consider a specific emotion elicitation
so given a query integer less dialogue system or social bought
there are many ways to answer it
and actually in real life each of this answer is different emotional impact
meaning they alice different kinds of emotion
and as can be seen a very obvious example confront here for the first one
has a negative impact and the second one is a positive one
and we can actually
find a response of information from conversational data
now if we take a look at japanese dialogue system
neural response generator has been frequently reported to perform well
and have promising properties
we have recurrent encoder-decoder
that includes sequence of user inputs and then use this representation
so we all sequence of
word
as the response
and serpentine for me is a step further and the if you don't know levels
of
sequences
so we have sequence of words
that makes up a dialogue turn and then we have sequence of dialogue turns that
makes up a dialogue itself
and we try to model that in a neural network we get something that looks
like this
so in the bottom we have an utterance encoder a link with the sequence of
words
and in the middle we take the dialogue turn representation
and then also
a model that sequentially
so when we
generate a sequence of four as the response we don't only take into account the
current dialogue but also dialogue constraint
and this helps with to maintain longer
during longer dependencies in the dialogue
in terms
off
in terms of application emotion
of various of the danger when quality
propose a system that can express different kinds of emotion
by using an internal state in the general really ugly the response generator
so you see here that application for emotion elicitation using neural networks
is still very lacking of not altogether absent
what we have recently is proposing set emotion sensitive response generation
which was published in your body are proceeding this year so the main idea is
to have an emotion order that takes into account the emotional context of the dialogue
and use this information in generating the response
so now we have any motion encoder which is
in here
that takes the dialogue context
and try to predict
emotion context of the current or
and when generating the response we use the combination of both
the dialogue context and the emotion context
so in this way we then the network is in motion sensitive
and if we train that only so that contains responses that is it possible motion
we can achieve and positive emotion elicitation
and all subjective evaluation actually proves this method work very well
however there are two million two main limitations
the first is that it has not yet learned strategies from an expert so which
are easy own
a wizard of oz conversation
but we would like to see how an expert or people who are knowledgeable in
a emotion interaction i will be as it possible motion
and also still tends towards short and generic responses with positive affect work
this in paris
i mean for engagement and that's
important especially in
mobile oriented interaction
so the main focus in this contribution is to address these limitations
there are several challenges
which i will talk about now
so that then the first goal is to learn
elicitation strategy from an expert
and the challenges that absent of absence of such features if we take a look
at
emotion which corpora
none of them
have yet to involve an expert in the data collection
and there is also not data that shows positive emotion elicitation strategy in everyday situations
so what we did construct such a dialogue corpus we carefully design this scenario and
i will be talking about this model more detail in a bit
the second lowest increase for it in the generator response
to improve engagements and the main challenge here is the sparsity
so we would like to cover as much as possible dialogue speech emotion space
however it's really hard to collect large amounts of data into annotated with emotion information
reliably so we would like to tackle this problem or methodically we hypothesize that higher
level information such as dialog action and help reduce the sparsity
but how to break types of responses that the action a
the system and
emphasizing is information in the training and generation process
and then put it all together and then try to utilize this information in the
response generation the main difference here now is that
you using the dialog state not only we predict the emotional context of the dialogue
but we also tries to would be action that this is the multi in
in the response
so then
b
repost able to context a chart be
that uses a combination of these three contracts to generate a response
no talking about the corpus construction
that's talked about for the goal here is or expert strategy for emotion elicitation
so that what we do this we like interactions between an expert in a participant
we through a professional counsellor a to take place is the expert
and the mean things to condition interaction at the beginning with negative emotions so that
as a
dialogue progresses we can see how export rise at the conversation
to allow emotional recovery and we stick
and this is how a typical recording such a session look like
we start with an opening for small or
and afterwards we induce the negative emotion and what
do you know which we show that videos and non fictional videos such as interview
clips
or it's
about topics
that have a negative sentiments such as well
all righty or environment change
and the ball of the session is the this question that
we've talked about four
we recorded sixty sessions amounting to about twenty four hours of data we recruited one
counsellor and thirty participants
for each participant
recordings
in one of the report in one of the session was
we showed that might induce over all the other one you that might be used
at nist
for the emotion annotation we rely on self reported emotion and a teacher so we
have to participants
you watch the recordings that just a
and using the g traced all the use this scale on the right-hand side
mark their emotional state at the core
at any given time
so if we project the dialogue
the length of the dialogue we can get
and emotion
trace that looks like this
of course we also be a we also transcribed it in we use the combination
of these two information a tree
later on but before that
other the other goal is to find higher level information from the overall expert
responses
what we would like to have here is more information that probably equivalent to dialog
actions
but we would like it should be specific to dialog scenario because this is the
scenario that particular interest
interestingly
it would also like for these dialogue acts but in fact if intense of the
export
there are several ways
into human annotation this is obvious limitation with the expensive and hard to reach a
reliable inter annotator agreement
we also use standard dialogue act classifiers that the constraint here is that it may
not cover specific emotion we intend to
so we resorted to unsupervised clustering
so we do that by first extracting the responses of the caller for the at
work
and then using a pre-trained word defect model we get a compact representation of each
response
and we do we try out two types of clustering methods
which means you need to you find beforehand how many clusters would like to find
our case which is we chose k empirically
four db gmm
we are not to define the model complexity beforehand all within itself tries to find
the optimal number of components we presented a
and then we did some analysis this is the t is a new representation of
the factors and the label
this is the result of the k-means clustering where we choose k u i
in between cluster we have many sentences that are really didn't corresponding to participants contains
in the red clustering we get affirmative responses or confirmation responses
and the blue clusters we have a listening or backchannels
what we do get here though is a very large cluster where all the more
complex
sentences are grouped together
so we
we cluster that one more time and we find another some clusters
some examples on the right cluster we have
a lot of sentences
that contains five a recall election about the topic
i don't the green cluster
we have
sentences that are focus on the participants so you is the most common words there
and sounds like the
score tries to be opinions and their assessment of the topic
for each year and
the
characteristic of each cluster is less you end up probably this is due to the
very imbalanced
distribution of the sentence and the cluster so we have to be very clusters here
and there are plenty very small clusters the parameters
so because just because the clusters are bigger is harder to include what they represent
so then we put all of these two the experiment to see if things are
working as we know
this is the experimental setup the first thing that it is to retrain the model
so we would like before we start only action any motion specific "'cause" we would
like to be
a prior for and he
response generation task
so we use a large-scale dialog corpus which is the subtle corpus containing
five point five million dialogue years movie subtitles
and we used in charge me models so we note any can wear any other
the dialogue context
and then we fit we find alternatives pre-trained model on how something that we have
like this
to ask for comparison we retrain every five point three types of model we have
more a chart that only relies on emotion context
we have anything at a really need that uses both
emotion and i actually convex combination and for completeness we also train a model that
all you realise on action
and of course because the models after works
a little bit about how we retraining point you
so what pre-training does is initialized is the way of the
of the jargon components
so the
the parts that have nothing to do with additional context
an and doing fine tuning because the data that we have is pretty small we
do it selectively so we only optimize parameters that are affected by the new products
so the decoder here and the two
to a complex encoders
in terms of m c h r t we have three different targets
reading during training
so we have the negative log
and
each of those targets have their own classes we have a negative log-likelihood of the
target response
and emotion importer tries to predict the emotional state
and we have the prediction error rate training as well as for the action orders
which is would be the action for the response
and we combine these clusters together linearly interpolate them and then used is back propagation
this to update the corresponding arcs
the first evaluation of it is we see the perplexity of the model
forty one cherry be a perplexity lower is better
well you much are needed i see that would get is forty two point six
and actually if we use action information we got slight slightly better model
however when we combine this information together with see if anything's happening for each action
labels that
so fourteen it's cluster-and-label we see some improvements
you're to here and forty three gmm it actually slightly worse than
we analyze this further by
separating that has the top forty the length
so we can get
reflects if or shorter is very animals
that's queries
there's a stark difference between the two groups performance on short queries
are consistently better than that of all ones which is not surprising a long-term dependency
the it sitting that
of the there's the neural network for a random performance
so the thing with a c h are basically means that
it again substantial improvement for a little queries
most of the improvement
that i get comes from all
i being able to perform better for queries
so this we can see that the multiple context how especially for longer inputs
and then we also subjective evaluation we extracted a hundred various
have each judge slightly crowd workers
we asked to rate the naturalness emotional impact and in addition to the response
vol two models
so we have really mortuary is the baseline and h r v the best of
the hrtf the proposed system
and we see improve engagements
from the proposed model while maintaining the emotional impact and naturalness
and when we look at the responses that's generated by the
system we see that ch are you
well on average two and how words longer than the baseline
so in conclusion here we have presented a corpus that shows expert
strategy in positive emotion elicitation
we also or c we also show how we use unsupervised clustering method
to obtain higher level information
and use all of these in
a response generation
in the future there are many things
that needs to be worked on but in particular we would like to look at
multimodal information
this is especially important for the
and the emotional context of the dialogue
and of course evaluations were user interaction is also important
that was my presentation
so that pre-training is that a
using another corpus which we do not construct so
we use this model
the training data is
is the time here
so it's a large-scale corpus
probably subtitles
right so the reading
we pre-training we did not use any emotion or action one
so that the pre-training is
only to brian that now or boards dialog generation
and then refine training we give the model's ability to encode actually context and emotion
that's and use this
in the generation
right so
the word identity
so
there are no menus or embodiment for different weights the first one is
using a pre-trained word i'm but in model we
we use that for the counsellor dialogue clustering and another three in the model itself
wheeler and the word embeddings
green pre-training
it is learned
what is it it's learned by the utterance in order
all the large scale data
cluster sentences or the dialogue our clustering
but export response clustering we cluster sentences
and for that we use the pre-training work to fact model
we average
what the sentence
right i've just heard about skip or yesterday whereas the s and that's of the
different we think that
q
so there's definitely an overlap between the actions that would like to find from the
experts
actions just general dialogues
so we did find for example backchannels
backchannels are actions that are generally the conversation and confirmation
but the unsupervised clustering is especially helpful for the this other actions probably act right
and it's
you do not need any expert one of the t at all
right
so
what we find that most of the time the majority of the time a counselor
is able to reach the opposing emotions
in terms of their the participant's reaction towards the video it hi varies
so there are people who are not so reactive and are people who are there
is emotionally sensitive
so
we get
different types of responses
but this is an example of well
all the dialogue so the red lines here is here wins throughout the dialogue
we can see that the kalman quite positive and there is an the real you
feel
every negative but as the dialogue progresses
the
the counsellor
successfully for this
you know we have a more extensive analysis in another paper
i'll be happy to help you