0:00:14 | so good morning everyone i know i introduce a phonotactic that of science and technology |
---|
0:00:20 | in japan |
---|
0:00:22 | detailed like to talking about our recent work in utilizing unsupervised-clustering or positive emotion elicitation |
---|
0:00:29 | in overall dialog system |
---|
0:00:32 | i so in this research we particularly look at affective dialogue system and that is |
---|
0:00:37 | dialogue system that takes into account |
---|
0:00:40 | affective aspects in the interaction |
---|
0:00:43 | so that all systems are as a way for users to interact naturally with the |
---|
0:00:48 | system |
---|
0:00:49 | especially to complete sorry house |
---|
0:00:51 | but as the technology develops we see high potential of |
---|
0:00:56 | dialogue system to address the emotional needs of the user |
---|
0:00:59 | and we can see in the increase of dialogue system works applications |
---|
0:01:04 | in various tasks that involve a perfect |
---|
0:01:07 | for example companionship for elderly |
---|
0:01:10 | distress clues assessment and affect sensitive tutoring |
---|
0:01:15 | the traditional fr in a working system with affective aspects and for surround to mean |
---|
0:01:20 | from utterances |
---|
0:01:21 | so there's emotion recognition or a speech recognition where we try to see what the |
---|
0:01:27 | user is currently feeling where their affective state and then use this information in the |
---|
0:01:31 | interaction |
---|
0:01:32 | and there's also emotion expression where the system tries to could be certain personality over |
---|
0:01:38 | emotion the user |
---|
0:01:41 | well useful this does not slowly we present emotion processes in human communication |
---|
0:01:47 | resulting in there is an increasing interest in emotion elicitation |
---|
0:01:51 | so it is focuses on the change of emotion in |
---|
0:01:55 | in dialogue |
---|
0:01:57 | there are some work has to go when only the |
---|
0:02:01 | use |
---|
0:02:02 | machine translation to translate users can what into a system response at and target a |
---|
0:02:09 | specific emotion |
---|
0:02:10 | there's also workplace or on and quality the implement different affective personalities in dialogue system |
---|
0:02:18 | and this study how user are impacted by each of these personalities |
---|
0:02:22 | upon interaction |
---|
0:02:24 | so the |
---|
0:02:26 | the drawback or the shortcoming of these existing work is that they have not yet |
---|
0:02:31 | considered the benefit emotional benefit for the user |
---|
0:02:34 | so he focuses on the intent of really sufficient itself and the me ask will |
---|
0:02:38 | be able to achieve this intention |
---|
0:02:41 | but as to how this can better with the user has not yet we study |
---|
0:02:47 | so in this research into drawing and overlooked potential of emotion elicitation to improve user |
---|
0:02:53 | emotional states |
---|
0:02:55 | and it's form is a chat based dialog system |
---|
0:02:59 | with an implicit role of positive emotion elicitation |
---|
0:03:02 | and now to formalize this we follow an emotion model which is called the circumplex |
---|
0:03:07 | model |
---|
0:03:08 | this is quite emotion in terms of two dimensions so there's a lens |
---|
0:03:12 | that masters the positivity negativity of emotion |
---|
0:03:15 | and there's arousal that captures the activation of emotion |
---|
0:03:20 | so based on this model what we mean when was a positive emotion is |
---|
0:03:25 | emotion with |
---|
0:03:26 | also if you and that's |
---|
0:03:29 | and what we mean when we say posterior emotional change or positive emotion elicitation |
---|
0:03:34 | it's any move in this valence arousal space the word more positive feelings so any |
---|
0:03:40 | of these errors that are shown here we consider a specific emotion elicitation |
---|
0:03:46 | so given a query integer less dialogue system or social bought |
---|
0:03:50 | there are many ways to answer it |
---|
0:03:53 | and actually in real life each of this answer is different emotional impact |
---|
0:03:58 | meaning they alice different kinds of emotion |
---|
0:04:01 | and as can be seen a very obvious example confront here for the first one |
---|
0:04:05 | has a negative impact and the second one is a positive one |
---|
0:04:09 | and we can actually |
---|
0:04:11 | find a response of information from conversational data |
---|
0:04:16 | now if we take a look at japanese dialogue system |
---|
0:04:19 | neural response generator has been frequently reported to perform well |
---|
0:04:24 | and have promising properties |
---|
0:04:27 | we have recurrent encoder-decoder |
---|
0:04:29 | that includes sequence of user inputs and then use this representation |
---|
0:04:34 | so we all sequence of |
---|
0:04:35 | word |
---|
0:04:37 | as the response |
---|
0:04:38 | and serpentine for me is a step further and the if you don't know levels |
---|
0:04:44 | of |
---|
0:04:44 | sequences |
---|
0:04:45 | so we have sequence of words |
---|
0:04:47 | that makes up a dialogue turn and then we have sequence of dialogue turns that |
---|
0:04:51 | makes up a dialogue itself |
---|
0:04:53 | and we try to model that in a neural network we get something that looks |
---|
0:04:57 | like this |
---|
0:04:58 | so in the bottom we have an utterance encoder a link with the sequence of |
---|
0:05:02 | words |
---|
0:05:02 | and in the middle we take the dialogue turn representation |
---|
0:05:07 | and then also |
---|
0:05:08 | a model that sequentially |
---|
0:05:10 | so when we |
---|
0:05:12 | generate a sequence of four as the response we don't only take into account the |
---|
0:05:17 | current dialogue but also dialogue constraint |
---|
0:05:20 | and this helps with to maintain longer |
---|
0:05:22 | during longer dependencies in the dialogue |
---|
0:05:26 | in terms |
---|
0:05:27 | off |
---|
0:05:28 | in terms of application emotion |
---|
0:05:32 | of various of the danger when quality |
---|
0:05:36 | propose a system that can express different kinds of emotion |
---|
0:05:40 | by using an internal state in the general really ugly the response generator |
---|
0:05:46 | so you see here that application for emotion elicitation using neural networks |
---|
0:05:51 | is still very lacking of not altogether absent |
---|
0:05:54 | what we have recently is proposing set emotion sensitive response generation |
---|
0:06:00 | which was published in your body are proceeding this year so the main idea is |
---|
0:06:05 | to have an emotion order that takes into account the emotional context of the dialogue |
---|
0:06:10 | and use this information in generating the response |
---|
0:06:13 | so now we have any motion encoder which is |
---|
0:06:17 | in here |
---|
0:06:18 | that takes the dialogue context |
---|
0:06:21 | and try to predict |
---|
0:06:23 | emotion context of the current or |
---|
0:06:26 | and when generating the response we use the combination of both |
---|
0:06:30 | the dialogue context and the emotion context |
---|
0:06:32 | so in this way we then the network is in motion sensitive |
---|
0:06:37 | and if we train that only so that contains responses that is it possible motion |
---|
0:06:42 | we can achieve and positive emotion elicitation |
---|
0:06:48 | and all subjective evaluation actually proves this method work very well |
---|
0:06:53 | however there are two million two main limitations |
---|
0:06:56 | the first is that it has not yet learned strategies from an expert so which |
---|
0:07:00 | are easy own |
---|
0:07:01 | a wizard of oz conversation |
---|
0:07:05 | but we would like to see how an expert or people who are knowledgeable in |
---|
0:07:10 | a emotion interaction i will be as it possible motion |
---|
0:07:14 | and also still tends towards short and generic responses with positive affect work |
---|
0:07:20 | this in paris |
---|
0:07:21 | i mean for engagement and that's |
---|
0:07:23 | important especially in |
---|
0:07:25 | mobile oriented interaction |
---|
0:07:28 | so the main focus in this contribution is to address these limitations |
---|
0:07:35 | there are several challenges |
---|
0:07:36 | which i will talk about now |
---|
0:07:38 | so that then the first goal is to learn |
---|
0:07:41 | elicitation strategy from an expert |
---|
0:07:44 | and the challenges that absent of absence of such features if we take a look |
---|
0:07:49 | at |
---|
0:07:50 | emotion which corpora |
---|
0:07:53 | none of them |
---|
0:07:54 | have yet to involve an expert in the data collection |
---|
0:08:00 | and there is also not data that shows positive emotion elicitation strategy in everyday situations |
---|
0:08:07 | so what we did construct such a dialogue corpus we carefully design this scenario and |
---|
0:08:13 | i will be talking about this model more detail in a bit |
---|
0:08:17 | the second lowest increase for it in the generator response |
---|
0:08:21 | to improve engagements and the main challenge here is the sparsity |
---|
0:08:25 | so we would like to cover as much as possible dialogue speech emotion space |
---|
0:08:30 | however it's really hard to collect large amounts of data into annotated with emotion information |
---|
0:08:36 | reliably so we would like to tackle this problem or methodically we hypothesize that higher |
---|
0:08:43 | level information such as dialog action and help reduce the sparsity |
---|
0:08:47 | but how to break types of responses that the action a |
---|
0:08:52 | the system and |
---|
0:08:54 | emphasizing is information in the training and generation process |
---|
0:08:58 | and then put it all together and then try to utilize this information in the |
---|
0:09:02 | response generation the main difference here now is that |
---|
0:09:07 | you using the dialog state not only we predict the emotional context of the dialogue |
---|
0:09:12 | but we also tries to would be action that this is the multi in |
---|
0:09:17 | in the response |
---|
0:09:19 | so then |
---|
0:09:21 | b |
---|
0:09:22 | repost able to context a chart be |
---|
0:09:25 | that uses a combination of these three contracts to generate a response |
---|
0:09:32 | no talking about the corpus construction |
---|
0:09:37 | that's talked about for the goal here is or expert strategy for emotion elicitation |
---|
0:09:44 | so that what we do this we like interactions between an expert in a participant |
---|
0:09:49 | we through a professional counsellor a to take place is the expert |
---|
0:09:55 | and the mean things to condition interaction at the beginning with negative emotions so that |
---|
0:10:00 | as a |
---|
0:10:01 | dialogue progresses we can see how export rise at the conversation |
---|
0:10:06 | to allow emotional recovery and we stick |
---|
0:10:10 | and this is how a typical recording such a session look like |
---|
0:10:15 | we start with an opening for small or |
---|
0:10:18 | and afterwards we induce the negative emotion and what |
---|
0:10:22 | do you know which we show that videos and non fictional videos such as interview |
---|
0:10:28 | clips |
---|
0:10:28 | or it's |
---|
0:10:30 | about topics |
---|
0:10:31 | that have a negative sentiments such as well |
---|
0:10:34 | all righty or environment change |
---|
0:10:37 | and the ball of the session is the this question that |
---|
0:10:42 | we've talked about four |
---|
0:10:45 | we recorded sixty sessions amounting to about twenty four hours of data we recruited one |
---|
0:10:52 | counsellor and thirty participants |
---|
0:10:54 | for each participant |
---|
0:10:56 | recordings |
---|
0:10:58 | in one of the report in one of the session was |
---|
0:11:02 | we showed that might induce over all the other one you that might be used |
---|
0:11:07 | at nist |
---|
0:11:09 | for the emotion annotation we rely on self reported emotion and a teacher so we |
---|
0:11:16 | have to participants |
---|
0:11:18 | you watch the recordings that just a |
---|
0:11:21 | and using the g traced all the use this scale on the right-hand side |
---|
0:11:27 | mark their emotional state at the core |
---|
0:11:30 | at any given time |
---|
0:11:31 | so if we project the dialogue |
---|
0:11:35 | the length of the dialogue we can get |
---|
0:11:37 | and emotion |
---|
0:11:39 | trace that looks like this |
---|
0:11:42 | of course we also be a we also transcribed it in we use the combination |
---|
0:11:47 | of these two information a tree |
---|
0:11:50 | later on but before that |
---|
0:11:53 | other the other goal is to find higher level information from the overall expert |
---|
0:11:59 | responses |
---|
0:12:01 | what we would like to have here is more information that probably equivalent to dialog |
---|
0:12:06 | actions |
---|
0:12:08 | but we would like it should be specific to dialog scenario because this is the |
---|
0:12:12 | scenario that particular interest |
---|
0:12:15 | interestingly |
---|
0:12:16 | it would also like for these dialogue acts but in fact if intense of the |
---|
0:12:21 | export |
---|
0:12:22 | there are several ways |
---|
0:12:25 | into human annotation this is obvious limitation with the expensive and hard to reach a |
---|
0:12:31 | reliable inter annotator agreement |
---|
0:12:35 | we also use standard dialogue act classifiers that the constraint here is that it may |
---|
0:12:40 | not cover specific emotion we intend to |
---|
0:12:44 | so we resorted to unsupervised clustering |
---|
0:12:48 | so we do that by first extracting the responses of the caller for the at |
---|
0:12:53 | work |
---|
0:12:53 | and then using a pre-trained word defect model we get a compact representation of each |
---|
0:12:59 | response |
---|
0:13:00 | and we do we try out two types of clustering methods |
---|
0:13:04 | which means you need to you find beforehand how many clusters would like to find |
---|
0:13:09 | our case which is we chose k empirically |
---|
0:13:13 | four db gmm |
---|
0:13:15 | we are not to define the model complexity beforehand all within itself tries to find |
---|
0:13:21 | the optimal number of components we presented a |
---|
0:13:26 | and then we did some analysis this is the t is a new representation of |
---|
0:13:30 | the factors and the label |
---|
0:13:32 | this is the result of the k-means clustering where we choose k u i |
---|
0:13:37 | in between cluster we have many sentences that are really didn't corresponding to participants contains |
---|
0:13:45 | in the red clustering we get affirmative responses or confirmation responses |
---|
0:13:49 | and the blue clusters we have a listening or backchannels |
---|
0:13:54 | what we do get here though is a very large cluster where all the more |
---|
0:13:58 | complex |
---|
0:13:59 | sentences are grouped together |
---|
0:14:02 | so we |
---|
0:14:03 | we cluster that one more time and we find another some clusters |
---|
0:14:09 | some examples on the right cluster we have |
---|
0:14:11 | a lot of sentences |
---|
0:14:13 | that contains five a recall election about the topic |
---|
0:14:17 | i don't the green cluster |
---|
0:14:19 | we have |
---|
0:14:21 | sentences that are focus on the participants so you is the most common words there |
---|
0:14:26 | and sounds like the |
---|
0:14:28 | score tries to be opinions and their assessment of the topic |
---|
0:14:32 | for each year and |
---|
0:14:34 | the |
---|
0:14:36 | characteristic of each cluster is less you end up probably this is due to the |
---|
0:14:41 | very imbalanced |
---|
0:14:44 | distribution of the sentence and the cluster so we have to be very clusters here |
---|
0:14:50 | and there are plenty very small clusters the parameters |
---|
0:14:55 | so because just because the clusters are bigger is harder to include what they represent |
---|
0:15:03 | so then we put all of these two the experiment to see if things are |
---|
0:15:07 | working as we know |
---|
0:15:10 | this is the experimental setup the first thing that it is to retrain the model |
---|
0:15:14 | so we would like before we start only action any motion specific "'cause" we would |
---|
0:15:19 | like to be |
---|
0:15:20 | a prior for and he |
---|
0:15:23 | response generation task |
---|
0:15:25 | so we use a large-scale dialog corpus which is the subtle corpus containing |
---|
0:15:32 | five point five million dialogue years movie subtitles |
---|
0:15:35 | and we used in charge me models so we note any can wear any other |
---|
0:15:40 | the dialogue context |
---|
0:15:43 | and then we fit we find alternatives pre-trained model on how something that we have |
---|
0:15:48 | like this |
---|
0:15:50 | to ask for comparison we retrain every five point three types of model we have |
---|
0:15:55 | more a chart that only relies on emotion context |
---|
0:15:59 | we have anything at a really need that uses both |
---|
0:16:03 | emotion and i actually convex combination and for completeness we also train a model that |
---|
0:16:08 | all you realise on action |
---|
0:16:12 | and of course because the models after works |
---|
0:16:16 | a little bit about how we retraining point you |
---|
0:16:21 | so what pre-training does is initialized is the way of the |
---|
0:16:25 | of the jargon components |
---|
0:16:27 | so the |
---|
0:16:29 | the parts that have nothing to do with additional context |
---|
0:16:33 | an and doing fine tuning because the data that we have is pretty small we |
---|
0:16:38 | do it selectively so we only optimize parameters that are affected by the new products |
---|
0:16:43 | so the decoder here and the two |
---|
0:16:48 | to a complex encoders |
---|
0:16:50 | in terms of m c h r t we have three different targets |
---|
0:16:55 | reading during training |
---|
0:16:57 | so we have the negative log |
---|
0:16:59 | and |
---|
0:17:00 | each of those targets have their own classes we have a negative log-likelihood of the |
---|
0:17:04 | target response |
---|
0:17:05 | and emotion importer tries to predict the emotional state |
---|
0:17:09 | and we have the prediction error rate training as well as for the action orders |
---|
0:17:15 | which is would be the action for the response |
---|
0:17:18 | and we combine these clusters together linearly interpolate them and then used is back propagation |
---|
0:17:26 | this to update the corresponding arcs |
---|
0:17:30 | the first evaluation of it is we see the perplexity of the model |
---|
0:17:35 | forty one cherry be a perplexity lower is better |
---|
0:17:39 | well you much are needed i see that would get is forty two point six |
---|
0:17:43 | and actually if we use action information we got slight slightly better model |
---|
0:17:48 | however when we combine this information together with see if anything's happening for each action |
---|
0:17:54 | labels that |
---|
0:17:55 | so fourteen it's cluster-and-label we see some improvements |
---|
0:17:59 | you're to here and forty three gmm it actually slightly worse than |
---|
0:18:04 | we analyze this further by |
---|
0:18:07 | separating that has the top forty the length |
---|
0:18:09 | so we can get |
---|
0:18:10 | reflects if or shorter is very animals |
---|
0:18:15 | that's queries |
---|
0:18:18 | there's a stark difference between the two groups performance on short queries |
---|
0:18:23 | are consistently better than that of all ones which is not surprising a long-term dependency |
---|
0:18:28 | the it sitting that |
---|
0:18:31 | of the there's the neural network for a random performance |
---|
0:18:36 | so the thing with a c h are basically means that |
---|
0:18:40 | it again substantial improvement for a little queries |
---|
0:18:45 | most of the improvement |
---|
0:18:46 | that i get comes from all |
---|
0:18:49 | i being able to perform better for queries |
---|
0:18:52 | so this we can see that the multiple context how especially for longer inputs |
---|
0:18:59 | and then we also subjective evaluation we extracted a hundred various |
---|
0:19:04 | have each judge slightly crowd workers |
---|
0:19:07 | we asked to rate the naturalness emotional impact and in addition to the response |
---|
0:19:13 | vol two models |
---|
0:19:15 | so we have really mortuary is the baseline and h r v the best of |
---|
0:19:19 | the hrtf the proposed system |
---|
0:19:21 | and we see improve engagements |
---|
0:19:26 | from the proposed model while maintaining the emotional impact and naturalness |
---|
0:19:31 | and when we look at the responses that's generated by the |
---|
0:19:35 | system we see that ch are you |
---|
0:19:38 | well on average two and how words longer than the baseline |
---|
0:19:43 | so in conclusion here we have presented a corpus that shows expert |
---|
0:19:47 | strategy in positive emotion elicitation |
---|
0:19:50 | we also or c we also show how we use unsupervised clustering method |
---|
0:19:56 | to obtain higher level information |
---|
0:19:59 | and use all of these in |
---|
0:20:01 | a response generation |
---|
0:20:04 | in the future there are many things |
---|
0:20:06 | that needs to be worked on but in particular we would like to look at |
---|
0:20:09 | multimodal information |
---|
0:20:11 | this is especially important for the |
---|
0:20:14 | and the emotional context of the dialogue |
---|
0:20:17 | and of course evaluations were user interaction is also important |
---|
0:20:23 | that was my presentation |
---|
0:20:51 | so that pre-training is that a |
---|
0:20:54 | using another corpus which we do not construct so |
---|
0:20:58 | we use this model |
---|
0:21:03 | the training data is |
---|
0:21:05 | is the time here |
---|
0:21:08 | so it's a large-scale corpus |
---|
0:21:11 | probably subtitles |
---|
0:21:16 | right so the reading |
---|
0:21:20 | we pre-training we did not use any emotion or action one |
---|
0:21:24 | so that the pre-training is |
---|
0:21:26 | only to brian that now or boards dialog generation |
---|
0:21:32 | and then refine training we give the model's ability to encode actually context and emotion |
---|
0:21:38 | that's and use this |
---|
0:21:40 | in the generation |
---|
0:21:56 | right so |
---|
0:21:59 | the word identity |
---|
0:22:07 | so |
---|
0:22:09 | there are no menus or embodiment for different weights the first one is |
---|
0:22:14 | using a pre-trained word i'm but in model we |
---|
0:22:19 | we use that for the counsellor dialogue clustering and another three in the model itself |
---|
0:22:24 | wheeler and the word embeddings |
---|
0:22:26 | green pre-training |
---|
0:22:29 | it is learned |
---|
0:22:31 | what is it it's learned by the utterance in order |
---|
0:22:35 | all the large scale data |
---|
0:22:43 | cluster sentences or the dialogue our clustering |
---|
0:22:47 | but export response clustering we cluster sentences |
---|
0:22:51 | and for that we use the pre-training work to fact model |
---|
0:23:01 | we average |
---|
0:23:02 | what the sentence |
---|
0:23:07 | right i've just heard about skip or yesterday whereas the s and that's of the |
---|
0:23:13 | different we think that |
---|
0:23:16 | q |
---|
0:23:49 | so there's definitely an overlap between the actions that would like to find from the |
---|
0:23:54 | experts |
---|
0:23:55 | actions just general dialogues |
---|
0:23:58 | so we did find for example backchannels |
---|
0:24:02 | backchannels are actions that are generally the conversation and confirmation |
---|
0:24:10 | but the unsupervised clustering is especially helpful for the this other actions probably act right |
---|
0:24:18 | and it's |
---|
0:24:20 | you do not need any expert one of the t at all |
---|
0:24:57 | right |
---|
0:25:01 | so |
---|
0:25:03 | what we find that most of the time the majority of the time a counselor |
---|
0:25:07 | is able to reach the opposing emotions |
---|
0:25:11 | in terms of their the participant's reaction towards the video it hi varies |
---|
0:25:17 | so there are people who are not so reactive and are people who are there |
---|
0:25:22 | is emotionally sensitive |
---|
0:25:24 | so |
---|
0:25:26 | we get |
---|
0:25:27 | different types of responses |
---|
0:25:30 | but this is an example of well |
---|
0:25:34 | all the dialogue so the red lines here is here wins throughout the dialogue |
---|
0:25:40 | we can see that the kalman quite positive and there is an the real you |
---|
0:25:43 | feel |
---|
0:25:44 | every negative but as the dialogue progresses |
---|
0:25:47 | the |
---|
0:25:51 | the counsellor |
---|
0:25:53 | successfully for this |
---|
0:25:55 | you know we have a more extensive analysis in another paper |
---|
0:26:00 | i'll be happy to help you |
---|