0:00:15 | okay so hello i'm these are processed and from one university already introduced |
---|
0:00:21 | thank you for having |
---|
0:00:23 | and i'm going to talk about changing the level of directions on the dialogue |
---|
0:00:27 | so when i first had that's have a little motivation of why this could be |
---|
0:00:31 | useful |
---|
0:00:33 | if we look at human dialogue for example one person could say you want to |
---|
0:00:38 | you to sell it repeats not and for some reason the other person decides not |
---|
0:00:43 | to answer that question directly and test i prefer warm il |
---|
0:00:47 | i want you human dialogue we can easily say okay that's itself |
---|
0:00:51 | and then the person could |
---|
0:00:53 | shoes to be more polite did not say directly you should really go on a |
---|
0:00:57 | diet |
---|
0:00:58 | and just say that pizza has a lot of countries |
---|
0:01:01 | and then the other person |
---|
0:01:03 | it's not offended and can say okay take the summit |
---|
0:01:07 | so if we have a look at the same conversation with them |
---|
0:01:10 | dialogue system which is not equipped to handle in directness |
---|
0:01:15 | we can run into a number of problems |
---|
0:01:17 | so |
---|
0:01:19 | for example if the system says a do you want to excel at a pizza |
---|
0:01:23 | and the human says i'd rather have a one meal |
---|
0:01:27 | if the system is not equipped to handle this indirect this and just expects a |
---|
0:01:31 | direct translate won't understand that |
---|
0:01:33 | and then of course is to repeat the question and the user has to more |
---|
0:01:38 | directly state-of-the-art sir |
---|
0:01:42 | which is not that bad but could be handled better prices |
---|
0:01:45 | but system if it could understand this indirect version of the answer |
---|
0:01:50 | and another problem we have that is in the output because sometimes as humans we |
---|
0:01:54 | expect our conversation partner to not really be direct |
---|
0:01:59 | so if the system not chooses to be directed and say you should not itself |
---|
0:02:03 | and the human will be very angry |
---|
0:02:06 | so it would be better or if the system could handle in directness well on |
---|
0:02:11 | the inputs and i and on the output side |
---|
0:02:14 | and that is why |
---|
0:02:16 | the goal of my work is changing the level of directness of an utterance |
---|
0:02:22 | now i want to have a look at the algorithm a whole want to do |
---|
0:02:26 | that |
---|
0:02:27 | at first i will give an overview of the overall algorithm and then to address |
---|
0:02:31 | some challenges specifically |
---|
0:02:34 | so my algorithm works with the three different types of input from the current utterance |
---|
0:02:39 | and the previous utterance and the double |
---|
0:02:42 | and a pool of utterances that it can choose from to exchange the current utterance |
---|
0:02:50 | the next step then is to evaluate the directness level of those utterances |
---|
0:02:55 | and from that we get of course the directions of the current utterance |
---|
0:02:59 | and the directions of every utterance and we need the previous utterance |
---|
0:03:03 | because the directness is of course depending on what was said before |
---|
0:03:08 | and we can have different levels of in directness depending on the previous utterance |
---|
0:03:14 | and the next step then is |
---|
0:03:16 | to filter all the utterances so that we only you have the pool of utterances |
---|
0:03:22 | we can choose from |
---|
0:03:23 | it have the opposite directions of the current utterance |
---|
0:03:26 | and the last step we have to see |
---|
0:03:30 | which of those utterances is the most similar in a functional manner to the current |
---|
0:03:35 | utterance |
---|
0:03:36 | which then leaves us with the utterance we can exchange it for |
---|
0:03:41 | so two challenges in this algorithm one is |
---|
0:03:46 | the directness level how can we estimate that |
---|
0:03:49 | and the other one is |
---|
0:03:51 | how do we assume which one which utterances are functionally similar |
---|
0:03:56 | so that start with that for is what is functionally similar |
---|
0:04:02 | i define that as the degree to which two utterances can be used interchangeably in |
---|
0:04:06 | the dialogue so they fulfill the same function in the dialogue |
---|
0:04:11 | and |
---|
0:04:12 | as a measure of course functional similarity i decided to do that with a dialogue |
---|
0:04:17 | act models |
---|
0:04:19 | they are inspired by work spectrum models so they follow the same principle |
---|
0:04:25 | and that |
---|
0:04:26 | utterances in the back for space in a manner |
---|
0:04:30 | the utterances appearing in the same context are mapped controls vicinity to each other so |
---|
0:04:35 | if two utterances |
---|
0:04:37 | are used in the same context |
---|
0:04:40 | it's very likely that they |
---|
0:04:42 | can be exchanging in the same the |
---|
0:04:46 | the median distance and the spectral space is then used as an approximation of the |
---|
0:04:51 | functional similarity |
---|
0:04:54 | i'm pretty sure that works because i have already published paper it outright this year |
---|
0:04:59 | and i will quickly summarise the findings of the paper so you can see why |
---|
0:05:03 | this is good feet |
---|
0:05:06 | i have evaluated the accuracy of clusters |
---|
0:05:10 | then i have hard k-means |
---|
0:05:12 | in the dialogue vector space and compare them |
---|
0:05:17 | to the ground truth of clusters by hand annotated dialogue acts |
---|
0:05:21 | so want to see of improving in the dialogue vector space corresponds to the annotated |
---|
0:05:28 | dialogue acts |
---|
0:05:29 | and i didn't cross corpus evaluation so on the dialogue act the models are trained |
---|
0:05:34 | on a different corpus then the clustering was performed on and just you can see |
---|
0:05:38 | on the left side the risks the accuracy is very good |
---|
0:05:43 | and |
---|
0:05:45 | that's why i think |
---|
0:05:47 | at a dialogue act models work very well for the estimation functional similar utterances |
---|
0:05:54 | so that's get to the estimation of directors which was the second challenge |
---|
0:05:59 | you can already see an architecture here this is for a recurrent neural network is |
---|
0:06:05 | to make the directness with the supervised learning approach |
---|
0:06:08 | and as an input is used the sum of weight vectors on the one |
---|
0:06:13 | so every work in the sector a in the an utterance |
---|
0:06:18 | we use the word vector and just |
---|
0:06:21 | at all of them |
---|
0:06:23 | and also i use than the dialogue vector representation as an input |
---|
0:06:27 | and the suspect it's a reference so we have a twenty data connection that also |
---|
0:06:30 | that's just get |
---|
0:06:31 | previous an utterance or and the input of the previous utterance |
---|
0:06:38 | its output |
---|
0:06:40 | we have i've made as a classification problem |
---|
0:06:44 | so the output just the probability of the utterance being either a very direct so |
---|
0:06:49 | i wanted to drink and so for example |
---|
0:06:52 | slightly indirect you we have can i get a ticket trajectory itself which is not |
---|
0:06:56 | quite the same but still has all the main works in there that are met |
---|
0:07:03 | necessary for the meeting |
---|
0:07:04 | and then very indirect where you just say i don't like meat |
---|
0:07:08 | and hopefully the other person can get that |
---|
0:07:13 | so this has not been tested before so as part of the evaluation for this |
---|
0:07:18 | work |
---|
0:07:18 | i also evaluated the how well the estimation of directors with this approach works |
---|
0:07:25 | so and with that let's get to the evaluation |
---|
0:07:29 | so as a set on the one hand the accuracy of the direct estimation was |
---|
0:07:33 | evaluated |
---|
0:07:35 | and of course the accuracy of the actual utterance exchange also |
---|
0:07:40 | and for that on we of course the ground truth that means we need |
---|
0:07:46 | a dialogue corpus that contains utterances but we can exchange |
---|
0:07:52 | we need to of course and annotation of the directness level |
---|
0:07:56 | and an annotation of dialogue act and in order to see if we made a |
---|
0:08:01 | correct exchange |
---|
0:08:03 | it was |
---|
0:08:05 | impossible to find holes like that |
---|
0:08:08 | so i also wasn't sure we could actually |
---|
0:08:13 | get a corpus like that ourselves because it's very difficult |
---|
0:08:17 | do not inhibited the naturalness of conversation |
---|
0:08:20 | well still the same to the participants |
---|
0:08:22 | okay we need to this meeting in different phrases a different directness levels |
---|
0:08:29 | to make sure that there are external equivalent utterances in the corpus |
---|
0:08:33 | so for this i decided to do an automatically generated corpus to want to present |
---|
0:08:38 | now |
---|
0:08:42 | so that it calls contained |
---|
0:08:44 | the definition of the dialog domain with system and user actions |
---|
0:08:49 | and just accession rules under which set which system |
---|
0:08:53 | what which action could for the which other actually |
---|
0:08:57 | each action had |
---|
0:08:59 | multiple utterances that actually a to a used were great |
---|
0:09:04 | and of course of directors level depending on the previous utterance |
---|
0:09:09 | then we started with the beginning with the start actually |
---|
0:09:16 | and then just record simply cut all the successors |
---|
0:09:20 | but also selsa's again until we reach the end |
---|
0:09:23 | and thereby generated all the dialogue flows that where possible with the time domain which |
---|
0:09:28 | no and we defined |
---|
0:09:30 | and the weighting was then choosing randomly and this resulted in more than four hundred |
---|
0:09:37 | thousand dialogue flows |
---|
0:09:41 | and |
---|
0:09:42 | about or for working is very dialogue act she |
---|
0:09:48 | for example you can see here yes could be worried it is a great i'm |
---|
0:09:53 | going forward to it or that sounds to the shoes |
---|
0:09:55 | if anyone |
---|
0:09:58 | what the previous utterance was |
---|
0:10:01 | or i would like to order pizza can order pizza from you |
---|
0:10:06 | the topic of those conversational style story |
---|
0:10:10 | so for example ordering a pizza or arranging for joint coding together |
---|
0:10:17 | and it |
---|
0:10:18 | i try to incorporate |
---|
0:10:20 | many elements of human conversation |
---|
0:10:22 | so it for example i had over a string |
---|
0:10:26 | the one misunderstandings a request for confirmation corrections and things like that |
---|
0:10:31 | and as already mentioned context-dependent directness levels |
---|
0:10:35 | so for example and you have time today |
---|
0:10:38 | can be answered with i have planned it is in |
---|
0:10:41 | which is not a direct answer so it hasn't directors three |
---|
0:10:45 | and it finds today i haven't planned anything |
---|
0:10:48 | so here we have |
---|
0:10:51 | different a question and before that |
---|
0:10:54 | so this time it's a direct answer and achieves the directness the number of one |
---|
0:11:01 | so of course with an automatically generated out or was there are some limitations |
---|
0:11:07 | we have that's variation in a natural conversations of course |
---|
0:11:12 | and well |
---|
0:11:13 | with regard to the dialogue flow |
---|
0:11:15 | and to the weighting that here |
---|
0:11:18 | and that very likely means |
---|
0:11:20 | it's |
---|
0:11:21 | more predictable and therefore easier to or |
---|
0:11:25 | however i also see some advantages of this approach |
---|
0:11:29 | on the one hand we have a very controlled environment |
---|
0:11:32 | we can make sure that |
---|
0:11:35 | for example is an actual on server |
---|
0:11:38 | in the corpus utterances |
---|
0:11:40 | so we know that there is a valid exchange and if we didn't find that |
---|
0:11:45 | the for the our algorithm and not just that there is no |
---|
0:11:49 | correct utterance in the corpus |
---|
0:11:52 | and also |
---|
0:11:54 | we know |
---|
0:11:56 | because |
---|
0:11:57 | the corpus was |
---|
0:12:00 | was not annotated but generates the ground truth |
---|
0:12:03 | at this ground truth is very dependable |
---|
0:12:07 | and also i think it's an advantage that using this approach we have |
---|
0:12:12 | a very complete dataset we have all the possible flows we have |
---|
0:12:16 | many different weightings |
---|
0:12:17 | and i think that having this for small application |
---|
0:12:22 | can meet implications for if we actually have a lot of data and the approach |
---|
0:12:27 | the full coverage so |
---|
0:12:29 | for example usually if i could just collect dialogue i want |
---|
0:12:34 | one have a lot of data and i won't this poor coverage |
---|
0:12:38 | but a larger companies may but i just don't get the data and that we |
---|
0:12:42 | do this |
---|
0:12:45 | small |
---|
0:12:46 | small what complete set that we generated |
---|
0:12:49 | that can |
---|
0:12:50 | then have some implications for what if i could get that |
---|
0:12:55 | i at all |
---|
0:12:58 | so for our results this means is of course |
---|
0:13:02 | that |
---|
0:13:03 | they don't do not represent the actual performance in the applied for spoken dialogue system |
---|
0:13:08 | which test we don't have natural conversations |
---|
0:13:11 | so it's very likely that it will perform worse |
---|
0:13:15 | but we can replace potential or approach given ideal circumstances |
---|
0:13:21 | so i think it still or some an average rate |
---|
0:13:27 | so with that that's get to the actual results |
---|
0:13:32 | at first the accuracy of the directness estimation |
---|
0:13:36 | here we use them as input for the |
---|
0:13:39 | a dialogue vector model |
---|
0:13:40 | it was trained on our automatically generated calls |
---|
0:13:44 | and we used word actual models that were trained on the google news call and |
---|
0:13:49 | you can see the reference for that |
---|
0:13:54 | as dependent variable of course we have the accuracy of correctly predicting the level of |
---|
0:13:59 | directness as annotated |
---|
0:14:01 | and it is indeed and river |
---|
0:14:02 | independent variables we use |
---|
0:14:07 | versions where we |
---|
0:14:08 | with and without where actors as input to see if we improve use all |
---|
0:14:13 | and also |
---|
0:14:15 | we wanted to see |
---|
0:14:16 | if the whole the size of the training sets impacts the classifier |
---|
0:14:21 | so we used of course ten fold cross validation as usual which leads to a |
---|
0:14:27 | training corpus of ninety percent of the data |
---|
0:14:32 | and we also tested it with when we only use ten percent of the data |
---|
0:14:39 | also we use different other tactile models there we also used different |
---|
0:14:46 | sure |
---|
0:14:47 | sizes of the dialogue corpus that we generated how many of the dialogs we included |
---|
0:14:51 | the actual train |
---|
0:14:54 | and he can see the results |
---|
0:14:57 | so |
---|
0:14:58 | we could achieve a very high accuracy of darkness estimation |
---|
0:15:03 | but keep in mind it's an automatically generated corpus so that plays the role in |
---|
0:15:07 | that of course |
---|
0:15:10 | the baseline for the majority class prediction would have been zero point five to nine |
---|
0:15:15 | one |
---|
0:15:16 | and can clearly outperform that |
---|
0:15:21 | we can see a significant influence of both the size of the training set |
---|
0:15:26 | and this is whether or not we include the word vectors |
---|
0:15:32 | and |
---|
0:15:33 | i think then |
---|
0:15:35 | that the word vector as input to improve so much of the estimation results really |
---|
0:15:43 | speaks of the quality of those models that what we have the speaker data |
---|
0:15:50 | but i think extensive work come on six is |
---|
0:15:54 | so that should not you problem |
---|
0:15:57 | what could be a problem is and the size of the training set |
---|
0:16:01 | because this is annotated data it's a supervised approach |
---|
0:16:05 | so |
---|
0:16:07 | if we want |
---|
0:16:09 | choose a scale this approach |
---|
0:16:11 | we would need a lot of annotated data so perhaps in the future we could |
---|
0:16:16 | consider i'm |
---|
0:16:18 | unsupervised approach for this |
---|
0:16:21 | that doesn't need to a lot of annotated data |
---|
0:16:28 | so the accuracy of i utterance exchange for the functional similarity we again used |
---|
0:16:34 | the dialogue act models from the automatically generated calls |
---|
0:16:37 | and for that are just estimation we use different portions of the train crash classifier |
---|
0:16:43 | that are just presented |
---|
0:16:44 | and as dependent variable and the percentage of correctly exchange utterances |
---|
0:16:50 | and independent variables here where the classifier accuracy |
---|
0:16:55 | and again the size of the training corpus for the dialogue act models |
---|
0:16:59 | g you can see the results |
---|
0:17:02 | the best performance we could achieve overall was zero point seven |
---|
0:17:06 | percent of utterances that were correctly exchange |
---|
0:17:11 | and we have a significant influence of both a classifier |
---|
0:17:16 | accuracy |
---|
0:17:17 | and that the size of the training data for the dialogue act models |
---|
0:17:22 | and a common error that could see |
---|
0:17:25 | it was made by the algorithm |
---|
0:17:27 | was that the utterance exchange was done |
---|
0:17:31 | either with more or less information than the original utterance |
---|
0:17:34 | so for example and stuff i want something spicy |
---|
0:17:38 | it was exchanged with i want a large pepperoni pizza and large of course is |
---|
0:17:42 | not included in the first sentence |
---|
0:17:44 | so |
---|
0:17:46 | this points to that a dialogue act models as we trained and cannot really differentiate |
---|
0:17:51 | that well between those a ring in |
---|
0:17:55 | but this could be solved with just adding more context to them so |
---|
0:17:59 | during the training take into account more utterances in the vicinity |
---|
0:18:06 | we can see here the importance of a good classifier and a good similarity measure |
---|
0:18:12 | the similarity measure i don't that's a problem because |
---|
0:18:17 | it's on annotated data so we can just take large corpora of dialog data and |
---|
0:18:22 | use that |
---|
0:18:23 | again the annotated data is here the real challenge and |
---|
0:18:27 | we should consider the unsupervised approach |
---|
0:18:32 | a short discussion of the results |
---|
0:18:35 | i think the approach shows a high potential what the evaluation was done in a |
---|
0:18:40 | theoretical setting |
---|
0:18:41 | and we have not applied to an full dialogue system |
---|
0:18:45 | and therefore they are still some questions to be answered |
---|
0:18:49 | so in this corpus we have this variability and in a natural dialogue |
---|
0:18:54 | that means that |
---|
0:18:56 | very likely the performance of the classifier and style vector model will decrease in an |
---|
0:19:01 | actual dialog |
---|
0:19:04 | to compensate for that we then we need more data |
---|
0:19:09 | and we have the problem that |
---|
0:19:11 | we don't really know if in an actual dialog hope was suitable alternative to exchange |
---|
0:19:17 | actually exist |
---|
0:19:19 | again if we have an increasing amount of data it becomes more likely |
---|
0:19:25 | what was and it's not sure |
---|
0:19:27 | so perhaps as a future work we can look into the generation of utterances instead |
---|
0:19:32 | of just their exchange |
---|
0:19:36 | and i dunno point is the interpolation of user experience and the accuracy of exchange |
---|
0:19:42 | because at the moment we don't know what |
---|
0:19:45 | actually we see we actually need to achieve |
---|
0:19:47 | to improve the user experience |
---|
0:19:50 | so that is also something we should look into |
---|
0:19:54 | so |
---|
0:19:56 | that system |
---|
0:19:57 | the end of my talk i want to conclude what i presented to you today |
---|
0:20:02 | i discuss the impact of interest in human computer interaction |
---|
0:20:07 | and propose an approach to changing that both directions of matter |
---|
0:20:12 | the directness estimation is done using recurrent neural networks the functionality measure |
---|
0:20:19 | uses dialogue act models |
---|
0:20:21 | and the evaluation shows the high potential this approach is also a lot of future |
---|
0:20:26 | work should you |
---|
0:20:28 | it would be good to have a corpus of natural dialogues annotated with the director's |
---|
0:20:32 | that to use that as an evaluation |
---|
0:20:37 | there would be benefits of to an unsupervised estimation of the directness level |
---|
0:20:42 | and |
---|
0:20:42 | also an evaluation on an actual dialog corpus |
---|
0:20:46 | would give more insights on how that actually impacts the performance |
---|
0:20:50 | and the generation of suitable utterances would be just desirable because we don't actually know |
---|
0:20:56 | if an |
---|
0:20:57 | the right utterances in the corpus |
---|
0:21:00 | and finally of course we would like to a okay apply this to an actual |
---|
0:21:05 | for dialogue system |
---|
0:21:08 | thank you very much for your attention |
---|
0:21:52 | no i did not evaluate and set |
---|
0:22:12 | yes a lot of my somewhere in this regard for differences there the directness is |
---|
0:22:18 | a very major difference that exists between cultures so therefore the source of major interest |
---|
0:22:23 | for me |
---|
0:23:04 | yes |
---|
0:23:06 | i think it would you really good |
---|
0:23:07 | such a coarse and |
---|
0:23:10 | i'm thinking about ways like i think one of the main difficulty is there is |
---|
0:23:17 | as i said i'm coming from an actual difference |
---|
0:23:21 | so for example i would expect a german i will be even more direct and |
---|
0:23:25 | for example japanese |
---|
0:23:27 | then we have the translation problem we can't exchange german utterances for japanese utterances so |
---|
0:23:33 | that makes it difficult and i'm not sure how to ensure for example in a |
---|
0:23:38 | german that the participants were actually use a in alright version |
---|
0:23:44 | the direct utterances as well |
---|
0:23:47 | so there is a little bit of a problem |
---|
0:24:28 | that sounds interesting thank you very much |
---|
0:24:42 | so this was small part of the error rate |
---|
0:24:45 | and there are just use the k-means clustering algorithms |
---|
0:24:48 | to find clusters |
---|
0:24:50 | in this work i don't actually define clusters but just use the closest one |
---|
0:25:15 | no i used it is so basic i |
---|
0:25:19 | it |
---|
0:25:21 | it's director pretty of is |
---|
0:25:25 | if it's a colloquial re-formulate dislike you know |
---|
0:25:32 | and i you know works from the original sentence here in the exchange sentence then |
---|
0:25:39 | it's a here and very direct |
---|