0:00:16 | so this is joint work with such people from your is your proven |
---|
0:00:20 | so |
---|
0:00:22 | as |
---|
0:00:23 | you'll null and you architectures have been increasingly popular for the development of conversational agents |
---|
0:00:29 | and one major advantage of these approaches is that they can be learned from role |
---|
0:00:33 | and annotated dialogues without needing too much domain knowledge or feature engineering |
---|
0:00:40 | but however we also require large amounts of training data because they have a large |
---|
0:00:45 | parameter space |
---|
0:00:46 | so usually |
---|
0:00:48 | we use large online resource to train then suggest |
---|
0:00:52 | twitter conversations |
---|
0:00:54 | a technical web forums like the one from one to |
---|
0:00:57 | chuck close |
---|
0:00:59 | movie scripts |
---|
0:01:00 | a movie subtitles |
---|
0:01:01 | so that is for t v series and these resorts are and diana be useful |
---|
0:01:06 | but they all the face some kind of limitations |
---|
0:01:09 | in terms of dialogue modeling |
---|
0:01:11 | it has several limitations we can talk for a long time about these but i |
---|
0:01:15 | would like to just point out to limitations |
---|
0:01:19 | especially the ones that are important for subtitles |
---|
0:01:24 | one of this limitation is that for movies of the next we don't have |
---|
0:01:28 | any turn structure explicit don't structure |
---|
0:01:32 | the corpus itself is only a sequence of sentences |
---|
0:01:37 | together with timestamps for the start and end times |
---|
0:01:40 | but we don't know who is speaking because of course |
---|
0:01:42 | the stuff that is always don't come together with |
---|
0:01:45 | well and audio track and video where you see who is speaking at a given |
---|
0:01:48 | time |
---|
0:01:50 | so we don't know who is speaking we don't know when sentences unanswered another tour |
---|
0:01:55 | or is a continuation of the current turn |
---|
0:01:58 | so in this particular example |
---|
0:02:00 | the actual ten structure is the following |
---|
0:02:04 | and as you can see there are some strong cue |
---|
0:02:07 | the time stance can be used |
---|
0:02:09 | i in a few cases |
---|
0:02:11 | and you have lexical and syntactic cues that can be used |
---|
0:02:14 | to infer data structure |
---|
0:02:15 | but you never have to run through |
---|
0:02:18 | and so that's an important disadvantage when you want to actually to build a system |
---|
0:02:22 | that generates responses and not just continuations in a given dialogue |
---|
0:02:29 | whenever limitation is that |
---|
0:02:32 | many of these data contain reference to |
---|
0:02:35 | named entities |
---|
0:02:35 | that might be absent from the inputs |
---|
0:02:39 | in particular fictional characters |
---|
0:02:41 | not always referred to a context which use external to the dialogue and which cannot |
---|
0:02:48 | be captured by the inputs on |
---|
0:02:50 | so in this particular case mister holes |
---|
0:02:53 | i is an input that |
---|
0:02:55 | you would require access to annex not context |
---|
0:02:58 | you know to make sense of what's happened |
---|
0:03:02 | and are ordered images of course but i'm just wanted to point two important limitations |
---|
0:03:08 | so how do we deal with these |
---|
0:03:10 | these problems the key idea i'm going to present here start with the fact that |
---|
0:03:16 | not what examples of |
---|
0:03:19 | context response pairs |
---|
0:03:20 | are equally useful or relevant for different but of conversational models |
---|
0:03:25 | some examples as |
---|
0:03:26 | oliver lemon showed i in is keynote might even be detrimental to the development of |
---|
0:03:31 | your model |
---|
0:03:32 | so we can view this as the can kind of |
---|
0:03:34 | domain adaptation problem |
---|
0:03:36 | there is some kind of this frequency between the context response appears that we observe |
---|
0:03:41 | in a corpus |
---|
0:03:43 | and the ones that we wish to encode in or new a conversational model |
---|
0:03:48 | i n in the particle application that we want |
---|
0:03:51 | so the proposed solution is one that is very well known in the field of |
---|
0:03:55 | the mean adaptation |
---|
0:03:57 | and which is it simply and inclusion of the weighting model |
---|
0:04:01 | so we try to map each |
---|
0:04:03 | where context and response |
---|
0:04:05 | two particular weight value that corresponds to its importance |
---|
0:04:09 | tweets quality if you want |
---|
0:04:12 | for the particle proposed of building |
---|
0:04:14 | a conversation model |
---|
0:04:18 | so how we assign this weeks |
---|
0:04:21 | of course you to the sheer size of four corpora we cannot i don't think |
---|
0:04:25 | each pair manually |
---|
0:04:26 | and even a handcrafted rules per se |
---|
0:04:29 | may be difficult to apply in many cases |
---|
0:04:32 | because the quality of examples might be depend on multiple factors that might interact in |
---|
0:04:37 | complex ways |
---|
0:04:39 | so we propose here is the data driven approach |
---|
0:04:42 | where we learn |
---|
0:04:44 | a weighting model from examples of high quality responses |
---|
0:04:48 | and of course what constitutes a response of high quality might depend on the particular |
---|
0:04:52 | objectives on the particular type of conversational model |
---|
0:04:56 | that one wishes to be a |
---|
0:04:58 | so there is no single answer to what constitutes a high-quality response |
---|
0:05:02 | but if you have some idea what kind of response you want and what which |
---|
0:05:06 | once you don't one |
---|
0:05:07 | you can often select |
---|
0:05:10 | a subset of high quality response and learn a weighting model from these |
---|
0:05:14 | and the weighting model uses and you are architecture |
---|
0:05:17 | which is the following |
---|
0:05:21 | so as you can see here we have two recurrent neural networks with shared weights |
---|
0:05:27 | and embedding lay your |
---|
0:05:29 | and a recurrent layer with lstm what you're weights are units |
---|
0:05:33 | these two |
---|
0:05:35 | and respectively and code that the context and the response |
---|
0:05:40 | as a sequence of a sequences of tokens |
---|
0:05:43 | and |
---|
0:05:44 | are |
---|
0:05:46 | put if a fixed size vectors which are then fed to a dense lay your |
---|
0:05:52 | which can also incorporate additional inputs |
---|
0:05:54 | a for instance document level factors |
---|
0:05:57 | if you have some features |
---|
0:06:00 | that are specific to move dialogue and that may be of interest |
---|
0:06:04 | to calculate the weights you can incorporate these |
---|
0:06:07 | in this then sleigh your for the supplied us for instance we also have information |
---|
0:06:11 | about the time gaps |
---|
0:06:13 | between the context and the response |
---|
0:06:15 | and that something that can be used as well |
---|
0:06:18 | and so we include all these data in inferior to this final translate your |
---|
0:06:23 | which ten outputs |
---|
0:06:24 | a weight |
---|
0:06:25 | for a given context response pair |
---|
0:06:30 | so that's the model |
---|
0:06:32 | and once we have learned a weighting model from examples of high quality responses we |
---|
0:06:38 | can then apply the model to the full training data |
---|
0:06:43 | to assign a particular weight to each pair |
---|
0:06:46 | then we can include it in the brick a loss |
---|
0:06:49 | that we minimize then we trained a neural model |
---|
0:06:52 | i |
---|
0:06:53 | the exact formula for dumper get lost might be depend on what kind of models |
---|
0:06:57 | you're building |
---|
0:06:57 | and what kind of loss function you you're using |
---|
0:07:01 | but the key idea is that |
---|
0:07:03 | than the loss function calculus some kind of distance between |
---|
0:07:07 | what the model produces and the ground truth |
---|
0:07:11 | and then you waiting |
---|
0:07:14 | the this lost |
---|
0:07:15 | by the weight value that you calculate from the weighting model |
---|
0:07:18 | so it some kind of two class pursue years where you first |
---|
0:07:21 | calculate the weight of your example and then given this weight |
---|
0:07:25 | and the result of a linear model you can calculate the empirical loss |
---|
0:07:31 | and then optimize the parameters |
---|
0:07:33 | of one these weighted sum |
---|
0:07:37 | so that's the model and |
---|
0:07:40 | the way the with integrated in the wench training time |
---|
0:07:44 | so how do we evaluate the models |
---|
0:07:47 | so we evaluate you only using retrieval-based your models |
---|
0:07:51 | because it's easier to matrix or more clearly defined and four agenda models |
---|
0:07:55 | so the retrieval-based your models seek to |
---|
0:07:59 | compute a score for a given |
---|
0:08:01 | a context response pair |
---|
0:08:03 | which is the score about how relevant is the response given the context |
---|
0:08:08 | and then you can use this core to write possible response and to select the |
---|
0:08:11 | most relevant |
---|
0:08:13 | the training data is |
---|
0:08:15 | uhuh comes from examples from open subtitles |
---|
0:08:19 | which is a large corpus of the palace that we're is least last year |
---|
0:08:23 | and we compare three models |
---|
0:08:25 | a classical tf-idf models |
---|
0:08:27 | and what an order models |
---|
0:08:29 | one with uniform weight |
---|
0:08:30 | so without waiting |
---|
0:08:32 | and one using the weighting model and we conducted what an automatic and a human |
---|
0:08:37 | evaluation of |
---|
0:08:38 | this approach |
---|
0:08:40 | and you are multiple models |
---|
0:08:42 | after we now have proposed a few years ago there actually quite simple models |
---|
0:08:47 | where you all have to recurrent networks we sure weights |
---|
0:08:52 | that you then |
---|
0:08:53 | i feed to then slayers |
---|
0:08:55 | and then combine in the dot product |
---|
0:08:58 | so it's computing some kind of semantic similarity |
---|
0:09:01 | between the respondent is predicted given the context |
---|
0:09:04 | and the actual response that you find in the corpus |
---|
0:09:07 | we |
---|
0:09:09 | so this dot product |
---|
0:09:11 | we made a small modification to the model to add a low the final score |
---|
0:09:16 | two also be defined on some features from the response itself |
---|
0:09:19 | "'cause" they might be some features that are not |
---|
0:09:22 | you to the similarity between the |
---|
0:09:24 | the context and the response but are you to |
---|
0:09:27 | some aspects of the respondent my |
---|
0:09:29 | give some clues about whether is high quality low quality |
---|
0:09:32 | for instance some unknown words |
---|
0:09:35 | might indicate a local response from lower quality |
---|
0:09:39 | in terms of evaluation we use |
---|
0:09:42 | so as i said and the subtitles as training data |
---|
0:09:46 | the two going to select |
---|
0:09:48 | the high quickly responses we took a subset of these training data |
---|
0:09:52 | for which we knew don't structure because we could aligned then we've movie scripts |
---|
0:09:56 | where you have speaker names |
---|
0:09:58 | and then we use two heuristics |
---|
0:10:01 | we only kept responses |
---|
0:10:02 | that introduce a new director |
---|
0:10:04 | so not |
---|
0:10:05 | i sequence sentences that simply berserk a continuation of a given turn |
---|
0:10:10 | and we only use the two party conversations because it's easier to two-party conversations to |
---|
0:10:18 | define winter the response is in response for the previous speaker or not |
---|
0:10:22 | and then we all the filter out |
---|
0:10:23 | responses containing fictional names |
---|
0:10:27 | and out-of-vocabulary words |
---|
0:10:29 | and we heartily the set |
---|
0:10:30 | of about one hundred thousand |
---|
0:10:33 | response pairs that we considered it to be helpful high quality |
---|
0:10:37 | for the test data we use one in domain and one a slightly out of |
---|
0:10:41 | the main test sets |
---|
0:10:44 | we use the core that movie data corpus which is a collection of movie script |
---|
0:10:48 | the movie subtitles but movie scripts |
---|
0:10:52 | and then a small corpus of sixty two t at your place |
---|
0:10:55 | that we found on the web |
---|
0:10:58 | of course we prove p process them tokenizer postech then |
---|
0:11:03 | and then in terms of experimental design we consider the context to be limited to |
---|
0:11:08 | the last ten utterances preceding the response maxima sixty tokens for the response was the |
---|
0:11:14 | maximum five utterances |
---|
0:11:16 | in case of turns with multiple utterances |
---|
0:11:19 | and then we had a one-to-one racial between positive examples we were actual peers |
---|
0:11:24 | observed in the corpus and negative examples that were drawn at random |
---|
0:11:30 | from the same corpus |
---|
0:11:32 | we use gru units instead of testaments because there it's possible to train and we |
---|
0:11:36 | didn't see any difference |
---|
0:11:38 | in performance compared to lstms |
---|
0:11:42 | and here the results |
---|
0:11:43 | so as you can see well tf-idf doesn't perform well but that's |
---|
0:11:48 | that's a really well known |
---|
0:11:51 | so we look at the recall and that i metric |
---|
0:11:53 | which looks at |
---|
0:11:55 | a set of possible and responses |
---|
0:11:59 | one of which is the actual response of certain the corpus |
---|
0:12:03 | and then we looked at whether the model was able |
---|
0:12:06 | to put a to put the actual response in the top high |
---|
0:12:12 | responses so we are then a one means that in a set of then responses |
---|
0:12:17 | one of which is the actual responses where to the model would rank the actual |
---|
0:12:22 | response to be the highs |
---|
0:12:25 | so that's the metric |
---|
0:12:27 | and then we assume compared to so the that will do what encoder models |
---|
0:12:31 | and as you can see the one with the with the model performs a little |
---|
0:12:33 | better on both test sets |
---|
0:12:35 | and what we found you in using a subsequent error analysis what's that the weighting |
---|
0:12:40 | model gives more importance to cohesive adjacency pairs |
---|
0:12:45 | between the context response |
---|
0:12:47 | so |
---|
0:12:48 | response so we're not simply continuations |
---|
0:12:50 | but they were actual responses |
---|
0:12:52 | that were clearly from under the speaker and it worked answering the context |
---|
0:12:58 | we also performed you meant evaluation of responses |
---|
0:13:02 | generated by the double encoder models |
---|
0:13:04 | using crowdsourcing |
---|
0:13:07 | so we had we picked |
---|
0:13:08 | fifty one hundred fifteen random complex from the corner corpus |
---|
0:13:12 | and four possible responses |
---|
0:13:15 | a random response the two responses from the u one encoder models |
---|
0:13:20 | and then expect response that were manually order |
---|
0:13:23 | so we had the resulting four hundred and sixty pairs |
---|
0:13:27 | that we each evaluate but at the human judges |
---|
0:13:31 | and were asked to rate the consistency between the context and response on a scale |
---|
0:13:35 | of five points |
---|
0:13:37 | so we had one hundred eighteen individuals party pit in the evaluation |
---|
0:13:42 | through dropped flower |
---|
0:13:45 | unfortunately the results were not conclusive |
---|
0:13:48 | so we can define any statistically significant difference between the two models |
---|
0:13:53 | and this in general a very low agreement between the participants |
---|
0:13:58 | for all four models |
---|
0:14:01 | and we hypothesize that this was due to the difficulty for the raiders |
---|
0:14:06 | to discriminate between the responses and this is might be due to the nature of |
---|
0:14:10 | the corpus itself is heavily dependent on an external context |
---|
0:14:13 | just to the movie scenes |
---|
0:14:15 | and if you don't have access to the movie scenes is very |
---|
0:14:18 | difficult to understand what's going on |
---|
0:14:20 | but even if you have longer directly story that nina seem to help |
---|
0:14:26 | and so for a human evaluation we think another type of test data might be |
---|
0:14:30 | more beneficial |
---|
0:14:34 | so that was for the human evaluation |
---|
0:14:38 | so to conclude |
---|
0:14:41 | large that of corpora usually include many noisy examples |
---|
0:14:45 | and noise can cover many things |
---|
0:14:47 | but can for response that we're not actual responses |
---|
0:14:50 | mike a response that includes |
---|
0:14:53 | i mean if you show names that you don't want to appear |
---|
0:14:55 | in your models it might also include |
---|
0:14:58 | double common places responses |
---|
0:15:01 | response that are inconsistent |
---|
0:15:05 | with what the model knows |
---|
0:15:08 | so not what examples have the same quality or the same relevance |
---|
0:15:13 | for learning conversational models |
---|
0:15:15 | and the possible remedy to that used to include a weighting model |
---|
0:15:18 | which can be seen as a form of domain adaptation |
---|
0:15:21 | instance weighting and models |
---|
0:15:23 | common approach for domain adaptation |
---|
0:15:26 | and we show that |
---|
0:15:28 | this weighting model does not need to be in practice in some |
---|
0:15:32 | if you have a clear idea how you want to filter you data then you |
---|
0:15:35 | can of course |
---|
0:15:36 | and use handcrafted rules but in many cases what determines the quality of an example |
---|
0:15:41 | is hard to pinpoint |
---|
0:15:42 | so it might be easier to use a data driven approach |
---|
0:15:46 | and learning within model from examples of high quality responses |
---|
0:15:53 | what constitutes this quality |
---|
0:15:55 | what consecutive good response |
---|
0:15:58 | is of course depend then all of the actual application that you trying to build |
---|
0:16:03 | the this approach is very general so it can be applied is essentially a preprocessing |
---|
0:16:07 | step |
---|
0:16:08 | so it can be applied to any |
---|
0:16:10 | data driven model dialogue |
---|
0:16:12 | you simply as long as you have example of high quality responses |
---|
0:16:17 | you can use it as a preprocessing step to anything |
---|
0:16:21 | as future work we would like to extend it to work |
---|
0:16:24 | generative models so and evaluation we restricted ourselves to |
---|
0:16:28 | one type of retrieval-based models |
---|
0:16:32 | but might be very interesting to apply to other kinds of models |
---|
0:16:36 | and especially to generative once which are known to be quite difficult to work to |
---|
0:16:40 | train |
---|
0:16:42 | and an additional benefit of waiting models would be that you could filter all examples |
---|
0:16:48 | that |
---|
0:16:50 | are known to be as detrimental to the model before you even feed them to |
---|
0:16:54 | the |
---|
0:16:55 | to the |
---|
0:16:56 | to the training scheme |
---|
0:16:58 | so that you might have performance benefits in addition |
---|
0:17:01 | to benefits it regarding here |
---|
0:17:04 | your metric your accuracy |
---|
0:17:06 | so that's for future work and possibly also |
---|
0:17:10 | i don't types of test data then the |
---|
0:17:13 | the cornet movie data corpus that we have |
---|
0:17:16 | yes that's a thank you |
---|
0:17:32 | can you go back to the box plot towards the end |
---|
0:17:36 | so |
---|
0:17:37 | i'm not sure what's in the box plot that way i read it is that |
---|
0:17:43 | there is no difference really between in agreement on two does not as |
---|
0:17:50 | but you have a set that is very low agreement between the evaluation but is |
---|
0:17:54 | that site was wondering whether we are looking at two different |
---|
0:17:58 | and to define |
---|
0:18:00 | in our is that is that it is that right |
---|
0:18:06 | i three it is mostly between the two d but encoder models |
---|
0:18:10 | so i |
---|
0:18:11 | there's of course a statistically significant difference between the |
---|
0:18:14 | the altar models and the random ones |
---|
0:18:16 | and although between the to do what encoder models and the random |
---|
0:18:20 | but there is no internal difference between the two |
---|
0:18:23 | we waiting and without waiting |
---|
0:18:24 | so quickly but not have some maybe would be more significant as if i just |
---|
0:18:31 | the two ways to set |
---|
0:18:34 | right i agree i read |
---|
0:18:39 | something |
---|
0:18:43 | and you elaborate well why you change the final piece of dual encoder what was |
---|
0:18:49 | the wrist extended |
---|
0:18:51 | so give |
---|
0:18:55 | so the idea is |
---|
0:18:57 | the dot product |
---|
0:18:58 | will give you a similarity between |
---|
0:19:00 | the prediction from the response and the actual response right |
---|
0:19:04 | and so this is a very important aspect when considering |
---|
0:19:08 | no whole relevant to responses compared to the context but they might be aspects |
---|
0:19:13 | that i really intrinsic to the response itself |
---|
0:19:15 | and i have nothing to do the context |
---|
0:19:18 | for instance |
---|
0:19:19 | unknown words or rare words that are probably not typos |
---|
0:19:24 | run punctuations |
---|
0:19:26 | a lengthy responses |
---|
0:19:29 | and this is not going to be directly captured in the dot product |
---|
0:19:33 | this is going to be captured by extracting |
---|
0:19:36 | some features from the response and then using these |
---|
0:19:40 | in the final adequacy score |
---|
0:19:42 | so something that was |
---|
0:19:44 | of one missing in this button portables |
---|
0:19:46 | that's why we wanted to modified |
---|
0:19:54 | i guess as just wondering if you could elaborate on the extent to which you |
---|
0:19:56 | believe that the generalizability of the generalisability capabilities of |
---|
0:20:02 | of training a weighting model on a single dataset and having it extend reasonably to |
---|
0:20:07 | enhance performance only just of compared to training on multiple domains you mean |
---|
0:20:12 | why means it to train i guess like |
---|
0:20:15 | is the current scheme or no such that whenever you are trying to improve performance |
---|
0:20:19 | on a dataset is you would basically find a similar dataset and three training the |
---|
0:20:24 | weighting model on like a similar data set and then use the weighting model on |
---|
0:20:27 | a new data centre is that sort of like that the general scheme when we |
---|
0:20:31 | use this |
---|
0:20:34 | so |
---|
0:20:36 | it's not exactly the question that you asking but in some cases |
---|
0:20:40 | you might want to |
---|
0:20:42 | it or to use different domains |
---|
0:20:44 | four or two preselect |
---|
0:20:46 | to prune out some parts of the |
---|
0:20:48 | the data that you don't want |
---|
0:20:50 | in some cases and that was the case that we had here |
---|
0:20:54 | it's very difficult to the pre-processing advance on the full dataset |
---|
0:21:00 | because the quality is very hard to determine |
---|
0:21:03 | i using |
---|
0:21:03 | you know simple rules |
---|
0:21:06 | in particular here a deterrent structure is something that |
---|
0:21:10 | it is important for determining what can secure natural response but it was near possible |
---|
0:21:15 | to write rules for that |
---|
0:21:17 | because it was dependent on post and gas lexical cues and many different factors |
---|
0:21:22 | and you get of course |
---|
0:21:24 | build a machine learning classifier that we'll |
---|
0:21:26 | the segment your turns |
---|
0:21:28 | but then it will be over all or nothing right in many examples in my |
---|
0:21:32 | dataset |
---|
0:21:33 | where |
---|
0:21:34 | probably responses |
---|
0:21:36 | but it's |
---|
0:21:37 | the classifier we didn't give me a really answer |
---|
0:21:42 | so it was better to use a weighting function |
---|
0:21:44 | subjective still icon for some of these examples |
---|
0:21:48 | but then not in the same way as i would from you know |
---|
0:21:51 | high quality responses |
---|
0:21:53 | n is but i don't are aspect that would like to mention is that |
---|
0:22:00 | i could for sense that we could train on the high quality responses |
---|
0:22:04 | but in this case i would have had to from |
---|
0:22:06 | with ninety nine point nine percent of my dataset |
---|
0:22:10 | so i don't want to i want the one that they want to control everything |
---|
0:22:13 | just because i'm not exactly sure of the hypothesis responses |
---|
0:22:18 | i don't if you that as a regression |
---|
0:22:22 | at one more question i dunno maybe losses |
---|
0:22:26 | i guess i i'm not sure are maybe i didn't like it is that the |
---|
0:22:29 | evaluation too closely but did you try a baseline where you may be used to |
---|
0:22:32 | simply are simpler heuristic for assigning the weights like maybe like |
---|
0:22:39 | some something |
---|
0:22:41 | as a heuristic for exercise none of the weights rather than like building at a |
---|
0:22:44 | separate model the control model to now learn the weights you just a |
---|
0:22:50 | so you know learn but not necessary |
---|
0:22:54 | i |
---|
0:22:56 | no idea i didn't |
---|
0:23:00 | i'm not exactly sure i we could find a very simple |
---|
0:23:03 | i guess that |
---|
0:23:06 | something that could be done i don't know how would we perform would be i |
---|
0:23:11 | where is it |
---|
0:23:13 | two new use the time gaps |
---|
0:23:14 | between the context and response |
---|
0:23:18 | as a way to their the mean |
---|
0:23:19 | what are |
---|
0:23:20 | i data didn't right |
---|
0:23:22 | i tried in a previous paper when i was just looking at turn segmentation that |
---|
0:23:26 | in a work very well for the for this particular task but here you can |
---|
0:23:30 | see be different this was assigned a weight with value instead of just segmenting |
---|
0:23:35 | but that just the kind that doesn't work very well you have to use some |
---|
0:23:38 | lexical cues usually |
---|
0:23:40 | like after signal |
---|
0:23:42 | doctor holds blah that's usually an indicator that |
---|
0:23:46 | the tick speakers going to beat of the whole |
---|
0:23:48 | but you need to testified for that |
---|