0:00:14 | great a thanks everyone for saying that the finalisation |
---|
0:00:19 | and hundred and i'm gonna do you talking about using dialogue context to improve language |
---|
0:00:24 | understanding performance in multi domain dialogues |
---|
0:00:27 | so this is the outline of the target is that still give a brief background |
---|
0:00:30 | of the problem i'll talk about the data sets of the model architectures and then |
---|
0:00:35 | a data augmentation scheme and experiments |
---|
0:00:38 | so i'll go to what is important in to dialogue system so in goal oriented |
---|
0:00:42 | dialogue systems of |
---|
0:00:44 | the goal of the system is to help the user to complete some task and |
---|
0:00:47 | the user's goal is to compute some task as opposed to chart based dialogue systems |
---|
0:00:51 | where no the user is just have a conversation and go system is spending is |
---|
0:00:55 | to use of |
---|
0:00:57 | so this is a typical architecture for all of goal oriented dialogue system it is |
---|
0:01:01 | not you know that no that's of a plane of components and the first component |
---|
0:01:06 | is basically the language understanding module of its to access an interface the other two |
---|
0:01:12 | incoming user utterances and their transforms them transforms them into a semantic representation |
---|
0:01:18 | the next component of the state tracker that keeps track of the probability distributions of |
---|
0:01:22 | the states over all the control over the conversation after that is the policy that |
---|
0:01:28 | depending on the dialog state and the back and or stage to decide what action |
---|
0:01:33 | to take |
---|
0:01:33 | which could be making a back and at all and no asking the user some |
---|
0:01:37 | information on informing be the set of something and the last component is the language |
---|
0:01:41 | generation to just external dialogue act based representation of the one is the output and |
---|
0:01:46 | o since the user do not exist |
---|
0:01:50 | so i just briefly talk about that semantic frame representation so are dialogue understanding is |
---|
0:01:57 | based on themes and redefine themes in connection tube actions in the sense that the |
---|
0:02:01 | your back and might support certain documents are stored in intense o and o those |
---|
0:02:06 | are basically replicated in touch screen |
---|
0:02:08 | so that computes a replicated that's not and the lack an intensity replicated the sentence |
---|
0:02:13 | and apart from the back an intense be support a bunch of conversational intensity from |
---|
0:02:18 | dialogue acts like a phone then i are complement expressed frustration exactly |
---|
0:02:25 | so basically what does the language understanding module |
---|
0:02:29 | so it performs to dust the first task is domain classification a given an incoming |
---|
0:02:34 | user utterance though language understanding module replace to identify bit stream it sure though correspond |
---|
0:02:39 | to so this is an utterance classification task |
---|
0:02:42 | just second task is intent classification so it tries to classify it tries to identify |
---|
0:02:47 | what intense exist in the user's utterance |
---|
0:02:51 | so and the third not sounds an utterance classification task |
---|
0:02:54 | and the third one is not telling a and the idea that is to identify |
---|
0:02:58 | attributes in the frame identify attributes which have been defined in the frame but in |
---|
0:03:03 | the user actions |
---|
0:03:04 | for example for this query like plates from boston here pretty your premade fee the |
---|
0:03:09 | plate stream and the user intent might be fine plates then you're trying to identify |
---|
0:03:13 | attributes like departure city i don't physically the party exactly so this is a sequence |
---|
0:03:18 | tagging task and be treated so lake sequence labeling task based on annual meeting |
---|
0:03:26 | so to basically sum it up |
---|
0:03:28 | given a user utterance like i want to go to tuck it allows titles can |
---|
0:03:32 | you look up table for the model the goal of to a language understanding module |
---|
0:03:36 | is to identify that the domain is less front resolution |
---|
0:03:40 | the intent is the user this thing to do so what is trying is trying |
---|
0:03:43 | to inform the system about the restaurant name and the system i entity |
---|
0:03:48 | and then identify that is certainly and similarly for the rest of this work |
---|
0:03:55 | so there has been not of related work on using context for dialogue related das |
---|
0:04:00 | and for language understanding there was work on using memory networks for language understanding on |
---|
0:04:05 | a single domain don't know that has been able on using memory networks for end-to-end |
---|
0:04:09 | dialog systems |
---|
0:04:11 | and that has been work on using hierarchical the current encoder decoded models for generative |
---|
0:04:16 | query suggestions of which is a slightly unrelated thus but our model is an enhancement |
---|
0:04:20 | of the smaller so it's |
---|
0:04:25 | so i reviewed all over the datasets |
---|
0:04:29 | so be have a collection of teasing the domain dialogue data set |
---|
0:04:32 | the idea it is that the user has a single task that is going to |
---|
0:04:35 | complete and their response to a single mean so we have i don't a thousand |
---|
0:04:40 | not tune these datasets and they are a bit england's but i don't include influence |
---|
0:04:47 | then we have a small did not selected a small multi-domain dialogue data set o |
---|
0:04:51 | where the training set is around five hundred dialogues the dev set aside and fifty |
---|
0:04:54 | dialogues in the test dataset on two hundred and seventy two you know dogs these |
---|
0:04:58 | dialogues and longer because the user has multiple pos that he's trying to complete until |
---|
0:05:02 | would span across multiple domains |
---|
0:05:06 | the entity said that we use two |
---|
0:05:08 | create the training and test dialogues sets are non-overlapping still we have a lot of |
---|
0:05:13 | out-of-vocabulary entities in our dataset that it i don't eating button to the test user |
---|
0:05:18 | utterances that for the vocal |
---|
0:05:22 | so our data collection process |
---|
0:05:24 | relies on the interaction of a policy model and the user simulator |
---|
0:05:28 | which in tracking tones of dialogue acts and back and politics extra and then we |
---|
0:05:33 | can also natural language manifestations of o on based on the style of selecting |
---|
0:05:38 | so the process and the datasets will be covered in an upcoming publication |
---|
0:05:44 | okay so now i l display the warlock detected this is the conceptual like the |
---|
0:05:49 | idea is that |
---|
0:05:51 | there is a context encoder not acts on the intensity of the dialogue and the |
---|
0:05:55 | dining the produce a context vector and then there's attack and it will not just |
---|
0:05:59 | x in the dialogue context and the current user utterance |
---|
0:06:03 | and tries to determine the domain independence not between a single model on multi-domain and |
---|
0:06:08 | it does so everything is directly model |
---|
0:06:13 | so i just i know this paper though architecture the type a network |
---|
0:06:18 | we use the same architecture doctors so all the models that we compared to be |
---|
0:06:23 | does vary the context important thing so of this is a rnn this model that |
---|
0:06:30 | jointly models that we don't mean and the features |
---|
0:06:33 | be viewed in our embeddings corresponding to the user tokens a user utterance tokens in |
---|
0:06:38 | twenty it would buy detection gru that which is depicted herein laid yellow if |
---|
0:06:44 | visible |
---|
0:06:45 | the outputs of though by gru are then fit right into an lstm the which |
---|
0:06:49 | is depicted in like to o |
---|
0:06:52 | so well as the context encoded common so the output of the dialogue context a |
---|
0:06:56 | input that is fed into the initial state of the lstm and we tried a |
---|
0:07:01 | bunch of different configurations but this one seemed to what corpus so that's what we |
---|
0:07:07 | well so weighted use an lstm in the second lead and it's a instead of |
---|
0:07:10 | gru the only because it seems to work with the slot filling maybe because it |
---|
0:07:15 | leads to a separation between the open the internal states and outputs |
---|
0:07:22 | so the final states of the lstm are fed into the domain and the classification |
---|
0:07:26 | as |
---|
0:07:27 | and the final or token level outputs of the lstm a better fit into its |
---|
0:07:31 | not like english |
---|
0:07:33 | so this is that are gonna work i don't know |
---|
0:07:35 | that's the user's across all the models |
---|
0:07:40 | so this is basically just a description of what dataset |
---|
0:07:44 | so |
---|
0:07:46 | by the mean you to use context may not just used to track the network |
---|
0:07:49 | one if the user utterance |
---|
0:07:50 | so suppose the user is having a conversation with a restaurant reservation bart and the |
---|
0:07:54 | user says i |
---|
0:07:56 | so in that sense of context this is a pretty i make a statement it's |
---|
0:07:59 | not easy to make out what the user means it could mean five people or |
---|
0:08:04 | paper or maybe am order could be a restaurant name but if you know without |
---|
0:08:09 | the system does that what name would you prefer then it's pretty obvious that the |
---|
0:08:12 | user meant by as a time |
---|
0:08:15 | as opposed to a number of people at this time |
---|
0:08:18 | so this leads us to i first baseline model |
---|
0:08:21 | the idea to start we just input the previous system to an energy are u |
---|
0:08:25 | and v the final state of the gru as the dialogue context |
---|
0:08:29 | so be evaluated for matrix so the first one is domain upon which is the |
---|
0:08:34 | classification of phones good or domains |
---|
0:08:36 | well the second as intent upon which is the classification of funds go to what |
---|
0:08:40 | extent and the third one is not fun and you know this was |
---|
0:08:45 | same edit it is the ratio of utterances bad though |
---|
0:08:49 | model you get any one of the predictions wrong so be obvious you want to |
---|
0:08:53 | go for the lowest possible premeditated |
---|
0:08:56 | so these are the performances of those simple and quality for the model where the |
---|
0:08:59 | system tone is encoded in the gru and then fed into the target network |
---|
0:09:06 | so they do we need black context remote dialogue in |
---|
0:09:09 | one text on the data so suppose though |
---|
0:09:12 | so user instead of responding just but are you to a system niche initiative dialogue |
---|
0:09:17 | this point but if it for all i know all the user is taking initiative |
---|
0:09:21 | robot so this makes the problem more difficult because in a sense of context about |
---|
0:09:27 | the previous dialogue you can be clear what the user is referring to here |
---|
0:09:33 | it could be a movie name it could be a titanium it would be responding |
---|
0:09:36 | as many options are but |
---|
0:09:39 | but suppose you nude art this user has been talking about meeting attended iteration then |
---|
0:09:44 | it's you more likely to get the prediction right so that so we are context |
---|
0:09:50 | from or that of all the previous turns it i |
---|
0:09:54 | so this is our second baseline |
---|
0:09:56 | and this is based of the model proposed by chen and out those in though |
---|
0:10:01 | emily network for language understanding people |
---|
0:10:05 | the idea that is to have a gru layer that |
---|
0:10:09 | and so on the previous sp utterances to produce the memory of vectors so this |
---|
0:10:13 | memory easily representation of all the previous utterance |
---|
0:10:17 | we have another gru dark box and the current actions to produce the representation of |
---|
0:10:21 | this utterance |
---|
0:10:22 | based on the inner product of this memory and the system i the current utterance |
---|
0:10:27 | vector we get the notation distribution and a user some them met but it and |
---|
0:10:31 | get the context of for the data but is depicted in there so this is |
---|
0:10:35 | the output of this context encoded bit speech into the target network |
---|
0:10:41 | so |
---|
0:10:44 | so as you can see adding |
---|
0:10:47 | on the main body of the entire dialog seats to leads to an improvement over |
---|
0:10:52 | all the metrics so for domain we see an improvement of roundy percent absolute |
---|
0:10:56 | of an intent around two point three percent for slot point five percent but a |
---|
0:11:00 | significant reduction in female but can lead to better than this |
---|
0:11:06 | so if it a member of your |
---|
0:11:08 | or working on multi-domain dialogue so the idea is that the user might a multiple |
---|
0:11:12 | goals and |
---|
0:11:14 | just |
---|
0:11:16 | just do knowledge of what the user said |
---|
0:11:18 | in the double being able to understand the dialogue history in context of it as |
---|
0:11:23 | the rest of the utterances in s p o might not give the complete picture |
---|
0:11:27 | for example suppose the user has multiple goals the user expendable can we take us |
---|
0:11:31 | to use that is trained to make it in anticipation |
---|
0:11:35 | in the absence of so |
---|
0:11:37 | how these utterances relate to each other the user utterances still ambiguous but if you |
---|
0:11:43 | can if you have a sequential history of the dialogue act you can really where |
---|
0:11:48 | you can understand each utterance but and in context of the other |
---|
0:11:52 | you know it's more likely that you get the prediction right |
---|
0:11:55 | so |
---|
0:11:55 | this is a final models that could be so |
---|
0:11:59 | experiment but and this is an extension of the memory network the idea to start |
---|
0:12:03 | again you get the |
---|
0:12:05 | emily of the previous dialog sp which is depicted herein yellow well you get come |
---|
0:12:12 | are you get to a representation of the current utterance which is depicted in three |
---|
0:12:17 | but instead of getting an inner product to get some attention distribution you combine them |
---|
0:12:21 | together in of each what would lead to get the context of memory of data |
---|
0:12:25 | and this is then fed into separate is another gru bits or produces the context |
---|
0:12:33 | vector so |
---|
0:12:34 | basically what is happening is we just to do so |
---|
0:12:38 | representation of the entire dialog history in context with the current utterance and then we |
---|
0:12:44 | go then we have an it would like to the with and that dialogue in |
---|
0:12:48 | all tries to understand i combines these utterances together in context of each other |
---|
0:12:53 | and at the final state of that the idea is still aren't expected that the |
---|
0:12:57 | reader to the target |
---|
0:13:02 | so this is an enhancement of the memory network and this is also in a |
---|
0:13:05 | sense an announcement of the hierarchical according to encode or decode a model that has |
---|
0:13:09 | been used for next utterance prediction on for context to generate of release addition |
---|
0:13:18 | so a very unexpectedly what we observe is that this model doesn't perform as well |
---|
0:13:23 | as the memory network |
---|
0:13:25 | now be stated take into this and a hypothesis is that the |
---|
0:13:30 | that is a huge training this shift in our datasets so like training set is |
---|
0:13:34 | composed largely a single domain dialogue a single domain to compute a bit that likely |
---|
0:13:39 | hundred it does and single domain data sets and a single domain dialogue and i |
---|
0:13:44 | don't for seventy of multi domain dialogues so |
---|
0:13:47 | o b |
---|
0:13:49 | i believe the meeting that the those sequential dialog input that is unable to adapt |
---|
0:13:55 | from a single domain dialogue with the multi-domain not a set |
---|
0:13:58 | so what do we do so |
---|
0:14:01 | so we go with a simple data augmentation scheme |
---|
0:14:04 | since then is addressed between our training and test datasets but it may got training |
---|
0:14:08 | dataset more similar to the test data |
---|
0:14:11 | so we take a large single domain dialogue datasets |
---|
0:14:14 | b g combine single domain dialogue so far too |
---|
0:14:18 | syntactically r o domain switches by a |
---|
0:14:23 | basically combining basically graphing the single domain dialogue into another one |
---|
0:14:29 | so we ended around ten thousand dialogues what you geodesic that so it's i don't |
---|
0:14:32 | know |
---|
0:14:33 | t pairs so |
---|
0:14:36 | but you know the utterance |
---|
0:14:39 | so this is an example of the sample be combined dialogue acts as the dialogue |
---|
0:14:42 | where the user is trying to output is movie tickets |
---|
0:14:46 | in dialogue by the user is trying to find that a strong and then we |
---|
0:14:50 | randomly sampling location in dialogue acts and in fact that this is no longer be |
---|
0:14:55 | combined |
---|
0:14:58 | and use this for training |
---|
0:15:01 | so the locally numbers are very of syllable |
---|
0:15:05 | improvement in performance by just after training on that accompanying dialogue compared to all training |
---|
0:15:10 | without in my dialogues we combine dialogues |
---|
0:15:13 | and the boy number is |
---|
0:15:17 | i describe them and it so the boy numbers are the ones read it all |
---|
0:15:21 | the model built from the based on a so on certain metrics |
---|
0:15:24 | so |
---|
0:15:25 | using that the sequential dialogue and put it seems to benefit them was from dialogue |
---|
0:15:29 | combination at all |
---|
0:15:31 | then only combination leads to or performance improvements were almost all the models by the |
---|
0:15:36 | one that benefits them was just a sequential elegant and this is probably because the |
---|
0:15:42 | and you combining dialogues leads to longer dialogues it adds noise and o l which |
---|
0:15:47 | acts like a regularization and since the sequential dialogue and put it is the most |
---|
0:15:50 | complex model would expect it to benefit the most unbiased |
---|
0:15:55 | and this is what we observe basically a the sequential dialogue encoder does better don't |
---|
0:15:59 | the mean of one |
---|
0:16:00 | that's not define and primitive it's but it's at most and the best the model |
---|
0:16:05 | on intent classification |
---|
0:16:08 | so |
---|
0:16:11 | this is so and example this is a degenerate example but it |
---|
0:16:16 | trace to illustrate what's happening here we just look at that into distributions and try |
---|
0:16:20 | to figure out what the models are doing so this is there an utterance from |
---|
0:16:24 | the test set and the movies are in boldface |
---|
0:16:26 | so all these are a bit later for dataset because of identity that's a non |
---|
0:16:30 | overlapping |
---|
0:16:32 | so |
---|
0:16:34 | you see that the last three utterances have a lot of for b o |
---|
0:16:38 | so if you look at the memory network attention distribution you know what is that |
---|
0:16:42 | are |
---|
0:16:44 | focus is almost entirely on the user utterance i want to visit industry |
---|
0:16:48 | well as the sequential dialogue encoders out of focus is equally on the last two |
---|
0:16:52 | utterances |
---|
0:16:54 | and the i should make a scalar the dialogue at the utterance that are trained |
---|
0:16:59 | to understand it is the final one that's at the bottom up to be a |
---|
0:17:03 | by the users is but with the tool is used as |
---|
0:17:07 | so the good units are identified that the domain is a standard finding restaurants and |
---|
0:17:12 | to identify the slot for what the two presidents day goes |
---|
0:17:20 | so what we observe is not the encoder-decoder model fails to well identify the domain |
---|
0:17:25 | or the starts |
---|
0:17:26 | the memory network correctly identifies the cutting domain because it is focusing on the utterance |
---|
0:17:31 | where the user says i want some restaurant |
---|
0:17:34 | but it feels incorporate a context from the previous system utterance where the system is |
---|
0:17:38 | offering a response to the user and is any would identify the slot of it |
---|
0:17:43 | as a sequential dialog input data successfully to combine context one possible utterances and a |
---|
0:17:49 | recognizer what the domain and this larger |
---|
0:17:55 | okay i think that's it |
---|
0:17:56 | a lot for listening |
---|
0:18:04 | questions |
---|
0:18:13 | care have two questions and stuff first one |
---|
0:18:17 | so as a byproduct of what you're doing q you get memory representation of the |
---|
0:18:24 | context |
---|
0:18:25 | you have the whole dialogue history i'm wondering if you consider maybe training because you |
---|
0:18:32 | have access to the simulated user of whether you can train a policy |
---|
0:18:39 | using this representation because it's very similar to belief tracking |
---|
0:18:44 | in traditional that much of which you soac a question is more like maybe you |
---|
0:18:50 | can instead of for doing a modular thing most ica you can just have the |
---|
0:18:54 | same or do that though for end-to-end that |
---|
0:18:59 | so no |
---|
0:19:01 | that's so very indistinct addition because we have some people running experiments on this so |
---|
0:19:06 | this is something thing |
---|
0:19:10 | because i think that the problem and maybe |
---|
0:19:13 | the problem usually with such an interpretable representation is that you when you pick some |
---|
0:19:18 | actually use a confirm you don't know which slot to confirm but at the same |
---|
0:19:22 | time you have this semantic so you can make as usable |
---|
0:19:30 | i think by carefully designing though |
---|
0:19:34 | semantics that we using we can |
---|
0:19:37 | alleviate mean removed i have been made in a few instead of having a single |
---|
0:19:41 | you can form if you have a conform or slot and then have the model |
---|
0:19:45 | predict so based on the context what the user is trying to control and then |
---|
0:19:50 | not made it uses a problem is |
---|
0:19:55 | again to an certainly is still |
---|
0:19:59 | and even a question |
---|
0:20:15 | so can you go back to |
---|
0:20:17 | the last but slight where you had the brazilian restaurant so i wanted to us |
---|
0:20:22 | two questions about these example |
---|
0:20:25 | first i thought you said you would train on synthetic dataset where you combine is |
---|
0:20:34 | domains right so now used to consider the restaurant domain to be out of domain |
---|
0:20:40 | at this point that's sorry the first and second |
---|
0:20:45 | would you deal with something that is |
---|
0:20:48 | to me true only out-of-domain like no |
---|
0:20:51 | the weather is nice or two day i'm grumpy or whatever in many different way |
---|
0:20:55 | than the |
---|
0:20:57 | these |
---|
0:20:58 | utterance that is still task related even if not used but they got |
---|
0:21:02 | so for the first question do this one domain is not out of domain but |
---|
0:21:06 | also because a our system can handle movie tickets and restaurant |
---|
0:21:12 | given an utterance the system would try to keep track across different domains it will |
---|
0:21:16 | see that this is a different domain utterance and you speaker hundred |
---|
0:21:20 | so even have the dialogue is multi domain date |
---|
0:21:26 | no out-of-vocabulary so |
---|
0:21:28 | out-of-vocabulary |
---|
0:21:31 | the second question we have or out-of-domain utterances in this dataset to with |
---|
0:21:36 | to base the system is supposed to see i cannot handle that |
---|
0:21:41 | but so i |
---|
0:21:44 | i think in our dataset it's not there was enough so we definitely need model |
---|
0:21:47 | domain data to be able to successfully handle out-of-domain utterances |
---|
0:21:58 | right or questions |
---|
0:22:07 | my second question to you don't use delexicalization or you do |
---|
0:22:13 | of the input |
---|
0:22:14 | no we don't use any delexicalization so this is basically this is the model that |
---|
0:22:18 | the lexical existing work effectively |
---|
0:22:21 | right so it's this model will basically identify though entities that are trained to delexicalise |
---|
0:22:28 | so |
---|
0:22:30 | because if you use naked guys at a bayes approach or something to delexicalise it |
---|
0:22:33 | then it doesn't scale to all the response that in the by the this model |
---|
0:22:39 | will try to identify the based on context based on some lines the annotations from |
---|
0:22:43 | they cannot it's got something |
---|
0:22:49 | thank you very much |
---|