great a thanks everyone for saying that the finalisation
and hundred and i'm gonna do you talking about using dialogue context to improve language
understanding performance in multi domain dialogues
so this is the outline of the target is that still give a brief background
of the problem i'll talk about the data sets of the model architectures and then
a data augmentation scheme and experiments
so i'll go to what is important in to dialogue system so in goal oriented
dialogue systems of
the goal of the system is to help the user to complete some task and
the user's goal is to compute some task as opposed to chart based dialogue systems
where no the user is just have a conversation and go system is spending is
to use of
so this is a typical architecture for all of goal oriented dialogue system it is
not you know that no that's of a plane of components and the first component
is basically the language understanding module of its to access an interface the other two
incoming user utterances and their transforms them transforms them into a semantic representation
the next component of the state tracker that keeps track of the probability distributions of
the states over all the control over the conversation after that is the policy that
depending on the dialog state and the back and or stage to decide what action
to take
which could be making a back and at all and no asking the user some
information on informing be the set of something and the last component is the language
generation to just external dialogue act based representation of the one is the output and
o since the user do not exist
so i just briefly talk about that semantic frame representation so are dialogue understanding is
based on themes and redefine themes in connection tube actions in the sense that the
your back and might support certain documents are stored in intense o and o those
are basically replicated in touch screen
so that computes a replicated that's not and the lack an intensity replicated the sentence
and apart from the back an intense be support a bunch of conversational intensity from
dialogue acts like a phone then i are complement expressed frustration exactly
so basically what does the language understanding module
so it performs to dust the first task is domain classification a given an incoming
user utterance though language understanding module replace to identify bit stream it sure though correspond
to so this is an utterance classification task
just second task is intent classification so it tries to classify it tries to identify
what intense exist in the user's utterance
so and the third not sounds an utterance classification task
and the third one is not telling a and the idea that is to identify
attributes in the frame identify attributes which have been defined in the frame but in
the user actions
for example for this query like plates from boston here pretty your premade fee the
plate stream and the user intent might be fine plates then you're trying to identify
attributes like departure city i don't physically the party exactly so this is a sequence
tagging task and be treated so lake sequence labeling task based on annual meeting
so to basically sum it up
given a user utterance like i want to go to tuck it allows titles can
you look up table for the model the goal of to a language understanding module
is to identify that the domain is less front resolution
the intent is the user this thing to do so what is trying is trying
to inform the system about the restaurant name and the system i entity
and then identify that is certainly and similarly for the rest of this work
so there has been not of related work on using context for dialogue related das
and for language understanding there was work on using memory networks for language understanding on
a single domain don't know that has been able on using memory networks for end-to-end
dialog systems
and that has been work on using hierarchical the current encoder decoded models for generative
query suggestions of which is a slightly unrelated thus but our model is an enhancement
of the smaller so it's
so i reviewed all over the datasets
so be have a collection of teasing the domain dialogue data set
the idea it is that the user has a single task that is going to
complete and their response to a single mean so we have i don't a thousand
not tune these datasets and they are a bit england's but i don't include influence
then we have a small did not selected a small multi-domain dialogue data set o
where the training set is around five hundred dialogues the dev set aside and fifty
dialogues in the test dataset on two hundred and seventy two you know dogs these
dialogues and longer because the user has multiple pos that he's trying to complete until
would span across multiple domains
the entity said that we use two
create the training and test dialogues sets are non-overlapping still we have a lot of
out-of-vocabulary entities in our dataset that it i don't eating button to the test user
utterances that for the vocal
so our data collection process
relies on the interaction of a policy model and the user simulator
which in tracking tones of dialogue acts and back and politics extra and then we
can also natural language manifestations of o on based on the style of selecting
so the process and the datasets will be covered in an upcoming publication
okay so now i l display the warlock detected this is the conceptual like the
idea is that
there is a context encoder not acts on the intensity of the dialogue and the
dining the produce a context vector and then there's attack and it will not just
x in the dialogue context and the current user utterance
and tries to determine the domain independence not between a single model on multi-domain and
it does so everything is directly model
so i just i know this paper though architecture the type a network
we use the same architecture doctors so all the models that we compared to be
does vary the context important thing so of this is a rnn this model that
jointly models that we don't mean and the features
be viewed in our embeddings corresponding to the user tokens a user utterance tokens in
twenty it would buy detection gru that which is depicted herein laid yellow if
visible
the outputs of though by gru are then fit right into an lstm the which
is depicted in like to o
so well as the context encoded common so the output of the dialogue context a
input that is fed into the initial state of the lstm and we tried a
bunch of different configurations but this one seemed to what corpus so that's what we
well so weighted use an lstm in the second lead and it's a instead of
gru the only because it seems to work with the slot filling maybe because it
leads to a separation between the open the internal states and outputs
so the final states of the lstm are fed into the domain and the classification
as
and the final or token level outputs of the lstm a better fit into its
not like english
so this is that are gonna work i don't know
that's the user's across all the models
so this is basically just a description of what dataset
so
by the mean you to use context may not just used to track the network
one if the user utterance
so suppose the user is having a conversation with a restaurant reservation bart and the
user says i
so in that sense of context this is a pretty i make a statement it's
not easy to make out what the user means it could mean five people or
paper or maybe am order could be a restaurant name but if you know without
the system does that what name would you prefer then it's pretty obvious that the
user meant by as a time
as opposed to a number of people at this time
so this leads us to i first baseline model
the idea to start we just input the previous system to an energy are u
and v the final state of the gru as the dialogue context
so be evaluated for matrix so the first one is domain upon which is the
classification of phones good or domains
well the second as intent upon which is the classification of funds go to what
extent and the third one is not fun and you know this was
same edit it is the ratio of utterances bad though
model you get any one of the predictions wrong so be obvious you want to
go for the lowest possible premeditated
so these are the performances of those simple and quality for the model where the
system tone is encoded in the gru and then fed into the target network
so they do we need black context remote dialogue in
one text on the data so suppose though
so user instead of responding just but are you to a system niche initiative dialogue
this point but if it for all i know all the user is taking initiative
robot so this makes the problem more difficult because in a sense of context about
the previous dialogue you can be clear what the user is referring to here
it could be a movie name it could be a titanium it would be responding
as many options are but
but suppose you nude art this user has been talking about meeting attended iteration then
it's you more likely to get the prediction right so that so we are context
from or that of all the previous turns it i
so this is our second baseline
and this is based of the model proposed by chen and out those in though
emily network for language understanding people
the idea that is to have a gru layer that
and so on the previous sp utterances to produce the memory of vectors so this
memory easily representation of all the previous utterance
we have another gru dark box and the current actions to produce the representation of
this utterance
based on the inner product of this memory and the system i the current utterance
vector we get the notation distribution and a user some them met but it and
get the context of for the data but is depicted in there so this is
the output of this context encoded bit speech into the target network
so
so as you can see adding
on the main body of the entire dialog seats to leads to an improvement over
all the metrics so for domain we see an improvement of roundy percent absolute
of an intent around two point three percent for slot point five percent but a
significant reduction in female but can lead to better than this
so if it a member of your
or working on multi-domain dialogue so the idea is that the user might a multiple
goals and
just
just do knowledge of what the user said
in the double being able to understand the dialogue history in context of it as
the rest of the utterances in s p o might not give the complete picture
for example suppose the user has multiple goals the user expendable can we take us
to use that is trained to make it in anticipation
in the absence of so
how these utterances relate to each other the user utterances still ambiguous but if you
can if you have a sequential history of the dialogue act you can really where
you can understand each utterance but and in context of the other
you know it's more likely that you get the prediction right
so
this is a final models that could be so
experiment but and this is an extension of the memory network the idea to start
again you get the
emily of the previous dialog sp which is depicted herein yellow well you get come
are you get to a representation of the current utterance which is depicted in three
but instead of getting an inner product to get some attention distribution you combine them
together in of each what would lead to get the context of memory of data
and this is then fed into separate is another gru bits or produces the context
vector so
basically what is happening is we just to do so
representation of the entire dialog history in context with the current utterance and then we
go then we have an it would like to the with and that dialogue in
all tries to understand i combines these utterances together in context of each other
and at the final state of that the idea is still aren't expected that the
reader to the target
so this is an enhancement of the memory network and this is also in a
sense an announcement of the hierarchical according to encode or decode a model that has
been used for next utterance prediction on for context to generate of release addition
so a very unexpectedly what we observe is that this model doesn't perform as well
as the memory network
now be stated take into this and a hypothesis is that the
that is a huge training this shift in our datasets so like training set is
composed largely a single domain dialogue a single domain to compute a bit that likely
hundred it does and single domain data sets and a single domain dialogue and i
don't for seventy of multi domain dialogues so
o b
i believe the meeting that the those sequential dialog input that is unable to adapt
from a single domain dialogue with the multi-domain not a set
so what do we do so
so we go with a simple data augmentation scheme
since then is addressed between our training and test datasets but it may got training
dataset more similar to the test data
so we take a large single domain dialogue datasets
b g combine single domain dialogue so far too
syntactically r o domain switches by a
basically combining basically graphing the single domain dialogue into another one
so we ended around ten thousand dialogues what you geodesic that so it's i don't
know
t pairs so
but you know the utterance
so this is an example of the sample be combined dialogue acts as the dialogue
where the user is trying to output is movie tickets
in dialogue by the user is trying to find that a strong and then we
randomly sampling location in dialogue acts and in fact that this is no longer be
combined
and use this for training
so the locally numbers are very of syllable
improvement in performance by just after training on that accompanying dialogue compared to all training
without in my dialogues we combine dialogues
and the boy number is
i describe them and it so the boy numbers are the ones read it all
the model built from the based on a so on certain metrics
so
using that the sequential dialogue and put it seems to benefit them was from dialogue
combination at all
then only combination leads to or performance improvements were almost all the models by the
one that benefits them was just a sequential elegant and this is probably because the
and you combining dialogues leads to longer dialogues it adds noise and o l which
acts like a regularization and since the sequential dialogue and put it is the most
complex model would expect it to benefit the most unbiased
and this is what we observe basically a the sequential dialogue encoder does better don't
the mean of one
that's not define and primitive it's but it's at most and the best the model
on intent classification
so
this is so and example this is a degenerate example but it
trace to illustrate what's happening here we just look at that into distributions and try
to figure out what the models are doing so this is there an utterance from
the test set and the movies are in boldface
so all these are a bit later for dataset because of identity that's a non
overlapping
so
you see that the last three utterances have a lot of for b o
so if you look at the memory network attention distribution you know what is that
are
focus is almost entirely on the user utterance i want to visit industry
well as the sequential dialogue encoders out of focus is equally on the last two
utterances
and the i should make a scalar the dialogue at the utterance that are trained
to understand it is the final one that's at the bottom up to be a
by the users is but with the tool is used as
so the good units are identified that the domain is a standard finding restaurants and
to identify the slot for what the two presidents day goes
so what we observe is not the encoder-decoder model fails to well identify the domain
or the starts
the memory network correctly identifies the cutting domain because it is focusing on the utterance
where the user says i want some restaurant
but it feels incorporate a context from the previous system utterance where the system is
offering a response to the user and is any would identify the slot of it
as a sequential dialog input data successfully to combine context one possible utterances and a
recognizer what the domain and this larger
okay i think that's it
a lot for listening
questions
care have two questions and stuff first one
so as a byproduct of what you're doing q you get memory representation of the
context
you have the whole dialogue history i'm wondering if you consider maybe training because you
have access to the simulated user of whether you can train a policy
using this representation because it's very similar to belief tracking
in traditional that much of which you soac a question is more like maybe you
can instead of for doing a modular thing most ica you can just have the
same or do that though for end-to-end that
so no
that's so very indistinct addition because we have some people running experiments on this so
this is something thing
because i think that the problem and maybe
the problem usually with such an interpretable representation is that you when you pick some
actually use a confirm you don't know which slot to confirm but at the same
time you have this semantic so you can make as usable
i think by carefully designing though
semantics that we using we can
alleviate mean removed i have been made in a few instead of having a single
you can form if you have a conform or slot and then have the model
predict so based on the context what the user is trying to control and then
not made it uses a problem is
again to an certainly is still
and even a question
so can you go back to
the last but slight where you had the brazilian restaurant so i wanted to us
two questions about these example
first i thought you said you would train on synthetic dataset where you combine is
domains right so now used to consider the restaurant domain to be out of domain
at this point that's sorry the first and second
would you deal with something that is
to me true only out-of-domain like no
the weather is nice or two day i'm grumpy or whatever in many different way
than the
these
utterance that is still task related even if not used but they got
so for the first question do this one domain is not out of domain but
also because a our system can handle movie tickets and restaurant
given an utterance the system would try to keep track across different domains it will
see that this is a different domain utterance and you speaker hundred
so even have the dialogue is multi domain date
no out-of-vocabulary so
out-of-vocabulary
the second question we have or out-of-domain utterances in this dataset to with
to base the system is supposed to see i cannot handle that
but so i
i think in our dataset it's not there was enough so we definitely need model
domain data to be able to successfully handle out-of-domain utterances
right or questions
my second question to you don't use delexicalization or you do
of the input
no we don't use any delexicalization so this is basically this is the model that
the lexical existing work effectively
right so it's this model will basically identify though entities that are trained to delexicalise
so
because if you use naked guys at a bayes approach or something to delexicalise it
then it doesn't scale to all the response that in the by the this model
will try to identify the based on context based on some lines the annotations from
they cannot it's got something
thank you very much