so the mixed speaker
so the next be could be sounded
so these
we study presentation
okay everyone my name is sent it and i'm going to present our data dialog
state tracking and you don't reading completion approach
this is a dying forbidden below we shake tag and the like another set from
the amazon alex a it means anyone california
so i'll first briefly introduce of the problem galaxy tracking is i guess sort of
you already know that work
thus for completeness and then i'll
talk about the motivation of our approach going to the tts of architecture show some
results innovations thirties and finally conclude that some at an analysis
so let's start so this is a the state discuss order dialog state is basically
dialog state represents a composition of dialogue history but galaxies basically
to represents what the user is interested in at any point in the conversation and
typically you the presenter dialog state with
slots and values
so here in the in the first and the user say that it needs a
book he needs to book a hotel in the use of that four stars and
this corresponds to a state where you have to start stars any together the respective
values
the elderly represents a domain that the user is talking about
and it will become more evident by that's important because
in the conversation again have multiple domains
so in like to examine the second done the user sees that so that it
can response asking if they surprising the user say that does not matter if it
has three wifi in parking in so how the spigot submitted this with three new
slot spotting and internet with the values us and the price don't get and the
other does not starting here gets carried about
in the next on the agent give some recommendation user say that sounds good i
would like also like a taxi to the ordered from cambridge sonar here we see
that these stars correspond the hotel domain gets got it over
but they are
slots to new starts departure and destination
corresponding to
a new domain taxi that also we need to
which also gets a bit in the dialog state so
know what is the task of dialog state tracking so you are not attacking basically
means you want to predict
the dialog state of the user one or more complete you are given the dialogue
history plus the current user utterance and you want to predict a distribution over the
our dialogue states and we saw the galaxy stability to typically to presented as slots
and values so this means a state trackers are
output a distribution over the slots and all the associated values
and that i looks too quickly consists of features like past user utterances pa system
response
it can have previous belief state or even any you interpretation of that is available
so this is the task
i don't to talk about briefly about what are the other traditional approaches to say
tracking
so one of the common approaches is a is very you encode the dialogue history
to some model architecture and then you have
you have a linear plus softmax layer on top and you are put a distribution
over the vocabulary
all the slot type and you do this for each slot in your scheme although
our dialog state
for example here you see on a protocol joystick tracking where the encode the dialogue
history using high technical lstm and then on top of that on the hidden representation
of the context they have a few properly or one for each not type
and then softmax layer to output the distribution of would be values that the that
particular star can take and these are the values which you have seen on the
training set
this brings to like to
main problems which such approaches
one is that they cannot handle out-of-vocabulary slot value mentions because the only output the
distribution over values that have been seen in the training set
the so in such a process it is assumed that the vocabulary or the ontology
is known in advance
and the second thing is that they do not scale well for slots that have
large vocabulary
but example the slot based on in we can assume that you can imagine that
the slot can take values from a possibly very large set so there's not enough
data to learn a good distribution over this large vocabulary
so on the other hand the teaching completion approaches typically do not rely on the
fixed vocabulary
this is because there are typically reading completion approaches are structured as
i an extractive question answering their the goal is to find a span of tokens
in the
in the passage which can t is the answer so there is no fixed vocabulary
and the second thing is
also that they have been a lot of be set advancement in reading comprehension that
we can leverage
if we structure our problem of state tracking as reading comprehension this let us to
propose this be computed for dialog state tracking and
in the next side of
before i go to exactly how we found in the problem i also want to
just give a month later or would be of how
typically machine reading compuation problems are opposed
so the general idea in reading companies you are given a question and pass it
and you are looking for a start of tokens in the passage that can be
assigned to
it's also to extract a question answering
and how people do is you encode the past it european a representation of each
token in the past would you encode the question you have a question representation and
on the top you have generally have what ancient head i training from the question
to each token in the past it one of the intention had to present the
start probability distribution
and the other representing and probability distribution once you have these two probability distributions you
just output b
at this point all the most probable span
and that is your answer
here it shows a popular architecture contatenate which is from microsoft the internally gets on
one and the use bunch of self attention according to layers to encode the basses
tokens
with the general it is assumed that you encode passage and question and then you
have attention for representing the start and end spent
so not less look at how we form the guitar that it's a tracking problem
as a teaching completion so
is the same dialogue as before
user is looking for a hotel
and after the second on you want to predict the values for each of these
slots at a hotel at a reading rise
and so on
and this easy chart takes into something like this may
you're dialogue context the whole dialogue context becomes a passage between alex and user times
and then the questions or something like what is the requested hotel at all but
to the requested value of the slot that you want to track and or is
something like this parking required in total and so on and then what you want
to find is the answer to these questions so
hotel for these first question you can look for the arts and the passes and
the models are point or something like ease and for some luckily second mission
and you are looking for hotel creating the models the point of this setup tokens
for starts
so as simple as that
no representations of how we present if a different got different components so dialogue history
which is also like the passage an arc formulation is represented as a concatenated user
in it and onset is to solve
it can be either one dimensional representation order to like to have assumed matrix like
a hierarchical representation and then you can use probably had a cloud in is to
encode them
and the slot which is the question in our formulation is domain class light emitting
we want to mean as well because as we saw in the previous ones the
example in there
it's not get out a data taken them a span multiple domains
and we have a fixed dimensional vector for these domains not combination which is learned
along with the full model
one thing to note here is that unlike what actually alike
we don't actually convert the slot into a full natural language question we just three
the embedding of the slot plus domain
as the question itself
and finally the onset is adjusted
starting in position in the conversation
"'kay" so this is the main model in our approach is quite the slots and
model
which is just like a typical extract if you're model what it does it predicts
the slot values this panel to consider in the dialogue the you have starting point
does and the starting spend a lot to bilinear tension between the dialogue context and
the slot invading
just like reading completion models and example shown here is
the same dialogue proposed on the uses a user wants to book a hotel in
these four stars so after the first and if you want to track this not
wouldn't hotel at a so in this case will assume that our model outputs a
start and probability which is high for the eight token in the context which represents
basically they down south east
okay so but this model is not sufficient
and this is true also for a question i think cases because
in certain slots that can take values from a closer like this a parking and
internet yes no so we need to can't for that and also the assumption slot
that can have a value core don't care for example pricing in the previous example
and
many of the slots they are never mation to the schema and so you need
to fill them with the default none value so these are the kids at that
cannot be guardedly handled by the span model
so to do this we augment are q model be to other auxiliary models at
cal you would model and the slot take model
okay you will model predicts whether we should just
a bit a slot value of in the current dialog the scheduler the old slightly
from the previous done and in the beginning it's not is initialized at the t
for none value
and a type model is just a simple classified which makes
decision about one of the four classes related yes no don't care order span type
so i'm going to because of the two models okay you are modeled as i
said it just predict so that will be the slot value for the content on
or to tell you what and it makes the binary decision for all the stories
jointly at each done
an example here would be so after the first and you have
values so what one thing i wanted like if i get the can you are
model is a kind of confusing because
what it exactly is it a slot a bit model what by mean that is
like
the one he represents that
you want to update the slot and zero to present that you want a caddy
or i just give this convention because we have it can in the people
so in here after the when you go from the forced down to the second
done the using as mentioned three new starts by five
like internet parking and the pricing so those slots will get a bit rates of
the values one by the added to start at an stars
they will be single because they want they will just get carried away from the
previous turn
and the type model is a simple it just predicts the start i given the
question which is the start and the dialogue context and it makes a for a
decision but yes no don't get a span simple example would be just a hotel
at a full in this context would be a span type because you want to
find the value used in the context and for the slot would barking the value
would be just
yes so it would be the aesthetically that the model should output
okay the so putting all this together the combined model is also be at the
bottom most we have about embedding it will cover the tokens in the passage
next we have a connection limiting i coding which is basically a bidirectional lstm
we just use only a bidirectional lstm one so this will give us the contextual
representation for each of the tokens use the last hidden layer of the lstm which
gives us the embedding of the dialogue
we embed the question using just the start as domain of adding a randomly initialized
and we just learned to the model
then so this
dialogue embedding back to will data t v used to predict these not get you
were decision so we have an instance in my
layer on top of that it just makes the binary decision for each of the
slots
for the slot i one of the input the dialogue embedding vector along with the
question vector and then it makes that's a softmax the to predict
i one of the four classes
and the spend more and finally will take input the question vector and that can
have attention from the question to each of the tokens in the past it's just
like any dm model and you would have these start span prediction and the in
prediction
so at infinite what happens is you will you will begin you with a single
dislike at what model if the cat you were modeled sees
a one which means to update the slot if it is a zero then we
just carry over the slot value from the previous done if it saves one which
means you want to be to start
then we label that i model
that i models easiest nor don't give it a bit the slot value for that
if it's a span then meanwhile disband model to get at
the start and end position of the slot value and then we just extract that
from the conversation and update a slot value
okay so
everyone and the two they have been using the same data set i can do
you know with the multi was dataset it's
most which is a human document collection about two point five thousand single domain and
seventy multi domain dialogues
it has annotations for dialog state and system acts we don't user dialogue act in
this in the small
and some statistics on that has about it of the four dialogs about hundred fifteen
thousand dollars
and averaged about answer starting point five in total exhausted we're tracking here is thirty
seven
a cross six domains
some results
so this is the original so before that the metric it is joint goal accuracy
which basically means that activity done you want to predict all the slots critically if
any of the start is a round then the value the accuracy zero otherwise one
so it so it is strict metric
so the audio this other
the first number it's from the original multi was paper the response people
glad and dcr what about that have been there a lattice using like sender can
do out and then split
i mean the global tracking in a local track attendee c is just a simplified
version of black
so these two numbers and then
dstc joint state tracking that i should before where the encode decode and dialogue history
too high typical lstm
and then have a feed-forward layer for each start i
so that the number is about thirty eight to not approach with the single model
is bits all these approaches
and then we'll to be done on someone model which is basically just take a
majority would between t different a models trained with three different seats
and finally we also wanted to come we also wanted to check
a however this work if you just combine our approach with this
with a close look at videoplus like of demonstrating a joint state tracking model and
how we combine is it is very simple we just
choose one of the two approaches based on
for each slot we choose one of the two approaches based on which of it
is better
for that particular slot on the dev set
and this gives us a constable whose like about five percent
and we see why this happens it
we did some recent studies the first and most important is like
if we feed the ground truth for all the three models that get so these
submissions series of for a for this the single model of are plotted this is
not for the
we combined
a model that the dst
so here if we feed on the t carry over to slot-types and these not
and model that the ground truth
you get the accuracy joint goal accuracy on the dataset as seventy three
which basically means that approach is upper bounded base of entity
what that basic you need to decrease with seven percent of slot values are not
even present in the conversation and example would be something like what kind of sports
in the context six marginal sports attraction are is available in the centre of town
and you want to find the slot attraction or type
the if the answer is multiple support a model will never get it right even
if the model and it points to support it does points to this values board
it is not the same as the ground truth is much but also in this
area bounded
by seventy percent and this is also the reason and combine our approach with the
all close look at very which is more based on the ontology we get some
post
and elevation is that board so if we add about you get about two percent
gain then we did some oracle with each of the model type so if we
place the so the justice not like model with the ground truth so this already
constructed model we don't get much again we get about like one percent gain or
half of a person in
if we replace the slots and model with the grounded we get about four percent
in
but if we replace the order of the slot carryover model with the ground would
be get about sixty we get about twenty percent
the in so as you can see that this is the bottleneck here the caddy
were model this is also evident from if you look at the accuracy for each
model that i understand models have like
ninety to ninety five percent which is pretty high
but i and you're model only has like seventy percent of seventy six percent unable
accuracy
so this gives direction for future work may be wanting prove this
you model
so we also analyze how does the performance leafy as being the conversation history but
and these are strictly decrease in performance that as a conversation is cheaper and this
is
because of the other propagation from the caddy one model
and finally we did some added analysis we basically took some two hundred data samples
and b
we did some two hundred and samples the and we analyze the men be bracketed
them into for different categories
the first in the biggest categories call unanswerable slot data
so these are the others which are made by our cat your start getting word
model
so there to get a case in this the first one is but the difference
is non and hypothesis is not and it basically needs
the references that we should can't you does not value from the previous done by
the i model the same to updated
so in this case and in this is the second one is the opposite of
this so in the first case
even though this is the bulk of i don't like forty two person i mean
we look at the actually the others these are not real as the model is
making the prediction which is actually correct
but there is a lot of annotation noise in the dataset because the state some
on either the states is are they are modeled they have adhered model like they
are updated after one after one done so because of its all these that you
get added as that i was but a bunch of them are about "'em" a
lot of them are not really errors
in the second case of it is
maybe ground for this predicting that we should
but with the ground it is to update the start value while our model predicts
to just carry over from the previous done in this case the there are some
errors
for example here you can see the user is trying to book
trying to destroy in the centre part of the down and finally the they didn't
is able to make the reservation and the new users to next say that you
also needs an attraction type near the nystrom so here many via when you want
to fill a slot say attraction dark at a so the model is model c
is that this would be non which basically means it is not been mentioned
no but as you can see that the user says it should be near the
neck structure it should be carried away from the previous domain so our model is
unable to unable to do that
so the next i will denote is what we call in can extract reference which
basically means there are multiple possible candidates in the context but our model predicts that
on candidate so in this case you see the user is trying to book a
hotel with of with all four people made in response to the booking was unsuccessful
and the user
a basic question at eight people
the ground truth is eight of course but our model predicts for
so be seen as a lot of this happens and there is at i think
as in this case or in the user change its mind so our model is
not
a robust to these kinds of things and the possible reason would be that models
were fitted to a particular entity like for which is the testing more data
training set
this accounts for about twenty percent of it is
the next categories the what we call slot resolution that are here you see the
context or something like i want to leave the hotel by two thirty
the model with a model pointed to thirty but the ground truth is actually fifteen
thirty so these this is kind of like an intended output because we only do
pointing the context
so these are more like and unlike playstation it is it's about thirty percent
the final thing is the slot boundary errors there that's and model makes a mistake
it's either exploit it i to get the span which is are supposed to be
a different sort it is a subset of the difference in this case the difference
is just the nine does as the to start by a model guest not all
city center but this is only a small was it is like to point represents
the other
finally i also want to just
but one slide on that the number that i should is about state-of-the-art can be
some but it but since then there's a paper it is here this is the
straight
or transportable multi-domain generated out for task oriented dialogue systems are
here what did we deduce pointed entered a network to combine the fixable cavity along
with their distribution over the dialogues
dialogue history and they get a slightly better accuracy than the
the then a model combined with the dst but the a key difference between data
points in a see that
the user decoded degenerate barstow can try to convey the we just use r two
pointers to point to the start and end up the span
and
that's probably already wanted to thank you and
i could questions
okay so we have time questions
i said to thank you thank you for the talk my question is when you're
considering the different types like yes no don't care ends and the span and span
this potential eels of another case right with that is when the user doesn't really
see
the value of the slot is but can infer that like twins fsa and what
cuisine type do you want an essay i want some pizza tonight the classifier could
be inferred that is the value for the cuisine type will be italian but the
user never said italian so the span would not cover that case right
so you what the user say so you're geniuses user says i want some pizza
night or something that are not okay that's true so those are not covered here
because we are just doing more like pointing i and model probably would put expand
because it's not one of the two types but we will and will point probably
point two is the category but we fail just like in other cases
so we have you have a future direction where we can sort inspired from being
completion where you can do more like abstract of question answering you can use these
as a rational and then try to have a generative model it generates the value
which it is most like italian grounded selection that we can we can do that
in future
okay
thanks for the great talk a just one simple question present so if i give
you a sentence like i want to go from cameras to and then you know
the destination efficient and approaches camera dissing your model can do like in this case
you can do better because they are all they are both value for the place
right
using a model can do better than baseline system within these kind of
designs
because you are still like slot by slot by then how this the model no
destination is
it's london is not comments
i see
so it would because we three the context right so it can learn like from
and to from the context that
what about you know what about because you check that span try and is possible
that both slot
both on both a prediction that's n
like they all mark no so but we also proceed in the slot type right
so destination and so the c
no i don't maybe it's in the final present so he have in the predicting
the
we also
have a question vector right so it would be either destination on the source right
so it based on that the span model can infer whether it would be
the question is user query embedding so follows two slots is the same user query
no should be different right so it would be a destination are the source
so the other considered slide information yes okay so this is the question recognition it
is a slot
okay they can you might tell different it meant that if okay cool thanks
the questions
maybe a provocative question but we have heard many papers about you know
dialog state tracking and in particular at this particular corpus and so my question is
what do you think we need to take it to then next level
when you know we don't talk about going from cambridge to land on or looking
for a chinese restaurant on
tonight
so if you don't or a particularly improving on this dataset i think nine jen
be honest
i think
i mean v but it is necessary i would say like a particularly looking experimented
with it is this data set i found that the a lot of errors in
this especially with respect to dialog state annotations so if you're just trying to improve
upon this it's not a good idea because we won't even over that we are
doing better not a so they are these a new dataset dstc a that we
can look into and c
for approaches are do better but otherwise i mean i feel like now people have
begin to do more into n approaches where you don't even need the state it's
more implicit but then that's eigenvoice under the same problem to pipeline or not to
pipeline so
i don't good answers
and user questions
i have one question so have you can see that the wasteful evaluation i i'm
not sure if the carryover ease the you know closing some problem in the evaluation
if we can be so previous slot values a circle back propagating areas to the
next ones but if you if you sort of the
have another metric that like a soft update rate or something like that is the
be possible for you to evaluate you missus more accurately
a slot will be treated like
also
i see a point
so the numbers i think get for the
so this some of the seventy six percent is more like
each i don't level accuracy for a particular done if the carrier model predict everything
gradient using more like
better not be updated
like more like precision and recall for either that be better exactly the eigen put
it here but also here like this these twitter data rate
you can think about it is the first one is more like a precision it
will for the slot a big model for the carrier would like this thing about
that big model so in this case the model predicts that we should update
but the grounded is not a base so this is like a precision and the
second is more can you what it
statistic is more likely correlated
so i don't know the numbers this morning at t and eighty four percent number
to this more destructive actually somewhat inflated i is more meaningful looking down level because
it won't all the
starts to be getting because eventual goal is to do joint goal accuracy when you
want all the slots to be correctly predicted
okay and we did we didn't train our models so an important
thing is also that train these the caddy or model jointly and not
well as log and this is important because if you do per slot the we
don't we try to the meeting good performance because of a one like
the dataset that lead up examples particularly for the cable model is highly biased you
can imagine like the number of bits are very few most of the time distorted
just getting at either one so if you just trained directly you would won't have
anything but signals are two for the updates and you will get just biased
the training
so it it's about time so it's not to speak again