Speech Transcript - Dialog State Tracking: A Neural Reading Comprehension Approach

so the mixed speaker

so the next be could be sounded

so these

we study presentation

okay everyone my name is sent it and i'm going to present our data dialog

state tracking and you don't reading completion approach

this is a dying forbidden below we shake tag and the like another set from

the amazon alex a it means anyone california

so i'll first briefly introduce of the problem galaxy tracking is i guess sort of

you already know that work

thus for completeness and then i'll

talk about the motivation of our approach going to the tts of architecture show some

results innovations thirties and finally conclude that some at an analysis

so let's start so this is a the state discuss order dialog state is basically

dialog state represents a composition of dialogue history but galaxies basically

to represents what the user is interested in at any point in the conversation and

typically you the presenter dialog state with

slots and values

so here in the in the first and the user say that it needs a

book he needs to book a hotel in the use of that four stars and

this corresponds to a state where you have to start stars any together the respective

values

the elderly represents a domain that the user is talking about

and it will become more evident by that's important because

in the conversation again have multiple domains

so in like to examine the second done the user sees that so that it

can response asking if they surprising the user say that does not matter if it

has three wifi in parking in so how the spigot submitted this with three new

slot spotting and internet with the values us and the price don't get and the

other does not starting here gets carried about

in the next on the agent give some recommendation user say that sounds good i

would like also like a taxi to the ordered from cambridge sonar here we see

that these stars correspond the hotel domain gets got it over

but they are

slots to new starts departure and destination

corresponding to

a new domain taxi that also we need to

which also gets a bit in the dialog state so

know what is the task of dialog state tracking so you are not attacking basically

means you want to predict

the dialog state of the user one or more complete you are given the dialogue

history plus the current user utterance and you want to predict a distribution over the

our dialogue states and we saw the galaxy stability to typically to presented as slots

and values so this means a state trackers are

output a distribution over the slots and all the associated values

and that i looks too quickly consists of features like past user utterances pa system

response

it can have previous belief state or even any you interpretation of that is available

so this is the task

i don't to talk about briefly about what are the other traditional approaches to say

tracking

so one of the common approaches is a is very you encode the dialogue history

to some model architecture and then you have

you have a linear plus softmax layer on top and you are put a distribution

over the vocabulary

all the slot type and you do this for each slot in your scheme although

our dialog state

for example here you see on a protocol joystick tracking where the encode the dialogue

history using high technical lstm and then on top of that on the hidden representation

of the context they have a few properly or one for each not type

and then softmax layer to output the distribution of would be values that the that

particular star can take and these are the values which you have seen on the

training set

this brings to like to

main problems which such approaches

one is that they cannot handle out-of-vocabulary slot value mentions because the only output the

distribution over values that have been seen in the training set

the so in such a process it is assumed that the vocabulary or the ontology

is known in advance

and the second thing is that they do not scale well for slots that have

large vocabulary

but example the slot based on in we can assume that you can imagine that

the slot can take values from a possibly very large set so there's not enough

data to learn a good distribution over this large vocabulary

so on the other hand the teaching completion approaches typically do not rely on the

fixed vocabulary

this is because there are typically reading completion approaches are structured as

i an extractive question answering their the goal is to find a span of tokens

in the

in the passage which can t is the answer so there is no fixed vocabulary

and the second thing is

also that they have been a lot of be set advancement in reading comprehension that

we can leverage

if we structure our problem of state tracking as reading comprehension this let us to

propose this be computed for dialog state tracking and

in the next side of

before i go to exactly how we found in the problem i also want to

just give a month later or would be of how

typically machine reading compuation problems are opposed

so the general idea in reading companies you are given a question and pass it

and you are looking for a start of tokens in the passage that can be

assigned to

it's also to extract a question answering

and how people do is you encode the past it european a representation of each

token in the past would you encode the question you have a question representation and

on the top you have generally have what ancient head i training from the question

to each token in the past it one of the intention had to present the

start probability distribution

and the other representing and probability distribution once you have these two probability distributions you

just output b

at this point all the most probable span

and that is your answer

here it shows a popular architecture contatenate which is from microsoft the internally gets on

one and the use bunch of self attention according to layers to encode the basses

tokens

with the general it is assumed that you encode passage and question and then you

have attention for representing the start and end spent

so not less look at how we form the guitar that it's a tracking problem

as a teaching completion so

is the same dialogue as before

user is looking for a hotel

and after the second on you want to predict the values for each of these

slots at a hotel at a reading rise

and so on

and this easy chart takes into something like this may

you're dialogue context the whole dialogue context becomes a passage between alex and user times

and then the questions or something like what is the requested hotel at all but

to the requested value of the slot that you want to track and or is

something like this parking required in total and so on and then what you want

to find is the answer to these questions so

hotel for these first question you can look for the arts and the passes and

the models are point or something like ease and for some luckily second mission

and you are looking for hotel creating the models the point of this setup tokens

for starts

so as simple as that

no representations of how we present if a different got different components so dialogue history

which is also like the passage an arc formulation is represented as a concatenated user

in it and onset is to solve

it can be either one dimensional representation order to like to have assumed matrix like

a hierarchical representation and then you can use probably had a cloud in is to

encode them

and the slot which is the question in our formulation is domain class light emitting

we want to mean as well because as we saw in the previous ones the

example in there

it's not get out a data taken them a span multiple domains

and we have a fixed dimensional vector for these domains not combination which is learned

along with the full model

one thing to note here is that unlike what actually alike

we don't actually convert the slot into a full natural language question we just three

the embedding of the slot plus domain

as the question itself

and finally the onset is adjusted

starting in position in the conversation

"'kay" so this is the main model in our approach is quite the slots and

model

which is just like a typical extract if you're model what it does it predicts

the slot values this panel to consider in the dialogue the you have starting point

does and the starting spend a lot to bilinear tension between the dialogue context and

the slot invading

just like reading completion models and example shown here is

the same dialogue proposed on the uses a user wants to book a hotel in

these four stars so after the first and if you want to track this not

wouldn't hotel at a so in this case will assume that our model outputs a

start and probability which is high for the eight token in the context which represents

basically they down south east

okay so but this model is not sufficient

and this is true also for a question i think cases because

in certain slots that can take values from a closer like this a parking and

internet yes no so we need to can't for that and also the assumption slot

that can have a value core don't care for example pricing in the previous example

and

many of the slots they are never mation to the schema and so you need

to fill them with the default none value so these are the kids at that

cannot be guardedly handled by the span model

so to do this we augment are q model be to other auxiliary models at

cal you would model and the slot take model

okay you will model predicts whether we should just

a bit a slot value of in the current dialog the scheduler the old slightly

from the previous done and in the beginning it's not is initialized at the t

for none value

and a type model is just a simple classified which makes

decision about one of the four classes related yes no don't care order span type

so i'm going to because of the two models okay you are modeled as i

said it just predict so that will be the slot value for the content on

or to tell you what and it makes the binary decision for all the stories

jointly at each done

an example here would be so after the first and you have

values so what one thing i wanted like if i get the can you are

model is a kind of confusing because

what it exactly is it a slot a bit model what by mean that is

the one he represents that

you want to update the slot and zero to present that you want a caddy

or i just give this convention because we have it can in the people

so in here after the when you go from the forced down to the second

done the using as mentioned three new starts by five

like internet parking and the pricing so those slots will get a bit rates of

the values one by the added to start at an stars

they will be single because they want they will just get carried away from the

previous turn

and the type model is a simple it just predicts the start i given the

question which is the start and the dialogue context and it makes a for a

decision but yes no don't get a span simple example would be just a hotel

at a full in this context would be a span type because you want to

find the value used in the context and for the slot would barking the value

would be just

yes so it would be the aesthetically that the model should output

okay the so putting all this together the combined model is also be at the

bottom most we have about embedding it will cover the tokens in the passage

next we have a connection limiting i coding which is basically a bidirectional lstm

we just use only a bidirectional lstm one so this will give us the contextual

representation for each of the tokens use the last hidden layer of the lstm which

gives us the embedding of the dialogue

we embed the question using just the start as domain of adding a randomly initialized

and we just learned to the model

then so this

dialogue embedding back to will data t v used to predict these not get you

were decision so we have an instance in my

layer on top of that it just makes the binary decision for each of the

slots

for the slot i one of the input the dialogue embedding vector along with the

question vector and then it makes that's a softmax the to predict

i one of the four classes

and the spend more and finally will take input the question vector and that can

have attention from the question to each of the tokens in the past it's just

like any dm model and you would have these start span prediction and the in

prediction

so at infinite what happens is you will you will begin you with a single

dislike at what model if the cat you were modeled sees

a one which means to update the slot if it is a zero then we

just carry over the slot value from the previous done if it saves one which

means you want to be to start

then we label that i model

that i models easiest nor don't give it a bit the slot value for that

if it's a span then meanwhile disband model to get at

the start and end position of the slot value and then we just extract that

from the conversation and update a slot value

okay so

everyone and the two they have been using the same data set i can do

you know with the multi was dataset it's

most which is a human document collection about two point five thousand single domain and

seventy multi domain dialogues

it has annotations for dialog state and system acts we don't user dialogue act in

this in the small

and some statistics on that has about it of the four dialogs about hundred fifteen

thousand dollars

and averaged about answer starting point five in total exhausted we're tracking here is thirty

seven

a cross six domains

some results

so this is the original so before that the metric it is joint goal accuracy

which basically means that activity done you want to predict all the slots critically if

any of the start is a round then the value the accuracy zero otherwise one

so it so it is strict metric

so the audio this other

the first number it's from the original multi was paper the response people

glad and dcr what about that have been there a lattice using like sender can

do out and then split

i mean the global tracking in a local track attendee c is just a simplified

version of black

so these two numbers and then

dstc joint state tracking that i should before where the encode decode and dialogue history

too high typical lstm

and then have a feed-forward layer for each start i

so that the number is about thirty eight to not approach with the single model

is bits all these approaches

and then we'll to be done on someone model which is basically just take a

majority would between t different a models trained with three different seats

and finally we also wanted to come we also wanted to check

a however this work if you just combine our approach with this

with a close look at videoplus like of demonstrating a joint state tracking model and

how we combine is it is very simple we just

choose one of the two approaches based on

for each slot we choose one of the two approaches based on which of it

is better

for that particular slot on the dev set

and this gives us a constable whose like about five percent

and we see why this happens it

we did some recent studies the first and most important is like

if we feed the ground truth for all the three models that get so these

submissions series of for a for this the single model of are plotted this is

not for the

we combined

a model that the dst

so here if we feed on the t carry over to slot-types and these not

and model that the ground truth

you get the accuracy joint goal accuracy on the dataset as seventy three

which basically means that approach is upper bounded base of entity

what that basic you need to decrease with seven percent of slot values are not

even present in the conversation and example would be something like what kind of sports

in the context six marginal sports attraction are is available in the centre of town

and you want to find the slot attraction or type

the if the answer is multiple support a model will never get it right even

if the model and it points to support it does points to this values board

it is not the same as the ground truth is much but also in this

area bounded

by seventy percent and this is also the reason and combine our approach with the

all close look at very which is more based on the ontology we get some

post

and elevation is that board so if we add about you get about two percent

gain then we did some oracle with each of the model type so if we

place the so the justice not like model with the ground truth so this already

constructed model we don't get much again we get about like one percent gain or

half of a person in

if we replace the slots and model with the grounded we get about four percent

but if we replace the order of the slot carryover model with the ground would

be get about sixty we get about twenty percent

the in so as you can see that this is the bottleneck here the caddy

were model this is also evident from if you look at the accuracy for each

model that i understand models have like

ninety to ninety five percent which is pretty high

but i and you're model only has like seventy percent of seventy six percent unable

accuracy

so this gives direction for future work may be wanting prove this

you model

so we also analyze how does the performance leafy as being the conversation history but

and these are strictly decrease in performance that as a conversation is cheaper and this

because of the other propagation from the caddy one model

and finally we did some added analysis we basically took some two hundred data samples

and b

we did some two hundred and samples the and we analyze the men be bracketed

them into for different categories

the first in the biggest categories call unanswerable slot data

so these are the others which are made by our cat your start getting word

model

so there to get a case in this the first one is but the difference

is non and hypothesis is not and it basically needs

the references that we should can't you does not value from the previous done by

the i model the same to updated

so in this case and in this is the second one is the opposite of

this so in the first case

even though this is the bulk of i don't like forty two person i mean

we look at the actually the others these are not real as the model is

making the prediction which is actually correct

but there is a lot of annotation noise in the dataset because the state some

on either the states is are they are modeled they have adhered model like they

are updated after one after one done so because of its all these that you

get added as that i was but a bunch of them are about "'em" a

lot of them are not really errors

in the second case of it is

maybe ground for this predicting that we should

but with the ground it is to update the start value while our model predicts

to just carry over from the previous done in this case the there are some

errors

for example here you can see the user is trying to book

trying to destroy in the centre part of the down and finally the they didn't

is able to make the reservation and the new users to next say that you

also needs an attraction type near the nystrom so here many via when you want

to fill a slot say attraction dark at a so the model is model c

is that this would be non which basically means it is not been mentioned

no but as you can see that the user says it should be near the

neck structure it should be carried away from the previous domain so our model is

unable to unable to do that

so the next i will denote is what we call in can extract reference which

basically means there are multiple possible candidates in the context but our model predicts that

on candidate so in this case you see the user is trying to book a

hotel with of with all four people made in response to the booking was unsuccessful

and the user

a basic question at eight people

the ground truth is eight of course but our model predicts for

so be seen as a lot of this happens and there is at i think

as in this case or in the user change its mind so our model is

not

a robust to these kinds of things and the possible reason would be that models

were fitted to a particular entity like for which is the testing more data

training set

this accounts for about twenty percent of it is

the next categories the what we call slot resolution that are here you see the

context or something like i want to leave the hotel by two thirty

the model with a model pointed to thirty but the ground truth is actually fifteen

thirty so these this is kind of like an intended output because we only do

pointing the context

so these are more like and unlike playstation it is it's about thirty percent

the final thing is the slot boundary errors there that's and model makes a mistake

it's either exploit it i to get the span which is are supposed to be

a different sort it is a subset of the difference in this case the difference

is just the nine does as the to start by a model guest not all

city center but this is only a small was it is like to point represents

the other

finally i also want to just

but one slide on that the number that i should is about state-of-the-art can be

some but it but since then there's a paper it is here this is the

straight

or transportable multi-domain generated out for task oriented dialogue systems are

here what did we deduce pointed entered a network to combine the fixable cavity along

with their distribution over the dialogues

dialogue history and they get a slightly better accuracy than the

the then a model combined with the dst but the a key difference between data

points in a see that

the user decoded degenerate barstow can try to convey the we just use r two

pointers to point to the start and end up the span

and

that's probably already wanted to thank you and

i could questions

okay so we have time questions

i said to thank you thank you for the talk my question is when you're

considering the different types like yes no don't care ends and the span and span

this potential eels of another case right with that is when the user doesn't really

see

the value of the slot is but can infer that like twins fsa and what

cuisine type do you want an essay i want some pizza tonight the classifier could

be inferred that is the value for the cuisine type will be italian but the

user never said italian so the span would not cover that case right

so you what the user say so you're geniuses user says i want some pizza

night or something that are not okay that's true so those are not covered here

because we are just doing more like pointing i and model probably would put expand

because it's not one of the two types but we will and will point probably

point two is the category but we fail just like in other cases

so we have you have a future direction where we can sort inspired from being

completion where you can do more like abstract of question answering you can use these

as a rational and then try to have a generative model it generates the value

which it is most like italian grounded selection that we can we can do that

in future

okay

thanks for the great talk a just one simple question present so if i give

you a sentence like i want to go from cameras to and then you know

the destination efficient and approaches camera dissing your model can do like in this case

you can do better because they are all they are both value for the place

right

using a model can do better than baseline system within these kind of

designs

because you are still like slot by slot by then how this the model no

destination is

it's london is not comments

i see

so it would because we three the context right so it can learn like from

and to from the context that

what about you know what about because you check that span try and is possible

that both slot

both on both a prediction that's n

like they all mark no so but we also proceed in the slot type right

so destination and so the c

no i don't maybe it's in the final present so he have in the predicting

the

we also

have a question vector right so it would be either destination on the source right

so it based on that the span model can infer whether it would be

the question is user query embedding so follows two slots is the same user query

no should be different right so it would be a destination are the source

so the other considered slide information yes okay so this is the question recognition it

is a slot

okay they can you might tell different it meant that if okay cool thanks

the questions

maybe a provocative question but we have heard many papers about you know

dialog state tracking and in particular at this particular corpus and so my question is

what do you think we need to take it to then next level

when you know we don't talk about going from cambridge to land on or looking

for a chinese restaurant on

tonight

so if you don't or a particularly improving on this dataset i think nine jen

be honest

i think

i mean v but it is necessary i would say like a particularly looking experimented

with it is this data set i found that the a lot of errors in

this especially with respect to dialog state annotations so if you're just trying to improve

upon this it's not a good idea because we won't even over that we are

doing better not a so they are these a new dataset dstc a that we

can look into and c

for approaches are do better but otherwise i mean i feel like now people have

begin to do more into n approaches where you don't even need the state it's

more implicit but then that's eigenvoice under the same problem to pipeline or not to

pipeline so

i don't good answers

and user questions

i have one question so have you can see that the wasteful evaluation i i'm

not sure if the carryover ease the you know closing some problem in the evaluation

if we can be so previous slot values a circle back propagating areas to the

next ones but if you if you sort of the

have another metric that like a soft update rate or something like that is the

be possible for you to evaluate you missus more accurately

a slot will be treated like

also

i see a point

so the numbers i think get for the

so this some of the seventy six percent is more like

each i don't level accuracy for a particular done if the carrier model predict everything

gradient using more like

better not be updated

like more like precision and recall for either that be better exactly the eigen put

it here but also here like this these twitter data rate

you can think about it is the first one is more like a precision it

will for the slot a big model for the carrier would like this thing about

that big model so in this case the model predicts that we should update

but the grounded is not a base so this is like a precision and the

second is more can you what it

statistic is more likely correlated

so i don't know the numbers this morning at t and eighty four percent number

to this more destructive actually somewhat inflated i is more meaningful looking down level because

it won't all the

starts to be getting because eventual goal is to do joint goal accuracy when you

want all the slots to be correctly predicted

okay and we did we didn't train our models so an important

thing is also that train these the caddy or model jointly and not

well as log and this is important because if you do per slot the we

don't we try to the meeting good performance because of a one like

the dataset that lead up examples particularly for the cable model is highly biased you

can imagine like the number of bits are very few most of the time distorted

just getting at either one so if you just trained directly you would won't have

anything but signals are two for the updates and you will get just biased

the training

so it it's about time so it's not to speak again

Dialog State Tracking: A Neural Reading Comprehension Approach

Oral Session 4: Understanding and Dialogue State Tracking

Shuyang Gao, Abhishek Sethi, Sanchit Agarwal, Tagyoung Chung and Dilek Hakkani-Tur