Speech Transcript - Sequential Dialogue Context Modeling for Spoken Language Understanding

great a thanks everyone for saying that the finalisation

and hundred and i'm gonna do you talking about using dialogue context to improve language

understanding performance in multi domain dialogues

so this is the outline of the target is that still give a brief background

of the problem i'll talk about the data sets of the model architectures and then

a data augmentation scheme and experiments

so i'll go to what is important in to dialogue system so in goal oriented

dialogue systems of

the goal of the system is to help the user to complete some task and

the user's goal is to compute some task as opposed to chart based dialogue systems

where no the user is just have a conversation and go system is spending is

to use of

so this is a typical architecture for all of goal oriented dialogue system it is

not you know that no that's of a plane of components and the first component

is basically the language understanding module of its to access an interface the other two

incoming user utterances and their transforms them transforms them into a semantic representation

the next component of the state tracker that keeps track of the probability distributions of

the states over all the control over the conversation after that is the policy that

depending on the dialog state and the back and or stage to decide what action

to take

which could be making a back and at all and no asking the user some

information on informing be the set of something and the last component is the language

generation to just external dialogue act based representation of the one is the output and

o since the user do not exist

so i just briefly talk about that semantic frame representation so are dialogue understanding is

based on themes and redefine themes in connection tube actions in the sense that the

your back and might support certain documents are stored in intense o and o those

are basically replicated in touch screen

so that computes a replicated that's not and the lack an intensity replicated the sentence

and apart from the back an intense be support a bunch of conversational intensity from

dialogue acts like a phone then i are complement expressed frustration exactly

so basically what does the language understanding module

so it performs to dust the first task is domain classification a given an incoming

user utterance though language understanding module replace to identify bit stream it sure though correspond

to so this is an utterance classification task

just second task is intent classification so it tries to classify it tries to identify

what intense exist in the user's utterance

so and the third not sounds an utterance classification task

and the third one is not telling a and the idea that is to identify

attributes in the frame identify attributes which have been defined in the frame but in

the user actions

for example for this query like plates from boston here pretty your premade fee the

plate stream and the user intent might be fine plates then you're trying to identify

attributes like departure city i don't physically the party exactly so this is a sequence

tagging task and be treated so lake sequence labeling task based on annual meeting

so to basically sum it up

given a user utterance like i want to go to tuck it allows titles can

you look up table for the model the goal of to a language understanding module

is to identify that the domain is less front resolution

the intent is the user this thing to do so what is trying is trying

to inform the system about the restaurant name and the system i entity

and then identify that is certainly and similarly for the rest of this work

so there has been not of related work on using context for dialogue related das

and for language understanding there was work on using memory networks for language understanding on

a single domain don't know that has been able on using memory networks for end-to-end

dialog systems

and that has been work on using hierarchical the current encoder decoded models for generative

query suggestions of which is a slightly unrelated thus but our model is an enhancement

of the smaller so it's

so i reviewed all over the datasets

so be have a collection of teasing the domain dialogue data set

the idea it is that the user has a single task that is going to

complete and their response to a single mean so we have i don't a thousand

not tune these datasets and they are a bit england's but i don't include influence

then we have a small did not selected a small multi-domain dialogue data set o

where the training set is around five hundred dialogues the dev set aside and fifty

dialogues in the test dataset on two hundred and seventy two you know dogs these

dialogues and longer because the user has multiple pos that he's trying to complete until

would span across multiple domains

the entity said that we use two

create the training and test dialogues sets are non-overlapping still we have a lot of

out-of-vocabulary entities in our dataset that it i don't eating button to the test user

utterances that for the vocal

so our data collection process

relies on the interaction of a policy model and the user simulator

which in tracking tones of dialogue acts and back and politics extra and then we

can also natural language manifestations of o on based on the style of selecting

so the process and the datasets will be covered in an upcoming publication

okay so now i l display the warlock detected this is the conceptual like the

idea is that

there is a context encoder not acts on the intensity of the dialogue and the

dining the produce a context vector and then there's attack and it will not just

x in the dialogue context and the current user utterance

and tries to determine the domain independence not between a single model on multi-domain and

it does so everything is directly model

so i just i know this paper though architecture the type a network

we use the same architecture doctors so all the models that we compared to be

does vary the context important thing so of this is a rnn this model that

jointly models that we don't mean and the features

be viewed in our embeddings corresponding to the user tokens a user utterance tokens in

twenty it would buy detection gru that which is depicted herein laid yellow if

visible

the outputs of though by gru are then fit right into an lstm the which

is depicted in like to o

so well as the context encoded common so the output of the dialogue context a

input that is fed into the initial state of the lstm and we tried a

bunch of different configurations but this one seemed to what corpus so that's what we

well so weighted use an lstm in the second lead and it's a instead of

gru the only because it seems to work with the slot filling maybe because it

leads to a separation between the open the internal states and outputs

so the final states of the lstm are fed into the domain and the classification

and the final or token level outputs of the lstm a better fit into its

not like english

so this is that are gonna work i don't know

that's the user's across all the models

so this is basically just a description of what dataset

by the mean you to use context may not just used to track the network

one if the user utterance

so suppose the user is having a conversation with a restaurant reservation bart and the

user says i

so in that sense of context this is a pretty i make a statement it's

not easy to make out what the user means it could mean five people or

paper or maybe am order could be a restaurant name but if you know without

the system does that what name would you prefer then it's pretty obvious that the

user meant by as a time

as opposed to a number of people at this time

so this leads us to i first baseline model

the idea to start we just input the previous system to an energy are u

and v the final state of the gru as the dialogue context

so be evaluated for matrix so the first one is domain upon which is the

classification of phones good or domains

well the second as intent upon which is the classification of funds go to what

extent and the third one is not fun and you know this was

same edit it is the ratio of utterances bad though

model you get any one of the predictions wrong so be obvious you want to

go for the lowest possible premeditated

so these are the performances of those simple and quality for the model where the

system tone is encoded in the gru and then fed into the target network

so they do we need black context remote dialogue in

one text on the data so suppose though

so user instead of responding just but are you to a system niche initiative dialogue

this point but if it for all i know all the user is taking initiative

robot so this makes the problem more difficult because in a sense of context about

the previous dialogue you can be clear what the user is referring to here

it could be a movie name it could be a titanium it would be responding

as many options are but

but suppose you nude art this user has been talking about meeting attended iteration then

it's you more likely to get the prediction right so that so we are context

from or that of all the previous turns it i

so this is our second baseline

and this is based of the model proposed by chen and out those in though

emily network for language understanding people

the idea that is to have a gru layer that

and so on the previous sp utterances to produce the memory of vectors so this

memory easily representation of all the previous utterance

we have another gru dark box and the current actions to produce the representation of

this utterance

based on the inner product of this memory and the system i the current utterance

vector we get the notation distribution and a user some them met but it and

get the context of for the data but is depicted in there so this is

the output of this context encoded bit speech into the target network

so as you can see adding

on the main body of the entire dialog seats to leads to an improvement over

all the metrics so for domain we see an improvement of roundy percent absolute

of an intent around two point three percent for slot point five percent but a

significant reduction in female but can lead to better than this

so if it a member of your

or working on multi-domain dialogue so the idea is that the user might a multiple

goals and

just

just do knowledge of what the user said

in the double being able to understand the dialogue history in context of it as

the rest of the utterances in s p o might not give the complete picture

for example suppose the user has multiple goals the user expendable can we take us

to use that is trained to make it in anticipation

in the absence of so

how these utterances relate to each other the user utterances still ambiguous but if you

can if you have a sequential history of the dialogue act you can really where

you can understand each utterance but and in context of the other

you know it's more likely that you get the prediction right

this is a final models that could be so

experiment but and this is an extension of the memory network the idea to start

again you get the

emily of the previous dialog sp which is depicted herein yellow well you get come

are you get to a representation of the current utterance which is depicted in three

but instead of getting an inner product to get some attention distribution you combine them

together in of each what would lead to get the context of memory of data

and this is then fed into separate is another gru bits or produces the context

vector so

basically what is happening is we just to do so

representation of the entire dialog history in context with the current utterance and then we

go then we have an it would like to the with and that dialogue in

all tries to understand i combines these utterances together in context of each other

and at the final state of that the idea is still aren't expected that the

reader to the target

so this is an enhancement of the memory network and this is also in a

sense an announcement of the hierarchical according to encode or decode a model that has

been used for next utterance prediction on for context to generate of release addition

so a very unexpectedly what we observe is that this model doesn't perform as well

as the memory network

now be stated take into this and a hypothesis is that the

that is a huge training this shift in our datasets so like training set is

composed largely a single domain dialogue a single domain to compute a bit that likely

hundred it does and single domain data sets and a single domain dialogue and i

don't for seventy of multi domain dialogues so

o b

i believe the meeting that the those sequential dialog input that is unable to adapt

from a single domain dialogue with the multi-domain not a set

so what do we do so

so we go with a simple data augmentation scheme

since then is addressed between our training and test datasets but it may got training

dataset more similar to the test data

so we take a large single domain dialogue datasets

b g combine single domain dialogue so far too

syntactically r o domain switches by a

basically combining basically graphing the single domain dialogue into another one

so we ended around ten thousand dialogues what you geodesic that so it's i don't

know

t pairs so

but you know the utterance

so this is an example of the sample be combined dialogue acts as the dialogue

where the user is trying to output is movie tickets

in dialogue by the user is trying to find that a strong and then we

randomly sampling location in dialogue acts and in fact that this is no longer be

combined

and use this for training

so the locally numbers are very of syllable

improvement in performance by just after training on that accompanying dialogue compared to all training

without in my dialogues we combine dialogues

and the boy number is

i describe them and it so the boy numbers are the ones read it all

the model built from the based on a so on certain metrics

using that the sequential dialogue and put it seems to benefit them was from dialogue

combination at all

then only combination leads to or performance improvements were almost all the models by the

one that benefits them was just a sequential elegant and this is probably because the

and you combining dialogues leads to longer dialogues it adds noise and o l which

acts like a regularization and since the sequential dialogue and put it is the most

complex model would expect it to benefit the most unbiased

and this is what we observe basically a the sequential dialogue encoder does better don't

the mean of one

that's not define and primitive it's but it's at most and the best the model

on intent classification

this is so and example this is a degenerate example but it

trace to illustrate what's happening here we just look at that into distributions and try

to figure out what the models are doing so this is there an utterance from

the test set and the movies are in boldface

so all these are a bit later for dataset because of identity that's a non

overlapping

you see that the last three utterances have a lot of for b o

so if you look at the memory network attention distribution you know what is that

are

focus is almost entirely on the user utterance i want to visit industry

well as the sequential dialogue encoders out of focus is equally on the last two

utterances

and the i should make a scalar the dialogue at the utterance that are trained

to understand it is the final one that's at the bottom up to be a

by the users is but with the tool is used as

so the good units are identified that the domain is a standard finding restaurants and

to identify the slot for what the two presidents day goes

so what we observe is not the encoder-decoder model fails to well identify the domain

or the starts

the memory network correctly identifies the cutting domain because it is focusing on the utterance

where the user says i want some restaurant

but it feels incorporate a context from the previous system utterance where the system is

offering a response to the user and is any would identify the slot of it

as a sequential dialog input data successfully to combine context one possible utterances and a

recognizer what the domain and this larger

okay i think that's it

a lot for listening

questions

care have two questions and stuff first one

so as a byproduct of what you're doing q you get memory representation of the

context

you have the whole dialogue history i'm wondering if you consider maybe training because you

have access to the simulated user of whether you can train a policy

using this representation because it's very similar to belief tracking

in traditional that much of which you soac a question is more like maybe you

can instead of for doing a modular thing most ica you can just have the

same or do that though for end-to-end that

so no

that's so very indistinct addition because we have some people running experiments on this so

this is something thing

because i think that the problem and maybe

the problem usually with such an interpretable representation is that you when you pick some

actually use a confirm you don't know which slot to confirm but at the same

time you have this semantic so you can make as usable

i think by carefully designing though

semantics that we using we can

alleviate mean removed i have been made in a few instead of having a single

you can form if you have a conform or slot and then have the model

predict so based on the context what the user is trying to control and then

not made it uses a problem is

again to an certainly is still

and even a question

so can you go back to

the last but slight where you had the brazilian restaurant so i wanted to us

two questions about these example

first i thought you said you would train on synthetic dataset where you combine is

domains right so now used to consider the restaurant domain to be out of domain

at this point that's sorry the first and second

would you deal with something that is

to me true only out-of-domain like no

the weather is nice or two day i'm grumpy or whatever in many different way

than the

these

utterance that is still task related even if not used but they got

so for the first question do this one domain is not out of domain but

also because a our system can handle movie tickets and restaurant

given an utterance the system would try to keep track across different domains it will

see that this is a different domain utterance and you speaker hundred

so even have the dialogue is multi domain date

no out-of-vocabulary so

out-of-vocabulary

the second question we have or out-of-domain utterances in this dataset to with

to base the system is supposed to see i cannot handle that

but so i

i think in our dataset it's not there was enough so we definitely need model

domain data to be able to successfully handle out-of-domain utterances

right or questions

my second question to you don't use delexicalization or you do

of the input

no we don't use any delexicalization so this is basically this is the model that

the lexical existing work effectively

right so it's this model will basically identify though entities that are trained to delexicalise

because if you use naked guys at a bayes approach or something to delexicalise it

then it doesn't scale to all the response that in the by the this model

will try to identify the based on context based on some lines the annotations from

they cannot it's got something

thank you very much

Sequential Dialogue Context Modeling for Spoken Language Understanding

Oral Session 4: Context in Discourse and Dialogue

Ankur Bapna, Gokhan Tur, Dilek Hakkani-Tur and Larry Heck