Speech Transcript - Identifying Domain Independent Update Intents in Task Based Dialogs

0:00:15	well in a single parent i'm not actually when we also this paper so this
0:00:19	paper by properly aren't each element area of this
0:00:23	should be with
0:00:25	all right i do not work we don although the force also tracker are the
0:00:33	weather
0:00:36	this we will use that also just have some background the motivation
0:00:40	and i don't talk american what we mean by his of the inference
0:00:45	and but not all introduce the problem statement and have a model and tell her
0:00:50	we
0:00:52	and deal with cross domain generalization
0:00:55	and then introduce a data experiments and an
0:00:59	of and conclusions and future work
0:01:01	so in terms of the background in
0:01:04	so as the main use of these days of c critical to
0:01:07	successful completions of tasks and task the stylus
0:01:11	and usability is
0:01:14	braries expresses the probability distribution over goals which are represented as state value pairs
0:01:20	and i'm typically state tracking or tracking approaches use dialogue acts to infer user intentions
0:01:27	towards the slot values that have been detected and typical dialogue acts would be inform
0:01:33	deny in four
0:01:35	requests negates so an example utterance here finally french restaurants
0:01:40	in boston a
0:01:42	example slu output would be informed crazy is french
0:01:47	okay she got city equals lost or not
0:01:50	and
0:01:51	so
0:01:52	basically in
0:01:56	the idea of this work it is our motivation of this work is guy recordings
0:02:00	dialogue acts are always not adequately capture the user intends to words
0:02:05	so far is in the bookcases
0:02:08	so one example is implicit denial so in this example here you know if the
0:02:14	user invites by john
0:02:16	and joe for dinner
0:02:18	and then se stage o often is so i think here that this is in
0:02:22	place of because it doesn't cars onto it in ir in the case
0:02:26	dialogue act
0:02:28	and expect so here we have the
0:02:31	we have the
0:02:33	or comments on the left and you expected slu a for a typical system on
0:02:37	the right
0:02:38	and to another limitation can be expressed and preferences for slot values and this is
0:02:44	specifically to i think to the space slot value so in this example we asked
0:02:50	to read french restaurants in los and not that
0:02:53	the second order and says
0:02:55	finally some in san jose to disordered finally by singular right instead so current slu
0:03:01	and expresses dialogue acts wouldn't
0:03:04	distinguish between the second and sororities whereas the intent is of see quite different in
0:03:09	that in that sort instances basically express the preference for your right which would imply
0:03:14	to replace
0:03:16	da replace go well as currently in the state we can write i'm
0:03:21	in the second instance you just wanted and it's
0:03:25	and then i don't know in the limitation is that
0:03:29	it doesn't deal well we numerical update if you just one incrementality command so in
0:03:34	this example you know you ask for table four formant then you might say a
0:03:39	four more seats to lose more seats and the expected a popcorn systems would deal
0:03:45	with tiger
0:03:46	and
0:03:48	so the solution is in the solution that we propose that the authors proposed
0:03:52	is it okay intense which basically describes a new semantic class of intense directly
0:03:59	hi to the update you get a user intends
0:04:02	and so here's the list of in intense so
0:04:06	the first one is apparent so basically the user specifies that
0:04:10	and value or multiple value
0:04:13	for multi valued slots so it's basically equivalent of attending to specify values to motivate
0:04:19	yourself for remove
0:04:21	and basically it's the
0:04:24	the complement of that basically to remove the value and from what about a slot
0:04:29	replace
0:04:31	expresses the press a reference for the slot so basically
0:04:35	it
0:04:36	expresses and evaluate that to be preferred over previous value six enter means replace existing
0:04:41	value and then have increased by and decrease by which are specific to numeric or
0:04:47	types
0:04:49	i am i'm here some examples so
0:04:52	so what we have here is an utterance and hundred
0:04:56	very conventional slu and then satirical not intense
0:05:01	and so for example we had earlier of tape show off the latest this would
0:05:05	maybe common informative when a sequence joe
0:05:09	and for restaurant search then we see their data and for examples of find someone
0:05:14	somehow say to this would be informed up and whereas find me in gilroy instead
0:05:20	become an informer place
0:05:22	and
0:05:24	and then for data
0:05:26	numerical examples
0:05:28	and
0:05:29	for a for more seats the be common form increased by four
0:05:34	it can you move to see what and for increased by
0:05:39	two
0:05:41	and okay so environment and how we formulate this problem basically
0:05:46	you have any user utterance a identify the intense all the slot values mentioned in
0:05:51	it so the impose user utterance that's alright intact with slots and values
0:05:56	and here what is update intense for
0:05:59	and these five classes for all slots are so here we see two examples drop
0:06:04	one person wearing
0:06:07	number i guess is the slot name and one is the stuff hourly and
0:06:12	updated decreased by and you're example joe can make it so and people names is
0:06:17	the slot name and joseph hourly and the update intent here is to remove the
0:06:22	and we formulate this is a multiclass classification of slot
0:06:26	a second to five okay ten classes
0:06:29	so modeling here is
0:06:33	sequence labeling problem with a bidirectional lstm
0:06:37	and
0:06:39	so the user utterances a sequence of slot values of tokens
0:06:43	and i'm labels
0:06:45	basically the user intents or words that analysis slot don't responses slot values the labels
0:06:52	just generic token and we also do so the effects of quantization of slot filling
0:06:56	so this is what it looks like
0:06:58	and so on the bottom we have the input okay forget sunnyvale try to put
0:07:02	you know instead the for sick reduced delexicalise that so we have a slot value
0:07:06	we basically delexicalise it to the slot name which has been shown
0:07:11	previous work to generalize better
0:07:14	limited training data and the slot values themselves you know maybe a vocabulary in the
0:07:20	training data
0:07:22	and then we have a the embedding layer and then basically
0:07:26	a typical bidirectional lstm i don't finally we have a softmax layer and we predict
0:07:32	the target so you can see in this example basically okay forget are tokens
0:07:38	so you very little or location you'll content words that the and for computing you
0:07:44	know you know the intense words that replace
0:07:50	and
0:07:50	so
0:07:52	the actual realization was silent is helpful when in generalizing to slot values not seen
0:07:57	in the training data well it's
0:08:02	but only really with it within a single domain so in cross want to go
0:08:07	cross-domain the slot names maybe difference in you may see slot names in
0:08:12	the target domain that didn't exist in the source domain
0:08:16	and however different domain should if we can group slot names in two types different
0:08:22	domains should share the same types of slots
0:08:25	so and as an example restaurant reservation and online shopping domains have numbering gas certain
0:08:33	number grocery items is about numeric five so we can we can relax if we
0:08:37	can be like spoken about the lexicalised to this
0:08:41	in high and we may be able to generalize
0:08:44	so the solution is that the lexical items like five
0:08:47	so these are the three slot-types to within the final maybe memorex
0:08:53	so slots which become really increase and decrease
0:08:57	and we've two
0:08:58	types of multi valued slots this junk those slots which can
0:09:03	take multiple values in this junction solidworks or so and there is an example of
0:09:08	was lost
0:09:09	or not
0:09:10	i'm counterfeit which communist and can take multiple values
0:09:14	in conjunction syllables than in the names of people going to the items in shopping
0:09:19	list
0:09:21	okay so much
0:09:22	two
0:09:23	to evaluate this and what we were acquired from a dataset was dialogs containing
0:09:28	you may rate can control them in this show the multi valued slots
0:09:32	in the domain ontology allow the weights you know annotations for the proposed update express
0:09:37	and
0:09:39	so i'm basically an existing data sets and didn't have all of these
0:09:44	so the also screen the wrong data set
0:09:48	and it does basically they talk to domains or restaurants in online shopping
0:09:53	and had eight different professional editors generate conversations
0:09:58	and in these domains and so that it the basically asked craig our conversations corresponding
0:10:05	to the task the task would be search for a restaurant
0:10:08	make it in or booking by
0:10:10	groceries by close
0:10:13	and they were told to assume appropriate part responses cover not require building an end-to-end
0:10:18	system here
0:10:19	and basically don't have a button to the czech generated a were annotated with slot
0:10:24	names and the update intense
0:10:27	i just as a reminder this is what you eight essentially look like you have
0:10:31	the utterance
0:10:32	it's annotated with the
0:10:35	with the slot name the slot value which would be impulse the system
0:10:39	and the update intent will be detected but
0:10:44	and this is the
0:10:47	this for the restaurant and shopping domain this is the list the slot and names
0:10:51	under types so we have been a participant names number of gas menu items cuisine
0:10:58	and location restaurants
0:11:00	and grossly items quantity of roast or operate items colour and size for shopping
0:11:07	i mean you can see that although
0:11:10	the this out of it it's not names are disjoint they stiff share the same
0:11:15	and slot right
0:11:17	okay so after the data was greater than at a this is what the distribution
0:11:21	looks like so
0:11:23	we had
0:11:23	similar distributions and shopping and restaurant possibly so we don't conversations each and thirteen hundred
0:11:29	utterances
0:11:30	and you can see on average there is
0:11:33	more than one and stuff i mentioned in each utterance
0:11:39	and then in terms of the actual updates
0:11:41	intense themselves this is the distributions they can see in both domains
0:11:46	at hand is domain and the most the most common updates
0:11:50	followed by replace
0:11:52	and
0:11:53	and for shopping the increased by is noticeably different compared to the restaurant which an
0:12:01	which you know so it's like twelve percent and chopping verses four percent and restaurant
0:12:08	okay so then and terms expire in terms of experiments there are we implement that
0:12:12	the improvement of the lstm in your us and optimizes
0:12:18	it data with a
0:12:20	in optimize enormous problem size is sixty four cross entropy loss
0:12:25	so the embedding player was initialized with pre-trained glove embeddings on the upcoming crawled dataset
0:12:32	and i'm missing words were initialized randomly
0:12:37	and basically the evaluation was leave-one-out cross-validation where
0:12:41	because the data was created with point eight in individual editors you they didn't ones
0:12:48	i intra added to the evaluation can maybe this a manager will express the same
0:12:54	to basically
0:12:55	for a given a follow also without they would always trained on seven editors and
0:13:00	test on the other atoms data
0:13:02	and every time is also the average over all it follows and only also the
0:13:06	parameter tuning on the drawing on learning rates
0:13:12	and then they did it also have some baseline so
0:13:16	to a simple n-gram baseline and based on a word window around the slot values
0:13:22	context but it will logistic regression classifier
0:13:26	but because of course that the remote will slot values an utterance that they have
0:13:30	to decide which
0:13:32	which of these slot values given you know words or n-grams belong to so i
0:13:37	went to details but they basically had two approaches to this one with a headset
0:13:40	segmentation which is basically a rule based approach to deciding
0:13:44	which can slot value the word
0:13:48	should be
0:13:49	belong to or self segmentation which basically
0:13:53	create an x basically paid for their
0:13:56	you know basically for every ward it could be encounters being to the left to
0:14:02	right to left well to the left and between two slot values at the right
0:14:07	and between two slot so you basically increase the size of the feature representation and
0:14:12	in another bit baseline was the phone level related events quantization
0:14:17	the use of the
0:14:19	classification results for the full model so
0:14:22	i guess the key point here is that you can get pretty accurate f one
0:14:27	here seeking an f one score over nineteen both domain and
0:14:32	i'm for quite a few of the intense
0:14:35	it can get over ninety percent have one so i think for both domains the
0:14:40	most difficult
0:14:41	one for some reason is remove
0:14:44	and it's not i could be the case that you don't have enough training data
0:14:48	of the older
0:14:49	increased by and decrease by actually have less
0:14:52	and
0:14:54	and then to be compared to the baseline and probably on unsurprisingly the models that
0:15:00	much that the model does much better than here the n-gram baseline and we can
0:15:05	also see that the delexicalization helps a lot so and for restaurants a lesson to
0:15:11	improve from eighty percent ninety percent
0:15:15	i have one
0:15:16	i don't and for a shopping from eighty four to ninety
0:15:21	and
0:15:24	okay and then in terms of a the cross domain generalization
0:15:28	so just
0:15:29	and some terminology so here they use the in the paper to use the
0:15:34	i'm not in domain versus its domain
0:15:36	and i'm basically i two settings one was just combined training what you just trying
0:15:41	to combine combination of in domain and out-of-domain data
0:15:44	and you do mostly retraining with fine tuning so they preach chain
0:15:48	and yet domain data and then finetuned only on the
0:15:52	is that a typo density function on union and domain data of a both settings
0:15:57	they vary the percentage of in the in domain
0:16:00	it was you selected in show a core and the rest
0:16:04	so here's the ear results when a restaurant was the ads domain and it shop
0:16:09	was the target domain
0:16:11	so the green is what happened if you only train an in-domain data
0:16:16	and i think is if you use a pre-training approach and
0:16:21	that is a combined training so you can see actually with zero in domain data
0:16:25	there are added in pretty well like we just upgraded percents
0:16:29	versus like mid ninety one is being the optimal
0:16:32	and you can get pretty good
0:16:35	like close to optimal results but only twenty percent of in domain data
0:16:40	and when we got the opposite way the results are still pretty encouraging model are
0:16:44	quite as good so what zero in domain data
0:16:47	and in the f one is only seventy percent
0:16:51	so in it seems to me at least act
0:16:55	this suggests that we measure the restaurants data may be richer and more very so
0:17:01	and i
0:17:03	training on the simple case that is just not
0:17:06	transferring as well
0:17:08	and
0:17:09	okay so and okay so
0:17:11	conclusions basically they propose a new type of slot-specific user intends
0:17:16	these user intents and addresses user intent containing
0:17:20	the implicit niles numerical update some preferences
0:17:24	for slot values
0:17:26	and the present it is sequence labeling model for classifying update intents
0:17:31	and also propose a method for transfer their learning across domains
0:17:36	and then also showed strong classification performance in this task a promising domain independent results
0:17:43	and future they plan to incorporate a pay attention to real dialogues
0:17:49	state tracking
0:17:50	and so
0:17:52	i'm not in order but i can try to answer some questions for say especially
0:17:56	if they're clarifications question type questions "'cause" i have last
0:18:01	also has a lot questions with this myself
0:18:03	is not if i can also or anything
0:18:06	this is the first two words are this email addresses
0:18:53	not sure microsoft something very ones especially i don't see how you know you could
0:19:00	just replace the nlu zero because you have four people use a task to i
0:19:05	can use like the number six from the nlu there
0:19:28	sure i mean that's only sense added to model so
0:19:31	so i like this eight minutes of a so a question but to me to
0:19:34	access more and more difficult where frame
0:19:39	i
0:19:58	i
0:19:59	i have a question myself but i was only thinking on the last night so
0:20:02	as to write it was too late to ask the authors if they have available
0:20:06	but it's a quite it's something i call to me as well
0:20:10	will be interesting to see what exactly is been confused
0:20:44	i don't i mean it's so i guess
0:20:48	i i'm not sure answer the question i guess this causes two steps the annotation
0:20:52	one is created dialogues and you're is actually annotating the weights the slot means values
0:20:58	and intent and so i guess the second part you could get inter annotator agreement
0:21:01	for it on the cue cards of the source but i don't i don't believe
0:21:05	they are
0:21:07	they try to cover on it so agreement
0:21:09	i mean that the fact that
0:21:12	they can get ninety percent f one suggesting that the labels can be too noisy
0:21:18	because of their very noisy would be hard to be accurate like
0:21:23	that's not the same as that of course explicitly measure

Identifying Domain Independent Update Intents in Task Based Dialogs

Oral Session 5: State Tracking

Prakhar Biyani, Cem Akkaya, Kostas Tsioutsiouliklis