Speech Transcript - Structured Fusion Networks for Dialog

0:00:17	and the next
0:00:20	speaker we have is she mary
0:00:24	with the paper on structured fusion networks for dialogue which use an end-to-end dialog model
0:00:31	so
0:00:32	please
0:00:49	emission key
0:00:51	and i'm here today to talking about structured fusion networks for dialogue
0:00:55	this work was done with
0:00:57	to just ring a bus and my adviser maxine eskenazi
0:01:01	okay let's talk about neural models of dialogue
0:01:04	so neural dialogue systems do really well on the task of dialog generation
0:01:08	but they have several well-known shortcomings
0:01:11	they need a lot of data to train
0:01:13	they struggle to generalize to new domains
0:01:17	there are difficult to control
0:01:19	and
0:01:20	they exhibit divergent behavior one tune with reinforcement learning
0:01:25	on the other hand traditional pipelined dialogue systems
0:01:28	have structure components
0:01:30	that allow us to easily generalize them
0:01:34	interpret them and control these systems
0:01:37	both these systems have their respective advantages and disadvantages
0:01:41	neural dialogue systems can learn from data
0:01:43	and they can learn a higher level reasoning
0:01:46	we're higher level policy
0:01:47	on the other hand pipeline systems
0:01:49	are very structured nature which has several benefits
0:01:53	yesterday there was this question in the panel of
0:01:55	so pipeline or not to pipeline
0:01:57	and to me the obvious answer seems why not both and i think that
0:02:03	combining these two approaches is a very intuitive thing to do
0:02:07	so how do we go about combining these two approaches
0:02:10	so in powerpoint systems we have structure components so the very first thing to do
0:02:16	to bring the structure
0:02:17	to neural dialogue systems
0:02:19	it's to and you like these components
0:02:22	so using the multimodal dataset we first define and train
0:02:26	several neural dialogue modules
0:02:28	one for the nlu
0:02:29	one for the dm and one for the nlg
0:02:33	so for the nlu what we do is
0:02:35	we read the dialogue context
0:02:38	encoded and then
0:02:39	ultimately make a prediction about the belief state
0:02:43	for the dialogue manager
0:02:44	we look at the belief state as well as some vectorized representation of the database
0:02:48	output passage are several in your layers and ultimately predict the system dialogue act
0:02:55	for the nlg we have a condition language model
0:02:58	where the initial hidden state is a linear combination
0:03:01	of the dialogue act the belief state and the database vector and then at every
0:03:05	time step
0:03:06	the model outputs what the next word should be to ultimately generate the response
0:03:11	so we have these three neural dialogue modules
0:03:14	that i merely is structured components of traditional pipelined systems
0:03:18	given these three components
0:03:21	how do we actually go about
0:03:22	building a system for dialog generation
0:03:25	well the simplest thing to do is
0:03:28	now you fusion
0:03:29	where what we do is we train these systems and then we just combine the
0:03:34	naively during inference where instead of passing in the ground truth belief state of the
0:03:40	dialogue manager which is what we would do during training we make a prediction
0:03:44	using our trained nlu
0:03:46	and then pass it into the dialogue manager
0:03:50	another way of using these dialogue modules
0:03:53	after training them independently is multitasking
0:03:57	so
0:03:58	which simultaneously learn the dialogue modules
0:04:01	as well as the final task of dialog response generation so we have these three
0:04:06	independent modules here
0:04:07	and then we have these red arrows that correspond to the forward propagation
0:04:11	for the task of response generation
0:04:15	sharing these the parameters in this way result in more structured components
0:04:19	now the encoder
0:04:20	is both being used for the task of the nlu
0:04:23	as well as for the task of response generation
0:04:25	so now would have this notion of structure in it
0:04:29	another way which is the primary
0:04:32	novel work in our paper is structured fusion networks
0:04:35	structured fusion that works aim to learn a higher level model
0:04:39	on top of free train neural dialogue modules
0:04:43	here's a visualization of structured fusion networks
0:04:45	and don't worry if the seems like spaghetti a come back to this
0:04:49	so here what we have is
0:04:51	we have the original dialogue modules the nlu the dm and all g
0:04:55	in these grey small boxes in the middle
0:04:58	and then what we do is we
0:04:59	define these black boxes around them
0:05:02	that consist of a higher level module
0:05:04	so the nlu get upgraded to the and on you plots
0:05:07	the dm to the dm plus and the nlg to the energy plus
0:05:11	by doing this
0:05:12	the higher level model does not need to relearn and remodel the dialogue structure
0:05:16	because it's provided to it
0:05:18	do the pre-trained dialogue modules
0:05:21	instead the higher level model
0:05:23	can focus on the necessary abstract modeling for the task of response generation
0:05:28	which includes encoding complex natural language
0:05:31	modeling the dialogue policy
0:05:33	and generating language conditional some latent representation
0:05:37	and they can leverage
0:05:38	the already provided dialogue structure to do this
0:05:43	so let's go through the structured fusion network piece by piece and see how we
0:05:47	build it up
0:05:48	we start out with these dialogue modules and great here
0:05:51	the combination between them is exactly what you sign it fusion
0:05:56	first we're gonna we're gonna add the nlu plus
0:05:59	the nlu plus get the output it belief state
0:06:02	and one it
0:06:03	re encodes the dialogue context
0:06:05	it has the already predicted belief state concatenated at every time step
0:06:10	and in this way the encoder does not need to relearn the structure and can
0:06:14	leverage the already computed belief state to better encode the
0:06:18	the dialogue context
0:06:21	next we're gonna add the dm plus
0:06:23	and the dm plus
0:06:24	initially
0:06:25	it takes as input it concatenation of four different features
0:06:29	the database vector the predicted dialogue act
0:06:32	the predicted belief state
0:06:33	and the final hidden state of the higher level encoder
0:06:36	and then passes that the real when you're layer
0:06:39	by providing the structure in this way it's our hope that
0:06:41	this sort of serves of the pause you modeling components
0:06:44	in this and send model
0:06:48	the nlg plus
0:06:50	takes as output takes as input the output of the dm plots and user that's
0:06:55	initialize the hidden state and then interfaces with the nlg
0:06:59	let's take a closer look into the nlg plus
0:07:03	it relies on cold fusion
0:07:05	so basically what this means is
0:07:07	the nlg it condition language model gives us a sense of what the next word
0:07:12	could be
0:07:14	the decoder on the other hand
0:07:16	is more
0:07:18	is more so
0:07:19	performing higher level reasoning
0:07:22	and then
0:07:22	we take the large it's the output from the nlg about what the next word
0:07:26	could be as well as the hidden state from the decoder
0:07:29	about the representation of what we should be generating and combine them using cold fusion
0:07:36	and then there's a cyclical relationship between the and all g and the higher level
0:07:40	decoder
0:07:41	in the sense that one cold fusion predicts what the next word should be three
0:07:44	combination of the decoder nlg it passes that prediction both into the decoder
0:07:49	and it to the next time step of the nlg
0:07:53	and here's the final combination again which
0:07:56	hopefully should make more sense
0:08:00	so how do we train the structure fusion network
0:08:02	because we have these modules this three different ways that we can do it
0:08:06	the first one is that we can freeze these modules
0:08:08	we can freeze the modules for obvious in their pre-trained
0:08:12	and then just learn the higher level model on top
0:08:15	in other ways that we can fine tune these modules for the final task of
0:08:19	dialog response generation
0:08:21	and then of course we can multitask the modules where we
0:08:24	simultaneously fine tune them for response generation and for their original tasks
0:08:30	we use the multi was dataset and generally follow their experimental setup
0:08:34	which means the same hyper parameters and because they use the ground truth belief state
0:08:38	we do so as well
0:08:39	and you can sort of think with this as the oracle and all you in
0:08:42	our case
0:08:43	for evaluation we use the same hyper parameters which includes bleu score
0:08:47	inform rate which
0:08:49	measures how often the system has provided the appropriate entities to the user
0:08:54	and success rate which is how often the system
0:08:57	answers all the attributes the user request
0:09:00	and then we use a combined score which they propose as well
0:09:03	which is blue plus the average of informant success rate
0:09:07	so let's take a look at our results
0:09:09	first our baseline so as you see here sadistic with attention does gets a combined
0:09:14	score of about eighty three point three six
0:09:17	next we an i fusion both zero shot which means that they're in the penalty
0:09:21	pre-trained in just combine it inference
0:09:23	and then we also finetune for
0:09:25	the task response generation which just slightly better than the baseline
0:09:30	multitasking does not do so well with sort of indicates that
0:09:33	the loss functions may be pulling
0:09:35	the weights in different directions
0:09:38	structure fusion networks with frozen modules
0:09:41	also do not do so well
0:09:43	but as soon as we start fine tuning
0:09:45	we get a significant improvement
0:09:47	with strong improvements
0:09:49	with slight improvements over these other models
0:09:51	in bleu score and then very strong improvements in informant success rate
0:09:55	and we observe
0:09:57	somewhat patterns with s f and with multitasking
0:10:00	and honestly the seems kind of
0:10:02	intuitive when you think about it informally then success rate measure how often we inform
0:10:08	the user of the appropriate entities and how often we provide the appropriate attributes
0:10:12	and explicitly modeling the belief state explicitly modeling the system act
0:10:16	should into italy help with this
0:10:18	if for model is explicitly aware of
0:10:21	what attributes the user has requested it's going to better provide that information to the
0:10:25	user
0:10:29	but of course i talked about several different problems
0:10:32	with neural models so let's see a structured fusion networks did anything to those problems
0:10:37	the first problem that i mentioned is the neural models are very data hungry
0:10:41	and i think that the added structure sure result and lasted hungry models
0:10:45	so we compare secrecy got the tension instructed fusion networks
0:10:48	i one percent five percent ten percent and twenty five percent of the training data
0:10:53	on the left you see the informer a graph and on the right you see
0:10:56	the success rate graph
0:10:57	and varying levels of percentage of data used
0:11:01	so the inform rate
0:11:02	right about thirty
0:11:04	thirty percent inform rate with c
0:11:06	and i fifty five
0:11:08	with structured fusion networks
0:11:11	of course there's different this difference is really big when were
0:11:14	and very small amounts of data as in one percent
0:11:17	and then it's lonely comes together
0:11:19	as we increase the data
0:11:21	what success rate word about twenty
0:11:25	what structured fusion networks
0:11:27	and fairly close to about like two or three percent
0:11:30	with sixty six and one percent of the data
0:11:33	so for extremely low data scenarios one percent which is about
0:11:36	six hundred utterances
0:11:39	we do
0:11:40	really well what structured fusion networks
0:11:42	and the difference
0:11:43	remains that about like ten percent improvement across both metrics
0:11:49	another problem dimension is domain generalisability
0:11:52	the added structure should give us more generalisable models
0:11:55	so what we do is we compare secrecy constructor fusion that works
0:11:59	by training on two thousand out of domain
0:12:02	dialogue examples
0:12:03	and fifty in domain examples
0:12:05	where in domain is restaurant and then we evaluate entirely on the restaurant domain
0:12:11	and what we see here is we get a sizable improvement and the combined scored
0:12:15	using structured fusion networks
0:12:17	what stronger permits in six sets in four
0:12:19	the blue a slightly lower but this drop matches roughly
0:12:23	what we saw in when using all the data so i don't think it's a
0:12:27	problem specific the generalisability
0:12:30	the next problem and to me the most interesting one
0:12:33	is divergent behavior with reinforcement learning
0:12:36	training general "'em" dialogue models with reinforcement learning
0:12:39	often results in divergent behavior
0:12:42	and you generate output
0:12:44	i'm sure that everybody here has seen the headlines where people claimed that face okay
0:12:48	i shut down there bought after it start inventing its own language really what happened
0:12:53	was it started outputting
0:12:56	stuff that doesn't look like english because it loses the structure as soon as you
0:13:00	trying to with a reinforcement learning
0:13:02	so why does this happen
0:13:04	my theory about why this happens is the notion of the implicit language model
0:13:09	stack decoders have the issue of the implicit language model which basically means that the
0:13:13	decoder simultaneously learns the false and strategy
0:13:16	as well as model language
0:13:18	and image captioning this is very well observed
0:13:21	and it's observed that the implicit language model over one the decoder
0:13:25	so basically what happens is
0:13:27	if the decoder generates if the if the image model detect so there's a giraffe
0:13:32	the model always output the giraffe standing in a field
0:13:36	which is this even if the draft is not standing in a field just because
0:13:39	that's what the language model has been
0:13:41	trying to do
0:13:44	in dialogue on the other hand this problem a slightly different in the sense that
0:13:48	when we finetuned dialogue models with reinforcement learning
0:13:51	raw optimising for the strategy
0:13:53	and alternately causing it on learn the implicit language model
0:13:57	so
0:13:59	structured fusion networks have an explicit language model
0:14:02	so maybe we don't have this problem
0:14:05	so let's try structured fusion networks with reinforcement learning
0:14:09	so for this we trained with supervised learning and then we freeze the dialogue modules
0:14:14	and finetune only the higher level model with the reward inform rape a success rate
0:14:20	so we're optimising the higher level model for some dialogue strategy
0:14:23	well relying on the structure components
0:14:26	to maintain the structured nature of the model
0:14:29	and we compared to changing cells work a knuckle
0:14:33	where he export a similar problem
0:14:35	and what we seize we get
0:14:38	less divergence and language
0:14:39	and fairly similar informant success rate with the state-of-the-art combined score here
0:14:47	so here all the results for all the models that we compared
0:14:50	throughout this presentation
0:14:53	we see that
0:14:54	adding structure in general seems to help
0:14:57	and we get a sizable improvement over our baseline
0:14:59	and
0:15:00	the model especially is robust to reinforcement learning
0:15:04	of course given how fast this field moves
0:15:07	well or paper was in reviews somebody be our results and we don't have state-of-the-art
0:15:11	anymore
0:15:12	but
0:15:12	one of the core contributions of their work
0:15:15	was improving dialogue act prediction
0:15:18	and because structured fusion that works have this ability
0:15:22	to leverage dialogue act predictions and an explicit component
0:15:27	i think there's room for combination here
0:15:30	so
0:15:31	no dialogue paper is complete without human evaluation so what we did here was we
0:15:37	as mechanical turk workers
0:15:39	to read the dialogue context and rate responses on a scale of one to five
0:15:43	on the notion of appropriateness
0:15:46	and what we see here is that
0:15:48	structured fusion networks with reinforcement learning
0:15:51	r per for r rated slightly higher
0:15:54	with
0:15:54	ratings of four or more given
0:15:58	more often suggest that all everything in bold are statistically significant
0:16:02	of course we have a lot more room
0:16:04	to improve before we beat the human ground truth but i think adding structure char
0:16:09	models is the way to go
0:16:12	thank you for your attention and the code is available here
0:16:20	for talk
0:16:21	so now we have
0:16:23	actually
0:16:24	quite some time for top questions so any questions
0:16:31	that's a
0:16:32	a very interesting work and looks promising but
0:16:37	you have plans to extend the evaluation and looking at whether
0:16:42	the system with your architecture can actually engage and dialogue rather than replicating dialogues
0:16:47	the second question i think the structure should help was do that and maintain
0:16:52	like not have the issue of when you start training models and evaluating models an
0:16:58	adaptive manner usually what happens is the errors propagate and i think that
0:17:02	the structure should make that less likely to happen
0:17:07	we
0:17:08	i think that something that we should definitely look into in the literature
0:17:11	and just if you put up your comparative slides the first one compare i think
0:17:16	you're to
0:17:18	a quick to
0:17:20	see the ranks to the other one as having the
0:17:23	the preferred performance because blue i would say is not something that should be measured
0:17:29	in this context it's
0:17:31	they're doing much better than you in blue but it's completely irrelevant whether you give
0:17:34	exactly the same words as the original or not
0:17:37	and you're actually doing much better and success for that's true i like my general
0:17:42	feeling having looked at the data lot is that
0:17:45	for this type of task at least we just relatively well and i think in
0:17:48	the original paper they did some correlation analysis with human judgement
0:17:52	but i think like
0:17:53	blue does not make like on its own will not measure quality of the system
0:17:57	but more so what it's measuring is
0:18:00	how structure the languages and how like
0:18:03	you disagree
0:18:06	okay that's fair i guess with multiple references maybe we can improve this
0:18:15	i so you and this three components but do you and you said that but
0:18:21	trained on what are they pre-trained and the second mation sorry during training do you
0:18:27	also have intermediate supervision there or they finetuned and then fashion
0:18:34	right okay good question
0:18:36	i mean just go back to that's why
0:18:39	so in the multi was data
0:18:42	they
0:18:43	they give us the belief state and they give us the system dialogue act
0:18:46	so what we do for pre-training these components is
0:18:50	the no use pre-trained to go from context of belief state
0:18:53	the dm from police data dialogue act
0:18:56	and the and all g from dialogue acts response
0:18:59	for your second question
0:19:00	we do explore one of in our multi test set up
0:19:04	we do intermediate supervision but in the other two we don't
0:19:08	so it seems to me that you too much more supervision then there are usual
0:19:13	sequence the sequence model each would be the reason for better performance rather than
0:19:19	different architecture no
0:19:21	no alike i completely agree with a point i think this but i think
0:19:26	a point of our paper is doing this additional supervision
0:19:29	and adding the structure into the model it's something that's numbering something that people should
0:19:33	be fair enough
0:19:33	but i do you understand that
0:19:35	it's not necessarily the architecture and its own that's doing better cool thinking
0:19:42	t any other questions
0:19:46	a great dog picked as much as looks promising so you talk a bit about
0:19:51	generalisability about this issue divergence with rl but it didn't touch much on the other
0:19:57	is you mentioned in the trade off at the beginning which was control ability and
0:20:02	i'm wondering if you have some thoughts on that
0:20:05	i guess some of the questions that come into my mind we design but models
0:20:08	with respect to control is suppose i wanted to behave a little bit differently in
0:20:12	one case is there anyway that this architecture can address that run the other way
0:20:17	to look at it looks at ten dollars three best in improving one of these
0:20:21	components can i do it in any on the way other than
0:20:25	get more data like how does the architecture for something in that sense okay
0:20:30	the that's a good question control ability isn't something that we looked at yet but
0:20:34	it's definitely something that i do you want to look at in the future just
0:20:37	because i think doing something as simple as
0:20:39	adding rules on top of the dialogue manager
0:20:42	to just change and say like i with this dialogue act instead of these conditions
0:20:45	are met would work really well and the model does leverage those dialogue act and
0:20:50	like i've seen back projections from the lower level model
0:20:54	result in four outputs
0:20:57	that's definitely something that we should look into in the future
0:21:01	remote mean was the second thing is a
0:21:02	the other part is there's architectures suitable for it to decompose ability of can invest
0:21:08	more on one calm like
0:21:09	there is no need to blame assignment in any sense better and does it you
0:21:12	know
0:21:13	so
0:21:15	i in like
0:21:17	i'm not entirely sure
0:21:18	for when we look at the final task response generation
0:21:22	but we do sort of have a sense just because of the intermediate supervision
0:21:26	how well each of the respective lower level components are doing
0:21:29	and what i can say that the and all you just really well
0:21:32	the
0:21:33	the natural language generation just pretty well
0:21:36	the main thing that struggling is this
0:21:38	this active going from police data dialogue act
0:21:41	and i think that if i was to recommend a component
0:21:45	based on just the this the pre supervision
0:21:49	to improve it would be the dialogue manager
0:21:52	but like blame assignment in general for the response generation task
0:21:56	isn't something that
0:21:58	i think is really easy with the current state of the model but i think
0:22:01	things might be able to be done to further interpret the model
0:22:09	anymore questions
0:22:14	okay in that case i'll
0:22:18	one of my own
0:22:21	can you
0:22:23	explain how exactly the you know what it what is it that the
0:22:28	dm and the impostor pretty how does it look like is it is it's some
0:22:31	kind of
0:22:33	a like
0:22:36	dialogue act and betting or is it's is it explicit use it like a one
0:22:41	hot
0:22:42	so
0:22:44	so the you mean like the dialogue act vector or just i mean basically what
0:22:47	when you look at the dm
0:22:51	well this i guess these are two different thing when you look at the end
0:22:56	the output is dialogue act right yes and the dm plus has something different so
0:23:02	like okay
0:23:03	so for the dm itself because of the supervision
0:23:07	we're predicting the dialogue act which is a multi class label
0:23:10	and it's basically just one zeros
0:23:13	like a binary vector okay and that's like
0:23:17	in form a request that a single slot yellow inform restaurant available type thing right
0:23:25	but then for the dm plus
0:23:28	it's not a structured in that sense and basically what we do is
0:23:32	we just treated as a linear layer that initialises
0:23:35	the decoders hidden state and in the original multi what's paper they had this type
0:23:40	of thing as well
0:23:41	where they just had eight when you're layer between the encoder and decoder the combined
0:23:46	more information into the hidden state
0:23:47	and they call that the palsy and
0:23:50	that sort of what where we're hoping that
0:23:52	by adding the structure beforehand
0:23:54	it's actually more like a policy rather than just to linger layer before
0:23:58	right okay thank you into k what
0:24:01	any more
0:24:04	the last one
0:24:06	did you try we have their baselines because she claims to sequence seems to be
0:24:10	basic
0:24:12	well we did try the other ways of combining the neural modules
0:24:15	and then a fusion the multi tasking those ones
0:24:19	i can go to that slide
0:24:21	but we didn't write transformers or anything like that and i think that
0:24:24	that's something that we can look into in the future
0:24:27	but we tried like night fusion multitasking which are different which are baselines the we
0:24:32	came up with
0:24:33	for actually leveraging the structure as well
0:24:37	okay thank you thank you

Structured Fusion Networks for Dialog

Oral Session 3: Generation and End-to-end Dialogue Systems

Shikib Mehri, Tejas Srinivasan and Maxine Eskenazi