and the next
speaker we have is she mary
with the paper on structured fusion networks for dialogue which use an end-to-end dialog model
so
please
emission key
and i'm here today to talking about structured fusion networks for dialogue
this work was done with
to just ring a bus and my adviser maxine eskenazi
okay let's talk about neural models of dialogue
so neural dialogue systems do really well on the task of dialog generation
but they have several well-known shortcomings
they need a lot of data to train
they struggle to generalize to new domains
there are difficult to control
and
they exhibit divergent behavior one tune with reinforcement learning
on the other hand traditional pipelined dialogue systems
have structure components
that allow us to easily generalize them
interpret them and control these systems
both these systems have their respective advantages and disadvantages
neural dialogue systems can learn from data
and they can learn a higher level reasoning
we're higher level policy
on the other hand pipeline systems
are very structured nature which has several benefits
yesterday there was this question in the panel of
so pipeline or not to pipeline
and to me the obvious answer seems why not both and i think that
combining these two approaches is a very intuitive thing to do
so how do we go about combining these two approaches
so in powerpoint systems we have structure components so the very first thing to do
to bring the structure
to neural dialogue systems
it's to and you like these components
so using the multimodal dataset we first define and train
several neural dialogue modules
one for the nlu
one for the dm and one for the nlg
so for the nlu what we do is
we read the dialogue context
encoded and then
ultimately make a prediction about the belief state
for the dialogue manager
we look at the belief state as well as some vectorized representation of the database
output passage are several in your layers and ultimately predict the system dialogue act
for the nlg we have a condition language model
where the initial hidden state is a linear combination
of the dialogue act the belief state and the database vector and then at every
time step
the model outputs what the next word should be to ultimately generate the response
so we have these three neural dialogue modules
that i merely is structured components of traditional pipelined systems
given these three components
how do we actually go about
building a system for dialog generation
well the simplest thing to do is
now you fusion
where what we do is we train these systems and then we just combine the
naively during inference where instead of passing in the ground truth belief state of the
dialogue manager which is what we would do during training we make a prediction
using our trained nlu
and then pass it into the dialogue manager
another way of using these dialogue modules
after training them independently is multitasking
so
which simultaneously learn the dialogue modules
as well as the final task of dialog response generation so we have these three
independent modules here
and then we have these red arrows that correspond to the forward propagation
for the task of response generation
sharing these the parameters in this way result in more structured components
now the encoder
is both being used for the task of the nlu
as well as for the task of response generation
so now would have this notion of structure in it
another way which is the primary
novel work in our paper is structured fusion networks
structured fusion that works aim to learn a higher level model
on top of free train neural dialogue modules
here's a visualization of structured fusion networks
and don't worry if the seems like spaghetti a come back to this
so here what we have is
we have the original dialogue modules the nlu the dm and all g
in these grey small boxes in the middle
and then what we do is we
define these black boxes around them
that consist of a higher level module
so the nlu get upgraded to the and on you plots
the dm to the dm plus and the nlg to the energy plus
by doing this
the higher level model does not need to relearn and remodel the dialogue structure
because it's provided to it
do the pre-trained dialogue modules
instead the higher level model
can focus on the necessary abstract modeling for the task of response generation
which includes encoding complex natural language
modeling the dialogue policy
and generating language conditional some latent representation
and they can leverage
the already provided dialogue structure to do this
so let's go through the structured fusion network piece by piece and see how we
build it up
we start out with these dialogue modules and great here
the combination between them is exactly what you sign it fusion
first we're gonna we're gonna add the nlu plus
the nlu plus get the output it belief state
and one it
re encodes the dialogue context
it has the already predicted belief state concatenated at every time step
and in this way the encoder does not need to relearn the structure and can
leverage the already computed belief state to better encode the
the dialogue context
next we're gonna add the dm plus
and the dm plus
initially
it takes as input it concatenation of four different features
the database vector the predicted dialogue act
the predicted belief state
and the final hidden state of the higher level encoder
and then passes that the real when you're layer
by providing the structure in this way it's our hope that
this sort of serves of the pause you modeling components
in this and send model
the nlg plus
takes as output takes as input the output of the dm plots and user that's
initialize the hidden state and then interfaces with the nlg
let's take a closer look into the nlg plus
it relies on cold fusion
so basically what this means is
the nlg it condition language model gives us a sense of what the next word
could be
the decoder on the other hand
is more
is more so
performing higher level reasoning
and then
we take the large it's the output from the nlg about what the next word
could be as well as the hidden state from the decoder
about the representation of what we should be generating and combine them using cold fusion
and then there's a cyclical relationship between the and all g and the higher level
decoder
in the sense that one cold fusion predicts what the next word should be three
combination of the decoder nlg it passes that prediction both into the decoder
and it to the next time step of the nlg
and here's the final combination again which
hopefully should make more sense
so how do we train the structure fusion network
because we have these modules this three different ways that we can do it
the first one is that we can freeze these modules
we can freeze the modules for obvious in their pre-trained
and then just learn the higher level model on top
in other ways that we can fine tune these modules for the final task of
dialog response generation
and then of course we can multitask the modules where we
simultaneously fine tune them for response generation and for their original tasks
we use the multi was dataset and generally follow their experimental setup
which means the same hyper parameters and because they use the ground truth belief state
we do so as well
and you can sort of think with this as the oracle and all you in
our case
for evaluation we use the same hyper parameters which includes bleu score
inform rate which
measures how often the system has provided the appropriate entities to the user
and success rate which is how often the system
answers all the attributes the user request
and then we use a combined score which they propose as well
which is blue plus the average of informant success rate
so let's take a look at our results
first our baseline so as you see here sadistic with attention does gets a combined
score of about eighty three point three six
next we an i fusion both zero shot which means that they're in the penalty
pre-trained in just combine it inference
and then we also finetune for
the task response generation which just slightly better than the baseline
multitasking does not do so well with sort of indicates that
the loss functions may be pulling
the weights in different directions
structure fusion networks with frozen modules
also do not do so well
but as soon as we start fine tuning
we get a significant improvement
with strong improvements
with slight improvements over these other models
in bleu score and then very strong improvements in informant success rate
and we observe
somewhat patterns with s f and with multitasking
and honestly the seems kind of
intuitive when you think about it informally then success rate measure how often we inform
the user of the appropriate entities and how often we provide the appropriate attributes
and explicitly modeling the belief state explicitly modeling the system act
should into italy help with this
if for model is explicitly aware of
what attributes the user has requested it's going to better provide that information to the
user
but of course i talked about several different problems
with neural models so let's see a structured fusion networks did anything to those problems
the first problem that i mentioned is the neural models are very data hungry
and i think that the added structure sure result and lasted hungry models
so we compare secrecy got the tension instructed fusion networks
i one percent five percent ten percent and twenty five percent of the training data
on the left you see the informer a graph and on the right you see
the success rate graph
and varying levels of percentage of data used
so the inform rate
right about thirty
thirty percent inform rate with c
and i fifty five
with structured fusion networks
of course there's different this difference is really big when were
and very small amounts of data as in one percent
and then it's lonely comes together
as we increase the data
what success rate word about twenty
what structured fusion networks
and fairly close to about like two or three percent
with sixty six and one percent of the data
so for extremely low data scenarios one percent which is about
six hundred utterances
we do
really well what structured fusion networks
and the difference
remains that about like ten percent improvement across both metrics
another problem dimension is domain generalisability
the added structure should give us more generalisable models
so what we do is we compare secrecy constructor fusion that works
by training on two thousand out of domain
dialogue examples
and fifty in domain examples
where in domain is restaurant and then we evaluate entirely on the restaurant domain
and what we see here is we get a sizable improvement and the combined scored
using structured fusion networks
what stronger permits in six sets in four
the blue a slightly lower but this drop matches roughly
what we saw in when using all the data so i don't think it's a
problem specific the generalisability
the next problem and to me the most interesting one
is divergent behavior with reinforcement learning
training general "'em" dialogue models with reinforcement learning
often results in divergent behavior
and you generate output
i'm sure that everybody here has seen the headlines where people claimed that face okay
i shut down there bought after it start inventing its own language really what happened
was it started outputting
stuff that doesn't look like english because it loses the structure as soon as you
trying to with a reinforcement learning
so why does this happen
my theory about why this happens is the notion of the implicit language model
stack decoders have the issue of the implicit language model which basically means that the
decoder simultaneously learns the false and strategy
as well as model language
and image captioning this is very well observed
and it's observed that the implicit language model over one the decoder
so basically what happens is
if the decoder generates if the if the image model detect so there's a giraffe
the model always output the giraffe standing in a field
which is this even if the draft is not standing in a field just because
that's what the language model has been
trying to do
in dialogue on the other hand this problem a slightly different in the sense that
when we finetuned dialogue models with reinforcement learning
raw optimising for the strategy
and alternately causing it on learn the implicit language model
so
structured fusion networks have an explicit language model
so maybe we don't have this problem
so let's try structured fusion networks with reinforcement learning
so for this we trained with supervised learning and then we freeze the dialogue modules
and finetune only the higher level model with the reward inform rape a success rate
so we're optimising the higher level model for some dialogue strategy
well relying on the structure components
to maintain the structured nature of the model
and we compared to changing cells work a knuckle
where he export a similar problem
and what we seize we get
less divergence and language
and fairly similar informant success rate with the state-of-the-art combined score here
so here all the results for all the models that we compared
throughout this presentation
we see that
adding structure in general seems to help
and we get a sizable improvement over our baseline
and
the model especially is robust to reinforcement learning
of course given how fast this field moves
well or paper was in reviews somebody be our results and we don't have state-of-the-art
anymore
but
one of the core contributions of their work
was improving dialogue act prediction
and because structured fusion that works have this ability
to leverage dialogue act predictions and an explicit component
i think there's room for combination here
so
no dialogue paper is complete without human evaluation so what we did here was we
as mechanical turk workers
to read the dialogue context and rate responses on a scale of one to five
on the notion of appropriateness
and what we see here is that
structured fusion networks with reinforcement learning
r per for r rated slightly higher
with
ratings of four or more given
more often suggest that all everything in bold are statistically significant
of course we have a lot more room
to improve before we beat the human ground truth but i think adding structure char
models is the way to go
thank you for your attention and the code is available here
for talk
so now we have
actually
quite some time for top questions so any questions
that's a
a very interesting work and looks promising but
you have plans to extend the evaluation and looking at whether
the system with your architecture can actually engage and dialogue rather than replicating dialogues
the second question i think the structure should help was do that and maintain
like not have the issue of when you start training models and evaluating models an
adaptive manner usually what happens is the errors propagate and i think that
the structure should make that less likely to happen
we
i think that something that we should definitely look into in the literature
and just if you put up your comparative slides the first one compare i think
you're to
a quick to
see the ranks to the other one as having the
the preferred performance because blue i would say is not something that should be measured
in this context it's
they're doing much better than you in blue but it's completely irrelevant whether you give
exactly the same words as the original or not
and you're actually doing much better and success for that's true i like my general
feeling having looked at the data lot is that
for this type of task at least we just relatively well and i think in
the original paper they did some correlation analysis with human judgement
but i think like
blue does not make like on its own will not measure quality of the system
but more so what it's measuring is
how structure the languages and how like
you disagree
okay that's fair i guess with multiple references maybe we can improve this
i so you and this three components but do you and you said that but
trained on what are they pre-trained and the second mation sorry during training do you
also have intermediate supervision there or they finetuned and then fashion
right okay good question
i mean just go back to that's why
so in the multi was data
they
they give us the belief state and they give us the system dialogue act
so what we do for pre-training these components is
the no use pre-trained to go from context of belief state
the dm from police data dialogue act
and the and all g from dialogue acts response
for your second question
we do explore one of in our multi test set up
we do intermediate supervision but in the other two we don't
so it seems to me that you too much more supervision then there are usual
sequence the sequence model each would be the reason for better performance rather than
different architecture no
no alike i completely agree with a point i think this but i think
a point of our paper is doing this additional supervision
and adding the structure into the model it's something that's numbering something that people should
be fair enough
but i do you understand that
it's not necessarily the architecture and its own that's doing better cool thinking
t any other questions
a great dog picked as much as looks promising so you talk a bit about
generalisability about this issue divergence with rl but it didn't touch much on the other
is you mentioned in the trade off at the beginning which was control ability and
i'm wondering if you have some thoughts on that
i guess some of the questions that come into my mind we design but models
with respect to control is suppose i wanted to behave a little bit differently in
one case is there anyway that this architecture can address that run the other way
to look at it looks at ten dollars three best in improving one of these
components can i do it in any on the way other than
get more data like how does the architecture for something in that sense okay
the that's a good question control ability isn't something that we looked at yet but
it's definitely something that i do you want to look at in the future just
because i think doing something as simple as
adding rules on top of the dialogue manager
to just change and say like i with this dialogue act instead of these conditions
are met would work really well and the model does leverage those dialogue act and
like i've seen back projections from the lower level model
result in four outputs
that's definitely something that we should look into in the future
remote mean was the second thing is a
the other part is there's architectures suitable for it to decompose ability of can invest
more on one calm like
there is no need to blame assignment in any sense better and does it you
know
so
i in like
i'm not entirely sure
for when we look at the final task response generation
but we do sort of have a sense just because of the intermediate supervision
how well each of the respective lower level components are doing
and what i can say that the and all you just really well
the
the natural language generation just pretty well
the main thing that struggling is this
this active going from police data dialogue act
and i think that if i was to recommend a component
based on just the this the pre supervision
to improve it would be the dialogue manager
but like blame assignment in general for the response generation task
isn't something that
i think is really easy with the current state of the model but i think
things might be able to be done to further interpret the model
anymore questions
okay in that case i'll
one of my own
can you
explain how exactly the you know what it what is it that the
dm and the impostor pretty how does it look like is it is it's some
kind of
a like
dialogue act and betting or is it's is it explicit use it like a one
hot
so
so the you mean like the dialogue act vector or just i mean basically what
when you look at the dm
well this i guess these are two different thing when you look at the end
the output is dialogue act right yes and the dm plus has something different so
like okay
so for the dm itself because of the supervision
we're predicting the dialogue act which is a multi class label
and it's basically just one zeros
like a binary vector okay and that's like
in form a request that a single slot yellow inform restaurant available type thing right
but then for the dm plus
it's not a structured in that sense and basically what we do is
we just treated as a linear layer that initialises
the decoders hidden state and in the original multi what's paper they had this type
of thing as well
where they just had eight when you're layer between the encoder and decoder the combined
more information into the hidden state
and they call that the palsy and
that sort of what where we're hoping that
by adding the structure beforehand
it's actually more like a policy rather than just to linger layer before
right okay thank you into k what
any more
the last one
did you try we have their baselines because she claims to sequence seems to be
basic
well we did try the other ways of combining the neural modules
and then a fusion the multi tasking those ones
i can go to that slide
but we didn't write transformers or anything like that and i think that
that's something that we can look into in the future
but we tried like night fusion multitasking which are different which are baselines the we
came up with
for actually leveraging the structure as well
okay thank you thank you