but now we would listen to three papers that as i said underwent
regular review process
and the first one out is deep copy grounded response generation where the hierarchical pointer
networks
percent the by semi doubles
hello everyone
i'm say we use a phd student from university of california santa barbara
today i'm going to talk about artwork one grounded response generation with hierarchical pointer networks
this is a joint work with i've been have gone in and the lexical
stay at different places now but so this is actually of work direct google they
i while i was an intern
last year
okay without further ado let's start
so
this paper is about
building dialogue models for
you know knowledge
ground the response generation
and the problem that you want to tackle here is to
a basically possible state models to
kind of be able to do
you know more natural are engaging
you know compositions
so basically like the previous
papers in this domain has pointed
several problems
that actually sort of
all the down to
you know generic response generation of the models
this is like sort of like the basic problem
of you know that this paper is trying to tackle
i just to start with an example so say we have a user out looking
for a italian
you know fourteen los altos
and
a response
coming from a system like
poppy's a nice restaurant in los altos serving italian food
would be a good response but at the same time
you know how engaging
you know does this response sound
i don't know i would probably prefer
something that would it contain more information
but in general basically this is sort of like the
the scenario that we try to
so if and reached responses with more information
so and the question that the ask is that what happens if we were able
to use
an external knowledge to
to make the content of these responses like more informative
or more engaging if you wanna say
so basically
let's say we have a model
that can actually go look at the commands of this restaurant
that you actually want to sort of recommended user
and then
you know take a you know pieces so piece of information
from these other reviews
and then generate basically maybe like
response that is looking like
you know the first sentence this aim
but it also says there are more chance but get the
a carbon there are quite popular
excuse me
so this like would be sort of
more engaging response to me
so basically the general problem that we are going to be trying to so
well in mall
so proposing models
to incorporate external knowledge in response to the next
you know previous no previous work in this domain actually most of the early work
try to not do this with you know sequence a sequence models
it not exactly the same problem but are trying to kind of model that local
we do not be using decks external knowledge
so this sort of like requires
you know a lot of data
to be able to so encode a world's knowledge into you know them into actually
the model's parameters
and you know it some additional excerpt drawbacks also include like you know view you
might actually depending movies the on the model you might need to retrain the model
as a new knowledge becomes available and it's also
instead of that like can be sort of think of this problem as
incorporating like basic that adding the knowledge
as an input to the model
so there is like
basically like there is an early work
that
tries to achieve this what they do is that they basically how decomposition is g
and then they try to use you know
additionally fax
let's say like no external knowledge
and then
sort of pick some of the knowledge from
this is this resource
and so it incorporate that into response english
so in this work we try to sort of go over the existing models that
tries to achieve this
exact scenario
and then
proposed for their models
that weeding might be useful
so basically like the d contributions that we talk about the going to be like
sort of
you know models that tries to incorporate external knowledge as an additional input
and then so
like more in more detail it will contain
you know going or some baselines
and actually proposing for the baseline that are not
like covered in the literature which actually are sort of like
useful it may be used models
and then it at the end of you will actually talk about the model that
we
propose and heading that mind that might be helpful
okay so
there's like a bunch of
even you knew what datasets in this domain
where actually like you have the
you know you have like conversations
and back to the data that actually accompanied
with external knowledge
like one of them exactly like dcf dstc so one challenge problem
no last year and it's like a sentence generation track
basically the there are like i rated conversations
and then you want to use the really reached btr goes
to be able to generate better responses
and there's a wizard of each p d a
where like
there's nature conversations between you know learner and the expert
or because few d a
it's also like a dataset
the two d recent
in this work we will
actually talk about commit to dataset
the one of the reasons y
we
so of worked on this was
i'd this there is it doesn't sort of like need any double step
so basically like you can just a drill the relevant facts to the dialogue already
given
and the dataset will talk about in more detail
so basically in this dataset there is like a two persons which are basically
you when a person must
and data us to talk about you know sort of
basically a whole like a conversation based on their per sentence
and
so
some of the properties of this dataset is
you know like basically
some challenges are
you know you have some packets that you want to be able to incorporate in
your a response generation which is actually sort of one of the motivations of why
you have like the personal
but it's also like hard for the models to be able to do that
and you have some like sort of had is needed facts
where you don't sort of had to leave when you're persona
but you have to be able to
produce that which is also like a
another main challenge of this dataset
and there's all kinds of difference if you motors
which are sort of i would say
a close to the statistics of the data is that
and hard to model
okay so basically like this is the
so this is the dataset that you're going to work on
and some evaluation metrics before we dove into like
devon two models
so
there will be like automated metrics which sort of our common
for the sentence generation task
which will be d so main task of this channel this challenge
and will also have like a human evaluation where we ask
if you must the rate the responses
generated by the models from you know want to five
will also like at the end present like a little bit further analysis on
you know the ability of the models to
incorporate the fax the kind of present the
and finally will also have like this is also sort of like an automated metric
divorce the analysis
to see like if the most can do it is a divorce responses
okay so basically like the models are going to be you know two parts one
is the baseline models
which will cover pretty fast and then we'll have the
you know models that we sort of
t there are helpful for this task
so
let's that with this because the sequence more we'd attention which is like basically
you have the dialogue history
which is sort of concatenated you know into a single sequence
and then you have like the sequence encoder which we use like lstm
and the we have like the decoder that actually sort of generates the
response based on this
and then
we have like a sequence a sequence again with a single fact where we actually
take you
also they want us fact from the you know personal
and then appended to the
appended to the
basically context now you have like a longer sequence
have each also have sort of like of factual information
and then you want to generate a response from this
and then the most relevant artifacts a is updated in two ways
the first one is like bass fact context
which is basically you take the dialogue context and then
a find a you know most element factor this
based on tf-idf similarity
and then we have basic response now the similarity
is you know measured between
the between the facts and the grounded response
so this is like a
you know
g d model just to be able to see if you
very able to provide like the
you know right
fact
with the model be able to generate the
you are generated better response basically
so basically
some results here i'm gonna first
present the
the results
like be kind of the main results which are going to be like automated metrics
like perplexity belligerence either
i also like to human evaluation
which is appropriate this
so here like i and the not fact was the no
the first model
and as a basically what we see is that like you incorporating seen single fact
improve the perplexity
i is you is you see here
and also like if you if you if you incorporate like sort of the cheating
fact that it you like even further improves it
but you sort of like moves from the
and naturalness
and like one of the sort of
reason is
i mean this is so like ipod this is the observed looking at the results
is that
so the not affect one kind of generates like a very sort of generating responses
which are you like sometimes i see like a very frequently rated higher than the
ones that are trying to incorporate the response
but so the field of it
so that sort of like the main reason for why don't happens
and also another thing that is interesting here is that if you look at the
appropriate the score of the ground truth response
i mean this is out of five so it's four point four it's like a
i'm not perfect
that sort of another challenge here
so i another and another line of sort of baselines that no like memory networks
where basically like we and quote the context again we the no sequence model
and i we take its representation ten on the facts
each fact actually have like he representation which are in green
which are basically like a vector
and then they also have like a value representations there which are in blue
so we a turn on the key representations and then
i have a probability distribution over the facts
and then compute a summary vector
out of them and then be added to the context vector and then feed it
to decoder and then because it generates the response
so we will call this like a memo network like so for this task
and then we hope will be also have like the
you know version
another version of this which is that is similar to
you know a model that is covered in the previous works at this is again
another baseline model
where basically
in the decoder you also have like an attention on the context
itself
so in the previous one there were like nor decoder per step decoder attention but
here there is
we also have like the fact that action version basically at every decoder step
you have like an you know additional attention on the facts
i mean basically when you generating used to go back and look at
the fact
and then we have like a name a network where like for both fact and
context of action or enabled
okay so if we
look at the results of this compared to the like the and also baselines
we see that like basically attentional only facts
is you can see here results in the bad so of fact incorporation
and
and additionally like six acoustic models that actually v source try to
alex year are compared to memory network models that actually sort of like proposed by
a we hope you previous serves
so
basically on top of that like the next thing is that we realise that like
the
the sequence models that we sort of analyzed
so failed to are reproduce the
factual information like such as the ones that the that i showed at the beginning
with us like idea what's ones one
so for that we want to try you know we tried incorporating compute again
you went on the baselines here
for that we basically just to the point the generator you know network that is
proposed two years back
and what it does is basically at every sort of decoder step you basically have
like you know
soft combination of what generation
and copying of the tokens from the input
so that like if there is something in the input that is not in your
vocabulary you can generated
okay so basically as i said that see important for
you know but using the art of
likely but using deep actual information that may not end up in the vocabulary
so basically what we do is like we had to use its you can small
is that we like kind of sequences sequence models
that be exploited to beginning we had to copy mechanism one double for each i
don't look at what happens
and we like sort of immediately sees
that d copy mechanism improves all of them
actually
you know
a pretty good and
we also sort of see that like if you look at sort of like the
one the model that you have c
feed the base
that's fact
in a cheating where e
basically that sort of like it says that like
if you had a way to
find the best there's fact that is not response then you'd be able to like
to pretty good so it sort of like an upper bound again
okay so now
we just one a
for their c
how
how we can actually make use of like
every token in every fact that is available to us because previous models
sort of either did it use
all the facts like the sequence models we just pick one fact and then use
that
what the memo network models
basically use the entire sort of like summary of the
fact as a as a whole and then just use that
now we wanna see
what happens if basically we were able to condition the response and a few fact
talk
so i mean which this might be important in sort of like
you know of copy d
sort of copying the relevant
pieces of information from the facts
even though you're not actually given the
this fact
so
so basically like the base for this is that
we call it more thai stick to stick hierarchical attention
where the context in court is the same
but for the fact the encoding we also use like an lstm so basically we
have the context of presentation for every fact
sorry every fact token
and what we do is that at every decoder step we take the you know
the core state and at hand on the
have sort of you know of
tokens of the fact so which basically gives us like a distribution over the fact
tokens
and then
sort of basically computing a context at the over these
i'd users basically fact summaries
and then we do for another attention on the fact summaries which gives us like
a distribution over the faxing which fact might be more important
and then
we also have like you know context summary
coming from the attention in the context
and then we have one more attention
which basically a times on the fact semantic context summary and then combines them
based on like which one is more important this is all like sort of salt
attention so you just like don't need any
you know so this is basically a differentiable that's what i'm saying three
and then now you sort of you to generate your response and
the and the and the loss is basically the you know local it blows
so the negative log-likelihood
and i in the deep copy which is like this sort of main model that
we propose in small in this paper what we try to exploit here is that
basically everything remains of the same
with the previous one that i showed
but what we basically do here is that
we use the probabilities
like attention probabilities
over the context
tokens and the fact tokens
as the corresponding you know copying probabilities
so basically here is you can see you have like a distribution over the facts
and distribution over the tokens of every fact so you can basically use a single
distribution
or whatever unique token in your facts here
and then you also like have another question on
you know context and the facts and using those
you can combine these two into
again single
distribution
and you can use that
as the copy probabilities of tokens
and then combine it with the generation
so here
basically we all already have like a generation probabilities over the vocabulary and then we
also have like to copy world this from the context tokens and all these tokens
and we combine all of them into single distribution
and then if you look at the results
basically on all of the you know evaluation metrics like the main evaluation metrics
the copy sort of you know
outperforms all the other models that we may see it here
and then it's also important to note that this like a best for context plus
copy that we sort of like try to analyze was also are computed model
so
okay a to b can probably
it's good is
so we also that like a divorce the analysis
well you know this is a metric that is actually proposed in one of the
previous works
so look at looking at a do are still three did just wants to that
generated so deep copy also like sort of is shown
performing good here is all compared to the other models
this is an example
where we can see that like
the deep copy can achieve that is that we wanted to do
basically it can
i depend on the right person of fact we just highlighted here before knowing which
one is related it can copied exactly kind of relevant pieces
from the fact and the current context of the dialogue
and also like you can also see that it can copy and generate
at the same time so basically it can switch between the modes
so basically we propose a general model that actually can take a query which is
the context in this case
and then external knowledge which is basically set of facts in unstructured text
and then you can generate a response out of them
we propose like strong baselines
on top of this
and then show that the proposed model actually performs
hospital you two d existing ones in the integer
right
that's it thank you for the scene
i can take questions
okay so we have any questions
in the audience
there
i this is someone to form arkansas
a quick costs when you say the for the coffee instead of focusing on only
one side focus and a few five but so in fact is that like compute
the ways of all the facts and then do a
which sound
instead of just p top three top
so
i mean like in are you asking what we do in the proposed model or
in the proposed model in it was normally basically you feed all the facts
and it can use which are which ones
so you so but it doesn't pick exactly one so it actually they compute a
soft representation out of all
okay and then use that as a weighted sum of the vocabulary just
well i that is actually copy part in the copied part you have like a
vocabulary from which you can look at it of size it's a five k
and then
this is like so frequent words
right
and then you also have a way
do you combine like you have a distribution with this right
and then you have a distribution over the tokens unique tokens that appear either in
fact
whatever dialogue context
so now you can induce the signal pro to distribution out of all of this
and that we have like a single post a distribution which is computed in the
software which means it's a differentiable so you can just shaded with the negative light
okay
i have a question also the of a human evaluation you have its appropriateness
one was well
where they actually
because the motivation for this was to create more engaging
responses and appropriate estimate doesn't sound like that so they have to be engaging so
what is the actual instruction the
that that's a good causes so
actually that was
something i had to
as a bit so basically we have two
to human of evaluations
one is the appropriateness
a one also is about like five inclusion analysis
so this is i mean this is more relevant to
measuring whether it is more engaging or not
but it is
it is not in time because of the following
so if you look at the grassroots so here a so this is this a
matrix that we have humans the rate the these are binary matrix
so every inclusion mean is the response include the fact from the i mean doesn't
have to be from a person
it could be you have exceeded five or it could be a factor of course
and then you have like a to follow this how much of it is coming
from percent how much
of it is coming from the station
so that sort of like what we also the humans
and
basically here you tell the like this metric as a bit about
of the engaging this but not exactly because of the following
if you look at the ground truths course like this is the main metric here
a you know
so if i like factual information included from the persona
if we look at the ground truth even that does
fifty percent
so it means that like the grounded responses
even detente
have
sort of coverage of the
for some affect all the time
because you can think of this as
in an actual conversation between the two person
you like basically five fact cannot cover the complexity of
you know such conversation right that's why
this is also not a perfect metric
so what i'm trying to say is
measuring engaging this is a little bit
more difficult
we try to engage in this way so it measured this way
just by looking at whether jen included below and have fact
but we don't have like a perfect sort of evaluation for that