so this is joint work with such people from your is your proven
so
as
you'll null and you architectures have been increasingly popular for the development of conversational agents
and one major advantage of these approaches is that they can be learned from role
and annotated dialogues without needing too much domain knowledge or feature engineering
but however we also require large amounts of training data because they have a large
parameter space
so usually
we use large online resource to train then suggest
twitter conversations
a technical web forums like the one from one to
chuck close
movie scripts
a movie subtitles
so that is for t v series and these resorts are and diana be useful
but they all the face some kind of limitations
in terms of dialogue modeling
it has several limitations we can talk for a long time about these but i
would like to just point out to limitations
especially the ones that are important for subtitles
one of this limitation is that for movies of the next we don't have
any turn structure explicit don't structure
the corpus itself is only a sequence of sentences
together with timestamps for the start and end times
but we don't know who is speaking because of course
the stuff that is always don't come together with
well and audio track and video where you see who is speaking at a given
time
so we don't know who is speaking we don't know when sentences unanswered another tour
or is a continuation of the current turn
so in this particular example
the actual ten structure is the following
and as you can see there are some strong cue
the time stance can be used
i in a few cases
and you have lexical and syntactic cues that can be used
to infer data structure
but you never have to run through
and so that's an important disadvantage when you want to actually to build a system
that generates responses and not just continuations in a given dialogue
whenever limitation is that
many of these data contain reference to
named entities
that might be absent from the inputs
in particular fictional characters
not always referred to a context which use external to the dialogue and which cannot
be captured by the inputs on
so in this particular case mister holes
i is an input that
you would require access to annex not context
you know to make sense of what's happened
and are ordered images of course but i'm just wanted to point two important limitations
so how do we deal with these
these problems the key idea i'm going to present here start with the fact that
not what examples of
context response pairs
are equally useful or relevant for different but of conversational models
some examples as
oliver lemon showed i in is keynote might even be detrimental to the development of
your model
so we can view this as the can kind of
domain adaptation problem
there is some kind of this frequency between the context response appears that we observe
in a corpus
and the ones that we wish to encode in or new a conversational model
i n in the particle application that we want
so the proposed solution is one that is very well known in the field of
the mean adaptation
and which is it simply and inclusion of the weighting model
so we try to map each
where context and response
two particular weight value that corresponds to its importance
tweets quality if you want
for the particle proposed of building
a conversation model
so how we assign this weeks
of course you to the sheer size of four corpora we cannot i don't think
each pair manually
and even a handcrafted rules per se
may be difficult to apply in many cases
because the quality of examples might be depend on multiple factors that might interact in
complex ways
so we propose here is the data driven approach
where we learn
a weighting model from examples of high quality responses
and of course what constitutes a response of high quality might depend on the particular
objectives on the particular type of conversational model
that one wishes to be a
so there is no single answer to what constitutes a high-quality response
but if you have some idea what kind of response you want and what which
once you don't one
you can often select
a subset of high quality response and learn a weighting model from these
and the weighting model uses and you are architecture
which is the following
so as you can see here we have two recurrent neural networks with shared weights
and embedding lay your
and a recurrent layer with lstm what you're weights are units
these two
and respectively and code that the context and the response
as a sequence of a sequences of tokens
and
are
put if a fixed size vectors which are then fed to a dense lay your
which can also incorporate additional inputs
a for instance document level factors
if you have some features
that are specific to move dialogue and that may be of interest
to calculate the weights you can incorporate these
in this then sleigh your for the supplied us for instance we also have information
about the time gaps
between the context and the response
and that something that can be used as well
and so we include all these data in inferior to this final translate your
which ten outputs
a weight
for a given context response pair
so that's the model
and once we have learned a weighting model from examples of high quality responses we
can then apply the model to the full training data
to assign a particular weight to each pair
then we can include it in the brick a loss
that we minimize then we trained a neural model
i
the exact formula for dumper get lost might be depend on what kind of models
you're building
and what kind of loss function you you're using
but the key idea is that
than the loss function calculus some kind of distance between
what the model produces and the ground truth
and then you waiting
the this lost
by the weight value that you calculate from the weighting model
so it some kind of two class pursue years where you first
calculate the weight of your example and then given this weight
and the result of a linear model you can calculate the empirical loss
and then optimize the parameters
of one these weighted sum
so that's the model and
the way the with integrated in the wench training time
so how do we evaluate the models
so we evaluate you only using retrieval-based your models
because it's easier to matrix or more clearly defined and four agenda models
so the retrieval-based your models seek to
compute a score for a given
a context response pair
which is the score about how relevant is the response given the context
and then you can use this core to write possible response and to select the
most relevant
the training data is
uhuh comes from examples from open subtitles
which is a large corpus of the palace that we're is least last year
and we compare three models
a classical tf-idf models
and what an order models
one with uniform weight
so without waiting
and one using the weighting model and we conducted what an automatic and a human
evaluation of
this approach
and you are multiple models
after we now have proposed a few years ago there actually quite simple models
where you all have to recurrent networks we sure weights
that you then
i feed to then slayers
and then combine in the dot product
so it's computing some kind of semantic similarity
between the respondent is predicted given the context
and the actual response that you find in the corpus
we
so this dot product
we made a small modification to the model to add a low the final score
two also be defined on some features from the response itself
"'cause" they might be some features that are not
you to the similarity between the
the context and the response but are you to
some aspects of the respondent my
give some clues about whether is high quality low quality
for instance some unknown words
might indicate a local response from lower quality
in terms of evaluation we use
so as i said and the subtitles as training data
the two going to select
the high quickly responses we took a subset of these training data
for which we knew don't structure because we could aligned then we've movie scripts
where you have speaker names
and then we use two heuristics
we only kept responses
that introduce a new director
so not
i sequence sentences that simply berserk a continuation of a given turn
and we only use the two party conversations because it's easier to two-party conversations to
define winter the response is in response for the previous speaker or not
and then we all the filter out
responses containing fictional names
and out-of-vocabulary words
and we heartily the set
of about one hundred thousand
response pairs that we considered it to be helpful high quality
for the test data we use one in domain and one a slightly out of
the main test sets
we use the core that movie data corpus which is a collection of movie script
the movie subtitles but movie scripts
and then a small corpus of sixty two t at your place
that we found on the web
of course we prove p process them tokenizer postech then
and then in terms of experimental design we consider the context to be limited to
the last ten utterances preceding the response maxima sixty tokens for the response was the
maximum five utterances
in case of turns with multiple utterances
and then we had a one-to-one racial between positive examples we were actual peers
observed in the corpus and negative examples that were drawn at random
from the same corpus
we use gru units instead of testaments because there it's possible to train and we
didn't see any difference
in performance compared to lstms
and here the results
so as you can see well tf-idf doesn't perform well but that's
that's a really well known
so we look at the recall and that i metric
which looks at
a set of possible and responses
one of which is the actual response of certain the corpus
and then we looked at whether the model was able
to put a to put the actual response in the top high
responses so we are then a one means that in a set of then responses
one of which is the actual responses where to the model would rank the actual
response to be the highs
so that's the metric
and then we assume compared to so the that will do what encoder models
and as you can see the one with the with the model performs a little
better on both test sets
and what we found you in using a subsequent error analysis what's that the weighting
model gives more importance to cohesive adjacency pairs
between the context response
so
response so we're not simply continuations
but they were actual responses
that were clearly from under the speaker and it worked answering the context
we also performed you meant evaluation of responses
generated by the double encoder models
using crowdsourcing
so we had we picked
fifty one hundred fifteen random complex from the corner corpus
and four possible responses
a random response the two responses from the u one encoder models
and then expect response that were manually order
so we had the resulting four hundred and sixty pairs
that we each evaluate but at the human judges
and were asked to rate the consistency between the context and response on a scale
of five points
so we had one hundred eighteen individuals party pit in the evaluation
through dropped flower
unfortunately the results were not conclusive
so we can define any statistically significant difference between the two models
and this in general a very low agreement between the participants
for all four models
and we hypothesize that this was due to the difficulty for the raiders
to discriminate between the responses and this is might be due to the nature of
the corpus itself is heavily dependent on an external context
just to the movie scenes
and if you don't have access to the movie scenes is very
difficult to understand what's going on
but even if you have longer directly story that nina seem to help
and so for a human evaluation we think another type of test data might be
more beneficial
so that was for the human evaluation
so to conclude
large that of corpora usually include many noisy examples
and noise can cover many things
but can for response that we're not actual responses
mike a response that includes
i mean if you show names that you don't want to appear
in your models it might also include
double common places responses
response that are inconsistent
with what the model knows
so not what examples have the same quality or the same relevance
for learning conversational models
and the possible remedy to that used to include a weighting model
which can be seen as a form of domain adaptation
instance weighting and models
common approach for domain adaptation
and we show that
this weighting model does not need to be in practice in some
if you have a clear idea how you want to filter you data then you
can of course
and use handcrafted rules but in many cases what determines the quality of an example
is hard to pinpoint
so it might be easier to use a data driven approach
and learning within model from examples of high quality responses
what constitutes this quality
what consecutive good response
is of course depend then all of the actual application that you trying to build
the this approach is very general so it can be applied is essentially a preprocessing
step
so it can be applied to any
data driven model dialogue
you simply as long as you have example of high quality responses
you can use it as a preprocessing step to anything
as future work we would like to extend it to work
generative models so and evaluation we restricted ourselves to
one type of retrieval-based models
but might be very interesting to apply to other kinds of models
and especially to generative once which are known to be quite difficult to work to
train
and an additional benefit of waiting models would be that you could filter all examples
that
are known to be as detrimental to the model before you even feed them to
the
to the
to the training scheme
so that you might have performance benefits in addition
to benefits it regarding here
your metric your accuracy
so that's for future work and possibly also
i don't types of test data then the
the cornet movie data corpus that we have
yes that's a thank you
can you go back to the box plot towards the end
so
i'm not sure what's in the box plot that way i read it is that
there is no difference really between in agreement on two does not as
but you have a set that is very low agreement between the evaluation but is
that site was wondering whether we are looking at two different
and to define
in our is that is that it is that right
i three it is mostly between the two d but encoder models
so i
there's of course a statistically significant difference between the
the altar models and the random ones
and although between the to do what encoder models and the random
but there is no internal difference between the two
we waiting and without waiting
so quickly but not have some maybe would be more significant as if i just
the two ways to set
right i agree i read
something
and you elaborate well why you change the final piece of dual encoder what was
the wrist extended
so give
so the idea is
the dot product
will give you a similarity between
the prediction from the response and the actual response right
and so this is a very important aspect when considering
no whole relevant to responses compared to the context but they might be aspects
that i really intrinsic to the response itself
and i have nothing to do the context
for instance
unknown words or rare words that are probably not typos
run punctuations
a lengthy responses
and this is not going to be directly captured in the dot product
this is going to be captured by extracting
some features from the response and then using these
in the final adequacy score
so something that was
of one missing in this button portables
that's why we wanted to modified
i guess as just wondering if you could elaborate on the extent to which you
believe that the generalizability of the generalisability capabilities of
of training a weighting model on a single dataset and having it extend reasonably to
enhance performance only just of compared to training on multiple domains you mean
why means it to train i guess like
is the current scheme or no such that whenever you are trying to improve performance
on a dataset is you would basically find a similar dataset and three training the
weighting model on like a similar data set and then use the weighting model on
a new data centre is that sort of like that the general scheme when we
use this
so
it's not exactly the question that you asking but in some cases
you might want to
it or to use different domains
four or two preselect
to prune out some parts of the
the data that you don't want
in some cases and that was the case that we had here
it's very difficult to the pre-processing advance on the full dataset
because the quality is very hard to determine
i using
you know simple rules
in particular here a deterrent structure is something that
it is important for determining what can secure natural response but it was near possible
to write rules for that
because it was dependent on post and gas lexical cues and many different factors
and you get of course
build a machine learning classifier that we'll
the segment your turns
but then it will be over all or nothing right in many examples in my
dataset
where
probably responses
but it's
the classifier we didn't give me a really answer
so it was better to use a weighting function
subjective still icon for some of these examples
but then not in the same way as i would from you know
high quality responses
n is but i don't are aspect that would like to mention is that
i could for sense that we could train on the high quality responses
but in this case i would have had to from
with ninety nine point nine percent of my dataset
so i don't want to i want the one that they want to control everything
just because i'm not exactly sure of the hypothesis responses
i don't if you that as a regression
at one more question i dunno maybe losses
i guess i i'm not sure are maybe i didn't like it is that the
evaluation too closely but did you try a baseline where you may be used to
simply are simpler heuristic for assigning the weights like maybe like
some something
as a heuristic for exercise none of the weights rather than like building at a
separate model the control model to now learn the weights you just a
so you know learn but not necessary
i
no idea i didn't
i'm not exactly sure i we could find a very simple
i guess that
something that could be done i don't know how would we perform would be i
where is it
two new use the time gaps
between the context and response
as a way to their the mean
what are
i data didn't right
i tried in a previous paper when i was just looking at turn segmentation that
in a work very well for the for this particular task but here you can
see be different this was assigned a weight with value instead of just segmenting
but that just the kind that doesn't work very well you have to use some
lexical cues usually
like after signal
doctor holds blah that's usually an indicator that
the tick speakers going to beat of the whole
but you need to testified for that