Speech Transcript - Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models

0:00:16	so this is joint work with such people from your is your proven
0:00:20	so
0:00:22	as
0:00:23	you'll null and you architectures have been increasingly popular for the development of conversational agents
0:00:29	and one major advantage of these approaches is that they can be learned from role
0:00:33	and annotated dialogues without needing too much domain knowledge or feature engineering
0:00:40	but however we also require large amounts of training data because they have a large
0:00:45	parameter space
0:00:46	so usually
0:00:48	we use large online resource to train then suggest
0:00:52	twitter conversations
0:00:54	a technical web forums like the one from one to
0:00:57	chuck close
0:00:59	movie scripts
0:01:00	a movie subtitles
0:01:01	so that is for t v series and these resorts are and diana be useful
0:01:06	but they all the face some kind of limitations
0:01:09	in terms of dialogue modeling
0:01:11	it has several limitations we can talk for a long time about these but i
0:01:15	would like to just point out to limitations
0:01:19	especially the ones that are important for subtitles
0:01:24	one of this limitation is that for movies of the next we don't have
0:01:28	any turn structure explicit don't structure
0:01:32	the corpus itself is only a sequence of sentences
0:01:37	together with timestamps for the start and end times
0:01:40	but we don't know who is speaking because of course
0:01:42	the stuff that is always don't come together with
0:01:45	well and audio track and video where you see who is speaking at a given
0:01:48	time
0:01:50	so we don't know who is speaking we don't know when sentences unanswered another tour
0:01:55	or is a continuation of the current turn
0:01:58	so in this particular example
0:02:00	the actual ten structure is the following
0:02:04	and as you can see there are some strong cue
0:02:07	the time stance can be used
0:02:09	i in a few cases
0:02:11	and you have lexical and syntactic cues that can be used
0:02:14	to infer data structure
0:02:15	but you never have to run through
0:02:18	and so that's an important disadvantage when you want to actually to build a system
0:02:22	that generates responses and not just continuations in a given dialogue
0:02:29	whenever limitation is that
0:02:32	many of these data contain reference to
0:02:35	named entities
0:02:35	that might be absent from the inputs
0:02:39	in particular fictional characters
0:02:41	not always referred to a context which use external to the dialogue and which cannot
0:02:48	be captured by the inputs on
0:02:50	so in this particular case mister holes
0:02:53	i is an input that
0:02:55	you would require access to annex not context
0:02:58	you know to make sense of what's happened
0:03:02	and are ordered images of course but i'm just wanted to point two important limitations
0:03:08	so how do we deal with these
0:03:10	these problems the key idea i'm going to present here start with the fact that
0:03:16	not what examples of
0:03:19	context response pairs
0:03:20	are equally useful or relevant for different but of conversational models
0:03:25	some examples as
0:03:26	oliver lemon showed i in is keynote might even be detrimental to the development of
0:03:31	your model
0:03:32	so we can view this as the can kind of
0:03:34	domain adaptation problem
0:03:36	there is some kind of this frequency between the context response appears that we observe
0:03:41	in a corpus
0:03:43	and the ones that we wish to encode in or new a conversational model
0:03:48	i n in the particle application that we want
0:03:51	so the proposed solution is one that is very well known in the field of
0:03:55	the mean adaptation
0:03:57	and which is it simply and inclusion of the weighting model
0:04:01	so we try to map each
0:04:03	where context and response
0:04:05	two particular weight value that corresponds to its importance
0:04:09	tweets quality if you want
0:04:12	for the particle proposed of building
0:04:14	a conversation model
0:04:18	so how we assign this weeks
0:04:21	of course you to the sheer size of four corpora we cannot i don't think
0:04:25	each pair manually
0:04:26	and even a handcrafted rules per se
0:04:29	may be difficult to apply in many cases
0:04:32	because the quality of examples might be depend on multiple factors that might interact in
0:04:37	complex ways
0:04:39	so we propose here is the data driven approach
0:04:42	where we learn
0:04:44	a weighting model from examples of high quality responses
0:04:48	and of course what constitutes a response of high quality might depend on the particular
0:04:52	objectives on the particular type of conversational model
0:04:56	that one wishes to be a
0:04:58	so there is no single answer to what constitutes a high-quality response
0:05:02	but if you have some idea what kind of response you want and what which
0:05:06	once you don't one
0:05:07	you can often select
0:05:10	a subset of high quality response and learn a weighting model from these
0:05:14	and the weighting model uses and you are architecture
0:05:17	which is the following
0:05:21	so as you can see here we have two recurrent neural networks with shared weights
0:05:27	and embedding lay your
0:05:29	and a recurrent layer with lstm what you're weights are units
0:05:33	these two
0:05:35	and respectively and code that the context and the response
0:05:40	as a sequence of a sequences of tokens
0:05:43	and
0:05:44	are
0:05:46	put if a fixed size vectors which are then fed to a dense lay your
0:05:52	which can also incorporate additional inputs
0:05:54	a for instance document level factors
0:05:57	if you have some features
0:06:00	that are specific to move dialogue and that may be of interest
0:06:04	to calculate the weights you can incorporate these
0:06:07	in this then sleigh your for the supplied us for instance we also have information
0:06:11	about the time gaps
0:06:13	between the context and the response
0:06:15	and that something that can be used as well
0:06:18	and so we include all these data in inferior to this final translate your
0:06:23	which ten outputs
0:06:24	a weight
0:06:25	for a given context response pair
0:06:30	so that's the model
0:06:32	and once we have learned a weighting model from examples of high quality responses we
0:06:38	can then apply the model to the full training data
0:06:43	to assign a particular weight to each pair
0:06:46	then we can include it in the brick a loss
0:06:49	that we minimize then we trained a neural model
0:06:52	i
0:06:53	the exact formula for dumper get lost might be depend on what kind of models
0:06:57	you're building
0:06:57	and what kind of loss function you you're using
0:07:01	but the key idea is that
0:07:03	than the loss function calculus some kind of distance between
0:07:07	what the model produces and the ground truth
0:07:11	and then you waiting
0:07:14	the this lost
0:07:15	by the weight value that you calculate from the weighting model
0:07:18	so it some kind of two class pursue years where you first
0:07:21	calculate the weight of your example and then given this weight
0:07:25	and the result of a linear model you can calculate the empirical loss
0:07:31	and then optimize the parameters
0:07:33	of one these weighted sum
0:07:37	so that's the model and
0:07:40	the way the with integrated in the wench training time
0:07:44	so how do we evaluate the models
0:07:47	so we evaluate you only using retrieval-based your models
0:07:51	because it's easier to matrix or more clearly defined and four agenda models
0:07:55	so the retrieval-based your models seek to
0:07:59	compute a score for a given
0:08:01	a context response pair
0:08:03	which is the score about how relevant is the response given the context
0:08:08	and then you can use this core to write possible response and to select the
0:08:11	most relevant
0:08:13	the training data is
0:08:15	uhuh comes from examples from open subtitles
0:08:19	which is a large corpus of the palace that we're is least last year
0:08:23	and we compare three models
0:08:25	a classical tf-idf models
0:08:27	and what an order models
0:08:29	one with uniform weight
0:08:30	so without waiting
0:08:32	and one using the weighting model and we conducted what an automatic and a human
0:08:37	evaluation of
0:08:38	this approach
0:08:40	and you are multiple models
0:08:42	after we now have proposed a few years ago there actually quite simple models
0:08:47	where you all have to recurrent networks we sure weights
0:08:52	that you then
0:08:53	i feed to then slayers
0:08:55	and then combine in the dot product
0:08:58	so it's computing some kind of semantic similarity
0:09:01	between the respondent is predicted given the context
0:09:04	and the actual response that you find in the corpus
0:09:07	we
0:09:09	so this dot product
0:09:11	we made a small modification to the model to add a low the final score
0:09:16	two also be defined on some features from the response itself
0:09:19	"'cause" they might be some features that are not
0:09:22	you to the similarity between the
0:09:24	the context and the response but are you to
0:09:27	some aspects of the respondent my
0:09:29	give some clues about whether is high quality low quality
0:09:32	for instance some unknown words
0:09:35	might indicate a local response from lower quality
0:09:39	in terms of evaluation we use
0:09:42	so as i said and the subtitles as training data
0:09:46	the two going to select
0:09:48	the high quickly responses we took a subset of these training data
0:09:52	for which we knew don't structure because we could aligned then we've movie scripts
0:09:56	where you have speaker names
0:09:58	and then we use two heuristics
0:10:01	we only kept responses
0:10:02	that introduce a new director
0:10:04	so not
0:10:05	i sequence sentences that simply berserk a continuation of a given turn
0:10:10	and we only use the two party conversations because it's easier to two-party conversations to
0:10:18	define winter the response is in response for the previous speaker or not
0:10:22	and then we all the filter out
0:10:23	responses containing fictional names
0:10:27	and out-of-vocabulary words
0:10:29	and we heartily the set
0:10:30	of about one hundred thousand
0:10:33	response pairs that we considered it to be helpful high quality
0:10:37	for the test data we use one in domain and one a slightly out of
0:10:41	the main test sets
0:10:44	we use the core that movie data corpus which is a collection of movie script
0:10:48	the movie subtitles but movie scripts
0:10:52	and then a small corpus of sixty two t at your place
0:10:55	that we found on the web
0:10:58	of course we prove p process them tokenizer postech then
0:11:03	and then in terms of experimental design we consider the context to be limited to
0:11:08	the last ten utterances preceding the response maxima sixty tokens for the response was the
0:11:14	maximum five utterances
0:11:16	in case of turns with multiple utterances
0:11:19	and then we had a one-to-one racial between positive examples we were actual peers
0:11:24	observed in the corpus and negative examples that were drawn at random
0:11:30	from the same corpus
0:11:32	we use gru units instead of testaments because there it's possible to train and we
0:11:36	didn't see any difference
0:11:38	in performance compared to lstms
0:11:42	and here the results
0:11:43	so as you can see well tf-idf doesn't perform well but that's
0:11:48	that's a really well known
0:11:51	so we look at the recall and that i metric
0:11:53	which looks at
0:11:55	a set of possible and responses
0:11:59	one of which is the actual response of certain the corpus
0:12:03	and then we looked at whether the model was able
0:12:06	to put a to put the actual response in the top high
0:12:12	responses so we are then a one means that in a set of then responses
0:12:17	one of which is the actual responses where to the model would rank the actual
0:12:22	response to be the highs
0:12:25	so that's the metric
0:12:27	and then we assume compared to so the that will do what encoder models
0:12:31	and as you can see the one with the with the model performs a little
0:12:33	better on both test sets
0:12:35	and what we found you in using a subsequent error analysis what's that the weighting
0:12:40	model gives more importance to cohesive adjacency pairs
0:12:45	between the context response
0:12:47	so
0:12:48	response so we're not simply continuations
0:12:50	but they were actual responses
0:12:52	that were clearly from under the speaker and it worked answering the context
0:12:58	we also performed you meant evaluation of responses
0:13:02	generated by the double encoder models
0:13:04	using crowdsourcing
0:13:07	so we had we picked
0:13:08	fifty one hundred fifteen random complex from the corner corpus
0:13:12	and four possible responses
0:13:15	a random response the two responses from the u one encoder models
0:13:20	and then expect response that were manually order
0:13:23	so we had the resulting four hundred and sixty pairs
0:13:27	that we each evaluate but at the human judges
0:13:31	and were asked to rate the consistency between the context and response on a scale
0:13:35	of five points
0:13:37	so we had one hundred eighteen individuals party pit in the evaluation
0:13:42	through dropped flower
0:13:45	unfortunately the results were not conclusive
0:13:48	so we can define any statistically significant difference between the two models
0:13:53	and this in general a very low agreement between the participants
0:13:58	for all four models
0:14:01	and we hypothesize that this was due to the difficulty for the raiders
0:14:06	to discriminate between the responses and this is might be due to the nature of
0:14:10	the corpus itself is heavily dependent on an external context
0:14:13	just to the movie scenes
0:14:15	and if you don't have access to the movie scenes is very
0:14:18	difficult to understand what's going on
0:14:20	but even if you have longer directly story that nina seem to help
0:14:26	and so for a human evaluation we think another type of test data might be
0:14:30	more beneficial
0:14:34	so that was for the human evaluation
0:14:38	so to conclude
0:14:41	large that of corpora usually include many noisy examples
0:14:45	and noise can cover many things
0:14:47	but can for response that we're not actual responses
0:14:50	mike a response that includes
0:14:53	i mean if you show names that you don't want to appear
0:14:55	in your models it might also include
0:14:58	double common places responses
0:15:01	response that are inconsistent
0:15:05	with what the model knows
0:15:08	so not what examples have the same quality or the same relevance
0:15:13	for learning conversational models
0:15:15	and the possible remedy to that used to include a weighting model
0:15:18	which can be seen as a form of domain adaptation
0:15:21	instance weighting and models
0:15:23	common approach for domain adaptation
0:15:26	and we show that
0:15:28	this weighting model does not need to be in practice in some
0:15:32	if you have a clear idea how you want to filter you data then you
0:15:35	can of course
0:15:36	and use handcrafted rules but in many cases what determines the quality of an example
0:15:41	is hard to pinpoint
0:15:42	so it might be easier to use a data driven approach
0:15:46	and learning within model from examples of high quality responses
0:15:53	what constitutes this quality
0:15:55	what consecutive good response
0:15:58	is of course depend then all of the actual application that you trying to build
0:16:03	the this approach is very general so it can be applied is essentially a preprocessing
0:16:07	step
0:16:08	so it can be applied to any
0:16:10	data driven model dialogue
0:16:12	you simply as long as you have example of high quality responses
0:16:17	you can use it as a preprocessing step to anything
0:16:21	as future work we would like to extend it to work
0:16:24	generative models so and evaluation we restricted ourselves to
0:16:28	one type of retrieval-based models
0:16:32	but might be very interesting to apply to other kinds of models
0:16:36	and especially to generative once which are known to be quite difficult to work to
0:16:40	train
0:16:42	and an additional benefit of waiting models would be that you could filter all examples
0:16:48	that
0:16:50	are known to be as detrimental to the model before you even feed them to
0:16:54	the
0:16:55	to the
0:16:56	to the training scheme
0:16:58	so that you might have performance benefits in addition
0:17:01	to benefits it regarding here
0:17:04	your metric your accuracy
0:17:06	so that's for future work and possibly also
0:17:10	i don't types of test data then the
0:17:13	the cornet movie data corpus that we have
0:17:16	yes that's a thank you
0:17:32	can you go back to the box plot towards the end
0:17:36	so
0:17:37	i'm not sure what's in the box plot that way i read it is that
0:17:43	there is no difference really between in agreement on two does not as
0:17:50	but you have a set that is very low agreement between the evaluation but is
0:17:54	that site was wondering whether we are looking at two different
0:17:58	and to define
0:18:00	in our is that is that it is that right
0:18:06	i three it is mostly between the two d but encoder models
0:18:10	so i
0:18:11	there's of course a statistically significant difference between the
0:18:14	the altar models and the random ones
0:18:16	and although between the to do what encoder models and the random
0:18:20	but there is no internal difference between the two
0:18:23	we waiting and without waiting
0:18:24	so quickly but not have some maybe would be more significant as if i just
0:18:31	the two ways to set
0:18:34	right i agree i read
0:18:39	something
0:18:43	and you elaborate well why you change the final piece of dual encoder what was
0:18:49	the wrist extended
0:18:51	so give
0:18:55	so the idea is
0:18:57	the dot product
0:18:58	will give you a similarity between
0:19:00	the prediction from the response and the actual response right
0:19:04	and so this is a very important aspect when considering
0:19:08	no whole relevant to responses compared to the context but they might be aspects
0:19:13	that i really intrinsic to the response itself
0:19:15	and i have nothing to do the context
0:19:18	for instance
0:19:19	unknown words or rare words that are probably not typos
0:19:24	run punctuations
0:19:26	a lengthy responses
0:19:29	and this is not going to be directly captured in the dot product
0:19:33	this is going to be captured by extracting
0:19:36	some features from the response and then using these
0:19:40	in the final adequacy score
0:19:42	so something that was
0:19:44	of one missing in this button portables
0:19:46	that's why we wanted to modified
0:19:54	i guess as just wondering if you could elaborate on the extent to which you
0:19:56	believe that the generalizability of the generalisability capabilities of
0:20:02	of training a weighting model on a single dataset and having it extend reasonably to
0:20:07	enhance performance only just of compared to training on multiple domains you mean
0:20:12	why means it to train i guess like
0:20:15	is the current scheme or no such that whenever you are trying to improve performance
0:20:19	on a dataset is you would basically find a similar dataset and three training the
0:20:24	weighting model on like a similar data set and then use the weighting model on
0:20:27	a new data centre is that sort of like that the general scheme when we
0:20:31	use this
0:20:34	so
0:20:36	it's not exactly the question that you asking but in some cases
0:20:40	you might want to
0:20:42	it or to use different domains
0:20:44	four or two preselect
0:20:46	to prune out some parts of the
0:20:48	the data that you don't want
0:20:50	in some cases and that was the case that we had here
0:20:54	it's very difficult to the pre-processing advance on the full dataset
0:21:00	because the quality is very hard to determine
0:21:03	i using
0:21:03	you know simple rules
0:21:06	in particular here a deterrent structure is something that
0:21:10	it is important for determining what can secure natural response but it was near possible
0:21:15	to write rules for that
0:21:17	because it was dependent on post and gas lexical cues and many different factors
0:21:22	and you get of course
0:21:24	build a machine learning classifier that we'll
0:21:26	the segment your turns
0:21:28	but then it will be over all or nothing right in many examples in my
0:21:32	dataset
0:21:33	where
0:21:34	probably responses
0:21:36	but it's
0:21:37	the classifier we didn't give me a really answer
0:21:42	so it was better to use a weighting function
0:21:44	subjective still icon for some of these examples
0:21:48	but then not in the same way as i would from you know
0:21:51	high quality responses
0:21:53	n is but i don't are aspect that would like to mention is that
0:22:00	i could for sense that we could train on the high quality responses
0:22:04	but in this case i would have had to from
0:22:06	with ninety nine point nine percent of my dataset
0:22:10	so i don't want to i want the one that they want to control everything
0:22:13	just because i'm not exactly sure of the hypothesis responses
0:22:18	i don't if you that as a regression
0:22:22	at one more question i dunno maybe losses
0:22:26	i guess i i'm not sure are maybe i didn't like it is that the
0:22:29	evaluation too closely but did you try a baseline where you may be used to
0:22:32	simply are simpler heuristic for assigning the weights like maybe like
0:22:39	some something
0:22:41	as a heuristic for exercise none of the weights rather than like building at a
0:22:44	separate model the control model to now learn the weights you just a
0:22:50	so you know learn but not necessary
0:22:54	i
0:22:56	no idea i didn't
0:23:00	i'm not exactly sure i we could find a very simple
0:23:03	i guess that
0:23:06	something that could be done i don't know how would we perform would be i
0:23:11	where is it
0:23:13	two new use the time gaps
0:23:14	between the context and response
0:23:18	as a way to their the mean
0:23:19	what are
0:23:20	i data didn't right
0:23:22	i tried in a previous paper when i was just looking at turn segmentation that
0:23:26	in a work very well for the for this particular task but here you can
0:23:30	see be different this was assigned a weight with value instead of just segmenting
0:23:35	but that just the kind that doesn't work very well you have to use some
0:23:38	lexical cues usually
0:23:40	like after signal
0:23:42	doctor holds blah that's usually an indicator that
0:23:46	the tick speakers going to beat of the whole
0:23:48	but you need to testified for that

Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models

Oral Session 4: Context in Discourse and Dialogue

Pierre Lison and Serge Bibauw