Speech Transcript - Discovering User Groups for Natural Language Generation

0:00:16	hi everyone
0:00:17	i am nichols from that an inverse to germany
0:00:22	i'm gonna talk to go about
0:00:24	way to discover user groups for natural language generation in dialogue
0:00:29	and this is work i've done together with crystal spiderman an onyx on the corner
0:00:39	let's see let's look at this example here
0:00:42	we have a navigation system that there's
0:00:45	the user turn right after my central
0:00:50	so user a sexy
0:00:52	in finding the
0:00:54	i think that do
0:00:55	and use of be phase
0:00:58	so why couldn't be
0:01:03	well there are different reasons why
0:01:05	users react differently to such instructions so
0:01:10	most likely here the person is not from the and user is not from melbourne
0:01:15	so
0:01:16	they do not know what maybe one central means but
0:01:20	and we can imagine also other reasons such as the lack
0:01:26	demographics are present a sign or
0:01:29	experience with navigational systems
0:01:34	however such information is often difficult to obtain
0:01:38	so
0:01:40	and
0:01:42	we can ask everyone and before the user navigation system where they from
0:01:47	but it's an interactive setting is something approaching who
0:01:52	and collect observations and react to them so ideally after observing something like that
0:01:58	a system with okay user a using place names from an but
0:02:04	and they want adapt to user b and say something like other on the ball
0:02:09	take the third that's
0:02:14	so people deal with this problem in different ways one approach is of course to
0:02:18	completely ignored
0:02:21	which we don't want
0:02:24	another approach is
0:02:26	to use
0:02:27	one model for every user
0:02:31	however there is requires lots of data for that user and we might lose information
0:02:37	that
0:02:39	might help us from similar users
0:02:44	and another approach would be used pre-defined groups
0:02:48	so for example have
0:02:50	a group of residents of mild one and another group for outsiders
0:02:57	but this is hard to annotate and it's also hard to know in advance
0:03:04	which categories could be rate of and then
0:03:09	which i categories that actually we can actually find inside the and in the dataset
0:03:16	so instead of doing these things
0:03:19	we assume that's the user's behavior clusters
0:03:23	in two
0:03:24	groups that we cannot observe
0:03:27	and
0:03:29	we use bayesian reasoning to infer those groups from the un from an annotated the
0:03:35	training data
0:03:36	and then test time to dynamically assign users those good as the dialogue progresses
0:03:46	so our starting point is a simple log-linear model of a language use
0:03:52	where in particular we have a stack of the way of whether we are doing
0:03:56	and
0:03:57	complete attention like simulating complication or production
0:04:02	so we just in general that we want to predict their behaviour of
0:04:07	and the behavior of view of the user and response the stimulus is coming from
0:04:12	the system so if we trying to simulate language production
0:04:17	the stimulus can be the communicative goal that the user is trying to achieve and
0:04:22	behavior would be the utterance that the use or some other linguistic choice the thing
0:04:28	make
0:04:29	and
0:04:31	if we want to predict what the user would understand
0:04:35	another stimulus is system produce utterance and the behaviour is i mean that the user
0:04:42	signs
0:04:43	the utterance
0:04:47	so this is
0:04:49	this is how our basic model looks like
0:04:52	before we had the user groups
0:04:54	and it's a log-linear model with a real-valued parameter vector o
0:05:00	and set of feature functions fight over behaviors and stimuli
0:05:05	and this model can be trained with a dataset of pairs of the cases in
0:05:10	my using
0:05:11	no longer a gradient descent the based methods
0:05:15	no actually we have already use that thing this work for
0:05:20	events possible resolution in dialogue
0:05:24	so
0:05:27	now if we want to extend this model with user groups
0:05:33	we just assume that there is a finite number of user groups of the data
0:05:37	okay
0:05:39	and the we do you
0:05:41	each of the groups of their own i mean vector
0:05:46	so and we place visionary only the vector from the model before
0:05:53	really is a group specific parameter vectors or if we know exactly what group a
0:06:00	user don't still
0:06:01	and all we have to do is just a replace a just use these new
0:06:06	parameters and
0:06:08	we have like in new prediction model that is get that in particular
0:06:16	however as we still
0:06:20	we want to adapt to user is that we haven't seen in training data
0:06:26	so
0:06:27	we assume that the training data was generated in the following way
0:06:33	we have a set
0:06:34	of users u
0:06:36	and
0:06:38	so it's each user is assigned
0:06:42	to a group
0:06:45	with a probability
0:06:47	you're given by which is another which is another parameter vector that determines the prior
0:06:53	probability of age group
0:06:56	and then
0:06:57	as we said we have one parameter vector for a third group so now the
0:07:02	behavior of the of the user
0:07:05	and not only depends on the stimulus but also on their group assignment and of
0:07:10	the group specific one of the vectors
0:07:16	so now let's suppose that's we have trained our system we don't both training data
0:07:23	and then you user starts talking to us
0:07:28	since we don't know what they're action movies
0:07:31	and we marginalise overall groups using the prior probability
0:07:37	and so we directly have
0:07:40	an idea of what they would do
0:07:46	given a given the prior probabilities that we have observed in the training data and
0:07:51	we can already use this model for interacting with them and then observes a behaviour
0:08:00	so if the user fees
0:08:02	control system gives interacting with a system we start collecting observations for them
0:08:09	so let's say we have
0:08:11	a sets the you of observations for user you of that particular time step
0:08:20	we cannot use these observations to estimate
0:08:24	find out which so you belong still
0:08:28	so we can do that because
0:08:30	as i said we have a specific
0:08:34	the cave you're a prediction
0:08:36	so we can
0:08:39	calculated probability on the right-hand side probability of the data of the observations for the
0:08:46	user given it to the group specific parameters of each clue
0:08:51	and also we have the prior membership probabilities so that is truly we can also
0:08:57	compute
0:08:59	the probability that the user belongs to each of the groups g given the data
0:09:04	and
0:09:05	and there's
0:09:09	so if we plug in this new posterior group membership estimation
0:09:14	in the previous
0:09:16	and behavior prediction model
0:09:19	we have
0:09:20	we have a new
0:09:22	you can prediction model that is aware of that there is a into account
0:09:28	the data but we have seen for this new user and
0:09:31	then you know group membership estimation
0:09:35	and that's we collect more observations from the user
0:09:41	we hopefully have a more accurate group and are suppressed night and a better behavior
0:09:45	addition
0:09:50	now how do we train another system to find the best parameter setting
0:09:58	other set our model has
0:10:01	parameters by which are the prior group of the numbers of phone address and
0:10:06	for each of other groups
0:10:09	has one and
0:10:11	finally the vector for the features
0:10:15	now we assume that we have a corpus of
0:10:19	behaviors instinct line
0:10:21	and for each of these for use of this pair of we haven't seen use
0:10:25	we have we know the use of that use then
0:10:29	but we don't know the groups of young
0:10:33	so we will try to maximize the data likelihood
0:10:37	according to
0:10:40	the previous
0:10:43	behavior probabilities
0:10:46	however we can use or not straightforward to use a gradient descent as for the
0:10:52	basic model because we don't know the group assignments
0:10:58	so instead
0:11:00	we use
0:11:01	a method similar to expectation maximization
0:11:05	so
0:11:07	and in the beginning we just initialize all parameters
0:11:13	randomly from a normal distribution
0:11:15	and then these times that
0:11:18	we compute
0:11:20	the group estimates the group membership probabilities
0:11:24	for given the data for each user
0:11:29	using the parameter setting from the previous time step
0:11:32	and
0:11:34	we use this probabilities
0:11:37	as frequencies for no so the observations
0:11:42	according to that of this distribution
0:11:46	so we have set of sort of separations with
0:11:51	observed
0:11:54	group memberships
0:11:55	so now we can do we can use normal gradient ascent to maximize the lower
0:12:01	part of the of the location given this and observations
0:12:06	and we got we find new parameter setting and
0:12:12	and we
0:12:14	we go back to step one and two they look like it doesn't improve further
0:12:20	and more than a threshold
0:12:29	so now let's see if
0:12:32	if our method works
0:12:34	a if we can discover groups natural and data
0:12:39	so actually our model is a very generic so we can use it in an
0:12:43	component of a that exist and
0:12:46	for which we need to predict the user's behavior
0:12:51	but for the purpose of this work we evaluated in
0:12:55	those specific prediction tasks related to natural language generation
0:13:02	and so the first task
0:13:05	is
0:13:06	taken from the expression generation detection
0:13:11	in this case the stimulus is a visual scene and the target object
0:13:15	and we want to predict
0:13:17	and whether the
0:13:19	user will the speaker will use of spatial relation in describing that object
0:13:26	so for example in this scene if they would say something like that both in
0:13:30	front of the cube or the small global
0:13:34	the dataset we use
0:13:36	is generally three d three
0:13:40	which is a commonly used the dataset in briefings question generation
0:13:44	and it has
0:13:46	at anything described by a sixty three users usage
0:13:51	and relations are using thirty five percent of the scenes
0:13:56	so it is difficult to predict
0:13:59	in this dataset whether the user would you like just from the same it is
0:14:03	it is difficult to predict
0:14:05	whether the speaker will user a spatial relation or not
0:14:10	because some users don't use spatial relations at all
0:14:16	sound use
0:14:17	spatial relations all the time and some are in between
0:14:21	so
0:14:22	we expect that's
0:14:24	our model will capture that
0:14:27	difference
0:14:30	the way we evaluate it is
0:14:32	we firstly we do crossvalidation and with the data in such a way that the
0:14:37	users that we see testing never seen in training before
0:14:42	and we implement two baselines based on the state-of-the-art for this dataset which is work
0:14:50	done by different by one hundred fourteen
0:14:56	so
0:14:58	we see that
0:15:01	are
0:15:03	however the version of our model for one group is actually equivalent with one of
0:15:09	the baselines
0:15:10	which is and basic
0:15:12	and the second baseline also used some demographic data which also the don't
0:15:20	on the help
0:15:23	for improving the data
0:15:25	the f-score of the prediction task
0:15:29	but as soon as we introduce a more than one group
0:15:34	the performance goes up because we are able to actually distinguish between
0:15:39	the different the user behaviors
0:15:44	and this is what happens at test time as we see more and more observations
0:15:48	so we see that for a already after one
0:15:53	after seeing one of the federation our model can is better at predicting what the
0:15:59	user will do next
0:16:01	and the green time is the entropy of the group members
0:16:05	probably distributions so this and this for some throughout the testing phase
0:16:12	so this means that our model our system is a more and more certain about
0:16:17	the actual group that the user
0:16:19	belongs to
0:16:22	the second task which i
0:16:24	is related to comprehension
0:16:28	given the stimulus s which is a visual scene and referring expression
0:16:32	we want to predict the object that so the user understood as a reference
0:16:38	our baseline is based on our previous work from thousand fifteen
0:16:43	where we also use a log-linear model as the one i showed in the beginning
0:16:47	and
0:16:49	for this so experiment we use
0:16:51	as in that paper we use the data from the give two point five challenge
0:16:56	for training and the gift to challenge for testing
0:17:01	however in this dataset
0:17:04	we can thumb achieve an accuracy improvement compared to the baseline
0:17:10	and we observe that the them our model can decide which group to assign the
0:17:16	users two
0:17:18	and
0:17:20	even as we tried different features
0:17:22	we could not detect and the viability of the and
0:17:26	in the data so
0:17:28	we assume that there might be in this case
0:17:32	there the so the user behaviour doesn't actually can we cannot actually class of the
0:17:38	user behavior to
0:17:40	meaningful clusters
0:17:42	and that a test that's however that hypothesis we did the third experiment
0:17:48	where we use the same since but with a one hundred synthetic users
0:17:53	and we artificially introduced a to a completely different use of behaviors in the dataset
0:18:02	so half the user's always select the most are visually salient target and the other
0:18:07	have very salient
0:18:09	and
0:18:10	in this case we did discover that our model can actually distinguish between those two
0:18:16	groups
0:18:17	next we more than one group one and two groups doesn't really improve
0:18:25	the accuracy
0:18:28	and again in the test phase we have the same pictures before so
0:18:34	after a couple of observations are model is
0:18:37	with a certain that look the user belongs to one of the groups
0:18:45	so
0:18:47	somehow
0:18:49	we have shown that we can
0:18:51	cluster users to groups based on the behavior in i data for which we don't
0:18:57	have group annotations
0:18:59	and this time we can dynamically assign announcing uses two groups in the course of
0:19:05	the dialogue
0:19:06	and we can use these assignments to provide a better and better predictions of their
0:19:13	behaviour
0:19:15	and in future work we want to try
0:19:19	different datasets
0:19:21	and applying the same effort to other dialogue-related the prediction tasks
0:19:28	and also
0:19:30	slightly more sophisticated the underlying models
0:19:35	and with this meant for your
0:19:56	yes of course it's very task dependent what the so we only wanted
0:20:03	to predict how the user's plus the depending on that we can ask
0:20:27	yes
0:20:35	as i said so
0:20:37	i'm not sure if i said to what we evaluated on just recorded data so
0:20:40	we didn't have which and the but that's of course very good do when you
0:20:46	have an actual that
0:21:03	well we expected to so in this task
0:21:10	can be honest is an easy task for the for the user right so
0:21:14	if i don't know if you can see if you can read that so it
0:21:18	says press the button to the right of the land so most users get it
0:21:20	right
0:21:21	so but there is a sound fifteen percent of errors
0:21:26	so we will
0:21:28	we call to find about some he didn't bother and but
0:21:33	like why some users
0:21:36	it sounds uses for example have difficulty with colours
0:21:40	or with a spatial relations
0:21:44	well
0:21:45	we didn't
0:21:48	yes it's probably
0:22:16	so for the for the production task
0:22:28	yes so we didn't
0:22:32	so for this task studied in the literature says that
0:22:37	there are basically two clearly distinguishable groups
0:22:41	and some people are in between
0:22:44	so this is my this might be why we have like a slight improvement for
0:22:49	six or seven
0:22:51	groups like
0:22:53	maybe by we have
0:22:56	when we have a six or seven groups we have like
0:23:01	groups that happened to a captures some particular usersbehaviour but which have very low prior
0:23:07	probability
0:23:08	but we do find the main two groups with the groups which are
0:23:13	whether i people who always use relations and
0:23:17	you don't
0:23:34	you mean to look at a particular feature weights
0:24:01	yes we did so i that we didn't look at that i don't remember exactly
0:24:08	what we found out but we
0:24:10	we did find out that there are like
0:24:15	and some particular features which
0:24:18	which have a completely different ways to use
0:24:25	that i don't remember which one
0:24:27	which one

Discovering User Groups for Natural Language Generation

Oral Session 2: Generation 2

Nikos Engonopoulos, Christoph Teichmann, Alexander Koller