hi everyone
i am nichols from that an inverse to germany
i'm gonna talk to go about
way to discover user groups for natural language generation in dialogue
and this is work i've done together with crystal spiderman an onyx on the corner
let's see let's look at this example here
we have a navigation system that there's
the user turn right after my central
so user a sexy
in finding the
i think that do
and use of be phase
so why couldn't be
well there are different reasons why
users react differently to such instructions so
most likely here the person is not from the and user is not from melbourne
so
they do not know what maybe one central means but
and we can imagine also other reasons such as the lack
demographics are present a sign or
experience with navigational systems
however such information is often difficult to obtain
so
and
we can ask everyone and before the user navigation system where they from
but it's an interactive setting is something approaching who
and collect observations and react to them so ideally after observing something like that
a system with okay user a using place names from an but
and they want adapt to user b and say something like other on the ball
take the third that's
so people deal with this problem in different ways one approach is of course to
completely ignored
which we don't want
another approach is
to use
one model for every user
however there is requires lots of data for that user and we might lose information
that
might help us from similar users
and another approach would be used pre-defined groups
so for example have
a group of residents of mild one and another group for outsiders
but this is hard to annotate and it's also hard to know in advance
which categories could be rate of and then
which i categories that actually we can actually find inside the and in the dataset
so instead of doing these things
we assume that's the user's behavior clusters
in two
groups that we cannot observe
and
we use bayesian reasoning to infer those groups from the un from an annotated the
training data
and then test time to dynamically assign users those good as the dialogue progresses
so our starting point is a simple log-linear model of a language use
where in particular we have a stack of the way of whether we are doing
and
complete attention like simulating complication or production
so we just in general that we want to predict their behaviour of
and the behavior of view of the user and response the stimulus is coming from
the system so if we trying to simulate language production
the stimulus can be the communicative goal that the user is trying to achieve and
behavior would be the utterance that the use or some other linguistic choice the thing
make
and
if we want to predict what the user would understand
another stimulus is system produce utterance and the behaviour is i mean that the user
signs
the utterance
so this is
this is how our basic model looks like
before we had the user groups
and it's a log-linear model with a real-valued parameter vector o
and set of feature functions fight over behaviors and stimuli
and this model can be trained with a dataset of pairs of the cases in
my using
no longer a gradient descent the based methods
no actually we have already use that thing this work for
events possible resolution in dialogue
so
now if we want to extend this model with user groups
we just assume that there is a finite number of user groups of the data
okay
and the we do you
each of the groups of their own i mean vector
so and we place visionary only the vector from the model before
really is a group specific parameter vectors or if we know exactly what group a
user don't still
and all we have to do is just a replace a just use these new
parameters and
we have like in new prediction model that is get that in particular
however as we still
we want to adapt to user is that we haven't seen in training data
so
we assume that the training data was generated in the following way
we have a set
of users u
and
so it's each user is assigned
to a group
with a probability
you're given by which is another which is another parameter vector that determines the prior
probability of age group
and then
as we said we have one parameter vector for a third group so now the
behavior of the of the user
and not only depends on the stimulus but also on their group assignment and of
the group specific one of the vectors
so now let's suppose that's we have trained our system we don't both training data
and then you user starts talking to us
since we don't know what they're action movies
and we marginalise overall groups using the prior probability
and so we directly have
an idea of what they would do
given a given the prior probabilities that we have observed in the training data and
we can already use this model for interacting with them and then observes a behaviour
so if the user fees
control system gives interacting with a system we start collecting observations for them
so let's say we have
a sets the you of observations for user you of that particular time step
we cannot use these observations to estimate
find out which so you belong still
so we can do that because
as i said we have a specific
the cave you're a prediction
so we can
calculated probability on the right-hand side probability of the data of the observations for the
user given it to the group specific parameters of each clue
and also we have the prior membership probabilities so that is truly we can also
compute
the probability that the user belongs to each of the groups g given the data
and
and there's
so if we plug in this new posterior group membership estimation
in the previous
and behavior prediction model
we have
we have a new
you can prediction model that is aware of that there is a into account
the data but we have seen for this new user and
then you know group membership estimation
and that's we collect more observations from the user
we hopefully have a more accurate group and are suppressed night and a better behavior
addition
now how do we train another system to find the best parameter setting
other set our model has
parameters by which are the prior group of the numbers of phone address and
for each of other groups
has one and
finally the vector for the features
now we assume that we have a corpus of
behaviors instinct line
and for each of these for use of this pair of we haven't seen use
we have we know the use of that use then
but we don't know the groups of young
so we will try to maximize the data likelihood
according to
the previous
behavior probabilities
however we can use or not straightforward to use a gradient descent as for the
basic model because we don't know the group assignments
so instead
we use
a method similar to expectation maximization
so
and in the beginning we just initialize all parameters
randomly from a normal distribution
and then these times that
we compute
the group estimates the group membership probabilities
for given the data for each user
using the parameter setting from the previous time step
and
we use this probabilities
as frequencies for no so the observations
according to that of this distribution
so we have set of sort of separations with
observed
group memberships
so now we can do we can use normal gradient ascent to maximize the lower
part of the of the location given this and observations
and we got we find new parameter setting and
and we
we go back to step one and two they look like it doesn't improve further
and more than a threshold
so now let's see if
if our method works
a if we can discover groups natural and data
so actually our model is a very generic so we can use it in an
component of a that exist and
for which we need to predict the user's behavior
but for the purpose of this work we evaluated in
those specific prediction tasks related to natural language generation
and so the first task
is
taken from the expression generation detection
in this case the stimulus is a visual scene and the target object
and we want to predict
and whether the
user will the speaker will use of spatial relation in describing that object
so for example in this scene if they would say something like that both in
front of the cube or the small global
the dataset we use
is generally three d three
which is a commonly used the dataset in briefings question generation
and it has
at anything described by a sixty three users usage
and relations are using thirty five percent of the scenes
so it is difficult to predict
in this dataset whether the user would you like just from the same it is
it is difficult to predict
whether the speaker will user a spatial relation or not
because some users don't use spatial relations at all
sound use
spatial relations all the time and some are in between
so
we expect that's
our model will capture that
difference
the way we evaluate it is
we firstly we do crossvalidation and with the data in such a way that the
users that we see testing never seen in training before
and we implement two baselines based on the state-of-the-art for this dataset which is work
done by different by one hundred fourteen
so
we see that
are
however the version of our model for one group is actually equivalent with one of
the baselines
which is and basic
and the second baseline also used some demographic data which also the don't
on the help
for improving the data
the f-score of the prediction task
but as soon as we introduce a more than one group
the performance goes up because we are able to actually distinguish between
the different the user behaviors
and this is what happens at test time as we see more and more observations
so we see that for a already after one
after seeing one of the federation our model can is better at predicting what the
user will do next
and the green time is the entropy of the group members
probably distributions so this and this for some throughout the testing phase
so this means that our model our system is a more and more certain about
the actual group that the user
belongs to
the second task which i
is related to comprehension
given the stimulus s which is a visual scene and referring expression
we want to predict the object that so the user understood as a reference
our baseline is based on our previous work from thousand fifteen
where we also use a log-linear model as the one i showed in the beginning
and
for this so experiment we use
as in that paper we use the data from the give two point five challenge
for training and the gift to challenge for testing
however in this dataset
we can thumb achieve an accuracy improvement compared to the baseline
and we observe that the them our model can decide which group to assign the
users two
and
even as we tried different features
we could not detect and the viability of the and
in the data so
we assume that there might be in this case
there the so the user behaviour doesn't actually can we cannot actually class of the
user behavior to
meaningful clusters
and that a test that's however that hypothesis we did the third experiment
where we use the same since but with a one hundred synthetic users
and we artificially introduced a to a completely different use of behaviors in the dataset
so half the user's always select the most are visually salient target and the other
have very salient
and
in this case we did discover that our model can actually distinguish between those two
groups
next we more than one group one and two groups doesn't really improve
the accuracy
and again in the test phase we have the same pictures before so
after a couple of observations are model is
with a certain that look the user belongs to one of the groups
so
somehow
we have shown that we can
cluster users to groups based on the behavior in i data for which we don't
have group annotations
and this time we can dynamically assign announcing uses two groups in the course of
the dialogue
and we can use these assignments to provide a better and better predictions of their
behaviour
and in future work we want to try
different datasets
and applying the same effort to other dialogue-related the prediction tasks
and also
slightly more sophisticated the underlying models
and with this meant for your
yes of course it's very task dependent what the so we only wanted
to predict how the user's plus the depending on that we can ask
yes
as i said so
i'm not sure if i said to what we evaluated on just recorded data so
we didn't have which and the but that's of course very good do when you
have an actual that
well we expected to so in this task
can be honest is an easy task for the for the user right so
if i don't know if you can see if you can read that so it
says press the button to the right of the land so most users get it
right
so but there is a sound fifteen percent of errors
so we will
we call to find about some he didn't bother and but
like why some users
it sounds uses for example have difficulty with colours
or with a spatial relations
well
we didn't
yes it's probably
so for the for the production task
yes so we didn't
so for this task studied in the literature says that
there are basically two clearly distinguishable groups
and some people are in between
so this is my this might be why we have like a slight improvement for
six or seven
groups like
maybe by we have
when we have a six or seven groups we have like
groups that happened to a captures some particular usersbehaviour but which have very low prior
probability
but we do find the main two groups with the groups which are
whether i people who always use relations and
you don't
you mean to look at a particular feature weights
yes we did so i that we didn't look at that i don't remember exactly
what we found out but we
we did find out that there are like
and some particular features which
which have a completely different ways to use
that i don't remember which one
which one