okay a my name is mean a
and i from the natural language in dialogue systems via
you see santa cruz preventing paper controlling personalities with that of variation
the neural that language generators
so
the problem in that work on the task oriented neural nlg structured data has focused
on
a weighting semantic errors which has resulted in
by logistically an interesting outputs
so for example
i two references with the
locus generating coca is a stronger describe our holiday and coca
is the low really construct your expressed by holiday and
both realise
although the attribute in the mr but that's really all that is
so our goal is to train a neural nlg user semantics and stylistic
variation by controlling input data and the amount of supervision available to the model
really need lots of training data to learn the style
so we use a statistical generator personage which is able to generate data is being
and the big five personality to create stylistic variation
we use
i personalities agreeable conscientious disagreeable extrovert and conscientious
two
to generate
data using train and dev mars each e
challenge so personage you can systematically control
the types of styles variational produced and we know which had to
stylistic variation in our in
it's reproducing so there are two examples
the screen one for the agreeable personality and one for the disagreeable personality
the remote personality and
part markers like i e
and the disagreeable one hand and the size or like effectively
and or
conversation
and disagreeable is broken up into five sentences for their support agreeable
all you are in one sentence
for our data distribution we have i think eight hundred fifty five total utterance is
generated from three thousand seven hundred and eighty four unique more and seventeen thousand seven
hundred and seventy one references for personality and protest we generate one thousand three hundred
and ninety total utterances
for rendering from a unique are you get one
preference for personality from the fact personality
so with this data the mr our problem
it rate and that's that each we challenge and have them are taken is directly
from the text
at each we challenge
so the distribution of this data is problem but challenge so
the training data number of attributes gram are a bit more balanced like
mostly for five and the attributes
gram or and a test data
has a lot
that's quite a bit more attributes per more mostly seven or eight actually
we think this makes the test a little or in the training
so there are five types of a rotation operation that personage can here
do you combine the actual mr there's the period operation so x or y it
is q x and y with three
in conjunction operation x y and i e
where x is why don't you and the
the different areas the lack of four
and the also q which is
has why also it
e
aggregation operations are necessary to combine
actually together with the distribution
most of the personalities use most of the aggregation
operations that there is still some
brightly so
it just agreeable voice
using period operational lot more than all of the other one with
and extrovert
is a lot more likely to use the conjunction operator then the other
what is so we can still see that was different
you're the sample and pragmatic
marker except me
that personage can
used
the by now that we have had about thirty one i binary operators
so some of these are the correct requests confirmation so that he what we can
find on a
exactly the restaurant be emphasized for
like really basically actually just competing mitigation
the come on obviously rewritten note that
and include markers such as
however we need it for and
and has a product
markers are necessary
or a grammatically correct
sentence and what utterance
be you can see that not all over the personalities
you every harmonic
operator and i can occur
you end up with some like tag question is really only used by agreeable
many of them are used by multiple so
what it is pretty much equally used by disagreeable and conscientious and some of a
little bit less talent so you know it's mostly whose make extra or but agreeable
will also
you
you know marker
so we begin with the refined system from two sec at all and we have
three different models with varying levels of supervision
then there's a model the nose to model directly follows the baseline model has no
supervision token model as a single okay
specifies the personality
similar to machine translation problems
and our context model directly encodes that thirty six that parameters the pragmatic marketing aggregation
operations
from personage as context and if you forward network
here's an example how what
from our context model
i e
realization i had no application and no pragmatic markers so
each attribute is that it on sentence and
the a variety it's just realising attributes
sar
and i have three examples from personalities first agreeable
let's see what we can finally it is well is we could use a rating
also with an italian restaurant riverside moderately priced notice right so
also with it in a really friendly easy
so it had a confirmation in its hands and knowledge and justifications bayesian well and
then it has a high as to the end and it also he's is also
q for aggregation
the second one
i and twenty inches voice
god i don't know it's really said at separating also it is moderately priced restaurant
so italian place in riverside and you think you'd friendly
expletive got
and an initial rejection with the i don't know and this use this
still uses the also q there is also with you
the final four with
in extrovert
basically it's really is an italian place of this right and actually moderately priced the
riverside decent reading okay brightly and it's a you know
so it's one hand a year to emphasize errors
basically actually and you know marker and only uses merge in conjunction and
although he's just one sentence in there is no use of the period operation
so
automatic metrics
really or just
the
i really you know why
it systems that they don't just although the training data is a really is similar
to the training data and i inherently bad
for
stylistic variation
so
our context model does perform the best but numbers may be a great
we are mostly showing be specific completeness
and we propose a new metrics for evaluating semantic causality and stylistic variation
so first we evaluate the quality
using four types of errors from the actual you're sitting here are in reference to
the realizations so
the first is deletions which is one
and action you near bar it is not rely in the what
reputations which is where a here
actually you in the reference multiple times
substitution which is where
actually you is i think in a year more and the reference considered value
so for example if you are marked it was italian restaurant and referent a french
restaurant
what he wants everything you know
and then hallucinations which is one reason actually reference that was not new original mr
so we have in table here that have
he values for each model each personality for deletions insertions and substitutions something very or
stable and it is hard to tell which one okay
is doing the best overall we
simplified it included a slot error rate
where it is the sum of those force semantic errors over a number of slots
are actually you
this is modelled after the word error rate
and how we have more similar table where you can
actually see the difference between the models and you can see that no stupid as
performed the best but also that this is
we had a cost and stylistic variation and that
context really
that much worse
so
that was rated the semantic quality and now we want to measure stylistic variation
so first we take a shared a text and should he to see how very
the results are
the context model a performs the best directly models and is closest to the original
personage training data so it is why is varied of the original data
we also want to measure the models are the fully reproducing it pragmatic markers
at each personality user
so we
calculated for all right marking set of here a region
and then we get the pearson's correlation between
a personage training data and the output for each
model and each personality
so the context model that for most of the personalities except for very important can
perform better
no stew
it has positive value for two of them agreeing projections right are actually negatively correlated
i think this is because conscientious
actually easy to provide markers
mostly that are the request confirmation and an initial rejection which are generally at the
very beginning for the very end of the sentence which makes them at your
to reproduce and soon as you pretty much exclusively just one does
so it's very similar conscientious but
so we did pretty much the same thing for a rapid creation
operations will be counting occurrences of our age
and the pearson's correlation between each rate in the test data
again context is performing a better than
each other
except for one case this time disagreeable
hand
you see that actually used for pretty well here
it does better than okay well a couple of instance since we think this is
because
i patient operations like is that they need to be you can have a sentence
with our own
and so you'll see that it is an excellent pragmatic markers but less
create a location operations this is morgan opportunity to do better with the application to
the pragmatic markers
the overall are context model
gives us the best next a systematic quality and stylistic variation
so we also evaluated a the quality of the work is all easy and turk
study e
so our best performing model
the context model and tested whether people
can recognize personality
as a baseline we randomly select a set of ten unique or mars from training
and their references so we gave its workers is very three hundred and i would
entail in that an item
inventory
tp and we also i
the dm's range how natural it that the utterance down
so we evaluate it very unique or mars
we generated from the context modeling task
we had five tokens per hit me measured how
frequently the majority select the crack cheapy item
we were opposite item
to get a ratio which is no all i highlighted
personage
that is had over fifty percent or
all of the p i n
model context
that's right over fifty percent and everything except agree well conscientious
yes or
the lowest percentage does seem to the trend
personage just a little bit lower
we also got be a great rating from one to seven scale from the t
v and we basically a average rating of the
which of the case so it's agreeable with the average rating for the agreeable
in and
it's but a
the average for all the time for percentage most of them for the context model
agreeable it it'd
about
that the same and then for unconsciously and you know
condescension it also has a little better than the original personage
we also the nationalist rating again one to seven
i
the model contact again hands couple instances where it actually sounds a little more natural
than the original data so disagreeable
then there anything with an conscientious
people are models that's where k
more natural in overall results
so we also tested our model for general
its ability
and we tried to generate what matches characteristics of
all personalities so for me to
the disagreeable voice and the conscientious way
and we combine them and that are you sentences
is that what extent to one example
our model out what a fool a disagreeable and point here just personality
we
to evaluate it we look at
e average occurrence of the different features
are two examples
that are pretty there is no the fury are location is a lot more common
in this variable
in conscientious
and when we combine them the results of sorted in the middle and same with
the
expletive handwriting or it's much more common in disagreeable
conscientious
okay you can okay result that is what again between so it really think indicate
that models not sticky
one way or other is
sort of averaging them and getting in our hands data well
and this is from a model that we only trained on a single personality train
it on x personalities so word tells me to have a paper speech
a neural model to voice models p-expression novel personality or we can t s
o solution we show
and do not models used to generate a but that is both syntactically and semantically
correct
based on each week generation challenge
in b and are role models be able to use stylistic variation in a controlled
setting
based on the type of data and they are trained on a number of supervision
there are given in training you're currently
focusing on can swarms of stylistic variation
our dataset is available at that link
i
well
so all these results are actually people have test i don't is with first which
i
we got around the same results as a as it were really just one show
that
the neural that is the model context is it's
still producing these personalities and weight is recognisable so
people can still tell the conscientious voice
is conscientious and i
it's not just that we're looking at these pragmatic markers and think that repeat it
is actually still same personality training