i
i
that's cool will talk about controlling direction while it is it may win by lstm
and the impact on dialogue policy learning
great
so welcome everyone to my talk today
on improving interaction quality estimation with bias and s t ends
and the impact on dialogue policy learning
i didn't check but i'm quite sure i'll wait already one the time to for
the longest paper title
nonetheless
let's get started this
and in reinforcement learning
the one of the thing is that
have a huge influence on the learnt behaviors the reward
and this is also true for
the world of task oriented dialogue
and in a modulo statistical dialog system we use reinforcement learning to learn the
dialogue policy
weights
maps
a belief state
which represents the
progression over the several
dialogue turns
from the input side i mean and this
but this maps to a system action which is then
transferred into a response to the user
and the task of reinforcement learning is then to find the policy that maximizes the
activated future reward
not called the optimal possible policy
and
for most else go in a dialogue systems
the reward that's used is
other tasks s
so you have some measure of has assessed usually
the user provide some task information or you can
derive it in some other way
and
you can check what this what the system responses look like an then you can
evaluate whether task was excessive successfully achieved
or not
however
i think that the real behaviour should optimize user the section the set are not
tasks
and this is many out of two reasons
first of all user the section but better represents
what the user ones
and effect is only been
used as it correlates well as task success
you know conferences a long been used because correlates well with use of this section
that's the right order
secondly
task and
uses the setting can be links
two task or domain independent phenomena
and you don't need information about the underlying task
to illustrate this have just this creek x
sample dialog here you only see here parameters extracted from it it's from the let's
go bus information
system
you have ta is a state is the asr confidence activity time in the represent
and the claim is that you can derive some measure of user satisfaction just by
looking at this
whereas if you were actually need to look at
task success
you would have to have knowledge about what was actually going on what were the
system utterances user input so and so forth to if you know about on the
right
energy has been found
and i'm proposing a novel by lstm remote estimator that first of all improves the
estimation of interaction quality itself
and also improve the dialogue performance
and this is done without explicitly modeling of temporal features
so you see this crime where
where we don't estimate the we don't evaluate the task success anymore but we estimate
the you sect use this section we will
which is the edge has originally been published
two years ago already funny really enough it it'll speech also in stock on
so i'm talking about this topic and to come only apparently
so to model the use of this section
we use the interaction quality
as a less objective metric
yes we use the plaintive terms of for the same purpose
and previously the estimation
was making use a lot of
manually a handcrafted features to encode the temporal information
and in the proposed estimator i'm
i'm sure so you're the next slide i was so how you can do this
without the need to actually learn those of temporal information
so that two aspects of this of this of this talk one is the
detection quality estimation itself
and after the time to talk about how
using this it as a reward actually influences the dialogue policy
so
first of all its of a closer look at interaction quality and how it smaller
than how it used to be modeled with all the handcrafted stuff going on
you see the module architect of a dialogue and information is extracted from the user
input and the system response and this these carry me to think constitute one exchange
so you end up with a sequence of exchanges from the
beginning of the dialog up to the current
turn t value code yet or exchange in this case because it
contains information about both
size user and system
and
this that the exchange level
and the temporal information used to be encoded
on the window rather that you have a look at the bigger to look at
we know of three in this example and also on the overall dialog level
and both levels those parameters can codes concatenated
then were fed into a classification in the data into a classifier
to estimate interaction quality
and the interaction quality itself
is then obviously the supervision signal
it is it's annotated on a scale from five to one
five representing satisfied one experience satisfied
and it's important to understand that every interaction quality annotation which
exists for every exchange
actually models the whole the subdialogue from the beginning after that exchange
so it's not a measure of
how well was this turn the system reaction or whatever but it's
a measure of how
well
how satisfied was the use of from the beginning up to now so if it
goes down
it might be that the last turn wasn't really great but also many things before
they could also have influence but
and the unit circle model
i proposed
gets rid of those temporal features it only uses a very small set of
of exchange level features which you can see here's the asr reason statist asr confidence
was it a reprint or not
what's the channel
exterior type is a the so the statement or question and so on
or is the system action to confirm request or something like that
so these are the parameters we use
and
notice that work anymore to be cy
this exchange that is then
used as input to
and colluder is a strictly to encode another using by a that's the or a
by lstm
and the so
for every subdialogue we want to estimate one interaction quality value at every subdialogue is
then fed into the encoder
to generate hidden a sequence of hidden states with additional attention layer
to
with the hope of figuring out which
turn actually contributes most to the final estimation
of
interaction quality itself
intentionally is the set of attention value
cutback based on the context of the other
in this state
weights are computed
and the results of applying this to the task of
decks in quality estimation
so those results
you see the
unweighted average recall it's just the arithmetic average of all over all class-wise recalls
the grey ones are the
baselines of the for the one on the right was it two thousand fifteen is
using the support vector machine using
both temporal features of hank of the temporal features
and the second one by a docket i from two thousand seventeen is the not
on your network approach which
most making use of different architecture
but still not using the sample features
but if you can we fought for test data we use the they go corpus
which contains of to hunt two hundred bus information dialogs
to the
let's go system of pittsburgh
and those results are computed in ten-fold dialoguewise cross validation
and
you can see that the best performing
okay sick as a file of the by lstm with the attention mechanism
we compare there's all those with the but a pure biased em all up you
lstms
with and without attention
he be achieved an
i mean every speaker of zero point five for which is increase over the previous
best
we thought of zero point zero nine
now those numbers
don't seem to be
very useful
because it's not i mean if you want to estimate reward you want to have
a very good quality and you need to have a certain
certainty that what you actually
get as a as an estimate actually
can be used as a remote you don't get like
right wrong indicators
another measure we used to evaluate that is the extended accuracy when we did not
just look at the at the actual match but also look at neighboring
but you so if you to but estimating a five although it was
originally of four would still be counted as correct
because the way we
transfer those detection quality values to the reward
makes it is not very use problem if you're off by one
you will see later all this is this is done and then we can see
that we can actually good very good values above the ninety
a present accuracy rate we're with your points nine four
but the follow based approach which is
three point zero six
better than the previous
best result
and
this estimation of the best performing model by an estimate the attention mechanism
is then used to train a dialogue policies
first of all we have to address the question how can we make use of
an interaction quality value
in a remote a here we see that for the remote better interaction quality we
use the turn penalty of minus one per turn
and then we actually scalar
the remote the detection quality so that
it takes values from zero to trendy
to be in correspondence to the baseline of talks assess we just been using
many different papers already
where yells of the time penalty and the past trend is the dialogue was successful
and zero if not
so
you bet you get the same value range but you have more or more fine
grained
interpretation of
how the dialogue actually did
we compare
the
best performing evaluate it's tomato again as the support vector machine is a stunning pre
previous work it so the evaluation system we use pied i'll
with a set of using the focus tracker and the cheapest also policy learning algorithm
we use the difference duration environments
containing of zero percent error rate fifteen percent error rate and
twenty but and thirty percent error rate
we used two different evaluation metrics one is the task success rate because even though
we want to be optimized towards indexing quality or user satisfaction
it still
very important also have successfully
it see if the task doesn't help if you have
if you estimated does all this was a very nice dialogue
but that didn't the user didn't achieve the task that's of no use
the second metric we use a see i was interaction quality
maybe just to get the estimate
and
compute the average of all final estimates for the overall dialog
and to address the aspect of domain independence
we actually look at many different domains
so the estimated been trained on the let's go domain
there we have the annotations
for
but it's the dialogue to themselves the domains in which dialogues
has been there and are actually lows to so we have a complete restaurants domain
it can be chosen as the main so that the score estimate rest of the
men services go to the man and that of the name
they have different compare complexity they have different
aspects to them
so
so this basically will showcase that
that the approach is actually
domain independent you don't need to have information about the
underlying task
so
no question is how does this perform actually
you have a lot of sparse
of a obviously because they're a lot of experiments
curators in the non logically be the paper
but i think what's very interesting here is that
for the task success rate
and the different noise levels we can see
that are in comparing the black bars which is
a robot using a support vector machine
with the blue ones
the reward using the
nova
why lstm approach
we can see that the
overall the task success
increase in this is but is especially resting for higher noise rate so here we
have the for all domains combined we can see that is fifty four higher noise
rates
the improvement in task success is
very important than almost
even solid
to the to use the actual task success
as the remote signal
so what the slide tells us is
that
even though we are not using any information about the task
test looking at user satisfaction and actually estimating that
we can still get
on average
almost the same task success rate is when we were doing
but if it's if you're optimising on how success
directly
and
obviously also the election qualities of importance
we have you we here we show the
a rich interaction quality as i said earlier had which is computed
at the end of the dialogue
and he we can see that there is an improvement for the task success based
once you already get
these and indexing quality estimates
so the users are estimated to be a
not completely unsatisfied so it's quite okay but by applying
by optimising towards the interaction quality itself
you get also improvement on the side
is not very surprising because
you actually i improving to the actual value
you are
showing here so it would be
bad if it with what in the case like that
so it's mostly like a more proof of concept
as i said earlier this was all done in simulation the was emitted experiments
in my publication two years ago i already did evaluations with humans
as a validation
we had humans
talking to the system and using this in texas directly to their own
that a dialogue policy
and
you see the moving average detection quality and the moving to task success rate the
green a
curve is uses interaction quality and the red one is using task success
you can see the timit of times and says there's not a real
we use the difference here
however when you look at
the interaction quality you see also the same spcs you know on the simulated experiments
that has already after a few on a dialogue to use get
in
in
detection quality estimation
so
whatever the what have i told you so far today
we used the interaction quality to model uses section four subdialogues
i present a novel
become a neural network
model that outperforms all previous models without explicitly encoding the temporal information
and this but the for model was then
used to learn dialogue policies in unseen domains without knowledge about the underlying task and
the didn't require knowledge about the underlying task
an increased use this section and
so as to be
more robust to noise
and this similar experience accommodated in humiliation
already why the goal
for future work
obviously would be
very beneficial
two
applied to more complex context tasks
and also to have
better understanding of
what are the actual differences in policies learned
to be able to transfer this two
new knowledge
thank you
ten question
hi you know has from all mention that and i am i have to questions
that data star and probably in something circular neatly the lexical a that and getting
it we do their domains and just okay and my other question is am
one problem that i actually have now is that we just have a normalized score
satisfaction score of the user from close
and we don't have is an attention at each dialect or
so what i'd are told about that because of leo you'll have annotations that at
every that it are so what they re what is your intuition about that they
kind of model that close to that point eight problem and global
as user satisfaction to a time and well in a user satisfaction
estimation
i think it's very interesting question i think that probably the
the biggest disadvantage of this approach that
you
seem to need those turn level annotations i think that a quite important during the
learning phase because tree learning with of the to tutoring dialog learning you see a
lot of
maybe a lot of interrupted dialogues on all of things and if you don't have
a good estimate for those three hard to learn all those because even then even
when interrupting can be can come from anything basically somebody hangs out the following because
even though the dialogue was a pretty good until now you wouldn't
you can sit on something out of it
so i think if you only have turn level estimates
you can still bracket i think you should you but you need is to set
up you are
policy learning
more carefully
maybe
get rid of stud a don't regard some dialogues actually experience because you don't know
won't be able to
take anything any not out of them
but then it can actually were quite well i think
i don't i don't think the estimated self
needs to turn that the estimates
as i said those of subdialogues
and if you only consider the whole dialogue and you've if enough annotations of those
not only this one a dialect we have here
like i don't know
thousands millions i don't know what we scare you operate
then i think it's possible
without that under the ones
or you can try using that's goal
system then applied to the elements
we have to implement questions
or cell phone i'm lose their from a report university
thanks for the talk i was wondering a lot of people that support for instance
in the alexi challenge to see the
this user satisfaction can be very noisy
no you're corpus was collected some years back
did you see this noise in this in the corpus and in the annotations and
how do you think this is affecting the way you're regarding the pos you're predicting
this interaction forward
so the idea of the interaction quality is especially
it is specifically to avoid the not to reduce noise net noisiness
indexing quality was not collected by a
people rating the on dialogs
but it was weighted by expert raters after dialogue so people sitting there
forming
if you guidelines have some general guidance on both how you apply those interaction quality
labels
and based on that
also applied
then you have multiple raters current exchange the noise things
and
this was time to actually reduce noisiness
but for the data we have
we are able to cover the noise and s
one last
so the buildings k google
did you see cases where the interaction quality predictions within a dialogue change dramatically when
was just with the patterns that were interesting so user cases of interesting recovery cases
within dialogue cities or something would be learned from these students
stepwise processes in dialogues
well the estimation is not
what is an accurate so that you see drops
but in x annotation you don't see any dropped
because
based on the guidelines
it was forbidden basically
so the idea was to have a more consistent labeling
and it was rather
we gotta rather unlikely that only one single event would kind of
tropp the surface text level from the three to one or something like that
so from the it's on the annotation you don't zeros
but from the from the learned policies
i haven't done yet the analysis of
what has actually been around comparing this to other things a maybe two
human dialogues evenly generated dialogues
but this is as a cell part of the future work i think this will
hopefully shed a lot of insight into how what these
different remote signals actually
learn
and how we can make use of that
and can you itself and