so i make speaker will be included common
and she'll be talking about the influence of time and risk and was a response
acceptability in a simple spoken dialogue system
okay so this is worse than we'd and e
and the you know why am
and now it
well
that doesn't want to
cool
that works
okay so
what are we doing here
evaluations of dialogue systems are often based on ratings
however
if you look at research in recommender systems you will see the people's ratings are
inconsistent over time and that leads to what it's called the magic barrier you can
only get the certain point in accuracy due to people's inconsistencies
so
we ask ourselves
is this true for dialogue systems
and of course this is implications about the reliability of the evaluations of systems and
about comparative evaluations between systems
and
while we were at the end we also wanted to check the effect of situation
a to rescore on how people view the responses of
a dialogue system
so
we did an experiment we conducted a longitudinal study the dis over time
and in the context of a spoken dialogue system for the household robot
and
the corpus that we use
while as a core pause
for spoken request
the task of robot too fate remove objects in a room
and this study well as in two stages
one of the reviewers of the paper call this heroic thank you
and in the first stage
people selected how they would response respond to request
and have their
a here not have to yes
we gave people the wrong responses in other responses and ask them to rate doubles
responses
so the questions that we want to answer
how well the participants like their stage one response types and we call them response
type rather than dialogue acts
because one of the response types could be just do what you are
that's not the dialogue act
the user the users prefer their stage one response types to have a response type
and three again the situation that are risk because well
it was something we were interested in
so the first thing let's describe the corpus
at the corpus was created in the past what we were developing our system we
had thirty five participants that describe twelve object
in different images we had a total of
four hundred and seventy eight descriptions because people were allowed repetitions
asr performance this is google now
a bit worse than what
you would think
so word error rate thirteen percent that top ranked interpretation was wrong you know about
half the cases and all interpretations were wrong in about a third of the cases
some of the wrong things where little things like a or and
and that was thirteen percent of the cases
we retained
two hundred and ninety two descriptions wise sort of you
some of them there was inconsistency in rating like some people rated only stage one
out of rate that only stage two so we couldn't keep them others are system
couldn't brawl says
and there's head more than one prepositional phrase and
we can process goals but i will
explain later why we got rid of them
so each of those nine and two hundred and ninety two descriptions and a head
for
dot for asr output
and okay let's go back a set gone
why don't for we want that the party c-band
do you hear called uncle why this
spoken language understanding system is hearing which is the output of the asr
and then we to guard descriptions as i said that were generated in the context
of another study
and
prepended get or move to each asr output to turn them into recording
then this corpus was divided into sets of at most well for what
one pair of g
so let's all those of you were referred me before will have single speeches
so party c-band whereas to designate one of the objects a b or c like
eventually all three of them but one at a time
so in this case the participant is describing the hard disk under the table
this is what the asr heard
none of them is correct this is true asr output
and then we put they did in front
so get
that thing
we in the second image again the party c-band once the of the ball farther
away from the plate
which object they have
this is what they it's not hard
and again we add
the get them
and this time one of the interpretations
he's
correct yes the first one
these results are deemed edge
the plate in the middle of the table
so we play the same game can speed up now
and are finally manage the cleanable crack yes that's what they set
and
again
this is what they aside and hard
and this time would
do move why because it's a big object
we cannot ask anybody to get the bookcase
okay
now
this is a we collected our corpus and now we start we
the trial stage one
we collected demographic information gender english native in is whether that are native english speaker
age education
and we also corrected risk propane see the information because we are interested in the
effect of risk
so we collected these from work firearm and that s
six weeks
where is probably i
statements such as i follow the motto nothing ventured nothing getting
and six
risk of version statement my decision errors are always made on their carefully inaccurately and
there are six of each and we measure the agreement or now one to five
likert scale
so
these are our demographic characteristics of
in the stage one we had forty participants six of those were not reachable in
stage two so we are thirty four people
seventeen female seventeen male eighteen native english speakers sixteen on a leave
and these are the age and education
brought five
error for risk prone as just to give you an idea about the human condition
we subtract the
risk aversion from risk brown is so the sum of
all their scores
and this is what are pub population looks like they seem to be a more
recent prone then
risk of ours
so now
now we get to the real stage one
so as i said each participant was shown
the top for asr output for each request maximum twelve requests one for image one
pair i them you in each image
and they were shown versions of the images were all the objects are number
why because they could
peak any object to talk to
to respond
we had to be reached conditions low and high we told them that in the
lower it rests condition the respond there is in the same room as the requester
in the high risk condition the respond that is far away and it will be
in car a lot of inconvenience if they do the wrong thing
and
they had four response types to
choose from and the
they got explanations of what
each response main
in fact they only got these side
this it is for us
so
do means would you just fitch object number
and put the number of the object you would fix
com four i'm ease
you want to last did you mean object again object number
choose which object did you mean
even list of object and rephrase ease i can hear you
i want you to restate so they had four response types to choose from
so this is a sample items so now we see the same room we so
before
but all the objects are numbered
and this is what the survey looks like soul
you may have the four out bolts assuming that you are in the same room
of the speaker
select one of the responses
get object number did you mean object which object did you mean and for rephrase
we actually gave them the option
to say rephrase the object rephrase the position or rephrase the whole sentence
now we distinguish because the asr makes most of the errors on the object not
on the location
and then
we went assume that
this peak at seen a remote location would you change your hands
and we asked the same coast
so
after stage one we got city corpora
one
so we had
five hundred and eighty four responses so
two hundred and i two request standard to race conditions
and
it will become clear why we have to be corpora so the first one he's
response corpus
response corpus he's what answers we got from our parties what
we see
okay what answers we got from our participants
and this is the distribution of the answers and their the law and their high
risk conditions
so do is clearly majority class
and we have come farm choose rephrase and as you can see
the
there is
let's do those in more conferencing chooses
and rephrases and their high risk condition
in addition we developed
two corpora
or dark or pause and classifier corpus so what is a dark or both
and the
responded to every c
why did we want double talk or both because there is a lot of the
variability between people and we wanted to see how user variability affix
the result
and the either in the final corpus is called classifier corpus
and what we need ease we train the classifier
two
select responses based on the
based both on all of our corpus and on response corpus
or and i promised i would then yielded sorry
so this is why we throughout the
requests with more than one prepositional phrase because we wanted to restrict the features that
we used for training the classifier because we just want to the simple classifier
okay so
what does not response classifier look so that look like it assumes that
we have a spoken language understanding system that with don's ranked interpretations
we have to be types of classification features the asr confidence in the correctness of
its own outputs
how well an interpretation matches the description
the risk of the situation and for response corpus we also have the more graphic
and respect propensity information
so
i think weak example this is a close up of one of the rooms
the description is the browns to linear the table
so
these two stools match well the description
the one
the one over there is a bit closer but their balls
are pretty good match
what about the classes so how the classifier do
we tested the whole bunch of classifiers and random forest one
now
these them only the main thing to note is
the bottom line of course ware doing better or and this score pause then on
the corpus of older people
why because there was a lot of variability in responses and their the exact same
conditions
but this is just
before you think i'm wasting your time
and this is not important for the purposes of this paper
so now
we proceed to experiment two
a year not have to two years later
so
each party c-band is shown
the same asr output this in images as in stage one
to race conditions again
and
a bunch of candidate responses
sourced from
the response type in response corpus for the wrong responses
and these responses
the response to speak by the classifier
and also
do confirm pairs so whenever one of these responses what to do if there was
no pun firm in that above three we are that the con four
similarly
if one of these was to confirm and there was no do
we added to do
of course we didn't repeat
several of these chose the same response we present to be done you want
now we had some
it's more challenges do and rephrase that direct renditions of the selections in stage one
but for confirming choose
we needed to do some instantiation
so for choose we chose the pictorially query value and two point d so we
would say is this what you want
in this is your confirmation the particular plate
four choose we had two options there are two plates on the table
and then
we presented
what was
which one do you want or do you want this or that
now
the pictorial version was restricted to only two or three options
if there was more options in the least
i mean nobody says these sort be sort of this or that
it's usually t c
i
and this is what the survey looks like again we have the same age
we have the output
and
now they get to choose between all these responses
and they get to rate them on
a likert scale be on u w t
again
okay going back to her question so how did we do
but this depends rating of the stage one responses are significantly lower
then the rating sets guide to this response types and their both wrists conditions what
do you mean f-score i
if you recall in stage one
they had to pick a response how would you respond
so we said okay
we in order to account for rate thereby s
we will say okay the one d p d is the rnn-based opinion of them
set of saw his their highest opinion of anything was if five
we have scribe to the response of five if it was a four ascribing to
four
but the rating was significantly lower well
and
these are this is still gram present the difference in the rating between
they're ascribed responses and their stage two ratings
so for a lot of them
they kept
so whatever we have scribe the also fold it was pretty goal
but
for quite a lot of them like to
hundred and thirty three for low risk and hundred and sixty nine for high risk
they see new fig on the reduce the rate
question tool
do participants preferred the stage one response type at the response type
in the paper we have balls and the and the classifier
here i'm only showing the classifier why the classifier the version of the classifier that
while using is the one trained on and he was not even trained on the
users
so what did we do we took
we to call their responses that
are
different
between stage two one stage one and then checked
the rate
so
only different response
so in a lot of cases
stage one was better than the classifier
in quite a few cases they were the same and
in enough cases
the classifier that is trained on somebody else did better than their own pretty of
yourself
so this is an example
what to get
and saying stage one
the user
we choose
but then in stage two we give choose a rating of one and come from
a rating of fine
but having said that
at the end of the day
participants rating of their stage one response types
is not statistically significant difference from the rating of different response types and their bowls
race conditions
so i need singles basically
influence on race just quickly
people were more conservative and their high risk which is that's expect that fewer doles
effect of risk on specific response times
so do and choose receive lower ratings and then i raised
and come from and rephrase what unaffected by risk
regardless of race
people rated confirm higher than do and choose with pictures higher than choose
text only
so
to conclude
people's preferences are
fluid over time
various reasonable responses may be acceptable and as we saw a classifier that trained on
a small non-target
corpus produce find responses
recently influences people studied used to with some response time
and what does that mean
well this has implications for training and evaluating dialog systems but this was in a
restricted set been wonderful dialogues would
the pretend robot
so more studies are required
i
we have some time for questions
thanks it's a and very interesting experiment to
and i think it does show clearly that there's some variation in response permitted which
we see another experiments to i'm not i'm not sure how you come to the
conclusion that the users are fluid through time
given that you're you tell you actually asking do something different like rating their response
rating response as opposed to choosing responses a different task
and if you assume that
users don't have just a fixed choice of mine bits of kind of a probability
distribution or utility distribution and you're forcing a choice so they pick one and if
you sampled again from the same distribution you'd expect a certain amount of variation so
is it really that users are changing over time or that you're the rolling the
dice and you get a
a different number sometimes the second time
yes this is a limitation we spot the that one
well or we can assume he's
yes whatever the actual
they must have a the reason for choosing need then
they thought they were making perfect sense
and then you and they were given the exact same options and then in
in rate of pay
there were okay with other options that's i mean
or what i mean
to me that e d case louis
should we have done the experiment differently in retrospect
yes probably but
to the intention the original intention of the experiment
was not to do this longitudinal study we kind of stumbled upon
the longitudinal part
but the okay to ask this indicates that the
you know things are not that is
cut and dry is
a lot of people believe that
they are in anything reasonable goals
we have time for another question
can you go back to select twenty four actually think
wow
the idea to fix the number in my head otherwise
i couldn't mm
there was the conclusion not so much a graph
oops
the next one
it doesn't one
sorry i had a hard time
following the reasoning here did you didn't you just show us that it is only
it was different no i sold there were differences
yes over or when you come when you do pairwise comparison along with statistical significance
testing was no
so although it up here sometimes this wean sometimes that queens
when you do
there might bear it's not statistically significant at all
we didn't wilcoxon signed-rank
yes
who
alright let's think the speaker is again