0:00:21so i make speaker will be included common
0:00:25and she'll be talking about the influence of time and risk and was a response
0:00:30acceptability in a simple spoken dialogue system
0:01:05okay so this is worse than we'd and e
0:01:09and the you know why am
0:01:14and now it
0:01:15well
0:01:16that doesn't want to
0:01:17cool
0:01:18that works
0:01:20okay so
0:01:21what are we doing here
0:01:24evaluations of dialogue systems are often based on ratings
0:01:31however
0:01:33if you look at research in recommender systems you will see the people's ratings are
0:01:39inconsistent over time and that leads to what it's called the magic barrier you can
0:01:44only get the certain point in accuracy due to people's inconsistencies
0:01:49so
0:01:50we ask ourselves
0:01:52is this true for dialogue systems
0:01:56and of course this is implications about the reliability of the evaluations of systems and
0:02:02about comparative evaluations between systems
0:02:06and
0:02:07while we were at the end we also wanted to check the effect of situation
0:02:11a to rescore on how people view the responses of
0:02:16a dialogue system
0:02:20so
0:02:21we did an experiment we conducted a longitudinal study the dis over time
0:02:28and in the context of a spoken dialogue system for the household robot
0:02:33and
0:02:34the corpus that we use
0:02:36while as a core pause
0:02:39for spoken request
0:02:42the task of robot too fate remove objects in a room
0:02:46and this study well as in two stages
0:02:50one of the reviewers of the paper call this heroic thank you
0:02:54and in the first stage
0:02:57people selected how they would response respond to request
0:03:01and have their
0:03:02a here not have to yes
0:03:05we gave people the wrong responses in other responses and ask them to rate doubles
0:03:11responses
0:03:15so the questions that we want to answer
0:03:21how well the participants like their stage one response types and we call them response
0:03:28type rather than dialogue acts
0:03:31because one of the response types could be just do what you are
0:03:37that's not the dialogue act
0:03:40the user the users prefer their stage one response types to have a response type
0:03:47and three again the situation that are risk because well
0:03:52it was something we were interested in
0:03:56so the first thing let's describe the corpus
0:04:00at the corpus was created in the past what we were developing our system we
0:04:06had thirty five participants that describe twelve object
0:04:10in different images we had a total of
0:04:13four hundred and seventy eight descriptions because people were allowed repetitions
0:04:19asr performance this is google now
0:04:23a bit worse than what
0:04:25you would think
0:04:28so word error rate thirteen percent that top ranked interpretation was wrong you know about
0:04:33half the cases and all interpretations were wrong in about a third of the cases
0:04:39some of the wrong things where little things like a or and
0:04:43and that was thirteen percent of the cases
0:04:47we retained
0:04:48two hundred and ninety two descriptions wise sort of you
0:04:55some of them there was inconsistency in rating like some people rated only stage one
0:05:00out of rate that only stage two so we couldn't keep them others are system
0:05:04couldn't brawl says
0:05:07and there's head more than one prepositional phrase and
0:05:13we can process goals but i will
0:05:16explain later why we got rid of them
0:05:19so each of those nine and two hundred and ninety two descriptions and a head
0:05:25for
0:05:26dot for asr output
0:05:30and okay let's go back a set gone
0:05:33why don't for we want that the party c-band
0:05:38do you hear called uncle why this
0:05:41spoken language understanding system is hearing which is the output of the asr
0:05:49and then we to guard descriptions as i said that were generated in the context
0:05:54of another study
0:05:56and
0:05:57prepended get or move to each asr output to turn them into recording
0:06:05then this corpus was divided into sets of at most well for what
0:06:10one pair of g
0:06:11so let's all those of you were referred me before will have single speeches
0:06:17so party c-band whereas to designate one of the objects a b or c like
0:06:23eventually all three of them but one at a time
0:06:26so in this case the participant is describing the hard disk under the table
0:06:34this is what the asr heard
0:06:36none of them is correct this is true asr output
0:06:40and then we put they did in front
0:06:43so get
0:06:44that thing
0:06:46we in the second image again the party c-band once the of the ball farther
0:06:52away from the plate
0:06:55which object they have
0:06:57this is what they it's not hard
0:07:00and again we add
0:07:02the get them
0:07:03and this time one of the interpretations
0:07:06he's
0:07:07correct yes the first one
0:07:10these results are deemed edge
0:07:12the plate in the middle of the table
0:07:15so we play the same game can speed up now
0:07:18and are finally manage the cleanable crack yes that's what they set
0:07:24and
0:07:25again
0:07:27this is what they aside and hard
0:07:29and this time would
0:07:30do move why because it's a big object
0:07:33we cannot ask anybody to get the bookcase
0:07:37okay
0:07:38now
0:07:39this is a we collected our corpus and now we start we
0:07:43the trial stage one
0:07:45we collected demographic information gender english native in is whether that are native english speaker
0:07:53age education
0:07:55and we also corrected risk propane see the information because we are interested in the
0:08:01effect of risk
0:08:04so we collected these from work firearm and that s
0:08:08six weeks
0:08:09where is probably i
0:08:10statements such as i follow the motto nothing ventured nothing getting
0:08:15and six
0:08:16risk of version statement my decision errors are always made on their carefully inaccurately and
0:08:22there are six of each and we measure the agreement or now one to five
0:08:26likert scale
0:08:29so
0:08:30these are our demographic characteristics of
0:08:34in the stage one we had forty participants six of those were not reachable in
0:08:39stage two so we are thirty four people
0:08:41seventeen female seventeen male eighteen native english speakers sixteen on a leave
0:08:47and these are the age and education
0:08:51brought five
0:08:56error for risk prone as just to give you an idea about the human condition
0:09:02we subtract the
0:09:03risk aversion from risk brown is so the sum of
0:09:07all their scores
0:09:09and this is what are pub population looks like they seem to be a more
0:09:13recent prone then
0:09:15risk of ours
0:09:19so now
0:09:20now we get to the real stage one
0:09:23so as i said each participant was shown
0:09:26the top for asr output for each request maximum twelve requests one for image one
0:09:32pair i them you in each image
0:09:36and they were shown versions of the images were all the objects are number
0:09:41why because they could
0:09:43peak any object to talk to
0:09:45to respond
0:09:47we had to be reached conditions low and high we told them that in the
0:09:51lower it rests condition the respond there is in the same room as the requester
0:09:57in the high risk condition the respond that is far away and it will be
0:10:01in car a lot of inconvenience if they do the wrong thing
0:10:07and
0:10:08they had four response types to
0:10:11choose from and the
0:10:13they got explanations of what
0:10:14each response main
0:10:16in fact they only got these side
0:10:19this it is for us
0:10:21so
0:10:23do means would you just fitch object number
0:10:28and put the number of the object you would fix
0:10:31com four i'm ease
0:10:33you want to last did you mean object again object number
0:10:37choose which object did you mean
0:10:39even list of object and rephrase ease i can hear you
0:10:46i want you to restate so they had four response types to choose from
0:10:51so this is a sample items so now we see the same room we so
0:10:54before
0:10:55but all the objects are numbered
0:10:59and this is what the survey looks like soul
0:11:03you may have the four out bolts assuming that you are in the same room
0:11:07of the speaker
0:11:09select one of the responses
0:11:12get object number did you mean object which object did you mean and for rephrase
0:11:18we actually gave them the option
0:11:21to say rephrase the object rephrase the position or rephrase the whole sentence
0:11:29now we distinguish because the asr makes most of the errors on the object not
0:11:34on the location
0:11:36and then
0:11:37we went assume that
0:11:39this peak at seen a remote location would you change your hands
0:11:43and we asked the same coast
0:11:48so
0:11:49after stage one we got city corpora
0:11:53one
0:11:55so we had
0:11:55five hundred and eighty four responses so
0:11:59two hundred and i two request standard to race conditions
0:12:02and
0:12:04it will become clear why we have to be corpora so the first one he's
0:12:07response corpus
0:12:11response corpus he's what answers we got from our parties what
0:12:15we see
0:12:18okay what answers we got from our participants
0:12:22and this is the distribution of the answers and their the law and their high
0:12:27risk conditions
0:12:28so do is clearly majority class
0:12:31and we have come farm choose rephrase and as you can see
0:12:37the
0:12:39there is
0:12:39let's do those in more conferencing chooses
0:12:43and rephrases and their high risk condition
0:12:47in addition we developed
0:12:50two corpora
0:12:51or dark or pause and classifier corpus so what is a dark or both
0:12:56and the
0:12:58responded to every c
0:13:01why did we want double talk or both because there is a lot of the
0:13:04variability between people and we wanted to see how user variability affix
0:13:11the result
0:13:13and the either in the final corpus is called classifier corpus
0:13:18and what we need ease we train the classifier
0:13:23two
0:13:23select responses based on the
0:13:29based both on all of our corpus and on response corpus
0:13:36or and i promised i would then yielded sorry
0:13:39so this is why we throughout the
0:13:43requests with more than one prepositional phrase because we wanted to restrict the features that
0:13:48we used for training the classifier because we just want to the simple classifier
0:13:54okay so
0:13:56what does not response classifier look so that look like it assumes that
0:14:01we have a spoken language understanding system that with don's ranked interpretations
0:14:07we have to be types of classification features the asr confidence in the correctness of
0:14:13its own outputs
0:14:15how well an interpretation matches the description
0:14:20the risk of the situation and for response corpus we also have the more graphic
0:14:24and respect propensity information
0:14:28so
0:14:29i think weak example this is a close up of one of the rooms
0:14:34the description is the browns to linear the table
0:14:37so
0:14:39these two stools match well the description
0:14:42the one
0:14:47the one over there is a bit closer but their balls
0:14:51are pretty good match
0:14:57what about the classes so how the classifier do
0:15:01we tested the whole bunch of classifiers and random forest one
0:15:06now
0:15:09these them only the main thing to note is
0:15:13the bottom line of course ware doing better or and this score pause then on
0:15:19the corpus of older people
0:15:20why because there was a lot of variability in responses and their the exact same
0:15:25conditions
0:15:29but this is just
0:15:31before you think i'm wasting your time
0:15:34and this is not important for the purposes of this paper
0:15:40so now
0:15:41we proceed to experiment two
0:15:44a year not have to two years later
0:15:47so
0:15:48each party c-band is shown
0:15:51the same asr output this in images as in stage one
0:15:56to race conditions again
0:15:59and
0:16:00a bunch of candidate responses
0:16:03sourced from
0:16:06the response type in response corpus for the wrong responses
0:16:12and these responses
0:16:14the response to speak by the classifier
0:16:16and also
0:16:18do confirm pairs so whenever one of these responses what to do if there was
0:16:24no pun firm in that above three we are that the con four
0:16:28similarly
0:16:30if one of these was to confirm and there was no do
0:16:33we added to do
0:16:34of course we didn't repeat
0:16:37several of these chose the same response we present to be done you want
0:16:44now we had some
0:16:46it's more challenges do and rephrase that direct renditions of the selections in stage one
0:16:54but for confirming choose
0:16:57we needed to do some instantiation
0:17:00so for choose we chose the pictorially query value and two point d so we
0:17:06would say is this what you want
0:17:09in this is your confirmation the particular plate
0:17:13four choose we had two options there are two plates on the table
0:17:18and then
0:17:19we presented
0:17:21what was
0:17:22which one do you want or do you want this or that
0:17:26now
0:17:27the pictorial version was restricted to only two or three options
0:17:32if there was more options in the least
0:17:34i mean nobody says these sort be sort of this or that
0:17:39it's usually t c
0:17:40i
0:17:43and this is what the survey looks like again we have the same age
0:17:48we have the output
0:17:50and
0:17:51now they get to choose between all these responses
0:17:55and they get to rate them on
0:17:58a likert scale be on u w t
0:18:05again
0:18:08okay going back to her question so how did we do
0:18:14but this depends rating of the stage one responses are significantly lower
0:18:20then the rating sets guide to this response types and their both wrists conditions what
0:18:25do you mean f-score i
0:18:27if you recall in stage one
0:18:30they had to pick a response how would you respond
0:18:33so we said okay
0:18:36we in order to account for rate thereby s
0:18:39we will say okay the one d p d is the rnn-based opinion of them
0:18:43set of saw his their highest opinion of anything was if five
0:18:48we have scribe to the response of five if it was a four ascribing to
0:18:52four
0:18:53but the rating was significantly lower well
0:18:58and
0:19:00these are this is still gram present the difference in the rating between
0:19:07they're ascribed responses and their stage two ratings
0:19:12so for a lot of them
0:19:15they kept
0:19:16so whatever we have scribe the also fold it was pretty goal
0:19:20but
0:19:22for quite a lot of them like to
0:19:25hundred and thirty three for low risk and hundred and sixty nine for high risk
0:19:31they see new fig on the reduce the rate
0:19:38question tool
0:19:41do participants preferred the stage one response type at the response type
0:19:46in the paper we have balls and the and the classifier
0:19:50here i'm only showing the classifier why the classifier the version of the classifier that
0:19:56while using is the one trained on and he was not even trained on the
0:20:00users
0:20:01so what did we do we took
0:20:04we to call their responses that
0:20:07are
0:20:08different
0:20:10between stage two one stage one and then checked
0:20:13the rate
0:20:15so
0:20:15only different response
0:20:18so in a lot of cases
0:20:21stage one was better than the classifier
0:20:24in quite a few cases they were the same and
0:20:28in enough cases
0:20:32the classifier that is trained on somebody else did better than their own pretty of
0:20:37yourself
0:20:40so this is an example
0:20:44what to get
0:20:45and saying stage one
0:20:48the user
0:20:49we choose
0:20:51but then in stage two we give choose a rating of one and come from
0:20:55a rating of fine
0:21:00but having said that
0:21:03at the end of the day
0:21:05participants rating of their stage one response types
0:21:09is not statistically significant difference from the rating of different response types and their bowls
0:21:16race conditions
0:21:18so i need singles basically
0:21:22influence on race just quickly
0:21:25people were more conservative and their high risk which is that's expect that fewer doles
0:21:33effect of risk on specific response times
0:21:36so do and choose receive lower ratings and then i raised
0:21:40and come from and rephrase what unaffected by risk
0:21:46regardless of race
0:21:47people rated confirm higher than do and choose with pictures higher than choose
0:21:53text only
0:21:56so
0:21:57to conclude
0:22:00people's preferences are
0:22:02fluid over time
0:22:04various reasonable responses may be acceptable and as we saw a classifier that trained on
0:22:11a small non-target
0:22:13corpus produce find responses
0:22:17recently influences people studied used to with some response time
0:22:22and what does that mean
0:22:24well this has implications for training and evaluating dialog systems but this was in a
0:22:30restricted set been wonderful dialogues would
0:22:33the pretend robot
0:22:34so more studies are required
0:22:37i
0:22:44we have some time for questions
0:22:52thanks it's a and very interesting experiment to
0:22:56and i think it does show clearly that there's some variation in response permitted which
0:23:03we see another experiments to i'm not i'm not sure how you come to the
0:23:07conclusion that the users are fluid through time
0:23:12given that you're you tell you actually asking do something different like rating their response
0:23:16rating response as opposed to choosing responses a different task
0:23:20and if you assume that
0:23:22users don't have just a fixed choice of mine bits of kind of a probability
0:23:25distribution or utility distribution and you're forcing a choice so they pick one and if
0:23:30you sampled again from the same distribution you'd expect a certain amount of variation so
0:23:36is it really that users are changing over time or that you're the rolling the
0:23:40dice and you get a
0:23:41a different number sometimes the second time
0:23:44yes this is a limitation we spot the that one
0:23:49well or we can assume he's
0:23:52yes whatever the actual
0:23:54they must have a the reason for choosing need then
0:23:57they thought they were making perfect sense
0:24:00and then you and they were given the exact same options and then in
0:24:05in rate of pay
0:24:07there were okay with other options that's i mean
0:24:10or what i mean
0:24:12to me that e d case louis
0:24:14should we have done the experiment differently in retrospect
0:24:18yes probably but
0:24:20to the intention the original intention of the experiment
0:24:24was not to do this longitudinal study we kind of stumbled upon
0:24:28the longitudinal part
0:24:31but the okay to ask this indicates that the
0:24:36you know things are not that is
0:24:38cut and dry is
0:24:40a lot of people believe that
0:24:42they are in anything reasonable goals
0:24:47we have time for another question
0:24:57can you go back to select twenty four actually think
0:25:01wow
0:25:02the idea to fix the number in my head otherwise
0:25:06i couldn't mm
0:25:09there was the conclusion not so much a graph
0:25:19oops
0:25:19the next one
0:25:24it doesn't one
0:25:30sorry i had a hard time
0:25:32following the reasoning here did you didn't you just show us that it is only
0:25:37it was different no i sold there were differences
0:25:41yes over or when you come when you do pairwise comparison along with statistical significance
0:25:48testing was no
0:25:51so although it up here sometimes this wean sometimes that queens
0:25:57when you do
0:25:59there might bear it's not statistically significant at all
0:26:05we didn't wilcoxon signed-rank
0:26:08yes
0:26:10who
0:26:11alright let's think the speaker is again