0:00:21 | so i make speaker will be included common |
---|
0:00:25 | and she'll be talking about the influence of time and risk and was a response |
---|
0:00:30 | acceptability in a simple spoken dialogue system |
---|
0:01:05 | okay so this is worse than we'd and e |
---|
0:01:09 | and the you know why am |
---|
0:01:14 | and now it |
---|
0:01:15 | well |
---|
0:01:16 | that doesn't want to |
---|
0:01:17 | cool |
---|
0:01:18 | that works |
---|
0:01:20 | okay so |
---|
0:01:21 | what are we doing here |
---|
0:01:24 | evaluations of dialogue systems are often based on ratings |
---|
0:01:31 | however |
---|
0:01:33 | if you look at research in recommender systems you will see the people's ratings are |
---|
0:01:39 | inconsistent over time and that leads to what it's called the magic barrier you can |
---|
0:01:44 | only get the certain point in accuracy due to people's inconsistencies |
---|
0:01:49 | so |
---|
0:01:50 | we ask ourselves |
---|
0:01:52 | is this true for dialogue systems |
---|
0:01:56 | and of course this is implications about the reliability of the evaluations of systems and |
---|
0:02:02 | about comparative evaluations between systems |
---|
0:02:06 | and |
---|
0:02:07 | while we were at the end we also wanted to check the effect of situation |
---|
0:02:11 | a to rescore on how people view the responses of |
---|
0:02:16 | a dialogue system |
---|
0:02:20 | so |
---|
0:02:21 | we did an experiment we conducted a longitudinal study the dis over time |
---|
0:02:28 | and in the context of a spoken dialogue system for the household robot |
---|
0:02:33 | and |
---|
0:02:34 | the corpus that we use |
---|
0:02:36 | while as a core pause |
---|
0:02:39 | for spoken request |
---|
0:02:42 | the task of robot too fate remove objects in a room |
---|
0:02:46 | and this study well as in two stages |
---|
0:02:50 | one of the reviewers of the paper call this heroic thank you |
---|
0:02:54 | and in the first stage |
---|
0:02:57 | people selected how they would response respond to request |
---|
0:03:01 | and have their |
---|
0:03:02 | a here not have to yes |
---|
0:03:05 | we gave people the wrong responses in other responses and ask them to rate doubles |
---|
0:03:11 | responses |
---|
0:03:15 | so the questions that we want to answer |
---|
0:03:21 | how well the participants like their stage one response types and we call them response |
---|
0:03:28 | type rather than dialogue acts |
---|
0:03:31 | because one of the response types could be just do what you are |
---|
0:03:37 | that's not the dialogue act |
---|
0:03:40 | the user the users prefer their stage one response types to have a response type |
---|
0:03:47 | and three again the situation that are risk because well |
---|
0:03:52 | it was something we were interested in |
---|
0:03:56 | so the first thing let's describe the corpus |
---|
0:04:00 | at the corpus was created in the past what we were developing our system we |
---|
0:04:06 | had thirty five participants that describe twelve object |
---|
0:04:10 | in different images we had a total of |
---|
0:04:13 | four hundred and seventy eight descriptions because people were allowed repetitions |
---|
0:04:19 | asr performance this is google now |
---|
0:04:23 | a bit worse than what |
---|
0:04:25 | you would think |
---|
0:04:28 | so word error rate thirteen percent that top ranked interpretation was wrong you know about |
---|
0:04:33 | half the cases and all interpretations were wrong in about a third of the cases |
---|
0:04:39 | some of the wrong things where little things like a or and |
---|
0:04:43 | and that was thirteen percent of the cases |
---|
0:04:47 | we retained |
---|
0:04:48 | two hundred and ninety two descriptions wise sort of you |
---|
0:04:55 | some of them there was inconsistency in rating like some people rated only stage one |
---|
0:05:00 | out of rate that only stage two so we couldn't keep them others are system |
---|
0:05:04 | couldn't brawl says |
---|
0:05:07 | and there's head more than one prepositional phrase and |
---|
0:05:13 | we can process goals but i will |
---|
0:05:16 | explain later why we got rid of them |
---|
0:05:19 | so each of those nine and two hundred and ninety two descriptions and a head |
---|
0:05:25 | for |
---|
0:05:26 | dot for asr output |
---|
0:05:30 | and okay let's go back a set gone |
---|
0:05:33 | why don't for we want that the party c-band |
---|
0:05:38 | do you hear called uncle why this |
---|
0:05:41 | spoken language understanding system is hearing which is the output of the asr |
---|
0:05:49 | and then we to guard descriptions as i said that were generated in the context |
---|
0:05:54 | of another study |
---|
0:05:56 | and |
---|
0:05:57 | prepended get or move to each asr output to turn them into recording |
---|
0:06:05 | then this corpus was divided into sets of at most well for what |
---|
0:06:10 | one pair of g |
---|
0:06:11 | so let's all those of you were referred me before will have single speeches |
---|
0:06:17 | so party c-band whereas to designate one of the objects a b or c like |
---|
0:06:23 | eventually all three of them but one at a time |
---|
0:06:26 | so in this case the participant is describing the hard disk under the table |
---|
0:06:34 | this is what the asr heard |
---|
0:06:36 | none of them is correct this is true asr output |
---|
0:06:40 | and then we put they did in front |
---|
0:06:43 | so get |
---|
0:06:44 | that thing |
---|
0:06:46 | we in the second image again the party c-band once the of the ball farther |
---|
0:06:52 | away from the plate |
---|
0:06:55 | which object they have |
---|
0:06:57 | this is what they it's not hard |
---|
0:07:00 | and again we add |
---|
0:07:02 | the get them |
---|
0:07:03 | and this time one of the interpretations |
---|
0:07:06 | he's |
---|
0:07:07 | correct yes the first one |
---|
0:07:10 | these results are deemed edge |
---|
0:07:12 | the plate in the middle of the table |
---|
0:07:15 | so we play the same game can speed up now |
---|
0:07:18 | and are finally manage the cleanable crack yes that's what they set |
---|
0:07:24 | and |
---|
0:07:25 | again |
---|
0:07:27 | this is what they aside and hard |
---|
0:07:29 | and this time would |
---|
0:07:30 | do move why because it's a big object |
---|
0:07:33 | we cannot ask anybody to get the bookcase |
---|
0:07:37 | okay |
---|
0:07:38 | now |
---|
0:07:39 | this is a we collected our corpus and now we start we |
---|
0:07:43 | the trial stage one |
---|
0:07:45 | we collected demographic information gender english native in is whether that are native english speaker |
---|
0:07:53 | age education |
---|
0:07:55 | and we also corrected risk propane see the information because we are interested in the |
---|
0:08:01 | effect of risk |
---|
0:08:04 | so we collected these from work firearm and that s |
---|
0:08:08 | six weeks |
---|
0:08:09 | where is probably i |
---|
0:08:10 | statements such as i follow the motto nothing ventured nothing getting |
---|
0:08:15 | and six |
---|
0:08:16 | risk of version statement my decision errors are always made on their carefully inaccurately and |
---|
0:08:22 | there are six of each and we measure the agreement or now one to five |
---|
0:08:26 | likert scale |
---|
0:08:29 | so |
---|
0:08:30 | these are our demographic characteristics of |
---|
0:08:34 | in the stage one we had forty participants six of those were not reachable in |
---|
0:08:39 | stage two so we are thirty four people |
---|
0:08:41 | seventeen female seventeen male eighteen native english speakers sixteen on a leave |
---|
0:08:47 | and these are the age and education |
---|
0:08:51 | brought five |
---|
0:08:56 | error for risk prone as just to give you an idea about the human condition |
---|
0:09:02 | we subtract the |
---|
0:09:03 | risk aversion from risk brown is so the sum of |
---|
0:09:07 | all their scores |
---|
0:09:09 | and this is what are pub population looks like they seem to be a more |
---|
0:09:13 | recent prone then |
---|
0:09:15 | risk of ours |
---|
0:09:19 | so now |
---|
0:09:20 | now we get to the real stage one |
---|
0:09:23 | so as i said each participant was shown |
---|
0:09:26 | the top for asr output for each request maximum twelve requests one for image one |
---|
0:09:32 | pair i them you in each image |
---|
0:09:36 | and they were shown versions of the images were all the objects are number |
---|
0:09:41 | why because they could |
---|
0:09:43 | peak any object to talk to |
---|
0:09:45 | to respond |
---|
0:09:47 | we had to be reached conditions low and high we told them that in the |
---|
0:09:51 | lower it rests condition the respond there is in the same room as the requester |
---|
0:09:57 | in the high risk condition the respond that is far away and it will be |
---|
0:10:01 | in car a lot of inconvenience if they do the wrong thing |
---|
0:10:07 | and |
---|
0:10:08 | they had four response types to |
---|
0:10:11 | choose from and the |
---|
0:10:13 | they got explanations of what |
---|
0:10:14 | each response main |
---|
0:10:16 | in fact they only got these side |
---|
0:10:19 | this it is for us |
---|
0:10:21 | so |
---|
0:10:23 | do means would you just fitch object number |
---|
0:10:28 | and put the number of the object you would fix |
---|
0:10:31 | com four i'm ease |
---|
0:10:33 | you want to last did you mean object again object number |
---|
0:10:37 | choose which object did you mean |
---|
0:10:39 | even list of object and rephrase ease i can hear you |
---|
0:10:46 | i want you to restate so they had four response types to choose from |
---|
0:10:51 | so this is a sample items so now we see the same room we so |
---|
0:10:54 | before |
---|
0:10:55 | but all the objects are numbered |
---|
0:10:59 | and this is what the survey looks like soul |
---|
0:11:03 | you may have the four out bolts assuming that you are in the same room |
---|
0:11:07 | of the speaker |
---|
0:11:09 | select one of the responses |
---|
0:11:12 | get object number did you mean object which object did you mean and for rephrase |
---|
0:11:18 | we actually gave them the option |
---|
0:11:21 | to say rephrase the object rephrase the position or rephrase the whole sentence |
---|
0:11:29 | now we distinguish because the asr makes most of the errors on the object not |
---|
0:11:34 | on the location |
---|
0:11:36 | and then |
---|
0:11:37 | we went assume that |
---|
0:11:39 | this peak at seen a remote location would you change your hands |
---|
0:11:43 | and we asked the same coast |
---|
0:11:48 | so |
---|
0:11:49 | after stage one we got city corpora |
---|
0:11:53 | one |
---|
0:11:55 | so we had |
---|
0:11:55 | five hundred and eighty four responses so |
---|
0:11:59 | two hundred and i two request standard to race conditions |
---|
0:12:02 | and |
---|
0:12:04 | it will become clear why we have to be corpora so the first one he's |
---|
0:12:07 | response corpus |
---|
0:12:11 | response corpus he's what answers we got from our parties what |
---|
0:12:15 | we see |
---|
0:12:18 | okay what answers we got from our participants |
---|
0:12:22 | and this is the distribution of the answers and their the law and their high |
---|
0:12:27 | risk conditions |
---|
0:12:28 | so do is clearly majority class |
---|
0:12:31 | and we have come farm choose rephrase and as you can see |
---|
0:12:37 | the |
---|
0:12:39 | there is |
---|
0:12:39 | let's do those in more conferencing chooses |
---|
0:12:43 | and rephrases and their high risk condition |
---|
0:12:47 | in addition we developed |
---|
0:12:50 | two corpora |
---|
0:12:51 | or dark or pause and classifier corpus so what is a dark or both |
---|
0:12:56 | and the |
---|
0:12:58 | responded to every c |
---|
0:13:01 | why did we want double talk or both because there is a lot of the |
---|
0:13:04 | variability between people and we wanted to see how user variability affix |
---|
0:13:11 | the result |
---|
0:13:13 | and the either in the final corpus is called classifier corpus |
---|
0:13:18 | and what we need ease we train the classifier |
---|
0:13:23 | two |
---|
0:13:23 | select responses based on the |
---|
0:13:29 | based both on all of our corpus and on response corpus |
---|
0:13:36 | or and i promised i would then yielded sorry |
---|
0:13:39 | so this is why we throughout the |
---|
0:13:43 | requests with more than one prepositional phrase because we wanted to restrict the features that |
---|
0:13:48 | we used for training the classifier because we just want to the simple classifier |
---|
0:13:54 | okay so |
---|
0:13:56 | what does not response classifier look so that look like it assumes that |
---|
0:14:01 | we have a spoken language understanding system that with don's ranked interpretations |
---|
0:14:07 | we have to be types of classification features the asr confidence in the correctness of |
---|
0:14:13 | its own outputs |
---|
0:14:15 | how well an interpretation matches the description |
---|
0:14:20 | the risk of the situation and for response corpus we also have the more graphic |
---|
0:14:24 | and respect propensity information |
---|
0:14:28 | so |
---|
0:14:29 | i think weak example this is a close up of one of the rooms |
---|
0:14:34 | the description is the browns to linear the table |
---|
0:14:37 | so |
---|
0:14:39 | these two stools match well the description |
---|
0:14:42 | the one |
---|
0:14:47 | the one over there is a bit closer but their balls |
---|
0:14:51 | are pretty good match |
---|
0:14:57 | what about the classes so how the classifier do |
---|
0:15:01 | we tested the whole bunch of classifiers and random forest one |
---|
0:15:06 | now |
---|
0:15:09 | these them only the main thing to note is |
---|
0:15:13 | the bottom line of course ware doing better or and this score pause then on |
---|
0:15:19 | the corpus of older people |
---|
0:15:20 | why because there was a lot of variability in responses and their the exact same |
---|
0:15:25 | conditions |
---|
0:15:29 | but this is just |
---|
0:15:31 | before you think i'm wasting your time |
---|
0:15:34 | and this is not important for the purposes of this paper |
---|
0:15:40 | so now |
---|
0:15:41 | we proceed to experiment two |
---|
0:15:44 | a year not have to two years later |
---|
0:15:47 | so |
---|
0:15:48 | each party c-band is shown |
---|
0:15:51 | the same asr output this in images as in stage one |
---|
0:15:56 | to race conditions again |
---|
0:15:59 | and |
---|
0:16:00 | a bunch of candidate responses |
---|
0:16:03 | sourced from |
---|
0:16:06 | the response type in response corpus for the wrong responses |
---|
0:16:12 | and these responses |
---|
0:16:14 | the response to speak by the classifier |
---|
0:16:16 | and also |
---|
0:16:18 | do confirm pairs so whenever one of these responses what to do if there was |
---|
0:16:24 | no pun firm in that above three we are that the con four |
---|
0:16:28 | similarly |
---|
0:16:30 | if one of these was to confirm and there was no do |
---|
0:16:33 | we added to do |
---|
0:16:34 | of course we didn't repeat |
---|
0:16:37 | several of these chose the same response we present to be done you want |
---|
0:16:44 | now we had some |
---|
0:16:46 | it's more challenges do and rephrase that direct renditions of the selections in stage one |
---|
0:16:54 | but for confirming choose |
---|
0:16:57 | we needed to do some instantiation |
---|
0:17:00 | so for choose we chose the pictorially query value and two point d so we |
---|
0:17:06 | would say is this what you want |
---|
0:17:09 | in this is your confirmation the particular plate |
---|
0:17:13 | four choose we had two options there are two plates on the table |
---|
0:17:18 | and then |
---|
0:17:19 | we presented |
---|
0:17:21 | what was |
---|
0:17:22 | which one do you want or do you want this or that |
---|
0:17:26 | now |
---|
0:17:27 | the pictorial version was restricted to only two or three options |
---|
0:17:32 | if there was more options in the least |
---|
0:17:34 | i mean nobody says these sort be sort of this or that |
---|
0:17:39 | it's usually t c |
---|
0:17:40 | i |
---|
0:17:43 | and this is what the survey looks like again we have the same age |
---|
0:17:48 | we have the output |
---|
0:17:50 | and |
---|
0:17:51 | now they get to choose between all these responses |
---|
0:17:55 | and they get to rate them on |
---|
0:17:58 | a likert scale be on u w t |
---|
0:18:05 | again |
---|
0:18:08 | okay going back to her question so how did we do |
---|
0:18:14 | but this depends rating of the stage one responses are significantly lower |
---|
0:18:20 | then the rating sets guide to this response types and their both wrists conditions what |
---|
0:18:25 | do you mean f-score i |
---|
0:18:27 | if you recall in stage one |
---|
0:18:30 | they had to pick a response how would you respond |
---|
0:18:33 | so we said okay |
---|
0:18:36 | we in order to account for rate thereby s |
---|
0:18:39 | we will say okay the one d p d is the rnn-based opinion of them |
---|
0:18:43 | set of saw his their highest opinion of anything was if five |
---|
0:18:48 | we have scribe to the response of five if it was a four ascribing to |
---|
0:18:52 | four |
---|
0:18:53 | but the rating was significantly lower well |
---|
0:18:58 | and |
---|
0:19:00 | these are this is still gram present the difference in the rating between |
---|
0:19:07 | they're ascribed responses and their stage two ratings |
---|
0:19:12 | so for a lot of them |
---|
0:19:15 | they kept |
---|
0:19:16 | so whatever we have scribe the also fold it was pretty goal |
---|
0:19:20 | but |
---|
0:19:22 | for quite a lot of them like to |
---|
0:19:25 | hundred and thirty three for low risk and hundred and sixty nine for high risk |
---|
0:19:31 | they see new fig on the reduce the rate |
---|
0:19:38 | question tool |
---|
0:19:41 | do participants preferred the stage one response type at the response type |
---|
0:19:46 | in the paper we have balls and the and the classifier |
---|
0:19:50 | here i'm only showing the classifier why the classifier the version of the classifier that |
---|
0:19:56 | while using is the one trained on and he was not even trained on the |
---|
0:20:00 | users |
---|
0:20:01 | so what did we do we took |
---|
0:20:04 | we to call their responses that |
---|
0:20:07 | are |
---|
0:20:08 | different |
---|
0:20:10 | between stage two one stage one and then checked |
---|
0:20:13 | the rate |
---|
0:20:15 | so |
---|
0:20:15 | only different response |
---|
0:20:18 | so in a lot of cases |
---|
0:20:21 | stage one was better than the classifier |
---|
0:20:24 | in quite a few cases they were the same and |
---|
0:20:28 | in enough cases |
---|
0:20:32 | the classifier that is trained on somebody else did better than their own pretty of |
---|
0:20:37 | yourself |
---|
0:20:40 | so this is an example |
---|
0:20:44 | what to get |
---|
0:20:45 | and saying stage one |
---|
0:20:48 | the user |
---|
0:20:49 | we choose |
---|
0:20:51 | but then in stage two we give choose a rating of one and come from |
---|
0:20:55 | a rating of fine |
---|
0:21:00 | but having said that |
---|
0:21:03 | at the end of the day |
---|
0:21:05 | participants rating of their stage one response types |
---|
0:21:09 | is not statistically significant difference from the rating of different response types and their bowls |
---|
0:21:16 | race conditions |
---|
0:21:18 | so i need singles basically |
---|
0:21:22 | influence on race just quickly |
---|
0:21:25 | people were more conservative and their high risk which is that's expect that fewer doles |
---|
0:21:33 | effect of risk on specific response times |
---|
0:21:36 | so do and choose receive lower ratings and then i raised |
---|
0:21:40 | and come from and rephrase what unaffected by risk |
---|
0:21:46 | regardless of race |
---|
0:21:47 | people rated confirm higher than do and choose with pictures higher than choose |
---|
0:21:53 | text only |
---|
0:21:56 | so |
---|
0:21:57 | to conclude |
---|
0:22:00 | people's preferences are |
---|
0:22:02 | fluid over time |
---|
0:22:04 | various reasonable responses may be acceptable and as we saw a classifier that trained on |
---|
0:22:11 | a small non-target |
---|
0:22:13 | corpus produce find responses |
---|
0:22:17 | recently influences people studied used to with some response time |
---|
0:22:22 | and what does that mean |
---|
0:22:24 | well this has implications for training and evaluating dialog systems but this was in a |
---|
0:22:30 | restricted set been wonderful dialogues would |
---|
0:22:33 | the pretend robot |
---|
0:22:34 | so more studies are required |
---|
0:22:37 | i |
---|
0:22:44 | we have some time for questions |
---|
0:22:52 | thanks it's a and very interesting experiment to |
---|
0:22:56 | and i think it does show clearly that there's some variation in response permitted which |
---|
0:23:03 | we see another experiments to i'm not i'm not sure how you come to the |
---|
0:23:07 | conclusion that the users are fluid through time |
---|
0:23:12 | given that you're you tell you actually asking do something different like rating their response |
---|
0:23:16 | rating response as opposed to choosing responses a different task |
---|
0:23:20 | and if you assume that |
---|
0:23:22 | users don't have just a fixed choice of mine bits of kind of a probability |
---|
0:23:25 | distribution or utility distribution and you're forcing a choice so they pick one and if |
---|
0:23:30 | you sampled again from the same distribution you'd expect a certain amount of variation so |
---|
0:23:36 | is it really that users are changing over time or that you're the rolling the |
---|
0:23:40 | dice and you get a |
---|
0:23:41 | a different number sometimes the second time |
---|
0:23:44 | yes this is a limitation we spot the that one |
---|
0:23:49 | well or we can assume he's |
---|
0:23:52 | yes whatever the actual |
---|
0:23:54 | they must have a the reason for choosing need then |
---|
0:23:57 | they thought they were making perfect sense |
---|
0:24:00 | and then you and they were given the exact same options and then in |
---|
0:24:05 | in rate of pay |
---|
0:24:07 | there were okay with other options that's i mean |
---|
0:24:10 | or what i mean |
---|
0:24:12 | to me that e d case louis |
---|
0:24:14 | should we have done the experiment differently in retrospect |
---|
0:24:18 | yes probably but |
---|
0:24:20 | to the intention the original intention of the experiment |
---|
0:24:24 | was not to do this longitudinal study we kind of stumbled upon |
---|
0:24:28 | the longitudinal part |
---|
0:24:31 | but the okay to ask this indicates that the |
---|
0:24:36 | you know things are not that is |
---|
0:24:38 | cut and dry is |
---|
0:24:40 | a lot of people believe that |
---|
0:24:42 | they are in anything reasonable goals |
---|
0:24:47 | we have time for another question |
---|
0:24:57 | can you go back to select twenty four actually think |
---|
0:25:01 | wow |
---|
0:25:02 | the idea to fix the number in my head otherwise |
---|
0:25:06 | i couldn't mm |
---|
0:25:09 | there was the conclusion not so much a graph |
---|
0:25:19 | oops |
---|
0:25:19 | the next one |
---|
0:25:24 | it doesn't one |
---|
0:25:30 | sorry i had a hard time |
---|
0:25:32 | following the reasoning here did you didn't you just show us that it is only |
---|
0:25:37 | it was different no i sold there were differences |
---|
0:25:41 | yes over or when you come when you do pairwise comparison along with statistical significance |
---|
0:25:48 | testing was no |
---|
0:25:51 | so although it up here sometimes this wean sometimes that queens |
---|
0:25:57 | when you do |
---|
0:25:59 | there might bear it's not statistically significant at all |
---|
0:26:05 | we didn't wilcoxon signed-rank |
---|
0:26:08 | yes |
---|
0:26:10 | who |
---|
0:26:11 | alright let's think the speaker is again |
---|