0:00:14 | right good evening everyone again |
---|
0:00:20 | a problem of how or more |
---|
0:00:25 | a tall |
---|
0:00:26 | i one i tried to make it more interesting or and exciting |
---|
0:00:33 | alright so |
---|
0:00:37 | taking a step by looking at of our previous work while previous work was one |
---|
0:00:42 | so in the last work we looked tired |
---|
0:00:44 | like fine grained semantic like we tried one design |
---|
0:00:48 | are the scene descriptions by segmenting the target descriptions or what's right and as described |
---|
0:00:54 | target you |
---|
0:00:55 | in two different parts |
---|
0:00:57 | in two different a into two different semantic act as we see your and then |
---|
0:01:02 | try to understand the images |
---|
0:01:04 | so in this work we take a step back and we try to understand the |
---|
0:01:08 | high-level dialogue acts |
---|
0:01:09 | so you're |
---|
0:01:11 | we try to understand these high signal at different dialogue or i one for instance |
---|
0:01:16 | i don't regulate understand what the person is trying to two and a kind of |
---|
0:01:21 | extend upon a previous work presented previously |
---|
0:01:25 | alright |
---|
0:01:25 | so the motivation for this work is to achieve fast paced interaction so |
---|
0:01:31 | in the fast pace interactions lot of things happened like that is |
---|
0:01:35 | a single user speech segment can have multiple dialogue acts a single dialogue i can |
---|
0:01:40 | span across multiple or speech segments |
---|
0:01:43 | and in those cases what should be due what kind of all these things we |
---|
0:01:47 | design |
---|
0:01:48 | then i think is we want to understand |
---|
0:01:52 | methodology to perform this dialogue act segmentation and try to understand what dialogue acts are |
---|
0:01:57 | and in an environment which is highly which is very fast paced and i'll try |
---|
0:02:03 | to or the things |
---|
0:02:05 | and then initiate of a dialogue act at the right |
---|
0:02:09 | fine so that's something that okay that the |
---|
0:02:13 | well |
---|
0:02:14 | the structure of this talk a will be divided into these parts so in the |
---|
0:02:18 | first thing is a speak a bit of our domain the previous work and try |
---|
0:02:22 | to |
---|
0:02:22 | see that their technical problem a starting point |
---|
0:02:26 | and then are the annotation scheme that we good i that we used outline of |
---|
0:02:32 | a target al |
---|
0:02:33 | then the meant that strip of the minutes we use to perform the segmentation and |
---|
0:02:37 | a dialogue act understanding the link |
---|
0:02:40 | sorry |
---|
0:02:41 | then evaluate the components then see how it works but agent |
---|
0:02:47 | so |
---|
0:02:48 | the domain that we use is very similar to the one that we saw the |
---|
0:02:52 | last talk so |
---|
0:02:54 | that's not cases topic one fly so the domain is basically call our dog image |
---|
0:02:58 | are |
---|
0:02:59 | okay so it's |
---|
0:03:00 | it's a rapid dialogue in two people a |
---|
0:03:04 | two people are things game so it's fast it's |
---|
0:03:08 | it's very rapid time-constrained |
---|
0:03:11 | and |
---|
0:03:12 | thus we don't study has little harder classes little heart classes got it |
---|
0:03:17 | okay this line is i |
---|
0:03:21 | before that i'm sorry |
---|
0:03:22 | before that so that wasn't at all is the detector |
---|
0:03:26 | the data that is trying to a this was in the director the detector to |
---|
0:03:29 | see the screen on a computer |
---|
0:03:32 | and she basically trying to describe |
---|
0:03:34 | the target highlight the target image and this is the matcher |
---|
0:03:37 | a matcher doesn't see any of those images kind of highlighted so wasn't is trying |
---|
0:03:42 | to |
---|
0:03:43 | or make the selection they can have a dialogue exchanges back and forth |
---|
0:03:48 | and it's time-constrained and make they also see score so it's |
---|
0:03:52 | it's insane device |
---|
0:04:01 | extensive study has a little hard classes little hard classes got okay this line is |
---|
0:04:08 | i really didn't go flying its actual a lot of might actually got it okay |
---|
0:04:11 | it's one as the line classes yellow classes with the space on the tiny classes |
---|
0:04:16 | that high |
---|
0:04:19 | well as you can see it's something that the game from furthermore dialogue exchanges and |
---|
0:04:23 | it's a kind of problem |
---|
0:04:25 | so we built an agent or using this our data and is what we present |
---|
0:04:29 | in the previous think that the |
---|
0:04:31 | the agent and of play the game this fast-paced game but the real user history |
---|
0:04:38 | we had incremental components of the had asr nlu and the policy and all these |
---|
0:04:42 | components were operating but incrementally |
---|
0:04:46 | an agreement architecture is very important because we got better a game scores |
---|
0:04:53 | but |
---|
0:04:54 | it's not significantly better than humans |
---|
0:04:56 | which means you know like it or from really rather i don't perform much better |
---|
0:05:00 | than alternate incremental |
---|
0:05:02 | architectures for which a one point of view back or what previous adaptive one thing |
---|
0:05:08 | and it had available subject evaluations that is people interacting with this agency like interacting |
---|
0:05:13 | with the agent compared to other all versions of the agent that |
---|
0:05:18 | it is |
---|
0:05:19 | there are there are few limitations of this architecture okay the limitation is that it |
---|
0:05:24 | assumes every three okay every description every board that the person is speaking is |
---|
0:05:29 | basically description of for a good |
---|
0:05:30 | and if that's the case we can't you can't |
---|
0:05:33 | have really fun base interaction is of the two players were having |
---|
0:05:38 | so it's |
---|
0:05:38 | not as interactive as human players |
---|
0:05:40 | but it is really fast |
---|
0:05:42 | so |
---|
0:05:45 | we build an engine so i want to show a small real for the agent |
---|
0:05:48 | interacting with a human soul to reinforce the points that i just |
---|
0:05:55 | at the top using the human director screen |
---|
0:05:57 | so there's a cultural studies so you want to human faces but |
---|
0:06:00 | in the top eight images using the human describing that and the bottom screen using |
---|
0:06:05 | the agent a images and of confidence |
---|
0:06:08 | in |
---|
0:06:09 | in the power |
---|
0:06:17 | i |
---|
0:06:21 | it is apparent i |
---|
0:06:26 | so i |
---|
0:06:31 | it is asleep and y |
---|
0:06:34 | i |
---|
0:06:38 | so which one is the same time |
---|
0:06:45 | are |
---|
0:06:46 | i placed indoors and c |
---|
0:06:50 | i |
---|
0:06:54 | though the agent is you know like very muffins as you can see |
---|
0:06:59 | which is really grappling the game but models |
---|
0:07:01 | alright so what we want to do so we wanna make the agent more interactive |
---|
0:07:05 | so we want to make use of full range of dialogue acts that you know |
---|
0:07:08 | only know of this |
---|
0:07:10 | of you want to initiate the right dialogue act are that i one of the |
---|
0:07:14 | right time so that we get the right interactions and one for that |
---|
0:07:19 | it needs a an incremental or dialogue act segmentation and labeling it some sense and |
---|
0:07:24 | we show as to how we use it and i we need it |
---|
0:07:28 | and the challenges is that |
---|
0:07:30 | they efficiently employing a good for it for instance in the previous architecture we had |
---|
0:07:34 | the agent which |
---|
0:07:36 | transport every utterance is basically a target image descriptions so she was being very efficient |
---|
0:07:41 | in understanding the target images |
---|
0:07:42 | but |
---|
0:07:43 | if we have if we include more dialogue acts it's very possible like dialogue acts |
---|
0:07:47 | make it is i don't i've make it is label of you want to be |
---|
0:07:51 | going to other dialogue acts surrounding the target descriptions for instance and the gimp make |
---|
0:07:57 | a good so we want to glad we wanted one of c |
---|
0:08:00 | if the agent performance index ahead or |
---|
0:08:05 | so we collected the human heart dialogue corpus in the lab setting in one of |
---|
0:08:11 | the previous studies |
---|
0:08:13 | and we annotated as data it was annotated by a human |
---|
0:08:17 | so |
---|
0:08:18 | the gain characteristic is that |
---|
0:08:21 | it's a rapid okay |
---|
0:08:23 | and there are like multiple a dialogue acts which in within speech segment |
---|
0:08:28 | and the same dialogue acts can actually span across a different speech segments like for |
---|
0:08:32 | instance your |
---|
0:08:34 | you can see whenever the don't they can kind of work sits down dating is |
---|
0:08:38 | like really fast there's like lot of overlaps |
---|
0:08:41 | and then you're in this example we can see that are like multiple dialogue acts |
---|
0:08:45 | within a single speech segment |
---|
0:08:48 | so you each speech segment is in a separate it out by these two hundred |
---|
0:08:51 | milliseconds |
---|
0:08:53 | and in this example that is like a cm dialogue act just and across multiple |
---|
0:08:57 | of multiple speech segments |
---|
0:09:00 | and from this table we can see there are like not of dialogue acts and |
---|
0:09:03 | you need anything each speech segment and hypothesis that each |
---|
0:09:08 | i q or each speech segment if you in by separating it out by a |
---|
0:09:13 | silence threshold we won't the role of a good job than identifying of the dialogue |
---|
0:09:18 | acts ones |
---|
0:09:20 | so the human annotators or our goal and it was annotated is doing so |
---|
0:09:25 | and annotation is done in a very fine grained level i |
---|
0:09:29 | the i'd the word level |
---|
0:09:31 | so here for instance a couple of dialog data kind of identify this is a |
---|
0:09:35 | question and if that is its answer to the previous question than a little or |
---|
0:09:40 | i don't all |
---|
0:09:43 | so how does not i'd addition corpus how the corpus once and repeated looks like |
---|
0:09:47 | so it's very diverse |
---|
0:09:49 | so if we think of this game as a simple target description and acknowledgement all |
---|
0:09:54 | assert-identified or motions by the person |
---|
0:09:57 | as to our dialogue acts will be covering only fifty six percent of the total |
---|
0:10:01 | dialogue acts |
---|
0:10:01 | so the rest of the forty four percent of dialogue acts as it contains a |
---|
0:10:05 | lot of other |
---|
0:10:06 | a kind of dialogue exchanges |
---|
0:10:09 | well some of them on the questions you know and source oracle confirmations and all |
---|
0:10:15 | game but it is |
---|
0:10:18 | so in the methods |
---|
0:10:19 | so this corpus that we have working but so we have a human corpus and |
---|
0:10:23 | our goal is if we include this data in an agent |
---|
0:10:28 | but the segmentation and labeling dialogue act labeling perform what outage and okay so that's |
---|
0:10:33 | the thing that you want to that people want to kind of work on what |
---|
0:10:37 | account value |
---|
0:10:38 | one kind of methods for |
---|
0:10:40 | the method that we use is a is kind of divided into or steps rather |
---|
0:10:47 | so the first step so we have |
---|
0:10:48 | the asr utterances the asr is giving route its incremental utterances |
---|
0:10:53 | we just kind of way to we just try to the linear chain conditional of |
---|
0:10:57 | real the curve the crf does a sequential what it is a sequential what i |
---|
0:11:01 | doubt about |
---|
0:11:02 | everybody's been labeled as a part of or a new segment or not part of |
---|
0:11:06 | a previous segment or not |
---|
0:11:08 | and then once we have the segment boundaries assigned we want to identify what each |
---|
0:11:12 | of these segment |
---|
0:11:15 | so |
---|
0:11:16 | one thing is that it's not a new approach a variety of you know like |
---|
0:11:19 | segmenting the dialogue act a segment in the whole dialogue into something that some kind |
---|
0:11:23 | of identifying that i like that |
---|
0:11:25 | it's been used by many people in the past messages passed |
---|
0:11:32 | and we make sure |
---|
0:11:34 | everyone so here in this approach let's see a so we have the transcripts which |
---|
0:11:40 | contains these many words are just coming out from the asr |
---|
0:11:44 | so this black boxes are basically two hundred milliseconds at least three hundred miliseconds of |
---|
0:11:48 | speech |
---|
0:11:49 | and |
---|
0:11:51 | once these importance |
---|
0:11:53 | it's kind of free to the linear chain conditional random field the it is that |
---|
0:11:58 | those are sequential and ask a sequential i think that it assigns each word with |
---|
0:12:02 | the label or if this word is part of a new segment of our previous |
---|
0:12:07 | segment or not |
---|
0:12:08 | so we just use be eye tracking because each word is part of |
---|
0:12:11 | a segment |
---|
0:12:13 | and then once we have a segment or once we have the segments extracted me |
---|
0:12:17 | we label each one of the segments using a svm classifier |
---|
0:12:22 | but what kind of segment |
---|
0:12:25 | the what kind of features to be used to perform these methods |
---|
0:12:28 | so we used three kinds of features for our feature is a lexical syntactic features |
---|
0:12:33 | which includes well it's the part-of-speech tags a door |
---|
0:12:37 | the top level question a problem which are obtained from the parse strings |
---|
0:12:41 | and then we have the prosody information prosody features which we extracted from the audio |
---|
0:12:47 | incrementally |
---|
0:12:48 | so every ten milliseconds we don't this prosody feature extractor for which we use in |
---|
0:12:53 | forty k |
---|
0:12:54 | and we go via and then be obtained like this or to but don't them |
---|
0:12:58 | domain the max and as these scores for a pitch and dynamics values which he |
---|
0:13:03 | was an idea about like |
---|
0:13:05 | the frequency and energy values |
---|
0:13:08 | and then we have the pause duration between the words which is also a clean |
---|
0:13:12 | as a feature |
---|
0:13:14 | then for the contextual features we believe or wouldn't be one though of you want |
---|
0:13:18 | a teaching to know what kind of rule of the person is performing is a |
---|
0:13:22 | direct orders of the match of all because they both have different kinds of dialog |
---|
0:13:26 | act distributions |
---|
0:13:27 | so then we have previously that could light recognize dialogue act labels which is very |
---|
0:13:31 | important to identify things like a confirmation or answers to questions |
---|
0:13:35 | and then how recent words from the other interlocutor which makes which is very important |
---|
0:13:39 | to identify echo confirmation |
---|
0:13:44 | we use these features and all these modules are operating incrementally back means every new |
---|
0:13:49 | asr hypothesis that comes and |
---|
0:13:51 | the b i actor |
---|
0:13:54 | splits the utterance into are the different segments and that is the classifier that has |
---|
0:13:59 | the dialogue are only the rich and of runs and identifies dialogue acts |
---|
0:14:04 | so their dialogue acts change with every new word because |
---|
0:14:08 | you know it has more information and go on the task |
---|
0:14:13 | so there is this question that we want that the task is how well does |
---|
0:14:16 | the segmental and the dialogue actually lower and pipeline kind of method perform in this |
---|
0:14:21 | or a reference resolution on each image task |
---|
0:14:25 | and what is the impact of asr performance that is an asr with reasonable word |
---|
0:14:30 | error rate if it is it is into those who makes |
---|
0:14:34 | how well how well we're not ask kind of a core |
---|
0:14:38 | and then how does automated pipeline of form of but |
---|
0:14:42 | i mean like how does it impacted image understanding of the user can correctly one |
---|
0:14:46 | dimensional |
---|
0:14:47 | evaluation of components is a little hard because there are a lot of cables you |
---|
0:14:54 | because the first thing is that our transcripts from the users and there is asr |
---|
0:14:58 | hypothesis we just coming and |
---|
0:15:00 | and they don't kind of match up and it's very hard to align them |
---|
0:15:04 | so here in this example they are not there is a it's not online |
---|
0:15:08 | or to one another but it's basically just a line as a mentor coming in |
---|
0:15:12 | and the human annotator does the segmentation and the dialog act labeling |
---|
0:15:17 | and the word level and |
---|
0:15:21 | we have that as data |
---|
0:15:23 | now we if you want to measure the performance of the dialogue act label or |
---|
0:15:27 | we can just run the dialogue act label it on this human transcribe |
---|
0:15:32 | i know but also segmentation of the human segmented information and we can get a |
---|
0:15:36 | sensors as to how the dialogue act label is performing |
---|
0:15:40 | but if you if you put the segment order to go forward or then we |
---|
0:15:44 | have then you lose the one-to-one mapping between the segmenter and |
---|
0:15:49 | the between the dialogue acts from the gold and |
---|
0:15:52 | from the segmental and the dialogue act i one |
---|
0:15:55 | so how do we measure because to go by the word the word measure for |
---|
0:15:58 | instance |
---|
0:16:00 | but once we have the asr it once we put is starting to the picture |
---|
0:16:04 | we even lose one-to-one mapping between the transcribed and annotated ago |
---|
0:16:09 | and the asr |
---|
0:16:11 | corpora are and asr a big also how do we kind of evaluate you know |
---|
0:16:16 | like a pipeline just working in a such a more |
---|
0:16:19 | so a the previously researchers have used a |
---|
0:16:23 | many matrix to kind of measure these things so we have that all segmentation error |
---|
0:16:27 | rates opinion error rates and f scores and concept of its which people have used |
---|
0:16:33 | which is to just have used in the past measure of the system |
---|
0:16:38 | but each one of these metrics have |
---|
0:16:40 | one you know like |
---|
0:16:42 | kind of measure different things in the system |
---|
0:16:45 | but what we actually want to make sure when we're building the system is that |
---|
0:16:49 | we want to know if |
---|
0:16:51 | the right dialogue act was identified so that we can take the right action |
---|
0:16:56 | for example i it doesn't matter they have you know like if the asr did |
---|
0:17:01 | an error in identifying the whole goal for example and it gave you know like |
---|
0:17:06 | instead of on no maybe give no and i identifying the no answer l in |
---|
0:17:12 | spite of a this the it'll though |
---|
0:17:15 | the asr error which was happening |
---|
0:17:17 | so if i get the regular graph maybe my agent and eight or a better |
---|
0:17:21 | performance i mean and they better actions |
---|
0:17:23 | so the measures such a kind of a system we need a multi position and |
---|
0:17:28 | recall metrics |
---|
0:17:30 | for which |
---|
0:17:31 | it is sorted of time i would be would like would into the details of |
---|
0:17:34 | this metric but just let let's just keep in mind that the segment level boundaries |
---|
0:17:39 | for the words |
---|
0:17:40 | are not so important it's important that we identify a dialogue acts |
---|
0:17:46 | that was kind of traffic |
---|
0:17:49 | so the evaluation kind of produces these numbers so if we use the baseline which |
---|
0:17:54 | is just one dialogue act or you know what for speech segment kind of like |
---|
0:17:59 | to the end up with other perform an accuracy of seventy eight percent |
---|
0:18:04 | but once we have the prosody runs if we perform the segmentation for just the |
---|
0:18:09 | prosody features like seventy two percent |
---|
0:18:12 | go drop in performance could also be because of it's not be able to identify |
---|
0:18:18 | development there are like something out by silence |
---|
0:18:21 | and then we have if we use the lexical and the lexical syntactic and contextual |
---|
0:18:25 | features get at ninety percent |
---|
0:18:27 | but once we combine all the features in the we get a performance in queens |
---|
0:18:32 | and like one two percent |
---|
0:18:34 | so it's a really back in a possibly features aren't impacting the performance much |
---|
0:18:39 | you can see that change |
---|
0:18:41 | but it's not close to human-level performance |
---|
0:18:44 | so this is the numbers that we have any for the market on marty said |
---|
0:18:48 | precision and recall for are described on it and other identified and from this table |
---|
0:18:54 | we can we can observe that |
---|
0:18:56 | the automation of every level |
---|
0:18:58 | the performance is kind of at the head so the numbers are dropping down if |
---|
0:19:02 | we |
---|
0:19:03 | from going from human transcripts and human segmented to order segmented and automated yearly and |
---|
0:19:09 | finally the asr |
---|
0:19:12 | but i really what we want to see is how well as the agent how |
---|
0:19:16 | does the agent how it is the agent performs the agent performing equally well or |
---|
0:19:19 | not |
---|
0:19:20 | so in a previous study we use a stimulation method to measure how well the |
---|
0:19:24 | agent |
---|
0:19:25 | or performed |
---|
0:19:27 | so this offline method of evaluating the agent is scored eavesdropper which we have explained |
---|
0:19:31 | you get enough to do twenty fifteen paper i nine creation look at the right |
---|
0:19:36 | and that gave us a really good picture as to how the agent performance actually |
---|
0:19:39 | was in one so we use that metric to kind of evaluate |
---|
0:19:45 | the agent performance on target done on target image identification |
---|
0:19:50 | and we found i |
---|
0:19:51 | it was no significant difference between |
---|
0:19:55 | finally the take a message is that there are many metrics to measure other dialogue |
---|
0:20:00 | act segmentation measuring the final impact on the agent performance is very important and the |
---|
0:20:05 | individual model performance might give us a different picture |
---|
0:20:09 | and bite plane performance negatively module or information and finally the da segmentation can facilitate |
---|
0:20:15 | austin building a better and more complex just |
---|
0:20:19 | an individual we want to integrate these policies and the agent |
---|
0:20:23 | thank you |
---|
0:20:57 | so that's a very good question |
---|
0:21:00 | so the question was that |
---|
0:21:01 | if so this domain is really specific in terms of utterances being or short duration |
---|
0:21:07 | unshorten and r doesn't really scale up or a large and then |
---|
0:21:13 | so the answer is that i don't know a maybe it could because the framework |
---|
0:21:17 | is kind of channel in the sense that the features that the users are not |
---|
0:21:22 | very much to note that this domain but it should really explore and see how |
---|
0:21:27 | the group of formant other domain for example |
---|
0:21:30 | so the answer is it score one |
---|
0:21:33 | i can't say all |
---|
0:21:42 | of the creation |
---|
0:21:47 | however questionable be architecture for segmentation and labeling what do you have to stuck swarm |
---|
0:21:54 | for segmenting about one probably where you do the prior to drawing or cost |
---|
0:22:00 | so the question was a wide we have a separate step a segmentation and labeling |
---|
0:22:06 | any be so the researchers in have looked at like order and architectures like they |
---|
0:22:11 | have tried to do the joint method |
---|
0:22:13 | of identifying the boundaries |
---|
0:22:15 | and also doing it into separate steps |
---|
0:22:18 | so i would say be to try out and it's kind of workload and but |
---|
0:22:23 | i guess every kind of measure the performance and the joint method was not working |
---|
0:22:28 | as well as well |
---|
0:22:32 | this method |
---|
0:22:50 | that's right they were just set of e |
---|
0:22:53 | we probably don't have we have a long tail of dialogue acts from a stable |
---|
0:22:58 | there is a dialog act distribution is kind of long haired and the joint matter |
---|
0:23:03 | probably what but if we had more |
---|
0:23:05 | with this issue |
---|
0:23:10 | no scripts |
---|
0:23:28 | that so good questions the question was a can be look at the and best |
---|
0:23:33 | list and kind of or |
---|
0:23:35 | could we see how well the performance was for dialogue act labeling how our weather |
---|
0:23:41 | work well as well so are the answer is we then we can take a |
---|
0:23:46 | look at the n-best list but definitely that's something |
---|