0:00:14right good evening everyone again
0:00:20a problem of how or more
0:00:25a tall
0:00:26i one i tried to make it more interesting or and exciting
0:00:33alright so
0:00:37taking a step by looking at of our previous work while previous work was one
0:00:42so in the last work we looked tired
0:00:44like fine grained semantic like we tried one design
0:00:48are the scene descriptions by segmenting the target descriptions or what's right and as described
0:00:54target you
0:00:55in two different parts
0:00:57in two different a into two different semantic act as we see your and then
0:01:02try to understand the images
0:01:04so in this work we take a step back and we try to understand the
0:01:08high-level dialogue acts
0:01:09so you're
0:01:11we try to understand these high signal at different dialogue or i one for instance
0:01:16i don't regulate understand what the person is trying to two and a kind of
0:01:21extend upon a previous work presented previously
0:01:25alright
0:01:25so the motivation for this work is to achieve fast paced interaction so
0:01:31in the fast pace interactions lot of things happened like that is
0:01:35a single user speech segment can have multiple dialogue acts a single dialogue i can
0:01:40span across multiple or speech segments
0:01:43and in those cases what should be due what kind of all these things we
0:01:47design
0:01:48then i think is we want to understand
0:01:52methodology to perform this dialogue act segmentation and try to understand what dialogue acts are
0:01:57and in an environment which is highly which is very fast paced and i'll try
0:02:03to or the things
0:02:05and then initiate of a dialogue act at the right
0:02:09fine so that's something that okay that the
0:02:13well
0:02:14the structure of this talk a will be divided into these parts so in the
0:02:18first thing is a speak a bit of our domain the previous work and try
0:02:22to
0:02:22see that their technical problem a starting point
0:02:26and then are the annotation scheme that we good i that we used outline of
0:02:32a target al
0:02:33then the meant that strip of the minutes we use to perform the segmentation and
0:02:37a dialogue act understanding the link
0:02:40sorry
0:02:41then evaluate the components then see how it works but agent
0:02:47so
0:02:48the domain that we use is very similar to the one that we saw the
0:02:52last talk so
0:02:54that's not cases topic one fly so the domain is basically call our dog image
0:02:58are
0:02:59okay so it's
0:03:00it's a rapid dialogue in two people a
0:03:04two people are things game so it's fast it's
0:03:08it's very rapid time-constrained
0:03:11and
0:03:12thus we don't study has little harder classes little heart classes got it
0:03:17okay this line is i
0:03:21before that i'm sorry
0:03:22before that so that wasn't at all is the detector
0:03:26the data that is trying to a this was in the director the detector to
0:03:29see the screen on a computer
0:03:32and she basically trying to describe
0:03:34the target highlight the target image and this is the matcher
0:03:37a matcher doesn't see any of those images kind of highlighted so wasn't is trying
0:03:42to
0:03:43or make the selection they can have a dialogue exchanges back and forth
0:03:48and it's time-constrained and make they also see score so it's
0:03:52it's insane device
0:04:01extensive study has a little hard classes little hard classes got okay this line is
0:04:08i really didn't go flying its actual a lot of might actually got it okay
0:04:11it's one as the line classes yellow classes with the space on the tiny classes
0:04:16that high
0:04:19well as you can see it's something that the game from furthermore dialogue exchanges and
0:04:23it's a kind of problem
0:04:25so we built an agent or using this our data and is what we present
0:04:29in the previous think that the
0:04:31the agent and of play the game this fast-paced game but the real user history
0:04:38we had incremental components of the had asr nlu and the policy and all these
0:04:42components were operating but incrementally
0:04:46an agreement architecture is very important because we got better a game scores
0:04:53but
0:04:54it's not significantly better than humans
0:04:56which means you know like it or from really rather i don't perform much better
0:05:00than alternate incremental
0:05:02architectures for which a one point of view back or what previous adaptive one thing
0:05:08and it had available subject evaluations that is people interacting with this agency like interacting
0:05:13with the agent compared to other all versions of the agent that
0:05:18it is
0:05:19there are there are few limitations of this architecture okay the limitation is that it
0:05:24assumes every three okay every description every board that the person is speaking is
0:05:29basically description of for a good
0:05:30and if that's the case we can't you can't
0:05:33have really fun base interaction is of the two players were having
0:05:38so it's
0:05:38not as interactive as human players
0:05:40but it is really fast
0:05:42so
0:05:45we build an engine so i want to show a small real for the agent
0:05:48interacting with a human soul to reinforce the points that i just
0:05:55at the top using the human director screen
0:05:57so there's a cultural studies so you want to human faces but
0:06:00in the top eight images using the human describing that and the bottom screen using
0:06:05the agent a images and of confidence
0:06:08in
0:06:09in the power
0:06:17i
0:06:21it is apparent i
0:06:26so i
0:06:31it is asleep and y
0:06:34i
0:06:38so which one is the same time
0:06:45are
0:06:46i placed indoors and c
0:06:50i
0:06:54though the agent is you know like very muffins as you can see
0:06:59which is really grappling the game but models
0:07:01alright so what we want to do so we wanna make the agent more interactive
0:07:05so we want to make use of full range of dialogue acts that you know
0:07:08only know of this
0:07:10of you want to initiate the right dialogue act are that i one of the
0:07:14right time so that we get the right interactions and one for that
0:07:19it needs a an incremental or dialogue act segmentation and labeling it some sense and
0:07:24we show as to how we use it and i we need it
0:07:28and the challenges is that
0:07:30they efficiently employing a good for it for instance in the previous architecture we had
0:07:34the agent which
0:07:36transport every utterance is basically a target image descriptions so she was being very efficient
0:07:41in understanding the target images
0:07:42but
0:07:43if we have if we include more dialogue acts it's very possible like dialogue acts
0:07:47make it is i don't i've make it is label of you want to be
0:07:51going to other dialogue acts surrounding the target descriptions for instance and the gimp make
0:07:57a good so we want to glad we wanted one of c
0:08:00if the agent performance index ahead or
0:08:05so we collected the human heart dialogue corpus in the lab setting in one of
0:08:11the previous studies
0:08:13and we annotated as data it was annotated by a human
0:08:17so
0:08:18the gain characteristic is that
0:08:21it's a rapid okay
0:08:23and there are like multiple a dialogue acts which in within speech segment
0:08:28and the same dialogue acts can actually span across a different speech segments like for
0:08:32instance your
0:08:34you can see whenever the don't they can kind of work sits down dating is
0:08:38like really fast there's like lot of overlaps
0:08:41and then you're in this example we can see that are like multiple dialogue acts
0:08:45within a single speech segment
0:08:48so you each speech segment is in a separate it out by these two hundred
0:08:51milliseconds
0:08:53and in this example that is like a cm dialogue act just and across multiple
0:08:57of multiple speech segments
0:09:00and from this table we can see there are like not of dialogue acts and
0:09:03you need anything each speech segment and hypothesis that each
0:09:08i q or each speech segment if you in by separating it out by a
0:09:13silence threshold we won't the role of a good job than identifying of the dialogue
0:09:18acts ones
0:09:20so the human annotators or our goal and it was annotated is doing so
0:09:25and annotation is done in a very fine grained level i
0:09:29the i'd the word level
0:09:31so here for instance a couple of dialog data kind of identify this is a
0:09:35question and if that is its answer to the previous question than a little or
0:09:40i don't all
0:09:43so how does not i'd addition corpus how the corpus once and repeated looks like
0:09:47so it's very diverse
0:09:49so if we think of this game as a simple target description and acknowledgement all
0:09:54assert-identified or motions by the person
0:09:57as to our dialogue acts will be covering only fifty six percent of the total
0:10:01dialogue acts
0:10:01so the rest of the forty four percent of dialogue acts as it contains a
0:10:05lot of other
0:10:06a kind of dialogue exchanges
0:10:09well some of them on the questions you know and source oracle confirmations and all
0:10:15game but it is
0:10:18so in the methods
0:10:19so this corpus that we have working but so we have a human corpus and
0:10:23our goal is if we include this data in an agent
0:10:28but the segmentation and labeling dialogue act labeling perform what outage and okay so that's
0:10:33the thing that you want to that people want to kind of work on what
0:10:37account value
0:10:38one kind of methods for
0:10:40the method that we use is a is kind of divided into or steps rather
0:10:47so the first step so we have
0:10:48the asr utterances the asr is giving route its incremental utterances
0:10:53we just kind of way to we just try to the linear chain conditional of
0:10:57real the curve the crf does a sequential what it is a sequential what i
0:11:01doubt about
0:11:02everybody's been labeled as a part of or a new segment or not part of
0:11:06a previous segment or not
0:11:08and then once we have the segment boundaries assigned we want to identify what each
0:11:12of these segment
0:11:15so
0:11:16one thing is that it's not a new approach a variety of you know like
0:11:19segmenting the dialogue act a segment in the whole dialogue into something that some kind
0:11:23of identifying that i like that
0:11:25it's been used by many people in the past messages passed
0:11:32and we make sure
0:11:34everyone so here in this approach let's see a so we have the transcripts which
0:11:40contains these many words are just coming out from the asr
0:11:44so this black boxes are basically two hundred milliseconds at least three hundred miliseconds of
0:11:48speech
0:11:49and
0:11:51once these importance
0:11:53it's kind of free to the linear chain conditional random field the it is that
0:11:58those are sequential and ask a sequential i think that it assigns each word with
0:12:02the label or if this word is part of a new segment of our previous
0:12:07segment or not
0:12:08so we just use be eye tracking because each word is part of
0:12:11a segment
0:12:13and then once we have a segment or once we have the segments extracted me
0:12:17we label each one of the segments using a svm classifier
0:12:22but what kind of segment
0:12:25the what kind of features to be used to perform these methods
0:12:28so we used three kinds of features for our feature is a lexical syntactic features
0:12:33which includes well it's the part-of-speech tags a door
0:12:37the top level question a problem which are obtained from the parse strings
0:12:41and then we have the prosody information prosody features which we extracted from the audio
0:12:47incrementally
0:12:48so every ten milliseconds we don't this prosody feature extractor for which we use in
0:12:53forty k
0:12:54and we go via and then be obtained like this or to but don't them
0:12:58domain the max and as these scores for a pitch and dynamics values which he
0:13:03was an idea about like
0:13:05the frequency and energy values
0:13:08and then we have the pause duration between the words which is also a clean
0:13:12as a feature
0:13:14then for the contextual features we believe or wouldn't be one though of you want
0:13:18a teaching to know what kind of rule of the person is performing is a
0:13:22direct orders of the match of all because they both have different kinds of dialog
0:13:26act distributions
0:13:27so then we have previously that could light recognize dialogue act labels which is very
0:13:31important to identify things like a confirmation or answers to questions
0:13:35and then how recent words from the other interlocutor which makes which is very important
0:13:39to identify echo confirmation
0:13:44we use these features and all these modules are operating incrementally back means every new
0:13:49asr hypothesis that comes and
0:13:51the b i actor
0:13:54splits the utterance into are the different segments and that is the classifier that has
0:13:59the dialogue are only the rich and of runs and identifies dialogue acts
0:14:04so their dialogue acts change with every new word because
0:14:08you know it has more information and go on the task
0:14:13so there is this question that we want that the task is how well does
0:14:16the segmental and the dialogue actually lower and pipeline kind of method perform in this
0:14:21or a reference resolution on each image task
0:14:25and what is the impact of asr performance that is an asr with reasonable word
0:14:30error rate if it is it is into those who makes
0:14:34how well how well we're not ask kind of a core
0:14:38and then how does automated pipeline of form of but
0:14:42i mean like how does it impacted image understanding of the user can correctly one
0:14:46dimensional
0:14:47evaluation of components is a little hard because there are a lot of cables you
0:14:54because the first thing is that our transcripts from the users and there is asr
0:14:58hypothesis we just coming and
0:15:00and they don't kind of match up and it's very hard to align them
0:15:04so here in this example they are not there is a it's not online
0:15:08or to one another but it's basically just a line as a mentor coming in
0:15:12and the human annotator does the segmentation and the dialog act labeling
0:15:17and the word level and
0:15:21we have that as data
0:15:23now we if you want to measure the performance of the dialogue act label or
0:15:27we can just run the dialogue act label it on this human transcribe
0:15:32i know but also segmentation of the human segmented information and we can get a
0:15:36sensors as to how the dialogue act label is performing
0:15:40but if you if you put the segment order to go forward or then we
0:15:44have then you lose the one-to-one mapping between the segmenter and
0:15:49the between the dialogue acts from the gold and
0:15:52from the segmental and the dialogue act i one
0:15:55so how do we measure because to go by the word the word measure for
0:15:58instance
0:16:00but once we have the asr it once we put is starting to the picture
0:16:04we even lose one-to-one mapping between the transcribed and annotated ago
0:16:09and the asr
0:16:11corpora are and asr a big also how do we kind of evaluate you know
0:16:16like a pipeline just working in a such a more
0:16:19so a the previously researchers have used a
0:16:23many matrix to kind of measure these things so we have that all segmentation error
0:16:27rates opinion error rates and f scores and concept of its which people have used
0:16:33which is to just have used in the past measure of the system
0:16:38but each one of these metrics have
0:16:40one you know like
0:16:42kind of measure different things in the system
0:16:45but what we actually want to make sure when we're building the system is that
0:16:49we want to know if
0:16:51the right dialogue act was identified so that we can take the right action
0:16:56for example i it doesn't matter they have you know like if the asr did
0:17:01an error in identifying the whole goal for example and it gave you know like
0:17:06instead of on no maybe give no and i identifying the no answer l in
0:17:12spite of a this the it'll though
0:17:15the asr error which was happening
0:17:17so if i get the regular graph maybe my agent and eight or a better
0:17:21performance i mean and they better actions
0:17:23so the measures such a kind of a system we need a multi position and
0:17:28recall metrics
0:17:30for which
0:17:31it is sorted of time i would be would like would into the details of
0:17:34this metric but just let let's just keep in mind that the segment level boundaries
0:17:39for the words
0:17:40are not so important it's important that we identify a dialogue acts
0:17:46that was kind of traffic
0:17:49so the evaluation kind of produces these numbers so if we use the baseline which
0:17:54is just one dialogue act or you know what for speech segment kind of like
0:17:59to the end up with other perform an accuracy of seventy eight percent
0:18:04but once we have the prosody runs if we perform the segmentation for just the
0:18:09prosody features like seventy two percent
0:18:12go drop in performance could also be because of it's not be able to identify
0:18:18development there are like something out by silence
0:18:21and then we have if we use the lexical and the lexical syntactic and contextual
0:18:25features get at ninety percent
0:18:27but once we combine all the features in the we get a performance in queens
0:18:32and like one two percent
0:18:34so it's a really back in a possibly features aren't impacting the performance much
0:18:39you can see that change
0:18:41but it's not close to human-level performance
0:18:44so this is the numbers that we have any for the market on marty said
0:18:48precision and recall for are described on it and other identified and from this table
0:18:54we can we can observe that
0:18:56the automation of every level
0:18:58the performance is kind of at the head so the numbers are dropping down if
0:19:02we
0:19:03from going from human transcripts and human segmented to order segmented and automated yearly and
0:19:09finally the asr
0:19:12but i really what we want to see is how well as the agent how
0:19:16does the agent how it is the agent performs the agent performing equally well or
0:19:19not
0:19:20so in a previous study we use a stimulation method to measure how well the
0:19:24agent
0:19:25or performed
0:19:27so this offline method of evaluating the agent is scored eavesdropper which we have explained
0:19:31you get enough to do twenty fifteen paper i nine creation look at the right
0:19:36and that gave us a really good picture as to how the agent performance actually
0:19:39was in one so we use that metric to kind of evaluate
0:19:45the agent performance on target done on target image identification
0:19:50and we found i
0:19:51it was no significant difference between
0:19:55finally the take a message is that there are many metrics to measure other dialogue
0:20:00act segmentation measuring the final impact on the agent performance is very important and the
0:20:05individual model performance might give us a different picture
0:20:09and bite plane performance negatively module or information and finally the da segmentation can facilitate
0:20:15austin building a better and more complex just
0:20:19an individual we want to integrate these policies and the agent
0:20:23thank you
0:20:57so that's a very good question
0:21:00so the question was that
0:21:01if so this domain is really specific in terms of utterances being or short duration
0:21:07unshorten and r doesn't really scale up or a large and then
0:21:13so the answer is that i don't know a maybe it could because the framework
0:21:17is kind of channel in the sense that the features that the users are not
0:21:22very much to note that this domain but it should really explore and see how
0:21:27the group of formant other domain for example
0:21:30so the answer is it score one
0:21:33i can't say all
0:21:42of the creation
0:21:47however questionable be architecture for segmentation and labeling what do you have to stuck swarm
0:21:54for segmenting about one probably where you do the prior to drawing or cost
0:22:00so the question was a wide we have a separate step a segmentation and labeling
0:22:06any be so the researchers in have looked at like order and architectures like they
0:22:11have tried to do the joint method
0:22:13of identifying the boundaries
0:22:15and also doing it into separate steps
0:22:18so i would say be to try out and it's kind of workload and but
0:22:23i guess every kind of measure the performance and the joint method was not working
0:22:28as well as well
0:22:32this method
0:22:50that's right they were just set of e
0:22:53we probably don't have we have a long tail of dialogue acts from a stable
0:22:58there is a dialog act distribution is kind of long haired and the joint matter
0:23:03probably what but if we had more
0:23:05with this issue
0:23:10no scripts
0:23:28that so good questions the question was a can be look at the and best
0:23:33list and kind of or
0:23:35could we see how well the performance was for dialogue act labeling how our weather
0:23:41work well as well so are the answer is we then we can take a
0:23:46look at the n-best list but definitely that's something