right good evening everyone again
a problem of how or more
a tall
i one i tried to make it more interesting or and exciting
alright so
taking a step by looking at of our previous work while previous work was one
so in the last work we looked tired
like fine grained semantic like we tried one design
are the scene descriptions by segmenting the target descriptions or what's right and as described
target you
in two different parts
in two different a into two different semantic act as we see your and then
try to understand the images
so in this work we take a step back and we try to understand the
high-level dialogue acts
so you're
we try to understand these high signal at different dialogue or i one for instance
i don't regulate understand what the person is trying to two and a kind of
extend upon a previous work presented previously
alright
so the motivation for this work is to achieve fast paced interaction so
in the fast pace interactions lot of things happened like that is
a single user speech segment can have multiple dialogue acts a single dialogue i can
span across multiple or speech segments
and in those cases what should be due what kind of all these things we
design
then i think is we want to understand
methodology to perform this dialogue act segmentation and try to understand what dialogue acts are
and in an environment which is highly which is very fast paced and i'll try
to or the things
and then initiate of a dialogue act at the right
fine so that's something that okay that the
well
the structure of this talk a will be divided into these parts so in the
first thing is a speak a bit of our domain the previous work and try
to
see that their technical problem a starting point
and then are the annotation scheme that we good i that we used outline of
a target al
then the meant that strip of the minutes we use to perform the segmentation and
a dialogue act understanding the link
sorry
then evaluate the components then see how it works but agent
so
the domain that we use is very similar to the one that we saw the
last talk so
that's not cases topic one fly so the domain is basically call our dog image
are
okay so it's
it's a rapid dialogue in two people a
two people are things game so it's fast it's
it's very rapid time-constrained
and
thus we don't study has little harder classes little heart classes got it
okay this line is i
before that i'm sorry
before that so that wasn't at all is the detector
the data that is trying to a this was in the director the detector to
see the screen on a computer
and she basically trying to describe
the target highlight the target image and this is the matcher
a matcher doesn't see any of those images kind of highlighted so wasn't is trying
to
or make the selection they can have a dialogue exchanges back and forth
and it's time-constrained and make they also see score so it's
it's insane device
extensive study has a little hard classes little hard classes got okay this line is
i really didn't go flying its actual a lot of might actually got it okay
it's one as the line classes yellow classes with the space on the tiny classes
that high
well as you can see it's something that the game from furthermore dialogue exchanges and
it's a kind of problem
so we built an agent or using this our data and is what we present
in the previous think that the
the agent and of play the game this fast-paced game but the real user history
we had incremental components of the had asr nlu and the policy and all these
components were operating but incrementally
an agreement architecture is very important because we got better a game scores
but
it's not significantly better than humans
which means you know like it or from really rather i don't perform much better
than alternate incremental
architectures for which a one point of view back or what previous adaptive one thing
and it had available subject evaluations that is people interacting with this agency like interacting
with the agent compared to other all versions of the agent that
it is
there are there are few limitations of this architecture okay the limitation is that it
assumes every three okay every description every board that the person is speaking is
basically description of for a good
and if that's the case we can't you can't
have really fun base interaction is of the two players were having
so it's
not as interactive as human players
but it is really fast
so
we build an engine so i want to show a small real for the agent
interacting with a human soul to reinforce the points that i just
at the top using the human director screen
so there's a cultural studies so you want to human faces but
in the top eight images using the human describing that and the bottom screen using
the agent a images and of confidence
in
in the power
i
it is apparent i
so i
it is asleep and y
i
so which one is the same time
are
i placed indoors and c
i
though the agent is you know like very muffins as you can see
which is really grappling the game but models
alright so what we want to do so we wanna make the agent more interactive
so we want to make use of full range of dialogue acts that you know
only know of this
of you want to initiate the right dialogue act are that i one of the
right time so that we get the right interactions and one for that
it needs a an incremental or dialogue act segmentation and labeling it some sense and
we show as to how we use it and i we need it
and the challenges is that
they efficiently employing a good for it for instance in the previous architecture we had
the agent which
transport every utterance is basically a target image descriptions so she was being very efficient
in understanding the target images
but
if we have if we include more dialogue acts it's very possible like dialogue acts
make it is i don't i've make it is label of you want to be
going to other dialogue acts surrounding the target descriptions for instance and the gimp make
a good so we want to glad we wanted one of c
if the agent performance index ahead or
so we collected the human heart dialogue corpus in the lab setting in one of
the previous studies
and we annotated as data it was annotated by a human
so
the gain characteristic is that
it's a rapid okay
and there are like multiple a dialogue acts which in within speech segment
and the same dialogue acts can actually span across a different speech segments like for
instance your
you can see whenever the don't they can kind of work sits down dating is
like really fast there's like lot of overlaps
and then you're in this example we can see that are like multiple dialogue acts
within a single speech segment
so you each speech segment is in a separate it out by these two hundred
milliseconds
and in this example that is like a cm dialogue act just and across multiple
of multiple speech segments
and from this table we can see there are like not of dialogue acts and
you need anything each speech segment and hypothesis that each
i q or each speech segment if you in by separating it out by a
silence threshold we won't the role of a good job than identifying of the dialogue
acts ones
so the human annotators or our goal and it was annotated is doing so
and annotation is done in a very fine grained level i
the i'd the word level
so here for instance a couple of dialog data kind of identify this is a
question and if that is its answer to the previous question than a little or
i don't all
so how does not i'd addition corpus how the corpus once and repeated looks like
so it's very diverse
so if we think of this game as a simple target description and acknowledgement all
assert-identified or motions by the person
as to our dialogue acts will be covering only fifty six percent of the total
dialogue acts
so the rest of the forty four percent of dialogue acts as it contains a
lot of other
a kind of dialogue exchanges
well some of them on the questions you know and source oracle confirmations and all
game but it is
so in the methods
so this corpus that we have working but so we have a human corpus and
our goal is if we include this data in an agent
but the segmentation and labeling dialogue act labeling perform what outage and okay so that's
the thing that you want to that people want to kind of work on what
account value
one kind of methods for
the method that we use is a is kind of divided into or steps rather
so the first step so we have
the asr utterances the asr is giving route its incremental utterances
we just kind of way to we just try to the linear chain conditional of
real the curve the crf does a sequential what it is a sequential what i
doubt about
everybody's been labeled as a part of or a new segment or not part of
a previous segment or not
and then once we have the segment boundaries assigned we want to identify what each
of these segment
so
one thing is that it's not a new approach a variety of you know like
segmenting the dialogue act a segment in the whole dialogue into something that some kind
of identifying that i like that
it's been used by many people in the past messages passed
and we make sure
everyone so here in this approach let's see a so we have the transcripts which
contains these many words are just coming out from the asr
so this black boxes are basically two hundred milliseconds at least three hundred miliseconds of
speech
and
once these importance
it's kind of free to the linear chain conditional random field the it is that
those are sequential and ask a sequential i think that it assigns each word with
the label or if this word is part of a new segment of our previous
segment or not
so we just use be eye tracking because each word is part of
a segment
and then once we have a segment or once we have the segments extracted me
we label each one of the segments using a svm classifier
but what kind of segment
the what kind of features to be used to perform these methods
so we used three kinds of features for our feature is a lexical syntactic features
which includes well it's the part-of-speech tags a door
the top level question a problem which are obtained from the parse strings
and then we have the prosody information prosody features which we extracted from the audio
incrementally
so every ten milliseconds we don't this prosody feature extractor for which we use in
forty k
and we go via and then be obtained like this or to but don't them
domain the max and as these scores for a pitch and dynamics values which he
was an idea about like
the frequency and energy values
and then we have the pause duration between the words which is also a clean
as a feature
then for the contextual features we believe or wouldn't be one though of you want
a teaching to know what kind of rule of the person is performing is a
direct orders of the match of all because they both have different kinds of dialog
act distributions
so then we have previously that could light recognize dialogue act labels which is very
important to identify things like a confirmation or answers to questions
and then how recent words from the other interlocutor which makes which is very important
to identify echo confirmation
we use these features and all these modules are operating incrementally back means every new
asr hypothesis that comes and
the b i actor
splits the utterance into are the different segments and that is the classifier that has
the dialogue are only the rich and of runs and identifies dialogue acts
so their dialogue acts change with every new word because
you know it has more information and go on the task
so there is this question that we want that the task is how well does
the segmental and the dialogue actually lower and pipeline kind of method perform in this
or a reference resolution on each image task
and what is the impact of asr performance that is an asr with reasonable word
error rate if it is it is into those who makes
how well how well we're not ask kind of a core
and then how does automated pipeline of form of but
i mean like how does it impacted image understanding of the user can correctly one
dimensional
evaluation of components is a little hard because there are a lot of cables you
because the first thing is that our transcripts from the users and there is asr
hypothesis we just coming and
and they don't kind of match up and it's very hard to align them
so here in this example they are not there is a it's not online
or to one another but it's basically just a line as a mentor coming in
and the human annotator does the segmentation and the dialog act labeling
and the word level and
we have that as data
now we if you want to measure the performance of the dialogue act label or
we can just run the dialogue act label it on this human transcribe
i know but also segmentation of the human segmented information and we can get a
sensors as to how the dialogue act label is performing
but if you if you put the segment order to go forward or then we
have then you lose the one-to-one mapping between the segmenter and
the between the dialogue acts from the gold and
from the segmental and the dialogue act i one
so how do we measure because to go by the word the word measure for
instance
but once we have the asr it once we put is starting to the picture
we even lose one-to-one mapping between the transcribed and annotated ago
and the asr
corpora are and asr a big also how do we kind of evaluate you know
like a pipeline just working in a such a more
so a the previously researchers have used a
many matrix to kind of measure these things so we have that all segmentation error
rates opinion error rates and f scores and concept of its which people have used
which is to just have used in the past measure of the system
but each one of these metrics have
one you know like
kind of measure different things in the system
but what we actually want to make sure when we're building the system is that
we want to know if
the right dialogue act was identified so that we can take the right action
for example i it doesn't matter they have you know like if the asr did
an error in identifying the whole goal for example and it gave you know like
instead of on no maybe give no and i identifying the no answer l in
spite of a this the it'll though
the asr error which was happening
so if i get the regular graph maybe my agent and eight or a better
performance i mean and they better actions
so the measures such a kind of a system we need a multi position and
recall metrics
for which
it is sorted of time i would be would like would into the details of
this metric but just let let's just keep in mind that the segment level boundaries
for the words
are not so important it's important that we identify a dialogue acts
that was kind of traffic
so the evaluation kind of produces these numbers so if we use the baseline which
is just one dialogue act or you know what for speech segment kind of like
to the end up with other perform an accuracy of seventy eight percent
but once we have the prosody runs if we perform the segmentation for just the
prosody features like seventy two percent
go drop in performance could also be because of it's not be able to identify
development there are like something out by silence
and then we have if we use the lexical and the lexical syntactic and contextual
features get at ninety percent
but once we combine all the features in the we get a performance in queens
and like one two percent
so it's a really back in a possibly features aren't impacting the performance much
you can see that change
but it's not close to human-level performance
so this is the numbers that we have any for the market on marty said
precision and recall for are described on it and other identified and from this table
we can we can observe that
the automation of every level
the performance is kind of at the head so the numbers are dropping down if
we
from going from human transcripts and human segmented to order segmented and automated yearly and
finally the asr
but i really what we want to see is how well as the agent how
does the agent how it is the agent performs the agent performing equally well or
not
so in a previous study we use a stimulation method to measure how well the
agent
or performed
so this offline method of evaluating the agent is scored eavesdropper which we have explained
you get enough to do twenty fifteen paper i nine creation look at the right
and that gave us a really good picture as to how the agent performance actually
was in one so we use that metric to kind of evaluate
the agent performance on target done on target image identification
and we found i
it was no significant difference between
finally the take a message is that there are many metrics to measure other dialogue
act segmentation measuring the final impact on the agent performance is very important and the
individual model performance might give us a different picture
and bite plane performance negatively module or information and finally the da segmentation can facilitate
austin building a better and more complex just
an individual we want to integrate these policies and the agent
thank you
so that's a very good question
so the question was that
if so this domain is really specific in terms of utterances being or short duration
unshorten and r doesn't really scale up or a large and then
so the answer is that i don't know a maybe it could because the framework
is kind of channel in the sense that the features that the users are not
very much to note that this domain but it should really explore and see how
the group of formant other domain for example
so the answer is it score one
i can't say all
of the creation
however questionable be architecture for segmentation and labeling what do you have to stuck swarm
for segmenting about one probably where you do the prior to drawing or cost
so the question was a wide we have a separate step a segmentation and labeling
any be so the researchers in have looked at like order and architectures like they
have tried to do the joint method
of identifying the boundaries
and also doing it into separate steps
so i would say be to try out and it's kind of workload and but
i guess every kind of measure the performance and the joint method was not working
as well as well
this method
that's right they were just set of e
we probably don't have we have a long tail of dialogue acts from a stable
there is a dialog act distribution is kind of long haired and the joint matter
probably what but if we had more
with this issue
no scripts
that so good questions the question was a can be look at the and best
list and kind of or
could we see how well the performance was for dialogue act labeling how our weather
work well as well so are the answer is we then we can take a
look at the n-best list but definitely that's something