Speech Transcript - Toward incremental dialogue act segmentation in fast-paced interactive dialogue systems

right good evening everyone again

a problem of how or more

a tall

i one i tried to make it more interesting or and exciting

alright so

taking a step by looking at of our previous work while previous work was one

so in the last work we looked tired

like fine grained semantic like we tried one design

are the scene descriptions by segmenting the target descriptions or what's right and as described

target you

in two different parts

in two different a into two different semantic act as we see your and then

try to understand the images

so in this work we take a step back and we try to understand the

high-level dialogue acts

so you're

we try to understand these high signal at different dialogue or i one for instance

i don't regulate understand what the person is trying to two and a kind of

extend upon a previous work presented previously

alright

so the motivation for this work is to achieve fast paced interaction so

in the fast pace interactions lot of things happened like that is

a single user speech segment can have multiple dialogue acts a single dialogue i can

span across multiple or speech segments

and in those cases what should be due what kind of all these things we

design

then i think is we want to understand

methodology to perform this dialogue act segmentation and try to understand what dialogue acts are

and in an environment which is highly which is very fast paced and i'll try

to or the things

and then initiate of a dialogue act at the right

fine so that's something that okay that the

well

the structure of this talk a will be divided into these parts so in the

first thing is a speak a bit of our domain the previous work and try

see that their technical problem a starting point

and then are the annotation scheme that we good i that we used outline of

a target al

then the meant that strip of the minutes we use to perform the segmentation and

a dialogue act understanding the link

sorry

then evaluate the components then see how it works but agent

the domain that we use is very similar to the one that we saw the

last talk so

that's not cases topic one fly so the domain is basically call our dog image

are

okay so it's

it's a rapid dialogue in two people a

two people are things game so it's fast it's

it's very rapid time-constrained

and

thus we don't study has little harder classes little heart classes got it

okay this line is i

before that i'm sorry

before that so that wasn't at all is the detector

the data that is trying to a this was in the director the detector to

see the screen on a computer

and she basically trying to describe

the target highlight the target image and this is the matcher

a matcher doesn't see any of those images kind of highlighted so wasn't is trying

or make the selection they can have a dialogue exchanges back and forth

and it's time-constrained and make they also see score so it's

it's insane device

extensive study has a little hard classes little hard classes got okay this line is

i really didn't go flying its actual a lot of might actually got it okay

it's one as the line classes yellow classes with the space on the tiny classes

that high

well as you can see it's something that the game from furthermore dialogue exchanges and

it's a kind of problem

so we built an agent or using this our data and is what we present

in the previous think that the

the agent and of play the game this fast-paced game but the real user history

we had incremental components of the had asr nlu and the policy and all these

components were operating but incrementally

an agreement architecture is very important because we got better a game scores

but

it's not significantly better than humans

which means you know like it or from really rather i don't perform much better

than alternate incremental

architectures for which a one point of view back or what previous adaptive one thing

and it had available subject evaluations that is people interacting with this agency like interacting

with the agent compared to other all versions of the agent that

it is

there are there are few limitations of this architecture okay the limitation is that it

assumes every three okay every description every board that the person is speaking is

basically description of for a good

and if that's the case we can't you can't

have really fun base interaction is of the two players were having

so it's

not as interactive as human players

but it is really fast

we build an engine so i want to show a small real for the agent

interacting with a human soul to reinforce the points that i just

at the top using the human director screen

so there's a cultural studies so you want to human faces but

in the top eight images using the human describing that and the bottom screen using

the agent a images and of confidence

in the power

it is apparent i

so i

it is asleep and y

so which one is the same time

are

i placed indoors and c

though the agent is you know like very muffins as you can see

which is really grappling the game but models

alright so what we want to do so we wanna make the agent more interactive

so we want to make use of full range of dialogue acts that you know

only know of this

of you want to initiate the right dialogue act are that i one of the

right time so that we get the right interactions and one for that

it needs a an incremental or dialogue act segmentation and labeling it some sense and

we show as to how we use it and i we need it

and the challenges is that

they efficiently employing a good for it for instance in the previous architecture we had

the agent which

transport every utterance is basically a target image descriptions so she was being very efficient

in understanding the target images

but

if we have if we include more dialogue acts it's very possible like dialogue acts

make it is i don't i've make it is label of you want to be

going to other dialogue acts surrounding the target descriptions for instance and the gimp make

a good so we want to glad we wanted one of c

if the agent performance index ahead or

so we collected the human heart dialogue corpus in the lab setting in one of

the previous studies

and we annotated as data it was annotated by a human

the gain characteristic is that

it's a rapid okay

and there are like multiple a dialogue acts which in within speech segment

and the same dialogue acts can actually span across a different speech segments like for

instance your

you can see whenever the don't they can kind of work sits down dating is

like really fast there's like lot of overlaps

and then you're in this example we can see that are like multiple dialogue acts

within a single speech segment

so you each speech segment is in a separate it out by these two hundred

milliseconds

and in this example that is like a cm dialogue act just and across multiple

of multiple speech segments

and from this table we can see there are like not of dialogue acts and

you need anything each speech segment and hypothesis that each

i q or each speech segment if you in by separating it out by a

silence threshold we won't the role of a good job than identifying of the dialogue

acts ones

so the human annotators or our goal and it was annotated is doing so

and annotation is done in a very fine grained level i

the i'd the word level

so here for instance a couple of dialog data kind of identify this is a

question and if that is its answer to the previous question than a little or

i don't all

so how does not i'd addition corpus how the corpus once and repeated looks like

so it's very diverse

so if we think of this game as a simple target description and acknowledgement all

assert-identified or motions by the person

as to our dialogue acts will be covering only fifty six percent of the total

dialogue acts

so the rest of the forty four percent of dialogue acts as it contains a

lot of other

a kind of dialogue exchanges

well some of them on the questions you know and source oracle confirmations and all

game but it is

so in the methods

so this corpus that we have working but so we have a human corpus and

our goal is if we include this data in an agent

but the segmentation and labeling dialogue act labeling perform what outage and okay so that's

the thing that you want to that people want to kind of work on what

account value

one kind of methods for

the method that we use is a is kind of divided into or steps rather

so the first step so we have

the asr utterances the asr is giving route its incremental utterances

we just kind of way to we just try to the linear chain conditional of

real the curve the crf does a sequential what it is a sequential what i

doubt about

everybody's been labeled as a part of or a new segment or not part of

a previous segment or not

and then once we have the segment boundaries assigned we want to identify what each

of these segment

one thing is that it's not a new approach a variety of you know like

segmenting the dialogue act a segment in the whole dialogue into something that some kind

of identifying that i like that

it's been used by many people in the past messages passed

and we make sure

everyone so here in this approach let's see a so we have the transcripts which

contains these many words are just coming out from the asr

so this black boxes are basically two hundred milliseconds at least three hundred miliseconds of

speech

and

once these importance

it's kind of free to the linear chain conditional random field the it is that

those are sequential and ask a sequential i think that it assigns each word with

the label or if this word is part of a new segment of our previous

segment or not

so we just use be eye tracking because each word is part of

a segment

and then once we have a segment or once we have the segments extracted me

we label each one of the segments using a svm classifier

but what kind of segment

the what kind of features to be used to perform these methods

so we used three kinds of features for our feature is a lexical syntactic features

which includes well it's the part-of-speech tags a door

the top level question a problem which are obtained from the parse strings

and then we have the prosody information prosody features which we extracted from the audio

incrementally

so every ten milliseconds we don't this prosody feature extractor for which we use in

forty k

and we go via and then be obtained like this or to but don't them

domain the max and as these scores for a pitch and dynamics values which he

was an idea about like

the frequency and energy values

and then we have the pause duration between the words which is also a clean

as a feature

then for the contextual features we believe or wouldn't be one though of you want

a teaching to know what kind of rule of the person is performing is a

direct orders of the match of all because they both have different kinds of dialog

act distributions

so then we have previously that could light recognize dialogue act labels which is very

important to identify things like a confirmation or answers to questions

and then how recent words from the other interlocutor which makes which is very important

to identify echo confirmation

we use these features and all these modules are operating incrementally back means every new

asr hypothesis that comes and

the b i actor

splits the utterance into are the different segments and that is the classifier that has

the dialogue are only the rich and of runs and identifies dialogue acts

so their dialogue acts change with every new word because

you know it has more information and go on the task

so there is this question that we want that the task is how well does

the segmental and the dialogue actually lower and pipeline kind of method perform in this

or a reference resolution on each image task

and what is the impact of asr performance that is an asr with reasonable word

error rate if it is it is into those who makes

how well how well we're not ask kind of a core

and then how does automated pipeline of form of but

i mean like how does it impacted image understanding of the user can correctly one

dimensional

evaluation of components is a little hard because there are a lot of cables you

because the first thing is that our transcripts from the users and there is asr

hypothesis we just coming and

and they don't kind of match up and it's very hard to align them

so here in this example they are not there is a it's not online

or to one another but it's basically just a line as a mentor coming in

and the human annotator does the segmentation and the dialog act labeling

and the word level and

we have that as data

now we if you want to measure the performance of the dialogue act label or

we can just run the dialogue act label it on this human transcribe

i know but also segmentation of the human segmented information and we can get a

sensors as to how the dialogue act label is performing

but if you if you put the segment order to go forward or then we

have then you lose the one-to-one mapping between the segmenter and

the between the dialogue acts from the gold and

from the segmental and the dialogue act i one

so how do we measure because to go by the word the word measure for

instance

but once we have the asr it once we put is starting to the picture

we even lose one-to-one mapping between the transcribed and annotated ago

and the asr

corpora are and asr a big also how do we kind of evaluate you know

like a pipeline just working in a such a more

so a the previously researchers have used a

many matrix to kind of measure these things so we have that all segmentation error

rates opinion error rates and f scores and concept of its which people have used

which is to just have used in the past measure of the system

but each one of these metrics have

one you know like

kind of measure different things in the system

but what we actually want to make sure when we're building the system is that

we want to know if

the right dialogue act was identified so that we can take the right action

for example i it doesn't matter they have you know like if the asr did

an error in identifying the whole goal for example and it gave you know like

instead of on no maybe give no and i identifying the no answer l in

spite of a this the it'll though

the asr error which was happening

so if i get the regular graph maybe my agent and eight or a better

performance i mean and they better actions

so the measures such a kind of a system we need a multi position and

recall metrics

for which

it is sorted of time i would be would like would into the details of

this metric but just let let's just keep in mind that the segment level boundaries

for the words

are not so important it's important that we identify a dialogue acts

that was kind of traffic

so the evaluation kind of produces these numbers so if we use the baseline which

is just one dialogue act or you know what for speech segment kind of like

to the end up with other perform an accuracy of seventy eight percent

but once we have the prosody runs if we perform the segmentation for just the

prosody features like seventy two percent

go drop in performance could also be because of it's not be able to identify

development there are like something out by silence

and then we have if we use the lexical and the lexical syntactic and contextual

features get at ninety percent

but once we combine all the features in the we get a performance in queens

and like one two percent

so it's a really back in a possibly features aren't impacting the performance much

you can see that change

but it's not close to human-level performance

so this is the numbers that we have any for the market on marty said

precision and recall for are described on it and other identified and from this table

we can we can observe that

the automation of every level

the performance is kind of at the head so the numbers are dropping down if

from going from human transcripts and human segmented to order segmented and automated yearly and

finally the asr

but i really what we want to see is how well as the agent how

does the agent how it is the agent performs the agent performing equally well or

not

so in a previous study we use a stimulation method to measure how well the

agent

or performed

so this offline method of evaluating the agent is scored eavesdropper which we have explained

you get enough to do twenty fifteen paper i nine creation look at the right

and that gave us a really good picture as to how the agent performance actually

was in one so we use that metric to kind of evaluate

the agent performance on target done on target image identification

and we found i

it was no significant difference between

finally the take a message is that there are many metrics to measure other dialogue

act segmentation measuring the final impact on the agent performance is very important and the

individual model performance might give us a different picture

and bite plane performance negatively module or information and finally the da segmentation can facilitate

austin building a better and more complex just

an individual we want to integrate these policies and the agent

thank you

so that's a very good question

so the question was that

if so this domain is really specific in terms of utterances being or short duration

unshorten and r doesn't really scale up or a large and then

so the answer is that i don't know a maybe it could because the framework

is kind of channel in the sense that the features that the users are not

very much to note that this domain but it should really explore and see how

the group of formant other domain for example

so the answer is it score one

i can't say all

of the creation

however questionable be architecture for segmentation and labeling what do you have to stuck swarm

for segmenting about one probably where you do the prior to drawing or cost

so the question was a wide we have a separate step a segmentation and labeling

any be so the researchers in have looked at like order and architectures like they

have tried to do the joint method

of identifying the boundaries

and also doing it into separate steps

so i would say be to try out and it's kind of workload and but

i guess every kind of measure the performance and the joint method was not working

as well as well

this method

that's right they were just set of e

we probably don't have we have a long tail of dialogue acts from a stable

there is a dialog act distribution is kind of long haired and the joint matter

probably what but if we had more

with this issue

no scripts

that so good questions the question was a can be look at the and best

list and kind of or

could we see how well the performance was for dialogue act labeling how our weather

work well as well so are the answer is we then we can take a

look at the n-best list but definitely that's something

Toward incremental dialogue act segmentation in fast-paced interactive dialogue systems

Oral session 4: Incremental processing

Ramesh Manuvinakurike, Maike Paetzel, Cheng Qu, David Schlangen and David DeVault