Speech Transcript - Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task

once for the introduction

this is joint work with my students rumours remote query can and her collaborators wanted

to be trying what we and want to chunk

so image and its in

changing

certain characteristics of an image

and that can be done with software tools such as adobe photoshop microsoft's photos et

cetera and here we can see two examples

in the first example we add close to the sky

the second example we have

the photograph of a file a

we make its black and white

image editing is a very hard task

requires

artistic creativity patients is a lot of experimentation and try there which in turn makes

it a very time-consuming task

and that users may not be fully aware of the functionality of a given image

editing tool and some image editing tools a very complex use in there is the

learning curve

furthermore users may not be sure about what changes exactly it want to perform on

the image and here's an example basically scores look nice rugby this is pretty abstract

and how do you make a

but also look like that is currently

or maybe do not an abstract not change it is important

but we know where their precise

and then steps required so here's another example

remove the human from the fields

so this is pretty hungry

but it requires maybe and things that

so clearly there's attention here or fashionable how anybody there are web services web forums

where bill with users lost their images

and that the request i and their expert raters will perform this i either from

three

or forty three

and then be expert annotators in the novice users can exchange messages until the user

is happy we had

and stupidity in these forums their requests are formulated an abstract manner using natural language

here is an example

i satisfied from our last holiday and someone please remove my x from these but

i'm the one on the right

so it's more likely like to get something like this rather than indicated one step

instructions

so the web forums are clearly very helpful

but the major

draw

a first of all users current request changes or provide feedback in real time

and the users can ask a clarification or provide suggestions well anything the editing is

be performed

so a user is clearly benefited greatly round conversing real-time with an expert image i

didn't

so our ultimate goal is to build the dialogue system with such capabilities

and i'm to play a million huh show but maybe by conversational image editing

a realistic speech

you actually

where

not the same as that back and talk about incrementality in that of systems incremental

dialogue systems means that user utterances start being processed word-byword before the user has after

a complete utterance and the system has to respond

as soon as possible

now conversational image editing is a domain particularly well suited to or incremental processing and

the reason is that it requires a lot of fine grained changes that will see

in this

users maybe to be their requests radically

and very fast speak very fast

and the wizard has to process everything very fast and to respond as soon as

possible

she

all right you just

so we collected a corpus over skype in a wizard-of-oz setting

so we had the user request and it's and the wizard would perform decided

the wizard screen was share and only the wizard could control the image editing tool

and this was done deliberately because we wanted to record all the user's input in

spoken language format

there was no time constraints take the top as long as they wanted

and

we did that's basically celtic users whether they interact with the system

or if you want but the conversation was very natural so it was pretty obvious

that it were talking to another human

and here are some statistics of our corpus we had twenty users hundred and twenty

nine dialogues

we can see that roughly number of user utterances is double the number of wizard

utterances

users will talk about

and occasionally that wizards would provide suggestions and questions

however by technology alliance

the corpus release to the public in the near future

so that our corpus

our next step is to annotated with dialogue that

so we define an utterance as a portion of speech

between silence intervals greater than three hundred milliseconds

i'm utterances are segmented into utterances like we assign a dialogue act to each utterance

seconds

and here are the dialogue act labels and from our corpus also

but we are interested for this study only in the ones so all dialogue act

labels are

user i

image it requires

you requests updates to produce request

are reverting to the previous state of the image comparing the current state of the

image with a previous data image we have comments like or dislike comments

and image comments these are mutual comments for example

looks very striking

we have yes no responses

and have their attributes

anything that can be classified into a and the other labels

so these are the cat

a dialogue act labels that were interested in this study the most frequent ones where

b you image that it requires the updates

the like comments

the gets responses and after

and here are some examples so increase saturation that really this is a new image

editor is and isn't more this is an update

that's not good enough

that's it is like problems

a change the saturation back to the original this is an image and it requires

river

great this is that like comments

and you show me before and after this is imaginative request compare

we measure inter annotator agreement what happens of expert annotators on to take the same

dialogue session of twenty minutes

and we calculated cohen's kappa to be point eighty one which shows high agreement

no but we perform

when we measure agreement

we want to make sure that the annotators and every

not only on the dialogue act labels but also have a segments

utterances

here's an example the utterances

proper thought to remove it

so the first annotator

as you is that this is one segments

and annotated with the demodulated request you whereas the second annotator

assumes that these are just segments problem for a remote that i

i candidates the first segment

at the demodulated request new and the second segment as an image i that request

update

a we got each row

and we compare the annotations

well we'll segmentation dialogue act

and what and why the agree on everything

a this counts as agreement so in this case we have so agreements out of

six

which is point thirty three

i was used

if you only have the dialogue act label this is not enough information to perform

image and it

so this reason

we also have more complex annotations of actions and it is so here's an example

so this segments may actually writer to like one hundred

dialogue act is limited it requires you actually is addressed it is right

you object this tree

and

that is one hundred

but for this study

we only use information from the dialogue act labels

so yes that about detection models so we split our corpus into training and testing

you have one hundred and sixteen atoms for training and three for text and we

compare neural network based models versus traditional classification models so far neural networks we have

long short-term memory networks

and convolutional neural networks

i'm more traditional models we have made by

conditional random fields and random forest

we also compare

and bad a strange well image related corpora versus false pre-training a by things

and here are classification results

note that

we don't do anything incremental yet

with you

but we have a full utterance

we see that the conditional random fields and have yes

they don't perform very well these are both sequential models

we hypothesize that this is because we didn't have enough data to construct a larger

context dependencies

random forests are doing well

and said

however using training where the bindings

is better than using half of the boxing balance the difference is not statistically significant

and it also helps when we use sentence accent right we use that we generate

a vector for the whole sentence rather than for each where

no be a

well it

a simple random forest

this is a problem and difference is that it is that

but when we use random forest with sent back

that is that's as the latter at the difference is not statistically

note here is a we can see

and we

one more rights from the user in the x-axis we see that i mean

however in a score for each one of them by a

changes so initiate for the first word i opinion score or the dialogue act a

better

but i

after the word that the word also here is that it's pretty clear

i output is the image that it requires you which happens to be good right

label for this example

no that's that about in from time okay

how right

where i'm correctness of prediction

when we have getting from and so i which from the user

so in order to calculate

how much we say whether we're right or not

is we get all these samples from the user

we have two

this is one comic strip prediction

here we use a very

a simple model we set on stress

that explains it is also we have come from the user i think i think

that i think that's something that's

then we have a confidence scores

but the classifier

and how a

god

the classifier assigns to each

some

note that me a that can support of point two

i mean i wish detection

sorry a constant threshold according to this means that we should make a prediction wondered

and score becomes higher and this problem so we should make a prediction here

is to predict the other class

so this is wrong prediction

and when we have wrong predictions we assume that we're not going to how it

works like this

not to say that the confidence threshold is one four

this means that we should be a prediction here because point five is larger than

one or here we have the classifier is like performance and this is correct

we have a correct prediction and we state one more

because we have right until the user i think that

so it helps when we increase

the confidence that

problem is that

we increase the price phone too much

let's say we present a point five for sex then no one anything to make

a prediction on

because you hear all the

all the threshold the scores are below points a

i five four point six

so i will see in the next time

the higher the confidence stress balls

the fewer samples

that

we use to make a prediction

okay

no in this

a row

on the x-axis we have a confidence that

and on the y-axis would have percentages

so that no i shows as the percentage of correct predictions

and the red line shows the percentages of word savings

and this numbers here that we see at each point the number

shows us how many samples a

we have above a certain press so we can see or columns writes one point

two we have at least once

but as a confidence threshold is that

then

we have fewer and fewer samples

so basically baseline

ace precision

so if the confidence right or a notice that we're doing better but

because

the number of samples becomes lower and lower this is precision but very small would

really be very well

so if your's precision as a function of and threshold and here's a recall

so it seems that this is a good point but it's not really good point

because the recall is very well

and now let's go back to the original graph have the same pattern where the

percentage of word savings

that's a confidence threshold point five

we use a the lost

we don't have many examples where the score

is going to become a larger than the confidence threshold one five so it's the

same pattern

however as that the computers or a threshold becomes are higher we have fewer samples

so the bottom line

is that increase in this is not clear one where we should say which domains

so that means that just having rely on a result is not a good model

and we certainly need something more sophisticated that

and here we can see what would happen if we had histogram-based model and that

i think i rates can make the prediction and the prediction would always be alright

so we can see here

we don't know works at each of the dialogue act

and for all the data

we have thirty nine percent saving and we can see the percentage of correct predictions

is a seventy four and this is actually i see because in this case the

number of samples that we have doesn't actually a constant stress balls

so we introduced

and you a domain conversational image i

it's a real world application which combines language and vision you tonight particularly well suited

for incremental dialogue processing we compare models or incrementally accent identification this yes and random

forest reform either the rest of the models

and we found that

and i think strain image related corpora outperformed reaching out of the box some bad

we also calculated the impact of any problems i suppose

above which the classifier's prediction should be considered on the classification rate is i don't

work savings which is a proxy for time saving so our experiments provide evidence

that incremental intent processing

a can save time it would be more efficient real user

so our future work obviously

we need to better model of prediction

whether to complete a prediction because just rely on frequent stressful

is not enough

we also need to perform for natural language understanding

taking into account the action is an additive

because just having

but i don't that has got enough information from behind it

and a

the ultimate goal is able to that of system but imply that was

this we're going to need not a natural language processing

and dialogue processing

but also computer vision algorithms because the system should be able to locate the part

of the image that the user is referring to

thank you very much

okay thank you time for questions one the

could you the please repeat the question

what would you can be made action

we mean the dialogue act label

when a when they say something that can categorise then it's been it's fine to

be labeled as other

right

no i

let's have

right actually let's go back to the annotations

here

so you hear about the auto

so we have the label and we define the label to each word

so you're saying that whether they say some something garbage here that we conduct

right

okay

well we will assume any dialogue act schemes and for example the of the recent

one by

how deep bonds

and

but

let me go to this might be

you know i talked about the data access users

there are a other dialogue acts like request implementation questions about features question about image

actually use

nova be talking about the image location actually directive et cetera so obviously some of

these dialogue acts are domain specific so we looked into other annotations days but we

have to adapt this annotation schemes to our data so we have to look for

of data and

this way round a convolutional neural networks with multiple

exactly

so you're talking about

or here

so after

we feed i to the convolutional neural network and we get the class that high

that have the highest probability

one or something

well we have

percent of the users in the training data but i agree with you that if

we i it's a small corpus and we had some for effect so on the

ls this and this year s

and we think that was recorded we didn't have enough data okay maybe if we

rearrange maybe we'll get slightly different

right results but i think the patterns will still remain the same

maybe the random forest are quite close

to the cnn

so maybe we have an effect

we have in fact

there and of course the models are also very sensitive to the parameters of the

neural networks we had to do a lot of experimentation to set the hyper parameters

and we did not on the training data

well i

a wizard need to ask a clarification questions or provide suggestions

so that sometimes the users would say okay i'm not i want to make it

writer

what's the what's the type of feature what's the bottom actually use so

they were asking that the users were asking questions and the wizard without once

but also the wizard could ask a clarification questions

so it initially we have an quite low confidence threshold

i think that was interaction and explaining that are also explain it well

so when we calculate the word savings we only take into account correct predictions

in the beginning we have low confidence threshold

and most likely are predictions are not alright

so we don't really say much it seems that i point five

we say why the law

because whatever predictions we make their correct

but after that

one we have a confidence threshold of point six or point seven basically we the

user had acted

a whole

utterance so we don't say anything

and here

so i don't wanna talked about the blue line i said that these numbers in

blue eyes the number of samples that we consider or the correctness of the predictions

here are the number of samples that we considered for the works at this and

you can see that this is lower than base

and this is because is that when we calculate the word savings we only consider

the correct predictions

it's a quite complicated to a graph i tried to make it is as a

clear as possible

okay or

what

that it doesn't mean that

it has to be

it has to respond immediately as long as it the state gets the user's input

ideally we should have a policy that

should tell the system went away and when there is enough information to perform we

had it's in the video that i like it was clear everything was have been

happening very fast

the user response that the topic

and that the wizard had to follow never can happen fast

but

if the if the users is

i don't know tell me about the functionality of the

it might accidentally doesn't mean that the systems

john right pane and start talking before the user has finished so we need a

policy to tell us when it makes sense

to process things very fast and well when it makes sense to wait

we did have a paper last year exec the higher

with a different domain

but we had an incremental dialogue policy which would make decisions when to wait

or when to

jumping and perform an action

okay but the last question

so when you make correct predictions you save time

when you

don't make

a correct predictions

it it's

and ensure understand it's a tradeoff

well first of all the we don't have a system

so you if you may have

okay the will be will give a wrong answer

i don't think i understood your right

that's true but this is the distribution this is just an analysis what happens

for each continent stressful it does and

it doesn't

we don't show it let's say interdependencies about jumping

and then the user was happy with me until then

actually started being happy because i did sometimes jpeg now we don't manager this at

this point is just an analysis of the results

and as i said

the

the bottom line means that

we can just reliable confidence stressful so we need something very sophisticated to decide on

whether to make a prediction and not

and it should take into account all kinds of things

the context

there should be rewards

if the user selects a gives the like formant and means that we're doing well

so this is just an analysis of what happened okay

okay useful tools for this purpose

Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task

Oral Session 3: Dialogue

Ramesh Manuvinakurike, Trung Bui, Walter Chang, Kallirroi Georgila