once for the introduction
this is joint work with my students rumours remote query can and her collaborators wanted
to be trying what we and want to chunk
so image and its in
changing
certain characteristics of an image
and that can be done with software tools such as adobe photoshop microsoft's photos et
cetera and here we can see two examples
in the first example we add close to the sky
the second example we have
the photograph of a file a
we make its black and white
image editing is a very hard task
i
requires
artistic creativity patients is a lot of experimentation and try there which in turn makes
it a very time-consuming task
and that users may not be fully aware of the functionality of a given image
editing tool and some image editing tools a very complex use in there is the
learning curve
furthermore users may not be sure about what changes exactly it want to perform on
the image and here's an example basically scores look nice rugby this is pretty abstract
and how do you make a
but also look like that is currently
or maybe do not an abstract not change it is important
but we know where their precise
and then steps required so here's another example
remove the human from the fields
so this is pretty hungry
but it requires maybe and things that
so clearly there's attention here or fashionable how anybody there are web services web forums
where bill with users lost their images
and that the request i and their expert raters will perform this i either from
three
or forty three
and then be expert annotators in the novice users can exchange messages until the user
is happy we had
and stupidity in these forums their requests are formulated an abstract manner using natural language
here is an example
i satisfied from our last holiday and someone please remove my x from these but
i'm the one on the right
so it's more likely like to get something like this rather than indicated one step
instructions
so the web forums are clearly very helpful
but the major
draw
a first of all users current request changes or provide feedback in real time
and the users can ask a clarification or provide suggestions well anything the editing is
be performed
so a user is clearly benefited greatly round conversing real-time with an expert image i
didn't
so our ultimate goal is to build the dialogue system with such capabilities
and i'm to play a million huh show but maybe by conversational image editing
a realistic speech
g
you actually
in
where
not the same as that back and talk about incrementality in that of systems incremental
dialogue systems means that user utterances start being processed word-byword before the user has after
a complete utterance and the system has to respond
as soon as possible
now conversational image editing is a domain particularly well suited to or incremental processing and
the reason is that it requires a lot of fine grained changes that will see
in this
users maybe to be their requests radically
and very fast speak very fast
and the wizard has to process everything very fast and to respond as soon as
possible
she
e
all right you just
so we collected a corpus over skype in a wizard-of-oz setting
so we had the user request and it's and the wizard would perform decided
the wizard screen was share and only the wizard could control the image editing tool
and this was done deliberately because we wanted to record all the user's input in
spoken language format
there was no time constraints take the top as long as they wanted
and
we did that's basically celtic users whether they interact with the system
or if you want but the conversation was very natural so it was pretty obvious
that it were talking to another human
and here are some statistics of our corpus we had twenty users hundred and twenty
nine dialogues
we can see that roughly number of user utterances is double the number of wizard
utterances
users will talk about
and occasionally that wizards would provide suggestions and questions
however by technology alliance
the corpus release to the public in the near future
so that our corpus
our next step is to annotated with dialogue that
so we define an utterance as a portion of speech
between silence intervals greater than three hundred milliseconds
i'm utterances are segmented into utterances like we assign a dialogue act to each utterance
seconds
and here are the dialogue act labels and from our corpus also
but we are interested for this study only in the ones so all dialogue act
labels are
user i
image it requires
you requests updates to produce request
are reverting to the previous state of the image comparing the current state of the
image with a previous data image we have comments like or dislike comments
and image comments these are mutual comments for example
looks very striking
we have yes no responses
and have their attributes
anything that can be classified into a and the other labels
so these are the cat
a dialogue act labels that were interested in this study the most frequent ones where
b you image that it requires the updates
the like comments
the gets responses and after
and here are some examples so increase saturation that really this is a new image
editor is and isn't more this is an update
that's not good enough
that's it is like problems
a change the saturation back to the original this is an image and it requires
river
great this is that like comments
and you show me before and after this is imaginative request compare
we measure inter annotator agreement what happens of expert annotators on to take the same
dialogue session of twenty minutes
and we calculated cohen's kappa to be point eighty one which shows high agreement
no but we perform
when we measure agreement
we want to make sure that the annotators and every
not only on the dialogue act labels but also have a segments
utterances
here's an example the utterances
proper thought to remove it
so the first annotator
as you is that this is one segments
and annotated with the demodulated request you whereas the second annotator
assumes that these are just segments problem for a remote that i
i candidates the first segment
at the demodulated request new and the second segment as an image i that request
update
so
a we got each row
and we compare the annotations
well we'll segmentation dialogue act
and what and why the agree on everything
a this counts as agreement so in this case we have so agreements out of
six
which is point thirty three
i was used
if you only have the dialogue act label this is not enough information to perform
image and it
so this reason
we also have more complex annotations of actions and it is so here's an example
so this segments may actually writer to like one hundred
dialogue act is limited it requires you actually is addressed it is right
you object this tree
and
that is one hundred
but for this study
we
we only use information from the dialogue act labels
so yes that about detection models so we split our corpus into training and testing
you have one hundred and sixteen atoms for training and three for text and we
compare neural network based models versus traditional classification models so far neural networks we have
long short-term memory networks
and convolutional neural networks
i'm more traditional models we have made by
conditional random fields and random forest
we also compare
and bad a strange well image related corpora versus false pre-training a by things
and here are classification results
note that
we don't do anything incremental yet
with you
but we have a full utterance
so
we see that the conditional random fields and have yes
they don't perform very well these are both sequential models
we hypothesize that this is because we didn't have enough data to construct a larger
context dependencies
random forests are doing well
and said
however using training where the bindings
is better than using half of the boxing balance the difference is not statistically significant
and it also helps when we use sentence accent right we use that we generate
a vector for the whole sentence rather than for each where
no be a
well it
a simple random forest
this is a problem and difference is that it is that
but when we use random forest with sent back
that is that's as the latter at the difference is not statistically
note here is a we can see
and we
one more rights from the user in the x-axis we see that i mean
however in a score for each one of them by a
changes so initiate for the first word i opinion score or the dialogue act a
better
but i
after the word that the word also here is that it's pretty clear
i output is the image that it requires you which happens to be good right
label for this example
no that's that about in from time okay
how right
where i'm correctness of prediction
when we have getting from and so i which from the user
so in order to calculate
how much we say whether we're right or not
is we get all these samples from the user
we have two
this is one comic strip prediction
so
here we use a very
a simple model we set on stress
that explains it is also we have come from the user i think i think
that i think that's something that's
then we have a confidence scores
but the classifier
and how a
god
the classifier assigns to each
some
note that me a that can support of point two
i mean i wish detection
sorry a constant threshold according to this means that we should make a prediction wondered
and score becomes higher and this problem so we should make a prediction here
is to predict the other class
so this is wrong prediction
and when we have wrong predictions we assume that we're not going to how it
works like this
not to say that the confidence threshold is one four
this means that we should be a prediction here because point five is larger than
one or here we have the classifier is like performance and this is correct
we have a correct prediction and we state one more
because we have right until the user i think that
so it helps when we increase
the confidence that
problem is that
we increase the price phone too much
let's say we present a point five for sex then no one anything to make
a prediction on
because you hear all the
all the threshold the scores are below points a
i five four point six
so i will see in the next time
the higher the confidence stress balls
the fewer samples
that
we use to make a prediction
okay
no in this
a row
on the x-axis we have a confidence that
and on the y-axis would have percentages
so that no i shows as the percentage of correct predictions
and the red line shows the percentages of word savings
and this numbers here that we see at each point the number
shows us how many samples a
we have above a certain press so we can see or columns writes one point
two we have at least once
but as a confidence threshold is that
then
we have fewer and fewer samples
so basically baseline
ace precision
so if the confidence right or a notice that we're doing better but
because
the number of samples becomes lower and lower this is precision but very small would
really be very well
so if your's precision as a function of and threshold and here's a recall
so it seems that this is a good point but it's not really good point
because the recall is very well
and now let's go back to the original graph have the same pattern where the
percentage of word savings
that's a confidence threshold point five
we use a the lost
we don't have many examples where the score
is going to become a larger than the confidence threshold one five so it's the
same pattern
however as that the computers or a threshold becomes are higher we have fewer samples
so the bottom line
is that increase in this is not clear one where we should say which domains
so that means that just having rely on a result is not a good model
and we certainly need something more sophisticated that
and here we can see what would happen if we had histogram-based model and that
i think i rates can make the prediction and the prediction would always be alright
so we can see here
we don't know works at each of the dialogue act
and for all the data
we have thirty nine percent saving and we can see the percentage of correct predictions
is a seventy four and this is actually i see because in this case the
number of samples that we have doesn't actually a constant stress balls
so we introduced
and you a domain conversational image i
it's a real world application which combines language and vision you tonight particularly well suited
for incremental dialogue processing we compare models or incrementally accent identification this yes and random
forest reform either the rest of the models
and we found that
and i think strain image related corpora outperformed reaching out of the box some bad
we also calculated the impact of any problems i suppose
above which the classifier's prediction should be considered on the classification rate is i don't
work savings which is a proxy for time saving so our experiments provide evidence
that incremental intent processing
a can save time it would be more efficient real user
so our future work obviously
we need to better model of prediction
whether to complete a prediction because just rely on frequent stressful
is not enough
we also need to perform for natural language understanding
taking into account the action is an additive
because just having
but i don't that has got enough information from behind it
and a
the ultimate goal is able to that of system but imply that was
this we're going to need not a natural language processing
and dialogue processing
but also computer vision algorithms because the system should be able to locate the part
of the image that the user is referring to
thank you very much
okay thank you time for questions one the
could you the please repeat the question
what would you can be made action
we mean the dialogue act label
when a when they say something that can categorise then it's been it's fine to
be labeled as other
right
no i
so
let's have
right actually let's go back to the annotations
here
so
so you hear about the auto
so we have the label and we define the label to each word
so you're saying that whether they say some something garbage here that we conduct
right
okay
well we will assume any dialogue act schemes and for example the of the recent
one by
how deep bonds
and
but
let me go to this might be
so
you know i talked about the data access users
there are a other dialogue acts like request implementation questions about features question about image
actually use
nova be talking about the image location actually directive et cetera so obviously some of
these dialogue acts are domain specific so we looked into other annotations days but we
have to adapt this annotation schemes to our data so we have to look for
of data and
this way round a convolutional neural networks with multiple
exactly
so you're talking about
or here
so after
we feed i to the convolutional neural network and we get the class that high
that have the highest probability
one or something
well we have
percent of the users in the training data but i agree with you that if
we i it's a small corpus and we had some for effect so on the
ls this and this year s
and we think that was recorded we didn't have enough data okay maybe if we
rearrange maybe we'll get slightly different
right results but i think the patterns will still remain the same
maybe the random forest are quite close
to the cnn
so maybe we have an effect
we have in fact
there and of course the models are also very sensitive to the parameters of the
neural networks we had to do a lot of experimentation to set the hyper parameters
and we did not on the training data
well i
a wizard need to ask a clarification questions or provide suggestions
so that sometimes the users would say okay i'm not i want to make it
writer
what's the what's the type of feature what's the bottom actually use so
they were asking that the users were asking questions and the wizard without once
but also the wizard could ask a clarification questions
so
so it initially we have an quite low confidence threshold
i think that was interaction and explaining that are also explain it well
so when we calculate the word savings we only take into account correct predictions
so
in the beginning we have low confidence threshold
and most likely are predictions are not alright
so we don't really say much it seems that i point five
we say why the law
because whatever predictions we make their correct
but after that
one we have a confidence threshold of point six or point seven basically we the
user had acted
a whole
utterance so we don't say anything
and here
so i don't wanna talked about the blue line i said that these numbers in
blue eyes the number of samples that we consider or the correctness of the predictions
here are the number of samples that we considered for the works at this and
you can see that this is lower than base
and this is because is that when we calculate the word savings we only consider
the correct predictions
it's a quite complicated to a graph i tried to make it is as a
clear as possible
okay or
what
that it doesn't mean that
it has to be
it has to respond immediately as long as it the state gets the user's input
ideally we should have a policy that
should tell the system went away and when there is enough information to perform we
had it's in the video that i like it was clear everything was have been
happening very fast
the user response that the topic
and that the wizard had to follow never can happen fast
but
if the if the users is
i don't know tell me about the functionality of the
it might accidentally doesn't mean that the systems
john right pane and start talking before the user has finished so we need a
policy to tell us when it makes sense
to process things very fast and well when it makes sense to wait
we did have a paper last year exec the higher
with a different domain
but we had an incremental dialogue policy which would make decisions when to wait
or when to
jumping and perform an action
okay but the last question
so when you make correct predictions you save time
when you
don't make
a correct predictions
it it's
and ensure understand it's a tradeoff
well first of all the we don't have a system
by
so you if you may have
okay the will be will give a wrong answer
i don't think i understood your right
that's true but this is the distribution this is just an analysis what happens
for each continent stressful it does and
it doesn't
we don't show it let's say interdependencies about jumping
and then the user was happy with me until then
actually started being happy because i did sometimes jpeg now we don't manager this at
this point is just an analysis of the results
and as i said
the
the bottom line means that
we can just reliable confidence stressful so we need something very sophisticated to decide on
whether to make a prediction and not
and it should take into account all kinds of things
the context
there should be rewards
if the user selects a gives the like formant and means that we're doing well
so this is just an analysis of what happened okay
okay useful tools for this purpose