right eigen a graph known how everyone i'll read be presenting our collaboration work between
a bit of n university and using institute for creative technologies
the work is based on incrementally understanding
complex scenes
the motivation for this work
to understand the
the scenes which we incompetent you bored
so let's take this seen for example
alright let's imagine writing to a reading enough
a set of training car and b are training is that when we got in
the street and we want to stop
and we see the seen as a as it shown here and then
we decide alright i want to make my car park
so the left of that it is not to the left side of three and
to the
or the to do that
so i guess that instruction or to myself driving car and the car has to
perform that action so for the car to perform that action it's very important for
the system to understand what rate
that's not actually means
and what left side of the screen actually needs and for the car to make
an action in real time it's very important that the processing happens incrementally so that
the actions can be taken at the right point or fine
and if we support dialogue that's all the more better
so the general research
plan for this for this work
is that is that we want to understand the scene descriptions with multiple objects
and we want to perform reference resolution and this the steps that we they perform
a different solution is basically divided of three parts
the first step is we understand words
so we try to understand the meaning of a red green and so on and
then we try to understand the meetings of phrases so for example very desolate call
i think about
and then we try to understand the complete seen description
that is the parts so that's a subset reading hard to understand reference to understand
the scene descriptions
so that we follow pipeline method that we have a segmentation of segmentation method and
then we have a segment i classifier method which is which is which is basically
trying to understand what type of individual segments other person is speaking about
so the
so the domain that how that you're taking a look at in this in this
work is called r d g or all image domain audio rapid dialogue game for
the pentomino
a variant so the game is a two player collaborative are pine constrain game
two people are assigned at all of the detector and the matcher
so the director scenes other data that they're detect the seas eight images on his
or screen and one of the images highlighted in the red border
but the red border and each scene has multiple objects in them and the director
is trying to provide descriptions to these images so that the matcher can make that
i guess
can i can select the right image
so let's take a an example a
a dialogue from this some dialogue from this game so here we have a direct
the whole true tries to describe this image using other description
this one is kind of for all or blue p and we don't w sort
of and the key is kind of malformed
so this is the description that the person is giving for the highlight the target
image for the matcher to make a guess
the matcher it is basically trying to the matcher from that description understand understands what
the images and makes this make makes are i guess and say spoken got it
so if you know that it or description it consists of three or in user
in the first segment the person is trying to describe
to this blue a this blue shape but with a description it's kind of
a blue e
and then the second part is where the person is trying to describe the next
object that this following a segment
and then finally
the third part is once again he goes back to the next image goes back
to the first object that he was trying to describe and he basically describes the
and that was not for the matcher to make the castle copies at all images
one thing to keep in mind is that the matcher does not seen the images
in the same order so they cannot really use
the position of the image integrate output is correctly image
so the rest of the talk is divided into a these many steps in the
first part
which is the preprocessing step i agree or explaining and we describe how we go
about collecting the data
and updating the data and then i go into the steps of or designing the
system how we designed the systems
using the segmentation and labeling and finally how it is all the differences using this
and then finally we argue the evaluation and see how about the model works
so the first step is the dialogue data collection
so we used crowdsourcing paradigm to collect data so the data is collected to look
for crowdsourcing framework or pinning out a that was presented last your
we and they have shown that we can of we can collect the data between
to a between two people playing the game or the crowdsourcing websites like mechanical turk
and we can really kind of the speech
you know you know you know you know real time way and of you can
transcribe and i don't
then we do a the audio collection and then the top the data collected was
basically a transcribed and was most basically just transcribe your
and then finally the third step is
one c o we have the data we just want to know like a mini
our data that we collect so
in this example we have
yes for this domain we have data from forty two past
so we have a sixteen complete games and the rest of them or
okay well that people just exactly
so then the next step is the data annotation so once we have data collected
from these people you wanted to annotate other data
the motivation for data annotation is basically is basically go on the domain
we want to do the reference resolution and you know reference resolution a real person
is basically describing the c and then it wasn't a describing the scene are the
scene is described to the object descriptions and these objects are basically one of the
an object that was in this thing
so we want to i don't date the individual objects within each us within each
and then it's also possible that a person decides to a speak about multiple objects
within an image
and opposed to me also speak about relations which we also annotate and basically everything
else is what we call this
for example on your for this utterance this is this is tough
a green tea upright to red crosses to the left of w and e
got it
so you for this particular utterance be basically annotate
these in this way
actually indicated
using the scheme
we had this particular utterance this as a on top of the t and then
harry porter sign next to the n
so we annotate a the first part which is a this is the l and
be markedly the single and what object to person is to fit into
you get the labels on each one of these objects using open c v
and we also mark on the relations iq of for example next to is marked
with one and zero and the use mark we to which basically defines the relationship
between this to object
this number
so once we have the data annotation performed the next step is
we use a language processing pipeline which includes two steps
basically steps so it includes three steps of it would try to understand the right
image just word image the person is the vol
so the first step is a the segmentation
which for which we use a linear chain conditional random field
and once we have the segmentation we try to identify the type of segments maybe
use a support vector machine and then you have the difference is always trying to
basically trying to resolve a switch image the person is speaking about
one thing to keep in mind is that we use are transcribed speech at this
point of mine does not asr in the future
though the segment the module is trying to basically split the whole utterance into multiple
a thing that's
so the input to
the segmental is the word stream one word at a time
where the segmented is basically a linear chain conditional random field we just i mean
each word but the word boundary information
so here in this example we have of your at the top left in one
stage and once we have the left all the word coming in
the segment that is basically we don
and the label is actually is
the task of a segmentation is performed again
and we have separate set of a segment
that you
so once we have the segment of the next step is we want to identify
what kind of segments of these things
so the type of segments is what's decided by a the segment by classifier and
for this we use a support vector machine
so we use
so we so
the our annotation scheme of we had four labels if you if you remember so
it's a single multiple relations and others so this is a step where we are
trying to identify each segment as to what i of segments
they are
so if in the previous example once we have of your at the top there
was basically single but now with the level of it but once we see the
word off
so we have to the top left off and of your that is part of
the previous segment and this is really able as a relation
so once we have the segment by basically a pain
the next step is to understand to perform the reference as addition
though the reference resolution step up and in three different three separate steps
so the first step is understanding words
so we use what's as classifying mortal which are described in a bed but we
try to understand the meaning but was using the visual features
and once we have
once we have understood the words we try to understand the whole segments
for which we use
of a very have you ever use the scores from the what that's classifier and
decompose the meaning
and we'll in the scores for the next segment
and then finally we try to understand the whole
the reference referring expression on your record inspections
so the ones that's classified model is basically a linear regression model bit it's trying
to understand the meaning of every board
using the visual features so every word you it is represented but the visual features
and the visual features can be the ones which it start from the images and
in this case use rgb hedges p-values ages orientations
and of the data instance x y positions and
each one of these for each one of the objects with image the indies features
and you and we get the probability distribution or for each one of the words
once we have understood of words
using the words as classified model the next step is to kind of compose the
meaning and get the meaning for
the whole segments
so the segmental so the segment meaning is obtained by composing the meaning that you
in for every board
so we basically you are in this example you just
compose the word meaning by
doing multiplication of the scores
and we'll in the segment meeting for each one of the state
so let's take an example as to how this works so here oppose anything eight
images for simplicity sake let's take a look i just two images as to how
it's a dissolve and how it understood
so here we have facebook have and brown w which is the description that scheme
into this target image
so we have in the probability scores for facebook f and brown w individually using
what as classifier model
and with this method we try to understand what
the facebook f and brown w s
so real in this segmentation we obtain the segments using the linear chain conditional random
fields which has already segmented the utterances from the user into neutral segments
and it has a label that each segment this type single so we're trying to
resolve the reference and try to understand the meaning
so for the facebook f this is the object which is this one and then
just w is thus and similarly this is that and this one as this action
so what i wasn't said facebook f and once we have simple discourse we end
up but a distribution which looks like this
and for and don't w
i just want to find it's
it's this one which has the highest
so no this is the meaning that field in for each one of the segments
the next step is to understand the whole description at once
so in the previous step beyond the store the meaning of each segment as the
facebook have and brown w but now we're trying to understand the meaning of the
the whole segment at once
so the baby do it is fixed use the score that you have you that
be obtained from the previous segmentation step
and decompose them using the simple summation and the image with ends up with the
highest score is the target image that the person was like to
basically describe so
in the same example we have the visible okay so this was discourse that is
but now with visible kevin brown w once we are composed discourse hoping for each
objects we end up with different scores for each images
we end up with
the matcher which is now a agent with vision booking as matter now
this to understand which target image
it should send
now this is how the bite plane books and the next step is we have
to understand i mean how well the model kind of performs
so in the evaluation step we actually need to know how well the segment that
works and how well the segment i labeling process works
so the segment the works with an f-score of around eighty percent which is basically
how well it court the boundaries right
and finally the segment i classifier that is working reasonably well for our purposes
but we but our goal is that you want to know how well the target
image was selected and not really work we understood from the previous i mean that
not really the segmentation or not really the same and i labeling
a so our main goal is to kind of understand how well it or will
are matcher agent can understand the h
so you understand the target image are evaluating such and such a system is a
kind of tricky
we can use multiple ways of using of running crossvalidations
so the first step the score on the whole one pair our occurred
of bad
but we basically say that one of the pair data that we have seen in
our corpus is on c
and we train on the rest of the pair detail
so one advantage is that it uses a number it uses an understanding of
how well the system basically performs
on a new user was one because the system
the second method of evaluating escort hold one is sort all
the advantage or this whole benefits or our type of cross validation
basically it's one is or description one target image description
it's held out training on the rest of the past as well as
are the descriptions by the same person that it has seen in the past
so this gives another advantage to our system that
it actually most
the words it gives you use a chance for the classifier to learn the idiosyncratic
word usage by a please
so use an additional advantage to train on the vocabulary
the one historical kind of evaluation where we assume the segmentation is from other humans
and finally we used automated pipeline but everything is automated entry one
or thing is we use multiple objects or a image in an image set so
you can move problem as simple as
single or two
and then we kind of what in more and more objects with any within an
image and then we try to see how well the reference resolution
but the first step of evaluation is you need a good baseline and the baseline
here is
the naive bayes that we use in the past are shown us to be o
we use what i'm be arguing not paper is a white support vector
and creation to other people for that
we can see that then it is actually walking close to random
is the number of objects and of input i increase
and once we have a segmental and the labeling try labeling performed
value in other we have a little bit individual segments and of understand what kind
of segments the person is speaking about
a performance increases so we can see that the target i target image identification accuracy
increases without the segmentation which is this role
and but segmentation distraught without segmentation and labeling the performance was
kind of which closer random but high number four
the next thing that we want to see as how well
our segmenter performs to the oracle segmentation but is basically the object segmentation and the
perfect labeling
additionally the whites as around two to six percent both as you can see your
from this table sort of it uses additional proposed but it's
and then finally we wanna see the entertainment afraid that is once we have the
whole that there are evaluation but the whole when it is sort out evaluation
that is once we give a classifier a chance to learn
the word usage of this particular person this particular describe
the performance kind of in increases and we here
the best performance you but we can see that the performance is still not real
and i think we use a number of objects within an image the performance
so that it all messages other people message from this work is that
the automated pipeline improves the scene understanding evadees so once we have a segmentation once
we have the segmentation
are you a
well formed are task
or a task of imitated asian groups
the interim and improves the performance of a classifier that is
you will see if it were wired channels don't from but using target word that
has been spoken in the example
it increases or the performance
and increasing the number of objects within an image it's really hard for the classifier
to get other objects right because of the scene understanding right because the objects are
very much similar
special thanks to better are usually and michael us you know and
thank you
but you have time for questions
so that it will work with word transcripts
were people scroll
i was wondering about how the dual-tree the information from the speech signal
for example prosodic contour
but more to renew a promotion remote use of compression
in addition we do
exactly that's or a really interesting question
so here the question was that if we use the more prosodic context instead of
just using the transcript and but performance
actually i would want to refer this to the domain x talk as a character
case where we actually use the prosody features are to perform the segmentation
kind of a
i don't know from one to like break the
i don't know
it's that possibly kind of helps
part i think of prosody can be really useful to for the segmentation
any other questions
okay so the question was about error analysis was that any error analysis both on
the numbers that initial yes with it can you be checked the significance score like
how like whenever there is a target image identification and how the humans identify the
target and how each one of the methods kind of perform
and then we did a significance test and when we ran significance test be you
know and upon a few things are more significant than others
which reported in the people
okay so the question was how this the what's as classifier trained is it like
a separate or training module or is it on along with the way we get
this numbers so actually we have the pipeline which is evaluated as one module in
sort of like training but that's classifier separately and then
segmenter separately and the segment i labeling both are separately we don't do the separate
kind of evaluation we kind of have
the whole pipeline kind of walking as one in a like
one particular more you
and then we'll in the numbers on the segmental performance we might i labeling performance
then we have the though processing part which is what understanding then the composition of
the meeting from the buttons classifier and then
combining the what's combining the meeting scheduled in from the speech segments of the whole
or seen description so all these are actually performed is not exact ones which means
we train the whole pipeline on
and minus one bunch of pairs if you imagine we have and there's and be
a point for all but they don't evaluation they're training the whole pipeline on in
mind in minus one pass and then the or domain e one patted his what
evaluation and we do that you know
we do it is impossible
so that so the unseen phrases that's that so very good example and that's the
reason why are performance was taking the egg and we don't use of
that is if you don't provide a classifier the chance to learn the meeting go
work for that example despair all of us start and they use description prison caff
for that object which
if it's there's no way of learning the meaning for that if we had if
we don't know what the tokens
so that is why the performance to get hurt but once we give a pipeline
a chance to do on the meaning of the words like face will reorder all
these things it's not perform much
we well we did do that so far so we have we have mentioned