0:00:15right eigen a graph known how everyone i'll read be presenting our collaboration work between
0:00:23a bit of n university and using institute for creative technologies
0:00:27the work is based on incrementally understanding
0:00:31though
0:00:32complex scenes
0:00:35or
0:00:36so
0:00:37the motivation for this work
0:00:39is
0:00:40to understand the
0:00:42the scenes which we incompetent you bored
0:00:45so let's take this seen for example
0:00:49alright let's imagine writing to a reading enough
0:00:54a set of training car and b are training is that when we got in
0:00:57the street and we want to stop
0:01:00somewhere
0:01:01and we see the seen as a as it shown here and then
0:01:06we decide alright i want to make my car park
0:01:10so the left of that it is not to the left side of three and
0:01:13to the
0:01:14or the to do that
0:01:17so i guess that instruction or to myself driving car and the car has to
0:01:21perform that action so for the car to perform that action it's very important for
0:01:26the system to understand what rate
0:01:29that's not actually means
0:01:31and what left side of the screen actually needs and for the car to make
0:01:35an action in real time it's very important that the processing happens incrementally so that
0:01:41the actions can be taken at the right point or fine
0:01:45and if we support dialogue that's all the more better
0:01:49so the general research
0:01:53plan for this for this work
0:01:56is that is that we want to understand the scene descriptions with multiple objects
0:02:02and we want to perform reference resolution and this the steps that we they perform
0:02:07a different solution is basically divided of three parts
0:02:10the first step is we understand words
0:02:12so we try to understand the meaning of a red green and so on and
0:02:17then we try to understand the meetings of phrases so for example very desolate call
0:02:21i think about
0:02:22and then we try to understand the complete seen description
0:02:27so
0:02:28that is the parts so that's a subset reading hard to understand reference to understand
0:02:33the scene descriptions
0:02:34so that we follow pipeline method that we have a segmentation of segmentation method and
0:02:40then we have a segment i classifier method which is which is which is basically
0:02:44trying to understand what type of individual segments other person is speaking about
0:02:51so the
0:02:58so the domain that how that you're taking a look at in this in this
0:03:02work is called r d g or all image domain audio rapid dialogue game for
0:03:06the pentomino
0:03:07a variant so the game is a two player collaborative are pine constrain game
0:03:14where
0:03:15two people are assigned at all of the detector and the matcher
0:03:19so the director scenes other data that they're detect the seas eight images on his
0:03:23or screen and one of the images highlighted in the red border
0:03:27but the red border and each scene has multiple objects in them and the director
0:03:32is trying to provide descriptions to these images so that the matcher can make that
0:03:36i guess
0:03:37can i can select the right image
0:03:40so let's take a an example a
0:03:44a dialogue from this some dialogue from this game so here we have a direct
0:03:48the whole true tries to describe this image using other description
0:03:52this one is kind of for all or blue p and we don't w sort
0:03:57of and the key is kind of malformed
0:04:00so this is the description that the person is giving for the highlight the target
0:04:03image for the matcher to make a guess
0:04:06the matcher it is basically trying to the matcher from that description understand understands what
0:04:12the images and makes this make makes are i guess and say spoken got it
0:04:16so if you know that it or description it consists of three or in user
0:04:21segments
0:04:21in the first segment the person is trying to describe
0:04:25to this blue a this blue shape but with a description it's kind of
0:04:31a blue e
0:04:32and then the second part is where the person is trying to describe the next
0:04:37object that this following a segment
0:04:41and then finally
0:04:43the third part is once again he goes back to the next image goes back
0:04:47to the first object that he was trying to describe and he basically describes the
0:04:53image
0:04:53and that was not for the matcher to make the castle copies at all images
0:04:58one thing to keep in mind is that the matcher does not seen the images
0:05:01in the same order so they cannot really use
0:05:04the position of the image integrate output is correctly image
0:05:08so the rest of the talk is divided into a these many steps in the
0:05:13first part
0:05:13which is the preprocessing step i agree or explaining and we describe how we go
0:05:18about collecting the data
0:05:19and updating the data and then i go into the steps of or designing the
0:05:25system how we designed the systems
0:05:27using the segmentation and labeling and finally how it is all the differences using this
0:05:32pipeline
0:05:33and then finally we argue the evaluation and see how about the model works
0:05:38so the first step is the dialogue data collection
0:05:41so we used crowdsourcing paradigm to collect data so the data is collected to look
0:05:47for crowdsourcing framework or pinning out a that was presented last your
0:05:51and
0:05:53we and they have shown that we can of we can collect the data between
0:05:57to a between two people playing the game or the crowdsourcing websites like mechanical turk
0:06:02and we can really kind of the speech
0:06:05you know you know you know you know real time way and of you can
0:06:09transcribe and i don't
0:06:11and
0:06:12then we do a the audio collection and then the top the data collected was
0:06:16basically a transcribed and was most basically just transcribe your
0:06:21and then finally the third step is
0:06:23one c o we have the data we just want to know like a mini
0:06:26our data that we collect so
0:06:28in this example we have
0:06:31yes for this domain we have data from forty two past
0:06:33so we have a sixteen complete games and the rest of them or
0:06:38okay well that people just exactly
0:06:42so then the next step is the data annotation so once we have data collected
0:06:45from these people you wanted to annotate other data
0:06:49the motivation for data annotation is basically is basically go on the domain
0:06:54so
0:06:55we want to do the reference resolution and you know reference resolution a real person
0:07:00is basically describing the c and then it wasn't a describing the scene are the
0:07:05scene is described to the object descriptions and these objects are basically one of the
0:07:11an object that was in this thing
0:07:13so we want to i don't date the individual objects within each us within each
0:07:19image
0:07:19and then it's also possible that a person decides to a speak about multiple objects
0:07:24within an image
0:07:26and opposed to me also speak about relations which we also annotate and basically everything
0:07:30else is what we call this
0:07:33for example on your for this utterance this is this is tough
0:07:37a green tea upright to red crosses to the left of w and e
0:07:43got it
0:07:43so you for this particular utterance be basically annotate
0:07:48these in this way
0:07:49well
0:07:51actually indicated
0:07:52using the scheme
0:07:54we had this particular utterance this as a on top of the t and then
0:07:58harry porter sign next to the n
0:08:00so we annotate a the first part which is a this is the l and
0:08:05be markedly the single and what object to person is to fit into
0:08:09you get the labels on each one of these objects using open c v
0:08:13and we also mark on the relations iq of for example next to is marked
0:08:18with one and zero and the use mark we to which basically defines the relationship
0:08:22between this to object
0:08:24this number
0:08:28so once we have the data annotation performed the next step is
0:08:33we use a language processing pipeline which includes two steps
0:08:38basically steps so it includes three steps of it would try to understand the right
0:08:43image just word image the person is the vol
0:08:46so the first step is a the segmentation
0:08:50which for which we use a linear chain conditional random field
0:08:54and once we have the segmentation we try to identify the type of segments maybe
0:08:58use a support vector machine and then you have the difference is always trying to
0:09:03basically trying to resolve a switch image the person is speaking about
0:09:07one thing to keep in mind is that we use are transcribed speech at this
0:09:11point of mine does not asr in the future
0:09:16though the segment the module is trying to basically split the whole utterance into multiple
0:09:21a thing that's
0:09:23so the input to
0:09:25the segmental is the word stream one word at a time
0:09:28where the segmented is basically a linear chain conditional random field we just i mean
0:09:32each word but the word boundary information
0:09:36so here in this example we have of your at the top left in one
0:09:41stage and once we have the left all the word coming in
0:09:44the segment that is basically we don
0:09:46and the label is actually is
0:09:51the task of a segmentation is performed again
0:09:54and we have separate set of a segment
0:09:57that you
0:10:00so once we have the segment of the next step is we want to identify
0:10:03what kind of segments of these things
0:10:06so the type of segments is what's decided by a the segment by classifier and
0:10:12for this we use a support vector machine
0:10:15so we use
0:10:16so we so
0:10:17the our annotation scheme of we had four labels if you if you remember so
0:10:23it's a single multiple relations and others so this is a step where we are
0:10:27trying to identify each segment as to what i of segments
0:10:32they are
0:10:32so if in the previous example once we have of your at the top there
0:10:37was basically single but now with the level of it but once we see the
0:10:41word off
0:10:42so we have to the top left off and of your that is part of
0:10:45the previous segment and this is really able as a relation
0:10:50well
0:10:52segment
0:10:53so once we have the segment by basically a pain
0:10:56the next step is to understand to perform the reference as addition
0:11:02though the reference resolution step up and in three different three separate steps
0:11:07so the first step is understanding words
0:11:09so we use what's as classifying mortal which are described in a bed but we
0:11:14try to understand the meaning but was using the visual features
0:11:17and once we have
0:11:19once we have understood the words we try to understand the whole segments
0:11:23for which we use
0:11:24of a very have you ever use the scores from the what that's classifier and
0:11:28decompose the meaning
0:11:29and we'll in the scores for the next segment
0:11:32and then finally we try to understand the whole
0:11:35the reference referring expression on your record inspections
0:11:40so the ones that's classified model is basically a linear regression model bit it's trying
0:11:45to understand the meaning of every board
0:11:47using the visual features so every word you it is represented but the visual features
0:11:52and the visual features can be the ones which it start from the images and
0:11:56in this case use rgb hedges p-values ages orientations
0:12:00and of the data instance x y positions and
0:12:03each one of these for each one of the objects with image the indies features
0:12:08automatically
0:12:10and you and we get the probability distribution or for each one of the words
0:12:16once we have understood of words
0:12:19using the words as classified model the next step is to kind of compose the
0:12:22meaning and get the meaning for
0:12:25the whole segments
0:12:26so the segmental so the segment meaning is obtained by composing the meaning that you
0:12:31in for every board
0:12:32so we basically you are in this example you just
0:12:36compose the word meaning by
0:12:38doing multiplication of the scores
0:12:40and we'll in the segment meeting for each one of the state
0:12:44so let's take an example as to how this works so here oppose anything eight
0:12:49images for simplicity sake let's take a look i just two images as to how
0:12:53it's a dissolve and how it understood
0:12:56so here we have facebook have and brown w which is the description that scheme
0:13:00into this target image
0:13:02so we have in the probability scores for facebook f and brown w individually using
0:13:07what as classifier model
0:13:09and with this method we try to understand what
0:13:12the facebook f and brown w s
0:13:15so real in this segmentation we obtain the segments using the linear chain conditional random
0:13:19fields which has already segmented the utterances from the user into neutral segments
0:13:25and it has a label that each segment this type single so we're trying to
0:13:29resolve the reference and try to understand the meaning
0:13:31so for the facebook f this is the object which is this one and then
0:13:36just w is thus and similarly this is that and this one as this action
0:13:42so what i wasn't said facebook f and once we have simple discourse we end
0:13:46up but a distribution which looks like this
0:13:48and for and don't w
0:13:50i just want to find it's
0:13:53it's this one which has the highest
0:13:56so no this is the meaning that field in for each one of the segments
0:14:01the next step is to understand the whole description at once
0:14:04so in the previous step beyond the store the meaning of each segment as the
0:14:08facebook have and brown w but now we're trying to understand the meaning of the
0:14:12whole
0:14:13the whole segment at once
0:14:16so the baby do it is fixed use the score that you have you that
0:14:19be obtained from the previous segmentation step
0:14:22and decompose them using the simple summation and the image with ends up with the
0:14:27highest score is the target image that the person was like to
0:14:31basically describe so
0:14:33in the same example we have the visible okay so this was discourse that is
0:14:37still
0:14:38but now with visible kevin brown w once we are composed discourse hoping for each
0:14:43objects we end up with different scores for each images
0:14:46and
0:14:47we end up with
0:14:49the matcher which is now a agent with vision booking as matter now
0:14:53this to understand which target image
0:14:57it should send
0:15:00now this is how the bite plane books and the next step is we have
0:15:03to understand i mean how well the model kind of performs
0:15:07so in the evaluation step we actually need to know how well the segment that
0:15:11works and how well the segment i labeling process works
0:15:16so the segment the works with an f-score of around eighty percent which is basically
0:15:21how well it court the boundaries right
0:15:23and finally the segment i classifier that is working reasonably well for our purposes
0:15:29but we but our goal is that you want to know how well the target
0:15:34image was selected and not really work we understood from the previous i mean that
0:15:38not really the segmentation or not really the same and i labeling
0:15:42a so our main goal is to kind of understand how well it or will
0:15:45are matcher agent can understand the h
0:15:48so you understand the target image are evaluating such and such a system is a
0:15:53kind of tricky
0:15:56we can use multiple ways of using of running crossvalidations
0:16:00so the first step the score on the whole one pair our occurred
0:16:05of bad
0:16:06but we basically say that one of the pair data that we have seen in
0:16:10our corpus is on c
0:16:12and we train on the rest of the pair detail
0:16:15so one advantage is that it uses a number it uses an understanding of
0:16:20how well the system basically performs
0:16:22on a new user was one because the system
0:16:25the second method of evaluating escort hold one is sort all
0:16:28the advantage or this whole benefits or our type of cross validation
0:16:33basically it's one is or description one target image description
0:16:37it's held out training on the rest of the past as well as
0:16:41are the descriptions by the same person that it has seen in the past
0:16:44so this gives another advantage to our system that
0:16:49it actually most
0:16:50the words it gives you use a chance for the classifier to learn the idiosyncratic
0:16:55word usage by a please
0:16:57so use an additional advantage to train on the vocabulary
0:17:01so
0:17:03the one historical kind of evaluation where we assume the segmentation is from other humans
0:17:07and finally we used automated pipeline but everything is automated entry one
0:17:13or thing is we use multiple objects or a image in an image set so
0:17:19you can move problem as simple as
0:17:21single or two
0:17:23and then we kind of what in more and more objects with any within an
0:17:26image and then we try to see how well the reference resolution
0:17:30but the first step of evaluation is you need a good baseline and the baseline
0:17:35here is
0:17:37the naive bayes that we use in the past are shown us to be o
0:17:41we use what i'm be arguing not paper is a white support vector
0:17:45and creation to other people for that
0:17:48so
0:17:48we can see that then it is actually walking close to random
0:17:53is the number of objects and of input i increase
0:17:58and once we have a segmental and the labeling try labeling performed
0:18:02value in other we have a little bit individual segments and of understand what kind
0:18:07of segments the person is speaking about
0:18:09a performance increases so we can see that the target i target image identification accuracy
0:18:15increases without the segmentation which is this role
0:18:19and but segmentation distraught without segmentation and labeling the performance was
0:18:24kind of which closer random but high number four
0:18:28the next thing that we want to see as how well
0:18:32our segmenter performs to the oracle segmentation but is basically the object segmentation and the
0:18:37perfect labeling
0:18:38additionally the whites as around two to six percent both as you can see your
0:18:44from this table sort of it uses additional proposed but it's
0:18:48not
0:18:50and then finally we wanna see the entertainment afraid that is once we have the
0:18:54whole that there are evaluation but the whole when it is sort out evaluation
0:18:58that is once we give a classifier a chance to learn
0:19:01the word usage of this particular person this particular describe
0:19:05the performance kind of in increases and we here
0:19:10the best performance you but we can see that the performance is still not real
0:19:14human
0:19:16and i think we use a number of objects within an image the performance
0:19:21so that it all messages other people message from this work is that
0:19:26the automated pipeline improves the scene understanding evadees so once we have a segmentation once
0:19:31we have the segmentation
0:19:33are you a
0:19:35well formed are task
0:19:37or a task of imitated asian groups
0:19:41the interim and improves the performance of a classifier that is
0:19:44you will see if it were wired channels don't from but using target word that
0:19:48has been spoken in the example
0:19:50it increases or the performance
0:19:52and increasing the number of objects within an image it's really hard for the classifier
0:19:57to get other objects right because of the scene understanding right because the objects are
0:20:02very much similar
0:20:06and
0:20:07special thanks to better are usually and michael us you know and
0:20:11thank you
0:20:17but you have time for questions
0:20:26so that it will work with word transcripts
0:20:31were people scroll
0:20:33i was wondering about how the dual-tree the information from the speech signal
0:20:40for example prosodic contour
0:20:42but more to renew a promotion remote use of compression
0:20:48in addition we do
0:20:51exactly that's or a really interesting question
0:20:55so here the question was that if we use the more prosodic context instead of
0:20:59just using the transcript and but performance
0:21:03actually i would want to refer this to the domain x talk as a character
0:21:06case where we actually use the prosody features are to perform the segmentation
0:21:11and
0:21:14kind of a
0:21:15i don't know from one to like break the
0:21:19i don't know
0:21:21it's that possibly kind of helps
0:21:24part i think of prosody can be really useful to for the segmentation
0:21:33any other questions
0:21:46okay so the question was about error analysis was that any error analysis both on
0:21:50the numbers that initial yes with it can you be checked the significance score like
0:21:56how like whenever there is a target image identification and how the humans identify the
0:22:01target and how each one of the methods kind of perform
0:22:05and then we did a significance test and when we ran significance test be you
0:22:11know and upon a few things are more significant than others
0:22:14which reported in the people
0:22:54okay so the question was how this the what's as classifier trained is it like
0:22:58a separate or training module or is it on along with the way we get
0:23:03this numbers so actually we have the pipeline which is evaluated as one module in
0:23:08sort of like training but that's classifier separately and then
0:23:12segmenter separately and the segment i labeling both are separately we don't do the separate
0:23:18kind of evaluation we kind of have
0:23:20the whole pipeline kind of walking as one in a like
0:23:25one particular more you
0:23:27and then we'll in the numbers on the segmental performance we might i labeling performance
0:23:30then we have the though processing part which is what understanding then the composition of
0:23:36the meeting from the buttons classifier and then
0:23:40combining the what's combining the meeting scheduled in from the speech segments of the whole
0:23:44or seen description so all these are actually performed is not exact ones which means
0:23:50we train the whole pipeline on
0:23:53and minus one bunch of pairs if you imagine we have and there's and be
0:23:57a point for all but they don't evaluation they're training the whole pipeline on in
0:24:01mind in minus one pass and then the or domain e one patted his what
0:24:05evaluation and we do that you know
0:24:09we do it is impossible
0:24:22so that so the unseen phrases that's that so very good example and that's the
0:24:27reason why are performance was taking the egg and we don't use of
0:24:30entrainment
0:24:32that is if you don't provide a classifier the chance to learn the meeting go
0:24:35work for that example despair all of us start and they use description prison caff
0:24:40for that object which
0:24:42if it's there's no way of learning the meaning for that if we had if
0:24:45we don't know what the tokens
0:24:47so
0:24:49so that is why the performance to get hurt but once we give a pipeline
0:24:53a chance to do on the meaning of the words like face will reorder all
0:24:56these things it's not perform much
0:25:05we well we did do that so far so we have we have mentioned