0:00:15 | right eigen a graph known how everyone i'll read be presenting our collaboration work between |
---|
0:00:23 | a bit of n university and using institute for creative technologies |
---|
0:00:27 | the work is based on incrementally understanding |
---|
0:00:31 | though |
---|
0:00:32 | complex scenes |
---|
0:00:35 | or |
---|
0:00:36 | so |
---|
0:00:37 | the motivation for this work |
---|
0:00:39 | is |
---|
0:00:40 | to understand the |
---|
0:00:42 | the scenes which we incompetent you bored |
---|
0:00:45 | so let's take this seen for example |
---|
0:00:49 | alright let's imagine writing to a reading enough |
---|
0:00:54 | a set of training car and b are training is that when we got in |
---|
0:00:57 | the street and we want to stop |
---|
0:01:00 | somewhere |
---|
0:01:01 | and we see the seen as a as it shown here and then |
---|
0:01:06 | we decide alright i want to make my car park |
---|
0:01:10 | so the left of that it is not to the left side of three and |
---|
0:01:13 | to the |
---|
0:01:14 | or the to do that |
---|
0:01:17 | so i guess that instruction or to myself driving car and the car has to |
---|
0:01:21 | perform that action so for the car to perform that action it's very important for |
---|
0:01:26 | the system to understand what rate |
---|
0:01:29 | that's not actually means |
---|
0:01:31 | and what left side of the screen actually needs and for the car to make |
---|
0:01:35 | an action in real time it's very important that the processing happens incrementally so that |
---|
0:01:41 | the actions can be taken at the right point or fine |
---|
0:01:45 | and if we support dialogue that's all the more better |
---|
0:01:49 | so the general research |
---|
0:01:53 | plan for this for this work |
---|
0:01:56 | is that is that we want to understand the scene descriptions with multiple objects |
---|
0:02:02 | and we want to perform reference resolution and this the steps that we they perform |
---|
0:02:07 | a different solution is basically divided of three parts |
---|
0:02:10 | the first step is we understand words |
---|
0:02:12 | so we try to understand the meaning of a red green and so on and |
---|
0:02:17 | then we try to understand the meetings of phrases so for example very desolate call |
---|
0:02:21 | i think about |
---|
0:02:22 | and then we try to understand the complete seen description |
---|
0:02:27 | so |
---|
0:02:28 | that is the parts so that's a subset reading hard to understand reference to understand |
---|
0:02:33 | the scene descriptions |
---|
0:02:34 | so that we follow pipeline method that we have a segmentation of segmentation method and |
---|
0:02:40 | then we have a segment i classifier method which is which is which is basically |
---|
0:02:44 | trying to understand what type of individual segments other person is speaking about |
---|
0:02:51 | so the |
---|
0:02:58 | so the domain that how that you're taking a look at in this in this |
---|
0:03:02 | work is called r d g or all image domain audio rapid dialogue game for |
---|
0:03:06 | the pentomino |
---|
0:03:07 | a variant so the game is a two player collaborative are pine constrain game |
---|
0:03:14 | where |
---|
0:03:15 | two people are assigned at all of the detector and the matcher |
---|
0:03:19 | so the director scenes other data that they're detect the seas eight images on his |
---|
0:03:23 | or screen and one of the images highlighted in the red border |
---|
0:03:27 | but the red border and each scene has multiple objects in them and the director |
---|
0:03:32 | is trying to provide descriptions to these images so that the matcher can make that |
---|
0:03:36 | i guess |
---|
0:03:37 | can i can select the right image |
---|
0:03:40 | so let's take a an example a |
---|
0:03:44 | a dialogue from this some dialogue from this game so here we have a direct |
---|
0:03:48 | the whole true tries to describe this image using other description |
---|
0:03:52 | this one is kind of for all or blue p and we don't w sort |
---|
0:03:57 | of and the key is kind of malformed |
---|
0:04:00 | so this is the description that the person is giving for the highlight the target |
---|
0:04:03 | image for the matcher to make a guess |
---|
0:04:06 | the matcher it is basically trying to the matcher from that description understand understands what |
---|
0:04:12 | the images and makes this make makes are i guess and say spoken got it |
---|
0:04:16 | so if you know that it or description it consists of three or in user |
---|
0:04:21 | segments |
---|
0:04:21 | in the first segment the person is trying to describe |
---|
0:04:25 | to this blue a this blue shape but with a description it's kind of |
---|
0:04:31 | a blue e |
---|
0:04:32 | and then the second part is where the person is trying to describe the next |
---|
0:04:37 | object that this following a segment |
---|
0:04:41 | and then finally |
---|
0:04:43 | the third part is once again he goes back to the next image goes back |
---|
0:04:47 | to the first object that he was trying to describe and he basically describes the |
---|
0:04:53 | image |
---|
0:04:53 | and that was not for the matcher to make the castle copies at all images |
---|
0:04:58 | one thing to keep in mind is that the matcher does not seen the images |
---|
0:05:01 | in the same order so they cannot really use |
---|
0:05:04 | the position of the image integrate output is correctly image |
---|
0:05:08 | so the rest of the talk is divided into a these many steps in the |
---|
0:05:13 | first part |
---|
0:05:13 | which is the preprocessing step i agree or explaining and we describe how we go |
---|
0:05:18 | about collecting the data |
---|
0:05:19 | and updating the data and then i go into the steps of or designing the |
---|
0:05:25 | system how we designed the systems |
---|
0:05:27 | using the segmentation and labeling and finally how it is all the differences using this |
---|
0:05:32 | pipeline |
---|
0:05:33 | and then finally we argue the evaluation and see how about the model works |
---|
0:05:38 | so the first step is the dialogue data collection |
---|
0:05:41 | so we used crowdsourcing paradigm to collect data so the data is collected to look |
---|
0:05:47 | for crowdsourcing framework or pinning out a that was presented last your |
---|
0:05:51 | and |
---|
0:05:53 | we and they have shown that we can of we can collect the data between |
---|
0:05:57 | to a between two people playing the game or the crowdsourcing websites like mechanical turk |
---|
0:06:02 | and we can really kind of the speech |
---|
0:06:05 | you know you know you know you know real time way and of you can |
---|
0:06:09 | transcribe and i don't |
---|
0:06:11 | and |
---|
0:06:12 | then we do a the audio collection and then the top the data collected was |
---|
0:06:16 | basically a transcribed and was most basically just transcribe your |
---|
0:06:21 | and then finally the third step is |
---|
0:06:23 | one c o we have the data we just want to know like a mini |
---|
0:06:26 | our data that we collect so |
---|
0:06:28 | in this example we have |
---|
0:06:31 | yes for this domain we have data from forty two past |
---|
0:06:33 | so we have a sixteen complete games and the rest of them or |
---|
0:06:38 | okay well that people just exactly |
---|
0:06:42 | so then the next step is the data annotation so once we have data collected |
---|
0:06:45 | from these people you wanted to annotate other data |
---|
0:06:49 | the motivation for data annotation is basically is basically go on the domain |
---|
0:06:54 | so |
---|
0:06:55 | we want to do the reference resolution and you know reference resolution a real person |
---|
0:07:00 | is basically describing the c and then it wasn't a describing the scene are the |
---|
0:07:05 | scene is described to the object descriptions and these objects are basically one of the |
---|
0:07:11 | an object that was in this thing |
---|
0:07:13 | so we want to i don't date the individual objects within each us within each |
---|
0:07:19 | image |
---|
0:07:19 | and then it's also possible that a person decides to a speak about multiple objects |
---|
0:07:24 | within an image |
---|
0:07:26 | and opposed to me also speak about relations which we also annotate and basically everything |
---|
0:07:30 | else is what we call this |
---|
0:07:33 | for example on your for this utterance this is this is tough |
---|
0:07:37 | a green tea upright to red crosses to the left of w and e |
---|
0:07:43 | got it |
---|
0:07:43 | so you for this particular utterance be basically annotate |
---|
0:07:48 | these in this way |
---|
0:07:49 | well |
---|
0:07:51 | actually indicated |
---|
0:07:52 | using the scheme |
---|
0:07:54 | we had this particular utterance this as a on top of the t and then |
---|
0:07:58 | harry porter sign next to the n |
---|
0:08:00 | so we annotate a the first part which is a this is the l and |
---|
0:08:05 | be markedly the single and what object to person is to fit into |
---|
0:08:09 | you get the labels on each one of these objects using open c v |
---|
0:08:13 | and we also mark on the relations iq of for example next to is marked |
---|
0:08:18 | with one and zero and the use mark we to which basically defines the relationship |
---|
0:08:22 | between this to object |
---|
0:08:24 | this number |
---|
0:08:28 | so once we have the data annotation performed the next step is |
---|
0:08:33 | we use a language processing pipeline which includes two steps |
---|
0:08:38 | basically steps so it includes three steps of it would try to understand the right |
---|
0:08:43 | image just word image the person is the vol |
---|
0:08:46 | so the first step is a the segmentation |
---|
0:08:50 | which for which we use a linear chain conditional random field |
---|
0:08:54 | and once we have the segmentation we try to identify the type of segments maybe |
---|
0:08:58 | use a support vector machine and then you have the difference is always trying to |
---|
0:09:03 | basically trying to resolve a switch image the person is speaking about |
---|
0:09:07 | one thing to keep in mind is that we use are transcribed speech at this |
---|
0:09:11 | point of mine does not asr in the future |
---|
0:09:16 | though the segment the module is trying to basically split the whole utterance into multiple |
---|
0:09:21 | a thing that's |
---|
0:09:23 | so the input to |
---|
0:09:25 | the segmental is the word stream one word at a time |
---|
0:09:28 | where the segmented is basically a linear chain conditional random field we just i mean |
---|
0:09:32 | each word but the word boundary information |
---|
0:09:36 | so here in this example we have of your at the top left in one |
---|
0:09:41 | stage and once we have the left all the word coming in |
---|
0:09:44 | the segment that is basically we don |
---|
0:09:46 | and the label is actually is |
---|
0:09:51 | the task of a segmentation is performed again |
---|
0:09:54 | and we have separate set of a segment |
---|
0:09:57 | that you |
---|
0:10:00 | so once we have the segment of the next step is we want to identify |
---|
0:10:03 | what kind of segments of these things |
---|
0:10:06 | so the type of segments is what's decided by a the segment by classifier and |
---|
0:10:12 | for this we use a support vector machine |
---|
0:10:15 | so we use |
---|
0:10:16 | so we so |
---|
0:10:17 | the our annotation scheme of we had four labels if you if you remember so |
---|
0:10:23 | it's a single multiple relations and others so this is a step where we are |
---|
0:10:27 | trying to identify each segment as to what i of segments |
---|
0:10:32 | they are |
---|
0:10:32 | so if in the previous example once we have of your at the top there |
---|
0:10:37 | was basically single but now with the level of it but once we see the |
---|
0:10:41 | word off |
---|
0:10:42 | so we have to the top left off and of your that is part of |
---|
0:10:45 | the previous segment and this is really able as a relation |
---|
0:10:50 | well |
---|
0:10:52 | segment |
---|
0:10:53 | so once we have the segment by basically a pain |
---|
0:10:56 | the next step is to understand to perform the reference as addition |
---|
0:11:02 | though the reference resolution step up and in three different three separate steps |
---|
0:11:07 | so the first step is understanding words |
---|
0:11:09 | so we use what's as classifying mortal which are described in a bed but we |
---|
0:11:14 | try to understand the meaning but was using the visual features |
---|
0:11:17 | and once we have |
---|
0:11:19 | once we have understood the words we try to understand the whole segments |
---|
0:11:23 | for which we use |
---|
0:11:24 | of a very have you ever use the scores from the what that's classifier and |
---|
0:11:28 | decompose the meaning |
---|
0:11:29 | and we'll in the scores for the next segment |
---|
0:11:32 | and then finally we try to understand the whole |
---|
0:11:35 | the reference referring expression on your record inspections |
---|
0:11:40 | so the ones that's classified model is basically a linear regression model bit it's trying |
---|
0:11:45 | to understand the meaning of every board |
---|
0:11:47 | using the visual features so every word you it is represented but the visual features |
---|
0:11:52 | and the visual features can be the ones which it start from the images and |
---|
0:11:56 | in this case use rgb hedges p-values ages orientations |
---|
0:12:00 | and of the data instance x y positions and |
---|
0:12:03 | each one of these for each one of the objects with image the indies features |
---|
0:12:08 | automatically |
---|
0:12:10 | and you and we get the probability distribution or for each one of the words |
---|
0:12:16 | once we have understood of words |
---|
0:12:19 | using the words as classified model the next step is to kind of compose the |
---|
0:12:22 | meaning and get the meaning for |
---|
0:12:25 | the whole segments |
---|
0:12:26 | so the segmental so the segment meaning is obtained by composing the meaning that you |
---|
0:12:31 | in for every board |
---|
0:12:32 | so we basically you are in this example you just |
---|
0:12:36 | compose the word meaning by |
---|
0:12:38 | doing multiplication of the scores |
---|
0:12:40 | and we'll in the segment meeting for each one of the state |
---|
0:12:44 | so let's take an example as to how this works so here oppose anything eight |
---|
0:12:49 | images for simplicity sake let's take a look i just two images as to how |
---|
0:12:53 | it's a dissolve and how it understood |
---|
0:12:56 | so here we have facebook have and brown w which is the description that scheme |
---|
0:13:00 | into this target image |
---|
0:13:02 | so we have in the probability scores for facebook f and brown w individually using |
---|
0:13:07 | what as classifier model |
---|
0:13:09 | and with this method we try to understand what |
---|
0:13:12 | the facebook f and brown w s |
---|
0:13:15 | so real in this segmentation we obtain the segments using the linear chain conditional random |
---|
0:13:19 | fields which has already segmented the utterances from the user into neutral segments |
---|
0:13:25 | and it has a label that each segment this type single so we're trying to |
---|
0:13:29 | resolve the reference and try to understand the meaning |
---|
0:13:31 | so for the facebook f this is the object which is this one and then |
---|
0:13:36 | just w is thus and similarly this is that and this one as this action |
---|
0:13:42 | so what i wasn't said facebook f and once we have simple discourse we end |
---|
0:13:46 | up but a distribution which looks like this |
---|
0:13:48 | and for and don't w |
---|
0:13:50 | i just want to find it's |
---|
0:13:53 | it's this one which has the highest |
---|
0:13:56 | so no this is the meaning that field in for each one of the segments |
---|
0:14:01 | the next step is to understand the whole description at once |
---|
0:14:04 | so in the previous step beyond the store the meaning of each segment as the |
---|
0:14:08 | facebook have and brown w but now we're trying to understand the meaning of the |
---|
0:14:12 | whole |
---|
0:14:13 | the whole segment at once |
---|
0:14:16 | so the baby do it is fixed use the score that you have you that |
---|
0:14:19 | be obtained from the previous segmentation step |
---|
0:14:22 | and decompose them using the simple summation and the image with ends up with the |
---|
0:14:27 | highest score is the target image that the person was like to |
---|
0:14:31 | basically describe so |
---|
0:14:33 | in the same example we have the visible okay so this was discourse that is |
---|
0:14:37 | still |
---|
0:14:38 | but now with visible kevin brown w once we are composed discourse hoping for each |
---|
0:14:43 | objects we end up with different scores for each images |
---|
0:14:46 | and |
---|
0:14:47 | we end up with |
---|
0:14:49 | the matcher which is now a agent with vision booking as matter now |
---|
0:14:53 | this to understand which target image |
---|
0:14:57 | it should send |
---|
0:15:00 | now this is how the bite plane books and the next step is we have |
---|
0:15:03 | to understand i mean how well the model kind of performs |
---|
0:15:07 | so in the evaluation step we actually need to know how well the segment that |
---|
0:15:11 | works and how well the segment i labeling process works |
---|
0:15:16 | so the segment the works with an f-score of around eighty percent which is basically |
---|
0:15:21 | how well it court the boundaries right |
---|
0:15:23 | and finally the segment i classifier that is working reasonably well for our purposes |
---|
0:15:29 | but we but our goal is that you want to know how well the target |
---|
0:15:34 | image was selected and not really work we understood from the previous i mean that |
---|
0:15:38 | not really the segmentation or not really the same and i labeling |
---|
0:15:42 | a so our main goal is to kind of understand how well it or will |
---|
0:15:45 | are matcher agent can understand the h |
---|
0:15:48 | so you understand the target image are evaluating such and such a system is a |
---|
0:15:53 | kind of tricky |
---|
0:15:56 | we can use multiple ways of using of running crossvalidations |
---|
0:16:00 | so the first step the score on the whole one pair our occurred |
---|
0:16:05 | of bad |
---|
0:16:06 | but we basically say that one of the pair data that we have seen in |
---|
0:16:10 | our corpus is on c |
---|
0:16:12 | and we train on the rest of the pair detail |
---|
0:16:15 | so one advantage is that it uses a number it uses an understanding of |
---|
0:16:20 | how well the system basically performs |
---|
0:16:22 | on a new user was one because the system |
---|
0:16:25 | the second method of evaluating escort hold one is sort all |
---|
0:16:28 | the advantage or this whole benefits or our type of cross validation |
---|
0:16:33 | basically it's one is or description one target image description |
---|
0:16:37 | it's held out training on the rest of the past as well as |
---|
0:16:41 | are the descriptions by the same person that it has seen in the past |
---|
0:16:44 | so this gives another advantage to our system that |
---|
0:16:49 | it actually most |
---|
0:16:50 | the words it gives you use a chance for the classifier to learn the idiosyncratic |
---|
0:16:55 | word usage by a please |
---|
0:16:57 | so use an additional advantage to train on the vocabulary |
---|
0:17:01 | so |
---|
0:17:03 | the one historical kind of evaluation where we assume the segmentation is from other humans |
---|
0:17:07 | and finally we used automated pipeline but everything is automated entry one |
---|
0:17:13 | or thing is we use multiple objects or a image in an image set so |
---|
0:17:19 | you can move problem as simple as |
---|
0:17:21 | single or two |
---|
0:17:23 | and then we kind of what in more and more objects with any within an |
---|
0:17:26 | image and then we try to see how well the reference resolution |
---|
0:17:30 | but the first step of evaluation is you need a good baseline and the baseline |
---|
0:17:35 | here is |
---|
0:17:37 | the naive bayes that we use in the past are shown us to be o |
---|
0:17:41 | we use what i'm be arguing not paper is a white support vector |
---|
0:17:45 | and creation to other people for that |
---|
0:17:48 | so |
---|
0:17:48 | we can see that then it is actually walking close to random |
---|
0:17:53 | is the number of objects and of input i increase |
---|
0:17:58 | and once we have a segmental and the labeling try labeling performed |
---|
0:18:02 | value in other we have a little bit individual segments and of understand what kind |
---|
0:18:07 | of segments the person is speaking about |
---|
0:18:09 | a performance increases so we can see that the target i target image identification accuracy |
---|
0:18:15 | increases without the segmentation which is this role |
---|
0:18:19 | and but segmentation distraught without segmentation and labeling the performance was |
---|
0:18:24 | kind of which closer random but high number four |
---|
0:18:28 | the next thing that we want to see as how well |
---|
0:18:32 | our segmenter performs to the oracle segmentation but is basically the object segmentation and the |
---|
0:18:37 | perfect labeling |
---|
0:18:38 | additionally the whites as around two to six percent both as you can see your |
---|
0:18:44 | from this table sort of it uses additional proposed but it's |
---|
0:18:48 | not |
---|
0:18:50 | and then finally we wanna see the entertainment afraid that is once we have the |
---|
0:18:54 | whole that there are evaluation but the whole when it is sort out evaluation |
---|
0:18:58 | that is once we give a classifier a chance to learn |
---|
0:19:01 | the word usage of this particular person this particular describe |
---|
0:19:05 | the performance kind of in increases and we here |
---|
0:19:10 | the best performance you but we can see that the performance is still not real |
---|
0:19:14 | human |
---|
0:19:16 | and i think we use a number of objects within an image the performance |
---|
0:19:21 | so that it all messages other people message from this work is that |
---|
0:19:26 | the automated pipeline improves the scene understanding evadees so once we have a segmentation once |
---|
0:19:31 | we have the segmentation |
---|
0:19:33 | are you a |
---|
0:19:35 | well formed are task |
---|
0:19:37 | or a task of imitated asian groups |
---|
0:19:41 | the interim and improves the performance of a classifier that is |
---|
0:19:44 | you will see if it were wired channels don't from but using target word that |
---|
0:19:48 | has been spoken in the example |
---|
0:19:50 | it increases or the performance |
---|
0:19:52 | and increasing the number of objects within an image it's really hard for the classifier |
---|
0:19:57 | to get other objects right because of the scene understanding right because the objects are |
---|
0:20:02 | very much similar |
---|
0:20:06 | and |
---|
0:20:07 | special thanks to better are usually and michael us you know and |
---|
0:20:11 | thank you |
---|
0:20:17 | but you have time for questions |
---|
0:20:26 | so that it will work with word transcripts |
---|
0:20:31 | were people scroll |
---|
0:20:33 | i was wondering about how the dual-tree the information from the speech signal |
---|
0:20:40 | for example prosodic contour |
---|
0:20:42 | but more to renew a promotion remote use of compression |
---|
0:20:48 | in addition we do |
---|
0:20:51 | exactly that's or a really interesting question |
---|
0:20:55 | so here the question was that if we use the more prosodic context instead of |
---|
0:20:59 | just using the transcript and but performance |
---|
0:21:03 | actually i would want to refer this to the domain x talk as a character |
---|
0:21:06 | case where we actually use the prosody features are to perform the segmentation |
---|
0:21:11 | and |
---|
0:21:14 | kind of a |
---|
0:21:15 | i don't know from one to like break the |
---|
0:21:19 | i don't know |
---|
0:21:21 | it's that possibly kind of helps |
---|
0:21:24 | part i think of prosody can be really useful to for the segmentation |
---|
0:21:33 | any other questions |
---|
0:21:46 | okay so the question was about error analysis was that any error analysis both on |
---|
0:21:50 | the numbers that initial yes with it can you be checked the significance score like |
---|
0:21:56 | how like whenever there is a target image identification and how the humans identify the |
---|
0:22:01 | target and how each one of the methods kind of perform |
---|
0:22:05 | and then we did a significance test and when we ran significance test be you |
---|
0:22:11 | know and upon a few things are more significant than others |
---|
0:22:14 | which reported in the people |
---|
0:22:54 | okay so the question was how this the what's as classifier trained is it like |
---|
0:22:58 | a separate or training module or is it on along with the way we get |
---|
0:23:03 | this numbers so actually we have the pipeline which is evaluated as one module in |
---|
0:23:08 | sort of like training but that's classifier separately and then |
---|
0:23:12 | segmenter separately and the segment i labeling both are separately we don't do the separate |
---|
0:23:18 | kind of evaluation we kind of have |
---|
0:23:20 | the whole pipeline kind of walking as one in a like |
---|
0:23:25 | one particular more you |
---|
0:23:27 | and then we'll in the numbers on the segmental performance we might i labeling performance |
---|
0:23:30 | then we have the though processing part which is what understanding then the composition of |
---|
0:23:36 | the meeting from the buttons classifier and then |
---|
0:23:40 | combining the what's combining the meeting scheduled in from the speech segments of the whole |
---|
0:23:44 | or seen description so all these are actually performed is not exact ones which means |
---|
0:23:50 | we train the whole pipeline on |
---|
0:23:53 | and minus one bunch of pairs if you imagine we have and there's and be |
---|
0:23:57 | a point for all but they don't evaluation they're training the whole pipeline on in |
---|
0:24:01 | mind in minus one pass and then the or domain e one patted his what |
---|
0:24:05 | evaluation and we do that you know |
---|
0:24:09 | we do it is impossible |
---|
0:24:22 | so that so the unseen phrases that's that so very good example and that's the |
---|
0:24:27 | reason why are performance was taking the egg and we don't use of |
---|
0:24:30 | entrainment |
---|
0:24:32 | that is if you don't provide a classifier the chance to learn the meeting go |
---|
0:24:35 | work for that example despair all of us start and they use description prison caff |
---|
0:24:40 | for that object which |
---|
0:24:42 | if it's there's no way of learning the meaning for that if we had if |
---|
0:24:45 | we don't know what the tokens |
---|
0:24:47 | so |
---|
0:24:49 | so that is why the performance to get hurt but once we give a pipeline |
---|
0:24:53 | a chance to do on the meaning of the words like face will reorder all |
---|
0:24:56 | these things it's not perform much |
---|
0:25:05 | we well we did do that so far so we have we have mentioned |
---|