Speech Transcript - Real-Time Understanding of Complex Discriminative Scene Descriptions

right eigen a graph known how everyone i'll read be presenting our collaboration work between

a bit of n university and using institute for creative technologies

the work is based on incrementally understanding

though

complex scenes

the motivation for this work

to understand the

the scenes which we incompetent you bored

so let's take this seen for example

alright let's imagine writing to a reading enough

a set of training car and b are training is that when we got in

the street and we want to stop

somewhere

and we see the seen as a as it shown here and then

we decide alright i want to make my car park

so the left of that it is not to the left side of three and

to the

or the to do that

so i guess that instruction or to myself driving car and the car has to

perform that action so for the car to perform that action it's very important for

the system to understand what rate

that's not actually means

and what left side of the screen actually needs and for the car to make

an action in real time it's very important that the processing happens incrementally so that

the actions can be taken at the right point or fine

and if we support dialogue that's all the more better

so the general research

plan for this for this work

is that is that we want to understand the scene descriptions with multiple objects

and we want to perform reference resolution and this the steps that we they perform

a different solution is basically divided of three parts

the first step is we understand words

so we try to understand the meaning of a red green and so on and

then we try to understand the meetings of phrases so for example very desolate call

i think about

and then we try to understand the complete seen description

that is the parts so that's a subset reading hard to understand reference to understand

the scene descriptions

so that we follow pipeline method that we have a segmentation of segmentation method and

then we have a segment i classifier method which is which is which is basically

trying to understand what type of individual segments other person is speaking about

so the

so the domain that how that you're taking a look at in this in this

work is called r d g or all image domain audio rapid dialogue game for

the pentomino

a variant so the game is a two player collaborative are pine constrain game

where

two people are assigned at all of the detector and the matcher

so the director scenes other data that they're detect the seas eight images on his

or screen and one of the images highlighted in the red border

but the red border and each scene has multiple objects in them and the director

is trying to provide descriptions to these images so that the matcher can make that

i guess

can i can select the right image

so let's take a an example a

a dialogue from this some dialogue from this game so here we have a direct

the whole true tries to describe this image using other description

this one is kind of for all or blue p and we don't w sort

of and the key is kind of malformed

so this is the description that the person is giving for the highlight the target

image for the matcher to make a guess

the matcher it is basically trying to the matcher from that description understand understands what

the images and makes this make makes are i guess and say spoken got it

so if you know that it or description it consists of three or in user

segments

in the first segment the person is trying to describe

to this blue a this blue shape but with a description it's kind of

a blue e

and then the second part is where the person is trying to describe the next

object that this following a segment

and then finally

the third part is once again he goes back to the next image goes back

to the first object that he was trying to describe and he basically describes the

image

and that was not for the matcher to make the castle copies at all images

one thing to keep in mind is that the matcher does not seen the images

in the same order so they cannot really use

the position of the image integrate output is correctly image

so the rest of the talk is divided into a these many steps in the

first part

which is the preprocessing step i agree or explaining and we describe how we go

about collecting the data

and updating the data and then i go into the steps of or designing the

system how we designed the systems

using the segmentation and labeling and finally how it is all the differences using this

pipeline

and then finally we argue the evaluation and see how about the model works

so the first step is the dialogue data collection

so we used crowdsourcing paradigm to collect data so the data is collected to look

for crowdsourcing framework or pinning out a that was presented last your

and

we and they have shown that we can of we can collect the data between

to a between two people playing the game or the crowdsourcing websites like mechanical turk

and we can really kind of the speech

you know you know you know you know real time way and of you can

transcribe and i don't

and

then we do a the audio collection and then the top the data collected was

basically a transcribed and was most basically just transcribe your

and then finally the third step is

one c o we have the data we just want to know like a mini

our data that we collect so

in this example we have

yes for this domain we have data from forty two past

so we have a sixteen complete games and the rest of them or

okay well that people just exactly

so then the next step is the data annotation so once we have data collected

from these people you wanted to annotate other data

the motivation for data annotation is basically is basically go on the domain

we want to do the reference resolution and you know reference resolution a real person

is basically describing the c and then it wasn't a describing the scene are the

scene is described to the object descriptions and these objects are basically one of the

an object that was in this thing

so we want to i don't date the individual objects within each us within each

image

and then it's also possible that a person decides to a speak about multiple objects

within an image

and opposed to me also speak about relations which we also annotate and basically everything

else is what we call this

for example on your for this utterance this is this is tough

a green tea upright to red crosses to the left of w and e

got it

so you for this particular utterance be basically annotate

these in this way

well

actually indicated

using the scheme

we had this particular utterance this as a on top of the t and then

harry porter sign next to the n

so we annotate a the first part which is a this is the l and

be markedly the single and what object to person is to fit into

you get the labels on each one of these objects using open c v

and we also mark on the relations iq of for example next to is marked

with one and zero and the use mark we to which basically defines the relationship

between this to object

this number

so once we have the data annotation performed the next step is

we use a language processing pipeline which includes two steps

basically steps so it includes three steps of it would try to understand the right

image just word image the person is the vol

so the first step is a the segmentation

which for which we use a linear chain conditional random field

and once we have the segmentation we try to identify the type of segments maybe

use a support vector machine and then you have the difference is always trying to

basically trying to resolve a switch image the person is speaking about

one thing to keep in mind is that we use are transcribed speech at this

point of mine does not asr in the future

though the segment the module is trying to basically split the whole utterance into multiple

a thing that's

so the input to

the segmental is the word stream one word at a time

where the segmented is basically a linear chain conditional random field we just i mean

each word but the word boundary information

so here in this example we have of your at the top left in one

stage and once we have the left all the word coming in

the segment that is basically we don

and the label is actually is

the task of a segmentation is performed again

and we have separate set of a segment

that you

so once we have the segment of the next step is we want to identify

what kind of segments of these things

so the type of segments is what's decided by a the segment by classifier and

for this we use a support vector machine

so we use

so we so

the our annotation scheme of we had four labels if you if you remember so

it's a single multiple relations and others so this is a step where we are

trying to identify each segment as to what i of segments

they are

so if in the previous example once we have of your at the top there

was basically single but now with the level of it but once we see the

word off

so we have to the top left off and of your that is part of

the previous segment and this is really able as a relation

well

segment

so once we have the segment by basically a pain

the next step is to understand to perform the reference as addition

though the reference resolution step up and in three different three separate steps

so the first step is understanding words

so we use what's as classifying mortal which are described in a bed but we

try to understand the meaning but was using the visual features

and once we have

once we have understood the words we try to understand the whole segments

for which we use

of a very have you ever use the scores from the what that's classifier and

decompose the meaning

and we'll in the scores for the next segment

and then finally we try to understand the whole

the reference referring expression on your record inspections

so the ones that's classified model is basically a linear regression model bit it's trying

to understand the meaning of every board

using the visual features so every word you it is represented but the visual features

and the visual features can be the ones which it start from the images and

in this case use rgb hedges p-values ages orientations

and of the data instance x y positions and

each one of these for each one of the objects with image the indies features

automatically

and you and we get the probability distribution or for each one of the words

once we have understood of words

using the words as classified model the next step is to kind of compose the

meaning and get the meaning for

the whole segments

so the segmental so the segment meaning is obtained by composing the meaning that you

in for every board

so we basically you are in this example you just

compose the word meaning by

doing multiplication of the scores

and we'll in the segment meeting for each one of the state

so let's take an example as to how this works so here oppose anything eight

images for simplicity sake let's take a look i just two images as to how

it's a dissolve and how it understood

so here we have facebook have and brown w which is the description that scheme

into this target image

so we have in the probability scores for facebook f and brown w individually using

what as classifier model

and with this method we try to understand what

the facebook f and brown w s

so real in this segmentation we obtain the segments using the linear chain conditional random

fields which has already segmented the utterances from the user into neutral segments

and it has a label that each segment this type single so we're trying to

resolve the reference and try to understand the meaning

so for the facebook f this is the object which is this one and then

just w is thus and similarly this is that and this one as this action

so what i wasn't said facebook f and once we have simple discourse we end

up but a distribution which looks like this

and for and don't w

i just want to find it's

it's this one which has the highest

so no this is the meaning that field in for each one of the segments

the next step is to understand the whole description at once

so in the previous step beyond the store the meaning of each segment as the

facebook have and brown w but now we're trying to understand the meaning of the

whole

the whole segment at once

so the baby do it is fixed use the score that you have you that

be obtained from the previous segmentation step

and decompose them using the simple summation and the image with ends up with the

highest score is the target image that the person was like to

basically describe so

in the same example we have the visible okay so this was discourse that is

still

but now with visible kevin brown w once we are composed discourse hoping for each

objects we end up with different scores for each images

and

we end up with

the matcher which is now a agent with vision booking as matter now

this to understand which target image

it should send

now this is how the bite plane books and the next step is we have

to understand i mean how well the model kind of performs

so in the evaluation step we actually need to know how well the segment that

works and how well the segment i labeling process works

so the segment the works with an f-score of around eighty percent which is basically

how well it court the boundaries right

and finally the segment i classifier that is working reasonably well for our purposes

but we but our goal is that you want to know how well the target

image was selected and not really work we understood from the previous i mean that

not really the segmentation or not really the same and i labeling

a so our main goal is to kind of understand how well it or will

are matcher agent can understand the h

so you understand the target image are evaluating such and such a system is a

kind of tricky

we can use multiple ways of using of running crossvalidations

so the first step the score on the whole one pair our occurred

of bad

but we basically say that one of the pair data that we have seen in

our corpus is on c

and we train on the rest of the pair detail

so one advantage is that it uses a number it uses an understanding of

how well the system basically performs

on a new user was one because the system

the second method of evaluating escort hold one is sort all

the advantage or this whole benefits or our type of cross validation

basically it's one is or description one target image description

it's held out training on the rest of the past as well as

are the descriptions by the same person that it has seen in the past

so this gives another advantage to our system that

it actually most

the words it gives you use a chance for the classifier to learn the idiosyncratic

word usage by a please

so use an additional advantage to train on the vocabulary

the one historical kind of evaluation where we assume the segmentation is from other humans

and finally we used automated pipeline but everything is automated entry one

or thing is we use multiple objects or a image in an image set so

you can move problem as simple as

single or two

and then we kind of what in more and more objects with any within an

image and then we try to see how well the reference resolution

but the first step of evaluation is you need a good baseline and the baseline

here is

the naive bayes that we use in the past are shown us to be o

we use what i'm be arguing not paper is a white support vector

and creation to other people for that

we can see that then it is actually walking close to random

is the number of objects and of input i increase

and once we have a segmental and the labeling try labeling performed

value in other we have a little bit individual segments and of understand what kind

of segments the person is speaking about

a performance increases so we can see that the target i target image identification accuracy

increases without the segmentation which is this role

and but segmentation distraught without segmentation and labeling the performance was

kind of which closer random but high number four

the next thing that we want to see as how well

our segmenter performs to the oracle segmentation but is basically the object segmentation and the

perfect labeling

additionally the whites as around two to six percent both as you can see your

from this table sort of it uses additional proposed but it's

not

and then finally we wanna see the entertainment afraid that is once we have the

whole that there are evaluation but the whole when it is sort out evaluation

that is once we give a classifier a chance to learn

the word usage of this particular person this particular describe

the performance kind of in increases and we here

the best performance you but we can see that the performance is still not real

human

and i think we use a number of objects within an image the performance

so that it all messages other people message from this work is that

the automated pipeline improves the scene understanding evadees so once we have a segmentation once

we have the segmentation

are you a

well formed are task

or a task of imitated asian groups

the interim and improves the performance of a classifier that is

you will see if it were wired channels don't from but using target word that

has been spoken in the example

it increases or the performance

and increasing the number of objects within an image it's really hard for the classifier

to get other objects right because of the scene understanding right because the objects are

very much similar

and

special thanks to better are usually and michael us you know and

thank you

but you have time for questions

so that it will work with word transcripts

were people scroll

i was wondering about how the dual-tree the information from the speech signal

for example prosodic contour

but more to renew a promotion remote use of compression

in addition we do

exactly that's or a really interesting question

so here the question was that if we use the more prosodic context instead of

just using the transcript and but performance

actually i would want to refer this to the domain x talk as a character

case where we actually use the prosody features are to perform the segmentation

and

kind of a

i don't know from one to like break the

i don't know

it's that possibly kind of helps

part i think of prosody can be really useful to for the segmentation

any other questions

okay so the question was about error analysis was that any error analysis both on

the numbers that initial yes with it can you be checked the significance score like

how like whenever there is a target image identification and how the humans identify the

target and how each one of the methods kind of perform

and then we did a significance test and when we ran significance test be you

know and upon a few things are more significant than others

which reported in the people

okay so the question was how this the what's as classifier trained is it like

a separate or training module or is it on along with the way we get

this numbers so actually we have the pipeline which is evaluated as one module in

sort of like training but that's classifier separately and then

segmenter separately and the segment i labeling both are separately we don't do the separate

kind of evaluation we kind of have

the whole pipeline kind of walking as one in a like

one particular more you

and then we'll in the numbers on the segmental performance we might i labeling performance

then we have the though processing part which is what understanding then the composition of

the meeting from the buttons classifier and then

combining the what's combining the meeting scheduled in from the speech segments of the whole

or seen description so all these are actually performed is not exact ones which means

we train the whole pipeline on

and minus one bunch of pairs if you imagine we have and there's and be

a point for all but they don't evaluation they're training the whole pipeline on in

mind in minus one pass and then the or domain e one patted his what

evaluation and we do that you know

we do it is impossible

so that so the unseen phrases that's that so very good example and that's the

reason why are performance was taking the egg and we don't use of

entrainment

that is if you don't provide a classifier the chance to learn the meeting go

work for that example despair all of us start and they use description prison caff

for that object which

if it's there's no way of learning the meaning for that if we had if

we don't know what the tokens

so that is why the performance to get hurt but once we give a pipeline

a chance to do on the meaning of the words like face will reorder all

these things it's not perform much

we well we did do that so far so we have we have mentioned

Real-Time Understanding of Complex Discriminative Scene Descriptions

Oral session 4: Incremental processing

Ramesh Manuvinakurike, Casey Kennington, David DeVault and David Schlangen