good morning everyone welcome to the second they have sre two thousand eleven

i hope you're enjoying it as much as i am

oh it's my pleasure to introduce a professor david foresight and without risk of you know yeah i wanna champagne

and came there from exhibit

i'm going to skip some of the by here but we probably more than a hundred and thirty papers

yeah he's very active in the A ieee community as well

he was a program called J for i two P C V P or twice two thousand and two thousand

and one

and he was there a general coaching for C P R in two thousand six

is also active in their siggraph community

here is that

ieee technical achievement

i

became an I Q

is an textbook

yeah

well i don't

couple years ago

cypress

yeah

thank you for those kind what

so i was a little bit

maybe we could

nine out

i

being a vision

talking

speech

can i

identify the next

and i just

right

it's in

well

i'm gonna talk probably about it

a lot of cali

yeah

my colleagues

them are a little from

and correspond to X

one form or not

ollie hardy and dress i mean so that the teaching

yeah

oh

oh

and reconstruction is essentially you may a model

from pictures or video other kinds

and the recognition

i think of recognition as being

what is

oh

and it's

oh

it's gone

being

you know

yeah patient of a small number of people

to a very successful

i don't

and the massive applications we're we have a standard problem of academic field which is whatever something really works generates

money

we say that's not really what we do and ignore it

but there are a whole bunch of those things that have spinal

and we'll see some of those are

i'm not gonna talk very much about reconstruction but i want to mention the state-of-the-art can

in this study are if you have multiple can

can get

still be astonishing results huge geometric scale

so if you walk around for example a quadrangle lots of big buildings and

waving a video camera those buildings you can reconstruct the geometry of two centimetres or less

and the reconstructions of holes city

that had been prepared using those made the error gets a little bit bigger and it very largely automatic

and furthermore you can put ten you can trying to pretend

that a bunch of scattered images of the way i like that it's slightly harder something's it looks like you

and so on but that's kind of and

if you have a single picture

it's much more difficult to reconstruct

but you can get some progress actually in the recognition stuff i'm gonna talk about you'll see some of this

so some of the things that tell you about the shape of the world might include the symmetry of objects

in the world the stylised shape so later on we're gonna pretend that every room is about

and that turned out to be a very useful assumption

a contour information texture information shading in my

can all tell us something about

i'm gonna show you reconstruction maybe it's about seven years old now

but gives you some sense of the state of the

study are like this but big

here's a movie all my cultural thing

somewhere out there in the well and it's being video from a bunch of the right

all this helps a lot from you can reconstruct an enormous number of points lying on the op

and well all of those cameras the view that way

i have render all the midi might have random order

because that would make the rendering

but you can see whether cameras when weather points and that's by standard methods this is the complete system

you can join those points that in a second we're gonna do that

to make a mess

and the message will give you some yeah about how i

the geometry

points that look that

yeah we have a nice

and i wish i and has them into the mess you can really see we've got a tremendous amount of

information about that you know

the difference of the last seven years what i'm showing you now on what people do

is that

okay now that sort of thing is a quadrangle for the buildings or a city or something of that form

here it's a subtract the law

and that of course we can texture that meant mash and they are not really sweet

we have a very cool

physical reconstruction of what it's like which we could show to other people and we could use in our method

reality applications you should see other applications here as well so if you wanna block downtown los angeles the colours

a bit difficult to get but you can fly helicopter over it build a model of the model up in

a movie on your phone

and if you want to join in a movie sequence all some real live action

to blowing up a model action you need to know what the camera part

and we could do that as well

so the tremendous applications looking behind this

that's it for reconstruction i'm gonna talk mainly about recognition

why do we care about visual object

the answer is if you want to act in the world you have to draw distinctions

and those distinctions could be or a very simple kind or a very complex car

so if you would building a robot

you have this great advantage of vision that it can predict the future

you can look ahead of you and you can see things you haven't done yet and figure out what would

happen

is the ground so

if it is maybe out oh my god is that person doing something dangerous

does it matter if i run that object of

which end of that object has is the shopping

and these are really important questions when you

now for information system

it just really valuable to be able to search for pictures

cluster pictures or the pictures to understand what they tell you

and all of those the recognition functions you might not need to be really good recognition but you need to

build descriptions of what's going on to support

and of course the general engineering applications which are demonstrate in a second

there is this universal fact about vision systems pretty much any animal that has vision has a recognition

they are often pretty lousy so if you look into it by a horseshoe crabs identify female horseshoe crabs visually

but what they're looking for you doc square

if you build the right kind of dark square and leave it lying on the floor of the ocean a

line of amorous males horseshoe crabs will build up behind because the vision system just isn't up to the job

what you might not have right recognition but if you if you got this and you got recognition

okay as an example of a more general engineering application of vision

and i believe strain are not array on we'll talk about this on thursday as well probably in more detail

imagine you watch a whole bunch of people

and you manage to a bunch of stuff as well so you could look at the physiological mark as you

could listen to the sounds and speech and you could watch him and the behaving naturally

then what you could do is a bunch of things the first thing is

if they behave in a way you don't want you could feed

the other thing is you could screen

so for example autism spectrum disorders is an affliction where if you catch it into written very ugly

you sometimes have better chances of interventions it would be really nice to screen children very ugly

in line and it would be very nice to screen every

what you'd like to be able to do is to say this child needs to see someone who knows what

to do in this child doesn't and you'd like to do that in a very low skill white

well maybe what you could do is observe them behaving and say gee the need to see someone you can

tell whether they're really

and it turns out that models like this there are you can apply that story to in the home care

to caff a demented patients to caff a stroke recovery

building design and sound models like this look as though they're gonna be really fat

and S F is put a bunch of money into the sort of thing on the expeditions program and we

have good things will come

here's another example you might want to take pictures and simply predict what

why would you like to predict what tags well people like to search for pictures with words lots of pictures

don't come with words attached what you might do is look at the picture and say based on various classification

machinery and on what i know about how words are correlated

and so on give me a bunch of word text to associate the picture that would be you

and the state-of-the-art in this activity is moderately advanced you get we have very good experimental methods

we're getting

if you actually retrieve images based on predicted word tags you can get estimates as in the third

which may not sound a let impressive bowl ten years ago they were in the three percent so you know

it's an order of magnitude which is wonderful and this look this is genuinely useful in

but words and pictures affect one another and much more complex ways

so there are many interesting problems that are just sort of the merging

from the presence of word and picture datasets this is example due to tamara the you'll see in my these

approaches from catalogues and their descriptions underneath them all the things in the picture

oh there are another two existing vision mechanisms for saying that the thing in the picture is an adorable people

telecom

but we just don't know how to do that

the first instinct problem that arises from that is if you had a whole bunch of catalogues you might actually

be able to fish phrases out of the text

a fish descriptions out of the pictures and build classifiers that could predict adorable people

this something else going on that you re these description

the fairly comprehensive descriptions of the object

but they don't tell you what colour they are

and furthermore gonna tell you what colour the session on the breast

and the reason they don't at that is it's a blindingly obvious from the picture does not point

but from our perspective if we're looking for things all searching for things or doing things like recommending things to

customer

being able to push

information jointly or i check and a description might add real data

okay

so getting to the end of the kind of summary of vision and i'll show you some stuff about recognition

i was asked to describe just recently what every vision person should not and it's useful "'cause" it gives you

a flavour of the distance

the big thing is that vision is really useful it's really hot and it's still really poorly on this

it's very helpful to know a bunch of

it's also very helpful to know a bunch of scepticism in hot probably understood disciplines is always somebody who comes

along

with a revolutionary new solution and that's come along every five years or so and then they go away so

a moderate degree of scepticism is available but is valuable

opportunism a simple

right so a vision is difficult because you need you need to know a lot of stuff

and there's a lot of evidence that the knowledge of any one thing doesn't seem to help much

the really are a lot of different ideas that are just sort of boiled together and we'll see some

however the main thing is to know the general principles of its

and that is you can deduce from evolutionary examples and what has been successful in computer vision that outfit on

the slides us to come on the next

there aren't

well it's not a subject that has general print

it's just one of those things

anybody who offers you a general principle is either a fool or a liar and you can you can make

your

so now i'm gonna set up a series of discussions about our state and recognition i like to do this

with a conclusion "'cause" then we know where we're going so the first thing is object recognition is subtle but

we actually have really strong methods of what really quite well

based on class

so rather loosely we could believe this about object tracking

the object categories of fixed and known this is a cat that's account that's a motor car every object belongs

to one category in there are K of the

that you can get good training data so i've got a hundred pictures of cats hundred pictures of cows a

hundred pictures about it "'cause"

and then object recognition sort of turns into K way classification

and it will turn out the detection turns into lots of task

in that belief space which has been very valuable there's an actual programme of research you get i'd say you

bang together a bunch of features you do better fitting with classifiers and you produce a represent

and that strategy has been amazingly back it's very

we could it features

so the summary of about ten years work in features use the to really input

one is features need to be illumination invariant so when the lights gets right to the features shouldn't change all

that much and there's an easy way to do that which is to look at the orientations of image great

a second big principle is you never the object is never quite where you think it is in the image

it's away shifted around a little bit and that means if you look at the image gradient at a particular

point

you're not gonna do well

instead you want to look at local pools of image gradients

or histograms of orientation

and it turns out if you take those stupid

suppose

and you can see in a fairly natural fashion in development

then you get hogan sift feature

i've shown in here for a series of different pictures on the left you on the one side sorry i

get the right mix that you'll see a woman with a bicycle and then show next to it is i

features style representation each of those little balls are basically histograms of gradient orientation in a little ball

so what we're saying is at the top of that in

the gradient orientations go in pretty much every direction in local but

but then when we get down to the sides of the women

there are lots of gradients

that there are lots of kampala along the side of the

the gradients of

yeah

and adaptive contrast around the bicycle

and again in this room with the traffic you can see in the tree is the brightest guy in all

directions

but a round the colour

they have

again in this picture the bicycle down the bottom

see the rough structure of the wheels on the frame reflected in those patterns of boring

and essentially what we do is take this information and buying it in a class

when we do this we get really quite good results

rather good at

"'kay" this kind of K Y classication running up to K A a couple of hundred

when we get into the ten thousands things get very interesting but

you know we'll set

and they're a standard item datasets for investigating methods and features you can take one O one for example you

a set of pictures of a hundred different categories one hundred one different categories

the pig somewhat at random from a selection of useful looking categories and the main thing here use the error

right the number of classication a ten

while is now likely about twenty percent

if you stick a picture of an isolated object in the caltech one O one list of object into a

good model method you're likely to get the right now

and if the collection of categories you know about is somewhat bigger you are not as likely to get the

right answer out so the accuracy runs up to the fifties if one's very likely

and has lots of training examples but you still got a really good chance of getting the right answer

so there are some problems we could do quite well

and this machinery extends

really very complicated and non obvious judge

so you can extend these features to work in space time

and then what people do now is like take movies

they get the script of the movie that's marked up with time codes by and sees the S on the

internet

the time align these two and then say okay here

shen description in the script look for some features a round that are distinctive in the movie trying to classify

like that

and then run it on something

and you can get really quite effective actions part is like that for complex actions like hans and the fun

getting out of the hugging kissing sitting down

on the top production and a bunch of true positive

on the second row a bunch of tuna

on the third row some false positive so if you look at the onset and false positive for example the

guy on the bed leaning to the side

looks as though he could be sitting on a bed on string of fine you just doesn't actually have a

phone and

right and then of course there are also like

people also in the fine in unusual circumstances where distance

so it's machinery extends to really quite complicated

this machinery can also be used for detection

so the way you detect with a classifier used imagine i have a picture with some interesting things in it

that i want to detect

what i'm gonna do is take a window of the all that in mind

oh correct illumination an estimate orientation and then button in that window and to classify and say yes on the

and then i'll go to the next window and i'll say yes or no not keep doing that

i don't find the best detection responses if the good enough also write it so that

if i want to find a big one i'll make the in a small and search it with a fixed

sized window again

if i wanna find a small one i'll look at a very high resolution version of the image

this recipe again this amazingly successful we are really quite good at detecting moderately complicated

standard detector has

some

additional complexity attached to this description

yeah additional complexity use these little yellow box

if you look at these columns each column displays the behaviour of the standard detection on at the different categories

so the first column so i run i'm getting my rise mixed

the first row use human detection the second rows vocal detection and the third row discarded

in the first row you'll see that going back to step in front of the train

as how to learn a little like blue box placed on top of him

with yellow so

then is a big group of people which has been incorrectly counted one of them is minutes

but most of them have boxes on top of them and we know that there are people

in the third column of the first row

you see somebody hiding behind a bush

he's had a box placed on top of them the obvious monty python joke is so obvious is not with

mike

as my colleague rubber cholesky is the site it's claudia cutting edge the detectors on perfect and that she has

been marked as a pet

in the second row you'll see martin best bottle detectors on the go we're pretty good at detecting bottles we

can find them even if they're in people's hands or on tables but we get bottles and people mixed up

a quite good reasons detectors really like

strong

identifiable high contrast curves

people have them around the head and shoulders started bottles and they tend to look the same

right so human humans and bottles often get mixed

we're also very good at detecting "'cause" and we case they get the mixed up with buses which is no

not

the i referred to have a carry sort of the standard technology you can download and run the code it's

all very established and it's widely you

a problem with the belief space about recognition that i described is that is beginning to come apart at the

seams because most of billy's obvious notes

right that's just not true

object belong to multiple categories a good training data might be very hot to get and that present serious problem

C has one example i think is all mine

well i you like what it's is usually got into

i know

no i went to that audience is usually going to vapour lock some roundabout this point because they know i'm

gonna get them from the side but they don't know which side i'm gonna get

okay so if you look at them depending on what you please the could it might easily could not the

first one is in fact a mighty size the second and the fourth isn't i

right the but it is in fact the monkey i had to check this i'm not that good on product

a taxonomy but most of these are i

and the one on the bottom row in the second column is a little plastic toy

right so the whole point about categorization here use the concept okay

i think this can belong to more than one category at the same time perfectly

so what we've inherited from the point of view are described few

is a tremendous amount of information about feature computation construction

we're really good at building and managing and using classifiers

and a lot of practise it improves

but this is really evil subtleties that yeah and the next thing is to describe some of the efforts to

deal

so the big questions the really big questions of computer vision that are in play right now

what signal representations should we you

this sort of at the early level before you get the classifieds and learning stuff

some extent models what aspects of the world should we represent and how should we represent

and then the other which is what should we say about pictures

and those three questions are really very difficult in the

so let's start looking and

the coming technologies on the nasty problem

one big issue is the unfamiliar

the recipe i described you really just doesn't deal with the un from

let me show you a little movie of somebody doing something

almost certainly you've never seen people doing this before it doesn't happen every day

and at the same time it doesn't really present you with any problem

right it's not you might not have a word to describe it but you know what's going on then that's

fine

yes another more extreme example of something where you really don't see this every day

but you can still watch it and it's just all us what's going on

and even that at this point even the donkeys

accustomed to

i mean done

treat this as

because you don't have training

you can deal with the unfamiliar in satisfactory ways and you probably have put together in your mind a little

narrative of what's going on and why they're doing what they're doing and it's all over and they can get

on that

now that's a really but

fling thing

from the perspective i described to you we just have no approach

there are methods that you can you so you can you can take

the stuff i described in rewrite

yeah as a an architecture that people are using quite a lot i take a picture ideas and feature selection

and stuff and instead of building classifiers that side but

i build a bunch of classifiers that say the picture has a peak and it it's got an I and

it it's gonna for and it's got a

the reason i would think that is if i ran into something else i might not know what it was

but i could say oh okay it's got that it might be a feather dust or a but that's got

that is so i can say something useful about

this is kind of neat because you can then build systems that can make predictions for objects that never seen

before

where they haven't even seen that

degree of that type of object

on the slide the little yellow boxes are the spatial basis of the predictions in the image a and underneath

them are prediction so that rather baffled look man here but there it's reported is having a kid having an

yeah having a snout having a nose and having a man

it would be an able to say something useful about something we'd never seen

it's harder to get these predictions right

you can see on that yeah right for example it is it's gotta tile it's gotta snout it's gotta lay

it also say it's got text on it and its might apply

and it's is got text on it because text is characterized by little dark and white stripes next to each

other and plastic is characterized wonderful bright

so the these predictions are hot to make but you can make

the other neat thing about this architecture is if you happened to have seen lots of the

it's relatively straightforward to add something else it says okay this really is about

and that again that's in the whole recipe of classification that i describe

if i say that i can also look at that list of attributes and say well gee it's a but

something's missing or something's extra

some known objects things that i know about whose names are now could be unfamiliar by being different from the

typical

and if they are different from the typical it's worth mention

we can build systems that do that as well essentially if we really sure it's the object and we really

sure it has a missing attribute or an extra attribute we can say it

so i think yeah i have a bunch of examples from one recent system the semantics of attributes all messed

up so that the down there was one is reported is not having a tail not because this compelling evidence

that it is a tale this one but because we can see that i we haven't that little detail hasn't

been sorted out

that aeroplane as reported is not having a jet engine

and gloriously but this on the friday she had done like that sheep is reported is not having well

what it has in fact been sure

and you can report extra stuff as well again you know there was a two questions the semantics that need

to be sorted out here that the in the little yellow box on the end there is reported as having

an extra lee

no but is never have actually so one should

have some more complex interpretation sitting on top but there's a bicycle with whole on an aeroplane with a big

and a bus with a fine

well within the sort of extra special features of the object and we can report

no one nice thing about this is

joe asked recently there are technologies emerging that say some regions in images actually would like to be not

so if the region would like to be an object then what we can do is take collect attribute machinery

catch it to the region that would like to be an object and reported description and that sort of stuff

is being discussed in the hallways but doesn't yet

yeah the second interesting and disturbing thing about modern vision like we coded visual phrase

so meaning comes in class

i talked about object recognition is something where you spot individual objects

but it's really hot to talk sense about what it means to be or not

so if you look at this can honestly you could think about that as an object because if you fish

around in your head you could come up with a single word to describe that's a flat

but it isn't one thing or two

what should we cut her off the slack and then sort of think of the person as a person than

the slate is a slight that way lies madness because we can also kind of a head inside sick a

kind of a jacket and say it's a jacket kind of issues inside shoes and so on

so what we might want to do is just sort of excel

that is a chunk of meaning of a yeah represented by what many people would think of as at least

two or

as a precedent for this we think a common notion envision is that of a C

so it's a likely stage where particular kinds of object a particular kinds of activity might occur things the things

like box rooms or greenhouses or playgrounds or bedroom

and we really quite go to classes

so you can use the procedure i describe we previously you get a bunch of labeled images of scenes you

compute some features you button a minute classifier and it turns out you could be really good at saying that's

a picture of a bathroom that's a picture of a boring that's a picture of a clock

and the advantage of doing that is you have some idea of the kind of things that might happen

so we've known since the early nineteen

but if you get the scene right

you can predict where to look for objects

and although you can't get it right because so i've sent to examples you have from the rubber stuff

one is an outdoor scene where you know we predict on the top row the buildings are sort of on

the top and street is on the bottom and trees of vertical and they might be in front of you

and the spline tends to be on the top and the cause will tend to be on the side of

the middle and so

not sure that all of these predictions are right there aren't any "'cause" i

but they tell you where to look for "'cause" if they work

and that seems to be help

yes thinking about scenes is currently we talked about meeting is coming in class at two scales

one scale is the scene the whole image

and the other is individual objects all over a little like is to what it means to be an all

and it turns out very recently that is

come good practical evidence that the might exist useful clumps of meaning between the scene and the visual and the

object and these are referred to as visual phrase

the compass

so the compass it's where the compass it is easier to recognise in that spot

so one useful visual phrases a person drinking from a bottle

it turns out it's much easier to detect a person drinking from a bottle that it is to detect a

person or to detect above

because people who drink from bottles do special thing

right they hold them are they a don't special configurations and the law

the same goes for things like this and riding a bicycle it's much easier to detect a whole person riding

a bicycle that it is to detect the person in the bicycle and then reason about spatial relation

because the appearance is constrained by the relation

so when you bill when you have this observation then you get into a serious mess about what to report

about an image

so we might build a person detector we might build a host detect and we might also build a person

writing a whole

we have to figure out which if any of them is right if we're really lucky the person riding a

horse detectable report in the same place

as the person detector on the host detector and we have to figure out just how many people just how

many pulses and just how many people riding horse

so what we do is rack up a whole bunch of detectors

and then go through a second phrase which is currently right phase which is currently referred to as decoding where

we say

based on all of the evidence of the detectors i'm willing to believe you and you

and that judgement is again a discriminant a judgement we essentially take the responses of nearby detectors report them to

the current detector and construct a second classify which is should we believe you

and you can get quite good ounces of that procedure

it turns out they help quite a lot

so if you look at the picture of the

the top row pictures these a detector responses without any decoding a global vision of what's going on a you

can see a sofa and a bunch of people on the set to go very small

if one then says okay i'm gonna look at the totality of detect which includes more than that

and try and find a consistent selection that makes and then you get a side because you gotta so that

there's a fair amount of evidence that you got a dog lying on the sofa because you got something it

looks a bit like a personal but like a dog and you got a dog lying on the sofa and

that's also a dog

you can significantly improve detection procedures by this kind of global view

another thing that gives a global view that significantly improves detection performance and scene understanding is john

so if we know there's something about the geometry

we can really improve detectors so on than the one side with the blue line on it i have an

image with the horizon and

i want to build a pedestrian detector you can see the boxes around pedestrians

and cost

now the thing about how right

is in perspective cameras things that get closer to the horizon from below must be smaller

otherwise the bigger in three D

what that means is if i wanted to K

i pedestrian and i think it's a big one you have to be lower in the image

and the small ones have to be i

furthermore if i get something pedestrian detector responses

i can look at them and say well the big ones they care in the small ones of that helps

me estimate the rice

and if i estimate the horizon and my reports joint

i can get much better re response

so for example on the top row of the local detections or the yellow ones of pedestrians the green ones

that cause and we just tested against a threshold and all but just sort of band perry pedestrians hovering in

the sky

but from that and other detector information we can estimate a horizon

pedestrians have the feet on the ground most of the time and that just rolls out all those false positives

up that and it rolls in some small detects the close to the horizon because they're about the right

similarly if we go looking and scene with "'cause" and people in

you'll notice the this is the one on the bottom

by estimating the horizon

several detect the responses the little dotted red ones for the pedestrians

have gotten back

because we know that even though the image data didn't look all that great it really is it the right

size in the right place to be a pedestrian and that gives us just a little bit more calm

no geometry is wonderful stuff the roles of the geometric estimates that are making detection better right now

one thing is you can pretend that the room is a box

when using a variety of standard method

you can then estimate the box even if the room isn't exactly

you can then estimate the box and when you estimate the box you can get some idea of whether flores

so over there we got a room with a box painted on it you think about isn't quite right

nonetheless because you got the box we can figure out what the walls look like what the floor looks like

and what the ceiling

so the rate is one will lose another wall the yellow is that the wall the green is the floor

so i the blues the ceiling and the particle is stuff that use none of the above what we call

class

things that you might bump into and such

another thing way that you could benefit from so firstly we gave an account of free space

but another thing you could do is you could take that and you could say well because i know the

box

i can use standard methods to ask

what the what would faces all boxes inside the room look like

if i looked at them front

so if i want to build a better detect and it turns out the people who did this budget head

down

collings actually have the world's best ad detector which sounds like sort of a slightly eccentric thing to have but

there's a principle here and you'll see it being useful in a second if i want to build a good

bet detect if i just look at images

i have to deal with the fact that the band might appear at different orientation

and because it appears at different orientations it's going to look at

but if i know the box of the rhythm i can say bad so axis along

they have one they typically have one face against the wall of the room

therefore i'm gonna write take the box of the room so the faces of the bed of frontal

and i can now remove some

source of ambiguity in my features and build a better detect

now the thing that's nice about that

is when you know whether babies you want to know something about where the room

because they do not penetrate the walls of room

so what i can if i

do is estimate the room and the bands simultaneously and come up with quite good estimates as to whether furniture

is in free pictures the room so every here at the top you see a and estimated ball

in the middle you see a bad that's estimated without thinking about where the box with without re-estimating the box

and at the bottom you see a joint estimate or bed and room box

and that jointly estimate is used somewhat better than the it's sort of three or four percent but it's way

oh the nice thing about box

is you can do other things with them as well

so very recently kevin "'cause" has shown

that if you know the box of a room you can figure out whether like so

and you can figure out what the L B O is on the sides of the room whether it's black

or white to right

or green or red

and if that's the case and you know with authority is you can stick out the stuff into the room

so i'll go backwards and forwards

we put some pieces of computer graphics chat in the room and you'll notice that statue is behind the ottoman

and as a result is occluded and the lighting is wrong

oh i think about this which is kind of fun is if you can do it for a static thing

you can do it for moving stuff

so here's a picture of a billiard room from

like and you can just play but it's on the ballot

yes another picture from flick and so everything i'm showing you come from a single picture but

yes another picture from like a and a little glowing bowl manage to get into the picture and is going

to explore it

you'll notice it gets reflected in them there

it costs shadows the way it should

and when it flies under the table is more like twitch

so these kind of simple geometric inference

can support amazing functions the usefulness of this is pretty obvious you can stick furniture into pictures of your and

living room

if you're inclined to do such things you can should aliens in your or

dining room on a computing

so let's look at the last sort of begin puzzling principal that's kind of a merging in modern vision

and then a selection

what should we say

so a couple of years ago judy of how can my went out collected a whole bunch of images

and then set them on mechanical turk got people want to pretty qualifies english speakers

this is kind of important otherwise things get a bit funny and ask them to write a sentence about the

pig

and then what you do is you get multiple sentences about a single picture

and you look at a sentence

and just start playing thing about those sentences is that can see

people presented with this picture talk about two girls sitting and talking they one of them is holding something that

chanting the wearing jeans but that'd talk about the step

that i talk about the specular reflections in the window at the back of the image that i talk about

the two people in that when the that'd talk about the chewing gum on the ground

the capable of looking at this thing and saying this is important

this is what's worth mentioning and this is

and the moderate beacons

not understanding that is terribly important than the reason it's important

pictures are all about

and if you model is on the record every object in the picture then you're dead because you report is

too big

so we need to know what's what

we can do some of those

is this a fair amount of work on predicting sentence level descriptions of images or video

so for example have enough got turned colleagues took video all baseball game

and they use method similar to the discriminative methods i described to identify who's kidding who was catching who was

running

no they also build a little

a generative model of baseball essentially you can do this and then that once you've done this could happen or

that could happen all that

and you can think of it as being represented by a tree of events and some surgical rules that allow

you to rearrange the tree and then what you do is you say okay i've got these detector response

these are the structural rules of the game let me generate a structure that explains those responses and of course

of course if i can generate that structure i can generate things that you know without close inspection look like

described again

no sportscaster would emit something that's as pitch approaches the ball before batting yet that it's and then simultaneously better

runs the base and feel the runs towards the ball feel that catches the ball and it it's not the

way people talk

at the same level at the same time it

a description of what's going on

that you could use to produce something that scene

and it's a fairly detailed description of what's

we can generate sentences for over three pictures although it's still a bit rough and ready there are methods that

essentially say i go from an image space to some sort of intermediate space of detector response

and then i'll go from a sentence space

to some intermediate space of detector responses and then i would try and line a sentences and images in that

space and report the best matching set

the kind of results one gets a channel with yeah so that top picture the detectors are paying sleep on

ground animals sleep and ground gold standard ground and the kind of sentences one sees generated a see something and

expect

"'kay" so people remark things up into account the thing to say the least

but you might also get counted grass field which is not that it's a shape but you know it's not

bad guy

the third one down

a man stands next to train on a cloudy day it looks like a wonderful

if you raise the money and that it's actually a one

so we did you know you can make minor mistakes because sentences a really calm

sources of information and sometimes you make houses so this is not in fact i'd that laptop connected to a

black belmont there really isn't all that much black on the on the four

the sentence is more recently tamara but enrich them significantly like joining a this machinery to machinery about attributes

and was able to produce again you know we're not doing sentence generation you know the should be fairly obvious

from this end

a descriptions of pictures that look like this

there were two aeroplane the first shiny aeroplane is near the second

again we're not in sentence generation

but if you did do sentence generation you might see there's enough meaning that's been extracted from the image that

you could turn it into a reasonable for

they're all one dining table one chair into windows wouldn't dining table is by the wooden chair and against the

first when the noise

kind of objection you would right is to that is too much information and not selection as opposed to

it's wrong

okay now i'm gonna show you a movie too

illustrate how far the side your selection seems to go in human vision it's a fairly wrenching movies the first

thing is just to warn you that nobody was a watch

watch one yeah

and then we'll think about so it's clearly a surveillance movie on a train that

and that's not there as it gets interesting

okay yeah the question how many adults were on the platform and what were they doing

right do not right i no audience so or is always give sort of a variety of answers it somewhere

in the two seven range it's just not in it

you look at that thing in it is clear what's important and it's clear what's not important then you really

good at climbing on simple

and the important stuff looks like what outcome do we expect how other people feel

this feeling thing is not just because we're nice people and we care about what other people feel it's because

it gives you a really good idea of what they're gonna do next which match

a what we like

and of course what's gonna happen the by

again actually the whole sequence

nobody was that the child was not a it's something about how good probably in baby carriers can be

but it's a lot

and the trying times

but as i wouldn't show that if the child been but it's quite a well known that

the baby carrier and it upside down and was pushed along the child was annoyed but not seriously damage

if you look at this your ability to predict the behaviour of that one could just nearly threw herself in

front of the train

it's pretty good she's gonna react in kind of a strange way of the next ten

what you don't is you look at this you identify what simple shape well what we're going to notice this

guy because he is an important

and they build a little narrative around it and they focus on the

we don't know how to do that we are trying to but we don't know how to do that yeah

so carol some of the two crews would something crucial open questions as well as we move towards the end

one is dataset by

so

i distinctive feature vision is that frequencies in data

misrepresent applications

for a whole bunch of reasons the labels are wrong

the things that are chosen to get labelled a not uniform people collect things in very specific ways

and this is not a chart nobody goes out there and does we could things with data collection but it's

a real issue

so the bias is pervasive and we know it's a big deal envision datasets "'cause" and tanya to rub an

eyelash on your staff russ produced this wonderful paper this yeah

proved a good classifier can tell which dataset and image come

which is very scary news in the

and a smart image of the smart vision research you can do it very quickly so you have a little

text there the pictures which dataset doesn't come from people run about sixty to seventy percent classifiers are a little

bit weak

size doesn't make by scale way

if you get a really big dataset that doesn't mean it's an unbiased dataset and it might make it worse

because you might become

so if you look at the he when i collected these pictures from google had you not twenty three million

pictures of lines here are the top

i don't know however many

and you might think they're unbiased but have a close look so the kinds of things you could deduce from

these pictures all the lines right of course is fairly or

there were two pictures of lines on horseback

there's a line lying down with the lamb

there's another one

putting a having a person and putting a hand on it

and is aligned with i'm that way

that's on the first

so if you use that as your resource of online information you'd be in serious trouble that just not long

this is an effective territorial bias people are more interested in with pictures of lines in

a common ones

the problem is this blows huge holes in what we know about machine learning

so machine learning is based on a form of induction that's is the future is going to be like the

pot

in if you can't make the future like the cost then you've got a problem

and current machinery just doesn't sort of go to this

this place

this is good reason to believe that this issue is pervasive in object recognition that the world cannot be like

the training dataset because many things already that's why unfamiliar things a common and we can deal with

of course of many things a red then this exaggerates by

so gang wang produced a little histogram

that said okay all the objects in a marked up dataset that's common envision how many instances out

and there's small number of objects that have you know four thousand five hundred instances also but very quickly you're

down in that

and after that most objects appear two or three times in this data set some most objects a right

this is kind of should be a fairly familiar phenomena

but it wasn't really an issue envisioned to re

are several things you might do about bias

you could think about appropriate feature representations and what i described about illumination invariance is one form of doing that

if you're features are invariant to illumination then the fact that you're dataset is biased with illumination just doesn't

another thing you might do is build appropriate intermediate representation

so that those intermediate representations you might be able to make unbiased estimators of classifiers evens out of the objects

the right

and that's one way of interpreting this attribute

on the other thing is if you have a good representations of things like geometry

you just might be able to skate the effects of that set

so i last conclusion and then we're almost done

object recognition links to utility in complex ways that the not terribly well understood yet

so

biggest question in computer vision right now is what should we actually say about visual day

a picture goes into the or a very goes into a recognition system question what should come out

one answer is a list of everything that's in the picture that's a silly also the too many things in

the picture

if i look at this room in front of me it's silly to be describing the not on the bolt

that holds the emergency X

thing to get that still

so that i could on so well a useful representation of reasonable size which is a lousy on so because

we don't know what it means to be useful and we don't know how to make the size read

it seems that object categories depend on utility

so when i talked about that monkey

or it could also be a plastic toy but the other category it can occupy is iran

it really just doesn't matter no we're not that interested in it so why can't

if you look at this little fellow who turned out in my doesn't breaking a bottle recently somebody pointed out

that that's a be a bottle so you know you could think of that as a person or a child

or be a drink

or a be a drink each other a tourist or a hotline like a or an obstacle or potential the

rights

you know you see that you can write something right or around

so just depending on what you're doing that object occupies a wide range of different potential categories

so what i talked about suggests

the emergence of and you believe space about object recognition with sort of a heading in this direction and it

looks as though it's gonna be interesting when we get

and the billy spaces look cadres are really flew

they're opportunistic devices to a generalisation they're affected by your problems and buying utility

things can belong to many cat

some people would refer to this is a cellphone or is modified if i fling it into the audience it

would turn into a project on the media

and in fact the fact that it was a smart find would have nothing to do

with whether it was a project

so at the same time the same instance can belong to different categories of sorry at different times it can

belong to different

categories of shape when we talk about objects as being special within the category that's meaningful

it's not like all birds of the same but

some of the interesting because the missing tiles other the interesting because they have special fetters other birds

alright thing "'cause" they're inside this room flying around we had to just before the talk

many categories seem to be right

and many characterisation's mike because

unlike think about some things differently than you and if we don't talk about it is it really just doesn't

and in turn that suggested recognition is

it's not really just discrimination it's constantly coping with the unfamiliar

in the presence of massive an unreasonable by

and we need new tools and machinery to do

so i'm done on what through my major points

and it remains any to point out that if you want more information you can get it

but if you if somebody tries to sell you the one with the brown colour than their appellant because that's

the first addition and that's ten years i'll second edition appeared physically november

so they do exist and they're around and its follow quite up to date information about the state of recognition

and thanks to what i describes been supported by numerous agencies and organisations is including the office of naval research

the national science foundation

and we don't

oh

just a quick question about size

so the issue when the person was misrecognized as a bottle

or the issue you know this is a miss recognition when we go well that's something just the wrong scale

so but i is size is really difficult to tell how big something

oh okay

with so many vision

yes no the same

we know that people are amazingly good at making

so

the main literature about this i

that describing things that they get wrong in attendance at my house

we don't know how they do it

and we don't have methods right now envision that in computer vision

that can do size estimate satisfactorily the

it would one reasonable resolution to the personal model is you know but also just a lot smaller than people

but how do you know how big the thing you see is

in an absolute sense well wonder on some more but i look at some kind of big scale geometric context

around that i use it to make some estimate of the camera and with things a and that tells me

something about the size and if i get really gross size mismatches that i can say no that isn't gonna

work

yeah right now nobody can do in a satisfactory way i would regard that as something that sort of in

the air coming

wouldn't

i would think in three or four years time we might do

factor to size judgements moderately well

more details are judgements i think is still very mysterious

they do require putting together a whole bunch of contextual machinery because of the scaling effect of respect

that looks like a small what'll in an image might just be a mess of a long way away so

you need some notion of the space that it occupies

and that's one of the attractions by the way you want to show you that fun movie of the things

moving around in the room

well the attractions that movie is

when you have that degree of understanding of space you probably can make size protection

and that you could use them to drive recognition but as far as i know there's nothing right yeah that's

so i

just

i

the sets are biased unfairly biased towards things that are interesting

and i'm wondering why in computer vision we don't use the data sets as the vocabulary from which to describe

do women

the bias obviously or something that people are drawn to and it seems that the data itself

could be the vocabulary which you describe yeah that is

you describe an image in terms of its representation in this huge dataset

so on

i think that when i mean this is just a

setting right because

different agenda react to this very different

so if you think about vision is something you computer vision is something you do when you stick a camera

on your head and you will walk around well

then the line that's it i showed you just or

right but if you think about computer vision is something where what i do not use google images to interpret

more

the whole issue bias is just not an issue right because

one is the first sample of the other

the there is very little explicit writing about what you're referring to what is a lot of what that implicitly

takes into account

so much of what i've talked about in recognition actually

in vol

some interesting use of a common graphic

which is a is a way of talking about what you're talking about

we don't have a good enough and

standing of that issue to be able to talk about

clearly so you know the two kinds of convention one is the lines interest

this one's got that one's riding a horse that solution

and the other is we really tinted photograph lines handle

you want C will that many pictures or

you know a line photograph

three quarters with the shoulder dominating action

and it seems like one possible iconoclastic convention is different from another one

one of them if you like is interesting this in terms of properties and also it's a semantic stuff

and the other is

characters

just don't have the language to separate those rates and talk about them sensibly

yeah again i think it's very much on the agenda because of these separating three

you know if you really want to learn about the world from google actions

you're gonna have trouble

and

we know that we don't really have

so it's of a coincidence the but the best i can do so

and a comment a comment on the what you said about the utility of a of what matters in a

picture why did what matters in the picture depends on the utility at i'm of the view

but yet it seems like when you gave any image today the image of the two girls

to several people they came up with pretty much the same description so this seems to be a sort of

that baseline utility which is sort of context independent was wondering if you could comment on that and i think

you're right

so there's a fair amount of experiment

one

people select dimension

the situation is a little bit because

it's how to do the experiments exactly right and it's to be precise

but this some evidence suggests that kind of things that we dispose people to mention thing

oh

the really interested in people begin

and you can explain that because people have the potential to affect you when you've got a right and left

yes

the sort of always interesting kind of baseline

that thing is that all begin should tend to be mentioned

i

things that the unusual you know if you have a small rhinoceros in the downtown street view people are gonna

say gee you don't see that very often therefore

and that seem to be rough

principles for baseline utility but i

we do not again yet have

class of understanding required to say well okay there's a baseline utility and then there's also component that's linked to

the immediate task

well i would guess that

that's a situation

if one wanted to take a very extreme point of view you could say

the right way to division is with reinforcement learning because that's the white knight should it you just should every

vision system in the head if it doesn't do everything right

the downside of that one is it was a top notch an awfully long time

and you know it's appealing open these utility issues and getting better understanding of the principal seem to be important

again sorry surveillance the understanding

question

so i mean

obviously we all kind of an hour had sort of comparing to how vision people do their stuff and how

speech people do their stuff

and the two things that's kinda make speech recognition work in my view at a very abstract level is one

that we model how the various units that we're trying to recognise change in context for instance you know phones

the pen

rightly how the realised i'm on what other phones their car next to and then we really use in a

massive way

the this what you called joint modeling you know we model how phones occur together how words occur together how

high level unix linux like topics and other linguistic units at various levels all interact and have a co-occurrence statistics

that can inform the units would end them

so this joint modeling that you just touched arms is really massively important for speech recognition

and so these two aspects the modeling of how things change as a as a function of context and then

modeling the context itself

and it's statistics is the you see that as being find a

till having a long ways to go or is it just not something that people that works as well in

the in the vision domain what can you draw some comparisons there at your finger on a really nasty ms

we know about context we've been talking about context since the eighties

and then the question is sort of how what and why and under what circumstances and what you get the

contextual statistics and all that jazz

and there it is

a tremendous amount of work on that topic

the

i guess a reasonable summary you is

clever use or contextual information

often improve

i particular function just a little bit

but there is no example in anyone knows what context just hits the issue out of the

and i'm using the word context in the broadest possible sense of various kinds of co-occurrence to

so the geometric stuff so for example you can you can make pedestrian detectors a little bit better by knowing

about geometry and the little bit is what having like that's one person doesn't get run over or whatever

but i know of no example envisioned with

things get a lot better by heavy duty contextual information no you could argue that a bunch of what is

and people do argue about two ways one argument you

well use not using enough contextual information if you use much richer contextual models and more detail in a like

things will get back to you if you feel it get back under the whole research programs based on that

hypothesis

the other arguments as well those elaborate structures

become increasingly subject issues upon us issues of variance in estimation and all that jazz

and basically what you when with one hand you lose with the other one and you sort of back where

you want

i would say the juries just count on this question it's very firmly on the agenda it's

very aggressively study

and my own and that would be contextual information really matters

but it also really matters which contextual information you use and which you know

and that's second choice is pretty

we don't really have the machinery that says this is the good stuff this is the bad

i

one i know i it's not easy to sort of meaningfully contrast vision and speech the just different activities different

communities at the different things

but i would say

we have a baffling leave rich selection of potential contexts to use

everything from camera geometry to geometric context

two special properties of texture all night or co-occurrence statistics of objects all objects seen co-occurrences and the like and

one possible source of the difficulties we just don't know what to select on that

i'm this to address the first you and jeff's where i

the mechanism that i don't know if you heard jeff's talk yesterday morning on

these segmental conditional random field right in the idea he's proposing which is you know basically to model you know

speech at you know it's eight eighty basically the it incorporating information from multiple detectors

using the segmental random fields i mean i actually don't know enough to know whether that was inspired by the

vision waltz so and migrating to speech or vice versa but i was wondering of

both of you know could comment as to what the commonalities you see between those two approaches

and whether there is anything you know you think you might obscene in jeff's to upload jeff whether you see

anything here you know based on what you're from david for some a little bit of cross pollination between the

two areas

so i think

yeah

and i guess jeff is next a microphone and i think from my perspective there are strong resonances and harmonies

and one of "'em" year is an idea that's pervasive envision which is

if you can call up a picture into pieces the mikes

you can get

much more information about the P

because you got special support of which to cool features in lecture

there

i'd be

most serious vision people believe that if you could do a good job oh

coming up on it

everything will get

i are used with billy because there's no evidence to support that we

and

it's reasonable to say that the people who believe it simply say that all tested unsupported belief of the wrong

statistic any so you know we sort of in a position where smart people think it should work out

but right now none of the best

detection or classification methods takes any account a special support or just look so the buttons as the whole

i think that will try

i will go to my grave believing that if it hasn't changed we've done something wrong and we'll come right

later on

but it hasn't changed yet and that's it's a very disturbing feature of the vision land

so i think there's potential that but nobody's demonstrated yeah would be my reaction

i don't i i've got the light in my lexicon see if that's just one oh yes

i

and yeah so i was i thought that was very interesting and that

it was i think there are many points of commonality two things struck me one of them in was in

addition case

and it seemed that the attributes were much clearer or

then we have been a speech case for example has that there's

has wings has a geek has wheels

those are high level attributes

that we can sort of rat a lot

just by thinking about the problem and i'm not sure that we have the same attributes

available to us

how looking at the spectrogram or the speech signal

and the other thing that occurred to me was

that perhaps in fishing case

there's an interesting extension today S which were dealing with in interspeech case which has to do with the sequential

aspect of thing

for example if you're working instead of with a fixed image with the video where you have a sequence of

scenes and you might wanna segment that i into segments using some of the attributes that exist within the segments

so

the

responding to one

this should discussion and

what attributes

in the niger talk like this one summarizes about that

but

it's easy to write down a couple of hundred

it's not clear that they're independent of each other and it's not clear that covers the game by any manner

we don't really have a story about what you do if you don't know what natural attributes

the story currently the people use it is if you can come up with something that's discriminant of it's gonna

be an attribute one way or another and what colour attribute going like

but there there's actually

a moderately interesting vision problem where we sort of know we don't have attributes and would like to and that

question developing attributes for things which is hot to write down a list is a big deal for us and

i think we can learn about it we would be pleased to learn

time help segmentation

but it

again

segment a special temporally segmented videos

the

doesn't seem to be much better anything we know how to do the non spatially to

special temporally segmented videos

people like

i you know i'd say most of the serious people in vision believe that's because we're understanding something wrong

but we don't know what it is and we don't know how to make fine

just what you said is this section of the community that does believe in feature detectors like articulatory feature detectors

he you know in terms of your whether i i'm not saying it's right or wrong but the there was

that part of the community that look

each recognition from that viewpoint which is a little more similar one thing i wasn't sure this is just a

clarification then all that mike talk is in what menu produce these features i presume these are all are these

yard features that are being produced that is either the idea or not there were these all soft decisions

that it or extracted so is there like a set of ten billion possible things

and is the probability that's thresholded or you make a decision here it's a potatoey as a septic tank et

cetera et cetera

well the nice thing about

you like

and you make a list or you know a bunch

potential

something about a paper about any combination

usually what people do this is report

one alternative

you know it's a pedestrian a pedestrian use a cat not but there's

a fair amount of interest in for example the top five

rob a bunch applications where

as long as you get a ranking that's good and you get the wrong thing plus the top ranking then

you're okay and people are very interested in that one there's another class of activity which is look if i

build these detectors i can actually think of the output

as being features and what i'm gonna do is i'm gonna pretend on building detectors and then i'll look at

the responses

and pretend that the features and use them for completely different activity so essentially all the alternatives you describe appear

in someone's paper somewhere

and i wouldn't say there's any consensus about what the best thing is

which is unfortunately not so you know you do this you're okay this is not really

i difference between speech and our teams about images of that all the images that seem to be isn't dataset

seem to be sort of high quality get images no one seems to post their crappy pictures on the web

and so as well i have some of these techniques work when the pictures are

poor quality blurring you're overexposed or under exposed

"'cause" in speech we have a lot more of a sort of

variability it seems like of quality which affects the performance of our system

so

i mean this is what was what was also with it

i

at the fc is there's an awful lot of pretty pictures and cruddy videos that i like that and often

in on you two will reassure you want this point

and some things a hot this

we

the things that mike feature computations

a very

the acoustical phenomena that mike

you you're feature computations give them problems but there are some points of contact

we benefit quite a lot from time so for example just one moderately good example if you're interested in human

activity recognition

if you think about things like soccer field

a long view of soccer field with a player running across the field you really just contras all the arms

and legs

what you got motion blur to worry about the is about one pixel across anyhow it's just a minute

but if you look over a more time scale you can get fairly good picture of what's going on what

just looking at the sequence of pixels on the motion and pixels

so i think

some of the losses the resolution might not be as destructive as some of the acoustic effects that you encounter

but i'm not sure that that's true

there are a whole series envisioned the awe basically dead in the water as a result of

it's reflections of light

where i think yeah multipath acoustic distortion probably isn't the biggest thing in your life the other things to worry

about

so i it's and it depends kind of situation

there's a lot of interest in low resolution pictures how agencies care about or for pictures that come out of

forward looking infrared sensors for example

for

somewhat alarming reasons

i

yeah