good morning everyone welcome to the second they have sre two thousand eleven
i hope you're enjoying it as much as i am
oh it's my pleasure to introduce a professor david foresight and without risk of you know yeah i wanna champagne
and came there from exhibit
i'm going to skip some of the by here but we probably more than a hundred and thirty papers
yeah he's very active in the A ieee community as well
he was a program called J for i two P C V P or twice two thousand and two thousand
and one
and he was there a general coaching for C P R in two thousand six
is also active in their siggraph community
here is that
ieee technical achievement
i
became an I Q
is an textbook
yeah
well i don't
couple years ago
cypress
yeah
thank you for those kind what
so i was a little bit
maybe we could
nine out
i
being a vision
talking
speech
can i
identify the next
and i just
right
it's in
well
i'm gonna talk probably about it
a lot of cali
yeah
my colleagues
them are a little from
and correspond to X
one form or not
ollie hardy and dress i mean so that the teaching
yeah
oh
oh
and reconstruction is essentially you may a model
from pictures or video other kinds
and the recognition
i think of recognition as being
what is
oh
and it's
oh
it's gone
being
you know
yeah patient of a small number of people
to a very successful
i don't
and the massive applications we're we have a standard problem of academic field which is whatever something really works generates
money
we say that's not really what we do and ignore it
but there are a whole bunch of those things that have spinal
and we'll see some of those are
i'm not gonna talk very much about reconstruction but i want to mention the state-of-the-art can
in this study are if you have multiple can
can get
still be astonishing results huge geometric scale
so if you walk around for example a quadrangle lots of big buildings and
waving a video camera those buildings you can reconstruct the geometry of two centimetres or less
and the reconstructions of holes city
that had been prepared using those made the error gets a little bit bigger and it very largely automatic
and furthermore you can put ten you can trying to pretend
that a bunch of scattered images of the way i like that it's slightly harder something's it looks like you
and so on but that's kind of and
if you have a single picture
it's much more difficult to reconstruct
but you can get some progress actually in the recognition stuff i'm gonna talk about you'll see some of this
so some of the things that tell you about the shape of the world might include the symmetry of objects
in the world the stylised shape so later on we're gonna pretend that every room is about
and that turned out to be a very useful assumption
a contour information texture information shading in my
can all tell us something about
i'm gonna show you reconstruction maybe it's about seven years old now
but gives you some sense of the state of the
study are like this but big
here's a movie all my cultural thing
somewhere out there in the well and it's being video from a bunch of the right
all this helps a lot from you can reconstruct an enormous number of points lying on the op
and well all of those cameras the view that way
i have render all the midi might have random order
because that would make the rendering
but you can see whether cameras when weather points and that's by standard methods this is the complete system
you can join those points that in a second we're gonna do that
to make a mess
and the message will give you some yeah about how i
the geometry
points that look that
yeah we have a nice
and i wish i and has them into the mess you can really see we've got a tremendous amount of
information about that you know
the difference of the last seven years what i'm showing you now on what people do
is that
okay now that sort of thing is a quadrangle for the buildings or a city or something of that form
here it's a subtract the law
and that of course we can texture that meant mash and they are not really sweet
we have a very cool
physical reconstruction of what it's like which we could show to other people and we could use in our method
reality applications you should see other applications here as well so if you wanna block downtown los angeles the colours
a bit difficult to get but you can fly helicopter over it build a model of the model up in
a movie on your phone
and if you want to join in a movie sequence all some real live action
to blowing up a model action you need to know what the camera part
and we could do that as well
so the tremendous applications looking behind this
that's it for reconstruction i'm gonna talk mainly about recognition
why do we care about visual object
the answer is if you want to act in the world you have to draw distinctions
and those distinctions could be or a very simple kind or a very complex car
so if you would building a robot
you have this great advantage of vision that it can predict the future
you can look ahead of you and you can see things you haven't done yet and figure out what would
happen
is the ground so
if it is maybe out oh my god is that person doing something dangerous
does it matter if i run that object of
which end of that object has is the shopping
and these are really important questions when you
now for information system
it just really valuable to be able to search for pictures
cluster pictures or the pictures to understand what they tell you
and all of those the recognition functions you might not need to be really good recognition but you need to
build descriptions of what's going on to support
and of course the general engineering applications which are demonstrate in a second
there is this universal fact about vision systems pretty much any animal that has vision has a recognition
they are often pretty lousy so if you look into it by a horseshoe crabs identify female horseshoe crabs visually
but what they're looking for you doc square
if you build the right kind of dark square and leave it lying on the floor of the ocean a
line of amorous males horseshoe crabs will build up behind because the vision system just isn't up to the job
what you might not have right recognition but if you if you got this and you got recognition
okay as an example of a more general engineering application of vision
and i believe strain are not array on we'll talk about this on thursday as well probably in more detail
imagine you watch a whole bunch of people
and you manage to a bunch of stuff as well so you could look at the physiological mark as you
could listen to the sounds and speech and you could watch him and the behaving naturally
then what you could do is a bunch of things the first thing is
if they behave in a way you don't want you could feed
the other thing is you could screen
so for example autism spectrum disorders is an affliction where if you catch it into written very ugly
you sometimes have better chances of interventions it would be really nice to screen children very ugly
in line and it would be very nice to screen every
what you'd like to be able to do is to say this child needs to see someone who knows what
to do in this child doesn't and you'd like to do that in a very low skill white
well maybe what you could do is observe them behaving and say gee the need to see someone you can
tell whether they're really
and it turns out that models like this there are you can apply that story to in the home care
to caff a demented patients to caff a stroke recovery
building design and sound models like this look as though they're gonna be really fat
and S F is put a bunch of money into the sort of thing on the expeditions program and we
have good things will come
here's another example you might want to take pictures and simply predict what
why would you like to predict what tags well people like to search for pictures with words lots of pictures
don't come with words attached what you might do is look at the picture and say based on various classification
machinery and on what i know about how words are correlated
and so on give me a bunch of word text to associate the picture that would be you
and the state-of-the-art in this activity is moderately advanced you get we have very good experimental methods
we're getting
if you actually retrieve images based on predicted word tags you can get estimates as in the third
which may not sound a let impressive bowl ten years ago they were in the three percent so you know
it's an order of magnitude which is wonderful and this look this is genuinely useful in
but words and pictures affect one another and much more complex ways
so there are many interesting problems that are just sort of the merging
from the presence of word and picture datasets this is example due to tamara the you'll see in my these
approaches from catalogues and their descriptions underneath them all the things in the picture
oh there are another two existing vision mechanisms for saying that the thing in the picture is an adorable people
telecom
but we just don't know how to do that
the first instinct problem that arises from that is if you had a whole bunch of catalogues you might actually
be able to fish phrases out of the text
a fish descriptions out of the pictures and build classifiers that could predict adorable people
this something else going on that you re these description
the fairly comprehensive descriptions of the object
but they don't tell you what colour they are
and furthermore gonna tell you what colour the session on the breast
and the reason they don't at that is it's a blindingly obvious from the picture does not point
but from our perspective if we're looking for things all searching for things or doing things like recommending things to
customer
being able to push
information jointly or i check and a description might add real data
okay
so getting to the end of the kind of summary of vision and i'll show you some stuff about recognition
i was asked to describe just recently what every vision person should not and it's useful "'cause" it gives you
a flavour of the distance
the big thing is that vision is really useful it's really hot and it's still really poorly on this
it's very helpful to know a bunch of
it's also very helpful to know a bunch of scepticism in hot probably understood disciplines is always somebody who comes
along
with a revolutionary new solution and that's come along every five years or so and then they go away so
a moderate degree of scepticism is available but is valuable
opportunism a simple
right so a vision is difficult because you need you need to know a lot of stuff
and there's a lot of evidence that the knowledge of any one thing doesn't seem to help much
the really are a lot of different ideas that are just sort of boiled together and we'll see some
however the main thing is to know the general principles of its
and that is you can deduce from evolutionary examples and what has been successful in computer vision that outfit on
the slides us to come on the next
there aren't
well it's not a subject that has general print
it's just one of those things
anybody who offers you a general principle is either a fool or a liar and you can you can make
your
so now i'm gonna set up a series of discussions about our state and recognition i like to do this
with a conclusion "'cause" then we know where we're going so the first thing is object recognition is subtle but
we actually have really strong methods of what really quite well
based on class
so rather loosely we could believe this about object tracking
the object categories of fixed and known this is a cat that's account that's a motor car every object belongs
to one category in there are K of the
that you can get good training data so i've got a hundred pictures of cats hundred pictures of cows a
hundred pictures about it "'cause"
and then object recognition sort of turns into K way classification
and it will turn out the detection turns into lots of task
in that belief space which has been very valuable there's an actual programme of research you get i'd say you
bang together a bunch of features you do better fitting with classifiers and you produce a represent
and that strategy has been amazingly back it's very
we could it features
so the summary of about ten years work in features use the to really input
one is features need to be illumination invariant so when the lights gets right to the features shouldn't change all
that much and there's an easy way to do that which is to look at the orientations of image great
a second big principle is you never the object is never quite where you think it is in the image
it's away shifted around a little bit and that means if you look at the image gradient at a particular
point
you're not gonna do well
instead you want to look at local pools of image gradients
or histograms of orientation
and it turns out if you take those stupid
suppose
and you can see in a fairly natural fashion in development
then you get hogan sift feature
i've shown in here for a series of different pictures on the left you on the one side sorry i
get the right mix that you'll see a woman with a bicycle and then show next to it is i
features style representation each of those little balls are basically histograms of gradient orientation in a little ball
so what we're saying is at the top of that in
the gradient orientations go in pretty much every direction in local but
but then when we get down to the sides of the women
there are lots of gradients
that there are lots of kampala along the side of the
the gradients of
yeah
and adaptive contrast around the bicycle
and again in this room with the traffic you can see in the tree is the brightest guy in all
directions
but a round the colour
they have
again in this picture the bicycle down the bottom
see the rough structure of the wheels on the frame reflected in those patterns of boring
and essentially what we do is take this information and buying it in a class
when we do this we get really quite good results
rather good at
"'kay" this kind of K Y classication running up to K A a couple of hundred
when we get into the ten thousands things get very interesting but
you know we'll set
and they're a standard item datasets for investigating methods and features you can take one O one for example you
a set of pictures of a hundred different categories one hundred one different categories
the pig somewhat at random from a selection of useful looking categories and the main thing here use the error
right the number of classication a ten
while is now likely about twenty percent
if you stick a picture of an isolated object in the caltech one O one list of object into a
good model method you're likely to get the right now
and if the collection of categories you know about is somewhat bigger you are not as likely to get the
right answer out so the accuracy runs up to the fifties if one's very likely
and has lots of training examples but you still got a really good chance of getting the right answer
so there are some problems we could do quite well
and this machinery extends
really very complicated and non obvious judge
so you can extend these features to work in space time
and then what people do now is like take movies
they get the script of the movie that's marked up with time codes by and sees the S on the
internet
the time align these two and then say okay here
shen description in the script look for some features a round that are distinctive in the movie trying to classify
like that
and then run it on something
and you can get really quite effective actions part is like that for complex actions like hans and the fun
getting out of the hugging kissing sitting down
on the top production and a bunch of true positive
on the second row a bunch of tuna
on the third row some false positive so if you look at the onset and false positive for example the
guy on the bed leaning to the side
looks as though he could be sitting on a bed on string of fine you just doesn't actually have a
phone and
right and then of course there are also like
people also in the fine in unusual circumstances where distance
so it's machinery extends to really quite complicated
this machinery can also be used for detection
so the way you detect with a classifier used imagine i have a picture with some interesting things in it
that i want to detect
what i'm gonna do is take a window of the all that in mind
oh correct illumination an estimate orientation and then button in that window and to classify and say yes on the
and then i'll go to the next window and i'll say yes or no not keep doing that
i don't find the best detection responses if the good enough also write it so that
if i want to find a big one i'll make the in a small and search it with a fixed
sized window again
if i wanna find a small one i'll look at a very high resolution version of the image
this recipe again this amazingly successful we are really quite good at detecting moderately complicated
standard detector has
some
additional complexity attached to this description
yeah additional complexity use these little yellow box
if you look at these columns each column displays the behaviour of the standard detection on at the different categories
so the first column so i run i'm getting my rise mixed
the first row use human detection the second rows vocal detection and the third row discarded
in the first row you'll see that going back to step in front of the train
as how to learn a little like blue box placed on top of him
with yellow so
then is a big group of people which has been incorrectly counted one of them is minutes
but most of them have boxes on top of them and we know that there are people
in the third column of the first row
you see somebody hiding behind a bush
he's had a box placed on top of them the obvious monty python joke is so obvious is not with
mike
as my colleague rubber cholesky is the site it's claudia cutting edge the detectors on perfect and that she has
been marked as a pet
in the second row you'll see martin best bottle detectors on the go we're pretty good at detecting bottles we
can find them even if they're in people's hands or on tables but we get bottles and people mixed up
a quite good reasons detectors really like
strong
identifiable high contrast curves
people have them around the head and shoulders started bottles and they tend to look the same
right so human humans and bottles often get mixed
we're also very good at detecting "'cause" and we case they get the mixed up with buses which is no
not
the i referred to have a carry sort of the standard technology you can download and run the code it's
all very established and it's widely you
a problem with the belief space about recognition that i described is that is beginning to come apart at the
seams because most of billy's obvious notes
right that's just not true
object belong to multiple categories a good training data might be very hot to get and that present serious problem
C has one example i think is all mine
well i you like what it's is usually got into
i know
no i went to that audience is usually going to vapour lock some roundabout this point because they know i'm
gonna get them from the side but they don't know which side i'm gonna get
okay so if you look at them depending on what you please the could it might easily could not the
first one is in fact a mighty size the second and the fourth isn't i
right the but it is in fact the monkey i had to check this i'm not that good on product
a taxonomy but most of these are i
and the one on the bottom row in the second column is a little plastic toy
right so the whole point about categorization here use the concept okay
i think this can belong to more than one category at the same time perfectly
so what we've inherited from the point of view are described few
is a tremendous amount of information about feature computation construction
we're really good at building and managing and using classifiers
and a lot of practise it improves
but this is really evil subtleties that yeah and the next thing is to describe some of the efforts to
deal
so the big questions the really big questions of computer vision that are in play right now
what signal representations should we you
this sort of at the early level before you get the classifieds and learning stuff
some extent models what aspects of the world should we represent and how should we represent
and then the other which is what should we say about pictures
and those three questions are really very difficult in the
so let's start looking and
the coming technologies on the nasty problem
one big issue is the unfamiliar
the recipe i described you really just doesn't deal with the un from
let me show you a little movie of somebody doing something
almost certainly you've never seen people doing this before it doesn't happen every day
and at the same time it doesn't really present you with any problem
right it's not you might not have a word to describe it but you know what's going on then that's
fine
yes another more extreme example of something where you really don't see this every day
but you can still watch it and it's just all us what's going on
and even that at this point even the donkeys
accustomed to
i mean done
treat this as
because you don't have training
you can deal with the unfamiliar in satisfactory ways and you probably have put together in your mind a little
narrative of what's going on and why they're doing what they're doing and it's all over and they can get
on that
now that's a really but
fling thing
from the perspective i described to you we just have no approach
there are methods that you can you so you can you can take
the stuff i described in rewrite
yeah as a an architecture that people are using quite a lot i take a picture ideas and feature selection
and stuff and instead of building classifiers that side but
i build a bunch of classifiers that say the picture has a peak and it it's got an I and
it it's gonna for and it's got a
the reason i would think that is if i ran into something else i might not know what it was
but i could say oh okay it's got that it might be a feather dust or a but that's got
that is so i can say something useful about
this is kind of neat because you can then build systems that can make predictions for objects that never seen
before
where they haven't even seen that
degree of that type of object
on the slide the little yellow boxes are the spatial basis of the predictions in the image a and underneath
them are prediction so that rather baffled look man here but there it's reported is having a kid having an
yeah having a snout having a nose and having a man
it would be an able to say something useful about something we'd never seen
it's harder to get these predictions right
you can see on that yeah right for example it is it's gotta tile it's gotta snout it's gotta lay
it also say it's got text on it and its might apply
and it's is got text on it because text is characterized by little dark and white stripes next to each
other and plastic is characterized wonderful bright
so the these predictions are hot to make but you can make
the other neat thing about this architecture is if you happened to have seen lots of the
it's relatively straightforward to add something else it says okay this really is about
and that again that's in the whole recipe of classification that i describe
if i say that i can also look at that list of attributes and say well gee it's a but
something's missing or something's extra
some known objects things that i know about whose names are now could be unfamiliar by being different from the
typical
and if they are different from the typical it's worth mention
we can build systems that do that as well essentially if we really sure it's the object and we really
sure it has a missing attribute or an extra attribute we can say it
so i think yeah i have a bunch of examples from one recent system the semantics of attributes all messed
up so that the down there was one is reported is not having a tail not because this compelling evidence
that it is a tale this one but because we can see that i we haven't that little detail hasn't
been sorted out
that aeroplane as reported is not having a jet engine
and gloriously but this on the friday she had done like that sheep is reported is not having well
what it has in fact been sure
and you can report extra stuff as well again you know there was a two questions the semantics that need
to be sorted out here that the in the little yellow box on the end there is reported as having
an extra lee
no but is never have actually so one should
have some more complex interpretation sitting on top but there's a bicycle with whole on an aeroplane with a big
and a bus with a fine
well within the sort of extra special features of the object and we can report
no one nice thing about this is
joe asked recently there are technologies emerging that say some regions in images actually would like to be not
so if the region would like to be an object then what we can do is take collect attribute machinery
catch it to the region that would like to be an object and reported description and that sort of stuff
is being discussed in the hallways but doesn't yet
yeah the second interesting and disturbing thing about modern vision like we coded visual phrase
so meaning comes in class
i talked about object recognition is something where you spot individual objects
but it's really hot to talk sense about what it means to be or not
so if you look at this can honestly you could think about that as an object because if you fish
around in your head you could come up with a single word to describe that's a flat
but it isn't one thing or two
what should we cut her off the slack and then sort of think of the person as a person than
the slate is a slight that way lies madness because we can also kind of a head inside sick a
kind of a jacket and say it's a jacket kind of issues inside shoes and so on
so what we might want to do is just sort of excel
that is a chunk of meaning of a yeah represented by what many people would think of as at least
two or
as a precedent for this we think a common notion envision is that of a C
so it's a likely stage where particular kinds of object a particular kinds of activity might occur things the things
like box rooms or greenhouses or playgrounds or bedroom
and we really quite go to classes
so you can use the procedure i describe we previously you get a bunch of labeled images of scenes you
compute some features you button a minute classifier and it turns out you could be really good at saying that's
a picture of a bathroom that's a picture of a boring that's a picture of a clock
and the advantage of doing that is you have some idea of the kind of things that might happen
so we've known since the early nineteen
but if you get the scene right
you can predict where to look for objects
and although you can't get it right because so i've sent to examples you have from the rubber stuff
one is an outdoor scene where you know we predict on the top row the buildings are sort of on
the top and street is on the bottom and trees of vertical and they might be in front of you
and the spline tends to be on the top and the cause will tend to be on the side of
the middle and so
not sure that all of these predictions are right there aren't any "'cause" i
but they tell you where to look for "'cause" if they work
and that seems to be help
yes thinking about scenes is currently we talked about meeting is coming in class at two scales
one scale is the scene the whole image
and the other is individual objects all over a little like is to what it means to be an all
and it turns out very recently that is
come good practical evidence that the might exist useful clumps of meaning between the scene and the visual and the
object and these are referred to as visual phrase
the compass
so the compass it's where the compass it is easier to recognise in that spot
so one useful visual phrases a person drinking from a bottle
it turns out it's much easier to detect a person drinking from a bottle that it is to detect a
person or to detect above
because people who drink from bottles do special thing
right they hold them are they a don't special configurations and the law
the same goes for things like this and riding a bicycle it's much easier to detect a whole person riding
a bicycle that it is to detect the person in the bicycle and then reason about spatial relation
because the appearance is constrained by the relation
so when you bill when you have this observation then you get into a serious mess about what to report
about an image
so we might build a person detector we might build a host detect and we might also build a person
writing a whole
we have to figure out which if any of them is right if we're really lucky the person riding a
horse detectable report in the same place
as the person detector on the host detector and we have to figure out just how many people just how
many pulses and just how many people riding horse
so what we do is rack up a whole bunch of detectors
and then go through a second phrase which is currently right phase which is currently referred to as decoding where
we say
based on all of the evidence of the detectors i'm willing to believe you and you
and that judgement is again a discriminant a judgement we essentially take the responses of nearby detectors report them to
the current detector and construct a second classify which is should we believe you
and you can get quite good ounces of that procedure
it turns out they help quite a lot
so if you look at the picture of the
the top row pictures these a detector responses without any decoding a global vision of what's going on a you
can see a sofa and a bunch of people on the set to go very small
if one then says okay i'm gonna look at the totality of detect which includes more than that
and try and find a consistent selection that makes and then you get a side because you gotta so that
there's a fair amount of evidence that you got a dog lying on the sofa because you got something it
looks a bit like a personal but like a dog and you got a dog lying on the sofa and
that's also a dog
you can significantly improve detection procedures by this kind of global view
another thing that gives a global view that significantly improves detection performance and scene understanding is john
so if we know there's something about the geometry
we can really improve detectors so on than the one side with the blue line on it i have an
image with the horizon and
i want to build a pedestrian detector you can see the boxes around pedestrians
and cost
now the thing about how right
is in perspective cameras things that get closer to the horizon from below must be smaller
otherwise the bigger in three D
what that means is if i wanted to K
i pedestrian and i think it's a big one you have to be lower in the image
and the small ones have to be i
furthermore if i get something pedestrian detector responses
i can look at them and say well the big ones they care in the small ones of that helps
me estimate the rice
and if i estimate the horizon and my reports joint
i can get much better re response
so for example on the top row of the local detections or the yellow ones of pedestrians the green ones
that cause and we just tested against a threshold and all but just sort of band perry pedestrians hovering in
the sky
but from that and other detector information we can estimate a horizon
pedestrians have the feet on the ground most of the time and that just rolls out all those false positives
up that and it rolls in some small detects the close to the horizon because they're about the right
similarly if we go looking and scene with "'cause" and people in
you'll notice the this is the one on the bottom
by estimating the horizon
several detect the responses the little dotted red ones for the pedestrians
have gotten back
because we know that even though the image data didn't look all that great it really is it the right
size in the right place to be a pedestrian and that gives us just a little bit more calm
no geometry is wonderful stuff the roles of the geometric estimates that are making detection better right now
one thing is you can pretend that the room is a box
when using a variety of standard method
you can then estimate the box even if the room isn't exactly
you can then estimate the box and when you estimate the box you can get some idea of whether flores
so over there we got a room with a box painted on it you think about isn't quite right
nonetheless because you got the box we can figure out what the walls look like what the floor looks like
and what the ceiling
so the rate is one will lose another wall the yellow is that the wall the green is the floor
so i the blues the ceiling and the particle is stuff that use none of the above what we call
class
things that you might bump into and such
another thing way that you could benefit from so firstly we gave an account of free space
but another thing you could do is you could take that and you could say well because i know the
box
i can use standard methods to ask
what the what would faces all boxes inside the room look like
if i looked at them front
so if i want to build a better detect and it turns out the people who did this budget head
down
collings actually have the world's best ad detector which sounds like sort of a slightly eccentric thing to have but
there's a principle here and you'll see it being useful in a second if i want to build a good
bet detect if i just look at images
i have to deal with the fact that the band might appear at different orientation
and because it appears at different orientations it's going to look at
but if i know the box of the rhythm i can say bad so axis along
they have one they typically have one face against the wall of the room
therefore i'm gonna write take the box of the room so the faces of the bed of frontal
and i can now remove some
source of ambiguity in my features and build a better detect
now the thing that's nice about that
is when you know whether babies you want to know something about where the room
because they do not penetrate the walls of room
so what i can if i
do is estimate the room and the bands simultaneously and come up with quite good estimates as to whether furniture
is in free pictures the room so every here at the top you see a and estimated ball
in the middle you see a bad that's estimated without thinking about where the box with without re-estimating the box
and at the bottom you see a joint estimate or bed and room box
and that jointly estimate is used somewhat better than the it's sort of three or four percent but it's way
oh the nice thing about box
is you can do other things with them as well
so very recently kevin "'cause" has shown
that if you know the box of a room you can figure out whether like so
and you can figure out what the L B O is on the sides of the room whether it's black
or white to right
or green or red
and if that's the case and you know with authority is you can stick out the stuff into the room
so i'll go backwards and forwards
we put some pieces of computer graphics chat in the room and you'll notice that statue is behind the ottoman
and as a result is occluded and the lighting is wrong
oh i think about this which is kind of fun is if you can do it for a static thing
you can do it for moving stuff
so here's a picture of a billiard room from
like and you can just play but it's on the ballot
yes another picture from flick and so everything i'm showing you come from a single picture but
yes another picture from like a and a little glowing bowl manage to get into the picture and is going
to explore it
you'll notice it gets reflected in them there
it costs shadows the way it should
and when it flies under the table is more like twitch
so these kind of simple geometric inference
can support amazing functions the usefulness of this is pretty obvious you can stick furniture into pictures of your and
living room
if you're inclined to do such things you can should aliens in your or
dining room on a computing
so let's look at the last sort of begin puzzling principal that's kind of a merging in modern vision
and then a selection
what should we say
so a couple of years ago judy of how can my went out collected a whole bunch of images
and then set them on mechanical turk got people want to pretty qualifies english speakers
this is kind of important otherwise things get a bit funny and ask them to write a sentence about the
pig
and then what you do is you get multiple sentences about a single picture
and you look at a sentence
and just start playing thing about those sentences is that can see
people presented with this picture talk about two girls sitting and talking they one of them is holding something that
chanting the wearing jeans but that'd talk about the step
that i talk about the specular reflections in the window at the back of the image that i talk about
the two people in that when the that'd talk about the chewing gum on the ground
the capable of looking at this thing and saying this is important
this is what's worth mentioning and this is
and the moderate beacons
not understanding that is terribly important than the reason it's important
pictures are all about
and if you model is on the record every object in the picture then you're dead because you report is
too big
so we need to know what's what
we can do some of those
is this a fair amount of work on predicting sentence level descriptions of images or video
so for example have enough got turned colleagues took video all baseball game
and they use method similar to the discriminative methods i described to identify who's kidding who was catching who was
running
no they also build a little
a generative model of baseball essentially you can do this and then that once you've done this could happen or
that could happen all that
and you can think of it as being represented by a tree of events and some surgical rules that allow
you to rearrange the tree and then what you do is you say okay i've got these detector response
these are the structural rules of the game let me generate a structure that explains those responses and of course
of course if i can generate that structure i can generate things that you know without close inspection look like
described again
no sportscaster would emit something that's as pitch approaches the ball before batting yet that it's and then simultaneously better
runs the base and feel the runs towards the ball feel that catches the ball and it it's not the
way people talk
at the same level at the same time it
a description of what's going on
that you could use to produce something that scene
and it's a fairly detailed description of what's
we can generate sentences for over three pictures although it's still a bit rough and ready there are methods that
essentially say i go from an image space to some sort of intermediate space of detector response
and then i'll go from a sentence space
to some intermediate space of detector responses and then i would try and line a sentences and images in that
space and report the best matching set
the kind of results one gets a channel with yeah so that top picture the detectors are paying sleep on
ground animals sleep and ground gold standard ground and the kind of sentences one sees generated a see something and
expect
"'kay" so people remark things up into account the thing to say the least
but you might also get counted grass field which is not that it's a shape but you know it's not
bad guy
the third one down
a man stands next to train on a cloudy day it looks like a wonderful
if you raise the money and that it's actually a one
so we did you know you can make minor mistakes because sentences a really calm
sources of information and sometimes you make houses so this is not in fact i'd that laptop connected to a
black belmont there really isn't all that much black on the on the four
the sentence is more recently tamara but enrich them significantly like joining a this machinery to machinery about attributes
and was able to produce again you know we're not doing sentence generation you know the should be fairly obvious
from this end
a descriptions of pictures that look like this
there were two aeroplane the first shiny aeroplane is near the second
again we're not in sentence generation
but if you did do sentence generation you might see there's enough meaning that's been extracted from the image that
you could turn it into a reasonable for
they're all one dining table one chair into windows wouldn't dining table is by the wooden chair and against the
first when the noise
kind of objection you would right is to that is too much information and not selection as opposed to
it's wrong
okay now i'm gonna show you a movie too
illustrate how far the side your selection seems to go in human vision it's a fairly wrenching movies the first
thing is just to warn you that nobody was a watch
watch one yeah
and then we'll think about so it's clearly a surveillance movie on a train that
and that's not there as it gets interesting
okay yeah the question how many adults were on the platform and what were they doing
right do not right i no audience so or is always give sort of a variety of answers it somewhere
in the two seven range it's just not in it
you look at that thing in it is clear what's important and it's clear what's not important then you really
good at climbing on simple
and the important stuff looks like what outcome do we expect how other people feel
this feeling thing is not just because we're nice people and we care about what other people feel it's because
it gives you a really good idea of what they're gonna do next which match
a what we like
and of course what's gonna happen the by
again actually the whole sequence
nobody was that the child was not a it's something about how good probably in baby carriers can be
but it's a lot
and the trying times
but as i wouldn't show that if the child been but it's quite a well known that
the baby carrier and it upside down and was pushed along the child was annoyed but not seriously damage
if you look at this your ability to predict the behaviour of that one could just nearly threw herself in
front of the train
it's pretty good she's gonna react in kind of a strange way of the next ten
what you don't is you look at this you identify what simple shape well what we're going to notice this
guy because he is an important
and they build a little narrative around it and they focus on the
we don't know how to do that we are trying to but we don't know how to do that yeah
so carol some of the two crews would something crucial open questions as well as we move towards the end
one is dataset by
so
i distinctive feature vision is that frequencies in data
misrepresent applications
for a whole bunch of reasons the labels are wrong
the things that are chosen to get labelled a not uniform people collect things in very specific ways
and this is not a chart nobody goes out there and does we could things with data collection but it's
a real issue
so the bias is pervasive and we know it's a big deal envision datasets "'cause" and tanya to rub an
eyelash on your staff russ produced this wonderful paper this yeah
proved a good classifier can tell which dataset and image come
which is very scary news in the
and a smart image of the smart vision research you can do it very quickly so you have a little
text there the pictures which dataset doesn't come from people run about sixty to seventy percent classifiers are a little
bit weak
size doesn't make by scale way
if you get a really big dataset that doesn't mean it's an unbiased dataset and it might make it worse
because you might become
so if you look at the he when i collected these pictures from google had you not twenty three million
pictures of lines here are the top
i don't know however many
and you might think they're unbiased but have a close look so the kinds of things you could deduce from
these pictures all the lines right of course is fairly or
there were two pictures of lines on horseback
there's a line lying down with the lamb
there's another one
putting a having a person and putting a hand on it
and is aligned with i'm that way
that's on the first
so if you use that as your resource of online information you'd be in serious trouble that just not long
this is an effective territorial bias people are more interested in with pictures of lines in
a common ones
the problem is this blows huge holes in what we know about machine learning
so machine learning is based on a form of induction that's is the future is going to be like the
pot
in if you can't make the future like the cost then you've got a problem
and current machinery just doesn't sort of go to this
this place
this is good reason to believe that this issue is pervasive in object recognition that the world cannot be like
the training dataset because many things already that's why unfamiliar things a common and we can deal with
of course of many things a red then this exaggerates by
so gang wang produced a little histogram
that said okay all the objects in a marked up dataset that's common envision how many instances out
and there's small number of objects that have you know four thousand five hundred instances also but very quickly you're
down in that
and after that most objects appear two or three times in this data set some most objects a right
this is kind of should be a fairly familiar phenomena
but it wasn't really an issue envisioned to re
are several things you might do about bias
you could think about appropriate feature representations and what i described about illumination invariance is one form of doing that
if you're features are invariant to illumination then the fact that you're dataset is biased with illumination just doesn't
another thing you might do is build appropriate intermediate representation
so that those intermediate representations you might be able to make unbiased estimators of classifiers evens out of the objects
the right
and that's one way of interpreting this attribute
on the other thing is if you have a good representations of things like geometry
you just might be able to skate the effects of that set
so i last conclusion and then we're almost done
object recognition links to utility in complex ways that the not terribly well understood yet
so
biggest question in computer vision right now is what should we actually say about visual day
a picture goes into the or a very goes into a recognition system question what should come out
one answer is a list of everything that's in the picture that's a silly also the too many things in
the picture
if i look at this room in front of me it's silly to be describing the not on the bolt
that holds the emergency X
thing to get that still
so that i could on so well a useful representation of reasonable size which is a lousy on so because
we don't know what it means to be useful and we don't know how to make the size read
it seems that object categories depend on utility
so when i talked about that monkey
or it could also be a plastic toy but the other category it can occupy is iran
it really just doesn't matter no we're not that interested in it so why can't
if you look at this little fellow who turned out in my doesn't breaking a bottle recently somebody pointed out
that that's a be a bottle so you know you could think of that as a person or a child
or be a drink
or a be a drink each other a tourist or a hotline like a or an obstacle or potential the
rights
you know you see that you can write something right or around
so just depending on what you're doing that object occupies a wide range of different potential categories
so what i talked about suggests
the emergence of and you believe space about object recognition with sort of a heading in this direction and it
looks as though it's gonna be interesting when we get
and the billy spaces look cadres are really flew
they're opportunistic devices to a generalisation they're affected by your problems and buying utility
things can belong to many cat
some people would refer to this is a cellphone or is modified if i fling it into the audience it
would turn into a project on the media
and in fact the fact that it was a smart find would have nothing to do
with whether it was a project
so at the same time the same instance can belong to different categories of sorry at different times it can
belong to different
categories of shape when we talk about objects as being special within the category that's meaningful
it's not like all birds of the same but
some of the interesting because the missing tiles other the interesting because they have special fetters other birds
alright thing "'cause" they're inside this room flying around we had to just before the talk
many categories seem to be right
and many characterisation's mike because
unlike think about some things differently than you and if we don't talk about it is it really just doesn't
and in turn that suggested recognition is
it's not really just discrimination it's constantly coping with the unfamiliar
in the presence of massive an unreasonable by
and we need new tools and machinery to do
so i'm done on what through my major points
and it remains any to point out that if you want more information you can get it
but if you if somebody tries to sell you the one with the brown colour than their appellant because that's
the first addition and that's ten years i'll second edition appeared physically november
so they do exist and they're around and its follow quite up to date information about the state of recognition
and thanks to what i describes been supported by numerous agencies and organisations is including the office of naval research
the national science foundation
and we don't
oh
just a quick question about size
so the issue when the person was misrecognized as a bottle
or the issue you know this is a miss recognition when we go well that's something just the wrong scale
so but i is size is really difficult to tell how big something
oh okay
with so many vision
yes no the same
we know that people are amazingly good at making
so
the main literature about this i
that describing things that they get wrong in attendance at my house
we don't know how they do it
and we don't have methods right now envision that in computer vision
that can do size estimate satisfactorily the
it would one reasonable resolution to the personal model is you know but also just a lot smaller than people
but how do you know how big the thing you see is
in an absolute sense well wonder on some more but i look at some kind of big scale geometric context
around that i use it to make some estimate of the camera and with things a and that tells me
something about the size and if i get really gross size mismatches that i can say no that isn't gonna
work
yeah right now nobody can do in a satisfactory way i would regard that as something that sort of in
the air coming
wouldn't
i would think in three or four years time we might do
factor to size judgements moderately well
more details are judgements i think is still very mysterious
they do require putting together a whole bunch of contextual machinery because of the scaling effect of respect
that looks like a small what'll in an image might just be a mess of a long way away so
you need some notion of the space that it occupies
and that's one of the attractions by the way you want to show you that fun movie of the things
moving around in the room
well the attractions that movie is
when you have that degree of understanding of space you probably can make size protection
and that you could use them to drive recognition but as far as i know there's nothing right yeah that's
so i
just
i
the sets are biased unfairly biased towards things that are interesting
and i'm wondering why in computer vision we don't use the data sets as the vocabulary from which to describe
do women
the bias obviously or something that people are drawn to and it seems that the data itself
could be the vocabulary which you describe yeah that is
you describe an image in terms of its representation in this huge dataset
so on
i think that when i mean this is just a
setting right because
different agenda react to this very different
so if you think about vision is something you computer vision is something you do when you stick a camera
on your head and you will walk around well
then the line that's it i showed you just or
right but if you think about computer vision is something where what i do not use google images to interpret
more
the whole issue bias is just not an issue right because
one is the first sample of the other
the there is very little explicit writing about what you're referring to what is a lot of what that implicitly
takes into account
so much of what i've talked about in recognition actually
in vol
some interesting use of a common graphic
which is a is a way of talking about what you're talking about
we don't have a good enough and
standing of that issue to be able to talk about
clearly so you know the two kinds of convention one is the lines interest
this one's got that one's riding a horse that solution
and the other is we really tinted photograph lines handle
you want C will that many pictures or
you know a line photograph
three quarters with the shoulder dominating action
and it seems like one possible iconoclastic convention is different from another one
one of them if you like is interesting this in terms of properties and also it's a semantic stuff
and the other is
characters
just don't have the language to separate those rates and talk about them sensibly
yeah again i think it's very much on the agenda because of these separating three
you know if you really want to learn about the world from google actions
you're gonna have trouble
and
we know that we don't really have
so it's of a coincidence the but the best i can do so
and a comment a comment on the what you said about the utility of a of what matters in a
picture why did what matters in the picture depends on the utility at i'm of the view
but yet it seems like when you gave any image today the image of the two girls
to several people they came up with pretty much the same description so this seems to be a sort of
that baseline utility which is sort of context independent was wondering if you could comment on that and i think
you're right
so there's a fair amount of experiment
one
people select dimension
the situation is a little bit because
it's how to do the experiments exactly right and it's to be precise
but this some evidence suggests that kind of things that we dispose people to mention thing
oh
the really interested in people begin
and you can explain that because people have the potential to affect you when you've got a right and left
yes
the sort of always interesting kind of baseline
that thing is that all begin should tend to be mentioned
i
things that the unusual you know if you have a small rhinoceros in the downtown street view people are gonna
say gee you don't see that very often therefore
and that seem to be rough
principles for baseline utility but i
we do not again yet have
class of understanding required to say well okay there's a baseline utility and then there's also component that's linked to
the immediate task
well i would guess that
that's a situation
if one wanted to take a very extreme point of view you could say
the right way to division is with reinforcement learning because that's the white knight should it you just should every
vision system in the head if it doesn't do everything right
the downside of that one is it was a top notch an awfully long time
and you know it's appealing open these utility issues and getting better understanding of the principal seem to be important
again sorry surveillance the understanding
question
so i mean
obviously we all kind of an hour had sort of comparing to how vision people do their stuff and how
speech people do their stuff
and the two things that's kinda make speech recognition work in my view at a very abstract level is one
that we model how the various units that we're trying to recognise change in context for instance you know phones
the pen
rightly how the realised i'm on what other phones their car next to and then we really use in a
massive way
the this what you called joint modeling you know we model how phones occur together how words occur together how
high level unix linux like topics and other linguistic units at various levels all interact and have a co-occurrence statistics
that can inform the units would end them
so this joint modeling that you just touched arms is really massively important for speech recognition
and so these two aspects the modeling of how things change as a as a function of context and then
modeling the context itself
and it's statistics is the you see that as being find a
till having a long ways to go or is it just not something that people that works as well in
the in the vision domain what can you draw some comparisons there at your finger on a really nasty ms
we know about context we've been talking about context since the eighties
and then the question is sort of how what and why and under what circumstances and what you get the
contextual statistics and all that jazz
and there it is
a tremendous amount of work on that topic
the
i guess a reasonable summary you is
clever use or contextual information
often improve
i particular function just a little bit
but there is no example in anyone knows what context just hits the issue out of the
and i'm using the word context in the broadest possible sense of various kinds of co-occurrence to
so the geometric stuff so for example you can you can make pedestrian detectors a little bit better by knowing
about geometry and the little bit is what having like that's one person doesn't get run over or whatever
but i know of no example envisioned with
things get a lot better by heavy duty contextual information no you could argue that a bunch of what is
and people do argue about two ways one argument you
well use not using enough contextual information if you use much richer contextual models and more detail in a like
things will get back to you if you feel it get back under the whole research programs based on that
hypothesis
the other arguments as well those elaborate structures
become increasingly subject issues upon us issues of variance in estimation and all that jazz
and basically what you when with one hand you lose with the other one and you sort of back where
you want
i would say the juries just count on this question it's very firmly on the agenda it's
very aggressively study
and my own and that would be contextual information really matters
but it also really matters which contextual information you use and which you know
and that's second choice is pretty
we don't really have the machinery that says this is the good stuff this is the bad
i
one i know i it's not easy to sort of meaningfully contrast vision and speech the just different activities different
communities at the different things
but i would say
we have a baffling leave rich selection of potential contexts to use
everything from camera geometry to geometric context
two special properties of texture all night or co-occurrence statistics of objects all objects seen co-occurrences and the like and
one possible source of the difficulties we just don't know what to select on that
i'm this to address the first you and jeff's where i
the mechanism that i don't know if you heard jeff's talk yesterday morning on
these segmental conditional random field right in the idea he's proposing which is you know basically to model you know
speech at you know it's eight eighty basically the it incorporating information from multiple detectors
using the segmental random fields i mean i actually don't know enough to know whether that was inspired by the
vision waltz so and migrating to speech or vice versa but i was wondering of
both of you know could comment as to what the commonalities you see between those two approaches
and whether there is anything you know you think you might obscene in jeff's to upload jeff whether you see
anything here you know based on what you're from david for some a little bit of cross pollination between the
two areas
so i think
yeah
and i guess jeff is next a microphone and i think from my perspective there are strong resonances and harmonies
and one of "'em" year is an idea that's pervasive envision which is
if you can call up a picture into pieces the mikes
you can get
much more information about the P
because you got special support of which to cool features in lecture
there
i'd be
most serious vision people believe that if you could do a good job oh
coming up on it
everything will get
i are used with billy because there's no evidence to support that we
and
it's reasonable to say that the people who believe it simply say that all tested unsupported belief of the wrong
statistic any so you know we sort of in a position where smart people think it should work out
but right now none of the best
detection or classification methods takes any account a special support or just look so the buttons as the whole
i think that will try
i will go to my grave believing that if it hasn't changed we've done something wrong and we'll come right
later on
but it hasn't changed yet and that's it's a very disturbing feature of the vision land
so i think there's potential that but nobody's demonstrated yeah would be my reaction
i don't i i've got the light in my lexicon see if that's just one oh yes
i
and yeah so i was i thought that was very interesting and that
it was i think there are many points of commonality two things struck me one of them in was in
addition case
and it seemed that the attributes were much clearer or
then we have been a speech case for example has that there's
has wings has a geek has wheels
those are high level attributes
that we can sort of rat a lot
just by thinking about the problem and i'm not sure that we have the same attributes
available to us
how looking at the spectrogram or the speech signal
and the other thing that occurred to me was
that perhaps in fishing case
there's an interesting extension today S which were dealing with in interspeech case which has to do with the sequential
aspect of thing
for example if you're working instead of with a fixed image with the video where you have a sequence of
scenes and you might wanna segment that i into segments using some of the attributes that exist within the segments
so
the
responding to one
this should discussion and
what attributes
in the niger talk like this one summarizes about that
but
it's easy to write down a couple of hundred
it's not clear that they're independent of each other and it's not clear that covers the game by any manner
we don't really have a story about what you do if you don't know what natural attributes
the story currently the people use it is if you can come up with something that's discriminant of it's gonna
be an attribute one way or another and what colour attribute going like
but there there's actually
a moderately interesting vision problem where we sort of know we don't have attributes and would like to and that
question developing attributes for things which is hot to write down a list is a big deal for us and
i think we can learn about it we would be pleased to learn
time help segmentation
but it
again
segment a special temporally segmented videos
the
doesn't seem to be much better anything we know how to do the non spatially to
special temporally segmented videos
people like
i you know i'd say most of the serious people in vision believe that's because we're understanding something wrong
but we don't know what it is and we don't know how to make fine
just what you said is this section of the community that does believe in feature detectors like articulatory feature detectors
he you know in terms of your whether i i'm not saying it's right or wrong but the there was
that part of the community that look
each recognition from that viewpoint which is a little more similar one thing i wasn't sure this is just a
clarification then all that mike talk is in what menu produce these features i presume these are all are these
yard features that are being produced that is either the idea or not there were these all soft decisions
that it or extracted so is there like a set of ten billion possible things
and is the probability that's thresholded or you make a decision here it's a potatoey as a septic tank et
cetera et cetera
well the nice thing about
you like
and you make a list or you know a bunch
potential
something about a paper about any combination
usually what people do this is report
one alternative
you know it's a pedestrian a pedestrian use a cat not but there's
a fair amount of interest in for example the top five
rob a bunch applications where
as long as you get a ranking that's good and you get the wrong thing plus the top ranking then
you're okay and people are very interested in that one there's another class of activity which is look if i
build these detectors i can actually think of the output
as being features and what i'm gonna do is i'm gonna pretend on building detectors and then i'll look at
the responses
and pretend that the features and use them for completely different activity so essentially all the alternatives you describe appear
in someone's paper somewhere
and i wouldn't say there's any consensus about what the best thing is
which is unfortunately not so you know you do this you're okay this is not really
i difference between speech and our teams about images of that all the images that seem to be isn't dataset
seem to be sort of high quality get images no one seems to post their crappy pictures on the web
and so as well i have some of these techniques work when the pictures are
poor quality blurring you're overexposed or under exposed
"'cause" in speech we have a lot more of a sort of
variability it seems like of quality which affects the performance of our system
so
i mean this is what was what was also with it
i
at the fc is there's an awful lot of pretty pictures and cruddy videos that i like that and often
in on you two will reassure you want this point
and some things a hot this
we
the things that mike feature computations
a very
the acoustical phenomena that mike
you you're feature computations give them problems but there are some points of contact
we benefit quite a lot from time so for example just one moderately good example if you're interested in human
activity recognition
if you think about things like soccer field
a long view of soccer field with a player running across the field you really just contras all the arms
and legs
what you got motion blur to worry about the is about one pixel across anyhow it's just a minute
but if you look over a more time scale you can get fairly good picture of what's going on what
just looking at the sequence of pixels on the motion and pixels
so i think
some of the losses the resolution might not be as destructive as some of the acoustic effects that you encounter
but i'm not sure that that's true
there are a whole series envisioned the awe basically dead in the water as a result of
it's reflections of light
where i think yeah multipath acoustic distortion probably isn't the biggest thing in your life the other things to worry
about
so i it's and it depends kind of situation
there's a lot of interest in low resolution pictures how agencies care about or for pictures that come out of
forward looking infrared sensors for example
for
somewhat alarming reasons
i
yeah