Context Aware Spatial Priors using Entity Relations (CASPER)
Geremy HeitzJonathan Laserson
Daphne Koller
December 10th, 2007DAGS
Outline
Goal – Scene Understanding Existing Methods CASPER Preliminary Experiments Future Direction – Going Discriminative
Building
Building
Building
Tree
Car
CarCar Car
Car
Representation
Building
Tree
Car
Building
Building
CarCar Car
Car
l = bag of object categories
ρ = location of centroids
We model P(ρ, l)
Why? Because we use a generative modelP(ρ, l | I) ~ P(ρ, l) P(I|ρ, l)
I = the Image
Building
Tree
Car
Building
Building
CarCar Car
Car
Building
Tree
Car
Car
Building
CarTree Car
Car
Which one makes more sense?
Does Context matter?
Can it help Object Recognition?
LOOPS
Outline
Goal – Scene Understanding Existing Methods CASPER Preliminary Experiments Future Direction – Going Discriminative
Fixed Order Model
Each image has the same bag of objects example: 1 car, 2 buildings, 1 tree
Object centroids are drawn jointly P(ρ, l) = 11{l = l_fixed_order} P(ρ | l) Similar to constellations (Fergus)Problem:
We don't always know the exact set of objects
TDP (Sudderth, 2005)
Each image has a different bag of objects Object centroids are drawn independently P(ρ, l) = P(l) П P(ρi | li) Problems:
This doesn't take pairwise constraints into account
We have lost context
Outline
Goal – Scene Understanding Existing Methods CASPER Preliminary Experiments Future Direction – Going Discriminative
CASPER
Each image has a different bag of objects Object centroids are drawn jointly given l P(ρ,l) = P(l) P(ρ | l) Questions:
How do we represent P(l)? How do we represent P(ρ | l)? How do we learn? How do we infer?
P(l)
Dirichlet Process We don’t want to get into that now
Other options Multinomial Uniform
P(ρ | l) - Desiderata
Correlations between ρ's Sharing of parameters between l's
Intuitive parameterization Continuous Multivariate Distribution Easy to learn parameters Easy to evaluate likelihood Easy to condition Gaussian?
MV Gaussian - Options
Learn a different Gaussian for every l Can't share parameters Large number (∞) of l's
Gaussian Process ρ(x) ~ GP(mu(x), K(x,x’))
Every finite set of x’s produces a Gaussian ρ [ρ(x1) ρ(x2) … ρ(xk)] ~ Gaussian
xt is a hidden function of the class lt Mu(xt) = Axt K(xt,xt’) = c exp(-||B(xt-xt’)||
2) Two objects of the same class -> same x? Is correlation the natural space?
Car
Spatial Distribution - Options
“Singleton Expert” P(ρi|li) Gaussian over absolute object location
“Pairwise Expert” P(ρi-ρj | li,lj) Gaussian offset between objects Expert can be one of K mixture components
Tree
Car CarCar
k = 1
k = 2k = 1
CASPER P(ρ|l) How to use experts? Introduce an auxiliary variable d P(ρ|d,l) d tells us which experts are ‘on’
Building
Tree
Car
Building
Building
CarCar Car
Car
For each edge e=(li,lj), de
indexes all possible experts for this edge
Default is a uniform expert
P(ρ|d,l) ~ POEd
POEd = ПP(ρi|li) ПP(ρi-ρj | dij,li,lj)
Product of Gaussians is a Gaussian
CASPER P(ρ|d,l)
POEd = Zd N(ρ; μd, Σd) P(ρ|d,l) = N(ρ; μd, Σd) = 1/Zd POEd P(d|l) ~ Zd (Multinomial) P(ρ,d|l) ~ POEd
Car3 Car2Car1Car2
Car1 Car Car3Car2
Example: P(ρ,d|l) ~ P(ρ2-ρ1 | d12) P(ρ3-ρ2 | d32)
d1
d2
Car2
Car2
P(ρ|d1,l) = P(ρ|d2,l) but Zd2>Zd1 hence POEd2 > POEd1
Learning the Experts
Training set with supervised (ρ,l) pairs (one pair for each image)
Gibbs over the hidden variables de
Loop over edges Update expert sufficient
statistics with each update Does it converge?
not as much as we want it to Work in progress
Building
Tree
Car
Building
Building
CarCar Car
Car
Outline
Goal – Scene Understanding Existing Methods CASPER Preliminary Experiments Future Direction – Going Discriminative
Preliminary Experiments
LabelMe Datasets
STREETS BEDROOMS
**
*
*
**
*
*
*
*
*
*
***
*
*
*
*
*
*
*
*
**
*
FEATURES Harris Interest Operator -> yi
SIFT Descriptor -> wi
Instance membership -> ti
INSTANCES Centroid -> ρt
Class label -> lt
**
Carρt
(yi, wi, ti)
(ρt, lt)
Observed P(I| ρ,l) = P(y, w|ρ,l)
What do the true ρ’s look like?
Car -> Car
Lamp -> Lamp
Bed -> Lamp
Learning/Inference in Full Model
TDP - Three stage Gibbs: Assign features to instances (Sample ti for every feature)
Assign expert components (Sample de for every edge)
Assign instances to classes (Sample lt, ρt for every instance)
Training Supervise (t,l) variables Gibbs over d and ρ
Testing Introduce new images Gibbs (t,l,d,ρ) of new images
Independent-TDP: ρ’s are independent CASPER-TDP: ρ’s are distributed according to
CASPER
Learned Experts
**
*
*
**
*
*
*
*
*
*
***
*
*
*
*
*
*
*
*
**
*
FEATURES
**
(yi, wi, ti)
*
*
*
*
*
IMAGE GROUNDTRUTH
IND – N = 0.1 IND – N = 0.5
Evaluation – Gen Model
N = 0.1 N = 0.3 N = 0.5
Bed 0.6111 0.6286 0.5882Lamp 0.3077 0.1667 0.0000
Painting 0.5333 0.3333 0.2857Window 0.9091 0.7692 0.5455
Table 0.6667 0.4211 0.3529
“Synthetic Appearance” Visual words give strong indicator for the class
Evaluated on Detection Performance Precision/Recall F1 score for centroid and class
identification Results here with Independent TDP
Can we hope to do this well?
Evaluation - Context
INDEPENDENT
CASPER
Bed 0.5882 0.5714Lamp 0.0000 0.0000
Painting 0.2857 0.1333Window 0.5455 0.4000
Table 0.3529 0.1250
Independent-TDP vs CASPER-TDP N = 0.5
Why isn’t context helping here?
Problems with this Setup
Bad Feelings Supervised setting – Detection
Our model is not trained to maximize detection ability
We will lose to many/most discriminative approaches
Context is NOT the main reason why TDP fails Unsupervised setting
Likelihood? Does anyone care? Object discovery? Context is a lower-order
consideration How would we show that CASPER >
Independent?
Outline
Goal – Scene Understanding Existing Methods CASPER Preliminary Experiments Future Direction – Going Discriminative
Going Discriminative
Up to now we have been generative:
P(I, ρ, l) = P(I | ρ, l) P(ρ, l)
How do we convert this into discriminative?
Include CASPER distribution over (ρ,l) Include term with boosted object detectors Slap on a partition function
P(ρ, l | I) = 1/Z * CASPER * DETECTORS
Discriminative Framework
Boosted Detectors “Over detect”
Each “candidate” has: location ρt, class variable lt detection score DI(lt)
P(ρ, l | I) ~ P(ρ, l) Π DI(lt)
Goal: Reassign detection candidates to classes
Respects the “detection strength” Respects the context between objects
DI(face) = 0.09
DI(face) = 0.92
Similarities to Steve’s work
“Over detection” using boosted detectors
But some detections don’t make sense in context
3D information allows him to “sort out” which detections are correct
CASPER Learning/Inference
Gibbs Inference Loop over images
Loop over detection candidates t Sample (lt | everything else)
Loop over pairs of candidates Sample (de | everything else)
Training lt is known, Gibbs over de
Evaluation Precision/Recall for detections
Possible Datasets
Short Term Plan
Learn the boosted detectors Determine our baseline performance Add Gibbs inference Submit to a conference that is far far
away… ICML = Helsinki, Finland
Alternate Names
Spatial Priors for Arbitrary Groups of Objects
Product of Experts Precision Space View P1(x) = N(a, A) P2(x) = N(b, B) P1(x)P2(x) = Z N(c, C)
Z = N(a ; b, B+A) C-1 = A-1 + B-1
c = C(A-1a + B-1b) What does this mean?
Precision matrices of the experts ADD Even if each expert has a singular A-1
the sum is PSD