advanced computer vision (module 5f16) carsten rother pushmeet kohli
TRANSCRIPT
Advanced Computer Vision(Module 5F16)
Carsten RotherPushmeet Kohli
Syllabus (updated)• L1&2: Intro
– Intro: Probabilistic models– Different approaches for learning – Generative/discriminative models, discriminative functions
• L3&4: Labelling Problems in Computer Vision– Graphical models– Expressing vision problems as labelling problems
• L5&6: Optimization- Message Passing (BP, TRW) - Submodularity and Graph Cuts - Move Making algorithms (Expansion/Swap/Range/Fusion) - LP Relaxations - Dual Decomposition
Syllabus (updated)
• L7&8 (8.2): Optimization and Learning- compare max-margin vs. maximum likelihood
• L9&10 (15.2): Case Studies - tbd … Decision Trees and Random Fields, Kinect Person detection
• L11&12 (22.2): Optimization Comparison, Case Studies (tbd)
Books
1. Advances in Markov Random Fields for Computer Vision. MIT Press 2011. (Edited by Andrew Blake, Pushmeet Kohli and Carsten Rother)
2. Pattern Recognition and Machine Learning, Springer 2006, by Chris Bishop
3. Structured Learning and Prediction in Computer Vision (Sebastian Nowozin and Christoph H. Lampert; Foundations and Trends in Computer Graphics and Vision series of now publishers, 2011).
4. Computer Vision, Springer 2010, by Rick Szeliski
A gentle Start:Interactive Image Segmentation
and Probabilities
Probabilities
• Probability distribution: P(x): ∑ P(x) = 1, P(x) ≥ 0; discrete x ϵ {0,…L}
• Joint distribution: P(x,z)
• Conditional distribution: P(x|z) • Sum rule: P(x) = ∑ P(x,z)
• Product rule: P(x,z) = P(x|z) P(z)
• Bayes’ rule: P(x|z) = P(z|x) P(x) / P(z)
x
z
Interactive Segmentation
Goal
Given z and unknown variables x: P(x|z) = P(z|x) P(x) / P(z) ~ P(z|x) P(x)
z = (R,G,B)n x = {0,1}n
Posterior Probability
Likelihood(data-
dependent)
Maximium a Posteriori (MAP): x* = argmax P(x|z)
Prior(data-
independent)
x
x* = argmin E(x)x
We will express this as anenergy minimization problem:
constant
(user-specified pixels are not optimized for)
Likelihood P(x|z) ~ P(z|x) P(x)
Red
Gre
en
RedG
reen
Likelihood P(x|z) ~ P(z|x) P(x)
Maximum likelihood:
x* = argmax P(z|x) =
= argmax ∏ P(zi|xi)
p(zi|xi=0) p(zi|xi=1)
x
ix
Prior P(x|z) ~ P(z|x) P(x)
P(x) = 1/f ∏ θij (xi,xj)
f = ∑ ∏ θij (xi,xj) “partition function”
θij (xi,xj) = exp{-|xi-xj|} “ising prior”
xi xj
x
i,j Є N4
i,j Є N
(exp{-1}=0.36; exp{0}=1)
Prior – 4x4 Grid
Best Solutions sorted by probability
Pure Prior model:
“Smoothness prior needs the likelihood”
P(x) = 1/f ∏ exp{-|xi-xj|} i,j Є N4
Worst Solutions sorted by probability
Prior – 4x4 Grid
Distribution
Pure Prior model: P(x) = 1/f ∏ exp{-|xi-xj|} i,j Є N4
Samples
216 configurations
Prob
abili
ty
Prior – 4x4 Grid
Best Solutions sorted by probability
Pure Prior model: P(x) = 1/f ∏ exp{-10|xi-xj|} i,j Є N4
Worst Solutions sorted by probability
Prior – 4x4 Grid
Distribution
Pure Prior model: P(x) = 1/f ∏ exp{-10|xi-xj|} i,j Є N4
Samples
216 configurations
Prob
abili
ty
Putting it together…
… let us look at this later
Posterior: P(x|z) = P(z|x) P(x) / P(z)
P(x|z) = 1/P(z) * 1/f ∏ exp{-|xi-xj|} * ∏ p(zi|xi)
Rewriting it…
P(x,z) = P(z|x) P(x)Joint:
with f(z) = ∑ exp{-E(x,z)}
i,j Є N4 i
= 1/f(z) exp{- (∑ |xi-xj| + ∑ -log p(zi|xi)) } i i,j Є N4
= 1/f(z) exp{-E(x,z)}
“Gibbs distribution”
= 1/f(z) exp{- (∑ |xi-xj| + ∑ -log p(zi|xi=0)(1-xi) -log p(zi|xi=1)xi)}
i i,j Є N4
x
Gibbs Distribution is more general
-log p(zi|xi=1) xi -log p(zi|xi=0) (1-xi) θi (xi,zi) =
θij (xi,xj) = |xi-xj|
Unary term
“encoded our prior knowledge over labellings”
P(x|z) = 1/f(z) exp{-E(x,z)} with f(z) = ∑ exp{-E(x,z)}
E(x,z) = ∑ θi (xi,z) + w∑ θij (xi,xj,z) + ∑ θij,k (xi,xj,xk,z) +... i i,j i,j,k
Gibbs distribution does not has to decompose into prior and likelihood:
x
Energy:
Pairwise term
“encoded our dependency on the data”
Higher-order termsIn our case:
Energy minimization
-log P(x|z) = -log (1/f(z)) + E(x,z)
Minimum Energy solution is the same as MAP solution
MAP; Global min E
x* = argmin E(x,z)
ML
f(z,w) = ∑ exp{-E(x,z)}X
X
P(x|z) = 1/f(z) exp{-E(x,z)}
x*= argmax P(x|z) x
maximum-a-posteriori (MAP) solution
Recap• Posterior, Likelihood, Prior
P(x|z) = P(z|x) P(x) / P(z)
• Gibbs distribution: P(x|z) = 1/f(z) exp{-E(x,z)}
• Energy minimization same as MAP estimationx* = argmax P(x|z)= argmin E(x)
xx
Weighting of Unary and Pairwise term
w =0
E(x,z,w) = ∑ θi (xi,zi) + w∑ θij (xi,xj)
w =10
w =200w =40
Learning versus Optimization/PredictionGibbs distribution: P(x|z,w) = 1/f(z,w) exp{-E(x,z,w)}
Testing phase: infer x which does depends on test image z
Training phase: infer w which does not depend on a test image z
{xt,zt} => w
z,w => x
ztzt xt
z
=>
A simple procedure to learn w
Questions: - Is it the best and only way?- Can we over-fit to training data?
w
1. Iterate w = 0,…,400 1. Compute x*t for all training images {xt,zt}
2. Compute average error Er = 1/|T| ∑
with loss function: (Hamming error)
2. Take w with smallest Er
Er
Δ(xt,x*t)
Hamming error: number of misclassified pixels
Δ(x,x*) = ∑ xi xi*i
t
Model : discrete or continuous variables? discrete or continuous space? Dependence between variables? …
Big Picture: Statistical Models in Computer Vision
Optimisation/Prediction/inference Combinatorial optimization:
e.g. Graph Cut Message Passing: e.g. BP, TRW Iterated Conditional Modes (ICM) LP-relaxation: e.g. Cutting-plane Problem decomposition + subgradient …
Learning: Maximum Likelihood Learning
Pseudo-likelihood approximation Loss minimizing Parameter Learning
Exhaustive search Constraint generation …
Applications: 2D/3D Image segmentation Object Recognition 3D reconstruction Stereo matching Image denoising Texture Synthesis Pose estimation Panoramic Stitching …
Machine Learning view:Structured Learning and Prediction
”Normal” Machine Learning:
f : Z N (classification) f : Z R (regression)
Input: Image, textOutput: real number(s)
f : Z X
Input: Image , textOutput: complex structure object
(labelling, parse tree) Parse tree of a sentence
Image labellingChemical structure
Structured Output Prediction:
Structured Output
Ad hoc definition (from [Nowozin et al. 2011])Data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together.
Learning: A simple toy problem
Label generation:
Data generation:
“small deviation of a 2x2 foreground (white) square at arbitrary position”
1. Foreground pixels are white, Background black2. Flip label of a few random pixels3. Add some Gaussian noise
Example man-made object detection [Nowozin and Lampert ‘2011]
A possible model for the dataIsing model on 4x4 grid graph: P(x|z,w) = 1/f(z,w) exp{-( ∑ (zi(1-xi)+(1-zi)xi) + w∑ |xi-xj| )}
i i,j Є N4
Unary term Pairwise terms
Data z:
Label x:
Decision TheoryAssume w has been learned and P(x|z,w) is:
Which solution x* would you choose?
Best Solutions sorted by probability Worst Solutions sorted by probability
Distribution
216 configurations
Prob
abili
ty
How to make a decision
Risk R is the expected loss:
“loss function”
Goal: Choose x* which minimizes the risk R
Assume model P(x|z,w) is known
R = ∑ P(x|z,w) Δ(x,x*)x
Decision Theory
Best Solutions sorted by probability Worst Solutions sorted by probability
0/1 loss:Δ(x,x*) = 0 if x*=x, 1 otherwise
Risk: R = ∑ P(x|z,w) Δ(x,x*)x
MAP x* = argmax P(x|z,w) x
Decision Theory
Best Solutions sorted by probability Worst Solutions sorted by probability
Risk: R = ∑ P(x|z,w) Δ(x,x*)x
Hamming loss:Δ(x,x*) = ∑ xi xi*
i
Maximize Marginals: xi* = argmax P(xi|z,w)xi
Decision Theory
Best Solutions sorted by probability Worst Solutions sorted by probability
Maximize Marginals: xi* = argmax P(xi|z,w)xi
Marginal: P(xi=k) = ∑ P(x1,…,xi=k,…,xn)
Xj\i
Computing marginals is sometimes called “probabilistic inference” different to MAP inference.
Recap
A different loss function gives a very different solution !
Two different approaches to learning
1. Probabilistic Parameter Learning:
“P(x|z,w) is needed”
2. Loss-based Parameters Learning
“E(x,z,w) is sufficient”
Probabilistic Parameter Learning
{xt,zt} w* = argmin Π –log P(xt|zt,w)+|w|2
Choose a Loss
0/1 loss
Hamming loss Regularized Maximum
Likelihood estimation
Construct thedecision function
Test time:
optimize decision function for new test image z, e.g. x* = argmax P(x|z,w)
Trainingdatabase
w t
x
Training:
x
It is:P(w|zt,xt) ~ P(xt|w,zt) P(w|zt)
x* = argmax P(x|z,w)
Learn weights
xi
x* = argmax P(xi|z,w)
ML estimation for our toy image
Images zt
Labels xt
P(x|z,w) = 1/f(z,w) exp{-( ∑ (zi(1-xi)+(1-zi)xi) + w∑ |xi-xj| )}
i i,j Є N4
Train:w* = argmin ∑ -log P(xt|zt,w)
w
PLOT
t
1/|T| ∑ -log P(xt|zt,w)t
How many training images?
ML estimation for or toy image
Images zt
Labels xt
P(x|z,w) = 1/f(z,w) exp{-( ∑ (zi(1-xi)+(1-zi)xi) + w∑ |xi-xj| )}
i i,j Є N4
Train:w* = argmin ∑ -log P(xt|zt,w) = 0.8 w t
Exhaustive search:
Testing (1000 images): 1. MAP (0/1 Loss):
av. Error 0/1: 0.99; av. Error Hamming: 0.322. Marginals (Hamming Loss): av. Error 0/1: 0.92; av. Error Hamming: 0.17
ML estimation for or toy image
So, probabilistic inference is better than MAP inference … since better loss function
Example test results
Two different approaches to learning
1. Probabilistic Parameter Learning:
“P(x|z,w) is needed”
2. Loss-based Parameters Learning
“E(x,z,w) is sufficient”
Loss-based Parameter learning
“loss function”Minimize R = ∑ P(x|z,w) Δ(x,x*)
x
“Replace this by samples from the true distribution, i.e. training data”
How much training data is needed?
R = 1/|T| ∑ Δ(xt,x*t)~t
with:x* = argmax P(x|z,w)
x
Loss-based Parameter learning
Testing 1. 0/1 Loss (w=0.2)
Error 0/1: 0.69; Error Hamming: 0.11
2. Hamming Loss (w=0.1)Error 0/1: 0.7; Error Hamming: 0.10
Minimize R = 1/|T| ∑ Δ(xt,x*t)t
x* = argmax P(x|z,w)x
Search: 0/1 loss Search: Hamming loss
Loss-based Parameter learningExample test results
0/1 Loss Hamming Loss
Which approach is better?
Model mismatch: our model cannot represent the true distribution of the training data!… and we probably always have that in vision
Comment: marginals do also give an uncertainty for every pixel which can be used in a bigger systems
Hamming Test Error:
1. ML: MAP (0/1 Loss) - Error 0.322. ML: Marginals (Hamming Loss) - Error 0.173. Loss-based: MAP (0/1 Loss) - Error 0.114. Loss-based: MAP (Ham. Loss) - Error 0.10
Why are Loss-based methods much better?
Check: sample from true model (w=0.8)
Data:
Sampled Label:
My toy data labelling:
Re-train gives w=0.8
A real world application: Image denoising
Z1..m
Ground truthsTrain images
Model: 4-connected graph with 64 labels and total 128 weights
ML training: MAP (image 0-1 loss)
ML training: MMSE (pixel-wise squared loss)
Test image - true
Input test image - noisy
x1..m
[see details in: Putting MAP back on the map, Pletscher et al. DAGM 2010]
zoom
Example – Image denoising
Loss-based MAP (pixel-wise squared loss)
Test image - true
Input test image - noisy
Z1..m
Ground truthsTrain imagesx1..m
Comparison of the two pipelines: models
Loss-minimizing
Probabilistic
Unary potential: |zi-xi| Pairwise potential: |xi-xj|
Unary potential: |zi-xi| Pairwise potential: |xi-xj|
Data z
Lable x
Comparison of the two pipelines
[see details in: Putting MAP back on the map, Pletscher et al. DAGM 2010]
Deviation from true model
Pred
ictio
n er
ror
Recap
• Loss functions
• Two Pipelines for Parameter learning– Loss-based– Probabilistic
• MAP inference is good, if trained well
Another Machine Learning view
We can identify 3 different approaches: [see details in Bishop, page 42ff]:
• Generative (probabilistic) models
• Discriminative (probabilistic) models
• Discriminative functions
Generative modelModels that model explicitly (or implicitly) the distribution of the in- and output
Joint Probability: P(x,z) = P(z|x) P(x)
Pros: 1. Most elaborate model 2. possible to sample both, x and z
Cons: might not always be possible to write down the full distribution (involves a distribution over images).
likelihood prior
Generative Model: ExampleP(x,z) = P(z|x) P(x)
x
P(z|x) as GMMs
P(x) = 1/f ∏ exp{-|xi-xj|} Ising Prior i,j Є N4
z
Samples: True image:
Most likely:
Why does segmentation still work?
P(x|z) = 1/P(z) P(z,x)
Remember:P(x|z) = 1/f(z) exp{-E(x,z)}
We use the posterior not the joint, so image z is given:
Comments:- a better likelihood p(z|x) may give a better model- when you test models keep in mind that data is never random it is very structured!
z Samples x
Samples from the toy-model (with strong likelihood):
Discriminative model
P(x|z) = 1/f(z) exp{-E(x,z)}
Models that model the Posterior directly are discriminative models:
We later call them: “Conditional random field”
Pros: 1. simpler to write down (no need to model z)and goes directly for the desired output x
2. probability can be used in bigger systems
Cons: we can not sample images z
Discriminative model - Example
Gibbs: P(x|z) = 1/f(z) exp{-E(x,z)}
E(x) = ∑ θi (xi,zi) + ∑ θij (xi,xj,zi,zj)i i,j Є N4
θij (xi,xj,zi,zj) = |xi-xj| (-exp{-ß||zi-zj||})
ß=2(Mean(||zi-zj||2) )-1
||zi-zj||
θij
Ising Edge-dependent
Discriminative functions
E(x,z): Ln -> R
Models that model the classification problem via a function
Examples: - Energy which has been Loss-based trained - support vector machines - decision trees
Pros: most direct approach to model the problem
Cons: no probabilities
x* = argmax E(x,z)x
Recap
• Generative (probabilistic) models
• Discriminative (probabilistic) models
• Discriminative functions
Image segmentation … the full story
… a meeting with the Queen
Segmentation [Boykov& Jolly ICCV ‘01]
Fp = ∞ Bp = 0
E(x) =
Image zand user input
Output x* = argmin E(x) ϵ {0,1}
∑ Fp xp+ Bp (1-xp) + ∑ wpq|xp-xq|
Graph Cut: Global optimum in polynomial time ~0.3sec for 1MPixel image [Boykov, Kolmogorov, PAMI ‘04]
wpq = wi + wc exp(-wβ||zp-zq||2)
Fp = 0 Bp = ∞
pq ϵ E p ϵ V
x
How to prevent the trivial solution?
What is a good segmentation?
Objects (fore- and background) are self-similar wrt appearance
Input Image
Option 1 Option 2 Option 3
Eunary(x, θF,θB) = -log p(z|x, θF,θB) = ∑ -log p(zp|θF) xp -log p(zp|θB) (1-xp)
p ϵ V
Eunary = 460000 Eunary = 482000 Eunary = 483000
foreground background foreground background foreground background
θF θB θF θB θF θB
θF θBx
z
GrabCut[Rother, Kolmogorov, Blake, Siggraph ‘04]
Background
Foreground G
R
Fp(θF) = -log p(zp|θF) Bp(θB) = -log p(zp|θB)
E(x,θF,θB) =∑ Fp(θF)xp+ Bp(θB)(1-xp) + ∑ wpq|xp-xq|pq Є E pЄV
“others”
Output GMMs θF,θB
Problem: Joint optimization of x,θF,θB is NP-hard
Image zand user input
Output xϵ {0,1}
Fp = ∞ Bp = 0
Fp = 0 Bp = ∞
GrabCut: Optimization[Rother, Kolmogorov, Blake, Siggraph ‘04]
Learning of the colour distributions
Graph cut to infer segmentation
xmin E(x, θF, θB) θF,θB
min E(x, θF, θB)
Image zand user input
Initial segmentation x
1 2 3 4
GrabCut: Optimization
Energy after each IterationResult0
GrabCut: Optimization
Background
Foreground & Background G
RBackground
Foreground G
RIterated
graph cut
Summary
– Intro: Probabilistic models
– Two different approaches for learning
– Generative/discriminative models, discriminative functions
– Advanced segmentation system: GrabCut