advanced computer vision (module 5f16) carsten rother pushmeet kohli

Advanced Computer Vision(Module 5F16)

Carsten RotherPushmeet Kohli

Syllabus (updated)• L1&2: Intro

– Intro: Probabilistic models– Different approaches for learning – Generative/discriminative models, discriminative functions

• L3&4: Labelling Problems in Computer Vision– Graphical models– Expressing vision problems as labelling problems

• L5&6: Optimization- Message Passing (BP, TRW) - Submodularity and Graph Cuts - Move Making algorithms (Expansion/Swap/Range/Fusion) - LP Relaxations - Dual Decomposition

Syllabus (updated)

• L7&8 (8.2): Optimization and Learning- compare max-margin vs. maximum likelihood

• L9&10 (15.2): Case Studies - tbd … Decision Trees and Random Fields, Kinect Person detection

• L11&12 (22.2): Optimization Comparison, Case Studies (tbd)

Books

1. Advances in Markov Random Fields for Computer Vision. MIT Press 2011. (Edited by Andrew Blake, Pushmeet Kohli and Carsten Rother)

2. Pattern Recognition and Machine Learning, Springer 2006, by Chris Bishop

3. Structured Learning and Prediction in Computer Vision (Sebastian Nowozin and Christoph H. Lampert; Foundations and Trends in Computer Graphics and Vision series of now publishers, 2011).

4. Computer Vision, Springer 2010, by Rick Szeliski

A gentle Start:Interactive Image Segmentation

and Probabilities

Probabilities

• Probability distribution: P(x): ∑ P(x) = 1, P(x) ≥ 0; discrete x ϵ {0,…L}

• Joint distribution: P(x,z)

• Conditional distribution: P(x|z) • Sum rule: P(x) = ∑ P(x,z)

• Product rule: P(x,z) = P(x|z) P(z)

• Bayes’ rule: P(x|z) = P(z|x) P(x) / P(z)

x

z

Interactive Segmentation

Goal

Given z and unknown variables x: P(x|z) = P(z|x) P(x) / P(z) ~ P(z|x) P(x)

z = (R,G,B)n x = {0,1}n

Posterior Probability

Likelihood(data-

dependent)

Maximium a Posteriori (MAP): x* = argmax P(x|z)

Prior(data-

independent)

x

x* = argmin E(x)x

We will express this as anenergy minimization problem:

constant

(user-specified pixels are not optimized for)

Likelihood P(x|z) ~ P(z|x) P(x)

Red

Gre

en

RedG

reen

Prior P(x|z) ~ P(z|x) P(x)

P(x) = 1/f ∏ θij (xi,xj)

f = ∑ ∏ θij (xi,xj) “partition function”

θij (xi,xj) = exp{-|xi-xj|} “ising prior”

xi xj

x

i,j Є N4

i,j Є N

(exp{-1}=0.36; exp{0}=1)

Prior – 4x4 Grid

Best Solutions sorted by probability

Pure Prior model:

“Smoothness prior needs the likelihood”

P(x) = 1/f ∏ exp{-|xi-xj|} i,j Є N4

Worst Solutions sorted by probability

Prior – 4x4 Grid

Distribution

Pure Prior model: P(x) = 1/f ∏ exp{-|xi-xj|} i,j Є N4

Samples

216 configurations

Prob

abili

ty

Prior – 4x4 Grid

Best Solutions sorted by probability

Pure Prior model: P(x) = 1/f ∏ exp{-10|xi-xj|} i,j Є N4

Worst Solutions sorted by probability

Prior – 4x4 Grid

Distribution

Pure Prior model: P(x) = 1/f ∏ exp{-10|xi-xj|} i,j Є N4

Samples

216 configurations

Prob

abili

ty

Putting it together…

… let us look at this later

Posterior: P(x|z) = P(z|x) P(x) / P(z)

P(x|z) = 1/P(z) * 1/f ∏ exp{-|xi-xj|} * ∏ p(zi|xi)

Rewriting it…

P(x,z) = P(z|x) P(x)Joint:

with f(z) = ∑ exp{-E(x,z)}

i,j Є N4 i

= 1/f(z) exp{- (∑ |xi-xj| + ∑ -log p(zi|xi)) } i i,j Є N4

= 1/f(z) exp{-E(x,z)}

“Gibbs distribution”

= 1/f(z) exp{- (∑ |xi-xj| + ∑ -log p(zi|xi=0)(1-xi) -log p(zi|xi=1)xi)}

i i,j Є N4

x

Gibbs Distribution is more general

-log p(zi|xi=1) xi -log p(zi|xi=0) (1-xi) θi (xi,zi) =

θij (xi,xj) = |xi-xj|

Unary term

“encoded our prior knowledge over labellings”

P(x|z) = 1/f(z) exp{-E(x,z)} with f(z) = ∑ exp{-E(x,z)}

E(x,z) = ∑ θi (xi,z) + w∑ θij (xi,xj,z) + ∑ θij,k (xi,xj,xk,z) +... i i,j i,j,k

Gibbs distribution does not has to decompose into prior and likelihood:

x

Energy:

Pairwise term

“encoded our dependency on the data”

Higher-order termsIn our case:

Energy minimization

-log P(x|z) = -log (1/f(z)) + E(x,z)

Minimum Energy solution is the same as MAP solution

MAP; Global min E

x* = argmin E(x,z)

ML

f(z,w) = ∑ exp{-E(x,z)}X

X

P(x|z) = 1/f(z) exp{-E(x,z)}

x*= argmax P(x|z) x

maximum-a-posteriori (MAP) solution

Recap• Posterior, Likelihood, Prior

P(x|z) = P(z|x) P(x) / P(z)

• Gibbs distribution: P(x|z) = 1/f(z) exp{-E(x,z)}

• Energy minimization same as MAP estimationx* = argmax P(x|z)= argmin E(x)

xx

Weighting of Unary and Pairwise term

w =0

E(x,z,w) = ∑ θi (xi,zi) + w∑ θij (xi,xj)

w =10

w =200w =40

Learning versus Optimization/PredictionGibbs distribution: P(x|z,w) = 1/f(z,w) exp{-E(x,z,w)}

Testing phase: infer x which does depends on test image z

Training phase: infer w which does not depend on a test image z

{xt,zt} => w

z,w => x

ztzt xt

z

=>

A simple procedure to learn w

Questions: - Is it the best and only way?- Can we over-fit to training data?

w

1. Iterate w = 0,…,400 1. Compute x*t for all training images {xt,zt}

2. Compute average error Er = 1/|T| ∑

with loss function: (Hamming error)

2. Take w with smallest Er

Er

Δ(xt,x*t)

Hamming error: number of misclassified pixels

Δ(x,x*) = ∑ xi xi*i

t

Model : discrete or continuous variables? discrete or continuous space? Dependence between variables? …

Big Picture: Statistical Models in Computer Vision

Optimisation/Prediction/inference Combinatorial optimization:

e.g. Graph Cut Message Passing: e.g. BP, TRW Iterated Conditional Modes (ICM) LP-relaxation: e.g. Cutting-plane Problem decomposition + subgradient …

Learning: Maximum Likelihood Learning

Pseudo-likelihood approximation Loss minimizing Parameter Learning

Exhaustive search Constraint generation …

Applications: 2D/3D Image segmentation Object Recognition 3D reconstruction Stereo matching Image denoising Texture Synthesis Pose estimation Panoramic Stitching …

Machine Learning view:Structured Learning and Prediction

”Normal” Machine Learning:

f : Z N (classification) f : Z R (regression)

Input: Image, textOutput: real number(s)

f : Z X

Input: Image , textOutput: complex structure object

(labelling, parse tree) Parse tree of a sentence

Image labellingChemical structure

Structured Output Prediction:

Structured Output

Ad hoc definition (from [Nowozin et al. 2011])Data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together.

Learning: A simple toy problem

Label generation:

Data generation:

“small deviation of a 2x2 foreground (white) square at arbitrary position”

1. Foreground pixels are white, Background black2. Flip label of a few random pixels3. Add some Gaussian noise

Example man-made object detection [Nowozin and Lampert ‘2011]

A possible model for the dataIsing model on 4x4 grid graph: P(x|z,w) = 1/f(z,w) exp{-( ∑ (zi(1-xi)+(1-zi)xi) + w∑ |xi-xj| )}

i i,j Є N4

Unary term Pairwise terms

Data z:

Label x:

Decision TheoryAssume w has been learned and P(x|z,w) is:

Which solution x* would you choose?

Best Solutions sorted by probability Worst Solutions sorted by probability

Distribution

216 configurations

Prob

abili

ty

How to make a decision

Risk R is the expected loss:

“loss function”

Goal: Choose x* which minimizes the risk R

Assume model P(x|z,w) is known

R = ∑ P(x|z,w) Δ(x,x*)x

Decision Theory


0/1 loss:Δ(x,x*) = 0 if x*=x, 1 otherwise

Risk: R = ∑ P(x|z,w) Δ(x,x*)x

MAP x* = argmax P(x|z,w) x

Decision Theory


Risk: R = ∑ P(x|z,w) Δ(x,x*)x

Hamming loss:Δ(x,x*) = ∑ xi xi*

i

Maximize Marginals: xi* = argmax P(xi|z,w)xi

Decision Theory


Maximize Marginals: xi* = argmax P(xi|z,w)xi

Marginal: P(xi=k) = ∑ P(x1,…,xi=k,…,xn)

Xj\i

Computing marginals is sometimes called “probabilistic inference” different to MAP inference.

Recap

A different loss function gives a very different solution !

Two different approaches to learning

1. Probabilistic Parameter Learning:

“P(x|z,w) is needed”

2. Loss-based Parameters Learning

“E(x,z,w) is sufficient”

Probabilistic Parameter Learning

{xt,zt} w* = argmin Π –log P(xt|zt,w)+|w|2

Choose a Loss

0/1 loss

Hamming loss Regularized Maximum

Likelihood estimation

Construct thedecision function

Test time:

optimize decision function for new test image z, e.g. x* = argmax P(x|z,w)

Trainingdatabase

w t

x

Training:

x

It is:P(w|zt,xt) ~ P(xt|w,zt) P(w|zt)

x* = argmax P(x|z,w)

Learn weights

xi

x* = argmax P(xi|z,w)

ML estimation for our toy image

Images zt

Labels xt

P(x|z,w) = 1/f(z,w) exp{-( ∑ (zi(1-xi)+(1-zi)xi) + w∑ |xi-xj| )}

i i,j Є N4

Train:w* = argmin ∑ -log P(xt|zt,w)

w

PLOT

t

1/|T| ∑ -log P(xt|zt,w)t

How many training images?

ML estimation for or toy image

Images zt

Labels xt

P(x|z,w) = 1/f(z,w) exp{-( ∑ (zi(1-xi)+(1-zi)xi) + w∑ |xi-xj| )}

i i,j Є N4

Train:w* = argmin ∑ -log P(xt|zt,w) = 0.8 w t

Exhaustive search:

Testing (1000 images): 1. MAP (0/1 Loss):

av. Error 0/1: 0.99; av. Error Hamming: 0.322. Marginals (Hamming Loss): av. Error 0/1: 0.92; av. Error Hamming: 0.17

ML estimation for or toy image

So, probabilistic inference is better than MAP inference … since better loss function

Example test results

Two different approaches to learning

1. Probabilistic Parameter Learning:

“P(x|z,w) is needed”

2. Loss-based Parameters Learning

“E(x,z,w) is sufficient”

Loss-based Parameter learning

“loss function”Minimize R = ∑ P(x|z,w) Δ(x,x*)

x

“Replace this by samples from the true distribution, i.e. training data”

How much training data is needed?

R = 1/|T| ∑ Δ(xt,x*t)~t

with:x* = argmax P(x|z,w)

x

Loss-based Parameter learning

Testing 1. 0/1 Loss (w=0.2)

Error 0/1: 0.69; Error Hamming: 0.11

2. Hamming Loss (w=0.1)Error 0/1: 0.7; Error Hamming: 0.10

Minimize R = 1/|T| ∑ Δ(xt,x*t)t

x* = argmax P(x|z,w)x

Search: 0/1 loss Search: Hamming loss

Loss-based Parameter learningExample test results

0/1 Loss Hamming Loss

Which approach is better?

Model mismatch: our model cannot represent the true distribution of the training data!… and we probably always have that in vision

Comment: marginals do also give an uncertainty for every pixel which can be used in a bigger systems

Hamming Test Error:

1. ML: MAP (0/1 Loss) - Error 0.322. ML: Marginals (Hamming Loss) - Error 0.173. Loss-based: MAP (0/1 Loss) - Error 0.114. Loss-based: MAP (Ham. Loss) - Error 0.10

Why are Loss-based methods much better?

Check: sample from true model (w=0.8)

Data:

Sampled Label:

My toy data labelling:

Re-train gives w=0.8

A real world application: Image denoising

Z1..m

Ground truthsTrain images

Model: 4-connected graph with 64 labels and total 128 weights

ML training: MAP (image 0-1 loss)

ML training: MMSE (pixel-wise squared loss)

Test image - true

Input test image - noisy

x1..m

[see details in: Putting MAP back on the map, Pletscher et al. DAGM 2010]

zoom

Example – Image denoising

Loss-based MAP (pixel-wise squared loss)

Test image - true

Input test image - noisy

Z1..m

Ground truthsTrain imagesx1..m

Comparison of the two pipelines

[see details in: Putting MAP back on the map, Pletscher et al. DAGM 2010]

Deviation from true model

Pred

ictio

n er

ror

Recap

• Loss functions

• Two Pipelines for Parameter learning– Loss-based– Probabilistic

• MAP inference is good, if trained well

Another Machine Learning view

We can identify 3 different approaches: [see details in Bishop, page 42ff]:

• Generative (probabilistic) models

• Discriminative (probabilistic) models

• Discriminative functions

Generative modelModels that model explicitly (or implicitly) the distribution of the in- and output

Joint Probability: P(x,z) = P(z|x) P(x)

Pros: 1. Most elaborate model 2. possible to sample both, x and z

Cons: might not always be possible to write down the full distribution (involves a distribution over images).

likelihood prior

Generative Model: ExampleP(x,z) = P(z|x) P(x)

x

P(z|x) as GMMs

P(x) = 1/f ∏ exp{-|xi-xj|} Ising Prior i,j Є N4

z

Samples: True image:

Most likely:

Why does segmentation still work?

P(x|z) = 1/P(z) P(z,x)

Remember:P(x|z) = 1/f(z) exp{-E(x,z)}

We use the posterior not the joint, so image z is given:

Comments:- a better likelihood p(z|x) may give a better model- when you test models keep in mind that data is never random it is very structured!

z Samples x

Samples from the toy-model (with strong likelihood):

Discriminative model

P(x|z) = 1/f(z) exp{-E(x,z)}

Models that model the Posterior directly are discriminative models:

We later call them: “Conditional random field”

Pros: 1. simpler to write down (no need to model z)and goes directly for the desired output x

2. probability can be used in bigger systems

Cons: we can not sample images z

Discriminative model - Example

Gibbs: P(x|z) = 1/f(z) exp{-E(x,z)}

E(x) = ∑ θi (xi,zi) + ∑ θij (xi,xj,zi,zj)i i,j Є N4

θij (xi,xj,zi,zj) = |xi-xj| (-exp{-ß||zi-zj||})

ß=2(Mean(||zi-zj||2) )-1

||zi-zj||

θij

Ising Edge-dependent

Discriminative functions

E(x,z): Ln -> R

Models that model the classification problem via a function

Examples: - Energy which has been Loss-based trained - support vector machines - decision trees

Pros: most direct approach to model the problem

Cons: no probabilities

x* = argmax E(x,z)x

Recap

• Generative (probabilistic) models

• Discriminative (probabilistic) models

• Discriminative functions

Image segmentation … the full story

… a meeting with the Queen

Segmentation [Boykov& Jolly ICCV ‘01]

Fp = ∞ Bp = 0

E(x) =

Image zand user input

Output x* = argmin E(x) ϵ {0,1}

∑ Fp xp+ Bp (1-xp) + ∑ wpq|xp-xq|

Graph Cut: Global optimum in polynomial time ~0.3sec for 1MPixel image [Boykov, Kolmogorov, PAMI ‘04]

wpq = wi + wc exp(-wβ||zp-zq||2)

Fp = 0 Bp = ∞

pq ϵ E p ϵ V

x

How to prevent the trivial solution?

What is a good segmentation?

Objects (fore- and background) are self-similar wrt appearance

Input Image

Option 1 Option 2 Option 3

Eunary(x, θF,θB) = -log p(z|x, θF,θB) = ∑ -log p(zp|θF) xp -log p(zp|θB) (1-xp)

p ϵ V

Eunary = 460000 Eunary = 482000 Eunary = 483000

foreground background foreground background foreground background

θF θB θF θB θF θB

θF θBx

z

GrabCut[Rother, Kolmogorov, Blake, Siggraph ‘04]

Background

Foreground G

R

Fp(θF) = -log p(zp|θF) Bp(θB) = -log p(zp|θB)

E(x,θF,θB) =∑ Fp(θF)xp+ Bp(θB)(1-xp) + ∑ wpq|xp-xq|pq Є E pЄV

“others”

Output GMMs θF,θB

Problem: Joint optimization of x,θF,θB is NP-hard


Output xϵ {0,1}

Fp = ∞ Bp = 0

Fp = 0 Bp = ∞

GrabCut: Optimization[Rother, Kolmogorov, Blake, Siggraph ‘04]

Learning of the colour distributions

Graph cut to infer segmentation

xmin E(x, θF, θB) θF,θB

min E(x, θF, θB)


Initial segmentation x

1 2 3 4

GrabCut: Optimization

Energy after each IterationResult0

GrabCut: Optimization

Background

Foreground & Background G

RBackground

Foreground G

RIterated

graph cut

Summary

– Intro: Probabilistic models

– Two different approaches for learning

– Generative/discriminative models, discriminative functions

– Advanced segmentation system: GrabCut

advanced computer vision (module 5f16) carsten rother pushmeet kohli

Documents