previous algorithms for graph partition

1

Chapter 6 Advanced Algorithms for Inference in Gibbs fields

In this chapter, we study three advanced MCMC methods for computing in the Gibbs fields (flat or two level models)

1, Swendsen-Wang Cuts for segmentation and labeling

2, DDMCMC and its applications in image segmentation

3 C4 Cl stering ith Cooperati e and Competiti e Constraints

Stat232B. Stat Computing and Inference, S.C. Zhu

3, C4 ---Clustering with Cooperative and Competitive Constraints

Previous algorithms for graph partition

Generic algorithms:Gibbs sampler (Geman,Geman ’84) – Inefficient

Specialized algorithms:Graph Cuts (Boykov, Veksler, Zabih ‘01)

Belief Propagation (Yedidia et al’00)

Gibbs sampler (Geman,Geman 84) Inefficient

PDE optimization (like region competition) – greedy, local minima


Belief Propagation (Yedidia et al 00)

Swendsen-Wang (Swendsen,Wang ’87)

2

1, Swendsen-Wang cuts

state A state B

The original idea of cluster sampling and SW

V0V0

V2V2

V1V1

Each edge in the lattice e=<s,t> is associated with probability =1-e-.


Essential ideas of SW

When two variables are tightly coupled. It is best if we move along the direction of the line.

x2

The clustering process connects the coupled dimensionsprobabilistically, then the moves are more effective.

x1

3

Some theoretical results about SW on Potts model

1. (Gore and Jerrum 97) constructed a “worst case”SW does not mix rapidly if G is a complete graph with n>2, and a certain .

2. (Cooper and Frieze 99) had positive resultsIf G is a tree, SW mixing time is O(|G|) for any b.If G has constant connectivity O(1), the SW has polynomial mixing time for .

The real limit of SW are

1 it i l lid f I i /P tt d l1, it is only valid for Ising/Potts models.

2, it makes no use of the data (external fields) in forming clustersand slow down critically in the presence of external fields.


Segmentation and graph partition

Image segmentation– Group pixels based on intensity

– For speed, one can use an over-segmentation by edge detection and edge tracing

input image

over-segmentation with atomic regionsadjacency graph

graph partition (labeling)

image segmentation result

4

The graph partition problem

Given:– A graph Go=(V,E)

• Nodes V are image elementsNodes V a e age e e e ts

• Edges E represent spatial relationship or similarity

– A probability p((V) |I) or energy E((V) ) defined on partitions (V)

Find a partition (V) that maximizes p((V) |I)

A graph Go A partition of the graph

Improving the clustering step

e.g. image segmentation: KL divergence of histogramsThe edge probability qij is decided by local features Fi,Fj

Histogram Hi

Histogram Hj

Atomic regions on the input image Edge weightsIntensity Histograms

is a marginal probability of p(W|I)

In general

5

Random connected components

T=1 T=2 T=4 T=8

The temperature T is used in the marginal probability.

Sample 1

Sample 2

Sample 3

The Swendsen-Wang Cuts algorithm

Swendsen-Wang Cuts: SWCInput: Go=<V, Eo>, discriminative probabilities qe, e Eo,

and generative posterior probability p(W|I).Output: Samples W p(W|I)

1. Initialize a graph partition 2. Repeat, for current state A= π

Output: Samples W~p(W|I).

3. Repeat for each subgraph Gl=<Vl, El>, l=1,2,...,n in A4. For e El turn e=“on” with probability qe.

5. Partition Gl into nl connected components:gli=<Vli, Eli>, i=1,...,nl

6. Collect all the connected components inCP=V : l=1 n i=1 n

V2

The initial graph Go

V2

xxx

State A

7. Select a connected component V0CP at random

9. Accept the move with probability α(AB).

CP=Vli: l=1,...,n, i=1,...,nl.

V0

CP

V0

V1

x

x

x

x

x

x

8. Propose to reassign V0 to a subgraph Gl’, l' follows a probability q(l'|V0,A)

x

V0

V1

x

x

x

xx

xx

x

State B

6

SW Cuts: the acceptance probability

Metropolis-Hastings

Theorem The acceptance probability for the Swendsen-Wang Cuts algorithm is

Barbu and Zhu 2003

State A State B

Outline of the proof

V0

V

V2

x

x

xx

x

x

xx

x

V0

V

V2

x

x

x

x

We compute the ratio:

All configurations of edges that take state A to B must have all

V1

x

xV1 xx

g gedges of the cut C(Vo,V1-Vo) turned off.

7

Outline of the proof

V0

V2

x

x

xx

x

x

x

V0

V2

x

x

x

Cancellation of the sums occurs because of the symmetry between states A and B: Any CP that takes state A to B is also a CP that takes state B to A

State A State B

V1

x

x

xx

xV1 x

x

x

x

to A Any configuration of “on” edges in state A appears in state B and vice versa

The reassignment probability

The reassignment probability q(l |V0,A) can also be data-driven.

V0

V1

V2

x

x

x

x

x

x

H(V3) H(V2)

H(V1-V0) H(V0)

8

Comparison with the Gibbs sampler

3 9

4

4.1

4.2x 10

5

Gibbs, random initialization

3 9

4

4.1

4.2x 10

5

Gibbs, random initialization

0 200 400 600 800 1000 12003.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

en

erg

y

time(s)

Gibbs, uniform initialization

SWC

0 2 4 6 8 10 12 14 16 18 203.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

en

erg

y

time(s)

Gibbs, uniform initialization

SWC, uniform initialization

SWC, random initialization

Convergence comparison of SWC and the Gibbs sampler on the cheetah image, starting from a random state or from the state where all nodes have label 0. Right – zoom in view of the first 20 seconds.

Examples of segmentation


a. input image b. over-segmentationwith atomic regions

c. segmentation result

9

Advantages of the SW-Cuts algorithm

– Generally applicable – allows usage of complex models beyond the scope of the specialized algorithmsbeyond the scope of the specialized algorithms.

– Computationally efficient – performance comparable with the specialized algorithms.

– Reversible and ergodic – theoretically guaranteed to eventually find the global optimum.

We can obtain acceptance probability

A generalized Gibbs sampler

if we select the probability q(l|V0,A) to reassign V0 to Vl (obtaining state Al)

Th b i ll fli th l b l f th t d b h b li d Gibb lThen we basically flip the label of the connected subgraph by a generalized Gibbs sampler.

10

The importance of q(l |V0,A)

4

4.05x 10

5

3.8

3.85

3.9

3.95

en

erg

y

Generalized Gibbs samplerSWC

0 20 40 60 80 100 120 140 160 180 2003.75

time(s)

Convergence of SWC with data-driven q(l |V0,A) (blue) and of the generalized Gibbs sampler (red), starting from a random state.

The Bayesian formulation for maximizing a posterior probability

Let I be an image and W be a semantic representation of the world

2, Data-Driven Markov Chain Monte Carlo

ppp (W) W)| (I I) |(W w

max arg w

max arg W*

Let I be an image and W be a semantic representation of the world.

)I|W(~WWW ),...,,( k21 p

In statistics, we sample from a posterior probability to preserve ambiguities.


p

W

11

Example: Image Segmentation

):θ ,l,R,(nW n,...,2,1iiii

/ : n

jiin21 R RR)R ..., ,R ,(R pg ji,nπ πn

)R ..., ,R ,(R 7217π1

2

43

5

76

is a 7-partition of the lattice.


1in

The partition space is

|

1nππ n

A permutation group

Likelihood models (no objects or templates)

: iid Gaussian for pixel intensities : non-parametric histograms

: Markov random fields for texture : spline model for lighting variations


: iid Gaussian for color (LUV) : mixture of Gaussians for color

: spline model for smooth color variations (e.g. sky, shading, …)

12

Sampling the posterior distribution

To design transition kernel:

Markov Chain:


g

atomic particles

A 7-partition

space

Formulating and visualizing the search space

iΩ

a) solution space c) an atomic space

7π

b) a sub-space of 7 regions

atomic

spaces

1C1C

2C2C 2C

3C 3C


a). solution space c). an atomic spaceb). a sub-space of 7 regions

13

1R2R

In Image Segmentation

3R

grey-scale

flat clutter texture shading

color

flat shading texturepartition


iΩ

p

spaces

parameter spaces

1C1C

2C2C

2C

3C 3C

7

Basic requirements for MCMC design

We have the following conditions for valid MCMC design in 202CWe have the following conditions for valid MCMC design in 202C

1: stochastic --- each row sums to 1.2: irreducible --- has 1 communication class3: aperiodic --- any power of K has 1 communication class4: globally balanced5: positive recurrent --- (not an issue in finite space).


p ( p )

14

What is Data-Driven Markov Chain Monte Carlo?

)I|W(~W pThe complexity of sampling the posterior is in the Metropolis-Hastings jumps

Consider a reversible jump WW Consider a reversible jump BA WW

))|()I|(

)|()I|(,1(minor)

)()I|(

)()I|(,1(min)(

ABA

BAB

BAA

ABBBA WWqWp

WWqWp

WWGWp

WWGWpWW

In DDMCMC, ))I,|()I|(

)I,|()I|(,1(min)(

ABA

BABBA WWqWp

WWqWpWW

Without looking at the data, the pre-designed proposal probabilities are often uniform distributions, thus it is a blind (exhaustive) search !


)I,|()I|( ABA WWqWp

If )I|()I,|(),I|()I,|( ABABAB WpWWqWpWWq

Then the MC is well-informed, it may converges (hit the W*) in a small number of steps !

The Markov chain consists of many processes

Suppose a Markov chain consists of many sub-chains, and the transition probability is a linear sumthe transition probability is a linear sum,

If each sub-kernel has the same invariant probability,

Then the whole Markov chain follows

15

The connectivity at each state x

We denote the set of states connected to x by the i-th type moves by

e.g. Ki be a probability within set i proportional to (x).

x is connected to a set

MCMC Kernels consist of many components

Transition kernel is a mixture of many sub-kernels corresponding to the various operators.

1rK

Each observes detailed balance equation, but may not be irreducible.

asymmetric


2K

1lK

AWsymmetric

16

Metropolized Gibbs sampler

e

Consider a pair of reversible jumps Jm between x and y.

t i

(y)iΩ l (x)irΩ

x ye

Proposal according to the conditional probabilities --- like a Gibbs sampler

);(,)'(

)(),(Qir xy

y

yyx ir

Proposal matrix Q

asymmetric


)('y xir

);(,)'(

)(),(Q

)('

il yxx

xxy il

x yil

x 0, 0,… 00, 0,… 0

Key issues

1. How do we decide the samplingdimensions, directions, group transforms, and sets i(x

in a systematic and principled way?

2. How do we schedule the visiting order governed by p(i)?i.e. choosing the moving directions, groups, and sets

Stat202C: Monte Carlo methods © S-C Zhu

17

K1l: Splitting of a region into two.

K1r: Merging two regions into one.

MCMC Moves in image segmentation

K2: Switching the model type for a region.

K3: Diffusion of region boundary -- region competition


Split Merge Switch Model Diffusion

Data-Driven Methods in the object spaces (death-birth)

))'(

1,

)'(

1min()(),(

)(')('

yxxy

mr

mlmr

xyyyxK

( )Ω

We conjecture that the Metropolised-Gibbs sampler is the best design strategy on average ---- mixing very fast under the constraints of scopes.

But at each step, it need to evaluate theexpensive posterior probability over a rather large scope

)()( yx mlmr


q

(x)mrΩWe replace the condition probability by bottom-up (discriminative) methods which are estimated locally with lower cost. We show that such approximations indeed reduce mixing time.

18


(x)1rΩ)()

)(

)(

),(Q

),(Q,1min(),(Q),(K xy

x

y

yx

xyyxyx il

il

irilil

Proposal according to the conditional probabilities---signature of a Gibbs samplernormalized within the set of connected states.

)(,)'(

)(),(Qil xy

y

yyx il

Proposal matrix Q

x(x)2Ω

(x)1ΩL

)('y xil

)(,)'(

)(),(Q

)('

ir yxx

xxy ir

x yir

x 0, 0,… 00, 0,… 0



)())(

)(

),(Q

),(Q,1min(),( xy

x

y

yx

xyyx il

il

ir

)(

))'(

)'(,1min(

))(

)(,1min(

)('

)('

)'(

)(

)'(

)(

)('

)('

yx

xy

y

y

x

x

ir

il

xily

yirx

x

y

x

y

In case the sets are symmetric:

Then it is always accepted:The Gibbs sampler becomes a special case.


)()( yx iril

1),( yx

19


Mixing Metropolis and Gibbs designs.

One can improve the traditional Gibbs sampler by prohibiting theOne can improve the traditional Gibbs sampler by prohibiting the MC from staying at its current state in the conditional probability. Thus it becomes asymmetric and needs a Metropolis acceptance step to “re-balance”.

The diagonal elements in the proposal matrix are set to zero.This is an desirable property of MC design in order to make the

).(),(,)(1

)(),(Q xxxy

x

yyx


MC to “mix fast”.

))(1

)(1,1min(),(

x

yyx

Split and Merge

Split?

20

Computing the marginal posterior probabilities: Clustering in Color Space c1

Mean-shift clustering (Cheng, 1995, Meer et al 2001)

K

)θg(θωI)|q(θ

Input

1i

ii )θg(θωI)|q(θ


saliency maps 1 2 3 4 5 6The brightness represents how likely a pixel belongs to a cluster.

Computing marginal posterior probabilities in the partition space π

Edge detection and tracing at three scales of details:


21

Partition maps:

Proposals by Edge Detection at Different Scales (before SW-cut was invented)

Scale 1


Scale 2 Scale 3

Super-pixels and connected components

T=1 T=2 T=4 T=8

Sample 1

Sample 2

Stat232B Swendson-Wang Cut,

Sample 3

22

Clustering in the Partition Space

an adjacency graph: each vertex is a basic element : pixels, small-regions, edges, ….each link e=<a, b> is associated with a probability/ratio for similarity

))F(I)F(I|on""q(e (b)(a)))F(I),F(I|off""q(e

))F(I),F(I|onq(e

(b)(a)

(b)(a)

be


a

Clustering in the Partition Space

Sampling the edges independently, we get connected components:


These connected sub-graphs are the clusters in the partition space

sampling c ~ q( C | F(I)) on π

23

Graph Partitioning – Generalizing SW

The red edges are the bridges

Theorem Accepting the label change proposal with probability:

),(),,( 'c

lcc

lc VVVEVVVE

AG BGAG


results in an ergodic and reversible Markov Chain.

Diffusion Processes on the boundary

The Markov chains realized reversible jumps between sub-spaces of varying dimensions.

Withi b f fi d di i th i diff i dWithin a subspace of fixed dimension, there are various diffusion processes expressedas partial differential equations.

For example, the region competition for curve evolution (Zhu, Lee, and Yuille, 95)

Ra

R ( ))( ( )( )

Let v be a point on the boundary between two regions, its motionis governed by the region-competition equation.


(s)n))θ|y)p(I(x,log

)θ|y)p(I(x,logκ(s)(μ

dt

(s)vd

b

a

Rb y(s))(x(s),(s)v

24

Stochastic Diffusion and PDE

1R2R

R

The continuous Langevin equation simulates a Markov Chain with stationary density

3R


For example, the movement of changing point is driven by

Running DDMCMC

Starting with 3 different initial segments below

MC 1 MC 2 MC 3

energy plots of three MCMCs


input W1 I1~p( I |W1) W2 I2~ p(I|W2)

25

Saliency maps (the brightness represents how likely a pixel belongs to a cluster.)

Proposals for region models by clustering

y p ( )

color values (L,u,v)

Stat232B. Stat Computing and Inference, S.C. Zhu texture

A Demo

Segmentation Synthesis


snapshot of solution sampled by DDMCMC

26

Experiments: Color Image Segmentation

Input segment synthesis I ~ p( I | W*)





27




a. Input image b. segmented regions c. synthesis I ~ p( I | W*)


28

Image Segmentation result on the public dataset



Performance on the Berkeley Benchmark Study

test images DDMCMC manual segment “error” measure

0 1083

(David Martin et al, 2001)

0.3082

0.1083


0.5627

29

a. Input image b. segmented regions c. synthesis I ~ p( I | W*)

Examples of Failure


Uninformed MCMC

Speed Comparison

MCMC with clustering

MCMC with partition


MCMC with bothground truth

30

Running Time Comparison Against Gibbs Sampler

3.2

3.25x 10

5

3.2

3.25x 10

5

ergy

0 500 1000 1500 2000 25003

3.05

3.1

3.15

ener

gy

0 10 20 30 40 50 60 70 80 90 1003

3.05

3.1

3.15

3.2

time(s)

ener

gy

ene


time(s)

Time = #sweeps Zoom-in view

Red curve: Gibbs sampler for graph partition and labelingBlue curve: Improved SW algorithm for graph partition.

Generic Images parsing

scene

objects

patterns


p

parts

textons Example: parsing (Tu et al, 2000-2004)

31

From segmentation to parsing

Face images of FERET datasetText images of San Francisco street scenes.


Adaboost in the Label Space

---- An example from Viola and Jones, 2001.

(a) the first two face features (b) an example of face detection

Adaboost is a learning algorithm which makes decision by combining a number


g g y gof simple features. As T and training samplers become large enough, it weakly

converges to the log ratio of the posterior probability.

32

Image Parsing Results Tu, Chen, Yuille, and Zhu, iccv2003

Input Regions Objects Synthesis


Image Parsing ResultsInput Regions Objects Synthesis


33

An example

face text region model switching

Markov kernel

Diagram for Integrating Top-down generative andBottom-up discriminativeMethods.

deathbirth deathbirth split merge

+

generativeinference

discriminativeinference

weighted particles

input image

face detection text detection edge detection model clustering

inference

34

Bayesian: (Top-down)

Summary: generative vs. discriminative

Data-driven: (Bottom-up)

?


Integrating generative and discriminative


35

Review: MCMC developments related to vision

Metropolis 1946

Hastings 1970

Waltz 1972 (labeling)

Rosenfeld Hummel Zucker 1976 (relaxation) Hastings 1970Rosenfeld, Hummel, Zucker 1976 (relaxation)

Geman brothers 1984, (Gibbs sampler)Miller, Grenander,1994

Heat bathKirkpatrick 1983

Swendsen-Wang 1987 (clustering) Green 1995

Swendsen-Wang Cut 2003DDMCMC 2001-2005

C4 2009

previous algorithms for graph partition

Documents