segmentation via generative models - vision...

CSCI 5561: Computer Vision, Prof. Paul Schrater, Spring 2005

Segmentation via Generative Models


Overview

• In the last lecture we introduced classifier basedsegmentation models where– class -> image generative models were used

– P(pixel | class) is learned from training data

Here we show how to use class -> image generative modelswithout labeled data.

The key idea is to simultaneously estimate class densityparameters and the class-membership each piece of imagedata.


Missing variable problems• In many vision problems, if some variables were known the

maximum likelihood inference problem would be easy– fitting; if we knew which line each token came from, it would be easy

to determine line parameters

– segmentation; if we knew the segment each pixel came from, it wouldbe easy to determine the segment parameters

– fundamental matrix estimation; if we knew which featurecorresponded to which, it would be easy to determine thefundamental matrix

– etc.

• This sort of thing happens in statistics, too


For Independent data samples


Missing variables - strategy• We have a problem with

parameters, missingvariables

• This suggests:

• Iterate until convergence– replace missing variable with

expected values, given fixedvalues of parameters

– fix missing variables, chooseparameters to maximiselikelihood given fixed valuesof missing variables

• e.g., iterate till convergence– allocate each point to a line

with a weight, which is theprobability of the point giventhe line

– refit lines to the weighted setof points

• Converges to localextremum


Density Est: Mixture of Densities

e.g. θ = { µ, σ }


Motion Estimation


Mixture Model ApplicationsV

eloc

ity

Image position


But only if we are given the distributions and prior


http://www.ncrg.aston.ac.uk/netlab/

* PCA* Mixtures of probabilistic PCA* Gaussian mixture model with EM training* Linear and logistic regression with IRLS* Multi-layer perceptron with linear, logistic and

softmax outputs and error functions* Radial basis function (RBF) networks with

both Gaussian and non-local basis functions* Optimisers, including quasi-Newton methods,

conjugate gradients and scaled conj grad.* Multi-layer perceptron with Gaussian mixture

outputs (mixture density networks)* Gaussian prior distributions over parameters

for the MLP, RBF and GLM including multiple hyper-parameters

* Laplace approximation framework for Bayesian inference (evidence procedure)

* Automatic Relevance Determination for input selection

* Markov chain Monte-Carlo including simple Metropolis and hybrid Monte-Carlo

* K-nearest neighbour classifier* K-means clustering* Generative Topographic Map* Neuroscale topographic projection* Gaussian Processes* Hinton diagrams for network weights* Self-organising map


Data sampled fromMixture of 3 Gaussians Spectral Clustering


Original Data Gaussian Mixture ModelClassification


Missing variable problems• In many vision problems, if some variables were known the maximum

likelihood inference problem would be easy– fitting; if we knew which line each token came from, it would be easy to determine

line parameters

– segmentation; if we knew the segment each pixel came from, it would be easy todetermine the segment parameters

– fundamental matrix estimation; if we knew which feature corresponded to which, itwould be easy to determine the fundamental matrix

– etc.

• This sort of thing happens in statistics, too


For Independent data samples


Missing variables - strategy• We have a problem with

parameters, missingvariables

• This suggests:

• Iterate until convergence– replace missing variable with

expected values, given fixedvalues of parameters

– fix missing variables, chooseparameters to maximiselikelihood given fixed valuesof missing variables

• e.g., iterate till convergence– allocate each point to a line

with a weight, which is theprobability of the point giventhe line

– refit lines to the weighted setof points

• Converges to localextremum


Finite Mixtures

P(x) = Σi=1:3 a(i) gi(x; θ)


Expection onIndicatorvariables

CSCI 5561: Computer Vision, Prof. Paul Schrater, Spring 2005Figure from “Color and Texture Based Image Segmentation Using EM and Its Application to ContentBased Image Retrieval”,S.J. Belongie et al., Proc. Int. Conf. Computer Vision, 1998, c1998, IEEE

Segmentation with EM

Scale Estimatemap

6 texture features

EM components

SegmentationInto ‘Blobs’


Fitting• Choose a parametric

object/some objects torepresent a set of tokens

• Most interesting case iswhen criterion is not local– can’t tell whether a set of

points lies on a line bylooking only at each point andthe next.

• Three main questions:– what object represents this

set of tokens best?

– which of several objects getswhich token?

– how many objects are there?

(you could read line for objecthere, or circle, or ellipse or...)


Fitting and the Hough Transform

• Purports to answer all threequestions– in practice, answer isn’t usually

all that much help

• We do for lines only

• A line is the set of points (x, y)such that

• Different choices of θ, d>0 givedifferent lines

• For any (x0, y0) there is a oneparameter family of lines throughthis point, given by

• Plot these curves in discretized, r,θ,space. Each point (r,θ) is abucket.

• Each point gets to vote for eachline in the family; if there is a linethat has lots of votes, that shouldbe the line passing through thepoints.

• This voting can be done by add 1to every (r,θ) point that the curvespass through, accumulating acrossthe set of (x0, y0) points.

€

sinθ( )x + cosθ( )y + d = 0

€

r = −sinθ x0 − cosθ y0


tokensvotes


Mechanics of the Hough transform• Construct an array

representing θ, d

• For each point, render thecurve (θ, d) into this array,adding one at each cell

• Difficulties– how big should the cells be?

(too big, and we cannotdistinguish between quitedifferent lines; too small, andnoise causes lines to bemissed)

• How many lines?– count the peaks in the Hough

array

• Who belongs to which line?– tag the votes

• Hardly ever satisfactory inpractice, because problemswith noise and cell sizedefeat it


tokens votes


Line fitting can be max.likelihood - but choice ofmodel is important


Who came from which line?

• Assume we know how many lines there are - but whichlines are they?– easy, if we know who came from which line

• Three strategies– Incremental line fitting

– K-means

– Probabilistic (later!)


Robustness

• As we have seen, squared error can be a source ofbias in the presence of noise points– One fix is EM - we’ll do this shortly

– Another is an M-estimator• Square nearby, threshold far away

– A third is RANSAC• Search for good points

CSCI 5561: Computer Vision, Prof. Paul Schrater, Spring 2005€

Example :ρθ r( ) = r2 = (yi − f (xi))

2

∂ρθ yi − f (xi)( )∂yi

= 2yi

Influence function:

€

Err = ρθ yi − f (xi)( )i=1:N∑

Example :

ρθ r( ) =r2

r2 + θ 2Influence :∂ρθ r( )∂r


Modified Error metrics

Euclidean: d2

Robust: d2/(d2+ s2)


Too small


Too large


RANSAC• Choose a small subset

uniformly at random

• Fit to that

• Anything that is close toresult is signal; all othersare noise

• Refit

• Do this many times andchoose the best

• Issues– How many times?

• Often enough that we are likelyto have a good line

– How big a subset?• Smallest possible

– What does close mean?• Depends on the problem

– What is a good line?• One where the number of

nearby points is so big it isunlikely to be all outliers


Fitting curves other than lines• In principle, an easy

generalisation– The probability of obtaining a

point, given a curve, is givenby a negative exponential ofdistance squared

• In practice, rather hard– It is generally difficult to

compute the distancebetween a point and a curve


Lines and robustness• We have one line, and n

points

• Some come from the line,some from “noise”

• This is a mixture model:

• We wish to determine– line parameters

– p(comes from line)

€

P point | line and noise params( ) = P point | line( )P comes from line( ) +

P point | noise( )P comes from noise( )= P point | line( )λ + P point | noise( )(1− λ)


Estimating the mixture model• Introduce a set of hidden

variables, δ, one for eachpoint. They are one whenthe point is on the line, andzero when off.

• If these are known, thenegative log-likelihoodbecomes (the line’sparameters are φ, c):

• Here K is a normalisingconstant, kn is the noiseintensity (we’ll choose thislater).

€

Lc ({xi,yi};θ) = δi xi cosφ + yi sinφ( )2 /2σ 2

i∑ + (1−δi)kn + K


Substituting for delta• We shall substitute the

expected value of δ, for agiven θ

• recall θ=(φ, c, λ)

• E(δ_i)=1. P(δ_i=1|θ)+0....

• Notice that if kn is small andpositive, then if distance issmall, this value is close to 1and if it is large, close tozero

€

P δi = 1|θ,xi( ) =P xi |δi = 1,θ( )P δi = 1( )

P xi |δ i = 1,θ( )P δi = 1( ) + P xi |δi = 0,θ( )P δ i = 0( )

=exp −12σ 2 xi cosφ + yi sinϕ + c[ ]2( )λ

exp −12σ 2 xi cosφ + yi sinϕ + c[ ]2( )λ + exp −kn( ) 1− λ( )


Algorithm for line fitting• Obtain some start point

• Now compute δ’s usingformula above

• Now compute maximumlikelihood estimate of

– φ, c come from fitting toweighted points

– λ comes by counting

• Iterate to convergence

€

θ 0( ) = φ 0( ),c 0( ) ,λ 0( )( )

€

θ 1( )


The expected values of the deltas at the maximum(notice the one value close to zero).


Closeup of the fit


Choosing parameters

• What about the noise parameter, and the sigma for theline?– several methods

• from first principles knowledge of the problem (seldom reallypossible)

• play around with a few examples and choose (usually quiteeffective, as precise choice doesn’t matter much)

– notice that if kn is large, this says that points very seldomcome from noise, however far from the line they lie

• usually biases the fit, by pushing outliers into the line

• rule of thumb; its better to fit to the better fitting points, withinreason; if this is hard to do, then the model could be a problem


Other examples• Segmentation

– a segment is a gaussian thatemits feature vectors (whichcould contain colour; orcolour and position; or colour,texture and position).

– segment parameters aremean and (perhaps)covariance

– if we knew which segmenteach point belonged to,estimating these parameterswould be easy

– rest is on same lines as fittingline

• Fitting multiple lines– rather like fitting one line,

except there are more hiddenvariables

– easiest is to encode as anarray of hidden variables,which represent a table with aone where the i’th pointcomes from the j’th line, zerosotherwise

– rest is on same lines asabove


Issues with EM

• Local maxima– can be a serious nuisance in some problems

– no guarantee that we have reached the “right” maximum

• Starting– k means to cluster the points is often a good idea


Local maximum


which is an excellent fit to some points


and the deltas for this maximum


A dataset that is well fitted by four lines


Result of EM fitting, with one line (or at least, one available local maximum).


Result of EM fitting, with two lines (or at least, one available local maximum).


Seven lines can produce a rather logical answer


Motion segmentation with EM• Model image pair (or video

sequence) as consisting ofregions of parametric motion– affine motion is popular

• Now we need to– determine which pixels

belong to which region

– estimate parameters

• Likelihood– assume

• Straightforward missingvariable problem, rest iscalculation

€

vxvy

=

a bc d

xy +

txty

€

I x, y,t( ) = I x + vx, y + vy,t +1( )+noise


Three frames from the MPEG “flower garden” sequence

Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEETransactions on Image Processing, 1994, c 1994, IEEE


Grey level shows region no. with highest probability

Segments and motion fields associated with themFigure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEETransactions on Image Processing, 1994, c 1994, IEEE


If we use multiple frames to estimate the appearanceof a segment, we can fill in occlusions; so we canre-render the sequence with some segments removed.

Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEETransactions on Image Processing, 1994, c 1994, IEEE


Some generalities• Many, but not all problems

that can be attacked withEM can also be attackedwith RANSAC– need to be able to get a

parameter estimate with amanageably small number ofrandom choices.

– RANSAC is usually better

• Didn’t present in the mostgeneral form– in the general form, the

likelihood may not be a linearfunction of the missingvariables

– in this case, one takes anexpectation of the likelihood,rather than substitutingexpected values of missingvariables


Model Selection• We wish to choose a model

to fit to data– e.g. is it a line or a circle?

– e.g is this a perspective ororthographic camera?

– e.g. is there an aeroplanethere or is it noise?

• Issue– In general, models with more

parameters will fit a datasetbetter, but are poorer atprediction

– This means we can’t simplylook at the negative log-likelihood (or fitting error)


Top is not necessarily a betterfit than bottom(actually, almost always worse)


We can discount the fitting error with some term in the numberof parameters in the model.


Discounts• AIC (an information criterion)

– choose model with smallestvalue of

– p is the number ofparameters

• BIC (Bayes informationcriterion)– choose model with smallest

value of

– N is the number of datapoints

• Minimum description length– same criterion as BIC, but

derived in a completelydifferent way

€

−2L D;θ*( ) + p logN

€

−2L D;θ*( ) + 2 p


Cross-validation• Split data set into two pieces,

fit to one, and computenegative log-likelihood onthe other

• Average over multipledifferent splits

• Choose the model with thesmallest value of thisaverage

• The difference in averagesfor two different models is anestimate of the difference inKL divergence of the modelsfrom the source of the data


Model averaging• Very often, it is smarter to

use multiple models forprediction than just one

• e.g. motion capture data– there are a small number of

schemes that are used to putmarkers on the body

– given we know the scheme Sand the measurements D, wecan estimate theconfiguration of the body X

• We want

• If it is obvious what the scheme is fromthe data, then averaging makes littledifference

• If it isn’t, then not averagingunderestimates the variance of X --- wethink we have a more precise estimatethan we do.

€

P X | D( ) = P X | S1,D( )P S1 | D( ) +

P X | S2,D( )P S2 | D( ) +

P X | S3,D( )P S3 | D( )

segmentation via generative models - vision...

Documents