fisher kernels for image representation & generative classification models jakob verbeek...

Fisher kernels for image representation & generative classification models

Jakob Verbeek

December 11, 2009

Plan for this course Introduction to machine learning

Clustering techniques k-means, Gaussian mixture density

Gaussian mixture density continued Parameter estimation with EM

4) Classification techniques 1 Introduction, generative methods, semi-supervised Fisher kernels

5) Classification techniques 2 Discriminative methods, kernels

6) Decomposition of images Topic models, …

Classification• Training data consists of “inputs”, denoted x, and corresponding output

“class labels”, denoted as y.

• Goal is to correctly predict for a test data input the corresponding class label.

• Learn a “classifier” f(x) from the input data that outputs the class label or a probability over the class labels.

• Example: – Input: image– Output: category label, eg “cat” vs. “no cat”

• Classification can be binary (two classes), or over a larger number of classes (multi-class).

– In binary classification we often refer to one class as “positive”, and the other as “negative”

• Binary classifier creates a boundaries in the input space between areas assigned to each class

Example of classification

Given: training images and their categories What are the categories of these test images?

Discriminative vs generative methods• Generative probabilistic methods

– Model the density of inputs x from each class p(x|y)– Estimate class prior probability p(y)– Use Bayes’ rule to infer distribution over class given input

• Discriminative (probabilistic) methods– Directly estimate class probability given input: p(y|x)– Some methods do not have probabilistic interpretation,

• eg. they fit a function f(x), and assign to class 1 if f(x)>0,

and to class 2 if f(x)<0

)(

)()|()|(

xp

ypyxpxyp

y

yxpypxp )|()()(

Generative classification methods• Generative probabilistic methods


• Modeling class-conditional densities over the inputs x– Selection of model class:

• Parametric models: such as Gaussian (for continuous), Bernoulli (for binary), …

• Semi-parametric models: mixtures of Gaussian, Bernoulli, …

• Non-parametric models: Histograms over one-dimensional, or multi-dimensional data, nearest-neighbor method, kernel density estimator

• Given class conditional model, classification is trivial: just apply Bayes’ rule

• Adding new classes can be done by adding a new class conditional model– Existing class conditional models stay as they are

)(

)()|()|(

xp

ypyxpxyp

c

cyxpcypxp )|()()(

Histogram methods• Suppose we

– have N data points– use a histogram with C cells

• How to set the density level in each cell ?– Maximum (log)-likelihood estimator.– Proportional to nr of points n in cell– Inversely proportional to volume V of cell

• Problems with histogram method:– # cells scales exponentially with the dimension of the data– Discontinuous density estimate– How to choose cell size?

pc nc /(NVc )

The ‘curse of dimensionality’• Number of bins increases exponentially with the dimensionality of the data.

– Fine division of each dimension: many empty bins– Rough division of each dimension: poor density model

• Probability distribution of D discrete variables takes at least 2D values– At least 2 values for each variable

• The number of cells may be reduced assuming independency between the components of x: the naïve Bayes model

– Model is “naïve” since it assumes that all variables are independent… – Unrealistic for high dimensional data, where variables tend to be dependent

• Poor density estimator• Classification performance can still be good using derived p(y|x)

Example of generative classification• Hand-written digit classification

– Input: binary 28x28 scanned digit images, collect in 784 long vector

– Desired output: class label of image

• Generative model– Independent Bernoulli model for each class

– Probability per pixel per class

– Maximum likelihood estimator is average value

per pixel per class

• Classify using Bayes’ rule:

)|()|( 1 cyxpcyxp dDd

dc

d

dc

d

cyxp

cyxp

1)|0(

)|1(

)(

)()|()|(

xp

ypyxpxyp

c

cyxpcypxp )|()()(

k-nearest-neighbor estimation method• Idea: fix number of samples in the cell, find the right cell size.

• Probability to find a point in a sphere A centered on x with volume v is

• Smooth density approximately constant in small region, and thus

• Alternatively: estimate P from the fraction of training data in a sphere on x

• Combine the above to obtain estimate

k-nearest-neighbor estimation method• Method in practice:

– Choose k – For given x, compute the volume v which contain k samples.– Estimate density with

• Volume of a sphere with radius r in d dimensions is

• What effect does k have?– Data sampled from mixture

of Gaussians plotted in green– Larger k, larger region,

smoother estimate

• Selection of k– Leave-one-out cross validation– Select k that maximizes data

log-likelihood

k-nearest-neighbor classification rule• Use k-nearest neighbor density estimation to find p(x|category)• Apply Bayes rule for classification: k-nearest neighbor classification

– Find sphere volume v to capture k data points for estimate

– Use the same sphere for each class for estimates

– Estimate global class priors

– Calculate class posterior distribution

)12/(/2),( 2/ drdrv dd

vN

kcyxp

c

c )|(

)(

)()|()|(

xp

cypcyxpxcyp

N

Ncyp c )(

Nv

k

xpc

)(

1

k

kc

Nv

kxp )(

k-nearest-neighbor classification rule• Effect of k on classification boundary

– Larger number of neighbors

– Larger regions

– Smoother class boundaries

Kernel density estimation methods• Consider a simple estimator of the cumulative

distribution function:

• Derivative gives an estimator of the density function, but this is just a set of delta peaks.

• Derivative is defined as

• Consider a non-limiting value of h:

• Each data point adds 1/(2hN) in region of size h around it, sum of “blocks” gives estimate

Kernel density estimation methods• Can use other than “block” function to obtain smooth estimator.

• Widely used kernel function is the (multivariate) Gaussian – Contribution decreases smoothly as a function of the distance to data point.

• Choice of smoothing parameter – Larger size of “kernel” function gives

smoother desnity estimator

– Use the average distance between samples.– Use cross-validation.

• Method can be used for multivariate data– Or in naïve bayes model

Summary generative classification methods

histogram k-nn k.d.e.

• (Semi-) Parametric models (eg p(data |category) = gaussian or mixture)– No need to store data, but possibly too strong assumptions on data density– Can lead to poor fit on data, and poor classification result

• Non-parametric models– Histograms:

• Only practical in low dimensional space (<5 or so)

• High dimensional space will lead to many cells, many of which will be empty

• Naïve Bayes modeling in higher dimensional cases

– K-nearest neighbor & kernel density estimation:• Need to store all training data

• Need to find nearest neighbors or points with non-zero kernel evaluation (costly)

Discriminative vs generative methods• Generative probabilistic methods


• Discriminative (probabilistic) methods (next week)– Directly estimate class probability given input: p(y|x)– Some methods do not have probabilistic interpretation,

• eg. they fit a function f(x), and assign to class 1 if f(x)>0,

and to class 2 if f(x)<0

• Hybrid generative-discriminative models – Fit density model to data– Use properties of this model as input for classifier– Example: Fisher-vectors for image respresentation

)(

)()|()|(

xp

ypyxpxyp

y

yxpypxp )|()()(

Clustering for visual vocabulary construction• Clustering of local image descriptors

– using k-means or mixture of Gaussians

• Recap of the image representation pipe-line– Extract image regions at various locations and scales Compute

descriptor for each region (eg SIFT)

– (Soft) assignment each descriptors to clusters

– Make histogram for complete image• Summing of vector representations of each descriptor

• Input to image classification method

Cluster indexes

Ima

ge r

egio

ns

0 1 0 0 0 0 0 0

0 0 0 0 1 0 0 0

… … … … … … … …

1 20 0 0 10 3 2 1

0 .5 .5 0 0 0 0 0

0 0 0 0 .9 0 .1 0

… … … … … … … …

1.2 20.0 0.2 0.6 10.0 2.5 2.5 1.0

Fisher Vector motivation• Feature vector quantization is computationally expensive in practice

• Run-time linear in– N: nr. of feature vectors ~ 10^3 per image– D: nr. of dimensions ~ 10^2 (SIFT)– K: nr. of clusters ~ 10^3 for recognition

• So in total in the order of 10^8 multiplications per image to assign SIFT descriptors to visual words

• We use histogram of visual word counts

• Can we do this more efficiently ?!

• Reading material: “Fisher Kernels on Visual

Vocabularies for Image Categorization”

F. Perronnin and C. Dance, in CVPR'07

Xerox Research Centre Europe, Meylan

Fisher vector image representation

• MoG / k-means stores nr of points per cell– Need many clusters to represent distribution of descriptors in image– But increases computational cost

• Fischer vector adds 1st & 2nd order moments– More precise description regions assigned to cluster– Fewer clusters needed for same accuracy– Representation (2D+1) times larger, at same computational cost

– Terms already calculated when computing soft-assignment

1

2

4

2

2

2 3

3

8

1

22

4

51

2

22

2

2

24

4

3

35

8

1

2)( knnk mxq nkqnkq )( knnk mxq

qnk: soft-assignment of image region to cluster (Gaussian mixture component)

Image representation using Fisher kernels• General idea of Fischer vector representation

– Fit probabilistic model to data– Use derivative of data log-likelihood as data representation, eg.for classification

[Jaakkola & Haussler. “Exploiting generative models in discriminative classifiers”, in Advances in Neural Information Processing Systems 11, 1999.]

• Here, we use Mixture of Gaussians to cluster the region descriptors

• Concatenate derivatives to obtain data representation

K

kkknk

N

n

N

nn CmxNxpL

111

),;(log)(log)(

N

nknnkk

k

mxqCLm 1

1 )()(

N

nnk

k

qL1

)(

N

n

Tknknknk

k

mxmxCqLC 1

1))((

2

1

2

1)(

Image representation using Fisher kernels• Extended representation of image descriptors using MoG

– Displacement of descriptor from center– Squares of displacement from center

– From 1 number per descriptor per cluster, to 1+D+D2 (D = data dimension)

• Simplified version obtained when– Using this representation for a linear classifier

– Diagonal covariance matrices, variance in dimensions given by vector vk

– For a single image region descriptor

– Summed over all descriptors this gives us• 1: Soft count of regions assigned to cluster

• D: Weighted average of assigned descriptors

• D: Weighted variance of descriptors in all dimensions

k

nkqkm

1

kC


Fisher vector image representation

• MoG / k-means stores nr of points per cell– Need many clusters to represent distribution of descriptors in image

• Fischer vector adds 1st & 2nd order moments– More precise description regions assigned to cluster

– Fewer clusters needed for same accuracy

– Representation (2D+1) times larger, at same computational cost– Terms already calculated when computing soft-assignment– Comp. cost is O(NKD), need difference between all clusters and data

1

2

4

2

2

2 3

3

8

1

22

4

51

2

22

2

2

24

4

3

35

8

1


Images from categorization task PASCAL VOC• Yearly “competition” for image classification (also object localization,

segmentation, and body-part localization)

Fisher Vector: results• BOV-supervised learns separate mixture model for each image class, makes

that some of the visual words are class-specific

• MAP: assign image to class for which the corresponding MoG assigns maximum likelihood to the region descriptors

• Other results: based on linear classifier of the image descriptions

• Similar performance, using 16x fewer Gaussians• Unsupervised/universal representation good

Plan for this course Introduction to machine learning

Clustering techniques• k-means, Gaussian mixture density

Gaussian mixture density continued• Parameter estimation with EM

Classification techniques 1 Introduction, generative methods, semi-supervised

Reading for next week: Previous papers (!), nothing new Available on course website http://lear.inrialpes.fr/~verbeek/teaching

5) Classification techniques 2• Discriminative methods, kernels

6) Decomposition of images• Topic models, …

fisher kernels for image representation & generative classification models jakob verbeek...

Documents

class slide

class probability

corresponding class

classification techniques

binary classification

discriminative methods

cat classification

example of classification