cvpr2010: advanced itincvpr in a nutshell: part 3: feature selection

56
High-Dimensional Feature Selection: Images, Genes & Graphs Francisco Escolano

Upload: zukun

Post on 29-Jun-2015

253 views

Category:

Education


0 download

TRANSCRIPT

Page 1: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Tutorial

Advanced Information Theory in CVPR “in a Nutshell” CVPR

June 13-18 2010 San Francisco,CA

High-Dimensional Feature Selection:Images, Genes & Graphs

Francisco Escolano

Page 2: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

High-Dimensional Feature Selection

Background. Feature selection in CVPR is mainly motivated by: (i)dimensionality reduction for improving the performance of classifiers,and (ii) feature-subset pursuit for a better understanding/descriptionof the patterns at hand (either using generative or discriminativemethods).Exploring the domain of images [Zhu et al.,97] and other patterns(e.g. micro-array/gene expression data) [Wang and Gotoh,09] impliesdealing with thousands of features [Guyon and Elisseeff, 03]. In termsof Information Theory (IT) this task demands pursuing the mostinformative features, not only individually but also capturing theirstatistical interactions, beyond taking into account pairs of features.This is a big challenge since the early 70’s due to its intrinsiccombinatorial nature.

2/56

Page 3: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Features Classification

Strong and Weak Relevance

1. Strong relevance: A feature fi is strongly relevant if its removaldegrades the performance of the Bayes optimum classifier.

2. Weak relevance: fi is weakly relevant if it is not stronglyrelevant and there exists a subset of features ~F ′ such that theperformance on ~F ′ ∪ {fi} is better than the one on just ~F ′.

Redundancy and Noisy

1. Redundant features: those which can be removed because thereis another feature or subset of features which already supply thesame information about the class.

2. Noisy features: not redundant and not informative.

3/56

Page 4: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Features Classification (2)

Irrelevant features

Weakly relevant features

Strongly relevant

features

Figure: Feature classification

4/56

Page 5: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Wrappers and Filters

Wrappers vs Filters

1. Wrapper feature selection (WFS) consists in selecting featuresaccording to the classification results that these features yield(e.g. using Cross Validation (CV)). Therefore wrapper featureselection is a classifier-dependent approach.

2. Filter feature selection (FFS)is classifier independent, as it isbased on statistical analysis on the input variables (features),given the classification labels of the samples.

3. Wrappers build classifiers each time a feature set has to beevaluated. This makes them more prone to overfitting thanfilters.

4. In FFS the classifier itself is built and tested after FS.

5/56

Page 6: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

FS is a Complex Task

The Complexity of FS

The only way to assure that a feature set is optimum (following whatcriterion?) is the exhaustive search among feature combinations.The curse of dimensionality limits this search, as the complexity is

O(z), z =n∑

i=1

(ni

)Filters design is desirable in order to avoid overfitting, but they mustbe as multivariate as Wrappers. However, this implies to design andestimate a cost function capturing the high-order interaction ofmany variables. In the following we will focus on FFS.

6/56

Page 7: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Mutual Information Criterion

A Criterion is Needed

The primary problem of feature selection is the criterion whichevaluates a feature set. It must decide whether a feature subset issuitable for the classification problem, or not. The optimal criterionfor such purpose would be the Bayesian error rate for the subset ofselected features:

E (~S) =

∫~S

p(~S)

(1−max

i(p(ci |~S))

)d~S ,

where ~S is the vector of selected features and ci ∈ C is a class fromall the possible classes C existing in the data.

7/56

Page 8: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Mutual Information Criterion (2)

Bounding Bayes Error

An upper bound of the Bayesian error, which was obtained byHellman and Raviv (1970) is:

E (~S) ≤ H(C |~S)

2.

This bound is related to mutual information (MI), because mutualinformation can be expressed as

I (~S ; C ) = H(C )− H(C |~S)

and H(~C ) is the entropy of the class labels which do not depend onthe feature subspace ~S .

8/56

Page 9: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Mutual Information Criterion (3)

I(X;C)

Cx1

x2 x3x4

Figure: Redundancy and MI

9/56

Page 10: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Mutual Information Criterion (4)

MI and KL Divergence

From the definition of MI we have that:

I (X ; Y ) =∑y∈Y

∑x∈X

p(x , y) logp(x , y)

p(x)p(y)

=∑y∈Y

p(y)∑x∈X

p(x |y) logp(x |y)

p(x)

= EY (KL (p(x |y)||p(x))).

Then, maximizing MI is equivalent to maximizing the expectation ofthe KL divergence between the class-conditional densities P(~S |~C )and the density of the feature subset P(~S).

10/56

Page 11: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Mutual Information Criterion (5)

The BottleneckI Estimating mutual information between a high-dimensional

continuous set of features, and the class labels, is notstraightforward, due to the entropy estimation.

I Simplifying assumptions: (i) Dependency is not-informativeabout the class, (ii) redundancies between features areindependent of the class [Vasconcelos & Vasconcelos, 04].The basic idea is to simplify the computation of the MI. Forinstance:

I ( ~S∗; C ) ≈∑xi∈~S

I (xi ; C )

I More realistic assumptions involving interactions are needed!

11/56

Page 12: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

The mRMR Criterion

Definition

Maximize the relevance I (xj ; C ) of each individual feature xj ∈ ~Fand simultaneously minimize the redundancy between xj and the

rest of selected features xi ∈ ~S , i 6= j . This criterion is known as themin-Redundancy Max-Relevance (mRMR) criterion and itsformulation for the selection of the m-th feature is [Peng et al., 2005]:

maxxj∈~F−~Sm−1

I (xj ; C )− 1

m − 1

∑xi∈~Sm−1

I (xj ; xi )

Thus, second-order interactions are considered!

12/56

Page 13: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

The mRMR Criterion (2)

Properties

I Can be embedded in a forward greedy search, that is, at eachiteration of the algorithm the next feature optimizing thecriterion is selected.

I Doing so, this criterion is equivalent to a first-order using theMaximum Depedency criterion:

max~S⊆~F

I (~S ; C )

Then, the m-th feature is selected according to:

maxxj∈~F−~Sm−1

I (~Sm−1, xj ; C )

13/56

Page 14: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Efficient MD

Need of an Entropy Estimator

I Whilst in mRMR the mutual information is incrementallyestimated by estimating it between two variables of onedimension, in MD the estimation of I (~S ; C ) is not trivialbecause ~S could consist of a large number of features.

I If we base MI calculation in terms of the conditional entropyH(~S |~C ) we have:

I (~S ; ~C ) = H(~S)− H(~S |~C ).

To do this,∑

H(~S |C = c)p(C = c) entropies have to becalculated, Anyway, we must calculate multi-dimensionalentropies!

14/56

Page 15: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Leonenko’s Estimator

Theoretical Bases

Recently Leonenko et al. published an extensive study [Leonenko et

al, 08] about Renyi and Tsallis entropy estimation, also consideringthe case of the limit of α→ 1 for obtaining the Shannon entropy.Their construction relies on the integral

Iα = E{f α−1(~X )} =

∫Rd

f α(x)dx ,

where f (.) refers to the density of a set of n i.i.d. samples~X = {X1,X2, . . . ,XN}. The latter integral is valid for α 6= 1,however, the limits for α→ 1 are also calculated in order to considerthe Shannon entropy estimation.

15/56

Page 16: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Leonenko’s Estimator (2)

Entropy Maximization Distributions

I The α-entropy maximizing distributions are only defined for0 < α < 1, where the entropy Hα is a concave function. Themaximizing distributions are defined under some constraints.The uniform distribution maximizes α-entropy under theconstraint that the distribution has a finite support. Fordistributions with a given covariance matrix the maximizingdistribution is Student-t, if d/(d + 2) < α < 1, for any numberof dimensions d ≥ 1.

I This is a generalization of the property that the Gaussiandistribution maximizes the Shannon entropy H.

16/56

Page 17: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Leonenko’s Estimator (3)

Renyi, Tsallis and Shannon Entropy Estimates

For α 6= 1, the estimated I is:

IN,k,α = 1N

∑ni=1(ζN,i ,k)1−α,

with

ζN,i ,k = (N − 1)CkVd(ρ(i)k,N−1)d ,

where ρ(i)k,N−1 is the Euclidean distance from Xi to its k-th nearest

neighbour from among the resting N − 1 samples.

Vd = πd2 /Γ(d2 + 1) is the volume of the unit ball B(0, 1) in Rd and

Ck is Ck = [Γ(k)/Γ(k + 1− α)]1

1−α .

17/56

Page 18: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Leonenko’s Estimator (4)

Renyi, Tsallis and Shannon Entropy Estimates (cont.)

I The estimator IN,k,α is asymptotically unbiased, which is to say

that E IN,k,α → Iq as N →∞. It is also consistent under mildconditions.

I Given these conditions, the estimated Renyi entropy Hα of f is

HN,k,α =log(IN,k,α)

1−α , α 6= 1,

and for the Tsallis entropy Sα = 1q−1 (1−

∫x f α(x)dx) is

SN,k,α =1−IN,k,αα−1 , α 6= 1.

and as N →∞, both estimators are asymptotically unbiasedand consistent.

18/56

Page 19: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Leonenko’s Estimator (5)

Shannon Entropy Estimate

The limit of the Tsallis entropy estimator as α→ 1 gives theShannon entropy H estimator:

HN,k,1 =1

N

N∑i=1

log ξN,i ,k ,

ξN,i ,k = (N − 1)e−Ψ(k)Vd(ρ(i)k,N−1)d ,

where Ψ(k) is the digamma function:Ψ(1) = −γ ' 0.5772,Ψ(k) = −γ + Ak−1, A0 = 0,Aj =

∑ji=1

1i .

19/56

Page 20: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Leonenko’s Estimator (6)

0 500 1000 1500 20001.2

1.25

1.3

1.35

1.4

1.45

samples

entr

opy

1−D Gaussian distributions

5−NN

KD−part.

Gauss. Σ=I

Gauss. real Σ

0 500 1000 1500 20002.5

2.6

2.7

2.8

2.9

3

samples

entr

opy

2−D Gaussian distributions

5−NN

KD−part.

Gauss. Σ=I

Gauss. real Σ

0 500 1000 1500 20004

4.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8

4.9

samples

entr

opy

3−D Gaussian distributions

5−NN

KD−part.

Gauss. Σ=I

Gauss. real Σ

0 500 1000 1500 20005.4

5.6

5.8

6

6.2

6.4

6.6

6.8

samples

entr

opy

4−D Gaussian distributions

5−NN

KD−part.

Gauss. Σ=I

Gauss. real Σ

Figure: Comparison: k-NN vs k-d partitioning estimator for Gaussians.

20/56

Page 21: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Reminder: Visual Localization

Figure: 3D+2D map for coarse-to-fine localization

21/56

Page 22: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Image Categorization

Problem StatementI Design a mininal complexity and maximal accuracy classifier

based on forward MI-FS for coarse-localization [Lozano et al, 09].

I We have a bag of filters (say we have K different filters):Harris detector, Canny, horizontal and vertical gradient, gradient

magnitude, 12 color filter based on Gaussians over hue range, sparse

depth.

I We take the statistics of each filter and compose a histogramwith say B bins.

I Therefore, the total number of features NF per image isNF = K × (B − 1).

22/56

Page 23: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Image Categorization (2)

Figure: Examples of filters responses.From top-bottom and left-right: inputimage, depth, vertical gradient, gradient magnitude, and five colour filters.

23/56

Page 24: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Image Categorization (3)

0 20 40 60 800

10

20

30

40

50

60Greedy Feature Selection on different feature sets

# Selected Features

10−

fold

CV

err

or (

%)

16 filters x 6 bins16 filters x 4 bins16 filters x 2 bins

0 10 20 30 40 500

5

10

15

20

# Selected Features10

−fo

ld C

V e

rror

(%

)

Greedy Feature Selection on 4−bin histograms

8 Classes6 Classes4 Classes2 Classes

Figure: Left: Finding the optima number of bins B for K = 16 filters.Right: Testing whether |C | = 6 is a correct choice.

24/56

Page 25: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Image Categorization (4)

0 2 4 6 8 10 12 14 16 18 200

10

20

30

40

50

60

MD, MmD and mRMR Feature Selection

# features

% e

rror

MD (10−fold CV error)MD (test error)MmD (10−fold CV error)MmD (test error)mRMR (10−fold CV error)mRMR (test error),

Figure: FS performance on image histograms data with 48 features.

25/56

Page 26: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Image Categorization (5)

C#1 C#2 C#3 C#4 C#5 C#6

C#1 26 0 0 0 0 0

C#2 2/3 63/56 1/4 0/3 0 0

C#3 0 1/0 74/67 1/9 0 0

C#4 4/12 5/6 10/0 96/95 0/2 0

C#5 0 0 0 0 81 0

C#6 0 0 0 0 30/23 78/85

Table: Key question: What is the best classifier after Feature Selection?K-NN / SVM Confusion Matrix

26/56

Page 27: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Image Categorization (6)

Test Image 1st NN 2nd NN 3rd NN 4th NN 5th NN

27/56

Page 28: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Image Categorization (7)

Figure: Confusion trajectories at coarse (left) and fine (right) localization

28/56

Page 29: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Image Categorization (8)

Figure: Confusion trajectory using only coarse localization

29/56

Page 30: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Image Categorization (9)

Figure: Confusion trajectory after coarse-to-fine localization

30/56

Page 31: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Image Categorization (10)

ResultsI With mRMR, the lowest CV error (8.05%) is achieved with a

set of 17 features, while with MD a CV error of 4.16% isachieved with a set of 13 features.

I The CV errors yielded by MmD are very similar to those of MD.

I On the other hand, test errors present a slightly differentevolution.

I The best error is given by MD, and the worst one is given bymRMR.

I This may be explained by the fact that the entropy estimatorwe use does not degrade its accuracy as the number ofdimensions increases.

31/56

Page 32: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Gene Selection

Microarray Analysis

I The basic setup of a microarray platform is a glass slide whichis spotted with DNA fragments or oligonucleotides representingsome specific gene coding regions. Then, purified RNA islabelled fluorescently or radioactively and it is hybridized to theslide.

I The hybridization can be done simultaneously with referenceRNA to enable comparison of data with multiple experiments.Then washing is performed and the raw data is obtained byscanning with a laser or autoradiographing imaging. Theobtained expression levels of tens of thousands of genes arenumerically quantified and used for statistical studies.

32/56

Page 33: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Gene Selection (2)

Fluorescent

labeling

Sample RNA

Referece ADNc

Combination

Hybridization

Fluorescent

labeling

Figure: An illustration of the basic idea of a microarray platform.

33/56

Page 34: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Gene Selection (3)

0 20 40 60 80 100 120 140 16010

15

20

25

30

35

40

45

# features

% e

rror

and

MI

MD Feature Selection on Microarray

LOOCV error (%)

I(S;C)

Figure: MD-FS performance on the NCI microarray data set with 6,380features. Lowest LOOCV error with a set of 39 features [Bonev et al.,08]

34/56

Page 35: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Gene Selection (4)

5 10 15 20 25 30 350

10

20

30

40

50

60

70

80

90

100Comparison of FS criteria

# features

% L

OO

CV

err

or

MD criterionMmD criterionmRMR criterionLOOCV wrapper

Figure: Performance of FS criteria on the NCI microarray. Only theselection of the first 36 features (out of 6,380) is represented.

35/56

Page 36: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Gene Selection (5)

MD Feature Selection

Number of Selected Gene

Cla

ss (

dise

ase)

MELANOMAMELANOMAMELANOMAMELANOMAMELANOMAMELANOMA

BREASTBREAST

MELANOMANSCLCNSCLCNSCLC

BREASTMCF7D−repro

BREASTMCF7A−repro

COLONCOLONCOLONCOLONCOLONCOLONCOLON

LEUKEMIALEUKEMIALEUKEMIALEUKEMIALEUKEMIA

K562A−reproK562B−reproLEUKEMIA

NSCLCNSCLCNSCLC

PROSTATEOVARIANOVARIANOVARIANOVARIANOVARIAN

PROSTATEMELANOMA

OVARIANUNKNOWN

RENALNSCLC

BREASTRENALRENALRENALRENALRENALRENALRENALNSCLCNSCLC

BREASTCNSCNS

BREASTRENAL

CNSCNSCNS

19→

135

246

663

766

982

1177

1470

1671

→ 2

080

3227

3400

3964

4057

4063

4110

4289

4357

4441

4663

4813

5226

5481

5494

5495

5508

5790

5892

6013

6019

6032

6045

6087

→ 6

145

6184

6643

mRMR Feature Selection

Number of Selected Gene

MELANOMAMELANOMAMELANOMAMELANOMAMELANOMAMELANOMA

BREASTBREAST

MELANOMANSCLCNSCLCNSCLC

BREASTMCF7D−repro

BREASTMCF7A−repro

COLONCOLONCOLONCOLONCOLONCOLONCOLON

LEUKEMIALEUKEMIALEUKEMIALEUKEMIALEUKEMIA

K562A−reproK562B−reproLEUKEMIA

NSCLCNSCLCNSCLC

PROSTATEOVARIANOVARIANOVARIANOVARIANOVARIAN

PROSTATEMELANOMA

OVARIANUNKNOWN

RENALNSCLC

BREASTRENALRENALRENALRENALRENALRENALRENALNSCLCNSCLC

BREASTCNSCNS

BREASTRENAL

CNSCNSCNS

133

134

→ 1

3523

325

938

156

113

7813

8214

0918

41→

208

020

8120

8320

8632

5333

7133

7243

8344

5945

2754

3555

0455

3856

9658

1258

8759

3460

7261

15→

614

563

0563

9964

2964

3065

66

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure: Feature Selection on the NCI DNA microarray data. Features(genes) selected by MD and mRMR are marked with an arrow.

36/56

Page 37: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Gene Selection (6)

ResultsI Leukemia: 38 bone marrow samples, 27 Acute Lymphoblastic

Leukemia (ALL) and 11 Acute Myeloid Leukemia (AML), over 7,129

probes (features) from 6,817 human genes. Also 34 samples testing

data is provided, with 20 ALL and 14 AML. We obtained a 2.94%

test error with 7 features, while other authors report the same error

selecting 49 features via individualized markers. Other recent works

report test errors higher than 5% for the Leukemia data set.

I Central Nervous System Embryonal Tumors: 60 patient samples,

21 are survivors and 39 are failures. There are 7,129 genes in the data

set. We selected 9 genes achieving a 1.67% LOOCV error. Best result

in the literature is a 8.33% CV error with 100 features is reported.

37/56

Page 38: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Gene Selection (7)

Results (cont)

I Colon: 62 samples collected from colon-cancer patients. Among

them, 40 tumor biopsies are from tumors and 22 normal biopsies are

from healthy parts of the colons of the same patients. From 6,500

genes, 2,000 were selected for this data set, based on the confidence

in the measured expression levels. In the feature selection experiments

we performed, 15 of them were selected, resulting in a 0% LOOCV

error, while recent works report errors higher than 12%.

I Prostate: 25 tumor and 9 normal samples. with 12,600 genes.Our experiments achieved the best error with only 5 features,with a test error of 5.88%. Other works report test errorshigher than 6%.

38/56

Page 39: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition

Reeb Graphs

I Given a surface S and a real function f : S → R, the Reebgraph (RG), represents the topology of S through a graphstructure whose nodes correspond to the critical points of f .

I The Extended Reeb Graph (ERG) [Biasotti, 05] is anapproximation of the RG by using of a fixed number of levelsets (63 in this work) that divide the surface into a set ofregions; critical regions, rather than critical points, areidentified according to the behaviour of f along level sets; ERGnodes correspond to critical regions, while the arcs are detectedby tracking the evolution of level sets.

39/56

Page 40: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (2)

Figure: Extended Reeb Graphs. Image by courtesy of Daniela Giorgi andSilvia Biasotti. a) Geodesic, b) Distance to center, b) bSphere

40/56

Page 41: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (3)

Catalog of Features

I Subgraph node centrality: quantifies the degree of participationof a node i in a subgraph.

CS(i) =∑n

k=1 φk(i)2eλk ,

where where n = |V |, λk the k-th eigenvalue of A and φk itscorresponding eigenvector.

I Perron-Frobenius eigenvector: φn (the eigenvectorcorresponding to the largest eigenvalue of A). The componentsof this vector denote the importance of each node.

I Adjacency Spectra: the magnitudes |φk | of the (leading)eigenvalues of A have been been experimentally validated forgraph embedding [Luo et al.,03].

41/56

Page 42: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (4)

Catalog of Features (cont.)

I Spectra from Laplacians: both from the un-normalized andnormalized ones. The Laplacian spectrum plays a fundamentalrole in the development of regularization kernels for graphs.

I Friedler vector: eigenvector corresponding to the firstnon-trivial eigenvalue of the Laplacian (φ2 in connectedgraphs). It encodes the connectivity structure of the graph(actually is the core of graph-cut methods).

I Commute Times either coming from the un-normalized andnormalized Laplacian. They encode the path-length distributionof the graph.

I The Heat-Flow Complexity trace as seen in the section above.

42/56

Page 43: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (5)

Backwards Filtering

I The entropies H(·) of a set with a large n number of featurescan be efficiently estimated using the k-NN-based methoddeveloped by Leonenko.

I Thus, we take the data set with all its features and determinewhich feature to discard in order to produce the smallestdecrease of I ( ~Sn−1; ~C ).

I We then repeat the process for the features of the remainingfeature set, until only one feature is left [Bonev et al.,10].

I Then, the subset yielding the minimum 10-CV error is selected.

I Most of the spectral features are histogrammed into a variablenumber of bins: 9 · 3 · (2 + 4 + 6 + 8) = 540 features.

43/56

Page 44: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Experimental Results

ElementsI SHREC database (15 classes × 20 objects). Each of the 300

samples is characterized by 540 features, and has a class labell ∈ {human, cup, glasses, airplane, chair, octopus, table, hand,fish, bird, spring, armadillo, buste, mechanic, four-leg}

I The errors are measured by 10-fold cross validation (10-foldCV). MI is maximized as the number of selected features grows.A high number of features degrades the classificationperformance.

I However the MI curve, as well as the selected features, do notdepend on the classifier, as it is a purely information-theoreticmeasure.

44/56

Page 45: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (7)

Figure: Classification error vs Mutual Information.

45/56

Page 46: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (8)

Figure: Feature selection on the 15-class experiment (left) and the featurestatistics for the best-error feature set (right).

46/56

Page 47: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (9)

Analysis

I For the 15-class problem, the minimal error (23, 3%) is achievedwith a set of 222 features.

I The coloured areas in the plot represent how much a feature isused with respect to the remaining ones (the height on the Yaxis is arbitrary).

I For the 15-class experiment, in the feature sets smaller than100 features, the most important is the Friedler vector, incombination with the remaining features. Commute time is alsoan important feature.Some features that are not relevant arethe node centrality and the complexity flow. Turning ourattention to the graphs type, all three appear relevant.

47/56

Page 48: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (10)

Analysis (cont)

I We can see that the four different binnings of the features dohave importance for graph characterization.

I These conclusions concerning the relevance of each featurecannot be drawn without performing some additionalexperiments with different groups of graph classes.

I We perform our different 3-class experiments. The classes sharesome structural similarities, for example the 3 classes of thefirst experiment have a head and limbs.

I Although in each experiment the minimum error is achievedwith very different numbers of features, the participation ofeach feature is highly consistent with the 15-class experiment.

48/56

Page 49: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (11)

Figure: Feature Selection on 3-class experiments:Human/Armadillo/Four-legged, Aircraft/Fish/Bird

49/56

Page 50: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (12)

Figure: Feature Selection on 3-class experiments: Cup/Bust/Mechanic,Chair/Octopus/Table.

50/56

Page 51: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (14)

0 100 200 300 400 500 60020

30

40

50

60

70

80

# features

CV

err

or

(%)

Comparison of forward and backward F.S.

Forward selection

Backward elimination

Figure: Forward vs Backwards FS

51/56

Page 52: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (15)

0 100 200 300 400 500

30

40

50

60

70

80

90

# features

% test err

or

Bootstrapping on forward FS

Standard deviation

Test error

0 100 200 300 400 500

30

40

50

60

70

80

90

# features

% test err

or

Bootstrapping on backward FS

Standard deviation

Test error

0 100 200 300 400 5000

5

10

15

20

25

# selected features

# c

oin

cid

ences

Forward FS: feature coincidences in bootstraps

0 100 200 300 400 5000

5

10

15

20

25

# selected features

# c

oin

cid

ences

Backward FS: feature coincidences in bootstraps

Figure: Bootstrapping experiments on forward (on the left) and backward(on the right) feature selection, with 25 bootstraps.

52/56

Page 53: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

Structural Recognition (16)

ConclusionsI The main difference among experiments is that node centrality

seems to be more important for discerning among elongatedsharp objects with long sharp characteristics. Although all threegraph types are relevant, the sphere graph performs best forblob-shaped objects.

I Bootstrapping shows how better is the Backwards method.

I The IT approach not only performs good classification but alsoyield the role of each spectral feature in it!

I Given other points of view (size functions, eigenfunctions,...) itis desirable to exploit them!

53/56

Page 54: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

References

I [Zhu et al.,97] Zhu, S.C., Wu, Y.N. and Mumford, D (1997). Filters,

Random Fields And Maximum Entropy (FRAME): towads an unified

theory for texture modeling. IJCV 27 (2) 107–126

I [Wang and Gotoh, 2009] Wang, X. and Gotoh, O. (2009). Accurate

molecular classication of cancer using simple rules. BMC Medical

Genomics, 64(2)

I [Guyon and Elisseeff, 03] Guyon, I. and Elissee?, A. (2003). An intro-

duction to variable and feature selection. Journal of Machine

Learning Research, volume 3, pages 1157–1182.

I [Vasconcelos and Vasconcelos,04] Vasconcelos, N. and Vasconcelos,

M. (2004). Scalable discriminant feature selection for image retrieval

and recognition. In CVPR04, pages 770–775.

54/56

Page 55: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

References (2)

I [Peng et al., 2005]Peng, H., Long, F., and Ding, C. (2005). Feature

selection based on mutual information: Criteria of max-dependency,

max-relevance, and min-redundancy. In IEEE Trans. on PAMI, 27(8),

pp.1226-1238.

I [Leonenko et al, 08] Leonenko, N., Pronzato, L., and Savani, V.

(2008). A class of Renyi information estimators for multidimensional

densities. The Annals of Statistics, 36(5):21532182.

I [Lozano et al, 09] Lozano M,A., Escolano, F., Bonev, B., Suau, P.,

Aguilar, W., Saez, J.M., Cazorla, M. (2009). Region and constell.

based categorization of images with unsupervised graph learning.

Image Vision Comput. 27(7): 960–978

55/56

Page 56: CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection

References (3)

I [Bonev et al.,08] Bonev, B., Escolano, F., and Cazorla, M. (2008).

Feature selection, mutual information, and the classication of high-

dimensional patterns. Pattern Analysis and Applications 11(3-4):

304–319.

I [Biasotti, 05] Biasotti, S. (2005). Topological coding of surfaces with

boundary using Reeb graphs. Computer Graphics and Geometry,

7(1):3145.

I [Luo et al.,03 Luo, B., Wilson, R., and Hancock, E. (2003). Spectral

embedding of graphs. Pattern Recognition, 36(10):22132223

I [Bonev et al.,10] Bonev, B., Escolano, F., Giorgi, D., Biasotti, S,

(2010). High-dimensional Spectral Feature Selection for 3D Object

Recognition based on Reeb Graphs, SSPR’2010 (accepted)

56/56