feature selction for svms j. weston et al., nips 2000 오장민 (2000/01/04) second reference : mark...

25
Feature Selction for SVMs Feature Selction for SVMs J. Weston et al., NIPS 2000 오오오 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Upload: candace-wilkerson

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

1. Introduction Importance of feature selection in supervised learning  Generalization performance, running time requirements, and interpretational issues SVM  State-of-the-art in classification & regression Objective  To select a subset of features while preserving of improving the discriminative ability of a classifier

TRANSCRIPT

Page 1: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

Feature Selction for SVMsFeature Selction for SVMs

J. Weston et al., NIPS 2000오장민 (2000/01/04)

Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Page 2: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

AbstractAbstractA method of feature selection for Support Vector

Machines (SVMs)A efficient Wrapper method.Superior to some standard feature selection

algorithm

Page 3: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

1. Introduction1. Introduction Importance of feature selection in supervised

learning Generalization performance, running time

requirements, and interpretational issuesSVM

State-of-the-art in classification & regressionObjective

To select a subset of features while preserving of improving the discriminative ability of a classifier

Page 4: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

2. The Feature Selection 2. The Feature Selection problemproblemDefinition

Find the m << n features with the smallest expected generalization error, which is unknown and must be estimated.

Given fixed set of functions y = f(x, ), Find a preprocessing of the data x (x), {0,1}n, and the parameters minimizing

Subject to ||||0=m, where F(x,y) is unknown, V(, ) is a loss function.

),()),((,(),( ydFfyV xx

Page 5: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

Four issues on Feature Four issues on Feature SelectionsSelectionsStarting point

Forward/backward/MiddleSearch organization

Exhaustive search (2N)/ Heuristic searchEvaluation strategy

Filter (independent on any learning algorithm) Wrapper (bias of a particular induction algorithm)

Stopping criterion

from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Page 6: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine
Page 7: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

Feature filtersFeature filtersForward selection: only addition to the subset.Backward elimination: only deletion from the

subtionExample

FOCUS [Almuallim and Dieterich 92] Finds the minimum combination of filters that divides

the training data into pure classes. Feature minimizing entropy. Most discriminating feature. (if its value differs

between positive and negative examples)from Mark A. Holl, Correlation-based Feature Selection for

Machine Learning, Ph.D. thesis, 1999

Page 8: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

Wrapper filtersWrapper filtersUse a induction algorithm to estimate the merit of

feature subsets.Consider inductive bias.Better results but slower. (repeatedly call the

induction algorithm)

from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Page 9: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

Correlation based Feature Correlation based Feature SelectionSelection Relevant: if their values vary systemically with category

membership.

Redundant: if one or more of the other features are correlated with a feature.

A good feature subset if one that contains features highly correlated with the class, yet uncorrelated with each other.

)()|( cCpvVcCp ii

from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Page 10: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

Pearson Correlation coefficient

z: outside variable, c: composite, k: number of components, rzi: correlation between components and the outside variable, rii: inter-correlation between components

higher rzi higher rzc

lower rii higher rzc

larger k higher rzc

ii

zizc rkkk

rkr)1(

from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Page 11: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

2. (cont’d)2. (cont’d)Taking advantage of the performance increase of

wrapper methods whilst avoiding their computational complexity.

Page 12: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

3. Support Vector Learning3. Support Vector Learning Idea: map x into a high dimensianl space (x)

and make an optimal hyper-plane in this space. Mapping () is performed by a kernel function K(,

). Decision function given by SVM

Optimal hyper-plane is the large margin classifier. (reduced to QP problem)

i

iii bKybwf ),()()( 0 xxxx

l

i iii

l

jijijiji

l

ii

liCy

KyyW

1

1,1

2

,...1 ,0 ,0

subject to

),(21)(

xx

Page 13: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

3. Support Vector Learning3. Support Vector Learning

Where images are the mapping (x1),…,(xl).

Theorem 1 justifies that the performance depends on both R and M, where R is controlled by the mapping function () .

)(11 0222

2

WRElM

REl

EPerr

Theorem 1. If images of training data of size l belonging to a sphere of size R are separable with the corresponding margin M, then the expectation of the error probability has the bound

where expectation is taken over sets of training data of size l.

Page 14: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

4. Feature Selection for SVMs4. Feature Selection for SVMsRemind the objective

Minimize

via , . Enlarge the set of functions

),()),((,(),( ydFfyV xx

bwfbwf

),(),(

)(),(

xxxx

))(),(())(),((),( yxyxyx KK

Page 15: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

4. Feature Selection for SVMs4. Feature Selection for SVMs

Minimize over .

Where

),()()( 02222 WRWR

i ii

i jijijiiii

li

KKR

,...,1,0,1subject to

),(),(max)(,

2

xxxx

(Vapink’s statistical learning theory)

Page 16: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

4. Feature Selection for SVMs4. Feature Selection for SVMs Finding the minimum of R2W2 over requires

searching over all possible subsets of n features which is combinatorial problem.

Approximate algorithm: Approximate the binary valued vector {0,1}n,

with a real valued vector . Optimize R2W2 by gradient descent.

kkk

RWWRWR

)(),(),()(),( 2

0202

2022

Page 17: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

4. Feature Selection for SVMs4. Feature Selection for SVMsSummary

Find minimum of (, ) by minimizing approximate integer programming

For large enough as p 0 only m elements of will be nonzero, approximating optimization problem (, ) .

lim

WR

ii i

i

pi

,...,1 ,0 ,subject to

)(),()( 022

Page 18: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

5. Experiments (Toy data)5. Experiments (Toy data)Comparison

Standard SVMs, feature selection algorithm, three classical filter methods (Pearson correlation coefficients, Fischer criterion score, Kolmogorov-Smirnov test).

Description Linear SVM for linear problem, second order

polynomial kernel for nonlinear problem 2 best features were selected for comparison.

Page 19: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

Fischer criterion score

Kolmogorov-Smirnov test

22)(

rr

rrrF

r : mean value for r-th feature in the

positive and negative classes

1,ˆˆsup)( rrrtst yfXPfXPlrKS

where fr denotes the r-th feature from each training example, and P is the corresponding empirical distribution.

Page 20: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

5. Experiments (Toy data)5. Experiments (Toy data)Data

Linear data 6 of 202 were relevant. Probability of y = 1 or –1 are

equal. {x1, x2, x3} were drawn as xi=yN(i,1) with prob. 0.7 and xi=N(0,1) with prob 0.3 .{x4, x5, x6} were drawn as xi=yN(i–3,1) with prob. 0.7 and xi=N(0,1) with prob 0.3. Other features are noise xi=N(0,20)

Nonlinear data 2 of 52 were relevant. If y=–1, {x1,x2} are drawn from

N(1, ) or N(2, ) with equal prob. =I. If y=1, {x1,x2} are drawn from two normal distribution. Other features are noise xi=N(0,20)

Page 21: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

(a) Linear problem, (b) nonlinear problem both with many irrelevant features. The x-axis is the number of training points, and the y-axis the test error as a fraction of test points.

(a) (b)

5. Experiments (Toy data)5. Experiments (Toy data)

Page 22: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

5. Experiments (Real-life Data)5. Experiments (Real-life Data)Comparision

(1) Minimizing R2W2

(2) Rank ordering features according to how much they change the decision boundary and removing those that cause the least change.

(3) Fischer criterion score

riijij

r

Nsv

i

Nsv

jij

rjj

KK

KyrI

xxxxx

xx

),(),(

where

,),()(1 1

Page 23: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

5. Experiments (Real-life Data)5. Experiments (Real-life Data)Data

Face detection training set 2,429/13,229 , test set 104/2M , dimension

1,740 Pedestrian detection

training set 924/10,044 , test set 124/800K , dimension 1,326

Cancer morphology classification training 38/ test 34 , dimension 7,129 1 error using all genes. 0 errors for (1) using 20 genes,

5 errors for (2), 3 errors for (3)

Page 24: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

5. Experiments (Real-life Data)5. Experiments (Real-life Data)

(a) The top ROC curves are for 725 features and the bottom one for 120 features for face detection. (b) ROC curves using all features and 120 features for pedestrian detection.Solid: all features, solid with a circle: (1), dotted and dashed: (2), dotted: (3)

(a) (b)

Page 25: Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

6. Conclusion6. Conclusion Introduce a wrapper method to perform feature

selection for SVMs.Computationally feasible for high dimensional

datasets compared to existing wrapper methodsSuperior to some filter methods.