max-margin classification of data with absent features presented by chunping wang machine learning...
TRANSCRIPT
Max-Margin Classification of Data with Absent Features
Presented by Chunping Wang
Machine Learning Group, Duke University
July 3, 2008
by Chechik, Heitz, Elidan, Abbeel and Koller, JMLR 2008
Outline
• Introduction
• Standard SVM
• Max-Margin Formulation for Missing Features
• Three Algorithms
• Experimental Results
• Conclusions
Introduction (1)
Pattern of missing features:
• due to measurement noise or corruption: existing but unknown
• due to the inherent properties of the instances: non-existing
Example 1: Two subpopulation of instances (animals and buildings) with few overlapping features (body parts, architectural aspects );
Example 2: In a web-page task, one useful feature of a given page may be the most common topic of other sites that point to it, however, this particular page may have no such parents.
Introduction (2)
Common methods for handling missing features:
(Assume the features exist but their values are unknown)
• Single imputation: zeros, mean, kNN
• imputation by building probabilistic generative models
Proposed method (Assume the features are structurally absent) :
Each data instance resides in a lower dimensional subspace of the feature space, determined by its own existing features. We try to maximize the worst-case margin of the separating hyperplane, while measuring the margin of each data instance in its own lower-dimensional subspace.
Standard SVM (1) Binary classification
di Rxreal-valued predictors
binary response }1,1{ iy
bf T xwx)(
A classifier could be defined as
based on a linear function
0)( xf
w
||||||
wb
Parameters1),( dRbw
)]([sign xfy
Standard SVM (2) )()( by i
Tii xwwFunctional margin for each instance
||||/)()( wxww by iT
ii Geometric margin for each instance
Geometric margin of a hyper plane ),( bw
||||/)(min)(min)( wxwww by iT
ii
ii
SVM: )(max ww
by fixing the functional margin to 1, i.e., 1)(min by i
Ti
ixw
’s: slack variables
C: cost
Quadratic Programming (QP)
Max-Margin Formulation for Missing Features (1)
A 2-D case with missing data
1 margin in the subspace
2 margin in the full feature space
21
Margin of instances with missing features is underestimated.
Max-Margin Formulation for Missing Features (2)
Instance margin
is non-convex in w||||/ )()( ii
iiy wxw
|||| )(iw is instance dependent and thus cannot be taken out of the minimization
It is difficult to solve this optimization problem directly.
Optimization problem
Three Algorithms (1)
• A convex formulation for linearly separable case
Introduce a lower bound for
For a given , this is a second order cone program (SOCP), which is convex and can be solved efficiently.
To find the optimal , do a bisection search over .
Unfortunately, extending it to the non-separable case is difficult.
R
Three Algorithms (2)
Average norm: a convex approximation for non-separable case
define Get rid of the instance dependence
non-separable case
Three Algorithms (3) Geometric margin: an exact non-convex approach for non-separable case
define
non-separable case
QP for a given set of ’sis
Three Algorithms (4)
Pseudo-code
Geometric margin: the exact non-convex approach for non-separable case
The convergence is not always guaranteed. Cross validation is used to choose an early stopping point.
Experimental Results (1) Zero. Missing values were set to zero.
Mean. Missing values were set to the average value of the feature over all data.
Flag. Additional features (“flags”) were added, explicitly denoting whether a feature is missing for a given instance.
kNN. Missing features were set with the mean value obtained from the K
nearest neighbors instances.
EM. A Gaussian mixture model is learned by iterating between (1) learning a GMM model of the filled data and (2) re-filling missing values using cluster
means, weighted by the posterior probability that a cluster generated the sample.
Averaged norm (avg |w|). Proposed approximate convex approach.
Geometric margin (geom). Proposed exact non-convex approach.
Experimental Results (2) UCI data sets (missing at random)
Remove 90% of the features of each sample randomly
Remove a patch covered 25% of pixels with location of the patch uniformly sampled.
Digits 5 & 6 from MNIST
Experimental Results (3) Visual object recognition
Task: to determine an automobile is present in a given image or not.
Local edge information Generative
model
Likelihood of patches to match each of 19 landmarks
Set a threshold
(Up to 10) Candidate patches (21-by-21 pixels) for landmarks
PCA
First 10 principal components for each patch
concatenate
A feature vector (up to 1900 features)
If the number of candidates for a given landmark is less than ten, we consider the rest to be structurally absent
Experimental Results (4)
An example image: the best 5 candidates matched to the front windshield landmark
Experimental Results (5)
Experimental Results (6) Metabolic pathway reconstruction
A fragment of the full metabolic pathway network
Arrows: chemical reactions
Purple boxed names: enzymes
Experimental Results (7)
Three types of neighborhood relations between enzyme pairs:
Linear chains (ARO7, PHA2)
Forks (TRP2, ARO7): same input, different outputs
Funnels (ARO9, PHA2): same output, different inputs
One feature vector (represents an enzyme)
Features for linear chain neighbor
Features for fork neighbor
Features for funnel neighbor
A feature vector will have structurally missing entries if the enzyme does not have all types of neighbors, e.g., PHA2 does not have a neighbor of type fork.
Experimental Results (8)
Task: to identify if a candidate enzyme is in the right “neighborhood”.
Data creation:
Positive samples: from the reactions with known enzymes (in the right “neighborhood”);
Negative samples: for each positive sample, replace the true enzyme with a random impostor, and calculate the features in such a wrong “neighborhood”. The impostor was uniformly chosen from the set of other enzymes.
Experimental Results (9)
Conclusions
1. The authors presented a modified SVM model for max-margin training of classifiers in the presence of missing features, where the pattern of missing features is an inherent part of the domain.
2. The authors directly classified instances by skipping the non-existing features, rather than filling them with hypothetical values.
3. The proposed model was competitive with a range of single imputation approaches when tested in missing-at-random (MAR) settings.
4. One variant (geometric margin) significantly outperformed other methods in two real problems with non-existing features.