boris babenko department of computer science and engineering university of california, san diego...

Boris BabenkoDepartment of Computer Science and Engineering

University of California, San Diego

Semi-supervised and Unsupervised

Feature Scaling

AbstractFeature selection is an important problem in machine learning. In high dimensional data it is critical to weed out noisy features, which have no discriminatory power, before applying standard learning algorithms. This problem has been studied extensively in the supervised setting. However, there is sparse literature semi-supervised and unsupervised feature selection. Furthermore, while feature selection algorithms pick a discrete subset of features, it is a more challenging and interesting problem to assign a weight to each feature based on discriminatory power. In a sense, feature selection can be thought of as a sub-problem of feature scaling, where the weights are binary. In this project I propose two simple feature scaling algorithms – one for the semi-supervised setting, and one for the unsupervised setting.

IntroductionAlthough feature scaling/selection is reminiscent of dimensionality reduction, it is a fundamentally different problem. While dimensionality reduction methods, such as PCA, focus on the search for features which explain most of the variance in the data, this is clearly not appropriate for feature selection. This is because there is no clear correlation between variance and discriminatory power (See Fig. 1). Also, methods like PCA return a brand new set of directions, rather than a subset or scaling of the original features.

Figure 1: The y axis explains most of the variance in the data. Nevertheless, the x axis clearly has more discriminative power about the three classes.

Previous Work Supervised Feature Selection

Filter methods: do not depend on the consequent classifier that is used. Rely purely on the data. Methods include the popular RELIEF (Kira and Rendell) and FOCUS (Almuallim and Dietterich) algorithms.

Wrapper methods: “wrap” around the classifier algorithm that is used. The classifier is used to evaluate particular feature subsets. The popular AdaBoost (Freund and Schapire) method is analogous to such a method if one considers weak classifiers that are decision stubs (work on only one feature).

Many of these methods are used for feature selection, but do so by computing feature scales and using some sort of threshold.

Unsupervised Feature Selection This field has been sparsely studied. Existing approaches include genetic algorithms that

have no performance guarantees, and methods that make assumptions about the class distributions (i.e. assume the classes are either Gaussian or multinomial).

Semi-supervised Feature Selection Semi-supervised feature selection remains to be an untouched territory. Semi supervised

learning has become increasingly popular in recent years, but surprisingly there is practically no literature about feature selection in this context.

Illustrations

x2

x1

x2

x1

x2

x1

Supervised Setting Semi-supervised Setting Unsupervised Setting

Figure 2: An illustration of the three different learning settings.

A Simple Algorithm for Feature Scaling

A simple greedy algorithm for feature selection starts with an empty set and keeps adding features that improve some sort of criterion function score.

The running time of the above algorithm is O(Nn) where n is the number of features, and N is the size of the feature subset.

Very expensive computationally, and it’s not clear how this could be extended to feature scaling

Instead, score each feature independently

Estimating the discriminative power of a feature

Semi-supervised case Note: For this study, assume k = 2, and that there is an equal number of

+1 labeled points and -1 labeled points. Project data (both labeled and unlabeled) onto feature j. Use k-means

to cluster data. Let C1 and C2 be the two clusters, Cip be the set of +1 points in cluster i, and Cin be the set of -1 points in cluster i.

Unsupervised case Used a stability measure: project data onto feature j. Run k-means N

times and record the positions of the centroids. Let c1,c2…cI be vectors of centroid positions for each run.

Testing on a Synthetic Dataset

The first evaluation of the algorithms was done on a synthetic data set. The dataset contained two classes, with four valuable features and two features of pure noise. The data set is shown to the right.

Relief 0.0086 0.2647 0.5809 0.1458 0.0000 0.0000

Semi-supervised Algorithm 0.2500 0.2500 0.2500 0.2500 0 0

Unsupervised Algorithm 25.4548 139.6496 0.0191 165.1732 0 0

Figure 3: Synthetic data set

Semi-supervised Case: Experimental Results on OCR DataRecognizing characters ‘4’ and ‘9’ Recognizing characters ‘7’ and ‘9’

Figure 4: The semi-supervised algorithm was tested on the OCR dataset with the SVM classifier1. First, the performance of SVM was measured on the original data. Then, the data was passed through RELIEF (as described in [1]), a popular supervised feature selection algorithm, and the performance of SVM was again measured. Lastly, the data was passed through the semi-supervised feature selection algorithm described, and the performance of SVM was measured. For each training set size, 100 trails were done, and the mean error was computed.1 The SVM lite package, by Joachims et al., was used

Semi-supervised Case: Visualizing Results

Semi-supervisedFeature Selection

SupervisedFeature Selection(Relief Algorithm)

Two classes: ‘4’ VS ‘9’ Two classes: ‘7’ VS ‘9’

# of training points: 20 100 20 100

Figure 5: A nice property of working with the OCR data set for experimentation, is that each feature corresponds to a pixel. Thus, it is easy to visualize which pixels are important to discriminate between certain digits. The above images are the visualizations of feature weights (dark pixels corresponding to high weighted features).

Unsupervised Case: Experimental Results on OCR Data

‘4’ VS ‘9’ ‘7’ VS ‘9’ ‘5’ VS ‘6’

% Error w/ feature scaling

25.1932 33.1128 25.9585

% Error w/o feature scaling

46.2491 40.8571 8.3823

Figure 6: Unsupervised algorithm results

‘4’ VS ‘9’ ‘7’ VS ‘9’ ‘5’ VS ‘6’

Limitations & Future Work

Limitations: Assumes features are independent. Unsupervised algorithm seems to

worsen classification results in cases when two classes are well separable.

Possible extensions Locate highly correlated features as a pre-processing

step Use some sort of information theoretic approach to

evaluating discriminatory power of features

References

boris babenko department of computer science and engineering university of california, san diego...

Documents

unsupervised feature

selection algorithms

computing feature scales

particular feature subsets

subproblem of feature

supervised learning

wrapper methods

discriminatory power