feature selection, dimensionality reduction, and clustering

Summer

Course

Data

Mining

Feature Selection, Feature Selection, Dimensionality Reduction, and Dimensionality Reduction, and ClusteringClustering

Presenter: Georgi NalbantovPresenter: Georgi Nalbantov

Summer Course: Data Mining

August 2009

Summer

Course

Data

Mining

2/54

Structure Feature Selection

Filtering approach Wrapper approach Embedded methods

Clustering

Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs)

Dimensionality Reduction

Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis

Summer

Course

Data

Mining

3/54

Feature Selection, Dimensionality Reduction, and Clustering in the KDD Process

U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)

Summer

Course

Data

Mining

4/54

Feature Selection

In the presence of millions of features/attributes/inputs/variables, select the most relevant ones.

Advantages: build better, faster, and easier to understand learning machines.

Xm features

n

m’

Summer

Course

Data

Mining

5/54

Feature Selection

Goal: select the two best features individually

Any reasonable objective J will rank the features

J(x1) > J(x2) = J(x3) > J(x4) Thus, features chosen [x1,x2] or

[x1,x3]. However, x4 is the only feature

that provides complementary information to x1

Summer

Course

Data

Mining

6/54

Feature Selection

Filtering approach: ranks features or feature subsets independently of the predictor (classifier).

…using univariate methods: consider one variable at a time …using multivariate methods: consider more than one variables at a time

Wrapper approach: uses a classifier to assess (many) features or feature subsets.

Embedding approach:uses a classifier to build a (single) model with a subset of features that are internally selected.

Summer

Course

Data

Mining

7/54

Feature Selection: univariate filtering approach

xi

dens

ity,

P(X

i| Y

)

-

-1

- xi

dens

ity,

P(X

i| Y

)

Issue: determine the relevance of a given single feature.

Summer

Course

Data

Mining

8/54


Issue: determine the relevance of a given single feature.

Under independence:P(X, Y) = P(X) P(Y)

Measure of dependence (Mutual Information):

MI(X, Y) = P(X,Y) log dX dY

= KL( P(X,Y) || P(X)P(Y) )

Summer

Course

Data

Mining

9/54


Correlation and MINote: Correlation is a measure of linear dependence

Summer

Course

Data

Mining

10/54


Correlation and MI under the Gaussian distribution

Summer

Course

Data

Mining

11/59

Feature Selection: univariate filtering approach. Criteria for measuring dependence.

Summer

Course

Data

Mining

12/59


xi

Den

sity

P(X

i| Y

=-1

)

P(X

i| Y

=1)

Legend: Y=1Y=-1

-1

- +

- + xi-+

-, +

Summer

Course

Data

Mining

13/59


xi

dens

ity

Legend: Y=1Y=-1

-1

- + xi-+

P(Xi| Y=1) = P(Xi| Y=-1) P(Xi| Y=1) /= P(Xi| Y=-1)

Summer

Course

Data

Mining

14/59


-1

- + xi

T-test

• Normally distributed classes, equal variance 2 unknown; estimated from data as 2

within.

• Null hypothesis H0: + = -

• T statistic: If H0 is true, then

t= (+ - -)/(withinm++1/m-Studentm++m--d.f.

- +Is this distance significant?

Summer

Course

Data

Mining

15/59

Feature Selection: multivariate filtering approach

Guyon-Elisseeff, JMLR 2004; Springer 2006

Summer

Course

Data

Mining

16/59

Feature Selection: search strategies

N features, 2N possible feature subsets!

Kohavi-John, 1997

Summer

Course

Data

Mining

17/59

Feature Selection: search strategies

Forward selection or backward elimination.

Beam search: keep k best path at each step.

GSFS: generalized sequential forward selection – when (n-k) features are left try all subsets of g features. More trainings at each step, but fewer steps.

PTA(l,r): plus l , take away r – at each step, run SFS l times then SBS r times.

Floating search: One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far.

Summer

Course

Data

Mining

18/59

Feature Selection: filters vs. wrappers vs. embedding

Main goal: rank subsets of useful features

Summer

Course

Data

Mining

19/59

Feature Selection: feature subset assessment (wrapper)

1) For each feature subset, train predictor on training data.

2) Select the feature subset, which performs best on validation data. Repeat and average if you want to

reduce variance (cross-validation).

3) Test on test data.

N variables/features

M s

ampl

esm1

m2

m3

Split data into 3 sets:training, validation, and test set.

Danger of over-fitting with intensive search!

Summer

Course

Data

Mining

20/59

Feature Selection via Embedded Methods:L1-regularization

sum(|beta|) sum(|beta|)

Summer

Course

Data

Mining

21/59

Feature Selection: summary

Nearest Neighbors

Neural Nets

Trees, SVM

Mutual information

feature ranking

Non-linear

RFE with linear SVM or LDA

T-test, AUC, feature ranking

Linear

MultivariateUnivariate

Summer

Course

Data

Mining

22/59


In the presence of may of features, select the most relevant subset of (weighted) combinations of features.

kpkm XXXX ,,,, 11 Feature Selection:

Dimensionality Reduction: ),,(,),,,(,, 1111 mpmm XXfXXfXX

Summer

Course

Data

Mining

23/59

Dimensionality Reduction:(Linear) Principal Components Analysis

PCA finds a linear mapping of dataset X to a dataset X’ of lower dimensionality. The variance of X that is remained in X’ is maximal.

Dataset X is mapped to dataset X’, here of the same dimensionality. The first dimension in X’ (= the first principal component) is the direction of maximal variance. The second principal component is orthogonal to the first.

Summer

Course

Data

Mining

24/59

Dimensionality Reduction:Nonlinear (Kernel) Principal Components Analysis

Original dataset X Map X to a HIGHER-dimensional space, and carry out LINEAR PCA in that space

(If necessary,) map the resulting principal components back to the origianl space

Summer

Course

Data

Mining

25/59

Dimensionality Reduction:Multi-Dimensional Scaling

MDS is a mathematical dimension reduction technique that maps the distances between observations from the original (high) dimensional space into a lower (for example, two) dimensional space.

MDS attempts to retain pairwise Euclidean distances in the low-dimensional space .

Error on the fit is measured using a so-called “stress” function

Different choices for a stress function are possible

Summer

Course

Data

Mining

26/59

Dimensionality Reduction:Multi-Dimensional Scaling

Raw stress function (identical to PCA):

Sammon cost function:

Summer

Course

Data

Mining

27/59

Dimensionality Reduction:Multi-Dimensional Scaling (Example)

Input:

Output:

Summer

Course

Data

Mining

28/59

Dimensionality Reduction:Homogeneity analysis

Homals finds a lower-dimensional representation of categorical data matrix X. It may be considered as a type of nonlinear extension of PCA.

Summer

Course

Data

Mining

29/59

Clustering:Similarity measures for hierarchical clustering

Clustering Classification Regression

k-th Nearest Neighbour Parzen Window Unfolding, Conjoint Analysis,

Cat-PCA

Linear Discriminant Analysis, QDA Logistic Regression (Logit) Decision Trees, LSSVM, NN, VS

Classical Linear Regression Ridge Regression NN, CART

+

+

+

+ ++

++

++

+++

+

++

++

++

+

+

--

--

-

-

++

+

++

+++

+

X 1X 1X 1

X 2 X 2

Summer

Course

Data

Mining

30/59

Clustering

Clustering in an unsupervised learning technique.

Task: organize objects into groups whose members are similar in some way

Clustering finds structures in a collection of unlabeled data

A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters

Summer

Course

Data

Mining

31/59

Density estimation and clustering

Bayesian separation curve (optimal)

Summer

Course

Data

Mining

32/59

Clustering:K-means clustering

Minimizes the sum of the squared distances to the cluster centers (reconstruction error)

Iterative process:

Estimate current assignments (construct Voronoi partition)

Given the new cluster assignments, set cluster center to center-of-mass

Summer

Course

Data

Mining

33/59

Clustering:K-means clustering

Step 1 Step 2

Step 3 Step 4

Summer

Course

Data

Mining

34/59

Clustering:Hierarchical clustering

Dendrogram

Clustering based on (dis)similarities. Multilevel clustering: level 1 has n clusters, level n has one cluster

Agglomerative HC: starts with N clusters and combines clusters iteratively

Divisive HC: starts with one cluster and divides iteratively

Disadvantage: wrong division cannot be undone

Summer

Course

Data

Mining

35/59

Clustering:Nearest Neighbor algorithm for hierarchical clustering

1. Nearest Neighbor, Level 2, k = 7 clusters.



Summer

Course

Data

Mining

36/59





Summer

Course

Data

Mining

37/59


7. Nearest Neighbor, Level 8, k = 1 cluster.

Summer

Course

Data

Mining

38/59

Clustering:Similarity measures for hierarchical clustering

Summer

Course

Data

Mining

39/59

Clustering: Similarity measures for hierarchical clustering

Pearson Correlation: Trend Similarity

ab5.02.0ac

1),( caCpearson

1),( baCpearson

1),( cbCpearson

Summer

Course

Data

Mining

40/59


Euclidean Distance

N

n nn yxyxd1

2)(),(

Nx

x

x 1

Ny

y

y 1

Summer

Course

Data

Mining

41/59


Cosine Correlation

Nx

x

x 1

yx

yxNyxC

N

i ii

1

cosine

1

),(

Ny

y

y 1

yx

+1 Cosine Correlation – 1 yx

Summer

Course

Data

Mining

42/59


Cosine Correlation: Trend + Mean Distance

ab5.02.0ac

1),(inecos baC

9622.0),(inecos caC

9622.0),(inecos cbC

Summer

Course

Data

Mining

43/54


ab5.02.0ac

1),(inecos baC

9622.0),(inecos caC

9622.0),(inecos cbC

5875.1),( cad

8025.2),( bad

2211.3),( cbd

1),( caCpearson

1),( baCpearson

1),( cbCpearson

Summer

Course

Data

Mining

44/54


7544.0),(inecos baC

8092.0),(inecos caC

844.0),(inecos cbC

0255.0),( cad

0279.0),( bad

0236.0),( cbd

1244.0),( caCpearson

1175.0),( baCpearson

1779.0),( cbCpearson

Similar?

Summer

Course

Data

Mining

45/54

Clustering: Grouping strategies for hierarchical clustering

C1

C2

C3

Merge which pair of clusters?

Summer

Course

Data

Mining

46/54


+

+

Single Linkage

C1

C2

Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters

Tend to generate “long chains”

Summer

Course

Data

Mining

47/54


Complete Linkage

Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters

Tend to generate “clumps”

+

+

C1

C2

Summer

Course

Data

Mining

48/54


+

+

Average Linkage

C1

C2

Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).

Summer

Course

Data

Mining

49/54


+

+

Average Group Linkage

C1

C2

Dissimilarity between two clusters = Distance between two cluster means.

Summer

Course

Data

Mining

50/54

Clustering: Support Vector Machines for clustering

The not-noisy case

Objective function:

Ben-Hur, Horn, Siegelmann and Vapnik, 2001

Summer

Course

Data

Mining

51/54


The noisy case

Objective function:


Summer

Course

Data

Mining

52/54


The noisy case (II)

Objective function:


Summer

Course

Data

Mining

53/54


The noisy case (III)

Objective function:


Summer

Course

Data

Mining

54/54

Conclusion / Summary / References Feature Selection

Filtering approach Wrapper approach Embedded methods

Clustering

Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs)


Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis

Ben-Hur, Horn, Siegelmann and Vapnik, 2001http://www.autonlab.org/tutorials/kmeans11.pdf

Gifi, 1990

Schoelkopf et. al., 2001; .;Gifi, 1990Born and Groenen, 2005

http://www.cs.otago.ac.nz/cosc453/student_tutorials/...principal_components.pdf

I. Guyon et. al., 2006

Kohavi and John, 1996Kohavi and John, 1996

MacQueen, 1967

Hastie et. el., 2001

feature selection, dimensionality reduction, and clustering

Documents

feature selectionin

feature selectiongoal

pxi y s

ss xiss pxi y

given single feature

y pxpy

pxi y issue

n possible feature subsets