feature selection, dimensionality reduction, and clustering

54
Summer Course Data Mining Feature Selection, Feature Selection, Dimensionality Reduction, Dimensionality Reduction, and Clustering and Clustering Presenter: Georgi Nalbantov Presenter: Georgi Nalbantov Summer Course: Data Mining August 2009

Upload: brett-stevenson

Post on 31-Dec-2015

96 views

Category:

Documents


0 download

DESCRIPTION

Summer Course: Data Mining. Feature Selection, Dimensionality Reduction, and Clustering. Presenter: Georgi Nalbantov. August 2009. Structure. Dimensionality Reduction Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

Feature Selection, Feature Selection, Dimensionality Reduction, and Dimensionality Reduction, and ClusteringClustering

Presenter: Georgi NalbantovPresenter: Georgi Nalbantov

Summer Course: Data Mining

August 2009

Page 2: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

2/54

Structure Feature Selection

Filtering approach Wrapper approach Embedded methods

Clustering

Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs)

Dimensionality Reduction

Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis

Page 3: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

3/54

Feature Selection, Dimensionality Reduction, and Clustering in the KDD Process

U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)

Page 4: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

4/54

Feature Selection

In the presence of millions of features/attributes/inputs/variables, select the most relevant ones.

Advantages: build better, faster, and easier to understand learning machines.

Xm features

n

m’

Page 5: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

5/54

Feature Selection

Goal: select the two best features individually

Any reasonable objective J will rank the features

J(x1) > J(x2) = J(x3) > J(x4) Thus, features chosen [x1,x2] or

[x1,x3]. However, x4 is the only feature

that provides complementary information to x1

Page 6: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

6/54

Feature Selection

Filtering approach: ranks features or feature subsets independently of the predictor (classifier).

…using univariate methods: consider one variable at a time …using multivariate methods: consider more than one variables at a time

Wrapper approach: uses a classifier to assess (many) features or feature subsets.

Embedding approach:uses a classifier to build a (single) model with a subset of features that are internally selected.

Page 7: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

7/54

Feature Selection: univariate filtering approach

xi

dens

ity,

P(X

i| Y

)

-

-1

- xi

dens

ity,

P(X

i| Y

)

Issue: determine the relevance of a given single feature.

Page 8: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

8/54

Feature Selection: univariate filtering approach

Issue: determine the relevance of a given single feature.

Under independence:P(X, Y) = P(X) P(Y)

Measure of dependence (Mutual Information):

MI(X, Y) = P(X,Y) log dX dY

= KL( P(X,Y) || P(X)P(Y) )

Page 9: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

9/54

Feature Selection: univariate filtering approach

Correlation and MINote: Correlation is a measure of linear dependence

Page 10: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

10/54

Feature Selection: univariate filtering approach

Correlation and MI under the Gaussian distribution

Page 11: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

11/59

Feature Selection: univariate filtering approach. Criteria for measuring dependence.

Page 12: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

12/59

Feature Selection: univariate filtering approach

xi

Den

sity

P(X

i| Y

=-1

)

P(X

i| Y

=1)

Legend: Y=1Y=-1

-1

- +

- + xi-+

-, +

Page 13: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

13/59

Feature Selection: univariate filtering approach

xi

dens

ity

Legend: Y=1Y=-1

-1

- + xi-+

P(Xi| Y=1) = P(Xi| Y=-1) P(Xi| Y=1) /= P(Xi| Y=-1)

Page 14: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

14/59

Feature Selection: univariate filtering approach

-1

- + xi

T-test

• Normally distributed classes, equal variance 2 unknown; estimated from data as 2

within.

• Null hypothesis H0: + = -

• T statistic: If H0 is true, then

t= (+ - -)/(withinm++1/m-Studentm++m--d.f.

- +Is this distance significant?

Page 15: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

15/59

Feature Selection: multivariate filtering approach

Guyon-Elisseeff, JMLR 2004; Springer 2006

Page 16: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

16/59

Feature Selection: search strategies

N features, 2N possible feature subsets!

Kohavi-John, 1997

Page 17: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

17/59

Feature Selection: search strategies

Forward selection or backward elimination.

Beam search: keep k best path at each step.

GSFS: generalized sequential forward selection – when (n-k) features are left try all subsets of g features. More trainings at each step, but fewer steps.

PTA(l,r): plus l , take away r – at each step, run SFS l times then SBS r times.

Floating search: One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far.

Page 18: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

18/59

Feature Selection: filters vs. wrappers vs. embedding

Main goal: rank subsets of useful features

Page 19: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

19/59

Feature Selection: feature subset assessment (wrapper)

1) For each feature subset, train predictor on training data.

2) Select the feature subset, which performs best on validation data. Repeat and average if you want to

reduce variance (cross-validation).

3) Test on test data.

N variables/features

M s

ampl

esm1

m2

m3

Split data into 3 sets:training, validation, and test set.

Danger of over-fitting with intensive search!

Page 20: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

20/59

Feature Selection via Embedded Methods:L1-regularization

sum(|beta|) sum(|beta|)

Page 21: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

21/59

Feature Selection: summary

Nearest Neighbors

Neural Nets

Trees, SVM

Mutual information

feature ranking

Non-linear

RFE with linear SVM or LDA

T-test, AUC, feature ranking

Linear

MultivariateUnivariate

Page 22: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

22/59

Dimensionality Reduction

In the presence of may of features, select the most relevant subset of (weighted) combinations of features.

kpkm XXXX ,,,, 11 Feature Selection:

Dimensionality Reduction: ),,(,),,,(,, 1111 mpmm XXfXXfXX

Page 23: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

23/59

Dimensionality Reduction:(Linear) Principal Components Analysis

PCA finds a linear mapping of dataset X to a dataset X’ of lower dimensionality. The variance of X that is remained in X’ is maximal.

Dataset X is mapped to dataset X’, here of the same dimensionality. The first dimension in X’ (= the first principal component) is the direction of maximal variance. The second principal component is orthogonal to the first.

Page 24: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

24/59

Dimensionality Reduction:Nonlinear (Kernel) Principal Components Analysis

Original dataset X Map X to a HIGHER-dimensional space, and carry out LINEAR PCA in that space

(If necessary,) map the resulting principal components back to the origianl space

Page 25: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

25/59

Dimensionality Reduction:Multi-Dimensional Scaling

MDS is a mathematical dimension reduction technique that maps the distances between observations from the original (high) dimensional space into a lower (for example, two) dimensional space.

MDS attempts to retain pairwise Euclidean distances in the low-dimensional space .

Error on the fit is measured using a so-called “stress” function

Different choices for a stress function are possible

Page 26: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

26/59

Dimensionality Reduction:Multi-Dimensional Scaling

Raw stress function (identical to PCA):

Sammon cost function:

Page 27: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

27/59

Dimensionality Reduction:Multi-Dimensional Scaling (Example)

Input:

Output:

Page 28: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

28/59

Dimensionality Reduction:Homogeneity analysis

Homals finds a lower-dimensional representation of categorical data matrix X. It may be considered as a type of nonlinear extension of PCA.

Page 29: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

29/59

Clustering:Similarity measures for hierarchical clustering

Clustering Classification Regression

k-th Nearest Neighbour Parzen Window Unfolding, Conjoint Analysis,

Cat-PCA

Linear Discriminant Analysis, QDA Logistic Regression (Logit) Decision Trees, LSSVM, NN, VS

Classical Linear Regression Ridge Regression NN, CART

+

+

+

+ ++

++

++

+++

+

++

++

++

+

+

--

--

-

-

++

+

++

+++

+

X 1X 1X 1

X 2 X 2

Page 30: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

30/59

Clustering

Clustering in an unsupervised learning technique.

Task: organize objects into groups whose members are similar in some way

Clustering finds structures in a collection of unlabeled data

A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters

Page 31: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

31/59

Density estimation and clustering

Bayesian separation curve (optimal)

Page 32: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

32/59

Clustering:K-means clustering

Minimizes the sum of the squared distances to the cluster centers (reconstruction error)

Iterative process:

Estimate current assignments (construct Voronoi partition)

Given the new cluster assignments, set cluster center to center-of-mass

Page 33: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

33/59

Clustering:K-means clustering

Step 1 Step 2

Step 3 Step 4

Page 34: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

34/59

Clustering:Hierarchical clustering

Dendrogram

Clustering based on (dis)similarities. Multilevel clustering: level 1 has n clusters, level n has one cluster

Agglomerative HC: starts with N clusters and combines clusters iteratively

Divisive HC: starts with one cluster and divides iteratively

Disadvantage: wrong division cannot be undone

Page 35: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

35/59

Clustering:Nearest Neighbor algorithm for hierarchical clustering

1. Nearest Neighbor, Level 2, k = 7 clusters.

2. Nearest Neighbor, Level 3, k = 6 clusters.

3. Nearest Neighbor, Level 4, k = 5 clusters.

Page 36: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

36/59

Clustering:Nearest Neighbor algorithm for hierarchical clustering

4. Nearest Neighbor, Level 5, k = 4 clusters.

5. Nearest Neighbor, Level 6, k = 3 clusters.

6. Nearest Neighbor, Level 7, k = 2 clusters.

Page 37: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

37/59

Clustering:Nearest Neighbor algorithm for hierarchical clustering

7. Nearest Neighbor, Level 8, k = 1 cluster.

Page 38: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

38/59

Clustering:Similarity measures for hierarchical clustering

Page 39: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

39/59

Clustering: Similarity measures for hierarchical clustering

Pearson Correlation: Trend Similarity

ab5.02.0ac

1),( caCpearson

1),( baCpearson

1),( cbCpearson

Page 40: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

40/59

Clustering: Similarity measures for hierarchical clustering

Euclidean Distance

N

n nn yxyxd1

2)(),(

Nx

x

x 1

Ny

y

y 1

Page 41: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

41/59

Clustering: Similarity measures for hierarchical clustering

Cosine Correlation

Nx

x

x 1

yx

yxNyxC

N

i ii

1

cosine

1

),(

Ny

y

y 1

yx

+1 Cosine Correlation – 1 yx

Page 42: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

42/59

Clustering: Similarity measures for hierarchical clustering

Cosine Correlation: Trend + Mean Distance

ab5.02.0ac

1),(inecos baC

9622.0),(inecos caC

9622.0),(inecos cbC

Page 43: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

43/54

Clustering: Similarity measures for hierarchical clustering

ab5.02.0ac

1),(inecos baC

9622.0),(inecos caC

9622.0),(inecos cbC

5875.1),( cad

8025.2),( bad

2211.3),( cbd

1),( caCpearson

1),( baCpearson

1),( cbCpearson

Page 44: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

44/54

Clustering: Similarity measures for hierarchical clustering

7544.0),(inecos baC

8092.0),(inecos caC

844.0),(inecos cbC

0255.0),( cad

0279.0),( bad

0236.0),( cbd

1244.0),( caCpearson

1175.0),( baCpearson

1779.0),( cbCpearson

Similar?

Page 45: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

45/54

Clustering: Grouping strategies for hierarchical clustering

C1

C2

C3

Merge which pair of clusters?

Page 46: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

46/54

Clustering: Grouping strategies for hierarchical clustering

+

+

Single Linkage

C1

C2

Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters

Tend to generate “long chains”

Page 47: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

47/54

Clustering: Grouping strategies for hierarchical clustering

Complete Linkage

Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters

Tend to generate “clumps”

+

+

C1

C2

Page 48: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

48/54

Clustering: Grouping strategies for hierarchical clustering

+

+

Average Linkage

C1

C2

Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).

Page 49: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

49/54

Clustering: Grouping strategies for hierarchical clustering

+

+

Average Group Linkage

C1

C2

Dissimilarity between two clusters = Distance between two cluster means.

Page 50: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

50/54

Clustering: Support Vector Machines for clustering

The not-noisy case

Objective function:

Ben-Hur, Horn, Siegelmann and Vapnik, 2001

Page 51: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

51/54

Clustering: Support Vector Machines for clustering

The noisy case

Objective function:

Ben-Hur, Horn, Siegelmann and Vapnik, 2001

Page 52: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

52/54

Clustering: Support Vector Machines for clustering

The noisy case (II)

Objective function:

Ben-Hur, Horn, Siegelmann and Vapnik, 2001

Page 53: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

53/54

Clustering: Support Vector Machines for clustering

The noisy case (III)

Objective function:

Ben-Hur, Horn, Siegelmann and Vapnik, 2001

Page 54: Feature Selection,  Dimensionality Reduction, and Clustering

Summer

Course

Data

Mining

54/54

Conclusion / Summary / References Feature Selection

Filtering approach Wrapper approach Embedded methods

Clustering

Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs)

Dimensionality Reduction

Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis

Ben-Hur, Horn, Siegelmann and Vapnik, 2001http://www.autonlab.org/tutorials/kmeans11.pdf

Gifi, 1990

Schoelkopf et. al., 2001; .;Gifi, 1990Born and Groenen, 2005

http://www.cs.otago.ac.nz/cosc453/student_tutorials/...principal_components.pdf

I. Guyon et. al., 2006

Kohavi and John, 1996Kohavi and John, 1996

MacQueen, 1967

Hastie et. el., 2001