machine learning kernel cca, kernel kmeans...

102
MACHINE LEARNING 2012 1 MACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering

Upload: vunga

Post on 19-May-2018

246 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

1

MACHINE LEARNING

kernel CCA, kernel Kmeans

Spectral Clustering

Page 2: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

2

Change in timetable: We have practical session next week!

Page 3: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

3

Structure of today’s and next week’s class

1) Briefly go through some extension or variants on the principle of

kernel PCA, namely kernel CCA.

2) Look at one particular set of clustering algorithms for structure

discovery, kernel K-Means

3) Describe the general concept of Spectral Clustering, highlighting

equivalency between kernel PCA, ISOMAP, etc and their use for

clustering.

4) Introduce the notion of unsupervised and semi-supervised

learning and how this can be used to evaluate clustering

methods

5) Compare kernel K-means and Gaussian Mixture Model for

unsupervised and semi-supervised clustering

Page 4: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

4

Canonical Correlation Analysis (CCA)

Video description Audio description

1 1,x y

Nx Py

2 2,x y

,

max ,x y

T T

x yw w

corr w x w y

Determine features in two (or more) separate descriptions of the dataset that

best explain each datapoint.

Extract hidden structure that maximize correlation across two different projections.

Page 5: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

5

Canonical Correlation Analysis (CCA)

1 1,x y

2 2,x y

,

max ,x y

T T

x yw w

corr w X w Y

Pair of multidimensional

zero mean variables

We have M instances of the pairs.

1 1,i i

M MN q

i iX x Y y

Search two projections and w :

' and '

x y

T T

x y

w

X w X Y w Y

,

solutions of:

max max corr X',Y'x yw w

Page 6: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

6

Canonical Correlation Analysis (CCA)

,

T

x y

T,x y

T

x y

T T,x y

max max corr X',Y'

w w max

w w

w w max

w w

x y

x y

x y

w w

T

Tw w

xy

w wxx x yy y

E XY

X Y

C

C w C w

Covariance matrices

=E : N

=E :

T

xx

T

yy

C XX N

C YY q q

Crosscovariance matrix

is

Measure crosscorrelation between and .

xyC N q

X Y

Page 7: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

7

Canonical Correlation Analysis (CCA)

T T

x y

Correlation not affected by rescaling the norm of the vectors,

we can ask that w w 1xx x yy yC w C w

T

x y,

T T

x y

max max w w

u. c. w w 1

x y

xyw w

xx x yy y

C

C w C w

,

T

x y

T,x y

T

x y

T T,x y

max max corr X',Y'

w w max

w w

w w max

w w

x y

x y

x y

w w

T

Tw w

xy

w wxx x yy y

E XY

X Y

C

C w C w

Page 8: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

8

Canonical Correlation Analysis (CCA)

T

x y

T T 1 2,

x y

T T

x y

To determine the optimum (maximum) of , solve by Lagrange:

w wmax

w w

u.c. w w 1

x y

xy

w wxx x yy y xy yy yx x xx x

xx x yy y

C

C w C w C C C w C w

C w C w

Generalized Eigenvalue

Problem; can be reduced

to a classical eigenvalue

problem if Cxx is invertible

Page 9: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

9

Kernel Canonical Correlation Analysis

- CCA finds basis vectors, s.t. the correlation between the projections

is mutually maximized.

generalized version of PCA for two or more multi-dimensional

datasets.

CCA depends on the coordinate system in which the variables are

described.

Even if strong linear relationship between variables, depending on the

coordinate system used, this relationship might not be visible as a

correlation Kernel CCA.

Page 10: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

10

Principle of Kernel Methods

Determine a metric which brings out features of the data so as to

make subsequent computation easier

Original Space

x1

x2

After Lifting Data in Feature Space

Data becomes linearly separable when using a rbf kernel and projecting onto first 2

PC of kernelPCA.

Page 11: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

11

Original Space

x1

x2

e1

e2

Idea: Send the data X into a feature space H through the nonlinear map f.

1...

1 ,.....,i M

i N MX x X x xf f f

fH Performs linear

transformation in

feature space

Principle of Kernel Methods (Recall)

In feature space, perform classical linear computation

Page 12: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

12

Kernel CCA

1 1,i i

M MN q

i iX x Y y

1 1

1 1

and , with 0 and 0i i i i

M MM M

x y x yi i

i i

x y x yf f f f

Project into a feature space

Construct associated kernel matrices:

, , columns of , are , T T i i

x x x y y y x y x yK F F K F F F F x yf f

The projection vectors can be expressed as a linear combination

in feature space:

and x x x y y yw F w F

Page 13: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

13

Kernel CCA

Kernel CCA can then become an optimization problem of the

1/2 1/2

2 2, ,

2 2

max max

u.c 1

x y x y

T

x x y y

T T

x x x y y y

T T

x x x y y y

K K

K K

K K

T

x y

T T,

x y

T T

x y

In Linear CCA, we were solving for:

w w max

w w

u.c. w w 1

x y

xy

w wxx x yy y

xx x yy y

C

C w C w

C w C w

Page 14: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

14

Kernel CCA

Kernel CCA can then become an optimization problem of the

1/2 1/2

2 2, ,

2 2

max max

u.c 1

x y x y

T

x x y y

T T

x x x y y y

T T

x x x y y y

K K

K K

K K

In practice, the intersection between the spaces spanned by , is non-zero,

then the problem has a trivial solution, as ~ cos , 1.

x x y y

x x y y

K K

K K

Page 15: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

15

Kernel CCA

2

2

Generalized eigenvalue problem:

0 0

0 0

x y x x x

y yy x y

K K K

K K K

2

2

Add a regularization term to increase the rank of the matrix and make it invertible

+ I 2

x x

MK K

Several methods have been proposed to choose carefully the regularizing term so as to

get projections that are as close as possible to the “true” projections.

Page 16: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

16

Kernel CCA

2

2

+ I 0 0 2

0 0 + I

2

xx y x x

y yy x

y

MK

K K

K K MK

A

B

Becomes a classical eigenvalue problem 1

1TC AC

Set: and TB C C C

Page 17: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

17

Kernel CCA

Can be extended to multiple datasets:

1 1,i i

M MN q

i iX x Y y

Two datasets case

1

1

datasets: ,...., with observations each

Dimensions ,.... :, i.e. :

L

L i i

L X X M

N N X N M

1

1

Applying a non-linear transformation to ,...

construct Gram matrices: ,.....,

L

L

X X

L K K

f

Page 18: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

18

Kernel CCA

Was formulated as generalized eigenvalue problem:

Can be extended to multiple datasets:

1 1,i i

M MN q

i iX x Y y

Two datasets case

2

2

1

1 2 1 1

2 1 2 2

1 2

+ I 0 0 ....... 2

0 ........

. . .......

. .

.......

L

L

L L L L L

MK

K K K K

K K K K

K K K K K K

1

2

2

2

......................................... .

.

0 + I 2

L

L

MK

2

2

+ I 0 0 2

0 0 + I

2

xx y x x

y yy x

y

MK

K K

K K MK

A

B

Page 19: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

19

Kernel CCA

M. Kuss and T, Graepel ,The Geometry Of Kernel Canonical Correlation Analysis Tech. Report, Max Planck Institute, 2003

Page 20: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

20

Matlab Example File: demo_CCA-KCCA.m

Kernel CCA

Page 21: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

21

Applications of Kernel CCA

Kernel matrices K1, K2 and K3

correspond to gene-gene similarities in

pathways, genome position, and

microarray expression data resp.

Use RBF kernel with fixed kernel width.

Goal: To measure correlation between heterogeneous datasets and to extract sets of

genes which share similarities with respect to multiple biological attributes

Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data by

generalized kernel canonical correlation analysis, Bioinformatics, 2003

Correlation scores in MKCCA:

pathway vs. genome vs. expression.

Page 22: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

22

Applications of Kernel CCA

Goal: To measure correlation between heterogeneous datasets and to extract sets of

genes which share similarities with respect to multiple biological attributes

Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data by

generalized kernel canonical correlation analysis, Bioinformatics, 2003

Correlation scores in MKCCA:

pathway vs. genome vs. expression.

Gives pairwise correlation between K1, K2

Gives pairwise correlation between K1, K3

A readout of the entries with equal

projection onto the first canonical

vectors give the genes which

belong to each cluster

Two clusters correspond to genes

close to each other with respect to

their positions in the pathways, in the

genome, and to their expression

Page 23: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

23

Applications of Kernel CCA

Goal: To construct appearance models for estimating an object’s pose from raw

brightness images

T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern

Recognition 36 (2003) p. 1961–1971

Example of two image datapoints with different poses

X: Set of images

Y: pose parameters (pan and tilt

angle of the object w.r.t. the

camera in degrees)

Use linear kernel on X and RBF kernel on Y and compare performance to

applying PCA on the (X, Y) dataset directly

Page 24: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

24

kernel-CCA performs better than PCA,

especially for small k=testing/training ratio

(i.e., for larger training sets).

The kernel-CCA estimators tend to

produce less outliers, i.e., gross errors, and

consequently yield a smaller standard

deviation of the pose estimation error than

their PCA-based counterparts.

Applications of Kernel CCA

Goal: To construct appearance models for estimating an object’s pose from raw

brightness images

T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern

Recognition 36 (2003) p. 1961–1971

For very small training sets, the performance of both approaches becomes similar

Testing/training ratio

Page 25: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

26

Kernel K-means

Spectral Clustering

Page 26: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

27

Structure Discovery: Clustering

Groups pair of points according to how similar these are.

Density based clustering methods (Soft K-means, Kernel K-means,

Gaussian Mixture Models) compare the relative distributions.

Page 27: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

28

K-means

*m1

*m2

*m3

K-means is a hard partitioning of the space through K clusters

equidistant according to norm 2 measure.

The distribution of data within each cluster is encapsulated in a

sphere.

Page 28: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

29

K-means Algorithm

, 1... ,k k Km

Page 29: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

30

2. Calculate the distance from each data point to each centroid .

3. Assignment Step: Assign the responsibility of each data point to its

“closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data

point, one assigns the data point to the smallest winning centroid).

,p

j k j kd x xm m

arg min ,j k

k

d x m

K-means Algorithm

Iterative Method (variant on Expectation-Maximization)

Page 30: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

31

2. Calculate the distance from each data point to each centroid .

3. Assignment Step: Assign the responsibility of each data point to its

“closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data

point, one assigns the data point to the smallest winning centroid).

,p

j k j kd x xm m

K-means Algorithm

Iterative Method (variant on Expectation-Maximization)

4. Update Step: Adjust the centroids to be the means of all data points

assigned to them (M-step)

5. Go back to step 2 and repeat the process until the clusters are stable.

arg min ,j k

k

d x m

Page 31: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

32

Two hyperparameters: number of clusters K and power p of the metric

Very sensitive to the choice of the number of clusters K and the

initialization.

K-means Clustering: Weaknesses

Page 32: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

33

Two hyperparameters: number of clusters K and power p of the metric

Choice of power determines the form of the decision boundaries

K-means Clustering: Hyperparameters

P=1 P=2

P=3 P=4

Page 33: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

34

Kernel K-means

K-Means algorithm consists of minimization of:

1

1

,...., with

: number of datapoints in cluster

j k

j k

j

Kp

K j k k x C

k x C k

k

k

x

J xm

m C

m m m m

1

i

M

ixf

Project into a feature space

1

1

,...., j k

pKK j k

k x C

J xm m f f m

j k

j

x C

k

x

m

f

We cannot observe the mean in feature space.

Construct the mean in feature space using image of points in same cluster

Page 34: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

35

Kernel K-means

2

1

1

,

21

,

21

,....,

2

=

k ,2 k ,

= k ,

j k

j l kl k

j k

j l kj k

j k

KK j k

k x C

j ll j

Kx x Cj j x C

k x C k k

j li j

Kx x Ci i x C

k x C k k

J x

x xx x

x xm m

x xx x

x xm m

m m f f m

f ff f

f f

Page 35: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

36

Kernel K-means

Kernel K-means algorithm is also an iterative procedure:

1. Initialization: pick K clusters

2. Assignment Step: Assign each data point to its “closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the

smallest winning centroid) by computing the distance in feature space.

3. Update Step: Update the list of points belonging to each centroid (M-step)

4. Go back to step 2 and repeat the process until the clusters are stable.

,

2

k ,2 k ,

min , min k ,j l kj k

j li j

x x Ci k i i x C

k kk k

x xx x

d x C x xm m

Page 36: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

37

Kernel K-means

,

2

k ,2 k ,

min , min k ,j l kj k

j li j

x x Ci k i i x C

k kk k

x xx x

d x C x xm m

With a RBF kernel

If xi is close to all points

in cluster k, this is close

to 1. Cst of value 1

If the points are

well grouped in

cluster k, this sum

is close to 1.

With homogeneous polynomial kernel?

Page 37: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

38

Kernel K-means

,

2

k ,2 k ,

min , min k ,j l kj k

j li j

x x Ci k i i x C

k kk k

x xx x

d x C x xm m

With a polynomial kernel Some of the terms

change sign depending

on their position with

respect to the origin. Positive value

If the points are

aligned in the

same Quadran,

the sum is

maximal

Page 38: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

39

Kernel K-means: examples

Rbf Kernel, 2 Clusters

Page 39: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

40

Kernel K-means: examples

Rbf Kernel, 2 Clusters

Page 40: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

41

Kernel K-means: examples Rbf Kernel, 2 Clusters

Kernel width: 0.5

Kernel width: 0.05

Page 41: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

42

Kernel K-means: examples

Polynomial Kernel, 2 Clusters

Page 42: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

43

Kernel K-means: examples

Polynomial Kernel (p=8), 2 Clusters

Page 43: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

44

Kernel K-means: examples

Polynomial Kernel, 2 Clusters

Order 2 Order 4

Order 6

Page 44: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

45

Kernel K-means: examples

Polynomial Kernel, 2 Clusters

The separating line will always be perpendicular to the line passing by the origin

(which is located at the mean of the datapoints) and parrallel to the axis of the

ordinates (because of the change in sign of the cosine function in the inner product.

No better than linear K-means!

Page 45: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

46

Kernel K-means: examples

Polynomial Kernel, 4 Clusters

Can only group datapoints that do not overlap across quadrans with respect to the

origin (careful, data are centered!).

No better than linear K-means! (except less sensitive to random initialization)

Solutions found with Kernel K-means Solutions found with K-means

Page 46: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

47

Choice of number of Clusters in Kernel K-means is important

Kernel K-means: Limitations

Page 47: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

48

Choice of number of Clusters in Kernel K-means is important

Kernel K-means: Limitations

Page 48: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

49

Choice of number of Clusters in Kernel K-means is important

Kernel K-means: Limitations

Page 49: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

50

Limitations of kernel K-means

Raw Data

Page 50: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

51

Limitations of kernel K-means

kernel K-means with K=3, RBF kernel

Page 51: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

52

From Non-Linear Manifolds

Laplacian Eigenmaps, Isomaps

To Spectral Clustering

Page 52: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

53

Non-Linear Manifolds

PCA and Kernel PCA belong to a more general class of methods to

create non-linear manifolds based on spectral decomposition. (Spectral decomposition of matrices is more frequently referred to as an eigenvalue

decomposition.)

Depending on which matrix we decompose, we get a different set of

projections.

• PCA decomposes the covariance matrix of the dataset generate

rotation and projection in the original space

• kernel PCA decomposes the Gram matrix partition or regroup the

datapoints

• The Laplacian matrix is a matrix representation of a graph. Its spectral

decomposition can be used for clustering.

Page 53: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

54

Embed Data in a Graph

• Build a similarity graph

• Each vertex on the graph is a datapoint

Original dataset Graph representation of the dataset

Page 54: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

55

Measure Distances in Graph

Construct the similarity matrix S to denote whether points are close or far

away to weight the edges of the graph:

0.9.....0.8. .. 0.2 ... 0.2

.....

0.2.....0.2........0.7....0.6

S

Page 55: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

56

Disconnected Graphs

1.........1. .. .....0 .......0

.....

0.........0..........1.........1

S

Disconnected Graph:

Two data-points are connected if:

a) the similarity between them is higher

than a threshold.

b) or if they are k-nearest neighbors

(according to the similarity metric)

Page 56: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

57

Graph Laplacian

1

1.........1. .. .....0 .......0

Given the similarity matrix .....

0.........0..........1.........1

Construct the diagonal matrix D composed of the sum on each line of K:

.............i

i

S

S

D

2

...0

0 ........0 ,

....

0..................

and then, build the Laplacian matrix :

L is positive semi-definite spectral decomposition possible

i

i

Mi

i

S

S

L D S

Page 57: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

58

Graph Laplacian

1 2

Eigenvalue decomposition of the Laplacian matrix:

All eigenvalues of L are positive and the smallest eigenvalue of L is zero:

If we order the eigenvalue by increasing order:

0 .... .

T

M

L U U

If the graph has connected components, then the

eigenvalue =0 has multiplicity .

k

k

Page 58: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

59

Spectral Clustering

The multiplicity of the eigenvalue 0 determines the number of connected

components in a graph. The associated eigenvectors identify these connected

components.

For an eigenvalue 0, the correspondini

g eigenvector has the same value

for all vertices in a component,and a different value for each one of their i 1

components.

ie

Identifying the clusters is then trivial when the similarity matrix is

composed of zeros and ones (as when using k-nearest neighbor).

What happens when the similarity matrix is full?

Page 59: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

60

Spectral Clustering

0.9.....0.8. .. 0.2 ... 0.2

.....

0.2.....0.2........0.7....0.6

S

Similarity map : N NS

22

can either be binary (k-nearest neighboor)

or continuous with Gaussia kernel

,

i jx x

i j

S

S x x e

Page 60: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

61

Spectral Clustering

Similarity map : N NS

22

can either be binary (k-nearest neighboor)

or continuous with Gaussia kernel

,

i jx x

i j

S

S x x e

1 2

1) Build the Laplacian matrix :

2) Do eigenvalue decomposition of the Laplacian matrix:

3) Order the eigenvalue by increasing order:

0 .... .

T

M

L D S

L U U

The first eigenvalue is still zero but with multiplicity 1 only (fully connected graph)!

Idea: the smallest eigenvalues are close to zero and hence provide also

information on the partioning of the graph (see exercise session)

Page 61: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

62

Spectral Clustering

1 2

1 1 1

1

2

1 2

1 1

Eigenvalue decomposition of the Laplacian matrix:

........

.

.

.

........

T

K

K

M

L U U

e e e

e

U

e e e

1

.

.

.

i

i

K

i

e

y

e

Construct an embedding of each of the

M datapoints through .

Reduce dimensionality by picking k<M

projections , 1... .

i i

i

x y

y i K

With a clear partitioning of the graph,

the entries in y are split into sets of

equal values. Each group of points

with same value belong to the same

partition (cluster).

Page 62: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

63

Spectral Clustering

1 2

1 1 1

1

2

1 2

1 1

Eigenvalue decomposition of the Laplacian matrix:

........

.

.

.

........

T

K

K

M

L U U

e e e

e

U

e e e

When we have a fully connected graph,

the entries in Y take any real value.

1

.

.

.

i

i

K

i

e

y

e

Construct an embedding of each of the

M datapoints through .

Reduce dimensionality by picking k<M

projections , 1... .

i i

i

x y

y i K

Page 63: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

64

1

2

One solution for the two associated eigenvectors is:

1

2 1

0

0

0

1

e

e

1

2

Anothersolution for the two associated eigenvectors is:

0.33

0.33

0.88

0.1

0.1

0.99

e

e

Spectral Clustering

Example: 3 datapoints in a graph composed of 2 partitions

1 1 0

The similarity matrix is 1 1 0

0 0 1

has eigenvalue =0 with multiplicity two.

S

L

1x

2x

3x

Entries in the

eigenvector for the

two first datapoins

are equal

1 1

2 2

for the 1st set of eigenvectors for the 2nd set of eigenvectors

1 0 0.33 -0.1

1 0 0.33 -0.1

y y

y y

Page 64: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

65

1 2 3

with associated eigenvectors :

1 0.411 0.8

2 1 , 0.404 , 0.7

1 0.81 0.0

e e e

Spectral Clustering Example: 3 datapoints in a fully connected graph

1 0.9 0.02

The similarity matrix is 0.9 1 0.02

0.01 0.02 1

has eigenvalue =0 with multiplicity 1. The seco

S

L

2

3

nd

eigenvalue is small 0.04, whereas the 3rd one

is large, 1.81.

1x

2x

3x

Entries in the 2nd

eigenvector for the

two first datapoins

are almost equal.

1 2

The first two points have almost the same coordinates on the y embedding.

1 0.41 , 1 0.40y y

Reduce the dimensionality by considering the smallest eigenvalue

Page 65: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

66

1 2 3

with associated eigenvectors :

1 0.21 0.78

2 1 , 0.57 , 0.57

1 0.79 0.21

e e e

Spectral Clustering Example: 3 datapoints in a fully connected graph

1 0.9 0.8

The similarity matrix is 0.9 1 0.7

0.8 0.7 1

has eigenvalue =0 with multiplicity 1. The secon

S

L

2 3

d

and third eigenvalues are both large 2.23, 2.57.

1x

2x

3x

Entries in the 2nd eigenvector

for the two first datapoins are

no longer equal.

1 2

The first two points have no longer the same coordinates on the y embedding.

1 -0.21 , 1 -0.57y y

The 3rd point is now

closer to the two other

points

Page 66: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

67

Spectral Clustering

1x

2x

3x

4x

5x

6x

12w

21w

1y 2y

3y

4y

5y

6y

Step 1:

Embedding in y

Idea: Points close to one another have almost the same coordinate on

the eigenvectors of L with small eigenvalues.

Step1: Do an eigenvalue decomposition of the Lagrange matrix L and

project the datapoints onto the first K eigenvectors with smallest eigenvalue

(hence reducing the dimensionality).

Page 67: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

68

Spectral Clustering

1

Step 2:

Perform K-Means on the set of ,... vectors

Cluster datapoints x according to their clustering in y.

My y

1x

2x

3x

4x

5x

6x

12w

21w

1y 2y

3y

4y

5y

6y

Page 68: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

70

Equivalency to other non-linear Embeddings

1

1

Spectral deomposition of the similarity matrix

(which is already positive semi-definite)

.

.

.

i

i

K

K i

e

y

e

In Isomap, the embedding is

normalized by the eigenvalues

and uses geodesic distance to build

the similarity matrix,

see supplementary material.

Page 69: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

71

Laplacian Eigenmaps

The vectors , 1... , form an embedding of the datapoints.iy i M

Swissroll example Projections on each Laplacian eigenvector

Image courtesy from A. Singh

Solve the generalized eigenvalue problem:

Solution to optimization problem: min such that 1.

Ensures minimal distorsion while preventing arbitrary scaling.

T T

y

Ly Dy

y Ly y Dy

Page 70: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

72

Equivalency to other non-linear Embeddings

kernel PCA: Eigenvalue decomposition of the matrix of similarity S

TS UDU

The choice of parameters in kernel K-Means can be initialized by doing a

readout of the Gram matrix after kernel PCA.

Page 71: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

73

1

2

1

The optimization problem of kernel K-means is equivalent to:

max ,

Since ,

: M eigenvalues, resulting from the eigenvalue decomposition

of the Gram Matrix

T

H

MT

i

i

i

tr H KH H YD

tr H KH

Kernel K-means and Kernel PCA

:

:

Y M K

D K K

Each entry of Y is 1 if the datapoint belongs to cluster k, otherwise zero

D is diagonal. Element on the diagonal is sum of the datapoint in cluster k

See paper by M. Welling,

supplementary document on website

Look at the eigenvalues to determine

optimal number of clusters.

Page 72: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

74

Kernel PCA

projections can

also help

determine the

kernel width

From top to

bottom

Kernel width of

0.8, 1.5, 2.5

Page 73: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

75

Kernel PCA projections can help determine the kernel width

The sum of eigenvalue grows as we get a better clustering

Page 74: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

76

Quick Recap of Gaussian Mixture Model

Page 75: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

77

Clustering with Mixture of Gaussians

Alternative to K-means; soft partitioning with elliptic clusters instead of

spheres

Clustering with Mixtures of Gaussians using spherical Gaussians (left) and non

spherical Gaussians (i.e. with full covariance matrix) (right).

Notice how the clusters become elongated along the direction of the clusters (the

grey circles represent the first and second variances of the distributions).

Page 76: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

78

Gaussian Mixture Model (GMM)

Using a set of M N-dimensional training datapoints

The pdf of X will be modeled through a mixture of K Gaussians:

1,...

1,...

i Mi

j j NX x

1

| , , with | , ,

, : mean and covariance matrix of Gaussian

Ki i i i i i

i

i

i i

p X p X p X N

i

m m m

m

1

1K

i

i

Mixing Coefficients

Probability that the data was explained by

Gaussian i:

1

|M

j

i

j

p i p i x

Page 77: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

79

Gaussian Mixture Modeling

The parameters of a GMM are the means, covariance matrices and prior

pdf:

Estimation of all the parameters can be done through Expectation-

Maximization (E-M). E-M tries to find the optimum of the likelihood of the

model given the data, i.e.:

1 1 1,..... , ,..... , ,.....K K Km m

max | max |L X p X

See lecture notes for details

Page 78: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

80

Gaussian Mixture Model

Page 79: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

81

Gaussian Mixture Model

GMM using 4 Gaussians

with random initialization

Page 80: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

82

Gaussian Mixture Model

Expectation Maximization is very sensitive to initial conditions:

GMM using 4 Gaussians

with new random

initialization

Page 81: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

83

Gaussian Mixture Model

Very sensitive to choice of number of Gaussians. Number of Gaussians

can be optimized iteratively using AIC or BIC, like for K-means:

Here, GMM using 8 Gaussians

Page 82: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

84

Evaluation of Clustering Methods

Page 83: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

85

Evaluation of Clustering Methods

Clustering methods rely on hyper parameters

• Number of clusters

• Kernel parameters

Need to determine the goodness of these choices

Clustering is unsupervised classification

Do not know the real number of clusters and the data labels

Difficult to evaluate these choice without ground truth

Page 84: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

86

Evaluation of Clustering Methods

Two types of measures: Internal versus external measures

Internal measures rely on measure of similarity (e.g. intra-cluster

distance versus inter-cluster distances)

E.g.: Residual Sum of Square is an internal measure (available in

mldemos); Gives the squared distance of each vector from its centroid

summed over all vectors.

Internal measures are problematic as the metric of similarity is often

already optimized by clustering algorithm

2

1

RSS=k

Kk

k x C

x m

Page 85: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

87

K-Means, soft-K-Means and GMM have several hyperparameters:

(Fixed number of clusters, beta, number of Gaussian functions)

Measure to determine how well the choice of hyperparameters fit the

dataset (maximum-likelihood measure)

: dataset; : number of datapoints; : number of free parameters

- Aikaike Information Criterion: AIC= 2ln 2

- Bayesian Information Criterion: 2 ln ln

L: maximum likelihood of the model giv

X M B

L B

BIC L B M

en B parameters

Lower BIC implies either fewer explanatory variables, better fit, or both.

As the number of datapoints (observations) increase, BIC assigns more weights

to simpler models than AIC.

Choosing AIC versus BIC depends on the application:

Is the purpose of the analysis to make predictions, or to decide which model

best represents reality?

AIC may have better predictive ability than BIC, but BIC finds a computationally

more efficient solution.

Evaluation of Clustering Methods

Penalty for

increase in

computational

costs

Page 86: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

88

Evaluation of Clustering Methods

Two types of measures: Internal versus external measures

External measures assume that a subset of datapoints have class label

and measures how well these datapoints are clustered.

Needs to have an idea of the class and have labeled some

datapoints

Interesting only in cases when labeling is highly time-consuming

when the data is very large (e.g. in speech recognition)

Page 87: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

89

Evaluation of Clustering Methods

Raw Data

Page 88: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

90

: nm of datapoints, : the set of classes

: nm of clusters,

: nm of members of class and of cluster

, max ,

2 , ,,

, ,

,

,

ik

i

ik

ik

i

i

i

i

c C k

i i

i

i i

i

i

i

M C c

K

n c k

cF C K F c k

M

R c k P c kF c k

R c k P c k

nR c k

c

nP c k

k

Semi-Supervised Learning

Clustering F-Measure: (careful: similar but not the same F-measure as the F-measure we will see for classification!)

Tradeoff between clustering correctly all datapoints of the same class in the

same cluster and making sure that each cluster contains points of only one

class.

Picks for each class

the cluster with the

maximal number of

datapoints

Recall: proportion of

datapoints correctly

classified/clusterized

Precision: proportion of

datapoints of the same

class in the cluster

Page 89: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

91

Evaluation of Clustering Methods

RSS with K-means can find the true optimal number of clusters but very

sensitive to random initialization (left and right: two different runs).

RSS finds an optimum for

K=4 and K=5 for the right

run

Page 90: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

92

Evaluation of Clustering Methods

BIC (left) and AIC (right) perform very poorly here, splitting some clusters into

two halves.

Page 91: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

93

Evaluation of Clustering Methods

BIC (left) and AIC (right) perform much better for picking the right number of

clusters in GMM.

Page 92: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

94

Evaluation of Clustering Methods

Raw Data

Page 93: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

95

Evaluation of Clustering Methods

Optimization with BIC using K-means

Page 94: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

96

Evaluation of Clustering Methods

Optimization with AIC using K-means

AIC tends to find more clusters

Page 95: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

97

Evaluation of Clustering Methods

Raw Data

Page 96: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

98

Evaluation of Clustering Methods

Optimization with AIC using kernel K-means with RBF

Page 97: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

99

Evaluation of Clustering Methods

Optimization with BIC using kernel K-means with RBF

Page 98: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

100

Semi-Supervised Learning

Raw Data: 3 classes

Page 99: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

101

Semi-Supervised Learning

Clustering with RBF kernel K-Means after optimization with BIC

Page 100: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

102

Semi-Supervised Learning

After semi-supervised learning

Page 101: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

103

Summary

We have seen several methods for extracting structure in data using the

notion of kernel and discussed their similarity, differences and

complementarity:

• Kernel CCA is a generalization of kernel PCA to determine partial

grouping in different dimensions.

• Kernel PCA can be used too bootstrap choice of hyperparameters in

kernel K-means

We have compared the geometrical division of space yielded by Rbf and

Polynomial kernels.

We have seen that simpler techniques than kernel K-means such as K-

means with norm-p and mixture of Gaussians can yield complex non-linear

clustering.

Page 102: MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

MACHINE LEARNING – 2012

104

Summary

When to use what? E.g. k-means versus kernel K-means versus GMM

When using any machine learning algorithm, you have to balance a

number of factors:

• Computing time at training and at testing

• Number of open parameters

• Curse of dimensionality (order of growth with number of datapoints,

dimension, etc)

• Robustness to initial condition, optimality of solution