spectral clustering - hu pili...spectral clustering framework 1. get similarity matrix an× n from...

Spectral Clustering

by HU Pili

June 16, 2013

Outline

• Clustering Problem

• Spectral Clustering Demo

• Preliminaries

◦ Clustering: K-means Algorithm

◦ Dimensionality Reduction: PCA, KPCA.

• Spectral Clustering Framework

• Spectral Clustering Justification

• Ohter Spectral Embedding Techniques

Main reference: Hu 2012 [4]

Clustering Problem

Height

Weight

Figure 1. Abstract your target using feature vector

Clustering Problem

Height

Weight

Figure 2. Cluster the data points into K (2 here) groups

Clustering Problem

Height

Weight

ThinFat

Figure 3. Gain Insights of your data

Clustering Problem

Height

Weight

Figure 4. The center is representative (knowledge)

Clustering Problem

Height

Weight

Figure 5. Use the knowledge for prediction

Review: Clustering

We learned the general steps of Data Mining/ KnowledgeDiscover using a clustering example:

1. Abstract your data in form of vectors.

2. Run learning algorithms

3. Gain insights/ extract knowledge/ make prediction

We focus on 2.

Spectral Clustering Demo

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5−1.5

−0.5

Figure 6. Data Scatter Plot

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5−1.5

−0.5

Figure 7. Standard K-Means

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5−1.5

−0.5

Figure 8. Our Sample Spectral Clustering

The algorithm is simple:

• K-Means:[idx, c] = kmeans(X, K) ;

• Spectral Clustering:epsilon = 0.7 ;

D = dist(X’) ;

A = double(D < epsilon) ;

[V, Lambda] = eigs(A, K) ;

[idx, c] = kmeans(V, K) ;

Review: The Demo

The usual case in data mining:

• No weak algorithms

• Preprocessing is as important as algorithms

• The problem looks easier in another space (thesecret coming soon)

Transformation to another space:

• High to low: dimensionality reduction, low dimen-sion embedding. e.g. Spectral clustering.

• Low to high. e.g. Support Vector Machine (SVM)

Secrets of Preprocessing

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5−1.5

−0.5

Figure 9. The similarity graph: connect to ε-ball.

D = dist(X’); A = double(D < epsilon);

−0.11 −0.1 −0.09 −0.08 −0.07 −0.06 −0.05 −0.04 −0.03 −0.02 −0.01−0.1

−0.05

Figure 10. 2-D embedding using largest 2 eigenvectors

[V, Lambda] = eigs(A, K);

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

Figure 11. Even better after projecting to unit circle (not

used in our sample but more applicable, Brand 2003 [3])16

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

4 Angles −− All Points

0.94 0.96 0.98 1 1.02 1.04 1.060

800Angles −− Cluster 1

0.8 0.85 0.9 0.95 1 1.05 1.1 1.150

4000Angles −− Cluster 2

Figure 12. Angle histogram: The two clusters are concen-

trated and they are nearly perpendicular to each other17

Notations

• Data points: Xn×N = [x1, x2, , xN]. N points,each of n dimensions. Organize in columns.

• Feature extraction: φ(xi). d dimensional.

• Eigen value decomposition: A = UΛUT. First d

colums of U : Ud.

• Feature matrix: Φn̂×N = [φ(x)1, φ(x)2, , φ(x)N]

• Low dimension embedding: YN×d. Embed N

points into d-dimensional space.

• Number of clusters: K.

K-Means

• Initialize m1, ,mK centers

• Iterate until convergence:

◦ Cluster assignment: Ci=argminj ‖xi−mj‖for all data points, 16 i6N .

◦ Update clustering: Sj={i:Ci= j},16 j6K

◦ Update centers: mj=1

i∈Sjxi

Remarks: K-Means

1. A chiken and egg problem.

2. How to initialize centers?

3. Determine superparameter K?

4. Decision boundary is a straight line:

• Ci= argminj ‖xi−mj‖

• ‖x−mj‖= ‖x−mk‖

• miTmi−mj

Tmj=2xT(mi−mj)

We address the 4th points by transforming the data intoa space where straight line boundary is enough.

Principle Component Analysis

DataPCError

Figure 13. Error minimization formulation

Assume xi is already centered (easy to preprocess).Project points onto the space spanned by Un×d with min-imum errors:

minU∈Rn×d

J(U)=∑

‖UUTxi− xi‖2

s.t. UTU = I

Transform to trace maximization:

maxU∈Rn×d

Tr[UT(XXT)U ]

s.t. UTU = I

Standard problem in matrix theory:

• Solution of U is given by the largest d eigen vec-tors of XXT (those corresponding to largest eigen

values). i.e. XXTU =UΛ.

• Usually denote Σn×n =XXT, because XXT canbe interpreted as the covariance of variables (nvariables). (Again not that X is centered)

About UUTxi:

• xi: the data points in original n dimensional space.

• Un×d: the d dimensional space (axis) expressed

using the coordinates of original n-D space. –Principle Axis.

• UTxi: the coordinates of xi in d-dimensional space;d-dimensional embedding; the projection of xi tod-D space expressed in d-D space. – Principle

Component.

• U UTxi: the projection of xi to d-D space expressedin n-D space.

From principle axis to principle component:

• One data point: UTxi

• Matrix form: (Y T)d×N =UTX

Relation between covariance and similarity:

(XTX )Y = XTX (UTX)T

= XTXXTU

= XTU Λ

= Y Λ

Observation: Y is the eigen vectors of XTX .

Two operational approaches:

• Decompose (XXT)n×n and Let (Y T)d×N=UTX.

• Decompose (XTX)N×N and directly get Y .

Implications:

• Choose the smaller size one in practice.

• XTX hint that we can do more with the structure.

Remarks:

• Principle components are what we want in mostcases. i.e. Y . i.e. d-dimension embedding. e.g. Cando clustering on coordinates given by Y .

Kernel PCA

Settings:

• A feature extraction function:

φ:Rn→Rn̂

original n dimension features. Map to n̂ dimension.

• Matrix organization:

Φn̂×N = [φ(x)1, φ(x)2, , φ(x)N]

• Now Φ is the “data points” in the formulation ofPCA. Try to embed them into d-dimensions.

Kernel PCA

According to the analysis of PCA, we can operate on:

• (ΦΦT)n̂×n̂: the covariance matrix of features

• (ΦTΦ)N×N: the similarity/ affinity matrix (inspectral clustering language); the gram/ kernalmatrix (in keneral PCA language).

The observation:

• n̂ can be very large. e.g. φ:Rn→R∞

• (ΦTΦ)i,j=φT(xi)φ(xj). Don’t need explict φ; only

need k(xi, xj)= φT(xi)φ(xj).

Kernel PCA

k(xi, xj) = φT(xi)φ(xj) is the “kernel”. One importantproperty: by definition,

• k(xi, xj) is a positive semidefinite function.

• K is a positive semidefinite matrix.

Some example kernals:

• Linear: k(xi, xj)=xiTxj. Degrade to PCA.

• Polynomial: k(xi, xj)= (1+xiTxj)p

• Gaussion: k(xi, xj)= e−

∥xi−xj

Remarks: KPCA

• Avoid explicit high dimension (maybe infinite) fea-ture construction.

• Enable one research direction: kernel engineering.

• The above discussion assume Φ is centered! SeeBishop 2006 [2] for how to center this matrix (usingonly kernel function). (or “double centering” tech-nique in [4])

• Out of sample embedding is the real difficulty, seeBengio 2004 [1].

Review: PCA and KPCA

• Minimum error formulation of PCA

• Two equivalent implementation approaches:

◦ covariance matrix

◦ similarity matrix

• Similarity matrix is more convenient to manipulateand leads to KPCA.

• Kernel is Positive Semi-Definite (PSD) by defini-tion. K =ΦTΦ

Spectral Clustering Framework

A bold guess:

• Decomposing K =ΦTΦ gives good low-dimensionembedding. Inner product measures similarity, i.e.k(xi, xj)= φT(xi)φ(xj). K is similarity matrix.

• In the operation, we acturally do not look at Φ.

• We can specify K directly and perform EVD:

Xn×N→KN×N

• What if we directly give a similarity measure, K,without the constraint of PSD?

That leads to the general spectral clustering.

1. Get similarity matrix AN×N from data points X.(A: affinity matrix; adjacency matrix of a graph;similarity graph)

2. EVD: A= UΛUT . Use Ud (or post-processed ver-sion, see [4]) as the d-dimension embedding.

3. Perform clustering on d-D embedding.

Review our naive spectral clustering demo:

1. epsilon = 0.7 ;

D = dist(X’) ;

A = double(D < epsilon) ;

2. [V, Lambda] = eigs(A, K) ;

3. [idx, c] = kmeans(V, K) ;

Remarks: SC Framework

• We start by relaxing A (K) in KPCA.

• Lose PSD == Lose KPCA justification? Not exact

A′=A+σI

• Real tricks: (see [4] section 2 for details)

◦ How to form A?

◦ Decompose A or other variants of A (L =D−A).

◦ Use EVD result directly (e.g. U) or use a

variant (e.g. UΛ1/2).

Similarity graph

Input is high dimensional data: (e.g. come in form of X)

• k-nearest neighbour

• ε-ball

• mutual k-NN

• complete graph (with Gaussian kernel weight)

Similarity graph

Input is distance.(

i,jis the squred distance between

i and j. (may not come from raw xi, xj)

c = [x1Tx1, , xN

D(2) = c 1T+1 cT− 2XTX

J = I −1

XTX = −1

2JD(2)J

Remarks:

• See MDS.

Similarity graph

Input is a graph:

• Just use it.

• Or do some enhancements. e.g. Geodesic distance.See [4] Section 2.1.3 for some possible methods.

After get the graph (or input):

• Adjacency matrix A, Laplacian matrix L=D−A.

• Normalized versions:

◦ Aleft=D−1A, Asym=D−1/2

AD−1/2

◦ Lleft=D−1L, Lsym=D−1/2

LD−1/2

EVD of the Graph

Matrix types:

• Adjacency series: Use the largest EVs.

• Laplacian series: Use the smallest EVs.

Remarks: SC Framework

• There are many possibilities in construction of sim-ilarity matrix and the post-processing of EVD.

• Not all of these combinations have justifications.

• Once a combination is shown to working, it maynot be very hard to find justifications.

• Existing works actually starts from very differentflavoured formulations.

• Only one common property: involve EVD;aka “spectral analysis”; hence the name.

Spectral Clustering Justification

• Cut based argument (main stream; origin)

• Random walk escaping probability

• Commute time: L−1 encodes the effective resis-tance. (where UΛ−1/2 come from)

• Low-rank approximation.

• Density estimation.

• Matrix perturbation.

• Polarization. (the demo)

See [4] for pointers.

Cut Justification

Normalized Cut (Shi 2000 [5]):

NCut=∑

Kcut(Ci, V −Ci)

vol(Ci)

Characteristic vector for Ci, χi= {0, 1}N:

NCut=∑

KχiTLχi

χiTDχi

Relax χi to real value:

minvi∈RN

viTLvi

s.t. viTDvi=1

viT vj=0, ∀i� j

This is the generalized eigenvalue problem:

L v=λDv

Equivalent to EVD on:

Lleft=D−1L

Matrix Perturbation Justification

• When the graph is ideally separable, i.e. multipleconnected components, A and L have character-istic (or piecewise linear) EVs.

• When not ideally separable but sparse cut exists,A can be viewed as ideal separable matrix plus asmall perturbation.

• Small perturbation of matrix entries leads to smallperturbation of EVs.

• EVs are not too far from piecewise linear: easy toseparate by simple algorithms like K-Means.

Low Rank Approximation

The similarity matrix A is generated by inner productin some unknown space we want to recover. We want tominimize the recovering error:

minY ∈RN×d

‖A−YY T ‖F2

The standard low-rank approximation problem, whichleads to EVD of A:

Y =UΛ1/2

Spectral Embedding Techniques

See [4] for some pointers: MDS, isomap, PCA, KPCA,LLE, LEmap, HEmap, SDE, MVE, SPE. The difference,as said, lies mostly in the construction of A.

Bibliography

[1] Y. Bengio, J. Paiement, P. Vincent, O. Delalleau, N. Le Roux,

and M. Ouimet. Out-of-sample extensions for lle, isomap, mds,

eigenmaps, and spectral clustering. Advances in neural informa-

tion processing systems, 16:177–184, 2004.

[2] C. Bishop. Pattern recognition and machine learning,

volume 4. springer New York, 2006.

[3] M. Brand and K. Huang. A unifying theorem for spectral

embedding and clustering. In Proceedings of the Ninth Inter-

national Workshop on Artificial Intelligence and Statistics, 2003.

[4] P. Hu. Spectral clustering survey, 5 2012.

[5] J. Shi and J. Malik. Normalized cuts and image segmentation.

Pattern Analysis and Machine Intelligence, IEEE Transactions

on, 22(8):888–905, 2000.

Thanks

Some supplementary slides for details are attached.

SVD and EVD

Definitions of Singular Value Decomposition (SVD):

Xn×N = Un×kΣk×kVN×kT

Definitions of Eigen Value Decomposition (EVD):

A = XTX

A = UΛUT

Relations:

XTX = V Σ2V T

XXT = UΣ2UT

Remarks:

• SVD requires UTU = I , V TV = I and σi > 0(Σ = diag(σ1, , σN)). This is to guarantee theuniqueness of solution.

• EVD does not have constraints, any U and Λ satis-fying AU =UΛ is OK. The requirement of UTU =I is also to guarantee uniqueness of solution (e.g.PCA). Another benefit is the numerical stability ofsubspace spanned by U : orthogonal layout is moreerror resilient.

• The computation of SVD is done via EVD.

• Watch out the terms and the object they refer to.

Out of Sample Embedding

• New data point x ∈Rn that is not in X . How tofind the lower dimension embedding, i.e. y ∈Rd.

• In PCA, we have principle axis U (XXT=UΛUT).Out of sample embedding is simple: y=UTx.

• Un×d is actually a compact representation of

knowledge.

• In KPCA and different variants of SC, we operateon similarity graph and do not have such compactrepresentation. It is thus hard to explicitly the outof sample embedding result.

• See [1] for some researches on this.

Gaussian Kernel

The gaussian kernel: (Let τ =1

k(xi, xj)= e−

∥xi−xj

2σ2 = e−τ ‖xi−xj‖2

Use Taylor expansion:

ex=∑

Rewrite the kernel:

k(xi, xj) = e−τ(xi−xj)T(xi−xj)

= e−τxiTxi · e−τxj

Txj · e2τxiTxj

Focus on the last part:

e2τxiTxj =

k!(2τxi

It’s hard to write out the form when xi ∈Rn, n > 1. Wedemo the case when n = 1. xi and xj are now singlevariable:

e2τxiTxj =

k!(2τ )kxi

c(k) ·xikxj

The feature vector is: (infinite dimension)

φ(x)= e−τx2[

c(0)√

, c(1)√

x, c(2)√

x2, , c(k)√

Verify that:

k(xi, xj)= φ(xi) · φ(xj)

This shows that Gaussian kernel implicitly map 1-D datato an infinite dimensional feature space.

spectral clustering - hu pili...spectral clustering framework 1. get similarity matrix an× n from...

Documents

similarity-based clustering using a network analysis...

clustering with multiviewpoint based similarity measure.bak

an improved co-similarity measure for document clustering

review on graph clustering and subgraph similarity based ......

symnmf: nonnegative low-rank approximation of a...

han, kamber, eick: object similarity & clustering for cosc...

han, kamber, eick: object similarity & clustering 1...

base paper -clustering with multi-viewpoint based similarity...

similarity metrics for clustering pubmed abstracts for

clustering by pattern similarity - cs.sfu.ca

clustering by pattern similarity - simon fraser...

similarity matrices and clustering algorithms for...

similarity-based clustering by left-stochastic matrix...

clustering uncertain data based on probability distribution...

deep kernel learning for clusteringapproach for spectral...

clustering: partition clustering. lecture outline...

web.eecs.umich.eduweb.eecs.umich.edu/~hero/preprints/10.1007_s10618-012-0302-x.pdf ·...

clustering web logs using similarity upper approximation...

nonnegative matrix factorization with local similarity

clustering: similarity-based clustering · 2014-11-18 ·...