graph signal processing for clustering · 2016. 1. 26. · graph signal processing for clustering...

Graph signal processing for clustering

Nicolas Tremblay

PANAMA Team, INRIA Rennes with Remi Gribonval,Signal Processing Laboratory 2, EPFL, Lausanne with Pierre Vandergheynst.

Introduction Graph signal processing... ... applied to clustering Conclusion

N. Tremblay Graph signal processing for clustering Rennes, 13th of January 2016 1 / 26

What’sclustering ?

Given a series of N objects :

1/ Find adapted descriptors

2/ Cluster

After step 1, one has :

• N vectors in d dimensions (descriptor dimension) :

x1, x2, · · · , xN ∈ Rd

• and their distance matrix D ∈ RN×N .

The goal of clustering is to assign a label c(i) = 1, · · · , k to eachobject i in order to organize / simplify / analyze the data.

There exists two different general types of methods :

• methods directly based on the xi and/or D like k-means orhierarchical clustering.

• graph-based methods.

x1, x2, · · · , xN ∈ Rd

Graph construction from the distance matrix D

Create a graph G = (V ,E ) :

• each node in V is one of the N objects

• each pair of nodes (i , j) is connected if the associated distanceD(i , j) is small enough.

For example, two connectivity possibilities :

• Gaussian kernel :

1. all pairs of nodes are connected with links of weightsexp(−D(i , j)/σ)

2. remove all links of weight inferior to ε

• k nearest neighbors : connect each node to its k nearestneighbors.

The problem now states :

Given the graph G representing the similarity between the Nobjects, find a partition of all nodes in k clusters.

Many methods exist [Fortunato ’10] :

• Modularity (or other cost-function) optimisation methods[Newman ’04]

• Random walk methods [Delvenne ’10]

• Methods inspired from statistical physics [Krzakala ’12],information theory [Rosvall ’07]...

• spectral methods

• ...

The problem now states :

Given the graph G representing the similarity between the Nobjects, find a partition of all nodes in k clusters.

Many methods exist [Fortunato ’10] :

• Modularity (or other cost-function) optimisation methods[Newman ’04]

• Random walk methods [Delvenne ’10]

• Methods inspired from statistical physics [Krzakala ’12],information theory [Rosvall ’07]...

• spectral methods

• ...

Three useful matrices

The adjacency matrix : The degree matrix :

The Laplacian matrix :

0 1 1 01 0 1 11 1 0 00 1 0 0

2 0 0 00 3 0 00 0 2 00 0 0 1

L = S−W =

2 −1 −1 0−1 3 −1 −1−1 −1 2 00 −1 0 1

Three useful matrices

The adjacency matrix : The degree matrix :

The Laplacian matrix :

0 .5 .5 0.5 0 .5 4.5 .5 0 00 4 0 0

1 0 0 00 5 0 00 0 1 00 0 0 4

L = S−W =

1 −.5 −.5 0

−.5 5 −.5 −4−.5 −.5 1 0

0 −4 0 4

The classical spectral clustering algorithm [Von Luxburg ’06] :

Given the N-node graph G of adjacency matrix W :

1. Compute :Uk = (u1|u2| · · · |uk)

the first k eigenvectors of L = S−W.

2. Consider each node i as a point in Rk :

fi = U>k δi .

3. Run k-means with the Euclidean distance : Dij = ||fi − fj ||and obtain k clusters.

1. Compute :Uk = (u1|u2| · · · |uk)

fi = U>k δi .

1. Compute :Uk = (u1|u2| · · · |uk)

fi = U>k δi .

What’s the point of using a graph ?

N points in d = 2 dimensions.Result with k-means (k=2) :

After creating a graph fromthe N points’ interdistances,and running the spectral clus-tering algorithm (with k=2) :

Computation bottlenecks of thespectral clustering algorithm

When N and/or k become too large, there are two mainbottlenecks in the algorithm :

1. The partial eigendecomposition of the Laplacian.

2. k-means.

Our goal :

Circumvent both !

2. k-means.

Our goal :

Circumvent both !

2. k-means.

Our goal :

Circumvent both !

What’s graph signalprocessing ?

What’s a graph signal ?

J F M A M J J A S O N D−20

0 5 10 15 205

What’s the graph Fourier matrix ?[Hammond ’11]

The “classical” graph :

2 −1 0 ··· 0 −1−1 2 −1 ··· 0 0...

......

0 0 0 ··· 2 −1−1 0 0 ··· −1 2

All classical Fourier modes are theeigenvectors of Lcl

Any graph :

By analogy, any graph’s Fouriermodes are the eigenvectors of its

Laplacian matrix L.

What’s the graph Fourier matrix ?[Hammond ’11]

The “classical” graph :

2 −1 0 ··· 0 −1−1 2 −1 ··· 0 0...

......

0 0 0 ··· 2 −1−1 0 0 ··· −1 2

All classical Fourier modes are theeigenvectors of Lcl

Any graph :

By analogy, any graph’s Fouriermodes are the eigenvectors of its

Laplacian matrix L.

The graph Fourier matrix

L = S−W

Its eigenvectors :

U = (u1|u2| · · · |uN)

form the graph Fourierorthonormal basis.

Its eigenvalues :

0 = λ1 ≤ λ2 ≤ · · · ≤ λN

represent the graphfrequencies. λi is the squaredfrequency associated to the

Fourier mode ui .

Illustration

Low frequency : High frequency :

The Fourier transform• given f ∈ RN a signal on a graph of size N.

• f is obtained by decomposing f on the eigenvectors ui :

< u1, f >< u2, f >< u3, f >

...< uN , f >

, i.e. f = U> f

• Inversely, the inverse Fourier transform reads :

f = U f

• The Parseval theorem stays valid : ∀(g , h) < g , h >=< g , h >

< u1, f >< u2, f >< u3, f >

...< uN , f >

, i.e. f = U> f

f = U f

< u1, f >< u2, f >< u3, f >

...< uN , f >

, i.e. f = U> f

f = U f

< u1, f >< u2, f >< u3, f >

...< uN , f >

, i.e. f = U> f

f = U f

Filtering

Given a filter function g definedin the Fourier space.

In the Fourier space, the signal filtered by g reads :

f (1) g(λ1)

f (2) g(λ2)

f (3) g(λ3)...

f (N) g(λN)

= G f with G =

g(λ1) 0 0 ... 00 g(λ2) 0 ... 00 0 g(λ3) ... 0... ... ... ... ...0 0 0 ... g(λN)

In the node space, the filtered signal f g reads therefore :

f g = U G U> f = Gf

Filtering

f (1) g(λ1)

f (2) g(λ2)

f (3) g(λ3)...

f (N) g(λN)

= G f with G =

g(λ1) 0 0 ... 00 g(λ2) 0 ... 00 0 g(λ3) ... 0... ... ... ... ...0 0 0 ... g(λN)

f g = U G U> f = Gf

Filtering

f (1) g(λ1)

f (2) g(λ2)

f (3) g(λ3)...

f (N) g(λN)

= G f with G =

g(λ1) 0 0 ... 00 g(λ2) 0 ... 00 0 g(λ3) ... 0... ... ... ... ...0 0 0 ... g(λN)

f g = U G U> f = Gf

So where’s the link ?

Remember : the classical spectral clustering algorithm

1. Compute :Uk = (u1|u2| · · · |uk)

fi = U>k δi .

Let’s work on the first bottleneck : estimate Dij without partiallydiagonalizing the Laplacian matrix.

Remember : the classical spectral clustering algorithm

1. Compute :Uk = (u1|u2| · · · |uk)

fi = U>k δi .

Let’s work on the first bottleneck : estimate Dij without partiallydiagonalizing the Laplacian matrix.

Ideal low-pass filtering1st step : assume we know Uk and λk

Given hλk an ideal LP,Hλk = UHλk U> = UkU>kis its filter matrix.

Let R = (r1|r2| · · · |rη) ∈ RN×η be a random Gaussian matrix.

We define fi = (Hλk R)>δi ∈ Rη and Dij = ||fi − fj ||.

Norm conservation theorem for ideal filter

Let ε > 0, if η > η0 ∼ log Nε2 , then, with proba > 1− 1/N, we have :

∀(i , j) ∈ [1,N]2 (1− ε)Dij ≤ Dij ≤ (1 + ε)Dij .

Non-ideal low-pass filtering2nd step : assume all we know is λk

In practice, we use a polyapprox of order m of hλk :

hλk =m∑l=1

αlλl ' hλk .

1.5ideal

Indeed, in this case, filtering a vector x reads :

Hλkx = Uhλk (Λ)U>x = Um∑l=1

αlΛlU>x =

m∑l=1

αlLlx

• Does not require the knowledge of Uk .

• Only involves matrix-vector multiplications [costs O(m|E |)].

The theorem stays (more or less) valid with this non-ideal filtering !

hλk =m∑l=1

αlλl ' hλk .

1.5ideal

αlΛlU>x =

m∑l=1

αlLlx

hλk =m∑l=1

αlλl ' hλk .

1.5ideal

αlΛlU>x =

m∑l=1

αlLlx

Last step : estimate λk

Goal : given L, estimate its k-th eigenvalue as fast as possible.

We use eigencount techniques (also based on polynomial filteringof random vectors !) :

• given the interval [0, b], get an approximation of the numberof enclosed eigenvalues.

• And find λk by dichotomy on b.

Last step : estimate λk

Goal : given L, estimate its k-th eigenvalue as fast as possible.

We use eigencount techniques (also based on polynomial filteringof random vectors !) :

• given the interval [0, b], get an approximation of the numberof enclosed eigenvalues.

• And find λk by dichotomy on b.

Accelerated spectral algorithm

1. Estimate λk , the k-th eigenvalue of L.

2. Generate η random graph signals in matrix R ∈ RN×η.

3. Filter them with Hλk and treat each node i as a point in Rη :

f >i = δ>i Hλk R.

Let’s work on the second bottleneck : avoid k-means in possiblyvery large dimension N (step 4).

f >i = δ>i Hλk R.

Fast spectral algorithm ?Given the N-node graph G of adjacency matrix W :

f >i = δ>i Hλk R.

4. Sample randomly ρ ' k log k << N nodes out of N :

f ri = Mfi = (MHλk R)>δri .

5. Run k-means in this reduced space with the Euclideandistance : Dr

ij = ||f ri − f r

j || and obtain k clusters.

6. Interpolate cluster indicator functions c rl on the whole graph :

cl = arg minx∈RN

||Mx − c rl ||2 + µxL>x .

Compressive spectral clustering : a summary

1. generate a feature vector for each node by filtering fewrandom gaussian random signal on G ;

2. subsample the set of nodes ;

3. cluster the reduced set of nodes ;

4. interpolate the cluster indicator vectors back to the completegraph.

This work was done in collaboration with :

• Gilles PUY and Remi GRIBONVAL from the PANAMAteam (INRIA).

• Pierre VANDERGHEYNST from EPFL.

Part of this work has been published (or submitted) :

• Circumventing the first bottleneck has been accepted toICASSP 2016

• Interpolation of k-bandlimited graph signals has beensubmitted to ACHA in November (an application of whichhelps us circumvent the second bottleneck).

This work was done in collaboration with :

• Gilles PUY and Remi GRIBONVAL from the PANAMAteam (INRIA).

• Pierre VANDERGHEYNST from EPFL.

Part of this work has been published (or submitted) :

• Circumventing the first bottleneck has been accepted toICASSP 2016

• Interpolation of k-bandlimited graph signals has beensubmitted to ACHA in November (an application of whichhelps us circumvent the second bottleneck).

Perspectives and difficult questions

Two difficult questions (among others) :

1. Given a semi-definite positive matrix, how to estimate as fastas possible its k-th eigenvalue, and only that one ?

2. How to subsample ρ nodes out of N while ensuring thatclustering them in k classes is the result one would haveobtained by clustering all N nodes ?

Perspectives

1. How about if nodes are added one by one ?

2. Rational filters instead of polynomial filters ?

3. Approximating other spectral clustering algorithms ?

Perspectives and difficult questions

Two difficult questions (among others) :

1. Given a semi-definite positive matrix, how to estimate as fastas possible its k-th eigenvalue, and only that one ?

2. How to subsample ρ nodes out of N while ensuring thatclustering them in k classes is the result one would haveobtained by clustering all N nodes ?

Perspectives

1. How about if nodes are added one by one ?

2. Rational filters instead of polynomial filters ?

3. Approximating other spectral clustering algorithms ?

graph signal processing for clustering · 2016. 1. 26. · graph signal processing for clustering...

Documents