tools for large graph miningjure/talks/...cmu scs tools for large graph mining www 2008 tutorial...

77
CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine Learning Department Joint work with: Deepay Chakrabarti, Tamara Kolda and Jimeng Sun.

Upload: others

Post on 18-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Tools for large graph miningWWW 2008 tutorial

Part 3: Matrix tools for graph mining

Jure Leskovec and Christos Faloutsos

Machine Learning Department

Joint work with: Deepay Chakrabarti, Tamara Kolda and Jimeng Sun.

Page 2: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Tutorial outline

Part 1: Structure and models for networksWhat are properties of large graphs?How do we model them?

Part 2: Dynamics of networksDiffusion and cascading behaviorHow do viruses and information propagate?

Part 3: Matrix tools for mining graphsSingular value decomposition (SVD)Random walks

Part 4: Case studies240 million MSN instant messenger networkGraph projections: how does the web look like

Part 3‐2Leskovec&Faloutsos, WWW 2008

Page 3: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

About part 3

Introduce matrix and tensor tools through real mining applications

Goal: find patterns, rules, clusters, outliers, …in matrices and

in tensors

Part 3‐3Leskovec&Faloutsos, WWW 2008

Page 4: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

What is this part about?

Connection of matrix tools and networksMatrix tools

Singular Value Decomposition (SVD)Principal Component Analysis (PCA)Webpage ranking algorithms: HITS, PageRankCUR decompositionCo‐clustering (in part 4 of the tutorial)

Tensor toolsTucker decomposition

Applications

Part 3‐4Leskovec&Faloutsos, WWW 2008

Page 5: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Why matrices? Examples

Social networks

Documents and terms

Authors and terms

John Peter Mary Nick ...JohnPeterMaryNick

...

0 11 22 55 ...5 0 6 7 ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

Part 3‐5Leskovec&Faloutsos, WWW 2008

Page 6: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SIGMOD’07

Why tensors? ExampleTensor:

n‐dimensional generalization of matrix

13 11 22 55 ...5 4 6 7 ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

data mining classif. tree ...JohnPeterMaryNick

...

Part 3‐6Leskovec&Faloutsos, WWW 2008

Page 7: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SIGMOD’06

SIGMOD’05

SIGMOD’07

Why tensors? ExampleTensor:

n‐dimensional generalization of matrix

13 11 22 55 ...5 4 6 7 ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

data mining classif. tree ...JohnPeterMaryNick

...

Part 3‐7Leskovec&Faloutsos, WWW 2008

Page 8: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Tensors are useful for 3 or more modes 

Terminology: ‘mode’ (or ‘aspect’):

13 11 22 55 ...5 4 6 7 ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

data mining classif. tree ...

Mode (== aspect) #1

Mode#2

Mode#3

Part 3‐8Leskovec&Faloutsos, WWW 2008

Page 9: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Motivating applications Why matrices are important?Why tensors are useful? 

P1: social networksP2: web & text miningP3: network forensicsP4: sensor networks

100 200 300 400 500

50

100

150

200

250

300

350

400

450

500

source

dest

inat

ion

normal trafficabnormal traffic

dest

inat

ion

100 200 300 400 500

50

100

150

200

250

300

350

400

450

500

source

dest

inat

ion

source

dest

inat

ion

source

0 2000 4000 6000 8000 100000

5

10

15

20

25

30

time (min)

valu

e

Temperature

Social networks

Sensor networksNetwork forensics

Part 3‐9Leskovec&Faloutsos, WWW 2008

Page 10: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Static Data model Tensor

Formally, 

Generalization of matrices

Represented as multi‐array, (~ data cube).

Order 1st 2nd 3rd

Correspondence Vector Matrix 3D array

Example

Part 1‐10Leskovec&Faloutsos, WWW 2008

Page 11: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Dynamic Data modelTensor Streams

A sequence of Mth order tensors

where

t is increasing over timeOrder 1st 2nd 3rd

Correspondence Multiple streams Time evolving graphs 3D arrays

Exampletim

e

… auth

orkeyword

Part 1‐11Leskovec&Faloutsos, WWW 2008

Page 12: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SVD: Examples of Matrices

Example/Intuition: Documents and terms

Find patterns, groups, concepts

13 11 22 55 ...5 4 6 7 ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

Paper#1Paper#2Paper#3Paper#4

data mining classif. tree ...

...

Part 3‐12Leskovec&Faloutsos, WWW 2008

Page 13: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Singular Value Decomposition (SVD)X = UΣVT

u1 u2 ukx(1) x(2) x(M) = .

v1

v2

vk

.

σ1

σ2

σk

X UΣ VT

right singular vectors

input data left singular vectors

singular values

Part 3‐13Leskovec&Faloutsos, WWW 2008

Page 14: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SVD as spectral decomposition

Best rank‐k approximation in L2 and Frobenius SVD only works for static matrices (a single 2ndorder tensor)

Am

n

Σm

n

U

VT

≈ +

σ1u1°v1 σ2u2°v2

Part 3‐14Leskovec&Faloutsos, WWW 2008

Page 15: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Vector outer product – intuition:

A

2-d histogram

car type

ownerage

1-d histograms + independence assumption

VWVolvoBMW

20; 30; 40

VWVolvoBMW

20; 30; 40

Part 3‐15Leskovec&Faloutsos, WWW 2008

Page 16: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SVD ‐ Example

A = U Σ VT ‐ example:

1 1 1 0 02 2 2 0 01 1 1 0 05 5 5 0 00 0 0 2 20 0 0 3 30 0 0 1 1

datainf.

retrievalbrain lung

0.18 00.36 00.18 00.90 00 0.530 0.800 0.27

=CS

MD

9.64 00 5.29x

0.58 0.58 0.58 0 00 0 0 0.71 0.71

x

Part 3‐16Leskovec&Faloutsos, WWW 2008

Page 17: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SVD ‐ Example

A = U Σ VT ‐ example:

1 1 1 0 02 2 2 0 01 1 1 0 05 5 5 0 00 0 0 2 20 0 0 3 30 0 0 1 1

datainf.

retrievalbrain lung

0.18 00.36 00.18 00.90 00 0.530 0.800 0.27

=CS

MD

9.64 00 5.29x

0.58 0.58 0.58 0 00 0 0 0.71 0.71

x

CS-conceptMD-concept

Part 3‐17Leskovec&Faloutsos, WWW 2008

Page 18: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SVD ‐ Example

A = U Σ VT ‐ example:

1 1 1 0 02 2 2 0 01 1 1 0 05 5 5 0 00 0 0 2 20 0 0 3 30 0 0 1 1

datainf.

retrievalbrain lung

0.18 00.36 00.18 00.90 00 0.530 0.800 0.27

=CS

MD

9.64 00 5.29x

0.58 0.58 0.58 0 00 0 0 0.71 0.71

x

CS-conceptMD-concept

doc-to-concept similarity matrix

Part 3‐18Leskovec&Faloutsos, WWW 2008

Page 19: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SVD ‐ Example

A = U Σ VT ‐ example:

1 1 1 0 02 2 2 0 01 1 1 0 05 5 5 0 00 0 0 2 20 0 0 3 30 0 0 1 1

datainf.

retrievalbrain lung

0.18 00.36 00.18 00.90 00 0.530 0.800 0.27

=CS

MD

9.64 00 5.29x

0.58 0.58 0.58 0 00 0 0 0.71 0.71

x

‘strength’ of CS-concept

Part 3‐19Leskovec&Faloutsos, WWW 2008

Page 20: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SVD ‐ Example

A = U Σ VT ‐ example:

1 1 1 0 02 2 2 0 01 1 1 0 05 5 5 0 00 0 0 2 20 0 0 3 30 0 0 1 1

datainf.

retrievalbrain lung

0.18 00.36 00.18 00.90 00 0.530 0.800 0.27

=CS

MD

9.64 00 5.29x

0.58 0.58 0.58 0 00 0 0 0.71 0.71

x

term-to-conceptsimilarity matrix

CS-concept

Part 3‐20Leskovec&Faloutsos, WWW 2008

Page 21: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SVD ‐ Example

A = U Σ VT ‐ example:

1 1 1 0 02 2 2 0 01 1 1 0 05 5 5 0 00 0 0 2 20 0 0 3 30 0 0 1 1

datainf.

retrievalbrain lung

0.18 00.36 00.18 00.90 00 0.530 0.800 0.27

=CS

MD

9.64 00 5.29x

0.58 0.58 0.58 0 00 0 0 0.71 0.71

x

term-to-conceptsimilarity matrix

CS-concept

Part 3‐21Leskovec&Faloutsos, WWW 2008

Page 22: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SVD ‐ Interpretation

‘documents’, ‘terms’ and ‘concepts’:Q: if A is the document‐to‐term matrix, what is AT A?

A: term‐to‐term ([m x m]) similarity matrixQ: A AT ?A: document‐to‐document ([n x n]) similarity matrix

Part 3‐22Leskovec&Faloutsos, WWW 2008

Page 23: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

SVD properties

V are the eigenvectors of the covariance matrix ATA

U are the eigenvectors of the Gram (inner‐product) matrix AAT

Further reading:1. Ian T. Jolliffe, Principal Component Analysis (2nd ed), Springer, 2002.2. Gilbert Strang, Linear Algebra and Its Applications (4th ed), Brooks Cole, 2005.Part 3‐23Leskovec&Faloutsos, WWW 2008

Page 24: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

PCs

Principal Component Analysis (PCA)

SVD

PCA is an important application of SVD

Note that U and V are dense and may have negative entries

Am

n

Σm

nRR

R

UVT k

k k

Loading

Part 3‐24Leskovec&Faloutsos, WWW 2008

Page 25: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

PCA interpretationbest axis to project on: (‘best’ = min sum of squares of projection errors)

Term1 (‘data’)

Term2 (‘lung’)

Part 3‐25Leskovec&Faloutsos, WWW 2008

Page 26: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

PCA ‐ interpretation 

minimum RMS error

PCA projects pointsOnto the “best” axis

v1

first singular vector

Term1 (‘data’)

Term2 (‘retrieval’)

ΣUVT

Part 1‐26Leskovec&Faloutsos, WWW 2008

Page 27: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐27

Kleinberg’s algorithm HITS

Problem definition: given the web and a query

find the most ‘authoritative’ web pages for this query

Step 0: find all pages containing the query terms

Step 1: expand by one move forward and backward

Further reading:1. J. Kleinberg. Authoritative sources in a hyperlinked environment. SODA 1998

Page 28: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐28

Kleinberg’s algorithm HITS

Step 1: expand by one move forward and backward

Page 29: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐29

Kleinberg’s algorithm HITS

on the resulting graph, give high score (= ‘authorities’) to nodes that many important nodes point to

give high importance score (‘hubs’) to nodes that point to good ‘authorities’

hubs authorities

Page 30: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐30

Kleinberg’s algorithm HITS

observations

recursive definition!

each node (say, ‘i’‐th node) has both an authoritativeness score ai and a hubness score hi

Page 31: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐31

Kleinberg’s algorithm: HITS

Let A be the adjacency matrix: the (i,j) entry is 1 if the edge from i to j exists

Let h and a be  [n x 1] vectors with the ‘hubness’ and ‘authoritativiness’ scores.

Then:

Page 32: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐32

Kleinberg’s algorithm: HITS

Then:

ai = hk + hl + hmthat is

ai = Sum (hj)     over all j that (j,i) edge exists

or

a = AT h

kl

m

i

Page 33: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐33

Kleinberg’s algorithm: HITS

symmetrically, for the ‘hubness’:

hi = an + ap + aqthat is

hi = Sum (qj)     over all j that (i,j) edge exists

or

h = A a

p

n

q

i

Page 34: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐34

Kleinberg’s algorithm: HITS

In conclusion, we want vectors h and a such that:

h = A a

a = AT h

That is:

a = ATA a

Page 35: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐35

Kleinberg’s algorithm: HITSa is a right singular vector of the adjacency matrix A (by dfn!), a.k.a the eigenvector of ATA

Starting from random a’ and iterating, we’ll eventually converge

Q: to which of all the eigenvectors? why?A: to the one of the strongest eigenvalue, 

(ATA ) k  a = λ1ka

Page 36: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐36

Kleinberg’s algorithm ‐ discussion

‘authority’ score can be used to find ‘similar pages’ (how?)

closely related to ‘citation analysis’, social networks / ‘small world’ phenomena

Page 37: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐37

Motivating problem: PageRank

Given a directed graph, find its most interesting/central node

A node is important,if it is connected with important nodes(recursive, but OK!)

Page 38: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐38

Motivating problem – PageRank solution

Given a directed graph, find its most interesting/central node

Proposed solution: Random walk; spot most ‘popular’ node (‐> steady state prob. (ssp))

A node has high ssp,if it is connected with high ssp nodes(recursive, but OK!)

Page 39: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐39

(Simplified) PageRank algorithm

Let A be the transition matrix (= adjacency matrix); let B be the transpose, column‐normalized ‐ then

1 2 3

45

p1

p2

p3

p4

p5

p1

p2

p3

p4

p5

=

ToFrom B

1

1 1

1/2 1/2

1/2

1/2

Page 40: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐40

(Simplified) PageRank algorithm

B p = p

1 2 3

45

p1

p2

p3

p4

p5

p1

p2

p3

p4

p5

=

B p = p

1

1 1

1/2 1/2

1/2

1/2

Page 41: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐41

(Simplified) PageRank algorithm

B p = 1 * p

thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is column‐

normalized)

Why does such a p exist? p exists if B is nxn, nonnegative, irreducible [Perron–Frobenius theorem]

Page 42: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐42

(Simplified) PageRank algorithm

In short: imagine a particle randomly moving along the edges

compute its steady‐state probabilities (ssp)

Full version of algo:  with occasional random jumps

Why? To make the matrix irreducible

Page 43: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐43

Full Algorithm

With probability 1‐c, fly‐out to a random node

Then, we havep = c B p + (1‐c)/n 1 =>

p = (1‐c)/n  [I ‐ c B] ‐1 1

Page 44: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 Part 3‐44

Page 45: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐45

Motivation of CUR or CMD

SVD, PCA all transform data into some abstract space (specified by a set basis)

Interpretability problem

Loss of sparsity

Page 46: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐46

PCA ‐ interpretation 

minimum RMS error

PCA projects pointsOnto the “best” axis

v1

first singular vector

Term1 (‘data’)

Term2 (‘retrieval’)

Page 47: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐47

CUR

Example‐based projection: use actual rows and columns to specify the subspaceGiven a matrix A∈Rm×n, find three matrices C∈ Rm×c, U∈Rc×r, R∈ Rr× n , such that ||A‐CUR|| is small

U is the pseudo-inverse of XOrthogonal projection

Page 48: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐48

CUR

Example‐based projection: use actual rows and columns to specify the subspaceGiven a matrix A∈Rm×n, find three matrices C∈ Rm×c, U∈Rc×r, R∈ Rr× n , such that ||A‐CUR|| is small

U is the pseudo-inverse of X:U = X† = (UT U )-1 UT

Example-based

Page 49: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐49

CUR (cont.)

Key question:How to select/sample the columns and rows?

Uniform sampling

Biased samplingCUR w/ absolute error bound

CUR w/ relative error bound

Reference:1. Tutorial: Randomized Algorithms for Matrices and Massive Datasets, SDM’062. Drineas et al. Subspace Sampling and Relative-error Matrix Approximation: Column-

Row-Based Methods, ESA20063. Drineas et al., Fast Monte Carlo Algorithms for Matrices III: Computing a

Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 2006.

Page 50: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐50

The sparsity property – pictorially:

=

SVD/PCA:Destroys sparsity

U Σ VT

=

C U R

CUR: maintains sparsity

Page 51: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐51

The sparsity property

SVD: A = U Σ VT

Big but sparse Big and dense

CUR: A = C U RBig but sparse Big but sparse

dense but small

sparse and small

Page 52: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 2‐52

Matrix tools ‐ summary

SVD: optimal for L2 – VERY popular (HITS, PageRank, Karhunen‐Loeve, Latent Semantic Indexing, PCA,  etc etc)

C‐U‐R (CMD etc)near‐optimal; sparsity; interpretability

Page 53: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

TENSORS

Leskovec&Faloutsos, WWW 2008 Part 3‐53

Page 54: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

3‐54

Reminder: SVD

Best rank‐k approximation in L2

Am

n

Σm

n

U

VT

Leskovec&Faloutsos, WWW 2008

Page 55: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

3‐55

Reminder: SVD

Best rank‐k approximation in L2

Am

n

≈ +

σ1u1°v1 σ2u2°v2

Leskovec&Faloutsos, WWW 2008

Page 56: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

3‐56

Goal: extension to >=3 modes

¼

I x R

ABJ x R

R x R x R

I x J x K

+…+=

Leskovec&Faloutsos, WWW 2008

Page 57: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

3‐57

Tensors: Main points

2 major types of tensor decompositions: Kruskal and Tucker

both can be solved with ``alternating least squares’’ (ALS)

Details follow – we start with terminology:

Leskovec&Faloutsos, WWW 2008

Page 58: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

3‐58

Kruskal’s Decomposion ‐ intuition

¼

I x R

ABJ x R

R x R x R

I x J x K

+…+=

Leskovec&Faloutsos, WWW 2008

Page 59: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

3‐59

Tucker Decomposition ‐ intuition

I x J x K

¼A

I x R

BJ x S

R x S x T

author x keyword x conference

A: author x author‐group

B: keyword x keyword‐group

C: conf. x conf‐groupG: how groups relate to each other

Leskovec&Faloutsos, WWW 2008

Page 60: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008

⎥⎥⎥⎥

⎢⎢⎢⎢

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

⎥⎥⎥⎥

⎢⎢⎢⎢

036.036.028.028.036.036.036.036.028.028036.036.054.054.042.000054.054.042.000000042.054.054.000042.054.054.

⎥⎥⎥⎥

⎢⎢⎢⎢

5.005.0005.005.0005.005.

⎥⎦⎤

⎢⎣⎡

2.2.3.003. [ ]

36.36.28.00000028.36.36. =

m

m

n

nl

k

k

l

e.g., terms x documents

Part 4‐60

2‐d analog of Tucker decomposition

Page 61: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008

⎥⎥⎥⎥

⎢⎢⎢⎢

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

⎥⎥⎥⎥

⎢⎢⎢⎢

036.036.028.028.036.036.036.036.028.028036.036.054.054.042.000054.054.042.000000042.054.054.000042.054.054.

⎥⎥⎥⎥

⎢⎢⎢⎢

5.005.0005.005.0005.005.

⎥⎦⎤

⎢⎣⎡

2.2.3.003. [ ]

36.36.28.00000028.36.36. =

term xterm-group

doc xdoc group

term group xdoc. group

med. terms

cs terms

common terms

med. doccs doc

Part 4‐61

Page 62: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

3‐62

Tensor tools ‐ summary

Two main toolsPARAFACTucker

Both find row‐, column‐, tube‐groupsbut in PARAFAC the three groups are identical

To solve: Alternating Least Squares

Toolbox: from Tamara Kolda:http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/

Leskovec&Faloutsos, WWW 2008

Page 63: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐63

P1: Environmental sensor monitoring 

0 2000 4000 6000 8000 100000

5

10

15

20

25

30

time (min)

valu

e

Temperature

0 2000 4000 6000 8000 100000

100

200

300

400

500

600

time (min)

valu

e

Light

0 2000 4000 6000 8000 100000

0.5

1

1.5

2

2.5

time (min)

valu

e

Voltage0 2000 4000 6000 8000 10000

0

10

20

30

40

time (min)

valu

e

Humidity

Page 64: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐64

1st factor Scaling factor 250

Volt Humid Temp Light−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

type

valu

e

type

valu

e

0 20 40 60

0

0.05

0.1

0.15

0.2

0.25

0.3

location

valu

e

location

0 500 1000

−0.02

−0.01

0

0.01

0.02

0.03

0.04

time (min)

valu

e

timeP1: sensor monitoring 

1st factor consists of the main trends:Daily periodicity on timeUniform on all locationsTemp, Light and Volt are positively correlated while negatively correlated with Humid

Loca

tion

Time

voltage

hum.temp.

light

Page 65: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐65

P1: sensor monitoring 

2nd factor captures an atypical trend:Uniformly across all time

Concentrating on 3 locations

Mainly due to voltage

Interpretation: two sensors have low battery, and the other one has high battery. 

2nd factorScaling factor 154

Volt Humid Temp Light−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

type

valu

e

0 500 1000

−0.02

−0.01

0

0.01

0.02

0.03

0.04

time (min)

valu

e

typelocationtime

voltage

hum.temp.

light

Page 66: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐66

P3: Social network analysisMultiway latent semantic indexing (LSI)

Monitor the change of the community structure over time

Philip Yu

Michael Stonebreaker

‘Query’‘Pattern’

Page 67: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐67

P3: Social network analysis (cont.)Authors Keywords Yearmichael carey, michaelstonebreaker, h. jagadish,hector garcia-molina

queri,parallel,optimization,concurr,objectorient

1995

surajit chaudhuri,mitch cherniack,michaelstonebreaker,ugur etintemel

distribut,systems,view,storage,servic,process,cache

2004

jiawei han,jian pei,philip s. yu,jianyong wang,charu c. aggarwal

streams,pattern,support, cluster, index,gener,queri

2004

• Two groups are correctly identified: Databases and Data mining

• People and concepts are drifting over time

DM

DB

Page 68: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐68

P4: Network anomaly detection

Reconstruction error gives indication of anomalies.Prominent difference between normal and abnormal ones is mainly due to the unusual scanning activity (confirmed by the campus admin).

200 400 600 800 1000 12000

10

20

30

40

50

hours

erro

r

Reconstruction error over time

Normal traffic

100 200 300 400 500

50

100

150

200

250

300

350

400

450

500

source

dest

inat

ion

Abnormal traffic

Page 69: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐69

P5: Web graph mining

How to order the importance of web pages?Kleinberg’s algorithm HITS

PageRank

Tensor extension on HITS (TOPHITS)

Page 70: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐70

Kleinberg’s Hubs and Authorities(the HITS method)

Sparse adjacency matrix and its SVD:

authority scoresfor 1st topic

hub scores for 1st topic

hub scores for 2nd topic

authority scoresfor 2nd topic

from

to

Kleinberg, JACM, 1999

Page 71: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐71

authority scoresfor 1st topic

hub scores for 1st topic

hub scores for 2nd topic

authority scoresfor 2nd topic

from

to

HITS Authorities on Sample Data.97 www.ibm.com.24 www.alphaworks.ibm.com.08 www-128.ibm.com.05 www.developer.ibm.com.02 www.research.ibm.com.01 www.redbooks.ibm.com.01 news.com.com

1st Principal Factor

.99 www.lehigh.edu

.11 www2.lehigh.edu

.06 www.lehighalumni.com

.06 www.lehighsports.com

.02 www.bethlehem-pa.gov

.02 www.adobe.com

.02 lewisweb.cc.lehigh.edu

.02 www.leo.lehigh.edu

.02 www.distance.lehigh.edu

.02 fp1.cc.lehigh.edu

2nd Principal FactorWe started our crawl from

http://www-neos.mcs.anl.gov/neos, and crawled 4700 pages,

resulting in 560 cross-linked hosts.

.75 java.sun.com

.38 www.sun.com

.36 developers.sun.com

.24 see.sun.com

.16 www.samag.com

.13 docs.sun.com

.12 blogs.sun.com

.08 sunsolve.sun.com

.08 www.sun-catalogue.com

.08 news.com.com

3rd Principal Factor

.60 www.pueblo.gsa.gov

.45 www.whitehouse.gov

.35 www.irs.gov

.31 travel.state.gov

.22 www.gsa.gov

.20 www.ssa.gov

.16 www.census.gov

.14 www.govbenefits.gov

.13 www.kids.gov

.13 www.usdoj.gov

4th Principal Factor

.97 mathpost.asu.edu

.18 math.la.asu.edu

.17 www.asu.edu

.04 www.act.org

.03 www.eas.asu.edu

.02 archives.math.utk.edu

.02 www.geom.uiuc.edu

.02 www.fulton.asu.edu

.02 www.amstat.org

.02 www.maa.org

6th Principal Factor

Page 72: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐72

Three‐Dimensional View of the Web

Observe that this tensor is very sparse!

Kolda, Bader, Kenny, ICDM05

Page 73: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐73

Topical HITS (TOPHITS)Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information.

authority scoresfor 1st topic

hub scores for 1st topic

hub scores for 2nd topic

authority scoresfor 2nd topic

from

to

term scores for 1st topic

term scores for 2nd topic

Page 74: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐74

Topical HITS (TOPHITS)Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information.

authority scoresfor 1st topic

hub scores for 1st topic

hub scores for 2nd topic

authority scoresfor 2nd topic

from

to

term scores for 1st topic

term scores for 2nd topic

Page 75: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐75

TOPHITS Terms & Authorities on Sample Data

.23 JAVA .86 java.sun.com

.18 SUN .38 developers.sun.com

.17 PLATFORM .16 docs.sun.com

.16 SOLARIS .14 see.sun.com

.16 DEVELOPER .14 www.sun.com

.15 EDITION .09 www.samag.com

.15 DOWNLOAD .07 developer.sun.com

.14 INFO .06 sunsolve.sun.com

.12 SOFTWARE .05 access1.sun.com

.12 NO-READABLE-TEXT .05 iforce.sun.com

1st Principal Factor

.20 NO-READABLE-TEXT .99 www.lehigh.edu

.16 FACULTY .06 www2.lehigh.edu

.16 SEARCH .03 www.lehighalumni.com

.16 NEWS

.16 LIBRARIES

.16 COMPUTING

.12 LEHIGH

2nd Principal Factor

.15 NO-READABLE-TEXT .97 www.ibm.com

.15 IBM .18 www.alphaworks.ibm.com

.12 SERVICES .07 www-128.ibm.com

.12 WEBSPHERE .05 www.developer.ibm.com

.12 WEB .02 www.redbooks.ibm.com

.11 DEVELOPERWORKS .01 www.research.ibm.com

.11 LINUX

.11 RESOURCES

.11 TECHNOLOGIES

.10 DOWNLOADS

3rd Principal Factor

.26 INFORMATION .87 www.pueblo.gsa.gov

.24 FEDERAL .24 www.irs.gov

.23 CITIZEN .23 www.whitehouse.gov

.22 OTHER .19 travel.state.gov

.19 CENTER .18 www.gsa.gov

.19 LANGUAGES .09 www.consumer.gov

.15 U.S .09 www.kids.gov

.15 PUBLICATIONS .07 www.ssa.gov

.14 CONSUMER .05 www.forms.gov

.13 FREE .04 www.govbenefits.gov

4th Principal Factor

.26 PRESIDENT .87 www.whitehouse.gov

.25 NO-READABLE-TEXT .18 www.irs.gov

.25 BUSH .16 travel.state.gov

.25 WELCOME .10 www.gsa.gov

.17 WHITE .08 www.ssa.gov

.16 U.S .05 www.govbenefits.gov

.15 HOUSE .04 www.census.gov

.13 BUDGET .04 www.usdoj.gov

.13 PRESIDENTS .04 www.kids.gov

.11 OFFICE .02 www.forms.gov

6th Principal Factor

.75 OPTIMIZATION .35 www.palisade.com

.58 SOFTWARE .35 www.solver.com

.08 DECISION .33 plato.la.asu.edu

.07 NEOS .29 www.mat.univie.ac.at

.06 TREE .28 www.ilog.com

.05 GUIDE .26 www.dashoptimization.com

.05 SEARCH .26 www.grabitech.com

.05 ENGINE .25 www-fp.mcs.anl.gov

.05 CONTROL .22 www.spyderopts.com

.05 ILOG .17 www.mosek.com

12th Principal Factor

.46 ADOBE .99 www.adobe.com

.45 READER

.45 ACROBAT

.30 FREE

.30 NO-READABLE-TEXT

.29 HERE

.29 COPY

.05 DOWNLOAD

13th Principal Factor

.50 WEATHER .81 www.weather.gov

.24 OFFICE .41 www.spc.noaa.gov

.23 CENTER .30 lwf.ncdc.noaa.gov

.19 NO-READABLE-TEXT .15 www.cpc.ncep.noaa.gov

.17 ORGANIZATION .14 www.nhc.noaa.gov

.15 NWS .09 www.prh.noaa.gov

.15 SEVERE .07 aviationweather.gov

.15 FIRE .06 www.nohrsc.nws.gov

.15 POLICY .06 www.srh.noaa.gov

.14 CLIMATE

16th Principal Factor

.22 TAX .73 www.irs.gov

.17 TAXES .43 travel.state.gov

.15 CHILD .22 www.ssa.gov

.15 RETIREMENT .08 www.govbenefits.gov

.14 BENEFITS .06 www.usdoj.gov

.14 STATE .03 www.census.gov

.14 INCOME .03 www.usmint.gov

.13 SERVICE .02 www.nws.noaa.gov

.13 REVENUE .02 www.gsa.gov

.12 CREDIT .01 www.annualcreditreport.com

19th Principal Factor

TOPHITS uses 3D analysis to find the dominant groupings of web pages and terms.

authority scoresfor 1st topic

hub scores for 1st topic

hub scores for 2nd topic

authority scoresfor 2nd topicfro

m

to

term scores for 1st topic

term scores for 2nd topic

Tensor

wk = # unique links using term k

Page 76: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

Leskovec&Faloutsos, WWW 2008 4‐76

Conclusions

Real data are often in high dimensions with multiple aspects (modes)

Matrices and tensors provide elegant theory and algorithms 

Several research problems are still openskewed distribution, anomaly detection, streaming algorithms, distributed/parallel algorithms, efficient out‐of‐core processing

Page 77: Tools for large graph miningjure/talks/...CMU SCS Tools for large graph mining WWW 2008 tutorial Part 3: Matrix tools for graph mining Jure Leskovec and Christos Faloutsos Machine

CMU SCS

References

Slides borrowed from SIGMOD ‘07 tutorial by Falutsos, Kolda and Sun.

Leskovec&Faloutsos, WWW 2008 Part 3‐77