fast nearest neighbor search on large time-evolving graphs...• anomaly detection ... personalized...

44
Fast Nearest Neighbor Search on Large Time-Evolving Graphs Leman Akoglu Srinivasan Parthasarathy Rohit Khandekar Vibhore Kumar Deepak Rajan Kun-Lung Wu

Upload: others

Post on 14-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Fast Nearest Neighbor Search on Large Time-Evolving Graphs

Leman Akoglu Srinivasan Parthasarathy Rohit Khandekar Vibhore Kumar Deepak Rajan Kun-Lung Wu

Page 2: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Graphs are everywhere…

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 3

Page 3: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

…and LARGE and TIME-evolving!

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 4

n  1.32 billion monthly active users June 30, 2014

Page 4: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Proximity problem on graphs

also: NN-search, similarity, closeness, relevance

Q:  Which  nodes  are  “close”  to  A?  

A BH1 1

D1 1

EF

G1 11

I J11 1

B

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 5

Page 5: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Application: Recommendations

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 6

Page 6: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

•  Finding communities (e.g. co-authorship networks such as DBLP)

•  Anomaly detection (e.g. infected hosts, potential suspects)

•  Link Prediction •  Keyword search •  Content-based Image Retrieval •  Fighting spam •  …

   

Other applications

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 7

Page 7: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Proximity measures for graphs n  Several metrics: shortest paths, commute time,

hitting time, SimRank, … n  Prevalent (robust) metric: Personalized PageRank

PPR captures: -  many, -  short, -  heavy-weighted paths

A BH1 1

D1 1

EF

G1 11

I J11 1

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 8

Page 8: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

PPR is based on RWR

1

4

3

2

5 6

7

9 10

8 11

12

1

4

3

2

5 6

7

9 10

8 11

12 0.13

0.10

0.13

0.13

0.05

0.05

0.08

0.04

0.02

0.04

0.03

Slides adapted from http://www.cs.cmu.edu/~htong/pub_new.htm

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 9

Page 9: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Problem Definition  Maintain  

–  A  LARGE,  time-­‐varying,  edge-­‐weighted  graph  G(t),  so  that  we  can  answer  the  following  query  efficiently:  

 Given  a  query  node  q  in  G(t)  at  Fme  t,          Find  verFces  in  G(t)  that  are  “close”  to  q  (w.r.t.    the  Personalized  PageRank  score)  

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 10

Page 10: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Road Map n  Motivation n  Problem Definition n  Previous work n  Our Approach

q  Graph clustering q  Intra-Cluster & Inter-Cluster Random Walks

(baby steps & BIG steps)

n  Time-Varying Graphs

n  Experiments n  Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 11

Page 11: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Previous Work on PPR n  D. Fogaras, B. Rcz, K. Csalogny, Tams Sarls. Towards scaling fully

personalized pagerank. In Internet Mathematics 2004. n  Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast Random Walk with

Restart and Its Applications. In ICDM 2006. n  Soumen Chakrabarti. Dynamic personalized pagerank in entity-relation

graphs. In WWW 2007. n  H. Tong, S. Papadimitriou, P. S. Yu and C. Faloutsos. Proximity Tracking on

Time-Evolving Bipartite Graphs. In SDM 2008. n  P. Sarkar, A. W. Moore. Fast nearest-neighbor search in disk-resident graphs.

In KDD 2010. n  Bahman Bahmani, Abdur Chowdhury, Ashish Goel: Fast Incremental and

Personalized PageRank. In PVLDB 2010. n  Bahman Bahmani, Kaushik Chakrabarti, Dong Xin: Fast personalized

PageRank on MapReduce. In SIGMOD 2011. n  P. A Lofgren, S. Banerjee, A. Goel, C. Seshadhri. FAST-PPR: Scaling

Personalized PageRank Estimation for Large Graphs. In KDD 2014.

We consider both large AND time-varying graphs!

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 12

Page 12: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Our Method – ClusterRank

1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster

2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute

an answer

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 13

Page 13: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Graph Clustering n  We work with large graphs (that do not fit in

main memory), thus cluster vertices such that each cluster is “small enough”.

n  Need “good” clusters—many intra-cluster edges, but few inter-cluster edges. q  Random walks more likely to stay within cluster q  Good cluster is already a good approximation of

“close” neighborhood of vertices in cluster

Note: For some cases, graph could be clustered naturally (e.g. Web graph across many servers)

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 14

Page 14: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Graph Clustering n  Many graph clustering algorithms, e.g. based on

community detection, spectral partitioning, etc. n  Reid Andersen, Fan Chung, and Kevin Lang (ACL).

Local Graph Partitioning using PageRank Vectors. FOCS, 2006.

n  Advantages: q  Local algorithm–complexity depends on output

cluster size q  Gives different size clusters which can be overlapping q  Can do clustering while graph is on disk

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 15

Page 15: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

What is “good” clustering?

ACL [FOCS06]’s measure is conductance:

Φ = 3 / (4+3+4+4+2) = 0.17

ϵ [0, 1]

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 16

Page 16: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Graph Clustering

G

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 17

Page 17: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Our Method – ClusterRank

1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster

2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute

an answer

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 18

Page 18: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

C(u,v) : The expected number of times (Count) a RW starting at node u in cluster S hits node v, before exiting S (can exit by walking to another cluster or by restarting to q).

E(u,v) : Expected probability that a RW starting at node u in cluster S Exits S to node v (out-bound node in B) (assuming query (restart) vertex q is outside S).

S

C matrix for S

is 5x5 (|S| x |S|) E matrix is 5x3 (|S| x 2|B|+|q|)

Compute meta-info for each cluster

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 19

Page 19: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

S1

q

Compute meta-info for each cluster Intra-cluster random-walks à baby steps

S2 S3

S4

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 20

Page 20: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Compute meta-info for each cluster

T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 − α) : restart probability S : set of nodes in given cluster

Recursive definition for C

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 21

Page 21: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Compute meta-info for each cluster

Similarly,

Closed-form formulae for C and E

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 22

: |S| x (|B|+1) matrix with exit prob.s to nodes in B U {q}

: |S| x |S| transition matrix

Page 22: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Our Method – ClusterRank

1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster

2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute

an answer

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 23

OFFLINE

ONLINE

Page 23: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Cq (C given q) :

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 24

Closed-form formulae for C and E

Eq (E given q) : Cq

Recall:

K : |S|x|S| 0s matrix with column q all 1s (rank 1!) à Fast Sherman-Morrison matrix inverse update

Query processing Update meta-info for q’s cluster

Page 24: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

S1

q

S2 S3

S4

Query processing Inter-cluster Graph M over relevant clusters

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 25

Page 25: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

q

Query processing Inter-cluster random-walks à BIG steps

M

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 26

Page 26: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Query processing Combine intra- and inter- cluster meta-info to compute final answer (“lift” C matrices)

S1

S2 S3

S4

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 27

Page 27: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Query processing Combine intra- and inter- cluster meta-info to compute final answer (“lift” C matrices)

S1

S2 S3

S4

Theorem: ClusterRank gives exact PPR scores.

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 28

Page 28: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Road Map n  Motivation n  Problem Definition n  Previous work n  Our Approach

q  Graph clustering q  Intra-Cluster & Inter-Cluster Random Walks

(baby steps & BIG steps)

n  Time-Varying Graphs n  Experiments n  Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 29

Page 29: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Time-varying ClusterRank n  WLOG: assume single edge (u,v) added n  Observation: changes in & low-rank

à  compute new C & E by SM formula n  4 cases studied in paper:

q  Both u and v new vertices q  Either u or v is a new vertex q  u and v in same cluster q  u and v in different clusters

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 30

Page 30: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Graph datasets

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 31

Dataset #edges #nodes #clusters description Synthetic 909K 300K 100 Planted partitions

Amazon 900K 262K 3739 Product co-purchase

Web 1.1M 325K 2793 http://nd.edu links

DBLP 1.1M 329K 4670 Co-authorships

LiveJournal 21.5M 2.7M 15252 Friendships

Dataset median Φ avg. Φ med. size avg. size Amazon 0.1385 0.1486 17 98.5

Web 0.0625 0.0871 31 129.4 DBLP 0.2117 0.2196 27 102.4 LiveJournal 0.5500 0.5319 43 237.3

Page 31: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Pre-computation

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 32

Pre-computation time depends on 1) graph size, 2) #clusters, 3) parallelization

Page 32: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Query Processing: set up n  Instead of all clusters, focus on a subset of

relevant clusters (small neighborhoods around query vertex) (1,2-hop away).

n  Allow for maximum of B boundary vertices n  Sparsify inter-cluster matrix: zero-out

entries close to zero

n  100 randomly chosen query vertices Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 33

Page 33: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Evaluation criterion

n  We report accuracy and running time for k nearest neighbor (kNN) queries.

n  Accuracy = Relative Average Goodness (RAG) score @k

Note: precision, i.e. “overlap with optimum”, is *not* a good measure (due to ties/near-ties).

total true score of output total true score of “optimum” RAG(@k) =

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 34

Page 34: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Synthetic graphs

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 35

SYNTHETIC 2-HOP 1-HOP

Average RAG (50) score (100 runs)

B = 5K 0.9986 0.9865 B = 1K 0.9892 0.9865 ClusterRank Average Response Time (sec.)

B = 5K 5.12 2.18 B = 1K 2.86 2.12

Brute-Force 5.16 sec.s

Page 35: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Real graphs

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 36

Page 36: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Dynamic updates

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 37

n  500K  edge  DBLP  graph  +  1K  new  edges  

Avg: 42.12 seconds

Avg: 2.78 clusters

Note: load/store time of C, E matrices included

Page 37: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Dynamic updates

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 38

DBLP 1959-2001

+1K edges in time

+500K edges in time

Page 38: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Summary n ClusterRank: k Nearest Neighbor queries

based on Personalized Pagerank scores q  Works with large and time-evolving graphs q  Fast query time: sub-linear computation on

pre-computed meta-info q  Efficient dynamic updates by low-rank matrices q  Disk-based: query processing and dynamic

updates only on relevant subset of clusters

n  Future directions q  Cluster tracking and localized re-clustering q  Extension to hitting / commute time

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 39

Page 39: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Thank You!

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 40

[email protected]

http://www.cs.stonybrook.edu/~leman

Page 40: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Back-up Slides

Page 41: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Recursive definition for E

T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 − α) : restart probability S : set of nodes in given cluster

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 42

Page 42: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

Closed formulations for C and E C1 is an identity matrix of |S|x|S|

Similary,

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 43

Page 43: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

What if s (query vertex) ϵ S ?

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 44

Page 44: Fast Nearest Neighbor Search on Large Time-Evolving Graphs...• Anomaly detection ... Personalized PageRank PPR captures: - many, - short, - heavy-weighted paths A 1 H 1 B D 1 1 E

At query time, given the query vertex s, those two matrices in which s resides in is updated only.

K is rank 1! Therefore, we will use the Sherman-Morrison Lemma to update C.

Complexity: Multiplication of |S|x1 and 1x|S| vectors Note that we do not need to run SVD as K is rank-1 only!

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 45