fast nearest neighbor search on large time-evolving graphs...• anomaly detection ... personalized...

Fast Nearest Neighbor Search on Large Time-Evolving Graphs

Leman Akoglu Srinivasan Parthasarathy Rohit Khandekar Vibhore Kumar Deepak Rajan Kun-Lung Wu

Graphs are everywhere…

Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 3

…and LARGE and TIME-evolving!


n  1.32 billion monthly active users June 30, 2014

Proximity problem on graphs

also: NN-search, similarity, closeness, relevance

Q: Which nodes are “close” to A?

A BH1 1

D1 1

EF

G1 11

I J11 1

B


Application: Recommendations


•  Finding communities (e.g. co-authorship networks such as DBLP)

•  Anomaly detection (e.g. infected hosts, potential suspects)

•  Link Prediction •  Keyword search •  Content-based Image Retrieval •  Fighting spam •  …

Other applications


Proximity measures for graphs n  Several metrics: shortest paths, commute time,

hitting time, SimRank, … n  Prevalent (robust) metric: Personalized PageRank

PPR captures: -  many, -  short, -  heavy-weighted paths

A BH1 1

D1 1

EF

G1 11

I J11 1


PPR is based on RWR

1

4

3

2

5 6

7

9 10

8 11

12

1

4

3

2

5 6

7

9 10

8 11

12 0.13

0.10

0.13

0.13

0.05

0.05

0.08

0.04

0.02

0.04

0.03

Slides adapted from http://www.cs.cmu.edu/~htong/pub_new.htm


Problem Definition Maintain

–  A LARGE, time-‐varying, edge-‐weighted graph G(t), so that we can answer the following query efficiently:

Given a query node q in G(t) at Fme t, Find verFces in G(t) that are “close” to q (w.r.t. the Personalized PageRank score)


Road Map n  Motivation n  Problem Definition n  Previous work n  Our Approach

q  Graph clustering q  Intra-Cluster & Inter-Cluster Random Walks

(baby steps & BIG steps)

n  Time-Varying Graphs

n  Experiments n  Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 11

Previous Work on PPR n  D. Fogaras, B. Rcz, K. Csalogny, Tams Sarls. Towards scaling fully

personalized pagerank. In Internet Mathematics 2004. n  Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast Random Walk with

Restart and Its Applications. In ICDM 2006. n  Soumen Chakrabarti. Dynamic personalized pagerank in entity-relation

graphs. In WWW 2007. n  H. Tong, S. Papadimitriou, P. S. Yu and C. Faloutsos. Proximity Tracking on

Time-Evolving Bipartite Graphs. In SDM 2008. n  P. Sarkar, A. W. Moore. Fast nearest-neighbor search in disk-resident graphs.

In KDD 2010. n  Bahman Bahmani, Abdur Chowdhury, Ashish Goel: Fast Incremental and

Personalized PageRank. In PVLDB 2010. n  Bahman Bahmani, Kaushik Chakrabarti, Dong Xin: Fast personalized

PageRank on MapReduce. In SIGMOD 2011. n  P. A Lofgren, S. Banerjee, A. Goel, C. Seshadhri. FAST-PPR: Scaling

Personalized PageRank Estimation for Large Graphs. In KDD 2014.

We consider both large AND time-varying graphs!


Our Method – ClusterRank

1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster

2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute

an answer


Graph Clustering n  We work with large graphs (that do not fit in

main memory), thus cluster vertices such that each cluster is “small enough”.

n  Need “good” clusters—many intra-cluster edges, but few inter-cluster edges. q  Random walks more likely to stay within cluster q  Good cluster is already a good approximation of

“close” neighborhood of vertices in cluster

Note: For some cases, graph could be clustered naturally (e.g. Web graph across many servers)


Graph Clustering n  Many graph clustering algorithms, e.g. based on

community detection, spectral partitioning, etc. n  Reid Andersen, Fan Chung, and Kevin Lang (ACL).

Local Graph Partitioning using PageRank Vectors. FOCS, 2006.

n  Advantages: q  Local algorithm–complexity depends on output

cluster size q  Gives different size clusters which can be overlapping q  Can do clustering while graph is on disk


What is “good” clustering?

ACL [FOCS06]’s measure is conductance:

Φ = 3 / (4+3+4+4+2) = 0.17

ϵ [0, 1]


Graph Clustering

G





an answer


C(u,v) : The expected number of times (Count) a RW starting at node u in cluster S hits node v, before exiting S (can exit by walking to another cluster or by restarting to q).

E(u,v) : Expected probability that a RW starting at node u in cluster S Exits S to node v (out-bound node in B) (assuming query (restart) vertex q is outside S).

S

C matrix for S

is 5x5 (|S| x |S|) E matrix is 5x3 (|S| x 2|B|+|q|)

Compute meta-info for each cluster


S1

q

Compute meta-info for each cluster Intra-cluster random-walks à baby steps

S2 S3

S4



T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 − α) : restart probability S : set of nodes in given cluster

Recursive definition for C



Similarly,

Closed-form formulae for C and E


: |S| x (|B|+1) matrix with exit prob.s to nodes in B U {q}

: |S| x |S| transition matrix




an answer


OFFLINE

ONLINE

Cq (C given q) :


Closed-form formulae for C and E

Eq (E given q) : Cq

Recall:

K : |S|x|S| 0s matrix with column q all 1s (rank 1!) à Fast Sherman-Morrison matrix inverse update

Query processing Update meta-info for q’s cluster

S1

q

S2 S3

S4

Query processing Inter-cluster Graph M over relevant clusters


q

Query processing Inter-cluster random-walks à BIG steps

M


Query processing Combine intra- and inter- cluster meta-info to compute final answer (“lift” C matrices)

S1

S2 S3

S4


Query processing Combine intra- and inter- cluster meta-info to compute final answer (“lift” C matrices)

S1

S2 S3

S4

Theorem: ClusterRank gives exact PPR scores.


Road Map n  Motivation n  Problem Definition n  Previous work n  Our Approach

q  Graph clustering q  Intra-Cluster & Inter-Cluster Random Walks

(baby steps & BIG steps)

n  Time-Varying Graphs n  Experiments n  Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 29

Time-varying ClusterRank n  WLOG: assume single edge (u,v) added n  Observation: changes in & low-rank

à  compute new C & E by SM formula n  4 cases studied in paper:

q  Both u and v new vertices q  Either u or v is a new vertex q  u and v in same cluster q  u and v in different clusters


Graph datasets


Dataset #edges #nodes #clusters description Synthetic 909K 300K 100 Planted partitions

Amazon 900K 262K 3739 Product co-purchase

Web 1.1M 325K 2793 http://nd.edu links

DBLP 1.1M 329K 4670 Co-authorships

LiveJournal 21.5M 2.7M 15252 Friendships

Dataset median Φ avg. Φ med. size avg. size Amazon 0.1385 0.1486 17 98.5

Web 0.0625 0.0871 31 129.4 DBLP 0.2117 0.2196 27 102.4 LiveJournal 0.5500 0.5319 43 237.3

Pre-computation


Pre-computation time depends on 1) graph size, 2) #clusters, 3) parallelization

Query Processing: set up n  Instead of all clusters, focus on a subset of

relevant clusters (small neighborhoods around query vertex) (1,2-hop away).

n  Allow for maximum of B boundary vertices n  Sparsify inter-cluster matrix: zero-out

entries close to zero

n  100 randomly chosen query vertices Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 33

Evaluation criterion

n  We report accuracy and running time for k nearest neighbor (kNN) queries.

n  Accuracy = Relative Average Goodness (RAG) score @k

Note: precision, i.e. “overlap with optimum”, is *not* a good measure (due to ties/near-ties).

total true score of output total true score of “optimum” RAG(@k) =


Synthetic graphs


SYNTHETIC 2-HOP 1-HOP

Average RAG (50) score (100 runs)

B = 5K 0.9986 0.9865 B = 1K 0.9892 0.9865 ClusterRank Average Response Time (sec.)

B = 5K 5.12 2.18 B = 1K 2.86 2.12

Brute-Force 5.16 sec.s

Real graphs


Dynamic updates


n  500K edge DBLP graph + 1K new edges

Avg: 42.12 seconds

Avg: 2.78 clusters

Note: load/store time of C, E matrices included

Dynamic updates


DBLP 1959-2001

+1K edges in time

+500K edges in time

Summary n ClusterRank: k Nearest Neighbor queries

based on Personalized Pagerank scores q  Works with large and time-evolving graphs q  Fast query time: sub-linear computation on

pre-computed meta-info q  Efficient dynamic updates by low-rank matrices q  Disk-based: query processing and dynamic

updates only on relevant subset of clusters

n  Future directions q  Cluster tracking and localized re-clustering q  Extension to hitting / commute time


Thank You!


[email protected]

http://www.cs.stonybrook.edu/~leman

Back-up Slides

Recursive definition for E

T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 − α) : restart probability S : set of nodes in given cluster


Closed formulations for C and E C1 is an identity matrix of |S|x|S|

Similary,


What if s (query vertex) ϵ S ?


At query time, given the query vertex s, those two matrices in which s resides in is updated only.

K is rank 1! Therefore, we will use the Sherman-Morrison Lemma to update C.

Complexity: Multiplication of |S|x1 and 1x|S| vectors Note that we do not need to run SVD as K is rank-1 only!


fast nearest neighbor search on large time-evolving graphs...• anomaly detection ... personalized...

Documents