fast nearest neighbor search on large time-evolving graphs...• anomaly detection ... personalized...
TRANSCRIPT
Fast Nearest Neighbor Search on Large Time-Evolving Graphs
Leman Akoglu Srinivasan Parthasarathy Rohit Khandekar Vibhore Kumar Deepak Rajan Kun-Lung Wu
Graphs are everywhere…
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 3
…and LARGE and TIME-evolving!
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 4
n 1.32 billion monthly active users June 30, 2014
Proximity problem on graphs
also: NN-search, similarity, closeness, relevance
Q: Which nodes are “close” to A?
A BH1 1
D1 1
EF
G1 11
I J11 1
B
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 5
Application: Recommendations
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 6
• Finding communities (e.g. co-authorship networks such as DBLP)
• Anomaly detection (e.g. infected hosts, potential suspects)
• Link Prediction • Keyword search • Content-based Image Retrieval • Fighting spam • …
Other applications
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 7
Proximity measures for graphs n Several metrics: shortest paths, commute time,
hitting time, SimRank, … n Prevalent (robust) metric: Personalized PageRank
PPR captures: - many, - short, - heavy-weighted paths
A BH1 1
D1 1
EF
G1 11
I J11 1
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 8
PPR is based on RWR
1
4
3
2
5 6
7
9 10
8 11
12
1
4
3
2
5 6
7
9 10
8 11
12 0.13
0.10
0.13
0.13
0.05
0.05
0.08
0.04
0.02
0.04
0.03
Slides adapted from http://www.cs.cmu.edu/~htong/pub_new.htm
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 9
Problem Definition Maintain
– A LARGE, time-‐varying, edge-‐weighted graph G(t), so that we can answer the following query efficiently:
Given a query node q in G(t) at Fme t, Find verFces in G(t) that are “close” to q (w.r.t. the Personalized PageRank score)
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 10
Road Map n Motivation n Problem Definition n Previous work n Our Approach
q Graph clustering q Intra-Cluster & Inter-Cluster Random Walks
(baby steps & BIG steps)
n Time-Varying Graphs
n Experiments n Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 11
Previous Work on PPR n D. Fogaras, B. Rcz, K. Csalogny, Tams Sarls. Towards scaling fully
personalized pagerank. In Internet Mathematics 2004. n Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast Random Walk with
Restart and Its Applications. In ICDM 2006. n Soumen Chakrabarti. Dynamic personalized pagerank in entity-relation
graphs. In WWW 2007. n H. Tong, S. Papadimitriou, P. S. Yu and C. Faloutsos. Proximity Tracking on
Time-Evolving Bipartite Graphs. In SDM 2008. n P. Sarkar, A. W. Moore. Fast nearest-neighbor search in disk-resident graphs.
In KDD 2010. n Bahman Bahmani, Abdur Chowdhury, Ashish Goel: Fast Incremental and
Personalized PageRank. In PVLDB 2010. n Bahman Bahmani, Kaushik Chakrabarti, Dong Xin: Fast personalized
PageRank on MapReduce. In SIGMOD 2011. n P. A Lofgren, S. Banerjee, A. Goel, C. Seshadhri. FAST-PPR: Scaling
Personalized PageRank Estimation for Large Graphs. In KDD 2014.
We consider both large AND time-varying graphs!
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 12
Our Method – ClusterRank
1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster
2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute
an answer
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 13
Graph Clustering n We work with large graphs (that do not fit in
main memory), thus cluster vertices such that each cluster is “small enough”.
n Need “good” clusters—many intra-cluster edges, but few inter-cluster edges. q Random walks more likely to stay within cluster q Good cluster is already a good approximation of
“close” neighborhood of vertices in cluster
Note: For some cases, graph could be clustered naturally (e.g. Web graph across many servers)
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 14
Graph Clustering n Many graph clustering algorithms, e.g. based on
community detection, spectral partitioning, etc. n Reid Andersen, Fan Chung, and Kevin Lang (ACL).
Local Graph Partitioning using PageRank Vectors. FOCS, 2006.
n Advantages: q Local algorithm–complexity depends on output
cluster size q Gives different size clusters which can be overlapping q Can do clustering while graph is on disk
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 15
What is “good” clustering?
ACL [FOCS06]’s measure is conductance:
Φ = 3 / (4+3+4+4+2) = 0.17
ϵ [0, 1]
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 16
Graph Clustering
G
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 17
Our Method – ClusterRank
1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster
2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute
an answer
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 18
C(u,v) : The expected number of times (Count) a RW starting at node u in cluster S hits node v, before exiting S (can exit by walking to another cluster or by restarting to q).
E(u,v) : Expected probability that a RW starting at node u in cluster S Exits S to node v (out-bound node in B) (assuming query (restart) vertex q is outside S).
S
C matrix for S
is 5x5 (|S| x |S|) E matrix is 5x3 (|S| x 2|B|+|q|)
Compute meta-info for each cluster
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 19
S1
q
Compute meta-info for each cluster Intra-cluster random-walks à baby steps
S2 S3
S4
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 20
Compute meta-info for each cluster
T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 − α) : restart probability S : set of nodes in given cluster
Recursive definition for C
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 21
Compute meta-info for each cluster
Similarly,
Closed-form formulae for C and E
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 22
: |S| x (|B|+1) matrix with exit prob.s to nodes in B U {q}
: |S| x |S| transition matrix
Our Method – ClusterRank
1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster
2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute
an answer
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 23
OFFLINE
ONLINE
Cq (C given q) :
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 24
Closed-form formulae for C and E
Eq (E given q) : Cq
Recall:
K : |S|x|S| 0s matrix with column q all 1s (rank 1!) à Fast Sherman-Morrison matrix inverse update
Query processing Update meta-info for q’s cluster
S1
q
S2 S3
S4
Query processing Inter-cluster Graph M over relevant clusters
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 25
q
Query processing Inter-cluster random-walks à BIG steps
M
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 26
Query processing Combine intra- and inter- cluster meta-info to compute final answer (“lift” C matrices)
S1
S2 S3
S4
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 27
Query processing Combine intra- and inter- cluster meta-info to compute final answer (“lift” C matrices)
S1
S2 S3
S4
Theorem: ClusterRank gives exact PPR scores.
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 28
Road Map n Motivation n Problem Definition n Previous work n Our Approach
q Graph clustering q Intra-Cluster & Inter-Cluster Random Walks
(baby steps & BIG steps)
n Time-Varying Graphs n Experiments n Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 29
Time-varying ClusterRank n WLOG: assume single edge (u,v) added n Observation: changes in & low-rank
à compute new C & E by SM formula n 4 cases studied in paper:
q Both u and v new vertices q Either u or v is a new vertex q u and v in same cluster q u and v in different clusters
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 30
Graph datasets
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 31
Dataset #edges #nodes #clusters description Synthetic 909K 300K 100 Planted partitions
Amazon 900K 262K 3739 Product co-purchase
Web 1.1M 325K 2793 http://nd.edu links
DBLP 1.1M 329K 4670 Co-authorships
LiveJournal 21.5M 2.7M 15252 Friendships
Dataset median Φ avg. Φ med. size avg. size Amazon 0.1385 0.1486 17 98.5
Web 0.0625 0.0871 31 129.4 DBLP 0.2117 0.2196 27 102.4 LiveJournal 0.5500 0.5319 43 237.3
Pre-computation
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 32
Pre-computation time depends on 1) graph size, 2) #clusters, 3) parallelization
Query Processing: set up n Instead of all clusters, focus on a subset of
relevant clusters (small neighborhoods around query vertex) (1,2-hop away).
n Allow for maximum of B boundary vertices n Sparsify inter-cluster matrix: zero-out
entries close to zero
n 100 randomly chosen query vertices Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 33
Evaluation criterion
n We report accuracy and running time for k nearest neighbor (kNN) queries.
n Accuracy = Relative Average Goodness (RAG) score @k
Note: precision, i.e. “overlap with optimum”, is *not* a good measure (due to ties/near-ties).
total true score of output total true score of “optimum” RAG(@k) =
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 34
Synthetic graphs
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 35
SYNTHETIC 2-HOP 1-HOP
Average RAG (50) score (100 runs)
B = 5K 0.9986 0.9865 B = 1K 0.9892 0.9865 ClusterRank Average Response Time (sec.)
B = 5K 5.12 2.18 B = 1K 2.86 2.12
Brute-Force 5.16 sec.s
Real graphs
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 36
Dynamic updates
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 37
n 500K edge DBLP graph + 1K new edges
Avg: 42.12 seconds
Avg: 2.78 clusters
Note: load/store time of C, E matrices included
Dynamic updates
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 38
DBLP 1959-2001
+1K edges in time
+500K edges in time
Summary n ClusterRank: k Nearest Neighbor queries
based on Personalized Pagerank scores q Works with large and time-evolving graphs q Fast query time: sub-linear computation on
pre-computed meta-info q Efficient dynamic updates by low-rank matrices q Disk-based: query processing and dynamic
updates only on relevant subset of clusters
n Future directions q Cluster tracking and localized re-clustering q Extension to hitting / commute time
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 39
Thank You!
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 40
http://www.cs.stonybrook.edu/~leman
Back-up Slides
Recursive definition for E
T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 − α) : restart probability S : set of nodes in given cluster
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 42
Closed formulations for C and E C1 is an identity matrix of |S|x|S|
Similary,
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 43
What if s (query vertex) ϵ S ?
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 44
At query time, given the query vertex s, those two matrices in which s resides in is updated only.
K is rank 1! Therefore, we will use the Sherman-Morrison Lemma to update C.
Complexity: Multiplication of |S|x1 and 1x|S| vectors Note that we do not need to run SVD as K is rank-1 only!
Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 45