efficient cohesive subgraph detection in parallel yingxia shao lei chen bin cui school of eecs,...
TRANSCRIPT
1
Efficient Cohesive Subgraph Detection
in ParallelYingxia Shao Lei Chen Bin Cui
School of EECS, Peking University Hong Kong University of Science and Technology
2
Outline
• Background• Preliminaries• PETA: parallel and efficient truss detection algorithm
• Basic framework• Triangle complete subgraph• Subgraph-oriented model• Optimization techniques
• Evaluation• Conclusion
3
Cohesive Subgraph
• Cohesive subgraph • identifies cohesive subgroups within
social networks. • helps social network analysts focus
on areas of the network that are likely to be fruitful.
• E.g., clique, -clique, -clan, -club, -plex, -core, etc.
Image source: Large Scale Cohesive Subgraph Discovery for Social Network Visual Analysis, VLDB’13
Background
4
-Truss• -truss is a cohesive subgraph, where the
support of each edge is at least -2.• Support of an edge is the number of triangles
that contain the edge.
• The maximal -truss.
Problem Statement:Given a graph and a threshold , finding the maximal -truss in .
The subgraph with black thick edges is a 4-truss.
a
b d
c e
o
m
l
k
j
ih
f g
P1
P2 P3
Background
5
Fundamental operation• The operation computes the support of an edge.
• Counting triangles around an edge .
• Two solutions• Classic solution [Cohen ’08, VLDB ’13]
1. Sorts the neighbors of each vertex in ascending order;2. Counts triangles in time complexity.
• Index-based solution [Wang ’12]1. When processing an edge , only enumerate neighbors of vertex which has
smaller degree;2. Test the existence of the third edge ( with the help of a HashTable.• Time complexity:
Preliminaries
6
A straightforward detection framework
• The framework is introduced by J. Cohen.
1. Enumerate triangles2. For each edge, record the number of
triangles, containing that edge3. Keep only edges with the support greater
than 4. If step 3 dropped any edges, return to step 15. The remained graph is the maximal -truss
• Mapreduce solution [J. Cohen ’09]
• Two MapReduce jobs• One is for steps 1-2.• One is for step 3.
• Pregel-based solution [L. Quick ’12]
• Three supersteps. • Two are for steps 1-2• One is for step 3.
• Classic solution of the fundamental operation.Inefficiency
high communication cost large number of iterations
Preliminaries
7
Contributions
• We propose a parallel and efficient truss detection algorithm.• We introduce a subgraph-oriented programming model to efficiently
implement the algorithm into popular graph computation systems.• We address the edge-support law in real-world graph.
8
PETA: Parallel and Efficient Truss detection Algorithm• New detection framework behind PETA
1. Each worker constructs a special-designed subgraph;2. Simultaneously detects local -truss among workers;3. Communicates the update when it is unavoidable;4. Goto step 2 until all local -trusses are stable.
PETA Basic framework
a
b d
c e
o
m
nl
k
j
ih
f g
P1
P2 P3
a
b d
2
c e
3
mk
i
f
P2
l
d
e
mk
j
ih
f g
3
P3
ad
o
m
nl
k
i
P1
1
d
mk
i
P1
b d
2
c e2
2
mk
i
f
P2
d
e
mk
j
ih
f g
3
P3
New detection framework behind PETA
(a) original graph (b) special subgraphs (c) local -trusses
①
②
③
④
9
Triangle Complete Subgraph
• Triangle Complete Subgraph (TC-subgraph)• For internal and cross edges,• TC-subgraph maintains all their triangles at
local.
• Property• TC-subgraph preserves sufficient knowledge.
• Theorem 1 and Theorem 2 prove the correctness of new framework in PETA with TC-subgraph.
a
b d
c e
mk
i
f
P2
Internal Edge
External Edge
Cross Edge
l
PETA TC Subgraph
10
• The subgraph-oriented model allows to flexibly process the local subgraph by designing proper APIs.
• In PETA, we can use index-based approach to detect local k-truss.
Subgraph-Oriented Model
Vertex-centric Model Subgraph-oriented Model
Accessible data Vertex and one-hop neighbors Entire local subgraph
Access pattern Sequence Sequence/random
Local updates Require extra supersteps By-pass
User defined function Simple Fruitful
Expressivity Good Better
*Refer to the paper for API design.
PETA Subgraph Model
11
Local subgraph algorithm for PETA
• The algorithm contains two phases.• Initialization phase.
• Constructs TC-subgraph via triangle counting routine• Require two supersteps
• Detection phase.• Apply index-based solution to compute the support of an edge.• First detection iteration, scan over internal and cross edges.• Successive detection iterations, modify local k-truss based on the
removal of external edges.
seamless detection!
PETA Local algorithm
12
Efficiency analysis of PETA
• Computation Complexity• It is the same as the one of best-known serial algorithms, .
• Communication Complexity• Worst case is bounded by 3|Δ|.
• The number of iteration• It is minimal when a graph partition is given.
• Space complexity• Worst case is bounded by .
• Drawback• The worst space cost is achievable in theoretic, thus it may be infeasible for large
scale graphs.• e.g., clique
PETA Efficiency
13
• Edge replicating factor ()• is the average number that an edge is
replicated.• Small leads to low space cost, computation
cost and communication cost.• Small implies few number of iterations.
Optimizations - I
a
b d
c e
mk
i
f
P2
Internal Edge
External Edge
Cross Edge
l
is edge cut ratio and stands for cross edge replication.
The third term is external edge replication, where is the support of an edge.
Optimizations
14
Optimizations - II
• Edge-Support Law in real-world graph• The frequency distribution of the initial support of edges follows Power-Law.
Optimizations
15
Optimizations - III
• Edge-balanced partition strategy• Improve the performance of the algorithm further.• Use METIS to generate a “good” partition.
• Since METIS is unable to balance the core edges directly, we assign each vertex’s degree as its weights, and balance the degree as an indicator for core edge balance.
Graph E[θ(e)] ρest ρrand ρmetis
livejournal 20.00 20.74 8.99 1.77
us-patent 2.36 3.25 3.13 1.19
wikitalk 5.93 7.53 4.52 3.31
dbpedia 7.61 9.11 6.77 2.10
Graphs are partitioned into 32 parts.
Optimizations
16
Evaluation
• All experiments are conducted on a cluster with 23 physical nodes.
• Baselines• Cohen-MR [J. Cohen ’09]• Orig-LQ [L. Quick ’12]
Graph |V| |E|
wikitalk 2.4M 4.7M
us-patent 3.8M 16.5M
livejournal 4.8M 42.9M
dbpedia 17.2M 117.4M
Evaluation
17
Efficiency
• On different datasets, PETA achieves 5x to 6x speedup compared with original pregel-based solution (i.e., orig-LQ and impr-LQ).
• The performance of Cohen-MR is at least 10X slower than the best one. So, it is not visualized for figures’ clarity.
Evaluation
livejournal us-patent
18
Scalability
• The performance of PETA improves gracefully on both random and METIS-based partition schemes.
Evaluation
10-truss in dbpedia 40-truss in dbpedia
19
Conclusions
• We designed an efficient parallel -truss detection algorithm, named PETA and thoroughly prove the advantages of PETA.
• The subgraph-oriented model has potential to improve the performance of other complex graph analysis tasks.
• In future, we will solve other truss-related problem under this framework.
20
Q&A
21
Backup Expr. – the number of iterations
K-truss Orig-LQ Cohen-MR PETA
Random METIS
5-truss 2212(503) 1006(503) 21 9
10-truss 272(68) 136(68) 23 14
40-truss 112(28) 56(28) 14 6
The number of iterations for k-truss detection on dbpedia