efficiently answering reachability queries on large directed graphs ruoming jin kent state...

32
Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU), and Haixun Wang (IBM T.J. Watson)

Post on 21-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Efficiently Answering Reachability Queries on Large Directed Graphs

Ruoming JinKent State University

Joint work with Yang Xiang (KSU), Ning Ruan (KSU), and Haixun Wang (IBM T.J. Watson)

Reachability Query

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

?Query(1,11)

Yes

?Query(3,9)

No

The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ?

Directed Graph DAG (directed acyclic graph) by coalescing the strongly connected components

Applications

• XML

• Biological networks

• Ontology

• Knowledge representation (Lattice operation)

• Object programming (Class relationship)

• Distributed systems (Reachable states)

Graph Databases

Method Query time Construction Index size

DFS/BFS O(n+m) O(n+m) O(n+m)

Transitive Closure O(1) O(nm)/O(n3) O(n2)

Optimal Chain Cover

(Jagadish, TODS’90)O(k) O(nm) O(nk)

Optimal Tree Cover

(Agrawal et al., SIGMOD’89)O(n) O(nm) O(n2)

Dual-Labeling

(Wang et al., ICDE’06)O(1) O(n+m+t3) O(n+t2)

Labeling+SSPI

(Chen et al., VLDB’05)O(m-n) O(n+m) O(n+m)

GRIPP

(Triβl et al., SIGMOD’07)O(m-n) O(n+m) O(n+m)

Prior Work

2-HOP (O(nm1/2), and O(n4)), HOPI, and heuristic algorithms

Limitation of Tree-based approaches

• Finding a good tree cover is expensive

• Tree cover cannot represent some common types of DAGs, like Grid

• Compression limitations– Chain (1-parent, 1-child)– Tree (1-parent, multiple children) – Most existing methods which utilize the tree

cover are greatly affected by how many edges are left uncovered

Overview of Path-Tree

• Chain->Tree->Path-Tree (2 parents / multiple children)

• Path-tree cover is a spanning subgraph of G in a tree shape (T)

• A node in the tree T corresponds to a path in G and an edge in T corresponds to the edges between two paths in G

• 3-tuple labeling exists for any path-tree to answer reachability query in O(1)

Path-Tree in a Nutshell

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4 P1

P2

P3

P4

Path-Graph is not necessarily a planar graphThe reachability between any two nodes can be answered in O(1)

Key Problems

• How to construct a path-tree?– Algorithm

• How can a path-tree help with reachability queries?– Labeling – Transitive Closure Compression

• How does path-tree compare with the existing methods?– Optimality

Constructing Path-Tree

• Step 1: Path-Decomposition of DAG

• Step 2: Minimal Equivalent Edge Set between any two paths

• Step 3: Path-Graph Construction

• Step 4: Path-Tree Cover Extraction

Step 1: Path-Decomposition

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4

(PID,SID)=(2, 5)

For any two nodes (u, v) in the same path, u v if and only if (u.sid v.sid)

Simple linear algorithm based on topological sort can achieve a path-decomposition

Step 2: Minimal equivalent edge set

1 2

3 4

6 7

13 10

11

14

15

P1 P2

P1 P2

The reachability between any two paths can be captured by a unique minimal set of edges

1 2

3 4

6 7

13 10

11

14

15

P1 P2

P1 P2

The edges in the minimal equivalent edge set do not cross (always parallel)!

Step 3: Path-Graph Construction

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4

P1

P2

P3

P4

4

5

12 2

11

2

Weighted Directed Path-Graph

Weight reflects the cost we have to pay for the transitive closure computation if we exclude this path-tree edge

Step 4: Extracting Path-Tree Cover

P1

P2

P3

P4

4

5

12 2

11

2

Weighted Directed Path-Graph

P1

P2

P3

P4

5

2

2

Maximal Directed Spanning Tree

Chu-Liu/Edmonds algorithm, O(m’+ k logk)

Key Problems

• How to construct a path-tree?– Algorithm

• How can path-tree help with reachability queries?– Labeling – Transitive Closure Compression

• How does path-tree compare with the existing methods?– Optimality

3-Tuple Labeling for Reachability

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4

P1

P2

P3

P4

DFS labeling (1-tuple)

Interval labeling (2-tuple)High-level description about pathsPi Pj ?

[1,1]

[2,2]

[1,3]

[1,4]

DFS labeling

1

2

3

4

6

7

85

9

13

10 11

12

14 15P1

P2

P3

P4

1. Starting from the first vertex in the root-path 2. Always try to visit the next vertex in the same path3. Label a node when all its neighbors has been visited L(v)=N-x, x is the # of nodes has been labeled

1514

13

12

11

109

8

7

6

5

4

3

21

3-Tuple Labeling for Reachability

1

2

3

4

6

7

85

9

13

10 11

12

14 15P1

P2

P4

1514

13

12

11

109

8

7

6

5

4

3

21

P1

P2

P3

P4

[1,1]

[2,2]

[1,3]

[1,4]

uv if and only if 1) Interval label I(u) I(v) 2) DFS label L(u) L(v)

?Query(9,15)P4[1,4] P1[1,1] and 5 < 15Yes?Query(9,2)?Query(5,9)

P3

Transitive Closure Compression

An efficient procedure can compute and compress the transitive closure in O(mk), k is number of paths in path-tree

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

Path-tree cover (including labeling)

can be constructed in O(m + n logn)

Key Problems

• How to construct a path-tree?– Algorithm

• How can path-tree help with reachability query?– Labeling – Transitive Closure Compression

• How does path-tree compare with the existing methods?– Optimality

Theoretical Analysis

• Optimal Path-Tree Cover (OPTC) Problem: – Given a path-decomposition, what is the optimal path-

tree cover to maximally compress the transitive closure?

– OptIndex weight assignment based on computing the predecessor set

• Optimal Path-Decomposition (OPD) Problem:– Assuming we only use path-decomposition to

compress the transitive closure, what is the optimal path-decomposition to maximally compress the transitive closure?

– Minimal-cost flow problem– What is the overall optimal path-decomposition?

Superiority of Path-Tree Cover

• The optimal tree cover is a special case of path-tree cover when each vertex corresponds to a single path and the weight is based on OptIndex.

• The path-tree cover approach can compress the transitive closure with size being smaller than or equal to the optimal tree cover approach (and consequently optimal chain cover approach).

Experimental Evaluation

• Implementation in C++

• 12 Real datasets used in Dual-labeling paper and GRIPP paper

• Synthetic datasets – Sparse DAG with edge density = 2

• AMD Opteron 2.0GHz/ 2GB/ Linux

• PTree1 (OptIndex) and PTree2 – Mainly compare with Optimal Tree Cover

Real Datasets

Graph Name #V #E DAG #V DAG #E

AgroCyc 13969 17694 12684 13408

aMaze 11877 28700 3710 3600

Anthra 13736 17307 12499 13104

Ecoo157 13800 17308 12620 13350

HpyCyc 5565 8474 4771 5859

Human 40051 43879 38811 39576

Kegg 14271 35170 3617 3908

Mtbrv 10697 13922 9602 10245

Nasa 5704 7942 5605 7735

Reactome 3678 14447 901 846

Vchocyc 10694 14207 9491 10143

Xmark 6483 7654 6080 7028

Experimental Result (Real Data)

 Transitive Closure Size Construction Time (in ms) Query Time (in ms)

Tree Ptree-1 Ptree-2 Tree Ptree-1 Ptree-2 Tree Ptree-1 Ptree-2

AgroCyc 13550 962 2133 149.8 224.853 142.311 46.629 10 14.393

aMaze 5178 1571 17274 1062.2 834.697 63.748 19.478 21.529 61.925

Anthra 13155 733 2620 141.11 212.258 143.568 44.958 9.317 16.498

Ecoo157 13493 973 3592 151.46 229.29 141.951 46.674 11.224 16.739

HpyCyc 5946 4224 4661 57.378 106.552 71.675 31.539 12.089 15.503

Human 39636 965 2910 446.32 648.005 465.148 70.107 20.008 23.008

Kegg 5121 1703 30344 746.03 1057.11 86.396 17.509 27.282 75.448

Mtbrv 10288 812 3664 111.48 173.382 106.583 40.391 9.81 19.815

Nasa 9162 5063 6670 85.291 111.397 53.139 37.037 16.214 20.771

Reactome 1293 383 1069 17.244 18.189 6.3 17.565 6.467 13.037

Vchocyc 10183 830 2262 109.47 170.714 103.036 40.026 8.999 14.274

Xmark 8237 2356 10614 204.76 247.628 68.358 37.834 17.122 41.549

On average 10 times better than Tree On average 3 times better than Tree

Experimental Result (Synthetic Data)

Transitive Closure Size

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

10 20 30 40 50 60 70 80 90 100

# of Vertices in K (DAG)

# o

f V

erti

ces

(TC

)

Tree

Ptree-1

Ptree-2

Experimental Result (Synthetic Data)

Construction Time

0

200

400

600

800

1000

1200

1400

1600

10 20 30 40 50 60 70 80 90 100

# of Vertices in K (DAG)

Co

nst

ruct

ion

Tim

e in

ms

Tree

Ptree-1

Ptree-2

Experimental Result (Synthetic Data)Query Time

0

10

20

30

40

50

60

70

80

90

10 20 30 40 50 60 70 80 90 100

# of Vertices in K (DAG)

Que

ry T

ime

in m

s

Tree

Ptree-1

Ptree-2

Conclusion

• A novel Path-Tree structure is proposed to assist the compression of transitive closure and answering reachability query

• Path-tree has potential to integrate with other existing methods to further improve the efficiency of reachability query processing

Thanks!!

Step 3: Path-Graph Construction

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4

P1

P2

P3

P4

4

5

12 2

11

2

Weighted Directed Path-Graph

Weight reflects the penalty if we exclude this path-tree edge

Step 2: Constructing Minimal Equivalent Edge Set (PiPj)

1 2

3 4

6 7

13 10

11

14

15

P1 P2

P1 P2

1. Ordering the vertices in Pi and Pj by decreasing order

2. Finding the first vertex v in P_j that P_i can reach3. Finding the last vertex u in P_i that reach v 4. Removing all the edges cross (u,v) and repeat 2-4

3-Tuple Labeling for Reachability

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4

P1

P2

P3

P4

DFS labeling (1-tuple)

Interval labeling (2-tuple)High-level description about pathsPi Pj ?

[1,1]

[2,2]

[1,3]

[1,4]