1 holistic twig joins: optimal xml pattern matching nicolas bruno, nick koudas, divesh srivastava...

24
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

Upload: archibald-mckinney

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

3 Introduction XML de facto standard of Data Exchange and Retrieval Tree structured model

TRANSCRIPT

Page 1: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

1

Holistic Twig Joins:Optimal XML Pattern Matching

Nicolas Bruno, Nick Koudas, Divesh Srivastava

ACM SIGMOD 2002Presented by Jun-Ki Min

Page 2: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

2

Contents Introduction Background Holistic Path Join Algorithms Twig join Algorithms Experimental Evaluation Conclusion

Page 3: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

3

Introduction XML

de facto standard of Data Exchange and Retrieval

Tree structured model

Page 4: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

4

Introduction XML Query Languages

have specified tree structured relationship

specify patterns of selection predicate

ex)book[title =‘XML’]//author[fn=‘jane’ AND

ln=‘doe’]

Page 5: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

5

Introduction Finding all occurrences of a twig pattern

in a database is core operation Previous work

decompose the twig pattern into a set of binary(p-c and a-d) relationships

matching each of the binary relationships “stitching” together these basic matching

Page 6: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

6

Introduction Contributions

Two families of holistic path join algorithms

Holistic path join approach Holistic twig join approach

Experimental study

Page 7: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

7

Background XML Data Model

a XML database is a forest of rooted, ordered, labeled trees.

Page 8: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

8

Background Indexing XML Documents

Element positions represented as tuples (DocID, Left:Right, Level), sorted by Left

Child and Descendant relationships between elements easily determined.authorbookjane…titleXMLyear

(1,6:20,3) …(1,1:150,1)…(1,8:8,5) … (1,43:43,5)…

(1,2:4,2) (1,65:67,3)…(1,3:3,3) (1,66:66,4)…(1,61:63,2) …

Page 9: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

9

Background Twig Pattern Matching

Given a query twig pattern Q and an XML database D, compute the set of all matching for Q on D.

book[title = ‘XML’ AND year = ‘2000’]

Page 10: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

10

Background Previous attempts

Based on binary joins Decompose query into binary relationships Solve binary joins against XML DB Combine together “basic” matches

Main drawbacks: Optimization is required Intermediate results can be large

book[title = ‘XML’ AND year = ‘2000’] ((book JOIN title)JOIN XML)JOIN (year JOIN 2000) (((book JOIN year)JOIN 2000)JOIN title)JOIN XML)many other possibilities

Page 11: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

11

Holistic Joins Solve the entire twig query in two

phases produce “guaranteed” partial results

using one pass Combine (merge join) partial results

Partial result smaller than final result effective encoding of partial results

Page 12: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

12

Data Structure Each node q in query has associated:

A stream Tq, with the positions of the elements corresponding to node q, in increasing “left” order.

A stack Sq with a compact encoding of partial solutions (stacks are chained).

a node (position, pointer to a node in Sparent(q))

Page 13: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

13

Data Structure: Result representation

Nodes in Stack Sq are lie on a root-to-leaf path

A

C

D

A1

C1

A2

C2

B1

D1

[A1 ,C1 ,D1][A1 ,C2 ,D1][A2 ,C2 ,D1]

D1

SD

C1

SC

C2

A1

SA

A2

XML fragment Query Matches Stacks

//A//C//D

Page 14: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

14

Path Stack: Holistic Path Queries Repeatedly constructs stack encodings of

partial solutions by iterating through the streams Tq.

Stacks encode the set of partial solutions from the current element in Tq to the root of the XML tree.

WHILE (!eof) qN = “getMin(q)” clean stacks push TqN’s first element to SqN with

the pointer to top(Sparent(qN)) IF qN is a leaf node, expand solutions

Page 15: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

15

Path Stack ExampleA1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1 - A2A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1 - A2

C1

A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1 - A2

B1

C1

A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1 - A2

B1

A1,B1,C2A2,B1,C2

C1 - C2

A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

S A1

B2

A1,B1,C2A2,B1,C2

A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

S A1

B2

A1,B1,C2A2,B1,C2A1,B2,C3

C3

A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

S A1

B2

A1,B1,C2A2,B1,C2A1,B2,C3A1,B2,C4

C4

Page 16: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

16

Twig Queries Naïve adaptation of PathStack

solve each root-to-leaf path independantly Merge-Join each intermediate result

Problem: Many intermediate results might not be part of the final answer.

A

B D

C EB

A AA A

BB B D D D D

X

C C C C E E E E

A

B D

C E

A

Page 17: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

17

Twig Stack1) Compute only partial solutions that are

guaranteed to extend to a final solution.

2) Merge partial solutions to obtain all matches.

WHILE (!eof) qN = “getNext(q)” clean stacks IF TqN’s first element is part of a solution, push it IF qN is a leaf node, expand solutions

getNext might advance the streams in

subTree(q) that are guaranteed not to be

part of a solution

Page 18: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

18

Twig Stack Key difference between PathStack

and TwigStack is that a node hq from Tq is pushed on its stack Sq, Twig Stack ensure (1) node hq has a descendant hqi in

each of the stream Tqi, for qi ∈ children(q)

(2) each node hqi, recursively satisfies the first property

Page 19: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

19

Twig Stack Example

before insert author to stackauthor, all child streams(Tfn, Tln)’s current elements are checked.

Partial results are (6,11)(7,8) and (6,11)(9,10), then merge to generate final results.

allauthors

fn lnfn

author

ln

authorauthor

1,16

9,107,8

6,11

3, 4

2,5 12,15

13,14

author

fn ln

authorfnln

(2,5) (6,11) (12,15)(3,4) (7,8)(9,10) (13,14)

Page 20: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

20

Experiment Environments Implemented all algorithms in C++ using

the file system as a simple storage engine. Synthetic database.

Random XML documents. depth, fan-out, number of distinct labels

Techniques compared: Binary Join techniques. PathStack. TwigStack.

Page 21: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

21

PathStack vs. Binary Joins

Sequential Scan: 1.87s Path Stack: 2.53s Binary Joins: 16.1s to 53.07s

0

10

20

30

40

50

60

Exec

utio

n tim

e (s

econ

ds)

Binary Joins PathStack SS

XML database fragment: 1 million nodes.Path Query: A1//A2//A3//A4//A5//A6

Page 22: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

22

PathStack vs. TwigStack Query

Data: a full ternary tree first subtree contains only A1,A2,A3 and A4 second subtree : A1,A5,A6,A7 third subtree contains all possible nodes Vary the size of thir subtree relative to the

size of the first two from 8% to 24%

A1

A3

A5A2

A6

A7A4

Page 23: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

23

PathStack vs. TwigStack

•Partial solutions are discarded at the merge step

Page 24: 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

24

Conclusion Developed holistic path join algorithms Developed TwigStack, which generalizes

PathStack for twig queries. better than binary join approach

Future work Integrate TwigStack with value-based joins

(id-refs, user defined predicates, etc.). Incorporate remaining axes (following, etc.).