1 holistic twig joins: optimal xml pattern matching nicolas bruno, nick koudas, divesh srivastava...

Post on 18-Jan-2018

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

3 Introduction XML de facto standard of Data Exchange and Retrieval Tree structured model

TRANSCRIPT

1

Holistic Twig Joins:Optimal XML Pattern Matching

Nicolas Bruno, Nick Koudas, Divesh Srivastava

ACM SIGMOD 2002Presented by Jun-Ki Min

2

Contents Introduction Background Holistic Path Join Algorithms Twig join Algorithms Experimental Evaluation Conclusion

3

Introduction XML

de facto standard of Data Exchange and Retrieval

Tree structured model

4

Introduction XML Query Languages

have specified tree structured relationship

specify patterns of selection predicate

ex)book[title =‘XML’]//author[fn=‘jane’ AND

ln=‘doe’]

5

Introduction Finding all occurrences of a twig pattern

in a database is core operation Previous work

decompose the twig pattern into a set of binary(p-c and a-d) relationships

matching each of the binary relationships “stitching” together these basic matching

6

Introduction Contributions

Two families of holistic path join algorithms

Holistic path join approach Holistic twig join approach

Experimental study

7

Background XML Data Model

a XML database is a forest of rooted, ordered, labeled trees.

8

Background Indexing XML Documents

Element positions represented as tuples (DocID, Left:Right, Level), sorted by Left

Child and Descendant relationships between elements easily determined.authorbookjane…titleXMLyear

(1,6:20,3) …(1,1:150,1)…(1,8:8,5) … (1,43:43,5)…

(1,2:4,2) (1,65:67,3)…(1,3:3,3) (1,66:66,4)…(1,61:63,2) …

9

Background Twig Pattern Matching

Given a query twig pattern Q and an XML database D, compute the set of all matching for Q on D.

book[title = ‘XML’ AND year = ‘2000’]

10

Background Previous attempts

Based on binary joins Decompose query into binary relationships Solve binary joins against XML DB Combine together “basic” matches

Main drawbacks: Optimization is required Intermediate results can be large

book[title = ‘XML’ AND year = ‘2000’] ((book JOIN title)JOIN XML)JOIN (year JOIN 2000) (((book JOIN year)JOIN 2000)JOIN title)JOIN XML)many other possibilities

11

Holistic Joins Solve the entire twig query in two

phases produce “guaranteed” partial results

using one pass Combine (merge join) partial results

Partial result smaller than final result effective encoding of partial results

12

Data Structure Each node q in query has associated:

A stream Tq, with the positions of the elements corresponding to node q, in increasing “left” order.

A stack Sq with a compact encoding of partial solutions (stacks are chained).

a node (position, pointer to a node in Sparent(q))

13

Data Structure: Result representation

Nodes in Stack Sq are lie on a root-to-leaf path

A

C

D

A1

C1

A2

C2

B1

D1

[A1 ,C1 ,D1][A1 ,C2 ,D1][A2 ,C2 ,D1]

D1

SD

C1

SC

C2

A1

SA

A2

XML fragment Query Matches Stacks

//A//C//D

14

Path Stack: Holistic Path Queries Repeatedly constructs stack encodings of

partial solutions by iterating through the streams Tq.

Stacks encode the set of partial solutions from the current element in Tq to the root of the XML tree.

WHILE (!eof) qN = “getMin(q)” clean stacks push TqN’s first element to SqN with

the pointer to top(Sparent(qN)) IF qN is a leaf node, expand solutions

15

Path Stack ExampleA1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1 - A2A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1 - A2

C1

A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1 - A2

B1

C1

A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

SA1 - A2

B1

A1,B1,C2A2,B1,C2

C1 - C2

A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

S A1

B2

A1,B1,C2A2,B1,C2

A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

S A1

B2

A1,B1,C2A2,B1,C2A1,B2,C3

C3

A1

A2

C1

B1

C2

B2

C3 C4

A

B

C

S A1

B2

A1,B1,C2A2,B1,C2A1,B2,C3A1,B2,C4

C4

16

Twig Queries Naïve adaptation of PathStack

solve each root-to-leaf path independantly Merge-Join each intermediate result

Problem: Many intermediate results might not be part of the final answer.

A

B D

C EB

A AA A

BB B D D D D

X

C C C C E E E E

A

B D

C E

A

17

Twig Stack1) Compute only partial solutions that are

guaranteed to extend to a final solution.

2) Merge partial solutions to obtain all matches.

WHILE (!eof) qN = “getNext(q)” clean stacks IF TqN’s first element is part of a solution, push it IF qN is a leaf node, expand solutions

getNext might advance the streams in

subTree(q) that are guaranteed not to be

part of a solution

18

Twig Stack Key difference between PathStack

and TwigStack is that a node hq from Tq is pushed on its stack Sq, Twig Stack ensure (1) node hq has a descendant hqi in

each of the stream Tqi, for qi ∈ children(q)

(2) each node hqi, recursively satisfies the first property

19

Twig Stack Example

before insert author to stackauthor, all child streams(Tfn, Tln)’s current elements are checked.

Partial results are (6,11)(7,8) and (6,11)(9,10), then merge to generate final results.

allauthors

fn lnfn

author

ln

authorauthor

1,16

9,107,8

6,11

3, 4

2,5 12,15

13,14

author

fn ln

authorfnln

(2,5) (6,11) (12,15)(3,4) (7,8)(9,10) (13,14)

20

Experiment Environments Implemented all algorithms in C++ using

the file system as a simple storage engine. Synthetic database.

Random XML documents. depth, fan-out, number of distinct labels

Techniques compared: Binary Join techniques. PathStack. TwigStack.

21

PathStack vs. Binary Joins

Sequential Scan: 1.87s Path Stack: 2.53s Binary Joins: 16.1s to 53.07s

0

10

20

30

40

50

60

Exec

utio

n tim

e (s

econ

ds)

Binary Joins PathStack SS

XML database fragment: 1 million nodes.Path Query: A1//A2//A3//A4//A5//A6

22

PathStack vs. TwigStack Query

Data: a full ternary tree first subtree contains only A1,A2,A3 and A4 second subtree : A1,A5,A6,A7 third subtree contains all possible nodes Vary the size of thir subtree relative to the

size of the first two from 8% to 24%

A1

A3

A5A2

A6

A7A4

23

PathStack vs. TwigStack

•Partial solutions are discarded at the merge step

24

Conclusion Developed holistic path join algorithms Developed TwigStack, which generalizes

PathStack for twig queries. better than binary join approach

Future work Integrate TwigStack with value-based joins

(id-refs, user defined predicates, etc.). Incorporate remaining axes (following, etc.).

top related