1 holistic twig joins: optimal xml pattern matching nicolas bruno, nick koudas, divesh srivastava...

Holistic Twig Joins:Optimal XML Pattern Matching

Nicolas Bruno, Nick Koudas, Divesh Srivastava

ACM SIGMOD 2002Presented by Jun-Ki Min

Contents Introduction Background Holistic Path Join Algorithms Twig join Algorithms Experimental Evaluation Conclusion

Introduction XML

de facto standard of Data Exchange and Retrieval

Tree structured model

Introduction XML Query Languages

have specified tree structured relationship

specify patterns of selection predicate

ex)book[title =‘XML’]//author[fn=‘jane’ AND

ln=‘doe’]

Introduction Finding all occurrences of a twig pattern

in a database is core operation Previous work

decompose the twig pattern into a set of binary(p-c and a-d) relationships

matching each of the binary relationships “stitching” together these basic matching

Introduction Contributions

Two families of holistic path join algorithms

Holistic path join approach Holistic twig join approach

Experimental study

Background XML Data Model

a XML database is a forest of rooted, ordered, labeled trees.

Background Indexing XML Documents

Element positions represented as tuples (DocID, Left:Right, Level), sorted by Left

Child and Descendant relationships between elements easily determined.authorbookjane…titleXMLyear

(1,6:20,3) …(1,1:150,1)…(1,8:8,5) … (1,43:43,5)…

(1,2:4,2) (1,65:67,3)…(1,3:3,3) (1,66:66,4)…(1,61:63,2) …

Background Twig Pattern Matching

Given a query twig pattern Q and an XML database D, compute the set of all matching for Q on D.

book[title = ‘XML’ AND year = ‘2000’]

Background Previous attempts

Based on binary joins Decompose query into binary relationships Solve binary joins against XML DB Combine together “basic” matches

Main drawbacks: Optimization is required Intermediate results can be large

book[title = ‘XML’ AND year = ‘2000’] ((book JOIN title)JOIN XML)JOIN (year JOIN 2000) (((book JOIN year)JOIN 2000)JOIN title)JOIN XML)many other possibilities

Holistic Joins Solve the entire twig query in two

phases produce “guaranteed” partial results

using one pass Combine (merge join) partial results

Partial result smaller than final result effective encoding of partial results

Data Structure Each node q in query has associated:

A stream Tq, with the positions of the elements corresponding to node q, in increasing “left” order.

A stack Sq with a compact encoding of partial solutions (stacks are chained).

a node (position, pointer to a node in Sparent(q))

Data Structure: Result representation

Nodes in Stack Sq are lie on a root-to-leaf path

[A1 ,C1 ,D1][A1 ,C2 ,D1][A2 ,C2 ,D1]

XML fragment Query Matches Stacks

//A//C//D

Path Stack: Holistic Path Queries Repeatedly constructs stack encodings of

partial solutions by iterating through the streams Tq.

Stacks encode the set of partial solutions from the current element in Tq to the root of the XML tree.

WHILE (!eof) qN = “getMin(q)” clean stacks push TqN’s first element to SqN with

the pointer to top(Sparent(qN)) IF qN is a leaf node, expand solutions

Path Stack ExampleA1

SA1 - A2A1

SA1 - A2

A1,B1,C2A2,B1,C2

C1 - C2

A1,B1,C2A2,B1,C2

A1,B1,C2A2,B1,C2A1,B2,C3

A1,B1,C2A2,B1,C2A1,B2,C3A1,B2,C4

Twig Queries Naïve adaptation of PathStack

solve each root-to-leaf path independantly Merge-Join each intermediate result

Problem: Many intermediate results might not be part of the final answer.

A AA A

BB B D D D D

C C C C E E E E

Twig Stack1) Compute only partial solutions that are

guaranteed to extend to a final solution.

2) Merge partial solutions to obtain all matches.

WHILE (!eof) qN = “getNext(q)” clean stacks IF TqN’s first element is part of a solution, push it IF qN is a leaf node, expand solutions

getNext might advance the streams in

subTree(q) that are guaranteed not to be

part of a solution

Twig Stack Key difference between PathStack

and TwigStack is that a node hq from Tq is pushed on its stack Sq, Twig Stack ensure (1) node hq has a descendant hqi in

each of the stream Tqi, for qi ∈ children(q)

(2) each node hqi, recursively satisfies the first property

Twig Stack Example

before insert author to stackauthor, all child streams(Tfn, Tln)’s current elements are checked.

Partial results are (6,11)(7,8) and (6,11)(9,10), then merge to generate final results.

allauthors

fn lnfn

author

authorauthor

9,107,8

2,5 12,15

author

authorfnln

(2,5) (6,11) (12,15)(3,4) (7,8)(9,10) (13,14)

Experiment Environments Implemented all algorithms in C++ using

the file system as a simple storage engine. Synthetic database.

Random XML documents. depth, fan-out, number of distinct labels

Techniques compared: Binary Join techniques. PathStack. TwigStack.

PathStack vs. Binary Joins

Sequential Scan: 1.87s Path Stack: 2.53s Binary Joins: 16.1s to 53.07s

Binary Joins PathStack SS

XML database fragment: 1 million nodes.Path Query: A1//A2//A3//A4//A5//A6

PathStack vs. TwigStack Query

Data: a full ternary tree first subtree contains only A1,A2,A3 and A4 second subtree : A1,A5,A6,A7 third subtree contains all possible nodes Vary the size of thir subtree relative to the

size of the first two from 8% to 24%

PathStack vs. TwigStack

•Partial solutions are discarded at the merge step

Conclusion Developed holistic path join algorithms Developed TwigStack, which generalizes

PathStack for twig queries. better than binary join approach

Future work Integrate TwigStack with value-based joins

(id-refs, user defined predicates, etc.). Incorporate remaining axes (following, etc.).

1 holistic twig joins: optimal xml pattern matching nicolas bruno, nick koudas, divesh srivastava...

Documents

differential privacy sigmod 2012 tutorial

sigmod’03 evaluating probabilistic queries over imprecise...

how to do research for fun and profit -divesh

developing ios & mac apps with arcgis runtime sdk suganya...

-divesh prakash (iec2009070) limits of electrically small...

structural joins: a primitive for efficient xml query...

bigdansing presentation slides for sigmod 2015

presenter divesh kumar ph.d. scholar department of...

large-scale copy detection xin luna dong divesh srivastava 1

ibm software group ibm toronto lab | acm sigmod dbtest2008

storm@twitter, sigmod 2014 paper

keyword-based search and exploration on databases (sigmod...

anonymized data: generation, models, usage graham cormode...

storm@twitter, sigmod 2014

relaxing join selection & queries nick koudas et al

validating multi-column schemamatchings...

hadoop in sigmod 2011

fast algorithms for hierarchical range histogram...

deirdre d. young, dds divesh byrappagari bds, msd

lineage-driven fault injection, sigmod'15