efficient processing of ordered xml twig pattern
DESCRIPTION
Efficient Processing of Ordered XML Twig Pattern. by Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni Presented by: Tian Yu 23, Aug 2005. Outline. Introduction and motivation Background XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList - PowerPoint PPT PresentationTRANSCRIPT
Efficient Processing of Ordered XML Twig
Pattern by Jiaheng Lu, Tok Wang Ling, Tian Yu,
Changqing Li, Wei NiPresented by: Tian Yu
23, Aug 2005
Efficient Processing of Ordered XML Twig Pattern 2
Outline Introduction and motivation Background
XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ
Experiments Conclusion
Efficient Processing of Ordered XML Twig Pattern 3
Outline Introduction and motivation Background
XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ
Experiments Conclusion
Efficient Processing of Ordered XML Twig Pattern 4
Introduction XML data representation rapidly increases
popularity
XML documents modeled as ordered trees.
XML queries specify patterns of selection predicates on multiple elements having some structural relationships (parent-child, ancestor-descendant)
Efficient Processing of Ordered XML Twig Pattern 5
What is a Twig Pattern? A twig pattern is a small tree whose nodes are tags,
attributes or text values and edges are either Parent-Child (P-C) edges or Ancestor-Descendant (A-D) edges.
E.g. Query description: Selects Figure elements which are descendants of Paragraph elements which in turn are children of Section elements having child element Title
Twig pattern :
Section
Title Paragraph
Figure
Efficient Processing of Ordered XML Twig Pattern 6
Motivation XML documents modeled as ordered trees, it’s
natural to have ordered queries. Four ordered axes: following-sibling, preceding-
sibling, following, preceding. Example:
ordered query:
//book/title/following-sibling::chapter
unordered query :
//book/title/chapter
Efficient Processing of Ordered XML Twig Pattern 7
Order axis Four axis: following-sibling, preceding-sibling,
following, and preceding. In the sample document: Set the context node to be f
a
b
e f
h
i
g
c
j
d
Sample XML document
Context node: fFollowing of f: i and jPreceding of f: b, c and eFollowing-sibling of f: iPreceding-sibling of f: e
Following-sibling of f = following of f and share the same parent with fPreceding-sibling of f = preceding of f and share the same parent with f
Efficient Processing of Ordered XML Twig Pattern 8
Ordered Twig Pattern //chapter[title=“related work”]/following::section Intuitive meaning: search for all the sections that appear after
(but are not descendents of) chapter elements with the title “related work” in the XML document.
The query node Book is ordered
Efficient Processing of Ordered XML Twig Pattern 9
Ordered Twig Pattern //chapter[title=“related work”]/following::section
Efficient Processing of Ordered XML Twig Pattern 10
Ordered Twig Pattern //chapter[title=“related work”]/following::section
If the twig pattern is unordered:
section1, section2, and section3 are all matching elements.
Efficient Processing of Ordered XML Twig Pattern 11
Ordered Twig Pattern //chapter[title=“related work”]/following::section
But for ordered query, section1 and section2 are not in the solution. How to know that in our method?
Efficient Processing of Ordered XML Twig Pattern 12
Motivation Naïve Method: Use the existing algorithm to output the intermediate
path solutions for each individual root-leaf query path Merge path solutions so that the final solutions are
guaranteed to satisfy the order predicates of the query.
Disadvantage of the naïve method: Many intermediate results may not contribute to final
answers.
Our Solution: efficient processing of ordered XML twig patterns.
Efficient Processing of Ordered XML Twig Pattern 13
Outline Introduction and motivation Background
XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ
Experiments Conclusion
Efficient Processing of Ordered XML Twig Pattern 14
XML Twig Pattern Matching
An XML document is commonly modeled as a rooted, ordered and tagged tree.
book
preface chapter chapter
section
section
paragraph
section
paragraph
paragraph
………….
title
title
“XML”“Data”
“Intro”
“…” “…”
“…”
Efficient Processing of Ordered XML Twig Pattern 15
Region Coding
Node Label1: (startPos, endPos, LevelNum) E.g.
1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.
“…”
book
preface chapter chapter
section title
“Data”
“Intro”
“…”
(1,21,1)
(2,4,2)
(3,3,3)
(13,20,2)(5,12,2)
(9,11,3)
(6,8,3)
(7,7,4) (10,10,4)
section title
“Data” “…”
(17,19,3)(14,16,3)
(15,15,4) (18,18,4)
Efficient Processing of Ordered XML Twig Pattern 16
Region Coding
Given e1, e2: e1 is ancestor of e2: iff e1.start < e2.start and e1.end > e2.end.
1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.
e1
e2
book
preface chapter chapter
section title
“Data”
“Intro”
“…”
(1,21,1)
(2,4,2)
(3,3,3)
(13,20,2)(5,12,2)
(9,11,3)
(6,8,3)
(7,7,4) (10,10,4)
section title
“Data” “…”
(17,19,3)(14,16,3)
(15,15,4) (18,18,4)
Efficient Processing of Ordered XML Twig Pattern 17
Region Coding
Given e1, e2: e1 is parent of e2: iff e1.start < e2.start and e1.end > e2.end , and e1.level + 1= e2.level
1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.
e1 book
preface chapter chapter
section title
“Data”
“Intro”
“…”
(1,21,1)
(2,4,2)
(3,3,3)
(13,20,2)(5,12,2)
(9,11,3)
(6,8,3)
(7,7,4) (10,10,4)
section title
“Data” “…”
(17,19,3)(14,16,3)
(15,15,4) (18,18,4)
e2
Efficient Processing of Ordered XML Twig Pattern 18
Outline Introduction and motivation Background
XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ
Experiments Conclusion
Efficient Processing of Ordered XML Twig Pattern 19
Previous work: TwigStack
TwigStack2: a holistic approach Two-phase algorithm:
Phase 1 TwigJoin: part of intermediate root-leaf paths are outputted Phase 2 Merge: merge the intermediate paths to get the final results
2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.
Efficient Processing of Ordered XML Twig Pattern 20
Sub-optimality of TwigStack TwigStack: optimal when the query contains only ancester-
descendant relationship If the query contains any parent-child relationship, TwigStack
may output some intermediate path solutions that cannot contribute to final results.
We call that TwigStack is sub-optimal for queries with parent-child relationships.
Efficient Processing of Ordered XML Twig Pattern 21
TwigStackList The main problem of TwigStack is to assume all edges
are ancestor-descendant relationship in the first phase. So it is not efficient for queries with parent-child relationships.
Improved method: TwigStackList3 [CIKM 2004] There is an additional list structure for each query node
to cache elements that likely participate in final solutions.
TwigStackList3 is an improvement algorithm for TwigStack, since it considers parent-child relationships in the first phase.
TwigStackList is optimal when there is no P-C edge for branching nodes (a branch node is a node with more than one descendant or child)
3. J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533- 542, 2004.
Efficient Processing of Ordered XML Twig Pattern 22
TwigStackList v.s. TwigStack
TwigStack output the it output the “uesless” path solution < s1,t1>, since it doesn’t check for parent-child relationsihp. TwigStackList has no uesless output. < s1,t1> is not in the
output.
Twig Pattern
s1
p1
section
titleparagraph
figure
p3
f1
t1
An XML tree
t2
s2
p2t3
f2
Root
s1
t1
No Parent-child relationship for branching node
Efficient Processing of Ordered XML Twig Pattern 23
Outline Introduction and motivation Background
XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ
Experiments Conclusion
Efficient Processing of Ordered XML Twig Pattern 24
Ordered Children Extension (OCE) Definition: An element en (of Type n) has an OCE if: 1) In the query Q, for all A-D children of n (if any), n’,
there is an element en’ (with tag n’) that is a descendant of en , and en’ also has an OCE; and
2) In the query Q, for all P-C children of n (if any), n’, there is an element e’ (with tag n) in the path en to en’ such that e’ is the parent of en’, and en’ also has an OCE; and
3) For each child (or descendant) n’ of n, if there is an node m that is the immediate rightSibling of n, there are elements en’ and em such that en’ is a child (or descendant) of element en, en’.end < em.start, and both en’ and emi have OCE.
The first two conditions are guaranteed in twigStackList Our main focus is in the third condition
Efficient Processing of Ordered XML Twig Pattern 25
Ordered Children Extension (OCE) Definition: Condition 3)
For each child (or descendant) n’ of n, if there is an node m that is the immediate rightSibling of n, there are elements en’ and em such that en’ is a child (or descendant) of element en, en’.end < em.start, and both en’ and emi have OCE.
n
mn’
>en
emEn’
XML documentOrdered XML Query
Efficient Processing of Ordered XML Twig Pattern 26
Ordered Children Extension (OCE)In an Ordered XML query: If node n is ordered node:
In order to find it’s OCE, all the three previous conditions must be checked.
If node n is an unordered node:
In order to find it’s OCE, only the first two conditions need to be checked. The last condition does not apply.
Efficient Processing of Ordered XML Twig Pattern 27
Document:
Query:
a
b dc
>
Ordered Children Extension: Example 1
a1
c1 e2e1
b1 d1
Efficient Processing of Ordered XML Twig Pattern 28
Document:a1
c1 e2e1
Query:
b1 d1
a1 has an OCE
a
b dc
>
Ordered Children Extension: Example 1
Efficient Processing of Ordered XML Twig Pattern 29
Document:
a
b d
Query:
c
>
a1 has an OCE1) a1 has descendants b1 and d1, and child c1 (fulfill condition 1, 2
of OCE definition)2) b1 has a right sibling element c1 , and c1 has a right sibling
element d1 (fulfill condition 3 of OCE definition)
Ordered Children Extension: Example 1
a1
c1 e2e1
b1 d1
Efficient Processing of Ordered XML Twig Pattern 30
Document:
Query:
a
b dc
>
Ordered Children Extension: Example 2
a1
c1e1
b1 d1
Efficient Processing of Ordered XML Twig Pattern 31
Document:
Query:
a1 doesn’t have any OCE
a
b dc
>
Ordered Children Extension: Example 2
a1
c1e1
b1 d1
Efficient Processing of Ordered XML Twig Pattern 32
Document:
Query:
a
b dc
>
Ordered Children Extension: Example 2
a1
c1e1
b1 d1
a1 doesn’t have any OCE1) a1 has descendants b1 and d1, and child c1 (fulfill condition 1, 2
of OCE definition)2) b1 has a right sibling node c1 (fulfill condition 3 of OCE
definition)3) However, c1 only has descendant of d1. There is no element with
the labeld d that is a right sibling of element c1 (doesn’t satisfy condition 3 of OCE definition)
Efficient Processing of Ordered XML Twig Pattern 33
Outline Introduction and motivation Background
XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ
Experiments Conclusion
Efficient Processing of Ordered XML Twig Pattern 34
Data structure Each node n in the twig query has: Stream, List, and Stack Data Stream: Tn
we partition an XML document into streams All elements in a stream are of the same tag and ordered by their start
Position The elements in each stream is read only once from head to tail.
a1, a2, a3
b1 , b2
C1 , C2
d1, d2, d3
a
dc
>
b
Ta
Tb
Tc
Td
Document
2:
3:
a1
a2 a3 b2
d2 b1d3
c2
d1
c1
4:
Level 1:
Efficient Processing of Ordered XML Twig Pattern 35
Data structure Each node n in the twig query has: Stream, List, and Stack List: Ln
The elements in lists help to check for P-C relationship Elements in each list Ln are strictly nested from the first to the
end, i.e. in the XML document, each element is an ancestor or parent of the following element.
a1, a2…
b1 ..
C1
d1 ,d3
a
dc
>
b
La
Lb
Lc
Ld
Efficient Processing of Ordered XML Twig Pattern 36
Data structure Each node n in the twig query has: Stream, List, and Stack Stack: Sn
Stacks is used to store elements that have at least one OCE Elements in the stack are potential solutions of the XML query. When we insert an new element into a stack, the top element of
the stack is popped out if the top of the stack doesn’t have A-D relationship with the new element.
a
dc
>
b
Sa
Sb
Sc
Sd
Efficient Processing of Ordered XML Twig Pattern 37
A holistic matching algorithm: OrderedTJ We propose a general algorithm, OrderedTJ, that computes answers to an ordered query twig.
Our key focus is to check the ordered nodes in the query and find elements which has at least one OCE.
Efficient Processing of Ordered XML Twig Pattern 38
Main function OrderedTJ Main function operates in two phases.
Efficient Processing of Ordered XML Twig Pattern 39
Main function OrderedTJ Main function operates in two phases.
Phase 1
Phase 2
Phase 1: Parts of query root-leaf paths are output. The ordering requirements in the ordered query is checked.
Phase 2: These solutions are merged-joined to compute the answers to the whole query.
Important function
Efficient Processing of Ordered XML Twig Pattern 40
getNext(n) It gets the next stream to be processed and advanced
Check Order
Check P-C
Efficient Processing of Ordered XML Twig Pattern 41
An example of OrderedTJ algorithmDocument:
Query: Book
Chapter Section
“Related work”
b1
c1
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
“Introduction”
c2 c3
“Algorithm”
s3
t1, t2, t3 Title:
>
t2 t3s2s1
“Related work”
t1
Partition an XML document into streams
Next Action:
Efficient Processing of Ordered XML Twig Pattern 42
An example of OrderedTJ algorithmDocument:
Query: Book
Chapter Section
“Related work”
b1
c1
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
“Introduction”
c2 c3
“Algorithm”
s3
t1, t2, t3 Title:
>
t2 t3s2s1
“Related work”
t1
Show lists for nodes with P-C child
Next Action:
Efficient Processing of Ordered XML Twig Pattern 43
An example of OrderedTJ algorithmDocument:
Query: Book
Chapter Section
“Related work”
b1
c1
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
“Introduction”
c2 c3
“Algorithm”
s3
t1, t2, t3 Title:
>
t2 t3s2s1
“Related work”
t1
Show Stacks of every node in the query
Next Action:
Efficient Processing of Ordered XML Twig Pattern 44
An example of OrderedTJ algorithmDocument:
Query: Book
Chapter Section
“Related work”
b1
c1
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
“Introduction”
c2 c3
“Algorithm”
s3
t1, t2, t3 Title:
>
t2 t3s2s1
“Related work”
t1
advance (Title)Next Action:
t1 has no descendant
“related work”
Efficient Processing of Ordered XML Twig Pattern 45
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
Insert t2 into the list of Title
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
Next Action:
t2 has descendant
“related work”
Efficient Processing of Ordered XML Twig Pattern 46
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
Advance (Chapter)
t2
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
Next Action:
C1 has no descendant title that has child “related
work”
Efficient Processing of Ordered XML Twig Pattern 47
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
Insert c2 into the list of chapter
t2
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
Next Action:
C2 has a descendant t2 that has child
“related work”
Efficient Processing of Ordered XML Twig Pattern 48
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
Advance(Section)
t2
c2
Document:b1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
Next Action:
s1 is not the following element of c2
c1
Efficient Processing of Ordered XML Twig Pattern 49
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
Advance(Section)
t2
c2
Document:b1
c1
“Introduction”
c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
Next Action:
s2 is not the following element of c2
c2
Efficient Processing of Ordered XML Twig Pattern 50
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
Push b1 into the stack of Book
t2
c2
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
Next Action:
b1 is has an OCE
Efficient Processing of Ordered XML Twig Pattern 51
c1
“Introduction”
s1t1
c2 is has an OCE
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
Push c2 into the stack of Chapter
t2
c2
b1Document:b1
c2 c3
“Algorithm”
s3t2 t3
s2
“Related work”
Next Action:
Efficient Processing of Ordered XML Twig Pattern 52
s1t1
t2 is has an OCE
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
Push t2 into the stack of Title
t2
b1
c2
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2
“Related work”
Next Action:
Efficient Processing of Ordered XML Twig Pattern 53
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
Push “r…” to into the stack of “Related work”
b1
c2
t2
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
Next Action:
“rel..” is the leaf node
Efficient Processing of Ordered XML Twig Pattern 54
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
“Introduction” “Algorithm”
t1, t2, t3 Title:
>
b1
c2
t2
Output: b1, c2, t2,“r…”
“r…”
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
Next Action:
A path is found
Efficient Processing of Ordered XML Twig Pattern 55
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
b1
c2
t2
Push: s3 into stack
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
“r…”
Next Action:
s3 is a leaf node and follows element c2
Efficient Processing of Ordered XML Twig Pattern 56
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
b1
c2
t2
s3
Output: b1, s3
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
“r…”Next Action:
A path is found
Efficient Processing of Ordered XML Twig Pattern 57
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
b1
c2
t2
s3
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
Previous Output:
Output: b1, c2, t2,“r…” Output: b1, s3
“r…”
Efficient Processing of Ordered XML Twig Pattern 58
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
b1
c2
t2
s3
A match is found
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
Join the output paths
“r…”
Next Action:
Efficient Processing of Ordered XML Twig Pattern 59
An example of OrderedTJ algorithm
Query: Book
Chapter Section
“Related work”
b1
s1, s2, s3
“related work”
c1, c2, c3
Book:
Section:
“Related work”
Chapter:
Title
t1, t2, t3 Title:
>
b1
c2
t2
s3
Document:b1
c1
“Introduction”
c2 c3
“Algorithm”
s3t2 t3
s2s1
“Related work”
t1
“r…”
A match is found
Efficient Processing of Ordered XML Twig Pattern 60
Optimality of OrderedTJ TwigStack doesn’t consider P-C relationship, therefore, it
produce more intermediate result than TwigStackList. Therefore, we compare the optimality of our OrderedTJ
with TwigStackList. Example: we match ordered query1 in XML document 1
using the two algorithms: TwigStackList, and OrderedTJ.
Document 1:
a1
c1 a2
b1
>a
b c
Query 1:
Efficient Processing of Ordered XML Twig Pattern 61
Optimality of OrderedTJ TwigStackList can only solve ordered XML query
with naïve method. Therefore, it convert query 1 to query 2, by
removing the ordered sign in the twig pattern.
Document 1:
a1
c1 a2
b1
a
b c
>a
b c
Query 1: Query 2:
Efficient Processing of Ordered XML Twig Pattern 62
Optimality of OrderedTJ Sub-optimality of TwigStackList: When there is a P-C relationship at the branching node, there
could be redundant intermediate output. In this example : In the streams, the elements are read only once from head to tail. Therefore, when the TwigStackList process element a1, c1, and b1.
There is no way to decide if there is an element b2 that is a child of a1
Document:
a1
c1 a2
b1
a
b c
Query 2:
Therefore, the algorithm outputs useless solution <a1,c1>
b2 TwigStackList
Efficient Processing of Ordered XML Twig Pattern 63
Optimality of OrderedTJ Optimality of OrderedTJt: It allows the existence of parent-child relationship in the first branching edge
for the ordered node. In this example : Therefore, when the OrderedTJ process element a1, c1, and b1. Since there
is no element with tag name b before c1. It doesn’t satisfy condition 3 in the definition of OCE. c1 does not contribute to any final answer
Document:
a1
c1 a2
b1
a
b c
Query 1:
Therefore, the algorithm doesn’t outputs useless solution <a1,c1>
>
OrderedTJ
Efficient Processing of Ordered XML Twig Pattern 64
Optimality of OrderedTJ
TwigStack: optimal for A-D only queries.
A
B C
A-D only
TwigStack Optimality
Efficient Processing of Ordered XML Twig Pattern 65
Optimality of OrderedTJ
TwigStackList: optimal for queries that only has A-D edge for branching node.
The other edges in the query can be P-C edge.
TwigStackList Optimality
A
B C
A-D for branching node
A-D only
Efficient Processing of Ordered XML Twig Pattern 66
Optimality of OrderedTJ
OrderedTJ: It allows the existence of parent-child relationship in the first branching edge for the ordered nodes
OrderedTJ Optimality
A
B C
P-C for 1-Branch of ordered node
A-D only
A-D for branching node A
B C
D E
Efficient Processing of Ordered XML Twig Pattern 67
Outline Introduction and motivation Background
XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ
Experiments Conclusion
Efficient Processing of Ordered XML Twig Pattern 68
Experiments Algorithms for comparison:
straightforward -TwigStack (short STW) straightforward-TwigStackList (STWL) Our proposed OrderedTJ
Benchmarks XMark: Synthetic Data
Size: 115 M bytes factor:1.0 Treebank: Real Data from Wall Street Journal
Size: 82M bytes nodes:2.5 million
Efficient Processing of Ordered XML Twig Pattern 69
Experiments Testing Queires
Q1, Q2, Q3 for XMark; Q4,Q5,Q6 for TreeBank)
Evaluation metrics Number of intermediate path solutions Total running time
Efficient Processing of Ordered XML Twig Pattern 70
Experiments: Execution Time
OrderedTJ outputs less intermediate result
Therefore, it has less execution time
Efficient Processing of Ordered XML Twig Pattern 71
Experiments: Intermediate result
OrderedTJ has the smallest intermediate results
QueryDataset
STW STWL OrderedTJUseful
solutions
Q1 XMark 71956 71956 44382 44382
Q2 XMark 65940 65940 10679 10679
Q3 XMark 71522 71522 23959 23959
Q4 TreeBank 2237 1502 381 302
Q5 TreeBank 92705 92705 83635 79941
Q6 TreeBank 10663 11 5 5
Table 1. The number of intermediate path solutions
Efficient Processing of Ordered XML Twig Pattern 72
Experiments: Intermediate result
QueryDataset
STW STWL OrderedTJUseful
solutions
Q1 XMark 71956 71956 44382 44382
Q2 XMark 65940 65940 10679 10679
Q3 XMark 71522 71522 23959 23959
Q4 TreeBank 2237 1502 381 302
Q5 TreeBank 92705 92705 83635 79941
Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions
For all queries, OrderedTJ has the smallest intermediate results.
Efficient Processing of Ordered XML Twig Pattern 73
Experiments: Intermediate resultQuery
DatasetSTW STWL OrderedTJ
Useful solutions
Q1 XMark 71956 71956 44382 44382
Q2 XMark 65940 65940 10679 10679
Q3 XMark 71522 71522 23959 23959
Q4 TreeBank 2237 1502 381 302
Q5 TreeBank 92705 92705 83635 79941
Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions
Only A-D edges, therefore, STW and STWL output same intermediate result.
However, OrderedTJ has less intermediate result since it also
considers the ordering relationship.
>test
bold keyword
Query 1:
Efficient Processing of Ordered XML Twig Pattern 74
Experiments: Intermediate resultQuery
DatasetSTW STWL OrderedTJ
Useful solutions
Q1 XMark 71956 71956 44382 44382
Q2 XMark 65940 65940 10679 10679
Q3 XMark 71522 71522 23959 23959
Q4 TreeBank 2237 1502 381 302
Q5 TreeBank 92705 92705 83635 79941
Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions
>PP
IN NP
Query 4:
VP
S
VBN
It has P-C edges for non-branching nodes. Therefore, STWL output less intermediate result than STW.
OrderedTJ output even less intermediate result since it also consider the ordering relationship.
OrderedTJ still has redundant intermediate result comparing with the final useful result. It is because there is P-C edges on the second branch of ordered node PP
Efficient Processing of Ordered XML Twig Pattern 75
Experiments: Intermediate resultQuery
DatasetSTW STWL OrderedTJ
Useful solutions
Q1 XMark 71956 71956 44382 44382
Q2 XMark 65940 65940 10679 10679
Q3 XMark 71522 71522 23959 23959
Q4 TreeBank 2237 1502 381 302
Q5 TreeBank 92705 92705 83635 79941
Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions
STWL output less intermediate result than STW, since there is a P-C edge in the query.
OrderedTJ output no redundant intermediate result comparing with the final useful result. It is because it only has a P-C edge on the first branch of ordered node PP
OrderedTJ is optimal in this case
>
DT PRP_DOLLAR_
Query 6:
S
Efficient Processing of Ordered XML Twig Pattern 76
Outline Introduction and motivation Background
XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ
Experiments Conclusion
Efficient Processing of Ordered XML Twig Pattern 77
Conclusions We developed a new algorithm orderedTJ to solve
the problem of Ordered Twig Pattern matching. Our algorithm orderedTJ can identify a larger
query class to guarantee I/O optimality. Experimental results showed the effectiveness,
scalability, and efficiency of our algorithm. Future work: implement more efficient indexing
method, e.g. B tree or R tree to skip XML elements.
Efficient Processing of Ordered XML Twig Pattern 78
Reference(1) [1] M.P. Consens and T.Milo. Optimizing queries on
files. In In Proceedings of ACM SIGMOD, 1994 Node Label: Regional encoding. [2] N. Bruno, D. Srivastava, and N. Koudas. Holistic
twig joins: optimal XML pattern matching. In SIGMOD Conference, pages 310 - 321, 2002
Propose TwigStack algorithm [3] J. Lu, T. Chen, and T. W. Ling. Efficient processing
of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533-542, 2004.
Propose TwigStackList algorithm
Efficient Processing of Ordered XML Twig Pattern 79
Reference(2) [4] Y. Chen, S. B. Davidson, and Y. Zheng. BLAS: An
efficient XPath processing system. In Proc. of SIGMOD, pages 47-58, 2004.
Propose a new algorithm for XPath query [5] J. Lu, T. W. Ling. C.Y Chan and T. Chen, From
Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching In VLDB 2005
Propose a new twig pattern matching algorithm based on a proposed prefix labeling scheme
Efficient Processing of Ordered XML Twig Pattern 80
END
Thank you!
Q & A