efficient processing of ordered xml twig pattern

80
Efficient Processing of Ordered XML Twig Pattern by Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni Presented by: Tian Yu 23, Aug 2005

Upload: salaam

Post on 05-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Efficient Processing of Ordered XML Twig Pattern. by Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni Presented by: Tian Yu 23, Aug 2005. Outline. Introduction and motivation Background XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig

Pattern by Jiaheng Lu, Tok Wang Ling, Tian Yu,

Changqing Li, Wei NiPresented by: Tian Yu

23, Aug 2005

Page 2: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 2

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Page 3: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 3

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Page 4: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 4

Introduction XML data representation rapidly increases

popularity

XML documents modeled as ordered trees.

XML queries specify patterns of selection predicates on multiple elements having some structural relationships (parent-child, ancestor-descendant)

Page 5: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 5

What is a Twig Pattern? A twig pattern is a small tree whose nodes are tags,

attributes or text values and edges are either Parent-Child (P-C) edges or Ancestor-Descendant (A-D) edges.

E.g. Query description: Selects Figure elements which are descendants of Paragraph elements which in turn are children of Section elements having child element Title

Twig pattern :

Section

Title Paragraph

Figure

Page 6: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 6

Motivation XML documents modeled as ordered trees, it’s

natural to have ordered queries. Four ordered axes: following-sibling, preceding-

sibling, following, preceding. Example:

ordered query:

//book/title/following-sibling::chapter

unordered query :

//book/title/chapter

Page 7: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 7

Order axis Four axis: following-sibling, preceding-sibling,

following, and preceding. In the sample document: Set the context node to be f

a

b

e f

h

i

g

c

j

d

Sample XML document

Context node: fFollowing of f: i and jPreceding of f: b, c and eFollowing-sibling of f: iPreceding-sibling of f: e

Following-sibling of f = following of f and share the same parent with fPreceding-sibling of f = preceding of f and share the same parent with f

Page 8: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 8

Ordered Twig Pattern //chapter[title=“related work”]/following::section Intuitive meaning: search for all the sections that appear after

(but are not descendents of) chapter elements with the title “related work” in the XML document.

The query node Book is ordered

Page 9: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 9

Ordered Twig Pattern //chapter[title=“related work”]/following::section

Page 10: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 10

Ordered Twig Pattern //chapter[title=“related work”]/following::section

If the twig pattern is unordered:

section1, section2, and section3 are all matching elements.

Page 11: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 11

Ordered Twig Pattern //chapter[title=“related work”]/following::section

But for ordered query, section1 and section2 are not in the solution. How to know that in our method?

Page 12: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 12

Motivation Naïve Method: Use the existing algorithm to output the intermediate

path solutions for each individual root-leaf query path Merge path solutions so that the final solutions are

guaranteed to satisfy the order predicates of the query.

Disadvantage of the naïve method: Many intermediate results may not contribute to final

answers.

Our Solution: efficient processing of ordered XML twig patterns.

Page 13: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 13

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Page 14: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 14

XML Twig Pattern Matching

An XML document is commonly modeled as a rooted, ordered and tagged tree.

book

preface chapter chapter

section

section

paragraph

section

paragraph

paragraph

………….

title

title

“XML”“Data”

“Intro”

“…” “…”

“…”

Page 15: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 15

Region Coding

Node Label1: (startPos, endPos, LevelNum) E.g.

1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.

“…”

book

preface chapter chapter

section title

“Data”

“Intro”

“…”

(1,21,1)

(2,4,2)

(3,3,3)

(13,20,2)(5,12,2)

(9,11,3)

(6,8,3)

(7,7,4) (10,10,4)

section title

“Data” “…”

(17,19,3)(14,16,3)

(15,15,4) (18,18,4)

Page 16: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 16

Region Coding

Given e1, e2: e1 is ancestor of e2: iff e1.start < e2.start and e1.end > e2.end.

1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.

e1

e2

book

preface chapter chapter

section title

“Data”

“Intro”

“…”

(1,21,1)

(2,4,2)

(3,3,3)

(13,20,2)(5,12,2)

(9,11,3)

(6,8,3)

(7,7,4) (10,10,4)

section title

“Data” “…”

(17,19,3)(14,16,3)

(15,15,4) (18,18,4)

Page 17: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 17

Region Coding

Given e1, e2: e1 is parent of e2: iff e1.start < e2.start and e1.end > e2.end , and e1.level + 1= e2.level

1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.

e1 book

preface chapter chapter

section title

“Data”

“Intro”

“…”

(1,21,1)

(2,4,2)

(3,3,3)

(13,20,2)(5,12,2)

(9,11,3)

(6,8,3)

(7,7,4) (10,10,4)

section title

“Data” “…”

(17,19,3)(14,16,3)

(15,15,4) (18,18,4)

e2

Page 18: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 18

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Page 19: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 19

Previous work: TwigStack

TwigStack2: a holistic approach Two-phase algorithm:

Phase 1 TwigJoin: part of intermediate root-leaf paths are outputted Phase 2 Merge: merge the intermediate paths to get the final results

2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.

Page 20: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 20

Sub-optimality of TwigStack TwigStack: optimal when the query contains only ancester-

descendant relationship If the query contains any parent-child relationship, TwigStack

may output some intermediate path solutions that cannot contribute to final results.

We call that TwigStack is sub-optimal for queries with parent-child relationships.

Page 21: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 21

TwigStackList The main problem of TwigStack is to assume all edges

are ancestor-descendant relationship in the first phase. So it is not efficient for queries with parent-child relationships.

Improved method: TwigStackList3 [CIKM 2004] There is an additional list structure for each query node

to cache elements that likely participate in final solutions.

TwigStackList3 is an improvement algorithm for TwigStack, since it considers parent-child relationships in the first phase.

TwigStackList is optimal when there is no P-C edge for branching nodes (a branch node is a node with more than one descendant or child)

3. J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533- 542, 2004.

Page 22: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 22

TwigStackList v.s. TwigStack

TwigStack output the it output the “uesless” path solution < s1,t1>, since it doesn’t check for parent-child relationsihp. TwigStackList has no uesless output. < s1,t1> is not in the

output.

Twig Pattern

s1

p1

section

titleparagraph

figure

p3

f1

t1

An XML tree

t2

s2

p2t3

f2

Root

s1

t1

No Parent-child relationship for branching node

Page 23: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 23

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Page 24: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 24

Ordered Children Extension (OCE) Definition: An element en (of Type n) has an OCE if: 1) In the query Q, for all A-D children of n (if any), n’,

there is an element en’ (with tag n’) that is a descendant of en , and en’ also has an OCE; and

2) In the query Q, for all P-C children of n (if any), n’, there is an element e’ (with tag n) in the path en to en’ such that e’ is the parent of en’, and en’ also has an OCE; and

3) For each child (or descendant) n’ of n, if there is an node m that is the immediate rightSibling of n, there are elements en’ and em such that en’ is a child (or descendant) of element en, en’.end < em.start, and both en’ and emi have OCE.

The first two conditions are guaranteed in twigStackList Our main focus is in the third condition

Page 25: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 25

Ordered Children Extension (OCE) Definition: Condition 3)

For each child (or descendant) n’ of n, if there is an node m that is the immediate rightSibling of n, there are elements en’ and em such that en’ is a child (or descendant) of element en, en’.end < em.start, and both en’ and emi have OCE.

n

mn’

>en

emEn’

XML documentOrdered XML Query

Page 26: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 26

Ordered Children Extension (OCE)In an Ordered XML query: If node n is ordered node:

In order to find it’s OCE, all the three previous conditions must be checked.

If node n is an unordered node:

In order to find it’s OCE, only the first two conditions need to be checked. The last condition does not apply.

Page 27: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 27

Document:

Query:

a

b dc

>

Ordered Children Extension: Example 1

a1

c1 e2e1

b1 d1

Page 28: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 28

Document:a1

c1 e2e1

Query:

b1 d1

a1 has an OCE

a

b dc

>

Ordered Children Extension: Example 1

Page 29: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 29

Document:

a

b d

Query:

c

>

a1 has an OCE1) a1 has descendants b1 and d1, and child c1 (fulfill condition 1, 2

of OCE definition)2) b1 has a right sibling element c1 , and c1 has a right sibling

element d1 (fulfill condition 3 of OCE definition)

Ordered Children Extension: Example 1

a1

c1 e2e1

b1 d1

Page 30: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 30

Document:

Query:

a

b dc

>

Ordered Children Extension: Example 2

a1

c1e1

b1 d1

Page 31: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 31

Document:

Query:

a1 doesn’t have any OCE

a

b dc

>

Ordered Children Extension: Example 2

a1

c1e1

b1 d1

Page 32: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 32

Document:

Query:

a

b dc

>

Ordered Children Extension: Example 2

a1

c1e1

b1 d1

a1 doesn’t have any OCE1) a1 has descendants b1 and d1, and child c1 (fulfill condition 1, 2

of OCE definition)2) b1 has a right sibling node c1 (fulfill condition 3 of OCE

definition)3) However, c1 only has descendant of d1. There is no element with

the labeld d that is a right sibling of element c1 (doesn’t satisfy condition 3 of OCE definition)

Page 33: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 33

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Page 34: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 34

Data structure Each node n in the twig query has: Stream, List, and Stack Data Stream: Tn

we partition an XML document into streams All elements in a stream are of the same tag and ordered by their start

Position The elements in each stream is read only once from head to tail.

a1, a2, a3

b1 , b2

C1 , C2

d1, d2, d3

a

dc

>

b

Ta

Tb

Tc

Td

Document

2:

3:

a1

a2 a3 b2

d2 b1d3

c2

d1

c1

4:

Level 1:

Page 35: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 35

Data structure Each node n in the twig query has: Stream, List, and Stack List: Ln

The elements in lists help to check for P-C relationship Elements in each list Ln are strictly nested from the first to the

end, i.e. in the XML document, each element is an ancestor or parent of the following element.

a1, a2…

b1 ..

C1

d1 ,d3

a

dc

>

b

La

Lb

Lc

Ld

Page 36: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 36

Data structure Each node n in the twig query has: Stream, List, and Stack Stack: Sn

Stacks is used to store elements that have at least one OCE Elements in the stack are potential solutions of the XML query. When we insert an new element into a stack, the top element of

the stack is popped out if the top of the stack doesn’t have A-D relationship with the new element.

a

dc

>

b

Sa

Sb

Sc

Sd

Page 37: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 37

A holistic matching algorithm: OrderedTJ We propose a general algorithm, OrderedTJ, that computes answers to an ordered query twig.

Our key focus is to check the ordered nodes in the query and find elements which has at least one OCE.

Page 38: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 38

Main function OrderedTJ Main function operates in two phases.

Page 39: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 39

Main function OrderedTJ Main function operates in two phases.

Phase 1

Phase 2

Phase 1: Parts of query root-leaf paths are output. The ordering requirements in the ordered query is checked.

Phase 2: These solutions are merged-joined to compute the answers to the whole query.

Important function

Page 40: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 40

getNext(n) It gets the next stream to be processed and advanced

Check Order

Check P-C

Page 41: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 41

An example of OrderedTJ algorithmDocument:

Query: Book

Chapter Section

“Related work”

b1

c1

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

“Introduction”

c2 c3

“Algorithm”

s3

t1, t2, t3 Title:

>

t2 t3s2s1

“Related work”

t1

Partition an XML document into streams

Next Action:

Page 42: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 42

An example of OrderedTJ algorithmDocument:

Query: Book

Chapter Section

“Related work”

b1

c1

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

“Introduction”

c2 c3

“Algorithm”

s3

t1, t2, t3 Title:

>

t2 t3s2s1

“Related work”

t1

Show lists for nodes with P-C child

Next Action:

Page 43: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 43

An example of OrderedTJ algorithmDocument:

Query: Book

Chapter Section

“Related work”

b1

c1

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

“Introduction”

c2 c3

“Algorithm”

s3

t1, t2, t3 Title:

>

t2 t3s2s1

“Related work”

t1

Show Stacks of every node in the query

Next Action:

Page 44: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 44

An example of OrderedTJ algorithmDocument:

Query: Book

Chapter Section

“Related work”

b1

c1

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

“Introduction”

c2 c3

“Algorithm”

s3

t1, t2, t3 Title:

>

t2 t3s2s1

“Related work”

t1

advance (Title)Next Action:

t1 has no descendant

“related work”

Page 45: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 45

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Insert t2 into the list of Title

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

t2 has descendant

“related work”

Page 46: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 46

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Advance (Chapter)

t2

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

C1 has no descendant title that has child “related

work”

Page 47: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 47

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Insert c2 into the list of chapter

t2

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

C2 has a descendant t2 that has child

“related work”

Page 48: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 48

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Advance(Section)

t2

c2

Document:b1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

s1 is not the following element of c2

c1

Page 49: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 49

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Advance(Section)

t2

c2

Document:b1

c1

“Introduction”

c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

s2 is not the following element of c2

c2

Page 50: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 50

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Push b1 into the stack of Book

t2

c2

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

b1 is has an OCE

Page 51: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 51

c1

“Introduction”

s1t1

c2 is has an OCE

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Push c2 into the stack of Chapter

t2

c2

b1Document:b1

c2 c3

“Algorithm”

s3t2 t3

s2

“Related work”

Next Action:

Page 52: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 52

s1t1

t2 is has an OCE

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Push t2 into the stack of Title

t2

b1

c2

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2

“Related work”

Next Action:

Page 53: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 53

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

Push “r…” to into the stack of “Related work”

b1

c2

t2

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

“rel..” is the leaf node

Page 54: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 54

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

“Introduction” “Algorithm”

t1, t2, t3 Title:

>

b1

c2

t2

Output: b1, c2, t2,“r…”

“r…”

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Next Action:

A path is found

Page 55: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 55

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

b1

c2

t2

Push: s3 into stack

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

“r…”

Next Action:

s3 is a leaf node and follows element c2

Page 56: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 56

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

b1

c2

t2

s3

Output: b1, s3

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

“r…”Next Action:

A path is found

Page 57: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 57

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

b1

c2

t2

s3

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Previous Output:

Output: b1, c2, t2,“r…” Output: b1, s3

“r…”

Page 58: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 58

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

b1

c2

t2

s3

A match is found

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

Join the output paths

“r…”

Next Action:

Page 59: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 59

An example of OrderedTJ algorithm

Query: Book

Chapter Section

“Related work”

b1

s1, s2, s3

“related work”

c1, c2, c3

Book:

Section:

“Related work”

Chapter:

Title

t1, t2, t3 Title:

>

b1

c2

t2

s3

Document:b1

c1

“Introduction”

c2 c3

“Algorithm”

s3t2 t3

s2s1

“Related work”

t1

“r…”

A match is found

Page 60: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 60

Optimality of OrderedTJ TwigStack doesn’t consider P-C relationship, therefore, it

produce more intermediate result than TwigStackList. Therefore, we compare the optimality of our OrderedTJ

with TwigStackList. Example: we match ordered query1 in XML document 1

using the two algorithms: TwigStackList, and OrderedTJ.

Document 1:

a1

c1 a2

b1

>a

b c

Query 1:

Page 61: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 61

Optimality of OrderedTJ TwigStackList can only solve ordered XML query

with naïve method. Therefore, it convert query 1 to query 2, by

removing the ordered sign in the twig pattern.

Document 1:

a1

c1 a2

b1

a

b c

>a

b c

Query 1: Query 2:

Page 62: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 62

Optimality of OrderedTJ Sub-optimality of TwigStackList: When there is a P-C relationship at the branching node, there

could be redundant intermediate output. In this example : In the streams, the elements are read only once from head to tail. Therefore, when the TwigStackList process element a1, c1, and b1.

There is no way to decide if there is an element b2 that is a child of a1

Document:

a1

c1 a2

b1

a

b c

Query 2:

Therefore, the algorithm outputs useless solution <a1,c1>

b2 TwigStackList

Page 63: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 63

Optimality of OrderedTJ Optimality of OrderedTJt: It allows the existence of parent-child relationship in the first branching edge

for the ordered node. In this example : Therefore, when the OrderedTJ process element a1, c1, and b1. Since there

is no element with tag name b before c1. It doesn’t satisfy condition 3 in the definition of OCE. c1 does not contribute to any final answer

Document:

a1

c1 a2

b1

a

b c

Query 1:

Therefore, the algorithm doesn’t outputs useless solution <a1,c1>

>

OrderedTJ

Page 64: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 64

Optimality of OrderedTJ

TwigStack: optimal for A-D only queries.

A

B C

A-D only

TwigStack Optimality

Page 65: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 65

Optimality of OrderedTJ

TwigStackList: optimal for queries that only has A-D edge for branching node.

The other edges in the query can be P-C edge.

TwigStackList Optimality

A

B C

A-D for branching node

A-D only

Page 66: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 66

Optimality of OrderedTJ

OrderedTJ: It allows the existence of parent-child relationship in the first branching edge for the ordered nodes

OrderedTJ Optimality

A

B C

P-C for 1-Branch of ordered node

A-D only

A-D for branching node A

B C

D E

Page 67: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 67

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Page 68: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 68

Experiments Algorithms for comparison:

straightforward -TwigStack (short STW) straightforward-TwigStackList (STWL) Our proposed OrderedTJ

Benchmarks XMark: Synthetic Data

Size: 115 M bytes factor:1.0 Treebank: Real Data from Wall Street Journal

Size: 82M bytes nodes:2.5 million

Page 69: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 69

Experiments Testing Queires

Q1, Q2, Q3 for XMark; Q4,Q5,Q6 for TreeBank)

Evaluation metrics Number of intermediate path solutions Total running time

Page 70: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 70

Experiments: Execution Time

OrderedTJ outputs less intermediate result

Therefore, it has less execution time

Page 71: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 71

Experiments: Intermediate result

OrderedTJ has the smallest intermediate results

QueryDataset

STW STWL OrderedTJUseful

solutions

Q1 XMark 71956 71956 44382 44382

Q2 XMark 65940 65940 10679 10679

Q3 XMark 71522 71522 23959 23959

Q4 TreeBank 2237 1502 381 302

Q5 TreeBank 92705 92705 83635 79941

Q6 TreeBank 10663 11 5 5

Table 1. The number of intermediate path solutions

Page 72: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 72

Experiments: Intermediate result

QueryDataset

STW STWL OrderedTJUseful

solutions

Q1 XMark 71956 71956 44382 44382

Q2 XMark 65940 65940 10679 10679

Q3 XMark 71522 71522 23959 23959

Q4 TreeBank 2237 1502 381 302

Q5 TreeBank 92705 92705 83635 79941

Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions

For all queries, OrderedTJ has the smallest intermediate results.

Page 73: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 73

Experiments: Intermediate resultQuery

DatasetSTW STWL OrderedTJ

Useful solutions

Q1 XMark 71956 71956 44382 44382

Q2 XMark 65940 65940 10679 10679

Q3 XMark 71522 71522 23959 23959

Q4 TreeBank 2237 1502 381 302

Q5 TreeBank 92705 92705 83635 79941

Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions

Only A-D edges, therefore, STW and STWL output same intermediate result.

However, OrderedTJ has less intermediate result since it also

considers the ordering relationship.

>test

bold keyword

Query 1:

Page 74: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 74

Experiments: Intermediate resultQuery

DatasetSTW STWL OrderedTJ

Useful solutions

Q1 XMark 71956 71956 44382 44382

Q2 XMark 65940 65940 10679 10679

Q3 XMark 71522 71522 23959 23959

Q4 TreeBank 2237 1502 381 302

Q5 TreeBank 92705 92705 83635 79941

Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions

>PP

IN NP

Query 4:

VP

S

VBN

It has P-C edges for non-branching nodes. Therefore, STWL output less intermediate result than STW.

OrderedTJ output even less intermediate result since it also consider the ordering relationship.

OrderedTJ still has redundant intermediate result comparing with the final useful result. It is because there is P-C edges on the second branch of ordered node PP

Page 75: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 75

Experiments: Intermediate resultQuery

DatasetSTW STWL OrderedTJ

Useful solutions

Q1 XMark 71956 71956 44382 44382

Q2 XMark 65940 65940 10679 10679

Q3 XMark 71522 71522 23959 23959

Q4 TreeBank 2237 1502 381 302

Q5 TreeBank 92705 92705 83635 79941

Q6 TreeBank 10663 11 5 5Table 1. The number of intermediate path solutions

STWL output less intermediate result than STW, since there is a P-C edge in the query.

OrderedTJ output no redundant intermediate result comparing with the final useful result. It is because it only has a P-C edge on the first branch of ordered node PP

OrderedTJ is optimal in this case

>

DT PRP_DOLLAR_

Query 6:

S

Page 76: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 76

Outline Introduction and motivation Background

XML tree and twig pattern matching Previous two algorithms: TwigStack and TwigStackList

Our Ordered Twig Algorithms Ordered Children Extension (for short OCE) A generalized holistic matching algorithm: OrderedTJ

Experiments Conclusion

Page 77: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 77

Conclusions We developed a new algorithm orderedTJ to solve

the problem of Ordered Twig Pattern matching. Our algorithm orderedTJ can identify a larger

query class to guarantee I/O optimality. Experimental results showed the effectiveness,

scalability, and efficiency of our algorithm. Future work: implement more efficient indexing

method, e.g. B tree or R tree to skip XML elements.

Page 78: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 78

Reference(1) [1] M.P. Consens and T.Milo. Optimizing queries on

files. In In Proceedings of ACM SIGMOD, 1994 Node Label: Regional encoding. [2] N. Bruno, D. Srivastava, and N. Koudas. Holistic

twig joins: optimal XML pattern matching. In SIGMOD Conference, pages 310 - 321, 2002

Propose TwigStack algorithm [3] J. Lu, T. Chen, and T. W. Ling. Efficient processing

of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533-542, 2004.

Propose TwigStackList algorithm

Page 79: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 79

Reference(2) [4] Y. Chen, S. B. Davidson, and Y. Zheng. BLAS: An

efficient XPath processing system. In Proc. of SIGMOD, pages 47-58, 2004.

Propose a new algorithm for XPath query [5] J. Lu, T. W. Ling. C.Y Chan and T. Chen, From

Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching In VLDB 2005

Propose a new twig pattern matching algorithm based on a proposed prefix labeling scheme

Page 80: Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern 80

END

Thank you!

Q & A