pattern tree algebras: sets or sequences? stelios paparizos, h. v. jagadish university of michigan...

44
Pattern tree algebras: sets or sequences? Stelios Paparizos, H. V. Jagadish University of Michigan Ann Arbor, MI USA

Upload: ursula-price

Post on 30-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Pattern tree algebras: sets or sequences?

Stelios Paparizos, H. V. JagadishUniversity of Michigan

Ann Arbor, MI USA

Outline XML and XQuery Order and Duplicates

Document Order OrderBy Clause Binding Order Duplicates and XQuery

Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words

Document Order Usage

Provides capability to re-establish the original document information

<author> Mario </author><author> Stelios </author> <author> Alton </author>

Example: Return authors of book with title = “Grilling…”

FOR $b IN document(t)//bookWHERE $b/title = “Grilling for amateurs” RETURN $b/author

Document Order

Implicit, derived from XML data model The order in which data is represented in a

document is important information Requires original XML order

representation within a single document Requires an order amongst documents

during a single execution of a query Enforced on every XPath expression and

every sequence operation e.g. Union

ORDER BY Clause Order

Explicit specification with ORDER BY clause Results sorted using item’s value

Example: Return all books sorted by year of publication

XQuery: FOR $b IN document(t)//bookORDER BY $b/yearRETURN $b

SQL: SELECT book FROM t ORDER BY year

Binding Order Usage

Provides mechanism to produce results in multiple document orders

Example: Return books and articles with the same author, order the results by document order of

FOR $b IN document(t)//bookFOR $a IN document(t)//articleWHERE $b/author = $a/authorRETURN ($b, $a) book1 – article1

book1 – article2book2 – article1book2 – article2book2 – article3

FOR $a IN document(t)//articleFOR $b IN document(t)//bookWHERE $b/author = $a/authorRETURN ($b, $a)

book, articlearticle, book

Results

book1 – article1book2 – article1book1 – article2book2 – article2book2 – article3

Binding Order

Implicit, derived from the way the query is typed by the user

Results are sorted based on the order variables are bound Uses multiple document orders

XQuery and Duplicates

XQuery operates on duplicate-free sequences LET clause creates binding to sequence of

matching elements FOR clause creates binding to each element of

sequence of matching elements

Hence, XQuery requires all duplicates to be removed at variable binding

Outline

XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words

Dilemma: Use Sequences or Sets (or Bags or …)

Sets lose all ordering information Order can be important in intermediate steps

Sequences are expensive to manipulate Optimization possibilities can be restricted

Both sets and sequences are duplicate-free Duplicate elimination can be costly procedure

that should be avoided when possible

Solution: Use Hybrid Collections

A Hybrid Collection can have duplicate semantics that varies between a bag and a set and order semantics that varies between a set and a sequence Duplicate Specification Ordering Specification

Duplicate Specification (D-Spec)

Given a collection of trees CT, D-Spec describes how duplicates were removed from the collection

Possible Parameter Values: “empty”: Duplicates can be present “tree”: Duplicates were removed using deep-tree

comparison amongst trees in CT List of Nodes u: Duplicates were removed using

a comparison of the nodes referred by “u” in each tree in CT

Duplicate Specification Example

D-Spec(empty)(1)

B1

E2 A1A2 E1 E2 A2

B1 B1 B1

E1 A1 E2 A2

B1

D-Spec(tree)(2)

B1

E2 A1A2 E1 E2 A2

B1 B1 B1

E1 A1

D-Spec({B, E})(3)

B1

E2 A1E1

B1

A2

Ordering Item (O-Item)

Minimum unit used when sorting a collection CT

Parameters: Reference to sort by node Ascending (‘asc’) or descending (‘desc’) Empty greater (‘g’) or empty least (‘l’) for trees

without a matching node

Example: O-Item (B, asc, l)

Ordering Specification (O-Spec)

Given a collection CT, O-Spec describes how the trees are sorted in the collection

It accepts as parameter an ordered list of Ordering-Items Sorting took place in the order O-Items are

specified

Ordering Specification Example

O-Spec{(B, asc, l), (E, asc, l), (A, asc, l)}(1)

B1

E1 A1A2 E2 E2 A2

B1 B1 B1

E1 A1

(2.a)

B1

E2 A1A1 E1 E2 A2

B1 B1 B1

E1 A2

O-Spec{(B, asc, l), (A, asc, l)}

B1

E2 A1A1 E2 E1 A2

B1 B1 B1

E1 A2

O-Spec{}(3)

B1

E2 A1A2 E2 E1 A1

B1 B1 B1

E1 A2

(2.b)O-Spec{(B, asc, l), (A, asc, l)}

E2 A2

B1

E2 A2

B1

E2 A2

B1

E2 A2

B1

“Fully-ordered”

“Partially-ordered”“any order”

Outline

XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experiments Final Words

TLC-C Correct Output Algorithm

TLC-C Basic Principles Duplicate behavior is correct with sets Document order is modeled by our node

identifiers Pattern tree matches return information in document order

ORDER BY clause is mapped to a list of ordering items and a sort operation

Binding order is determined during parsing by tracking how the query was typed A sort operation is used at the end of each single block

FLWOR statement to capture the binding order

Binding Order ExampleFOR $b IN document(“lib.xml”)//bookFOR $a IN $b/authorFOR $e IN $b/editorFOR $h IN $e/hobbyFOR $i IN $a/interestRETURN $b Project: Keep (2)

LC=1doc_root

LC=2book

LC=3 LC=5editorauthor

Select

1interest hobbyLC=4 LC=6

2

Algebraic plan (TLC)

Orderlist: 2, 3, 5, 6, 4

Binding Order ExampleFOR $b IN document(“lib.xml”)//bookFOR $a IN $b/authorFOR $e IN $b/editorFOR $h IN $e/hobbyFOR $i IN $a/interestRETURN $b

Algebraic plan withcorrect output order

(TLC-C)

Orderlist: 2, 3, 5, 6, 4

Project: Keep (2), (3), (4), (5), (6)

LC=1doc_root

LC=2book

LC=3 LC=5editorauthor

Select

1interest hobbyLC=4 LC=6

2

Sort: ID(2), ID(3), ID(5), ID(6), ID(4) 3

Outline XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently

Enhancing an algebra with Hybrid Collections Minimizing Duplicate Elimination procedures Selections and Ordering Nested Queries and Ordering

Experimental Evaluation Final Words

Operators with Ordering (example)

Select S[apt, ord](CT): produces the matches of the annotated pattern tree (apt) on the input collection CT

New parameter ord is used for ordering ‘empty’, unspecified order ‘maintain’, preserve order of input CT

‘list-resort u’, destroy order of CT and resort using input list of node references u

‘list-add u’, preserve order of input CT and sort ties using input list of node references u

Algebraic Identities (example)

Select S and Sort O can be merged O[ol](S[any, any](…)) ↔ S[any, ol](…)

Select S and Sort O can be swaped O[ol](S[any, maintain](…)) ↔ S[any, maintain](O[ol](…))

Minimize Duplicate Eliminations

Step 1: Remove redundant duplicate elimination procedures

Step 2: Explore partial duplicate specifications to further minimize duplicate elimination procedures

Minimize DEs Step 1 ExampleFOR $o IN document(“auction.xml”)//open_auctionWHERE count($o/bidder) > 5RETURN <result> {$o/quantity} {$o/type} </result>

Aggregate (count, (3), newLC=4)

Filter : (4) > 5

Project: Keep (2)

*

*

Select

1LC=3

LC=2

LC=1

bidder

open_auction

doc_root

2

3

4

(2)Select

LC=5quantity 5

LC=7

(6)(5)

<result>Construct

6

type* LC=6

Aggregate (count, (3), newLC=4)

Filter : (4) > 5

Project: Keep (2)

*

*

Select

1LC=3

LC=2

LC=1

bidder

open_auction

doc_root

2

3

4

(2)Select

LC=5quantity 6

LC=7

(6)(5)

<result>Construct

7

type* LC=6

Duplicate Elimination: ID(tree) 5

Duplicate Elimination: ID(tree)

Duplicate Elimination: ID(tree)

Duplicate Elimination: ID(tree)

Duplicate Elimination: ID(tree)

Duplicate Elimination: ID(tree)

Duplicate Elimination: ID(tree)

From 6 DE procedures to 1

Minimize DEs Step 2 ExampleFOR $o IN document(“auction.xml”)//open_auctionWHERE count($o/bidder) > 5RETURN <result> {$o/quantity} {$o/type} </result>

Aggregate (count, (3), newLC=4)

Filter : (4) > 5

Project: Keep (2)

*

*

Select

1LC=3

LC=2

LC=1

bidder

open_auction

doc_root

2

3

4

(2)Select

LC=5quantity 6

LC=7

(6)(5)

<result>Construct

7

type* LC=6

Duplicate Elimination: ID(tree) 5

Aggregate (count, (3), newLC=4)

Filter : (4) > 5

*

*

Select

1LC=3

LC=2

LC=1

bidder

open_auction

doc_root

2

3

(2)Select

LC=5quantity 4

LC=7

(6)(5)

<result>Construct

5

type* LC=6

DE procedure is modified to DE: ID(2).

Then using algebraic rewrites is eliminated completely.

Selections and Ordering

For “selection” type queries, use algebraic rewrites and push the sort down to the select operator.

Selections and Ordering Example

Project: Keep (2), (3), (4), (5), (6)

LC=1doc_root

LC=2book

LC=3 LC=5editorauthor

Select

1interest hobbyLC=4 LC=6

2

(2)Construct 4

Sort: ID(2), ID(3), ID(5), ID(6), ID(4) 3

ord=empty

Project: Keep (2)

LC=1doc_root

LC=2book

LC=3 LC=5editorauthor

Select

1interest hobbyLC=4 LC=6

2

(2)Construct 3

ord=ID(2), ID(3), ID(5), ID(6), ID(4)

FOR $b IN document(“lib.xml”)//bookFOR $a IN $b/authorFOR $e IN $b/editorFOR $h IN $e/hobbyFOR $i IN $a/interestRETURN $b

Push Sort into Select using algebraic identities.

Optimizer can plan Select operator without having the forced blocking sort at the end.

Joins and Ordering ExampleFOR $a IN document(t)//articleFOR $b IN document(t)//bookWHERE $b/author = $a/authorRETURN ($b, $a)

Project: Keep (9), (2), (4)

Duplicate Elimination : ID(2), ID(4)

Select

2

LC=8

LC=4

LC=3

author

article

doc_root

conf = VLDB

LC=1doc_root

LC=2book

LC=5 LC=7

year = 1999author

Select

1

Join(5) = (6)

(2) (4)

join_root LC=9

3

4

5

LC=10

(4)(2)

<result>Construct

7

ord = empty ord = empty

LC=6

ord = empty

Sort : ID(2), ID(4) 6

Algebraic plan withcorrect output order

(TLC-C)

Joins and Ordering Example

Project: Keep (9), (2), (4)

Duplicate Elimination : ID(2), ID(4)

Select

2

LC=8

LC=4

LC=3

author

article

doc_root

conf = VLDB

LC=1doc_root

LC=2book

LC=5 LC=7

year = 1999author

Select

1

Join(5) = (6)

(2) (4)

join_root LC=9

3

4

5

LC=10

(4)(2)

<result>Construct

7

ord = empty ord = empty

LC=6

ord = empty

Sort : ID(2), ID(4) 6

Project: Keep (9), (2), (4)

Duplicate Elimination : ID(2), ID(4)

Select

2

LC=8

LC=4

LC=3

author

article

doc_root

conf = VLDB

LC=1doc_root

LC=2book

LC=5 LC=7

year = 1999author

Select

1

Join(5) = (6)

(2) (4)

join_root LC=9

3

4

5

LC=10

(4)(2)

<result>Construct

7

ord = empty ord = empty

LC=6

ord =ID(2), ID(4)

Push Sort into Join using algebraic identities.

Joins and Ordering Example

Project: Keep (9), (2), (4)

Duplicate Elimination : ID(2), ID(4)

Select

2

LC=8

LC=4

LC=3

author

article

doc_root

conf = VLDB

LC=1doc_root

LC=2book

LC=5 LC=7

year = 1999author

Select

1

Join(5) = (6)

(2) (4)

join_root LC=9

3

4

5

LC=10

(4)(2)

<result>Construct

7

ord = empty ord = empty

LC=6

ord =ID(2), ID(4)

Project: Keep (9), (2), (4)

Duplicate Elimination : ID(2), ID(4)

Select

2

LC=8

LC=4

LC=3

author

article

doc_root

conf = VLDB

LC=1doc_root

LC=2book

LC=5 LC=7

year = 1999author

Select

1

Join(5) = (6)

(2) (4)

join_root LC=9

3

4

5

LC=10

(4)(2)

<result>Construct

7

ord = ID(2) ord = ID(4)

LC=6

ord = maintain

Push Sort further down into Selects using algebraic identities.

Nested Queries and OrderingFOR $b IN document(“lib.xml”)/bookLET $k := FOR $a IN document(“lib.xml”)/article

WHERE $b/author = $a/author AND$a/conf = “VLDB”

RETURN $aWHERE $b/year = 1999RETURN <result> {$b} {$k} </result>

Project: Keep (9), (2), (4)

Duplicate Elimination : ID(2), ID(4)

Select

2

LC=8

LC=4

LC=3

author

article

doc_root

conf = VLDB

LC=1doc_root

LC=2book

LC=5 LC=7

year = 1999author

Select

1

Join(5) = (6)

(2) (4)

join_root LC=9

3

9

10

LC=10

(4)(2)

<result>Construct

12

ord = empty

LC=6

Sort : ID(2) 11

Project: Keep (4), (6)

DE : ID(4), ID(6)

4

5

(6)

(4)Construct

7

Sort : ID(4) 6

*

ord = maintain(left, right)

Algebraic plan withcorrect output order

(TLC-C)

Nested Queries and Reorder

Project: Keep (9), (2), (4)

Duplicate Elimination : ID(2), ID(4)

Select

2

LC=8

LC=4

LC=3

author

article

doc_root

conf = VLDB

LC=1doc_root

LC=2book

LC=5 LC=7

year = 1999author

Select

1

Join(5) = (6)

(2) (4)

join_root LC=9

3

9

10

LC=10

(4)(2)

<result>Construct

12

ord = empty

LC=6

Sort : ID(2) 11

Project: Keep (4), (6)

DE : ID(4), ID(6)

4

5

(6)

(4)Construct

7

Sort : ID(4) 6

*

ord = maintain(left, right)

Project: Keep (9), (2), (4)

Duplicate Elimination : ID(2), ID(4)

Select

2

LC=8

LC=4

LC=3

author

article

doc_root

conf = VLDB

LC=1doc_root

LC=2book

LC=5 LC=7

year = 1999author

Select

1

Join(5) = (6)

(2) (4)

join_root LC=9

3

9

10

LC=10

(4)(2)

<result>Construct

12

ord = empty

LC=6

Sort : ID(2) 11

Project: Keep (4), (6)

DE : ID(4), ID(6)

4

5

(6)

(4)Construct

7

*

Reorder: (9), (4), ID(4) 8

ord = empty

Rewrite Sort and blocking Join to Reorder operation.

Outline

XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words

Experimental Setup Timber System

128MB buffer pool Value index when necessary (not for all queries)

Intel Pentium III-M 866 Mhz Windows 2000 professional IDE Hard Drive 512MB RAM

XMark dataset factor 1 707MB total space (472MB data + 241MB index)

Minimizing Duplicate Eliminations

0

2

4

6

8

10

12

14

16

x17 x19 q2

TLC-C

TLC-D

x17 more selectivex19 less selectiveq2 value join

Selections and Ordering

0

1

2

3

4

5

6

x13 x17 x19

TLC-C

TLC-O x13 simple outputx17 more selectivex19 less selective

Join and Ordering

0

2

4

6

8

10

12

14

16

q1 q2 x3

TLC-C

TLC-O

q1 less selectiveq2 more selectivex3 less selective

Nested Queries and Ordering

0

20

40

60

80

100

120

140

160

180

x8 x9 x11

TLC-C

TLC-O

Ordering and Duplicate Optimizations

0%

50%

100%

x19 q2 x8

TLC-C

TLC-D

TLC-O

TLC-OD x19 selection q2 value joinX8 nested query

Outline

XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words

Related Work Relational Systems recognize smart sort

placement as a problem D. Simmen, E. Shekita, and T. Malkemus. Fundamental

techniques for order optimization. In Proc.SIGMOD Conf., 1996 XML Navigational-based approach has study of

ordering requirements in: J. Hidders and P. Michiels. Avoiding unnecessary ordering

operations in XPath. In Proc. DBPL Conf.,2003. XML Algebraic-based approaches use sets or

sequences. Aside from the performance limitations, it is unknown whether they fully address the XQuery binding order to produce correct results.

Final Words Ordering in XQuery is a complex procedure with

significant performance ramifications Introduced Hybrid Collections with Ordering

Specification as means to a correct and flexible solution Similar path for Duplicates

Showed algebraic optimizations that take advantage of provided flexibility

Demonstrated experimentally the performance increase