1 efficient xml stream processing with automata and query algebra a master thesis presentation...

41
1 Efficient XML Stream Processing with Automata and Query Algebra A Master Thesis Presentation Student: Advisor: Reader: Jinhui Jian Prof. Elke A. Rundensteiner Prof. Kathi Fisler

Post on 20-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

1

Efficient XML Stream Processing with Automata and Query Algebra

A Master Thesis Presentation

Student:

Advisor:

Reader:

Jinhui Jian

Prof. Elke A. Rundensteiner

Prof. Kathi Fisler

2

The Need for XML Stream Processing

XML Relational HTMLnews

Internet Internet

XML data streams

XML Stream ProcessingEngine

New paradigms Distributed data provider Distributed data consumer

New applications Monitoring (e.g., sensor network) Information Filtering (e.g., news, email)

New challenges Arbitrarily nested structure Incomplete knowledge

3

Two Existing Approaches

Automata-based [xfilter01, yfilter02, x-scan01,…]

Algebraic [tukwila01, rainbow02, …]

This thesis intends to integrate the both existing approaches into one system

4

A Running ExampleGive me book titles whose price is grater than 50:

<result> FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN <expensive> $b/title </expensive></result>

<bib> <book year="1994"> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> <price> 65.95</price> </book> <book year="2000"> <title>Data on the Web</title> <author><last>Abiteboul</last><first>Serge</first></author> <author><last>Buneman</last><first>Peter</first></author> <author><last>Suciu</last><first>Dan</first></author> <publisher>Morgan Kaufmann Publishers</publisher> <price>39.95</price> </book>

<book year="1992"> <title>Advanced Programming in the Unix environment</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> <price>65.95</price> </book> </bib>

<result> <expensive> <title>TCP/IP Illustrated</title> </expensive>

<expensive> <title>Advanced Programming in the Unix environment</title> </expensive></result>

5

XML as a Stream of Tokens

timeline

<bib> <book> <title> TCP/IP Illustrated </title> <author> <last> Stevens</last> …</book>…

Input XML stream

bib

book bookbook

title author

last first

publisher price

Text

Text Text

Text Text A token can be:

An open tag A close tag PCDATA

6

Basic State-Transition Model

<bib> <book> <title> TCP/IP Illustrated </title> <price> 65.95 </price> …

1 20bookε

3price

*

input

active states 0 1 1,2 1 1 1,2 1,3 … …

stack [0] [0]

[1]

[0]

[1]

[1,2]

[0]

[1]

[1,2]

[0]

[1]

[0]

[1]

[1,2]

… …

Q := //book/priceFOR $b in doc (bib.xml) //bookWHERE $b/price > 50RETURN $b/title

7

Extended with Data Buffer and Buffer Operations

FOR $b in doc (bib.xml) //bookWHERE $b/price > 50RETURN $b/title

Data-driven Token at a time Fixed order

1. eval pred and set/clear flag2. output if buffer not empty

1 20bookε

3title*

4price

1. write buffer2. output if flag is set

buffer flag

*

*

8

Algebraic Query Plan

FOR $b in doc (bib.xml) //bookWHERE $b/price > 50RETURN $b/title

Set at a time Postponed operation

Extract //book

Navigate //book, price

Select price > 50

Tagger

Navigate //book, title

9

Exploit the Flexibility of Postponed Operations

FOR $b in doc (bib.xml) //bookWHERE $b/price > 50 and $b/author/last = “Stevens”RETURN $b/title

Extract //book

Navigate //book, price

Select price > 50

Tagger

Navigate //book, author/last

Select last = “Stevens”

Navigate //book, title

10

Query Optimization in Algebraic Systems Logical optimization

Selection pushdown Projection pushdown Join order selection

Physical optimization Operator algorithms

Runtime optimization Scheduling Resource allocation

11

Thesis Overview

Motivation The Automata model is good for on-the-fly pattern

matching/retrieval The Algebraic model is good for optimizing complex

queries Major challenges

How to integrate the two models? How to optimize a query within the integrated query

model?

12

The Raindrop Approach

Integration Optimization

13

Path Bindings in XQuery

FOR $b in doc (bib.xml) //bookWHERE $b/price > 50 andRETURN $b/title

FOR $b in doc (bib.xml) //bookLET $p := $b/price, $t := $b/titleWHERE $p > 50 RETURN $t

FLWR expression:

FOR…LET...WHERE…RETURN…

Path bindings Filtering and restructuring

“The purpose of path bindings is to produce a tuple stream in which each tuple consists of one or more bound variables” [W3C]

14

A Two-Tier System Architecture

Automata plan

Master plan

Tuple stream

XML data stream

Query answer

15

Modeling the Master Plan:Algebraic

Navigate //book, price

Select price > 50

Tagger

Navigate //book, author/last

Select last = …

Navigate //book, title

16

Modeling the Automata Plan:Black Box vs. White Box

AutomataPlan

Q1 := //bookQ2 := //book/priceQ3 := //book/title

SJoin//book

Extract//book/price

Extract//book/title

17

How to optimize it?

Automata plan

Master plan

Tuple stream

XML data stream

Query answer

18

Optimization: A Unified Process in the Logical View

0 1

Extract //book

ε*

Navigate //book,

//book/price

2book

Select //book/price >5 0

Navigate //book, //book/title

Extract //book

Navigate //book, price

Select price >5 0

Navigate //book, title

AutomataPlan

MasterPlan

cBa

Cba

$c$b$a

The Algebra CoreOp Symbol Semantic

Selection Filter tuples based on the predicate pred

Projection Filter columns in the input tuples based on the variable list v

Join Join input tuples based on the predicate pred

Aggregate Aggregate over input tuples with the aggregate function f, e.g., sum and average

Tagger Format outputs based on the pattern pt, i.e., reconstruct XML tags

Navigate Take input elements of path p1 and output ancestor elements of path p2

Extract Identify elements of path p from the input stream

Structural Join

Join input tuples on their structural relationship, e.g, the common parent relationship p

2,1 pp

p

pred

v

pred

ptT

f

p

20

The Extract Operator

1 20bookε

*

Extract//book/title

<bib> <book> <title> TCP/IP Illustrated </title> … </book>…

1title

<title> TCP/IP Illustrated </title>

<title> Data on the Web </title>

<title>Advanced Programming in the Unix environment</title>

21

The Structural Join Operator

1 20bookε

3title*

4price

Extract//book/title

Extract//book/price

SJoin//book

FOR $b in doc (bib.xml) //bookLET $p := $b/price, $t := $b/titleWHERE $p > 50 RETURN $t

<title>…</title> <price>…</price>

<title>…</title> <price>…</price>

<title>…</title> <price>…</price>

<bib> <book> <title> TCP/IP Illustrated </title> … </book>… <book>… </book>

22

The Navigate Operator

<book year="1994"> <title>TCP/IP Illustrated</title> <author> <last> Stevens </last> <first> W. </first> </author> <publisher> Addison-Wesley </publisher> <price> 65.95 </price> </book>

<book>… … </book>

<book>… … </book>

<book>… … </book> <title>…</title>

<book>… … </book> <title>…</title>

<book>… … </book> <title>…</title>

Navigate//book, title

A navigate operation can be postponed, independent of the input stream

23

A Special Optimization: In or Out?

Automata plan

Master plan

Tuple stream

XML data stream

Query answer

Two Options: Bottom-up vs. Top-down

<title>…</title> <price>…</price

<title>…</title> <price>…</price>

<title>…</title> <price>…</price>

<price>…</price>

<price>…</price>

<price>…</price>

<title>…</title>

<title>…</title>

<title>…</title>

<book year="1994"> <title>TCP/IP Illustrated</title> <author> <last> Stevens </last> <first> W. </first> </author> <publisher> Addison-Wesley </publisher> <price> 65.95 </price> </book>

<book>… … </book>

<book>… … </book>

<book>… … </book> <title>…</title>

<book>… … </book> <title>…</title>

<book>… … </book> <title>…</title>

<book>… … </book> <title>…</title> <price>…</price>

<book>… … </book> <title>…</title> <price>…</price>

<book>… … </book> <title>…</title> <price>…</price>

25

Exploiting the Options for Optimization

0 1

Extract //book

ε*

Navigate //book, price

2book

Select price >5 0

Navigate //book, title

The pull-out plan

Extract //book/price

0 13

4

title

price

Extract //book/title

ε*

SJoin //book

2book

Select //book/price

>50

The push-in plan

TaggerTagger

26

Query Optimization by Rewriting Rules Navigate pushin:

)()( 2p1p1p2,1 pp

Redundant SJoin:

1p1p1p )(

Redundant Extract:

1p1p1p )( Selection Pushdown:

)()( opop

Etc..

)))(((

)))(((

))))((((

)))(((

)))(((

)),((

///0////

/////0//

//////0//

/////0//

/////0//

/////0//

ababaca

abacaba

aabacaba

abacaba

baacaba

cabaaba

Algebraic transformation:

27

Runtime Optimization: Why?

Optimization relies on cost estimation, which in terms relies on statistics Statistics unknown Statistics change

Extract //book

Navigate //book, price

Select price >5 0

Navigate //book, title

Tagger

28

Runtime Optimization Steps

Stat Collection

Decision Making

Plan Migration

29

Why Need Migration?

When to interrupt the executor Master plan Automata plan

Normal execution

Prepare for migration

Decision making

Plan modification

Legend

executor

Optimizer

Optimization cycle

The migration process

30

Modifying the Automata: A Bad Example

0 1

Extract //book

ε*

Navigate //book,

//book/price

2book

Select //book/price >5 0

Navigate //book, //book/title

Extract //book/price

0 13

4

title

price

Extract //book/title

ε*

SJoin //book

2book

Select //book/price

>50

<bib> <book> <title> TCP/IP Illustrated </title> <price> 36.65 </price> … </book>……<book>

31

Modifying the Automata:A Safe Approach

<bib> <note>…</note> <book> … </book> <book>…</book> <note> …</note> …

Safe point

Unsafe point

0 1ε

*

2book

0 1

3

4

title

priceε

*

2book

FOR $b in doc (bib.xml) //bookLET $p := $b/price, $t := $b/titleWHERE $p > 50 RETURN $t

32

Experimental Study

Is it feasible to integrate automata model and algebraic model?

Is push-in vs. pull-out a feasible optimization? Is runtime optimization worthwhile?

33

Experimental Setup

Java 1.4 Pentium III-750MHz, 384MB Windows XP Professional Three-party components

Xerces SAX parser The Kweelt XQuery parser Rainbow core

34

Exp1: System Throughput

0

0. 2

0. 4

0. 6

0. 8

1

20 40 80 160 320 640 1200 2400 4800 9600 19200

I nput Si ze (KB)

Thou

ghpu

t (M

B/s)

35

Exp2: Push-in vs. Pull-out

0

5

10

15

20

25

30

35

0. 01 0. 03 0. 09 0. 27 0. 91

Data Sel ect i vi t y

Time

(s)

Pushi nPul l out

Exp3: Runtime Optimization

0

5

10

15

20

25

30

35

0. 01 0. 03 0. 09 0. 27 0. 91

Data Sel ect i vi t y

Time

(s)

Pushi nPul l outAdapt i ve

37

Related work

Automata-based XML processing XFilter, YFilter, X-Scan, XTrie, XPush, …

Algebraic XQuery Engine XPeranto, LegoDB, Rainbow, Timber…

Runtime Optimization Tukwila, Telegraph CQ,…

38

Contribution

While many recent XML stream work (e.g., in SIGMOD03) processes XPath query, we are among the first to deal with XQuery

We are the first to consider the flexible automata and query algebra integration problem

Pushin vs. Pullout optimization techniques Prototype system Experimental study

39

Conclusion

Combining automata and query algebra results in a very power query model for XML stream processing

Special optimization techniques (e.g., pushin vs. pullout) can be applied in the integrated system

Data statistics collected at runtime can be exploited via runtime optimization techniques

40

Thanks to:

Prof. Elke A. Rundensteiner Prof. Kathi Fisler The Raindrop/Rainbow team All DSRG members

41

Questions?