1 efficient xml stream processing with automata and query algebra a master thesis presentation...
Post on 20-Dec-2015
221 views
TRANSCRIPT
1
Efficient XML Stream Processing with Automata and Query Algebra
A Master Thesis Presentation
Student:
Advisor:
Reader:
Jinhui Jian
Prof. Elke A. Rundensteiner
Prof. Kathi Fisler
2
The Need for XML Stream Processing
XML Relational HTMLnews
Internet Internet
XML data streams
XML Stream ProcessingEngine
New paradigms Distributed data provider Distributed data consumer
New applications Monitoring (e.g., sensor network) Information Filtering (e.g., news, email)
New challenges Arbitrarily nested structure Incomplete knowledge
3
Two Existing Approaches
Automata-based [xfilter01, yfilter02, x-scan01,…]
Algebraic [tukwila01, rainbow02, …]
This thesis intends to integrate the both existing approaches into one system
4
A Running ExampleGive me book titles whose price is grater than 50:
<result> FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN <expensive> $b/title </expensive></result>
<bib> <book year="1994"> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> <price> 65.95</price> </book> <book year="2000"> <title>Data on the Web</title> <author><last>Abiteboul</last><first>Serge</first></author> <author><last>Buneman</last><first>Peter</first></author> <author><last>Suciu</last><first>Dan</first></author> <publisher>Morgan Kaufmann Publishers</publisher> <price>39.95</price> </book>
<book year="1992"> <title>Advanced Programming in the Unix environment</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> <price>65.95</price> </book> </bib>
<result> <expensive> <title>TCP/IP Illustrated</title> </expensive>
<expensive> <title>Advanced Programming in the Unix environment</title> </expensive></result>
5
XML as a Stream of Tokens
timeline
<bib> <book> <title> TCP/IP Illustrated </title> <author> <last> Stevens</last> …</book>…
Input XML stream
bib
book bookbook
title author
last first
publisher price
Text
Text Text
Text Text A token can be:
An open tag A close tag PCDATA
6
Basic State-Transition Model
<bib> <book> <title> TCP/IP Illustrated </title> <price> 65.95 </price> …
1 20bookε
3price
*
input
active states 0 1 1,2 1 1 1,2 1,3 … …
stack [0] [0]
[1]
[0]
[1]
[1,2]
[0]
[1]
[1,2]
[0]
[1]
[0]
[1]
[1,2]
… …
Q := //book/priceFOR $b in doc (bib.xml) //bookWHERE $b/price > 50RETURN $b/title
7
Extended with Data Buffer and Buffer Operations
FOR $b in doc (bib.xml) //bookWHERE $b/price > 50RETURN $b/title
Data-driven Token at a time Fixed order
1. eval pred and set/clear flag2. output if buffer not empty
1 20bookε
3title*
4price
1. write buffer2. output if flag is set
buffer flag
*
*
8
Algebraic Query Plan
FOR $b in doc (bib.xml) //bookWHERE $b/price > 50RETURN $b/title
Set at a time Postponed operation
Extract //book
Navigate //book, price
Select price > 50
Tagger
Navigate //book, title
9
Exploit the Flexibility of Postponed Operations
FOR $b in doc (bib.xml) //bookWHERE $b/price > 50 and $b/author/last = “Stevens”RETURN $b/title
Extract //book
Navigate //book, price
Select price > 50
Tagger
Navigate //book, author/last
Select last = “Stevens”
Navigate //book, title
10
Query Optimization in Algebraic Systems Logical optimization
Selection pushdown Projection pushdown Join order selection
Physical optimization Operator algorithms
Runtime optimization Scheduling Resource allocation
11
Thesis Overview
Motivation The Automata model is good for on-the-fly pattern
matching/retrieval The Algebraic model is good for optimizing complex
queries Major challenges
How to integrate the two models? How to optimize a query within the integrated query
model?
13
Path Bindings in XQuery
FOR $b in doc (bib.xml) //bookWHERE $b/price > 50 andRETURN $b/title
FOR $b in doc (bib.xml) //bookLET $p := $b/price, $t := $b/titleWHERE $p > 50 RETURN $t
FLWR expression:
FOR…LET...WHERE…RETURN…
Path bindings Filtering and restructuring
“The purpose of path bindings is to produce a tuple stream in which each tuple consists of one or more bound variables” [W3C]
14
A Two-Tier System Architecture
Automata plan
Master plan
Tuple stream
XML data stream
Query answer
15
Modeling the Master Plan:Algebraic
Navigate //book, price
Select price > 50
Tagger
Navigate //book, author/last
Select last = …
Navigate //book, title
16
Modeling the Automata Plan:Black Box vs. White Box
AutomataPlan
Q1 := //bookQ2 := //book/priceQ3 := //book/title
SJoin//book
Extract//book/price
Extract//book/title
18
Optimization: A Unified Process in the Logical View
0 1
Extract //book
ε*
Navigate //book,
//book/price
2book
Select //book/price >5 0
Navigate //book, //book/title
Extract //book
Navigate //book, price
Select price >5 0
Navigate //book, title
AutomataPlan
MasterPlan
cBa
Cba
$c$b$a
The Algebra CoreOp Symbol Semantic
Selection Filter tuples based on the predicate pred
Projection Filter columns in the input tuples based on the variable list v
Join Join input tuples based on the predicate pred
Aggregate Aggregate over input tuples with the aggregate function f, e.g., sum and average
Tagger Format outputs based on the pattern pt, i.e., reconstruct XML tags
Navigate Take input elements of path p1 and output ancestor elements of path p2
Extract Identify elements of path p from the input stream
Structural Join
Join input tuples on their structural relationship, e.g, the common parent relationship p
2,1 pp
p
pred
v
pred
ptT
f
p
20
The Extract Operator
1 20bookε
*
Extract//book/title
<bib> <book> <title> TCP/IP Illustrated </title> … </book>…
1title
<title> TCP/IP Illustrated </title>
<title> Data on the Web </title>
<title>Advanced Programming in the Unix environment</title>
21
The Structural Join Operator
1 20bookε
3title*
4price
Extract//book/title
Extract//book/price
SJoin//book
FOR $b in doc (bib.xml) //bookLET $p := $b/price, $t := $b/titleWHERE $p > 50 RETURN $t
<title>…</title> <price>…</price>
<title>…</title> <price>…</price>
<title>…</title> <price>…</price>
<bib> <book> <title> TCP/IP Illustrated </title> … </book>… <book>… </book>
22
The Navigate Operator
<book year="1994"> <title>TCP/IP Illustrated</title> <author> <last> Stevens </last> <first> W. </first> </author> <publisher> Addison-Wesley </publisher> <price> 65.95 </price> </book>
<book>… … </book>
<book>… … </book>
<book>… … </book> <title>…</title>
<book>… … </book> <title>…</title>
<book>… … </book> <title>…</title>
Navigate//book, title
A navigate operation can be postponed, independent of the input stream
23
A Special Optimization: In or Out?
Automata plan
Master plan
Tuple stream
XML data stream
Query answer
Two Options: Bottom-up vs. Top-down
<title>…</title> <price>…</price
<title>…</title> <price>…</price>
<title>…</title> <price>…</price>
<price>…</price>
<price>…</price>
<price>…</price>
<title>…</title>
<title>…</title>
<title>…</title>
<book year="1994"> <title>TCP/IP Illustrated</title> <author> <last> Stevens </last> <first> W. </first> </author> <publisher> Addison-Wesley </publisher> <price> 65.95 </price> </book>
<book>… … </book>
<book>… … </book>
<book>… … </book> <title>…</title>
<book>… … </book> <title>…</title>
<book>… … </book> <title>…</title>
<book>… … </book> <title>…</title> <price>…</price>
<book>… … </book> <title>…</title> <price>…</price>
<book>… … </book> <title>…</title> <price>…</price>
25
Exploiting the Options for Optimization
0 1
Extract //book
ε*
Navigate //book, price
2book
Select price >5 0
Navigate //book, title
The pull-out plan
Extract //book/price
0 13
4
title
price
Extract //book/title
ε*
SJoin //book
2book
Select //book/price
>50
The push-in plan
TaggerTagger
26
Query Optimization by Rewriting Rules Navigate pushin:
)()( 2p1p1p2,1 pp
Redundant SJoin:
1p1p1p )(
Redundant Extract:
1p1p1p )( Selection Pushdown:
)()( opop
Etc..
)))(((
)))(((
))))((((
)))(((
)))(((
)),((
///0////
/////0//
//////0//
/////0//
/////0//
/////0//
ababaca
abacaba
aabacaba
abacaba
baacaba
cabaaba
Algebraic transformation:
27
Runtime Optimization: Why?
Optimization relies on cost estimation, which in terms relies on statistics Statistics unknown Statistics change
Extract //book
Navigate //book, price
Select price >5 0
Navigate //book, title
Tagger
29
Why Need Migration?
When to interrupt the executor Master plan Automata plan
Normal execution
Prepare for migration
Decision making
Plan modification
Legend
executor
Optimizer
Optimization cycle
The migration process
30
Modifying the Automata: A Bad Example
0 1
Extract //book
ε*
Navigate //book,
//book/price
2book
Select //book/price >5 0
Navigate //book, //book/title
Extract //book/price
0 13
4
title
price
Extract //book/title
ε*
SJoin //book
2book
Select //book/price
>50
<bib> <book> <title> TCP/IP Illustrated </title> <price> 36.65 </price> … </book>……<book>
31
Modifying the Automata:A Safe Approach
<bib> <note>…</note> <book> … </book> <book>…</book> <note> …</note> …
Safe point
Unsafe point
0 1ε
*
2book
0 1
3
4
title
priceε
*
2book
FOR $b in doc (bib.xml) //bookLET $p := $b/price, $t := $b/titleWHERE $p > 50 RETURN $t
32
Experimental Study
Is it feasible to integrate automata model and algebraic model?
Is push-in vs. pull-out a feasible optimization? Is runtime optimization worthwhile?
33
Experimental Setup
Java 1.4 Pentium III-750MHz, 384MB Windows XP Professional Three-party components
Xerces SAX parser The Kweelt XQuery parser Rainbow core
34
Exp1: System Throughput
0
0. 2
0. 4
0. 6
0. 8
1
20 40 80 160 320 640 1200 2400 4800 9600 19200
I nput Si ze (KB)
Thou
ghpu
t (M
B/s)
35
Exp2: Push-in vs. Pull-out
0
5
10
15
20
25
30
35
0. 01 0. 03 0. 09 0. 27 0. 91
Data Sel ect i vi t y
Time
(s)
Pushi nPul l out
Exp3: Runtime Optimization
0
5
10
15
20
25
30
35
0. 01 0. 03 0. 09 0. 27 0. 91
Data Sel ect i vi t y
Time
(s)
Pushi nPul l outAdapt i ve
37
Related work
Automata-based XML processing XFilter, YFilter, X-Scan, XTrie, XPush, …
Algebraic XQuery Engine XPeranto, LegoDB, Rainbow, Timber…
Runtime Optimization Tukwila, Telegraph CQ,…
38
Contribution
While many recent XML stream work (e.g., in SIGMOD03) processes XPath query, we are among the first to deal with XQuery
We are the first to consider the flexible automata and query algebra integration problem
Pushin vs. Pullout optimization techniques Prototype system Experimental study
39
Conclusion
Combining automata and query algebra results in a very power query model for XML stream processing
Special optimization techniques (e.g., pushin vs. pullout) can be applied in the integrated system
Data statistics collected at runtime can be exploited via runtime optimization techniques
40
Thanks to:
Prof. Elke A. Rundensteiner Prof. Kathi Fisler The Raindrop/Rainbow team All DSRG members