Buffering in Query Evaluation over XML
Streams
Ziv Bar-YossefTechnion
Marcus FontouraVanja Josifovski
IBM Almaden Research Center
2
XML Document1: <department>2: <name>3: Software Testing4: </name>5: <employee id= 1>6: <name>7: Alice8: </name>9: <position>10: engineer11: </position >12: </employee >13: <employee id = 2>14: <name>15: Bob16: </name>17: <position >
18: engineer19: </position >20: </ employee >21: <employee id = 3>22: <name>23: Carole24: </name>25: <position >26: assistant27: </position >28: </employee >29: <manager id = 4>30: <name>31: John32: </name>33: </manager>34: </department>
3
XML Document Tree
Software Testing
@id position
department
employee
name
root
employee
@idname
Alice
2
name
position
Bob engineer
employee
@id
name
1
assistant3position
Carole
engineer
manager
@id name4 John
4
XPath Queries
[manager/name = “John”] [position = “engineer”]
@id position
department
employee
name
root
employee
@idname
Alice
2
name
position
Bob engineer
employee
@id
name
1
assistant3position
Carole
engineer
manager
@id name4 John
/department /employee /name
5
XPath Queries
/department /name
@id position
department
employee
name
root
employee
@idname
Alice
2
name
position
Bob engineer
employee
@id
name
1
assistant3position
Carole
engineer
manager
@id name4 John
[employee/name = manager/name]
6
XPath
XPath 2.0 Forward axes only Eval(Q,D): nodes in D that match Q
Two modes of XPath evaluation: Full fledged evaluation: given Q,D, output Eval(Q,D) Filtering: given Q,D, determine whether Eval(Q,D) is
nonempty.
7
XML Streams
XML stream: sequence of SAX events startDocument(), endDocument(),
startElement(name), endElement(name), text(str), … Critical resources
Memory Processing time
Why XML streams? For transferring XML between systems For efficient access to large XML documents
8
Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] FluX [Koch el al 04] TurboXPath [Josifovski, Fontoura, and Barta 05] …
All of them use lots of memory on certain queries & documents
All of them use lots of memory on certain queries & documents
9
Memory Bottleneck I: Storage of Large Transition Tables
Framework of most algorithms: Q NFA Simulate NFA by DFA
Caveat: exponential blowup However: exponential blowup is not necessary
[Bar-Yossef, Fontoura, Josifovski 04]
Algorithm for filtering XML streams whose space is linear in the query size
10
Memory Bottleneck II:Buffering of Document Fragments Scenario 1: buffering nodes, which may or may not be part
of the output./department[manager/name = “John”]/employee[position = “engineer”]/name
@id position
department
employee
name
root
employee
@idname
Alice
2
name
position
Bob engineer
employee
@id
name
1
assistant3position
Carole
engineer
manager
@id name
4 John
11
Memory Bottleneck II:Buffering of Document Fragments Scenario 2: buffering nodes needed for evaluating pending
predicates.
@id position
department
employee
name
root
employee
@idname
Alice
2
name
position
Bob engineer
employee
@id
name
1
assistant3position
Carole
engineer
manager
@id name
4 John
/department[employee/name = manager/name ]/name
12
Memory Bottleneck II:Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that
are nested within each other.
Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura,
Josifovski 04]
13
Our Results
Quantitative space lower bounds for: Full-fledged evaluation of queries with predicates
(Scenario 1) Filtering/full-fledged evaluation of queries with
“multi-variate” predicates (Scenario 2) Matching upper bound
Eager evaluation of predicates In all other scenarios: no buffering required
Filtering non-recursive documents using queries with “univariate” predicates is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]
14
Related Work Space complexity of XPath evaluation over non-
streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]
Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]
Space complexity of select-project-join queries over relational data streams [Arasu et al 02]
15
Document Concurrency Q: query D = 1,…,n: document
Each i is an SAX event t = (1,…,t) Definition: x D is alive at step t if x t and
s.t. x Eval(Q, t) x Eval(Q, t)
t-concurrency(D,Q): number of distinct nodes that are alive at step t
concurrency(D,Q): maxt t-concurrency(D,Q)
16
Lower Bound Notions A “normal” lower bound:
For every algorithm A, there exist Q and D s.t. A uses on Q and D (concurrency(D,Q)) bits of space. Q and D may be “pathological” Doesn’t say much about real-world queries/documents
An “ideal” lower bound:For every A, every Q, and every D, A uses on Q and D (concurrency(D,Q)) bits of space. Too good to be true
A can have D and Q “hard-coded”, and then know the result a priori Space of A on D and Q = minimum description length of Q and D
17
Our Lower Bound
Theorem: For every A, every Q, and every D, there exists an almost isomorphic document D’, s.t. A uses on Q and D’, (concurrency(D,Q)) bits of space. D’ is the same as D, except for a few extra empty
nodes with auxiliary names. Theorem holds only if:
Q is “star-free” D is non-recursive
18
Why isn’t this Obvious?
Reason 1: we want the theorem to work for every Q and D, not only ones with high MDL.
Reason 2: Obvious: If x is alive at step t A has to buffer x
Because: A may or may not need to output x Not obvious: If x and y are alive at step t A has
to buffer both If x and y are not “independent”, maybe it’s enough to
buffer just x (or just y)
19
Proof of Lower Bound
C = t-concurrency(D,Q) x1,…,xC = distinct nodes alive at step t
Recall: for every xi there exist i and i s.t. xi Eval(Q, ti)
xi Eval(Q, ti)
Lemma: there exist a single and a single s.t. for all i, xi Eval(Q, t)
xi Eval(Q, t)
20
Proof of Lower Bound (cont.)
For every S { 1,…,C } define document DS:
DS is the same as D, except For every i S, we “mark” xi Marking: an extra empty child with an auxiliary
name
Note: DS is almost-isomorphic to D
tS = first t events in DS
21
Proof of Lower Bound (cont.)
A = any algorithm Consider state of A after processing t
S:
If suffix = , none of the xi’s should be output A could not have output any xi by step t
If suffix = , no information in suffix about S but S can be reconstructed from output state of A at step t must have all information
about S Conclusion: space ≥ (C)
Actual proof: by one-way communication complexity
22
Conclusions
Our contributions: Quantitative space lower bounds
Full-fledged evaluation of queries with predicates Filtering/full-fledged evaluation of queries with “multi-
variate” predicates Matching upper bound
Open problems: Quantitative lower bounds for XQuery evaluation
over streams Address larger fragments of XPath
23
Memory Bottleneck II:Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that
are nested within each other.
a
root
c
a
ba
c
b
//a[b and c]
Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura,
Josifovski 04]
24
Concurrency: Example
1: <department>2: <name>3: Software Testing4: </name>5: <employee id= 1>6: <name>7: Alice8: </name>9: <position>10: engineer11: </position >12: </employee >13: <employee id = 2>14: <name>15: Bob16: </name>17: <position >
18: engineer19: </position >20: </ employee >21: <employee id = 3>22: <name>23: Carole24: </name>25: <position >26: assistant27: </position >28: </employee >29: <manager id = 4>30: <name>31: John32: </name>33: </manager>34: </department>
/department[manager/name = “John”]/employee[position = “engineer”]/name
alive
alive
dead