xml stream processing
Embed Size (px)
DESCRIPTION
XML Stream Processing. Yanlei Diao University of Massachusetts Amherst. XML Data Streams. XML is the “wire format” for data exchanged online. Purchase orders http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=ubl News feeds http://blogs.law.harvard.edu/tech/rss Stock tickers - PowerPoint PPT PresentationTRANSCRIPT

XML Stream Processing
Yanlei DiaoUniversity of Massachusetts
Amherst

Yanlei Diao, University of Massachusetts Amherst 04/20/23
XML Data Streams
XML is the “wire format” for data exchanged online.• Purchase orders http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=ubl
• News feeds http://blogs.law.harvard.edu/tech/rss
• Stock tickers http://www.tickertech.com/products/xml/
• Transactions of online auctions http://bigblog.com/online_auctions.xml
• Network monitoring data http://ganglia.sourceforge.net/

Yanlei Diao, University of Massachusetts Amherst 04/20/23
XML Stream Processing
XML data continuously arrives from external sources, queries are evaluated every time a new data item is received.
A key feature of stream processing is the ability to process as it arrives.• Natural fit for XML message brokering and web services where
messages need to be filtered and transformed on-the-fly.
XML stream processing allows query execution to start before a message is completely received.• Short delay in producing results. • No need to buffer the entire message for query processing. • Both are crucial when input messages are large, e.g. the equivalent
of a database’s worth of data.

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Event-Based Parsing
XML stream processing is often performed on the granularity of an XML parsing event produced by an event-based API.
<?xml version="1.0" ?> <report> <section id=“intro” difficulty=“easy”> <title>Pub-Sub</title> <section difficulty=“easy”> <figure source=“g1.jpg”> <title>XML Processing</title> </figure> </section> <figure source=“g2.jpg”> <title>Scalability</title> </figure> </section></report>
< Start Document< Start Element: report< Start Element: section< Start Element: title Characters: Pub/Sub> End Element: title< Start Element: section< Start Element: figure< Start Element: title Characters: XML Processing> End Element: title> End Element: figure> End Element: section …> End Element: section> End Element: report> End Document

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Matching a Single Path Expression
XFilter [Altinel&Franklin’00], YFilter [Diao et al.’02], Tukwila [Ives et al.’02]
Simple paths: ( (“/” | “//”) (ElementName | “*”) )+ • A simple path can be transformed to a regular expression• Let = set of element names:
− ‘/’ is translated to the ‘·’ concatenation operator− “//” is translated to *− ‘*’ is translated to − “/a//b” can be translated to “a * b”
A finite automaton (FA) for each path: mapping location steps to machine states.

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Query Compilation
Location
steps Automaton fragments
/a
//a
/*
a
* a
*
Map location steps to automaton fragments
Concatenate automaton fragments for location steps in a query
a
*ba * bQuery “/a//b”
Is this Automaton deterministic or non-deterministic?
For multi-path processing; o.w.
not needed

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Event-Driven Query Execution
Query execution retrieves all matches of a path expression.
Event-driven execution of FA:• Parsing events (esp. start of
elements) drive the FA execution.• Elements that trigger transitions to
the accepting states are returned.
figure section* < Start Document< Start Element: report< Start Element: section< Start Element: title Characters: Pub/Sub> End Element: title< Start Element: section< Start Element: figure< Start Element: title Characters: XML Processing> End Element: title> End Element: figure> End Element: section …> End Element: section> End Element: report> End Document
Run FA over XML with nested structure, retrieve all results• Approach 1: shred XML into a set of linear paths• Approach 2: augment FA execution with backtracking

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Multi-Query Processing
Problem: evaluate a set Q = Q1, …, Qn of path queries against each incoming XML document.
Brute force: iterate the query set, one query at a time Indexing of queries: inverse problem of traditional query
processing• Traditional DB: Data is stored persistently; queries executed against
the data. Indexes of data enable selective searches of the data. • XML stream processing: Queries are persistently stored; documents
or their parsing events drive the matching of queries. Indexes of queries enable selective matching of documents to queries.
Sharing of processing: commonalities exist among queries. Shared processing avoids redundant work.

Yanlei Diao, University of Massachusetts Amherst 04/20/23
YFilter [Diao et al.’03] builds a combined FA for all paths. Complete prefix sharing among paths. Moore machine with output: accepting states → partition of query ids. Nondeterministic Finite Automaton (NFA)-based implementation: a small
machine size, flexible, easy to maintain, etc.
Constructing the Combined FSM
a
{Q1}
bQ1=/a/b
Q2=/a/c
Q3=/a/b/c
Q4=/a//b/c
Q5=/a/*/b
Q6=/a//c
Q7=/a/*/*/c
Q8=/a/b/c
a
{Q2}c
c {Q3}
{Q4}c
b*
*
c {Q5}
c {Q6}
* c{Q7}
{Q3, Q8}

Yanlei Diao, University of Massachusetts Amherst 04/20/23
YFilter uses a stack mechanism to handle XML. Backtracking in the NFA. No repeated work for the same element.
<b> <c> </c>An XML fragment <a> <b> <c> </c> …
Execution Algorithm
read <a>
2
1
read </c>
3 9 7 6
2
1
initial
1
Runtime StackNFA
match Q1
9
read <b>
3
2
1
67
read <c>
5
3 9 7 6
2
1
12
8 6
match Q3 Q8
10
11
Q5 Q6Q4
c
cb
{Q1}
{Q3, Q8}
{Q2} {Q4}
{Q6}
{Q5}{Q7}
a *
c
c
* c
c
*
b
1
4
3 5
8
6
12
10
27
11
13
9

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Implementation Choices
Non-Deterministic Automata (NFA) versus Deterministic Automata (DFA) • NFA: small machine, large numbers of transitions per element• DFA: potentially large machine, one transition per element
Worse-case comparison for regular expressions:Single regular expression of
length nm regular expressions
compiled together
Processingcomplexity
Machinesize
Processingcomplexity
Machinesize
NFA O(n2) O(n) O(n2m) O(nm)
DFA O(1) O(n) O(1) O(nm)
YFilter specific
O(3n) O(3nm)
Possible in practice? Restricted path expressions:

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Eager DFA
Green et al. studied the size of DFA for the restricted set of path expressions [Green et al.’03]
Eager DFA: • Query compile time, translates from NFA to DFA• Creates a state for every possible situation that runtime may require
Single path:• Linear for “//e1/e2/e3/e4/e5”• Exponential for “//e1/*/*/*/e5”
Multiple paths:• Exponential for “//e1//b”, “//e2//b”, …, “//e5//b”

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Start 1 8Acc-eptA
Not A
C D
Not A or D
3 6
5
C
2
4
7
A
C
C
A
A
Not A
Not A
A Not A Not A or C
Not A
Example of an Eager DFA
The DFA size is O(2w+1) where w is the number of ‘*’.
//a/*/*/c/d
“a a b b c d”“a b a b c d”
Need to remember all the A’s in three consecutive characters, as different combinations of As may yield different results.

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Lazy DFA
Lazy DFA is constructed at run time, on demand
• Initially, it has a single state
• Whenever it attempts to make a transition into a missing state, it computes it and updates the transition
Hope: only a small set of the DFA states is needed.
Exploits DTD to derive upper bounds…

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Lazy DFA (contd.)
A DTD graph is simple, if the only loops are self-loops.
Theorem: the size of a lazy DFA is exponential only in the maximal number of simple cycles a path can intersect.
• Not exponential in # of paths!
DTD graph
More complex recursive DTDs: e.g. “table” contains “list” and “list” contains “table”• Even a lazy DFA grows large…

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Predicates in Path Expressions
Predicates can address attributes, text data, or positions of elements. • Value of an attribute in an element, e.g.,
//section[@difficulty = “easy”].• Text data of an element, e.g.,
//section/title[text()=“XPath”].• Position of an element, e.g.,
//section/figure[text()=“XPath”][1].

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Predicate Evaluation
Extend the NFA • including additional states representing successful evaluation of
predicates and transitions to them
Potential problems• A potentially huge increase in the machine size• Destroy sharing of path expressions
Recent work [Gupta & Suciu 2003]: Possible to build an efficient pushdown automaton using lazy construction if• No mixed content, such as “<a> 1 <b> 2 </b> </a>”.• Can afford to periodically rebuild the automaton from scratch. • Can afford to train the automaton in each construction.

Yanlei Diao, University of Massachusetts Amherst 04/20/23
YFilter: Mapping XML to Relational
P1: //section//figure P2: //section/section/figure
4 524 5
2 7
2 5
path-tuple streams
Shared Path Matching
Engine
section
*
{P2}
section
figure
{P1}
figure
*
3title
section4
figure5
title6
2section
7
figure
title8
1
report
XQuery is much more complex than regular expressions. Leverage efficient relational processing XQuery stream processing = relational operations on pathtuple
streams!
parsing events

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Example Query Plan for FWR
Query iF: //section
W1://section/title
W2://section/figure/title
R1://section//section//title
R2://section//figure
Q1:for $s in $doc//section[@difficulty=“easy”]where $s/title = “Pub/Sub” and $s/figure/title = “XML processing”return <section>
{ $s//section//title } { $s//figure }
</section>
ΠDupElim
OuterJoin-Select
Shared Path Matching Engine
F W1 W2 R1 R2
An external (post-processing) plan for each query: Selection: evaluates value-based predicates. Projection: projects onto specific fields and removes duplicates. Semijoin: handles correlations between for and where paths, finds query
matches. Outerjoin-Select: handles correlations between for and return paths,
generates query results.
Push all paths into the path engine.
ΠDupElim ΠDupElim

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Buffering in XML Stream Processing
Buffering in XQuery stream processing• Whether an element belongs to the result set is
uncertain as it depends on predicates that have not been evaluated
Goal: avoid materialization and minimize buffering Issues to address:
• What queries require buffering?• What elements to buffer?• When and how to prune dynamic data structures?
− Buffered elements− State in any dynamic data structures

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Buffering in XML Stream Processing
//section[title=“XML”] Requires buffering?
• Yes
What element to buffer?• section
When to prune?• until the end of section is
encountered, or • when no more title can occur in
this section. (Use DTD!)
“XML processing”
“Pub/Sub”
“Scalability”
@id=“intro”
@difficulty=“easy”
@source=“g1.jpg”
@source=“g2.jpg”3
titlesection
4
figure5
title
6
2section
7
figure
title
8
1
report
@difficulty=“easy”

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Buffering in XML Stream Processing
//section[figure=“XML”]/title Requires buffering?
• Yes What element to buffer?
• title When to prune?
• until a figure matches the predicate, or
• when no more figure can occur in this section (use DTD!), or
• the end of section is encountered.
“XML processing”
“Pub/Sub”
“Scalability”
@id=“intro”
@difficulty=“easy”
@source=“g1.jpg”
@source=“g2.jpg”3
titlesection
4
figure5
title
6
2section
7
figure
title
8
1
report
@difficulty=“easy”

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Buffering in XML Stream Processing
Requires buffering? • Yes
What element to buffer?• figure • figure5?, figure7?
When to prune?• until the predicate is true, or • when no more title can occur in this
section (use DTD!), or• the end of section is encountered.
“XML processing”
“Pub/Sub”
“Scalability”
@id=“intro”
@difficulty=“easy”
@source=“g1.jpg”
@source=“g2.jpg”3
titlesection
4
figure5
title
6
2section
7
figure
title
8
1
report
@difficulty=“easy”
//section[contains(.//title, “XML”)]//figure

Yanlei Diao, University of Massachusetts Amherst 04/20/23

Yanlei Diao, University of Massachusetts Amherst 04/20/23
XQuery Usage Scenarios
XML message brokers• Simple path expressions, for authentication, authorization, routing, etc.
• Single input message, relatively small
• Transient and streaming data (no indexes)
XML transformation language in Web Services• Large and complex queries, mostly for transformation
• Input message + external data sources, small/medium sized data sets (xK -> xM)
• Transient and streaming data (no indexes)
Semantic data verification• Mostly messages
• Potentially complex (but small) queries
• Streaming and multi-query optimization required

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Example 2 of an Eager DFA
Start
1 2 3 S1
A
B C DA Not C , D or E
Not C or E
Not A , B or E
C
4 5 6 S2F G H
E
Not A , G or H
Not G , or A
G
E
Not A , E or FA
EE
7
8
F
A
9
B
10 S1D
CC
11 S2H
Not C , G or HG
Not C , D or G
E
CNot C or F
G
C
G
A
G
Not A , B or G
E
A
Matching AB
Matching EF
Matching AB and EF
Not A or E
//a/b//c/d
//e/f//g/h
If there are l patterns with “//” per pattern, we need O(2^l) states to record the matching of the power set of the prefixes.
The DFA size is O((x+1)l), where x is the number of “//” per pattern, and l is the number of patterns.

Yanlei Diao, University of Massachusetts Amherst 04/20/23
Full XQuery Stream Processor
Stream processing for the entire XQuery language [Daniela et al.’04]
• General algebra for XQuery• Pull-based token-at-a-time execution model• Lazy evaluation (like other functional languages)• Some preliminary work on sharing [Diao et al.’04]
Buffering is a big concern [Barton et al.’03, Peng&Chawathe’03, Koch et al.’04]
Sharing is crucial for performance and scalability for large numbers of queries