xml stream processing - coverpagesxml.coverpages.org/suciu-xmlstream.pdf · the problem •given:...

25
XML Stream Processing Dan Suciu www.cs.washington.edu/homes/suciu Joint work with faculty, visitors and students at UW

Upload: donga

Post on 04-Jul-2019

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

XML Stream Processing

Dan Suciuwww.cs.washington.edu/homes/suciu

Joint work with faculty, visitors and students at UW

Page 2: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Introduction

• This is a research project at UW• Partially supported by MS• Two parts:

– A free toolkit of command lines: xsort, xagg, ...www.cs.washington.edu/homes/suciu/XMLTK

– Research on XML stream processing – this talk

Page 3: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

The Problem

• Given:– Large number of Xpath expressions– Incoming stream of XML documents

• Decide for each document which expressions it matches

Page 4: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field

/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field

<datasets><dataset>

...</datasets>

<datasets><dataset>

...</datasets>

XPath expressionsXML Data Stream Decisions

Page 5: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

The Application(s)

• Selective Dissemination of Information [Berkeley]• XML content routing [MIT]• SOAP Message routing in Application Servers

• Typical scale:– 10,000 to 1,000,000 Xpath expressions– XML stream: 1KB/s ? 1MB/s ?

Page 6: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

The Approaches

• Basic techniques– NFA plus optimizations: Xfilter/Yfilter, XTrie– DFA: we are doing this here

• Beyond the obvious– SIX– views

Page 7: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Background on NFA and DFA

//a/b/a/a/b

NFA

b

a

b

a

a

*

5

0

1

2

4

3

$X

b

a

b

a

a

0 [other]

$X

01

02

013

014

025

[other]

[other]

b[other]

[other] a

[other]

a

DFA

Page 8: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Background on NFA and DFA

//a/*/*/*/b

a[other]0

01

012 02

0123 023 013 03

01234 0234 0134 034 . . . .

. . . .

a

a

a

a

a

[other]

[other] [other]

[other] [other]

b

02345

b b b

0345 0245 045

. . . .

. . . . . . . . .

$X $X $X $X$X

a

*

*

*

b

*0

5

1

2

4

3

NFA DFA (without back edges)

Page 9: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Background on NFA and DFA

• Issue: need to linearize Xpath expressions

/catalog/product[@category="tools"][sales/@price > 200]/quantity/catalog/product[@category="tools"][sales/@price > 200]/quantity

/catalog/product/$Y$Y/@category ="tools"$Y/sales/@price$Y/quantity

/catalog/product/$Y$Y/@category ="tools"$Y/sales/@price$Y/quantity

Extra processingOK in trivial cases.Complex cases requiremore work (future)

1 Xpath expression with filters

4 linear Xpathexpressions For now: assume

all Xpath expressionsare linear

Page 10: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Basic NFA Evaluation/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field. . .. . .. . ./datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/history/text()

/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field. . .. . .. . ./datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/history/text()

<datasets><dataset>

...</datasets>

NFAs

. . . . . .

XPath

STACK

1,55,99,...

2,3,543,43,254

3,66,102,4534,...

Current state

SAXevents

Page 11: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Basic DFA Evaluation/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field. . .. . .. . ./datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/history/text()

/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field. . .. . .. . ./datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/history/text()

<datasets><dataset>

...</datasets>

XPath

STACK

1

552

399

Current state

SAXevents

DFAs

Page 12: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Comparison: Throughput in MB/sThroughput for 1k, 10k, 100k, 1000k XPEs

[ prob(*)=10%, prob(//)=10% ]

0.0001

0.001

0.01

0.1

1

10

100

5MB 10MB 15MB 20MB 25MB

Total input size

parserlazyDFA(1k)lazyDFA(10k)lazyDFA(100k)lazyDFA(1000T)xfilter(1k)xfilter(10k)xfilter(100k)xfilter(1000T)

Page 13: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Number of States in DFA

Compute the DFA for 1,000,000 Xpathexpressions ???!!?

• 1 linear Xpath small DFA• 1,000,000 linear Xpaths HUGE DFA

Page 14: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Number of States in DFA

//section//footnote//figure//footnote//table//footnote. . . .. . . .//abstract//footnote

//section//footnote//figure//footnote//table//footnote. . . .. . . .//abstract//footnote

n Xpath expressions 2n states

Solution: lazy DFA !

Page 15: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Number of States in the lazy DFA

DFA is HUGETheoremDFA is small

Document-style recursive DTD

TheoremDFA is small

TheoremDFA is small

Non-recursive or data-style recursive DTDs

Synthetic XML dataReal XML data

Page 16: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

1

10

100

1000

10000

100000

simple prov ebBPSS protein nasa treebank

Number of DFA States - SYNTHETIC Data

1k XPEs

10k XPEs

100k XPEs

Page 17: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

1

10

100

1000

10000

100000

protein nasa treebank

Number of DFA States - REAL Data

1k XPEs

10k XPEs

100k XPEs

Page 18: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Beyond the Obvious I:Stream IndeX (SIX)

Main observation:• Parsing is major bottleneck• Skip portions of the XML document

avoid parsing and processing

Page 19: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Stream IndeX (SIX)

<bib><book> <publisher> Addison-Wesley </publisher>

<author> Serge Abiteboul </author><author> <first-name> Rick </first-name>

<last-name> Hull </last-name></author><author> Victor Vianu </author><title> Foundations of Databases </title><year> 1995 </year>

</book><book price=“55”>

<publisher> Freeman </publisher><author> Jeffrey D. Ullman </author><title> Principles of Database and

Knowledge Base Systems </title><year> 1998 </year>

</book></bib>

<bib><book> <publisher> Addison-Wesley </publisher>

<author> Serge Abiteboul </author><author> <first-name> Rick </first-name>

<last-name> Hull </last-name></author><author> Victor Vianu </author><title> Foundations of Databases </title><year> 1995 </year>

</book><book price=“55”>

<publisher> Freeman </publisher><author> Jeffrey D. Ullman </author><title> Principles of Database and

Knowledge Base Systems </title><year> 1998 </year>

</book></bib> . . .

. . .978author

879426author

publisher

book

bib

42312

4090233

14901240

endOffsetbeginOffset

SIXXML

Page 20: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Stream IndeX (SIX)

• API for SIX:– skip(k), where k >= 0– skips to the end of the k’th surrounding element– Uses beginOffset to sync with the XML doc– Uses endOffset to skip

Page 21: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Stream IndeX (SIX)

<datasets><dataset>

...</datasets>

<datasets><dataset>

...</datasets>

<datasets><dataset>

...</datasets>

XML XML XML

SIX

18872

6630

2050

9895

11090

18872

6630

2050

6630

2050

SIX SIX

The SIX stream is about 6% of the data streamAnd can be made MUCH smaller

Page 22: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Throughput improvements from SIX (stable)

0

5

10

15

20

25

30

35

55 60 65 70 75 80 85 90 95 100 105

XML stream (MB)

MB/

s

Theta=3% (SIX)Theta=3%Theta=8% (SIX)Theta=8%Theta=14% (SIX)Theta=14%

Page 23: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Beyond the obvious II:View Selections

• On-going work: View selections header

<datasets><dataset>

...</datasets>

<datasets><dataset>

...</datasets>

<datasets><dataset>

...</datasets>

XML XML XML

header

72

30

0

header header

72

30

0

72

30

0

100x speedupOn a hit

Page 24: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

Conclusions

• Two ideas:– Computing the DFA is possible !– Use extra info to further speedup: SIX, Headers

• Issues:– Extend DFAs to filters: process events– How to represent SIX or Headers in XML

Page 25: XML Stream Processing - CoverPagesxml.coverpages.org/Suciu-xmlstream.pdf · The Problem •Given: – Large number of Xpath expressions – Incoming stream of XML documents • Decide

• Msdn.microsoft.com/webservices• [email protected] contact