gpx-matcher - a generic boolean predicate-based xpath expression matcher mohammad sadoghi, ioana...

57
GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans- Arno Jacobsen Middleware Systems Research Group University of Toronto MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG EDBT’2011 An X-ToPSS Project http://msrg.org/tags/x-tops

Upload: donald-sullivan

Post on 21-Jan-2016

239 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

GPX-Matcher - A Generic Boolean Predicate-based

XPath Expression Matcher

Mohammad Sadoghi, Ioana Burcea, and Hans-Arno JacobsenMiddleware Systems Research Group

University of Toronto

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

EDBT’2011

An X-ToPSS Project

http://msrg.org/tags/x-topss

Page 2: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGThe Problem in a Nutshell

XPath Expressions (XPE)(Millions of XPE)

XML Filtering

Matched XPE

XML

Matched Subscriptions

Event/Publication

Subscriptions(Boolean Expressions)

Pub/Sub Engine

Page 3: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGPublish/Subscribe Systems

Broker

Publisher Publisher

Subscriber Subscriber

Subscriptions

Publications

NotificationNotification

IBM=84

MSFT=27 INTC=19 JNJ=58ORCL=12

HON=24

AMGN=58

Stock marketsNYSE

NASDAQTSX

Subscriptions:IBM > 85

ORCL < 10JNJ > 60

3X-ToPSS & GPX-Matcher

Page 4: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGPub/Sub Matching Algorithms• Rete algorithm [Forgy, late 70s]

– A graph-structure to correlate events, process rules (solves a more general problem)

• SIFT [Yan et al. TODS‘94]– Predicate counting et al.

• Gough algorithm [Gough et al. ACSC‘95]– Based on a finite state representation of subscriptions

• Gryphon algorithm [Aguilera, et al. PODC‘99]– Decision tree over predicates

• Clustering algorithm [Fabret et al. SIGMOD‘01]– Clusters subscriptions based on common predicates

• k-Index [Whang et al. VLDB‘09]• Hardware-based matching acceleration [Sadoghi et al. VLDB‘10]• BE-Tree [Sadoghi & Jacobsen, SIGMOD’2011]

4X-ToPSS & GPX-Matcher

Page 5: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGThe Key Question?

Can XML Filtering be benefited from the efficient publish/subscribe matching

algorithms that have been developed for more than three decades?

5X-ToPSS & GPX-Matcher

Page 6: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGXML Filtering Challenges

• Filter XML according to XPEs

• Efficiently, at Internet-scale, for millions of XPEs, and for many XML documents per unit of time

6X-ToPSS & GPX-Matcher

XPath Expressions (XPE)(Millions of XPE)

Matched XPE

XML

Page 7: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGXML Filtering Systems• Growing need for XML filtering

– Application-level firewalls– Maleware detection and prevention– Document routing– RSS aggregators– XML-based messaging and application integration

• Selected industry players (XML appliances)– SolaceSystems– IBM DataPower– Talerian– Sarvega (Intel)

7X-ToPSS & GPX-Matcher

• XML filtering systems are publish/subscribe systems

• XPath & XML are subscription and publication, respectively

Page 8: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGThe Core Problem

• XML Document Filtering Problem– Given a set of XPath expressions Q and an XML

document d, find all expressions in Q that are matched by d

• An expressions q is matched by an XML document d if and only if q selects a non-empty set of nodes in d– XPath expressions are used to select entire

documents or fragments of documents

8X-ToPSS & GPX-Matcher

Page 9: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGAgenda

• Supported XPath Language• Mapping XML Filtering to Pub/Sub Matching

– XPath encoding– XML encoding

• Experimental results• Outlook

9X-ToPSS & GPX-Matcher

Page 10: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGXML and XPath

<section>

<subsection> <figure> … </figure> </subsection> <figure> … </figure></section>

section

subsection

figure

figure

XML fragment XML tree XML paths

section-subsection-figure

section-figure

XPath queries

/section/subsection/figure

section/figure

/section//subsection/figure

section//figure

/section/*/figure

*/figure

location step

child operator

descendent operator wildcards

absolute query

relative query

10

Page 11: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

X-ToPSS & GPX-Matcher 11

XPath 2.0 Subset Considered• Absolute path expressions

– /a/b• Relative path expressions

– a/b/c• Descendant operators in path expressions

– a/b//a/d• Wildcards in path expressions

– a/*/*/b• Not discussed, but shown how to address

– Filter predicates in path expressions• <path>[@x>1]/<path>

– Nested path filters (the XPE becomes a tree)• <path>[a/b]/<path>

Page 12: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGAgenda

• Supported XPath Language• Mapping XML Filtering to Pub/Sub Matching

– XPath encoding– XML encoding

• Experimental results• Outlook

12X-ToPSS & GPX-Matcher

Page 13: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGOur Question(s)

• How can we map XPath expressions onto subscriptions?– Conjunctive Boolean formula over predicates– S = (a1 op v1) (a2 op v2) … (an op vn)

• How can we map XML documents onto publications?– Set of attribute-value pairs– P = {(a1, v1), (a2, v2), …, (am, vm)}

13X-ToPSS & GPX-Matcher

Page 14: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGPredicate Calculus

• Single-tag predicate

• Double-tags predicate

• End-tag predicate

• Length-constraint predicate

voppt

v opppdtt

),( 21

vpt

vlength

14X-ToPSS & GPX-Matcher

Page 15: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGSingle-tag Predicate Example

• XPath expression/b/…

• Predicate

1 bp

15X-ToPSS & GPX-Matcher

Tag b at position 1

b

a

c

d

b-a-c

(b, 1), (a, 2), (c, 3)

Page 16: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGDouble-tags Predicate Example I

• XPath expression… a/b …

• Predicate

1 ),( ppdba

16X-ToPSS & GPX-Matcher

Distance between Tag a and Tag b is one location step

x

a d

x-a-b

(x, 1), (a, 2), (b, 3)

b

Page 17: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGDouble-tags Predicate Example II

17X-ToPSS & GPX-Matcher

Distance between Tag a and Tag b is at

least one location step

• XPath expressiona//b

• Predicate

1 ),( ppdba

a

x

b

d

a-x-b

(a, 1), (x, 2), (b, 3)

Page 18: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGEnd-tag Predicate Example

• XPath expression/a/*/*

• Predicate

2ap

18X-ToPSS & GPX-Matcher

Tag a at least two location steps away

from path end

a

x

y

d

a-x-y

(a, 1), (x, 2), (y, 3), (length, 3)

Page 19: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGLength-constraint Predicate Example

• XPath expression*/*/*

• Predicate 3 length

19X-ToPSS & GPX-Matcher

Length of the path is at least 3

x

y

z

d

x-y-z

(x, 1), (y, 2), (z, 3) (length, 3)

Page 20: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

Putting it Together:XPath Query Encoding Example

Q1: a/b//aQ2: a//b/dQ3: a/*/*/*//b/d

Q1: a1/b1//a2

Q2: a1//b1/d1

Q3: a1/*/*/*//b1/d1

1),( 21 ab

ppdQ1: 1),( 11 ba

ppd

P3 P4

P4P5 4),( 11 ba

ppd 1),( 11 db

ppdQ3:

1),( 11 ba

ppd 1),( 11 db

ppdQ2:

Our XPath encoding grows linearly in the size of the XPath expression

P1 P2

20

Page 21: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGXML Document Path Encodinga-b-c-d

a1-b1-c1-d1

(length, 4),

(a1, 1), (b1, 2), (c1, 3), (d1, 4)

(a1, b1, 1), (a1, c1, 2), (a1, d1, 3),

(b1, c1, 1), (b1, d1, 2),

(c1, d1, 1)

The resulting attribute-value “pairs” set has O(n2) tags.

Without duplicate tags

(i.e., all occurrence

numbers are 1)

Document path

Attribute-value pair

Publication

21

Page 22: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

Mapping XML Filtering to Pub/Sub Matching

Matched XPE

XML

Matched Subscriptions

Event/Publication

Subscriptions(Boolean Expressions)

Pub/Sub Engine

XPath Expressions (XPE) (Millions of XPE)

Page 23: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGMatching Algorithms

• Pick any pub/sub matching algorithm• We used

– Counting algorithm [exact origin is unknown]– Clustering algorithm [Fabret, Jacobsen et al.,

2001]• Both are two-phased matching algorithms

1. Predicate matching: Match all predicates.2. Subscriptions matching: Match subscriptions

using the result from step 1.

23X-ToPSS & GPX-Matcher

Page 24: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

Predicate Matching: Single Tag Predicate

=

1 3 42 Predicate value

voppt vpt vlength

(length, 4),

(a1, 1), (b1, 2), (c1, 3), (d1, 4)(a1, b1, 1), (a1, c1, 2), (a1, d1, 3),

(b1, c1, 1), (b1, d1, 2),

(c1, d1, 1)

a

Publication:

Predicate bit vector

Hash on the tag

i

i

1

1 ap

c 3 cp

0 0 0

24

with id i

j

with id j

Page 25: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGSubscription Matching:

Clustering Algorithm• Cluster queries based on the access predicates• Access predicates shared by all queries in cluster• Only check clusters whose access predicates are matched• Open Question: how to choose an effective access predicate

Access predicates

false

false

pipi

25X-ToPSS & GPX-Matcher

Page 26: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGExperimental Evaluation• All algorithms implemented in C

– GPX – the base encoding with counting– GPX-ap – the base encoding with clustering (access pred.)– YFilter & BPA

• DTDs used for generating workloads– NITF DTD (News Industry Data Format)– PSD DTD (Protein Sequence Database)

• Total filtering time averaged over 500 XML documents– XML parsing time is negligible in

the overall filtering time• Intel Quad-Core 2.66 GHz, 4GB

encodedXPath expressions

XML

26X-ToPSS & GPX-Matcher

Page 27: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGScalability in Number of XPEsAll XPEs are distinct

27X-ToPSS & GPX-Matcher

1 ms vs.

18 msap on first

ap on last

Page 28: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGScalability in Number of XPEsXPEs workload contains duplicates

28X-ToPSS & GPX-Matcher

Page 29: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGEffect of Path Length

29X-ToPSS & GPX-Matcher

Page 30: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGEffect of Wildcards

X-ToPSS & GPX-Matcher 30

Page 31: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGConclusions

• Novel XML/XPath encoding• Leverages existing matching techniques• Differs significantly from predominantly

automata-based related work• Outperforms related approach by an order of

magnitude under many experimental conditions

31X-ToPSS & GPX-Matcher

Page 32: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGThank You!

• To learn more about X-ToPSS, please see– http://msrg.org/tags/x-topss

32X-ToPSS & GPX-Matcher

Page 33: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

X-ToPSS & GPX-Matcher 33

Page 34: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGAgenda

• XML-based Filtering Systems• Mapping XML Filtering to Pub/Sub Matching

– XPath encoding– XML encoding

• Experimental results• Outlook

34X-ToPSS & GPX-Matcher

Page 35: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGContent-based Publish/Subscribe

• Subscription: Boolean expressions (i.e., an attribute-operator-value triple)

(subject = news) (topic = travel) (date > 21.2.2011)

• Publication (a.k.a. event): Sets of attribute-value pairs

(subject, news), (topic, travel), (date, 21.2.2011), …

35X-ToPSS & GPX-Matcher

Page 36: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGThe Pub/Sub Matching Problem

• Given an event, e, and a set of subscriptions, S, determine all subscriptions, s S, that match e.

subscriptions

event / publication

matches 36X-ToPSS & GPX-Matcher

Page 37: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGWide Applicability• Selective information dissemination• Location-based services• Personalization, alerting services• Application integration• Service & resource discovery• Network and distributed system management• Monitoring, surveillance, and control • Network and distributed system management• Workforce management• Workload management & job scheduling• Business activity monitoring• Business process management, monitoring, and execution

X-ToPSS & GPX-Matcher 37

Page 38: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGMatching Algorithm Techniques

• Amortized storage & processing• Access predicates• Cost model-driven subscription partitioning• Cache-conscious data structure layout• Asynchronous cache-level pre-fetching • Event queue re-ordering and batch processing• Parallelization of algorithms for SMP & multi-core• FPGA-based acceleration (hardware-level)

38X-ToPSS & GPX-Matcher

Page 39: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGeXtensible Markup Language

• XML – de facto standard for data exchange– Web Services, data and application integration,

information dissemination

• XPath – XML query language– Also used as basis for other query languages (e.g.,

XQuery, Xpointer, XSLT et al.)

39X-ToPSS & GPX-Matcher

Page 40: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGXML and XPath

<section>

<subsection> <figure> … </figure> </subsection> <figure> … </figure></section>

XML fragment

section

subsection

figure

figure

XML tree XML paths

section-subsection-figure

section-figure

XPath queries

/section/subsection/figure

section/figure

/section//subsection/figure

section//figure

/section/*/figure

*/figure

40X-ToPSS & GPX-Matcher

Page 41: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGXML and XPath

<section>

<subsection> <figure> … </figure> </subsection> <figure> … </figure></section>

section

subsection

figure

figure

XML fragment

XML tree XML paths

section-subsection-figure

section-figure

XPath queries

/section/subsection/figure

section/figure

/section//subsection/figure

section//figure

/section/*/figure

*/figure

location step

child operator

41

Page 42: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGXML and XPath

<section>

<subsection> <figure> … </figure> </subsection> <figure> … </figure></section>

section

subsection

figure

figure

XML fragment XML tree XML paths

section-subsection-figure

section-figure

XPath queries

/section/subsection/figure

section/figure

/section//subsection/figure

section//figure

/section/*/figure

*/figure

location step

child operator

descendent operator 42

Page 43: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGOur Research Goal

• Solve the XML filtering problem using content-based pub/sub matching algorithm.

• Why– Build on and exploit several decades worth of

insights, rather than construct special purpose solutions.

43X-ToPSS & GPX-Matcher

Page 44: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGIn a Nutshell

encodedXPath expressions

section

subsection

figure

figure

section-subsection-figure

section-figure

44X-ToPSS & GPX-Matcher

Page 45: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGSpecial purpose XML/XPath Filtering Algorithm

• XFilter [Altinel et al. VLDB‘00]• WebFilter [Pereira et al. VLDB’01]• YFilter [Diao et al. TODS‘03]• XTrie [Chan et al. ICDE‘03]• AFilter [Candan et al. VLDB‘06]• BPA [Huo & Jacobsen, ICDE‘06]• BoXFilter [Moro et al. VLDB‘07]• pFiST [Kwon et al. DKE’08]

45X-ToPSS & GPX-Matcher

Page 46: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGFrom XML Filtering to Publish/Subscribe Matching

• XPath expressions are encoded in a predicate calculus

• XML documents are expressed as a set of paths from the root to a leave in the document tree– Each path is translated into sets of attribute-value

pairs (tags and their location in the path)

• Matching algorithm– The attribute-value pairs are matched against the

predicates with traditional pub/sub matching algorithms

46X-ToPSS & GPX-Matcher

Page 47: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGPossibly Extensions

• Extend predicate calculus to encompass other XPath 2.0 features

• Alternative encodings• Exploit DTD or schema information• Exploit information about XPath expressions

processed

47X-ToPSS & GPX-Matcher

Page 48: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGX-ToPSS: XML-based Toronto Publish/Subscribe System

• Distributed, content-based publish/subscribe (cf. ICDCS’08)– Exploit DTDs (Document Type Definition) to optimize

subscription routing in distributed pub/sub systems– Explain covering and merging optimizations for

XML/XPath• Alternative predicate-based XML/Xpath

matching algorithm that cannot exploit traditional pub/sub schemes (cf. ICDE’06)

• Encoding presented herein, cf. EDBT’2011 (forthcoming)

http://msrg.org/tags/x-topss 48

Page 49: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGExample: XPath Query Encoding

a 1b1

b 1a2

1d1

1),( 21 ab

ppd

1),( 11 ba

ppd

4),( 11 ba

ppd

1),( 11 db

ppd

1),( 11 ba

ppdP1

P3

P4

P5

P2

=

=

=

1 3 421

2

3

4

5

Predicate identifier (pid)

49X-ToPSS & GPX-Matcher

Page 50: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

data tuplessubscriptions

query publication

Query and subscription are very similar.

Data tuples and publication are very similar.

However, the two problem statements are inverse.

That’s Like Data Base Querying !!

sets of tuples

Abo

ut p

ast

Abo

ut f

utur

e

sets of tuples

Page 51: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

X(length, 5),

(a1, 1), (b1, 2), (c1, 3), (b2, 4), (d1, 5)

(a1, b1, 1), (a1, c1, 2), (a1, b2, 3), (a1, d1, 4),

(b1, c1, 1), (b1, b2, 2), (b1, d1, 3),

(c1, b2, 1), (c1, d1, 2),

(b2, d1, 1)

a-b-c-b-d

a1-b1-c1-b2-d1

a1-b1-c1-b2-d1 a1-b1-c1-b1-d1

a1- -c1-b1-d1

(length, 5),

(a1, 1), (c1, 3), (b1, 4), (d1, 5)

(a1, c1, 2), (a1, b1, 3), (a1, d1, 4),

(c1, b1, 1), (c1, d1, 2),

(b1, d1, 1)

XML Document Path Encoding ExampleWith duplicate tags

51X-ToPSS & GPX-Matcher

Page 52: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

Example - XML Document Path Encoding(with Duplicates)a-b-c-b-d

a1-b1-c1-b2-d1

a1-b1-c1-b2-d1 a1-b1-c1-b1-d1X

a1- -c1-b1-d1(length, 5),

(a1, 1), (b1, 2), (c1, 3), (b2, 4), (d1, 5)

(a1, b1, 1), (a1, c1, 2), (a1, b2, 3), (a1, d1, 4),

(b1, c1, 1), (b1, b2, 2), (b1, d1, 3),

(c1, b2, 1), (c1, d1, 2),

(b2, d1, 1)

(length, 5),

(a1, 1), (c1, 3), (b1, 4), (d1, 5)

(a1, c1, 2), (a1, b1, 3), (a1, d1, 4),

(c1, b1, 1), (c1, d1, 2),

(b1, d1, 1)

a1-b1-c1-b2-d1

(length, 5),

(a1, 1), (b1, 2), (c1, 3), (b2, 4), (d1, 5)

(a1, b1, 1), (a1, c1, 2), (a1, b2, 3), (a1, d1, 4),

(b1, c1, 1), (b1, b2, 2), (b1, d1, 3),

(c1, b2, 1), (c1, d1, 2),

(b2, d1, 1)

X

(length, 5),

(a1, 1), (c1, 3), (b1, 4), (d1, 5)

(a1, c1, 2), (a1, b1, 3), (a1, d1, 4),

(c1, b1, 1), (c1, d1, 2),

(b1, d1, 1)52X-ToPSS & GPX-Matcher

Page 53: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

Predicate Matching: Double Tags Predicate

a 1b1

b 1a2

1d1

Hash on the first tag

Hash on (occ # first tag,

second tag, occ # second tag)

=

=

=

1 3 42

Predicate operator

Predicate value

v opppdtt

),( 21

(length, 4),

(a1, 1), (b1, 2), (c1, 3), (d1, 4)

(a1, b1, 1), (a1, c1, 2), (a1, d1, 3),

(b1, c1, 1), (b1, d1, 2),

(c1, d1, 1)

Publication:

53X-ToPSS & GPX-Matcher

Page 54: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGMatching Algorithm

1. Match all predicates (predicate matching) and record results in predicate bit vector

2. Match subscriptions based on predicate bit vector (subscriptions matching)

From here on forward, nothing new really (we re-use pub/sub matching algorithms, as promised.)

X-ToPSS & GPX-Matcher 54

Page 55: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

Subscription Matching: Counting Algorithm

Q1

Q2

Q3

222

For each query record the number of predicates

0$1

For each query count the number of satisfied

predicates

= Q2 is matched

1

5

34

2Q1

Q1

Q2

Q2, Q3

Q3

For each predicate associate queries that

contain itPredicates

453

432

211

PPQ

PPQ

PPQ

P3 P4match

55X-ToPSS & GPX-Matcher

Page 56: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGRelated Work: YfilterQ1: a/b//aQ2: a//b/dQ3: a/*/*/*//b/d

a b ε

**

ε*

b d

*

a

Q2, Q3

ε*

*

Q1

56

[Diao et al. TODS‘03]

Page 57: GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORGLonger-term Vision

• Map matching problems for different languages onto an efficient pub/sub matching kernel

• For example, for:– Graph-structured query / data (RSS, RQL)– Tree-structured query / data (XML / XPath)– Regular expressions / sentences– Etc.

57X-ToPSS & GPX-Matcher