distributed structural and value xml filtering

Distributed Structural and Value XML Filtering

Iris Miliaraki and Manolis Koubarakis

Department of Informatics and TelecommunicationsNational and Kapodistrian University of Athens

*Το άρθρο θα παρουσιαστεί στο “4th ACM International Conference on Distributed Event-Based Systems (DEBS 2010)”, Cambridge, UK.

9Ο Ελληνικό Συμπόσιο Διαχείρισης Δεδομένων, Αγία Νάπα, Κύπρος

Outline

XML Filtering scenario Background

DHTs Structural matching

Value matching Experiments Sum up and future work

XML Filtering system XML Filtering system

XML Filtering scenario

XPath/XQuery?

XPath/XQuery?

Subscriber

Subscriber Publisher

Publisher

YFilter

XTrieFiST

Index-Filter

CentralizedDistributed

ONYX

Gong et al. [ICDE05]

XPush

Parallel/Hierarchical XTrie

Snoeren [SOSP 2001]

Miliaraki [WWW 2008]


XPath/XQuery?

XPath/XQuery

?

Subscriber


Publisher

Background: DHTs Structured overlay networks

Solve the item location problem in a distributed and dynamic network of nodes (in O(log N) hops): Let x be some data item. Find x!

Distributed version of hash table data structure id=Hash(K)

Main operations: Put: given a key (for a data item), map

the key onto a node. Get: Find the location of a data item with

a given a key.


XPath/XQuery?

XPath/XQuery

?

Subscriber


Publisher

DHT

XML data model - example

Q1: /bib/*/author[text()="John Smith"]

Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]

Q3: /bib/article[@conf=www]

Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]

Q5: /bib/article[@year=2009]/cite[@paper-id=2392]

Q6: /bib/article/cite[@paper-id=2770]







<bib> <article title=“XML Filtering”

conf=“VLDB” year=“2007”>

<school> Univ. of Athens

</school> <author institure=“Harvard”>

John Smith </author>

</article></bib>




</school> <author institure=“Harvard”>


</article></bib>




</school> <author institure=“Harvard”


</article></bib>






</article></bib>


















</article></bib>






</article></bib>













Structural matchingValue Matching

Automata-based approaches

XFilter and YFilter, ONYX, XTrie, IndexFilter, FiST etc.

Main idea Construct an automaton from a set of XPath/Xquery

queries Use it as a matching engine against the XML

documents

ε

9

*

Q5: //*/cite [@id = 12743]

1111cite Q5

10*

33year Q1

00

bib

phdthesis

1

2

88author Q4

*

7

titleQ3

66

55school Q2

proceedings 4

Q1: /bib/phdthesis/year = ‘2008’Q2: /bib/proceedings/school = ‘Univ. of Athens’Q3: /bib/proceedings/title = ‘XML Dissemination’ Q4: /bib/*/author = ‘John Smith’

Example NFA (YFilter)

Distributed structural matching

Utilize a distributed version of a state-of-the-art approach YFilter

Instead of a centralized NFA

Distribute the NFA in the DHT

Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.

Distributed NFA

Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.

Structural matching!Structural matching!

What about value matching?



• Automata-based approaches efficient for structural matching

• Queries apart from defining a structural path also contain value-based predicates

/dblp/phdthesis[@year=2005]/author[@nationality=greek]

• Our goal: Scale for both the size of the query set and the number of predicates per query

Definitions

Attribute predicates: element[@attr op value]e.g. /bib/phdthesis[@published=2007]

Textual predicates: element[text() op value]e.g. /bib/*/author[text()=“John Smith”]

Direct evaluation with automaton/trie

Treat predicates as elements!

Lazy DFA [Gupta and Suciu, 2003]Hope is that only a small set of DFA states will be computed at

runtime

00 11

22

bibphdthesis

Q3: /bib/article/conference[text()=WWW 2009]

Q1: /dblp/phdthesis[@year=2005]/author[@nationality=greek]

66

articleQ2: /bib/*/author[text()=Michael Smith]

44

33author

77conference

55author*

33year

77conference

55author

99

88author

text()

nationality

text()11

10

Huge increase of NFA states!Huge increase of NFA states!

Destroy sharing of path expressions!Destroy sharing of path expressions!

Bottom-up evaluation

Common rule in relational query optimization apply selections as early as possible

Works well for relational query processing

pFist [Kwon et al. 2005]

A lot of effort evaluating predicates while the structure may not be matched

A lot of effort evaluating predicates while the structure may not be matched

Step-by-step evaluation XPath queries consist of distinct stepsEach step contains one or more value-based predicatesPerform value matching with structural matching in a

stepwise manner

YFilter – Inline [Diao et al. 2003] process predicates when NFA state is reached

Effort spent for evaluating predicates while the structure may not be fully matched

Effort spent for evaluating predicates while the structure may not be fully matched

Top-down evaluationCheck predicates after structural matching

YFilter – Selection-Postponed [Diao et al. 2003] performs predicate evaluation after the execution of the NFA

VA-RoXSum [Vagena et al. 2007] Focus on message aggregation

depending on predicate selectivity number of false positives may be very largedepending on predicate selectivity number of false positives may be very large

Moving on to details Parse XML document and generate a set of candidate

predicates to perform predicate evaluation

CP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]

Enriched parsing events Candidate predicates

Step-by-step evaluation

Top-down evaluation

Top-down evaluation with pruning



• Associate NFA states with relevant predicate information organized using a hash index

• At each step of the execution– check predicates – update list with partially matched queries Q– continue with expanding state if Q not empty

Check value-predicates while

matching the structure

Check value-predicates while

matching the structure

Example

00 11

22

bibphdthesis

44

*33

55author

88conference

66authorarticle

77

cite







PREDICATE QUERY LIST

true {Q1,Q2,Q3,Q4,Q5,Q6}


true {Q6}

[@conf=www] {Q3}

[@year=2009] {Q4,Q5}

Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]


At each step of the execution1. check predicates 2. update list with partially matched queries Q3. continue with expanding state if Q not

empty

At each step of the execution1. check predicates 2. update list with partially matched queries Q3. continue with expanding state if Q not

empty


Top-down evaluation



Top-down evaluation

Execute distributed NFAOnly check predicates if an accepting state is

reached

Each peer uses a local index mapping predicates to the list of queries that contain them (hash index)

Delay value matching after

structural matching

Delay value matching after

structural matching

Example

author

00 11

22

bibphdthesis

44

*33

55

88conference

66authorarticle

77

cite








[@conf=WWW] {Q3}PREDICATE QUERY LIST

[paper-id=2770] {Q6}

[paper-id=2392] {Q5}

[@year=2009] {Q5}




Top-down evaluation




At each step of the execution, part of the NFA is revealed

Applies on equality predicates

IDEA: Use a compact summary of predicate information to stop NFA execution (prune) if we can deduce that no match can be found

IDEA: Use a compact summary of predicate information to stop NFA execution (prune) if we can deduce that no match can be found

00 11

22

bibphdthesis

44

*33

55

88conference

66authorarticle

77

cite

TD with pruning – Details

• Each peer responsible for storing many NFA fragments

• Each peer keeps one Bloom filter which summarizes predicates of queries indexed in the relevant NFA fragments Value filter (VF)

• Assuming a peer p and a state st, for each query q whose NFA accepting path contains st, we insert one predicate of q in the VF of p

TD with pruning - Main idea cont.

• Predicates are inserted as a whole in VFs using their string representation:– element[@attr=value] element + attr + value– element[text()=value] element + text + value

• VFs are updated during query indexing

• Since we traverse the NFA accepting path of a query to index all relevant VFs will be updated

Example: Constructing Value filters

DHT

100…101

00 11

22

bibphdthesis

44

*33

55author

88conference

66authorarticle

77

cite

is responsible for

Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]Q2: /bib/*/author[text()="John Smith"]Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]Q5: /bib/article[@year=2009]/cite[@paper-id=2392]Q6: /bib/article/cite[@paper-id=2770]

keeps value filter

Select 1 predicate per query to insert

m-bit filter

Example – Querying Value filters

DHT



Step 1: expanding state 0

100…101

100…101

check Value filter

MATCH!Step 2: expanding state 1 MISS! Execution continues

Execution stops

Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]Q2: /bib/*/author[text()="John Smith"]Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]Q5: /bib/article[@year=2009]/cite[@paper-id=2392]Q6: /bib/article/cite[@paper-id=2770]Q7: //article[@year=2007]

00 11

22

bibphdthesis

44

*33

55author

88conference

66authorarticle

77

cite

99article

10

*

e

Online selectivity estimation

• TD with pruning select one of the predicates of each query to insert in the value filter– Randomly– Or…. most selective predicate

• Example/bib/article[@year=2009]/author[text()=“John Smith”]

• It is no feasible to store the entire set of XML data that have been processed by our system

1 2

Sampling!Sampling!


Top-down evaluation




Queries are indexed in the network using their predicates

For each distinct predicate in query set select a responsible peer using DHTpeer organizes its queries using a local index

mapping predicates to the list of queries that contain them

This indexing model resembles works from area of Information Filtering

Check values as early as possible


Bottom-up evaluation cont.

• Construct set of candidate predicates

• For each candidate predicate contact responsible peer– Peer checks its local index – Performs locally structural matching



Example

DHT


Find responsible

Find responsible



Experiments

• Implemented methods using FreePastry release in Java• Environment

– Cluster (http://www.grid.tuc.gr/) – 28 machines (4 peers per machine)– 253 peers from Planetlab network

• Queries– Sets of 1000000 queries– 1 to 8 predicates per query

• Data– NITF DTD

• Bloom filter size – 100K bits

http://www.grid.tuc.gr/

Cluster (2 predicates per query)

Cluster (4 predicates per query)

Network traffic

Sum up & future work

Described methods to combine both structural and value XML filtering in a distributed environment

Experimental evaluation of our methodsFuture work

Potential improvements for SBS methodMore sophisticated methods for selectivity estimationRange predicatesTextual predicates

Questions?

Planetlab (2 predicates per query)

Performance improvement

Structural vs. value matching (small query set)

Structural vs. value matching (large query set)

<?xml version="1.0" encoding="UTF-8"?><statuses> <status><created_at>Tue Apr 07 22:52:51 +0000 2009</created_at><id>1472669360</id><text>At least I can get your humor through tweets. RT @abdur: I don't mean this in a bad way, but genetically speaking your a cul-de-sac.</text><source><a href="http://www.tweetdeck.com/">TweetDeck</a></source><truncated>false</truncated><in_reply_to_status_id></in_reply_to_status_id><in_reply_to_user_id></in_reply_to_user_id><favorited>false</favorited><in_reply_to_screen_name></in_reply_to_screen_name><user><id>1401881</id> <name>Doug Williams</name> <screen_name>dougw</screen_name> <location>San Francisco, CA</location> <description>Twitter API Support. Internet, greed, users, dougw and opportunities are my passions.</description> <profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/59648642/avatar_normal.png</profile_image_url> <url>http://www.igudo.com</url> <protected>false</protected> <followers_count>1027</followers_count> <profile_background_color>9ae4e8</profile_background_color> <profile_text_color>000000</profile_text_color> <profile_link_color>0000ff</profile_link_color> <profile_sidebar_fill_color>e0ff92</profile_sidebar_fill_color> <profile_sidebar_border_color>87bc44</profile_sidebar_border_color> <friends_count>293</friends_count> <created_at>Sun Mar 18 06:42:26 +0000 2007</created_at> <favourites_count>0</favourites_count> <utc_offset>-18000</utc_offset> <time_zone>Eastern Time (US & Canada)</time_zone> <profile_background_image_url>http://s3.amazonaws.com/twitter_production/profile_background_images/2752608/twitter_bg_grass.jpg</profile_background_image_url> <profile_background_tile>false</profile_background_tile> <statuses_count>3390</statuses_count> <notifications>false</notifications> <following>false</following> <verified>true</verified></user><geo/> </status> ... truncated ...</statuses>

distributed structural and value xml filtering

Documents