distributed structural and value xml filtering
DESCRIPTION
Distributed Structural and Value XML Filtering. Iris Miliaraki and Manolis Koubarakis Department of Informatics and Telecommunications National and Kapodistrian University of Athens. 9 Ο Ελληνικό Συμπόσιο Διαχείρισης Δεδομένων, Αγία Νάπα, Κύπρος. - PowerPoint PPT PresentationTRANSCRIPT
Distributed Structural and Value XML Filtering
Iris Miliaraki and Manolis Koubarakis
Department of Informatics and TelecommunicationsNational and Kapodistrian University of Athens
*Το άρθρο θα παρουσιαστεί στο “4th ACM International Conference on Distributed Event-Based Systems (DEBS 2010)”, Cambridge, UK.
9Ο Ελληνικό Συμπόσιο Διαχείρισης Δεδομένων, Αγία Νάπα, Κύπρος
Outline
XML Filtering scenario Background
DHTs Structural matching
Value matching Experiments Sum up and future work
XML Filtering system XML Filtering system
XML Filtering scenario
XPath/XQuery?
XPath/XQuery?
Subscriber
Subscriber Publisher
Publisher
YFilter
XTrieFiST
Index-Filter
CentralizedDistributed
ONYX
Gong et al. [ICDE05]
XPush
Parallel/Hierarchical XTrie
Snoeren [SOSP 2001]
Miliaraki [WWW 2008]
XML Filtering scenario
XPath/XQuery?
XPath/XQuery
?
Subscriber
Subscriber Publisher
Publisher
Background: DHTs Structured overlay networks
Solve the item location problem in a distributed and dynamic network of nodes (in O(log N) hops): Let x be some data item. Find x!
Distributed version of hash table data structure id=Hash(K)
Main operations: Put: given a key (for a data item), map
the key onto a node. Get: Find the location of a data item with
a given a key.
XML Filtering scenario
XPath/XQuery?
XPath/XQuery
?
Subscriber
Subscriber Publisher
Publisher
DHT
XML data model - example
Q1: /bib/*/author[text()="John Smith"]
Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]
Q3: /bib/article[@conf=www]
Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]
Q5: /bib/article[@year=2009]/cite[@paper-id=2392]
Q6: /bib/article/cite[@paper-id=2770]
Q1: /bib/*/author[text()="John Smith"]
Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]
Q3: /bib/article[@conf=www]
Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]
Q5: /bib/article[@year=2009]/cite[@paper-id=2392]
Q6: /bib/article/cite[@paper-id=2770]
<bib> <article title=“XML Filtering”
conf=“VLDB” year=“2007”>
<school> Univ. of Athens
</school> <author institure=“Harvard”>
John Smith </author>
</article></bib>
<bib> <article title=“XML Filtering”
conf=“VLDB” year=“2007”>
<school> Univ. of Athens
</school> <author institure=“Harvard”>
John Smith </author>
</article></bib>
<bib> <article title=“XML Filtering”
conf=“VLDB” year=“2007”>
<school> Univ. of Athens
</school> <author institure=“Harvard”
John Smith </author>
</article></bib>
<bib> <article title=“XML Filtering”
conf=“VLDB” year=“2007”>
<school> Univ. of Athens
</school> <author institure=“Harvard”
John Smith </author>
</article></bib>
Q1: /bib/*/author[text()="John Smith"]
Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]
Q3: /bib/article[@conf=www]
Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]
Q5: /bib/article[@year=2009]/cite[@paper-id=2392]
Q6: /bib/article/cite[@paper-id=2770]
Q1: /bib/*/author[text()="John Smith"]
Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]
Q3: /bib/article[@conf=www]
Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]
Q5: /bib/article[@year=2009]/cite[@paper-id=2392]
Q6: /bib/article/cite[@paper-id=2770]
<bib> <article title=“XML Filtering”
conf=“VLDB” year=“2007”>
<school> Univ. of Athens
</school> <author institure=“Harvard”
John Smith </author>
</article></bib>
<bib> <article title=“XML Filtering”
conf=“VLDB” year=“2007”>
<school> Univ. of Athens
</school> <author institure=“Harvard”
John Smith </author>
</article></bib>
Q1: /bib/*/author[text()="John Smith"]
Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]
Q3: /bib/article[@conf=www]
Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]
Q5: /bib/article[@year=2009]/cite[@paper-id=2392]
Q6: /bib/article/cite[@paper-id=2770]
Q1: /bib/*/author[text()="John Smith"]
Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek]
Q3: /bib/article[@conf=www]
Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]
Q5: /bib/article[@year=2009]/cite[@paper-id=2392]
Q6: /bib/article/cite[@paper-id=2770]
Structural matchingValue Matching
Automata-based approaches
XFilter and YFilter, ONYX, XTrie, IndexFilter, FiST etc.
Main idea Construct an automaton from a set of XPath/Xquery
queries Use it as a matching engine against the XML
documents
ε
9
*
Q5: //*/cite [@id = 12743]
1111cite Q5
10*
33year Q1
00
bib
phdthesis
1
2
88author Q4
*
7
titleQ3
66
55school Q2
proceedings 4
Q1: /bib/phdthesis/year = ‘2008’Q2: /bib/proceedings/school = ‘Univ. of Athens’Q3: /bib/proceedings/title = ‘XML Dissemination’ Q4: /bib/*/author = ‘John Smith’
Example NFA (YFilter)
Distributed structural matching
Utilize a distributed version of a state-of-the-art approach YFilter
Instead of a centralized NFA
Distribute the NFA in the DHT
Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.
Distributed NFA
Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.
Structural matching!Structural matching!
What about value matching?
What about value matching?
What about value matching?
• Automata-based approaches efficient for structural matching
• Queries apart from defining a structural path also contain value-based predicates
/dblp/phdthesis[@year=2005]/author[@nationality=greek]
• Our goal: Scale for both the size of the query set and the number of predicates per query
Definitions
Attribute predicates: element[@attr op value]e.g. /bib/phdthesis[@published=2007]
Textual predicates: element[text() op value]e.g. /bib/*/author[text()=“John Smith”]
Direct evaluation with automaton/trie
Treat predicates as elements!
Lazy DFA [Gupta and Suciu, 2003]Hope is that only a small set of DFA states will be computed at
runtime
00 11
22
bibphdthesis
Q3: /bib/article/conference[text()=WWW 2009]
Q1: /dblp/phdthesis[@year=2005]/author[@nationality=greek]
66
articleQ2: /bib/*/author[text()=Michael Smith]
44
33author
77conference
55author*
33year
77conference
55author
99
88author
text()
nationality
text()11
10
Huge increase of NFA states!Huge increase of NFA states!
Destroy sharing of path expressions!Destroy sharing of path expressions!
Bottom-up evaluation
Common rule in relational query optimization apply selections as early as possible
Works well for relational query processing
pFist [Kwon et al. 2005]
A lot of effort evaluating predicates while the structure may not be matched
A lot of effort evaluating predicates while the structure may not be matched
Step-by-step evaluation XPath queries consist of distinct stepsEach step contains one or more value-based predicatesPerform value matching with structural matching in a
stepwise manner
YFilter – Inline [Diao et al. 2003] process predicates when NFA state is reached
Effort spent for evaluating predicates while the structure may not be fully matched
Effort spent for evaluating predicates while the structure may not be fully matched
Top-down evaluationCheck predicates after structural matching
YFilter – Selection-Postponed [Diao et al. 2003] performs predicate evaluation after the execution of the NFA
VA-RoXSum [Vagena et al. 2007] Focus on message aggregation
depending on predicate selectivity number of false positives may be very largedepending on predicate selectivity number of false positives may be very large
Moving on to details Parse XML document and generate a set of candidate
predicates to perform predicate evaluation
CP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]
Enriched parsing events Candidate predicates
Step-by-step evaluation
Top-down evaluation
Top-down evaluation with pruning
Bottom-up evaluation
Step-by-step evaluation
• Associate NFA states with relevant predicate information organized using a hash index
• At each step of the execution– check predicates – update list with partially matched queries Q– continue with expanding state if Q not empty
Check value-predicates while
matching the structure
Check value-predicates while
matching the structure
Example
00 11
22
bibphdthesis
44
*33
55author
88conference
66authorarticle
77
cite
Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]
Q2: /bib/*/author[text()="John Smith"]
Q3: /bib/article[@conf=www]
Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]
Q5: /bib/article[@year=2009]/cite[@paper-id=2392]
Q6: /bib/article/cite[@paper-id=2770]
PREDICATE QUERY LIST
true {Q1,Q2,Q3,Q4,Q5,Q6}
PREDICATE QUERY LIST
true {Q6}
[@conf=www] {Q3}
[@year=2009] {Q4,Q5}
Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]
Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]
At each step of the execution1. check predicates 2. update list with partially matched queries Q3. continue with expanding state if Q not
empty
At each step of the execution1. check predicates 2. update list with partially matched queries Q3. continue with expanding state if Q not
empty
Step-by-step evaluation
Top-down evaluation
Top-down evaluation with pruning
Bottom-up evaluation
Top-down evaluation
Execute distributed NFAOnly check predicates if an accepting state is
reached
Each peer uses a local index mapping predicates to the list of queries that contain them (hash index)
Delay value matching after
structural matching
Delay value matching after
structural matching
Example
author
00 11
22
bibphdthesis
44
*33
55
88conference
66authorarticle
77
cite
Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]
Q2: /bib/*/author[text()="John Smith"]
Q3: /bib/article[@conf=www]
Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]
Q5: /bib/article[@year=2009]/cite[@paper-id=2392]
Q6: /bib/article/cite[@paper-id=2770]
PREDICATE QUERY LIST
[@conf=WWW] {Q3}PREDICATE QUERY LIST
[paper-id=2770] {Q6}
[paper-id=2392] {Q5}
[@year=2009] {Q5}
Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]
Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]
Step-by-step evaluation
Top-down evaluation
Top-down evaluation with pruning
Bottom-up evaluation
Top-down evaluation with pruning
At each step of the execution, part of the NFA is revealed
Applies on equality predicates
IDEA: Use a compact summary of predicate information to stop NFA execution (prune) if we can deduce that no match can be found
IDEA: Use a compact summary of predicate information to stop NFA execution (prune) if we can deduce that no match can be found
00 11
22
bibphdthesis
44
*33
55
88conference
66authorarticle
77
cite
TD with pruning – Details
• Each peer responsible for storing many NFA fragments
• Each peer keeps one Bloom filter which summarizes predicates of queries indexed in the relevant NFA fragments Value filter (VF)
• Assuming a peer p and a state st, for each query q whose NFA accepting path contains st, we insert one predicate of q in the VF of p
TD with pruning - Main idea cont.
• Predicates are inserted as a whole in VFs using their string representation:– element[@attr=value] element + attr + value– element[text()=value] element + text + value
• VFs are updated during query indexing
• Since we traverse the NFA accepting path of a query to index all relevant VFs will be updated
Example: Constructing Value filters
DHT
100…101
00 11
22
bibphdthesis
44
*33
55author
88conference
66authorarticle
77
cite
is responsible for
Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]Q2: /bib/*/author[text()="John Smith"]Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]Q5: /bib/article[@year=2009]/cite[@paper-id=2392]Q6: /bib/article/cite[@paper-id=2770]
keeps value filter
Select 1 predicate per query to insert
m-bit filter
Example – Querying Value filters
DHT
Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]
Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]
Step 1: expanding state 0
100…101
100…101
check Value filter
MATCH!Step 2: expanding state 1 MISS! Execution continues
Execution stops
Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]Q2: /bib/*/author[text()="John Smith"]Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]Q5: /bib/article[@year=2009]/cite[@paper-id=2392]Q6: /bib/article/cite[@paper-id=2770]Q7: //article[@year=2007]
00 11
22
bibphdthesis
44
*33
55author
88conference
66authorarticle
77
cite
99article
10
*
e
Online selectivity estimation
• TD with pruning select one of the predicates of each query to insert in the value filter– Randomly– Or…. most selective predicate
• Example/bib/article[@year=2009]/author[text()=“John Smith”]
• It is no feasible to store the entire set of XML data that have been processed by our system
1 2
Sampling!Sampling!
Step-by-step evaluation
Top-down evaluation
Top-down evaluation with pruning
Bottom-up evaluation
Bottom-up evaluation
Queries are indexed in the network using their predicates
For each distinct predicate in query set select a responsible peer using DHTpeer organizes its queries using a local index
mapping predicates to the list of queries that contain them
This indexing model resembles works from area of Information Filtering
Check values as early as possible
Check values as early as possible
Bottom-up evaluation cont.
• Construct set of candidate predicates
• For each candidate predicate contact responsible peer– Peer checks its local index – Performs locally structural matching
Check values as early as possible
Check values as early as possible
Example
DHT
Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek]
Find responsible
Find responsible
Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]
Candidate predicatesCP1:article[@title="XML Filtering"]CP2:article[@conf=VLDB]CP3:article[@year=2007]CP4:author[text()="John Smith"]CP5:author[@institute=Harvard]
Experiments
• Implemented methods using FreePastry release in Java• Environment
– Cluster (http://www.grid.tuc.gr/) – 28 machines (4 peers per machine)– 253 peers from Planetlab network
• Queries– Sets of 1000000 queries– 1 to 8 predicates per query
• Data– NITF DTD
• Bloom filter size – 100K bits
Cluster (2 predicates per query)
Cluster (4 predicates per query)
Cluster (4 predicates per query)
Network traffic
Sum up & future work
Described methods to combine both structural and value XML filtering in a distributed environment
Experimental evaluation of our methodsFuture work
Potential improvements for SBS methodMore sophisticated methods for selectivity estimationRange predicatesTextual predicates
Questions?
Planetlab (2 predicates per query)
Performance improvement
Structural vs. value matching (small query set)
Structural vs. value matching (large query set)
<?xml version="1.0" encoding="UTF-8"?><statuses> <status><created_at>Tue Apr 07 22:52:51 +0000 2009</created_at><id>1472669360</id><text>At least I can get your humor through tweets. RT @abdur: I don't mean this in a bad way, but genetically speaking your a cul-de-sac.</text><source><a href="http://www.tweetdeck.com/">TweetDeck</a></source><truncated>false</truncated><in_reply_to_status_id></in_reply_to_status_id><in_reply_to_user_id></in_reply_to_user_id><favorited>false</favorited><in_reply_to_screen_name></in_reply_to_screen_name><user><id>1401881</id> <name>Doug Williams</name> <screen_name>dougw</screen_name> <location>San Francisco, CA</location> <description>Twitter API Support. Internet, greed, users, dougw and opportunities are my passions.</description> <profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/59648642/avatar_normal.png</profile_image_url> <url>http://www.igudo.com</url> <protected>false</protected> <followers_count>1027</followers_count> <profile_background_color>9ae4e8</profile_background_color> <profile_text_color>000000</profile_text_color> <profile_link_color>0000ff</profile_link_color> <profile_sidebar_fill_color>e0ff92</profile_sidebar_fill_color> <profile_sidebar_border_color>87bc44</profile_sidebar_border_color> <friends_count>293</friends_count> <created_at>Sun Mar 18 06:42:26 +0000 2007</created_at> <favourites_count>0</favourites_count> <utc_offset>-18000</utc_offset> <time_zone>Eastern Time (US & Canada)</time_zone> <profile_background_image_url>http://s3.amazonaws.com/twitter_production/profile_background_images/2752608/twitter_bg_grass.jpg</profile_background_image_url> <profile_background_tile>false</profile_background_tile> <statuses_count>3390</statuses_count> <notifications>false</notifications> <following>false</following> <verified>true</verified></user><geo/> </status> ... truncated ...</statuses>