xml distributed retrieval
Post on 31-Dec-2015
28 Views
Preview:
DESCRIPTION
TRANSCRIPT
XML Distributed Retrieval
Emiran Curtmola @ UCSD K. K. Ramakrishnan @ at&tAlin Deutsch @ UCSD Divesh Srivastava @ at&t
04/19/2023 2
Motivation
Democratization of data creation on the web Easy to create and publish data
Self-organization in online communities Easy to form online communities in an ad-hoc
fashion Members create, publish and share data items
Need to query the overall community data collection (all the published data)
04/19/2023 3
“The virtual newspaper” community
Newspapers
Advertised data items about the publisher’s articles
P1San Diego, San Francisco, stocks, food, weather
P2 San Diego, gold, New York, food
P3San Diego, San Francisco, gold, New York, weather
P4 San Diego, fire, gold, stocks, weather
P5 San Diego, fire, gold, stocks, weather
P6 fire, San Francisco, stocks, weather
P7 fire, gold, stocks, weather
P8 fire, gold, stocks, weatherQuery Q1: find the articles talking about fire in San Diego
Query Q3: find the articles talking about food
Query Q2: find the articles about San Francisco
Query Q4: find the articles that give the weather in New York
P3
P8
P7 P6
P1
P2
P4
P5
The community data collection
Efficient querying the community data collection?
local data
P3 local data
P4
local data
P8
local data
P2
local data
P5
local data
P1
local data
P6
local data
P7
04/19/2023 4
State-of-the-Art in Querying Topic-based approach
Users creates static topics▪ a topic is a rendezvous point between consumers and
publishers▪ consumers subscribe (query) to topics of interest▪ publishers classify content into topics
Limitation▪ Consumer interests can not be specified at a very fine
granularity (too many topics)e.g., “news about fire damage when more than 1,000 people impacted and related to Santa Ana conditions in San Diego county, and information about related government relief efforts underway”
04/19/2023 5
Ad-hoc Querying
Query Q1: find the articles talking about fire in San Diego
Query Q3: find the articles talking about food
Query Q2: find the articles about San Francisco
Query Q4: find the articles that give the weather in New York
local data
P3 local data
P4
local data
P8
local data
P2
local data
P5
local data
P1
local data
P6
local data
P7
global dataThe community data collection
Central site
Content-based approach: on actual content E.g., search engines, hosted online
communities
04/19/2023 6
Limitations of Centralized Approach
Centralized approach disintermediates publishers from consumers via a centralized authority Publishers need to give up their data
▪ against the community of autonomous members Publishers can not know who is interested and
who accesses their data
Insufficient timeliness Freshness of data depends on crawling
frequency
04/19/2023 7
Decentralized Approach:Move Queries Instead of Data
Query Q1: find the articles talking about fire in San Diego
Query Q3: find the articles talking about food
Query Q2: find the articles about San Francisco
Query Q4: find the articles that give the weather in New York
local data
P3 local data
P4
local data
P8
local data
P2
local data
P5
local data
P1
local data
P6
local data
P7
The community data collection
Newspapers
Advertised data items about the publisher’s articles
P1San Diego, San Francisco, stocks, food, weather
P2 San Diego, gold, New York, food
P3San Diego, San Francisco, gold, New York, weather
P4 San Diego, fire, gold, stocks, weather
P5 San Diego, fire, gold, stocks, weather
P6 fire, San Francisco, stocks, weather
P7 fire, gold, stocks, weather
P8 fire, gold, stocks, weather
04/19/2023 8
Our Goal for Querying
Data resides with the publisher publishers maintain complete control
over who accesses their data
Consumers can send ad-hoc queries over the content of community data collection
04/19/2023 9
Challenges
Distributed nature of the data among publishers Data is not materialized globally but it resides with each
publisher
Large number of decentralized publishers and consumers Publishers: “whom to tell” among the host of potential
consumers?
Consumers: “whom to ask” among the myriad of available publishers?
Avoid flooding the network
04/19/2023 10
Proposal for Query Dissemination
The community setup Network of logical routers as infrastructure for the community Publishers connect to this network at the edge
Build an overlay network to act as a distributed index structure Routers are organized into a network called a
query dissemination tree (QDT)
Use QDT to disseminate queries Queries always posed at root Queries forwarded by routers to relevant publishers based on
the certain information▪ every node contains a summary of data stored in its subtrees
04/19/2023 11
A Query Dissemination Tree (QDT)
Only the overlay connections between the nodes of QDT are shown
P1’s advertised set of terms: San Diego, San Francisco, stocks, food, weather
P2’s advertised set of terms: San Diego, gold, New York, food
Node 3’s summary (set of terms) San Diego, San Francisco, stocks, food, weather, gold, New York
242118
1
8
9
1064 17 20 23
132
3 14 16
P4 P5
P6 P7 P8
P3P2P1
router
P publisher
union of its subtrees’ summaries
04/19/2023 12
XML Content Descriptors (CDs) An XML document D is described
(imperfectly) by a set of content descriptors, CD(D)
A query Q is also described by a set of CDs, CD(Q)
To estimate if Q has a match against D we check CD(Q) CD(D)
04/19/2023 13
Representing Documents Using CDs
rss
channel
editor item description
title linkJupiter
ReutersNewsreuters.com
San Diego, fire …
CDs can be
• all simple keywords:
San Diego, fire, Jupiter, ReutersNews, reuters.com
• keywords with full path from root:
/rss/channel/description/San Diego /rss/channel/description/fire /rss/channel/editor/Jupiter /rss/chanel/item/title/ReutersNews /rss/channel/item/link/reuters.com •etc.
• keywords with only last tag on path:
description/San Diego description/fire editor/Jupiter title/ReutersNews link/reuters.com
Sample XML article published by P1
04/19/2023 14
Query Routing in a QDT
1
64
2
3
8
9
P410 17 20 23
242118
13
14 16
P5
P6 P7 P8
P3P2P1
Q3=<food>
Q3 Q3 Q3
Q3
Q3
Q3
Q3
Q3
Only P1 and P2
publish articles about food … food …… food …
check set inclusion: query into node’s summary
Bloom Filter
04/19/2023 15
Traffic Congestion at Top of QDT
The tree topology introduces congestion during query
dissemination
04/19/2023 16
Traffic Congestion at Top of QDT
1
8
9
P41064 17 20 23
242118
132
3 14 16
P5
P6 P7 P8
P3P2P1
… food …… food …
Bottleneck(the load decreases from root to leavesdue to filtering)
How to relieve the congestion?
Routing a queryRouting a query workload • non-zero time to process a query at a node
04/19/2023 17
Techniques for Load Balancing
Overlaying multiple logical QDTs over the same underlay network a node belongs to multiple QDTs but at
different levels
Goal: organize the nodes into QDTs such that the distribution of tree levels for a node
is uniform across the QDTs
04/19/2023 18
Overlaying Multiple QDTs: QDT1
P4 P5
P6 P7 P8
P3P2P1
123
4
6
8
9
10
1314
16
23
20
17
24
21
18
QDT1
1
04/19/2023 19
Overlaying Multiple QDTs: QDT2
P4 P5
P6 P7 P8
P3P2P1
123
4
6
8
9
10
1314
16
23
20
17
24
21
18
QDT2
1
04/19/2023 20
Overlaying Multiple QDTs
QDT1 QDT2
QDT3 QDT4
1
1
1
1
04/19/2023 21
Query Routing for Multiple QDTs Partition community data collection (set of CDs) into blocks
Build one QDT tree per block QDTi groups all publishers with CDs in Bi
Routing a query Terms in query determine the relevant blocks Send query to the corresponding QDT Check the full query with publishers’ storage
Example of routing Q3
Q3 falls in B4 use QDT4
Block
CDs
B1 San Diego, fire
B2 San Francisco, gold
B3 New York, stocks
B4 food, weather
QDT4 for B4
… food … … food …
Q3=<food>
QDT1
QDT2
QDT3
QDT4
04/19/2023 22
Relieving the CongestionQDT1 QDT2
QDT3 QDT4
Q3=<food>
Q1=<fire, San Diego>
04/19/2023 23
Queries Spanning on Multiple Blocks
Q4=<New York, weather>
Route Q4 on both trees?▪ NO: generate redundant traffic, therefore more
messages▪ Routing on both trees can touch the same nodes
we show it suffices to send the query to either of the trees
Block
Terms
B1 San Diego, fire
B2 San Francisco, gold
B3 New York, stocks
B4 food, weather
QDT3
QDT4
04/19/2023 24
Routing Alternatives
Routing Q4=<New York, weather>
Q4: routing by <New York> Q4: routing by <weather>
QDT3QDT4
Check the all query terms at each publisher!
04/19/2023 25
Routing Alternatives
Routing Q4=<New York, weather>
Ideally, route after the most selective term In practice, not possible but use informed routing
▪ keep track of popular CDs▪ avoid routing with low selective (popular) CDs
Q4: routing by <New York> Q4: routing by <weather>
QDT3QDT4
04/19/2023 26
Discussion: The Design Space
How many query dissemination trees? 1 tree for all published terms
▪ Con: traffic congestion in the upper level of the dissemination tree▪ Pro: queries routed in tree are very selective
▪ the more conjuncts, the more selective the query early pruning of subtrees to be visited
1 tree per term▪ Pro: congestion-free▪ Con: tree maintenance (as many trees as terms)▪ Con: single-term queries less selective unnecessary visit more peers
“SWEET SPOT” EXPECTED TO LIE BETWEEN ABOVE EXTREMES Our solution
04/19/2023 27
Finding the Sweet Spot
Empirical fact upper 2 tree levels in a QDT are the most
congested
One solution: cyclical permutation of nodes on the tree levels
Goal: all routers appear precisely once in the top 2 levels of any QDT
04/19/2023 28
Sweet Spot when 4 QDTs1
8
9
P41064 17 20 23
242118
132
3 14 16
P5
P6 P7 P8
P3P2P1
P6 P7 P8
P3P2P1
QDT1 QDT2
3 9 14 16
4 6 10 17 20
18 21
23
24
1
2 8 13
P4 P5
04/19/2023 29
3
14
6
P4182320 21 24 1
1382
169
4 10 17
P5
P6 P7 P8
P3P2P1
Sweet Spot when 4 QDTs1
8
9
P41064 17 20 23
242118
132
3 14 16
P5
P6 P7 P8
P3P2P1
20
18
1
P49313 14 16 4
17106
2123
24 2 8
P5
P6 P7 P8
P3P2P1
4
10
23
P42124 8 13 3
16149
176
20 18 21
P5
P6 P7 P8
P3P2P1
QDT1 QDT2
QDT4QDT3
1
1
1
1
04/19/2023 30
Experimental Goals
Effect of number of QDTs find the “sweet spot” to load balance
Effect of routing strategy (informed routing) optimize based on query selectivity
estimation
Effect of QDT topology study the overlay organization of the
peers
04/19/2023 31
Experimental Setup
10,000-node overlay network simulator 9,400 publishers and 600 routers
XML Wikipedia dump of 1.1M articles (8.6GB)
Query workload: 50,000 conjunctive queries each query has 1..10 conjunctive terms each query has at least one match in the global data
collection
QDT topology Multicast trees e.g., Scribe (QDTS) Balanced trees (QDTB)
04/19/2023 32
Measuring the Throughput Processing load at each node
is a function of nr. messages reaching a node
Peak load: is the maximum load over all nodes
Average load: is the nr. messages in the network divided by nr. Routers
The ideal load we can achieve is the average load for the 1-QDT case
New metric: the load reduction how close is the actual peak load (when k QDTs) from the
ideal load case QDT-kfor loadpeak
case QDT-1for load average reduction load
04/19/2023 33
Effect of Number of QDTs
Varying the number of QDTs, we confirm the nr. of QDTs given by the cyclical
permutation method returns the highest load reduction
The “sweet spot” is well defined
For this nr. of QDTs the load reduction is near the optimum
04/19/2023 34
Effect of Number of QDTs Result: bring actual peak load very close to the
ideal load near-optimum peak load reduction at 15 QDTs
for Scribe generated topologies
04/19/2023 35
Effect of Routing Strategy
Query selectivity estimation for only 1-3% state, we get 65-75% of
the routing benefit
04/19/2023 36
Effect of QDT Topology
Fanout-balanced trees are closest to optimal throughput
Ideal-to-actual peak load reduction ratio
QDTS, 15-QDT config.
QDTB, 66-QDT config.
Processing 1.32 1.18
Forwarding 10.23 2.3
04/19/2023 37
Summary
Infrastructure for ad-hoc querying in online communities where the publishers keep control over their own data
Ongoing Work Ranked results
▪ Disseminate only to top-K relevant publishers▪ Find only top-K matching documents
Support for more expressive XML queries Simulation Build Prototype
04/19/2023 38
Thank You!
04/19/2023 39
Effect of Number of QDTs
04/19/2023 40
Effect of Routing Strategy
04/19/2023 41
Efficient Representation of Summaries
Naïve solution keep “exact node summaries” as a complete list of
published terms Con: memory intensive arbitrarily large summaries Con: costly to check set inclusion
How to achieve fast term inclusion sets? How to represent summaries using little space? Allow estimates
▪ without false negatives: to avoid incomplete answers▪ bounded false positives: to avoid wasting bandwidth
Represent summaries (term sets) using Bloom filters
top related