multiple-site distributed spatial query optimization using spatial semijoins
DESCRIPTION
Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins. Wendy Osborn and Saad Zaamout. Outline. Introduction Related Work Algorithm Performance Evaluation Conclusion and Future Work. Spatial Data. Canadian Cow Country. *borrowed from www.mapquest.ca. - PowerPoint PPT PresentationTRANSCRIPT
Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins
Wendy Osborn and Saad Zaamout
Outline Introduction Related Work Algorithm Performance Evaluation Conclusion and Future Work
Canadian Cow Country.....
Spatial Data
*borrowed from www.mapquest.ca
Distributed Database
*borrowed from docs.google.com
Calgary Montreal
Toronto
Research Problem Efficient processing of a distributed spatial
query Cost considerations:
data transmission CPU I/O
Related Work Spatial join
Kang et al. (2002) Spatial semijoins
Tan, Ooi, Abel (1995, 2000) Karam and Petry (2006)
Limitations Two-site distributed spatial queries
The Algorithm - Assumptions Each site has one participating spatial relation Each spatial relation has one spatial attribute All MBRs in a relation are unique
relation cardinality = number of MBRs in relation Each spatial relation is indexed by an R-tree
Spatial Semijoin Implementation1. “Project” spatial attribute from relation R
1. obtain (MBR,ID) pairs from leaf node of R-tree2. Transmit spatial attribute to relation S3. Perform semijoin
RSA S 4. Transmit identifiers from RSA whose MBR
qualifies in the query back to relation R
Algorithm - Example
100
200
600
800
QS
R1
R2 R
3
R4
Algorithm - Overview1. Sort and group by spatial attribute
cardinality2. Transmit spatial attributes3. Execute spatial semijoins4. Transmit qualifying tuples to query site
Algorithm – Stage 1 All sites (i.e. relations) are sorted in ascending
order of spatial attribute cardinality Divided into two groups
P – the first n/2 sites Q – the remaining n/2 sites
Algorithm - Stage 2 Transmit spatial attribute from sites in P to
sites in Q in the following manner: Spatial attribute with smallest cardinality in P sent
to site with smallest cardinality in Q Spatial attribute with next smallest cardinality in P
sent to site with next smallest cardinality in Q and so on…
Algorithm Example
R1R3
R4R2
P Q
SA
SA = MBR + ID
Algorithm – Stage 3 Spatial semijoin performed between spatial
attribute and relation at each site in Q Result:
set of tuples from relation that qualify in the semijoin
set of identifiers from spatial attribute whose MBRs qualify in the semijoin
Identifiers shipped back to originating site in P
Algorithm Example
R1R3
R4R2
P Q
ID
Algorithm – Stage 4
QS
R1
R2 R
3
R4
QT
QT
QT
QT
Performance Evaluation comparison vs. naïve approach six-site distributed spatial query
100, 200, 400, 600, 800, 1000 tuples each tuple has the following structure:
MBR, identifier, region name, population, line slope indicator
Cost Calculations Data sizes:
Character – 1 byte Integer – 2 bytes long integer and double float – 8 bytes
Cost of transmitting an identifier cost(ID) = sizeof(int)
Cost of transmitting a spatial attribute value (MBR) cost(MBR) = 4 * sizeof(double) + sizeof(int)
Cost of transmitting a tuple cost(MBR) + 20 * sizeof(char) * sizeof(long int) +
sizeof(int)
Cost Calculations Cost of performing a semijoin and transmitting
tuples to query site:
cost(X, Y, Z) = number_of_tuples(Y) * cost(MBR)+ number_of_qualifiers(X) * cost(ID) +
cost(tuple)+ number_of_qualifiers(Z) * cost(tuple)
Calculated for all n/2 semijoins
Two-site Query TestSite 1 Site 2 Optimized Naïve %Improvem
ent100 400 16010 32000 50100 600 16270 44800 64100 800 15750 57600 73100 1000 14580 70400 79200 400 32150 38400 17200 600 31760 51200 38200 800 32020 64000 50200 1000 31890 76800 59
Four- and Six-site Query Test
Site 1 Site 2 Site 3 Site 4 Optimized
Naïve %Improve
100 200 400 600 52264 83200 37100 200 800 1000 53410 134400 60400 600 800 1000 162604 172900 6
For the six-site query – 100, 200, 400, 600, 800, 1000
• Optimized = 127,456 bytes • Naïve = 198400 bytes• %improvement = 36%
Conclusions For multiple-site queries, our algorithm
outperforms the naïve approach in all cases The greater the difference in relation sizes,
the greater the reduction in data transmission
Future Work CPU and I/O costs Evaluate two-site queries vs. existing
strategies A real distributed database Development of more multi-site distributed
spatial query processing strategies
THANK YOU!
?