multiple-site distributed spatial query optimization using spatial semijoins

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Wendy Osborn and Saad Zaamout

Outline Introduction Related Work Algorithm Performance Evaluation Conclusion and Future Work

Canadian Cow Country.....

Spatial Data

*borrowed from www.mapquest.ca

Distributed Database

*borrowed from docs.google.com

Calgary Montreal

Toronto

Research Problem Efficient processing of a distributed spatial

query Cost considerations:

data transmission CPU I/O

Related Work Spatial join

Kang et al. (2002) Spatial semijoins

Tan, Ooi, Abel (1995, 2000) Karam and Petry (2006)

Limitations Two-site distributed spatial queries

The Algorithm - Assumptions Each site has one participating spatial relation Each spatial relation has one spatial attribute All MBRs in a relation are unique

relation cardinality = number of MBRs in relation Each spatial relation is indexed by an R-tree

Spatial Semijoin Implementation1. “Project” spatial attribute from relation R

1. obtain (MBR,ID) pairs from leaf node of R-tree2. Transmit spatial attribute to relation S3. Perform semijoin

RSA S 4. Transmit identifiers from RSA whose MBR

qualifies in the query back to relation R

Algorithm - Example

100

200

600

800

QS

R1

R2 R

3

R4

Algorithm - Overview1. Sort and group by spatial attribute

cardinality2. Transmit spatial attributes3. Execute spatial semijoins4. Transmit qualifying tuples to query site

Algorithm – Stage 1 All sites (i.e. relations) are sorted in ascending

order of spatial attribute cardinality Divided into two groups

P – the first n/2 sites Q – the remaining n/2 sites

Algorithm - Stage 2 Transmit spatial attribute from sites in P to

sites in Q in the following manner: Spatial attribute with smallest cardinality in P sent

to site with smallest cardinality in Q Spatial attribute with next smallest cardinality in P

sent to site with next smallest cardinality in Q and so on…

Algorithm Example

R1R3

R4R2

P Q

SA

SA = MBR + ID

Algorithm – Stage 3 Spatial semijoin performed between spatial

attribute and relation at each site in Q Result:

set of tuples from relation that qualify in the semijoin

set of identifiers from spatial attribute whose MBRs qualify in the semijoin

Identifiers shipped back to originating site in P

Algorithm Example

R1R3

R4R2

P Q

ID

Algorithm – Stage 4

QS

R1

R2 R

3

R4

QT

QT

QT

QT

Performance Evaluation comparison vs. naïve approach six-site distributed spatial query

100, 200, 400, 600, 800, 1000 tuples each tuple has the following structure:

MBR, identifier, region name, population, line slope indicator

Cost Calculations Data sizes:

Character – 1 byte Integer – 2 bytes long integer and double float – 8 bytes

Cost of transmitting an identifier cost(ID) = sizeof(int)

Cost of transmitting a spatial attribute value (MBR) cost(MBR) = 4 * sizeof(double) + sizeof(int)

Cost of transmitting a tuple cost(MBR) + 20 * sizeof(char) * sizeof(long int) +

sizeof(int)

Cost Calculations Cost of performing a semijoin and transmitting

tuples to query site:

cost(X, Y, Z) = number_of_tuples(Y) * cost(MBR)+ number_of_qualifiers(X) * cost(ID) +

cost(tuple)+ number_of_qualifiers(Z) * cost(tuple)

Calculated for all n/2 semijoins

Two-site Query TestSite 1 Site 2 Optimized Naïve %Improvem

ent100 400 16010 32000 50100 600 16270 44800 64100 800 15750 57600 73100 1000 14580 70400 79200 400 32150 38400 17200 600 31760 51200 38200 800 32020 64000 50200 1000 31890 76800 59

Four- and Six-site Query Test

Site 1 Site 2 Site 3 Site 4 Optimized

Naïve %Improve

100 200 400 600 52264 83200 37100 200 800 1000 53410 134400 60400 600 800 1000 162604 172900 6

For the six-site query – 100, 200, 400, 600, 800, 1000

• Optimized = 127,456 bytes • Naïve = 198400 bytes• %improvement = 36%

Conclusions For multiple-site queries, our algorithm

outperforms the naïve approach in all cases The greater the difference in relation sizes,

the greater the reduction in data transmission

Future Work CPU and I/O costs Evaluate two-site queries vs. existing

strategies A real distributed database Development of more multi-site distributed

spatial query processing strategies

THANK YOU!

?

multiple-site distributed spatial query optimization using spatial semijoins

Documents

spatial data

spatial semijoinstan

treetransmit spatial

spatial query optimization

spatial attributeall

spatial semijoinswendy

qspatial attribute

nave approachsixsite