spatial partitioning techniques in ... - spatial...

1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 15 20 25 35 Speedup on MapReduce with Number of Queries 1 5 10 15 20 0.1 1 10 Range Query Speedup STR STR+ Z-curve K-d Quad Hilbert 0 1 2 3 4 5 6 7 8 1 2 3 4 Q1 (Dead Space) with block size (B) Hilbert K-d tree Quad-tree STR+ STR Z-Curve Spatial Partitioning Techniques in SpatialHadoop Ahmed Eldawy Louai Alarabi Mohamed F. Mokbel Department of Computer Science and Engineering University of Minnesota University of Minnesota This work is supported in part by NSF under Grants IIS-0952977 and IIS-1218168 Big Spatial Data in HDFS Hadoop Distributed File System (HDFS) ... Spatial Applications K-d Tree Z-Curve Hilbert Curve Experimental Setup Satellite Data Medical Images Space Images Social Networks Spatial Partitioning Techniques Grid Quad-tree STR/STR+ Datasets Summary of techniques Dataset Size Records All Objects 88GB 250M Buildings 25GB 109M Roads 23GB 70M Lakes 8.6GB 9M Cities 1.4GB 170K Q1= (P.width×P.height) Quality Measures Measures the amount of dead area Q3= (P.width+P.height) Measures the squareness of partitions Q2= (Partitions overlap) Measures the overlap between partitions Q4= (P.size)/ (Block size) Disk utilization Q5=Std Dev(P.count) Skewness of partitions (Load balance) Range Query Spatial Join Throughput, running time Running time Quality Measures Indexing Time Range Query Spatial Join High correlation between Q1 and the performance of spatial join Bitonic behavior due to the tradeo between quality and load balance The block size with peak performance depends on the query size Single machine is preferable over MapReduce for small queries Single Machine is better Quad tree performance degrades with large query areas due to excessive number of partitions The speedup of range query is capped by number of simultaneous queries the system runs Brain simulation, Climate changes Event detection and analysis, GIS application Volunteer GIS applications Q1 0.75 0.89 0.89 1.47 1.47 2.71 0.50 0.71 0.78 0.71 1.07 1.81 Q1 Quad STR+ K-d STR Hilbert Zcurve Quad STR+ K-d STR Hilbert Zcurve 0.52 Quad 276 440 463 675 749 1446 628 919 888 895 1054 1633 0.66 STR+ 419 381 439 577 613 1048 801 740 834 763 960 1346 0.71 K-d 465 404 416 560 664 1058 787 836 735 782 935 1321 0.71 STR 409 401 437 570 599 1031 834 765 795 733 928 1321 1.03 Hilbert 512 464 491 612 630 1100 1248 1085 1026 1030 1184 1646 1.88 Zcurve 740 589 608 837 780 1246 1979 1587 1519 1519 1644 2223 Roads (23GB) Lakes (8.6GB) Buildings (25GB) 0.00 0.20 0.40 0.60 0.80 1.00 1.20 Q1 Q2 Q3 Q4 Q5 Quality Measures Grid Quad-tree STR STR+ K-d Tree Z-Curve Hilbert 0.00 0.50 1.00 1.50 2.00 2.50 3.00 0.01 0.02 0.05 0.1 0.2 0.5 1 Q1 (Dead space) with sample ratio ( ) Grid STR STR+ Quad K-d Tree Zcurve Hilbert 0.01 0.10 1.00 10.00 100.00 0.01 0.02 0.05 0.1 0.2 0.5 1 Q5 (Load balance) with sample ratio ( ) Grid STR STR+ Quad K-d Tree Zcurve Hilbert 0 0.2 0.4 0.6 0.8 1 1 2 3 4 Q4 (Utilization) with block size (B) Hilbert K-d tree Quad-tree STR+ STR Z-Curve 0 500 1000 1500 2000 2500 5 10 15 20 Indexing Time Sampling str str+ zcurve kdtree quadtree hilbert 2.4 2.6 2.8 3 3.2 32.mb 64.mb 128.mb 256.mb Block Size STR STR+ Z-Curve K-d Tree Quad-tree Hilbert 0.00 10.00 20.00 30.00 40.00 32MB 64MB 128MB 256MB Range query throughput with Block Size (B) 0.0001 0.01 1 Tradeo between Q1 and Q4 with HDFS block size Partition techniques are relatively stable across datasets ... except for Q5 which improves with sample ratio 0 2 4 6 8 10 12 14 Cities Lakes Roads Buildings All Q1 - All Datasets hilbert kdtree quadtree str+ str zcurve Distribution-based techniques degrade with large datasets due to expansion of partitions 1% sample is as good as 100% for Q1-Q4 ... Sampling is not as scalable as partitioning Distribution-based techniques almost reach 0 skewness Technique Partitioning Bounda Grid Space Replication Quad-tree Space Replication STR Data Distribution STR+ Data Replication K-d tree Data Replication Z-Curve SFC Distribution Hilbert SFC Distribution

Upload: others

Post on 25-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spatial Partitioning Techniques in ... - Spatial Hadoopspatialhadoop.cs.umn.edu/publications/ExpPartitionPoster.pdf · Hadoop Distributed File System (HDFS)... Spatial Applications

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5 10 15 20 25 35

Speedup on MapReduce with Number of Queries

1 5 10 15 20

0.1

1

10

Range Query Speedup

STR STR+ Z-curve K-d Quad Hilbert

0

1

2

3

4

5

6

7

8

1 2 3 4

Q1 (Dead Space) with block size (B)

Hilbert K-d tree Quad-tree

STR+ STR Z-Curve

Spatial Partitioning Techniques in SpatialHadoop

Ahmed Eldawy Louai Alarabi Mohamed F. MokbelDepartment of Computer Science and Engineering University of Minnesota

University of Minnesota This work is supported in part by NSF under Grants IIS-0952977 and IIS-1218168

Big Spatial Data in HDFS

Hadoop Distributed File System (HDFS)

...

Spatial Applications

K-d Tree Z-Curve Hilbert Curve

Experimental Setup

Satellite DataMedical ImagesSpace ImagesSocial Networks

Spatial Partitioning Techniques

Grid Quad-tree STR/STR+

DatasetsSummary of techniquesDataset Size Records

All Objects 88GB 250M

Buildings 25GB 109M

Roads 23GB 70M

Lakes 8.6GB 9M

Cities 1.4GB 170K

Q1= (P.width×P.height)

Quality Measures

Measures the amount

of dead area

Q3= (P.width+P.height)

Measures the squareness

of partitions

Q2= (Partitions overlap)

Measures the overlap

between partitions

Q4= (P.size)/ (Block size)

Disk utilization

Q5=Std Dev(P.count)

Skewness of partitions

(Load balance)

Range Query

Spatial Join

Throughput, running time

Running time

Quality Measures

Indexing Time

Range Query

Spatial Join

High correlation

between Q1 and

the performance

of spatial join

Bitonic behavior due to the tradeo

between quality and load balance

The block size with peak performance

depends on the query size

Single machine is preferable over

MapReduce for small queries

Single Machineis better

Quad tree performancedegrades with largequery areas due toexcessive numberof partitions

The speedup of range query is

capped by number of simultaneous

queries the system runs

Brain simulation, Climate changesEvent detection and analysis, GIS applicationVolunteer GIS applications

Q1 0.75 0.89 0.89 1.47 1.47 2.71 0.50 0.71 0.78 0.71 1.07 1.81

Q1 Quad STR+ K-d STR Hilbert Zcurve Quad STR+ K-d STR Hilbert Zcurve

0.52 Quad 276 440 463 675 749 1446 628 919 888 895 1054 1633

0.66 STR+ 419 381 439 577 613 1048 801 740 834 763 960 1346

0.71 K-d 465 404 416 560 664 1058 787 836 735 782 935 1321

0.71 STR 409 401 437 570 599 1031 834 765 795 733 928 1321

1.03 Hilbert 512 464 491 612 630 1100 1248 1085 1026 1030 1184 1646

1.88 Zcurve 740 589 608 837 780 1246 1979 1587 1519 1519 1644 2223Ro

ad

s (

23G

B)

Lakes (8.6GB) Buildings (25GB)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

Q1 Q2 Q3 Q4 Q5

Quality Measures

Grid Quad-tree STR STR+ K-d Tree Z-Curve Hilbert

0.00

0.50

1.00

1.50

2.00

2.50

3.00

0.01 0.02 0.05 0.1 0.2 0.5 1

Q1 (Dead space) with sample ratio ( )

Grid STR STR+ Quad K-d Tree Zcurve Hilbert

0.01

0.10

1.00

10.00

100.00

0.01 0.02 0.05 0.1 0.2 0.5 1

Q5 (Load balance) with sample ratio ( )

Grid STR STR+ Quad K-d Tree Zcurve Hilbert

0

0.2

0.4

0.6

0.8

1

1 2 3 4

Q4 (Utilization) with block size (B)

Hilbert K-d tree Quad-tree

STR+ STR Z-Curve0

500

1000

1500

2000

2500

5 10 15 20

Indexing Time

Sampling

str str+ zcurve

kdtree quadtree hilbert

2.4

2.6

2.8

3

3.2

32.mb 64.mb 128.mb 256.mb

Block Size

STR STR+ Z-Curve K-d Tree Quad-tree Hilbert

0.00

10.00

20.00

30.00

40.00

32MB 64MB 128MB 256MB

Range query throughput with Block Size (B)

0.0001 0.01 1

Tradeo between Q1 and Q4

with HDFS block size

Partition techniques are

relatively stable across datasets

... except for Q5 which

improves with sample ratio

0

2

4

6

8

10

12

14

Cities Lakes Roads Buildings All

Q1- All Datasets

hilbert kdtree quadtree str+ str zcurve

Distribution-basedtechniques degradewith large datasetsdue to expansionof partitions

1% sample is as good

as 100% for Q1-Q4 ...

Sampling is not as scalable

as partitioning

Distribution-basedtechniques almostreach 0 skewness

Technique Partitioning Boundary Objects

Grid Space Replication

Quad-tree Space Replication

STR Data Distribution

STR+ Data Replication

K-d tree Data Replication

Z-Curve SFC Distribution

Hilbert SFC Distribution