community training: partitioning schemes in good shape for ...€¦ · partitioning schemes in good...

28
Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser, Alfons Kemper 3 rd IEEE International Conference on e-Science and Grid Computing Bangalore, India December 10 th – 13 th 2007

Upload: others

Post on 13-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

Lehrstuhl Informatik III: Datenbanksysteme

Community Training: Partitioning Schemes in Good Shape for Federated Data Grids

Tobias Scholl, Richard Kuntschke, Angelika Reiser, Alfons Kemper

3rd IEEE International Conference on e-Science and Grid ComputingBangalore, IndiaDecember 10th – 13th 2007

Page 2: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

2Community Training

Lehrstuhl Informatik III: Datenbanksysteme

The AstroGrid-D Project

German Astronomy Community Grid http://www.gac-grid.org/

funded by the German Ministry of Education and Research

part of the D-Grid

Page 3: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

3Community Training

Lehrstuhl Informatik III: Datenbanksysteme

The Multiwavelength Milky Way

http://adc.gsfc.nasa.gov/mw/

Page 4: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

4Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Data-intensive e-Science Applications

Many e-science application areas astrophysics geosciences climatology

Combination of various, globally distributed data sources

Increasing popularity (within community and public domain): more users

Scalability issues with current approaches

Page 5: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

5Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Up-coming Data-intensive Applications

LOFAR

LHC

Data rates Terabytes a day/night Petabytes a year

LOFAR LSST Pan-STARRS LHC

Page 6: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

6Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Current Sharing in Data Grids

Images from: “Data-Intensive Grid Computing” by Srikumar Venugopal

Data autonomy Policies allow partners to access data Each institution ensures

Availability (replication) Scalability

Various organizational structures: Centralized Hierarchical Federated Hybrid

Page 7: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

7Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Community-Driven Data Distribution

Page 8: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

8Community Training

Lehrstuhl Informatik III: Datenbanksysteme

The Training Phase

Extract data from the archives Compute partitioning schemes Compare different partitioning schemes

Standard Quadtrees Median-based Quadtrees Zones

Page 9: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

9Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Quadtrees

Well-known index structure

Recursive decomposition

Adaptive to data resolution

Page 10: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

10Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Quadtree-based Schemes: Splitting Variants

Center splitting Always bisects each

dimension congruent subareas Splitting points stored

or computed

Median heuristics Compute median for

each dimension independently

O(n) median algorithm

Splitting points stored

Page 11: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

11Community Training

Lehrstuhl Informatik III: Datenbanksysteme

The Zones Index(J. Gray, M. Nieto-Santisteban, A. Szalay) Index structure for databases Specific spatial clustering in zones Optimized for

points-in-region queries self-match, cross-match queries

Equi-distant partitioning Declination coordinate Zone(ra, dec) = floor((dec + 90.0) / h)

Implemented directly in SQLImage by J. Gray et al.

Page 12: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

12Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Evaluation Setup

2 data sets: skewed and uniform Size of data sample: 0.01%, 0.1%, 1%, 10% Number of partitions:

4, 16, 64, 256, 1024, 4096, 8192, 16384, 32768, 65536, 131072 (24 – 217)

Page 13: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

13Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Skewed Training Data (Dskew

)

Page 14: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

14Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Comparing Partitioning Schemes

Duration Average data population Variance in partition population Empty partitions Size of the training set Baseline comparison

Page 15: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

15Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Average Data Population

0

10

20

30

40

50

60

70

80

90

100

1310726553632768163848192409610242566416

aver

age

data

pop

ulat

ion

(%)

# partitions

center splitting (skewed data)median splitting (skewed data)center splitting (uniform data)

median splitting (uniform data)

10% training sample, Dskew

avg(# objects in partitions)

# objects in biggest partition

Page 16: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

16Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Evolution of the Partitioning Scheme

4096 partitions 8192 partitions 16384 partitions

Page 17: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

17Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Empty Leaves

0

1

2

3

4

5

6

7

8

1310726553632768163848192409610242566416

em

pty

part

itio

ns (

%)

# partitions

center splittingmedian splittingzones

10% training sample, Dskew

Page 18: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

18Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Size of the Training Set

1

4

16

64

256

1024

4096

16384

1310726553632768163848192409610242566416 0

5

10

15

20

25

30

35

sam

ple

rat

io

em

pty

par

titio

ns (

%)

# partitions

sample ratiocenter splitting

median splitting

0.1% training sample, Dskew

# objects in training set

# partitions to be created

Page 19: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

19Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Standard Quadtree vs. Median Heuristics

Page 20: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

20Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Evaluation – Summary

Quadtrees: good adaption to data distribution Quadtrees: Trade-off between data load

balancing and uniformly shaped regions Median-based heuristics: best data load

balancing even for skewed data sets Zone Index: good for uniform data sets Training set needs to be sufficiently large in

order not to artificially create empty partitions

Page 21: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

21Community Training

Lehrstuhl Informatik III: Datenbanksysteme

HiSbase

Peer-to-Peer layer assigns data partitions to peers

Higher flexibility New peers are

integrated seamlessly

Page 22: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

22Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Query Submission (at Peer d)

Determine relevant regions: [1,3]

Select coordinator:Region 1

Send CoordinateQuery-message to id1:{[1,3], SQL}

Message gets routed to Peer a.

01 2

3 4

5 6

Page 23: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

23Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Query Coordination (at Peer a)

Peer a is coordinator Contact relevant

regions Collect intermediate

results Send complete result

to client

select …

select …

Page 24: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

24Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Prototype Implementation

Java-based prototype FreePastry library (Pastry implementation)

Rice University MPI-SWS

Presentations at BTW 2007, Aachen, Germany VLDB 2007, Vienna, Austria

Deployed in various settings LAN WAN (AstroGrid-D, PlanetLab)

Page 25: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

25Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Summary

Training phase Community-driven Data Grids

Domain-specific partitioning scheme Partitioning scheme supports

Data skew Region-based queries

Framework for comparing partitioning schemes

Various measures with regard to data load balancing

Page 26: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

26Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Ongoing Work

Database-driven comparison 0.1% of a Petabyte still is 1 Terabyte Feasibility of median-based techniques

Workload-aware data partitioning Heterogeneous data nodes

Page 27: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

27Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Get in Touch

Database systems group, TU München Web site: http://www-db.in.tum.de E-mail: [email protected]

HiSbase http://www-db.in.tum.de/research/projects/hisbase/

Data stream management “Grid-based Data Stream Processing in e-Science”

(e-Science '06) http://www.gac-grid.de/project-products/Software/

DataStreamManagement.html

Page 28: Community Training: Partitioning Schemes in Good Shape for ...€¦ · Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser,

28Community Training

Lehrstuhl Informatik III: Datenbanksysteme

AstroGrid-D Research Demo

Finding Galaxy Clusters using Grid Computing Technology

Room:Banquet

Wednesday3:30 pm to6:00 pm