liferaft: data-driven, batch processing for the exploration of scientific databases

20
Xiaodan Wang, Randal Burns Department of Computer Science Johns Hopkins University Tanu Malik Cyber Center Purdue University LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

Upload: sanaa

Post on 20-Feb-2016

35 views

Category:

Documents


2 download

DESCRIPTION

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases. BETTER LUCK NEXT TIME!. Problem. Q1. Q4. Q2. Q3. Goals. Eliminate redundant I/O to improve query throughput Batch queries with that exhibit data sharing Pre-process queries to identify data sharing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

Xiaodan Wang, Randal BurnsDepartment of Computer ScienceJohns Hopkins University

Tanu MalikCyber CenterPurdue University

LifeRaft: Data-Driven, Batch Processing for the Exploration of

Scientific Databases

Page 2: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

BETTER LUCK NEXT TIME!

Page 3: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

ProblemQ1

Q2

Q3

Q4

Page 4: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

GoalsEliminate redundant I/O to improve query throughput

Batch queries with that exhibit data sharing– Pre-process queries to identify data sharing– Co-schedule queries that access the same data– Access contentious data first to maximize sharing

Starvation resistance– Avoid indefinite queuing times (response time)– Enforce some constraints on completion order

Page 5: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Target Applications Data intensive scan queries

– Executed against a clustered index– Clustered and federated databases (e.g. joins that correlate

multiple nodes) Peta-scale astronomy (Pan-STARRS)

– Data are partitioned spatially– Many queries scan full DB and last hours or days

Cross-match– Probabilistic spatial join across multiple databases

Page 6: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Filter and Refine Filter queries

– Pre-process queries to determine join buckets– Buckets B1,…,Bn and queries Q1,…, Qm

– Workload Wij denote objects from Qi that overlap Bj

Refinement– Read buckets one-at-a-time– Sort-merge join (sort by HTM ID)– Query specific predicates applied on output tuples

Page 7: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Workload Throughput Metric

Greedily in order of decreasing workload throughput Exploits data regions that experience contention May starve requests

– Favors buckets experiencing frequent reuse– No guarantee a particular bucket or query receives service

Page 8: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Aged Workload Throughput Metric

Inspired by disk-drive head scheduling Balance arrival order (low response time) with

contention (high throughput) Adaptive trade-offs based on workload saturation

– Maximize rate at which objects are joined during saturated workloads

– Enforce completion order (queuing times) to prevent indefinite starvation during low saturation

Page 9: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Scheduling Behavior

Qi – Qi1, Qi2, Qi3

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj Qk

Sub-divide queries by bucket:

Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8

Assumptions:- Inter-query time of 1 sec- I/O for each bucket of 1 sec- Cache size of 2- Join cost is negligibleQj – Qj5, Qj6 , Qj7, Qj8

Qk

Page 10: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Arrival order with no sharing

Qi1

B1

Qi Arr

Qi2

B2

Qi3

B3

Qj1

B1

Qj Arr Qk Arr

Qj3

B3

Qi End

Qj4

B4

Qj6

B6

Qj7

B7

Qj8

B8

Qj End

Qk1

B1

Qk4

B4

Qk8

B8

Qk End

Qi – 3 secCompletion Times:

Qj – 8 sec Qk – 13 sec Avg – 8 sec

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj QkQk

Tp – .2 qry/sec

Page 11: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Age based scheduling (bias 1)

Qi1

B1

Qi Arr

Qi2

B2

Qi5

B5

Qi3Qj3

B3

Qj Arr Qk Arr Qi EndQj End

Qk End

Qj1Qk1

B1

Qj4Qk4

B4

Qj6Qk6

B6

Qi – 3 secCompletion Times:

Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj QkQk

Qj8Qk8

B8

Qj7Qk7

B7

Page 12: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Contention based scheduling (bias 0)

Qi1

B1

Qi Arr

Qi2

B2

Qi3Qj3

B3

Qj Arr Qk Arr Qi EndQj End

Qk5

B5

Qk End

Qj1Qk1Qj4Qk4

B1 B4

Qj6Qk6

B6

Qj7Qk7

B7

Qi – 7 secCompletion Times:

Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj QkQk

Qj8Qk8

B8

(5.6) (.33)

Page 13: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Throughput Performance

Page 14: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Tuning theage bias

Throughput performance gap grows while response time gap is insensitive to saturation

Increasing age bias is more attractive at low saturation

Page 15: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Parameter tuning using trade-off curves

Page 16: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Discussion Impact of caching strategies Workload overflow

– Large intermediate join results– Migrate pairs of workload and bucket

Beyond completion order– Higher priority for interactive queries

Batch processing in a clustered environmentP. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.

Page 17: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

WHAT ABOUT US?

Page 18: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Filter and refine Partition data into buckets

Page 19: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Average Response Time

Page 20: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing

Outline

Motivation– Goals for data-driven, batch scheduling– Target application (SkyQuery)

LiftRaft scheduler– Filter and refine queries– Throughput maximizing metric– Starvation resistance– Differences in outcomes

Workload adaptive parameter selection