towards automatic optimization of mapreduce programs (position paper) shivnath babu duke university

Towards Automatic Optimization of MapReduce Programs

(Position Paper)

Shivnath BabuShivnath Babu

Duke UniversityDuke University

JAQL

Roadmap

• Call to action to improve automatic optimization techniques in MapReduce frameworks

• Challenges & promising directions

Hadoop

HDFS

Pig Hive …

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Job Configuration Parameters

• 190+ parameters in Hadoop

• Set manually or defaults are used

• Are defaults or rules-of-thumb good enough?

Ru

nn

ing

tim

e (

se

co

nd

s)

Experiments

On EC2 andlocal clusters

Ru

nn

ing

tim

e (s

eco

nd

s)

Ru

nn

ing

tim

e (

min

ute

s)

Ru

nn

ing

tim

e (m

inu

tes)

• Performance at default and rule-of-thumb settings can be poor

• Cross-parameter interactions are significant

Illustrative Result: 50GB Terasort17-node cluster, 64+32 concurrent map+reduce slots

mapred.reduce.tasks

io.sort.factor

io.sort.record.percent

10 10 0.15

Runningtime

10 500 0.15

28 10 0.15

300 10 0.15

300 500 0.15

Based onpopularrule-of-thumb

Problem Space

Current approaches:• Predominantly manual• Post-mortem analysis

Job configurationparameters

Declarative HiveQL/Pigoperations

Multi-jobworkflows

Performanceobjectives

Cost in pay-as-you-goenvironment

Energyconsiderations

Com

plex

ity

Spa

ce o

f ex

ecut

ion

choi

ces

Is this where we want to be?

Good planGood settingof parameters

Can DB Query Optimization Technology Help?

But:

– MapReduce jobs are not declarative

– No schema about the data

– Impact of concurrent jobs & scheduling?

– Space of parameters is huge

Optimizer:• Enumerate• Cost • Search

QueryDatabaseExecution

Engine

MapReducejob

Hadoop Results

Can we:

– Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years

• Or innovate!

– Exploit design & usage properties of MapReduce frameworks

Spectrum of Query Optimizers

Conventional

OptimizersRule-based

Cost models + statistics about data

AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks

Insight: Predictability(RBO) >> Predictability(CBO)


Conventional



AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks

Insight: Predictability(RBO) >> Predictability(CBO)

LearningOptimizers(learn from

execution & adapt)

TuningOptimizers(proactively

try different plans)


Conventional



LearningOptimizers(learn from

execution & adapt)

Exploit usage & design properties of MapReduce frameworks:

• High ratio of repeated jobs to new jobs

• Schema can be learned (e.g., Pig scripts)

• Common sort-partition-merge skeleton

• Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results)

• Fine-grained and pluggable scheduler

TuningOptimizers(proactively

try different plans)

Summary• Call to action to improve automatic optimization

techniques in MapReduce frameworks– Automated generation of optimized Hadoop configuration

parameter settings, HiveQL/Pig/JAQL query plans, etc.

– Rich history to learn from

– MapReduce execution creates unique opportunities/challenges

towards automatic optimization of mapreduce programs (position paper) shivnath babu duke university

Documents

mapreduce jobs

mapreduce joblifecycle

mapreduce jobhow

mapreduce jobmap wave

hiveqlpigjaql query

frommapreduce execution

concurrent map

number of map