Download - Hadoop Scheduling - a 7 year perspective
Job Scheduling in Hadoopan exposé
Joydeep Sen Sarma
About Me
Facebook: Ran/Managed Hadoop ~ 3 years Wrote Hive
Mentor/PM Hadoop Fair-Scheduler
Used Hadoop/Hive (as Warehouse/ETL Dev)
Re-wrote significant chunks of Hadoop Job Scheduling (incl. Corona)
Qubole: Running World’s largest Hadoop clusters on AWS
c 2007
c 2014
The Crime
Statistical MultiplexingLargest jobs only fit on pooled hardware
Data LocalityEasier to manage
Shared Hadoop Clusters
… and the Punishment
• “Have you no Hadoop Etiquettes?” (c 2007)
(reducer count capped in response)
• User takes down entire Cluster (OOM) (c 2007-09)• Bad Job slows down entire Cluster (c 2009)• Steady State Latencies get intolerable (c 2010-)• ”How do I know I am getting my fair share?” (c 2011)• “Too few reducer slots, cluster idle” (c 2013)
The Perfect Weapon
• Efficient• Scalable• Strong Isolation• Fair• Fault Tolerant• Low Latency
Scheduler
Quick Review
• Fair Scheduler (Fairness/Isolation)• Speculation (Fault Tolerance/Latency)• Preemption (Fairness)• Usage Monitoring/Limits (Isolation)
And then there’s Hadoop (1.x) …• Single JobTracker for all Jobs
– Does not scale, SPOF
• Pull Based Architecture– Scalability and Low Latency at permanent War– Inefficient – leaves idle time
• Slot Based Scheduling– Inefficient
• Pessimistic Locking in Tracker– Scalability Bottleneck
• Long Running Tasks– Fairness and Efficiency at permanent War
insert overwrite table dest select … from ads join campaigns on …group by …;
8
Poll Driven Scheduling
Map Tasks
ReduceTasks Master
Slave
Job Tracker
TaskTracker
Child
Heartbeat MapTask
Pessmistic LockinggetBestTask(): for pool: sortedPools
for job: pool.sortedJobs()for task: job.tasks()
if betterMatch(task) …
processHeartbeat(): synchronized(world): return getBestTask()
Slot Based Scheduling
• N cpus, M map slots, R reduce slots– Memory cannot be oversubscribed!
• How to divide?– M < N not enough mappers at times– R < N not enough reducers at times– N=M=R enough memory to run 2N tasks ?
• Reduce Tasks Problematic– Network Intensive to start, CPU wasted– Memory Intensive later
Long Running Reducers
• Online Scheduling– No advance information of future workload
• Greedy + Fair Scheduling– Schedule ASAP– Preempt if future workload disagrees
• Long Running Reducers– Preemption causes restart and wasted work– No effective way to use short bursts of idle cpu
Optimistic LockingTask[] getBestTaskCandidates(): for pool: sortedPools
for job: pool.sortedJobs.clone()for task: job.tasks.clone()
synchronized(task):…
processHeartbeat(): tasks = getBestTaskCandidates() synchronized(world): return acquireTasks(tasks)
Corona: Push Scheduling
1. JT subscribes for M maps and R reduces– Receives availability from Cluster Manager (CM)
2. CM publishes availability ASAP– Pushes events to JT
3. JT pushes tasks to available TT– In parallel
Corona/YARN: Scalability
1. JobTracker for each Job now Independent– More Fault Tolerant and Isolated as well
2. Centralized Cluster/Resource Manager– Must be super-efficient!
3. Fundamental Differences– Corona ~ Latency– YARN ~ Heterogenous workloads
Pesky Reducers
• Hadoop 2 removes distinction between M and R slots
• Not Enough– Reduce Tasks don’t use much CPU in shuffle– Still long running and bad to preempt Re-architect to run millions of small Reducers
The Future is Cloudy
• Data Center Assumption:– Cluster characteristics known– Job spec fits to cluster
• In Cloud:– Cluster can grow/shrink, change node-type– Job Spec must be dynamic– Uniform task configuration untenable