cluster scheduling at microsoft...
TRANSCRIPT
Azure
Cluster Scheduling at Microsoft Scale
MIT, 10/31/2019
Computer Networks (6.829)
Konstantinos Karanasos
GSL (CISL)
Mission: applied research lab working on systems for big data, cloud, and machine learning
Office of the CTO for Azure Data
~15 researchers/engineers in the Bay Area, Redmond, and Madison
The three hats we wear
Applied research group
Collaborating with database, big data, and AI infra groups at MS
Open-sourcing our codeApache Hadoop, ONNX Runtime, MlFlow, REEF, Heron
Current Focus Areas
Resource management
Systems for ML
ML for Systems
Query Optimization Provenance
What is a Resource Manager?
Node Manager
Node Manager
Node Manager
1. Request
2. Allocation
3. Start container
acquire cluster resources in the form of containersDo we really need a
Resource Manager?
Lessons learned: Abstracting out the RM layer
Ad-hocapp
Ad-hocapp
Ad-hocapp
Ad-hocApps
YARN
MR v2
Tez Giraph Storm DryadREEF
...
Hive / Pig
Hadoop 1.x(MapReduce)
MR v1
Hive / Pig
Users
ApplicationFrameworks
ProgrammingModel(s)
Cluster OS (Resource Management)
Hadoop 1 World Hadoop 2 World
File SystemHDFS 1 HDFS 2
Hardware
Ad-hocapp
Ad-hocapp
Scopeon
YARNSpark
monolithicReuse of RM component
YARN
layering abstractions
Heron
Cosmos: Microsoft’s Analytics Stack
Hydra
Cluster(s):> 50K nodes
Job(s):>2M tasks, >5PB input
Tasks:10 sec 50th %ile
Utilization:~60% avg CPU util
Scheduler:>70K QPS
The Scale and Utilization Challenge
Scheduling in Analytics Clusters: a Journey…
Hadoop MR, Centralized sched.
MR
201920132003 2008
Scope, Centralized sched.
Scope
[eurosys07, vldb08]
HydraMR
YARN
Tez Spark
+multi-framework+security+scheduler expressivity[socc13]
Scope1 Scope2 Scope3
Distributed sched. +tooling/optimizer,+scale, +high utilization[osdi14]
Teaser…
>99% tenants migrated>250K servers>500k daily jobs>1 Zetabyte data processed>1 Trillion tasks scheduled
Agenda
Overview of legacy systems (Apollo, YARN)
Scale: Federated YARN
Resource utilization: Opportunistic Containers
Production Experience
Distributed Scheduling:Legacy Cosmos (Apollo)
Legacy Cosmos (Apollo)
JM2JM1 JM3
Distributed scheduling got us scale
Execution plan
Cosmos Job Service
Compile/Optimize
T = SELECT * FROM blah;
SELECT T.blah FROM T;
AdmissionControl
(gang-queue)
Distributed scheduling+ node-level queuing
* JM: Job Manager
Script
JM2
* JM: Job Manager
Script
Legacy CosmosProblem: (required) static resource allocation yields low utilization
resource fragmentation within and across queues
Solution: opportunistically share resources
“guaranteed” containers
“opportunistic”containers
opportunistic scheduling got us utilization
Legacy Cosmos: Pros/Cons
+ Great scalability
+ Good resource utilization
- No support for multiple frameworks
- Limited control on scheduling (e.g., fairness, load-balancing, locality, special HW)
Centralized Scheduling:Apache Hadoop YARN
Legacy YARN
AM
RM
NM
1) Submit job
2) Admission control (fairness, quotas, constraints, SLOs, …)
3) Schedule Application Master (AM)
4) Start AM container
NMNM
Legacy YARN
AM
RM
NM NM
Task
4) AM requests on heartbeat for more containers
6) Start container
7) AM-Task communication
5) RM grants “token”
Legacy YARN: Pros/Cons
+ Support for arbitrary frameworks
+ Rich scheduling invariants
+ Non-tech advantage: OSS/mind-share
- Scale limits (~5K nodes in 2013)
- Utilization: heartbeats lead to idle resources
Main challenges
High resource utilization Scalability
Rich placement constraints
Production jobs and predictability
Medea [eurosys18]
Mercury [atc15,eurosys16] Federation [nsdi19]
Morpheus [osdi16]
OSS + production!
OSS
Improving YARN’s scalability:Federated Architecture
Need scale and utilization?Go distributed!Need scheduling control and multi-framework?Go centralized!
Want it all?
RM
sub-cluster 1
RM
sub-cluster 2
Hydra Architecture
Rout
erRo
uter
Rout
erRo
uter
StateStoreProxy
StateStoreProxy
StateStoreProxy
StateStore
NM
Hydra Architecture
RM
sub-cluster 1
RM
sub-cluster 2
Rout
erRo
uter
Rout
erRo
uter
StateStoreProxy
StateStoreProxy
StateStoreProxy
StateStore
Hydra Architecture
AMRMProxy
AM
RM
sub-cluster 1
RM
sub-cluster 2
Rout
erRo
uter
Rout
erRo
uter
StateStoreProxy
StateStoreProxy
StateStoreProxy
StateStore
Task
NM NM
Hydra Architecture
AMRMProxy
AM
RM
sub-cluster 1
RM
sub-cluster 2
Rout
erRo
uter
Rout
erRo
uter
StateStoreProxy
StateStoreProxy
StateStoreProxy
StateStore
GlobalPolicy
Generator
NM
Scheduling Desiderata: Global Goals
High utilization
Scheduling invariants (e.g., fairness)
Locality (e.g., machine preferences)
QA QB
root
50% 50%
demand: 12demand: 4
demand: 12demand: 12
✓ Locality, Fairness
✗ Utilization
✓ Utilization, Fairness
✗ Locality
✓ Utilization, locality
✗ Fairness
✓ Utilization✓ Fairness✓ Locality
PoliciesAMRMProxy routing of requests
Enforce locality?Per-cluster RM scheduling decisions
Enforce quotas?
Key Idea
Decouple:
Share determinationHow many resources should a queue get?
PlacementOn which machines should each task run?
RM
sub-cluster 1
RM
sub-cluster 2
StateStoreProxy
StateStoreProxy
StateStoreProxy
StateStore
GlobalPolicy
Generator•
•
•
1
2
3
1 1
2
3
* More advanced than what in prod. (details in paper)
Proposed Solution*
Handling GPG downtime
If GPG is down, we would fallback to local decisionsProblematic if they “diverge” too much from global one
Leverage LP-based “tuning” of local queue allocationHistorical demand as a predictor of future demand
Improving YARN’s Utilization:Opportunistic Containers
Suboptimal Utilization in Scope-on-YARN
NM1 NM2
RMj1j2
Due to:Gang schedulingFeedback delays
5 sec 10 sec 50 sec Mixed-5-50
60.59% 78.35% 92.38% 78.54%
Opportunistic Containers in YARN
AM
RM
NM NM
Task
Mask feedback delays
Only OPP containers can be queued
Promotion/demotion of containers
Utilization Gains with Node-Side Queuing
But are long queues all we need?
Job Completion Times with Node-Side Queuing
Naïve node-side queuing can be detrimental for job completion times
Proper queue management techniques are required
Problems with Node-Side Queuing
Load imbalance across nodesSuboptimal task placement
Head-of-line blockingEspecially for heterogeneous tasks
Early binding of tasks to nodes N1 N2 N3
Queue management techniques
Place tasks to node queues
Prioritize task execution
(queue reordering)
Bound queue lengths
Placement of Tasks to Queues
Placement based on queue lengthAgnostic of task characteristicsSuboptimal placement for heterogeneous workloads
Placement based on queue wait timeBetter for heterogeneous workloadsRequires task duration estimates
24 100
253
N1 N2 N3
RM
Task Prioritization
Queue reordering strategiesShortest Remaining Job First (SRJF)Least Remaining Tasks First (LRTF)Shortest Task First (STF)Earliest Job First (EJF)
SRJF and LRTF are job-awareDynamically reorder tasks based on job progress
Starvation freedomGive priority to tasks waiting more than X secs
N1 N2 N3
RMj2: 5 tasksj3: 9 tasksj1: 21 tasks
Bounding Queue Lengths
Determine max number of tasks at a queueTrade-off between short and long queues
Short queuesResource idling à lower throughput
Long queuesHigh queuing delays, early binding of tasks to queues à longer job completion times
Static and dynamic queue bounding
Queuing Techniques: Evaluation
1.7x improvement in median JCT over YARN
Both bounding and reordering are crucial
Production Experience
Workload
Scale High scheduling rate at low allocation latency
UtilizationFederated design improves load balancing, while retaining utilization
Task Performance (comparison)
YARN
Performance
Jobs perform just as well (and tasks are as efficient)
Qualitative Experience
operability
Recap
Overview of legacy systems (Apollo, YARN)
Scale: Federated YARN
Resource utilization: Opportunistic Containers
Production Experience
Conclusion
Research in Industry is a lot of fun
Fewer bigger projects, with massive winsBig force multipliers
Failure is expected (otherwise we are not trying hard enough)
Modus Operandi
Be picky in choosing problemsEngage early, engage deep
Be inclusive with prod/OSS counterparts