cluster scheduling at microsoft...

Azure

Cluster Scheduling at Microsoft Scale

MIT, 10/31/2019

Computer Networks (6.829)

Konstantinos Karanasos

GSL (CISL)

Mission: applied research lab working on systems for big data, cloud, and machine learning

Office of the CTO for Azure Data

~15 researchers/engineers in the Bay Area, Redmond, and Madison

The three hats we wear

Applied research group

Collaborating with database, big data, and AI infra groups at MS

Open-sourcing our codeApache Hadoop, ONNX Runtime, MlFlow, REEF, Heron

Current Focus Areas

Resource management

Systems for ML

ML for Systems

Query Optimization Provenance

What is a Resource Manager?

Node Manager

Node Manager

Node Manager

1. Request

2. Allocation

3. Start container

acquire cluster resources in the form of containersDo we really need a

Resource Manager?

Lessons learned: Abstracting out the RM layer

Ad-hocapp

Ad-hocapp

Ad-hocapp

Ad-hocApps

YARN

MR v2

Tez Giraph Storm DryadREEF

...

Hive / Pig

Hadoop 1.x(MapReduce)

MR v1

Hive / Pig

Users

ApplicationFrameworks

ProgrammingModel(s)

Cluster OS (Resource Management)

Hadoop 1 World Hadoop 2 World

File SystemHDFS 1 HDFS 2

Hardware

Ad-hocapp

Ad-hocapp

Scopeon

YARNSpark

monolithicReuse of RM component

YARN

layering abstractions

Heron

Cosmos: Microsoft’s Analytics Stack

Hydra

Cluster(s):> 50K nodes

Job(s):>2M tasks, >5PB input

Tasks:10 sec 50th %ile

Utilization:~60% avg CPU util

Scheduler:>70K QPS

The Scale and Utilization Challenge

Scheduling in Analytics Clusters: a Journey…

Hadoop MR, Centralized sched.

MR

201920132003 2008

Scope, Centralized sched.

Scope

[eurosys07, vldb08]

HydraMR

YARN

Tez Spark

+multi-framework+security+scheduler expressivity[socc13]

Scope1 Scope2 Scope3

Distributed sched. +tooling/optimizer,+scale, +high utilization[osdi14]

Teaser…

>99% tenants migrated>250K servers>500k daily jobs>1 Zetabyte data processed>1 Trillion tasks scheduled

Agenda

Overview of legacy systems (Apollo, YARN)

Scale: Federated YARN

Resource utilization: Opportunistic Containers

Production Experience

Distributed Scheduling:Legacy Cosmos (Apollo)

Legacy Cosmos (Apollo)

JM2JM1 JM3

Distributed scheduling got us scale

Execution plan

Cosmos Job Service

Compile/Optimize

T = SELECT * FROM blah;

SELECT T.blah FROM T;

AdmissionControl

(gang-queue)

Distributed scheduling+ node-level queuing

* JM: Job Manager

Script

JM2

* JM: Job Manager

Script

Legacy CosmosProblem: (required) static resource allocation yields low utilization

resource fragmentation within and across queues

Solution: opportunistically share resources

“guaranteed” containers

“opportunistic”containers

opportunistic scheduling got us utilization

Legacy Cosmos: Pros/Cons

+ Great scalability

+ Good resource utilization

- No support for multiple frameworks

- Limited control on scheduling (e.g., fairness, load-balancing, locality, special HW)

Centralized Scheduling:Apache Hadoop YARN

Legacy YARN

AM

RM

NM

1) Submit job

2) Admission control (fairness, quotas, constraints, SLOs, …)

3) Schedule Application Master (AM)

4) Start AM container

NMNM

Legacy YARN

AM

RM

NM NM

Task

4) AM requests on heartbeat for more containers

6) Start container

7) AM-Task communication

5) RM grants “token”

Legacy YARN: Pros/Cons

+ Support for arbitrary frameworks

+ Rich scheduling invariants

+ Non-tech advantage: OSS/mind-share

- Scale limits (~5K nodes in 2013)

- Utilization: heartbeats lead to idle resources

Main challenges

High resource utilization Scalability

Rich placement constraints

Production jobs and predictability

Medea [eurosys18]

Mercury [atc15,eurosys16] Federation [nsdi19]

Morpheus [osdi16]

OSS + production!

OSS

Improving YARN’s scalability:Federated Architecture

Need scale and utilization?Go distributed!Need scheduling control and multi-framework?Go centralized!

Want it all?

RM

sub-cluster 1

RM

sub-cluster 2

Hydra Architecture

Rout

erRo

uter

Rout

erRo

uter

StateStoreProxy

StateStoreProxy

StateStoreProxy

StateStore

NM

Hydra Architecture

RM

sub-cluster 1

RM

sub-cluster 2

Rout

erRo

uter

Rout

erRo

uter

StateStoreProxy

StateStoreProxy

StateStoreProxy

StateStore

Hydra Architecture

AMRMProxy

AM

RM

sub-cluster 1

RM

sub-cluster 2

Rout

erRo

uter

Rout

erRo

uter

StateStoreProxy

StateStoreProxy

StateStoreProxy

StateStore

Task

NM NM

Hydra Architecture

AMRMProxy

AM

RM

sub-cluster 1

RM

sub-cluster 2

Rout

erRo

uter

Rout

erRo

uter

StateStoreProxy

StateStoreProxy

StateStoreProxy

StateStore

GlobalPolicy

Generator

NM

Scheduling Desiderata: Global Goals

High utilization

Scheduling invariants (e.g., fairness)

Locality (e.g., machine preferences)

QA QB

root

50% 50%

demand: 12demand: 4

demand: 12demand: 12

✓ Locality, Fairness

✗ Utilization

✓ Utilization, Fairness

✗ Locality

✓ Utilization, locality

✗ Fairness

✓ Utilization✓ Fairness✓ Locality

PoliciesAMRMProxy routing of requests

Enforce locality?Per-cluster RM scheduling decisions

Enforce quotas?

Key Idea

Decouple:

Share determinationHow many resources should a queue get?

PlacementOn which machines should each task run?

RM

sub-cluster 1

RM

sub-cluster 2

StateStoreProxy

StateStoreProxy

StateStoreProxy

StateStore

GlobalPolicy

Generator•

•

•

1

2

3

1 1

2

3

* More advanced than what in prod. (details in paper)

Proposed Solution*

Handling GPG downtime

If GPG is down, we would fallback to local decisionsProblematic if they “diverge” too much from global one

Leverage LP-based “tuning” of local queue allocationHistorical demand as a predictor of future demand

Improving YARN’s Utilization:Opportunistic Containers

Suboptimal Utilization in Scope-on-YARN

NM1 NM2

RMj1j2

Due to:Gang schedulingFeedback delays

5 sec 10 sec 50 sec Mixed-5-50

60.59% 78.35% 92.38% 78.54%

Opportunistic Containers in YARN

AM

RM

NM NM

Task

Mask feedback delays

Only OPP containers can be queued

Promotion/demotion of containers

Utilization Gains with Node-Side Queuing

But are long queues all we need?

Job Completion Times with Node-Side Queuing

Naïve node-side queuing can be detrimental for job completion times

Proper queue management techniques are required

Problems with Node-Side Queuing

Load imbalance across nodesSuboptimal task placement

Head-of-line blockingEspecially for heterogeneous tasks

Early binding of tasks to nodes N1 N2 N3

Queue management techniques

Place tasks to node queues

Prioritize task execution

(queue reordering)

Bound queue lengths

Placement of Tasks to Queues

Placement based on queue lengthAgnostic of task characteristicsSuboptimal placement for heterogeneous workloads

Placement based on queue wait timeBetter for heterogeneous workloadsRequires task duration estimates

24 100

253

N1 N2 N3

RM

Task Prioritization

Queue reordering strategiesShortest Remaining Job First (SRJF)Least Remaining Tasks First (LRTF)Shortest Task First (STF)Earliest Job First (EJF)

SRJF and LRTF are job-awareDynamically reorder tasks based on job progress

Starvation freedomGive priority to tasks waiting more than X secs

N1 N2 N3

RMj2: 5 tasksj3: 9 tasksj1: 21 tasks

Bounding Queue Lengths

Determine max number of tasks at a queueTrade-off between short and long queues

Short queuesResource idling à lower throughput

Long queuesHigh queuing delays, early binding of tasks to queues à longer job completion times

Static and dynamic queue bounding

Queuing Techniques: Evaluation

1.7x improvement in median JCT over YARN

Both bounding and reordering are crucial

Workload

Scale High scheduling rate at low allocation latency

UtilizationFederated design improves load balancing, while retaining utilization

Task Performance (comparison)

YARN

Performance

Jobs perform just as well (and tasks are as efficient)

Qualitative Experience

operability

Recap

Overview of legacy systems (Apollo, YARN)

Scale: Federated YARN

Resource utilization: Opportunistic Containers


Conclusion

Research in Industry is a lot of fun

Fewer bigger projects, with massive winsBig force multipliers

Failure is expected (otherwise we are not trying hard enough)

Modus Operandi

Be picky in choosing problemsEngage early, engage deep

Be inclusive with prod/OSS counterparts

Thank you!

[email protected]

http://microsoft.com

cluster scheduling at microsoft...

Documents