cluster management

Cluster ManagementCS 739

Fall 2019

Notes from reviews

• How does borg offer strategy-proofness?• The opposite: authors say they can advise on how to make jobs more easily

schedulable

What is the goal?

• Schedule sets of tasks comprising a job• Satisfy hardware demands – particular hardware present (CPU, SSD)• Satisfy resource need – mem, cycles, IOPS, storage/net BW• High utilization – few wasted resources• Handle unexpected peaks

• Launch and run tasks comprising a job• Ensure binaries/packages present• Handle preemption, migration• Monitor for failures & restart

• Fault tolerant and scalable• Never go down, scale to large clusters

Why complicated

• Range of tasks• Production – long-running, non-preemptible, latency-sensitive tasks• Non-production – batch data processing tasks

• Range of control needed• Control over co-location of tasks – e.g. must be on same machine

• Efficient resource allocation• Must pack tasks with multiple resource needs into machines• Handle poor resource estimations: over/under-estimate memory/cpu needed

• Fault isolation• Spread tasks in a job across diverse resources

What makes placement hard

• Policy: what to do for over-subscribed resources• Who do you starve?

• Do you preempt and who?

• Fairness: do you care?• How do you compute fairness across jobs

• Some want 8 CPUs, 1 GB memory, others want 1 CPU, 16 GB memory

• CPU allocation• Multiple processors want to execute, OS selects one to run for some amount

of time

• Bandwidth allocation• Packets from multiple incoming queue want to be transmitted out some link,

switch chooses one

6

Scheduling: An old problem

What do we want from a scheduler?

• Isolation

• Have some sort of guarantee that misbehaved

processes cannot affect me “too much”

• Efficient resource usage

• Resource is not idle while there is process whose

demand is not fully satisfied

• “Work conservation” -- not achieved by hard allocations

• Flexibility

• Can express some sort of priorities, e.g., strict or time

based

7Slide courtesy of Mike Freedman

• n users want to share a resource (e.g. CPU)

• Solution: give each 1/n of the shared resource

• Generalized by max-min fairness• Handles if a user wants less than its fair share

• E.g. user 1 wants no more than 20%

• Generalized by weighted max-min fairness

• Give weights to users according to importance

• User 1 gets weight 1, user 2 weight 2

CPU100%

50%

0%

33%

33%

33%

100%

50%

0%

20%

40%

40%

100%

50%

0%

33%

66%

Single Resource: Fair Sharing


• Weighted Fair Sharing / Proportional Shares• User u1 gets weight 2, u2 weight 1

• Priorities: Give u1 weight 1000, u2 weight 1

• Reservations • Ensure u1 gets 10%: Give u1 weight 10, sum weights ≤ 100

• Deadline-based scheduling• Given a job’s demand and deadline, compute user’s reservation /

weight

• Isolation: Users cannot affect others beyond their share

Max-Min Fairness is Powerful


• Job scheduling is not only about a single resource

• Tasks consume CPU, memory, network and disk I/O

• What are task demands today?

Why is Max-Min Fairness Not Enough?


Most task need ~

<2 CPU, 2 GB RAM>

Some tasks are

memory-intensive

Some tasks are

CPU-intensive

2000-node Hadoop Cluster at Facebook (Oct 2010)

Heterogeneous Resource Demands

11

Slide courtesy of Mike Freedman

How to allocate?

• 2 resources: CPUs & memory

• User 1 wants <1 CPU, 4 GB> per task

• User 2 wants <3 CPU, 1 GB> per task

• What’s a fair allocation?memCPU

100%

50%

0%

? ?


• Asset Fairness: Equalize each user’s sum of resource

shares

• Cluster with 28 CPUs, 56 GB RAM

• U1 needs <1 CPU, 2 GB RAM> per task,

or <3.6% CPUs, 3.6% RAM> per task



• Asset fairness yields

• U1: 12 tasks: <43% CPUs, 43% RAM> (∑=86%)

• U2: 8 tasks: <28% CPUs, 57% RAM> (∑=86%)

CPU

User 1

User 2

100%

50%

0%RAM

43%

57%

43%

28%

A Natural Policy


• Approach: Equalize each user’s sum of resource shares

• Cluster with 28 CPUs, 56 GB RAM





• Asset fairness yields

• U1: 12 tasks: <43% CPUs, 43% RAM> (∑=86%)

• U2: 8 tasks: <28% CPUs, 57% RAM> (∑=86%)

CPU

User 1

User 2

100%

50%

0%RAM

43%

57%

43%

28%Problem: violates share guarantee

User 1 has < 50% of both CPUs and RAM

Better off in separate cluster with half the resources

Problem: violates share guarantee

User 1 has < 50% of both CPUs and RAM

Better off in separate cluster with half the resources

Strawman for asset fairness


Cheating the Scheduler• Users willing to game the system to get more resources

• Real-life examples

• A cloud provider had quotas on map and reduce slotsSome users found out that the map-quota was low.Users implemented maps in the reduce slots!

• A search company provided dedicated machines to users that could ensure certain level of utilization (e.g. 80%). Users used busy-loops to inflate utilization.

• How achieve share guarantee + strategy proofness for sharing?

• Generalize max-min fairness to multiple resource/


• A user’s dominant resource is resource user has biggest share of

• Example:

Total resources:

User 1’s allocation:

Dominant resource of User 1 is CPU (as 25% > 20%)

• A user’s dominant share: fraction of dominant resource allocated

• User 1’s dominant share is 25%

Dominant Resource Fairness (DRF)

5 GB

1 GB

20% RAM

8 CPU

2 CPU

25% CPUs

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica, NSDI’11 16

Slide courtesy of Mike Freedman

• Apply max-min fairness to dominant shares

• Equalize the dominant share of the users. Example:

• Total resources: <9 CPU, 18 GB>

• User 1 demand: <1 CPU, 4 GB>; dom res: mem (1/9 < 4/18)

• User 2 demand: <3 CPU, 1 GB>; dom res: CPU (3/9 > 1/18)

User 1

User 2

100%

50%

0%CPU

(9 total)

mem

(18 total)

3 CPUs 12 GB

6 CPUs

2 GB

66%66%

Dominant Resource Fairness (DRF)


Borg Architecture – High Level

EECS 582 – F16 18

1. Compile the program and stick it in the cloud

2. Pass configuration to command line (Borg Config)

3. Send an RPC to the Borg Master4. Borg Master writes to persistent store

& tasks added to pending queue5. Scheduler asynchronous scan6. Link Shards check Borglets

Borg Architecture

EECS 582 – F16 19

Borg Master• Central “brain” of system• Holds Cluster State• Replicated for Reliability (PAXOS)

Scheduling• Where to place tasks???

• Feasibility Checking• Scoring

Borglet• Machine Agent• Supervises local tasks • Interacts with BorgMaster

Borg Architecture

EECS 582 – F16 20

Scalability• Simple synchronous loop• Split scheduler into separate processes• Separate threads to Borglets• Score Caching

• Don’t recompute scores if same state

• Equivalence classes• Only do feasibility checking &

scoring once per similar task• Relaxed randomization

• Calculate feasibility and scores for enough random machines not all

Scheduling controls

• Priority: relative importance for jobs• Given limited resources, who do you give them to?

• Who do you reclaim resources from?

• Problem: cascading preemption when introducing new high-prio task• Solution: bands, not preempt within a band

• Quotas: cluster-wide limits on resource usage for a time period• 10 TB memory for 2 hours

• Used for admission control: does the user have enough resources?

• Higher priority quotas harder to get

Job parameters

• Jobs specify ~200 parameters to guide scheduling• Alloc: reserved set of resources on a machine, can be used for many tasks

• Example: want locality over several tasks --> run them in the same alloc

• Runtime job modifications• Can change priority/resource request at runtime to reflect changing needs

Goal: High utilization

View of utilization on each machine in a clusterColor = different jobsWhite space = unused resources

Stranded resources: only one resource available

Scheduling Policy

• How place alloc/tasks on a cluster?• Balance: spread tasks across cluster (worst fit)

• + Leaves room for load spikes

• - Increased fragmentation from tasks that need most of a machine

• Best fit: find tightest fit for resources• + High utilization

• - Bad for load spikes – no capacity to absorb load

• - Bad for opportunistic jobs (best effort) as little left-over resources

• Borg policy: minimize stranded resources• How? Unknown

• Idea: minimize difference in remaining amount of different resources (CPU, memory, net, …)

Utilization

EECS 582 – F16 25

Cell Sharing

Utilization

EECS 582 – F16 26

Large Cells

Utilization

EECS 582 – F16 27

Fine-grained Resource Requests

Fixed-size allocation

• Cloud providers: fixed-size instances

• Functions: 128mb, 256mb, 512 mb.

• VMs: 2gb, 4gb, 32 gb, 128 gb, etc

• How efficient?• Upper bound: give single machine

to jobs > ½

• Lower bound: let go pending

• Why borg different than Cloud?

Resource Allocation Problem

• Application request resources > actual use

• How to reallocate?• Estimate future use, re-use for non-production jobs

• Preempt if resources needed

AutoPilot

AutoPilot

• Goal: deploy, provision, repair data center applications• automate as much sysadmin tasks as possible

• Design principles• fault tolerant like everything else, but not byzantine failures• simple and good enough when possible

• e.g. repairs only in a few categories• text file configuration + auditing of changes• QUESTION: EmuLab seems to offer more. Why?

• Correct at all times, or understandably and documented incorrect • state exceptions, handle pathological cases

• Crash-only• no explicit shutdown/cleanup

• Replication for availability only• Workload is fairly small

Autopilot goal

• Handle automatically routine tasks

• Without operator intervention

• No support for legacy apps

32

AutoPilot services

• Device manager:• stores goal state about state system should be in

• strongly consistent using Paxos

• doesn’t do anything but store state• Goal state, ground truth

• Satellite services• based on state in DM, take actions to bring system to correct state

• Poll for changes, but can be told to poll• avoids lost pushes of data

• Told to pull for low latency

Autopiloted System

34

Autopilot services

Application

Autopilot control

Recovery-Oriented Computing

35

• Everything will eventually fail

• Design for failure

• Crash-only software design

• http://roc.cs.berkeley.edu

Brown, A. and D. A. Patterson. Embracing Failure: A Case for Recovery-Oriented Computing (ROC). High Performance Transaction Processing Symposium, October 2001.

http://roc.cs.berkeley.edu/

Autopilot Architecture

36

Device Manager

Provisioning Deployment Watchdog Repair

Operational data collection and visualization

• Discover new machines; netboot; self-test• Install application binaries and configurations• Monitor application health• Fix broken machines

Centralized Replicated Control

• Keep essential control state centralized

• Replicate the state for reliability

• Use the Paxos consensus protocol

• Device manager uses it for ground truth – goal state of system

37

Paxos

Device Manager

• Strongly consistent (paxos), replicated for reliability

• Stores goal state for system • Set of physical machines• Configuration each should have

• Other services pull information from it

• QUESTION: LiveSearch decided this wasn’t reliable enough. Why?• Answer: didn’t trust, didn’t want to lose control to another group, or was it

really unreliable?• Shows that apps can build their own better services on top of AutoPilot…

Cluster Machines

39

Autopilot services manager



Name service Scheduling


Remote execution

StorageRemote

executionStorage




Paxos

Storage metadata service

Cluster services abstraction

40

Windows Server

Cluster services

Windows Server

Windows Server

Windows Server

Autopilot

Reliable specialized machines

Cluster Services

• Name service: discover cluster machines

• Scheduling: allocate cluster machines

• Storage metadata: distributed file location

• Storage: distributed file contents

• Remote execution: spawn new computations

41

Low-level node services

• filesync: copy files to make sure correct files prsent

• application manager: makes sure correct applications running

High-level clusterservices

• Provisioning service:• Configure new machines according to policy

• Determines what OS image, • install, boot, test

• New machine asks DM what applications to run

High-level services

• Application Deployment• apps specify a set of machine types – different configurations of nodes in

service• front-end web server

• crawler

• apps specify manifest – lists config files + app binaries needed for machine type

• Deployment service contains config files + binaries (populated by build process)

• machines ask DM for their configuration, contact deployment service for needed binaries

Application upgrade

• Rolling upgrade built into autopilot• added to manifest for machine type

• each machine in type will download new code on next poll

• DM instructs groups of machine to upgrade (to avoid whole-service downtime)

• e.g. 1/10th of each type at a time

• Put machine on probably during upgrade in case it fails, can roll back upgrade

Failure detection

• Watchdog service:• Checks an attribute on a node – reports “ok”, “warning”, or “error”

• Watchdog service calls node correct if all attributes OK or Warning (warning is unexpected but not fatal) to generate extra logging but not alert operators

• Watchdog sensors can be standard (e.g. bios for memory/disk corruption, OS version check) or app-specific

• QUESTION: Why have apps generate their own watchdogs?

• Note: check lots of signals, as just one could still be working while things fail

Failure Detection Latency

• AutoPilot doesn’t detect things immediately• Down/sluggish machine can affect application latency

• SO: what to do?• Apps can have very short timeouts and retry quickly

• Apps can report such failures to AutoPilot

• AutoPilot is generic, its recovery techniques are not suitable for stuttering latency problems

• E.g. would be overkill, could hurt overall reliability

Failure/recovery

• Recovery: do simple things an admin would do automatically before taking offline

• E.g. restart service, reboot, reinstall, replace

• Failed nodes given a repair treatment based on symptoms:• DoNothing, Reboot, ReImage, Replace

• techs replace computers periodically (days/weeks)

• Based on history (previously healthy computers go to DoNothing)

• Repaired nodes marked as on “probation” • expected to have a few early failures (ignored) but then become healthy• if correct for a while, moved to Healthy

• All machine affected by new code marked as “probation” or for expected failures during startup

Other recovery options

• Hot standby to replace a failed machine• But the machine is idle most of the time and not contributing

• But apps can handle failures anyway, so why not make app deal with it until full recovery?

Monitoring service

• Collects logs/performance counters from applications in common format

• provide central view of app

• Real-time data but in SQL DB

• Cockpit visualization tool reads data, shows current status

• Alert service sends alert emails based on triggers/queries in cockpit DB

What AutoPilot doesn’t do

• Load balancing/migration• This is up to the app

• If the app needs more resources – what then?• Tell AutoPilot to provision more machines

• Address all issues in the data center• Network configuration

• Power management/consolidation

AutoPilot Summary

• Provides common tools to:• install a new machine

• provision apps onto it

• detect failures

• repair failures

• record logs

• monitor app behavior

• BUT:• not for legacy code (must be packaged for AutoPilot)

cluster management

Documents