cluster management

52
Cluster Management CS 739 Fall 2019

Upload: others

Post on 07-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cluster Management

Cluster ManagementCS 739

Fall 2019

Page 2: Cluster Management

Notes from reviews

• How does borg offer strategy-proofness?• The opposite: authors say they can advise on how to make jobs more easily

schedulable

Page 3: Cluster Management

What is the goal?

• Schedule sets of tasks comprising a job• Satisfy hardware demands – particular hardware present (CPU, SSD)• Satisfy resource need – mem, cycles, IOPS, storage/net BW• High utilization – few wasted resources• Handle unexpected peaks

• Launch and run tasks comprising a job• Ensure binaries/packages present• Handle preemption, migration• Monitor for failures & restart

• Fault tolerant and scalable• Never go down, scale to large clusters

Page 4: Cluster Management

Why complicated

• Range of tasks• Production – long-running, non-preemptible, latency-sensitive tasks• Non-production – batch data processing tasks

• Range of control needed• Control over co-location of tasks – e.g. must be on same machine

• Efficient resource allocation• Must pack tasks with multiple resource needs into machines• Handle poor resource estimations: over/under-estimate memory/cpu needed

• Fault isolation• Spread tasks in a job across diverse resources

Page 5: Cluster Management

What makes placement hard

• Policy: what to do for over-subscribed resources• Who do you starve?

• Do you preempt and who?

• Fairness: do you care?• How do you compute fairness across jobs

• Some want 8 CPUs, 1 GB memory, others want 1 CPU, 16 GB memory

Page 6: Cluster Management

• CPU allocation• Multiple processors want to execute, OS selects one to run for some amount

of time

• Bandwidth allocation• Packets from multiple incoming queue want to be transmitted out some link,

switch chooses one

6

Scheduling: An old problem

Page 7: Cluster Management

What do we want from a scheduler?

• Isolation

• Have some sort of guarantee that misbehaved

processes cannot affect me “too much”

• Efficient resource usage

• Resource is not idle while there is process whose

demand is not fully satisfied

• “Work conservation” -- not achieved by hard allocations

• Flexibility

• Can express some sort of priorities, e.g., strict or time

based

7Slide courtesy of Mike Freedman

Page 8: Cluster Management

• n users want to share a resource (e.g. CPU)

• Solution: give each 1/n of the shared resource

• Generalized by max-min fairness• Handles if a user wants less than its fair share

• E.g. user 1 wants no more than 20%

• Generalized by weighted max-min fairness

• Give weights to users according to importance

• User 1 gets weight 1, user 2 weight 2

CPU100%

50%

0%

33%

33%

33%

100%

50%

0%

20%

40%

40%

100%

50%

0%

33%

66%

Single Resource: Fair Sharing

8Slide courtesy of Mike Freedman

Page 9: Cluster Management

• Weighted Fair Sharing / Proportional Shares• User u1 gets weight 2, u2 weight 1

• Priorities: Give u1 weight 1000, u2 weight 1

• Reservations • Ensure u1 gets 10%: Give u1 weight 10, sum weights ≤ 100

• Deadline-based scheduling• Given a job’s demand and deadline, compute user’s reservation /

weight

• Isolation: Users cannot affect others beyond their share

Max-Min Fairness is Powerful

9Slide courtesy of Mike Freedman

Page 10: Cluster Management

• Job scheduling is not only about a single resource

• Tasks consume CPU, memory, network and disk I/O

• What are task demands today?

Why is Max-Min Fairness Not Enough?

10Slide courtesy of Mike Freedman

Page 11: Cluster Management

Most task need ~

<2 CPU, 2 GB RAM>

Some tasks are

memory-intensive

Some tasks are

CPU-intensive

2000-node Hadoop Cluster at Facebook (Oct 2010)

Heterogeneous Resource Demands

11

Slide courtesy of Mike Freedman

Page 12: Cluster Management

How to allocate?

• 2 resources: CPUs & memory

• User 1 wants <1 CPU, 4 GB> per task

• User 2 wants <3 CPU, 1 GB> per task

• What’s a fair allocation?memCPU

100%

50%

0%

? ?

12Slide courtesy of Mike Freedman

Page 13: Cluster Management

• Asset Fairness: Equalize each user’s sum of resource

shares

• Cluster with 28 CPUs, 56 GB RAM

• U1 needs <1 CPU, 2 GB RAM> per task,

or <3.6% CPUs, 3.6% RAM> per task

• U2 needs <1 CPU, 4 GB RAM> per task,

or <3.6% CPUs, 7.2% RAM> per task

• Asset fairness yields

• U1: 12 tasks: <43% CPUs, 43% RAM> (∑=86%)

• U2: 8 tasks: <28% CPUs, 57% RAM> (∑=86%)

CPU

User 1

User 2

100%

50%

0%RAM

43%

57%

43%

28%

A Natural Policy

13Slide courtesy of Mike Freedman

Page 14: Cluster Management

• Approach: Equalize each user’s sum of resource shares

• Cluster with 28 CPUs, 56 GB RAM

• U1 needs <1 CPU, 2 GB RAM> per task,

or <3.6% CPUs, 3.6% RAM> per task

• U2 needs <1 CPU, 4 GB RAM> per task,

or <3.6% CPUs, 7.2% RAM> per task

• Asset fairness yields

• U1: 12 tasks: <43% CPUs, 43% RAM> (∑=86%)

• U2: 8 tasks: <28% CPUs, 57% RAM> (∑=86%)

CPU

User 1

User 2

100%

50%

0%RAM

43%

57%

43%

28%Problem: violates share guarantee

User 1 has < 50% of both CPUs and RAM

Better off in separate cluster with half the resources

Problem: violates share guarantee

User 1 has < 50% of both CPUs and RAM

Better off in separate cluster with half the resources

Strawman for asset fairness

14Slide courtesy of Mike Freedman

Page 15: Cluster Management

Cheating the Scheduler• Users willing to game the system to get more resources

• Real-life examples

• A cloud provider had quotas on map and reduce slotsSome users found out that the map-quota was low.Users implemented maps in the reduce slots!

• A search company provided dedicated machines to users that could ensure certain level of utilization (e.g. 80%). Users used busy-loops to inflate utilization.

• How achieve share guarantee + strategy proofness for sharing?

• Generalize max-min fairness to multiple resource/

15Slide courtesy of Mike Freedman

Page 16: Cluster Management

• A user’s dominant resource is resource user has biggest share of

• Example:

Total resources:

User 1’s allocation:

Dominant resource of User 1 is CPU (as 25% > 20%)

• A user’s dominant share: fraction of dominant resource allocated

• User 1’s dominant share is 25%

Dominant Resource Fairness (DRF)

5 GB

1 GB

20% RAM

8 CPU

2 CPU

25% CPUs

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica, NSDI’11 16

Slide courtesy of Mike Freedman

Page 17: Cluster Management

• Apply max-min fairness to dominant shares

• Equalize the dominant share of the users. Example:

• Total resources: <9 CPU, 18 GB>

• User 1 demand: <1 CPU, 4 GB>; dom res: mem (1/9 < 4/18)

• User 2 demand: <3 CPU, 1 GB>; dom res: CPU (3/9 > 1/18)

User 1

User 2

100%

50%

0%CPU

(9 total)

mem

(18 total)

3 CPUs 12 GB

6 CPUs

2 GB

66%66%

Dominant Resource Fairness (DRF)

17Slide courtesy of Mike Freedman

Page 18: Cluster Management

Borg Architecture – High Level

EECS 582 – F16 18

1. Compile the program and stick it in the cloud

2. Pass configuration to command line (Borg Config)

3. Send an RPC to the Borg Master4. Borg Master writes to persistent store

& tasks added to pending queue5. Scheduler asynchronous scan6. Link Shards check Borglets

Page 19: Cluster Management

Borg Architecture

EECS 582 – F16 19

Borg Master• Central “brain” of system• Holds Cluster State• Replicated for Reliability (PAXOS)

Scheduling• Where to place tasks???

• Feasibility Checking• Scoring

Borglet• Machine Agent• Supervises local tasks • Interacts with BorgMaster

Page 20: Cluster Management

Borg Architecture

EECS 582 – F16 20

Scalability• Simple synchronous loop• Split scheduler into separate processes• Separate threads to Borglets• Score Caching

• Don’t recompute scores if same state

• Equivalence classes• Only do feasibility checking &

scoring once per similar task• Relaxed randomization

• Calculate feasibility and scores for enough random machines not all

Page 21: Cluster Management

Scheduling controls

• Priority: relative importance for jobs• Given limited resources, who do you give them to?

• Who do you reclaim resources from?

• Problem: cascading preemption when introducing new high-prio task• Solution: bands, not preempt within a band

• Quotas: cluster-wide limits on resource usage for a time period• 10 TB memory for 2 hours

• Used for admission control: does the user have enough resources?

• Higher priority quotas harder to get

Page 22: Cluster Management

Job parameters

• Jobs specify ~200 parameters to guide scheduling• Alloc: reserved set of resources on a machine, can be used for many tasks

• Example: want locality over several tasks --> run them in the same alloc

• Runtime job modifications• Can change priority/resource request at runtime to reflect changing needs

Page 23: Cluster Management

Goal: High utilization

View of utilization on each machine in a clusterColor = different jobsWhite space = unused resources

Stranded resources: only one resource available

Page 24: Cluster Management

Scheduling Policy

• How place alloc/tasks on a cluster?• Balance: spread tasks across cluster (worst fit)

• + Leaves room for load spikes

• - Increased fragmentation from tasks that need most of a machine

• Best fit: find tightest fit for resources• + High utilization

• - Bad for load spikes – no capacity to absorb load

• - Bad for opportunistic jobs (best effort) as little left-over resources

• Borg policy: minimize stranded resources• How? Unknown

• Idea: minimize difference in remaining amount of different resources (CPU, memory, net, …)

Page 25: Cluster Management

Utilization

EECS 582 – F16 25

Cell Sharing

Page 26: Cluster Management

Utilization

EECS 582 – F16 26

Large Cells

Page 27: Cluster Management

Utilization

EECS 582 – F16 27

Fine-grained Resource Requests

Page 28: Cluster Management

Fixed-size allocation

• Cloud providers: fixed-size instances

• Functions: 128mb, 256mb, 512 mb.

• VMs: 2gb, 4gb, 32 gb, 128 gb, etc

• How efficient?• Upper bound: give single machine

to jobs > ½

• Lower bound: let go pending

• Why borg different than Cloud?

Page 29: Cluster Management

Resource Allocation Problem

• Application request resources > actual use

• How to reallocate?• Estimate future use, re-use for non-production jobs

• Preempt if resources needed

Page 30: Cluster Management

AutoPilot

Page 31: Cluster Management

AutoPilot

• Goal: deploy, provision, repair data center applications• automate as much sysadmin tasks as possible

• Design principles• fault tolerant like everything else, but not byzantine failures• simple and good enough when possible

• e.g. repairs only in a few categories• text file configuration + auditing of changes• QUESTION: EmuLab seems to offer more. Why?

• Correct at all times, or understandably and documented incorrect • state exceptions, handle pathological cases

• Crash-only• no explicit shutdown/cleanup

• Replication for availability only• Workload is fairly small

Page 32: Cluster Management

Autopilot goal

• Handle automatically routine tasks

• Without operator intervention

• No support for legacy apps

32

Page 33: Cluster Management

AutoPilot services

• Device manager:• stores goal state about state system should be in

• strongly consistent using Paxos

• doesn’t do anything but store state• Goal state, ground truth

• Satellite services• based on state in DM, take actions to bring system to correct state

• Poll for changes, but can be told to poll• avoids lost pushes of data

• Told to pull for low latency

Page 34: Cluster Management

Autopiloted System

34

Autopilot services

Application

Autopilot control

Page 35: Cluster Management

Recovery-Oriented Computing

35

• Everything will eventually fail

• Design for failure

• Crash-only software design

• http://roc.cs.berkeley.edu

Brown, A. and D. A. Patterson. Embracing Failure: A Case for Recovery-Oriented Computing (ROC). High Performance Transaction Processing Symposium, October 2001.

Page 36: Cluster Management

Autopilot Architecture

36

Device Manager

Provisioning Deployment Watchdog Repair

Operational data collection and visualization

• Discover new machines; netboot; self-test• Install application binaries and configurations• Monitor application health• Fix broken machines

Page 37: Cluster Management

Centralized Replicated Control

• Keep essential control state centralized

• Replicate the state for reliability

• Use the Paxos consensus protocol

• Device manager uses it for ground truth – goal state of system

37

Paxos

Page 38: Cluster Management

Device Manager

• Strongly consistent (paxos), replicated for reliability

• Stores goal state for system • Set of physical machines• Configuration each should have

• Other services pull information from it

• QUESTION: LiveSearch decided this wasn’t reliable enough. Why?• Answer: didn’t trust, didn’t want to lose control to another group, or was it

really unreliable?• Shows that apps can build their own better services on top of AutoPilot…

Page 39: Cluster Management

Cluster Machines

39

Autopilot services manager

Autopilot services manager

Autopilot services manager

Name service Scheduling

Autopilot services manager

Remote execution

StorageRemote

executionStorage

Autopilot services manager

Autopilot services manager

Autopilot services manager

Paxos

Storage metadata service

Page 40: Cluster Management

Cluster services abstraction

40

Windows Server

Cluster services

Windows Server

Windows Server

Windows Server

Autopilot

Reliable specialized machines

Page 41: Cluster Management

Cluster Services

• Name service: discover cluster machines

• Scheduling: allocate cluster machines

• Storage metadata: distributed file location

• Storage: distributed file contents

• Remote execution: spawn new computations

41

Page 42: Cluster Management

Low-level node services

• filesync: copy files to make sure correct files prsent

• application manager: makes sure correct applications running

Page 43: Cluster Management

High-level clusterservices

• Provisioning service:• Configure new machines according to policy

• Determines what OS image, • install, boot, test

• New machine asks DM what applications to run

Page 44: Cluster Management

High-level services

• Application Deployment• apps specify a set of machine types – different configurations of nodes in

service• front-end web server

• crawler

• apps specify manifest – lists config files + app binaries needed for machine type

• Deployment service contains config files + binaries (populated by build process)

• machines ask DM for their configuration, contact deployment service for needed binaries

Page 45: Cluster Management

Application upgrade

• Rolling upgrade built into autopilot• added to manifest for machine type

• each machine in type will download new code on next poll

• DM instructs groups of machine to upgrade (to avoid whole-service downtime)

• e.g. 1/10th of each type at a time

• Put machine on probably during upgrade in case it fails, can roll back upgrade

Page 46: Cluster Management

Failure detection

• Watchdog service:• Checks an attribute on a node – reports “ok”, “warning”, or “error”

• Watchdog service calls node correct if all attributes OK or Warning (warning is unexpected but not fatal) to generate extra logging but not alert operators

• Watchdog sensors can be standard (e.g. bios for memory/disk corruption, OS version check) or app-specific

• QUESTION: Why have apps generate their own watchdogs?

• Note: check lots of signals, as just one could still be working while things fail

Page 47: Cluster Management

Failure Detection Latency

• AutoPilot doesn’t detect things immediately• Down/sluggish machine can affect application latency

• SO: what to do?• Apps can have very short timeouts and retry quickly

• Apps can report such failures to AutoPilot

• AutoPilot is generic, its recovery techniques are not suitable for stuttering latency problems

• E.g. would be overkill, could hurt overall reliability

Page 48: Cluster Management

Failure/recovery

• Recovery: do simple things an admin would do automatically before taking offline

• E.g. restart service, reboot, reinstall, replace

• Failed nodes given a repair treatment based on symptoms:• DoNothing, Reboot, ReImage, Replace

• techs replace computers periodically (days/weeks)

• Based on history (previously healthy computers go to DoNothing)

• Repaired nodes marked as on “probation” • expected to have a few early failures (ignored) but then become healthy• if correct for a while, moved to Healthy

• All machine affected by new code marked as “probation” or for expected failures during startup

Page 49: Cluster Management

Other recovery options

• Hot standby to replace a failed machine• But the machine is idle most of the time and not contributing

• But apps can handle failures anyway, so why not make app deal with it until full recovery?

Page 50: Cluster Management

Monitoring service

• Collects logs/performance counters from applications in common format

• provide central view of app

• Real-time data but in SQL DB

• Cockpit visualization tool reads data, shows current status

• Alert service sends alert emails based on triggers/queries in cockpit DB

Page 51: Cluster Management

What AutoPilot doesn’t do

• Load balancing/migration• This is up to the app

• If the app needs more resources – what then?• Tell AutoPilot to provision more machines

• Address all issues in the data center• Network configuration

• Power management/consolidation

Page 52: Cluster Management

AutoPilot Summary

• Provides common tools to:• install a new machine

• provision apps onto it

• detect failures

• repair failures

• record logs

• monitor app behavior

• BUT:• not for legacy code (must be packaged for AutoPilot)