resource management in the cloud

Resource management in the cloud

DRS

2

Introduction

Lead developer in Parallels DRS enthusiast, the leader in

resource management team in Odin

Over 10 years experience in

virtualization Lecturer In MIPT, Innopolis,

Sbertech

3

Ресурс шедулер

4

Распределитель ресурсов

5

Ресурсоуправитель !

6

Resource scheduler

7

Resource scheduler

•  N resource producers (servers) •  M resource consumers (virtual instances: VMs, containers, jobs)

8

Resource scheduler

•  Estimate current resource consuming level for each consumer( demand(i) )

•  Predict resource consumption level for some consumer( future_demand(i) )

•  Detect provisor overload (bottleneck(j) ) •  Optimal placement ( place(i,j) -> place(i,k) ) •  Victim selection •  Migration estimation (possibility, migration weight)

•  N resource producers (servers) •  M resource consumers( virtual instances: VMs, containers, jobs)

9

Estimate demand

10

Estimate demand. Memory

"  OS halts when it doesn’t need CPU "  OS doesn’t make I/O when it doesn’t need external devices "  OS does nothing when it doesn’t need memory

11

Predict demand

12

Predict demand •  Math methods

•  ARIMA, regression •  Kalman & other filters •  CuSum •  RL •  Markov chains •  Bayes •  Clusterization

•  Metrics •  Provisioning metrics (virtualization) •  Consumer metrics (Guest OS) •  Job/user metrics

•  Time •  When to predict •  Prediction horizon (prediction accuracy) •  Lifetime prediction?

13

Detect overload

14

Detect overload

"  Heartbeat "  Virtual resource usage zones "  Complicated resource producer metrics

15

Placement (1)

•  Initial (w/o estimation) •  Single •  Group

•  Dynamic (with estimation) •  SLA and affinity issues

•  VM-to-VM, VM-to-host •  Affinity & anti-affinity •  Resource availability

16

Placement (2). Inter-consumer dependencies

17

Placement (3). Geography-aware

18

Placement (4). … after all filters, restrictions

A set of suitable target nodes. Which to choose?

19

Placement (4). … after all filters, restrictions

•  Random •  Round-robin

•  Spread-first •  Greedy (fill-first)

20

Migration

•  Could be too long (need prediction) •  Could affect performance •  Could affect energy consumption •  Could be more or less effective •  Could fail

21

Use cases: Vmware(1)

5 0VMWARE DISTRIBUTED RESOURCE MANAGEMENT:

DESIGN, IMPLEMENTATION, AND LESSONS LEARNED

This overview of the DRS algorithm has been greatly simplified to focus on its core load-balancing metric. The actual load-balancing algorithm considers many other factors, including the impact of the move on imbalance and the risk-adjusted benefit of each move. The next sections describe these checks and other factors in more detail.

3.1.3 Minimum GoodnessDRS load balancing rejects a move if it does not produce enough benefit in terms of improvement in the standard deviation value representing imbalance. The threshold used for this filtering is computed dynamically based on the number of hosts and VMs. The threshold is reduced significantly when imbalance is very high and moves to correct it are filtered by the normal threshold, so that many low-impact moves can be used to correct high imbalance.

3.1.4 Cost-Benefit AnalysisThe main check DRS uses to filter unstable moves is cost benefit analysis. Cost-benefit analysis considers the various costs involved in the move by modeling the vMotion costs as well as the cost to the other VMs on the destination host which will have an additional VM competing for resources. The benefit is computed as how much the VMs on the source and the migrating VM will benefit from the move. Cost-Benefit analysis also considers the risk of the move by predicting how the workload might change on both the source and destination and if this move still makes sense when the workload changes.

The cost is modeled by estimating how long this VM would take to migrate. The migration time mainly depends on the memory size of the VM as well as how actively the VM modifies its pages during migration. If the VM dirties its pages frequently during vMotion, they must be copied to the destination host multiple times.

DRS keeps track of the transfer rate based on the history of migration times for each VM and host. Based on the transfer rate and current memory size, it calculates the time required for single round of copying. During this time the cost is computed as the resources required for the migrating VM to exist on the destination. The vMotion cost is in the unit of resources (MHz or MB) over time.

The vMotion process itself consumes some resources on both the source and destination hosts and this is also added as a cost. DRS then computes how much the workload inside the VM would suer due to migration. This is done by measuring how actively the VM writes to its memory pages and how much time each transfer takes. DRS approximates this by measuring the number of times the VM had to be descheduled by the CPU scheduler during past vMotions and how active the VM was at that time. The performance degradation due to the increased cost of modifying memory during migration is computed. This value is used to extrapolate how much the VM would suer by taking into account its current consumption and added to the cost. This cost is also measured in units of resources over time, such as CPU seconds.

The vMotion benefit is also measured in resources over time. The benefit is computed by calculating how much of the demand for the candidate VM is being clipped on the source versus the increased amount of demand that is expected to be satisfied on the destination. The benefit also includes the amount of unsatisfied demand for other VMs on the source that will be satisfied after the VM moves off. To represent possible demand changes, demand is predicted for all VMs on the source and destination. Based on the workloads of all these VMs, a stable time is predicted after which we assume the VM workloads will change to the worst configuration for this move, given recent demand history (by default, for the previous hour). The use of this worst-case value is intended to make the algorithm more conservative when recommending migrations.

With respect to the source host, the worst possible situation is computed as one in which, after the stable time, the workloads of the VMs on the source exhibit the minimum demand observed during the previous hour. With respect to the destination host, the worst possible situation is computed as one in which the demands for the VMs on the destination including the VM being moved are the maximum values observed in the last one hour. By assuming this worst-case behavior, moves with benefits that depend on VMs with unstable workloads are filtered out.

3.1.5 Pending RecommendationsDRS considers the recommendations in flight and any pending recommendations not yet started, so that it does not correct the same constraint violation or imbalance several times. VMs that are currently migrating are treated as if they exist on both source and destination. New recommendations that would conflict with pending recommendations are not generated.

Figure 4: Overview of DRS Cost-Benet analysis

Algorithm 1: DRS Load Balancing

22

Use cases: Vmware(2)

5 0VMWARE DISTRIBUTED RESOURCE MANAGEMENT:

DESIGN, IMPLEMENTATION, AND LESSONS LEARNED

This overview of the DRS algorithm has been greatly simplified to focus on its core load-balancing metric. The actual load-balancing algorithm considers many other factors, including the impact of the move on imbalance and the risk-adjusted benefit of each move. The next sections describe these checks and other factors in more detail.

3.1.3 Minimum GoodnessDRS load balancing rejects a move if it does not produce enough benefit in terms of improvement in the standard deviation value representing imbalance. The threshold used for this filtering is computed dynamically based on the number of hosts and VMs. The threshold is reduced significantly when imbalance is very high and moves to correct it are filtered by the normal threshold, so that many low-impact moves can be used to correct high imbalance.

3.1.4 Cost-Benefit AnalysisThe main check DRS uses to filter unstable moves is cost benefit analysis. Cost-benefit analysis considers the various costs involved in the move by modeling the vMotion costs as well as the cost to the other VMs on the destination host which will have an additional VM competing for resources. The benefit is computed as how much the VMs on the source and the migrating VM will benefit from the move. Cost-Benefit analysis also considers the risk of the move by predicting how the workload might change on both the source and destination and if this move still makes sense when the workload changes.

The cost is modeled by estimating how long this VM would take to migrate. The migration time mainly depends on the memory size of the VM as well as how actively the VM modifies its pages during migration. If the VM dirties its pages frequently during vMotion, they must be copied to the destination host multiple times.

DRS keeps track of the transfer rate based on the history of migration times for each VM and host. Based on the transfer rate and current memory size, it calculates the time required for single round of copying. During this time the cost is computed as the resources required for the migrating VM to exist on the destination. The vMotion cost is in the unit of resources (MHz or MB) over time.

The vMotion process itself consumes some resources on both the source and destination hosts and this is also added as a cost. DRS then computes how much the workload inside the VM would suer due to migration. This is done by measuring how actively the VM writes to its memory pages and how much time each transfer takes. DRS approximates this by measuring the number of times the VM had to be descheduled by the CPU scheduler during past vMotions and how active the VM was at that time. The performance degradation due to the increased cost of modifying memory during migration is computed. This value is used to extrapolate how much the VM would suer by taking into account its current consumption and added to the cost. This cost is also measured in units of resources over time, such as CPU seconds.

The vMotion benefit is also measured in resources over time. The benefit is computed by calculating how much of the demand for the candidate VM is being clipped on the source versus the increased amount of demand that is expected to be satisfied on the destination. The benefit also includes the amount of unsatisfied demand for other VMs on the source that will be satisfied after the VM moves off. To represent possible demand changes, demand is predicted for all VMs on the source and destination. Based on the workloads of all these VMs, a stable time is predicted after which we assume the VM workloads will change to the worst configuration for this move, given recent demand history (by default, for the previous hour). The use of this worst-case value is intended to make the algorithm more conservative when recommending migrations.

With respect to the source host, the worst possible situation is computed as one in which, after the stable time, the workloads of the VMs on the source exhibit the minimum demand observed during the previous hour. With respect to the destination host, the worst possible situation is computed as one in which the demands for the VMs on the destination including the VM being moved are the maximum values observed in the last one hour. By assuming this worst-case behavior, moves with benefits that depend on VMs with unstable workloads are filtered out.

3.1.5 Pending RecommendationsDRS considers the recommendations in flight and any pending recommendations not yet started, so that it does not correct the same constraint violation or imbalance several times. VMs that are currently migrating are treated as if they exist on both source and destination. New recommendations that would conflict with pending recommendations are not generated.

Figure 4: Overview of DRS Cost-Benet analysis

Algorithm 1: DRS Load Balancing

cost-benefit analysis

23

Resource scheduler. Administrative •  Limits + Shares + Reservations •  Grouping of resource suppliers (clusters) •  Nested resource pool •  Custom rules (affinity and anti-affinity) •  Automative, semi-automative, manual mode •  Migration threshold

24

Resource scheduler. Virtualization case •  High availablility •  Failover •  Backup •  DPM

25

Resource scheduler. Main principles •  Do not migrate too often •  Random is a good placement algorithm •  Choose accurately counters for prediction and estimation •  Maths is a pleasure!

26

Questions

27

Thank you for you attention!

[email protected]

resource management in the cloud

Engineering