cost aware fault recovery in clouds (im 2013)

Upload: assafisr

Post on 08-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    1/29

    COST AWARE FAULT RECOVERYIN CLOUDSAssaf Israel, Danny RazTechnion - Israel Institute of Technology

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    2/29

    FAULTS IN DATACENTERS

    Weve come a long way in terms of server resilience

    Enterprise gra

    Component A

    Compute

    (CPU, RAM, Fans, Net)

    ~

    Storage ~

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    3/29

    FAULTS IN DATACENTERS Typical first year of a new 1800 servers cluster @ Google:

    - thousands of hard drive failures

    ~1000 individual machine failures

    ~3 router failures (have to immediately pull traffic for an hour)

    ~5 racks go wonky (40-80 machines see 50% packet loss)

    ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get ba

    ~1 network rewiring (~5% of machines down over 2-day span)

    ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to com

    ~0.5 overheating (power down most machines in

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    4/29

    FAULTS IN DATACENTERS

    Other factors also contribute to lack of resilience

    Distribution of service disruption evenThe Datacenter as a Computer (200

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    5/29

    RECOVERY

    Most of the time we would like to recover as quickly as p

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    6/29

    RECOVERY

    Most of the time we would like to recovery as quickly as Single host recovery may take advantage of vacant re

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    7/29

    RECOVERY

    Most of the time we would like to recovery as quickly as Single host recovery may take advantage of vacant re

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    8/29

    RECOVERY

    Larger failures (Racks, Network segments, Power regionsMay require powering more machines

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    9/29

    RECOVERY

    Larger failures (Racks, Network segments, Power regionsMay require powering more machines

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    10/29

    RECOVERY COST

    ServiceDegradation

    BackupInfrastructure

    RecoveryCost

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    11/29

    RECOVERY COST

    ServiceDegradation

    BackupInfrastructure

    RecoveryCost

    ,

    ,,

    , - Service deg. cost of when recovered at - Infrastructure cost of

    , , - 0/1 Decision vectors

    Can be formally expressed as:

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    12/29

    RECOVERY COST

    Service degradation depends on: Task setup/initialization

    Host setup/initialization

    Network configuration (if recovered to a different network segme

    Storage mapping

    Storage migration (if recovered to a different SAN)

    Software patches

    Integrity checks

    Manual host configuration

    Recovery target location (latency/bandwidth)

    ServiceDegradation

    BackupInfrastructure

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    13/29

    RECOVERY COST

    Pre-planning can help reduce recovery cost

    Activating additional backup infrastructure: Can help lowering some of Service Degradation costs

    At the expense of additional maintenance costs

    ServiceDegradation

    BackupInfrastructure

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    14/29

    OBSERVATION

    Not all tasks are equal Interactive & vital monitoring

    High-priority non-interactive

    Non-interactive user-facing

    Batch

    Housekeeping tasks

    Some are more susceptible to long downtimes than oth

    Web-scW. Cirne

    Tight SLA

    Relaxed SLA

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    15/29

    GOAL

    We would like to recover expensive tasks faster Balance service degradation and infrastructure costs

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    16/29

    GOAL

    We would like to recover expensive tasks first Balance service degradation and infrastructure costs

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    17/29

    GOAL

    Formal: Minimize the total recovery cost

    Infrastructurecosts

    Service degradationcosts

    Under somepacking constraints

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    18/29

    APPROXIMATION - OVERVIEW

    Integer Program

    LP Relaxation

    Linear

    Transformations ||Light Graphs

    CycleBreaking

    Activation

    RoundingApproximation bounds

    Cost 1 Load

    6

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    19/29

    IF WE HAD MORE INFO

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    20/29

    IF WE HAD MORE INFO

    If we knew which of backup hosts are active we could approximate the Service degradation costs

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    21/29

    MINIMUM GENERAL ASSIGNMENT PROB

    Bins, Items Each item have a size, depends on the target bin

    Each item have a cost, depends on the target bin

    Goal:Packall items into bins at minimum cost, under packing c

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    22/29

    MIN-GAP

    Has been studied extensively Known results:

    LP-Based 2-Approx. (Shmoys and Tardos, 1993)

    LP-Based

    -Approx. (Fleischer, Goemans, Mirrokni and Svir Local Ratio-Based 2 -Approx. (Cohen, Katzir and Raz, 2006)

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    23/29

    LOCAL SEARCH

    Iteratively find the next backup machine to activate Stop when theres no improvement in recovery costs

    Backup

    Active host

    Inactive hostBase cost - All backups are

    inactive Next AcFind theactivate

    recover

    is minim(Using it

    Stop conditionIf( < ):

    return last RPElse:

    Activate return +

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    24/29

    SIMULATIONS

    Based on data from IBM Research Compute Cloud (RC

    Several hundreds hosts, with a few thousands VMs

    4 host configurations, 3 VM configurations

    EC2-like SLA policies(higher availability guaranties, at higher rates)

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    25/29

    RECOVERY COST BY RACK SIZE

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1 2 3 6 10 17 34

    Cost[%]

    Rack size (#hosts/rack)

    Normalized Recovery Cost by Rack size

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    26/29

    RECOVERY COST BY VM SLA DISTRIBUTIO

    0

    50000

    100000

    150000

    200000

    250000

    0

    0.0

    2

    0.0

    4

    0.0

    6

    0.0

    80.1

    0.1

    2

    0.1

    4

    0.1

    6

    0.1

    80.2

    0.2

    2

    0.2

    4

    0.2

    6

    0.2

    80.3

    0.3

    2

    0.3

    4

    0.3

    6

    0.3

    80.4

    0.4

    2

    0.4

    4

    0.4

    6

    0.4

    80.5

    Cost

    SLA Distribution

    2 host racks - Total & Service costs

    20% - Cheap to recover

    80% - Expensive to recover

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    27/29

    RECOVERY COST BY VM SLA DISTRIBUTIO

    0

    50000

    100000

    150000

    200000

    250000

    0

    0.0

    2

    0.0

    4

    0.0

    6

    0.0

    80.1

    0.1

    2

    0.1

    4

    0.1

    6

    0.1

    80.2

    0.2

    2

    0.2

    4

    0.2

    6

    0.2

    80.3

    0.3

    2

    0.3

    4

    0.3

    6

    0.3

    80.4

    0.4

    2

    0.4

    4

    0.4

    6

    0.4

    80.5

    Cost

    SLA Distribution

    2 host racks - Total & Service costs

    Active

    Servic

    Inactiv

    ServicLocal

    Servic

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    28/29

    CONCLUSION

    Large scale infrastructure mandates fault tolerance tec

    Pre-planning can help reduce recovery cost

    Classifying tasks by SLAs can improve overall recovery c

    LP-Based Load/Cost Approximation with guaranteed pe Local Search heuristic with good practical performance

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    29/29

    THANK YOU !

    Questions ?