Transcript
Page 1: Architecting for the cloud scability-availability

Architecting for the Cloud

Len and Matt Bass

Scalability

Page 2: Architecting for the cloud scability-availability

Link to yesterday’s slides

http://www.slideshare.net/lenbass/architecting-for-the-cloud-intro-virtualization-iaa-s

Page 3: Architecting for the cloud scability-availability

Outline

• Introduction to scalability

• CPU scaling

• I/O scaling

Page 4: Architecting for the cloud scability-availability

Characteristic of cloud from NIST

• On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider.

Page 5: Architecting for the cloud scability-availability

Scale in the Cloud

• Many people think that you get scalability just by virtue of being in the cloud

• This isn’t true

• What the cloud gives you is the ability to quickly and easily add resources

– It doesn’t guarantee that this results in additional capacity

• Just like with security you need to design scalability in

Page 6: Architecting for the cloud scability-availability

What is Scalability?

• (Problem definition) Scalability is the ability of a system to support growing amount of work. – May be from additional users

– May be from additional requests from current users

– May be from operational activities.

• (Solution definition) Scalability is the ability to increase or decrease the resources available to your application by either changing the number of servers or disks or changing the size of the servers or disks.

Page 7: Architecting for the cloud scability-availability

Why scale?

• Are more users always a good thing?

– This is a cost/benefit question.

– More users have benefits – presumably more people receive service and the organization more revenue.

– More users have a cost – hardware, software, and personnel.

• Do costs scale linearly with users?

– For Netflix, the answer is yes.

– For Linkedin, the answer is no.

Page 8: Architecting for the cloud scability-availability

The different aspects of scalability

• Adding users – Large amounts of new users may require new computation

facilities

• Adding data – Large amounts of new data requires

• More computation • Careful attention to the distribution of this data.

• Adding computation – Computation is embedded in virtual machines – Elasticity means adding new virtual machines

• Scaling should not impact existing activities • May need to scale by adding computation capacity (CPU) or

by adding I/O capacity

8

Page 9: Architecting for the cloud scability-availability

Scaling Up vs Scaling Out

• Scaling up means adding more capacity to existing hardware

– More memory

– More disk

– Faster CPU or more cores

• Scaling out means adding additional hardware

– More systems

Page 10: Architecting for the cloud scability-availability

Costs in scaling out

• Each virtual machine has a cost – per hour • Licensing costs.

– Many software packages charge licenses per CPU or per (virtual) computer.

– Every new instance that utilizes one of these packages incurs licensing costs

• Personnel costs – In small to medium size organizations, one sysadmin can

administer ~30 machines. – In large, highly automated organizations, one sysadmin can

administer ~1000s of machines. – Movement called “DevOps” has as one goal the reduction of

personnel costs in operations. (more on this later).

Page 11: Architecting for the cloud scability-availability

How much lead time for growth of number of users?

• Some things are predictable – Seasonal variation.

• Christmas • Tax season

– Daily variation • Working hours or non-working hours in various time zones • Holidays

– Promotions or special offers – Sporting events

• Other things are not predictable – Being “SlashDotted” – News items – Rapid growth in popularity of a company. – Disaster

Page 12: Architecting for the cloud scability-availability

Managing growth in number of users

• A lead time allows planning

– Restructure database

– Add or restructure software

• When no lead time is available, elasticity of the cloud is the main mechanism.

Page 13: Architecting for the cloud scability-availability

Outline

• Introduction to scalability

• CPU scaling

– Load balancers

– Rule Based Scaling

– Scaling Patterns

• I/O scaling

Page 14: Architecting for the cloud scability-availability

Why have a load balancer?

• Suppose there are too many users for a single instance of a service

• The cloud allow us to create another instance of that service (elasticity)

• We would like to have the half the users use one instance and half use the other

• Two options:

1. Couple instances and users (half and half). This is accomplished by having users access an instance of a service directly by IP address.

2. Use an intermediary (load balancer) to distribute half of the requests to one instance and the other half to the other.

Option 2 is preferable for a variety of reasons which we will see.

14

Page 15: Architecting for the cloud scability-availability

Load Balancing

• Physically a load balancer is a box that looks like it belongs in a computer network.

Page 16: Architecting for the cloud scability-availability

Load Balancer

Logically, a load balancer takes requests from clients and distributes them to copies of an application executing on multiple different servers

Servers

Clients

Load Balancer

Page 17: Architecting for the cloud scability-availability

Message sequence – client makes a request

Servers

Clients

Load Balancer

Page 18: Architecting for the cloud scability-availability

Message sequence- request arrives at load balancer

Servers

Clients

Load Balancer

Page 19: Architecting for the cloud scability-availability

Message sequence – request is send to one server

Servers

Clients

Load Balancer

Page 20: Architecting for the cloud scability-availability

Message sequence – reply goes directly back to client

Servers

Clients

Load Balancer

Page 21: Architecting for the cloud scability-availability

Suppose Load Balancer Becomes Overloaded – Load Balance the Load Balancers

Page 22: Architecting for the cloud scability-availability

Hierarchy of Load Balancers

• Server always sends message back to client. • Load balancers use variety of algorithms to choose

instance for message – Round robin. Rotate requests evenly – Weighted round robin. Rotate requests according to some

weighting. – Hashing – IP address of source to determine instance.

Means that a request from a particular client always sent to same instance as long as it is still in service.

• Note that these algorithms do not require knowledge of an instance’s load. That situation we will cover in a little bit.

Page 23: Architecting for the cloud scability-availability

Outline

• Introduction to scalability

• CPU scaling

– Load balancers

– Rule based scaling

– Scaling Patterns

• I/O scaling

Page 24: Architecting for the cloud scability-availability

Rule Based Scaling

Page 25: Architecting for the cloud scability-availability

Server

• A server is a virtual machine without any software

• A virtual machine can be allocated with varying amounts of memory, CPU, disk

• Each variant has different cost, typically per hour

Page 26: Architecting for the cloud scability-availability

Machine Image

• A machine image is a copy of the contents of the memory of a computer.

• A machine image may be created from any contents of a computer. Some options: – Bare metal – With OS – With LAMP Stack

• Linux • Apache HTTP Server • MySQL • PhP or Python

• If licensed software is contained in the machine image, then a license fee is paid when it is loaded

Page 27: Architecting for the cloud scability-availability

Executable Virtual Machine

• An executable virtual machine is created by loading a machine image into a server.

• Executable virtual machine can then be

– Booted

– Paused

– Shut down

Machine Image Server

Page 28: Architecting for the cloud scability-availability

Adding/Removing Resources

• Example shows two servers with one to be removed.

• Could be N servers with one to be added or removed

• Creating a new instance

takes some time

• Removing an instance also

takes time – it must satisfy

existing requests and be

detached from existing

connections.

Page 29: Architecting for the cloud scability-availability

Autoscaling group

• An autoscaling group is a collection of instances that have been defined to be scaled together.

• Typically these represent instances of the same application.

Page 30: Architecting for the cloud scability-availability

Creating an autoscaling group

• An autoscaling group needs to know

– Machine instance id

– VM type

– Scaling policy

Page 31: Architecting for the cloud scability-availability

Scaling Policy

• Specify minimum, maximum, and desired number of instances

• Can specify scaling based on time of day

– E.g. scale up during 9:00-5:00 and down other times

• Can scale based on average CPU usage

– E.g. average CPU utilization <40% means delete instance

– Average CPU utilization >60% means add instance.

– Values come from monitor.

Page 32: Architecting for the cloud scability-availability

Outline

• Introduction to scalability

• CPU scaling

– Load balancers

– Rule Based Scaling

– Scaling Patterns

• I/O scaling

Page 33: Architecting for the cloud scability-availability

Scaling Patterns

• Autoscaling implements Push Pattern for messages

• Another pattern is Pull Pattern

Page 34: Architecting for the cloud scability-availability

Push Pattern

Page 35: Architecting for the cloud scability-availability

Push Pattern Description • Client sends a request (e.g. HTTP message) to

the app in the cloud.

• Request arrives at a load balancer

• Load balancer forwards request to one of the VMs in the resource pool.

• Load balancer uses scheduling strategy to decide which VM gets the request, e.g. dispatch to VM with lowest CPU utilization.

Page 36: Architecting for the cloud scability-availability

How does the load balancer know?

• The load balancer knows CPU utilization of the VMs and it knows how many requests it (the load balancer) has received, and possibly how long it took to service the requests. It does not know application specifics such as how many requests a VM can process.

• When resource pool is overloaded, new resources are allocated.

• The monitor decides (based on controller rules) when new resources are needed. It must have direct insight into the VM instances in order to do this. Hence, the monitor utilizes a

monitoring service provided by the cloud for each instance.

36

Page 37: Architecting for the cloud scability-availability

Pull architecture pattern (aka Producer-Consumer)

Page 38: Architecting for the cloud scability-availability

Pull architecture description

• Each request from the client is application specific and typed.

• The queue keeps separate queues for each application running on the VMs.

• A VM requests the next message of a particular type (pull) and processes it.

• The monitor can now see how long a request waits in a queue or the average queue length and this is an indication of the load on the VMs that have applications that service requests of that type.

Page 39: Architecting for the cloud scability-availability

Differences

• Push is more responsive to requests. They are immediately forwarded to a service. There is a possibility that the service is overloaded.

• Pull is less responsive since it relies on servers to de-queue messages.

• In the pull architecture, a service polls for new messages even if there is nothing in its queue and this introduces overhead.

• It is easier to monitor and control workload in the pull architecture since messages are application specific and typed.

Page 40: Architecting for the cloud scability-availability

Outline

• Introduction to scalability

• CPU scaling

• I/O scaling

– Multiple sites

– Software techniques

Page 41: Architecting for the cloud scability-availability

I/O Scaling

• Scaling out assumes scaling requirement is solved with more CPUs.

• It may be that I/O is also a problem.

– You may run your application in multiple sites

– Half the clients go to one site, half to another

Page 42: Architecting for the cloud scability-availability

Questions when you have multiple sites

How do clients know which site to use?

How are databases used by the applications coordinated across sites (we defer this question).

Page 43: Architecting for the cloud scability-availability

Domain Name Server (DNS) Client sends URL to DNS DNS takes as input a URL and returns an IP address Client uses IP address to send message to load balancer for a site

Site 1 Site 2

Domain Name Server Website.com

123.45.67.89

123.45.67.89

DNS

Page 44: Architecting for the cloud scability-availability

DNS with multiple sites

• DNS server returns IP address of both sites.

• DNS server will vary which address is listed first.

• Client will, typically, choose first entry.

Site 1 Site 2

Domain Name Server

Website.com

123.45.67.89 456.77.88.99 123.45.67.89

DNS

Page 45: Architecting for the cloud scability-availability

Outline

• Introduction to scalability

• CPU scaling

• I/O scaling

– Multiple sites

– Software techniques

Page 46: Architecting for the cloud scability-availability

Recall Pull Pattern

Page 47: Architecting for the cloud scability-availability

To Scale for I/O - Make the queue manager more sophisticated

Key Value Store

Publisher – takes values from key-value store and distributes them

Clients

Page 48: Architecting for the cloud scability-availability

Summary

• Scalability is the ability to respond to increasing or decreasing workload

– Add CPU capacity through utilizing features of cloud provider

– Add I/O capacity through

• Distributing requests to multiple sites

• Have fast message passing software

Page 49: Architecting for the cloud scability-availability

QUESTIONS?

Page 50: Architecting for the cloud scability-availability

Architecting for the Cloud

Introduction to Availability

Page 51: Architecting for the cloud scability-availability

Outline

• What is availability

• Faults

• Availability patterns

Page 52: Architecting for the cloud scability-availability

Outline

• What is availability

• Faults

• Availability patterns

Page 53: Architecting for the cloud scability-availability

Cost of Downtime

• According to a recent survey the average cost of unplanned downtime is $7,900/minute*

• 91% of reporting companies have experienced an unplanned outage in the last 24 months

• The average outage lasts 118 minutes • The average frequency of outages over a 24

month period were: – 10.16 limited outages – 5.88 local outages – 2.04 total outages

* Emerson Network Power, Ponemon Institute Study 2013

Page 54: Architecting for the cloud scability-availability

Cost of Downtime II

• As the previous numbers indicate downtime can be expensive

• Experienced in August 2013 – New York Times had a 2 hour outage (stock price declined, twitter

exploded, and Wall Street Journal dropped their fees to try and capture readership)

– Google had between 1 – 5 minutes of downtime (~$500,000 direct loss and 40% reduction in overall web traffic)

– Amazon had an outage of under an hour (> $5 million)

• In addition to direct losses indirect losses are experienced – Loss of confidence, reputation, and good will

– Productivity losses

– Compliance penalties

– …

Page 55: Architecting for the cloud scability-availability

Availability: a Business Concern

• The availability of the business service impacts the earnings and associated value of an organization

• If the organization relies on an IT system to deliver business service then the availability of the IT system impacts the value of the organization

• In this section we are going to look at the availability of the system – We want to keep in mind, however, that the objective

is the availability of the business service

Page 56: Architecting for the cloud scability-availability

What Is Availability?

• Availability in general refers to the degree to which a system is in an operable state

• This is typically articulated as the percentage of time the system is available (or we’d like to have the system available) e.g. 99.99%

• There are many related terms e.g. – Availability

– Fault-Tolerance

– Reliability

Page 57: Architecting for the cloud scability-availability

How is Availability Measured?

Availability is typically measured as:

MTBF

MTBF + MTTR

MTBF = Mean Time Between Failures

MTTR = Mean Time To Repair

Page 58: Architecting for the cloud scability-availability

9s

Availability Downtime per Year

90% (1-nine) 36.5 days/year

99% (2-nines) 3.65 days/year

99.9% (3-nines) 8.76 hours/year

99.99% (4-nines) 52 minutes/year

99.999% (5-nines) 5 minutes/year

99.9999% (6-nines) 31 seconds/year !

Page 59: Architecting for the cloud scability-availability

Calculating System Availability I

• Each component = 99% (3.65 days a year)

• The overall system, however, has an availability that is the product of each component’s availability

– 99% X 99% = 98% (7.26 days a year)

99% 99%

Page 60: Architecting for the cloud scability-availability

Calculating System Availability

• Each component = 99% (3.65 days a year)

• The overall system in this case, however, is based on the likelihood that both components would fail at the same time

1 – ((100% - 99%) X (100% - 99%) )= 99.99% (3.65 hours a year!!)

Redundant Elements

99%

99%

Page 61: Architecting for the cloud scability-availability

Availability Measures

• A couple of things to keep in mind

– These measures refer to the mean not the minimum time between failures

– As the MTBF increases the impact of MTTR decreases

– As the MTTR approaches 0 the overall availability approaches 1

• Historically these measures were developed for hardware components

Page 62: Architecting for the cloud scability-availability

Availability Requirements

• MTBF can be measured for operational systems

• How do you predict the MTBF for a system that is yet to be built, however?

• Does it make sense to use the previously defined availability measure as a requirement?

• If not, how should requirements be articulated?

Page 63: Architecting for the cloud scability-availability

Actionable Requirements

• Remember that as a business the concern is that the services are available as needed

• In order to determine the likely availability of a system (or design) you must – Understand the likelihood that various kinds of faults could

occur

– Understand the impact of these faults on overall system availability

• You must therefore translate the desired business objective into a set of fault scenarios

Page 64: Architecting for the cloud scability-availability

End to End Availability

• Engineers often think about availability of some portion of the system e.g. – Availability of the database or web server

• Organizations, however, are concerned with end to end availability

• When thinking about availability requirements you should think about the organizational perspective

– Once you’ve done this you’ll then need to map this to the engineering perspective

Page 65: Architecting for the cloud scability-availability

Requirements Vary

• We start with the desired requirements from a business perspective

• We then look at the system context to determine what faults might disrupt the desired behavior – This is likely an iterative process

• One thing to keep in mind is that different business contexts imply different requirements

• Consider the needs of Discreet Manufacturing vs. Continuous Manufacturing

• Discreet manufacturing is when you manufacture discreet products

– e.g. an automobile assembly line

• Continuous process automation is when you manufacture things like chemicals or concrete

• How might the systems respond differently in the event of a fault?

Page 66: Architecting for the cloud scability-availability

Example Scenario

If a processor in one of the servers fails during peak load, the system shall continue to operate without dropping any of the current tasks and without any noticeable delay

Page 67: Architecting for the cloud scability-availability

Relationship to Goals

• How does this scenario relate to availability goals? – It does not in and of itself guarantee a particular level of

availability

• This in conjunction with scenarios for other faults that could impact a service do improve availability, however

• In order to understand how to think about the design we need to: – Identify the activities that require availability

– Identify the related faults

– Identify the desired response if the fault occurs

Page 68: Architecting for the cloud scability-availability

Outline

• What is availability

• Faults

• Availability patterns

Page 69: Architecting for the cloud scability-availability

Fault Characteristics

• “Fail silent” vs. “fail operational”

– Fail silent when a component fails it no longer operates

– Fail operational a component continues to operate (although not correctly) when a fault is present

• Transient vs. deterministic – Some faults will always occur in a consistent way

– Others may come and go intermittently

• Some will look similar to other faults e.g. – A hung process, a processor crash, and a network outage can all look

the same

Page 70: Architecting for the cloud scability-availability

What’s the matter with

this $#@!#% computer …

A System Can Fail Silently …

Let’s look at an example interaction

Client Machine

Network Server

FileSystem

Hmm … what’s the best

vegetarian restaurant in

Bogota?

Page 71: Architecting for the cloud scability-availability

Symptoms of Faults

• From an end users perspective many faults exhibit themselves similarly

• These faults could all look the same to an end user:

– A hung process

– A crashed processor

– A network outage

– An overloaded element

Page 72: Architecting for the cloud scability-availability

Or Fail Operational …

Client Machine

Network Server

FileSystem

Carnes de Res is the best

vegetarian restaurant???

Hmm … what’s the best

vegetarian restaurant in

Bogota?

Page 73: Architecting for the cloud scability-availability

Fault Manifestation

• These types of faults could occur in any of the elements of the system

• Depending on where they occur different mitigation strategies might be appropriate

• As a result you need to

– Analyze your system and determine what faults might occur

– Identify the desired response if they do occur

• This is called a fault model

Page 74: Architecting for the cloud scability-availability

Fault Model

• A fault model describes the system faults that could disrupt the critical functionality

• The fault model is going to depend on both the critical functionality and the specific architecture of the system

• Once the fault model is identified you’ll need to describe the desired response if the fault occurs

Page 75: Architecting for the cloud scability-availability

Cost of Availability

• We’ve established that downtime can be expensive

• It’s also the case that “uptime” can be expensive – Implementing a mechanism to be resilient to faults

can be expensive

• We want to understand the cost and benefit for proposed strategies and select the set that make sense from a business perspective

• This means the initial requirements might change …

Page 76: Architecting for the cloud scability-availability

Example

• We want “appropriate” availability

• A study has been done for mobile carrier customers – This study has determined that customers will

tolerate 2 dropped calls per 100 calls made

– As soon as the system drops 3 calls per 100 they will start to change providers

• What does this say about the “appropriate” availability of the system?

Page 77: Architecting for the cloud scability-availability

Outline

• What is availability

• Faults

• Availability patterns

Page 78: Architecting for the cloud scability-availability

Elements of Availability

• Fault detection

– The system recognizes that a fault has occurred

• Masking faults

– The system is able to continue to operate despite the fault

• Recover from the fault

– The system is able to repair the faulty element of the system

Page 79: Architecting for the cloud scability-availability

Fault Detection

• There are standard “tactics” that we can use for fault detection

• They don’t detect the same types of faults, however

• They also have different “costs” – This cost can be in terms of effort or overhead of one

kind or another

• We need to understand something about the kinds of faults we are trying to detect before we can select the appropriate tactic

Page 80: Architecting for the cloud scability-availability

Detecting Silent Faults

• It’s much easier to detect elements that fail silently

• Essentially we monitor the “liveness” of the element where the fault could exist

• Example tactics are:

– Exceptions

– Heartbeat

– Ping/echo

Page 81: Architecting for the cloud scability-availability

Exceptions

• When an anomalous or exceptional event occurs it can be detected by exception handlers

• When the exception is “caught” an alternate path of execution is triggered

• The exception handling code can notify other portions of the system of the issue

• Doesn’t impose significant overhead on the system

Page 82: Architecting for the cloud scability-availability

Heart Beat

• A component emits a regular “heart beat”

• Another element will listen for this

• If this heart beat is not detected it is assumed that the component is no longer operational

• Does add overhead to the system

• Only an indication of the “liveness” of the component

Page 83: Architecting for the cloud scability-availability

Ping/Echo

• Similar to heart beat except a “watchdog” sends a ping and listens for a response

• If no response is heard it is assumed the component is not operational

• Requires more coupling than heart beat

• Increases network traffic

• Again it’s only an indication of the liveness of the component

Page 84: Architecting for the cloud scability-availability

Failing Operational

• If an element or system fails operational it’s more difficult to detect

• You don’t just monitor if the system responds but also need to determine if the results are “correct”

• Example tactics include: – Exceptions

– Voting

– Check sum

Page 85: Architecting for the cloud scability-availability

Voting

• You compare the response of multiple elements performing the same operation

• If the results of one of the elements doesn’t match the others you assume it’s faulty

• Can detect erroneous output

• Adds overhead (must wait for multiple responses and compare)

Page 86: Architecting for the cloud scability-availability

Check Sum

• A mathematical calculation that’s applied to a piece of data to determine if it’s been altered

• Does add some processing overhead to the system

• Can detect data corruption

Page 87: Architecting for the cloud scability-availability

Tolerating Faults

• In many cases you realize that faults will occur – Particularly in large distributed systems

• You can’t tolerate outages every time one of the nodes experiences a fault

• You therefore need to hide the fact that the system has a faulty component

• This is called “fault masking” • Again the strategies associated with masking the

fault are going to be dependent on the kind of fault being masked

Page 88: Architecting for the cloud scability-availability

Strategies For Fault Masking

• Modular redundancy

• Rollback – Restoring the system to a previously identified “safe state”

• Roll forward – “skipping” an operation that is causing a problem

• Retrying an operation

• Shedding load

• …

Page 89: Architecting for the cloud scability-availability

Modular Redundancy

• Redundant systems have multiple replicated elements (copies) – Not to be confused with load balancing approaches

– The thing to realize is that the state is replicated across the copies

• There are multiple strategies for software replication – Cold standby

– Warm standby

– Hot standby

Page 90: Architecting for the cloud scability-availability

Redundancy: Cold Standby

• There are non-operational copies available

• State is stored (e.g. in logs) but is not loaded on the copies until they are needed

• When a failure occurs the state is reconstructed and the replica is introduced

• Reduces operational overhead associated with maintaining copies

• Increases MTTR

Page 91: Architecting for the cloud scability-availability

Cold Standby

Page 92: Architecting for the cloud scability-availability

Redundancy: Warm Standby

• In this configuration you have a primary replica that is actively processing requests

• You have passive replicas that are not actively processing requests although they are online

• State is periodically loaded into the backup replicas

• As with cold standbys the processing overhead is reduced

• The MTTR is dependent on the state checkpoints (typically less than with cold standbys)

Page 93: Architecting for the cloud scability-availability

Warm Standby

Page 94: Architecting for the cloud scability-availability

Redundancy: Hot Standby

• All copies are processing requests

• All of the duplicate responses will be suppressed

• The copies need to be synchronized continuously – Thus the processing overhead is increased as the number of replicas

increases

• The MTTR is reduced to virtually zero, however, in the event that one of the replicas fail

Page 95: Architecting for the cloud scability-availability

Hot Standby

Page 96: Architecting for the cloud scability-availability

Considerations

• State management – If there is state that is managed in the replicated elements you need to

worry about synchronizing state

• State can be pushed to other elements … – This impacts other concerns such as performance or security, however

– Caching commonly accessed data is a typical strategy for dealing with performance concerns

• Kinds of replicas

• Frequency of check pointing

Page 97: Architecting for the cloud scability-availability

State Management

Page 98: Architecting for the cloud scability-availability

Roll Back

• Roll back is when you undo a transaction

• You need to manage state appropriately

– You need to define an atomic set of actions

• This could be taking complete snap shot of system state or just roll back of a transaction

Page 99: Architecting for the cloud scability-availability

Roll Forward

• Roll forward essentially skips a task and then applies the changes involved in the transactions

• The system will then be in the state consistent with the desired change

Page 100: Architecting for the cloud scability-availability

Retrying an Operation

• This is as simple as it sounds

• When a given operation fails you retry it

• It can be used in conjunction with a detection mechanism like exceptions

Page 101: Architecting for the cloud scability-availability

Shedding Load

• Sometimes issues occur due to an overload situation

• This can lead to:

– Timing errors

– Buffer overflows

– Memory consumption issues

• Shedding less critical load can help alleviate the problem

Page 102: Architecting for the cloud scability-availability

Strategies For Fault Recovery

• Reboot

– This could be a partial (e.g. restarting an application or process) or total system reboot

• Removal of faulty component

• Restore component to a previously identified safe state

• …

Page 103: Architecting for the cloud scability-availability

Reboot

• Rebooting the system can often correct the issue

• This can also be done as a preventative measure

• It can be a complete or partial reboot

• There is such a thing as a “micro reboot” that takes milliseconds

Page 104: Architecting for the cloud scability-availability

Component Removal

• If you have a faulty component you can remove it from service

• You might try other remedies such as restarting first

Page 105: Architecting for the cloud scability-availability

Checkpointing State

• You can periodically take a snap shot of the system

• If at some point you have an issue, you can restore the system to the previously defined state

• The more frequently you take a snap shot of the state the smaller the loss but the more overhead

Page 106: Architecting for the cloud scability-availability

Availability in the Cloud

• From a high level achieving availability in the cloud is the same process as elsewhere

– It needs to be designed in

• That means you need to understand the faults that could occur

• You then need to apply the appropriate decisions to achieve the desired result

Page 107: Architecting for the cloud scability-availability

Fault Model

• We will give specific faults that occur later in the course – This requires first a better understanding of the

architecture of the cloud

• At this point it’s useful to understand that the cloud is made up of faulty components – Failures happen on a regular basis

• There are mechanisms built in to handle this, but – They aren’t always successful – They don’t deal with application specific concerns – Some things that might be a fault for your application isn’t

considered a fault by the infrastucture

Page 108: Architecting for the cloud scability-availability

Summary

• Availability measures are not adequate for design

• You need to be able to translate availability goals into a set of actionable requirements that identify the possible faults and desired responses

• The approaches should support the desired responses in the event that a fault occurs

Page 109: Architecting for the cloud scability-availability

Questions??


Top Related