architecting for failures in the cloud - barcamp bangalore 2013

HOW YOU COULD HAVE SURVIVED

THE AWS CHRISTMAS EVE OUTAGE

ARCHITECTING FOR FAILURES IN THE

CLOUD

Pavan Verma

Founder, P3 InfoTech Solutions Pvt. Ltd.

[email protected], @YingYangPavan

1

How are datacenter failures

relevant to me?

2

Types of Failures in a Datacenter

• Electronic components – CPUs, Memory

• Mechanical components – Hard Disks, Fans

• Electrical components – Power Supplies, Air

conditioners

• Networking equipment – Network cables,

Routers, Switches

• Software bugs

• Power disruption

• Human errors

3

Cost of Failures

• Tangible cost

• Lost business = Business volume (Revenues) /

Duration of failure

• Cost of lost data or Time to re-create lost data

• Intangible cost

• Reputation

• Frustration

• Lost business opportunity

4

Techniques to deal with failures

5

Backup

• Backup = Copy of the data from a time

before the failure

• Can restore data after the failure to a state

from before the failure

• Limits the extent of data loss during a

failure

• Types of Backup

• Disk-to-tape – Offline, Slower restore, Cheaper

• Disk-to-disk – Online, Faster restore, Costlier

6

Backup in AWS

• Snapshots for EBS volumes

• Database backups with RDS

• Redundant copies of S3 objects (*)

7

High Availability (HA)

• Ability of the application to service requests

in spite of failure of some components

• Most prevalent notion of High Availability

• Ability to application to tolerate single

component failures

• Application has no single point of failure

• How is high availability achieved

• Redundant components

• Switchover traffic from failed component to

working component

8

High Availability (2)

• Two types of redundant components

• Active-Active

• Active-Passive

• Examples

• Power Supplies

• Servers

• Databases

9

High Availability in AWS

• Availability Zones (AZ)

• Elastic Load Balancer

• Database replicas

10

User

Auto Scaling Group of

EC2 Instances

EC2 EC2

S3

ELB

Auto Scaling Group of

EC2 Instances

EC2 EC2

RDS

Slave

RDS

Master

AZ #1 AZ #2

Reference Architecture for High

Availability setup in AWS

11

Disaster Recovery

• Ability to resume operations after a disaster

• How bad can a disaster be?

• Entire datacenter may be destroyed or become

inoperational

• Examples: 9/11, Hurricane Sandy, Northeast

blackout of 2003

• Affects all Availability Zones in a Region

12

Disaster Recovery Solutions

• Disaster recovery solutions involve combination of

• Replication / Backup of data to a different geography [On-going]

• Start operations from the DR site [when disaster occurs]

• Switch-over traffic to DR site

• Sync data and restore operations to primary site once it becomes operational again

• Since DR involves a different Geo, data replication/backup happens over WAN

13

Recovery Point and Recovery Time

• Recovery Point (RP) = Duration of time for

which data is lost

• Recovery Time (RT) = Duration of time in

which the application is restored

• Low numbers are better for both RP and RT

• Often framed as RPO and RTO as part of

business continuity planning

14

Recovery Point and Recovery Time

• Backup = High RP and High RT

• High Availability = Zero RP and Zero RT

• Disaster Recovery = RP and RT between

Backup and HA

15

Conclusion

• Topic of IT failures is very relevant for business operations

• Key issues with failures

• Unavailability of application

• Loss of data

• Techniques to handle failures – Backups, High Availability, Disaster Recovery

• AWS provides mechanisms to deal with failures

• Recovery Point and Recovery Time

16

17

Have a great Barcamp Bangalore 2013!

architecting for failures in the cloud - barcamp bangalore 2013

Technology