architecting for failures in the cloud - barcamp bangalore 2013

17
HOW YOU COULD HAVE SURVIVED THE AWS CHRISTMAS EVE OUTAGE ARCHITECTING FOR FAILURES IN THE CLOUD Pavan Verma Founder, P3 InfoTech Solutions Pvt. Ltd. [email protected], @YingYangPavan 1

Upload: p3-infotech-solutions-pvt-ltd

Post on 17-Jan-2015

328 views

Category:

Technology


1 download

DESCRIPTION

This is a talk from Barcamp Bangalore 2013 on how to architect for dealing with failures in the Cloud. P3 InfoTech Solutions Pvt. Ltd. helps organizations achieve business breakthroughs by adopting Cloud Computing through our Outsourced Product Development and Cloud Consulting service offerings. Check out our service offerings at http://www.p3infotech.in.

TRANSCRIPT

Page 1: Architecting for failures in the Cloud - Barcamp Bangalore 2013

HOW YOU COULD HAVE SURVIVED

THE AWS CHRISTMAS EVE OUTAGE

ARCHITECTING FOR FAILURES IN THE

CLOUD

Pavan Verma

Founder, P3 InfoTech Solutions Pvt. Ltd.

[email protected], @YingYangPavan

1

Page 2: Architecting for failures in the Cloud - Barcamp Bangalore 2013

How are datacenter failures

relevant to me?

2

Page 3: Architecting for failures in the Cloud - Barcamp Bangalore 2013

Types of Failures in a Datacenter

• Electronic components – CPUs, Memory

• Mechanical components – Hard Disks, Fans

• Electrical components – Power Supplies, Air

conditioners

• Networking equipment – Network cables,

Routers, Switches

• Software bugs

• Power disruption

• Human errors

3

Page 4: Architecting for failures in the Cloud - Barcamp Bangalore 2013

Cost of Failures

• Tangible cost

• Lost business = Business volume (Revenues) /

Duration of failure

• Cost of lost data or Time to re-create lost data

• Intangible cost

• Reputation

• Frustration

• Lost business opportunity

4

Page 5: Architecting for failures in the Cloud - Barcamp Bangalore 2013

Techniques to deal with failures

5

Page 6: Architecting for failures in the Cloud - Barcamp Bangalore 2013

Backup

• Backup = Copy of the data from a time

before the failure

• Can restore data after the failure to a state

from before the failure

• Limits the extent of data loss during a

failure

• Types of Backup

• Disk-to-tape – Offline, Slower restore, Cheaper

• Disk-to-disk – Online, Faster restore, Costlier

6

Page 7: Architecting for failures in the Cloud - Barcamp Bangalore 2013

Backup in AWS

• Snapshots for EBS volumes

• Database backups with RDS

• Redundant copies of S3 objects (*)

7

Page 8: Architecting for failures in the Cloud - Barcamp Bangalore 2013

High Availability (HA)

• Ability of the application to service requests

in spite of failure of some components

• Most prevalent notion of High Availability

• Ability to application to tolerate single

component failures

• Application has no single point of failure

• How is high availability achieved

• Redundant components

• Switchover traffic from failed component to

working component

8

Page 9: Architecting for failures in the Cloud - Barcamp Bangalore 2013

High Availability (2)

• Two types of redundant components

• Active-Active

• Active-Passive

• Examples

• Power Supplies

• Servers

• Databases

9

Page 10: Architecting for failures in the Cloud - Barcamp Bangalore 2013

High Availability in AWS

• Availability Zones (AZ)

• Elastic Load Balancer

• Database replicas

10

Page 11: Architecting for failures in the Cloud - Barcamp Bangalore 2013

User

Auto Scaling Group of

EC2 Instances

EC2 EC2

S3

ELB

Auto Scaling Group of

EC2 Instances

EC2 EC2

RDS

Slave

RDS

Master

AZ #1 AZ #2

Reference Architecture for High

Availability setup in AWS

11

Page 12: Architecting for failures in the Cloud - Barcamp Bangalore 2013

Disaster Recovery

• Ability to resume operations after a disaster

• How bad can a disaster be?

• Entire datacenter may be destroyed or become

inoperational

• Examples: 9/11, Hurricane Sandy, Northeast

blackout of 2003

• Affects all Availability Zones in a Region

12

Page 13: Architecting for failures in the Cloud - Barcamp Bangalore 2013

Disaster Recovery Solutions

• Disaster recovery solutions involve combination of

• Replication / Backup of data to a different geography [On-going]

• Start operations from the DR site [when disaster occurs]

• Switch-over traffic to DR site

• Sync data and restore operations to primary site once it becomes operational again

• Since DR involves a different Geo, data replication/backup happens over WAN

13

Page 14: Architecting for failures in the Cloud - Barcamp Bangalore 2013

Recovery Point and Recovery Time

• Recovery Point (RP) = Duration of time for

which data is lost

• Recovery Time (RT) = Duration of time in

which the application is restored

• Low numbers are better for both RP and RT

• Often framed as RPO and RTO as part of

business continuity planning

14

Page 15: Architecting for failures in the Cloud - Barcamp Bangalore 2013

Recovery Point and Recovery Time

• Backup = High RP and High RT

• High Availability = Zero RP and Zero RT

• Disaster Recovery = RP and RT between

Backup and HA

15

Page 16: Architecting for failures in the Cloud - Barcamp Bangalore 2013

Conclusion

• Topic of IT failures is very relevant for business operations

• Key issues with failures

• Unavailability of application

• Loss of data

• Techniques to handle failures – Backups, High Availability, Disaster Recovery

• AWS provides mechanisms to deal with failures

• Recovery Point and Recovery Time

16

Page 17: Architecting for failures in the Cloud - Barcamp Bangalore 2013

17

Have a great Barcamp Bangalore 2013!