HOW YOU COULD HAVE SURVIVED
THE AWS CHRISTMAS EVE OUTAGE
ARCHITECTING FOR FAILURES IN THE
CLOUD
Pavan Verma
Founder, P3 InfoTech Solutions Pvt. Ltd.
[email protected], @YingYangPavan
1
How are datacenter failures
relevant to me?
2
Types of Failures in a Datacenter
• Electronic components – CPUs, Memory
• Mechanical components – Hard Disks, Fans
• Electrical components – Power Supplies, Air
conditioners
• Networking equipment – Network cables,
Routers, Switches
• Software bugs
• Power disruption
• Human errors
3
Cost of Failures
• Tangible cost
• Lost business = Business volume (Revenues) /
Duration of failure
• Cost of lost data or Time to re-create lost data
• Intangible cost
• Reputation
• Frustration
• Lost business opportunity
4
Techniques to deal with failures
5
Backup
• Backup = Copy of the data from a time
before the failure
• Can restore data after the failure to a state
from before the failure
• Limits the extent of data loss during a
failure
• Types of Backup
• Disk-to-tape – Offline, Slower restore, Cheaper
• Disk-to-disk – Online, Faster restore, Costlier
6
Backup in AWS
• Snapshots for EBS volumes
• Database backups with RDS
• Redundant copies of S3 objects (*)
7
High Availability (HA)
• Ability of the application to service requests
in spite of failure of some components
• Most prevalent notion of High Availability
• Ability to application to tolerate single
component failures
• Application has no single point of failure
• How is high availability achieved
• Redundant components
• Switchover traffic from failed component to
working component
8
High Availability (2)
• Two types of redundant components
• Active-Active
• Active-Passive
• Examples
• Power Supplies
• Servers
• Databases
9
High Availability in AWS
• Availability Zones (AZ)
• Elastic Load Balancer
• Database replicas
10
User
Auto Scaling Group of
EC2 Instances
EC2 EC2
S3
ELB
Auto Scaling Group of
EC2 Instances
EC2 EC2
RDS
Slave
RDS
Master
AZ #1 AZ #2
Reference Architecture for High
Availability setup in AWS
11
Disaster Recovery
• Ability to resume operations after a disaster
• How bad can a disaster be?
• Entire datacenter may be destroyed or become
inoperational
• Examples: 9/11, Hurricane Sandy, Northeast
blackout of 2003
• Affects all Availability Zones in a Region
12
Disaster Recovery Solutions
• Disaster recovery solutions involve combination of
• Replication / Backup of data to a different geography [On-going]
• Start operations from the DR site [when disaster occurs]
• Switch-over traffic to DR site
• Sync data and restore operations to primary site once it becomes operational again
• Since DR involves a different Geo, data replication/backup happens over WAN
13
Recovery Point and Recovery Time
• Recovery Point (RP) = Duration of time for
which data is lost
• Recovery Time (RT) = Duration of time in
which the application is restored
• Low numbers are better for both RP and RT
• Often framed as RPO and RTO as part of
business continuity planning
14
Recovery Point and Recovery Time
• Backup = High RP and High RT
• High Availability = Zero RP and Zero RT
• Disaster Recovery = RP and RT between
Backup and HA
15
Conclusion
• Topic of IT failures is very relevant for business operations
• Key issues with failures
• Unavailability of application
• Loss of data
• Techniques to handle failures – Backups, High Availability, Disaster Recovery
• AWS provides mechanisms to deal with failures
• Recovery Point and Recovery Time
16
17
Have a great Barcamp Bangalore 2013!