architecting for failures in the cloud - barcamp bangalore 2013
DESCRIPTION
This is a talk from Barcamp Bangalore 2013 on how to architect for dealing with failures in the Cloud. P3 InfoTech Solutions Pvt. Ltd. helps organizations achieve business breakthroughs by adopting Cloud Computing through our Outsourced Product Development and Cloud Consulting service offerings. Check out our service offerings at http://www.p3infotech.in.TRANSCRIPT
HOW YOU COULD HAVE SURVIVED
THE AWS CHRISTMAS EVE OUTAGE
ARCHITECTING FOR FAILURES IN THE
CLOUD
Pavan Verma
Founder, P3 InfoTech Solutions Pvt. Ltd.
[email protected], @YingYangPavan
1
How are datacenter failures
relevant to me?
2
Types of Failures in a Datacenter
• Electronic components – CPUs, Memory
• Mechanical components – Hard Disks, Fans
• Electrical components – Power Supplies, Air
conditioners
• Networking equipment – Network cables,
Routers, Switches
• Software bugs
• Power disruption
• Human errors
3
Cost of Failures
• Tangible cost
• Lost business = Business volume (Revenues) /
Duration of failure
• Cost of lost data or Time to re-create lost data
• Intangible cost
• Reputation
• Frustration
• Lost business opportunity
4
Techniques to deal with failures
5
Backup
• Backup = Copy of the data from a time
before the failure
• Can restore data after the failure to a state
from before the failure
• Limits the extent of data loss during a
failure
• Types of Backup
• Disk-to-tape – Offline, Slower restore, Cheaper
• Disk-to-disk – Online, Faster restore, Costlier
6
Backup in AWS
• Snapshots for EBS volumes
• Database backups with RDS
• Redundant copies of S3 objects (*)
7
High Availability (HA)
• Ability of the application to service requests
in spite of failure of some components
• Most prevalent notion of High Availability
• Ability to application to tolerate single
component failures
• Application has no single point of failure
• How is high availability achieved
• Redundant components
• Switchover traffic from failed component to
working component
8
High Availability (2)
• Two types of redundant components
• Active-Active
• Active-Passive
• Examples
• Power Supplies
• Servers
• Databases
9
High Availability in AWS
• Availability Zones (AZ)
• Elastic Load Balancer
• Database replicas
10
User
Auto Scaling Group of
EC2 Instances
EC2 EC2
S3
ELB
Auto Scaling Group of
EC2 Instances
EC2 EC2
RDS
Slave
RDS
Master
AZ #1 AZ #2
Reference Architecture for High
Availability setup in AWS
11
Disaster Recovery
• Ability to resume operations after a disaster
• How bad can a disaster be?
• Entire datacenter may be destroyed or become
inoperational
• Examples: 9/11, Hurricane Sandy, Northeast
blackout of 2003
• Affects all Availability Zones in a Region
12
Disaster Recovery Solutions
• Disaster recovery solutions involve combination of
• Replication / Backup of data to a different geography [On-going]
• Start operations from the DR site [when disaster occurs]
• Switch-over traffic to DR site
• Sync data and restore operations to primary site once it becomes operational again
• Since DR involves a different Geo, data replication/backup happens over WAN
13
Recovery Point and Recovery Time
• Recovery Point (RP) = Duration of time for
which data is lost
• Recovery Time (RT) = Duration of time in
which the application is restored
• Low numbers are better for both RP and RT
• Often framed as RPO and RTO as part of
business continuity planning
14
Recovery Point and Recovery Time
• Backup = High RP and High RT
• High Availability = Zero RP and Zero RT
• Disaster Recovery = RP and RT between
Backup and HA
15
Conclusion
• Topic of IT failures is very relevant for business operations
• Key issues with failures
• Unavailability of application
• Loss of data
• Techniques to handle failures – Backups, High Availability, Disaster Recovery
• AWS provides mechanisms to deal with failures
• Recovery Point and Recovery Time
16
17
Have a great Barcamp Bangalore 2013!