Download - Disaster Recovery - On-Premise & Cloud
CLOUDCONF 2014Database: backup e disaster recovery in Cloud
Walter Dal Mut
@walterdalmut – www.corley.it – walterdalmut.com
DISASTER RECOVERYDisaster recovery (DR) is about preparing for and recovering
from a disaster.
DISASTERAny event that has a negative impact on
your business continuity or finances could be termed a disaster.
WHY WE ARE TALKING ABOUT DR?• Over 70% of businesses involved in a major fire either do not reopen, or
subsequently fail within 3 years of fire. (Source continuitycentral.com)
• 80% of businesses affected by a majorincident either never re-open or close within 18 months (Source Axa)
• 70 percent of companies go out of business after a major data loss (Source continuitycentral.com)
• 80% of businesses suffering a computer disaster, who have no disaster recovery plans, go out of business. (Source “A Bridge Too Far”, IBM BusinessRecovery Service & Cranfield, 1993)
• A recent study from Gartner, Inc., found that 90 percent of companies that experience data loss go out of business within two years.
• 80 percent of companies without well-conceived data protection and recovery strategies go out of business within 2 years of a major disaster. (Source: US National Archives and Records Administration)
RTO – RECOVERY TIME OBJECTIVE
This is the duration of time and the service level to which a business process must be restored after a disaster
RTO what it implies?
• Have a system that records 1000 transaction at hour
• Take a snapshot of a system at 03:00 am (every day)
• 10:00 am a disaster event occurs
• You spend 1 hour to sort things out for the backup (off-site, preparation, etc.)
• Recover operation takes 4 hours in order to get back to operate (at minimum service level)
• 5 hours is the: RECOVERY TIME OBJECTIVE
RPO – RECOVERY POINT OBJECTIVE
This describes the acceptable amount of data loss measured in time.
RPO – WHAT IT IMPLIES?
• Have a system that records 1000 transaction at hour
• Take a snaphot of a system at 03:00 am (every day)
• 10:00 am a disaster event occurs
• In this case we lost around 7000 transactions.• 1000 transactions 03:00 04:00• 1000 transactions 04:00 05:00• …
• But: we are accepting 24 hours of data loss 24000 transactions (RPO)
DISASTER RECOVERY STRATEGIES
Local tape backup
Online backup
Pilot-Light
Warm Stand-by
And More…
$ $$$ $$$$$$
Seconds
Days
ON-PREMISE & CLOUD
Use cloud resources in order to provide business continuity
Disaster Recovery & Cloud?
•On Demand•We can allocate and release new resources whenever we need
•Cost Effective•Pay as you go model. We pay only for resources that we are effectively using
•Scalable•We can scale freely and adapt our strategy thanks to autoscaling and other mechanisms
•Secure•Control doesn’t mean security
FOCUS ON DATABASES
We will focus on MySQL but you can apply to your infrastructure without any problem.
BACKUP & RESTORETake a snapshot of a system and restore it when you need it
Application
Backup
Restore
RTO & RPO?Things to remember…
RTOWhat resources can impact on my RTO
RESOURCES ALLOCATION
How fast we can set up all resources, eg: instances, network, etc etc.
DB RESTOREHow many time the database restore can takes?
RPOWhat resources can impact on my RPO
DB SNAPSHOTHow many time we need to recover all data from our
snapshot?
Backup & Restore – RPO & RTO
Configuration
• Resources Allocation• ???
• Restore Operation• ???
• DNS • TTL 30 minutes
• Snapshot• Every 24 hour
Effects
• RTO – Recovery Time Objective• 30 minutes + ??? + ???
• RPO – Recovery Point Objective• 24 hour
• Downtime per month• 99.8% availability 86.23 minutes• 99.95% availability 21.56 minutes
COSTS ON S3 (AWS)0.085$ per GB durability
99,999999999%
$0.068 / GB durability 99,99%
$0.010 / GB durability 99.999999999% [glacier]
Pilot lightWe can let a little resource always active that can help us to activate a whole system
ReplicationBasically pilot-light is based on database replication strategies
For MySQL async replication is used as base strategy
http://www.slideshare.net/corleycloud/mysql-scale-out-cloudparty-2013-milano-talent-garden
ON-PREMISE – WEB APP
READ REPLICA ON A CLOUD PROVIDER
MOVE TO CLOUD ON A DISASTER
RTO & RPO?Things to remember…
RTOWhat resources can impact on my RTO
RESOURCES ALLOCATION
run and configure new instances typically takes a couple of minutes
you have always to care about resources and times.
DNS PROPAGATIONDNS takes a little while before propagate new addresses
(Time To Live)
RPOWhat resources can impact on my RPO
DB REPLICATIONRemember that Master/Slave replications are ASYNC!
It implies LAG replication time and that impact with your RPO!
MONITOR YOUR INFRASTRUCTURE
Setting an RPO about 20 minutes implies that your replication LAG time should be always under 20 minutes!
Pilot Light – RPO & RTO
Configuration
• Resources Allocation• 20 minutes
• DNS • TTL 30 minutes
• Replication LAG• 20 minutes
Effects
• RTO – Recovery Time Objective• 50 minutes
• RPO – Recovery Point Objective• 20 minutes
• Downtime per month• 99.8% availability 86.23 minutes• 99.95% availability 21.56 minutes
COSTS ON AWS0.06$ per hour 1 m1.small~43$ per
month
0.05$ per GB EBS
0.05$ per 1 million I/O requests EBS
WARM STANDBYExtends pilot-light resource allocation and preparation
Warm Standby
Warm Stand-by
Warm StandBy – RPO & RTO
Configuration
• Resources Allocation• 5 minutes
• DNS • TTL 30 minutes
• Replication LAG• 20 minutes
Effects
• RTO – Recovery Time Objective• 35 minutes
• RPO – Recovery Point Objective• 20 minutes
• Downtime per month• 99.8% availability 86.23 minutes• 99.95% availability 21.56 minutes
COSTS ON AWS0.06$ per hour 2 m1.small~86$ per
month
0.05$ per GB EBS
0.05$ per 1 million I/O requests EBS
ELB 20$ per month
PILOT LIGHTVS
WARM STAND-BYEffectively in our examples
Pilot Light is much more effective than warm stand-by.
Doesn’t it?
DEPENDS ON ASSUMPTIONS
We assume that we don’t need to scale out our database but that is enough to scale it up only!
Resource allocation for new read replicas? How long does it takes?
THANKS FOR LISTENING