system availability talk
DESCRIPTION
Talk i gave on HA, resiliency and recovery of systemsTRANSCRIPT
Michael RichardsonTwitter: @Mr_SPB
1© 2011 Energized Work - www.energizedwork.com
Availability and Recoverability
So what is High Availability?
• Five 9s?• No Single point of failure?• Multiple Data Centre’s?• Fault Tolerance?• Load Balancing?• Uptime?
2© 2012 Energized Work - www.energizedwork.com
The 9’s of Availability
3© 2012 Energized Work - www.energizedwork.com
9 9
The 9’s of Availability
4© 2012 Energized Work - www.energizedwork.com
Availability Downtime per Year
One nine (90%) 36.5 days
Two nines (99%) 3.65 days
Three nines (99.9%) 8.76 hours
Four nines (99.99%) 52.56 minutes
Five nines (99.999%) 5.26 minutes
Problem with the 9’s
5© 2012 Energized Work - www.energizedwork.com
• What do they mean?• Guaranteed or just an SLA• Multiplicity
(99.9% * 99.9% * 99.9% = 99.7%)
SLA availability numbers:
just aim to provide a level of confidence in a website’s
service
6© 2012 Energized Work - www.energizedwork.com
No Single Point of Failure (SPOF)
7© 2012 Energized Work - www.energizedwork.com
two of everything?
8© 2012 Energized Work - www.energizedwork.com
Start with this
9© 2012 Energized Work - www.energizedwork.com
Index.html
Users
End with this
10© 2012 Energized Work - www.energizedwork.com
WEB1
switch 1 switch 2
WEB2 APP1 APP2 DB1 DB2
Firewall 1 Firewall 2
Users
• It’s expensive ££• Where do you draw the line?• Are failures independent• Can you guarantee No SPOF?• Increased complexity
11© 2012 Energized Work - www.energizedwork.com
Problems with eliminating SPOF
Problem: Data Centre’s Fail
12© 2012 Energized Work - www.energizedwork.com
Solution: Get a 2nd Data Centre
13© 2012 Energized Work - www.energizedwork.com
Hot/Hot Multisite
14© 2012 Energized Work - www.energizedwork.com
• Full range of services available in multiple locations.
• Easy to automate failover of sites• Data Consistency is hard.• Capacity Planning concerns
+
Hot/Warm Multisite
15© 2012 Energized Work - www.energizedwork.com
• Simpler than Hot/Hot• Read/write ratio dependant• Synchronous or Asynchronously
replicate data?
+
Hot/Cold Multisite
16© 2012 Energized Work - www.energizedwork.com
• Easy to setup• Will it work?• Can it be trusted?• Cold site rapidly become stale• Is it actually valuable?
+
DR Multisite
17© 2012 Energized Work - www.energizedwork.com
• Fingers crossed you never need it.• How can/should you test it?• Cloud?
+
Problems with Multiple sites
18© 2012 Energized Work - www.energizedwork.com
• ££ - it’s expensive• Managing more systems• Managing consistency of Data• Managing Capacity• Is it still fail proof?• Unless you test it, it’s just a plan
19© 2012 Energized Work - www.energizedwork.com
We now have a Complex System
• More redundancy and automation leads to more complexity.
• More complexity often adds more points of failure.
20© 2012 Energized Work - www.energizedwork.com
Complex Systems
Author: Dr. Richard Cook
21© 2012 Energized Work - www.energizedwork.com
“How Complex Systems fail”
• Catastrophe is always just around the corner.
• Human Operators have dual roles.• Change introduces new forms of failure
Failure and Recovery
22© 2012 Energized Work - www.energizedwork.com
Questions for the Customer
23© 2012 Energized Work - www.energizedwork.com
• What is the cost of downtime?
• What are the RTO and RPO?
24© 2012 Energized Work - www.energizedwork.com
RTO = Recovery Time Objective
RPO = Recovery Point Objective
Aggressive RTO & RPO is expensive and has a performance impact.
25© 2012 Energized Work - www.energizedwork.com
RTO / RPO example
26© 2012 Energized Work - www.energizedwork.com
problem
•Simple DB•Business can tolerate up to 15 minutes downtime•10 minute window of data lose.
RTO / RPO example
27© 2012 Energized Work - www.energizedwork.com
Possible solution
1.Continuously replicate data to 2nd host2.Continue with nightly backups and also copy DB transaction logs from the primary host to another system.
So what’s more important?
28© 2012 Energized Work - www.energizedwork.com
Increasing Availability
Or
Reducing Recovery Time
29© 2012 Energized Work - www.energizedwork.com
MTBFOr
MTTRWhat about MTTD??
30© 2012 Energized Work - www.energizedwork.com
Answer?
It Depends
31© 2012 Energized Work - www.energizedwork.com
Failure is inevitable
32© 2012 Energized Work - www.energizedwork.com
Ask anyone
33© 2011 Energized Work - www.energizedwork.com
Thank you
The End
Twitter - @Mr_SPB