mtbf / mttr - energized work tektalk, mar 2012

33
Presented by Michael Richardson, Energized Work 21 March 2012 MTBF / MTTR Availability or recoverability? 25 MACKLIN STREET LONDON WC2B 5NN +44 (0)20 7691 8933 ENERGIZED WORK WWW.ENERGIZEDWORK.COM

Upload: energized-work

Post on 29-Nov-2014

2.544 views

Category:

Technology


18 download

DESCRIPTION

 

TRANSCRIPT

Page 1: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Presented by Michael Richardson, Energized Work 21 March 2012

MTBF / MTTR Availability or recoverability?

25 MACKLIN STREET LONDON WC2B 5NN +44 (0)20 7691 8933

ENERGIZED WORK

WWW.ENERGIZEDWORK.COM

Page 2: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Michael Richardson Twitter: @mr_spb

Email: [email protected] #ewtektalk

2 © 2012 Energized Work - www.energizedwork.com

Page 3: MTBF / MTTR - Energized Work TekTalk, Mar 2012

So what is high availability?

3 © 2012 Energized Work - www.energizedwork.com

•  Five nines? •  No single point of failures? •  Multiple data centres? •  Fault tolerance? •  Load balancing? •  Uptime?

Page 4: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Nines of availability

4 © 2012 Energized Work - www.energizedwork.com

9

99

9 9

9 9 9

Page 5: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Nines of availability

5 © 2012 Energized Work - www.energizedwork.com

Availability Downtime per Year One nine (90%) 36.5 days Two nines (99%) 3.65 days Three nines (99.9%) 8.76 hours Four nines (99.99%) 52.56 minutes Five nines (99.999%) 5.26 minutes

Page 6: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Problem with the nines

6 © 2012 Energized Work - www.energizedwork.com

•  What do they mean? •  Guaranteed or just an SLA? •  Multiplicity (99.9% * 99.9% * 99.9% = 99.7%)

Page 7: MTBF / MTTR - Energized Work TekTalk, Mar 2012

SLA availability numbers just aim to provide a level of confidence in a website’s service

7 © 2012 Energized Work - www.energizedwork.com

Page 8: MTBF / MTTR - Energized Work TekTalk, Mar 2012

No single point of failure (SPOF)

8 © 2012 Energized Work - www.energizedwork.com

Page 9: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Two of everything?

9 © 2012 Energized Work - www.energizedwork.com

Page 10: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Start with this

10 © 2012 Energized Work - www.energizedwork.com

Index.html

Users

Page 11: MTBF / MTTR - Energized Work TekTalk, Mar 2012

End with this

11 © 2012 Energized Work - www.energizedwork.com

Switch 1 Switch 2

Firewall 1 Firewall 2

Users

WEB1 WEB2 APP1 APP2 DB1 DB2

Page 12: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Problems with eliminating SPOF

12 © 2012 Energized Work - www.energizedwork.com

•  It’s expensive •  Where do you draw the line? •  Are failures independent? •  Can you guarantee no SPOF? •  Increased complexity

Page 13: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Problem: Data centres fail

13 © 2012 Energized Work - www.energizedwork.com

Page 14: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Solution: Get a second data centre

14 © 2012 Energized Work - www.energizedwork.com

Page 15: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Hot – Hot multisite

15 © 2012 Energized Work - www.energizedwork.com

•  Full range of services available in multiple locations •  Easy to automate failover of sites •  Data consistency is hard •  Capacity planning concerns

+

Page 16: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Hot – Warm multisite

16 © 2012 Energized Work - www.energizedwork.com

•  Simpler than hot – hot •  Read / Write ratio dependent •  Synchronously or asynchronously replicate data?

+

Page 17: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Hot – Cold multisite

17 © 2012 Energized Work - www.energizedwork.com

•  Easy to setup •  Will it work? •  Can it be trusted? •  Cold site rapidly becomes stale •  Is it actually valuable?

+

Page 18: MTBF / MTTR - Energized Work TekTalk, Mar 2012

DR multisite

18 © 2012 Energized Work - www.energizedwork.com

•  Fingers crossed you never need it •  How can / should you test it? •  Cloud?

+

Page 19: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Problems with multiple sites

19 © 2012 Energized Work - www.energizedwork.com

•  It’s expensive •  Managing more systems •  Managing data consistency •  Managing capacity •  Is it still fail proof? •  Unless you test it, it’s just a plan

Page 20: MTBF / MTTR - Energized Work TekTalk, Mar 2012

We now have a complex system

20 © 2012 Energized Work - www.energizedwork.com

Page 21: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Complex systems

21 © 2012 Energized Work - www.energizedwork.com

•  More redundancy and automation leads to more complexity •  More complexity often adds more points of failure

Page 22: MTBF / MTTR - Energized Work TekTalk, Mar 2012

How complex systems fail

22 © 2012 Energized Work - www.energizedwork.com

•  Catastrophe is always just around the corner •  Human operators have dual roles •  Change introduces new forms of failure

- Dr. Richard Cook

Page 23: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Failure and recovery

23 © 2012 Energized Work - www.energizedwork.com

Page 24: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Questions for the business

24 © 2012 Energized Work - www.energizedwork.com

•  What is the cost of downtime? •  What are the Recovery Time Objectives (RTO) •  What are the Recovery Point Objectives (RPO)?

Page 25: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Aggressive RTO and RPO are expensive and have a performance impact

25 © 2012 Energized Work - www.energizedwork.com

Page 26: MTBF / MTTR - Energized Work TekTalk, Mar 2012

RTO / RPO example

26 © 2012 Energized Work - www.energizedwork.com

Problem: •  Simple DB •  Business can tolerate up to 15 minutes downtime •  10-minute window of data loss

Page 27: MTBF / MTTR - Energized Work TekTalk, Mar 2012

RTO / RPO example

27 © 2012 Energized Work - www.energizedwork.com

Possible solution: •  Continuously replicate data to second host •  Continue with nightly backups and also copy DB transaction logs

from the primary host to another system

Page 28: MTBF / MTTR - Energized Work TekTalk, Mar 2012

So what is more important – increasing availability or reducing recovery time?

28 © 2012 Energized Work - www.energizedwork.com

Page 29: MTBF / MTTR - Energized Work TekTalk, Mar 2012

MTBF or MTTR?

29 © 2012 Energized Work - www.energizedwork.com

What about MTTD?

Page 30: MTBF / MTTR - Energized Work TekTalk, Mar 2012

The answer is: It depends

30 © 2012 Energized Work - www.energizedwork.com

Page 31: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Failure is inevitable

31 © 2012 Energized Work - www.energizedwork.com

Page 32: MTBF / MTTR - Energized Work TekTalk, Mar 2012

Ask anyone

32 © 2012 Energized Work - www.energizedwork.com

Page 33: MTBF / MTTR - Energized Work TekTalk, Mar 2012

License This presentation is provided under the Creative Commons Attribution Share Alike 3.0 Unported License.

You are free: To share – to copy, distribute and transmit the work To remix – to adapt the work Under the following conditions: Attribution – You must attribute the work in the manner specified by Energized Work (but not in any way that suggests that Energized Work endorse you or your use of the work). Share Alike – If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.

33 © 2012 Energized Work - www.energizedwork.com

25 MACKLIN STREET LONDON WC2B 5NN +44 (0)20 7691 8933

ENERGIZED WORK

WWW.ENERGIZEDWORK.COM