building and monitoring services at lithium

22
Building and Monitoring Services at Lithium (fault tolerance, resiliency and monitoring) Paul Cichonski, Senior Software Engineer @paulcichonski

Upload: paul-cichonski

Post on 25-May-2015

455 views

Category:

Technology


2 download

DESCRIPTION

Paul Cichonski's presentation from SF CloudOps Meetup on building and monitoring fault tolerant systems. (http://www.meetup.com/CloudOps/events/159397622/)

TRANSCRIPT

Page 1: Building and Monitoring Services at Lithium

Building and Monitoring Services at Lithium(fault tolerance, resiliency and monitoring)

Paul Cichonski, Senior Software Engineer

@paulcichonski

Page 2: Building and Monitoring Services at Lithium

2

Services at Lithium Use:

Page 3: Building and Monitoring Services at Lithium

3

Failure is a Constant, Need to Avoid Cascading Failure

Image Source: Netflix Hystrix: https://github.com/Netflix/Hystrix/wiki

Page 4: Building and Monitoring Services at Lithium

4

We All Know How to Simulate Failure:

Page 5: Building and Monitoring Services at Lithium

5

But how do we develop code to deal with failure?

Page 6: Building and Monitoring Services at Lithium

6

Need to build fault tolerant and resilient services... How?

Clustering, for high-availability, is not enough to protect against cascading failure

Page 7: Building and Monitoring Services at Lithium

7

#1 Fail Fast: use timeouts aggressively

Page 8: Building and Monitoring Services at Lithium

8

#2 Use circuit breakers on network calls

Page 9: Building and Monitoring Services at Lithium

9

#3 Use async communication when possible

Page 10: Building and Monitoring Services at Lithium

10

#4 Have well thought-out backpressure mechanisms

Page 11: Building and Monitoring Services at Lithium

11

#5 Use cross-region (or cross-datacenter) replication

Page 12: Building and Monitoring Services at Lithium

12

#6 Failure models should be built into the business requirements of a service

Page 13: Building and Monitoring Services at Lithium

13

Read:

Page 14: Building and Monitoring Services at Lithium

14

Even with all of that, your app will still fail, so how do you recover quickly?

Page 15: Building and Monitoring Services at Lithium

15

Devops/Cloudops Model: OODA

Page 16: Building and Monitoring Services at Lithium

16

Observe and Orient: you need metrics and dashboards

Page 17: Building and Monitoring Services at Lithium

17

You Need Metrics

• Reduce “map/territory” confusion• We use Yammer Metrics

– Timers– Meters– Histograms

• We use them a lot– Every class has at least one metric, most

have multiple

Page 18: Building and Monitoring Services at Lithium

18

You Need to Visualize the Metrics

Page 19: Building and Monitoring Services at Lithium

19

You Need Dashboards Keyed to Business Functionality

Page 20: Building and Monitoring Services at Lithium

20

Use alerting as a last resort (because sometimes we need to sleep)

Page 21: Building and Monitoring Services at Lithium

21

Decide and Act: you need robust CI and fast code roll-outs

Page 22: Building and Monitoring Services at Lithium

22

Rinse and Repeat