canary analyze all the things

Post on 11-Aug-2014

940 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Canary Analysis presentation for QCon/NY 2014

TRANSCRIPT

Canary Analyze All the ThingsRoy Rapoport @royrapoport June 12, 2014

Significant contributions by Chris Sanden, @chris_sanden1

Oh, the Places We’ll Go!

• Introductions

• Proposed Use Case and Definition

• Continuous Improvement / MVP Model

• Issues, Solutions

• Cloud Considerations

• The Road at Netflix

2

A Word About Me …

3

A Word About Me …

•About 20 years in technology

3

A Word About Me …

•About 20 years in technology•Systems engineering, networking, software development, QA, release management

3

A Word About Me …

•About 20 years in technology•Systems engineering, networking, software development, QA, release management

•Time at Netflix: 1809 days

3

A Word About Me …

•About 20 years in technology•Systems engineering, networking, software development, QA, release management

•Time at Netflix: 1809 days 4y:11m:14d

3

A Word About Me …

•About 20 years in technology•Systems engineering, networking, software development, QA, release management

•Time at Netflix: 1809 days •At Netflix:

4y:11m:14d

3

A Word About Me …

•About 20 years in technology•Systems engineering, networking, software development, QA, release management

•Time at Netflix: 1809 days •At Netflix:•Systems Engineering, Service Delivery in IT/Ops

4y:11m:14d

3

A Word About Me …

•About 20 years in technology•Systems engineering, networking, software development, QA, release management

•Time at Netflix: 1809 days •At Netflix:•Systems Engineering, Service Delivery in IT/Ops•Troubleshooter and Builder of Python Things[tm] in Product Engineering

4y:11m:14d

3

A Word About Me …

•About 20 years in technology•Systems engineering, networking, software development, QA, release management

•Time at Netflix: 1809 days •At Netflix:•Systems Engineering, Service Delivery in IT/Ops•Troubleshooter and Builder of Python Things[tm] in Product Engineering

•Current role: Insight Engineering in Product Engineering

4y:11m:14d

3

A Word About Me …

•About 20 years in technology•Systems engineering, networking, software development, QA, release management

•Time at Netflix: 1809 days •At Netflix:•Systems Engineering, Service Delivery in IT/Ops•Troubleshooter and Builder of Python Things[tm] in Product Engineering

•Current role: Insight Engineering in Product Engineering•Real-Time Operational Insight

4y:11m:14d

3

A Word About Netflix…

4

A Word About Netflix…Just the Stats

4

A Word About Netflix…

•16 years

Just the Stats

4

A Word About Netflix…

•16 years•2000+ employees

Just the Stats

4

A Word About Netflix…

•16 years•2000+ employees•48 million users

Just the Stats

4

A Word About Netflix…

•16 years•2000+ employees•48 million users•5x10^9 hours/quarter

Just the Stats

4

A Word About Netflix…

5

A Word About Netflix…Freedom and Responsibility Culture

5

A Word About Netflix…

•Optimize speed of innovation Constrain availability Cost will be what cost will be

Freedom and Responsibility Culture

5

A Word About Netflix…

•Optimize speed of innovation Constrain availability Cost will be what cost will be •Hire smart (experienced) people Get out of their way

Freedom and Responsibility Culture

5

A Word About Netflix…

•Optimize speed of innovation Constrain availability Cost will be what cost will be •Hire smart (experienced) people Get out of their way•Anti-process bias

Freedom and Responsibility Culture

5

A Word About Netflix…

6

A Word About Netflix…Technology and Operations

6

A Word About Netflix…

•Service Oriented Architecture

Technology and Operations

6

A Word About Netflix…

•Service Oriented Architecture•Decentralized Operations. You

Technology and Operations

6

A Word About Netflix…

•Service Oriented Architecture•Decentralized Operations. You•Build

Technology and Operations

6

A Word About Netflix…

•Service Oriented Architecture•Decentralized Operations. You•Build•Test

Technology and Operations

6

A Word About Netflix…

•Service Oriented Architecture•Decentralized Operations. You•Build•Test•Deploy

Technology and Operations

6

A Word About Netflix…

•Service Oriented Architecture•Decentralized Operations. You•Build•Test•Deploy•Set up alerting and monitoring

Technology and Operations

6

A Word About Netflix…

•Service Oriented Architecture•Decentralized Operations. You•Build•Test•Deploy•Set up alerting and monitoring•Wake up at 2AM

Technology and Operations

6

Oh, the Places We’ll Go!

• Introductions

• Proposed Use Case and Definition

• Continuous Improvement / MVP Model

• Issues, Solutions

• Cloud Considerations

• The Road at Netflix

7

Why Canary Analysis?

8

So You’ve Just Done a Release

9

So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat

9

So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat{“response”: “meow”}

9

So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat{“response”: “meow”}

9

So You’ve Just Done a Release

10

So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog

10

So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog{“response”: “woof”}

10

So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog{“response”: “woof”}

10

So You’ve Just Done a Release

11

So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox

11

So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox{“response”: “wa-pa-pa-pa-pa-pa-pow”}

11

So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox{“response”: “wa-pa-pa-pa-pa-pa-pow”}

The correct answer to “what does the fox say?” is left an exercise for the reader

11

You Need Better Testing!

12

You Need Better Testing!

Well, yeah

12

You Need Better Testing!

“I’m going to push to production, though I’m pretty sure it’s going to kill the system”

13

- Said no one, ever*

* Hopefully

Rate of Change 1 10 100 1000

0

1

2

3

4

5

6

Avai

labi

lity

(nin

es)

Detour Rate of Change vs Availability

14

Rate of Change 1 10 100 1000

0

1

2

3

4

5

6

Avai

labi

lity

(nin

es)

Detour Rate of Change vs Availability

14

Rate of Change 1 10 100 1000

0

1

2

3

4

5

6

Avai

labi

lity

(nin

es)

Detour Rate of Change vs Availability

14

Rate of Change 1 10 100 1000

0

1

2

3

4

5

6

Avai

labi

lity

(nin

es)

Detour Rate of Change vs Availability

14

Rate of Change 1 10 100 1000

0

1

2

3

4

5

6

Avai

labi

lity

(nin

es)

Detour Rate of Change vs Availability

14

Rate of Change 1 10 100 1000

0

1

2

3

4

5

6

Avai

labi

lity

(nin

es)

Detour Rate of Change vs Availability

Operations Engineering

14

You Need Better Testing!Deployments!

Canary Analysis!!

• A deployment process where • a new change (in behavior, code, or both) • is rolled out into production gradually, • with checkpoints along the way to examine the new (canary) systems • (optionally versus the old (baseline) systems) • and make go/no-go decisions.

15

Canary Analysis Is Not

16

Canary Analysis Is Not

•A replacement for any sort of software testing

16

Canary Analysis Is Not

•A replacement for any sort of software testing•A/B Testing

16

Canary Analysis Is Not

•A replacement for any sort of software testing•A/B Testing•Releasing 100% to production and hoping for the best

16

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

One Possible Process

17

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

1 server @ 1.0.2

Automated Canary Analysis

One Possible Process

17

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

10 servers @ 1.0.2

One Possible Process

17

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

1000 servers @ 1.0.2

One Possible Process

17

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

1000 servers @ 1.0.2

One Possible Process

18

Version Control System Customers

Build & Deployment

System

Automated Canary Analysis

1000 servers @ 1.0.2

One Possible Process

18

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

1000 servers @ 1.0.2

One Possible Process

19

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

1000 servers @ 1.0.2

One Possible Process

19

Oh, the Places We’ll Go!

• Introductions

• Proposed Use Case and Definition

• Continuous Improvement / MVP Model

• Issues, Solutions

• Cloud Considerations

• The Road at Netflix

20

Are We There Yet?

21

Are We There Yet?

• We’re not

21

Are We There Yet?

• We’re not

• You’re probably not either

21

Minimally …

22

Minimally …

• Observability

22

Minimally …

• Observability

• Partial traffic routing

22

Minimally …

• Observability

• Partial traffic routing

• Decision-making

22

Better Yet …

23

Better Yet …

• Focus on the Goal

23

Better Yet …

• Focus on the Goal

• Current Baseline Matters

23

Better Yet …

• Focus on the Goal

• Current Baseline Matters

26% fewer errors in canary

23

Better Yet …

• Focus on the Goal

• Current Baseline Matters

• Observability segregation

26% fewer errors in canary

23

Hold On a Minute!

26% fewer errors in canary

24

Hold On a Minute!

26% fewer errors in canary

Mission Accomplished

24

Hold On a Minute!

26% fewer errors in canary

Mission Accomplished

30% fewer requests handled in canary

25

Hold On a Minute!

26

Hold On a Minute!

26

Hold On a Minute!

27

Hold On a Minute!

• Absolute numbers are relatively unimportant

27

Hold On a Minute!

• Absolute numbers are relatively unimportant

• Relative numbers matter

27

Hold On a Minute!

• Absolute numbers are relatively unimportant

• Relative numbers matter• Error rate

27

Hold On a Minute!

• Absolute numbers are relatively unimportant

• Relative numbers matter• Error rate• RPS per CPU cycle

27

Hold On a Minute!

• Absolute numbers are relatively unimportant

• Relative numbers matter• Error rate• RPS per CPU cycle

27

Requests Rate Comparison

So You’ve Got Your Graphs requests

28

Requests Rate Comparison

So You’ve Got Your Graphs requests

28

Requests Rate Comparison

So You’ve Got Your Graphs requests

Type RAM Cores CostBaseline m3.medium 3.75GB 3 $.11/hrCanary m1.small 1.7GB 1 $.06/hr

28

So You’ve Got Your Graphs

29

Automating …

30

Automating …

• Decision

30

Automating …

• Decision

• Execution

30

A Quick Recap

31

A Quick Recap

• Observe

31

A Quick Recap

• Observe

• Segregate metrics

31

A Quick Recap

• Observe

• Segregate metrics

• Partial deploy

31

A Quick Recap

• Observe

• Segregate metrics

• Partial deploy

• Compare to Baseline

31

A Quick Recap

• Observe

• Segregate metrics

• Partial deploy

• Compare to Baseline

• Absolutes are never right

31

A Quick Recap

• Observe

• Segregate metrics

• Partial deploy

• Compare to Baseline

• Absolutes are never right

• Automate decision

31

A Quick Recap

• Observe

• Segregate metrics

• Partial deploy

• Compare to Baseline

• Absolutes are never right

• Automate decision

• Automate execution

31

Oh, the Places We’ll Go!

• Introductions

• Proposed Use Case and Definition

• Continuous Improvement / MVP Model

• Issues, Solutions

• Cloud Considerations

• The Road at Netflix

32

To Save You Some Time …

Not all metrics are created equal

33

To Save You Some Time …

Not all metrics are created equal

Focus on System and Application Metrics

33

To Save You Some Time …

Not all metrics are created equal

Focus on System and Application Metrics

Weight by category (system, latency, etc)

33

To Save You Some Time …

Outliers are out, lying

34

To Save You Some Time …

Outliers are out, lying

Use a group of servers

34

To Save You Some Time …

Outliers are out, lying

Use a group of servers

Balance fidelity with customer impact

34

To Save You Some Time …

Exercise without warmup can result in injury

35

To Save You Some Time …

Exercise without warmup can result in injury

Repeat canary analysis frequently

35

To Save You Some Time …

Exercise without warmup can result in injury

Repeat canary analysis frequently

Both traffic and startup time are factors

35

To Save You Some Time …

vive la différence!

36

To Save You Some Time …

vive la différence!

Hot-OK, Cold-OK

36

To Save You Some Time …

vive la différence!

Hot-OK, Cold-OK

Let Application Owners Choose

36

To Save You Some Time …

Signal is better than no1$#[NO CARRIER]

37

To Save You Some Time …

Signal is better than no1$#[NO CARRIER]

Ignore weak signals

37

Oh, the Places We’ll Go!

• Introductions

• Proposed Use Case and Definition

• Continuous Improvement / MVP Model

• Issues, Solutions

• Cloud Considerations

• The Road at Netflix

38

Good News

39

Good News

39

Good News

• Software-Defined Everything

39

Good News

• Software-Defined Everything

• Incremental Pricing

39

Bad News

40

Bad News

40

Bad News

• Capacity Management

40

Bad News

• Capacity Management

• Unpredictable Inconsistency

40

Oh, the Places We’ll Go!

• Introductions

• Proposed Use Case and Definition

• Continuous Improvement / MVP Model

• Issues, Solutions

• Cloud Considerations

• The Road at Netflix

41

Numbers

42

Numbers

• 752 services in production

42

Numbers

• 752 services in production

• In-house telemetry platform

42

Numbers

• 752 services in production

• In-house telemetry platform

• A few metrics

42

Numbers

• 752 services in production

• In-house telemetry platform

• A few metrics

42

43

Been there.Done that.Manually. Artisanally.

43

Been there.

• Started in the Data Center

Done that.Manually. Artisanally.

43

Been there.

• Started in the Data Center

• Manual, dashboard-driven

Done that.Manually. Artisanally.

43

Been there.Done that.Manually.

44

CPU

Requests

Errors

Been there.Done that.Manually.

45

Been there.Done that.Manually.

45

Been there.Done that.Manually.

46

Been there.Done that.Manually.

46

Been there.Done that.Manually.

47

Been there.Done that.Manually.

47

Been there.Done that.Manually.

48

Been there.Done that.Manually.• Context vs Precision

48

Been there.Done that.Manually.• Context vs Precision

• No …

48

Been there.Done that.Manually.• Context vs Precision

• No …

• Repeatability

48

Been there.Done that.Manually.• Context vs Precision

• No …

• Repeatability

• Trending

48

Been there.Done that.Manually.• Context vs Precision

• No …

• Repeatability

• Trending

• Manual effort is manual

48

So Now What?

49

So Now What?

• Automate Analysis

49

So Now What?

• Automate Analysis

• Took Some Effort

49

So Now What?

• Automate Analysis

• Took Some Effort

• Approach and analytics

49

So Now What?

• Automate Analysis

• Took Some Effort

• Approach and analytics

• Presentation matters

49

Automated Canary Analysis

50

Automated Canary Analysis

51

Automated Canary Analysis

51

Automated Canary Analysis

52

Automated Canary Analysis

53

Automated Canary Analysis

54

For Our Next Trick …

55

For Our Next Trick …

• Configuration GUI

55

For Our Next Trick …

• Configuration GUI• Deployment System Integration

55

For Our Next Trick …

• Configuration GUI• Deployment System Integration• ACA All The Things

55

For Our Next Trick …

• Configuration GUI• Deployment System Integration• ACA All The Things

• OpenConnect firmware updates

55

For Our Next Trick …

• Configuration GUI• Deployment System Integration• ACA All The Things

• OpenConnect firmware updates• Client software changes

55

For Our Next Trick …

• Configuration GUI• Deployment System Integration• ACA All The Things

• OpenConnect firmware updates• Client software changes• Configuration changes in production

55

Summary

56

Summary

• Canary Analysis makes your changes

56

Summary

• Canary Analysis makes your changes• Safer

56

Summary

• Canary Analysis makes your changes• Safer• Faster

56

Summary

• Canary Analysis makes your changes• Safer• Faster• Easier

56

Summary

• Canary Analysis makes your changes• Safer• Faster• Easier

• Most people can start doing it

56

Summary

• Canary Analysis makes your changes• Safer• Faster• Easier

• Most people can start doing it• Everyone can do it better

56

Summary

• Canary Analysis makes your changes• Safer• Faster• Easier

• Most people can start doing it• Everyone can do it better

56

• https://www.flickr.com/photos/cseeman

• https://www.flickr.com/photos/ransomtech

• https://www.flickr.com/photos/dougbrown47

• https://www.flickr.com/photos/andresthor/

• https://www.flickr.com/photos/dougbrown47

• https://www.flickr.com/photos/pkdesigns

Questions, Attributions, Feedback

57

• https://www.flickr.com/photos/cseeman

• https://www.flickr.com/photos/ransomtech

• https://www.flickr.com/photos/dougbrown47

• https://www.flickr.com/photos/andresthor/

• https://www.flickr.com/photos/dougbrown47

• https://www.flickr.com/photos/pkdesigns

Questions, Attributions, Feedback

@royrapoport

57

• https://www.flickr.com/photos/cseeman

• https://www.flickr.com/photos/ransomtech

• https://www.flickr.com/photos/dougbrown47

• https://www.flickr.com/photos/andresthor/

• https://www.flickr.com/photos/dougbrown47

• https://www.flickr.com/photos/pkdesigns

Questions, Attributions, Feedback

@royrapoport ?57

top related