we'll do it live! - joseph pierri, pagerduty - devopsdays tel aviv 2016

29
[email protected] “We’ll do it Live!” Testing your Software in Production Joseph Pierri [email protected]

Upload: devopsdays-tel-aviv

Post on 08-Jan-2017

36 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

“We’ll do it Live!”

Testing your Software in ProductionJoseph Pierri

[email protected]

Page 3: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

People who don’t take risks generally make about two big mistakes a year. People who do take risks generally make about two big mistakes a year. - Peter Drucker

Page 4: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Testing: Problem Statement

Optimize User Experience

Minimize Operational Pain

Constraint: Developer Time

Page 5: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Conventional Approach

Production

Local Testing

Staging

Load Test

Page 6: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Staging

Benefits Challenges

Sort of prod-like Contention

Integration Difficult to Scale

Often Broken

Page 7: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Load Testing

Benefits Challenges

Realistic Data Maintaining Data

Realistic Fleet Scaling Fleet

Realistic Traffic Traffic…

Page 8: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Less Conventional Approaches

Local Containers

Disposable Environments

Test in Production

Page 9: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Local Containers• No contention issues

• Easy integration testing

• Some scalability issues

Page 10: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Disposable Environments• Codified environment

• Spun up and disposed of on demand

Page 11: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Testing in Production• Very production-like!

• Workload & environment

• Requires risk mitigation techniques…

Page 12: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Reducing Risk

Know when Something Breaks

Limit the Impact

Rolling Back

Page 13: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Know when Something Breaks

Monitoring, Logging, Alerting

(Others)

Page 14: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Know when Something Breaks

Production End-to-End Functional Testing

Software

SystemE2E Suite

ALERT

FAIL?

Page 15: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Limit the Impact

Canary Deploys

V+

V

Hosts

Page 16: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Limit the Impact

Feature Flags

Software

ServiceUsers

V+

V

Page 17: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Rolling Back

Deployment Pipeline

Test, Build

Canary

Deploy

Rollback

Page 18: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Culture

Quantify!

Risk Tolerance

“You built it, you run it”

Page 19: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

TiP - Real World

Good Fit Difficult

New feature Mobile app

Incremental Chg Bank machine

New service Rocket

Page 20: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Measure Everything!

# deploys by service

Page 22: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

People who don’t take risks generally make about two big mistakes a year. People who do take risks generally make about two big mistakes a year. - Peter Drucker

Page 23: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

A Tale of Two Services• 2014: New backend service for notifications

• Contention headaches

• Load Test fleet matches prod

Page 24: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

A Tale of Two Services• 2016: New Kafka producer service

• Containerized

• “Prod is the best LT anyways”

Page 25: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Bringing it Together

4

3

1 Conventional Approaches

2

Reducing Risk

Real World

Testing in Production

Page 26: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

A Tale of a Bridge

Silver Bridge (1928-1967)

Page 27: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

A Tale of a Bridge

"[The Silver Bridge] legacy should be to remind engineers to proceed always with the utmost caution, ever mindful of the possible existence of unknown unknowns and the potential consequences of even the smallest design decisions” - Henry Petroski

Page 28: We'll Do It Live! - Joseph Pierri, PagerDuty - DevOpsDays Tel Aviv 2016

[email protected]

Software ≠ BridgeBridge Software

Deploys One Many

Partial Deploys? No Yes

Rollbacks Difficult Easy

Bad Deploys? Disaster Manageable

Approach Never Fail Fail Fast