we'll do it live! - joseph pierri, pagerduty - devopsdays tel aviv 2016
TRANSCRIPT
“We’ll do it Live!”
Testing your Software in ProductionJoseph Pierri
People who don’t take risks generally make about two big mistakes a year. People who do take risks generally make about two big mistakes a year. - Peter Drucker
Testing: Problem Statement
Optimize User Experience
Minimize Operational Pain
Constraint: Developer Time
Conventional Approach
Production
Local Testing
Staging
Load Test
Staging
Benefits Challenges
Sort of prod-like Contention
Integration Difficult to Scale
Often Broken
Load Testing
Benefits Challenges
Realistic Data Maintaining Data
Realistic Fleet Scaling Fleet
Realistic Traffic Traffic…
Less Conventional Approaches
Local Containers
Disposable Environments
Test in Production
Local Containers• No contention issues
• Easy integration testing
• Some scalability issues
Disposable Environments• Codified environment
• Spun up and disposed of on demand
Testing in Production• Very production-like!
• Workload & environment
• Requires risk mitigation techniques…
Reducing Risk
Know when Something Breaks
Limit the Impact
Rolling Back
Know when Something Breaks
Monitoring, Logging, Alerting
(Others)
Know when Something Breaks
Production End-to-End Functional Testing
Software
SystemE2E Suite
ALERT
FAIL?
Limit the Impact
Feature Flags
Software
ServiceUsers
V+
V
Rolling Back
Deployment Pipeline
Test, Build
Canary
Deploy
Rollback
Culture
Quantify!
Risk Tolerance
“You built it, you run it”
TiP - Real World
Good Fit Difficult
New feature Mobile app
Incremental Chg Bank machine
New service Rocket
People who don’t take risks generally make about two big mistakes a year. People who do take risks generally make about two big mistakes a year. - Peter Drucker
A Tale of Two Services• 2014: New backend service for notifications
• Contention headaches
• Load Test fleet matches prod
A Tale of Two Services• 2016: New Kafka producer service
• Containerized
• “Prod is the best LT anyways”
Bringing it Together
4
3
1 Conventional Approaches
2
Reducing Risk
Real World
Testing in Production
A Tale of a Bridge
"[The Silver Bridge] legacy should be to remind engineers to proceed always with the utmost caution, ever mindful of the possible existence of unknown unknowns and the potential consequences of even the smallest design decisions” - Henry Petroski
Software ≠ BridgeBridge Software
Deploys One Many
Partial Deploys? No Yes
Rollbacks Difficult Easy
Bad Deploys? Disaster Manageable
Approach Never Fail Fail Fast