pagerduty | oscon 2016 failure testing
TRANSCRIPT
@alperkokmen
Failure TestingAUTOMATING A SERIES OF UNFORTUNATE EVENTS
#OSCON
Alper Kokmen PRESENT
Software Engineer at PagerDuty
Surrounded by smart people
PAST
Start-ups, Microsoft
Surrounded by smart people
#OSCON
#OSCON
Goals
Start manually injecting failures. Start automating your manual tests.
#OSCON
CHAOS ENGINEERING
“[T]he discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
Principles of Chaos Engineering http://principlesofchaos.org
#OSCON
Netflix Simian Army DIFFERENT SIMIANS FOR DIFFERENT FAILURES
#OSCON
PagerDuty Simian Army?
Multiple cloud providers (AWS and Azure) Experimentation
Application-specific failure scenarios
#OSCON
PagerDuty Simian Human Army FAILURE FRIDAY
Time-boxed recurring meeting
Pre-announced agenda
Break things
Sign-off from service owners
Attendance
GROUND RULES
Keep monitoring & alerting
Abort if needed
#OSCON
Failure Friday: Agenda
#OSCON
Failure Friday: Process
#OSCON
Inject Failure Monitor Repeat
Failure Friday: Monitoring
#OSCON
2 Years Later BENEFITS
System design Knowledge sharing Incident response training
#OSCON
2 Years Later ACCOMPLISHMENTS
Whole DC outages Target multiple services at once Distribute failure testing to teams Automation (in progress)
#OSCON
Automation: Rationale
#OSCON
“MANY” HOSTS
- Distribute tasks to multiple people and keep executing manually. - Watch Operations team with envy while they use chef and knife.
- Start automating.
PagerDuty/blender A MODULAR ORCHESTRATION ENGINE
Ruby DSL
Host Discovery (blender-chef, blender-serf)
Ranjib Dey (@RanjibDey)
#OSCON
PagerDuty/blender CODE
#OSCON
# example.rb ssh_task 'update' do execute 'sudo apt-get update -y' members ['ubuntu01', 'ubuntu02', 'ubuntu03'] end
PagerDuty/blender EXECUTION
#OSCON
blend -f example.rb
Run[example.rb] started 3 job(s) computed using 'Default' strategy Job 1 [update on ubuntu01] finished Job 2 [update on ubuntu02] finished Job 3 [update on ubuntu03] finished Run finished (42.228923876 s)
PagerDuty/smoothie A SIMPLE LIBRARY OF BLENDER RECIPES
Chef Integration
Recipes for Disaster
CLI to Specify Recipes
#OSCON
PagerDuty/smoothie REBOOT RECIPE
#OSCON
def recipe__reboot(hosts) ssh_task 'reboot' do members hosts execute 'sudo /sbin/reboot'
# shutdown will break ssh connection. ignore_failure true end end
PagerDuty/smoothie UNICORN SUSPEND & RESUME RECIPES
#OSCON
def recipe__unicorn_suspend_master(hosts) ssh_task 'suspend unicorn[master] immediately' do members hosts execute 'sudo kill -s STOP `cat /u/.../pids/unicorn.pid`' end end
def recipe__unicorn_resume_master(hosts) ssh_task 'resume unicorn[master] immediately' do members hosts execute 'sudo kill -s CONT `cat /u/.../pids/unicorn.pid`' end end
PagerDuty/smoothie LATENCY RECIPE
#OSCON
def recipe__tc_add_latency(hosts) ssh_task 'add network latency using tc' do members hosts execute 'sudo tc qdisc add dev eth0 root netem delay 500ms 100ms loss 20%' end end
def recipe__tc_remove_latency(hosts) ssh_task 'remove network latency using tc' do members hosts execute 'sudo tc qdisc del dev eth0 root netem' end end
PagerDuty/smoothie EXECUTION
#OSCON
HOSTFILTER=app1 RECIPE=reboot blend -f smoothie.rb
def recipe__reboot(hosts)
PagerDuty/smoothie EXECUTION
#OSCON
ZONE=us-west-2a RECIPE=reboot blend -f smoothie.rb
def recipe__reboot(hosts)
Failure Friday: Blender
#OSCON
ZONE=us-west-2a ROLE=web-app RECIPE=monit_unmonitor
ZONE=us-west-2a ROLE=web-app RECIPE=monit_monitor
ZONE=us-west-2a ROLE=web-app RECIPE=unicorn_stop_master_gracefully
ZONE=us-west-2b ROLE=web-app RECIPE=unicorn_suspend_master
ZONE=us-west-2b ROLE=web-app RECIPE=unicorn_resume_master
ZONE=us-west-2c ROLE=web-app RECIPE=reboot
ZONE=us-west-2a ROLE=web-app RECIPE=iptables_network_isolate
ZONE=us-west-2a ROLE=web-app RECIPE=iptables_rebuild
ZONE=us-west-2b ROLE=web-app RECIPE=tc_add_latency
ZONE=us-west-2b ROLE=web-app RECIPE=tc_remove_latency
Future AUTOMATION
Build more automation for service-specific scenarios.
Scheduled runs (similar to Netflix).
#OSCON
Future CHATOPS
Inject failures by invoking chat commands.
Share metrics and graphs to help people follow along.
Collect TODOs during Failure Fridays and generate a report.
#OSCON
Future NEW TYPES OF FAILURES
Distributed Denial of Service (DDoS) attacks for services.
Impediments that come up during Incident Response.
#OSCON
Summary FAILURES WILL HAPPEN
Anything that can go wrong, will go wrong.
Proactively test failure handling now.
Start simple.
#OSCON
#OSCON
PROPOSED EDIT
“Experiments that aren’t introducing new insights should be automated and used to monitor ongoing health of the system. New experiments should be devised to continue to push the bounds of the system.”
Culture From Chaos by @dougbarthhttps://speakerdeck.com/dougbarth/culture-from-chaos
Thank you.
#OSCON
@alperkokmen