pagerduty | oscon 2016 failure testing

30
@alperkokmen Failure Testing AUTOMATING A SERIES OF UNFORTUNATE EVENTS #OSCON

Upload: pagerduty

Post on 15-Jan-2017

303 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: PagerDuty | OSCON 2016 Failure Testing

@alperkokmen

Failure TestingAUTOMATING A SERIES OF UNFORTUNATE EVENTS

#OSCON

Page 2: PagerDuty | OSCON 2016 Failure Testing

Alper Kokmen PRESENT

Software Engineer at PagerDuty

Surrounded by smart people

PAST

Start-ups, Microsoft

Surrounded by smart people

#OSCON

Page 3: PagerDuty | OSCON 2016 Failure Testing

#OSCON

Page 4: PagerDuty | OSCON 2016 Failure Testing

Goals

Start manually injecting failures. Start automating your manual tests.

#OSCON

Page 5: PagerDuty | OSCON 2016 Failure Testing

CHAOS ENGINEERING

“[T]he discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

Principles of Chaos Engineering http://principlesofchaos.org

#OSCON

Page 6: PagerDuty | OSCON 2016 Failure Testing

Netflix Simian Army DIFFERENT SIMIANS FOR DIFFERENT FAILURES

#OSCON

Page 7: PagerDuty | OSCON 2016 Failure Testing

PagerDuty Simian Army?

Multiple cloud providers (AWS and Azure) Experimentation

Application-specific failure scenarios

#OSCON

Page 8: PagerDuty | OSCON 2016 Failure Testing

PagerDuty Simian Human Army FAILURE FRIDAY

Time-boxed recurring meeting

Pre-announced agenda

Break things

Sign-off from service owners

Attendance

GROUND RULES

Keep monitoring & alerting

Abort if needed

#OSCON

Page 9: PagerDuty | OSCON 2016 Failure Testing

Failure Friday: Agenda

#OSCON

Page 10: PagerDuty | OSCON 2016 Failure Testing

Failure Friday: Process

#OSCON

Inject Failure Monitor Repeat

Page 11: PagerDuty | OSCON 2016 Failure Testing

Failure Friday: Monitoring

#OSCON

Page 12: PagerDuty | OSCON 2016 Failure Testing

2 Years Later BENEFITS

System design Knowledge sharing Incident response training

#OSCON

Page 13: PagerDuty | OSCON 2016 Failure Testing

2 Years Later ACCOMPLISHMENTS

Whole DC outages Target multiple services at once Distribute failure testing to teams Automation (in progress)

#OSCON

Page 14: PagerDuty | OSCON 2016 Failure Testing

Automation: Rationale

#OSCON

“MANY” HOSTS

- Distribute tasks to multiple people and keep executing manually. - Watch Operations team with envy while they use chef and knife.

- Start automating.

Page 15: PagerDuty | OSCON 2016 Failure Testing

PagerDuty/blender A MODULAR ORCHESTRATION ENGINE

Ruby DSL

Host Discovery (blender-chef, blender-serf)

Ranjib Dey (@RanjibDey)

#OSCON

Page 16: PagerDuty | OSCON 2016 Failure Testing

PagerDuty/blender CODE

#OSCON

# example.rb ssh_task 'update' do execute 'sudo apt-get update -y' members ['ubuntu01', 'ubuntu02', 'ubuntu03'] end

Page 17: PagerDuty | OSCON 2016 Failure Testing

PagerDuty/blender EXECUTION

#OSCON

blend -f example.rb

Run[example.rb] started 3 job(s) computed using 'Default' strategy Job 1 [update on ubuntu01] finished Job 2 [update on ubuntu02] finished Job 3 [update on ubuntu03] finished Run finished (42.228923876 s)

Page 18: PagerDuty | OSCON 2016 Failure Testing

PagerDuty/smoothie A SIMPLE LIBRARY OF BLENDER RECIPES

Chef Integration

Recipes for Disaster

CLI to Specify Recipes

#OSCON

Page 19: PagerDuty | OSCON 2016 Failure Testing

PagerDuty/smoothie REBOOT RECIPE

#OSCON

def recipe__reboot(hosts) ssh_task 'reboot' do members hosts execute 'sudo /sbin/reboot'

# shutdown will break ssh connection. ignore_failure true end end

Page 20: PagerDuty | OSCON 2016 Failure Testing

PagerDuty/smoothie UNICORN SUSPEND & RESUME RECIPES

#OSCON

def recipe__unicorn_suspend_master(hosts) ssh_task 'suspend unicorn[master] immediately' do members hosts execute 'sudo kill -s STOP `cat /u/.../pids/unicorn.pid`' end end

def recipe__unicorn_resume_master(hosts) ssh_task 'resume unicorn[master] immediately' do members hosts execute 'sudo kill -s CONT `cat /u/.../pids/unicorn.pid`' end end

Page 21: PagerDuty | OSCON 2016 Failure Testing

PagerDuty/smoothie LATENCY RECIPE

#OSCON

def recipe__tc_add_latency(hosts) ssh_task 'add network latency using tc' do members hosts execute 'sudo tc qdisc add dev eth0 root netem delay 500ms 100ms loss 20%' end end

def recipe__tc_remove_latency(hosts) ssh_task 'remove network latency using tc' do members hosts execute 'sudo tc qdisc del dev eth0 root netem' end end

Page 22: PagerDuty | OSCON 2016 Failure Testing

PagerDuty/smoothie EXECUTION

#OSCON

HOSTFILTER=app1 RECIPE=reboot blend -f smoothie.rb

def recipe__reboot(hosts)

Page 23: PagerDuty | OSCON 2016 Failure Testing

PagerDuty/smoothie EXECUTION

#OSCON

ZONE=us-west-2a RECIPE=reboot blend -f smoothie.rb

def recipe__reboot(hosts)

Page 24: PagerDuty | OSCON 2016 Failure Testing

Failure Friday: Blender

#OSCON

ZONE=us-west-2a ROLE=web-app RECIPE=monit_unmonitor

ZONE=us-west-2a ROLE=web-app RECIPE=monit_monitor

ZONE=us-west-2a ROLE=web-app RECIPE=unicorn_stop_master_gracefully

ZONE=us-west-2b ROLE=web-app RECIPE=unicorn_suspend_master

ZONE=us-west-2b ROLE=web-app RECIPE=unicorn_resume_master

ZONE=us-west-2c ROLE=web-app RECIPE=reboot

ZONE=us-west-2a ROLE=web-app RECIPE=iptables_network_isolate

ZONE=us-west-2a ROLE=web-app RECIPE=iptables_rebuild

ZONE=us-west-2b ROLE=web-app RECIPE=tc_add_latency

ZONE=us-west-2b ROLE=web-app RECIPE=tc_remove_latency

Page 25: PagerDuty | OSCON 2016 Failure Testing

Future AUTOMATION

Build more automation for service-specific scenarios.

Scheduled runs (similar to Netflix).

#OSCON

Page 26: PagerDuty | OSCON 2016 Failure Testing

Future CHATOPS

Inject failures by invoking chat commands.

Share metrics and graphs to help people follow along.

Collect TODOs during Failure Fridays and generate a report.

#OSCON

Page 27: PagerDuty | OSCON 2016 Failure Testing

Future NEW TYPES OF FAILURES

Distributed Denial of Service (DDoS) attacks for services.

Impediments that come up during Incident Response.

#OSCON

Page 28: PagerDuty | OSCON 2016 Failure Testing

Summary FAILURES WILL HAPPEN

Anything that can go wrong, will go wrong.

Proactively test failure handling now.

Start simple.

#OSCON

Page 29: PagerDuty | OSCON 2016 Failure Testing

#OSCON

PROPOSED EDIT

“Experiments that aren’t introducing new insights should be automated and used to monitor ongoing health of the system. New experiments should be devised to continue to push the bounds of the system.”

Culture From Chaos by @dougbarthhttps://speakerdeck.com/dougbarth/culture-from-chaos

Page 30: PagerDuty | OSCON 2016 Failure Testing

Thank you.

#OSCON

@alperkokmen