a culture of automation - joe smith - devopsdays tel aviv 2017

39
Joe Smith November 14, 2017 1 A Culture Of Automation

Upload: devopsdays-tel-aviv

Post on 22-Jan-2018

64 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Joe Smith

November 14, 2017

1

A Culture Of Automation

Page 2: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Joe Smith

Operations Engineer, Slack Application Operations Team

● Build/Run core systems responsible for Slack

● CDNs, Edge Regions, Web tier, Websockets, development workflow, etc

● Previously:

○ Tech Lead, Aurora/Mesos SRE at Twitter

○ Internal Technology Resident at Google

Page 3: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Folks with the desire to build

resilient systems

These roles may have different names, but

share the above goal

● Reliability Engineering

● Operations

● DevOps

● Site Reliability Engineering

● Production Engineering

● Systems Engineering

Audience

Page 4: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Agenda

1. Running Production Services

2. Runbooks

3. Automation

Page 5: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Running Production Services

Page 6: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Two Pizza Rule

A team should be sized to share around two large pizzas

Page 7: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

“”

“We will take the site down today! 💥”

– No one when they wake up

Page 8: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

● Careful Planning and Procedures

● Extensive Documentation

● Good Communication

Strategies

Page 9: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Planning

● Some components are prioritized for speed while others are meant to be

canaried and analyzed

● Changes need to be staged to coordinate with each other

● Give teams the tools and visibility they need to make improvements and

understand impact

● Identify your rollback strategy ahead of time

Page 10: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Documentation

● Each code or procedure change should be paired with an update to easily-

readable text

● Help your teammates and yourself weeks from now when you need to

understand how things work.

● Do not just describe how systems are structured, explain why they are built

that way!

● Additional context can inform future decisions

Page 11: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Communication

● Change Management - Coordinating release schedules can be difficult

● Launch Channel - Announce changes, link to more details in a feature-

specific Slack channel for the change

● Add links to commits, code reviews, threads in Slack, mailing list posts, and

StackOverflow questions

● This enables your team to benefit from the research you've done

Page 12: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

● Careful Planning and Procedures

● Extensive Documentation

● Good communication

Strategies

Page 13: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

● Unexpected changes, forced roll-forwards

● Outdated Runbooks

● Missed Notifications

Growing Pains

Page 14: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Good Problems to Have

As the team grows, it's no

longer possible to understand

everything that's happening at

once.

The scope of work is also

increasing!

Page 15: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Runbooks

Page 16: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

“”

“Checklists for commonly

repeated operational tasks.”

– Slack, Runbook README

Page 17: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

1. Location

2. Format

3. Contents

Page 18: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Location

● Google Docs

○ Good Formatting, Mobile Apps, external service

● Wiki

○ Web Interface, Track Changes

● Markdown in git repo (paired with Github)

○ Formatting, offline support, normal Pull Request flow

Page 19: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Markdown in Git Repo

● Track changes across revisions

● Optional peer review

● Link to relevant sections

● Clone repo for offline support

Page 20: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Runbook Template

(thanks to my teammate Megan!)

● ApplicationServer

○ README.md

○ standard_actions.md

○ other_actions.md

○ alerts/

■ box_failure.md

■ some_alert_name.md

Page 21: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

README.md

Page 22: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

standard_actions.md

Page 23: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

other_action.md

Page 24: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

alert.md

Page 25: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

box_failure.md

Page 26: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Example

Page 27: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Content

This is not the place for Design

Documentation.

These are highly-actionable,

succinct descriptions of next

steps.

Page 28: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Automation

Page 29: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

“”

"Test until fear turns

to boredom."

– JUnit FAQ,

http://junit.sourceforge.net/doc/faq/faq.htm#tests_6

Page 30: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

“”

"Automate once fear

turns to boredom"

– Ancient SRE Corrollary

Page 31: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Beyond Runbooks

● Turn a manual checklist into a testable, repeatable set of steps anyone can

run

● Anytime you discover a sharp edge or workaround, this can be codified in the

tool

● Reduce sections of "but if this happens, check this dashboard and then do

one of three things"

Page 32: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

The Tooling WorkflowInitial Steps

This process can evolve over a long time and generally improve things.

● One person has all the knowledge in their head

● That person writes down everything they know in a runbook

● Someone sees an annoying or complicated piece and writes a small script

to be run instead for a tiny part of the process

The next jump will be the most difficult part!

Page 33: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

The Tooling WorkflowMaintenance

● It feels great to have written your first tool!

● You may be lucky and have no bugs

● Most likely there are some edge cases- that is okay and expected!

● Take some time to figure out what went wrong and how to make things better.

Page 34: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

The Tooling WorkflowCompletion

● Later on, another part of the process can be added in and the documentation

further updated

● Over time- the runbook becomes "Run this tool we wrote, send bugs to the

authors"

● Finally- there is no longer an entry! The tool is run automatically, or the

system itself is able to solve that problem

Page 35: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

“”

"The value of humans is to

execute Judgement, the value

of computers is to execute

instructions"

– Aron, teammate at Slack

Page 36: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Runbook to Automated Workflow

1. Brain

2. Runbook

3. Start of automation

4. Automation evolution (safeguards)

5. Self-contained Tool

6. Fully Automated

Page 37: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Process in Code

● Using libraries like fabric, pychef, and boto3 can ease automation

● When there are issues, the code can be reviewed for process changes, git

history can be consulted, etc

● No more "I forgot that was changed and followed the old process!"

● Each time someone submits an improvement or workflow tweak, that will

always be useful from now on!

Page 38: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Thank You!

38

For more information go to: slack.com/jobs

Page 39: A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

Joe Smith

Operations Engineer, Slack Application Operations Team

● Come build the future of work!

○ https://slack.com/careers/641062/senior-site-reliability-engineer

● Please reach out and say hi!

○ @Yasumoto on Twitter

● Tools Scaffold

○ https://github.com/Yasumoto/tools