2016 - ignite - blameless system design
TRANSCRIPT
Blameless System DesignDouglas LandVast.com, Inc.
I break systems… a LOT ● Auth● Syslog● Chef● Ambassadors● Prod Frontends
Sometimes I ‘break’ systems on purpose...● Service discovery by chef● 90% code in prod● No shared storage for cloudstack
Sometimes you just need do things.
Higher standardsAnd yet, I still hold others to a higher standard..
● Servers still on public internet???● Created a flat VLAN when we did move to private IPs???● No centralized management of virtualization infrastructure???● The only 'shared storage' is via DRBD and ha.d???
Technical debtor’s prisonWe’re obsessed with technical debt
Qualifying it:
● Application Debt ● Infrastructure Debt● Architecture Debt
Quantifying it:
● size of code base● code coverage● coupling and cohesion reports● cyclomatic complexity● Halstead complexity measures
The myth of technical debt
Peter Norvig, “All code is liability”
Not actually technical debt:● Maintenance● Changes in understanding● Operational inertia● Poor code choices● Dependency liabilities
So what is technical debt?Technical debt is the choices we intentionally make to speed up the development or implementation of systems, and which we acknowledge will need to be changed later.
Technical debt is the result of an Efficiency-Thoroughness Trade-Off at an individual level.
Technical debt is the output of a project constraint model at an organizational level.
The blame gameShouldn't we stop blaming people for making the trade-offs they're forced to make?
Being Blameless● If we remove fear we will have a more
honest conversation about trade-offs● if we're honest about those trade-offs
crisis might be averted altogether● If we understand our history, we won't be
destined to repeat it
What is blameless system design?Assuming goodwill
Blameless post-mortems
Empathy
Experimentation
Honesty
Communication
Assume Goodwill
Your co-worker probably doesn’t come into work every day with the intent of harming you or the organization.
Blameless Post-mortems“We must strive to understand that accidents don’t happen because people gamble and lose.Accidents happen because the person believes that:…what is about to happen is not possible,…or what is about to happen has no connection to what they are doing,…or that the possibility of getting the intended outcome is well worth whatever risk there is.”
- Erik Hollnagel
Empathy
● Reject ‘contempt culture’● Focus on the positive● Consider others’ perspectives
ExperimentationThe Engineering Design Process
● Define the Problem
● Do Background Research
● Specify Requirements
● Brainstorm Solutions
● Choose the Best Solution
● Do Development Work
● Build a Prototype
● Test and Redesign
Honesty● Publish ALL your results● Document ALL your decisions● Be honest about trade-offs● Track mitigations
Communication● Broadcast expectations● Honor achievements● Make doc easy to find● Open discussions● Well define feedback
channels
Did someone say devops?
● Culture● Measurement● Sharing● Feedback loops
The badIt’s hard to change culture and get away from a retribution culture and the RCA mentality
It’s hard to get over hindsight bias.
It’s a lot of work to encourage openness and honesty, and define what that looks like.
It’s hard to get over their impostor syndrome and / or contempt cultures.
The good● Remove fear● Encourage ‘risk’● Create feedback● Reduce redundant learning● Improve working environment, trust
Douglas Land - Director of operations, Vast.com, Inc.
[email protected] | @webuilddevops
Some References:
http://www.datical.com/blog/technical-debt-devops/
http://laughingmeme.org/2016/01/10/towards-an-understanding-of-technical-debt/
http://blog.aurynn.com/86/contempt-culture
http://erikhollnagel.com/ideas/etto-principle/index.html
http://indecorous.com/fallible_humans/
https://hbr.org/2003/05/it-doesnt-matter/ar/pr
https://codeascraft.com/2014/07/18/just-culture-resources/
http://sidneydekker.com/just-culture/