beyond the retrospective: embracing complexity on the road to service ownership

51
KEVINA FINN-BRAUN INTUIT J. PAUL REED RELEASE ENGINEERING APPROACHES DEVOPS ENTERPRISE SUMMIT, 2016 BEYOND THE RETROSPECTIVE: EMBRACING COMPLEXITY ON THE ROAD TOWARDS SERVICE OWNERSHIP

Upload: j-paul-reed

Post on 16-Apr-2017

93 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

K E V I N A F I N N - B R A U N I N T U I T

J . PA U L R E E D R E L E A S E E N G I N E E R I N G A P P R O A C H E S

D E V O P S E N T E R P R I S E S U M M I T, 2 0 1 6

B E Y O N D T H E R E T R O S P E C T I V E : E M B R A C I N G C O M P L E X I T Y O N T H E R O A D T O W A R D S S E R V I C E O W N E R S H I P

Page 2: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

K E V I N A F I N N - B R A U N

• Director of Product Infrastructure Service Management at Intuit

• Director of Site Reliability Service Management at Salesforce; Business Continuity at Yahoo

• Geeks out on group dynamics and behavior

• @kfinnbraun on @jpaulreed@kfinnbraun #DOES2016

Page 3: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

J . PA U L R E E D

• @jpaulreed on

• @shipshowpodcast alum

• Managing Partner, Release Engineering Approaches

• A “DevOps Consultant™”

• Master’s Candidate in Human Factors & Systems Safety

@jpaulreed@kfinnbraun #DOES2016

Page 4: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

A Q U I C K R E C A P F R O M L A S T D O E S

“The Blameless Cloud: Bringing Actionable Retrospectives to SFDC” DOES 2015 @jpaulreed@kfinnbraun

Page 5: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

N E W M A R C H I N G O R D E R S

@jpaulreed@kfinnbraun #DOES2016

Page 6: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

“ S E R V I C E O W N E R S H I P ? ”

@jpaulreed@kfinnbraun #DOES2016

Page 7: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

I T ’ S J U S T W H AT S F D C C A L L E D “ D E V O P S “

( S S H H H , D O N ’ T T E L L A N Y O N E )

@jpaulreed@kfinnbraun #DOES2016

Page 8: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?

@jpaulreed@kfinnbraun #DOES2016

Page 9: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?

@jpaulreed@kfinnbraun #DOES2016

Page 10: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?

@jpaulreed@kfinnbraun #DOES2016

Page 11: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

“ B U T H O W D O W E D O ‘ T H E D E V O P S ? ’ ”

• Learned helplessness?

• Uncontrollable bad event

• Perceived lack of control

• Generalized helpless behavior

@jpaulreed@kfinnbraun #DOES2016

Page 12: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

• Learned helplessness?

• Uncontrollable bad event

• Perceived lack of control

• Generalized helpless behavior

• Actually: Structural blindness

“ B U T H O W D O W E D O ‘ T H E D E V O P S ? ’ ”

@jpaulreed@kfinnbraun #DOES2016

Page 13: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

M A K I N G S E N S E O F S E R V I C E O W N E R S H I P

@jpaulreed@kfinnbraun #DOES2016

Page 14: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

W O R K S H O P S U R P R I S E S !

• Understanding teams’ local rationality is key

• Words have meaning; meanings are important; but they aren’t necessarily shared

• Teams must be given space to deliver on transformations

• Teams can be “retrospective blind”

@jpaulreed@kfinnbraun #DOES2016

Page 15: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

D E V O P S & N U C L E A R M E LT D O W N S ?

@jpaulreed@kfinnbraun

Page 16: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

A N E W A D V E N T U R E

@jpaulreed@kfinnbraun #DOES2016

Page 17: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

A N E W A D V E N T U R E

Quickbooks

TurboTax

Mint

FY 2016: $4.7b revenue

8,000 employees worldwide

Founded: 1983

Improving the financial lives of over 45 million customersIPO: 1993

@jpaulreed@kfinnbraun #DOES2016

Page 18: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

S O M E D I F F E R E N T C H A L L E N G E S

• Intuit not “born in the cloud”

@jpaulreed@kfinnbraun #DOES2016

Page 19: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

S O M E D I F F E R E N T C H A L L E N G E S

• Intuit not “born in the cloud”

• “Incidents” meant something different

@jpaulreed@kfinnbraun #DOES2016

Page 20: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

S O M E D I F F E R E N T C H A L L E N G E S

• Intuit not “born in the cloud”

• “Incidents” meant something different

• No “Bermuda Blob”

@jpaulreed@kfinnbraun #DOES2016

Page 21: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

S O M E D I F F E R E N T C H A L L E N G E S

• Intuit not “born in the cloud”

• “Incidents” meant something different

• No “Bermuda Blob”

• (No blob at all!)

@jpaulreed@kfinnbraun #DOES2016

Page 22: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

S O M E D I F F E R E N T C H A L L E N G E S

• Intuit not “born in the cloud”

• “Incidents” meant something different

• No “Bermuda Blob”

• (No blob at all!)

• Different business lifecycle

@jpaulreed@kfinnbraun #DOES2016

Page 23: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

B U T S I M I L A R C H A L L E N G E S , T O O

• Inconsistencies in operational responses

• Postmortems centered around “The Old View” of human error

• Some incidents & remediations got lost in the shuffle

• Surprising amount of (aggregated) service impact due to P3s/P4s

• “What, exactly, is an ‘incident?’”

@jpaulreed@kfinnbraun #DOES2016

Page 24: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

“ B L A M E L E S S ” “ P O S T M O R T E M S ” ?

• Brené Brown, research sociologist, on vulnerability

• “Blame is a way to discharge pain and discomfort”

• Postmortem has a heavy connotation

• “Awesome postmortems?” Really?!

• More at: http://jpaulreed.com/blame-aware-postmortems

@jpaulreed@kfinnbraun #DOES2016

Page 25: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Lang

uage

Beha

vior

s

Novice Competent Proficient ExpertBeginner

@kfinnbraun / #DOES2016 / @jpaulreed

Page 26: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line.”

“I’m getting sent to the principal’s office because

of this outage.”

Completes the post-incident

“paperwork.”

No formal retrospective/ hallway retrospectives.

Lang

uage

Beha

vior

s

@kfinnbraun / #DOES2016 / @jpaulreed

Page 27: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line.”

“I’m getting sent to the principal’s office because

of this outage.”

“Let’s fix this as fast as possible.”

“What’s the correct fix to avoid this specific issue

in the future?”

Completes the post-incident

“paperwork.”

No formal retrospective/ hallway retrospectives.

Some information

(inconsistently) recorded.

Jumps to a focus on why.

Lang

uage

Beha

vior

s

@kfinnbraun / #DOES2016 / @jpaulreed

Page 28: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line.”

“I’m getting sent to the principal’s office because

of this outage.”

“Let’s fix this as fast as possible.”

“What’s the correct fix to avoid this specific issue

in the future?”

“Let’s review the timeline/incident

report to answer that.”

“We need to find the root cause of this incident.”

Completes the post-incident

“paperwork.”

No formal retrospective/ hallway retrospectives.

Some information

(inconsistently) recorded.

Jumps to a focus on why.

Follows the prescribed format for retrospectives.

Possesses and incorporates complete dataset for the incident

into the retrospective.

Lang

uage

Beha

vior

s

@kfinnbraun / #DOES2016 / @jpaulreed

Page 29: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line.”

“I’m getting sent to the principal’s office because

of this outage.”

“Let’s fix this as fast as possible.”

“What’s the correct fix to avoid this specific issue

in the future?”

“Let’s review the timeline/incident

report to answer that.”

“We need to find the root cause of this incident.” “Now that we’ve established

what happened, how did it happen?”

“How did these multiple factors

influence our complex system?”

Completes the post-incident

“paperwork.”

No formal retrospective/ hallway retrospectives.

Some information

(inconsistently) recorded.

Jumps to a focus on why.

Follows the prescribed format for retrospectives.

Possesses and incorporates complete dataset for the incident

into the retrospective.

Identifies inherent bias

in self and others.

Perspectives solicited from all involved team members/functional groups.

Lang

uage

Beha

vior

s

@kfinnbraun / #DOES2016 / @jpaulreed

Page 30: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line.”

“I’m getting sent to the principal’s office because

of this outage.”

“Let’s fix this as fast as possible.”

“What’s the correct fix to avoid this specific issue

in the future?”

“Let’s review the timeline/incident

report to answer that.”

“We need to find the root cause of this incident.” “Now that we’ve established

what happened, how did it happen?”

“How did these multiple factors

influence our complex system?”

“How does our team/system contribute to our successes?”

“What can we incorporate from this incident to

better respond next time?”

Completes the post-incident

“paperwork.”

No formal retrospective/ hallway retrospectives.

Some information

(inconsistently) recorded.

Jumps to a focus on why.

Follows the prescribed format for retrospectives.

Possesses and incorporates complete dataset for the incident

into the retrospective.

Identifies inherent bias

in self and others.

Perspectives solicited from all involved team members/functional groups.

Able to facilitate retrospectives by healthily helping others address

tendency to blame/ personal & systemic bias.

Retrospective outcomes are fed back into the

system and prioritized.

Lang

uage

Beha

vior

s

@kfinnbraun / #DOES2016 / @jpaulreed

Page 31: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Lang

uage

Beha

viors

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line.”

“I’m getting sent to the principal’s office because

of this outage.”

“Let’s fix this as fast as possible.”

“What’s the correct fix to avoid this specific issue

in the future?”

“Let’s review the timeline/incident

report to answer that.”

“We need to find the root cause of this incident.” “Now that we’ve established

what happened, how did it happen?”

“How did these multiple factors

influence our complex system?”

“How does our team/system contribute to our successes?”

“What can we incorporate from this incident to

better respond next time?”

Completes the post-incident

“paperwork.”

No formal retrospective/ hallway retrospectives.

Some information

(inconsistently) recorded.

Jumps to a focus on why.

Follows the prescribed format for retrospectives.

Possesses and incorporates complete dataset for the incident

into the retrospective.

Identifies inherent bias

in self and others.

Perspectives solicited from all involved team members/functional groups.

Able to facilitate retrospectives by healthily helping others address

tendency to blame/ personal & systemic bias.

Retrospective outcomes are fed back into the

system and prioritized.

@kfinnbraun / #DOES2016 / @jpaulreed

Page 32: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Incident Analysis

Lang

uage

Beha

viors

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line.”

“I’m getting sent to the principal’s office because

of this outage.”

“Let’s fix this as fast as possible.”

“What’s the correct fix to avoid this specific issue

in the future?”

“Let’s review the timeline/incident

report to answer that.”

“We need to find the root cause of this incident.” “Now that we’ve established

what happened, how did it happen?”

“How did these multiple factors

influence our complex system?”

“How does our team/system contribute to our successes?”

“What can we incorporate from this incident to

better respond next time?”

Completes the post-incident

“paperwork.”

No formal retrospective/ hallway retrospectives.

Some information

(inconsistently) recorded.

Jumps to a focus on why.

Follows the prescribed format for retrospectives.

Possesses and incorporates complete dataset for the incident

into the retrospective.

Identifies inherent bias

in self and others.

Perspectives solicited from all involved team members/functional groups.

Able to facilitate retrospectives by healthily helping others address

tendency to blame/ personal & systemic bias.

Retrospective outcomes are fed back into the

system and prioritized.

@kfinnbraun / #DOES2016 / @jpaulreed

Page 33: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Incident Analysis

Incident Detection Incident

Response

Incident Remediation Incident

Prevention*

T H E I N C I D E N T L I F E C Y C L E

Lang

uage

Beha

viors

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line.”

“I’m getting sent to the principal’s office because

of this outage.”

“Let’s fix this as fast as possible.”

“What’s the correct fix to avoid this specific issue

in the future?”

“Let’s review the timeline/incident

report to answer that.”

“We need to find the root cause of this incident.” “Now that we’ve established

what happened, how did it happen?”

“How did these multiple factors

influence our complex system?”

“How does our team/system contribute to our successes?”

“What can we incorporate from this incident to

better respond next time?”

Completes the post-incident

“paperwork.”

No formal retrospective/ hallway retrospectives.

Some information

(inconsistently) recorded.

Jumps to a focus on why.

Follows the prescribed format for retrospectives.

Possesses and incorporates complete dataset for the incident

into the retrospective.

Identifies inherent bias

in self and others.

Perspectives solicited from all involved team members/functional groups.

Able to facilitate retrospectives by healthily helping others address

tendency to blame/ personal & systemic bias.

Retrospective outcomes are fed back into the

system and prioritized.

@kfinnbraun / #DOES2016 / @jpaulreed

Page 34: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

I N C I D E N T D E T E C T I O N

@kfinnbraun / #DOES2016 / @jpaulreed

Page 35: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Novice Competent Proficient ExpertBeginner

“Problems with our service are obvious;

outages are obvious.”

“Other teams will notify us of any problems.”

“Most of the time, we’re the first to know

when a service is impacted.”

“We use historical data to guess at service level changes.”

“We’ve detected service level transitions via

monitoring and reduced MTTD.”

“I know which specific code/infra change caused this

service level change; here’s how I know…”

“We prioritize feature requests and bug reports to monitoring hooks;

monitoring is a 1st class citizen.”

“We’ve decoupled code/infra deployment, because we

can roll back/forward.”

“We’re not paged anymore for changes

automation can react to.”

Manual and/or external outage notifications.

No baseline metrics/ service levels are broadly bucketed.

External monitoring is in place to detect real time service transitions.

Notifications are automated.

External infra/API endpoints/ outward-facing interfaces

monitored/recorded.

Historical data exists and has been used to establish

graduated service baselines.

Application internals report data

to the monitoring system.

Monitoring systems employ deep statistical methods

to (dis)prove service anomalies.

Monitoring output is reincorporated into operational behavior in an

automated fashion.

Anomalies no longer result in defined “incidents.”

Lang

uage

Beha

vior

s

@kfinnbraun / #DOES2016 / @jpaulreed

Page 36: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

I N C I D E N T R E S P O N S E

@kfinnbraun / #DOES2016 / @jpaulreed

Page 37: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Novice Competent Proficient ExpertBeginner

“Have you tried turning it off and turning it on again?”

“Something is wrong with the X…”

“I think X is familiar with Y; let’s find them.”

“I think there’s a problem with the database, network, etc.”

Standard Incident Management System

language used.

“The deployment caused the database to hang…”

“The infrastructure on-calls: perform a system status &

report back to the IC.”

Entire team is familiar with standardized

IMS language.

Standardized IMS language is used/valued by the

entire team.

“What parts of the service did not ‘self-heal’ and

need attention?”

Team is event-focused; the team is

“alarmed” by incidents.

Inconsistent response once incident has commenced.

Response based on “tribal knowledge.”

Team is area-focused.

Team is action-focused.

Team has identified incident “responders,” and those

people know their duties.

Team is technology-focused.

Incident response is an aspect of org and team “culture.”

Incidents are embraced, but outside-business hours or

repeated incidents are considered inhumane.

Team is systems-focused.

Lang

uage

Beha

vior

s

@kfinnbraun / #DOES2016 / @jpaulreed

Page 38: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

I N C I D E N T A N A LY S I S

@kfinnbraun / #DOES2016 / @jpaulreed

Page 39: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line.”

“I’m getting sent to the principal’s office because

of this outage.”

“Let’s fix this as fast as possible.”

“What’s the correct fix to avoid this specific issue

in the future?”

“Let’s review the timeline/incident

report to answer that.”

“We need to find the root cause of this incident.” “Now that we’ve established

what happened, how did it happen?”

“How did these multiple factors

influence our complex system?”

“How does our team/system contribute to our successes?”

“What can we incorporate from this incident to

better respond next time?”

Completes the post-incident

“paperwork.”

No formal retrospective/ hallway retrospectives.

Some information

(inconsistently) recorded.

Jumps to a focus on why.

Follows the prescribed format for retrospectives.

Possesses and incorporates complete dataset for the incident

into the retrospective.

Identifies inherent bias

in self and others.

Perspectives solicited from all involved team members/functional groups.

Able to facilitate retrospectives by healthily helping others address

tendency to blame/ personal & systemic bias.

Retrospective outcomes are fed back into the

system and prioritized.

Lang

uage

Beha

vior

s

@kfinnbraun / #DOES2016 / @jpaulreed

Page 40: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

I N C I D E N T R E M E D I AT I O N

@kfinnbraun / #DOES2016 / @jpaulreed

Page 41: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Novice Competent Proficient ExpertBeginner

“Let’s just file a ticket to track the issue.”

“I’m am sure this is the issue; the fix will correct 100%

of the occurrences.”

“I’m pretty sure we already fixed this?”

“We need an action plan to address the process gaps.”

“This needs to be fixed in the next release and

documented in our incident response docs.”

“We need to look deeper than this specific incident to really

address the problem.”“What can we learn from

this incident?”

“What other system aspects have we learned

from this incident? How can we use that?”

“While operating our system today,

how did we actively create & sustain

success?”

Remediation activities (or lack

thereof) contribute to a “break-fix” cycle.

Discussions of the incident are aggressive/blameful.

“Low hanging fruit” may be fixed, but

not documented or incorporated into team behavior.

More processes, more procedures,

more rules.

Issues of all sizes are actively managed.

Issues have a priority and teams have bandwidth to address them.

Completed issue remediation is

valued by the org.

Bandwidth exists to discuss, design and implement resiliency improvements.

Remediation is not regarded as a separate activity & is

culturally integrated into work.

Resilience is considered in the design phase

for new infra/software.

Lang

uage

Beha

vior

s

@kfinnbraun / #DOES2016 / @jpaulreed

Page 42: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

I N C I D E N T P R E V E N T I O N *

@kfinnbraun / #DOES2016 / @jpaulreed

Page 43: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Novice Competent Proficient ExpertBeginner

“Preventing future incidents is difficult

because of lacking data.”

“We can use predictive metrics

to completely avoid future incidents.”

“Our system has reasonable coverage

of its metrics.”

“We use metrics to inform attack/risk surface.”

“We use trend analysis to raise ‘soft’ problems

to operators.”

“Old documentation is problematic and dealt with accordingly.”

“When we started game days, it was a real mess.”

“We now care less about specific incidents &

more about crew formation.”

“The team is excited about game days.”

“Our crews care about their formation

and dissolution.”

Prevention efforts include documentation,

process design, metrics collection.

Retrospective focus is on static causes/effects.

Retrospectives include discussions

of active operator behaviors.

Docs, process, metrics established,

but < 100%.

Preventative focus is on reviewing docs+process+ metrics collection, but in a

day-to-day context.

Retrospectives focus on the response of the team

to an incident.

We actively inject failure into our

systems on a known schedule,

to drill.

We review our response to

induced failures.

The crew formation/dissolution process is considered our

primary role+responsibility in addressing and preventing

operational failure

We actively inject failure at random intervals.

Lang

uage

Beha

vior

s

@kfinnbraun / #DOES2016 / @jpaulreed

Page 44: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

H E L P U S M A K E I T B E T T E R !

https://github.com/preed/incident-lifecycle-model@jpaulreed@kfinnbraun #DOES2016

Page 45: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

FA C I L I TAT E T E A M S E X P L O R I N G T H E I R D I S C R E T I O N A R Y S PA C E

@jpaulreed@kfinnbraun #DOES2016

Page 46: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

I N C I D E N T R E S P O N S E ! = I N C I D E N T M A N A G E M E N T

@jpaulreed@kfinnbraun #DOES2016

Page 47: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

I N C I D E N T R E S P O N S E ! = I N C I D E N T M A N A G E M E N T

( Y O U R I N C I D E N T VA L U E S T R E A M M AT T E R S )

@jpaulreed@kfinnbraun #DOES2016

Page 48: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Y O U A R E N E V E R D O N E .

@jpaulreed@kfinnbraun #DOES2016

Page 49: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

Y O U . A R E . N E V E R . D O N E .

@jpaulreed@kfinnbraun #DOES2016

Page 50: Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

AV E N U E S F O R C O L L A B O R AT I O N

• Take a look at the extended incident lifecycle model and your organization: see where it fits and doesn’t!

• (And then send us Github pull requests!)

• Compare your own (documented?) incident life cycle against your actual incident value stream; share what you find!

@jpaulreed@kfinnbraun #DOES2016