velocity eu 2012 escalating scenarios: outage handling pitfalls

118
Escalating Scenarios A Deep Dive Into Outage Pitfalls John Allspaw Velocity London 2012 Wednesday, October 3, 12

Upload: john-allspaw

Post on 28-Jan-2015

107 views

Category:

Business


0 download

DESCRIPTION

When things go wrong, our judgement is clouded at best, blinded at worst. In order to successfully navigate a large-scale outage, being aware of potentials gaps in knowledge and context can help make for a better outcome. The Human Factors and Systems Safety community have been studying how people situate themselves, coordinate amongst a team, use tooling, make decisions, and keep their cool under sometimes very stressful and escalating scenarios. We can learn from this research in order to adopt a more mature stance when the s*#t hits the fan. We’re going to look closely at how people behave under these circumstances using real-world examples and scan what we can learn from High Reliability Organizations(HROs) and fields such as aviation, military, and trauma-driven healthcare.

TRANSCRIPT

Page 1: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Escalating Scenarios

A Deep Dive Into Outage Pitfalls

John AllspawVelocity

London 2012

Wednesday, October 3, 12

Page 2: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

TROUBLESHOOTING

This is NOT about troubleshooting

Or, not just about troubleshooting

Wednesday, October 3, 12

Page 3: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

LAYOUT

• Criteria• Situational Awareness• HROs• Decision Making• Communication• Team Coordination• A little bit of psychology

Wednesday, October 3, 12

Page 4: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 5: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 6: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

How important is this?

Wednesday, October 3, 12

Page 7: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 8: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 9: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 10: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 11: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 12: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Oct 2011

Sept 2012

Wednesday, October 3, 12

Page 13: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 14: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 15: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 16: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 17: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Where to learn from?

Wednesday, October 3, 12

Page 18: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

TMI

Wednesday, October 3, 12

Page 19: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 20: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 21: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Kegworth 1989

Wednesday, October 3, 12

Page 22: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Dr. Richard Cook, Velocity US 2012http://www.youtube.com/watch?v=R_PDc0HFdP0

Wednesday, October 3, 12

Page 23: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

“The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea”Rochlin, La Porte, and Roberts. Naval War College Review 1987

http://govleaders.org/reliability.htm

Wednesday, October 3, 12

Page 24: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 25: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 26: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

What Goes On In Our Heads?

Wednesday, October 3, 12

Page 27: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Jens Rasmussen, 1983Senior Member, IEEE

“Skills, Rules, and Knowledge; Signals, Signs, and Symbols, and Other Distinctions in Human Performance Models”IEEE Transactions On Systems, Man, and Cybernetics, May 1983

Wednesday, October 3, 12

Page 28: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

SKILL - BASED

Simple, routineRULE - BASED

Knowable, but unfamiliarKNOWLEDGE - BASED

WTF IS GOING ON?(Reason, 1990)

Wednesday, October 3, 12

Page 29: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Situational Awareness"the perception of elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future,” - (Endsley, 1995)

"keeping track of what is going on around you in a complex, dynamic environment" (Moray, 2005, p. 4)

"knowing what is going on so you can figure out what to do" (Adam, 1993)

Wednesday, October 3, 12

Page 30: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

OODA Loop

Observe Orient Decide Act

MetricsMonitoringAlertingAlarming

AnalysisVisualizationCorrelation

PlanningResourcing

Execution

credit: http://blog.b3k.us/ooda.htmlWednesday, October 3, 12

Page 31: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Canonical Work

“Towards a Theory of Situational Awareness”Mica Endsley, Human Factors (1995)

http://www.satechnologies.com/Papers/pdf/Toward%20a%20Theory%20of%20SA.pdf

Wednesday, October 3, 12

Page 32: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Situational AwarenessLevel I

Perception

Level IIComprehension

Level IIIProjection

Wednesday, October 3, 12

Page 33: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Situational Awareness

Projectionof future status

LEVEL III

Comprehension of current situation

LEVEL II

Perception of elements in current situation

LEVEL I

DecisionPerformance

of actionsState of the environment

System capabilityInterface designStress and workloadComplexityAutomation

Goals andobjectives

Preconceptions (expectations)

Information processing mechanisms

Long term memory states Automaticity

Feedback

- Abilities- Experience- Training

Task/System Factors

Individual Factors

(Endsley)Wednesday, October 3, 12

Page 34: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Level One: Perception

Wednesday, October 3, 12

Page 35: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 36: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Context

Can you spot the anomaly?Wednesday, October 3, 12

Page 37: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

24 hours

Context

Wednesday, October 3, 12

Page 38: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Context

7 daysWednesday, October 3, 12

Page 39: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Context

NormalBut

NoisyWednesday, October 3, 12

Page 40: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Level Two

ComprehensionWednesday, October 3, 12

Page 41: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Level Two

Wednesday, October 3, 12

Page 42: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Mental Models

• Categorization & Comprehension

• Mental “map” or “schema”

• Informed by experience, stored in memory

Wednesday, October 3, 12

Page 43: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Mental Models

Wednesday, October 3, 12

Page 44: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Level Three

Wednesday, October 3, 12

Page 45: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Mental Models

Wednesday, October 3, 12

Page 46: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Level Three

• Ambiguity • Fixation • Confusion • Lack of Information• Failure to maintain • Failure to meet expected checkpoint or target• Failure to resolve discrepancies • A bad gut feeling that things are not quite right 

Common Clues you’re losing SA at this level

Wednesday, October 3, 12

Page 47: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Characteristics of response to escalating scenarios

Wednesday, October 3, 12

Page 48: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

...tend to neglect how processes develop within time (awareness of rates) versus assessing how things are in the moment

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Wednesday, October 3, 12

Page 49: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

...have di!culty in dealing with exponential developments (hard to imagine how fast something can change, or accelerate)

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Wednesday, October 3, 12

Page 50: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

...inclined to think in causal SERIES, instead of causal NETS.

A therefore B,

instead of

A, therefore B and C (therefore D and E), etc.

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Wednesday, October 3, 12

Page 51: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Requisite Memory Trap

SA Pitfalls

Wednesday, October 3, 12

Page 52: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Workload, anxiety, fatigue, other stressors

SA Pitfalls

Wednesday, October 3, 12

Page 53: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Data Overload

SA Pitfalls

Wednesday, October 3, 12

Page 54: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Misplace Salience

SA Pitfalls

Wednesday, October 3, 12

Page 55: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

http://www.perceptualedge.com/articles/Whitepapers/Dashboard_Design.pdf

Wednesday, October 3, 12

Page 56: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

http://www.perceptualedge.com/articles/Whitepapers/Dashboard_Design.pdf

Wednesday, October 3, 12

Page 57: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 58: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Complexity Creep“Everything should be as simple as it can be, but not simpler.”- paraphrased, Einstein

SA Pitfalls

Wednesday, October 3, 12

Page 59: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Poor Mental Models

SA Pitfalls

Wednesday, October 3, 12

Page 60: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Out-Of-The-Loop Syndrome

SA Pitfalls

Wednesday, October 3, 12

Page 61: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Refusal to make decisions

SA Pitfalls

Wednesday, October 3, 12

Page 62: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Non-communicating lone wolf-isms

Heroism

Wednesday, October 3, 12

Page 63: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Irrelevant noise in comm channels

Distraction

Wednesday, October 3, 12

Page 64: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 65: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

• Divide and conquer applied to problem space, division of labor

• Incident resolution vs. Problem resolution

• Reproducibility

• Fault Tolerance E!ects

TEAMS

Wednesday, October 3, 12

Page 66: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Shotgun debugging

TEAMS

Wednesday, October 3, 12

Page 67: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

• Interpredictability

• Common Ground

• Directability

JOINTACTIVITY

http://csel.eng.ohio-state.edu/woods/distributed/CG%20final.pdf

Wednesday, October 3, 12

Page 68: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Interpredictability

Wednesday, October 3, 12

Page 69: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Common GroundWednesday, October 3, 12

Page 70: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Directability

Wednesday, October 3, 12

Page 71: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Improvisation

Wednesday, October 3, 12

Page 72: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

IMPROVISATION

Wednesday, October 3, 12

Page 73: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

IMPROVISATION

Wednesday, October 3, 12

Page 74: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Improvisation

“...you can’t improvise on nothing; you got to improvise on something.”

Charles Mingus

Wednesday, October 3, 12

Page 75: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Diagnose the problem

Represent the problem

Detect the Problem/Opportunity

Generate acourse of

action

ApplyLeveragePoints

Evaluate

Wednesday, October 3, 12

Page 76: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

CommunicationRecommendations

•Explicitness•Assertiveness•Timing

Wednesday, October 3, 12

Page 77: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Assertiveness

• Passive

• Assertive

• Aggressive

Wednesday, October 3, 12

Page 78: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 79: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Exercise

Wednesday, October 3, 12

Page 80: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Communication

• IRC?

• Face-To-Face?

• Conference Call?

• Morse Code?

Wednesday, October 3, 12

Page 81: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Kegworth 1989

Wednesday, October 3, 12

Page 82: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

MeaningEncode

Sender ReceiverDecode

Meaning

Transmission

Wednesday, October 3, 12

Page 83: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

MeaningEncode

Sender ReceiverDecode

Meaning

Transmission

Wednesday, October 3, 12

Page 84: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

MeaningEncode

Sender ReceiverDecode

Meaning

Transmission

Meaning

DecodeReceiver Sender

Encode

Meaning

Transmission

Wednesday, October 3, 12

Page 85: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

MeaningEncode

Sender ReceiverDecode

Meaning

Transmission

Meaning

DecodeReceiver Sender

Encode

Meaning

Transmission

Wednesday, October 3, 12

Page 86: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

FeedbackInformational

Wednesday, October 3, 12

Page 87: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

FeedbackCorrective

Wednesday, October 3, 12

Page 88: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

FeedbackReinforcing

Wednesday, October 3, 12

Page 89: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Decision Making Naturalistic Decision Making (NDM)Gary Klein

Wednesday, October 3, 12

Page 90: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Decision Making

Step One: What is the problem?

Wednesday, October 3, 12

Page 91: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Decision Making

Step Two: What shall I do?

Wednesday, October 3, 12

Page 92: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Recognition-Primed Decisions

Decision Making

Wednesday, October 3, 12

Page 93: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Rule-Based Decisions

Decision Making

Wednesday, October 3, 12

Page 94: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Choice decisions

Decision Making

Wednesday, October 3, 12

Page 95: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Creative decisionsDecision Making

Wednesday, October 3, 12

Page 96: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Decision Making

Creative Choice Rule-Based RPD

Decreasing cognitive effortDecreasing effects of stress

Increasing cognitive effortIncreasing effects of stress

Wednesday, October 3, 12

Page 97: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

PRE-Mortem

Decision Making

Wednesday, October 3, 12

Page 98: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Tooling

Wednesday, October 3, 12

Page 99: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

??

Wednesday, October 3, 12

Page 100: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

MetricTimePeriod

Wednesday, October 3, 12

Page 101: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Controls

Wednesday, October 3, 12

Page 102: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

ALERTS• Meant to boost SA

• Alarm overload

• High false alarm rates

• Routinely disable alerts

Wednesday, October 3, 12

Page 103: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Alert ReliabilityWednesday, October 3, 12

Page 104: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

ALERT DESIGN

• Signal:Noise can be di"cult

• Easy to err on more false alarms

• Decay in trust

• Origins: Undetectable conditions

Wednesday, October 3, 12

Page 105: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

ALERT DESIGN

Confirmation

Wednesday, October 3, 12

Page 106: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

ALERT DESIGN

Expectancy

Wednesday, October 3, 12

Page 107: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

ALERT DESIGN

Wednesday, October 3, 12

Page 108: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

ALERT DESIGN

• Don’t make people singularly reliant on alarms

• Support alarm confirmation activities

• Make alarms unambiguous

• Reduce, reduce, reduce false alerts

• Set missed/false alert trade-o!s appropriately

Wednesday, October 3, 12

Page 109: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

ALERT DESIGN

• Use multiple modalities

• Minimize alarm disruptions to ongoing activities

• Support the assessment/diagnosis of multiple alerts

• Support global SA of systems in an alarm state

Wednesday, October 3, 12

Page 110: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Mature Role of Automation

http://www.bainbrdg.demon.co.uk/Papers/Ironies.html

“Ironies of Automation” - Lisanne Bainbridge

Wednesday, October 3, 12

Page 111: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Mature Role of Automation

• Moves humans from manual operator to supervisor

• Extends and augments human abilities, doesn’t replace it

• Doesn’t remove “human error”

• Are brittle

• Recognize that there is always discretionary space for humans

• Recognizes the Law of Stretched Systems

Wednesday, October 3, 12

Page 112: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

SUMMARY

Wednesday, October 3, 12

Page 113: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

So what can we do?

“In preparing for battle, I have always found that plans are useless but planning is indispensable.”

- Eisenhower

Wednesday, October 3, 12

Page 114: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

So what can we do?We develop our Non-Technical Skills

• Situational Awareness

• Communication

• Decision Making

• Improvisation

• Crew Resource Management (CRM)

Wednesday, October 3, 12

Page 115: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

So what can we do?

We tailor our environment to adapt

• Tooling to support SA

• Learning from outages (PostMortem)

• Anticipating problems (PreMortem)

• Gather Meta-Metrics

Wednesday, October 3, 12

Page 116: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

Wednesday, October 3, 12

Page 117: Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls

THE ENDWednesday, October 3, 12