instrumentation as a living documentation: teaching humans about complex systems

88
Instrumentation as a Living Documentation TEACHING HUMANS ABOUT COMPLEX SYSTEMS

Upload: brian-troutwine

Post on 16-Apr-2017

3.955 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Instrumentation as a Living Documentation

TEACHING HUMANS ABOUT COMPLEX SYSTEMS

Page 2: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

I do things to/with computers.

Page 3: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

I build real-time systems.

Page 4: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

I build distributed systems.

Page 5: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

I build critical systems.

Page 6: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

AdRoll

Page 7: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

L E S S T H I S

Page 8: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

M O R E T H I S

Page 9: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

W E ’ R E A N A D T E C H

C O M PA N Y .

Page 10: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

R E A L - T I M E B I D D I N G

Page 11: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

The nature of the problem domain:

• Low latency ( < 100ms per transaction )

• Firm real-time system

• Highly concurrent ( > 55 billion transactions per day )

• Global, 24/7 operation

Page 12: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

I build Complex Systems

Page 13: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Complex Systems

• Non-linear feedback

• Tightly coupled to external systems

• Difficult to model, understand

• Usually a solution to some “wicked problem”

Page 14: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

- - C . WEST CHURCHMAN, - GUEST ED I TOR IAL : W ICKED PROBLEMS - MANAGEMENT SC IENCE VOL . 4 , 1967

[WICKED PROBLEMS ARE] SOCIAL PROBLEMS WHICH ARE ILL FORMULATED, WHERE THE INFORMATION IS CONFUSING, WHERE THERE ARE MANY CLIENTS AND DECISION-MAKERS WITH CONFLICTING VALUES, AND WHERE THE RAMIFICATIONS IN THE WHOLE SYSTEM ARE THOROUGHLY CONFUSING. […] THE ADJECTIVE ‘WICKED’ IS SUPPOSED TO DESCRIBE THE MISCHIEVOUS AND EVEN EVIL QUALITY OF THESE PROBLEMS, WHERE PROPOSED ‘SOLUTIONS’ OFTEN TURN OUT TO BE WORSE THAN THE SYMPTOMS.

Page 15: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Bad things happen when Complex Systems fail.

Page 16: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Complex Systems often create worse problems than those they solve.

Page 17: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

HUMANS ARE BAD AT PREDICTING THE PERFORMANCE OF COMPLEX SYSTEMS(…). OUR ABILITY TO CREATE LARGE AND COMPLEX SYSTEMS FOOLS US INTO BELIEVING THAT WE’RE ALSO ENTITLED TO UNDERSTAND THEM.

CARLOS BUENO “MATURE OPT IM IZAT ION HANDBOOK”

Page 18: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

The key challenge to sustaining a complex system is maintaining

our understanding of it.

Page 19: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

We write documentation.

Page 20: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Complex systems are fiendishly difficult to communicate about.

Page 21: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Miscommunications are accidents in the making.

Page 22: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Documentation reduces accidents.

Page 23: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

I F Y O U D O N ’ T K N O W H O W T H E S Y S T E M

S H O U L D B E H AV E Y O U C A N ’ T S AY H O W I T

S H O U L D N ’ T O R I S N ’ T .

Page 24: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Trouble is, documentation goes out of date.

Page 25: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Complex Systems evolve and written words “rot”

as the system moves on.

Page 26: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Engineers fail to update documentation as the

system changes.

Page 27: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

DAV ID E . HOFFMAN “THE DEAD HAND: THE UNTOLD STORY OF THE COLD

WAR ARMS RACE AND I T ’ S DANGEROUS LEGACY”

ONE OPERATOR (…) WAS CONFUSED BY THE LOGBOOK. HE CALLED SOMEONE ELSE TO INQUIRE. !

“WHAT SHALL I DO?” HE ASKED. “IN THE PROGRAM THERE ARE INSTRUCTIONS OF WHAT TO DO, AND THEN A LOT OF THINGS CROSSED OUT.” !

THE OTHER PERSON THOUGHT FOR A MINUTE, THEN R E P L I E D , “ F O L L O W T H E C R O S S E D O U T INSTRUCTIONS.”

Page 28: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Engineers can be unaware of the system as it is actually used.

Page 29: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

ER IC SCHLOSSER COMMAND AND CONTROL : NUCLEAR WEAPONS, THE DAMASCUS ACC IDENT, AND THE I L LUS ION OF SAFETY

CLEARLY THE TEXTBOOKS (…) DIDN’T TELL YOU WHAT REALLY HAPPENED IN THE FIELD. (…) (T)HERE WAS A WAY YOU WERE SUPPOSED TO DO THINGS – AND THE WAY THINGS GOT DONE. RFHCO SUITS WERE HOT AND CUMBERSOME (…) AND IF A MAINTENANCE TASK COULD BE ACCOMPLISHED QUICKLY WITHOUT AN OFFICER NOTICING, SOMETIMES THE SUITS WEREN’T WORN.

Page 30: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

(Normal) Accidents happen.

Page 31: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

HENRY S . F. COOPER , JR . X I I I : THE APOLLO FL IGHT THAT FA I LED

THE FIRST DISASTER IN SPACE HAD OCCURRED, AND NO ONE KNEW WHAT HAD HAPPENED. ON THE GROUND, THE FLIGHT CONTROLLERS W E R E N O T E V E N S U R E T H AT ANYTHING HAD.

Page 32: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Documentation doesn’t necessarily reflect the reality of the system.

Page 33: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

What can we do?

Page 34: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

INSTRUMENTATION

Page 35: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Instrumentation reflects the reality of the system as it exists.

Page 36: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Instrumentation allows users and engineers to explore the system as

it exists.

Page 37: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Exploration, done honestly, guides us to a new, better understanding

of the system.

Page 38: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

THIS “COLLECTIVE ENTITY” WAS ORGANIZED AROUND THE PILOT TO MAKE IT “SAFER AND MORE EFFICIENT IF THERE WAS A FOCAL POINT. AND I WAS THE FOCAL POINT. JIM FED THINGS INTO MY EARS. THE MOON FED THINGS INTO MY EYES AND I COULD FEEL THE MACHINE OPERATING.”

COMMANDER DAV ID SCOTT AS QUOTED IN DAV ID A . M INDELL 'S

D IG I TAL APOLLO : HUMAN AND MACH INE IN SPACEFL IGHT

Page 39: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Instrumentation democratizes the organization around a complex

system.

Page 40: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Studies

Page 41: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Study: Exchange Throttling

Page 42: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Study: Exchange Throttling

Healthy pattern of bid requests

Page 43: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Study: Exchange Throttling

The trough of throttling

Page 44: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

B A D

G O O D

Case Study: Exchange Throttling

Page 45: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Problem confirmed with Exchange

Case Study: Exchange Throttling

Page 46: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Study: Exchange Throttling

• All other metrics (run-queue, CPU, network IO) were fine.

• Confirmed that no changes had been made to the running systems via deployment.

• Amazon data showed no network issues to our machines.

Page 47: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

What happened?

Case Study: Exchange Throttling

Page 48: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

We hit an implicit exchange limit. (Arguably, a bug.)

Case Study: Exchange Throttling

Page 49: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Study: Timeout Jumps

Page 50: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Study: Timeout Jumps

Healthy Pattern of Background Timeouts

Page 51: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Study: Timeout Jumps

Unhealthy timeouts.

Page 52: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Study: Timeout Jumps

Healthy Bid Requests

Page 53: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Study: Timeout Jumps

Unhealthy Bid Requests

Cliff of Throttling

Page 54: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Case Study: Timeout Jumps• Timeouts jump occurred only in US East, US

West fine.

• All other metrics (as above) checked out.

• System deployment strongly correlated with timeout jump.

• Rollback to previous release reduce timeouts to acceptable levels.

Page 55: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

What happened?

Case Study: Timeout Jumps

Page 56: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Who can say? ¯\_(シ)_/¯

Case Study: Timeout Jumps

Page 57: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Lessons Learned

Page 58: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

It is possible to have too little information.

Page 59: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

(THE FIREFIGHTERS) TRIED TO BEAT DOWN THE FLAMES (OF CHERNOBYL REACTOR 4). THEY KICKED AT THE BURNING GRAPHITE WITH THEIR FEET. … THE DOCTORS KEPT TELLING THEM THEY’D BEEN POISONED BY GAS.- SVETLANA ALEX IEV ICH - VO ICES FROM CHERNOBYL : THE ORAL H ISTORY OF A

NUCLEAR D ISASTER

Page 60: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

It is possible to collect too much information, or

present it badly.

Page 61: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

SAFETY SYSTEMS, SUCH AS WARNING LIGHTS, ARE NECESSARY, BUT THEY HAVE THE POTENTIAL FOR DECEPTION. (…) ONE OF THE LESSONS OF COMPLEX SYSTEMS AND (THREE MILE ISLAND) IS THAT ANY PART OF THE SYSTEM MIGHT BE INTERACTING WITH OTHER PARTS IN UNANTICIPATED WAYS.

- CHARLES PERROW - NORMAL ACC IDENTS : L I V ING WITH H IGH -R ISK

TECHNOLOG IES

Page 62: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Instrumentation is not a

panacea.

Page 63: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Instruments may be misleading.

Page 64: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Must know some Mathematics.

Page 65: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Too much information hampers interpretation.

Page 66: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Instruments may be

inaccurate.

Page 67: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Instruments may be ignored.

Page 68: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Instrumentation may be used for undesirable purposes.

Page 69: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

What can we do?

Page 70: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Write documentation!

Page 71: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Context reduces misinterpretations.Misleading Instruments

Page 72: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Procedure manuals and visualizations reduce the need for math background.

Must Know Math

Page 73: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

The more contextual layers you add, the more you reduce “big boards of blinky lights”.

Too Much Information

Page 74: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

INSTRUMENTATION IS LIKE A SUIT. IT NEEDS TO FIT YOUR OWN MIND.

VALENT INO VOLONGH I

Page 75: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Cross-checks and documented error margins mitigate instrument inaccuracy.

Inaccuracy

Page 76: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

IF YOU DON'T TRUST A COMPUTER BECAUSE SOMETIMES IT DOESN'T TELL YOU THE TRUTH, TELLING IT TO TELL YOU TO TRUST IT IS ASKING IT TO LIE TO YOU SOMETIMES.

MIKE SASSAK , CURBS IDE

Page 77: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Checklists with references to instrumentation at decision points.

May be Ignored

Page 78: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Collaborative Workplaces, Cooperatives, Unions, Laws etc.

Undesirable Purposes

Page 79: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

I PROPOSE THAT MEN AND WOMEN BE RETURNED TO WORK AS CONTROLLERS OF MACHINES, AND THAT THE CONTROL OF PEOPLE BY MACHINES BE CURTAILED. I PROPOSE, FURTHER, THAT THE EFFECTS OF CHANGES IN TECHNOLOGY AND ORGANIZATION ON LIFE PATTERNS BE TAKEN INTO CAREFUL CONSIDERATION, AND THAT THE CHANGES BE WITHHELD OR INTRODUCED ON THE BASIS OF THIS CONSIDERATION.

KURT VONNEGUT PLAYER P IANO

Page 80: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Instrumentation addresses the problems of documentation, documentation the problems of instrumentation.

TL;DR

Page 81: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Complex Systems need them both.

Page 82: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

How do I get started?

Page 83: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Exometer

Page 84: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Dropwizard’s Metrics

Page 85: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Scales

Page 86: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

DataDog NewRelic Librato

Page 87: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Questions?

Page 88: Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

Thanks! <3

@bltroutwine