fail better: radical ideas from the practice of cloud computing · 2019-03-20 · “cloud...

100
Tom Limoncelli Stack Overflow Fail Better: Radical Ideas from the Practice of Cloud Computing

Upload: others

Post on 22-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Tom Limoncelli Stack Overflow

Fail Better: Radical Ideas from the Practice of Cloud

Computing

Page 2: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

• Learning Center tools for professional development: http://learning.acm.org • 4,500+ trusted technical books and videos by O’Reilly, Morgan Kaufmann, etc. • 1,300+ courses, virtual labs, test preps, live mentoring for software professionals covering

programming, data management, cybersecurity, networking, project management, more • Training toward top vendor certifications (CEH, Cisco, CISSP, CompTIA, ITIL, PMI, etc.) • Learning Webinars from thought leaders and top practitioner • Podcast interviews with innovators, entrepreneurs, and award winners

• Popular publications:

• Flagship Communications of the ACM (CACM) magazine: http://cacm.acm.org/ • ACM Queue magazine for practitioners: http://queue.acm.org/

• ACM Digital Library, the world’s most comprehensive database of computing literature: http://dl.acm.org.

• International conferences that draw leading experts on a broad spectrum of computing topics: http://www.acm.org/conferences.

• Prestigious awards, including the ACM A.M. Turing and Infosys: http://awards.acm.org

• And much more… http://www.acm.org.

ACM Highlights

Page 3: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Tom Limoncelli, SREStack Exchange, IncNew York City

the-cloud-book.com @YesThatTom

Radical Ideas from The Practice of Cloud System Administration

www.informit.com/TPOSA Discount code TPOSA35

Page 4: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Who is Tom Limoncelli?Sysadmin since 1988

Worked at Google, AT&T/Bell Labs and many more.

SRE at Stack Exchange, Inc (NYC) http://careers.stackoverflow.com

Blog: EverythingSysadmin.com

Twitter: @YesThatTom

Page 5: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 6: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 7: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

The Cloud

Page 8: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

The Cloud

Page 9: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

The Cloooooouud

Page 10: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

The Cloud!!!!!!

Page 11: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 12: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

TheCloud!!1!

Page 13: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

We <heart>

The Cloud

Page 14: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

The cloud solves all problems.

Page 15: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud.

C

Page 16: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Distributed Computing

Page 17: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 18: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 19: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 20: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 21: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Distributed Computing• Divide work among many machines

• Coordinated central or decentralized

• Examples:

• Genomics: 100s machines working on a dataset

• Web Service: 10 machines each taking 1/10th of the web traffic for StackExchange.com

• Storage: xx,000 machines holding all of Gmail’s messages

Page 22: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Distributed computing can do more “work” than the largest single computer.

More storage. More computing power. More memory. More throughput.

Page 23: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Mo’ computers, Mo’ problems

Thousands of Users • Bigger risks • Failures more visible • Automation mandatory • Cost containment

becomes critical

Page 24: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Mo’ computers, Mo’ problems

Thousands of Users • Bigger risks • Failures more visible • Automation mandatory • Cost containment

becomes critical

In response: Radical ideas on • Reducing risk / Improve safety • Reliability becomes a

competitive differentiator • New automation paradigms • Cost and economics

Page 25: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Make peace with failure

Parts are imperfect Networks are imperfect Systems are imperfect Code is imperfect People are imperfect

Page 26: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Learn how to

FAILBETTER

Page 27: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 28: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Buy the best, most reliable computer in the world. It is still going to fail.

If it doesn’t, you’ll still need to take it down for maintenance.

Page 29: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

3 ways to fail better

1. Use cheaper, less reliable, hardware.

2. If a process/procedure is risky, do it a lot.

3. Don’t punish people for outages.

Page 30: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Fail Better Part 1 of 3:

Use cheaper, less reliable, hardware.

Page 31: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 32: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

• Loss-damage waiver

• Liability

• Personal accident insurance

• Personal effects coverage

Page 33: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

• Loss-damage waiver

• Liability

• Personal accident insurance

• Personal effects coverage

Page 34: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

• Loss-damage waiver

• Liability

• Personal accident insurance

• Personal effects coverage

Page 35: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

• Loss-damage waiver

• Liability

• Personal accident insurance

• Personal effects coverage

Page 36: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

• Loss-damage waiver

• Liability

• Personal accident insurance

• Personal effects coverage

$$$$$$

Page 37: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

High-End Server

Page 38: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

High-End Server

RAID

Page 39: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

High-End Server

RAID

Dual PS

Page 40: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

High-End Server

RAID

Dual PS

UPS

Page 41: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

High-End Server

RAID

Dual PS

UPS

Gold Maintenance

Page 42: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

Load Balancer

High-End Server

RAIDDual PS

UPSGold Maintenance

Page 43: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

Load Balancer

Code Changes to Coordinate and Distribute Work

High-End Server

RAIDDual PS

UPSGold Maintenance

Page 44: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

Load Balancer

Code Changes to Coordinate and Distribute Work

High-End Server

RAIDDual PS

UPSGold Maintenance

Load Balancer

Page 45: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

High-End Server

RAID

Dual PSUPS

Gold Maintenance

Load Balancer

Code Changes to Coordinate and Distribute Work

High-End Server

RAIDDual PS

UPSGold Maintenance

Load Balancer$$$$$$

Page 46: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Reliability through software

• Resiliency through software:

• Costs to develop. Free to deploy.

• Resiliency through hardware:

• Costs every time you buy a new machine.

Page 47: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

$$$$

$$$$

Write code so that the system is distributed.

Best hardware.

Double-spending

Page 48: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

$$$$

$$$$

Write code so that the system is distributed.

Best hardware.

Double-spending

Page 49: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Efficient Server Efficient Server Efficient Server Efficient Server Efficient Server

Load Balancer Load Balancer

Page 50: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

These techniques work for large

grids of machines…

…and every-day systems too.Efficient Efficient Efficient Efficient Efficient

Load Balancer Load Balancer

Page 51: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Big resiliency is cheaper

Load Balancer

50% 50%

50% overhead

Load Balancer

10% overhead

90% 90% 90% 90% 90%

90% 90% 90% 90% 90%

Page 52: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

The right amount of resiliency is good.

Too much is a waste.Aim for an SLA target so you know when to stop.

Page 53: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Load balancing & redundancy is just one

way to achieve resiliency.

Page 54: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

The cheapest way to buy

terabytes of RAM.

Page 55: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Fail Better Part 1 of 3:

Use cheaper, less reliable, hardware.

Page 56: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Fail Better Part 2 of 3:

If a process/procedure is risky, do it a lot.

Page 57: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Risky behavior vs.

Risky procedures

Page 58: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Risky Behaviors are inherently risky

• Smoking • Shooting yourself in the foot • Blindfolded chainsaw juggling

Page 59: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Risky behavior is risky.

Page 60: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Risky Processes can be improved through practice

• Software Upgrades • Database Failovers • Network Trunk Failovers • Hardware Hot Swaps

Page 61: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

• StackExchange.com has a “DR” site in Oregon.

• StackExchange.com runs on SQL Server with “AlwaysOn” Availability Groups plus…

Redis, HAproxy, ISC BIND, CloudFlare, IIS, and many home-grown applications

StackExchange.com Failover from NY or Oregon

Page 62: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Process was risky• Took 10+ hours

• Required “hands on” by 3 teams.

• Found 30+ “improvements needed”

• Certain people were S.P.O.F.

Page 63: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Drill Results30

10Hours

Bugs Filed

Page 64: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Drill Results30

20

10

5

Hours

Bugs Filed

Page 65: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Drill Results30

20

1210

52

Hours

Bugs Filed

Page 66: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Drill Results30

20

12

5

10

52 1

Hours

Bugs Filed

Page 67: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Why?• Each drill “surfaces” areas of improvement.

• Each member of the team gains experience and builds confidence.

• “Smaller Batches” are better

Page 68: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Software Upgrades• Traditional

• Months of planning

• Incompatibility issues

• Very expensive

• Very visible mistakes

• By the time we’re done, time to start over again.

• Distributed Computing

• High frequency (many times a day or week)

• Fully automated

• Easy to fix failures

• Cheap… encourages experiments

Page 69: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

“Big Bang” releases are inherently risky.

Page 70: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Small batches are betterFewer changes each batch:

• If there are bugs, easier to identify source Reduced lead time:

• It is easier to debug code written recently. Environment has changed less:

• Fewer “external changes” to break on Happier, more motivated, employees:

• Instant gratification for all involved

Page 71: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Risk is inversely proportional to how recently a process has

been used

more recent

less recent

Backups that have

never been

restored

LB web servers

that fail all the time

Continuous Software

Deployment

Software Upgrades

every 3 years

most risky

least risky

Page 72: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

• Randomly reboots machines.

• Keeps Netflix “on its toes”.

• Part of the Simian Army:

• Chaos Monkey (hosts)

• Chaos Kong (data centers)

• Latency Monkey (adds random performance delays)

Netflix “Chaos Monkey”

Page 73: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Fail Better Part 2 of 3:

If a process/procedure is risky, do it a lot.

Page 74: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Fail Better Part 3 of 3:

Don’t punish people for outages.

Page 75: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

There will always be outages.

Page 76: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

There will always be outages.

Page 77: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Getting angry about outages is equivalent to expecting them to

never happen… which is irrational.

Page 78: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Out-dated attitudes about outages • Expect perfection: 100% uptime • Punish exceptions:

• fire someone to “prove we’re serious” • Results:

• People hide problems • People stop communicating • Discourages transparency • Small problems get ignored, turn into big

problems

Page 79: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

New thinking on outages • Set uptime goals: 99.9% +/- 0.05 • Anticipate outages:

• Strategic resiliency techniques, oncall system • Drills to keep in practice, improve process

• Results: • Encourages transparency, communication • Small problems addressed, fewer big

problems • Over-all uptime improved

Page 80: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

There are only Contributing Factors

John Allspaw http://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-sufficient/

Page 81: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

After the outage, publish a postmortem document

• People involved write a “blameless postmortem” • Identifies what happened, how, what can be done

to prevent similar problems in the future. • Published widely internally and externally.

• Instead of blame, people take responsibility: • Responsibility for implementing long-term fixes. • Responsibility for educating other teams how to

learn from this.

Page 82: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 83: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 84: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 85: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

I dunno about anybody else, but I really like getting these post-mortem reports. Not only is it nice to know what happened, but it’s also great to see how you guys handled it in the moment and how you plan to prevent these events going forward. Really neato. Thanks for the great work :)

—-Anna

Page 86: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Fail Better Part 3 of 3:

Don’t punish people for outages.

Page 87: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Take-homes• “cloud computing” = “distributed computing”1. Use cheaper, less reliable, hardware

• Create reliability through software (when possible)

• Pay only for the reliability you need 2. If a process/procedure is risky, do it a lot

• Practice makes perfect • “Small Batches” improves quality and morale

3. Don’t punish people for outages• Focus on accountability and take responsibility

Page 88: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

?

Page 89: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 90: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 91: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Home Life

Page 92: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 93: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 94: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 95: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create
Page 96: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Tom Limoncelli, SREStackExchange.com

the-cloud-book.com @YesThatTom

Radical Ideas from The Practice of Cloud System Administration

Page 97: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Tom Limoncelli, SREStackExchange.com

the-cloud-book.com @YesThatTom

Radical Ideas from The Practice of Cloud System Administration

Very Reasonable

Page 98: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

If you liked this talk……there’s more like it in

http://the-cloud-book.com

Save 35% www.informit.com/TPOSA Discount code TPOSA35

THOMAS A. LIMONCELLI • STRATA R. CHALUP • CHRISTINA J. HOGAN

DES IGN ING AND OPERAT ING LARGE D I S TR I BUTED SYSTEMS

T H E P R A C T I C E O F

Cloud System Administr ation

V O L U M E 2

Page 99: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

Q&A

Page 100: Fail Better: Radical Ideas from the Practice of Cloud Computing · 2019-03-20 · “cloud computing” = “distributed computing” 1. Use cheaper, less reliable, hardware • Create

ACM: The Learning Continues…

• Questions about this webcast? [email protected]

• ACM Learning Webinars (on-demand archive): http://learning.acm.org/webinar

• ACM Learning Center: http://learning.acm.org

• ACM SIGMIS: http://sigmis.org/

• ACM Queue: http://queue.acm.org/