toward recovery-oriented computing armando fox, stanford university david patterson, uc berkeley and...

26
Toward Recovery-Oriented Toward Recovery-Oriented Computing Computing Armando Fox, Stanford University Armando Fox, Stanford University David Patterson, UC Berkeley David Patterson, UC Berkeley and a cast of tens and a cast of tens

Upload: edith-newton

Post on 17-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

Toward Recovery-Oriented Toward Recovery-Oriented ComputingComputing

Armando Fox, Stanford UniversityArmando Fox, Stanford UniversityDavid Patterson, UC BerkeleyDavid Patterson, UC Berkeley

and a cast of tensand a cast of tens

Page 2: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

OutlineOutline

Whither recovery-oriented computing?Whither recovery-oriented computing? research/industry agenda of last 15 yearsresearch/industry agenda of last 15 years

today’s pressing problem: availability (we knew that) - today’s pressing problem: availability (we knew that) - but what is new/different compared to previous F/T work, but what is new/different compared to previous F/T work, databases, etc?databases, etc?

Recovery-Oriented Computing as an approach to Recovery-Oriented Computing as an approach to availabilityavailability Motivation and philosophyMotivation and philosophy

sampling of research avenuessampling of research avenues

what ROC is notwhat ROC is not

Page 3: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Reevaluating goals & assumptionsReevaluating goals & assumptions

Goals of last 15 yearsGoals of last 15 years Goal #1: Improve performanceGoal #1: Improve performance

Goal #2: Improve performanceGoal #2: Improve performance

Goal #3: Improve cost-performanceGoal #3: Improve cost-performance

AssumptionsAssumptions Humans are perfect (they don’t make mistakes during Humans are perfect (they don’t make mistakes during

installation, wiring, upgrade, maintenance or repair)installation, wiring, upgrade, maintenance or repair)

Software will eventually be bug free Software will eventually be bug free (good programmers will write bug-free code, debugging (good programmers will write bug-free code, debugging works)works)

Hardware MTBF is already very large (~100 years Hardware MTBF is already very large (~100 years between failures), and will continue to increasebetween failures), and will continue to increase

Page 4: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Results of this successful agendaResults of this successful agenda

Good news: faster computers, denser disks, cheaper $Good news: faster computers, denser disks, cheaper $ computation faster by >3 orders of magnitudecomputation faster by >3 orders of magnitude

disk capacity greater by >3 orders of magnitudedisk capacity greater by >3 orders of magnitude

Result: TCO dominated by Result: TCO dominated by administration,administration, not hardware cost not hardware cost

Bad news: complex, brittle systems that fail frequentlyBad news: complex, brittle systems that fail frequently 65% of IT managers report that their websites were 65% of IT managers report that their websites were

unavailable to customers over a 6-month period (25%: 3 or unavailable to customers over a 6-month period (25%: 3 or more outages) more outages) [Internet Week, 4/3/2000][Internet Week, 4/3/2000]

outage costs: negative press, “click overs” to competitor, outage costs: negative press, “click overs” to competitor, stock price, market cap…stock price, market cap…

Yet availability is key metric for online services!Yet availability is key metric for online services!

Page 5: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Direct Downtime Costs (per Direct Downtime Costs (per Hour)Hour)

Brokerage operationsBrokerage operations $6,450,000$6,450,000Credit card authorizationCredit card authorization $2,600,000$2,600,000Ebay (22 hour outage)Ebay (22 hour outage) $225,000$225,000Amazon.comAmazon.com $180,000$180,000Package shipping servicesPackage shipping services $150,000$150,000Home shopping channelHome shopping channel $113,000$113,000Catalog sales centerCatalog sales center $90,000$90,000Airline reservation centerAirline reservation center $89,000$89,000Cellular service activationCellular service activation $41,000$41,000On-line network feesOn-line network fees $25,000$25,000ATM service feesATM service fees $14,000$14,000

Sources: InternetWeek 4/3/2000 + Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."

Page 6: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

So, what are today’s challenges?So, what are today’s challenges?

We all seem to agree on goalsWe all seem to agree on goals Dave Patterson, IPTS 2002: ACME “availability, change, Dave Patterson, IPTS 2002: ACME “availability, change,

maintenance, evolution”maintenance, evolution”

Jim Gray, HPTS 2001: FAASM “functionality, availability, Jim Gray, HPTS 2001: FAASM “functionality, availability, agility, scalability, manageability”agility, scalability, manageability”

Butler Lampson, SOSP 1999: “Always available, evolving Butler Lampson, SOSP 1999: “Always available, evolving while they run, growing without practical limit”while they run, growing without practical limit”

John Hennessy, FCRC 1999: “Availability, maintainability John Hennessy, FCRC 1999: “Availability, maintainability and ease of upgrades, scalability”and ease of upgrades, scalability”

Fox & Brewer, HotOS 1997: BASE “best-effort service, Fox & Brewer, HotOS 1997: BASE “best-effort service, availability, soft state, eventual consistency”availability, soft state, eventual consistency”

We’re all singing the same tune, but what is new?We’re all singing the same tune, but what is new?……

Page 7: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

What’s New and DifferentWhat’s New and Different

Evolution and change are integralEvolution and change are integral not true of many “traditional” five nines systems: long design not true of many “traditional” five nines systems: long design

cycle, changes incur high overhead for design/spec/testingcycle, changes incur high overhead for design/spec/testing

Last version of space shuttle software: 1 bug in 420 KLOC, cost Last version of space shuttle software: 1 bug in 420 KLOC, cost $35M/yr to maintain (good quality commercial SW: 1 bug/KLOC)$35M/yr to maintain (good quality commercial SW: 1 bug/KLOC)

But, recent upgrade for GPS support required generating 2,500 But, recent upgrade for GPS support required generating 2,500 pages of specs pages of specs beforebefore changing anything in 6.3 KLOC (1.5%) changing anything in 6.3 KLOC (1.5%)

Performance still important, but focus changedPerformance still important, but focus changed Interactive Interactive performance and availability to end users is keyperformance and availability to end users is key

Users appear willing to occasionally tolerate temporary Users appear willing to occasionally tolerate temporary degradation (“service quality”) in exchange for improved degradation (“service quality”) in exchange for improved availabilityavailability

How to capture this tradeoff: soft/stale state, partial How to capture this tradeoff: soft/stale state, partial performance degradation, imprecise answers…performance degradation, imprecise answers…

Page 8: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

ROC PhilosophyROC Philosophy

ROC philosophy (“Peres’s Law”):ROC philosophy (“Peres’s Law”):““If a problem has no solution, it may not be a problem, but a fact; not to be solved, If a problem has no solution, it may not be a problem, but a fact; not to be solved,

but to be coped with over time”but to be coped with over time”Shimon PeresShimon Peres

Failures (hardware, software, operator-induced) are a fact; Failures (hardware, software, operator-induced) are a fact; recovery is how we cope with them over timerecovery is how we cope with them over time

Availability = MTTF/MTBF= MTTF / Availability = MTTF/MTBF= MTTF / (MTTF + MTTR) (MTTF + MTTR) Rather than just making MTTF very large, make MTTR << MTTFRather than just making MTTF very large, make MTTR << MTTF

Why?Why?

1.1. Human errors will still cause outages => minimize recovery Human errors will still cause outages => minimize recovery timetime

2.2. Recovery time is directly measurable, and directly captures Recovery time is directly measurable, and directly captures impact on users of a specific outage incident (MTTF doesn’t)impact on users of a specific outage incident (MTTF doesn’t)

3.3. Rapid evolution makes exhaustive testing/validation impossible Rapid evolution makes exhaustive testing/validation impossible => unexpected/transient failures will still occur=> unexpected/transient failures will still occur

Page 9: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

1. Human Error Is Inevitable1. Human Error Is Inevitable

Human error major factor in downtime…Human error major factor in downtime… PSTN: Half of all outage incidents and outage-minutes from PSTN: Half of all outage incidents and outage-minutes from

1992-1994 were due to human error (including errors by 1992-1994 were due to human error (including errors by phone company maintenance workers)phone company maintenance workers)

Oracle: up to half of DB failures due to human error (1999)Oracle: up to half of DB failures due to human error (1999)

Microsoft blamed human error for ~24-hour outage in Jan Microsoft blamed human error for ~24-hour outage in Jan 20012001

Approach:Approach: Learn from psychology of human error and disaster case Learn from psychology of human error and disaster case

studiesstudies

Build in system support for recovery from human errorsBuild in system support for recovery from human errors

Use tools such as error injection, virtual machine technology Use tools such as error injection, virtual machine technology to provide “flight simulator” training for operatorsto provide “flight simulator” training for operators

Page 10: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

The 3R undo modelThe 3R undo model

Undo == time travel for system operatorsUndo == time travel for system operators

Three R’s for recoveryThree R’s for recovery RRewind:ewind: roll system state backwards in time roll system state backwards in time

RRepair:epair: change system to prevent failure change system to prevent failure e.g., edit history, fix latent error, retry unsuccessful e.g., edit history, fix latent error, retry unsuccessful

operation, install preventative patchoperation, install preventative patch

RReplay:eplay: roll system state forward, replaying end-user roll system state forward, replaying end-user interactions lost during rewindinteractions lost during rewind

All three R’s are criticalAll three R’s are critical rewind enables undorewind enables undo repair lets user/administrator fix problemsrepair lets user/administrator fix problems replay preserves updates, propagates fixes forwardreplay preserves updates, propagates fixes forward

Page 11: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Example e-mail scenarioExample e-mail scenario

Before undo:Before undo: virus-laden message arrivesvirus-laden message arrives

user copies it into a folder without looking at ituser copies it into a folder without looking at it

Operator invokes undo (rewind) to install virus filter Operator invokes undo (rewind) to install virus filter (repair)(repair)

During replay:During replay: message is redelivered message is redelivered butbut now discarded by virus filter now discarded by virus filter

copy operation is now unsafe (source message doesn’t copy operation is now unsafe (source message doesn’t exist)exist)

compensating action: insert placeholder for messagecompensating action: insert placeholder for message

now copy command can be executed, making history now copy command can be executed, making history replay-acceptablereplay-acceptable

Page 12: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

First implementation attemptFirst implementation attempt

Undo wrapper for open source IMAP email storeUndo wrapper for open source IMAP email store

Email ServerIncludes: - user state - mailboxes - application - operating system

Non-overwritingStorage

Undo

Log

3R Layer

3RProx

y

StateTracke

r

SMTP

IMAP

SMTP

IMAP

control

Page 13: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

3. Handling Transient Failures via 3. Handling Transient Failures via RestartRestart

Many failures are either (a) transient and fixable through reboot, or (b) non-Many failures are either (a) transient and fixable through reboot, or (b) non-transient, but reboot is the lowest-MTTR fix transient, but reboot is the lowest-MTTR fix

Recursive RestartsRecursive Restarts: To minimize MTTR, restarts the minimal set of subsystems : To minimize MTTR, restarts the minimal set of subsystems that could cure a failure; if that doesn’t help, restart the next-higher containing that could cure a failure; if that doesn’t help, restart the next-higher containing set, etc.set, etc.

Partial restarts/rebootsPartial restarts/reboots Return system (mostly) to well-tested, well-understood start stateReturn system (mostly) to well-tested, well-understood start state High confidence way to reclaim stale/leaked resourcesHigh confidence way to reclaim stale/leaked resources Unlike true checkpointing, reboot more likely to avoid repeated failure due to Unlike true checkpointing, reboot more likely to avoid repeated failure due to

corrupted statecorrupted state We focus on proactive restarts; can also be reactive (SW rejuvenation)We focus on proactive restarts; can also be reactive (SW rejuvenation) ““Easier to run a system 365 times for 1 day than 365 days”Easier to run a system 365 times for 1 day than 365 days”

Goals:Goals: What is the software structure that can best accommodate such failure What is the software structure that can best accommodate such failure

management while still preserving all other requirements (functionality, management while still preserving all other requirements (functionality, performance, consistency, etc.) performance, consistency, etc.)

Develop methodology for building and managing RR systems (concrete Develop methodology for building and managing RR systems (concrete engineering methods)engineering methods)

Develop the tools for building, testing, deploying, and managing RR systemsDevelop the tools for building, testing, deploying, and managing RR systems Design for fast restartability in online-service building blocksDesign for fast restartability in online-service building blocks

Page 14: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

A Hierarchy of Restartable UnitsA Hierarchy of Restartable Units

Siblings highly fault-isolatedSiblings highly fault-isolated low level: by high-confidence, low-low level: by high-confidence, low-

level, HW-assisted machinery, (eg level, HW-assisted machinery, (eg MMU, physical isolation)MMU, physical isolation)

higher level: by VM-level abstractions higher level: by VM-level abstractions based on the above machinery (eg based on the above machinery (eg JVM, HW VM, process)JVM, HW VM, process)

R-map (=hierarchy of restartable component R-map (=hierarchy of restartable component groups) captures restart dependenciesgroups) captures restart dependencies Groups of restart units can be restarted by common parentGroups of restart units can be restarted by common parent

Restarting a node restarts everything in its subtreeRestarting a node restarts everything in its subtree

A failure is A failure is minimally curable minimally curable at a specific nodeat a specific node

Restarts farther up tree are more expensive, Restarts farther up tree are more expensive, but higher confidence for curing transientsbut higher confidence for curing transients

Page 15: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

RR-ifying a satellite ground stationRR-ifying a satellite ground station Biggest improvement: MTTF/MTTR-based boundary redrawingBiggest improvement: MTTF/MTTR-based boundary redrawing

Ability to isolate unstable components without penalizing whole systemAbility to isolate unstable components without penalizing whole system Achieve a balanced MTTF/MTTR ratio across components at the same Achieve a balanced MTTF/MTTR ratio across components at the same

levellevel

Lower MTTR may be strictly better than higher MTTFLower MTTR may be strictly better than higher MTTF unplanned downtime is more expensive than planned unplanned downtime is more expensive than planned

downtime, and downtime under a heavy/critical workload (e.g., downtime, and downtime under a heavy/critical workload (e.g., satellite pass) is more expensive than downtime under a satellite pass) is more expensive than downtime under a light/non-critical workload. light/non-critical workload.

high MTTF doesn’t guarantee failure-free operation interval, but high MTTF doesn’t guarantee failure-free operation interval, but sufficiently low MTTR may mitigate impact of failuresufficiently low MTTR may mitigate impact of failure

Current work is applying RR to a ubiquitous computing Current work is applying RR to a ubiquitous computing environment, a J2EE application server, and an OSGI-based environment, a J2EE application server, and an OSGI-based platform for cars platform for cars new lessons will emerge (e.g., r-tree needs to new lessons will emerge (e.g., r-tree needs to be a r-DAG)be a r-DAG)

Most of these lessons are not surprising, but RR provides a uniform Most of these lessons are not surprising, but RR provides a uniform framework within which to discuss themframework within which to discuss them

Page 16: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

MTTR Captures Outage CostsMTTR Captures Outage Costs

Recent software-related outages at Ebay: 4.5 hours in Recent software-related outages at Ebay: 4.5 hours in Apr02, 22 hours Jun99, 7 hours May99, 9 hours Dec98Apr02, 22 hours Jun99, 7 hours May99, 9 hours Dec98

Assume two 4-hour (“newsworthy”) outages/yearAssume two 4-hour (“newsworthy”) outages/year A=(182*24 hours)/(182*24 + 4 hours) = A=(182*24 hours)/(182*24 + 4 hours) = 99.9%99.9%

Dollar cost: Ebay policy for >2 hour outage, fees credited to Dollar cost: Ebay policy for >2 hour outage, fees credited to all affected users (US$3-5M for Jun99)all affected users (US$3-5M for Jun99)

Customer loyalty: after Jun99 outage, Yahoo Auctions Customer loyalty: after Jun99 outage, Yahoo Auctions reported statistically significant increase in usersreported statistically significant increase in users

Ebay’s market cap dropped US$4B after Jun99 outage, stock Ebay’s market cap dropped US$4B after Jun99 outage, stock price dropped 25%price dropped 25%

Newsworthy due to number of users affected, given Newsworthy due to number of users affected, given length of outagelength of outage

Page 17: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Outage costs, cont.Outage costs, cont.

What about a 10-minute outage once per week?What about a 10-minute outage once per week? A=(7*24 hours)/(7*24 + 1/6 hours) = A=(7*24 hours)/(7*24 + 1/6 hours) = 99.9% - the same99.9% - the same

Can we quantify “savings” over the previous scenario?Can we quantify “savings” over the previous scenario?

Shorter outages affect fewer users at a timeShorter outages affect fewer users at a time Typical AOL email “outage” affects 1-2% of usersTypical AOL email “outage” affects 1-2% of users

Many short outages may affect Many short outages may affect different different subsets of userssubsets of users

Shorter outages typically not news-worthyShorter outages typically not news-worthy

Page 18: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

When Low MTTR Trumps High When Low MTTR Trumps High MTTFMTTF

MTTR is directly measurable; MTTF usually notMTTR is directly measurable; MTTF usually not Component MTTF’s -> tens of yearsComponent MTTF’s -> tens of years

Software MTTF ceiling -> ~30 yrs (Gray, HDCC 01)Software MTTF ceiling -> ~30 yrs (Gray, HDCC 01)

Result: “measuring” MTTF requires 100’s of system-yearsResult: “measuring” MTTF requires 100’s of system-years

But, MTTR’s are minutes to hours, even for complex SW But, MTTR’s are minutes to hours, even for complex SW componentscomponents

MTTR more directly captures impact of a specific MTTR more directly captures impact of a specific outageoutage Very low MTTR (~10 seconds) achievable with Very low MTTR (~10 seconds) achievable with

redundancy and failoverredundancy and failover

Keeps response time below user threshold of distraction Keeps response time below user threshold of distraction [Miller 1968, Bhatti et al 2001, Zona Research 1999][Miller 1968, Bhatti et al 2001, Zona Research 1999]

Page 19: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Degraded Service vs. OutageDegraded Service vs. Outage

How about longer MTTR’s (minutes or hours)?How about longer MTTR’s (minutes or hours)?

Can service be designed so that “short” outages Can service be designed so that “short” outages appear to users as temporary degradation appear to users as temporary degradation instead?instead? How much degradation will users tolerate?How much degradation will users tolerate?

For how long (until they abandon the site because it feels For how long (until they abandon the site because it feels like a true outage - abandonment can be measured)like a true outage - abandonment can be measured)

How frequently?How frequently?

Even if above thresholds can be deduced, how to Even if above thresholds can be deduced, how to design service so that transient failures can be design service so that transient failures can be mapped onto degraded quality?mapped onto degraded quality?

Page 20: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Examples of degraded serviceExamples of degraded service

Goal: derive a set of service “primitives” that directly reflect Goal: derive a set of service “primitives” that directly reflect parameterizable degradation due to transient failure parameterizable degradation due to transient failure (“theory” is too strong…)(“theory” is too strong…)

Nature of Nature of degradationdegradation

Users Users affectedaffected

ThresholdsThresholds MechanismMechanism

See only headers See only headers (not body) for (not body) for some email some email messages (AOL)messages (AOL)

1.5-2% 1.5-2% typicaltypical

If >1 minute, If >1 minute, treated as treated as outageoutage

Messages on failed Messages on failed servers unavailable, but servers unavailable, but metadata kept on metadata kept on Tandem clusterTandem cluster

Reduced search Reduced search harvest (Inktomi)harvest (Inktomi)

All usersAll users VariesVaries Lossy reads to avoid Lossy reads to avoid failed serversfailed servers

Above-the-fold Above-the-fold content only content only (CNN.com)(CNN.com)

All usersAll users VariesVaries Fast but “manual” Fast but “manual” reconfiguration of front reconfiguration of front page on dynamic content page on dynamic content serverserver

Slower service Slower service (Anonymizer)(Anonymizer)

Non-Non-paying paying usersusers

IndefiniteIndefinite ??

Page 21: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Two Frequently Asked QuestionsTwo Frequently Asked Questions

1.1. Is ROC the same as autonomic computing™?Is ROC the same as autonomic computing™?

2.2. Are you saying we should build lousy hardware Are you saying we should build lousy hardware and software and mask all those failures with and software and mask all those failures with ROC mechanisms?ROC mechanisms?

Page 22: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

1. Does ROC==autonomic 1. Does ROC==autonomic computing?computing?

Self-administering?Self-administering? For now, focus on empowering administrators, not eliminating For now, focus on empowering administrators, not eliminating

themthem

Humans are good at detecting and learning from own mistakes, so Humans are good at detecting and learning from own mistakes, so why not? (avoiding automation irony)why not? (avoiding automation irony)

We’re not sure we understand sysadmins’ We’re not sure we understand sysadmins’ current current techniques well techniques well enough to think about automationenough to think about automation

Self-healing, self-reprovisioning, self-load-balancing…?Self-healing, self-reprovisioning, self-load-balancing…? Sure - Web services and datacenters already do this for many Sure - Web services and datacenters already do this for many

situations; many techniques and tools are “well known”situations; many techniques and tools are “well known”

But - do we know how (“theory”) to design the app software to But - do we know how (“theory”) to design the app software to make these techniques make these techniques possiblepossible

Digital immune system - it’s in WinXPDigital immune system - it’s in WinXP

Page 23: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

2. What ROC is not2. What ROC is not

We We do not advocatedo not advocate for… for… producing buggy softwareproducing buggy software

building lousy hardwarebuilding lousy hardware

slacking on design, testing, or careful administrationslacking on design, testing, or careful administration

discarding existing useful techniques or toolsdiscarding existing useful techniques or tools

We We do advocate do advocate for…for… an increased focus on lowering MTTR specificallyan increased focus on lowering MTTR specifically

increased examination of when some guarantees can be increased examination of when some guarantees can be traded for lower MTTRtraded for lower MTTR

systematic exploration of “design for fast recovery” in the systematic exploration of “design for fast recovery” in the context of a context of a variety variety of applicationsof applications

stealing great ideas from systems, Internet protocols, stealing great ideas from systems, Internet protocols, psychology, safety-critical systems designpsychology, safety-critical systems design

Page 24: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Summary: ROC and Online Summary: ROC and Online ServicesServices

Current software realities lead to new fociCurrent software realities lead to new foci Rapid evolution => traditional FT methodologies difficult to Rapid evolution => traditional FT methodologies difficult to

applyapply

Human error inevitable, but humans are good at identifying Human error inevitable, but humans are good at identifying own errors => provide facilities to allow recovery from theseown errors => provide facilities to allow recovery from these

HW and SW failure inevitable => use redundancy and HW and SW failure inevitable => use redundancy and designed-in ability to substitute temporary degradation for designed-in ability to substitute temporary degradation for outages (“design for recovery”)outages (“design for recovery”)

Trying to stay relevant via direct contact with Trying to stay relevant via direct contact with designers/operators of large systemsdesigners/operators of large systems Need real data on how large systems failNeed real data on how large systems fail

Need real data on how different kinds of failures are Need real data on how different kinds of failures are perceived by usersperceived by users

Page 25: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Interested in ROCing?Interested in ROCing?

Are you willing to anonymously share failure data?Are you willing to anonymously share failure data? Already great relationships (and in some cases data-Already great relationships (and in some cases data-

sharing agreements) with BEA, IBM, HP, Keynote, Microsoft, sharing agreements) with BEA, IBM, HP, Keynote, Microsoft, Oracle, Tellme, Yahoo!, othersOracle, Tellme, Yahoo!, others

See See http://roc.stanford.eduhttp://roc.stanford.edu or or http://roc.cs.berkeley.eduhttp://roc.cs.berkeley.edu for publications, talks, for publications, talks, research areas, etc.research areas, etc.

Contact Armando Fox ([email protected])Contact Armando Fox ([email protected])or Dave Patterson or Dave Patterson

([email protected])([email protected])

Page 26: Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens

© 2002 Armando Fox

Discussion QuestionDiscussion Question

3.3. [For discussion] So what if you pick the low [For discussion] So what if you pick the low hanging fruit? The challenge is in reaching the hanging fruit? The challenge is in reaching the highest leaves.highest leaves.