toward recovery-oriented computing armando fox, stanford university david patterson, uc berkeley and...
TRANSCRIPT
Toward Recovery-Oriented Toward Recovery-Oriented ComputingComputing
Armando Fox, Stanford UniversityArmando Fox, Stanford UniversityDavid Patterson, UC BerkeleyDavid Patterson, UC Berkeley
and a cast of tensand a cast of tens
© 2002 Armando Fox
OutlineOutline
Whither recovery-oriented computing?Whither recovery-oriented computing? research/industry agenda of last 15 yearsresearch/industry agenda of last 15 years
today’s pressing problem: availability (we knew that) - today’s pressing problem: availability (we knew that) - but what is new/different compared to previous F/T work, but what is new/different compared to previous F/T work, databases, etc?databases, etc?
Recovery-Oriented Computing as an approach to Recovery-Oriented Computing as an approach to availabilityavailability Motivation and philosophyMotivation and philosophy
sampling of research avenuessampling of research avenues
what ROC is notwhat ROC is not
© 2002 Armando Fox
Reevaluating goals & assumptionsReevaluating goals & assumptions
Goals of last 15 yearsGoals of last 15 years Goal #1: Improve performanceGoal #1: Improve performance
Goal #2: Improve performanceGoal #2: Improve performance
Goal #3: Improve cost-performanceGoal #3: Improve cost-performance
AssumptionsAssumptions Humans are perfect (they don’t make mistakes during Humans are perfect (they don’t make mistakes during
installation, wiring, upgrade, maintenance or repair)installation, wiring, upgrade, maintenance or repair)
Software will eventually be bug free Software will eventually be bug free (good programmers will write bug-free code, debugging (good programmers will write bug-free code, debugging works)works)
Hardware MTBF is already very large (~100 years Hardware MTBF is already very large (~100 years between failures), and will continue to increasebetween failures), and will continue to increase
© 2002 Armando Fox
Results of this successful agendaResults of this successful agenda
Good news: faster computers, denser disks, cheaper $Good news: faster computers, denser disks, cheaper $ computation faster by >3 orders of magnitudecomputation faster by >3 orders of magnitude
disk capacity greater by >3 orders of magnitudedisk capacity greater by >3 orders of magnitude
Result: TCO dominated by Result: TCO dominated by administration,administration, not hardware cost not hardware cost
Bad news: complex, brittle systems that fail frequentlyBad news: complex, brittle systems that fail frequently 65% of IT managers report that their websites were 65% of IT managers report that their websites were
unavailable to customers over a 6-month period (25%: 3 or unavailable to customers over a 6-month period (25%: 3 or more outages) more outages) [Internet Week, 4/3/2000][Internet Week, 4/3/2000]
outage costs: negative press, “click overs” to competitor, outage costs: negative press, “click overs” to competitor, stock price, market cap…stock price, market cap…
Yet availability is key metric for online services!Yet availability is key metric for online services!
© 2002 Armando Fox
Direct Downtime Costs (per Direct Downtime Costs (per Hour)Hour)
Brokerage operationsBrokerage operations $6,450,000$6,450,000Credit card authorizationCredit card authorization $2,600,000$2,600,000Ebay (22 hour outage)Ebay (22 hour outage) $225,000$225,000Amazon.comAmazon.com $180,000$180,000Package shipping servicesPackage shipping services $150,000$150,000Home shopping channelHome shopping channel $113,000$113,000Catalog sales centerCatalog sales center $90,000$90,000Airline reservation centerAirline reservation center $89,000$89,000Cellular service activationCellular service activation $41,000$41,000On-line network feesOn-line network fees $25,000$25,000ATM service feesATM service fees $14,000$14,000
Sources: InternetWeek 4/3/2000 + Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."
© 2002 Armando Fox
So, what are today’s challenges?So, what are today’s challenges?
We all seem to agree on goalsWe all seem to agree on goals Dave Patterson, IPTS 2002: ACME “availability, change, Dave Patterson, IPTS 2002: ACME “availability, change,
maintenance, evolution”maintenance, evolution”
Jim Gray, HPTS 2001: FAASM “functionality, availability, Jim Gray, HPTS 2001: FAASM “functionality, availability, agility, scalability, manageability”agility, scalability, manageability”
Butler Lampson, SOSP 1999: “Always available, evolving Butler Lampson, SOSP 1999: “Always available, evolving while they run, growing without practical limit”while they run, growing without practical limit”
John Hennessy, FCRC 1999: “Availability, maintainability John Hennessy, FCRC 1999: “Availability, maintainability and ease of upgrades, scalability”and ease of upgrades, scalability”
Fox & Brewer, HotOS 1997: BASE “best-effort service, Fox & Brewer, HotOS 1997: BASE “best-effort service, availability, soft state, eventual consistency”availability, soft state, eventual consistency”
We’re all singing the same tune, but what is new?We’re all singing the same tune, but what is new?……
© 2002 Armando Fox
What’s New and DifferentWhat’s New and Different
Evolution and change are integralEvolution and change are integral not true of many “traditional” five nines systems: long design not true of many “traditional” five nines systems: long design
cycle, changes incur high overhead for design/spec/testingcycle, changes incur high overhead for design/spec/testing
Last version of space shuttle software: 1 bug in 420 KLOC, cost Last version of space shuttle software: 1 bug in 420 KLOC, cost $35M/yr to maintain (good quality commercial SW: 1 bug/KLOC)$35M/yr to maintain (good quality commercial SW: 1 bug/KLOC)
But, recent upgrade for GPS support required generating 2,500 But, recent upgrade for GPS support required generating 2,500 pages of specs pages of specs beforebefore changing anything in 6.3 KLOC (1.5%) changing anything in 6.3 KLOC (1.5%)
Performance still important, but focus changedPerformance still important, but focus changed Interactive Interactive performance and availability to end users is keyperformance and availability to end users is key
Users appear willing to occasionally tolerate temporary Users appear willing to occasionally tolerate temporary degradation (“service quality”) in exchange for improved degradation (“service quality”) in exchange for improved availabilityavailability
How to capture this tradeoff: soft/stale state, partial How to capture this tradeoff: soft/stale state, partial performance degradation, imprecise answers…performance degradation, imprecise answers…
© 2002 Armando Fox
ROC PhilosophyROC Philosophy
ROC philosophy (“Peres’s Law”):ROC philosophy (“Peres’s Law”):““If a problem has no solution, it may not be a problem, but a fact; not to be solved, If a problem has no solution, it may not be a problem, but a fact; not to be solved,
but to be coped with over time”but to be coped with over time”Shimon PeresShimon Peres
Failures (hardware, software, operator-induced) are a fact; Failures (hardware, software, operator-induced) are a fact; recovery is how we cope with them over timerecovery is how we cope with them over time
Availability = MTTF/MTBF= MTTF / Availability = MTTF/MTBF= MTTF / (MTTF + MTTR) (MTTF + MTTR) Rather than just making MTTF very large, make MTTR << MTTFRather than just making MTTF very large, make MTTR << MTTF
Why?Why?
1.1. Human errors will still cause outages => minimize recovery Human errors will still cause outages => minimize recovery timetime
2.2. Recovery time is directly measurable, and directly captures Recovery time is directly measurable, and directly captures impact on users of a specific outage incident (MTTF doesn’t)impact on users of a specific outage incident (MTTF doesn’t)
3.3. Rapid evolution makes exhaustive testing/validation impossible Rapid evolution makes exhaustive testing/validation impossible => unexpected/transient failures will still occur=> unexpected/transient failures will still occur
© 2002 Armando Fox
1. Human Error Is Inevitable1. Human Error Is Inevitable
Human error major factor in downtime…Human error major factor in downtime… PSTN: Half of all outage incidents and outage-minutes from PSTN: Half of all outage incidents and outage-minutes from
1992-1994 were due to human error (including errors by 1992-1994 were due to human error (including errors by phone company maintenance workers)phone company maintenance workers)
Oracle: up to half of DB failures due to human error (1999)Oracle: up to half of DB failures due to human error (1999)
Microsoft blamed human error for ~24-hour outage in Jan Microsoft blamed human error for ~24-hour outage in Jan 20012001
Approach:Approach: Learn from psychology of human error and disaster case Learn from psychology of human error and disaster case
studiesstudies
Build in system support for recovery from human errorsBuild in system support for recovery from human errors
Use tools such as error injection, virtual machine technology Use tools such as error injection, virtual machine technology to provide “flight simulator” training for operatorsto provide “flight simulator” training for operators
© 2002 Armando Fox
The 3R undo modelThe 3R undo model
Undo == time travel for system operatorsUndo == time travel for system operators
Three R’s for recoveryThree R’s for recovery RRewind:ewind: roll system state backwards in time roll system state backwards in time
RRepair:epair: change system to prevent failure change system to prevent failure e.g., edit history, fix latent error, retry unsuccessful e.g., edit history, fix latent error, retry unsuccessful
operation, install preventative patchoperation, install preventative patch
RReplay:eplay: roll system state forward, replaying end-user roll system state forward, replaying end-user interactions lost during rewindinteractions lost during rewind
All three R’s are criticalAll three R’s are critical rewind enables undorewind enables undo repair lets user/administrator fix problemsrepair lets user/administrator fix problems replay preserves updates, propagates fixes forwardreplay preserves updates, propagates fixes forward
© 2002 Armando Fox
Example e-mail scenarioExample e-mail scenario
Before undo:Before undo: virus-laden message arrivesvirus-laden message arrives
user copies it into a folder without looking at ituser copies it into a folder without looking at it
Operator invokes undo (rewind) to install virus filter Operator invokes undo (rewind) to install virus filter (repair)(repair)
During replay:During replay: message is redelivered message is redelivered butbut now discarded by virus filter now discarded by virus filter
copy operation is now unsafe (source message doesn’t copy operation is now unsafe (source message doesn’t exist)exist)
compensating action: insert placeholder for messagecompensating action: insert placeholder for message
now copy command can be executed, making history now copy command can be executed, making history replay-acceptablereplay-acceptable
© 2002 Armando Fox
First implementation attemptFirst implementation attempt
Undo wrapper for open source IMAP email storeUndo wrapper for open source IMAP email store
Email ServerIncludes: - user state - mailboxes - application - operating system
Non-overwritingStorage
Undo
Log
3R Layer
3RProx
y
StateTracke
r
SMTP
IMAP
SMTP
IMAP
control
© 2002 Armando Fox
3. Handling Transient Failures via 3. Handling Transient Failures via RestartRestart
Many failures are either (a) transient and fixable through reboot, or (b) non-Many failures are either (a) transient and fixable through reboot, or (b) non-transient, but reboot is the lowest-MTTR fix transient, but reboot is the lowest-MTTR fix
Recursive RestartsRecursive Restarts: To minimize MTTR, restarts the minimal set of subsystems : To minimize MTTR, restarts the minimal set of subsystems that could cure a failure; if that doesn’t help, restart the next-higher containing that could cure a failure; if that doesn’t help, restart the next-higher containing set, etc.set, etc.
Partial restarts/rebootsPartial restarts/reboots Return system (mostly) to well-tested, well-understood start stateReturn system (mostly) to well-tested, well-understood start state High confidence way to reclaim stale/leaked resourcesHigh confidence way to reclaim stale/leaked resources Unlike true checkpointing, reboot more likely to avoid repeated failure due to Unlike true checkpointing, reboot more likely to avoid repeated failure due to
corrupted statecorrupted state We focus on proactive restarts; can also be reactive (SW rejuvenation)We focus on proactive restarts; can also be reactive (SW rejuvenation) ““Easier to run a system 365 times for 1 day than 365 days”Easier to run a system 365 times for 1 day than 365 days”
Goals:Goals: What is the software structure that can best accommodate such failure What is the software structure that can best accommodate such failure
management while still preserving all other requirements (functionality, management while still preserving all other requirements (functionality, performance, consistency, etc.) performance, consistency, etc.)
Develop methodology for building and managing RR systems (concrete Develop methodology for building and managing RR systems (concrete engineering methods)engineering methods)
Develop the tools for building, testing, deploying, and managing RR systemsDevelop the tools for building, testing, deploying, and managing RR systems Design for fast restartability in online-service building blocksDesign for fast restartability in online-service building blocks
© 2002 Armando Fox
A Hierarchy of Restartable UnitsA Hierarchy of Restartable Units
Siblings highly fault-isolatedSiblings highly fault-isolated low level: by high-confidence, low-low level: by high-confidence, low-
level, HW-assisted machinery, (eg level, HW-assisted machinery, (eg MMU, physical isolation)MMU, physical isolation)
higher level: by VM-level abstractions higher level: by VM-level abstractions based on the above machinery (eg based on the above machinery (eg JVM, HW VM, process)JVM, HW VM, process)
R-map (=hierarchy of restartable component R-map (=hierarchy of restartable component groups) captures restart dependenciesgroups) captures restart dependencies Groups of restart units can be restarted by common parentGroups of restart units can be restarted by common parent
Restarting a node restarts everything in its subtreeRestarting a node restarts everything in its subtree
A failure is A failure is minimally curable minimally curable at a specific nodeat a specific node
Restarts farther up tree are more expensive, Restarts farther up tree are more expensive, but higher confidence for curing transientsbut higher confidence for curing transients
© 2002 Armando Fox
RR-ifying a satellite ground stationRR-ifying a satellite ground station Biggest improvement: MTTF/MTTR-based boundary redrawingBiggest improvement: MTTF/MTTR-based boundary redrawing
Ability to isolate unstable components without penalizing whole systemAbility to isolate unstable components without penalizing whole system Achieve a balanced MTTF/MTTR ratio across components at the same Achieve a balanced MTTF/MTTR ratio across components at the same
levellevel
Lower MTTR may be strictly better than higher MTTFLower MTTR may be strictly better than higher MTTF unplanned downtime is more expensive than planned unplanned downtime is more expensive than planned
downtime, and downtime under a heavy/critical workload (e.g., downtime, and downtime under a heavy/critical workload (e.g., satellite pass) is more expensive than downtime under a satellite pass) is more expensive than downtime under a light/non-critical workload. light/non-critical workload.
high MTTF doesn’t guarantee failure-free operation interval, but high MTTF doesn’t guarantee failure-free operation interval, but sufficiently low MTTR may mitigate impact of failuresufficiently low MTTR may mitigate impact of failure
Current work is applying RR to a ubiquitous computing Current work is applying RR to a ubiquitous computing environment, a J2EE application server, and an OSGI-based environment, a J2EE application server, and an OSGI-based platform for cars platform for cars new lessons will emerge (e.g., r-tree needs to new lessons will emerge (e.g., r-tree needs to be a r-DAG)be a r-DAG)
Most of these lessons are not surprising, but RR provides a uniform Most of these lessons are not surprising, but RR provides a uniform framework within which to discuss themframework within which to discuss them
© 2002 Armando Fox
MTTR Captures Outage CostsMTTR Captures Outage Costs
Recent software-related outages at Ebay: 4.5 hours in Recent software-related outages at Ebay: 4.5 hours in Apr02, 22 hours Jun99, 7 hours May99, 9 hours Dec98Apr02, 22 hours Jun99, 7 hours May99, 9 hours Dec98
Assume two 4-hour (“newsworthy”) outages/yearAssume two 4-hour (“newsworthy”) outages/year A=(182*24 hours)/(182*24 + 4 hours) = A=(182*24 hours)/(182*24 + 4 hours) = 99.9%99.9%
Dollar cost: Ebay policy for >2 hour outage, fees credited to Dollar cost: Ebay policy for >2 hour outage, fees credited to all affected users (US$3-5M for Jun99)all affected users (US$3-5M for Jun99)
Customer loyalty: after Jun99 outage, Yahoo Auctions Customer loyalty: after Jun99 outage, Yahoo Auctions reported statistically significant increase in usersreported statistically significant increase in users
Ebay’s market cap dropped US$4B after Jun99 outage, stock Ebay’s market cap dropped US$4B after Jun99 outage, stock price dropped 25%price dropped 25%
Newsworthy due to number of users affected, given Newsworthy due to number of users affected, given length of outagelength of outage
© 2002 Armando Fox
Outage costs, cont.Outage costs, cont.
What about a 10-minute outage once per week?What about a 10-minute outage once per week? A=(7*24 hours)/(7*24 + 1/6 hours) = A=(7*24 hours)/(7*24 + 1/6 hours) = 99.9% - the same99.9% - the same
Can we quantify “savings” over the previous scenario?Can we quantify “savings” over the previous scenario?
Shorter outages affect fewer users at a timeShorter outages affect fewer users at a time Typical AOL email “outage” affects 1-2% of usersTypical AOL email “outage” affects 1-2% of users
Many short outages may affect Many short outages may affect different different subsets of userssubsets of users
Shorter outages typically not news-worthyShorter outages typically not news-worthy
© 2002 Armando Fox
When Low MTTR Trumps High When Low MTTR Trumps High MTTFMTTF
MTTR is directly measurable; MTTF usually notMTTR is directly measurable; MTTF usually not Component MTTF’s -> tens of yearsComponent MTTF’s -> tens of years
Software MTTF ceiling -> ~30 yrs (Gray, HDCC 01)Software MTTF ceiling -> ~30 yrs (Gray, HDCC 01)
Result: “measuring” MTTF requires 100’s of system-yearsResult: “measuring” MTTF requires 100’s of system-years
But, MTTR’s are minutes to hours, even for complex SW But, MTTR’s are minutes to hours, even for complex SW componentscomponents
MTTR more directly captures impact of a specific MTTR more directly captures impact of a specific outageoutage Very low MTTR (~10 seconds) achievable with Very low MTTR (~10 seconds) achievable with
redundancy and failoverredundancy and failover
Keeps response time below user threshold of distraction Keeps response time below user threshold of distraction [Miller 1968, Bhatti et al 2001, Zona Research 1999][Miller 1968, Bhatti et al 2001, Zona Research 1999]
© 2002 Armando Fox
Degraded Service vs. OutageDegraded Service vs. Outage
How about longer MTTR’s (minutes or hours)?How about longer MTTR’s (minutes or hours)?
Can service be designed so that “short” outages Can service be designed so that “short” outages appear to users as temporary degradation appear to users as temporary degradation instead?instead? How much degradation will users tolerate?How much degradation will users tolerate?
For how long (until they abandon the site because it feels For how long (until they abandon the site because it feels like a true outage - abandonment can be measured)like a true outage - abandonment can be measured)
How frequently?How frequently?
Even if above thresholds can be deduced, how to Even if above thresholds can be deduced, how to design service so that transient failures can be design service so that transient failures can be mapped onto degraded quality?mapped onto degraded quality?
© 2002 Armando Fox
Examples of degraded serviceExamples of degraded service
Goal: derive a set of service “primitives” that directly reflect Goal: derive a set of service “primitives” that directly reflect parameterizable degradation due to transient failure parameterizable degradation due to transient failure (“theory” is too strong…)(“theory” is too strong…)
Nature of Nature of degradationdegradation
Users Users affectedaffected
ThresholdsThresholds MechanismMechanism
See only headers See only headers (not body) for (not body) for some email some email messages (AOL)messages (AOL)
1.5-2% 1.5-2% typicaltypical
If >1 minute, If >1 minute, treated as treated as outageoutage
Messages on failed Messages on failed servers unavailable, but servers unavailable, but metadata kept on metadata kept on Tandem clusterTandem cluster
Reduced search Reduced search harvest (Inktomi)harvest (Inktomi)
All usersAll users VariesVaries Lossy reads to avoid Lossy reads to avoid failed serversfailed servers
Above-the-fold Above-the-fold content only content only (CNN.com)(CNN.com)
All usersAll users VariesVaries Fast but “manual” Fast but “manual” reconfiguration of front reconfiguration of front page on dynamic content page on dynamic content serverserver
Slower service Slower service (Anonymizer)(Anonymizer)
Non-Non-paying paying usersusers
IndefiniteIndefinite ??
© 2002 Armando Fox
Two Frequently Asked QuestionsTwo Frequently Asked Questions
1.1. Is ROC the same as autonomic computing™?Is ROC the same as autonomic computing™?
2.2. Are you saying we should build lousy hardware Are you saying we should build lousy hardware and software and mask all those failures with and software and mask all those failures with ROC mechanisms?ROC mechanisms?
© 2002 Armando Fox
1. Does ROC==autonomic 1. Does ROC==autonomic computing?computing?
Self-administering?Self-administering? For now, focus on empowering administrators, not eliminating For now, focus on empowering administrators, not eliminating
themthem
Humans are good at detecting and learning from own mistakes, so Humans are good at detecting and learning from own mistakes, so why not? (avoiding automation irony)why not? (avoiding automation irony)
We’re not sure we understand sysadmins’ We’re not sure we understand sysadmins’ current current techniques well techniques well enough to think about automationenough to think about automation
Self-healing, self-reprovisioning, self-load-balancing…?Self-healing, self-reprovisioning, self-load-balancing…? Sure - Web services and datacenters already do this for many Sure - Web services and datacenters already do this for many
situations; many techniques and tools are “well known”situations; many techniques and tools are “well known”
But - do we know how (“theory”) to design the app software to But - do we know how (“theory”) to design the app software to make these techniques make these techniques possiblepossible
Digital immune system - it’s in WinXPDigital immune system - it’s in WinXP
© 2002 Armando Fox
2. What ROC is not2. What ROC is not
We We do not advocatedo not advocate for… for… producing buggy softwareproducing buggy software
building lousy hardwarebuilding lousy hardware
slacking on design, testing, or careful administrationslacking on design, testing, or careful administration
discarding existing useful techniques or toolsdiscarding existing useful techniques or tools
We We do advocate do advocate for…for… an increased focus on lowering MTTR specificallyan increased focus on lowering MTTR specifically
increased examination of when some guarantees can be increased examination of when some guarantees can be traded for lower MTTRtraded for lower MTTR
systematic exploration of “design for fast recovery” in the systematic exploration of “design for fast recovery” in the context of a context of a variety variety of applicationsof applications
stealing great ideas from systems, Internet protocols, stealing great ideas from systems, Internet protocols, psychology, safety-critical systems designpsychology, safety-critical systems design
© 2002 Armando Fox
Summary: ROC and Online Summary: ROC and Online ServicesServices
Current software realities lead to new fociCurrent software realities lead to new foci Rapid evolution => traditional FT methodologies difficult to Rapid evolution => traditional FT methodologies difficult to
applyapply
Human error inevitable, but humans are good at identifying Human error inevitable, but humans are good at identifying own errors => provide facilities to allow recovery from theseown errors => provide facilities to allow recovery from these
HW and SW failure inevitable => use redundancy and HW and SW failure inevitable => use redundancy and designed-in ability to substitute temporary degradation for designed-in ability to substitute temporary degradation for outages (“design for recovery”)outages (“design for recovery”)
Trying to stay relevant via direct contact with Trying to stay relevant via direct contact with designers/operators of large systemsdesigners/operators of large systems Need real data on how large systems failNeed real data on how large systems fail
Need real data on how different kinds of failures are Need real data on how different kinds of failures are perceived by usersperceived by users
© 2002 Armando Fox
Interested in ROCing?Interested in ROCing?
Are you willing to anonymously share failure data?Are you willing to anonymously share failure data? Already great relationships (and in some cases data-Already great relationships (and in some cases data-
sharing agreements) with BEA, IBM, HP, Keynote, Microsoft, sharing agreements) with BEA, IBM, HP, Keynote, Microsoft, Oracle, Tellme, Yahoo!, othersOracle, Tellme, Yahoo!, others
See See http://roc.stanford.eduhttp://roc.stanford.edu or or http://roc.cs.berkeley.eduhttp://roc.cs.berkeley.edu for publications, talks, for publications, talks, research areas, etc.research areas, etc.
Contact Armando Fox ([email protected])Contact Armando Fox ([email protected])or Dave Patterson or Dave Patterson
© 2002 Armando Fox
Discussion QuestionDiscussion Question
3.3. [For discussion] So what if you pick the low [For discussion] So what if you pick the low hanging fruit? The challenge is in reaching the hanging fruit? The challenge is in reaching the highest leaves.highest leaves.