what can we learn from four years of data center hardware...

WhatCanWeLearnfromFourYearsofDataCenterHardwareFailures?

Guosai Wang,Lifei Zhang,WeiXu

Motivation:EvolvingFailureModel

• Failuresindatacentersarecommonandcostly- Violateservicelevelagreement(SLA)andcauselossofrevenue

• Understandfailures:reduceTCO• Today’sdatacentersaredifferent- ! Betterfailuredetectionsystems,experiencedoperators- " Adoptionofless-reliable,commodityorcustomorderedhardware,moreheterogeneoushardwareandworkload- Result:morecomplexfailuremodel

• Goal:comprehensiveanalysisofhardwarefailuresinmodernlarge-scaleIDCs

WeRe-studyHardwareFailuresinIDCs

Ourwork:- Largescale:hundredsofthousandsofserverswith290,000failureoperationtickets- Long-term:2012-2016- Multi-dimensional:components,time,space,productlines,operators’response,etc.- Reconfirmorextendpreviousfindings+Observenewpatterns

Time

Space Components

Productlines Operators’response

Commonbeliefs• Failuresareuniformlyrandomlydistributedovertime/space

• Failureshappenindependently

• HWunreliabilityshapesthesoftwarefaulttolerancedesign

Ourfindings• HWfailuresarenotuniformlyrandom- atdifferenttimescales- sometimesatdifferentlocations

• CorrelatedHWfailuresarecommoninIDCs• Itisalsotheotherwayaround:softwarefaulttoleranceindulgesoperatorstocarelessaboutHWdependability

InterestingFindingsOverview

FailureManagementArchitecture

� ��

��

!��

��

��

��

��

��

��

��


• HMSagentsdetectfailuresonservers

� ��

��

!��

��

��

��

��

��

��

��


� ��

��

!��

��

��

��

��

��

��

��

• HMSagentsdetectfailuresonservers• HMS collectsfailurerecords,andstoretheminafailurepool


� ��

��

!��

��

��

��

��

��

��

��

• HMSagentsdetectfailuresonservers• HMS collectsfailurerecords,andstoretheminafailurepool• Operators/programs generateaFOTforeachfailurerecord

� ��

��

!��

��

��

��

��

��

��

��

id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.

Dataset:290,000+FOTs

• Thefailureoperationtickets(FOTs)containmanyfields

• WestudythefailuresondifferentdimensionsbasedondifferentfieldsofFOTs

Multi-dimensionalAnalysisontheDataset

Time

Space Components

Productlines Operators’response id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.

• WestudythefailuresondifferentdimensionsbasedondifferentfieldsofFOTs

Multi-dimensionalAnalysisontheDataset

Time:errortime

Space:hostname,hostidc

Components:errordevice

Productlines:hostname

Operators’response:errortime,optime

id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.

Device Proportion

Hard DiskDrive 81.84%

Miscellaneous* 10.20%

Memory 3.06%

Power 1.74%

RAID card 1.23%

Flashcard 0.67%

Motherboard 0.57%

SSD 0.31%

Fan 0.19%

HDDbackboard 0.14%

CPU 0.04%

*”Miscellaneous”aremanuallysubmittedoruncategorizedfailures

FailurePercentageBreakdownbyComponent

FailureTypesforHardDiskDrive

• AbouthalfofHDDfailuresarerelatedtoSMARTvalues orpredictionerrorcount

FailureTypeBreakdownofHDD

SMARTFailPredictErrRaidPdPreErrRaidPdFailedMissingNotReadyMediumErrRaidPdMediaErrBadSectorPendingLBATooManyDStatusOthers

SomeHDDSMARTvalueexceedsthethreshold

Thepredictionerrorcountexceedsthethreshold

OthertypesSMART =SelfMonitoringAnalysisandReportingTechnique

Outline

• DatasetoverviewØTemporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned

FRisNOT UniformlyRandomoverDaysoftheWeek

• Hypothesis1. Theaveragenumberofcomponentfailuresisuniformlyrandomoverdifferentdaysoftheweek.

• Achi-squaretestcanrejectthehypothesisat0.01significancelevelforall componentclasses.

FRisNOT UniformlyRandomoverHoursoftheDay

• Hypothesis2.Theaveragenumberofcomponentfailuresisuniformlyrandomduringeachhouroftheday.

• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches

FRisNOT UniformlyRandomoverHoursoftheDay

FRofeachComponentChangesDuringitsLifeCycle

• DifferentcomponentclassesexhibitdifferentFRpatterns.

• Infantmortalities:


• Wearout


Outline

• Datasetoverview• TemporaldistributionofthefailuresØSpatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned

PhysicalLocationsMightAffecttheFRDistribution

• Hypothesis3. Thefailurerateoneachrackpositionisindependentoftherackposition.

• Ingeneral,at0.05significancelevel:- cannotrejectthehypothesisin40%ofthedatacenters- canrejectitintheother60%

FRCanbeAffectedbytheCoolingDesign

• FRsarehigheratrackposition22and35

• Possiblereasons- DesignofIDCcoolingandphysicalstructureoftheracks

Atthetop

AbovethePSU Coolingair

AtypicalScorpionrack

Outline

• Datasetoverview• Temporaldistributionofthefailures• SpatialdistributionofthefailuresØCorrelatedfailures• Operators’responsetofailures• LessonsLearned

CorrelatedFailures areCommon

• Correlatedfailures:batchfailures,correlatedcomponentfailures,repeatingsynchronousfailures• Fact:200+HDDfailuresoneachof22.5%ofthedays• Casestudy- Nov.16thand17th,2015- 5,000+servers,or32%ofalltheserversoftheproductline,reportingharddriveSMARTFail failures- 99%ofthesefailuresweredetectedbetween21:00onthe16thand3:00onthe17th.- Operatorsreplacedabout1,600,decommissionedtheremaining4000+out-of-warrantydrives- Failurereasonnotclearyet

CausesofCorrelatedFailures

Allthefollowinghavehappenedbefore#- Environmentalfactors(e.g.,humidity)- Firmwarebugs- Singlepointoffailure(e.g.,powermodulefailures)- Humanoperatormistakes- ...

Outline

• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• CorrelatedfailuresØOperators’responsetofailures• LessonsLearned

Operators’ResponsetoFailures

• Responsetime:RT=op_time – err_time

� ��

��

!��

��

��

��

��

��

��

��

RT isVeryHighinGeneral

• RTforD_fixing:Avg.42.2days,median6.1days• 10%oftheFOTs:RT>140days

- Isitbecauseoperatorsbusydealingwithlargenumberoffailures?- No!

RT inDifferentProductLinesVaries

• Observation1:VariationofRT indifferentproductlinesislarge• Observation2:Operatorsrespondtolargenumberoffailuremorequickly

Number ofHDDFailuresDuringYear2015

TheREALproblems$

Whocares?%

OPsareLessMotivatedtoRespondtoHWFailures

Possiblereasons• Softwareredundancydesign- Delayed Responding,processfailuresinbatches

• Manyhardwarefailuresarenolongerurgent- E.g.,SMARTfailuresmaynotbefatal

• Repairoperationcanbecostly- E.g.,Taskmigration

Operator

ResilientSoftware

HardwareRedundancy

Outline

• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailuresØLessonsLearned

LessonsLearnedI

• Mucholdwisdomstillholds.- Morecorrelatedfailures� softwaredesignchallenge- Automatichardwarefailuredetection&handling:!- Datacenterdesign:avoid“batspot”

LessonsLearnedII

• Striketherightbalanceamongsoftwarestackcomplexity,hardwaredependability,andoperationcost.• Datacenterdependabilityneedsjointoptimizationeffortthatcrosseslayers.

OperationCost

ResilientSoftwareDesign

DependableHardwareInfrastructure

LessonsLearnedIII

• Stateful failurehandlingsystem- Dataminingtool:discovercorrelationamongfailures- Provideoperatorswithextrainformation

HardwareFailure

Servermodel Workload

Environment

Failurehistory

Correlationwithotherfailures

Thankyou!Q&A

Outline• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned

TBFCannotbeWellFittedbyWell-knownDistributions

• Hypothesis4. Timebetweenfailures(TBF)ofallcomponentsfollowsanexponentialdistribution.• Hypothesis5. TBFofeachindividualcomponentclassfollowsanexponentialdistribution.

100 101 102

Time between Failures (min)

0

0.2

0.4

0.6

0.8

1

CD

F

ExpWeibullGammaLogNormalData

Largeproportionofsmallvalues

FailureOperationTicket(FOT)

• CategoriesofFOTs

• Fields:id,hostid,hostname,hostidc,errordevice,errortype,errortime,errorposition,errordetail

FRofMisc.FailuresDuringtheLifecycle

• Mostmanualdetectionanddebuggingeffortshappenonlyatdeploymenttime• Lesscosttorepair(notmuchtaskstomigrate)

RTforEachComponentClass

• MedianRTsforSSDandmist.failuresaretheshortest(hours)• MedianRTsforHDD,fans,andmemoryarethelongest(7-18days)• StandarddeviationoftheRTforHDD:30.2days

Self-Monitoring,AnalysisandReportingTechnology

• Fields:raw value,worst,threshold,status• SMARTattributeexamples(failurerelated)

• ReallocatedSectorsCount• End-to-Enderror• UncorrectableSectorCount• ReportedUncorrectableErrors• CurrentPendingSectorCount• CommandTimeout• ...

ExamplesofFailureTypes

RepeatingFailures

• Over85%ofthefixedcomponentsneverrepeatthesamefailure• Repaircanfail• 2%ofserversthateverfailedcontributemorethan99%ofallfailures

BatchFailureFrequencyforEachComponent

• r_N:anormalizedcounterofhowmanydaysduringtheDdays,inwhichmorethanNfailureshappenonthesameday• NormalizedbythetotaltimelengthD.

what can we learn from four years of data center hardware...

Documents