what can we learn from four years of data center hardware...
TRANSCRIPT
WhatCanWeLearnfromFourYearsofDataCenterHardwareFailures?
Guosai Wang,Lifei Zhang,WeiXu
Motivation:EvolvingFailureModel
• Failuresindatacentersarecommonandcostly- Violateservicelevelagreement(SLA)andcauselossofrevenue
• Understandfailures:reduceTCO• Today’sdatacentersaredifferent- ! Betterfailuredetectionsystems,experiencedoperators- " Adoptionofless-reliable,commodityorcustomorderedhardware,moreheterogeneoushardwareandworkload- Result:morecomplexfailuremodel
• Goal:comprehensiveanalysisofhardwarefailuresinmodernlarge-scaleIDCs
WeRe-studyHardwareFailuresinIDCs
Ourwork:- Largescale:hundredsofthousandsofserverswith290,000failureoperationtickets- Long-term:2012-2016- Multi-dimensional:components,time,space,productlines,operators’response,etc.- Reconfirmorextendpreviousfindings+Observenewpatterns
Time
Space Components
Productlines Operators’response
Commonbeliefs• Failuresareuniformlyrandomlydistributedovertime/space
• Failureshappenindependently
• HWunreliabilityshapesthesoftwarefaulttolerancedesign
Ourfindings• HWfailuresarenotuniformlyrandom- atdifferenttimescales- sometimesatdifferentlocations
• CorrelatedHWfailuresarecommoninIDCs• Itisalsotheotherwayaround:softwarefaulttoleranceindulgesoperatorstocarelessaboutHWdependability
InterestingFindingsOverview
FailureManagementArchitecture
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
FailureManagementArchitecture
• HMSagentsdetectfailuresonservers
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
FailureManagementArchitecture
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
• HMSagentsdetectfailuresonservers• HMS collectsfailurerecords,andstoretheminafailurepool
FailureManagementArchitecture
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
• HMSagentsdetectfailuresonservers• HMS collectsfailurerecords,andstoretheminafailurepool• Operators/programs generateaFOTforeachfailurerecord
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.
Dataset:290,000+FOTs
• Thefailureoperationtickets(FOTs)containmanyfields
• WestudythefailuresondifferentdimensionsbasedondifferentfieldsofFOTs
Multi-dimensionalAnalysisontheDataset
Time
Space Components
Productlines Operators’response id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.
• WestudythefailuresondifferentdimensionsbasedondifferentfieldsofFOTs
Multi-dimensionalAnalysisontheDataset
Time:errortime
Space:hostname,hostidc
Components:errordevice
Productlines:hostname
Operators’response:errortime,optime
id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.
Device Proportion
Hard DiskDrive 81.84%
Miscellaneous* 10.20%
Memory 3.06%
Power 1.74%
RAID card 1.23%
Flashcard 0.67%
Motherboard 0.57%
SSD 0.31%
Fan 0.19%
HDDbackboard 0.14%
CPU 0.04%
*”Miscellaneous”aremanuallysubmittedoruncategorizedfailures
FailurePercentageBreakdownbyComponent
FailureTypesforHardDiskDrive
• AbouthalfofHDDfailuresarerelatedtoSMARTvalues orpredictionerrorcount
FailureTypeBreakdownofHDD
SMARTFailPredictErrRaidPdPreErrRaidPdFailedMissingNotReadyMediumErrRaidPdMediaErrBadSectorPendingLBATooManyDStatusOthers
SomeHDDSMARTvalueexceedsthethreshold
Thepredictionerrorcountexceedsthethreshold
OthertypesSMART =SelfMonitoringAnalysisandReportingTechnique
FailureTypesforHardDiskDrive
• AbouthalfofHDDfailuresarerelatedtoSMARTvalues orpredictionerrorcount
FailureTypeBreakdownofHDD
SMARTFailPredictErrRaidPdPreErrRaidPdFailedMissingNotReadyMediumErrRaidPdMediaErrBadSectorPendingLBATooManyDStatusOthers
SomeHDDSMARTvalueexceedsthethreshold
Thepredictionerrorcountexceedsthethreshold
OthertypesSMART =SelfMonitoringAnalysisandReportingTechnique
Outline
• DatasetoverviewØTemporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned
FRisNOT UniformlyRandomoverDaysoftheWeek
• Hypothesis1. Theaveragenumberofcomponentfailuresisuniformlyrandomoverdifferentdaysoftheweek.
• Achi-squaretestcanrejectthehypothesisat0.01significancelevelforall componentclasses.
FRisNOT UniformlyRandomoverHoursoftheDay
• Hypothesis2.Theaveragenumberofcomponentfailuresisuniformlyrandomduringeachhouroftheday.
• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches
FRisNOT UniformlyRandomoverHoursoftheDay
• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches
FRisNOT UniformlyRandomoverHoursoftheDay
• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches
FRisNOT UniformlyRandomoverHoursoftheDay
• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches
FRisNOT UniformlyRandomoverHoursoftheDay
FRofeachComponentChangesDuringitsLifeCycle
• DifferentcomponentclassesexhibitdifferentFRpatterns.
• Infantmortalities:
FRofeachComponentChangesDuringitsLifeCycle
• Wearout
FRofeachComponentChangesDuringitsLifeCycle
Outline
• Datasetoverview• TemporaldistributionofthefailuresØSpatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned
PhysicalLocationsMightAffecttheFRDistribution
• Hypothesis3. Thefailurerateoneachrackpositionisindependentoftherackposition.
• Ingeneral,at0.05significancelevel:- cannotrejectthehypothesisin40%ofthedatacenters- canrejectitintheother60%
FRCanbeAffectedbytheCoolingDesign
• FRsarehigheratrackposition22and35
• Possiblereasons- DesignofIDCcoolingandphysicalstructureoftheracks
Atthetop
AbovethePSU Coolingair
AtypicalScorpionrack
Outline
• Datasetoverview• Temporaldistributionofthefailures• SpatialdistributionofthefailuresØCorrelatedfailures• Operators’responsetofailures• LessonsLearned
CorrelatedFailures areCommon
• Correlatedfailures:batchfailures,correlatedcomponentfailures,repeatingsynchronousfailures• Fact:200+HDDfailuresoneachof22.5%ofthedays• Casestudy- Nov.16thand17th,2015- 5,000+servers,or32%ofalltheserversoftheproductline,reportingharddriveSMARTFail failures- 99%ofthesefailuresweredetectedbetween21:00onthe16thand3:00onthe17th.- Operatorsreplacedabout1,600,decommissionedtheremaining4000+out-of-warrantydrives- Failurereasonnotclearyet
CausesofCorrelatedFailures
Allthefollowinghavehappenedbefore#- Environmentalfactors(e.g.,humidity)- Firmwarebugs- Singlepointoffailure(e.g.,powermodulefailures)- Humanoperatormistakes- ...
Outline
• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• CorrelatedfailuresØOperators’responsetofailures• LessonsLearned
Operators’ResponsetoFailures
• Responsetime:RT=op_time – err_time
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
RT isVeryHighinGeneral
• RTforD_fixing:Avg.42.2days,median6.1days• 10%oftheFOTs:RT>140days
- Isitbecauseoperatorsbusydealingwithlargenumberoffailures?- No!
RT inDifferentProductLinesVaries
• Observation1:VariationofRT indifferentproductlinesislarge• Observation2:Operatorsrespondtolargenumberoffailuremorequickly
Number ofHDDFailuresDuringYear2015
TheREALproblems$
Whocares?%
OPsareLessMotivatedtoRespondtoHWFailures
Possiblereasons• Softwareredundancydesign- Delayed Responding,processfailuresinbatches
• Manyhardwarefailuresarenolongerurgent- E.g.,SMARTfailuresmaynotbefatal
• Repairoperationcanbecostly- E.g.,Taskmigration
Operator
ResilientSoftware
HardwareRedundancy
Outline
• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailuresØLessonsLearned
LessonsLearnedI
• Mucholdwisdomstillholds.- Morecorrelatedfailures� softwaredesignchallenge- Automatichardwarefailuredetection&handling:!- Datacenterdesign:avoid“batspot”
LessonsLearnedII
• Striketherightbalanceamongsoftwarestackcomplexity,hardwaredependability,andoperationcost.• Datacenterdependabilityneedsjointoptimizationeffortthatcrosseslayers.
OperationCost
ResilientSoftwareDesign
DependableHardwareInfrastructure
LessonsLearnedIII
• Stateful failurehandlingsystem- Dataminingtool:discovercorrelationamongfailures- Provideoperatorswithextrainformation
HardwareFailure
Servermodel Workload
Environment
Failurehistory
Correlationwithotherfailures
Thankyou!Q&A
Outline• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned
TBFCannotbeWellFittedbyWell-knownDistributions
• Hypothesis4. Timebetweenfailures(TBF)ofallcomponentsfollowsanexponentialdistribution.• Hypothesis5. TBFofeachindividualcomponentclassfollowsanexponentialdistribution.
100 101 102
Time between Failures (min)
0
0.2
0.4
0.6
0.8
1
CD
F
ExpWeibullGammaLogNormalData
Largeproportionofsmallvalues
FailureOperationTicket(FOT)
• CategoriesofFOTs
• Fields:id,hostid,hostname,hostidc,errordevice,errortype,errortime,errorposition,errordetail
FRofMisc.FailuresDuringtheLifecycle
• Mostmanualdetectionanddebuggingeffortshappenonlyatdeploymenttime• Lesscosttorepair(notmuchtaskstomigrate)
RTforEachComponentClass
• MedianRTsforSSDandmist.failuresaretheshortest(hours)• MedianRTsforHDD,fans,andmemoryarethelongest(7-18days)• StandarddeviationoftheRTforHDD:30.2days
Self-Monitoring,AnalysisandReportingTechnology
• Fields:raw value,worst,threshold,status• SMARTattributeexamples(failurerelated)
• ReallocatedSectorsCount• End-to-Enderror• UncorrectableSectorCount• ReportedUncorrectableErrors• CurrentPendingSectorCount• CommandTimeout• ...
ExamplesofFailureTypes
RepeatingFailures
• Over85%ofthefixedcomponentsneverrepeatthesamefailure• Repaircanfail• 2%ofserversthateverfailedcontributemorethan99%ofallfailures
BatchFailureFrequencyforEachComponent
• r_N:anormalizedcounterofhowmanydaysduringtheDdays,inwhichmorethanNfailureshappenonthesameday• NormalizedbythetotaltimelengthD.