what can we learn from four years of data center hardware...

48
What Can We Learn from Four Years of Data Center Hardware Failures? Guosai Wang, Lifei Zhang, Wei Xu

Upload: others

Post on 15-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

WhatCanWeLearnfromFourYearsofDataCenterHardwareFailures?

Guosai Wang,Lifei Zhang,WeiXu

Page 2: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Motivation:EvolvingFailureModel

• Failuresindatacentersarecommonandcostly- Violateservicelevelagreement(SLA)andcauselossofrevenue

• Understandfailures:reduceTCO• Today’sdatacentersaredifferent- ! Betterfailuredetectionsystems,experiencedoperators- " Adoptionofless-reliable,commodityorcustomorderedhardware,moreheterogeneoushardwareandworkload- Result:morecomplexfailuremodel

• Goal:comprehensiveanalysisofhardwarefailuresinmodernlarge-scaleIDCs

Page 3: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

WeRe-studyHardwareFailuresinIDCs

Ourwork:- Largescale:hundredsofthousandsofserverswith290,000failureoperationtickets- Long-term:2012-2016- Multi-dimensional:components,time,space,productlines,operators’response,etc.- Reconfirmorextendpreviousfindings+Observenewpatterns

Time

Space Components

Productlines Operators’response

Page 4: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Commonbeliefs• Failuresareuniformlyrandomlydistributedovertime/space

• Failureshappenindependently

• HWunreliabilityshapesthesoftwarefaulttolerancedesign

Ourfindings• HWfailuresarenotuniformlyrandom- atdifferenttimescales- sometimesatdifferentlocations

• CorrelatedHWfailuresarecommoninIDCs• Itisalsotheotherwayaround:softwarefaulttoleranceindulgesoperatorstocarelessaboutHWdependability

InterestingFindingsOverview

Page 5: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureManagementArchitecture

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

Page 6: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureManagementArchitecture

• HMSagentsdetectfailuresonservers

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

Page 7: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureManagementArchitecture

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

• HMSagentsdetectfailuresonservers• HMS collectsfailurerecords,andstoretheminafailurepool

Page 8: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureManagementArchitecture

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

• HMSagentsdetectfailuresonservers• HMS collectsfailurerecords,andstoretheminafailurepool• Operators/programs generateaFOTforeachfailurerecord

Page 9: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.

Dataset:290,000+FOTs

• Thefailureoperationtickets(FOTs)containmanyfields

Page 10: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• WestudythefailuresondifferentdimensionsbasedondifferentfieldsofFOTs

Multi-dimensionalAnalysisontheDataset

Time

Space Components

Productlines Operators’response id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.

Page 11: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• WestudythefailuresondifferentdimensionsbasedondifferentfieldsofFOTs

Multi-dimensionalAnalysisontheDataset

Time:errortime

Space:hostname,hostidc

Components:errordevice

Productlines:hostname

Operators’response:errortime,optime

id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.

Page 12: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Device Proportion

Hard DiskDrive 81.84%

Miscellaneous* 10.20%

Memory 3.06%

Power 1.74%

RAID card 1.23%

Flashcard 0.67%

Motherboard 0.57%

SSD 0.31%

Fan 0.19%

HDDbackboard 0.14%

CPU 0.04%

*”Miscellaneous”aremanuallysubmittedoruncategorizedfailures

FailurePercentageBreakdownbyComponent

Page 13: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureTypesforHardDiskDrive

• AbouthalfofHDDfailuresarerelatedtoSMARTvalues orpredictionerrorcount

FailureTypeBreakdownofHDD

SMARTFailPredictErrRaidPdPreErrRaidPdFailedMissingNotReadyMediumErrRaidPdMediaErrBadSectorPendingLBATooManyDStatusOthers

SomeHDDSMARTvalueexceedsthethreshold

Thepredictionerrorcountexceedsthethreshold

OthertypesSMART =SelfMonitoringAnalysisandReportingTechnique

Page 14: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureTypesforHardDiskDrive

• AbouthalfofHDDfailuresarerelatedtoSMARTvalues orpredictionerrorcount

FailureTypeBreakdownofHDD

SMARTFailPredictErrRaidPdPreErrRaidPdFailedMissingNotReadyMediumErrRaidPdMediaErrBadSectorPendingLBATooManyDStatusOthers

SomeHDDSMARTvalueexceedsthethreshold

Thepredictionerrorcountexceedsthethreshold

OthertypesSMART =SelfMonitoringAnalysisandReportingTechnique

Page 15: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Outline

• DatasetoverviewØTemporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned

Page 16: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FRisNOT UniformlyRandomoverDaysoftheWeek

• Hypothesis1. Theaveragenumberofcomponentfailuresisuniformlyrandomoverdifferentdaysoftheweek.

• Achi-squaretestcanrejectthehypothesisat0.01significancelevelforall componentclasses.

Page 17: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FRisNOT UniformlyRandomoverHoursoftheDay

• Hypothesis2.Theaveragenumberofcomponentfailuresisuniformlyrandomduringeachhouroftheday.

Page 18: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches

FRisNOT UniformlyRandomoverHoursoftheDay

Page 19: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches

FRisNOT UniformlyRandomoverHoursoftheDay

Page 20: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches

FRisNOT UniformlyRandomoverHoursoftheDay

Page 21: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches

FRisNOT UniformlyRandomoverHoursoftheDay

Page 22: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FRofeachComponentChangesDuringitsLifeCycle

• DifferentcomponentclassesexhibitdifferentFRpatterns.

Page 23: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• Infantmortalities:

FRofeachComponentChangesDuringitsLifeCycle

Page 24: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• Wearout

FRofeachComponentChangesDuringitsLifeCycle

Page 25: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Outline

• Datasetoverview• TemporaldistributionofthefailuresØSpatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned

Page 26: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

PhysicalLocationsMightAffecttheFRDistribution

• Hypothesis3. Thefailurerateoneachrackpositionisindependentoftherackposition.

• Ingeneral,at0.05significancelevel:- cannotrejectthehypothesisin40%ofthedatacenters- canrejectitintheother60%

Page 27: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FRCanbeAffectedbytheCoolingDesign

• FRsarehigheratrackposition22and35

• Possiblereasons- DesignofIDCcoolingandphysicalstructureoftheracks

Atthetop

AbovethePSU Coolingair

AtypicalScorpionrack

Page 28: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Outline

• Datasetoverview• Temporaldistributionofthefailures• SpatialdistributionofthefailuresØCorrelatedfailures• Operators’responsetofailures• LessonsLearned

Page 29: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

CorrelatedFailures areCommon

• Correlatedfailures:batchfailures,correlatedcomponentfailures,repeatingsynchronousfailures• Fact:200+HDDfailuresoneachof22.5%ofthedays• Casestudy- Nov.16thand17th,2015- 5,000+servers,or32%ofalltheserversoftheproductline,reportingharddriveSMARTFail failures- 99%ofthesefailuresweredetectedbetween21:00onthe16thand3:00onthe17th.- Operatorsreplacedabout1,600,decommissionedtheremaining4000+out-of-warrantydrives- Failurereasonnotclearyet

Page 30: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

CausesofCorrelatedFailures

Allthefollowinghavehappenedbefore#- Environmentalfactors(e.g.,humidity)- Firmwarebugs- Singlepointoffailure(e.g.,powermodulefailures)- Humanoperatormistakes- ...

Page 31: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Outline

• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• CorrelatedfailuresØOperators’responsetofailures• LessonsLearned

Page 32: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Operators’ResponsetoFailures

• Responsetime:RT=op_time – err_time

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

Page 33: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

RT isVeryHighinGeneral

• RTforD_fixing:Avg.42.2days,median6.1days• 10%oftheFOTs:RT>140days

- Isitbecauseoperatorsbusydealingwithlargenumberoffailures?- No!

Page 34: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

RT inDifferentProductLinesVaries

• Observation1:VariationofRT indifferentproductlinesislarge• Observation2:Operatorsrespondtolargenumberoffailuremorequickly

Number ofHDDFailuresDuringYear2015

TheREALproblems$

Whocares?%

Page 35: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

OPsareLessMotivatedtoRespondtoHWFailures

Possiblereasons• Softwareredundancydesign- Delayed Responding,processfailuresinbatches

• Manyhardwarefailuresarenolongerurgent- E.g.,SMARTfailuresmaynotbefatal

• Repairoperationcanbecostly- E.g.,Taskmigration

Operator

ResilientSoftware

HardwareRedundancy

Page 36: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Outline

• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailuresØLessonsLearned

Page 37: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

LessonsLearnedI

• Mucholdwisdomstillholds.- Morecorrelatedfailures� softwaredesignchallenge- Automatichardwarefailuredetection&handling:!- Datacenterdesign:avoid“batspot”

Page 38: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

LessonsLearnedII

• Striketherightbalanceamongsoftwarestackcomplexity,hardwaredependability,andoperationcost.• Datacenterdependabilityneedsjointoptimizationeffortthatcrosseslayers.

OperationCost

ResilientSoftwareDesign

DependableHardwareInfrastructure

Page 39: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

LessonsLearnedIII

• Stateful failurehandlingsystem- Dataminingtool:discovercorrelationamongfailures- Provideoperatorswithextrainformation

HardwareFailure

Servermodel Workload

Environment

Failurehistory

Correlationwithotherfailures

Page 40: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Thankyou!Q&A

Outline• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned

Page 41: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

TBFCannotbeWellFittedbyWell-knownDistributions

• Hypothesis4. Timebetweenfailures(TBF)ofallcomponentsfollowsanexponentialdistribution.• Hypothesis5. TBFofeachindividualcomponentclassfollowsanexponentialdistribution.

100 101 102

Time between Failures (min)

0

0.2

0.4

0.6

0.8

1

CD

F

ExpWeibullGammaLogNormalData

Largeproportionofsmallvalues

Page 42: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureOperationTicket(FOT)

• CategoriesofFOTs

• Fields:id,hostid,hostname,hostidc,errordevice,errortype,errortime,errorposition,errordetail

Page 43: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FRofMisc.FailuresDuringtheLifecycle

• Mostmanualdetectionanddebuggingeffortshappenonlyatdeploymenttime• Lesscosttorepair(notmuchtaskstomigrate)

Page 44: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

RTforEachComponentClass

• MedianRTsforSSDandmist.failuresaretheshortest(hours)• MedianRTsforHDD,fans,andmemoryarethelongest(7-18days)• StandarddeviationoftheRTforHDD:30.2days

Page 45: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Self-Monitoring,AnalysisandReportingTechnology

• Fields:raw value,worst,threshold,status• SMARTattributeexamples(failurerelated)

• ReallocatedSectorsCount• End-to-Enderror• UncorrectableSectorCount• ReportedUncorrectableErrors• CurrentPendingSectorCount• CommandTimeout• ...

Page 46: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

ExamplesofFailureTypes

Page 47: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

RepeatingFailures

• Over85%ofthefixedcomponentsneverrepeatthesamefailure• Repaircanfail• 2%ofserversthateverfailedcontributemorethan99%ofallfailures

Page 48: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

BatchFailureFrequencyforEachComponent

• r_N:anormalizedcounterofhowmanydaysduringtheDdays,inwhichmorethanNfailureshappenonthesameday• NormalizedbythetotaltimelengthD.