Transcript
Page 1: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

WhatCanWeLearnfromFourYearsofDataCenterHardwareFailures?

Guosai Wang,Lifei Zhang,WeiXu

Page 2: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Motivation:EvolvingFailureModel

• Failuresindatacentersarecommonandcostly- Violateservicelevelagreement(SLA)andcauselossofrevenue

• Understandfailures:reduceTCO• Today’sdatacentersaredifferent- ! Betterfailuredetectionsystems,experiencedoperators- " Adoptionofless-reliable,commodityorcustomorderedhardware,moreheterogeneoushardwareandworkload- Result:morecomplexfailuremodel

• Goal:comprehensiveanalysisofhardwarefailuresinmodernlarge-scaleIDCs

Page 3: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

WeRe-studyHardwareFailuresinIDCs

Ourwork:- Largescale:hundredsofthousandsofserverswith290,000failureoperationtickets- Long-term:2012-2016- Multi-dimensional:components,time,space,productlines,operators’response,etc.- Reconfirmorextendpreviousfindings+Observenewpatterns

Time

Space Components

Productlines Operators’response

Page 4: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Commonbeliefs• Failuresareuniformlyrandomlydistributedovertime/space

• Failureshappenindependently

• HWunreliabilityshapesthesoftwarefaulttolerancedesign

Ourfindings• HWfailuresarenotuniformlyrandom- atdifferenttimescales- sometimesatdifferentlocations

• CorrelatedHWfailuresarecommoninIDCs• Itisalsotheotherwayaround:softwarefaulttoleranceindulgesoperatorstocarelessaboutHWdependability

InterestingFindingsOverview

Page 5: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureManagementArchitecture

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

Page 6: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureManagementArchitecture

• HMSagentsdetectfailuresonservers

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

Page 7: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureManagementArchitecture

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

• HMSagentsdetectfailuresonservers• HMS collectsfailurerecords,andstoretheminafailurepool

Page 8: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureManagementArchitecture

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

• HMSagentsdetectfailuresonservers• HMS collectsfailurerecords,andstoretheminafailurepool• Operators/programs generateaFOTforeachfailurerecord

Page 9: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.

Dataset:290,000+FOTs

• Thefailureoperationtickets(FOTs)containmanyfields

Page 10: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• WestudythefailuresondifferentdimensionsbasedondifferentfieldsofFOTs

Multi-dimensionalAnalysisontheDataset

Time

Space Components

Productlines Operators’response id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.

Page 11: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• WestudythefailuresondifferentdimensionsbasedondifferentfieldsofFOTs

Multi-dimensionalAnalysisontheDataset

Time:errortime

Space:hostname,hostidc

Components:errordevice

Productlines:hostname

Operators’response:errortime,optime

id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.

Page 12: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Device Proportion

Hard DiskDrive 81.84%

Miscellaneous* 10.20%

Memory 3.06%

Power 1.74%

RAID card 1.23%

Flashcard 0.67%

Motherboard 0.57%

SSD 0.31%

Fan 0.19%

HDDbackboard 0.14%

CPU 0.04%

*”Miscellaneous”aremanuallysubmittedoruncategorizedfailures

FailurePercentageBreakdownbyComponent

Page 13: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureTypesforHardDiskDrive

• AbouthalfofHDDfailuresarerelatedtoSMARTvalues orpredictionerrorcount

FailureTypeBreakdownofHDD

SMARTFailPredictErrRaidPdPreErrRaidPdFailedMissingNotReadyMediumErrRaidPdMediaErrBadSectorPendingLBATooManyDStatusOthers

SomeHDDSMARTvalueexceedsthethreshold

Thepredictionerrorcountexceedsthethreshold

OthertypesSMART =SelfMonitoringAnalysisandReportingTechnique

Page 14: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureTypesforHardDiskDrive

• AbouthalfofHDDfailuresarerelatedtoSMARTvalues orpredictionerrorcount

FailureTypeBreakdownofHDD

SMARTFailPredictErrRaidPdPreErrRaidPdFailedMissingNotReadyMediumErrRaidPdMediaErrBadSectorPendingLBATooManyDStatusOthers

SomeHDDSMARTvalueexceedsthethreshold

Thepredictionerrorcountexceedsthethreshold

OthertypesSMART =SelfMonitoringAnalysisandReportingTechnique

Page 15: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Outline

• DatasetoverviewØTemporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned

Page 16: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FRisNOT UniformlyRandomoverDaysoftheWeek

• Hypothesis1. Theaveragenumberofcomponentfailuresisuniformlyrandomoverdifferentdaysoftheweek.

• Achi-squaretestcanrejectthehypothesisat0.01significancelevelforall componentclasses.

Page 17: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FRisNOT UniformlyRandomoverHoursoftheDay

• Hypothesis2.Theaveragenumberofcomponentfailuresisuniformlyrandomduringeachhouroftheday.

Page 18: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches

FRisNOT UniformlyRandomoverHoursoftheDay

Page 19: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches

FRisNOT UniformlyRandomoverHoursoftheDay

Page 20: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches

FRisNOT UniformlyRandomoverHoursoftheDay

Page 21: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches

FRisNOT UniformlyRandomoverHoursoftheDay

Page 22: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FRofeachComponentChangesDuringitsLifeCycle

• DifferentcomponentclassesexhibitdifferentFRpatterns.

Page 23: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• Infantmortalities:

FRofeachComponentChangesDuringitsLifeCycle

Page 24: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

• Wearout

FRofeachComponentChangesDuringitsLifeCycle

Page 25: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Outline

• Datasetoverview• TemporaldistributionofthefailuresØSpatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned

Page 26: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

PhysicalLocationsMightAffecttheFRDistribution

• Hypothesis3. Thefailurerateoneachrackpositionisindependentoftherackposition.

• Ingeneral,at0.05significancelevel:- cannotrejectthehypothesisin40%ofthedatacenters- canrejectitintheother60%

Page 27: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FRCanbeAffectedbytheCoolingDesign

• FRsarehigheratrackposition22and35

• Possiblereasons- DesignofIDCcoolingandphysicalstructureoftheracks

Atthetop

AbovethePSU Coolingair

AtypicalScorpionrack

Page 28: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Outline

• Datasetoverview• Temporaldistributionofthefailures• SpatialdistributionofthefailuresØCorrelatedfailures• Operators’responsetofailures• LessonsLearned

Page 29: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

CorrelatedFailures areCommon

• Correlatedfailures:batchfailures,correlatedcomponentfailures,repeatingsynchronousfailures• Fact:200+HDDfailuresoneachof22.5%ofthedays• Casestudy- Nov.16thand17th,2015- 5,000+servers,or32%ofalltheserversoftheproductline,reportingharddriveSMARTFail failures- 99%ofthesefailuresweredetectedbetween21:00onthe16thand3:00onthe17th.- Operatorsreplacedabout1,600,decommissionedtheremaining4000+out-of-warrantydrives- Failurereasonnotclearyet

Page 30: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

CausesofCorrelatedFailures

Allthefollowinghavehappenedbefore#- Environmentalfactors(e.g.,humidity)- Firmwarebugs- Singlepointoffailure(e.g.,powermodulefailures)- Humanoperatormistakes- ...

Page 31: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Outline

• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• CorrelatedfailuresØOperators’responsetofailures• LessonsLearned

Page 32: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Operators’ResponsetoFailures

• Responsetime:RT=op_time – err_time

� �����

���� �����������

!������� ������������

���������������

������

���

��������������

�����������������

��������

�����������������������

Page 33: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

RT isVeryHighinGeneral

• RTforD_fixing:Avg.42.2days,median6.1days• 10%oftheFOTs:RT>140days

- Isitbecauseoperatorsbusydealingwithlargenumberoffailures?- No!

Page 34: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

RT inDifferentProductLinesVaries

• Observation1:VariationofRT indifferentproductlinesislarge• Observation2:Operatorsrespondtolargenumberoffailuremorequickly

Number ofHDDFailuresDuringYear2015

TheREALproblems$

Whocares?%

Page 35: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

OPsareLessMotivatedtoRespondtoHWFailures

Possiblereasons• Softwareredundancydesign- Delayed Responding,processfailuresinbatches

• Manyhardwarefailuresarenolongerurgent- E.g.,SMARTfailuresmaynotbefatal

• Repairoperationcanbecostly- E.g.,Taskmigration

Operator

ResilientSoftware

HardwareRedundancy

Page 36: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Outline

• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailuresØLessonsLearned

Page 37: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

LessonsLearnedI

• Mucholdwisdomstillholds.- Morecorrelatedfailures� softwaredesignchallenge- Automatichardwarefailuredetection&handling:!- Datacenterdesign:avoid“batspot”

Page 38: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

LessonsLearnedII

• Striketherightbalanceamongsoftwarestackcomplexity,hardwaredependability,andoperationcost.• Datacenterdependabilityneedsjointoptimizationeffortthatcrosseslayers.

OperationCost

ResilientSoftwareDesign

DependableHardwareInfrastructure

Page 39: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

LessonsLearnedIII

• Stateful failurehandlingsystem- Dataminingtool:discovercorrelationamongfailures- Provideoperatorswithextrainformation

HardwareFailure

Servermodel Workload

Environment

Failurehistory

Correlationwithotherfailures

Page 40: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Thankyou!Q&A

Outline• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned

Page 41: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

TBFCannotbeWellFittedbyWell-knownDistributions

• Hypothesis4. Timebetweenfailures(TBF)ofallcomponentsfollowsanexponentialdistribution.• Hypothesis5. TBFofeachindividualcomponentclassfollowsanexponentialdistribution.

100 101 102

Time between Failures (min)

0

0.2

0.4

0.6

0.8

1

CD

F

ExpWeibullGammaLogNormalData

Largeproportionofsmallvalues

Page 42: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FailureOperationTicket(FOT)

• CategoriesofFOTs

• Fields:id,hostid,hostname,hostidc,errordevice,errortype,errortime,errorposition,errordetail

Page 43: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

FRofMisc.FailuresDuringtheLifecycle

• Mostmanualdetectionanddebuggingeffortshappenonlyatdeploymenttime• Lesscosttorepair(notmuchtaskstomigrate)

Page 44: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

RTforEachComponentClass

• MedianRTsforSSDandmist.failuresaretheshortest(hours)• MedianRTsforHDD,fans,andmemoryarethelongest(7-18days)• StandarddeviationoftheRTforHDD:30.2days

Page 45: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

Self-Monitoring,AnalysisandReportingTechnology

• Fields:raw value,worst,threshold,status• SMARTattributeexamples(failurerelated)

• ReallocatedSectorsCount• End-to-Enderror• UncorrectableSectorCount• ReportedUncorrectableErrors• CurrentPendingSectorCount• CommandTimeout• ...

Page 46: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

ExamplesofFailureTypes

Page 47: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

RepeatingFailures

• Over85%ofthefixedcomponentsneverrepeatthesamefailure• Repaircanfail• 2%ofserversthateverfailedcontributemorethan99%ofallfailures

Page 48: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,

BatchFailureFrequencyforEachComponent

• r_N:anormalizedcounterofhowmanydaysduringtheDdays,inwhichmorethanNfailureshappenonthesameday• NormalizedbythetotaltimelengthD.


Top Related