predicting storage failures rebuild is not free 2012 2015 2017 capacity 4tb hdd 8tb hdd 16tb hdd...

PredictingStorageFailuresbyAhmedEl-Shimi

[email protected]

VAULT- LinuxStorageandFileSystemsConferenceCambridgeMA.March22,2017

CopyrightMinimaInc.2017

ThisTalk

• PartI:Motivation• DriveFailure&CommonMitigations• SeekingaBetterRecover/Rebuild• UseCases&Goals

• PartII:ExaminingtheData• Dataset• Features,Trends,Challenges

• PartIII:PredictingDriveFailure• HowtoEvaluate• Baseline• Approach&Models• Evaluation&Results


PartIDisksfail.Andevenwithredundancyfailurehascosts.


TheProblem

Source:HardDriveFailureRateReportQ32016,Backblaze.

2%DriveFailureRate

Averaged,Annualized.

DriveAnnualFailureRatebyManufacturerandModel

1.2%0.7% 0.6% 0.4% 0.4% 0.3% 0.0%

10.2%

3.2%

1.5%0.6%

11.3%

2.2%

0.0%

1 2 3 4 5 6 7 1 2 3 4 1 2 3

HGST Seagate WDC


Today’sMitigation

Assume: “Everythingthatcanfailwillfail”

Design: Redundancyateverypointoffailure

RAID ERASURECODING REPLICATION


ButRebuildisnotfree

2012 2015 2017

Capacity 4TBHDD 8TBHDD 16TBHDD

Throughput 100MB/Sec 150MB/Sec 200MB /Sec

1-Disk RebuildTime 11hours 15hours 22hours

0

200

400

600

800

1000

1200

1400

2002(80GBHDD) 2012(4TBHDD) 2015(8TBHDD) 2017(16TBHDD)

SingleHDDRebuildTime(Minutes)

RebuildInflationhasconsequences:• AvailabilityandDurability9s• Rebuildisaworkload!• Resilience,Reliability• DiskCapacity&NetworkManagement• FailureModes• Lotsofcomplexitytoaddressedgecases

Canwedobetterifwehadanearlywarning?


UseCases&Goals

Cloud

• ProactiveRebuild

• SmarterOpsScheduling

Enterprise/Field

• ProactiveRebuild

• BetterFRUSLA

End-UserPC

• BackupNow• Contingencyplanning


PartIIExaminingtheData.


TheBackblaze Dataset

• Backblaze.com:OnlineBackupandCloudStorageprovider.• 83+Kdrives• 2013-2016• Seagate,Hitachi,HGST,WesternDigital,Toshiba,Samsung

https://www.backblaze.com/b2/hard-drive-test-data.html• Hatsofftothemforsharingtheirdataopenly.


UnderstandingtheData

Dataset:• 83Kdrives• 46MDriveDays(2013-2016)• >5000failures• Dailysnapshotforeachdrive’shealth

state+SMARTmetrics• SMART(Self-Monitoring,Analysisand

ReportingTechnology)astandardmonitoringsystemincludedinHDDsandSSDs

ExampleSMARTAttributes:• SMART_1:ReadErrorRate• SMART_5:ReallocatedSectorsCount• SMART_9:PowerOnHours• SMART_7:SeekErrorRate• SMART_197:CurrentPendingSector

Count

https://en.wikipedia.org/wiki/S.M.A.R.T.


SMART5:ReallocatedSectorCount

Sampleof7678drives(50%failed/50%healthy)


SMART1:ReadErrorRate



SMART9:PowerOnHours



SMART197:CurrentPendingSectorCountover30days



PartIIIDiskfailurescanbepredicted.Notperfectly.Butbetterthansimpleheuristics.


PerformanceMetric

• Goals:• Detectrareevent(1/20orless)• TunedependingonUseCase

• ToleranceforFalsePositives• vs.ToleranceforFalseNegatives

• Wewanttomaximizeourabilitytomakebettertradeoffs

• PerformanceMetric:• PR.AUC:AreaUnderPrecision-Recallcurve

PR.AUC

Precision

RecallCopyrightMinimaInc.2017

BaselineHeuristic

• IfanyofthecriticalSMARTattributes>0thenthedriveislikelytofail

• SMART_5:ReallocatedSectorsCount• SMART_187:ReportedUncorrectable• SMART_188:CommandTimeout• SMART_197:CurrentPendingSectorCount• SMART_198:OfflineUncorrectable


BaselinePerformance

• Evaluationdataset:• 13980drives• 699failed• 13281healthy

• Precision:42%• (i.e.58%falsepositives)

• Recall:68%• (i.e.32%falsenegatives)

ConfusionMatrix

BaselinePrediction

Healthy Failed

TruthHealthy 12625 656

Failed 223 476


Approach

• Splitthedataintotrain/testandEval• 2013-2015:Train/Test• 2016:Eval

• Samplefromthetraindataat50/50• (Learnequallyfromfailure/health)

• SamplefromtheEval dataat95/05• (Evaluateatafixedreal-lifefailure/healthymix)


Featuremanagementchallenges

• InconsistencyofSMARTdatasupportacrossvendorsanddrivemodels• DataSparseness• OpacityofmostSMARTmetrics• FurtheropacityofNormalizedSMARTvalues• WideRangeformostSMARTvalues• Datasetskewnessbyvendorandmodel• Somegapsandinconsistenciesinthetelemetrydata


FeatureSelection&Models

• FeatureSelection:• RawoverNormalizedSMARTData• Createdmodel_fam featuretocollapsevendor/model• Z-ScoreNormalizationofRAWvalues• 3-day,5-dayrollingVarianceforallSMARTRAWvalues• andafewthingswhichdidn’tworkaswellasinitiallyhopedJ

• Models:• RandomForests• LogisticRegression• SupportVectorMachines

• Goal:• Improveonthebaseline• ExpandoptionstotuneDecisionThresholdforvariousPrecisionvs.Recalltradeoffs(basedonUseCase)


Model1:RandomForests

SingleDecisionTree RandomForestCopyrightMinimaInc.2017

Model2:LogisticRegression

• Sigmoid(Step)functiontomodelabinomialoutput(0/1)

• Workswellforlinearcontinuousnumericalinputs• Corollary:Horriblyforcategoricalnon-linearvariablesappearingtobecontinuousnumerical

• Inourcasethekeyisto:• IsolaterightSMARTmetrics• Normalizewhereneeded


Model3:SupportVectorMachines

• BinaryClassifierbasedonmappingnon-lineardataintoahigherdimensiontomakelinearseparationpossible

• Maximizesmarginofseparationbetween+/-categories

• InourcasethekeywasstilltoisolaterightSMARTmetricsbeforetraining

ByAlisneaky,svg versionbyUser:Zirguezi - Ownwork,CCBY-SA4.0,https://commons.wikimedia.org/w/index.php?curid=47868867


Results:Seagatedrives.DayZero


RandomModelAUC=0.05

Results:Seagatedrives.Day-1


RandomModelAUC=0.05



RandomModelAUC=0.05

Results:Hitachidrives.DayZero


RandomModelAUC=0.05

Results:Hitachidrives.Day-1


RandomModelAUC=0.05

FeatureImportance– SMARTRaw– AllModels

0

500

1000

1500

2000

2500

3000

3500

4000

Importance(RandomForestsmodeloverSMARTRawfeatures)


FeatureImportance– SMARTVariance- Seagate

0

50

100

150

200

250

300

350

400

Importance(RandomForestsmodeloverSMARTRollingVariancefeatures)


Summary

• 2%ofDisksfailannually,onaverage.Mileagevariesbymodel.• SMARTmetricscanclearlysignalfailures,sometimesdaysbeforeithappens• Wecantrain/predictacrosssomeofthedrivemodelsovercomingtrainingdatasparsity• WecanreasonablytrainmodelstopredictdrivefailureusingSMARTdataandimproveuponexistingheuristics• PickingandtuningtherightmodeldependsonUseCaseandgoals

• Precisionvs.Recalltradeoff• Differentmodelsgiveyoumoreoptions

• MoreDataisbetter.Duh!


predicting storage failures rebuild is not free 2012 2015 2017 capacity 4tb hdd 8tb hdd 16tb hdd...

Documents