predicting storage failures rebuild is not free 2012 2015 2017 capacity 4tb hdd 8tb hdd 16tb hdd...

35
Predicting Storage Failures by Ahmed El-Shimi [email protected] VAULT - Linux Storage and File Systems Conference Cambridge MA. March 22, 2017 Copyright Minima Inc. 2017

Upload: others

Post on 16-Apr-2020

13 views

Category:

Documents


1 download

TRANSCRIPT

PredictingStorageFailuresbyAhmedEl-Shimi

[email protected]

VAULT- LinuxStorageandFileSystemsConferenceCambridgeMA.March22,2017

CopyrightMinimaInc.2017

ThisTalk

• PartI:Motivation• DriveFailure&CommonMitigations• SeekingaBetterRecover/Rebuild• UseCases&Goals

• PartII:ExaminingtheData• Dataset• Features,Trends,Challenges

• PartIII:PredictingDriveFailure• HowtoEvaluate• Baseline• Approach&Models• Evaluation&Results

CopyrightMinimaInc.2017

PartIDisksfail.Andevenwithredundancyfailurehascosts.

CopyrightMinimaInc.2017

TheProblem

Source:HardDriveFailureRateReportQ32016,Backblaze.

2%DriveFailureRate

Averaged,Annualized.

DriveAnnualFailureRatebyManufacturerandModel

1.2%0.7% 0.6% 0.4% 0.4% 0.3% 0.0%

10.2%

3.2%

1.5%0.6%

11.3%

2.2%

0.0%

1 2 3 4 5 6 7 1 2 3 4 1 2 3

HGST Seagate WDC

CopyrightMinimaInc.2017

Today’sMitigation

Assume: “Everythingthatcanfailwillfail”

Design: Redundancyateverypointoffailure

RAID ERASURECODING REPLICATION

CopyrightMinimaInc.2017

ButRebuildisnotfree

2012 2015 2017

Capacity 4TBHDD 8TBHDD 16TBHDD

Throughput 100MB/Sec 150MB/Sec 200MB /Sec

1-Disk RebuildTime 11hours 15hours 22hours

0

200

400

600

800

1000

1200

1400

2002(80GBHDD) 2012(4TBHDD) 2015(8TBHDD) 2017(16TBHDD)

SingleHDDRebuildTime(Minutes)

RebuildInflationhasconsequences:• AvailabilityandDurability9s• Rebuildisaworkload!• Resilience,Reliability• DiskCapacity&NetworkManagement• FailureModes• Lotsofcomplexitytoaddressedgecases

Canwedobetterifwehadanearlywarning?

CopyrightMinimaInc.2017

UseCases&Goals

Cloud

• ProactiveRebuild

• SmarterOpsScheduling

Enterprise/Field

• ProactiveRebuild

• BetterFRUSLA

End-UserPC

• BackupNow• Contingencyplanning

CopyrightMinimaInc.2017

PartIIExaminingtheData.

CopyrightMinimaInc.2017

TheBackblaze Dataset

• Backblaze.com:OnlineBackupandCloudStorageprovider.• 83+Kdrives• 2013-2016• Seagate,Hitachi,HGST,WesternDigital,Toshiba,Samsung

https://www.backblaze.com/b2/hard-drive-test-data.html• Hatsofftothemforsharingtheirdataopenly.

CopyrightMinimaInc.2017

UnderstandingtheData

Dataset:• 83Kdrives• 46MDriveDays(2013-2016)• >5000failures• Dailysnapshotforeachdrive’shealth

state+SMARTmetrics• SMART(Self-Monitoring,Analysisand

ReportingTechnology)astandardmonitoringsystemincludedinHDDsandSSDs

ExampleSMARTAttributes:• SMART_1:ReadErrorRate• SMART_5:ReallocatedSectorsCount• SMART_9:PowerOnHours• SMART_7:SeekErrorRate• SMART_197:CurrentPendingSector

Count

https://en.wikipedia.org/wiki/S.M.A.R.T.

CopyrightMinimaInc.2017

SMART5:ReallocatedSectorCount

Sampleof7678drives(50%failed/50%healthy)

CopyrightMinimaInc.2017

SMART1:ReadErrorRate

Sampleof7678drives(50%failed/50%healthy)

CopyrightMinimaInc.2017

SMART9:PowerOnHours

Sampleof7678drives(50%failed/50%healthy)

CopyrightMinimaInc.2017

SMART197:CurrentPendingSectorCountover30days

Sampleof7678drives(50%failed/50%healthy)

CopyrightMinimaInc.2017

PartIIIDiskfailurescanbepredicted.Notperfectly.Butbetterthansimpleheuristics.

CopyrightMinimaInc.2017

PerformanceMetric

• Goals:• Detectrareevent(1/20orless)• TunedependingonUseCase

• ToleranceforFalsePositives• vs.ToleranceforFalseNegatives

• Wewanttomaximizeourabilitytomakebettertradeoffs

• PerformanceMetric:• PR.AUC:AreaUnderPrecision-Recallcurve

PR.AUC

Precision

RecallCopyrightMinimaInc.2017

BaselineHeuristic

• IfanyofthecriticalSMARTattributes>0thenthedriveislikelytofail

• SMART_5:ReallocatedSectorsCount• SMART_187:ReportedUncorrectable• SMART_188:CommandTimeout• SMART_197:CurrentPendingSectorCount• SMART_198:OfflineUncorrectable

CopyrightMinimaInc.2017

BaselinePerformance

• Evaluationdataset:• 13980drives• 699failed• 13281healthy

• Precision:42%• (i.e.58%falsepositives)

• Recall:68%• (i.e.32%falsenegatives)

ConfusionMatrix

BaselinePrediction

Healthy Failed

TruthHealthy 12625 656

Failed 223 476

CopyrightMinimaInc.2017

Approach

• Splitthedataintotrain/testandEval• 2013-2015:Train/Test• 2016:Eval

• Samplefromthetraindataat50/50• (Learnequallyfromfailure/health)

• SamplefromtheEval dataat95/05• (Evaluateatafixedreal-lifefailure/healthymix)

CopyrightMinimaInc.2017

Featuremanagementchallenges

• InconsistencyofSMARTdatasupportacrossvendorsanddrivemodels• DataSparseness• OpacityofmostSMARTmetrics• FurtheropacityofNormalizedSMARTvalues• WideRangeformostSMARTvalues• Datasetskewnessbyvendorandmodel• Somegapsandinconsistenciesinthetelemetrydata

CopyrightMinimaInc.2017

FeatureSelection&Models

• FeatureSelection:• RawoverNormalizedSMARTData• Createdmodel_fam featuretocollapsevendor/model• Z-ScoreNormalizationofRAWvalues• 3-day,5-dayrollingVarianceforallSMARTRAWvalues• andafewthingswhichdidn’tworkaswellasinitiallyhopedJ

• Models:• RandomForests• LogisticRegression• SupportVectorMachines

• Goal:• Improveonthebaseline• ExpandoptionstotuneDecisionThresholdforvariousPrecisionvs.Recalltradeoffs(basedonUseCase)

CopyrightMinimaInc.2017

Model1:RandomForests

SingleDecisionTree RandomForestCopyrightMinimaInc.2017

Model2:LogisticRegression

• Sigmoid(Step)functiontomodelabinomialoutput(0/1)

• Workswellforlinearcontinuousnumericalinputs• Corollary:Horriblyforcategoricalnon-linearvariablesappearingtobecontinuousnumerical

• Inourcasethekeyisto:• IsolaterightSMARTmetrics• Normalizewhereneeded

CopyrightMinimaInc.2017

Model3:SupportVectorMachines

• BinaryClassifierbasedonmappingnon-lineardataintoahigherdimensiontomakelinearseparationpossible

• Maximizesmarginofseparationbetween+/-categories

• InourcasethekeywasstilltoisolaterightSMARTmetricsbeforetraining

ByAlisneaky,svg versionbyUser:Zirguezi - Ownwork,CCBY-SA4.0,https://commons.wikimedia.org/w/index.php?curid=47868867

CopyrightMinimaInc.2017

Results:Seagatedrives.DayZero

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Seagatedrives.DayZero

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Seagatedrives.Day-1

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Seagatedrives.Day-2

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Seagatedrives.Day-4

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Seagatedrives.Day-8

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Hitachidrives.DayZero

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Hitachidrives.Day-1

CopyrightMinimaInc.2017

RandomModelAUC=0.05

FeatureImportance– SMARTRaw– AllModels

0

500

1000

1500

2000

2500

3000

3500

4000

Importance(RandomForestsmodeloverSMARTRawfeatures)

CopyrightMinimaInc.2017

FeatureImportance– SMARTVariance- Seagate

0

50

100

150

200

250

300

350

400

Importance(RandomForestsmodeloverSMARTRollingVariancefeatures)

CopyrightMinimaInc.2017

Summary

• 2%ofDisksfailannually,onaverage.Mileagevariesbymodel.• SMARTmetricscanclearlysignalfailures,sometimesdaysbeforeithappens• Wecantrain/predictacrosssomeofthedrivemodelsovercomingtrainingdatasparsity• WecanreasonablytrainmodelstopredictdrivefailureusingSMARTdataandimproveuponexistingheuristics• PickingandtuningtherightmodeldependsonUseCaseandgoals

• Precisionvs.Recalltradeoff• Differentmodelsgiveyoumoreoptions

• MoreDataisbetter.Duh!

CopyrightMinimaInc.2017