predicting storage failures rebuild is not free 2012 2015 2017 capacity 4tb hdd 8tb hdd 16tb hdd...
TRANSCRIPT
PredictingStorageFailuresbyAhmedEl-Shimi
VAULT- LinuxStorageandFileSystemsConferenceCambridgeMA.March22,2017
CopyrightMinimaInc.2017
ThisTalk
• PartI:Motivation• DriveFailure&CommonMitigations• SeekingaBetterRecover/Rebuild• UseCases&Goals
• PartII:ExaminingtheData• Dataset• Features,Trends,Challenges
• PartIII:PredictingDriveFailure• HowtoEvaluate• Baseline• Approach&Models• Evaluation&Results
CopyrightMinimaInc.2017
TheProblem
Source:HardDriveFailureRateReportQ32016,Backblaze.
2%DriveFailureRate
Averaged,Annualized.
DriveAnnualFailureRatebyManufacturerandModel
1.2%0.7% 0.6% 0.4% 0.4% 0.3% 0.0%
10.2%
3.2%
1.5%0.6%
11.3%
2.2%
0.0%
1 2 3 4 5 6 7 1 2 3 4 1 2 3
HGST Seagate WDC
CopyrightMinimaInc.2017
Today’sMitigation
Assume: “Everythingthatcanfailwillfail”
Design: Redundancyateverypointoffailure
RAID ERASURECODING REPLICATION
CopyrightMinimaInc.2017
ButRebuildisnotfree
2012 2015 2017
Capacity 4TBHDD 8TBHDD 16TBHDD
Throughput 100MB/Sec 150MB/Sec 200MB /Sec
1-Disk RebuildTime 11hours 15hours 22hours
0
200
400
600
800
1000
1200
1400
2002(80GBHDD) 2012(4TBHDD) 2015(8TBHDD) 2017(16TBHDD)
SingleHDDRebuildTime(Minutes)
RebuildInflationhasconsequences:• AvailabilityandDurability9s• Rebuildisaworkload!• Resilience,Reliability• DiskCapacity&NetworkManagement• FailureModes• Lotsofcomplexitytoaddressedgecases
Canwedobetterifwehadanearlywarning?
CopyrightMinimaInc.2017
UseCases&Goals
Cloud
• ProactiveRebuild
• SmarterOpsScheduling
Enterprise/Field
• ProactiveRebuild
• BetterFRUSLA
End-UserPC
• BackupNow• Contingencyplanning
CopyrightMinimaInc.2017
TheBackblaze Dataset
• Backblaze.com:OnlineBackupandCloudStorageprovider.• 83+Kdrives• 2013-2016• Seagate,Hitachi,HGST,WesternDigital,Toshiba,Samsung
https://www.backblaze.com/b2/hard-drive-test-data.html• Hatsofftothemforsharingtheirdataopenly.
CopyrightMinimaInc.2017
UnderstandingtheData
Dataset:• 83Kdrives• 46MDriveDays(2013-2016)• >5000failures• Dailysnapshotforeachdrive’shealth
state+SMARTmetrics• SMART(Self-Monitoring,Analysisand
ReportingTechnology)astandardmonitoringsystemincludedinHDDsandSSDs
ExampleSMARTAttributes:• SMART_1:ReadErrorRate• SMART_5:ReallocatedSectorsCount• SMART_9:PowerOnHours• SMART_7:SeekErrorRate• SMART_197:CurrentPendingSector
Count
https://en.wikipedia.org/wiki/S.M.A.R.T.
CopyrightMinimaInc.2017
SMART197:CurrentPendingSectorCountover30days
Sampleof7678drives(50%failed/50%healthy)
CopyrightMinimaInc.2017
PartIIIDiskfailurescanbepredicted.Notperfectly.Butbetterthansimpleheuristics.
CopyrightMinimaInc.2017
PerformanceMetric
• Goals:• Detectrareevent(1/20orless)• TunedependingonUseCase
• ToleranceforFalsePositives• vs.ToleranceforFalseNegatives
• Wewanttomaximizeourabilitytomakebettertradeoffs
• PerformanceMetric:• PR.AUC:AreaUnderPrecision-Recallcurve
PR.AUC
Precision
RecallCopyrightMinimaInc.2017
BaselineHeuristic
• IfanyofthecriticalSMARTattributes>0thenthedriveislikelytofail
• SMART_5:ReallocatedSectorsCount• SMART_187:ReportedUncorrectable• SMART_188:CommandTimeout• SMART_197:CurrentPendingSectorCount• SMART_198:OfflineUncorrectable
CopyrightMinimaInc.2017
BaselinePerformance
• Evaluationdataset:• 13980drives• 699failed• 13281healthy
• Precision:42%• (i.e.58%falsepositives)
• Recall:68%• (i.e.32%falsenegatives)
ConfusionMatrix
BaselinePrediction
Healthy Failed
TruthHealthy 12625 656
Failed 223 476
CopyrightMinimaInc.2017
Approach
• Splitthedataintotrain/testandEval• 2013-2015:Train/Test• 2016:Eval
• Samplefromthetraindataat50/50• (Learnequallyfromfailure/health)
• SamplefromtheEval dataat95/05• (Evaluateatafixedreal-lifefailure/healthymix)
CopyrightMinimaInc.2017
Featuremanagementchallenges
• InconsistencyofSMARTdatasupportacrossvendorsanddrivemodels• DataSparseness• OpacityofmostSMARTmetrics• FurtheropacityofNormalizedSMARTvalues• WideRangeformostSMARTvalues• Datasetskewnessbyvendorandmodel• Somegapsandinconsistenciesinthetelemetrydata
CopyrightMinimaInc.2017
FeatureSelection&Models
• FeatureSelection:• RawoverNormalizedSMARTData• Createdmodel_fam featuretocollapsevendor/model• Z-ScoreNormalizationofRAWvalues• 3-day,5-dayrollingVarianceforallSMARTRAWvalues• andafewthingswhichdidn’tworkaswellasinitiallyhopedJ
• Models:• RandomForests• LogisticRegression• SupportVectorMachines
• Goal:• Improveonthebaseline• ExpandoptionstotuneDecisionThresholdforvariousPrecisionvs.Recalltradeoffs(basedonUseCase)
CopyrightMinimaInc.2017
Model2:LogisticRegression
• Sigmoid(Step)functiontomodelabinomialoutput(0/1)
• Workswellforlinearcontinuousnumericalinputs• Corollary:Horriblyforcategoricalnon-linearvariablesappearingtobecontinuousnumerical
• Inourcasethekeyisto:• IsolaterightSMARTmetrics• Normalizewhereneeded
CopyrightMinimaInc.2017
Model3:SupportVectorMachines
• BinaryClassifierbasedonmappingnon-lineardataintoahigherdimensiontomakelinearseparationpossible
• Maximizesmarginofseparationbetween+/-categories
• InourcasethekeywasstilltoisolaterightSMARTmetricsbeforetraining
ByAlisneaky,svg versionbyUser:Zirguezi - Ownwork,CCBY-SA4.0,https://commons.wikimedia.org/w/index.php?curid=47868867
CopyrightMinimaInc.2017
FeatureImportance– SMARTRaw– AllModels
0
500
1000
1500
2000
2500
3000
3500
4000
Importance(RandomForestsmodeloverSMARTRawfeatures)
CopyrightMinimaInc.2017
FeatureImportance– SMARTVariance- Seagate
0
50
100
150
200
250
300
350
400
Importance(RandomForestsmodeloverSMARTRollingVariancefeatures)
CopyrightMinimaInc.2017
Summary
• 2%ofDisksfailannually,onaverage.Mileagevariesbymodel.• SMARTmetricscanclearlysignalfailures,sometimesdaysbeforeithappens• Wecantrain/predictacrosssomeofthedrivemodelsovercomingtrainingdatasparsity• WecanreasonablytrainmodelstopredictdrivefailureusingSMARTdataandimproveuponexistingheuristics• PickingandtuningtherightmodeldependsonUseCaseandgoals
• Precisionvs.Recalltradeoff• Differentmodelsgiveyoumoreoptions
• MoreDataisbetter.Duh!
CopyrightMinimaInc.2017