failure trends
TRANSCRIPT
-
8/9/2019 Failure Trends
1/21
FAILURE TRENDS IN A LARGE
DISK DRIVE POPULATIONPaper : Eduardo Pinheiro, Wolf-Dietrich Weber, Luiz Andre
Barroso
Slides: Tim Disney for CMPS 229 -Spring 2010
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
2/21
In 2002 an estimated 90% of all new informationstored in magnetic media
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
3/21
Problem: Insufficient studies of disk failure
Manufacturers
Accelerated life test extrapolation
Returned units
Field studies
Small populations
Too little monitoring
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
4/21
Solution: Google it
Figure 1: Collection, storage, and analysis architecture.
Population:
> 100,000 drives
From 2001
5400 to 7200 rpm80 to 400 GB
Collection:
S.M.A.R.T
Environment (temp)Activity Level
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
5/21
Data
Failure:
Some drives fail in the field but are good in testing
Failed if it was replaced as part of repairs procedure
Filtering:
Clean up bad data
Some drives reported being hotter than sun
Filtering reduced sample set < 0.1%
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
6/21
Say it with Charts!
AFR - % failed in year
3,6 month and 1 year overlap
Drives models are mixed
Figure 2: Annualized failure rates broken down by age groups
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
7/21
Utilization
Weekly averages of read/writebandwidth
Expected high correlation tofailure
Figure 3: Utilization AFR
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
8/21
Temperature
High temp often though to beimportant factor in drive failure
Only for very low, very high
Figure 4: Distribution of average temperatures and failuresrates.
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
9/21
Temperature
More pronounced difference for
very old drives
Figure 5: AFR for average drive temperature.
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
10/21
SMART - Scan Errors
Found by drives scanning the
disk surface in the background
Figure 6: AFR for scan errors.
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
11/21
Figure 8: Impact of scan errors on survival probability. Left figure shows aggregate survival probability for all drives after firstscan error. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down drives by their
number of scan errors.
SMART - Scan Errors
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
12/21
SMART - Reallocation Counts
Number of times a bad sector
has been remapped
Figure 7: AFR for reallocation counts.
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
13/21
Figure 11: Impact of reallocation count values on survival probability. Left figure shows aggregate survival probability for alldrives after first reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down
drives by their number of reallocations.
SMART - Reallocation Counts
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
14/21
SMART - Offline Reallocation Counts
Reallocated sectors found only
during background scrubbing
Figure 9: AFR for offline reallocation count.
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
15/21
Figure 12: Impact of offline reallocation on survival probability. Left figure shows aggregate survival probability for all drivesafter first offline reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down
drives by their number offline reallocation.
SMART - Offline Reallocation Counts
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
16/21
SMART - Probational Count
Suspicious sectors that havent
yet been reallocated
Figure 10: AFR for probational count.
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
17/21
Figure 13: Impact of probational count values on survival probability. Left figure shows aggregate survival probability for all
drives after first probational count. Middle figure breaks down survival probability per drive ages in months. Right figure breaksdown drives by their number of probational counts.
SMART - Probational Count
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
18/21
Seek Errors
SMART - Others
Fails to track to a sector
Only seen in a single drive manufacture
CRC Errors
Error between physical media and interfaceSome correlation but not much
Power Cycles
Number of times drive is powered on/off
Correlated only in older drives
Calibration RetriesAuthors couldnt find consistent definition
Weakly tied to failure rates
Spin Retries
Number of retries when disk spins upNo count in population
Power-on hours
In population drive age is basically the same
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
19/21
Predictive power
56%44%
Failed Drives
with SMART without SMART
Hoped to form
predictive failure model
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
20/21
Predictive power
Figure 14: Percentage of failed drives with SMART errors.
Thursday, April 15, 2010
-
8/9/2019 Failure Trends
21/21
So
Huge study of failure
Temperature and activity less important than believed
SMART parameters well-correlated
But SMART data alone wont help