failure trends

Upload: disnet03

Post on 30-May-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Failure Trends

    1/21

    FAILURE TRENDS IN A LARGE

    DISK DRIVE POPULATIONPaper : Eduardo Pinheiro, Wolf-Dietrich Weber, Luiz Andre

    Barroso

    Slides: Tim Disney for CMPS 229 -Spring 2010

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    2/21

    In 2002 an estimated 90% of all new informationstored in magnetic media

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    3/21

    Problem: Insufficient studies of disk failure

    Manufacturers

    Accelerated life test extrapolation

    Returned units

    Field studies

    Small populations

    Too little monitoring

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    4/21

    Solution: Google it

    Figure 1: Collection, storage, and analysis architecture.

    Population:

    > 100,000 drives

    From 2001

    5400 to 7200 rpm80 to 400 GB

    Collection:

    S.M.A.R.T

    Environment (temp)Activity Level

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    5/21

    Data

    Failure:

    Some drives fail in the field but are good in testing

    Failed if it was replaced as part of repairs procedure

    Filtering:

    Clean up bad data

    Some drives reported being hotter than sun

    Filtering reduced sample set < 0.1%

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    6/21

    Say it with Charts!

    AFR - % failed in year

    3,6 month and 1 year overlap

    Drives models are mixed

    Figure 2: Annualized failure rates broken down by age groups

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    7/21

    Utilization

    Weekly averages of read/writebandwidth

    Expected high correlation tofailure

    Figure 3: Utilization AFR

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    8/21

    Temperature

    High temp often though to beimportant factor in drive failure

    Only for very low, very high

    Figure 4: Distribution of average temperatures and failuresrates.

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    9/21

    Temperature

    More pronounced difference for

    very old drives

    Figure 5: AFR for average drive temperature.

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    10/21

    SMART - Scan Errors

    Found by drives scanning the

    disk surface in the background

    Figure 6: AFR for scan errors.

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    11/21

    Figure 8: Impact of scan errors on survival probability. Left figure shows aggregate survival probability for all drives after firstscan error. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down drives by their

    number of scan errors.

    SMART - Scan Errors

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    12/21

    SMART - Reallocation Counts

    Number of times a bad sector

    has been remapped

    Figure 7: AFR for reallocation counts.

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    13/21

    Figure 11: Impact of reallocation count values on survival probability. Left figure shows aggregate survival probability for alldrives after first reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down

    drives by their number of reallocations.

    SMART - Reallocation Counts

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    14/21

    SMART - Offline Reallocation Counts

    Reallocated sectors found only

    during background scrubbing

    Figure 9: AFR for offline reallocation count.

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    15/21

    Figure 12: Impact of offline reallocation on survival probability. Left figure shows aggregate survival probability for all drivesafter first offline reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down

    drives by their number offline reallocation.

    SMART - Offline Reallocation Counts

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    16/21

    SMART - Probational Count

    Suspicious sectors that havent

    yet been reallocated

    Figure 10: AFR for probational count.

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    17/21

    Figure 13: Impact of probational count values on survival probability. Left figure shows aggregate survival probability for all

    drives after first probational count. Middle figure breaks down survival probability per drive ages in months. Right figure breaksdown drives by their number of probational counts.

    SMART - Probational Count

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    18/21

    Seek Errors

    SMART - Others

    Fails to track to a sector

    Only seen in a single drive manufacture

    CRC Errors

    Error between physical media and interfaceSome correlation but not much

    Power Cycles

    Number of times drive is powered on/off

    Correlated only in older drives

    Calibration RetriesAuthors couldnt find consistent definition

    Weakly tied to failure rates

    Spin Retries

    Number of retries when disk spins upNo count in population

    Power-on hours

    In population drive age is basically the same

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    19/21

    Predictive power

    56%44%

    Failed Drives

    with SMART without SMART

    Hoped to form

    predictive failure model

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    20/21

    Predictive power

    Figure 14: Percentage of failed drives with SMART errors.

    Thursday, April 15, 2010

  • 8/9/2019 Failure Trends

    21/21

    So

    Huge study of failure

    Temperature and activity less important than believed

    SMART parameters well-correlated

    But SMART data alone wont help