digital systems architecture topics eece 343-01 eece eecs. 343-01 eece 292-02 dependability and raid...

Download Digital Systems Architecture Topics EECE 343-01 EECE eecs. 343-01 EECE 292-02 Dependability and RAID ... • More on magnetic storage devices • Dependability defined •RAID

Post on 10-Apr-2018

215 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • 1

    1

    Digital Systems ArchitectureEECE 343-01 EECE 292-02

    Dependability and RAID

    Dr. William H. RobinsonApril 23, 2004

    http://eecs.vanderbilt.edu/courses/eece343/

    2

    TopicsKILLS BUGS DEAD!

    TV commercial for RAID bug spray

    Administrative stuff Turn in Reading Assignment #8 Homework Assignment #4 solutions posted

    More on magnetic storage devices Dependability defined RAID Queuing theory

    3

    North Bridge Connects CPU to

    memory & video

    South Bridge Connects North

    Bridge to everythingelse

    Pentium 4 Implementation

    4

    AMDs Integrated Memory Controller

    http://www.pcstats.com/articleview.cfm?articleid=1469&page=5

    Reduces the time it takes the processor to access memory Data need only travel between

    the processor and the physical memory

    Memory traffic need no longer run between the processor and the Northbridge The Northbridge traditionally

    provides the memory controller with data, and removing this bottleneck increases overall operations

  • 2

    5

    CPU Performance: 60% per year

    I/O system performance limited by mechanicaldelays (disk I/O)

    < 10% per year (IO per sec or MB per sec)

    Amdahl's Law: system speed-up limited by the slowest part!

    10% IO & 10x CPU => 5x Performance (lose 50%)10% IO & 100x CPU => 10x Performance (lose 90%)

    I/O bottleneck:Diminishing fraction of time in CPUDiminishing value of faster CPUs

    Adapted from John Kubiatowiczs CS 252 lecture notes. Copyright 2003 UCB.

    Motivation: Who Cares About I/O?

    6

    Why still important to keep CPUs busy vs. IO devices ("CPU time"), as CPUs not costly?

    Moore's Law leads to both large, fast CPUs but also to very small, cheap CPUs

    2001 Hypothesis: 600 MHz PC is fast enough for Office Tools? PC slowdown since fast enough unless games, new apps?

    People care more about about storing information and communicating information than calculating

    "Information Technology" vs. "Computer Science" 1960s and 1980s: Computing Revolution 1990s and 2000s: Information Age

    Big Picture: Who Cares About CPUs?

    Adapted from David Cullers CS 252 lecture notes. Copyright 2002 UCB.

    7

    I/O SystemsProcessor

    Cache

    Memory - I/O Bus

    MainMemory

    I/OController

    Disk Disk

    I/OController

    I/OController

    Graphics Network

    interruptsinterrupts

    Adapted from John Kubiatowiczs CS 252 lecture notes. Copyright 2003 UCB. 8

    Storage Technology Drivers Driven by the prevailing computing paradigm

    1950s: migration from batch to on-line processing 1990s: migration to ubiquitous computing

    computers in phones, books, cars, video cameras, nationwide fiber optical network with wireless tails

    Effects on storage industry Embedded storage

    smaller, cheaper, more reliable, lower power

    Data utilities high capacity, hierarchically managed storage

    Adapted from John Kubiatowiczs CS 252 lecture notes. Copyright 2003 UCB.

  • 3

    9

    Disk Latency = Seek Time + Rotation Time + Transfer Time + Controller Overhead

    Seek Time? depends on number of tracks arm has to move, seek speed of disk

    Rotation Time? depends on speed disk rotates, how far sector is from head

    Transfer Time? depends on data rate (bandwidth) of disk (bit density), size of request

    Platter

    Arm

    Actuator

    HeadSectorInnerTrack

    OuterTrack

    ControllerSpindle

    Adapted from David Cullers CS 252 lecture notes. Copyright 2002 UCB.

    Disk Device Performance

    10

    Average distance sector from head?

    1/2 time of a rotation 10000 Revolutions Per Minute 166.67 Rev/sec 1 revolution = 1/ 166.67 sec 6.00 milliseconds 1/2 rotation (revolution) 3.00 ms

    Average number of tracks arm moves? Sum all possible seek distances

    from all possible tracks / # possible Assumes average seek distance is random

    Disk industry standard benchmarkAdapted from David Cullers CS 252 lecture notes. Copyright 2002 UCB.

    Disk Device Performance

    11

    To keep things simple, originally kept same number of sectors per track

    Since outer track longer, lower Bits Per Inch (BPI)

    Competition decided to keep BPI the same for all tracks (constant bit density)

    More capacity per disk More of sectors per track towards edge Since disk spins at constant speed,

    outer tracks have faster data rate

    Bandwidth outer track 1.7X inner track! Inner track highest density, outer track lowest, so not really constant 2.1X length of track outer / inner, 1.7X bits outer / inner

    Adapted from David Cullers CS 252 lecture notes. Copyright 2002 UCB.

    Data Rate: Inner vs. Outer Tracks

    12

    SectorTrack

    Cylinder

    HeadPlatter

    Purpose: Long-term, nonvolatile storage Large, inexpensive, slow level in

    the storage hierarchy

    Characteristics: Seek Time (~8 ms avg)

    positional latency rotational latency

    Transfer rate 10-40 MByte/sec Blocks

    Capacity Gigabytes Quadruples every 2 years

    (aerodynamics)

    7200 RPM = 120 RPS => 8 ms per revavg rot. latency = 4 ms

    128 sectors per track => 0.25 ms per sector1 KB per sector => 16 MB / s

    Response time= Queue + Controller + Seek + Rot + Xfer

    Service time

    Adapted from David Cullers CS 252 lecture notes. Copyright 2002 UCB.

    Devices: Magnetic Disks

  • 4

    13

    Capacity+ 100%/year (2X / 1.0 yrs)

    Transfer rate (BW)+ 40%/year (2X / 2.0 yrs)

    Rotation + seek time 8%/ year (1/2 in 10 yrs)

    MB/$> 100%/year (2X /

  • 5

    17

    Computer system dependability: quality of delivered service such that reliance can be placed on service

    Service is observed actual behavior as perceived by other system(s) interacting with this systems users

    Each module has ideal specified behavior, where service specification is agreed description of expected behavior

    A system failure occurs when the actual behavior deviates from the specified behavior

    Failure occurred because an error, a defect in module The cause of an error is a fault When a fault occurs it creates a latent error, which becomes effective

    when it is activated When error actually affects the delivered service, a failure occurs (time

    from error to failure is error latency)

    Adapted from David Pattersons CS 252 lecture notes. Copyright 2001 UCB.

    IFIP Standard Terminology

    18

    A fault creates one or more latent errors Properties of errors are

    A latent error becomes effective once activated An error may cycle between its latent and effective states An effective error often propagates from one component to

    another, thereby creating new errors

    Effective error is either a formerly-latent error in that component or it propagated from another error

    A component failure occurs when the error affects the delivered service

    These properties are recursive, and apply to any component in the system

    An error is manifestation in the system of a fault, a failure is manifestation on the service of an error

    Adapted from David Pattersons CS 252 lecture notes. Copyright 2001 UCB.

    Fault vs. (Latent) Error vs. Failure

    19

    An error is manifestation in the system of a fault, a failure is manifestation on the service of an error

    Is a programming mistake a fault, error, or failure? Are we talking about the time it was designed

    or the time the program is run? If the running program doesnt exercise the mistake,

    is it still a fault/error/failure?

    A programming mistake is a fault The consequence is an error (or latent error) in the

    software Upon activation, the error becomes effective When this effective error produces erroneous data

    which affect the delivered service, a failure occurs

    Adapted from David Pattersons CS 252 lecture notes. Copyright 2001 UCB.

    Fault vs. (Latent) Error vs. Failure

    20

    An error is manifestation in the system of a fault, a failure is manifestation on the service of an error

    If an alpha particle hits a DRAM memory cell, is it a fault/error/failure if it doesnt change the value?

    Is it a fault/error/failure if the memory doesnt access the changed bit?

    Did a fault/error/failure still occur if the memory had error correction and delivered the corrected value to the CPU?

    An alpha particle hitting a DRAM can be a fault If it changes the memory, it creates an error Error remains latent until effected memory word is read If the effected word error affects the delivered service,

    a failure occurs

    Adapted from David Pattersons CS 252 lecture notes. Copyright 2001 UCB.

    Fault vs. (Latent) Error vs. Failure

  • 6

    21

    Fault Tolerance (or more properly, Error Tolerance): mask local faults(prevent errors from becoming failures)

    RAID disks Uninterruptible Power Supplies

    Disaster Tolerance: masks site errors(prevent site errors from causing service failures)

    Protects against fire, flood, sabotage,.. Redundant system and service at remote site. Use design diversity

    From Jim Grays Talk at UC Berkeley on Fault Tolerance " 11/9/00

    Adapted from David Pattersons CS 252 lecture notes. Copyright 2001 UCB.

    Fault Tolerance vs. Disaster Tolerance

    22

    14105.253.5

    3.5

    Disk Array: 1 disk design

    Conventional: 4 disk designs

    Low End High End

    Katz and Patterson asked in 1987: Can smaller disks be used to close gap in performance between disks and CPUs?

    Adapted from Dav