digital systems architecture topics eece 343-01 eece...

1

1

Digital Systems ArchitectureEECE 343-01 EECE 292-02

Dependability and RAID

Dr. William H. RobinsonApril 23, 2004

http://eecs.vanderbilt.edu/courses/eece343/

2

TopicsKILLS – BUGS – DEAD!

– TV commercial for RAID bug spray

• Administrative stuff– Turn in Reading Assignment #8– Homework Assignment #4 solutions posted

• More on magnetic storage devices• Dependability defined• RAID• Queuing theory

3

• North Bridge– Connects CPU to

memory & video

• South Bridge– Connects North

Bridge to everythingelse

Pentium 4 Implementation

4

AMD’s Integrated Memory Controller

• http://www.pcstats.com/articleview.cfm?articleid=1469&page=5

• Reduces the time it takes the processor to access memory– Data need only travel between

the processor and the physical memory

• Memory traffic need no longer run between the processor and the Northbridge– The Northbridge traditionally

provides the memory controller with data, and removing this bottleneck increases overall operations

2

5

• CPU Performance: 60% per year

• I/O system performance limited by mechanicaldelays (disk I/O)

< 10% per year (IO per sec or MB per sec)

• Amdahl's Law: system speed-up limited by the slowest part!

10% IO & 10x CPU => 5x Performance (lose 50%)10% IO & 100x CPU => 10x Performance (lose 90%)

• I/O bottleneck:Diminishing fraction of time in CPUDiminishing value of faster CPUs

Adapted from John Kubiatowicz’s CS 252 lecture notes. Copyright © 2003 UCB.

Motivation: Who Cares About I/O?

6

• Why still important to keep CPUs busy vs. IO devices ("CPU time"), as CPUs not costly?

– Moore's Law leads to both large, fast CPUs but also to very small, cheap CPUs

– 2001 Hypothesis: 600 MHz PC is fast enough for Office Tools?– PC slowdown since fast enough unless games, new apps?

• People care more about about storing information and communicating information than calculating

– "Information Technology" vs. "Computer Science"– 1960s and 1980s: Computing Revolution– 1990s and 2000s: Information Age

Big Picture: Who Cares About CPUs?

Adapted from David Culler’s CS 252 lecture notes. Copyright © 2002 UCB.

7

I/O SystemsProcessor

Cache

Memory - I/O Bus

MainMemory

I/OController

Disk Disk

I/OController

I/OController

Graphics Network

interruptsinterrupts

Adapted from John Kubiatowicz’s CS 252 lecture notes. Copyright © 2003 UCB. 8

Storage Technology Drivers• Driven by the prevailing computing paradigm

– 1950s: migration from batch to on-line processing– 1990s: migration to ubiquitous computing

• computers in phones, books, cars, video cameras, …• nationwide fiber optical network with wireless tails

• Effects on storage industry– Embedded storage

• smaller, cheaper, more reliable, lower power

– Data utilities• high capacity, hierarchically managed storage


3

9

• Disk Latency = Seek Time + Rotation Time + Transfer Time + Controller Overhead

• Seek Time? depends on number of tracks arm has to move, seek speed of disk

• Rotation Time? depends on speed disk rotates, how far sector is from head

• Transfer Time? depends on data rate (bandwidth) of disk (bit density), size of request

Platter

Arm

Actuator

HeadSectorInnerTrack

OuterTrack

ControllerSpindle


Disk Device Performance

10

• Average distance sector from head?

• 1/2 time of a rotation– 10000 Revolutions Per Minute ⇒ 166.67 Rev/sec– 1 revolution = 1/ 166.67 sec ⇒ 6.00 milliseconds– 1/2 rotation (revolution) ⇒ 3.00 ms

• Average number of tracks arm moves?– Sum all possible seek distances

from all possible tracks / # possible• Assumes average seek distance is random

– Disk industry standard benchmarkAdapted from David Culler’s CS 252 lecture notes. Copyright © 2002 UCB.

Disk Device Performance

11

• To keep things simple, originally kept same number of sectors per track

– Since outer track longer, lower Bits Per Inch (BPI)

• Competition ⇒ decided to keep BPI the same for all tracks (“constant bit density”)

⇒ More capacity per disk⇒ More of sectors per track towards edge⇒ Since disk spins at constant speed,

outer tracks have faster data rate

• Bandwidth outer track 1.7X inner track!– Inner track highest density, outer track lowest, so not really constant– 2.1X length of track outer / inner, 1.7X bits outer / inner


Data Rate: Inner vs. Outer Tracks

12

SectorTrack

Cylinder

HeadPlatter

• Purpose:– Long-term, nonvolatile storage– Large, inexpensive, slow level in

the storage hierarchy

• Characteristics:– Seek Time (~8 ms avg)

• positional latency• rotational latency

• Transfer rate– 10-40 MByte/sec– Blocks

• Capacity– Gigabytes– Quadruples every 2 years

(aerodynamics)

7200 RPM = 120 RPS => 8 ms per revavg rot. latency = 4 ms

128 sectors per track => 0.25 ms per sector1 KB per sector => 16 MB / s

Response time= Queue + Controller + Seek + Rot + Xfer

Service time


Devices: Magnetic Disks

4

13

• Capacity+ 100%/year (2X / 1.0 yrs)

• Transfer rate (BW)+ 40%/year (2X / 2.0 yrs)

• Rotation + seek time– 8%/ year (1/2 in 10 yrs)

• MB/$> 100%/year (2X / <1.5 yrs)Fewer chips + areal density


Disk Performance Model /Trends

14

• Bits recorded along a track– Metric is Bits Per Inch (BPI)

• Number of tracks per surface– Metric is Tracks Per Inch (TPI)

• Disk designs brag about bit density per unit area– Metric is Bits Per Square Inch– Called Areal Density– Areal Density = BPI x TPI

Areal Density


15

Year Areal Density1973 1.71979 7.71989 631997 30902000 17100

1

10

100

1000

10000

100000

1970 1980 1990 2000

Year

Are

al D

ensi

– Areal Density = BPI x TPI– Change slope from (30%/yr) to (60%/yr) about 1991


Areal Density

16

Examples on why precise definitions are so important for reliability

• Is a programming mistake a fault, error, or failure?– Are we talking about the time it was designed

or the time the program is run? – If the running program doesn’t exercise the mistake,

is it still a fault/error/failure?

• If an alpha particle hits a DRAM memory cell, is it a fault/error/failure if it doesn’t change the value?

– Is it a fault/error/failure if the memory doesn’t access the changed bit?

– Did a fault/error/failure still occur if the memory had error correction and delivered the corrected value to the CPU?

Adapted from David Patterson’s CS 252 lecture notes. Copyright © 2001 UCB.

Definitions

5

17

• Computer system dependability: quality of delivered service such that reliance can be placed on service

• Service is observed actual behavior as perceived by other system(s) interacting with this system’s users

• Each module has ideal specified behavior, where service specification is agreed description of expected behavior

• A system failure occurs when the actual behavior deviates from the specified behavior

• Failure occurred because an error, a defect in module• The cause of an error is a fault• When a fault occurs it creates a latent error, which becomes effective

when it is activated• When error actually affects the delivered service, a failure occurs (time

from error to failure is error latency)


IFIP Standard Terminology

18

• A fault creates one or more latent errors• Properties of errors are

– A latent error becomes effective once activated– An error may cycle between its latent and effective states– An effective error often propagates from one component to

another, thereby creating new errors

• Effective error is either a formerly-latent error in that component or it propagated from another error

• A component failure occurs when the error affects the delivered service

• These properties are recursive, and apply to any component in the system

• An error is manifestation in the system of a fault, a failure is manifestation on the service of an error


Fault vs. (Latent) Error vs. Failure

19


• Is a programming mistake a fault, error, or failure?– Are we talking about the time it was designed

or the time the program is run?– If the running program doesn’t exercise the mistake,

is it still a fault/error/failure?

• A programming mistake is a fault• The consequence is an error (or latent error) in the

software• Upon activation, the error becomes effective• When this effective error produces erroneous data

which affect the delivered service, a failure occurs



20


• If an alpha particle hits a DRAM memory cell, is it a fault/error/failure if it doesn’t change the value?

– Is it a fault/error/failure if the memory doesn’t access the changed bit?

– Did a fault/error/failure still occur if the memory had error correction and delivered the corrected value to the CPU?

• An alpha particle hitting a DRAM can be a fault• If it changes the memory, it creates an error• Error remains latent until effected memory word is read• If the effected word error affects the delivered service,

a failure occurs



6

21

• Fault Tolerance (or more properly, Error Tolerance): mask local faults(prevent errors from becoming failures)

– RAID disks– Uninterruptible Power Supplies

• Disaster Tolerance: masks site errors(prevent site errors from causing service failures)

– Protects against fire, flood, sabotage,..– Redundant system and service at remote site.– Use design diversity

From Jim Gray’s “Talk at UC Berkeley on Fault Tolerance " 11/9/00


Fault Tolerance vs. Disaster Tolerance

22

14”10”5.25”3.5”

3.5”

Disk Array: 1 disk design

Conventional: 4 disk designs

Low End High End

Katz and Patterson asked in 1987: Can smaller disks be used to close gap in performance between disks and CPUs?


Use Arrays of Small Disks?

23

Low cost/MBHigh MB/volumeHigh MB/wattLow cost/Actuator

Cost and Environmental Efficiencies


Advantages of Small FormfactorDisk Drives

24

Capacity Volume PowerData Rate I/O Rate MTTF Cost

IBM 3390K20 GBytes97 cu. ft.

3 KW15 MB/s

600 I/Os/s250 KHrs

$250K

IBM 3.5" 0061320 MBytes

0.1 cu. ft.11 W

1.5 MB/s55 I/Os/s50 KHrs

$2K

Disk Arrays have potential for large data and I/O rates, high MBper cu. ft., high MB per KW, but what about reliability?

x7023 GBytes11 cu. ft.

1 KW120 MB/s

3900 IOs/s??? Hrs$150K

9X3X

8X

6X


Replace Small Number of Large Disks with Large Number of Small Disks! (1988 Disks)

7

25

• Reliability of N disks = Reliability of 1 Disk ÷ N

50,000 Hours ÷ 70 disks = 700 hours

Disk system MTTF: Drops from 6 years to 1 month!

• Arrays (without redundancy) too unreliable to be useful!

Hot spares support reconstruction in parallel with access: very high media availability can be achievedHot spares support reconstruction in parallel with access: very high media availability can be achieved


Array Reliability

26

• Files are "striped" across multiple disks

• Redundancy yields high data availability– Availability: service still provided to user, even if some components

failed

• Disks will still fail

• Contents reconstructed from data redundantly stored in the array

⇒ Capacity penalty to store redundant info⇒ Bandwidth penalty to update redundant info


Redundant Arrays of (Inexpensive) Disks

27

RAID Levels 0 Through 5

Backup and parity drives are shaded

28

recoverygroup


Redundant Arrays of Inexpensive Disks RAID 1: Disk Mirroring/Shadowing

• Each disk is fully duplicated onto its “mirror”– Very high availability can be achieved

• Bandwidth sacrifice on write:– Logical write = two physical writes

• Reads may be optimized• Most expensive solution: 100% capacity overhead

8

29

Redundant Arrays of Inexpensive Disks RAID 2: Hamming Code

• Split each byte into a pair of 4-bit nibbles– Add Hamming code to each– Parity bits are 1, 2, 4

• Provides single-bit error correction• Drawbacks

– Requires all drives to be rotationally synchronized– Requires a large number of drives to reduce overhead– Controller complexity to calculate checksum

30

P100100111100110110010011

. . .logical record 1

0010011

11001101

10010011

11001101

P contains sum ofother disks per stripe mod 2 (“parity”)If disk fails, subtract P from sum of other disks to find missing information

Striped physicalrecords


Redundant Arrays of Inexpensive Disks RAID 3: Parity Disk

31

• Sum computed across recovery group to protect against hard disk failures, stored in P disk

• Logically, a single high capacity, high transfer rate disk: good for large transfers

• Wider arrays reduce capacity costs, but decreases availability

• 33% capacity cost for parity in this configuration


RAID 3

32

• RAID 3 relies on parity disk to discover errors on Read

• But every sector has an error detection field

• Rely on error detection field to catch errors on read, not on the parity disk

• Allows independent reads to different disks simultaneously


Inspiration for RAID 4

9

33

D0 D1 D2 D3 P

D4 D5 D6 PD7

D8 D9 PD10 D11

D12 PD13 D14 D15

PD16 D17 D18 D19

D20 D21 D22 D23 P...

.

.

.

.

.

.

.

.

.

.

.

.Disk Columns

IncreasingLogicalDisk

Address

Stripe

Insides of 5 disksInsides of 5 disks

Example:small read D0 & D5, large write D12-D15

Example:small read D0 & D5, large write D12-D15


Redundant Arrays of Inexpensive Disks RAID 4: High I/O Rate Parity

34

• RAID 4 works well for small reads• Small writes (write to one disk):

– Option 1: read other data disks, create new sum and write to Parity Disk

– Option 2: since P has old sum, compare old data to new data, addthe difference to P

• Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk

D0 D1 D2 D3 P

D4 D5 D6 PD7


Inspiration for RAID 5

35

Independent writespossible because ofinterleaved parity

Independent writespossible because ofinterleaved parity

D0 D1 D2 D3 P

D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

P D16 D17 D18 D19

D20 D21 D22 D23 P...

.

.

.

.

.

.

.

.

.

.

.

.Disk Columns

IncreasingLogicalDisk

Addresses

Example: write to D0, D5 uses disks 0, 1, 3, 4


Redundant Arrays of Inexpensive Disks RAID 5: High I/O Rate Interleaved Parity

36

D0 D1 D2 D3 PD0'

+

+

D0' D1 D2 D3 P'

newdata

olddata

old parity

XOR

XOR

(1. Read) (2. Read)

(3. Write) (4. Write)

RAID-5: Small Write Algorithm1 Logical Write = 2 Physical Reads + 2 Physical Writes


Problems of Disk Arrays: Small Writes

10

37

• RAID-I (1989) – Consisted of a Sun 4/280

workstation with 128 MB of DRAM, four dual-string SCSI controllers, 28 5.25-inch SCSI disks and specialized disk striping software

• Today RAID is $19 billion dollar industry, 80% nonPC disks sold in RAIDs


Berkeley History: RAID-I

38

• Disk Mirroring, Shadowing (RAID 1)

Each disk is fully duplicated onto its "shadow"

Logical write = two physical writes

100% capacity overhead

• Parity Data Bandwidth Array (RAID 3)

Parity computed horizontally

Logically a single high data bw disk

• High I/O Rate Parity Array (RAID 5)Interleaved parity blocks

Independent reads and writes

Logical write = 2 reads + 2 writes

10010011

11001101

10010011

00110010

10010011

10010011


RAID Techniques Summary: Goal was performance, popularity due to reliability of storage

39

• Queueing Theory applies to long term, steady state behavior ⇒ Arrival rate = Departure rate

• Little’s Law: Mean number tasks in system = arrival rate x mean response time

– Observed by many, Little was first to prove– Simple interpretation: you should see the same number of tasks in

queue when entering as when leaving.

• Applies to any system in equilibrium, as long as nothing in black box is creating or destroying tasks

“Black Box”Queueing

System

Arrivals Departures


Introduction to Queueing Theory

40

• Server spends a variable amount of time with customers– Weighted mean m1 = (f1 x T1 + f2 x T2 +...+ fn x Tn)/F

= Σ p(T)xT– σ2 = (f1 x T12 + f2 x T22 +...+ fn x Tn2)/F – m12

= Σ p(T)xT2 - m12

– Squared coefficient of variance: C = σ2/m12

• Unitless measure (100 ms2 vs. 0.1 s2)• Exponential distribution C = 1 : most short relative to

average, few others long; 90% < 2.3 x average, 63% < averageHypoexponential distribution C < 1 : most close to average, C=0.5 => 90% < 2.0 x average, only 57% < averageHyperexponential distribution C > 1 : further from average C=2.0 => 90% < 2.8 x average, 69% < average

Avg.

Avg.

0

Proc IOC Device

Queue serverSystem


A Little Queuing Theory:Use of Random Distributions

11

41

• Disk response times C ≈ 1.5 (majority seeks < average)• Yet usually pick C = 1.0 for simplicity

– Memoryless, exponential distribution– Many complex systems well described by memoryless distribution!

• Another useful value is average time must wait for server to complete current task: m1(z)

– Called “Average Residual Wait Time”– Not just 1/2 x m1 because doesn’t capture variance– Can derive m1(z) = 1/2 x m1 x (1 + C)– No variance ⇒ C= 0 => m1(z) = 1/2 x m1– Exponential ⇒ C= 1 => m1(z) = m1

Proc IOC Device

Queue serverSystem

Avg.

0 Time


A Little Queuing Theory:Variable Service Time

42

• Calculating average wait time in queue Tq:– All customers in line must complete; avg time: m1≡Tser= 1/µ– If something at server, it takes to complete on average m1(z)

• Chance server is busy = u=λ/µ; average delay is u x m1(z)Tq = u x m1(z) + Lq x TserTq = u x m1(z) + λ x Tq x TserTq = u x m1(z) + u x TqTq x (1 – u) = m1(z) x uTq = m1(z) x u/(1-u) = Tser x {1/2 x (1+C)} x u/(1 – u))

Notation:λ average number of arriving customers/secondTser average time to service a customeru server utilization (0..1): u = λ x TserTq average time/customer in queueLq average length of queue:Lq= λ x Tqm1(z) average residual wait time = Tser x {1/2 x (1+C)}

Little’s Law

Defn of utilization (u)


A Little Queuing Theory:Average Wait Time

43

• Assumptions so far:– System in equilibrium– Time between two successive arrivals in line are random– Server can start on next customer immediately after prior finishes– No limit to the queue: works First-In-First-Out– Afterward, all customers in line must complete; each avg Tser

• Described “memoryless” or Markovian request arrival (M for C=1 exponentially random), General service distribution (no restrictions), 1 server: M/G/1 queue

• When Service times have C = 1, M/M/1 queueTq = Tser x u x(1 + C) /(2 x (1 – u)) = Tser x u /(1 – u)Tser average time to service a customeru server utilization (0..1): u = r x TserTq average time/customer in queue


A Little Queuing Theory: M/G/1 and M/M/1

44

A Little Queuing Theory: An Example

• Processor sends 10 x 8KB disk I/Os per second, requests & service exponentially distrib., avg. disk service = 20 ms

• On average, how utilized is the disk?– What is the number of requests in the queue?– What is the average time spent in the queue?– What is the average response time for a disk request?

• Notation:λ average number of arriving customers/second = 10Tser average time to service a customer = 20 ms (0.02s)u server utilization (0..1): u = λ x Tser= 10/s x .02s = 0.2Tq average time/customer in queue = Tser x u / (1 – u)

= 20 x 0.2/(1-0.2) = 20 x 0.25 = 5 ms (0 .005s)Tsys average time/customer in system: Tsys =Tq +Tser= 25 msLq average length of queue:Lq= λ x Tq

= 10/s x .005s = 0.05 requests in queueLsys average # tasks in system: Lsys = λ x Tsys = 10/s x .025s = 0.25


12

45

• Processor access memory over “network”• DRAM service properties

– speed = 9cycles+ 2cycles/word– With 8 word cachelines: Tser= 25cycles, µ=1/25=.04 ops/cycle– Deterministic Servicetime! (C=0)

• Processor behavior– CPI=1, 40% memory, 7.5% cache misses – Rate: λ= 1 inst/cycle*.4*.075 = .03 ops/cycle

• Notation:λ average number of arriving customers/cycle=.03 Tser average time to service a customer= 25 cyclesu server utilization (0..1): u = λ x Tser= .03×25=.75Tq average time/customer in queue = Tser x u x (1 + C) /(2 x (1 – u))

= (25 x 0.75 x ½)/(1 – 0.75) = 37.5 cyclesTsys average time/customer in system: Tsys = Tq +Tser= 62.5 cyclesLq average length of queue:Lq= λ x Tq

= .03 x 37.5 = 1.125 requests in queueLsys average # tasks in system :Lsys = λ x Tsys = .03 x 62.5 = 1.875


A Little Queuing Theory:Yet Another Example

46

Summary• Disk industry growing rapidly, improves:

– Bandwidth 40%/yr , – Areal density 60%/year, $/MB faster?

• Disk Latency queue + controller + seek + rotate + transfer

• Advertised average seek time benchmark much greater than average seek time in practice


47

Things to Do• Homework Assignment #4 posted with

solutions

• Project #3 due on Monday

digital systems architecture topics eece 343-01 eece...

Documents