dependability – more on ecc, raidcs61c/sp16/lec/37/2016sp-cs61c-l37... · dependability – more...

24
CS 61C: Great Ideas in Computer Architecture Dependability – More on ECC, RAID Vladimir Stojanovic & Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/ 1

Upload: hoangduong

Post on 10-Apr-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

CS61C:GreatIdeasinComputerArchitecture

Dependability– MoreonECC,RAID

VladimirStojanovic&NicholasWeaverhttp://inst.eecs.berkeley.edu/~cs61c/

1

HammingDistance:8codewords

2

HammingDistance2:DetectionDetectSingleBitErrors

3

• No1biterrorgoestoanothervalidcodeword• ½codewords arevalid

InvalidCodewords

HammingDistance3:CorrectionCorrectSingleBitErrors,DetectDoubleBitErrors

4

•No2biterrorgoestoanothervalidcodeword;1biterrornear• 1/4codewords arevalid

Nearest000

(one1)

Nearest111(one0)

HammingErrorCorrectingCode• Overheadinvolvedinsingleerror-correctioncode• Letp be totalnumberofparitybitsand d numberofdatabitsin p +d bitword

• Ifp errorcorrectionbitsaretopointto errorbit(p +d cases)+indicatethatnoerrorexists(1case),weneed:

2p >=p +d +1,thusp >=log2(p+d+1)forlarged,p approacheslog2(d)

• 8bitsdata=> d=8,2p >= p+8+1=>p>= 4• 16bdata=>5bparity,32bdata=>6bparity,64bdata=>7bparity

5

HammingSingle-ErrorCorrection,Double-ErrorDetection(SEC/DED)

• Adding extraparitybitcoveringtheentireword providesdoubleerrordetectionaswellassingleerrorcorrection1 2 34 5678p1 p2 d1 p3 d2 d3 d4p4

• Hammingparitybits H (p1 p2 p3)arecomputed(evenparityasusual)plusthe evenparityovertheentireword, p4:H=0 p4=0,noerrorH≠0p4=1,correctablesingleerror(oddparityif1error=>p4=1)H≠0p4=0, doubleerroroccurred(evenparityif2errors=>p4=0)H=0 p4=1, singleerroroccurredinp4bit,notinrestofword

TypicalmoderncodesinDRAMmemorysystems:64-bitdatablocks(8bytes)with72-bitcodewords(9bytes).

6

HammingSingleErrorCorrection+DoubleErrorDetection

7

1biterror(one1)Nearest0000

1biterror(one0)Nearest1111

2biterror(two0s,two1s)

HalfwayBetweenBoth

HammingDistance=4

iClicker QuestionThefollowingwordisreceived,encodedwithHammingcode:0 1 10 001

Whatisthecorrecteddatabitsequence?

A.1111B.0001C.1101D.1011E.1000

8

WhatifMoreThan2-BitErrors?

• Networktransmissions,disks,distributedstoragecommonfailuremodeisburstsofbiterrors,notjustoneortwobiterrors– ContiguoussequenceofB bitsinwhichfirst,lastandanynumberofintermediatebitsareinerror

– Causedbyimpulsenoiseorbyfadinginwireless– Effectisgreaterathigherdatarates

• SolvewithCyclicRedundancyCheck(CRC),interleavingorothermoreadvancedcodes

9

iClicker Question

Thefollowingwordisreceived,encodedwithHammingcode:0 1 10 001

checkp1:0 x 1x 0x1– o.k.checkp2:x11xx 01– errorinp2checkp4:x x x 0 001– errorinp4Errorinlocation2+4=6Correctdata:101 1(answerD)

10

EvolutionoftheDiskDrive

11IBMRAMAC305,1956

IBM3390K,1986

AppleSCSI,1986

CansmallerdisksbeusedtoclosegapinperformancebetweendisksandCPUs?

ArraysofSmallDisks

12

14”10”5.25”3.5”

3.5”

DiskArray:1diskdesign

Conventional:4diskdesigns

LowEnd HighEnd

ReplaceSmallNumberofLargeDiskswithLargeNumberofSmallDisks!(1988Disks)

13

CapacityVolumePowerDataRateI/ORateMTTFCost

IBM3390K20GBytes97cu.ft.3KW15MB/s600I/Os/s250KHrs$250K

IBM3.5"0061320MBytes0.1cu.ft.11W1.5MB/s55I/Os/s50KHrs$2K

x7023GBytes11cu.ft.1KW120MB/s3900IOs/s???Hrs$150K

DiskArrayshavepotentialforlargedataandI/Orates,highMBpercu.ft.,highMBperKW,butwhataboutreliability?

9X3X

8X

6X

RAID:RedundantArraysof(Inexpensive)Disks

• Filesare"striped"acrossmultipledisks• Redundancyyieldshighdataavailability– Availability:servicestillprovidedtouser,evenifsomecomponentsfailed

• Diskswillstillfail• ContentsreconstructedfromdataredundantlystoredinthearrayÞ CapacitypenaltytostoreredundantinfoÞ Bandwidthpenaltytoupdateredundantinfo

14

RedundantArraysofInexpensiveDisksRAID1:DiskMirroring/Shadowing

15

• Eachdiskisfullyduplicatedontoits“mirror”Veryhighavailabilitycanbeachieved

•Writeslimitedbysingle-diskspeed•Readsmaybeoptimized

Mostexpensivesolution:100%capacityoverhead

recoverygroup

RedundantArrayofInexpensiveDisksRAID3:ParityDisk

16

P

100100111100110110010011...

logicalrecord 10100011

11001101

10100011

11001101

Pcontainssumofotherdisksperstripemod2(“parity”)Ifdiskfails,subtractPfromsumofotherdiskstofindmissinginformation

Stripedphysicalrecords

RedundantArraysofInexpensiveDisksRAID4:HighI/ORateParity

D0 D1 D2 D3 P

D4 D5 D6 PD7

D8 D9 PD10 D11

D12 PD13 D14 D15

PD16 D17 D18 D19

D20 D21 D22 D23 P...

.

.

.

.

.

.

.

.

.

.

.

.DiskColumns

IncreasingLogicalDiskAddress

Stripe

Insidesof5disks

Example:smallreadD0&D5, largewriteD12-D15

17

InspirationforRAID5• RAID4workswellforsmallreads• Smallwrites(writetoonedisk):– Option1:readotherdatadisks,createnewsumandwritetoParityDisk

– Option2:sincePhasoldsum,compareolddatatonewdata,addthedifferencetoP

• SmallwritesarelimitedbyParityDisk:WritetoD0,D5bothalsowritetoPdisk

18

D0 D1 D2 D3 P

D4 D5 D6 PD7

RAID5:HighI/ORateInterleavedParity

19

Independentwritespossiblebecauseofinterleavedparity

D0 D1 D2 D3 P

D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

P D16 D17 D18 D19

D20 D21 D22 D23 P...

.

.

.

.

.

.

.

.

.

.

.

.DiskColumns

IncreasingLogicalDiskAddresses

Example:writetoD0,D5usesdisks0,1,3,4

ProblemsofDiskArrays: SmallWrites

D0 D1 D2 D3 PD0'

+

+

D0' D1 D2 D3 P'

newdata

olddata

oldparity

XOR

XOR

(1.Read) (2.Read)

(3.Write) (4.Write)

RAID-5:SmallWriteAlgorithm

1LogicalWrite=2PhysicalReads+2PhysicalWrites

20

TechReportRead‘RoundtheWorld(December1987)

21

RAID-I

• RAID-I(1989)–ConsistedofaSun4/280workstationwith128MBofDRAM,fourdual-stringSCSIcontrollers,285.25-inchSCSIdisksandspecializeddiskstripingsoftware

22

RAIDII• 1990-1993• EarlyNetworkAttached

Storage(NAS)SystemrunningaLogStructuredFileSystem(LFS)

• Impact:– $25Billion/yearin2002– Over$150BillioninRAID

devicesoldsince1990-2002– 200+RAIDcompanies(atthe

peak)– SoftwareRAIDastandard

componentofmodernOSs

23

And,inConclusion,…

• Memory– Hammingdistance2:ParityforSingleErrorDetect– Hammingdistance3:SingleErrorCorrectionCode+encodebitpositionoferror

• Treatdiskslikememory,exceptyouknowwhenadiskhasfailed—erasuremakesparityanErrorCorrectingCode

• RAID-2,-3,-4,-5:Interleaveddataandparity

24