disks and raid - cornell universitydisks and raid cs 4410 operang systems spring 2017 lorenzo alvisi...

DisksandRAID

CS4410Opera5ngSystems

Spring2017CornellUniversityLorenzoAlvisi

AnneBracy

See:Ch12,14.2inOSPPtextbook

TheslidesaretheproductofmanyroundsofteachingCS4410byProfessorsSirer,Bracy,Agarwal,George,andVanRenesse.

StorageDevices

2

Magne5cdisks•Storagethatrarelybecomescorrupted•Largecapacityatlowcost•Blocklevelrandomaccess•Slowperformanceforrandomaccess•BeKerperformanceforstreamingaccess

Flashmemory•Storagethatrarelybecomescorrupted•Capacityatintermediatecost(50xdisk)•Blocklevelrandomaccess•Goodperformanceforreads;worseforrandomwrites

THAT WAS THEN • 13th September 1956 • The IBM RAMAC 350 • Total Storage = 5 million

characters (just under 5 MB)

hKp://royal.pingdom.com/2008/04/08/the-history-of-computer-data-storage-in-pictures/

THIS IS NOW • 2.5-3.5” hard drive • Example: 500GB Western

Digital Scorpio Blue hard drive

Magnetic Disks are 60 years old!

Readingfromadisk

4

Track

Sector

HeadArm

Arm Assembly

Platter

Surface

Surface

Motor Motor

Spindle

Must specify: • cylinder # (distance from spindle) • surface # • sector # • transfer size • memory address

DiskTracks

5

Track

Sector

HeadArm

Arm Assembly

Platter

Surface

Surface

Motor Motor

Spindle

Track*

~1micronwide(1000nm)•Wavelengthoflightis~0.5micron•Resolu5onofhumaneye:50microns•100Ktracksonatypical2.5”disk

Tracklengthvariesacrossdisk•Outside:•Moresectorspertrack•Higherbandwidth

•Mostofdiskareainouterregionsofdisk

*nottoscale:headisactuallymuchbiggerthanatrack

Sector

Disk overheadsDiskLatency=SeekTime+RotaOonTime+TransferTime

• Seek: to get to the track (5-15 millisecs) • Rotational Latency: to get to the sector (4-8 millisecs)

(on average, only need to wait half a rotation) • Transfer: get bits off the disk (25-50 microsecs)

Track

Sector Seek Time

Rotational Latency

Hard Disks vs. RAM

Hard Disks RAMSmallest write sector wordAtomic write sector word

Random access 5 ms 10-1000 nsSequential access 200 MB/s 200-1000MB/s

Cost $50 / terabyte $5 / gigabytePower reliance (survives power outage?)

Non-volatile (yes)

Volatile (no)

DiskScheduling

8

Objective:minimize seek time

Context: a queue of cylinder numbers (#0-199) Head pointer @ 53

Queue: 98, 183, 37, 122, 14, 124, 65, 67

Metric: how many cylinders traversed?

DiskScheduling:FIFO

9

• Schedulediskoperationsinordertheyarrive• Downsides?

Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67 FIFOSchedule?Totalheadmovement?

DiskScheduling:ShortestSeekTimeFirst

10

• Selectrequestwithminimumseektimefromcurrentheadposition

• AformofShortestJobFirst(SJF)scheduling• Notoptimal:supposeclusterofrequestsatfarendofdisk➜starvation!

Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67 SSTFSchedule?Totalheadmovement?

DiskScheduling:SCAN

11

•Armstartsatoneendofdisk•movestowardotherend,servicingrequests

•movementreversed@endofdisk• repeat

•AKAelevatoralgorithm

Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67 SCANSchedule?Totalheadmovement?

DiskScheduling:C-SCAN

12

Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67 C-SCANSchedule?TotalHeadmovement?

• Headmovesfromoneendtoother• servicingrequestsasitgoes• reachestheend,returnstobeginning• Norequestsservicedonreturntrip

• Treatscylindersasacircularlist• wrapsaroundfromlasttofirst

• MoreuniformwaittimethanSCAN

Most SSDs based on NAND-flash • retains its state for months to years without power

SolidStateDrives(Flash)

hKps://flashdba.com/2015/01/09/understanding-flash-floa5ng-gates-and-wear/

Metal Oxide Semiconductor Field Effect Transistor (MOSFET) Floating Gate MOSFET (FGMOS)

Charge is stored in Floating Gate (can have Single and Multi-Level Cells)

NANDFlash

hKps://flashdba.com/2015/01/09/understanding-flash-floa5ng-gates-and-wear/

Floating Gate MOSFET (FGMOS)

FlashOperaOons

15

Eraseblock:setseachcellto“1”• erasegranularity=“erasureblock”=128-512KB• time:severalms

Writepage:canonlywriteerasedpages•writegranularity=1page=2-4KBytes• time:10sofms

Readpage:• readgranularity=1page=2-4KBytes• time:10sofms

FlashLimitaOons

16

•can’twrite1byte/word(mustwritewholeblocks)•limited#oferasecyclesperblock(memorywear)

•103-106erasesandthecellwearsout•readscan“disturb”nearbywordsandoverwritethemwithgarbage

Lotsoftechniquestocompensate:• errorcorrectingcodes• badpage/erasureblockmanagement• wearleveling:tryingtodistributeerasuresacrosstheentiredriver

• Flashdevicefirmwaremapslogicalpage#toaphysicallocation– Garbagecollecterasureblockbycopyinglivepagestonewlocation,thenerase• Moreefficientifblocksstoredatsametimearedeletedatsametime(e.g.,keepblocksofafiletogether)

– Wear-levelling:onlywriteeachphysicalpagealimitednumberoftimes

– Remappagesthatnolongerwork(sectorsparing)• Transparenttothedeviceuser

FlashTranslaOonLayer

SSD vs HDD

SSD HDDCost 10cts/gig 6cts/gigPower 2-3W 6-7WTypical Capacity 1TB 2TBWrite Speed 250MB/sec 200MB/secRead Speed 700MB/sec 200MB/sec

Whatdowewant?

19

Performance:keepingupwiththeCPU•CPU2xfasterevery2years(untilrecently)•Disks20xfasterin3decades

WhatcanwedotoimproveDiskPerformance?Hint#1:Disksdidgetcheaperinthepast3decades…Hint#2:WhenCPUsstoppedgettingfaster,wealsodidthis…

RAID,Step0:Striping

20

RedundantArrayofInexpensiveDisks(RAID)• Inindustry,“I”isfor“Independent”• ThealternativeisSLED,singlelargeexpensivedisk• RAID+RAIDcontrollerlooksjustlikeSLEDtocomputer(yay,abstraction!)

GOALS:

1.Performance•Parallelizeindividualrequests•Supportparallelrequests

TECHNIQUES:

0.Striping

RAID-0

21

Filesstripedacrossdisks

•Read:highthroughput(parallelI/O)•Write:bestthroughputDownsides?

Disk 0 Disk 1D0D4D8

D12

D1D5D9

D13

Disk 2 Disk 3D2D6

D10D14

D3D7D11D15

Whatcouldpossiblygowrong?

22

Failurecanoccurfor:(1)IsolatedDiskSectors(1+sectorsdown,restOK)

• Permanent:physicalmalfunc5on(magne5ccoa5ng,scratches,contaminants)

• Transient:datacorruptedbutnewdatacanbesuccessfullywriKento/readfromsector

(2)En5reDeviceFailure• Damagetodiskhead,electronicfailure,mechanicalwearout

• Detectedbydevicedriver,accessesreturnerrorcodes• annualfailureratesorMeanTimeToFailure(MTTF)

Whatdowealsowant?

23

Reliability:datafetchediswhatyoustoredAvailability:dataistherewhenyouwantit

• Moredisks➜higherprobabilityofsomediskfailing• Stripingreducesreliability

• N disks: 1/nth mean time between failures of 1 disk

WhatcanwedotoimproveDiskReliability?Hint#1:WhenCPUsstoppedbeingreliable,wealsodidthis…

😞

RAID,Step1:Mirroring

24

Toimprovereliability,addredundancy

GOALS:

1.Performance•Parallelizeindividualrequests•Supportparallelrequests

2.Reliability

TECHNIQUES:

0.Striping1.Mirroring

RAID-1

25

DisksMirrored:datawrittenin2placesSimple,expensiveExample:GoogleFileSystemreplicateddataon3disks,spreadacrossmultipleracks

Reads:gotoeitherdisk➜2xfasterthanSLED

•Write:replicatetoeverymirroreddisk➜samespeedasSLED

FullDiskFailure:usesurvivingdiskBitFlipError:Detect?Correct?

RAID,Step2:Parity

26

Torecoverfromfailures,addparity• n-inputXORgivesbit-levelparity(1=odd,0=even)• 1101⊕1100 ⊕0110=0111(parityblock)• Canreconstructanymissingblockfromtheothers

GOALS:1.Performance•Parallelizeindividualrequests•Supportparallelrequests2.Reliability

TECHNIQUES:0.Striping1.Mirroring2.Parity

LesserLovedRAIDS

27

RAID-2:bit-levelstripingwithECCcodes• 7diskarmssynchronizedandmoveinunison• Complicatedcontroller(andhenceveryunpopular)• Tolerates1errorwithnoperformancedegradation

RAID-3:byte-levelstriping+paritydisk• readaccessesalldatadisks• writeaccessesalldatadisks+paritydisk• Ondiskfailure:readparitydisk,computemissingdata

RAID-4:block-levelstriping+paritydisk+betterspatiallocalityfordiskaccess- paritydiskiswritebottleneckandwearsoutfaster

b1p2 p1 p0b2b3b4

byte 0Disk 0 Disk 1 Disk 2 Disk 3 Disk 4

byte 1 byte 2 byte 3 Parity

stripe 0Disk 0 Disk 1 Disk 2 Disk 3 Disk 4

stripe 1 stripe 2 stripe 3 Parity

AwordaboutGranularity

28

Bit-level➜byte-level➜blocklevel•fine-grained:Stripeeachfileacrossalldisks

+ highthroughputforthefile- wasteddiskseektime- limitstotransferof1fileatatime

• coarse-grained:Stripeeachfileoverafewdisks- limitsthroughputfor1file+betteruseofspatiallocality(fordiskseek)+allowsmoreparallelfileaccess

RAID5:RotaOngParityw/Striping

RAID5:RotaOngParityw/Striping

• Write1block:– Readolddatablock– Readoldparityblock– Writenewdatablock– Writenewparityblock(olddata⊕oldparity⊕newdata)

• Writeentirestripe:– Writedatablocksandparityforeachstripinstripe

Goodwriteperformance

• Read:gotocorrectdisk,canoutperformSLEDsandRAID-0

RAID5:WriteExample

• Write(D2,0111)– Readolddatablock(1010)– Readoldparityblock(1001)– Writenewdatablock(0111)– Writenewparityblock

(olddata⊕oldparity⊕newdata)1010⊕1001⊕0111=0100

D00000

Disk 0 Disk 1 Disk 2 Disk 3 Disk 4D1

1111D2

1010D3

1100P0-31001

0111 0100

disks and raid - cornell universitydisks and raid cs 4410 operang systems spring 2017 lorenzo alvisi...

Documents