disks and raid - cornell universitydisks and raid cs 4410 operang systems spring 2017 lorenzo alvisi...
TRANSCRIPT
DisksandRAID
CS4410Opera5ngSystems
Spring2017CornellUniversityLorenzoAlvisi
AnneBracy
See:Ch12,14.2inOSPPtextbook
TheslidesaretheproductofmanyroundsofteachingCS4410byProfessorsSirer,Bracy,Agarwal,George,andVanRenesse.
StorageDevices
2
Magne5cdisks•Storagethatrarelybecomescorrupted•Largecapacityatlowcost•Blocklevelrandomaccess•Slowperformanceforrandomaccess•BeKerperformanceforstreamingaccess
Flashmemory•Storagethatrarelybecomescorrupted•Capacityatintermediatecost(50xdisk)•Blocklevelrandomaccess•Goodperformanceforreads;worseforrandomwrites
THAT WAS THEN • 13th September 1956 • The IBM RAMAC 350 • Total Storage = 5 million
characters (just under 5 MB)
hKp://royal.pingdom.com/2008/04/08/the-history-of-computer-data-storage-in-pictures/
THIS IS NOW • 2.5-3.5” hard drive • Example: 500GB Western
Digital Scorpio Blue hard drive
Magnetic Disks are 60 years old!
Readingfromadisk
4
Track
Sector
HeadArm
Arm Assembly
Platter
Surface
Surface
Motor Motor
Spindle
Must specify: • cylinder # (distance from spindle) • surface # • sector # • transfer size • memory address
DiskTracks
5
Track
Sector
HeadArm
Arm Assembly
Platter
Surface
Surface
Motor Motor
Spindle
Track*
~1micronwide(1000nm)•Wavelengthoflightis~0.5micron•Resolu5onofhumaneye:50microns•100Ktracksonatypical2.5”disk
Tracklengthvariesacrossdisk•Outside:•Moresectorspertrack•Higherbandwidth
•Mostofdiskareainouterregionsofdisk
*nottoscale:headisactuallymuchbiggerthanatrack
Sector
Disk overheadsDiskLatency=SeekTime+RotaOonTime+TransferTime
• Seek: to get to the track (5-15 millisecs) • Rotational Latency: to get to the sector (4-8 millisecs)
(on average, only need to wait half a rotation) • Transfer: get bits off the disk (25-50 microsecs)
Track
Sector Seek Time
Rotational Latency
Hard Disks vs. RAM
Hard Disks RAMSmallest write sector wordAtomic write sector word
Random access 5 ms 10-1000 nsSequential access 200 MB/s 200-1000MB/s
Cost $50 / terabyte $5 / gigabytePower reliance (survives power outage?)
Non-volatile (yes)
Volatile (no)
DiskScheduling
8
Objective:minimize seek time
Context: a queue of cylinder numbers (#0-199) Head pointer @ 53
Queue: 98, 183, 37, 122, 14, 124, 65, 67
Metric: how many cylinders traversed?
DiskScheduling:FIFO
9
• Schedulediskoperationsinordertheyarrive• Downsides?
Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67 FIFOSchedule?Totalheadmovement?
DiskScheduling:ShortestSeekTimeFirst
10
• Selectrequestwithminimumseektimefromcurrentheadposition
• AformofShortestJobFirst(SJF)scheduling• Notoptimal:supposeclusterofrequestsatfarendofdisk➜starvation!
Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67 SSTFSchedule?Totalheadmovement?
DiskScheduling:SCAN
11
•Armstartsatoneendofdisk•movestowardotherend,servicingrequests
•movementreversed@endofdisk• repeat
•AKAelevatoralgorithm
Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67 SCANSchedule?Totalheadmovement?
DiskScheduling:C-SCAN
12
Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67 C-SCANSchedule?TotalHeadmovement?
• Headmovesfromoneendtoother• servicingrequestsasitgoes• reachestheend,returnstobeginning• Norequestsservicedonreturntrip
• Treatscylindersasacircularlist• wrapsaroundfromlasttofirst
• MoreuniformwaittimethanSCAN
Most SSDs based on NAND-flash • retains its state for months to years without power
SolidStateDrives(Flash)
hKps://flashdba.com/2015/01/09/understanding-flash-floa5ng-gates-and-wear/
Metal Oxide Semiconductor Field Effect Transistor (MOSFET) Floating Gate MOSFET (FGMOS)
Charge is stored in Floating Gate (can have Single and Multi-Level Cells)
NANDFlash
hKps://flashdba.com/2015/01/09/understanding-flash-floa5ng-gates-and-wear/
Floating Gate MOSFET (FGMOS)
FlashOperaOons
15
Eraseblock:setseachcellto“1”• erasegranularity=“erasureblock”=128-512KB• time:severalms
Writepage:canonlywriteerasedpages•writegranularity=1page=2-4KBytes• time:10sofms
Readpage:• readgranularity=1page=2-4KBytes• time:10sofms
FlashLimitaOons
16
•can’twrite1byte/word(mustwritewholeblocks)•limited#oferasecyclesperblock(memorywear)
•103-106erasesandthecellwearsout•readscan“disturb”nearbywordsandoverwritethemwithgarbage
Lotsoftechniquestocompensate:• errorcorrectingcodes• badpage/erasureblockmanagement• wearleveling:tryingtodistributeerasuresacrosstheentiredriver
• Flashdevicefirmwaremapslogicalpage#toaphysicallocation– Garbagecollecterasureblockbycopyinglivepagestonewlocation,thenerase• Moreefficientifblocksstoredatsametimearedeletedatsametime(e.g.,keepblocksofafiletogether)
– Wear-levelling:onlywriteeachphysicalpagealimitednumberoftimes
– Remappagesthatnolongerwork(sectorsparing)• Transparenttothedeviceuser
FlashTranslaOonLayer
SSD vs HDD
SSD HDDCost 10cts/gig 6cts/gigPower 2-3W 6-7WTypical Capacity 1TB 2TBWrite Speed 250MB/sec 200MB/secRead Speed 700MB/sec 200MB/sec
Whatdowewant?
19
Performance:keepingupwiththeCPU•CPU2xfasterevery2years(untilrecently)•Disks20xfasterin3decades
WhatcanwedotoimproveDiskPerformance?Hint#1:Disksdidgetcheaperinthepast3decades…Hint#2:WhenCPUsstoppedgettingfaster,wealsodidthis…
RAID,Step0:Striping
20
RedundantArrayofInexpensiveDisks(RAID)• Inindustry,“I”isfor“Independent”• ThealternativeisSLED,singlelargeexpensivedisk• RAID+RAIDcontrollerlooksjustlikeSLEDtocomputer(yay,abstraction!)
GOALS:
1.Performance•Parallelizeindividualrequests•Supportparallelrequests
TECHNIQUES:
0.Striping
RAID-0
21
Filesstripedacrossdisks
•Read:highthroughput(parallelI/O)•Write:bestthroughputDownsides?
Disk 0 Disk 1D0D4D8
D12
D1D5D9
D13
Disk 2 Disk 3D2D6
D10D14
D3D7D11D15
Whatcouldpossiblygowrong?
22
Failurecanoccurfor:(1)IsolatedDiskSectors(1+sectorsdown,restOK)
• Permanent:physicalmalfunc5on(magne5ccoa5ng,scratches,contaminants)
• Transient:datacorruptedbutnewdatacanbesuccessfullywriKento/readfromsector
(2)En5reDeviceFailure• Damagetodiskhead,electronicfailure,mechanicalwearout
• Detectedbydevicedriver,accessesreturnerrorcodes• annualfailureratesorMeanTimeToFailure(MTTF)
Whatdowealsowant?
23
Reliability:datafetchediswhatyoustoredAvailability:dataistherewhenyouwantit
• Moredisks➜higherprobabilityofsomediskfailing• Stripingreducesreliability
• N disks: 1/nth mean time between failures of 1 disk
WhatcanwedotoimproveDiskReliability?Hint#1:WhenCPUsstoppedbeingreliable,wealsodidthis…
😞
RAID,Step1:Mirroring
24
Toimprovereliability,addredundancy
GOALS:
1.Performance•Parallelizeindividualrequests•Supportparallelrequests
2.Reliability
TECHNIQUES:
0.Striping1.Mirroring
RAID-1
25
DisksMirrored:datawrittenin2placesSimple,expensiveExample:GoogleFileSystemreplicateddataon3disks,spreadacrossmultipleracks
Reads:gotoeitherdisk➜2xfasterthanSLED
•Write:replicatetoeverymirroreddisk➜samespeedasSLED
FullDiskFailure:usesurvivingdiskBitFlipError:Detect?Correct?
RAID,Step2:Parity
26
Torecoverfromfailures,addparity• n-inputXORgivesbit-levelparity(1=odd,0=even)• 1101⊕1100 ⊕0110=0111(parityblock)• Canreconstructanymissingblockfromtheothers
GOALS:1.Performance•Parallelizeindividualrequests•Supportparallelrequests2.Reliability
TECHNIQUES:0.Striping1.Mirroring2.Parity
LesserLovedRAIDS
27
RAID-2:bit-levelstripingwithECCcodes• 7diskarmssynchronizedandmoveinunison• Complicatedcontroller(andhenceveryunpopular)• Tolerates1errorwithnoperformancedegradation
RAID-3:byte-levelstriping+paritydisk• readaccessesalldatadisks• writeaccessesalldatadisks+paritydisk• Ondiskfailure:readparitydisk,computemissingdata
RAID-4:block-levelstriping+paritydisk+betterspatiallocalityfordiskaccess- paritydiskiswritebottleneckandwearsoutfaster
b1p2 p1 p0b2b3b4
byte 0Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
byte 1 byte 2 byte 3 Parity
stripe 0Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
stripe 1 stripe 2 stripe 3 Parity
AwordaboutGranularity
28
Bit-level➜byte-level➜blocklevel•fine-grained:Stripeeachfileacrossalldisks
+ highthroughputforthefile- wasteddiskseektime- limitstotransferof1fileatatime
• coarse-grained:Stripeeachfileoverafewdisks- limitsthroughputfor1file+betteruseofspatiallocality(fordiskseek)+allowsmoreparallelfileaccess
RAID5:RotaOngParityw/Striping
RAID5:RotaOngParityw/Striping
• Write1block:– Readolddatablock– Readoldparityblock– Writenewdatablock– Writenewparityblock(olddata⊕oldparity⊕newdata)
• Writeentirestripe:– Writedatablocksandparityforeachstripinstripe
Goodwriteperformance
• Read:gotocorrectdisk,canoutperformSLEDsandRAID-0
RAID5:WriteExample
• Write(D2,0111)– Readolddatablock(1010)– Readoldparityblock(1001)– Writenewdatablock(0111)– Writenewparityblock
(olddata⊕oldparity⊕newdata)1010⊕1001⊕0111=0100
D00000
Disk 0 Disk 1 Disk 2 Disk 3 Disk 4D1
1111D2
1010D3
1100P0-31001
0111 0100