disks and i/o schedulingthe disk scheduling problem: background • goals: maximize disk throughput...
TRANSCRIPT
COMP530:OperatingSystems
DisksandI/OScheduling
DonPorter
PortionscourtesyEmmettWitchel
1
COMP530:OperatingSystems
QuickRecap• CPUScheduling
– Balancecompetingconcernswithheuristics• Whatweresomegoals?
– Noperfectsolution
• Today:Blockdevicescheduling– HowdifferentfromtheCPU?– Focusprimarilyonatraditionalharddrive– Extendtonewstoragemedia
COMP530:OperatingSystems
• Why have disks?– Memory is small. Disks are large.
• Short term storage for memory contents (e.g., swap space).• Reduce what must be kept in memory (e.g., code pages).
– Memory is volatile. Disks are forever (?!)• File storage.
GB/dollar dollar/GB
RAM 0.013(0.015,0.01) $77($68,$95)Disks 3.3(1.4,1.1) 30¢ (71¢,90¢)
Capacity:2GBvs.1TB2GBvs.400GB1GBvs320GB
Disks:Justlikememory,butdifferent
COMP530:OperatingSystems
OS’sviewofadisk• Simplearrayofblocks
– Blocksareusually512or4kbytes
COMP530:OperatingSystems
Asimplediskmodel• Disksareslow.Why?
– Movingparts<<circuits
• Programminginterface:simplearrayofsectors(blocks)
• Physicallayout:– Concentriccircular“tracks”ofblocksonaplatter– E.g.,sectors0-9oninnermosttrack,10-19onnexttrack,etc.
– Diskarmmovesbetweentracks– Platterrotatesunderdiskheadtoalignw/requestedsector
COMP530:OperatingSystems
DiskModel
01234 5
67
Eachblockonasector
DiskHead
DiskHeadreadsat
granularityofentiresector
Diskspinsataconstantspeed.Sectorsrotate
underneathhead.
COMP530:OperatingSystems
DiskModel
DiskHead01
234 5
67
8910111213
14 15 1617181920
21
Concentrictracks
Diskheadseeks todifferenttracksGapbetween7
and8accountsforseektime
COMP530:OperatingSystems
ManyTracks
DiskHead
COMP530:OperatingSystems
Several(~4)Platters
Plattersspintogetheratsame
speed
Eachplatterhasahead;Allheadsseektogether
COMP530:OperatingSystems
Implicationsofmultipleplatters• Blocksactuallystripedacrossplatters• Also,bothsidesofaplattercanstoredata
– Calledasurface– Needaheadontopandbottom
• Example:– Sector0onplatter0(top)– Sector1onplatter0(bottom,sameposition)– Sector2 onplatter1atsameposition,top,– Sector3onplatter1,atsameposition,bottom– Etc.– 8 headscanreadall8sectorssimultaneously
COMP530:OperatingSystems
RealExample• Seagate 73.4 GB Fibre Channel Ultra 160 SCSI disk
• Specs:– 12 Platters– 24 Heads– Variable # of sectors/track – 10,000 RPM
• Average latency: 2.99 ms– Seek times
• Track-to-track: 0.6/0.9 ms• Average: 5.6/6.2 ms
• Includes acceleration and settle time.
– 160-200 MB/s peak transfer rate
• 1-8K cache
Ø 12 ArmsØ 14,100 TracksØ 512 bytes/sector
COMP530:OperatingSystems
3KeyLatencies• I/Odelay:timeittakestoread/writeasector• Rotationaldelay:timethediskheadwaitsfortheplattertorotatedesiredsectorunderit– Note:diskrotatescontinuouslyatconstantspeed
• Seekdelay:timethediskarmtakestomovetoadifferenttrack
COMP530:OperatingSystems
Observations• Latencyofagivenoperationisafunctionofcurrentdiskarmandplatterposition
• Eachrequestchangesthesevalues• Idea:buildamodelofthedisk
– Maybeusedelayvaluesfrommeasurementormanuals– Usesimplemathtoevaluatelatencyofeachpendingrequest
– Greedyalgorithm:alwaysselectlowestlatency
COMP530:OperatingSystems
Exampleformula• s=seeklatency,intime/track• r=rotationallatency,intime/sector• i =I/Olatency,inseconds
• Time=(Δtracks *s)+(Δsectors *r)+I• Note:Δsectors mustfactorinpositionafterseekisfinished.Why?
Example read time:seek time + latency + transfer time(5.6 ms + 2.99 ms + 0.014 ms)
COMP530:OperatingSystems
TheDiskSchedulingProblem:Background• Goals:Maximizediskthroughput
– Boundlatency
• Betweenfilesystemanddisk,youhaveaqueueofpendingrequests:– Readorwriteagivenlogicalblockaddress(LBA)range
• Youcanreordertheseasyouliketoimprovethroughput
• Whatreorderingheuristictouse?Ifany?• HeuristiciscalledtheIOScheduler
– Or“DiskScheduler”or“DiskHeadScheduler”
15Evaluation:howmanytracksheadmovesacross
COMP530:OperatingSystems
• Assume a queue of requests exists to read/write tracks:– and the head is on track 65
0 150125100755025
15016147147283
65
I/OSchedulingAlgorithm1:FCFS
FCFS:Moveshead550tracks
COMP530:OperatingSystems
• Greedy scheduling: shortest seek time first– Rearrange queue from:
To:
0 150125100755025
15016147147283
72821471501614
SSTF scheduling results in the head moving 221 tracksCan we do better?
I/OSchedulingAlgorithm2:SSTF
SSTF:221tracks(vs550forFCFS)
COMP530:OperatingSystems
Otherproblemswithgreedy?• “Far”requestswillstarve
– Assumingyoureordereverytimeanewrequestarrives
• Diskheadmayjusthoveraroundthe“middle”tracks
COMP530:OperatingSystems
• Move the head in one direction until all requests have been serviced, and then reverse.
• Also called Elevator Scheduling
161472 83147150
• Rearrange queue from:
To:
0 150125100755025
15016147147283
161472 83147150
I/OSchedulingAlgorithm3:SCAN
SCAN:187tracks(vs.221forSSTF)
COMP530:OperatingSystems
0 150125100755025
I/OSchedulingAlgorithm4:C-SCAN• Circular SCAN: Move the head in one direction
until an edge of the disk is reached, and then reset to the opposite edge
C-SCAN:265tracks(vs.221forSSTF,187forSCAN)
• Marginally better fairness than SCAN
COMP530:OperatingSystems
SchedulingCheckpoint• SCANseemsmostefficientfortheseexamples
– C-SCANoffersbetterfairnessatmarginalcost– Yourmileagemayvary(i.e.,workloaddependent)
• Filesystemswouldbewisetoplacerelateddata”near”eachother– Filesinthesamedirectory– Blocksofthesamefile
• YouwillexplorethepracticalimplicationsofthismodelinLab4!
21
COMP530:OperatingSystems
• Multiple file systems can share a disk: Partition space• Disks are typically partitioned to minimize the maximum seek time
– A partition is a collection of cylinders– Each partition is a logically separate disk
PartitionA PartitionB
DiskPartitioning
COMP530:OperatingSystems
• Disks are getting smaller in size– Smaller à spin faster; smaller distance for head to travel; and lighter
weight
• Disks are getting denser– More bits/square inch à small disks with large capacities
• Disks are getting cheaper – Well, in $/byte – a single disk has cost at least $50-100 for 20 years– 2x/year since 1991
• Disks are getting faster– Seek time, rotation latency: 5-10%/year (2-3x per decade)– Bandwidth: 20-30%/year (~10x per decade)– This trend is really flattening out on commodity devices; more apparent on
high-end
Disks:TechnologyTrends
Overall:Capacityimprovingmuchfasterthanperf.
COMP530:OperatingSystems
Parallelperformancewithdisks• Idea:Usemoreofthemworkingtogether
– Justlikewithmultiplecores
• RedundantArrayofInexpensiveDisks(RAID)– Intuition:Spreadlogicalblocksacrossmultipledevices– Ex:Read4LBAsfrom4differentdisksinparallel
• Doesthishelpthroughputorlatency?– Definitelythroughput,canconstructscenarioswhereonerequestwaitsonfewerotherrequests(latency)
• Itcanalsoprotectdatafromadiskfailure– Transparentlywriteonelogicalblockto1+devices
24
COMP530:OperatingSystems
• Blocks broken into sub-blocks that are stored on separate disks– similar to memory interleaving
• Provides for higher disk bandwidth through a larger effective block size
3
8 9 10 1112 13 14 15 0 1 2 3
OS diskblock
8 9 10 11
Physical disk blocks
21
12 13 14 15 0 1 2 3
DiskStriping:RAID-0
COMP530:OperatingSystems
0 1 1 0 01 1 1 0 10 1 0 1 1
• To increase the reliability of the disk, redundancy must be introduced– Simple scheme: disk mirroring (RAID-1)– Write to both disks, read from either.
xx
0 1 1 0 01 1 1 0 10 1 0 1 1
Primarydisk
Mirrordisk
RAID1:Mirroring
Canloseonediskwithoutlosingdata
COMP530:OperatingSystems
x
Disk1 Disk2 Disk3 Disk4 Disk5
1 1 1 11 1 1 10 0 0 0
0 0 0 01 1 1 10 0 0 0
0 0 1 10 0 1 10 0 1 1
0 1 0 10 1 0 10 1 0 1
1 0 0 10 1 1 00 1 1 0
8910
111213
14150
123
Blockx
ParityBlock
x
xxxx
RAID5:PerformanceandRedundancy• Idea: Sacrifice one disk to store the parity bits of other
disks (e.g., xor-ed together)• Still get parallelism• Can recover from failure of any one disk• Cost: Extra writes to update parity
COMP530:OperatingSystems
Disk1
x x
Disk2 Disk3
x
Disk4 Disk5
1 1 1 11 1 1 10 0 0 0
0 0 0 01 1 1 10 0 0 0
0 0 1 10 0 1 10 0 1 1
0 1 0 10 1 0 10 1 0 1
1 0 0 10 1 1 00 1 1 0
1 1 1 11 1 1 10 0 0 0
0 0 0 01 1 1 10 0 0 0
0 0 1 10 0 1 10 0 1 1
0 1 0 10 1 0 10 1 0 1
1 0 0 10 1 1 00 1 1 0
1 1 1 11 1 1 10 0 0 0
0 0 0 01 1 1 10 0 0 0
0 0 1 10 0 1 10 0 1 1
0 1 0 10 1 0 10 1 0 1
1 0 0 10 1 1 00 1 1 0
1 1 1 11 1 1 10 0 0 0
0 0 0 01 1 1 10 0 0 0
0 0 1 10 0 1 10 0 1 1
0 1 0 10 1 0 10 1 0 1
1 0 0 10 1 1 00 1 1 0
8910
111213
14150
123
Blockx
Parity
Blockx+1
Parityabc
def
ghi
jkl
mno
Blockx+2
Paritypqr
stu
vwx
yz
aabbccdd
Blockx+3
Parityeeffgg
hhiijj
Blockx
Blockx+1
Blockx+2
Blockx+3
xx
RAID5:InterleavedParity
COMP530:OperatingSystems
OtherRAIDvariations• Variationsonencodingschemes,differenttradesforfailuresandperformance– Seewikipedia– But0,1,5arethemostpopularbyfar
• Moregeneralareaoferasurecoding:– Storeklogicalblocks(message)innphysicalblocks(k<n)– Inanoptimalerasurecode,recoverfromanyk/nblocks– Xor parityisa(k,k+1)erasurecode– Gainingpopularityatdatacentergranularity
29
COMP530:OperatingSystems
• Hardware (i.e., a chip that looks to OS like 1 disk)– +Tend to be reliable (hardware implementers test)– +Offload parity computation from CPU
• Hardware is a bit faster for rewrite intensive workloads– -Dependent on card for recovery (replacements?)– -Must buy card (for the PCI bus)– -Serial reconstruction of lost disk
• Software (i.e., a “fake” disk driver)– -Software has bugs– -Ties up CPU to compute parity– +Other OS instances might be able to recover– +No additional cost– +Parallel reconstruction of lost disk
WhereisRAIDimplemented?
MostPCshave“fake”HWRAID:Allworkindriver
COMP530:OperatingSystems
Wordtothewise• RAIDisagoodideaforprotectingdata
– Cansafelylose1+disks(dependingonconfiguration)
• Butthereisanotherweaklink:Thepowersupply– Ihavepersonallyhadapowersupplygobadandfry2/4disksinaRAID5array,effectivelylosingallofthedata
31RAIDisnosubstituteforbackuptoanothermachine
COMP530:OperatingSystems
Summary• Understanddiskperformancemodel
– WillexploremoreinLab4
• UnderstandI/Oschedulingalgorithms• UnderstandRAID
32