disks and raid - cornell university · 2020. 11. 3. · disks and raid cs 4410 operating systems...
TRANSCRIPT
![Page 1: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/1.jpg)
Disks and RAID
CS 4410Operating Systems
[R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse]
![Page 2: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/2.jpg)
• Magnetic disks• Large capacity at low cost • Block level random access• Slow performance for random access• Good performance for streaming access
• Flash memory• Capacity at intermediate cost• Block level random access• Slow performance for random writes• Good performance otherwise
Storage Devices
2
![Page 3: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/3.jpg)
THAT WAS THEN• 13th September 1956 • The IBM RAMAC 350• Total Storage = 5 million characters
(just under 5 MB)
Magnetic Disks are 60 years old!
3http://royal.pingdom.com/2008/04/08/the-history-of-computer-data-storage-in-pictures/
THIS IS NOW• 2.5-3.5” hard drive• Example: 500GB Western Digital
Scorpio Blue hard drive• easily up to a few TB
![Page 4: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/4.jpg)
RAM (Memory) vs. HDD (Disk) vs. SSD, 2020
4
RAM HDD SSDTypical Size 16 GB 1 TB 1TBCost $5-10 per GB $0.05 per GB $0.10 per GBLatency 15 ns 15 ms 1msThroughput (Sequential) 8000 MB/s 175 MB/s 500 MB/sPower Reliance volatile non-volatile non-volatile
![Page 5: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/5.jpg)
Track
Sector
HeadArm
Arm Assembly
Platter
Surface
Surface
Motor Motor
Spindle
Must specify:• cylinder #
(distance from spindle)• head #• sector #• transfer size• memory address
Reading from disk
5
![Page 6: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/6.jpg)
Track
Sector
HeadArm
Arm Assembly
Platter
Surface
Surface
Motor Motor
Spindle
~ 1 micron wide (1000 nm)• Wavelength of light is ~ 0.5 micron• Resolution of human eye: 50 microns• 100K tracks on a typical 2.5” disk
Track length varies across disk• Outside: -More sectors per track- Higher bandwidth
• Most of disk area in outer regions
Disk Tracks
6
Track*
*not to scale: head is actually much bigger than a track
Sector
![Page 7: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/7.jpg)
Disk Latency = Seek Time + Rotation Time + Transfer Time• Seek: to get to the track (5-15 millisecs (ms))• Rotational Latency: to get to the sector (4-8 millisecs (ms))
(on average, only need to wait half a rotation)• Transfer: get bits off the disk (25-50 microsecs (μs)
Disk overheads
7
Track
Sector Seek Time
RotationalLatency
![Page 8: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/8.jpg)
Objective: minimize seek time
Context: a queue of cylinder numbers (#0-199)
Metric: how many cylinders traversed?
Disk Scheduling
8
Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67
![Page 9: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/9.jpg)
• Schedule disk operations in order they arrive• Downsides?
FIFO Schedule?Total head movement?
Disk Scheduling: FIFO
9
Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67
![Page 10: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/10.jpg)
• Select request with minimum seek time from current head position
• A form of Shortest Job First (SJF) scheduling • Not optimal: suppose cluster of requests at far end
of disk ➜ starvation!
SSTF Schedule?Total head movement?
Disk Scheduling: Shortest Seek Time First
10
Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67
![Page 11: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/11.jpg)
Elevator Algorithm:• arm starts at one end of disk• moves to other end, servicing requests • movement reversed @ end of disk • repeat
SCAN Schedule?Total head movement?
Disk Scheduling: SCAN
11
Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67
![Page 12: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/12.jpg)
Circular list treatment:• head moves from one end to other• servicing requests as it goes• reaches the end, returns to beginning• no requests serviced on return trip
+ More uniform wait time than SCAN
Disk Scheduling: C-SCAN
12
C-SCAN Schedule?Total Head movement?(?)Head pointer @ 53Queue: 98, 183, 37, 122, 14, 124, 65, 67
![Page 13: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/13.jpg)
Most SSDs based on NAND-flash• retains its state for months to years without power
Solid State Drives (Flash)
13https://flashdba.com/2015/01/09/understanding-flash-floating-gates-and-wear/
Metal Oxide Semiconductor Field Effect Transistor (MOSFET) Floating Gate MOSFET (FGMOS)
![Page 14: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/14.jpg)
Charge is stored in Floating Gate(can have Single and Multi-Level Cells)
NAND Flash
14https://flashdba.com/2015/01/09/understanding-flash-floating-gates-and-wear/
Floating Gate MOSFET (FGMOS)
![Page 15: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/15.jpg)
• Erase block: sets each cell to “1”• erase granularity = “erasure block” = 128-512 KB• time: several ms
• Write page: can only write erased pages • write granularity = 1 page = 2-4KBytes
• Read page: • read granularity = 1 page = 2-4KBytes
Flash Operations
15
![Page 16: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/16.jpg)
• can’t write 1 byte/word (must write whole blocks)• limited # of erase cycles per block (memory wear)
• 103-106 erases and the cell wears out• reads can “disturb” nearby words and overwrite them with
garbage
• Lots of techniques to compensate:• error correcting codes• bad page/erasure block management• wear leveling: trying to distribute erasures across the entire
driver
Flash Limitations
16
![Page 17: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/17.jpg)
Flash Translation LayerFlash device firmware maps logical page # to a physical location• Garbage collect erasure block by copying live pages to
new location, then erase-More efficient if blocks stored at same time are deleted at
same time (e.g., keep blocks of a file together)
• Wear-levelling: only write each physical page a limited number of times
• Remap pages that no longer work (sector sparing)
Transparent to the device user17
![Page 18: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/18.jpg)
(1) Isolated Disk Sectors (1+ sectors down, rest OK)Permanent: physical malfunction (magnetic coating, scratches, contaminants) Transient: data corrupted but new data can be successfully written to / read from sector
(2) Entire Device Failure• Damage to disk head, electronic failure, wear out• Detected by device driver, accesses return error codes• Annual failure rates or Mean Time To Failure (MTTF)
Disk Failure Cases
18
![Page 19: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/19.jpg)
• Fast: data is there when you want it• Reliable: data fetched is what you stored• Affordable: won’t break the bank
Enter: Redundant Array of Inexpensive Disks (RAID)• In industry, “I” is for “Independent”• The alternative is SLED, single large expensive disk• RAID + RAID controller looks just like SLED to computer (yay,
abstraction!)
What do we want from storage?
19
![Page 20: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/20.jpg)
Files striped across disks+ Fast
latency?throughput?
+ Cheapcapacity?
– Unreliablemax #failures?MTTF?
RAID-0
20
stripe 0stripe 2stripe 4stripe 6stripe 8
stripe 10stripe 12stripe 14
Disk 0stripe 1stripe 3stripe 5stripe 7stripe 9
stripe 11stripe 13stripe 15
Disk 1
. . .. . .
![Page 21: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/21.jpg)
Striping reduces reliability• More disks ➜ higher probability of some disk failing• N disks: 1/Nth mean time between failures of 1 disk
What can we do to improve Disk Reliability?
Striping and Reliability
21
😞
![Page 22: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/22.jpg)
Disks Mirrored: data written in 2 places
+ Reliabledeals well with disk loss but not corruption
+ Fastlatency?throughput?
– Expensive
RAID-1
22
data 0data 1data 2data 3data 4data 5data 6data 7
Disk 1
. . .
data 0data 1data 2data 3data 4data 5data 6data 7
Disk 0
. . .
![Page 23: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/23.jpg)
bit-level striping with ECC codes• 7 disk arms synchronized, move in unison• Complicated controller (➜ very unpopular)• Detect & Correct 1 error with no performance degradation
+ Reliable– Expensiveparity 1 = 3⊕5⊕7 (all disks whose # has 1 in LSB, xx1)
parity 2 = 3⊕6⊕7 (all disks whose # has 1 in 2nd bit, x1x)
parity 4 = 5⊕6⊕7 (all disks whose # has 1 in MSB, 1xx)
RAID-2
23
bit 2bit 6
bit 10bit 14
Disk 5
bit 1bit 5bit 9
bit 13
Disk 3Disk 2
parity 1parity 4parity 7parity 10
Disk 1
parity 3parity 6parity 9parity 12
Disk 4
parity 2parity 5parity 8parity 11
bit 3bit 7
bit 11bit 15
Disk 6
bit 4bit 8
bit 12bit 16
Disk 7001 010 011 100 101 110 111
![Page 24: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/24.jpg)
RAID-3: byte-level striping + parity disk• read accesses all data disks• write accesses all data disks + parity disk• On disk failure: read parity disk, compute missing dataRAID-4: block-level striping + parity disk+ better spatial locality for disk access
+ Cheap– Slow Writes– Unreliable
parity disk is write bottleneck and wears out faster
2 more rarely-used RAIDS
26
data 2data 6
data 10data 14
Disk 2data 1data 5data 9
data 13
Disk 1parity 1parity 2parity 3parity 4
Disk 5data 3data 7
data 11data 15
Disk 3data 4data 8
data 12data 16
Disk 4
![Page 25: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/25.jpg)
• 𝐷! = 𝐷"⊕𝐷#⊕… ⊕𝐷!$"• ⊕ = XOR operation• If one of 𝐷"… 𝐷!$", we can construct the
data by XOR-ing all the remaining drives
Using a parity disk
27
![Page 26: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/26.jpg)
• Suppose block lives on disk 𝐷!• Method 1:• read corresponding blocks on 𝐷! … 𝐷"#$• XOR all with new content of block• write disk 𝐷$ and 𝐷" in parallel• Method 2:• read 𝐷$ (old content) and 𝐷"• XOR both with new content of block• write disk 𝐷$ and 𝐷" in parallel• Either way:• throughput: ½ of single disk• latency: double of single disk
Updating a block in RAID-4
28
![Page 27: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/27.jpg)
• Save up updates to stripe across 𝐷!… 𝐷"#!• Compute 𝐷" = 𝐷!⊕𝐷$⊕… ⊕𝐷"#!• Write 𝐷!… 𝐷" in parallel• Throughput/latency same as single disk
• Note that in all write cases 𝐷" must always be updatedè𝐷" is a write performance bottleneck
Streaming update in RAID-4
29
![Page 28: Disks and RAID - Cornell University · 2020. 11. 3. · Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, F. Schneider, E. Sirer, R. Van Renesse] •Magnetic](https://reader036.vdocuments.net/reader036/viewer/2022071301/609e614679dd306253501cd6/html5/thumbnails/28.jpg)
+ Reliableyou can lose one disk
+ Fast(N-1)x seq. throughput of single diskN/4x random write throughput
+ Affordable
RAID 5: Rotating Parity w/Striping
31
parity 0-3data 4data 8
data 12data 16
Disk 0data 0
parity 4-7data 9
data 13data 17
Disk 1data 1data 5
parity 8-11data 14data 18
Disk 2data 2data 6
data 10parity 12-15
data 19
Disk 3data 3data 7
data 11data 15
parity 16-19
Disk 4