file systems part 5 - brown universitycs.brown.edu/courses/cs167/lectures/18file5x.pdf · operating...
TRANSCRIPT
Operating Systems In Depth XVIII–1 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
File Systems Part 5
Operating Systems In Depth XVIII–2 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Beyond Disks: Flash
Pro• Flash block ≈ file-system
block• Random access• Low power• Vibration-resistant
Con• Limited lifetime• Writes can be expensive• Cost more than disks
– 1TB SSD: $164.99- x4.89 roughly eight years ago
– 3TB disk: $47.50- x2.67 roughly eight years ago
Operating Systems In Depth XVIII–3 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Flash Memory
• Two technologies– nor
- byte addressable– nand
- page addressable• Writing
– newly “erased” block is all ones– “programming” changes some ones to zeroes
- per byte in nor; per page in nand (multiple pages/block)
- to change zeroes to ones, must erase entire block- can erase no more than ~100k times/block
Operating Systems In Depth XVIII–4 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Coping
• Wear leveling– spread writes (erasures) across entire drive
• Flash translation layer (FTL)– specification from 1994– provides disk-like block interface– maps disk blocks to flash blocks
- mapping changed dynamically to effect wear-leveling
Operating Systems In Depth XVIII–5 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Flash with FTL
• Which file system?– FAT32 (sort of like S5FS, but from Microsoft)– NTFS– FFS– Ext3
• All were designed to exploit disks– much of what they do is irrelevant for flash
Operating Systems In Depth XVIII–6 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
A Problem Case (1)
Free
Free
Modified
Operating Systems In Depth XVIII–7 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Free
Free
A Problem Case (2)
Free
Free
Free
Free
Modified
1) Copy2) Erase3) Copy and modify
Operating Systems In Depth XVIII–8 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Free
Free
Trimming
Modified
1) Copy2) Erase3) Copy and modify
Operating Systems In Depth XVIII–9 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Flash without FTL
• Known as memory technology device (MTD)– software wear-leveling– perhaps other tricks
Operating Systems In Depth XVIII–10 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
JFFS and JFFS2
• Journaling flash file system– log-based: no journal!
- each log entry contains inode info and some data
- garbage collection copies info out of partially obsoleted blocks, allowing block to be erased
- complete index of inodes kept in RAM• entire file system must be read when
mounted
Operating Systems In Depth XVIII–11 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
UBI/UBIFS
• UBI (unsorted block images)– supports multiple logical volumes on one flash
device– performs wear-leveling across entire device– handles bad blocks
• UBIFS– file system layered on UBI– it really has a journal (originally called JFFS3)– file map kept in flash as B+ tree– no need to scan entire file system when
mounted
Operating Systems In Depth XVIII–12 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Flash as Part of the Hierarchy
• Flash as log device– aggregate write throughput sufficient, but
latency is bad– augment with DRAM and a “super-capacitor”
• Flash as cache– large level-2 cache
- integrated into ZFS- can use cheaper (slower) disks with no
loss of performance• reduced power consumption
Operating Systems In Depth XVIII–13 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Apple Fusion Drives (FD)
• SSD used along with hard drive• Implemented in the LVM
– total capacity is sum of disk and SSD capacities
- works with both HFS+ and ZFS– SSD used to buffer all incoming writes– data is moved from disk to SSD if used
sufficiently often- migration happens in background
Operating Systems In Depth XVIII–14 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
FD Observations
• Implementation is in the LVM, i.e., below the file system
– all decisions based on block access• 4GB available on SSD for writes
– all writes go to SSD while there’s space– otherwise go to HDD
• Frequent reads trigger promotion to SSD• Data transferred between SSD and HDD in
units of 128KB
Operating Systems In Depth XVIII–15 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
FD Write
if !onSSD(block) {if SSDfreeSpace > 0 {
remap(block)writeOnSSD(block)
} else {writeOnHDD(block)
}} else {
writeOnSSD(block)}
Operating Systems In Depth XVIII–16 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
FD Read
Update_Usage(block)if onSSD(block)
readFromSSD(block)else
readFromHDD(block)
Operating Systems In Depth XVIII–17 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
FD Background Activities
when (accessThresholdReached(block)) {if SSDfreeSpace > 0 {
remap(block)writeOnSSD(block)
}}
when(SSDFreeSpace < 4GB) {for each infrequent block {
remap(block)writeOnHDD(block)
}}
Operating Systems In Depth XVIII–18 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
FD Block Map
• Vital data structure• Kept up to date on SSD• Perhaps backed up on HDD
Operating Systems In Depth XVIII–19 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Case Studies
• NTFS• WAFL• ZFS
Operating Systems In Depth XVIII–20 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
NTFS
• “Volume aggregation” options– spanned volumes– RAID 0 (striping)– RAID 1 (mirroring)– RAID 5– snapshots
Operating Systems In Depth XVIII–21 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Backups
• Want to back up a file system– while still using it
- files are being modified while the backup takes place
- applications may be in progress — files in inconsistent states
• Solution– have critical applications quickly reach a safe
point and pause– snapshot the file system– resume applications– back up the snapshot
Operating Systems In Depth XVIII–22 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Windows Snapshots
normalapplications
backupprogram
C volume ShadowC volume
0 1′ 2 3 4′ 5 6′ 7′ 1 4 6 7
File System
C Drive Shadow C Drive
Snapshot Driver
Operating Systems In Depth XVIII–23 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
NTFS File Records
Name Standardattributes Object ID Data stream
Name Standardattributes Object ID Data streamProperties
stream
Extent 3
Extent 2
Extent 1
Operating Systems In Depth XVIII–24 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Additional NTFS Features
• Data compression– run-length encoding of zeroes– compressed blocks
• Encrypted files
Operating Systems In Depth XVIII–25 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
WAFL
• Runs on special-purpose OS– machine is dedicated to being a filer– handles both NFS and CIFS requests
• Utilizes shadow paging and log-structured writes
• Provides snapshots
Operating Systems In Depth XVIII–26 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
WAFL and RAID
Data blocks
Parity blocks
Data from oneconsistency point
Operating Systems In Depth XVIII–27 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Consistency Points … and Beyond
• Consistency points taken every ~10 seconds– too relaxed for many applications
- NFS- databases
• Solution …
(battery-backed-up RAM)(a.k.a. non-volatile RAM (NVRAM))
Operating Systems In Depth XVIII–28 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Snapshots
• Periodic snapshots kept of file system– made easy with shadow paging
Operating Systems In Depth XVIII–29 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Taking a Snapshot (Before)
Root
Inode file indirect blocks
Inode file data blocks
Regular file indirect blocks
Regular file data blocks
Operating Systems In Depth XVIII–30 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Taking a Snapshot (After)
Root
Inode file indirect blocks
Inode file data blocks
Regular file indirect blocks
Regular file data blocks
Snapshot Root
Operating Systems In Depth XVIII–31 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Paranoia
• You think your files are safe simply because they’re on a RAID-4 or RAID-5 system …
– power failure at inopportune moment- parity is irreparably wrong
– obscure bug in controller firmware or OS- data is garbage (but with correct parity!)
– sysadmin accidentally scribbled on one drive- (profuse apologies … )
– out of disk space- must restructure 4TB file system
– out of address space- 264 isn’t as big as it used to be
Operating Systems In Depth XVIII–32 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Partial Writes
Data blocks
Parity blocks
Operating Systems In Depth XVIII–33 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Small Writes
Data blocks
Parity blocks
Writing these:
Requires reading then writing this:
Operating Systems In Depth XVIII–34 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Hardware RAID
Data blocks
Parity blocks
RAIDController
Operating Systems In Depth XVIII–35 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Adding a Disk (1)
Disk Disk
LVMDisk
?
Operating Systems In Depth XVIII–36 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Disk
Adding a Disk (2)
Disk Disk
LVM
?
Disk Disk
LVM
Operating Systems In Depth XVIII–37 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
ZFSThe Last (?!) Word in File Systems
Operating Systems In Depth XVIII–38 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
ZFS Layers
Disk Disk Disk Disk Disk
Storage Pool
Data Management
Operating Systems In Depth XVIII–39 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Enter ZFS
ZFS POSIX Layer(ZPL)
Data Management Unit(DMU)
Storage Pool Allocator(SPA)
Device Driver
VFSVnode operations
<dataset, object, offset>
<data virtual address>
<physical device, offset>
128-bit addresses!
264 objects;each up to 264 bytesProvides
transactions on objects
Maps virtual blocks to disks and
physical blocks
Operating Systems In Depth XVIII–40 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Shadow-Page Tree(with a twist …)
überblock
pointerchecksum
ditto blocks
Operating Systems In Depth XVIII–41 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Storage Pool Allocator
Data Management Unit(DMU)
Storage Pool AllocatorMirroring, spanning, or RAID
Operating Systems In Depth XVIII–42 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
RAID-ZSoftware Dynamic Striping
a1 a2 a3 ap1-3 b1 b2
b3 b4 bp1-4 c1 c2 cp1-2
d1 d2 d3 d4 d5 dp1-5
d6 d7 dp6-7
Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
Disk 6
Operating Systems In Depth XVIII–43 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Quiz
Compared with RAID 4, which of the following would be more time-consuming with RAID-Z?a) adding a diskb) replacing a crashed diskc) bothd) neither
Operating Systems In Depth XVIII–44 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
RAID-ZAdding a Disk
a1 a2 a3 ap1-3 b1 b2
b3 b4 bp1-4 c1 c2 cp1-2
d1 d2 d3 d4 d5 dp1-5
d6 d7 dp6-7
Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
Disk 6
Disk 7
e1 e2 e3 e4
e5 e6 e1-6 e7 e8 e9 e7-9
Operating Systems In Depth XVIII–45 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Scenarios
• Power failure at inopportune moment– “live data” is not modified– single lost write can be recovered
• Obscure bug in controller firmware or OS– detected by checksum in pointer
• Sysadmin accidentally scribbled on one drive– detected and repaired
• Out of disk space– add to the pool; SPA will cope
• Out of address space– 2128 is big
- 1 address per cubic yard of a sphere bounded by the orbit of Neptune
Operating Systems In Depth XVIII–46 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
And There’s More …
• Adaptive replacement cache• Advanced prefetching
Operating Systems In Depth XVIII–47 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
LRU Caching
• LRU cache holds n least-recently-used disk blocks
– working sets of current processes• New process reads n-block file sequentially
– cache fills with this file’s blocks– old contents flushed– new cache contents never accessed again
Operating Systems In Depth XVIII–48 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
(Non-Adaptive) Solution
• Split cache in two– half of it is for blocks that have been
referenced exactly once– half of it is for blocks that have been
referenced more than once• Is 50/50 split the right thing to do?
Operating Systems In Depth XVIII–49 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Adaptive Replacement Cache
……
LRU LRUn
n n
list t1 list t2
list b1 list b2
……
t1 ; b1:LRU list of blocks referenced oncet1 list (most recently used) contain contentsb1 list (least recently used) contain just references
t2 ; b2:LRU list of blocks referenced more than oncet2 list (most recently used) contain contentsb2 list (least recently used) contain just references
Operating Systems In Depth XVIII–50 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Adaptive Replacement Cache
……
LRU LRUn
n n
list t1 list t2
list b1 list b2
……
cache miss:if t1 is full
evict LRU(t1) and make it MRU(b1)referenced block becomes MRU(t1)
Operating Systems In Depth XVIII–51 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Adaptive Replacement Cache
……
LRU LRUn
n n
list t1 list t2
list b1 list b2
……
cache hit:if in t1 or t2, block becomes MRU(t2)otherwise
if block is referred to by b1, increase t1 space at expense of t2otherwise
increase t2 space at expense of t1if t1 is full, evict LRU(t1) and make it MRU(b1)if t2 is full, evict LRU(t2) and make it MRU(b2)insert block as MRU(t2)
Operating Systems In Depth XVIII–52 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
Prefetch
• FFS prefetch– keeps track of last block read by each process– fetches block i+1 if current block is i and
previous was i-1– chokes on
- diff file1 file2
Operating Systems In Depth XVIII–53 Copyright © 2020 Thomas W. Doeppner. All rights reserved.
zfetch
• Tracks multiple prefetch streams• Handles four patterns
– forward sequential access– backward sequential access– forward strided access
- iterating across columns of matrix stored by columns
– backward strided access