the btrfs filesystem - oraclemason/presentation/btrfs-eu11.pdf · the btrfs filesystem jointly...
TRANSCRIPT
The BtrfsFilesystem
Chris Mason
The Btrfs Filesystem
• Jointly developed by a number of companies
Oracle, Redhat, Fujitsu, Intel, SUSE, many others
• All data and metadata is written via copy-on-write
• CRCs maintained for all metadata and data
• Efficient writable snapshots
• Multi-device support
• Online resize and defrag
• Transparent compression
• Efficient storage for small files
• SSD optimizations and trim support
Btrfs Progress
• Extensive performance and stability fixes
• Significant code cleanups
• Efficient free space caching across reboots
• Delayed metadata insertion and deletion
• Background scrubbing
• New LZO compression mode
• New Snappy compression mode in development
• Batched discard (fitrim ioctl)
• Per-inode flags to control COW, compression
• Automatic file defrag option
• Focus on stability and performance
Logging Improvements
• Btrfs fsck log was rewriting some items over and over again
• New code from Fujitsu bumps the metadata generationnumbers inside a transaction
• Cuts down log traffic by 75%
• Will go into 3.2 merge window
Metadata Fragmentation
• Btrfs btree uses key ordering to group related items into thesame metadata block
• COW tends to fragment the btree over time
• Larger blocksizes lower metadata overhead and improveperformance
• Larger blocksizes provide very inexpensive btreedefragmentation
• Ex: Intel 120GB MLC drive:
4KB Random Reads – 78MB/s8KB Random Reads – 137MB/s16KB Random Reads – 186MB/s
• Code queued up for Linux 3.3 allows larger btree blocks
Scrub
• Btrfs CRCs allow us to verify data stored on disk
• CRC errors can be corrected by reading a good copy of theblock from another drive
• New scrubbing code scans the allocated data and metadatablocks (Arne Jansen)
• Any CRC errors are fixed during the scan if a second copyexists
• Will be extended to track and offline bad devices
• (Scrub Demo)
Discard/Trim
• Trim and discard notify storage that we’re done with a block
• Btrfs now supports both real-time trim and batched
• Real-time trims blocks as they are freed
• Batched trims all free space via an ioctl
Drive Swapping
• GSOC project
• Current raid rebuild works via the rebalance code
• Moves all extents into new locations as it rebuilds
• Drive swapping will replace an existing drive in place
• Uses extent-allocation map to limit the number of bytes read
• Can also restripe between different RAID levels
Efficient Backups
• Advanced btrfs send/receive tool in development (JanSchmidt)
• Transmits in a neutral format so corruptions are not duplicated
Embedded Systems
• Btrfs is fairly friendly to small machines• Btrfs is not quite as friendly to small disks
But this is getting better
• Btrfs works very well overall on low end flash
RAID 5/6
• Initial implementation from Intel some time ago
• Merge pending completion of fsck work
• Will also add triple mirroring
• Mixed raid modes for metadata and data are included
When Bad Things Happen to Good Data
• Beta filesystem recovery tool from Josef Bacik
Risk free – copies data out of the corrupt FS
• tree root history log to recover from many hardware errors
• New fsck releases on the way to repair in place
• git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git recovery-beta
• (demo)
Billions of Files?
• Dramatic differences in filesystem writeback patterns
• Sequential IO still matters on modern SSDs
• Btrfs COW allows flexible writeback patterns
• Ext4 and XFS tend to get stuck behind their logs
• Btrfs tends to produce more sequential writes and morerandom reads
File Creation Benchmark Summary
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
File
s/se
c
Btrfs SSDXFS SSDExt4 SSDBtrfsXFSExt4
• Btrfs duplicates metadataby default
2x the writes
• Btrfs stores the file namethree times
• Btrfs and XFS are CPUbound on SSD
0 45 90 135 180 225 270 315 330
Time (seconds)
0
20
40
60
80
100
120
140
160
MB
/s
File CreationThroughput
BtrfsXFSExt4
0 45 90 135 180 225 270 315
Time (seconds)
0
1500
3000
4500
6000
7500
9000
10500
12000
IO /
sec
IOPs
BtrfsXFSExt4
IO Animations
• Ext4 is seeking between a large number of disk areas
• XFS is walking forward through a series of distinct disk areas
• Both XFS and Ext4 show heavy log activity
• Btrfs is doing sequential writes and some random reads