fstream: managing flash streams in the file system · eunhee rho, kanchan joshi, seung-uk shin,...
TRANSCRIPT
FStream: Managing Flash Streams
in the File System
Eunhee Rho, Kanchan Joshi, Seung-Uk Shin, Nitesh Jagadeesh Shetty, Joo-Young Hwang, Sangyeun Cho, Daniel DG Lee, Jaeheon Jeong
Memory Division, Samsung Electronics Co., Ltd.
Table of Contents
• Flash-based SSDs
• Garbage Collection & WAF
• Multi-stream
• FStream
• Workload Analysis & Experimental Results
• Conclusion
2/17
Flash-based Solid State Drives Replacement of HDDs
• Flash Translation Layer (FTL) allows SSDs to maintain traditional block interface
Application
OS
File System
Application
OS
File System
HDD SSD
Block Layer Block Layer
Common Interface
Faster
Energy Efficient
More Reliable
FTL
3/18
Garbage Collection & WAF Garbage Collection (GC) Overheads
• Reclaiming space for empty blocks requires valid page copy
• Media write amplified due to garbage collection
• Shortens SSD lifetime and hampers performance
Write Amplification Factor (WAF)
• Ratio of the actual media writes to the user I/O
4/18
Multi-stream Managing data placement on a SSD with streams
• Mapping data to separate stream by their life expectancy
Standardization status
• T10 (SCSI) standard & NVME 1.3 “directives”
5/18
Multi-stream Cont’d Data Placement Comparison
H1
C1
H2
C2
C3
H3
H4
C4
Writes =
H1, C1, H2, C2,
C3, H3, H4, C4
1
H1
C1
H2
C2
C3
H3
H4
C4
H1’
H2’
H4’
H3’
H1
H2
H4
H3
C1
C2
C3
C4
New Writes =
H1’, H2’, H3’, H4’
2
H1
H2
H3
H4
C1
C2
C3
C4
Stream = X Stream = Y
Stream = Y
H1’
H2’
H4’
H3’
Stream = X
Conventional SSD Multi-streamed SSD
Free Space Fragmentation!
Valid page copy required to reclaim the free space.
Return to free block
6/18
FStream Motivation
• We need easier, general method of stream assignment.
• Block device layer has limited information about data lifetime.
• File system metadata has different lifetime from user data, need be separated.
Our Approach
• File system level stream assignment.
• Separate streams for file system metadata, journal, and user data.
• Implemented FStream in existing file systems.
[1]
[2]
[1] Kang, JU et al., “The Multi-streamed Solid-State Drive”, HotStorage ’14 [2] Yang, Jingpei et al., “AutoStream: automatic stream management for multi-streamed SSDs”, SYSTOR ‘17
7/18
EXT4 metadata and journaling
• EXT4 on-disk layout: block groups with data and metadata related to it
• EXT4 journal: write ordering in ‘data=ordered’ mode
Ext4
8/18
Ext4Stream Mount options
• Journal-stream
• Separate journal writes
• Inode-stream
• Separate inode writes
• Dir-stream
• Separate directory blocks
• Misc-stream
• Inode/block bitmap and group-descriptor
• Fname-stream
• Distinct stream to file(s) with specific names
• Extn-stream
• File-extension based stream
9/18
XFS XFS metadata and journaling
• Parallel metadata operations, metadata buffering (page cache not used)
• Mixture of logical and physical journaling
• Minimum inode update size is a chunk of 64 inodes.
10/18
XFStream Mount options
• Log-stream
• Separate journal writes
• Inode-stream
• Separate inode writes
• Fname-stream
• Distinct stream to file(s) with specific names
11/18
Application Specific Data Separation
Upon Write Request..
1) Writes to Commit Log - For error recovery
2) Writes to Memtable - In memory space
3) Returns success
4) Data in Memtable are
flushed to SSTable
5) SSTables go through compaction
Memory
Commit Log
Memtable
SSTable 1 K1 K2
SSTable 2 K1 K3
SSTable 3 K2 K3
SSTable 4 K1 K3
SSTable 5 K1 K2 K3
SSTable 6 SSTable 7
SSTable 21
Flush
ing
Write
Request
Stream for Cassandra’s commit log file.
• Fname_stream option: commitlog-*
12/18
Experimental Setup
OS:
• Linux kernel v4.5 with io-streamid support
System:
• Dell PowerEdge R720 server with 32 cores and 32GB memory
SSD:
• Samsung PM963 480GB with streams support
Benchmarks:
• Filebench: Varmail & Fileserver
• YCSB on Cassandra
13/18
Filebench Workload Analysis Varmail
Fileserver
Journal 61.0%
Inode 8.0%
Directory 4.0%
Other meta 0.2%
Data 26.8%
EXT4
Journal 60% Inode
9%
Data 31%
XFS
Inode 21.0% Directory
15.8%
Other meta 0.2%
Data 63.0%
EXT4 No Journal
Journal 26%
Inode 16%
Directory 3%
Other meta 0%
Data 55%
EXT4
Journal 16%
Inode 32%
Data 52%
XFS Workload Conditions - Filled 80% of disk before test
- Number of test files: 900,000 (14GB)
- Varmail : fsync-intensive
- Runtime: 2 hours
XFS writes more inodes (random writes) than Ext4.
14/18
Filebench: Performance Fstream achieved 5 ~ 35% performance improvements.
0
20000
40000
60000
80000
100000
120000
Ext4
Ext4Stream
Ext4-NJ
Ext4Stream-NJ
XFS
XFStream
0
20000
40000
60000
80000
100000
120000
Ext4
Ext4Stream
XFS
XFStream
Ops/
sec
Ops/
sec
Varmail Fileserver
15/18
Filebench: WAF
1.3
1.04
1.6
1.09 1.2
1.05
0
0.5
1
1.5
2
Ext4
Ext4Stream
Ext4-NJ
Ext4Stream-NJ
XFS
XFStream
1.1 1.02 1.2
1.05
0
0.5
1
1.5
2
Ext4
Ext4Stream
XFS
XFStream
WA
F
WA
F
Varmail Fileserver
Fstream achieved WAF of close to one.
Ext4’s WAF < Ext4NJ’s WAF
• Journal is written in a circular fashion, so is invalidated periodically.
16/18
YCSB on Cassandra Results Data intensive workload
• Load phase: 1KB record x 120 million inserts
• Run phase: 1KB record x 80 million inserts
0
10000
20000
30000
40000
50000
60000
70000
Ext4
Ext4Stream
XFS
XFStream
2
1.1
2
1.2
0
0.5
1
1.5
2
2.5
Ext4
Ext4Stream
XFS
XFStream
Ops/
sec
WA
F 17/18
Conclusion and Acknowledgements SSD Performance & Lifetime
• The less FTL garbage collection overheads, the longer SSD lives and the faster SSD performs.
Streams: SSD interface for separating data with different lifetimes
FStream: stream assignment in file system
• Separate streams for file system metadata, journal, and user data.
• Provide filename and extension based user data separation.
• Achieved 5~35% performance improvement and near 1 WAF for filebench.
Acknowledgements
• We thank Cristian Ungureanu, our shepherd, and anonymous reviewers for their feedbacks.
18/18