how to teach an old file system dog new object store tricks · performance • small write (4kb)...
TRANSCRIPT
![Page 1: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/1.jpg)
How to Teach an Old File System Dog New Object Store Tricks
USENIX HotStorage ’18
Eunji Lee1, Youil Han1, Suli Yang2, Andrea C. Arpaci-Dusseau2, Remzi H. Arpaci-Dusseau2
1
Chungbuk National University
![Page 2: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/2.jpg)
Data-service Platforms• Layering
• Abstract away underlying details • Reuse of existing software • Agility: development, operation, and maintenance
2 Eco System of Data-Service Platform
![Page 3: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/3.jpg)
Often at odds with efficiency • Local File System
• Bottom layer of modern storage platforms• Portability, Extensibility, Ease of Development
3
Distributed Data Store (Dynamo, MongoDB)
Key-value Store (RocksDB, BerkelyDB)
Local File System (Ext4, XFS, BtrFS)
Distributed Data Store (HBase, BigTable)
Object Store
Distributed File System (HDFS, GFS)
Local File System (Ext4, XFS, BtrFS)
Distributed Data Store (Ceph)
Object Store Daemon (Ceph)
Local File System (Ext4, XFS, BtrFS)
![Page 4: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/4.jpg)
Local File System • Not intended to serve as an underlying storage engine • Mismatch between the two layers • System-wide optimization
• Ignore demands from individual applications • Little control over file system internals • Suffer from degraded QoS
• Lack of required operations• No atomic operation• No data movement or reorganization• No additional user-level metadata
4
Out-of-control and Sub-optimal Performance
![Page 5: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/5.jpg)
Current Solutions• Bypass File System
• Key-value store, Object Store, Database • But, reliniquish file system benefits
• Extend file system interfaces • Add new features to POSIX APIs• Slow and conservative
evolution • Stable maintenance than
specific optimizations
5
Name: Ext2/3/4 Birth: 1993
![Page 6: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/6.jpg)
Our Approach • Use a file system as it is, but in a different
manner!• Design patterns of user-level data platform
• Take advantages of file system • Minimize negative effects of mismatches
6
![Page 7: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/7.jpg)
Contents• Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion
7
![Page 8: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/8.jpg)
Data-service Platform Taxonomy
8
PackingMapping
“Multiple objects in a file”“Object as a file”
What is the best way to store objects atop a file system?
![Page 9: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/9.jpg)
Case Study: Ceph• Backend object store engine
• FileStore : mapping • KStore : packing • BlueStore
9
FileStore
OSD
BlueStore
File system
Storage Device
RGW RBD CephFS
RADOS
…
Ceph ArchitectureBackend Object Store
KStore
![Page 10: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/10.jpg)
Mapping vs. Packing
10
KStore (Packing)
Object Store Log
…
FileStore (Mapping)
Object Store
Log
LSM Tree
“Multiple Objects in a File”
Object
File A
“Object as a File”
FileA B
File B
… A
![Page 11: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/11.jpg)
Experimental Setup• Ceph 12.01• Amazon EC2 Clusters • Intel Xeon quad-core• 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph)• Backend: FileStore, KStore• Benchmark: Rados• Metric: IOPS, throughput, write traffic
11
![Page 12: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/12.jpg)
Performance• Small Write (4KB)
• KStore performs better than FileStore by 1.5x • Write amplification by file metadata
12
KstoreFilestore
KstoreFilestore0.0
1.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
4.4x
8.8x
3.2x2.1x
OriginalLoggingCompactionFilesystem
Original write trafficKstore(4KB) 864 MBKstore(1MB) 2.4 GBFilestore(4KB) 332 MBFilestore(1MB) 3.8 GB
IOPS
FileStoreKStore
FilestoreKstore
FilestoreKstore0.0
1.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
8.8x
4.4x
2.1x3.2x
OriginalLoggingCompactionFilesystem
1.5x
Write Traffic BreakdownAverage IOPS
1x
![Page 13: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/13.jpg)
Performance• Large Write (1MB)
• FileStore outperforms KStore by 1.6x• Write amplification by compaction
13
IOPS
FileStore KStoreKstore
FilestoreKstore
Filestore0.01.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
4.4x
8.8x
3.2x2.1x
OriginalLoggingCompactionFilesystem
Original write trafficKstore(4KB) 864 MBKstore(1MB) 2.4 GBFilestore(4KB) 332 MBFilestore(1MB) 3.8 GB
FilestoreKstore
FilestoreKstore0.0
1.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
8.8x
4.4x
2.1x3.2x
OriginalLoggingCompactionFilesystemFileStore
KStore
1.6x
Write Traffic BreakdownAverage IOPS
1x
![Page 14: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/14.jpg)
Performance• Lack of atomic update support in file systems• Double-write penalty of logging• Halve bandwidth in large writes
14
FilestoreKstore
FilestoreKstore0.0
1.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
8.8x
4.4x
2.1x3.2x
OriginalLoggingCompactionFilesystem
KstoreFilestore
KstoreFilestore0.0
1.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
4.4x
8.8x
3.2x2.1x
OriginalLoggingCompactionFilesystem
Original write trafficKstore(4KB) 864 MBKstore(1MB) 2.4 GBFilestore(4KB) 332 MBFilestore(1MB) 3.8 GB
Write Traffic Breakdown
![Page 15: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/15.jpg)
QoS• FileStore
15
0 10 20 30 40 50 600
100
200
Time(s)
Writ
e (M
iB)
0
10
20
Thro
ughp
ut(M
B/s)BG-Write Throughputfilestore
FS: XFSW: 4KB
PerformanceWrite Traffic
0 10 20 30 40 50 600
100
200
Time(s)
Writ
e (M
iB)
0
150
300
Thro
ughp
ut(M
B/s)BG-Write Throughputfilestore
FS: XFSW: 1MB
4KB write
1MB write
Page Cache
Storage
Periodic Flush w. Buffered I/O Transaction Entanglement
![Page 16: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/16.jpg)
QoS• KStore
16
0 10 20 30 40 500
150
300
Time(s)
Writ
e (M
iB)
0
15
30
Thro
ughp
ut(M
B/s)BG-Write Throughputkstore
FS: XFSW: 4KB
Throughput: 40MB/s0 10 20 30 40 50 60
0
100
200
Time(s)
Writ
e (M
iB)
0
150
300Th
roug
hput
(MB/
s)BG-Write ThroughputkstoreFS: XFSW: 1MB
Consistently Poor
4KB write
1MB write
User-level Cache
StorageFrequent Compaction
Write amplification by merge
![Page 17: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/17.jpg)
Summary• Performance penalties of file systems
• Small objects seriously suffer from write amplification caused by filesystem metadata
• Large writes are sensitive to write traffic increase by Logging in common, and frequent compaction in packing architecture.
• Buffered I/O and out-of-control flush mechanism in file systems makes it challenging to support QoS.
17
![Page 18: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/18.jpg)
Contents• Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion
18
![Page 19: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/19.jpg)
SwimStore• Shadowing with Immutable Metadata Store• Provide consistently excellent performance for
all object sizes running over a file system
19
![Page 20: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/20.jpg)
• Strategy 1. In-file shadowing
SwimStore
20
File
Object
Log
Direct I/O
A
A’
B
Problems• Filesystem metadata overhead• Double-write penalty • Performance fluctuation• Compaction cost
key, offset, length
Indexing
![Page 21: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/21.jpg)
• Strategy 1. In-file shadowing
SwimStore
21
File
Synchronous Direct I/O
A
A’
User-facing Latency increases!
File
Raw Device Logging
A’
Log
Asynchronous Buffered I/O
A
FileStore SwimStore
File System
![Page 22: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/22.jpg)
SwimStore • File system access is slower than raw device access
• File system metadata (e.g., inode, allocation bitmap, etc.)• Transaction entanglement
22
File
Synchronous Direct I/O
A
A’
File System
m m m m
![Page 23: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/23.jpg)
SwimStore• Strategy 2. Metadata-Immutable Container
23
File
Synchronous Direct I/O
A
A’
File System
m m m m
1
0.4
0.86
0.4
Per-file
Single file
Raw device
Metadata-Immutable Container
Latency (4KB write)
Create a container file and allocate space in advance
![Page 24: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/24.jpg)
• Strategy 3. Hole-punching with Buddy-like Allocation
SwimStore
24
Shadowing technique requires the recycling of obsolete data space
![Page 25: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/25.jpg)
• Strategy 3. Hole-punching with Buddy-like Allocation
SwimStore
25
Opportunities(+) Filesystem has “infinite address space”(+) Filesystem provides “physical space reclamation” with punch-hole
….
Hole-punching
![Page 26: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/26.jpg)
• Strategy 3. Hole-punching with Buddy-like Allocation
SwimStore
26
Too small holes severely fragments space
Logical address
Physical address
New object
![Page 27: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/27.jpg)
SwimStore• Strategy 3. Hole-punching with Buddy-like
Allocation
27
2^0
2^1
2^n
…. Hole-punching for large holes
GC for small holes
![Page 28: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/28.jpg)
• Architecture
SwimStore
Container File Pool Metadata(Indexing, attributes, etc.)
Intent Log(metadata, checksum)
…
…
…
…
LSM-Tree (LevelDB)
![Page 29: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/29.jpg)
Contents• Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion
29
![Page 30: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/30.jpg)
Experimental Setup• Ceph 12.01, C++ 12K LOC • Amazon EC2 Clusters • Intel Xeon quad-core• 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph)• Backend: FileStore, KStore, BlueStore, SwimStore • Benchmark: Rados• Metric: IOPS, throughput, write traffic
30
![Page 31: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/31.jpg)
Performance Evaluation• IOPS
31
4KB 16KB 64KB 256KB 1MB0.00.51.01.52.02.53.0
IOSize
Rat
io w
rto F
ileSt
ore
1454
ops
/s
1472
ops
/s
881
ops/
s
243
ops/
s
67 o
ps/s
FileStoreBlueStoreSwimStoreKStore
Small Write2.5x better than FileStore 1.6x better than BlueStore 1.1x better than KStore
Large Write1.8x better than FileStore 3.1x better than KStore
FileStore
BlueStore
SwimStoreKStore
![Page 32: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/32.jpg)
Performance Evaluation• Write Traffic
32
4KB 16KB 64KB 256KB 1MB0.01.02.03.04.05.0
IOSize
Rat
io w
rto S
wim
Stor
e
1129
(MB)
3020
(MB)
5370
(MB)
7342
(MB)
7342
(MB) FileStore
BlueStoreSwimStoreKStore
FileStore
BlueStoreSwimStore
KStore
![Page 33: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/33.jpg)
Contents• Motivation • Problem Analysis • Solution • Performance Evaluation • Conclusion
33
![Page 34: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/34.jpg)
Conclusion• Explore design patterns to build an object
store atop a local file system • SwimStore: a new backend object store
• In-file shadowing• Immutable metadata container • Hole-punching with buddy-like allocation
• Provide high performance and little performance variations
• Retain all benefits of the file system
34
![Page 35: How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB) • KStore performs better than FileStore by 1.5x • Write amplification by file](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec6bf3046900a4db71b6d29/html5/thumbnails/35.jpg)
35
Thank you