write dependency disentanglement with horae
TRANSCRIPT
Write Dependency Disentanglement with HORAE
Xiaojian Liao, Youyou Lu, Erci Xu, Jiwu Shu
1
Tsinghua University
1. Background and Motivation
2
Background: Storage Trend
HDD SATA SSD NVME SSD All SSD Array
Bandwidth ~80MB/s ~500MB/s ~6GB/s ~48GB/s
Single-Head Concurrency
(1)
Multi-Channel Concurrency
(>=32)
Multi-Device Concurrency(> =256)
Modern storage delivers higher bandwidth and concurrency.
core core core core
flash flash flash flash
NVMe Controller
core
flash flash
SATA Ctrl
Multi-Channel Concurrency
(1~8)
3
Background: Write Dependency
4
A1 B2 C3
Write Dependency
“Persist-before”
Storage
SystemA1 B2 C3
Storage order
Consistency
File System Journaling
COW-based File System Soft Updates ……
require
maintainStorage Device
Background: Storage Order
Application Block Layer
OS levelscheduling
Device Driver
Hardware retries
Storage Device
Device sidescheduling
Deadline, BFQ, etc.
Modern IO stack is Orderless.
5
A3 A2 A1 A3 A1 A2 A1 A2 A3
Background: Storage Order To enforce storage order, current IO stack should process only one set of orderless IOs at a time.
Serialization Concurrency6
Application Block Layer
OS levelscheduling
Device Driver
Hardware retries
Storage Device
Device sidescheduling
Deadline, BFQ, etc.
A2 A1A3
Background: Storage Order To enforce storage order, current IO stack should process only one set of orderless IOs at a time.
Serialization Concurrency7
Application Block Layer
OS levelscheduling
Device Driver
Hardware retries
Storage Device
Device sidescheduling
Deadline, BFQ, etc.
A2A1
A3
Background: Storage Order To enforce storage order, current IO stack should process only one set of orderless IOs at a time.
Serialization Concurrency8
Application Block Layer
OS levelscheduling
Device Driver
Hardware retries
Storage Device
Device sidescheduling
Deadline, BFQ, etc.
A2 A1A3
Background: Storage Order To enforce storage order, current IO stack should process only one set of orderless IOs at a time.
Serialization Concurrency9
Application Block Layer
OS levelscheduling
Device Driver
Hardware retries
Storage Device
Device sidescheduling
Deadline, BFQ, etc.
A2 A1A3
Motivation
Orderless writes-> Full bandwidth!
Ordered writes -> Hard to grow!
1 Device 2 Devices 3 Devices
Test tool: fio
IO size: 4 KB
Threads: up to 4
Test SSD: Intel 750
10
0
200
400
600
800
1 8 24 32 39 55 63 70 86
KIO
PS
# of hardware queues
Ordered
Orderless 8X
2. Design
Can we and how to achieve storage order with full bandwidth?
with Horae.
11
Yes,
Horae IO stack Overview Key idea: Split the IO stack into an ordered control path and an orderless data path
File System
Device Driver
Storage Arrays
Block Layer
Linux IO Stack
2
3
HoraeFS
Ordering Layer
HoraeStore
Horae IO Stack
Device Driver
Block Layer
Storage Arrays
1
2
3
Full bandwidth Storage order
12
Horae’s Key Design
HoraeFS
Ordering Layer
HoraeStore
Horae IO Stack
Device Driver
Block Layer
Storage Arrays
1
2
3
2
13
Key idea:
Dedicated Control Path
Opportunities:
• Parallelizing durability commands
• Parallelizing in-place updates
Use cases:
• File System: HoraeFS
• Object Store: HoraeStore
3
2 3
1
Dedicated Control Path
Data Buffer
chip chip chip chip
SSD
CMB
23
CPUDMA engine
CMB (persistent Controller Memory Buffer) In NVMespec.
Circular Ordering Queue
16–Byte Ordering Metadata, persist latency ~0.5us
lba len devID etag ordering
dr durability
plba in-place update
2
3
2 3
A1
14
Dedicated Control Path
Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.
(a) Normal execution
Data Buffer
chip chip chip chip
A1B2
C3
Ordering Layer
OrderingSatisfied
15
A1 B2 C3
Dedicated Control Path
Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.
(a) Normal execution
Data Buffer
chip chip chip chip
A1B2
C3
Ordering Layer
A1 B2 C3
OrderingSatisfied
16
Dedicated Control Path
Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.
(a) Normal execution (b) Crash recovery
Data Buffer
chip chip chip chip
A1B2
C3
Ordering Layer
A1 B2 C3
OrderingSatisfied
Data Buffer
chip chip chip chip
A1B2
C3
Ordering Layer
A1
B2
C3
OrderingSatisfied
17
Dedicated Control Path
Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.
(a) Normal execution (b) Crash recovery
Data Buffer
chip chip chip chip
A1B2
C3
Ordering Layer
A1 B2 C3
OrderingSatisfied
Data Buffer
chip chip chip chip
A1B2
C3
Ordering Layer
A1
B2
OrderingSatisfied
Discard C3 during recovery
18
Dedicated Control Path
Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.
(a) Normal execution (b) Crash recovery
Data Buffer
chip chip chip chip
A1B2
C3
Ordering Layer
A1 B2 C3
OrderingSatisfied
Data Buffer
chip chip chip chip
A1B2
C3
Ordering Layer
A1
B2
OrderingSatisfied
Discard C3 during recovery
19
Proof In the Paper
Classic Durability Guarantee
• Linux uses FLUSH command to synchronously and serially drain out the possibly volatile data buffer in each device.
Data Buffer
chip chip chip chip
Device A
A1 Data Buffer
chip chip chip chip
Device B
B2
20
Classic Durability Guarantee
• Linux uses FLUSH command to synchronously and serially drain out the possibly volatile data buffer in each device.
Data Buffer
chip chip chip chip
Device A
A1
Data Buffer
chip chip chip chip
Device B
B2
21
Classic Durability Guarantee
• Linux uses FLUSH command to synchronously and serially drain out the possibly volatile data buffer in each device.
Data Buffer
chip chip chip chip
Device A
A1
Data Buffer
chip chip chip chip
Device B
B2
22
Horae’s Durability Guarantee
• Horae introduces joint FLUSH to perform parallel FLUSHes in separate devices.
Data Buffer
chip chip chip chip
Device A
A1 Data Buffer
chip chip chip chip
Device B
B2
A1B2
23
Horae’s Durability Guarantee
• Horae introduces joint FLUSH to perform parallel FLUSHes in separate devices.
Data Buffer
chip chip chip chip
Device A
A1
Data Buffer
chip chip chip chip
Device B
B2
A1B2
24
Data Buffer
chip chip chip chip
Linux IPU
Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.
a.txt(V=1)
a.txt(V=2)
25
Data Buffer
chip chip chip chip
Linux IPU
Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.
a.txt(V=1)
a.txt(V=2)
26
Data Buffer
chip chip chip chip
Linux IPU
Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.
a.txt(V=1)
a.txt(V=2)
27
Data Buffer
chip chip chip chip
Horae IPU
Data Buffer
chip chip chip chip
Linux IPU
Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.
a.txt(V=1)
a.txt(V=2)
• Horae parallelizes in-place updates through write redirection using the plba field of ordering metadata.
a.txt(V=1)
a.txt(V=2)
V=1
V=2
28
Data Buffer
chip chip chip chip
Horae IPU
Data Buffer
chip chip chip chip
Linux IPU
Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.
a.txt(V=1)
a.txt(V=2)
• Horae parallelizes in-place updates through write redirection using the plba field of ordering metadata.
a.txt(V=1)
a.txt(V=2)
V=1
V=2
29
Use Case 1: HoraeFS
fsync
App:
Journal:
ext4
Latency Reduction!
fbarrier: ordering
App:
Ordering:
Durability:
HoraeFS
CPU IO Context Switch Ordering Layer
30
separate ordering logic from durability logic
fsync
Use Case 2: HoraeStore
CPU IO Ordering Layer
TXN1:
TXN2:
Ceph BlueStore
TXN1:
TXN2:
HoraeStore31
parallelize dependent transactions
Concurrently processed!
Serially processed.
3. Evaluation
[1] Barrier-Enabled IO stack for Flash Storage, FAST’18
Test layer
Block DeviceFile System
Storage
Application
• FIO random overwrite
Benchmarks
• FIO allocating write
• OLTP-insert, dbbench, objectbench
• Lower-end SATA SSD A: Samsung 860 Pro
• Medium-end NVMe SSD B: Intel 750 (consumer-grade)
• High-end NVMe SSD C: Intel DC P3700 (datacenter-grade)
Hardware
• Linux VS. Barrier[1] VS. Horae
Compared systems
32
Block Device Performance
0
1000
2000
3000
4000
4 8 16 32 64
Thro
ughpu
t (M
B/s
)
Write Size (KB)
3 NVMe SSDs (2B + C)
Linux Horae
0
500
1000
1500
2000
4 8 16 32 64
Thro
ughpu
t (M
B/s
)
Write Size (KB)
1 NVMe SSD C
Linux Barrier-enabled Horae
• Single Device • Multiple Devices
Maximum Device Bandwidth
2x
4.1x 6.8x
33Separating the ordering control path is efficient.
File System Performance
0
50
100
1 4 12
KIO
PS
0
50
100
1 4 12
KIO
PS
Threadsext4 BarrierFS HoraeFS
Durability
Ordering
0
50
100
150
1 4 12
KIO
PS
Durability
0
50
100
150
1 4 12
KIO
PS
Threadsext4 HoraeFS
Ordering
1.5x
1.6x
2.2x
34
• 1 NVMe SSD (B) • 2 NVMe SSDs (B+C)
Parallelizing the data and journal processing is efficient.
0
20
40
60
80
100
fsync fbarrier
KT
X/s
ext4-RAID0 ext4
HoraeFS-RAID0 HoraeFS
Application Performance-MySQL
0
10
20
30
40
50
fsync fbarrier
KT
X/s
ext4 BarrierFS HoraeFS
1.8x
1.4x1.6x
ext4 fbarrier: nobarrier option, weaker ordering
35
• Sole • Separate redo log
• Horae boosts MySQL performance by up to 1.8x• HoraeFS-RAID0 < HoraeFS
Application Performance-HoraeStore
0
1
2
3
4
5
1 2 4 6 8 10 12
TPS
(K)
# of sequencers
BlueStore-S BlueStore-M
HoraeStore-S HoraeStore-M
0
5
10
15
1 2 4 6 8 10 12
TPS
(K)
# of sequencers
BlueStore-S BlueStore-M
HoraeStore-S HoraeStore-M
Sole: SSD A; Multiple: SSD A+B+C
1.4x2x
36
• apply_transaction • queue_transaction
HoraeStore outperforms BlueStore by up to 2x.
Conclusion• The write dependency overhead becomes more severe with
the scaling of storage concurrency.
• We re-architect IO stack with HORAE, providing a dedicated control path to reduce the write dependency overhead.
• Horae boosts file system level and application level performance by up to 2.2x and 2x.
Thank You!
Write Dependency Disentanglement with HORAE
Xiaojian Liao, Youyou Lu, Erci Xu, Jiwu Shu
37