write dependency disentanglement with horae

Write Dependency Disentanglement with HORAE

Xiaojian Liao, Youyou Lu, Erci Xu, Jiwu Shu

1

Tsinghua University

1. Background and Motivation

2

Background: Storage Trend

HDD SATA SSD NVME SSD All SSD Array

Bandwidth ~80MB/s ~500MB/s ~6GB/s ~48GB/s

Single-Head Concurrency

(1)

Multi-Channel Concurrency

(>=32)

Multi-Device Concurrency(> =256)

Modern storage delivers higher bandwidth and concurrency.

core core core core

flash flash flash flash

NVMe Controller

core

flash flash

SATA Ctrl

Multi-Channel Concurrency

(1~8)

3

Background: Write Dependency

4

A1 B2 C3

Write Dependency

“Persist-before”

Storage

SystemA1 B2 C3

Storage order

Consistency

File System Journaling

COW-based File System Soft Updates ……

require

maintainStorage Device

Background: Storage Order

Application Block Layer

OS levelscheduling

Device Driver

Hardware retries

Storage Device

Device sidescheduling

Deadline, BFQ, etc.

Modern IO stack is Orderless.

5

A3 A2 A1 A3 A1 A2 A1 A2 A3

Background: Storage Order To enforce storage order, current IO stack should process only one set of orderless IOs at a time.

Serialization Concurrency6


OS levelscheduling

Device Driver

Hardware retries

Storage Device


Deadline, BFQ, etc.

A2 A1A3




OS levelscheduling

Device Driver

Hardware retries

Storage Device


Deadline, BFQ, etc.

A2A1

A3




OS levelscheduling

Device Driver

Hardware retries

Storage Device


Deadline, BFQ, etc.

A2 A1A3

Motivation

Orderless writes-> Full bandwidth!

Ordered writes -> Hard to grow!

1 Device 2 Devices 3 Devices

Test tool: fio

IO size: 4 KB

Threads: up to 4

Test SSD: Intel 750

10

0

200

400

600

800

1 8 24 32 39 55 63 70 86

KIO

PS

# of hardware queues

Ordered

Orderless 8X

2. Design

Can we and how to achieve storage order with full bandwidth?

with Horae.

11

Yes,

Horae IO stack Overview Key idea: Split the IO stack into an ordered control path and an orderless data path

File System

Device Driver

Storage Arrays

Block Layer

Linux IO Stack

2

3

HoraeFS

Ordering Layer

HoraeStore

Horae IO Stack

Device Driver

Block Layer

Storage Arrays

1

2

3

Full bandwidth Storage order

12

Horae’s Key Design

HoraeFS

Ordering Layer

HoraeStore

Horae IO Stack

Device Driver

Block Layer

Storage Arrays

1

2

3

2

13

Key idea:

Dedicated Control Path

Opportunities:

• Parallelizing durability commands

• Parallelizing in-place updates

Use cases:

• File System: HoraeFS

• Object Store: HoraeStore

3

2 3

1


Data Buffer

chip chip chip chip

SSD

CMB

23

CPUDMA engine

CMB (persistent Controller Memory Buffer) In NVMespec.

Circular Ordering Queue

16–Byte Ordering Metadata, persist latency ~0.5us

lba len devID etag ordering

dr durability

plba in-place update

2

3

2 3

A1

14


Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.

(a) Normal execution

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

OrderingSatisfied

15

A1 B2 C3



(a) Normal execution

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1 B2 C3

OrderingSatisfied

16



(a) Normal execution (b) Crash recovery

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1 B2 C3

OrderingSatisfied

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1

B2

C3

OrderingSatisfied

17




Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1 B2 C3

OrderingSatisfied

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1

B2

OrderingSatisfied

Discard C3 during recovery

18




Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1 B2 C3

OrderingSatisfied

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1

B2

OrderingSatisfied

Discard C3 during recovery

19

Proof In the Paper

Classic Durability Guarantee

• Linux uses FLUSH command to synchronously and serially drain out the possibly volatile data buffer in each device.

Data Buffer

chip chip chip chip

Device A

A1 Data Buffer

chip chip chip chip

Device B

B2

20



Data Buffer

chip chip chip chip

Device A

A1

Data Buffer

chip chip chip chip

Device B

B2

21



Data Buffer

chip chip chip chip

Device A

A1

Data Buffer

chip chip chip chip

Device B

B2

22

Horae’s Durability Guarantee

• Horae introduces joint FLUSH to perform parallel FLUSHes in separate devices.

Data Buffer

chip chip chip chip

Device A

A1 Data Buffer

chip chip chip chip

Device B

B2

A1B2

23

Horae’s Durability Guarantee

• Horae introduces joint FLUSH to perform parallel FLUSHes in separate devices.

Data Buffer

chip chip chip chip

Device A

A1

Data Buffer

chip chip chip chip

Device B

B2

A1B2

24

Data Buffer

chip chip chip chip

Linux IPU

Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.

a.txt(V=1)

a.txt(V=2)

25

Data Buffer

chip chip chip chip

Linux IPU


a.txt(V=1)

a.txt(V=2)

26

Data Buffer

chip chip chip chip

Linux IPU


a.txt(V=1)

a.txt(V=2)

27

Data Buffer

chip chip chip chip

Horae IPU

Data Buffer

chip chip chip chip

Linux IPU


a.txt(V=1)

a.txt(V=2)

• Horae parallelizes in-place updates through write redirection using the plba field of ordering metadata.

a.txt(V=1)

a.txt(V=2)

V=1

V=2

28

Data Buffer

chip chip chip chip

Horae IPU

Data Buffer

chip chip chip chip

Linux IPU


a.txt(V=1)

a.txt(V=2)

• Horae parallelizes in-place updates through write redirection using the plba field of ordering metadata.

a.txt(V=1)

a.txt(V=2)

V=1

V=2

29

Use Case 1: HoraeFS

fsync

App:

Journal:

ext4

Latency Reduction!

fbarrier: ordering

App:

Ordering:

Durability:

HoraeFS

CPU IO Context Switch Ordering Layer

30

separate ordering logic from durability logic

fsync

Use Case 2: HoraeStore

CPU IO Ordering Layer

TXN1:

TXN2:

Ceph BlueStore

TXN1:

TXN2:

HoraeStore31

parallelize dependent transactions

Concurrently processed!

Serially processed.

3. Evaluation

[1] Barrier-Enabled IO stack for Flash Storage, FAST’18

Test layer

Block DeviceFile System

Storage

Application

• FIO random overwrite

Benchmarks

• FIO allocating write

• OLTP-insert, dbbench, objectbench

• Lower-end SATA SSD A: Samsung 860 Pro

• Medium-end NVMe SSD B: Intel 750 (consumer-grade)

• High-end NVMe SSD C: Intel DC P3700 (datacenter-grade)

Hardware

• Linux VS. Barrier[1] VS. Horae

Compared systems

32

Block Device Performance

0

1000

2000

3000

4000

4 8 16 32 64

Thro

ughpu

t (M

B/s

)

Write Size (KB)

3 NVMe SSDs (2B + C)

Linux Horae

0

500

1000

1500

2000

4 8 16 32 64

Thro

ughpu

t (M

B/s

)

Write Size (KB)

1 NVMe SSD C

Linux Barrier-enabled Horae

• Single Device • Multiple Devices

Maximum Device Bandwidth

2x

4.1x 6.8x

33Separating the ordering control path is efficient.

File System Performance

0

50

100

1 4 12

KIO

PS

0

50

100

1 4 12

KIO

PS

Threadsext4 BarrierFS HoraeFS

Durability

Ordering

0

50

100

150

1 4 12

KIO

PS

Durability

0

50

100

150

1 4 12

KIO

PS

Threadsext4 HoraeFS

Ordering

1.5x

1.6x

2.2x

34

• 1 NVMe SSD (B) • 2 NVMe SSDs (B+C)

Parallelizing the data and journal processing is efficient.

0

20

40

60

80

100

fsync fbarrier

KT

X/s

ext4-RAID0 ext4

HoraeFS-RAID0 HoraeFS

Application Performance-MySQL

0

10

20

30

40

50

fsync fbarrier

KT

X/s

ext4 BarrierFS HoraeFS

1.8x

1.4x1.6x

ext4 fbarrier: nobarrier option, weaker ordering

35

• Sole • Separate redo log

• Horae boosts MySQL performance by up to 1.8x• HoraeFS-RAID0 < HoraeFS

Application Performance-HoraeStore

0

1

2

3

4

5

1 2 4 6 8 10 12

TPS

(K)

# of sequencers

BlueStore-S BlueStore-M

HoraeStore-S HoraeStore-M

0

5

10

15

1 2 4 6 8 10 12

TPS

(K)

# of sequencers

BlueStore-S BlueStore-M

HoraeStore-S HoraeStore-M

Sole: SSD A; Multiple: SSD A+B+C

1.4x2x

36

• apply_transaction • queue_transaction

HoraeStore outperforms BlueStore by up to 2x.

Conclusion• The write dependency overhead becomes more severe with

the scaling of storage concurrency.

• We re-architect IO stack with HORAE, providing a dedicated control path to reduce the write dependency overhead.

• Horae boosts file system level and application level performance by up to 2.2x and 2x.

Thank You!

[email protected]

Write Dependency Disentanglement with HORAE

Xiaojian Liao, Youyou Lu, Erci Xu, Jiwu Shu

37

write dependency disentanglement with horae

Documents