write dependency disentanglement with horae

37
Write Dependency Disentanglement with HORAE Xiaojian Liao, Youyou Lu, Erci Xu, Jiwu Shu 1 Tsinghua University

Upload: others

Post on 18-May-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Write Dependency Disentanglement with HORAE

Write Dependency Disentanglement with HORAE

Xiaojian Liao, Youyou Lu, Erci Xu, Jiwu Shu

1

Tsinghua University

Page 2: Write Dependency Disentanglement with HORAE

1. Background and Motivation

2

Page 3: Write Dependency Disentanglement with HORAE

Background: Storage Trend

HDD SATA SSD NVME SSD All SSD Array

Bandwidth ~80MB/s ~500MB/s ~6GB/s ~48GB/s

Single-Head Concurrency

(1)

Multi-Channel Concurrency

(>=32)

Multi-Device Concurrency(> =256)

Modern storage delivers higher bandwidth and concurrency.

core core core core

flash flash flash flash

NVMe Controller

core

flash flash

SATA Ctrl

Multi-Channel Concurrency

(1~8)

3

Page 4: Write Dependency Disentanglement with HORAE

Background: Write Dependency

4

A1 B2 C3

Write Dependency

“Persist-before”

Storage

SystemA1 B2 C3

Storage order

Consistency

File System Journaling

COW-based File System Soft Updates ……

require

maintainStorage Device

Page 5: Write Dependency Disentanglement with HORAE

Background: Storage Order

Application Block Layer

OS levelscheduling

Device Driver

Hardware retries

Storage Device

Device sidescheduling

Deadline, BFQ, etc.

Modern IO stack is Orderless.

5

A3 A2 A1 A3 A1 A2 A1 A2 A3

Page 6: Write Dependency Disentanglement with HORAE

Background: Storage Order To enforce storage order, current IO stack should process only one set of orderless IOs at a time.

Serialization Concurrency6

Application Block Layer

OS levelscheduling

Device Driver

Hardware retries

Storage Device

Device sidescheduling

Deadline, BFQ, etc.

A2 A1A3

Page 7: Write Dependency Disentanglement with HORAE

Background: Storage Order To enforce storage order, current IO stack should process only one set of orderless IOs at a time.

Serialization Concurrency7

Application Block Layer

OS levelscheduling

Device Driver

Hardware retries

Storage Device

Device sidescheduling

Deadline, BFQ, etc.

A2A1

A3

Page 8: Write Dependency Disentanglement with HORAE

Background: Storage Order To enforce storage order, current IO stack should process only one set of orderless IOs at a time.

Serialization Concurrency8

Application Block Layer

OS levelscheduling

Device Driver

Hardware retries

Storage Device

Device sidescheduling

Deadline, BFQ, etc.

A2 A1A3

Page 9: Write Dependency Disentanglement with HORAE

Background: Storage Order To enforce storage order, current IO stack should process only one set of orderless IOs at a time.

Serialization Concurrency9

Application Block Layer

OS levelscheduling

Device Driver

Hardware retries

Storage Device

Device sidescheduling

Deadline, BFQ, etc.

A2 A1A3

Page 10: Write Dependency Disentanglement with HORAE

Motivation

Orderless writes-> Full bandwidth!

Ordered writes -> Hard to grow!

1 Device 2 Devices 3 Devices

Test tool: fio

IO size: 4 KB

Threads: up to 4

Test SSD: Intel 750

10

0

200

400

600

800

1 8 24 32 39 55 63 70 86

KIO

PS

# of hardware queues

Ordered

Orderless 8X

Page 11: Write Dependency Disentanglement with HORAE

2. Design

Can we and how to achieve storage order with full bandwidth?

with Horae.

11

Yes,

Page 12: Write Dependency Disentanglement with HORAE

Horae IO stack Overview Key idea: Split the IO stack into an ordered control path and an orderless data path

File System

Device Driver

Storage Arrays

Block Layer

Linux IO Stack

2

3

HoraeFS

Ordering Layer

HoraeStore

Horae IO Stack

Device Driver

Block Layer

Storage Arrays

1

2

3

Full bandwidth Storage order

12

Page 13: Write Dependency Disentanglement with HORAE

Horae’s Key Design

HoraeFS

Ordering Layer

HoraeStore

Horae IO Stack

Device Driver

Block Layer

Storage Arrays

1

2

3

2

13

Key idea:

Dedicated Control Path

Opportunities:

• Parallelizing durability commands

• Parallelizing in-place updates

Use cases:

• File System: HoraeFS

• Object Store: HoraeStore

3

2 3

1

Page 14: Write Dependency Disentanglement with HORAE

Dedicated Control Path

Data Buffer

chip chip chip chip

SSD

CMB

23

CPUDMA engine

CMB (persistent Controller Memory Buffer) In NVMespec.

Circular Ordering Queue

16–Byte Ordering Metadata, persist latency ~0.5us

lba len devID etag ordering

dr durability

plba in-place update

2

3

2 3

A1

14

Page 15: Write Dependency Disentanglement with HORAE

Dedicated Control Path

Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.

(a) Normal execution

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

OrderingSatisfied

15

A1 B2 C3

Page 16: Write Dependency Disentanglement with HORAE

Dedicated Control Path

Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.

(a) Normal execution

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1 B2 C3

OrderingSatisfied

16

Page 17: Write Dependency Disentanglement with HORAE

Dedicated Control Path

Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.

(a) Normal execution (b) Crash recovery

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1 B2 C3

OrderingSatisfied

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1

B2

C3

OrderingSatisfied

17

Page 18: Write Dependency Disentanglement with HORAE

Dedicated Control Path

Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.

(a) Normal execution (b) Crash recovery

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1 B2 C3

OrderingSatisfied

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1

B2

OrderingSatisfied

Discard C3 during recovery

18

Page 19: Write Dependency Disentanglement with HORAE

Dedicated Control Path

Use the ordering metadata to provide ordering guarantee during normal execution and crash recovery.

(a) Normal execution (b) Crash recovery

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1 B2 C3

OrderingSatisfied

Data Buffer

chip chip chip chip

A1B2

C3

Ordering Layer

A1

B2

OrderingSatisfied

Discard C3 during recovery

19

Proof In the Paper

Page 20: Write Dependency Disentanglement with HORAE

Classic Durability Guarantee

• Linux uses FLUSH command to synchronously and serially drain out the possibly volatile data buffer in each device.

Data Buffer

chip chip chip chip

Device A

A1 Data Buffer

chip chip chip chip

Device B

B2

20

Page 21: Write Dependency Disentanglement with HORAE

Classic Durability Guarantee

• Linux uses FLUSH command to synchronously and serially drain out the possibly volatile data buffer in each device.

Data Buffer

chip chip chip chip

Device A

A1

Data Buffer

chip chip chip chip

Device B

B2

21

Page 22: Write Dependency Disentanglement with HORAE

Classic Durability Guarantee

• Linux uses FLUSH command to synchronously and serially drain out the possibly volatile data buffer in each device.

Data Buffer

chip chip chip chip

Device A

A1

Data Buffer

chip chip chip chip

Device B

B2

22

Page 23: Write Dependency Disentanglement with HORAE

Horae’s Durability Guarantee

• Horae introduces joint FLUSH to perform parallel FLUSHes in separate devices.

Data Buffer

chip chip chip chip

Device A

A1 Data Buffer

chip chip chip chip

Device B

B2

A1B2

23

Page 24: Write Dependency Disentanglement with HORAE

Horae’s Durability Guarantee

• Horae introduces joint FLUSH to perform parallel FLUSHes in separate devices.

Data Buffer

chip chip chip chip

Device A

A1

Data Buffer

chip chip chip chip

Device B

B2

A1B2

24

Page 25: Write Dependency Disentanglement with HORAE

Data Buffer

chip chip chip chip

Linux IPU

Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.

a.txt(V=1)

a.txt(V=2)

25

Page 26: Write Dependency Disentanglement with HORAE

Data Buffer

chip chip chip chip

Linux IPU

Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.

a.txt(V=1)

a.txt(V=2)

26

Page 27: Write Dependency Disentanglement with HORAE

Data Buffer

chip chip chip chip

Linux IPU

Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.

a.txt(V=1)

a.txt(V=2)

27

Page 28: Write Dependency Disentanglement with HORAE

Data Buffer

chip chip chip chip

Horae IPU

Data Buffer

chip chip chip chip

Linux IPU

Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.

a.txt(V=1)

a.txt(V=2)

• Horae parallelizes in-place updates through write redirection using the plba field of ordering metadata.

a.txt(V=1)

a.txt(V=2)

V=1

V=2

28

Page 29: Write Dependency Disentanglement with HORAE

Data Buffer

chip chip chip chip

Horae IPU

Data Buffer

chip chip chip chip

Linux IPU

Parallelizing In-Place Updates• Classic IO stacks serialize in-place updates.

a.txt(V=1)

a.txt(V=2)

• Horae parallelizes in-place updates through write redirection using the plba field of ordering metadata.

a.txt(V=1)

a.txt(V=2)

V=1

V=2

29

Page 30: Write Dependency Disentanglement with HORAE

Use Case 1: HoraeFS

fsync

App:

Journal:

ext4

Latency Reduction!

fbarrier: ordering

App:

Ordering:

Durability:

HoraeFS

CPU IO Context Switch Ordering Layer

30

separate ordering logic from durability logic

fsync

Page 31: Write Dependency Disentanglement with HORAE

Use Case 2: HoraeStore

CPU IO Ordering Layer

TXN1:

TXN2:

Ceph BlueStore

TXN1:

TXN2:

HoraeStore31

parallelize dependent transactions

Concurrently processed!

Serially processed.

Page 32: Write Dependency Disentanglement with HORAE

3. Evaluation

[1] Barrier-Enabled IO stack for Flash Storage, FAST’18

Test layer

Block DeviceFile System

Storage

Application

• FIO random overwrite

Benchmarks

• FIO allocating write

• OLTP-insert, dbbench, objectbench

• Lower-end SATA SSD A: Samsung 860 Pro

• Medium-end NVMe SSD B: Intel 750 (consumer-grade)

• High-end NVMe SSD C: Intel DC P3700 (datacenter-grade)

Hardware

• Linux VS. Barrier[1] VS. Horae

Compared systems

32

Page 33: Write Dependency Disentanglement with HORAE

Block Device Performance

0

1000

2000

3000

4000

4 8 16 32 64

Thro

ughpu

t (M

B/s

)

Write Size (KB)

3 NVMe SSDs (2B + C)

Linux Horae

0

500

1000

1500

2000

4 8 16 32 64

Thro

ughpu

t (M

B/s

)

Write Size (KB)

1 NVMe SSD C

Linux Barrier-enabled Horae

• Single Device • Multiple Devices

Maximum Device Bandwidth

2x

4.1x 6.8x

33Separating the ordering control path is efficient.

Page 34: Write Dependency Disentanglement with HORAE

File System Performance

0

50

100

1 4 12

KIO

PS

0

50

100

1 4 12

KIO

PS

Threadsext4 BarrierFS HoraeFS

Durability

Ordering

0

50

100

150

1 4 12

KIO

PS

Durability

0

50

100

150

1 4 12

KIO

PS

Threadsext4 HoraeFS

Ordering

1.5x

1.6x

2.2x

34

• 1 NVMe SSD (B) • 2 NVMe SSDs (B+C)

Parallelizing the data and journal processing is efficient.

Page 35: Write Dependency Disentanglement with HORAE

0

20

40

60

80

100

fsync fbarrier

KT

X/s

ext4-RAID0 ext4

HoraeFS-RAID0 HoraeFS

Application Performance-MySQL

0

10

20

30

40

50

fsync fbarrier

KT

X/s

ext4 BarrierFS HoraeFS

1.8x

1.4x1.6x

ext4 fbarrier: nobarrier option, weaker ordering

35

• Sole • Separate redo log

• Horae boosts MySQL performance by up to 1.8x• HoraeFS-RAID0 < HoraeFS

Page 36: Write Dependency Disentanglement with HORAE

Application Performance-HoraeStore

0

1

2

3

4

5

1 2 4 6 8 10 12

TPS

(K)

# of sequencers

BlueStore-S BlueStore-M

HoraeStore-S HoraeStore-M

0

5

10

15

1 2 4 6 8 10 12

TPS

(K)

# of sequencers

BlueStore-S BlueStore-M

HoraeStore-S HoraeStore-M

Sole: SSD A; Multiple: SSD A+B+C

1.4x2x

36

• apply_transaction • queue_transaction

HoraeStore outperforms BlueStore by up to 2x.

Page 37: Write Dependency Disentanglement with HORAE

Conclusion• The write dependency overhead becomes more severe with

the scaling of storage concurrency.

• We re-architect IO stack with HORAE, providing a dedicated control path to reduce the write dependency overhead.

• Horae boosts file system level and application level performance by up to 2.2x and 2x.

Thank You!

[email protected]

Write Dependency Disentanglement with HORAE

Xiaojian Liao, Youyou Lu, Erci Xu, Jiwu Shu

37