the linux block layer - built for fast storage

The Linux Block LayerBuilt for Fast Storage

Light up your cloud!

Sagi Grimberg KernelTLV27/6/18

1

First off, Happy 1’st birthday Roni!

2

Who am I?

• Co-founder and Principal Architect @ Lightbits Labs• LightBits Labs is a stealth-mode startup pushing the software and hardware technology

boundaries in cloud-scale storage.• We are looking for excellent people who enjoy a challenge for a variety of positions, including

both software and hardware. More information on our website at http://www.lightbitslabs.com/#join or talk with me after the talk.

• Active contributor to Linux I/O and RDMA stack• I am the maintainer of the iSCSI over RDMA (iSER) drivers• I co-maintain the Linux NVMe subsystem

• Used to work for Mellanox on Storage, Networking, RDMA and pretty much everything in between…

3

http://www.lightbitslabs.com/#join



Where were we 10 years ago

• Only rotating storage devices exist.• Devices were limited to hundreds of IOPs• Devices access latency was in the milliseconds ballpark• The Linux block layer was sufficient to handle these devices

• High performance applications found clever ways to avoid storage access as much as possible

4

What happened? (hint: HW)

• Flash SSDs started appearing in the DataCenter• IOPs went from Hundreds to Hundreds of thousands to Millions• Latency went from Milliseconds to Microseconds• Fast Interfaces evolved: PCIe (NVMe)

• Processors core count increased a lot!• And NUMA...

5

I/O Stack

6

What are the issues?

• Existing I/O stack had a lot of data sharing• Between different applications (running on different cores)• Between submission and completion• Locking for synchronization• Zero NUMA awareness

• All stack heuristics and optimizations centered around slow storage

• The result is very bad (even negative) scaling, spending lots of CPU cycles and much much higher latencies.

7

I/O Stack - Little deeper

8

I/O Stack - Little deeper

9

Hmmm...

- Request are serialized- Placed for staging- Retrieved by the drivers

⇒ Lots of shared state!

I/O Stack - Performance

10

I/O Stack - Performance

11

Workaround: Bypass the the request layer

12

Problems with bypass:● Give up flow control● Give up error handling● Give up statistics● Give up tagging and indexing● Give up I/O deadlines● Give up I/O scheduling● Crazy code duplication -

mistakes are copied because people get stuff wrong...

Most importantly, this is not the Linux design approach!

Enter Block Multiqueue

• The old stack does not consist of “one serialization point”• The stack needed a complete re-write from ground up• What do we do:

• Go look at the networking stack which solved the exact same issue 10+ years ago.

• But build from scratch for storage devices

13

Block Multiqueue - Goals

• Linear Scaling with CPU cores• Split shared state between applications and

submission/completion• Careful locality awareness: Cachelines, NUMA• Pre-allocate resources as much as possible• Provide full helper functionality - ease of implementation• Support all existing HW

• Become THE queueing mode, not a “3’rd one”

14

Block Multiqueue - Architecture

15

Block Multiqueue - Features

• Efficient tagging• Locality of submissions and completions

• Extremely aware to minimize cache pollutions• Smart error handling - minimum intrusion to the hot path• Smart cpu <-> queue mappings• Clean API• Easy conversion (usually just cleanup old cruft)

16

Block Multiqueue - I/O Flow

17

Block Multiqueue - Completions

18

● Applications are usually “cpu-sticky”

● If I/O completion comes on the “correct” cpu, complete it

● Else, IPI to the “correct” cpu

Block Multiqueue - Tagging

19

• Almost every modern HW supports queueing• Tags are used to identify individual I/Os in the presence of

out-of-order completions• Tags are limited by capabilities of the HW, driver needs to flow

control

Block Multiqueue - Tagging

20

• PerCPU Cacheline aware scalable bitmaps• Efficient at near-exhaustion• Rolling wake-ups • Maps 1x1 with HW usage - no driver specific tagging

Block Multiqueue - Pre-allocations

21

• Eliminate hot path allocations• Allocate all the requests memory at initialization time• Tag and request allocations are combined (no two step allocation)

• No driver per-request allocation• Driver context and SG lists are placed in “slack space” behind the request

Block Multiqueue - Performance

22

Test-Case:- null_blk driver- fio - 4K sync random read- Dual socket system

Block Multiqueue - perf profiling

23

• Locking time is drastically reduced• FIO reports much less “system time”• Average and tail latencies are much lower and consistent

Next on the Agenda: SCSI, NVMe and friends

• NVMe started as a bypass driver - converted to blk-mq• mtip32xx (Micron)• virtio_blk, xen• rbd (ceph)• loop• more...

• SCSI midlayer was a bigger project..

24

SCSI multiqueue

• Needed the concept of “shared tag sets”• Tags are now a property of the HBA and not the storage device

• Needed a chunking of scatter-gather lists• SCSI HBAs support huge sg lists, two much to allocate up front

• Needed “Head of queue” insertion• For SCSI complex error handling

• Removed the “big scsi host_busy lock”• reduced the huge contention on the scsi target “busy” atomic

• Needed Partial completion support• Needed BIDI support (yukk..)• Hardened the stack a lot with lots of user bug reports.

25

Block multiqueue - MSI(X) based queue mapping

26

● Motivation: Eliminate the IPI case

● Expose MSI(X) vector affinity mappings to the block layer

● Map the HW context mappings via the underlying device IRQ mappings

● Offer MSI(X) allocation and correct affinity spreading via the PCI subsystem

● Take advantage in pci based drivers (nvme, rdma, fc, hpsa, etc..)

But wait, what about I/O schedulers?

• What we didn’t mention was that block multiqueue lacked a proper I/O scheduler for approximately 3 years!

• A fundamental part of the I/O stack functionality is scheduling• To optimize I/O sequentiality - Elevator algorithm• Prevent write vs. read starvation (i.e. deadline scheduler)• Fairness enforcement (i.e. CFQ)

• One can argue that I/O scheduling was designed for rotating media• Optimized for reducing actuator seek time

NOT NECESSARILY TRUE - Flash can benefit scheduling!

27

Start from ground up: WriteBack Throttling

• Linux since the dawn of times sucked at buffered I/O• Writes are naturally buffered and committed to disk in the

background • Needs to have little or no impact on foreground activity

• What was needed:• Plumb I/O stats for submitted reads and writes• Track average latency in window granularity and what is currently enqueued• Scale queue depth accordingly• Prefer reads over non-directIO writes

28

WriteBack Throttling - Performance

29

Before... After...

Now we are ready for I/O schedules - MQ-Deadline

• Added I/O interception of requests for building schedulers on top• First MQ conversion was for deadline scheduler

• Pretty easy and straightforward• Just delay writes FIFO until deadline hits• Reads FIFO are pass-through• All percpu context - tradeoff?

• Remember: I/O scheduler can hurt synthetic workloads, but impact on real life workloads.

30

Next: Kyber I/O Scheduler

• Targeted for fast multi-queue devices• Lightweight• Prefers reads over writes

• All I/Os are split into two queues (reads and writes)• Reads are typically preferred• Writes are throttled but not to a point of starvation

• The key is to keep submission queues short to guarantee latency targets

• Kyber tracks I/O latency stats and adjust queue size accordingly• Aware of flash background operations.

31

Next: BFQ I/O Scheduler

• Budget fair queueing scheduler• A lot heavier

• Maintain Per-Process I/O budget• Maintain bunch of Per-Process heuristics • Yields the “best” I/O to queue at any given time• A better fit for slower storage, especially rotating media and cheap &

deep SSDs.

32

But wait #2: What about Ultra-low latency devices

• New media is emerging with Ultra low latency (1-2 us)• 3D-Xpoint• Z-NAND

• Even with block MQ, the Linux I/O stack still has issues providing these latencies

• It starts with IRQ (interrupt handling)• If I/O is so fast, we might want to poll for completion and avoid paying the

cost of MSI(X) interrupt

33

Interrupt based I/O completion model

34

Polling based I/O completion model

35

IRQ vs. Polling

36

• Polling can remove the extra context switch from the completion handling

So we should support polling!

37

• Add selective polling syscall interface:• Use preadv2/pwritev2 with flag IOCB_HIGHPRI• Saves roughly 25% of added latency

But what about CPU% - can we go hybrid?

38

• Yes!• We have all the statistics framework in place, let’s use it for hybrid polling!• Wake up poller after ½ of the mean latency.

Hybrid polling - Performance

39

Hybrid polling - Adjust to I/O size

40

• Block layer sees I/Os of different sizes.• Some are 4k, some are 256K and some or 1-2MB• We need to consider that when tracking stats for Polling considerations

• Simple solution: Bucketize stats...• 0-4k• 4-16k• 16k-64k• >64k

• Now Hybrid polling has good QoS!

To Conclude

41

• Lots of interesting stuff happening in Linux

• Linux belongs to everyone, Get involved!• We always welcome patches and bug reports :)

42

LIGHT UP YOUR CLOUD!

the linux block layer - built for fast storage

Software