the linux block layer - built for fast storage
TRANSCRIPT
The Linux Block LayerBuilt for Fast Storage
Light up your cloud!
Sagi Grimberg KernelTLV27/6/18
1
First off, Happy 1’st birthday Roni!
2
Who am I?
• Co-founder and Principal Architect @ Lightbits Labs• LightBits Labs is a stealth-mode startup pushing the software and hardware technology
boundaries in cloud-scale storage.• We are looking for excellent people who enjoy a challenge for a variety of positions, including
both software and hardware. More information on our website at http://www.lightbitslabs.com/#join or talk with me after the talk.
• Active contributor to Linux I/O and RDMA stack• I am the maintainer of the iSCSI over RDMA (iSER) drivers• I co-maintain the Linux NVMe subsystem
• Used to work for Mellanox on Storage, Networking, RDMA and pretty much everything in between…
3
Where were we 10 years ago
• Only rotating storage devices exist.• Devices were limited to hundreds of IOPs• Devices access latency was in the milliseconds ballpark• The Linux block layer was sufficient to handle these devices
• High performance applications found clever ways to avoid storage access as much as possible
4
What happened? (hint: HW)
• Flash SSDs started appearing in the DataCenter• IOPs went from Hundreds to Hundreds of thousands to Millions• Latency went from Milliseconds to Microseconds• Fast Interfaces evolved: PCIe (NVMe)
• Processors core count increased a lot!• And NUMA...
5
I/O Stack
6
What are the issues?
• Existing I/O stack had a lot of data sharing• Between different applications (running on different cores)• Between submission and completion• Locking for synchronization• Zero NUMA awareness
• All stack heuristics and optimizations centered around slow storage
• The result is very bad (even negative) scaling, spending lots of CPU cycles and much much higher latencies.
7
I/O Stack - Little deeper
8
I/O Stack - Little deeper
9
Hmmm...
- Request are serialized- Placed for staging- Retrieved by the drivers
⇒ Lots of shared state!
I/O Stack - Performance
10
I/O Stack - Performance
11
Workaround: Bypass the the request layer
12
Problems with bypass:● Give up flow control● Give up error handling● Give up statistics● Give up tagging and indexing● Give up I/O deadlines● Give up I/O scheduling● Crazy code duplication -
mistakes are copied because people get stuff wrong...
Most importantly, this is not the Linux design approach!
Enter Block Multiqueue
• The old stack does not consist of “one serialization point”• The stack needed a complete re-write from ground up• What do we do:
• Go look at the networking stack which solved the exact same issue 10+ years ago.
• But build from scratch for storage devices
13
Block Multiqueue - Goals
• Linear Scaling with CPU cores• Split shared state between applications and
submission/completion• Careful locality awareness: Cachelines, NUMA• Pre-allocate resources as much as possible• Provide full helper functionality - ease of implementation• Support all existing HW
• Become THE queueing mode, not a “3’rd one”
14
Block Multiqueue - Architecture
15
Block Multiqueue - Features
• Efficient tagging• Locality of submissions and completions
• Extremely aware to minimize cache pollutions• Smart error handling - minimum intrusion to the hot path• Smart cpu <-> queue mappings• Clean API• Easy conversion (usually just cleanup old cruft)
16
Block Multiqueue - I/O Flow
17
Block Multiqueue - Completions
18
● Applications are usually “cpu-sticky”
● If I/O completion comes on the “correct” cpu, complete it
● Else, IPI to the “correct” cpu
Block Multiqueue - Tagging
19
• Almost every modern HW supports queueing• Tags are used to identify individual I/Os in the presence of
out-of-order completions• Tags are limited by capabilities of the HW, driver needs to flow
control
Block Multiqueue - Tagging
20
• PerCPU Cacheline aware scalable bitmaps• Efficient at near-exhaustion• Rolling wake-ups • Maps 1x1 with HW usage - no driver specific tagging
Block Multiqueue - Pre-allocations
21
• Eliminate hot path allocations• Allocate all the requests memory at initialization time• Tag and request allocations are combined (no two step allocation)
• No driver per-request allocation• Driver context and SG lists are placed in “slack space” behind the request
Block Multiqueue - Performance
22
Test-Case:- null_blk driver- fio - 4K sync random read- Dual socket system
Block Multiqueue - perf profiling
23
• Locking time is drastically reduced• FIO reports much less “system time”• Average and tail latencies are much lower and consistent
Next on the Agenda: SCSI, NVMe and friends
• NVMe started as a bypass driver - converted to blk-mq• mtip32xx (Micron)• virtio_blk, xen• rbd (ceph)• loop• more...
• SCSI midlayer was a bigger project..
24
SCSI multiqueue
• Needed the concept of “shared tag sets”• Tags are now a property of the HBA and not the storage device
• Needed a chunking of scatter-gather lists• SCSI HBAs support huge sg lists, two much to allocate up front
• Needed “Head of queue” insertion• For SCSI complex error handling
• Removed the “big scsi host_busy lock”• reduced the huge contention on the scsi target “busy” atomic
• Needed Partial completion support• Needed BIDI support (yukk..)• Hardened the stack a lot with lots of user bug reports.
25
Block multiqueue - MSI(X) based queue mapping
26
● Motivation: Eliminate the IPI case
● Expose MSI(X) vector affinity mappings to the block layer
● Map the HW context mappings via the underlying device IRQ mappings
● Offer MSI(X) allocation and correct affinity spreading via the PCI subsystem
● Take advantage in pci based drivers (nvme, rdma, fc, hpsa, etc..)
But wait, what about I/O schedulers?
• What we didn’t mention was that block multiqueue lacked a proper I/O scheduler for approximately 3 years!
• A fundamental part of the I/O stack functionality is scheduling• To optimize I/O sequentiality - Elevator algorithm• Prevent write vs. read starvation (i.e. deadline scheduler)• Fairness enforcement (i.e. CFQ)
• One can argue that I/O scheduling was designed for rotating media• Optimized for reducing actuator seek time
NOT NECESSARILY TRUE - Flash can benefit scheduling!
27
Start from ground up: WriteBack Throttling
• Linux since the dawn of times sucked at buffered I/O• Writes are naturally buffered and committed to disk in the
background • Needs to have little or no impact on foreground activity
• What was needed:• Plumb I/O stats for submitted reads and writes• Track average latency in window granularity and what is currently enqueued• Scale queue depth accordingly• Prefer reads over non-directIO writes
28
WriteBack Throttling - Performance
29
Before... After...
Now we are ready for I/O schedules - MQ-Deadline
• Added I/O interception of requests for building schedulers on top• First MQ conversion was for deadline scheduler
• Pretty easy and straightforward• Just delay writes FIFO until deadline hits• Reads FIFO are pass-through• All percpu context - tradeoff?
• Remember: I/O scheduler can hurt synthetic workloads, but impact on real life workloads.
30
Next: Kyber I/O Scheduler
• Targeted for fast multi-queue devices• Lightweight• Prefers reads over writes
• All I/Os are split into two queues (reads and writes)• Reads are typically preferred• Writes are throttled but not to a point of starvation
• The key is to keep submission queues short to guarantee latency targets
• Kyber tracks I/O latency stats and adjust queue size accordingly• Aware of flash background operations.
31
Next: BFQ I/O Scheduler
• Budget fair queueing scheduler• A lot heavier
• Maintain Per-Process I/O budget• Maintain bunch of Per-Process heuristics • Yields the “best” I/O to queue at any given time• A better fit for slower storage, especially rotating media and cheap &
deep SSDs.
32
But wait #2: What about Ultra-low latency devices
• New media is emerging with Ultra low latency (1-2 us)• 3D-Xpoint• Z-NAND
• Even with block MQ, the Linux I/O stack still has issues providing these latencies
• It starts with IRQ (interrupt handling)• If I/O is so fast, we might want to poll for completion and avoid paying the
cost of MSI(X) interrupt
33
Interrupt based I/O completion model
34
Polling based I/O completion model
35
IRQ vs. Polling
36
• Polling can remove the extra context switch from the completion handling
So we should support polling!
37
• Add selective polling syscall interface:• Use preadv2/pwritev2 with flag IOCB_HIGHPRI• Saves roughly 25% of added latency
But what about CPU% - can we go hybrid?
38
• Yes!• We have all the statistics framework in place, let’s use it for hybrid polling!• Wake up poller after ½ of the mean latency.
Hybrid polling - Performance
39
Hybrid polling - Adjust to I/O size
40
• Block layer sees I/Os of different sizes.• Some are 4k, some are 256K and some or 1-2MB• We need to consider that when tracking stats for Polling considerations
• Simple solution: Bucketize stats...• 0-4k• 4-16k• 16k-64k• >64k
• Now Hybrid polling has good QoS!
To Conclude
41
• Lots of interesting stuff happening in Linux
• Linux belongs to everyone, Get involved!• We always welcome patches and bug reports :)
42
LIGHT UP YOUR CLOUD!