k2: work-constraining scheduling of nvme-attached...
Post on 23-Aug-2021
2 Views
Preview:
TRANSCRIPT
K2: Work-Constraining Scheduling of NVMe-Attached Storage
Till Miemietz, Hannes Weisbach, Michael Roitzsch and Hermann Härtig
Presentation at the 40 th IEEE Real-Time Systems Symposium
Hong Kong ⚫ 4th of December 2019
What are the implications of fast storage devices for
real-time systems?
2
What May a Modern Storage Stack Look Like?
SSD
Driver
Block Layer
Apps (e.g., File Systems)
1 2 3
Apps @ CPU 0 Apps @ CPU 1 Apps @ CPU 2
I/O scheduler-specific staging queue scheme
32
1
NVMe Queue Pair NVMe Queue Pair
NVMe commands
bios
requests
4
Dispatching Queues(FIFO only)
What May a Modern Storage Stack Look Like?
SSD
Driver
Block Layer
Apps (e.g., File Systems)
Controller with Flash Translation Layer (FTL)
Flash Package Flash Package
Block Block
P P
SQ CQ SQ CQ SQ CQ
NVMe commands
bios
requests
5
Latency Characteristics of SSDs – Motivation
● Gap between CPU and storage devices is shrinking
→ Speedup of 1000x compared to HDDs
→ Up to 40% of storage latency caused by software
● High degree of abstraction (FTL) is source of non-determinism
→ Garbage Collection
→ Caching
→ Scheduling of multiple NVMe queue pairs
● Can host-sided I/O schedulers still be used to enforce latency goals?
6
Dissecting Linux' Block I/O Schedulers
● Performing micro benchmarks in a simulated real-time scenario
→ Samsung 970evo (250GB, NVMe 1.3) on a desktop system
● Multiple background processes used to create load on the drive
→ Goal is bandwidth maximization
● Single foreground high-priority process that periodically issues requests
→ Goal is fast access to the drive
● Analyse latency characteristics of RT process using different I/O schedulers
7
Block I/O Schedulers in Linux 4.15
● None (default)
● mq-deadline
→ Orders requests by target block address
● Kyber
→ Core-local, balances latencies of coarse-grained request classes
● Bandwidth Fair Queuing (BFQ)
→ Bandwidth control per process
→ Aware of I/O priorities
8
Latency Characteristics for Random Reads
● Both real-time and background processes issuing small random reads (4K)
→ X-axis shows achieved bandwidth, color of plotting symbols depicts targeted throughput
● Plots look similar for larger block sizes
9
Latency Characteristics for Write-Only Workloads
● Writing is very fast as long as drive-internal SLC cache is accessed
● Garbage collection drastically increases storage latency of RT process
10
Dissecting Linux’ I/O Schedulers – Lessons learned
● Reading is often slower than writing
→ Caching can avoid synchronous access of second-level flash cells when writing
● No performance isolation
→ Real-time process faces high latency when SSD is fully loaded
→ BFQ is unable to enforce priorities correctly
● From latency viewpoint I/O schedulers show little differences
→ However, complex implementations have latency penalties of up to 10%
11
K2: Work-Constraining I/O Scheduling
● Work-conserving behavior of current schedulers is not optimal w.r.t. latencies
→ Stalls high-priority read requests
→ Amplifies effects of garbage collection
● Concept: Limit the number of requests that are served in parallel
→ Device-wide limit of inflight requests
→ Requests are stored in per-priority FIFO queues
→ Submit new requests on completion of previous ones
● Queue length as a tunable parameter
→ Trade global throughput for softly bounded I/O latency of high-priority processes
12
Evaluation of K2 – Random Reads (64K)
● Limiting the length of the device queues enforces correct service order
→ Tested K2 with a queue length of 8, 16, and 32
→ Note the different scales of the y-axis!
● Host gains flexibility to enforce quick submission of real-time requests
13
Evaluation of K2 – Sequential Write Operations (64K)
● Performance for very fast operations similar to mq-deadline and Kyber
● K2 can not avoid garbage collection but mitigates impact on RT applications
14
Evaluation of K2 – Application Benchmark
● Tested read-only OLTP benchmarks of sysbench with with read / write background load
15
Summary
● Fast SSDs impose new challenges on the OS to enforce timely access to storage
→ Complex abstractions cause non-determinism of performance parameters
● Current I/O schedulers are not suitable for real-time demands
→ No performance isolation
→ Overridden by drive-internal scheduler
● K2: work-constraining I/O scheduling to limit storage access latency
→ Trade throughput for lower tail-latencies
→ Improve worst-case latency up to 10x for reading, 6.8x for writing
→ Works with components-off-the-shelf
16
Additional Slides
Latency Characteristics for Random Reads (All Percentiles)
● Both real-time and background processes issuing small random reads (4K)
19
Latency Characteristics for Write-Only Workloads
● Garbage collection also affects lower percentiles
20
Latency Characteristics for Mixed Workloads (64K)
● Access times of real-time application are similar to reading
21
Impact of Scheduler Complexity
● For small random writes, complex policies have a notable impact on overall latency
22
Evaluation of K2 – Random Reads (64K, All Percentiles)
● Limiting the length of the device queues enforces correct service order
23
Evaluation of K2 – Mixed Workloads (64K)
● Reduction of tail latencies is also present for mixed read / write requests
● Bandwidth penalty reduced by fast write operations
24
● Table shows 99.9th percentile of latency at maximum throughput
→ Throughput loss of K2 is mitigated by large block sizes
Evaluation – Comparison of I/O Schedulers
25
● Table shows throughput at maximum 99.9th percentile of latency
Evaluation – Comparison of I/O Schedulers
26
top related