k2: work-constraining scheduling of nvme-attached...

Post on 23-Aug-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

K2: Work-Constraining Scheduling of NVMe-Attached Storage

Till Miemietz, Hannes Weisbach, Michael Roitzsch and Hermann Härtig

Presentation at the 40 th IEEE Real-Time Systems Symposium

Hong Kong ⚫ 4th of December 2019

What are the implications of fast storage devices for

real-time systems?

2

What May a Modern Storage Stack Look Like?

SSD

Driver

Block Layer

Apps (e.g., File Systems)

1 2 3

Apps @ CPU 0 Apps @ CPU 1 Apps @ CPU 2

I/O scheduler-specific staging queue scheme

32

1

NVMe Queue Pair NVMe Queue Pair

NVMe commands

bios

requests

4

Dispatching Queues(FIFO only)

What May a Modern Storage Stack Look Like?

SSD

Driver

Block Layer

Apps (e.g., File Systems)

Controller with Flash Translation Layer (FTL)

Flash Package Flash Package

Block Block

P P

SQ CQ SQ CQ SQ CQ

NVMe commands

bios

requests

5

Latency Characteristics of SSDs – Motivation

● Gap between CPU and storage devices is shrinking

→ Speedup of 1000x compared to HDDs

→ Up to 40% of storage latency caused by software

● High degree of abstraction (FTL) is source of non-determinism

→ Garbage Collection

→ Caching

→ Scheduling of multiple NVMe queue pairs

● Can host-sided I/O schedulers still be used to enforce latency goals?

6

Dissecting Linux' Block I/O Schedulers

● Performing micro benchmarks in a simulated real-time scenario

→ Samsung 970evo (250GB, NVMe 1.3) on a desktop system

● Multiple background processes used to create load on the drive

→ Goal is bandwidth maximization

● Single foreground high-priority process that periodically issues requests

→ Goal is fast access to the drive

● Analyse latency characteristics of RT process using different I/O schedulers

7

Block I/O Schedulers in Linux 4.15

● None (default)

● mq-deadline

→ Orders requests by target block address

● Kyber

→ Core-local, balances latencies of coarse-grained request classes

● Bandwidth Fair Queuing (BFQ)

→ Bandwidth control per process

→ Aware of I/O priorities

8

Latency Characteristics for Random Reads

● Both real-time and background processes issuing small random reads (4K)

→ X-axis shows achieved bandwidth, color of plotting symbols depicts targeted throughput

● Plots look similar for larger block sizes

9

Latency Characteristics for Write-Only Workloads

● Writing is very fast as long as drive-internal SLC cache is accessed

● Garbage collection drastically increases storage latency of RT process

10

Dissecting Linux’ I/O Schedulers – Lessons learned

● Reading is often slower than writing

→ Caching can avoid synchronous access of second-level flash cells when writing

● No performance isolation

→ Real-time process faces high latency when SSD is fully loaded

→ BFQ is unable to enforce priorities correctly

● From latency viewpoint I/O schedulers show little differences

→ However, complex implementations have latency penalties of up to 10%

11

K2: Work-Constraining I/O Scheduling

● Work-conserving behavior of current schedulers is not optimal w.r.t. latencies

→ Stalls high-priority read requests

→ Amplifies effects of garbage collection

● Concept: Limit the number of requests that are served in parallel

→ Device-wide limit of inflight requests

→ Requests are stored in per-priority FIFO queues

→ Submit new requests on completion of previous ones

● Queue length as a tunable parameter

→ Trade global throughput for softly bounded I/O latency of high-priority processes

12

Evaluation of K2 – Random Reads (64K)

● Limiting the length of the device queues enforces correct service order

→ Tested K2 with a queue length of 8, 16, and 32

→ Note the different scales of the y-axis!

● Host gains flexibility to enforce quick submission of real-time requests

13

Evaluation of K2 – Sequential Write Operations (64K)

● Performance for very fast operations similar to mq-deadline and Kyber

● K2 can not avoid garbage collection but mitigates impact on RT applications

14

Evaluation of K2 – Application Benchmark

● Tested read-only OLTP benchmarks of sysbench with with read / write background load

15

Summary

● Fast SSDs impose new challenges on the OS to enforce timely access to storage

→ Complex abstractions cause non-determinism of performance parameters

● Current I/O schedulers are not suitable for real-time demands

→ No performance isolation

→ Overridden by drive-internal scheduler

● K2: work-constraining I/O scheduling to limit storage access latency

→ Trade throughput for lower tail-latencies

→ Improve worst-case latency up to 10x for reading, 6.8x for writing

→ Works with components-off-the-shelf

16

Additional Slides

Latency Characteristics for Random Reads (All Percentiles)

● Both real-time and background processes issuing small random reads (4K)

19

Latency Characteristics for Write-Only Workloads

● Garbage collection also affects lower percentiles

20

Latency Characteristics for Mixed Workloads (64K)

● Access times of real-time application are similar to reading

21

Impact of Scheduler Complexity

● For small random writes, complex policies have a notable impact on overall latency

22

Evaluation of K2 – Random Reads (64K, All Percentiles)

● Limiting the length of the device queues enforces correct service order

23

Evaluation of K2 – Mixed Workloads (64K)

● Reduction of tail latencies is also present for mixed read / write requests

● Bandwidth penalty reduced by fast write operations

24

● Table shows 99.9th percentile of latency at maximum throughput

→ Throughput loss of K2 is mitigated by large block sizes

Evaluation – Comparison of I/O Schedulers

25

● Table shows throughput at maximum 99.9th percentile of latency

Evaluation – Comparison of I/O Schedulers

26

top related