improving block level efficiency4 improving block level efficiency blk-mq (multi-queue) • rewrite...
TRANSCRIPT
![Page 1: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/1.jpg)
ORNL is managed by UT-Battelle for the US Department of Energy
Improving Block Level Efficiency with scsi-mq
Blake Caldwell NCCS/ORNL March 4th, 2015
![Page 2: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/2.jpg)
2 Improving Block Level Efficiency
Block Layer Problems
• Global lock of request queue per block device • Cache coherency traffic
– If servicing part of a request on multiple cores, the lock must be obtained on the new core and invalidated on the old core
• Interrupt locality – Hardware interrupt may occur on wrong core, requiring
sending soft-interrupt to proper core
![Page 3: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/3.jpg)
3 Improving Block Level Efficiency
Linux Block Layer
• Designed with rotational media in mind – Time spent in the queue allows sequential request
reordering – a very good thing – Completion latencies 10ms to 100ms
• Single request queue – Staging area for merging, reordering, scheduling
• Drivers are presented with the same interface for each block device
![Page 4: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/4.jpg)
4 Improving Block Level Efficiency
blk-mq (multi-queue)
• Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues
1) Per-core submission queues 2) 1 more more hardware dispatch queues with affinity to
NUMA nodes/CPU’s (device-driver specific)
• IO scheduling within software queues – Inserted in FIFO order, then interleaved to hardware
queues
• Tags IOs that are reused for lookup on completion
![Page 5: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/5.jpg)
5 Improving Block Level Efficiency
blk-mq
single queue
Disk/SSD
Block Device(SRP Target)
Page Cache
ost_iolibaio
Application(Core 0, 1, 2)
Core 0 Core 1 Core 2 Core 3
Block Device Driver(SRP Initiator)
Disk/SSD
Block Device(SRP Target)
Page Cache
ost_iolibaio
Application(Core 0, 1, 2)
Core 0 Core 1 Core 2 Core 3
Block Device Driver(SRP Initiator)
Software Submission
queues
Hardware Dispatch Queues
Hardware Parallelism
![Page 6: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/6.jpg)
6 Improving Block Level Efficiency
IO Device Affinity
HCA 2HCA 1
NUMA node 1NUMA node 0
CPU 1CPU 0
QPI
PCIe
PCIe
To Storage Controller
LNET
OSS Node
![Page 7: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/7.jpg)
7 Improving Block Level Efficiency
Controller Caching (direct)
ddn1
RP0 RP1
ddn2
RP0 RP1
Lun 1Lun 0 Lun 3 Lun 4
HCA
OSS
![Page 8: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/8.jpg)
8 Improving Block Level Efficiency
Controller Caching (indirect)
ddn1
RP0 RP1
ddn2
RP0 RP1
Lun 1Lun 0 Lun 3 Lun 4
HCA
OSS
![Page 9: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/9.jpg)
9 Improving Block Level Efficiency
Evaluation Setup
• Linux 3.18 – blk-mq (3.13) – scsi-mq (3.17) – ib-srp multichannel (3.18) – dm-multipath support (4.0)
• Lustre 2.7.0 rc1 – ldiskfs patches rebased to 3.18
• OSS – Dual Ivy Bridge E5-2650 (2
NUMA nodes) – 64GB – Dual-port QDR IB to array
• Storage Array – Dual DDN 10k controllers – 8GB write-back cache – RAID 6 (8+2) LUNs – SATA disk
![Page 10: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/10.jpg)
10 Improving Block Level Efficiency
Evaluation Goals
• Block level testing: adapt tests done with null-blk device to a real storage device with scsi-mq – Does NUMA affinity awareness lead to efficiency
• Increased bandwidth • Decreased request latency
• Multipath performance
• Explore benefits for filesystems – Ready for use with fabric attached storage? – Are there performance benefits?
![Page 11: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/11.jpg)
11 Improving Block Level Efficiency
Evaluation Parameters
• Affinity – IB MSI-x interrupts spread among cores on NUMA node 1 – Fio threads bound to NUMA node 1 (closest to HCA) – Block device rq_affinity=2 – completion happens on submitting core
• Block tuning – Noop scheduler – max_sectors_kb = max_hw_sectors_kb
• IB-srp module – 16 channels – Max_sect 8192 – Max_cmd_per_lun 62 – Queue size 127 (fixed by hardware)
![Page 12: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/12.jpg)
12 Improving Block Level Efficiency
Throughput (direct path) thread/LUN
0
50000
100000
150000
200000
250000
300000
350000
1 2 4 6
Writ
e IO
PS
LUNS
2.6.32-431
3.18 mq
3.18 mq+mc
![Page 13: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/13.jpg)
13 Improving Block Level Efficiency
Throughput (indirect path) thread/LUN
0
50000
100000
150000
200000
250000
300000
350000
1 2 4 6
Writ
e IO
PS
LUNS
2.6.32-431
3.18 mq
3.18 mq+mc
![Page 14: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/14.jpg)
14 Improving Block Level Efficiency
CPU Usage (direct path) thread/LUN
0
10
20
30
40
50
60
70
80
90
1 2 4 6 8
CPU
%
LUNS
usr 2.6.32-431
sys 2.6.32-431
usr 3.18 mq
sys 3.18 mq
usr 3.18 mq+mc
sys 3.18 mq+mc
![Page 15: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/15.jpg)
15 Improving Block Level Efficiency
Throughput (direct path) 1 LUN
0
100000
200000
300000
400000
500000
600000
1 2 4 8
Writ
e IO
PS
Threads
2.6.32-431
3.18 mq
3.18 mq+mc
![Page 16: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/16.jpg)
16 Improving Block Level Efficiency
Read Throughput (direct path) 1LUN
0
50000
100000
150000
200000
250000
300000
350000
0 1 2 3 4 5 6 7 8 9
Rea
d IO
PS
Threads
2.6 sq
3.18 mq
3.18 mq+mc
![Page 17: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/17.jpg)
17 Improving Block Level Efficiency
dm-multipath support
• Evaluated on Linux 4.0rc1 with patches for dm blk-mq support – 4k IO, libaio, iodepth=127, SRP multi-channel support enabled
Multipath Direct (no IO FWD) indirect (IO FWD) 393.2 MB/s 849.5 MB/s 854.8 MB/s 100658 IOPs 217468 IOPs 218829 IOPs
![Page 18: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/18.jpg)
18 Improving Block Level Efficiency
Request latency
• Average for 100 4k IOs, fio with synchronous IO engine
kernel Multipath Direct (no IO FWD)
Indirect (IO FWD)
2.6.32-431.17.1 130.50 119.92 169.22
4.0rc1 (blk-mq) 130.56 103.58 142.84 Improvement -0.04% 13.6% 15.6%
![Page 19: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/19.jpg)
19 Improving Block Level Efficiency
Lustre Profiling
• FlameGraph of kernel code using Perf, 100Hz, Linux 3.18, Lustre 2.7.0rc1, 1MB writes to a single OST
![Page 20: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/20.jpg)
20 Improving Block Level Efficiency
Lustre Applications
• Metadata IO – Improve single request latency – Is bandwidth necessary during flushing metadata to
MDT?
• Object IO – Scheduling
• Request size • Request tagging
![Page 21: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/21.jpg)
21 Improving Block Level Efficiency
Future Directions
• Lustre with Linux 4.0 • Testing with hardware capable of 600k+ 4k IOPs
– Random write performance for multiple thread/LUN
• Evaluate multiple threads/LUN sequential writes • Read and random tests needs further investigation
![Page 22: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core](https://reader033.vdocuments.net/reader033/viewer/2022060914/60a83cc79329ed556902180d/html5/thumbnails/22.jpg)
22 Improving Block Level Efficiency
Conclusion
• scsi-mq has potential to lower CPU usage even with rotational media
• scsi-mq has lower IO completion latency • Further evaluation needed of device drivers that
support multiple hardware dispatch queues