high performance storage devices in the linux kernel

Post on 14-Jan-2017

581 Views

Category:

Software

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1 . 1

E8 STORAGECome join us at careers@e8storage.com

https://goo.gl/FOuczd

2 . 1

HIGH PERFORMANCE STORAGEDEVICES IN KERNEL

E8 Storage

By / Evgeny Budilovsky @budevg

12 Jan 2016

3 . 1

STORAGE STACK 101

3 . 2

APPLICATIONApplication invokes system calls (read, write, mmap) onfiles

Block device files - direct access to block deviceRegular files - access through specific file system (ext4,btrfs, etc.)

3 . 3

FILE SYSTEMsystem calls go through VFS layerspecific file system logic appliedread/write operations translated into read/write ofmemory pagesPage cache involved to reduce disk access

3 . 4

BIO LAYER (STRUCT BIO)File system constructs BIO unit which is the main unit of IODispatch bio to block layer / bypass driver (make_request_fndriver)

3 . 5

BLOCK LAYER (STRUCT REQUEST)Setup struct request and move itto request queue (single queueper device)IO scheduler can delay/mergerequestsDispatch requests

To hardware drivers(request_fn driver)To scsi layer (scsi devices)

SCSI LAYER (STRUCT SCSI_CMND)translate request intoscsi commanddispatch commandto low level scsidrivers

3 . 6

WRITE (O_DIRECT)# Application perf record ­g dd if=/dev/zero of=1.bin bs=4k \      count=1000000 oflag=direct ... perf report ­­call­graph ­­stdio 

# Submit into block queue (bypass page cache) write   system_call_fastpath     sys_write       vfs_write         new_sync_write           ext4_file_write_iter             __generic_file_write_iter               generic_file_direct_write                 ext4_direct_IO                   ext4_ind_direct_IO                     __blockdev_direct_IO                       do_blockdev_direct_IO                         submit_bio                           generic_make_request 

3 . 73 . 8

# after io scheduling submit to scsi layer and to the hardware      ... io_schedule   blk_flush_plug_list    queue_unplugged     __blk_run_queue       scsi_request_fn         scsi_dispatch_cmd           ata_scsi_queuecmd 

3 . 8

3 . 9

WRITE (WITH PAGE CACHE)# Application perf record ­g dd if=/dev/zero of=1.bin bs=4k \      count=1000000 ... perf report ­­call­graph ­­stdio 

# write data into page cache write   system_call_fastpath     sys_write       vfs_write         new_sync_write           ext4_file_write_iter             __generic_file_write_iter               generic_perform_write                 ext4_da_write_begin                   grab_cache_page_write_begin                     pagecache_get_page                       __page_cache_alloc                         alloc_pages_current 

3 . 10

# asynchronously flush dirty pages to disk 

kthread   worker_thread     process_one_work       bdi_writeback_workfn         wb_writeback           __writeback_inodes_wb             writeback_sb_inodes               __writeback_single_inode                 do_writepages                   ext4_writepages                     mpage_map_and_submit_buffers                       mpage_submit_page                         ext4_bio_write_page                           ext4_io_submit                             submit_bio                               generic_make_request                                 blk_queue_bio 

3 . 10

4 . 1

HIGH PERFORMANCE BLOCKDEVICES

4 . 2

CHANGES IN STORAGE WORLDRotational devices (HDD)"hundreds" of IOPS , "tens" ofmilliseconds latencyToday, flash based devices (SSD)"hundreds of thousands" of IOPS,"tens" of microseconds latencyLarge internal data parallelismIncrease in cores and NUMAarchitectureNew standardized storageinterfaces (NVME)

4 . 3

NVM EXPRESS

4 . 4

HIGH PERFORMANCE STANDARD

Standardized interface for PCIeSSDsDesigned from the ground up toexploit

Low latency of today’s PCIe-based SSD’sParallelism of today’s CPU’s

BEATS AHCI STANDARD FOR SATA HOSTS  AHCI NVME

Maximum QueueDepth

1 command queue 32 commandsper Q

64K queues 64KCommands per Q

Un-cacheable registeraccesses (2K cycles

6 per non-queued command 9 perqueued command

2 per command

MSI-X and InterruptSteering

Single interrupt; no steering 2K MSI-X interrupts

Parallelism & MultipleThreads

Requires synchronization lock toissue command

No locking

Efficiency for 4KBCommands

Command parameters require twoserialized host DRAM fetches

Commandparameters in one64B fetch

4 . 54 . 6

IOPS

consumer grade NVME SSDs (enterprise grade have muchbetter performance)100% writes less impressive due to NAND limitation

4 . 7

BANDWIDTH

4 . 8

LATENCY

5 . 1

THE OLD STACK DOES NOT SCALE

5 . 2

THE NULL_BLK EXPERIMENTJens Axboe (Facebook)null_blk configuration

queue_mode=1(rq) completion_nsec=0irqmode=0(none)

fioEach thread does pread(2), 4k, randomly, O_DIRECT

Each added thread alternates between the two availableNUMA nodes (2 socket system, 32 threads)

5 . 3

LIMITED PERFORMANCE

5 . 4

PERFSpinlockcontention

5 . 5

PERF40% from sending request (blk_queue_bio)20% from completing request (blk_end_bidi_request)18% sending bio from the application to the bio layer(blk_flush_plug_list)

5 . 6

OLD STACK HAS SEVERE SCALING ISSUES

5 . 7

PROBLEMSGood scalability before block layer (file system, pagecache, bio)Single shared queue is a problemWe can use bypass mode driver which will work with bio'swithout getting into shared queue.Problem with bypass driver: code duplication

6 . 1

BLOCK MULTI-QUEUE TO THERESCUE

6 . 2

HISTORYPrototyped in 2011Paper in SYSTOR 2013Merged into linux 3.13 (2014)A replacement for old block layer with different driver API

Drivers gradually converted to blk-mq (scsi-mq, nvme-core, virtio_blk)

6 . 3

ARCHITECTURE - 2 LAYERS OF QUEUESApplication works with per-CPU software queueMultiple software queuesmap into hardware queuesNumber of HW queues isbased on number of HWcontexts supported bydeviceRequests from HW queuesubmitted by low leveldriver to the device

ARCHITECTURE - ALLOCATION AND TAGGINGIO tag

Is an integer value that uniquely identifies IO submittedto hardwareOn completion we can use the tag to find out which IOwas completedLegacy drivers maintained their own implementationof tagging

With block-mq, requests allocated at initialization time(based on queue depth)Tag and request allocations combinedAvoids per request allocations in driver and tagmaintenance

6 . 46 . 5

ARCHITECTURE - I/O COMPLETIONSGenerally we want completions to be as local as possibleUse IPIs to complete requests on submitting nodeOld block layer was using software interrupts instead ofIPIsBest case there is an SQ/CQ pair for each core, with MSI-Xinterrupt setup for each CQ, steered to the relevant coreIPIs used when there aren't enough interrupts/HW queues

6 . 6

NULL_BLK EXPERIMENT AGAINnull_blk configuration

queue_mode=2(multiqueue) completion_nsec=0irqmode=0(none) submit_queues=32

fioEach thread does pread(2), 4k, randomly, O_DIRECT

Each added thread alternates between the two availableNUMA nodes (2 socket system, 32 threads)

6 . 7

SUCCESS

6 . 8

WHAT ABOUT HARDWARE WITHOUT MULTIQUEUE SUPPORT

Same null_blk setup1/2/n hw queues in blk-mqmq-1 and mq-2 so closesince we have 2 socketssystemNuma issues eliminatedonce we have queue pernuma node

CONVERSION PROGRESSmtip32xx (micron SSD)NVMevirtio_blk, xen blockdriverrbd (ceph block)loopubiSCSI (scsi-mq)

6 . 97 . 1

SUMMARYStorage device performance has accelerated fromhundreds of IOPS to hundreds thousands of IOPSBottlenecks in software gradually eliminated byexploiting concurrency and introducing lock-lessarchitectureblk-mq is one example

Questions ?

8 . 1

REFERENCES1.

2. 3. 4. 5. 6. 7.

Linux Block IO: Introducing Multi-queue SSD Access onMulti-core SystemsThe Performance Impact of NVMe and NVMe over FabricsNull block device driverblk-mq: new multi-queue block IO queueing mechanismfioperfSolving the Linux storage scalability bottlenecks - by JensAxboe

top related