falcon: scaling io performance in multi-ssd volumes · pdf filedirect io vfs page cache 1...

23
Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University

Upload: vanquynh

Post on 22-Mar-2018

224 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Falcon: Scaling IO Performance in Multi-SSD Volumes

Pradeep Kumar H Howie Huang

The George Washington University

Page 2: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

SSDs in Big Data Applications➢ Recent trends advocate using many SSDs for higher throughput in

▪Graph Analytics

▪Machine Learning

▪Key-Value stores, etc.

➢New techniques are taking advantage of high random IOPS of SSDs▪ Fine grained IOs in graph processing [FAST’17]

▪Doing random IOs in graph processing [ATC’16]

▪Range scan in WiscKey is many parallel random IOs [FAST16]

➢ Increasing use of batched IO interfaces such as libaio in Linux

2

Page 3: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Existing IO Model

3

IO Performance HighLow

Pro

gram

min

gC

om

ple

xity

Low

High

Kernel-managed IO (1 application IO thread)

Application-managed IO(1 application IO thread per-SSD)

Falcon(1 application IO thread)

Kernel-managed IO (many application IO threads)

(a) Application-managed

SSD …

IO Stack

SSD

… …

1 application IO thread per-SSD

User spaceKernel Space

SSD SSD…

IO Stack

(b) Kernel-managed

Volume

1 or more application IO thread per-volume

ApplicationIO Threads

ComputingThreads

UserspaceIO Buffer

Page 4: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Outline

4

01 – Overview and Problem Statement

02 – Background and Block Layer Insufficiency

03 – Falcon Architecture

04 – Evaluation

Page 5: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Linux: IO Flow and IO States

5

Process Next Request

enqueue to software-queue

Unplug Phase

sort

no

enqueue to plug-list

no

yes

yes

yes

no

0

dispatch IOto driver

Plug Phase

Dispatch Phase

tagavailable?

merge ?

unplug?

no

IRQ eventcompletion

Complete IOFreetag

bio

Completion Phase

classify

complete8

merge3

ready4

wait5

insert6

dispatch7

SCSI Layer and Drivers

SSD1 SSDm…

Block LayerInstance (SSD1)

Block Layer Instance (SSDm)

Applications

bio1 biom

Volume Manager Instance

bio

VFSDirect IO Page Cache

biostart1

split2

start1

split2

IO Phases: Plug, Unplug, Dispatch, Completion

IO States: start, split, merge, wait, ready, insert, dispatch, complete

Page 6: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Linux: Mixes IO Batching and IO Serving Tasks

➢ Examples:▪Mixing batching with merge and tag allocation in

plug phase

▪Mixing classify with sort in unplug phase

➢ Root cause:▪Many tasks are tied to plug-list

▪Not designed for multi-SSD volume

➢ Creates many Insufficiencies▪ Lack of parallelism in IO processing

▪ Inefficient Merge and Sort

▪ Unpredictable blocking

6

Unplug Phase (sort, classify)

bio1 bio2 biom

Plug Phase(batch, merge, tag allocation)

Dispatch Phase (dispatch)

To SCSI Layer and Drivers

Software queues

Software queues

Linux block layer control flow

Thread-specificplug-list

Page 7: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Insufficiency #1: Lack of Parallelism

➢ Increasing stack latency of member SSDs▪ E.g., Stack Latency of sda is less than sdb

➢ Effect of sequential IO serving and round-robin dispatch▪ IOs of last drive will acquire insert after IOs of every other drive gets

dispatched

7

050

100150200250300

sda sdb sdc sdd sde sdf sdg sdh 8-SSD(avg)

1-SSD

Stac

k La

ten

cy

(in

use

c)

|________8-SSD Volume Member Drives ___________|

Page 8: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Insufficiency #2: Inefficient Merge and Sort

➢ Stack Latency in 8-SSD volume is greater than 1-SSD system➢ Stack latency is greater than device latency in 8-SSD volume

➢ Plug-list intermixes IOs destined to all member drives▪ Search for a merge candidate even in unrelated IOs▪ Larger sorting workload across SSDs

8

0%

20%

40%

60%

80%

100%

1-SSD 8-SSD Volume0

100

200

300

400

500

1-SSD 8-SSD Volume

Late

ncy

(u

sec)

Stack Latency Device Latency

(a) Absolute latency (b) Percentage

Page 9: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Insufficiency #3: Unpredictable Blocking

➢ If tag allocation fails, the IO thread blocks waiting for a free tag

➢Uncertainty about active IO count in the pipeline▪ Storage controller dependent

▪ Compromises the IO scalability in SATA controller connected SSD volume

9

0

1

2

3

LSI 9300-8i HBA Intel C602 SATA

No

rmal

ize

d

Thro

ugh

pu

t

1-SSD 2-SSD

0

20

40

60

80

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Qu

eu

e D

ep

th

Time (Sec)

sda sdb sda+sdb Available tag

(a) IO performance Scaling (b) Tag usage in 2-SSD SATA volume

Page 10: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Outline

10

01 – Overview and Problem Statement

02 – Background and Block Layer Insufficiency

03 – Falcon Architecture

04 – Evaluation

Page 11: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Falcon: Separates IO Batching from IO Serving Tasks

11

SCSI Layer and Drivers

SSD1SSDm

FBL Instance (SSD1)

FBL Instance(SSDm)

bio1 biom

Volume Manager Instance

Falcon IO Management Layer (FML)

Applications

VFSDirect IO Page Cache

bio bio

Falcon Threads

start1 start1

split2 split2

Falcon IO Management Layer (FML) PerformsIO batching tasks Only

Falcon Block Layer (FBL) performs

IO Serving tasks only

Classification Phase (classify)

bio1 bio2 biom

Batching Phase (batch)

Thread-specific plug-list

FML

Process Phase merge, tag

allocation, dispatch

Sort Phase (sort)

FBL…

software queues

Process Phase merge, tag

allocation, dispatch

Sort Phase (sort)

FBL…

Software queues

To SCSI Layer and Drivers

Page 12: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Falcon: Feature Comparison with Linux

12

Block LayerFeatures

Linux1-SSD

Linux Multi-SSD

FalconVolume

Parallel Processing NA

Per-Drive Sort

Neighbor Merge

Dynamic Tag Management

➢Well-suited for multi-SSD volume

➢ Improvements are applicable to 1-SSD system also

Page 13: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Falcon IO Management Layer

➢ IO batching in plug-list▪ No processing, just batching

➢ Classification ▪ Single pass operation▪ No sorting

➢ Enabling parallel processing▪ Creates Falcon threads per FBL

➢ Uniform unplug criteria▪ 32 per-SSD, 256 for 8-SSD volume▪ Lower criteria for low IO demand,

and latency sensitive applications▪ See the paper for more details

13

Move per-drive queues to software queues

insert

Classification Phase

Classify plug-list to temporary per-drive queues

no

enqueue to plug-list

Process NextIO Request

yes

Batching Phase

unplug?

bio (from Volume manager Layer)

Spawn Falcon threads, if needed

Page 14: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Falcon Block Layer

➢ Sort Phase▪ Per-Drive Sort

➢ Process Phase▪ Neighbor Merge▪ Dynamic tag allocation▪ Dispatch

➢ Completion Phase

➢ Able to saturate a Samsung 950 Pro 512GB NVMe SSD ▪ 1375 MB/s (13% better than

Linux)

14

0

0Dispatch IO to driver

dispatch

ready

Process Phase

Sort software-queue

Allocate tag

Neighbor merge

Sort Phase

merge

IRQ event completion

Complete IO

complete

bio(from FML layer)

To SCSI layer and Drivers

Completion Phase

Page 15: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Dynamic Tag Management

➢Allocate a tag only if a dispatch is required

➢ Bio-queue keeps bio objects yet to be dispatched

➢ Pressure point controls the active IO count in the pipeline

15

(a) IO Throughput Scaling for SATA controller

0

10

20

30

40

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30Q

ue

ue

De

pth

Time (sec)

sda-Linux sdb-Linux sda-Falcon sdb-Falcon

(b) Tag usage in 2-SSD SATA volume

0

1

2

3

Linux Falcon

No

rmal

ize

d

Thro

ugh

pu

t

1-SSD 2-SSD

Page 16: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Outline

16

01 – Overview and Problem Statement

02 – Background and Block Layer Insufficiency

03 – Falcon Architecture

04 – Evaluation

Page 17: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Evaluation Setup

➢ Falcon: 600 lines of C kernel code add

➢ Ubuntu 16.04 version with Kernel version 4.4.0, Blk-mq block layer

➢ 2 Intel Xeon CPU E5-2620 2GHz with 6 cores each

➢ 32GB DRAM

➢ 8 Samsung EVO 850 500 GB SSDs, connected using LSI SAS9300-8i HBA

➢ 4KB Stripe size is used by default

➢ Raw volume, Ext4 and XFS file systems are evaluated

➢ Revised FIO is used as micro-benchmark

17

Page 18: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Ext4 File Random IO

➢ Ext4 has file inode lock issue

➢ 1.77x speedup for random read

➢ 1.59x speedup for random write

18

(a) Single file read throughput (b) Single file write throughput

0500

100015002000

1 2 4 8

Thro

ugh

pu

t(M

B/s

ec)

IO thread count

Linux Falcon

0

500

1000

1500

1 2 4 8

Thro

ugh

pu

t (M

B/s

ec)

IO thread count

Linux Falcon

Page 19: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Buffered Random Write

➢ Buffer cache has just 1 thread per volume for flushing dirty pages

➢ 1.39x speedup for 4-SSD volume

➢ 1.59x speedup for 8-SSD volume

19

0400800

12001600

1 2 4 8Th

rou

ghp

ut

(MB

/se

c)SSD count

Linux Falcon

Page 20: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Graph Processing➢G-Store[SC’16] is used

▪ Semi-external graph analytics engine▪ Configurable number of IO threads▪ Linux setup, 1 and 8 IO threads▪ Falcon : 1 IO thread

➢ 8-SSD volume, XFS filesystem

➢ Kronecker graph scale 28, edge factor 16 is used

➢ BFS, kCore: High random IO

➢ Connected component (CC), PageRank (PR): Sequential IO

20

0123456

BFS kCore CC PR

Spe

ed

up

Linux 1 IO Thread Linux 8 IO Threads Falcon

Graph Processing

➢ 4.12x speedup over Linux 1 IO thread setup

➢ 1.78x speedup over Linux 8 IO thread

Page 21: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

IO Trace Replay

21

Trace Name Read (%) IO size range Size (GB) Type

UM-Financial1 23.16 512B - 16715KB 17.22 Online transaction processing

UM-Financial2 82.34 512B - 256.5KB 8.44 Online transaction processing

UM-Websearch1 99.98 512B - 1111KB 15.24 Web Search

UM-Websearch2 99.98 8KB - 32KB 65.82 Web Search

FIU-Home 1 512B - 512KB 34.58 Research group activities

FIU-Mail 8.58 4KB - 4KB 86.64 Mail Server

FIU-Webuser 10.33 4KB - 128KB 30.94 Web User

FIU-Web-vm 21.8 4KB - 4KB 54.52 Webmail proxy/online course management

0500

10001500200025003000

UM-Financial1 UM-Financial2 UM-Websearch1 UM-Webserach2 FIU-Home FIU-Mail FIU-Webuser FIU-Web-vm

Thro

ugh

pu

t (M

B/s

ec)

Linux Falcon

1.67x better IO throughput on average

Page 22: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Conclusion

➢ Separating batching from IO serving tasks is the key for IO scalability in multi-SSD volume

➢ Falcon enforces per-drive processing▪ Improves the IO stack performance

▪Parallelizes the IO serving tasks across member SSDs

➢ Falcon improves the performance by 1.69x for various applications on 8-SSD volume

➢ Falcon achieves 1.13x throughput for an NVMe SSD

22

Page 23: Falcon: Scaling IO Performance in Multi-SSD Volumes · PDF fileDirect IO VFS Page Cache 1 startbio 2 2split IO Phases: Plug, Unplug, Dispatch, Completion IO States: start, split, merge,

Thank You

➢ Falcon is open source now▪https://github.com/iHeartGraph/falcon

23