falcon: scaling io performance in multi-ssd volumes · pdf filedirect io vfs page cache 1...
TRANSCRIPT
Falcon: Scaling IO Performance in Multi-SSD Volumes
Pradeep Kumar H Howie Huang
The George Washington University
SSDs in Big Data Applications➢ Recent trends advocate using many SSDs for higher throughput in
▪Graph Analytics
▪Machine Learning
▪Key-Value stores, etc.
➢New techniques are taking advantage of high random IOPS of SSDs▪ Fine grained IOs in graph processing [FAST’17]
▪Doing random IOs in graph processing [ATC’16]
▪Range scan in WiscKey is many parallel random IOs [FAST16]
➢ Increasing use of batched IO interfaces such as libaio in Linux
2
Existing IO Model
3
IO Performance HighLow
Pro
gram
min
gC
om
ple
xity
Low
High
Kernel-managed IO (1 application IO thread)
Application-managed IO(1 application IO thread per-SSD)
Falcon(1 application IO thread)
Kernel-managed IO (many application IO threads)
(a) Application-managed
…
SSD …
IO Stack
SSD
… …
1 application IO thread per-SSD
User spaceKernel Space
SSD SSD…
IO Stack
(b) Kernel-managed
Volume
…
1 or more application IO thread per-volume
ApplicationIO Threads
ComputingThreads
UserspaceIO Buffer
Outline
4
01 – Overview and Problem Statement
02 – Background and Block Layer Insufficiency
03 – Falcon Architecture
04 – Evaluation
Linux: IO Flow and IO States
5
Process Next Request
enqueue to software-queue
Unplug Phase
sort
no
enqueue to plug-list
no
yes
yes
yes
no
0
dispatch IOto driver
Plug Phase
Dispatch Phase
tagavailable?
merge ?
unplug?
no
IRQ eventcompletion
Complete IOFreetag
bio
Completion Phase
classify
complete8
merge3
ready4
wait5
insert6
dispatch7
SCSI Layer and Drivers
SSD1 SSDm…
Block LayerInstance (SSD1)
Block Layer Instance (SSDm)
…
Applications
bio1 biom
Volume Manager Instance
bio
VFSDirect IO Page Cache
biostart1
split2
start1
split2
IO Phases: Plug, Unplug, Dispatch, Completion
IO States: start, split, merge, wait, ready, insert, dispatch, complete
Linux: Mixes IO Batching and IO Serving Tasks
➢ Examples:▪Mixing batching with merge and tag allocation in
plug phase
▪Mixing classify with sort in unplug phase
➢ Root cause:▪Many tasks are tied to plug-list
▪Not designed for multi-SSD volume
➢ Creates many Insufficiencies▪ Lack of parallelism in IO processing
▪ Inefficient Merge and Sort
▪ Unpredictable blocking
6
Unplug Phase (sort, classify)
…
bio1 bio2 biom
Plug Phase(batch, merge, tag allocation)
Dispatch Phase (dispatch)
To SCSI Layer and Drivers
…
Software queues
…
Software queues
Linux block layer control flow
Thread-specificplug-list
…
Insufficiency #1: Lack of Parallelism
➢ Increasing stack latency of member SSDs▪ E.g., Stack Latency of sda is less than sdb
➢ Effect of sequential IO serving and round-robin dispatch▪ IOs of last drive will acquire insert after IOs of every other drive gets
dispatched
7
050
100150200250300
sda sdb sdc sdd sde sdf sdg sdh 8-SSD(avg)
1-SSD
Stac
k La
ten
cy
(in
use
c)
|________8-SSD Volume Member Drives ___________|
Insufficiency #2: Inefficient Merge and Sort
➢ Stack Latency in 8-SSD volume is greater than 1-SSD system➢ Stack latency is greater than device latency in 8-SSD volume
➢ Plug-list intermixes IOs destined to all member drives▪ Search for a merge candidate even in unrelated IOs▪ Larger sorting workload across SSDs
8
0%
20%
40%
60%
80%
100%
1-SSD 8-SSD Volume0
100
200
300
400
500
1-SSD 8-SSD Volume
Late
ncy
(u
sec)
Stack Latency Device Latency
(a) Absolute latency (b) Percentage
Insufficiency #3: Unpredictable Blocking
➢ If tag allocation fails, the IO thread blocks waiting for a free tag
➢Uncertainty about active IO count in the pipeline▪ Storage controller dependent
▪ Compromises the IO scalability in SATA controller connected SSD volume
9
0
1
2
3
LSI 9300-8i HBA Intel C602 SATA
No
rmal
ize
d
Thro
ugh
pu
t
1-SSD 2-SSD
0
20
40
60
80
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Qu
eu
e D
ep
th
Time (Sec)
sda sdb sda+sdb Available tag
(a) IO performance Scaling (b) Tag usage in 2-SSD SATA volume
Outline
10
01 – Overview and Problem Statement
02 – Background and Block Layer Insufficiency
03 – Falcon Architecture
04 – Evaluation
Falcon: Separates IO Batching from IO Serving Tasks
11
SCSI Layer and Drivers
SSD1SSDm
…
FBL Instance (SSD1)
FBL Instance(SSDm)
…
bio1 biom
Volume Manager Instance
…
Falcon IO Management Layer (FML)
Applications
VFSDirect IO Page Cache
bio bio
Falcon Threads
start1 start1
split2 split2
Falcon IO Management Layer (FML) PerformsIO batching tasks Only
Falcon Block Layer (FBL) performs
IO Serving tasks only
Classification Phase (classify)
bio1 bio2 biom
Batching Phase (batch)
…
…
Thread-specific plug-list
FML
Process Phase merge, tag
allocation, dispatch
Sort Phase (sort)
FBL…
software queues
Process Phase merge, tag
allocation, dispatch
Sort Phase (sort)
FBL…
Software queues
To SCSI Layer and Drivers
Falcon: Feature Comparison with Linux
12
Block LayerFeatures
Linux1-SSD
Linux Multi-SSD
FalconVolume
Parallel Processing NA
Per-Drive Sort
Neighbor Merge
Dynamic Tag Management
➢Well-suited for multi-SSD volume
➢ Improvements are applicable to 1-SSD system also
Falcon IO Management Layer
➢ IO batching in plug-list▪ No processing, just batching
➢ Classification ▪ Single pass operation▪ No sorting
➢ Enabling parallel processing▪ Creates Falcon threads per FBL
➢ Uniform unplug criteria▪ 32 per-SSD, 256 for 8-SSD volume▪ Lower criteria for low IO demand,
and latency sensitive applications▪ See the paper for more details
13
Move per-drive queues to software queues
insert
Classification Phase
Classify plug-list to temporary per-drive queues
no
enqueue to plug-list
Process NextIO Request
yes
Batching Phase
unplug?
bio (from Volume manager Layer)
Spawn Falcon threads, if needed
Falcon Block Layer
➢ Sort Phase▪ Per-Drive Sort
➢ Process Phase▪ Neighbor Merge▪ Dynamic tag allocation▪ Dispatch
➢ Completion Phase
➢ Able to saturate a Samsung 950 Pro 512GB NVMe SSD ▪ 1375 MB/s (13% better than
Linux)
14
0
0Dispatch IO to driver
dispatch
ready
Process Phase
Sort software-queue
Allocate tag
Neighbor merge
Sort Phase
merge
IRQ event completion
Complete IO
complete
bio(from FML layer)
To SCSI layer and Drivers
Completion Phase
Dynamic Tag Management
➢Allocate a tag only if a dispatch is required
➢ Bio-queue keeps bio objects yet to be dispatched
➢ Pressure point controls the active IO count in the pipeline
15
(a) IO Throughput Scaling for SATA controller
0
10
20
30
40
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30Q
ue
ue
De
pth
Time (sec)
sda-Linux sdb-Linux sda-Falcon sdb-Falcon
(b) Tag usage in 2-SSD SATA volume
0
1
2
3
Linux Falcon
No
rmal
ize
d
Thro
ugh
pu
t
1-SSD 2-SSD
Outline
16
01 – Overview and Problem Statement
02 – Background and Block Layer Insufficiency
03 – Falcon Architecture
04 – Evaluation
Evaluation Setup
➢ Falcon: 600 lines of C kernel code add
➢ Ubuntu 16.04 version with Kernel version 4.4.0, Blk-mq block layer
➢ 2 Intel Xeon CPU E5-2620 2GHz with 6 cores each
➢ 32GB DRAM
➢ 8 Samsung EVO 850 500 GB SSDs, connected using LSI SAS9300-8i HBA
➢ 4KB Stripe size is used by default
➢ Raw volume, Ext4 and XFS file systems are evaluated
➢ Revised FIO is used as micro-benchmark
17
Ext4 File Random IO
➢ Ext4 has file inode lock issue
➢ 1.77x speedup for random read
➢ 1.59x speedup for random write
18
(a) Single file read throughput (b) Single file write throughput
0500
100015002000
1 2 4 8
Thro
ugh
pu
t(M
B/s
ec)
IO thread count
Linux Falcon
0
500
1000
1500
1 2 4 8
Thro
ugh
pu
t (M
B/s
ec)
IO thread count
Linux Falcon
Buffered Random Write
➢ Buffer cache has just 1 thread per volume for flushing dirty pages
➢ 1.39x speedup for 4-SSD volume
➢ 1.59x speedup for 8-SSD volume
19
0400800
12001600
1 2 4 8Th
rou
ghp
ut
(MB
/se
c)SSD count
Linux Falcon
Graph Processing➢G-Store[SC’16] is used
▪ Semi-external graph analytics engine▪ Configurable number of IO threads▪ Linux setup, 1 and 8 IO threads▪ Falcon : 1 IO thread
➢ 8-SSD volume, XFS filesystem
➢ Kronecker graph scale 28, edge factor 16 is used
➢ BFS, kCore: High random IO
➢ Connected component (CC), PageRank (PR): Sequential IO
20
0123456
BFS kCore CC PR
Spe
ed
up
Linux 1 IO Thread Linux 8 IO Threads Falcon
Graph Processing
➢ 4.12x speedup over Linux 1 IO thread setup
➢ 1.78x speedup over Linux 8 IO thread
IO Trace Replay
21
Trace Name Read (%) IO size range Size (GB) Type
UM-Financial1 23.16 512B - 16715KB 17.22 Online transaction processing
UM-Financial2 82.34 512B - 256.5KB 8.44 Online transaction processing
UM-Websearch1 99.98 512B - 1111KB 15.24 Web Search
UM-Websearch2 99.98 8KB - 32KB 65.82 Web Search
FIU-Home 1 512B - 512KB 34.58 Research group activities
FIU-Mail 8.58 4KB - 4KB 86.64 Mail Server
FIU-Webuser 10.33 4KB - 128KB 30.94 Web User
FIU-Web-vm 21.8 4KB - 4KB 54.52 Webmail proxy/online course management
0500
10001500200025003000
UM-Financial1 UM-Financial2 UM-Websearch1 UM-Webserach2 FIU-Home FIU-Mail FIU-Webuser FIU-Web-vm
Thro
ugh
pu
t (M
B/s
ec)
Linux Falcon
1.67x better IO throughput on average
Conclusion
➢ Separating batching from IO serving tasks is the key for IO scalability in multi-SSD volume
➢ Falcon enforces per-drive processing▪ Improves the IO stack performance
▪Parallelizes the IO serving tasks across member SSDs
➢ Falcon improves the performance by 1.69x for various applications on 8-SSD volume
➢ Falcon achieves 1.13x throughput for an NVMe SSD
22
Thank You
➢ Falcon is open source now▪https://github.com/iHeartGraph/falcon
23