challenges in getting flash drives closer to cpu myoungsoo jung (ut-dallas) mahmut kandemir (psu)...
TRANSCRIPT
Challenges in Getting Flash Drives Closer to CPU
Myoungsoo Jung (UT-Dallas)Mahmut Kandemir (PSU)
The University of Texas at Dallas
Take-away
• Leveraging PCIe bus as storage interface– ≠ conventional memory system interconnects– ≠ thin storage interfaces– Requires new SSD architecture and storage stack
• Motivation: there are not many studies focusing on the system characteristics of these emerging PCIe SSD platforms.
• Contributions: we quantitatively analyze the challenges faced by PCIe SSDs in getting flash memory closer to CPU1. Memory consumption2. Computation resource requirement3. Performance as a shared storage system4. Latency impact on their storage-level queuing mechanisms
Bandwidth Trend
1998 2000 2002 2004 2006 2008 2010 2012 2014 20160.0
0.0
0.1
0.1
0.3
0.5
1.0
2.0
4.0
8.0
16.0
Ban
dwid
th (
GB
/sec
)
Year
SATA PATA SSD
Southbridge Performance Limit
Bandwidth improvement (150MB/s ~ 600MB/s)
Bandwidth Trend
1998 2000 2002 2004 2006 2008 2010 2012 2014 20160.0
0.0
0.1
0.1
0.3
0.5
1.0
2.0
4.0
8.0
16.0Future PCIe SSD (expectation)FusionIO ioDrive Octal
FusionIO ioDrive2
Z-Drive R4FusionIO ioDrive
SF-1000
Intel-X25
ST-Zeus
A25FBWinchester
Ban
dwid
th (
GB
/sec
)
Year
SATA PATA SSD
Southbridge Performance Limit
SSDs have improved their bandwidth 4x
SSDs begin to blur the distinction between block and memory access semantic devices
Flash Storage Migration
Northbridge
IDESATA
USB
SouthbridgeI/O controller hub
Memory controller hub
High-speed graphic I/O slots (PCI Express)
PCI Slots
Memory Slots
Cables and ports leading off-board
30 GIPS/core
400 MHz/flash chip
600MB/s physical limit
Core
Flash Flash Flash Flash Flash Flash
Core Core Core Core Core
Taking SSDs out from the I/O controller hub and locating them as close to the CPU side as possible
Interface Bottleneck
PCIe interface is by far one of the easiest ways to integrate flash memory into the processor-memory complex
Flash Integration
1. Bridge-based PCIe SSD (BSSD)2. From-scratch PCIe SSD (FSSD)
Bridge-based PCIe SSD (BSSD)
Northbridge
Memory controller hub
I/O slots (PCI Express)PC
Ie R
C
PCIe
EP
PCIe
TO
SA
S BR
IDG
E SAS CTRL
PCI Express SSD
SAS Controller
FlashFlashFlashFlash
FlashSAS CTRLSAS CTRLSAS CTRLSAS CTRL
multiple traditional SAS/SATA SSD controllers
Bridge controller
exposing an aggregated SAS/SATA SSD performance
RC = Root Complex, CTRL = ControllerEP = Endpoint, HBA = Host Block Adapter
Bridge-based PCIe SSD (BSSD)
Northbridge
Memory controller hub
I/O slots (PCI Express)PC
Ie R
C
PCIe
EP
PCIe
TO
SA
S BR
IDG
E SAS CTRL
PCI Express SSD
SAS Controller
FlashFlashFlashFlash
FlashSAS CTRLSAS CTRLSAS CTRLSAS CTRL
High Compatibility
Fast Development Process
Redundant Control Logics
Computational Overheads
En-decoding Overheads
PROS
CONS
RC = Root Complex, CTRL = ControllerEP = Endpoint, HBA = Host Block Adapter
From-scratch PCIe SSD (FSSD)
Northbridge
Memory controller hub
I/O slots (PCI Express)PC
Ie R
C
Switc
h
Native PCIe Endpont and Controller
PCIe EP-CTRLFlashFlashFlashFlash
FlashPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRL
PCIe
RC
PCIe
EP
PCIe
TO
SA
S BR
IDG
E SAS CTRL
PCI Express SSD
SAS Controller
FlashFlashFlashFlash
FlashSAS CTRLSAS CTRLSAS CTRLSAS CTRLPC
Ie T
O
SAS
BRID
GE
PCIe endpoints (EPs) has upstream and downstream buffers, which control in-bound and out-bound I/O requests
PCIe EPs and switch are implemented as a form of native PCIe controller
FSSD has been built bottom to top by directly interconnecting the NAND flash interface and the external PCIe link
Point-to-point PCIe link network
RC = Root Complex, CTRL = ControllerEP = Endpoint, HBA = Host Block Adapter
From-scratch PCIe SSD (FSSD)
Northbridge
Memory controller hub
I/O slots (PCI Express)PC
Ie R
C
Switc
h
Native PCIe Endpont and Controller
PCIe EP-CTRLFlashFlashFlashFlash
FlashPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRL
Highly scalable
Exposing flash performance
Protocol design/implementation
Tailoring SW/HW
Resource competition
PROS
CONS
RC = Root Complex, CTRL = ControllerEP = Endpoint, HBA = Host Block Adapter
Flash Software Stack
File System
Block Storage Layer
HBA Device Driver
Host Interface Layer (NVMHC)
Flash Software (FTL)
Hardware Abstraction Layer
PCIe RC
Switch
PCIe EP-CTRLFlashFlashFlashFlash
FlashPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRL
corecore
PCIe HOSTDRAM
From-scratch SSD architecture
Database
Logical Block I/O Interface
Host
Storage
• Buffer cache• Address mapping• Wear-leveling
Experimental Setup
• Host configuration– Quad Core i7 Sandy Bridge 3.4GHz– External extra HDD (for logging the footprints)– 16GB Memory (4GB DDR3-1333 DIMM * 4)
most performance values observed with FSSD are about 40% better than BSSD
Tool
• Synthesized micro-benchmark workloads of Iometer• Modified Iometer
– Time series evaluation: a script that generates log-data per every sec.
– Memory usage evaluation: added a module in calling system API GlobalMemoryStatusEx() into Iometer
Memory Usage (Overall)
512B 4KB 16KB 128KB 1MB0
1
2
3
6789
10
Mem
ory
Usa
ge (
GB
)
Access Granularity
FSSD Seq. FSSD Rand. BSSD Seq. BSSD Rand.
512B 4KB 16KB 128KB 1MB
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Mem
ory
Usa
ge (
GB
)Access Granularity
FSSD Seq. FSSD Rand. BSSD Seq. BSSD Rand.
[Writes] [Reads]
Request sizes (1 ~ 512 sectors )
Physical memory consumption
FSSD consumes 3x~16x more memory space
FSSD consumes 2.5x more memory space
0.6 GB (BSSD)
0.6 GB (BSSD)
Memory Usage (BSSD)
0 2500 5000 7500 10000 125000.00
2.00
4.00
6.00
8.00
10.00
BSSD-Legacy BSSD-Queue
0.630.640.650.660.67
Time Folow (Second)
Me
mo
ry U
sag
e (
GB
)
Memory consumption
submits I/Os whenever device is available 128 entries
BSSD requires only 0.6GB memory space regardless of the I/O type and size.
Memory Usage (FSSD)
0 2500 5000 7500 10000 125000.00
2.00
4.00
6.00
8.00
10.00
Time Flow (Second)
FSSD-Legacy FSSD-Queue
Me
mo
ry U
sag
e (
GB
)
2GB memory requirements
10GB memory usage to manage only the underlying SSD may not be acceptable in many applications
As the I/O process progresses, the amount of memory usage keeps increasing in logarithmic fashion and reach 10GB
CPU Usage (BSSD)
0 400 800 1200 1600 2000 2400 2800 320010
20
30
40
50
60
70
80
90C
PU
Usa
ge
(%
)
Time Flow (Second)
BSSD-Queue BSSD-Legacy
Time series
Host-level CPU usages
BSSD consumes 15%~30% of total CPU cycles for handling I/O requests
CPU Usage (FSSD)
0 400 800 1200 1600 2000 2400 2800 320010
20
30
40
50
60
70
80
90
Time Flow (Second)
FSSD-Legacy FSSD-Queue
CP
U U
sag
e (
%)
FSSD requires much higher CPU usages (50%~ 90%)
A CPU usage over 60% for just I/O processing might be able to degrade overall system performance
60% of the cycles on host-side CPU
I/O service with queue-mode operation requires 50% more CPU cycles
FSSD performance (multi-threads)
0 250 500 750 10000.00.51.01.52.02.53.03.54.04.55.0
Late
ncy
(ms)
Time Flow (Second)
1 thread 4 threads 8 threads
0 250 500 750 1000
0
100000
200000
300000
400000
500000
IOP
STime Flow (Second)
1 thread 4 threads 8 threads
Latency Throughput
worse than four workers by 118%
worse than single workers by 289 % 2.2x better than single worker
FSSD offers very stable and predictable performance
FSSD resource usages (multi-threads)
0 250 500 750 10000102030405060708090
100
CP
U U
sage
(%
)Time Flow (Second)
1 thread 4 threads 8 threads
0 250 500 750 10001.0
1.5
2.0
2.5
3.0
3.5
4.0
Mem
ory
Req
uire
men
ts (
GB
)
Time Flow (Second)
1 thread 4 threads 8 threads
Memory consumption CPU usages
the advantage decreases because of high memory requirement and CPU usages
Require 134% more memory space Require 201% more computation resources
BSSD resource usages (multi-threads)
0 250 500 750 10000102030405060708090
100
CP
U U
sage
(%
)Time Flow (Second)
4 threads 8 threads 1 thread
0 250 500 750 10000.00.51.01.52.02.53.03.54.0
0 20 40 60 80 100600610620630640650660670680690700
Mem
ory
Req
uire
men
t (G
B)
Time Flow (Second)
1 thread 4 threads 8 threads
Mem
ory
Re
qu
irem
en
ts (
MB
)offers similar memory requirements (less than 0.66GB) irrespective of # of threads
offers similar CPU usages (less than 30%) irrespective of # of threads
Memory consumption CPU usages
BSSD performance (multi-threads)
0 250 500 750 1000248
163264
128256512
Late
ncy
(ms)
Time Flow (Second)
1 thread 4 threads 8 threads
0 250 500 750 10000
5000
10000
15000
20000
25000
30000
35000
IOP
STime Flow (Second)
1 thread 4 threads 8 threads
worse than four workers by 289%
worse than single workers by 708 %
There exist no differences with varying number of workers
Write-cliff occurs (garbage collection impact)
Latency Throughput
Latency Impact on a Queuing Method
0.03
2.43.24.0
0 200 400 600 800 10000.000.07
1.62.43.24.0 Sequential
Queue LegacyRandom
Late
ncy
(ms)
Time Flow (Second)
012
20406080
0 250 500 750 100006
200400600800
1000Sequential
RandomQueue Legacy
Late
ncy
(ms)
Time Flow (Second)
worse than a legacy req. by 106x
worse than a legacy req. by 86x
worse than a legacy req. by 99x
worse than a legacy req. by 184x
FSSD BSSD
Summary• Design trade-off between performance and resource
utilization– All-Flash-Array– Data-center/HPC local node SSD
• Software stack optimization– Co-operative approaches– Unified/direct file systems– Garbage collection schedulers– Queue control
• We are constructing an environment for automated SSD evaluation in camelab.org