challenges in getting flash drives closer to cpu myoungsoo jung (ut-dallas) mahmut kandemir (psu)...

Challenges in Getting Flash Drives Closer to CPU

Myoungsoo Jung (UT-Dallas)Mahmut Kandemir (PSU)

The University of Texas at Dallas

Take-away

• Leveraging PCIe bus as storage interface– ≠ conventional memory system interconnects– ≠ thin storage interfaces– Requires new SSD architecture and storage stack

• Motivation: there are not many studies focusing on the system characteristics of these emerging PCIe SSD platforms.

• Contributions: we quantitatively analyze the challenges faced by PCIe SSDs in getting flash memory closer to CPU1. Memory consumption2. Computation resource requirement3. Performance as a shared storage system4. Latency impact on their storage-level queuing mechanisms

Bandwidth Trend

1998 2000 2002 2004 2006 2008 2010 2012 2014 20160.0

0.0

0.1

0.1

0.3

0.5

1.0

2.0

4.0

8.0

16.0

Ban

dwid

th (

GB

/sec

)

Year

SATA PATA SSD

Southbridge Performance Limit

Bandwidth improvement (150MB/s ~ 600MB/s)

Bandwidth Trend

1998 2000 2002 2004 2006 2008 2010 2012 2014 20160.0

0.0

0.1

0.1

0.3

0.5

1.0

2.0

4.0

8.0

16.0Future PCIe SSD (expectation)FusionIO ioDrive Octal

FusionIO ioDrive2

Z-Drive R4FusionIO ioDrive

SF-1000

Intel-X25

ST-Zeus

A25FBWinchester

Ban

dwid

th (

GB

/sec

)

Year

SATA PATA SSD

Southbridge Performance Limit

SSDs have improved their bandwidth 4x

SSDs begin to blur the distinction between block and memory access semantic devices

Flash Storage Migration

Northbridge

IDESATA

USB

SouthbridgeI/O controller hub

Memory controller hub

High-speed graphic I/O slots (PCI Express)

PCI Slots

Memory Slots

Cables and ports leading off-board

30 GIPS/core

400 MHz/flash chip

600MB/s physical limit

Core

Flash Flash Flash Flash Flash Flash

Core Core Core Core Core

Taking SSDs out from the I/O controller hub and locating them as close to the CPU side as possible

Interface Bottleneck

PCIe interface is by far one of the easiest ways to integrate flash memory into the processor-memory complex

Flash Integration

1. Bridge-based PCIe SSD (BSSD)2. From-scratch PCIe SSD (FSSD)

Bridge-based PCIe SSD (BSSD)

Northbridge


I/O slots (PCI Express)PC

Ie R

C

PCIe

EP

PCIe

TO

SA

S BR

IDG

E SAS CTRL

PCI Express SSD

SAS Controller

FlashFlashFlashFlash

FlashSAS CTRLSAS CTRLSAS CTRLSAS CTRL

multiple traditional SAS/SATA SSD controllers

Bridge controller

exposing an aggregated SAS/SATA SSD performance

RC = Root Complex, CTRL = ControllerEP = Endpoint, HBA = Host Block Adapter

Bridge-based PCIe SSD (BSSD)

Northbridge



Ie R

C

PCIe

EP

PCIe

TO

SA

S BR

IDG

E SAS CTRL

PCI Express SSD

SAS Controller


FlashSAS CTRLSAS CTRLSAS CTRLSAS CTRL

High Compatibility

Fast Development Process

Redundant Control Logics

Computational Overheads

En-decoding Overheads

PROS

CONS


From-scratch PCIe SSD (FSSD)

Northbridge



Ie R

C

Switc

h

Native PCIe Endpont and Controller

PCIe EP-CTRLFlashFlashFlashFlash

FlashPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRL

PCIe

RC

PCIe

EP

PCIe

TO

SA

S BR

IDG

E SAS CTRL

PCI Express SSD

SAS Controller


FlashSAS CTRLSAS CTRLSAS CTRLSAS CTRLPC

Ie T

O

SAS

BRID

GE

PCIe endpoints (EPs) has upstream and downstream buffers, which control in-bound and out-bound I/O requests

PCIe EPs and switch are implemented as a form of native PCIe controller

FSSD has been built bottom to top by directly interconnecting the NAND flash interface and the external PCIe link

Point-to-point PCIe link network


From-scratch PCIe SSD (FSSD)

Northbridge



Ie R

C

Switc

h

Native PCIe Endpont and Controller



Highly scalable

Exposing flash performance

Protocol design/implementation

Tailoring SW/HW

Resource competition

PROS

CONS


Flash Software Stack

File System

Block Storage Layer

HBA Device Driver

Host Interface Layer (NVMHC)

Flash Software (FTL)

Hardware Abstraction Layer

PCIe RC

Switch



corecore

PCIe HOSTDRAM

From-scratch SSD architecture

Database

Logical Block I/O Interface

Host

Storage

• Buffer cache• Address mapping• Wear-leveling

Experimental Setup

• Host configuration– Quad Core i7 Sandy Bridge 3.4GHz– External extra HDD (for logging the footprints)– 16GB Memory (4GB DDR3-1333 DIMM * 4)

most performance values observed with FSSD are about 40% better than BSSD

Tool

• Synthesized micro-benchmark workloads of Iometer• Modified Iometer

– Time series evaluation: a script that generates log-data per every sec.

– Memory usage evaluation: added a module in calling system API GlobalMemoryStatusEx() into Iometer

Memory Usage (Overall)

512B 4KB 16KB 128KB 1MB0

1

2

3

6789

10

Mem

ory

Usa

ge (

GB

)

Access Granularity

FSSD Seq. FSSD Rand. BSSD Seq. BSSD Rand.

512B 4KB 16KB 128KB 1MB

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Mem

ory

Usa

ge (

GB

)Access Granularity

FSSD Seq. FSSD Rand. BSSD Seq. BSSD Rand.

[Writes] [Reads]

Request sizes (1 ~ 512 sectors )

Physical memory consumption

FSSD consumes 3x~16x more memory space

FSSD consumes 2.5x more memory space

0.6 GB (BSSD)

0.6 GB (BSSD)

Memory Usage (BSSD)

0 2500 5000 7500 10000 125000.00

2.00

4.00

6.00

8.00

10.00

BSSD-Legacy BSSD-Queue

0.630.640.650.660.67

Time Folow (Second)

Me

mo

ry U

sag

e (

GB

)

Memory consumption

submits I/Os whenever device is available 128 entries

BSSD requires only 0.6GB memory space regardless of the I/O type and size.

Memory Usage (FSSD)

0 2500 5000 7500 10000 125000.00

2.00

4.00

6.00

8.00

10.00

Time Flow (Second)

FSSD-Legacy FSSD-Queue

Me

mo

ry U

sag

e (

GB

)

2GB memory requirements

10GB memory usage to manage only the underlying SSD may not be acceptable in many applications

As the I/O process progresses, the amount of memory usage keeps increasing in logarithmic fashion and reach 10GB

CPU Usage (BSSD)

0 400 800 1200 1600 2000 2400 2800 320010

20

30

40

50

60

70

80

90C

PU

Usa

ge

(%

)

Time Flow (Second)

BSSD-Queue BSSD-Legacy

Time series

Host-level CPU usages

BSSD consumes 15%~30% of total CPU cycles for handling I/O requests

CPU Usage (FSSD)

0 400 800 1200 1600 2000 2400 2800 320010

20

30

40

50

60

70

80

90

Time Flow (Second)

FSSD-Legacy FSSD-Queue

CP

U U

sag

e (

%)

FSSD requires much higher CPU usages (50%~ 90%)

A CPU usage over 60% for just I/O processing might be able to degrade overall system performance

60% of the cycles on host-side CPU

I/O service with queue-mode operation requires 50% more CPU cycles

FSSD performance (multi-threads)

0 250 500 750 10000.00.51.01.52.02.53.03.54.04.55.0

Late

ncy

(ms)

Time Flow (Second)

1 thread 4 threads 8 threads

0 250 500 750 1000

0

100000

200000

300000

400000

500000

IOP

STime Flow (Second)


Latency Throughput

worse than four workers by 118%

worse than single workers by 289 % 2.2x better than single worker

FSSD offers very stable and predictable performance

FSSD resource usages (multi-threads)

0 250 500 750 10000102030405060708090

100

CP

U U

sage

(%

)Time Flow (Second)


0 250 500 750 10001.0

1.5

2.0

2.5

3.0

3.5

4.0

Mem

ory

Req

uire

men

ts (

GB

)

Time Flow (Second)


Memory consumption CPU usages

the advantage decreases because of high memory requirement and CPU usages

Require 134% more memory space Require 201% more computation resources

BSSD resource usages (multi-threads)

0 250 500 750 10000102030405060708090

100

CP

U U

sage

(%

)Time Flow (Second)

4 threads 8 threads 1 thread

0 250 500 750 10000.00.51.01.52.02.53.03.54.0

0 20 40 60 80 100600610620630640650660670680690700

Mem

ory

Req

uire

men

t (G

B)

Time Flow (Second)


Mem

ory

Re

qu

irem

en

ts (

MB

)offers similar memory requirements (less than 0.66GB) irrespective of # of threads

offers similar CPU usages (less than 30%) irrespective of # of threads

Memory consumption CPU usages

BSSD performance (multi-threads)

0 250 500 750 1000248

163264

128256512

Late

ncy

(ms)

Time Flow (Second)


0 250 500 750 10000

5000

10000

15000

20000

25000

30000

35000

IOP

STime Flow (Second)


worse than four workers by 289%

worse than single workers by 708 %

There exist no differences with varying number of workers

Write-cliff occurs (garbage collection impact)

Latency Throughput

Latency Impact on a Queuing Method

0.03

2.43.24.0

0 200 400 600 800 10000.000.07

1.62.43.24.0 Sequential

Queue LegacyRandom

Late

ncy

(ms)

Time Flow (Second)

012

20406080

0 250 500 750 100006

200400600800

1000Sequential

RandomQueue Legacy

Late

ncy

(ms)

Time Flow (Second)

worse than a legacy req. by 106x




FSSD BSSD

Summary• Design trade-off between performance and resource

utilization– All-Flash-Array– Data-center/HPC local node SSD

• Software stack optimization– Co-operative approaches– Unified/direct file systems– Garbage collection schedulers– Queue control

• We are constructing an environment for automated SSD evaluation in camelab.org

challenges in getting flash drives closer to cpu myoungsoo jung (ut-dallas) mahmut kandemir (psu)...

Documents

gb slide

memory usage fssd

gb bssd slide

memory space fssd

gb memory usage

processormemory complex

scratch pcie ssd fssd

iometer slide