challenges in getting flash drives closer to cpu myoungsoo jung (ut-dallas) mahmut kandemir (psu)...

24
Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Upload: cayla-haywood

Post on 02-Apr-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Challenges in Getting Flash Drives Closer to CPU

Myoungsoo Jung (UT-Dallas)Mahmut Kandemir (PSU)

The University of Texas at Dallas

Page 2: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Take-away

• Leveraging PCIe bus as storage interface– ≠ conventional memory system interconnects– ≠ thin storage interfaces– Requires new SSD architecture and storage stack

• Motivation: there are not many studies focusing on the system characteristics of these emerging PCIe SSD platforms.

• Contributions: we quantitatively analyze the challenges faced by PCIe SSDs in getting flash memory closer to CPU1. Memory consumption2. Computation resource requirement3. Performance as a shared storage system4. Latency impact on their storage-level queuing mechanisms

Page 3: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Bandwidth Trend

1998 2000 2002 2004 2006 2008 2010 2012 2014 20160.0

0.0

0.1

0.1

0.3

0.5

1.0

2.0

4.0

8.0

16.0

Ban

dwid

th (

GB

/sec

)

Year

SATA PATA SSD

Southbridge Performance Limit

Bandwidth improvement (150MB/s ~ 600MB/s)

Page 4: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Bandwidth Trend

1998 2000 2002 2004 2006 2008 2010 2012 2014 20160.0

0.0

0.1

0.1

0.3

0.5

1.0

2.0

4.0

8.0

16.0Future PCIe SSD (expectation)FusionIO ioDrive Octal

FusionIO ioDrive2

Z-Drive R4FusionIO ioDrive

SF-1000

Intel-X25

ST-Zeus

A25FBWinchester

Ban

dwid

th (

GB

/sec

)

Year

SATA PATA SSD

Southbridge Performance Limit

SSDs have improved their bandwidth 4x

SSDs begin to blur the distinction between block and memory access semantic devices

Page 5: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Flash Storage Migration

Northbridge

IDESATA

USB

SouthbridgeI/O controller hub

Memory controller hub

High-speed graphic I/O slots (PCI Express)

PCI Slots

Memory Slots

Cables and ports leading off-board

30 GIPS/core

400 MHz/flash chip

600MB/s physical limit

Core

Flash Flash Flash Flash Flash Flash

Core Core Core Core Core

Taking SSDs out from the I/O controller hub and locating them as close to the CPU side as possible

Interface Bottleneck

PCIe interface is by far one of the easiest ways to integrate flash memory into the processor-memory complex

Page 6: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Flash Integration

1. Bridge-based PCIe SSD (BSSD)2. From-scratch PCIe SSD (FSSD)

Page 7: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Bridge-based PCIe SSD (BSSD)

Northbridge

Memory controller hub

I/O slots (PCI Express)PC

Ie R

C

PCIe

EP

PCIe

TO

SA

S BR

IDG

E SAS CTRL

PCI Express SSD

SAS Controller

FlashFlashFlashFlash

FlashSAS CTRLSAS CTRLSAS CTRLSAS CTRL

multiple traditional SAS/SATA SSD controllers

Bridge controller

exposing an aggregated SAS/SATA SSD performance

RC = Root Complex, CTRL = ControllerEP = Endpoint, HBA = Host Block Adapter

Page 8: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Bridge-based PCIe SSD (BSSD)

Northbridge

Memory controller hub

I/O slots (PCI Express)PC

Ie R

C

PCIe

EP

PCIe

TO

SA

S BR

IDG

E SAS CTRL

PCI Express SSD

SAS Controller

FlashFlashFlashFlash

FlashSAS CTRLSAS CTRLSAS CTRLSAS CTRL

High Compatibility

Fast Development Process

Redundant Control Logics

Computational Overheads

En-decoding Overheads

PROS

CONS

RC = Root Complex, CTRL = ControllerEP = Endpoint, HBA = Host Block Adapter

Page 9: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

From-scratch PCIe SSD (FSSD)

Northbridge

Memory controller hub

I/O slots (PCI Express)PC

Ie R

C

Switc

h

Native PCIe Endpont and Controller

PCIe EP-CTRLFlashFlashFlashFlash

FlashPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRL

PCIe

RC

PCIe

EP

PCIe

TO

SA

S BR

IDG

E SAS CTRL

PCI Express SSD

SAS Controller

FlashFlashFlashFlash

FlashSAS CTRLSAS CTRLSAS CTRLSAS CTRLPC

Ie T

O

SAS

BRID

GE

PCIe endpoints (EPs) has upstream and downstream buffers, which control in-bound and out-bound I/O requests

PCIe EPs and switch are implemented as a form of native PCIe controller

FSSD has been built bottom to top by directly interconnecting the NAND flash interface and the external PCIe link

Point-to-point PCIe link network

RC = Root Complex, CTRL = ControllerEP = Endpoint, HBA = Host Block Adapter

Page 10: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

From-scratch PCIe SSD (FSSD)

Northbridge

Memory controller hub

I/O slots (PCI Express)PC

Ie R

C

Switc

h

Native PCIe Endpont and Controller

PCIe EP-CTRLFlashFlashFlashFlash

FlashPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRL

Highly scalable

Exposing flash performance

Protocol design/implementation

Tailoring SW/HW

Resource competition

PROS

CONS

RC = Root Complex, CTRL = ControllerEP = Endpoint, HBA = Host Block Adapter

Page 11: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Flash Software Stack

File System

Block Storage Layer

HBA Device Driver

Host Interface Layer (NVMHC)

Flash Software (FTL)

Hardware Abstraction Layer

PCIe RC

Switch

PCIe EP-CTRLFlashFlashFlashFlash

FlashPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRLPCIe EP-CTRL

corecore

PCIe HOSTDRAM

From-scratch SSD architecture

Database

Logical Block I/O Interface

Host

Storage

• Buffer cache• Address mapping• Wear-leveling

Page 12: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Experimental Setup

• Host configuration– Quad Core i7 Sandy Bridge 3.4GHz– External extra HDD (for logging the footprints)– 16GB Memory (4GB DDR3-1333 DIMM * 4)

most performance values observed with FSSD are about 40% better than BSSD

Page 13: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Tool

• Synthesized micro-benchmark workloads of Iometer• Modified Iometer

– Time series evaluation: a script that generates log-data per every sec.

– Memory usage evaluation: added a module in calling system API GlobalMemoryStatusEx() into Iometer

Page 14: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Memory Usage (Overall)

512B 4KB 16KB 128KB 1MB0

1

2

3

6789

10

Mem

ory

Usa

ge (

GB

)

Access Granularity

FSSD Seq. FSSD Rand. BSSD Seq. BSSD Rand.

512B 4KB 16KB 128KB 1MB

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Mem

ory

Usa

ge (

GB

)Access Granularity

FSSD Seq. FSSD Rand. BSSD Seq. BSSD Rand.

[Writes] [Reads]

Request sizes (1 ~ 512 sectors )

Physical memory consumption

FSSD consumes 3x~16x more memory space

FSSD consumes 2.5x more memory space

0.6 GB (BSSD)

0.6 GB (BSSD)

Page 15: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Memory Usage (BSSD)

0 2500 5000 7500 10000 125000.00

2.00

4.00

6.00

8.00

10.00

BSSD-Legacy BSSD-Queue

0.630.640.650.660.67

Time Folow (Second)

Me

mo

ry U

sag

e (

GB

)

Memory consumption

submits I/Os whenever device is available 128 entries

BSSD requires only 0.6GB memory space regardless of the I/O type and size.

Page 16: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Memory Usage (FSSD)

0 2500 5000 7500 10000 125000.00

2.00

4.00

6.00

8.00

10.00

Time Flow (Second)

FSSD-Legacy FSSD-Queue

Me

mo

ry U

sag

e (

GB

)

2GB memory requirements

10GB memory usage to manage only the underlying SSD may not be acceptable in many applications

As the I/O process progresses, the amount of memory usage keeps increasing in logarithmic fashion and reach 10GB

Page 17: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

CPU Usage (BSSD)

0 400 800 1200 1600 2000 2400 2800 320010

20

30

40

50

60

70

80

90C

PU

Usa

ge

(%

)

Time Flow (Second)

BSSD-Queue BSSD-Legacy

Time series

Host-level CPU usages

BSSD consumes 15%~30% of total CPU cycles for handling I/O requests

Page 18: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

CPU Usage (FSSD)

0 400 800 1200 1600 2000 2400 2800 320010

20

30

40

50

60

70

80

90

Time Flow (Second)

FSSD-Legacy FSSD-Queue

CP

U U

sag

e (

%)

FSSD requires much higher CPU usages (50%~ 90%)

A CPU usage over 60% for just I/O processing might be able to degrade overall system performance

60% of the cycles on host-side CPU

I/O service with queue-mode operation requires 50% more CPU cycles

Page 19: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

FSSD performance (multi-threads)

0 250 500 750 10000.00.51.01.52.02.53.03.54.04.55.0

Late

ncy

(ms)

Time Flow (Second)

1 thread 4 threads 8 threads

0 250 500 750 1000

0

100000

200000

300000

400000

500000

IOP

STime Flow (Second)

1 thread 4 threads 8 threads

Latency Throughput

worse than four workers by 118%

worse than single workers by 289 % 2.2x better than single worker

FSSD offers very stable and predictable performance

Page 20: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

FSSD resource usages (multi-threads)

0 250 500 750 10000102030405060708090

100

CP

U U

sage

(%

)Time Flow (Second)

1 thread 4 threads 8 threads

0 250 500 750 10001.0

1.5

2.0

2.5

3.0

3.5

4.0

Mem

ory

Req

uire

men

ts (

GB

)

Time Flow (Second)

1 thread 4 threads 8 threads

Memory consumption CPU usages

the advantage decreases because of high memory requirement and CPU usages

Require 134% more memory space Require 201% more computation resources

Page 21: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

BSSD resource usages (multi-threads)

0 250 500 750 10000102030405060708090

100

CP

U U

sage

(%

)Time Flow (Second)

4 threads 8 threads 1 thread

0 250 500 750 10000.00.51.01.52.02.53.03.54.0

0 20 40 60 80 100600610620630640650660670680690700

Mem

ory

Req

uire

men

t (G

B)

Time Flow (Second)

1 thread 4 threads 8 threads

Mem

ory

Re

qu

irem

en

ts (

MB

)offers similar memory requirements (less than 0.66GB) irrespective of # of threads

offers similar CPU usages (less than 30%) irrespective of # of threads

Memory consumption CPU usages

Page 22: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

BSSD performance (multi-threads)

0 250 500 750 1000248

163264

128256512

Late

ncy

(ms)

Time Flow (Second)

1 thread 4 threads 8 threads

0 250 500 750 10000

5000

10000

15000

20000

25000

30000

35000

IOP

STime Flow (Second)

1 thread 4 threads 8 threads

worse than four workers by 289%

worse than single workers by 708 %

There exist no differences with varying number of workers

Write-cliff occurs (garbage collection impact)

Latency Throughput

Page 23: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Latency Impact on a Queuing Method

0.03

2.43.24.0

0 200 400 600 800 10000.000.07

1.62.43.24.0 Sequential

Queue LegacyRandom

Late

ncy

(ms)

Time Flow (Second)

012

20406080

0 250 500 750 100006

200400600800

1000Sequential

RandomQueue Legacy

Late

ncy

(ms)

Time Flow (Second)

worse than a legacy req. by 106x

worse than a legacy req. by 86x

worse than a legacy req. by 99x

worse than a legacy req. by 184x

FSSD BSSD

Page 24: Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

Summary• Design trade-off between performance and resource

utilization– All-Flash-Array– Data-center/HPC local node SSD

• Software stack optimization– Co-operative approaches– Unified/direct file systems– Garbage collection schedulers– Queue control

• We are constructing an environment for automated SSD evaluation in camelab.org