flashshare: punching through server storage stack from ... · samsung toshiba 3us n/a 100us n/a...

FlashShare: Punching Through Server

Storage Stack from Kernel to Firmware

for Ultra-Low Latency SSDs

Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh,

Changlim Lee, Mohammad Alian, Myoungjun Chun, Mahmut

Kandemir, Nam Sung Kim, Jihong Kim and Myoungsoo Jung

Executable Summary

FlashSharePunches throughthe performance barriers

Reduce average turnaroundresponse times by 22%

Reduce 99th-percentile turnaround response times

by 31%

Datacenter

Latency-critical application

Throughput application

Memory-likeperformance

Flash Firmware

NVMe Driver

Block Layer

File System

Interference

ULL-SSD

Unaware of ULL-SSD

Unaware of latency-critical

application

Barrier

Motivation: applications in datacenterDatacenter executes a wide range of latency-critical workloads.• Driven by the market of social media and web services;• Required to satisfy a certain level of service-level agreement;• Sensitive to the latency (i.e., turn-around response time);

A typical example: Apache

respond

serverapache

TCP/IP service

edata obj.

worker

TCP/IP service

worker

data obj.

HTTPrequest

A key metric:user experience

Motivation: applications in datacenter• Latency-critical applications exhibit varying loads during a day.• Datacenter overprovisions its server resources to meet the SLA.• However, it results in a low utilization and low energy efficiency.

Hour of the day

Figure 1. Example diurnal pattern in queries per second for a Web Search cluster1.

1Power Management of Online Data-Intensive Services.

CPU utilization

Figure 2. CPU utilization analysis of Google server cluster2.

2The Datacenter as a Computer.

Varyingloads

30% avg.utilization

Motivation: applications in datacenterPopular solution: co-locating latency-critical and throughput workloads.

Micro’11 ISCA’15

Eurosys’14

Challenge: applications in datacenter

Applications:Apache – Online latency-critical application;PageRank – Offline throughput application;

Server configuration:

Experiment: Apache+PageRank vs. Apache only

Performance metrics:SSD device latency;Response time of latency-critical application;

Challenge: applications in datacenterExperiment: Apache+PageRank vs. Apache only

• The throughput-oriented application drastically increases the I/O accesslatency of the latency-critical application.

• This latency increase deteriorates the turnaround response time ofthe latency-critical application.

Fig 2: Apache response time increases due to PageRank.

Fig 1: Apache SSD latency increases due to PageRank.

Challenge: ULL-SSDThere are emerging Ultra Low-Latency SSD (ULL-SSD) technologies,which can be used for faster I/O services in the datacenter.

ZNAND XL-Flash

New NAND Flash

Samsung Toshiba

3us N/A

100us N/A

Optane nvNitro

TechniquePhase change

RAMMRAM

Vendor Intel Everspin

Read 10us 6us

Write 10us 6us

Challenge: ULL-SSDIn this work, we use engineering sample of Z-SSD.

Z-NAND1

TechnologySLC based 3D NAND

48 stacked word-line layer

Capacity 64Gb/die

Page Size 2KB/Page

Z-NAND based archives “Z-SSD”[1] Cheong, Wooseong, et al. "A flash memory controller for 15μs ultra-low-latency SSD using high-speed 3D NAND flash with 3μs read time." 2018 IEEE International Solid-State Circuits Conference-(ISSCC), 2018.

Challenge: Datacenter server with ULL-SSD

Applications:Apache – online latency-critical application;PageRank – offline throughput application;

Device latency analysis

42x36us 28us

Server configuration:

Unfortunately, the short latency characteristics of ULL-SSD cannot be exposed to users (in particular, for the latency-critical applications).

Challenge: Datacenter server with ULL-SSD

The storage stack is unaware of the characteristics of both latency-critical

workload and ULL-SSD

App App App

ULL-SSD

Caching layer

Filesystem

NVMe Driver

The current design of blkmq layer, NVMe driver, and SSD firmware can

hurt the performance of latency-critical applications.

ULL-SSD fails to bring short latency, because of the storage stack.

NVMe Driver

ULL-SSD

Blkmq layer: challenge

App App App

ULL-SSD

Caching layer

Filesystem

NVMe Driver

blkmqSo

ReqReq

Req Req ReqIncoming requests

Queuing Queuing

Software queue: holds latency-critical I/O requests for a long time;

Blkmq layer: challenge

App App App

ULL-SSD

Caching layer

Filesystem

NVMe Driver

blkmqSo

Dispatch

Token=1 Token=0Token=0

ReqReqReq

ReqReq

Software queue: holds latency-critical I/O requests for a long time;Hardware queue: dispatches an I/O request without a knowledge of

the latency-criticality;

Blkmq layer: optimization

App App App

ULL-SSD

Caching layer

Filesystem

NVMe Driver

blkmqSo

ReqReqReq

Our solution: bypass.

LatReq ThrReq ThrReqIncoming requests

LatReq

No merge

No I/O scheduling

Latency-critical I/Os: bypass blkmq for a faster response;

ThrReq ThrReq

Throughput I/Os: merge in blkmq for a higher storage bandwidth.

Little penalty

Addressedin NVMe

NVMe SQ: challenge (bypass is not simple enough)

App App App

ULL-SSD

Caching layer

Filesystem

NVMe DriverNVMe Driver

ULL-SSD

SQ doorbellregister

NVMecontroller

CQ doorbellregister

NVMe SQ NVMe CQ

Incoming requests

TailTail

+ Tfetch+ 2xTfetch

NVMe protocol-level queue: a latency-critical I/O request can be blockedby prior I/O requests; Time Cost = Tfetch-self + 2xTfetch >200%

overhead

NVMe SQ: optimizationIncoming

Target: Designing towards a responsiveness-aware NVMe submission.Key Insight: • Conventional NVMe controller(s) allow to customize the standard arbitration

strategy for different NVMe protocol-level queue accesses.• Thus, we can make the NVMe controller to decide which NVMe command to

fetch by sharing a hint for the I/O urgency.

NVMe SQ: optimization

App App App

ULL-SSD

Caching layer

Filesystem

NVMe DriverNVMe Driver

ULL-SSD

SQ doorbellregister

NVMecontroller

CQ doorbellregister

NVMe SQ NVMe CQ

Our Solution:

Lat-SQ doorbell

Thr-SQ doorbell

NVMeCTL

CQ doorbell

Lat-SQ Thr-SQ Co

Incoming requests ThrReq ThrReq LatReq

Ring Postpone

1. Double SQs (one for latency-critical I/Os, another for throughput I/Os)2. Double the SQ doorbell register3. New arbitration strategy: gives the highest priority to the Lat-SQ

Immediate fetch

SSD firmware: challenge

App App App

ULL-SSD

Caching layer

Filesystem

NVMe Driver

Embedded cache cannot protect the latency-critical I/O from an eviction;

ULL-SSD

NVMe Controller

Caching Layer

NAND Flash

Caching LayerI/O Hit

way addr

Req@0x01 Req@0x05 Req@0x04

Cost: TCL+TCACHECost: TCL+TFTL+TNAND +TCACHE

Req@0x08Incoming requests

Embedded cache provides the fastest response(DRAM service)

Embedded cache can be polluted by the throughput requests;

Req@0x0b

0xb 0x5

SSD firmware: optimization

App App App

ULL-SSD

Caching layer

Filesystem

NVMe Driver

Our design: splits the internal cache space to protect latency-critical I/Orequests;

ULL-SSD

NVMe Controller

Caching Layer

NAND Flash

Caching Layer

way addr

Req@0x01 Req@0x05 Req@0x04

Req@0x08Incoming requests

Protectionregion

0x40x8

Filesystem

NVMe Driver

Caching layer

NVMe CQ: challenge

App App App

ULL-SSD

NVMe completion: MSI overhead for each I/O request;

ULL-SSD

NVMe Driver

NVMe SQ NVMe CQ

SQ doorbellregister

NVMecontroller

CQ doorbellregister

Message

HeadTail

InterruptController CPU Interrupt

Service Routine

contextswitch

Cost: 2xTCS +TISR

Cost: TCS Cost: TCSCost: TISR

Filesystem

NVMe Driver

Caching layer

NVMe CQ: optimization

App App App

ULL-SSDULL-SSD

NVMe Driver

NVMe SQ NVMe CQ

SQ doorbellregister

NVMecontroller

CQ doorbellregister

InterruptController CPU Interrupt

Service Routine

contextswitch

Key insight: state-of-the-art Linux supports a poll mechanism;

Message

Blkmqlayer Save: 2xTCS +TISRSave: TCSSave: TISRSave: TCS

NVMe CQ: optimizationPoll mechanism can bring benefits to fast storage device.

4KB8KB

16KB32KB

141618202224262830

Interrupt

Polling

4KB8KB

16KB32KB

Polling

Interrupt

ULL SSD

Decreases by

Read: 7.5% & Write: 13.2%

Read Write

NVMe CQ: optimizationHowever, the poll-based I/O services consume most host resources.

4KB8KB

16KB32KB

) Polling

Interrupt

ation (

Interrupt

ation (

Polling

InterruptController

Filesystem

NVMe Driver

Caching layer

NVMe CQ: optimization

App App App

ULL-SSD

Our solution: selective interrupt service routine (Select-ISR).

ULL-SSD

NVMe Driver

NVMe SQ NVMe CQ

SQ doorbellregister

NVMecontroller

CQ doorbellregister

Message

Incoming requests ThrReqLatReq CPU

Blkmqlayer

Message

Interrupt Service Routine

contextswitch

Design: Responsiveness AwarenessKey Insight: users have a better knowledge of I/O responsiveness (i.e., latency critical/throughput).Our Approach:• Open a set of APIs to users, which pass the workload attribute to Linux PCB.

Call a new utility: chworkload_attr

Modify Linux PCBdata structure

Invoke new system call

Design: Responsiveness AwarenessKey Insight: users have a better knowledge of I/O responsiveness (i.e., latency critical/throughput).Our Approach:• Open a set of APIs to users, which pass the workload attribute to Linux PCB.• Deliver the workload attribute to each layer of storage stack.

Workloadattribute

nvme_cmd

NVMe controller

task_structUser Process

tag array

Embeddedcache

BIOFile system

request

Block layer(blk-mq)

nvme_rw_commandNVMe driver

More optimizationsAdvanced caching layer designs:

• Dynamic cache split scheme: to maximize cache hits in various request patterns;

• Read prefetching: better utilize SSD internal parallelism;• Adjustable read prefetching with ghost cache: adaptive to different

request patterns;

Hardware accelerator designs:• Conduct simple but timing-consuming tasks such as I/O poll and I/O

merge;• Simplify the design of blkmq and NVMe driver.

Experiment Setup

Test Environment

System configurations:• Vanilla – a vanilla Linux-based computer system running on ZSSD;• CacheOpt – compared to Vanilla, it optimizes the cache layer of SSD firmware;• KernelOpt – it optimizes blkmq layer and NVMe I/O submission;• SelectISR – compared to KernelOpt, it adds the optimization of selective ISR;

http://simplessd.org

Evaluation: latency breakdown

46% reduction

38% reduction 5us reduction

• KernelOpt reduces the time cost of blkmq layer by 46% thanks to no queuing time;• As latency-critical I/Os are fetched by NVMe controller immediately, KernelOpt drastically

reduces the waiting time;• CacheOpt better utilizes the embedded cache layer and reduces the SSD access delays by 38%;• By selectively using polling mechanism, SelectISR can reduce the I/O completion time by 5us.

Evaluation: online I/O access

• CacheOpt reduces the average I/O service latency, but it cannot eliminate the long tails;• KernelOpt can remove the long tails, because it can avoid long queuing time and prevents

throughput I/Os from blocking latency-critical I/Os;• SelectISR reduces the average latency further, thanks to selectively using poll mechanism.

ConclusionObservation The ultra-low latency of new memory-based SSDs is not exposed to latency-critical application and have no benefit from user-experience angle;ChallengePiecemeal reformations of the current storage stack won’t work due to multiple barriers; the storage stack is unaware of all behaviors of ULL-SSD and latency-critical applications; Our solutionFlashShare: We expose different levels of I/O responsiveness to the key components in the current storage stack and optimize the corresponding system layers to make ULL visible to users (latency-critical applications). Major results• Reducing average turnaround response times by 22%;• Reducing 99th-percentile turnaround response times by 31%.

Thank you

flashshare: punching through server storage stack from ... · samsung toshiba 3us n/a 100us n/a...

Documents

o?uyr6fsl r, o'.'),6us.:ist63,.ln15iltai olrlaqfru l\s...

xtralis vesda vlf-500 - amazon s3...xtralis vesda® xtralis...

the mind reels - west virginia...

storage technologies for today and tomorrow...ibm...

500+ leading manufacturers sign up to get new product ·...

improved on-resistance measurement at wafer probe using a...

(jjlzzpipsp[` · o[[wz! ^^^ srjh us v]ly srjh ^h[ kvlu ^pq...

ces 2013 everspin presentation (1)

v.pdf · v.s.saravana (salem . 1. , . i) si r,ed urjüq -...

build the smart grid - nxp semiconductors · 2016. 3....

rt 100us · 2020. 1. 13. · rt 100us the demag ic-1...

improved on-resistance measurement at wafer probe using a...

the progress power (gas fired power station) order 10.4...

everspin - epsoft international · 2020-02-27 ·...

16mb 256k x 16 mram memory - everspin technologies, inc

new fundamental data, storage and device technologies ·...

app note - everspin · • pcie x8 gen3 half height lp •...

sct2080kehr : sic mosfet - rohm semiconductor...p w = 10ms p...

rt 100us - terex

+86-755-8310 8866 neware’s bts 9000 vs vendor m’s txxxx...