unblindingthe+os+to+optimize+ … · cache+hit cache+miss predictable predictable small+w...

33
Unblinding the OS to Optimize UserPerceived Flash SSD Latency Woong Shin * , Jaehyun Park ** , Heon Y. Yeom * * Seoul National University ** Arizona State University USENIX HotStorage 2016 Jun. 21, 2016

Upload: others

Post on 23-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Unblinding the  OS  to  Optimize  User-­Perceived  Flash  SSD  Latency

Woong Shin*,  Jaehyun Park**,  Heon Y.  Yeom*

*Seoul  National  University**Arizona  State  University

USENIX  HotStorage 2016Jun.  21,  2016

Page 2: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

OS  I/O  Path  Optimizations:  Reducing  S/W  Overheads

Simplified  I/O  Path

Application

StorageDevice

Issue  Sideuser  thread  context

Completion  Sideuser  thread  context

OS  I/O  Path

Overheadfrom  13  usdown  to

0.2  ~  2.5 us

Under  20usTechnology

Memory  technology  with  ultra  low (nanoseconds),  predictable latencies.

“Block  CPU  (poll)  while  waiting  for  the  I/O  to  complete”

Page 3: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Cannot  use  polling:  Wastes CPU  cyclesHarms system  parallelism

Issue  Sideuser  thread  context

Completion  Sideuser  thread  context

OS  I/O  Path

CPU

(i.e.,  read:  20us  ~  150us)Higher   latency &  High  Variability

Hard  IRQDeferred  Processing

CPUCPU

To  Block  the  CPU  (sync)  or  to  Yield  the  CPU  (async)

H/W  context  switch  :=  4  us  ~  5  usOS  involved  switch  :=  7  us  ~  8  us

“Impact  of  Modern  SSDs”More  contexts  on  a  CPU  coreLarger  scheduling  delays

Page 4: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

700,000 IOPS  NVM-­e  SSDBandwidth:4kB  x  700  kIOPS =  2.8  GB/s

Impact  of  Modern  SSDs

?!?Latency:1  sec  /  700  kIOPS =  1.42  µs

Page 5: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Impact  of  Modern  SSDs

700,000 IOPS  NVM-­e  SSDSingle  NAND  die:  apprx.  14,285  8kB  IOPS  (i.e,  70us  read  latency)Requires  more  than  49  NAND  dies  to  achieve  700,000  IOPS

Multi-­channel  Multi-­wayHigh  die  count  NAND  array

LargeDRAM

PowerfulControllers

Page 6: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Impact  of  Modern  SSDs

700,000 IOPS  NVM-­e  SSDSingle  NAND  die:  apprx.  14,285  8kB  IOPS  (i.e,  70us  read  latency)Requires  more  than  49  NAND  dies  to  achieve  700,000  IOPS

Multi-­channel  Multi-­wayHigh  die  count  NAND  array

LargeDRAM

PowerfulControllers

High  count  of  I/O  contexts  (threads,  state  machines)  required  for  high  IOPS

Page 7: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Higher  Context  Multiplexing  Cost“Scheduling  Delays”

Multi-­channel  Multi-­wayHigh  die  count  NAND  array

LargeDRAM

PowerfulControllers

NAND  die  count  >>>  CPU  core  counti.e.,  Four  PCI-­e  3.0  4lane  slots  in  a  chassis,  more  SSDs

Redundancy,  more  capacity  ...

CPU CPU

CPU  Core

Worker  threadWith  multipleI/O  contexts(Async.  I/O)

Scheduler

CPU  Core

Time  sharing   a  CPU  core(Sync.  I/O)

ContextTo

ContextContextTo

Context

Page 8: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Issue  Sideuser  thread  context

Completion  Sideuser  thread  context

OS  I/O  Path

The  OS  is  Blind:  Conservative  Strategies

SSDControllerDRAM

Page 9: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Issue  Sideuser  thread  context

Completion  Sideuser  thread  context

OS  I/O  Path

20us  ~  10msHigher   latency,  Higher  Variance

Hard  IRQDeferred  Processing

The  OS  is  Blind:  Conservative  Strategies

DRAMR W R

W E

R R

Physical  Destination  within  the  SSD

R

R

R RE

W

NAND  dies

Externally  

Experienced  

Latency

SSDController

Page 10: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

This  Work:  Unblinding the  OSIssue  Side

user  thread  contextCompletion  Sideuser  thread  context

OS  I/O  Path

Hard  IRQDeferred  Processing

H/W  context  switch  :=  4  us  ~  5  usOS  involved  switch  :=  7  us  ~  8  usMore  contexts  on  a  CPU  core(higher)

“Predictable  SSD  Latency”

Page 11: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

This  Work:  Unblinding the  OSIssue  Side

user  thread  contextCompletion  Sideuser  thread  context

OS  I/O  Path

I/O  destination

Latency

OS SSD

“Predictable  SSD  Latency”

Hard  IRQDeferred  Processing

Page 12: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

This  Work:  Unblinding the  OSIssue  Side

user  thread  contextCompletion  Sideuser  thread  context

OS  I/O  Path

“Predictable  SSD  Latency”

Latency

I/O  destinationx  I/O  type

DRAM NANDSmallR/W Small  READ

NANDElse

Predictable Predictable UnpredictableA B C

DRAM NAND NANDSmallR/W Small  READ Else

Latency

I/O  destinationx  I/O  type

Predictable Predictable UnpredictableA B C

BetterInteraction

ExtendedInterfaceOS SSD

Internal  KnowledgeExposed  Knowledge“A  simplified  model”

Hard  IRQDeferred  Processing

Page 13: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

This  Work:  Unblinding the  OSIssue  Side

user  thread  contextCompletion  Sideuser  thread  context

OS  I/O  Path

“Predictable  SSD  Latency”

Latency

I/O  destinationx  I/O  type

DRAM NANDSmallR/W Small  READ

NANDElse

Predictable Predictable UnpredictableA B C

SmallR/W Small  READ Else

Latency

I/O  destinationx  I/O  type

Predictable Predictable UnpredictableA B C

BetterInteraction

ExtendedInterfaceOS SSD

Internal  KnowledgeExposed  Knowledge“A  simplified  model”

Hard  IRQDeferred  Processing

Exploiting  SSD  Internal  InformationTowards  a  “Predictable  SSD”New  Optimization  Opportunities

Page 14: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Multi-­channel  Multi-­wayHigh  die  count  NAND  array

DRAM SSD  Controller

Exploiting  SSD  Internal  Information

OS SSD

The  UnpredictableSSD

Page 15: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Exploiting  SSD  Internal  Information“Decomposition  &  Classification”

Multi-­channel  Multi-­wayHigh  die  count  NAND  array

DRAM SSD  Controller

DRAM

NAND

OS SSD

Destination

Decomposition

I/O  request  Classification

Small  R Small  W Large  R Large  W

Cache  hit Buffer  hit N/A N/A

Interleaved  writes

Cache  miss Buffer  miss Interleavedreads

Predictable Predictable

Predictable Unpredictable No  benefit No  benefit

Page 16: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Exploiting  SSD  Internal  Information“Decomposition  &  Classification”

Multi-­channel  Multi-­wayHigh  die  count  NAND  array

DRAM SSD  Controller

DRAM

NAND

OS SSD

Destination

Decomposition

I/O  request  Classification

Small  R

Cache  hit

Cache  miss

Predictable

Predictable

Small  W

Buffer  hit

Buffer  missPredictable

Unpredictable

Page 17: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Exploiting  SSD  Internal  Information“Decomposition  &  Classification”

Multi-­channel  Multi-­wayHigh  die  count  NAND  array

DRAM SSD  Controller

DRAM

NAND

OS SSD

Destination

Decomposition

I/O  request  Classification

Small  R

Cache  hit

Cache  miss

Predictable

Predictable

Small  W

Buffer  hit

Buffer  missPredictable

Unpredictable

Page 18: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Mitigating  the  Impact  of  Scheduling  Delays

DRAM

NAND

OS SSD

Destination

Decomposition

I/O  request  Classification

Small  R

Cache  hit

Cache  miss

Predictable

Predictable

Small  W

Buffer  hit

Buffer  missPredictable

Unpredictable

Issue  Sideuser  thread  context

OS  I/O  Path

PredictableNAND  Read

Completion  Sideuser  thread  context

Hard  IRQDeferred  Processing

Page 19: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Mitigating  the  Impact  of  Scheduling  Delays

DRAM

NAND

OS SSD

Destination

Decomposition

I/O  request  Classification

Small  R

Cache  hit

Cache  miss

Predictable

Predictable

Small  W

Buffer  hit

Buffer  missPredictable

Unpredictable

Issue  Sideuser  thread  context

OS  I/O  Path

PredictableNAND  Read

Completion  Sideuser  thread  context

Hard  IRQDeferred  Processing

Issue  Sideuser  thread  context

Completion  Sideuser  thread  context

PredictableBuffer  Hits

Page 20: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Accurate  Latency  Prediction:  Remaining  I/O  Time

Remaining  I/O  time

Total  SSD  I/O  time

I/O  processing&  Queuing  delays

Flash  I/O  +  ECC  +  DMA  transfer

OS  I/O  Path

Issue  Sideuser  thread  context

“Only  for  NAND  reads”

Classification“Small  Read”

LatencyPredictor

Page 21: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Precompletions:  Overlapping  I/O  &  Scheduling  Delay

Precompletionwindow

Total  SSD  I/O  time

PrecompletionIRQ

Actual  Completion(Flag  update:  No  IRQ)

OS  I/O  Path

Completion  Sideuser  thread  context

Hard  IRQDeferred  Processing

Issue  Sideuser  thread  context

Classification“Small  Read”

Busy  wait  for  flag(waiting  on  L1)

Remaining  I/O  time

PrecompletionWait  period

I/O  processing&  Queuing  delays

Flash  I/O  +  ECC  +  DMA  transfer“Only  for  NAND  reads”

LatencyPredictor

Page 22: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

OS  &  SSD  Interaction:  Simple  Behavioral  Models

DRAM

NAND

OS SSD

Destination

Decomposition

I/O  request  Classification

Small  R

Cache  hit

Cache  miss

Predictable

Predictable

Small  W

Buffer  hit

Buffer  missPredictable

Unpredictable

Page 23: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Blocks  to  consume

In-­band  Communication  Channel“Piggybacked  Information”

OS  I/O  Path

I/OClassifier

BufferMonitor

Previous  I/O  completion I/O  request

BufferMonitor

CurrentBlocks

!

Page 24: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Host  system

CustomBlock  Driver

Application

FIO

OS:  Linux  3.5.40

Implementation &  Evaluation

Zync-­7000FPGA  +  ARM  Cortex  A9Dual  core

128GB  NANDModule x  1

(4ch,  4way  each)

External  PCI-­eGen  2  four  lane  connection(Operating  in  Gen2  one  lane) PCI-­e  external  

adaptor

DDR3  DRAM512MB  x  2

http://www.openssd-­project.org/wiki/The_OpenSSD_Project

FTL

NANDController

MSI  IRQ

NVM-­e  likeI/F

Page 25: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Host  system

CustomBlock  Driver

Application

FIO

OS:  Linux  3.5.40

Implementation &  Evaluation

Zync-­7000FPGA  +  ARM  Cortex  A9Dual  core

128GB  NANDModule x  1

(4ch,  4way  each)

External  PCI-­eGen  2  four  lane  connection(Operating  in  Gen2  one  lane) PCI-­e  external  

adaptor

DDR3  DRAM512MB  x  2

http://www.openssd-­project.org/wiki/The_OpenSSD_Project

FTL

NANDController

MSI  IRQ

NVM-­e  likeI/F

Limitation• Single  I/O  Depth• Unoptimized FPGA  NAND  Controller  (Higher  Latencies)

• Fixed  latency• Slow  DMA  transfers  (low  freq.  bus)• PCI-­e  Gen2  one  lane

Latency  Prediction

Impact  of  precompletion

Page 26: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Predicting  SSD  Latency (Small  NAND  Read)

Flash:  NAND   I/O  +  ECC Prediction:   three  value  moving  average

DMA:  device   to  host  transfer  (4kB) Low  varianceLow  Error

“Predictable”

Page 27: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

The  Impact  of  Precompletionfio I/O  thread  vs  background  threads

Non-­NAND  Avg.  latency:  Measured  AVG  latency  – 352us  (NAND  latency)

Page 28: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

The  Impact  of  Precompletionfio I/O  thread  vs  background  threads

Non-­NAND  Avg.  latency:  Measured  AVG  latency  – 352us  (NAND  latency)

I/O  vs  CPU  requiresPriority  boost  for

Polling(priority  degradataion)

Polling  damages  system  parallelism

Page 29: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

The  Impact  of  Precompletionfio I/O  thread  vs  background  threads

Non-­NAND  Avg.  latency:  Measured  AVG  latency  – 352us  (NAND  latency)

Overshoot  harms  parallelism  (busy  wait)

Precompletion requires  an  adequate  precompletion

window

Undershoot  will  expose  scheduling  delays

Page 30: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

The  Impact  of  Precompletionfio I/O  thread  vs  background  threads

Non-­NAND  Avg.  latency:  Measured  AVG  latency  – 352us  (NAND  latency)

Approx.  20%  degradation  of

system  parallelismIRQ  vs  

PrecompletionvsCPU:  7.16  us  gainvsIO:  7.52  us gainWith  no  degradation  in  system  parallelism

Page 31: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Summary• Unblinding the  OS  -­ Cross  layer  optimization

• Achieved   a  partially  predictable   SSD  “decomposition   /  classification”• Exploit  SSD   internal   information   -­ “Remaining   I/O  time”• Protecting  SSD  proprietary  internals   – “Abstracted   behavioral  models”

• Mitigating  scheduling  delays• Exploiting   predictability   of  certain  I/O  requests• Pre-­completion   -­ Projection   (1  I/O  depth  vs  other  threads)

IRQ Polling This  work

Latency Bad Good Good

Parallelism Good Bad Good

Page 32: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side

Future  Work• Future  Implementation  &  Evaluation

• Full  blown  SSD• Projection  (1  I/O  depth  – this  work)  

à Simulation  (varying  tech  latency  &  etc)à Real  implementation

• Cross  layer  optimization• More  models• More  use  cases• More  backend  technologies  rather  than  flash

Page 33: Unblindingthe+OS+to+Optimize+ … · Cache+hit Cache+miss Predictable Predictable Small+W Buffer+hit Buffer+miss Predictable ... Actual+Completion (Flag+update:No+IRQ) h Completion#Side