nvmw 2014 extending main memory with flash-the optimized swap approach

Jihyung Park, Hyuck Han and Sangyeun ChoMemory Solutions Lab

Memory Business

Extending Main Memory with Flash –the Optimized SWAP Approach

1. Introduction2. Optimized SWAP3. Evaluation4. Future Work5. Conclusion

Why extend main memory with flash?• To overcome DRAM scaling limitations and offer large working memory• To reduce total cost of ownership (acquisition and operation)• Flash has no seek time• Flash has faster latency than HDD

Two approaches toward memory extension• Non-transparent approach: Application has to change• Transparent approach: Application is NOT aware of the underlying flash

Introduction

Current swap algorithm is optimized for HDD

Paging for the Fast device• Fast and Simple vs. Heavy and Accurate

Motivation

Swap entry search• A new search algorithm

I/O path optimization• Swap read-ahead• I/O scheduler• Swappiness

Swap device as backing store: Inclusive vs. Exclusive• We adjust the swap entry free policy to enforce that the swap device

“includes” all swapped out pages

Optimized SWAP

Tree search• “Bit tree”, no pointer, a node size is just one byte• Fan-out degree is 8 (one bit is pointing a child node)• 8-level tree covers multi-terabytes of swap space.• Search cost: 2O(log N)• Reduce swap structure size

– Roughly current swap mechanism vs. O-Swap = 10MB vs. 2MB (to support 32GB swap space)

Optimized SWAP

0 2 4 61 3 5 7 8 9

Read-ahead• No read-ahead (due to randomness)• Note also that SSD has no seek time

I/O scheduler• NOOP (due to randomness and fast response requirements)• Bypass

Swappiness• swappiness : 0

Swap entry reclaim policy• Do not free swap entries as much as possible

Optimized SWAP

Evaluation - Memcached

System

CPU Xeon E5-2665 (HT disabled)

# Core 16

Network 10Gb Ethernet

SSD Samsung XS1715 (NVME)

WorkloadYCSB

DB Size 30GB

Value Length 2048B

# memcached threads 64

# Clients 320

Get : Update 95% : 5%

MemorySWAP OSWAP Full DRAM

DRAM 8GBSSD Swap 32GB


DRAM 32GB


0

2

4

6

8

10

12

14

SWAP OSWAP Full DRAM

Ope

ratio

ns p

er se

cond

(x10

,000

)Memcached (NVME, 10Gb Network)


0

1

2

3

4

5

6

7

8

256us 512us 1024us 2ms 4ms 8ms 16ms 32ms 64ms 128ms 256ms 512ms

Ope

ratio

ns p

er se

cond

(x1,

000)

SWAP Performance by Latency Segment

< 1ms QoS


0

5

10

15

20

25

256us 512us 1024us 2ms 4ms 8ms 16ms 32ms 64ms 128ms 256ms 512ms

Ope

ratio

ns p

er se

cond

(x1,

000)

OSWAP Performance by Latency Segment

< 1ms QoS


0

2

4

6

8

10

12

256us 512us 1024us 2ms 4ms 8ms 16ms 32ms 64ms 128ms

Ope

ratio

ns p

er se

cond

(x10

,000

)

Full DRAM Performance by Latency Segment

< 1ms QoS

Evaluation - Linkbench

System

CPU Xeon E5-2665 (HT disabled)

# Core 16

Network 10Gb Ethernet

SSD Samsung XS1715 (NVME)

WorkloadLinkbench

DB Size 30GB

# Clients 400

MemorySWAP OSWAP Full DRAM



DRAM 32GB

Evaluation - Linkbench

0

2

4

6

8

10

12

14

SWAP OSWAP Full DRAM

Req

uest

s per

seco

nd (x

1,00

0)

Linkbench

Rack scale architecture

High performance memory + High capacity memory

Future Work

CPUs

DRAMDRAMDRAM

Compute

PCIe <-> Ctrl Ctrl

Memory

Memory

Memorycable

Memory Device

Cost-effective memory capacity

Exploit flash memory transparently

Conclusion

nvmw 2014 extending main memory with flash-the optimized swap approach

Technology

gb ssd swap

memory swap oswap

gb dram

gb swap space optimized

ms qos

ms operationspersecondx10

optimized swap approach

possible optimized swap