nvmw 2014 extending main memory with flash-the optimized swap approach
DESCRIPTION
Title: Extending Main Memory with Flash-the Optimized SWAP Approach Author: Jihyung Park, Hyuck Han, Sangyeun Cho Memory Solutions Lab, Memory Business, Samsung ElectronicsTRANSCRIPT
Jihyung Park, Hyuck Han and Sangyeun ChoMemory Solutions Lab
Memory Business
Extending Main Memory with Flash –the Optimized SWAP Approach
1. Introduction2. Optimized SWAP3. Evaluation4. Future Work5. Conclusion
Why extend main memory with flash?• To overcome DRAM scaling limitations and offer large working memory• To reduce total cost of ownership (acquisition and operation)• Flash has no seek time• Flash has faster latency than HDD
Two approaches toward memory extension• Non-transparent approach: Application has to change• Transparent approach: Application is NOT aware of the underlying flash
Introduction
Current swap algorithm is optimized for HDD
Paging for the Fast device• Fast and Simple vs. Heavy and Accurate
Motivation
Swap entry search• A new search algorithm
I/O path optimization• Swap read-ahead• I/O scheduler• Swappiness
Swap device as backing store: Inclusive vs. Exclusive• We adjust the swap entry free policy to enforce that the swap device
“includes” all swapped out pages
Optimized SWAP
Tree search• “Bit tree”, no pointer, a node size is just one byte• Fan-out degree is 8 (one bit is pointing a child node)• 8-level tree covers multi-terabytes of swap space.• Search cost: 2O(log N)• Reduce swap structure size
– Roughly current swap mechanism vs. O-Swap = 10MB vs. 2MB (to support 32GB swap space)
Optimized SWAP
0 2 4 61 3 5 7 8 9
Read-ahead• No read-ahead (due to randomness)• Note also that SSD has no seek time
I/O scheduler• NOOP (due to randomness and fast response requirements)• Bypass
Swappiness• swappiness : 0
Swap entry reclaim policy• Do not free swap entries as much as possible
Optimized SWAP
Evaluation - Memcached
System
CPU Xeon E5-2665 (HT disabled)
# Core 16
Network 10Gb Ethernet
SSD Samsung XS1715 (NVME)
WorkloadYCSB
DB Size 30GB
Value Length 2048B
# memcached threads 64
# Clients 320
Get : Update 95% : 5%
MemorySWAP OSWAP Full DRAM
DRAM 8GBSSD Swap 32GB
DRAM 8GBSSD Swap 32GB
DRAM 32GB
Evaluation - Memcached
0
2
4
6
8
10
12
14
SWAP OSWAP Full DRAM
Ope
ratio
ns p
er se
cond
(x10
,000
)Memcached (NVME, 10Gb Network)
Evaluation - Memcached
0
1
2
3
4
5
6
7
8
256us 512us 1024us 2ms 4ms 8ms 16ms 32ms 64ms 128ms 256ms 512ms
Ope
ratio
ns p
er se
cond
(x1,
000)
SWAP Performance by Latency Segment
< 1ms QoS
Evaluation - Memcached
0
5
10
15
20
25
256us 512us 1024us 2ms 4ms 8ms 16ms 32ms 64ms 128ms 256ms 512ms
Ope
ratio
ns p
er se
cond
(x1,
000)
OSWAP Performance by Latency Segment
< 1ms QoS
Evaluation - Memcached
0
2
4
6
8
10
12
256us 512us 1024us 2ms 4ms 8ms 16ms 32ms 64ms 128ms
Ope
ratio
ns p
er se
cond
(x10
,000
)
Full DRAM Performance by Latency Segment
< 1ms QoS
Evaluation - Linkbench
System
CPU Xeon E5-2665 (HT disabled)
# Core 16
Network 10Gb Ethernet
SSD Samsung XS1715 (NVME)
WorkloadLinkbench
DB Size 30GB
# Clients 400
MemorySWAP OSWAP Full DRAM
DRAM 8GBSSD Swap 32GB
DRAM 8GBSSD Swap 32GB
DRAM 32GB
Evaluation - Linkbench
0
2
4
6
8
10
12
14
SWAP OSWAP Full DRAM
Req
uest
s per
seco
nd (x
1,00
0)
Linkbench
Rack scale architecture
High performance memory + High capacity memory
Future Work
CPUs
DRAMDRAMDRAM
Compute
PCIe <-> Ctrl Ctrl
Memory
Memory
Memorycable
Memory Device
Cost-effective memory capacity
Exploit flash memory transparently
Conclusion