practical memory disaggregation

Practical Memory DisaggregationA Case Study in Network-Informed Data Systems Design

Mosharaf ChowdhuryNovember 2020

Five Years Ago…

The volume of data businesses want to make sense of is increasing

2015 2016 2017 2018

1. Data Volume Will Keep Increasing

2015 2016 2017 2018

Data Systems

Big Data AnalyticsAI/ML Tools

Massive dataHigh parallelismGPU clustersDistributed…

> 100 ms

Over the World

2. Deployed in Diverse Networks

< 10 µs

Within a Rack Within a Datacenter~ 1 ms

Network-Informed Data Systems Design

I. Network-adaptive Big Data and AI/ML systems

II. Tailoring data systems to extreme networksI. Computation over the InternetII. Leveraging high-speed networks

2016 20202017 20192018

Carbyne

Tiresias

NetLock

CellScope

PracticalMemoryDisaggregation

Memory is King!

Perform Great!

1.5420

100% 75% 50%

In-Memory Working Set

TPC-C on VoltDB

Perform Great Until Memory Runs Out

1.5420

100% 75% 50%

TPC-C on VoltDB

1.5420

100% 75% 50%

100% 75% 50%O

TPC-C on VoltDB FB Workload on Memcached

1.5420

100% 75% 50%

57 67.5

100% 75% 50%

TPC-C on VoltDB FB Workload on Memcached PageRank on PowerGraph95

100% 75% 50%O

50% Less Memory Causes Slowdown of …

1.5420

100% 75% 50%

57 67.5

100% 75% 50%

TPC-C on VoltDB FB Workload on Memcached PageRank on PowerGraph95

100% 75% 50%O

Between a Rock and a Hard Place

OverallocationLeads to underutilization

30-50% in Google, Alibaba, and Facebook

UnderallocationLeads to severe performance loss

Machine 1 Machine 2 Machine 3 Machine N

Used Memory Free Memory

Disaggregated Memory

Memory Disaggregation

Remote Memory 16

Network is Getting Faster!

TCP/IP Hundreds of µsec

DPDK Tens of µsec

RDMA Single-digit µsec

Hundreds of nsecDRAM

time to access a 4KB memory page17

What is PracticalMemoryDisaggregation?

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

Infiniswap

w/ Juncheng Gu and many othersNSDI’17

Efficient Memory Disaggregation

Applicability

Application ChangesPe

Major NoMinor

Key-Value

Paging

Very Good

Key-Value

Paging

How can we enable any application to leverage disaggregated memory without sacrificing performance?

Remote Memory Paging

Exposes memory across server boundaries• Scalable• Efficient• Fault-tolerant

No changes to• applications,• operating systems, or• hardware

Core Idea

1. Infiniswap Block DeviceFinds free remote memory, maps pages, and provides fault tolerance without any central coordination

2. Infiniswap DaemonProactively evicts remote pages to ensure transparent, best-effortservice

Exposes free remote memory as swap devices in a decentralized manner without affecting remote processes

Infiniswap in One Slide

Container 1 Container NInfiniswapDaemon

User Space

Kernel Space

Local Disk

Async Sync

Machine-1

User Space

Kernel Space

InfiniswapDaemonContainer A

Machine-2

X Mapped to memory of Machine-X

Virtual Memory Manager (VMM)

Infiniswap Block Device

Individual pagePage fault

User Space

Kernel Space

InfiniswapDaemonContainer A

Machine-3

Container 1 Container NInfiniswapDaemon

User Space

Kernel Space

Local Disk

Async Sync

Machine-N

Virtual Memory Manager (VMM)

Infiniswap Block Device

Scalability via Decentralization

How to find free remote memory in a large cluster?• Problem: Centralized solution can be slow and expensive• Solution: Power of two choices

How to evict mapped memory?• Problem: LRU/LFU is hard because one-sided RDMA bypasses CPU• Solution: Power of many choices

Higher Efficiency & Better Load Balancing

Higher Utilization

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29Mem

Rank of 32 Machines

Infiniswap w/o Infiniswap

100% 75% 50%

56.1 63.9 64.2

100% 75% 50%

100% 75% 50%O

TPC-C on VoltDB FB Workload on Memcached

Even on 50% Memory, Slowdown is

PageRank on PowerGraph

<2X ≈1X ≈1X 27

w/ Hasan Al MarufATC’20 Best Paper

LeapEffectively Prefetching Remote Memory

User Space

Kernel Space

Device Mapping Layer

Block Device Driver

Generic Block Layer

I/O Scheduler Request QueueRequest queue processing:

Insertion, Merging, Sorting, Staging and Dispatch

Remote Memory

Dispatch Queue

Memory ManagementUnit (MMU)

Process 1 Process 2 Process N…

Page Fault

RDMA: 4.3 us

0.27 us

10.04 us

21.88 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

Life of a Page

Where Does the Time Go?Page Request

In Page Cache?

Allocate Cache for Page

Read Request?

Update Page Table & End I/OPrepare for I/O

Queue and Batch Requests

Execute I/O

0.12 µs

2.1 µs

10.04 µs

21.88 µs

RDMA: 4.3 µs

0.15 µs

Fast Path

Slow Path

We Need to

1. Increase cache hit• Faster path serves more page faults

2. Reduce the latency of the slow path• Remove block-layer operations unnecessary for RDMA

User Space

Kernel Space

Device Mapping Layer

Block Device Driver

Generic Block Layer

I/O Scheduler Request QueueRequest queue processing:

Insertion, Merging, Sorting, Staging and Dispatch

Remote Memory

Dispatch Queue

Page Fault

RDMA: 4.3 us

0.27 us

10.04 us

21.88 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

Life of a Page

User Space

Kernel Space

Remote Memory

Page Fault

RDMA: 4.3 us

0.27 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

Life of a Pagew/ Leap

Process Specific Page Access Tracker

Trend Detection

Prefetch CandidateGeneration

Prefetcher

Eager Cache Eviction

0.34 us

Prefetching in Linux

Reads ahead pages sequentially

Based only on the last page access

Does not distinguish between processesCannot detect thread-level access irregularities

too aggressive on seq: cache pollution

too conservative off seq: brings nothing

Trend Detection in LeapStart with a smaller window of Access History

Majority found?

Doubles the window size

No Yes

Run Boyer-Moore on the window

Return Majority ∆maj

Max. window

YesNo trend found

Resilient to short term irregularity

Identifies the majority element in access history

Regular trends can be found within recent accesses

Lowers Remote Page Access Latency by…Sequential Access Stride Access

0.01 1 100 10000

Latency (us)

Infiniswap

Infiniswap+Leap

0.01 1 100 10000

Latency (us)

4X 104X37

Performs Great Even After Memory Runs Out

TPC-C on VoltDB

100% 75% 50% 25%

Infiniswap

37 36.3 35.6

100% 75% 50% 25%

TPC-C on VoltDB

Infiniswap + Leap

≈1X 38

Performs Great Even After Memory Runs Out

TPC-C on VoltDB

100% 75% 50% 25%

Infiniswap

37 36.3 35.6

100% 75% 50% 25%

TPC-C on VoltDB

Infiniswap + Leap

2.4X 39

Applicability & Performance

Application Changes

Major NoMinor

Very Good

Key-Value

Infiniswap

Infiniswap + Leap

Memtrade

Justitia

w/ ZhuolongYu, Yiwen Zhang and othersSIGCOMM’20

NetLockLock Management with Programmable Switches

Transactions

Transaction processing needs• High throughput;• Low latency; and• Policy support

Existing approaches• Centralized: low throughput• Decentralized: limited policy support

Network-Assisted Lock Management

Transaction processing needs• High throughput;• Low latency; and• Policy support

Challenges• Limited memory to store the locks• Limited functionalities to process the

locks and realize the policies

Programmable Switch

NetLock processes lock requests with a combination of switch and servers• The switch only stores and

processes the requests on hot locks• Servers do the rest

Implemented on a 6.5Tbps Barefoot Tofino switch

Clients

L2/L3 Routing

LockTable

Lock TableServer

DatabaseServers

ToR Switch Lock TableServer

NetLock

NetLock Architecture

Switch Memory Disaggregation

Determine how much switch memory is needed for a target throughput• Formulated as a fractional knapsack problem• Depends of expected contention

Handling overflow• Move locks back and forth between switch and servers

Single µs Latency

20X lower latency for TPC-C over DSLR

Billions of Locks/Sec

18X higher throughput for TPC-C over DSLR

Memory Disaggregation

Wide-Area Computing

AI/MLSystems

Big DataSystems

Network-Informed Data Systems Design

Juncheng Gu Jie You Yiwen ZhangFan Lai Jiachen Liu Hasan Al Maruf Peifeng Yu

PhD Students

Undergraduate& Master’s

Collaborators

Chris ChenYinwei DaiShuoren FuSongyuan Guan

Jack KosaianQinye LiYang LiuYuze Lou

Alexander NebenWenting TanYue TanKaiwei Tu

Yuchen WangYujia XieYilei XuJiaxing Yang

Yiwei ZhangJiangchen ZhuJingyuan ZhuXiangfeng Zhu

Aditya AkellaGanesh AnanthanarayananWei BaiVladimir BravermanShuchi ChawlaKai ChenLi ChenAsaf CidonYanhui GengAli GhodsiAyush Goel

Robert GrandlChuanxiong GuoMatan HamilisAnthony HuangAnand P. IyerMyeongjae JeonXin JinSamir KhullerTan N. LeYoungmoon LeeLi Erran Li

Hongqiang LiuZhenhua LiuHarsha V. MadhyasthaKshiteej MahajanBarzan MozafariLinh NguyenAurojit PandaManish PurohitJunjie QianKannan RamchandranK. V. Rashmi

Kang G. ShinScott ShenkerBrent StephensIon StoicaXiao SunMuhammed UluyolShivaram VenkataramanCarl WaldspurgerHongyi WangJingfeng WuSheng Yang

BairenYiDong Young YoonZhuolongYuHong ZhangJunxue ZhangYuhong ZhongYibo Zhu

practical memory disaggregation

Documents

unsupervised disaggregation of low frequency...

operating systems, 142 practical session 9, memory 1

energy disaggregation ashokkumar

readactor-practical code randomization resilient to memory...

practical kernel memory disclosure detection ·...

white paper: bringing disaggregation to transport networks...

flash storage disaggregation

you practical ideas on activities, memory boxes

practical systemtap basics: perl memory profiling

operating systems, 112 practical session 9, memory 1

dpus: acceleration through disaggregation

system programming practical session 7 c++ memory handling

practical low-overhead enforcement of memory...

practical low-overhead enforcement of memory safety...

accelerating network disaggregation

the fruits of disaggregation: the general engineering...

practical byte-granular memory blacklisting using califorms

leakchecker: practical static memory leak detection for...

multisite rainfall downscaling and disaggregation in a...

disaggregation argument