practical memory disaggregation

Post on 01-Oct-2021

13 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Practical Memory DisaggregationA Case Study in Network-Informed Data Systems Design

Mosharaf ChowdhuryNovember 2020

Five Years Ago…

3

The volume of data businesses want to make sense of is increasing

2015 2016 2017 2018

1. Data Volume Will Keep Increasing

4

2015 2016 2017 2018

Data Systems

5

Big Data AnalyticsAI/ML Tools

Massive dataHigh parallelismGPU clustersDistributed…

> 100 ms

Over the World

2. Deployed in Diverse Networks

< 10 µs

Within a Rack Within a Datacenter~ 1 ms

6

Network-Informed Data Systems Design

7

I. Network-adaptive Big Data and AI/ML systems

II. Tailoring data systems to extreme networksI. Computation over the InternetII. Leveraging high-speed networks

2016 20202017 20192018

HU

G

CO

DA

Herm

es

EC-C

ache

Carbyne

QO

OP

Tiresias

Salu

s

AlloX

NO

CS

Sol

Pand

o

Infin

iswap

DSL

R

Leap

NetLock

CellScope

PracticalMemoryDisaggregation

8

Memory is King!

9

Perform Great!

36.18

6.619

1.5420

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

TPC-C on VoltDB

10

Perform Great Until Memory Runs Out

36.18

6.619

1.5420

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

TPC-C on VoltDB

11

Perform Great Until Memory Runs Out

36.18

6.619

1.5420

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

95.8

44.9

23.8

0

20

40

60

80

100

120

100% 75% 50%O

ps (

Tho

usan

ds)

In-Memory Working Set

TPC-C on VoltDB FB Workload on Memcached

12

Perform Great Until Memory Runs Out

36.18

6.619

1.5420

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

57 67.5

453.4

1

10

100

1000

100% 75% 50%

Com

plet

ion

Tim

e (s

)

In-Memory Working Set

TPC-C on VoltDB FB Workload on Memcached PageRank on PowerGraph95

.8

44.9

23.8

0

20

40

60

80

100

120

100% 75% 50%O

ps (

Tho

usan

ds)

In-Memory Working Set

13

50% Less Memory Causes Slowdown of …

36.18

6.619

1.5420

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

57 67.5

453.4

1

10

100

1000

100% 75% 50%

Com

plet

ion

Tim

e (s

)

In-Memory Working Set

TPC-C on VoltDB FB Workload on Memcached PageRank on PowerGraph95

.8

44.9

23.8

0

20

40

60

80

100

120

100% 75% 50%O

ps (

Tho

usan

ds)

In-Memory Working Set

14

Between a Rock and a Hard Place

OverallocationLeads to underutilization

30-50% in Google, Alibaba, and Facebook

UnderallocationLeads to severe performance loss

VS.

15

Machine 1 Machine 2 Machine 3 Machine N

Used Memory Free Memory

Disaggregated Memory

Memory Disaggregation

Remote Memory 16

Network is Getting Faster!

TCP/IP Hundreds of µsec

DPDK Tens of µsec

RDMA Single-digit µsec

Hundreds of nsecDRAM

time to access a 4KB memory page17

What is PracticalMemoryDisaggregation?

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

18

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

19

What is PracticalMemoryDisaggregation?

Infiniswap

w/ Juncheng Gu and many othersNSDI’17

Efficient Memory Disaggregation

20

Applicability

Application ChangesPe

rfor

man

ce

Best

Poor

Major NoMinor

Key-Value

File

Disk

Paging

Very Good

File

Disk

Key-Value

Paging

How can we enable any application to leverage disaggregated memory without sacrificing performance?

Remote Memory Paging

22

Exposes memory across server boundaries• Scalable• Efficient• Fault-tolerant

No changes to• applications,• operating systems, or• hardware

Core Idea

23

1. Infiniswap Block DeviceFinds free remote memory, maps pages, and provides fault tolerance without any central coordination

2. Infiniswap DaemonProactively evicts remote pages to ensure transparent, best-effortservice

Exposes free remote memory as swap devices in a decentralized manner without affecting remote processes

Infiniswap in One Slide

Container 1 Container NInfiniswapDaemon

User Space

Kernel Space

Local Disk

Async Sync

RNIC

2 3 2

Machine-1

User Space

Kernel Space

InfiniswapDaemonContainer A

Machine-2

X Mapped to memory of Machine-X

Virtual Memory Manager (VMM)

Infiniswap Block Device

Individual pagePage fault

User Space

Kernel Space

InfiniswapDaemonContainer A

Machine-3

Container 1 Container NInfiniswapDaemon

User Space

Kernel Space

Local Disk

Async Sync

RNIC

2 3

Machine-N

Virtual Memory Manager (VMM)

Infiniswap Block Device

24

Scalability via Decentralization

How to find free remote memory in a large cluster?• Problem: Centralized solution can be slow and expensive• Solution: Power of two choices

How to evict mapped memory?• Problem: LRU/LFU is hard because one-sided RDMA bypasses CPU• Solution: Power of many choices

25

Higher Efficiency & Better Load Balancing

Higher Utilization

0

20

40

60

80

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29Mem

ory

Util

izat

ion

(%)

Rank of 32 Machines

Infiniswap w/o Infiniswap

26

35.89

27.74

19.33

0

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

56.1 63.9 64.2

1

10

100

1000

10000

100% 75% 50%

Com

plet

ion

Tim

e (s

)

In-Memory Working Set

99.1

100.

4

91.3

0

20

40

60

80

100

120

140

160

100% 75% 50%O

ps (

Tho

usan

ds)

In-Memory Working Set

TPC-C on VoltDB FB Workload on Memcached

Even on 50% Memory, Slowdown is

PageRank on PowerGraph

<2X ≈1X ≈1X 27

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

28

What is PracticalMemoryDisaggregation?

w/ Hasan Al MarufATC’20 Best Paper

29

LeapEffectively Prefetching Remote Memory

User Space

Kernel Space

Device Mapping Layer

Block Device Driver

Generic Block Layer

I/O Scheduler Request QueueRequest queue processing:

Insertion, Merging, Sorting, Staging and Dispatch

bio

Remote Memory

Dispatch Queue

Memory ManagementUnit (MMU)

Process 1 Process 2 Process N…

Page Fault

RDMA: 4.3 us

0.27 us

10.04 us

21.88 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

Life of a Page

30

Where Does the Time Go?Page Request

In Page Cache?

Allocate Cache for Page

Read Request?

No

Yes

Update Page Table & End I/OPrepare for I/O

YesNo

Queue and Batch Requests

Execute I/O

0.12 µs

2.1 µs

10.04 µs

21.88 µs

RDMA: 4.3 µs

0.15 µs

Fast Path

Slow Path

31

We Need to

32

1. Increase cache hit• Faster path serves more page faults

2. Reduce the latency of the slow path• Remove block-layer operations unnecessary for RDMA

User Space

Kernel Space

Device Mapping Layer

Block Device Driver

Generic Block Layer

I/O Scheduler Request QueueRequest queue processing:

Insertion, Merging, Sorting, Staging and Dispatch

bio

Remote Memory

Dispatch Queue

Memory ManagementUnit (MMU)

Process 1 Process 2 Process N…

Page Fault

RDMA: 4.3 us

0.27 us

10.04 us

21.88 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

Life of a Page

33

User Space

Kernel Space

Remote Memory

Memory ManagementUnit (MMU)

Process 1 Process 2 Process N…

Page Fault

RDMA: 4.3 us

0.27 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

Life of a Pagew/ Leap

Process Specific Page Access Tracker

Leap

Trend Detection

Prefetch CandidateGeneration

Prefetcher

Eager Cache Eviction

34

0.34 us

Prefetching in Linux

Reads ahead pages sequentially

Based only on the last page access

Does not distinguish between processesCannot detect thread-level access irregularities

too aggressive on seq: cache pollution

too conservative off seq: brings nothing

35

Trend Detection in LeapStart with a smaller window of Access History

Majority found?

Doubles the window size

No Yes

Run Boyer-Moore on the window

Return Majority ∆maj

Max. window

size?

YesNo trend found

No

Resilient to short term irregularity

Identifies the majority element in access history

Regular trends can be found within recent accesses

36

Lowers Remote Page Access Latency by…Sequential Access Stride Access

0

0.2

0.4

0.6

0.8

1

0.01 1 100 10000

CD

F

Latency (us)

Infiniswap

Infiniswap+Leap

0

0.2

0.4

0.6

0.8

1

0.01 1 100 10000

CD

F

Latency (us)

4X 104X37

Performs Great Even After Memory Runs Out

TPC-C on VoltDB

<2X

37.00

27.74

19.33

1.50

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

Infiniswap

37 36.3 35.6

15.6

0

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

TPC-C on VoltDB

Infiniswap + Leap

≈1X 38

Performs Great Even After Memory Runs Out

TPC-C on VoltDB

37.00

27.74

19.33

1.50

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

Infiniswap

37 36.3 35.6

15.6

0

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

TPC-C on VoltDB

Infiniswap + Leap

2.4X 39

Applicability & Performance

Application Changes

Perf

orm

ance

Best

Poor

Major NoMinor

Very Good

File

Disk

Key-Value

Infiniswap

40

Infiniswap + Leap

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

41

What is PracticalMemoryDisaggregation?

Memtrade

Justitia

Hydra

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

42

What is PracticalMemoryDisaggregation?

w/ ZhuolongYu, Yiwen Zhang and othersSIGCOMM’20

43

NetLockLock Management with Programmable Switches

Transactions

44

Transaction processing needs• High throughput;• Low latency; and• Policy support

Existing approaches• Centralized: low throughput• Decentralized: limited policy support

Network-Assisted Lock Management

45

Transaction processing needs• High throughput;• Low latency; and• Policy support

Challenges• Limited memory to store the locks• Limited functionalities to process the

locks and realize the policies

Programmable Switch

NetLock processes lock requests with a combination of switch and servers• The switch only stores and

processes the requests on hot locks• Servers do the rest

Implemented on a 6.5Tbps Barefoot Tofino switch

46

Clients

L2/L3 Routing

LockTable

Lock TableServer

DatabaseServers

ToR Switch Lock TableServer

NetLock

NetLock Architecture

Switch Memory Disaggregation

47

Determine how much switch memory is needed for a target throughput• Formulated as a fractional knapsack problem• Depends of expected contention

Handling overflow• Move locks back and forth between switch and servers

48

Single µs Latency

20X lower latency for TPC-C over DSLR

Billions of Locks/Sec

49

18X higher throughput for TPC-C over DSLR

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

50

What is PracticalMemoryDisaggregation?

51

Memory Disaggregation

Wide-Area Computing

AI/MLSystems

Big DataSystems

Network-Informed Data Systems Design

52

Juncheng Gu Jie You Yiwen ZhangFan Lai Jiachen Liu Hasan Al Maruf Peifeng Yu

PhD Students

Undergraduate& Master’s

Collaborators

Chris ChenYinwei DaiShuoren FuSongyuan Guan

Jack KosaianQinye LiYang LiuYuze Lou

Alexander NebenWenting TanYue TanKaiwei Tu

Yuchen WangYujia XieYilei XuJiaxing Yang

Yiwei ZhangJiangchen ZhuJingyuan ZhuXiangfeng Zhu

Aditya AkellaGanesh AnanthanarayananWei BaiVladimir BravermanShuchi ChawlaKai ChenLi ChenAsaf CidonYanhui GengAli GhodsiAyush Goel

Robert GrandlChuanxiong GuoMatan HamilisAnthony HuangAnand P. IyerMyeongjae JeonXin JinSamir KhullerTan N. LeYoungmoon LeeLi Erran Li

Hongqiang LiuZhenhua LiuHarsha V. MadhyasthaKshiteej MahajanBarzan MozafariLinh NguyenAurojit PandaManish PurohitJunjie QianKannan RamchandranK. V. Rashmi

Kang G. ShinScott ShenkerBrent StephensIon StoicaXiao SunMuhammed UluyolShivaram VenkataramanCarl WaldspurgerHongyi WangJingfeng WuSheng Yang

BairenYiDong Young YoonZhuolongYuHong ZhangJunxue ZhangYuhong ZhongYibo Zhu

top related