an empirical guide to scalable persistent memory · persistent memory joseph izraelevitz, jian...

54
1 An Empirical Guide to Scalable Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R. Dulloor, Jishen Zhao, Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego Department of Electrical, Computer & Energy Engineering University of Colorado, Boulder

Upload: others

Post on 29-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

1

An Empirical Guide to Scalable Persistent Memory

Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu,

Subramanya R. Dulloor, Jishen Zhao, Steven Swanson

Non-Volatile Systems LaboratoryDepartment of Computer Science & Engineering

University of California, San Diego

Department of Electrical, Computer & Energy EngineeringUniversity of Colorado, Boulder

Page 2: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

2

Optane DIMMs are Here!

Page 3: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

3

Optane DIMMs are Here!

• Not Just Slow Dense DRAMTM

Page 4: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

4

Optane DIMMs are Here!

• Not Just Slow Dense DRAMTM

• Slower, denser, media

-> More complex architecture

-> Second-order performance anomalies

-> Fundamentally different

Page 5: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

6

Basics: Emulation is misleading

RocksDB

OptaneDRAM

Page 6: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

7

Outline

• Background

• Basics: Optane DIMM Performance

• Lessons: Optane DIMM Best Practices

• Conclusion

Page 7: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

8

Background

Page 8: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

9

Background: Optane in the Machine

Core CoreCore

Page 9: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

10

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

Core

L1/2

Page 10: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

11

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

Page 11: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

12

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

Page 12: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

13

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

Page 13: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

14

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

AppDirectMode

Core

L1/2

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

Page 14: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

15

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

AppDirectMode

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

Far MemoryNear Memory

(Direct Mapped Cache)

Memory Mode

Page 15: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

16

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

Far MemoryNear Memory

(Direct Mapped Cache)

Memory Mode Core

L1/2

AppDirectMode

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

Page 16: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

17

Background: Optane Internals

$

MC

Optane

DRAM

Optane DIMM

Page 17: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

18

Background: Optane Internals

$

MC

Optane

DRAM

WPQ

iMC

Optane DIMM

From CPU

Page 18: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

19

Background: Optane Internals

$

MC DRAM

Optane Controller

WPQ

iMC

Optane DIMM

From CPU

Optane

Page 19: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

20

Background: Optane Internals

$

MC DRAM

Optane Media

Optane Controller

WPQ

iMC

Optane DIMM

From CPU

256B Block

Optane

Page 20: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

21

Background: Optane Internals

$

MC DRAM

Optane Media

Optane ControllerOptaneBuffer

WPQ

iMC

Optane DIMM

From CPU

256B Block

Optane

Page 21: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

22

Background: Optane Internals

$

MC DRAM

Optane Media

Optane ControllerOptaneBuffer

AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

Optane

Page 22: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

23

Background: Optane Internals

$

MC DRAM

Optane Media

Optane ControllerOptaneBuffer

AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

ADR (Asynchronous DRAM Refresh)

Optane

Page 23: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

24

Background: Optane Interleaving

• Optane is interleaved across NVDIMMs at 4KB granularity.

Optane4

Optane5

Optane6

Optane1

Optane2

Optane3

Page 24: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

25

Background: Optane Interleaving

• Optane is interleaved across NVDIMMs at 4KB granularity.

Optane4

Optane5

Optane6

Optane1

Optane2

Optane3

4 5 61 2 3

Phys. Addr: 4KB 12KB 20KB0 8KB 16KB

1

24KB

Page 25: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

26

Basics: How does Optane DC perform?

Page 26: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

27

Basics: Our Approach

• Microbenchmark sweeps across state space– Access Patterns (random, sequential)

– Operations (read, ntstore, clflush, etc.)

– Access/Stride Size

– Power Budget

– NUMA Configuration

– Address Space Interleaving

• Targeted experiments

• Total: 10,000 experiments

Page 27: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

28

Test Platform

• CPU– Intel Cascade Lake, 24 cores at 2.2 GHz in 2 sockets

– Hyperthreading off

• DRAM– 32 GB Micron DDR4 2666 MHz

– 384 GB across 2 sockets w/ 6 channels

• Optane– 256 GB Intel Optane 2666 MHz QS

– 3 TB across 2 sockets w/ 6 channels

• OS– Fedora 27, 4.13.0

Page 28: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

29

Basics: Latency

• 2x -3x as slow as DRAM

• Write latency masked by ADR

Page 29: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

31

Basics: Bandwidth

• ~Scalable reads, Non-scalable writes

Page 30: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

32

Basics: Bandwidth

• Access size matters

Page 31: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

33

Basics: Bandwidth

• A mystery!

*answer in ~10 minutes

Page 32: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

34

Lessons: What are Optane Best Practices?

Page 33: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

35

Lessons: What are Optane Best Practices?

• Avoid small random accesses

• Use ntstores for large writes

• Limit threads accessing one NVDIMM

Page 34: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

36

Lesson #1: Avoid small random accesses

Page 35: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

37

Lesson #1: Avoid small random accesses

Optane Media

Optane ControllerOptaneBuffer

AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

$

MC

Optane

DRAM

Page 36: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

38

Lesson #1: Avoid small random accesses

Optane Media

Optane ControllerOptaneBuffer

AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

$

MC

Optane

DRAM

Page 37: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

39

Lesson #1: Avoid small random accesses

Optane Media

Optane ControllerOptaneBuffer

AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

$

MC

Optane

DRAM

Page 38: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

40

Lessons: Optane Buffer Size

• Write amplification if working set is larger than Optane Buffer

Page 39: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

42

Lesson #1: Avoid small random accesses

• Bad bandwidth with:

– Small random writes (<256B)

– Not tiny working set / NVDIMM (>16KB)

• Good bandwidth with

– Sequential accesses

Page 40: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

43

Lesson #2: Use ntstores for large writes

Page 41: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

44

Lesson #2: Use ntstores for large writes

• Non-temporal stores bypass the cache

Page 42: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

45

Lessons: Store instructions

ntstore store + clwb store

$

MC

Optane

DRAM

1

2

3

1

2

3

Page 43: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

46

Lessons: Store instructions

ntstore store + clwb store

$

MC

Optane

DRAM

1

2

3

1

2

3

*lost bandwidth *lost locality

Page 44: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

47

Lesson #2: Use ntstores for large writes

Page 45: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

48

Lesson #2: Use ntstores for large writes

• Non-temporal stores bypass the cache

– Avoid cache-line read

– Maintain locality

Page 46: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

49

Lesson #3: Limit threads accessing one NVDIMM

Page 47: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

50

Lesson #3: Limit threads accessing one NVDIMM

• Contention at Optane Buffer

• Contention at iMC

Page 48: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

51

Lessons: Contention at Optane Buffer

• More threads = access amplification = lower bandwidth

Page 49: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

52

Lessons: Contention at media & iMC

• iMCs queues are small & media is slow

– Short queues end up clogged

Page 50: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

54

Lessons: Contention at media & iMC

• Contention is largest when random access size = interleave size (4KB)

Page 51: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

55

Lessons: Contention at media & iMC

• Contention is largest when random access size = interleave size (4KB)

• Load is fairest when access size = #DIMMs x interleave size (24KB)

Page 52: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

56

Lesson #3: Limit threads accessing one NVDIMM

• Contention at Optane Buffer

– Increase access amplification

• Contention at media/iMC

– Lose bandwidth through uneven NVDIMM load

– Avoid interleave aligned random accesses

Page 53: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

61

Conclusion

Page 54: An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

62

Conclusion

• Not Just Slow Dense DRAM

• Slower media

-> More complex architecture

-> Second-order performance anomalies

-> Fundamentally different

• Max performance is tricky– Avoid small random accesses

– Use ntstores for large writes

– Limit threads accessing one NVDIMM