1 exploiting 3d-stacked memory devices rajeev balasubramonian school of computing university of utah...

43
1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

Upload: arielle-malter

Post on 22-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

1

Exploiting 3D-Stacked Memory Devices

Rajeev Balasubramonian

School of ComputingUniversity of Utah

Oct 2012

Page 2: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

2

Power Contributions

PERCENTAGEOF TOTALSERVERPOWER

PROCESSOR

MEMORY

Page 3: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

3

Power Contributions

PERCENTAGEOF TOTALSERVERPOWER

PROCESSOR

MEMORY

Page 4: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

4

Example IBM Server

Source: P. Bose, WETI Workshop, 2012

Page 5: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

5

Reasons for Memory Power Increase

• Innovations for the processor, but not for memory

• Harder to get to memory (buffer chips)

• New workloads that demand more memory SAP HANA in-memory databases SAS in-memory analytics

Page 6: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

6

The Cost of Data Movement

• 64-bit double-precision FP MAC: 50 pJ (NSF CPOM Workshop report)

• 1 instruction on an ARM Cortex A5: 80 pJ (ARM datasheets)

• Fetching 256-bit block from a distant cache bank: 1.2 nJ (NSF CPOM Workshop report)

• Fetching 256-bit block from an HMC device: 2.68 nJ Fetching 256-bit block from a DDR3 device: 16.6 nJ (Jeddeloh and Keeth, 2012 Symp. on VLSI Technology)

Page 7: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

7

Memory Basics

Host Multi-CoreProcessor

MC MC

MCMC

Page 8: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

8

FB-DIMM

Host Multi-CoreProcessor

MC MC

MCMC …

Page 9: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

9

SMB/SMI

Host Multi-CoreProcessor

MC MC

MCMC

Page 10: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

10

Micron Hybrid Memory Cube Device

Page 11: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

11

HMC Architecture

Host Multi-CoreProcessor

MC MC

MCMC

Page 12: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

12

Key Points

• HMC allows logic layer to easily reach DRAM chips

• Open question: new functionalities on the logic chip – cores, routing, refresh, scheduling

• Data transfer out of the HMC is just as expensive as before

Near Data Computing … to cut off-HMC movement

Intelligent Network-of-Memories … to reduce hops

Page 13: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

13

Near Data Computing (NDC)

Page 14: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

14

Timely Innovation

• A low-cost way to achieve NDC

• Workloads that are embarrassingly parallel

• Workloads that are increasingly memory bound

• Mature frameworks (MapReduce) in place

Page 15: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

15

Open Questions

• What workloads will benefit from this?

• What causes the benefit?

Page 16: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

16

Workloads

• Initial focus on MapReduce, but any workload with localized data access patterns will be a good fit

• Map phase in MapReduce: the dataset is partitioned and each Map phase works on its “split”; embarrassingly parallel, localized data access, often the bottleneck; e.g., count word occurrences in each individual document

• Reduce phase in MapReduce: aggregates the results of many mappers; requires random access of data; but deals with less data than Mappers; e.g., summing up the occurrences for each word

Page 17: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

17

Baseline Architecture

MC MC

MCMC

• Mappers and Reducers both execute on the host processor• Many simple cores is better than few complex cores• 2 sockets, 256 GB memory, processing power budget 260 W, 512 Arm cores (EE-Cores) per socket, each core at 876 MHz

Page 18: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

18

NDC Architecture

MC MC

MCMC

• Mappers execute on ND Cores; Reducers execute on the host processor• 32 cores per HMC; 2048 total ND Cores and 1024 total EE-Cores; 260 W total processing power budget

Page 19: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

19

NDC Memory Hierarchy

MC MC

MCMC

• Memory latency excludes delay for link queuing and traversal• Many row buffer hits• L1 I and D caches per ND Core• The vault has space reserved for intermediate outputs, and Mapper/Runtime code/data

Page 20: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

20

Methodology

• Three workloads: Range-Aggregate: count occurrences of something Group-By: count occurrences of everything Equi-Join: for two databases, it counts the pairs that

have similar attributes

• Dataset: 1998 World Cup web server logs

• Simulations of individual mappers and reducers on EE-cores on TRAX simulator

Page 21: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

21

Single Thread Performance

Page 22: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

22

Effect of Bandwidth

Page 23: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

23

Exec Time vs. Frequency

Page 24: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

24

Maximizing the Power Budget

Page 25: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

25

Scaling the Core Count

Page 26: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

26

Energy Reduction

Page 27: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

27

Results Summary

• Execution time reductions of 7%-89%

• NDC performance scales better with core count

• Energy reduction of 26%-91%

No bandwidth limitation Lower memory access latency Lower bit transport energy

Page 28: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

28

Intelligent Network of Memories

• How should several HMCs be connected to the processor?• How should data be placed in these HMCs?

Page 29: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

29

Contributions

• Evaluation of different network topologies Route adaptivity does help

• Page placement to bring popular data to nearby HMCs Percolate-down based on page access counts

• Use of router bypassing under low load

• Use of deep sleep modes for distant HMCs

Page 30: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

30

Topologies

Page 31: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

31

Topologies

Page 32: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

32

Topologies

(d) F-Tree (e) T-Tree

Page 33: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

33

Network Properties

• Supports 44-64 HMC devices with 2-4 rings

• Adaptive routing (deadlock avoidance based on timers)

• An entire page resides in one ring, but cache lines are striped across the channels

Page 34: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

34

Percolate-Down Page Placement

• New pages are placed in nearest ring

• Periodically, inactive pages are demoted to the next ring; thresholds matter because of queuing delays

• Activity is tracked with the multi-queue algorithm: hierarchical queues, each entry has a timer and an access count, demotion to lower queue if timer expires, promotion to higher queue if access count is high

• Page migration off the critical path, striped across many channels, distant links are under-utilized

Page 35: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

35

Router Bypassing

• Topologies with more links and adaptive routing (T-Tree) are better… but distant links experience relatively low load

• While a complex router is required for the T-Tree, the router can often be bypassed

Page 36: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

36

Power-Down Modes

• Activity shift to nearby rings under-utilization at distant HMCs

• Can power off the DRAM layers (PD-0) and the SerDes circuits (PD-1)

• 26% energy saving for a 5% performance penalty

Page 37: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

37

Methodology

• 128-thread traces of NAS parallel benchmarks (capacity requirements of nearly 211 GB)

• Detailed simulations with 1 billion memory access traces, confirmatory page-access simulations for the entire application

• Power breakdown: 3.7 pJ/bit for DRAM access, 6.8 pJ/bit for HMC logic layer, 3.9 pJ/bit for a 5x5 router

Page 38: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

38

Results – Normalized Exec Time

• T-Tree P-Down reduces exec time by 50%• 86% of flits bypass the router• 88% of requests serviced by Ring-0

Page 39: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

39

Results – Energy

Page 40: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

40

Summary

• Must reduce data movement on off-chip memory links

• NDC reduces energy, improves performance by overcoming the bandwidth wall

• More work required to analyze workloads, build software frameworks, analyze thermals, etc.

• iNoM uses OS page placement to minimize hops for popular data and increase power-down opportunities

• Path diversity is useful, router overhead is small

Page 41: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

41

Acknowledgements

• Co-authors: Kshitij Sudan, Seth Pugsley, Manju Shevgoor, Jeff Jestes, Al Davis, Feifei Li

• Group funded by: NSF, HP, Samsung, IBM

Page 42: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

42

Backup Slide

Page 43: 1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

43

Backup Slide