diamonds are a memory controller’s best friend*

16
Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked. Dennis Abts Google Natalie Enright Jerger University of Toronto John Kim KAIST Dan Gibson Univ of Wisconsin Mikko Lipasti Univ of Wisconsin

Upload: kamala

Post on 24-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Dennis Abts Google . Natalie Enright Jerger University of Toronto. John Kim KAIST. Diamonds are a Memory Controller’s Best Friend*. Dan Gibson Univ of Wisconsin. Mikko Lipasti Univ of Wisconsin. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Diamonds are a Memory Controller’s Best Friend*

Diamonds are a Memory Controller’s Best Friend*

*Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked.

Dennis AbtsGoogle

Natalie Enright JergerUniversity of Toronto

 

John KimKAIST

Dan GibsonUniv of Wisconsin

 Mikko Lipasti

Univ of Wisconsin 

Page 2: Diamonds are a Memory Controller’s Best Friend*

Executive Summary ®• On what tiles should memory controllers reside?

– Three-tiered simulation approach• Heuristic-guided search• Detailed network simulation• Full-system simulation

• Diamond MC placement works well for on-chip meshes and tori– Diamonds minimize maximum channel load– Diamonds deliver lower and more predictable

runtimes

Page 3: Diamonds are a Memory Controller’s Best Friend*

Background• Diverse on-chip communication

– Cache-to-cache– LD/ST to Memory– Off-chip traffic (e.g., I/O)

• Processors/chip on the rise– Pins available for memory not rising as fast: Memory

bandwidth becomes more precious– Reality: Many Cores, Few Memory Controllers

• Tiled architectures gaining popularity– Commonly employ on-chip meshes or tori

Page 4: Diamonds are a Memory Controller’s Best Friend*

The Problem• What Memory Controller placement is best

overall?– Flip-chip packaging allows flexible escape routes– n tiles and m ports:

• Don’t worry, there are only configurations!

– What are the characteristics of the best configuration?

• Performance: Low runtime for a set of objective workloads• Throughput: Low latency as a function of offered load• Fairness: Similar (low) average memory latency across all

nodes.• Predictability: Low latency and runtime variance

nm

Slight Simplification: Assume n = k2 and m =

2k

Page 5: Diamonds are a Memory Controller’s Best Friend*

Baseline Placement: row0_7• Ports to MCs located at

top and bottom of chip• Conceptually similar to

real parts:– Tilera’s Tile64

• 64 cores, 4 MCs (4 ports each, top/bottom of chip)

– Intel TeraFLOPs• 80 cores, 2 MCs (8 ports

each, top/bottom of chip)

X-Dimension Traffic Encounters

Congestion on Rows with Memory Controllers

Page 6: Diamonds are a Memory Controller’s Best Friend*

Three-Tiered Approach

Link Contention Simulation

Detailed Network Simulation

Full SystemM

ore Runs

Shorter Runtim

es

More

Detail

Page 7: Diamonds are a Memory Controller’s Best Friend*

Tier 0.5: Exhaustive Search

• It turns out is tractable for k<7

– (At least on the link contention simulator – only 3,268,760 possibilities for k=5)

kk2

2

Patterns Emerge!

Another Contender

Page 8: Diamonds are a Memory Controller’s Best Friend*

Tier 1: Heuristic-Guided Search• k>6: Intractable to search all configurations

– Use search heuristics and random search• Genetic Algorithm:

– Represent designs as a population of strings (Bit Vectors)

– Generate new designs by combining members of the population via genetic crossover (Bit Selection)

– Occasionally, mutate new population members (Swap adjacent bits)

– Reduce population size by removing least-fit members – Survival of the Fittest

Page 9: Diamonds are a Memory Controller’s Best Friend*

Genetic MC Placement0x00AA550000AA55000x0000FF0000FF0000

0x00AAF00000F25100

0x00AAF00000F25080

Mutate

Page 10: Diamonds are a Memory Controller’s Best Friend*

Link Contention Results k=8

Config.Max Channel Load

Mesh Torus

row0_7 13.5 9.25

X 8.93 7.72

Diamond 8.90 7.72

• GA Selected Diamond as most fit solution for 8x8– Minimizes MCs in a single

row/column– Spreads DOR load Sanity Check: GA also

prefers Diamond for 4x4, 5x5, and 6x6

Page 11: Diamonds are a Memory Controller’s Best Friend*

Network Simulation: Open-Loop Evaluation• Detailed simulation of all network events

(buffers, links, etc.)• Cores are Bernoulli injection processes, uniform

random traffic• Measure latency vs. offered load

Parameters Values

Router latency 1 cycle (aggressive)

Inter-router Delay 1 cycle

Buffers 32-flit sized per port

Packet size Request: 1 flitReply: 4 flit

Virtual Channels 4 (XY-YX routing)

Page 12: Diamonds are a Memory Controller’s Best Friend*

Open-Loop Results

0

5

10

15

20

25

0 0.2 0.4 0.6 0.8 1

Offered load (flits/cycle)

Late

ncy

(cyc

les) row0_7

row2_5DiamondX

Page 13: Diamonds are a Memory Controller’s Best Friend*

Closed-Loop Evaluation

• Each processor executes N memory operations• Up to r operations outstanding at a time

– Models MSHRs• Uniform Random requests, and real request

streams with ‘hot spot’ behavior

Page 14: Diamonds are a Memory Controller’s Best Friend*

Closed-Loop Results

0

4

8

12

16

20

3500 4000 4500 5000 5500 6000 6500

Completion Time

Num

ber o

f Pro

cess

ors

8000 8500 9000 9500 10000 10500 11000

Diamond row0_7

Page 15: Diamonds are a Memory Controller’s Best Friend*

Full System Results

14.5

15

15.5

16

16.5

17

17.5

0 0.2 0.4 0.6 0.8 1 1.2

R ow0_7

Diamond

Standard Deviation

Ave

rage

Net

wor

k La

tenc

y (c

ycle

s)

for R

eque

st to

Mem

ory

Con

trolle

r

JBBWEB

TPC-WTPC-W+H

TPC-H

TPC-W+H

TPC-WTPC-H

WEBJBB

Diamond placement yields lower latency and

lower latency variance.

Page 16: Diamonds are a Memory Controller’s Best Friend*

Conclusion• MC Placement Matters!

– Diamond reduces contention, improves latency, and reduces latency/runtime variance

– X does fairly well