light nuca: a proposal for bridging the inter-cache latency gap darío suárez 1, teresa monreal 1,...

28
Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1 , Teresa Monreal 1 , Fernando Vallejo 2 , Ramón Beivide 2 , and Victor Viñals 1 1 Universidad de Zaragoza and 2 Universidad de Cantabria Spain

Upload: walter-yarboro

Post on 29-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

Light NUCA: a proposal for bridging the inter-cache latency gap

Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals 1

1 Universidad de Zaragoza and 2 Universidad de Cantabria

Spain

Page 2: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

2

Load-to-Use cache latency trend

Page 3: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

3

Facing the inter-cache latency gap• Reconfigurable L1/L2

(Balasubramonian et al., MICRO’00)

• single-ported memory cells low bandwidth

• NUCA(Kim et al., ASPLOS’02)

• wire-delay dominated large caches

• routing-cache-routing network overhead

• L-NUCA: L1 + small cache tiles + specialized networks

low latencyhigh bandwidth

large associativity

Page 4: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

4

Summary• Motivation

• Introduction to L-NUCAs

• Networking

–Topologies

–Routing

–Messages

• Global Miss Determination

• Single cycle cache look-up plus one-hop routing

• Experimental Platform

• Results

• Conclusions

Page 5: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

5

L-NUCA introduction

Latency Tiles Size (KB)

1 1 8

3 4 32

4 6 48

Latency Tiles Size (KB)

5 9 72

6 13 104

7 15 120

Page 6: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

6

Topologies and Routing

SearchSearch TransportTransport ReplacementReplacement

•Independent operations, ensures deadlock avoidance

•Broadcast Tree•No flow control

•2D mesh•On/Off flow control•Dynamic Distributed

Routing (DDR)

•Blocks ordered by temporal locality

•Latency-driven topology

•On/Off flow control•DDR

Page 7: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

7

Headerless messages

Operation Message content Source Destination Width (bits or wires)

Search @ + MSHR + st data + ctrl r-tile rest tiles 41+4+64+2 = 111

Transport block + MSHR tile (hit) r-tile 256 + 4 = 260

Replacement block + @ tile i tile k, lat(k)=lat(i)+1 256 + 41 = 297

Assuming 32-byte blocks

•no header overhead•message = packet = flit = phit•Implicit destination

•More than 1k m4/m5 wires fit in one side of an 8KB cache(Intel 32nm, Natarajan et al., IEDM’08)

8KBcache

> 1

000

Worst case: 668

Page 8: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

8

Global Miss Determination Logic

•Tiles stop miss propagation in hits•L-NUCA miss iff all last-level tiles miss•Scalable hierarchical organization, taken from SRAM bitlines

(Yang and Kim, JSSC’05)•one cycle after the last level look-up

Page 9: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

9

Single-cycle tiles

Three networksHeaderless messages No DC, RT, and VANo virtual channels

•Avoidance of multiple routing stages

•Parallel data array access and switch allocation

XBar: 3 inputs, 2 outputs

low latency

Page 10: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

10

Summary• Motivation

• Introduction to L-NUCAs

• Experimental Platform

• Results

• Conclusions

Page 11: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

11

Simulator•Enhanced simplescalar 3.0d (Alpha) with:

•Cycle-accurate memory and network models•4-issue processor:

•Speculative wake-up and selective recovery(Intel Pentium 4 alike)

•128 ROB•64 LSQ•Load-to-Use L1 miss penalty: 4 + cache latency

•Memory system:•L1/RT: 32KB-4Way-32B (lat. 2/ init. rate 1) (2 ports)•L3: 8MB-16Way-128B (lat. 20 / init. rate 15)•16-entry L1/RT MSHR

•32 nm technology and 19 FO4s cycle-time

Page 12: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

12

Workload and Delay, Power, and Area Models

• Workloads• All but one SPEC CPU 2006 benchmarks

(unable to run 483.xalancbmk on Alpha)

• Delay, Power, and Area modelling

• Cacti 5.3 and improved Orion for caches and routers

Page 13: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

13

Summary• Motivation

• Introduction to L-NUCAs

• Experimental Platform

• Results

• 3-level conventional cache vs. L-NUCA and L3

• D-NUCA vs. L-NUCA and D-NUCA

• Conclusions

Page 14: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

14

Tested Scenarios

•3-level conventional cache vs. L-NUCA and L3

•D-NUCA vs. L-NUCA and D-NUCA

Page 15: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

15

Average IPC, 3-level vs. L-NUCA

+ 6.1 %

+ 15 %

Page 16: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

16

Hierarchy energy, 3-level vs. L-NUCA

-14.2%

Page 17: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

17

IPC and Area Comparison

L2-256KB L2-512KB

LN2-72KB

LN3-144KB LN4-248KB

IPC

AREA

0.91 mm2 1.29 mm2 0.86 mm2 1.59 mm20.46 mm2

•small L-NUCA network overhead (14 to 19 %)•The low density of L-NUCAs discourages the use of large

sizes

Page 18: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

18

Tested Scenarios

•3-level conventional cache vs. L-NUCA and L3

•D-NUCA vs. L-NUCA and D-NUCA

Page 19: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

19

Average IPC, L-NUCA with D-NUCA

+ 4.2%

+ 6.8 %

Page 20: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

20

Hierarchy Energy, L-NUCA with D-NUCA

4.25 %

Page 21: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

21

L-NUCA load-to-use latency

IPC

L2-256KB 1.46

LN3-144KB 1.66

In 10 benchmarks, Le2 captures

more than 75% of L2 read hits

Page 22: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

22

Summary• Motivation

• Introduction to L-NUCAs

• Experimental Platform

• Results

• Conclusions

Page 23: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

23

Conclusions & Future Work• L-NUCAs leverages the advantages of NoChips for

NoCaches, low latency and high bandwidth, and reduces the inter-cache latency gap

• Design based on 3 specialized networks conveying headerless messages

• Performance and Energy gains with conventional and D-NUCA LLCs

• Future Work:

• Integrate L-NUCAs in CMP and SMT environments

• Study the effect of prefetching for increasing spatial locality

Page 24: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

Light NUCAs: a proposal for bridging the inter-cache latency gap

Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals 1

1 Universidad de Zaragoza and 2 Universidad de Cantabria

Spain

Page 25: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

25

L-NUCA summary

Page 26: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

26

Out-of-Order processor pipeline

Page 27: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

27

Out-of-Order processor pipeline

Page 28: Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals

28

Tile internals

SearchSearch TransportTransport ReplacementReplacement

MA: Miss Address Register (Search)U bf: Upperstream buffer (replacement)D bf: Downstream buffer (transport)

Every D and U buffer has 2 entries (2-cycle round-trip delay)