locality-aware data replication in the last-level cache

43
Locality-Aware Data Replication in the Last-Level Cache George Kurian 1 , Srinivas Devadas 1 , Omer Khan 2 , 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs 1

Upload: stacy-vasquez

Post on 03-Jan-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Locality-Aware Data Replication in the Last-Level Cache. George Kurian 1 , Srinivas Devadas 1 , Omer Khan 2 , 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs. The Problem. Future multicore processors will have 100s of cores - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Locality-Aware Data Replication in the Last-Level Cache

1

Locality-Aware Data Replication in the Last-Level Cache

George Kurian1, Srinivas Devadas1, Omer Khan2,

1 Massachusetts Institute of Technology2 University of Connecticut, Storrs

Page 2: Locality-Aware Data Replication in the Last-Level Cache

The Problem

• Future multicore processors will have 100s of cores

• LLC management key to optimizing performance and energy

• Last-level cache (LLC) data locality and off-chip miss rates often show opposing trends

2

• Goal: Intelligent replication at the LLC

# Network Hops = ⅔ * √N

Page 3: Locality-Aware Data Replication in the Last-Level Cache

LLC Replication Strategy

• Black block shows benefit with replication– E.g., Frequently-read shared data– Core-1 and Core-2 allowed to create replicas

• Red block shows NO benefit with replication– E.g., Frequently-written shared data 3

L2 Cache(LLC Slice)

ComputePipeline

Directory

Router

PrivateL1 Caches

Core

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

Replica

Home

Replica

Home

1

4

32

Page 4: Locality-Aware Data Replication in the Last-Level Cache

4

Outline

• Motivation• Comparison to Previous Schemes• Design & Implementation• Evaluation• Conclusion

Page 5: Locality-Aware Data Replication in the Last-Level Cache

5

MotivationReuse at the LLC

• Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core– Note: Private L1 cache hits are filtered out

L2 Cache(LLC Slice)

ComputePipeline

Directory

Router

PrivateL1 Caches

Core

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC SliceHome

4

3 Core 35 Accesses

Core 4Write

Reuse = 5

Page 6: Locality-Aware Data Replication in the Last-Level Cache

6

MotivationReuse Determines Replication Benefit

• Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core

• Higher the reuse, higher the efficacy of replication

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%[1-2] [3-9] [≥10]

LLC

Acc

ess

Coun

t

Fig: LLC Access Count vs Reuse

Page 7: Locality-Aware Data Replication in the Last-Level Cache

7

Motivation (cont’d)Reuse Determines Replication Benefit

• Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core

• Higher the reuse, higher the efficacy of replication

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%[1-2] [3-9] [≥10]

LLC

Acc

ess

Coun

t

Replicate

Don’tReplicate

Fig: LLC Access Count vs Reuse

Page 8: Locality-Aware Data Replication in the Last-Level Cache

8

Motivation (cont’d)Reuse Independent of Cache Line Type

• Private data exhibits varying degrees of reuse

Private 1-2 3-9 ≥10

Fig: LLC Access Count vs Reuse

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%

LLC

Acc

ess

Coun

t

Page 9: Locality-Aware Data Replication in the Last-Level Cache

9

Motivation (cont’d)Reuse Independent of Cache Line Type

• Instructions mostly exhibit high reuse

Private

Instruction

1-2 3-9 ≥10

Fig: LLC Access Count vs Reuse

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%

LLC

Acc

ess

Coun

t

Page 10: Locality-Aware Data Replication in the Last-Level Cache

10

Motivation (cont’d)Reuse Independent of Cache Line Type

• Shared read-only data exhibits varying degrees of reuse

Private

InstructionShared Read-Only

1-2 3-9 ≥10

1-2 3-9 ≥10

Fig: LLC Access Count vs Reuse

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%

LLC

Acc

ess

Coun

t

Page 11: Locality-Aware Data Replication in the Last-Level Cache

11

Motivation (cont’d)Reuse Independent of Cache Line Type

• Shared read-write data exhibits varying degrees of reuse

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%

LLC

Acc

ess

Coun

t

Private

InstructionShared Read-Only

Shared Read-Write

1-2 3-9 ≥10

1-2 3-9 ≥10

Fig: LLC Access Count vs Reuse

Page 12: Locality-Aware Data Replication in the Last-Level Cache

12

Motivation (cont’d)Reuse Independent of Cache Line Type

• Replication must be based on reuse and not cache line classification

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%

LLC

Acc

ess

Coun

t

Private

InstructionShared Read-Only

Shared Read-Write

1-2 3-9 ≥10

1-2 3-9 ≥10

Fig: LLC Access Count vs Reuse

Replicate based on Reuse Instructions Shared read-only data Shared read-write data (even) Private data

Page 13: Locality-Aware Data Replication in the Last-Level Cache

13

Locality-Aware ReplicationSalient Features

• Locality-based: Based on reuse and not memory classification information– Replicate data with high reuse – Bypass replication mechanisms for low reuse data

• Cache-line Level: Reuse measured and replication decision made at cache-line level

• Dynamic: Reuse profiled at runtime using highly-accurate hardware counters

• Minimal Coherence Protocol Changes: Replication is done at the local LLC slice

• Fully Hardware: LLC replication techniques require no modification to operating system

Page 14: Locality-Aware Data Replication in the Last-Level Cache

14

Comparison to Previous Schemes

LLC Management Schemes

Replication Candidates

When to Replicate?

Static-NUCA (S-NUCA)

None Never

Reactive-NUCA(R-NUCA)

Instructions(per-cluster)

Every L1 Cache Miss(NO Intelligence)

Victim Replication(VR)

All Every L1 Cache Eviction(NO Intelligence)

Adaptive Selective Replication (ASR)

SharedRead-Only

L1 Cache Eviction(Adapts Replication Level)

Locality-Aware Replication

All L1 Cache Miss(Detect High Reuse)

Page 15: Locality-Aware Data Replication in the Last-Level Cache

15

Outline

• Motivation • Comparison to Previous Schemes• Design & Implementation• Evaluation• Conclusion

Page 16: Locality-Aware Data Replication in the Last-Level Cache

16

Baseline System

• Compute pipeline with private L1-I and L1-D caches• Logically shared physically distributed L2 cache (LLC) with

integrated directory

Router

L1 I-CacheL1 D-Cache

L2 Cache (LLC)

CoreCompute Pipeline

Directory

M

M

M

• LLC managed using Reactive-NUCA [Hardavellas – ISCA09]- Local placement of private pages, shared pages are striped

• ACKwise limited-directory protocol [Kurian – PACT10]

Page 17: Locality-Aware Data Replication in the Last-Level Cache

17

Locality Tracking IntelligenceReplica Reuse Counter

• Replica Reuse: Tracks cache line usage by a core at the LLC replica

• Replica reuse counter is communicated back to directory on eviction or invalidation for classification

• NO additional network messages• Storage overhead: 1KB - 0.4%

StateTagMode1 Moden…

Home Reuse1 Home Reusen…ACKWise

Pointers (1 … p)

Complete Locality List (1 .. n)

LRU Replica Reuse

Page 18: Locality-Aware Data Replication in the Last-Level Cache

18

Replica Reuse

Locality Tracking IntelligenceMode & Home Reuse Counters

• Modei: Can cache line be replicated at Corei?

• Home Reusei: Tracks cache line usage by Corei at home LLC slice

• Complete Locality Classifier: Tracks locality information for all cores and for all LLC cache lines

• Storage Overhead: 96KB - 30%– We’ll fix this later

StateTagMode1 Moden…

Home Reuse1 Home Reusen…ACKWise

Pointers (1 … p)

Complete Locality List (1 .. n)

LRU

Page 19: Locality-Aware Data Replication in the Last-Level Cache

19

Mode TransitionsReplication Intelligence

• Initially, no replica is created• All requests are serviced at the LLC home

No Replica

Initial

• Replication decision made based on previous cache line reuse behavior

Page 20: Locality-Aware Data Replication in the Last-Level Cache

20

Mode Transitions

• Home-Reuse counter: Tracks the # accesses by a core at the LLC home location

No Replica

Initial

• Replication decision made based on previous cache line reuse behavior

Page 21: Locality-Aware Data Replication in the Last-Level Cache

21

Mode Transitions

• A replica is created if enough reuse is detected at the LLC home

• If (Home-Reuse >= Replication-Threshold) Promote to “Replica” mode

Create Replica• Replication-Threshold : # Replicas• Replication-Threshold : # Replicas

ReplicaNo Replica

Home Reuse >= RTInitialRT:

Replication Threshold

Page 22: Locality-Aware Data Replication in the Last-Level Cache

22

Mode Transitions

• Replica-Reuse counter: Tracks the # accesses to the LLC at the replica location

ReplicaNo Replica

Home Reuse >= RTInitialRT:

Replication Threshold

Page 23: Locality-Aware Data Replication in the Last-Level Cache

23

Mode Transitions

• Eviction from LLC Replica Location• Triggered by capacity limitations• If (Replica-Reuse >= Replication-Threshold)

Stay in “Replica” modeElse

Demote to “No-Replica” mode

ReplicaNo Replica

Home Reuse >= RTInitialReplica Reuse >= RT

Replica Reuse < RT

RT: Replication Threshold

Page 24: Locality-Aware Data Replication in the Last-Level Cache

24

Mode Transitions

• Invalidation at LLC Replica Location• Triggered by a conflicting write• If ( [Replica+Home] Reuse >= Replication-Threshold)

Stay in “Replica” modeElse

Demote to “No-Replica” mode

ReplicaNo Replica

Home Reuse >= RTInitial(Replica + Home) Reuse >= RT

(Replica + Home) Reuse < RT

RT: Replication Threshold

Page 25: Locality-Aware Data Replication in the Last-Level Cache

25

Mode Transitions

• Conflicting-Write from another core:Reset Home-Reuse counter to ‘0’

No Replica

Initial

Home Reuse < RT

RT: Replication Threshold

Replica

Home Reuse >= RT XReuse >= RT

XReuse < RT

Page 26: Locality-Aware Data Replication in the Last-Level Cache

26

Mode Transitions Summary

ReplicaNo Replica

Home Reuse >= RT

Home Reuse < RT

Initial XReuse >= RT

XReuse < RT

RT: Replication Threshold

• Replication decision made based on previous cache line reuse behavior

Page 27: Locality-Aware Data Replication in the Last-Level Cache

27

Replica Reuse

Locality Tracking IntelligenceLimitedk Locality Classifier

• Complete Locality Classifier: Prohibitive storage overhead (30%)

• Limited Locality Classifier (k): Mode and Home Reuse information tracked for only k cores

• Modes of other cores obtained by majority voting• Smaller k -> Lower overhead• Inactive cores replaced in locality list based on access

pattern to accommodate new sharers

StateTag

Core ID1 Core IDk…

Mode1 Modek…

Home Reuse1 Home Reusek…

Limited Locality List (1 .. k)

ACKWise Pointers (1 … p)LRU

Page 28: Locality-Aware Data Replication in the Last-Level Cache

Limited3 Locality Classifier

• Limited-3 classifier approximates performance & energy of Complete classifier

28

Classifier Complete Limited-3Bit Overhead per core(256KB L2, 32KB L1-D, 16KB L1-I)

96 KB(30%)

13.5 KB(4.5%)

Metric Limited-3 vs CompleteCompletion Time 0.6 % higherEnergy 1.0 % higher

• Mode and Home Reuse tracked for 3 sharers

Page 29: Locality-Aware Data Replication in the Last-Level Cache

29

Outline

• Motivation • Comparison to Previous Schemes• Design & Implementation• Evaluation• Conclusion

Page 30: Locality-Aware Data Replication in the Last-Level Cache

30

Evaluation Methodology

• Evaluations done using– Graphite simulator for 64 cores– McPAT/CACTI cache energy models and DSENT network

energy models at 11 nm• Evaluated 21 benchmarks from the SPLASH-2 (11),

PARSEC (8), Parallel MI-bench (1) and UHPC (1) suites• LLC managements schemes compared:– Static-NUCA (S-NUCA)– Reactive-NUCA (R-NUCA)– Victim Replication (VR)– Adaptive Selective Replication (ASR) [modified]– Locality-Aware Replication (RT-1, RT-3, RT-8)

Page 31: Locality-Aware Data Replication in the Last-Level Cache

31

Replicate Shared Read-Write DataLLC Accesses: BARNES

• Most LLC accesses are reads to widely-shared high-reuse shared read-write data

• Important to replicate shared read-write data

1 5 10 15 20 25 30 35 40 45 50 55 60 640

10000000

20000000

30000000

40000000

Number of Sharers

LLC

Acc

ess

Coun

t

Private

InstructionShared Read-Only

Shared Read-Write

1-2 3-9 ≥10

1-2 3-9 ≥10

Page 32: Locality-Aware Data Replication in the Last-Level Cache

32

Replicate Shared Read-Write DataEnergy Results: BARNES

• Locality-aware protocol reduces network router & link energy by replicating shared read-write data locally

• Victim replication (VR) obtains limited energy benefits– (Almost) blind replica creation scheme– Simplistic LLC replacement policy– Removing and re-inserting replicas on L1 misses & evictions

• Adaptive selective replication (ASR) and Reactive-NUCA do not replicate shared read-write data

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

11.2

DRAM Network Link Network Router Directory L2 Cache (LLC) L1-D Cache L1-I CacheEn

ergy

(nor

mal

ized

)

Page 33: Locality-Aware Data Replication in the Last-Level Cache

33

Replicate Shared Read-Write DataCompletion Time Results: BARNES

• Locality-aware protocol reduces communication time with the LLC home(L1-To-LLC-Home)

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

11.2

Synchronization LLC-Home-To-OffChip LLC-Home-To-Sharers LLC-Home--Waiting L1-To-LLC-Home L1-To-LLC-Replica Compute

Com

pleti

on T

ime

(nor

mal

ized

)

Page 34: Locality-Aware Data Replication in the Last-Level Cache

34

Replicate Private Cache LinesPage vs Cache Line Classification: BLACKSCHOLES

• Page-level classification incurs false positives– Multiple cores work privately on cache lines in the same page– Page classified shared read-only instead of private

• Page-level data placement not optimal– Reactive-NUCA cannot localize most LLC accesses

• Replicate private data to localize all LLC accesses

Page Cache-Line0%

20%

40%

60%

80%

100%

Shared Read-Write

Shared Read-Only

Instruction

Private

LLC

Acce

ss C

ount

Page 35: Locality-Aware Data Replication in the Last-Level Cache

35

Replicate Private Cache LinesEnergy Results: BLACKSCHOLES

• Locality-aware protocol reduces network energy through replication of private cache lines

• ASR replicates just shared read-only cache lines• VR obtains limited improvements in energy

– Still restricted by replication mechanisms

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

11.2

DRAM Network Link Network Router Directory L2 Cache L1-D Cache L1-I CacheEn

ergy

(nor

mal

ized

)

Page 36: Locality-Aware Data Replication in the Last-Level Cache

36

Replicate All Classes of Cache LinesLLC Accesses: BODYTRACK

• Most LLC accesses are reads to widely-shared high-reuse instructions, shared read-only and shared read-write data

• Best replication policy should optimize handling of all 3 classes of cache lines

1 5 10 15 20 25 30 35 40 45 50 55 60 640

40000000

80000000

120000000

Number of Sharers

LLC

Acc

ess

Coun

t

Private

InstructionShared Read-Only

Shared Read-Write

1-2 3-9 ≥10

1-2 3-9 ≥10

Page 37: Locality-Aware Data Replication in the Last-Level Cache

37

Replicate All Classes of Cache LinesEnergy Results: BODYTRACK

• R-NUCA replicates instructions, hence obtains small network energy improvements

• ASR replicates instructions and shared read-only data and obtains larger energy improvements

• The locality-aware protocol replicates shared read-write data as well

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

11.2

DRAM Network Link Network Router Directory L2 Cache L1-D Cache L1-I CacheEn

ergy

(nor

mal

ized

)

Page 38: Locality-Aware Data Replication in the Last-Level Cache

38

Use Optimal Replication ThresholdEnergy Results: STREAMCLUSTER

• Perform intelligent replication• RT-1 performs badly due to LLC pollution• RT-8 identifies less replicas, slow to identify useful ones• RT-3 identifies more replicas and faster while not creating LLC

pollution• Use optimal replication threshold of 3

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

11.2

DRAM Network Link Network Router Directory L2 Cache L1-D Cache L1-I CacheEn

ergy

(nor

mal

ized

)

Page 39: Locality-Aware Data Replication in the Last-Level Cache

39

Results Summary

• We choose a static Replication threshold (RT) of 3• Energy improved by 13-21%• Completion Time improved by 4-13%

Energy Completion TimeS-N

UCA

R-NUCA VR

ASR RT-1RT-3

RT-80

0.2

0.4

0.6

0.8

1

S-NUCA

R-NUCA VR

ASR RT-1RT-3

RT-80

0.2

0.4

0.6

0.8

1

Page 40: Locality-Aware Data Replication in the Last-Level Cache

40

Conclusion

• Locality-aware instruction and data replication in the last-level cache (LLC)

• Spatio-temporal locality profiled dynamically at the cache line level using low-overhead yet highly accurate hardware counters

• Enables replication only for lines with high reuse• Requires minimal changes to the baseline cache

coherence protocol since replicas are placed locally• Significant energy and performance improvements

against state-of-the-art replication mechanisms

Page 41: Locality-Aware Data Replication in the Last-Level Cache

41

See The Paper For …

• Exhaustive benchmark case studies– Apps with migratory shared data– Apps with NO benefit from replication

• Limited locality classifier study– Sensitivity to number of tracked cores (k)

• Cluster-level locality-aware LLC replication study– Sensitivity to cluster size (C)

Page 42: Locality-Aware Data Replication in the Last-Level Cache

42

Thank You!Questions?

Page 43: Locality-Aware Data Replication in the Last-Level Cache

43

Locality-Aware Data Replication in the Last-Level Cache

George Kurian1, Srinivas Devadas1, Omer Khan2,

1 Massachusetts Institute of Technology2 University of Connecticut, Storrs