locality-aware data replication in the last-level cache
DESCRIPTION
Locality-Aware Data Replication in the Last-Level Cache. George Kurian 1 , Srinivas Devadas 1 , Omer Khan 2 , 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs. The Problem. Future multicore processors will have 100s of cores - PowerPoint PPT PresentationTRANSCRIPT
1
Locality-Aware Data Replication in the Last-Level Cache
George Kurian1, Srinivas Devadas1, Omer Khan2,
1 Massachusetts Institute of Technology2 University of Connecticut, Storrs
The Problem
• Future multicore processors will have 100s of cores
• LLC management key to optimizing performance and energy
• Last-level cache (LLC) data locality and off-chip miss rates often show opposing trends
2
• Goal: Intelligent replication at the LLC
# Network Hops = ⅔ * √N
LLC Replication Strategy
• Black block shows benefit with replication– E.g., Frequently-read shared data– Core-1 and Core-2 allowed to create replicas
• Red block shows NO benefit with replication– E.g., Frequently-written shared data 3
L2 Cache(LLC Slice)
ComputePipeline
Directory
Router
PrivateL1 Caches
Core
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
Replica
Home
Replica
Home
1
4
32
4
Outline
• Motivation• Comparison to Previous Schemes• Design & Implementation• Evaluation• Conclusion
5
MotivationReuse at the LLC
• Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core– Note: Private L1 cache hits are filtered out
L2 Cache(LLC Slice)
ComputePipeline
Directory
Router
PrivateL1 Caches
Core
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC SliceHome
4
3 Core 35 Accesses
Core 4Write
Reuse = 5
6
MotivationReuse Determines Replication Benefit
• Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core
• Higher the reuse, higher the efficacy of replication
RADIX FFTLU
-CLU
-NC
CHOLESKY
BARNES
OCEAN-C
OCEAN-NC
WATER-N
SQ
RAYTRACE
VOLREND
BLACKSCH.
SWAPTIO
NS
FLUID
ANIM.
STREAMCLUS.
DEDUP
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOMP0%
20%
40%
60%
80%
100%[1-2] [3-9] [≥10]
LLC
Acc
ess
Coun
t
Fig: LLC Access Count vs Reuse
7
Motivation (cont’d)Reuse Determines Replication Benefit
• Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core
• Higher the reuse, higher the efficacy of replication
RADIX FFTLU
-CLU
-NC
CHOLESKY
BARNES
OCEAN-C
OCEAN-NC
WATER-N
SQ
RAYTRACE
VOLREND
BLACKSCH.
SWAPTIO
NS
FLUID
ANIM.
STREAMCLUS.
DEDUP
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOMP0%
20%
40%
60%
80%
100%[1-2] [3-9] [≥10]
LLC
Acc
ess
Coun
t
Replicate
Don’tReplicate
Fig: LLC Access Count vs Reuse
8
Motivation (cont’d)Reuse Independent of Cache Line Type
• Private data exhibits varying degrees of reuse
Private 1-2 3-9 ≥10
Fig: LLC Access Count vs Reuse
RADIX FFTLU
-CLU
-NC
CHOLESKY
BARNES
OCEAN-C
OCEAN-NC
WATER-N
SQ
RAYTRACE
VOLREND
BLACKSCH.
SWAPTIO
NS
FLUID
ANIM.
STREAMCLUS.
DEDUP
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOMP0%
20%
40%
60%
80%
100%
LLC
Acc
ess
Coun
t
9
Motivation (cont’d)Reuse Independent of Cache Line Type
• Instructions mostly exhibit high reuse
Private
Instruction
1-2 3-9 ≥10
Fig: LLC Access Count vs Reuse
RADIX FFTLU
-CLU
-NC
CHOLESKY
BARNES
OCEAN-C
OCEAN-NC
WATER-N
SQ
RAYTRACE
VOLREND
BLACKSCH.
SWAPTIO
NS
FLUID
ANIM.
STREAMCLUS.
DEDUP
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOMP0%
20%
40%
60%
80%
100%
LLC
Acc
ess
Coun
t
10
Motivation (cont’d)Reuse Independent of Cache Line Type
• Shared read-only data exhibits varying degrees of reuse
Private
InstructionShared Read-Only
1-2 3-9 ≥10
1-2 3-9 ≥10
Fig: LLC Access Count vs Reuse
RADIX FFTLU
-CLU
-NC
CHOLESKY
BARNES
OCEAN-C
OCEAN-NC
WATER-N
SQ
RAYTRACE
VOLREND
BLACKSCH.
SWAPTIO
NS
FLUID
ANIM.
STREAMCLUS.
DEDUP
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOMP0%
20%
40%
60%
80%
100%
LLC
Acc
ess
Coun
t
11
Motivation (cont’d)Reuse Independent of Cache Line Type
• Shared read-write data exhibits varying degrees of reuse
RADIX FFTLU
-CLU
-NC
CHOLESKY
BARNES
OCEAN-C
OCEAN-NC
WATER-N
SQ
RAYTRACE
VOLREND
BLACKSCH.
SWAPTIO
NS
FLUID
ANIM.
STREAMCLUS.
DEDUP
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOMP0%
20%
40%
60%
80%
100%
LLC
Acc
ess
Coun
t
Private
InstructionShared Read-Only
Shared Read-Write
1-2 3-9 ≥10
1-2 3-9 ≥10
Fig: LLC Access Count vs Reuse
12
Motivation (cont’d)Reuse Independent of Cache Line Type
• Replication must be based on reuse and not cache line classification
RADIX FFTLU
-CLU
-NC
CHOLESKY
BARNES
OCEAN-C
OCEAN-NC
WATER-N
SQ
RAYTRACE
VOLREND
BLACKSCH.
SWAPTIO
NS
FLUID
ANIM.
STREAMCLUS.
DEDUP
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOMP0%
20%
40%
60%
80%
100%
LLC
Acc
ess
Coun
t
Private
InstructionShared Read-Only
Shared Read-Write
1-2 3-9 ≥10
1-2 3-9 ≥10
Fig: LLC Access Count vs Reuse
Replicate based on Reuse Instructions Shared read-only data Shared read-write data (even) Private data
13
Locality-Aware ReplicationSalient Features
• Locality-based: Based on reuse and not memory classification information– Replicate data with high reuse – Bypass replication mechanisms for low reuse data
• Cache-line Level: Reuse measured and replication decision made at cache-line level
• Dynamic: Reuse profiled at runtime using highly-accurate hardware counters
• Minimal Coherence Protocol Changes: Replication is done at the local LLC slice
• Fully Hardware: LLC replication techniques require no modification to operating system
14
Comparison to Previous Schemes
LLC Management Schemes
Replication Candidates
When to Replicate?
Static-NUCA (S-NUCA)
None Never
Reactive-NUCA(R-NUCA)
Instructions(per-cluster)
Every L1 Cache Miss(NO Intelligence)
Victim Replication(VR)
All Every L1 Cache Eviction(NO Intelligence)
Adaptive Selective Replication (ASR)
SharedRead-Only
L1 Cache Eviction(Adapts Replication Level)
Locality-Aware Replication
All L1 Cache Miss(Detect High Reuse)
15
Outline
• Motivation • Comparison to Previous Schemes• Design & Implementation• Evaluation• Conclusion
16
Baseline System
• Compute pipeline with private L1-I and L1-D caches• Logically shared physically distributed L2 cache (LLC) with
integrated directory
Router
L1 I-CacheL1 D-Cache
L2 Cache (LLC)
CoreCompute Pipeline
Directory
M
M
M
• LLC managed using Reactive-NUCA [Hardavellas – ISCA09]- Local placement of private pages, shared pages are striped
• ACKwise limited-directory protocol [Kurian – PACT10]
17
Locality Tracking IntelligenceReplica Reuse Counter
• Replica Reuse: Tracks cache line usage by a core at the LLC replica
• Replica reuse counter is communicated back to directory on eviction or invalidation for classification
• NO additional network messages• Storage overhead: 1KB - 0.4%
StateTagMode1 Moden…
Home Reuse1 Home Reusen…ACKWise
Pointers (1 … p)
Complete Locality List (1 .. n)
LRU Replica Reuse
18
Replica Reuse
Locality Tracking IntelligenceMode & Home Reuse Counters
• Modei: Can cache line be replicated at Corei?
• Home Reusei: Tracks cache line usage by Corei at home LLC slice
• Complete Locality Classifier: Tracks locality information for all cores and for all LLC cache lines
• Storage Overhead: 96KB - 30%– We’ll fix this later
StateTagMode1 Moden…
Home Reuse1 Home Reusen…ACKWise
Pointers (1 … p)
Complete Locality List (1 .. n)
LRU
19
Mode TransitionsReplication Intelligence
• Initially, no replica is created• All requests are serviced at the LLC home
No Replica
Initial
• Replication decision made based on previous cache line reuse behavior
20
Mode Transitions
• Home-Reuse counter: Tracks the # accesses by a core at the LLC home location
No Replica
Initial
• Replication decision made based on previous cache line reuse behavior
21
Mode Transitions
• A replica is created if enough reuse is detected at the LLC home
• If (Home-Reuse >= Replication-Threshold) Promote to “Replica” mode
Create Replica• Replication-Threshold : # Replicas• Replication-Threshold : # Replicas
ReplicaNo Replica
Home Reuse >= RTInitialRT:
Replication Threshold
22
Mode Transitions
• Replica-Reuse counter: Tracks the # accesses to the LLC at the replica location
ReplicaNo Replica
Home Reuse >= RTInitialRT:
Replication Threshold
23
Mode Transitions
• Eviction from LLC Replica Location• Triggered by capacity limitations• If (Replica-Reuse >= Replication-Threshold)
Stay in “Replica” modeElse
Demote to “No-Replica” mode
ReplicaNo Replica
Home Reuse >= RTInitialReplica Reuse >= RT
Replica Reuse < RT
RT: Replication Threshold
24
Mode Transitions
• Invalidation at LLC Replica Location• Triggered by a conflicting write• If ( [Replica+Home] Reuse >= Replication-Threshold)
Stay in “Replica” modeElse
Demote to “No-Replica” mode
ReplicaNo Replica
Home Reuse >= RTInitial(Replica + Home) Reuse >= RT
(Replica + Home) Reuse < RT
RT: Replication Threshold
25
Mode Transitions
• Conflicting-Write from another core:Reset Home-Reuse counter to ‘0’
No Replica
Initial
Home Reuse < RT
RT: Replication Threshold
Replica
Home Reuse >= RT XReuse >= RT
XReuse < RT
26
Mode Transitions Summary
ReplicaNo Replica
Home Reuse >= RT
Home Reuse < RT
Initial XReuse >= RT
XReuse < RT
RT: Replication Threshold
• Replication decision made based on previous cache line reuse behavior
27
Replica Reuse
Locality Tracking IntelligenceLimitedk Locality Classifier
• Complete Locality Classifier: Prohibitive storage overhead (30%)
• Limited Locality Classifier (k): Mode and Home Reuse information tracked for only k cores
• Modes of other cores obtained by majority voting• Smaller k -> Lower overhead• Inactive cores replaced in locality list based on access
pattern to accommodate new sharers
StateTag
Core ID1 Core IDk…
Mode1 Modek…
Home Reuse1 Home Reusek…
Limited Locality List (1 .. k)
ACKWise Pointers (1 … p)LRU
Limited3 Locality Classifier
• Limited-3 classifier approximates performance & energy of Complete classifier
28
Classifier Complete Limited-3Bit Overhead per core(256KB L2, 32KB L1-D, 16KB L1-I)
96 KB(30%)
13.5 KB(4.5%)
Metric Limited-3 vs CompleteCompletion Time 0.6 % higherEnergy 1.0 % higher
• Mode and Home Reuse tracked for 3 sharers
29
Outline
• Motivation • Comparison to Previous Schemes• Design & Implementation• Evaluation• Conclusion
30
Evaluation Methodology
• Evaluations done using– Graphite simulator for 64 cores– McPAT/CACTI cache energy models and DSENT network
energy models at 11 nm• Evaluated 21 benchmarks from the SPLASH-2 (11),
PARSEC (8), Parallel MI-bench (1) and UHPC (1) suites• LLC managements schemes compared:– Static-NUCA (S-NUCA)– Reactive-NUCA (R-NUCA)– Victim Replication (VR)– Adaptive Selective Replication (ASR) [modified]– Locality-Aware Replication (RT-1, RT-3, RT-8)
31
Replicate Shared Read-Write DataLLC Accesses: BARNES
• Most LLC accesses are reads to widely-shared high-reuse shared read-write data
• Important to replicate shared read-write data
1 5 10 15 20 25 30 35 40 45 50 55 60 640
10000000
20000000
30000000
40000000
Number of Sharers
LLC
Acc
ess
Coun
t
Private
InstructionShared Read-Only
Shared Read-Write
1-2 3-9 ≥10
1-2 3-9 ≥10
32
Replicate Shared Read-Write DataEnergy Results: BARNES
• Locality-aware protocol reduces network router & link energy by replicating shared read-write data locally
• Victim replication (VR) obtains limited energy benefits– (Almost) blind replica creation scheme– Simplistic LLC replacement policy– Removing and re-inserting replicas on L1 misses & evictions
• Adaptive selective replication (ASR) and Reactive-NUCA do not replicate shared read-write data
S-NUCA
R-NUCA VR
ASR RT-1
RT-3 RT-8
00.20.40.60.8
11.2
DRAM Network Link Network Router Directory L2 Cache (LLC) L1-D Cache L1-I CacheEn
ergy
(nor
mal
ized
)
33
Replicate Shared Read-Write DataCompletion Time Results: BARNES
• Locality-aware protocol reduces communication time with the LLC home(L1-To-LLC-Home)
S-NUCA
R-NUCA VR
ASR RT-1
RT-3 RT-8
00.20.40.60.8
11.2
Synchronization LLC-Home-To-OffChip LLC-Home-To-Sharers LLC-Home--Waiting L1-To-LLC-Home L1-To-LLC-Replica Compute
Com
pleti
on T
ime
(nor
mal
ized
)
34
Replicate Private Cache LinesPage vs Cache Line Classification: BLACKSCHOLES
• Page-level classification incurs false positives– Multiple cores work privately on cache lines in the same page– Page classified shared read-only instead of private
• Page-level data placement not optimal– Reactive-NUCA cannot localize most LLC accesses
• Replicate private data to localize all LLC accesses
Page Cache-Line0%
20%
40%
60%
80%
100%
Shared Read-Write
Shared Read-Only
Instruction
Private
LLC
Acce
ss C
ount
35
Replicate Private Cache LinesEnergy Results: BLACKSCHOLES
• Locality-aware protocol reduces network energy through replication of private cache lines
• ASR replicates just shared read-only cache lines• VR obtains limited improvements in energy
– Still restricted by replication mechanisms
S-NUCA
R-NUCA VR
ASR RT-1
RT-3 RT-8
00.20.40.60.8
11.2
DRAM Network Link Network Router Directory L2 Cache L1-D Cache L1-I CacheEn
ergy
(nor
mal
ized
)
36
Replicate All Classes of Cache LinesLLC Accesses: BODYTRACK
• Most LLC accesses are reads to widely-shared high-reuse instructions, shared read-only and shared read-write data
• Best replication policy should optimize handling of all 3 classes of cache lines
1 5 10 15 20 25 30 35 40 45 50 55 60 640
40000000
80000000
120000000
Number of Sharers
LLC
Acc
ess
Coun
t
Private
InstructionShared Read-Only
Shared Read-Write
1-2 3-9 ≥10
1-2 3-9 ≥10
37
Replicate All Classes of Cache LinesEnergy Results: BODYTRACK
• R-NUCA replicates instructions, hence obtains small network energy improvements
• ASR replicates instructions and shared read-only data and obtains larger energy improvements
• The locality-aware protocol replicates shared read-write data as well
S-NUCA
R-NUCA VR
ASR RT-1
RT-3 RT-8
00.20.40.60.8
11.2
DRAM Network Link Network Router Directory L2 Cache L1-D Cache L1-I CacheEn
ergy
(nor
mal
ized
)
38
Use Optimal Replication ThresholdEnergy Results: STREAMCLUSTER
• Perform intelligent replication• RT-1 performs badly due to LLC pollution• RT-8 identifies less replicas, slow to identify useful ones• RT-3 identifies more replicas and faster while not creating LLC
pollution• Use optimal replication threshold of 3
S-NUCA
R-NUCA VR
ASR RT-1
RT-3 RT-8
00.20.40.60.8
11.2
DRAM Network Link Network Router Directory L2 Cache L1-D Cache L1-I CacheEn
ergy
(nor
mal
ized
)
39
Results Summary
• We choose a static Replication threshold (RT) of 3• Energy improved by 13-21%• Completion Time improved by 4-13%
Energy Completion TimeS-N
UCA
R-NUCA VR
ASR RT-1RT-3
RT-80
0.2
0.4
0.6
0.8
1
S-NUCA
R-NUCA VR
ASR RT-1RT-3
RT-80
0.2
0.4
0.6
0.8
1
40
Conclusion
• Locality-aware instruction and data replication in the last-level cache (LLC)
• Spatio-temporal locality profiled dynamically at the cache line level using low-overhead yet highly accurate hardware counters
• Enables replication only for lines with high reuse• Requires minimal changes to the baseline cache
coherence protocol since replicas are placed locally• Significant energy and performance improvements
against state-of-the-art replication mechanisms
41
See The Paper For …
• Exhaustive benchmark case studies– Apps with migratory shared data– Apps with NO benefit from replication
• Limited locality classifier study– Sensitivity to number of tracked cores (k)
• Cluster-level locality-aware LLC replication study– Sensitivity to cluster size (C)
42
Thank You!Questions?
43
Locality-Aware Data Replication in the Last-Level Cache
George Kurian1, Srinivas Devadas1, Omer Khan2,
1 Massachusetts Institute of Technology2 University of Connecticut, Storrs