adaptive insertion policies for high-performance caching
DESCRIPTION
Adaptive Insertion Policies for High-Performance Caching. Aamer Jaleel Simon C. Steely Jr. Joel Emer. Moinuddin K. Qureshi Yale N. Patt. International Symposium on Computer Architecture (ISCA) 2007. Memory. L2 miss. Proc. L2. L1. Background. - PowerPoint PPT PresentationTRANSCRIPT
1
Adaptive Insertion Policies for High-Performance
CachingMoinuddin K.
QureshiYale N. Patt
International Symposium on Computer Architecture (ISCA) 2007
Aamer JaleelSimon C. Steely
Jr.Joel Emer
2
Background
L1 misses Short latency, can be hidden
L2 misses Long-latency, hurts performance
Important to reduce Last Level (L2) cache misses
MemoryL2 miss
Proc L2 L1
Fast processor + Slow memory Cache hierarchy
(~2 cycles) (~10 cycles)(~300 cycles)
3
Motivation
L1 for latency, L2 for capacity
Traditionally L2 managed similar to L1 (typically LRU)
L1 filters temporal locality Poor locality at L2
LRU causes thrashing when working set > cache size Most lines remain unused between insertion and
eviction
4
Dead on Arrival (DoA) Lines
DoA Lines: Lines unused between insertion and eviction
For the 1MB 16-way L2, 60% of lines are DoA
Ineffective use of cache space
(%
) D
oA
Lin
es
5
Why DoA Lines ? Streaming data Never reused. L2 caches don’t help.
Working set of application greater than cache size
Soln: if working set > cache size, retain some working set
art
Mis
ses p
er
10
00
in
str
ucti
on
s
Cache size in MB
mcf
Mis
ses p
er
10
00
in
str
ucti
on
s
Cache size in MB
6
Overview
Problem: LRU replacement inefficient for L2 caches
Goal: A replacement policy that has:1. Low hardware overhead2. Low complexity3. High performance4. Robust across workloads
Proposal: A mechanism that reduces misses by 21% and has total storage overhead < two bytes
7
Outline
Introduction
Static Insertion Policies
Dynamic Insertion Policies
Summary
8
Cache Insertion Policy
Simple changes to insertion policy can greatly improve cache performance for memory-intensive workloads
Two components of cache replacement:
1. Victim Selection: Which line to replace for incoming line? (E.g. LRU, Random, FIFO, LFU)
2. Insertion Policy: Where is incoming line placed in replacement list? (E.g. insert incoming line at MRU position)
9
LRU-Insertion Policy (LIP)
a b c d e f g hMRU LRU
i a b c d e f g
Reference to ‘i’ with traditional LRU policy:
a b c d e f g i
Reference to ‘i’ with LIP:
Choose victim. Do NOT promote to MRU
Lines do not enter non-LRU positions unless reused
10
if ( rand() < ) Insert at MRU position;
elseInsert at LRU position;
Bimodal-Insertion Policy (BIP)
LIP does not age older lines
Infrequently insert lines in MRU position
Let Bimodal throttle parameter
For small , BIP retains thrashing protection of LIP while responding to changes in working set
11
Circular Reference Model
For small , BIP retains thrashing protection of LIP while adapting to changes in working set
Policy (a1 a2 a3 …
aT)N
(b1 b2 b3 …
bT)N
LRU 0 0
OPT (K-1)/(T-1) (K-1)/(T-1)
LIP (K-1)/T 0
BIP (small ) ≈ (K-1)/T ≈ (K-1)/T
Reference stream has T blocks and repeats N times.
Cache has K blocks (K<T and N>>T)
[Smith & GoodmanISCA’84]
12
Results for LIP and BIP
Changes to insertion policy increases misses for LRU-friendly workloads
LIP BIP1/32)
(%)
Red
uct
ion
in
L2
MPK
I
13
Outline
Introduction
Static Insertion Policies
Dynamic Insertion Policies
Summary
14
Dynamic-Insertion Policy (DIP)
Two types of workloads: LRU-friendly or BIP-friendly
DIP can be implemented by:
1. Monitor both policies (LRU and BIP)
2. Choose the best-performing policy
3. Apply the best policy to the cache
Need a cost-effective implementation “Set Dueling”
15
LRU-sets
Follower Sets
BIP-sets
DIP via “Set Dueling”
Divide the cache in three:– Dedicated LRU sets– Dedicated BIP sets – Follower sets (winner of
LRU,BIP)
n-bit saturating counter misses to LRU-sets: counter++misses to BIP-set: counter--
Counter decides policy for Follower sets:– MSB = 0, Use LRU– MSB = 1, Use BIP
n-bit cntr+
miss
–miss
MSB = 0?
YES No
Use LRU Use BIP
monitor choose apply
(using a single counter)
16
Bounds on Dedicated Sets
How many dedicated sets required for “Set Dueling”?
μLRU, σLRU, μBIP, σBIP = Avg. misses and stdev. for LRU and BIP
P(Best) = probability of selecting best policy
P(Best) = P(Z< r√n)
n = number of dedicated setsZ = standard Gaussian variable
r = |μLRU- μBIP|/√(σLRU2 + σBIP
2)
(For majority workloads r > 0.2)32-64 dedicated sets sufficient
17
Results for DIP
DIP reduces average MPKI by 21% and requires < two bytes storage
overhead
DIP (32 dedicated sets)BIP
(%)
Red
uct
ion
in
L2
MPK
I
18
DIP vs. Other Policies
0
5
10
15
20
25
30
35
(LRU+RND) (LRU+LFU) (LRU+MRU) DIP OPT Double
% R
educ
tion
in
aver
age
MPK
I
DIP bridges two-thirds of gap between LRU and OPT
DIP OPT Double(2MB)
(LRU+RND) (LRU+LFU) (LRU+MRU)
(%)
Red
uct
ion
in
L2
MPK
I
19
IPC Improvement
Processor: 4 wide, 32-entry windowMemory 270 cycles. L2: 1MB 16-way LRU
IPC
Im
pro
vem
ent
wit
h D
IP (
%)
DIP Improves IPC by 9.3% on average
20
Outline
Introduction
Static Insertion Policies
Dynamic Insertion Policies
Summary
21
Summary
LRU inefficient for L2 caches. Most lines remain unused between insertion and eviction
Proposed changes to cache insertion policy (DIP) has:
1. Low hardware overheadRequires < two bytes storage overhead
2. Low complexityTrivial to implement. No changes to cache
structure
3. High performanceReduces misses by 21%. Two-thirds as good
as OPT
4. Robust across workloadsAlmost as good as LRU for LRU-friendly
workloads
23
DIP vs. LRU Across Cache Sizes
LRU DIP
}
1MB
} }}
8MB2MB4MB
MPK
I R
ela
tive t
o 1
MB
LR
U
(%)
(Sm
alle
r is
bett
er)
art mcf equake swim health Avg_16
MPKI reduces till workload fits in the cache
24
DIP with 1MB 8-way L2 Cache
MPKI reduction with 8-way (19%) similar to 16-way (21%)
0
10
20
30
40
50
(%)
Red
uct
ion
in
L2
MPK
I
25
Interaction with Prefetching(%
) R
ed
uct
ion
in
L2
MPK
I
DIP-NoPrefLRU-Pref
DIP-Pref
DIP also works well in presence of prefetching
(PC-based stride prefetcher)
26
mcf snippet
27
art snippet
28
health mpki
29
swim mpki
30
DIP Bypass
31
DIP (design and implementation)
32
Random Replacement (Success Function)
Cache contains K blocks and reference stream contains T
Prob that a block in cache survives 1 eviction = (1-1/K)Total number of evictions = (T-1)*Pmiss
Phit = (1-1/K)^(T-1)*Pmiss)Phit = (1-1/K)^(T-1)(1-Phit)
Iterative solution: Start at Phit=0
1. Phit = (1-1/K)^T