evaluating associativity in cpu cacheszz124/cs671_fall2013/lectures/jan_ca.pdf · evaluating...
Post on 21-Jul-2020
13 Views
Preview:
TRANSCRIPT
Evaluating Associativity in CPU Caches
Mark D. Hill, Alan Jay SmithIEEE Transactions on Computers(1989)
CPU Caches
● Direct mapped● Fully associative● Set associative
– Block size– Number of sets– Associativity (elements in one set)– Set mapping function (block -> set)
● Usually bit selection (modulo)
– Replacement policy
Data and related work
● Trace driven simulation– Samples must be short, space and time limits (1989)
– Metric: miss rate
● Related work– Simulation Algorithms
● Mattson et al. Introduced inclusion property of alternative caches (not multilevel inclusion)
– Requires total ordering of pages in set before eviction– Stack simulation for caches
– Associativity● 32K and smaller caches● Overall design of bigger caches● Few papers focus solely on associativity
Stack simulation
● Usually linked list implementation– There are more complex implementations
● Avltree, hash tables, ...
– Good for CPU caches (few links)● CPU references – high degree of locality● Caches have large number of sets and limited associativity
– If we use LRU, references that hit most recently used element can be deleted without affecting number of misses
– We record distance of every reference:● n-way cache after K references: miss_ratio = 1 – Σ distance[i]/K; i<n
Trace Data
Methods and Traces
● Primary metric: miss ratio
– Effective access time: tcache + miss_ratio * tmemory
– Increase in associativity can increase access time, and degrade performance
– easy to define, interpret, compute, implementation independent
● Traces:– Five trace group (5 x 500 000 references)
– 23-trace groups (170- 400 k references)
– Include instruction fetch references
– Both cold and warm startup
– Trace limitations, large caches subject to errors
Figure 2
Figure 3
Simulation of alternative DM and SA caches
● Useful properties– Set refinement
● Function f refines g if f(x) = f(y) => g(x) = g(y), for all blocks x,y.● C2 refines C1
– Inclusion● Cache C1 includes C2: after any series of references for any block x:
x is resident C1 => x is resident C2
● Theorem– Same block size, LRU replacement:
● C2 includes C1 <=> F2 refines F1 AND assoc2 >= assoc1
Useful implications
● C2 refines C1 && mapping functions differ => greater number of sets
● Bit selection: 2i refines 2j for all i >= j
● C2 must be strictly larger than a different C1 in order to include it
● Refinement implies inclusion in direct mapped caches.● Inclusion holds for direct mapped caches using bit selection● Inclusion does not hold between pairs of different set-associative caches● Inclusion is a partial ordering of set of caches● Refinement is a partial ordering● Refinement can be used to speed up simulation of alternate caches that
use LRU replacement
Simulating Direct-Mapped Caches
● Forest simulation– Requires that mapping functions obey set-refinement
● It implies inclusion, advantageous
– The data structures uses trees to simulate alternative direct mapped caches
– Each level for one cache
– Key idea: ● start at the top and proceed down until a reference is found● Increment distance[i] if found on level i● miss_ratio = 1 – Σdistance[i]/K; i<n
– Can be extended to n-way caches by replacing nodes with stacks
Forest example
Simulating Set-Associative Caches
● New all-associativity simulation– Simulating alternative DM and SA caches that have
same block size, LRU, no prefetching
– Generalization of earlier work
– Unique accesses can be usually stored in memory● Storage space can be reclaimed if not used by any cache
– Single run for all alternative caches
Figure 6
Figure 7
Figure 8
Trace performance
Size Number of sets (32B block)
16K 512 256 128
32K 1024 512 256
64K 2048 1024 512
128K 4096 2048 1024
● Comparable for single cache● Forest is fastest for DM caches● All-assoc fastest for general caches (DM + SA)
Associativity and Miss Ratio
● Relationships exist independent of cache size● Categorizing cache misses
– Conflict misses (no more in same set)● miss_ratio – miss_ratiofully associative
– Capacity misses (no more space in cache)● miss_ratiofully associative – miss_ratioinfinite
– Compulsory misses (first time data)● miss_ratioinfinite
Table III
Set associative vs. Fully associative
● pi(s) – probability that reference is made to i-th most recent in one of n sets
● qi – probability that reference is made to i-th most recent in FA cache
● Miss ratio n-way: 1 – Σ p i(s)
● Miss ratio FA (n-blocks): 1 – Σ q i
● Bayes rule:
– pn(s) = Σ Prob(LRU distance n with s sets | LRU distance i with 1 set) q i
● Probability of set conflict is 1/s
Figure 11
It works!
● Predictions are accurate– Error usually less than 5%
● Predictions are usually more pessimistic– Bit selection collision slightly less likely than random due to
locality
● Error gets smaller with increased associativity● It's not important (can be measured)● IMPORTANT:
– Increase in miss ratio is nearly identical to results that assume independent and equal probability of sets conflict
Comparing SA Caches
● Miss ratio spread– Two caches, same capacity, n-way vs. 2n-way
Mn/M2n - 1 = (Mn – M2n)/M2n
– Data smoothed using weighted average● 0.15 for distance 2● 0.20 for distance 1● 0.30 for current
– Large caches again subject to errors● > 64K
Figure 12
Figure 13
Table IV
Trends
● Spread in low associativity caches is larger– DM to 2-way
● Except for instruction caches, size does not matter– Spread ratio in small instruction caches is smaller
– Sequential behavior of instruction references
● Positively correlated with block size– Larger blocks => fewer sets
● Miss ratio spread of data and unified caches is similar– Smaller for instruction caches
● 8->4, 4->2, 2->1, spread ratio of 5%, 10%, and 30%– Regardless of size, type, block size!!
● Design target miss ratios– Rule of thumb: miss ratio drops as the square root of the cache size
Conclusion
● Both set refinement and cache inclusion useful for developing fast simulation algorithms– Forest simulation for direct mapped caches
– All associativity simulation
● Miss classification– Conflict, capacity, compulsory
● Difference between FA and SA caches can be predicted● Miss ratio spread is invariant to cache size, and original miss ratio
– 5, 10, 30 percent
● Trace size limitations skewed the results for large caches– 64K, 128K
top related