evaluating associativity in cpu cacheszz124/cs671_fall2013/lectures/jan_ca.pdf · evaluating...

Evaluating Associativity in CPU Caches

Mark D. Hill, Alan Jay SmithIEEE Transactions on Computers(1989)

CPU Caches

● Direct mapped● Fully associative● Set associative

– Block size– Number of sets– Associativity (elements in one set)– Set mapping function (block -> set)

● Usually bit selection (modulo)

– Replacement policy

Data and related work

● Trace driven simulation– Samples must be short, space and time limits (1989)

– Metric: miss rate

● Related work– Simulation Algorithms

● Mattson et al. Introduced inclusion property of alternative caches (not multilevel inclusion)

– Requires total ordering of pages in set before eviction– Stack simulation for caches

– Associativity● 32K and smaller caches● Overall design of bigger caches● Few papers focus solely on associativity

Stack simulation

● Usually linked list implementation– There are more complex implementations

● Avltree, hash tables, ...

– Good for CPU caches (few links)● CPU references – high degree of locality● Caches have large number of sets and limited associativity

– If we use LRU, references that hit most recently used element can be deleted without affecting number of misses

– We record distance of every reference:● n-way cache after K references: miss_ratio = 1 – Σ distance[i]/K; i<n

Trace Data

Methods and Traces

● Primary metric: miss ratio

– Effective access time: tcache + miss_ratio * tmemory

– Increase in associativity can increase access time, and degrade performance

– easy to define, interpret, compute, implementation independent

● Traces:– Five trace group (5 x 500 000 references)

– 23-trace groups (170- 400 k references)

– Include instruction fetch references

– Both cold and warm startup

– Trace limitations, large caches subject to errors

Figure 2

Figure 3

Simulation of alternative DM and SA caches

● Useful properties– Set refinement

● Function f refines g if f(x) = f(y) => g(x) = g(y), for all blocks x,y.● C2 refines C1

– Inclusion● Cache C1 includes C2: after any series of references for any block x:

x is resident C1 => x is resident C2

● Theorem– Same block size, LRU replacement:

● C2 includes C1 <=> F2 refines F1 AND assoc2 >= assoc1

Useful implications

● C2 refines C1 && mapping functions differ => greater number of sets

● Bit selection: 2i refines 2j for all i >= j

● C2 must be strictly larger than a different C1 in order to include it

● Refinement implies inclusion in direct mapped caches.● Inclusion holds for direct mapped caches using bit selection● Inclusion does not hold between pairs of different set-associative caches● Inclusion is a partial ordering of set of caches● Refinement is a partial ordering● Refinement can be used to speed up simulation of alternate caches that

use LRU replacement

Simulating Direct-Mapped Caches

● Forest simulation– Requires that mapping functions obey set-refinement

● It implies inclusion, advantageous

– The data structures uses trees to simulate alternative direct mapped caches

– Each level for one cache

– Key idea: ● start at the top and proceed down until a reference is found● Increment distance[i] if found on level i● miss_ratio = 1 – Σdistance[i]/K; i<n

– Can be extended to n-way caches by replacing nodes with stacks

Forest example

Simulating Set-Associative Caches

● New all-associativity simulation– Simulating alternative DM and SA caches that have

same block size, LRU, no prefetching

– Generalization of earlier work

– Unique accesses can be usually stored in memory● Storage space can be reclaimed if not used by any cache

– Single run for all alternative caches

Figure 6

Figure 7

Figure 8

Trace performance

Size Number of sets (32B block)

16K 512 256 128

32K 1024 512 256

64K 2048 1024 512

128K 4096 2048 1024

● Comparable for single cache● Forest is fastest for DM caches● All-assoc fastest for general caches (DM + SA)

Associativity and Miss Ratio

● Relationships exist independent of cache size● Categorizing cache misses

– Conflict misses (no more in same set)● miss_ratio – miss_ratiofully associative

– Capacity misses (no more space in cache)● miss_ratiofully associative – miss_ratioinfinite

– Compulsory misses (first time data)● miss_ratioinfinite

Table III

Set associative vs. Fully associative

● pi(s) – probability that reference is made to i-th most recent in one of n sets

● qi – probability that reference is made to i-th most recent in FA cache

● Miss ratio n-way: 1 – Σ p i(s)

● Miss ratio FA (n-blocks): 1 – Σ q i

● Bayes rule:

– pn(s) = Σ Prob(LRU distance n with s sets | LRU distance i with 1 set) q i

● Probability of set conflict is 1/s

Figure 11

It works!

● Predictions are accurate– Error usually less than 5%

● Predictions are usually more pessimistic– Bit selection collision slightly less likely than random due to

locality

● Error gets smaller with increased associativity● It's not important (can be measured)● IMPORTANT:

– Increase in miss ratio is nearly identical to results that assume independent and equal probability of sets conflict

Comparing SA Caches

● Miss ratio spread– Two caches, same capacity, n-way vs. 2n-way

Mn/M2n - 1 = (Mn – M2n)/M2n

– Data smoothed using weighted average● 0.15 for distance 2● 0.20 for distance 1● 0.30 for current

– Large caches again subject to errors● > 64K

Figure 12

Figure 13

Table IV

Trends

● Spread in low associativity caches is larger– DM to 2-way

● Except for instruction caches, size does not matter– Spread ratio in small instruction caches is smaller

– Sequential behavior of instruction references

● Positively correlated with block size– Larger blocks => fewer sets

● Miss ratio spread of data and unified caches is similar– Smaller for instruction caches

● 8->4, 4->2, 2->1, spread ratio of 5%, 10%, and 30%– Regardless of size, type, block size!!

● Design target miss ratios– Rule of thumb: miss ratio drops as the square root of the cache size

Conclusion

● Both set refinement and cache inclusion useful for developing fast simulation algorithms– Forest simulation for direct mapped caches

– All associativity simulation

● Miss classification– Conflict, capacity, compulsory

● Difference between FA and SA caches can be predicted● Miss ratio spread is invariant to cache size, and original miss ratio

– 5, 10, 30 percent

● Trace size limitations skewed the results for large caches– 64K, 128K

evaluating associativity in cpu cacheszz124/cs671_fall2013/lectures/jan_ca.pdf · evaluating...

Documents

7 associativity - electrical and computer engineering at...

evaluating a cpu/gpu implementation for real-time … ·...

associativity between weak and strict sequencing

understanding and evaluating...

adjoint associativity: an invitation to algebra in...

type conversion, precedence and associativity

evaluating cpu performance

associativity and cooperation based on the popular …

homotopy associativity of ^-spaces. ii · 2018. 11. 16. ·...

ambiguity, precedence, associativity & top-down parsing

omron - cs1d-cpu h cpu units cs1d-cpu s cpu units ......cs1d...

design-analysis associativity technology for psi...

title: the role of extension in strengthening the...

partial monoids: associativity and...

1 cmpe 421 parallel computer architecture part4 caching with...

expressions and assignment...

elliptic curves over finite fields - roma tre · elliptic...

get moving with cmc fpga/gpu cluster · 2019. 12. 12. ·...

ms thesis defense evaluating cpu utilization in...

1 objectives ❏ to be able to list and describe the six...