towards a theory of cache-efficient algorithms

Towards a Theory of Cache-Efficient Algorithms

Summary for the seminar:

Analysis of algorithms in hierarchical memory – Spring 2004

by Gala Golan

The RAM Model

In the previous lecture we discussed a cache in an operating system

We saw a lower bound on sorting:

N = number of sorted elements B = number of elements in each block M = memory size

log

log

N N B

B M B

The I/O Model1. A datum can be accessed only from fast

memory

2. B elements are brought to memory in each access

3. Computation cost << I/O cost

4. A block of data can be placed anywhere in fast memory

5. I/O operations are explicit

The Cache Model1. A datum can be accessed only from fast memory √

2. B elements are brought to memory in each access √3. Computation cost << I/O cost

L denotes normalized cache latency, accessing a block from cache costs 1

4. A block of data can be placed anywhere in fast memoryA fixed mapping distributes main memory in the cache

5. I/O operations are explicitThe cache is not visible to the programmer

Notation

I(M,B) - The I/O modelC(M,B,L) - The cache model n = N/B, m = M/B – The size of the data and

of memory in blocks (instead of elements) The goal of an algorithm design is to

minimize running time = (number of cache accesses) + (L* number of memory accesses)

Reminder – Cache Associativity Associativity specifies the number of different

frames in which a memory block can reside

Fully Associative

Direct Mapped

2-Way Associative

Set

Emulation Theorem An algorithm A in I(M,B) using T block

transfers and I processing time can be converted to an equivalent algorithm Ac in C(M,B,L) that runs in O(I+ (L+B)T ) steps.

The additional memory requirement is m blocks.

In other words – an algorithm that is efficient in main memory, can be efficient in cache.

Proof (1)

1 mC[] 2

21 m n

Mem[]

Buf[]

Proof (2)

1 m

2

C[]

1

2

m n

Mem[]

Buf[] ba

Proof (3)

1 m

2

C[]

1

2

m n

Mem[]

Buf[] ba

q

Proof (4)

1 m

2

C[]

1

2

m n

Mem[]

Buf[] ba

b

Proof (5)

1 m

2

C[]

1

2

m n

Mem[]

Buf[] ba

b

q

Proof (6)

1 m

2

C[]

1

2

m n

Mem[]

Buf[] ba

b

q

Block efficient algorithms

For a block efficient algorithm, a computation is done on at least a constant fraction of the elements in the blocks transferred.

In such a case, O(B*T) ≡ O(I), so an algorithm for I(M,B) can be emulated in C(M,B,L) in O(I+ L*T) steps.

The algorithms for sorting, FFT, and matrix transposition are block efficient.

Extension to set-associative cache

In a set associative cache, if all k sets are occupied, LRU is used by the hardware to find an assignment for the referenced block.

In the emulation technique described before we do not have explicit control of the replacement.

Instead, a property of LRU will be used, and the cache will be used only partially.

Optimal Replacement Algorithm for Cache OPT or MIN – a hypothetical algorithm that

minimizes cache misses for a given (finite) access trace.

Offline – it knows in advance which blocks will be accessed next.

Evicts from cache the block which will be accessed again in the longest time in the future.

Was proven to be optimal – better than any online algorithm.

Proposed by Belady in 1966. Used to theoretically test efficiency of online

algorithms.

LRU vs. OPT

For any constant factor c > 1, LRU with fast memory size m makes at most c times as many misses as OPT with fast memory size (1-1/c)m.

For example, LRU with cache size m will cause 3 times more misses than OPT with memory of size 2/3 m.

LRU – 3X misses OPT – X misses

6 = (1-1/3) 9

1 69 1

Extension to set-associative cache – Cont.

Similarly, LRU with cache size m will cause 2 times more misses than OPT with memory of size m/2.

We emulate The I/O algorithm using only half the size of Buf[]. Instead of k cache lines for every set, there are now k/2

These k/2 blocks are managed optimally, according to the optimality of the I/O algorithm.

In the real cache, k lines will be managed by LRU and will experience twice the misses.

Extension to set-associative cache – Cont.

1 mC[] 2

21 m n

Mem[]

Buf[]

Generalized Emulation Theorem

An algorithm A in I(M/2,B) using T block transfers and I processing time can be converted to an equivalent algorithm Ac in the k-way associative cache model C(M,B,L) that runs in O(I+ (L+B)T ) steps.

The additional memory requirement is m/2 blocks.

The cache complexity of sorting

The lower bound for sorting in I(M,B) is

The lower bound for sorting in C(M,B,L) is

loglog

log

N N BN N L

B M B

log

log

N N B

B M B

I = computations T = I/O operations

Cache Miss Classes

Compulsory Miss – a block is being referenced for the first time

Capacity Miss – a block was evicted from the cache because it is too small

Conflict Miss – a block was evicted from the cache because another block was mapped to the same set.

Average case performance of merge-sort in the cache model

We want to estimate the number of cache misses while performing the algorithm: Compulsory misses are unavoidable Capacity misses are minimized by the I/O

algorithm We can quantify the expected number of conflict

misses.

When does a conflict miss occur?

s cache sets are available for k runs S1…Sk. The expected number of elements in any run

Si is N/k. A leading block is a cache line containing a

leading element of a run. bi is the leading block of Si.

A conflict occurs when two leading blocks are mapped to the same cache set.

When does a conflict miss occur – Cont.

Formally: a conflict miss occurs for element Si,j+1 when there is at least one element x in a leading block bk, k≠i, such that Si,j<x<Si,j+1

and S(bi) = S(bk).

Si

Sk

j j+1

x

…

How many conflict misses to expect

Pi = the probability of conflict for element i, 1≤i≤N.

Assume uniform distribution: The leading blocks among cache sets The leading element within the leading block

If k is Ω(s) then Pi is Ω(1). For each round, the number of conflict

misses is Ω(N).

How many conflict misses to expect – Cont.

The expected number of conflict misses throughout merge-sort is

This includes O(N) misses for each pass. By choosing k<<s we minimize the probability of

conflict misses, but we incur more capacity misses.

log

log

N BN

M B

Conclusions

There is a way to transform I/O efficient algorithms to cache efficient algorithms

It is only for blocking, direct mapped cache that does not distinguish between reads and writes.

The constants are important in these orders of magnitude.

towards a theory of cache-efficient algorithms

Documents

cache size

memory block

cache misses

fast memory size

cache lines

cache modeln

real cache

theory of cache