towards a theory of cache-efficient algorithms

28
Towards a Theory of Cache-Efficient Algorithms Summary for the seminar: Analysis of algorithms in hierarchical memory – Spring 2004 by Gala Golan

Upload: metta

Post on 05-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Towards a Theory of Cache-Efficient Algorithms. Summary for the seminar: Analysis of algorithms in hierarchical memory – Spring 2004 by Gala Golan. The RAM Model. In the previous lecture we discussed a cache in an operating system We saw a lower bound on sorting: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards a Theory of Cache-Efficient Algorithms

Towards a Theory of Cache-Efficient Algorithms

Summary for the seminar:

Analysis of algorithms in hierarchical memory – Spring 2004

by Gala Golan

Page 2: Towards a Theory of Cache-Efficient Algorithms

The RAM Model

In the previous lecture we discussed a cache in an operating system

We saw a lower bound on sorting:

N = number of sorted elements B = number of elements in each block M = memory size

log

log

N N B

B M B

Page 3: Towards a Theory of Cache-Efficient Algorithms

The I/O Model1. A datum can be accessed only from fast

memory

2. B elements are brought to memory in each access

3. Computation cost << I/O cost

4. A block of data can be placed anywhere in fast memory

5. I/O operations are explicit

Page 4: Towards a Theory of Cache-Efficient Algorithms

The Cache Model1. A datum can be accessed only from fast memory √

2. B elements are brought to memory in each access √3. Computation cost << I/O cost

L denotes normalized cache latency, accessing a block from cache costs 1

4. A block of data can be placed anywhere in fast memoryA fixed mapping distributes main memory in the cache

5. I/O operations are explicitThe cache is not visible to the programmer

Page 5: Towards a Theory of Cache-Efficient Algorithms

Notation

I(M,B) - The I/O modelC(M,B,L) - The cache model n = N/B, m = M/B – The size of the data and

of memory in blocks (instead of elements) The goal of an algorithm design is to

minimize running time = (number of cache accesses) + (L* number of memory accesses)

Page 6: Towards a Theory of Cache-Efficient Algorithms

Reminder – Cache Associativity Associativity specifies the number of different

frames in which a memory block can reside

Fully Associative

Direct Mapped

2-Way Associative

Set

Page 7: Towards a Theory of Cache-Efficient Algorithms

Emulation Theorem An algorithm A in I(M,B) using T block

transfers and I processing time can be converted to an equivalent algorithm Ac in C(M,B,L) that runs in O(I+ (L+B)T ) steps.

The additional memory requirement is m blocks.

In other words – an algorithm that is efficient in main memory, can be efficient in cache.

Page 8: Towards a Theory of Cache-Efficient Algorithms

Proof (1)

1 mC[] 2

21 m n

Mem[]

Buf[]

Page 9: Towards a Theory of Cache-Efficient Algorithms

Proof (2)

1 m

2

C[]

1

2

m n

Mem[]

Buf[] ba

Page 10: Towards a Theory of Cache-Efficient Algorithms

Proof (3)

1 m

2

C[]

1

2

m n

Mem[]

Buf[] ba

q

Page 11: Towards a Theory of Cache-Efficient Algorithms

Proof (4)

1 m

2

C[]

1

2

m n

Mem[]

Buf[] ba

b

Page 12: Towards a Theory of Cache-Efficient Algorithms

Proof (5)

1 m

2

C[]

1

2

m n

Mem[]

Buf[] ba

b

q

Page 13: Towards a Theory of Cache-Efficient Algorithms

Proof (6)

1 m

2

C[]

1

2

m n

Mem[]

Buf[] ba

b

q

Page 14: Towards a Theory of Cache-Efficient Algorithms

Block efficient algorithms

For a block efficient algorithm, a computation is done on at least a constant fraction of the elements in the blocks transferred.

In such a case, O(B*T) ≡ O(I), so an algorithm for I(M,B) can be emulated in C(M,B,L) in O(I+ L*T) steps.

The algorithms for sorting, FFT, and matrix transposition are block efficient.

Page 15: Towards a Theory of Cache-Efficient Algorithms

Extension to set-associative cache

In a set associative cache, if all k sets are occupied, LRU is used by the hardware to find an assignment for the referenced block.

In the emulation technique described before we do not have explicit control of the replacement.

Instead, a property of LRU will be used, and the cache will be used only partially.

Page 16: Towards a Theory of Cache-Efficient Algorithms

Optimal Replacement Algorithm for Cache OPT or MIN – a hypothetical algorithm that

minimizes cache misses for a given (finite) access trace.

Offline – it knows in advance which blocks will be accessed next.

Evicts from cache the block which will be accessed again in the longest time in the future.

Was proven to be optimal – better than any online algorithm.

Proposed by Belady in 1966. Used to theoretically test efficiency of online

algorithms.

Page 17: Towards a Theory of Cache-Efficient Algorithms

LRU vs. OPT

For any constant factor c > 1, LRU with fast memory size m makes at most c times as many misses as OPT with fast memory size (1-1/c)m.

For example, LRU with cache size m will cause 3 times more misses than OPT with memory of size 2/3 m.

LRU – 3X misses OPT – X misses

6 = (1-1/3) 9

1 69 1

Page 18: Towards a Theory of Cache-Efficient Algorithms

Extension to set-associative cache – Cont.

Similarly, LRU with cache size m will cause 2 times more misses than OPT with memory of size m/2.

We emulate The I/O algorithm using only half the size of Buf[]. Instead of k cache lines for every set, there are now k/2

These k/2 blocks are managed optimally, according to the optimality of the I/O algorithm.

In the real cache, k lines will be managed by LRU and will experience twice the misses.

Page 19: Towards a Theory of Cache-Efficient Algorithms

Extension to set-associative cache – Cont.

1 mC[] 2

21 m n

Mem[]

Buf[]

Page 20: Towards a Theory of Cache-Efficient Algorithms

Generalized Emulation Theorem

An algorithm A in I(M/2,B) using T block transfers and I processing time can be converted to an equivalent algorithm Ac in the k-way associative cache model C(M,B,L) that runs in O(I+ (L+B)T ) steps.

The additional memory requirement is m/2 blocks.

Page 21: Towards a Theory of Cache-Efficient Algorithms

The cache complexity of sorting

The lower bound for sorting in I(M,B) is

The lower bound for sorting in C(M,B,L) is

loglog

log

N N BN N L

B M B

log

log

N N B

B M B

I = computations T = I/O operations

Page 22: Towards a Theory of Cache-Efficient Algorithms

Cache Miss Classes

Compulsory Miss – a block is being referenced for the first time

Capacity Miss – a block was evicted from the cache because it is too small

Conflict Miss – a block was evicted from the cache because another block was mapped to the same set.

Page 23: Towards a Theory of Cache-Efficient Algorithms

Average case performance of merge-sort in the cache model

We want to estimate the number of cache misses while performing the algorithm: Compulsory misses are unavoidable Capacity misses are minimized by the I/O

algorithm We can quantify the expected number of conflict

misses.

Page 24: Towards a Theory of Cache-Efficient Algorithms

When does a conflict miss occur?

s cache sets are available for k runs S1…Sk. The expected number of elements in any run

Si is N/k. A leading block is a cache line containing a

leading element of a run. bi is the leading block of Si.

A conflict occurs when two leading blocks are mapped to the same cache set.

Page 25: Towards a Theory of Cache-Efficient Algorithms

When does a conflict miss occur – Cont.

Formally: a conflict miss occurs for element Si,j+1 when there is at least one element x in a leading block bk, k≠i, such that Si,j<x<Si,j+1

and S(bi) = S(bk).

Si

Sk

j j+1

x

Page 26: Towards a Theory of Cache-Efficient Algorithms

How many conflict misses to expect

Pi = the probability of conflict for element i, 1≤i≤N.

Assume uniform distribution: The leading blocks among cache sets The leading element within the leading block

If k is Ω(s) then Pi is Ω(1). For each round, the number of conflict

misses is Ω(N).

Page 27: Towards a Theory of Cache-Efficient Algorithms

How many conflict misses to expect – Cont.

The expected number of conflict misses throughout merge-sort is

This includes O(N) misses for each pass. By choosing k<<s we minimize the probability of

conflict misses, but we incur more capacity misses.

log

log

N BN

M B

Page 28: Towards a Theory of Cache-Efficient Algorithms

Conclusions

There is a way to transform I/O efficient algorithms to cache efficient algorithms

It is only for blocking, direct mapped cache that does not distinguish between reads and writes.

The constants are important in these orders of magnitude.