3.2 cache oblivious algorithms

32
3.2 Cache Oblivious Algorithms

Upload: baina

Post on 19-Jan-2016

57 views

Category:

Documents


6 download

DESCRIPTION

3.2 Cache Oblivious Algorithms. Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran . In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA. Outline. Cache complexity - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 3.2 Cache Oblivious Algorithms

3.2 Cache Oblivious Algorithms

Page 2: 3.2 Cache Oblivious Algorithms

2CS598dhp

Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

Page 3: 3.2 Cache Oblivious Algorithms

3CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition FFT

Conclusion

Page 4: 3.2 Cache Oblivious Algorithms

4CS598dhp

Assumption

Only two levels of memory hierarchies: An ideal cache

Fully associativeOptimal replacement strategy“Tall cache”

A very large memory

Page 5: 3.2 Cache Oblivious Algorithms

5CS598dhp

An Ideal Cache Model

An ideal cache model (Z,L)

Z: Total words in the cache

L: Words in one cache line

Page 6: 3.2 Cache Oblivious Algorithms

6CS598dhp

Cache Complexity

An algorithm with input size n is measured by:Work complexity W(n)Cache complexity: the number of cache misses

it incurs. Q(n; Z, L)

Page 7: 3.2 Cache Oblivious Algorithms

7CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition FFT

Conclusion

Page 8: 3.2 Cache Oblivious Algorithms

8CS598dhp

Cache Aware Algorithms

Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L).

Need to adjust parameters when running on different platforms.

Page 9: 3.2 Cache Oblivious Algorithms

9CS598dhp

Example:

A blocked matrix multiplication algorithm

s is a tuning parameter to make the algorithm run fast

A11s

s

n

A

Page 10: 3.2 Cache Oblivious Algorithms

10CS598dhp

Example (2)

Cache complexity The three s x s sub matrices should fit into the cache so

they occupy cache lines

Optimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into

cache n2/L cache misses needed to read n2 elements It is

)( Zs

)//1(

))/()/(/1(32

32

ZLnLn

LZsnLn

)/()/,max( 22 LssLss

Page 11: 3.2 Cache Oblivious Algorithms

11CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition and FFT

Conclusion

Page 12: 3.2 Cache Oblivious Algorithms

12CS598dhp

Cache Oblivious Algorithms

Have no parameters about hardware, such as cache size (Z), cache-line length (L).No tuning needed, platform independent.

The following algorithms introduced are proved to have the optimal cache complexity.

Page 13: 3.2 Cache Oblivious Algorithms

13CS598dhp

Matrix Multiplication

Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p

Proceed recursively until reach the base case - one element.

n ≥ max (m, p)

m ≥ max (n, p)

p ≥ max (n, m)

Page 14: 3.2 Cache Oblivious Algorithms

14CS598dhp

Matrix Multiplication (2)

12

111211 B

BAA

2

121 B

BAAA*B

A1*B1 A2*B2

A11*B11 A12*B12 A21*B21 A22*B22

22

212221 B

BAA

Assume Sizes of A, B are nx4n, 4nxn

+ +

+

Page 15: 3.2 Cache Oblivious Algorithms

15CS598dhp

Matrix Multiplication (3)

Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

Page 16: 3.2 Cache Oblivious Algorithms

16CS598dhp

Matrix Multiplication (4)

Cache complexityCan achieve the same as the cache complexity

of Block-MULT algorithm (cache aware)For a square matrix, the optimal cache

complexity is achieved.

Page 17: 3.2 Cache Oblivious Algorithms

17CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition FFT

Conclusion

Page 18: 3.2 Cache Oblivious Algorithms

18CS598dhp

If n is very large, the access of B in column will cause cache miss every time!

(No spatial locality in B)

Matrix Transposition

A AT for i 1 to m

for j 1 to n

B( j, i ) = A( i, j )

m x n

Bn x m

Page 19: 3.2 Cache Oblivious Algorithms

19CS598dhp

Matrix Transposition (2)

Partition array A along the longer dimension and recursively execute the transpose function.

A1A111

A12A12

A21A21

A22A22

A11A11TT

A21A21TT

A12A12TT

A22A22TT

Page 20: 3.2 Cache Oblivious Algorithms

20CS598dhp

Matrix Transposition (3)

Cache complexityIt has the optimal cache complexityQ(m, n) = Θ(1+mn/L)

Page 21: 3.2 Cache Oblivious Algorithms

21CS598dhp

Fast Fourier Transform

Use Cooley-Tukey algorithm Cooley-Tukey algorithms recursively re-express a DF

T of a composite size n = n1n2 as:

Perform n2 DFTs of size n1.

Multiply by complex roots of unity called twiddle factors.

Perform n1 DFTs of size n2.

1

0

][][n

j

ijnjXiY

Page 22: 3.2 Cache Oblivious Algorithms

22CS598dhp

1

0

[ ] [ ]n

ij

j

Y i X j w

2 1

1 1 1 2 2 2

1 2

2 1

1 1

1 2 1 1 2 20 0

[ ] [ ]n n

i j i j i jn n n

j j

Y i i n X j n j w w w

n2

n1

Page 23: 3.2 Cache Oblivious Algorithms

23CS598dhp

Assume X is a row-major n1× n2 matrixSteps:

Transpose X in place.Compute n2 DFTsMultiply by twiddle factorsTranspose X in placeCompute n1 DFTsTranspose X in-place

Page 24: 3.2 Cache Oblivious Algorithms

24CS598dhp

Fast Fourier Transform

*twiddle factor

Transpose to select n2 DFT of size n1

Call FFT recursively with n1=2, n2=2 Reach the base case, return

Transpose to select n1 DFT of size n2

Transpose and return

n1=4, n2=2

Page 25: 3.2 Cache Oblivious Algorithms

25CS598dhp

Fast Fourier Transform

Cache complexityOptimal for a Cooley-Tukey algorithm, when n

is an exact power of 2Q(n) = O(1+(n/L)(1+logzn)

Page 26: 3.2 Cache Oblivious Algorithms

26CS598dhp

Other Cache Oblivious Algorithms

Funnelsort Distribution sortLU decomposition without pivots

Page 27: 3.2 Cache Oblivious Algorithms

27CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transpositionFFT

Conclusion

Page 28: 3.2 Cache Oblivious Algorithms

28CS598dhp

Questions

How large is the range of practicality of cache-oblivious algorithms?

What are the relative strengths of cache-oblivious and cache-aware algorithms?

Page 29: 3.2 Cache Oblivious Algorithms

29CS598dhp

Practicality of Cache-oblivious Algorithms

Average time to transpose an NxN matrix, divided by N2

Page 30: 3.2 Cache Oblivious Algorithms

30CS598dhp

Practicality of Cache-oblivious Algorithms (2)

Average time taken to multiply two NxN matrices, divided by N3

Page 31: 3.2 Cache Oblivious Algorithms

31CS598dhp

Question 2

Do cache-oblivious algorithms perform as well as cache-aware algorithms?FFTW libraryNo answer yet.

Page 32: 3.2 Cache Oblivious Algorithms

32CS598dhp

References

Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.Cache-Oblivious Algorithmsby Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999.

Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.