Download - 3.2 Cache Oblivious Algorithms
3.2 Cache Oblivious Algorithms
2CS598dhp
Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.
3CS598dhp
Outline
Cache complexity Cache aware algorithmsCache oblivious algorithms
Matrix multiplicationMatrix transposition FFT
Conclusion
4CS598dhp
Assumption
Only two levels of memory hierarchies: An ideal cache
Fully associativeOptimal replacement strategy“Tall cache”
A very large memory
5CS598dhp
An Ideal Cache Model
An ideal cache model (Z,L)
Z: Total words in the cache
L: Words in one cache line
6CS598dhp
Cache Complexity
An algorithm with input size n is measured by:Work complexity W(n)Cache complexity: the number of cache misses
it incurs. Q(n; Z, L)
7CS598dhp
Outline
Cache complexity Cache aware algorithmsCache oblivious algorithms
Matrix multiplicationMatrix transposition FFT
Conclusion
8CS598dhp
Cache Aware Algorithms
Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L).
Need to adjust parameters when running on different platforms.
9CS598dhp
Example:
A blocked matrix multiplication algorithm
s is a tuning parameter to make the algorithm run fast
A11s
s
n
A
10CS598dhp
Example (2)
Cache complexity The three s x s sub matrices should fit into the cache so
they occupy cache lines
Optimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into
cache n2/L cache misses needed to read n2 elements It is
)( Zs
)//1(
))/()/(/1(32
32
ZLnLn
LZsnLn
)/()/,max( 22 LssLss
11CS598dhp
Outline
Cache complexity Cache aware algorithmsCache oblivious algorithms
Matrix multiplicationMatrix transposition and FFT
Conclusion
12CS598dhp
Cache Oblivious Algorithms
Have no parameters about hardware, such as cache size (Z), cache-line length (L).No tuning needed, platform independent.
The following algorithms introduced are proved to have the optimal cache complexity.
13CS598dhp
Matrix Multiplication
Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p
Proceed recursively until reach the base case - one element.
n ≥ max (m, p)
m ≥ max (n, p)
p ≥ max (n, m)
14CS598dhp
Matrix Multiplication (2)
12
111211 B
BAA
2
121 B
BAAA*B
A1*B1 A2*B2
A11*B11 A12*B12 A21*B21 A22*B22
22
212221 B
BAA
Assume Sizes of A, B are nx4n, 4nxn
+ +
+
15CS598dhp
Matrix Multiplication (3)
Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.
16CS598dhp
Matrix Multiplication (4)
Cache complexityCan achieve the same as the cache complexity
of Block-MULT algorithm (cache aware)For a square matrix, the optimal cache
complexity is achieved.
17CS598dhp
Outline
Cache complexity Cache aware algorithmsCache oblivious algorithms
Matrix multiplicationMatrix transposition FFT
Conclusion
18CS598dhp
If n is very large, the access of B in column will cause cache miss every time!
(No spatial locality in B)
Matrix Transposition
A AT for i 1 to m
for j 1 to n
B( j, i ) = A( i, j )
m x n
Bn x m
19CS598dhp
Matrix Transposition (2)
Partition array A along the longer dimension and recursively execute the transpose function.
A1A111
A12A12
A21A21
A22A22
A11A11TT
A21A21TT
A12A12TT
A22A22TT
20CS598dhp
Matrix Transposition (3)
Cache complexityIt has the optimal cache complexityQ(m, n) = Θ(1+mn/L)
21CS598dhp
Fast Fourier Transform
Use Cooley-Tukey algorithm Cooley-Tukey algorithms recursively re-express a DF
T of a composite size n = n1n2 as:
Perform n2 DFTs of size n1.
Multiply by complex roots of unity called twiddle factors.
Perform n1 DFTs of size n2.
1
0
][][n
j
ijnjXiY
22CS598dhp
1
0
[ ] [ ]n
ij
j
Y i X j w
2 1
1 1 1 2 2 2
1 2
2 1
1 1
1 2 1 1 2 20 0
[ ] [ ]n n
i j i j i jn n n
j j
Y i i n X j n j w w w
n2
n1
23CS598dhp
Assume X is a row-major n1× n2 matrixSteps:
Transpose X in place.Compute n2 DFTsMultiply by twiddle factorsTranspose X in placeCompute n1 DFTsTranspose X in-place
24CS598dhp
Fast Fourier Transform
*twiddle factor
Transpose to select n2 DFT of size n1
Call FFT recursively with n1=2, n2=2 Reach the base case, return
Transpose to select n1 DFT of size n2
Transpose and return
n1=4, n2=2
25CS598dhp
Fast Fourier Transform
Cache complexityOptimal for a Cooley-Tukey algorithm, when n
is an exact power of 2Q(n) = O(1+(n/L)(1+logzn)
26CS598dhp
Other Cache Oblivious Algorithms
Funnelsort Distribution sortLU decomposition without pivots
27CS598dhp
Outline
Cache complexity Cache aware algorithmsCache oblivious algorithms
Matrix multiplicationMatrix transpositionFFT
Conclusion
28CS598dhp
Questions
How large is the range of practicality of cache-oblivious algorithms?
What are the relative strengths of cache-oblivious and cache-aware algorithms?
29CS598dhp
Practicality of Cache-oblivious Algorithms
Average time to transpose an NxN matrix, divided by N2
30CS598dhp
Practicality of Cache-oblivious Algorithms (2)
Average time taken to multiply two NxN matrices, divided by N3
31CS598dhp
Question 2
Do cache-oblivious algorithms perform as well as cache-aware algorithms?FFTW libraryNo answer yet.
32CS598dhp
References
Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.Cache-Oblivious Algorithmsby Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999.
Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.