# The Study of Cache Oblivious Algorithms

Post on 25-Feb-2016

36 views

Embed Size (px)

DESCRIPTION

The Study of Cache Oblivious Algorithms. Prepared by Jia Guo. Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran . In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA . - PowerPoint PPT PresentationTRANSCRIPT

The Study of Cache Oblivious AlgorithmsPrepared by Jia Guo

CS598dhp

Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

CS598dhp

OutlineCache complexity Cache aware algorithmsCache oblivious algorithmsMatrix multiplicationMatrix transposition FFTConclusion

CS598dhp

AssumptionOnly two levels of memory hierarchies: An ideal cacheFully associativeOptimal replacement strategyTall cache A very large memory

CS598dhp

An Ideal Cache ModelAn ideal cache model (Z,L) Z: Total words in the cacheL: Words in one cache line

CS598dhp

Cache ComplexityAn algorithm with input size n is measured by:Work complexity W(n)Cache complexity: the number of cache misses it incurs. Q(n; Z, L)

CS598dhp

OutlineCache complexity Cache aware algorithmsCache oblivious algorithmsMatrix multiplicationMatrix transposition FFTConclusion

CS598dhp

Cache Aware AlgorithmsContain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L).Need to adjust parameters when running on different platforms.

CS598dhp

Example:A blocked matrix multiplication algorithm

s is a tuning parameter to make the algorithm run fast

A11ssnA

CS598dhp

Example (2)Cache complexityThe three s x s sub matrices should fit into the cache so they occupy cache linesOptimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into cachen2/L cache misses needed to read n2 elements It is

CS598dhp

OutlineCache complexity Cache aware algorithmsCache oblivious algorithmsMatrix multiplicationMatrix transposition and FFTConclusion

CS598dhp

Cache Oblivious AlgorithmsHave no parameters about hardware, such as cache size (Z), cache-line length (L).No tuning needed, platform independent.The following algorithms introduced are proved to have the optimal cache complexity.

CS598dhp

Matrix MultiplicationPartition matrix A and B by half in the largest dimension. A: n x m, B: m x p

Proceed recursively until reach the base case - one element.

n max (m, p)m max (n, p)p max (n, m)

CS598dhp

Matrix Multiplication (2)A*BA1*B1A2*B2A11*B11A12*B12A21*B21A22*B22Assume Sizes of A, B are nx4n, 4nxn+++

CS598dhp

Matrix Multiplication (3)Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

CS598dhp

Matrix Multiplication (4)Cache complexityCan achieve the same as the cache complexity of Block-MULT algorithm (cache aware)For a square matrix, the optimal cache complexity is achieved.

CS598dhp

OutlineCache complexity Cache aware algorithmsCache oblivious algorithmsMatrix multiplicationMatrix transposition FFTConclusion

CS598dhp

Matrix Transposition

If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B)

A

ATfor i 1 to m for j 1 to n B( j, i ) = A( i, j )m x nBn x m

CS598dhp

Matrix Transposition (2)Partition array A along the longer dimension and recursively execute the transpose function.A11A12A21A22A11TA21TA12TA22T

CS598dhp

Matrix Transposition (3)Cache complexityIt has the optimal cache complexityQ(m, n) = (1+mn/L)

CS598dhp

Fast Fourier Transform

Use Cooley-Tukey algorithmCooley-Tukey algorithms recursively re-express a DFT of a composite size n = n1n2 as:Perform n2 DFTs of size n1. Multiply by complex roots of unity called twiddle factors. Perform n1 DFTs of size n2.

CS598dhp

n2n1

CS598dhp

Assume X is a row-major n1 n2 matrixSteps:Transpose X in place.Compute n2 DFTsMultiply by twiddle factorsTranspose X in placeCompute n1 DFTsTranspose X in-place

CS598dhp

Fast Fourier Transform*twiddle factorTranspose to select n2 DFT of size n1Call FFT recursively with n1=2, n2=2 Reach the base case, returnTranspose to select n1 DFT of size n2 Transpose and return n1=4, n2=2

CS598dhp

Fast Fourier TransformCache complexityOptimal for a Cooley-Tukey algorithm, when n is an exact power of 2Q(n) = O(1+(n/L)(1+logzn)

CS598dhp

Other Cache Oblivious AlgorithmsFunnelsort Distribution sortLU decomposition without pivots

CS598dhp

OutlineCache complexity Cache aware algorithmsCache oblivious algorithmsMatrix multiplicationMatrix transpositionFFTConclusion

CS598dhp

QuestionsHow large is the range of practicality of cache-oblivious algorithms?What are the relative strengths of cache-oblivious and cache-aware algorithms?

CS598dhp

Practicality of Cache-oblivious AlgorithmsAverage time to transpose an NxN matrix, divided by N2

CS598dhp

Practicality of Cache-oblivious Algorithms (2)Average time taken to multiply two NxN matrices, divided by N3

CS598dhp

Question 2Do cache-oblivious algorithms perform as well as cache-aware algorithms?FFTW libraryNo answer yet.

CS598dhp

References Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.Cache-Oblivious Algorithms by Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999. Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and Mara Jesus Garzarn. LCPC 2005.

Recommended