The Study of Cache Oblivious Algorithms

Download The Study of Cache Oblivious Algorithms

Post on 25-Feb-2016

36 views

Category:

Documents

2 download

Embed Size (px)

DESCRIPTION

The Study of Cache Oblivious Algorithms. Prepared by Jia Guo. Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran . In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA . - PowerPoint PPT Presentation

TRANSCRIPT

  • The Study of Cache Oblivious AlgorithmsPrepared by Jia Guo

    CS598dhp

    Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

    CS598dhp

    OutlineCache complexity Cache aware algorithmsCache oblivious algorithmsMatrix multiplicationMatrix transposition FFTConclusion

    CS598dhp

    AssumptionOnly two levels of memory hierarchies: An ideal cacheFully associativeOptimal replacement strategyTall cache A very large memory

    CS598dhp

    An Ideal Cache ModelAn ideal cache model (Z,L) Z: Total words in the cacheL: Words in one cache line

    CS598dhp

    Cache ComplexityAn algorithm with input size n is measured by:Work complexity W(n)Cache complexity: the number of cache misses it incurs. Q(n; Z, L)

    CS598dhp

    OutlineCache complexity Cache aware algorithmsCache oblivious algorithmsMatrix multiplicationMatrix transposition FFTConclusion

    CS598dhp

    Cache Aware AlgorithmsContain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L).Need to adjust parameters when running on different platforms.

    CS598dhp

    Example:A blocked matrix multiplication algorithm

    s is a tuning parameter to make the algorithm run fast

    A11ssnA

    CS598dhp

    Example (2)Cache complexityThe three s x s sub matrices should fit into the cache so they occupy cache linesOptimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into cachen2/L cache misses needed to read n2 elements It is

    CS598dhp

    OutlineCache complexity Cache aware algorithmsCache oblivious algorithmsMatrix multiplicationMatrix transposition and FFTConclusion

    CS598dhp

    Cache Oblivious AlgorithmsHave no parameters about hardware, such as cache size (Z), cache-line length (L).No tuning needed, platform independent.The following algorithms introduced are proved to have the optimal cache complexity.

    CS598dhp

    Matrix MultiplicationPartition matrix A and B by half in the largest dimension. A: n x m, B: m x p

    Proceed recursively until reach the base case - one element.

    n max (m, p)m max (n, p)p max (n, m)

    CS598dhp

    Matrix Multiplication (2)A*BA1*B1A2*B2A11*B11A12*B12A21*B21A22*B22Assume Sizes of A, B are nx4n, 4nxn+++

    CS598dhp

    Matrix Multiplication (3)Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

    CS598dhp

    Matrix Multiplication (4)Cache complexityCan achieve the same as the cache complexity of Block-MULT algorithm (cache aware)For a square matrix, the optimal cache complexity is achieved.

    CS598dhp

    OutlineCache complexity Cache aware algorithmsCache oblivious algorithmsMatrix multiplicationMatrix transposition FFTConclusion

    CS598dhp

    Matrix Transposition

    If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B)

    A

    ATfor i 1 to m for j 1 to n B( j, i ) = A( i, j )m x nBn x m

    CS598dhp

    Matrix Transposition (2)Partition array A along the longer dimension and recursively execute the transpose function.A11A12A21A22A11TA21TA12TA22T

    CS598dhp

    Matrix Transposition (3)Cache complexityIt has the optimal cache complexityQ(m, n) = (1+mn/L)

    CS598dhp

    Fast Fourier Transform

    Use Cooley-Tukey algorithmCooley-Tukey algorithms recursively re-express a DFT of a composite size n = n1n2 as:Perform n2 DFTs of size n1. Multiply by complex roots of unity called twiddle factors. Perform n1 DFTs of size n2.

    CS598dhp

    n2n1

    CS598dhp

    Assume X is a row-major n1 n2 matrixSteps:Transpose X in place.Compute n2 DFTsMultiply by twiddle factorsTranspose X in placeCompute n1 DFTsTranspose X in-place

    CS598dhp

    Fast Fourier Transform*twiddle factorTranspose to select n2 DFT of size n1Call FFT recursively with n1=2, n2=2 Reach the base case, returnTranspose to select n1 DFT of size n2 Transpose and return n1=4, n2=2

    CS598dhp

    Fast Fourier TransformCache complexityOptimal for a Cooley-Tukey algorithm, when n is an exact power of 2Q(n) = O(1+(n/L)(1+logzn)

    CS598dhp

    Other Cache Oblivious AlgorithmsFunnelsort Distribution sortLU decomposition without pivots

    CS598dhp

    OutlineCache complexity Cache aware algorithmsCache oblivious algorithmsMatrix multiplicationMatrix transpositionFFTConclusion

    CS598dhp

    QuestionsHow large is the range of practicality of cache-oblivious algorithms?What are the relative strengths of cache-oblivious and cache-aware algorithms?

    CS598dhp

    Practicality of Cache-oblivious AlgorithmsAverage time to transpose an NxN matrix, divided by N2

    CS598dhp

    Practicality of Cache-oblivious Algorithms (2)Average time taken to multiply two NxN matrices, divided by N3

    CS598dhp

    Question 2Do cache-oblivious algorithms perform as well as cache-aware algorithms?FFTW libraryNo answer yet.

    CS598dhp

    References Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.Cache-Oblivious Algorithms by Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999. Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and Mara Jesus Garzarn. LCPC 2005.

Recommended

View more >