jim demmel and oded schwartz

Communication AvoidingAlgorithms

forDense Linear Algebra:

LU, QR, and Cholesky decompositionsand

Sparse Cholesky decompositionsJim Demmel and Oded Schwartz

CS294, Lecture #5 Fall, 2011Communication-Avoiding Algorithms

www.cs.berkeley.edu/~odedsc/CS294

Based on:Bootcamp 2011, CS267, Summer-school 2010, Laura Grigori’s, many papers

2

Outline for today• Recall communication lower bounds • What a “matching upper bound”, i.e. CA-algorithm,

might have to do• Summary of what is known so far• Case Studies of CA dense algorithms

• Last time: case study I: Matrix multiplication• Case study II: LU, QR• Case study III: Cholesky decomposition

• Case Studies of CA sparse algorithms• Sparse Cholesky decomposition

3

Communication lower bounds• Applies to algorithms satisfying technical conditions

discussed last time• “smells like 3-nested loops”

• For each processor:• M = size of fast memory of that processor• G = #operations (multiplies) performed by that processor

#words_movedto/from fast memory

= Ω ( max ( G / M1/2 , #inputs + #outputs ) )

#messagesto/from fast memory

≥ #words_movedmax_message_size

= Ω ( G / M3/2 )

Summary of dense sequential algorithms attaining communication lower bounds• Algorithms shown minimizing # Messages use (recursive) block layout

• Not possible with columnwise or rowwise layouts• Many references (see reports), only some shown, plus ours• Cache-oblivious are underlined, Green are ours, ? is unknown/future work

Algorithm 2 Levels of Memory Multiple Levels of Memory

#Words Moved and # Messages #Words Moved and #Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97]

Cholesky LAPACK (with b = M1/2) [Gustavson 97]

[BDHS09]

[Gustavson,97] [Ahmed,Pingali,00]

[BDHS09]

(←same ) (←same)

LU withpivoting

LAPACK (sometimes)[Toledo,97] , [GDX 08]

[GDX 08]not partial pivoting

[Toledo, 97][GDX 08]?

[GDX 08]?

QRRank-

revealing

LAPACK (sometimes) [Elmroth,Gustavson,98]

[DGHL08]

[Frens,Wise,03]but 3x flops?

[DGHL08]

[Elmroth,Gustavson,98]

[DGHL08] ?

[Frens,Wise,03] [DGHL08] ?

Eig, SVD Not LAPACK [BDD11 ]randomized, but more flops;

[BDK11]

[BDD11,BDK11? ]

[BDD11,BDK11?]

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix MultiplyCholesky

LUQR

Sym Eig, SVDNonsym Eig


• Assume nxn matrices on P processors (conventional approach)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:




#messages

Matrix Multiply [Cannon, 69] 1Cholesky ScaLAPACK log P

LU ScaLAPACK log PQR ScaLAPACK log P

Sym Eig, SVD ScaLAPACK log PNonsym Eig ScaLAPACK P1/2 log P


• Assume nxn matrices on P processors (conventional approach)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:




#messages

Matrix Multiply [Cannon, 69] 1 1Cholesky ScaLAPACK log P log P

LU ScaLAPACK log P n log P / P1/2

QR ScaLAPACK log P n log P / P1/2

Sym Eig, SVD ScaLAPACK log P n / P1/2

Nonsym Eig ScaLAPACK P1/2 log P n log P


• Assume nxn matrices on P processors (better)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:




#messages


LU [GDX10] log P log PQR [DGHL08] log P log3 P

Sym Eig, SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better?

• Assume nxn matrices on P processors • Why just one copy of data: M = O(n2 / P) per processor?• Increase M to reduce lower bounds:




#messages


LU [GDX10] log P log PQR [DGHL08] log P log3 P

Sym Eig, SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Recall Sequential One-sided Factorizations (LU, QR)

• Classical Approach for i=1 to n update column i update trailing matrix• #words_moved = O(n3)

• Blocked Approach (LAPACK) for i=1 to n/b update block i of b columns update trailing matrix• #words moved = O(n3/M1/3)

• Recursive Approach func factor(A) if A has 1 column, update it

else factor(left half of A) update right half of A factor(right half of A)• #words moved = O(n3/M1/2)

• None of these approaches minimizes #messages or addresses parallelism• Need another idea

10

Minimizing latency in sequential case requires new data structures

• To minimize latency, need to load/store whole rectangular subblock of matrix with one “message”

• Incompatible with conventional columnwise (rowwise) storage• Ex: Rows (columns) not in contiguous memory locations

• Blocked storage: store as matrix of bxb blocks, each block stored contiguously

• Ok for one level of memory hierarchy, what if more?• Recursive blocked storage: store each block using subblocks

• Also known as “space filling curves”, “Morton ordering”

11

Summer School Lecture 4 12

Different Data Layouts for Parallel GE

Bad load balance:P0 idle after firstn/4 steps

Load balanced, but can’t easily use BLAS2 or BLAS3

Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneck

Complicated addressing,May not want full parallelismIn each column, row

0123012301230123

0 1 2 3 0 1 2 3

0 1 2 33 0 1 22 3 0 11 2 3 0

1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout

3) 1D Column Block Cyclic Layout 4) Block Skewed Layout

The winner!0 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 3 6) 2D Row and Column

Block Cyclic Layout

0 1 2 3

Bad load balance:P0 idle after firstn/2 steps

0 1

2 3

5) 2D Row and Column Blocked Layout

b

02/14/2006

CS267 Lecture 9 13

Distributed GE with a 2D Block Cyclic Layout

13

02/14/2006

CS267 Lecture 9 14

Mat

rix m

ultip

ly o

f g

reen

= g

reen

- bl

ue *

pink

14

Where is communication bottleneck?• In both sequential and parallel case: panel factorization• Can we retain partial pivoting? No• “Proof” for parallel case:

• Each choice of pivot requires a reduction (to find the maximum entry) per columns

• #messages = Ω (n) or Ω (n log p), versus goal of O(p1/2)• Solution for QR, then show LU

15

TSQR: QR of a Tall, Skinny matrix

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= = .

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= . R01

R11

R01

R11= Q02 R02

Output = Q00, Q10, Q20, Q30, Q01, Q11, Q02, R02 16

TSQR: An Architecture-Dependent Algorithm

W =

W0W1W2W3

R00R10

R20R30

R01

R11

R02Parallel:

W =

W0W1W2W3

R01 R02

R00

R03

Sequential:

W =

W0W1W2W3

R00R01 R01

R11R02

R11

R03

Dual Core:

Can choose reduction tree dynamicallyMulticore / Multisocket / Multirack / Multisite / Out-of-core: ?

17

Back to LU: Using similar idea for TSLU as TSQR: Use reduction tree, to do “Tournament Pivoting”

18

Wnxb =

W1

W2

W3

W4

P1·L1·U1

P2·L2·U2

P3·L3·U3

P4·L4·U4

=

Choose b pivot rows of W1, call them W1’




W1’

W2’

W3’

W4’

P12·L12·U12

P34·L34·U34

=Choose b pivot rows, call them W12’

Choose b pivot rows, call them W34’

W12’

W34’= P1234·L1234·U1234

Choose b pivot rows

Go back to W and use these b pivot rows

(move them to top, do LU without pivoting)

Minimizing Communication in TSLU

W =

W1

W2

W3

W4

LULULULU

LU

LULU

Parallel:(binary tree)

W =

W1

W2

W3

W4

LULU

LU

LU

Sequential:(flat tree)

W =

W1

W2

W3

W4

LU

LULU

LULU

LULU

Dual Core:

Can Choose reduction tree dynamically, as beforeMulticore / Multisocket / Multirack / Multisite / Out-of-core:

19

Making TSLU Numerically Stable• Stability Goal: Make ||A – PLU|| very small: O(machine_epsilon · ||A||)• Details matter

• Going up the tree, we could do LU either on original rows of W (tournament pivoting), or computed rows of U

• Only tournament pivoting stable• Thm: New scheme as stable as Partial Pivoting (PP) in following sense: get

same results as PP applied to different input matrix whose entries are blocks taken from input A

• CALU – Communication Avoiding LU for general A– Use TSLU for panel factorizations– Apply to rest of matrix– Cost: redundant panel factorizations:

– Extra flops for one reduction step: n/b panels (n b2) flops per panel– For the sequential algorithms this is n2b. b = M1/2,

so extra n2M1/2 flops, (<n3 as M < n2).– For the parallel 2D algorithms, we have log P reduction steps

n/b panels, n/b b2 / P1/2 flops per panel and b = n/(log(P)2P1/2), so extra n/b (n b2) log P / P1/2= n2blog P / P1/2 = n3/(Plog(P)) < n3/P

• Benefit: – Stable in practice, but not same pivot choice as GEPP– One reduction operation per panel: reduces latency to minimum

20

Sketch of TSLU Stabilty Proof

• Consider A = with TSLU of first columns done by :

(assume wlog that A21 already

contains desired pivot rows)• Resulting LU decomposition (first n/2 steps) is

• Claim: As32 can be obtained by GEPP on larger matrix formed from blocks of A

• GEPP applied to G pivots no rows, and

L31 · L21-1 ·A21 · U’11

-1 · U’12 + As32

= L31 · U21 · U’11-1 · U’12 + As

32

= A31 · U’11-1 · U’12 + As

32 = L’31 · U’12 + As32 = A32

21

A11 A12A21 A22A31 A32

A11A21A31

A21

A’11

A11 A12A21 A22A31 A32

11 12 021 22 0 0 0 I

· =A’11 A’12A’21 A’22A31 A32

=U’11 U’12 As

22 As

32

L’11 L’21 I L’31 0 I

·

A’11 A’12A21 A21 -A31 A32

G = =L’11 A21·U’11

-1 L21 -L31 I

·U’11 U’12 U21 -L21

-1 ·A21 · U’11-1 · U’12

As32

Need to show As32

same as above

Stability of LU using TSLU: CALU (1/2)• Worst case analysis

• Pivot growth factor from one panel factorization with b columns• TSLU:

– Proven: 2bH where H = 1 + height of reduction tree– Attained: 2b , same as GEPP

• Pivot growth factor on m x n matrix• TSLU:

– Proven: 2nH-1

– Attained: 2n-1 , same as GEPP • |L| can be as large as 2(b-2)H-b+1 , but CALU still stable• There are examples where GEPP is exponentially less

stable than CALU, and vice-versa, so neither is always better than the other

• Discovered sparse matrices with O(2n) pivot growth22

Stability of LU using TSLU: CALU (2/2)

Summer School Lecture 4 23

• Empirical testing• Both random matrices and “special ones”• Both binary tree (BCALU) and flat-tree (FCALU)• 3 metrics: ||PA-LU||/||A||, normwise and componentwise backward errors• See [D., Grigori, Xiang, 2010] for details

23

Performance vs ScaLAPACK

• TSLU– IBM Power 5

• Up to 4.37x faster (16 procs, 1M x 150)– Cray XT4

• Up to 5.52x faster (8 procs, 1M x 150)• CALU

– IBM Power 5

• Up to 2.29x faster (64 procs, 1000 x 1000)– Cray XT4

• Up to 1.81x faster (64 procs, 1000 x 1000)• See INRIA Tech Report 6523 (2008), paper at SC08

25

Exascale predicted speedupsfor Gaussian Elimination: CA-LU vs ScaLAPACK-LU

log2 (p)

log 2

(n2 /p

) =

log 2

(mem

ory_

per_

proc

)

27

Case Study III: Cholesky Decomposition

29

30

LU and Cholesky decompositionsLU decomposition of A

A = P·L·UU is upper triangularL is lower triangularP is a permutation matrix

Cholesky decomposition of AIf A is a real symmetric and positive definite matrix,then L (a real lower triangular) s.t.,

A = L·LT (L is unique if we restrict diagonal elements 0)

Efficient parallel and sequential algorithms?

31

Application of the generalized form to Cholesky decomposition

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

jikjAjiAjiAjjL

jiLik

,,,,,

1,1

1

2,,,ik

kiAiiAiiL

32

By the Geometric Embedding theorem[Ballard, Demmel, Holtz, S. 2011a]Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

The communication cost (bandwidth) of any classic algorithm for Cholesky decomposition (for dense matrices) is:

MMn 3

PM

Mn

3

33

Some Sequential Classical Cholesky Algorithms

Bandwidth LatencyCache

Oblivious

Lower bound (n3/M1/2) (n3/M3/2)Naive (n3) (n3/M) LAPACK (with right block size!) Column-major Contiguous blocks

O(n3/M1/2)O(n3/M1/2)

O(n3/M)O(n3/M3/2)

Rectangular Recursive [Toledo 97] Column-major Contiguous blocks

O(n3/M1/2+n2 log n)O(n3/M1/2+n2 log n)

(n3/M)(n2)

Square Recursive [Ahmad, Pingali 00] Column-major Contiguous blocks

O(n3/M1/2)O(n3/M1/2)

O(n3/M)O(n3/M3/2)

34

Some Parallel Classical Cholesky Algorithms

Bandwidth LatencyLower bound General 2D-layout: M=O(n2/P)

(n3/PM1/2)(n2/P1/2)

(n3/PM3/2)(P1/2)

Sca-LAPACK General Choosing b=n/P1/2

O((n2/P1/2 + nb)log P)O((n2/P1/2)log P)

O((n/b) log p)O(P1/2 log P)

35

Top (column-major): Full, Old Packed, Rectangular Full Packed.Bottom (block-contiguous): Blocked, Recursive Format, Recursive Full Packed [Frigo, Leiserson, Prokop, Ramachandran 99, Ahmed, Pingali 00, Elmroth, Gustavson, Jonsson, Kagstrom 04].

Upper bounds – Supporting Data Structures

36

Upper bounds – Supporting Data Structures

Rules of thumb:• Don’t mix column/row major with blocked:

decide which one fits your algorithm.• Converting between the two once in your algorithm

is OK for dense instances.E.g., changing the DS at the beginning of your algorithms.

37

The Naïve Algorithms

jk j k

i

j

j

i

i

Naïve left looking and right looking

Bandwidth: (n3) Latency: (n3/M)

Assuming column-major DS.

1

2,,,ik

kiAiiAiiL

jikjAjiAjiAjjL

jiLik

,,,,,

1,1

38

Rectangular Recursive [Toledo 97]

m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Recursive on left rectangular

Recursive on right rectangular;L12 :=0

[A22;A23] := [A22;A23] - [L21;L31]L21

T

Base: ColumnL := A/(A11)1/2

39


m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Toledo’s Algorithm: guarantees stability for LU.

BW(m,n) = 2m if n = 1, otherwise BW(m,n/2) + BWmm(m-n/2,n/2,n/2) + BW(m-n/2,n/2)

Bandwidth: O(n3/M1/2+n2 log n)

Optimal bandwidth (for almost all M)

40


m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Is it Latency optimal?

If we use a column major DS, thenL = (n3/M), so M1/2 away from optimum (n3/M3/2)This is due to the matrix multiply.

Open problem: Can we prove a latency lower bound for all matrix multiply classic algorithm using column major DS?

Note: we don’t need such a lower bound here, as the matrix-multiply algorithm is known.

41


m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Is it Latency optimal?

If we use a contiguous blocks DS, thenL = (n2). This is due to the forward columns updatesOptimum is (n3/M3/2)

n3/M3/2 ≤ n2 for n ≤ M2/3. Outside this range the algorithm is not latency optimal.

42

Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]

A21 A22

A11 A12

n

n/2

n/2 A21 A22

L11 0

L21 A22

L11 0

L21 A22

L11 0

Recursive on A11 Recursive on A22

A12=0

L21 = RTRSM(A21,L11’)

A22 =

A22 – L21 L21’

RTRSM: Recursive Triangular Solver

43


A21 A22

A11 A12

n

n/2

n/2 A21 A22

L11 0

L21 A22

L11 0

L21 A22

L11 0

Recursive on A11 Recursive on A22

A12=0

L21 = RTRSM(A21,L11’)

A22 =

A22 – L21 L21’

BW = 3n2 if 3n2 ≤ M, otherwise

2BW(n/2) + O(n3/M1/2 + n2)

44


A21 A22

A11 A12

n

n/2

n/2 A21 A22

L11 0

L21 A22

L11 0

L21 A22

L11 0

Uses • Recursive matrix multiplication

[Frigo, Leiserson, Prokop, Ramachandran 99]

• and recursive RTRSM

Bandwidth: O(n3/M1/2) Latency: O(n3/M3/2) Assuming blocks recursive DS.• Optimal bandwidth• Optimal latency• Cache oblivious• Not stable for LU

45

Parallel Algorithm

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

Pc

Pr

1. Find Cholesky locally on a diagonal block.Broadcast downwards.

2. Parallel triangular solve.Broadcast to the right.

3. Parallel symmetric rank-b update

46

Parallel Algorithm

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

Pc

Pr

Bandwidth LatencyLower bound General 2D-layout: M=O(n2/P)

(n3/PM1/2)(n2/P1/2)

(n3/PM3/2)(P1/2)

Sca-LAPACK General Choosing b=n/P1/2

O((n2/P1/2 + nb)log P)O((n2/P1/2)log P)

O((n/b) log P)O(P1/2 log P)

Low order term, as b ≤ n/P1/2

Optimal (up to a log P factor)only for 2D.New 2.5D algorithm extends to any size M.

47

Summary of algorithms for dense Cholesky decomposition

Tight lower bound for Cholesky decomposition.Sequential:

Bandwidth: (n3/M1/2) Latency: (n3/M3/2)

Parallel:Bandwidth and latency lower bound tight up to (log P) factor.

Sparse Algorithms: Cholesky Decomposition

Based on [L. Grigori P.Y David, J. Demmel, S. Peyronnet, ‘00]

48

Lower bounds for sparse direct methods

• Geometric embedding for sparse 3-nested-loops-like algorithms may be loose:

BW = (#flops / M1/2 ) , L = Ω(#flops / M3/2 )

It may be smaller than the NNZ (number of non-zeros) in the input/output.• For example computing the product of two (block) diagonal

matrices may involve no communication in parallel

49

Lower bounds for sparse matrix multiplication

• Compute C = AB, with A, B, C of size n n, having dn nonzeros uniformly distributed, #flops is at most d2n

• Lower bound on communication in sequential is BW = (dn),

L = (dn/M) • Two types of algorithms - depending on how their

communication pattern is designed.- Take into account the nonzero structure of the matrices Example - Buluc, Gilbert BW = L = Ω(d2n) - Oblivious to the nonzero structure of the matrices Example - sparse Cannon - Buluc, Gilbert for parallel case BW = Ω(dn/ p1/2), #messages = Ω(p1/2)

• L. Grigori P.Y David, J. Demmel, S. Peyronnet50

Lower bounds for sparse Cholesky factorization

• Consider A of size ksks results from a finite difference operator on a regular grid of dimension s2 with ks nodes.

• Its Cholesky L factor contains a dense lower triangular matrix of size ks-1ks-1. BW = Ω(W1/2) L = Ω(W3/2)where W = k 3(s-1)/2 for a sequential algorithm W = k 3(s-1)/(2p) for a parallel algorithm

• This result applies more generally to matrix A whose graph G =(V,E) has the following property for some s: • every set of vertices W ⊂ V with n/3 ≤ |W| ≤ 2n/3 is adjacent to at least s

vertices in V − W• the Cholesky factor of A contains a dense s x s submatrix

51

• PSPASES with an optimal layout could attains the lower bound in parallel for 2D/3D regular grids:• Uses nested dissection to reorder the matrix• Distributes the matrix using the subtree-to-subcube algorithm• The factorization of every dense multifrontal matrix is performed using an

optimal parallel dense Cholesky factorization.• Sequential multifrontal algorithm attains the lower bound

• The factorization of every dense multifrontal matrix is performed using an optimal dense Cholesky factorization

Optimal algorithms for sparse Cholesky factorization

52

Open Problems / Work in Progress (1/2)• Rank Revealing CAQR (RRQR)

• AP = QR, P a perm. chosen to put “most independent” columns first • Does LAPACK geqp3 move O(n3) words, or fewer?• Better: Tournament Pivoting (TP) on blocks of columns

• CALU with more reliable pivoting (Gu, Grigori, Khabou et al)• Replace tournament pivoting of panel (TSLU) with Rank-Revealing QR of Panel

• LU with Complete Pivoting (GECP)• A = PrLUPc

T , Pr and Pc are perms. putting “most independent” rows & columns first• Idea: TP on column blocks to chose columns, then TSLU on columns to choose rows

• LU of A=AT, with symmetric pivoting• A = PLLTPT, P perm., L lower triangular

• Bunch-Kaufman, Rook pivoting: O(n3) words moved?• CA approach: Choose block with one step of CA-GECP, permute rows & columns

to top left, pick symmetric subset with local Bunch-Kaufman• A = PLBLTPT, with symmetric pivoting (Toledo et al)

• L triangular, P perm., B banded• B tridiagonal: due to Aasen• CA approach: B block tridiagonal, stability still in question

53

Open Problems / Work in Progress (2/2)• Heterogeneity (Ballard, Gearhart)

• How to assign work to attain lower bounds when flop rates / bandwidths / latencies vary?• Using extra memory in distributed memory case (Solomonik)

• Using all available memory to attain improved lower bounds• Just matmul and CALU so far

• Using extra memory in shared memory case• Sparse matrices …• Combining sparse and dense algorithms, for better communication performance,

e.g.,• The above [Grigori, David, Peleg, Peyronnet ‘09]• Our retuned version of [Yuster Zwick ‘04]

• Eliminate the log P factor?• Prove latency lower bounds for specific DS use , e.g., can we show that any classic

matrix multiply algorithm that uses column-major DS has to be M1/2 away from optimality?

54

jim demmel and oded schwartz

Documents

communication lower

communication lower

n3 p m12

on2 p recall lower bounds

n3 p m32

m12 gustavson

p processors minimum

recursive gustavson