jim demmel and oded schwartz

51
Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded Schwartz CS294, Lecture #5 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Based on: Bootcamp 2011, CS267, Summer-school 2010, Laura Grigori’s, many papers

Upload: kaipo

Post on 24-Feb-2016

74 views

Category:

Documents


4 download

DESCRIPTION

CS294, Lecture #5 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294. Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions. Jim Demmel and Oded Schwartz. Based on: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Jim  Demmel  and  Oded  Schwartz

Communication AvoidingAlgorithms

forDense Linear Algebra:

LU, QR, and Cholesky decompositionsand

Sparse Cholesky decompositionsJim Demmel and Oded Schwartz

CS294, Lecture #5 Fall, 2011Communication-Avoiding Algorithms

www.cs.berkeley.edu/~odedsc/CS294

Based on:Bootcamp 2011, CS267, Summer-school 2010, Laura Grigori’s, many papers

Page 2: Jim  Demmel  and  Oded  Schwartz

2

Outline for today• Recall communication lower bounds • What a “matching upper bound”, i.e. CA-algorithm,

might have to do• Summary of what is known so far• Case Studies of CA dense algorithms

• Last time: case study I: Matrix multiplication• Case study II: LU, QR• Case study III: Cholesky decomposition

• Case Studies of CA sparse algorithms• Sparse Cholesky decomposition

Page 3: Jim  Demmel  and  Oded  Schwartz

3

Communication lower bounds• Applies to algorithms satisfying technical conditions

discussed last time• “smells like 3-nested loops”

• For each processor:• M = size of fast memory of that processor• G = #operations (multiplies) performed by that processor

#words_movedto/from fast memory

= Ω ( max ( G / M1/2 , #inputs + #outputs ) )

#messagesto/from fast memory

≥ #words_movedmax_message_size

= Ω ( G / M3/2 )

Page 4: Jim  Demmel  and  Oded  Schwartz

Summary of dense sequential algorithms attaining communication lower bounds• Algorithms shown minimizing # Messages use (recursive) block layout

• Not possible with columnwise or rowwise layouts• Many references (see reports), only some shown, plus ours• Cache-oblivious are underlined, Green are ours, ? is unknown/future work

Algorithm 2 Levels of Memory Multiple Levels of Memory

#Words Moved and # Messages #Words Moved and #Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97]

Cholesky LAPACK (with b = M1/2) [Gustavson 97]

[BDHS09]

[Gustavson,97] [Ahmed,Pingali,00]

[BDHS09]

(←same ) (←same)

LU withpivoting

LAPACK (sometimes)[Toledo,97] , [GDX 08]

[GDX 08]not partial pivoting

[Toledo, 97][GDX 08]?

[GDX 08]?

QRRank-

revealing

LAPACK (sometimes) [Elmroth,Gustavson,98]

[DGHL08]

[Frens,Wise,03]but 3x flops?

[DGHL08]

[Elmroth,Gustavson,98]

[DGHL08] ?

[Frens,Wise,03] [DGHL08] ?

Eig, SVD Not LAPACK [BDD11 ]randomized, but more flops;

[BDK11]

[BDD11,BDK11? ]

[BDD11,BDK11?]

Page 5: Jim  Demmel  and  Oded  Schwartz

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix MultiplyCholesky

LUQR

Sym Eig, SVDNonsym Eig

Page 6: Jim  Demmel  and  Oded  Schwartz

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors (conventional approach)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1Cholesky ScaLAPACK log P

LU ScaLAPACK log PQR ScaLAPACK log P

Sym Eig, SVD ScaLAPACK log PNonsym Eig ScaLAPACK P1/2 log P

Page 7: Jim  Demmel  and  Oded  Schwartz

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors (conventional approach)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1 1Cholesky ScaLAPACK log P log P

LU ScaLAPACK log P n log P / P1/2

QR ScaLAPACK log P n log P / P1/2

Sym Eig, SVD ScaLAPACK log P n / P1/2

Nonsym Eig ScaLAPACK P1/2 log P n log P

Page 8: Jim  Demmel  and  Oded  Schwartz

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors (better)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1 1Cholesky ScaLAPACK log P log P

LU [GDX10] log P log PQR [DGHL08] log P log3 P

Sym Eig, SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Page 9: Jim  Demmel  and  Oded  Schwartz

Can we do even better?

• Assume nxn matrices on P processors • Why just one copy of data: M = O(n2 / P) per processor?• Increase M to reduce lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1 1Cholesky ScaLAPACK log P log P

LU [GDX10] log P log PQR [DGHL08] log P log3 P

Sym Eig, SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Page 10: Jim  Demmel  and  Oded  Schwartz

Recall Sequential One-sided Factorizations (LU, QR)

• Classical Approach for i=1 to n update column i update trailing matrix• #words_moved = O(n3)

• Blocked Approach (LAPACK) for i=1 to n/b update block i of b columns update trailing matrix• #words moved = O(n3/M1/3)

• Recursive Approach func factor(A) if A has 1 column, update it

else factor(left half of A) update right half of A factor(right half of A)• #words moved = O(n3/M1/2)

• None of these approaches minimizes #messages or addresses parallelism• Need another idea

10

Page 11: Jim  Demmel  and  Oded  Schwartz

Minimizing latency in sequential case requires new data structures

• To minimize latency, need to load/store whole rectangular subblock of matrix with one “message”

• Incompatible with conventional columnwise (rowwise) storage• Ex: Rows (columns) not in contiguous memory locations

• Blocked storage: store as matrix of bxb blocks, each block stored contiguously

• Ok for one level of memory hierarchy, what if more?• Recursive blocked storage: store each block using subblocks

• Also known as “space filling curves”, “Morton ordering”

11

Page 12: Jim  Demmel  and  Oded  Schwartz

Summer School Lecture 4 12

Different Data Layouts for Parallel GE

Bad load balance:P0 idle after firstn/4 steps

Load balanced, but can’t easily use BLAS2 or BLAS3

Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneck

Complicated addressing,May not want full parallelismIn each column, row

0123012301230123

0 1 2 3 0 1 2 3

0 1 2 33 0 1 22 3 0 11 2 3 0

1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout

3) 1D Column Block Cyclic Layout 4) Block Skewed Layout

The winner!0 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 3 6) 2D Row and Column

Block Cyclic Layout

0 1 2 3

Bad load balance:P0 idle after firstn/2 steps

0 1

2 3

5) 2D Row and Column Blocked Layout

b

Page 13: Jim  Demmel  and  Oded  Schwartz

02/14/2006

CS267 Lecture 9 13

Distributed GE with a 2D Block Cyclic Layout

13

Page 14: Jim  Demmel  and  Oded  Schwartz

02/14/2006

CS267 Lecture 9 14

Mat

rix m

ultip

ly o

f g

reen

= g

reen

- bl

ue *

pink

14

Page 15: Jim  Demmel  and  Oded  Schwartz

Where is communication bottleneck?• In both sequential and parallel case: panel factorization• Can we retain partial pivoting? No• “Proof” for parallel case:

• Each choice of pivot requires a reduction (to find the maximum entry) per columns

• #messages = Ω (n) or Ω (n log p), versus goal of O(p1/2)• Solution for QR, then show LU

15

Page 16: Jim  Demmel  and  Oded  Schwartz

TSQR: QR of a Tall, Skinny matrix

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= = .

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= . R01

R11

R01

R11= Q02 R02

Output = Q00, Q10, Q20, Q30, Q01, Q11, Q02, R02 16

Page 17: Jim  Demmel  and  Oded  Schwartz

TSQR: An Architecture-Dependent Algorithm

W =

W0W1W2W3

R00R10

R20R30

R01

R11

R02Parallel:

W =

W0W1W2W3

R01 R02

R00

R03

Sequential:

W =

W0W1W2W3

R00R01 R01

R11R02

R11

R03

Dual Core:

Can choose reduction tree dynamicallyMulticore / Multisocket / Multirack / Multisite / Out-of-core: ?

17

Page 18: Jim  Demmel  and  Oded  Schwartz

Back to LU: Using similar idea for TSLU as TSQR: Use reduction tree, to do “Tournament Pivoting”

18

Wnxb =

W1

W2

W3

W4

P1·L1·U1

P2·L2·U2

P3·L3·U3

P4·L4·U4

=

Choose b pivot rows of W1, call them W1’

Choose b pivot rows of W2, call them W2’

Choose b pivot rows of W3, call them W3’

Choose b pivot rows of W4, call them W4’

W1’

W2’

W3’

W4’

P12·L12·U12

P34·L34·U34

=Choose b pivot rows, call them W12’

Choose b pivot rows, call them W34’

W12’

W34’= P1234·L1234·U1234

Choose b pivot rows

Go back to W and use these b pivot rows

(move them to top, do LU without pivoting)

Page 19: Jim  Demmel  and  Oded  Schwartz

Minimizing Communication in TSLU

W =

W1

W2

W3

W4

LULULULU

LU

LULU

Parallel:(binary tree)

W =

W1

W2

W3

W4

LULU

LU

LU

Sequential:(flat tree)

W =

W1

W2

W3

W4

LU

LULU

LULU

LULU

Dual Core:

Can Choose reduction tree dynamically, as beforeMulticore / Multisocket / Multirack / Multisite / Out-of-core:

19

Page 20: Jim  Demmel  and  Oded  Schwartz

Making TSLU Numerically Stable• Stability Goal: Make ||A – PLU|| very small: O(machine_epsilon · ||A||)• Details matter

• Going up the tree, we could do LU either on original rows of W (tournament pivoting), or computed rows of U

• Only tournament pivoting stable• Thm: New scheme as stable as Partial Pivoting (PP) in following sense: get

same results as PP applied to different input matrix whose entries are blocks taken from input A

• CALU – Communication Avoiding LU for general A– Use TSLU for panel factorizations– Apply to rest of matrix– Cost: redundant panel factorizations:

– Extra flops for one reduction step: n/b panels (n b2) flops per panel– For the sequential algorithms this is n2b. b = M1/2,

so extra n2M1/2 flops, (<n3 as M < n2).– For the parallel 2D algorithms, we have log P reduction steps

n/b panels, n/b b2 / P1/2 flops per panel and b = n/(log(P)2P1/2), so extra n/b (n b2) log P / P1/2= n2blog P / P1/2 = n3/(Plog(P)) < n3/P

• Benefit: – Stable in practice, but not same pivot choice as GEPP– One reduction operation per panel: reduces latency to minimum

20

Page 21: Jim  Demmel  and  Oded  Schwartz

Sketch of TSLU Stabilty Proof

• Consider A = with TSLU of first columns done by :

(assume wlog that A21 already

contains desired pivot rows)• Resulting LU decomposition (first n/2 steps) is

• Claim: As32 can be obtained by GEPP on larger matrix formed from blocks of A

• GEPP applied to G pivots no rows, and

L31 · L21-1 ·A21 · U’11

-1 · U’12 + As32

= L31 · U21 · U’11-1 · U’12 + As

32

= A31 · U’11-1 · U’12 + As

32 = L’31 · U’12 + As32 = A32

21

A11 A12A21 A22A31 A32

A11A21A31

A21

A’11

A11 A12A21 A22A31 A32

11 12 021 22 0 0 0 I

· =A’11 A’12A’21 A’22A31 A32

=U’11 U’12 As

22 As

32

L’11 L’21 I L’31 0 I

·

A’11 A’12A21 A21 -A31 A32

G = =L’11 A21·U’11

-1 L21 -L31 I

·U’11 U’12 U21 -L21

-1 ·A21 · U’11-1 · U’12

As32

Need to show As32

same as above

Page 22: Jim  Demmel  and  Oded  Schwartz

Stability of LU using TSLU: CALU (1/2)• Worst case analysis

• Pivot growth factor from one panel factorization with b columns• TSLU:

– Proven: 2bH where H = 1 + height of reduction tree– Attained: 2b , same as GEPP

• Pivot growth factor on m x n matrix• TSLU:

– Proven: 2nH-1

– Attained: 2n-1 , same as GEPP • |L| can be as large as 2(b-2)H-b+1 , but CALU still stable• There are examples where GEPP is exponentially less

stable than CALU, and vice-versa, so neither is always better than the other

• Discovered sparse matrices with O(2n) pivot growth22

Page 23: Jim  Demmel  and  Oded  Schwartz

Stability of LU using TSLU: CALU (2/2)

Summer School Lecture 4 23

• Empirical testing• Both random matrices and “special ones”• Both binary tree (BCALU) and flat-tree (FCALU)• 3 metrics: ||PA-LU||/||A||, normwise and componentwise backward errors• See [D., Grigori, Xiang, 2010] for details

23

Page 24: Jim  Demmel  and  Oded  Schwartz

Performance vs ScaLAPACK

• TSLU– IBM Power 5

• Up to 4.37x faster (16 procs, 1M x 150)– Cray XT4

• Up to 5.52x faster (8 procs, 1M x 150)• CALU

– IBM Power 5

• Up to 2.29x faster (64 procs, 1000 x 1000)– Cray XT4

• Up to 1.81x faster (64 procs, 1000 x 1000)• See INRIA Tech Report 6523 (2008), paper at SC08

25

Page 25: Jim  Demmel  and  Oded  Schwartz

Exascale predicted speedupsfor Gaussian Elimination: CA-LU vs ScaLAPACK-LU

log2 (p)

log 2

(n2 /p

) =

log 2

(mem

ory_

per_

proc

)

27

Page 26: Jim  Demmel  and  Oded  Schwartz

Case Study III: Cholesky Decomposition

29

Page 27: Jim  Demmel  and  Oded  Schwartz

30

LU and Cholesky decompositionsLU decomposition of A

A = P·L·UU is upper triangularL is lower triangularP is a permutation matrix

Cholesky decomposition of AIf A is a real symmetric and positive definite matrix,then L (a real lower triangular) s.t.,

A = L·LT (L is unique if we restrict diagonal elements 0)

Efficient parallel and sequential algorithms?

Page 28: Jim  Demmel  and  Oded  Schwartz

31

Application of the generalized form to Cholesky decomposition

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

jikjAjiAjiAjjL

jiLik

,,,,,

1,1

1

2,,,ik

kiAiiAiiL

Page 29: Jim  Demmel  and  Oded  Schwartz

32

By the Geometric Embedding theorem[Ballard, Demmel, Holtz, S. 2011a]Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

The communication cost (bandwidth) of any classic algorithm for Cholesky decomposition (for dense matrices) is:

MMn 3

PM

Mn

3

Page 30: Jim  Demmel  and  Oded  Schwartz

33

Some Sequential Classical Cholesky Algorithms

Bandwidth LatencyCache

Oblivious

Lower bound (n3/M1/2) (n3/M3/2)Naive (n3) (n3/M) LAPACK (with right block size!) Column-major Contiguous blocks

O(n3/M1/2)O(n3/M1/2)

O(n3/M)O(n3/M3/2)

Rectangular Recursive [Toledo 97] Column-major Contiguous blocks

O(n3/M1/2+n2 log n)O(n3/M1/2+n2 log n)

(n3/M)(n2)

Square Recursive [Ahmad, Pingali 00] Column-major Contiguous blocks

O(n3/M1/2)O(n3/M1/2)

O(n3/M)O(n3/M3/2)

Page 31: Jim  Demmel  and  Oded  Schwartz

34

Some Parallel Classical Cholesky Algorithms

Bandwidth LatencyLower bound General 2D-layout: M=O(n2/P)

(n3/PM1/2)(n2/P1/2)

(n3/PM3/2)(P1/2)

Sca-LAPACK General Choosing b=n/P1/2

O((n2/P1/2 + nb)log P)O((n2/P1/2)log P)

O((n/b) log p)O(P1/2 log P)

Page 32: Jim  Demmel  and  Oded  Schwartz

35

Top (column-major): Full, Old Packed, Rectangular Full Packed.Bottom (block-contiguous): Blocked, Recursive Format, Recursive Full Packed [Frigo, Leiserson, Prokop, Ramachandran 99, Ahmed, Pingali 00, Elmroth, Gustavson, Jonsson, Kagstrom 04].

Upper bounds – Supporting Data Structures

Page 33: Jim  Demmel  and  Oded  Schwartz

36

Upper bounds – Supporting Data Structures

Rules of thumb:• Don’t mix column/row major with blocked:

decide which one fits your algorithm.• Converting between the two once in your algorithm

is OK for dense instances.E.g., changing the DS at the beginning of your algorithms.

Page 34: Jim  Demmel  and  Oded  Schwartz

37

The Naïve Algorithms

jk j k

i

j

j

i

i

Naïve left looking and right looking

Bandwidth: (n3) Latency: (n3/M)

Assuming column-major DS.

1

2,,,ik

kiAiiAiiL

jikjAjiAjiAjjL

jiLik

,,,,,

1,1

Page 35: Jim  Demmel  and  Oded  Schwartz

38

Rectangular Recursive [Toledo 97]

m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Recursive on left rectangular

Recursive on right rectangular;L12 :=0

[A22;A23] := [A22;A23] - [L21;L31]L21

T

Base: ColumnL := A/(A11)1/2

Page 36: Jim  Demmel  and  Oded  Schwartz

39

Rectangular Recursive [Toledo 97]

m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Toledo’s Algorithm: guarantees stability for LU.

BW(m,n) = 2m if n = 1, otherwise BW(m,n/2) + BWmm(m-n/2,n/2,n/2) + BW(m-n/2,n/2)

Bandwidth: O(n3/M1/2+n2 log n)

Optimal bandwidth (for almost all M)

Page 37: Jim  Demmel  and  Oded  Schwartz

40

Rectangular Recursive [Toledo 97]

m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Is it Latency optimal?

If we use a column major DS, thenL = (n3/M), so M1/2 away from optimum (n3/M3/2)This is due to the matrix multiply.

Open problem: Can we prove a latency lower bound for all matrix multiply classic algorithm using column major DS?

Note: we don’t need such a lower bound here, as the matrix-multiply algorithm is known.

Page 38: Jim  Demmel  and  Oded  Schwartz

41

Rectangular Recursive [Toledo 97]

m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Is it Latency optimal?

If we use a contiguous blocks DS, thenL = (n2). This is due to the forward columns updatesOptimum is (n3/M3/2)

n3/M3/2 ≤ n2 for n ≤ M2/3. Outside this range the algorithm is not latency optimal.

Page 39: Jim  Demmel  and  Oded  Schwartz

42

Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]

A21 A22

A11 A12

n

n/2

n/2 A21 A22

L11 0

L21 A22

L11 0

L21 A22

L11 0

Recursive on A11 Recursive on A22

A12=0

L21 = RTRSM(A21,L11’)

A22 =

A22 – L21 L21’

RTRSM: Recursive Triangular Solver

Page 40: Jim  Demmel  and  Oded  Schwartz

43

Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]

A21 A22

A11 A12

n

n/2

n/2 A21 A22

L11 0

L21 A22

L11 0

L21 A22

L11 0

Recursive on A11 Recursive on A22

A12=0

L21 = RTRSM(A21,L11’)

A22 =

A22 – L21 L21’

BW = 3n2 if 3n2 ≤ M, otherwise

2BW(n/2) + O(n3/M1/2 + n2)

Page 41: Jim  Demmel  and  Oded  Schwartz

44

Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]

A21 A22

A11 A12

n

n/2

n/2 A21 A22

L11 0

L21 A22

L11 0

L21 A22

L11 0

Uses • Recursive matrix multiplication

[Frigo, Leiserson, Prokop, Ramachandran 99]

• and recursive RTRSM

Bandwidth: O(n3/M1/2) Latency: O(n3/M3/2) Assuming blocks recursive DS.• Optimal bandwidth• Optimal latency• Cache oblivious• Not stable for LU

Page 42: Jim  Demmel  and  Oded  Schwartz

45

Parallel Algorithm

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

Pc

Pr

1. Find Cholesky locally on a diagonal block.Broadcast downwards.

2. Parallel triangular solve.Broadcast to the right.

3. Parallel symmetric rank-b update

Page 43: Jim  Demmel  and  Oded  Schwartz

46

Parallel Algorithm

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

Pc

Pr

Bandwidth LatencyLower bound General 2D-layout: M=O(n2/P)

(n3/PM1/2)(n2/P1/2)

(n3/PM3/2)(P1/2)

Sca-LAPACK General Choosing b=n/P1/2

O((n2/P1/2 + nb)log P)O((n2/P1/2)log P)

O((n/b) log P)O(P1/2 log P)

Low order term, as b ≤ n/P1/2

Optimal (up to a log P factor)only for 2D.New 2.5D algorithm extends to any size M.

Page 44: Jim  Demmel  and  Oded  Schwartz

47

Summary of algorithms for dense Cholesky decomposition

Tight lower bound for Cholesky decomposition.Sequential:

Bandwidth: (n3/M1/2) Latency: (n3/M3/2)

Parallel:Bandwidth and latency lower bound tight up to (log P) factor.

Page 45: Jim  Demmel  and  Oded  Schwartz

Sparse Algorithms: Cholesky Decomposition

Based on [L. Grigori P.Y David, J. Demmel, S. Peyronnet, ‘00]

48

Page 46: Jim  Demmel  and  Oded  Schwartz

Lower bounds for sparse direct methods

• Geometric embedding for sparse 3-nested-loops-like algorithms may be loose:

BW = (#flops / M1/2 ) , L = Ω(#flops / M3/2 )

It may be smaller than the NNZ (number of non-zeros) in the input/output.• For example computing the product of two (block) diagonal

matrices may involve no communication in parallel

49

Page 47: Jim  Demmel  and  Oded  Schwartz

Lower bounds for sparse matrix multiplication

• Compute C = AB, with A, B, C of size n n, having dn nonzeros uniformly distributed, #flops is at most d2n

• Lower bound on communication in sequential is BW = (dn),

L = (dn/M) • Two types of algorithms - depending on how their

communication pattern is designed.- Take into account the nonzero structure of the matrices Example - Buluc, Gilbert BW = L = Ω(d2n) - Oblivious to the nonzero structure of the matrices Example - sparse Cannon - Buluc, Gilbert for parallel case BW = Ω(dn/ p1/2), #messages = Ω(p1/2)

• L. Grigori P.Y David, J. Demmel, S. Peyronnet50

Page 48: Jim  Demmel  and  Oded  Schwartz

Lower bounds for sparse Cholesky factorization

• Consider A of size ksks results from a finite difference operator on a regular grid of dimension s2 with ks nodes.

• Its Cholesky L factor contains a dense lower triangular matrix of size ks-1ks-1. BW = Ω(W1/2) L = Ω(W3/2)where W = k 3(s-1)/2 for a sequential algorithm W = k 3(s-1)/(2p) for a parallel algorithm

• This result applies more generally to matrix A whose graph G =(V,E) has the following property for some s: • every set of vertices W ⊂ V with n/3 ≤ |W| ≤ 2n/3 is adjacent to at least s

vertices in V − W• the Cholesky factor of A contains a dense s x s submatrix

51

Page 49: Jim  Demmel  and  Oded  Schwartz

• PSPASES with an optimal layout could attains the lower bound in parallel for 2D/3D regular grids:• Uses nested dissection to reorder the matrix• Distributes the matrix using the subtree-to-subcube algorithm• The factorization of every dense multifrontal matrix is performed using an

optimal parallel dense Cholesky factorization.• Sequential multifrontal algorithm attains the lower bound

• The factorization of every dense multifrontal matrix is performed using an optimal dense Cholesky factorization

Optimal algorithms for sparse Cholesky factorization

52

Page 50: Jim  Demmel  and  Oded  Schwartz

Open Problems / Work in Progress (1/2)• Rank Revealing CAQR (RRQR)

• AP = QR, P a perm. chosen to put “most independent” columns first • Does LAPACK geqp3 move O(n3) words, or fewer?• Better: Tournament Pivoting (TP) on blocks of columns

• CALU with more reliable pivoting (Gu, Grigori, Khabou et al)• Replace tournament pivoting of panel (TSLU) with Rank-Revealing QR of Panel

• LU with Complete Pivoting (GECP)• A = PrLUPc

T , Pr and Pc are perms. putting “most independent” rows & columns first• Idea: TP on column blocks to chose columns, then TSLU on columns to choose rows

• LU of A=AT, with symmetric pivoting• A = PLLTPT, P perm., L lower triangular

• Bunch-Kaufman, Rook pivoting: O(n3) words moved?• CA approach: Choose block with one step of CA-GECP, permute rows & columns

to top left, pick symmetric subset with local Bunch-Kaufman• A = PLBLTPT, with symmetric pivoting (Toledo et al)

• L triangular, P perm., B banded• B tridiagonal: due to Aasen• CA approach: B block tridiagonal, stability still in question

53

Page 51: Jim  Demmel  and  Oded  Schwartz

Open Problems / Work in Progress (2/2)• Heterogeneity (Ballard, Gearhart)

• How to assign work to attain lower bounds when flop rates / bandwidths / latencies vary?• Using extra memory in distributed memory case (Solomonik)

• Using all available memory to attain improved lower bounds• Just matmul and CALU so far

• Using extra memory in shared memory case• Sparse matrices …• Combining sparse and dense algorithms, for better communication performance,

e.g.,• The above [Grigori, David, Peleg, Peyronnet ‘09]• Our retuned version of [Yuster Zwick ‘04]

• Eliminate the log P factor?• Prove latency lower bounds for specific DS use , e.g., can we show that any classic

matrix multiply algorithm that uses column-major DS has to be M1/2 away from optimality?

54