2.5d algorithms for parallel dense linear...

39
Introduction Matrix multiplication LU factorization QR factorization Conclusion 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James Demmel UC Berkeley June, 2012 Edgar Solomonik and James Demmel 2.5D algorithms 1/ 33

Upload: others

Post on 16-Oct-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D algorithms for parallel dense linear algebra

Edgar Solomonik and James Demmel

UC Berkeley

June, 2012

Edgar Solomonik and James Demmel 2.5D algorithms 1/ 33

Page 2: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

Outline

IntroductionStrong scaling

Matrix multiplication2D and 3D algorithms2.5D matrix multiplication

LU factorization2.5D LU without pivoting2.5D LU with pivoting

QR factorization2.5D QR using Givens rotations2.5D QR using Householder transformations

Conclusion

Edgar Solomonik and James Demmel 2.5D algorithms 2/ 33

Page 3: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

Strong scaling

Solving science problems faster

Parallel computers can solve bigger problems

I weak scaling

Parallel computers can also solve a fixed problem faster

I strong scaling

Obstacles to strong scaling

I may increase relative cost of communication

I may hurt load balance

How to reduce communication and maintain load balance?

I reduce (minimize) communication along the critical path

I exploit the network topology

Edgar Solomonik and James Demmel 2.5D algorithms 3/ 33

Page 4: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2D and 3D algorithms2.5D matrix multiplication

Blocking matrix multiplication

A

BA

B

A

B

AB

Edgar Solomonik and James Demmel 2.5D algorithms 4/ 33

Page 5: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2D and 3D algorithms2.5D matrix multiplication

2D matrix multiplication[Cannon 69],

[Van De Geijn and Watts 97]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

16 CPUs (4x4)O(n3/p) flops

O(n2/√p) words moved

O(√p) messages

O(n2/p) bytes of memory

Edgar Solomonik and James Demmel 2.5D algorithms 5/ 33

Page 6: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2D and 3D algorithms2.5D matrix multiplication

3D matrix multiplication[Agarwal et al 95],

[Aggarwal, Chandra, and Snir 90],

[Bernsten 89], [McColl and Tiskin 99]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

64 CPUs (4x4x4)

4 copies of matrices

O(n3/p) flops

O(n2/p2/3) words moved

O(1) messages

O(n2/p2/3) bytes of memory

Edgar Solomonik and James Demmel 2.5D algorithms 6/ 33

Page 7: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2D and 3D algorithms2.5D matrix multiplication

2.5D matrix multiplication[McColl and Tiskin 99]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

32 CPUs (4x4x2)

2 copies of matrices

O(n3/p) flops

O(n2/√c · p) words moved

O(√

p/c3) messages

O(c · n2/p) bytes of memory

Edgar Solomonik and James Demmel 2.5D algorithms 7/ 33

Page 8: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2D and 3D algorithms2.5D matrix multiplication

Strong scaling matrix multiplication

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

p

2.5D MM on BG/P (n=65,536)

2.5D MM2D MM

ScaLAPACK PDGEMM

Edgar Solomonik and James Demmel 2.5D algorithms 8/ 33

Page 9: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D recursive LU

A = L · U where L is lower-triangular and U is upper-triangular

I A 2.5D recursive algorithm with no pivoting [A. Tiskin 2002]I Tiskin gives algorithm under the BSP model

I Bulk Synchronous ParallelI considers communication and synchronization

I We give an alternative distributed-memory adaptation andimplementation

I Also, we lower-bound the latency cost

Edgar Solomonik and James Demmel 2.5D algorithms 9/ 33

Page 10: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D blocked LU factorization

A

Edgar Solomonik and James Demmel 2.5D algorithms 10/ 33

Page 11: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D blocked LU factorization

L₀₀

U₀₀

Edgar Solomonik and James Demmel 2.5D algorithms 11/ 33

Page 12: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D blocked LU factorization

L

U

Edgar Solomonik and James Demmel 2.5D algorithms 12/ 33

Page 13: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D blocked LU factorization

L

U

S=A-LU

Edgar Solomonik and James Demmel 2.5D algorithms 13/ 33

Page 14: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D block-cyclic decomposition

8 8 8 8

8 8 8 8

8 8 8 8

8 8 8 8

Edgar Solomonik and James Demmel 2.5D algorithms 14/ 33

Page 15: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D block-cyclic LU factorization

Edgar Solomonik and James Demmel 2.5D algorithms 15/ 33

Page 16: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D block-cyclic LU factorization

L

U

Edgar Solomonik and James Demmel 2.5D algorithms 16/ 33

Page 17: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D block-cyclic LU factorization

L

U

S=A-LU

Edgar Solomonik and James Demmel 2.5D algorithms 17/ 33

Page 18: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

A new latency lower bound for LU

I Relate volume to surfacearea to diameter

I For block size n/d LU doesI Ω(n3/d2) flopsI Ω(n2/d) wordsI Ω(d) msgs

I Now pick d (=latency cost)I d = Ω(

√p) to minimize

flopsI d = Ω(

√c · p) to

minimize words

I More generally,latency · bandwidth = n2

k₁

k₀

k₂

k₃

k₄

k

A₀₀

A₂₂

A₃₃

A₄₄

A

n

n

critical path

d-1,d-1d-1

A₁₁

Edgar Solomonik and James Demmel 2.5D algorithms 18/ 33

Page 19: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik and James Demmel 2.5D algorithms 19/ 33

Page 20: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

(B)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik and James Demmel 2.5D algorithms 20/ 33

Page 21: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

(B)

U

L

(C)(D)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik and James Demmel 2.5D algorithms 21/ 33

Page 22: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU strong scaling (without pivoting)

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

p

2.5D LU on BG/P (n=65,536)

2.5D LU (no pvt)2D LU (no pvt)

Edgar Solomonik and James Demmel 2.5D algorithms 22/ 33

Page 23: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU with pivoting

A = P · L · U, where P is a permutation matrix

I 2.5D generic pairwise elimination (neighbor/pairwise pivotingor Givens rotations (QR)) [A. Tiskin 2007]

I pairwise pivoting does not produce an explicit LI pairwise pivoting may have stability issues for large matrices

I Our approach uses tournament pivoting, which is more stablethan pairwise pivoting and gives L explicitly

I pass up rows of A instead of U to avoid error accumulation

Edgar Solomonik and James Demmel 2.5D algorithms 23/ 33

Page 24: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

Tournament pivoting

Partial pivoting is not communication-optimal on a blocked matrix

I requires message/synchronization for each column

I O(n) messages needed

Tournament pivoting is communication-optimal

I performs a tournament to determine best pivot row candidates

I passes up ’best rows’ of A

Edgar Solomonik and James Demmel 2.5D algorithms 24/ 33

Page 25: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization with tournament pivoting

PA₀

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

PA₀

Edgar Solomonik and James Demmel 2.5D algorithms 25/ 33

Page 26: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

PA₀

Edgar Solomonik and James Demmel 2.5D algorithms 26/ 33

Page 27: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

LU

LU

L U

LU

L₂₀

L₃₀

L₄₀

Update

Update

Update

Update

Update

Update

Update

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

L₁₀

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

PA₀

Edgar Solomonik and James Demmel 2.5D algorithms 27/ 33

Page 28: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

LU

LU

L U

LU

L₂₀

L₃₀

L₄₀

Update

Update

Update

Update

Update

Update

Update

L₀₀U₀₀

U₀₁

U₀₂

U₀₃

L₃₀

L₁₀L₂₀

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

L₁₀

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀ L₀₀U₀₀

L₀₀U₀₀

L₀₀U₀₀

PA₀

Edgar Solomonik and James Demmel 2.5D algorithms 28/ 33

Page 29: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

Strong scaling of 2.5D LU with tournament pivoting

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

p

2.5D LU on BG/P (n=65,536)

2.5D LU (CA-pvt)2D LU (CA-pvt)

ScaLAPACK PDGETRF

Edgar Solomonik and James Demmel 2.5D algorithms 29/ 33

Page 30: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D QR using Givens rotations2.5D QR using Householder transformations

2.5D QR factorization

A = Q · R where Q is orthogonal R is upper-triangular

I 2.5D QR using Givens rotations (generic pairwise elimination)is given by [A. Tiskin 2007]

I Tiskin minimizes latency and bandwidth by working onslanted panels

I 2.5D QR cannot be done with right-looking updates as 2.5DLU due to non-commutativity of orthogonalization updates

Edgar Solomonik and James Demmel 2.5D algorithms 30/ 33

Page 31: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D QR using Givens rotations2.5D QR using Householder transformations

2.5D QR factorization using the YT representation

The YT representation of Householder QR factorization is morework efficient when computing only R

I We give an algorithm that performs 2.5D QR using the YTrepresentation

I The algorithm performs left-looking updates on Y

I Householder with YT needs fewer computation (roughly 2x)than Givens rotations

I Our approach achieves optimal bandwidth cost, but has O(n)latency

Edgar Solomonik and James Demmel 2.5D algorithms 31/ 33

Page 32: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D QR using Givens rotations2.5D QR using Householder transformations

2.5D QR using YT representation

Edgar Solomonik and James Demmel 2.5D algorithms 32/ 33

Page 33: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

Conclusion

Our contributions:I 2.5D mapping of matrix multiplication

I Optimal according to lower bounds [Irony, Tiskin, Toledo 04]and [Aggarwal, Chandra, and Snir 90]

I A new latency lower bound for LUI Communication-optimal 2.5D LU and QR

I Both are bandwidth-optimal according to general lower bound[Ballard, Demmel, Holtz, Schwartz 10]

I LU is latency-optimal according to new lower bound

Reflections:

I Replication allows better strong scaling

I Topology-aware mapping cuts communication costs

Edgar Solomonik and James Demmel 2.5D algorithms 33/ 33

Page 34: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

Rectangular collectives

Backup slides

Edgar Solomonik and James Demmel 2.5D algorithms 34/ 33

Page 35: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

Rectangular collectives

Performance of multicast (BG/P vs Cray)

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Edgar Solomonik and James Demmel 2.5D algorithms 35/ 33

Page 36: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

Rectangular collectives

Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network

Edgar Solomonik and James Demmel 2.5D algorithms 36/ 33

Page 37: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

Rectangular collectives

2D rectangular multicasts trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

Edgar Solomonik and James Demmel 2.5D algorithms 37/ 33

Page 38: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

Rectangular collectives

Cost breakdown of MM on 65,536 cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

n=8192, 2D

n=8192, 2.5D

n=131072, 2D

n=131072, 2.5D

Exe

cutio

n tim

e no

rmal

ized

by

2D

Matrix multiplication on 16,384 nodes of BG/P

95% reduction in comm computationidle

communication

Edgar Solomonik and James Demmel 2.5D algorithms 38/ 33

Page 39: 2.5D algorithms for parallel dense linear algebrasolomon2.web.engr.illinois.edu/talks/siam-ala-2012.pdf · 2.5D algorithms for parallel dense linear algebra Edgar Solomonik and James

Rectangular collectives

2.5D LU on 65,536 cores

0

20

40

60

80

100

NO-pivot 2D

NO-pivot 2.5D

CA-pivot 2D

CA-pivot 2.5D

Tim

e (s

ec)

LU on 16,384 nodes of BG/P (n=131,072)

2X faster

2X faster

computeidle

communication

Edgar Solomonik and James Demmel 2.5D algorithms 39/ 33