2.5d algorithms for parallel dense linear...

Post on 16-Oct-2020

15 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D algorithms for parallel dense linear algebra

Edgar Solomonik and James Demmel

UC Berkeley

June, 2012

Edgar Solomonik and James Demmel 2.5D algorithms 1/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

Outline

IntroductionStrong scaling

Matrix multiplication2D and 3D algorithms2.5D matrix multiplication

LU factorization2.5D LU without pivoting2.5D LU with pivoting

QR factorization2.5D QR using Givens rotations2.5D QR using Householder transformations

Conclusion

Edgar Solomonik and James Demmel 2.5D algorithms 2/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

Strong scaling

Solving science problems faster

Parallel computers can solve bigger problems

I weak scaling

Parallel computers can also solve a fixed problem faster

I strong scaling

Obstacles to strong scaling

I may increase relative cost of communication

I may hurt load balance

How to reduce communication and maintain load balance?

I reduce (minimize) communication along the critical path

I exploit the network topology

Edgar Solomonik and James Demmel 2.5D algorithms 3/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2D and 3D algorithms2.5D matrix multiplication

Blocking matrix multiplication

A

BA

B

A

B

AB

Edgar Solomonik and James Demmel 2.5D algorithms 4/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2D and 3D algorithms2.5D matrix multiplication

2D matrix multiplication[Cannon 69],

[Van De Geijn and Watts 97]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

16 CPUs (4x4)O(n3/p) flops

O(n2/√p) words moved

O(√p) messages

O(n2/p) bytes of memory

Edgar Solomonik and James Demmel 2.5D algorithms 5/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2D and 3D algorithms2.5D matrix multiplication

3D matrix multiplication[Agarwal et al 95],

[Aggarwal, Chandra, and Snir 90],

[Bernsten 89], [McColl and Tiskin 99]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

64 CPUs (4x4x4)

4 copies of matrices

O(n3/p) flops

O(n2/p2/3) words moved

O(1) messages

O(n2/p2/3) bytes of memory

Edgar Solomonik and James Demmel 2.5D algorithms 6/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2D and 3D algorithms2.5D matrix multiplication

2.5D matrix multiplication[McColl and Tiskin 99]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

32 CPUs (4x4x2)

2 copies of matrices

O(n3/p) flops

O(n2/√c · p) words moved

O(√

p/c3) messages

O(c · n2/p) bytes of memory

Edgar Solomonik and James Demmel 2.5D algorithms 7/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2D and 3D algorithms2.5D matrix multiplication

Strong scaling matrix multiplication

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

p

2.5D MM on BG/P (n=65,536)

2.5D MM2D MM

ScaLAPACK PDGEMM

Edgar Solomonik and James Demmel 2.5D algorithms 8/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D recursive LU

A = L · U where L is lower-triangular and U is upper-triangular

I A 2.5D recursive algorithm with no pivoting [A. Tiskin 2002]I Tiskin gives algorithm under the BSP model

I Bulk Synchronous ParallelI considers communication and synchronization

I We give an alternative distributed-memory adaptation andimplementation

I Also, we lower-bound the latency cost

Edgar Solomonik and James Demmel 2.5D algorithms 9/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D blocked LU factorization

A

Edgar Solomonik and James Demmel 2.5D algorithms 10/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D blocked LU factorization

L₀₀

U₀₀

Edgar Solomonik and James Demmel 2.5D algorithms 11/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D blocked LU factorization

L

U

Edgar Solomonik and James Demmel 2.5D algorithms 12/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D blocked LU factorization

L

U

S=A-LU

Edgar Solomonik and James Demmel 2.5D algorithms 13/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D block-cyclic decomposition

8 8 8 8

8 8 8 8

8 8 8 8

8 8 8 8

Edgar Solomonik and James Demmel 2.5D algorithms 14/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D block-cyclic LU factorization

Edgar Solomonik and James Demmel 2.5D algorithms 15/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D block-cyclic LU factorization

L

U

Edgar Solomonik and James Demmel 2.5D algorithms 16/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2D block-cyclic LU factorization

L

U

S=A-LU

Edgar Solomonik and James Demmel 2.5D algorithms 17/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

A new latency lower bound for LU

I Relate volume to surfacearea to diameter

I For block size n/d LU doesI Ω(n3/d2) flopsI Ω(n2/d) wordsI Ω(d) msgs

I Now pick d (=latency cost)I d = Ω(

√p) to minimize

flopsI d = Ω(

√c · p) to

minimize words

I More generally,latency · bandwidth = n2

k₁

k₀

k₂

k₃

k₄

k

A₀₀

A₂₂

A₃₃

A₄₄

A

n

n

critical path

d-1,d-1d-1

A₁₁

Edgar Solomonik and James Demmel 2.5D algorithms 18/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik and James Demmel 2.5D algorithms 19/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

(B)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik and James Demmel 2.5D algorithms 20/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

(B)

U

L

(C)(D)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik and James Demmel 2.5D algorithms 21/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU strong scaling (without pivoting)

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

p

2.5D LU on BG/P (n=65,536)

2.5D LU (no pvt)2D LU (no pvt)

Edgar Solomonik and James Demmel 2.5D algorithms 22/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU with pivoting

A = P · L · U, where P is a permutation matrix

I 2.5D generic pairwise elimination (neighbor/pairwise pivotingor Givens rotations (QR)) [A. Tiskin 2007]

I pairwise pivoting does not produce an explicit LI pairwise pivoting may have stability issues for large matrices

I Our approach uses tournament pivoting, which is more stablethan pairwise pivoting and gives L explicitly

I pass up rows of A instead of U to avoid error accumulation

Edgar Solomonik and James Demmel 2.5D algorithms 23/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

Tournament pivoting

Partial pivoting is not communication-optimal on a blocked matrix

I requires message/synchronization for each column

I O(n) messages needed

Tournament pivoting is communication-optimal

I performs a tournament to determine best pivot row candidates

I passes up ’best rows’ of A

Edgar Solomonik and James Demmel 2.5D algorithms 24/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization with tournament pivoting

PA₀

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

PA₀

Edgar Solomonik and James Demmel 2.5D algorithms 25/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

PA₀

Edgar Solomonik and James Demmel 2.5D algorithms 26/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

LU

LU

L U

LU

L₂₀

L₃₀

L₄₀

Update

Update

Update

Update

Update

Update

Update

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

L₁₀

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

PA₀

Edgar Solomonik and James Demmel 2.5D algorithms 27/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

LU

LU

L U

LU

L₂₀

L₃₀

L₄₀

Update

Update

Update

Update

Update

Update

Update

L₀₀U₀₀

U₀₁

U₀₂

U₀₃

L₃₀

L₁₀L₂₀

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

L₁₀

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀ L₀₀U₀₀

L₀₀U₀₀

L₀₀U₀₀

PA₀

Edgar Solomonik and James Demmel 2.5D algorithms 28/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

Strong scaling of 2.5D LU with tournament pivoting

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

p

2.5D LU on BG/P (n=65,536)

2.5D LU (CA-pvt)2D LU (CA-pvt)

ScaLAPACK PDGETRF

Edgar Solomonik and James Demmel 2.5D algorithms 29/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D QR using Givens rotations2.5D QR using Householder transformations

2.5D QR factorization

A = Q · R where Q is orthogonal R is upper-triangular

I 2.5D QR using Givens rotations (generic pairwise elimination)is given by [A. Tiskin 2007]

I Tiskin minimizes latency and bandwidth by working onslanted panels

I 2.5D QR cannot be done with right-looking updates as 2.5DLU due to non-commutativity of orthogonalization updates

Edgar Solomonik and James Demmel 2.5D algorithms 30/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D QR using Givens rotations2.5D QR using Householder transformations

2.5D QR factorization using the YT representation

The YT representation of Householder QR factorization is morework efficient when computing only R

I We give an algorithm that performs 2.5D QR using the YTrepresentation

I The algorithm performs left-looking updates on Y

I Householder with YT needs fewer computation (roughly 2x)than Givens rotations

I Our approach achieves optimal bandwidth cost, but has O(n)latency

Edgar Solomonik and James Demmel 2.5D algorithms 31/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D QR using Givens rotations2.5D QR using Householder transformations

2.5D QR using YT representation

Edgar Solomonik and James Demmel 2.5D algorithms 32/ 33

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

Conclusion

Our contributions:I 2.5D mapping of matrix multiplication

I Optimal according to lower bounds [Irony, Tiskin, Toledo 04]and [Aggarwal, Chandra, and Snir 90]

I A new latency lower bound for LUI Communication-optimal 2.5D LU and QR

I Both are bandwidth-optimal according to general lower bound[Ballard, Demmel, Holtz, Schwartz 10]

I LU is latency-optimal according to new lower bound

Reflections:

I Replication allows better strong scaling

I Topology-aware mapping cuts communication costs

Edgar Solomonik and James Demmel 2.5D algorithms 33/ 33

Rectangular collectives

Backup slides

Edgar Solomonik and James Demmel 2.5D algorithms 34/ 33

Rectangular collectives

Performance of multicast (BG/P vs Cray)

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Edgar Solomonik and James Demmel 2.5D algorithms 35/ 33

Rectangular collectives

Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network

Edgar Solomonik and James Demmel 2.5D algorithms 36/ 33

Rectangular collectives

2D rectangular multicasts trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

Edgar Solomonik and James Demmel 2.5D algorithms 37/ 33

Rectangular collectives

Cost breakdown of MM on 65,536 cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

n=8192, 2D

n=8192, 2.5D

n=131072, 2D

n=131072, 2.5D

Exe

cutio

n tim

e no

rmal

ized

by

2D

Matrix multiplication on 16,384 nodes of BG/P

95% reduction in comm computationidle

communication

Edgar Solomonik and James Demmel 2.5D algorithms 38/ 33

Rectangular collectives

2.5D LU on 65,536 cores

0

20

40

60

80

100

NO-pivot 2D

NO-pivot 2.5D

CA-pivot 2D

CA-pivot 2.5D

Tim

e (s

ec)

LU on 16,384 nodes of BG/P (n=131,072)

2X faster

2X faster

computeidle

communication

Edgar Solomonik and James Demmel 2.5D algorithms 39/ 33

top related