2.5d algorithms for parallel dense linear...

IntroductionMatrix multiplication

LU factorizationQR factorization

Conclusion

2.5D algorithms for parallel dense linear algebra

Edgar Solomonik and James Demmel

UC Berkeley

June, 2012

Edgar Solomonik and James Demmel 2.5D algorithms 1/ 33

Conclusion

Outline

IntroductionStrong scaling

Matrix multiplication2D and 3D algorithms2.5D matrix multiplication

LU factorization2.5D LU without pivoting2.5D LU with pivoting

QR factorization2.5D QR using Givens rotations2.5D QR using Householder transformations

Conclusion

Strong scaling

Solving science problems faster

Parallel computers can solve bigger problems

I weak scaling

Parallel computers can also solve a fixed problem faster

I strong scaling

Obstacles to strong scaling

I may increase relative cost of communication

I may hurt load balance

How to reduce communication and maintain load balance?

I reduce (minimize) communication along the critical path

I exploit the network topology

Conclusion

2D and 3D algorithms2.5D matrix multiplication

Blocking matrix multiplication

Conclusion

2D matrix multiplication[Cannon 69],

[Van De Geijn and Watts 97]

CPU CPU CPU CPU

16 CPUs (4x4)O(n3/p) flops

O(n2/√p) words moved

O(√p) messages

O(n2/p) bytes of memory

Conclusion

3D matrix multiplication[Agarwal et al 95],

[Aggarwal, Chandra, and Snir 90],

[Bernsten 89], [McColl and Tiskin 99]

CPU CPU CPU CPU

64 CPUs (4x4x4)

4 copies of matrices

O(n3/p) flops

O(n2/p2/3) words moved

O(1) messages

O(n2/p2/3) bytes of memory

Conclusion

2.5D matrix multiplication[McColl and Tiskin 99]

CPU CPU CPU CPU

32 CPUs (4x4x2)

2 copies of matrices

O(n3/p) flops

O(n2/√c · p) words moved

p/c3) messages

O(c · n2/p) bytes of memory

Conclusion

Strong scaling matrix multiplication

256 512 1024 2048

2.5D MM on BG/P (n=65,536)

2.5D MM2D MM

ScaLAPACK PDGEMM

Conclusion

2.5D LU without pivoting2.5D LU with pivoting

2.5D recursive LU

A = L · U where L is lower-triangular and U is upper-triangular

I A 2.5D recursive algorithm with no pivoting [A. Tiskin 2002]I Tiskin gives algorithm under the BSP model

I Bulk Synchronous ParallelI considers communication and synchronization

I We give an alternative distributed-memory adaptation andimplementation

I Also, we lower-bound the latency cost

Conclusion

2D blocked LU factorization

Conclusion

L₀₀

U₀₀

Conclusion

S=A-LU

Conclusion

2D block-cyclic decomposition

8 8 8 8

Conclusion

2D block-cyclic LU factorization

Conclusion

S=A-LU

Conclusion

A new latency lower bound for LU

I Relate volume to surfacearea to diameter

I For block size n/d LU doesI Ω(n3/d2) flopsI Ω(n2/d) wordsI Ω(d) msgs

I Now pick d (=latency cost)I d = Ω(

√p) to minimize

flopsI d = Ω(

√c · p) to

minimize words

I More generally,latency · bandwidth = n2

A₀₀

A₂₂

A₃₃

A₄₄

critical path

d-1,d-1d-1

A₁₁

Conclusion

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

U₀₀

L₀₀

U₀₀

L₀₀

Conclusion

L₀₀

U₀₀

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

U₀₀

L₀₀

U₀₀

L₀₀

Conclusion

L₀₀

U₀₀

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(C)(D)

U₀₀

L₀₀

U₀₀

L₀₀

Conclusion

2.5D LU strong scaling (without pivoting)

256 512 1024 2048

2.5D LU on BG/P (n=65,536)

2.5D LU (no pvt)2D LU (no pvt)

Conclusion

2.5D LU with pivoting

A = P · L · U, where P is a permutation matrix

I 2.5D generic pairwise elimination (neighbor/pairwise pivotingor Givens rotations (QR)) [A. Tiskin 2007]

I pairwise pivoting does not produce an explicit LI pairwise pivoting may have stability issues for large matrices

I Our approach uses tournament pivoting, which is more stablethan pairwise pivoting and gives L explicitly

I pass up rows of A instead of U to avoid error accumulation

Conclusion

Tournament pivoting

Partial pivoting is not communication-optimal on a blocked matrix

I requires message/synchronization for each column

I O(n) messages needed

Tournament pivoting is communication-optimal

I performs a tournament to determine best pivot row candidates

I passes up ’best rows’ of A

Conclusion

2.5D LU factorization with tournament pivoting

PA₃ PA₂ PA₁

Conclusion

L₂₀

L₁₀

L₃₀

L₄₀

PA₃ PA₂ PA₁

U₀₁

L₁₀

Conclusion

L₂₀

L₁₀

L₃₀

L₄₀

L₂₀

L₃₀

L₄₀

Update

PA₃ PA₂ PA₁

U₀₁

L₁₀

U₀₁

L₁₀

Conclusion

L₂₀

L₁₀

L₃₀

L₄₀

L₂₀

L₃₀

L₄₀

Update

L₀₀U₀₀

U₀₁

U₀₂

U₀₃

L₃₀

L₁₀L₂₀

PA₃ PA₂ PA₁

U₀₁

L₁₀

U₀₁

L₁₀

L₁₀ L₀₀U₀₀

L₀₀U₀₀

Conclusion

Strong scaling of 2.5D LU with tournament pivoting

256 512 1024 2048

2.5D LU on BG/P (n=65,536)

2.5D LU (CA-pvt)2D LU (CA-pvt)

ScaLAPACK PDGETRF

Conclusion

2.5D QR using Givens rotations2.5D QR using Householder transformations

2.5D QR factorization

A = Q · R where Q is orthogonal R is upper-triangular

I 2.5D QR using Givens rotations (generic pairwise elimination)is given by [A. Tiskin 2007]

I Tiskin minimizes latency and bandwidth by working onslanted panels

I 2.5D QR cannot be done with right-looking updates as 2.5DLU due to non-commutativity of orthogonalization updates

Conclusion

2.5D QR factorization using the YT representation

The YT representation of Householder QR factorization is morework efficient when computing only R

I We give an algorithm that performs 2.5D QR using the YTrepresentation

I The algorithm performs left-looking updates on Y

I Householder with YT needs fewer computation (roughly 2x)than Givens rotations

I Our approach achieves optimal bandwidth cost, but has O(n)latency

Conclusion

2.5D QR using YT representation

Conclusion

Our contributions:I 2.5D mapping of matrix multiplication

I Optimal according to lower bounds [Irony, Tiskin, Toledo 04]and [Aggarwal, Chandra, and Snir 90]

I A new latency lower bound for LUI Communication-optimal 2.5D LU and QR

I Both are bandwidth-optimal according to general lower bound[Ballard, Demmel, Holtz, Schwartz 10]

I LU is latency-optimal according to new lower bound

Reflections:

I Replication allows better strong scaling

I Topology-aware mapping cuts communication costs

Rectangular collectives

Backup slides

Performance of multicast (BG/P vs Cray)

8 64 512 4096

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network

2D rectangular multicasts trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

Cost breakdown of MM on 65,536 cores

n=8192, 2D

n=8192, 2.5D

n=131072, 2D

n=131072, 2.5D

Matrix multiplication on 16,384 nodes of BG/P

95% reduction in comm computationidle

communication

2.5D LU on 65,536 cores

NO-pivot 2D

NO-pivot 2.5D

CA-pivot 2D

CA-pivot 2.5D

LU on 16,384 nodes of BG/P (n=131,072)

2X faster

computeidle

communication

2.5d algorithms for parallel dense linear...

Documents

evaluation of parallel implementation of dense and sparse...

communication-optimal parallel 2.5d matrix multiplication...

parallel self-veriﬁed solver for dense linear systems ·...

parallel algorithms for dense linear algebra...

improving communication performance in dense linear algebra...

cs 267 applications of parallel computers dense linear...

marrnet: 3d shape reconstruction via 2.5d sketches · 2.5d...

building 2.5d in unity

cs 267 dense linear algebra: history and structure...

0.5cm =1=quantum transport & impurities in dense...

programming parallel dense matrix factorizations with look...

cs267 l20 dense linear algebra ii.1 demmel sp 1999 cs 267...

2.5d flow visualization

the millipede, a very dense, highly parallel …... the...

parallel reﬁnement of slanted 3d reconstruction using...

2.5d/3d ic with si photonics tekin - 2.5d/3dic with si...

3/3/2008cs267 guest lecture 21 cs 267 dense linear algebra:...

parallel numerics, wt 2012/2013€¦ · linear systems of...

communication-avoiding parallel sparse-dense matrix-matrix

cs267 dense linear algebra i.1 demmel fa 2002 cs 267...