2.5d algorithms for parallel dense linear...
Post on 16-Oct-2020
15 Views
Preview:
TRANSCRIPT
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D algorithms for parallel dense linear algebra
Edgar Solomonik and James Demmel
UC Berkeley
June, 2012
Edgar Solomonik and James Demmel 2.5D algorithms 1/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
Outline
IntroductionStrong scaling
Matrix multiplication2D and 3D algorithms2.5D matrix multiplication
LU factorization2.5D LU without pivoting2.5D LU with pivoting
QR factorization2.5D QR using Givens rotations2.5D QR using Householder transformations
Conclusion
Edgar Solomonik and James Demmel 2.5D algorithms 2/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
Strong scaling
Solving science problems faster
Parallel computers can solve bigger problems
I weak scaling
Parallel computers can also solve a fixed problem faster
I strong scaling
Obstacles to strong scaling
I may increase relative cost of communication
I may hurt load balance
How to reduce communication and maintain load balance?
I reduce (minimize) communication along the critical path
I exploit the network topology
Edgar Solomonik and James Demmel 2.5D algorithms 3/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2D and 3D algorithms2.5D matrix multiplication
Blocking matrix multiplication
A
BA
B
A
B
AB
Edgar Solomonik and James Demmel 2.5D algorithms 4/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2D and 3D algorithms2.5D matrix multiplication
2D matrix multiplication[Cannon 69],
[Van De Geijn and Watts 97]
A
BA
B
A
B
AB
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
16 CPUs (4x4)O(n3/p) flops
O(n2/√p) words moved
O(√p) messages
O(n2/p) bytes of memory
Edgar Solomonik and James Demmel 2.5D algorithms 5/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2D and 3D algorithms2.5D matrix multiplication
3D matrix multiplication[Agarwal et al 95],
[Aggarwal, Chandra, and Snir 90],
[Bernsten 89], [McColl and Tiskin 99]
A
BA
B
A
B
AB
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
64 CPUs (4x4x4)
4 copies of matrices
O(n3/p) flops
O(n2/p2/3) words moved
O(1) messages
O(n2/p2/3) bytes of memory
Edgar Solomonik and James Demmel 2.5D algorithms 6/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2D and 3D algorithms2.5D matrix multiplication
2.5D matrix multiplication[McColl and Tiskin 99]
A
BA
B
A
B
AB
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
32 CPUs (4x4x2)
2 copies of matrices
O(n3/p) flops
O(n2/√c · p) words moved
O(√
p/c3) messages
O(c · n2/p) bytes of memory
Edgar Solomonik and James Demmel 2.5D algorithms 7/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2D and 3D algorithms2.5D matrix multiplication
Strong scaling matrix multiplication
0
20
40
60
80
100
256 512 1024 2048
Per
cent
age
of m
achi
ne p
eak
p
2.5D MM on BG/P (n=65,536)
2.5D MM2D MM
ScaLAPACK PDGEMM
Edgar Solomonik and James Demmel 2.5D algorithms 8/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2.5D recursive LU
A = L · U where L is lower-triangular and U is upper-triangular
I A 2.5D recursive algorithm with no pivoting [A. Tiskin 2002]I Tiskin gives algorithm under the BSP model
I Bulk Synchronous ParallelI considers communication and synchronization
I We give an alternative distributed-memory adaptation andimplementation
I Also, we lower-bound the latency cost
Edgar Solomonik and James Demmel 2.5D algorithms 9/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2D blocked LU factorization
A
Edgar Solomonik and James Demmel 2.5D algorithms 10/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2D blocked LU factorization
L₀₀
U₀₀
Edgar Solomonik and James Demmel 2.5D algorithms 11/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2D blocked LU factorization
L
U
Edgar Solomonik and James Demmel 2.5D algorithms 12/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2D blocked LU factorization
L
U
S=A-LU
Edgar Solomonik and James Demmel 2.5D algorithms 13/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2D block-cyclic decomposition
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
Edgar Solomonik and James Demmel 2.5D algorithms 14/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2D block-cyclic LU factorization
Edgar Solomonik and James Demmel 2.5D algorithms 15/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2D block-cyclic LU factorization
L
U
Edgar Solomonik and James Demmel 2.5D algorithms 16/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2D block-cyclic LU factorization
L
U
S=A-LU
Edgar Solomonik and James Demmel 2.5D algorithms 17/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
A new latency lower bound for LU
I Relate volume to surfacearea to diameter
I For block size n/d LU doesI Ω(n3/d2) flopsI Ω(n2/d) wordsI Ω(d) msgs
I Now pick d (=latency cost)I d = Ω(
√p) to minimize
flopsI d = Ω(
√c · p) to
minimize words
I More generally,latency · bandwidth = n2
k₁
k₀
k₂
k₃
k₄
k
A₀₀
A₂₂
A₃₃
A₄₄
A
n
n
critical path
d-1,d-1d-1
A₁₁
Edgar Solomonik and James Demmel 2.5D algorithms 18/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2.5D LU factorization
L₀₀
U₀₀
U₀₃
U₀₃
U₀₁
L₂₀L₃₀
L₁₀
(A)
U₀₀
U₀₀
L₀₀
L₀₀
U₀₀
L₀₀
Edgar Solomonik and James Demmel 2.5D algorithms 19/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2.5D LU factorization
L₀₀
U₀₀
U₀₃
U₀₃
U₀₁
L₂₀L₃₀
L₁₀
(A)
(B)
U₀₀
U₀₀
L₀₀
L₀₀
U₀₀
L₀₀
Edgar Solomonik and James Demmel 2.5D algorithms 20/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2.5D LU factorization
L₀₀
U₀₀
U₀₃
U₀₃
U₀₁
L₂₀L₃₀
L₁₀
(A)
(B)
U
L
(C)(D)
U₀₀
U₀₀
L₀₀
L₀₀
U₀₀
L₀₀
Edgar Solomonik and James Demmel 2.5D algorithms 21/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2.5D LU strong scaling (without pivoting)
0
20
40
60
80
100
256 512 1024 2048
Per
cent
age
of m
achi
ne p
eak
p
2.5D LU on BG/P (n=65,536)
2.5D LU (no pvt)2D LU (no pvt)
Edgar Solomonik and James Demmel 2.5D algorithms 22/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2.5D LU with pivoting
A = P · L · U, where P is a permutation matrix
I 2.5D generic pairwise elimination (neighbor/pairwise pivotingor Givens rotations (QR)) [A. Tiskin 2007]
I pairwise pivoting does not produce an explicit LI pairwise pivoting may have stability issues for large matrices
I Our approach uses tournament pivoting, which is more stablethan pairwise pivoting and gives L explicitly
I pass up rows of A instead of U to avoid error accumulation
Edgar Solomonik and James Demmel 2.5D algorithms 23/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
Tournament pivoting
Partial pivoting is not communication-optimal on a blocked matrix
I requires message/synchronization for each column
I O(n) messages needed
Tournament pivoting is communication-optimal
I performs a tournament to determine best pivot row candidates
I passes up ’best rows’ of A
Edgar Solomonik and James Demmel 2.5D algorithms 24/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2.5D LU factorization with tournament pivoting
PA₀
PLU
PLU
PLU
PLU
PLU P
LU
PLU
PLU
PA₃ PA₂ PA₁
PA₀
Edgar Solomonik and James Demmel 2.5D algorithms 25/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2.5D LU factorization with tournament pivoting
PA₀
LU
LU
L₂₀
L₁₀
L₃₀
L₄₀
PLU
PLU
PLU
PLU
PLU P
LU
PLU
PLU
PA₃ PA₂ PA₁
U₀₁
U₀₁
U₀₁
U₀₁
L₁₀
L₁₀
L₁₀
U
U
L
L
PA₀
Edgar Solomonik and James Demmel 2.5D algorithms 26/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2.5D LU factorization with tournament pivoting
PA₀
LU
LU
L₂₀
L₁₀
L₃₀
L₄₀
LU
LU
L U
LU
L₂₀
L₃₀
L₄₀
Update
Update
Update
Update
Update
Update
Update
PLU
PLU
PLU
PLU
PLU P
LU
PLU
PLU
PA₃ PA₂ PA₁
U₀₁
U₀₁
U₀₁
U₀₁
L₁₀
L₁₀
L₁₀
U
U
L
L
L₁₀
U₀₁
U₀₁
U₀₁
U₀₁
L₁₀
L₁₀
L₁₀
PA₀
Edgar Solomonik and James Demmel 2.5D algorithms 27/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
2.5D LU factorization with tournament pivoting
PA₀
LU
LU
L₂₀
L₁₀
L₃₀
L₄₀
LU
LU
L U
LU
L₂₀
L₃₀
L₄₀
Update
Update
Update
Update
Update
Update
Update
L₀₀U₀₀
U₀₁
U₀₂
U₀₃
L₃₀
L₁₀L₂₀
PLU
PLU
PLU
PLU
PLU P
LU
PLU
PLU
PA₃ PA₂ PA₁
U₀₁
U₀₁
U₀₁
U₀₁
L₁₀
L₁₀
L₁₀
U
U
L
L
L₁₀
U₀₁
U₀₁
U₀₁
U₀₁
L₁₀
L₁₀
L₁₀ L₀₀U₀₀
L₀₀U₀₀
L₀₀U₀₀
PA₀
Edgar Solomonik and James Demmel 2.5D algorithms 28/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D LU without pivoting2.5D LU with pivoting
Strong scaling of 2.5D LU with tournament pivoting
0
20
40
60
80
100
256 512 1024 2048
Per
cent
age
of m
achi
ne p
eak
p
2.5D LU on BG/P (n=65,536)
2.5D LU (CA-pvt)2D LU (CA-pvt)
ScaLAPACK PDGETRF
Edgar Solomonik and James Demmel 2.5D algorithms 29/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D QR using Givens rotations2.5D QR using Householder transformations
2.5D QR factorization
A = Q · R where Q is orthogonal R is upper-triangular
I 2.5D QR using Givens rotations (generic pairwise elimination)is given by [A. Tiskin 2007]
I Tiskin minimizes latency and bandwidth by working onslanted panels
I 2.5D QR cannot be done with right-looking updates as 2.5DLU due to non-commutativity of orthogonalization updates
Edgar Solomonik and James Demmel 2.5D algorithms 30/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D QR using Givens rotations2.5D QR using Householder transformations
2.5D QR factorization using the YT representation
The YT representation of Householder QR factorization is morework efficient when computing only R
I We give an algorithm that performs 2.5D QR using the YTrepresentation
I The algorithm performs left-looking updates on Y
I Householder with YT needs fewer computation (roughly 2x)than Givens rotations
I Our approach achieves optimal bandwidth cost, but has O(n)latency
Edgar Solomonik and James Demmel 2.5D algorithms 31/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
2.5D QR using Givens rotations2.5D QR using Householder transformations
2.5D QR using YT representation
Edgar Solomonik and James Demmel 2.5D algorithms 32/ 33
IntroductionMatrix multiplication
LU factorizationQR factorization
Conclusion
Conclusion
Our contributions:I 2.5D mapping of matrix multiplication
I Optimal according to lower bounds [Irony, Tiskin, Toledo 04]and [Aggarwal, Chandra, and Snir 90]
I A new latency lower bound for LUI Communication-optimal 2.5D LU and QR
I Both are bandwidth-optimal according to general lower bound[Ballard, Demmel, Holtz, Schwartz 10]
I LU is latency-optimal according to new lower bound
Reflections:
I Replication allows better strong scaling
I Topology-aware mapping cuts communication costs
Edgar Solomonik and James Demmel 2.5D algorithms 33/ 33
Rectangular collectives
Backup slides
Edgar Solomonik and James Demmel 2.5D algorithms 34/ 33
Rectangular collectives
Performance of multicast (BG/P vs Cray)
128
256
512
1024
2048
4096
8192
8 64 512 4096
Ban
dwid
th (M
B/s
ec)
#nodes
1 MB multicast on BG/P, Cray XT5, and Cray XE6
BG/PXE6XT5
Edgar Solomonik and James Demmel 2.5D algorithms 35/ 33
Rectangular collectives
Why the performance discrepancy in multicasts?
I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus
I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees
I Route in different dimensional orderI Use both directions of bidirectional network
Edgar Solomonik and James Demmel 2.5D algorithms 36/ 33
Rectangular collectives
2D rectangular multicasts trees
root2D 4X4 Torus Spanning tree 1 Spanning tree 2
Spanning tree 3 Spanning tree 4 All 4 trees combined
Edgar Solomonik and James Demmel 2.5D algorithms 37/ 33
Rectangular collectives
Cost breakdown of MM on 65,536 cores
0
0.2
0.4
0.6
0.8
1
1.2
1.4
n=8192, 2D
n=8192, 2.5D
n=131072, 2D
n=131072, 2.5D
Exe
cutio
n tim
e no
rmal
ized
by
2D
Matrix multiplication on 16,384 nodes of BG/P
95% reduction in comm computationidle
communication
Edgar Solomonik and James Demmel 2.5D algorithms 38/ 33
Rectangular collectives
2.5D LU on 65,536 cores
0
20
40
60
80
100
NO-pivot 2D
NO-pivot 2.5D
CA-pivot 2D
CA-pivot 2.5D
Tim
e (s
ec)
LU on 16,384 nodes of BG/P (n=131,072)
2X faster
2X faster
computeidle
communication
Edgar Solomonik and James Demmel 2.5D algorithms 39/ 33
top related