Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Reducing communication in dense matrix/tensorcomputations
Edgar Solomonik
UC Berkeley
Aug 11th, 2011
Edgar Solomonik Communication-avoiding contractions 1/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Outline
Topology-aware collectivesRectangular collectivesMulticastsReductions
2.5D algorithms2.5D matrix multiplication2.5D LU factorization
Tensor contractionsAlgorithms for distributed tensor contractionsA tensor contraction library implementation
Conclusions and future work
Edgar Solomonik Communication-avoiding contractions 2/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
Performance of multicast (BG/P vs Cray)
128
256
512
1024
2048
4096
8192
8 64 512 4096
Ban
dwid
th (M
B/s
ec)
#nodes
1 MB multicast on BG/P, Cray XT5, and Cray XE6
BG/PXE6XT5
Edgar Solomonik Communication-avoiding contractions 3/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
Why the performance discrepancy in multicasts?
I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus
I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees
I Route in different dimensional orderI Use both directions of bidirectional network
Edgar Solomonik Communication-avoiding contractions 4/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
2D rectangular multicasts trees
root2D 4X4 Torus Spanning tree 1 Spanning tree 2
Spanning tree 3 Spanning tree 4 All 4 trees combined
Edgar Solomonik Communication-avoiding contractions 5/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
A model for rectangular multicasts
tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)
Our multicast model consists of 3 terms
1. m/Bn, the bandwidth cost incurred at the root
2. 2(d + 1) · o + 3L, the start-up overhead of setting up themulticasts in all dimensions
3. d · P1/d · (2o + L), the path overhead reflects the time for apacket to get from the root to the farthest destination node
Edgar Solomonik Communication-avoiding contractions 6/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
A model for binomial multicasts
tbnm = log2(P) · (m/Bn + 2o + L)
I The root of the binomial tree sends the entire messagelog2(P) times
I The setup overhead is overlapped with the path overhead
I We assume no contention
Edgar Solomonik Communication-avoiding contractions 7/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
Model verification: one dimension
200
400
600
800
1000
1 8 64 512 4096 32768 262144
Ban
dwid
th (M
B/s
ec)
msg size (KB)
DCMF Broadcast on a ring of 8 nodes of BG/P
trect modelDCMF rectangle dput
tbnm modelDCMF binomial
Edgar Solomonik Communication-avoiding contractions 8/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
Model verification: two dimensions
0
500
1000
1500
2000
1 8 64 512 4096 32768 262144
Ban
dwid
th (M
B/s
ec)
msg size (KB)
DCMF Broadcast on 64 (8x8) nodes of BG/P
trect modelDCMF rectangle dput
tbnm modelDCMF binomial
Edgar Solomonik Communication-avoiding contractions 9/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
Model verification: three dimensions
0
500
1000
1500
2000
2500
3000
1 8 64 512 4096 32768 262144
Ban
dwid
th (M
B/s
ec)
msg size (KB)
DCMF Broadcast on 512 (8x8x8) nodes of BG/P
trect modelFaraj et al data
DCMF rectangle dputtbnm model
DCMF binomial
Edgar Solomonik Communication-avoiding contractions 10/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
A model for rectangular reductions
tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)
I Any multicast tree can be inverted to produce a reduction treeI The reduction operator must be applied at each node
I each node operates on 2m dataI both the memory bandwidth and computation cost can be
overlapped
Edgar Solomonik Communication-avoiding contractions 11/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
Rectangular reduction performance on BG/P
50
100
150
200
250
300
350
400
8 64 512
Ban
dwid
th (M
B/s
ec)
#nodes
DCMF Reduce peak bandwidth (largest message size)
torus rectangle ringtorus binomial
torus rectangletorus short binomial
BG/P rectangular reduction performs significantly worse thanmulticast
Edgar Solomonik Communication-avoiding contractions 12/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
Performance of custom line reduction
100
200
300
400
500
600
700
800
512 4096 32768
Ban
dwid
th (M
B/s
ec)
msg size (KB)
Performance of custom Reduce/Multicast on 8 nodes
MPI BroadcastCustom Ring Multicast
Custom Ring Reduce 2Custom Ring Reduce 1
MPI Reduce
Edgar Solomonik Communication-avoiding contractions 13/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Rectangular collectivesMulticastsReductions
Another look at that first plot
Just how much better arerectangular algorithms onP = 4096 nodes?
I Binomial collectives on XE6I 1/30th of link
bandwidth
I Rectangular collectives onBG/P
I 4.3X the link bandwidth
I Over 120X improvementin efficiency!
How can we apply this?
128
256
512
1024
2048
4096
8192
8 64 512 4096
Ban
dwid
th (M
B/s
ec)
#nodes
1 MB multicast on BG/P, Cray XT5, and Cray XE6
BG/PXE6XT5
Edgar Solomonik Communication-avoiding contractions 14/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
2.5D Cannon-style matrix multiplication0
1
2
3
0 21 3
1
2
3
0
2 03 1
0
1
21 2 0
0
10 10
0
2
32 30
10 10
0
22
+
22
33
00
11
+3D (P=64, c=4)
2.5D (P=32, c=2)
2D (P=16, c=1)
B₀₁
B₁₁
B₂₁
B₃₁
A₂₁A₂₀ A₂₂ A₂₃=
Edgar Solomonik Communication-avoiding contractions 15/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
Classification of parallel dense matrix algorithms
algs c memory (M) words (W ) messages (S)
2D 1 O(n2/P) O(n2/√P) O(
√P)
2.5D [1,P1/3] O(cn2/P) O(n2/√cP) O(
√P/c3)
3D P1/3 O(n2/P2/3) O(n2/P2/3) O(log(P))
NEW: 2.5D algorithms generalize 2D and 3D algrotihms
Edgar Solomonik Communication-avoiding contractions 16/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
Minimize communication with
I minimal memory (2D)
I with as much memory as available (2.5D) - flexible
I with as much memory as the algorithm can exploit (3D)
Match the network topology of
I a√P-by-
√P grid (2D)
I a√
P/c-by-√P/c-by-c grid, most cuboids (2.5D) - flexible
I a P1/3-by-P1/3-by-P1/3 cube (3D)
Edgar Solomonik Communication-avoiding contractions 17/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
2.5D SUMMA-style matrix multiplication
Matrix mapping to 3D partition of BG/P
Edgar Solomonik Communication-avoiding contractions 18/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
2.5D MM strong scaling
0
20
40
60
80
100
256 512 1024 2048
Per
cent
age
of m
achi
ne p
eak
#nodes
2.5D MM on BG/P (n=65,536)
2.5D SUMMA2.5D Cannon
2D MM (Cannon)ScaLAPACK PDGEMM
Edgar Solomonik Communication-avoiding contractions 19/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
2.5D MM on 65,536 cores
0
20
40
60
80
100
8192 32768 131072
Per
cent
age
of m
achi
ne p
eak
n
2.5D MM on 16,384 nodes of BG/P
2D SUMMA2D Cannon
2.5D Cannon2.5D SUMMA
Edgar Solomonik Communication-avoiding contractions 20/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
Cost breakdown of MM on 65,536 cores
0
0.2
0.4
0.6
0.8
1
1.2
1.4
n=8192, 2D
n=8192, 2.5D
n=32768, 2D
n=32768, 2.5D
n=131072, 2D
n=131072, 2.5D
Exe
cutio
n tim
e no
rmal
ized
by
2D
SUMMA (2D vs 2.5D) on 16,384 nodes of BG/P
computationidle
communication
Edgar Solomonik Communication-avoiding contractions 21/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
A new latency lower bound for LU
Reduce latency to O(√P/c3) for
LU?I For block size n/d LU does
I Ω(n3/d2) flopsI Ω(n2/d) wordsI Ω(d) msgs
I Now pick d (=latency cost)I d = Ω(
√P) to minimize
flopsI d = Ω(
√c · P) to
minimize words
No dice. But lets minimizebandwidth.
k₁
k₀
k₂
k₃
k₄
k
A₀₀
A₂₂
A₃₃
A₄₄
A
n
n
critical path
d-1,d-1d-1
A₁₁
Edgar Solomonik Communication-avoiding contractions 22/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
2.5D LU factorization without pivoting
2. Perform TRSMs to compute a panel of L and a panel of U.
1. Factorize A₀₀ andcommunicate L₀₀ and U₀₀among layers.
L₀₀
U₀₀
U₀₃
U₀₃
U₀₁
L₂₀L₃₀
L₁₀
3. Broadcast blocks so alllayers own the panelsof L and U.
(A)
(B)
4.Broadcast differentsubpanels within eachlayer.
5.Multiply subpanelson each layer.
6.Reduce (sum) thenext panels.*
U
L
7. Broadcast the panels and continue factorizing the Schur's complement...
* All layers always need to contribute to reductioneven if iteration done with subset of layers.
(C)(D)
U₀₀
U₀₀
L₀₀
L₀₀
U₀₀
L₀₀
Edgar Solomonik Communication-avoiding contractions 23/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
2.5D LU factorization with tournament pivoting
PA₀
LU
LU
L₂₀
L₁₀
L₃₀
L₄₀
3. Pivot rows in first big block column on each layer.
2. Reduce to find best pivot rows.
LU
LU
L U
LU
L₂₀
L₃₀
L₄₀
Update
Update
Update
Update
Update
Update
Update
L₀₀U₀₀
U₀₁
U₀₂
U₀₃
8. Perform TRSMs to compute panel of U
L₃₀
L₁₀L₂₀
1. Factorize each blockin the first column with pivoting.
4. Apply TRSMs tocompute first column of Land the first block of a row of U.
5. Update correspondinginterior blocks S=A-L *U₀₁.
6. Recurse to compute the restof the first big block column of L.
9. Update the restof the matrix as before and recurse on next block panel...
7. Pivot rows in the restof the matrix on each layer.
PLU
PLU
PLU
PLU
PLU P
LU
PLU
PLU
PA₃ PA₂ PA₁
U₀₁
U₀₁
U₀₁
U₀₁
L₁₀
L₁₀
L₁₀
U
U
L
L
L₁₀
U₀₁
U₀₁
U₀₁
U₀₁
L₁₀
L₁₀
L₁₀ L₀₀U₀₀
L₀₀U₀₀
L₀₀U₀₀
k0
PA₀
Edgar Solomonik Communication-avoiding contractions 24/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
2.5D LU strong scaling
0
20
40
60
80
100
256 512 1024 2048
Per
cent
age
of m
achi
ne p
eak
#nodes
2.5D LU with on BG/P (n=65,536)
2.5D LU (no-pvt)2.5D LU (CA-pvt)
2D LU (no-pvt)2D LU (CA-pvt)
ScaLAPACK PDGETRF
Edgar Solomonik Communication-avoiding contractions 25/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
2.5D matrix multiplication2.5D LU factorization
2.5D LU on 65,536 cores
0
20
40
60
80
100
2D no-pvt
2.5D no-pvt bnm
2.5D no-pvt rect
2D CA-pvt
2.5D CA-pvt
Tim
e (s
ec)
2.5D LU vs 2D LU on 16,384 nodes of BG/P (n=131,072)
computeidle
communication
Edgar Solomonik Communication-avoiding contractions 26/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Bridging dense linear algebra techniques and applications
Target application: tensor contractions in electronic structurecalculations (quantum chemistry)
I Often memory constrained
I Most target tensors are oddly shaped
I Need support for high dimensional tensors
I Need handling of partial/full tensor symmetries
I Would like to use communication avoiding ideas (blocking,2.5D, topology-awareness)
Edgar Solomonik Communication-avoiding contractions 27/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Decoupling memory usage and topology-awareness
I 2.5D algorithms couple memory usage and virtual topologyI c copies of a matrix implies c processor layers
I Instead, we can nest 2D and/or 2.5D algorithms
I Higher-dimensional algorithms allow smarter topology awaremapping
Edgar Solomonik Communication-avoiding contractions 28/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Higher-dimensional distributed MM
I 2.5D algorithms couple memory usage and virtual topologyI c copies of a matrix implies c processor layers
I Instead, we can nest 2D and/or 2.5D algorithms
I Higher-dimensional algorithms allow smarter topology awaremapping
Edgar Solomonik Communication-avoiding contractions 29/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
4D SUMMA-CannonHow do we map to a 3D partitionwithout using more memory
I SUMMA (bcast-based) on2D layers
I Cannon (send-based) alongthird dimension
I Cannon calls SUMMA assub-routine
I Minimize inefficient(non-rectangular)communication
I Allow better overlap
I Treats MM as a 4D tensorcontraction
0
20
40
60
80
100
4096 8192 16384 32768 65536 131072
Per
cent
age
of fl
ops
peak
matrix dimension
MM on 512 nodes of BG/P
2.5D MM4D MM
2D MM (Cannon)PBLAS MM
Edgar Solomonik Communication-avoiding contractions 30/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Symmetry is a problemI A fully symmetric tensor of dimenson d requires only nd/d!
storageI Symmetry significantly complicates sequential implementation
I Irregular indexing makes alignment and unrolling difficultI Generalizing over all partial-symmetries is expensive
I Blocked or block-cyclic virtual processor decmpositions giveirregular or imbalanced virtual grids
Blocked Block-cyclic
P0 P1
P2 P3
Edgar Solomonik Communication-avoiding contractions 31/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Solving the symmetry problemI A cyclic decomposition allows balanced and regular blocking
of symmetric tensorsI If the cyclic-phase is the same in each symmetric dimension,
each sub-tensor retains the symmetry of the whole tensor
Cyclic
Edgar Solomonik Communication-avoiding contractions 32/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
A generalized cyclic layout is still challenging
I In order to retain partial symmetry, all symmetric dimensionsof a tensor must be mapped with the same cyclic phase
I The contracted dimensions of A and B must be mapped withthe same phase
I And yet the virtual mapping, needs to be mapped to aphysical topology, which can be any shape
Edgar Solomonik Communication-avoiding contractions 33/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Virtual processor grid dimensions
I Our virtual cyclic topology is somewhat restrictive and thephysical topology is very restricted
I Virtual processor grid dimensions serve as a new level ofindirection
I If a tensor dimension must have a certain cyclic phase, adjustphysical mapping by creating a virtual processor dimension
I Allows physical processor grid to be ’stretchable’
Edgar Solomonik Communication-avoiding contractions 34/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Constructing a virtual processor grid for MM
Matrix multiply on 2x3 processor grid. Red lines representvirtualized part of processor grid. Elements assigned to blocks by
cyclic phase.
X =
AB
C
Edgar Solomonik Communication-avoiding contractions 35/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Unfolding the processor grid
I Higher-dimensional fully-symmetric tensors can be mappedonto a lower-dimensional processor grid via creation of newvirtual dimensions
I Lower-dimensional tensors can be mapped onto ahigher-dimensional processor grid via by unfolding (serializing)pairs of processor dimensions
I However, when possible, replication is better than unfolding,since unfolded processor grids can lead to an unbalancedmapping
Edgar Solomonik Communication-avoiding contractions 36/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
A basic parallel algorithm for symmetric tensor contractions
1. Arrange processor grid in any k-ary n-cube shape
2. Map (via unfold & virt) both A and B cyclically along thedimensions being contracted
3. Map (via unfold & virt) the remaining dimensions of A and Bcyclically
4. For each tensor dimension contracted over, recursivelymulitply the tensors along the mapping
I Each contraction dimension is represented with a nested call toa local multiply or a parallel algorithm (e.g. Cannon)
Edgar Solomonik Communication-avoiding contractions 37/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Tensor library structure
The library supports arbitrary-dimensional parallel tensorcontractions with any symmetries on n-cuboid processor toruspartitions
1. Load tensor data by (global rank, value) pairs
2. Once a contraction is defined, map participating tensors
3. Distribute or reshuffle tensor data/pairs
4. Construct contraction algorithm with recursive function/argspointers
5. Contract the sub-tensors with a user-defined sequentialcontract function
6. Output (global rank, value) pairs on request
Edgar Solomonik Communication-avoiding contractions 38/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Current tensor library status
I Dense and symmetric remapping/repadding/contractionsimplemented
I Currently functional only for dense tensors, but with fullsymmetric logic
I Can perform automatic mapping with physical and virtualdimensions, but cannot unfold processor dimensions yet
I Complete library interface implemented, including basicauxillary functions (e.g. map/reduce, sum, etc.)
Edgar Solomonik Communication-avoiding contractions 39/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Next implementation steps
I Currently integrating library with a SCF method code thatuses dense contractions
I Get symmetric redistribution working correctly
I Automatic unfolding of processor dimensions
I Implement mapping by replication to enable 2.5D algorithms
I Much basic performance debugging/optimization left to do
I More optimization needed for sequential symmetriccontractions
Edgar Solomonik Communication-avoiding contractions 40/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Very preliminary contraction library results
Contracts tensors of size 64x64x256x256 in 1 second on 2K nodes
0
20
40
60
80
100
256 512 1024 2048
Per
cent
age
of m
achi
ne p
eak
#nodes
Strong scaling of dense contraction on BG/P 64x64x256x256
no rephaserephase every contraction
repad every contraction
Edgar Solomonik Communication-avoiding contractions 41/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Algorithms for distributed tensor contractionsA tensor contraction library implementation
Potential benefit of unfoldingUnfolding smallest two BG/P torus dimensions improvesperformance.
0
20
40
60
80
100
256 512 1024 2048
Per
cent
age
of m
achi
ne p
eak
#nodes
Strong scaling of dense contraction on BG/P 64x64x256x256
no-rephase 2Dno-rephase 3D
Edgar Solomonik Communication-avoiding contractions 42/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Conntributions
I Models for rectangular collectives
I 2.5D algorithms theory and implementation
I Using a cyclic mapping to parallelize symmetric tensorcontractions
I Extending and tuning processor grid with virtual dimensions
I Automatic mapping of high-dimensional tensors totopology-aware physical partitions
I A parallel tensor contraction algorithm/library without aglobal address space
Edgar Solomonik Communication-avoiding contractions 43/ 44
Topology-aware collectives2.5D algorithms
Tensor contractionsConclusions and future work
Conclusions and references
I Parallel tensor contraction algorithm and library seem to bethe first communication-efficient practical approach
I Preliminary results and theory indicate high potential of thistensor contraction library
I papersI (2.5D) to appear in Euro-Par 2011, Distinguished paperI (2.5D + rectangular collective models) to appear in
Supercomputing 2011
Edgar Solomonik Communication-avoiding contractions 44/ 44
Backup slides
Edgar Solomonik Communication-avoiding contractions 45/ 44
A new LU latency lower bound
k₁
k₀
k₂
k₃
k₄
k
A₀₀
A₂₂
A₃₃
A₄₄
A
n
n
critical path
d-1,d-1d-1
A₁₁
flops lower bound requires d = Ω(√p) blocks/messages
bandwidth lower bound required d = Ω(√cp) blocks/messages
Edgar Solomonik Communication-avoiding contractions 46/ 44
Virtual topology of 2.5D algorithms
(p/c)-1
i
j
k
0
00
(p/c) -11/2
1/2
c-1
2D algorithm mapping:(√
P)×(√
P)
grid
2.5D algorithm mapping:(√
P/c)×(√
P/c)× c grid for any c
Edgar Solomonik Communication-avoiding contractions 47/ 44