Download - Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Reducing communication in dense matrix/tensorcomputations

Edgar Solomonik

UC Berkeley

Aug 11th, 2011

Edgar Solomonik Communication-avoiding contractions 1/ 44



Outline

Topology-aware collectivesRectangular collectivesMulticastsReductions

2.5D algorithms2.5D matrix multiplication2.5D LU factorization

Tensor contractionsAlgorithms for distributed tensor contractionsA tensor contraction library implementation

Conclusions and future work




Rectangular collectivesMulticastsReductions

Performance of multicast (BG/P vs Cray)

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5





Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network





2D rectangular multicasts trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined





A model for rectangular multicasts

tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)

Our multicast model consists of 3 terms

1. m/Bn, the bandwidth cost incurred at the root

2. 2(d + 1) · o + 3L, the start-up overhead of setting up themulticasts in all dimensions

3. d · P1/d · (2o + L), the path overhead reflects the time for apacket to get from the root to the farthest destination node





A model for binomial multicasts

tbnm = log2(P) · (m/Bn + 2o + L)

I The root of the binomial tree sends the entire messagelog2(P) times

I The setup overhead is overlapped with the path overhead

I We assume no contention





Model verification: one dimension

200

400

600

800

1000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on a ring of 8 nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial





Model verification: two dimensions

0

500

1000

1500

2000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 64 (8x8) nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial





Model verification: three dimensions

0

500

1000

1500

2000

2500

3000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 512 (8x8x8) nodes of BG/P

trect modelFaraj et al data

DCMF rectangle dputtbnm model

DCMF binomial





A model for rectangular reductions

tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)

I Any multicast tree can be inverted to produce a reduction treeI The reduction operator must be applied at each node

I each node operates on 2m dataI both the memory bandwidth and computation cost can be

overlapped





Rectangular reduction performance on BG/P

50

100

150

200

250

300

350

400

8 64 512

Ban

dwid

th (M

B/s

ec)

#nodes

DCMF Reduce peak bandwidth (largest message size)

torus rectangle ringtorus binomial

torus rectangletorus short binomial

BG/P rectangular reduction performs significantly worse thanmulticast





Performance of custom line reduction

100

200

300

400

500

600

700

800

512 4096 32768

Ban

dwid

th (M

B/s

ec)

msg size (KB)

Performance of custom Reduce/Multicast on 8 nodes

MPI BroadcastCustom Ring Multicast

Custom Ring Reduce 2Custom Ring Reduce 1

MPI Reduce





Another look at that first plot

Just how much better arerectangular algorithms onP = 4096 nodes?

I Binomial collectives on XE6I 1/30th of link

bandwidth

I Rectangular collectives onBG/P

I 4.3X the link bandwidth

I Over 120X improvementin efficiency!

How can we apply this?

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5




2.5D matrix multiplication2.5D LU factorization

2.5D Cannon-style matrix multiplication0

1

2

3

0 21 3

1

2

3

0

2 03 1

0

1

21 2 0

0

10 10

0

2

32 30

10 10

0

22

+

22

33

00

11

+3D (P=64, c=4)

2.5D (P=32, c=2)

2D (P=16, c=1)

B₀₁

B₁₁

B₂₁

B₃₁

A₂₁A₂₀ A₂₂ A₂₃=





Classification of parallel dense matrix algorithms

algs c memory (M) words (W ) messages (S)

2D 1 O(n2/P) O(n2/√P) O(

√P)

2.5D [1,P1/3] O(cn2/P) O(n2/√cP) O(

√P/c3)

3D P1/3 O(n2/P2/3) O(n2/P2/3) O(log(P))

NEW: 2.5D algorithms generalize 2D and 3D algrotihms





Minimize communication with

I minimal memory (2D)

I with as much memory as available (2.5D) - flexible

I with as much memory as the algorithm can exploit (3D)

Match the network topology of

I a√P-by-

√P grid (2D)

I a√

P/c-by-√P/c-by-c grid, most cuboids (2.5D) - flexible

I a P1/3-by-P1/3-by-P1/3 cube (3D)





2.5D SUMMA-style matrix multiplication

Matrix mapping to 3D partition of BG/P





2.5D MM strong scaling

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

2.5D MM on BG/P (n=65,536)

2.5D SUMMA2.5D Cannon

2D MM (Cannon)ScaLAPACK PDGEMM





2.5D MM on 65,536 cores

0

20

40

60

80

100

8192 32768 131072

Per

cent

age

of m

achi

ne p

eak

n

2.5D MM on 16,384 nodes of BG/P

2D SUMMA2D Cannon

2.5D Cannon2.5D SUMMA





Cost breakdown of MM on 65,536 cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

n=8192, 2D

n=8192, 2.5D

n=32768, 2D

n=32768, 2.5D

n=131072, 2D

n=131072, 2.5D

Exe

cutio

n tim

e no

rmal

ized

by

2D

SUMMA (2D vs 2.5D) on 16,384 nodes of BG/P

computationidle

communication





A new latency lower bound for LU

Reduce latency to O(√P/c3) for

LU?I For block size n/d LU does

I Ω(n3/d2) flopsI Ω(n2/d) wordsI Ω(d) msgs

I Now pick d (=latency cost)I d = Ω(

√P) to minimize

flopsI d = Ω(

√c · P) to

minimize words

No dice. But lets minimizebandwidth.

k₁

k₀

k₂

k₃

k₄

k

A₀₀

A₂₂

A₃₃

A₄₄

A

n

n

critical path

d-1,d-1d-1

A₁₁





2.5D LU factorization without pivoting

2. Perform TRSMs to compute a panel of L and a panel of U.

1. Factorize A₀₀ andcommunicate L₀₀ and U₀₀among layers.

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

3. Broadcast blocks so alllayers own the panelsof L and U.

(A)

(B)

4.Broadcast differentsubpanels within eachlayer.

5.Multiply subpanelson each layer.

6.Reduce (sum) thenext panels.*

U

L

7. Broadcast the panels and continue factorizing the Schur's complement...

* All layers always need to contribute to reductioneven if iteration done with subset of layers.

(C)(D)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀





2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

3. Pivot rows in first big block column on each layer.

2. Reduce to find best pivot rows.

LU

LU

L U

LU

L₂₀

L₃₀

L₄₀

Update

Update

Update

Update

Update

Update

Update

L₀₀U₀₀

U₀₁

U₀₂

U₀₃

8. Perform TRSMs to compute panel of U

L₃₀

L₁₀L₂₀

1. Factorize each blockin the first column with pivoting.

4. Apply TRSMs tocompute first column of Land the first block of a row of U.

5. Update correspondinginterior blocks S=A-L *U₀₁.

6. Recurse to compute the restof the first big block column of L.

9. Update the restof the matrix as before and recurse on next block panel...

7. Pivot rows in the restof the matrix on each layer.

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

L₁₀

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀ L₀₀U₀₀

L₀₀U₀₀

L₀₀U₀₀

k0

PA₀





2.5D LU strong scaling

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

2.5D LU with on BG/P (n=65,536)

2.5D LU (no-pvt)2.5D LU (CA-pvt)

2D LU (no-pvt)2D LU (CA-pvt)

ScaLAPACK PDGETRF





2.5D LU on 65,536 cores

0

20

40

60

80

100

2D no-pvt

2.5D no-pvt bnm

2.5D no-pvt rect

2D CA-pvt

2.5D CA-pvt

Tim

e (s

ec)

2.5D LU vs 2D LU on 16,384 nodes of BG/P (n=131,072)

computeidle

communication




Algorithms for distributed tensor contractionsA tensor contraction library implementation

Bridging dense linear algebra techniques and applications

Target application: tensor contractions in electronic structurecalculations (quantum chemistry)

I Often memory constrained

I Most target tensors are oddly shaped

I Need support for high dimensional tensors

I Need handling of partial/full tensor symmetries

I Would like to use communication avoiding ideas (blocking,2.5D, topology-awareness)





Decoupling memory usage and topology-awareness

I 2.5D algorithms couple memory usage and virtual topologyI c copies of a matrix implies c processor layers

I Instead, we can nest 2D and/or 2.5D algorithms

I Higher-dimensional algorithms allow smarter topology awaremapping





Higher-dimensional distributed MM

I 2.5D algorithms couple memory usage and virtual topologyI c copies of a matrix implies c processor layers

I Instead, we can nest 2D and/or 2.5D algorithms

I Higher-dimensional algorithms allow smarter topology awaremapping





4D SUMMA-CannonHow do we map to a 3D partitionwithout using more memory

I SUMMA (bcast-based) on2D layers

I Cannon (send-based) alongthird dimension

I Cannon calls SUMMA assub-routine

I Minimize inefficient(non-rectangular)communication

I Allow better overlap

I Treats MM as a 4D tensorcontraction

0

20

40

60

80

100

4096 8192 16384 32768 65536 131072

Per

cent

age

of fl

ops

peak

matrix dimension

MM on 512 nodes of BG/P

2.5D MM4D MM

2D MM (Cannon)PBLAS MM





Symmetry is a problemI A fully symmetric tensor of dimenson d requires only nd/d!

storageI Symmetry significantly complicates sequential implementation

I Irregular indexing makes alignment and unrolling difficultI Generalizing over all partial-symmetries is expensive

I Blocked or block-cyclic virtual processor decmpositions giveirregular or imbalanced virtual grids

Blocked Block-cyclic

P0 P1

P2 P3





Solving the symmetry problemI A cyclic decomposition allows balanced and regular blocking

of symmetric tensorsI If the cyclic-phase is the same in each symmetric dimension,

each sub-tensor retains the symmetry of the whole tensor

Cyclic





A generalized cyclic layout is still challenging

I In order to retain partial symmetry, all symmetric dimensionsof a tensor must be mapped with the same cyclic phase

I The contracted dimensions of A and B must be mapped withthe same phase

I And yet the virtual mapping, needs to be mapped to aphysical topology, which can be any shape





Virtual processor grid dimensions

I Our virtual cyclic topology is somewhat restrictive and thephysical topology is very restricted

I Virtual processor grid dimensions serve as a new level ofindirection

I If a tensor dimension must have a certain cyclic phase, adjustphysical mapping by creating a virtual processor dimension

I Allows physical processor grid to be ’stretchable’





Constructing a virtual processor grid for MM

Matrix multiply on 2x3 processor grid. Red lines representvirtualized part of processor grid. Elements assigned to blocks by

cyclic phase.

X =

AB

C





Unfolding the processor grid

I Higher-dimensional fully-symmetric tensors can be mappedonto a lower-dimensional processor grid via creation of newvirtual dimensions

I Lower-dimensional tensors can be mapped onto ahigher-dimensional processor grid via by unfolding (serializing)pairs of processor dimensions

I However, when possible, replication is better than unfolding,since unfolded processor grids can lead to an unbalancedmapping





A basic parallel algorithm for symmetric tensor contractions

1. Arrange processor grid in any k-ary n-cube shape

2. Map (via unfold & virt) both A and B cyclically along thedimensions being contracted

3. Map (via unfold & virt) the remaining dimensions of A and Bcyclically

4. For each tensor dimension contracted over, recursivelymulitply the tensors along the mapping

I Each contraction dimension is represented with a nested call toa local multiply or a parallel algorithm (e.g. Cannon)





Tensor library structure

The library supports arbitrary-dimensional parallel tensorcontractions with any symmetries on n-cuboid processor toruspartitions

1. Load tensor data by (global rank, value) pairs

2. Once a contraction is defined, map participating tensors

3. Distribute or reshuffle tensor data/pairs

4. Construct contraction algorithm with recursive function/argspointers

5. Contract the sub-tensors with a user-defined sequentialcontract function

6. Output (global rank, value) pairs on request





Current tensor library status

I Dense and symmetric remapping/repadding/contractionsimplemented

I Currently functional only for dense tensors, but with fullsymmetric logic

I Can perform automatic mapping with physical and virtualdimensions, but cannot unfold processor dimensions yet

I Complete library interface implemented, including basicauxillary functions (e.g. map/reduce, sum, etc.)





Next implementation steps

I Currently integrating library with a SCF method code thatuses dense contractions

I Get symmetric redistribution working correctly

I Automatic unfolding of processor dimensions

I Implement mapping by replication to enable 2.5D algorithms

I Much basic performance debugging/optimization left to do

I More optimization needed for sequential symmetriccontractions





Very preliminary contraction library results

Contracts tensors of size 64x64x256x256 in 1 second on 2K nodes

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

Strong scaling of dense contraction on BG/P 64x64x256x256

no rephaserephase every contraction

repad every contraction





Potential benefit of unfoldingUnfolding smallest two BG/P torus dimensions improvesperformance.

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

Strong scaling of dense contraction on BG/P 64x64x256x256

no-rephase 2Dno-rephase 3D




Conntributions

I Models for rectangular collectives

I 2.5D algorithms theory and implementation

I Using a cyclic mapping to parallelize symmetric tensorcontractions

I Extending and tuning processor grid with virtual dimensions

I Automatic mapping of high-dimensional tensors totopology-aware physical partitions

I A parallel tensor contraction algorithm/library without aglobal address space




Conclusions and references

I Parallel tensor contraction algorithm and library seem to bethe first communication-efficient practical approach

I Preliminary results and theory indicate high potential of thistensor contraction library

I papersI (2.5D) to appear in Euro-Par 2011, Distinguished paperI (2.5D + rectangular collective models) to appear in

Supercomputing 2011


Backup slides


A new LU latency lower bound

k₁

k₀

k₂

k₃

k₄

k

A₀₀

A₂₂

A₃₃

A₄₄

A

n

n

critical path

d-1,d-1d-1

A₁₁

flops lower bound requires d = Ω(√p) blocks/messages

bandwidth lower bound required d = Ω(√cp) blocks/messages


Virtual topology of 2.5D algorithms

(p/c)-1

i

j

k

0

00

(p/c) -11/2

1/2

c-1

2D algorithm mapping:(√

P)×(√

P)

grid

2.5D algorithm mapping:(√

P/c)×(√

P/c)× c grid for any c


Download - Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Top Related