how to compute and prove lower and upper bounds on the communication costs of your algorithm part...

How to Compute and Prove

Lower and Upper Boundson the

Communication Costsof Your Algorithm

Part III: Graph analysis

Oded Schwartz

CS294, Lecture #10 Fall, 2011Communication-Avoiding Algorithms

www.cs.berkeley.edu/~odedsc/CS294

Based on:

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz:

Graph expansion and communication costs of fast matrix multiplication.

Previous talk on lower bounds Communication Lower Bounds:

Approaches:

1. Reduction [Ballard, Demmel, Holtz, S. 2009]

2. Geometric Embedding[Irony,Toledo,Tiskin 04],

[Ballard, Demmel, Holtz, S. 2011a]

3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b]

Proving that your algorithm/implementation is as good as it gets.

Previous talk on lower bounds: algorithms with “flavor” of 3 nested loops[Ballard, Demmel, Holtz, S. 2009],[Ballard, Demmel, Holtz, S. 2011a]Following [Irony,Toledo,Tiskin 04]

• BLAS, LU, Cholesky, LDLT, and QR factorizations, eigenvalues and singular values, i.e., essentially all direct methods of linear algebra.

• Dense or sparse matricesIn sparse cases: bandwidth is a function NNZ.

• Bandwidth and latency.• Sequential, hierarchical, and

parallel – distributed and shared memory models.• Compositions of linear algebra operations.• Certain graph optimization problems

[Demmel, Pearson, Poloni, Van Loan, 11]• Tensor contraction

Geometric Embedding (2nd approach) [Ballard, Demmel, Holtz, S. 2011a]Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

But many algorithms just don’t fit the generalized form!

For example: Strassen’s fast matrix multiplication

Beyond 3-nested loops

How about the communication costs of algorithmsthat have a more complex structure?

Communication Lower Bounds

Approaches:

[Strassen 69]• Compute 2 x 2 matrix multiplication

using only 7 multiplications (instead of 8).• Apply recursively (block-wise)

M1 = (A11 + A22) (B11 + B22)M2 = (A21 + A22) B11

M3 = A11 (B12 - B22)M4 = A22 (B21 - B11)M5 = (A11+ A12) B22

M6 = (A21 - A11) (B11 + B12)M7 = (A12 - A22) (B21 + B22)

C11 = M1 + M4 - M5 + M7

C12 = M3 + M5

C21 = M2 + M4

C22 = M1 - M2 + M3 + M6

Recall: Strassen’s Fast Matrix Multiplication

C21 C22

C11 C12n/2

n/2 A21 A22

A11 A12

B21 B22

B11 B12

T(n) = 7T(n/2) + O(n2)

T(n) = (nlog27)

Strassen-like algorithms

• Compute n0 x n0 matrix multiplication using only n0

0 multiplications (instead of n0

• Apply recursively (block-wise)0 2.81 [Strassen 69] works fast in practice.2.79 [Pan 78]2.78 [Bini 79]2.55 [Schönhage 81]2.50 [Pan Romani,Coppersmith Winograd 84]2.48 [Strassen 87]2.38 [Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach

T(n) = n00 T(n/n0) + O(n2)

T(n) = (n0)

New lower bound for Strassen’s fast matrix multiplication

[Ballard, Demmel, Holtz, S. 2011b]:The Communication bandwidth lower bound is

n7log2

n8log2

n7log2

n8log2

Strassen-like: Recall for cubic:For Strassen’s:

The parallel lower bounds applies to2D: M = (n2/P)2.5D: M = (c∙n2/P)

log2 7 log2 80

For sequential? hierarchy?Yes, existing implementation do!

For parallel 2D? parallel 2.5D?Yes: new algorithms.

Sequential and new 2D and 2.5D parallel Strassen-like algorithms

Sequential and Hierarchy cases: Attained by the natural recursive implementation.

Also: LU, QR,… (Black-box use of fast matrix multiplication)

[Ballard, Demmel, Holtz, S., Rom 2011]: New 2D parallel Strassen-like algorithm.

Attains the lower bound.

New 2.5D parallel Strassen-like algorithm.c 0 /2-1 parallel communication speedup over 2D implementation (c ∙ 3n2 = M∙P)

[Ballard, Demmel, Holtz, S. 2011b]:This is as good as it gets.

Implications for sequential architectural scaling

• Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n2 > M:

• Time to add two rows of largest locally storable square matrix exceeds reciprocal bandwidth

• Time to multiply 2 largest locally storable square matrices exceeds latency

Strassen-like algs do fewer flops & less communication but are more demanding on the hardware.If 2, it is all about communication.

CA Matrix multiplication algorithm

Scaling BandwidthRequirement

Scaling LatencyRequirement

Classic M1/2 M3/2

Strassen-like M0/2-1 M0/2

Let G = (V,E) be a d-regular graph

A is the normalized adjacency matrix, witheigenvalues 1 ≥ 2 ≥ … ≥ n

1 - max {2, | n|}

Thm: [Alon-Milman84, Dodziuk84, Alon86]12 2h

Expansion (3rd approach) [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]

SSEh V

The Computation Directed Acyclic Graph

Expansion (3rd approach)

Communication-cost is Graph-expansion

Input / OutputIntermediate valueDependency

\V SSV

For a given run (Algorithm, Machine, Input)

1. Consider the computation DAG: G = (V, E)V = set of computations and inputsE = dependencies

2. Partition G into segments S of (M/2) vertices(correspond to time / location adjacency)

3. Show that every S has 3M vertices with incoming / outgoing edges perform M read/writes.

4. The total communication BW isBW = BW of one segment #segments = (M) O(n) / (M/2) = (n / M/2 -1)

Expansion (3rd approach)

Is it a Good Expander?

Break G into edge-disjoint graphs, corresponding to the algorithm on M1/2 M1/2 matrices.Consider the expansions of S in each part (they sum up).

We need to show that M/2 expands to (M).

h(G(n)) = (M/ M/2) for n = (M1/2).

Namely, for every n, h(G(n)) = (n2/n) = ((4/7)lg n)

BW = (T(n)) h(G(M1/2))

BW = (T(n)) (G(M1/2))

Enlg n BEnlg nA

Declg nC

What is the CDAG of Strassen’s algorithm?

M1 = (A11 + A22) (B11 + B22)M2 = (A21 + A22) B11

M3 = A11 (B12 - B22)M4 = A22 (B21 - B11)M5 = (A11+ A12) B22

M6 = (A21 - A11) (B11 + B12)M7 = (A12 - A22) (B21 + B22)

C11 = M1 + M4 - M5 + M7

C12 = M3 + M5

C21 = M2 + M4

C22 = M1 - M2 + M3 + M6

The DAG of Strassen, n = 2

7 5 4 1 3 2 6

1,1 1,2 2,1 2,2

1,1 1,2 2,1 2,21,1 1,2 2,1 2,2

The DAG of Strassen, n=4

Dec1C1,1 1,2 2,1 2,2

7 5 4 1 3 2 6

One recursive level:

• Each vertex splits into four.

• Multiply blocks

Enc1 BEnc1A

Enc1A Enc1B

Enclg n BEnclg nA

Declg nC

The DAG of Strassen: further recursive steps

1,1 1,2 2,1 2,2

Recursive construction

Given DeciC, Construct Deci+1C:

1. Duplicate 4 times

2. Connect with a cross-layer of Dec1C

Enlg nBEnlg nA

Declg nC

The DAG of Strassen

1. Compute weighted sums of A’s elements.

2. Compute weighted sums of B’s elements.

3. Compute multiplications m1,m2,…,m.

4. Compute weighted sums of m1,m2,…,m to obtain C.

Expansion of a Segment

Two methods to compute the expansion of the recursively constructed graph:

• Combinatorial- estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08])

or• Spectral

- compute the edge expansion via the spectral-gap(in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00])

Expansion of a Segment

Main technical challenges:

• Two types of vertices: with/without recursion.

• The graph is not regular.

7 5 4 1 3 2 6

1,1 1,2 2,1 2,2

1,1 1,2 2,1 2,21,1 1,2 2,1 2,2

Estimating the edge expansion- Combinatorially

lg 1k M

S3S2 M M

• Dec1C is a consistency gadget: Mixed pays 1/12 of its edges.

• The fraction of S vertices is consistent between the 1st level and the four 2nd levels (deviations pay linearly).

Not in S

Communication Lower Bounds

Approaches:

Open Problems

Find algorithms that attain the lower bounds:• Sparse matrix algorithms• for sequential and parallel models• that auto-tune or are cache oblivious

Address complex heterogeneous hardware:• Lower bounds and algorithms

[Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11]

Extend the techniques to other algorithm and algorithmic tools:• Non-uniform recursive structure

Characterize a communication lower bound for a problem rather than for an algorithm.

How to Compute and Prove

Lower Boundson the

Communication Costsof Your Algorithm

Part III: Graph analysis

Oded Schwartz

CS294, Lecture #10 Fall, 2011Communication-Avoiding Algorithms

Based on:

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz:

Graph expansion and communication costs of fast matrix multiplication.

Thank you!

how to compute and prove lower and upper bounds on the communication costs of your algorithm part...

Documents

session 55 oded cats

communication avoiding algorithms for dense linear algebra...

dire it oded efes a 2010

oded cohen_16 tocpa_april 2015_south africa_with corrections

sp07 cs294 lecture 18 -- semantic roles.ppt...

observ ec v cumentación oded

how to compute and prove lower and upper bounds on the...

zero knowledge-tutorial oded goldreich 2010

welcome cs294-8 design of deeply networked systems spring...

sp09 cs294 lecture 9 -- speech signal.ppt

copyright by oded doron 2007

ucb cs294-88: declarative design [0.2cm] chisel...

cs294-6 reconfigurable computing

cs294-11 sensornet fall 2005 murali rangan

oded korach - portfolio 2015

jim demmel and oded schwartz

sp08 cs294 lecture 9 -- speech signal - eecs at uc...

oded ben-horin: skylight opera

lattice-based cryptography oded regev tel-aviv university...

sp07 cs294 lecture 12 -- phrase decoding.ppt...