design of parallel algorithms

46
Design of parallel algorithms Matrix operations J. Porras

Upload: oksana

Post on 08-Jan-2016

65 views

Category:

Documents


3 download

DESCRIPTION

Design of parallel algorithms. Matrix operations J. Porras. Matrix x vector. Sequential approach MAT_VECT(A,x,y) for(i=0;i

TRANSCRIPT

Page 1: Design of parallel algorithms

Design of parallel algorithms

Matrix operations

J. Porras

Page 2: Design of parallel algorithms

Matrix x vector

• Sequential approach MAT_VECT(A,x,y)

for(i=0;i<n;i++) {y[i] = 0;for(j=0;j<n:j++) {

y[i] = y[i] + A[i,j] * x[j]}

}

• Work = n2

Page 3: Design of parallel algorithms

Parallelization of matrix operationsMatrix x vector

• Three ways to implement – rowwise striping– columnwise striping– checkerboarding

• DRAW each of these approaches !

Page 4: Design of parallel algorithms

Rowwise striping

• N x N is distributed into n processors (one row each)

• N x 1 vector is distributed into n processors (one element each)

• All processors need the whole vector so all-to-all broadcast is required

Page 5: Design of parallel algorithms
Page 6: Design of parallel algorithms

Rowwise striping

• All-to-all broadcast requires n).

• One row takes n) time for multiplications

• Rows are calculated in parallel thus the total time is n) and the work n2).– Algorithm is cost-optimal

Page 7: Design of parallel algorithms

Block striping

• Assume that p < n and the matrix in partitioned by using block striping

• All processors contain n/p rows and n/p elements of the vector

• All processors require the whole vector thus all-to-all broadcast is required (message size n/p)

Page 8: Design of parallel algorithms

Block striping in hypercube

• all-to-all broadcast in hypercube with n/p-sized message takes

tslog p + tw(n/p)(p-1)

• If p is considered large enoughtslog p + twn

• Multiplication requires n2/p time (n/p rows to multiply with the vector)

Page 9: Design of parallel algorithms

Block striping in hypercube

• Parallel execution time TP = n2/p + tslog p + twn

• Cost pTP n2 + ts plog p + twnp

• Algorithm is costoptimal if

p = O(n)

Page 10: Design of parallel algorithms

Block striping in mesh

• All-to-all broadcast in mesh with wraparounds takes 2ts(p-1) + tw(n/p)(p-1)

• Parallel execution requiresTP = n2/p + 2ts (p-1) + twn

Page 11: Design of parallel algorithms

Scalability of block striping

• Overhead (T0 = pTp – W)

T0 = ts plog p + twnp

• Isoeffiency (W = KT0) for hypercube

W = K ts p log p

W = K tw np

• Since W = n2, W = K2 tw

2 p2

Page 12: Design of parallel algorithms

Scalability of block striping

• Because p = O(n), n = p)n2 = p2)W = p2)

• Equation gives the highest asymptotic rate at which the problem size must increase with the number of processors to maintain fixed efficiency

Page 13: Design of parallel algorithms

Scalability of block striping

• Isoeffiency in hypercube is (p2).

• Similar analysis can be done for the mesh architecture and get the same value (p2).

• Thus with striped partitioning, scalability is not any more on a hypercube than on a mesh

Page 14: Design of parallel algorithms

Checkerboard

• N x N matrix in partitioned into N2

processors (one element per processor)• N x 1 vector is located on a last column (or

on a diagonal)• Vector is distributed into corresponding

processors• Calculate multiplications in parallel and

collect results with single node accumulation into the last processor

Page 15: Design of parallel algorithms
Page 16: Design of parallel algorithms
Page 17: Design of parallel algorithms

Checkerboard

• Three communication stapes are required– One-to-one communication to send the vector

onto diagonal– One-to-all broadcast to distributed the

elements of the vector– Single-node accumulation to sum the partial

results

Page 18: Design of parallel algorithms

Checkerboard

• Mesh requires (n) time for all the operations (SF) and hypercube (log n)

• Multiplication happens in constant time

• Parallel execution time is (n) in mesh and (log n) in hypercube architecture

• Cost is (n3) for the mesh and (n2log n)for the hypercube

• Algorithms are not cost-optimal

Page 19: Design of parallel algorithms

Checkerboard p < n2

• Cost-optimality can be achieved if the granularity is increased ??

• Consider two dimensional mesh of p processors in which each processor stores (n/p) x (n/p block of the matrix

• Simlarly for the vector (n/p)

Page 20: Design of parallel algorithms

Checkerboard p < n2

• Vector elements are sent to the diagonal

• Vector elements are distributed for the other processors

• Each processor performs n2/p multiplications and calculates n/p additions

• Partial sums are collected with single node accumulation

Page 21: Design of parallel algorithms

Scalability of checkerboard p < n2

• Assume that the processors are connected in a two dimensional p x p cut-through routing mesh (no wraparounds)

• Sent to diagonal takes

ts + twn / p + th p

• One-to-all in columns takes(ts + twn / p) log (p) + th p

Page 22: Design of parallel algorithms

Scalability of checkerboard p < n2

• Single-node accumulation takes(ts + twn / p) log (p) + th p

• Multiplicatios in each processor takes n2/p.

• Thus

TP = n2/p + tslog p +(tw n / p) log p + 3th p

• T0 = pTP - W gives for the overhead:

T0 = tsplog p + tw n p log p + 3th p3/2

Page 23: Design of parallel algorithms

Scalability of checkerboard p < n2

• Isoeffiency for ts:

W = Kts p log p

• Isoeffiency for tw:

W = n2 = K tw n p log p

n = K tw p log p

n2 = K2 tw 2 p log 2 p

W = K2 tw2p log2 p

• Isoeffiency for th:

W = 3 K th p3/2

Page 24: Design of parallel algorithms

Scalability of checkerboard p < n2

• If p = O(n2), :p = O(n2)n2 = p)W = p)

• tw and th dominate ts

Page 25: Design of parallel algorithms

Scalability of checkerboard p < n2

• Concentrate on th (p3/2) and tw:n (plog2 p)

• Because p3/2 > plog2p only for p > 65536 both of the terms could dominate

• Assume that the term (plog2 p) dominates

Page 26: Design of parallel algorithms

Scalability of checkerboard p < n2

• Maximum number of processors that can be used costoptimally for the problem size W is determined by

plog2 p = O( n2 )

log p + 2 log log p = O( log n )

log p = O (log n)

Page 27: Design of parallel algorithms

Scalability of checkerboard p < n2

• Substitute log n for log p:n

• p log2 n = O (n2 ) p = O ( n2 / log2 n )

• p gives the upper limit for the number of processors that can be used cost-optimally

Page 28: Design of parallel algorithms

SF and CT

• Parallel execution takes n2 / p + 2ts p + 3tw

n time on p processor mesh with SF routing (isoeffiency (p2) dueto tw )

• CT routing performs much better

• Note that this is true for cases with several elements per processor

• HOW about fine-grain case ?

Page 29: Design of parallel algorithms

Striped and checkerboard

• Comparison shows that checkerboard is faster than striped approach with the same amount of processors

• If p > n, striped approach is not available

• How about the effect of architecture ?

• Scalability ?

• Isoefficiency ?

Page 30: Design of parallel algorithms

Sequential matrix multiplication

• Procedure MAT_MULT(A,B,C)for i := 0 to n-1 do for j := 0 to n-1 do C[i,j] := 0; for k := 0 to n-1 C[i,j] := C[i,j] + A[i,k]B[k,j]

• n3 work (strassen’s algorithm has better complexity)

Page 31: Design of parallel algorithms

Block approach

• n/q * n/q submatrices

• Procedure BLOCK_MAT_MULT(A,B,C)for i := 0 to q-1 do for j := 0 to q-1 do Initialize C to zero for k := 0 to q-1 do Ci,j := Ci,j + Ai,k Bk,j

• Same complexity n3

Page 32: Design of parallel algorithms

Simple parallel approach

• Matrices A and B partitioned into p blocks of size(n/p1/2) x (n/p1/2)

• Map into p1/2 x p1/2 mesh

• Processors P0,0 ... Pp-1,p-1

• Pi,j stores Ai,j and Bi,j and computes Ci,j

• Ci,j requires Ai,k and Bk,j

• A needs to communicate within rows • B communicates within columns

Page 33: Design of parallel algorithms

Performance on hypercube

• Requires 2 broadcasts (rows and columns)

• message size n2/p

• tc = 2(ts log(p)+tw(n2/p)(p-1))

• tm= p (n/p)3=n3/p

• Tp = n3/p + ts log p + 2twn2/ p , p » 1

Page 34: Design of parallel algorithms

Performance on mesh

• Store-and-forward routing

• tc = 2(tsp + twn2/ p)

• tm= p (n/ p)3=n3/p

• tp = n3/p + 2ts p + 2twn2/ p

Page 35: Design of parallel algorithms

Cannon´s algorithm

• Partition to blocks as usual

• Processors P0,0 - P p-1, p-1

• Pi,j contains Ai,j and Bi,j

• rotate block !!

• A blocks to the left

• B blocks upwards

Page 36: Design of parallel algorithms
Page 37: Design of parallel algorithms
Page 38: Design of parallel algorithms
Page 39: Design of parallel algorithms

Fox’s algorithm

• Partition to blocks as usual

• Pi,j contains Ai,j and Bi,j

• Uses one-to-all broadcastsp iterations

• (1) broadcast selected block to row

• (2) multiply by B

• (3) send B upwards

• (4) select Ai,(j+1)mod(p)

Page 40: Design of parallel algorithms
Page 41: Design of parallel algorithms
Page 42: Design of parallel algorithms

DNS

• Dekel, Nassimi and Sahni

• n3 processors available

• use 3D structure

• Pi,j,k solves A[i,k]xB[k,j]

• C[i,j] = Pi,j,0 +...+ Pi,j,n-1

(log n) time

Page 43: Design of parallel algorithms

DNS for hypercube

• 3D structure is mapped into hypercube where n3 = 23d processors

• Processor Pi,j,o contains A[i,j] and B[i,j]

• 3 steps

• (1) move A & B to correct plane

• (2) replicate on each plane

• (3) single node accumulation

Page 44: Design of parallel algorithms
Page 45: Design of parallel algorithms
Page 46: Design of parallel algorithms

DNS < n3 processors

• Processors p = q3, q < n

• Partition matrices into (n/q)*(n/q) blocks

• Matrices contain q x q submatrices

• Since 1<=q<=n, p=[1,n3]