cs 584. dense matrix algorithms there are two types of matrices dense (full) sparse we will consider...

CS 584

Dense Matrix Algorithms

There are two types of Matrices Dense (Full) Sparse

We will consider matrices that are Dense Square

Mapping Matrices

How do we partition a matrix for parallel processing?

There are two basic ways Striped partitioning Block partitioning

Striped Partitioning

01

2

3

4

5

6

7

01

2

3

4

5

6

7

P0

P1

P2

P3

P0

P1P2P3P0P1P2P3

Block striping Cyclic striping

Block Partitioning

P0 P1

P2 P3

P0 P1 P2 P3

P4 P5 P6 P7

P0 P1 P2 P3

P4 P5 P6 P7

Block checkerboard Cyclic checkerboard

Block vs. Striped Partitioning

Scalability? Striping is limited to n processors Checkerboard is limited to n x n

processors

Complexity? Striping is easy Block could introduce more

dependencies

Dense Matrix Algorithms

TranspositionMatrix - Vector MultiplicationMatrix - Matrix MultiplicationSolving Systems of Linear Equations Gaussian Elimination

Matrix Transposition

The transpose of A is AT such thatAT[i,j] = A[j,i]All elements below the diagonal move above the diagonal and vice-versa

If we assume unit time to exchange: Transpose takes (n2 - n)/2

Transpose

Consider case where each processor has more than one element.

Hypothesis: The transpose of the full matrix can be

done by first sending the multiple element messages to their destination and then transposing the contents of the message.

Transpose (Striped Partitioning)

Transpose (Block Partitioning)

Matrix Multiplication

One Dimensional Decomposition

Each processor "owns" black portionTo compute the owned portion of the answer, each processor requires all of A

P

NttPT ws

2

)1(

Two Dimensional Decomposition

Requires less data per processorAlgorithm can be performed stepwise.

Broadcast an A sub-matrix to the other processors in row.

Compute

Rotate the B sub-matrix upwards

AlgorithmSet B' = Blocal

for j = 0 to sqrt(P) -2in each row I the [(I+j) mod sqrt(P)]th task broadcasts

A' = Alocal to the other tasks in the rowaccumulate A' * B'send B' to upward neighbor

done

P

Ntt

PPT ws

2

12

log1

Cannon’s Algorithm

Broadcasting a submatrix to all who need it is costly.Suggestion: Shift both submatrices

P

NttPT ws

2

12

Divide and Conquer

App Apq

Aqp Aqq

Bpp Bpq

Bqp Bqq

P0 = App * BppP1 = Apq * BpqP2 = App * BpqP3 = Aqp * Bqq

P4 = Aqp * BppP5 = Aqq * BqpP6 = Aqp * BpqP7 = Aqq * Bqq

P0 + P1 P2 + P3

P4 + P5 P6 + P7

=x

Systems of Linear Equations

A linear equation in n variables has the form

A set of linear equations is called a system.A solution exists for a system iff the solution satisfies all equations in the system.Many scientific and engineering problems take this form.

a0x0 + a1x1 + … + an-1xn-1 = b

Solving Systems of Equations

Many such systems are large. Thousands of equations and unknowns

a0,0x0 + a0,1x1 + … + a0,n-1xn-1 = b0

a1,0x0 + a1,1x1 + … + a1,n-1xn-1 = b1

an-1,0x0 + an-1,1x1 + … + an-1,n-1xn-1 = bn-1


A linear system of equations can be represented in matrix form

a0,0 a0,1 … a0,n-1 x0 b0

a1,0 a1,1 … a1,n-1 x1 b1

an-1,0 an-1,1 … an-1,n-1 xn-1 bn-1

=

Ax = b


Solving a system of linear equations is done in two steps: Reduce the system to upper-

triangular Use back-substitution to find solution

These steps are performed on the system in matrix form. Gaussian Elimination, etc.


Reduce the system to upper-triangular form

Use back-substitution

a0,0 a0,1 … a0,n-1 x0 b0

0 a1,1 … a1,n-1 x1 b1

0 0 … an-1,n-1 xn-1 bn-1

=

Reducing the System

Gaussian elimination systematically eliminates variable x[k] from equations k+1 to n-1. Reduces the coefficients to zero

This is done by subtracting a appropriate multiple of the kth equation from each of the equations k+1 to n-1

Procedure GaussianElimination(A, b, y) for k = 0 to n-1

/* Division Step */for j = k + 1 to n - 1 A[k,j] = A[k,j] / A[k,k]y[k] = b[k] / A[k,k]A[k,k] = 1

/* Elimination Step */for i = k + 1 to n - 1 for j = k + 1 to n - 1

A[i,j] = A[i,j] - A[i,k] * A[k,j] b[i] = b[i] - A[i,k] * y[k] A[i,k] = 0endfor

endforend

Parallelizing Gaussian Elim.

Use domain decomposition Rowwise striping

Division step requires no communicationElimination step requires a one-to-all broadcast for each equation.No agglomerationInitially map one to to each processor

Communication Analysis

Consider the algorithm step by stepDivision step requires no communicationElimination step requires one-to-all bcast only bcast to other active processors only bcast active elements

Final computation requires no communication.

Communication Analysis

One-to-all broadcast log2q communications q = n - k - 1 active processors

Message size q active processors q elements required

T = (ts + twq)log2q

Computation Analysis

Division step q divisions

Elimination step q multiplications and subtractions

Assuming equal time --> 3q operations

Computation Analysis

In each step, the active processor set is reduced by one resulting in:

2/)1(3

11

0

nnCompTime

knCompTimen

k

Can we do better?

Previous version is synchronous and parallelism is reduced at each step.Pipeline the algorithmRun the resulting algorithm on a linear array of processors.Communication is nearest-neighborResults in O(n) steps of O(n) operations

Pipelined Gaussian Elim.

Basic assumption: A processor does not need to wait until all processors have received a value to proceed.Algorithm If processor p has data for other processors,

send the data to processor p+1 If processor p can do some computation

using the data it has, do it. Otherwise, wait to receive data from

processor p-1

Conclusion

Using a striped partitioning method, it is natural to pipeline the Gaussian elimination algorithm to achieve best performance.Pipelined algorithms work best on a linear array of processors. Or something that can be linearly mapped

Would it be better to block partition? How would it affect the algorithm?

cs 584. dense matrix algorithms there are two types of matrices dense (full) sparse we will consider...

Documents