1 cs4402 – parallel computing lecture 5 fox and cannon matrix multiplication

1

CS4402 – Parallel Computing

Lecture 5

Fox and Cannon Matrix Multiplication

2

Matrix Multiplication

Start with two matrices A is n*m and B is m*p.

The product C=A*B is a matrix n*p.

The multiplication “row by column” gives a complexity of O(n*m*p).

Parallel Implementation Linear Partitioning (I)

1.Scatter A to localA and Bcast B.

2. Compute localC = localA * B

3. Gather localC to C

3


Parallel Implementation Linear Partitioning (II)

1. Bcast A and Scatter B on columns to localB.

2. Compute localC = A * localB

3. Gather the columns of localC to C

Advantages

1. Execution times reduce and the speedup increases.

2. Simple computation for each processor.

3. (Dis) for each element localC[i][j], the columns of B must be traversed.

4


Improvement of Parallel Implementation

1. Transpose the matrix B.

2. Scatter A to localA and Bcast B.

2. Compute the pseudo product localC = localA * B multiplying “row by row”

3. Gather localC to C

Memory cache overhead reduces.

55

Complexity of the Linear Multiplication

Scatter n*n/size elements:

Bcast n*n elements:

Compute the product:

Gather n*n/size elements:

Total Complexity:

comcommstartup Tsize

nTn

size

nT

32

2

23

commstartup Tsize

nT

2

n3

sizeTcom

commstartup TnT 2

commstartup Tsize

nT

2

10

Strassen’s Algorithm

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

P1 A11 A22 B11 B22 P2 A21 A22 B11

P3 A11 B12 B22 P4 A22 B21 B11

P5 A11 A12 B22

P6 A21 A11 B11 B12 P7 A12 A22 B21 B22

C11 P1 P4 P5 P7

C12 P3 P5

C21 P2 P4

C22 P1 P3 P2 P6

11

Fast Matrix Multiplication

1. Strassen: 7 multiplies, 18 additions, O(n2.81)

2. Strassen-Winograd: 7 multiplies, 15 additions

3. Coppersmith-Winograd, O(n2.376)

1. But this is not (easily) implementable

2. “Previous authors in this field have exhibited their algorithms directly, but we will have to rely on hashing and counting arguments to show the existence of a suitable algorithm.”

12

Grid Topology

Grid Elements:

- the dimension: 1, 2, 3 etc.

- the sizes of each dimension.

- the periodicity if the extreme are adjacent.

- reorder the processors.

MPI Methods:

- MPI_Cart_create() to create the grid.

- MPI_Card_coords() to get the coordinates

- MPI_Card_rank to find the rank.

13

MPI_Cart_createCreates a communicator containing topology information.

int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart);

MPI_Comm grid_comm;int size[2], wrap_around[2], reorder;

size[0]=size[1]=q;wrap_around[0]=1; wrap_around[1]=0;reorder=1;

MPI_Cart_create(MPI_COMM_WORLD,2,size,wrap_around,reorder, &grid_comm);

14

MPI_Cart_coords, MPI_Cart_rank

MPI_Cart_coords(MPI_Comm comm,int rank,int maxdims,int *coords);

MPI_Cart_rank(MPI_Comm comm,int *coords,int *rank);

Find the coordinates/rank from rank/coordinates.

They map the ranks into coordinates.

15

How to find the rank of the neighbors Consider that processor rank has got (row, col) as grid coordinate

1. Find the grid coordinates of the right/left neighbors and transform them into ranks.

leftCoords[0] = row; leftCoords[1]=(col-1)%size;MPI_Cart_rank(grid, leftCoords, &leftRank);

2. void MPI_Cart_shift(MPI_Comm comm,int direction,int disp, int rank_source,int *rank_dest);

MPI_Cart_shift(grid, 1, -1, rank, &leftRank);

16

How to partition the matrix a

Some simple facts:

- Processor 0 has the whole matrix so it needs to extract the blocks Ai,j

- Processor 0 sends the block Ai,j to the processor of coords i,j.

- Processor rank receives whatever Processor 0 sends.

17

How to partition + shift the matrix a

if(rank==0) for (i=0;i<p;i++)for(j=0;j<p;j++){

extract_matrix(n,n,a,n/p,n/p,local_a,i*n/p,j*n/p);

senderCoords[0]=i;senderCoords[1]=(j-i)%p;

MPI_Cart_rank(grid, senderCoords, &senderRank);

MPI_Send(&local_a[0][0], n*n/(p*p), MPI_INT, senderRank, tag1,

MPI_COMM_WORLD);

}

}

MPI_Recv(&local_a[0][0], n*n/(p*p), MPI_INT, 0, tag1,

MPI_COMM_WORLD,&status_a);

19

Facts about the systolic computation

Consider the processor rank = (row, col).

- The processor rank computes for p-1 times a. Receive a bloc from left in local_a.

b. Receive a block from above in local_b.

c. Compute the product local_a*local_b and accumulate it in local_c

d. Send local_a to right

e. Send local_b to below.

- The computation local_a*local_b takes place only after the processor’s receive is completed.

- Lots of processors are idle

20

C00 =A00 B00

A00

A10

A20

B00 B01 B02

21

C00 =A00 B00 +A01 B10

A01 A00

A10

A20

B10 B01 B02

B00

22

C00 =A00 B00 +A01 B10 +A02 B20

A02 A01 A00

A11 A10

A20

B20 B11 B02

B10 B01

B00

23

Some Other Facts

- The processing ends after 2*p-1 stages when processor (p-1, p-1) receives the last matrices.

- After p stages of processing some processors become idle. e.g. Processor (0,0)

- It remains the question of how we can reduce the number of stages to exact p-1.

Fox = Broadcast A, Multiply and roll B.

Cannon = Multiply, roll A, roll B.

24

Cannon’s Matrix Multiplication

- The matrix a is block partitioned as follows:

- Row i of processors gets row i of blocks followed by shift << i positions.

A00 A01 A02

A11 A12 A10

A22 A20 A21

A00 A01 A02

A10 A11 A12

A20 A21 A22

25


- The matrix b is block partitioned on grid as follows:

- Column i of processors gets column i of blocks followed by shifted up i positions.

B00 B01 B02

B10 B11 B12

B20 B21 B22

B00 B11 B22

B10 B21 B02

B20 B01 B12

26


For p times do the following computation- Multiply local_a with local_b.- Shift << local_a one position. - Shift up local_b one position.

A00 A01 A02

A11 A12 A10

A22 A20 A21

B00 B11 B22

B10 B21 B02

B20 B01 B12

27

C00 =A00 B00

Step 1.

A00 A01 A02

A11 A12 A10

A22 A20 A21

B00 B11 B22

B10 B21 B02

B20 B01 B12

28

C00 =A00 B00 +A01 B10

Step 2.

A01 A02 A00

A12 A10 A11

A20 A21 A22

B10 B21 B02

B20 B01 B12

B00 B11 B22

29

C00 =A00 B00 +A01 B10 +A02 B20

Step 3.

A02 A00 A01

A10 A11 A12

A21 A22 A20

B20 B01 B12

B00 B11 B22

B10 B21 B02

30

Cannon Computation How to roll the matrices: Use send/receive

for(step=0;step<p;step++){// calculate the product local a * local b and accumulate in local_c cc = prod_matrix(n/p, n/p, n/p,local_a, local_b); for(i=0;i<n/p;i++)for(j=0;j<n/p;j++) local_c[i][j] += cc[i][j]; // shift local a, MPI_Send(&local_a[0][0], n*n/(p*p), MPI_INT, leftRank, tag1, MPI_COMM_WORLD); MPI_Recv(&local_a[0][0], n*n/(p*p), MPI_INT, rightRank, tag1, MPI_COMM_WORLD,&status); // shift b up MPI_Send(&local_b[0][0], n*n/(p*p), MPI_INT, upRank, tag1, MPI_COMM_WORLD); MPI_Recv(&local_b[0][0], n*n/(p*p), MPI_INT, downRank, tag1, MPI_COMM_WORLD,&status);

}

31

Cannon Computation How to roll the matrices:

- Use MPI_Send_recv_replace()

for(step=0;step<p;step++){

// calculate the product local a * local b and accumulate in local_ccc = prod_matrix(n/p, n/p, n/p,local_a, local_b);for(i=0;i<n/p;i++)for(j=0;j<n/p;j++) local_c[i][j] += cc[i][j];

// shift local a, local_bMPI_Send_recv_replace(&local_a[0][0], n*n/(p*p), MPI_INT, leftRank, tag1, rightRank,

tag2, MPI_COMM_WORLD, &status);MPI_Send_recv_replace (&local_b[0][0], n*n/(p*p), MPI_INT, upRank, tag1, downRank,

tag2, MPI_COMM_WORLD, &status);}

32

Cannon’s Complexity

Evaluate the complexity in terms of n and p = sqrt(size).

- The matrices a and b are sent to the grid with one send

operation

- Each processor computes p matrix multiplications in

- Each processor does p rolls of local_a and local_b

- Total execution time is

commstartupcommstartup tsize

ntt

p

ntT

2

2

2

1

comcom tsize

ntp

p

nT

3

3

3

2

commstartupcommstartup tsize

ntt

p

nptT

2

2

2

223

comcommstartup tsize

nt

size

n

size

ntT

322

22

3333

Simple Comparisons:

Complexities:

Cannon:

Linear:

Each strategy uses same amount of computation.

Cannon uses less communication.

Cannon uses smaller matrices.


nTn

size

nT

32

2

23


nT

size

n

size

nT

322

22

34

Fox’s Matrix Multiplication (1)

- The row i of blocks is broadcasted to the row i of processors in the order Ai,i Ai,i+1 Ai,i+2 …Ai,i-1

- The matrix b is partitioned on grid row after row in the normal order.

- In this way each processor has a block of A and a block of B and it can proceed to computation.

- After computation roll the matrix b up

35

Fox’s Matrix Multiplication (2)

Consider the processor rank = (row, col).Step 1. Partition the matrix b on the grid so that Bi,j goes

to Pi,j.

Step 2. For i=0,1,2,..,p-1 times do- Broadcast Arow, row+i to all the processors of the same row.

- Multiply local_a by local_b and accumulate the product to local_c

- Send local_b to (row-1, col)

- Receive in local_b from (row+1, col)

36

C00 =A00 B00

A00 A00 A00

A11 A11 A11

A22 A22 A22

B00 B01 B02

B10 B11 B12

B20 B21 B22

Step 1.

37

C00 =A00 B00 +A01 B10

A01 A01 A01

A12 A12 A12

A20 A20 A20

B10 B11 B12

B20 B21 B22

B00 B01 B02

Step 2.

38

C00 =A00 B00 +A01 B10 +A02 B20

A02 A02 A02

A10 A10 A10

A21 A21 A21

B20 B21 B22

B00 B01 B02

B10 B11 B12

Step 3.

1 cs4402 – parallel computing lecture 5 fox and cannon matrix multiplication

Documents