12e.1 more on parallel computing unc-wilmington, c. ferner, 2007 mar 21, 2007

23
12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

12e.1

More on Parallel Computing

UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007

12e.2

Block Mapping (Review)

blksz = (int)ceil((float)N / P);

for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) { ...}

(lb is the lower bound of the original loop)

12e.3

Example

for (i = 1; i < N; i++) {

for (j = 0; j < N; j++) {

a[i][j] += f(a[i-1][j]);

}

}

12e.4

Example

0,0 0,1 0,2 0,3 0,N-1...

1,0 1,1 1,2 1,3 1,N-1......

2,0 2,1 2,2 2,3 2,N-1.........

N-1,0 N-1,1 N-1,2 N-1,3 N-1,N-1...

... ... ... ... ...

j

i

12e.5

Example

• If we mapped iterations of the i loop to processors, the dependencies cross processors boundaries

• Therefore interprocessor communication would be required

12e.6

N-1,N-1

Example

0,0 0,1 0,2 0,3 0,N-1...

1,0 1,1 1,2 1,3 1,N-1......

2,0 2,1 2,2 2,3 2,N-1.........

N-1,0 N-1,1 N-1,2 N-1,3 ...

... ... ... ... ...

PE0:

PE1:

PE2:

PEP:

12e.7

Example

• A better solution would be to map iterations of the j loop to processors

12e.8

N-1,N-1

Example

0,0 0,1 0,2 0,3 0,N-1...

1,0 1,1 1,2 1,3 1,N-1......

2,0 2,1 2,2 2,3 2,N-1.........

N-1,0 N-1,1 N-1,2 N-1,3 ...

... ... ... ... ...

PE0: PE

1: PE

2: PE3:

12e.9

Example

for (i = 1; i < N; i++) {

for (j = my_rank * blksz; i < min(N, (my_rank + 1) * blksz); i++) { a[i][j] += f(a[i-1][j]);

}

}

12e.10

Block Mapping (Review)

blksz = (int)ceil((float)N / P);

for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) { ...}

(lb is the lower bound of the original loop)

12e.11

...0 1*blksz+0 (P-1)*blksz+01 1*blksz+1 (P-1)*blksz+1... ... ... ...1*blksz-1 2*blksz-1 N-1

PE0 PE1 PEP-1

Block Mapping

12e.12

Block Mapping

• The problem is that block mapping can lead to a load imbalance

• Example, let N=26, P=6• blksz = ceiling(26/6) = 5• (lb = 0)

12e.13

0 5 10 15 20 251 6 11 16 212 7 12 17 223 8 13 18 234 9 14 19 24

PE0

PE1

PE2

PE3

PE4

PE5

Block Mapping

• Processors 0-4 have 5 iterations of work• Processor 5 has 1 iteration

12e.14

Cyclic Mapping

• An alternative to block mapping is cyclic mapping

• This is where each iteration is assigned to each processors in a round robin fashion

• This leads to a better load balance

12e.15

0 1 2 3 4 56 7 8 9 10 1112 13 14 15 16 1718 19 20 21 22 2324 25

PE0

PE1

PE2

PE3

PE4

PE5

Cyclic Mapping

• Processors 0-2 have 6 iterations of work• Processor 3-6 have only 5, but it is only 1

iteration fewer!

12e.16

Cyclic Mapping

for (i = lb + my_rank; i < N; i += P) { ...}

(lb is the lower bound of the original loop)

12e.17

Cyclic Mapping

• Conceptually, this is an easier mapping to implement than block mapping

• It leads to better load balancing• However, it can (and often does) lead to

more communication• Suppose that each iteration in the above

example is dependent on the previous iteration

12e.18

0 1 2 3 4 56 7 8 9 10 1112 13 14 15 16 1718 19 20 21 22 2324 25

PE0

PE1

PE2

PE3

PE4

PE5

Cyclic Mapping

• A message is sent from iteration 0 to 1, from 1 to 2, from 2 to 3, from 3 to 4, from 4 to 5, from 5 to 6, ...

12e.19

Block Mapping

• With block mapping, only messages are sent from iteration 5 to 6, from 11 to 12, from 17 to 18, and from 23 to 24

0 5 10 15 20 251 6 11 16 212 7 12 17 223 8 13 18 234 9 14 19 24

PE0

PE1

PE2

PE3

PE4

PE5

12e.20

Block vs Cyclic

• Block mapping increases the granularity and reduces overall communication (O(P)). However, it can lead to load imbalances (O(N/P)).

• Cyclic mapping decreases granularity and increases overall communication (O(N)). However, it improves load balance (O(1)).

• Block-Cyclic is a combination of the two

12e.21

Block-Cyclic Mapping

• Block-cyclic with N=26, P=6, and blksz=2• The load imbalance will be <= blksz

0 2 4 6 8 101 3 5 7 9 1112 14 16 18 20 2213 15 17 19 21 2324 25

PE0

PE1

PE2

PE3

PE4

PE5

12e.22

Block-Cyclic Mapping(N, P, and blksz are given)

nLayers = (int)ceil(((float)N)/(blksz*P));

for (layer = 0; layer < nLayers; layer++) {

beginBlk = layer*blksz*N; for (i = beginBlk + mypid*blksz; i < min(N, beginBlk + (mypid + 1)*blksz); i++) { ... }}

12e.23

Block vs Cyclic

• Block-Cyclic is in between Block and Cyclic in terms of granularity, communication, and load balancing.

• Block and Cyclic are special cases of Block-Cyclic– Block = Block-Cyclic with blksz = ceiling(N/P)– Cyclic = Block-Cyclic with blksz = 1