12e.1 more on parallel computing unc-wilmington, c. ferner, 2007 mar 21, 2007
Post on 20-Dec-2015
216 views
TRANSCRIPT
12e.2
Block Mapping (Review)
blksz = (int)ceil((float)N / P);
for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) { ...}
(lb is the lower bound of the original loop)
12e.4
Example
0,0 0,1 0,2 0,3 0,N-1...
1,0 1,1 1,2 1,3 1,N-1......
2,0 2,1 2,2 2,3 2,N-1.........
N-1,0 N-1,1 N-1,2 N-1,3 N-1,N-1...
... ... ... ... ...
j
i
12e.5
Example
• If we mapped iterations of the i loop to processors, the dependencies cross processors boundaries
• Therefore interprocessor communication would be required
12e.6
N-1,N-1
Example
0,0 0,1 0,2 0,3 0,N-1...
1,0 1,1 1,2 1,3 1,N-1......
2,0 2,1 2,2 2,3 2,N-1.........
N-1,0 N-1,1 N-1,2 N-1,3 ...
... ... ... ... ...
PE0:
PE1:
PE2:
PEP:
12e.8
N-1,N-1
Example
0,0 0,1 0,2 0,3 0,N-1...
1,0 1,1 1,2 1,3 1,N-1......
2,0 2,1 2,2 2,3 2,N-1.........
N-1,0 N-1,1 N-1,2 N-1,3 ...
... ... ... ... ...
PE0: PE
1: PE
2: PE3:
12e.9
Example
for (i = 1; i < N; i++) {
for (j = my_rank * blksz; i < min(N, (my_rank + 1) * blksz); i++) { a[i][j] += f(a[i-1][j]);
}
}
12e.10
Block Mapping (Review)
blksz = (int)ceil((float)N / P);
for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) { ...}
(lb is the lower bound of the original loop)
12e.11
...0 1*blksz+0 (P-1)*blksz+01 1*blksz+1 (P-1)*blksz+1... ... ... ...1*blksz-1 2*blksz-1 N-1
PE0 PE1 PEP-1
Block Mapping
12e.12
Block Mapping
• The problem is that block mapping can lead to a load imbalance
• Example, let N=26, P=6• blksz = ceiling(26/6) = 5• (lb = 0)
12e.13
0 5 10 15 20 251 6 11 16 212 7 12 17 223 8 13 18 234 9 14 19 24
PE0
PE1
PE2
PE3
PE4
PE5
Block Mapping
• Processors 0-4 have 5 iterations of work• Processor 5 has 1 iteration
12e.14
Cyclic Mapping
• An alternative to block mapping is cyclic mapping
• This is where each iteration is assigned to each processors in a round robin fashion
• This leads to a better load balance
12e.15
0 1 2 3 4 56 7 8 9 10 1112 13 14 15 16 1718 19 20 21 22 2324 25
PE0
PE1
PE2
PE3
PE4
PE5
Cyclic Mapping
• Processors 0-2 have 6 iterations of work• Processor 3-6 have only 5, but it is only 1
iteration fewer!
12e.16
Cyclic Mapping
for (i = lb + my_rank; i < N; i += P) { ...}
(lb is the lower bound of the original loop)
12e.17
Cyclic Mapping
• Conceptually, this is an easier mapping to implement than block mapping
• It leads to better load balancing• However, it can (and often does) lead to
more communication• Suppose that each iteration in the above
example is dependent on the previous iteration
12e.18
0 1 2 3 4 56 7 8 9 10 1112 13 14 15 16 1718 19 20 21 22 2324 25
PE0
PE1
PE2
PE3
PE4
PE5
Cyclic Mapping
• A message is sent from iteration 0 to 1, from 1 to 2, from 2 to 3, from 3 to 4, from 4 to 5, from 5 to 6, ...
12e.19
Block Mapping
• With block mapping, only messages are sent from iteration 5 to 6, from 11 to 12, from 17 to 18, and from 23 to 24
0 5 10 15 20 251 6 11 16 212 7 12 17 223 8 13 18 234 9 14 19 24
PE0
PE1
PE2
PE3
PE4
PE5
12e.20
Block vs Cyclic
• Block mapping increases the granularity and reduces overall communication (O(P)). However, it can lead to load imbalances (O(N/P)).
• Cyclic mapping decreases granularity and increases overall communication (O(N)). However, it improves load balance (O(1)).
• Block-Cyclic is a combination of the two
12e.21
Block-Cyclic Mapping
• Block-cyclic with N=26, P=6, and blksz=2• The load imbalance will be <= blksz
0 2 4 6 8 101 3 5 7 9 1112 14 16 18 20 2213 15 17 19 21 2324 25
PE0
PE1
PE2
PE3
PE4
PE5
12e.22
Block-Cyclic Mapping(N, P, and blksz are given)
nLayers = (int)ceil(((float)N)/(blksz*P));
for (layer = 0; layer < nLayers; layer++) {
beginBlk = layer*blksz*N; for (i = beginBlk + mypid*blksz; i < min(N, beginBlk + (mypid + 1)*blksz); i++) { ... }}