ece 1747h: parallel programming
DESCRIPTION
ECE 1747H: Parallel Programming. Lecture 2-3: More on parallelism and dependences -- synchronization. Synchronization. All programming models give the user the ability to control the ordering of events on different processors. This facility is called synchronization. Example 1. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/1.jpg)
ECE 1747H: Parallel Programming
Lecture 2-3: More on parallelism and dependences -- synchronization
![Page 2: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/2.jpg)
Synchronization
• All programming models give the user the ability to control the ordering of events on different processors.
• This facility is called synchronization.
![Page 3: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/3.jpg)
Example 1
f() { a = 1; b = 2; c = 3;}g() { d = 4; e = 5; f = 6; }main() { f(); g(); }
• No dependences between f and g.• Thus, f and g can be run in parallel.
![Page 4: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/4.jpg)
Example 2
f() { a = 1; b = 2; c = 3; }g() { a = 4; b = 5; c = 6; }main() { f(); g(); }
• Dependences between f and g.• Thus, f and g cannot be run in parallel.
![Page 5: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/5.jpg)
Example 2 (continued)
f() { a = 1; b = 2; c = 3; }g() { a = 4 ; b = 5; c = 6; }main() { f(); g(); }
• Dependences are between assignments to a, assignments to b, assignments to c.
• No other dependences.• Therefore, we only need to enforce these
dependences.
![Page 6: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/6.jpg)
Synchronization Facility
• Suppose we had a set of primitives, signal(x) and wait(x).
• wait(x) blocks unless a signal(x) has occurred.
• signal(x) does not block, but causes a wait(x) to unblock, or causes a future wait(x) not to block.
![Page 7: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/7.jpg)
Example 2 (continued)
f() { a = 1; b = 2; c = 3; }g() { a = 4; b = 5; c = 6; }main() { f(); g(); }
f() { a = 1; signal(e_a); b = 2; signal(e_b); c = 3; signal(e_c); }
g() { wait(e_a); a = 4; wait(e_b); b = 5; wait(e_c); c = 6; }main() { f(); g(); }
![Page 8: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/8.jpg)
Example 2 (continued)a = 1; wait(e_a);signal(e_a);b = 2; a = 4;signal(e_b); wait(e_b);c = 3; b = 5;signal(e_c); wait(e_c);
c = 6;• Execution is (mostly) parallel and correct.• Dependences are “covered” by synchronization.
![Page 9: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/9.jpg)
About synchronization
• Synchronization is necessary to make some programs execute correctly in parallel.
• However, synchronization is expensive.• Therefore, needs to be reduced, or
sometimes need to give up on parallelism.
![Page 10: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/10.jpg)
Example 3
f() { a=1; b=2; c=3; }g() { d=4; e=5; a=6; }main() { f(); g(); }
f() { a=1; signal(e_a); b=2; c=3; }g() { d=4; e=5; wait(e_a); a=6; }main() { f(); g(); }
![Page 11: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/11.jpg)
Example 4
for( i=1; i<100; i++ ) {a[i] = …;…;… = a[i-1];
}
• Loop-carried dependence, not parallelizable
![Page 12: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/12.jpg)
Example 4 (continued)
for( i=...; i<...; i++ ) {a[i] = …;signal(e_a[i]);…;wait(e_a[i-1]);… = a[i-1];
}
![Page 13: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/13.jpg)
Example 4 (continued)
• Note that here it matters which iterations are assigned to which processor.
• It does not matter for correctness, but it matters for performance.
• Cyclic assignment is probably best.
![Page 14: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/14.jpg)
Example 5
for( i=0; i<100; i++ ) a[i] = f(i);x = g(a);for( i=0; i<100; i++ ) b[i] = x + h( a[i] );
• First loop can be run in parallel.• Middle statement is sequential.• Second loop can be run in parallel.
![Page 15: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/15.jpg)
Example 5 (contimued)
• We will need to make parallel execution stop after first loop and resume at the beginning of the second loop.
• Two (standard) ways of doing that:– fork() - join()– barrier synchronization
![Page 16: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/16.jpg)
Fork-Join Synchronization
• fork() causes a number of processes to be created and to be run in parallel.
• join() causes all these processes to wait until all of them have executed a join().
![Page 17: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/17.jpg)
Example 5 (continued)
fork();for( i=...; i<...; i++ ) a[i] = f(i);join();x = g(a);fork();for( i=...; i<...; i++ ) b[i] = x + h( a[i] );join();
![Page 18: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/18.jpg)
Example 6
sum = 0.0;for( i=0; i<100; i++ ) sum += a[i];
• Iterations have dependence on sum.• Cannot be parallelized, but ...
![Page 19: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/19.jpg)
Example 6 (continued)
for( k=0; k<...; k++ ) sum[k] = 0.0;fork();for( j=…; j<…; j++ ) sum[k] += a[j];join();sum = 0.0;for( k=0; k<...; k++ ) sum += sum[k];
![Page 20: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/20.jpg)
Reduction
• This pattern is very common.• Many parallel programming systems have
explicit support for it, called reduction.
sum = reduce( +, a, 0, 100 );
![Page 21: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/21.jpg)
Final word on synchronization
• Many different synchronization constructs exist in different programming models.
• Dependences have to be “covered” by appropriate synchronization.
• Synchronization is often expensive.
![Page 22: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/22.jpg)
ECE 1747H: Parallel Programming
Lecture 2-3: Data Parallelism
![Page 23: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/23.jpg)
Previously
• Ordering of statements.• Dependences.• Parallelism.• Synchronization.
![Page 24: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/24.jpg)
Goal of next few lectures
• Standard patterns of parallel programs.• Examples of each.• Later, code examples in various
programming models.
![Page 25: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/25.jpg)
Flavors of Parallelism
• Data parallelism: all processors do the same thing on different data.– Regular – Irregular
• Task parallelism: processors do different tasks.– Task queue– Pipelines
![Page 26: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/26.jpg)
Data Parallelism
• Essential idea: each processor works on a different part of the data (usually in one or more arrays).
• Regular or irregular data parallelism: using linear or non-linear indexing.
• Examples: MM (regular), SOR (regular), MD (irregular).
![Page 27: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/27.jpg)
Matrix Multiplication
• Multiplication of two n by n matrices A and B into a third n by n matrix C
![Page 28: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/28.jpg)
Matrix Multiply
for( i=0; i<n; i++ )for( j=0; j<n; j++ )
c[i][j] = 0.0;for( i=0; i<n; i++ )
for( j=0; j<n; j++ )for( k=0; k<n; k++ )
c[i][j] += a[i][k]*b[k][j];
![Page 29: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/29.jpg)
Parallel Matrix Multiply
• No loop-carried dependences in i- or j-loop.• Loop-carried dependence on k-loop.• All i- and j-iterations can be run in parallel.
![Page 30: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/30.jpg)
Parallel Matrix Multiply (contd.)
• If we have P processors, we can give n/P rows or columns to each processor.
• Or, we can divide the matrix in P squares, and give each processor one square.
![Page 31: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/31.jpg)
Data Distribution: Examples
BLOCK DISTRIBUTION
![Page 32: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/32.jpg)
Data Distribution: Examples
BLOCK DISTRIBUTION BY ROW
![Page 33: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/33.jpg)
Data Distribution: Examples
BLOCK DISTRIBUTION BY COLUMN
![Page 34: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/34.jpg)
Data Distribution: Examples
CYCLIC DISTRIBUTION BY COLUMN
![Page 35: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/35.jpg)
Data Distribution: Examples
BLOCK CYCLIC
![Page 36: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/36.jpg)
Data Distribution: Examples
COMBINATIONS
![Page 37: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/37.jpg)
SOR
• SOR implements a mathematical model for many natural phenomena, e.g., heat dissipation in a metal sheet.
• Model is a partial differential equation.• Focus is on algorithm, not on derivation.
![Page 38: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/38.jpg)
Problem Statement
x
y F = 1
F = 0
F = 0
F = 0 F(x,y) = 02
![Page 39: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/39.jpg)
Discretization
• Represent F in continuous rectangle by a 2-dimensional discrete grid (array).
• The boundary conditions on the rectangle are the boundary values of the array
• The internal values are found by the relaxation algorithm.
![Page 40: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/40.jpg)
Discretized Problem Statement
i
j
![Page 41: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/41.jpg)
Relaxation Algorithm
• For some number of iterationsfor each internal grid point
compute average of its four neighbors• Termination condition:
values at grid points change very little(we will ignore this part in our example)
![Page 42: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/42.jpg)
Discretized Problem Statement
for some number of timesteps/iterations {for (i=1; i<n; i++ )for( j=1, j<n, j++ )temp[i][j] = 0.25 *( grid[i-1][j] + grid[i+1][j]grid[i][j-1] + grid[i][j+1] );for( i=1; i<n; i++ )for( j=1; j<n; j++ )grid[i][j] = temp[i][j];
}
![Page 43: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/43.jpg)
Parallel SOR
• No dependences between iterations of first (i,j) loop nest.
• No dependences between iterations of second (i,j) loop nest.
• Anti-dependence between first and second loop nest in the same timestep.
• True dependence between second loop nest and first loop nest of next timestep.
![Page 44: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/44.jpg)
Parallel SOR (continued)
• First (i,j) loop nest can be parallelized.• Second (i,j) loop nest can be parallelized.• We must make processors wait at the end of
each (i,j) loop nest.• Natural synchronization: fork-join.
![Page 45: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/45.jpg)
Parallel SOR (continued)
• If we have P processors, we can give n/P rows or columns to each processor.
• Or, we can divide the array in P squares, and give each processor a square to compute.
![Page 46: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/46.jpg)
Molecular Dynamics (MD)
• Simulation of a set of bodies under the influence of physical laws.
• Atoms, molecules, celestial bodies, ...• Have same basic structure.
![Page 47: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/47.jpg)
Molecular Dynamics (Skeleton)
for some number of timesteps {for all molecules i
for all other molecules jforce[i] += f( loc[i], loc[j] );
for all molecules iloc[i] = g( loc[i], force[i] );
}
![Page 48: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/48.jpg)
Molecular Dynamics (continued)
• To reduce amount of computation, account for interaction only with nearby molecules.
![Page 49: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/49.jpg)
Molecular Dynamics (continued)
for some number of timesteps {for all molecules i
for all nearby molecules jforce[i] += f( loc[i], loc[j] );
for all molecules iloc[i] = g( loc[i], force[i] );
}
![Page 50: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/50.jpg)
Molecular Dynamics (continued)
for each molecule inumber of nearby molecules count[i]array of indices of nearby molecules index[j]
( 0 <= j < count[i])
![Page 51: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/51.jpg)
Molecular Dynamics (continued)
for some number of timesteps {for( i=0; i<num_mol; i++ )
for( j=0; j<count[i]; j++ )force[i] +=
f(loc[i],loc[index[j]]);for( i=0; i<num_mol; i++ )
loc[i] = g( loc[i], force[i] );}
![Page 52: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/52.jpg)
Molecular Dynamics (continued)
• No loop-carried dependence in first i-loop.• Loop-carried dependence (reduction) in j-
loop.• No loop-carried dependence in second i-
loop.• True dependence between first and second
i-loop.
![Page 53: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/53.jpg)
Molecular Dynamics (continued)
• First i-loop can be parallelized.• Second i-loop can be parallelized.• Must make processors wait between loops.• Natural synchronization: fork-join.
![Page 54: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/54.jpg)
Molecular Dynamics (continued)
for some number of timesteps {for( i=0; i<num_mol; i++ )
for( j=0; j<count[i]; j++ )force[i] +=
f(loc[i],loc[index[j]]);for( i=0; i<num_mol; i++ )
loc[i] = g( loc[i], force[i] );}
![Page 55: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/55.jpg)
Irregular vs. regular data parallel
• In SOR, all arrays are accessed through linear expressions of the loop indices, known at compile time [regular].
• In MD, some arrays are accessed through non-linear expressions of the loop indices, some known only at runtime [irregular].
![Page 56: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/56.jpg)
Irregular vs. regular data parallel
• No real differences in terms of parallelization (based on dependences).
• Will lead to fundamental differences in expressions of parallelism:– irregular difficult for parallelism based on data
distribution– not difficult for parallelism based on iteration
distribution.
![Page 57: ECE 1747H: Parallel Programming](https://reader035.vdocuments.net/reader035/viewer/2022070502/56814967550346895db6bbe0/html5/thumbnails/57.jpg)
Molecular Dynamics (continued)
• Parallelization of first loop:– has a load balancing issue– some molecules have few/many neighbors– more sophisticated loop partitioning necessary