open multiprocessing
DESCRIPTION
Open Multiprocessing. Dr. Bo Yuan E-mail: [email protected]. OpenMP. An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran. Supports multiple platforms (processor architectures and operating systems). - PowerPoint PPT PresentationTRANSCRIPT
Open Multiprocessing
Dr. Bo YuanE-mail: [email protected]
2
Note on Parallel Programming
• An incorrect program may produce correct results.
– The order of execution of processes/threads is unpredictable.
– May depend on your luck!
• A program that always produce correct results may not make sense.
– The outputs of a program are just part of the story.
– Efficiency matters!
3
OpenMP
• An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran.– Supports multiple platforms (processor architectures and operating systems).– Higher level implementation (a block of code that should be executed in parallel).
• A method of parallelizing whereby a master thread forks a number of slave threads and a task is divided among them.
• Based on preprocessor directives (Pragma)– Requires compiler support.– omp.h
• References– http://openmp.org/
– https://computing.llnl.gov/tutorials/openMP/– http://supercomputingblog.com/openmp/
4
Hello, World!#include <stdio.h>#include <stdlib.h>#include <omp.h>
void Hello(void);
int main(int argc, char* argv[]) { /* Get the number of threads from command line */ int thread_count=strtol(argv[1], NULL, 10);
# pragma omp parallel num_threads(thread_count) Hello();
return 0;}
void Hello(void) { int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();
printf(“Hello from thread %d of %d\n”, my_rank, thread_count);}
5
Definitions# pragma omp parallel [clauses] { code_block }
Error Checking
#ifdef _OPENMP# include <omp.h>#endif
#ifdef _OPENMP int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();#else int my_rank=0; int thread_count=1;#endif
implicit barrier
Thread Team = Master + Slaves
text to modify the directive
6
The Trapezoidal Rule
/* Input: a, b, n */h=(b-a)/n;approx=(f(a)+f(b))/2.0;for (i=1; i<=n-1; i++) { x_i=a+i*h; approx+=f(x_i);}approx=h*approx;
Thread 0 Thread 2
# pragma omp critical global_result+=my_result;
Shared Memory Shared Variables Race Condition
7
The critical Directive
# pragma omp critical y=f(x); ... double f(double x) {# pragma omp critical z=g(x); ... }
Cannot be executed simultaneously!
# pragma omp critical(one) y=f(x); ... double f(double x) {# pragma omp critical(two) z=g(x); ... }
Deadlock
8
The atomic Directive# pragma omp atomic x <op>=<expression>;
<op> can be one of the binary operators:+, *, -, /, &, ^, |, <<, >>
• Higher performance than the critical
directive.
• Only single C assignment statement is
protected.
• Only the load and store of x is
protected.
• <expression> must not reference x.
# pragma omp atomic # pragma omp critical x+=f(y); x=g(x);
Can be executed simultaneously!
x++++xx----x
9
Locks/* Executed by one thread */Initialize the lock data structure;.../* Executed by multiple threads */Attempt to lock or set the lock data structure;Critical section;Unlock or unset the lock data structure;.../* Executed by one thread */Destroy the lock data structure;
void omp_init_lock(omp_lock_t* lock_p);void omp_set_lock(omp_lock_t* lock_p);void omp_unset_lock(omp_lock_t* lock_p);void omp_destroy_lock(omp_lock_t* lock_p);
10
Trapezoidal Rule in OpenMP #include <stdio.h>#include <stdlib.h>#include <omp.h>
void Trap(double a, double b, int n, double* global_result_p);
int main(int argc, char* argv[]) { double global_result=0.0; double a, b; int n, thread_count;
thread_count=strtol(argv[1], NULL, 10); printf(“Enter a, b, and n\n”); scanf(“%lf %lf %d”, &a, &b, &n);# pragma omp parallel num_threads(thread_count) Trap(a, b, n, &global_result);
printf(“With n=%d trapezoids, our estimate\n”, n); printf(“of the integral from %f to %f = %.15e\n”, a, b, global_result); return 0;}
11
Trapezoidal Rule in OpenMP void Trap(double a, double b, int n, double* global_result_p) { double h, x, my_result; double local_a, local_b; int i, local_n; int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();
h=(b-a)/n; local_n=n/thread_count; local_a=a+my_rank*local_n*h; local_b=local_a+local_n*h; my_result=(f(local_a)+f(local_b))/2.0; for(i=1; i<=local_n-1; i++) { x=local_a+i*h; my_result+=f(x); } my_result=my_result*h;
# pragma omp critical *global_result_p+=my_result;}
12
Scope of Variables
Private Scope
• Only accessible by a single thread
• Declared in the code block
Shared Scope
• Accessible by all threads in a team
• Declared before a parallel directive
• a, b, n
• global_result
• thread_count
• my_rank
• my_result
• global_result_p
• *global_result_p
In serial programming:
• Function-wide scope
• File-wide scope
13
Another Trap Functiondouble Local_trap(double a, double b, int n);
global_result=0.0;# pragma omp parallel num_threads(thread_count) {# pragma omp critical global_result+=Local_trap(a, b, n); }
global_result=0.0;# pragma omp parallel num_threads(thread_count) { double my_result=0.0; /* Private */ my_result=Local_trap(a, b, n); # pragma omp critical global_result+=my_result; }
14
The Reduction Clause
• Reduction: A computation (binary operation) that repeatedly applies the same reduction operator (e.g., addition or multiplication) to a sequence of operands in order to get a single result.
• Note:– The reduction variable itself is shared.– A private variable is created for each thread in the team.– The private variables are initialized to 0 for addition operator.
global_result=0.0;# pragma omp parallel num_threads(thread_count)\ reduction(+: global_result) global_result=Local_trap(a, b, n);
reduction(<operator>: <variable list>)
15
The parallel for Directive
h=(b-a)/n;approx=(f(a)+f(b))/2.0;for (i=1; i<=n-1; i++) { approx+=f(a+i*h);}approx=h*approx;
h=(b-a)/n; approx=(f(a)+f(b))/2.0;# pragma omp parallel for num_threads(thread_count)\ reduction(+: approx) for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx;
• The code block must be a for loop.
• Iterations of the for loop are divided among threads.
• approx is a reduction variable.
• i is a private variable.
16
The parallel for Directive
• Sounds like a truly wonderful approach to parallelizing serial programs.• Does not work with while or do-while loops.
– How about converting them into for loops?
• The number of iterations must be determined in advance.
for (; ;) { ...}
for (i=0; i<n; i++) { if (...) break; ...}
int x, y; # pragma omp parallel for num_threads(thread_count) for(x=0; x < width; x++) { for(y=0; y < height; y++) { finalImage[x][y] = f(x, y); } }
private(y)
17
Estimating π
0 12)1(4
71
51
3114
k
k
K
double factor=1.0;double sum=0.0;for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor;}pi_approx=4.0*sum;
double factor=1.0; double sum=0.0;# pragma omp parallel for\ num_threads(thread_count)\ reduction(+: sum) for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum;
?
Loop-carried dependence
18
Estimating πif(k%2 == 0) factor=1.0;else factor=-1.0;sum+=factor/(2*k+1);
factor=(k%2 == 0)?1.0: -1.0;sum+=factor/(2*k+1);
double factor=1.0; double sum=0.0;# pragma omp parallel for num_threads(thread_count)\ reduction(+: sum) private(factor) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;
19
Scope Matters
double factor=1.0; double sum=0.0;# pragma omp parallel for num_threads(thread_count)\ default(none) reduction(+: sum) private(k, factor) shared(n) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;
• With the default (none) clause, we need to specify the scope of each variable that we use in the block that has been declared outside the block.
• The value of a variable with private scope is unspecified at the beginning (and after completion) of a parallel or parallel for block.
The private factor is not specified.
20
Bubble Sortfor (len=n; len>=2; len--) for (i=0; i<len-1; i++) if (a[i]>a[i+1]) { tmp=a[i]; a[i]=a[i+1]; a[i+1]=tmp; }
• Can we make it faster?
• Can we parallelize the outer loop?
• Can we parallelize the inner loop?
21
Odd-Even Sort
PhaseSubscript in Array
0 1 2 3
0 9 7 8 6
7 9 6 8
1 7 9 6 8
7 6 9 8
2 7 6 9 8
6 7 8 9
3 6 7 8 9
6 7 8 9
Any opportunities for parallelism?
22
Odd-Even Sortvoid Odd_even_sort (int a[], int n) { int phase, i, temp; for (phase=0; phase<n; phase++) if (phase%2 == 0) { /* Even phase */ for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }
23
Odd-Even Sort in OpenMP for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */# pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */# pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }
24
Odd-Even Sort in OpenMP# pragma omp parallel num_thread(thread_count) \ default(none) shared(a, n) private(i, tmp, phase) for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */# pragma omp for for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */# pragma omp for for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }
25
Data Partitioning
0 1 2 3 4 5 6 7 8
0 1 2
Iterations
Threads
0 1 2 3 4 5 6 7 8
0 1 2
Iterations
Threads
Block
Cyclic
26
Scheduling Loopsdouble Z[N][N];…sum=0.0;for (i=0; i<N; i++) sum+=f(i);
double f(int r) { int i; double val=0.0;
for (i=r+1; i<N; i++) { return_val+=sin(Z[r][i]); } return val;}
Load Balancing
27
The schedule clause sum=0.0;# pragma omp parallel for num_threads(thread_count) \ reduction(+:sum) schedule(static, 1) for (i=0; i<n; i++) sum+=f(i);
n=12, t=3schedule(static, 1) schedule(static, 2)
schedule(static, 4)
Thread 0: 0, 3, 6, 9Thread 1: 1, 4, 7, 10Thread 2: 2, 5, 8, 11
Thread 0: 0, 1, 6, 7
Thread 1: 2, 3, 8, 9
Thread 2: 4, 5, 10, 11
Thread 0: 0, 1, 2, 3Thread 1: 4, 5, 6, 7Thread 2: 8, 9, 10, 11
chunksize
schedule(static, total_iterations/thread_count)
28
The dynamic and guided Types
• In a dynamic schedule:– Iterations are broken into chunks of chunksize consecutive iterations.– Default chunksize value: 1– Each thread executes a chunk.– When a thread finishes a chunk, it requests another one.
• In a guided schedule:– Each thread executes a chunk.– When a thread finishes a chunk, it requests another one.– As chunks are completed, the size of the new chunks decreases.– Approximately equals to the number of iterations remaining divided by the number
of threads.– The size of chunks decreases down to chunksize or 1 (default).
29
Example of guided Schedule Thread Chunk Size of Chunk Remaining Iterations
0 1-5000 5000 49991 5001-7500 2500 24991 7501-8750 1250 12491 8751-9375 625 6240 9376-9687 312 3121 9688-9843 156 1560 9844-9921 78 781 9922-9960 39 391 9961-9980 20 191 9981-9990 10 91 9991-9995 5 40 9996-9997 2 21 9998-9998 1 10 9999-9999 1 0
30
Which schedule?
• The optimal schedule depends on:– The type of problem– The number of iterations– The number of threads
• Overhead– guided>dynamic>static
– If you are getting satisfactory results (e.g., close to the theoretically maximum speedup) without a schedule clause, go no further.
• The Cost of Iterations– If it is roughly the same, use the default schedule.– If it decreases or increases linearly as the loop executes, a static schedule with
small chunksize values will be good.– If it cannot be determined in advance, try to explore different options.
31
Performance IssueA x y
# pragma omp parallel for num_threads(thread_count) \ default(none) private(i,j) shared(A, x, y, m, n) for(i=1; i<m; i++) { y[i]=0.0; for(j=0; j<n; j++) y[i]+=A[i][j]*x[j]; }
=X
32
Performance Issue
Number of
Threads
Matrix Dimension8,000,000 x 8 8,000 x 8,000 8 x 8,000,000
Time Efficiency Time Efficiency Time Efficiency
1 0.322 1.000 0.264 1.000 0.333 1.000
2 0.219 0.735 0.189 0.698 0.300 0.555
3 0.141 0.571 0.119 0.555 0.303 0.275
Cache MissFalse Sharing
33
Performance Issue
• 8,000,000-by-8– y has 8,000,000 elements Potentially large number of write misses
• 8-by-8,000,000– x has 8,000,000 elements Potentially large number of read misses
• 8-by-8,000,000– y has 8 elements (8 doubles) Could be stored in the same cache line (64 bytes).– Potentially serious false sharing effect for multiple processors
• 8000-by-8000– y has 8,000 elements (8,000 doubles).– Thread 2: 4000 to 5999 Thread 3: 6000 to 7999– {y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] }– The effect of false sharing is highly unlikely.
34
Thread Safety
• How to generate random numbers in C?– First, call srand() with an integer seed.– Second, call rand() to create a sequence of random numbers.
• Pseudorandom Number Generator (PRNG)
• Is it thread safe?– Can it be simultaneously executed by multiple threads without causing problems?
mcaXX nn mod1
Shared State
35
Foster’s Methodology
• Partitioning– Divide the computation and the data into small tasks.– Identify tasks that can be executed in parallel.
• Communication– Determine what communication needs to be carried out.– Local Communication vs. Global Communication
• Agglomeration– Group tasks into larger tasks.– Reduce communication.– Task Dependence
• Mapping– Assign the composite tasks to processes/threads.
36
Foster’s Methodology
37
The n-body Problem
• To predict the motion of a group of objects that interact with each other gravitationally over a period of time.
– Inputs: Mass, Position and Velocity
• Astrophysicist– The positions and velocities of a collection of stars
• Chemist– The positions and velocities of a collection of molecules
38
Newton’s Law
)()(3 tsts
tsts
mGmf kq
kq
kqqk
)()()()(
1
03 tsts
tsts
mGmF kq
n
qkk kq
kqq
)()()()(
1
03 tsts
tsts
mGa kq
n
qkk kq
kq
39
The Basic AlgorithmGet input data;for each timestep { if (timestep output) Print positions and velocities of particles; for each particle q Compute total force on q; for each particle q Compute position and velocity of q;}
for each particle q { forces[q][0]=forces[q][1]=0; for each particle k!=q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff; forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff; }}
40
Newton’s 3rd Law of Motion
f38
f58
f83 f85
n=12
q=8
r=3
41
The Reduced Algorithmfor each particle q forces[q][0]=forces[q][1]=0;
for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;
forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; }}
42
Euler Method
ttytyttttytytty )()())(()()( 0000000
43
Position and Velocity
q
qqqqqqq
qqqqq
q
qqqqqqq
qqqqq
mtF
ttvttatvttvtvtv
ttvtsttststs
mFtvtavtvvtv
tvstssts
)()()()()()()2(
)()()()()2(
)0()0()0()0()0()0()(
)0()0()0()0()(
'
'
'
'
for each particle q { pos[q][0]+=delta_t*vel[q][0]; pos[q][1]+=delta_t*vel[q][1]; vel[q][0]+=delta_t*forces[q][0]/masses[q]; vel[q][1]+=delta_t*forces[q][1]/masses[q];}
44
Communications: Basic
sq(t) vq(t) sr(t) vr(t)
sq(t + t)△ vq(t + t)△ sr(t + t)△ vr(t + t)△
Fq(t) Fr(t)
Fq(t+ t)△ Fr(t+ t)△
45
Agglomeration: Basic
sq, vq, Fq
sq, vq, Fq
sr, vr, Fr
sr, vr, Fr
sq sr
sq sr
t
t + t△
46
Agglomeration: Reduced
sq, vq, Fq
sq, vq, Fq
sr, vr, Fr
sr, vr, Fr
fqr sr
fqr sr
t
t + t△
q<r
47
Parallelizing the Basic Solver# pragma omp parallel for each timestep { if (timestep output){# pragma omp single nowait Print positions and velocities of particles; }# pragma omp for for each particle q Compute total force on q;# pragma omp for for each particle q Compute position and velocity of q; }
Race Conditions?
48
Parallelizing the Reduced Solver# pragma omp for for each particle q forces[q][0]=forces[q][1]=0;
# pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;
forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } }
49
Does it work properly?
• Consider 2 threads and 4 particles.
• Thread 1 is assigned particle 0 and particle 1.
• Thread 2 is assigned particle 2 and particle 3.
• F3=-f03-f13-f23
• Who will calculate f03 and f13?
• Who will calculate f23?
• Any race conditions?
50
Thread Contributions
Thread ParticleContributions of Threads0 1 2
0 0 f01 +f02 +f03+f04 +f05 0 0
1 -f01 +f12 +f13+f14 +f15 0 0
1 2 -f02 -f12 f23 +f24 +f25 0
3 -f03 -f13 -f23 +f34 +f35 0
2 4 -f04 -f14 -f24 -f34 f45
5 -f05 -f15 -f25 -f35 -f45
3 Threads, 6 Particles, Block Partition
51
Thread Contributions
Thread ParticleContributions of Threads
0 1 2
0 0 f01 +f02 +f03+f04 +f05 0 0
1 1 -f01 f12 +f13+f14 +f15 0
2 2 -f02 -f12 f23 +f24 +f25
0 3 -f03+f34 +f35 -f13 -f23
1 4 -f04 -f34 -f14 +f45 -f24
2 5 -f05 -f35 -f15 -f45 -f25
3 Threads, 6 Particles, Cyclic Partition
52
First Phase
# pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;
loc_forces[my_rank][q][0]+=force_qk[0]; loc_forces[my_rank][q][1]+=force_qk[1]; loc_forces[my_rank][k][0]-=force_qk[0]; loc_forces[my_rank][k][1]-=force_qk[1]; } }
53
Second Phase
# pragma omp for for (q=0; q<n; q++) { forces[q][0]=forces[q][1]=0; for(thread=0; thread<thread_count; thread++) { forces[q][0]+=loc_forces[thread][q][0]; forces[q][1]+=loc_forces[thread][q][1]; } }
Race Conditions?
• In the first phase, each thread carries out the same calculations as before but the values are stored in its own array of forces (loc_forces).
• In the second phase, the thread that has been assigned particle q will add the contributions that have been computed by different threads.
54
Evaluating the OpenMP Codes
• In the reduced code:– Loop 1: Initialization of the loc_forces array– Loop 2: The first phase of the computation of forces– Loop 3: The second phase of the computation of forces– Loop 4: The updating of positions and velocities
• Which schedule should be used?
Threads Basic ReducedDefault
ReducedForces Cyclic
ReducedAll Cyclic
1 7.71 3.90 3.90 3.90
2 3.87 2.94 1.98 2.01
4 1.95 1.73 1.01 1.08
8 0.99 0.95 0.54 0.61
55
Review
• What are the major differences between MPI and OpenMP?
• What is the scope of a variable?
• What is a reduction variable?
• How to ensure mutual exclusion in a critical section?
• What are the common loop scheduling options?
• What is a thread safe function?
• What factors may affect the performance of an OpenMP program?