open multiprocessing

Open Multiprocessing

Dr. Bo YuanE-mail: [email protected]

2

Note on Parallel Programming

• An incorrect program may produce correct results.

– The order of execution of processes/threads is unpredictable.

– May depend on your luck!

• A program that always produce correct results may not make sense.

– The outputs of a program are just part of the story.

– Efficiency matters!

3

OpenMP

• An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran.– Supports multiple platforms (processor architectures and operating systems).– Higher level implementation (a block of code that should be executed in parallel).

• A method of parallelizing whereby a master thread forks a number of slave threads and a task is divided among them.

• Based on preprocessor directives (Pragma)– Requires compiler support.– omp.h

• References– http://openmp.org/

– https://computing.llnl.gov/tutorials/openMP/– http://supercomputingblog.com/openmp/

4

Hello, World!#include <stdio.h>#include <stdlib.h>#include <omp.h>

void Hello(void);

int main(int argc, char* argv[]) { /* Get the number of threads from command line */ int thread_count=strtol(argv[1], NULL, 10);

# pragma omp parallel num_threads(thread_count) Hello();

return 0;}

void Hello(void) { int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();

printf(“Hello from thread %d of %d\n”, my_rank, thread_count);}

5

Definitions# pragma omp parallel [clauses] { code_block }

Error Checking

#ifdef _OPENMP# include <omp.h>#endif

#ifdef _OPENMP int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();#else int my_rank=0; int thread_count=1;#endif

implicit barrier

Thread Team = Master + Slaves

text to modify the directive

6

The Trapezoidal Rule

/* Input: a, b, n */h=(b-a)/n;approx=(f(a)+f(b))/2.0;for (i=1; i<=n-1; i++) { x_i=a+i*h; approx+=f(x_i);}approx=h*approx;

Thread 0 Thread 2

# pragma omp critical global_result+=my_result;

Shared Memory Shared Variables Race Condition

7

The critical Directive

# pragma omp critical y=f(x); ... double f(double x) {# pragma omp critical z=g(x); ... }

Cannot be executed simultaneously!

# pragma omp critical(one) y=f(x); ... double f(double x) {# pragma omp critical(two) z=g(x); ... }

Deadlock

8

The atomic Directive# pragma omp atomic x <op>=<expression>;

<op> can be one of the binary operators:+, *, -, /, &, ^, |, <<, >>

• Higher performance than the critical

directive.

• Only single C assignment statement is

protected.

• Only the load and store of x is

protected.

• <expression> must not reference x.

# pragma omp atomic # pragma omp critical x+=f(y); x=g(x);

Can be executed simultaneously!

x++++xx----x

9

Locks/* Executed by one thread */Initialize the lock data structure;.../* Executed by multiple threads */Attempt to lock or set the lock data structure;Critical section;Unlock or unset the lock data structure;.../* Executed by one thread */Destroy the lock data structure;

void omp_init_lock(omp_lock_t* lock_p);void omp_set_lock(omp_lock_t* lock_p);void omp_unset_lock(omp_lock_t* lock_p);void omp_destroy_lock(omp_lock_t* lock_p);

10

Trapezoidal Rule in OpenMP #include <stdio.h>#include <stdlib.h>#include <omp.h>

void Trap(double a, double b, int n, double* global_result_p);

int main(int argc, char* argv[]) { double global_result=0.0; double a, b; int n, thread_count;

thread_count=strtol(argv[1], NULL, 10); printf(“Enter a, b, and n\n”); scanf(“%lf %lf %d”, &a, &b, &n);# pragma omp parallel num_threads(thread_count) Trap(a, b, n, &global_result);

printf(“With n=%d trapezoids, our estimate\n”, n); printf(“of the integral from %f to %f = %.15e\n”, a, b, global_result); return 0;}

11

Trapezoidal Rule in OpenMP void Trap(double a, double b, int n, double* global_result_p) { double h, x, my_result; double local_a, local_b; int i, local_n; int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();

h=(b-a)/n; local_n=n/thread_count; local_a=a+my_rank*local_n*h; local_b=local_a+local_n*h; my_result=(f(local_a)+f(local_b))/2.0; for(i=1; i<=local_n-1; i++) { x=local_a+i*h; my_result+=f(x); } my_result=my_result*h;

# pragma omp critical *global_result_p+=my_result;}

12

Scope of Variables

Private Scope

• Only accessible by a single thread

• Declared in the code block

Shared Scope

• Accessible by all threads in a team

• Declared before a parallel directive

• a, b, n

• global_result

• thread_count

• my_rank

• my_result

• global_result_p

• *global_result_p

In serial programming:

• Function-wide scope

• File-wide scope

13

Another Trap Functiondouble Local_trap(double a, double b, int n);

global_result=0.0;# pragma omp parallel num_threads(thread_count) {# pragma omp critical global_result+=Local_trap(a, b, n); }

global_result=0.0;# pragma omp parallel num_threads(thread_count) { double my_result=0.0; /* Private */ my_result=Local_trap(a, b, n); # pragma omp critical global_result+=my_result; }

14

The Reduction Clause

• Reduction: A computation (binary operation) that repeatedly applies the same reduction operator (e.g., addition or multiplication) to a sequence of operands in order to get a single result.

• Note:– The reduction variable itself is shared.– A private variable is created for each thread in the team.– The private variables are initialized to 0 for addition operator.

global_result=0.0;# pragma omp parallel num_threads(thread_count)\ reduction(+: global_result) global_result=Local_trap(a, b, n);

reduction(<operator>: <variable list>)

15

The parallel for Directive

h=(b-a)/n;approx=(f(a)+f(b))/2.0;for (i=1; i<=n-1; i++) { approx+=f(a+i*h);}approx=h*approx;

h=(b-a)/n; approx=(f(a)+f(b))/2.0;# pragma omp parallel for num_threads(thread_count)\ reduction(+: approx) for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx;

• The code block must be a for loop.

• Iterations of the for loop are divided among threads.

• approx is a reduction variable.

• i is a private variable.

16

The parallel for Directive

• Sounds like a truly wonderful approach to parallelizing serial programs.• Does not work with while or do-while loops.

– How about converting them into for loops?

• The number of iterations must be determined in advance.

for (; ;) { ...}

for (i=0; i<n; i++) { if (...) break; ...}

int x, y; # pragma omp parallel for num_threads(thread_count) for(x=0; x < width; x++) { for(y=0; y < height; y++) { finalImage[x][y] = f(x, y); } }

private(y)

17

Estimating π

0 12)1(4

71

51

3114

k

k

K

double factor=1.0;double sum=0.0;for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor;}pi_approx=4.0*sum;

double factor=1.0; double sum=0.0;# pragma omp parallel for\ num_threads(thread_count)\ reduction(+: sum) for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum;

?

Loop-carried dependence

18

Estimating πif(k%2 == 0) factor=1.0;else factor=-1.0;sum+=factor/(2*k+1);

factor=(k%2 == 0)?1.0: -1.0;sum+=factor/(2*k+1);

double factor=1.0; double sum=0.0;# pragma omp parallel for num_threads(thread_count)\ reduction(+: sum) private(factor) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;

19

Scope Matters

double factor=1.0; double sum=0.0;# pragma omp parallel for num_threads(thread_count)\ default(none) reduction(+: sum) private(k, factor) shared(n) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;

• With the default (none) clause, we need to specify the scope of each variable that we use in the block that has been declared outside the block.

• The value of a variable with private scope is unspecified at the beginning (and after completion) of a parallel or parallel for block.

The private factor is not specified.

20

Bubble Sortfor (len=n; len>=2; len--) for (i=0; i<len-1; i++) if (a[i]>a[i+1]) { tmp=a[i]; a[i]=a[i+1]; a[i+1]=tmp; }

• Can we make it faster?

• Can we parallelize the outer loop?

• Can we parallelize the inner loop?

21

Odd-Even Sort

PhaseSubscript in Array

0 1 2 3

0 9 7 8 6

7 9 6 8

1 7 9 6 8

7 6 9 8

2 7 6 9 8

6 7 8 9

3 6 7 8 9

6 7 8 9

Any opportunities for parallelism?

22

Odd-Even Sortvoid Odd_even_sort (int a[], int n) { int phase, i, temp; for (phase=0; phase<n; phase++) if (phase%2 == 0) { /* Even phase */ for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

23

Odd-Even Sort in OpenMP for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */# pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */# pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

24

Odd-Even Sort in OpenMP# pragma omp parallel num_thread(thread_count) \ default(none) shared(a, n) private(i, tmp, phase) for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */# pragma omp for for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */# pragma omp for for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

25

Data Partitioning

0 1 2 3 4 5 6 7 8

0 1 2

Iterations

Threads

0 1 2 3 4 5 6 7 8

0 1 2

Iterations

Threads

Block

Cyclic

26

Scheduling Loopsdouble Z[N][N];…sum=0.0;for (i=0; i<N; i++) sum+=f(i);

double f(int r) { int i; double val=0.0;

for (i=r+1; i<N; i++) { return_val+=sin(Z[r][i]); } return val;}

Load Balancing

27

The schedule clause sum=0.0;# pragma omp parallel for num_threads(thread_count) \ reduction(+:sum) schedule(static, 1) for (i=0; i<n; i++) sum+=f(i);

n=12, t=3schedule(static, 1) schedule(static, 2)

schedule(static, 4)

Thread 0: 0, 3, 6, 9Thread 1: 1, 4, 7, 10Thread 2: 2, 5, 8, 11

Thread 0: 0, 1, 6, 7

Thread 1: 2, 3, 8, 9

Thread 2: 4, 5, 10, 11

Thread 0: 0, 1, 2, 3Thread 1: 4, 5, 6, 7Thread 2: 8, 9, 10, 11

chunksize

schedule(static, total_iterations/thread_count)

28

The dynamic and guided Types

• In a dynamic schedule:– Iterations are broken into chunks of chunksize consecutive iterations.– Default chunksize value: 1– Each thread executes a chunk.– When a thread finishes a chunk, it requests another one.

• In a guided schedule:– Each thread executes a chunk.– When a thread finishes a chunk, it requests another one.– As chunks are completed, the size of the new chunks decreases.– Approximately equals to the number of iterations remaining divided by the number

of threads.– The size of chunks decreases down to chunksize or 1 (default).

29

Example of guided Schedule Thread Chunk Size of Chunk Remaining Iterations

0 1-5000 5000 49991 5001-7500 2500 24991 7501-8750 1250 12491 8751-9375 625 6240 9376-9687 312 3121 9688-9843 156 1560 9844-9921 78 781 9922-9960 39 391 9961-9980 20 191 9981-9990 10 91 9991-9995 5 40 9996-9997 2 21 9998-9998 1 10 9999-9999 1 0

30

Which schedule?

• The optimal schedule depends on:– The type of problem– The number of iterations– The number of threads

• Overhead– guided>dynamic>static

– If you are getting satisfactory results (e.g., close to the theoretically maximum speedup) without a schedule clause, go no further.

• The Cost of Iterations– If it is roughly the same, use the default schedule.– If it decreases or increases linearly as the loop executes, a static schedule with

small chunksize values will be good.– If it cannot be determined in advance, try to explore different options.

31

Performance IssueA x y

# pragma omp parallel for num_threads(thread_count) \ default(none) private(i,j) shared(A, x, y, m, n) for(i=1; i<m; i++) { y[i]=0.0; for(j=0; j<n; j++) y[i]+=A[i][j]*x[j]; }

=X

32

Performance Issue

Number of

Threads

Matrix Dimension8,000,000 x 8 8,000 x 8,000 8 x 8,000,000

Time Efficiency Time Efficiency Time Efficiency

1 0.322 1.000 0.264 1.000 0.333 1.000

2 0.219 0.735 0.189 0.698 0.300 0.555

3 0.141 0.571 0.119 0.555 0.303 0.275

Cache MissFalse Sharing

33

Performance Issue

• 8,000,000-by-8– y has 8,000,000 elements Potentially large number of write misses

• 8-by-8,000,000– x has 8,000,000 elements Potentially large number of read misses

• 8-by-8,000,000– y has 8 elements (8 doubles) Could be stored in the same cache line (64 bytes).– Potentially serious false sharing effect for multiple processors

• 8000-by-8000– y has 8,000 elements (8,000 doubles).– Thread 2: 4000 to 5999 Thread 3: 6000 to 7999– {y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] }– The effect of false sharing is highly unlikely.

34

Thread Safety

• How to generate random numbers in C?– First, call srand() with an integer seed.– Second, call rand() to create a sequence of random numbers.

• Pseudorandom Number Generator (PRNG)

• Is it thread safe?– Can it be simultaneously executed by multiple threads without causing problems?

mcaXX nn mod1

Shared State

35

Foster’s Methodology

• Partitioning– Divide the computation and the data into small tasks.– Identify tasks that can be executed in parallel.

• Communication– Determine what communication needs to be carried out.– Local Communication vs. Global Communication

• Agglomeration– Group tasks into larger tasks.– Reduce communication.– Task Dependence

• Mapping– Assign the composite tasks to processes/threads.

36

Foster’s Methodology

37

The n-body Problem

• To predict the motion of a group of objects that interact with each other gravitationally over a period of time.

– Inputs: Mass, Position and Velocity

• Astrophysicist– The positions and velocities of a collection of stars

• Chemist– The positions and velocities of a collection of molecules

http://en.wikipedia.org/wiki/File:Orbit5.gif

38

Newton’s Law

)()(3 tsts

tsts

mGmf kq

kq

kqqk

)()()()(

1

03 tsts

tsts

mGmF kq

n

qkk kq

kqq

)()()()(

1

03 tsts

tsts

mGa kq

n

qkk kq

kq

39

The Basic AlgorithmGet input data;for each timestep { if (timestep output) Print positions and velocities of particles; for each particle q Compute total force on q; for each particle q Compute position and velocity of q;}

for each particle q { forces[q][0]=forces[q][1]=0; for each particle k!=q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff; forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff; }}

40

Newton’s 3rd Law of Motion

f38

f58

f83 f85

n=12

q=8

r=3

41

The Reduced Algorithmfor each particle q forces[q][0]=forces[q][1]=0;

for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; }}

42

Euler Method

ttytyttttytytty )()())(()()( 0000000

http://en.wikipedia.org/wiki/File:Leonhard_Euler_2.jpg

43

Position and Velocity

q

qqqqqqq

qqqqq

q

qqqqqqq

qqqqq

mtF

ttvttatvttvtvtv

ttvtsttststs

mFtvtavtvvtv

tvstssts

)()()()()()()2(

)()()()()2(

)0()0()0()0()0()0()(

)0()0()0()0()(

'

'

'

'

for each particle q { pos[q][0]+=delta_t*vel[q][0]; pos[q][1]+=delta_t*vel[q][1]; vel[q][0]+=delta_t*forces[q][0]/masses[q]; vel[q][1]+=delta_t*forces[q][1]/masses[q];}

44

Communications: Basic

sq(t) vq(t) sr(t) vr(t)

sq(t + t)△ vq(t + t)△ sr(t + t)△ vr(t + t)△

Fq(t) Fr(t)

Fq(t+ t)△ Fr(t+ t)△

45

Agglomeration: Basic

sq, vq, Fq

sq, vq, Fq

sr, vr, Fr

sr, vr, Fr

sq sr

sq sr

t

t + t△

46

Agglomeration: Reduced

sq, vq, Fq

sq, vq, Fq

sr, vr, Fr

sr, vr, Fr

fqr sr

fqr sr

t

t + t△

q<r

47

Parallelizing the Basic Solver# pragma omp parallel for each timestep { if (timestep output){# pragma omp single nowait Print positions and velocities of particles; }# pragma omp for for each particle q Compute total force on q;# pragma omp for for each particle q Compute position and velocity of q; }

Race Conditions?

48

Parallelizing the Reduced Solver# pragma omp for for each particle q forces[q][0]=forces[q][1]=0;

# pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } }

49

Does it work properly?

• Consider 2 threads and 4 particles.

• Thread 1 is assigned particle 0 and particle 1.

• Thread 2 is assigned particle 2 and particle 3.

• F3=-f03-f13-f23

• Who will calculate f03 and f13?

• Who will calculate f23?

• Any race conditions?

50

Thread Contributions

Thread ParticleContributions of Threads0 1 2

0 0 f01 +f02 +f03+f04 +f05 0 0

1 -f01 +f12 +f13+f14 +f15 0 0

1 2 -f02 -f12 f23 +f24 +f25 0

3 -f03 -f13 -f23 +f34 +f35 0

2 4 -f04 -f14 -f24 -f34 f45

5 -f05 -f15 -f25 -f35 -f45

3 Threads, 6 Particles, Block Partition

51

Thread Contributions

Thread ParticleContributions of Threads

0 1 2

0 0 f01 +f02 +f03+f04 +f05 0 0

1 1 -f01 f12 +f13+f14 +f15 0

2 2 -f02 -f12 f23 +f24 +f25

0 3 -f03+f34 +f35 -f13 -f23

1 4 -f04 -f34 -f14 +f45 -f24

2 5 -f05 -f35 -f15 -f45 -f25

3 Threads, 6 Particles, Cyclic Partition

52

First Phase

# pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

loc_forces[my_rank][q][0]+=force_qk[0]; loc_forces[my_rank][q][1]+=force_qk[1]; loc_forces[my_rank][k][0]-=force_qk[0]; loc_forces[my_rank][k][1]-=force_qk[1]; } }

53

Second Phase

# pragma omp for for (q=0; q<n; q++) { forces[q][0]=forces[q][1]=0; for(thread=0; thread<thread_count; thread++) { forces[q][0]+=loc_forces[thread][q][0]; forces[q][1]+=loc_forces[thread][q][1]; } }

Race Conditions?

• In the first phase, each thread carries out the same calculations as before but the values are stored in its own array of forces (loc_forces).

• In the second phase, the thread that has been assigned particle q will add the contributions that have been computed by different threads.

54

Evaluating the OpenMP Codes

• In the reduced code:– Loop 1: Initialization of the loc_forces array– Loop 2: The first phase of the computation of forces– Loop 3: The second phase of the computation of forces– Loop 4: The updating of positions and velocities

• Which schedule should be used?

Threads Basic ReducedDefault

ReducedForces Cyclic

ReducedAll Cyclic

1 7.71 3.90 3.90 3.90

2 3.87 2.94 1.98 2.01

4 1.95 1.73 1.01 1.08

8 0.99 0.95 0.54 0.61

55

Review

• What are the major differences between MPI and OpenMP?

• What is the scope of a variable?

• What is a reduction variable?

• How to ensure mutual exclusion in a critical section?

• What are the common loop scheduling options?

• What is a thread safe function?

• What factors may affect the performance of an OpenMP program?

open multiprocessing

Documents