computational methods in physics phys 3437

Computational Computational Methods in PhysicsMethods in Physics

PHYS 3437 PHYS 3437Dr Rob ThackerDr Rob Thacker

Dept of Astronomy & Physics Dept of Astronomy & Physics (MM-301C)(MM-301C)

[email protected]@ap.smu.ca

Today’s LectureToday’s Lecture

Recap from end of last lectureRecap from end of last lecture Some technical details related to Some technical details related to

parallel programmingparallel programming Data dependenciesData dependencies Race conditionsRace conditions

Summary of other clauses you can Summary of other clauses you can use in setting up parallel loopsuse in setting up parallel loops

RecapRecap

C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do

Denotes this is a region of code for parallel execution

Good programming practice, mustdeclare nature of all variables

Thread PRIVATE variables: each threadmust have their own copy of this variable (in this case i is the only private variable)

Thread SHARED variables: all threads can

access these variables, but must not updateindividual memory locations simultaneously

Comment pragmas for FORTRAN - ampersand

necessary for continuation

SHAREDSHARED and and PRIVATEPRIVATE Most commonly used directives which are Most commonly used directives which are

necessary to ensure correct executionnecessary to ensure correct execution PRIVATEPRIVATE: any variable declared as private will : any variable declared as private will

be local only to a given thread and is be local only to a given thread and is inaccesible to others (also it is inaccesible to others (also it is uninitializeduninitialized)) This means that if you have a variable, say This means that if you have a variable, say tt, in the , in the

serial section of the code and then use it in a loop, serial section of the code and then use it in a loop, the value of the value of tt in the loop will not carry over the in the loop will not carry over the value of value of tt from the serial part from the serial part

Watch out for this – but there is a way around it…Watch out for this – but there is a way around it…

SHAREDSHARED: any variable declared as shared will : any variable declared as shared will be accessible by all other threads of executionbe accessible by all other threads of execution

ExampleExample

The SHARED and PRIVATE The SHARED and PRIVATE specifications can be long!specifications can be long!C$OMP& PRIVATE(icb,icol,izt,iyt,icell,iz_off,iy_off,ibz,C$OMP& iby,ibx,i,rxadd,ryadd,rzadd,inx,iny,inz,nb,nebs,ibrf,C$OMP& nbz,nby,nbx,nbrf,nbref,jnbox,jnboxnhc,idt,mdt,iboxd,C$OMP& dedge,idir,redge,is,ie,twoh,dosph,rmind,in,ixyz,C$OMP& redaughter,Ustmp,ngpp,hpp,vpp,apps,epp,hppi,hpp2,C$OMP& rh2,hpp2i,hpp3i,hpp5i,dpp,divpp,dcvpp,nspp,rnspp,C$OMP& rad2torbin,de1,dosphflag,dosphnb,nbzlow,nbzhigh,nbylow,C$OMP& nbyhigh,nbxlow,nbxhigh,nbzadd,nbyadd,r3i,r2i,r1i,C$OMP& dosphnbnb,dogravnb,js,je,j,rad2,rmj,grc,igrc,gfrac,C$OMP& Gr,hppj,jlist,dx,rdv,rcv,v2,radii2,rbin,ibin,fbin,C$OMP& wl1,dwl1,drnspp,hppa,hppji,hppj2i,hppj3i,hppj5i,C$OMP& wl2,dwl2,w,dw,df,dppi,divppr,dcvpp2,dcvppm,divppm,csi,C$OMP& fi,prhoi2,ispp,frcij,rdotv,hpa,rmuij,rhoij,cij,qij,C$OMP& frc3,frc4,hcalc,rath,av,frc2,dr1,dr2,dr3,dr12,dr22,dr32,C$OMP& appg1,appg2,appg3,gdiff,ddiff,d2diff,dv1,dv2,dv3,rpp,C$OMP& Gro)

FIRSTPRIVATEFIRSTPRIVATE

Declaring a variable Declaring a variable FIRSTPRIVATEFIRSTPRIVATE will will ensure that its value is copied in from ensure that its value is copied in from any prior piece of serial codeany prior piece of serial code However (of course) if the variable is not However (of course) if the variable is not

initialized in the serial section it will remain initialized in the serial section it will remain uninitializeduninitialized

Happens only once for a given thread Happens only once for a given thread setset

Try to avoid writing to variables Try to avoid writing to variables declareddeclared FIRSTPRIVATE FIRSTPRIVATE

FIRSTPRIVATEFIRSTPRIVATE example example

Lower bound of values is set to value of A, Lower bound of values is set to value of A, without without FIRSTPRIVATEFIRSTPRIVATE clause a=0.0 clause a=0.0

a=5.0C$OMP PARALLEL DOC$OMP& SHARED(r), PRIVATE(i)C$OMP& FIRSTPRIVATE(a) do i=1,n r(i)=max(a,r(i)) end do

LASTPRIVATELASTPRIVATE

Occasionally it may be necessary to Occasionally it may be necessary to know the last value of a variable know the last value of a variable from the end of the loopfrom the end of the loop

LASTPRIVATELASTPRIVATE variables will initialize variables will initialize the value of the variable in the serial the value of the variable in the serial section using the last (section using the last (sequentialsequential) ) value of the variable from the value of the variable from the parallel loopparallel loop

Default behaviourDefault behaviour

You can actually omit the You can actually omit the SHAREDSHARED and and PRIVATEPRIVATE statements – what is statements – what is the expected behaviour?the expected behaviour?

Scalars are private by defaultScalars are private by default Arrays are shared by defaultArrays are shared by default

Bad practice in my opinion – specify the types for everything

DEFAULTDEFAULT

I recommend using I recommend using DEFAULT(NONE)DEFAULT(NONE) at all at all timestimes

Forces specification of all variable typesForces specification of all variable types Alternatively, can use Alternatively, can use DEFAULT(SHARED)DEFAULT(SHARED), ,

or or DEFAULT(PRIVATE)DEFAULT(PRIVATE) to specify that un- to specify that un-scoped variables will default to the scoped variables will default to the particular type chosen particular type chosen

e.g.e.g. choosing choosing DEFAULT(PRIVATE)DEFAULT(PRIVATE) will will ensure any un-scoped variable is private ensure any un-scoped variable is private

The Parallel Do PragmasThe Parallel Do Pragmas

So far we’ve considered a small So far we’ve considered a small subset of functionalitysubset of functionality

Before we talk more about data Before we talk more about data dependencies, lets look briefly at dependencies, lets look briefly at what other statements can be used what other statements can be used in a parallel do loopin a parallel do loop

Besides Besides PRIVATEPRIVATE and and SHAREDSHARED variables there are a number of variables there are a number of other clauses that can be appliedother clauses that can be applied

Loop Level Parallelism in Loop Level Parallelism in more detailmore detail

For each parallel do(for) pragma, For each parallel do(for) pragma, the following clauses are possible:the following clauses are possible:

FORTRANPRIVATESHAREDFIRSTPRIVATELASTPRIVATEREDUCTIONORDEREDSCHEDULECOPYINDEFAULT

C/C++privatesharedfirstprivatelastprivatereductionorderedschedulecopyin

Red=most frequently usedClauses in italics we have already seen.

More background on data More background on data dependenciesdependencies

Suppose you try to parallelize the following Suppose you try to parallelize the following looploop

Won’t work as it is written since iteration Won’t work as it is written since iteration ii, depends upon iteration , depends upon iteration i-1i-1 and thus we and thus we can’t start anything in parallelcan’t start anything in parallel

To see this explicitly, let n=20 and start To see this explicitly, let n=20 and start thread 1 at i=1, and thread 2 at i=11 then thread 1 at i=1, and thread 2 at i=11 then thread 1 sets Y(1)=1.0 and thread 2 sets thread 1 sets Y(1)=1.0 and thread 2 sets Y(11)=1.0 (which is wrong!)Y(11)=1.0 (which is wrong!)

c=0.0do i=1,n c=c+1.0 Y(i)=cend do

Simple solutionSimple solution

This loop can easily be re-written in a This loop can easily be re-written in a way that can be parallelized:way that can be parallelized:

There is no longer any dependence on There is no longer any dependence on the previous operationthe previous operation

Private variables: Private variables: ii, Shared variables: , Shared variables: Y(),c,nY(),c,n

c=0.0do i=1,n Y(i)=c+float(i)end doc=c+n

Types of Data Types of Data DependenciesDependencies

Suppose we have operations OSuppose we have operations O11,O,O22

True Dependence: True Dependence: OO22 has a true dependence on O has a true dependence on O11 if O if O22 reads a reads a

value written by Ovalue written by O11

Anti Dependence:Anti Dependence: OO22 has an anti-dependence on O has an anti-dependence on O11 if O if O22 writes a writes a

value read by Ovalue read by O11

Output Dependence:Output Dependence: OO22 has an output dependence on O has an output dependence on O11 if O if O22 writes a writes a

variable written by Ovariable written by O11

ExamplesExamples

True dependence:True dependence:

Anti-dependence:Anti-dependence:

Output dependence:Output dependence:

A1=A2+A3B1=A1+B2

B1=A1+B2A1=C2

B1=5B1=2

Dealing with Data Dealing with Data DependenciesDependencies

Any loop where iterations depend upon the Any loop where iterations depend upon the previous one has a previous one has a potentialpotential problem problem

Any result which depends upon the Any result which depends upon the orderorder of of the iterations will be a problemthe iterations will be a problem

Good first test of whether something can be Good first test of whether something can be parallelized: reverse the loop iteration orderparallelized: reverse the loop iteration order

Not all data dependencies can be eliminatedNot all data dependencies can be eliminated AccumulationsAccumulations of variables ( of variables (e.g.e.g. sum of sum of

elements in an array) can be dealt with easilyelements in an array) can be dealt with easily

AccumulationsAccumulations

Consider the following loop:Consider the following loop:

It apparently has a data dependency – It apparently has a data dependency – however each thread can sum values of however each thread can sum values of aa independently independently

OpenMP provides an explicit interface OpenMP provides an explicit interface for this kind of operation (“for this kind of operation (“REDUCTIONREDUCTION”)”)

a=0.0do i=1,n a=a+X(i)end do

REDUCTIONREDUCTION clause clause This clause deals with parallel versions of This clause deals with parallel versions of

the following loopsthe following loops

Outcome is determined by a `reduction’ Outcome is determined by a `reduction’ over all the values for each threadover all the values for each thread

e.g.e.g. max over all of a set, is equivalent to max over all of a set, is equivalent to the max over all max with subsets: the max over all max with subsets:

Max(A) where A=U AMax(A) where A=U Ann= Max(U = Max(U Max(AMax(Ann))))

do i=1,N a=max(a,b(i))end do

do i=1,N a=min(a,b(i))end do

do i=1,n a=a+b(i)end do

ExamplesExamples

Syntax:Syntax: REDUCTION(OP: REDUCTION(OP: variablevariable)) where where OP=max,min,+,-,*OP=max,min,+,-,* (& logic (& logic ops)ops)

C$OMP PARALLEL DOC$OMP& PRIVATE(i), SHARED(b)C$OMP& REDUCTION(max:a)do i=1,N a=max(a,b(i))end do

C$OMP PARALLEL DOC$OMP& PRIVATE(i), SHARED(b)C$OMP& REDUCTION(min:a)do i=1,N a=min(a,b(i))end do

What is REDUCTION What is REDUCTION actually doing?actually doing?

Saving you from writing more codeSaving you from writing more code The reduction clause generates an The reduction clause generates an

array of the reduction variables, and array of the reduction variables, and each thread is responsible for a each thread is responsible for a certain element in the arraycertain element in the array

The final reduction over all the array The final reduction over all the array elements (when the loop is finished) elements (when the loop is finished) is performed transparently to the is performed transparently to the useruser

InitializationInitialization

Reduction variables are initialized as Reduction variables are initialized as follows (from the standard):follows (from the standard):

Operator Initialization+ 0* 1- 0MAX Smallest rep. numberMIN Largest rep. number

Race ConditionsRace Conditions

Common operation is to resolve a Common operation is to resolve a spatial position into an array index: spatial position into an array index: consider following loopconsider following loop

Looks innocent enough – but suppose Looks innocent enough – but suppose two particles have the same positions…two particles have the same positions…

C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i,j)C$OMP& SHARED(r,A) do i=1,n j=int(r(i)) A(j)=A(j)+1. end do

r(): array of positions

A(): array that is modifiedusing information fromr()

Race Conditions: A Race Conditions: A concurrency problemconcurrency problem

Two different threads of execution Two different threads of execution can concurrently attempt to update can concurrently attempt to update the same memory locationthe same memory location

Thread 1: Puts A(j)=2.

Thread 2: Puts A(j)=2.

time

End State A(j)=2.

INCORRECT

StartA(j)=1.

Thread 1: Gets A(j)=1.Adds 1.A(j)=2.

Thread 2: Gets A(j)=1.Adds 1.A(j)=2.

Dealing with Race Dealing with Race ConditionsConditions

Need mechanism to ensure updates to Need mechanism to ensure updates to single variables occur within a single variables occur within a critical critical sectionsection

Any thread entering a critical section Any thread entering a critical section blocks all othersblocks all others

Critical sections can be established by Critical sections can be established by using “lock variables”using “lock variables” Think of lock variables as preventing more than Think of lock variables as preventing more than

one thread from working on a particular piece of one thread from working on a particular piece of code at any one timecode at any one time

Just like a lock on door prevents people from entering Just like a lock on door prevents people from entering a rooma room

Deadlocks: The pitfall of Deadlocks: The pitfall of lockinglocking

Must ensure a situation is not created Must ensure a situation is not created where requests in possession create a where requests in possession create a deadlock: deadlock:

Nested locks are a classic example of thisNested locks are a classic example of this Can also create problem with multiple Can also create problem with multiple

processes - `deadly embrace’processes - `deadly embrace’

Resource 1 Resource 2

Process 1 Process 2

Holds

Requests

SolutionsSolutions

Need to ensure memory read/writes Need to ensure memory read/writes occur without any overlapoccur without any overlap

If the access occurs to a single If the access occurs to a single region, we can use a region, we can use a critical sectioncritical section::

do i=1,n **work**C$OMP CRITICAL(lckx) a=a+1.C$OMP END CRITICAL(lckx) end do

Only one thread will beallowed inside the criticalsection at a time.

I have given a name tothe critical section but youdon’t have to do this.

ATOMICATOMIC

If all you want to do is ensure the If all you want to do is ensure the correct update of one variable you correct update of one variable you can use the atomic update facility:can use the atomic update facility:

Exactly the same as a critical section Exactly the same as a critical section around one single update pointaround one single update point

C$OMP PARALLEL DO do i=1,n **work**C$OMP ATOMIC a=a+1. end do

CanCan be inefficient be inefficient

If other threads are If other threads are waiting to enter the waiting to enter the critical section then critical section then the program may even the program may even degenerate to a serial degenerate to a serial code!code!

Make sure there is Make sure there is much more work much more work outside the locked outside the locked region than inside it!region than inside it!

Parallel Section where eachthread waits for the lock before being able to proceed –A complete disaster

= doing work=waiting for lock

COPYIN & ORDEREDCOPYIN & ORDERED Suppose you have a small section of code that Suppose you have a small section of code that

needs to be executed always in sequential orderneeds to be executed always in sequential order However, remaining work can be done in any However, remaining work can be done in any

orderorder Placing an Placing an ORDEREDORDERED clause around the work clause around the work

section will force threads to execute this section section will force threads to execute this section of code sequentiallyof code sequentially

If a common block is specified as private in a If a common block is specified as private in a parallel do, parallel do, COPYINCOPYIN will ensure that all threads will ensure that all threads are initialized with the same values as in the are initialized with the same values as in the serial section of the codeserial section of the code Essentially `Essentially `FIRSTPRIVATEFIRSTPRIVATE’ for common blocks/globals’ for common blocks/globals

Subtle point about running Subtle point about running in parallelin parallel

When running in When running in parallel you are only parallel you are only as fast as your as fast as your slowest threadslowest thread

In example, total In example, total work is 40 seconds, & work is 40 seconds, & have 4 cpushave 4 cpus

Max speed up would Max speed up would be 40/4=10 secsbe 40/4=10 secs

All have to equal 10 All have to equal 10 secs though to give secs though to give max speed-upmax speed-up

0

2

4

6

8

10

12

14

16

Thread 1 Thread 2 Thread 3 Thread 4

Work

Example of poor load balance, only a40/16=2.5 speed-up despite using 4 processors

SCHEDULESCHEDULE

This is the mechanism for determining This is the mechanism for determining how work is spread among threadshow work is spread among threads

Important for ensuring that work is Important for ensuring that work is spread evenly among the threads – spread evenly among the threads – just having the same number of each just having the same number of each iterations may not guarantee they all iterations may not guarantee they all complete at the same timecomplete at the same time

Four types of scheduling possible: Four types of scheduling possible: STATIC, DYNAMIC, GUIDED, RUNTIMESTATIC, DYNAMIC, GUIDED, RUNTIME

STATICSTATIC scheduling scheduling

Simplest of the fourSimplest of the four If If SCHEDULESCHEDULE is unspecified, is unspecified, STATICSTATIC

scheduling will resultscheduling will result Default behaviour is to simply divide Default behaviour is to simply divide

up the iterations among the threads up the iterations among the threads ~n/(# threads)~n/(# threads)

STATIC(chunksize)STATIC(chunksize), creates a cyclic , creates a cyclic distribution of iterations distribution of iterations

ComparisonComparison

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

THREAD 1 THREAD 2 THREAD 3 THREAD 4

THREAD 1 THREAD 2 THREAD 3 THREAD 4

STATICNo chunksize

STATICchunksize=1

DYNAMICDYNAMIC scheduling scheduling DYNAMICDYNAMIC scheduling is a personal favourite scheduling is a personal favourite Specify usingSpecify using DYNAMIC(chunksize) DYNAMIC(chunksize) Simple implementation of master-worker Simple implementation of master-worker

type distribution of iterationstype distribution of iterations Master thread passes off values of Master thread passes off values of

iterations to the workers of size chunksizeiterations to the workers of size chunksize Not a silver bullet: if load balance is too Not a silver bullet: if load balance is too

severe (severe (i.e.i.e. one thread takes longer than one thread takes longer than the rest combined) an algorithm rewrite is the rest combined) an algorithm rewrite is necessarynecessary Also not good if you need a regular access Also not good if you need a regular access

pattern for data localitypattern for data locality

Master-Worker ModelMaster-Worker Model

Master

THREAD 1

THREAD 2

THREAD 3

Other ways to use Other ways to use OpenMPOpenMP

We’ve really only skimmed the surface of We’ve really only skimmed the surface of what you can dowhat you can do However, we have covered the important However, we have covered the important

detailsdetails OpenMP provides a different OpenMP provides a different

programming model to just using loopsprogramming model to just using loops It isn’t that much harder, but you need to It isn’t that much harder, but you need to

think slightly differentlythink slightly differently Check out Check out www.openmp.orgwww.openmp.org for more for more

detailsdetails

Applying to algorithms used Applying to algorithms used in the coursein the course

What could we apply OpenMP to?What could we apply OpenMP to? Root finding algorithms are actually Root finding algorithms are actually

fundamentally serial!fundamentally serial! Global bracket finder: subdivide region and let each Global bracket finder: subdivide region and let each

CPU search in their allotted space in parallelCPU search in their allotted space in parallel LU decomposition can be parallelizedLU decomposition can be parallelized Numerical integration can be parallelizedNumerical integration can be parallelized ODE solvers are not usually good ODE solvers are not usually good

parallelization candidates, but it is problem parallelization candidates, but it is problem dependentdependent

MC methods usually (but not always) MC methods usually (but not always) parallelize wellparallelize well

SummarySummary The main difficulty in loop level parallel The main difficulty in loop level parallel

programming is figuring out whether there are programming is figuring out whether there are data dependencies or race conditionsdata dependencies or race conditions

Remember that variables do not naturally Remember that variables do not naturally carry into a parallel loop, or for that matter out carry into a parallel loop, or for that matter out of one of one Use FIRSTPRIVATE and LASTPRIVATE when you Use FIRSTPRIVATE and LASTPRIVATE when you

need to do thisneed to do this SCHEDULE provides many optionsSCHEDULE provides many options

Use DYNAMIC when you have unknown amount of Use DYNAMIC when you have unknown amount of work in a loopwork in a loop

Use STATIC when you need a regular access Use STATIC when you need a regular access pattern to arraypattern to array

Next LectureNext Lecture

Introduction to visualizationIntroduction to visualization

computational methods in physics phys 3437

Documents