efficient parallel implementation of molecular dynamics with embedded atom method on multi-core...

25
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali Liu, and Jianjiang Li Information Engineering School, University of Science and Technology

Upload: erika-barker

Post on 26-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms

Reporter: Jilin Zhang

Authors:Changjun Hu, Yali Liu, and Jianjiang Li

Information Engineering School, University of Science and Technology Beijing, Beijing, P.R.China

Page 2: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

Outline 1 Motivation 2 Related Works 3 Spatial Decomposition Coloring (SDC) Approach 4 Short-Range Forces Calculations of EAM using

SDC method 5 Experiments and Discussion 6 Conclusion and Future Directions

Page 3: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

1 Motivation The process of molecular dynamics simulations

Fig. 1 the process of molecular dynamics simulations.

12

3

4

5

6

0

9

7 8

calculate forces

1 2

3

4

56

0

97

8

calculate new positions of atoms

set init_state

12

3

4

5

6

0

9

7 8

Page 4: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

1 Motivation the intensive computation

appears in short-range force calculations procedure of MD simulations Neighbor-list method

decreases the intensive computation largely. It make each atom only interacts with atoms in its neighbor region.

Newton’s third law can have the force computations. And it brings the reduction operations on irregular arrays

for ( i = 0; i < N; i++){

neighstart = neighindex[i];neighend = neighstart + neighlen[i];for ( k = neighstart ; k < neighend; k++){

j = neighlist[k];xd = Coord[j][X]- Coord[i][X];yd = Coord[j][Y]- Coord[i][Y];zd = Coord[j][Z]- Coord[i][Z];…forc = …force[i][X] += forc*xd ;force[i][Y] += forc*yd ;force[i][Z] += forc*zd ;force[j][X] -= forc*xd ;force[j][Y] -= forc*yd ;force[j][Z] -= forc*zd ;

}}

Fig. 2 codes of force caluclations.

Page 5: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

2 Related Works --- parallel reduction operations on irregular arrays

Some types of solutionsenclosing reduction operation in a critical

sectionprivating the reduction arrayusing redundant computations strategy

Page 6: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

2 Related Works --- parallel reduction operations on irregular arrays

enclosing reduction operation in a critical sectioncreate a critical section in inner loop

straight and easy to implement parallelization.

high synchronization cost arose by critical region, atomic or lock involved in inner loop

Page 7: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

2 Related Works --- parallel reduction operations on irregular arrays

private the reduction arrayeach thread have to update share array in critical

region according the value of its private array it reduce times of entering into critical region and

reduce synchronization cost. high memory overhead of private array limit number of particles allowed in simulations compete for cache space and decrease program

speed

Page 8: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

2 Related Works --- parallel reduction operations on irregular arrays

redundant computations strategydoes not use Newton’s third law. So each pair

interaction has to be calculated twice. the high parallelizability since data

dependence has been removed between the loop iterations

there are double computations and that neighbor list requires more memory space.

Page 9: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

3 Spatial Decomposition Coloring (SDC) Approach Spatial Decomposition (SD) method

distributed memory multi-processors involving several hundreds of processors

change all array declarations and all loop bounds, and explicitly codes the periodic transfer of the boundary data between processors.

It is simple to implement SD in OpenMP.

Page 10: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

3 Spatial Decomposition Coloring (SDC) Approach

SD method places a restriction on parallelism in OpenMP.

synchronization will be required to ensure that multiple threads do not attempt to update the same atom simultaneously.

1 2

8

4 5

10

3 6

14 1613 17 18

127 9

15

11rc rc

Fig. 3 SD method.

Page 11: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

3 Spatial Decomposition Coloring (SDC) Approach SDC method

SDC method consists of the following steps

Step 1): Split domain

Step 2): Coloring subdomains

Step 3): Parallel Computing

Page 12: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

3 Spatial Decomposition Coloring (SDC) Approach SDC method

SDC method consists of the following steps Step 1): Split domain

Split the spatial domain into subdomains.

Length of a subdomain must be longer than diameter.

Number of subdomains in dimension decomposed should be even.

Page 13: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

3 Spatial Decomposition Coloring (SDC) Approach SDC method

SDC method consists of the following steps Step 2): Coloring subdomains

The number of subdomains with each color must be equal

each subdomain is surrounded only by those subdomains with different colors.

Page 14: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

3 Spatial Decomposition Coloring (SDC) Approach

SDC methodSDC method consists of the following steps

Step 3): Parallel ComputingCalculations of forces on subdomains

with one color can be run in parallel.a barrier should be given for waiting all

threads to complete computation on this color.

Calculations on subdomains with different colors must run in a serial fashion.

Page 15: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

3 Spatial Decomposition Coloring (SDC) Approach SDC method

advantage neighbor list usually doesn’t be updated in every time-

step Cost of SDC method is very lowest. higher-dimensional decomposition method creates

more subdomains. scalable and suitable on multi-core and many-core architectures.

disadvantage Spatial Decomposition method Overload imbalance

under condition of simulation system has uniformity of density

Page 16: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

4 Short-Range Forces Calculations of EAM using SDC method

EAM methodshort-range

forces the intensive

computation three computational

phases the most time

consuming parts are 1 and 3

N

ijiji )r(φρ

)(' iF ρ

N

ijjijijiiji FF ijr)')('')(')r(V'(F

ρρρρ

Fig. 4 short-range forces in EAM method.

Page 17: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

4 Short-Range Forces Calculations of EAM using SDC method The parallel procedure of short-range

forces calculations using SDC method1) Run electron density computations using

SDC method2) Calculate embedding function value and

their derivative in parallel3) Run force calculations using SDC method

Page 18: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

force calculations based on SDC method

L1: computations on subdomains with different color

L2 : computations on subdomains with same color

L3 deals with all atoms that constitute a subdomain

L4 deals with neighbors of a atom

#pragma omp parallel private(cpart) for (cpart = 0; cpart < colors; cpart++){ ...#pragma omp for private(spart,i,j,k,…) for (spart = cpart; spart < subdomains; spart += colors) for ( ipart = pstart[spart]; ipart < pstart[spart+1]; ipart++) {

i = partindex[ipart];neighstart = neighindex[i];neighend = neighstart + neighlen[i];for ( k = neighstart ; k < neighend; k++){

j = neighlist[k];…forc = …force[i][X] += forc*xd ;force[i][Y] += forc*yd ;force[i][Z] += forc*zd ;force[j][X] -= forc*xd ;force[j][Y] -= forc*yd ;force[j][Z] -= forc*zd ;

} }}

L1:

L2:

L3:

L4:

4 Short-Range Forces Calculations of EAM using SDC method

Fig. 5 forces calculations using SDC.

Page 19: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

5 Experiments and Discussion

Experimental environment Four Intel Xeon(R) Quad-core E7320 (L2 Cache 4MB)

processors, 16 GB memory OS is Fedora release 9 with kernel 2.6.25. The compiler is gcc

4.3.0. Experimental cases

observe micro-deformation behaviors of pure Fe metals material

---came from XMD program under periodic boundary conditions initial state -- body-centered cubic (bcc) lattice arrangement test cases

Small-scale case (1): 54,000 atoms Medium-scale case (2): 265,302 atoms Large-scale case (3): 1,062,882 atoms Large-scale case (4): 3,456,000 atoms

Page 20: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

Speedup

Small case (1) on 2~16 cores Medium case (2) on 2~16 cores

2 3 4 8 12 16 2 3 4 8 12 16

SDC (one-dim) 1.71 2.46 3.07 4.17 1.84 2.64 3.37 6.24 6.33

SDC (two-dim) 1.70 2.46 3.07 4.74 5.90 6.43 1.84 2.65 3.39 6.20 8.89 10.90

SDC (three-dim) 1.66 2.40 2.99 4.61 5.74 6.30 1.82 2.65 3.36 6.16 8.76 10.78

Large case (3) on 2~16 cores Large case (4) on 2~16 cores

2 3 4 8 12 16 2 3 4 8 12 16

SDC (one-dim) 1.86 2.76 3.67 6.82 9.76 9.59 1.88 2.79 3.66 6.30 9.97 9.82

SDC (two-dim) 1.87 2.78 3.64 6.74 9.73 12.31 1.87 2.80 3.65 6.77 9.84 12.42

SDC (three-dim) 1.86 2.75 3.64 6.64 9.65 12.29 1.87 2.80 3.67 6.74 9.82 12.34

Table 1. The Speedups of Spatial Decomposition Coloring (SDC) Methods

Page 21: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

5 Experiments and Discussion

the scalability of our SDC method. performance of multi-dimensional SDC method has been improved with the increase in the number of cores and the increase in the number of atoms.

performance of SDC methods. We can see that two-dimensional SDC method achieves highest efficiency. two-dimensional decomposition algorithm strives to make

subdomains with small surface area and large volume, which results in better cache locality compared to the one-dimensional decomposition strategy.

three-dimensional SDC method slightly degrades the performance due to the more overhead of fork-join threads and scheduling.

Page 22: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

0

2

4

6

8

10

12

14

2 3 4 8 12 16

Number of cores

Spee

dup

SDC on small case(1) SDC on medium case(2) SDC on large case(3) SDC on large case(4)CS on small case(1) CS on medium case(2) CS on large case(3) CS on large case(4)SAP on small case(1) SAP on medium case(2) SAP on large case(3) SAP on large case(4)RC on small case(1) RC on medium case(2) RC on large case(3) RC on large case(4)

Fig. 6 The speedup of two-dimensional Spatial Decomposition Coloring (SDC) method, Critical Section (CS) method, Share Array Privatization (SAP) method and Redundant Computations (RC) method.

Page 23: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

5 Experiments and Discussion

SDC method achieves a nearly linear speedup and highest speedup than other methods The reason of nearly linear speedup is that the low

synchronization cost of implicit barriers in our method can be amortized over a large amount of computation.

CSmethod achieves lowest efficiency. CS method encloses reduction

operations on irregular array in critical section. SAPmethod

performance degrade with the increase of the number of executing cores. memory overhead+synchronization overhead

RC VS SDC there is nearly two-fold computation work for the short-range

force calculations in RC method than in SDC method, the efficiency of RC method is low than that of SDC method.

Page 24: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

Conclusion and Future Directions

A scalable spatial decomposition coloring (SDC) method To solve a class of short-range force calculations

problems on shared memory multi-core platforms It is scalable not only to large simulation system but

also to many-core architectures Future directions

To study SDC method on NUMA memory architecture To implement SDC method using MPI+OpenMP in

multi-core cluster

Page 25: Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali

Thank You !