software challenges for petaflop computing

20
Software challenges for Petaflop computing Ivo F. Sbalzarini Institute of Computational Science, ETH Zurich

Upload: others

Post on 05-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Software challenges for Petaflop computing

Software challenges for Petaflop computing

Ivo F. Sbalzarini

Institute of Computational Science, ETH Zurich

Page 2: Software challenges for Petaflop computing

Programming languages and compilers (how much of the parallelism can be done on the language level?)

Middleware (how much of the parallelism can be automated using libraries?)

Ease of use (how do we enable a wider range of scientists to use HPC? In what form should they communicate to the machine?)

Debugging and performance profiling tools

Interactive simulations and on-line visualization

Education

2

Some of them...

1. Software Challenges

Page 3: Software challenges for Petaflop computing

Ideas for language-level parallelism exist for a long time! (1978 Illiac IV)

3

Language level

1. Software Challenges

!"#$%&'()*%+(,*-,*$&,.+/*+"0*

$1.+/2*0.((*0"*"3"&*/"$*

#%456$"&*('+/6'/"2*&./1$

7('+*8&',

97:7*74"2

Page 4: Software challenges for Petaflop computing

We need a hierarchy of levels (language, loops, algorithms, numerical methods)

Domain/Method-specific extensions are useful/needed

A lot exists on the level of linear algebra (scaLAPACK, PETSc, ...)

but what to do for other applications (agent-based models, particles, trees, ...)?

4

But...

1. Software Challenges

Page 5: Software challenges for Petaflop computing

A middleware for the portable parallelization of particle-mesh simulationsIvo F. Sbalzarini, J. H. Walther, B. Polasek, P. Chatelin, M. Bergdorf, S. E. Hieber, E. M. Kotsalis, P. Koumoutsakos

Institute of Computational Science, ETH Zurich

Page 6: Software challenges for Petaflop computing

Goal: make HPC systems easier to use and reduce code development time.

Enable non-traditional HPC user fields (biology, social sciences, economics, psychology, ...).

Re-usable code base: well tested, improvements immediately benefit all applications.

Portable code across (inhomogeneous) architectures

6

Why middleware?

2. Middleware

Page 7: Software challenges for Petaflop computing

Hybrid particle-mesh descriptions are common in many fields:

Plasma physics (charged particles, dynamics on mesh) Molecular dynamics (fast electrostatics involve a mesh) Astrophysics (planetary and gravitational systems) Fluid mechanics (vortex methods and SPH) Computational biology (reaction-diffusion simulations in

complex geometries) but also: Monte Carlo, Traffic Simulations, Optimal Control,

Financial Mathematics, Granular Flows, DEM, ...

7

Why hybrid particle-mesh?

2. Middleware

Page 8: Software challenges for Petaflop computing

use of symmetry in particle-particle interactions

ease of use and flexibility

parallel scaling in CPU time and memory

independence from specific applications / physics

adaptive domain decompositions

good portability across platforms

good vectorization of all major loops

re-usable, well tested code base

8

Design goals

2. Middleware

Page 9: Software challenges for Petaflop computing

Parallel Particle Mesh Library (PPM)Easy to use and efficient infrastructure for

particle-mesh simulations on parallel computers

9

vector shared memory distributed memory single processor

Message Passing Interface (MPI)

Parallel Particle Mesh Library (PPM)

METIS FFTW

Fig. 16. Iso-surfaces of the vorticty (|ω| = 10) for t=5.0, 6.0, 9.0, 10.5, 12.0, and 14.2.The surfaces are colored according to the magnitude of the vorticity componentsperpendicular to the tube axes.

5 Summary

We have presented new Parallel Particle Mesh library ppm which providesa general-purpose, physics-independent infrastructure for simulating contin-uum physical systems using particle methods employing particle, mesh, andhybrid particle-mesh algorithms. The design goals included ease of use, flexi-bility, state-of-the-art parallel scaling, good vectorization, and platform inde-pendence.

Ease of use was achieved by limiting the number of user-callable functions andusing generic interfaces and overloading for different variants of the same task.The library has demonstrated its ease of use in the process of developing theclient applications presented in this paper. Within a few weeks, and withoutlibrary documentation, a single person could write and test a complete parallelsimulation application.

Flexibility and independence from specific physics was demonstrated by hav-ing various simulation client applications. The library was successfully com-piled and used on Intel/Linux, Apple G5/OS X, IBM p690/AIX, NEC SX-5/SUPER-UX, and AMD Opteron/Linux.

Parallel scaling and efficiency were assessed in the test cases presented in Sec-tion 4 of this paper. All applications showed parallel efficiencies comparable tothe present state of the art and favorable run-times on large systems. More-over, vectorization as tested on the NEC SX-5 computer demonstrated thesuitability of the ppm library for vector architectures. Ongoing work involvesextension of the library to discrete atomistic, mesoscale and hybrid systemsas they are simulated by partile methods.

The absence of suitable parallel libraries has prevented so far the widespread

29

3. PPM

Page 10: Software challenges for Petaflop computing

The PPM Library provides:

An easy to use and efficient infrastructure to implement particle-mesh simulations on parallel machines

Based on the abstractions: Topologies (data partition onto processors) Mappings (communication operators to move data between processors) Particles (location in space and attributed properties) Connections (links between particles, hard or soft) Meshes (hierarchies of Cartesian meshes)

Write simulation sequentially in terms of these primitives!

10

The PPM Library3. PPM

Page 11: Software challenges for Petaflop computing

The Library then automatically (transparently) performs:

Adaptive domain decomposition: Generate >>Nproc sub-domains with associated cost Build smaller sub-domains in high-density regions

Dynamic load balancing: Probe and monitor individual processor speeds Dynamically assign sub-domains to processors (particles AND mesh) Re-decomposition if amortized, predict re-decomp time points

Communication scheduling: Probe and monitor communication speeds (empirical topology) Build graph of application communication need Minimal edge coloring (+1 bound) to determine near-optimal schedule

11

The PPM Library3. PPM

Page 12: Software challenges for Petaflop computing

The core has been supplemented with modules for:

Parallel Poisson solvers (multigrid and FFT)

Parallel evaluation of differential operators on particles

Parallel neighbor lists (cell lists and Verlet lists)

Parallel multi-stage ODE integrators Parallel Fast Multipole Method (FMM) -- global distributed tree code!

Boundary element solvers

Parallel disk I/O

Interfaces to external libraries (fftw, Hypre, HDF5, Metis)

12

The PPM Library3. PPM

Page 13: Software challenges for Petaflop computing

7

(a) (b) (c)

(d) (e) (f)

Fig. 3. Visualization of individual grains from direct DEM simulation of a sand avalanche downan inclined plane. Initially, 120 000 grains with a void fraction of 0.468 and uniformly distributedradii between 1.00 and 1.12 mm are equilibrated in a box of dimensions 740 × 200 × 8 mm. At time0 the box is inclined at an angle of 31.2 ◦ and the right wall is removed. We show the flow of theavalanche down the plane 0.1, 0.2, . . . , 0.6 seconds after removing the wall (a-f).

2. Daerr, A., Douady, S.: Two types of avalanche behaviour in granular media. Nature 399 (1999)241–243

3. Douady, S., Andreotti, B., Daerr, A., Clade, P.: From a grain to avalanches: on the physics ofgranular surface flows. C. R. Physique 3 (2002) 177–186

4. Silbert, L.E., Ertas, D., Grest, G.S., Halsey, T.C., Levine, D., Plimpton, S.J.: Granular flowdown an inclined plane: Bagnold scaling and rheology. Phys. Rev. E 64 (2001) 051302

5. Hunt, M.L.: Discrete element simulations for granular material flows: effective thermal conduc-tivity and self-diffusivity. Int. J. Heat Mass Transfer 40(13) (1997) 3059–3068

6. de Gennes, P.G.: Granular matter: a tentative view. Rev. Mod. Phys. 71(2) (1999) 374–3827. Douady, S., Andreotti, B., Daerr, A.: On granular surface flow equations. Eur. Phys. J. B 11

(1999) 131–1428. Douady, S., Manning, A., Hersen, P., Elbelrhiti, H., Protiere, S., Daerr, A., Kabbachi, B.: Song

of the dunes as a self-synchronized instrument. Phys. Rev. Lett. 97 (2006) 0180029. Landry, J.W., Grest, G.S., Silbert, L.E., Plimpton, S.J.: Confined granular packings: Structure,

stress, and forces. Phys. Rev. E 67 (2003) 04130310. Heyes, D.M., Baxter, J., Tuzun, U., Qin, R.S.: Discrete-element method simulations: from

micro to macro scales. Phil. Trans. R. Soc. Lond. A 362 (2004) 1853–186511. Richards, K., Bithell, M., Dove, M., Hodge, R.: Discrete-element modelling: methods and

applications in the environmental sciences. Phil. Trans. R. Soc. Lond. A 362 (2004) 1797–181612. Rycroft, C.H., Bazant, M.Z., Grest, G.S., Landry, J.W.: Dynamics of random packings in

granular flow. Phys. Rev. E 73 (2006) 05130613. Altenhoff, A., Walther, J.H., Koumoutsakos, P.: A stochastic boundary forcing for dissipative

particle dynamics. J. Comput. Phys. (2007) in press14. Todorov, I.T., Smith, W.: DL POLY 3: the CCP5 national UK code for molecular-dynamics

simulations. Phil. Trans. R. Soc. Lond. A 362 (2004) 1835–185215. Dutt, M., Hancock, B., Bentham, C., Elliott, J.: An implementation of granular dynamics for

simulating frictional elastic particles based on the DL POLY code. Comput. Phys. Comm. 166

(2005) 26–4416. Owen, D.R.J., Feng, Y.T.: Parallelised finite/discrete element simulation of multi-fracturing

solids and discrete systems. Engineering Computations 18(3/4) (2001) 557–57617. Balmforth, N.J., Kerswell, R.R.: Granular collapse in two dimensions. J. Fluid Mech. 538

(2005) 399–42818. Cleary, P.W., Prakash, M.: Discrete-element modelling and smoothed particle hydrodynamics:

potential in the environmental sciences. Phil. Trans. R. Soc. Lond. A 362 (2004) 2003–2030

• Vortex methods for incompressible fluids268M particles, 512 processors, 76% efficiency

• Smooth Particle Hydrodynamics for compressible fluids268M particles, 128 processors, 91% efficiency

• Particle diffusion 242 processors, 84%, up to 1 Billion particles

• Discrete element methods192 processors, 40%, up to 122M particles

Past applications of PPM

134. Benchmarks

Page 14: Software challenges for Petaflop computing

PPM-SPHRemeshed Smoothed Particle Hydrodynamics method solving

the 3D compressible Navier-Stokes equations.

State of the artPresent

• Gadget (MPI Astrophysics) • 30823 grid (~1010)• 512 CPUs (IBM p690, MPI Garching)• 85% Efficiency (on 32 CPU)

• 1024x5122 grid (~108)• 128 CPUs (IBM p690, CSCS)• 91% Efficiency (on 128 CPU)

4. Benchmarks 14

Page 15: Software challenges for Petaflop computing

Vortex-in-Cell particle method solving the 3D incompressible Navier-Stokes equations using multigrid as fast Poisson solver.

State of the artPresent• Kuwahara Laboratory (Japan) • Isotropic turbulence• 40963 grid (~1010)• 512 CPUs (Earth Simulator)• 50% Efficiency

• double shear layer• 1024x5122 grid (~108)• 512 CPUs (XT/3, CSCS)• 76% Efficiency

PPM-VM4. Benchmarks 15

Page 16: Software challenges for Petaflop computing

7

(a) (b) (c)

(d) (e) (f)

Fig. 3. Visualization of individual grains from direct DEM simulation of a sand avalanche downan inclined plane. Initially, 120 000 grains with a void fraction of 0.468 and uniformly distributedradii between 1.00 and 1.12 mm are equilibrated in a box of dimensions 740 × 200 × 8 mm. At time0 the box is inclined at an angle of 31.2 ◦ and the right wall is removed. We show the flow of theavalanche down the plane 0.1, 0.2, . . . , 0.6 seconds after removing the wall (a-f).

2. Daerr, A., Douady, S.: Two types of avalanche behaviour in granular media. Nature 399 (1999)241–243

3. Douady, S., Andreotti, B., Daerr, A., Clade, P.: From a grain to avalanches: on the physics ofgranular surface flows. C. R. Physique 3 (2002) 177–186

4. Silbert, L.E., Ertas, D., Grest, G.S., Halsey, T.C., Levine, D., Plimpton, S.J.: Granular flowdown an inclined plane: Bagnold scaling and rheology. Phys. Rev. E 64 (2001) 051302

5. Hunt, M.L.: Discrete element simulations for granular material flows: effective thermal conduc-tivity and self-diffusivity. Int. J. Heat Mass Transfer 40(13) (1997) 3059–3068

6. de Gennes, P.G.: Granular matter: a tentative view. Rev. Mod. Phys. 71(2) (1999) 374–3827. Douady, S., Andreotti, B., Daerr, A.: On granular surface flow equations. Eur. Phys. J. B 11

(1999) 131–1428. Douady, S., Manning, A., Hersen, P., Elbelrhiti, H., Protiere, S., Daerr, A., Kabbachi, B.: Song

of the dunes as a self-synchronized instrument. Phys. Rev. Lett. 97 (2006) 0180029. Landry, J.W., Grest, G.S., Silbert, L.E., Plimpton, S.J.: Confined granular packings: Structure,

stress, and forces. Phys. Rev. E 67 (2003) 04130310. Heyes, D.M., Baxter, J., Tuzun, U., Qin, R.S.: Discrete-element method simulations: from

micro to macro scales. Phil. Trans. R. Soc. Lond. A 362 (2004) 1853–186511. Richards, K., Bithell, M., Dove, M., Hodge, R.: Discrete-element modelling: methods and

applications in the environmental sciences. Phil. Trans. R. Soc. Lond. A 362 (2004) 1797–181612. Rycroft, C.H., Bazant, M.Z., Grest, G.S., Landry, J.W.: Dynamics of random packings in

granular flow. Phys. Rev. E 73 (2006) 05130613. Altenhoff, A., Walther, J.H., Koumoutsakos, P.: A stochastic boundary forcing for dissipative

particle dynamics. J. Comput. Phys. (2007) in press14. Todorov, I.T., Smith, W.: DL POLY 3: the CCP5 national UK code for molecular-dynamics

simulations. Phil. Trans. R. Soc. Lond. A 362 (2004) 1835–185215. Dutt, M., Hancock, B., Bentham, C., Elliott, J.: An implementation of granular dynamics for

simulating frictional elastic particles based on the DL POLY code. Comput. Phys. Comm. 166

(2005) 26–4416. Owen, D.R.J., Feng, Y.T.: Parallelised finite/discrete element simulation of multi-fracturing

solids and discrete systems. Engineering Computations 18(3/4) (2001) 557–57617. Balmforth, N.J., Kerswell, R.R.: Granular collapse in two dimensions. J. Fluid Mech. 538

(2005) 399–42818. Cleary, P.W., Prakash, M.: Discrete-element modelling and smoothed particle hydrodynamics:

potential in the environmental sciences. Phil. Trans. R. Soc. Lond. A 362 (2004) 2003–2030

Discrete Element Method for granular flow with fully resolvedvisco-elastic colliding particles.

State of the artPresent

• Landry et al. (SNL)• same particle model (Silbert)• 200k particle

• sand avalanche• 122M particles• 192 CPUs (XT/3, CSCS)• 40% Efficiency (strong)

PPM-DEM4. Benchmarks 16

II. SIMULATION METHOD

We present molecular dynamics !MD" simulations inthree dimensions on model systems of N monodispersed

spheres of diameter d and mass m. We vary N from 20 000 to

200 000 particles. The system is constrained by a cylinder of

radius R, centered on x!y!0, with its axis along the verti-cal z direction. The cylinder is bounded below with a flat

base at z!0. In some cases, a layer of randomly arrangedimmobilized particles approximately 2d high rests on top of

the flat base to provide a rough base. The cylinders used vary

in size from R!10d to 20d . This work builds on previousMD simulations of packings with periodic boundary condi-

tions in the xy plane #10$.The spheres interact only on contact through a spring-

dashpot interaction in the normal and tangential directions to

their lines of centers. Contacting spheres i and j positioned at

ri and rj experience a relative normal compression %!!ri j"d!, where ri j!ri"rj , which results in a force

Fi j!Fn#Ft . !1"

The normal and tangential contact forces are given by

Fn! f !%/d "" kn%ni j" m

2&nvn# , !2"

Ft! f !%/d "" "kt!st"m

2& tvt# , !3"

where ni j!ri j /ri j , with ri j!!ri j!. vn and vt are, respec-tively, the normal and tangential components of the relative

surface velocity, and kn ,t and &n ,t are elastic and viscoelastic

constants, respectively. f (x)!1 for Hookean !linear" con-tacts, while for Hertzian contacts f (x)!!x . !st is the elastictangential displacement between spheres, obtained by inte-

grating tangential relative velocities during elastic deforma-

tion for the lifetime of the contact. The magnitude of !st istruncated as necessary to satisfy a local Coulomb yield cri-

terion Ft'(Fn , where Ft)!Ft! and Fn)!Fn! and ( is the

particle-particle friction coefficient. Frictionless spheres cor-

respond to (!0. Particle-wall interactions are treated iden-tically, but the particle-wall friction coefficient (w is set in-

dependently. A more detailed description of the model is

available elsewhere #16$.Most of these simulations are run with a fixed set of pa-

rameters: kn!2$105mg/d , kt!27 kn , and &n!50!g/d . For

Hookean springs we set & t!0. For Hertzian springs, & t

!&n #17$. In these simulations, it takes far longer to drainthe energy out of granular packs, using the Hertzian force

law, since the coefficient of restitution * is velocity depen-dent #18$ and goes to zero as the velocity goes to zero. Wethus focused on Hookean contacts, which for the above pa-

rameters give *!0.88. The convenient time unit is +!!d/g , the time it takes a particle to fall its radius from restunder gravity. For this set of parameters, the time step %t!10"4+ . The particle-particle friction and particle-wall fric-tion are the same: (!(w!0.5, unless stated otherwise.

All of our results will be given in dimensionless units

based on m, d, and g. Physical experiments often use glass

spheres of d!100 (m with ,!2$103kg/m3. In this case,

the physical elastic constant would be kglass-1010mg/d . A

spring constant this high would be prohibitively computa-

tionally expensive, because the time step must have the form

%t.k"1/2 for collisions to be modeled effectively. We have

found that running simulations with larger k’s does not ap-

preciatively change the physical results #16$.We use a variety of techniques to generate our static pack-

ings. In method P1, we mimic the pouring of particles at a

fixed height Z into the container. For computational effi-

ciency, a group of M particles is added to the simulation on a

single time step as if they had been added one by one at

random times. This is done by inserting the M particles at

nonoverlapping positions within a thin cylindrical region of

radius R"d that extends in z from Z to Z"d . The x, y, and

z coordinates of the particles are chosen randomly within this

insertion region. The height of insertion z determines the

initial z velocity vz of the particle—vz is set to the value itwould have after falling from a height Z. After a time !2+ ,another group of M particles is inserted. This methodology

generates a steady stream of particles, as if they were poured

continuously from a hopper !see Fig. 1". The rate of pouringis controlled by setting M to correspond to a desired volume

fraction of particles within the insertion region. For example,

for an initial volume fraction of / i!0.13 and R!10d , thepouring rate is 045 particles per + .Method P2 is similar, but the insertion region moves in z

with time, so that the particles are inserted at roughly the

same distance from the top of the pile over the course of the

simulation. The insertion region is the same as in method P1,

FIG. 1. Formation of a packing of N!20 000 spheres in a cy-lindrical container of radius 10d onto a flat base. The packing is

constructed by pouring, using method P1, from a height of 70d .

The configurations shown are for early, intermediate, and late times.

The final static pile has / f!0.62.

LANDRY et al. PHYSICAL REVIEW E 67, 041303 !2003"

041303-2

Page 17: Software challenges for Petaflop computing

Particle Strength Exchange method solving the 3D diffusion equations in the geometry of a real cellular organelle (ER).

State of the artPresent

• Winckelmans (Belgium) • PSE for viscous Vortex Methods• 2.4M particles (2002)• 32 CPUs (HP Superdome)

• Sakaguchi et al. (Japan)• Diffusion using random walk (2004)• 6648 x 6648 x 6656 (294G)• 4096 CPUs (NEX SX/5, Earth Sim.)

• ER diffusion• up to 1 billion particles• 4…242 CPUs (IBM p690, CSCS)• 84% Efficiency (on 242 CPU)

PPM-PSE4. Benchmarks 17

Page 18: Software challenges for Petaflop computing

PPM-PSE Benchmark

Domain decomposition Balanced decomposition on 242 processors

184. Benchmarks

Page 19: Software challenges for Petaflop computing

Lennard-Jones Molecular Dynamics of Argon

Present

• Argon gas• 8 million atoms• 256 CPUs (XT/3, CSCS)• 63% Efficiency• 0.25 seconds/timestep

PPM-MD4. Benchmarks

• Dedicated MD package FASTTUBE: 2-fold slower than ppm-md. Took 6 years to develop, ppm-md 3 months.

Werder et al., ETH Zurich

State of the art

19

Page 20: Software challenges for Petaflop computing

Acknowledgements

PPM core development team:

Prof. Dr. Jens WaltherProf. Dr. Ivo SbalzariniDr. Philippe Chatelin

Michael BergdorfEvangelos Kotsalis

Simone Hieber

20

http://minion.inf.ethz.ch/ppm

More information (referencemanual) and PPM source:

I. F. Sbalzarini, J. H. Walther, M. Bergdorf, S. E. Hieber, E. M. Kotsalis, and P. Koumoutsakos. PPM – a highly efficient parallel particle-mesh library for the simulation of continuum systems. J. Comput. Phys., 215(2):566–588, 2006.