the fast multipole method in molecular dynamics

bioexcel.eu

Partners Funding

The Fast Multipole Method in molecular dynamics

Berk HessKTH Royal Institute of Technology, Stockholm, Sweden

ADAC6 workshop Zurich, 20-06-2018

Slide

bioexcel.eu

Partners Funding

BioExcel

Slide

bioexcel.eu

Partners Funding

Powerful because:• structure and dynamics in atomistic detail

Challenging because:• fixed system size: strong scaling needed• always more simulation needed than possible• force-field (model) accuracy can be limiting• results can be difficult to analyse

Molecular Dynamics of biomolecules

Slide

bioexcel.eu

Partners Funding

mid2xi

dt2= Fi = �riV (x) , i = 1, . . . , N

vi(t+�t

2) = vi(t�

�t

2) + ai(t)�t

xi(t+�t) = x(t) + vi(t+�t

2)

Molecular Dynamics

Newton’s equation of motion

Simple (symplectic) integrator

Fundamental issue: sequential problem Δt ~ 1 fs, timescales of interest μs - s ⇨ need 109 - 1015 steps!

Slide

bioexcel.eu

Partners Funding

U(r) =X

bonds

Ubond(r) +X

angles

Uangle(r) +X

dihs

Udih(r)

+X

i

X

j>i

qiqj4⇥�0rij

+Aij

r12ij� Bij

r6ij

Computational cost mostly in computing the forces

Fi = �dU

dri

We need the forceon each particle i pair-sum: largest cost,

use cut-off (and long-rangeelectrostatics algorithm)

{

Slide

bioexcel.eu

Partners Funding

To reach 109 - 1015 steps we need to minimise the walltime per step

The best MD codes in the world are at 100 microseconds per step (on a single node):• at ~ 200 atoms per CPU core ⇨ we run in L1/L2 cache• at ~ 3000 atoms per GPU• but systems of interest are usually > 100000 atoms

Extreme demands on hardware and parallelisation libraries (MPI, OpenMP):• ideally overheads should be < 1 microsecond

The MD strong scaling challenge

Slide

bioexcel.eu

Partners Funding

• Started as hardware+software project in Groningen (NL)• Now core team in Stockholm and many contributors world-wide• Focus on high performance: efficient algorithms, SIMD & GPUs• Good parallel scaling• Compiles and runs efficiently on nearly any hardware• Open source, LGPL• Incorporate important algorithms for flexibly manipulating systems• C++, MPI, OpenMP, CUDA, OpenCL• C++ and Python API in the works

GROMACS

Slide

bioexcel.eu

Partners Funding

12 13 14 15111098

1111

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0000

8653 9 10 11 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0000

7654 12 13 14 15

222

3333

2

0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

11110000

222 33332

0123

4567

891011

Classical 1x1 neighborlist on 4-way SIMD

4x4 setup on 4-way SIMD

4x4 setup on 8-way SIMD

4x4 setup on SIMT-8

141312 1515141312654 77654

Highly optimised vectorization:

Small C++ SIMD layer with highly optimised math functions for:SSE2/4, AVX2-128, AVX(2)-256, AVX-512, ARM, VSX, HPC-ACE

CPU kernels reach 50% of peak

CUDA and OpenCL need separate kernels

Slide

bioexcel.eu

Partners Funding

We can optimize the size of this buffer

Larger buffer means morecalculations, butwe can updatethe neighbor listless frequently

Atom clustering (regularization) and pair list buffering

Slide

bioexcel.eu

Partners Funding

We can optimize the size of this buffer

Larger buffer means morecalculations, butwe can updatethe neighbor listless frequently

8 10 15 20 25 40 50 75 100 150 200

50

55

60

65

70

75

80

85

90

56 ranks

56 ranks with dynamic pruning

pair search frequency (MD steps)

perf

orm

ance (

ns/d

ay)

New in GROMACS-2018: dual-pair list: • reduces overhead • less sensitive to parameters

Atom clustering and pair list buffering

Slide

bioexcel.eu

Partners Funding

GROMACS has efficient parallelization of all algorithms using MPI + OpenMP

OpenMP is (performance) portable, but limited:• No way to run parallel tasks next to each other• No binding of threads to cores (cache locality)

Need for a better threading model, requirements:• Extremely low overhead barriers (all-all, all-1, 1-all)• Binding of threads to cores• Portable

Intra-rank parallelisation: OpenMP

Slide

bioexcel.eu

Partners Funding

1 12

d0

3 2

3’

rc

rb2’

7

30

4cr

165

8th-shell domain decomposition:• minimizes communicated volume• minimizes communication calls• implemented using aggregation of data

along dimensions• 1 MPI rank per domain• dynamic load balancing

Question: still the best approach for networks that support many messages?

Spatial decomposition

Hess, Kutzner, Van der Spoel, Lindahl; J. Chem. Theory Comput. 4, 435 (2008)

Slide

bioexcel.eu

Partners Funding

GPUCUDA

CPU OpenMP threads

H2Dpair list

Average CPU-GPU overlap 75-90%

Bonded FIntegration,Constraints

Non-bonded F

Wait

Idle

DDPair search

Idle

DD & Pair search: every 80-150 steps

MD step

H2Dx,q

Clear Fbuffer

Rollingpruning

Prunepair list

Reduce F

Idle Spread 3D-FFT Solve 3D-FFT Gather

D2HF,E

D2HF,E

pull/FE/etc.

100-500 μs New in GROMACS-2018: PME on GPU!

Slide

bioexcel.eu

Partners Funding

Lignocellulose benchmark• 3.3 million atoms• No long-range electrostatics!• machine: Piz Daint at CSCS

• Cray XC50• NVIDIA P100 GPUs

96 / 4 960 /40 9600 / 40020

200

4 ranks/node

6 ranks/node

12 ranks/node

#cores / #GPUs

Perf

orm

ance

(ns/

day)

100

GROMACS strong scaling

Slide

bioexcel.eu

Partners Funding

At 100s of microseconds per step:• The 3D-FFT in PME electrostatics• MPI overhead

• we need MPI_Put_notify()• OpenMP barriers take significant time• Load imbalance• CUDA API overhead can be 50% of the CPU time• Too many GROMACS options to tweak manually

Strong scaling issues

10 100 1000number of cores or #GPUs*8

10

100

1000

perfo

rman

ce (n

s/da

y)

140000 atom ion channelon Piz Daint xc

Slide

bioexcel.eu

Partners Funding

All examples up till now without long-range electrostatics, but needed in most casesWe need all-vs-all Coulomb interactions

All MD codes use Particle-mesh Ewald (PME) for electrostaticsPME uses a 3D-FFT, 643 - 1283 grid pointsIssues:• The 3D-FFT requires two grid transposes that require

All-to-All communication in slabs• Pencil decomposition clashes with 3D decomposition• 3D-FFT scaling is the bottleneck for MD

Long-range electrostatics GROMACS Multiple-ProgramMultiple-Data parallelisationimproves scaling by a factor 4

8 PP/PME ranks

6 PP ranks 2 PME ranks

3D-FFTcommunication

Slide

bioexcel.eu

Partners Funding

in use in MD computational cost pre-factor communication

costEwald

summation since 70s O(N3/2) small (FFT) high

PPPM / Particle Mesh Ewald since 90s O(N log N) small (FFT) high

Fast Multipole Method not yet O(N) larger low

Multigrid Electrostatics 2010 O(N) larger low

Long-range electrostatics methods for molecular dynamics

we recently solved the energy conservation issue

Slide

bioexcel.eu

Partners FundingRio YokotaTokyo Tech

FMM scheme

Slide

bioexcel.eu

Partners Funding

The accuracy of FMM is controlled by• the order of the expansion(s)• the theta parameter

Main issue for FMM in MD:• We want low accuracy (or

better: high speed)• We want energy conservation

R

r

L

↵

• •

Figure 3: Extended cells.

θ=0.50 θ=0.35

θ=0.25 θ=0.20

Figure 4: Number of near neighbors increases as ✓ decreases.

3

Slide

bioexcel.eu

Partners Funding

FMMformoleculardynamics

direct approx.p : Order of expansionθ : Opening angle

This combination yieldsthe same accuracy

But even for the same accuracy the oneswhere θ is small has less energy driftbecause the jump is further away

What happens if wesmooth the jump?

� =1

r

work byRio Yokota

Slide

bioexcel.eu

Partners Funding

Main physics issue with FMM in MD:• All atoms have partial charges, but molecules are mostly locally neutral• The sharp boundaries of FMM cells introduce charges at the boundaries

qO = +0.82

qH= -0.41qH = -0.41

qcell = -0.41 qcell = +0.41

a neutral water molecule crossing 2 FMM cells

at long distancewater is a dipole

Slide

bioexcel.eu

Partners FundingS2

q=0 q=0 q=0

q−multipoleq−q

S1

V = s1(x)V1(x) + s2(x)V2(x)

F = �rV

= s1(x)F1(x) + s2(x)F2(x)

�s01(x)V1(x)� s02(x)V2(x)

FMM regularization: add smooth overlap between cells

orig. idea: Chartier, Darrigand Faou,BIT Num. Math. 50, 23 (2010)

Slide

bioexcel.eu

Partners Funding

• Davoud Saffar Shamshir Gar has analyzed the accuracy of regularization• Currently: 2D FFM with full regularization• Results for charge pairs with θ=0.35 = second nearest neighbors, 8 atoms

per cell

0 0.2 0.4 0.6regularization width / cell size

10−4

10−3

pote

ntia

l erro

r p=4

p=6

factor 5.7

Slide

bioexcel.eu

Partners Funding

1 2 3 4 5 6�8

�7

�6

�5

�4

�3

�2

�1

0

1

P

AbsRMS

dumbbell bf=0dumbbell bf=0.03random bf=0random bf=0.03

1 2 3 4 5 60.5

1

1.5

2

2.5

3

3.5

4

4.5

5

P

AbsRMS

multipolecoefficients,log10

local expansion coefficients, log10

0 0.1 0.2half regularization width (nm)

0.01

0.1

1

RM

S co

effic

ient

SPC/E water, 0.473 nm cell size

monopoledipolequadrupole

Real water shows the samedecrease of coefficients

Slide

bioexcel.eu

Partners Funding

Computational cost of regularization

• 1/2 regularization width = 1/4 cell• Second nearest neighbor direct• Direct pair cost factor (2.25/2)3 = 1.42

• At 0.473 nm (10.6 atoms) cell free with LJ• Multipole lowest level: factor 1.53 = 3.4

• But we get factor 5 more accuracy with P=4• And we get energy conservation!

Slide

bioexcel.eu

Partners Funding

Conclusions

• PME 3D-FFT is limiting scaling• FMM has better communication characteristics• FMM is a good fit for SIMD/GPU acceleration• FMM regularization gives energy conservation• FMM regularization improves accuracy• FMM fits wells with GROMACS pair computation

• We are implementing optimized, regularised FMM

Is there a better threading/tasking model than OpenMP on the horizon?

Acknowledgements

bioexcel.eu

Partners Funding

Erik LindahlMark AbrahamSzilard PallAleksei Iupinov

Rio Yokota, Tokyo TechJohn Eblen, ORNLRoland Schulz, IntelMark Berger, NVIDIArest of the GROMACS team world-wide

the fast multipole method in molecular dynamics

Documents