the fast multipole method in molecular dynamics
TRANSCRIPT
bioexcel.eu
Partners Funding
The Fast Multipole Method in molecular dynamics
Berk HessKTH Royal Institute of Technology, Stockholm, Sweden
ADAC6 workshop Zurich, 20-06-2018
Slide
bioexcel.eu
Partners Funding
BioExcel
Slide
bioexcel.eu
Partners Funding
Powerful because:• structure and dynamics in atomistic detail
Challenging because:• fixed system size: strong scaling needed• always more simulation needed than possible• force-field (model) accuracy can be limiting• results can be difficult to analyse
Molecular Dynamics of biomolecules
Slide
bioexcel.eu
Partners Funding
mid2xi
dt2= Fi = �riV (x) , i = 1, . . . , N
vi(t+�t
2) = vi(t�
�t
2) + ai(t)�t
xi(t+�t) = x(t) + vi(t+�t
2)
Molecular Dynamics
Newton’s equation of motion
Simple (symplectic) integrator
Fundamental issue: sequential problem Δt ~ 1 fs, timescales of interest μs - s ⇨ need 109 - 1015 steps!
Slide
bioexcel.eu
Partners Funding
U(r) =X
bonds
Ubond(r) +X
angles
Uangle(r) +X
dihs
Udih(r)
+X
i
X
j>i
qiqj4⇥�0rij
+Aij
r12ij� Bij
r6ij
Computational cost mostly in computing the forces
Fi = �dU
dri
We need the forceon each particle i pair-sum: largest cost,
use cut-off (and long-rangeelectrostatics algorithm)
{
Slide
bioexcel.eu
Partners Funding
To reach 109 - 1015 steps we need to minimise the walltime per step
The best MD codes in the world are at 100 microseconds per step (on a single node):• at ~ 200 atoms per CPU core ⇨ we run in L1/L2 cache• at ~ 3000 atoms per GPU• but systems of interest are usually > 100000 atoms
Extreme demands on hardware and parallelisation libraries (MPI, OpenMP):• ideally overheads should be < 1 microsecond
The MD strong scaling challenge
Slide
bioexcel.eu
Partners Funding
• Started as hardware+software project in Groningen (NL)• Now core team in Stockholm and many contributors world-wide• Focus on high performance: efficient algorithms, SIMD & GPUs• Good parallel scaling• Compiles and runs efficiently on nearly any hardware• Open source, LGPL• Incorporate important algorithms for flexibly manipulating systems• C++, MPI, OpenMP, CUDA, OpenCL• C++ and Python API in the works
GROMACS
Slide
bioexcel.eu
Partners Funding
12 13 14 15111098
1111
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0000
8653 9 10 11 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0000
7654 12 13 14 15
222
3333
2
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
11110000
222 33332
0123
4567
891011
Classical 1x1 neighborlist on 4-way SIMD
4x4 setup on 4-way SIMD
4x4 setup on 8-way SIMD
4x4 setup on SIMT-8
141312 1515141312654 77654
Highly optimised vectorization:
Small C++ SIMD layer with highly optimised math functions for:SSE2/4, AVX2-128, AVX(2)-256, AVX-512, ARM, VSX, HPC-ACE
CPU kernels reach 50% of peak
CUDA and OpenCL need separate kernels
Slide
bioexcel.eu
Partners Funding
We can optimize the size of this buffer
Larger buffer means morecalculations, butwe can updatethe neighbor listless frequently
Atom clustering (regularization) and pair list buffering
Slide
bioexcel.eu
Partners Funding
We can optimize the size of this buffer
Larger buffer means morecalculations, butwe can updatethe neighbor listless frequently
8 10 15 20 25 40 50 75 100 150 200
50
55
60
65
70
75
80
85
90
56 ranks
56 ranks with dynamic pruning
pair search frequency (MD steps)
perf
orm
ance (
ns/d
ay)
New in GROMACS-2018: dual-pair list: • reduces overhead • less sensitive to parameters
Atom clustering and pair list buffering
Slide
bioexcel.eu
Partners Funding
GROMACS has efficient parallelization of all algorithms using MPI + OpenMP
OpenMP is (performance) portable, but limited:• No way to run parallel tasks next to each other• No binding of threads to cores (cache locality)
Need for a better threading model, requirements:• Extremely low overhead barriers (all-all, all-1, 1-all)• Binding of threads to cores• Portable
Intra-rank parallelisation: OpenMP
Slide
bioexcel.eu
Partners Funding
1 12
d0
3 2
3’
rc
rb2’
7
30
4cr
165
8th-shell domain decomposition:• minimizes communicated volume• minimizes communication calls• implemented using aggregation of data
along dimensions• 1 MPI rank per domain• dynamic load balancing
Question: still the best approach for networks that support many messages?
Spatial decomposition
Hess, Kutzner, Van der Spoel, Lindahl; J. Chem. Theory Comput. 4, 435 (2008)
Slide
bioexcel.eu
Partners Funding
GPUCUDA
CPU OpenMP threads
H2Dpair list
Average CPU-GPU overlap 75-90%
Bonded FIntegration,Constraints
Non-bonded F
Wait
Idle
DDPair search
Idle
DD & Pair search: every 80-150 steps
MD step
H2Dx,q
Clear Fbuffer
Rollingpruning
Prunepair list
Reduce F
Idle Spread 3D-FFT Solve 3D-FFT Gather
D2HF,E
D2HF,E
pull/FE/etc.
100-500 μs New in GROMACS-2018: PME on GPU!
Slide
bioexcel.eu
Partners Funding
Lignocellulose benchmark• 3.3 million atoms• No long-range electrostatics!• machine: Piz Daint at CSCS
• Cray XC50• NVIDIA P100 GPUs
96 / 4 960 /40 9600 / 40020
200
4 ranks/node
6 ranks/node
12 ranks/node
#cores / #GPUs
Perf
orm
ance
(ns/
day)
100
GROMACS strong scaling
Slide
bioexcel.eu
Partners Funding
At 100s of microseconds per step:• The 3D-FFT in PME electrostatics• MPI overhead
• we need MPI_Put_notify()• OpenMP barriers take significant time• Load imbalance• CUDA API overhead can be 50% of the CPU time• Too many GROMACS options to tweak manually
Strong scaling issues
10 100 1000number of cores or #GPUs*8
10
100
1000
perfo
rman
ce (n
s/da
y)
140000 atom ion channelon Piz Daint xc
Slide
bioexcel.eu
Partners Funding
All examples up till now without long-range electrostatics, but needed in most casesWe need all-vs-all Coulomb interactions
All MD codes use Particle-mesh Ewald (PME) for electrostaticsPME uses a 3D-FFT, 643 - 1283 grid pointsIssues:• The 3D-FFT requires two grid transposes that require
All-to-All communication in slabs• Pencil decomposition clashes with 3D decomposition• 3D-FFT scaling is the bottleneck for MD
Long-range electrostatics GROMACS Multiple-ProgramMultiple-Data parallelisationimproves scaling by a factor 4
8 PP/PME ranks
6 PP ranks 2 PME ranks
3D-FFTcommunication
Slide
bioexcel.eu
Partners Funding
in use in MD computational cost pre-factor communication
costEwald
summation since 70s O(N3/2) small (FFT) high
PPPM / Particle Mesh Ewald since 90s O(N log N) small (FFT) high
Fast Multipole Method not yet O(N) larger low
Multigrid Electrostatics 2010 O(N) larger low
Long-range electrostatics methods for molecular dynamics
we recently solved the energy conservation issue
Slide
bioexcel.eu
Partners FundingRio YokotaTokyo Tech
FMM scheme
Slide
bioexcel.eu
Partners Funding
The accuracy of FMM is controlled by• the order of the expansion(s)• the theta parameter
Main issue for FMM in MD:• We want low accuracy (or
better: high speed)• We want energy conservation
R
r
L
↵
• •
Figure 3: Extended cells.
θ=0.50 θ=0.35
θ=0.25 θ=0.20
Figure 4: Number of near neighbors increases as ✓ decreases.
3
Slide
bioexcel.eu
Partners Funding
FMMformoleculardynamics
direct approx.p : Order of expansionθ : Opening angle
This combination yieldsthe same accuracy
But even for the same accuracy the oneswhere θ is small has less energy driftbecause the jump is further away
What happens if wesmooth the jump?
� =1
r
work byRio Yokota
Slide
bioexcel.eu
Partners Funding
Main physics issue with FMM in MD:• All atoms have partial charges, but molecules are mostly locally neutral• The sharp boundaries of FMM cells introduce charges at the boundaries
qO = +0.82
qH= -0.41qH = -0.41
qcell = -0.41 qcell = +0.41
a neutral water molecule crossing 2 FMM cells
at long distancewater is a dipole
Slide
bioexcel.eu
Partners FundingS2
q=0 q=0 q=0
q−multipoleq−q
S1
V = s1(x)V1(x) + s2(x)V2(x)
F = �rV
= s1(x)F1(x) + s2(x)F2(x)
�s01(x)V1(x)� s02(x)V2(x)
FMM regularization: add smooth overlap between cells
orig. idea: Chartier, Darrigand Faou,BIT Num. Math. 50, 23 (2010)
Slide
bioexcel.eu
Partners Funding
• Davoud Saffar Shamshir Gar has analyzed the accuracy of regularization• Currently: 2D FFM with full regularization• Results for charge pairs with θ=0.35 = second nearest neighbors, 8 atoms
per cell
0 0.2 0.4 0.6regularization width / cell size
10−4
10−3
pote
ntia
l erro
r p=4
p=6
factor 5.7
Slide
bioexcel.eu
Partners Funding
1 2 3 4 5 6�8
�7
�6
�5
�4
�3
�2
�1
0
1
P
AbsRMS
dumbbell bf=0dumbbell bf=0.03random bf=0random bf=0.03
1 2 3 4 5 60.5
1
1.5
2
2.5
3
3.5
4
4.5
5
P
AbsRMS
multipolecoefficients,log10
local expansion coefficients, log10
0 0.1 0.2half regularization width (nm)
0.01
0.1
1
RM
S co
effic
ient
SPC/E water, 0.473 nm cell size
monopoledipolequadrupole
Real water shows the samedecrease of coefficients
Slide
bioexcel.eu
Partners Funding
Computational cost of regularization
• 1/2 regularization width = 1/4 cell• Second nearest neighbor direct• Direct pair cost factor (2.25/2)3 = 1.42
• At 0.473 nm (10.6 atoms) cell free with LJ• Multipole lowest level: factor 1.53 = 3.4
• But we get factor 5 more accuracy with P=4• And we get energy conservation!
Slide
bioexcel.eu
Partners Funding
Conclusions
• PME 3D-FFT is limiting scaling• FMM has better communication characteristics• FMM is a good fit for SIMD/GPU acceleration• FMM regularization gives energy conservation• FMM regularization improves accuracy• FMM fits wells with GROMACS pair computation
• We are implementing optimized, regularised FMM
Is there a better threading/tasking model than OpenMP on the horizon?
Acknowledgements
bioexcel.eu
Partners Funding
Erik LindahlMark AbrahamSzilard PallAleksei Iupinov
Rio Yokota, Tokyo TechJohn Eblen, ORNLRoland Schulz, IntelMark Berger, NVIDIArest of the GROMACS team world-wide