scalable molecular dynamics for large biomolecular systems
DESCRIPTION
Scalable Molecular Dynamics for Large Biomolecular Systems. Robert Brunner James C Phillips Laxmikant Kale. Overview. Context: approach and methodology Molecular dynamics for biomolecules Our program NAMD Basic Parallelization strategy NAMD performance Optimizations Techniques Results - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/1.jpg)
1
Scalable Molecular Dynamicsfor Large Biomolecular Systems
Robert Brunner
James C Phillips
Laxmikant Kale
![Page 2: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/2.jpg)
2
Overview
• Context: approach and methodology• Molecular dynamics for biomolecules• Our program NAMD
– Basic Parallelization strategy
• NAMD performance Optimizations– Techniques
– Results
• Conclusions: summary, lessons and future work
![Page 3: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/3.jpg)
3
The context
• Objective: Enhance Performance and productivity in parallel programming– For complex, dynamic applications
– Scalable to thousands of processors
• Theme:– Adaptive techniques for handling dynamic behavior
• Look for optimal division of labor between human programmer and the “system”– Let the programmer specify what to do in parallel
– Let the system decide when and where to run the subcomputations
• Data driven objects as the substrate
![Page 4: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/4.jpg)
4
1
12
5
9 10
2
11
34
7
13
6
8
15810
4
11 12
9 2 3
9
6 713
![Page 5: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/5.jpg)
5
Data driven execution
Scheduler Scheduler
Message Q Message Q
![Page 6: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/6.jpg)
6
Charm++
• Parallel C++ with Data Driven Objects• Object Arrays and collections• Asynchronous method invocation• Object Groups:
– global object with a “representative” on each PE
• Prioritized scheduling• Mature, robust, portable• http://charm.cs.uiuc.edu
![Page 7: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/7.jpg)
7
Multi-partition decomposition
![Page 8: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/8.jpg)
8
Load balancing
• Based on migratable objects• Collect timing data for several cycles• Run heuristic load balancer
– Several alternative ones
• Re-map and migrate objects accordingly– Registration mechanisms facilitate migration
![Page 9: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/9.jpg)
9
Measurement based load balancing
• Application induced imbalances:– Abrupt, but infrequent, or
– Slow, cumulative
– rarely: frequent, large changes
• Principle of persistence– Extension of principle of locality
– Behavior, including computational load and communication patterns, of objects tend to persist over time
• We have implemented strategies that exploit this automatically
![Page 10: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/10.jpg)
10
Molecular Dynamics
![Page 11: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/11.jpg)
11
Molecular dynamics and NAMD
• MD to understand the structure and function of biomolecules– proteins, DNA, membranes
• NAMD is a production quality MD program– Active use by biophysicists (science publications)
– 50,000+ lines of C++ code
– 1000+ registered users
– Features and “accessories” such as
• VMD: visualization
• Biocore: collaboratory
• Steered and Interactive Molecular Dynamics
![Page 12: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/12.jpg)
12
NAMD Contributors
• PI s : – Laxmikant Kale, Klaus Schulten, Robert Skeel
• NAMD 1: – Robert Brunner, Andrew Dalke, Attila Gursoy, Bill
Humphrey, Mark Nelson
• NAMD2: – M. Bhandarkar, R. Brunner, A. Gursoy, J. Phillips,
N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..
![Page 13: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/13.jpg)
13
Molecular Dynamics
• Collection of [charged] atoms, with bonds• Newtonian mechanics• At each time-step
– Calculate forces on each atom
• bonds:
• non-bonded: electrostatic and van der Waal’s
– Calculate velocities and Advance positions
• 1 femtosecond time-step, millions needed!• Thousands of atoms (1,000 - 100,000)
![Page 14: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/14.jpg)
14
Cut-off radius
• Use of cut-off radius to reduce work– 8 - 14 Å
– Faraway charges ignored!
• 80-95 % work is non-bonded force computations• Some simulations need faraway contributions
– Periodic systems: Ewald, Particle-Mesh Ewald
– Aperiodic systems: FMA
• Even so, cut-off based computations are important:– near-atom calculations are part of the above
– multiple time-stepping is used: k cut-off steps, 1 PME/FMA
![Page 15: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/15.jpg)
15
Scalability
• The Program should scale up to use a large number of processors. – But what does that mean?
• An individual simulation isn’t truly scalable• Better definition of scalability:
– If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size
![Page 16: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/16.jpg)
16
Isoefficiency
• Quantify scalability – (Work of Vipin Kumar, U. Minnesota)
• How much increase in problem size is needed to retain the same efficiency on a larger machine?
• Efficiency : Seq. Time/ (P · Parallel Time)– parallel time =
• computation + communication + idle
![Page 17: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/17.jpg)
17
Atom decomposition
• Partition the Atoms array across processors– Nearby atoms may not be on the same processor
– Communication: O(N) per processor
– Communication/Computation: O(N)/(N/P): O(P)
– Again, not scalable by our definition
![Page 18: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/18.jpg)
18
Force Decomposition
• Distribute force matrix to processors– Matrix is sparse, non uniform
– Each processor has one block
– Communication:
– Ratio:
• Better scalability in practice – (can use 100+ processors)
– Plimpton:
– Hwang, Saltz, et al:
• 6% on 32 Pes 36% on 128 processor
– Yet not scalable in the sense defined here!
P
N
P
![Page 19: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/19.jpg)
19
Spatial Decomposition
• Allocate close-by atoms to the same processor• Three variations possible:
– Partitioning into P boxes, 1 per processor
• Good scalability, but hard to implement
– Partitioning into fixed size boxes, each a little larger than the cutoff distance
– Partitioning into smaller boxes
• Communication: O(N/P): – so, scalable in principle
![Page 20: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/20.jpg)
20
Spatial Decomposition in NAMD
• NAMD 1 used spatial decomposition• Good theoretical isoefficiency, but for a fixed size
system, load balancing problems• For midsize systems, got good speedups up to 16
processors….• Use the symmetry of Newton’s 3rd law to facilitate
load balancing
![Page 21: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/21.jpg)
21
Spatial Decomposition
But the load balancing problems are still severe:
![Page 22: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/22.jpg)
22
![Page 23: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/23.jpg)
23
FD + SD
• Now, we have many more objects to load balance:– Each diamond can be assigned to any processor
– Number of diamonds (3D):
• 14·Number of Patches
![Page 24: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/24.jpg)
24
Bond Forces
• Multiple types of forces:– Bonds(2), Angles(3), Dihedrals (4), ..
– Luckily, each involves atoms in neighboring patches only
• Straightforward implementation:– Send message to all neighbors,
– receive forces from them
– 26*2 messages per patch!
![Page 25: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/25.jpg)
25
Bonded Forces:• Assume one patch per processor:
– an angle force involving atoms in patches:
• (x1,y1,z1), (x2,y2,z2), (x3,y3,z3)
• is calculated in patch: (max{xi}, max{yi}, max{zi})
B
CA
![Page 26: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/26.jpg)
26
Implementation
• Multiple Objects per processor– Different types: patches, pairwise forces, bonded forces,
– Each may have its data ready at different times
– Need ability to map and remap them
– Need prioritized scheduling
• Charm++ supports all of these
![Page 27: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/27.jpg)
27
Load Balancing
• Is a major challenge for this application– especially for a large number of processors
• Unpredictable workloads– Each diamond (force object) and patch encapsulate variable
amount of work
– Static estimates are inaccurate
• Measurement based Load Balancing Framework– Robert Brunner’s recent Ph.D. thesis
– Very slow variations across timesteps
![Page 28: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/28.jpg)
28
Bipartite graph balancing
• Background load:– Patches (integration, ..) and bond-related forces:
• Migratable load:– Non-bonded forces
• Bipartite communication graph – between migratable and non-migratable objects
• Challenge:– Balance Load while minimizing communication
![Page 29: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/29.jpg)
29
Load balancing strategy
Greedy variant (simplified):
Sort compute objects (diamonds)
Repeat (until all assigned)
S = set of all processors that:
-- are not overloaded
-- generate least new commun.
P = least loaded {S}
Assign heaviest compute to P
Refinement:
Repeat
- Pick a compute from
the most overloaded PE
- Assign it to a suitable
underloaded PE
Until (No movement)
Cell CellCompute
![Page 30: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/30.jpg)
30
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
5000000
Processors
Tim
e migratable work
non-migratable work
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
0 2 4 6 8 10 12 14
Avera
ge
Processors
Tim
e migratable work
non-migratable work
![Page 31: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/31.jpg)
32
Initial Speedup Results: ASCI RedSpeedup on ASCI Red: Apo-A1
0
100
200
300
400
500
600
700
800
900
0 200 400 600 800 1000 1200 1400 1600 1800
Processors
Sp
ee
du
p
![Page 32: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/32.jpg)
33
BC1 complex: 200k atoms
![Page 33: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/33.jpg)
34
Optimizations
• Series of optimizations• Examples to be covered here:
– Grainsize distributions (bimodal)
– Integration: message sending overheads
![Page 34: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/34.jpg)
35
Grainsize and Amdahls’s law
• A variant of Amdahl’s law, for objects, would be:– The fastest time can be no shorter than the time for the biggest
single object!
• How did it apply to us?– Sequential step time was 57 seconds
– To run on 2k processors, no object should be more than 28 msecs.
• Should be even shorter
– Grainsize analysis via projections showed that was not so..
![Page 35: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/35.jpg)
36
Grainsize analysisGrainsize distribution
0
100
200
300
400
500
600
700
800
900
1000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
grainsize in milliseconds
nu
mb
er
of
ob
jec
ts
Solution:
Split compute objects that may have too much work:
using a heuristics based on number of interacting atoms
Problem
![Page 36: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/36.jpg)
37
Grainsize reduced
Grainsize distribution after splitting
0
200
400
600
800
1000
1200
1400
1600
1 3 5 7 9 11 13 15 17 19 21 23 25
grainsize in msecs
nu
mb
er o
f o
bje
cts
![Page 37: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/37.jpg)
38
Performance audit
![Page 38: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/38.jpg)
39
Performance audit
• Through the optimization process, – an audit was kept to
decide where to look to improve performance
Total Ideal Actual
Total 57.04 86
nonBonded 52.44 49.77
Bonds 3.16 3.9
Integration 1.44 3.05
Overhead 0 7.97
Imbalance 0 10.45
Idle 0 9.25
Receives 0 1.61
Integration time doubled
![Page 39: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/39.jpg)
40
Integration overhead analysis
integration
Problem: integration time had doubled from sequential run
![Page 40: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/40.jpg)
41
Integration overhead example:
• The projections pictures showed the overhead was associated with sending messages.
• Many cells were sending 30-40 messages.– The overhead was still too much compared with the cost of
messages.
– Code analysis: memory allocations!
– Identical message is being sent to 30+ processors.
• Simple multicast support was added to Charm++– Mainly eliminates memory allocations (and some copying)
![Page 41: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/41.jpg)
42
Integration overhead: After multicast
![Page 42: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/42.jpg)
43
Improved Performance DataSpeedup on Asci Red
0
200
400
600
800
1000
1200
1400
0 500 1000 1500 2000 2500
Processors
Sp
eed
up
![Page 43: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/43.jpg)
45
Results on Linux Cluster
Speedup on Linux Cluster
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120
Processors
Sp
eed
up
![Page 44: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/44.jpg)
46
Performance of Apo-A1 on Asci Red
0
200
400
600
800
1000
1200
0 500 1000 1500 2000 2500
Processors
Sp
eed
up
![Page 45: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/45.jpg)
47
Performance of Apo-A1 on O2k and T3E
0
50
100
150
200
250
0 50 100 150 200 250 300
Processors
Sp
eed
up
![Page 46: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/46.jpg)
48
Lessons learned
• Need to downsize objects!– Choose smallest possible grainsize that amortizes overhead
• One of the biggest challenge – was getting time for performance tuning runs on parallel
machines
![Page 47: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/47.jpg)
49
Future and Planned work
• Speedup on small molecules!– Interactive molecular dynamics
• Increased speedups on 2k-10k processors– Smaller grainsizes
– New algorithms for reducing communication impact
– New load balancing strategies
• Further performance improvements for PME/FMA– With multiple timestepping
– Needs multi-phase load balancing
![Page 48: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/48.jpg)
50
Steered MD: example picture
Image and Simulation by the theoretical biophysics group, Beckman Institute, UIUC
![Page 49: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/49.jpg)
51
More information
• Charm++ and associated framework:– http://charm.cs.uiuc.edu
• NAMD and associated biophysics tools:– http://www.ks.uiuc.edu
• Both include downloadable software
![Page 50: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/50.jpg)
52
Performance: size of system
# ofatoms
Procs 1 2 4 8 16 32 64 128 160
bR Time 1.14 0.58 .315 .158 .086 .0483,762atoms
Speedup 1.0 1.97 3.61 7.20 13.2 23.7
ER-ERE Time 6.115 3.099 1.598 .810 .397 0.212 0.123 0.09836,573atoms
Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123
ApoA-I Time 10.76 5.46 2.85 1.47 0.729 0.382 0.32192,224atoms
Speedup (3.88) 7.64 14.7 28.4 57.3 109 130
Performance data on Cray T3E
![Page 51: Scalable Molecular Dynamics for Large Biomolecular Systems](https://reader036.vdocuments.net/reader036/viewer/2022062721/568136c6550346895d9e622f/html5/thumbnails/51.jpg)
53
Performance: various machines
Procs 1 2 4 8 16 32 64 128 160 192
T3E Time 6.12 3.10 1.60 0.810 0.397 0.212 0.123 0.098
- ---------
Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123
Origin Time 8.28 4.20 2.17 1.07 0.542 0.271 0.152
2000-------
Speedup 1.0 1.96 3.80 7.74 15.3 30.5 54.3
ASCI- Time 28.0 13.9 7.24 3.76 1.91 1.01 0.500 0.279 0.227 0.196
Red ---------
Speedup 1.0 2.01 3.87 7.45 14.7 27.9 56.0 100 123 143
NOWs Time 24.1 12.4 6.39 3.69
HP735/125
Speedup 1.0 1.94 3.77 6.54