namd performance gpu - university of illinois€¦ · outline • namd’s use of gpus as...
TRANSCRIPT
![Page 1: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/1.jpg)
Improving NAMD Performance on Multi-GPU Platforms
David J. Hardy
Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology
University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/~dhardy/
16th Annual Workshop on Charm++ and its Applications April 11, 2018
![Page 2: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/2.jpg)
Outline• NAMD’s use of GPUs as coprocessors, a historical perspective
- NAMD has been developed for more than 20 years
- First full-featured molecular dynamics code to adopt CUDA
- Stone, et al. J Comput Chem, 28:2618-2640, 2007
• The challenge posed by today’s multi-GPU architectures
• How can Charm++ help address these challenges?
![Page 3: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/3.jpg)
Parallelism in Molecular Dynamics Limited to Each Timestep
Computational workflow of MD:
Initialize coordinates forces, coordinates
Update coordinates
Force calculation
Occasional output of reduced quantities (energy, temperature, pressure)
Occasional output of coordinates (trajectory snapshot)
about 99% of computational work
about 1% of computational work
![Page 4: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/4.jpg)
Work Dominated by Nonbonded Forces90% — Non-bonded forces, short-range cutoff
5% — Long-range electrostatics, gridded (e.g. PME)
2% — Bonded forces (bonds, angles, etc.)
2% — Correction for excluded interactions
1% — Integration, constraints, thermostat, barostat
force calculation
update coordinates
Apply GPU acceleration first to the most expensive part
![Page 5: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/5.jpg)
NAMD Hybrid Decomposition with Charm++
• Spatially decompose data and communication
• Separate but related work decomposition
• “Compute objects” create much greater amount of parallelization, facilitating iterative, measurement-based load balancing system, all from use of Charm++
Kale et al., J. Comp. Phys. 151:283-312, 1999
![Page 6: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/6.jpg)
Phillips et al., SC2002
Offload to GPU
Objects are assigned to processors and queued as data arrives
Overlap Calculations, Offload Nonbonded Forces
![Page 7: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/7.jpg)
Early Nonbonded Forces Kernel Used All Memory Systems• Start with most expensive calculation: direct nonbonded interactions. • Decompose work into pairs of patches, identical to NAMD structure. • GPU hardware assigns patch-pairs to multiprocessors dynamically.
16kB Shared MemoryPatch A Coordinates & Parameters
32kB RegistersPatch B Coords, Params, & Forces
Texture UnitForce TableInterpolation
ConstantsExclusions
8kB cache8kB cache
32-way SIMD Multiprocessor32-256 multiplexed threads
768 MB Main Memory, no cache, 300+ cycle latency
Force computation on single multiprocessor (GeForce 8800 GTX has 16)
![Page 8: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/8.jpg)
NAMD Performance Improved Using Early GPUs
• Full NAMD, not test harness • Useful performance boost
– 8x speedup for nonbonded – 5x speedup overall w/o PME – 3.5x speedup overall w/ PME – GPU = quad-core CPU
• Plans for better performance – Overlap GPU and CPU work. – Tune or port remaining work.
• PME, bonded, integration, etc.
ApoA1 Performance
faster
seco
nds p
er st
ep0
0.75
1.5
2.25
3
CPU GPU
OtherPMENonbond
2.67 GHz Core 2 Quad Extreme + GeForce 8800 GTX
![Page 9: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/9.jpg)
Reduce Communication Latency by Separating Work Units
Remote Force Local ForceGPU
CPU
Other Nodes/Processes
LocalRemote
xf f
f
f
Local x
x
Update
One Timestep
x
Phillips et al., SC2008
![Page 10: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/10.jpg)
Early GPU Fits Into Parallel NAMD as Coprocessor
• Offload most expensive calculation: non-bonded forces
• Fits into existing parallelization
• Extends existing code without modifying core data structures
• Requires work aggregation and kernel scheduling considerations to optimize remote communication
• GPU is treated as a coprocessor
![Page 11: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/11.jpg)
Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.eduGTC 2017
NAMD Scales Well on Kepler Based Computers
(2fs timestep)
0.25
0.5
1
2
4
8
16
32
64
256 512 1024 2048 4096 8192 16384Number of Nodes
21M atoms
224M atoms
Perfo
rman
ce (n
s pe
r day
)
Blue Waters XK7 (GTC16)Titan XK7 (GTC16)
Edison XC30 (SC14)Blue Waters XE6 (SC14)
11
Kepler based
![Page 12: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/12.jpg)
Large Rate Difference Between Pascal and CPU
• Balance between GPU and CPU capability keeps shifting towards GPU • NVIDIA plots show only through Pascal — Volta widens the performance gap! • Difference made worse by multiple GPUs per CPU (e.g. AWS, DGX, Summit) • Past efforts to balance work between GPU and CPU are now CPU bound
20x FLOP rate differencebetween GPU and CPU
Requires full use ofCPU cores and vectorization!
![Page 13: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/13.jpg)
Reduce Latency, Offload All Force Computation• Overlapped GPU communication and computation (2012)
• Offload atom-based work for PME (2013)
- Use higher order interpolation with coarser grid
- Reduce parallel FFT communication
• Faster nonbonded force kernels (2016)
• Offload entire PME using cuFFT (for single node use) (2016)
• Offload remaining force terms (2017)
- Includes: bonds, angles, dihedrals, impropers, crossterms, exclusions
Emphasis on improvingcommunication latency
Emphasis onusing GPUs
more effectively
![Page 14: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/14.jpg)
Overlapped GPU Communication and Computation
• Allows incremental results from a single grid to be processed on CPU before grid finishes on GPU
• Allows merging and prioritizing of remote and local work • GPU side:
- Write results to host-mapped memory (also without streaming) - __threadfence_system() and __syncthreads() - Atomic increment for next output queue location - Write result index to output queue
• CPU side: - Poll end of output queue (int array) in host memory
![Page 15: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/15.jpg)
Non-overlapped Kernel Communication
Integration unable to start until GPU kernel finishes
![Page 16: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/16.jpg)
Overlapped Kernel Communication
GPU kernel communicates results while running; patches begin integration as soon as data arrives
![Page 17: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/17.jpg)
7 S6623: Advances in NAMD GPU Performance
Non-bonded force computation in NAMD
• Two levels of spatial sorting – Simulation box is divided
into patches – Within the patch, atoms
are sorted spatially into groups of 32 using orthogonal recursive bisection method
4 4
4
Faster
![Page 18: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/18.jpg)
8 S6623: Advances in NAMD GPU Performance
Non-bonded force compute
• Compute = all pairwise interactions between two patches
Compute 1
Compute 2
32
32
Compute 1
Pat
ch 1
P
atch
2
Patch 3
Patch 2
• For GPU, compute is split into tiles of size 32x32 atoms
Faster
![Page 19: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/19.jpg)
9 S6623: Advances in NAMD GPU Performance
Non-bonded force computation
Atoms in patch i
Ato
ms
in p
atch
j
32
31
30
1
2
3
Fj
Fi
• One warp per tiles • Loop through 32x32 tile diagonally – Avoids race condition when storing forces Fi and Fj
• Bitmask used for exclusion lookup
32
32
War
p 1
War
p 2
War
p 3
War
p 4
Faster
![Page 20: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/20.jpg)
12 S6623: Advances in NAMD GPU Performance
Neighbor list sorting
• Tile lists executed on the same thread block should have approximately the same work load
• Simple solution is to sort according to tile list length
• Also minimizes tail effects at the end of kernel execution
Load imbalance!
No load imbalance
Thread block sort
War
p 1
War
p 2
War
p 3
War
p 1
War
p 2
War
p 3
War
p 4
![Page 21: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/21.jpg)
Single-Node GPU Performance Competitive on Maxwell
New kernels by Antti-Pekka Hynninen, NVIDIAStone, Hynninen, et al., International Workshop on OpenPOWER for HPC (IWOPH'16), 2016
![Page 22: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/22.jpg)
More Improvement from Offloading Bonded Forces
• GPU offloading for bonds, angles, dihedrals, impropers, exclusions, and crossterms
• Computation in single precision• Forces are accumulated in 24.40 fixed point • Virials are accumulated in 34.30 fixed point• Code path exists for double precision
accumulation on Pascal and newer GPUs• Reduces CPU workload and hence improves
performance on GPU-heavy systems
DGX-1
Speedu
p
1
1.175
1.35
1.525
1.7
apoa1 f1atpase stmv
New kernels by Antti-Pekka Hynninen, NVIDIA
![Page 23: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/23.jpg)
Supercomputers Increasing GPU to CPU Ratio
Blue Waters, Titan with Cray XK7 nodes 1 K20 / 16-core AMD Opteron
Summit nodes 6 Volta / 42 cores IBM Power 9
Only 7 cores supporting each Volta!
![Page 24: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/24.jpg)
Revolutionary GPU-based Hardware
Equivalent compute power to about 160 nodes of Blue Waters
DGX-2: 3.5 CPU cores / GPU vs. Blue Waters: 16 CPU cores / GPU
24
• 16 Volta GPUs • 16 x 32 GB HBM2 • Fast switch makes memory
uniformly accessible • 2 Intel Xeon Platinum CPUs
(2 x 28 cores) • 1.5 TB main memory
![Page 25: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/25.jpg)
Limited Scaling Even After Offloading All Forces Results on NVIDIA DGX-1 (Intel Haswell using 28-cores with Volta V100 GPUs)
0.5
1
1.5
2
2.5
3
3.5
4
1 2 3 4
STMV 1 million atoms
Perf
orm
ance
(ns
per
day)
Number of NVIDIA Voltas
Offloading all forcesNonbonded forces only
NAMD 2.12
NAMD 2.13
![Page 26: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/26.jpg)
CPU Integrator Calculation (1%) Causing BottleneckNsight Systems profiling of NAMD runningSTMV (1M atoms) on 1 Volta & 28 CPU cores
GPU is not being kept busy
Too much communication!
Too much CPU work: 2200 patches across 28 cores
Patches running sequentially within each core
nonbonded
bonded
PME
![Page 27: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/27.jpg)
CPU integrator work is mostly data parallel, but…• Uses double precision for positions, velocities, forces
- Data layout is array of structures (AOS), not well-suited to vectorization
• Each NAMD “patch” runs integrator in separate user-level thread to make source code more accessible
- Benefit from vectorization is reduced, loop over 200–600 atoms in each patch
• Too many exceptional cases handled within same code path
- E.g. fixed atoms, pseudo-atom particles (Drude and lone pair)
- Test conditionals for simulation options and rare events (e.g. trajectory output) every timestep
![Page 28: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/28.jpg)
CPU integrator work is mostly data parallel, but…• Additional communication is required
- Send reductions for kinetic energies and virial
- Receive broadcast periodic cell rescaling when simulating constant pressure
• A few core algorithms have sequential parts, with reduced parallelism
- Not suitable for CPU vectorization
- Rigid bond constraints — successive relaxation over each hydrogen group
- Random number generator — Gaussian numbers using Box–Muller transform with rejection
- Reductions calculated over “hydrogen groups” — irregular data access patterns
![Page 29: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/29.jpg)
Strategies for Overcoming Bottleneck• Data structures for CPU vectorization
- Convert atom data storage from AOS (array of structures) form into vector friendly SOA (structure of arrays) form
• Algorithms for CPU vectorization
- Replace non-vectorizing RNG code with vectorized version
- Replace rigid bond constraints sequential algorithm with one capable of fine-grained parallelism (maybe LINCS or Matrix-SHAKE)
• Offload integrator to GPU
- Main challenge is aggregating patch data
- Use vectorized algorithms, adapt curand for Gaussian random numbers
![Page 30: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/30.jpg)
Goal: Developing GPU-based NAMD• CPU primarily manages GPU kernel launching
- CPU prepares and aggregates data structures for GPU, handles Charm++ communication
• Reduces overhead of host-to-device memory transfers
- Data lives on the GPU, clusters of patches
- Communicate edge patches for force calculation and atom migration
• Design new data structures capable of GPU or CPU vectorization
- Major refactor of code to reveal more data parallelism
• Kernel-based design with consistent interfaces across GPU and CPU
- Need fallback kernels for CPU (e.g. continue to support Blue Waters and Titan)
![Page 31: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/31.jpg)
Some Conclusions For HPC apps in general and NAMD in particular
• Balance between CPU and GPU computational capability continues to shift in favor of the GPU
- The 1% workload from 10 years ago is now 60% or more in wall clock time
• Any past code that has attempted to load balance work between CPU and GPU is today likely to be CPU bound
• Best utilization of GPU might require keeping all data on the GPU
- Motivates turning a previously CPU-based code using GPU as an accelerator into a GPU-based code using CPU as a kernel management and communication coprocessor
• Volta + CUDA 9.x + CUDA Toolkit Libraries provide enough general purpose support to allow moving an application entirely to the GPU
![Page 32: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/32.jpg)
What will Charm++’s role be for multi-GPU NAMD?• Single-node multi-GPU case
- With everything running on CPU-managed GPUs, Charm++ has nothing to do!
- Even now, NAMD patches are non-migratable
• Multi-node multi-GPU case
- Charm++ handles internode communication
- Load balancing is still ineffective, until Charm++ understands GPU work
• Charm++ could help us make use of improvements to device communication, like NVLink
![Page 33: NAMD performance gpu - University Of Illinois€¦ · Outline • NAMD’s use of GPUs as coprocessors, a historical perspective -NAMD has been developed for more than 20 years -First](https://reader036.vdocuments.net/reader036/viewer/2022062415/5fcc91d84510046e8c42e67f/html5/thumbnails/33.jpg)
Acknowledgments• Special thanks to:
- John Stone and others from Theoretical Biophysics Research Group, Beckman Institute, UIUC
- Ke Li and others from NVIDIA CUDA Team
- NVIDIA NSight Systems Team
- James Phillips, NCSA, UIUC
- In memoriam Antti-Pekka Hynninen, NVIDIA
- Ronak Buch and Karthik Senthil from Parallel Programming Laboratory, Dept of Computer Science, UIUC
• Grants:
- NIH P41-GM104601 Center for Macromolecular Modeling and Bioinformatics
- Summit Center for Accelerated Application Readiness, OLCF, ORNL