![Page 1: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/1.jpg)
Strong Scaling for Molecular Dynamics Applications
![Page 2: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/2.jpg)
Scaling and Molecular Dynamics
Strong Scaling
— How does solution time vary with the number of processors for a fixed problem size
Classical Molecular Dynamics
— at each time step
Compute pairwise forces on atoms
integrate atoms forward
— Forces can either be bonded (the atoms are covalently bound), or non bonded
— Limit amount of direct forces computed by using a cutoff distance
Estimate forces outside cut off to infinity with FFT
Strong Scaling + Molecular Dynamics
— For a single run can only improve performance by speeding up the time step
Time steps are already in milliseconds
— Need strong scaling because need large number of time steps to study phenomenon of biological interest
![Page 3: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/3.jpg)
Overview of talk
This talk is a look at the issues that we found while trying to
improve the scaling performance of MD codes, mostly
focusing on NAMD
— I will try to make these lessons as general as possible
— This is not a comprehensive list of all scaling issues an application
might face
![Page 4: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/4.jpg)
What could cause scaling issues on GPU nodes We will focus on issues that would be different than what you
would face on your CPU path
1. GPU Kernel(s) itself does not scale
2. The code that is setting up work for the GPU / managing the GPU / processing work from the GPU does not scale
3. Something else, like communication is becoming the bottleneck
— the GPU might be speeding up the computation to the extent that it no longer is the bottleneck
![Page 5: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/5.jpg)
Process
Profile! Over multiple processor counts observe how the
following vary
— Time spent in GPU functions
— Time spent in CPU functions
— Communication cost
Look at timeline traces to help understand bottlenecks and
and validate hypothesis
![Page 6: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/6.jpg)
Example - AMBER
JAC NVE on M2090
1 node 2 nodes 4 nodes 6 nodes 8 nodes
ns/day 35.7 49.16 69.68 79.73 85.21
Parallel
efficiency 1.00 0.69 0.49 0.37 0.30
![Page 7: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/7.jpg)
Example AMBER profile
2
nodes
4
nodes
8
nodes
2
nodes
4
nodes
8
nodes
2
nodes
4
nodes
8
nodes
Build
neighbor
list
0.171 0.087 0.051 1.000 0.982 0.844
Non bond
forces 1.030 0.548 0.347 1.000 0.939 0.742
MPI win
fence 0.291 0.205 0.280 1.000 0.710 0.259
MPI all
gather 0.193 0.324 0.404 1.000 0.298 0.119
PME fill
charge
buffer
0.339 0.346 0.386 1.000 0.490 0.220
Time in ms % of timestep Parallel efficiency
![Page 8: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/8.jpg)
Example AMBER timeline
ChargeGrid, PME kernels
MPI fence and gather
Nonbond force kernel
Other kernels
Host side CUDA functions
CUDA memcopies
2 nodes
4 nodes
8 nodes
![Page 9: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/9.jpg)
Kernel Scaling
![Page 10: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/10.jpg)
Possible reasons for Kernel timing not scaling
Is the work to the kernel scaling? I.e. Is it reduced to 1/n per GPU when running n GPUs?
Are you running less threads than what you need to saturate the machine?
Are we running into the tail effect?
![Page 11: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/11.jpg)
Work executed by the GPU not scaling? AMBER
Some of the functions executed on only one GPU
Amdahl’s law
2 4 6 8
Time taken for
serial PME
![Page 12: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/12.jpg)
Not enough work to saturate the GPU
Try to increase occupancy
— is occupancy limited by register or shared memory usage?
—Can you reduce the amount of work each thread does in order to increase total amount of threads on GPU?
Are the multiple kernels that you could run concurrently?
Is there more work that you can port to the GPU?
![Page 13: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/13.jpg)
Kernel not Scaling NAMD
NAMD kernel scaling
In the case of NAMD the amount of work to the kernel was scaling perfectly and there was enough work
Hypothesis: some blocks are taking a very long time, causing low SM occupancy towards the tail end of the kernel, resulting in the scaling issues
1 node 8 nodes 16 nodes 32 nodes
Apoa1 27.3 3.9 2.3 1.5
Parallel efficiency 1 0.88 0.74 0.57
![Page 14: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/14.jpg)
Tail effect Sm 0 Sm 0
![Page 15: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/15.jpg)
Tail effect Sm 0 Sm 0
GPU 0
GPU 1
GPU 0
time
GPU 1
time
GPU 0
time
GPU 1
time
![Page 16: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/16.jpg)
Block timing, Apoa1 at 32 nodes two Away Y, sm 0-10
start
time
time
Blo
cks,
sort
ed b
y s
tart
tm
ie
Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5
![Page 17: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/17.jpg)
Block timing
SM clock
Num
ber
of
Blo
cks
runnin
g o
n S
M0
SM clock SM clock
2 nodes 16 nodes 32 nodes
2
4
6
8
![Page 18: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/18.jpg)
Performance Model
Can build a performance model to estimate improvement in
kernel time as a result of perfect load balancing
Distribute blocks to SMs in round robin fashion
— Inside each block have n queues, n = number of concurrent blocks
Model actual kernel behavior by sampling from measured
block runtimes
Model kernel behavior with perfect load balancing by
setting all block runtimes to average value
![Page 19: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/19.jpg)
Estimates from Performance model
52% predicted improvement
at 32 nodes if blocks are
perfectly load balanced
![Page 20: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/20.jpg)
Managing the GPU
![Page 21: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/21.jpg)
Collecting work from multiple cores
Potential performance issues if running multiple MPI ranks with each rank submitting work to the GPU
— each rank has to submit to its own GPU context and kernels from different contexts cannot run concurrently
you can do better by
— Using threads instead of MPI ranks
This way all threads can submit to the same context
— Use PROXY
Introduced in CUDA 5.0
— Collect work from multiple MPI ranks to be submitted by one context
![Page 22: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/22.jpg)
NAMD – collect work from multiple ranks
![Page 23: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/23.jpg)
Waiting for all the data to arrive
1 node
Time step
GPU kernel
Time step
GPU kernel
16 nodes
Waiting for data to arrive
Time step
Time step
![Page 24: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/24.jpg)
Packaging work for the GPU
Is the function that is packaging work for the GPU and
processing work off the GPU scaling?
— Across nodes?
— Across cores?
NAMD:
— pushing enqueueCUDA work to the sender nodes improved both
scaling across nodes and across cores
![Page 25: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/25.jpg)
Figuring out kernel completion
Reducing time to query event signaling kernel completion
— 10% improvement at 16 nodes, 57% improvement at 52 nodes for Apoa1
before after
GPU
CPU
![Page 26: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/26.jpg)
Communication Issues
![Page 27: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/27.jpg)
Communication Issues
If all your CPU functions and GPU kernels are scaling well
but the overall application is not, communication could be a
bottleneck
Communication issues are more critical to address in the
GPU path - the GPU is speeding up the computation so
communication can more easily become the bottleneck
![Page 28: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/28.jpg)
Improving the communication performance
Using native UGNI messaging rather than generic MPI on a Cray cluster
Improving the
communication
performance helps
improve the scaling
performance of the
GPU path more
than the CPU path
![Page 29: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/29.jpg)
NAMD scaling of 100 M atoms
Performance results generated on Titan at Oak Ridge National Labs
![Page 30: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/30.jpg)
Analysis
Average time step non PME step PME step
sum PME computation kernel
GPU, 100 stmv, 32 nodes 1.249 1.009 1.833 0.796 0.964
GPU, 100 stmv, 768 nodes 0.085 0.046 0.176 0.034 0.042
strong scaling 0.616 0.905 0.435 0.988 0.966
![Page 31: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/31.jpg)
Example NAMD timelines
Blue/purple = CPU functions Pink superscript = GPU kernel
Magenta = enqueueCUDA(work) Green = PME functions White = idle time
Showing 11 CPU threads and one communication thread
Communication thread
![Page 32: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/32.jpg)
NAMD, Strong Scaling, 100 M atoms, GPU
4 timesteps = 4982ms = 1.24 s / step
1.83 s for PME step 1.01 s for non-PME step
100 stmv
32 nodes
4 timesteps = 336ms = 0.084 s / step
0.185 s for PME step 0.046 s for non-PME step
100 stmv
768 nodes
![Page 33: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/33.jpg)
Communication delays in PME – tracing data needed for one ungrid calculation
4 timesteps = 336ms = 0.084 s / step
0.185 s for PME step 0.046 s for non-PME step
0.042s delay
for 10 KB message
0.046s delay
for 100KB message
![Page 34: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/34.jpg)
How to improve performance
Algorithmic changes to reduce communication needed
— Reduce amount of PME data by doing more non bonded work
Improve communication
![Page 35: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/35.jpg)
Topology
Ideal topology?
— Do you have the ideal topology of nodes in order
to maximize bandwidth?
— Do the CPU only nodes have the same or better
topology?
Topology aware mapping
Randomization
— As an experiment try randomizing the ordering of
nodes you receive
![Page 36: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/36.jpg)
Reducing Communication Delays
Reduce / remove global syncs and pairwise syncs
— Try to do one sided RMA operations
If MPI is slow try others like Global Arrays toolkit, UPC, CoArray, ddi
P2P / Interprocess P2P for GPUs on same node
2 nodes
MPI_Win_Fence = 338 us
MPI_allgatherv = 183 us
8 nodes
MPI_Win_Fence = 537 us
MPI_allgatherv = 413 us
![Page 37: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/37.jpg)
Contention
If you are running into network contention try Rendezvous
messaging (sender sends small control message, receiver
initiates get)
— Vs greedy / fire and forget
Can throttle and control congestion on both the sender and
the receiver side
![Page 38: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/38.jpg)
Overlap kernel computation and network activity
Kernel
Cuda Synchronize Poll network
Kernel
Cuda memcpy Poll network Kernel
If your CPU core is also responsible for the communication
Kernel
Asyn
copy
Check status of event query
Poll network Poll network
![Page 39: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/39.jpg)
Acknowledgement
Oak Ridge National Lab
Scott Le Grand and Ross Walker – AMBER
Jim Phillips, Sanjay Kale, Yanhua Sun, Gengbin Zhang and
Eric Bohm – NAMD
Sarah Anderson and Ryan Olson - CRAY
Sky Wu, Nikolai Sakharnykh, Mike Parker and Justin Luitjens
– NVIDIA
![Page 40: Strong Scaling for Molecular Dynamics Applicationson-demand.gputechconf.com/gtc/2012/presentations/S0351... · 2013. 8. 23. · Scaling and Molecular Dynamics Strong Scaling — How](https://reader035.vdocuments.net/reader035/viewer/2022071014/5fcc91d84510046e8c42e681/html5/thumbnails/40.jpg)
Questions?
Thank You!