vasp: a case study for accelerating plane wave dft...
TRANSCRIPT
VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT CODES
Presenters: Sarah Tariq and Przemyslaw Tredak
Authors: Jeroen Bedorf, Przemyslaw Tredak , Dusan Stosic, Arash Ashari, Paul Springer, Darko Stosic, Sarah Tariq, Paul Fleurat-Lessard and Anciaux Sedrakian (Ens-lyon, IFPEN), Maxwell Hutchinson (University of Chicago) and Michael Widom (CMU)
GPU VASP COLLABORATION Collaborators
Project Scope Minimization algorithms to calculate electronic ground state
— Blocked Davidson (ALGO = NORMAL & FAST)
— RMM-DIIS (ALGO = VERYFAST & FAST)
Earlier work — Speeding up plane-wave electronic-structure calculations using graphics-processing units. Maintz, Eck,
Dronskowski. (2011)
— VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron. Hutchinson, Widom. (2011)
— Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units. Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard. (2012)
VASP OVERVIEW
Atomic scale materials modeling from first principles
Simulate atoms (mostly solids/surfaces)
Liquids, crystals, magnetism, semiconductor/insulators, surfaces, catalysts
Solve many-body Schrödinger equation
Density Functional Theory (DFT): Kohn-Sham equations
Optionally add exact-exchange using Hybrid Hartree Fock functionals (HF)
THEORY
Self-consistent Kohn-Sham system
— Self-consistency loop until convergence
— Compute Kohn-Sham potential 𝒗𝑲𝑺 𝒓
— Solve Kohn-Sham eigenproblem
— Obtain electronic density 𝒏 𝒓
Kohn-Sham eigenproblem
— Diagonalize Hamiltonian matrix 𝑯 𝑲𝑺
— Problem: often 𝑯 𝑲𝑺 is very big
— Solution: Iterative matrix diagonalization schemes
— Blocked Davidson, RMM-DIIS
— Find lowest few 𝝋𝒊 eigenstates of 𝑯 𝑲𝑺
𝒏𝟎(𝒓)
𝒗𝑲𝑺(𝒓)
𝑯 𝑲𝑺𝝋𝒊 𝒓 = 𝑬𝒊𝝋𝒊 𝒓
𝒏 𝒓 = 𝝋𝒊 𝒓𝟐
𝒊
stop?
end
yes
no
SIMILARITIES IN PW DFT CODES
Rely heavily on math libraries BLAS and FFT
— Easily offloaded using cuBLAS and cuFFT
Don’t need to write a lot of specialized routines
— Focus is on keeping GPU busy, and reducing communication instead of optimizing kernels
TARGET WORKLOADS Silica
— 7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24)
— RMM-DIIS (ALGO = VERYFAST)
NiAl-MD — Liquid metal molecular dynamics sample of Nickel-
based superalloy
— 500 atoms, 9 chemical species
— Blocked Davidson (ALGO = NORMAL)
VERSION AND HARDWARE The GPU port is on VASP version 5.2.12
Code accelerated includes RMM-DIIS and Blocked Davidson routines and also exact-exchange work from CMU
We have run the code on Fermi and Kepler boards
The code has been tested for functional correctness on more than 25 benchmarks
We present performance results on 2 benchmarks at the end of this presentation
OPTIMIZATION DETAILS
RUNTIME DISTRIBUTION FOR SILICA
Time in sec for 1 K40 GPU + 1 IvyBridge core
0 500 1000 1500 2000 2500 3000 3500
Optimized GPU port
original GPU port
CPU
Memcopy
Gemm
FFT
Other
OUTLINE
Reduce communication
Port more work to the GPU
Optimize for small benchmarks
Batch work
Improve MPI scaling
REDUCE COMMUNICATION
REDUCE COMMUNICATION
PCIe Bus
K40: 288GB/s
theoretical
peak memory
bandwidth on
chip
PCIe Gen3:
16GB/s
theoretical
peak per
direction
REDUCE COMMUNICATION – EDDRM AND EDDIAG
Overlap transfers with compute by passing stream index into pipeline of FFT subroutines
Unnecessary idle time
FFT
Memcopy
Default stream
Time
REDUCE COMMUNICATION – EDDRM AND EDDIAG
Overlap transfers with compute by passing stream index into pipeline of FFT subroutines
Stream 1
Stream 2
Stream 3
Much better GPU utilization – 40% speedup
in EDDRM and 144% in EDDIAG!
FFT
Memcopy
Time
REDUCE COMMUNICATION – EDDIAG
Before
After
REDUCE COMMUNICATION – FORCE AND STRESS
Downstream CPU work
FFT
Memcopy
HtoD DtoH
CPU
HtoD DtoH
Time
Memory copies taking more time than the kernel!
CPU
REDUCE COMMUNICATION – FORCE AND STRESS
FFT
Memcopy
HtoD DtoH HtoD DtoH
Time
Memory copies taking more time than the kernel!
Port downstream CPU work to GPU GPU
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
FFT
Memcopy
HtoD DtoH
CPU
HtoD DtoH
Time
GPU
Unnecessary
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
FFT
Memcopy
HtoD
CPU
HtoD
Time
GPU
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies FFT
Memcopy
HtoD
CPU
HtoD
Time
GPU
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU FFT
Memcopy
CPU
Time
GPU
HtoD HtoD
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU FFT
Memcopy
CPU
Time
GPU
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU FFT
Memcopy
CPU
Time
GPU
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU
Use streams to overlap computation and transfers
FFT
Memcopy
CPU
Time
GPU
REDUCE COMMUNICATION – FORCE AND STRESS
117 ms
14 ms
14ms
8.3x
speedup
Over
original
GPU
version
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
CPU CPU CPU
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
CPU CPU
GPU HtoD DtoH1
Slowdown!
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
Porting more functions and keeping data on the GPU reduces communication and improves results!
GPU HtoD DtoH1 GPU GPU
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
Porting more functions and keeping data on the GPU reduces communication and improves results!
GPU GPU GPU
High level RMM-DIIS port – 18%
improvement!
BATCH AND STREAM WORK
BATCH WORK AND STREAM WORK
GPU is massively parallel
Need to launch sufficient work to
saturate it
A single call to a zgemm of (50x50)
* (50x50) only launches 2 blocks
which fit on one SM
- Not sufficient to fully utilize the
GPU!
Can launch multiple independent
pieces of work simultaneously
BATCH WORK AND STREAM WORK
STREAMED BATCHED
for(int i=0;i<N;i++)
cublasZgemm();
for(int i=0;i<N;i++){
cublasSetStream();
cublasZgemm();
}
cublasZgemmBatched();
Improved
zgemm
zgemm
zgemm
zgemm Kernel
launch
overhead
Not improved
Kernel
launch
overhead
zgemmBatched
BATCH WORK AND STREAM WORK
for(int i=0;i<N;i++)
cublasZgemm();
GEMM
0
20
40
60
80
100
GPU
utl
izati
on
time
GEMM GEMM GEMM GEMM
BATCH WORK AND STREAM WORK
GEMM
0
50
100
GPU
utl
izati
on
time
Kolumna1
GEMM
GEMM
GEMM
GEMM for(int i=0;i<N;i++){
cublasSetStream();
cublasZgemm();
}
STREAMED
…
Improved Not improved
0
50
100
GPU
utl
izati
on
time
Kolumna1
…
Kernel
launch
overhead
BATCH WORK AND STREAM WORK
GEMM
0
20
40
60
80
100
GPU
utl
izati
on
time
BATCHED
cublasZgemmBatched();
BATCH WORK – INVERSE REAL-SPACE PROJECTION
Padding with 0 required to have
same sizes of all gemms
0 0
data
data
data
BATCH WORK - RPROMU
Problem: How to easily batch it?
for i in 1..N
for j in 1..M
kernel<<<B,T,0,stream(i)>>>(…i,j);
Code Result
Time
BATCH WORK - RPROMU
Problem: How to easily batch it?
Use more grid dimensions and extract i and j from blockIdx.y and blockIdx.z
for i in 1..N
for j in 1..M
kernel<<<B,T,0,stream(i)>>>(…i,j);
Code Result
Time
BATCH WORK - RPROMU
Problem: How to easily batch it?
Use more grid dimensions and extract i and j from blockIdx.y and blockIdx.z
for i in 1..N
for j in 1..M
kernel<<<B,T,0,stream(i)>>>(…i,j);
Code Result
Time
dim3 blocks(B,M,N);
kernel<<<blocks,T>>>(…);
STREAM WORK: GRAHM-SCHMIDT ORTHONORMALIZATION (ORTHCH) MULTI BASIS MATRIX MATRIX MULTIPLY (LINCOM)
Original
New
Running on K20X with 14 SMs
Kernel launches 12 blocks
Because of register usage can run 3 blocks per SM
Theoretically can run 14*3 = 42 blocks
Use streams to launch
multiple independent
Zgemms and fill all the
SMs
MODIFY PARAMETERS TO IMPROVE BATCH SIZES
N = 2*NSIM
Increasing NSIM is an easy way
to improve the performance
without changing the numerical
accuracy of the results
REDUCE ALLOCATION / DEALLOCATION ON GPU
REDUCE ALLOCATION/DEALLOCATION ON GPU
Allocation / Deallocation on GPU is expensive, same as CPU
— Try to allocate once and use many times, even for temporary data
Allocations also cause expensive synchronization with the host, that introduces gaps in the GPU utilization
Allocations and deallocations may be tracked using CUDA API Trace functionality of CUDA Visual Profiler
GPU HtoD DtoH Allocate Deallocate
REDUCE ALLOCATION/DEALLOCATION ON GPU
Time
cudaMalloc(…);
cudaMemcpy(…);
kernel<<<…>>>(…);
cudaMemcpy(…);
cudaFree(…);
GPU HtoD DtoH
REDUCE ALLOCATION/DEALLOCATION ON GPU
Time
cudaMalloc(…);
cudaMemcpy(…);
kernel<<<…>>>(…);
cudaMemcpy(…);
cudaFree(…);
cudaMalloc(…);
cudaMemcpy(…);
Kernel<<<…>>>(…);
cudaMemcpy(…);
if(size < size_needed)
cudaFree(…);
1.4ms
0.3ms
Unnecessary
REDUCE ALLOCATION/DEALLOCATION ON GPU - ECCP
REDUCE ALLOCATION/DEALLOCATION ON GPU – FORCE AND STRESS
Cufft plan create Cufft plan destroy
Now: no plan create or destroy
REDUCE CPU WORK
PORT ADDITIONAL WORK TO THE GPU
Setup precond – 9.3x speedup
— Change from executing many times on the CPU in the new bands loop to executing only once on the GPU after the new bands loop
Potlok
CPU
2% of runtime
Initial GPU
7% of runtime
GPU
15% of runtime
Optimize
other parts GPU
6% of runtime
Port GGA (~50% of
Potlok) to GPU
REMOVE UN NECESSARY CPU WORK
Example: Daxpy and Dscal in EDDRM
135K
elements
1,143K
elements
K
space
real
space DSCAL
FFT
DAXPY
DSCAL DAXPY
1,143K
elements
REMOVE UN NECESSARY CPU WORK
Example: Daxpy and Dscal in EDDRM
135K
elements
1,143K
elements
K
space
real
space DSCAL
FFT
DAXPY x DSCAL DAXPY
1,143K
elements x
REMOVE UN NECESSARY CPU WORK
Example: Daxpy and Dscal in EDDRM
135K
elements
1,143K
elements
K
space
real
space
FFT
1.24x speedup for
EDDRM routine
DSCAL DAXPY
1,143K
elements
USING MORE CPU CORES
CPU, 436
Memcopy, 68
Gemm, 120
FFT, 288
Other, 165
SILICA, 1K40 + 1 Ivy bridge core
Left over
CPU work
USING MORE CPU CORES
0
0.5
1
1.5
2
2.5
3
1 2 3 4 6
Speedup v
s. 1
GPU
1 c
ore
Cores per GPU
Performance improvement with using multiple CPU cores
1 GPU
2 GPUs
4 GPUs
USE MULTI PROCESS SERVICE (MPS)
Performance issues with running multiple MPI ranks per GPU
— Increased MPI communication
— Each rank running in its own context on the GPU
Use the MPS functionality introduced in cuda 5.5 to have multiple MPI ranks run on the same GPU at the same time
— Allows kernels from multiple MPI ranks to run at the same time on the GPU
1 GPU + 1 core
USING MULTIPLE CPU CORES PER GPU 1 GPU + 2 cores
zgemm
zgemm
zgemm
zgemm
zgemm
zgemm
zgemm
zgemm
Time 1 Time 2
Context 1,
MPI rank 1
Context
switch Context 2,
MPI rank 2
USING MULTIPLE CPU CORES PER GPU
0.8
1.3
1.8
2.3
2.8
3.3
1 2 3 4 6
Speedup v
s. 1
core
Cores per GPU
Performance improvement with using multiple CPU cores
1 GPU
1 GPU+MPS
2 GPU
2 GPU + MPS
4 GPU
4 GPU + MPS14%
13%
11%
OPTIMIZATION FOR SMALL BENCHMARKS
SMALL BENCHMARK - PROBLEMS
Launch latency, memory copies and bookkeeping relatively large part of time
Small kernels don’t saturate GPU, wasting resources
SMALL BENCHMARK - SOLUTION
Group independent parts together
Merge independent calls into one kernel
Group independent iterations together
AFTER BEFORE
SMALL BENCHMARK – EXAMPLE I3 LOOP
Setup kernel
arguments
Launch Daxpy
kernel
Launch
Reduction kernel
Copy results to
CPU
Process results
For each sim
in nsim
Launch Daxpy kernel
Launch Reduction
kernel
Copy results to CPU
Setup kernel
arguments
For each sim
in nsim
CPU
work in
parallel
Process results For each sim
in nsim
CPU
work in
parallel
RESULTS FOR I3 LOOP
3.75x improvement for Pdo
— Small benchmark with only 87 ions
1.3x improvement for SILICA
SCALING
MPI SCALING
Number of
GPUs
EDDIAG [seconds, scaling]
EDDRM [seconds, scaling]
ORTHCH [seconds, scaling]
1 GPU 4.2s, 100% 6.7s, 100% 1.5s, 100%
2 GPUs 2.8s, 75% 3.4s, 99% 1.5s, 50%
4 GPUs 2.7s, 39% 1.8s, 95% 2.4s, 15%
8 GPUs 1.9s, 27% 0.9s, 93% 1.4s, 13%
Compute
intensive routine
: good Scaling
MPI intensive routines :
bad Scaling
OVERLAPPING MPI AND GPU WORK
Reordered such that MPI overlaps with computation
GPU compute
Memcopy
Default stream
Time
MPI
OVERLAPPING MPI AND GPU WORK
Reordered such that MPI overlaps with computation
Stream 1
Stream 2
Hide MPI communication and memory copies.
3x improvement in Striploop in EDDIAG
GPU compute
Memcopy
Time
MPI
PRE-ALLOCATING MEMORY IN ONE CONTIGUOUS CHUNK
VASP allocates hundreds of small buffers at the start of the RMM-DIIS iterations.
— Memory allocations require locks and syncs and can therefore be relatively expensive.
— This cost increases with multiple GPUs
Instead:
— Do a single large memory allocation
— Divide the large memory buffer over the hundreds of small buffers
— Memory allocation phase over 100x faster.
AFTER
BEFORE
USING GPU DIRECT
GPU
CPU
NIC NIC
CPU
GPU
GPU
NIC NIC
GPU
USING GPU DIRECT
Use CUDA Aware MPI
— As simple as calling MPI_Send, MPI_Recv with pointers to the GPU data
Performance improvements
Number of
GPUs
Time ORTCH –
without
Time ORTHCH
– with
%
improvement
2 GPUs 1.32s 0.99s 33%
4 GPUs 0.87s 0.63s 37%
RESULTS
RESULTS SILICA (RMM-DIIS) – VASP 5.2.2
• all results measured on K40
and dual socket sandy bridge
with 8 cores per socket
running at 2.9GHz
0
1
2
3
4
5
6
7
8
9
10
0 5 10
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1-2 cores/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(2-6 cores/GPU)2.5x
2.4x
2.3x
2.9x 2.9x
3.7x
3.6x
RESULTS SILICA (RMM-DIIS) – VASP 5.2.2
• all results measured on K40
and dual socket sandy bridge
with 8 cores per socket
running at 2.9GHz
0
1
2
3
4
5
6
7
8
9
10
0 5 10
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1-2 cores/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(2-6 cores/GPU)
1 node with two GPUs
is faster than 10 CPU
Sockets (5 nodes)
RESULTS NIAL-MD (BLOCKED DAVIDSON) , VASP 5.2.2
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1 core/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(1 core/GPU)
4x
6.9x
4.8x
4.9x
3.5x
3.4x
• all results measured on K40 and
dual socket sandy bridge with 8
cores per socket running at
2.9GHz
• Running with more cores per GPU
runs out of memory
RESULTS NIAL-MD (BLOCKED DAVIDSON) , VASP 5.2.2
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1 core/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(1 core/GPU)
• all results measured on K40 and
dual socket sandy bridge with 8
cores per socket running at
2.9GHz
• Running with more cores per GPU
runs out of memory
1 node with one GPU
is faster than 8 CPU
Sockets (4 nodes)