vasp: a case study for accelerating plane wave dft...

74
VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT CODES Presenters: Sarah Tariq and Przemyslaw Tredak Authors: Jeroen Bedorf, Przemyslaw Tredak , Dusan Stosic, Arash Ashari, Paul Springer, Darko Stosic, Sarah Tariq, Paul Fleurat- Lessard and Anciaux Sedrakian (Ens-lyon, IFPEN), Maxwell Hutchinson (University of Chicago) and Michael Widom (CMU)

Upload: others

Post on 16-Mar-2020

45 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT CODES

Presenters: Sarah Tariq and Przemyslaw Tredak

Authors: Jeroen Bedorf, Przemyslaw Tredak , Dusan Stosic, Arash Ashari, Paul Springer, Darko Stosic, Sarah Tariq, Paul Fleurat-Lessard and Anciaux Sedrakian (Ens-lyon, IFPEN), Maxwell Hutchinson (University of Chicago) and Michael Widom (CMU)

Page 2: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

GPU VASP COLLABORATION Collaborators

Project Scope Minimization algorithms to calculate electronic ground state

— Blocked Davidson (ALGO = NORMAL & FAST)

— RMM-DIIS (ALGO = VERYFAST & FAST)

Earlier work — Speeding up plane-wave electronic-structure calculations using graphics-processing units. Maintz, Eck,

Dronskowski. (2011)

— VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron. Hutchinson, Widom. (2011)

— Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units. Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard. (2012)

Page 3: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

VASP OVERVIEW

Atomic scale materials modeling from first principles

Simulate atoms (mostly solids/surfaces)

Liquids, crystals, magnetism, semiconductor/insulators, surfaces, catalysts

Solve many-body Schrödinger equation

Density Functional Theory (DFT): Kohn-Sham equations

Optionally add exact-exchange using Hybrid Hartree Fock functionals (HF)

Page 4: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

THEORY

Self-consistent Kohn-Sham system

— Self-consistency loop until convergence

— Compute Kohn-Sham potential 𝒗𝑲𝑺 𝒓

— Solve Kohn-Sham eigenproblem

— Obtain electronic density 𝒏 𝒓

Kohn-Sham eigenproblem

— Diagonalize Hamiltonian matrix 𝑯 𝑲𝑺

— Problem: often 𝑯 𝑲𝑺 is very big

— Solution: Iterative matrix diagonalization schemes

— Blocked Davidson, RMM-DIIS

— Find lowest few 𝝋𝒊 eigenstates of 𝑯 𝑲𝑺

𝒏𝟎(𝒓)

𝒗𝑲𝑺(𝒓)

𝑯 𝑲𝑺𝝋𝒊 𝒓 = 𝑬𝒊𝝋𝒊 𝒓

𝒏 𝒓 = 𝝋𝒊 𝒓𝟐

𝒊

stop?

end

yes

no

Page 5: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

SIMILARITIES IN PW DFT CODES

Rely heavily on math libraries BLAS and FFT

— Easily offloaded using cuBLAS and cuFFT

Don’t need to write a lot of specialized routines

— Focus is on keeping GPU busy, and reducing communication instead of optimizing kernels

Page 6: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

TARGET WORKLOADS Silica

— 7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24)

— RMM-DIIS (ALGO = VERYFAST)

NiAl-MD — Liquid metal molecular dynamics sample of Nickel-

based superalloy

— 500 atoms, 9 chemical species

— Blocked Davidson (ALGO = NORMAL)

Page 7: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

VERSION AND HARDWARE The GPU port is on VASP version 5.2.12

Code accelerated includes RMM-DIIS and Blocked Davidson routines and also exact-exchange work from CMU

We have run the code on Fermi and Kepler boards

The code has been tested for functional correctness on more than 25 benchmarks

We present performance results on 2 benchmarks at the end of this presentation

Page 8: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

OPTIMIZATION DETAILS

Page 9: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

RUNTIME DISTRIBUTION FOR SILICA

Time in sec for 1 K40 GPU + 1 IvyBridge core

0 500 1000 1500 2000 2500 3000 3500

Optimized GPU port

original GPU port

CPU

Memcopy

Gemm

FFT

Other

Page 10: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

OUTLINE

Reduce communication

Port more work to the GPU

Optimize for small benchmarks

Batch work

Improve MPI scaling

Page 11: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION

Page 12: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION

PCIe Bus

K40: 288GB/s

theoretical

peak memory

bandwidth on

chip

PCIe Gen3:

16GB/s

theoretical

peak per

direction

Page 13: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – EDDRM AND EDDIAG

Overlap transfers with compute by passing stream index into pipeline of FFT subroutines

Unnecessary idle time

FFT

Memcopy

Default stream

Time

Page 14: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – EDDRM AND EDDIAG

Overlap transfers with compute by passing stream index into pipeline of FFT subroutines

Stream 1

Stream 2

Stream 3

Much better GPU utilization – 40% speedup

in EDDRM and 144% in EDDIAG!

FFT

Memcopy

Time

Page 15: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – EDDIAG

Before

After

Page 16: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – FORCE AND STRESS

Downstream CPU work

FFT

Memcopy

HtoD DtoH

CPU

HtoD DtoH

Time

Memory copies taking more time than the kernel!

CPU

Page 17: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – FORCE AND STRESS

FFT

Memcopy

HtoD DtoH HtoD DtoH

Time

Memory copies taking more time than the kernel!

Port downstream CPU work to GPU GPU

Page 18: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – FORCE AND STRESS

Port downstream CPU work to GPU

FFT

Memcopy

HtoD DtoH

CPU

HtoD DtoH

Time

GPU

Unnecessary

Page 19: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – FORCE AND STRESS

Port downstream CPU work to GPU

Remove unnecessary memory copies

FFT

Memcopy

HtoD

CPU

HtoD

Time

GPU

Page 20: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – FORCE AND STRESS

Port downstream CPU work to GPU

Remove unnecessary memory copies FFT

Memcopy

HtoD

CPU

HtoD

Time

GPU

Page 21: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – FORCE AND STRESS

Port downstream CPU work to GPU

Remove unnecessary memory copies

When possible, initialize data on the GPU FFT

Memcopy

CPU

Time

GPU

HtoD HtoD

Page 22: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – FORCE AND STRESS

Port downstream CPU work to GPU

Remove unnecessary memory copies

When possible, initialize data on the GPU FFT

Memcopy

CPU

Time

GPU

Page 23: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – FORCE AND STRESS

Port downstream CPU work to GPU

Remove unnecessary memory copies

When possible, initialize data on the GPU FFT

Memcopy

CPU

Time

GPU

Page 24: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – FORCE AND STRESS

Port downstream CPU work to GPU

Remove unnecessary memory copies

When possible, initialize data on the GPU

Use streams to overlap computation and transfers

FFT

Memcopy

CPU

Time

GPU

Page 25: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – FORCE AND STRESS

117 ms

14 ms

14ms

8.3x

speedup

Over

original

GPU

version

Page 26: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT

Typical drop-in replacement may not work well for small CPU functions

CPU CPU CPU

Page 27: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT

Typical drop-in replacement may not work well for small CPU functions

CPU CPU

GPU HtoD DtoH1

Slowdown!

Page 28: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT

Typical drop-in replacement may not work well for small CPU functions

Porting more functions and keeping data on the GPU reduces communication and improves results!

GPU HtoD DtoH1 GPU GPU

Page 29: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT

Typical drop-in replacement may not work well for small CPU functions

Porting more functions and keeping data on the GPU reduces communication and improves results!

GPU GPU GPU

High level RMM-DIIS port – 18%

improvement!

Page 30: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

BATCH AND STREAM WORK

Page 31: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

BATCH WORK AND STREAM WORK

GPU is massively parallel

Need to launch sufficient work to

saturate it

A single call to a zgemm of (50x50)

* (50x50) only launches 2 blocks

which fit on one SM

- Not sufficient to fully utilize the

GPU!

Can launch multiple independent

pieces of work simultaneously

Page 32: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

BATCH WORK AND STREAM WORK

STREAMED BATCHED

for(int i=0;i<N;i++)

cublasZgemm();

for(int i=0;i<N;i++){

cublasSetStream();

cublasZgemm();

}

cublasZgemmBatched();

Improved

zgemm

zgemm

zgemm

zgemm Kernel

launch

overhead

Not improved

Kernel

launch

overhead

zgemmBatched

Page 33: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

BATCH WORK AND STREAM WORK

for(int i=0;i<N;i++)

cublasZgemm();

GEMM

0

20

40

60

80

100

GPU

utl

izati

on

time

GEMM GEMM GEMM GEMM

Page 34: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

BATCH WORK AND STREAM WORK

GEMM

0

50

100

GPU

utl

izati

on

time

Kolumna1

GEMM

GEMM

GEMM

GEMM for(int i=0;i<N;i++){

cublasSetStream();

cublasZgemm();

}

STREAMED

Improved Not improved

0

50

100

GPU

utl

izati

on

time

Kolumna1

Kernel

launch

overhead

Page 35: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

BATCH WORK AND STREAM WORK

GEMM

0

20

40

60

80

100

GPU

utl

izati

on

time

BATCHED

cublasZgemmBatched();

Page 36: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

BATCH WORK – INVERSE REAL-SPACE PROJECTION

Padding with 0 required to have

same sizes of all gemms

0 0

data

data

data

Page 37: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

BATCH WORK - RPROMU

Problem: How to easily batch it?

for i in 1..N

for j in 1..M

kernel<<<B,T,0,stream(i)>>>(…i,j);

Code Result

Time

Page 38: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

BATCH WORK - RPROMU

Problem: How to easily batch it?

Use more grid dimensions and extract i and j from blockIdx.y and blockIdx.z

for i in 1..N

for j in 1..M

kernel<<<B,T,0,stream(i)>>>(…i,j);

Code Result

Time

Page 39: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

BATCH WORK - RPROMU

Problem: How to easily batch it?

Use more grid dimensions and extract i and j from blockIdx.y and blockIdx.z

for i in 1..N

for j in 1..M

kernel<<<B,T,0,stream(i)>>>(…i,j);

Code Result

Time

dim3 blocks(B,M,N);

kernel<<<blocks,T>>>(…);

Page 40: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

STREAM WORK: GRAHM-SCHMIDT ORTHONORMALIZATION (ORTHCH) MULTI BASIS MATRIX MATRIX MULTIPLY (LINCOM)

Original

New

Running on K20X with 14 SMs

Kernel launches 12 blocks

Because of register usage can run 3 blocks per SM

Theoretically can run 14*3 = 42 blocks

Use streams to launch

multiple independent

Zgemms and fill all the

SMs

Page 41: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

MODIFY PARAMETERS TO IMPROVE BATCH SIZES

N = 2*NSIM

Increasing NSIM is an easy way

to improve the performance

without changing the numerical

accuracy of the results

Page 42: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE ALLOCATION / DEALLOCATION ON GPU

Page 43: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE ALLOCATION/DEALLOCATION ON GPU

Allocation / Deallocation on GPU is expensive, same as CPU

— Try to allocate once and use many times, even for temporary data

Allocations also cause expensive synchronization with the host, that introduces gaps in the GPU utilization

Allocations and deallocations may be tracked using CUDA API Trace functionality of CUDA Visual Profiler

Page 44: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

GPU HtoD DtoH Allocate Deallocate

REDUCE ALLOCATION/DEALLOCATION ON GPU

Time

cudaMalloc(…);

cudaMemcpy(…);

kernel<<<…>>>(…);

cudaMemcpy(…);

cudaFree(…);

Page 45: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

GPU HtoD DtoH

REDUCE ALLOCATION/DEALLOCATION ON GPU

Time

cudaMalloc(…);

cudaMemcpy(…);

kernel<<<…>>>(…);

cudaMemcpy(…);

cudaFree(…);

cudaMalloc(…);

cudaMemcpy(…);

Kernel<<<…>>>(…);

cudaMemcpy(…);

if(size < size_needed)

cudaFree(…);

Page 46: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

1.4ms

0.3ms

Unnecessary

REDUCE ALLOCATION/DEALLOCATION ON GPU - ECCP

Page 47: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE ALLOCATION/DEALLOCATION ON GPU – FORCE AND STRESS

Cufft plan create Cufft plan destroy

Now: no plan create or destroy

Page 48: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REDUCE CPU WORK

Page 49: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

PORT ADDITIONAL WORK TO THE GPU

Setup precond – 9.3x speedup

— Change from executing many times on the CPU in the new bands loop to executing only once on the GPU after the new bands loop

Potlok

CPU

2% of runtime

Initial GPU

7% of runtime

GPU

15% of runtime

Optimize

other parts GPU

6% of runtime

Port GGA (~50% of

Potlok) to GPU

Page 50: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REMOVE UN NECESSARY CPU WORK

Example: Daxpy and Dscal in EDDRM

135K

elements

1,143K

elements

K

space

real

space DSCAL

FFT

DAXPY

DSCAL DAXPY

1,143K

elements

Page 51: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REMOVE UN NECESSARY CPU WORK

Example: Daxpy and Dscal in EDDRM

135K

elements

1,143K

elements

K

space

real

space DSCAL

FFT

DAXPY x DSCAL DAXPY

1,143K

elements x

Page 52: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

REMOVE UN NECESSARY CPU WORK

Example: Daxpy and Dscal in EDDRM

135K

elements

1,143K

elements

K

space

real

space

FFT

1.24x speedup for

EDDRM routine

DSCAL DAXPY

1,143K

elements

Page 53: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

USING MORE CPU CORES

CPU, 436

Memcopy, 68

Gemm, 120

FFT, 288

Other, 165

SILICA, 1K40 + 1 Ivy bridge core

Left over

CPU work

Page 54: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

USING MORE CPU CORES

0

0.5

1

1.5

2

2.5

3

1 2 3 4 6

Speedup v

s. 1

GPU

1 c

ore

Cores per GPU

Performance improvement with using multiple CPU cores

1 GPU

2 GPUs

4 GPUs

Page 55: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

USE MULTI PROCESS SERVICE (MPS)

Performance issues with running multiple MPI ranks per GPU

— Increased MPI communication

— Each rank running in its own context on the GPU

Use the MPS functionality introduced in cuda 5.5 to have multiple MPI ranks run on the same GPU at the same time

— Allows kernels from multiple MPI ranks to run at the same time on the GPU

Page 56: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

1 GPU + 1 core

USING MULTIPLE CPU CORES PER GPU 1 GPU + 2 cores

zgemm

zgemm

zgemm

zgemm

zgemm

zgemm

zgemm

zgemm

Time 1 Time 2

Context 1,

MPI rank 1

Context

switch Context 2,

MPI rank 2

Page 57: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

USING MULTIPLE CPU CORES PER GPU

0.8

1.3

1.8

2.3

2.8

3.3

1 2 3 4 6

Speedup v

s. 1

core

Cores per GPU

Performance improvement with using multiple CPU cores

1 GPU

1 GPU+MPS

2 GPU

2 GPU + MPS

4 GPU

4 GPU + MPS14%

13%

11%

Page 58: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

OPTIMIZATION FOR SMALL BENCHMARKS

Page 59: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

SMALL BENCHMARK - PROBLEMS

Launch latency, memory copies and bookkeeping relatively large part of time

Small kernels don’t saturate GPU, wasting resources

Page 60: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

SMALL BENCHMARK - SOLUTION

Group independent parts together

Merge independent calls into one kernel

Group independent iterations together

Page 61: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

AFTER BEFORE

SMALL BENCHMARK – EXAMPLE I3 LOOP

Setup kernel

arguments

Launch Daxpy

kernel

Launch

Reduction kernel

Copy results to

CPU

Process results

For each sim

in nsim

Launch Daxpy kernel

Launch Reduction

kernel

Copy results to CPU

Setup kernel

arguments

For each sim

in nsim

CPU

work in

parallel

Process results For each sim

in nsim

CPU

work in

parallel

Page 62: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

RESULTS FOR I3 LOOP

3.75x improvement for Pdo

— Small benchmark with only 87 ions

1.3x improvement for SILICA

Page 63: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

SCALING

Page 64: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

MPI SCALING

Number of

GPUs

EDDIAG [seconds, scaling]

EDDRM [seconds, scaling]

ORTHCH [seconds, scaling]

1 GPU 4.2s, 100% 6.7s, 100% 1.5s, 100%

2 GPUs 2.8s, 75% 3.4s, 99% 1.5s, 50%

4 GPUs 2.7s, 39% 1.8s, 95% 2.4s, 15%

8 GPUs 1.9s, 27% 0.9s, 93% 1.4s, 13%

Compute

intensive routine

: good Scaling

MPI intensive routines :

bad Scaling

Page 65: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

OVERLAPPING MPI AND GPU WORK

Reordered such that MPI overlaps with computation

GPU compute

Memcopy

Default stream

Time

MPI

Page 66: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

OVERLAPPING MPI AND GPU WORK

Reordered such that MPI overlaps with computation

Stream 1

Stream 2

Hide MPI communication and memory copies.

3x improvement in Striploop in EDDIAG

GPU compute

Memcopy

Time

MPI

Page 67: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

PRE-ALLOCATING MEMORY IN ONE CONTIGUOUS CHUNK

VASP allocates hundreds of small buffers at the start of the RMM-DIIS iterations.

— Memory allocations require locks and syncs and can therefore be relatively expensive.

— This cost increases with multiple GPUs

Instead:

— Do a single large memory allocation

— Divide the large memory buffer over the hundreds of small buffers

— Memory allocation phase over 100x faster.

Page 68: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

AFTER

BEFORE

USING GPU DIRECT

GPU

CPU

NIC NIC

CPU

GPU

GPU

NIC NIC

GPU

Page 69: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

USING GPU DIRECT

Use CUDA Aware MPI

— As simple as calling MPI_Send, MPI_Recv with pointers to the GPU data

Performance improvements

Number of

GPUs

Time ORTCH –

without

Time ORTHCH

– with

%

improvement

2 GPUs 1.32s 0.99s 33%

4 GPUs 0.87s 0.63s 37%

Page 70: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

RESULTS

Page 71: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

RESULTS SILICA (RMM-DIIS) – VASP 5.2.2

• all results measured on K40

and dual socket sandy bridge

with 8 cores per socket

running at 2.9GHz

0

1

2

3

4

5

6

7

8

9

10

0 5 10

Sp

eed

up

vs.

Sin

gle

CP

U S

ocket

Number of CPU Sockets

2 GPU : 1 CPU ratio(1-2 cores/GPU)

CPU only(8 cores/CPU)

1 GPU : 1 CPU ratio(2-6 cores/GPU)2.5x

2.4x

2.3x

2.9x 2.9x

3.7x

3.6x

Page 72: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

RESULTS SILICA (RMM-DIIS) – VASP 5.2.2

• all results measured on K40

and dual socket sandy bridge

with 8 cores per socket

running at 2.9GHz

0

1

2

3

4

5

6

7

8

9

10

0 5 10

Sp

eed

up

vs.

Sin

gle

CP

U S

ocket

Number of CPU Sockets

2 GPU : 1 CPU ratio(1-2 cores/GPU)

CPU only(8 cores/CPU)

1 GPU : 1 CPU ratio(2-6 cores/GPU)

1 node with two GPUs

is faster than 10 CPU

Sockets (5 nodes)

Page 73: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

RESULTS NIAL-MD (BLOCKED DAVIDSON) , VASP 5.2.2

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8

Sp

eed

up

vs.

Sin

gle

CP

U S

ocket

Number of CPU Sockets

2 GPU : 1 CPU ratio(1 core/GPU)

CPU only(8 cores/CPU)

1 GPU : 1 CPU ratio(1 core/GPU)

4x

6.9x

4.8x

4.9x

3.5x

3.4x

• all results measured on K40 and

dual socket sandy bridge with 8

cores per socket running at

2.9GHz

• Running with more cores per GPU

runs out of memory

Page 74: VASP: A Case Study for Accelerating Plane Wave DFT Codeson-demand.gputechconf.com/...vasp-accelerating-plane-wave-dft-codes.pdf · VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT

RESULTS NIAL-MD (BLOCKED DAVIDSON) , VASP 5.2.2

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8

Sp

eed

up

vs.

Sin

gle

CP

U S

ocket

Number of CPU Sockets

2 GPU : 1 CPU ratio(1 core/GPU)

CPU only(8 cores/CPU)

1 GPU : 1 CPU ratio(1 core/GPU)

• all results measured on K40 and

dual socket sandy bridge with 8

cores per socket running at

2.9GHz

• Running with more cores per GPU

runs out of memory

1 node with one GPU

is faster than 8 CPU

Sockets (4 nodes)