vasp-gpu on balena: usage and some benchmarks

Balena UserGroupMeeting

3rd February2017

vasp-gpu onBalena:UsageandSomeBenchmarks

Ø TheVASPSCFcycleinanutshell

Ø ParallelisationinVASP

o Workloadanddatadistribution

o Parallelisationcontrolparameters

o Somerulesofthumbforoptimisingparallelscaling

Ø TheGPU(CUDA)portofVASP

o Compilingandrunning

o Features

o Someinitialbenchmarks

Ø Thoughtsanddiscussionpoints

Balena UserGroupMeeting,February2017|Slide2

Overview

http://www.iue.tuwien.ac.at/phd/goes/dissse14.htmlS.Mainzetal.,Comput.Phys.Comm. 182,1421(2011)


TheVASPSCFcycleinanutshell

Ø ThenewestversionsofVASPimplementfourlevelsofparallelism:

o k-pointparallelism:KPAR

o Bandparallelismanddatadistribution:NCORE andNPAR

o Parallelisationanddatadistributionoverplane-wavecoefficients(=FFTs;doneoverplanesalongNGZ):LPLANE

o Parallelisationofsomelinear-algebraoperationsusingScaLAPACK (notionallysetatcompiletime,butcanbecontrolledatruntimeusingLSCALAPACK)

Ø Effectiveparallelisationwill…:

o …minimise(relativelyslow)communicationbetweenMPIprocesses,…

o …distributedatatoreducememoryrequirements,…

o …andmakesuretheMPIprocesseshaveenoughworktokeepthembusy


ParallelisationinVASP

MPIprocesses

KPAR k-pointgroups

NPAR bandgroups

NGZ FFTgroups(?)

Ø WorkloaddistributionoverKPAR k-pointgroups,NBANDS bandgroupsandNGZ plane-wavecoefficient(FFT)groups[not100%surehowthisworks…]


Parallelisation:Workloaddistribution

Data

KPAR k-pointgroups

NPAR bandgroups

NGZ FFTgroups(?)

Ø DatadistributionoverNBANDS bandgroupsandNGZ plane-wavecoefficient(FFT)groups[alsonot100%surehowthisworks…]


Parallelisation:Datadistribution

Ø DuringastandardDFTcalculation,k-pointsareindependent->k-pointparallelismshouldbelinearlyscaling,althoughperhapsnotinpractice:https://www.nsc.liu.se/~pla/blog/2015/01/12/vasp-how-many-cores

Ø WARNING:<#procs>mustbedivisiblebyKPAR,buttheparallelisationisviaaround-robinalgorithm,so<#k-points> doesnotneedtobedivisiblebyKPAR ->checkhowmanyirreducible k-pointsyouhave(IBZKPT file)andsetKPAR accordingly

k1

k2

k3

k1 k2

k3

k1 k2 k3

KPAR = 1t =3[OK]

KPAR = 2;t =2[Bad]KPAR = 3t =1[Good]

R1

R2

R3

R1

R2

R1


Parallelisation:KPAR

NCORE :numberofcoresinbandgroupsNPAR :numberofbandstreatedsimultaneously NCORE =

< #procs >NPAR

Ø ForNCORE = 1/NPAR = <#procs> (thedefault),morebandgroupsappearstoincreasememorypressureandincurasubstantialcommunicationoverhead

7.08x

6.41x

6.32x


Parallelisation:NCORE andNPAR

Ø WARNING:VASPwillincreasethedefaultNBANDS tothenearestmultipleofthenumberofgroups

Ø SincetheelectronicminimisationscalesasapowerofNBANDS, thiscanbackfireincalculationswithalargeNPAR (e.g.thoserequiringNPAR = <#procs>)

Cores

NBANDS

Default Adjusted

96 455 480

128 455 512

192 455 576

256 455 512

384 455 768

512 455 512

NBANDS =NELECT

2 +NIONS2

Examplesystem:

• 238atomsw/272electrons

• DefaultNBANDS =455

NBANDS =35NELECT + NMAG


Parallelisation:NCORE andNPAR

Ø TheRMM-DIIS(ALGO = VeryFast | Fast)algorithminvolvesthreesteps:

EDDIAG :subspacediagonalisationRMM-DIIS :electronicminimisationORTHCH :wavefunction orthogonalisation

Routine 312atoms 624 atoms 1,248atoms 1,872 atoms

EDDIAG 2.90(18.64%) 12.97(22.24%) 75.26(26.38%) 208.29(31.31%)

RMM-DIIS 12.39(79.63%) 42.73(73.27%) 187.62(65.78%) 379.80(57.10%)

ORTHCH 0.27(1.74 %) 2.62(4.49%) 22.36(7.84%) 77.11(11.59%)

Ø EDDIAG andORTHCH formallyscaleasN3,andrapidlybegintodominatetheSCFcycletimeforlargecalculations

Ø AgoodScaLAPACK librarycanimprovetheperformanceoftheseroutinesinmassively-parallelcalculations

Seealso:https://www.nsc.liu.se/~pla/blog/2014/01/30/vasp9k


Parallelisation:ScaLAPACK

Ø KPAR:currentimplementationdoesnotdistributedataoverk-pointgroups->KPAR = N willuseN×morememorythanKPAR = 1

Ø NPAR/NCORE:dataisdistributedoverbandgroups->decreasingNPAR/increasingNCORE willconsiderablyreducememoryrequirements

Ø NPAR takesprecedenceoverNCORE - ifyouuse“master”INCAR files,makesureyoudon’tdefineboth

Ø ThedefaultsforNPAR/NCORE (NPAR = <#procs>,NCORE = 1)areusuallyapoorchoiceforbothmemoryrequirementsand performance

Ø Bandparallelismforhybridfunctionals hasbeensupportedsinceVASP5.3.5;formemory-intensivecalculations,itisagoodalternativetounderpopulating nodes

Ø LPLANE:distributesdataoverplane-wavecoefficients,andspeedsthingsupbyreducingcommunicationduringFFTs- thedefaultisLPLANE = .TRUE.,andshouldonlyneedtobechangedformassively-parallelarchitectures(e.g.BlueGene/Q)


Parallelisation:Memory

Ø Forx86_64IBsystems(e.g.Balena,Archer…):

o TryKPAR forheavycalculations(e.g.hybrids)

o SetNPAR = (<#procs>/KPAR) orNCORE = <#procs/node>

o 1node/bandgroupper50atoms;maywanttouse2nodes/50atomsforhybrids,ordecreaseto½nodeperbandgroupfor<10atoms

o LeaveLPLANE atthedefault(.TRUE.)

o WARNING:InmyexperienceofCraysystems(Archer/XC30,SiSu/XC40),usingKPARsometimescausesVASPtohangduringmultistepcalculations(e.g.optimisations)

Ø FortheIBMBlueGene/Q(STFCHartree Centre):

o LasttimeIusedit,theHartree machineonlyhadVASP5.2.x->noKPAR

o Trytochooseasquarenumberofcores,andsetNPAR = sqrt(<#procs>)

o ConsidersettingLPLANE = .FALSE. if<#procs> ≥NGZ


Parallelisation:Somerulesofthumb

Ø GPUcomputingworksinanoffloadmodel

Ø ProgrammingmodelssuchasCUDAandOpenCLprovideAPIsfor:

o CopyingmemorytoandfromtheGPU

o Compilingkernel programstorunontheGPU

o Settingupandrunningkernelsoninputdata

Ø PortingcodesforGPUsinvolvesidentifyingroutinesthatcanbeefficientlymappedtotheGPUarchitecture,writingkernels,andinterfacingthemtotheCPUcode

Data

Data Program

Program

Run

Data

Data

CPU

GPU


GPUcomputing


vasp-gpu

Ø StartingfromtheFebruary2016releaseofVASP5.4.1,thedistributionincludesaCUDAportthatoffloadssomeofthecoreDFTroutinesontoNVIDIAGPUs

Ø AculminationofresearchattheUniversityofChicago,CarnegieMellon andENS-Lyon,andahealthydoseofoptimisationbyNVIDIA

Ø Threepaperscoveringtheimplementationandtesting:

o M.Hacene etal.,J.Comput.Chem. 33,2581(2012),10.1002/jcc.23096

o M.HutchinsonandW.Widom,Comput.Phys.Comm. 183,1422(2012),10.1002/jcc.23096

o S.Mainzetal.,Comput.Phys.Comm. 182,1421(2011),10.1016/j.cpc.2011.03.010


Becausesharingiscaring...

https://github.com/JMSkelton/VASP-GPU-Benchmarking

Ø Easy(ish)withtheVASP5.4.1buildsystem:

o Loadcuda/toolkit (alongwithintel/compiler,intel/mkl,etc.)

o Modifythearch/makefile.include.linux_intel_cuda example

o Makethegpu and/orgpu_ncl targets

intel/compiler/64/15.0.0.090intel/mkl/64/11.2openmpi/intel/1.8.4cuda/toolkit/7.5.18

FC = mpif90FCL = mpif90 -mkl -lstdc++...CUDA_ROOT :=/cm/shared/apps/cuda75/toolkit/7.5.18...MPI_INC =/apps/openmpi/intel-2015/1.8.4/include/

https://github.com/JMSkelton/VASP-GPU-Benchmarking/Compilation


vasp-gpu:Compilation

Ø AvailableasamoduleonBalena:module load untested vasp/intel/5.4.1

Ø Tousevasp-gpu onBalena,youneedtorequestaGPU-equippednodeandperformsomebasicsetuptasksinyourSLURMscripts

#SBATCH --partition=batch-acc

# Node w/ 1 k20x card.

#SBATCH --gres=gpu:1#SBATCH --constraint=k20x

# Node w/ 4 k20x cards.

##SBATCH --gres=gpu:4##SBATCH --constraint=k20x

if [ ! -d "/tmp/nvidia-mps" ] ; thenmkdir "/tmp/nvidia-mps"

fi

export CUDA_MPS_PIPE_DIRECTORY="/tmp/nvidia-mps"

if [ ! -d "/tmp/nvidia-log" ] ; thenmkdir "/tmp/nvidia-log"

fi

export CUDA_MPS_LOG_DIRECTORY="/tmp/nvidia-log"

nvidia-cuda-mps-control -d

https://github.com/JMSkelton/VASP-GPU-Benchmarking/Scripts


vasp-gpu:Runningjobs

Ø UsescuFFT andCUDAportsofcompute-heavypartsoftheSCFcycle

Ø ALGO = Normal | VeryFast (+Fast)w/LREAL = Auto fullysupported,alongwithKPAR,exactexchangeandnon-collinearspin

Ø ALGO = All | Damped andtheGW routineswork,butarenotoptimised(“passivelysupported”)

Ø LREAL = .FALSE.,NCORE > 1 (NPAR != N)andelectricfieldsarenotsupported(willcrashwithanerror)

Ø CurrentlynoGamma-onlyversion

Ø Futureroadmap:Γ-pointoptimisationsandsupportforLREAL = .FALSE.,vdWfunctionals,RPA/GW calculationsandbandparallelism


vasp-gpu:Features

Ø EachMPIprocessallocatesitsownsetofcuFFT plansandCUDAkernels,distributinground-robinamongtheavailableGPUs

Ø ThesizeoftheCUDAkernelsiscontrolledbyNSIM:broadly,NSIM↑=betterGPUutilisationbuthighermemoryrequirements

Ø <#procs> shouldbeamultipleof<#GPUs>,andformostsystemsyouwillprobablyendupunderpopulating theCPUs

Proc1

Proc2

Proc3

Proc4

GPU1

GPU2

Proc1

Proc2

Proc3

Proc4

GPU1

GPU2

GPU3

GPU4


vasp-gpu:Loadbalancing

Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode


vasp-gpu:Benchmarking


NSIM

1 2 4 8 12 16 24 32 48 64

#MPIProcesses

1 13.52 8.88 8.15 7.82 7.77 7.76 7.72 7.74 7.81 7.89

2 9.11 6.75 6.34 6.21 6.23 6.21 6.23 6.25 6.32 OOM

4 6.72 5.57 5.33 5.24 5.29 5.30 OOM OOM OOM OOM

8 6.01 5.26 5.14 OOM OOM OOM OOM OOM OOM OOM

12 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM

16 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM






0.0

1.0

2.0

3.0

4.0

5.0

64 128 192 256 320 384 448 512

Spee

dup

(vasp_gam

)

# Atoms

1 GPU 4 GPUs

0.0

1.0

2.0

3.0

4.0

5.0

64 128 192 256 320 384 448 512

Spee

dup

(vasp_std

)

# Atoms

1 GPU 4 GPUs

NSIM

1 2 4 8 16

#MPIProcesses

1 -14131.52 -158.39 -158.39 -158.39 -158.39

2 -14131.52 -158.39 -158.39 -158.39 -158.39

4 -14131.52 -158.39 -158.39 -158.39 -158.39

8 -14131.52 -158.39 -158.39 - -

12 - - - - -

16 - - - - -




Ø Threepaperscoveringtheimplementationandtesting…:

o M.Hacene etal.,J.Comput.Chem. 33,2581(2012),10.1002/jcc.23096

o M.HutchinsonandW.Widom,Comput.Phys.Comm. 183,1422(2012),10.1002/jcc.23096

o S.Mainzetal.,Comput.Phys.Comm. 182,1421(2011),10.1016/j.cpc.2011.03.010

Ø …andacoupleofotherlinks:

o https://www.vasp.at/index.php/news/44-administrative/115-new-release-vasp-5-4-1-with-gpu-support

o https://www.nsc.liu.se/~pla/blog/2015/11/16/vaspgpu/

o http://images.nvidia.com/events/sc15/SC5120-vasp-gpus.html


Furtherreading

Ø UnderstandingtheparallelisationinVASPandapplyingafewsimplerulesofthumbcanmakeyourjobsscalebetteranduselessresources(thedefaultsettingsaren’tgreat...)

Ø Atthemoment,runningVASPonGPUsismostlyforinterest:

o Doesnotbenefitalltypesofjob

o Requiressomefiddlytestingtogetthebestperformance

o IfyouwillberunningalotofasuitableworkloadonBalena (e.g.largeMDjobs),itcouldbeworththeeffort

Ø Aimsforfurtherbenchmarktests:

o WhattypesofjobbenefitfromGPUacceleration?

o Whatisthemost“balanced”configuration(1/2/4GPUs/node)?

o IsitpossibletorunovermultipleGPUnodes?

o CanGPUsbeacost/powerefficientwaytoruncertainVASPjobs?


Thoughtsanddiscussionpoints


Acknowledgements

vasp-gpu on balena: usage and some benchmarks

Science