vasp-gpu on balena: usage and some benchmarks
TRANSCRIPT
Balena UserGroupMeeting
3rd February2017
vasp-gpu onBalena:UsageandSomeBenchmarks
Ø TheVASPSCFcycleinanutshell
Ø ParallelisationinVASP
o Workloadanddatadistribution
o Parallelisationcontrolparameters
o Somerulesofthumbforoptimisingparallelscaling
Ø TheGPU(CUDA)portofVASP
o Compilingandrunning
o Features
o Someinitialbenchmarks
Ø Thoughtsanddiscussionpoints
Balena UserGroupMeeting,February2017|Slide2
Overview
http://www.iue.tuwien.ac.at/phd/goes/dissse14.htmlS.Mainzetal.,Comput.Phys.Comm. 182,1421(2011)
Balena UserGroupMeeting,February2017|Slide3
TheVASPSCFcycleinanutshell
Ø ThenewestversionsofVASPimplementfourlevelsofparallelism:
o k-pointparallelism:KPAR
o Bandparallelismanddatadistribution:NCORE andNPAR
o Parallelisationanddatadistributionoverplane-wavecoefficients(=FFTs;doneoverplanesalongNGZ):LPLANE
o Parallelisationofsomelinear-algebraoperationsusingScaLAPACK (notionallysetatcompiletime,butcanbecontrolledatruntimeusingLSCALAPACK)
Ø Effectiveparallelisationwill…:
o …minimise(relativelyslow)communicationbetweenMPIprocesses,…
o …distributedatatoreducememoryrequirements,…
o …andmakesuretheMPIprocesseshaveenoughworktokeepthembusy
Balena UserGroupMeeting,February2017|Slide4
ParallelisationinVASP
MPIprocesses
KPAR k-pointgroups
NPAR bandgroups
NGZ FFTgroups(?)
Ø WorkloaddistributionoverKPAR k-pointgroups,NBANDS bandgroupsandNGZ plane-wavecoefficient(FFT)groups[not100%surehowthisworks…]
Balena UserGroupMeeting,February2017|Slide5
Parallelisation:Workloaddistribution
Data
KPAR k-pointgroups
NPAR bandgroups
NGZ FFTgroups(?)
Ø DatadistributionoverNBANDS bandgroupsandNGZ plane-wavecoefficient(FFT)groups[alsonot100%surehowthisworks…]
Balena UserGroupMeeting,February2017|Slide6
Parallelisation:Datadistribution
Ø DuringastandardDFTcalculation,k-pointsareindependent->k-pointparallelismshouldbelinearlyscaling,althoughperhapsnotinpractice:https://www.nsc.liu.se/~pla/blog/2015/01/12/vasp-how-many-cores
Ø WARNING:<#procs>mustbedivisiblebyKPAR,buttheparallelisationisviaaround-robinalgorithm,so<#k-points> doesnotneedtobedivisiblebyKPAR ->checkhowmanyirreducible k-pointsyouhave(IBZKPT file)andsetKPAR accordingly
k1
k2
k3
k1 k2
k3
k1 k2 k3
KPAR = 1t =3[OK]
KPAR = 2;t =2[Bad]KPAR = 3t =1[Good]
R1
R2
R3
R1
R2
R1
Balena UserGroupMeeting,February2017|Slide7
Parallelisation:KPAR
NCORE :numberofcoresinbandgroupsNPAR :numberofbandstreatedsimultaneously NCORE =
< #procs >NPAR
Ø ForNCORE = 1/NPAR = <#procs> (thedefault),morebandgroupsappearstoincreasememorypressureandincurasubstantialcommunicationoverhead
7.08x
6.41x
6.32x
Balena UserGroupMeeting,February2017|Slide8
Parallelisation:NCORE andNPAR
Ø WARNING:VASPwillincreasethedefaultNBANDS tothenearestmultipleofthenumberofgroups
Ø SincetheelectronicminimisationscalesasapowerofNBANDS, thiscanbackfireincalculationswithalargeNPAR (e.g.thoserequiringNPAR = <#procs>)
Cores
NBANDS
Default Adjusted
96 455 480
128 455 512
192 455 576
256 455 512
384 455 768
512 455 512
NBANDS =NELECT
2 +NIONS2
Examplesystem:
• 238atomsw/272electrons
• DefaultNBANDS =455
NBANDS =35NELECT + NMAG
Balena UserGroupMeeting,February2017|Slide9
Parallelisation:NCORE andNPAR
Ø TheRMM-DIIS(ALGO = VeryFast | Fast)algorithminvolvesthreesteps:
EDDIAG :subspacediagonalisationRMM-DIIS :electronicminimisationORTHCH :wavefunction orthogonalisation
Routine 312atoms 624 atoms 1,248atoms 1,872 atoms
EDDIAG 2.90(18.64%) 12.97(22.24%) 75.26(26.38%) 208.29(31.31%)
RMM-DIIS 12.39(79.63%) 42.73(73.27%) 187.62(65.78%) 379.80(57.10%)
ORTHCH 0.27(1.74 %) 2.62(4.49%) 22.36(7.84%) 77.11(11.59%)
Ø EDDIAG andORTHCH formallyscaleasN3,andrapidlybegintodominatetheSCFcycletimeforlargecalculations
Ø AgoodScaLAPACK librarycanimprovetheperformanceoftheseroutinesinmassively-parallelcalculations
Seealso:https://www.nsc.liu.se/~pla/blog/2014/01/30/vasp9k
Balena UserGroupMeeting,February2017|Slide10
Parallelisation:ScaLAPACK
Ø KPAR:currentimplementationdoesnotdistributedataoverk-pointgroups->KPAR = N willuseN×morememorythanKPAR = 1
Ø NPAR/NCORE:dataisdistributedoverbandgroups->decreasingNPAR/increasingNCORE willconsiderablyreducememoryrequirements
Ø NPAR takesprecedenceoverNCORE - ifyouuse“master”INCAR files,makesureyoudon’tdefineboth
Ø ThedefaultsforNPAR/NCORE (NPAR = <#procs>,NCORE = 1)areusuallyapoorchoiceforbothmemoryrequirementsand performance
Ø Bandparallelismforhybridfunctionals hasbeensupportedsinceVASP5.3.5;formemory-intensivecalculations,itisagoodalternativetounderpopulating nodes
Ø LPLANE:distributesdataoverplane-wavecoefficients,andspeedsthingsupbyreducingcommunicationduringFFTs- thedefaultisLPLANE = .TRUE.,andshouldonlyneedtobechangedformassively-parallelarchitectures(e.g.BlueGene/Q)
Balena UserGroupMeeting,February2017|Slide11
Parallelisation:Memory
Ø Forx86_64IBsystems(e.g.Balena,Archer…):
o TryKPAR forheavycalculations(e.g.hybrids)
o SetNPAR = (<#procs>/KPAR) orNCORE = <#procs/node>
o 1node/bandgroupper50atoms;maywanttouse2nodes/50atomsforhybrids,ordecreaseto½nodeperbandgroupfor<10atoms
o LeaveLPLANE atthedefault(.TRUE.)
o WARNING:InmyexperienceofCraysystems(Archer/XC30,SiSu/XC40),usingKPARsometimescausesVASPtohangduringmultistepcalculations(e.g.optimisations)
Ø FortheIBMBlueGene/Q(STFCHartree Centre):
o LasttimeIusedit,theHartree machineonlyhadVASP5.2.x->noKPAR
o Trytochooseasquarenumberofcores,andsetNPAR = sqrt(<#procs>)
o ConsidersettingLPLANE = .FALSE. if<#procs> ≥NGZ
Balena UserGroupMeeting,February2017|Slide12
Parallelisation:Somerulesofthumb
Ø GPUcomputingworksinanoffloadmodel
Ø ProgrammingmodelssuchasCUDAandOpenCLprovideAPIsfor:
o CopyingmemorytoandfromtheGPU
o Compilingkernel programstorunontheGPU
o Settingupandrunningkernelsoninputdata
Ø PortingcodesforGPUsinvolvesidentifyingroutinesthatcanbeefficientlymappedtotheGPUarchitecture,writingkernels,andinterfacingthemtotheCPUcode
Data
Data Program
Program
Run
Data
Data
CPU
GPU
Balena UserGroupMeeting,February2017|Slide13
GPUcomputing
Balena UserGroupMeeting,February2017|Slide14
vasp-gpu
Ø StartingfromtheFebruary2016releaseofVASP5.4.1,thedistributionincludesaCUDAportthatoffloadssomeofthecoreDFTroutinesontoNVIDIAGPUs
Ø AculminationofresearchattheUniversityofChicago,CarnegieMellon andENS-Lyon,andahealthydoseofoptimisationbyNVIDIA
Ø Threepaperscoveringtheimplementationandtesting:
o M.Hacene etal.,J.Comput.Chem. 33,2581(2012),10.1002/jcc.23096
o M.HutchinsonandW.Widom,Comput.Phys.Comm. 183,1422(2012),10.1002/jcc.23096
o S.Mainzetal.,Comput.Phys.Comm. 182,1421(2011),10.1016/j.cpc.2011.03.010
Balena UserGroupMeeting,February2017|Slide15
Becausesharingiscaring...
https://github.com/JMSkelton/VASP-GPU-Benchmarking
Ø Easy(ish)withtheVASP5.4.1buildsystem:
o Loadcuda/toolkit (alongwithintel/compiler,intel/mkl,etc.)
o Modifythearch/makefile.include.linux_intel_cuda example
o Makethegpu and/orgpu_ncl targets
intel/compiler/64/15.0.0.090intel/mkl/64/11.2openmpi/intel/1.8.4cuda/toolkit/7.5.18
FC = mpif90FCL = mpif90 -mkl -lstdc++...CUDA_ROOT :=/cm/shared/apps/cuda75/toolkit/7.5.18...MPI_INC =/apps/openmpi/intel-2015/1.8.4/include/
https://github.com/JMSkelton/VASP-GPU-Benchmarking/Compilation
Balena UserGroupMeeting,February2017|Slide16
vasp-gpu:Compilation
Ø AvailableasamoduleonBalena:module load untested vasp/intel/5.4.1
Ø Tousevasp-gpu onBalena,youneedtorequestaGPU-equippednodeandperformsomebasicsetuptasksinyourSLURMscripts
#SBATCH --partition=batch-acc
# Node w/ 1 k20x card.
#SBATCH --gres=gpu:1#SBATCH --constraint=k20x
# Node w/ 4 k20x cards.
##SBATCH --gres=gpu:4##SBATCH --constraint=k20x
if [ ! -d "/tmp/nvidia-mps" ] ; thenmkdir "/tmp/nvidia-mps"
fi
export CUDA_MPS_PIPE_DIRECTORY="/tmp/nvidia-mps"
if [ ! -d "/tmp/nvidia-log" ] ; thenmkdir "/tmp/nvidia-log"
fi
export CUDA_MPS_LOG_DIRECTORY="/tmp/nvidia-log"
nvidia-cuda-mps-control -d
https://github.com/JMSkelton/VASP-GPU-Benchmarking/Scripts
Balena UserGroupMeeting,February2017|Slide17
vasp-gpu:Runningjobs
Ø UsescuFFT andCUDAportsofcompute-heavypartsoftheSCFcycle
Ø ALGO = Normal | VeryFast (+Fast)w/LREAL = Auto fullysupported,alongwithKPAR,exactexchangeandnon-collinearspin
Ø ALGO = All | Damped andtheGW routineswork,butarenotoptimised(“passivelysupported”)
Ø LREAL = .FALSE.,NCORE > 1 (NPAR != N)andelectricfieldsarenotsupported(willcrashwithanerror)
Ø CurrentlynoGamma-onlyversion
Ø Futureroadmap:Γ-pointoptimisationsandsupportforLREAL = .FALSE.,vdWfunctionals,RPA/GW calculationsandbandparallelism
Balena UserGroupMeeting,February2017|Slide18
vasp-gpu:Features
Ø EachMPIprocessallocatesitsownsetofcuFFT plansandCUDAkernels,distributinground-robinamongtheavailableGPUs
Ø ThesizeoftheCUDAkernelsiscontrolledbyNSIM:broadly,NSIM↑=betterGPUutilisationbuthighermemoryrequirements
Ø <#procs> shouldbeamultipleof<#GPUs>,andformostsystemsyouwillprobablyendupunderpopulating theCPUs
Proc1
Proc2
Proc3
Proc4
GPU1
GPU2
Proc1
Proc2
Proc3
Proc4
GPU1
GPU2
GPU3
GPU4
Balena UserGroupMeeting,February2017|Slide19
vasp-gpu:Loadbalancing
Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode
Balena UserGroupMeeting,February2017|Slide20
vasp-gpu:Benchmarking
Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode
NSIM
1 2 4 8 12 16 24 32 48 64
#MPIProcesses
1 13.52 8.88 8.15 7.82 7.77 7.76 7.72 7.74 7.81 7.89
2 9.11 6.75 6.34 6.21 6.23 6.21 6.23 6.25 6.32 OOM
4 6.72 5.57 5.33 5.24 5.29 5.30 OOM OOM OOM OOM
8 6.01 5.26 5.14 OOM OOM OOM OOM OOM OOM OOM
12 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM
16 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM
Balena UserGroupMeeting,February2017|Slide21
vasp-gpu:Benchmarking
Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode
Balena UserGroupMeeting,February2017|Slide22
vasp-gpu:Benchmarking
0.0
1.0
2.0
3.0
4.0
5.0
64 128 192 256 320 384 448 512
Spee
dup
(vasp_gam
)
# Atoms
1 GPU 4 GPUs
0.0
1.0
2.0
3.0
4.0
5.0
64 128 192 256 320 384 448 512
Spee
dup
(vasp_std
)
# Atoms
1 GPU 4 GPUs
NSIM
1 2 4 8 16
#MPIProcesses
1 -14131.52 -158.39 -158.39 -158.39 -158.39
2 -14131.52 -158.39 -158.39 -158.39 -158.39
4 -14131.52 -158.39 -158.39 -158.39 -158.39
8 -14131.52 -158.39 -158.39 - -
12 - - - - -
16 - - - - -
Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode
Balena UserGroupMeeting,February2017|Slide23
vasp-gpu:Benchmarking
Ø Threepaperscoveringtheimplementationandtesting…:
o M.Hacene etal.,J.Comput.Chem. 33,2581(2012),10.1002/jcc.23096
o M.HutchinsonandW.Widom,Comput.Phys.Comm. 183,1422(2012),10.1002/jcc.23096
o S.Mainzetal.,Comput.Phys.Comm. 182,1421(2011),10.1016/j.cpc.2011.03.010
Ø …andacoupleofotherlinks:
o https://www.vasp.at/index.php/news/44-administrative/115-new-release-vasp-5-4-1-with-gpu-support
o https://www.nsc.liu.se/~pla/blog/2015/11/16/vaspgpu/
o http://images.nvidia.com/events/sc15/SC5120-vasp-gpus.html
Balena UserGroupMeeting,February2017|Slide24
Furtherreading
Ø UnderstandingtheparallelisationinVASPandapplyingafewsimplerulesofthumbcanmakeyourjobsscalebetteranduselessresources(thedefaultsettingsaren’tgreat...)
Ø Atthemoment,runningVASPonGPUsismostlyforinterest:
o Doesnotbenefitalltypesofjob
o Requiressomefiddlytestingtogetthebestperformance
o IfyouwillberunningalotofasuitableworkloadonBalena (e.g.largeMDjobs),itcouldbeworththeeffort
Ø Aimsforfurtherbenchmarktests:
o WhattypesofjobbenefitfromGPUacceleration?
o Whatisthemost“balanced”configuration(1/2/4GPUs/node)?
o IsitpossibletorunovermultipleGPUnodes?
o CanGPUsbeacost/powerefficientwaytoruncertainVASPjobs?
Balena UserGroupMeeting,February2017|Slide25
Thoughtsanddiscussionpoints
Balena UserGroupMeeting,February2017|Slide26
Acknowledgements