vasp-gpu on balena: usage and some benchmarks

26
Balena User Group Meeting 3 rd February 2017 vasp-gpu on Balena: Usage and Some Benchmarks

Upload: jonathan-skelton

Post on 15-Apr-2017

176 views

Category:

Science


13 download

TRANSCRIPT

Page 1: vasp-gpu on Balena: Usage and Some Benchmarks

Balena UserGroupMeeting

3rd February2017

vasp-gpu onBalena:UsageandSomeBenchmarks

Page 2: vasp-gpu on Balena: Usage and Some Benchmarks

Ø TheVASPSCFcycleinanutshell

Ø ParallelisationinVASP

o Workloadanddatadistribution

o Parallelisationcontrolparameters

o Somerulesofthumbforoptimisingparallelscaling

Ø TheGPU(CUDA)portofVASP

o Compilingandrunning

o Features

o Someinitialbenchmarks

Ø Thoughtsanddiscussionpoints

Balena UserGroupMeeting,February2017|Slide2

Overview

Page 3: vasp-gpu on Balena: Usage and Some Benchmarks

http://www.iue.tuwien.ac.at/phd/goes/dissse14.htmlS.Mainzetal.,Comput.Phys.Comm. 182,1421(2011)

Balena UserGroupMeeting,February2017|Slide3

TheVASPSCFcycleinanutshell

Page 4: vasp-gpu on Balena: Usage and Some Benchmarks

Ø ThenewestversionsofVASPimplementfourlevelsofparallelism:

o k-pointparallelism:KPAR

o Bandparallelismanddatadistribution:NCORE andNPAR

o Parallelisationanddatadistributionoverplane-wavecoefficients(=FFTs;doneoverplanesalongNGZ):LPLANE

o Parallelisationofsomelinear-algebraoperationsusingScaLAPACK (notionallysetatcompiletime,butcanbecontrolledatruntimeusingLSCALAPACK)

Ø Effectiveparallelisationwill…:

o …minimise(relativelyslow)communicationbetweenMPIprocesses,…

o …distributedatatoreducememoryrequirements,…

o …andmakesuretheMPIprocesseshaveenoughworktokeepthembusy

Balena UserGroupMeeting,February2017|Slide4

ParallelisationinVASP

Page 5: vasp-gpu on Balena: Usage and Some Benchmarks

MPIprocesses

KPAR k-pointgroups

NPAR bandgroups

NGZ FFTgroups(?)

Ø WorkloaddistributionoverKPAR k-pointgroups,NBANDS bandgroupsandNGZ plane-wavecoefficient(FFT)groups[not100%surehowthisworks…]

Balena UserGroupMeeting,February2017|Slide5

Parallelisation:Workloaddistribution

Page 6: vasp-gpu on Balena: Usage and Some Benchmarks

Data

KPAR k-pointgroups

NPAR bandgroups

NGZ FFTgroups(?)

Ø DatadistributionoverNBANDS bandgroupsandNGZ plane-wavecoefficient(FFT)groups[alsonot100%surehowthisworks…]

Balena UserGroupMeeting,February2017|Slide6

Parallelisation:Datadistribution

Page 7: vasp-gpu on Balena: Usage and Some Benchmarks

Ø DuringastandardDFTcalculation,k-pointsareindependent->k-pointparallelismshouldbelinearlyscaling,althoughperhapsnotinpractice:https://www.nsc.liu.se/~pla/blog/2015/01/12/vasp-how-many-cores

Ø WARNING:<#procs>mustbedivisiblebyKPAR,buttheparallelisationisviaaround-robinalgorithm,so<#k-points> doesnotneedtobedivisiblebyKPAR ->checkhowmanyirreducible k-pointsyouhave(IBZKPT file)andsetKPAR accordingly

k1

k2

k3

k1 k2

k3

k1 k2 k3

KPAR = 1t =3[OK]

KPAR = 2;t =2[Bad]KPAR = 3t =1[Good]

R1

R2

R3

R1

R2

R1

Balena UserGroupMeeting,February2017|Slide7

Parallelisation:KPAR

Page 8: vasp-gpu on Balena: Usage and Some Benchmarks

NCORE :numberofcoresinbandgroupsNPAR :numberofbandstreatedsimultaneously NCORE =

< #procs >NPAR

Ø ForNCORE = 1/NPAR = <#procs> (thedefault),morebandgroupsappearstoincreasememorypressureandincurasubstantialcommunicationoverhead

7.08x

6.41x

6.32x

Balena UserGroupMeeting,February2017|Slide8

Parallelisation:NCORE andNPAR

Page 9: vasp-gpu on Balena: Usage and Some Benchmarks

Ø WARNING:VASPwillincreasethedefaultNBANDS tothenearestmultipleofthenumberofgroups

Ø SincetheelectronicminimisationscalesasapowerofNBANDS, thiscanbackfireincalculationswithalargeNPAR (e.g.thoserequiringNPAR = <#procs>)

Cores

NBANDS

Default Adjusted

96 455 480

128 455 512

192 455 576

256 455 512

384 455 768

512 455 512

NBANDS =NELECT

2 +NIONS2

Examplesystem:

• 238atomsw/272electrons

• DefaultNBANDS =455

NBANDS =35NELECT + NMAG

Balena UserGroupMeeting,February2017|Slide9

Parallelisation:NCORE andNPAR

Page 10: vasp-gpu on Balena: Usage and Some Benchmarks

Ø TheRMM-DIIS(ALGO = VeryFast | Fast)algorithminvolvesthreesteps:

EDDIAG :subspacediagonalisationRMM-DIIS :electronicminimisationORTHCH :wavefunction orthogonalisation

Routine 312atoms 624 atoms 1,248atoms 1,872 atoms

EDDIAG 2.90(18.64%) 12.97(22.24%) 75.26(26.38%) 208.29(31.31%)

RMM-DIIS 12.39(79.63%) 42.73(73.27%) 187.62(65.78%) 379.80(57.10%)

ORTHCH 0.27(1.74 %) 2.62(4.49%) 22.36(7.84%) 77.11(11.59%)

Ø EDDIAG andORTHCH formallyscaleasN3,andrapidlybegintodominatetheSCFcycletimeforlargecalculations

Ø AgoodScaLAPACK librarycanimprovetheperformanceoftheseroutinesinmassively-parallelcalculations

Seealso:https://www.nsc.liu.se/~pla/blog/2014/01/30/vasp9k

Balena UserGroupMeeting,February2017|Slide10

Parallelisation:ScaLAPACK

Page 11: vasp-gpu on Balena: Usage and Some Benchmarks

Ø KPAR:currentimplementationdoesnotdistributedataoverk-pointgroups->KPAR = N willuseN×morememorythanKPAR = 1

Ø NPAR/NCORE:dataisdistributedoverbandgroups->decreasingNPAR/increasingNCORE willconsiderablyreducememoryrequirements

Ø NPAR takesprecedenceoverNCORE - ifyouuse“master”INCAR files,makesureyoudon’tdefineboth

Ø ThedefaultsforNPAR/NCORE (NPAR = <#procs>,NCORE = 1)areusuallyapoorchoiceforbothmemoryrequirementsand performance

Ø Bandparallelismforhybridfunctionals hasbeensupportedsinceVASP5.3.5;formemory-intensivecalculations,itisagoodalternativetounderpopulating nodes

Ø LPLANE:distributesdataoverplane-wavecoefficients,andspeedsthingsupbyreducingcommunicationduringFFTs- thedefaultisLPLANE = .TRUE.,andshouldonlyneedtobechangedformassively-parallelarchitectures(e.g.BlueGene/Q)

Balena UserGroupMeeting,February2017|Slide11

Parallelisation:Memory

Page 12: vasp-gpu on Balena: Usage and Some Benchmarks

Ø Forx86_64IBsystems(e.g.Balena,Archer…):

o TryKPAR forheavycalculations(e.g.hybrids)

o SetNPAR = (<#procs>/KPAR) orNCORE = <#procs/node>

o 1node/bandgroupper50atoms;maywanttouse2nodes/50atomsforhybrids,ordecreaseto½nodeperbandgroupfor<10atoms

o LeaveLPLANE atthedefault(.TRUE.)

o WARNING:InmyexperienceofCraysystems(Archer/XC30,SiSu/XC40),usingKPARsometimescausesVASPtohangduringmultistepcalculations(e.g.optimisations)

Ø FortheIBMBlueGene/Q(STFCHartree Centre):

o LasttimeIusedit,theHartree machineonlyhadVASP5.2.x->noKPAR

o Trytochooseasquarenumberofcores,andsetNPAR = sqrt(<#procs>)

o ConsidersettingLPLANE = .FALSE. if<#procs> ≥NGZ

Balena UserGroupMeeting,February2017|Slide12

Parallelisation:Somerulesofthumb

Page 13: vasp-gpu on Balena: Usage and Some Benchmarks

Ø GPUcomputingworksinanoffloadmodel

Ø ProgrammingmodelssuchasCUDAandOpenCLprovideAPIsfor:

o CopyingmemorytoandfromtheGPU

o Compilingkernel programstorunontheGPU

o Settingupandrunningkernelsoninputdata

Ø PortingcodesforGPUsinvolvesidentifyingroutinesthatcanbeefficientlymappedtotheGPUarchitecture,writingkernels,andinterfacingthemtotheCPUcode

Data

Data Program

Program

Run

Data

Data

CPU

GPU

Balena UserGroupMeeting,February2017|Slide13

GPUcomputing

Page 14: vasp-gpu on Balena: Usage and Some Benchmarks

Balena UserGroupMeeting,February2017|Slide14

vasp-gpu

Ø StartingfromtheFebruary2016releaseofVASP5.4.1,thedistributionincludesaCUDAportthatoffloadssomeofthecoreDFTroutinesontoNVIDIAGPUs

Ø AculminationofresearchattheUniversityofChicago,CarnegieMellon andENS-Lyon,andahealthydoseofoptimisationbyNVIDIA

Ø Threepaperscoveringtheimplementationandtesting:

o M.Hacene etal.,J.Comput.Chem. 33,2581(2012),10.1002/jcc.23096

o M.HutchinsonandW.Widom,Comput.Phys.Comm. 183,1422(2012),10.1002/jcc.23096

o S.Mainzetal.,Comput.Phys.Comm. 182,1421(2011),10.1016/j.cpc.2011.03.010

Page 15: vasp-gpu on Balena: Usage and Some Benchmarks

Balena UserGroupMeeting,February2017|Slide15

Becausesharingiscaring...

https://github.com/JMSkelton/VASP-GPU-Benchmarking

Page 16: vasp-gpu on Balena: Usage and Some Benchmarks

Ø Easy(ish)withtheVASP5.4.1buildsystem:

o Loadcuda/toolkit (alongwithintel/compiler,intel/mkl,etc.)

o Modifythearch/makefile.include.linux_intel_cuda example

o Makethegpu and/orgpu_ncl targets

intel/compiler/64/15.0.0.090intel/mkl/64/11.2openmpi/intel/1.8.4cuda/toolkit/7.5.18

FC = mpif90FCL = mpif90 -mkl -lstdc++...CUDA_ROOT :=/cm/shared/apps/cuda75/toolkit/7.5.18...MPI_INC =/apps/openmpi/intel-2015/1.8.4/include/

https://github.com/JMSkelton/VASP-GPU-Benchmarking/Compilation

Balena UserGroupMeeting,February2017|Slide16

vasp-gpu:Compilation

Ø AvailableasamoduleonBalena:module load untested vasp/intel/5.4.1

Page 17: vasp-gpu on Balena: Usage and Some Benchmarks

Ø Tousevasp-gpu onBalena,youneedtorequestaGPU-equippednodeandperformsomebasicsetuptasksinyourSLURMscripts

#SBATCH --partition=batch-acc

# Node w/ 1 k20x card.

#SBATCH --gres=gpu:1#SBATCH --constraint=k20x

# Node w/ 4 k20x cards.

##SBATCH --gres=gpu:4##SBATCH --constraint=k20x

if [ ! -d "/tmp/nvidia-mps" ] ; thenmkdir "/tmp/nvidia-mps"

fi

export CUDA_MPS_PIPE_DIRECTORY="/tmp/nvidia-mps"

if [ ! -d "/tmp/nvidia-log" ] ; thenmkdir "/tmp/nvidia-log"

fi

export CUDA_MPS_LOG_DIRECTORY="/tmp/nvidia-log"

nvidia-cuda-mps-control -d

https://github.com/JMSkelton/VASP-GPU-Benchmarking/Scripts

Balena UserGroupMeeting,February2017|Slide17

vasp-gpu:Runningjobs

Page 18: vasp-gpu on Balena: Usage and Some Benchmarks

Ø UsescuFFT andCUDAportsofcompute-heavypartsoftheSCFcycle

Ø ALGO = Normal | VeryFast (+Fast)w/LREAL = Auto fullysupported,alongwithKPAR,exactexchangeandnon-collinearspin

Ø ALGO = All | Damped andtheGW routineswork,butarenotoptimised(“passivelysupported”)

Ø LREAL = .FALSE.,NCORE > 1 (NPAR != N)andelectricfieldsarenotsupported(willcrashwithanerror)

Ø CurrentlynoGamma-onlyversion

Ø Futureroadmap:Γ-pointoptimisationsandsupportforLREAL = .FALSE.,vdWfunctionals,RPA/GW calculationsandbandparallelism

Balena UserGroupMeeting,February2017|Slide18

vasp-gpu:Features

Page 19: vasp-gpu on Balena: Usage and Some Benchmarks

Ø EachMPIprocessallocatesitsownsetofcuFFT plansandCUDAkernels,distributinground-robinamongtheavailableGPUs

Ø ThesizeoftheCUDAkernelsiscontrolledbyNSIM:broadly,NSIM↑=betterGPUutilisationbuthighermemoryrequirements

Ø <#procs> shouldbeamultipleof<#GPUs>,andformostsystemsyouwillprobablyendupunderpopulating theCPUs

Proc1

Proc2

Proc3

Proc4

GPU1

GPU2

Proc1

Proc2

Proc3

Proc4

GPU1

GPU2

GPU3

GPU4

Balena UserGroupMeeting,February2017|Slide19

vasp-gpu:Loadbalancing

Page 20: vasp-gpu on Balena: Usage and Some Benchmarks

Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode

Balena UserGroupMeeting,February2017|Slide20

vasp-gpu:Benchmarking

Page 21: vasp-gpu on Balena: Usage and Some Benchmarks

Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode

NSIM

1 2 4 8 12 16 24 32 48 64

#MPIProcesses

1 13.52 8.88 8.15 7.82 7.77 7.76 7.72 7.74 7.81 7.89

2 9.11 6.75 6.34 6.21 6.23 6.21 6.23 6.25 6.32 OOM

4 6.72 5.57 5.33 5.24 5.29 5.30 OOM OOM OOM OOM

8 6.01 5.26 5.14 OOM OOM OOM OOM OOM OOM OOM

12 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM

16 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM

Balena UserGroupMeeting,February2017|Slide21

vasp-gpu:Benchmarking

Page 22: vasp-gpu on Balena: Usage and Some Benchmarks

Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode

Balena UserGroupMeeting,February2017|Slide22

vasp-gpu:Benchmarking

0.0

1.0

2.0

3.0

4.0

5.0

64 128 192 256 320 384 448 512

Spee

dup

(vasp_gam

)

# Atoms

1 GPU 4 GPUs

0.0

1.0

2.0

3.0

4.0

5.0

64 128 192 256 320 384 448 512

Spee

dup

(vasp_std

)

# Atoms

1 GPU 4 GPUs

Page 23: vasp-gpu on Balena: Usage and Some Benchmarks

NSIM

1 2 4 8 16

#MPIProcesses

1 -14131.52 -158.39 -158.39 -158.39 -158.39

2 -14131.52 -158.39 -158.39 -158.39 -158.39

4 -14131.52 -158.39 -158.39 -158.39 -158.39

8 -14131.52 -158.39 -158.39 - -

12 - - - - -

16 - - - - -

Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode

Balena UserGroupMeeting,February2017|Slide23

vasp-gpu:Benchmarking

Page 24: vasp-gpu on Balena: Usage and Some Benchmarks

Ø Threepaperscoveringtheimplementationandtesting…:

o M.Hacene etal.,J.Comput.Chem. 33,2581(2012),10.1002/jcc.23096

o M.HutchinsonandW.Widom,Comput.Phys.Comm. 183,1422(2012),10.1002/jcc.23096

o S.Mainzetal.,Comput.Phys.Comm. 182,1421(2011),10.1016/j.cpc.2011.03.010

Ø …andacoupleofotherlinks:

o https://www.vasp.at/index.php/news/44-administrative/115-new-release-vasp-5-4-1-with-gpu-support

o https://www.nsc.liu.se/~pla/blog/2015/11/16/vaspgpu/

o http://images.nvidia.com/events/sc15/SC5120-vasp-gpus.html

Balena UserGroupMeeting,February2017|Slide24

Furtherreading

Page 25: vasp-gpu on Balena: Usage and Some Benchmarks

Ø UnderstandingtheparallelisationinVASPandapplyingafewsimplerulesofthumbcanmakeyourjobsscalebetteranduselessresources(thedefaultsettingsaren’tgreat...)

Ø Atthemoment,runningVASPonGPUsismostlyforinterest:

o Doesnotbenefitalltypesofjob

o Requiressomefiddlytestingtogetthebestperformance

o IfyouwillberunningalotofasuitableworkloadonBalena (e.g.largeMDjobs),itcouldbeworththeeffort

Ø Aimsforfurtherbenchmarktests:

o WhattypesofjobbenefitfromGPUacceleration?

o Whatisthemost“balanced”configuration(1/2/4GPUs/node)?

o IsitpossibletorunovermultipleGPUnodes?

o CanGPUsbeacost/powerefficientwaytoruncertainVASPjobs?

Balena UserGroupMeeting,February2017|Slide25

Thoughtsanddiscussionpoints

Page 26: vasp-gpu on Balena: Usage and Some Benchmarks

Balena UserGroupMeeting,February2017|Slide26

Acknowledgements