upcoming biowulfseminars - nih hpc › training › handouts › effective_batch...upcoming...

UpcomingBiowulf Seminars

• November30,1- 3pmPythoninHPCOverviewofpythontoolsusedinhighperformancecomputing,andhowtoimprovetheperformanceofyourpythonjobsonBiowulf• Jan16,1- 3pmRelion tipsandtricks,andParalleljobsandbenchmarkingMechanicsandbestpracticesforsubmiting RELIONjobstothebatchsystemfromboththecommandlineandviatheRELIONGUI,aswellasmethodsformonitoringandevaluatingtheresults.Scalingofparalleljobs,howtobenchmarktomakeeffectiveuseofyourallocatedresources

Bldg 50,Rm1227

MakingEffectiveUseoftheBiowulf BatchSystem

NIHHPCSystems

StevenFellinistaff@hpc.nih.govNIHHPCStaff,CIT

Oct30,2017

EffectiveUse==EffectiveResourceAllocation

• Specifyingresources• Estimatingrequiredresources• Allocatingresourceswithsbatch andswarm

• Monitoringresourceallocation• Schedulingandresourceallocation• Post-mortemanalysis

HardwareTerminologyReview

Hyper-threading

Processor

EstimatingResources• CPU• Checkdocumentation(https://hpc.nih.gov/apps/)• Objective-- matchCPU:Threads 1:1(thereareexceptions,e.g.,MDjobs)

• Memory• Runajoborswarmwithalargememoryallocation• Checkactualmemoryusage• Add10%toactualmemoryusage

• Time• Runajoborswarmwithalargetimeallocation• Checkactualwalltime• Add10%toactualwalltime

AllocatingResourceswithsbatch andswarm• Alljobs• --mem(sbatch)or-g(swarm)• --time(sbatch andswarm)• -btobundlecommandlines(swarm)

• Single-threadedjobs• “-p2”toloadcoreswith2threads(swarm)

• Multi-nodejobs• “ParallelJobsandBenchmarking”Jan16

• Multi-threadedjobs• --cpus-per-task(sbatch)or-t(swarm)• Use$SLURM_CPUS_PER_TASKinbatchscript• OMP_NUM_THREADS

MonitoringResourceAllocation

• CPU• jobload whilethejobisrunning• Dashboardduringorafterthejob• (NoeasywaytomonitorGPUutilizationatthemoment)

• Walltime• jobhist,Dashboardorsacct duringorafterthejobhascompleted

• Memory• jobload whilethejobisrunning• jobhist,Dashboardorsacct duringorafterthejobhascompleted

jobload% jobload -u someuserJOBID TIME NODES CPUS THREADS LOAD MEMORY

Elapsed / Wall Alloc Active Used / Alloc51863534 6-22:08:01 / 10-00:00:00 cn3095 4 4 100% 1.0 / 8.0 GB51863535 6-22:08:01 / 10-00:00:00 cn3256 4 5 125% 0.9 / 8.0 GB51863536 6-22:08:01 / 10-00:00:00 cn3348 4 1 25% 1.0 / 8.0 GB51863537 6-22:08:01 / 10-00:00:00 cn3401 4 3 75% 0.9 / 8.0 GB 51881591 6-19:42:16 / 10-00:00:00 cn3097 4 1 25% 1.0 / 8.0 GB

% jobload -j 51874438_233 JOBID TIME NODES CPUS THREADS LOAD MEMORY

Elapsed/Wall Alloc Active Used/Alloc51874438_233 6-20:10:13/10-00:00:00 cn3105 2 1 50% 0.5/ 1.5 GB

jobhist

# jobhist 52102264_67Jobid Partition State Nodes CPUs Walltime Runtime MemReq MemUsed Nodelist52102264_67 norm COMPLETED 1 2 02:00:00 00:05:29 4.0GB/node 0.8GB cn3185

allocated

% sacct --format=Jobname,AllocCPUS,AllocNodes,ReqMem,MaxRSS,Elapsed -j 52102332 JobName AllocCPUS AllocNodes ReqMem MaxRSS Elapsed ---------- ---------- ---------- ---------- ---------- ----------tbss_2_reg 2 1 4Gn 00:05:29 batch 2 1 4Gn 815152K 00:05:29

UsingYourDashboardtoMonitorJobshttps://hpc.nih.gov

https://hpc.nih.gov/dashboard/

SchedulingandResourceAllocation

• Schedulingisdeterminedbyjobpriority• PriorityisdeterminedbyFairshare valueofuser• Fairshare isdeterminedbyrecentcpu andmemoryallocations ofrunningjobs

• Unnecessarilylongtimeallocationwillpreventjobsfrombeingbackfilled

• ‘freen’showsfreeCPUsbutnotfreememoryordisk• Otherjobshavehigherpriority(sprio)• Nodesarereservedforhigher-priorityjobs

Whyaremyjobspending?

Consequencesof…

Specifyingmoreresourcesthanneeded

Specifyingfewerresourcesthanneeded

CPU WastedCPUresources,possiblyunnecessaryschedulingdelays

Jobrunsalittle/a lotslower

Memory Wastedmemoryresources,possiblyunnecessaryschedulingdelays

Jobis“Killed” bythekernel

Time Possiblyunnecessaryschedulingdelays

Jobiskilledbythebatch system

Post-mortemofjobsusinguserDashboardOr

TheGood,theBad,

andtheUgly…

Comment:jobisrunningwithdefaultallocationsforCPUandmemoryRecommendation:ifasubjob ofalargeswarm,try“-p2”

AnotherA+

Recommendation:reduceCPUallocation

Comment:perfectCPUutilization;underutilizedmemorybutentirenodeisallocatedduetocpu allocation

Comment:goodoverallutilization;possiblysplitintotwojobswithadependency,andwithdifferingresourceallocations

Comment:8CPUstoolittle/toomuch,2woulddoRecommendation:couldruninhalfthememory

Comment:goodmemoryutilizationRecommendation:increasingCPUallocationprobablywon’thelp

Comment:CPUsbadlyoverloadedRecommendation:couldruninlessthanhalfthememory

Comment:CPUsoverloaded200%Recommendation:256GBmemoryallocated,MBsused

Comment:goodmemoryutilizationRecommendation:mightrunfasterwith32CPUs?

Recommendation:increasingCPUcountmightimprove?

Recommendation:allocating56CPUswouldlikelyhelp

staff@hpc.nih.gov

upcoming biowulfseminars - nih hpc › training › handouts › effective_batch...upcoming...

Documents

introduction to netcdf4 binary le with python, c++ and r ·...

profiling your python application with intel® vtune™...

building web gateways to science in python › scipy2010 ›...

a basic introduction to complex networks, using python · a...

course work and research thesis masters in … ·...

palestra hpc python

hpc in python

improving python- based workloads in hpc - microsoft ·...

hpc-cloud / cloud-hpc - approaches to hpc with openstack

python in hpc · python in hpc nih high performance ......

visualization of hpc simulation data – overview and … of...

hpc python workshop: matplotlib · hpc python workshop:...

uni.lu hpc school 2020 - ps08: advanced distributed...

hpc python programming

shifter: containers in hpc environments · santis docker...

hpc-systeme hpc und storage

scientific( python( basics(on(gacrc clusters( 7i ·...

jussi enkovaara harri hämäläinen exercises for python in...

code generation for high-level problem speci cation and...

python et hpc