intro to monsoon and slurm
TRANSCRIPT
IntrotoMonsoonandSlurm
1/18/2017
Introductions
• Introduceyourself– Name– Department/Group– Whatproject(s)doyouplantousemonsoonfor?– LinuxorUnixexperience– Previousclusterexperience?
ListofTopics
• Clustereducation– Whatisacluster,exactly?– Queues,schedulingandresourcemanagement
• ClusterOrientation– Monsoonclusterspecifics– HowdoIusethiscluster?– Groupresourcelimits– Exercises– Questionandanswer
Whatisacluster?
• Acomputerclusterismanyindividualcomputerssystems(nodes)networkedtogetherlocallytoserveasasingleresource
• Abilitytosolveproblemsonalargescalenotfeasiblealone
InsideaNode
Socket0
Socket3Socket2
Socket1
Whatisaqueue?
• Normallythoughtofasaline,FIFO• QueuesonaclustercanbeasbasicasaFIFO,orfarmoreadvancedwithdynamicprioritiestakingintoconsiderationmanyfactors
Whatisscheduling?
• “Aplanorprocedurewithagoalofcompletingsomeobjectivewithinsometimeframe”
• Schedulingforaclusteratthebasiclevelismuchthesame. Assigningworktocomputerstocompleteobjectiveswithinsometimeavailability
• Notexactlythateasythough. Manyfactorscomeintoplayschedulingworkonacluster.
Scheduling
• A schedulerneedstoknowwhatresourcesareavailableonthecluster
• Assignmentofworkonaclusteriscarriedoutmostefficientlywithschedulingandresourcemanagementworkingtogether
ResourceManagement
• Monitoringresourceavailabilityandhealth• Allocationofresources• Executionofresources• Accountingofresources
Clusterschedulinggoals
• Optimizequantityofwork• Optimizeusageofresources• Serviceallusersandprojectsjustly• Makeschedulingdecisionstransparent
ClusterResources
• Node• Memory• CPU’s• GPU’s• Licenses
Manyschedulingmethods
• FIFO– Simplyfirstinfirstout
• Backfill– Runssmallerjobswithlowerresourcerequirementswhilelargerjobswaitforhigherresourcerequirementstobeavailable
• Fairshare– Prioritizesjobsbasedonusersrecentresourceconsumption
Monsoon• TheMonsoonclusterisaresourceavailabletotheNAUresearchenterprise
• 32systems(nodes)– cn[1-32]• 884cores• 12GPUs,NVIDIATeslaK80• Red Hat EnterpriseLinux6.7• 12TBmemory - 128GB/nodemin,1.5TBmax• 170TBhigh-speed scratch storage• 500TBlong-term storage• Highspeed interconnect:FDRInfiniband
Monsoonscheduling
• Slurm (SimpleLinuxUtilityforResourceManagement)• Excellentresourcemanagerandscheduler• Precisecontroloverresourcerequests• DevelopedatLLNL,continuedbySchedMD• Usedeverywherefromsmallclusterstothelargestclusters:
– SunwayTaihuLight (#1),10.6Mcores,93PF,15kKW- China– Titan(#3),561Kcores,17.6PF,8kKW,UnitedStates
SmallCluster!
Dualcore?
LargestCluster!
10,649,600 cores
Monsoonscheduling
• Combinationofschedulingmethods• Currentlyconfiguredtoutilizebackfillalongwithamultifactorprioritysystemtoprioritizejobs
Factorsattributingtopriority
• Fairshare (predominantfactor)– Prioritypointsdeterminedonusersrecentresourceusage– Decayhalflifeover1days
• QOS(QualityofService)– SomeQOShavehigherprioritythanothers,forinstance:debug
• Age– howlonghasthejobsatpending• Jobsize- thenumberofnodes/cpus ajobisrequesting
Storage
• /home– 10GBquota– Keepyourscriptsandexecutables here– Snapshottedtwiceaday:/home/.snapshot– Pleasedonotwritejoboutput(logs,results)here!!
• /scratch– 30dayretention– Veryfaststorage,capableof11GB/sec– Checkpoints,logs– Keepalltemp/intermediatedatahere– Shouldbeyourdefaultlocationtoperforminput/output
Storage
• /projects– Long-termstorageprojectshares– Spaceisassignedtofacultymemberforgrouptoshare– Snapshotsavailable– Nobackupstoday
• /common– Clustersupportshare– Contrib:placetoputscripts/libs/confs/db’s forothersuse
DataFlow
1. Keepscriptsandexecutablesin/home2. Writetemp/intermediatedatato/scratch3. Copydatato/projects/<group_project>,forgroupstorage
andreferenceinotherprojects4. Cleanup/scratch
**Remember,/scratchisascratchfilesystem,usedforhigh-speedtemporary,andintermediatedata
Remotestorageaccess
• scp– scp [email protected]:/scratch/nauid– WinSCP
• samba/cifs– \\nau.froot.nau.edu\cirrus(windows)– smb://nau.froot.nau.edu/cirrus(mac)
Groups
• NAUhasaresourcecalledEnterprisegroups• Theyareavailabletoyouontheclusterifyou’dliketomanageaccesstodata
• Manageaccesstoyourfiles• https://my.nau.edu
– “GotoEnterpriseGroups”– TakealookatourFAQ::nau.edu/hpc/faq
• Iftheyarenotworkingforyou,contactITShelpdesk
Software• ENVI/IDL• Matlab• IntelCompilers,andMKL• R• Qiime• AnacondaPython• OpenFOAM• SOWFA• Lotsofbioinformaticsprograms• Requestadditionalsoftwaretobeinstalled!
Modules
• Softwareenvironmentmanagementhandledbythemodulespackagemanagementsystem
• moduleavail– whatmodulesareavailable• modulelist– modulescurrentlyloaded• moduleload<modulename>- loadapackagemodule• moduledisplay<modulename>- detailedinformationincludingenvironmentvariableseffected
MPI
• QuicknoteonMPI• MessagePassingInterfaceforparallelcomputing• OpenMPIsetasdefaultMPI• Mvapich2alsoavailable
– moduleunloadopenmpi– moduleloadmvapich2
• ExampleMPIjobscript:– /common/contrib/examples/job_scripts/mpijob.sh
InteractingwithSlurm• Decidewhatyouneedtoaccomplish• Whatresourcesareneeded?
– 2cpus,12GBmemory,for2hours?• Whatstepsarerequired?
– Runprog1,thenprog2…etc– Arethestepsdependentononeanother?
• Canyourwork,orprojectbebrokenupintosmallerpieces?Smallerpiecescanmaketheworkloadmoreagile.
• Howlongshouldyourjobrunfor?• Isyoursoftwaremultithreaded,usesOpenMP orMPI?
ExampleJobscript• #!/bin/bash• #SBATCH--job-name=test• #SBATCH--output=/scratch/nauid/output.txt• #SBATCH--time=20:00 #shortertime=soonerstart• #SBATCH--workdir=/scratch/nauid
• #replacethismodulewithsoftwarerequiredinyourworkload• moduleloadpython/3.3.4
• #examplejobcommands• #eachsrun commandisajobstep,sothisjobwillhave2steps• srun sleep300• srun python-V
Interactive/DebugWork
• Runyourcompilesandtestingontheclusternodesby:
– srun –pallgcc hello.c –oa.out– srun –pall-c12make-j12– srun –qos=debug-c12make-j12– srun Rscript analysis.r– srun pythonanalysis.py– srun cp -av /scratch/NAUID/lots_o_files /scratch-lt/NAUID/destination
LongInteractivework• salloc
– ObtainaSLURMjoballocationthatyoucanworkwithforanextendedamountoftimeinteractively.Thisisusefulfortesting/debuggingforanextendedamountoftime.
[cbc@wind ~]$salloc –c8--time=2-00:00:00salloc:Grantedjoballocation33442[cbc@wind ~]$srun pythonanalysis.py[cbc@wind ~]$exitsalloc:Relinquising job allocation 33442[cbc@wind ~]$salloc -N2salloc:Grantedjoballocation33443[cbc@wind ~]$srun hostnamecn3.nauhpccn2.nauhpc[cbc@wind ~]$exitsalloc:Relinquising job allocation 33443
JobParametersYouwant SwitchesneededMorethan onecpu forthejob --cpus-per-task=2, or-c2
To specify an ordering of your jobs --dependency=afterok:job_id,or-djob_id
Split up the output, and errors --output=result.txt --error=error.txtTo run your job at a particular time/day
--begin=16:00 --begin=now+1hour --begin=2010-01-20T12:34:00
Add MPItasks/rankstoyourjob --ntasks=2, or-n2To control job failure options --norequeue –requeueToreceivestatusemail --mail-type=ALL
Contraints andResourcesYouwant SwitchesneededTo choose a specific node feature (e.g. avx2) --constraint=avx2
To use a generic resources (e.g. a gpu) --gres=gpu:tesla:1
To reserve a whole node for yourself --exclusive To chose a partition --partition
Submitthescript
[cbc@wind ~]$sbatch jobscript.shSubmittedbatchjob85223
– slurm returns ajob idforyour job that you can use tomonitorormodify constraints
Monitoringyourjob
• squeue– viewinformationaboutjobslocatedintheSLURMschedulingqueue.
• squeue --start• squeue -ulogin• squeue -o“%j%u…“• squeue -ppartitionname• squeue -Ssortfield• squeue -t<state>(PDorR)
Clusterinfo
• sinfo– viewinformationaboutSLURMnodesandpartitions.
• sinfo -N–l• sinfo –R
– Listreasonsfordownednodesandpartitions
Monitoringyourjob
• sprio– viewthefactorsthatcompriseajob’sschedulingpriority
• sprio –l-- listpriorityofusersjobsinpendingstate
• sprio -o“%j%u…“• sprio -w
Monitoringyourjob
• sstat– Displayvariousstatusinformationofarunningjob/step.
• sstat -jjobid• sstat -oAveCPU,AveRSS• Onlyworkswithjobscontainingsteps
Controling yourjob
• scancel– UsedtosignaljobsorjobstepsthatareunderthecontrolofSlurm.
• scancel -jjobid• scancel -njobname• scancel -umylogin• scancel -tpending(onlyyours)
Controllingyourjob
• scontrol– UsedtoviewandmodifySlurm configurationandstate.
• scontrol showjob85224
JobAccounting• sacct
– displaysaccountingdataforofyourjobsandjobstepsintheSLURMjobaccountinglogorSLURMdatabase
• sacct -jjobid -ojobid,elapsed,maxrss• sacct -Nnodelist• sacct -umylogin
• Tryouralias“jobstats”– jobstats– jobstats –j<jobid>
JobAccounting
• sshare– Toolforlistingthesharesofassociationstoacluster.
• sshare -l:viewandcompareyourfairshare withotheraccounts
• sshare -a:viewallusersfairshare• sshare –A–a<account>:viewallmembersinyouraccount(group)
Accounthierarchy• Youruseraccountbelongstoaparentfacultyaccount(group)• Youruseraccountsharesresourcesthatareprovidedforyourgroup
• Example:– coffey
• cbc• mkg52
• Viewtheaccountstructureyoubelongtowith:“sshare -a–A<account>”
• Example:– sshare -a-Acoffey
Limitsontheaccount(group)
• Limitsareinplacetopreventintentionalorunintentionalmisuseofresourcestoensurequickandfairturnaroundtimesonjobsforeveryone.
• Groupsarelimitedtoatotalnumberofcpu minutesinuseatonetime:700,000
• Thiscpu resourcelimitmechanismisreferredtoas:“TRESRunMins”
TRESRunMins Limit
• Whattheheckisthat!?• Anumberwhichlimitsthetotalnumberofremainingcpu minuteswhichyourrunning jobscanoccupy.
• Enablesflexibleresourcelimiting• Staggersjobs• Increasesclusterutilization• Leadstomoreaccurateresourcerequests
• Sumofjobs(cpus *timelimit remaining)
Examples
• 14400=10jobs,1cpu,1dayinlength• 144000=10jobs,10cpu,1dayinlength• 576000=10jobs,10cpu,5daysinlength• 648000=100jobs,1cpu,½dayinlength
Questions?
• Checkyourgroupscpu minusage:– sshare -l
Exercise1
GettoknowmonsoonandSlurm,onyourown.
1. Howmanynodesmakeupmonsoon?– Hint:use“sinfo”
2. Howmanynodesareintheallpartition?3. Howmanyjobsarecurrentlyintherunningstate?
– Hint:use“squeue -tR”4. Howmanyjobsarecurrentlyinthependingstate?Why?
– Hint:use“squeue –tPD”
Exercise2• Createasimplejobinyourhomedirectory• Examplehere:/common/contrib/examples/job_scripts/simplejob.sh (copyitifyoulikeJ )
• Nameyourjob:“exercise”• Nameyourjobsoutput:“exercise.out”• Outputshouldgoto/scratch/<user>/exercise.out• Loadthemodule“workshop”• Runthe“date”command• Andadditionally,the“secret”command• Submityourjobwithsbatch,i.e.“sbatch simplejob.sh”
Exercise3• Makeyourjobsleepfor5minutes(sleep300)
– Sleepisacommandthatcreatesaprocessthat…sleeps• Monitoryourjob
– squeue -uyour_nauid– squeue -tR– scontrol showjobjobnum– sacct -jjobnum
• Inspectthesteps
• Cancelyourjob– scancel jobnum
Exercise4• Copyjobscriptandedit:
– /common/contrib/examples/job_scripts/lazyjob.sh
• Submitthejob,itwilltake65sectocomplete• Usesstat andmonitorthejob
– sstat -j<jobid>• Reviewtheresourcesthatthejobused
– jobstats -j<jobid>• Wearelookingfor“MaxRSS”,MaxRSS isthe
maxamountofmemoryused• Editthejobscript,reducethememorybeing
requestedinMBandresubmit,edit“--mem=“,e.g.--mem=600
• Reviewtheresourcesthattheoptimizedjobutilizedonceagain– jobstats -j<jobid>
• Ok,memorylooksgood,butnoticethattheusercpuisthesameastheelapsedtime
Usercpu =num utilizedcpus *elapsedtime
• Thisisbecausetheapplicationwewererunningonlyused1ofthe4cpus thatwerequested
• Editthelazyjobscript,commentoutfirstsruncommand,anduncommentthesecondsruncommand.
• Resubmit• Rerunjobstats -j<jobid>,noticenowusercpu isa
multipletimestheelapsedtime,inthiscase(4).Becausewewereallocated4cpus,andused 4cpus.
Slurm Arrays!
Slurm ArraysExercise
• Fromyourscratchdirectory:“/scratch/nauid”• tarxvf /common/contrib/examples/bigdata_example.tar• cdbigdata• editthefile“job_array.sh”sothatitworkswithyournauidreplacingallNAUIDwithyours
• Submitthescript“sbatch job_array.sh”• Run“squeue”,noticethereare5jobsrunning,howdidthathappen!
MPIExample
• RefertotheMPIexamplehere:– /common/contrib/examples/job_scripts/mpijob.sh
• Editit,foryourworkareas,thenexperiment:– Changenumberoftasks,nodes…etc
• Alsocanruntheexamplelikethis:– srun --qos=debug–n4/common/contrib/examples/mpi/hellompi
Keepthesetipsinmind
• Knowthesoftwareyouarerunning• Requestresourcesaccurately• Supplyanaccuratetimelimitforyourjob• Don’tbelazy,itwilleffectyouandyourgroupnegatively
QuestionandAnswer
• Moreinfohere:http://nau.edu/hpc
• Linuxshellhelphere:– http://linuxcommand.org/tlcl.php– Freebookdownload
• Andonthenauhpc listserv– [email protected]