intro to monsoon and slurm

IntrotoMonsoonandSlurm

1/18/2017

Introductions

• Introduceyourself– Name– Department/Group– Whatproject(s)doyouplantousemonsoonfor?– LinuxorUnixexperience– Previousclusterexperience?

ListofTopics

• Clustereducation– Whatisacluster,exactly?– Queues,schedulingandresourcemanagement

• ClusterOrientation– Monsoonclusterspecifics– HowdoIusethiscluster?– Groupresourcelimits– Exercises– Questionandanswer

Whatisacluster?

• Acomputerclusterismanyindividualcomputerssystems(nodes)networkedtogetherlocallytoserveasasingleresource

• Abilitytosolveproblemsonalargescalenotfeasiblealone

InsideaNode

Socket0

Socket3Socket2

Socket1

Whatisaqueue?

• Normallythoughtofasaline,FIFO• QueuesonaclustercanbeasbasicasaFIFO,orfarmoreadvancedwithdynamicprioritiestakingintoconsiderationmanyfactors

Whatisscheduling?

• “Aplanorprocedurewithagoalofcompletingsomeobjectivewithinsometimeframe”

• Schedulingforaclusteratthebasiclevelismuchthesame. Assigningworktocomputerstocompleteobjectiveswithinsometimeavailability

• Notexactlythateasythough. Manyfactorscomeintoplayschedulingworkonacluster.

Scheduling

• A schedulerneedstoknowwhatresourcesareavailableonthecluster

• Assignmentofworkonaclusteriscarriedoutmostefficientlywithschedulingandresourcemanagementworkingtogether

ResourceManagement

• Monitoringresourceavailabilityandhealth• Allocationofresources• Executionofresources• Accountingofresources

Clusterschedulinggoals

• Optimizequantityofwork• Optimizeusageofresources• Serviceallusersandprojectsjustly• Makeschedulingdecisionstransparent

ClusterResources

• Node• Memory• CPU’s• GPU’s• Licenses

Manyschedulingmethods

• FIFO– Simplyfirstinfirstout

• Backfill– Runssmallerjobswithlowerresourcerequirementswhilelargerjobswaitforhigherresourcerequirementstobeavailable

• Fairshare– Prioritizesjobsbasedonusersrecentresourceconsumption

Monsoon• TheMonsoonclusterisaresourceavailabletotheNAUresearchenterprise

• 32systems(nodes)– cn[1-32]• 884cores• 12GPUs,NVIDIATeslaK80• Red Hat EnterpriseLinux6.7• 12TBmemory - 128GB/nodemin,1.5TBmax• 170TBhigh-speed scratch storage• 500TBlong-term storage• Highspeed interconnect:FDRInfiniband

Monsoonscheduling

• Slurm (SimpleLinuxUtilityforResourceManagement)• Excellentresourcemanagerandscheduler• Precisecontroloverresourcerequests• DevelopedatLLNL,continuedbySchedMD• Usedeverywherefromsmallclusterstothelargestclusters:

– SunwayTaihuLight (#1),10.6Mcores,93PF,15kKW- China– Titan(#3),561Kcores,17.6PF,8kKW,UnitedStates

SmallCluster!

Dualcore?

LargestCluster!

10,649,600 cores

Monsoonscheduling

• Combinationofschedulingmethods• Currentlyconfiguredtoutilizebackfillalongwithamultifactorprioritysystemtoprioritizejobs

Factorsattributingtopriority

• Fairshare (predominantfactor)– Prioritypointsdeterminedonusersrecentresourceusage– Decayhalflifeover1days

• QOS(QualityofService)– SomeQOShavehigherprioritythanothers,forinstance:debug

• Age– howlonghasthejobsatpending• Jobsize- thenumberofnodes/cpus ajobisrequesting

Storage

• /home– 10GBquota– Keepyourscriptsandexecutables here– Snapshottedtwiceaday:/home/.snapshot– Pleasedonotwritejoboutput(logs,results)here!!

• /scratch– 30dayretention– Veryfaststorage,capableof11GB/sec– Checkpoints,logs– Keepalltemp/intermediatedatahere– Shouldbeyourdefaultlocationtoperforminput/output

Storage

• /projects– Long-termstorageprojectshares– Spaceisassignedtofacultymemberforgrouptoshare– Snapshotsavailable– Nobackupstoday

• /common– Clustersupportshare– Contrib:placetoputscripts/libs/confs/db’s forothersuse

DataFlow

1. Keepscriptsandexecutablesin/home2. Writetemp/intermediatedatato/scratch3. Copydatato/projects/<group_project>,forgroupstorage

andreferenceinotherprojects4. Cleanup/scratch

**Remember,/scratchisascratchfilesystem,usedforhigh-speedtemporary,andintermediatedata

Remotestorageaccess

• scp– scp [email protected]:/scratch/nauid– WinSCP

• samba/cifs– \\nau.froot.nau.edu\cirrus(windows)– smb://nau.froot.nau.edu/cirrus(mac)

Groups

• NAUhasaresourcecalledEnterprisegroups• Theyareavailabletoyouontheclusterifyou’dliketomanageaccesstodata

• Manageaccesstoyourfiles• https://my.nau.edu

– “GotoEnterpriseGroups”– TakealookatourFAQ::nau.edu/hpc/faq

• Iftheyarenotworkingforyou,contactITShelpdesk

Software• ENVI/IDL• Matlab• IntelCompilers,andMKL• R• Qiime• AnacondaPython• OpenFOAM• SOWFA• Lotsofbioinformaticsprograms• Requestadditionalsoftwaretobeinstalled!

Modules

• Softwareenvironmentmanagementhandledbythemodulespackagemanagementsystem

• moduleavail– whatmodulesareavailable• modulelist– modulescurrentlyloaded• moduleload<modulename>- loadapackagemodule• moduledisplay<modulename>- detailedinformationincludingenvironmentvariableseffected

MPI

• QuicknoteonMPI• MessagePassingInterfaceforparallelcomputing• OpenMPIsetasdefaultMPI• Mvapich2alsoavailable

– moduleunloadopenmpi– moduleloadmvapich2

• ExampleMPIjobscript:– /common/contrib/examples/job_scripts/mpijob.sh

InteractingwithSlurm• Decidewhatyouneedtoaccomplish• Whatresourcesareneeded?

– 2cpus,12GBmemory,for2hours?• Whatstepsarerequired?

– Runprog1,thenprog2…etc– Arethestepsdependentononeanother?

• Canyourwork,orprojectbebrokenupintosmallerpieces?Smallerpiecescanmaketheworkloadmoreagile.

• Howlongshouldyourjobrunfor?• Isyoursoftwaremultithreaded,usesOpenMP orMPI?

ExampleJobscript• #!/bin/bash• #SBATCH--job-name=test• #SBATCH--output=/scratch/nauid/output.txt• #SBATCH--time=20:00 #shortertime=soonerstart• #SBATCH--workdir=/scratch/nauid

• #replacethismodulewithsoftwarerequiredinyourworkload• moduleloadpython/3.3.4

• #examplejobcommands• #eachsrun commandisajobstep,sothisjobwillhave2steps• srun sleep300• srun python-V

Interactive/DebugWork

• Runyourcompilesandtestingontheclusternodesby:

– srun –pallgcc hello.c –oa.out– srun –pall-c12make-j12– srun –qos=debug-c12make-j12– srun Rscript analysis.r– srun pythonanalysis.py– srun cp -av /scratch/NAUID/lots_o_files /scratch-lt/NAUID/destination

LongInteractivework• salloc

– ObtainaSLURMjoballocationthatyoucanworkwithforanextendedamountoftimeinteractively.Thisisusefulfortesting/debuggingforanextendedamountoftime.

[cbc@wind ~]$salloc –c8--time=2-00:00:00salloc:Grantedjoballocation33442[cbc@wind ~]$srun pythonanalysis.py[cbc@wind ~]$exitsalloc:Relinquising job allocation 33442[cbc@wind ~]$salloc -N2salloc:Grantedjoballocation33443[cbc@wind ~]$srun hostnamecn3.nauhpccn2.nauhpc[cbc@wind ~]$exitsalloc:Relinquising job allocation 33443

JobParametersYouwant SwitchesneededMorethan onecpu forthejob --cpus-per-task=2, or-c2

To specify an ordering of your jobs --dependency=afterok:job_id,or-djob_id

Split up the output, and errors --output=result.txt --error=error.txtTo run your job at a particular time/day

--begin=16:00 --begin=now+1hour --begin=2010-01-20T12:34:00

Add MPItasks/rankstoyourjob --ntasks=2, or-n2To control job failure options --norequeue –requeueToreceivestatusemail --mail-type=ALL

Contraints andResourcesYouwant SwitchesneededTo choose a specific node feature (e.g. avx2) --constraint=avx2

To use a generic resources (e.g. a gpu) --gres=gpu:tesla:1

To reserve a whole node for yourself --exclusive To chose a partition --partition

Submitthescript

[cbc@wind ~]$sbatch jobscript.shSubmittedbatchjob85223

– slurm returns ajob idforyour job that you can use tomonitorormodify constraints

Monitoringyourjob

• squeue– viewinformationaboutjobslocatedintheSLURMschedulingqueue.

• squeue --start• squeue -ulogin• squeue -o“%j%u…“• squeue -ppartitionname• squeue -Ssortfield• squeue -t<state>(PDorR)

Clusterinfo

• sinfo– viewinformationaboutSLURMnodesandpartitions.

• sinfo -N–l• sinfo –R

– Listreasonsfordownednodesandpartitions

Monitoringyourjob

• sprio– viewthefactorsthatcompriseajob’sschedulingpriority

• sprio –l-- listpriorityofusersjobsinpendingstate

• sprio -o“%j%u…“• sprio -w

Monitoringyourjob

• sstat– Displayvariousstatusinformationofarunningjob/step.

• sstat -jjobid• sstat -oAveCPU,AveRSS• Onlyworkswithjobscontainingsteps

Controling yourjob

• scancel– UsedtosignaljobsorjobstepsthatareunderthecontrolofSlurm.

• scancel -jjobid• scancel -njobname• scancel -umylogin• scancel -tpending(onlyyours)

Controllingyourjob

• scontrol– UsedtoviewandmodifySlurm configurationandstate.

• scontrol showjob85224

JobAccounting• sacct

– displaysaccountingdataforofyourjobsandjobstepsintheSLURMjobaccountinglogorSLURMdatabase

• sacct -jjobid -ojobid,elapsed,maxrss• sacct -Nnodelist• sacct -umylogin

• Tryouralias“jobstats”– jobstats– jobstats –j<jobid>

JobAccounting

• sshare– Toolforlistingthesharesofassociationstoacluster.

• sshare -l:viewandcompareyourfairshare withotheraccounts

• sshare -a:viewallusersfairshare• sshare –A–a<account>:viewallmembersinyouraccount(group)

Accounthierarchy• Youruseraccountbelongstoaparentfacultyaccount(group)• Youruseraccountsharesresourcesthatareprovidedforyourgroup

• Example:– coffey

• cbc• mkg52

• Viewtheaccountstructureyoubelongtowith:“sshare -a–A<account>”

• Example:– sshare -a-Acoffey

Limitsontheaccount(group)

• Limitsareinplacetopreventintentionalorunintentionalmisuseofresourcestoensurequickandfairturnaroundtimesonjobsforeveryone.

• Groupsarelimitedtoatotalnumberofcpu minutesinuseatonetime:700,000

• Thiscpu resourcelimitmechanismisreferredtoas:“TRESRunMins”

TRESRunMins Limit

• Whattheheckisthat!?• Anumberwhichlimitsthetotalnumberofremainingcpu minuteswhichyourrunning jobscanoccupy.

• Enablesflexibleresourcelimiting• Staggersjobs• Increasesclusterutilization• Leadstomoreaccurateresourcerequests

• Sumofjobs(cpus *timelimit remaining)

Examples

• 14400=10jobs,1cpu,1dayinlength• 144000=10jobs,10cpu,1dayinlength• 576000=10jobs,10cpu,5daysinlength• 648000=100jobs,1cpu,½dayinlength

Questions?

• Checkyourgroupscpu minusage:– sshare -l

Exercise1

GettoknowmonsoonandSlurm,onyourown.

1. Howmanynodesmakeupmonsoon?– Hint:use“sinfo”

2. Howmanynodesareintheallpartition?3. Howmanyjobsarecurrentlyintherunningstate?

– Hint:use“squeue -tR”4. Howmanyjobsarecurrentlyinthependingstate?Why?

– Hint:use“squeue –tPD”

Exercise2• Createasimplejobinyourhomedirectory• Examplehere:/common/contrib/examples/job_scripts/simplejob.sh (copyitifyoulikeJ )

• Nameyourjob:“exercise”• Nameyourjobsoutput:“exercise.out”• Outputshouldgoto/scratch/<user>/exercise.out• Loadthemodule“workshop”• Runthe“date”command• Andadditionally,the“secret”command• Submityourjobwithsbatch,i.e.“sbatch simplejob.sh”

Exercise3• Makeyourjobsleepfor5minutes(sleep300)

– Sleepisacommandthatcreatesaprocessthat…sleeps• Monitoryourjob

– squeue -uyour_nauid– squeue -tR– scontrol showjobjobnum– sacct -jjobnum

• Inspectthesteps

• Cancelyourjob– scancel jobnum

Exercise4• Copyjobscriptandedit:

– /common/contrib/examples/job_scripts/lazyjob.sh

• Submitthejob,itwilltake65sectocomplete• Usesstat andmonitorthejob

– sstat -j<jobid>• Reviewtheresourcesthatthejobused

– jobstats -j<jobid>• Wearelookingfor“MaxRSS”,MaxRSS isthe

maxamountofmemoryused• Editthejobscript,reducethememorybeing

requestedinMBandresubmit,edit“--mem=“,e.g.--mem=600

• Reviewtheresourcesthattheoptimizedjobutilizedonceagain– jobstats -j<jobid>

• Ok,memorylooksgood,butnoticethattheusercpuisthesameastheelapsedtime

Usercpu =num utilizedcpus *elapsedtime

• Thisisbecausetheapplicationwewererunningonlyused1ofthe4cpus thatwerequested

• Editthelazyjobscript,commentoutfirstsruncommand,anduncommentthesecondsruncommand.

• Resubmit• Rerunjobstats -j<jobid>,noticenowusercpu isa

multipletimestheelapsedtime,inthiscase(4).Becausewewereallocated4cpus,andused 4cpus.

Slurm Arrays!

Slurm ArraysExercise

• Fromyourscratchdirectory:“/scratch/nauid”• tarxvf /common/contrib/examples/bigdata_example.tar• cdbigdata• editthefile“job_array.sh”sothatitworkswithyournauidreplacingallNAUIDwithyours

• Submitthescript“sbatch job_array.sh”• Run“squeue”,noticethereare5jobsrunning,howdidthathappen!

MPIExample

• RefertotheMPIexamplehere:– /common/contrib/examples/job_scripts/mpijob.sh

• Editit,foryourworkareas,thenexperiment:– Changenumberoftasks,nodes…etc

• Alsocanruntheexamplelikethis:– srun --qos=debug–n4/common/contrib/examples/mpi/hellompi

Keepthesetipsinmind

• Knowthesoftwareyouarerunning• Requestresourcesaccurately• Supplyanaccuratetimelimitforyourjob• Don’tbelazy,itwilleffectyouandyourgroupnegatively

QuestionandAnswer

• Moreinfohere:http://nau.edu/hpc

• Linuxshellhelphere:– http://linuxcommand.org/tlcl.php– Freebookdownload

• Andonthenauhpc listserv– [email protected]

intro to monsoon and slurm

Documents