cori and edison queues - nersc...“shared” partition on cori • users see many jobs in...
TRANSCRIPT
Cori and Edison Queues
-1-
Helen He NUG Meeting, 1/21/2016
Goals for Cori and Edison • Wheretorunwhattypeofjobsa2erCarverandHopperre7red?• TheCoriPhase1(alsoknownasthe"CoriDataPar77on")systemis
designedtoacceleratedata-intensiveapplica7ons,withhighthroughputand“real7me”need.– "shared”par--on.Mul-plejobsonthesamenode.Largersubmitandrunlimits.– The1-2nodebininthe"regular"par--on(mimics“thruput”queueonHopper).
Largesubmitandrunlimits.– “real-me”par--on.Highestqueuepriority.Specialpermissiononly.– “burstbuffer”capability,inearlyuserperiod.– Maxwall-melimitforCoriincreasedto48hrs(from24hrs)yesterday
• Edison’spurposeisthesupportoflargejobs– EdisonisthelargestNERSCsystem.– Largerjobsareboostedforqueuepriority.– Jobsuse683+nodesonEdisonget40%chargingdiscount.– Edisonqueuestructureislargelysimplified.
• ThesegoalshavebeencommunicatedwithusersinweeklynewsleMerandpublishedonNERSCwebsite.
-2-
Cori Queues
-3-
Edison Queues
-4-
SLURM on Cori and Edison
• Thispresenta7onwillfocusmoreonCori.• UsershavebeenonCoriwithSLURMlonger
– Cori:allusersfrom11/12/2015– Edison:allusersfrom01/04/2016– MoreexperiencetuningSLURMconfigura-onsonCori
• Corihasmorecomplicatedqueuestructures– Exci-ngnewfeaturescomplicatesscheduling
• EdisonandCorisharesimilarSLURMconfigura7ons.• LessonslearnedfromCoriareappliedtoEdison,and
viceversa.
-5-
SLURM Configuration is Ongoing • BeforeAY16startsonJan12,wemostlyfocusedon
installingCori,movingEdison,andperformingini7aldeploymentsofSLURM.
• A2erthemoveandalloca7onyearpolicychangesarein,we'vefocusedalotondetailedqueueturn-around,u7liza7onandschedulingofworkloadinanefficientmanner.– Extremelysuccessfulinfixingtheissuesthatwerepresentintheini-al
configura-ons• Wewillbetuningtowardsmoreuserfacingissues,suchas
reliablerankingsofthequeue,end-of-jobprocessing,andenablingnewfeaturestoallowuserstocon7nuerunningoncetheirrepohasbeenexhausted.
• Userfeedbackandcommentsarealwayswelcome
-6-
“shared” Partition on Cori • Usersseemanyjobsin“shared”,appearstouse1nodeper
job(displayedwiththequeuemonitoringscripts),actuallyNOT.
• Serialjobsorsmallparalleljobsaresharedonthesenodes.• 40nodesaresetasideforthe“shared”jobs.• “shared”jobsdonotrunonothernodescurrently(may
changeinthefuture).• Highsubmitlimits(2500)andrunlimits(500).• Jobsaregedngverygoodthroughput.• “shared”jobsarenotchargedbyen7renode,butbyactual
physicalcoresused.
-7-
“realtime” Partition on Cori
• Specialpermissiontouse“real7me”forreal-7meneedofdataintensiveworkflows.
• Highestpriorityfor“real7me”jobssotheystartalmostimmediately.Couldbedisrup7vetooverallqueuescheduling.
• “real7me”jobscanrunin“shared”or“exclusive”modefornodeusage.
• 8nodesaresetasideforthe“real7me”jobs(currently)• “real7me”jobscanrunonothernodes.
-8-
Two SLURM Schedulers are in Work
• InstantScheduler(eventtriggered)– Performsaquickandsimpleschedulinga^emptateventssuchasjobsubmissionorcomple-onandconfigura-onchanges.
• BackfillScheduler(atsetintervals)– Considerspendingjobsinpriorityorder,determiningwhenandwhereeachwillstart,takingintoconsidera-onthepossibilityofjobpreemp-on,gangscheduling,genericresource(GRES)requirements,memoryrequirements,etc.
– Ifthejobunderconsidera-oncanstartimmediatelywithoutimpac-ngtheexpectedstart-meofanyhigherpriorityjob,thenitdoesso.
-9-
SLURM Limits and Priority Tunings • Noseparatequeuesfor“premium”,“low”,etc.Thesearenow
availableviaQOSsedngsin“regular”par77on.• No“idle”limitsconcept.
– Alljobsinthequeueareeligible,except• Userheldjobs,priorityvalueis0.
– Dependencyjobs,priorityvalueisnot0,butdonotage• Limitsandpoliciesenforcedtoensurefairness
– Maxsubmitlimit– Maxrunlimit– Totalnodesnumbernodesperpar--on/QOS– Backfillinterval– Maxbackfillperuser(userssubmihngmanyjobswon’thaveadvantage)– Maxbackfillperpar--on– Maxtotalremainingwall-me*nodesfromallrunningjobs(usedpreviously)– Fairsharepolicy(basedonremainingalloca-onandusagebeforeAY16,
basedonrecentusageandmuchlowerweightnow)
-10-
Shorter Queues After Charging Began
• ManymorejobsweresubmiMedduringfree7me.– Backlogsarelarge
• ChargingbeganatAY16start– jobswithnoac-verepowerecancelled– Userscancelledownjobsthatwouldnotliketobecharged– Jobsubmissionlimitsweredecreased
• Usereduca7on– communicatedwithindividualuserstousethe“shared”par--on,job
arrays,andbundlingjobs.
-11-
Job Wait Time Improves Significantly on Cori • UserscomplainedaboutVERYLONGwait7meforjobs• ChangesweremadefromJan15
– Addedmaxnumberofbackfilljobsperpar--on(ontopofmaxnumberofbackfilljobsperuser)significantlyimprovedthebacklogfordebugjobs.
– Itallowslowerprioritydebugjobstorunaheadofregularjobsthathavehigherabsolutevalueofpriority.
– Decreasedmaxsizeofdebugfrom128to112.• Mostdebugjobsnowstartwithin30min,manymuch
shorter!• Theregularjobswait7mearesignificantlysmallertoo
– Addi-onaltuning:• Increasedmaxbackfillintervalfrom30to150sec• Tunedmaxbackfilljobsperuser,andmaxbackfillperpar--on
– Usersdeletemorejobssubmi^edduringfree-me• BacklogonCoriisnowonly~4days
-12-
Backlogs on Cori • Currentbacklogis4days.• Hugesubmissionsfrom2usersincreasedbacklogssignificantly.
– Oneusersubmitmany512nodesjobs,each24hrs.increasedbacklogfrom40to92days
– Anotherusersubmi^eda1000-tasklargearrayjob,with1hrwall-melimit,laterincreasedto12hrs-melimit,increasedbacklogfrom33to83to644days.
– Althoughbacklogscausedfromsuchsubmissionsareshownhigh,theywon’taffectschedulingforotherusersjobssignificantly,sincethelimitswehavesetwillbasicallycausemostofthesejobsnotbeingconsideredforscheduling.
-13-
Average Wait Time for Debug Jobs on Cori
-14-
1/12/16–1/15/16 1/16/16-1/20/1611/30/15-1/11/16
Current Debug Jobs on Cori
-15-
Average Wait Time for Regular Jobs on Cori (1)
-16-
11/30/15–1/11/16,Edisonmovestartedon11/30/15,Hopperre-redon12/15/15
Average Wait Time for Regular Jobs on Cori (2)
-17-
Dec16–Jan11
1/12/16–1/15/16,AY16startedon1/12/16
Average Wait Time for Regular Jobs on Cori (3)
-18-
Jan16-20,2016,aoerchangesmadeonJan15
Dec16–Jan11
New “sqs” with 2 Columns of Priority Ranking • Anewversionof“sqs”(aNERSCcustomqueuemonitoringscript)deployedon
Jan19.Original“sqs”hasonecolumnforrankingbasedonstart7meprovidedbythebackfillscheduler.
• “sqs”indefault,onlyshowsuser’sownjobs• “sqs-a”showsalljobs• Othersampleop7ons:
– “sqs-a-pdebug”(showonlydebugjobs)– “sqs-a-nr-npshared”(norunningjobs,nosharedjobs)– “sqs-w”(showallmyjobsinwideformatwithmoreinfo)– “sqs–s”(shortsummaryofqueuedjobs)
• Thisversionprovidestwocolumnsofrankingvaluestogiveusersmoreperspec7veoftheirjobsinqueue.– ColumnRANK_Pshowstherankingwithabsolutepriorityvalue,whichisafunc-onof
par--onQOS,jobwait-me,andfairshare.Jobswithhigherprioritywon'tnecessarilyrunearlierduetovariousrunlimits,totalnodelimits,andbackfilldepthwehaveset.
– ColumnRANK_BFshowstherankingusingthebestes-matedstart-me(ifavailable)atabackfillschedulingcycle(every150secnow),sotherankingisdynamicandchangesfrequentlyalongwiththechangesinthequeuedjobs.
– Thefirstfewjobswithreasonbeing“Resources”arerankedbypriorityvalue,hencetheymatchinRANK_PandRANK_BFcolumns.
-19-
Sample “sqs” Output
-20-
%sqs-a-nr|more
Places and Tools to Check Job Status
• Completedjobswebpage:– h^ps://www.nersc.gov/users/job-logs-sta-s-cs/completed-jobs/
• MyNERSCQueuesdisplay– h^ps://my.nersc.gov/queues.php?machine=cori&full_name=Cori
• QueueWaitTimes– h^p://www.nersc.gov/users/queues/queue-wait--mes/
• ScriptsdescribedonQueueMonitoringPage(sqs,squeue,sstat,sprio,etc.)– h^ps://www.nersc.gov/users/computa-onal-systems/cori/running-jobs/monitoring-jobs/
-21-
A Few Tips to Get Faster Job Turnaround
• Requestshorterwall7meifyoucan,donotuseallowedmaxwall7me.
• Use“shared”par77onforserialjobsorverysmallparalleljobs.
• Bundlejobs(mul7ple“sruns”inonescript,sequen7alorsimultaneously)
• UseJobArrays(beMermanagingjobs,notnecessaryfasterturnaround.Eacharraytaskisconsideredasinglejobforscheduling.
-22-