scalable i/o-aware job scheduling for burst buffer enabled hpc...
TRANSCRIPT
I/O-ignorant
I/O-aw
are
I/O-ignorantI/O
-aware
StreamedI/Opatterns(PFS-side)
ScalableI/O-AwareJobSchedulingforBurstBufferEnabledHPCClusters
Motivation
CriticalQuestions
I/O-awareschedulingkeepsallocatednodesincomputation100%ofthetime
ThisworkwasperformedundertheauspicesoftheU.S.DepartmentofEnergybyLawrenceLivermoreNationalLaboratoryunderContractDE-AC52-07NA27344.National Science FoundationCCF-1318445/1318417.
StephenHerbein1,DongH.Ahn2,DonLipari2,TomScogland2,MarcStearman2,JimGarlick2,MarkGrondona2,BeckySpringmeyer2,MichelaTaufer11UniversityofDelaware,2LawrenceLivermoreNationalLaboratory
PeakFLOPS
BBSSD
CN
ParallelFileSystem
PFSBW
(10sGB/s) BBBW(100sGB/s)
PFSBW(1sGB/s)
SchedulerDecisionTime Efficiencyvs.Turnaround
I/O-awareschedulingeliminatesvariabilityinjobperformanceduetoI/Ocontention
I/O-awareschedulingisstillviableforonlinebatchjobscheduling
I/O-awareschedulingincreasesscience(>1.29x)inexchangeforincreasing
turnaroundtime(<1.52x)
MakingtheSchedulerI/O-aware
ModelingtheI/OContention• Twoscenariosaremodeled:
§ AlljobsgettheirrequestedBWandextraBWremains§ SmallerI/Orequestsaresatisfied,largerrequestscontendforBW;noextraBWremains
• ContentionoccursincasetwoandismodelingusinganInterferenceFactor definedin[2]
• FourlevelsofPFSprovisioning§ 0%(70GB/s),10%(63GB/s),20%(56GB/s),and30%(49GB/s)
§ SimulatesasmallPFSorareservationofBWforexternalsourcesofI/O
• DoesI/O-awarescheduling:§ Impactpercentageoftimethatnodesspendincomputation?§ Impactthevariabilityofeachindividualjob’sperformance?§ Affectthetimetomakeaschedulingdecision?
• Whatisthetrade-offbetweensystemefficiencyandturnaroundtime?• TheFLOPSvs.I/OimbalancecancauseI/Ocontention• Burstbuffers(BB)andsmartstagingpostponecontention• Parallelfilesystems(PFSes)remainthemainbottleneck
Weproposeanovel,I/O-awarebatchschedulingalgorithmthatcanmanageI/OcontentionatthePFSlevel[1]
• Job1,byitself,canbescheduledonthesystem• Job2requeststoomuchBWandcancausecontentionwithJob1§ Job2isdelayeduntilmoreBWisavailable(i.e.,whenJob1completes)
References:[1]S.Herbein,D.H.Ahn,D.Lipari,T.R.Scogland,M.Stearman,M.Grondona,J.Garlick,B.Springmeyer,andM.Taufer,“ScalableI/O-AwareJobSchedulingforBurstBufferEnabledHPCClusters,”inProc.ofthe25thInternationalSymposiumonHigh-PerformanceParallelandDistributedComputing(HPDC),2016.[2]M.Dorier,G.Antoniu,R.Ross,D.Kimpe,andS.Ibrahim.CALCioM:MitigatingI/OInterferenceinHPCSystemsThroughCross-ApplicationCoordination.InProc.ofthe2014IEEE28th InternationalParallelandDistributedProcessingSymposium(IPDPS),May2014.
Growingcomputationalcapability StagnatingI/Ocapabilities•WithoutBBs,thebursty I/OgoesstraighttothePFS•WithBBs,theapplicationseesmuchhigherI/OBWs•WithBBs,theI/OtothePFSisaconstantstream• PFSisnowprovisionedforavg.I/Oload(notmaxload)
ModelingtheI/OSubsystem
CoreSwitchPool
GatewayNodePool
PFS
SU0 SU1 … …. SU12
1" 2" 3" 4" 5" 6" 7" 8" 9" 107" 108"
1" 2" 3"
18 18 18
6
High%Level%Switches%
Low%Level%Switches%
Scalable%Units%
Low LevelSwitches
ScalableUnits(SUs)
• Modeledsystem:§ A1944node/12SUcluster§ I/Oroutedround-robinacrosscore
switchesandgatewaynodes
• Keysimplifications:§ Mergecoreswitchesandgatewaynodes§ LeverageBBstomodelI/Oasaconstant
streamratherthanvariablebursts
SU0 SU1 … …. SU12
1" 2" 3" 4" 5" 6" 7" 8" 9" 107" 108"
1" 2" 3"
18 18 18
6
High%Level%Switches%
Low%Level%Switches%
Scalable%Units%
LowLevelSwitches
ScalableUnits(SUs)
CoreSwitches
Lustre /ParallelFileSystem
2 2
Fromacomplexresourcegraph… Toasimpleresourcetree
• 2,500jobssampledfromLLNL’sworkloads§ ConstantjobI/Orateof18MB/s• 3,888nodesystemmodelfromLLNL’sCTS-1• I/O-aware/ignorantversionsofEASYbackfilling§ EmulatedusingtheFluxframeworkemulator
TestConfiguration
I/O-awareSchedulingScenarios
Based on: Liu, N, Cope, J, Carns, P, Carothers, C, Ross, R, Grider, G, Crume, A, Maltzahn, C .“On the Role of Burst Buffers in Leadership-class Storage Systems”MSST/SNAPI 2012
FromatalkofLucyNowell,DoEProgramDirector(DoEWorkflowWorkshop,Rockville,MD,April20-21,2015)
• I/O-awaremeansusingI/Oasakeyconstraintwhenschedulingjobs§ JobsaredelayediftheywouldcausecontentionintheI/Osubsystem
• I/O-awareschedulerskeeptrackofI/OallocationsandpredictpotentialI/OcontentionusingboththeI/OsubsystemandI/Ocontentionmodels
Fluxframework’sglobalsystemviewandresourcedescriptionlanguageenabletheuseofI/IOsubsystemandcontentionmodelsinascheduler
TotalSystemPerformance IndividualJobPerformance
Limit:256MB/sRequest:256MB/s
LowestLevelSwitch
Limit:256MB/sRequest:320MB/s
LowestLevelSwitch
Request:192MB/s
Job10
ComputeNode
Limit:192MB/sRequest:192MB/sBurstBuffer
Request:128MB/s
ComputeNode
Limit:192MB/sRequest:128MB/sBurstBuffer
Job20Request:128MB/s
ComputeNode
Limit:192MB/sRequest:128MB/sBurstBuffer
Job21Request:128MB/s
ComputeNode
Limit:192MB/sRequest:128MB/sBurstBuffer
Job22
Limit:1024MB/sRequest:576MB/s
ParallelFileSystem
Limit:512MB/sRequest:576MB/s
CoreNetworkSwitch
Limit:256MB/sRequest:192MB/s
LowestLevelSwitch
Request:192MB/s
Job10
ComputeNode
Limit:192MB/sRequest:192MB/sBurstBuffer
ComputeNode
Limit:192MB/sRequest:0 MB/sBurstBuffer
ComputeNode
Limit:192MB/sRequest:0 MB/sBurstBuffer
ComputeNode
Limit:192MB/sRequest:0 MB/sBurstBuffer
Limit:1024MB/sRequest:192MB/s
ParallelFileSystem
Limit:512MB/sRequest:192MB/s
CoreNetworkSwitch
Limit:256MB/sRequest:0 MB/s
LowestLevelSwitch
PeakI/OBandwidth
I/O-ignorant
I/O-aw
are
Application1 Application2 Application3
Bursty I/OpatternsStreamedI/Opatterns(App-side)
Application1 Application2 Application3
Application1 Application2 Application3
LLNL-POST-690319