parallelization and performance of the nim weather model for cpu

ParallelizationandPerformanceoftheNIMWeatherModelforCPU,GPUandMICProcessors

MarkGovett1,JimRosinski2,JacquesMiddlecoff2,TomHenderson2,JinLee1,AlexanderMacDonald1,PaulMadden3,JulieSchramm2,AntonioDuarte3

PreprintSubmittedtoBAMSforReviewFebruary9,2016

Abstract The design and performance of the NIM global weather prediction model is described. NIM wasdesignedtorunonGPUandMICprocessors.Itdemonstratesefficientparallelperformanceandscalabilityto tens of thousands of compute nodes, and has been an effective way to make comparisons betweentraditionalCPUandemergingfine-grainprocessors.DesignoftheNIMalsoservesasausefulguideforfine-grainparallelizationoftheFV3andMPASmodels,twocandidatesbeingconsideredbytheNWSastheirnextglobalweatherpredictionmodeltoreplacetheoperationalGFS. TheF2C-ACCcompiler,co-developedtosupportrunningtheNIMonGPUs,hasservedasaneffectivevehicletogainsubstantialimprovementsincommercialFortranOpenACCcompilers.PerformanceresultscomparingF2C-ACCwithcommercialGPUcompilersdemonstratetheirincreasingmaturityandabilitytoefficientlyparallelizeandrunnextgenerationweatherandclimatepredictionmodels. This paper describes the code structure and parallelization of NIM using F2C-ACC, and standardscompliant OpenMP and OpenACC directives. NIM uses the directives to support a single, performance-portable code that runs on CPU, GPU and MIC systems. Performance results are compared for fourgenerationsofcomputerchips.Singleandmulti-nodeperformanceandscalabilityisalsoshown,alongwithacost-benefitcomparisonbasedonvendorlistprices.

1.IntroductionA new generation of High-Performance Computing (HPC) has emerged, referred to as MassivelyParallel Fine Grain (MPFG). The term “Massively Parallel” refers to systems containing tens ofthousandstomillionsofprocessingcores.“FineGrain”referstolooplevelparallelismthatmustbeexposed in theapplication topermit thousands tomillionsofarithmeticoperations tobeexecutedeveryclockcycle.TwogeneralclassesofMPFGchipsareavailable:ManyIntegratedCore(MIC)fromIntelandGraphicsProcessingUnits(GPUs)fromNVIDIAandAMD.Incontrasttoupto36coresusedon the latest generation Intel Haswell CPUs, these MPFG chips contain hundreds to thousands ofprocessingcores.Theyprovide10-20timesgreaterpeakperformancethanCPUs,andtheyappearinsystemsthatincreasinglydominatethelistoftopsupercomputersintheworld(Top500,2015).Peakperformancedoesnottranslatetorealapplicationperformancehowever.Goodperformancecanonlybe achieved if fine-grain parallelism can be found and exploited in the codes. Fortunately, mostweatherandclimatecodescontainahighdegreeofparallelismmakingthemgoodcandidatesforMPFGcomputing.Asaresult,researchgroupsworldwidehavebegunparallelizingtheirweatherandclimatepredictionmodelsforMPFGprocessors.TheSwissNationalSupercomputingCenter(CSCS)hasdonethemostcomprehensiveworkso far.Theyparallelized thedynamicalcoreof theCOSMOmodel forGPUs in

1 NOAA Earth System Research Laboratory, 325 Broadway, Boulder, Colorado 80305. 2 Cooperative Institute of Research in the Atmosphere, Colorado State University, Fort Collins, Colorado 80523. 3 Cooperative Institute for Research in Environmental Sciences, University of Colorado, Boulder, Colorado 80309.

2013(Fuhreretal.2014).Atthattime,noviablecommercialFortranGPUcompilerswereavailablesothecodewasre-writteninC++toenhanceperformanceandportability.TheyreportedtheC++versiongave a 2.9X speedup over the original Fortran code using same generation dual-socket IntelSandyBridgeCPUandKeplerK20xGPUchips.Parallelizationofmodelphysicsin2014preservedtheoriginal Fortran code by using industry standard OpenACC compiler directives for parallelization(Lapillonneetal.2014).Theentiremodel,includingdataassimilationisnowrunningoperationallyonGPUsatMeteoSwiss.Mostgroupsprimarily focusedonparallelizationofmodeldynamics.TheGermanWeatherService(DWD)andMaxPlanckInstituteforMeteorology(MPI-M)developedtheICONdynamicalcore,whichhasbeenparallelizedforGPUs.EarlyworkconvertedtheFortranusingNVIDIAspecificCUDA-FortranandOpenCL,demonstrateda2XspeedupoverdualsocketCPUnodes.Theinvasive,platform-specificcode changes were unacceptable to domain scientists, so current efforts are focused on minimalchangestotheoriginalcodeusingOpenACCforparallelization(Sawyeretal.2014).Anotherdynamicalcore,theNon-hydrostaticICosahedralAtmosphericModel(NICAM),hasbeenparallelizedforGPUs,withareported7-8Xperformancespeedupcomparing2NVIDIAK20xGPUstosingleoldergeneration,dualsocketIntelWestmereCPUs(Yashiroetal.2014).OtherdynamicalcoresparallelizedfortheGPUincludingtheFinite-Volumecubed(FV3)model(Nguyenetal.2013)usedinGEOS-5(Putnam2011),andtheHOMME(Carpenteretal.2013)havebothshownsomespeedupversustheCPU.Collectively, theseexperiencesshowthatportingcodestoGPUscanbechallenging,butmostusershave reported speedups over CPUs. Over time, more mature GPU compilers have simplifiedparallelization, and improvedapplicationperformance.However, reportingof resultshasnotbeenuniformandcanbemisleading.Ideally,comparisonsshouldbemadeusingthesamesourcecode,withoptimizationsappliedfaithfullytotheCPUandGPU,andrunonsamegenerationprocessors.Whencodes are rewritten, it becomes harder to make fair comparisons as multiple versions must bemaintainedandoptimized.Whendifferentgenerationhardwareisused(Eg.2010CPUsversus2013GPUs),adjustmentsshouldbemadetonormalizereportedspeedups.Similarly,whencomparisonsaremadewithmultiple GPUs attached to a single node, further adjustments should bemade. Finally,comparisonsbetweenaGPUanda singleCPUcoregive impressive speedupsof50-100Xbut suchresultsarenotusefulorfair,andrequireadjustmenttofactorinuseofallcoresavailableontheCPU.WhenIntelreleaseditsMICprocessor,calledKnightsCornerin2013,anewinfluxofresearchersbeganexploringfine-graincomputing.ResearchteamsfromNCAR’sCommunityEarthSystemModel(CESM)(Kimet al. 2013),WeatherResearchandForecast (WRF)Model (Michalakeset al. 2015), andFV3parallelized select portions of these codes for theMIC and reported little to no performance gaincompared to the CPU. A more comprehensive parallelization from NOAA’s Flow Following Finitevolume IcosahedralModel (FIM) (Bleck,R. et al., 2015) includedbothdynamics coupledwithGFSphysics running on the MIC (Rosinski 2015). Execution of the entire model on the MIC gave noperformance benefit compared to the CPU. A common sentiment in these efforts is that portingapplicationstorunontheMICiseasy,butgettinggoodperformancecanbedifficult.Withfewreportsof better MIC performance than the CPU, most believe their efforts will pay off when the next-generationIntelKnightsLandingchiparrivesin2016.Further,optimizationstargetingtheMIChaveimprovedCPUperformance,givingimmediatebenefitforthework.ThispaperdescribesdevelopmentoftheNon-hydrostaticIcosahedralModel(NIM),amodelthatwasdesignedtoexploitMPFGprocessors.TheNIMwasinitiallydesignedforNVIDIAGPUsin2009.SincecommercialFortranGPUcompilerswerenotavailableatthattime,theFortran-to-CUDAAccelerator(F2C-ACC)(Govettetal.2010)wasco-developedwithNIMtoconvertFortrancodeintoCUDA,ahigh-levelprogramminglanguageusedonNVIDIAGPUs(CUDA,2015).TheF2C-ACCcompilerhasbeentheprimary compiler used for execution ofNIMonNVIDIAGPUs, andhas served as a benchmark forevaluationofcommercialOpenACCcompilersfromCrayandPGI.Usingthesamesourcecode,theNIMwasportedtoIntelMICwhentheseprocessorsbecameavailable.

TheNIMiscurrentlytheonlyweathermodelthatrunsonCPU,GPUandMICprocessorsusingasinglesourcecode.ThedynamicsportionofNIMusesOpenMP(CPU&MIC),OpenACC(GPU)andF2C-ACC(GPU)directivesforparallelization.Fordistributedmemoryparallelization,ScalableModelingSystem(SMS)directivesareusedtohandledomaindecomposition, inter-processcommunications,andI/Ooperations (Govett et al. 1996). Collectively, these directives allow a single source code to bemaintained capable of running on CPU, GPU and MIC processors for serial or parallel execution.Further,theNIMhasdemonstratedefficientparallelperformanceandscalabilitytotensofthousandsofcomputenodes,andhasbeenusefulforcomparisonsbetweenCPU,GPUandMICprocessors.TherestofthispaperdescribestheNIMdynamicalcore,withsomereferencestoworkdoneonmodelphysics.Section2describesthecomputationaldesignoftheNIM.Sections3and4containdetailsontheparallelizationofNIMforIntelMICandNVIDIAGPU.Section4alsocomparestheperformanceoftheF2C-ACCandthePGIGPUcompilers.Section5highlightsperformanceandscalabilityintermsofdevice,node,andmultiplenodes.Usingtheseresults,acost-benefitanalysisisgivenbasedonvendorlistprices.ThepaperconcludeswithananalysisoftheworkandfinalremarksinSections6and7.Toappearasasidebar

Many-CoreandGPUComputingExplainedMany-CoreandGraphicsProcessingUnits(GPUs)representanewclassofcomputingcalledMassivelyParallelFine-Grain(MPFG).IncontrasttoCPUchipsthathaveupto36cores,thesefine-grainprocessorscontainhundredstothousandsofcomputationalcores.Eachindividualcore is slower thana traditionalCPUcore, but therearemanymoreof them available toexecute instructions simultaneously. This has required model calculations to becomeincreasinglyfine-grained.GPU:GPUsaredesignedforcompute-intensive,highlyparallelexecution.GPUscontainuptofivethousandcomputecoresthatexecuteinstructionssimultaneously.Asaco-processortotheCPU,workisgiventotheGPUinroutinesorregionsofcodecalledkernels.Loop-levelcalculationsaretypicallyexecutedinparallelinkernels.TheOpenACCprogrammingmodeldesignatesthreelevelsofparallelismforloopcalculations:gang,workerandvectorthataremappedtoexecutionthreadsandblocksontheGPU.Gangparallelismis forcoarsegraincalculations.Workerlevelparallelismisfine-grain,whereeachgangwillcontainoneormoreworkers. Vector parallelism is for Single Instruction Multiple Data (SIMD) or vectorparallelismthatisexecutedonthehardwaresimultaneously.MIC: Many Integrated Core (MIC) hardware from Intel also provides the opportunity toexploitmoreparallelismthantraditionalCPUarchitectures.LikeGPUs,theclockrateofthechips is 2-5 times slower than current generation CPUs, with higher peak performanceprovidedbyadditionalprocessingcores,widervectorprocessingunits,andafusedmultiply-add (FMA) instruction. The programming model used to express parallelism on MIChardwareistraditionalOpenMPthreadingalongwithvectorization.UsercodecanbewrittentooffloadcomputationallyintensivecalculationsfromtheCPUtotheMIC(similartoGPU),runinMIC-onlymode,orsharedbetweenMICandCPUhost.

4

2.NIMDesign

2.1ModelDescriptionNIM is amulti-scalemodel,which has been designed, developed, tested, and run globally at 3-kmresolutiontoimprovemedium-rangeweatherforecasts.Themodelwasdesignedtoexplicitlypermitconvectivecloudsystemswithoutcumulusparameterizationstypicallyusedinmodelsrunatcoarserscales.Inaddition,NIMhasextendedtheconventionaltwo-dimensionalfinite-volumeapproachintothree-dimensional finite-volume solvers designed to improve pressure gradient calculation andorographicprecipitationovercomplexterrain. NIMusesthefollowinginnovationsinthemodelformulation:

• Alocalcoordinatesystemthatremapsasphericalsurfacetoaplane(Leeetal.2009)• Horizontalgridpointcalculationsinaloopthatallowsanypointsequence(MacDonaldetal.

2011)• Flux-Corrected Transport formulated on finite-volume operators tomaintain conservative

andmonotonictransport(Leeetal.2010)• Alldifferentialsevaluatedasfinite-volumeintegralsaroundthecells• Icosahedral-hexagonalgridoptimization(Wangetal.2011)

Theicosahedral-hexagonalgridisakeypartofthe model. This formulation approximates aspherewithavaryingnumberofhexagons,butalways includes twelve pentagons.(Sadourney, R. et el., 1968; Williamson, D.,1971).ThekeyadvantageofthisformulationisthenearlyuniformgridareasthatarepossibleoverasphereasillustratedinFigure1.Thisisin contrast to the latitude-longitude modelsthat have dominated global weather andclimate prediction for 30 years. The nearlyuniformgridrepresentsthepoleswithoutthenotorious“poleproblem”inherentinlatitude-longitude grids where meridians convergetowardthepoles.

Figure 1: Icosahedral-hexagonal grid atapproximately450kmhorizontalresolution.

NIMusesafullythree-dimensionalfinite-volumediscretizationschemedesignedtoimprovepressuregradientcalculationsovercomplexterrain.Three-dimensionalfinite-volumeoperatorsalsoprovideaccurate and efficient tracer transport essential for next-generation global atmospheric models.Prognosticvariablesareco-locatedathorizontalcellcenters(Arakawa,A.,andV.Lamb,1977).Thissimplifiesloopingconstructsanddatadependenciesinthecode.Thenumerical schemeutilizesa local coordinatesystemremapped fromthespherical surface toaplaneateachgridcell.Alldifferentialsareevaluatedasfinite-volumeintegralsaroundeachgridcell.Fluxcorrectedtransportisformulatedonfinite-volumeoperatorstomaintainconservativepositive-definite transport. Passive tracers are strictly conserved to the round-off limit of single-precisionfloating-pointoperations.NIMgoverningequationsarecastinconservativefluxformswithmassfluxtotransportbothmomentumandtracervariables.

5

2.1ComputationalDesign NIMisaFortrancodecontainingamixofFortran77andFortran90languageconstructs.Itdoesnotusederivedtypes,pointersorotherconstructsthatcanbechallengingforcompilerstosupportorrunefficiently4.TheSMSlibraryutilizedbyNIMforcoarse-grainparallelismemploystheMPI(MessagePassing Interface) library to handle domain decomposition, inter-process communications,reductions,andotherMPIoperations.NIM was designed from the outset to maximize fine-grain or loop-level parallel computationalcapabilityofbothNVIDIAGPUandIntelMICarchitectures.Primarymodelcomputationsareorganizedas simple dot products or vector operations, and loops with no data-dependent conditionals orbranching.TheNIMdynamical core requiresonly single-precision floatingpointcomputationsandrunswellonasingleCPUcore,achieving11%ofpeakperformanceonasingleIntelSandyBridgeCPUcore.Gridcellscanbestoredinanyorder.Alookuptableisusedtoaccessneighboringgridcellsandedgeson the icosahedral-hexagonal grid. Themodel’s loop and array structures are organized with theverticaldimensioninnermostindynamicsroutines.Testingduringmodeldevelopmentverifiedthatthe indirect schemeyielded less than1percentperformancepenaltybecause the costof theextrareferencecanbeamortizedovertheinnermost,directlyaddressed,verticaldimension(MacDonaldetal. 2010). Further, unit-stride accesses in the vertical dimension support CPU andMIC inner-loopvectorization,andefficientmemoryaccessesontheGPU.FortheGPU,NIMdynamicshasbeenparallelizedwithF2C-ACCandOpenACCdirectivesandexecutescompletelyontheGPU.ModelstateremainsresidentinGPUglobalmemory.Inthemodeldynamics,dataareonlycopiedbetweenCPUandGPU(acrossthePCIebus)fordiskI/OandbetweenGPUsforinter-processcommunications.Physicalparameterizationshavenotyetbeenportedto theGPU,sodatamustbemovedbetweentheGPUandCPUeveryphysicstimestep.GPUtoGPUcommunicationsishandledviaSMSdirectivesandinitiatedbytheCPU.Fine-grain parallelization for MIC (and also multi-core CPU nodes) is implemented via OpenMPdirectives. MIC parallelization of NIM was easy since the code had already been modified to runefficientlyontheCPUandGPU.Symmetricexecution,whereboththeCPUhostandattachedMICco-processorareused,isalsopermittedandwillbedescribedinSection3.Example1showsrepresentativecodefromtheNIMdynamicswithOpenACC,F2C-ACCandOpenMPdirectives prefaced with !$acc, !ACC$, and !$OMP respectively. F2C-ACC does not adhere to theOpenACCstandard,thoughthedirectivesandtheirplacementinsourcecodearequitesimilar.Inthisexample,theOpenMPthreadedregionisidentifiedby!$OMPPARALLELDO,whichcorrespondsto!ACC$REGION and !$acc parallel that are used to define GPU kernels. F2C-ACC directives will beremovedfromNIMonceongoingcomparisonswithOpenACCcompilersarecomplete.

4 The OpenACC specification only recently added support for derived types; pointer abstractions may limit the ability of compilers to fully analyze and optimize calculations.

6

Example1:RepresentativeFortrancodefromNIM,parallelizedwithOpenMP,F2C-ACCandOpenACCdirectives.Forclarityinthisexample,numericargumentsforthenumberofthreads(96)andblocks(10242)inthedirectiveareusedthatequatetotheloopboundsnzandipe-ips+1respectively.IntheactualNIMcode,variableargumentsareused.

3.CPUandMICParallelizationParallelizationfortheCPUandMICinvolvedthreesteps:(1)insertionofOpenMPdirectivestoidentifythread-level parallelism in the code and fusing multiple threaded loops where possible, (2)vectorizationoptimizationsusingdirectivesandcompilerflags,and(3)othercodeoptimizationsandruntimeconfigurationstoimproveperformance.

3.1VectorizationVectorizationisanoptimizationwhereindependentcalculationsexecutedseriallywithinaloopcanbe executed simultaneously in hardware by specially designated vector registers available to eachprocessingcore.Thenumberofoperationsthatcanbeexecutedsimultaneouslyisbasedonthelengthofthevectorregisters.OntheCPU,vectorregistersarecurrently256bitsinlength;theKNCMICco-processor contains 512-bit vector registers. Based on these vector lengths, eight single-precisionoperationscanbeexecutedsimultaneouslyontheCPU,andsixteensuchsimultaneousoperationsontheMIC.Asaresult,vectorizingloopsprovidedsomebenefitonthehost,butinmostcasesitprovideda greater improvementon theMIC. Intel compilers automatically attemptvectorizationbydefault,withcompilerflagsavailableforfurtheroptimizationonspecifichardware.

3.2OptimizationsOpenMPparallelizationofNIMdynamicsforbothhostandMICisdoneexclusivelyatthehorizontallooplevel.Horizontalloopinginalmostallcasesisoutsideofverticalloopingand,ifapplicable,loopsovercelledges.Indirectaddressingofhorizontalindices(describedearlier)allowsforasingleloopoverhorizontalgridpointsownedbyeachMPItask.Thissingleloopstructureenablesagreatdealofthreadparallelism,withoutrelyingonrestrictiveOpenMPverbssuchas“collapse”(fusesmultipleOMPloopsintoone),whicharerequiredinsomecodestoexposesufficientparallelism.InNIMtherearenoMPI communications insideof threaded loops.This enablesuseof a “thread funneled”MPI library(mostMPIlibrariesarebuiltthisway),wheretheassumptionthatnoMPIcommunicationhappens

!$OMP PARALLEL DO PRIVATE (k) !ACC$REGION(<96>,<10242>) BEGIN !$acc parallel num_gangs(10242) vector_length(96) !ACC$DO PARALLEL(1) !$acc loop gang do ipn=ips,ipe ! Loop over horizontal !ACC$DO VECTOR(1,1:nz-1) !$acc loop vector do k=1,nz-1 ! Loop over vertical levels bedgvar(k,ipn,1) = . . . bedgvar(k,ipn,2) = . . . end do < more loops and calculations > !$acc parallel end !ACC$REGION END !$OMP END PARALLEL DO

7

insideofthreadedloopsenablestheMPIlibrarytoavoidnumerousperformance-sappinglocksandtestsonthreading.MostOpenMPloopsinNIMcontainsufficientworktoeasilyamortizethecostofassigningworktoindividualthreadsonloopstartup,andthreadsynchronizationattheendoftheloop.ThesecostsaresignificantlyhigherontheMICthanontheSNBhost.PartofthereasonbeingthattheMIChasmorethreadswhichmustbesynchronized,withasmanyas244perMICcardvs.16ontheSNBhost.AlsotheclockcycleontheMICismorethantwiceasslowasthatoftheSNBhost.Asaresult,fusingofsomethreadedloopsgaveabiggerperformanceboostontheMICthanthehost.Someboostwasvisibleonbotharchitecturesbecauseineachcasethenumberofthreadsynchronizationpointswasreduced.TheOpenMPstandardincludesaverb(“guided”)whichtellstheOpenMPruntimelibrarytoparceloutiterationsofthesubsequentthreadedloopinamoresophisticatedwaythanthedefaultmechanism(“static”),whereeachthreadishandedanequalnumberofloopiterations.Whenthe“guided”verbisinforce,theOpenMPruntimelibraryparcelsoutinitiallyrelativelylargenumbersofloopiterationstothreads. Subsequent iterations are parceled out in “hungry puppy” fashion to threadswhich havefinishedtheirinitialwork,withthesizeofeachiterationsetdiminishingasworkdonebythethreadsprogresses.Thisprocedurehasapositiveimpactonloadimbalance.ThoughmosthorizontalloopsintheNIMdynamicscontainequalworkperloopiteration,itwasfoundthatthe“guided”verbprovidedmeasurablebenefitinsomecasesontheMIC.Possiblecausesforthisimbalance,evengivenequalworkload,includememorycontention,cacheutilization,andOSintervention.ThoughthediscussioninthispaperisfocusedalmostexclusivelyontheNIMdynamics,workonthephysicspackageutilizedinNIMhasshownagreatadvantageindeployingthe“guided”verbonphysicshorizontal loopsexercisedonbothCPUandMIC. InNIMthephysics loopsarebothvectorizedandthreadedoverhorizontaliterations.Eachthreadisprovidedatleastenoughhorizontalpointsatatimetoenablevectorizationofallthepointsitisgiven.This“chunking”approachinasingledimensionisnecessary because most of the physical parameterizations contain dependencies in the verticaldimension that precludes most opportunities to vectorize and/or thread. The parameterizationscontainmuchinherentloadimbalance,suchassolarradiationcalculationsonlybeingdoneinsunlitregions,andcloudcalculationsonlynecessarywherecloudsarepresent.Thus the“guided”verb isideal for many of the physical parameterizations. NIM contains a single horizontal loop over allphysicalparameterizations,so“guided”providesenormousbenefitonboththeSNBandtheMIC.

3.3SymmetricExecutionParallelizationofNIMfortheMICwasdoneusingsymmetricmode,whereboththehostandattachedMICco-processorsharetheworkload.CommunicationbetweentasksisaccomplishedusingthesameMPIprimitivesashomogeneous(CPU-only)mode.Theonlydifferenceisthatseparatecompilationsare required for the different (CPU and MIC) architectures. Complications imposed by variouscommunicationpatternsinamulti-noderun(e.g.host-host,host-mic,mic-mic)areblissfullyhiddenfromtheapplicationdeveloperbytheMPIimplementation.Inaheterogeneouscomputationalenvironment(inthiscasehostandMIC),inevitablytherewillbeloadimbalancesresultingfromthedifferentcomputationalcapabilitiesoftheseparatearchitectures.NotingthattheSMSlibraryattemptstoequallydistributethenumberofhorizontalpointsassignedtoeachMPItask,anattempttomitigatethisimbalancewasmadeinNIMbyvaryingthenumberofMPItasksplacedoneachdevice.The“stampede”systematTACC(TexasAdvancedComputingCenter)wasused formost of theMICporting activitywithNIM.This systememploys SandyBridge (SNB) hostprocessorsandKNCMICco-processors.OnthissystemitturnedoutthatthetimetorunasingleNIMtimesteponahostnoderoughlyequaledthetimetorunthatsametimesteponaMICcard.Thus,placinganequalnumberofMPItasksonthehost(s)andattachedco-processor(s)workedwell.

8

ObtainingroughequivalentperformanceonasingleSNBnodeandtheattachedMICcardrequiredsomesoftwareengineeringwork.Aftertheinitialport,theSNBnodewassubstantiallyfasterthantheMIC.Subsequentcodemodifications toaddressvectorizationandoptimizingOpenMPperformancebrought the two architectures into balance. Obtaining balance was more difficult than originallyanticipated,mainly because codemodifications designed to improve performance on theMIC alsoimprovedperformanceonthehost.

4.GPUParallelization

4.1OpenACCCompilerEvaluationEitherOpenACCorF2C-ACCdirectivescanbeusedforGPUparallelizationofNIM.F2C-ACChasbeenaneffectivewaytopushforimprovementsincommercialFortranGPUcompilers.PriorevaluationofOpenACCcompilersandtheirpredecessorswasdonein2011(CAPS,PGI)(Hendersonetal.2011),2013(PGI,Cray)(Govett,2013)andin2014(PGI,Cray)(Govettetal.2014).Theseevaluationsexposedbugs in the compilers, and areas where additional development was needed to support weatherpredictionmodelsusedatNOAA.Acomprehensiveperformanceevaluationin2014showedtheCrayandPGIcompilersranthedynamicsportionofNIM1.7and2.1timesslowerthanF2C-ACC.ThisledtoincreasedcollaborationbetweenNOAAandvendorstoimprovetheperformanceoftheirrespectivecompilers.Table1showstheruntimesforNIMdynamicsroutineswithnearparityobservedinperformancewiththeF2C-ACCandPGIcompilers.PGIimprovementssincethe2014evaluationcanbeattributedto:(1)bettersupportuser-definedfastmemory,(2)directmappingofdo-loopswhenloopboundsareknownanddonotexceedGPUhardwarelimits,and(3)performanceoptimizationsthatwereintroducedorimproved.WhilemostroutinesrunfasterwiththePGIcompilerthanF2C-ACC,tworoutines(vdmintsand vdmintv) are slower. The cause is under investigation, but higher register usage by the PGIcompilermayexplaintheperformancedegradation5.

Table1:PerformanceoftheNIMdynamicswiththeF2C-ACC,PGIandCraycompilers.Runtimesinsecondsformainroutinesareshown.SpeeduprepresentsruntimescomparedtoF2C-ACC.Vdmintsruntimerepresentsthesumofthreevariantsinthemodel:vdmints,vdmints0,andvdmints3.Whileperformancewithbothcompilersissimilar,codeparallelizationissimplerwiththeOpenACCcompilers,whichspeeds the timerequired toportapplications to theGPU.Therestof thissectiondescribesOpenACCparallelizationofNIM,withsomereferencestoF2C-ACCcapabilitiesthat ledtoimprovementintheCrayandPGIcompilers.

4.2OpenACCParallelizationParallelization for theGPUcanbeexpressed in threephases:definingGPUkernelsand identifyingloop-levelparallelism,minimizingdatamovement,andoptimizingperformance.ParallelizationofNIM5 GPUs have a limited number of registers in hardware. Excessive register use can lead to lower occupancy or concurrency that limits application performance.

Routine F2C-ACCv5.8 PGIversion15.10 PGISpeedupvdmints 4.58 5.40 0.84vdmintv 1.63 2.29 0.71diag 1.24 0.65 1.90flux 0.98 1.03 0.95force 0.61 0.52 1.17trisol 0.36 0.28 1.28timediff 0.18 0.15 1.20TotalRuntime 11.77 12.34 0.95

9

forthesephaseswillbedescribedinthenextthreesections.Forbrevity,thesentinels(eg.!$acc)areomitted.

4.2.1DefiningGPUKernelsandParallelismOpenACC’sparalleldirectiveisusedtodefinecodesections,calledkernels,thatwillbeexecutedonthe GPU.6The number of threads and blocks used (96 and 10242 respectively in Example 1) arerequiredargumentsforF2C-ACCthatmapdirectlytoCUDAthreadandgridblocksinthegeneratedcode.Whilenoargumentsarerequiredwithparallel,itwasbesttospecifyresourcesexplicitlyusingnum_gangsandvector_length(seeExample1)togetoptimalperformance.Avectorlengthof96wasused thatmatched thenumberofvertical levels in themodel.OpenACC’s loop directive isused toidentify parallel loops along with vector and gang keywords. Using similar directives, F2C-ACCdemonstratedthevalueofdirectlymappingloopstothreadandgridblockexecutionontheGPU.ThisledtoimprovementstothePGIcompilertoprovidesimilarcapabilities.

4.2.2MinimizingDataMovementBydefault,OpenACCcompilersassumedataareresidentontheCPUandautomaticallygeneratedatacopiesbetweentheCPUandGPUnecessarytoexecuteeachkernel.However,torunefficiently,datamovementbetweentheCPUandGPUmustbeminimizedbykeepingdataresidentontheGPU.WithOpenACC,dataregionsaredefinedusingthedatadirectivewithdatamovementspecifiedoptionallyby listingvariables tobemanagedusing thepcopy,pcopyin, andpcopyoutkeywords.The letter “p”precedingthecopyisshorthandfor“present_or“andindicatestheruntimesystemchecktodeterminewhenadatatransferisneeded.Askingtheruntimesystemtoalwaysperformthecheckincurslittleoverhead.TokeepdataresidentontheGPU,variablesarelistedinthedatadirectivewithenterandexitclausesindicatedatacreationandremoval.TofurthersimplifydatamanagementwithOpenACCcompilers, theCUDAruntimelibrarysupportsunifiedmemory,ameanstoprogrammaticallytreatCPUandGPUmemoryasonelargememory.Withunifiedmemory, data are automatically copied between GPU and CPU on demand by the runtimesystem.Whilelessefficientthanwhenusersprescribedatatransfersthemselves,itreducesthetimeandeffort required togetapplicationsrunningon theGPU.Usingunifiedmemory,datamovementbetweenCPUandGPUbecomesaperformanceoptimizationratherthanadebuggingchallenge.To date, unified memory is managed in software by the NVIDIA runtime system. With the nextgeneration chip (called “Pascal”), data accesses will be managed in hardware. The hardware willincludeanewhigh-speedinterconnectcalledNVLINK,thatwillallowtheGPUtoaccessCPUmemoryat CPUmemory speeds. This will allow CPU and GPUmemory to be treated as one common fastmemory,simplifyingGPUprogrammingandimprovingapplicationperformance.

4.2.3OptimizingPerformanceFine-tuningperformanceprimarilymeansoptimizingtheuseofthemulti-tieredmemoryavailableontheGPU.Toreducememoryuse,F2C-ACCsupportsvariabledemotiontoremoveadimensionfromalocallydefinedvariable.Demotionisparticularlybeneficialwhenasingledimensionalarraycanbereplaced with a scalar. This transformation, called scalar replacement, improves performance byreplacingslowerglobalmemoryaccesswithfasterregisteraccesses.OpenACCmemoryaccessisintentionallymorerestrictive,toavoidreferencetospecificdetailsoftheunderlyinghardware.Thisimprovesportabilityandleavesthecompilersfreetoimplementefficientuseofmemorywherepossiblethroughcompileranalysisandoptimization.ImplicituseoffasterlocalandfastsharedmemoryispossibleinOpenACCusingtheprivateclauseattachedtogangorvectorloops.Inaddition,improvedsupportforcachememorybytheCrayandPGIcompilerspermitsexplicit

6 OpenACC’s kernel directive was also tried. While simpler to use, it provided less user control over parallelization, and slower performance than parallel.

10

user-managedsharedmemorytoimproveapplicationperformanceontheGPU.Variabledemotionorscalar replacement is not exposed in the OpenACC standard, but is handled automatically whenanalysisdeterminessuchoptimizationscanbedone.

5.PerformanceandScalingTheNIMcodehasbeenhighlyoptimizedforserialandparallelexecution.IthasdemonstratedgoodscalingonbothCPUsandGPUsonTitan7whereithasrunonmorethan250,000CPUcoresandmorethan15,000GPUs. Ithasalsobeen runonup to320 IntelMIC (Xeon-Phi)processorsat theTexasAdvancedComputingCenter(TACC)8.OptimizationstargetingXeon-PhiandGPUhavealsoimprovedCPUperformance.SinceNIMhas been optimized for the CPU, GPU andMIC, it is a usefulway tomake comparisonsbetweenchips.Everyattemptismadetomakefaircomparisonsbetweensamegenerationhardware,usingidenticalsourcecodeoptimizedforallarchitectures.Giventheincreasingdiversityofhardwaresolutions,resultsareshownintermsofdevice,nodeandmulti-nodeperformance.

5.1DevicePerformanceSingledeviceperformanceisthesimplestandmostdirectcomparisonofchiptechnologies.Figure2showsperformancerunningtheentireNIMdynamicalcoreonfourgenerationsofCPU,GPUandMIChardware. CPU results are based on standard two socket node configurations. A roughly 2Xperformancebenefitfavoringacceleratorsisobservedfor2010through2014generationGPUchips.Inaddition,theMICKnightsCornerprocessordemonstratesa1.3XspeedupovertheIvyBridgeCPU.

Figure2:RuntimesfortheNIMrunningat240kmresolution(10242horizontalpoints,96verticallevels)for100timestepswiththefollowingchips9:

Year CPU:2sockets Cores GPU Cores MIC cores2010/11 Westmere 12 Fermi 448 2012 SandyBridge 16 KeplerK20x 2688 2013 IvyBridge 20 KeplerK40 2880 KnightsCorner 612014 Haswell 24 KeplerK80 4992

7 Titan is an AMD-GPU based system containing over 17,000 GPUs, managed by the U.S. Department of Energy’s Oak Ridge National Laboratory (ORNL). 8 TACC is a National Science Foundation (NSF) supported CPU-MIC cluster at the University of Texas, Austin. 9 Nodes configured with 2 sockets/node were: Westmere: x5650, SandyBridge: E5-2670, IvyBridge E5-2690V2, Haswell E5-2690V3

49.8

26.820

14.3

23.6

15.1 13.97.8

16

0

10

20

30

40

50

60

2010/11 2012 2013 2014

runtime(sec) CPU

GPUMIC

11

Improvements to the CPU chips, including increased memory bandwidth, advanced vectorinstructions, and steadily increasing core counts, have been sufficient to generally keep pacewithaccelerators.OfparticularimportanceintheCPUresultsisthenearly1.9XdecreaseinruntimefortheSandyBridgeovertheWestmereprocessor(26.8and49.8secondsresp.),largelyduetoa2Xincreaseinmemoryspeed.Thisisbecauseweatherandclimatedynamicalcoresareusuallymemorybound,ratherthancomputebound.Hyperthreadingisatechniquewherethenumberofexecutionthreadsisgreaterthanthenumberofavailableprocessingcores.Overschedulingcanbeaneffectiveoptimizationspecifiedtriviallyatrun-timeforCPU,GPUandMIC.ScalingresultsareshownforasingleMICprocessorinFigure3.TheMICprocessorcontains61physicalcores,increasingperformancedemonstratedouttothemaximumof244threadspermittedonthedevice.Hyperthreadingcanbebeneficialwhenathreadrunningonacoreisstalledwaitingonamemoryreadorwrite.Inthiscasetheoperatingsystemcanswapouttherunningthreadandswap inawaitingthreadthat isnototherwisewaitingonhardwareresources.Nearlinearspeeduptothe61physicalcoresisobserved,with4-wayhyperthreadingprovidingnearlyanadditional2Xperformanceadvantagebeyondthat.

Figure 3: NIM thread scaling on a single MIC processorcontaining61cores.Thegreenlineislinearspeedupreference;the red curve is observed speedup. NIM dynamics achieves

nearly linear scaling up to 61 threads, with a 2X furtherspeedupwhenhyperthreadingwithupto244virtualthreads.

Since accelerators currentlymustbe attached to aCPUhost, price-performancebenefit is reducedwhen the cost of all hardware is included. This practical and economic consideration motivatescomparisonsbetweennodesthatincludeboththehostandattachedaccelerators.

12

5.2SingleNodePerformanceThebasiccomponentsofIntelCPUsincludetwoCPU sockets, memory, network interconnect(NIC), Peripheral Component Interconnectexpress (PCIe) bus and a motherboard.Deviations from this basic configuration areavailablebutmoreexpensivesince thevolumesmanufactured are lower. Therefore, mostcomputing centers use standard, high volumeparts that offer the best price-performance.Accelerators are attached to these nodes andcommunicatewiththeCPUhostviathePCIebus.Figure4showsasimple illustrationof twoCPUnodes,eachcontainingtwoattachedacceleratorsand connected via high-speed networkinterconnect.

Figure 4: Standard Intel Sandy Bridge node configurationcontaining two CPU sockets per node with attachedaccelerators.

Figure5showsperformanceoftheNIMdynamicalcoreforseveraltraditionalCPUandacceleratorconfigurations. Standard two socket Intel IvyBridge nodeswere usedwith attachedMIC and GPUaccelerators.Thefourleft-mostresultsshowruntimeswhenonlytheCPU,GPUandMICareusedandappearindividuallyinblue,greenandorangerespectively.TheshadedresultshighlightperformanceoftheNIMforsymmetricexecution,whereboththeCPUandattachedacceleratorareused.Thetworuntimesappear foreachconfigurationwithsymmetricruns,one inblueandasecond ingreenororange,withanumericvaluerepresentingthenoderuntimeatthetopofeachcolumn.

Figure5:FullnodeperformanceforCPU,GPUandMICusingstandardtwosocketIntelIvyBridgeCPUsandoneortwoattachedaccelerators.SymmetricmodeexecutiongivesperformancewhenboththeCPUandGPUorMICareused.RuntimesfortheNIMrunningat120kmresolution(40968columns,96verticallevels)for100timestepswiththefollowinghardware:IB20–IvyBridge,20cores,3.0GHz(E52690-v2)GPU-KeplerK402880cores,0.745GHzIB24–IvyBridge,24cores,2.6GHz(E52697-v2)MIC–KNC712061cores,1.238GHzForsymmetricruns,themodelsub-domainisdividedevenlybetweenthehostanddeviceleadingtodifferencesinexecutiontimeasshown.Whentheacceleratorisfasterthanthehost,itmustwaitfor

•  CPUrun(me•  MICrun(me•  GPUrun(meusingF2C-ACC

SymmetricModeExecu0on

8174 73

58

42 46

33

0102030405060708090

IB20only IB24only MIConly GPUonly IB24+MIC IB20+GPU IB20+2GPU

Run

-0me(sec)

120KMResolu0on(NIM–G6)40,968Columns,96Ver0calLevels

1000mesteps

NIMDynamics:SymmetricPerformanceNode-to-NodeComparison:CPU,MIC,GPU

NodeType:

Numericvaluesrepresentnoderun-(mesforeachconfigura(on

13

the host to complete. This explains the slower overall node time of 46 seconds for the CPU-GPUconfiguration(IVB20-GPU),wheretheGPUruntimeissignificantlyfasterthanthehost(33versus46).Performanceofthe24coreIvyBridgewasnearlyequivalenttotheMIC(KNC),leadingtoabalancedsymmetricresult.Theseresultsshowsignificantperformancebenefitusingsymmetricexecution,especiallywhenthenumberofacceleratorsissmall.Comparisonsusingtwoacceleratorsweretestedthatgavecontinuedimprovement in runtimes for the GPU, but worse performance for the MIC (not shown) due tohardwareresourcelimitations.GivenIntel’splanstodevelopahost-lessMICchip(KnightsLanding),therewas little reason to pursue resolution of this issue. On the GPU, symmetric execution yieldsdiminishingbenefitasthenumberofacceleratorsattachedtoasinglenodeincreases.SinceeightormoreGPUscanbeattachedtoasinglenode,performanceisfurtherexaminedintermsofstrongandweakscaling.5.2.1StrongScalingStrongscalingismeasuredbyapplyingincreasingnumbersofcomputeresourcesforafixedproblemsize.Thismetricisparticularlyimportantforoperationalweatherpredictionwhereforecastsshouldruninone-percentofreal-time.Therequirementisnormallyachievedbyincreasingthenumberofprocessors until the given time threshold ismet. For example, a one-day forecast that runs in 15minutesrepresentsonepercentofreal-time;runsintwopercentofreal-timewouldtake30minutes.Figure6showsstrongscalingforNIMrunningat120KMresolutiononasingleCPUcontaining1to8GPUs per node. To reduce inter-GPU communications, NVIDIA developed GPUdirect, a means ofbypassingtheCPUhostinMPIdatatransfers.UseofGPUdirectimprovedmulti-GPUruntimesfortheNIMbyupto38percentonasinglenode(Middlecoff,2015).Inthefigure,“cols/GPU”isthenumberofverticalcolumnsassignedtoasingleGPU.Thetermcolumnreferstoasingleicosahedralgridcellwith96verticallevels.AsthenumberofGPUsincreasefrom1to8,efficiencydecreasesbecausetherearefewercolumnsandthusfewercalculationsperGPU.Thesestrongscalingresultsshowthebestefficiency(>90%)isachievedwhenatleast10,000columnsofworkaregiventoeachGPU.

Figure6:Strongscalingperformance,where“Cols/GPU”indicatetheamountofwork(columnsperGPU)eachdevicemustdo.NumericvaluesrepresentspeedupoverasingleGPU.NVIDIAK80spackagedwith2GPUsperboardwereusedfortheseresults,sothe2GPUresultranononeK80.5.2.2WeakScalingWeakscalingisameasureofhowsolutiontimevarieswithincreasingnumbersofprocessors,whentheproblemsizeperprocessorisfixed.Itisconsideredagoodwaytodeterminehowamodelscales

0.950.90 0.77 0.71

0

10

20

30

40

50

1 2 4 6 8

Runtime(seconds)

GPUs

SingleNodePerformance40,962Columns,100timesteps

Runtime Communications

Cols/GPU40962 204811024268275120

14

tohighnumbersof processors and is particularlyuseful formeasuring communicationsoverhead.Table2givesperformanceresultsforasinglenodewith20,284columnsperGPUfor120KMand60KMresolutionrunsusing2and8GPUs.NVIDIAK80spackagedwithtwoGPUswereusedfortheruns.Computationtimeisnearlyidenticalforallruns,withcommunicationstimeincreasingto3.19secondsfortheonenode,8GPUrun.Anadditionalrunusingtwonodesillustratesthesubstantialincreaseinoffnodecommunicationstime.Givencommunicationstimewithinanode(3.19seconds)islessthantheoff-nodetime(7.23seconds),theresultsshowthatmoreGPUscouldbeaddedtoeachnodewithoutadversely affecting model runtimes. This is because all processes must wait for the slowestcommunicationtocompletebeforemodelexecutioncancontinue.

Table2:WeakScalingperformancewithGPUdirectforGPUtoGPUdatatransfersforasinglenode(notshaded)andmultiplenodes(shaded).Communicationstimehavemuchhigheroverheadcomparedtowithinnodedatatransfers.5.2.3ArchitecturalConsiderationsAstandardInteltwosocketnodeisnormallyconfiguredtosupportuptotwoacceleratorsthatcancommunicatewiththehostatfullspeed.Whenmorethantwodevicesareattachedtothehost,theymustsharethePCIebandwidth.MorespecializedsolutionsareavailablethatreducethisconstraintbyaddingadditionalPCIehardware.Figure7illustratesthearchitectureforaCray-Stormnode,with8attachedaccelerators(GPUsareshownbutMICprocessorscanalsobeused).Inthisconfiguration,twoaccelerators are attached to each PCIe communications hub, with two hubs per CPU socket.CommunicationsbetweensocketsarehandledwithIntel’sQuickPathInterconnect(QPI).

Figure7:IllustrationoftheCrayStormnodearchitecturecontainingeightacceleratorspernode.NVIDIAGPUsareshown,butotherPCIecompatibledevicescanbeused.ReferencefromCray.OnepotentialperformancebottleneckforthisnodearchitecturemaybethelimitedbandwidthofthesingleInfiniBand(IB)connectionsharedbytheeightattachedaccelerators.AlternativenodeconfigurationsareavailableincludingoneswithmultipleIBconnections,nestedPCIearchitectures,andsolutionsthatavoiduseofQPI,duetoreportedlatencyissues(Ellis2015;Cirrascale2015).

GPUsperNode

NumberofNodes

ModelResolution

Columns/GPU ComputationTime

CommunicationsTime

TotalRuntime

2 1 120KM 20,482 25.13 0.56 25.718 1 60KM 20,482 25.16 3.19 28.352 4 60KM 20,482 25.22 7.23 33.45

15

Thesemorespecializedsolutionsofferpotentialbenefitoverstandardhigh-volumeparts,whereapplicationtestingisthebestwaytodetermineoverallcost-benefit.

5.3Multi-nodePerformanceandScalingGoodmulti-nodeperformanceandscalingonhundredstothousandsofnodesrequireefficientinter-processcommunications.Formostmodels,communicationsnormallyincludesgatheringandpackingdatatobesenttoneighboringprocesses,MPIcommunicationsofthedata,andthenunpackinganddistributing the data. Analysis of NIM dynamics performance showed that message packing andunpackingaccountedfor50%ofinter-GPUcommunicationstime.SinceNIMreliesonalookuptableto referencehorizontal gridpoints,data canbe reorganized inanyorder toeliminatepackingandunpacking.Thisoptimizationiscalled“spiralgridorder”.Figure8illustratesthespiralgridorderingusedinNIM.OptimalspiralgridorderingisbasedonthenumberofMPIranksthatwillbeused,andisdeterminedduringmodelinitialization.Asshowninthefigure,pointsareorganizedaccordingtowheretheymustbesent(asinteriorpoints)orreceived(ashalopoints).Eachpointinthefigurerepresentsanicosahedralgridcolumnthatcontains96verticallevels.Inthefigure,theSpiralGridOrderingsectionillustratesthemethodusedtoorderpointswithineachMPI task. TheData Layout section of the figure illustrates how grid points are organized inmemoryforoptimalcommunicationsandcomputation.Useofthespiralgridordergaveperformancebenefitonallarchitectures,witha20percentimprovementinmodelruntimesontheGPU,16percentontheMIC-CPU,and5percentontheCPU.

Figure8:SpiralCommunicationswhereSpiralDataLayoutshowsinteriorpointsforseveralMPItasks,organizedwithdatasenttoneighbortasksinblack,andpointstobereceivedindifferentcolorsfromeachMPItask.DataStorageLayoutshowshowdataareorganizedinmemory,whereeachhorizontalpointshownrepresentsamodelcolumncontaining96verticallevels.

InteriorPointsMPITask5



MPITask2

InteriorPoints HaloPoints(received)

Task4 Task6Task8 Task3

MPITask8

MPITask5

MPIReceive

MPIReceive

MPISen

d

SMSInter-ProcessCommunicaBons-SpiralGridOpBmizaBon-

DataStorageLayout

SpiralDataLayout HaloPoints(received)

MPISen

d

MPITask5

16

Figure9showsmulti-nodescalingresultsfor20to320CPU,GPU,andsymmetricrunswiththeCPUandMIC.ThehardwareusedintherunswasoldergenerationIntelSandyBridge(2012),NVIDIAK20x(2012),andIntelKnightsCorner(2013).TheseresultsshowtheMICandGPUarebestwhentheyhavemorethan10,000columnsofworkperdevice.CPUperformanceisbestwhenthereislessworkperCPU. Increasingefficiency(0.65to0.71)andsuperscalarperformance isobserved(4096and2048columnsresp.)becausecomputationsfitbetterincachememory.

Figure9:NIMscalingcomparisonfordual-socketSandyBridgeCPU,NVIDIAKeplerK20xGPU,andsymmetricexecutionwithIntelKnightsCorner(MIC)andadualsocketIntelSandyBridgeCPU.Thehorizontalaxisgivesthenumberofnodesusedforthefixedproblemsize.Cols/Nodeindicatethenumberofcolumnspernode,consistingofeitherthedualsocketCPU,GPUorsplitbetweentheCPUandMICforsymmetricexecution.Efficiencycomparedtothetwentynoderuntimeappearsasanumericvalueineachperformancebar.

5.4Cost-Benefit(GPU)CostbenefitfortheGPUisdeterminedusinglistpricesasspecifiedfromIntelandNVIDIAinTable3.TheCPUnodeestimatewasbasedonastandardtwosocket,24core,IntelHaswellnode,thatincludesthe processor, memory, network interconnect, and warranty. The system inter-connect was notincludedincostcalculations,basedontheassumptionthatthecostforeachsystemwouldbesimilar.While significant discounts are normally offered to customers, it would be impossible to fairlyrepresenttheminanycost-benefitevaluationhere.

Table3:ListpricesforIntelHaswellCPU,IntelMIC,andNVIDIAK80GPUprocessors.TheCPUnodeisbasedonthecostofaDellR430rack-mountedsystem.

10 http://ark.intel.com/products/81713/Intel-Xeon-Processor-E5-2690-v3-30M-Cache-2_60-GHz 11 http://www.anandtech.com/show/8729/nvidia-launches-tesla-k80-gk210-gpu 12 http://ark.intel.com/products/75799/Intel-Xeon-Phi-Coprocessor-7120P-16GB-1_238-GHz-61-core 13 Dell PowerEdge R430 server, rack mounted, quote 3/2/2015, see appendix for details.

.95

.94 .65 .71

.87.85 73 .67

.84.70 .53

0

10

20

30

40

50

60

70

80

20 40 80 160 320

Runtime(seconds)

Nodes

NIMMulti-NodePerformance30KMresolution,96verticallevels

100modeltimesteps CPU

GPU

MIC+CPU

Cols/Node 3276816384819240962048

Chip Part Cores Power(watts) OEMPriceHaswell E5-2690-V3(2) 24 270 418010NVIDIAK80 K80 4992 300 500011IntelMIC(KNC) 7120P 61 300 412912HaswellCPUNode DellR430 24 - 650013

17

Figure10showsacost-benefitbasedonrunningNIMdynamicsat30kmmodelresolution.Eachofthefivesystemconfigurationsshownproduceda3hourforecastin23secondsor0.20percentofreal-time.TheCPU-onlyconfiguration(upper-leftpoint)required960coresor40Haswellnodes.Theright-mostconfigurationsused20NVIDIAK80GPUsthatwereattachedto20,10,8and5CPUsrespectively.The execution time of 23 seconds can be extrapolated to 1.6 percent of real-time for a 3.75KMresolutionmodelwhenper-processworkloadremainsfixed(weak-scaling)14.The20,482columnsofworkperGPU,isalsoconsistentwith95percentdeviceefficiencygiveninFigure5.

Figure10:CostcomparisonforCPU-only,andCPU-GPUsystemsneededtorun100timestepsofNIMdynamicsin23seconds.RuntimesdonotincludemodelinitializationorI/O.CostestimatesarebasedonlistpricesforhardwaregiveninTable5.TheCPU-onlysystemused40HaswellCPUnodes.FourCPU-GPUconfigurationswereused,where“numCPUs” indicatethetotalnumberofCPUsused,and“K80sperCPU”indicatethenumberofacceleratorsattachedtoeachnode.BasedonlistpricesinTable3,a40nodeCPUwouldcost$260,000.Systemsconfiguredwith1to4NVIDIA K80s per CPU are shown that lower the price of the system from $230K to $132.5Krespectively.Forthesetests,20NVIDIAK80swereusedcontaining2GPUsperboard;nochangesinruntimeswereobservedforthefourCPU-GPUconfigurations.SystemssuchasCrayStormsupport8ormoreK80sthatcouldgiveadditionalcostbenefitfavoringGPUs.

6.DiscussionThe NIM demonstrates that weather prediction codes can be designed for high performance andportabilitytargetingCPU,GPUandMICarchitectureswithasinglesourcecode.InherentinthedesignofNIMhas been the simplicity of the code, use of basic Fortran language constructs, andminimalbranchinginloopcalculations.UseofFortranpointers,derivedtypes,andotherconstructsthatarenotwell supported or are challenging for compilers to analyze and optimizewere avoided. NIM’sicosahedral-hexagonalgridpermitsgridcellstobetreatedidentically,whichminimizesbranchingingridpointcalculations.Further,codedesignseparatedfine-grainandcoarsegrain(MPI)parallelism.ThiswasprimarilyduetolimitationsinF2C-ACC,buthadabenefitoforganizingcalculationstoavoidcreationandexecutionofsmallparallelregions,wheresynchronizationandthreadstartup(CPU,MIC)andsynchronizationorkernelstartup(GPU)timecanbesignificant.

14 Each doubling in model resolution requires 4X more compute, and a 2X increase in the number of model time-steps. Assuming perfect scaling, an increase in model resolution from 30KM to 3.75KM requires 64X (43) more GPUs, with an 8X (23) increase in the number of model time-steps. Therefore, scaling to 3.75KM is calculated as 8 * 0.20 = 1.6% of real-time. Additional increases in compute and time-to-solution are expected when physics calculations are included.

260230

165145.5 132.5

0

50

100

150

200

250

300

40 20 10 7 5

Cost(thousands)

numCPUs:

CPUversusGPUCost-BenefitNIM30kmresolution

CPUonly CPU&GPU

K80sperCPU:012 34

18

Thechoicetoorganizearraysandloopcalculations,withaninnermostverticaldimensionandindirectaddressingtoaccessneighboringgridcells,simplifiedcodedesignwithoutsacrificingperformance.Italsoimprovedcodeportabilityandperformanceinunanticipatedways.First,theinnermostverticaldimensionof96levelswassufficientforCPUandMICvectorization,butessentialfortheGPU’shigh-corecountdevices.Withfewdependenciesintheverticaldimension,vectorization(CPU,MIC),andthread-parallelism (GPU) were consistently available in dynamics routines. Second, indirectaddressingofgridcellsgaveflexibilityandbenefitinhowtheycouldbeorganized.Forexample,spiralgridre-orderingtoeliminateMPImessagepackingandunpackinggaveupto20percentimprovementinmodelperformance.Optimizations benefitting one architecture also helped the others. In the rare event performancedegradedononeormorearchitecture,thechangeswerere-formulatedtogainpositivebenefitonall.The PGI OpenACC compiler now matches the performance of the F2C-ACC compiler. OpenACCcompilers continue to mature, benefiting from F2C-ACC comparisons that exposed bugs andperformanceissuesthatwerecorrected.ParallelizationissimplerwithOpenACC,largelybecausedatamovementbetweenCPUandGPUismanagedbytheruntimesystem.UnifiedmemoryontheGPUisexpected to further simplify parallelization, narrowing the ease-of-use gap versus OpenMP. TheseimprovementshaveledofadecisiontophaseoutuseofF2C-ACC,relyingsolelyonOpenACCcompilersforNIMandothermodels.Thescopeofthispaperprimarilyfocusedonmodeldynamics,largelybecausedomainscientistshadnot decided which physics suite to use for high-resolution (< 4KM) runs. Parallelization of selectmicrophysics and radiation routines improvedperformance on all architectures, butMIC andGPUruntimes were slower than the CPU (Henderson et al. 2015; Michalakes et al. 2015). Lowerperformance is likelydue tomoreconditionals,and lessparallelism inphysicscalculations than indynamicsroutines.ConditionalsrestrictSIMDparallelism,limitingperformance.Furtherparallelismis largely determined by data dependencies in the scientific formulations. Since physics routinesgenerally contain dependencies in the vertical column, parallelism is often only available in thehorizontaldimension.Toeasethislimitation,chunkingwasdescribed,ameanstodivideparallelismbetweenvectorizationandthreadparallelismontheCPUandMICprocessors thatcanbesimilarlymappedtothethreadandblockparallelismontheGPU.Hyper-threadingontheCPUandMICwasalsodescribed,aneffectiveloadbalancingmechanismwhenmultipleverticalcolumnsareexecutedbyeachMPItask.Thepapergivesacost-benefit calculation forNIMdynamics thatshowed increasingvalueasmoreacceleratorspernodeareused.Howeverthereareseveral limitations in thevalueof theseresults.First,thecomparisonwasonlyformodeldynamics;whenphysicsisincluded,modelperformanceandcostbenefit favoring theGPU is expected todecrease. Second,useof list price isnaïve as vendorstypically offer significant discounts, particularly for large installations. Third, calculations did notincludethecostof thesysteminterconnect.Forsmallsystemswithtensofnodesthiswasdeemedacceptable for comparison as there would be little difference in price or performance. However,comparisonswithhundreds to thousandsofnodeswouldamplify the roleof the interconnect andwouldneedtobeincludedincost-benefitcalculations.Finally,theNextGenerationGlobalPredictionSystem(NGGPS)programistaskedwithdevelopingtheNation’snextglobalweatherpredictionmodelthatwillbeusedbyNOAA’sNationalWeatherServicearound2020.AspartoftheHighImpactWeatherPredictionProject(HiWPP),twodynamicalcoreswereselectedasleadingcandidatestoreplacetheexistingoperationalGlobalForecastSystem(GFS):NOAAGFDL’sFinite-Volumecubed(FV3)(Lin2004)andtheModelPredictionAcrossScales(MPAS)(Skamarocketal.2015)fromNCAR.TheNIMworkcanbeusedasatemplateforMPFGparallelizationofthesemodelstoenablerunningglobalforecastmodelsoperationallyat3KMresolutioninthenextdecade.

19

7.ConclusionThe NIM is currently the only weather model able capable of running on CPU, GPU and MICarchitectureswithasinglesourcecode.PerformanceoftheNIMdynamicalcorewasdescribed.CPU,GPUandMIC comparisonsweremade fordevice, node andmulti-nodeperformance. Performancecalculationsweremadebasedongoalofrunningat3KMresolutionin1-2percentofreal-time.Devicecomparisons showNIM ran on theMIC andGPU, 1.3X and 1.9X faster respectively than the samegenerationCPUhardware.The1.3XMICspeedupversusadualsocketCPUissignificant,sincenootherweather or climate model has been able to demonstrate speedup favoring the MIC. Single nodecomparisonsdemonstrated thevalueof symmetricexecutionwhereboth theCPUandGPUorMIChardwarewasused.Formulti-nodeexecution,thespiralgridorderingwasdescribedthateliminateddatapackingandunpackingandgaveperformancebenefitonallarchitectures.Finally,acost-benefitanalysisdemonstratedincreasingbenefitsfavoringtheGPUwhenupto8acceleratorswereattachedtoeachCPUhost.Newhardwarein2016lookspromisingandwillinvitefurthercomparisons.BothGPUandMICchipswill include high speed memory with up to 5 times faster performance than current generationhardware.FortheMIC,thehostlessKnights-LandingprocessorwillnolongerbeattachedtoaCPUhost,whichshouldalsoimproveinter-nodecommunicationsperformance.FortheGPU,thePascalandVoltachipswillofferhardwaresupportedunifiedmemoryandfastercommunicationswhichshouldimproveprogrammabilityandperformance.Theseimprovements,alongwithmorematurecompilers,shouldfurthersimplifyportingcodestoGPUs.FundamentaltoachievinggoodperformanceandportabilityhasbeenthedesignofNIM.Thesimplicityofthecodedesign,loopingandarraystructures,andtheindirectaddressingoficosahedralgridwereallchosentoexposethemaximumparallelismtotheunderlyinghardware.Theworkreportedhere,representsasuccessfuldevelopmenteffortbyateamofdomainandcomputerscientistsandsoftwareengineers. Scientists write the code, computer scientists are responsible for the directive-basedparallelizationandoptimization,andsoftwareengineersmaintainthesoftwareinfrastructurecapableofsupportingdevelopment,testingandrunningthemodelondiversesupercomputersystems.ApplyingNIMperformanceandportabilitytechniquestotheMPASandtheFV3modelsisanextstepinthework.TheHiWPPreportshowedthatbothmodelsdemonstratedgoodscalingto130,000CPUcores(Michalakesetal.,2015).Whiletheseresults indicatesufficientparallelismisavailable,someworkwillberequiredtoadaptthemtorunefficientlyonGPUandMICprocessors.In the next decade, HPC is expected to become increasingly fine-grainedwith systems containingpotentiallyhundredsofmillionsofprocessingcores.Totakeadvantageofthesesystemsnewweatherpredictionmodelswillneedtobeco-developedbyscientificandcomputationalteamstoincorporateparallelisminmodeldesign,codestructure,algorithms,andunderlyingphysicalprocesses.

ACKNOWLEDGMENTSThankstotechnicalteamsatIntel,Cray,PGI,andNVIDIAwhowereresponsibleforfixingbugs,andprovidingaccesstothelatesthardwareandcompilers.ThanksalsotothestaffatORNLTitan,andNSFTACC for providing system resources and helping to resolve system issues. This work was alsosupportedinpartbytheDisasterReliefAppropriationsActof2013andtheNOAAHPCCprogram.

REFERENCESArakawa,A.,andV.R.Lamb,1977:Computationaldesignofthebasicdynamicalprocessesofthe

UCLAgeneralcirculationmodel.Meth.Comput.Phys.,17,AcademicPress,NewYork,173–265.

20

Bleck,R.,J.Bao,S.Benjamin,J.Brown,M.Fiorino,T.Henderson,J.Lee,A.MacDonald,P.Madden,J.Middlecoff,J.Rosinski,T.Smirnova,S.SunandN.Wang,2015:MonthlyWeatherReview,143,2386-2403, DOI:10.1175/MWR-D-14-00300.1

CarpenterI.R.Archibald,K.Evans,andM.Taylor,2013:ProgresstowardsacceleratingHOMMEonhybridmulti-coresystems,IntlJournalofHighPerformanceComputingApplications,27(3),335-347–July2013,DOI:10.1177/1094342012462751.

Cirrascale,2015:ScalingGPUcomputeperformance,Cirrascalewhitepaper,June2015,http://www.cirrascale.com/documents/whitepapers/Cirrascale_ScalingGPUCompute_WP_M987_REVA.pdf.

CUDACProgrammingGuide,2015:http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Ellis,S.,2015:ExploringthePCIbusroutines,CirraScaleBlogpost,August13,2014,

http://www.cirrascale.com/blog/index.php/exploring-the-pcie-bus-routes/.Fuhrer,O.,C.Osuna,X.Lapillone,T.Gysi,B.Cumming,M.Bianco,A.Arteaga,andT.Schulthess,2014:

Towardsaperformanceportable,architectureagnosticimplementationstrategyforweatherandclimatemodels,ComputingFrontiersandInnovations,1,No1,DOI:10.14529/jsfi1401.

Govett,M,L.Hart,T.Henderson,andD.Schaffer,2003:TheScalableModelingSystem:directive-basedcodeparallelizationfordistributedandsharedmemorycomputers,ParallelComputing,29(8),995-1020.

Govett,M.,J.MiddlecoffandT.Henderson,2010:RunningtheNIMnext-generationweathermodelonGPUs,10thIEEE/ACMIntlConfonCluster,CloudandGridComputing,CCGrid2010,17-20,May2010,Melbourne,Australia.

Govett,M.,UsingOpenACCcompilerstorunFIMandNIMonGPUs,2013NCARmulti-coreworkshop,Boulder,Colorado,Sept2013,https://www2.cisl.ucar.edu/sites/default/files/govett_6b.pdf

Govett,M.,J.MiddlecoffandT.Henderson,Directive-basedparallelizationoftheNIMweatherodelforGPUs,FirstWorkshoponAcceleratorProgrammingusingDirectives(WACCPD14),pp55-61,NewOrleans,LA,November2014.

Henderson,T.,M.Govett,andJ.Middlecoff,2011:ApplyngFortranGPUcompilerstonumericalweatherprediction,2011SymposiumonApplicationAcceleratorsinHighPerformanceComputing(SAAHPC2011),pp34-41,Knoxville,TN,August2011.

Henderson,T.,J.Michalkes,I.GokhaleandA.Jha,2015:OptimizingNumericalWeatherPrediction,HighPerformanceParallelismPearlsVolumeTwo:MulticoreandMany-coreProgrammingApproaches,7-23,MorganKaufmannpublisher,DOI:10.1016/B978-0-12-803819-2.00016-1.

Lapillonne,X,andO.Fuhrer,2014:UsingCompilerDirectivestoPortLargeScientificApplicationstoGPUs:AnExamplefromAtmosphericScience,ParallelProcess.Letters.24,1450003[18pages]DOI:10.1142/S0129626414500030.

Lee,J.andA.E.MacDonald,2009:Afinite-volumeicosahedralshallowwatermodelonlocalcoordinate,MonthlyWeatherReview,137,1422-1437.

Lee,J.,R.BleckandA.E.MacDonald,2010:Amultistepfluxcorrectedtransportscheme,JournalofComputingPhysics,229,9284-9298.

Lin,S-J.2004:Averticallylagrangianfinite-volumedynamicalcoreforglobalmodels,J.ComputingPhysics,229,9284-9298.

MacDonald,A.E.,J.Middlecoff,T.HendersonandJ.Lee,2011:Ageneralmethodformodelingonirregulargrids,IntlJournalofHighPerformanceComputingApplications,25(4),392-403,DOI:http://dx.doi.org/10.1177/1094342010385019.

Michalakes,J.,M.Iacono,andE.Jessup,2015:OptimizingWeatherModelRadiativeTransferPhysicsforIntel’sManyIntegratedCore(MIC)Architecture,preprint.

Michalakes,J,M.Govett,T.Black,H.Juang,A.Reinecke,andB.Skamarock,AVECReport:NGGPSLevel-1BenchmarksandSoftwareEvaluation,April30,2015,http://www.nws.noaa.gov/ost/nggps/DycoreTestingFiles/AVEC%20Level%201%20Benchmarking%20Report%2008%2020150602.pdf.

Middlecoff,J.,2015:OptimizationofMPImessagepassinginamulti-coreNWPdynamicalcorerunningonNVIDIAGPUs,inFifthNCARmulti-coreworkshop,Boulder,Colorado,Sept2015,https://www2.cisl.ucar.edu/sites/default/files/Abstract_Middlecoff.pdf.

21

Norman,M.,I.Demeshko,J.Larkin,A.Vose,andM.Taylor,2015:ExperienceswithCUDAandOpenACCfromportingACMEtoGPUs,inFifthNCARmulti-coreworkshop,Boulder,Colorado,Sept2015,https://www2.cisl.ucar.edu/sites/default/files/Norman_Slides.pdf.

Nyugen,H.,C.Kerr,Z.Liang,2013:PerformanceoftheCube-SphereAtmosphericDynamicalCoreonXeonandXeon-PhiArchitectures,inThirdNCARMulti-coreWorkshop,BoulderColorado,Sept2013,https://www2.cisl.ucar.edu/sites/default/files/vu_3a.pdf.

Putnam,B.,2011:GraphicsProcessingUnit(GPU)accelerationoftheGoddardEarthObservingSystemAtmosphericModel,NASATechnicalReport,Jan2011.

Rosinski,J.,PortingandOptimizingNCEP’sGFSPhysicsPackageforUnstructuredGridsonIntelXeonandXeon-Phi,inFifthNCARMulti-coreWorkshop,Boulder,Colorado,Sept2015,https://www2.cisl.ucar.edu/sites/default/files/Rosinski_slides.pdf.

Sadourny,R.,A.ArakawaandY.Mintz,1968:Integrationofnon-divergentbarotropicvorticityequationwithanicasahderal-hexagonalgridforthesphere,MonthlyWeatherReview96(6):351-356,DOI:http://dx.doi.org/10.1175/1520-0493(1968)096<0351:IOTNBV>2.0.CO;2

Sawyer,W.,ZaegalG.,Linardakis,L.,2014:Towardsamulti-nodeOpenACCImplementationoftheICONModel,EGUGeneralAssembly2014,held27April-2May,2014inVienna,Austria,id.15276.

Skamarock,B.,J.Klemp,M.Duda,L.Fowler,S.Park,andT.Ringler,2012:Amulti-scalenon-hydrostaticatmosphericModelusingcentroidalvoronoitessellationsandC-gridstaggering,MonthlyWeatherReview,240,3090-3105,DOI10.1175/MWR-D-11-00215.1.

Top500Website,2015:http://www.top500.org/lists/2015/11/.Wang,N.andJ.Lee,2011:Geometricpropertiesoftheicosahedral-hexagonalgridonthetwo-sphere,

SIAMJ.Sci.Computing,33(5):2536-2559.Whitaker,J.,2015:HIWPPnon-hydrostaticdynamicalcoretests:Resultsfromidealizedtestcases,

Jan-21-2015,http://www.nws.noaa.gov/ost/nggps/DycoreTestingFiles/HIWPP_idealized_tests-v8%20revised%2005212015.pdf.

Williamson,D.,1971:Acomparisonoffirstandsecond-orderdifferenceapproximationsoverasphericalgeodesicgrid,J.Comp.Phys.,7(2),301-309,April1971,DOI:doi:10.1016/0021-9991(71)90091-X.

Yashiro,H.,A.Naruse,R.Yoshida,H.Tomita,2014:AglobalatmospheresimulationonaGPUsupercomputerusingOpenACC:Drydynamicalcorestests,TSUBAMEESJVol.12(Pub.Sep.22,2014).

parallelization and performance of the nim weather model for cpu

Documents