optimizing network usage on sequoia and...

23
LLNL-PRES-695482 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Optimizing Network Usage on Sequoia and Sierra Bronis R. de Supinski Chief Technology Officer Livermore Compu=ng June 23, 2016

Upload: others

Post on 29-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Optimizing Network Usage on Sequoia and Sierra

BronisR.deSupinskiChiefTechnologyOfficer

LivermoreCompu=ngJune 23, 2016

Page 2: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 2

Adv

ance

d Te

chno

logy

S

yste

ms

(ATS

)

Fiscal Year ‘13 ‘14 ‘15 ‘16 ‘17 ‘18

Use Retire

‘20 ‘19 ‘21

Com

mod

ity

Tech

nolo

gy

Sys

tem

s (C

TS)

Procure& Deploy

Sequoia(LLNL)

ATS1–Trinity(LANL/SNL)

ATS2–Sierra(LLNL)

Tri-labLinuxCapacityClusterII(TLCCII)

CTS1

CTS2

‘22

System Delivery

ATS3–Crossroads(LANL/SNL)

ATS4–(LLNL)

‘23

ATS5–(LANL/SNL)

MyfocusisNNSAASCATSpla2ormsatLLNL

SequoiaandSierraarethecurrentandnext-genera=onAdvancedTechnologySystemsatLLNL

Page 3: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 3

Sequoiaprovidespreviouslyunprecedentedlevelsofcapabilityandconcurrency

§  Sequoiasta=s=cs—  20petaFLOP/speak—  17petaFLOP/sLINPACK— Memory1.5PB,4PB/sbandwidth—  98,304nodes—  1,572,864cores—  3PB/slinkbandwidth—  60TB/sbi-sec=onbandwidth—  0.5−1.0TB/sLustrebandwidth—  50PBdisk

§  9.6MWpower,4,000^2

§  Thirdgenera=onIBMBlueGene

Page 4: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 4

TheBG/Qcomputechipintegratesprocessors,memoryandnetworkinglogicintoonechip

§  16user+1OS+1redundantcores—  4-waymul=-threaded,1.6GHz64-bit—  16kB/16kBL1I/Dcaches—  QuadFPUs(4-wideDPSIMD)—  Peak:204.8GFLOPS@55W

§  Shared32MBeDRAML2cache—  Mul=versionedcache

§  Dualmemorycontroller—  16GBDDR3memory(1.33Gb/s)—  2*16byte-wideinterface(+ECC)

§  Chip-to-chipnetworking—  5DTorustopology+externallink—  Eachlink2GB/ssend+2GB/sreceive—  DMA,put/get,collec=veopera=ons

Page 5: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 5

TradiFonalBlueGeneoverallsystemintegraFonresultsinsmallfootprint

6. Rack 2 midplanes

1, 2, or 4 I/O drawers 7. System 20 PF/s

4. Node card 32 compute cards,

Optical modules, link chips, torus

3. Compute card One single chip module, 16 GB DDR3 memory

2. Module Single chip

5a. Midplane 16 node cards

5b. I/O drawer 8 I/O cards

8 PCIe Gen2 slots

1. Chip 16 cores

Page 6: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 6

§  Communica=onlocalitythroughop=mizedMPIprocessplacementiscri=calon3Dtorusnetworks— Useof5Dtorusreducesnetworkdiameterand

reducestheimportanceofMPIprocessplacement

§  Supportforhardwareop=mizedcollec=vesshouldapplytosubcommunicatorsaswellasglobalopera=ons—  Increasednetworkcommunica=oncontextsallowsmoreapplica=ons

toexploithardwaresupportforcollec=veopera=ons

§  Hardwaresupportfornetworkpar==oningminimizesjiker

§ Mul=plenetworksprovidemanybenefitsbutalsoincreasecosts

SequoiaandBlueGene/QreflectlessonsfrompreviousBlueGenegeneraFons

Theseexamplesarenetwork-centric;othersreflectlessonsthroughoutthesystemhardwareandso^warearchitecture

Page 7: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 7

§ Mechanismsthatleadtoarrhythmiaarenotwellunderstood;Contrac=onofheartiscontrolledbyelectricalbehaviorofheartcells

§ Mathema=calmodelsreproducecomponentioniccurrentsoftheac=onpoten=al—  Systemofnon-linearequa=onODEs—  TT06includes6speciesand14gates— Nega=vecurrentsdepolarize(ac=vate)—  Posi=vecurrentsrepolarize(returntorest)

§  Abilitytorun,athighresolu=on,thousandsinsteadoftensofheartbeatsenablesdetailedstudyofdrugeffects

LLNL,withIBM,hasdevelopedCardiod,astate-of-the-artcardiacelectrophysiologysimulaFon

2012GordonBellfinalisttypifiestheeffortrequiredtoexploitsupercomputercapabilityfully

Page 8: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 8

60 beats in 67.2 seconds

60 beats in 197.4 seconds

§ Measuredpeakperformance:11.84PFlop/s(58.8%ofpeak)—  0.05mmresolu=onheart(3B=ssuecells)—  Tenmillionitera=ons,dt=4usec—  Performanceoffullsimula=onloop,includingI/O,measuredwithHPM

Cardoidachievesoutstandingperformancethatenablesnearlyreal-FmeheartbeatsimulaFon

Op=mizedCardioidis50xfasterthan“naive”code

§  Extremestrongscalinglimit:—  0.10mm:236=ssuecells/core.—  0.13mm:114=ssuecells/core

Page 9: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 9

CardiodrepresentsamajoradvanceinthestateoftheartofhumanheartsimulaFon

0.1 mm heart (370M tissue cells) One minute of wall time

Previous state of the art Cardioid

18.2 seconds of simulation time

Page 10: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 10

§  Par==onedcellsoverprocesseswithanupperboundon=me(notonequal=me)

§  Assigneddiffusionworkandreac=onworktodifferentcores

§  Transformedthepotassiumequa=ontoremoveserializa=on

§  Expensive1Dfunc=onsinreac=onmodelexpressedwithra=onalapproximates

§  SingleprecisionweightstoreducediffusionstenciluseofL2bandwidth

§  HandunrolledtoSIMDizeloopsovercells§  SortedbycelltypetoimproveSIMDiza=on§  Sub-sor=ngofcellstoincreasesequen=al/

vectorloadandstoringofdata.§  logfunc=onfromlibmreplacedwithcustom

inlinedfunc=ons§  Ontheflyassemblyofcodetoop=mizedata

movementatrun=me§  Memorylayouttunedtoimprovecache

performance§  Useofvectorintrinsicsandcustomdivides

CardoidachievesoutstandingperformancethroughdetailedtuningtoSequoia’sarchitecture

§  Movedintegeropera=onstofloa=ngpointunitstoexploitSIMDunits

§  Noexplicitnetworkbarrier§  L2onnodethreadbarriers§  UselowlevelSPIforhalodataexchangebetweentasks(DMA)

§  Applica=onmanagedthreads§  SIMDizeddiffusionstencilimplementa=on§  Zerofluxboundarycondi=onsapproximated

bymethodwithnoglobalsolve§  HighperformanceI/OisawareofBG/Qnetworktopology

§  Lowoverheadin-situperformancemonitors§  Assignmentofthreadstodiffusion/reac=on

dependentondomaincharacteris=cs§  Co-scheduledthreadsforimproveddual

issue§  Mul=plediffusionimplementa=onstoobtain

op=malperformanceforvariousdomains§  Remote&localcopiesseparatedtoimprove

bandwidthu=liza=on

Page 11: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 11

Atlargestscales,smallsoTwareoverheadscansignificantlyimpactperformance

DirectuseofmessageunitsandL2atomicopera=onsminimizessoverhead

370 Million Cells 1.6 Million Cores 1600 Flops/cell

60 us per iteration

0 50 100 150 200 250 300

L2AtomicBarrier

OMPFork/Join

SPIHaloExchange

MPIHaloExchange

Time (usec)

60 us target

Page 12: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 12

TheSierrasystemthatwillreplaceSequoiafeaturesaGPU-acceleratedarchitecture

Mellanox® Interconnect Dual-rail EDR Infiniband®

IBM POWER •  NVLink™

NVIDIA Volta •  HBM •  NVLink

Components

Compute Node POWER® Architecture Processor NVIDIA®Volta™ NVMe-compatible PCIe 800GB SSD > 512 GB DDR4 + HBM Coherent Shared Memory

Compute Rack Standard 19” Warm water cooling

Compute System 2.1 – 2.7 PB Memory

120 -150 PFLOPS 10 MW

GPFS™ File System 120 PB usable storage

1.2/1.0 TB/s R/W bandwidth

Page 13: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 13

OutstandingbenchmarkanalysisbyIBMandNVIDIAdemonstratesthesystem’susability

Projec=onsincludedcodechangesthatshowedtractableannota=on-basedapproach(i.e.,OpenMP)willbecompe==ve

9

Figure 5: CORAL benchmark projections show GPU-accelerated system is expected to deliver substantially higher performance at the system level compared to CPU-only configuration.

The demonstration of compelling, scalable performance at the system level across a wide range of applications proved to  be  one  of  the  key  factors  in  the  U.S.  DoE’s  decision  to  build  Summit and Sierra on the GPU-accelerated OpenPOWER platform.

Conclusion Summit and Sierra are  historic  milestones  in  HPC’s  efforts  to  reach  exascale  computing.    With  these  new  pre-exascale systems, the U.S. DoE maintains its leadership position, trailblazing the next generation of supercomputers while allowing the nation to stay ahead in scientific discoveries and economic competitiveness.

The future of large-scale systems will inevitably be accelerated with throughput-oriented processors. Latency-optimized CPU-based systems have long hit a power wall that no longer delivers year-on-year performance  increase.    So  while  the  question  of  “accelerator  or  not”  is  no  longer  in  debate,  other  questions remain, such as CPU architecture, accelerator architecture, inter-node interconnect, intra-node interconnect, and heterogeneous versus self-hosted computing models.

With those questions in mind, the technological building blocks of these systems were carefully chosen with the focused goal of eventually deploying exascale supercomputers. The key building blocks that allow Summit and Sierra to meet this goal are:

• The Heterogeneous computing model • NVIDIA NVLink high-speed interconnect • NVIDIA GPU accelerator platform • IBM OpenPOWER platform

With the unveiling of the Summit and Sierra supercomputers, Oak Ridge National Laboratory and Lawrence Livermore National Laboratory have spoken loud and clear about the technologies that they believe will best carry the industry to exascale.

13

CORAL APPLICATION PERFORMANCE PROJECTIONS

0x

2x

4x

6x

8x

10x

12x

14x

QBOX LSMS HACC NEKbone

Rela

tive

Per

form

ance

Scalable Science Benchmarks

CPU-only CPU + GPU

0x

2x

4x

6x

8x

10x

12x

14x

CAM-SE UMT2013 AMG2013 MCB

Rela

tive

Per

form

ance

Throughput Benchmarks

CPU-only CPU + GPU

Page 14: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 14

SierraNREwillprovidesignificantbenefittothefinalsystem

§  CenterofExcellence

§  Motherboarddesign

§  Watercooledcomputenodes

§  HWresiliencestudies/investigation(NVIDIA)

§  Switchbasedcollectives

§  Hardwaretagmatching

§  GPUDirectandNVMe

§  Opensourcecompilerinfrastructure

§  Systemdiagnostics

§  Systemscheduling

§  Burstbuffer

§  GPFSperformanceandscalability

§  Clustermanagement

§  Opensourcetools

Page 15: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 15

§  Reliable,scalable,generalpurposeprimi=ve,applicabletomul=pleusecases—  In-networkTreebasedaggrega=onmechanism—  Largenumberofgroups— Mul=plesimultaneousoutstandingopera=ons

§  Highperformancecollec=veoffload—  Barrier,Reduce,All-Reduce—  Sum,Min,Max,Min-loc,max-loc,OR,XOR,AND—  Canoverlapcommunica=onandcomputa=on

§  FlexiblemechanismreflectslessonslearnedfromBlueGenesystems

Switch-basedsupportforcollecFvesfurtherimprovescriFcalfuncFonality

SHArP Tree Aggregation Node (Process running on HCA) SHArP Tree Endnode (Process running on HCA)

SHArP Tree

SHArP Tree Root

Page 16: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 16

IniFalresultsdemonstratethatSHArPcollecFvesimproveperformancesignificantly

§  OSUAllreduce1PPN,128nodes

Page 17: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 17

§  Realis=capplica=ons,par=cularlyinC++o^enusesmallmessages—  Realizedmessagerateo^enthekeyperformanceindicator— MPIprovideslikleabilitytoCOALESCEthesemessages

§ MPImatchingrulesheavilyimpactrealizedmessagerate— Messageenvelopsmustmatch

•  Wildcards(MPI_ANY_SOURCE,MPI_ANY_TAG)increaseenvelopmatchingcomplexityand,thus,cost

—  Postedreceivedmustbematchedin-orderagainstthein-orderpostedsends

AsseenwithCardoid,MPIsoTwareoverheadcriFcallylimitsrealizednetworkperformance

Hardwaremessagematchingsupportcanalleviateso^wareoverhead

Receiver Sender Tag=A, Communicator=B, source=C, Time=X

Tag=A, Communicator=B, Destination=C, Time=Y

Tag=A, Communicator=B, source=C, Time=X+D

Tag=A, Communicator=B, Destination=C, Time=Y+D’

Page 18: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 18

MPItagmatchingoperaFonsmustappeartobeperformedatomically

§  Complexity/serializationofmessagematchinglimitstheprocessingthatcanbeperformedonGPUs

Page 19: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 19

§  OffloadedtotheConnectX-5HCA—  Enablesmoreofhardwarebandwidthtotranslatetorealizedmessagerate—  FullMPItagmatchingascomputeprogresses—  Rendezvousoffload:largedatadeliveryascomputeprogresses

§  Controlcanbepassedbetweenhardwareandso^ware

§  Verbstagmatchingsupportisbeingup-streamed

MellanoxhardwarewillsupportefficientMPItagmatching

Page 20: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 20

§  Sierrausescommodityclusternetworksolu=on—  Separatemanagement(ethernet)anduser(IB)networks—  Singlenetworkforusertraffic(point-to-point,collec=ves&filesystemtraffic)

§  Asinglenetworkforusertrafficsavesmoneybuthasothercosts—  Jikerimpactofotherjobs’filesystemtrafficcanbesevere—  Burstbufferstrategysmoothsfilesystembandwidthdemand

•  FilesystemtrafficofajobnowcompeteswithitsMPItraffic

§  Differenttypesofnetworktrafficarenotequallycri=cal—  Filesystemtraffic“only”needsaguaranteeofeventualcomple=on—  Collec=veso^encri=callylimitoverallperformance— Othertrafficclassesalsoexist

DeployingmulFplenetworksexacerbatesnetworkhardwarecosts,whicharealreadytoohighinlarge-scalesystems

Quality-of-Service(QoS)mechanismsarenecessarytoachieveanetworksolu=onthatreducesnetworkhardwarecostswhile

providingacceptable,consistentperformanceforalltrafficclasses

Page 21: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 21

PreliminaryresultsindicatethatIBprioritylevelscompensateforcheckpointtraffic

IBM Confidential Page 24

Figure 21:Application (pF3D nearest neighbor exchange in Z, 16 processes per node) completion time for the various benchmarked configurations, in presence of checkpointing and in isolation.

Figure 22: Checkpointing rate in absence and in presence of application (pF3D nearest neighbor exchange in Z, 16 processes per node) for the various benchmarked configurations.

3.3 All-to-All Simulations on Fat Tree

The all to all communication phases of pF3D, due to the mapping we have chosen, interact with checkpointing traffic in a smaller portion of the network. They do however exhibit similar behavior to that presented in the previous section for the nearest neighbor exchange and the conclusions are the same. In the interest of brevity, we will only include the summary of the results. The figures present the following.

For the all to all exchange in the X direction:

x Figure 23: pF3D completion time in the single process per node case.

x Figure 24: checkpointing rate in the pF3D single process per node case.

IBM Confidential Page 24

Figure 21:Application (pF3D nearest neighbor exchange in Z, 16 processes per node) completion time for the various benchmarked configurations, in presence of checkpointing and in isolation.

Figure 22: Checkpointing rate in absence and in presence of application (pF3D nearest neighbor exchange in Z, 16 processes per node) for the various benchmarked configurations.

3.3 All-to-All Simulations on Fat Tree

The all to all communication phases of pF3D, due to the mapping we have chosen, interact with checkpointing traffic in a smaller portion of the network. They do however exhibit similar behavior to that presented in the previous section for the nearest neighbor exchange and the conclusions are the same. In the interest of brevity, we will only include the summary of the results. The figures present the following.

For the all to all exchange in the X direction:

x Figure 23: pF3D completion time in the single process per node case.

x Figure 24: checkpointing rate in the pF3D single process per node case.

Page 22: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS

LLNL-PRES-695482 22

§  FlexibilityofSHArPswitch-basedcollec=veswillacceleratesubcommunicatorscollec=vesandwillallowjobstosharenetwork

§  HCAMPItagmatchingwillreduceso^warecostoncri=calpath—  Futuresystemsshouldfurtheracceleratemessagepassingso^ware

§  QoSmechanismsareessen=alwithburstbuffersorsystemsthatsharednetworkresourcesacrossjobs— Mul=plenetworksmights=llprovidebestsolu=oninsomecases— Networkpar==oningcoulds=llbevaluableonfuturesystems

§  GPUDirectandNVMereducewithinnodemessagingimpact

§  Highcapabilitynodesleadtoasmallernetwork—  Reducesimportanceofnetworkpar==oning—  LLNLCTS-1with2-to-1taperedfattrees=llrequirecarefultaskmapping

SierranetworkhardwareaddresseslessonslearnedfrompreviousLLNLsystems

Substan=alresearchques=onss=llremainforsystemsa^erSierra

Page 23: Optimizing Network Usage on Sequoia and Sierranowlab.cse.ohio-state.edu/static/media/workshops/...Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘22 System Delivery ATS