programming models for exascale systems
TRANSCRIPT
High-PerformanceandScalableDesignsofProgrammingModelsforExascaleSystems
DhabaleswarK.(DK)PandaTheOhioStateUniversity
E-mail:[email protected]
h<p://www.cse.ohio-state.edu/~panda
TalkatHPCAC-Switzerland(Mar2016)
by
HPCAC-Switzerland(Mar‘16) 2NetworkBasedCompuNngLaboratory
High-EndCompuNng(HEC):ExaFlop&ExaByte
100-200 PFlops in 2016-2018
1 EFlops in 2020-2024?
3
F i g u r e 1
Source: IDC's Digital Universe Study, sponsored by EMC, December 2012
Within these broad outlines of the digital universe are some singularities worth noting.
First, while the portion of the digital universe holding potential analytic value is growing, only a tiny fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and so on. However, even with a generous estimate, the amount of information in the digital universe that is "tagged" accounts for only about 3% of the digital universe in 2012, and that which is analyzed is half a percent of the digital universe. Herein is the promise of "Big Data" technology — the extraction of value from the large untapped pools of data in the digital universe.
10K-20K EBytes in 2016-2018
40K EBytes in 2020 ?
ExaFlop&HPC•
ExaByte&BigData•
HPCAC-Switzerland(Mar‘16) 3NetworkBasedCompuNngLaboratory
0102030405060708090100
050
100150200250300350400450500
Percen
tageofC
lusters
Num
bero
fClusters
Timeline
PercentageofClustersNumberofClusters
TrendsforCommodityCompuNngClustersintheTop500List(hUp://www.top500.org)
85%
HPCAC-Switzerland(Mar‘16) 4NetworkBasedCompuNngLaboratory
DriversofModernHPCClusterArchitectures
Tianhe–2 Titan Stampede Tianhe–1A
• MulR-core/many-coretechnologies
• RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)
• SolidStateDrives(SSDs),Non-VolaRleRandom-AccessMemory(NVRAM),NVMe-SSD
• Accelerators(NVIDIAGPGPUsandIntelXeonPhi)
Accelerators/Coprocessorshighcomputedensity,high
performance/waU>1TFlopDPonachip
HighPerformanceInterconnects-InfiniBand
<1useclatency,100GbpsBandwidth>MulN-coreProcessors SSD,NVMe-SSD,NVRAM
HPCAC-Switzerland(Mar‘16) 5NetworkBasedCompuNngLaboratory
• 235IBClusters(47%)intheNov’2015Top500list(h<p://www.top500.org)
• InstallaRonsintheTop50(21systems):
Large-scaleInfiniBandInstallaNons
462,462cores(Stampede)atTACC(10th) 76,032cores(Tsubame2.5)atJapan/GSIC(25th)
185,344cores(Pleiades)atNASA/Ames(13th) 194,616cores(Cascade)atPNNL(27th)
72,800coresCrayCS-StorminUS(15th) 76,032cores(Makman-2)atSaudiAramco(32nd)
72,800coresCrayCS-StorminUS(16th) 110,400cores(Pangea)inFrance(33rd)
265,440coresSGIICEatTulipTradingAustralia(17th) 37,120cores(Lomonosov-2)atRussia/MSU(35th)
124,200cores(Topaz)SGIICEatERDCDSRCinUS(18th) 57,600cores(SwifLucy)inUS(37th)
72,000cores(HPC2)inItaly(19th) 55,728cores(Prometheus)atPoland/Cyfronet(38th)
152,692cores(Thunder)atAFRL/USA(21st) 50,544cores(Occigen)atFrance/GENCI-CINES(43rd)
147,456cores(SuperMUC)inGermany(22nd) 76,896cores(Salomon)SGIICEinCzechRepublic(47th)
86,016cores(SuperMUCPhase2)inGermany(24th) andmanymore!
HPCAC-Switzerland(Mar‘16) 6NetworkBasedCompuNngLaboratory
• ScienRficCompuRng– MessagePassingInterface(MPI),includingMPI+OpenMP,istheDominant
ProgrammingModel
– ManydiscussionstowardsParRRonedGlobalAddressSpace(PGAS)• UPC,OpenSHMEM,CAF,etc.
– HybridProgramming:MPI+PGAS(OpenSHMEM,UPC)
• BigData/Enterprise/CommercialCompuRng– Focusesonlargedataanddataanalysis
– Hadoop(HDFS,HBase,MapReduce)
– Sparkisemergingforin-memorycompuRng
– MemcachedisalsousedforWeb2.0
TwoMajorCategoriesofApplicaNons
HPCAC-Switzerland(Mar‘16) 7NetworkBasedCompuNngLaboratory
TowardsExascaleSystem(TodayandTarget)
Systems 2016Tianhe-2
2020-2024 DifferenceToday&Exascale
Systempeak 55PFlop/s 1EFlop/s ~20x
Power 18MW(3Gflops/W)
~20MW(50Gflops/W)
O(1)~15x
Systemmemory 1.4PB(1.024PBCPU+0.384PBCoP)
32–64PB ~50X
Nodeperformance 3.43TF/s(0.4CPU+3CoP)
1.2or15TF O(1)
Nodeconcurrency 24coreCPU+171coresCoP
O(1k)orO(10k) ~5x-~50x
TotalnodeinterconnectBW 6.36GB/s 200–400GB/s ~40x-~60x
Systemsize(nodes) 16,000 O(100,000)orO(1M) ~6x-~60x
Totalconcurrency 3.12M12.48Mthreads(4/core)
O(billion)forlatencyhiding
~100x
MTTI Few/day Many/day O(?)
Courtesy:Prof.JackDongarra
HPCAC-Switzerland(Mar‘16) 8NetworkBasedCompuNngLaboratory
• EnergyandPowerChallenge– Hardtosolvepowerrequirementsfordatamovement
• MemoryandStorageChallenge– Hardtoachievehighcapacityandhighdatarate
• ConcurrencyandLocalityChallenge– Managementofverylargeamountofconcurrency(billionthreads)
• ResiliencyChallenge– Lowvoltagedevices(forlowpower)introducemorefaults
BasicDesignChallengesforExascaleSystems
HPCAC-Switzerland(Mar‘16) 9NetworkBasedCompuNngLaboratory
ParallelProgrammingModelsOverview
P1 P2 P3
SharedMemory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory MemoryLogicalsharedmemory
SharedMemoryModel
SHMEM,DSMDistributedMemoryModel
MPI(MessagePassingInterface)
ParRRonedGlobalAddressSpace(PGAS)
GlobalArrays,UPC,Chapel,X10,CAF,…
• Programmingmodelsprovideabstractmachinemodels
• Modelscanbemappedondifferenttypesofsystems– e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.
• PGASmodelsandHybridMPI+PGASmodelsaregraduallyreceivingimportance
HPCAC-Switzerland(Mar‘16) 10NetworkBasedCompuNngLaboratory
• MessagePassingLibrarystandardizedbyMPIForum– CandFortran
• Goal:portable,efficientandflexiblestandardforwriRngparallelapplicaRons
• NotIEEEorISOstandard,butwidelyconsidered“industrystandard”forHPCapplicaRon
• EvoluRonofMPI– MPI-1:1994
– MPI-2:1996
– MPI-3.0:2008–2012,standardizedonSeptember21,2012
– MPI-3.1:2012–2015,standardizedonJune4,2015
– NextplanisforMPI4.0
MPIOverviewandHistory
HPCAC-Switzerland(Mar‘16) 11NetworkBasedCompuNngLaboratory
• PowerrequiredfordatamovementoperaRonsisoneofthemainchallenges
• Non-blockingcollecRves– OverlapcomputaRonandcommunicaRon
• MuchimprovedOne-sidedinterface– ReducesynchronizaRonofsender/receiver
• Manageconcurrency– ImprovedinteroperabilitywithPGAS(e.g.UPC,GlobalArrays,OpenSHMEM,CAF)
• Resiliency– NewinterfacefordetecRngfailures
HowdoesMPIPlantoMeetExascaleChallenges?
HPCAC-Switzerland(Mar‘16) 12NetworkBasedCompuNngLaboratory
• MajorfeaturesinMPI3.0– Non-blockingCollecRves
– ImprovedOne-Sided(RMA)Model
– MPIToolsInterface
• SpecificaRonisavailablefrom:h<p://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
MajorNewFeaturesinMPI-3.0
HPCAC-Switzerland(Mar‘16) 13NetworkBasedCompuNngLaboratory
MPI-3RMA:One-sidedCommunicaNonModelHCA HCA HCA P 1 P 2 P 3
Write to P2
Write to P3
Write Data from P1
Write data from P2
Post to HCA
Post to HCA
Buffer at P2 Buffer at P3
Global Region Creation (Buffer Info Exchanged)
Buffer at P1
HCA Write
Data to P2
HCA Write
Data to P3
HPCAC-Switzerland(Mar‘16) 14NetworkBasedCompuNngLaboratory
• Non-blockingone-sidedcommunicaRonrouRnes
– Put,Get(Rput,Rget)– Accumulate,Get_accumulate
– Atomics
• FlexiblesynchronizaRonoperaRonstocontroliniRaRonandcompleRon
MPI-3RMA:CommunicaNonandsynchronizaNonPrimiNves
MPIOne-sidedSynchronizaNon/CompleNonPrimiNves
SynchronizaNon CompleNon Win_sync
Lock/UnlockLock_all/Unlock_all
Fence
Post-Wait/Start-Complete
Flush
Flush_all
Flush_local
Flush_local_all
HPCAC-Switzerland(Mar‘16) 15NetworkBasedCompuNngLaboratory
• NetworkadapterscanprovideRDMAfeaturethatdoesn’trequiresofwareinvolvementatremoteside
• Aslongasputs/getsareexecutedassoonastheyareissued,overlapcanbeachieved
• RDMA-basedimplementaRonsdojustthat
MPI-3RMA:OverlappingCommunicaNonandComputaNon
HPCAC-Switzerland(Mar‘16) 16NetworkBasedCompuNngLaboratory
• EnablesoverlapofcomputaRonwithcommunicaRon
• Non-blockingcallsdonotmatchblockingcollecRvecalls– MPImayusedifferentalgorithmsforblockingandnon-blockingcollecRves
– BlockingcollecRves:OpRmizedforlatency
– Non-blockingcollecRves:OpRmizedforoverlap
• AprocesscallinganNBCoperaRon– SchedulescollecRveoperaRonandimmediatelyreturns
– ExecutesapplicaRoncomputaRoncode
– WaitsfortheendofthecollecRve
• ThecommunicaRonprogressby– ApplicaRoncodethroughMPI_Test
– Networkadapter(HCA)withhardwaresupport
– Dedicatedprocesses/threadinMPIlibrary
• Thereisanon-blockingequivalentforeachblockingoperaRon– Hasan“I”inthename(MPI_Bcast->MPI_Ibcast;MPI_Reduce->MPI_Ireduce)
MPI-3Non-blockingCollecNve(NBC)OperaNons
HPCAC-Switzerland(Mar‘16) 17NetworkBasedCompuNngLaboratory
MPIToolsInterface
• ExtendedtoolssupportinMPI-3,beyondthePMPIinterface• Providestandardizedinterface(MPIT)toaccessMPIinternal
informaRon• ConfiguraRonandcontrolinformaRon
• Eagerlimit,buffersizes,...• PerformanceinformaRon
• Timespentinblocking,memoryusage,...• DebugginginformaRon
• Packetcounters,thresholds,...• Externaltoolscanbuildontopofthisstandardinterface
HPCAC-Switzerland(Mar‘16) 18NetworkBasedCompuNngLaboratory
• MPI3.1wasapprovedonJune4,2015
– SpecificaRonisavailablefrom:h<p://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
• Majorfeaturesandenhancements:
– CorrecRontotheFortranbindingsintroducedinMPI-3.0
– NewfuncRonsaddedincluderouRnestomanipulateMPI_Aintvaluesinaportablemanner
– NonblockingcollecRveI/OrouRnes– RouRnestogettheindexvaluebynameforMPI_Tperformanceand
controlvariables
MPI-3.1Enhancements
HPCAC-Switzerland(Mar‘16) 19NetworkBasedCompuNngLaboratory
ParNNonedGlobalAddressSpace(PGAS)Models• Keyfeatures
- SimplesharedmemoryabstracRons
- Lightweightone-sidedcommunicaRon
- EasiertoexpressirregularcommunicaRon
• DifferentapproachestoPGAS- Languages
• UnifiedParallelC(UPC)
• Co-ArrayFortran(CAF)
• X10
• Chapel
- Libraries• OpenSHMEM
• UPC++
• GlobalArrays
HPCAC-Switzerland(Mar‘16) 20NetworkBasedCompuNngLaboratory
OpenSHMEM• SHMEMimplementaRons–CraySHMEM,SGISHMEM,QuadricsSHMEM,HPSHMEM,GSHMEM
• SubtledifferencesinAPI,acrossversions–example:
SGISHMEMQuadricsSHMEMCraySHMEM
IniNalizaNonstart_pes(0)shmem_init start_pes
ProcessID_my_pemy_peshmem_my_pe
• MadeapplicaRoncodesnon-portable
• OpenSHMEMisanefforttoaddressthis:
“Anew,openspecifica>ontoconsolidatethevariousextantSHMEMversions
intoawidelyacceptedstandard.”–OpenSHMEMSpecifica>onv1.0
byUniversityofHoustonandOakRidgeNaRonalLab
SGISHMEMisthebaseline
HPCAC-Switzerland(Mar‘16) 21NetworkBasedCompuNngLaboratory
• UPC:UnifiedParallelC-PGASbasedlanguageextensiontoC– AnISOC99-basedlanguageprovidinguniformprogrammingmodelforbothsharedanddistributed
memoryhardwaretosupportHPC
– UPC=UPCtranslator+Ccompiler+UPCrunRme
• CoarrayFortran(CAF):Language-levelPGASsupportinFortran– AnextensiontoFortrantosupportglobalsharedarray(coarray)inparallelFortranapplicaRons
– CAF=CAFcompiler+CAFrunRme(libcaf)
– BasicsupportinFortran2008andextendedsupporttocollecRveinFortran2015
• UPC++:AnObjectOrientedPGASProgrammingModel– Acompiler-freePGASprogrammingmodelincontextofC++
– BuiltontopofC++standardtemplatesandrunRmelibraries
– ExtensiontoUPC’sprogrammingidioms
– RegistertaskforasyncexecuRon
UPC,CAFandUPC++
HPCAC-Switzerland(Mar‘16) 22NetworkBasedCompuNngLaboratory
• HierarchicalarchitectureswithmulRpleaddressspaces
• (MPI+PGAS)Model– MPIacrossaddressspaces
– PGASwithinanaddressspace
• MPIisgoodatmovingdatabetweenaddressspaces
• Withinanaddressspace,MPIcaninteroperatewithothersharedmemoryprogrammingmodels
• ApplicaRonscanhavekernelswithdifferentcommunicaRonpa<erns
• Canbenefitfromdifferentmodels
• Re-wriRngcompleteapplicaRonscanbeahugeeffort
• PortcriRcalkernelstothedesiredmodelinstead
MPI+PGASforExascaleArchitecturesandApplicaNons
HPCAC-Switzerland(Mar‘16) 23NetworkBasedCompuNngLaboratory
Hybrid(MPI+PGAS)Programming
• ApplicaRonsub-kernelscanbere-wri<eninMPI/PGASbasedoncommunicaRoncharacterisRcs
• Benefits:– BestofDistributedCompuRngModel
– BestofSharedMemoryCompuRngModel
• ExascaleRoadmap*:– “HybridProgrammingisapracRcalwayto
programexascalesystems”
*TheInterna>onalExascaleSoKwareRoadmap,Dongarra,J.,Beckman,P.etal.,Volume25,Number1,2011,Interna>onalJournalofHighPerformanceComputerApplica>ons,ISSN1094-3420
Kernel1MPI
Kernel2MPI
Kernel3MPI
KernelNMPI
HPCApplicaNon
Kernel2PGAS
KernelNPGAS
HPCAC-Switzerland(Mar‘16) 24NetworkBasedCompuNngLaboratory
DesigningCommunicaNonLibrariesforMulN-PetaflopandExaflopSystems:Challenges
ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.
ApplicaNonKernels/ApplicaNons
NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmniPath)
MulN/Many-coreArchitectures
Accelerators(NVIDIAandMIC)
MiddlewareCo-Design
OpportuniNesand
ChallengesacrossVarious
Layers
PerformanceScalabilityFault-
Resilience
CommunicaNonLibraryorRunNmeforProgrammingModelsPoint-to-pointCommunicaNon
CollecNveCommunicaNon
Energy-Awareness
SynchronizaNonandLocks
I/OandFileSystems
FaultTolerance
HPCAC-Switzerland(Mar‘16) 25NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors– Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided)– Scalablejobstart-up
• ScalableCollecRvecommunicaRon– Offload– Non-blocking– Topology-aware
• Balancingintra-nodeandinter-nodecommunicaRonfornextgeneraRonnodes(128-1024cores)– MulRpleend-pointspernode
• SupportforefficientmulR-threading• IntegratedSupportforGPGPUsandAccelerators• Fault-tolerance/resiliency• QoSsupportforcommunicaRonandI/O• SupportforHybridMPI+PGASprogramming(MPI+OpenMP,MPI+UPC,MPI+OpenSHMEM,
CAF,…)• VirtualizaRon• Energy-Awareness
BroadChallengesinDesigningCommunicaNonLibrariesfor(MPI+X)atExascale
HPCAC-Switzerland(Mar‘16) 26NetworkBasedCompuNngLaboratory
• ExtremeLowMemoryFootprint– MemorypercoreconRnuestodecrease
• D-L-AFramework
– Discover• Overallnetworktopology(fat-tree,3D,…),Networktopologyforprocessesforagivenjob• Nodearchitecture,Healthofnetworkandnode
– Learn• Impactonperformanceandscalability• PotenRalforfailure
– Adapt• Internalprotocolsandalgorithms• Processmapping• Fault-tolerancesoluRons
– Lowoverheadtechniqueswhiledeliveringperformance,scalabilityandfault-tolerance
AddiNonalChallengesforDesigningExascaleSoqwareLibraries
HPCAC-Switzerland(Mar‘16) 27NetworkBasedCompuNngLaboratory
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,10-40Gig/iWARP,andRDMAoverConvergedEnhancedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Availablesince2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014
– SupportforVirtualizaRon(MVAPICH2-Virt),Availablesince2015
– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– Usedbymorethan2,525organizaNonsin77countries
– Morethan356,000(>0.36million)downloadsfromtheOSUsitedirectly
– EmpoweringmanyTOP500clusters(Nov‘15ranking)• 10thranked519,640-corecluster(Stampede)atTACC
• 13thranked185,344-corecluster(Pleiades)atNASA
• 25thranked76,032-corecluster(Tsubame2.5)atTokyoInsRtuteofTechnologyandmanyothers
– AvailablewithsofwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– h<p://mvapich.cse.ohio-state.edu
• EmpoweringTop500systemsforoveradecade– System-XfromVirginiaTech(3rdinNov2003,2,200processors,12.25TFlops)->
– StampedeatTACC(10thinNov’15,519,640cores,5.168Plops)
HPCAC-Switzerland(Mar‘16) 28NetworkBasedCompuNngLaboratory
MVAPICH2Architecture
HighPerformanceParallelProgrammingModels
MessagePassingInterface(MPI)
PGAS(UPC,OpenSHMEM,CAF,UPC++*)
Hybrid---MPI+X(MPI+PGAS+OpenMP/Cilk)
HighPerformanceandScalableCommunicaNonRunNmeDiverseAPIsandMechanisms
Point-to-point
PrimiNves
CollecNvesAlgorithms
Energy-Awareness
RemoteMemoryAccess
I/OandFileSystems
FaultTolerance
VirtualizaNon AcNveMessages
JobStartupIntrospecNon&Analysis
SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,OmniPath)
SupportforModernMulN-/Many-coreArchitectures(Intel-Xeon,OpenPower*,Xeon-Phi(MIC,KNL*),NVIDIAGPGPU)
TransportProtocols ModernFeatures
RC XRC UD DC UMR ODP*SR-IOV
MulNRail
TransportMechanismsSharedMemory CMA IVSHMEM
ModernFeatures
MCDRAM* NVLink* CAPI*
*Upcoming
HPCAC-Switzerland(Mar‘16) 29NetworkBasedCompuNngLaboratoryTimeline Ja
n-04
Jan-
10
Nov
-12
MVAPICH2-X
OMB
MVAPICH2
MVAPICH
Oct
-02
Nov
-04
Apr
-15
EOL
MVAPICH2-GDR
MVAPICH2-MIC
MVAPICHProjectTimeline
Jul-
15
MVAPICH2-Virt
Aug
-14
Aug
-15
Sep-
15
MVAPICH2-EA
OSU-INAM
HPCAC-Switzerland(Mar‘16) 30NetworkBasedCompuNngLaboratory
MVAPICH2SoqwareFamilyRequirements MVAPICH2Librarytouse
MPIwithIB,iWARPandRoCE MVAPICH2
AdvancedMPI,OSUINAM,PGASandMPI+PGASwithIBandRoCE MVAPICH2-X
MPIwithIB&GPU MVAPICH2-GDR
MPIwithIB&MIC MVAPICH2-MIC
HPCCloudwithMPI&IB MVAPICH2-Virt
Energy-awareMPIwithIB,iWARPandRoCE MVAPICH2-EA
HPCAC-Switzerland(Mar‘16) 31NetworkBasedCompuNngLaboratory
0
50000
100000
150000
200000
250000
300000
350000Sep-04
Jan-05
May-05
Sep-05
Jan-06
May-06
Sep-06
Jan-07
May-07
Sep-07
Jan-08
May-08
Sep-08
Jan-09
May-09
Sep-09
Jan-10
May-10
Sep-10
Jan-11
May-11
Sep-11
Jan-12
May-12
Sep-12
Jan-13
May-13
Sep-13
Jan-14
May-14
Sep-14
Jan-15
May-15
Sep-15
Jan-16
Num
bero
fDow
nloa
ds
Timeline
MV0.9.4
MV2
0.9.0
MV2
0.9.8
MV2
1.0
MV1.0
MV2
1.0.3
MV1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9 MV2
2.1
MV2
-GDR
2.0b
MV2
-MIC2.0
MV2
-Virt2.1rc2 MV2
-GDR
2.2b
MV2
-X2.2b
MV2
2.2b
MVAPICH/MVAPICH2ReleaseTimelineandDownloads
HPCAC-Switzerland(Mar‘16) 32NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors– Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided
RMA)– SupportforadvancedIBmechanisms(UMRandODP)– Extremelyminimalmemoryfootprint– Scalablejobstart-up
• CollecRvecommunicaRon• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+
UPC,CAF,…)• InfiniBandNetworkAnalysisandMonitoring(INAM)• IntegratedSupportforGPGPUs• IntegratedSupportforMICs• VirtualizaRon(SR-IOVandContainer)• Energy-Awareness
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Switzerland(Mar‘16) 33NetworkBasedCompuNngLaboratory
One-wayLatency:MPIoverIBwithMVAPICH2
0.000.200.400.600.801.001.201.401.601.802.00 SmallMessageLatency
MessageSize(bytes)
Latency(us)
1.261.19
0.951.15
TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch
ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back
0
20
40
60
80
100
120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR
LargeMessageLatency
MessageSize(bytes)
Latency(us)
HPCAC-Switzerland(Mar‘16) 34NetworkBasedCompuNngLaboratory
Bandwidth:MPIoverIBwithMVAPICH2
0
2000
4000
6000
8000
10000
12000
14000 UnidirecNonalBandwidth
Band
width
(MBy
tes/sec)
MessageSize(bytes)
12465
3387
6356
12104
0
5000
10000
15000
20000
25000
30000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR
BidirecNonalBandwidth
Band
width
(MBy
tes/sec)
MessageSize(bytes)
21425
12161
24353
6308
TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch
ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back
HPCAC-Switzerland(Mar‘16) 35NetworkBasedCompuNngLaboratory
0
0.5
1
0 1 2 4 8 16 32 64 128 256 512 1K
Latency(us)
MessageSize(Bytes)
LatencyIntra-Socket Inter-Socket
MVAPICH2Two-SidedIntra-NodePerformance(SharedmemoryandKernel-basedZero-copySupport(LiMICandCMA))
LatestMVAPICH22.2b
IntelIvy-bridge0.18us
0.45us
0
5000
10000
15000
Band
width(M
B/s)
MessageSize(Bytes)
Bandwidth(Inter-socket)inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC
0
5000
10000
15000
Band
width(M
B/s)
MessageSize(Bytes)
Bandwidth(Intra-socket)intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC
14,250MB/s13,749MB/s
HPCAC-Switzerland(Mar‘16) 36NetworkBasedCompuNngLaboratory
• IntroducedbyMellanoxtosupportdirectlocalandremotenonconRguousmemoryaccess– Avoidpackingatsenderandunpackingatreceiver
• AvailablewithMVAPICH2-X2.2b
User-modeMemoryRegistraNon(UMR)
050
100150200250300350
4K 16K 64K 256K 1M
Latency(u
s)
MessageSize(Bytes)
Small&MediumMessageLatencyUMRDefault
0
5000
10000
15000
20000
2M 4M 8M 16M
Latency(us)
MessageSize(Bytes)
LargeMessageLatencyUMRDefault
Connect-IB(54Gbps):2.8GHzDualTen-core(IvyBridge)IntelPCIGen3withMellanoxIBFDRswitch
M.Li,H.Subramoni,K.Hamidouche,X.LuandD.K.Panda,HighPerformanceMPIDatatypeSupportwithUser-modeMemoryRegistraNon:Challenges,DesignsandBenefits,CLUSTER,2015
HPCAC-Switzerland(Mar‘16) 37NetworkBasedCompuNngLaboratory
• IntroducedbyMellanoxtosupportdirectremotememoryaccesswithoutpinning
• Memoryregionspagedin/outdynamicallybytheHCA/OS
• Sizeofregisteredbufferscanbelargerthanphysicalmemory
• WillbeavailableinfutureMVAPICH2release
On-DemandPaging(ODP)
Connect-IB(54Gbps):2.6GHzDualOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRswitch
0
500
1000
1500
16 32 64
Pin-do
wnBu
fferS
ize
(MB)
NumberofProcesses
Graph500Pin-downBufferSizesPin-down ODP
0
1
2
3
4
5
16 32 64
ExecuN
onTim
e(s)
NumberofProcesses
Graph500BFSKernelPin-down ODP
HPCAC-Switzerland(Mar‘16) 38NetworkBasedCompuNngLaboratory
MinimizingMemoryFootprintbyDirectConnect(DC)Transport
Nod
e0 P1
P0
Node1
P3
P2Node3
P7
P6
Nod
e2 P5
P4
IBNetwork
• ConstantconnecRoncost(OneQPforanypeer)• FullFeatureSet(RDMA,Atomicsetc)• Separateobjectsforsend(DCIniRator)andreceive(DCTarget)
– DCTargetidenRfiedby“DCTNumber”– Messagesroutedwith(DCTNumber,LID)– Requiressame“DCKey”toenablecommunicaRon
• AvailablesinceMVAPICH2-X2.2a
0
0.5
1
160 320 620Normalized
ExecuNo
nTime
NumberofProcesses
NAMD-Apoa1:LargedatasetRC DC-Pool UD XRC
1022
4797
1 1 12
10 10 10 10
1 13
5
1
10
100
80 160 320 640
Conn
ecNo
nMem
ory(KB)
NumberofProcesses
MemoryFootprintforAlltoallRC DC-Pool UD XRC
H.Subramoni,K.Hamidouche,A.Venkatesh,S.ChakrabortyandD.K.Panda,DesigningMPILibrarywithDynamicConnectedTransport(DCT)ofInfiniBand:EarlyExperiences.IEEEInternaRonalSupercompuRngConference(ISC’14)
HPCAC-Switzerland(Mar‘16) 39NetworkBasedCompuNngLaboratory
• Near-constantMPIandOpenSHMEMiniRalizaRonRmeatanyprocesscount
• 10xand30ximprovementinstartupRmeofMPIandOpenSHMEMrespecRvelyat16,384processes
• MemoryconsumpRonreducedforremoteendpointinformaRonbyO(processespernode)
• 1GBMemorysavedpernodewith1Mprocessesand16processespernode
TowardsHighPerformanceandScalableStartupatExascale
P M
O
JobStartupPerformance
Mem
oryRe
quire
dtoStore
Endp
ointInform
aRon
a b c d
eP
M
PGAS–Stateoftheart
MPI–Stateoftheart
O PGAS/MPI–OpRmized
PMIX_Ring
PMIX_Ibarrier
PMIX_Iallgather
ShmembasedPMI
b
c
d
e
aOn-demandConnecRon
On-demandConnecNonManagementforOpenSHMEMandOpenSHMEM+MPI.S.Chakraborty,H.Subramoni,J.Perkins,A.A.Awan,andDKPanda,20thInternaRonalWorkshoponHigh-levelParallelProgrammingModelsandSupporRveEnvironments(HIPS’15)
PMIExtensionsforScalableMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,J.Perkins,M.Arnold,andDKPanda,Proceedingsofthe21stEuropeanMPIUsers'GroupMeeRng(EuroMPI/Asia’14)
Non-blockingPMIExtensionsforFastMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,A.Venkatesh,J.Perkins,andDKPanda,15thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’15)
SHMEMPMI–SharedMemorybasedPMIforImprovedPerformanceandScalability.S.Chakraborty,H.Subramoni,J.Perkins,andDKPanda,16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’16),AcceptedforPublica=on
a
b
c d
e
HPCAC-Switzerland(Mar‘16) 40NetworkBasedCompuNngLaboratory
• SHMEMPMIallowsMPIprocessestodirectlyreadremoteendpoint(EP)informaRonfromtheprocessmanagerthroughsharedmemorysegments
• Onlyasinglecopypernode-O(processespernode)reducRoninmemoryusage
• EsRmatedsavingsof1GBpernodewith1millionprocessesand16processespernode
• Upto1,000RmesfasterPMIGetscomparedtodefaultdesign.WillbeavailableinMVAPICH22.2RC1.
ProcessManagementInterfaceoverSharedMemory(SHMEMPMI)
TACCStampede-Connect-IB(54Gbps):2.6GHzQuadOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRSHMEMPMI–SharedMemoryBasedPMIforPerformanceandScalabilityS.Chakraborty,H.Subramoni,J.Perkins,andD.K.Panda,
16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid‘16),Acceptedforpublica=on
0
50
100
150
200
250
300
1 2 4 8 16 32
TimeTaken(m
illise
cond
s)
NumberofProcessesperNode
TimeTakenbyonePMI_GetDefault
SHMEMPMI
0.00010.0010.010.1110100
100010000
16 64 256 1K 4K 16K 64K 256K 1MMem
oryUsageperNod
e(M
B)
NumberofProcessesperJob
MemoryUsageforRemoteEPInformaRonFence-DefaultAllgather-DefaultFence-ShmemAllgather-Shmem
EsNmated
1000x
Actual
16x
HPCAC-Switzerland(Mar‘16) 41NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors• CollecRvecommunicaRon
– OffloadandNon-blocking– Topology-aware
• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+UPC,CAF,…)
• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Switzerland(Mar‘16) 42NetworkBasedCompuNngLaboratory
ModifiedHPLwithOffload-Bcastdoesupto4.5%be<erthandefaultversion(512Processes)
012345
512 600 720 800
ApplicaN
onRun
-Tim
e(s)
DataSize
05
1015
64 128 256 512Run-Time(s)
NumberofProcesses
PCG-Default Modified-PCG-Offload
Co-DesignwithMPI-3Non-BlockingCollecNvesandCollecNveOffloadCo-DirectHardware(AvailablesinceMVAPICH2-X2.2a)
ModifiedP3DFFTwithOffload-Alltoalldoesupto17%be<erthandefaultversion(128Processes)
K.Kandalla,et.al..High-PerformanceandScalableNon-BlockingAll-to-AllwithCollecNveOffloadonInfiniBandClusters:AStudywithParallel3DFFT,ISC2011
17%
00.20.40.60.81
1.2
10 20 30 40 50 60 70
Normalized
Pe
rforman
ce
HPL-Offload HPL-1ring HPL-Host
HPLProblemSize(N)as%ofTotalMemory
4.5%
ModifiedPre-ConjugateGradientSolverwithOffload-Allreducedoesupto21.8%be<erthandefaultversion
K.Kandalla,et.al,DesigningNon-blockingBroadcastwithCollecNveOffloadonInfiniBandClusters:ACaseStudywithHPL,HotI2011K.Kandalla,et.al.,DesigningNon-blockingAllreducewithCollecNveOffloadonInfiniBandClusters:ACaseStudywithConjugateGradientSolvers,IPDPS’12
21.8%
CanNetwork-OffloadbasedNon-BlockingNeighborhoodMPICollecNvesImproveCommunicaNonOverheadsofIrregularGraphAlgorithms?K.Kandalla,A.Buluc,H.Subramoni,K.Tomko,J.Vienne,L.Oliker,andD.K.Panda,IWPAPS’12
HPCAC-Switzerland(Mar‘16) 43NetworkBasedCompuNngLaboratory
Network-Topology-AwarePlacementofProcesses• CanwedesignahighlyscalablenetworktopologydetecRonserviceforIB?• HowdowedesigntheMPIcommunicaRonlibraryinanetwork-topology-awaremannertoefficientlyleveragethetopology
informaRongeneratedbyourservice?• WhatarethepotenRalbenefitsofusinganetwork-topology-awareMPIlibraryontheperformanceofparallelscienRficapplicaRons?
OverallperformanceandSplitupofphysicalcommunicaNonforMILConRanger
Performanceforvaryingsystemsizes Defaultfor2048corerun Topo-Awarefor2048corerun
15%
H.Subramoni,S.Potluri,K.Kandalla,B.Barth,J.Vienne,J.Keasler,K.Tomko,K.Schulz,A.Moody,andD.K.Panda,DesignofaScalableInfiniBandTopologyServicetoEnableNetwork-Topology-AwarePlacementofProcesses,SC'12.BESTPaperandBESTSTUDENTPaperFinalist
• ReducenetworktopologydiscoveryNmefromO(N2hosts)toO(Nhosts)
• 15%improvementinMILCexecuNonNme@2048cores• 15%improvementinHypreexecuNonNme@1024cores
HPCAC-Switzerland(Mar‘16) 44NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors• CollecRvecommunicaRon• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,
MPI+UPC,CAF,…)• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Switzerland(Mar‘16) 45NetworkBasedCompuNngLaboratory
MVAPICH2-XforAdvancedMPIandHybridMPI+PGASApplicaNons
MPI, OpenSHMEM, UPC, CAF, UPC++ or Hybrid (MPI + PGAS) Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls UPC Calls
• UnifiedcommunicaRonrunRmeforMPI,UPC,OpenSHMEM,CAF,UPC++availablewithMVAPICH2-X1.9onwards!(since2012)
• UPC++supportwillbeavailableinupcomingMVAPICH2-X2.2RC1• FeatureHighlights
– SupportsMPI(+OpenMP),OpenSHMEM,UPC,CAF,UPC++,MPI(+OpenMP)+OpenSHMEM,MPI(+OpenMP)+UPC
– MPI-3compliant,OpenSHMEMv1.0standardcompliant,UPCv1.2standardcompliant(withiniRalsupportforUPC1.3),CAF2008standard(OpenUH),UPC++
– ScalableInter-nodeandintra-nodecommunicaRon–point-to-pointandcollecRves
CAF Calls UPC++ Calls
HPCAC-Switzerland(Mar‘16) 46NetworkBasedCompuNngLaboratory
ApplicaNonLevelPerformancewithGraph500andSortGraph500ExecuNonTime
J.Jose,S.Potluri,K.TomkoandD.K.Panda,DesigningScalableGraph500BenchmarkwithHybridMPI+OpenSHMEMProgrammingModels,InternaNonalSupercompuNngConference(ISC’13),June2013
J.Jose,K.Kandalla,M.LuoandD.K.Panda,SupporNngHybridMPIandOpenSHMEMoverInfiniBand:DesignandPerformanceEvaluaNon,Int'lConferenceonParallelProcessing(ICPP'12),September2012
05101520253035
4K 8K 16K
Time(s)
No.ofProcesses
MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)
13X
7.6X
• PerformanceofHybrid(MPI+OpenSHMEM)Graph500Design• 8,192processes
-2.4XimprovementoverMPI-CSR-7.6XimprovementoverMPI-Simple
• 16,384processes-1.5XimprovementoverMPI-CSR-13XimprovementoverMPI-Simple
J.Jose,K.Kandalla,S.Potluri,J.ZhangandD.K.Panda,OpNmizingCollecNveCommunicaNoninOpenSHMEM,Int'lConferenceonParNNonedGlobalAddressSpaceProgrammingModels(PGAS'13),October2013.
SortExecuNonTime
0
1000
2000
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Time(secon
ds)
InputData-No.ofProcesses
MPI Hybrid
51%
• PerformanceofHybrid(MPI+OpenSHMEM)SortApplicaRon
• 4,096processes,4TBInputSize-MPI–2408sec;0.16TB/min-Hybrid–1172sec;0.36TB/min-51%improvementoverMPI-design
HPCAC-Switzerland(Mar‘16) 47NetworkBasedCompuNngLaboratory
MiniMD–TotalExecuNonTime
• Hybriddesignperformsbe<erthanMPIimplementaRon• 1,024processes
- 17%improvementoverMPIversion• StrongScaling
Inputsize:128*128*128
Performance StrongScaling
0
500
1000
1500
2000
2500
512 1,024
Hybrid-Barrier MPI-Original Hybrid-Advanced
17%
050010001500200025003000
256 512 1,024
Hybrid-Barrier MPI-Original Hybrid-Advanced
Time(m
s)
Time(m
s)
#ofCores #ofCores
M.Li,J.Lin,X.Lu,K.Hamidouche,K.TomkoandD.K.Panda,ScalableMiniMDDesignwithHybridMPIandOpenSHMEM,OpenSHMEMUserGroupMeeNng(OUG’14),heldinconjuncNonwith8thInternaNonalConferenceonParNNonedGlobalAddressSpaceProgrammingModels,(PGAS14).
HPCAC-Switzerland(Mar‘16) 48NetworkBasedCompuNngLaboratory
HybridMPI+UPCNAS-FT
• ModifiedNASFTUPCall-to-allpa<ernusingMPI_Alltoall• Trulyhybridprogram• ForFT(ClassC,128processes)
• 34%improvementoverUPC-GASNet• 30%improvementoverUPC-OSU
0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Time(s)
NASProblemSize–SystemSize
UPC-GASNet
UPC-OSU
Hybrid-OSU
34%
J.Jose,M.Luo,S.SurandD.K.Panda,UnifyingUPCandMPIRunNmes:ExperiencewithMVAPICH,FourthConferenceonParNNonedGlobalAddressSpaceProgrammingModel(PGAS’10),October2010
HybridMPI+UPCSupport
Availablesince
MVAPICH2-X1.9(2012)
HPCAC-Switzerland(Mar‘16) 49NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors• CollecRvecommunicaRon• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,
MPI+UPC,CAF,…)• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Switzerland(Mar‘16) 50NetworkBasedCompuNngLaboratory
OverviewofOSUINAM• AnetworkmonitoringandanalysistoolthatiscapableofanalyzingtrafficontheInfiniBandnetwork
withinputsfromtheMPIrunRme– h<p://mvapich.cse.ohio-state.edu/tools/osu-inam/
– h<p://mvapich.cse.ohio-state.edu/userguide/osu-inam/
• MonitorsIBclustersinrealRmebyqueryingvarioussubnetmanagementenRResandgatheringinputfromtheMPIrunRmes
• Capabilitytoanalyzeandprofilenode-level,job-levelandprocess-levelacRviResforMPIcommunicaRon(Point-to-Point,CollecRvesandRMA)
• Abilitytofilterdatabasedontypeofcountersusing“dropdown”list
• RemotelymonitorvariousmetricsofMPIprocessesatuserspecifiedgranularity
• "JobPage"todisplayjobsinascending/descendingorderofvariousperformancemetricsinconjuncRonwithMVAPICH2-X
• Visualizethedatatransferhappeningina“live”or“historical”fashionforenRrenetwork,joborsetofnodes
HPCAC-Switzerland(Mar‘16) 51NetworkBasedCompuNngLaboratory
OSUINAM–NetworkLevelView
• Shownetworktopologyoflargeclusters• Visualizetrafficpa<ernondifferentlinks• QuicklyidenRfycongestedlinks/linksinerrorstate• Seethehistoryunfold–playbackhistoricalstateofthenetwork
FullNetwork(152nodes) Zoomed-inViewoftheNetwork
HPCAC-Switzerland(Mar‘16) 52NetworkBasedCompuNngLaboratory
OSUINAM–JobandNodeLevelViews
VisualizingaJob(5Nodes) FindingRoutesBetweenNodes
• Joblevelview• Showdifferentnetworkmetrics(load,error,etc.)foranylivejob• PlaybackhistoricaldataforcompletedjobstoidenRfybo<lenecks
• Nodelevelviewprovidesdetailsperprocessorpernode• CPUuRlizaRonforeachrank/node• Bytessent/receivedforMPIoperaRons(pt-to-pt,collecRve,RMA)• Networkmetrics(e.g.XmitDiscard,RcvError)perrank/node
HPCAC-Switzerland(Mar‘16) 53NetworkBasedCompuNngLaboratory
LiveNodeLevelView
HPCAC-Switzerland(Mar‘16) 54NetworkBasedCompuNngLaboratory
LiveSwitchLevelView
HPCAC-Switzerland(Mar‘16) 55NetworkBasedCompuNngLaboratory
ListofSupportedSwitchCounters• ThefollowingcountersarequeriedfromtheInfiniBandSwitches
• XmitData– Totalnumberofdataoctets,dividedby4,transmi<edonallVLsfromtheport
– Thisincludesalloctetsbetween(andnotincluding)thestartofpacketdelimiterandtheVCRC,andmayincludepacketscontainingerrors
– Excludesalllinkpackets.
• RcvData– Totalnumberofdataoctets,dividedby4,receivedonallVLsfromtheport
– Thisincludesalloctetsbetween(andnotincluding)thestartofpacketdelimiterandtheVCRC,andmayincludepacketscontainingerrors
– Excludesalllinkpackets.
• Max[XmitData/RcvData]:Maximumofthetwovaluesabove
HPCAC-Switzerland(Mar‘16) 56NetworkBasedCompuNngLaboratory
ListofSupportedMPIProcessLevelCounters• MVAPICH2-XcollectsaddiRonalinformaRonabouttheprocess’snetworkusagewhichcanbedisplayedbyOSU
INAM• XmitData
– Totalnumberofbytestransmi<edaspartoftheMPIapplicaRon
• RcvData– TotalnumberofbytesreceivedaspartoftheMPIapplicaRon
• Max[XmitData/RcvData]– Maximumofthetwovaluesabove
• PointtoPointSend– Totalnumberofbytestransmi<edaspartofMPIpoint-to-pointoperaRons
• PointtoPointRcvd– TotalnumberofbytesreceivedaspartofMPIpoint-to-pointoperaRons
• Max[PointtoPointSent/Rcvd]– Maximumofthetwovaluesabove
• CollBytesSent– Totalnumberofbytestransmi<edaspartofMPIcollecRveoperaRons
• CollBytesRcvd– TotalnumberofbytesreceivedaspartofMPIcollecRveoperaRons
HPCAC-Switzerland(Mar‘16) 57NetworkBasedCompuNngLaboratory
ListofSupportedMPIProcessLevelCounters(Cont.)• Max[CollBytesSent/Rcvd]
– Maximumofthetwovaluesabove• RMABytesSent
– Totalnumberofbytestransmi<edaspartofMPIRMAoperaRons
– NotethatduetothenatureoftheRMAoperaRons,bytesreceivedforRMAoperaRonscannotbecounted
• RCVBUF– ThenumberofinternalcommunicaRonbuffersusedforreliableconnecRon(RC)
• UDVBUF– ThenumberofinternalcommunicaRonbuffersusedforunreliabledatagram(UD)
• VMSize– Totalnumberofbytesusedbytheprogramforitsvirtualmemory
• VMPeak– Maximumnumberofvirtualmemorybytesfortheprogram
• VMRSS– Thenumberofbytesresidentinthememory(Residentsetsize)
• VMHWM– Themaximumnumberofbytesthatcanberesidentinmemory(PeakresidentsetsizeorHighwatermark)
HPCAC-Switzerland(Mar‘16) 58NetworkBasedCompuNngLaboratory
ListofSupportedNetworkErrorCounters(Cont.)• XmtDiscards
– Totalnumberofoutboundpacketsdiscardedbytheportbecausetheportisdownorcongested.Reasonsforthisinclude:• OutputportisnotintheacRvestate
• PacketlengthexceededNeighborMTU
• SwitchLifeRmeLimitexceeded
• SwitchHOQLifeRmeLimitexceededThismayalsoincludepacketsdiscardedwhileinVLStalledState.
• XmtConstraintErrors– Totalnumberofpacketsnottransmi<edfromtheswitchphysicalportforthefollowingreasons:
• FilterRawOutboundistrueandpacketisraw
• ParRRonEnforcementOutboundistrueandpacketfailsparRRonkeycheckorIPversioncheck
• RcvConstraintErrors– Totalnumberofpacketsnotreceivedfromtheswitchphysicalportforthefollowingreasons:
• FilterRawInboundistrueandpacketisraw
• ParRRonEnforcementInboundistrueandpacketfailsparRRonkeycheckorIPversioncheck
• LinkIntegrityErrors– ThenumberofRmesthatthecountoflocalphysicalerrorsexceededthethresholdspecifiedbyLocalPhyErrors
• ExcBufOverrunErrors– ThenumberofRmesthatOverrunErrorsconsecuRveflowcontrolupdateperiodsoccurred,eachhavingatleastoneoverrunerror
• VL15Dropped:NumberofincomingVL15packetsdroppedduetoresourcelimitaRons(e.g.,lackofbuffers)intheport
HPCAC-Switzerland(Mar‘16) 59NetworkBasedCompuNngLaboratory
ListofSupportedNetworkErrorCounters• Thefollowingerrorcountersareavailablebothatswitchandprocesslevel:
• SymbolErrors– Totalnumberofminorlinkerrorsdetectedononeormorephysicallanes
• LinkRecovers– TotalnumberofRmesthePortTrainingstatemachinehassuccessfullycompletedthelinkerrorrecoveryprocess
• LinkDowned– TotalnumberofRmesthePortTrainingstatemachinehasfailedthelinkerrorrecoveryprocessanddownedthelink
• RcvErrors– Totalnumberofpacketscontaininganerrorthatwerereceivedontheport.Theseerrorsinclude:
• Localphysicalerrors
• Malformeddatapacketerrors
• Malformedlinkpacketerrors
• Packetsdiscardedduetobufferoverrun
• RcvRemotePhysErrors– TotalnumberofpacketsmarkedwiththeEBPdelimiterreceivedontheport.
• RcvSwitchRelayErrors– Totalnumberofpacketsreceivedontheportthatwerediscardedbecausetheycouldnotbeforwardedbytheswitchrelay
HPCAC-Switzerland(Mar‘16) 60NetworkBasedCompuNngLaboratory
Conclusions• Providedanoverviewofprogrammingmodelsforexascalesystems
• OutlinedtheassociatedchallengesindesigningrunRmesfortheprogrammingmodelschallenges
• DemonstratedhowMVAPICH2projectisaddressingsomeofthesechallenges
HPCAC-Switzerland(Mar‘16) 61NetworkBasedCompuNngLaboratory
• IntegratedSupportforGPGPUs• IntegratedSupportforMICs• VirtualizaRon(SR-IOVandContainer)• Energy-Awareness• BestPracRce:SetofTuningsforCommonApplicaRons
(AvailablethroughtheMVAPICHWebsite)
AddiNonalChallengestobeCoveredinToday’s1:30pmTalk
HPCAC-Switzerland(Mar‘16) 62NetworkBasedCompuNngLaboratory
ThankYou!
TheHigh-PerformanceBigDataProjecth<p://hibd.cse.ohio-state.edu/
Network-BasedCompuRngLaboratoryh<p://nowlab.cse.ohio-state.edu/
TheMVAPICH2Projecth<p://mvapich.cse.ohio-state.edu/