programming models for exascale systems

High-PerformanceandScalableDesignsofProgrammingModelsforExascaleSystems

DhabaleswarK.(DK)PandaTheOhioStateUniversity

E-mail:[email protected]

h<p://www.cse.ohio-state.edu/~panda

TalkatHPCAC-Switzerland(Mar2016)

by

HPCAC-Switzerland(Mar‘16) 2NetworkBasedCompuNngLaboratory

High-EndCompuNng(HEC):ExaFlop&ExaByte

100-200 PFlops in 2016-2018

1 EFlops in 2020-2024?

3

F i g u r e 1

Source: IDC's Digital Universe Study, sponsored by EMC, December 2012

Within these broad outlines of the digital universe are some singularities worth noting.

First, while the portion of the digital universe holding potential analytic value is growing, only a tiny fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and so on. However, even with a generous estimate, the amount of information in the digital universe that is "tagged" accounts for only about 3% of the digital universe in 2012, and that which is analyzed is half a percent of the digital universe. Herein is the promise of "Big Data" technology — the extraction of value from the large untapped pools of data in the digital universe.

10K-20K EBytes in 2016-2018

40K EBytes in 2020 ?

ExaFlop&HPC• 

ExaByte&BigData• 


0102030405060708090100

050

100150200250300350400450500

Percen

tageofC

lusters

Num

bero

fClusters

Timeline

PercentageofClustersNumberofClusters

TrendsforCommodityCompuNngClustersintheTop500List(hUp://www.top500.org)

85%


DriversofModernHPCClusterArchitectures

Tianhe–2 Titan Stampede Tianhe–1A

•  MulR-core/many-coretechnologies

•  RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

•  SolidStateDrives(SSDs),Non-VolaRleRandom-AccessMemory(NVRAM),NVMe-SSD

•  Accelerators(NVIDIAGPGPUsandIntelXeonPhi)

Accelerators/Coprocessorshighcomputedensity,high

performance/waU>1TFlopDPonachip

HighPerformanceInterconnects-InfiniBand

<1useclatency,100GbpsBandwidth>MulN-coreProcessors SSD,NVMe-SSD,NVRAM


•  235IBClusters(47%)intheNov’2015Top500list(h<p://www.top500.org)

•  InstallaRonsintheTop50(21systems):

Large-scaleInfiniBandInstallaNons

462,462cores(Stampede)atTACC(10th) 76,032cores(Tsubame2.5)atJapan/GSIC(25th)

185,344cores(Pleiades)atNASA/Ames(13th) 194,616cores(Cascade)atPNNL(27th)

72,800coresCrayCS-StorminUS(15th) 76,032cores(Makman-2)atSaudiAramco(32nd)

72,800coresCrayCS-StorminUS(16th) 110,400cores(Pangea)inFrance(33rd)

265,440coresSGIICEatTulipTradingAustralia(17th) 37,120cores(Lomonosov-2)atRussia/MSU(35th)

124,200cores(Topaz)SGIICEatERDCDSRCinUS(18th) 57,600cores(SwifLucy)inUS(37th)

72,000cores(HPC2)inItaly(19th) 55,728cores(Prometheus)atPoland/Cyfronet(38th)

152,692cores(Thunder)atAFRL/USA(21st) 50,544cores(Occigen)atFrance/GENCI-CINES(43rd)

147,456cores(SuperMUC)inGermany(22nd) 76,896cores(Salomon)SGIICEinCzechRepublic(47th)

86,016cores(SuperMUCPhase2)inGermany(24th) andmanymore!


•  ScienRficCompuRng–  MessagePassingInterface(MPI),includingMPI+OpenMP,istheDominant

ProgrammingModel

–  ManydiscussionstowardsParRRonedGlobalAddressSpace(PGAS)•  UPC,OpenSHMEM,CAF,etc.

–  HybridProgramming:MPI+PGAS(OpenSHMEM,UPC)

•  BigData/Enterprise/CommercialCompuRng–  Focusesonlargedataanddataanalysis

–  Hadoop(HDFS,HBase,MapReduce)

–  Sparkisemergingforin-memorycompuRng

–  MemcachedisalsousedforWeb2.0

TwoMajorCategoriesofApplicaNons


TowardsExascaleSystem(TodayandTarget)

Systems 2016Tianhe-2

2020-2024 DifferenceToday&Exascale

Systempeak 55PFlop/s 1EFlop/s ~20x

Power 18MW(3Gflops/W)

~20MW(50Gflops/W)

O(1)~15x

Systemmemory 1.4PB(1.024PBCPU+0.384PBCoP)

32–64PB ~50X

Nodeperformance 3.43TF/s(0.4CPU+3CoP)

1.2or15TF O(1)

Nodeconcurrency 24coreCPU+171coresCoP

O(1k)orO(10k) ~5x-~50x

TotalnodeinterconnectBW 6.36GB/s 200–400GB/s ~40x-~60x

Systemsize(nodes) 16,000 O(100,000)orO(1M) ~6x-~60x

Totalconcurrency 3.12M12.48Mthreads(4/core)

O(billion)forlatencyhiding

~100x

MTTI Few/day Many/day O(?)

Courtesy:Prof.JackDongarra


•  EnergyandPowerChallenge–  Hardtosolvepowerrequirementsfordatamovement

•  MemoryandStorageChallenge–  Hardtoachievehighcapacityandhighdatarate

•  ConcurrencyandLocalityChallenge–  Managementofverylargeamountofconcurrency(billionthreads)

•  ResiliencyChallenge–  Lowvoltagedevices(forlowpower)introducemorefaults

BasicDesignChallengesforExascaleSystems


ParallelProgrammingModelsOverview

P1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

SHMEM,DSMDistributedMemoryModel

MPI(MessagePassingInterface)

ParRRonedGlobalAddressSpace(PGAS)

GlobalArrays,UPC,Chapel,X10,CAF,…

•  Programmingmodelsprovideabstractmachinemodels

•  Modelscanbemappedondifferenttypesofsystems–  e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

•  PGASmodelsandHybridMPI+PGASmodelsaregraduallyreceivingimportance


•  MessagePassingLibrarystandardizedbyMPIForum–  CandFortran

•  Goal:portable,efficientandflexiblestandardforwriRngparallelapplicaRons

•  NotIEEEorISOstandard,butwidelyconsidered“industrystandard”forHPCapplicaRon

•  EvoluRonofMPI–  MPI-1:1994

–  MPI-2:1996

–  MPI-3.0:2008–2012,standardizedonSeptember21,2012

–  MPI-3.1:2012–2015,standardizedonJune4,2015

–  NextplanisforMPI4.0

MPIOverviewandHistory


•  PowerrequiredfordatamovementoperaRonsisoneofthemainchallenges

•  Non-blockingcollecRves–  OverlapcomputaRonandcommunicaRon

•  MuchimprovedOne-sidedinterface–  ReducesynchronizaRonofsender/receiver

•  Manageconcurrency–  ImprovedinteroperabilitywithPGAS(e.g.UPC,GlobalArrays,OpenSHMEM,CAF)

•  Resiliency–  NewinterfacefordetecRngfailures

HowdoesMPIPlantoMeetExascaleChallenges?


•  MajorfeaturesinMPI3.0–  Non-blockingCollecRves

–  ImprovedOne-Sided(RMA)Model

–  MPIToolsInterface

•  SpecificaRonisavailablefrom:h<p://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

MajorNewFeaturesinMPI-3.0


MPI-3RMA:One-sidedCommunicaNonModelHCA HCA HCA P 1 P 2 P 3

Write to P2

Write to P3

Write Data from P1

Write data from P2

Post to HCA

Post to HCA

Buffer at P2 Buffer at P3

Global Region Creation (Buffer Info Exchanged)

Buffer at P1

HCA Write

Data to P2

HCA Write

Data to P3


•  Non-blockingone-sidedcommunicaRonrouRnes

–  Put,Get(Rput,Rget)–  Accumulate,Get_accumulate

–  Atomics

•  FlexiblesynchronizaRonoperaRonstocontroliniRaRonandcompleRon

MPI-3RMA:CommunicaNonandsynchronizaNonPrimiNves

MPIOne-sidedSynchronizaNon/CompleNonPrimiNves

SynchronizaNon CompleNon Win_sync

Lock/UnlockLock_all/Unlock_all

Fence

Post-Wait/Start-Complete

Flush

Flush_all

Flush_local

Flush_local_all


•  NetworkadapterscanprovideRDMAfeaturethatdoesn’trequiresofwareinvolvementatremoteside

•  Aslongasputs/getsareexecutedassoonastheyareissued,overlapcanbeachieved

•  RDMA-basedimplementaRonsdojustthat

MPI-3RMA:OverlappingCommunicaNonandComputaNon


•  EnablesoverlapofcomputaRonwithcommunicaRon

•  Non-blockingcallsdonotmatchblockingcollecRvecalls–  MPImayusedifferentalgorithmsforblockingandnon-blockingcollecRves

–  BlockingcollecRves:OpRmizedforlatency

–  Non-blockingcollecRves:OpRmizedforoverlap

•  AprocesscallinganNBCoperaRon–  SchedulescollecRveoperaRonandimmediatelyreturns

–  ExecutesapplicaRoncomputaRoncode

–  WaitsfortheendofthecollecRve

•  ThecommunicaRonprogressby–  ApplicaRoncodethroughMPI_Test

–  Networkadapter(HCA)withhardwaresupport

–  Dedicatedprocesses/threadinMPIlibrary

•  Thereisanon-blockingequivalentforeachblockingoperaRon–  Hasan“I”inthename(MPI_Bcast->MPI_Ibcast;MPI_Reduce->MPI_Ireduce)

MPI-3Non-blockingCollecNve(NBC)OperaNons


MPIToolsInterface

•  ExtendedtoolssupportinMPI-3,beyondthePMPIinterface•  Providestandardizedinterface(MPIT)toaccessMPIinternal

informaRon•  ConfiguraRonandcontrolinformaRon

•  Eagerlimit,buffersizes,...•  PerformanceinformaRon

•  Timespentinblocking,memoryusage,...•  DebugginginformaRon

•  Packetcounters,thresholds,...•  Externaltoolscanbuildontopofthisstandardinterface


•  MPI3.1wasapprovedonJune4,2015

–  SpecificaRonisavailablefrom:h<p://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf

•  Majorfeaturesandenhancements:

–  CorrecRontotheFortranbindingsintroducedinMPI-3.0

–  NewfuncRonsaddedincluderouRnestomanipulateMPI_Aintvaluesinaportablemanner

–  NonblockingcollecRveI/OrouRnes–  RouRnestogettheindexvaluebynameforMPI_Tperformanceand

controlvariables

MPI-3.1Enhancements


ParNNonedGlobalAddressSpace(PGAS)Models•  Keyfeatures

-  SimplesharedmemoryabstracRons

-  Lightweightone-sidedcommunicaRon

-  EasiertoexpressirregularcommunicaRon

•  DifferentapproachestoPGAS-  Languages

•  UnifiedParallelC(UPC)

•  Co-ArrayFortran(CAF)

•  X10

•  Chapel

-  Libraries•  OpenSHMEM

•  UPC++

•  GlobalArrays


OpenSHMEM•  SHMEMimplementaRons–CraySHMEM,SGISHMEM,QuadricsSHMEM,HPSHMEM,GSHMEM

•  SubtledifferencesinAPI,acrossversions–example:

SGISHMEMQuadricsSHMEMCraySHMEM

IniNalizaNonstart_pes(0)shmem_init start_pes

ProcessID_my_pemy_peshmem_my_pe

•  MadeapplicaRoncodesnon-portable

•  OpenSHMEMisanefforttoaddressthis:

“Anew,openspecifica>ontoconsolidatethevariousextantSHMEMversions

intoawidelyacceptedstandard.”–OpenSHMEMSpecifica>onv1.0

byUniversityofHoustonandOakRidgeNaRonalLab

SGISHMEMisthebaseline


•  UPC:UnifiedParallelC-PGASbasedlanguageextensiontoC–  AnISOC99-basedlanguageprovidinguniformprogrammingmodelforbothsharedanddistributed

memoryhardwaretosupportHPC

–  UPC=UPCtranslator+Ccompiler+UPCrunRme

•  CoarrayFortran(CAF):Language-levelPGASsupportinFortran–  AnextensiontoFortrantosupportglobalsharedarray(coarray)inparallelFortranapplicaRons

–  CAF=CAFcompiler+CAFrunRme(libcaf)

–  BasicsupportinFortran2008andextendedsupporttocollecRveinFortran2015

•  UPC++:AnObjectOrientedPGASProgrammingModel–  Acompiler-freePGASprogrammingmodelincontextofC++

–  BuiltontopofC++standardtemplatesandrunRmelibraries

–  ExtensiontoUPC’sprogrammingidioms

–  RegistertaskforasyncexecuRon

UPC,CAFandUPC++


•  HierarchicalarchitectureswithmulRpleaddressspaces

•  (MPI+PGAS)Model–  MPIacrossaddressspaces

–  PGASwithinanaddressspace

•  MPIisgoodatmovingdatabetweenaddressspaces

•  Withinanaddressspace,MPIcaninteroperatewithothersharedmemoryprogrammingmodels

•  ApplicaRonscanhavekernelswithdifferentcommunicaRonpa<erns

•  Canbenefitfromdifferentmodels

•  Re-wriRngcompleteapplicaRonscanbeahugeeffort

•  PortcriRcalkernelstothedesiredmodelinstead

MPI+PGASforExascaleArchitecturesandApplicaNons


Hybrid(MPI+PGAS)Programming

•  ApplicaRonsub-kernelscanbere-wri<eninMPI/PGASbasedoncommunicaRoncharacterisRcs

•  Benefits:–  BestofDistributedCompuRngModel

–  BestofSharedMemoryCompuRngModel

•  ExascaleRoadmap*:–  “HybridProgrammingisapracRcalwayto

programexascalesystems”

*TheInterna>onalExascaleSoKwareRoadmap,Dongarra,J.,Beckman,P.etal.,Volume25,Number1,2011,Interna>onalJournalofHighPerformanceComputerApplica>ons,ISSN1094-3420

Kernel1MPI

Kernel2MPI

Kernel3MPI

KernelNMPI

HPCApplicaNon

Kernel2PGAS

KernelNPGAS


DesigningCommunicaNonLibrariesforMulN-PetaflopandExaflopSystems:Challenges

ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.

ApplicaNonKernels/ApplicaNons

NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmniPath)

MulN/Many-coreArchitectures

Accelerators(NVIDIAandMIC)

MiddlewareCo-Design

OpportuniNesand

ChallengesacrossVarious

Layers

PerformanceScalabilityFault-

Resilience

CommunicaNonLibraryorRunNmeforProgrammingModelsPoint-to-pointCommunicaNon

CollecNveCommunicaNon

Energy-Awareness

SynchronizaNonandLocks

I/OandFileSystems

FaultTolerance


•  Scalabilityformilliontobillionprocessors–  Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided)–  Scalablejobstart-up

•  ScalableCollecRvecommunicaRon–  Offload–  Non-blocking–  Topology-aware

•  Balancingintra-nodeandinter-nodecommunicaRonfornextgeneraRonnodes(128-1024cores)–  MulRpleend-pointspernode

•  SupportforefficientmulR-threading•  IntegratedSupportforGPGPUsandAccelerators•  Fault-tolerance/resiliency•  QoSsupportforcommunicaRonandI/O•  SupportforHybridMPI+PGASprogramming(MPI+OpenMP,MPI+UPC,MPI+OpenSHMEM,

CAF,…)•  VirtualizaRon•  Energy-Awareness

BroadChallengesinDesigningCommunicaNonLibrariesfor(MPI+X)atExascale


•  ExtremeLowMemoryFootprint–  MemorypercoreconRnuestodecrease

•  D-L-AFramework

–  Discover•  Overallnetworktopology(fat-tree,3D,…),Networktopologyforprocessesforagivenjob•  Nodearchitecture,Healthofnetworkandnode

–  Learn•  Impactonperformanceandscalability•  PotenRalforfailure

–  Adapt•  Internalprotocolsandalgorithms•  Processmapping•  Fault-tolerancesoluRons

–  Lowoverheadtechniqueswhiledeliveringperformance,scalabilityandfault-tolerance

AddiNonalChallengesforDesigningExascaleSoqwareLibraries


OverviewoftheMVAPICH2Project•  HighPerformanceopen-sourceMPILibraryforInfiniBand,10-40Gig/iWARP,andRDMAoverConvergedEnhancedEthernet(RoCE)

–  MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Availablesince2002

–  MVAPICH2-X(MPI+PGAS),Availablesince2011

–  SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

–  SupportforVirtualizaRon(MVAPICH2-Virt),Availablesince2015

–  SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

–  Usedbymorethan2,525organizaNonsin77countries

–  Morethan356,000(>0.36million)downloadsfromtheOSUsitedirectly

–  EmpoweringmanyTOP500clusters(Nov‘15ranking)•  10thranked519,640-corecluster(Stampede)atTACC

•  13thranked185,344-corecluster(Pleiades)atNASA

•  25thranked76,032-corecluster(Tsubame2.5)atTokyoInsRtuteofTechnologyandmanyothers

–  AvailablewithsofwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

–  h<p://mvapich.cse.ohio-state.edu

•  EmpoweringTop500systemsforoveradecade–  System-XfromVirginiaTech(3rdinNov2003,2,200processors,12.25TFlops)->

–  StampedeatTACC(10thinNov’15,519,640cores,5.168Plops)


MVAPICH2Architecture

HighPerformanceParallelProgrammingModels

MessagePassingInterface(MPI)

PGAS(UPC,OpenSHMEM,CAF,UPC++*)

Hybrid---MPI+X(MPI+PGAS+OpenMP/Cilk)

HighPerformanceandScalableCommunicaNonRunNmeDiverseAPIsandMechanisms

Point-to-point

PrimiNves

CollecNvesAlgorithms

Energy-Awareness

RemoteMemoryAccess

I/OandFileSystems

FaultTolerance

VirtualizaNon AcNveMessages

JobStartupIntrospecNon&Analysis

SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,OmniPath)

SupportforModernMulN-/Many-coreArchitectures(Intel-Xeon,OpenPower*,Xeon-Phi(MIC,KNL*),NVIDIAGPGPU)

TransportProtocols ModernFeatures

RC XRC UD DC UMR ODP*SR-IOV

MulNRail

TransportMechanismsSharedMemory CMA IVSHMEM

ModernFeatures

MCDRAM* NVLink* CAPI*

*Upcoming

HPCAC-Switzerland(Mar‘16) 29NetworkBasedCompuNngLaboratoryTimeline Ja

n-04

Jan-

10

Nov

-12

MVAPICH2-X

OMB

MVAPICH2

MVAPICH

Oct

-02

Nov

-04

Apr

-15

EOL

MVAPICH2-GDR

MVAPICH2-MIC

MVAPICHProjectTimeline

Jul-

15

MVAPICH2-Virt

Aug

-14

Aug

-15

Sep-

15

MVAPICH2-EA

OSU-INAM


MVAPICH2SoqwareFamilyRequirements MVAPICH2Librarytouse

MPIwithIB,iWARPandRoCE MVAPICH2

AdvancedMPI,OSUINAM,PGASandMPI+PGASwithIBandRoCE MVAPICH2-X

MPIwithIB&GPU MVAPICH2-GDR

MPIwithIB&MIC MVAPICH2-MIC

HPCCloudwithMPI&IB MVAPICH2-Virt

Energy-awareMPIwithIB,iWARPandRoCE MVAPICH2-EA


0

50000

100000

150000

200000

250000

300000

350000Sep-04

Jan-05

May-05

Sep-05

Jan-06

May-06

Sep-06

Jan-07

May-07

Sep-07

Jan-08

May-08

Sep-08

Jan-09

May-09

Sep-09

Jan-10

May-10

Sep-10

Jan-11

May-11

Sep-11

Jan-12

May-12

Sep-12

Jan-13

May-13

Sep-13

Jan-14

May-14

Sep-14

Jan-15

May-15

Sep-15

Jan-16

Num

bero

fDow

nloa

ds

Timeline

MV0.9.4

MV2

0.9.0

MV2

0.9.8

MV2

1.0

MV1.0

MV2

1.0.3

MV1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9 MV2

2.1

MV2

-GDR

2.0b

MV2

-MIC2.0

MV2

-Virt2.1rc2 MV2

-GDR

2.2b

MV2

-X2.2b

MV2

2.2b

MVAPICH/MVAPICH2ReleaseTimelineandDownloads


•  Scalabilityformilliontobillionprocessors–  Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided

RMA)–  SupportforadvancedIBmechanisms(UMRandODP)–  Extremelyminimalmemoryfootprint–  Scalablejobstart-up

•  CollecRvecommunicaRon•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+

UPC,CAF,…)•  InfiniBandNetworkAnalysisandMonitoring(INAM)•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  VirtualizaRon(SR-IOVandContainer)•  Energy-Awareness

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale


One-wayLatency:MPIoverIBwithMVAPICH2

0.000.200.400.600.801.001.201.401.601.802.00 SmallMessageLatency

MessageSize(bytes)

Latency(us)

1.261.19

0.951.15

TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch

ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back

0

20

40

60

80

100

120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR

LargeMessageLatency

MessageSize(bytes)

Latency(us)


Bandwidth:MPIoverIBwithMVAPICH2

0

2000

4000

6000

8000

10000

12000

14000 UnidirecNonalBandwidth

Band

width

(MBy

tes/sec)

MessageSize(bytes)

12465

3387

6356

12104

0

5000

10000

15000

20000

25000

30000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR

BidirecNonalBandwidth

Band

width

(MBy

tes/sec)

MessageSize(bytes)

21425

12161

24353

6308

TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch

ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back


0

0.5

1

0 1 2 4 8 16 32 64 128 256 512 1K

Latency(us)

MessageSize(Bytes)

LatencyIntra-Socket Inter-Socket

MVAPICH2Two-SidedIntra-NodePerformance(SharedmemoryandKernel-basedZero-copySupport(LiMICandCMA))

LatestMVAPICH22.2b

IntelIvy-bridge0.18us

0.45us

0

5000

10000

15000

Band

width(M

B/s)

MessageSize(Bytes)

Bandwidth(Inter-socket)inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC

0

5000

10000

15000

Band

width(M

B/s)

MessageSize(Bytes)

Bandwidth(Intra-socket)intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC

14,250MB/s13,749MB/s


•  IntroducedbyMellanoxtosupportdirectlocalandremotenonconRguousmemoryaccess–  Avoidpackingatsenderandunpackingatreceiver

•  AvailablewithMVAPICH2-X2.2b

User-modeMemoryRegistraNon(UMR)

050

100150200250300350

4K 16K 64K 256K 1M

Latency(u

s)

MessageSize(Bytes)

Small&MediumMessageLatencyUMRDefault

0

5000

10000

15000

20000

2M 4M 8M 16M

Latency(us)

MessageSize(Bytes)

LargeMessageLatencyUMRDefault

Connect-IB(54Gbps):2.8GHzDualTen-core(IvyBridge)IntelPCIGen3withMellanoxIBFDRswitch

M.Li,H.Subramoni,K.Hamidouche,X.LuandD.K.Panda,HighPerformanceMPIDatatypeSupportwithUser-modeMemoryRegistraNon:Challenges,DesignsandBenefits,CLUSTER,2015


•  IntroducedbyMellanoxtosupportdirectremotememoryaccesswithoutpinning

•  Memoryregionspagedin/outdynamicallybytheHCA/OS

•  Sizeofregisteredbufferscanbelargerthanphysicalmemory

•  WillbeavailableinfutureMVAPICH2release

On-DemandPaging(ODP)

Connect-IB(54Gbps):2.6GHzDualOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRswitch

0

500

1000

1500

16 32 64

Pin-do

wnBu

fferS

ize

(MB)

NumberofProcesses

Graph500Pin-downBufferSizesPin-down ODP

0

1

2

3

4

5

16 32 64

ExecuN

onTim

e(s)

NumberofProcesses

Graph500BFSKernelPin-down ODP


MinimizingMemoryFootprintbyDirectConnect(DC)Transport

Nod

e0 P1

P0

Node1

P3

P2Node3

P7

P6

Nod

e2 P5

P4

IBNetwork

•  ConstantconnecRoncost(OneQPforanypeer)•  FullFeatureSet(RDMA,Atomicsetc)•  Separateobjectsforsend(DCIniRator)andreceive(DCTarget)

–  DCTargetidenRfiedby“DCTNumber”–  Messagesroutedwith(DCTNumber,LID)–  Requiressame“DCKey”toenablecommunicaRon

•  AvailablesinceMVAPICH2-X2.2a

0

0.5

1

160 320 620Normalized

ExecuNo

nTime

NumberofProcesses

NAMD-Apoa1:LargedatasetRC DC-Pool UD XRC

1022

4797

1 1 12

10 10 10 10

1 13

5

1

10

100

80 160 320 640

Conn

ecNo

nMem

ory(KB)

NumberofProcesses

MemoryFootprintforAlltoallRC DC-Pool UD XRC

H.Subramoni,K.Hamidouche,A.Venkatesh,S.ChakrabortyandD.K.Panda,DesigningMPILibrarywithDynamicConnectedTransport(DCT)ofInfiniBand:EarlyExperiences.IEEEInternaRonalSupercompuRngConference(ISC’14)


•  Near-constantMPIandOpenSHMEMiniRalizaRonRmeatanyprocesscount

•  10xand30ximprovementinstartupRmeofMPIandOpenSHMEMrespecRvelyat16,384processes

•  MemoryconsumpRonreducedforremoteendpointinformaRonbyO(processespernode)

•  1GBMemorysavedpernodewith1Mprocessesand16processespernode

TowardsHighPerformanceandScalableStartupatExascale

P M

O

JobStartupPerformance

Mem

oryRe

quire

dtoStore

Endp

ointInform

aRon

a b c d

eP

M

PGAS–Stateoftheart

MPI–Stateoftheart

O PGAS/MPI–OpRmized

PMIX_Ring

PMIX_Ibarrier

PMIX_Iallgather

ShmembasedPMI

b

c

d

e

aOn-demandConnecRon

On-demandConnecNonManagementforOpenSHMEMandOpenSHMEM+MPI.S.Chakraborty,H.Subramoni,J.Perkins,A.A.Awan,andDKPanda,20thInternaRonalWorkshoponHigh-levelParallelProgrammingModelsandSupporRveEnvironments(HIPS’15)

PMIExtensionsforScalableMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,J.Perkins,M.Arnold,andDKPanda,Proceedingsofthe21stEuropeanMPIUsers'GroupMeeRng(EuroMPI/Asia’14)

Non-blockingPMIExtensionsforFastMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,A.Venkatesh,J.Perkins,andDKPanda,15thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’15)

SHMEMPMI–SharedMemorybasedPMIforImprovedPerformanceandScalability.S.Chakraborty,H.Subramoni,J.Perkins,andDKPanda,16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’16),AcceptedforPublica=on

a

b

c d

e


•  SHMEMPMIallowsMPIprocessestodirectlyreadremoteendpoint(EP)informaRonfromtheprocessmanagerthroughsharedmemorysegments

•  Onlyasinglecopypernode-O(processespernode)reducRoninmemoryusage

•  EsRmatedsavingsof1GBpernodewith1millionprocessesand16processespernode

•  Upto1,000RmesfasterPMIGetscomparedtodefaultdesign.WillbeavailableinMVAPICH22.2RC1.

ProcessManagementInterfaceoverSharedMemory(SHMEMPMI)

TACCStampede-Connect-IB(54Gbps):2.6GHzQuadOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRSHMEMPMI–SharedMemoryBasedPMIforPerformanceandScalabilityS.Chakraborty,H.Subramoni,J.Perkins,andD.K.Panda,

16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid‘16),Acceptedforpublica=on

0

50

100

150

200

250

300

1 2 4 8 16 32

TimeTaken(m

illise

cond

s)

NumberofProcessesperNode

TimeTakenbyonePMI_GetDefault

SHMEMPMI

0.00010.0010.010.1110100

100010000

16 64 256 1K 4K 16K 64K 256K 1MMem

oryUsageperNod

e(M

B)

NumberofProcessesperJob

MemoryUsageforRemoteEPInformaRonFence-DefaultAllgather-DefaultFence-ShmemAllgather-Shmem

EsNmated

1000x

Actual

16x


•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon

–  OffloadandNon-blocking–  Topology-aware

•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+UPC,CAF,…)

•  InfiniBandNetworkAnalysisandMonitoring(INAM)



ModifiedHPLwithOffload-Bcastdoesupto4.5%be<erthandefaultversion(512Processes)

012345

512 600 720 800

ApplicaN

onRun

-Tim

e(s)

DataSize

05

1015

64 128 256 512Run-Time(s)

NumberofProcesses

PCG-Default Modified-PCG-Offload

Co-DesignwithMPI-3Non-BlockingCollecNvesandCollecNveOffloadCo-DirectHardware(AvailablesinceMVAPICH2-X2.2a)

ModifiedP3DFFTwithOffload-Alltoalldoesupto17%be<erthandefaultversion(128Processes)

K.Kandalla,et.al..High-PerformanceandScalableNon-BlockingAll-to-AllwithCollecNveOffloadonInfiniBandClusters:AStudywithParallel3DFFT,ISC2011

17%

00.20.40.60.81

1.2

10 20 30 40 50 60 70

Normalized

Pe

rforman

ce

HPL-Offload HPL-1ring HPL-Host

HPLProblemSize(N)as%ofTotalMemory

4.5%

ModifiedPre-ConjugateGradientSolverwithOffload-Allreducedoesupto21.8%be<erthandefaultversion

K.Kandalla,et.al,DesigningNon-blockingBroadcastwithCollecNveOffloadonInfiniBandClusters:ACaseStudywithHPL,HotI2011K.Kandalla,et.al.,DesigningNon-blockingAllreducewithCollecNveOffloadonInfiniBandClusters:ACaseStudywithConjugateGradientSolvers,IPDPS’12

21.8%

CanNetwork-OffloadbasedNon-BlockingNeighborhoodMPICollecNvesImproveCommunicaNonOverheadsofIrregularGraphAlgorithms?K.Kandalla,A.Buluc,H.Subramoni,K.Tomko,J.Vienne,L.Oliker,andD.K.Panda,IWPAPS’12


Network-Topology-AwarePlacementofProcesses•  CanwedesignahighlyscalablenetworktopologydetecRonserviceforIB?•  HowdowedesigntheMPIcommunicaRonlibraryinanetwork-topology-awaremannertoefficientlyleveragethetopology

informaRongeneratedbyourservice?•  WhatarethepotenRalbenefitsofusinganetwork-topology-awareMPIlibraryontheperformanceofparallelscienRficapplicaRons?

OverallperformanceandSplitupofphysicalcommunicaNonforMILConRanger

Performanceforvaryingsystemsizes Defaultfor2048corerun Topo-Awarefor2048corerun

15%

H.Subramoni,S.Potluri,K.Kandalla,B.Barth,J.Vienne,J.Keasler,K.Tomko,K.Schulz,A.Moody,andD.K.Panda,DesignofaScalableInfiniBandTopologyServicetoEnableNetwork-Topology-AwarePlacementofProcesses,SC'12.BESTPaperandBESTSTUDENTPaperFinalist

• ReducenetworktopologydiscoveryNmefromO(N2hosts)toO(Nhosts)

• 15%improvementinMILCexecuNonNme@2048cores• 15%improvementinHypreexecuNonNme@1024cores


•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  InfiniBandNetworkAnalysisandMonitoring(INAM)



MVAPICH2-XforAdvancedMPIandHybridMPI+PGASApplicaNons

MPI, OpenSHMEM, UPC, CAF, UPC++ or Hybrid (MPI + PGAS) Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Calls UPC Calls

•  UnifiedcommunicaRonrunRmeforMPI,UPC,OpenSHMEM,CAF,UPC++availablewithMVAPICH2-X1.9onwards!(since2012)

•  UPC++supportwillbeavailableinupcomingMVAPICH2-X2.2RC1•  FeatureHighlights

–  SupportsMPI(+OpenMP),OpenSHMEM,UPC,CAF,UPC++,MPI(+OpenMP)+OpenSHMEM,MPI(+OpenMP)+UPC

–  MPI-3compliant,OpenSHMEMv1.0standardcompliant,UPCv1.2standardcompliant(withiniRalsupportforUPC1.3),CAF2008standard(OpenUH),UPC++

–  ScalableInter-nodeandintra-nodecommunicaRon–point-to-pointandcollecRves

CAF Calls UPC++ Calls


ApplicaNonLevelPerformancewithGraph500andSortGraph500ExecuNonTime

J.Jose,S.Potluri,K.TomkoandD.K.Panda,DesigningScalableGraph500BenchmarkwithHybridMPI+OpenSHMEMProgrammingModels,InternaNonalSupercompuNngConference(ISC’13),June2013

J.Jose,K.Kandalla,M.LuoandD.K.Panda,SupporNngHybridMPIandOpenSHMEMoverInfiniBand:DesignandPerformanceEvaluaNon,Int'lConferenceonParallelProcessing(ICPP'12),September2012

05101520253035

4K 8K 16K

Time(s)

No.ofProcesses

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

13X

7.6X

•  PerformanceofHybrid(MPI+OpenSHMEM)Graph500Design•  8,192processes

-2.4XimprovementoverMPI-CSR-7.6XimprovementoverMPI-Simple

•  16,384processes-1.5XimprovementoverMPI-CSR-13XimprovementoverMPI-Simple

J.Jose,K.Kandalla,S.Potluri,J.ZhangandD.K.Panda,OpNmizingCollecNveCommunicaNoninOpenSHMEM,Int'lConferenceonParNNonedGlobalAddressSpaceProgrammingModels(PGAS'13),October2013.

SortExecuNonTime

0

1000

2000

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Time(secon

ds)

InputData-No.ofProcesses

MPI Hybrid

51%

•  PerformanceofHybrid(MPI+OpenSHMEM)SortApplicaRon

•  4,096processes,4TBInputSize-MPI–2408sec;0.16TB/min-Hybrid–1172sec;0.36TB/min-51%improvementoverMPI-design


MiniMD–TotalExecuNonTime

•  Hybriddesignperformsbe<erthanMPIimplementaRon•  1,024processes

-  17%improvementoverMPIversion•  StrongScaling

Inputsize:128*128*128

Performance StrongScaling

0

500

1000

1500

2000

2500

512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

17%

050010001500200025003000

256 512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

Time(m

s)

Time(m

s)

#ofCores #ofCores

M.Li,J.Lin,X.Lu,K.Hamidouche,K.TomkoandD.K.Panda,ScalableMiniMDDesignwithHybridMPIandOpenSHMEM,OpenSHMEMUserGroupMeeNng(OUG’14),heldinconjuncNonwith8thInternaNonalConferenceonParNNonedGlobalAddressSpaceProgrammingModels,(PGAS14).


HybridMPI+UPCNAS-FT

•  ModifiedNASFTUPCall-to-allpa<ernusingMPI_Alltoall•  Trulyhybridprogram•  ForFT(ClassC,128processes)

•  34%improvementoverUPC-GASNet•  30%improvementoverUPC-OSU

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Time(s)

NASProblemSize–SystemSize

UPC-GASNet

UPC-OSU

Hybrid-OSU

34%

J.Jose,M.Luo,S.SurandD.K.Panda,UnifyingUPCandMPIRunNmes:ExperiencewithMVAPICH,FourthConferenceonParNNonedGlobalAddressSpaceProgrammingModel(PGAS’10),October2010

HybridMPI+UPCSupport

Availablesince

MVAPICH2-X1.9(2012)


•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  InfiniBandNetworkAnalysisandMonitoring(INAM)



OverviewofOSUINAM•  AnetworkmonitoringandanalysistoolthatiscapableofanalyzingtrafficontheInfiniBandnetwork

withinputsfromtheMPIrunRme–  h<p://mvapich.cse.ohio-state.edu/tools/osu-inam/

–  h<p://mvapich.cse.ohio-state.edu/userguide/osu-inam/

•  MonitorsIBclustersinrealRmebyqueryingvarioussubnetmanagementenRResandgatheringinputfromtheMPIrunRmes

•  Capabilitytoanalyzeandprofilenode-level,job-levelandprocess-levelacRviResforMPIcommunicaRon(Point-to-Point,CollecRvesandRMA)

•  Abilitytofilterdatabasedontypeofcountersusing“dropdown”list

•  RemotelymonitorvariousmetricsofMPIprocessesatuserspecifiedgranularity

•  "JobPage"todisplayjobsinascending/descendingorderofvariousperformancemetricsinconjuncRonwithMVAPICH2-X

•  Visualizethedatatransferhappeningina“live”or“historical”fashionforenRrenetwork,joborsetofnodes


OSUINAM–NetworkLevelView

•  Shownetworktopologyoflargeclusters•  Visualizetrafficpa<ernondifferentlinks•  QuicklyidenRfycongestedlinks/linksinerrorstate•  Seethehistoryunfold–playbackhistoricalstateofthenetwork

FullNetwork(152nodes) Zoomed-inViewoftheNetwork


OSUINAM–JobandNodeLevelViews

VisualizingaJob(5Nodes) FindingRoutesBetweenNodes

•  Joblevelview•  Showdifferentnetworkmetrics(load,error,etc.)foranylivejob•  PlaybackhistoricaldataforcompletedjobstoidenRfybo<lenecks

•  Nodelevelviewprovidesdetailsperprocessorpernode•  CPUuRlizaRonforeachrank/node•  Bytessent/receivedforMPIoperaRons(pt-to-pt,collecRve,RMA)•  Networkmetrics(e.g.XmitDiscard,RcvError)perrank/node


LiveNodeLevelView


LiveSwitchLevelView


ListofSupportedSwitchCounters•  ThefollowingcountersarequeriedfromtheInfiniBandSwitches

•  XmitData–  Totalnumberofdataoctets,dividedby4,transmi<edonallVLsfromtheport

–  Thisincludesalloctetsbetween(andnotincluding)thestartofpacketdelimiterandtheVCRC,andmayincludepacketscontainingerrors

–  Excludesalllinkpackets.

•  RcvData–  Totalnumberofdataoctets,dividedby4,receivedonallVLsfromtheport

–  Thisincludesalloctetsbetween(andnotincluding)thestartofpacketdelimiterandtheVCRC,andmayincludepacketscontainingerrors

–  Excludesalllinkpackets.

•  Max[XmitData/RcvData]:Maximumofthetwovaluesabove


ListofSupportedMPIProcessLevelCounters•  MVAPICH2-XcollectsaddiRonalinformaRonabouttheprocess’snetworkusagewhichcanbedisplayedbyOSU

INAM•  XmitData

–  Totalnumberofbytestransmi<edaspartoftheMPIapplicaRon

•  RcvData–  TotalnumberofbytesreceivedaspartoftheMPIapplicaRon

•  Max[XmitData/RcvData]–  Maximumofthetwovaluesabove

•  PointtoPointSend–  Totalnumberofbytestransmi<edaspartofMPIpoint-to-pointoperaRons

•  PointtoPointRcvd–  TotalnumberofbytesreceivedaspartofMPIpoint-to-pointoperaRons

•  Max[PointtoPointSent/Rcvd]–  Maximumofthetwovaluesabove

•  CollBytesSent–  Totalnumberofbytestransmi<edaspartofMPIcollecRveoperaRons

•  CollBytesRcvd–  TotalnumberofbytesreceivedaspartofMPIcollecRveoperaRons


ListofSupportedMPIProcessLevelCounters(Cont.)•  Max[CollBytesSent/Rcvd]

–  Maximumofthetwovaluesabove•  RMABytesSent

–  Totalnumberofbytestransmi<edaspartofMPIRMAoperaRons

–  NotethatduetothenatureoftheRMAoperaRons,bytesreceivedforRMAoperaRonscannotbecounted

•  RCVBUF–  ThenumberofinternalcommunicaRonbuffersusedforreliableconnecRon(RC)

•  UDVBUF–  ThenumberofinternalcommunicaRonbuffersusedforunreliabledatagram(UD)

•  VMSize–  Totalnumberofbytesusedbytheprogramforitsvirtualmemory

•  VMPeak–  Maximumnumberofvirtualmemorybytesfortheprogram

•  VMRSS–  Thenumberofbytesresidentinthememory(Residentsetsize)

•  VMHWM–  Themaximumnumberofbytesthatcanberesidentinmemory(PeakresidentsetsizeorHighwatermark)


ListofSupportedNetworkErrorCounters(Cont.)•  XmtDiscards

–  Totalnumberofoutboundpacketsdiscardedbytheportbecausetheportisdownorcongested.Reasonsforthisinclude:•  OutputportisnotintheacRvestate

•  PacketlengthexceededNeighborMTU

•  SwitchLifeRmeLimitexceeded

•  SwitchHOQLifeRmeLimitexceededThismayalsoincludepacketsdiscardedwhileinVLStalledState.

•  XmtConstraintErrors–  Totalnumberofpacketsnottransmi<edfromtheswitchphysicalportforthefollowingreasons:

•  FilterRawOutboundistrueandpacketisraw

•  ParRRonEnforcementOutboundistrueandpacketfailsparRRonkeycheckorIPversioncheck

•  RcvConstraintErrors–  Totalnumberofpacketsnotreceivedfromtheswitchphysicalportforthefollowingreasons:

•  FilterRawInboundistrueandpacketisraw

•  ParRRonEnforcementInboundistrueandpacketfailsparRRonkeycheckorIPversioncheck

•  LinkIntegrityErrors–  ThenumberofRmesthatthecountoflocalphysicalerrorsexceededthethresholdspecifiedbyLocalPhyErrors

•  ExcBufOverrunErrors–  ThenumberofRmesthatOverrunErrorsconsecuRveflowcontrolupdateperiodsoccurred,eachhavingatleastoneoverrunerror

•  VL15Dropped:NumberofincomingVL15packetsdroppedduetoresourcelimitaRons(e.g.,lackofbuffers)intheport


ListofSupportedNetworkErrorCounters•  Thefollowingerrorcountersareavailablebothatswitchandprocesslevel:

•  SymbolErrors–  Totalnumberofminorlinkerrorsdetectedononeormorephysicallanes

•  LinkRecovers–  TotalnumberofRmesthePortTrainingstatemachinehassuccessfullycompletedthelinkerrorrecoveryprocess

•  LinkDowned–  TotalnumberofRmesthePortTrainingstatemachinehasfailedthelinkerrorrecoveryprocessanddownedthelink

•  RcvErrors–  Totalnumberofpacketscontaininganerrorthatwerereceivedontheport.Theseerrorsinclude:

•  Localphysicalerrors

•  Malformeddatapacketerrors

•  Malformedlinkpacketerrors

•  Packetsdiscardedduetobufferoverrun

•  RcvRemotePhysErrors–  TotalnumberofpacketsmarkedwiththeEBPdelimiterreceivedontheport.

•  RcvSwitchRelayErrors–  Totalnumberofpacketsreceivedontheportthatwerediscardedbecausetheycouldnotbeforwardedbytheswitchrelay


Conclusions•  Providedanoverviewofprogrammingmodelsforexascalesystems

•  OutlinedtheassociatedchallengesindesigningrunRmesfortheprogrammingmodelschallenges

•  DemonstratedhowMVAPICH2projectisaddressingsomeofthesechallenges


•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  VirtualizaRon(SR-IOVandContainer)•  Energy-Awareness•  BestPracRce:SetofTuningsforCommonApplicaRons

(AvailablethroughtheMVAPICHWebsite)

AddiNonalChallengestobeCoveredinToday’s1:30pmTalk


[email protected]

ThankYou!

TheHigh-PerformanceBigDataProjecth<p://hibd.cse.ohio-state.edu/

Network-BasedCompuRngLaboratoryh<p://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecth<p://mvapich.cse.ohio-state.edu/