improving the scalability of cfd codes€¦ · cfd codes are not ready to take full advantage of...
TRANSCRIPT
-
IMPROVINGTHESCALABILITYOFCFDCODES
FrancescoGava,Ghislain Lartigue,VincentMoureauCNRS-CORIA
-
2
ICARUS* PROJECT
05/06/2019
Context
ØObjective: Developmentofhigh-fidelitycalculationtoolsforthedesignofhotpartsofengines(aerospace+automotive)
ØTask: Optimisationofcodes’performancesonHPCmachinesØMotivation: Nextgeneration(2020)machineswillbemassivelyparallel.
CFDcodesarenotreadytotakefulladvantageofsuchsupercomputers.
ØFunding: FUI– Fonds UniqueInterministériel
*IntensiveCalculationforAeRo andautomotiveenginesUnsteadySimulations
-
305/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
-
405/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
-
5
Source:top500.org
05/06/2019
PerformancesoftheTop500
TheTop500isarankingofthe500mostpowerfulsupercomputersintheworld
-
6
Source:top500.org
05/06/2019
PerformancesoftheTop500
TheTop500isarankingofthe500mostpowerfulsupercomputersintheworldChangeinthetrend:Performanceincreasesmuchslowernow
-
7
Source:top500.org
05/06/2019
PerformancesoftheTop500
Physicallimitsofmaterialsandenergyconsumptionarelimitingtheprocessorsfrequencies,hencetheperformances.
TheTop500isarankingofthe500mostpowerfulsupercomputersintheworldChangeinthetrend:Performanceincreasesmuchslowernow
-
8
PreparedbyC.Batten – SchoolofElectricalandComputerEngineering– CornellUniversity– 2005– retrievedDec122012–http://www.cls.cornell.edu/courses/ece5950/handouts/ece5950-overview.pdf
05/06/2019
Multicorearchitectures
Sequentialperformancesarelimited,butwithmorecorestheparallelperformancescanstillincrease.
Almostallsupercomputersusemulticoreprocessors.
Thenumberofcorespersocketisever-increasingandmorevaried.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 11/200011/200611/201011/201411/2018
Systemshare
Date
TOP500Corespersocket
1 2 4
6 8 10
12 14 16
18 20 24
68 64 Others
-
9
CPU
L1Cache
L2Cache
L3Cache
RAM
CPU
L1Cache
L2Cache
L3Cache
RAM
CPU
L1Cache
L2Cache
L3Cache
RAM
Thememoryhierarchy
Mono-core Multi-core
05/06/2019
L3CacheItissharedamongstallCPUs
RAM
CPU
L1Cache
CPU
L1Cache
CPU
L1Cache
L2Cache L2Cache L2Cache
Network
Fastest32KB1cycle
Faster256KB3cycles
FastFewMB
10cycles
SlowManyGB100+cycles
-
1005/06/2019
Therooflinemodel
Codeperformancecanbelimitedby:Ø Processorspeed(computebound)Ø Memoryaccessspeed(memorybound)
ArithmeticIntensity(flops/byte)
AttainablePerform
ance(G
flops)
MemoryBound
ComputeBound
InCFDsolvers:Ø FastcomputationØ HighnumberofmemoryaccessesØ Largedatasizes
Theaimistomoveoverthere:Ø Exploitmemoryhierarchy
Ø WorkonsmallerdataØ Computeasmuchaspossible
onthesamedata
-
1105/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
-
1205/06/2019
ComputationalFluidDynamics
CFDSOLVER
PRECCINSTAburnerwithYales2
Generally,aCFDcode:Ø SolvesNavier-Stokes(andother)equations
Ø ReliesonlinearoperatorsØ Fastcomputations(additions,…)Ø Alotofmemoryread/write
Ø NeedtoexploitmemoryhierarchyØ Usesadiscretizeddomain
Ø ThefinerthediscretizationthehighertheprecisionØ LargemeshesmaynotfitintoRAMand
takelongertimetocomputeØ Useparallelsolvers
@u
@t+ (u ·r)u = �1
⇢rp+ ⌫r2u
-
1305/06/2019
FromincompressiblemomentumtoPoisson’sequation
@u
@t+ (u ·r)u = �1
⇢rp+ ⌫r2u
Solvetheincompressiblemomentumequationforu
Aprediction-correctionmethod[1]
Imposingthecontinuityequation
r · un+1 = 0LeadstothePoisson’sequationforpressure
r2pn+1 = rhs
+
Whichcanberewrittenasalinearsystem
Lp = b
Mustsolveforp tohaveu
[1]Chorin,A.J.(1967), "ThenumericalsolutionoftheNavier-Stokesequationsforanincompressiblefluid”, Bull.Am.Math.Soc., 73:928–931
-
1405/06/2019
Poisson’sequationandConjugateGradientmethod
Thelinearsystemhastobesolvedforp
Thiscanbesolvedwithaniterativemethod
Letrk betheresidualatiterationk.Iteratealongthedirectiondk conjugatetork
untilconvergence
Lp = b
r0 = b� Lp0d0 = r0
k = 0
� = convergence criterion
err = ||r0||1while (err > �)
↵k =rTkrk
dTkLdk
pk+1 = pk + ↵kdk
rk+1 = rk � ↵kLdkerr = ||rk||1
�k =rTk+1rk+1rTkrk
dk+1 = rk+1 + �kdk
k = k + 1
end while
return pk as the result
TheConjugateGradientmethod
Thismethodistrivialwithoutparallelism
-
1505/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
-
1605/06/2019
Parallelcomputationanddomaindecomposition
LargeproblemscannotbecomputedbyasingleprocessØ Domaindecompositiontodividetheproblemamongstmanyprocesses
Ø MorememoryavailableØ MorecomputationalpowerØ Communicationneeded
Dataonthesenodeshavetobeexchangedbetweenprocesses
L3CacheItissharedamongstallCPUs
RAM
CPU
L1Cache
CPU
L1Cache
CPU
L1Cache
L2Cache L2Cache L2Cache
-
17
Computationonadomainnode
Insidethedomain Ondomainboundary
05/06/2019
i
Ø NeedscontributionofallneighbournodesØ Allsurroundingnodesbelongtothedomain
Ø Noproblem
Ø NeedscontributionofallneighbournodesØ Somenodesdonotbelongtothedomain
Ø Mustcommunicatewithneighbours
i
proc#1 proc#1
proc#2
r�i =X
j2Ni
f(�i,�j ,Mij)
-
1805/06/2019
ParallelConjugateGradientmethod
Thelinearsystemhastobesolvedforp
Thiscanbesolvedwithaniterativemethod
Letrk betheresidualatiterationk.Iteratealongthedirectiondk conjugatetork
untilconvergence
Lp = b
r0 = b� Lp0d0 = r0
k = 0
� = convergence criterion
err = ||r0||1while (err > �)
↵k =rTkrk
dTkLdk
pk+1 = pk + ↵kdk
rk+1 = rk � ↵kLdkerr = ||rk||1
�k =rTk+1rk+1rTkrk
dk+1 = rk+1 + �kdk
k = k + 1
end while
return pk as the result
Thisalgorithmrequires4COLLECTIVEcommunications:oneforeachscalarproduct(3)and
oneforthenorm
TheConjugateGradientmethod
ThisalgorithmrequiresaPOINTTOPOINTcommunicationtocomputeLd
-
1905/06/2019
YALES2structure
Domaindecompositiontodividetheproblemamongstmanyprocessors
YALES2usesaDoubleDomainDecompositionØ Eachsubdomainissplitinsmallgroupsofelements
(EL_GRPs)whichwillfitintoL3,possiblyL2
Dataonthesenodeshavetobeexchangedbetweenprocessors.YALES2hasadedicateddatastructure:theexternalcommunicator(EC)
DataonthesenodeshavetobeexchangedbetweenEL_GRPsonthesameprocessor.YALES2hasadedicateddatastructure:theinternalcommunicator(IC)
gridofproc#1
el_grp
el_grp
el_grp
boundary boundary
int.comm ext.comm
ext.comm
proc#2
proc#3
ii
Ø InYALES2boundarynodesareduplicatedØ PartialvalueiscomputedoneachsideØ Totalvalueiscomputedonint.comm.
ThedecomposeddomainisstilltobigtofitintoL3cache
-
2005/06/2019
YALES2InternalCommunicator
Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode
Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes
itsowncontributionØ Contributionsareadded
ontheinternalcomm.Ø Totalvalueispossibly
copiedbackonel_grpnode
++++++++++++++++++
IC
-
2105/06/2019
YALES2InternalCommunicator
Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode
Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes
itsowncontributionØ Contributionsareadded
ontheinternalcomm.Ø Totalvalueispossibly
copiedbackonel_grpnode
++++++++++++++++++
IC
-
2205/06/2019
YALES2InternalCommunicator
Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode
Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes
itsowncontributionØ Contributionsareadded
ontheinternalcomm.Ø Totalvalueispossibly
copiedbackonel_grpnode
++++++++++++++++++
IC
-
2305/06/2019
YALES2InternalCommunicator
Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode
Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes
itsowncontributionØ Contributionsareadded
ontheinternalcomm.Ø Totalvalueispossibly
copiedbackonel_grpnodes
++++++++++++++++++
IC
-
24
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
-
25
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
Ø Eachel_grp computesitsowncontribution
-
26
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
Ø Eachel_grp computesitsowncontribution
Ø Contributionsareaddedontheinternalcomm.
-
27
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
Ø Eachel_grp computesitsowncontribution
Ø Contributionsareaddedontheinternalcomm.
Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess
-
28
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
Ø Eachel_grp computesitsowncontribution
Ø Contributionsareaddedontheinternalcomm.
Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess
Ø Valuereceivedonexternalreceivecommunicatorisaddedtointernalcommunicator
-
29
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
Ø Eachel_grp computesitsowncontribution
Ø Contributionsareaddedontheinternalcomm.
Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess
Ø Valuereceivedonexternalreceivecommunicatorisaddedtointernalcommunicator
Ø Finalvalueispossiblycopiedbackonel_grp nodes
-
3005/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
-
3105/06/2019
Dataexchange
Therearemainly3(and1/2)waystoachieveparallelism
MPI(MessagePassing)
OpenMP(SharedMemory)
PGAS(PartitionedGlobalAccessSpace)
MPI+OpenMP(Hybrid)
Oldbut(almost)gold
Easybut(very)limited
Promisingbut(too)new(Nottreatedhere)
-
3205/06/2019
MPI:MessagePassingInterface
Ø Reliesonamessageexchangeparadigm(oftenthroughnetwork)Ø Themostcommonparadigm
Ø VerywelltestedØ A lotofsupport
Ø FairlyeasytoimplementØ Couldbeusedonanyplatform
Ø WorksbetterondistributedmemorysystemsØ Doesnottakeadvantageofsharedmemory
Ø DoesnotscaleathighnumberofprocessesØ CollectivecommunicationsareabottleneckØ Cannotexploitwellhugesupercomputers
CurrentlyYALES2usesMPI
CPUL1
L2
L3
RAM
CPUL1
L2
L3
RAM
CPUL1
L2
L3
RAM
Network
-
3305/06/2019
OpenMP:SharedMemoryparallelism
Ø ReliesonmemorysharedamongstcoresØ Verycommon
Ø WelltestedØ Goodsupport
Ø Extremelyeasytoimplement(finegrain)Ø CanbeusedONLYonarchitectureswith
sharedmemoryØ CannotgobeyondNUMA-Domaincores
Ø Mustbeusedtogetherwithanotherparadigm(MPI,..)
Ø OverheadofpragmasisnotnegligibleØ Hardtogetfullparallelisation(Amdahl’slaw)
L3Cache(Shared)
RAM
CPUL1
CPUL1
CPUL1
L2 L2 L2
-
3405/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
-
3505/06/2019
HybridMPI+OpenMP:motivations
Ø MPIcodesdoesnotscaleindefinitelyØ ThreadingcanreducenumberofMPIrankshenceimprovescalability
Ø MPIalonecannottakefulladvantageofmulticorearchitecturesØ OpenMPcanexploitsharedmemory
https://nvidia-gpugenius.highspot.com/viewer/5bf5139e659e9366ed606a3e?iid=5bf5134ac714335696ba3410
-
3605/06/2019
Performancemeasuresspecifications
AllfollowingmeasureswereobtainedonMyria supercomputeratCRIANN:Processor :Bi-socketBroadwell ([email protected],[email protected])Network :LowlatencyhighbandwidthIntelOmni-Path(100Gbit/s)MPIlibrary :Intel-MPI2017.1.132(othersgivesimilarresults)Testcase :Incompressible,non-reactivePRECCINSTAburner
PRECCINSTAburnerwithYales2
Myria supercomputeratCRIANN
-
37
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
MPIscalability
-
38
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
Realcasescenario
05/06/2019
MPIscalability
WallClockTime(Lowerisbetter)
-
39
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
Realcasescenario
05/06/2019
MPIscalability
WallClockTime(Lowerisbetter)
Strongscalability:constantglobalwork=>linearlydecreasingWCT
-
40
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
Realcasescenario
05/06/2019
MPIscalability
WallClockTime(Lowerisbetter)
Weakscalability:constantworkperprocess=>constantWCT
-
41
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI
Realcasescenario
05/06/2019
MPIscalability
WallClockTime(Lowerisbetter)
-
42
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI
Realcasescenario
05/06/2019
MPIscalability
WallClockTime(Lowerisbetter)
Deviationfromidealmainlyduetocommunications
-
43
Collectivecommunications
05/06/2019
MPIscalabilitylimits
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPNPPN=ProcessesPerNode
-
44
Collectivecommunications
05/06/2019
MPIscalabilitylimits
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
ReducingthenumberofcommunicatingPPNwhilstmaintainingtheamountofcoresforcomputationwillreducethecostofcollectivecommunications
PPN=ProcessesPerNode
-
4505/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
-
4605/06/2019
MPI+OpenMPFineGrainmaster master
thread2
therad3
thread4
master master
thread2
thread3
thread4
master master
thread2
thread3
thread4
master
Ø Objective:Ø HavelargerdomainforoneMPIrank
Ø FewerMPIranksØ Lesscommunication
Ø DividetheworkamongthreadsØ Basedonfork-joinmodel:
Ø SimplepragmasaroundloopsØ WorkonloopsissharedbyallthreadsØ Workoutsideloopsandcommunicationisdonebymasterthreadonly
-
47
MPI+OpenMPFineGraindomaindecomposition
WithoutOpenMP WithOpenMPFineGrain
05/06/2019
Ø LargerMPIdomainØ FewerMPIranksØ ThreadsshareworkonEL_GRPs
Ø Musttakecareofdataraces
thread#4
thread#3
thread#1
thread#2
-
4805/06/2019
MPI+OpenMPFineGrain(Baseversion)
RuntimeBreakdown PercentageRuntimeBreakdown
Ø ProcesseshavelargerdomainØ ThreadsshareworkongroupsofelementØ CommunicationisdonebymasterthreadonlyØ Onlyloopswithindependentiterationsareparallelised
Ø NoconcurrencyØ Notmuchparallelised
OpenMP
Idealscaling
Sequential
With7threadsonly40%isexecutedinparallel
80%ofthecodeisparallelized
-
49
In-socketstrongscaling
05/06/2019
MPI+OpenMPFineGrain(Baseversion)performances
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Speedup
Numberofcores
IN-SOCKETSPEEDUP
REF:1.7MElements,1Core,MPI
IDEAL MPI OMP_FG(Base)
WorsescalabilitythanMPI
-
50
Realcasescenario
05/06/2019
MPI+OpenMPFineGrain(Baseversion)scaling
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)
(Lowerisbetter)
-
51
Realcasescenario
05/06/2019
MPI+OpenMPFineGrain(Baseversion)scaling
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)
Startswithconsiderableoverhead
(Lowerisbetter)
-
52
Realcasescenario
05/06/2019
MPI+OpenMPFineGrain(Baseversion)scaling
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)
Startswithconsiderableoverhead
(Lowerisbetter)
Slightlybetterscalability
-
53
Realcasescenario
05/06/2019
MPI+OpenMPFineGrain(Baseversion)scaling
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)
Startswithconsiderableoverhead
(Lowerisbetter)
Slightlybetterscalability
GloballynoimprovementwithrespecttoMPI
-
5405/06/2019
PercentageRuntimeBreakdownRuntimeBreakdownofstrongscaling
MPI+OpenMPFineGrain
Ø ProcesseshavelargerdomainØ ThreadsshareworkongroupsofelementØ CommunicationisdonebymasterthreadØ Alsoloopswithconcurrentiterationsareparallelised
Ø AlmosteverythingisparallelisedØ Overheadtoavoidconcurrency
OpenMP
Idealscaling
Sequential
With7threadsonly80%isexecutedinparallel
95%ofthecodeisparallelized
-
55
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Speedup
Numberofcores
IN-SOCKETSPEEDUP
REF:1.7MElements,1Core,MPI
IDEAL MPI OMP_FG(Base) OMP_FG
In-socketstrongscaling
05/06/2019
MPI+OpenMPFineGrainperformances
StillworsescalabilitythanMPI
BetterscalabilitythanBase
-
56
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG
Realcasescenario
05/06/2019
MPI+OpenMPFineGrainscaling
(Lowerisbetter)
-
57
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG
Realcasescenario
05/06/2019
MPI+OpenMPFineGrainscaling
Startswithconsiderableoverhead
(Lowerisbetter)
-
58
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG
Realcasescenario
05/06/2019
MPI+OpenMPFineGrainscaling
Startswithconsiderableoverhead
Betterscalability
(Lowerisbetter)
-
59
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG
Realcasescenario
05/06/2019
MPI+OpenMPFineGrainscaling
Startswithconsiderableoverhead
Betterscalability
(Lowerisbetter)
GloballynoimprovementwithrespecttoMPI
-
60
y=0,0027x
0,001
0,01
0,1
1
10
100
1000
1 10 100 1000 10000 100000
Time[us]
LoopIterations
OpenMPscalability
0th 1th 2th 3th 4th 5th6th 7th 8th 9th 10th 11th12th 13th 14th Linear(0th)
Minimumamountofwork
05/06/2019
OpenMPFineGrainlimits
MinimumamountofworkperOpenMPregiontohavesomegain
ThereisanoverheadduetoOpenMP
Sequential(withOpenMPpragmas)
Sequential(noOpenMPpragmas)
-
61
0
0,2
0,4
0,6
0,8
1
1,2
0 2 4 6 8 10 12 14 16
Time[us]
Numberofthreads
Fork-Joinoverheadbyloop iterations
1 2 3 5 8 10 20 30 50 80 100 200 300 500 800 1000
Fork-Joinoverhead
05/06/2019
OpenMPFineGrainlimits
Overhead:Ø IndependentoftheamountofworkØ IncreaseswithnumberofthreadsØ Imposesminimumworktobeeffective
-
62
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
-
63
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
ValueisupdatesequentiallyinIC
-
64
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
ValueisupdatesequentiallyinIC
-
65
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
Thread1
++++++++++++++++++
IC
WithOpenMP
Thread2
-
66
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
Thread1
++++++++++++++++++
IC
WithOpenMP
Thread2
Noguaranteeofdatacoherency
-
67
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
Thread1
++++++++++++++++++
IC
WithOpenMP
Thread2
Datarace
Noguaranteeofdatacoherency
-
68
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
Thread1
++++++++++++++++++
IC
AugmentedIC
WithOpenMPOneadditionalnon-concurrentcopyismadeinordertoavoiddataracesonIC
Thread2
-
69
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
Thread1
++++++++++++++++++
IC
AugmentedIC
WithOpenMPTheICisupdatedinparallelwithoutconcurrency.Thenonconcurrentcopyisadditionalcost.
Thread2
-
7005/06/2019
MPI+OpenMPFineGrain:recapandconclusionsmaster master
thread2
therad3
thread4
master master
thread2
thread3
thread4
master master
thread2
thread3
thread4
master
Ø Objective:Ø HavelargerdomainforoneMPIrank
Ø FewerMPIranksØ Lesscommunication
Ø DividetheworkamongthreadsØ Basedonfork-joinmodel:
Ø SimplepragmasaroundloopsØ WorkonloopsissharedbyallthreadsØ Workoutsideloopsandcommunicationisdonebymasterthreadonly
Ø Conclusions:Ø MinimumamountofworkperOpenMPregiondonotallowcomplete
codeparallelisationØ ThreadingscalabilityislimitedbyAmdahl’slaw
Ø FewerMPIranksallowbetteroverallscalabilityanywayØ OpenMPpragmasanddataconcurrencyoverheadpreventbetter
performancesthanMPI
-
7105/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
-
7205/06/2019
MPI+OpenMPCoarseGrainmaster master
thread2
therad3
thread4
master master
thread2
thread3
thread4
master master
thread2
thread3
thread4
master
Ø Objective:Ø Getridoffork-joinoverheadandsequentialcomputationØ SubstituteMPIranksbythreads
Ø FewerMPIranksØ Collectivecommunicationlessexpensive
Ø TheentirecodeisinsideoneOpenMPregion:Ø Entirecodemustbethread-safe(extremelyhardtocodeanddebug)Ø Threadsdoalltheworkandthecommunication
thread2
thread3
thread4
thread2
thread3
thread4
m
m
m
m
t2
t2
m
m
t2
t2
m
m
m
m
m
m
m
m
THISISFASTERforlargenumbersofprocesses
m
m
t2
t2 t2
t2
-
73
Collectivecommunications
05/06/2019
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN
MPI+OpenMPCoarseGrain
m
mm
m
m
mm
m
-
74
Collectivecommunications
05/06/2019
MPI+OpenMPCoarseGrain
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
m
m
m
mt2
t2
m
mt2
t2
m
mt2
t2 t2
t2
-
75
Collectivecommunications
05/06/2019
MPI+OpenMPCoarseGrain
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
m
m
m
mt2
t2
m
mt2
t2
m
mt2
t2 t2
t2
MPIcost
-
76
Collectivecommunications
05/06/2019
MPI+OpenMPCoarseGrain
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
m
m
m
mt2
t2
m
mt2
t2
m
mt2
t2 t2
t2
OpenMPcost
MPIcost
+
-
77
Collectivecommunications
05/06/2019
MPI+OpenMPCoarseGrain
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
m
m
m
mt2
t2
m
mt2
t2
m
mt2
t2 t2
t2
OpenMPcost
MPIcost
+ OpenMPcostdonotincreasewithnumberofcores:f(nthreads)=constant
OpenMPcostcanbereducedwithbetteralgorithms(WIP)
-
78
Collectivecommunications
05/06/2019
MPI+OpenMPCoarseGrain
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
m
m
m
mt2
t2
m
mt2
t2
m
mt2
t2 t2
t2 Gain
OpenMPcost
MPIcost
-
79
MPI+OpenMPCoarseGraindomaindecomposition
WithoutOpenMP WithOpenMPCoarseGrain
05/06/2019
Ø ThreadssubstituteMPIranksØ FewerMPIranks
Ø TheentireworkisdoneinparallelØ Threadsmustcommunicate
thread#1
thread#2
thread#3
-
80
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Speedup
Numberofcores
IN-SOCKETSPEEDUPREF:1.7MElements,1Core,MPI
IDEAL MPI OMP_FG OMP_CG
In-socketstrongscaling
05/06/2019
MPI+OpenMPCoarseGrainperformances
SamescalabilityasMPI
BetterscalabilitythanFineGrain
-
81
0,1
1
10
10 100 1000 10000
Norm
alizedW
CT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG OMP_CG
Realcasescenario
05/06/2019
MPI+OpenMPCoarseGrainscaling
(Lowerisbetter)
-
82
0,1
1
10
10 100 1000 10000
Norm
alizedW
CT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG OMP_CG
Realcasescenario
05/06/2019
MPI+OpenMPCoarseGrainscaling
Catastrophicperformances
(Lowerisbetter)
-
83
PointtoPointCommunications
05/06/2019
MPI+OpenMPCoarseGrainlimits
ALL2ALLvianon-blockingP2Pcommunicationononenode
MPIMPI_2PPN+CG
-
84
PointtoPointCommunications
05/06/2019
MPI+OpenMPCoarseGrainlimits
ALL2ALLvianon-blockingP2Pcommunicationononenode
40timesslower
MPIMPI_2PPN+CG
-
85
PointtoPointCommunications
05/06/2019
MPI+OpenMPCoarseGrainlimits
ALL2ALLvianon-blockingP2Pcommunicationononenode
MPIimplementationsallowmultithreadingbutsequentialize internally.
Impossibletoattainanyperformance
40timesslower
MPIMPI_2PPN+CG
-
8605/06/2019
MPI+OpenMPCoarseGrain:recapandconclusionsmaster master
thread2
therad3
thread4
master master
thread2
thread3
thread4
master master
thread2
thread3
thread4
master
Ø Objective:Ø Getridoffork-joinoverheadandsequentialcomputationØ SubstituteMPIranksbythreads
Ø FewerMPIranksØ Collectivecommunicationlessexpensive
Ø TheentirecodeisinsideoneOpenMPregion:Ø Entirecodemustbethread-safe(extremelyhardtocodeanddebug)Ø Threadsdoalltheworkandthecommunication
Ø Conclusions:Ø OpenMPCoarseGrainallowscompleteparallelizationofthecode
Ø SameperformancesasMPIØ ImproveincollectivecommunicationsØ ImplementationspreventfullymultithreadedconcurrentMPIcalls
Ø Point2Pointcommunicationskillperformances
thread2
thread3
thread4
thread2
thread3
thread4
-
8705/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
-
88
MPI+OpenMP
05/06/2019
Perspectives
ØSolvePointtoPointcommunicationproblemforCoarseGrainØBesmarterondomaindecomposition
Ø MinimizenumberofneighborsondifferentranksØFunnelallcommunicationtoonethread
Ø MorecommunicationsfordesignatedthreadsØ Idletimefornon-communicatingthreadsØ Moresynchronizationpoints
ØMPI4standardmayintroduceendpointsconceptØ FullymultithreadedMPI(Hopefully)Ø Mustwaitforlibrariestoimplementit
-
89
Perspectives
MPI+MPI3ØMPI3allowscreationofsharedwindowsinsideanode
ØSamesolutionasOpenMPCGforcollectivescomm.
ØMustverifyperformancesØStartsfrom1PPNcurveØExpensivesynchronization(?)
ØNoproblemforP2Pcomm.
GASPI(PGAS)
05/06/2019
ØAlternativetoMPIØUsesRMAinsteadofmessagesØFullymultithreaded
ØShouldsolveP2PproblemØCanbecombinedwithMPI
ØUsefulforcomplexcollectives
ØNotsupportedonallmachines
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
-
9005/06/2019
Conclusions
Ø MPIreachedscalabilitylimitsinmodernarchitecturesØ Hybridcodescouldimproveperformances
Ø ItisnoteasytowriteaperforminghybridMPI+OpenMPcodeØ OpenMPFineGrainsuffersfromfork-joinoverheadandAmdahl’slawØ OpenMPCoarsegrainislimitedbyMPIimplementationsonP2Pcomms.
Ø OtherhybridsolutionsareworthexploringØ MPI+MPI3Ø (MPI+)GASPI+OpenMPØ …