improving the scalability of cfd codes€¦ · cfd codes are not ready to take full advantage of...

IMPROVINGTHESCALABILITYOFCFDCODES

FrancescoGava,Ghislain Lartigue,VincentMoureauCNRS-CORIA

2

ICARUS* PROJECT

05/06/2019

Context

ØObjective: Developmentofhigh-fidelitycalculationtoolsforthedesignofhotpartsofengines(aerospace+automotive)

ØTask: Optimisationofcodes’performancesonHPCmachinesØMotivation: Nextgeneration(2020)machineswillbemassivelyparallel.

CFDcodesarenotreadytotakefulladvantageofsuchsupercomputers.

ØFunding: FUI– Fonds UniqueInterministériel

*IntensiveCalculationforAeRo andautomotiveenginesUnsteadySimulations

305/06/2019

PresentationPlanning

ØContext

ØCFDcodesØGeneralconceptsØParallelism

ØReviewofparallelismparadigms

ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

ØPerspectives&Conclusions

405/06/2019


ØContext





5

Source:top500.org

05/06/2019

PerformancesoftheTop500

TheTop500isarankingofthe500mostpowerfulsupercomputersintheworld

6

Source:top500.org

05/06/2019


TheTop500isarankingofthe500mostpowerfulsupercomputersintheworldChangeinthetrend:Performanceincreasesmuchslowernow

7

Source:top500.org

05/06/2019


Physicallimitsofmaterialsandenergyconsumptionarelimitingtheprocessorsfrequencies,hencetheperformances.

TheTop500isarankingofthe500mostpowerfulsupercomputersintheworldChangeinthetrend:Performanceincreasesmuchslowernow

8

PreparedbyC.Batten – SchoolofElectricalandComputerEngineering– CornellUniversity– 2005– retrievedDec122012–http://www.cls.cornell.edu/courses/ece5950/handouts/ece5950-overview.pdf

05/06/2019

Multicorearchitectures

Sequentialperformancesarelimited,butwithmorecorestheparallelperformancescanstillincrease.

Almostallsupercomputersusemulticoreprocessors.

Thenumberofcorespersocketisever-increasingandmorevaried.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 11/200011/200611/201011/201411/2018

Systemshare

Date

TOP500Corespersocket

1 2 4

6 8 10

12 14 16

18 20 24

68 64 Others

9

CPU

L1Cache

L2Cache

L3Cache

RAM

CPU

L1Cache

L2Cache

L3Cache

RAM

CPU

L1Cache

L2Cache

L3Cache

RAM

Thememoryhierarchy

Mono-core Multi-core

05/06/2019

L3CacheItissharedamongstallCPUs

RAM

CPU

L1Cache

CPU

L1Cache

CPU

L1Cache

L2Cache L2Cache L2Cache

Network

Fastest32KB1cycle

Faster256KB3cycles

FastFewMB

10cycles

SlowManyGB100+cycles

1005/06/2019

Therooflinemodel

Codeperformancecanbelimitedby:Ø Processorspeed(computebound)Ø Memoryaccessspeed(memorybound)

ArithmeticIntensity(flops/byte)

AttainablePerform

ance(G

flops)

MemoryBound

ComputeBound

InCFDsolvers:Ø FastcomputationØ HighnumberofmemoryaccessesØ Largedatasizes

Theaimistomoveoverthere:Ø Exploitmemoryhierarchy

Ø WorkonsmallerdataØ Computeasmuchaspossible

onthesamedata

1105/06/2019


ØContext





1205/06/2019

ComputationalFluidDynamics

CFDSOLVER

PRECCINSTAburnerwithYales2

Generally,aCFDcode:Ø SolvesNavier-Stokes(andother)equations

Ø ReliesonlinearoperatorsØ Fastcomputations(additions,…)Ø Alotofmemoryread/write

Ø NeedtoexploitmemoryhierarchyØ Usesadiscretizeddomain

Ø ThefinerthediscretizationthehighertheprecisionØ LargemeshesmaynotfitintoRAMand

takelongertimetocomputeØ Useparallelsolvers

@u

@t+ (u ·r)u = �1

⇢rp+ ⌫r2u

1305/06/2019

FromincompressiblemomentumtoPoisson’sequation

@u

@t+ (u ·r)u = �1

⇢rp+ ⌫r2u

Solvetheincompressiblemomentumequationforu

Aprediction-correctionmethod[1]

Imposingthecontinuityequation

r · un+1 = 0LeadstothePoisson’sequationforpressure

r2pn+1 = rhs

+

Whichcanberewrittenasalinearsystem

Lp = b

Mustsolveforp tohaveu

[1]Chorin,A.J.(1967), "ThenumericalsolutionoftheNavier-Stokesequationsforanincompressiblefluid”, Bull.Am.Math.Soc., 73:928–931

1405/06/2019

Poisson’sequationandConjugateGradientmethod

Thelinearsystemhastobesolvedforp

Thiscanbesolvedwithaniterativemethod

Letrk betheresidualatiterationk.Iteratealongthedirectiondk conjugatetork

untilconvergence

Lp = b

r0 = b� Lp0d0 = r0

k = 0

� = convergence criterion

err = ||r0||1while (err > �)

↵k =rTkrk

dTkLdk

pk+1 = pk + ↵kdk

rk+1 = rk � ↵kLdkerr = ||rk||1

�k =rTk+1rk+1rTkrk

dk+1 = rk+1 + �kdk

k = k + 1

end while

return pk as the result

TheConjugateGradientmethod

Thismethodistrivialwithoutparallelism

1505/06/2019


ØContext





1605/06/2019

Parallelcomputationanddomaindecomposition

LargeproblemscannotbecomputedbyasingleprocessØ Domaindecompositiontodividetheproblemamongstmanyprocesses

Ø MorememoryavailableØ MorecomputationalpowerØ Communicationneeded

Dataonthesenodeshavetobeexchangedbetweenprocesses

L3CacheItissharedamongstallCPUs

RAM

CPU

L1Cache

CPU

L1Cache

CPU

L1Cache

L2Cache L2Cache L2Cache

17

Computationonadomainnode

Insidethedomain Ondomainboundary

05/06/2019

i

Ø NeedscontributionofallneighbournodesØ Allsurroundingnodesbelongtothedomain

Ø Noproblem

Ø NeedscontributionofallneighbournodesØ Somenodesdonotbelongtothedomain

Ø Mustcommunicatewithneighbours

i

proc#1 proc#1

proc#2

r�i =X

j2Ni

f(�i,�j ,Mij)

1805/06/2019

ParallelConjugateGradientmethod

Thelinearsystemhastobesolvedforp

Thiscanbesolvedwithaniterativemethod

Letrk betheresidualatiterationk.Iteratealongthedirectiondk conjugatetork

untilconvergence

Lp = b

r0 = b� Lp0d0 = r0

k = 0

� = convergence criterion

err = ||r0||1while (err > �)

↵k =rTkrk

dTkLdk

pk+1 = pk + ↵kdk

rk+1 = rk � ↵kLdkerr = ||rk||1

�k =rTk+1rk+1rTkrk

dk+1 = rk+1 + �kdk

k = k + 1

end while

return pk as the result

Thisalgorithmrequires4COLLECTIVEcommunications:oneforeachscalarproduct(3)and

oneforthenorm

TheConjugateGradientmethod

ThisalgorithmrequiresaPOINTTOPOINTcommunicationtocomputeLd

1905/06/2019

YALES2structure

Domaindecompositiontodividetheproblemamongstmanyprocessors

YALES2usesaDoubleDomainDecompositionØ Eachsubdomainissplitinsmallgroupsofelements

(EL_GRPs)whichwillfitintoL3,possiblyL2

Dataonthesenodeshavetobeexchangedbetweenprocessors.YALES2hasadedicateddatastructure:theexternalcommunicator(EC)

DataonthesenodeshavetobeexchangedbetweenEL_GRPsonthesameprocessor.YALES2hasadedicateddatastructure:theinternalcommunicator(IC)

gridofproc#1

el_grp

el_grp

el_grp

boundary boundary

int.comm ext.comm

ext.comm

proc#2

proc#3

ii

Ø InYALES2boundarynodesareduplicatedØ PartialvalueiscomputedoneachsideØ Totalvalueiscomputedonint.comm.

ThedecomposeddomainisstilltobigtofitintoL3cache

2005/06/2019

YALES2InternalCommunicator

Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode

Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes

itsowncontributionØ Contributionsareadded

ontheinternalcomm.Ø Totalvalueispossibly

copiedbackonel_grpnode

++++++++++++++++++

IC

2105/06/2019







++++++++++++++++++

IC

2205/06/2019







++++++++++++++++++

IC

2305/06/2019






copiedbackonel_grpnodes

++++++++++++++++++

IC

24

SENDEC

RECVEC

05/06/2019

YALES2ExternalCommunicator

Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode

Ø Nodesonboundariesbetweenprocessesareduplicated

proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC

25

SENDEC

RECVEC

05/06/2019




proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC

Ø Eachel_grp computesitsowncontribution

26

SENDEC

RECVEC

05/06/2019




proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC


Ø Contributionsareaddedontheinternalcomm.

27

SENDEC

RECVEC

05/06/2019




proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC



Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess

28

SENDEC

RECVEC

05/06/2019




proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC




Ø Valuereceivedonexternalreceivecommunicatorisaddedtointernalcommunicator

29

SENDEC

RECVEC

05/06/2019




proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC




Ø Valuereceivedonexternalreceivecommunicatorisaddedtointernalcommunicator

Ø Finalvalueispossiblycopiedbackonel_grp nodes

3005/06/2019


ØContext





3105/06/2019

Dataexchange

Therearemainly3(and1/2)waystoachieveparallelism

MPI(MessagePassing)

OpenMP(SharedMemory)

PGAS(PartitionedGlobalAccessSpace)

MPI+OpenMP(Hybrid)

Oldbut(almost)gold

Easybut(very)limited

Promisingbut(too)new(Nottreatedhere)

3205/06/2019

MPI:MessagePassingInterface

Ø Reliesonamessageexchangeparadigm(oftenthroughnetwork)Ø Themostcommonparadigm

Ø VerywelltestedØ A lotofsupport

Ø FairlyeasytoimplementØ Couldbeusedonanyplatform

Ø WorksbetterondistributedmemorysystemsØ Doesnottakeadvantageofsharedmemory

Ø DoesnotscaleathighnumberofprocessesØ CollectivecommunicationsareabottleneckØ Cannotexploitwellhugesupercomputers

CurrentlyYALES2usesMPI

CPUL1

L2

L3

RAM

CPUL1

L2

L3

RAM

CPUL1

L2

L3

RAM

Network

3305/06/2019

OpenMP:SharedMemoryparallelism

Ø ReliesonmemorysharedamongstcoresØ Verycommon

Ø WelltestedØ Goodsupport

Ø Extremelyeasytoimplement(finegrain)Ø CanbeusedONLYonarchitectureswith

sharedmemoryØ CannotgobeyondNUMA-Domaincores

Ø Mustbeusedtogetherwithanotherparadigm(MPI,..)

Ø OverheadofpragmasisnotnegligibleØ Hardtogetfullparallelisation(Amdahl’slaw)

L3Cache(Shared)

RAM

CPUL1

CPUL1

CPUL1

L2 L2 L2

3405/06/2019


ØContext



ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain


3505/06/2019

HybridMPI+OpenMP:motivations

Ø MPIcodesdoesnotscaleindefinitelyØ ThreadingcanreducenumberofMPIrankshenceimprovescalability

Ø MPIalonecannottakefulladvantageofmulticorearchitecturesØ OpenMPcanexploitsharedmemory

https://nvidia-gpugenius.highspot.com/viewer/5bf5139e659e9366ed606a3e?iid=5bf5134ac714335696ba3410

3605/06/2019

Performancemeasuresspecifications

AllfollowingmeasureswereobtainedonMyria supercomputeratCRIANN:Processor :Bi-socketBroadwell ([email protected],[email protected])Network :LowlatencyhighbandwidthIntelOmni-Path(100Gbit/s)MPIlibrary :Intel-MPI2017.1.132(othersgivesimilarresults)Testcase :Incompressible,non-reactivePRECCINSTAburner

PRECCINSTAburnerwithYales2

Myria supercomputeratCRIANN

37

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores

SCALABILITYRef:14MElements,28Cores

IDEALWEAK IDEALSTRONG 14M 110M 870M

MPIscalability

38

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019

MPIscalability

WallClockTime(Lowerisbetter)

39

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019

MPIscalability


Strongscalability:constantglobalwork=>linearlydecreasingWCT

40

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019

MPIscalability


Weakscalability:constantworkperprocess=>constantWCT

41

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI

Realcasescenario

05/06/2019

MPIscalability


42

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI

Realcasescenario

05/06/2019

MPIscalability


Deviationfromidealmainlyduetocommunications

43

Collectivecommunications

05/06/2019

MPIscalabilitylimits

0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE

MPI_28PPNPPN=ProcessesPerNode

44


05/06/2019

MPIscalabilitylimits

0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE

MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN

ReducingthenumberofcommunicatingPPNwhilstmaintainingtheamountofcoresforcomputationwillreducethecostofcollectivecommunications

PPN=ProcessesPerNode

4505/06/2019


ØContext





4605/06/2019

MPI+OpenMPFineGrainmaster master

thread2

therad3

thread4

master master

thread2

thread3

thread4

master master

thread2

thread3

thread4

master

Ø Objective:Ø HavelargerdomainforoneMPIrank

Ø FewerMPIranksØ Lesscommunication

Ø DividetheworkamongthreadsØ Basedonfork-joinmodel:

Ø SimplepragmasaroundloopsØ WorkonloopsissharedbyallthreadsØ Workoutsideloopsandcommunicationisdonebymasterthreadonly

47

MPI+OpenMPFineGraindomaindecomposition

WithoutOpenMP WithOpenMPFineGrain

05/06/2019

Ø LargerMPIdomainØ FewerMPIranksØ ThreadsshareworkonEL_GRPs

Ø Musttakecareofdataraces

thread#4

thread#3

thread#1

thread#2

4805/06/2019

MPI+OpenMPFineGrain(Baseversion)

RuntimeBreakdown PercentageRuntimeBreakdown

Ø ProcesseshavelargerdomainØ ThreadsshareworkongroupsofelementØ CommunicationisdonebymasterthreadonlyØ Onlyloopswithindependentiterationsareparallelised

Ø NoconcurrencyØ Notmuchparallelised

OpenMP

Idealscaling

Sequential

With7threadsonly40%isexecutedinparallel

80%ofthecodeisparallelized

49

In-socketstrongscaling

05/06/2019

MPI+OpenMPFineGrain(Baseversion)performances

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Speedup

Numberofcores

IN-SOCKETSPEEDUP

REF:1.7MElements,1Core,MPI

IDEAL MPI OMP_FG(Base)

WorsescalabilitythanMPI

50

Realcasescenario

05/06/2019

MPI+OpenMPFineGrain(Baseversion)scaling

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)

(Lowerisbetter)

51

Realcasescenario

05/06/2019


0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Startswithconsiderableoverhead

(Lowerisbetter)

52

Realcasescenario

05/06/2019


0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores




(Lowerisbetter)

Slightlybetterscalability

53

Realcasescenario

05/06/2019


0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores




(Lowerisbetter)

Slightlybetterscalability

GloballynoimprovementwithrespecttoMPI

5405/06/2019

PercentageRuntimeBreakdownRuntimeBreakdownofstrongscaling

MPI+OpenMPFineGrain

Ø ProcesseshavelargerdomainØ ThreadsshareworkongroupsofelementØ CommunicationisdonebymasterthreadØ Alsoloopswithconcurrentiterationsareparallelised

Ø AlmosteverythingisparallelisedØ Overheadtoavoidconcurrency

OpenMP

Idealscaling

Sequential

With7threadsonly80%isexecutedinparallel

95%ofthecodeisparallelized

55

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Speedup

Numberofcores

IN-SOCKETSPEEDUP

REF:1.7MElements,1Core,MPI

IDEAL MPI OMP_FG(Base) OMP_FG


05/06/2019

MPI+OpenMPFineGrainperformances

StillworsescalabilitythanMPI

BetterscalabilitythanBase

56

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG

Realcasescenario

05/06/2019

MPI+OpenMPFineGrainscaling

(Lowerisbetter)

57

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019



(Lowerisbetter)

58

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019



Betterscalability

(Lowerisbetter)

59

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019



Betterscalability

(Lowerisbetter)

GloballynoimprovementwithrespecttoMPI

60

y=0,0027x

0,001

0,01

0,1

1

10

100

1000

1 10 100 1000 10000 100000

Time[us]

LoopIterations

OpenMPscalability

0th 1th 2th 3th 4th 5th6th 7th 8th 9th 10th 11th12th 13th 14th Linear(0th)

Minimumamountofwork

05/06/2019

OpenMPFineGrainlimits

MinimumamountofworkperOpenMPregiontohavesomegain

ThereisanoverheadduetoOpenMP

Sequential(withOpenMPpragmas)

Sequential(noOpenMPpragmas)

61

0

0,2

0,4

0,6

0,8

1

1,2

0 2 4 6 8 10 12 14 16

Time[us]

Numberofthreads

Fork-Joinoverheadbyloop iterations

1 2 3 5 8 10 20 30 50 80 100 200 300 500 800 1000

Fork-Joinoverhead

05/06/2019


Overhead:Ø IndependentoftheamountofworkØ IncreaseswithnumberofthreadsØ Imposesminimumworktobeeffective

62

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

63

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

ValueisupdatesequentiallyinIC

64

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

ValueisupdatesequentiallyinIC

65

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

Thread1

++++++++++++++++++

IC

WithOpenMP

Thread2

66

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

Thread1

++++++++++++++++++

IC

WithOpenMP

Thread2

Noguaranteeofdatacoherency

67

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

Thread1

++++++++++++++++++

IC

WithOpenMP

Thread2

Datarace

Noguaranteeofdatacoherency

68

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

Thread1

++++++++++++++++++

IC

AugmentedIC

WithOpenMPOneadditionalnon-concurrentcopyismadeinordertoavoiddataracesonIC

Thread2

69

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

Thread1

++++++++++++++++++

IC

AugmentedIC

WithOpenMPTheICisupdatedinparallelwithoutconcurrency.Thenonconcurrentcopyisadditionalcost.

Thread2

7005/06/2019

MPI+OpenMPFineGrain:recapandconclusionsmaster master

thread2

therad3

thread4

master master

thread2

thread3

thread4

master master

thread2

thread3

thread4

master

Ø Objective:Ø HavelargerdomainforoneMPIrank

Ø FewerMPIranksØ Lesscommunication

Ø DividetheworkamongthreadsØ Basedonfork-joinmodel:

Ø SimplepragmasaroundloopsØ WorkonloopsissharedbyallthreadsØ Workoutsideloopsandcommunicationisdonebymasterthreadonly

Ø Conclusions:Ø MinimumamountofworkperOpenMPregiondonotallowcomplete

codeparallelisationØ ThreadingscalabilityislimitedbyAmdahl’slaw

Ø FewerMPIranksallowbetteroverallscalabilityanywayØ OpenMPpragmasanddataconcurrencyoverheadpreventbetter

performancesthanMPI

7105/06/2019


ØContext





7205/06/2019

MPI+OpenMPCoarseGrainmaster master

thread2

therad3

thread4

master master

thread2

thread3

thread4

master master

thread2

thread3

thread4

master

Ø Objective:Ø Getridoffork-joinoverheadandsequentialcomputationØ SubstituteMPIranksbythreads

Ø FewerMPIranksØ Collectivecommunicationlessexpensive

Ø TheentirecodeisinsideoneOpenMPregion:Ø Entirecodemustbethread-safe(extremelyhardtocodeanddebug)Ø Threadsdoalltheworkandthecommunication

thread2

thread3

thread4

thread2

thread3

thread4

m

m

m

m

t2

t2

m

m

t2

t2

m

m

m

m

m

m

m

m

THISISFASTERforlargenumbersofprocesses

m

m

t2

t2 t2

t2

73


05/06/2019

0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE

MPI_28PPN

MPI+OpenMPCoarseGrain

m

mm

m

m

mm

m

74


05/06/2019


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE

MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG

m

m

m

mt2

t2

m

mt2

t2

m

mt2

t2 t2

t2

75


05/06/2019


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


m

m

m

mt2

t2

m

mt2

t2

m

mt2

t2 t2

t2

MPIcost

76


05/06/2019


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


m

m

m

mt2

t2

m

mt2

t2

m

mt2

t2 t2

t2

OpenMPcost

MPIcost

+

77


05/06/2019


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


m

m

m

mt2

t2

m

mt2

t2

m

mt2

t2 t2

t2

OpenMPcost

MPIcost

+ OpenMPcostdonotincreasewithnumberofcores:f(nthreads)=constant

OpenMPcostcanbereducedwithbetteralgorithms(WIP)

78


05/06/2019


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


m

m

m

mt2

t2

m

mt2

t2

m

mt2

t2 t2

t2 Gain

OpenMPcost

MPIcost

79

MPI+OpenMPCoarseGraindomaindecomposition

WithoutOpenMP WithOpenMPCoarseGrain

05/06/2019

Ø ThreadssubstituteMPIranksØ FewerMPIranks

Ø TheentireworkisdoneinparallelØ Threadsmustcommunicate

thread#1

thread#2

thread#3

80

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Speedup

Numberofcores

IN-SOCKETSPEEDUPREF:1.7MElements,1Core,MPI

IDEAL MPI OMP_FG OMP_CG


05/06/2019

MPI+OpenMPCoarseGrainperformances

SamescalabilityasMPI

BetterscalabilitythanFineGrain

81

0,1

1

10

10 100 1000 10000

Norm

alizedW

CT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG OMP_CG

Realcasescenario

05/06/2019

MPI+OpenMPCoarseGrainscaling

(Lowerisbetter)

82

0,1

1

10

10 100 1000 10000

Norm

alizedW

CT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG OMP_CG

Realcasescenario

05/06/2019

MPI+OpenMPCoarseGrainscaling

Catastrophicperformances

(Lowerisbetter)

83

PointtoPointCommunications

05/06/2019

MPI+OpenMPCoarseGrainlimits

ALL2ALLvianon-blockingP2Pcommunicationononenode

MPIMPI_2PPN+CG

84


05/06/2019



40timesslower

MPIMPI_2PPN+CG

85


05/06/2019



MPIimplementationsallowmultithreadingbutsequentialize internally.

Impossibletoattainanyperformance

40timesslower

MPIMPI_2PPN+CG

8605/06/2019

MPI+OpenMPCoarseGrain:recapandconclusionsmaster master

thread2

therad3

thread4

master master

thread2

thread3

thread4

master master

thread2

thread3

thread4

master

Ø Objective:Ø Getridoffork-joinoverheadandsequentialcomputationØ SubstituteMPIranksbythreads

Ø FewerMPIranksØ Collectivecommunicationlessexpensive

Ø TheentirecodeisinsideoneOpenMPregion:Ø Entirecodemustbethread-safe(extremelyhardtocodeanddebug)Ø Threadsdoalltheworkandthecommunication

Ø Conclusions:Ø OpenMPCoarseGrainallowscompleteparallelizationofthecode

Ø SameperformancesasMPIØ ImproveincollectivecommunicationsØ ImplementationspreventfullymultithreadedconcurrentMPIcalls

Ø Point2Pointcommunicationskillperformances

thread2

thread3

thread4

thread2

thread3

thread4

8705/06/2019


ØContext





88

MPI+OpenMP

05/06/2019

Perspectives

ØSolvePointtoPointcommunicationproblemforCoarseGrainØBesmarterondomaindecomposition

Ø MinimizenumberofneighborsondifferentranksØFunnelallcommunicationtoonethread

Ø MorecommunicationsfordesignatedthreadsØ Idletimefornon-communicatingthreadsØ Moresynchronizationpoints

ØMPI4standardmayintroduceendpointsconceptØ FullymultithreadedMPI(Hopefully)Ø Mustwaitforlibrariestoimplementit

89

Perspectives

MPI+MPI3ØMPI3allowscreationofsharedwindowsinsideanode

ØSamesolutionasOpenMPCGforcollectivescomm.

ØMustverifyperformancesØStartsfrom1PPNcurveØExpensivesynchronization(?)

ØNoproblemforP2Pcomm.

GASPI(PGAS)

05/06/2019

ØAlternativetoMPIØUsesRMAinsteadofmessagesØFullymultithreaded

ØShouldsolveP2PproblemØCanbecombinedwithMPI

ØUsefulforcomplexcollectives

ØNotsupportedonallmachines

0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


9005/06/2019

Conclusions

Ø MPIreachedscalabilitylimitsinmodernarchitecturesØ Hybridcodescouldimproveperformances

Ø ItisnoteasytowriteaperforminghybridMPI+OpenMPcodeØ OpenMPFineGrainsuffersfromfork-joinoverheadandAmdahl’slawØ OpenMPCoarsegrainislimitedbyMPIimplementationsonP2Pcomms.

Ø OtherhybridsolutionsareworthexploringØ MPI+MPI3Ø (MPI+)GASPI+OpenMPØ …

improving the scalability of cfd codes€¦ · cfd codes are not ready to take full advantage of...

Documents