improving the scalability of cfd codes€¦ · cfd codes are not ready to take full advantage of...

90
IMPROVING THE SCALABILITY OF CFD CODES Francesco Gava, Ghislain Lartigue, Vincent Moureau CNRS-CORIA

Upload: others

Post on 28-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • IMPROVINGTHESCALABILITYOFCFDCODES

    FrancescoGava,Ghislain Lartigue,VincentMoureauCNRS-CORIA

  • 2

    ICARUS* PROJECT

    05/06/2019

    Context

    ØObjective: Developmentofhigh-fidelitycalculationtoolsforthedesignofhotpartsofengines(aerospace+automotive)

    ØTask: Optimisationofcodes’performancesonHPCmachinesØMotivation: Nextgeneration(2020)machineswillbemassivelyparallel.

    CFDcodesarenotreadytotakefulladvantageofsuchsupercomputers.

    ØFunding: FUI– Fonds UniqueInterministériel

    *IntensiveCalculationforAeRo andautomotiveenginesUnsteadySimulations

  • 305/06/2019

    PresentationPlanning

    ØContext

    ØCFDcodesØGeneralconceptsØParallelism

    ØReviewofparallelismparadigms

    ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

    ØPerspectives&Conclusions

  • 405/06/2019

    PresentationPlanning

    ØContext

    ØCFDcodesØGeneralconceptsØParallelism

    ØReviewofparallelismparadigms

    ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

    ØPerspectives&Conclusions

  • 5

    Source:top500.org

    05/06/2019

    PerformancesoftheTop500

    TheTop500isarankingofthe500mostpowerfulsupercomputersintheworld

  • 6

    Source:top500.org

    05/06/2019

    PerformancesoftheTop500

    TheTop500isarankingofthe500mostpowerfulsupercomputersintheworldChangeinthetrend:Performanceincreasesmuchslowernow

  • 7

    Source:top500.org

    05/06/2019

    PerformancesoftheTop500

    Physicallimitsofmaterialsandenergyconsumptionarelimitingtheprocessorsfrequencies,hencetheperformances.

    TheTop500isarankingofthe500mostpowerfulsupercomputersintheworldChangeinthetrend:Performanceincreasesmuchslowernow

  • 8

    PreparedbyC.Batten – SchoolofElectricalandComputerEngineering– CornellUniversity– 2005– retrievedDec122012–http://www.cls.cornell.edu/courses/ece5950/handouts/ece5950-overview.pdf

    05/06/2019

    Multicorearchitectures

    Sequentialperformancesarelimited,butwithmorecorestheparallelperformancescanstillincrease.

    Almostallsupercomputersusemulticoreprocessors.

    Thenumberofcorespersocketisever-increasingandmorevaried.

    0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 11/200011/200611/201011/201411/2018

    Systemshare

    Date

    TOP500Corespersocket

    1 2 4

    6 8 10

    12 14 16

    18 20 24

    68 64 Others

  • 9

    CPU

    L1Cache

    L2Cache

    L3Cache

    RAM

    CPU

    L1Cache

    L2Cache

    L3Cache

    RAM

    CPU

    L1Cache

    L2Cache

    L3Cache

    RAM

    Thememoryhierarchy

    Mono-core Multi-core

    05/06/2019

    L3CacheItissharedamongstallCPUs

    RAM

    CPU

    L1Cache

    CPU

    L1Cache

    CPU

    L1Cache

    L2Cache L2Cache L2Cache

    Network

    Fastest32KB1cycle

    Faster256KB3cycles

    FastFewMB

    10cycles

    SlowManyGB100+cycles

  • 1005/06/2019

    Therooflinemodel

    Codeperformancecanbelimitedby:Ø Processorspeed(computebound)Ø Memoryaccessspeed(memorybound)

    ArithmeticIntensity(flops/byte)

    AttainablePerform

    ance(G

    flops)

    MemoryBound

    ComputeBound

    InCFDsolvers:Ø FastcomputationØ HighnumberofmemoryaccessesØ Largedatasizes

    Theaimistomoveoverthere:Ø Exploitmemoryhierarchy

    Ø WorkonsmallerdataØ Computeasmuchaspossible

    onthesamedata

  • 1105/06/2019

    PresentationPlanning

    ØContext

    ØCFDcodesØGeneralconceptsØParallelism

    ØReviewofparallelismparadigms

    ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

    ØPerspectives&Conclusions

  • 1205/06/2019

    ComputationalFluidDynamics

    CFDSOLVER

    PRECCINSTAburnerwithYales2

    Generally,aCFDcode:Ø SolvesNavier-Stokes(andother)equations

    Ø ReliesonlinearoperatorsØ Fastcomputations(additions,…)Ø Alotofmemoryread/write

    Ø NeedtoexploitmemoryhierarchyØ Usesadiscretizeddomain

    Ø ThefinerthediscretizationthehighertheprecisionØ LargemeshesmaynotfitintoRAMand

    takelongertimetocomputeØ Useparallelsolvers

    @u

    @t+ (u ·r)u = �1

    ⇢rp+ ⌫r2u

  • 1305/06/2019

    FromincompressiblemomentumtoPoisson’sequation

    @u

    @t+ (u ·r)u = �1

    ⇢rp+ ⌫r2u

    Solvetheincompressiblemomentumequationforu

    Aprediction-correctionmethod[1]

    Imposingthecontinuityequation

    r · un+1 = 0LeadstothePoisson’sequationforpressure

    r2pn+1 = rhs

    +

    Whichcanberewrittenasalinearsystem

    Lp = b

    Mustsolveforp tohaveu

    [1]Chorin,A.J.(1967), "ThenumericalsolutionoftheNavier-Stokesequationsforanincompressiblefluid”, Bull.Am.Math.Soc., 73:928–931

  • 1405/06/2019

    Poisson’sequationandConjugateGradientmethod

    Thelinearsystemhastobesolvedforp

    Thiscanbesolvedwithaniterativemethod

    Letrk betheresidualatiterationk.Iteratealongthedirectiondk conjugatetork

    untilconvergence

    Lp = b

    r0 = b� Lp0d0 = r0

    k = 0

    � = convergence criterion

    err = ||r0||1while (err > �)

    ↵k =rTkrk

    dTkLdk

    pk+1 = pk + ↵kdk

    rk+1 = rk � ↵kLdkerr = ||rk||1

    �k =rTk+1rk+1rTkrk

    dk+1 = rk+1 + �kdk

    k = k + 1

    end while

    return pk as the result

    TheConjugateGradientmethod

    Thismethodistrivialwithoutparallelism

  • 1505/06/2019

    PresentationPlanning

    ØContext

    ØCFDcodesØGeneralconceptsØParallelism

    ØReviewofparallelismparadigms

    ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

    ØPerspectives&Conclusions

  • 1605/06/2019

    Parallelcomputationanddomaindecomposition

    LargeproblemscannotbecomputedbyasingleprocessØ Domaindecompositiontodividetheproblemamongstmanyprocesses

    Ø MorememoryavailableØ MorecomputationalpowerØ Communicationneeded

    Dataonthesenodeshavetobeexchangedbetweenprocesses

    L3CacheItissharedamongstallCPUs

    RAM

    CPU

    L1Cache

    CPU

    L1Cache

    CPU

    L1Cache

    L2Cache L2Cache L2Cache

  • 17

    Computationonadomainnode

    Insidethedomain Ondomainboundary

    05/06/2019

    i

    Ø NeedscontributionofallneighbournodesØ Allsurroundingnodesbelongtothedomain

    Ø Noproblem

    Ø NeedscontributionofallneighbournodesØ Somenodesdonotbelongtothedomain

    Ø Mustcommunicatewithneighbours

    i

    proc#1 proc#1

    proc#2

    r�i =X

    j2Ni

    f(�i,�j ,Mij)

  • 1805/06/2019

    ParallelConjugateGradientmethod

    Thelinearsystemhastobesolvedforp

    Thiscanbesolvedwithaniterativemethod

    Letrk betheresidualatiterationk.Iteratealongthedirectiondk conjugatetork

    untilconvergence

    Lp = b

    r0 = b� Lp0d0 = r0

    k = 0

    � = convergence criterion

    err = ||r0||1while (err > �)

    ↵k =rTkrk

    dTkLdk

    pk+1 = pk + ↵kdk

    rk+1 = rk � ↵kLdkerr = ||rk||1

    �k =rTk+1rk+1rTkrk

    dk+1 = rk+1 + �kdk

    k = k + 1

    end while

    return pk as the result

    Thisalgorithmrequires4COLLECTIVEcommunications:oneforeachscalarproduct(3)and

    oneforthenorm

    TheConjugateGradientmethod

    ThisalgorithmrequiresaPOINTTOPOINTcommunicationtocomputeLd

  • 1905/06/2019

    YALES2structure

    Domaindecompositiontodividetheproblemamongstmanyprocessors

    YALES2usesaDoubleDomainDecompositionØ Eachsubdomainissplitinsmallgroupsofelements

    (EL_GRPs)whichwillfitintoL3,possiblyL2

    Dataonthesenodeshavetobeexchangedbetweenprocessors.YALES2hasadedicateddatastructure:theexternalcommunicator(EC)

    DataonthesenodeshavetobeexchangedbetweenEL_GRPsonthesameprocessor.YALES2hasadedicateddatastructure:theinternalcommunicator(IC)

    gridofproc#1

    el_grp

    el_grp

    el_grp

    boundary boundary

    int.comm ext.comm

    ext.comm

    proc#2

    proc#3

    ii

    Ø InYALES2boundarynodesareduplicatedØ PartialvalueiscomputedoneachsideØ Totalvalueiscomputedonint.comm.

    ThedecomposeddomainisstilltobigtofitintoL3cache

  • 2005/06/2019

    YALES2InternalCommunicator

    Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode

    Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes

    itsowncontributionØ Contributionsareadded

    ontheinternalcomm.Ø Totalvalueispossibly

    copiedbackonel_grpnode

    ++++++++++++++++++

    IC

  • 2105/06/2019

    YALES2InternalCommunicator

    Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode

    Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes

    itsowncontributionØ Contributionsareadded

    ontheinternalcomm.Ø Totalvalueispossibly

    copiedbackonel_grpnode

    ++++++++++++++++++

    IC

  • 2205/06/2019

    YALES2InternalCommunicator

    Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode

    Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes

    itsowncontributionØ Contributionsareadded

    ontheinternalcomm.Ø Totalvalueispossibly

    copiedbackonel_grpnode

    ++++++++++++++++++

    IC

  • 2305/06/2019

    YALES2InternalCommunicator

    Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode

    Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes

    itsowncontributionØ Contributionsareadded

    ontheinternalcomm.Ø Totalvalueispossibly

    copiedbackonel_grpnodes

    ++++++++++++++++++

    IC

  • 24

    SENDEC

    RECVEC

    05/06/2019

    YALES2ExternalCommunicator

    Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode

    Ø Nodesonboundariesbetweenprocessesareduplicated

    proc#2RECVEC

    proc#2SENDEC

    proc#3RECVEC

    proc#3SENDEC

    ++++++++++++++++++++++++++++++++++++++

    IC

  • 25

    SENDEC

    RECVEC

    05/06/2019

    YALES2ExternalCommunicator

    Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode

    Ø Nodesonboundariesbetweenprocessesareduplicated

    proc#2RECVEC

    proc#2SENDEC

    proc#3RECVEC

    proc#3SENDEC

    ++++++++++++++++++++++++++++++++++++++

    IC

    Ø Eachel_grp computesitsowncontribution

  • 26

    SENDEC

    RECVEC

    05/06/2019

    YALES2ExternalCommunicator

    Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode

    Ø Nodesonboundariesbetweenprocessesareduplicated

    proc#2RECVEC

    proc#2SENDEC

    proc#3RECVEC

    proc#3SENDEC

    ++++++++++++++++++++++++++++++++++++++

    IC

    Ø Eachel_grp computesitsowncontribution

    Ø Contributionsareaddedontheinternalcomm.

  • 27

    SENDEC

    RECVEC

    05/06/2019

    YALES2ExternalCommunicator

    Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode

    Ø Nodesonboundariesbetweenprocessesareduplicated

    proc#2RECVEC

    proc#2SENDEC

    proc#3RECVEC

    proc#3SENDEC

    ++++++++++++++++++++++++++++++++++++++

    IC

    Ø Eachel_grp computesitsowncontribution

    Ø Contributionsareaddedontheinternalcomm.

    Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess

  • 28

    SENDEC

    RECVEC

    05/06/2019

    YALES2ExternalCommunicator

    Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode

    Ø Nodesonboundariesbetweenprocessesareduplicated

    proc#2RECVEC

    proc#2SENDEC

    proc#3RECVEC

    proc#3SENDEC

    ++++++++++++++++++++++++++++++++++++++

    IC

    Ø Eachel_grp computesitsowncontribution

    Ø Contributionsareaddedontheinternalcomm.

    Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess

    Ø Valuereceivedonexternalreceivecommunicatorisaddedtointernalcommunicator

  • 29

    SENDEC

    RECVEC

    05/06/2019

    YALES2ExternalCommunicator

    Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode

    Ø Nodesonboundariesbetweenprocessesareduplicated

    proc#2RECVEC

    proc#2SENDEC

    proc#3RECVEC

    proc#3SENDEC

    ++++++++++++++++++++++++++++++++++++++

    IC

    Ø Eachel_grp computesitsowncontribution

    Ø Contributionsareaddedontheinternalcomm.

    Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess

    Ø Valuereceivedonexternalreceivecommunicatorisaddedtointernalcommunicator

    Ø Finalvalueispossiblycopiedbackonel_grp nodes

  • 3005/06/2019

    PresentationPlanning

    ØContext

    ØCFDcodesØGeneralconceptsØParallelism

    ØReviewofparallelismparadigms

    ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

    ØPerspectives&Conclusions

  • 3105/06/2019

    Dataexchange

    Therearemainly3(and1/2)waystoachieveparallelism

    MPI(MessagePassing)

    OpenMP(SharedMemory)

    PGAS(PartitionedGlobalAccessSpace)

    MPI+OpenMP(Hybrid)

    Oldbut(almost)gold

    Easybut(very)limited

    Promisingbut(too)new(Nottreatedhere)

  • 3205/06/2019

    MPI:MessagePassingInterface

    Ø Reliesonamessageexchangeparadigm(oftenthroughnetwork)Ø Themostcommonparadigm

    Ø VerywelltestedØ A lotofsupport

    Ø FairlyeasytoimplementØ Couldbeusedonanyplatform

    Ø WorksbetterondistributedmemorysystemsØ Doesnottakeadvantageofsharedmemory

    Ø DoesnotscaleathighnumberofprocessesØ CollectivecommunicationsareabottleneckØ Cannotexploitwellhugesupercomputers

    CurrentlyYALES2usesMPI

    CPUL1

    L2

    L3

    RAM

    CPUL1

    L2

    L3

    RAM

    CPUL1

    L2

    L3

    RAM

    Network

  • 3305/06/2019

    OpenMP:SharedMemoryparallelism

    Ø ReliesonmemorysharedamongstcoresØ Verycommon

    Ø WelltestedØ Goodsupport

    Ø Extremelyeasytoimplement(finegrain)Ø CanbeusedONLYonarchitectureswith

    sharedmemoryØ CannotgobeyondNUMA-Domaincores

    Ø Mustbeusedtogetherwithanotherparadigm(MPI,..)

    Ø OverheadofpragmasisnotnegligibleØ Hardtogetfullparallelisation(Amdahl’slaw)

    L3Cache(Shared)

    RAM

    CPUL1

    CPUL1

    CPUL1

    L2 L2 L2

  • 3405/06/2019

    PresentationPlanning

    ØContext

    ØCFDcodesØGeneralconceptsØParallelism

    ØReviewofparallelismparadigms

    ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

    ØPerspectives&Conclusions

  • 3505/06/2019

    HybridMPI+OpenMP:motivations

    Ø MPIcodesdoesnotscaleindefinitelyØ ThreadingcanreducenumberofMPIrankshenceimprovescalability

    Ø MPIalonecannottakefulladvantageofmulticorearchitecturesØ OpenMPcanexploitsharedmemory

    https://nvidia-gpugenius.highspot.com/viewer/5bf5139e659e9366ed606a3e?iid=5bf5134ac714335696ba3410

  • 3605/06/2019

    Performancemeasuresspecifications

    AllfollowingmeasureswereobtainedonMyria supercomputeratCRIANN:Processor :Bi-socketBroadwell ([email protected],[email protected])Network :LowlatencyhighbandwidthIntelOmni-Path(100Gbit/s)MPIlibrary :Intel-MPI2017.1.132(othersgivesimilarresults)Testcase :Incompressible,non-reactivePRECCINSTAburner

    PRECCINSTAburnerwithYales2

    Myria supercomputeratCRIANN

  • 37

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M

    MPIscalability

  • 38

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M

    Realcasescenario

    05/06/2019

    MPIscalability

    WallClockTime(Lowerisbetter)

  • 39

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M

    Realcasescenario

    05/06/2019

    MPIscalability

    WallClockTime(Lowerisbetter)

    Strongscalability:constantglobalwork=>linearlydecreasingWCT

  • 40

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M

    Realcasescenario

    05/06/2019

    MPIscalability

    WallClockTime(Lowerisbetter)

    Weakscalability:constantworkperprocess=>constantWCT

  • 41

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI

    Realcasescenario

    05/06/2019

    MPIscalability

    WallClockTime(Lowerisbetter)

  • 42

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI

    Realcasescenario

    05/06/2019

    MPIscalability

    WallClockTime(Lowerisbetter)

    Deviationfromidealmainlyduetocommunications

  • 43

    Collectivecommunications

    05/06/2019

    MPIscalabilitylimits

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPNPPN=ProcessesPerNode

  • 44

    Collectivecommunications

    05/06/2019

    MPIscalabilitylimits

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN

    ReducingthenumberofcommunicatingPPNwhilstmaintainingtheamountofcoresforcomputationwillreducethecostofcollectivecommunications

    PPN=ProcessesPerNode

  • 4505/06/2019

    PresentationPlanning

    ØContext

    ØCFDcodesØGeneralconceptsØParallelism

    ØReviewofparallelismparadigms

    ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

    ØPerspectives&Conclusions

  • 4605/06/2019

    MPI+OpenMPFineGrainmaster master

    thread2

    therad3

    thread4

    master master

    thread2

    thread3

    thread4

    master master

    thread2

    thread3

    thread4

    master

    Ø Objective:Ø HavelargerdomainforoneMPIrank

    Ø FewerMPIranksØ Lesscommunication

    Ø DividetheworkamongthreadsØ Basedonfork-joinmodel:

    Ø SimplepragmasaroundloopsØ WorkonloopsissharedbyallthreadsØ Workoutsideloopsandcommunicationisdonebymasterthreadonly

  • 47

    MPI+OpenMPFineGraindomaindecomposition

    WithoutOpenMP WithOpenMPFineGrain

    05/06/2019

    Ø LargerMPIdomainØ FewerMPIranksØ ThreadsshareworkonEL_GRPs

    Ø Musttakecareofdataraces

    thread#4

    thread#3

    thread#1

    thread#2

  • 4805/06/2019

    MPI+OpenMPFineGrain(Baseversion)

    RuntimeBreakdown PercentageRuntimeBreakdown

    Ø ProcesseshavelargerdomainØ ThreadsshareworkongroupsofelementØ CommunicationisdonebymasterthreadonlyØ Onlyloopswithindependentiterationsareparallelised

    Ø NoconcurrencyØ Notmuchparallelised

    OpenMP

    Idealscaling

    Sequential

    With7threadsonly40%isexecutedinparallel

    80%ofthecodeisparallelized

  • 49

    In-socketstrongscaling

    05/06/2019

    MPI+OpenMPFineGrain(Baseversion)performances

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 2 4 6 8 10 12 14 16

    Speedup

    Numberofcores

    IN-SOCKETSPEEDUP

    REF:1.7MElements,1Core,MPI

    IDEAL MPI OMP_FG(Base)

    WorsescalabilitythanMPI

  • 50

    Realcasescenario

    05/06/2019

    MPI+OpenMPFineGrain(Baseversion)scaling

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)

    (Lowerisbetter)

  • 51

    Realcasescenario

    05/06/2019

    MPI+OpenMPFineGrain(Baseversion)scaling

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)

    Startswithconsiderableoverhead

    (Lowerisbetter)

  • 52

    Realcasescenario

    05/06/2019

    MPI+OpenMPFineGrain(Baseversion)scaling

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)

    Startswithconsiderableoverhead

    (Lowerisbetter)

    Slightlybetterscalability

  • 53

    Realcasescenario

    05/06/2019

    MPI+OpenMPFineGrain(Baseversion)scaling

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)

    Startswithconsiderableoverhead

    (Lowerisbetter)

    Slightlybetterscalability

    GloballynoimprovementwithrespecttoMPI

  • 5405/06/2019

    PercentageRuntimeBreakdownRuntimeBreakdownofstrongscaling

    MPI+OpenMPFineGrain

    Ø ProcesseshavelargerdomainØ ThreadsshareworkongroupsofelementØ CommunicationisdonebymasterthreadØ Alsoloopswithconcurrentiterationsareparallelised

    Ø AlmosteverythingisparallelisedØ Overheadtoavoidconcurrency

    OpenMP

    Idealscaling

    Sequential

    With7threadsonly80%isexecutedinparallel

    95%ofthecodeisparallelized

  • 55

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 2 4 6 8 10 12 14 16

    Speedup

    Numberofcores

    IN-SOCKETSPEEDUP

    REF:1.7MElements,1Core,MPI

    IDEAL MPI OMP_FG(Base) OMP_FG

    In-socketstrongscaling

    05/06/2019

    MPI+OpenMPFineGrainperformances

    StillworsescalabilitythanMPI

    BetterscalabilitythanBase

  • 56

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG

    Realcasescenario

    05/06/2019

    MPI+OpenMPFineGrainscaling

    (Lowerisbetter)

  • 57

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG

    Realcasescenario

    05/06/2019

    MPI+OpenMPFineGrainscaling

    Startswithconsiderableoverhead

    (Lowerisbetter)

  • 58

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG

    Realcasescenario

    05/06/2019

    MPI+OpenMPFineGrainscaling

    Startswithconsiderableoverhead

    Betterscalability

    (Lowerisbetter)

  • 59

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alize

    dWCT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG

    Realcasescenario

    05/06/2019

    MPI+OpenMPFineGrainscaling

    Startswithconsiderableoverhead

    Betterscalability

    (Lowerisbetter)

    GloballynoimprovementwithrespecttoMPI

  • 60

    y=0,0027x

    0,001

    0,01

    0,1

    1

    10

    100

    1000

    1 10 100 1000 10000 100000

    Time[us]

    LoopIterations

    OpenMPscalability

    0th 1th 2th 3th 4th 5th6th 7th 8th 9th 10th 11th12th 13th 14th Linear(0th)

    Minimumamountofwork

    05/06/2019

    OpenMPFineGrainlimits

    MinimumamountofworkperOpenMPregiontohavesomegain

    ThereisanoverheadduetoOpenMP

    Sequential(withOpenMPpragmas)

    Sequential(noOpenMPpragmas)

  • 61

    0

    0,2

    0,4

    0,6

    0,8

    1

    1,2

    0 2 4 6 8 10 12 14 16

    Time[us]

    Numberofthreads

    Fork-Joinoverheadbyloop iterations

    1 2 3 5 8 10 20 30 50 80 100 200 300 500 800 1000

    Fork-Joinoverhead

    05/06/2019

    OpenMPFineGrainlimits

    Overhead:Ø IndependentoftheamountofworkØ IncreaseswithnumberofthreadsØ Imposesminimumworktobeeffective

  • 62

    Dataraces

    05/06/2019

    OpenMPFineGrainlimits

    ++++++++++++++++++

    IC

    WithoutOpenMP

  • 63

    Dataraces

    05/06/2019

    OpenMPFineGrainlimits

    ++++++++++++++++++

    IC

    WithoutOpenMP

    ValueisupdatesequentiallyinIC

  • 64

    Dataraces

    05/06/2019

    OpenMPFineGrainlimits

    ++++++++++++++++++

    IC

    WithoutOpenMP

    ValueisupdatesequentiallyinIC

  • 65

    Dataraces

    05/06/2019

    OpenMPFineGrainlimits

    ++++++++++++++++++

    IC

    WithoutOpenMP

    Thread1

    ++++++++++++++++++

    IC

    WithOpenMP

    Thread2

  • 66

    Dataraces

    05/06/2019

    OpenMPFineGrainlimits

    ++++++++++++++++++

    IC

    WithoutOpenMP

    Thread1

    ++++++++++++++++++

    IC

    WithOpenMP

    Thread2

    Noguaranteeofdatacoherency

  • 67

    Dataraces

    05/06/2019

    OpenMPFineGrainlimits

    ++++++++++++++++++

    IC

    WithoutOpenMP

    Thread1

    ++++++++++++++++++

    IC

    WithOpenMP

    Thread2

    Datarace

    Noguaranteeofdatacoherency

  • 68

    Dataraces

    05/06/2019

    OpenMPFineGrainlimits

    ++++++++++++++++++

    IC

    WithoutOpenMP

    Thread1

    ++++++++++++++++++

    IC

    AugmentedIC

    WithOpenMPOneadditionalnon-concurrentcopyismadeinordertoavoiddataracesonIC

    Thread2

  • 69

    Dataraces

    05/06/2019

    OpenMPFineGrainlimits

    ++++++++++++++++++

    IC

    WithoutOpenMP

    Thread1

    ++++++++++++++++++

    IC

    AugmentedIC

    WithOpenMPTheICisupdatedinparallelwithoutconcurrency.Thenonconcurrentcopyisadditionalcost.

    Thread2

  • 7005/06/2019

    MPI+OpenMPFineGrain:recapandconclusionsmaster master

    thread2

    therad3

    thread4

    master master

    thread2

    thread3

    thread4

    master master

    thread2

    thread3

    thread4

    master

    Ø Objective:Ø HavelargerdomainforoneMPIrank

    Ø FewerMPIranksØ Lesscommunication

    Ø DividetheworkamongthreadsØ Basedonfork-joinmodel:

    Ø SimplepragmasaroundloopsØ WorkonloopsissharedbyallthreadsØ Workoutsideloopsandcommunicationisdonebymasterthreadonly

    Ø Conclusions:Ø MinimumamountofworkperOpenMPregiondonotallowcomplete

    codeparallelisationØ ThreadingscalabilityislimitedbyAmdahl’slaw

    Ø FewerMPIranksallowbetteroverallscalabilityanywayØ OpenMPpragmasanddataconcurrencyoverheadpreventbetter

    performancesthanMPI

  • 7105/06/2019

    PresentationPlanning

    ØContext

    ØCFDcodesØGeneralconceptsØParallelism

    ØReviewofparallelismparadigms

    ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

    ØPerspectives&Conclusions

  • 7205/06/2019

    MPI+OpenMPCoarseGrainmaster master

    thread2

    therad3

    thread4

    master master

    thread2

    thread3

    thread4

    master master

    thread2

    thread3

    thread4

    master

    Ø Objective:Ø Getridoffork-joinoverheadandsequentialcomputationØ SubstituteMPIranksbythreads

    Ø FewerMPIranksØ Collectivecommunicationlessexpensive

    Ø TheentirecodeisinsideoneOpenMPregion:Ø Entirecodemustbethread-safe(extremelyhardtocodeanddebug)Ø Threadsdoalltheworkandthecommunication

    thread2

    thread3

    thread4

    thread2

    thread3

    thread4

    m

    m

    m

    m

    t2

    t2

    m

    m

    t2

    t2

    m

    m

    m

    m

    m

    m

    m

    m

    THISISFASTERforlargenumbersofprocesses

    m

    m

    t2

    t2 t2

    t2

  • 73

    Collectivecommunications

    05/06/2019

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN

    MPI+OpenMPCoarseGrain

    m

    mm

    m

    m

    mm

    m

  • 74

    Collectivecommunications

    05/06/2019

    MPI+OpenMPCoarseGrain

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG

    m

    m

    m

    mt2

    t2

    m

    mt2

    t2

    m

    mt2

    t2 t2

    t2

  • 75

    Collectivecommunications

    05/06/2019

    MPI+OpenMPCoarseGrain

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG

    m

    m

    m

    mt2

    t2

    m

    mt2

    t2

    m

    mt2

    t2 t2

    t2

    MPIcost

  • 76

    Collectivecommunications

    05/06/2019

    MPI+OpenMPCoarseGrain

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG

    m

    m

    m

    mt2

    t2

    m

    mt2

    t2

    m

    mt2

    t2 t2

    t2

    OpenMPcost

    MPIcost

    +

  • 77

    Collectivecommunications

    05/06/2019

    MPI+OpenMPCoarseGrain

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG

    m

    m

    m

    mt2

    t2

    m

    mt2

    t2

    m

    mt2

    t2 t2

    t2

    OpenMPcost

    MPIcost

    + OpenMPcostdonotincreasewithnumberofcores:f(nthreads)=constant

    OpenMPcostcanbereducedwithbetteralgorithms(WIP)

  • 78

    Collectivecommunications

    05/06/2019

    MPI+OpenMPCoarseGrain

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG

    m

    m

    m

    mt2

    t2

    m

    mt2

    t2

    m

    mt2

    t2 t2

    t2 Gain

    OpenMPcost

    MPIcost

  • 79

    MPI+OpenMPCoarseGraindomaindecomposition

    WithoutOpenMP WithOpenMPCoarseGrain

    05/06/2019

    Ø ThreadssubstituteMPIranksØ FewerMPIranks

    Ø TheentireworkisdoneinparallelØ Threadsmustcommunicate

    thread#1

    thread#2

    thread#3

  • 80

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 2 4 6 8 10 12 14 16

    Speedup

    Numberofcores

    IN-SOCKETSPEEDUPREF:1.7MElements,1Core,MPI

    IDEAL MPI OMP_FG OMP_CG

    In-socketstrongscaling

    05/06/2019

    MPI+OpenMPCoarseGrainperformances

    SamescalabilityasMPI

    BetterscalabilitythanFineGrain

  • 81

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alizedW

    CT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG OMP_CG

    Realcasescenario

    05/06/2019

    MPI+OpenMPCoarseGrainscaling

    (Lowerisbetter)

  • 82

    0,1

    1

    10

    10 100 1000 10000

    Norm

    alizedW

    CT

    Numberofcores

    SCALABILITYRef:14MElements,28Cores

    IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG OMP_CG

    Realcasescenario

    05/06/2019

    MPI+OpenMPCoarseGrainscaling

    Catastrophicperformances

    (Lowerisbetter)

  • 83

    PointtoPointCommunications

    05/06/2019

    MPI+OpenMPCoarseGrainlimits

    ALL2ALLvianon-blockingP2Pcommunicationononenode

    MPIMPI_2PPN+CG

  • 84

    PointtoPointCommunications

    05/06/2019

    MPI+OpenMPCoarseGrainlimits

    ALL2ALLvianon-blockingP2Pcommunicationononenode

    40timesslower

    MPIMPI_2PPN+CG

  • 85

    PointtoPointCommunications

    05/06/2019

    MPI+OpenMPCoarseGrainlimits

    ALL2ALLvianon-blockingP2Pcommunicationononenode

    MPIimplementationsallowmultithreadingbutsequentialize internally.

    Impossibletoattainanyperformance

    40timesslower

    MPIMPI_2PPN+CG

  • 8605/06/2019

    MPI+OpenMPCoarseGrain:recapandconclusionsmaster master

    thread2

    therad3

    thread4

    master master

    thread2

    thread3

    thread4

    master master

    thread2

    thread3

    thread4

    master

    Ø Objective:Ø Getridoffork-joinoverheadandsequentialcomputationØ SubstituteMPIranksbythreads

    Ø FewerMPIranksØ Collectivecommunicationlessexpensive

    Ø TheentirecodeisinsideoneOpenMPregion:Ø Entirecodemustbethread-safe(extremelyhardtocodeanddebug)Ø Threadsdoalltheworkandthecommunication

    Ø Conclusions:Ø OpenMPCoarseGrainallowscompleteparallelizationofthecode

    Ø SameperformancesasMPIØ ImproveincollectivecommunicationsØ ImplementationspreventfullymultithreadedconcurrentMPIcalls

    Ø Point2Pointcommunicationskillperformances

    thread2

    thread3

    thread4

    thread2

    thread3

    thread4

  • 8705/06/2019

    PresentationPlanning

    ØContext

    ØCFDcodesØGeneralconceptsØParallelism

    ØReviewofparallelismparadigms

    ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

    ØPerspectives&Conclusions

  • 88

    MPI+OpenMP

    05/06/2019

    Perspectives

    ØSolvePointtoPointcommunicationproblemforCoarseGrainØBesmarterondomaindecomposition

    Ø MinimizenumberofneighborsondifferentranksØFunnelallcommunicationtoonethread

    Ø MorecommunicationsfordesignatedthreadsØ Idletimefornon-communicatingthreadsØ Moresynchronizationpoints

    ØMPI4standardmayintroduceendpointsconceptØ FullymultithreadedMPI(Hopefully)Ø Mustwaitforlibrariestoimplementit

  • 89

    Perspectives

    MPI+MPI3ØMPI3allowscreationofsharedwindowsinsideanode

    ØSamesolutionasOpenMPCGforcollectivescomm.

    ØMustverifyperformancesØStartsfrom1PPNcurveØExpensivesynchronization(?)

    ØNoproblemforP2Pcomm.

    GASPI(PGAS)

    05/06/2019

    ØAlternativetoMPIØUsesRMAinsteadofmessagesØFullymultithreaded

    ØShouldsolveP2PproblemØCanbecombinedwithMPI

    ØUsefulforcomplexcollectives

    ØNotsupportedonallmachines

    0

    5

    10

    15

    20

    25

    30

    0 500 1000 1500 2000 2500 3000

    Time[us]

    Numberofcores

    MPIALLREDUCE

    MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG

  • 9005/06/2019

    Conclusions

    Ø MPIreachedscalabilitylimitsinmodernarchitecturesØ Hybridcodescouldimproveperformances

    Ø ItisnoteasytowriteaperforminghybridMPI+OpenMPcodeØ OpenMPFineGrainsuffersfromfork-joinoverheadandAmdahl’slawØ OpenMPCoarsegrainislimitedbyMPIimplementationsonP2Pcomms.

    Ø OtherhybridsolutionsareworthexploringØ MPI+MPI3Ø (MPI+)GASPI+OpenMPØ …