the trilinos project exascale roadmap · the trilinos project exascale roadmap michael a. heroux...

Post on 05-Apr-2020

12 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

The Trilinos Project ExascaleRoadmap

Michael A. HerouxSandia National Laboratories

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Contributors: Mark Hoemmen, Siva Rajamanickam, Tobias Wiesner,Alicia Klinvex, Heidi Thornquist,

Kate Evans, Andrey Prokopenko, Lois McInnes

trilinos.github.io

Outline

§ Challenges.§ BriefOverviewofTrilinos.§ On-nodeparallelism.§ ParallelAlgorithms.§ TrilinosProductsOrganization.§ ForTrilinos.§ ThexSDK.

2

Challenges

§ On-nodeconcurrencyexpression:§ Vectorization/SIMT.§ On-nodetasking.§ Portability.

§ Parallelalgorithms:§ Vector/SIMTexpressible.§ Latencytolerant.§ Highlyscalable.

§ Multi-scaleandmulti-physics.§ Preconditioning.§ Softwarecomposition.

§ Resilience(DiscussedtomorrowmorningMS40).

3

4

Trilinos Overview

What is Trilinos?§ Object-oriented software framework for…§ Solving big complex science & engineering problems.§ Large collection of reusable scientific capabilities.§ More like LEGO™ bricks than Matlab™.

5

OptimalKernelstoOptimalSolutions:w Geometry,Meshingw Discretizations,LoadBalancing.w ScalableLinear,Nonlinear,Eigen,

Transient,Optimization,UQsolvers.w ScalableI/O,GPU,Manycore

w 60+Packages.w Otherdistributions:

w CrayLIBSCI.w GitHubrepo.

w ThousandsofUsers.w Worldwidedistribution.

LaptopstoLeadershipsystems

Trilinos Package SummaryObjective Package(s)

DiscretizationsMeshing & Discretizations STK, Intrepid, Pamgen, Sundance, ITAPS, Mesquite

Time Integration Rythmos

MethodsAutomatic Differentiation Sacado

Mortar Methods Moertel

Services

Linear algebra objects Epetra, Tpetra, Kokkos, Xpetra

Interfaces Thyra, Stratimikos, RTOp, FEI, Shards

Load Balancing Zoltan, Isorropia, Zoltan2“Skins” PyTrilinos, WebTrilinos, ForTrilinos, Ctrilinos, Optika

C++ utilities, I/O, thread API Teuchos, EpetraExt, Kokkos, Triutils, ThreadPool, Phalanx, Trios

Solvers

Iterative linear solvers AztecOO, Belos, Komplex

Direct sparse linear solvers Amesos, Amesos2, ShyLUDirect dense linear solvers Epetra, Teuchos, Pliris

Iterative eigenvalue solvers Anasazi, Rbgen

ILU-type preconditioners AztecOO, IFPACK, Ifpack2, ShyLU

Multilevel preconditioners ML, CLAPS, Muelu

Block preconditioners Meros, Teko

Nonlinear system solvers NOX, LOCA, Piro

Optimization (SAND) MOOCHO, Aristos, TriKota, Globipack, Optipack

Stochastic PDEs Stokhos

Unique features of Trilinos

8

§ Huge library of algorithms§ Linear and nonlinear solvers, preconditioners, …§ Optimization, transients, sensitivities, uncertainty, …

§ Growing support for multicore & hybrid CPU/GPU§ Built into the new Tpetra linear algebra objects

§ Therefore into iterative solvers with zero effort!§ Unified intranode programming model: Kokkos§ Spreading into the whole stack:

§ Multigrid, sparse factorizations, element assembly…

§ Support for mixed and arbitrary precisions§ Don’t have to rebuild Trilinos to use it

§ Support for flexible 2D sparse partitioning§ Useful for graph analytics, other data science apps.

§ Support for huge (> 2B unknowns) problems

AnasaziSnapshot

§ Abbreviationskey§ “Gen.”=generalizedeigenvaluesystem.§ “Prec:”=canuseapreconditioner.§ BKS=BlockKrylov Schur.§ LOBPCG=LocallyOptimial BlockPreconditionedConjugateGradient.§ RTR=RiemannianTrust-Regionmethod.

§ ^ denotesthatthesefeaturesmaybeimplementedifthereissufficientinterest.§ $ denotesthattheTraceMin familyofsolversiscurrentlyexperimental. 9

Core team:- Alicia Klinvex- Rich Lehoucq- Heidi Thornquist

Works with:- Epetra- Tpetra- Custom

https://trilinos.github.io/anasazi.html

Trilinoslinearsolvers§ Sparselinearalgebra

(Kokkos/KokkosKernels/Tpetra)§ Threadedconstruction,Sparsegraphs,(block)

sparsematrices,densevectors,parallelsolvekernels,parallelcommunication&redistribution

§ Iterative(Krylov)solvers (Belos)§ CG,GMRES,TFQMR,recyclingmethods

10

§ Sparsedirectsolvers (Amesos2)§ Algebraiciterativemethods (Ifpack2)

§ Jacobi,SOR,polynomial,incompletefactorizations,additiveSchwarz

§ Shared-memoryfactorizations (ShyLU)§ LU,ILU(k),ILUt,IC(k),iterativeILU(k)§ Direct+iterative preconditioners

§ Segregated blocksolvers(Teko)§ Algebraicmultigrid (MueLu)

KokkosKernels

11

On-node data & execution

Mustsupport>3architectures§ Comingsystemstosupport

§ Trinity (IntelHaswell &KNL)§ Sierra:NVIDIAGPUs+IBM

multicoreCPUs§ Plus“everythingelse”

§ 3differentarchitectures§ MulticoreCPUs(bigcores)§ Manycore CPUs(smallcores)§ GPUs(highlyparallel)

§ MPIonly,&MPI+threads§ Threadsdon’talwayspayon

non-GPUarchitecturestoday§ Portingtothreadsmustnot

slowdowntheMPI-onlycase12

Kokkos:Performance,Portability,&Productivity

13

DDR#

HBM#

DDR#

HBM#

DDR#DDR#

DDR#

HBM#HBM#

Kokkos#

LAMMPS# Sierra# Albany#Trilinos#

14

Kokkos ProgrammingModel

• Machinemodel:• N execution space + M memory spaces • NxM matrix for memory access performance/possibility• Asynchronous execution allowed

• Implementationapproach• A C++ template library• C++11 now required• Target different back-ends for different hardware architecture• Abstract hardware details and execution mapping details away

• Distribution• Open Source library• Soon (i.e. in the next few weeks) available on GitHub

• LongTermVision• Move features into the C++ standard (Carter Edwards voting committee member)

Goal:OneCodegivesgoodperformanceoneveryplatform

15

AbstractionConcepts

Execution Pattern: parallel_for, parallel_reduce, parallel_scan, task, …

Execution Policy : how (and where) a user function is executedE.g., data parallel range : concurrently call function(i) for i = [0..N)User’s function is a C++ functor or C++11 lambda

Execution Space : where functions executeEncapsulates hardware resources; e.g., cores, GPU, vector units, ...

Memory Space : where data residesØ AND what execution space can access that dataAlso differentiated by access performance; e.g., latency & bandwidth

Memory Layout : how data structures are ordered in memoryØ provide mapping from logical to physical index space

Memory Traits : how data shall be accessedØ allow specialisation for different usage scenarios (read only, random, atomic, …)

16

ExecutionPattern#include <Kokkos_Core.hpp>#include <cstdio>

int main(int argc, char* argv[]) {// Initialize Kokkos analogous to MPI_Init()// Takes arguments which set hardware resources (number of threads, GPU Id)Kokkos::initialize(argc, argv);

// A parallel_for executes the body in parallel over the index space, here a simple range 0<=i<10// It takes an execution policy (here an implicit range as an int) and a functor or lambda// The lambda operator has one argument, and index_type (here a simple int for a range)Kokkos::parallel_for(10,[=](int i){

printf(”Hello %i\n",i); });

// A parallel_reduce executes the body in parallel over the index space, here a simple range 0<=i<10 and // performs a reduction over the values given to the second argument // It takes an execution policy (here an implicit range as an int); a functor or lambda; and a return valuedouble sum = 0;Kokkos::parallel_reduce(10,[=](int i, int& lsum) {

lsum += i; },sum);printf("Result %lf\n",sum);

// A parallel_scan executes the body in parallel over the index space, here a simple range 0<=i<10 and // Performs a scan operation over the values given to the second argument // If final == true lsum contains the prefix sum. double sum = 0;Kokkos::parallel_scan(10,[=](int i, int& lsum, bool final) {

if(final) printf(”ScanValue %i\n",lsum); lsum += i;

});

Kokkos::finalize();}

Kokkos protectsusagainst…§ Hardwaredivergence§ Programmingmodeldiversity§ Threadsatall

§ Kokkos::Serial back-end§ Kokkos’semanticsrequire

vectorizable (ivdep)loops§ Exposeparallelismtoexploitlater§ Hierarchicalparallelismmodel

encouragesexploitinglocality

§ Kokkos protectsourHUGEtimeinvestmentofportingTrilinos

§ Note:Kokkos isnotmagic,cannotmakebadalgorithmsscale.

17

Kokkos isourhedge

Parallel Algorithms

FoundationalTechnology:KokkosKernels§ ProvideBLAS(1,2,3);Sparse;GraphandTensorKernels§ Kokkos based:PerformancePortable§ Interfacestovendorlibrariesifapplicable(MKL,CuSparse,…)§ Goal:Providekernelsforalllevelsofnodehierarchy

19

VectorLane§ ElementalFunctions,Serial,e.g.3x3DGEMM

HyperThread§ VectorParallelism,Synchfree,e.g.MatrixRowx Vector

Core§ ThreadParallelism,SharedL1/L2,e.g.SubdomainSolve

Socket§ ThreadTeams,SharedL3,e.g.FullSolve

Kokkoskernels:SparseMatrix-MatrixMultiplication(SpGEMM)

• SpGEMM is the most expensive part of the multigrid setup.

• New portable write-avoiding algorithm in KokkosKernels is~14x faster than NVIDIA’s CUSPARSE on K80 GPUs.

• ~2.8x faster than Intel’s MKL on Intel’s Knights Landing (KNL).

• Memory scalable: Solving larger problems that cannot be solved by codes like NVIDIA’s CUSP and Intel’s MKL when using large number of threads.

• Up is good.

1.22x%

2.55x%

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

1% 2% 4% 8% 16% 32% 64% 128% 256% 1% 2% 4% 8% 16% 32% 64% 128% 256%

NoReuse% Reuse%

Geo

metric

%mean%of%th

e%GFLOPs%fo

r%vario

us%M

ulDp

licaD

ons%%on%KN

L%%%

(Stron

g%Scaling)%

KokkosKernels%

MKL%

AxP$ RX(AP)$ AxP$ RX(AP)$ AxP$ RX(AP)$

Laplace$ Brick$ Empire$

cuSPARSE$ 0.100$ 0.229$ 0.291$ 0.542$ 0.646$ 0.715$

KokkosKernels$ 1.489$ 1.458$ 2.234$ 2.118$ 2.381$ 1.678$

14.89x$

0.00$

0.25$

0.50$

0.75$

1.00$

1.25$

1.50$

1.75$

2.00$

2.25$

2.50$

Sparse$M

atrixHM

atrix$m

ulIplicaIon$$

GFLOPs$on$K80$

cuSPARSE$

KokkosKernels$

• Goal: Identify independent data that can be processed in parallel.

• Performance: Better quality (4x on average) and run time (1.5x speedup ) w.r.t cuSPARSE.

• Enables parallelization of preconditioners: Gauss Seidel: 82x speedup on KNC, 136x on K20 GPUs

63.87&

1.22&

11.01&

1.75&

2.38&

0.66&

0.41&

0.19&

0.44&

0.64&

0.13&

0.25&

0.50&

1.00&

2.00&

4.00&

8.00&

16.00&

32.00&

64.00&

ci in kr li h a rg e B Q

Speedu

p&w.r.t.&cuSPAR

SE& KokkosKernels& cuSPARSE&

0.02$

0.97$

0.19$

0.59$

0.95$

0.32$

0.39$

0.19$

0.33$

0.34$

0.00$

0.20$

0.40$

0.60$

0.80$

1.00$

1.20$

cir in kr liv ho au rg eu Bu Q

Normalized

$#colors$w.r.t.$

cuSPAR

SE$

KokkosKernels$cuSPARSE$

Kokkoskernels:GraphColoringandSymmetricGauss-Seidel

Kokkos and Kokkos Kernels are available independent of Trilinos:https://github.com/kokkos

▪ MPI+Xbasedsubdomainsolvers▪ DecouplethenotionofoneMPIrankasonesubdomain:Subdomainscanspan

multipleMPIrankseachwithitsownsubdomainsolverusingXorMPI+X▪ Subpackages ofShyLU:MultipleKokkos-basedoptionsforon-nodeparallelism

▪ Basker :LUorILU(t)factorization▪ Tacho:IncompleteCholesky - IC(k)▪ Fast-ILU:Fast-ILUfactorizationforGPUs

▪ KokkosKernels:ColoringbasedGauss-Seidel(M.Deveci),TriangularSolves(A.Bradley)

ShyLU andSubdomainSolvers:Overview

TachoBasker FAST-ILUKLU2

Amesos2 Ifpack2

ShyLU

KokkosKernels –SGS, Tri-Solve (HTS)

23

ObjectConstructionModifications:MakingMPI+Xtrulyscalable(lotsofwork)

§ Pattern:1. Count /estimateallocationsize;mayuseKokkos parallel_scan2. Allocate;useKokkos::Viewforbestdatalayout&firsttouch3. Fill:parallel_reduce overerrorcodes;ifyourunoutofspace,keep

going,counthowmuchmoreyouneed,&returnto(2)4. Compute (e.g.,solvethelinearsystem)usingfilleddatastructures

§ ComparetoFill,Setup,Solve sparselinearalgebrausepattern§ Fortran<=77codersshouldfindthisfamiliar§ Semanticschange:Runningoutofmemorynotanerror!

§ Alwaysreturn:Eithernosideeffects,orcorrectresult§ Callersmustexpectfailure&protectagainstinfiniteloops§ Generalizestootherkindsoffailures,evenfaulttolerance

24

Trilinos Product Organization

ProductLeaders:Maximizecohesion,controlcoupling

26

§ Product:§ Framework(J.Willenbring).§ DataServices(K.Devine).§ LinearSolvers(S.Rajamanickam).§ NonlinearSolvers(R.Pawlowski).§ Discretizations (M.Perego).

§ Productfocus:§ New,strongerleadershipmodel.§ Focus:

§ PublishedAPIs§ Highcohesionwithinproduct.Lowcouplingacross.§ Deliberateproduct-levelupstreamplanning&design.

ForTrilinos: Full, sustainable Fortran access to Trilinos

HistoryViaadhoc interfaces,FortranappshavebenefittedfromTrilinoscapabilitiesformanyyears,mostnotablyclimate.

Why ForTrilinosTrilinosprovidealargecollectionofrobust,productionscientificC++software,includingstate-of-the-artmanycore/GPUcapabilities.ForTrilinos willgiveaccesstoFortranapps.

Research Details– Accessviaadhoc APIshasexistedforyears,especiallyintheclimatecommunity.

– ForTrilinos providesnative,sustainableAPIs,includingsupportforuser-providedFortranphysics-basedpreconditioninganduser-definedFortranoperators.

– Useofrobustauto-generationcodeandAPItoolsmakesfutureextensibilityfeasible.

– Auto-generationtoolsapplytootherprojects.– ForTrilinos willprovideearlyaccesstolatestscalablemanycore/GPUfunctionality,sophisticatedsolvers.

MichaelHeroux(PI,SNL),KateEvans(co-PI,ORNL),devteambasedatORNL.

https://github.com/flang-compiler

The Extreme-Scale Scientific Software Development Kit (xSDK)

Extreme-scaleScientificSoftwareEcosystem

Libraries• Solvers,etc.• Interoperable.

Frameworks&tools• Docgenerators.• Test,buildframework.

Extreme-scaleScientificSoftwareDevelopmentKit(xSDK)

SWengineering• Productivitytools.• Models,processes.

Domaincomponents• Reactingflow,etc.• Reusable.

Documentationcontent• Sourcemarkup.• Embeddedexamples.

Testingcontent• Unittests.• Testfixtures.

Buildcontent• Rules.• Parameters.

Libraryinterfaces• Parameterlists.• Interfaceadapters.• Functioncalls.

Shareddataobjects• Meshes.• Matrices,vectors.

Nativecode&dataobjects• Singleusecode.• Coordinatedcomponentuse.• Applicationspecific.

Extreme-scaleScienceApplications

Domaincomponentinterfaces• Datamediatorinteractions.• Hierarchicalorganization.• Multiscale/multiphysics coupling.

Focusofkeyaccomplishments:xSDK foundations

Impact:• Improvedcodequality,usability,access,sustainability• InformpotentialusersthatanxSDK memberpackagecanbeeasilyusedwithotherxSDK packages• Foundationforworkonperformanceportability,deeperlevelsofpackageinteroperability

Buildingthefoundationofahighlyeffectiveextreme-scalescientificsoftwareecosystem

30

Focus:Increasingthefunctionality,quality,andinteroperabilityofimportantscientificlibraries,domaincomponents,anddevelopmenttools¨ xSDKrelease0.2.0:April2017(soon)

¤ Spack packageinstallationn ./spack install xsdk

¤ Packageinteroperabilityn Numericallibraries

n hypre,PETSc,SuperLU,Trilinosn Domaincomponents

n Alquimia,PFLOTRAN¨ xSDK communitypolicies:

¤ Addresschallengesininteroperabilityandsustainabilityofsoftwaredevelopedbydiversegroupsatdifferentinstitutions

website:xSDK.info

xSDK communitypolicies31

xSDK compatiblepackage:MustsatisfymandatoryxSDK policies:M1.SupportxSDK communityGNUAutoconf orCMakeoptions.M2.Provideacomprehensivetestsuite.M3.Employuser-providedMPIcommunicator.M4.Givebesteffortatportabilitytokeyarchitectures.M5.Provideadocumented,reliablewaytocontactthedevelopmentteam.M6.Respectsystemresourcesandsettingsmadebyotherpreviouslycalledpackages.M7.Comewithanopensourcelicense.M8.ProvidearuntimeAPItoreturnthecurrentversionnumberofthesoftware.M9.Usealimitedandwell-definedsymbol,macro,library,andincludefilenamespace.M10.Provideanaccessiblerepository(notnecessarilypubliclyavailable).M11.HavenohardwiredprintorIOstatements.M12.Allowinstalling,building,andlinkingagainstanoutsidecopyofexternalsoftware.M13.Installheadersandlibrariesunder<prefix>/include/and<prefix>/lib/.M14.Bebuildableusing64bitpointers.32bitisoptional.

Draft 0.3, Dec 2016

Alsospecifyrecommendedpolicies,whichcurrentlyareencouragedbutnotrequired:R1.Haveapublicrepository.R2.Possibletoruntestsuiteundervalgrindinordertotestformemorycorruptionissues.R3.Adoptanddocumentconsistentsystemforerrorconditions/exceptions.R4.Freeallsystemresourcesithasacquiredassoonastheyarenolongerneeded.R5.Provideamechanismtoexportorderedlistoflibrarydependencies.

xSDK memberpackage:MustbeanxSDK-compatiblepackage,and itusesorcanbeusedbyanotherpackageinthexSDK,andtheconnectinginterfaceisregularlytestedforregressions.

https://xsdk.info/policies

xSDK4 SnapshotStatus

ASCRxSDKRelease0.1and0.2givesapplicationteamssingle-pointinstallation,accesstoTrilinos,hypre,PETScandSuperLU.

ValueThexSDKisessentialformulti-scale/physicsapplicationcouplingandinteroperability,andbroadestaccesstocriticallibraries.xSDkeffortsexpandcollaborationscopetoalllabs.

Research Details– xSDKstartedunderDOEASCR(Ndousse-Fetter,2014).– PriortoxSDK,verydifficulttousexSDKmemberlibstogether,adhocapproachrequired,versioninghard.

– Latestrelease(andfuture)usesSpack forbuilds.– xSDK4ECPwillinclude3additionallibs.– xSDKeffortsenablenewscopeofcross-labcoordinationandcollaboration.

– Long-termgoal:Createcommunityandpolicy-basedlibraryandcomponentecosystemsforcompositionalapplicationdevelopment.

Ref:xSDK Foundations:TowardanExtreme-scaleScientificSoftwareDevelopmentKit,,Feb2017,https://arxiv.org/abs/1702.08425,toappearinSupercomputingFrontiersandInnovations.

xsdk.info

hypre

Trilinos

PETSc

SuperLU

Released:Feb2017

TestedonkeymachinesatALCF,OLCF,NERSC,alsoLinuxandMacOSX

SLATE

DTK

SUNDIALS

Release:Oct2017

All ECP math and scientific libraries,

Many community

libraries

Future

MichaelHeroux(co-leadPI,SNL),LoisCurfman McInnes (co-leadPI,ANL),JamesWillenbring(Releaselead,SNL)

MorexSDK info33

¨ Paper:xSDK Foundations:TowardanExtreme-scaleScientificSoftwareDevelopmentKit¤ R.Bartlett,I.Demeshko,T.Gamblin,G.Hammond,M.Heroux,J.Johnson,A.Klinvex,X.Li,L.C.McInnes,D.Osei-Kuffuor,J.Sarich,B.Smith,J.Willenbring,U.M.Yang

¤ https://arxiv.org/abs/1702.08425¤ ToappearinSupercomputingFrontiersandInnovations,2017

¨ CSE17Posters:¤ xSDK:WorkingtowardaCommunitySoftwareEcosystem

n https://doi.org/10.6084/m9.figshare.4531526¤ ManagingtheSoftwareEcosystemwithSpack

n https://doi.org/10.6084/m9.figshare.4702294

FinalTake-AwayPoints▪ Intra-nodeparallelismisbiggestchallengerightnow:

▪ Kokkos providesvehicleforreasoningandimplementingon-nodeparallel.▪ Eventualgoal:SearchandreplaceKokkos::withstd::

▪ Node-parallelalgorithmsarealreadyavailable.▪ Fullynodeparallelexecutionishardwork.

▪ Inter-nodeparallelism:▪ Muelu frameworkprovideflexibility:

▪ Pluggable,customizablecomponents.▪ Multi-physics.

▪ TrilinosProducts:▪ Improvesupstreamplanningabilities.▪ Enablesincreasedcohesion,reduced(incidental)coupling.

▪ Communityexpansion:▪ ForTrilinos providesnativeaccessandextensibilityforFortranusers.▪ xSDK providesturnkey,consistentaccesstogrowingsetoflibraries.

▪ Contactifinterestedinjoining.▪ EuroTUG 2018:BetweenISCandPASC?

34

top related