petsc: high-performance software for engineering and...

51
1 PETSc: High-Performance Software for Engineering and Science Hong Zhang Computer Science Dept., Illinois Institute of Technology Mathematics and Computer Science Div. Argonne National Laboratory, USA Sept., 2010

Upload: duongdung

Post on 20-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

1

PETSc: High-Performance Software for Engineering and Science

Hong Zhang

Computer Science Dept., Illinois Institute of Technology Mathematics and Computer Science Div.

Argonne National Laboratory, USA Sept., 2010

PETSc – Portable Extensible Toolkit for Scientific computation

•  High-performance software for the scalable (parallel) solution of scientific applications.

•  PETSc History •  Begun Sept. 1991 •  Over 20,000 downloads since 1995 (version 2) •  Currently 400 per month

•  PETSc Funding and Support •  Department of Energy, USA •  National Science Foundation, USA

2

1991 1993 1995 1996 2000 2001 2003 2006

Non-LANS Team and Active Developers

Portable Extensible Toolkit for Scientific computation

•  Portable to any parallel system supporting MPI •  Tightly coupled systems

•  Cray T3E, SGI Origin, IBM SP, HP 9000,…

•  Loosely coupled systems •  PCs running Linux or Windows

•  Free for everyone, including industrial users. •  Download from www.mcs.anl.gov/petsc •  Extensive documentation •  Hundreds of tutorial-style examples •  Support via email: [email protected] •  Usable from Fortran90, C, C++ and Python

4

5

Computation and Communication Kernels"MPI, MPI-IO, BLAS, LAPACK"

Profiling Interface"

Application Codes!

Matrices, Vectors, Indices" Grid"Management"

Linear Solvers"Preconditioners + Krylov Methods"

Nonlinear Solvers"

ODE Integrators" Visualization"

Interface

Portable Extensible Toolkit for Scientific computation

PETSc Structure

Level of Abstraction

Portable Extensible Toolkit for Scientific computation •  Parallel vector/array operations •  Numerous (parallel) sparse matrix formats •  Numerous linear solvers •  Nonlinear solvers •  ODE integrators •  Parallel grid/data management •  Provides common interface for most linear solver software

(Hypre, SuperLU, MUMPS, Trilinos) •  Allows switching of virtually all solvers at runtime

6

7

Portable Extensible Toolkit for Scientific computation Interfaced Packages

Interfaced Packages:

•  LU, Cholesky, ILU, ICC SuperLU, SuperLU_DIST, ESSL, Matlab, PLAPACK, UMFPACK, SPOOLES, MUMPS PLAPACK, BlockSolve95

•  Algebraic multigrid Parallel BoomerAMG (part of Hypre, LLNL), ML (part of Trilinos, SNL)

•  Parititioning: Parmetis, Chaco, Jostle, Party, Scotch •  ODE integrators: Sundials (LLNL) •  Eigenvalue solvers: BLOPEX •  FFTW •  SPRN •  …

Child Packages of PETSc: All have PETSc’s style of programming

•  SIPs: Shift-and-Invert Parallel Spectral Transformations •  SLEPc: scalable eigenvalue/eigenvector solver packages. •  TAO: scalable optimization algorithms

Portable Extensible Toolkit for Scientific computation

Algorithms, (parallel) debugging aids, low-overhead profiling

•  It is not possible to pick the solver a priori. –  What will deliver best/competitive performance for a given physics,

problem size, discretization, and architecture?

•  PETSc was developed as a platform for experimentation –  models, discretization, algorithms, solvers –  algebra of composition so new solvers can be created at runtime.

•  Important to keep solvers decoupled from physics and discretization because we also experiment with those.

9

Portable Extensible Toolkit for Scientific computation

Who Uses PETSc? •  Computational Scientists

–  PyLith (TECTON), Underworld, Columbia group, PFLOTRAN

•  Algorithm Developers –  Iterative methods and Preconditioning researchers

•  Package Developers –  SIPs, SLEPc, TAO, MagPar, StGermain, Dealll, PETSc-FEM

10

Portable Extensible Toolkit for Scientific computation Applications of PETSc • Nano-simulations • Biology/Medical • Cardiology • Imaging and Surgery • Fusion • Geosciences • Environmental/Subsurface Flow • Computational Fluid Dynamics • Wave propagation and the Helmholz equation • Optimization • Fast Algorithms • Software engineering • Algorithm analysis and design

11

2009 R&D 100 Award

12

What Can We Handle? •  PETSc has run implicit problem with over 1 billion unknowns

•  PFLOTRAN for flow in porous media

•  PETSc has run on over 130,000 cores efficiently •  UNIC on IBM BG/P Intrepid at ANL •  PFLOTRAN on the Gray XT5 Jaguar at ORNL

•  PETSc applications have run at 3 Teraflops •  LANL PFLOTRAN code

•  PETSc also runs on your laptop

•  Only a handful of our users ever go over 64 processors

13

Modeling of Nanostructured Materials (Density-Functional based Tight-Binding)

14

Matrices are

•  large: ultimate goal 50,000 atoms with electronic structure ~ N=200,000

•  sparse: non-zero density -> 0 as N increases

•  dense solutions are requested: 60% eigenvalues and eigenvectors

Dense solutions of large sparse problems!

15

DFTB-eigenvalue problem is distinguished by

•  (A, B) is large and sparse Iterative method

•  A large number of eigensolutions (60%) are requested Iterative method + multiple shift-and-invert

•  The spectrum has - poor average eigenvalue separation O(1/N), - cluster with hundreds of tightly packed eigenvalues - gap >> O(1/N) Iterative method + multiple shift-and-invert + robusness

•  The matrix factorization of (A-σB)=LDLT : not-very-sparse(7%) <= nonzero density <= dense(50%) Iterative method + multiple shift-and-invert + robusness + efficiency

•  Ax=λBx is solved many times (possibly 1000’s) Iterative method + multiple shift-and-invert + robusness + efficiency

+ initial approximation of eigensolutions

16

Software Structure

MPI

PETSc

SLEPc

MUMPS

ARPACK

Shift-and-Invert Parallel Spectral Transforms (SIPs)

•  Select shifts

•  Bookkeep and validate eigensolutions

•  Balance parallel jobs

•  Ensure global orthogonality of eigenvectors

•  Manage matrix storage

17

FACETS: Framework Application for Core-Edge Transport Simulations •  https://facets.txcorp.com/facets •  PI: John Cary, Tech-X Corporation

•  Goal: Providing modeling of a fusion device from the core to the wall

•  TOPS Emphasis in FACETS

–  Incorporate TOPS expertise in scalable nonlinear algebraic solvers into the base physics codes that provide the foundation for the coupled models

–  Study mathematical challenges that arise in coupled core-edge and transport-turbulence systems

IU NYU

Lodestar

18

The edge-plasma region is a key component to include for integrated modeling of fusion devices

•  Edge-pedestal temperature has large impact on fusion gain

•  Plasma exhaust can damage walls •  Impurities from wall can dilute core fuel

and radiate substantial energy

2D mesh for DIII-D

19

UEDGE is a 2D plasma/neutral transport code •  Features of UEDGE

Physics: –  Multispecies plasma; var. ni,e, u||i,e, Ti,e for particle

density, parallel momentum, and energy balances –  Reaction-diffusion-convection type eqstions –  Reduced Navier-Stokes or Monte Carlo for wall-

recycled/sputtered neutrals –  Multi-step ionization and recombination Numerics: –  Finite volume discretization –  Preconditioned Newton-Krylov implicit solver –  Non-orthogonal mesh for fitting divertor –  Steady-state or time dependent –  Parallel version –  PYTHON or BASIS scripting control

•  T. D. Rognlien and M. E. Rensink, Edge-Plasma Models and Characteristics for Magnetic Fusion Energy Devices, Fusion Engineering and Design, 60: 497-514, 2002. •  T. D. Rognlien, D. D. Ryutov, N. Mattor, and G. D. Porter, Two-Dimensional Electric Fields and Drifts Near the Magnetic Separatrix in Divertor Tokamaks, Physics of Plasmas, 6 (5): 1851-1857, 1999. •  T. D. Rognlien, X. Q. Xu, and A. C. Hindmarsh, Application of Parallel Implicit Methods to Edge-Plasma Numerical Simulations, Journal of Computational Physics, 175: 249-268, 2002.

References:

20

The Role of PETSc

Developing parallel, nontrivial PDE solvers that deliver high performance is still difficult and requires months (or even years) of concentrated effort.

PETSc is a tool that can ease these difficulties and reduce the development time, but it is not a black-box PDE solver, nor a silver bullet.

21

Outline

•  Overview of PETSc – Linear solver interface: KSP – Nonlinear solver interface: SNES – Profiling and debugging

•  Ongoing research and developments

22

The PETSc Programming Model •  Distributed memory, “shared-nothing”

• Requires only a standard compiler • Access to data on remote machines through MPI

•  Hide within objects the details of the communication

•  User orchestrates communication at a higher abstract level than direct MPI calls

PETSc Structure

23

Getting Started

PetscInitialize(); ObjCreate(MPI_comm,&obj); ObjSetType(obj, ); ObjSetFromOptions(obj, );

ObjSolve(obj, ); ObjGetxxx(obj, );

ObjDestroy(obj); PetscFinalize()

Integration

24

Compressed Sparse Row

(AIJ)

Blocked Compressed Sparse Row

(BAIJ)

Block Diagonal (BDIAG)

Dense Other

Indices Block Indices Stride Other Index Sets (IS)

Vectors (Vec)

Line Search Trust Region

Newton-based Methods Other

Nonlinear Solvers (SNES)

Additive Schwartz

Block Jacobi Jacobi ILU ICC LU

(Sequential only) Others

Preconditioners (PC)

Euler Backward Euler

Pseudo Time Stepping Other

Time Steppers (TS)

GMRES CG CGS Bi-CG-STAB TFQMR Richardson Chebychev Other

Krylov Subspace Methods (KSP)

Matrices (Mat)

PETSc Numerical Components

Distributed Arrays(DA)

Matrix-free

25

PETSc

Application Initialization Evaluation of A and b Post-

Processing

Solve Ax = b PC

Linear Solvers (KSP)

PETSc code User code

Linear Solver Interface: KSP Main Routine

solvers:linear beginner

26

•  -ksp_type [cg,gmres,bcgs,tfqmr,…] •  -pc_type [lu,ilu,jacobi,sor,asm,…]

•  -ksp_max_it <max_iters> •  -ksp_gmres_restart <restart> •  -pc_asm_overlap <overlap> •  -pc_asm_type [basic,restrict,interpolate,none] •  etc ...

Setting Solver Options at Runtime

solvers:linear beginner

1 intermediate

2

1

2

27

Recursion: Specifying Solvers for Schwarz Preconditioner Blocks

•  Specify KSP solvers and options with “-sub” prefix, e.g., – Full or incomplete factorization

-sub_pc_type lu -sub_pc_type ilu -sub_pc_ilu_levels <levels>

– Can also use inner Krylov iterations, e.g., -sub_ksp_type gmres -sub_ksp_rtol <rtol> -sub_ksp_max_it <maxit>

solvers: linear: preconditioners beginner

28

PETSc code User code

Application Initialization

Function Evaluation

Jacobian Evaluation

Post- Processing

PC PETSc

Main Routine

Linear Solvers (KSP)

Nonlinear Solvers (SNES)

Timestepping Solvers (TS)

Flow of Control for PDE Solution

PETSc Structure

29

UEDGE and PETSc: Solve F(u) = 0

Post- Processing

Application Initialization Function

Evaluation

Jacobian Evaluation

PETSc

Nonlinear Solvers (SNES)

PETSc code

Application code

UEDGE finite differencing Jacobian for preconditioning matrix; PETSc code for matrix-free Jacobian-vector products

Matrices Vectors

Krylov Solvers Preconditioners

GMRES

TFQMR

BCGS

CGS

BCG

Others…

ASM

ILU

B-Jacobi

SSOR

Multigrid

Others…

AIJ

B-AIJ

Diagonal

Dense

Matrix-free

Others…

Sequential

Parallel

Others…

UEDGE Driver + Timestepping

Algorithms and data structures originally employed by UEDGE

30

Nonlinear Solver Interface: SNES

Goal: For problems arising from PDEs, support the general solution of F(u) = 0

User provides: – Code to evaluate F(u) – Code to evaluate Jacobian of F(u) (optional)

•  or use sparse finite difference approximation •  or use automatic differentiation

–  AD support via collaboration with P. Hovland and B. Norris –  Coming in next PETSc release via automated interface to

ADIFOR and ADIC (see http://www.mcs.anl.gov/autodiff)

solvers: nonlinear

31

SNES: Review of Basic Usage •  SNESCreate( ) - Create SNES context •  SNESSetFunction( ) - Set function eval. routine •  SNESSetJacobian( ) - Set Jacobian eval. routine •  SNESSetFromOptions( ) - Set runtime solver options

for [SNES,SLES, KSP,PC] •  SNESSolve( ) - Run nonlinear solver •  SNESView( ) - View solver options

actually used at runtime (alternative: -snes_view)

•  SNESDestroy( ) - Destroy solver

solvers: nonlinear

32

Uniform access to all linear and nonlinear solvers

•  -ksp_type [cg,gmres,bcgs,tfqmr,…] •  -pc_type [lu,ilu,jacobi,sor,asm,…] •  -snes_type [ls,…]

•  -snes_line_search <line search method> •  -sles_ls <parameters> •  -snes_convergence <tolerance> •  etc...

solvers: nonlinear

1

2

33

PETSc Programming Aids •  Correctness Debugging

– Automatic generation of tracebacks

– Detecting memory corruption and leaks

– Optional user-defined error handlers •  Performance Profiling

–  Integrated profiling using -log_summary – Profiling by stages of an application – User-defined events

Integration

34

Conclusions

PETSc can help you

•  easily construct a code to test your ideas –  lots of code construction, management, debugging tools

•  scale an existing code to large or distributed machines

•  incorporate more scalable or higher performance algorithms –  such as domain decomposition or multigrid

•  tune your code to new architectures –  using profiling tools and specialized implementations

Ongoing Research and Developments

•  Coupled multiphysics system

•  Preconditioning using splitting methods

•  Memory performance for sparse kernels –  Sparse Matrix-Vector products –  Triangular solves

•  ILU with drop tolerance

•  Time integration –  Differential Algebraic Equations –  Strong stability preserving methods

•  …

36

•  Two or more models which are not traditionally solved fully coupled:

F(x, y) = 0 G(x,y) = 0

•  Our approach: -  simple user specification -  abstractions for managing the composition of solvers -  flexible solver options and efficient solution

Coupled Multiphysics System

37

A model "multi-physics" solver based on the Vincent Mousseau's reactor core pilot code: There are three grids

Framework for Multi-model Algebraic System ~petsc/src/snes/examples/tutorials/ex31.c

DA1

DA2

DA3

Fluid

Thermal conduction

(cladding and core)

Fission (core)

38

/* Create the DMComposite object to manage the three grids/physics. */ DMCompositeCreate(app.comm,&app.pack); DACreate1d(app.comm,DA_XPERIODIC,app.nxv,6,3,0,&da1); DMCompositeAddDA(app.pack,da1); DACreate2d(app.comm,DA_YPERIODIC,DA_STENCIL_STAR,…,&da2); DMCompositeAddDA(app.pack,da2); DACreate2d(app.comm,DA_XYPERIODIC,DA_STENCIL_STAR,…,&da3); DMCompositeAddDA(app.pack,da3);

/* Create the solver object and attach the grid/physics info */ DMMGCreate(app.comm,1,0,&dmmg); DMMGSetDM(dmmg,(DM)app.pack); DMMGSetSNES(dmmg,FormFunction,0);

/* Solve the nonlinear system */ DMMGSolve(dmmg);

/* Free work space */ DMCompositeDestroy(app.pack); DMMGDestroy(dmmg);

39

/* Unwraps the input vector and passes its local ghosted pieces into the user function */ FormFunction(SNES snes,Vec X,Vec F,void *ctx) … DMCompositeGetEntries(dm,&da1,&da2,&da3); DAGetLocalInfo(da1,&info1);

/* Get local vectors to hold ghosted parts of X; then fill in the ghosted vectors from the unghosted global vector X */ DMCompositeGetLocalVectors(dm,&X1,&X2,&X3); DMCompositeScatter(dm,X,X1,X2,X3);

/* Access subvectors in F - not ghosted and directly access the memory locations in F */ DMCompositeGetAccess(dm,F,&F1,&F2,&F3);

/* Evaluate local user provided function */ FormFunctionLocalFluid(&info1,x1,f1); FormFunctionLocalThermal(&info2,x2,f2); FormFunctionLocalFuel(&info3,x3,f3); …

Three phase instantaneous time domain simulation of electric power systems using PETSc by Shri Abhyankar, IIT

Network subsystem

Generator subsystem

Load subsystem

Equations for the power system model Generator subsystem equations

Network subsystem equations

Load subsystem equations

Multi-physics modeling approach – - Solve the entire system at once. - Unpack the System X and pack the subsystem F at each iteration.

This is done in PETSc using the DMMG object.

Using PETSc multi-physics data structures and functions

Unpack the composite Vector X

DMMGGetLocalVectors

Compute the subsystem functions

Pack the subsystem Functions

DMMGRestoreLocalAccess

Plus (+) •  vectors and matrix structures. •  Wide range of linear and nonlinear solvers, preconditioners. •  Parallel data structures and solvers. •  Efficient multiphysics data structures and functions to access system

and subsystem vectors. •  Run time options!!

Minus (-) •  Scattering of vectors locally in a subsystem.

The PETSc Experience

44

Memory performance for sparse kernels Computational kernels:"•  SpMv: y = A*x"•  Solve A*x = b"

–  Sparse factorization: A = L U or A ~ L U"–  Sparse triangular solves: "

Ly = b and Ux = y"

•  Have been implemented in numerical software libraries without serious consideration of computer architecture, e.g., memory layout, data-fetching behavior, …"

•  Counts for 70-80% of total computational time for most of applications"

45

Sparse LU factor

Input: A"Output: L and U"For i= 1, 2, …, n Do: " w := A(i, :) " For k=1,2, …, i-1 Do: " mul := A(i,k)/A(k,k)" update w := w - mul*U(k,:) " EndDo" store w in L(i,:) and U(i,:)"EndDo "

Old storage: L and U are stored in the order of data creation: L(1,:), U(1,:), L(2,:), U(2,:), L(3,:), U(3,:)

Sparse Triangular Solve Input: L, U, b"Output: x, s.t., LUx ~ b"

Forward substitution"For i=1, 2,…,n Do" x(i):=b(i)-L(i,1:i-1)*[x(1), …,x(i-1)]^{T}"EndDo"

Backward substitution" For i=n, n-1,…,1 Do " x(i):=(x(i)-U(i,i+1:n)*[x(i+1),…,x(n)]^{T} )/U(i,i)" EndDo"

Old storage: the array of factored values is accessed twice:

L(1,:), U(1,:), L(2,:), U(2,:), L(3,:), U(3,:)

47

Memory access can be improved through a simple reorganization: Sparse LU factor

Input: A"Output: L and U"For i= 1, 2, …, n Do: " w := A(i, :) " For k=1,2, …, i-1 Do: " mul := A(i,k)/A(k,k)" update w := w - mul*U(k,:) " EndDo" store w in L(i,:) and U(i,:)"EndDo "

New storage: L and U are stored in the order of data accessing:

L(1,:), L(2,:), L(3,:), U(3,:), U(2,:), U(1,:)

48

Sparse Triangular Solve Input: L, U, b"Output: x, s.t., LUx ~ b"

Forward substitution"For i=1, 2,…,n Do" x(i):=b(i)-L(i,1:i-1)*[x(1), …,x(i-1)]^{T}"EndDo"

Backward substitution" For i=n, n-1,…,1 Do " x(i):=(x(i)-U(i,i+1:n)*[x(i+1),…,x(n)]^{T} )/U(i,i)" EndDo"

New storage: The array is accessed in a linear fashion:

L(1,:), L(2,:), L(3,:), U(3,:), U(2,:), U(1,:)

-> Performance improvement of PETSc applications

Processor Matrix SpMV (Mflps/s)

Triangular Solve Old Storage

Triangular Solve New Storage

Core 2 Duo 7-point Laplace 3D Euler AIJ

3D Euler BAIJ

537 620 890

261 (49%) 260 (42%) 468 (53%)

447 (83%) 660 (106%) 758 (85%)

BlueGene/P 7-point Laplace 3D Euler AIJ

3D Euler BAIJ

98 148 303

46 (47%) 99 (66%) 198 (65%)

62 (63%) 126 (85%) 260 (86%)

49

Barry Smith and Hong Zhang, “Sparse Triangular Solvers for ILU Revisited: Data Layout Crucial to Better Performance” , Intl. J. of High Performance Computing Applications, 2010

Improvement from better storage

Select data structure based on how it will be accessed efficiently and frequently, not on how it is created first!

50

How will we solve numerical applications in 20 years?

•  Not with the algorithms we use today?

•  Not with the software (development) we use today?

51

How Can We Help? •  Provide documentation:"

– http://www.mcs.anl.gov/petsc"•  Quickly answer questions"•  Help install"•  Guide large scale flexible code development"•  Answer email at [email protected]"